Wednesday 25 July 2018

Rewriting the Beginner's Guide to SEO, Chapter 2: Crawling, Indexing, and Ranking

Posted by BritneyMuller

It's been a few months since our last share of our work-in-progress rewrite of the Beginner's Guide to SEO, but after a brief hiatus, we're back to share our draft of Chapter Two with you! This wouldn’t have been possible without the help of Kameron Jenkins, who has thoughtfully contributed her great talent for wordsmithing throughout this piece.

This is your resource, the guide that likely kicked off your interest in and knowledge of SEO, and we want to do right by you. You left amazingly helpful commentary on our outline and draft of Chapter One, and we'd be honored if you would take the time to let us know what you think of Chapter Two in the comments below.


Chapter 2: How Search Engines Work – Crawling, Indexing, and Ranking

First, show up.

As we mentioned in Chapter 1, search engines are answer machines. They exist to discover, understand, and organize the internet's content in order to offer the most relevant results to the questions searchers are asking.

In order to show up in search results, your content needs to first be visible to search engines. It's arguably the most important piece of the SEO puzzle: If your site can't be found, there's no way you'll ever show up in the SERPs (Search Engine Results Page).

How do search engines work?

Search engines have three primary functions:

  1. Crawl: Scour the Internet for content, looking over the code/content for each URL they find.
  2. Index: Store and organize the content found during the crawling process. Once a page is in the index, it’s in the running to be displayed as a result to relevant queries.
  3. Rank: Provide the pieces of content that will best answer a searcher's query. Order the search results by the most helpful to a particular query.

What is search engine crawling?

Crawling, is the discovery process in which search engines send out a team of robots (known as crawlers or spiders) to find new and updated content. Content can vary — it could be a webpage, an image, a video, a PDF, etc. — but regardless of the format, content is discovered by links.

The bot starts out by fetching a few web pages, and then follows the links on those webpages to find new URLs. By hopping along this path of links, crawlers are able to find new content and add it to their index — a massive database of discovered URLs — to later be retrieved when a searcher is seeking information that the content on that URL is a good match for.

What is a search engine index?

Search engines process and store information they find in an index, a huge database of all the content they’ve discovered and deem good enough to serve up to searchers.

Search engine ranking

When someone performs a search, search engines scour their index for highly relevant content and then orders that content in the hopes of solving the searcher's query. This ordering of search results by relevance is known as ranking. In general, you can assume that the higher a website is ranked, the more relevant the search engine believes that site is to the query.

It’s possible to block search engine crawlers from part or all of your site, or instruct search engines to avoid storing certain pages in their index. While there can be reasons for doing this, if you want your content found by searchers, you have to first make sure it’s accessible to crawlers and is indexable. Otherwise, it’s as good as invisible.

By the end of this chapter, you’ll have the context you need to work with the search engine, rather than against it!

Note: In SEO, not all search engines are equal

Many beginners wonder about the relative importance of particular search engines. Most people know that Google has the largest market share, but how important it is to optimize for Bing, Yahoo, and others? The truth is that despite the existence of more than 30 major web search engines, the SEO community really only pays attention to Google. Why? The short answer is that Google is where the vast majority of people search the web. If we include Google Images, Google Maps, and YouTube (a Google property), more than 90% of web searches happen on Google — that's nearly 20 times Bing and Yahoo combined.

Crawling: Can search engines find your site?

As you've just learned, making sure your site gets crawled and indexed is a prerequisite for showing up in the SERPs. First things first: You can check to see how many and which pages of your website have been indexed by Google using "site:yourdomain.com", an advanced search operator.

Head to Google and type "site:yourdomain.com" into the search bar. This will return results Google has in its index for the site specified:

Screen Shot 2017-08-03 at 5.19.15 PM.png

The number of results Google displays (see “About __ results” above) isn't exact, but it does give you a solid idea of which pages are indexed on your site and how they are currently showing up in search results.

For more accurate results, monitor and use the Index Coverage report in Google Search Console. You can sign up for a free Google Search Console account if you don't currently have one. With this tool, you can submit sitemaps for your site and monitor how many submitted pages have actually been added to Google's index, among other things.

If you're not showing up anywhere in the search results, there are a few possible reasons why:

  • Your site is brand new and hasn't been crawled yet.
  • Your site isn't linked to from any external websites.
  • Your site's navigation makes it hard for a robot to crawl it effectively.
  • Your site contains some basic code called crawler directives that is blocking search engines.
  • Your site has been penalized by Google for spammy tactics.

If your site doesn't have any other sites linking to it, you still might be able to get it indexed by submitting your XML sitemap in Google Search Console or manually submitting individual URLs to Google. There's no guarantee they'll include a submitted URL in their index, but it's worth a try!

Can search engines see your whole site?

Sometimes a search engine will be able to find parts of your site by crawling, but other pages or sections might be obscured for one reason or another. It's important to make sure that search engines are able to discover all the content you want indexed, and not just your homepage.

Ask yourself this: Can the bot crawl through your website, and not just to it?

Is your content hidden behind login forms?

If you require users to log in, fill out forms, or answer surveys before accessing certain content, search engines won't see those protected pages. A crawler is definitely not going to log in.

Are you relying on search forms?

Robots cannot use search forms. Some individuals believe that if they place a search box on their site, search engines will be able to find everything that their visitors search for.

Is text hidden within non-text content?

Non-text media forms (images, video, GIFs, etc.) should not be used to display text that you wish to be indexed. While search engines are getting better at recognizing images, there's no guarantee they will be able to read and understand it just yet. It's always best to add text within the <HTML> markup of your webpage.

Can search engines follow your site navigation?

Just as a crawler needs to discover your site via links from other sites, it needs a path of links on your own site to guide it from page to page. If you’ve got a page you want search engines to find but it isn’t linked to from any other pages, it’s as good as invisible. Many sites make the critical mistake of structuring their navigation in ways that are inaccessible to search engines, hindering their ability to get listed in search results.

Common navigation mistakes that can keep crawlers from seeing all of your site:

  • Having a mobile navigation that shows different results than your desktop navigation
  • Any type of navigation where the menu items are not in the HTML, such as JavaScript-enabled navigations. Google has gotten much better at crawling and understanding Javascript, but it’s still not a perfect process. The more surefire way to ensure something gets found, understood, and indexed by Google is by putting it in the HTML.
  • Personalization, or showing unique navigation to a specific type of visitor versus others, could appear to be cloaking to a search engine crawler
  • Forgetting to link to a primary page on your website through your navigation — remember, links are the paths crawlers follow to new pages!

This is why it's essential that your website has a clear navigation and helpful URL folder structures.

Information architecture

Information architecture is the practice of organizing and labeling content on a website to improve efficiency and fundability for users. The best information architecture is intuitive, meaning that users shouldn't have to think very hard to flow through your website or to find something.

Your site should also have a useful 404 (page not found) page for when a visitor clicks on a dead link or mistypes a URL. The best 404 pages allow users to click back into your site so they don’t bounce off just because they tried to access a nonexistent link.

Tell search engines how to crawl your site

In addition to making sure crawlers can reach your most important pages, it’s also pertinent to note that you’ll have pages on your site you don’t want them to find. These might include things like old URLs that have thin content, duplicate URLs (such as sort-and-filter parameters for e-commerce), special promo code pages, staging or test pages, and so on.

Blocking pages from search engines can also help crawlers prioritize your most important pages and maximize your crawl budget (the average number of pages a search engine bot will crawl on your site).

Crawler directives allow you to control what you want Googlebot to crawl and index using a robots.txt file, meta tag, sitemap.xml file, or Google Search Console.

Robots.txt

Robots.txt files are located in the root directory of websites (ex. yourdomain.com/robots.txt) and suggest which parts of your site search engines should and shouldn't crawl via specific robots.txt directives. This is a great solution when trying to block search engines from non-private pages on your site.

You wouldn't want to block private/sensitive pages from being crawled here because the file is easily accessible by users and bots.

Pro tip:

  • If Googlebot can't find a robots.txt file for a site (40X HTTP status code), it proceeds to crawl the site.
  • If Googlebot finds a robots.txt file for a site (20X HTTP status code), it will usually abide by the suggestions and proceed to crawl the site.
  • If Googlebot finds neither a 20X or a 40X HTTP status code (ex. a 501 server error) it can't determine if you have a robots.txt file or not and won't crawl your site.

Meta directives

The two types of meta directives are the meta robots tag (more commonly used) and the x-robots-tag. Each provides crawlers with stronger instructions on how to crawl and index a URL's content.

The x-robots tag provides more flexibility and functionality if you want to block search engines at scale because you can use regular expressions, block non-HTML files, and apply sitewide noindex tags.

These are the best options for blocking more sensitive*/private URLs from search engines.

*For very sensitive URLs, it is best practice to remove them from or require a secure login to view the pages.

WordPress Tip: In Dashboard > Settings > Reading, make sure the "Search Engine Visibility" box is not checked. This blocks search engines from coming to your site via your robots.txt file!

Avoid these common pitfalls, and you'll have clean, crawlable content that will allow bots easy access to your pages.

Once you’ve ensured your site has been crawled, the next order of business is to make sure it can be indexed. That’s right — just because your site can be discovered and crawled by a search engine doesn’t necessarily mean that it will be stored in their index. Read on to learn about how indexing works and how you can make sure your site makes it into this all-important database.

Sitemaps

A sitemap is just what it sounds like: a list of URLs on your site that crawlers can use to discover and index your content. One of the easiest ways to ensure Google is finding your highest priority pages is to create a file that meets Google's standards and submit it through Google Search Console. While submitting a sitemap doesn’t replace the need for good site navigation, it can certainly help crawlers follow a path to all of your important pages.

Google Search Console

Some sites (most common with e-commerce) make the same content available on multiple different URLs by appending certain parameters to URLs. If you’ve ever shopped online, you’ve likely narrowed down your search via filters. For example, you may search for “shoes” on Amazon, and then refine your search by size, color, and style. Each time you refine, the URL changes slightly. How does Google know which version of the URL to serve to searchers? Google does a pretty good job at figuring out the representative URL on its own, but you can use the URL Parameters feature in Google Search Console to tell Google exactly how you want them to treat your pages.

Indexing: How do search engines understand and remember your site?

Once you’ve ensured your site has been crawled, the next order of business is to make sure it can be indexed. That’s right — just because your site can be discovered and crawled by a search engine doesn’t necessarily mean that it will be stored in their index. In the previous section on crawling, we discussed how search engines discover your web pages. The index is where your discovered pages are stored. After a crawler finds a page, the search engine renders it just like a browser would. In the process of doing so, the search engine analyzes that page's contents. All of that information is stored in its index.

Read on to learn about how indexing works and how you can make sure your site makes it into this all-important database.

Can I see how a Googlebot crawler sees my pages?

Yes, the cached version of your page will reflect a snapshot of the last time googlebot crawled it.

Google crawls and caches web pages at different frequencies. More established, well-known sites that post frequently like https://www.nytimes.com will be crawled more frequently than the much-less-famous website for Roger the Mozbot’s side hustle, http://www.rogerlovescupcakes.com (if only it were real…)

You can view what your cached version of a page looks like by clicking the drop-down arrow next to the URL in the SERP and choosing "Cached":

You can also view the text-only version of your site to determine if your important content is being crawled and cached effectively.

Are pages ever removed from the index?

Yes, pages can be removed from the index! Some of the main reasons why a URL might be removed include:

  • The URL is returning a "not found" error (4XX) or server error (5XX) – This could be accidental (the page was moved and a 301 redirect was not set up) or intentional (the page was deleted and 404ed in order to get it removed from the index)
  • The URL had a noindex meta tag added – This tag can be added by site owners to instruct the search engine to omit the page from its index.
  • The URL has been manually penalized for violating the search engine’s Webmaster Guidelines and, as a result, was removed from the index.
  • The URL has been blocked from crawling with the addition of a password required before visitors can access the page.

If you believe that a page on your website that was previously in Google’s index is no longer showing up, you can manually submit the URL to Google by navigating to the “Submit URL” tool in Search Console.

Ranking: How do search engines rank URLs?

How do search engines ensure that when someone types a query into the search bar, they get relevant results in return? That process is known as ranking, or the ordering of search results by most relevant to least relevant to a particular query.

To determine relevance, search engines use algorithms, a process or formula by which stored information is retrieved and ordered in meaningful ways. These algorithms have gone through many changes over the years in order to improve the quality of search results. Google, for example, makes algorithm adjustments every day — some of these updates are minor quality tweaks, whereas others are core/broad algorithm updates deployed to tackle a specific issue, like Penguin to tackle link spam. Check out our Google Algorithm Change History for a list of both confirmed and unconfirmed Google updates going back to the year 2000.

Why does the algorithm change so often? Is Google just trying to keep us on our toes? While Google doesn’t always reveal specifics as to why they do what they do, we do know that Google’s aim when making algorithm adjustments is to improve overall search quality. That’s why, in response to algorithm update questions, Google will answer with something along the lines of: “We’re making quality updates all the time.” This indicates that, if your site suffered after an algorithm adjustment, compare it against Google’s Quality Guidelines or Search Quality Rater Guidelines, both are very telling in terms of what search engines want.

What do search engines want?

Search engines have always wanted the same thing: to provide useful answers to searcher’s questions in the most helpful formats. If that’s true, then why does it appear that SEO is different now than in years past?

Think about it in terms of someone learning a new language.

At first, their understanding of the language is very rudimentary — “See Spot Run.” Over time, their understanding starts to deepen, and they learn semantics—- the meaning behind language and the relationship between words and phrases. Eventually, with enough practice, the student knows the language well enough to even understand nuance, and is able to provide answers to even vague or incomplete questions.

When search engines were just beginning to learn our language, it was much easier to game the system by using tricks and tactics that actually go against quality guidelines. Take keyword stuffing, for example. If you wanted to rank for a particular keyword like “funny jokes,” you might add the word

source https://moz.com/blog/beginners-guide-to-seo-chapter-2

No comments:

Post a Comment