We use cookies to improve your experience. Please read our cookies policy here.

×

How search engines work: crawling and indexing for SEO

7 minute read

Graham Charlton

When you enter a search term into Google, you aren’t searching the web but rather you’re searching Google’s index of the web.

This means that to the average searcher, your site doesn’t exist unless Google or another search engine has already crawled and indexed it.

With search engines so fundamental to the web, it’s important that marketers have an understanding of how search engines work, and how they crawl and index websites.

This understanding helps to improve your website visibility and can help ensure that all of your pages are given the chance to appear in search results.

In this article, I’ll look at how search engines crawl and index webpages, how you can ensure your content is indexed and address any problems you may have with pages not appearing in Google.

How search engines work

I’m mainly focusing on Google here but other search engines work in much the same way.

Search engines have three key functions; crawling, indexing and ranking.

1. Crawling the web

The search engine will crawl the web for pages to add to its index. According to Google, this process begins with its current index made up of previous crawls and sitemaps submitted by website owners.

Google’s crawlers use links on websites to discover new pages and pay attention to new sites, changes to existing sites and dead links. These crawlers then send back the information to Google’s servers.

2. Indexing the web

Google stores the information it finds from crawling – the website content, links etc. – in its Search Index. This index contains ‘hundreds of billions of webpages and is well over 100,000,000 gigabytes in size’.

Essentially, once a page is in this index, it has a chance of ranking as a result for relevant queries.

3. Ranking the web

From its index, Google then displays the results that its algorithm decides are most relevant to the search term entered. The algorithm considered a range of factors when deciding on which pages to rank, which include relevance to the search query, the authority of a page (measured by links and other factors) and more.

PageRank

Google uses a system called PageRank as a way to rank the relative authority of different websites and pages. It was patented by Google’s founders in 1998 and formed the basis of Google’s algorithm.

Until 2014, there was even an optional Google toolbar which would display the rank (out of ten) for any page you visited.

Now PageRank is no longer a public facing metric, but it’s still used. The essential idea is around links and authority. 

Links from one page to another convey trust and authority and help Google to decide on the relative importance of pages. This flow of PageRank between sites is often referred to as ‘link juice.’

PageRank factors in the relative importance of links, so when Google crawls a site and follows links, a backlink from a site with a PageRank of 9 will carry more weight than one from a site with a PageRank of 2, and so on.

How crawling and indexing works

Search engines use crawling to access and retrieve information from pages around the web. Google refers to these as ‘crawlers’ or ‘spiders’.

When Google’s spider explores a website, it visits all the links contained within and follows any instructions included in the robots.txt file, which tells its crawlers which pages or files the crawler can or can’t request from your site.

Crawlers use algorithms to work out how often they scan a specific page and how many pages of the website it must scan. This helps crawlers to distinguish, for example, a frequently updated page and scan it more often.

From an SEO perspective, this crawling is the point at which Google sees new content and key factors such as how many links come in and out of a page and gauges their quality.

Once crawled, all key information that is relevant to searchers, and to Google’s algorithm, is stored within the search engine’s index.

For a site owner, it’s essential that search engines are able to crawl their site and add its pages to the index.

Crawl budget

The number of pages Google crawls and how frequently it does so is often referred to as the ‘crawl budget’.

In most cases, this won’t be an issue, but a large number of pages on a site, lots of redirects or the recent addition of a lot of pages can potentially impact crawl budget and affect indexing.

Log file analysis

Log file analysis can help you to ensure your crawl budget is not being overused and help you to see how Google is crawling your site. Suganthan Mohanadasan has an excellent guide to this topic.

Every request made to your server for content is recorded in a log file. This helps you to see exactly which pages Google and other search engines are crawling on your site.

Log file analysis can help you to see:

  • How Google crawls your site, and whether your ‘crawl budget’ is being wasted
  • Any accessibility errors that might hamper crawling
  • Pages which aren’t being crawled or are crawled less often

A log file can be accessed from your server or CDN provider and will look something like this:

computer log file

Source: www.suganthan.com

These files can be analysed for the issues set out above, and there are SEO tools which will help you do this work (example below from SEMrush).

Log file analyzer from SEMrush

How to ensure your website has been crawled and indexed

If Google isn’t crawling and indexing your site, no searcher will find it, so it’s important to check your website and all of its pages have been indexed.

It’s simple enough to check if your site is in Google’s index. Just head to Google and search for site:yourwebsite.com.

You’ll then quickly see if your site is indexed and how many pages from it have been indexed by Google.

If you want to see if a specific page has been indexed, then search for site:yourwebsite.com/web-page and it should appear if Google has indexed it. Google Search Console (or Bing Webmaster Tools) is also a great place to check for indexing issues too.

If you access Google Search Console and head to Google Search Console > Index > Coverage, you’ll see a summary of indexed pages.

It will show those that are valid and have been indexed by Google and pages intentionally excluded by robots.txt and any possible errors.

Google Search Console error report

This allows you to quickly check if the right number of pages are indexed or to be alerted to possible indexing issues. If Google Search Console highlights any problems, you can see which URL is affected and the potential problem.

For individual URLs, perhaps new pages you want to rank quickly, you can use the URL inspection tool which will confirm whether it’s on Google or not.

Google URL Inspection tool

It will also allow you to request indexing for a particular URL. This can be good practice for a new page and may speed the process up. If you’ve searched for a URL and it doesn’t show as indexed, click the ‘request indexing’ link.

Common crawling and indexing issues to watch for

If you’ve carried out these checks and some or all of your pages aren’t indexed by Google, it could be a crawling issue that is causing the problem.

Here are some common problems and how to fix them. It is also important to note that the solutions to some of these issues require some basic knowledge of coding or will perhaps require the help of your IT team.

1. Crawling blocked by no index tags

Robots.txt is useful for keeping some content away from Google, but tags in the wrong pages will prevent indexing.

Check if your affected page(s) contain this tag: <meta name=”robots” content=”noindex” />. If they do, the problem will be solved by removing the tag. You can also request indexing using the URL inspection tool as outlined above.

2. No follow links

This is a similar issue to the no index tags. Check if you have one of the following no follow tags:

  • No follow for the whole page, which will prevent Google following any link from a page: <meta name=”robots” content=”nofollow”>
  • No follow for a specific link: href=”pagename.html” rel=”nofollow”/>

In either case, removing the tags will solve the problem.

3. Blocking the whole site or sections from Google’s crawlers

Robots.txt is the first file of your website the crawlers look at and it is possible to block the whole site, or entire sections using this.

If Google see this, the whole site is blocked from crawling:

User-agent: *

Disallow: /

In the following example, this would block a section of the site, perhaps an entire category.

User-agent: *

Disallow: /products/

4. No internal links

Google crawls pages on your site by following links, so if a page has no links to it, indexing will be delayed or not happen at all.

If you want pages to be visible, and to be indexed as quickly as possible, add links to it from other relevant pages on your site. If it’s an important page, link to it from other key pages.

5. Broken links

Broken links may occur due to typos when you link from other pages or because you have undergone a website migration or structural change.

This will prevent crawling but is easy to catch in Google’s coverage report and to fix by correcting the link URL.

6. Low quality or unnecessary pages

Lots of low quality pages with thin content, duplicate content or lots of auto-generated pages will potentially reduce crawl budget and slow down indexing. As Google states:

“Wasting server resources on pages like these will drain crawl activity from pages that do actually have value, which may cause a significant delay in discovering great content on a site.”

7. Misplaced canonical tags

Canonical tags are used to tell Google which is the preferred version of a page. A misplaced canonical tag could prevent a page being indexed.

To check for this issue, use Google Search Console’s URL inspection tool. You’ll see a canonical tag warning if you have this problem.

Conclusion

In most cases, Google will crawl and index your site without any problems and new pages you add will be indexed in good time.

However, it’s important to be aware of potential issues so you avoid falling into the trap of some of the mistakes mentioned here.

As a general rule, a good site structure and the use of links to help Google to crawl new pages will help you to avoid many common problems.

It’s also important to use Google Search Console to alert you to any crawl issues so they can be fixed without affecting site usage for any length of time. Misplaced tags or broken links can happen to any site, but effective monitoring can help you reduce the risk.

Graham Charlton is Editor in Chief at behavioural marketing company SaleCycle. He has previously worked for Econsultancy and Search Engine Watch, and has written several best practice guides on e-commerce and digital marketing. Follow him on Twitter

Sign up to the UK Domain newsletter

Get all our monthly news and updates direct to your inbox