Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 25

How do Search Engines

Work?
n this guide we’re going to provide you with an introduction to how search engines
work. This will cover the processes of crawling and indexing as well as concepts such
as crawl budget and PageRank.
Search engines work by crawling hundreds of billions of pages using their own web
crawlers. These web crawlers are commonly referred to as search engine bots or spiders.
A search engine navigates the web by downloading web pages and following links on
these pages to discover new pages that have been made available.
 

The Search Engine Index


Webpages that have been discovered by the search engine are added into a data
structure called an index.

The index includes all the discovered URLs along with a number of relevant key signals
about the contents of each URL such as:

 The keywords discovered within the page’s content – what topics does the page cover?
 The type of content that is being crawled (using microdata called Schema) – what is
included on the page?
 The freshness of the page – how recently was it updated?
 The previous user engagement of the page and/or domain – how do people interact with
 
the page?

What is The Aim of a Search


Engine Algorithm?
The aim of the search engine algorithm is to present a relevant set of high quality search
results that will fulfil the user’s query/question as quickly as possible.
The user then selects an option from the list of search results and this action, along with
subsequent activity, then feeds into future learnings which can affect search engine
rankings going forward.
 

What happens when a search is


performed?
When a search query is entered into a search engine by a user, all of the pages which
are deemed to be relevant are identified from the index and an algorithm is used to
hierarchically rank the relevant pages into a set of results.
The algorithms used to rank the most relevant results differ for each search engine. For
example, a page that ranks highly for a search query in Google may not rank highly for
the same query in Bing.

In addition to the search query, search engines use other relevant data to return results,
including:

 Location – Some search queries are location-dependent e.g. ‘cafes near me’ or ‘movie


times’.
 Language detected – Search engines will return results in the language of the user, if it
can be detected.
 Previous search history – Search engines will return different results for a
query dependent on what user has previously searched for.
 Device – A different set of results may be returned based on the device from which the
 
query was made.

Why Might a Page Not be


Indexed?
There are a number of circumstances where a URL will not be indexed by a search
engine. This may be due to:

 Robots.txt file exclusions – a file which tells search engines what they shouldn’t visit on
your site.
 Directives on the webpage telling search engines not to index that page (noindex tag) or
to index another similar page (canonical tag).
 Search engine algorithms judging the page to be of low quality, have thin content or
contain duplicate content.
 The URL returning an error page (e.g. a 404 Not Found HTTP response code).
Next: Search Engine Crawling

Search Engine Crawling


ow that you’ve got a top level understanding about how search engines work, let’s delve deeper
into the processes that search engine and web crawlers use to understand the web. Let’s start
with the crawling process.
 

What is Search Engine Crawling?


Crawling is the process used by search engine web crawlers (bots or spiders) to visit and
download a page and extract its links in order to discover additional pages.
Pages known to the search engine are crawled periodically to determine whether any changes
have been made to the page’s content since the last time it was crawled. If a search engine
detects changes to a page after crawling a page, it will update it’s index in response to these
detected changes.
 
How Does Web Crawling Work?
Search engines use their own web crawlers to discover and access web pages.
All commercial search engine crawlers begin crawling a website by downloading its robots.txt
file, which contains rules about what pages search engines should or should not crawl on the
website. The robots.txt file may also contain information about sitemaps; this contains lists of
URLs that the site wants a search engine crawler to crawl.
Search engine crawlers use a number of algorithms and rules to determine how frequently a page
should be re-crawled and how many pages on a site should be indexed. For example, a page
which changes a regular basis may be crawled more frequently than one that is rarely modified.
 

How Can Search Engine Crawlers


be Identified?
The search engine bots crawling a website can be identified from the user agent string that they
pass to the web server when requesting web pages.

Here are a few examples of user agent strings used by search engines:

 Googlebot User Agent


Mozilla/5.0 (compatible; Googlebot/2.1; +https://www.google.com/bot.html)
 Bingbot User Agent
Mozilla/5.0 (compatible; bingbot/2.0; +https://www.bing.com/bingbot.htm)
 Baidu User Agent
Mozilla/5.0 (compatible; Baiduspider/2.0; +https://www.baidu.com/search/spider.html)
 Yandex User Agent
Mozilla/5.0 (compatible; YandexBot/3.0; +https://yandex.com/bots)
Anyone can use the same user agent as those used by search engines. However, the IP address
that made the request can also be used to confirm that it came from the search engine – a process
called reverse DNS lookup.
 

Crawling images and other non-


text files
Search engines will normally attempt to crawl and index every URL that they encounter.

However, if the URL is a non-text file type such as an image, video or audio file, search engines
will typically not be able to read the content of the file other than the associated filename and
metadata.
Although a search engine may only be able to extract a limited amount of information about non-
text file types, they can still be indexed, rank in search results and receive traffic.

You can find a full list of file types that can be indexed by Google available here.
 

Crawling and Extracting Links


From Pages
Crawlers discover new pages by re-crawling existing pages they already know about, then
extracting the links to other pages to find new URLs. These new URLs are added to the crawl
queue so that they can be downloaded at a later date.
Through this process of following links, search engines are able to discover every publicly-
available webpage on the internet which is linked from at least one other page.
 

Sitemaps
Another way that search engines can discover new pages is by crawling sitemaps.
Sitemaps contain sets of URLs, and can be created by a website to provide search engines with a
list of pages to be crawled. These can help search engines find content hidden deep within a
website and can provide webmasters with the ability to better control and understand the areas of
site indexing and frequency.
 

Page submissions
Alternatively, individual page submissions can often be made directly to search engines via their
respective interfaces. This manual method of page discovery can be used when new content is
published on site, or if changes have taken place and you want to minimise the time that it takes
for search engines to see the changed content.
Google states that for large URL volumes you should use XML sitemaps, but sometimes the
manual submission method is convenient when submitting a handful of pages. It is also
important to note that Google limits webmasters to 10 URL submissions per day.

Additionally, Google says that the response time for indexing is the same for sitemaps as
individual submissions.

Next: Search Engine Indexing


Search Engine Indexing
hat happens once a search engine has finished crawling a page? Let’s take a look at the indexing
process that search engines use to store information about web pages, enabling them to quickly
return relevant, high quality results.
 

What’s the need for indexing by


search engines?
Remember the days before the internet when you’d have to consult an encyclopedia to learn
about the world and dig through the Yellow Pages to find a plumber? Even in the early days of
the web, before search engines, we had to search through directories to retrieve information.
What a time-consuming process. How did we ever have the patience?
Search engines have revolutionised information retrieval to the extent that users expect near-
instantaneous responses to their search queries.
 

What is search engine indexing?


Indexing is the process by which search engines organise information before a search to enable
super-fast responses to queries.

Searching through individual pages for keywords and topics would be a very slow process for
search engines to identify relevant information. Instead, search engines (including Google) use
an inverted index, also known as a reverse index.
 

What is an inverted index?


An inverted index is a system wherein a database of text elements is compiled along with
pointers to the documents which contain those elements. Then, search engines use a process
called tokenisation to reduce words to their core meaning, thus reducing the amount of resources
needed to store and retrieve data. This is a much faster approach than listing all known
documents against all relevant keywords and characters.
An example of inverted indexing

Below is a very basic example which illustrates the concept of inverted indexing. In the example
you can see that each keyword (or token) is associated with a row of documents in which that
element was identified.

Keyword Document Path 1 Document Path 2 Document Path 3


SEO example.com/seo-tips moz.com …
HTTPS deepcrawl.co.uk/https-speed example.com/https-future …
This example uses URLs but these might be document IDs instead depending on how the search
engine is structured.
 

The cached version of a page


In addition to indexing pages, search engines may also store a highly compressed text-only
version of a document including all HTML and metadata.

The cached document is the latest snapshot of the page that the search engine has seen.

The cached version of a page can be accessed (in Google) by clicking the little green arrow next
to each search result’s URL and selecting the cached option. Alternatively, you can use the
‘cache:’ Google search operator to view the cached version of the page.
Bing offers the same facility to view the cached version of a page via a green down arrow next to
each search result but doesn’t currently support the ‘cache:’ search operator.
 

What is PageRank?
“PageRank” is a Google algorithm named after the co-founder of Google, Larry Page (yes,
really!) It is a value for each page calculated by counting the number of links pointing at a
page in order to determine the page’s value relative to every other page on the internet. The value
passed by each individual link is based on the number and value of links which point to the page
with the link.
PageRank is just one of the many signals used within the large Google ranking algorithm.
An approximation of the PageRank values were initially provided by Google but they are no
longer publicly visible.
While PageRank is a Google term, all commercial search engines calculate and use an equivalent
link equity metric. Some SEO tools try to give an estimation of PageRank using their own logic
and calculations. For example, Page Authority in Moz tools, TrustFlow in Majestic, or URL
Rating in Ahrefs. DeepCrawl has a metric called DeepRank to measure the value of pages based
on the internal links within a website.
 
How PageRank flows through
pages
Pages pass PageRank, or link equity, through to other pages via links. When a page links to
content elsewhere it is seen as a vote of confidence and trust, in that the content being linked to is
being recommended as relevant and useful for users. The count of these links and the measure of
how authoritative the linking website is, determines the relative PageRank of the linked-to page.

PageRank is equally divided across all discovered links on the page. For example, if your page
has five links, each link would pass 20% of the page’s PageRank through each link to the target
pages. Links which use the rel=”nofollow” attribute do not pass PageRank.
 

The importance of backlinks


Backlinks are a cornerstone of how search engines understand importance of a page. There have
been many studies and tests performed to identify the correlation between backlinks and
rankings.

Research into backlinks by Moz shows that results for the top 50 Google search queries (~15,000
search results), 99.2% of these had at least 1 external backlink. On top of this, SEOs consistently
rate backlinks as one of the most important ranking factors in surveys.
Next: Search Engine Differences

Search Engine Differences


ow that we’ve looked at the basics of how search engines work, it is worth taking this
opportunity to break down some of the key differences between some of the major search
engines: Google, Bing, Yandex and Baidu.
 Google – Google was launched in 1998 and unless you’ve been living on another planet, you’ll
know that Google is by far the most widely-used search engine in terms of search volume and is
the main focus for most in search engine optimisation (SEO).
 Bing – Owned by Microsoft, Bing launched in 2009 and has the second largest search volume
worldwide.
 Yandex – The search engine of choice in Russia and the largest technology company in Russia.
 Baidu – The dominant search engine used in China and the 4th most popular site according to the
Alexa 500.
Now that you know what’s out there in the search engine landscape, let’s take a look at a few of
the areas where they differ.
 

Device indexing
Google is making a move towards mobile-first indexing, where they will use the mobile version
of a site’s content to rank pages from that site rather than the desktop version.
In 2018, Google also has plans to roll out a mobile page speed update which means that Page
speed will become a ranking factor in mobile search.
Bing’s Head of Evangelism for Search at Microsoft, Christi Olson, said that they have no plans
to roll out a mobile-first index similar to Google.
Yandex began labeling mobile-friendly pages in their index as of November 2015 and rolled out
a mobile-friendly algorithm in 2016.
The mobile-friendly algorithm, code- named Vladivostok, did not result in pages which are not
considered mobile- friendly being removed from the search results, but it was stated that such
pages would potentially not rank as prominently for search users who are using mobile devices.
“Vladivostok implementation doesn’t mean web pages not optimized for mobile experience will
now disappear from search results, but their position on SERPS may differ depending on
whether the user search[es] on their mobile or desktop,”
Baidu’s mobile search results differ substantially dependent on whether a page is deemed to be
mobile-friendly. It is also worth noting that Baidu makes use of transcoding in order to convert
non-mobile-friendly web pages into Baidu-generated mobile-friendly pages.
 

Backlinks as a ranking signal


Google focuses on the quality of backlinks over volume, according to empirical and anecdotal
evidence.

It used to be the case that the volume of backlinks was a key ranking signal, which led to a lot of
low quality link acquisition with companies purchasing backlinks from link farms and networks.

Bing uses backlink information in much the same way as Google, according to their webmaster
guidelines as well as anecdotal reports.
Bing’s webmaster guidelines state:
“The fact is Bing wants to see quality links pointing at your website. Often even just a few
quality inbound links from trusted websites is enough to help boost your rankings. Just as it is
with content, when it comes to links, quality matters most.”
Yandex stopped using backlink data in their ranking algorithms in certain verticals since 2014.
Around one year later backlink data was reintroduced into their algorithms and they now provide
the following warning regarding the use of purchased links intended to promote search rankings:
“Publishing SEO links on other sites in order to promote your own site. Such links include, in
particular, links that are bought through link exchanges and aggregators.”

It is known that, like Google and Bing, Yandex looks for high quality, relevant links from
authoritative sources, but backlinks alone are not a decisive ranking factor.

Baidu values backlinks from websites based in China much more than those from foreign
sites. It is reported that Baidu lags behind the other major search engines with regard to the
detection of link spam.
Link spam tactics are still effective in promoting rankings in the Baidu search results and
therefore continue to be used in the promotion of Chinese websites.
 

Social media as a ranking signal


Google officially doesn’t use social media as a ranking factor. Matt Cutts explained that this is
due to to the difficulties of understanding social identities, and because Google wants to avoid
using data that may be incomplete or misleading.
Bing, on the other hand, embraces social signals as a part of its algorithms. Their webmaster
guidelines state:
“If you are influential socially, this leads to your followers sharing your information widely,
which in turn results in Bing seeing these positive signals. These positive signals can have an
impact on how you rank organically in the long run.”
Yandex does seem to derive some ranking signals from social media, at least according to
anecdotal reports.
Baidu does not use social signals in its ranking algorithms according to reports. However, there
is often a strong correlation between sites ranking prominently in Baidu and active social media
accounts.
Next: Crawl Budget
Crawl Budget
ollowing on from our introduction to the crawling process used to discover new pages, it is
important to understand the main rules and conditions around crawling that search engines
incorporate as part of their algorithms. After reading this you will have an understanding
of crawl budget, demand and rate.
 

What is crawl budget?


Crawl budget is the number of URLs on a website which a search engine will crawl in a given
time period and a function of crawl rate and crawl demand.
The Google Webmaster Central Blog defines crawl budget as the following:
“Taking crawl rate and crawl demand together we define crawl budget as the number of URLs
Googlebot can and wants to crawl.”
 

Why is crawl budget limited?


Crawl budget is constrained in order to ensure that a website’s server is not overloaded with too
many concurrent connections or too much demand for server resources which could adversely
impact the experience of the site’s visitors.

Every IP (web host) has a maximum number of connections that it can handle. Many websites
can be hosted on a shared server, so if a website shares a server or IP with several other websites
it may have a lower crawl budget than a website hosted on a dedicated server.
Equally, a website which is hosted on a cluster of dedicated servers which responds quickly will
typically have a higher crawl budget than a website which is hosted on a single server and begins
to respond more slowly when there is a high amount of traffic.

It is worth bearing in mind that just because a website responds quickly and has the resources to
sustain a high crawl rate, it doesn’t mean that search engines will want to dedicate a high amount
of their own resources if the content is not considered important enough.
 

What is crawl rate and crawl rate


limit?
Crawl rate is defined as the number of URLs per second that search engines will attempt to
crawl a site. This is normally proportional to the number of active HTTP connections that they
choose to open simultaneously.
Crawl rate limit can be defined as the maximum fetching that can be achieved without
degrading the experience of visitors to a site.

There are a couple of factors that can cause fluctuations in crawl rate. These include:

 Crawl health – Faster responding websites may see increases in crawl rate, whereas slower
websites may see reductions in crawl rate.
 Limiting the rate at which Google crawls your website in Google Search Console by going
 
into Settings and navigating to the Crawl Rate section.

What is crawl demand?


In addition to crawl health and crawl rate limits specified by the webmaster, crawl rate will vary
from page to page based on the demand for a specific page.
The demand from users for previously indexed pages impacts how often a search engine crawls
those pages. Pages that are more popular will likely be crawled more often than pages that are
rarely visited, or those that are not updated or hold little value. New or important pages are
normally prioritised over old pages which do not change often.
 

Managing crawl budget


Issues with larger sites
Managing crawl budget is particularly important for larger sites with many URLs and a high
turnover of content.

Larger sites may encounter issues with getting new pages which have never been crawled and
indexed to appear in a search engine’s results pages. It may also be the case that pages that have
already been indexed take longer to be re-crawled, meaning that changes take longer to be
detected and then updated in the index.
Issues with low value URLs

Another important part of managing crawl budget is about dealing with low value URLs which
can consume a large amount of crawl budget. This can be problematic because it could mean that
crawl budget is being wasted on low value URLs while higher value URLs are crawled less often
than you would ideally like.

Examples of low value URLs which may consume crawl budget are:

 URLs with tracking parameters and session identifiers


 On-site duplicate content
 Soft error pages, such as discontinued products
 Multi-facet categorisation
 Site search results pages
When/why/how can I influence crawl budget?
Most search engines will provide you with statistics about the number of pages crawled per day
within their webmaster interfaces (such as Google Search Console or Bing Webmaster Tools).

Alternatively you can analyse server log files, which record every time a page is requested by a
search engine and provide the most accurate data on which URLs are crawled and how
frequently.

Do all websites need to consider crawl budget?

Managing crawl budget isn’t something that needs to be worried about on the majority of
websites because sites with fewer than a few thousand URLs and new pages can be crawled in
one day. This means that crawl budget isn’t something that demands attention for smaller sites.

Influencing crawl budget

Managing crawl activity is more of a consideration for larger sites and those that auto-generate
content based on URL parameters.

So what can large sites do to influence the crawl activity by search engine bots to ensure their
high value pages are crawled regularly?

Ensuring high priority pages are accessible to crawlers


Large sites should ensure the .htaccess and robots.txt files don’t prevent crawlers from accessing
high priority pages on the website. Additionally, web crawlers should also be able to crawl CSS
and JavaScript files.
Disallowing pages not to be indexed
Regardless of the size of a site, there are always going to be pages that you will want to disallow
from search engine indexes. A few examples include:
 Duplicate or near duplicate pages – Pages that present predominantly duplicate content should
be disallowed.
 Dynamically generated URLs – Such as onsite search results which should also be disallowed.
 Thin or low value content – Pages with little content or little valuable content are also good
candidates for being excluded from indexes.
Robots.txt
The robots.txt file is used to provide instructions to web crawlers using the Robots Exclusion
Protocol. Disallowing directories and pages that should not be crawled in the robots.txt file is a
good method for freeing up valuable crawl budget on large sites.
Noindex robots meta tag & X-Robots-Tag
Robots.txt disallow instructions do not guarantee that a page will not be crawled and shown in
the search results. Search engines use other information, such as internal links, which may guide
web crawlers toward a page which should ideally be omitted.

To prevent most search engine crawlers from indexing a page, the following meta tag should be
placed in thesection of the page.

<meta name=”robots” content=”noindex”>

An alternative to the noindex robots meta tag is to return an X-Robots-Tag: a noindex header in
response to a page request.
HTTP/1.1 200 OK
Date: Tue, 25 May 2010 21:42:43 GMT
(…)
X-Robots-Tag: noindex
(…)
Managing parameter/URL sprawl
A common cause of crawl budget wastage is poor management of parameters and URLs; known
as URL sprawl. The best strategy to avoid URL sprawl on a website is to design it so that URLs
are only created for unique and useful pages.

If there is already an issue with URL sprawl on a website, there are several steps that should be
taken to address this:

 Stop using useless parameters – These are parameters that do not make meaningful changes to
the content on a page and could include session IDs, tracking parameters and sorting parameters.
 Uniform casing – Ensure that all URLs share the same casing i.e. all lower case or camel case.
 Trailing slashes – Check that all URLs follow the same trailing slash rules i.e. every URL has a
trailing slash or it doesn’t.

All URLs which don’t follow the above rules should be redirected to their canonical version.
You should also ensure all links are updated to point to the canonical versions. Additionally you
should use rel=”nofollow” for URLs that don’t follow these rules i.e. links to pages with sorting
parameters.

Nofollow links
Using rel=”nofollow” tells search engines not to pass link equity via that link to the linked URL.
There is good evidence to suggest that Googlebot will honour the nofollow attribute and not
follow the link to crawl and discover content. This means that nofollow can be used by
webmasters to moderate crawl activity within a website.

It should also be noted that external links which do not use the rel=”nofollow” attribute will
provide a pathway for search engine bots to crawl the linked resource.

Fix broken links

If there are broken links (external and internal) on a site, this will expend crawl budget
unnecessarily. The number of broken links should be monitored regularly on a site and kept to an
absolute minimum.

Avoid unnecessary redirects

Unnecessary redirects can often occur after a page’s URL has been changed, with a 301 redirect
being implemented from the old URL to the new one. However, other onsite links may be
neglected and not updated to reflect new URLs, resulting in unnecessary redirects.

Unnecessary redirects can delay crawling and indexation of the target URL, as well as impacting
user experience by increasing load time.
Next: Robots.txt

Robots.txt
n this section of our guide to robots directives, we’ll go into more detail about the robots.txt text
file and how it can be used to instruct the search engine web crawlers. This file is especially
useful for managing crawl budget and making sure search engines are spending their time on
your site efficiently and crawling only the important pages.
 
What is a robots txt file used for?
The robots.txt file is there to tell crawlers and robots which URLs they should not visit on your
website. This is important to help them avoid crawling low quality pages, or getting stuck in
crawl traps where an infinite number of URLs could potentially be created, for example, a
calendar section which creates a new URL for every day.
As Google explains in their robots.txt specifications guide, the file format should be plain text
encoded in UTF-8. The file’s records (or lines) should be separated by CR, CR/LF or LF.
You should be mindful of the size of a robots.txt file, as search engines have their own maximum
file size limits. The maximum size for Google is 500KB.
 

Where should the robots.txt exist?


The robots.txt should always exist on the root of the domain, for example:

This file is specific to the protocol and full domain, so the robots.txt
on https://www.example.com does not impact the crawling
of https://www.example.com or https://subdomain.example.com; these should have their own
robots.txt files.
 

When should you use robots.txt


rules?
In general, websites should try to utilise the robots.txt as little as possible to control crawling.
Improving your website’s architecture and making it clean and accessible for crawlers is a much
better solution. However, using robots.txt where necessary to prevent crawlers from accessing
low quality sections of the site is recommended if these problems cannot be fixed in the short
term.

Google recommends only using robots.txt when server issues are being caused or for crawl
efficiency issues, such as Googlebot spending a lot of time crawling a non-indexable section of a
site, for example.

Some examples of pages which you may not want to be crawled are:

 Category pages with non-standard sorting as this generally creates duplication with the
primary category page
 User-generated content that cannot be moderated
 Pages with sensitive information
 Internal search pages as there can be an infinite amount of these result pages which provides a
 
poor user experience and wastes crawl budget

When shouldn’t you use


robots.txt?
The robots.txt file is a useful tool when used correctly, however, there are instances where it isn’t
the best solution. Here are some examples of when not to use robots.txt to control crawling:

1. Blocking Javascript/CSS

Search engines need to be able to access all resources on your site to correctly render pages,
which is a necessary part of maintaining good rankings. JavaScript files which dramatically
change the user experience, but are disallowed from crawling by search engines may result in
manual or algorithmic penalties.

For instance, if you serve an ad interstitial or redirect users with JavaScript that a search engine
cannot access, this may be seen as cloaking and the rankings of your content may be adjusted
accordingly.

2. Blocking URL parameters

You can use robots.txt to block URLs containing specific parameters, but this isn’t always the
best course of action. It is better to handle these in Google Search console as there are more
parameter-specific options there to communicate preferred crawling methods to Google.

You could also place the information in a URL fragment (/page#sort=price), as search engines
do not crawl this. Additionally, if a URL parameter must be used, the links to it could contain the
rel=nofollow attribute to prevent crawlers from trying to access it.
3. Blocking URLs with backlinks

Disallowing URLs within the robots.txt prevents link equity from passing through to the website.
This means that if search engines are unable to follow links from other websites as the target
URL is disallowed, your website will not gain the authority that those links are passing and as a
result, you may not rank as well overall.

4. Getting indexed pages deindexed

Using Disallow doesn’t get pages deindexed, and even if the URL is blocked and search engines
have never crawled the page, disallowed pages may still get indexed. This is because the
crawling and indexing processes are largely separate.
5. Setting rules which ignore social network crawlers

Even if you don’t want search engines to crawl and index pages, you may want social networks
to be able to access those pages so that a page snippet can be built. For example, Facebook will
attempt to visit every page that gets posted on the network, so that they can serve a relevant
snippet. Keep this in mind when setting robots.txt rules.

6. Blocking access from staging or dev sites

Using the robots.txt to block an entire staging site isn’t best practice. Google recommends
noindexing the pages but allowing them to be crawled, but in general it is better to render the site
inaccessible from the outside world.

7. When you have nothing to block


Some websites with a very clean architecture have no need to block crawlers from any pages. In
this situation it’s perfectly acceptable not to have a robots.txt file, and return a 404 status when
it’s requested.
 

Robots.txt Syntax and Formatting


Now that we’ve learnt what robots.txt is and when it should and shouldn’t be used, let’s take a
look at the standardised syntax and formatting rules that should be adhered to when writing a
robots.txt file.
Comments
Comments are lines that are completely ignored by search engines and start with a #. They exist
to allow you to write notes about what each line of your robots.txt does, why it exists, and when
it was added. In general, it is advised to document the purpose of every line of your robots.txt
file, so that it can be removed when it is no longer necessary and is not modified while it is still
essential.
Specifying User-agent
A block of rules can be applied to specific user agents using the “User-agent” directive. For
instance, if you wanted certain rules to apply to Google, Bing, and Yandex; but not Facebook
and ad networks, this can be achieved by specifying a user agent token that a set of rules applies
to.

Each crawler has its own user-agent token, which is used to select the matching blocks.

Crawlers will follow the most specific user agent rules set for them with the name separated by
hyphens, and will then fall back to more generic rules if an exact match isn’t found. For example,
Googlebot News will look for a match of ‘googlebot-news’, then ‘googlebot’, then ‘*’.
Here are some of the most common user agent tokens you’ll come across:

 * – The rules apply to every bot, unless there is a more specific set of rules
 Googlebot – All Google crawlers
 Googlebot-News – Crawler for Google News
 Googlebot-Image – Crawler for Google Images
 Mediapartners-Google – Google Adsense crawler
 Bingbot – Bing’s crawler
 Yandex – Yandex’s crawler
 Baiduspider – Baidu’s crawler
 Facebot – Facebook’s crawler
 Twitterbot – Twitter’s crawler
This list of user agent tokens is by no means exhaustive, so to learn more about some of the
crawlers out there, take a look at the documentation published
by Google, Bing, Yandex, Baidu, Facebook and Twitter.

The matching of a user agent token to a robots.txt block is not case sensitive. E.g. ‘googlebot’
will match Google’s user agent token ‘Googlebot’.

Pattern Matching URLs

You might have a particular URL string you want to block from being crawled, as this is much
more efficient than including a full list of complete URLs to be excluded in your robots.txt file.

To help you refine your URL paths, you can use the * and $ symbols. Here’s how they work:

 * – This is a wildcard and represents any amount of any character. It can be at the start or in the
middle of a URL path, but isn’t required at the end. You can use multiple wildcards within a URL
string, for example, “Disallow: */products?*sort=”. Rules with full paths should not start with a
wildcard.
 $ – This character signifies the end of a URL string, so “Disallow: */dress$” will match only
URLs ending in “/dress”, and not “/dress?parameter”.
It’s worth noting that robots.txt rules are case sensitive, meaning that if you disallow URLs with
the parameter “search” (e.g. “Disallow: *?search=”), robots might still crawl URLs with
different capitalisation, such as “?Search=anything”.
The directive rules match against URL paths only, and can’t include a protocol or hostname. A
slash at the start of a directive matches against the start of the URL path. E.g. “Disallow: /starts”
would match to www.example.com/starts.
Unless you add a start a directive match with a / or *, it will not match anything. E.g. “Disallow:
starts” would never match anything.

To help visualise the ways different URLs rules work, we’ve put together some examples for
you:
Robots.txt Sitemap Link
The sitemap directive in a robots.txt file tells search engines where to find the XML sitemap,
which helps them to discover all the URLs on the website. To learn more about sitemaps, take a
look at our guide on sitemap audits and advanced configuration.
When including sitemaps in a robots.txt file, you should use absolute URLs
(i.e. https://www.example.com/sitemap.xml) instead of relative URLs (i.e. /sitemap.xml.) It’s also
worth noting that sitemaps don’t have to sit on one root domain, they can also be hosted on an
external domain.
Search engines will discover and may crawl the sitemaps listed in your robots.txt file, however,
these sitemaps will not appear in Google Search Console or Bing Webmaster Tools without
manual submission.
 

Robots.txt Blocks
The “disallow” rule in the robots.txt file can be used in a number of ways for different user
agents. In this section, we’ll cover some of the different ways you can format combinations of
blocks.

It’s important to remember that directives in the robots.txt file are only instructions. Malicious
crawlers will ignore your robots.txt file and crawl any part of your site that is public, so disallow
should not be used in place of robust security measures.

Multiple User-agent blocks

You can match a block of rules to multiple user agents by listing them before a set of rules, for
example, the following disallow rules will apply to both Googlebot and Bing in the following
block of rules:

User-agent: googlebot
User-agent: bing
Disallow: /a
Spacing between blocks of directives

Google will ignore spaces between directives and blocks. In this first example, the second rule
will be picked up, even though there is a space separating the two parts of the rule:

[code]
User-agent: *
Disallow: /disallowed/

Disallow: /test1/robots_excluded_blank_line
[/code]

In this second example, Googlebot-mobile would inherit the same rules as Bingbot:

[code]
User-agent: googlebot-mobile

User-agent: bing
Disallow: /test1/deepcrawl_excluded
[/code]
Separate blocks combined
Multiple blocks with the same user agent are combined. So in the example below, the top and
bottom blocks would be combined and Googlebot would be disallowed from crawling “/b” and
“/a”.

User-agent: googlebot
Disallow: /b

User-agent: bing
Disallow: /a

User-agent: googlebot
Disallow: /a
 

Robots.txt Allow
The robots.txt “allow” rule explicitly gives permission for certain URLs to be crawled. While
this is the default for all URLs, this rule can be used to overwrite a disallow rule. For example, if
“/locations” is disallowed, you could allow the crawling of “/locations/london” by having the
specific rule of “Allow: /locations/london”.
 

Robots.txt Prioritisation
When several allow and disallow rules apply to a URL, the longest matching rule is the one that
is applied. Let’s look at what would happen for the URL “/home/search/shirts” with the
following rules:
Disallow: /home
Allow: *search/*
Disallow: *shirts

In this case, the URL is allowed to be crawled because the Allow rule has 9 characters, whereas
the disallow rule has only 7. If you need a specific URL path to be allowed or disallowed, you
can utilise * to make the string longer. For example:

Disallow: *******************/shirts
When a URL matches both an allow rule and a disallow rule, but the rules are the same length,
the disallow will be followed. For example, the URL “/search/shirts” will be disallowed in the
following scenario:
Disallow: /search
Allow: *shirts
 

Robots.txt Directives
Page level directives (which we’ll cover later on in this guide) are great tools, but the issue with
them is that search engines must crawl a page before being able to read these instructions, which
can consume crawl budget.

Robots.txt directives can help to reduce the strain on crawl budget because you can add
directives directly into your robots.txt file rather than waiting for search engines to crawl pages
before taking action on them. This solution is much quicker and easier to manage.

The following robots.txt directives work in the same way as the allow and disallow directives, in
that you can specify wildcards (*) and use the $ symbol to denote the end of a URL string.
Robots.txt NoIndex
Robots.txt noindex is a useful tool for managing search engine indexing without using up crawl
budget. Disallowing a page in robots.txt doesn’t mean it is removed from the index, so the
noindex directive is much more effective to use for this purpose.
Google doesn’t officially support robots.txt noindex, and you shouldn’t rely on it because
although it works today, it may not do so tomorrow. This tool can be helpful though and should
be used as a short term fix in combination with other longer-term index controls, but not as a
mission-critical directive. Take a look at the tests run by ohgm and Stone Temple which both
prove that the feature works effectively.

Here’s an example of how you would use robots.txt noindex:


[code]
User-agent: *
NoIndex: /directory
NoIndex: /*?*sort=
[/code]
As well as noindex, Google currently unofficially obeys several other indexing directives when
they’re placed within the robots.txt. It is important to note that not all search engines and
crawlers support these directives, and the ones which do may stop supporting them at any time –
you shouldn’t rely on these working consistently.
 

Common Robots.txt Issues


There are some key issues and considerations for the robots.txt file and the impact it can have on
a site’s performance. We’ve taken the time to list some of the key points to consider with
robots.txt as well as some of the most common issues which you can hopefully avoid.

1. Have a fallback block of rules for all bots – Using blocks of rules for specific user agent strings
without having a fallback block of rules for every other bot means that your website will
eventually encounter a bot which does not have any rulesets to follow.
2. It is important that robots.txt is kept up to date – A relatively common problem occurs when
the robots.txt is set during the initial development phase of a website, but is not updated as the
website grows, meaning that potentially useful pages are disallowed.
3. Be aware of redirecting search engines through disallowed URLs – For
example, /product > /disallowed > /category
4. Case sensitivity can cause a lot of problems – Webmasters may expect a section of a website
not to be crawled, but those pages may crawled because of alternate casings i.e. “Disallow:
/admin” exists, but search engines crawl “/ADMIN”.
5. Don’t disallow backlinked URLs – This prevents PageRank from flowing to your site from
others that are linking to you.
6. Crawl Delay can cause search issues – The “crawl-delay” directive forces crawlers to visit your
website slower than they would have liked, meaning that your important pages may be crawled
less often than is optimal. This directive is not obeyed by Google or Baidu, but is supported by
Bing and Yandex.
7. Make sure the robots.txt only returns a 5xx status code if the whole site is down – Returning
a 5xx status code for /robots.txt indicates to search engines that the website is down for
maintenance. This typically means that they will try to crawl the website again later.
8. Robots.txt disallow overrides the parameter removal tool – Be mindful that your robots.txt
rules may override parameter handling and any other indexation hints that you may have given to
search engines.
9. Sitelinks Search Box markup will work with internal search pages blocked – Internal search
pages on a site do not need to be crawlable for the Sitelinks Search Box markup to work.
10. Disallowing a migrated domain will impact the success of the migration – If you disallow a
migrated domain, search engines won’t be able to follow any of the redirects from the old site to
 
the new one, so the migration is unlikely to be a success.
Testing & Auditing Robots.txt
Considering just how harmful a robots.txt file can be if the directives within aren’t handled
correctly, there are a few different ways you can test it to make sure it has been set up properly.
Take a look at this guide on how to audit URLs blocked by robots.txt, as well as these
examples:
 Use DeepCrawl – The Disallowed Pages and Disallowed URLs (Uncrawled) reports can show
you which pages are being blocked from search engines by your robots.txt file.
 Use Google Search Console – With the GSC robots.txt tester tool you can see the latest cached
version of a page, as well as using the Fetch and Render tool to see renders from the Googlebot
user agent as well as the browser user agent. Things to note: GSC only works for Google User
agents, and only single URLs can be tested.
 Try combining the insights from both tools by spot-checking disallowed URLs that DeepCrawl
has flagged within the GSC robots.txt tester tool to clarify the specific rules which are resulting in
 
a disallow.

Monitoring Robots.txt Changes


When there are lots of people working on a site, and with the issues that can be caused if even
one character is out of place in a robots.txt file, constantly monitoring your robots.txt is
crucial. Here are some ways in which you can check for any issues:
 Check Google Search Console to see the current robots.txt which Google is using. Sometimes
robots.txt can be delivered conditionally based on user agents, so this is the only method to see
exactly what Google is seeing.
 Check the size of the robots.txt file if you have noticed significant changes to make sure it is
under Google’s 500KB size limit.
 Go to the Google Search Console Index Status report in advanced mode to cross-check robots.txt
changes with the number of disallowed and allowed URLs on your site.
 Schedule regular crawls with DeepCrawl to see the number of disallowed pages on your site on
an ongoing basis, so you can track changes.
Next: URL-level Robots Directives

You might also like