Professional Documents
Culture Documents
How Do Search Engines Work
How Do Search Engines Work
Work?
n this guide we’re going to provide you with an introduction to how search engines
work. This will cover the processes of crawling and indexing as well as concepts such
as crawl budget and PageRank.
Search engines work by crawling hundreds of billions of pages using their own web
crawlers. These web crawlers are commonly referred to as search engine bots or spiders.
A search engine navigates the web by downloading web pages and following links on
these pages to discover new pages that have been made available.
The index includes all the discovered URLs along with a number of relevant key signals
about the contents of each URL such as:
The keywords discovered within the page’s content – what topics does the page cover?
The type of content that is being crawled (using microdata called Schema) – what is
included on the page?
The freshness of the page – how recently was it updated?
The previous user engagement of the page and/or domain – how do people interact with
the page?
In addition to the search query, search engines use other relevant data to return results,
including:
Robots.txt file exclusions – a file which tells search engines what they shouldn’t visit on
your site.
Directives on the webpage telling search engines not to index that page (noindex tag) or
to index another similar page (canonical tag).
Search engine algorithms judging the page to be of low quality, have thin content or
contain duplicate content.
The URL returning an error page (e.g. a 404 Not Found HTTP response code).
Next: Search Engine Crawling
Here are a few examples of user agent strings used by search engines:
However, if the URL is a non-text file type such as an image, video or audio file, search engines
will typically not be able to read the content of the file other than the associated filename and
metadata.
Although a search engine may only be able to extract a limited amount of information about non-
text file types, they can still be indexed, rank in search results and receive traffic.
You can find a full list of file types that can be indexed by Google available here.
Sitemaps
Another way that search engines can discover new pages is by crawling sitemaps.
Sitemaps contain sets of URLs, and can be created by a website to provide search engines with a
list of pages to be crawled. These can help search engines find content hidden deep within a
website and can provide webmasters with the ability to better control and understand the areas of
site indexing and frequency.
Page submissions
Alternatively, individual page submissions can often be made directly to search engines via their
respective interfaces. This manual method of page discovery can be used when new content is
published on site, or if changes have taken place and you want to minimise the time that it takes
for search engines to see the changed content.
Google states that for large URL volumes you should use XML sitemaps, but sometimes the
manual submission method is convenient when submitting a handful of pages. It is also
important to note that Google limits webmasters to 10 URL submissions per day.
Additionally, Google says that the response time for indexing is the same for sitemaps as
individual submissions.
Searching through individual pages for keywords and topics would be a very slow process for
search engines to identify relevant information. Instead, search engines (including Google) use
an inverted index, also known as a reverse index.
Below is a very basic example which illustrates the concept of inverted indexing. In the example
you can see that each keyword (or token) is associated with a row of documents in which that
element was identified.
The cached document is the latest snapshot of the page that the search engine has seen.
The cached version of a page can be accessed (in Google) by clicking the little green arrow next
to each search result’s URL and selecting the cached option. Alternatively, you can use the
‘cache:’ Google search operator to view the cached version of the page.
Bing offers the same facility to view the cached version of a page via a green down arrow next to
each search result but doesn’t currently support the ‘cache:’ search operator.
What is PageRank?
“PageRank” is a Google algorithm named after the co-founder of Google, Larry Page (yes,
really!) It is a value for each page calculated by counting the number of links pointing at a
page in order to determine the page’s value relative to every other page on the internet. The value
passed by each individual link is based on the number and value of links which point to the page
with the link.
PageRank is just one of the many signals used within the large Google ranking algorithm.
An approximation of the PageRank values were initially provided by Google but they are no
longer publicly visible.
While PageRank is a Google term, all commercial search engines calculate and use an equivalent
link equity metric. Some SEO tools try to give an estimation of PageRank using their own logic
and calculations. For example, Page Authority in Moz tools, TrustFlow in Majestic, or URL
Rating in Ahrefs. DeepCrawl has a metric called DeepRank to measure the value of pages based
on the internal links within a website.
How PageRank flows through
pages
Pages pass PageRank, or link equity, through to other pages via links. When a page links to
content elsewhere it is seen as a vote of confidence and trust, in that the content being linked to is
being recommended as relevant and useful for users. The count of these links and the measure of
how authoritative the linking website is, determines the relative PageRank of the linked-to page.
PageRank is equally divided across all discovered links on the page. For example, if your page
has five links, each link would pass 20% of the page’s PageRank through each link to the target
pages. Links which use the rel=”nofollow” attribute do not pass PageRank.
Research into backlinks by Moz shows that results for the top 50 Google search queries (~15,000
search results), 99.2% of these had at least 1 external backlink. On top of this, SEOs consistently
rate backlinks as one of the most important ranking factors in surveys.
Next: Search Engine Differences
Device indexing
Google is making a move towards mobile-first indexing, where they will use the mobile version
of a site’s content to rank pages from that site rather than the desktop version.
In 2018, Google also has plans to roll out a mobile page speed update which means that Page
speed will become a ranking factor in mobile search.
Bing’s Head of Evangelism for Search at Microsoft, Christi Olson, said that they have no plans
to roll out a mobile-first index similar to Google.
Yandex began labeling mobile-friendly pages in their index as of November 2015 and rolled out
a mobile-friendly algorithm in 2016.
The mobile-friendly algorithm, code- named Vladivostok, did not result in pages which are not
considered mobile- friendly being removed from the search results, but it was stated that such
pages would potentially not rank as prominently for search users who are using mobile devices.
“Vladivostok implementation doesn’t mean web pages not optimized for mobile experience will
now disappear from search results, but their position on SERPS may differ depending on
whether the user search[es] on their mobile or desktop,”
Baidu’s mobile search results differ substantially dependent on whether a page is deemed to be
mobile-friendly. It is also worth noting that Baidu makes use of transcoding in order to convert
non-mobile-friendly web pages into Baidu-generated mobile-friendly pages.
It used to be the case that the volume of backlinks was a key ranking signal, which led to a lot of
low quality link acquisition with companies purchasing backlinks from link farms and networks.
Bing uses backlink information in much the same way as Google, according to their webmaster
guidelines as well as anecdotal reports.
Bing’s webmaster guidelines state:
“The fact is Bing wants to see quality links pointing at your website. Often even just a few
quality inbound links from trusted websites is enough to help boost your rankings. Just as it is
with content, when it comes to links, quality matters most.”
Yandex stopped using backlink data in their ranking algorithms in certain verticals since 2014.
Around one year later backlink data was reintroduced into their algorithms and they now provide
the following warning regarding the use of purchased links intended to promote search rankings:
“Publishing SEO links on other sites in order to promote your own site. Such links include, in
particular, links that are bought through link exchanges and aggregators.”
It is known that, like Google and Bing, Yandex looks for high quality, relevant links from
authoritative sources, but backlinks alone are not a decisive ranking factor.
Baidu values backlinks from websites based in China much more than those from foreign
sites. It is reported that Baidu lags behind the other major search engines with regard to the
detection of link spam.
Link spam tactics are still effective in promoting rankings in the Baidu search results and
therefore continue to be used in the promotion of Chinese websites.
Every IP (web host) has a maximum number of connections that it can handle. Many websites
can be hosted on a shared server, so if a website shares a server or IP with several other websites
it may have a lower crawl budget than a website hosted on a dedicated server.
Equally, a website which is hosted on a cluster of dedicated servers which responds quickly will
typically have a higher crawl budget than a website which is hosted on a single server and begins
to respond more slowly when there is a high amount of traffic.
It is worth bearing in mind that just because a website responds quickly and has the resources to
sustain a high crawl rate, it doesn’t mean that search engines will want to dedicate a high amount
of their own resources if the content is not considered important enough.
There are a couple of factors that can cause fluctuations in crawl rate. These include:
Crawl health – Faster responding websites may see increases in crawl rate, whereas slower
websites may see reductions in crawl rate.
Limiting the rate at which Google crawls your website in Google Search Console by going
into Settings and navigating to the Crawl Rate section.
Larger sites may encounter issues with getting new pages which have never been crawled and
indexed to appear in a search engine’s results pages. It may also be the case that pages that have
already been indexed take longer to be re-crawled, meaning that changes take longer to be
detected and then updated in the index.
Issues with low value URLs
Another important part of managing crawl budget is about dealing with low value URLs which
can consume a large amount of crawl budget. This can be problematic because it could mean that
crawl budget is being wasted on low value URLs while higher value URLs are crawled less often
than you would ideally like.
Examples of low value URLs which may consume crawl budget are:
Alternatively you can analyse server log files, which record every time a page is requested by a
search engine and provide the most accurate data on which URLs are crawled and how
frequently.
Managing crawl budget isn’t something that needs to be worried about on the majority of
websites because sites with fewer than a few thousand URLs and new pages can be crawled in
one day. This means that crawl budget isn’t something that demands attention for smaller sites.
Managing crawl activity is more of a consideration for larger sites and those that auto-generate
content based on URL parameters.
So what can large sites do to influence the crawl activity by search engine bots to ensure their
high value pages are crawled regularly?
To prevent most search engine crawlers from indexing a page, the following meta tag should be
placed in thesection of the page.
An alternative to the noindex robots meta tag is to return an X-Robots-Tag: a noindex header in
response to a page request.
HTTP/1.1 200 OK
Date: Tue, 25 May 2010 21:42:43 GMT
(…)
X-Robots-Tag: noindex
(…)
Managing parameter/URL sprawl
A common cause of crawl budget wastage is poor management of parameters and URLs; known
as URL sprawl. The best strategy to avoid URL sprawl on a website is to design it so that URLs
are only created for unique and useful pages.
If there is already an issue with URL sprawl on a website, there are several steps that should be
taken to address this:
Stop using useless parameters – These are parameters that do not make meaningful changes to
the content on a page and could include session IDs, tracking parameters and sorting parameters.
Uniform casing – Ensure that all URLs share the same casing i.e. all lower case or camel case.
Trailing slashes – Check that all URLs follow the same trailing slash rules i.e. every URL has a
trailing slash or it doesn’t.
All URLs which don’t follow the above rules should be redirected to their canonical version.
You should also ensure all links are updated to point to the canonical versions. Additionally you
should use rel=”nofollow” for URLs that don’t follow these rules i.e. links to pages with sorting
parameters.
Nofollow links
Using rel=”nofollow” tells search engines not to pass link equity via that link to the linked URL.
There is good evidence to suggest that Googlebot will honour the nofollow attribute and not
follow the link to crawl and discover content. This means that nofollow can be used by
webmasters to moderate crawl activity within a website.
It should also be noted that external links which do not use the rel=”nofollow” attribute will
provide a pathway for search engine bots to crawl the linked resource.
If there are broken links (external and internal) on a site, this will expend crawl budget
unnecessarily. The number of broken links should be monitored regularly on a site and kept to an
absolute minimum.
Unnecessary redirects can often occur after a page’s URL has been changed, with a 301 redirect
being implemented from the old URL to the new one. However, other onsite links may be
neglected and not updated to reflect new URLs, resulting in unnecessary redirects.
Unnecessary redirects can delay crawling and indexation of the target URL, as well as impacting
user experience by increasing load time.
Next: Robots.txt
Robots.txt
n this section of our guide to robots directives, we’ll go into more detail about the robots.txt text
file and how it can be used to instruct the search engine web crawlers. This file is especially
useful for managing crawl budget and making sure search engines are spending their time on
your site efficiently and crawling only the important pages.
What is a robots txt file used for?
The robots.txt file is there to tell crawlers and robots which URLs they should not visit on your
website. This is important to help them avoid crawling low quality pages, or getting stuck in
crawl traps where an infinite number of URLs could potentially be created, for example, a
calendar section which creates a new URL for every day.
As Google explains in their robots.txt specifications guide, the file format should be plain text
encoded in UTF-8. The file’s records (or lines) should be separated by CR, CR/LF or LF.
You should be mindful of the size of a robots.txt file, as search engines have their own maximum
file size limits. The maximum size for Google is 500KB.
This file is specific to the protocol and full domain, so the robots.txt
on https://www.example.com does not impact the crawling
of https://www.example.com or https://subdomain.example.com; these should have their own
robots.txt files.
Google recommends only using robots.txt when server issues are being caused or for crawl
efficiency issues, such as Googlebot spending a lot of time crawling a non-indexable section of a
site, for example.
Some examples of pages which you may not want to be crawled are:
Category pages with non-standard sorting as this generally creates duplication with the
primary category page
User-generated content that cannot be moderated
Pages with sensitive information
Internal search pages as there can be an infinite amount of these result pages which provides a
poor user experience and wastes crawl budget
1. Blocking Javascript/CSS
Search engines need to be able to access all resources on your site to correctly render pages,
which is a necessary part of maintaining good rankings. JavaScript files which dramatically
change the user experience, but are disallowed from crawling by search engines may result in
manual or algorithmic penalties.
For instance, if you serve an ad interstitial or redirect users with JavaScript that a search engine
cannot access, this may be seen as cloaking and the rankings of your content may be adjusted
accordingly.
You can use robots.txt to block URLs containing specific parameters, but this isn’t always the
best course of action. It is better to handle these in Google Search console as there are more
parameter-specific options there to communicate preferred crawling methods to Google.
You could also place the information in a URL fragment (/page#sort=price), as search engines
do not crawl this. Additionally, if a URL parameter must be used, the links to it could contain the
rel=nofollow attribute to prevent crawlers from trying to access it.
3. Blocking URLs with backlinks
Disallowing URLs within the robots.txt prevents link equity from passing through to the website.
This means that if search engines are unable to follow links from other websites as the target
URL is disallowed, your website will not gain the authority that those links are passing and as a
result, you may not rank as well overall.
Using Disallow doesn’t get pages deindexed, and even if the URL is blocked and search engines
have never crawled the page, disallowed pages may still get indexed. This is because the
crawling and indexing processes are largely separate.
5. Setting rules which ignore social network crawlers
Even if you don’t want search engines to crawl and index pages, you may want social networks
to be able to access those pages so that a page snippet can be built. For example, Facebook will
attempt to visit every page that gets posted on the network, so that they can serve a relevant
snippet. Keep this in mind when setting robots.txt rules.
Using the robots.txt to block an entire staging site isn’t best practice. Google recommends
noindexing the pages but allowing them to be crawled, but in general it is better to render the site
inaccessible from the outside world.
Each crawler has its own user-agent token, which is used to select the matching blocks.
Crawlers will follow the most specific user agent rules set for them with the name separated by
hyphens, and will then fall back to more generic rules if an exact match isn’t found. For example,
Googlebot News will look for a match of ‘googlebot-news’, then ‘googlebot’, then ‘*’.
Here are some of the most common user agent tokens you’ll come across:
* – The rules apply to every bot, unless there is a more specific set of rules
Googlebot – All Google crawlers
Googlebot-News – Crawler for Google News
Googlebot-Image – Crawler for Google Images
Mediapartners-Google – Google Adsense crawler
Bingbot – Bing’s crawler
Yandex – Yandex’s crawler
Baiduspider – Baidu’s crawler
Facebot – Facebook’s crawler
Twitterbot – Twitter’s crawler
This list of user agent tokens is by no means exhaustive, so to learn more about some of the
crawlers out there, take a look at the documentation published
by Google, Bing, Yandex, Baidu, Facebook and Twitter.
The matching of a user agent token to a robots.txt block is not case sensitive. E.g. ‘googlebot’
will match Google’s user agent token ‘Googlebot’.
You might have a particular URL string you want to block from being crawled, as this is much
more efficient than including a full list of complete URLs to be excluded in your robots.txt file.
To help you refine your URL paths, you can use the * and $ symbols. Here’s how they work:
* – This is a wildcard and represents any amount of any character. It can be at the start or in the
middle of a URL path, but isn’t required at the end. You can use multiple wildcards within a URL
string, for example, “Disallow: */products?*sort=”. Rules with full paths should not start with a
wildcard.
$ – This character signifies the end of a URL string, so “Disallow: */dress$” will match only
URLs ending in “/dress”, and not “/dress?parameter”.
It’s worth noting that robots.txt rules are case sensitive, meaning that if you disallow URLs with
the parameter “search” (e.g. “Disallow: *?search=”), robots might still crawl URLs with
different capitalisation, such as “?Search=anything”.
The directive rules match against URL paths only, and can’t include a protocol or hostname. A
slash at the start of a directive matches against the start of the URL path. E.g. “Disallow: /starts”
would match to www.example.com/starts.
Unless you add a start a directive match with a / or *, it will not match anything. E.g. “Disallow:
starts” would never match anything.
To help visualise the ways different URLs rules work, we’ve put together some examples for
you:
Robots.txt Sitemap Link
The sitemap directive in a robots.txt file tells search engines where to find the XML sitemap,
which helps them to discover all the URLs on the website. To learn more about sitemaps, take a
look at our guide on sitemap audits and advanced configuration.
When including sitemaps in a robots.txt file, you should use absolute URLs
(i.e. https://www.example.com/sitemap.xml) instead of relative URLs (i.e. /sitemap.xml.) It’s also
worth noting that sitemaps don’t have to sit on one root domain, they can also be hosted on an
external domain.
Search engines will discover and may crawl the sitemaps listed in your robots.txt file, however,
these sitemaps will not appear in Google Search Console or Bing Webmaster Tools without
manual submission.
Robots.txt Blocks
The “disallow” rule in the robots.txt file can be used in a number of ways for different user
agents. In this section, we’ll cover some of the different ways you can format combinations of
blocks.
It’s important to remember that directives in the robots.txt file are only instructions. Malicious
crawlers will ignore your robots.txt file and crawl any part of your site that is public, so disallow
should not be used in place of robust security measures.
You can match a block of rules to multiple user agents by listing them before a set of rules, for
example, the following disallow rules will apply to both Googlebot and Bing in the following
block of rules:
User-agent: googlebot
User-agent: bing
Disallow: /a
Spacing between blocks of directives
Google will ignore spaces between directives and blocks. In this first example, the second rule
will be picked up, even though there is a space separating the two parts of the rule:
[code]
User-agent: *
Disallow: /disallowed/
Disallow: /test1/robots_excluded_blank_line
[/code]
In this second example, Googlebot-mobile would inherit the same rules as Bingbot:
[code]
User-agent: googlebot-mobile
User-agent: bing
Disallow: /test1/deepcrawl_excluded
[/code]
Separate blocks combined
Multiple blocks with the same user agent are combined. So in the example below, the top and
bottom blocks would be combined and Googlebot would be disallowed from crawling “/b” and
“/a”.
User-agent: googlebot
Disallow: /b
User-agent: bing
Disallow: /a
User-agent: googlebot
Disallow: /a
Robots.txt Allow
The robots.txt “allow” rule explicitly gives permission for certain URLs to be crawled. While
this is the default for all URLs, this rule can be used to overwrite a disallow rule. For example, if
“/locations” is disallowed, you could allow the crawling of “/locations/london” by having the
specific rule of “Allow: /locations/london”.
Robots.txt Prioritisation
When several allow and disallow rules apply to a URL, the longest matching rule is the one that
is applied. Let’s look at what would happen for the URL “/home/search/shirts” with the
following rules:
Disallow: /home
Allow: *search/*
Disallow: *shirts
In this case, the URL is allowed to be crawled because the Allow rule has 9 characters, whereas
the disallow rule has only 7. If you need a specific URL path to be allowed or disallowed, you
can utilise * to make the string longer. For example:
Disallow: *******************/shirts
When a URL matches both an allow rule and a disallow rule, but the rules are the same length,
the disallow will be followed. For example, the URL “/search/shirts” will be disallowed in the
following scenario:
Disallow: /search
Allow: *shirts
Robots.txt Directives
Page level directives (which we’ll cover later on in this guide) are great tools, but the issue with
them is that search engines must crawl a page before being able to read these instructions, which
can consume crawl budget.
Robots.txt directives can help to reduce the strain on crawl budget because you can add
directives directly into your robots.txt file rather than waiting for search engines to crawl pages
before taking action on them. This solution is much quicker and easier to manage.
The following robots.txt directives work in the same way as the allow and disallow directives, in
that you can specify wildcards (*) and use the $ symbol to denote the end of a URL string.
Robots.txt NoIndex
Robots.txt noindex is a useful tool for managing search engine indexing without using up crawl
budget. Disallowing a page in robots.txt doesn’t mean it is removed from the index, so the
noindex directive is much more effective to use for this purpose.
Google doesn’t officially support robots.txt noindex, and you shouldn’t rely on it because
although it works today, it may not do so tomorrow. This tool can be helpful though and should
be used as a short term fix in combination with other longer-term index controls, but not as a
mission-critical directive. Take a look at the tests run by ohgm and Stone Temple which both
prove that the feature works effectively.
1. Have a fallback block of rules for all bots – Using blocks of rules for specific user agent strings
without having a fallback block of rules for every other bot means that your website will
eventually encounter a bot which does not have any rulesets to follow.
2. It is important that robots.txt is kept up to date – A relatively common problem occurs when
the robots.txt is set during the initial development phase of a website, but is not updated as the
website grows, meaning that potentially useful pages are disallowed.
3. Be aware of redirecting search engines through disallowed URLs – For
example, /product > /disallowed > /category
4. Case sensitivity can cause a lot of problems – Webmasters may expect a section of a website
not to be crawled, but those pages may crawled because of alternate casings i.e. “Disallow:
/admin” exists, but search engines crawl “/ADMIN”.
5. Don’t disallow backlinked URLs – This prevents PageRank from flowing to your site from
others that are linking to you.
6. Crawl Delay can cause search issues – The “crawl-delay” directive forces crawlers to visit your
website slower than they would have liked, meaning that your important pages may be crawled
less often than is optimal. This directive is not obeyed by Google or Baidu, but is supported by
Bing and Yandex.
7. Make sure the robots.txt only returns a 5xx status code if the whole site is down – Returning
a 5xx status code for /robots.txt indicates to search engines that the website is down for
maintenance. This typically means that they will try to crawl the website again later.
8. Robots.txt disallow overrides the parameter removal tool – Be mindful that your robots.txt
rules may override parameter handling and any other indexation hints that you may have given to
search engines.
9. Sitelinks Search Box markup will work with internal search pages blocked – Internal search
pages on a site do not need to be crawlable for the Sitelinks Search Box markup to work.
10. Disallowing a migrated domain will impact the success of the migration – If you disallow a
migrated domain, search engines won’t be able to follow any of the redirects from the old site to
the new one, so the migration is unlikely to be a success.
Testing & Auditing Robots.txt
Considering just how harmful a robots.txt file can be if the directives within aren’t handled
correctly, there are a few different ways you can test it to make sure it has been set up properly.
Take a look at this guide on how to audit URLs blocked by robots.txt, as well as these
examples:
Use DeepCrawl – The Disallowed Pages and Disallowed URLs (Uncrawled) reports can show
you which pages are being blocked from search engines by your robots.txt file.
Use Google Search Console – With the GSC robots.txt tester tool you can see the latest cached
version of a page, as well as using the Fetch and Render tool to see renders from the Googlebot
user agent as well as the browser user agent. Things to note: GSC only works for Google User
agents, and only single URLs can be tested.
Try combining the insights from both tools by spot-checking disallowed URLs that DeepCrawl
has flagged within the GSC robots.txt tester tool to clarify the specific rules which are resulting in
a disallow.