Search Engines and SEO (IT302)

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 42

IT302 Web Technologies and Applications

Search Engines and


Search Engine Optimization (SEO)

Dr.Sowmya Kamath S, Dept of IT, NITK Surathkal 7-Nov-16


Web Information Retrieval
Goal: To retrieve all Web documents which are relevant to a
query while retrieving as few non-relevant documents as
possible.
Measured by recall and precision.

The set of Web documents that The fraction of retrieved Web


are successfully retrieved for a documents that were relevant to
given query. the given search query

Dr.Sowmya Kamath S, Dept of IT, NITK Surathkal 7-Nov-16


Challenges in Web Information Retrieval
Large volume
Unstructured and redundant data
Distributed and Heterogeneous data
High percentage of volatile data
Quality of data

How to specify and use user query.


Query processing, normalization, optimization...

How to handle the matching documents provided by the system


Scoring, Ranking, Pagination ....

Dr.Sowmya Kamath S, Dept of IT, NITK Surathkal 7-Nov-16


What is a Search Engine?
Definition: An internet-based tool that searches an index of
documents for a particular term or phrase specified by the user.
large web-based search applications that explore the billions of resources on
the internet.
E.g. Google,Yahoo, Bing, Ask Jeeves, Altavista

Common Characteristics:
Find matching documents and display them according to relevance.
Frequent updates to proprietary ranking algorithm.
Strive to produce better, more relevant results than competitors.
Terms used - Spider, Crawler, Indexer, Ranking Algorithm.

Dr.Sowmya Kamath S, Dept of IT, NITK Surathkal 7-Nov-16


Types of Search Engines
Search by Keywords or phrases
E.g. Google, Bing,Yahoo,Yandex
Search by specialization (vertical search engines)
E.g. Google Scholar/Images,Yahoo Finance, Google Maps, Bing Videos
Specialize in other languages
E.g. Baidu, Chinese Yahoo!, Google search in Indian Languages
Users answering questions and creating content.
E.g. Ask Jeeves!, Answers.com
Semantic Search Engines
E.g. Swoogle, Hakia, SenseBot.
Question-Answering Search Engines
E.g. IBM Watson, Wolfram Alpha, Amazon Evi
Dr.Sowmya Kamath S, Dept of IT, NITK Surathkal 7-Nov-16
Working of a Search Engine
Working of a Search Engine Crawler
Some Web Search Algorithms
HITS (Hyperlink-Induced Topic Search)
Simple algorithm
was used by the pre-Google era search engine, Altavista.

PageRank
used by the Google Internet search engine.
Basic PageRank algorithm is described in Stanford Universitys patent
documents.
Bing Search
Uses highly complex probabilistic ensemble ranking functions over a
Neural network architecture (??? Not confirmed data, not much
information available on Bing algorithms)

Yahoo search (ditto)

Dr.Sowmya Kamath S, Dept of IT, NITK Surathkal 7-Nov-16


Googles Search Algorithm
PageRank
A method for rating the importance of web pages objectively and
mechanically using the link structure of the web.
developed by Larry Page (hence the name Page-Rank) and Sergey Brin.
used by the Google Internet search engine.
Patent issued to Stanford University!

assigns a numerical weighting to each element of a hyperlinked


set of documents (such as the World Wide Web), with the purpose
of measuring its relative importance within the set.

Dr.Sowmya Kamath S, Dept of IT, NITK Surathkal 7-Nov-16


PageRank Algorithm
Exploits the link structure of the Web
More than 1 billion web pages 1.7 trillion links

Back links and Forward links:


A and B are Cs back links
C is A and Bs forward link

Intuitively, a webpage is important if it has a lot of back links, that keep


changing over time.
Dr.Sowmya Kamath S, Dept of IT, NITK Surathkal 7-Nov-16
PageRank Algorithm (contd.)
Essentially, Google interprets a link from page A to page B as a
vote for page B, by page A.
BUT these votes dont weigh the same, because PageRank also
analyzes the page that casts the vote.

does not rank the whole website, but is determined for each page
individually and iteratively.
The PageRank of page A is recursively defined by the PageRank of
those pages which link to page A.

Dr.Sowmya Kamath S, Dept of IT, NITK Surathkal 7-Nov-16


PageRank Algorithm (contd.)
The PageRank (PR) of a page u is given as:

PR(u) = (1-d) (1/N) + d (PR(v)/C(v))

Assume page u has set of pages v which point to it (i.e., are votes).
PR(u) is the PageRank of page u,
PR(v) is the PageRank of the set of pages v that link to page u,
C(v) is defined as the number of links going out of pages v.
N is the number of pages in the network.
The parameter d is a damping factor which can be set between 0 and
1. (d is set to 0.85, if not otherwise specified).

Dr.Sowmya Kamath S, Dept of IT, NITK Surathkal 7-Nov-16


Calculating a Webpages PageRank (PR)
Examples

Consider an example:

The number of web pages N = 3 ;


Let the damping parameter d = 0.7

PageRank PR(u) = (1 - d) (1/N) + d (PR(v)/C(v))


Where u can be A, B or C.

Dr.Sowmya Kamath S, Dept of IT, NITK Surathkal 7-Nov-16


Calculating a Webpages PageRank (PR)
So,
PR(A) = (1 d) (1/N) + d ( PR(C) / 1 )
PR(B) = (1 d) ( 1/N ) + d ( PR(A) / 1 )
PR(C) = (1 d) ( 1/N ) + d ( PR(B) / 1 )

Substituting value of d and N ,


PR(A) = 0.1 + 0.7 PR(C) ---- (i)
PR(B) = 0.1 + 0.7 PR(A) ---- (ii)
PR(C) = 0.1 + 0.7 PR(B) ---- (iii)

By solving the resulting systems of linear equations, we get


PR(A) = 1/3 = 0.33 PR(B) = 1/3 = 0.33 PR(C) = 1/3 = 0.33

Dr.Sowmya Kamath S, Dept of IT, NITK Surathkal 7-Nov-16


Calculating a Webpages PageRank (PR)
Consider another example: The number of web pages N = 3 ;
Let the damping parameter be d = 0.7
PR(u) = (1 - d) (1/N) + d (PR(v)/C(v))
So,
PR(A) = (1 d) ( 1 / N ) + d ( PR(C) / 2 )
PR(B) = (1 d) ( 1 / N ) + d ( PR(A) / 1 + PR(C) / 2 )
PR(C) = (1 d) ( 1 / N ) + d ( PR(B) / 1 )
Substituting value of d and N ,
PR(A) = 0.1 + 0.35 PR(C) ------(i)
PR(B) = 0.1 + 0.70 PR(A) + 0.35 PR(C) -----(ii)
PR(C) = 0.1 + 0.70 PR(B) -----(iii)

By solving the resulting systems of linear equations, we get


PR(A) = 0.2314 PR(B) = 0.3933 PR(C) = 0.3753

Dr.Sowmya Kamath S, Dept of IT, NITK Surathkal 7-Nov-16


Calculating a Webpages PageRank (PR)
Consider another example: The number of web pages N = 3 ;
Let the damping parameter be d = 0.7
PR(u) = (1 - d) (1/N) + d (PR(v)/C(v))
So,
PR(A) = (1 d) ( 1 / N ) + d ( PR(B) / 2 )
PR(B) = (1 d) ( 1 / N ) + d ( PR(A) / 1 + PR(C) / 1 )
PR(C) = (1 d) ( 1 / N ) + d ( PR(B) / 2 )
Substituting value of d and N
PR(A) = 0.1 + 0.35 PR(B) ----(i)
PR(B) = 0.1 + 0.70 PR(A) + 0.70 PR(C) -----(ii)
PR(C) = 0.1 + 0.35 PR(B) ------(iii)

By solving the resulting systems of linear equations, we get


PR(A) = 0.2647 PR(B) = 0.4706 PR(C) = 0.2647

Dr.Sowmya Kamath S, Dept of IT, NITK Surathkal 7-Nov-16


Calculating a Webpages PageRank (PR)
Iterative Computation

Lets start with PR(A) = PR(B) = 10

A B

After 1st iteration:


PR(A) = (1-d)*1/N + d*(PR(B)/C(B))
= 0.15*0.5 + 0.85 * (10/1)
= 8.58
PR(B) = (1-d) *1/N+ d*(PR(A)/C(A))
= 0.15*0.5 + 0.85 * (8.58/1)
= 7.36

Dr.Sowmya Kamath S, Dept of IT, NITK Surathkal 7-Nov-16


Calculating a Webpages PageRank (PR)
Iterative Computation

After 2nd iteration:


PR(A) = (1-d)*1/N + d*(PR(B)/C(B))
= 0.15*0.5 + 0.85 * (7.36/1)
= 6.331
PR(B) = (1-d)*1/N + d*(PR(A)/C(A))
= 0.15*0.5 + 0.85 * (6.331/1)
= 5.456
And so on.. till?

Dr.Sowmya Kamath S, Dept of IT, NITK Surathkal 7-Nov-16


Calculating a Webpages PageRank (PR)
Iterative Computation

Ans: Iterations should be repeated till PR values converge

Thus, we can start with any values of PR, and should


repeat iterations till PR values converges i.e. dont
change too much between iterations.

Dr.Sowmya Kamath S, Dept of IT, NITK Surathkal 7-Nov-16


Convergence Property
PR (322 Million Links): 52 iterations
PR (161 Million Links): 45 iterations
Scaling factor is roughly linear in log n

Dr.Sowmya Kamath S, Dept of IT, NITK Surathkal 7-Nov-16


Conclusion
PageRank is a global ranking of all web pages based on their
locations in the web graph structure

PageRank uses information which is external to the web pages


backlinks

Backlinks from important pages are more significant than


backlinks from average pages

The structure of the web graph is very useful for information


retrieval tasks.
Dr.Sowmya Kamath S, Dept of IT, NITK Surathkal 7-Nov-16
IT302 Web Technologies and Applications

Search Engine Optimization

Dr.Sowmya Kamath S, Dept of IT, NITK Surathkal 7-Nov-16


Why is Search Engine Marketing important?
85% of all traffic on the Internet is referred to their destination by search
engines.

98% of all users dont look past first 20 results (most only view top 4 -5)

Cost-effective advertising.

Clear and measurable ROI (Return on Investment).


Operates under the assumption:

More (relevant) traffic + Good Conversions Rate = More Sales/Leads/


exposure/value

Dr.Sowmya Kamath S, Dept of IT, NITK Surathkal 7-Nov-16


SEO = Search Engine Optimization
Used as a means to increase relevant traffic to a website.

Refers to the process of optimizing certain ranking factors


used by search engines in order to achieve high rankings for
targeted search terms.

Two types of ranking factors


Positive Factors which generally improve a websites rankings
Negative factors which may hurt a websites rankings.

Dr.Sowmya Kamath S, Dept of IT, NITK Surathkal 7-Nov-16


Search Engine Ranking Factors
Broadly classified into
On page factors
Visible
Invisible
Time based factors
External factors.

Dr.Sowmya Kamath S, Dept of IT, NITK Surathkal 7-Nov-16


1. On Page Factors
Visible On Page Factors
1. Page Copy
A page that contains keywords that an user is looking for should be
relevant to his or her search query.
Page copy may contain related words which may contribute to rankings.

Important -
Keyword insertion should not be in excess. Excessive repetition called
keyword stuffing can be perceived by search engines as spam.
E.g. using the end of one sentence and the beginning of another sentence to
repeat a keyword subtly.
Mangalore Hotel: Visit us in coastal Mangalore. Hotel rooms are affordable and
well maintained. Mangalore is a .

Dr.Sowmya Kamath S, Dept of IT, NITK Surathkal 7-Nov-16


1. On Page Factors
Visible On Page Factors (contd.)
2. Page Title
Defined by the contents of the <title> tag within <head> section.
Visible both in the title bar of a browser window as well as the headline of
a search engine result.
Most important factor for increasing Click Through Rate (CTR)

Tips:
Never set the title for all pages in your website to the same generic text.
At best, pages will be indexed poorly.
At worst, site will receive a penalty if search engines consider these
pages as duplicate content.
Soln: Insert targeted keywords when giving pagetitles for better rankings.

Dr.Sowmya Kamath S, Dept of IT, NITK Surathkal 7-Nov-16


1. On Page Factors
Visible On Page Factors (contd.)
3. Page Headings
Content of the <hx> tags.
Important since these indicate overall context and meaning.

4. Outbound Links
Links that are contained in a web page pointing to other pages or
sites are also evaluated by a search engine.
A related link on a webpage is considered valuable content.
However totally irrelevant links may be considered spam content
and hence may hurt rankings of page.

Dr.Sowmya Kamath S, Dept of IT, NITK Surathkal 7-Nov-16


1. On Page Factors
Visible On Page Factors (contd.)
5. Internal Link Structure and Anchor Text
SE algorithms assume that pages not linked to or buried within a
websites internal link structure are less important.

E.g. Homepage Page1 PageX PageY PageZ

* The 4th page is harder to reach.

Tips:
Add links for individual pages.
Add a sitemap that links to all pages in the site.

Dr.Sowmya Kamath S, Dept of IT, NITK Surathkal 7-Nov-16


1. On Page Factors
Visible On Page Factors (contd.)
6. Keywords in URL and domain name
Relevant keywords like flowers if you want to have an online flower
shop.

7. Overall site topicality


The fact that a webpage is semantically related to other pages within a
website may boost rankings of that particular page.

Dr.Sowmya Kamath S, Dept of IT, NITK Surathkal 7-Nov-16


1. On Page Factors
Invisible On Page Factors
Invisible to human readers but can be read by a search engine parsing a
website.

1. Meta Description
Is important since SEs may use it to build search result pages (SERPs)
e.g.
<head>
<meta name=description value= NITK, a premier
institute for technical education in India />
</head>

Dr.Sowmya Kamath S, Dept of IT, NITK Surathkal 7-Nov-16


1. On Page Factors
Invisible On Page Factors (contd.)
2. Meta Keywords
A few major keywords as well as their misspellings can be placed.
e.g.
<head>
<meta name=keywords value= NITK, NIT Karnataka,
NIT, Surathkal, Suratkal, Surtakal />
</head>

Dr.Sowmya Kamath S, Dept of IT, NITK Surathkal 7-Nov-16


Dr.Sowmya Kamath S, Dept of IT, NITK Surathkal 7-Nov-16
1. On Page Factors
Invisible On Page Factors (contd.)

3. alt and title attributes


Useful in screen readers and text based browsers.

e.g. <img src=/images/nitk.jpg alt=Photo of NITK


Surathkal />

<a href=/nitk/home.html title=Homepage of NITK


Surathkal > </a>

TIP: always add the alt attribute to your images but include the
title attribute only if the image is a link.

Dr.Sowmya Kamath S, Dept of IT, NITK Surathkal 7-Nov-16


2. Time Based Factors
a. Site and page age
A website that has existed for a longer time is likely to be ranked better
than a new site.
A website that gradually adds new and valuable content acquires trust.

b. Link age
Links that are present on other sites pointing to a website in question
acquire more value over time
- because both link age and the other sites age/popularity/usage
contribute in increasing ranking of website in question.

Dr.Sowmya Kamath S, Dept of IT, NITK Surathkal 7-Nov-16


2. Time Based Factors (contd.)
c. Domain Registration Length
SEs may view a long standing domain name registration as an indication
that it is not spam.

Domain names are available cheaply and spammers frequently use them
in a disposable fashion.
These domains eventually get banned and must be abandoned.
Also spammers generally do not register a domain for longer that a
year.

* People who buy a domain and pay for them on a yearly basis, may
be treated as spammers, at least initially.
Dr.Sowmya Kamath S, Dept of IT, NITK Surathkal 7-Nov-16
3. External Factors
1. Quantity, Quality and Relevance of Inbound Links

Quantity: A site with many inbound links is likely to be relevant.


May be many people want it, so they have placed a link to your site on
their site.

Quality : If your site has a popular website that links to it, your
ranking may increase as the popular site itself has many inbound links
and has a good reputation

Relevance: A link from a semantically related webpage or site is


viewed as more valuable than a link from a random unrelated site.

Dr.Sowmya Kamath S, Dept of IT, NITK Surathkal 7-Nov-16


3. External Factors (contd.)
2. Link Churn
Links sometimes appear and disappear on pages of a site. The rate at
which this happens is called link churn.
If link churn is high, then SEs regard the site as spam.

3. Link acquisition rate:


Gaining of thousands of new links by a new site in a relatively short time
may be viewed as suspicious, if not accompanied by highly ranked sites.

4. Link Location
Prominently displayed links to your site on popular sites may be
regarded more highly by search engines than those buried in the site.
Dr.Sowmya Kamath S, Dept of IT, NITK Surathkal 7-Nov-16
Ranking the Ranking factors
On-Page Factors (Code & Content)
Content, Content, Content (Body text) <body> #1
Keyword frequency & density #2
Title tags <title> #3
ALT image tags #4
Header tags <h1> #5
Hyperlink text #6

Off-Page Factors
Anchor text #1
Link Popularity (votes for your site) adds credibility #2
...
Dr.Sowmya Kamath S, Dept of IT, NITK Surathkal 7-Nov-16
Pay Per Click

PPC ads appear as sponsored listings


Companies bid on the price they are willing to pay per click

Typically have very good tracking tools and statistics


Ability to control ad text
Can set budgets and spending limits

Fastest way to build up brand recognition.

Dr.Sowmya Kamath S, Dept of IT, NITK Surathkal 7-Nov-16


Tips for developing a good SEO strategy:
Research desirable keywords and search phrases (WordTracker,
Overture, Google AdWords)
Identify search phrases to target (should be relevant to business/market,
obtainable and profitable)
Clean and optimize a websites HTML code for appropriate keyword
density, title tag optimization, internal linking structure, headings and
subheadings, etc.
Add quality content. Write the copy so as to appeal to both search
engines and actual website visitors.
Study competitors (competing websites) and search engines.
Implement a quality link building campaign.
Constant monitoring of rankings for targeted search terms.
Dr.Sowmya Kamath S, Dept of IT, NITK Surathkal 7-Nov-16
More Reading
Croft, W. Bruce, Donald Metzler, and Trevor Strohmann. Search
engines. Pearson Education, 2010.

Brin, Sergey, and Lawrence Page. "Reprint of: The anatomy of a


large-scale hypertextual web search engine." Computer
networks 56.18 (2012): 3825-3833.

Kobayashi, Mei, and Koichi Takeda. "Information retrieval on the


web." ACM Computing Surveys (CSUR) 32.2 (2000): 144-173.

Dr.Sowmya Kamath S, Dept of IT, NITK Surathkal 7-Nov-16

You might also like