Download as pdf or txt
Download as pdf or txt
You are on page 1of 5



Individual Assignment of Information Retrieval


Submitted to Dr. Kula. K(Ass. Professor.)
A web search engine or Internet search engine is a tool in the form of software to collect particular
data through a web search. In other words, a search engine helps the user search the World Wide Web
to receive his/her required data in a systematic way.
The role of internet search engines has become more and more important due to the specialization of
things in our daily life and the almost daily need to use the Internet as an important source of
The World Wide Web (WWW) allows people to share information or data from large database
repositories globally. We need to search the information with specialized tools known generically as
search engines. There are many search engines available today, where retrieving meaningful
information is difficult.
Search Engines are now part of our daily life and one of the most important tools for billions of
internet users worldwide. From searching for the best pizza shop in the town to the development of
blockchain technology, people are now becoming more and more dependent on search engines to get
the answer to their everyday queries.
The first search engine known throughout history is Archie! It appeared in 1989 and was invented by
Alan Emtage, a student studying computer sciences at McGill University. It is considered as the
original search engine and the turning point in the creation of today’s giant alternatives Yahoo and
According to statistics from Netmarketshare, Statista, and StatCounter, the top 5 search engines
worldwide in terms of market share are Google, Bing, Yahoo, Baidu, and Yandex.
Google was not the first search engine to be created in history, but with a 78.23% worldwide market
share in 2019, it has a dominant lead over its rival search engines.
Google is a dominant force in search and one of the most popular search engines worldwide; it is the
king and will be for the foreseeable future, over other search engines due to its powerful algorithms,
easy-to-use interface, leading marketing and advertising platform, and personalized user experience
According to StatCounter, Google possesses an 88.37 percent share of the United States’ search
market in 2019.
The rest of the US market share (11.63) is divided between Bing, with 6.07%, Yahoo, with 3.94%, and
DuckDuckGo, with 1.28%.
Here are 4 search alternatives to Google:
Bing: Having a 6.07% market share in the United States, Bing is the second largest search engine in
the world, after Google, having origins in Microsoft’s previous search engines: MSN Search,
Windows Live Search, and later Live Search. Owned and operated by Microsoft to challenge Google
in 2009, Bing is the default search engine on Windows PCs, however, its operators have not
convinced users yet that the Bing search engine can be a reliable alternative to Google. Bing has done
a lot of advertising campaigns to compete with Google, including adding new features to its own
advertising, “Bing Ads”, (changing AdWords entry to related keywords), improving reporting to
maintain systems’ standards, and getting help from some of the managers who have information about

Google: According to Statcounter, Google accounts for 92.16% of the search engine market share
worldwide. Google alone receives 3.5 billion searches a day. It is so popular that we almost forget the
existence of other search engines. Google has outstanding features such as cutting-edge algorithms, an
easy-to-use interface, and a personalized user experience.
The order of search results returned by Google is based, in part, on a priority rank system called
“PageRank”. Google also provides many different options for customized searches, using symbols to
include, exclude, specify or require certain search behavior, and offers specialized interactive
experiences, such as flight status and package tracking, weather forecasts, currency, unit, time
conversions, word definitions, and more.
Over the years, Google has offered numerous services covering most categories, from common sites
such as Gmail, Google Drive to commercial sites such as Google Business, or Google Ads.

Yahoo!: Yahoo! Search is a web search engine owned by Yahoo, in California. It is one of the most
popular email providers and its web search engine is the third largest search engine in the world,
having between 1.64% and 2.04% market share. Yahoo! Directory was created in April 1994 by David
Filo and Jerry Yang of Stanford University. Now, Yahoo! is considered an internet portal, rather than a
search engine, and according to Alexa ranks as the 11 most visited website on the Internet. From
October 2011 to October 2015, Yahoo! search signed a deal with Bing for a 4our-year period, in
which Yahoo’s internet search is powered exclusively by Bing. From October 2015 until October
2018, Yahoo! added Google to its service-providing partners, along with Bing. But the deal, 1 year
later, is once again powered exclusively by Bing. As of 2014, Yahoo! search engine is the default
search engine for Firefox browsers in the US.
Baidu: Having a global market share between 0.92% and 9.37% and 70% of the Chinese search
market, Baidu was founded in 2000 by Robin Li and Eric Xu, and is the most popular search engine in
China. Now ranked 4th in global internet engagement, Baidu is only available in the Chinese
Yandex: Founded in 1997 by Arkady Voluzh, Arkady Borkovsky, and Ilya Segalovich, Russia’s most
popular search engine Yandex has a global market share between 0.47% and 0.83%. According to
Alexa, Yandex is among the 30 most popular websites on the Internet with a ranking position of 4 in
Russian. It is a technology company that builds intelligent products and services powered by machine
learning. Yandex holds about 65% market share in Russia, digs deeper into localized search results to
provide more than 1,400 cities, and is the single most visited page in the Russian language.

Oromo: The Oromo are indigenous African people inhabiting the North Eastern part of Africa. They
are the single dominant largest ethnic group in Ethiopia, where the Oromia country contains a huge
Ethiopia's land area and population. The Oromo language is also known as Afaan Oromo. Oromo
language is a Cushitic language spoken by more than about 50 million people in Ethiopia, Kenya,
Somalia, and Egypt and is the 3rd largest language in Africa. There are more Oromo speakers abroad
than the resident population in Ethiopia. In the United States, Australia, Canada, and different
European cities people are speaking and communities are teaching their kids, and foreigners interested
in communications in Afaan Oromoo are also taking the Oromo class. In Oromia, it has the status of
an official language. It has its own script and it can be written with Latin script. The oral tradition is
very rich and nowadays there are enough literary works written in Oromo; modern arts like music and
folk arts. Oromo people speak Afaan Oromoo, as well as Amharic, Tigrinya, Grange, and Omotic
languages. They are mainly Christian and Muslim, while only 3% still follow the traditional religion
based on the worshipping of the god Waaq. Oromo are mainly farmers and cattle herders. They have
distinguished themselves throughout history for their strong military organization.
“Google is the largest U.S.-based search engine whose mission is to organize our universe’s
information and make it universally accessible and useful. And to make Afaan Oromoo and other
Ethiopian languages part of this global project, to me, is quintessential in creating a new information
superhighway and thereby information societies everywhere
Google creates the platform to translate Google products in Low resource language such as Afaan
Oromoo, because of poverty and restricted investment in technology in Oromia and Ethiopia. The part
of society that uses the Internet and other computer services is mainly the educated and the political
elites that can afford to pay for the services.


Mueller says, for example, that some languages don’t separate words with spaces. That
makes it necessary to use a different algorithm than what Google uses for languages that do
use spaces.
He states:
“Mostly. The search uses lots & lots of algorithms. Some of them apply to content in all languages,
some of them are specific to individual languages (for example, some languages don’t use spaces to
separate words — which would make things kinda hard to search for if Google assumed that all
languages were like English).”

Challenges for low-resource languages:

1. Lack of annotated datasets: Annotated datasets are necessary to train Machine Learning (ML)
models in a supervised fashion. These models are commonly used to solve specific tasks very
accurately, like hate speech detection. However, creating annotated datasets requires human
intervention by labeling training examples one by one, making the process usually
time-consuming and very expensive given the thousands of examples advanced deep learning
models require. Thus, it becomes infeasible to rely on only manual data creation in the long
2. Lack of unlabelled datasets: Unlabelled datasets like text corpora are the precursors to their
annotated versions. They are essential for training base models that are later fine-tuned for
specific tasks. Hence, approaches to circumvent the lack of unlabelled datasets also become
very important.
3. Supporting multiple dialects of a language: Languages that have multiple dialects are also a
tricky problem to solve, especially for speech models. A model trained in a language usually
won’t perform great in its different dialects. For example, most unlabelled and annotated
datasets available for Arabic are in Modern Standard Arabic. However, for a human-like
feeling when interacting with voice or chat assistants for daily use it is too formal for many
Arabic speakers. Thus, supporting dialects become necessary for practical use cases.
The list of challenges is growing with every low-resource language and even large corporations and
their NLP Software as a Service (SaaS) offerings, such as Google Dialog Flow, AWS Lex, or
Microsoft LUIS, are understandably only supporting a small number of low-resource languages.
Issues Related to Cross-lingual Entity Linking
In this part, I present a thorough analysis of the limitations of several leading candidate generation
methods. Although these methods adopt different techniques, we find that they all heavily rely on
Wikipedia interlanguage links2 as their cross-lingual resources. However, small SL Wikipedia size
limits their performance in the low resource language setting. As shown in Figure 2, while the core
challenge of low Resource language XEL is to link low Resource language entities (A) to candidates
in the English Wikipedia (C), interlanguage links only map a small subset of the low Resource
language entities that appear in both low Resource language Wikipedia (B) and English Wikipedia.
Therefore, methods that only leverage interlanguage links (B ∩ C) as the main source of supervision
cannot cover a wide range of entities.
For example, Amharic Wikipedia has 14,854 entries, but only 8,176 of them have interlanguage links
to English.
In the low Resource language setting, few Wikipedia articles lead to fewer Wikipedia anchor text
mappings, thus reducing the ability of current methods to cover many SL mentions. For instance, the
low Resource language Oromo Wikipedia article for “Laayibeeriyaa '' has much fewer hyperlinks than
the English Wikipedia article for “Liberia '', even though they are linked through an interlanguage
link. ({xingyue 2, xdyu, 2020, 3-4)


Spam: The increasing importance of search engines to commercial websites has given rise to a
phenomenon we call “webspam”, that is, web pages that exist only to mislead search engines into
misleading users to certain websites. Webspam is a nuisance to users as well as search engines: users
have a harder time finding the information they need, and search engines have to cope with an inflated
corpus, which in turn causes their cost per query to increase. Therefore, search engines have a strong
incentive to weed out spam web pages from their index.
Content Quality: The web is full of noisy, low quality, unreliable, and contradictory content. While
there has been a great deal of research on determining the relevance of documents, the issue of
document quality or accuracy has not been received much attention in a web search or information
retrieval. The web is so huge, so techniques for judging document quality are essential for generating
good search results. The most successful approach to determining the quality on the web is based on
link analysis, for instance, Page Rank.
Duplicate Hosts: Web Search Engines try to avoid crawling and indexing duplicate and
near-duplicate pages as they do not add new information to the search results and clutter up the
results. The problem of finding duplicate or near-duplicate pages in a set of crawled pages is well
studied. Duplicate hosts are the single largest source of duplicate pages on the web, so solving the
duplicate host's problem can result in a significantly improved web crawler. Standard checksumming
techniques can facilitate the easy recognition of documents that are duplicates of each other (as a
result of mirroring and plagiarism). Web search engines face considerable problems due to duplicate
and near-duplicate web pages. These pages enlarge the space required to store the index, either
decelerate or amplify the cost of serving results and so exasperate users. The identification of similar
or near-duplicate pairs in a large collection is a significant problem with widespread applications. In
general, predicting whether a page is a duplicate of an already crawled page is very chancy work, and
a lot of work is being done in this field but still, it is not able to completely overcome this problem.

You might also like