Case Study-Google

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 16

A Case Study Analysis on Google Distributed System

In partial fulfillment on the requirements of the degree

Bachelor of Science in Computer Engineering

Submitted by:

Escarsa, Julianne Loren v.

Larida, Bienvenido M.

Tadena, Jomer R.

BSCpE 3-A

Submitted on:

November 26, 2021


TABLE OF CONTENTS

INTRODUCTION……………………………………………………………………3

 Description……………………………………………………………….….3
 Developers……………………………………………………………….….4
 Utilization and Programming Language used…………………..........4

BODY…………………………………………………………………………………6

 Google Search Engine Operations and Features……………...….……6


 Additional Features and Applications Description………………….….7
 Google System Architecture as their Advantage……………………....10
 Drawbacks of Google…………………………………………………...……13

RECOMMENDATIONS…………………………………………….…………………14

CONCLUSIONS……………………………………………………………….……...15

REFERENCES………………………………………………………………………16

2|Page
INTRODUCTION

Description

Interaction of autonomous computers on a network so as to achieve a common goal is


called a “distributed system”. The goal of distributed system is to divide resources/tasks among
different machines on a network either to solve or to achieve the same objective with fewer
burdens on machines with maintenance of the high performance of systems so that these free
resources can handle other users at same time. Similarly, a very famous search engine “Google”
uses the same concept of distributed system to answer the queries of its over 10 billion users a
month maintaining its speed. Projects like Google earth, Google maps, Google finance, and
many other projects of Google need the independent big storage place. The environment of
Google engine is quite complex phenomenon. It supports tens of thousands of query per second,
reads hundreds of megabyte, and uses billions of CPU’s cycles. The environment is fault
tolerance and with efficient processing speed.

Google Search (known simply as Google), is a search engine provided by Google.


Handling over 3.5 billion searches per day, it has a 92% share of the global search engine
market. It is also the most-visited website in the world. The order of search results returned by
Google is based, in part, on a priority rank system called "PageRank". Google Search also
provides many different options for customized searches, using symbols to include, exclude,
specify or require certain search behavior, and offers specialized interactive experiences, such
as flight status and package tracking, weather forecasts, currency, unit, and time conversions,
word definitions, and more.

The main purpose of Google Search is to search for text in publicly accessible documents
offered by web servers, as opposed to other data, such as images or data contained in
databases. It was originally developed in 1997 by Larry Page, Sergey Brin, and Scott Hassan.
Its Mission statement “to organize the world’s information and make it universally accessible and
useful”. Born out of an internet search research project at Stanford, now diversified into cloud
computing. Cloud computing is a set of internet based application, storage and computing
services sufficient to support most users’ needs thus allowing them to dispense with local data-
storage and application software.

3|Page
Developers

Brin and Page, who met as graduate students at Stanford University, were intrigued with
the idea of extracting meaning from the mass of data accumulating on the Internet. They began
working from Page’s dormitory room at Stanford to devise a new type of search technology,
which they dubbed BackRub. BackRub was an early search engine from the 1990s which is
now regarded as the predecessor of the Google search engine. It was personally developed and
operated by Google founder Larry Page and Sergey Brin. It was originally a research project of
Larry Page that was part of the team that worked on the Stanford Digital Library Project (SDLP).
This project had set itself the goal of developing technologies which could create a universal
digital library. As part of his thesis, Larry Page took up the issue of linking. His reflections on
understanding link structure as a huge graph, later led to the development of the still significant
PageRank algorithm. Their development carved a path for them to build BackRub on the
principle that the websites which had the most links would be the most relevant.

Page and Brin determined that links to a page were a sign of its online authority, and
Google's algorithm therefore returned more useful results, helping propel Google to become the
most-used search engine. Google's algorithm was patented and named PageRank. The current
search technology is based on some of these principles, but has evolved to the point where
there are many more variables at play.

Utilization and Programming Language Used

Google uses the information shared by sites and apps to deliver services, maintain and
improve them, develop new services, measure the effectiveness of advertising, protect against
fraud and abuse, and personalize content and ads you see on Google and on our partners’ sites
and apps. A Google hardware load balancer has responsibility to transfer the Google request to
Google Web Server (GWS). Whereas, Google Web Server (GWS) is responsible to execute the
given query and formulize the response in Hyper Text Markup Language (HTML). Google
distributes the query into three parts. Firstly, a Googlebot, web crawler, is an intelligent program
that finds and fetches the page being searched. The second part is indexer, which stores the
results in resulting index pages (index shard) from every found word. Thus, these index results
in the last become as order list of document identified with, docid.

4|Page
In each docid the Google Web Server (GWS) keeps the record comprising of Uniform
Resource Locator (URL), page title, and relevant text. In this way, all processes related to docid
process are maintained by docserver. Finally, query processor compares the query with indices
and recommends the most relevant result. Google’s stated mission is to organize the world’s
information and make it universally accessible and useful. By this, we have discovered that the
underlying search engine consists of crawling, indexing and ranking of the said informational
pages.

The languages that Google uses in house are: Javascript & Typescript for their front end,
their backend is a little bit more complex: C, C++, Java, Python and Node. Interestingly enough,
almost all of the most popular websites use Javascript as their front end (with very few
exceptions). Javascript is a dynamic object orientated language that essentially focuses on the
front end, Typescript is actually a language created by Microsoft which is pretty much an updated
version of Javascript. C and C++ are both programming languages that Google uses, C is a
fairly thorough language, the amount of things you can do with both C and C++ is insane, and it
looks like both of these languages are highly used in Google.

Currently, Python is recognized as an official language at Google, it is one of the key


languages at Google today, alongside with C++ and Java. Some of the key Python contributors
are Googlers and they continue to use, promote, and support the language actively. Python runs
on many Google internal systems and shows up in many of Google APIs. It suits perfectly the
engineering process at Google.

From a distributed systems perspective, Google provides a fascinating case study with
extremely demanding requirements, particularly in terms of scalability, reliability, performance
and openness. For example, in terms of search, it is noteworthy that the underlying system has
successfully scaled with the growth of the company from its initial production system in 1998 to
dealing with over 88 billion queries a month by the end of 2010, that the main search engine has
never experienced an outage in all that time and that users can expect query results in around
0.2 seconds. The case study we present here will examine the strategies and design decisions
behind that success, and provide insight into design of complex distributed systems.

5|Page
GOOGLE AS DISTRIBUTED SYSTEM (BODY)

The role of the Google search engine is, as for any web search engine, to take a given
query and return an ordered list of the most relevant results that match that query by searching
the content of the Web. The challenges stem from the size of the Web and its rate of change,
as well as the requirement to provide the most relevant results from the perspective of its users.
We provide a brief overview of the operation of Google search below.

Crawling: The task of the crawler is to locate and retrieve the contents of the Web and
pass the contents onto the indexing subsystem. This is performed by a software service called
Googlebot, which recursively reads a given web page, harvesting all the links from that web
page and then scheduling further crawling operations for the harvested links (a technique known
as deep searching that is highly effective in reaching practically all pages in the Web). In the
past, because of the size of the Web, crawling was generally performed once every few weeks.
However, for certain web pages this was insufficient.

Indexing: While crawling is an important function in terms of being aware of the content
of the Web, it does not really help us with our search for occurrences of ‘distributed systems
book’. To understand how this is processed, we need to have a closer look at indexing. The role
of indexing is to produce an index for the contents of the Web that is similar to an index at the
back of a book, but on a much larger scale.

Ranking: The problem with indexing on its own is that it provides no information about
the relative importance of the web pages containing a particular set of keywords – yet this is
crucial in determining the potential relevance of a given page. All modern search engines
therefore place significant emphasis on a system of ranking whereby a higher rank is an
indication of the importance of a page and it is used to ensure that important pages are returned
nearer to the top of the list of results than lower-ranked pages. As mentioned above, much of
the success of Google can be traced back to the effectiveness of its ranking algorithm.

6|Page
Anatomy of a search engine: The founders of Google, Sergey Brin and Larry Page,
wrote a seminal paper on the ‘anatomy’ of the Google search engine in 1998 [Brin and Page
1998], providing interesting insights into how their search engine was implemented. The overall
architecture described in this paper is illustrated in Figure 1, redrawn from the original. In this
diagram, we distinguish between services directly supporting web search, drawn as ovals, and
the underlying storage infrastructure components, illustrated as rectangles.

Figure 1 Outline architecture of the original Google search engine [Brin and Page 1998]

Additional Features and Applications Description

 Google Mail system with messages hosted by Google but desktop-like message
management.
 Google Docs Web-based office suite supporting shared editing of documents held on
Google servers.
 Google Sites Wiki-like web sites with shared editing facilities.
 Google Talk Supports instant text messaging and Voice over IP.
 Google Calendar Web-based calendar with all data hosted on Google servers.

7|Page
 Google Wave Collaboration tool integrating email, instant messaging, wikis and social
networks.
 Google News Fully automated news aggregator site.
 Google Maps Scalable web-based world map including high-resolution imagery and
unlimited user generated overlays.
 Google Earth Scalable near-3D view of the globe with unlimited user-generated overlays.
Google App Engine Google distributed infrastructure made available to outside parties
as a service (platform as a service).
 Accessibility Scanner can help Android app creators identify opportunities to improve their
apps for users.
 Action Blocks with Action Blocks, you can add common actions to your Android Home
screen. Then you can activate the Action Block—for example, a photo—to trigger the
corresponding action, like calling a loved one.
 Android Is committed to creating an open platform that is more accessible to everyone. From
hearing aid compatibility to a built-in screen reader, we offer tools to help you use your
device in the way you want.
 Android Accessibility Suite helps make your Android device more accessible. Services
include Accessibility Menu, select to Speak, Switch Access, and Talkback.
 Chrome Browser the Chrome browser supports screen readers and magnifiers and offers
people with low vision full-page zoom, high-contrast color, and extensions.
 Classroom helps teachers and students stay organized, communicate with their class, and
go paperless.
 The Voice Access app for Android lets you control your device with spoken commands. Use
your voice to open apps, navigate, and edit text hands-free.
 Google Photos Organize and find your photos.
 Google Meet Hold video meetings with people inside or outside your organization. You can
join meetings from a computer or mobile device, or from a conference room.
 Google Assistant helps you get things done throughout your day, saving you time to focus
on the things that matter most. The Assistant connects you with the best of Google to
help you get more done in a natural and personalized way, whether you're at home or on
the road.

8|Page
Google as a cloud provider: Google has diversified significantly beyond search and now
offers a wide range of web-based applications, including the set of applications promoted as
Google Apps. More generally, Google is now a major player in the area of cloud computing.

Software as a service: This area is concerned with offering application-level software over
the Internet as web applications. A prime example is Google Apps, a set of web based
applications including Gmail, Google Docs, Google Sites, Google Talk and Google Calendar.
Google’s aim is to replace traditional office suites with applications supporting shared document
preparation, online calendars, and a range of collaboration tools supporting email, wikis, Voice
over IP and instant messaging.

Platform as a service: This area is concerned with offering distributed system APIs as
services across the Internet, with these APIs used to support the development and hosting of
web applications (note that the use of the term ‘platform’ in this context is unfortunately
inconsistent with the way it is used elsewhere in this book, where it refers to the hardware and
operating system level). With the launch of the Google App Engine, Google went beyond
software as a service and now offers its distributed systems infrastructure as discussed
throughout this chapter as a cloud service.

Physical Model: The key philosophy of Google in terms of physical infrastructure is to use
very large numbers of commodity PCs to produce a cost-effective environment for distributed
storage and computation. Purchasing decisions are based on obtaining the best performance
per dollar rather than absolute performance with a typical spend on a single PC unit of around
$1,000. Commodity PCs are organized in racks with between 40 and 80 PCs in a given rack.
The racks are double-sided with half the PCs on each side. Each rack has an Ethernet switch
that provides connectivity across the rack and also to the external world. Racks are organized
into clusters, which are a key unit of management, determining for example the placement and
replication of services. A cluster typically consists of 30 or more racks and two high-bandwidth
switches providing connectivity to the outside world (the Internet and other Google centers).
Clusters are housed in Google data centers that are spread around the world. In 2000, Google
relied on key data centers in Silicon Valley (two centers) and in Virginia.

9|Page
Google System Architecture as their Advantage

Scalability: The first and most obvious requirement for the underlying Google infrastructure is
to master scalability and, in particular, to have approaches that scale to what is an Ultra-Large
Scale (ULS) distributed system.

Reliability: Google offers a 99.9% service level agreement (effectively, a system guarantee) to
paying customers of Google Apps covering Gmail, Google Calendar, Google Docs, Google Sites
and Google Talk. The company has an excellent overall record in terms of availability of services.

Performance: The overall performance of the system is critical for Google, especially in
achieving low latency of user interactions. The better the performance, the more likely it is that
a user will return with more queries that, in turn, increase their exposure to ads hence potentially
increasing revenue. The importance of performance is exemplified by the target of completing
web search operations in 0.2 seconds and achieving the required throughput to respond to all
incoming requests while dealing with very large datasets

Openness: support further development in the range of web applications on offer. It is well
known that Google as an organization encourages and nurtures innovation, and this is most
evident in the development of new web applications. This is only possible with an infrastructure
that is extensible and provides support for the development of new applications.

Figure 2 The overall Google systems architecture

Google has responded to these needs by developing the overall system architecture shown
in Figure 2. This figure shows the underlying computing platform at the bottom (that is, the
physical architecture as described above) and the well-known Google services and applications
at the top.

10 | P a g e
The middle layer defines a common distributed infrastructure providing middleware support
for search and cloud computing. This is crucial to the success of Google. The infrastructure
provides the common distributed system services for developers of Google services and
applications and encapsulates key strategies for dealing with scalability, reliability and
performance. The provision of a well-designed common infrastructure such as this can bootstrap
the development of new applications and services through reuse of the underlying system
services and, more subtly, provides an overall coherence to the growing Google code base by
enforcing common strategies and design principles.

Figure 3 Google Infrastructure

Google infrastructure: The system is constructed as a set of distributed services offering


core functionality to developers (see Figure 3). This set of services naturally partitions into the
following subsets.

The underlying communication paradigms, including services for both remote invocation
and indirect communication:

 the protocol buffers component offers a common serialization format for Google, including
the serialization of requests and replies in remote invocation.
 the Google publish-subscribe service supports the efficient dissemination of events to
potentially large numbers of subscribers.

11 | P a g e
Data and coordination services providing unstructured and semi-structured abstractions for
the storage of data coupled with services to support coordinated access to the data:

 GFS offers a distributed file system optimized for the particular requirements of Google
applications and services (including the storage of very large files).
 Chubby supports coordination services and the ability to store small volumes of data.
 Bigtable provides a distributed database offering access to semi-structured data.

Distributed computation services providing means for carrying out parallel and distributed
computation over the physical infrastructure:

 MapReduce supports distributed computation over potentially very large datasets (for
example, stored in Bigtable).
 Sawzall provides a higher-level language for the execution of such distributed computations.

Microservices: Google App Engine has a number of features that are well-suited for a
microservices-based application. This page outlines best practices to use when deploying your
application as a microservices-based application on Google App Engine. In an App Engine
project, you can deploy multiple microservices as separate services, previously known as
modules in App Engine.

These services have full isolation of code; the only way to execute code in these services
is through an HTTP invocation, such as a user request or a RESTful API call. Code in one service
can't directly call code in another service. Code can be deployed to services independently, and
different services can be written in different languages, such as Python, Java, Go, and PHP.
Auto scaling, load balancing, and machine instance types are all managed independently for
services.

Companies that uses Google Cloud platform: Spotify, Twitter, Snapchat, Vimeo,
BestBuy, 20th Century Fox Acciona Mobility. AFL, Altavia, Coca Cola, Domino's, Feedly,
ShareThis and so much more.

12 | P a g e
Drawbacks of Google

Software Failures: with about 20 machines needing to be rebooted per day due to software
failures. (Interestingly, the rebooting process is entirely manual.).

Hardware failures: represent about 1/10 of the failures due to software with around 2–3% of
PCs failing per annum due to hardware faults. Of these, 95% are due to faults in disks or DRAM.

Reliability lapses: outage of Gmail on 1 September 2009 acts as a reminder of the continuing
challenges in this area.

Penalizing sites badly: Every new update of Google’s algorithms creates a lot of anxiety and
nervousness to hundreds and thousands of bloggers and website owners who rely on Google
for traffic and income.

Basically Google’s updates hurt a lot of bloggers and webmaster’s livelihood because they all
are dependent on Google for traffic.

A penalty from google leads to,

 Lower rankings in google.


 Lower traffic from google.
 Lower visibility of the website on google.

Always tracking users: Another disadvantage of google is that. It is always tracking you. Even
without using GPS or google maps, google can still track your location and your private
information through your website searches

Here are ways google tracks users.


 Through permissions in google services like google maps and google searches.
 Through the searches you go on google and websites you click on search results.
 Through other google products, you use google docs and sheets.
 Through google-chrome permissions.
 Through android software, you use it on your mobile phone.

13 | P a g e
RECOMMENDATIONS

Through its strong commitment towards learning and development, Google has thrived as
a learning organization, which provides its users with useful and accessible information based
on openness and creative thinking. As a learning organization, Google embraces the idea that
the solutions to the challenges and ongoing business problems reside also within its user’s
feedback and recommendations. On that note, we would like to recommend the following:

Google needs to apply certain approaches to ensure that it makes the best out of its
strengths, do away with weaknesses, seize available opportunities, and eliminate all threats from
the external environment. The recommendation is need for Google to further reflect on its
mission statement and develop it. It is important for Google to know that all their users are
seeking to have the best services on the market. Thus, it needs to rethink how it can maintain
its efficiency.

To prevent future software and hardware failures, Google also needs to invest more time and
money into development, implementation and maintenance. The company should run series of
tests that assure that the 3% of failing PCs will be minimize to 0%. In terms of reliability, we all
know that Google is reliable but not perfectly reliable in general. Some of its services can still
encounter outage, so it is best for Google to review it infrastructure for all of its services.

Google should provide information about the relative importance of the web pages containing
a particular set of keywords. It is important that Google penalize websites accordingly and
updates should properly have executed and reviews page rank algorithm beforehand.

Another important thing is that, Google should always consider user’s privacy. Even though
there are ways to prevent it by the user, Google should be transparent on what data and
information they are acquiring. Generally, Google is already doing great in proving its services
as distributed systems. But then again, it is not inevitable to have some failure so it is better to
be prepared for it.

14 | P a g e
CONCLUSIONS

This is a very challenging topic and one that requires a thorough understanding of the
technological choices available to distributed systems developers at all levels of system
development, including communication paradigms, available services and associated distributed
algorithms. The inevitable trade-offs associated with the design choices demand a thorough
understanding of the application domain.

The approach taken in this study is to highlight the art of distributed systems design through
a substantial case study – that is, the examination of the design of the underlying Google
infrastructure, the platform and middleware used by Google to support its search engine and
expanding set of applications and services. This is a compelling case study as it addresses what
is the most complex and large-scale distributed system ever constructed, and one that has
demonstrably met its design requirements.

This case study examined the overall architecture of the system together with in-depth
studies of the key underlying services – specifically, protocol buffers, the publish subscribe
service, GFS, Chubby, Bigtable, MapReduce and Sawzall – which all work together to support
complex distributed applications and services including the core search engine and Google
Earth. One key lesson to be taken from this case study is the importance of really understanding
your application domain, deriving a core set of underlying design principles and applying them
consistently. In the case of Google, this manifests itself in a strong advocacy of simplicity and
low-overhead approaches coupled with an emphasis on testing, logging and tracing.

The end result is an architecture that is highly scalable, reliable, high performance and
open in terms of supporting new applications and services. The Google infrastructure is one of
a number of middleware solutions for cloud computing that have emerged in recent years.
Beyond that, there is a real paucity of published case studies related to distributed systems
design, and this is a pity given the potential educational value of studying overall distributed
systems architectures and their associated design principles. The main contribution of this
analysis is therefore to provide a first in-depth case study illustrating the complexities of
designing and implementing a complete distributed system solution.

15 | P a g e
REFERENCES

 Coulouris G, Distributed system concept and design, fifth edition, chapter 1 & 2.
 Krzyzanowski P, Distributed Systems Case study: Google Cluster Architecture,2010
 Google Search Statistics - Internet Live Stats". www.internetlivestats.com. Retrieved
April 9, 2021.
 http://en.wikipedia.org/wiki/Distributed_system
 https://www.britannica.com/topic/Google-Inc\
 https://en.ryte.com/wiki/Backrub
 https://www.techopedia.com/definition/5359/google
 https://policies.google.com/technologies/partner-sites?hl=en-US
 https://preettheman.medium.com/these-are-the-programming-languages-google-uses-
e248a03a589d
 https://quintagroup.com/cms/python/google
 https://googleblog.blogspot.com
 https://George-Coulouris-Distributed-Systems-Concepts-and-Design-5th-Edition
 https://entrepreneurshipera.com/advantages-and-disadvantages-of-google-search-
engine/

16 | P a g e

You might also like