Large Scale Index Processing: Mapreduce

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 9

LargeScaleIndexProcessing

PranavKulkarniVimalenduShekhar

Introduction
Indexes are used in databases for implementing prompt and effective data retrieval from the
tables inthedatabase.Databaseindexesareanalogoustoindexofabook.Tofindatopicwe
look up the page number of that topic from the index and then look for that specific topic on
thatpageinsteadofscanningthewholebooktofindonetopic.Soindexes areanintegralpart
ofthestudyofdatabasesystems.

Motivation
Internet is a huge source of information. But the information on the internet is unorganized
and the web sites appear or disappear, move ormutateondailybasis.Tosearchusefuldata
data from this morass of information can be very tedious and frustrating. To speed up the
retrieval of useful information from the internet, thereisa needtocreateindexesforthehuge
amount of information and as the internet is constantly growing, these indexes have to be
updated constantly. Hence creation and processing of these indexes play a major in
connectingpeopletousefulinformationontheinternet.

Howisthetopicrelatedtodatabases?
MapReduce and Google Percolator are the most widely used techniques for large scale
indexing. Bigtable is a distributed storage system for managing structured data that is

designed to scale to a very large size. It is a multidimensional, sorted, sparse and


timebased table. In simple words, it is a database which contains all the information
required by MapReduce and Percolator. Thus, large scale index processing is
performingoperationsonadatabaseusingvarioustechniques.

With the advent of BigData and Cloud Computing there is a need of frameworks that that
huge amount of data by distributing among a cluster of processors. Two of themost popular
frameworksusedforthistaskareMapReduceandGooglePercolator

We will give a short introduction toMapReduce and discuss its applications,advantagesand


disadvantages. Then we will discuss the factors that forms the motivation behind the
developmentofPercolatorbyGoogle,followedbytheworkingofPercolatorindetail.

MapReduce

Introduction
MapReduce model was introduced and GoogleandIBM forprocessingoflargedatasetsina
distributed fashion. The basic infrastructure of MapReduce uses a cluster containing large
number of of low end systems working in parallel to solve a computing problem instead of

using few high end servers. MapReduce, although, widely used today, is based on ideas
developed manyyearsback.ItdefiestheconventionsofanidealDBMSanddoesnotsupport
manydatabasetoolslikeviewsandtransactions.

Working
The
core
idea
behind
MapReduce
is
mapping
your
data
set
into
a
collection
of
<key,
value>

pairs,
and
then
reducing
over
all
pairs
with
the
same
key.Thisideaisachievedintwostages,
MapandReduce.

Map
Map takes a list ofrecordsfromtheinputfile.Theoutput ofMapfunctionisasetofrecordsin
the form of (key,value) pairs. The Map function basically splits the records among the
computers in thecluster. Manyinstances of Mapfunctionmay berunningatthesametimein
theclusterandeachofthemusesthesamehashfunctionforgeneratingthe(key,value)pairs

ShuffleandReduce
Reduce takes the collection of (key,value) pairs as input and merges the records with the
same key value. The pairs generated by the same hash function are reduced by the same
instanceofreducefunctionandthenthereducedsetsarecomputedinparallel.

Figure.WorkingofMapReduce(DrawnusingDraw.io)


ApplicationsofMapReduce
1) Distributedpatternbasedsearchingandsorting
2) Clusteringofdocuments.
3) Construction of inverted indexes which was earlier used by Google to create its web
searchindex.
4) One of the most popularonlinemovielibrary,NetFlix,usesMapReducetechnologyto
maintainandretrieveitscollectionofmovies.

Merits
1) MapReduceisaveryimportantBigDatatoolandenablesustoquerylivedata.
2) MapReduce is simple toimplement. A programmer defines only the Map andReduce
functionswithoutworryingaboutthephysicaldistributionofthetaskateachnode.
3) MapReduce is flexible as it does not have any dependency on data model and the
schema.
4) MapReduceishighlyfaulttolerant.

However, the database community wasnt happy with MapReducetechniqueforthefollowing


reasons:
1) Suboptimalimplementationwhichusesbruteforceinsteadofindexing
2) Not exactly novel. The database community says that this is just a specific
implementationofwellknowntechniquesover30yearsold.
3) MapReduce misses most of the features that are routinely included in the current
DBMS,
4) ItisincompatiblewiththetoolsDBMSusersdependon

The databasecommunity believes that schemas and their separationfromtheapplicationsis


good. They also believe that high level access languages are good. MapReduce represents
the era of 1960s, before the modern DBMS were invented. MapReduce does not have any
system catalogs that maintain the structure of the records. Therefore, when a programmer
has to write a new application, the programmer needs to discover the structure by an
examination of the codes. This is an extremely tedious process, unavoidable when
MapReduceisused.

All the modern DBMSs use a hash index or a Btree index to speed up the data access.
Searching with anindexisconsiderablyfasterthan searchingwithoutanindex.Alsothereare
query optimizers that decide whether to use an index or brute force to perform search
operations on thedatabase. MapReduce provides no facilities foran index and has to resort
tobruteforceeverysingletime.

Demerits
A major drawback associated with MapReduce is the amount of processing required.
Consider a scenario where we crawl a small portion of the web and update the index.
MapReduce then needs to process the entire repository, not just the newly updated
documents. This results into a huge overhead every time a small update operation is
performed. Depending on the size of the repository, large number of MapReduce must be
used.Theefficiencydecreaseswithanincreaseinthesizeoftherepository.

In a world where the amount of data generated over the web is increasing exponentially,
efficiency and the speed of data access becomes difficult with MapReduce. This motivated
Google to come up with an incremental system that continuously updates the search index
withoutreprocessingtheentirethingfromscratch.
ThisnewsystemistheGooglePercolator

CharacteristicsofanIdealSystemforIndexProcessing
1) It should be designed for incremental processing and creation of search indexes as
the web pages are crawledinto the system. Incremental processingwouldreducethe
overheadandcostsassociatedwithupdationofsearchindexes.
2) Itshouldproviderandomaccesstotherepository.
3) One of the most important part of incremental processing is to keep track of the
updateswhichhavealreadybeenprocessed.Incrementalprocessingworkslikealoop
and is repeatable process. Since, it is repeatable, we need tohave some conditions
which remain true throughout the process i.e we need to be to able to maintain
invariants.

Percolator fulfills the above requirements of anidealsystem and iscurrentlyusedbyGoogle


tomaintainandupdateitssearchindexes.

Percolator

BeforelookingintotheworkingofPercolator,weneedtounderstandtheconceptofBigtable
asthePercolatorisbuiltonthetopofBigtable.

Bigtable
Bigtable is a distributed multidimensional sorted map and the indexing of the map is based
on row,column keys and timestamp. Bigtable is designed to provide lookup and update
operations which includes modify, read and write operations. Bigtable provides huge
scalabilityandcangrowuptopetabytesinsizetoaccommodatetheGooglesearchindexes.

BigTable is a compressed, high performance, and proprietary data storage built on Google
FileSystem,ChubbyLockService,SSTableandafewotherGoogletechnologies.

Bigtable implements huge storage of data with the use of largenumber of multidimensional
storage spaces partitioned across many machines in the cluster. These storage spaces are
called tablets. Data is written or retrieved in the tablets on the commandsfroma program
knownastheMaster.

WhynotonlyBigtable?
1) Bigtable alone cannot be used alone to maintain the search index as it does not
provide multi row transactions which is a necessity for incrementally updating the
searchindexes.
2) Although Bigtable cangrow to accommodate the size of indexes, it does not some of
the very essential tools for incrementally updating and maintaining the index to the
programmers.

IntroductiontoPercolator
Percolatorwas designed toovercomethe limitations ofMapReducebyimplementing a
platform for incremental processing i.e continually updating the web search index
without having to deleting the existing indexes and reprocessing the entire thing from
scratch. Percolator achieves this flexibility by providing a number offeatureslikemulti
rowtransactionsandrandomaccesstothelocationsintherepository.

Due to incremental processing there is an immediate affect on the latency i.e a very
small delay between the arrival of the document in the system and the creation of its
index.Hence its reducesthe average ageofdocuments.ByreplacingMapReducewith
Percolator, Google says that they are able to process same number of documents
everyday, whilereducingtheaverageageofdocumentsintheGooglesearchresultsby
50%. BecauseofthesequentialnatureofMapReduce,ittookdaystoindexadocument
afteritsarrivalbeforeitcanactuallyturnupinGooglesearchresults.

PercolatorfulfillsthethirdrequirementofanidealsystembyprovidingObservers.
Observershelps the programmers inkeepingtrackof the incrementalcomputationand
are invoked whenever there is a write operation on any of the rows in thecolumn on
whichtheyareregistered.

DefinitionsthatwillhelpustounderstandtheworkingofPercolator:
1) Incremental Processing : Incrementally creating new indexes and updating the
existingindexesasthenewwebpagesarriveinthesystem.

2) Pagerank : Pagerank is an algorithm used by Google to decide how important and


usefulawebsiteiswithrespecttothesearchcriteria.

3) Link/Index Inversion : Invertedindex is a link from the content on the webpagetoits


locationinthedatabase.

AnoverviewofGoogleFileSystem(GFS)
AbasicunderstandingofGFSisnecessarytounderstandtheworkingofthePercolator.
A GFS cluster is basically divided into two sets of nodes , a large number ofChunkservers
andasinglemaster.

Chuckservers : Each file in a GFS is divided into 64 megabytes chunks whichare stored by
Chunkservers. Every chunk is has a primary copy and a secondary copy. These copiesare
stored in the Chunkservers which sends the primary copy when requested by the Master.
Secondarycopiesserveasabackupincaseofanyfailure.

Master : Master node is responsible for generating a 64 bit address for the chunksandkeep
track of the processes that are reading or writing to chunks. All the processes first need to
querytheMastertogetthelocationofthedesiredchunk.

DesignofPercolator
The Percolator is able to achieve its incremental processing capability because of the
followingtwodesignabstractions
1) AbilitytoperformACIDtransactionsoverarandomaccessiblememory
ThisisachievedbythepresenceoftheBigtableasanunderlyingstoragesystem.

2) Abilitytodefineinvariantsandkeeptrackoftheincrementalcomputation
ThisisachievedbythepresenceoftheObservers.

The infrastructure to achieve the above design abstractions consists of running three
programs on every machine in the Percolator cluster. They are a Percolator worker, a
Bigtabletabletserver,andaGFSchunkserver.

ThePercolatorworkerworksasaninterfacebetweentheObserversandtheBigtable.
The Percolator worker scans the Bigtable. Any change in any row of the Bigtable, forces the
Percolator worker to trigger the corresponding Observer. This is achieved by writing
invocation functionsin the Observer and callingthemfromthe Percolatorworkerbymakinga
Remote Procedure Call (RPC). On invocation, the Observer identifies the request and
generates a corresponding RPC call to the Bigtable TabletServers. In responsetotheRPC
request from the Observer, the Bigtable Tablet Servers generates another read/writeRPC
call for the GFS Chunkservers which is routed via the Master node which provides the
addressofthechunkwherethedatahastobewrittenorreadfrom.

The data is storedin Bigtable along with the Percolator metadatastored in special columns.
The Percolator library consists of Bigtable operations wrapped in Percolator specific
computations.

Each node in the Percolator cluster can raise a request for read/write operation on the
Bigtable and thousands of machines can raise a request for a lock on a particular column in
the Bigtable. Hencethe design of Percolator mustincorporateanappropriatelockingservice.
Thelockservicemustbeabletosatisfythefollowingconditions.
1) Locks must not be lostin the case of machine failure as itcanresultintwoconflicting
transactionsgettingcommitted.
2) As therecouldbelargenumberoflockrequests,thelockservicehavehighthroughput
tohandlealltherequests.
3) Theoverheadinreadingandcreatinglocksshouldnotbehigh.

To satisfy the first condition, there is need to replicate the lock server save copies of it at
different locations. To service the large number of requests, the lock server needs to be
distributed across several machines. As the Bigtable is also distributed across several
machines in the cluster, the Percolator stores the locks in the specialinmemory columns in
theBigtableitself.

Working
Percolatorperforms varioustransactions.ItbeginswithaGet()functiontocheckforalockin
its timestamp range. If no lock is found, the function reads the latestrecord and returns the
datacorrespondingtotherecord.
If a transaction encounters a conflicting lock, the transaction needs to check whether the
previous transaction failed or succeeded. If the previous transaction has failed, the current
transaction can erase the locks. This is crucial as deadlocks may occur if the locks are not
cleaned in a proper fashion. The transaction protocol uses strictly increasingtimestamps to
guaranteethatGet()returnsallcommittedwritesbeforethetransactionsstarttimestamp.

HowPercolatorovercomesthelimitationsofMapReduce

Scalability
Percolator employs Observers. They work somewhat like triggers in the conventional
DBMS. An observer is invoked intheevent of a read/writeoperation on anycolumn in
the Bigtable. These observers are stored in a hierarchical fashion. On any event, the
observer on the topofthe hierarchyinvokes the Observers down thehierarchy. Sothe
work is divided among many Observers and as single Observer isnotresponsiblefor
scanning andupdating the repository.Hencethetaskgetsdividedincrementallyamong
multipleObservers.ThisovercomesthemajorlimitationofMapReducetechnique.

Tradeoffs
Percolator roughlyuses 30 times more CPUper transaction thanastandardDBMS.Ittrades
efficientuseofresourcesforscalability.

Caffeine
The Percolator based system used by Google is called Caffeine. The primary goal of
designing Caffeine was to reduce the latency. The document movesover a 100 times faster
over Caffeine than the previoussystems.Caffeinealsohas aclusteringphase.Thishelpsthe
lookup of documents by avoiding an extra scan over therepository.Caffeinealsomakesuse
ofadditionalresourcestoprovidethenecessarythroughput.

ComparisonofBigtableandPercolator

FinalThoughts

FutureofMapReduceandPercolator.
MapReduce in its current form will not beof much use and its use will gradually fade away.
But this change is not happen overnight as the alternatives to MapReduce are still in a very
nascent stage. A lot of systems currently useMapReduceandwillcontinuetodosoforafew
years. The two upcoming systems Tez and Spark,arebasedonthe ideaofMapReduce,but,
combinebothBatchprocessingandstreaminginasingleengine.

Percolatorisheretostayandwillbeusedastheprimarysystem forcreatingandupdatingthe
Google search index in the near future. Many systems inspired by the working of percolator
are coming up like HBase coprocessors and Apache Drill.Haesina, thetransactionlibraryfor
Hbase is also built upontheideaoftheworkingofpercolator,butthedetailedimplementation
issomewhatdifferentfromPercolator.

NextBigThing?
Google Percolator is designed to overcome latency at the cost of everything else. The
infrastructure and resources required to implement Percolator is very as compared to
MapReduce. Hence, modifications in the design of Percolator or a completely new system
inspired by the working of Percolator and not very resource extensive in on the cards. A
system which accommodates the low latency of percolator and frugality of resources of
MapReducecouldbethenextbigthing.

Percolator is built over Bigtable and add somefeatureslike multirowtransactionsinorderto


achieve asystem ofincrementalprocessing.Googlehascome upwithanewdatabasecalled
Spanner, which has semirelational data model. Spanner has the same distributed
architectureasBigTablebutthedataretrievalisfaster.

References:
LargeScaleIncrementalProcessingUsingDistributedTransactionsandNotifications
byDanielPengandFrankDabek.
https://www.usenix.org/legacy/event/osdi10/tech/full_papers/Peng.pdf
BigDatabeyondMapReduce:Google'sBigDatapapersbyMikioL.Braun
http://blog.mikiobraun.de/2013/02/bigdatabeyondmapreducegooglespapers.html
Google Percolator global search jolt sans MapReduce comedown by By Cade Metz.
http://www.theregister.co.uk/2010/09/24/google_percolator/
MapReduce:AmajorstepbackwardsbyByDavidDeWitt
http://homes.cs.washington.edu/~billhowe/mapreduce_a_major_step_backwards.html
LargeScaleIncrementalProcessingUsingDistributedTransactionsandNotifications
byDanielPengandFrankDabekandPresentedbyNickRadcliffe
http://courses.cs.vt.edu/cs5204/fall11butt/lectures/perculator.pdf

You might also like