Digital Notes of Big Data Analytics Dated 5.1.2024

LECTURE NOTES ON
INTRODUCTION TO BIG DATA ANALYTICS
IV B. Tech II semester (2012PE05) Prepared by
Dr.AR. Sivakumaran & Dr.R.Baby Munirathinam
Associate Professor (Dept.of CSE)
Mrs. Gorla Sivaparvathi

Assistant Professor (Dept.of CSE)
1
DEPARTMENT OF CSE & IT MALLAREDDYENGINEERING COLLEGE FOR WOMEN
(Autonomous lnstitution-UGC,Govt.of India)
Accreditedby NBA& NAAC with 'A'Grade, UGC,Govt.of lndia NIRF
Indian Ranking-2018, Accepted by MHRD, Govt. of India.
PermanentlyAffiliated to JNTUH, Approvedby AICTE, ISO 9001:2015 Certified
Institution AAAA+Rated by Digital Learning Magazine, AAA+ Rated by Careers
360 Magazine, 6thRank CSR Platinum Rated by AICTE-CII Survey,
Top 100 Rank band by ARIIA, MHRD, Govt. of lndia
National Ranking-Top 100Rank band by Outlook,
National Ranking-Top 100 Rank band byTimes News Magazine
Maisammaguda ,Dhulapally, Secunderabad, Kompally-500100
2023-2024
2
CourseName: Introduction toBigData Analytics
Course Code: 2012PE05
Course Objectives:
• To underst and about bigdata

• To learn the analytics of BigData
• To Understand theMap,Reduce fundamentals Course Outcomes:
• Preparing for data summarization,query,andanalysis.
• Applying data modeling techniques to large datasets
• Creating applications for BigData analytics
• Building acomplete business data analytic solution
UNIT-I INTRODUCTION TO BIGDATA

Introduction to Big data: Overview,Characteristics of Data, Evolution ofBig Data, Definition of
Big Data, Challenges with Big Data.
Big data analytics: Classification of Analytics,Importance and challenges ofbigdata,
Terminologies, Data storage and analysis.
UNIT-II HADOOP TECHNOLOGY
Introduction to Hadoop: A brief history ofHadoop, Convolution approachversus Hadoop,
IntroductiontoHadoopEcosystem, Processing datawithHadoop,Hadoop distributors,Usecase,
Challenge in Hadoop.
UNIT-III HADOOPFILESYSTEM
IntroductiontoHadoopdistributed filesystem(HDFS): Overview,DesignofHDFS, Concepts,
Basic File systems vs. Hadoop File systems, Local File System, File-Based Data Structures,
Sequential File, Map File.
TheJava Interface: Library Classes, inserting data froma Hadoop URL,insertingdata using the
file system API, Storing Data.
UNIT-IV FUNDAMENTALS OF MAPREDUCE
Introductionto Mapreduce: Itsframework,Features of Mapreduce, Its working, Analyze Map
reducefunctions,Mapreduce techniques to optimize job,Uses, Controlling input formats in map
reduce execution,Different phases in mapreduce, Applications.
UNIT-V BIG DATA PLATFORMS
Sqoop, Cassandra,Mongo DB,Hive,PIG,Storm,Flink,Apache.
TEXTBOOK:
1. SeemaAcharya, SubhashiniChellappan,"Big Data and Analytics",WileyPublications,First
Edition,2015
REFERENCEBOOKS:
1. Judith Huruwitz, Alan Nugent, Fem Halper, Marcia Kaufman, "Big
2. datafor dummies", John Wiley &Sons, Inc.(2013)
3. TomWhite,"HadoopTheDefinitiveGuide",O'ReillyPublications,FourthEdition,2015.
3
SNO Title
UNIT -1 INTRODUCTION TO BIGDATA 5-19
I Introduction to Big Data
I Charcteristics of Big data
I Evalutation of Big Data
I Definition of BigData
I Challenges with Bigdata
I Big data Analytics
I Classification of Bigdata Analytics
I Importance and Chballenges Facing BigData
I Data and Analysis
UNIT - 2 Hadoop Technology 21-40
II Introducting Hadoop
II Convolution Approach versus with Hadoop
II Processing Data with Hadoop
II Introduction to Hadoop Eco System
II Challenges with Hadoop
Unit – 3 HADOOP FILE SYSTEM 41-80
III Introduction to Hadoop Distributed File System
III Design of Hadoop File system
III Basic HDFS
III Basic File system
III Hadoop File System
III Basic File system versus Hadoop file System
III The Java Interface
III Inserting Data froma Hadoop URL
III Inserting Data Using the Filesystem API
III Storing Data
III DataFlow-AnatomyofaFileRad
III AnatomyofaFileWrite
III Limitations
UNIT – IV 81-98
IV IntroductiontoMapReduce:ItsFraming
IV FeaturesofMapReduce
IV WorkingofMapReduce
IV Analyze MapandReduceFunctions
IV MapReduce Techniques to Optimize Job,Uses
IV Controlling Input Formats in MapReduce Execution
IV Different Phases inMapReduce, Applications
UNIT – 5 Big data platform
V Sqoop 99-165
V Cassandra
4
V MongoDB
V HIVE
V PIG
IM Important Questions 165-175
5
IntroductiontoBigData Analytics Dept.ofCSEAI &ML
UNITI INTRODUCTION TO BIGDATA
► Introduction to Bigdata
► Characteristics of Data
► Evolution of BigData
► Definition of BigData
► Challenges with BigData
► Bigdata analytics
► Classification of Analytics
► Importance and challenges facing bigdata
► Terminologies UsedinBigDataEnvironments
► DataStorageandAnalysis.
INTRODUCTIONTOBIGDATA
CharacteristicsofData:
Beforegoingtoknowindetailabout thecharacteristicsofdata. Let usknowthetermBIGDATA, BIG
means bigger in size and DATA means raw data.
Irrespective of the size of the enterprise (big or small), data continues to be a precious and
irreplaceable asset.Dataispresent internal totheenterprise andalso existsoutside the fourwalls
andfirewallsoftheenterprise.Dataispresentinhomogeneoussourcesaswellasinheterogeneous
sources. The needof thehour isto understand, manage, process, andtakethedatafor analysis to
draw valuable insights.
Data -- Information Information--Insights
Broadlyclassifythedifferencebetweenthe DataandInformation
Data Information
Meaning Dataisraw,unorganizedfactsth When data is processed,
atneedtobeprocessed.Datacan organized, structuredorpresented
be inagivencontext so as to make it
somethingsimpleandseemingl useful, it is called Information.
y randomanduselessuntilitis
organized.
Example Each student'stest score is Theclass'averagescoreortheschoo
one piece ofdata l's
average score isthe information
that can be concludedfrom the
given data.
Definition: Latin'datum' meaning"that Informationisinterpreteddata.
which is given". Data was
the plural form of
datumsingular.
MallareddvEnaineerinaCol/eaeForWomen(AutonomousInstitution.UGC-Govt.oflndia) 2
Characteristics of Data:
Letusstart withthecharacteristicsofdata,Asdepictedinbelow figure, data hasthreekey
characteristics:
Composition: The composition of data deals with the structure of data, that is, the sources or at
differentgranularity,thetypes,andthenatureofdataastowhetheritisstaticorreal-timestreaming
Condition: The conditionof data deals with the state ofdata, that is, "Can one use this data as is
foranalysis?or"Doesitrequirecleansing forfurtherenhancementandenrichment:Context:The context
ofdatadealswithWhere hasthisdatabeengenerated? Whywasthisdatagenerated?"How sensitive
isthisdata?" What aretheeventsassociated withthisdata?andsoon.Small data (dataas it
existedpriortothebigdata revolution)isabout certainty. It isabout fairlyknown data sources; it is about
no major changes to the composition or context ofdata.
Most often we have answers to queries like why this data was generated, where and when it was
generated exactlyhowwewould liketouseit,whatquestions willthisdatabeableto answer, and so on. Big
data is about complexity in terms of multiple and unknown datasets, in terms of
explodingvolume,intermsofthespeedatwhichthedataisbeinggeneratedandthespeedatwhich the data is
being generated and the speed at which it needs to be processed, and in terms ofthe
varietyofdata(internal or external,behavioral or social) that is being generated.
EvolutionofBigData:
1970c and before was the era of mainframes. The data was essentially primitive and structured.
Relational databases evolved in1980s and 1990s. The erawasofdataintensive applications. The World
Wide Web www) and the Internet ofThings (loT) have led to anonslaught ofstructured,
unstructured,and multimedia data.
Data Generation and DataDriven
DataUtilizati
Complexand on Storage Structured data,
Unstructured unstructureddata,
multimedia data
Complexand
Relational Relationalda
tabases:
Data-
intensive
applications
Primitive and Mainfram before
Structured es:Basic
data
storage
1970sand
Relationa
l
(1980san
d1990s) 2000sandbeyond
F
i
g
u
r
e
:
T
h
e
e
v
o
l
u
t
i
o
n
o
f
Figure:EvolutionofBigData
Definitionof Big Data:
lfwewereto ask youthesimplequestion: "Define BigData", what would your answer be?Well, we
willgive you a few responses that we have heard over time:
1. Anything beyond thehumanandtechnical infrastructureneeded tosupportstorage, processmg,and
analysis.
2. Today'sBIGmaybetomorrow'sNORMAL.
3. Terabytesorpetabytesorzettabytesofdata.
4. Ithinkitisabout3Vs.
Well,allofthese responses arecorrect. Butit is notoneofthese; infact, big data isallofthe above and
more.
Big data ishigh-volume,high-velocity, andhigh-variety informationassets that demand
costeffective,innovativeformsofinformationprocessingforenhancedinsightanddecision
making.--Gartner.
The3Vsconceptwasproposed bytheGartner analystDougLaney ina2001Meta Group research

publication,titled,3DDataManagement:Controlling DataVolume,VarietyandVelocity.
l
High-volume
Cost-effective,
High-velocity
innovativeforms of
High-variety
information processing
Enhanced insight
l
&decisionmaking
Figure:Definition of bigdata Gartner.

PartIofthedefinition"bigdataishigh-volume,high-velocity,andhigh-varietyinformationassets" talks
about voluminous data (humongous data) that may have great variety(a good mix of structured, semi-
structured,and unstructured data)and willrequire agoodspeed/pace forstorage, preparation, processing
and analysis.
Part II of the definition "cost effective, innovative forms of information processing talks about
embracing new techniques and technologies to capture(ingest), store, process, persist, integrate, and
visualize the high-volume,high-velocity, and high-variety data.
Part III of the definition "enhanced insight and decision making" talks about deriving deeper, richer,
andmeaningful insights and thenusing these insights to make faster andbetter decisions to gain
business value and thus a competitive edge.
IntroductiontoBigData Analytics Dept.ofCSEAI&ML
Data Information ActionableIntelligenceBetter decisionsEnhancedbusinessvalue.
Challenges withBigData:
1. Data today is growing at an exponential rate. Most of the data that we have today has been
generated in the last 2-3 years. This high tide of data will continue to rise incessantly. The key
questions here are: "Will allthisdata be useful for analysis?","Do wework with althis data ora
subset of it?", "How willwe separate the knowledge fromthe noise?", etc.
2. Cloudcomputingandvirtualizationareheretostay.Cloudcomputingistheanswertomanaging
infrastructurefor bigdataas farascost-efficiency,elasticity, and easy upgrading/downgradingis
concerned. Thisfurther complicates thedecisiontohost bigdata solutions outside theenterprise.
3. Theotherchallenge istodecideontheperiodofretentionofbigdata.Just howlongshould one retain
this data? Atricky question indeed assome data isuseful for making long-term decisions, whereas
in few cases,thedata mayquickly become irrelevant andobsolete just a few hours after having
being generated.
4. Thereisadearthofskilledprofessionalswhopossessahighlevelofproficiencyindatasciences that is
vital in implementing big data solutions.
5. Then,ofcourse,thereareotherchallenges withrespect tocapture,storage, preparation, search,
analysis, transfer, security, and visualization of big data. Big data refers to datasets whose size is
typically beyond the storage capacity of traditional database software tools. There is no explicit
definitionofhowbigthedatasetshould beforittobeconsidered bigdata.Herewearetodealwith data
that is just too big, moves way to fast, and does notfitthe structures of typical database stems.
The data changes are highly dynamic and therefore there is a need to in gest this as quickly as
possible.
6. Datavisualizationisbecomingpopularasaseparatediscipline.Weareshortbyquiteanumber, as far
as business visualizationexperts are concerned.
Capture
ChallengeswithBigData
What is BigData?
Bigdataisdatathatisbiginvolume,velocity,andvariety
Volume:Itgrowfrombits tobytestopetabytesandexabytes.
BitsBytes 0or1
Kilobytes 8bits
Megabytes 1024bytes
Gigabytes 10242bytes
Terabytes 10243bytes
Petabytes 10244bytes
Exabytes 10245bytes
Zettabytes 10246bytes
Vottabytes 10247bytes
10248bytes
Figure:Growthofdata
Wheredoesthisdatagetgenerated?
There areamultitude ofsources forbig data. AnXLS, aDOC,aPDF, etc.isunstructured data;a video
on YouTube, a chat conversation on Internet Messenger, customer feedback form on an
onlineretailwebsiteisunstructureddata;aCCTVcoverage,weather forecast reportisunstructured data
too.
1. Typical internal data sources: Data present within an organization's firewall. It is as follows:
Datastorage:Filesystems,SQL(RDBMSs-Oracle,MSSQLServer,DB2,MySQL,PostgreSQL, etc.),
NoSQL (MongoDB, Cassandra, etc.), and so on.
MallareddyEngineeringCollegeForWomen(AutonomousInstitution,UGC-Govt.ofIndia) 5
Archives: Archives of scanned documents, paper archives, customer correspondence records,

patients healthrecords, studentsadmission records, students assessmentrecords, and soon.External
datasources: Dataresiding outsideanorganization'sfirewall. It isasfollows: Public Web: Wikipedia,
weather, regulatory, compliance, census, etc.
2. Both(internal+externaldatasources)
Sensordata:
Car sensors, smart electric meters, office buildings, arr conditioning units, refrigera
tors, and so on.
Machinelogdata: Eventlogs,applicationlogs,Business processlogs,auditlogs,clickstream data, etc.
Socialmedia:
Twitter,blogs,Facebook,Linkedln,YouTube,Instagram,etc.
Businessapps:
ERP, CRM, HR,Google Docs, and so on. Media:Audio,Video,Image,Podcast,etc.
Docs:Commaseparatedvalue(CSV),Word Documents,PD,XLS,PPT,andsoon.
Data storage
Archives
Media
Sensordata
SourcesofBigData
Machinelogdata
Publicweb
Figure: Sources of Big
Velocity:
Wehavemovedfromthedaysofbatchprocessingtorealtimeprocessing.
Batch---+periodic ---+Nearrealtime---+Realtimeprocessing
Variety:
VarietydealswithaWide rangeofdatatypesand sourcesofdata. Wewillstudythis underthree categories:
Structureddata,semi-structureddataandunstructureddata.
1. Structureddata:Fromtraditionaltransactionprocessingsystemsand RDBMS,etc.
2. Semi-structureddata:ForexampleHyperText MarkupLanguage(HTML),eXensibleMarkup
Language(XML).
3. Unstructured data: For example unstructured text documents, audios, videos, emails, photos, PDFs,
social media, etc.
Evolution of Big Data, Definition of Big Data, Challenges with Big Data, Traditional Business
Intelligence (BI) versus Big Data.
Big data analytics: Classification of Analytics, Importance and challenges facing big data,
TerminologiesUsedinBigData Environments,The Big Data Technology Landscape.
BigDataisbecomingoneofthemosttalkedabouttechnologytrendsnowadays. Therealchallenge
withthebigorganizationistogetmaximumoutofthedataalreadyavailableandpredictwhatkind
ofdatatocollectinthefuture.Howtotaketheexistingdataandmakeitmeaningfulthat itprovides us accurate
insight in the past data is one of the key discussion points in many of the executive meetings in
organizations.
With the explosion of the data the challenge has gone to the next level and now a Big Data is
becoming the reality in manyorganizations.The goalofeveryorganization and expert issame to get
maximum out ofthe data, the route and the starting point are different for each organization
andexpert.Asorganizationsareevaluatingandarchitecting bigdatasolutiontheyarealsolearning the ways
and opportunities which are related to Big Data.
There is not a single solution to big data as well there is not a single vendor which can claim to know
all about Big Data. Big Data is too big a concept and there are many players -different architectures,
different vendors and differenttechnology.
ClassificationsOfAnalytics
WhatisData Analytics?
Inthisnew digitalworld,datais beinggeneratedin anenormousamountwhich opens
newparadigms. Aswehave high computing power as wellas a largeamount of data we can
make use ofthis data to help us make data-drivendecisionmaking. The mainbenefitsof
data-drivendecisions arethat they aremade up by observing past trends which have
resultedin beneficialresults.
Inshort, we cansay thatdataanalyticsis theprocessofmanipulatingdatatoextract

usefultrendsandhiddenpatternswhichcanhelpusderivevaluableinsightstomake
businesspredictions.DataAnalyticstechniqueandLifeCycle
UseofDataAnalytics
There are some keydomains and strategic planning techniques inwhich theData Analytics
has played a very important role:
•ImprovedDecision-Making-If wewill have supporting data in favor of a
decision that then wewill be able to implement them with even more success
probability. Forexample, if a certain decision or plan has to lead to
betteroutcomesthenthere willbeno doubt in implementingthemagain.
• BetterCustomer Service - Churn modeling is the best example of this in which
we tryto predict or identify what leads to customer churn and change those
things accordinglyso, that the attrition of thecustomers isas low as
possiblewhich is amost important factorin any organization.
• Efficient Operations-DataAnalyticscan help us understandwhat is the
demandof thesituation and what shouldbe done to get betterresults then we will
be able to streamlineour processes which in turn will leadto efficient operations.
• Effective Marketing -Market segmentation techniques have been
implementedtotargetthisimportantfactoronlyinwhichwearesupposed
tofindthemarketingtechniqueswhichwillhelpusincreaseoursalesand leads to
effective marketing strategies.
TypesofDataAnalytics
Therearefourmajortypesofdataanalytics:
1. Predictive(forecasting)
2. Descriptive(businessintelligenceanddatamining)
3. Prescriptive(optimizationandsimulation)
4. Diagnosticanalytics
DataAnalyticsanditsTypes
PredictiveAnalytics
Predictive analytics turn thedataintovaluable,actionable information.predictive
analyticsusesdatatodeterminetheprobableoutcomeofaneventoralikelihoodofa situation
occurring.Predictiveanalyticsholdsavarietyofstatistical techniquesfrom
modeling,machine learning,data mining,and game theorythat analyze currentand
historicalfactsto makepredictionsaboutafutureevent.Techniquesthatareusedfor
predictive analytics are:
• LinearRegression
• TimeSeriesAnalysisandForecasting
• DataMining
BasicCornerStonesofPredictiveAnalytics
• Predictivemodeling
• DecisionAnalysisandoptimization
• Transactionprofiling
DescriptiveAnalytics
Descriptiveanalytics looks at data and analyzepast event for insight asto how to
approachfutureevents.It looksatpastperformanceandunderstandsthe performancebymining
historical data to understandthe cause of success or failure in thepast. Almost
allmanagementreporting such as sales, marketing,operations, and finance uses this type of
analysis.
Thedescriptive model quantifies relationships indata inawaythatisoften used to classify
customers or prospects intogroups. Unlike a predictive model thatfocuses on
predictingthebehaviorofasinglecustomer,Descriptiveanalyticsidentifiesmany different
relationships between customer and product.
Common examplesofDescriptive analytics are companyreportsthat provide historic
reviews like:
• DataQueries
• Reports
• DescriptiveStatistics
• Datadashboard
PrescriptiveAnalytics
Prescriptive Analytics automatically synthesize bigdata, mathematicalscience,
businessrule, andmachinelearning to make a prediction andthen suggests a decision option
to take advantage of theprediction.
Prescriptive analytics goes beyond predicting future outcomes by also suggesting
actionbenefits fromthepredictionsand showing thedecisionmaker theimplication of each
decision option.Prescriptive Analytics not only anticipates what will happen and when to
happenbut also why it will happen. Further, Prescriptive Analyticscan suggest
decisionoptions on howto take advantageof a future opportunityor mitigate a futurerisk
andillustratethe implicationofeachdecisionoption.
Forexample, Prescriptive Analytics can benefit healthcare strategic planning by using
analytics to leverage operational and usage data combinedwith data of external factors
such as economic data, populationdemography,etc.
DiagnosticAnalytics
Inthisanalysis,wegenerally usehistorical dataoverotherdatatoanswerany
questionorforthesolutionofanyproblem.Wetrytofindanydependencyand pattern inthe
historical data of theparticular problem.
Forexample, companies go for this analysis becauseit gives a great insight into a problem,
and they also keep detailed information about their disposal otherwise data collection may
turn out individual for every problem and it will be very time- consuming. Common
techniques used for Diagnostic Analytics are:
• Datadiscovery
• Datamining
• Correlations
FutureScopeofDataAnalytics
1. Retail: Tostudy sales patterns,consumerbehavior,andinventory
management, data analytics can be applied in theretail sector. Data analytics can be
used byretailers to make data-drivendecisions regarding what products to stock, how to
price them, and how to best organizetheir stores.
2. Healthcare:Dataanalyticscan be usedtoevaluatepatient data, spot trendsinpatient
health,andcreateindividualizedtreatmentregimens. Data analytics can be used by
healthcare companies to enhance patient outcomes and lower healthcare expenditures.
3. Finance: In thefield of finance, data analytics can be used to evaluate
investmentdata, spot trendsin the financialmarkets, andmake wise investment
decisions. Data analytics can be used byfinancial institutions to lowerrisk andboost
theperformanceofinvestment portfolios.
4. Marketing:By analyzingcustomerdata, spotting trends in consumer behavior,

andcreating customizedmarketing strategies, data analytics can be used in marketing.
Data analytics can be used by marketersto boost the efficiency of their campaigns and
their overall impact.
5. Manufacturing:Dataanalytics canbeused toexamine production data,
spottrendsinproductionmethods,andboostproductionefficiencyinthe
manufacturingsector.Dataanalyticscanbeusedbymanufacturers tocut costs and
enhance product quality.
6. Transportation:To evaluatelogistics data,spot trends in transportation routes,

and improve transportation routes, the transportation sector can employ
dataanalytics.Dataanalytics canhelp transportation businesses cut expenses
and speed up delivery times.
ThethreeVsofBig dataareVelocity,VolumeandVariety.
Figure1.1:BigDataSphere
Figure 1.2:BigData-Transactions, Interactions,Observations BIG DATA

CHARACTERISTICS
1.The threeVs of Big data areVelocity,Volumeand Variety
VOLUME
Theexponential growth inthedatastorage asthedataisnowmorethantextdata.Thedatacanbe found inthe
format ofvideos, music's and large images onour social media channels. It is very common to have
Terabytes and Petabytes of the storage system for enterprises. As the database
growstheapplicationsandarchitecturebuilttosupportthedataneedstobere-evaluatedquiteoften.
Sometimesthe same data is re-evaluatedwithmultipleangles andeven though the original data is the
same the new intelligence creates explosion ofthe data. The big volume indeed represents Big Data.
Figure.:Characteristics of Big data
VELOCITY
The data growth and social media explosion have changed how we look at the data. There was a time
when we used to believe thatdataof yesterdayis recent.The matter ofthe fact newspapers
isstillfollowing thatlogic.However, newschannels andradioshavechanged howfastwereceive the news.

Today, people reply on social media to update them with the latest happening. On social media
sometimes a few seconds old messages (a tweet, status updates etc.) is not something interests users.
They often discard old messages and payattention to recent updates. The data movement isnow
almostrealtimeandtheupdatewindow hasreducedtofractionsoftheseconds. Thishigh velocity data
representBig Data.
VARIETY
Datacanbestored inmultiple format.Forexample database, excel,csv,accessorforthematterof the fact, it
can be stored in a simple text file. Sometimes the data is not even in the traditional format as we
assume, it may be in the form of video, SMS, pdf or something we might have not thought about it. It
isthe need ofthe organization to arrange it and make it meaningful.
It will be easy to do so ifwe have data inthe same format, however it is not thecase most ofthe time.
The real world have data in many different formats and that is the challenge we need to overcome
with the Big Data. This varietyofthe data representBig Data.
BigDataTechnologies
Figure1.5:BigDataLayout
1. APACHEHADOOP
ApacheHadoopisoneofthemainsupportiveelementsinBigDatatechnologies. Itsimplifiesthe
processing of large amount of structuredor unstructured data in a cheap manner.Hadoop is an
open source project from apache that is continuously improving over the years."Hadoop is
basically a set of software libraries and frameworks to manage and process big amount of data
from a single server to thousands of machines.
It provides an efficient and powerfull erordetection
mechanismbasedonapplicationlayerratherthan relying upon hardware.
In December releases Hadoop 1.0.0, moreinformation and installation guide can be found at
Apache Hadoop Documentation.Hadoop is not a single project but includes a number of other
technologiesin it.
2. MAPREDUCE
MapReduce was introduced by google tocreate large amount of web search indexes.It is
basically aframework towrite applications that processes alargeamount
ofstructuredorunstructureddata over theweb. MapReduce takesthequeryand breaks it
intopartstorun iton multiple nodes. By distributed queryprocessing it makes iteasytomaintain
largeamount ofdatabydividing thedata into several different machines.Hadoop MapReduce is a
software frameworkfor easily writing
applicationstomanagelargeamountofdatasetswithahighlyfaulttolerantmanner.Moretutorials and
getting started guide can be found at Apache Documentation.
3. HDFS(Hadoopdistributedfilesystem)
HDFS is a java based file system that is used to store structured or unstructured data over large
clustersofdistributedservers.ThedatastoredinHDFShasnorestrictionorruletobeapplied,the
datacanbeeither fullyunstructured ofpurelystructured.In HDFStheworktomakedatasenseful is
done bydeveloper's code only. Hadoop distributed file system provides ahighly fault tolerant
atmospherewith a deploymenton lowcosthardwaremachines.HDFSis now a part of Apache
Hadoop project, more information and installation guide can be found at Apache HDFS
documentation.
4. HIVE
HivewasoriginallydevelopedbyFacebook,nowitismadeopensourceforsometime.Hiveworks something
like a bridge in between sql and Hadoop, it is basically used to make Sql queries on Hadoop clusters.
Apache Hive is basically a data warehouse that provides ad-hoc queries, data summarization and
analysis ofhuge data setsstoredinHadoop compatiblefile systems.
Hive provides a SQL like called HiveQL query-based implementation of huge amount of data stored
in Hadoop clusters. In January 2013 apache releases Hive 0.10.0, more information and installation
guide can be found at Apache Hive Documentation.
5. PIG
Pig was introduced by yahoo and lateronit was made fullyopen source.It also provides abridge
toquerydataoverHadoopclustersbutunlikehive,it implements ascript implementationtomake Hadoop
data access able by developers and businesspersons. Apache pig provides a highlevel programming
platform for developers to process and analyses Big Data using user defined functions
andprogramming efforts. InJanuary2013 Apache released Pig 0.10.1which isdefined
forusewithHadoop0.10.1orlaterreleases. Moreinformationand installation guidecanbefound at Apache

Pig Getting Started Documentation.
AdvantagesofBigData:
BigDataisusedformanythingsbutsomeofthemainusagepurposesareasfollows:
• Bag Data enables you to gather information about customers and their
experience,theneventuallyhelps you to alignit properly.
• Helps in maintaining the predictive failures beforehand by analyzing the
problemsandprovides withtheirpotentialsolutions.
• Big Data isalso useful for companies toanticipate customer demand, roll out
new plans, test markets, etc.
• BigDataisveryusefulinassessingpredictivefailuresbyanalyzingvarious
indicators such as unstructured data, error messages, log entries, engine
temperature, etc.
• BigData isalsoveryefficientin maintaining operationalfunctions alongwith
anticipating future demands of thecustomers, current market demands thus
providing proper results.
BenefitsofBigDatainIT Sectors:
• Many old IT companies are fully dependent onbigdata inorder tomodernize their
outdatedmainframesby identifying theroot causesof failures and issues in real-
time and antiquated code bases.Many organizations are replacing their
traditionalsystem with open-sourceplatforms like Hadoop.
• Most big data solutions are based onHadoop,which allows designs to scale up
from a single machine to thousands of machines, each offering local

computation and storage, moreover, it is a"free" open source platform, allowing
minimizing capital investment foranorganizationinacquiringnewplatforms.
• With the help of big datatechnologies ITcompanies are abletoprocessthird- party
data fast,which isoften hard tounderstand at once byhavinginherently high
horsepower and parallelizedworking of platforms.
BenefitsofBigDatainBusiness:
• Dataqualityhas adirectimpactonbusinessprocessefficiency. Inpurchaseto
payprocess,poorqualityvendordatacancausemissingpurchasecontractsor
pricinginformationwhichcanleadtodelaysinprocuringvitalgoods.Many
companiesusebigdatasolutionsoralgorithmstosimplydowhattheyhave
alreadybeendoing,sothatthereisnodatalossmoreoverifwerunthe
algorithmagainstthedataset,theresultmightbethelistofindividualwho
exhibitsattributesoffraudulentbehavior.
• In order to cash process, incomplete or inaccurate credit limits or pricing
information can lead to overall customer service loss orreduce revenue or may
increase service cost, with the help of big data technologies and the ability to
run various algorithms more quickly, the data can be updated at regular intervals
throughout the day.
• Thesystematicanalysisofdataordataprofilingisusedtoassesstheoverall health
ofthedatawhich leadstoproper businessdecisions inaccordance with
thepresentsituation becausesometimes inaccurate dataresultsinincorrect
management, which means business decisions are based on incorrect
information.Forexample,themoreonecanunderstandcustomers'complex
relationships, preferences, and interaction history with the company, the
more relevant and timely business outreach.
IntroductiontoBigDataAnalytics Dept.ofCSEAI &ML
BenefitsofBigDatainEnterprise:
• Bigdatamightallowacompanytocollecttrillionsorbillionsofreal-timedata points on
its products, resources, or customers- and then repackage the data
instantaneously to optimizethecustomerexperience.
• The speed at which data is updated using big data technologies allows
enterprises to more quicklyand accurately respond tocustomer demands. For
example,MetLifeusedMongoDBtoquicklyconsolidatecustomerinformation over
70 different sources andprovide a single, rapidly-updatedview
• Big data canhelp enterprises to act more nimbly allowing them to adapt to changes
faster than their competitors.
Benefits of BigData in OtherAreas:

• Big Datatechnologies are usedtopredictthe'buy'and'sell'decisionsmadeon theshares
ofdifferentcompaniesforthecustomers.
• Hospitals are analyzing medical data andrecordstopredictthosepatients that are
likely to seek re-admission within a few months of discharge. The
hospitalcanavoidcostly stays for patientsinthehospital.
• Search-Engineretrieveslots of Data from differentdatabasesinafractionof a
second using big data technologies. For example, Google uses the
MapReducealgorithmtosearchforagivenquery.MapReducedividesthetask into
smallparts and assigns thoseparts to many computersconnectedover thenetwork,
andcollects theresult to formthefinalresult.
• Financial Services organizations are using big data for data mining about
customer interactions toslice anddicetheir usersintofinelytuned segments, this will
helpin creating increasinglyrelevant and sophisticatedoffers.
• Insurance companiesare usingBigDataanalysistoseewhichhomeinsurance
applications can beimmediately processed, and which ones need avalidating in-
person visit from an agent.
• Web-based businessesaredevelopinginformation productsthatcombine data
gathered from customers tooffer more appealing recommendationsand more
successful coupon.
Big data challenges include the storing, analyzing the extremely large and fast- growing
data.
SomeoftheBigDatachallengesare:
1. SharingandAccessingData:
• Perhaps the most frequent challenge in bigdata efforts is the inaccessibility of
data sets from external sources.
• Sharing datacancausesubstantialchallenges.
• It includes the needforinterandintra-institutional legaldocuments.

• Accessing datafrom publicrepositories leadsto multiple difficulties.
• It is necessary for the data to beavailable in an accurate, complete and
timely manner because if data in the companies information system is to be used
to make accurate decisions in time then it becomesnecessaryfor datatobe available
in this manner.
2. PrivacyandSecurity:
• It isanother most important challenge with Big Data. This challenge
includes sensitive, conceptual, technical as well as legal significance.
• Most of theorganizations are unable to maintain regular checks due to large
amounts of data generation. However, it should be necessary to perform security
checksandobservation in real time because it is most beneficial.
• Thereissomeinformation ofa personwhich whencombinedwith external
large data maylead to somefacts of a person which may be secretive and he might
not want the owner to know this information about that person.
• Someoftheorganization collectsinformationofthepeoplein order to add
value to their business.This is done by making insightsintotheir livesthat they're
unawareof.
3. AnalyticalChallenges:
There are some huge analytical challenges in bigdata which arise some main
challenges questions like how to dealwith aproblem if data volume gets too large?
• Orhowtofindouttheimportantdatapoints?
• Orhowtousedatatothebestadvantage?
• Theselargeamountofdataonwhichthesetype ofanalysisis tobe done can be
structured (organized data), semi-structured (Semi-
organizeddata)orunstructured(unorganizeddata). Therearetwo
techniquesthroughwhichdecisionmakingcanbe done:
• Eitherincorporatemassivedatavolumesintheanalysis.
• OrdetermineupfrontwhichBigdataisrelevant.
4. Technicalchallenges:
• Qualityofdata:
When there is a collection of a large amount of data and storage of this data, it
comes at a cost. Big companies, business leaders and IT leaders always want large
data storage.
For better results and conclusions,Big data rather than having irrelevantdata,
focusesonquality datastorage.
This further arise a question thathowitcanbeensured thatdata isrelevant, how
much data would be enough for decision making and whether thestored data is
accurate or not.
• Faulttolerance:
• Fault tolerance is another technical challenge andfault
tolerance computing is extremely hard, involving intricate algorithms.
• Nowadays someofthenewtechnologies likecloudcomputingand
bigdataalwaysintendedthatwhenever thefailureoccursthe damagedoneshouldbe
withintheacceptable threshold that is the whole task should not begin from thescratch.
• Scalability:
Bigdata projects cangrow and evolve rapidly.The scalability issueof BigData has
lead towards cloud computing.

It leads to various challenges like how to run and execute various jobsso
thatgoalofeachworkload canbeachievedcost- effectively.
It also requires dealing with the system failures in an efficient manner. This leads to
a bigquestion again thatwhat kinds of storage devices are to be used.
TRADITIONALVSBIGDATABUSINESSAPPROACH
1. SchemalessandColumnorientedDatabases(NoSql)
Weareusingtableandrowbasedrelationaldatabasesovertheyears,thesedatabasesarejustfinewithonline
transactionsandquickupdates.Whenunstructuredandlargeamountofdatacomes
intothepictureweneeds some databaseswithouthaving ahard code schema attachment.There
areanumber ofdatabases to fit into this category, these databases can store unstructured, semi
structured or even fullystructured data.
Apart fromother benefits the finest thing with schema less databases isthat it makes data migrationvery
easy. MongoDBis a very popularand widely usedNoSQLdatabasethesedays.NoSQLand schemaless
databases are used when the primary concern is to store a huge amount ofdata and notto maintain
relationshipbetween elements."NoSQL(not only Sql)is atype of databasesthatdoesnotprimarilyrely
uponschemabased structureand doesnotuse Sql for data processing."
Figurle.:BigData
Thetraditionalapproachworkonthestructureddatathathasabasic layoutand thestructureprovided.
Figure1.7:StaticData
Thestructuredapproachdesignsthedatabaseaspertherequirementsintuplesandcolumns.
Workingonthelive comingdata,whichcanbeaninput fromtheeverchanging scenariocannot bedealt
inthetraditional approach. The Big data approach is iterative.
Figure1.8:StreamingData
The Big data analytics workonthe unstructured data, where no specific pattern ofthe data is
defined. The data is not organized in rows and columns. The live flow ofdata iscaptured and
theanalysis isdone on it. Efficiencyincreaseswhenthedatatobeanalyzedislarge.
1.9BigDataArchitecture
UNIT-2:Hadoop Technology
Topics:
IntroducingHadoop History of Hadoop
Convolution approachversusHadoop Introductionto Hadoop Eco System Processing
Data with Hadoop Hadoop Distributors
UseCases
ChallengesinHadoop.
InteractingwithHadoop
HADOOP
1. Hadoopisanopen-sourceframeworkthatsupportstheprocessingoflargedatasetsin
adistributedcomputingenvironment.
2. Hadoop consistsofMapReduce,theHadoop distributed filesystem(HDFS) andanumber of
related projects such as Apache Hive, HBase and Zookeeper. MapReduce and Hadoop
distributed file system (HDFS) are the main componentofHadoop.
3. ApacheHadoopisanopen-source, freeandJava based software frameworkoffersapowerful
distributed platform to store and manage BigData.
4. ItislicensedunderanApache V2license.
5. It runs applications on large clusters ofcommodity hardware and it processes thousands of
terabytes ofdataonthousands ofthenodes. Hadoop isinspired fromGoogle's MapReduce and
Google File System (GFS)papers.
6. ThemajoradvantageofHadoopframeworkisthatit provides reliabilityandhighavailability.
USEOFHADOOP
There aremanyadvantagesofusing Hadoop:
1. RobustandScalable-Wecanaddnewnodesasneededaswell modifythem.
2. AffordableandCostEffective-Wedonotneedanyspecialhardwareforrunning Hadoop.Wecanjust
use commodity server.
3. AdaptiveandFlexible-Hadoopisbuiltkeepinginmindthatitwillhandlestructuredand unstructured
data.
4. Highly Available and FaultTolerant -When a node fails, theHadoop framework automatically
fails over to another node.
CONVOLUTIONAPPROACHVERSUSHADOOP
There are two major component so the Hadoopframeworkandbothofthemdoes two of the important
taskforit.
1. Hadoop MapReduce isthemethod tosplitalargerdataproblemintosmaller chunkanddistribute itto
manydifferent commodity servers. Eachserver have their ownsetofresources
andtheyhaveprocessed them locally. Once the commodity server has processed the data theysend
it back collectively to main server. This iseffectively aprocess where we process large data
effectively and efficiently
2. HadoopDistributed FileSystem(HDFS) isavirtualfilesystem.Thereisabigdifferencebetweenany
other file system and Hadoop. When we move a fileon HDFS, it isautomatically split into many
small pieces. These small chunks of the file arereplicated and store do not he r servers (usually3)
forthefault tolerance or high availability.
3. Namenode: Namenode is the heart of the Hadoop system. The Name Node manages the file
system namespace.Itstoresthemetadatainformation ofthedatablocks.This metadata isstored
permanentlyon to localdisk inthe formofname space image and edit log file. TheName Node also
knows the location of the data blocks on the data node. However,the Name
Nodedoesnotstorethisinformationpersistently. TheName Nodecreatestheblock toData
Nodemapping when itisrestarted.IftheNameNodecrashes, thentheentire Hadoop systemgoes
down. Read more aboutName node
4. Secondary Namenode: The responsibilityofsecondary name node is to periodically copy and
merge the namespace image and editlog. Incase ifthe name nodecrashes, thenthenamespace
image stored in secondary NameNode can be used to restart the NameNode.
5. DataNode: It stores the blocks of data and retrieves them. The DataNodes also reports the
blocks information to the NameNode periodically.
6. Job Tracker: Job Tracker responsibility is to schedule the client's jobs. Job tracker creates map
and reducetasksandschedulesthemtorunontheDataNodes (tasktrackers).JobTracker
alsochecksforany failedtasksandreschedulesthefailedtasksonanother
DataNode.JobtrackercanberunontheNameNode or a separate node.
7. Task Tracker: Task tracker runs on the DataNodes. Task trackers responsibility
1s to runthemaporreducetasksassignedbytheNameNodeandtoreportthestatusof the tasks to the
NameNode. Besides above two corecomponents Hadoop project also contains following
modules aswell.
1. HadoopCommon:CommonutilitiesfortheotherHadoopmodules
2. HadoopYarn:Aframeworkforjobschedulingandclusterresourcemanagement
RDBMS
Why can't we used at a bases with lots of disk storage- scale batch analysis? Why is MapReduce
needed?
Theanswertothesequestionscomesfrom anothertrendindiskdrives:seektimeisimproving more
slowlythantransferrate.Seekingistheprocessofmovingthedisk'sheadtoaparticularplace
onthedisktoreadorwritedata.It characterizes thelatency ofadiskoperation, whereas the transfer rate
corresponds to a disk's bandwidth.
Ifthedata access pattern isdominated by seeks, it will take longer toread orwrite large portions of
the dataset than streaming through it, which operates at the transfer rate. On the other hand, for
updating asmall proportion of records ina database, atraditional B-Tree (the data structure used
in relational databases, which is limited by therate itcan perform seeks) works well. For updating
the majority ofadatabase, aB-Tree islessefficient thanMapReduce, whichusesSort/Merge

torebuild the database.
Inmanyways,MapReducecanbeseenasacomplementtoanRDBMS.
MapReduce is a good fit for problems that need to analyze the whole dataset, in a batch fashion,
particularly for adhoc analysis.An RDBMS is good for point queries or updates, where the
dataset has been indexed to deliver low-latency retrieval and update times of a relatively small
amount of data. MapReduce suits applications where thedata iswritten once, and read many
times, whereas a relational database isgood for datasets that are continually updated.
Another difference between MapReduce and an RDBMS is the amount of structure in the
datasets thattheyoperateon.Structured dataisdatathat isorganized intoentitiesthathaveadefined
format, such as XML documents or database tables that conform to a particular predefined
schema. This is the realm of the RDBMS.Semi-structureddata, on the other hand, is looser, and
though theremay be a schema, it is often ignored,so it may be used only a sa guide to the
structure of the data: for example, aspreadsheet, inwhich thestructure is thegridofcells,although
thecellsthemselves may holdanyformofdata.Unstructured datadoesnothaveanyparticularinternal
structure: forexample, plain text orimage data. MapReduce works well onunstructured orsemi-
structured data, since itis designed tointerpret thedata at processing time. Inother words, the
input
keys and values forMapReduce are not an intrinsic property of thedata, but they arechosen by the
person analyzing the data.
Relational data is often normalized to retain its integrity and remove redundancy. Normalization
posesproblemsforMapReduce,sinceitmakesreadingarecord anon-localoperation,andoneofthe
central assumptions that MapReduce makes isthat itispossible toperform (high-speed) streamingreads
and writes.
Awebserverlogisagoodexampleofasetofrecordsthatisnotnormalized(forex-ample,theclient
hostnames arespecified infulleachtime,even though thesameclientmayappearmany times), and
this isonereason that log files of all kindsareparticularlywell-suited to analysis with MapReduce.
MapReduce isalinearlyscalable programmingmodel.Theprogrammer writestwofunctions-a map
function and areduce function-each of which defines a mapping from one set of key-value pairs
to another. These functions areoblivious tothesizeof thedataorthecluster that theyare operating
on, so they canbe usedunchangedfora smalldatasetand for a massiveone.Moreimportant, if you
double the size of the input data, a jobwill run twice as slow. But if you alsodoublethesizeofthe
cluster, a job will run as fast as the original one. This isnot generally true of SQL queries.
Overtime,however,thedifferencesbetweenrelationaldatabasesandMapReduce systems are likely
to blur-both as relational databases start incorporating some of the ideas from MapReduce (such
as Aster Data's and Greenplum's databases) and, from the other direction, as higher-level query
languagesbuiltonMapReduce (suchasPigandHive)makeMapReducesystemsmoreapproachable to
traditional database programmers.
ABRIEFHISTORYOFHADOOP
Hadoopwas created by Doug Cutting, the creator of Apache Lucene, the widely used text
search library. Hadoop has its origins inApache Nutch, an open source web search engine, itself
a part of the Lucene project.
Building a web search engine fromscratchwas an ambitious goal, for not only is the software
required to crawl and index websites complex to write, butit is also a challenge to run without a
dedicated operations team,sincetherearesomanymovingparts.It'sexpensive, too:MikeCafarella
andDoug Cutting estimated a system supporting a I-billion-page index would cost around half a
milliondollarsin hardware,witha monthlyrunningcostof $30,000. Nevertheless,
they believed it was a worthy goal, asit would open up and ultimately democratize search engine
algorithms.
Nutchwasstartedin2002,andaworkingcrawlerandsearchsystem quicklyemerged.However,
theyrealizedthattheirarchitecturewouldn'tscaletothebillionsofpagesonthe Web. Help wasat hand

with the publicationof a paper in 2003 that described the architecture of Google'sdistributed
filesystem,called GFS, which was being usedin productionatGoogle.IIGFS,orsomething like
it,would solve their storage needs forthe very large filesgenerated as apart oftheweb crawl and
indexing process. In particular, GFS would free up time being spent onadministrative tasks such
as managing storagenodes.In 2004,they setaboutwritinganopensource implementation,the Nutch
DistributedFilesystem(NDFS).
In 2004, Google publishedthe paper that introducedMapReduceto the world. Early in 2005, the
Nutch developershad a working MapReduceimplementation in Nutch, and by the middle of that
year all the major Nutch algorithms had been ported to run using MapReduce and NDFS.
NDFSandtheMapReduceimplementationinNutchwereapplicablebeyondtherealm of search, and in
February 2006 they moved out ofNutch to form an independent subproject ofLucene called
Hadoop. At around the same time, Doug Cutting joinedYahoo!, which provided adedicated team
andtheresourcestotumHadoopintoasystemthatranatwebscale(seesidebar).Thiswas demonstrated
inFebruary 2008whenYahoo!announced that itsproduction searchindexwasbeing generated by a
10,000-core Hadoop cluster.
InJanuary2008,Hadoopwasmadeitsowntop-level project atApache, confirming itssuccessand its
diverse, active community. By this time, Hadoop was being used by many other companies
besides Yahoo!, such as Last.fm, Facebook,and the New York Times.
In one well-publicized feat, the New York Times used Amazon's EC2 compute cloud to crunch
through four terabytes ofscanned archives from the paper converting them toPDFs forthe
Web.14 Theprocessingtooklessthan24hourstorunusing100ma-
chines,andtheprojectprobablywouldn't
havebeenembarkedonwithoutthecom-binationofAmazon'spay-by-the-
hourmodel(whichallowedtheNYTtoaccessalarge number of machines for a short period) and
Hadoop's easy-to-use parallel programming model.
In April 2008, Hadoop broke aworld record to become thefastest system tosort a terabyte of data.
Runningona910-nodecluster,Hadoop sortedoneterabytein209seconds Gustunder3½minutes),
beatingthepreviousyear'swinnerof297seconds. InNovemberofthesameyear,
Google reported that its MapReduce implementationsorted one terabyte in68seconds. As the first
edition of this book was going to press (May2009), it was announced that a team at Yahoo! Used
Hadoop to sort one tera byte in 62 seconds.
PROCESSINGTHEDATAWITHHADOOP
Totakeadvantage oftheparallel processing that Hadoop provides, weneed toexpress ourquery as a
MapReduce job. After some local, small-scale testing, we will be able to run it on a cluster of
machines.
MAPANDREDUCE
MapReduceworksbybreakingtheprocessingintotwophases: themapphaseandthereduce phase.
Each phase has key-value pairs as input and output, the types of which may be chosen by the
programmer.Theprogrammeralsospecifiestwofunctions:themapfunctionandthereducefunction.
TheinputtoourmapphaseistherawNCDCdata.Wechooseatextinputformatthat gives useach linein
the dataset as a text value. The key is the offset of the beginning of the line from the beginning
of the file, but aswe have no need forthis, we ignore it.
Our map function is simple. We pull out the year and the air temperature, since these are the
only fields we are interested in. In this case, the map function is just a data preparation
phase, setting up the data in such a way that the reducer function can do its work on it:
finding the maximum temperature foreachyear.Themapfunctionisalsoagood
placetodropbadrecords: here we filter out temperatures that are missing, suspect, or
erroneous.
Tovisualizethewaythemapworks, considerthefollowingsamplelinesofinput data(some unused
columns have been dropped to fit thepage, indicated by ellipses):
00670119909999919500515070049999999N9+00001+99999999999_
00430119909999919500515120049999999N9+00221+99999999999_
00430119909999919500515180049999999N9-00111+99999999999_
00430126509999919490324120040500001N9+01111+99999999999_
00430126509999919490324180040500001N9+00781+99999999999_
Theselinesarepresentedtothemapfunctionasthekey-valuepairs:
(0,0067011990999991950051507004...9999999N9+00001+99999999999 ...)
(106,0043011990999991950051512004...9999999N9+00221+99999999999...)
(212,0043011990999991950051518004...9999999N9-00111+99999999999...)
(318,0043012650999991949032412004...0500001N9+01111+99999999999...)
(424,0043012650999991949032418004...0500001N9+00781+99999999999...)
Thekeysarethelineoffsetswithinthefile,whichweignore inourmapfunction. Themap function
merelyextractstheyearandtheairtemperature(indicatedinbold text), andemitsthemasitsoutput
(thetemperaturevalueshave beeninterpretedasintegers):
(1950,0)
(1950,22)
(1950,-11)
(1949,111)
(1949, 78)
The output from the map function isprocessed by the MapReduce framework before being sent
to thereduce function. Thisprocessing sortsandgroups thekey-value pairs bykey. So, continuing
the example, our reduce function sees the following input:
(1949,[111,78])
(1950,[0,22,-11])
Eachyearappearswithalistofallitsairtemperaturereadings.Allthereducefunction has to do no w
is iterate through the list and pick up the maximum reading:
(1949,111)
(1950,22)
This isthe final output: the maximum global temperature recorded in each year.
Thewholedataflowisillustratedin2.2.AtthebottomofthediagramisaUnix pipeline, which mimics
the whole MapReduce flow, and which we will see againlater inthechapterwhen welook
atHadoop Streaming.
Figure 2-1. MapReduce logical data flow JAVAMAPREDUCE

Having run through how the MapReduce program works, the next step is to express it in code.
We needthreethings: amapfunction,areduce function,andsomecodetorunthejob.Themapfunction
isrepresented by the Mapper class, which declares an abstract map() method.
The Mapper class is a generic type, with four formal type parameters that specify the
inputkey,inputvalue,outputkey,andoutputvaluetypesofthemapfunction.Forthepresentexample,theinput
keyisalonginteger offset,theinputvalue isalineoftext,theoutput keyisayear, and the
outputvalueisanairtemperature(aninteger).Ratherthanusebuilt-inJavatypes,Hadoop
provides itsownsetofbasictypes that areoptimized for network serialization.Theseare found

in the org.apache.hadoop.iopackage.
HereweuseLongWritable,whichcorrespondstoaJavaLong,Text(likeJavaString),and
IntWritable (like Javalnteger).
The map()method ispassed akeyand a value.Weconvert theTextvaluecontainingthe lineof
inputintoaJavaString,thenuseitssubstring()methodtoextractthecolumnsweareinterested
Ill.
Themap() method alsoprovides aninstance ofContext to write theoutput to.Inthiscase, we write
the year as a Text object (since we are just using it asa key), and the temperature is wrapped in
an IntWritable. We write an output record only if the tern- perature is present and the quality
code indicates the temperaturereading isOK.
Again,fourformal typeparameters areused tospecify theinputandoutputtypes, thistimefor the
reduce function. The input types of the reduce function must match the output types of the
mapfunction:TextandIntWritable.Andinthiscase,theoutputtypesofthereducefunction are Text
and IntWritable, for a year and its maximum temperature, which we find by iterating
through the temperatures and comparing each with arecord of the highest found so far.
Havingconstructed aJobobject,wespecifytheinput andoutput paths.Aninput pathisspecified by
calling the static addlnputPath() method on FilelnputFormat, and it can
beasinglefile,adirectory(inwhichcase,theinputformsallthefilesinthatdirectory), or a file pattern.
As the name suggests, addlnputPath()can be called more than onceto use input from
multiplepaths.
Theoutput path(of which thereisonlyone) isspecified bythestaticsetOutputPath()methodon
FileOutputFormat.It specifies a directory where the output files from the reducer functions
are written. The directory shouldn't exist before running the job, as Hadoop will complain
and
notrunthejob.Thisprecautionis topreventdataloss(itcanbeveryannoyingtoaccidentally overwrite

the output of a long job with another).
Next, we specify the map and reduce types to use via the setMapperClass() and
setReducerClass() methods.The setOutputKeyClass() and setOutputValueClass()methods
control the output types
forthemapandthereducefunctions,whichareoftenthesame,astheyareinourcase. If they are
different, then the map output types can be set using the methods setMapOutputKeyClass()
andsetMapOutputValueClass().Theinputtypesarecontrolled viatheinputformat, whichwe

have not explicitly sets using the default TextlnputFormat.
After setting the classes that define the map and reduce functions, we are ready to run the job.
The waitForCompletion()methodonJob submitsthejobandwaitsforittofinish.Themethod's boolean
argumentisaverbose flag,sointhiscasethejobwritesinformationabout itsprogresstotheconsole.
Thereturn valueofthewaitForCompletion()method isabooleanindicating success(true)orfailure
(false), which we translate into the program's exit code of O or 1.
ATESTRUN
AfterwritingaMapReduce job,it'snormal totryitout onasmall dataset toflushoutanyimmediate
problemswith the code. Firstinstall Hadoopin standalonemode- there are instructionsfor how to
do this mAppendix A. This 1s the mode m which
Hadooprunsusingthelocalfilesystemwithalocaljobrunner.Theninstallandcompile the examples
using the instructionson the book'swebsite.
When thehadoop command isinvoked withaclassname asthefirstargument, itlaunches aNMto
runtheclass.Itismoreconvenient tousehadoop thanstraightjava sincetheformeraddsthe Hadoop
libraries (and their dependencies) to the classpath and picks up the Hadoop configuration,too.
To add the application classes to the classpath, we've defined an environment variable called
HADOOP_CLASSPATH, which the hadoop script picks up.
The last section of the output, titled"Counters,"shows the statistics that Hadoop generates
for each job it runs. These are very useful for checking whether the amount of data
processed is what youexpected.For example,we canfollowthe numberof recordsthat
wentthroughthe
system: five map inputs produced five map outputs, then five reduce inputs in two groups
produced two reduce outputs.
Theoutput waswritten totheoutput directory, which contains one output fileperreducer. Thejob
had a single reducer, so we find a single file, named part-r-00000:
%catoutput/part-r-00000 1949 111
1950 22
Thisresult isthesame aswhen wewent throughit byhand earlier. Weinterpretthisassaying that
themaximum temperature recorded in 1949 was 11.1°C, and in1950 it was 2.2°C.
THEOLDANDTHENEWJAVAMAPREDUCEAPIS
TheJavaMapReduce APIusedintheprevioussectionwasfirst released inHadoop
0.20.0.ThisnewAPI,sometimesreferredtoas "ContextObjects,"was designedtomaketheAPI easier
to evolve in the future.It is type-incompatible with the old,how-ever,soapplications need

Tobe rewritten to take advantage of it.
The newAPi is not completeinthel.x(formerly0.20)release series, so the old API is recommended
for these releases, despite having been marked as deprecated in the early
0.20 releases. (Understandably, this recommendation caused a lot of confusion so the
deprecation warning was removed from later releases in that series.)
Previous editions of this book were based on 0.20 releases, and used the old API throughout
(although the new API wascovered, the code in variably used the old API). In this edition
the new API is used as the primary API, except where mentioned.
How- ever, should you wish to use the old API, you can, since the code for all the examples
in this book is available for the old API on the book's website.
ThereareseveralnotabledifferencesbetweenthetwoAPis:
The new API favors abstract classes over interfaces, since these are easier to evolve. For
example, you can add a method(with a default implementation)to an abstract class without
breaking old implementations of the class. For example, the MapperandReducer interfaces in the
old API are abstract classes in the new APL
The new API is in the org.apache.hadoop.mapreduce package (and subpackages). The old API
can still be found inorg.apache.hadoop.mapred.
ThenewAPImakesextensiveuseofcontextobjectsthatallowtheusercodetocommunicatewith
theMapReduce system. The new Context, for example, essentially unifies the role of the
JobConf, theOutput Collector, and the Reporter from the old APL
In both APis, key-value record pairs are pushed to the mapper and reducer, butin addition, the
new APIallowsbothmappersandreducers tocontrol theexecutionflowbyoverriding
therun()method. For example, records can beprocessed inbatches, orthe execution can
beterminated before allthe recordshavebeenprocessed. IntheoldAPIthisispossibleform appears
bywritingaMapRunnable, but noequivalentexists for reducers.
Configuration has been unified. The old API has a special JobConf object for job configuration,
which is an extension of Hadoop's vanilla Configuration object (used for configuring daemons.
In thenew API, this distinction isdropped, sojob configuration isdone through aConfiguration.
Jobcontrol isperformed through theJobclassinthenew API,ratherthantheold JobClient, whichno
longer exists in the new API.
Output files are named slightly differently: in the old API both map and reduce outputs are
named part-nnnnn,whileinthenewAPImapoutputsarenamedpart-m-nnnnn,andreduceoutputsare
named part-r-nnnnn (where nnnnn isaninteger designating the part number,startingfrom zero).
User-overridablemethodsinthenewAPIaredeclaredtothrowjava.lang.InterruptedException. What
this means is that you can write your code to be reponsive to interupts so that the framework can
gracefully cancel long-running operations if it needsto.
In the newAPlthereduce()method passes values as a java.lang.Iterable, rather than a
java.lang.lterator (as the oldAPidoes). This change make site as over the values
using Java's for-each loop construct:for (VALUEIN value :values) { ...}
HADOOPECOSYSTEM
AlthoughHadoopisbestknownforMapReduce anditsdistributedfilesystem(HDFS,renamedfrom
NDFS), the term is also used for a family of related projects that fall under the umbrella of
infrastructure for distributed computing and large-scale data processing.
All ofthe core projects covered in this book are hosted by the Apache Software Foundation,
which provides support for a community of open source software projects, including the original
HTTP Server from which it getsitsname. AstheHadoop eco-system grows, more projects
areappearing, notnecessarilyhosted atApache,whichprovide complementary
servicestoHadoop,orbuildonthe core to add higher-level abstractions.
TheHadoopprojectsthatarecoveredinthisbookaredescribedbrieflyhere:
Common
A set of componentsand interfaces fordistributed filesystems and general 1/0(serialization,Java
RPC, persistentdata structures).
Avro
Aserializationsystemforefficient,cross-languageRPC,andpersistentdatastorage.
MapReduce
Adistributeddataprocessingmodelandexecutionenvironmentthatrunsonlargeclustersof commodity
machines.
HDFS
Adistributedfilesystemthatrunsonlargeclustersofcommoditymachines.
Pig
Adataflowlanguageandexecutionenvironmentforexploringverylargedatasets.PigrunsonHDFS and
MapReduceclusters.
Hive
Adistributeddatawarehouse.HivemanagesdatastoredinHDFSandprovidesaquerylanguage
based on SQL (and which is translated by the runtime engine toMapReduce jobs) forquerying
the data.
HBase
Adistributed, column-orienteddatabase. HBase usesHDFSforitsunderlying storage,andsupports
both batch-style computations using MapReduce and point queries (random reads).
ZooKeeper
A distributed, highly available coordination service. ZooKeeper provides primitives such as
distributed locks that can be used for building distributed applications.
Sqoop
AtoolforefficientlymovingdatabetweenrelationaldatabasesandHDFS.
PHYSICALARCHITECTURE
Figure2.2:PhysicalArchitecture
Hadoop Cluster-Architecture,CoreComponentsandWork-flow
1. ThearchitectureofHadoopCluster
2. CoreComponentsofHadoopCluster
3. Work-flowofHowFileisStoredinHadoop
HADOOPCLUSTER
i.Hadoopclusterisaspecialtypeofcomputationalclusterdesignedforstoringandanalyzing vastamount of
unstructureddata in adistributedcomputing environment
Figure2.3:HadoopCluster
1. Theseclustersrunonlowcostcommodity computers.
11 Hadoopclustersareoftenreferredtoas"sharednothing"systemsbecausetheonlythingthatis shared
between nodes is the network that connects them.
Figure2.4:SharedNothing
LargeHadoopClustersarearrangedinseveralracks.Networktrafficbetweendifferentnodesin the same
rack is much more desirable than network traffic across the racks.
ARealTimeExample: Yahoo's Hadoop cluster. They have morethan10,000 machines runningHadoop
and nearly 1 petabyte of user data.
Figure2.4:YahooHadoopCluster
• AsmallHadoopclusterincludesasinglemastemodeandmultipleworkerorslave node. As
discussed earlier,theentireclustercontainstwolayers.
• OneofthelayerofMapReduceLayerandanotherisofHDFSLayer.
• Eachoftheselayerhaveitsownrelevantcomponent.
• ThemasternodeconsistsofaJobTracker,TaskTracker,NameNodeandDataNode.
• Aslaveor workernodeconsistsofaDataNodeand TaskTracker.
It isalso possible that slave nodeorworker nodeisonlydataorcompute node.The matterofthefact
that is the key feature of the Hadoop
Figure2.4:NameNodeCluster
HADOOPCLUSTERARCHITECTURE:
Figure2.5:HadoopClusterArchitecture HadoopClusterwouldconsistsof
► 110differentracks
► Eachrackwouldhavearound40slavemachine
► Atthetopofeachrackthereisarackswitch
► Eachslavemachine(rackserverinarack)hascablescoming outitfromboththeends
► Cablesareconnectedtorackswitchatthetopwhichmeansthattoprackswitchwillhavearound
80ports
► Thereareglobal8coreswitches
► The rack switch has uplinks connected to core switches and hence connecting all other racks with
uniformbandwidth, formingtheClusterInthecluster, youhavefewmachinestoactasNamenode and
asJobTracker.TheyarereferredasMasters.Thesemastershavedifferentconfigurationfavoringmore
DRAM and CPU and lesslocalstorage.
Hadoopclusterhas3components:
1. Client
2. Master
3. Slave
Theroleofeachcomponentsareshowninthebelowimage.
Figure2.6:HadoopCoreComponent
Client:
It is neither master nor slave, rather play a role ofloading the data into cluster, submit MapReduce
jobs describing how the data should be processed and then retrieve the data to see the response after
job completion.
Figure2.6:HadoopClient
Masters:
TheMastersconsistsof3componentsNameNode,SecondaryNodenameandJobTracker.
Figure2.7:MapReduce-HDFS
NameNode:
NameNodedoesNOTstorethefilesbutonlythe file's metadata.Inlatersectionwewillsee it isactually
the Data Node which stores the files.
Figure2.8:NameNode
• NameNodeoverseesthehealthofDataNodeandcoordinatesaccesstothedatastoredinDataNode.
• Namenodekeepstrackofallthefilesystemrelatedinformationsuchasto
• Whichsectionoffileissavedinwhichpartofthecluster
• Lastaccesstimeforthefiles
• Userpermissionslikewhichuserhaveaccesstothefile
JobTracker:
JobTrackercoordinatestheparallelprocessing ofdatausingMapReduce.
ToknowmoreaboutJobTracker, pleasereadthearticleAllYou Want toKnow about MapReduce
(The Heart ofHadoop)
SecondaryNameNode
Figure2.9:SecondaryNameNode
• The jobofSecondaryNode isto contact NameNode inaperiodic manner aftercertain time
interval (by defaultlhour).
• NameNode which keeps all filesystem metadata inRAM has no capability to process that
metadata on to disk.
• IfNameNode crashes,you loseeverythinginRAMitselfand you don't haveanybackup of
filesystem.
• Whatsecondary node does isit contacts NameNodeinan hour and pulls copyofmetadata
information out ofNameNode.
• Itshuffleandmerge thisinformation intocleanfilefolderandsenttobackagaintoNameNode, while
keeping a copy foritself.
• HenceSecondaryNodeisnotthebackupratheritdoesjobofhousekeeping.
• IncaseofNameNodefailure,savedmetadatacanrebuilditeasily.
Slaves:
SlavenodesarethemajorityofmachinesinHadoopClusterandareresponsibleto
• Storethedata
• Processthecomputation
• EachslaverunsbothaDataNodeandTaskTrackerdaemonwhichcommunicatestotheirmasters.
• TheTaskTrackerdaemonis aslavetotheJob TrackerandtheDataNodedaemonaslavetothe
NameNode
Figure2.10:Slaves
Hadoop-TypicalWorkflowinHDFS:TaketheexampleofinputfileasSample.txt.
Figure2.11:HDFSWorkflow
1. owTestingHadoop.txtgetsloadedintotheHadoopCluster?
Figure2.12:LoadingfileinHadoopCluster
• ClientmachinedoesthisstepandloadstheSample.txtintocluster.
• Itbreaksthesample.txtintosmallerchunkswhichareknownas"Blocks"inHadoopcontext.
• Clientputtheseblocksondifferentmachines(datanodes)throughoutthecluster.
2. Next,howdoestheClientknowsthattowhichdatanodesloadtheblocks?
• NowNameNodecomesintopicture.
• TheNameNodeuseditsRackAwarenessintelligencetodecideonwhichDataNodetoprovide.
• Foreachofthedatablock (inthiscaseBlock-A, Block-Band Block-C), Client contactsNameNode
and in response NameNode sends anordered list of3 DataNodes.
3. HowdoestheClientknowsthattowhichdatanodesloadtheblocks?
ForexampleinresponsetoBlock-Arequest,NodeNamemaysendDataNode-2,DataNode-3
andDataNode-4.
Block-BDataNodeslistDataNode-1,DataNode-3,DataNode-4andforBlockCdatanodelist
DataNode-1, DataNode-2, DataNode-3.Hence
• BlockAgetsstoredinDataNode-2,DataNode-3,DataNode-4
• BlockBgetsstoredinDataNode-1,DataNode-3,DataNode-4
• BlockCgetsstoredinDataNode-1,DataNode-2,DataNode-3
• Everyblockisreplicatedtomorethan1datanodestoensurethedatarecoveryonthetimeof machine
failures. That's whyNameNode send 3DataNodes list for each individual block
4. Whodoestheblockreplication?
• ClientwritethedatablockdirectlytooneDataNode.
• DataNodesthenreplicatetheblocktootherDatanodes.
• Whenoneblock getswritteninall3DataNodethenonlycyclerepeatsfornextblock.
5. Whodoestheblockreplication?
• In HadoopGenlthereisonly oneNameNodewherein Gen2thereis activepassivemodelinName Node
whereonemorenode"PassiveNode" comesin picture.
• The defaultsettingfor Hadoop is to have3 copies ofeach blockin the cluster. This settingcan
be configuredwith"dfs.replication" parameter of hdfs-site.xmlfile.
• Keep notethat Client directlywrites the block totheDataNode withoutanyintervention of
NameNodein this process.
Challenges in Hadoop
• Network File system isthe oldest and the most commonly used distributed file system and was
designedforthe generalclassofapplications,Hadooponlyspecific kind ofapplicationscan make use
of it.
• It is known that Hadoop hasbeen created to address the limitations ofthedistributed filesystem,
whereit canstorethelargeamount ofdata,offersfailureprotection and provides fastaccess, but it
should be known that the benefits that come with Hadoop come at some cost.
• Hadoop isdesigned for applications that require random reads; So ifa file has four partsthe file
would liketoreadallthepartsone-by-one going from1to4tilltheend. Random seek iswhere you want
togotoaspecific location inthefile;this issomething that isn't possible withHadoop.Hence, Hadoop
is designedfor non- real-time batch processing of data.
• Hadoop isdesigned forstreaming reads caching ofdata isn't provided. Caching ofdata isprovided
which means that when you want toread data another time, itcanbe read veryfast fromthecache.
Thiscaching isn't possible because you get fasteraccess tothedatadirectly bydoingthesequential
read; hence caching isn't available through Hadoop.
• It willwrite thedata andthen it willread the data several times.It willnotbe updating the data that it
has written; hence updating data written to closed files is not available.However,
youhavetoknowthatinupdate0.19appendingwillbe supported forthosefilesthat aren't closed. Butfor
those files that have been closed, updating is n'tpossible.
• In case of Hadoop we aren't talking about one computer; in this scenario we usually have a large
number of computers and hardware failures are unavoidable; sometime one computer will fail and
sometimes the entire rack can fail too. Hadoop gives excellent protection against hardware
failure; however the performancewill go down proportionateto the number of computers that are
down. In the big picture, it doesn't really matter and it is not generally noticeable since if you have
100 computersand inthemif3failthen97arestillworking. Sotheproportionate lossofperformanceisn't
that noticeable.However, thewayHadoop worksthereis thelossinperformance.Nowthislossof
performance through hardware failures issomething that is managed through replication strategy.
IntroductiontoBigDataAnalytics Dept.ofCSEAI&ML
UNIT- III
HADOOP FILE SYSTEM
► IntroductiontoHadoopDistributedFileSystem(HDFS)54
► Overview
► TheDesign ofHDFS
► HDFSConcepts
► Basic Filesystem
► Hadoop Filesystems
► BasicFilesystemVsHadoopFilesystems
► LocalFile System
► FileBasedDataStructures
► SequentialFile
► MapFile
► TheJavaInterface
► LibraryClasses
► Inserting/ReadingData fromaHadoopURL
► Inserting/ReadingDataUsingtheFilesystemAPI
► Writing/StoringData
► DataFlow-AnatomyofaFileRead
► Anatomy ofaFileWrite
► Limitations.
Introduction to Hadoop Distributed File System(HDFS)
With growing data velocity the data size easily outgrows the storage limit of a machine. A solution
would be to store the data across a network of machines. Such filesystems are called distributed
filesystems. Since data is stored across a network all the complications of a network come in.
This is where Hadoop comes in. It provides one of the most reliable filesystems. HDFS (Hadoop
Distributed File System) is a unique design that provides storage for extremely large files with
streaming data access pattern and it runs on commodity hardware. Let’s elaborate the terms:
 Extremely large files: Here we are talking about the data in range of petabytes(1000 TB).
 Streaming Data Access Pattern: HDFS is designed on principle of write-once and read-many-times.
Once data is written large portions of dataset can be processed any number times.
 Commodity hardware: Hardware that is inexpensive and easily available in the market. This is one
of feature which specially distinguishes HDFS from other file system.
Nodes: Master-slave nodes typically forms the HDFS cluster.
1. NameNode(MasterNode):
 Manages all the slave nodes and assign work to them.
 It executes filesystem namespace operations like opening, closing, renaming files and
directories.
 It should be deployed on reliable hardware which has the high config. not on commodity
hardware.
2. DataNode(SlaveNode):
 Actual worker nodes, who do the actual work like reading, writing, processing etc.
 They also perform creation, deletion, and replication upon instruction from the master.
 They can be deployed on commodity hardware.
HDFS daemons: Daemons are the processes running in background.
Namenodes:
 Run on the master node.

 Store metadata (data about data) like file path, the number of blocks, block Ids. etc.
 Require high amount of RAM.
 Store meta-data in RAM for fast retrieval i.e to reduce seek time. Though a persistent copy of
it is kept on disk.
DataNodes:
 Run on slave nodes.

 Require high memory as data is actually stored here.
Data storage in HDFS: Now let’s see how the data is stored in a distributed manner.
Lets assume that 100TB file is inserted, then masternode(namenode) will first divide the file into
blocks of 10TB (default size is 128 MB in Hadoop 2.x and above).
Then these blocks are stored across different datanodes(slavenode).
Datanodes(slavenode)replicate the blocks among themselves and the information of what blocks they
contain is sent to the master.
Default replication factor is 3 means for each block 3 replicas are created (including itself). In
hdfs.site.xml we can increase or decrease the replication factor i.e we can edit its configuration here.
Note: MasterNode has the record of everything, it knows the location and info of each and every
single data nodes and the blocks they contain, i.e. nothing is done without the permission of
masternode.
Why divide the file into blocks?

Answer: Let’s assume that we don’t divide, now it’s very difficult to store a 100 TB file on a single
machine. Even if we store, then each read and write operation on that whole file is going to take very
high seek time. But if we have multiple blocks of size 128MB then its become easy to perform
various read and write operations on it compared to doing it on a whole file at once. So we divide the
file to have faster data access i.e. reduce seek time.
Why replicate the blocks in data nodes while storing?

Answer: Let’s assume we don’t replicate and only one yellow block is present on datanode D1. Now
if the data node D1 crashes we will lose the block and which will make the overall data inconsistent
and faulty. So we replicate the blocks to achieve fault-tolerance.
Terms related to HDFS:
 HeartBeat : It is the signal that datanode continuously sends to namenode. If namenode doesn’t
receive heartbeat from a datanode then it will consider it dead.
 Balancing : If a datanode is crashed the blocks present on it will be gone too and the blocks will
be under-replicated compared to the remaining blocks. Here master node(namenode) will give a
signal to datanodes containing replicas of those lost blocks to replicate so that overall distribution of
blocks is balanced.
 Replication:: It is done by datanode.
Note: No two replicas of the same block are present on the same datanode.
Features:
 Distributed data storage.
 Blocks reduce seek time.
 The data is highly available as the same block is present at multiple datanodes.
 Even if multiple datanodes are down we can still do our work, thus making it highly reliable.
 High fault tolerance.
Limitations: Though HDFS provide many features there are some areas where it doesn’t work well.
 Low latency data access: Applications that require low-latency access to data i.e in the range of
milliseconds will not work well with HDFS, because HDFS is designed keeping in mind that we need
high-throughput of data even at the cost of latency.
 Small file problem: Having lots of small files will result in lots of seeks and lots of movement from
one datanode to another datanode to retrieve each small file, this whole process is a very inefficient
data access pattern.
Hadoop - Introduction
Hadoop is an Apache open source framework written in java that allows distributed processing of
large datasets across clusters of computers using simple programming models. The Hadoop
framework application works in an environment that provides
distributed storage and computation across clusters of computers. Hadoop is designed to scale up
from single server to thousands of machines, each offering local computation and storage.
Hadoop Architecture
At its core, Hadoop has two major layers namely −
 Processing/Computation layer (MapReduce), and

 Storage layer (Hadoop Distributed File System).
MapReduce
MapReduce is a parallel programming model for writing distributed applications devised at Google
for efficient processing of large amounts of data (multi-terabyte data-sets), on large clusters
(thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner. The MapReduce
program runs on Hadoop which is an Apache open-source framework.
Hadoop Distributed File System
The Hadoop Distributed File System (HDFS) is based on the Google File System (GFS) and provides
a distributed file system that is designed to run on commodity hardware. It has many similarities with
existing distributed file systems. However, the differences from other distributed file systems are
significant. It is highly fault-tolerant and is designed to be deployed on low-cost hardware. It
provides high throughput access to application data and is suitable for applications having large
datasets.
Apart from the above-mentioned two core components, Hadoop framework also includes the
following two modules −
 Hadoop Common − These are Java libraries and utilities required by other Hadoop modules.
 Hadoop YARN − This is a framework for job scheduling and cluster resource management.
How Does Hadoop Work?
It is quite expensive to build bigger servers with heavy configurations that handle large scale
processing, but as an alternative, you can tie together many commodity computers with single-CPU,
as a single functional distributed system and practically, the clustered machines can read the dataset
in parallel and provide a much higher throughput. Moreover, it is cheaper than one high-end server.
So this is the first motivational factor behind using Hadoop that it runs across clustered and low-cost
machines.
Hadoop runs code across a cluster of computers. This process includes the following core tasks that
Hadoop performs −
 Data is initially divided into directories and files. Files are divided into uniform sized blocks of 128M
and 64M (preferably 128M).
 These files are then distributed across various cluster nodes for further processing.
 HDFS, being on top of the local file system, supervises the processing.
 Blocks are replicated for handling hardware failure.
 Checking that the code was executed successfully.
 Performing the sort that takes place between the map and reduce stages.
 Sending the sorted data to a certain computer.
 Writing the debugging logs for each job.
Advantages of Hadoop
 Hadoop framework allows the user to quickly write and test distributed systems. It is efficient, and it
automatic distributes the data and work across the machines and in turn, utilizes the underlying
parallelism of the CPU cores.
 Hadoop does not rely on hardware to provide fault-tolerance and high availability (FTHA), rather
Hadoop library itself has been designed to detect and handle failures at the application layer.
 Servers can be added or removed from the cluster dynamically and Hadoop continues to operate
without interruption.
 Another big advantage of Hadoop is that apart from being open source, it is compatible on all the
platforms since it is Java based
Hadoop - HDFS Overview
Hadoop File System was developed using distributed file system design. It is run on commodity
hardware. Unlike other distributed systems, HDFS is highly faulttolerant and designed using low-cost
hardware.
HDFS holds very large amount of data and provides easier access. To store such huge data, the files
are stored across multiple machines. These files are stored in redundant fashion to rescue the system
from possible data losses in case of failure. HDFS also makes applications available to parallel
processing.
The Design of HDFS :
HDFS is a filesystem designed for storing very large files with streaming data access patterns,
running on clusters of commodity hardware. Very large files: “Very large” in this context means files
that are hundreds of megabytes, gigabytes, or terabytes in size. There are Hadoop clusters running
today that store petabytes of data. Streaming data access : HDFS is built around the idea that the most
efficient data processing pattern is a write-once, readmany-times pattern. A dataset is typically
generated or copied from source, then various analyses are performed on that dataset over time.
Commodity hardware : Hadoop doesn’t require expensive, highly reliable hardware to run on. It’s
designed to run on clusters of commodity hardware (commonly available hardware available from
multiple vendors3) for which the chance of node failure across the cluster is high, at least for large
clusters. HDFS is designed to carry on working without a noticeable interruption to the user in the
face of such failure. These are areas where HDFS is not a good fit today: Low-latency data access :
Applications that require low-latency access to data, in the tens of milliseconds range, will not work
well with HDFS. Lots of small files : Since the namenode holds filesystem metadata in memory, the
limit to the number of files in a filesystem is governed by the amount of memory on the namenode.
Multiple writers, arbitrary file modifications: Files in HDFS may be written to by a single writer.
Writes are always made at the end of the file. There is no support for multiple writers, or for
modifications at arbitrary offsets in the file.
Features of HDFS
 It is suitable for the distributed storage and processing.
 Hadoop provides a command interface to interact with HDFS.
 The built-in servers of namenode and datanode help users to easily check the status of cluster.
 Streaming access to file system data.
 HDFS provides file permissions and authentication.
HDFS Architecture
Given below is the architecture of a Hadoop File System.
HDFS CONCEPTS:
HDFS follows the master-slave architecture and it has the following elements.
1. Blocks: A Block is the minimum amount of data that it can read or write.HDFS blocks are 128 MB
by default and this is configurable.Files n HDFS are broken into block-sized chunks,which are stored
as independent units.Unlike a file system, if the file is in HDFS is smaller than block size, then it
does not occupy full block?s size, i.e. 5 MB of file stored in HDFS of block size 128 MB takes 5MB
of space only.The HDFS block size is large just to minimize the cost of seek.
Name node
1. Name Node: HDFS works in master-worker pattern where the name node acts as master.Name Node
is controller and manager of HDFS as it knows the status and the metadata of all the files in HDFS;
the metadata information being file permission, names and location of each block.The metadata are
small, so it is stored in the memory of name node,allowing faster access to data. Moreover the HDFS
cluster is accessed by multiple clients concurrently,so all this information is handled bya single
machine. The file system operations like opening, closing, renaming etc. are executed by it.
2. Data Node: They store and retrieve blocks when they are told to; by client or name node. They
report back to name node periodically, with list of blocks that they are storing. The data node being a
commodity hardware also does the work of block creation, deletion and replication as stated by the
name node.
HDFS DataNode and NameNode Image:
name node is the commodity hardware that contains the GNU/Linux operating system and the name
node software. It is a software that can be run on commodity hardware. The system having the name
node acts as the master server and it does the following tasks −
 Manages the file system namespace.

 Regulates client’s access to files.
 It also executes file system operations such as renaming, closing, and opening files and directories.
Datanode
The datanode is a commodity hardware having the GNU/Linux operating system and datanode
software. For every node (Commodity hardware/System) in a cluster, there will be a datanode. These
nodes manage the data storage of their system.
 Datanodes perform read-write operations on the file systems, as per client request.
 They also perform operations such as block creation, deletion, and replication according to the
instructions of the namenode.
Block
Generally the user data is stored in the files of HDFS. The file in a file system will be divided into
one or more segments and/or stored in individual data nodes. These file segments are called as
blocks. In other words, the minimum amount of data that HDFS can read or write is called a Block.
The default block size is 64MB, but it can be increased as per the need to change in HDFS
configuration.
Goals of HDFS
Fault detection and recovery –
Since HDFS includes a large number of commodity hardware, failure of components is frequent.
Therefore HDFS should have mechanisms for quick and automatic fault detection and recovery.
Huge datasets − HDFS should have hundreds of nodes per cluster to manage the applications having
huge datasets.
Hardware at data − A requested task can be done efficiently, when the computation takes place near
the data. Especially where huge datasets are involved, it reduces the network traffic and increases the
throughput.
What is HDFS
Hadoop comes with a distributed file system called HDFS. In HDFS data is distributed over several
machines and replicated to ensure their durability to failure and high availability to parallel
application.
It is cost effective as it uses commodity hardware. It involves the concept of blocks, data nodes and
node name.
Where to use HDFS
o Very Large Files: Files should be of hundreds of megabytes, gigabytes or more.

o Streaming Data Access: The time to read whole data set is more important than latency in reading
the first. HDFS is built on write-once and read-many-times pattern.
o Commodity Hardware:It works on low cost hardware.
Where not to use HDFS
o Low Latency data access: Applications that require very less time to access the first data should not
use HDFS as it is giving importance to whole data rather than time to fetch the first record.
o Lots Of Small Files:The name node contains the metadata of files in memory and if the files are
small in size it takes a lot of memory for name node's memory which is not feasible.
o Multiple Writes:It should not be used when we have to write multiple times.
Hadoop - Introduction
Hadoop is an Apache open source framework written in java that allows distributed processing of
large datasets across clusters of computers using simple programming models. The Hadoop
framework application works in an environment that provides
distributed storage and computation across clusters of computers. Hadoop is designed to scale up
from single server to thousands of machines, each offering local computation and storage.
HADOOP ARCHIATECTURE:
At its core, Hadoop has two major layers namely − Processing/Computation layer (MapReduce), and
 Storage layer (Hadoop Distributed File System).

 MapReduce

 MapReduce is a parallel programming model for writing distributed applications devised at Google
for efficient processing of large amounts of data (multi-terabyte data-sets), on large clusters
(thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner. The MapReduce
program runs on Hadoop which is an Apache open-source framework.
Hadoop Distributed File System
The Hadoop Distributed File System (HDFS) is based on the Google File System (GFS) and provides
a distributed file system that is designed to run on commodity hardware. It has many similarities with
existing distributed file systems. However, the differences from other distributed file systems are
significant. It is highly fault-tolerant and is designed to be deployed on low-cost hardware. It
provides high throughput access to application data and is suitable for applications having large
datasets.
Apart from the above-mentioned two core components, Hadoop framework also includes the
following two modules −
 Hadoop Common − These are Java libraries and utilities required by other Hadoop modules.
 Hadoop YARN − This is a framework for job scheduling and cluster resource management.
Since all the metadata is stored in name node, it is very important. If it fails the file system can not be
used as there would be no way of knowing how to reconstruct the files from blocks present in data
node. To overcome this, the concept of secondary name node arises.
HDFS Read Image:

HDFS Write Image:
Secondary Name Node: It is a separate physical machine which acts as a helper of name node. It
performs periodic check points.It communicates with the name node and take snapshot of meta data
which helps minimize downtime and loss of data.
Starting HDFS
The HDFS should be formatted initially and then started in the distributed mode. Commands are
given below.
To Format $ hadoop namenode -format
To Start $ start-dfs.sh
HDFS Basic File Operations
1. Putting data to HDFS from local file system

o First create a folder in HDFS where data can be put form local file system.
$ hadoop fs -mkdir /user/test
o Copy the file "data.txt" from a file kept in local folder /usr/home/Desktop to HDFS folder
/user/ test
$ hadoop fs -copyFromLocal /usr/home/Desktop/data.txt /user/test
o Display the content of HDFS folder
$ Hadoop fs -ls /user/test
2. Copying data from HDFS to local file system

o $ hadoop fs -copyToLocal /user/test/data.txt /usr/bin/data_copy.txt
3. Compare the files and see that both are same
o $ md5 /usr/bin/data_copy.txt /usr/home/Desktop/data.txt
Recursive deleting
o hadoop fs -rmr <arg>
Example:
o hadoop fs -rmr /user/sonoo/
HDFS Other commands
The below is used in the commands
"<path>" means any file or directory name.
"<path>..." means one or more file or directory names.
"<file>" means any filename.
"<src>" and "<dest>" are path names in a directed operation.
"<localSrc>" and "<localDest>" are paths as above, but on the local file system
o put <localSrc><dest>
Copies the file or directory from the local file system identified by localSrc to dest within the DFS.
o copyFromLocal <localSrc><dest>
Identical to -put
o copyFromLocal <localSrc><dest>
Identical to -put
o moveFromLocal <localSrc><dest>
Copies the file or directory from the local file system identified by localSrc to dest within HDFS, and
then deletes the local copy on success.
o get [-crc] <src><localDest>
Copies the file or directory in HDFS identified by src to the local file system path identified by
localDest.
o cat <filen-ame>
Displays the contents of filename on stdout.
o moveToLocal <src><localDest>
Works like -get, but deletes the HDFS copy on success.
o setrep [-R] [-w] rep <path>
Sets the target replication factor for files identified by path to rep. (The actual replication factor will
move toward the target over time)
o touchz <path>
Creates a file at path containing the current time as a timestamp. Fails if a file already exists at path,
unless the file is already size 0.
o test -[ezd] <path>
Returns 1 if path exists; has zero length; or is a directory or 0 otherwise.
o stat [format] <path>
Prints information about path. Format is a string which accepts file size in blocks (%b), filename
(%n), block size (%o), replication (%r), and modification date (%y, %Y).
BASIC FILE SYSTEM IN HADOOP - Hadoop - HDFS Operations
Starting HDFS
Initially you have to format the configured HDFS file system, open namenode (HDFS server), and
execute the following command.
$ hadoop namenode -format
After formatting the HDFS, start the distributed file system. The following command will start the
namenode as well as the data nodes as cluster.
$ start-dfs.sh - Listing Files in HDFS
After loading the information in the server, we can find the list of files in a directory, status of a file,
using ‘ls’. Given below is the syntax of ls that you can pass to a directory or a filename as an
argument.
$ $HADOOP_HOME/bin/hadoop fs -ls <args>

Inserting Data into HDFS
Assume we have data in the file called file.txt in the local system which is ought to be saved in the
hdfs file system. Follow the steps given below to insert the required file in the Hadoop file system.
Step 1
You have to create an input directory.
$ $HADOOP_HOME/bin/hadoop fs -mkdir /user/input
Step 2
Transfer and store a data file from local systems to the Hadoop file system using the put command.
$ $HADOOP_HOME/bin/hadoop fs -put /home/file.txt /user/input
Step 3
You can verify the file using ls command.
$ $HADOOP_HOME/bin/hadoop fs -ls /user/input
Retrieving Data from HDFS
Assume we have a file in HDFS called outfile. Given below is a simple demonstration for retrieving
the required file from the Hadoop file system.
Step 1
Initially, view the data from HDFS using cat command.
$ $HADOOP_HOME/bin/hadoop fs -cat /user/output/outfile
Step 2
Get the file from HDFS to the local file system using get command.
$ $HADOOP_HOME/bin/hadoop fs -get /user/output/ /home/hadoop_tp/
Shutting Down the HDFS
You can shut down the HDFS by using the following command.
$ stop-dfs.sh
c
BASIC HDFS FILE OPERATIONS COMMANDS
This article will cover basic Hadoop Distributed File System (HDFS) commands in Linux. You’ll
learn how to create and list directories, move, delete, read files and more.Hadoop FS Command Line
The Hadoop FS command line is a simple way to access and interface with HDFS. Below are some
basic HDFS commands in Linux, including operations like creating directories, moving files, deleting
files, reading files, and listing directories.
To use HDFS commands, start the Hadoop services using the following command:
sbin/start-all.sh
To check if Hadoop is up and running:
jps
Below cover several basic HDFS commands, along with a list of more File system commands given
command -help.
mkdir:
To create a directory, similar to Unix ls command.
Options:
-p : Do not fail if the directory already exists
$ hadoop fs -mkdir [-p]
ls:
List directories present under a specific directory in HDFS, similar to Unix ls command. The -
lsr command can be used for recursive listing of directories and files.
Options:
-d : List the directories as plain files
-h : Format the sizes of files to a human-readable manner instead of number of bytes
-R : Recursively list the contents of directories
$ hadoop fs -ls [-d] [-h] [-R]
copyFromLocal:
Copy files from the local file system to HDFS, similar to -put command. This command will not
work if the file already exists. To overwrite the destination if the file already exists, add -f
flag to command.
Options:
-p : Preserves access and modification time, ownership and the mode

-f : Overwrites the destination
$ hadoop fs -copyFromLocal [-f] [-p] …
copyToLocal: Copy files from HDFS to local file system, similar to -get command.
$ hadoop fs -copyToLocal [-p] [-ignoreCrc] [-crc] ...
cat: Display contents of a file, similar to Unix cat command.

$ hadoop fs -cat /user/data/sampletext.txtcp: Copy files from one directory to another within HDFS,
similar to Unix cp command.
$ hadoop fs -cp /user/data/sample1.txt /user/hadoop1

$ hadoop fs -cp /user/data/sample2.txt /user/test/in1
mv:
Move files from one directory to another within HDFS, similar to Unix mv command.
$ hadoop fs -mv /user/hadoop/sample1.txt /user/text/
rm:
Remove a file from HDFS, similar to Unix rm command. This command does not delete directories.
For recursive delete, use command -rm -r.
Options:
-r : Recursively remove directories and files
-skipTrash : To bypass trash and immediately delete the source
-f : Mention if there is no file existing

-rR : Recursively delete directories
$ hadoop fs -mv /user/hadoop/sample1.txt /user/text/
getmerge:
Merge a list of files in one directory on HDFS into a single file on local file system. One of the most
important and useful commands when trying to read the contents of map reduce job or pig job’s
output files.
$ hadoop fs -getmerge /user/data
setrep:
Change replication factor of a file to a specific instead of default replication factor for remaining in
HDFS. If it is a directory, then the command will recursively change in the replication of all the files
in the directory tree given the input provided.
Options:
-w : Request the command wait for the replication to be completed (potentially takes a long time)
-r : Accept for backwards compatibility and has no effect
$ hadoop fs -setrep [-R] [-w]
touchz:
Creates an empty file in HDFS.
$ hadoop fs -touchz URI
test:
Test an HDFS file’s existence of an empty file or if it is a directory or not.
Options:
-w : Request the command wait for the replication to be completed (potentially takes a long time)
-r : Accept for backwards compatibility and has no effect
$ hadoop fs -setrep [-R] [-w]
appendToFile:
Appends the contents of all given local files to the provided destination file on HDFS. The
destination file will be created if it doesn’t already exist.
$ hadoop fs -appendToFile
chmod:
Change the permission of a file, similar to Linux shell’s command but with a few exceptions.
<MODE> Same as mode used for the shell’s command with the only letters recognized are ‘rwxXt’
<OCTALMODE> Mode specified in 3 or 4 digits. It is not possible to specify only part of the mode,
unlike the shell command.
Options:
-R : Modify the files recursively
$ hadoop fs -chmod [-R] <MODE[,MODE]... | OCTALMODE> PATH
chown:
Change owner and group of a file, similar to Linux shell’s command but with a few exceptions.
Options:
-R : Modify the files recursively
$ hadoop fs -chown [-R] [OWNER][:[GROUP]] PATH
df:
Show capacity (free and used space) of the file system. The status of the root partitions are provided
if the file system has multiple partitions and no path is specified.
Options:
$ hadoop fs -df [-h] [<path> ...]
du:
Show size of each file in the directory.
Options:
-s : Show total summary size
$ hadoop fs -df [-h] [<path> ...]
tail:
Show the last 1KB of the file.
Options:
-f : Show appended data as the file grows
$ hadoop fs -tail [-f]
List of HDFS Commands:

Use command -help to get the below list of possible HDFS commands.
user@ubuntu1:~$ hadoop fs -help

Usage: hadoop fs [generic options]
[-appendToFile <localsrc> ... <dst>]
[-cat [-ignoreCrc] <src> ...]
[-checksum <src> ...]
[-chgrp [-R] GROUP PATH...]
[-chmod [-R] <MODE[,MODE]... | OCTALMODE> PATH...]
[-chown [-R] [OWNER][:[GROUP]] PATH...]
[-copyFromLocal [-f] [-p] <localsrc> ... <dst>]
[-copyToLocal [-p] [-ignoreCrc] [-crc] <src> ... <localdst>]
[-count [-q] <path> ...]
[-cp [-f] [-p | -p[topax]] <src> ... <dst>]
[-createSnapshot <snapshotDir> [<snapshotName>]]
[-deleteSnapshot <snapshotDir> <snapshotName>]
[-df [-h] [<path> ...]]
[-du [-s] [-h] <path> ...]
[-expunge]
[-get [-p] [-ignoreCrc] [-crc] <src> ... <localdst>]
[-getfacl [-R] <path>]
[-getfattr [-R] {-n name | -d} [-e en] <path>]
[-getmerge [-nl] <src> <localdst>]
[-help [cmd ...]]
[-ls [-d] [-h] [-R] [<path> ...]]
[-mkdir [-p] <path> ...]
[-moveFromLocal <localsrc> ... <dst>]
[-moveToLocal <src> <localdst>]
[-mv <src> ... <dst>]
[-put [-f] [-p] <localsrc> ... <dst>]
[-renameSnapshot <snapshotDir> <oldName> <newName>]
[-rm [-f] [-r|-R] [-skipTrash] <src> ...]
[-rmdir [--ignore-fail-on-non-empty] <dir> ...]
[-setfacl [-R] [{-b|-k} {-m|-x <acl_spec>} <path>]|[--set <acl_spec> <path>]]
[-setfattr {-n name [-v value] | -x name} <path>]
[-setrep [-R] [-w] <rep> <path> ...]
[-stat [format] <path> ...]

[-tail [-f] <file>]
[-test -[defsz] <path>]
[-text [-ignoreCrc] <src> ...]
[-touchz <path> ...]
[-usage [cmd ...]]
Difference Between Local File System (LFS) and Distributed File System (DFS)
1. Local File System (LFS) :
The basic file system of Linux operating system is termed as Local file system. It stores any data file
as it is in single copy.
Local File System Distributed File System
DFS divides data as multiple blocks and

LFS stores data as a single block.
stores it into different DataNodes.
DFS provides Master-Slave architecture

LFS uses Tree format to store Data.
for Data storage.
Data retrieval in LFS is slow. Data retrieval in DFS is fast.
It is not reliable because LFS data It is reliable because in DFS data blocks
does not replicate the Data files. are replicated into different DataNodes.
LFS is cheaper because it does not DFS is expensive because it needs extra
needs extra memory for storing any memory to replicate the same data
data file. blocks.
Files can not be accessed directly in DFS

Files can be accessed directly in
because the actual location of data blocks
LFS.
are only known by NameNode.
DFS is appropriate for analysis of big file

LFS is not appropriate for analysis of
of data because it needs less amount of
very big file of data because it needs
time to process as compare to Local file
large time to process.
system.
LFS is less complex than DFS. DFS is more complex than LFS.
It stores data files in Tree format. Here, any user can access data files directly. LFS does not
Replicate the data blocks. It always used for storing and processing personal data(small data).
2.Distributed File System (DFS) :
When we need to store and process a large data file (approx 1 TB size file at least), the Local file
system of Operating system is not appropriate. In such cases we use Distributed File system. It can be
created on any Linux operating system with Hadoop. DFS stores any data file by dividing it into
several blocks. This file system works on Master-Slave format where Master is NameNode and
DataNodes are the slaves. All the blocks of a Data file is stored into different DataNodes and the
location is only known by NameNode. Every Data Block is replicated into different Datanodes to
avoid data loss when any datanode fails. In DFS Data files are directly not accessible to any user
because only NameNode knows where the Data blocks of Data file are stored.
FILE BASED DATA STRUCTURE
Hadoop file-based data structures and examples

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud.
File-based data structures
Two file formats:
1, Sequencefile
2, MapFile
Sequence file
sequencefile files are <key,value>flat files (Flat file) designed by Hadoop to store binary forms of pairs.
2, can sequencefile as a container, all the files packaged into the Sequencefile class can be efficiently stored
and processed small files .
3. sequencefile files are not sorted by their stored key, Sequencefile's internal class writer** provides append
functionality * *.
4. The key and value in Sequencefile can be any type writable or a custom writable type.
Sequence file Compression:
1. The internal format of the sequencefile depends on whether compression is enabled, or, if it is, either a
record compression or a block compression.
2, three kinds of types:
A. No compression type : If compression is not enabled (the default setting), then each record consists of its
record length (number of bytes), the length of the key, the key and the value. The Length field is four bytes.
B. Record compression type : The record compression format is basically the same as the uncompressed
format, and the difference is that the value byte is compressed with the encoder defined in the header. Note
that the key is not compressed.
C. Block compression type : Block compression compresses multiple records at once , so it is more compact
than record compression and generally preferred . When the number of bytes recorded reaches the minimum
size, it is added to the block. The minimum valueio.seqfile.compress.blocksizeis defined by the property in.
The default value is 1000000 bytes. The format is record count, key length, key, value length, value.
no compression format and record compression format
block compression format
Benefits of the Sequencefile file format:

A. Supports data compression based on records (record) or blocks (block).
B. Supports splittable, which can be used as input shards for mapreduce.
C. Simple to modify: The main responsibility is to modify the corresponding business logic, regardless of
the specific storage format.
Disadvantages of the Sequencefile file format:

The downside is the need for a merge file, and the merged file will be inconvenient to view. because it is a
binary file.
read/write Sequencefile
Write Process:
1) Create a configuration
2) Get filesystem
3) Create file output path
4) Call Sequencefile.createwriter to get Sequencefile.writer object
5) Call SequenceFile.Writer.append Append write file
6) Close the stream.
Read process:
2) Get filesystem
4) New one sequencefile.reader for reading
5) Get Keyclass and Valueclass
6) Close the stream
org.apache.hadoop.io
Class SequenceFile
There are three SequenceFile Writers based on the SequenceFile.CompressionType used to compress
key/value pairs:
1、Writer : Uncompressed records.
2、RecordCompressWriter : Record-compressed files, only compress values.
3、BlockCompressWriter : Block-compressed files, both keys & values are collected in ‘blocks‘ separately
and compressed. The size of the ‘block‘ is configurable
package SequenceFile;
import java.io.IOException;
import java.net.URI;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IOUtils;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.SequenceFile;
import org.apache.hadoop.io.SequenceFile.CompressionType;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.Writable;
import org.apache.hadoop.io.compress.BZip2Codec;
import org.apache.hadoop.util.ReflectionUtils;
public class Demo01 {
final static String uri= "hdfs://liguodong:8020/liguodong";
final static String[] data = {
"apache,software","chinese,good","james,NBA","index,pass"
};
public static void main(String[] args) throws IOException {

//1
Configuration configuration = new Configuration();
//2
FileSystem fs = FileSystem.get(URI.create(uri),configuration);
//3
Path path = new Path("/tmp.seq");
write(fs,configuration,path);
read(fs,configuration,path);
}
public static void write(FileSystem fs,Configuration configuration,Path path) throws IOException{
//4
IntWritable key = new IntWritable();
Text value = new Text();
//No compression
/*@SuppressWarnings("deprecation")
SequenceFile.Writer writer = SequenceFile.createWriter
(fs,configuration,path,key.getClass(),value.getClass());*/
//Record compression
@SuppressWarnings("deprecation")
(fs,configuration,path,key.getClass(),
value.getClass(), CompressionType.RECORD, new BZip2Codec());
//Block compression
/*@SuppressWarnings("deprecation")
(fs,configuration,path,key.getClass(),
value.getClass(),CompressionType.BLOCK,new BZip2Codec());*/
//5
for (int i = 0; i <30; i++) {
key.set(100-i);
value.set(data[i%data.length]);
writer.append(key, value);
}
//6, close the stream
IOUtils.closeStream(writer);
}
public static void read(FileSystem fs,Configuration configuration,Path path) throws IOException {
//4
@SuppressWarnings("deprecation")
SequenceFile.Reader reader = new SequenceFile.Reader(fs, path,configuration);
//5
Writable key = (Writable) ReflectionUtils.newInstance
(reader.getKeyClass(), configuration);
Writable value = (Writable) ReflectionUtils.newInstance
(reader.getValueClass(), configuration);
while(reader.next(key,value)){
System.out.println("key = "+ key);
System.out.println("value = "+ value);
System.out.println("position = "+ reader.getPosition());
}
IOUtils.closeStream(reader);
}
}
Operation Result:
key = 100
value = apache,software
position = 164
key = 99
value = chinese,good
position = 197
key = 98
value = james,NBA
position = 227
key = 97
value = index,pass
position = 258
key = 96
position = 294
key = 95
position = 327
......
key = 72
position = 1074
key = 71
position = 1107
MapFile
public class MapFile {

/** The name of the index file. */
public static final String INDEX_FILE_NAME = "index";
/** The name of the data file. */

public static final String DATA_FILE_NAME = "data";

}
Mapfile is the sequencefile of the sorted index and can be found based on key.
Unlike Sequencefile, mapfile key must implement Writablecomparable interface, that is, the key value is
comparable, and value is the writable type.
You can use the Mapfile.fix () method to reconstruct the index and convert the Sequencefile to mapfile.
It has two static member variables:
static final String INDEX_FILE_NAME

static final String DATA_FILE_NAME
By observing its directory structure, we can see that mapfile consists of two parts, data and index
respectively.
Index, which is a data-indexed file, primarily records the key value of each record and the position at which
the record is offset in the file.
When mapfile is accessed, the index file is loaded into memory, and the index mapping relationship quickly
navigates to the location of the file where the specified record is located.
Therefore, the retrieval efficiency of mapfile is efficient relative to sequencefile, and the disadvantage is that
it consumes a portion of memory to store index data.
It should be noted that the Mapfile does not record all records into index, which by default stores an index
map for every 128 records. Of course, the recording interval can be artificially modified,
throughMapFIle.Writer的setIndexInterval()methods, or modifiedio.map.index.intervalattributes;
read/write Mapfile
Write Process:
2) Get filesystem
4) New one Mapfile.writer object
5) Call MapFile.Writer.append Append write file
6) Close the stream
Read process:
2) Get filesystem
4) New one mapfile.reader for reading
5) Get Keyclass and Valueclass
6) Close the stream
The specific operation is similar to Sequencefile.

command line view binary file

hdfs dfs -text /liguodong/tmp.seq
Java interface for hadoop hdfs filesystems – examples and concept
Hadoop has an abstract notion of filesystems, of which HDFS is just one implementation. The Java abstract
class org.apache.hadoop.fs.FileSystem represents the client interface to a filesystem in Hadoop, and there are
several concrete implementations.Hadoop is written in Java, so most Hadoop filesystem interactions are
mediated through the Java API. The filesystem shell, for example, is a Java application that uses the Java
FileSystem class to provide filesystem operations.By exposing its filesystem interface as a Java API,
Hadoop makes it awkward for non-Java applications to access HDFS. The HTTP REST API exposed by the
WebHDFS protocol makes it easier for other languages to interact with HDFS. Note that the HTTP interface
is slower than the native Java client, so should be avoided for very large data transfers if possible.
There are two ways of accessing HDFS over HTTP: directly, where the HDFS daemons serve HTTP
requests to clients and via a proxy, which accesses HDFS on the client’s behalf using the usual
DistributedFileSystem API.
In the first case, the embedded web servers in the namenode and datanodes act as WebHDFS endpoints.
(WebHDFS is enabled by default, since dfs.webhdfs.enabled is set to true.) File metadata operations are
handled by the namenode, while file read (and write) operations are sent first to the namenode, which sends
an HTTP redirect to the client indicating the datanode to stream file data from (or to).
The second way of accessing HDFS over HTTP relies on one or more standalone proxy servers. All traffic to
the cluster passes through the proxy, so the client never accesses the namenode or datanode directly. This
allows for stricter firewall and bandwidth-limiting policies to be put in place. The HttpFS proxy exposes the
same HTTP (and HTTPS) interface as WebHDFS, so clients can access both using webhdfs (or swebhdfs)
URIs. The HttpFS proxy is started independently of the namenode and datanode daemons, using the
httpfs.sh script, and by default listens on a different port number 14000.
The Java Interface
In this section, we dig into the Hadoop FileSystem class: the API for interacting with one of Hadoop’s
filesystems.
Reading Data Using the FileSystem API

A file in a Hadoop filesystem is represented by a Hadoop Path object. FileSystem is a general filesystem
API, so the first step is to retrieve an instance for the filesystem we want to use—HDFS, in this case. There
are several static factory methods for getting a FileSystem instance
public static FileSystem get(Configuration conf) throws IOException

public static FileSystem get(URI uri, Configuration conf) throws IOException
public static FileSystem get(URI uri, Configuration conf, String user) throws IOException
public static LocalFileSystem getLocal(Configuration conf) throws IOException
A Configuration object encapsulates a client or server’s configuration, which is set using configuration files
read from the classpath, such as core-site.xml. The first method returns the default filesystem (as specified in
core-site.xml, or the default local filesystem if not specified there). The second uses the given URI’s scheme
and authority to determine the filesystem to use, falling back to the default filesystem if no scheme is
specified in the given URI. The third retrieves the filesystem as the given user, which is important in the
context of security.The fourth one retrieves a local filesystem instance.
With a FileSystem instance in hand, we invoke an open() method to get the input stream for a file.The first
method uses a default buffer size of 4 KB.The second one gives an option to user to specify the buffer size.
public FSDataInputStream open(Path f) throws IOException

public abstract FSDataInputStream open(Path f, int bufferSize) throws IOException
FSDataInputStream
The open() method on FileSystem actually returns an FSDataInputStream rather than a standard java.io
class. This class is a specialization of java.io.DataInputStream with support for random access, so you can
read from any part of the stream:
package org.apache.hadoop.fs;
public class FSDataInputStream extends DataInputStream implements Seekable, PositionedReadable {
}
The Seekable interface permits seeking to a position in the file and provides a query method for the current
offset from the start of the file (getPos()) . Calling seek() with a position that is greater than the length of the
file will result in an IOException. Unlike the skip() method of java.io.InputStream, which positions the
stream at a point later than the current position, seek() can move to an arbitrary,
absolute position in the file.
public interface Seekable {

void seek(long pos) throws IOException;
long getPos() throws IOException;

}
FSDataInputStream also implements the PositionedReadable interface for reading parts of a file at a given
offset.
public interface PositionedReadable {

public int read(long position, byte[] buffer, int offset, int length) throws IOException;
public void readFully(long position, byte[] buffer, int offset, int length) throws IOException;
public void readFully(long position, byte[] buffer) throws IOException;
}
The read() method reads up to length bytes from the given position in the file into the buffer at the given
offset in the buffer. The return value is the number of bytes actually read; callers should check this value, as
it may be less than length.The readFully() methods will read length bytes into the buffer, unless the end of
the file is reached, in which case an EOFException is thrown.
Finally, bear in mind that calling seek() is a relatively expensive operation and should be done sparingly.
You should structure your application access patterns to rely on streaming data (by using MapReduce, for
example) rather than performing a large number of seeks.
Writing Data
The FileSystem class has a number of methods for creating a file. The simplest is the method that takes a
Path object for the file to be created and returns an output stream to write to.
public FSDataOutputStream create(Path f) throws IOException
There are overloaded versions of this method that allow you to specify whether to forcibly overwrite existing
files, the replication factor of the file, the buffer size to use when writing the file, the block size for the file,
and file permissions.
Note : The create() methods create any parent directories of the file to be written that don’t already exist.
Though convenient, this behavior may be unexpected. If you want the write to fail when the parent directory
doesn’t exist, you should check for the existence of the parent directory first by calling the exists() method.
Alternatively, use FileContext, which allows you to control whether parent directories are created or not.
There’s also an overloaded method for passing a callback interface, Progressable, so your application can be
notified of the progress of the data being written to the datanodes.
package org.apache.hadoop.util;
public interface Progressable {
public void progress();
}
As an alternative to creating a new file, you can append to an existing file using the append() method (there
are also some other overloaded versions)
public FSDataOutputStream append(Path f) throws IOException
The append operation allows a single writer to modify an already written file by opening it and writing data
from the final offset in the file. With this API, applications that produce unbounded files, such as logfiles,
can write to an existing file after having closed it. The append operation is optional and not implemented by
all Hadoop filesystems.Below example shows how to copy a local file to a Hadoop filesystem. We illustrate
progress by printing a period every time the progress() method is called by Hadoop, which is after each 64
KB packet of data is written to the datanode pipeline.
import java.io.BufferedInputStream;
import java.io.FileInputStream;
import java.io.InputStream;
import java.io.OutputStream;
import org.apache.hadoop.util.Progressable;
public class CopyFileWithProgress {
public static void main(String[] args) throws Exception {
String localSrc = args[0];
String dst = args[1];
InputStream in = new BufferedInputStream(new FileInputStream(localSrc));
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(URI.create(dst), conf);
OutputStream out = fs.create(new Path(dst), new Progressable() {
public void progress() {
System.out.print(".");
}
});
IOUtils.copyBytes(in, out, 4096, true);
}
}
FSDataOutputStream
The create() method on FileSystem returns an FSDataOutputStream, which, like FSDataInputStream, has a
method for querying the current position in the file:
public class FSDataOutputStream extends DataOutputStream implements Syncable {
public long getPos() throws IOException {
}
}
However, unlike FSDataInputStream, FSDataOutputStream does not permit seeking. This is because HDFS
allows only sequential writes to an open file or appends to an already written file. In other words, there is no
support for writing to anywhere other than the end of the file, so there is no value in being able to seek while
writing.
FileSystem provides a method to create a directory also
public boolean mkdirs(Path f) throws IOException
This method creates all of the necessary parent directories if they don’t already exists and returns true if its
success full.
NLibraries Needed for Hadoop
Hadoop components need to have Hadoop libraries accessible from CloverDX. The libraries are needed by
HadoopReader, HadoopWriter, ExecuteMapReduce, HDFS and Hive.
The Hadoop libraries are necessary to establish a Hadoop connection, see Hadoop Connections.
The officially supported version of Hadoop is Cloudera 5 version 5.6.0. Other versions close to this one
might work, but the compatibility is not guaranteed.
Cloudera 5
The below mentioned libraries are needed for the connection to Cloudera 5.
Common libraries
 hadoop-common-2.6.0-cdh5.6.0.jar
 hadoop-auth-2.6.0-cdh5.6.0.jar
 guava-15.0.jar
 avro-1.7.6-cdh5.6.0.jar
 htrace-core4-4.0.1-incubating.jar
 servlet-api-3.0.jar
HDFS
 hadoop-hdfs-2.6.0-cdh5.6.0.jar
 protobuf-java-2.5.0.jar
MapReduce
 hadoop-annotations-2.6.0-cdh5.6.0.jar
 hadoop-mapreduce-client-app-2.6.0-cdh5.6.0.jar
 hadoop-mapreduce-client-common-2.6.0-cdh5.6.0.jar
 hadoop-mapreduce-client-core-2.6.0-cdh5.6.0.jar
 hadoop-mapreduce-client-hs-2.6.0-cdh5.6.0.jar
 hadoop-mapreduce-client-jobclient-2.6.0-cdh5.6.0.jar
 hadoop-mapreduce-client-shuffle-2.6.0-cdh5.6.0.jar
 jackson-core-asl-1.9.2.jar
 jackson-mapper-asl-1.9.12.jar
 hadoop-yarn-api-2.6.0-cdh5.6.0.jar
 hadoop-yarn-client-2.6.0-cdh5.6.0.jar
 hadoop-yarn-common-2.6.0-cdh5.6.0.jar
Hive
 hive-jdbc-1.1.0-cdh5.6.0.jar
 hive-exec-1.1.0-cdh5.6.0.jar
 hive-metastore-1.1.0-cdh5.6.0.jar
 hive-service-1.1.0-cdh5.6.0.jar
 libfb303-0.9.2.jar
 slf4j-api-1.7.5.jar
 slf4j-log4j12-1.7.5.jar
The libraries can be found in your CDH installation or in a package downloaded from Cloudera.
CDH installation
Required libraries from CDH reside in the directories from the following list.
 /usr/lib/hadoop
 /usr/lib/hadoop-hdfs
 /usr/lib/hadoop-mapreduce
 /usr/lib/hadoop-yarn
 + 3rd party libraries are located in lib subdirectories
Package downloaded from Cloudera
The files can be found also in a package downloaded from Cloudera on the following locations.
 share/hadoop/common
 share/hadoop/hdfs
 share/hadoop/mapreduce2
 share/hadoop/yarn
 + lib subdirectories
Reading Data from a Hadoop URL
One of the simplest ways to read a file from a Hadoop filesystem is by using a java.net.URL object to
open a stream to read the data from. The general idiom is:
InputStream in = null;
try {
in = new URL("hdfs://host/path").openStream();
// process in
} finally {
IOUtils.closeStream(in);
}
There’s a little bit more work required to make Java recognize Hadoop’s hdfs URL scheme. This is
achieved by calling the setURLStreamHandlerFactory() method on URL with an instance of
FsUrlStreamHandlerFactory. This method can be called only once per JVM, so it is typically executed in
a static block. This limitation means that if some other part of your program — perhaps a third-party
component outside your control — sets a URLStreamHandlerFactory, you won’t be able to use this
approach for reading data from Hadoop. The next section discusses an alternative.
Example 3-1 shows a program for displaying files from Hadoop filesystems on standard output, like the
Unix cat command.
Example 3-1. Displaying files from a Hadoop filesystem on standard output using a URLStreamHandler
public class URLCat {

static {
URL.setURLStreamHandlerFactory(new FsUrlStreamHandlerFactory());
}
InputStream in = null;
try {
in = new URL(args[0]).openStream();
IOUtils.copyBytes(in, System.out, 4096, false);
} finally {
IOUtils.closeStream(in);
}
}
}
We make use of the handy IOUtils class that comes with Hadoop for closing the stream in the finally
clause, and also for copying bytes between the input stream and the output stream (System.out, in this
case). The last two arguments to the copyBytes() method are the buffer size used for copying and whether
to close the streams when the copy is complete. We close the input stream ourselves, and System.out
doesn’t need to be closed.
Here’s a sample run:[31]
% export HADOOP_CLASSPATH=hadoop-examples.jar
% hadoop URLCat hdfs://localhost/user/tom/quangle.txt
On the top of the Crumpetty Tree
The Quangle Wangle sat,
But his face you could not see,
On account of his Beaver Hat.
Writing and reading data using the Hadoop File System

The Store sub-project of Spring for Apache Hadoop provides abstractions for writing and reading various
types of data residing in HDFS. We currently support different file types either via our own store accessors
or by using the Dataset support in Kite SDK.
Currently, the Store sub-project doesn’t have an XML namespace or javaconfig based configuration classes
as it's considered to be a foundational library. However, this may change in future releases.
5.1 Store Abstraction

Native store abstractions provide various writer and reader interfaces so that the end user don't have to worry
about the underlying implementation actually doing the work on files in HDFS. Implementations are usually
strongly typed and provides constructors and setters for additional setup to work with naming, compression
codecs and everything else defining the behaviour. Interfaces are meant to be used from integration
components which don’t need to know the internal workings of writers and readers.
5.1.1 Writing Data

Main interface writing into a store is a DataWriter which have one method write which simply writes an
entity and the backing implementation will handle the rest.
public interface DataWriter<T> {

void write(T entity) throws IOException;
}
The DataStoreWriter interface adds methods to close and flush a writer. Some of the writers have a property
to close a stream after an idle time has been reached but generally this interface is meant for programmatic
control of these operations.
public interface DataStoreWriter<T> extends DataWriter<T>, Flushable, Closeable {

}
Java interface for hadoop hdfs filesystems – examples and concept
Hadoop has an abstract notion of file systems, of which HDFS is just one implementation. The Java
abstract class org.apache.hadoop.fs.FileSystem represents the client interface to a filesystem in
Hadoop, and there are several concrete implementations. Hadoop is written in Java, so most Hadoop
file system interactions are mediated through the Java API. The filesystem shell, for example, is a
Java application that uses the Java File System class to provide filesystem operations. By exposing its
file system interface as a Java API, Hadoop makes it awkward for non-Java applications to access
HDFS. The HTTP REST API exposed by the WebHDFS protocol makes it easier for other languages
to interact with HDFS. Note that the HTTP interface is slower than the native Java client, so should
be avoided for very large data transfers if possible.
There are two ways of accessing HDFS over HTTP: directly, where the HDFS daemons serve HTTP
requests to clients and via a proxy, which accesses HDFS on the client’s behalf using the usual
DistributedFileSystem API.
In the first case, the embedded web servers in the namenode and datanodes act as WebHDFS
endpoints. (WebHDFS is enabled by default, since dfs.webhdfs.enabled is set to true.) File metadata
operations are handled by the namenode, while file read (and write) operations are sent first to the
namenode, which sends an HTTP redirect to the client indicating the datanode to stream file data
from (or to).
The second way of accessing HDFS over HTTP relies on one or more standalone proxy servers.
All traffic to the cluster passes through the proxy, so the client never accesses the namenode or
datanode directly. This allows for stricter firewall and bandwidth-limiting policies to be put in place.
The HttpFS proxy exposes the same HTTP (and HTTPS) interface as WebHDFS, so clients can
access both using webhdfs (or swebhdfs) URIs. The HttpFS proxy is started independently of the
namenode and datanode daemons, using the httpfs.sh script, and by default listens on a different port
number 14000.
The Java Interface
In this section, we dig into the Hadoop FileSystem class: the API for interacting with one of
Hadoop’s filesystems.
Reading Data Using the FileSystem API

A file in a Hadoop filesystem is represented by a Hadoop Path object. FileSystem is a general
filesystem API, so the first step is to retrieve an instance for the filesystem we want to use—HDFS,
in this case. There are several static factory methods for getting a FileSystem instance
public static FileSystem get(Configuration conf) throws IOException
public static FileSystem get(URI uri, Configuration conf) throws IOException
public static FileSystem get(URI uri, Configuration conf, String user) throws IOException
public static LocalFileSystem getLocal(Configuration conf) throws IOException
A Configuration object encapsulates a client or server’s configuration, which is set using
configuration files read from the classpath, such as core-site.xml. The first method returns the default
filesystem (as specified in core-site.xml, or the default local filesystem if not specified there). The
second uses the given URI’s scheme and authority to determine the filesystem to use, falling back to
the default filesystem if no scheme is specified in the given URI. The third retrieves the filesystem as
the given user, which is important in the context of security.The fourth one retrieves a local
filesystem instance.
With a FileSystem instance in hand, we invoke an open() method to get the input stream for a
file.The first method uses a default buffer size of 4 KB.The second one gives an option to user to
specify the buffer size.
public FSDataInputStream open(Path f) throws IOException
public abstract FSDataInputStream open(Path f, int bufferSize) throws IOException
FSDataInputStream
The open() method on FileSystem actually returns an FSDataInputStream rather than a standard
java.io class. This class is a specialization of java.io.DataInputStream with support for random
access, so you can read from any part of the stream:
public class FSDataInputStream extends DataInputStream implements Seekable,

PositionedReadable {
}
The Seekable interface permits seeking to a position in the file and provides a query method for the
current offset from the start of the file (getPos()) . Calling seek() with a position that is greater than
the length of the file will result in an IOException. Unlike the skip() method of java.io.InputStream,
which positions the stream at a point later than the current position, seek() can move to an arbitrary,
absolute position in the file.
public interface Seekable {
void seek(long pos) throws IOException;
long getPos() throws IOException;
}
FSDataInputStream also implements the PositionedReadable interface for reading parts of a file at a
given offset.
public interface PositionedReadable {
public int read(long position, byte[] buffer, int offset, int length) throws IOException;
public void readFully(long position, byte[] buffer, int offset, int length) throws IOException;
public void readFully(long position, byte[] buffer) throws IOException;
}
The read() method reads up to length bytes from the given position in the file into the buffer at the
given offset in the buffer. The return value is the number of bytes actually read; callers should check
this value, as it may be less than length.The readFully() methods will read length bytes into the
buffer, unless the end of the file is reached, in which case an EOFException is thrown.
Finally, bear in mind that calling seek() is a relatively expensive operation and should be done
sparingly. You should structure your application access patterns to rely on streaming data (by using
MapReduce, for example) rather than performing a large number of seeks.
Writing Data
The FileSystem class has a number of methods for creating a file. The simplest is the method that
takes a Path object for the file to be created and returns an output stream to write to.
public FSDataOutputStream create(Path f) throws IOException
There are overloaded versions of this method that allow you to specify whether to forcibly overwrite
existing files, the replication factor of the file, the buffer size to use when writing the file, the block
size for the file, and file permissions.
Note : The create() methods create any parent directories of the file to be written that don’t already
exist. Though convenient, this behavior may be unexpected. If you want the write to fail when the
parent directory doesn’t exist, you should check for the existence of the parent directory first by
calling the exists() method. Alternatively, use FileContext, which allows you to control whether
parent directories are created or not.
There’s also an overloaded method for passing a callback interface, Progressable, so your application
can be notified of the progress of the data being written to the datanodes.
package org.apache.hadoop.util;
public interface Progressable {
public void progress();
}
As an alternative to creating a new file, you can append to an existing file using the append() method
(there are also some other overloaded versions)
public FSDataOutputStream append(Path f) throws IOException
The append operation allows a single writer to modify an already written file by opening it and
writing data from the final offset in the file. With this API, applications that produce unbounded files,
such as logfiles, can write to an existing file after having closed it. The append operation is optional
and not implemented by all Hadoop filesystems.Below example shows how to copy a local file to a
Hadoop filesystem. We illustrate progress by printing a period every time the progress() method is
called by Hadoop, which is after each 64 KB packet of data is written to the datanode pipeline.
import java.io.BufferedInputStream;
import java.io.FileInputStream;
import java.io.InputStream;
import java.io.OutputStream;
import org.apache.hadoop.util.Progressable;
public class CopyFileWithProgress {
String localSrc = args[0];
String dst = args[1];
InputStream in = new BufferedInputStream(new FileInputStream(localSrc));
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(URI.create(dst), conf);
OutputStream out = fs.create(new Path(dst), new Progressable() {
public void progress() {
System.out.print(".");
}
});
IOUtils.copyBytes(in, out, 4096, true);
}
}
FSDataOutputStream
The create() method on FileSystem returns an FSDataOutputStream, which, like FSDataInputStream,
has a method for querying the current position in the file:
public class FSDataOutputStream extends DataOutputStream implements Syncable {
public long getPos() throws IOException {
}
}
However, unlike FSDataInputStream, FSDataOutputStream does not permit seeking. This is because
HDFS allows only sequential writes to an open file or appends to an already written file. In other
words, there is no support for writing to anywhere other than the end of the file, so there is no value
in being able to seek while writing.
FileSystem provides a method to create a directory also

public boolean mkdirs(Path f) throws IOException
This method creates all of the necessary parent directories if they don’t already exists and returns true
if its success full.
Anatomy of File Read and Write in HDFS

Big data is nothing but a collection of data sets that are large, complex, and which are difficult to
store and process using available data management tools or traditional data processing
applications. Hadoop is a framework (open source) for writing, running, storing, and processing large
datasets in a parallel and distributed manner. It is a solution that is used to overcome the challenges
faced by big data.
Hadoop has two components:
 HDFS (Hadoop Distributed File System)

 YARN (Yet Another Resource Negotiator)
In this article, we focus on one of the components of Hadoop i.e., HDFS and the anatomy of file
reading and file writing in HDFS. HDFS is a file system designed for storing very large files (files
that are hundreds of megabytes, gigabytes, or terabytes in size) with streaming data access, running
on clusters of commodity hardware(commonly available hardware that can be obtained from various
vendors). In simple terms, the storage unit of Hadoop is called HDFS.
Some of the characteristics of HDFS are:
 Fault-Tolerance
 Scalability
 Distributed Storage
 Reliability
 High availability
 Cost-effective
 High throughput

Building Blocks of Hadoop:
1. Name Node
2. Data Node
3. Secondary Name Node (SNN)
4. Job Tracker
5. Task Tracker
Anatomy of File Read in HDFS
Let’s get an idea of how data flows between the client interacting with HDFS, the name node, and the
data nodes with the help of a diagram. Consider the figure:
Step 1: The client opens the file it wishes to read by calling open() on the File System Object(which
for HDFS is an instance of Distributed File System).
Step 2: Distributed File System( DFS) calls the name node, using remote procedure calls (RPCs), to
determine the locations of the first few blocks in the file. For each block, the name node returns the
addresses of the data nodes that have a copy of that block. The DFS returns an FSDataInputStream to
the client for it to read data from. FSDataInputStream in turn wraps a DFSInputStream, which
manages the data node and name node I/O.
Step 3: The client then calls read() on the stream. DFSInputStream, which has stored the info node
addresses for the primary few blocks within the file, then connects to the primary (closest) data node
for the primary block in the file.
Step 4: Data is streamed from the data node back to the client, which calls read() repeatedly on the
stream.
Step 5: When the end of the block is reached, DFSInputStream will close the connection to the data
node, then finds the best data node for the next block. This happens transparently to the client, which
from its point of view is simply reading an endless stream. Blocks are read as, with the
DFSInputStream opening new connections to data nodes because the client reads through the stream.
It will also call the name node to retrieve the data node locations for the next batch of blocks as
needed.
Step 6: When the client has finished reading the file, a function is called, close() on the
FSDataInputStream.
Anatomy of File Write in HDFS

Next, we’ll check out how files are written to HDFS. Consider figure 1.2 to get a better
understanding of the concept.
Note: HDFS follows the Write once Read many times model. In HDFS we cannot edit the files
which are already stored in HDFS, but we can append data by reopening the files.
Step 1: The client creates the file by calling create() on DistributedFileSystem(DFS).
Step 2: DFS makes an RPC call to the name node to create a new file in the file system’s namespace,
with no blocks associated with it. The name node performs various checks to make sure the file
doesn’t already exist and that the client has the right permissions to create the file. If these checks
pass, the name node prepares a record of the new file; otherwise, the file can’t be created and
therefore the client is thrown an error i.e. IOException. The DFS returns an FSDataOutputStream for
the client to start out writing data to.
Step 3: Because the client writes data, the DFSOutputStream splits it into packets, which it writes to
an indoor queue called the info queue. The data queue is consumed by the DataStreamer, which is
liable for asking the name node to allocate new blocks by picking an inventory of suitable data nodes
to store the replicas. The list of data nodes forms a pipeline, and here we’ll assume the replication
level is three, so there are three nodes in the pipeline. The DataStreamer streams the packets to the
primary data node within the pipeline, which stores each packet and forwards it to the second data
node within the pipeline.
Step 4: Similarly, the second data node stores the packet and forwards it to the third (and last) data
node in the pipeline.
Step 5: The DFSOutputStream sustains an internal queue of packets that are waiting to be
acknowledged by data nodes, called an “ack queue”.
Step 6: This action sends up all the remaining packets to the data node pipeline and waits for
acknowledgments before connecting to the name node to signal whether the file is complete or not.
HDFS follows Write Once Read Many models. So, we can’t edit files that are already stored in
HDFS, but we can include them by again reopening the file. This design allows HDFS to scale to a
large number of concurrent clients because the data traffic is spread across all the data nodes in the
cluster. Thus, it increases the availability, scalability, and throughput of the system.
Limitations of HDFS
Issues with Small Files
The main problem with Hadoop is that it is not suitable for small data. HDFS lacks the ability to
support the random reading of small due to its high capacity design. Small files are smaller than the
HDFS Block size (default 128MB). If you are storing these huge numbers of small files, HDFS
cannot handle these lots of small files. As HDFS was designed to work with a small number of large
files for storing large data sets rather than a large number of small files. If there are lot many small
files, then the NameNode will be overloaded since it stores the namespace of HDFS.
Slow Processing Speed
MapReduce processes a huge amount of data. In Hadoop, MapReduce works by breaking the
processing into phases: Map and Reduce. So, MapReduce requires a lot of time to perform these
tasks, thus increasing latency. Hence, reduces processing speed.
Support for Batch Processing only
Hadoop only supports batch processing, it is not suitable for streaming data. Hence, overall
performance is slower. MapReduce framework doesn’t leverage the memory of the Hadoop cluster to
the maximum.
No Real-time Processing
Hadoop with its core Map-Reduce framework is unable to process real-time data. Hadoop process
data in batches. First, the user loads the file into HDFS. Then the user runs map-reduce job with the
file as input. It follows the ETL cycle of processing. The user extracts the data from the source. Then
the data gets transformed to meet the business requirements. And finally loaded into the data
warehouse. The users can generate insights from this data. The companies use these insights for the
betterment of their business.
Iterative Processing
Core Hadoop does not support iterative processing. Iterative processing requires a cyclic data flow.
In this output of a previous stage serves as an input to the next stage. Hadoop map-reduce is capable
of batch processing. It works on the principle of write-once-read-many. The data gets written on the
disk once. And then read multiple times to get insights. The Map-reduce of Hadoop has a batch
processing engine at its core. It is not able to iterate through data.
Latency
MapReduce in Hadoop is slower because it supports different format, structured and huge amount of
data. In MapReduce, Map takes a set of data and converts it into another set of data, where an
individual element is broken down into a key-value pair. Reduce takes the output from the map as
and Reduce takes the output from the map as input and process further. MapReduce requires a lot of
time to perform these tasks thereby increasing latency.
No Ease of Use
In Hadoop, we have to hand code each and every operation. This has two drawbacks first it is
difficult to use. And second, it increases the number of lines to code. There is no interactive mode
available with Hadoop Map-Reduce. This also makes it difficult to debug as it runs in the batch
mode. In this mode, we have to specify the jar file, the input as well as the location of the output file.
If the program fails in between, it is difficult to find the culprit code.
Security Issue
Hadoop does not implement encryption-decryption at the storage as well as network levels. Thus it is
not much secure. For security, Hadoop adopts Kerberos authentication which is difficult to maintain.
No Caching
Apache Hadoop is not efficient for caching. MapReduce cannot cache the intermediate data in
memory for the further requirement and this diminishes the performance of Hadoop.
Lengthy Code
Apache Hadoop has 1, 20,000 line of code. The number of lines produces the number of bugs. Hence
it will take more time to execute the programs.
UNIT IV FUNDAMENTALSOF MAP REDUCE
► IntroductiontoMapReduce:itsFramework
► FeaturesofMapReduce
► WorkingofMapReduce
► AnalyzeMapandReduceFunctions
► MapReduceTechniquestooptimizejob
► Uses
► ControllinginputformatsinMapReduceExecution
► DifferentPhasesinmapreduce
► Applications
MapReduceFeatures
Thischapter looksatsomeofthe more advanced featuresofMapReduce,including counters andsorting
and joining datasets.
Counters
There areoften things you would like toknowabout thedata you are analyzing but that areperipheral to
the analysis you are performing. For example, if you were counting invalid records and discovered
that theproportionofinvalidrecordsinthewholedataset wasveryhigh, youmight beprompted tocheckwhy
so many records were being marked as invalid-perhaps there is a bug in the part of the program that
detects invalid records? Or if the data were of poor quality and genuinely did have very many invalid
records, after discovering this, you might decide toincrease the size ofthe dataset so that the number
of good records was large enough for meaningful analysis.
Counters areauseful channel forgathering statistics about the job: for qualitycontrol or forapplication-
level statistics. They are also useful for problem diagnosis. If you are tempted to put a log message
into your map or reduce task, it is often better to see whether you can use a counter instead to record
that a particular condition occurred. Inaddition tocounter values being much easier to retrieve than
logoutput for large distributed jobs, you getarecord ofthe number oftimes that condition occurred,
which ismore work to obtain from a set of logfiles.
Sorting
Theabilitytosort dataisat theheart ofMapReduce.Even ifyour application isn't concerned withsorting
per se, it maybeable to usethe sorting stage that MapReduce provides toorganizeitsdata.Joins
MapReduce canperformjoins between largedatasets but writing thecodetodojoinsfromscratch is fairly
involved. Rather than writing MapReduce programs, you might consider using ahigher-level
framework such asPig, Hive, orCascading,inwhich join operations are a core part ofthe
implementation.Let'sbriefly consider the problem we are trying to solve. We have two datasets-
for example, the weather stations database and the weather records-and we want to reconcile the
two. Let'ssaywe wantto seeeachstation's history,withthestation's metadata inlinedineachoutput
row.
This is illustrated in Figure 8-2.
Howweimplementthejoindepends onhowlarge thedatasetsareandhowtheyarepartitioned.Ifone
dataset islarge(theweather records) buttheotheroneissmallenoughtobedistributed toeachnodein
the cluster (asthe station metadata is), the joincan be effected byaMapReduce job that brings the
recordsforeachstationtogether (apartialsortonstationID,forexample). Themapperorreducer uses
the smaller dataset tolook up thestation metadata for a station ID, so it can be written out with
each record.SeeSideData Distribution foradiscussion ofthisapproach, wherewefocusonthe
mechanics of distributingthe data to tasktrackers.
Ifthejoinis performedbythemapper,itis calleda map-side join, whereasif itis performedby the reducer it
is called a reduce-side join.
If both datasets aretoo large foreither to becopied to each node inthe cluster, we can still
jointhem using MapReduce witha map-side orreduce-side join, depending onhow the data
isstructured. One common example ofthis case isa user database and alog ofsome user
activity(such asaccess logs). Forapopularservice, it isnotfeasible
todistributetheuserdatabase(orthelogs)toalltheMapReduce nodes.
MapReduceWorking
Thewholeprocessgoesthroughfourphasesofexecution namely,splitting,mapping,shuffling,and
reducing.
Consideryouhavefollowing inputdataforyourMapReduceProgram Welcome to Hadoop Class
Hadoopisgood
Hadoopisbad
MapReduceArchitecture
ThefinaloutputoftheMapReducetaskis
Bad 1
Class 1
Good 1
Hadoop 3
Is 2
To 1
Welcom 1
e
Thedatagoesthroughthefollowingphases
InputSplits:
Aninput toaMapReducejobisdivided into fixed-size piecescalled input splitsInputsplit isachunk of the
input that is consumedby a single map
Mapping
This istheveryfirst phase intheexecution ofmap-reduce program. Inthisphase data ineachsplit is passed
toamapping function toproduce output values. Inourexample, ajobof mapping phase isto count
anumberofoccurrences ofeachword from input splits(more details about input-splitisgiven below) and
prepare a listinthe formof<word, frequency>
Shuffling
Thisphase consumes theoutputofMapping phase. Itstask istoconsolidate the relevant records from
Mapping phase output.Inourexample, thesame wordsareclubed together along withtheir respective
frequency.
Reducing
Inthisphase, output values fromtheShuffling phase areaggregated. Thisphase combines values from
Shufflingphase andreturnsasingle output value. Inshort, thisphase summarizes thecompletedataset.
Inourexample, this phase aggregates thevalues fromShuffling phase i.e.,calculates totaloccurrences of
each word.
MapReduce
1. TraditionalEnterpriseSystemsnormallyhaveacentralizedservertostoreandprocessdata.
2. The following illustration depicts aschematic view ofa traditional enterprise system.
Traditional modeliscertainlynotsuitabletoprocesshugevolumesofscalable
dataandcannotbeaccommodated by standard database servers.
3. Moreover,thecentralizedsystemcreatestoomuchofabottleneckwhileprocessingmultiplefiles
simultaneously.
Figure4.1:MapReduce
4. Googlesolvedthisbottleneck issueusinganalgorithmcalledMapReduce. MapReduce

dividesa task into small parts and assigns them to many computers.Later,theresults are
collected at one place and integrated to formthe result dataset.
Figure4.2:Physicalstructure
1. AMapReducecomputationexecutesasfollows:
► Some number ofMap tasks each aregiven one ormore chunks fromadistributed filesystem.
These Map tasks tum the chunk into asequence ofkey-value pairs. The way key-value pairs
are produced fromthe input dataisdeterminedbythe code written bythe user forthe Map
function.
► The key-value pairs fromeach Map task are collected bya master controller and sorted by
key.The keysaredivided among alltheReduce tasks, so allkey-value pairs withthe same
keywind up at the same Reduce task.
► The Reduce tasks work onone keyat atime, and combine all the values associated with that
key in someway.Themannerofcombination ofvaluesisdetermined
bythecodewrittenbytheuserforthe Reduce function.
Figure4.3:SchematicMapReduceComputation
A. TheMapTask
1. Weviewinputfilesfora Maptaskas consistingof elements,whichcan beany type:a tupleora
document, for example.
2. Achunkisacollectionofelements,andnoelementisstoredacrosstwochunks.
3. Technically,all inputs to Map tasks and outputs from Reducetasks are of the key-value-pair form,
but normallythe keys ofinput elements are not relevant and we shall tend to ignorethem.
4. Insistingonthisformforinputsandoutputsismotivatedbythedesiretoallowcompositionof several
MapReduce processes.
5. TheMap functiontakesaninputelementasitsargumentandproduceszeroormorekey-value pairs.
6. The types ofkeys and values are each arbitrary.vii. Further,keys are not "keys"inthe usualsense;
theydo not have to be unique.
7. RatheraMaptaskcanproduceseveralkey-valuepairswiththesamekey,evenfromthesame element.
Example1:AMapReducecomputationwithwhathasbecomethestandardexampleapplication:counting
thenumber ofoccurrences for eachwordina collection ofdocuments. Inthisexample, the input file isa
repositoryofdocuments, andeachdocument isanelement. TheMapfunction forthis example useskeys
thatareoftypeString(thewords)andvaluesthat areintegers.TheMaptaskreadsadocumentandbreaks itinto
itssequence ofwords w1, w2, ... ,wn. Itthenemits asequence ofkey-value pairs where the
value is always1.Thatis,theoutputof theMaptaskfor this documentis thesequenceofkey-value pairs:
(w1, 1),(w2, 1),...,(wn, 1)
A single Map task will typically process many documents -all the documents in one or more chunks.
Thus,itsoutputwillbe morethanthesequencefortheonedocumentsuggestedabove.Ifawordwappears m
times among all the documents assigned to that process, then there will be m key-value pairs (w, 1)
among itsoutput. Anoption,istocombine these mpairsinto asinglepair(w,m),but wecanonlydothat
because, the Reduce tasks applyanassociative and commutative operation, addition, tothe values.
B. GroupingbyKey
i. AstheMaptaskshaveallcompletedsuccessfully, thekey-value pairsaregrouped bykey,andthevalues
associated with each key are formedinto a list of values.
ii. Thegroupingisperformedbythesystem,regardless ofwhattheMapand Reducetasksdo.
iii. ThemastercontrollerprocessknowshowmanyReducetaskstherewillbe,sayrsuch tasks.

iv. TheusertypicallytellstheMapReducesystemwhat rshouldbe.
v. Then the mastercontrollerpicksa hashfunctionthatappliesto keys andproducesa bucketnumber
from O to r -1.
vi. EachkeythatisoutputbyaMaptaskishashed anditskey-value pair isput inoneofrlocalfiles. Each
file is destinedfor one of the Reduce tasks.I.
vii. ToperformthegroupingbykeyanddistributiontotheReducetasks,themastercontrollermergesthe
files from each Map task that are destinedfor a particular Reduce task and feedsthe mergedfile to that
process as a sequence of key-list-of-value pairs.
viii. That is, for each key k, the input to the Reduce task that handles key k is a pair of the form
(k, [vl, v2, ..., vn]),where (k, vl), (k, v2), ...,(k, vn) are allthekey-valuepairswith key k comingfrom all
the Map tasks.
C. TheReduceTask
i. TheReducefunction'sargumentisapairconsistingofakeyanditslistofassociatedvalues.
ii. TheoutputoftheReducefunctionisasequenceofzeroormorekey-valuepairs.
iii. These key-value pairs can be ofatype differentfrom those sent fromMap tasks toReduce
tasks,but often they are the same type.
iv. We shall refer tothe application ofthe Reducefunction to asingle keyand itsassociatedlist of
values as areducer. AReduce task receives one ormore keys and their associated value lists.
v. Thatis, aReduce task executes one or more reducers.Theoutputs fromalltheReduce tasks are
mergedinto a single file.
vi. Reducers maybepartitionedamongasmaller number ofReducetasksisbyhashing thekeysand
associating each
vii. Reducetaskwithoneofthebuckets ofthehashfunction.
The Reduce function simply adds up all the values. The output ofa reducer consists ofthe word and
the sum.Thus, theoutputofalltheReduce tasksisasequence of(w, m)pairs, where w isawordthat appears
atleastonceamongalltheinputdocuments and misthetotalnumberofoccurrencesofwamongallthose
documents.
D. Combiners
i. A Reduce function is associative and commutative. That is, the values to be combined can be
combinedin any order, with the same result.
ii. TheadditionperformedinExample1isanexampleofanassociative andcommutative

operation. It doesn't matter how wegroup alist of numbers v1, v2, ...,vn; the sumwill bethe
same. iii.When the Reduce function is associative and commutative, we can push some of
what the reducers doto the Map tasks
iv. Thesekey-valuepairswouldthusbereplacedby onepairwithkeywandvalueequaltothesum of
all the 1'sin all those pairs.
v. Thatis,thepairswith keywgeneratedby a singleMaptaskwouldbe replacedby apair(w,m),
wheremisthenumberoftimesthatwappearsamongthedocumentshandledby thisMaptask. E.Details
of MapReduce task
TheMapReducealgorithmcontainstwoimportanttasks,namelyMapandReduce.
i. TheMaptasktakesasetofdataandconvertsitintoanothersetofdata,whereindividual
elements are broken down into tuples (key-value pairs).
Figure4.4:OverviewoftheexecutionofaMapReduceprogram
ii. TheReduce tasktakestheoutput fromtheMapasaninput andcombines
thosedatatuples(key- value pairs)into a smaller set of tuples.
111. Thereducetaskisalwaysperformedafterthemapjob.
Figure4.5:Reducejob
► InputPhase-HerewehaveaRecordReaderthattranslateseachrecord inaninputfileandsends the
parsed data to the mapperinthe form ofkey-value pairs.
► Map-Mapisauser-definedfunction, whichtakesaseriesofkey-value pairsandprocesses each one
of them to generate zero or more key-value pairs.
► Intermediate Keys -they key-value pairs generated by the mapper are known as intermediate
keys.
► Combiner -A combiner is atype oflocalReducer that groups similar data from the map phase
into identifiable sets. It takesthe intermediate keys from the mapper as input and applies a user-
defined code to aggregate the values in a small scope of one mapper. It is not a part of the main
MapReduce algorithm;it is optional.
► Shuffle and Sort -The Reducer task starts with the Shuffle and Sort step. It downloads the
grouped key-value pairsontothelocalmachine, where theReducer isrunning. The individual key-value
pairsaresorted bykeyintoalargerdatalist.Thedatalist groupstheequivalent keystogether sothat their
values can be iterated easily in the Reducertask.
► Reducer -The Reducer takes the grouped key-value paired data as input and runs a Reducer
function on each one of them. Here, the data can be aggregated, filtered, and combined in a number
of ways, and it requires a wide range ofprocessing. Once the execution is over, it gives zero or more
key- value pairs to the final step.
► Output Phase -In the output phase, we have an output formatter that translates the final key-
value pairs fromthe Reducer function and writes them onto afile using arecord writer.
iv. TheMapReducephase
Figure46TheMapReducePhase
F.MapReduce-Example
Twitterreceivesaround500milliontweetsperday,whichisnearly3000tweetspersecond.The
following illustration shows how Tweeter manages its tweets withthe help ofMapReduce.
Figure4.7:Example
Tokenize-Tokenizesthetweetsintomapsoftokensandwritesthemaskey-valuepairs.
Filter-Filtersunwantedwordsfromthemapsoftokensandwritesthefilteredmapsaskey-valuepairs.
Count-Generatesatokencounterperword.
AggregateCounters-Preparesanaggregateofsimilarcountervalues intosmallmanageableunits.
G.MapReduce-Algorithm
TheMapReducealgorithmcontainstwoimportanttasks,namelyMapandReduce. The map task is
done by means ofMapper Class
Mapper classtakesthe input, tokenizes it, mapsand sortsit.Theoutput ofMapper classisusedasinput
byReducer class, which in turnsearches matching pairs and reduces them.
ThereducetaskisdonebymeansofReducerClass.
MapReduceimplementsvarious mathematicalalgorithmstodivideataskintosmallpartsandassignthem
tomultiple systems. Intechnical terms, MapReduce algorithm helps insending theMap &Reduce tasks
to appropriate servers in a cluster.
Figure4.8:TheMapReduceClass
H.CopingWithNodeFailures
i. The worst thing that can happen is that the compute node at which the Master is executing fails. In
this case, the entire MapReducejob mustbe restarted.
ii. But only this one node can bring the entire process down; other failures will be managed by the
Master, and the MapReducejob will complete eventually.
iii. SupposethecomputenodeatwhichaMapworkerresidesfails.ThisfailurewillbedetectedbytheMaster,
because it periodically pings the Worker processes.
iv. All the Map tasks that were assigned to this Worker will have to be redone, even ifthey had
completed. The reason for redoing completed Map asks is that their output destined for the Reduce
tasksresides at that compute node, and is now unavailable to the Reduce tasks.
v. The Master sets the status of each ofthese Map tasks to idle and will schedulethem on a Worker when
one becomes available.
vi. TheMastermustalsoinformeachReducetaskthatthelocationofitsinputfromthatMaptaskhas
changed.Dealing with a failure at the node ofa Reduce worker is simpler.
vii. TheMastersimplysetsthestatusofitscurrentlyexecutingReducetaskstoidle.Thesewillberescheduled on
another reduce worker later.
ControllingMapReduceExecutionwithInputFormat
Hadoop InputFormatcheckstheInput-Specificationofthejob.InputFormatsplittheInputfile
intoinputSplit and assign to individual Mapper.
Willlearn
whatisInputFormatinHadoopMapReduce,
differentmethodstogetthedatatothemapper and
different typesoflnputFormatinHadoop like FilelnputFormat inHadoop, TextlnputFormat,
KeyValueTextlnputFormat, etc.
Howthe inputfilesaresplitupandreadinHadoopisdefined bytheInputFormat.
AnHadoop InputFormatis the firstcomponentin Map-Reduce,itis responsiblefor creating the input

splits and dividing them into records.
Initially, thedataforaMapReduce taskisstoredininput files,andinputfilestypicallyreside inHDFS.
Although these files format isarbitrary, line-based logfilesand binary format canbe used. Using
InputFormatwe define how these input files aresplit andread.
TheInputFormatclassisoneofthefundamentalclasses intheHadoopMapReduce framework which
provides the following functionality:
• Thefilesorotherobjectsthatshouldbeusedforinput isselectedbytheInputFormat.
• InputFormatdefinestheDatasplits,whichdefines boththesizeofindividual Maptasksandits
potential execution server.
1. InputFormatdefinesthe RecordReader,whichis responsibleforreading actual records fromthe
input files.
2. Howwegetthedatatomapper?
Wehave2 methodstogetthedatatomapperinMapReduce:getsplits() andcreateRecordReader()as

shown below:
1. publicabstractclassInputFormat<K,V>
2. {
3. publicabstractList<InputSplit>getSplits(JobContextcontext)
4. throwsIOException,InterruptedException;
5. publicabstractRecordReader<K,V>
6. createRecordReader(InputSplit split,
7. TaskAttemptContext context) throws IOException,

8. InterruptedException;
9. }
4. TypesoflnputFormatinMapReduce
FileInputFormatinHadoop
Itisthebaseclassforallfile-based InputFormats.HadoopFileInputFormatspecifiesinputdirectory
where data filesare located. When we start aHadoop job, File Input Format isprovided with apath
containing files toread.File Input Format will read allfilesand divides these files into oneormore
Input Splits.
TextlnputFormat
Itisthedefault Input Format ofMapReduce. Text Input Format treats each lineofeach input fileasa
separate recordandperformsnoparsing. Thisisusefulforunformatted dataorline-based recordslike log
files.
• Key-Itisthebyteoffset ofthebeginning ofthe linewithinthefile(not whole file just one split), so it

will be unique if combined with the file name.
• Value-Itisthecontentsoftheline,excludinglineterminators.
KeyValueTextlnputFormat
It is similar to TextlnputFormat as it also treats each line of input as a separate record. While
TextlnputFormattreatsentire lineasthevalue, but theKeyValueTextlnputFormatbreakstheline itself into
key and value by a tab character ('/t'). Here Key is everything up to the tab character while the value
is the remaining part of the line after tab character.
SequenceFilelnputForma
Hadoop SequenceFilelnputFormatis anInputFormatwhich reads sequence files. Sequence files are
binaryfilesthat storessequences ofbinary key-value pairs. Sequence filesblock-compressandprovide
direct serialization and deserializationofseveral arbitrary data types (not just text). Here Key& Value
both are user-defined.
SequenceFileAsTextInputFormat
HadoopSequenceFileAsTextlnputFormatis anotherform of SequenceFileinputFormat which
converts thesequence filekeyvalues toTextobjects. Bycalling 'tostring()'conversion isperformed on
the keysand values. ThisInputFormatmakes sequencefilessuitable input for streaming.
SequenceFileAsBinarylnputFormat
Hadoop SequenceFileAsBinarylnputFormat isaSequenceFileinputFormat using whichwecan
extract the sequencefile's keys and values as anopaque binary object.
NLinelnputFormat
Hadoop NLinelnputFormat is another form of TextlnputFormat where the keys are byte offset of the
line and values are contents of the line. Each mapper receives a variable number of lines of input with
TextlnputFormat and KeyValueTextlnputFormat and the number depends onthe size ofthesplit and
the length ofthe lines. And ifwe want our mapper toreceive a fixed number oflines ofinput, then we
use N Line Input Format.
Nis thenumberof lines ofinputthateachmapperreceives.By default(N=l),eachmapperreceives exactly
one line of input. IfN=2, then each split contains twolines. One mapper will receive the first two Key-
Value pairs and anothermapper will receive the second two key-value pairs.
DBInputFormat
Hadoop DBinputFormat is an InputFormat that reads data from a relational database, usingJDBC.
As it doesn't have portioning capabilities, so we need to careful not to swamp the database from
which we are reading too many mappers. So it is best for loading relatively small datasets, perhaps
for joining withlarge datasets fromHDFS using Multipleinputs.Here Key is LongWritableswhile
Value isDB Writable.
UNITV- BIGDATAPLATFORMS
Sqoop
Cassandra
MongoDB
HIVE
PIG
Storm Flink Apache
Sqoop
Sqoop is a tool designed to transfer data between Hadoop and relational database servers. It is used to
import data from relational databases such asMySQL, Oracle to Hadoop HDFS, and export from Hadoop
filesystem to relational databases. This is a brief tutorial that explains how to make use ofSqoop inHadoop
ecosystem.
Audience
This tutorial is prepared for professionals aspiring to make a career in Big Data Analytics using Hadoop
FrameworkwithSqoop.ETLdevelopers andprofessionalswho areintoanalytics ingeneralmayaswellusethis
tutorial to good effect.
Prerequisites
Beforeproceedingwith thistutorial,youneedabasicknowledgeof CoreJava,Databaseconceptsof SQL,
Hadoop File system, and any of Linux operating system flavors.
The traditional application management system, that is, the interaction ofapplications with relational
database using RDBMS, isone ofthesources that generate Big Data.SuchBigData, generated byRDBMS,
isstored in Relational Database Servers in the relational database structure.
When Big Data storages and analyzers such as MapReduce, Hive, HBase, Cassandra, Pig, etc. of the
Hadoop ecosystem came into picture, they required a tool to interact with the relational database servers
for importing and exporting theBigData residing inthem. Here,Sqoopoccupies aplace intheHadoop
ecosystem to provide feasible interaction between relational database server and Hadoop's HDFS.
Sqoop-"SQLtoHadoopandHadoop toSQL"
Sqoop is a tool designed to transfer data between Hadoop and relational database servers. It isused to
import datafromrelational databases such as MySQL, Oracle toHadoop HDFS, and export from Hadoop
filesystem to relational databases. It is providedby the Apache Software Foundation.
HowSqoopWorks?
ThefollowingimagedescribestheworkflowofSqoop.
SqoopImport
The import tool imports individual tables from RDBMS to HDFS. Each rowin a table is treated asa
record in HDFS. Allrecords arestored astext data intext files oras binarydata in Avro and Sequence
files.
SqoopExport
Theexport toolexportsasetoffilesfromHDFS backto anRDBMS. Thefilesgiven asinput toSqoop
contain records, which are called as rows in table. Those are read and parsed into a set of records and
delimited with user-specified delimiter.
Cassandra
Cassandra isanopen-source NoSQL distributed database that manages large amounts ofdataacross
commodity servers.Itisadecentralized,scalable storagesystemdesigned tohandlevastvolumesofdata
across multiple commodityservers, providing high availability without asingle point offailure.
Cassandra isanopen-source NoSQL distributed database that manages large amounts ofdataacross
commodity servers. Itisadecentralized,scalablestoragesystemdesigned tohandlevast volumes ofdata
across multiple commodity servers, providing high availability without asingle point offailure.
Cassandra wascreated for Facebook butwasopen-sourced and released tobecome an Apache project
(maintained bytheAmerica!non-profit, ApacheSoftwareFoundation)
in2008.Afterthat,itfoundtoppriority in2010and isnow among thebest NoSQL database systems
intheworld. Cassandra istrusted and used by
thousandsofcompanies because oftheeaseofexpansionand,betterstill, its lackofasinglepoint offailure.

Currently, thesolution has beendeployed to handle databases for Netflix, Twitter, Reddit, etc.
ApacheCassandra, adistributed database management system, isbuilt tomanagea largeamount

ofdataover severalclouddatacenters.UnderstandinghowCassandra worksmeansunderstanding
threebasicprocesses of the system.These are the architecture components it is builton,it spartitioning
system,andits replicability.
1. ArchitectureofCassandra
Theprimaryarchitecture ofCassandraismadeupofaclusterofnodes.ApacheCassandraisstructured asa

peer-to-peer system and closely resembles DynamoDB and Google Bigtable.
Every node inCassandraisequal and carries thesame levelofimportance, which isfundamental tothe
structureofCassandra. Each node istheexact point wherespecific data isstored.Agroupofnodesthatare
related to each other makes up adatacenter. The complete set ofdatacenters capable ofstoring data for
processing is whatmakes up a cluster.
The beautiful thing about Cassandra's architecture isthat it caneasily be expanded to house moredata.
By adding more nodes, you candouble theamount ofdata the systemcarries without
overwhelmingit.This dynamicscalingabilitygoesbothways. Byreducingthenumberofnodes, developers
canshrinkthedatabase systemifnecessary. Compared topreviousstructured querylanguage
(SQL)databases andthecomplexityof increasing their data-carrying capacity, Cassandra's architecture
gives it aconsiderable advantage.
AnotherwayCassandra'sarchitecture helpsitsfunctionality isthat
itincreasesdatasecurityandprotectsfrom data loss.
2. Thepartitioningsystem
InCassandra,data is stored and retrieved via a partitioning system. Apartitioner is what

determineswherethe primary copyofadata set is stored. This works ith nodaltokens in a direct format.
Every node owns oris responsibleforasetoftokensbased onapartitionkey.Thepartitionkeyisresponsible
fordetermining where data is stored.Immediatelyas data enters a cluster, a hash function is added
tothe partition key. The coordinator node (thenode aclient connects towith a request) isresponsible
forsending thedatatothe node with the same token under that partition.
3. Cassandra'sreplicability
Another wayCassandra works is byreplicating data across nodes.These secondary nodesarecalled
replica nodes, and the number ofreplica nodes fora given datasetisbased on thereplication factor
0
(RF). A replicationfactorof3meansthreenodescoverthesametokenrange,storingthesamedata.Multiple
replicas are key to the reliability of Cassandra.
Evenwhenonenodestopsfunctioning, temporarilyorpermanently,other nodesholdthesamedata,meaning
that dataishardlyeverwhollylost.Betterstill,ifatemporarilydisruptednodeisbackontrack,itreceivesan
updateonthedataactionsitmayhavemissedandthencatches uptospeedtocontinuefunctioning.
KeyFeaturesofCassandra
Cassandra 1s a umque database system, and some of its key features
include:
KeyFeaturesofCassandra
I. Open-sourceavailability
Nothing ismoreexciting thangettingahandyproduct forfree. Thisisprobablyoneofthesignificant factors
behind Cassandra's far-reaching popularity and acceptance. Cassandra is among the open-source products
1
hosted by Apache and is free for anyone who wants to utilize it.
2. Distributedfootprint
Another featureofCassandra isthatitiswelldistributed andmeanttorunover multiple nodesasopposedtoa
centralsystem. Allthe nodes areequal insignificance, and without amaster node, no bottleneck slows
the process down. This isveryimportantbecause thecompanies that utilize Cassandra need
toconstantly runon accurate dataand can not tolerate data loss. The equaland wide distribution
ofCassandra data acrossnodes means that losing one node does not significantly affect the system's
general performance.
3. Scalability
Cassandra haselastic scalability. This means that itcanbescaled upor down without much difficulty or
resistance. Cassandra's scalability once again is due to the nodalarchitecture. It is intended to grow
horizontallyasyourneedsasadeveloper orcompany grow.Scaling-up inCassandra isveryeasyandnot
limitedtolocation.Addingorremoving extranodescanadjust yourdatabasesystemtosuit yourdynamic
needs.
Anotherexcitingpoint about scalingin Cassandra isthatthereisnoslowdown,pauseorhitchinthesystem
during the process. This means end-users would not feel theeffect ofwhatever happened, ensuring
smooth service to all individuals connected to the network.
4. CassandraQueryLanguage
Cassandra is not a relationaldatabase and does not use the standard query language orSQL. Itusesthe
Cassandra querylanguage (CQL).Thiswouldhaveposedaproblem foradrninsastheywouldhavetomaster
a whole newlanguage-butthegoodthingabout CassandraQuerylanguage isthat itisverysimilar
toSQL.Itis structured to operate with rows and columns,i.e., table-based data.
However, it does lack the flexibility that comes with the fixed schema ofSQL. CQL combines
thetabular databasemanagementsystemandthekeyvalue.Itoperatesusingthedatatypeoperations,
definitionoperation, data definition operation, triggers operation, securityoperations, arithmetic
operations, etc.
5. Faulttolerance
Cassandra isfault-tolerantprimarily because ofitsdatareplicative ability.Datareplication
denotestheability ofthesystemtostorethesame information atmultiple locations ornodes. This makes it
2
highly available and tolerant offaults inthesystem.Failure ofasingle node ordata center does not bring
the systemto ahalt as data has been replicated and stored across the nodes
inthecluster.Datareplication leadstoahighlevel of backup and recovery.
6. Schemafree
SQLis a fixedschema databaselanguagemakingit rigid and fixed. However,Cassandrais a schema-optional
datamodel andallowstheoperatortocreateasmany rowsandcolumnsasisdeemednecessary.
7. Tunableconsistency
Cassandra hastwotypesofconsistency-theeventualconsistency andthesettingconsistency. Thestring
consistency isatypethat broadcastsanyupdateorinformation to everynodewheretheconcerned data is
located. Ineventualconsistency, theclient hasto approve immediately after acluster receives a write.
Cassandra's tunable consistency isa feature that allows the developer todecide to useanyofthe two
types depending onthefunction beingcarriedout.Thedeveloper canuseeitherorbothkindsofconsistency
atany time.
8. Fastwrites
Cassandra isknown to have averyhigh throughput, not hinderedbyitssize. Itsability to write quickly is
a functionofitsdatahandling process. Theinitialsteptakenistowritetothecommit log.Thisisfordurability
to preserve dataincase ofdamage ornode downtime. Writing tothe commitlog isaspeedy and efficient
process using this tool.
The next stepistowritetothe"Memtable"or memory. After writing to Memtable, a node acknowledges

the successful writingofdata.TheMemtable isfound inthedatabase memory, andwritingtoin-memory
ismuch faster thanwriting to a disk. Allofthese account forthe speed Cassandra writes.
Peer-to-peerarchitecture
Cassandra isbuilt onapeer-to-peer architectural

modelwhereallnodesareequal.Thisisunlikesomedatabase models with a"slave to master"relationship.
That iswhere one unit directs thefunctioning ofthe other units, and theother unit onlycommunicates
withthecentralunit or master. InCassandra, differentunits can communicate with eachother aspeers
ina process called gossiping. Thispeer-to-peer communication eliminates asingle point offailure and
isaprominent defining feature ofCassandra.
Top6 UsesofCassandra
Cassandra isanopen-source NoSQL database managementsystemwith manyadvantages andpractical
3
functionalities tha trival othe rsystems. Itisused byseveral large and small companies cross
theglobe.Some ofthe top applicationsofCassandrainclude:
1. E-commerce
E-commerce isanextremely sensitive field that cutsacross everyregion and country. The nature
offinancial markets meansanticipated peaktimesaswellasdowntimes. Forafinance operation,
nocustomer would want toexperience downtime or lackofaccesswhen there isrevenue tobeearned and
lotsofopportunities to hold onto.E-commerce companies canavoid these downtimes orpotential
blackouts byusinga highly reliable systemlikeCassandra. Itsfaulttoleranceallowsittokeeprunning
evenifawholecenterisdamagedwithlittle or no hitch in the system.
Due to itseasy scalability, especially inpeak seasons, E-commerce and inventory managementis also
a significant application ofCassandra. When there isa market rush, the company has toincrease
theabilityof thedatabase tocarryandstoremoredata.Theseasonal, rapidE-commerce growththat
isaffordable anddoes not cause system restart issimplya perfectfit for companies.
E-commercewebsitesalsobenefitfromCassandraasitstoresandrecordsvisitors'activities.Itthenallows
analyticaltoolstomodulatethevisitor's actionand,forinstance,temptsthemtostayonthewebsite.
2. Entertainmentwebsites
WiththeaidofCassandra, websites for movies, games and music cankeep trackofcustomer behavior
and preferences. Thedatabaserecordsforeachvisitor,including what wasclicked,downloaded,
timespent,etc. This information isanalyzed and used to recommendfurther entertainmentoptions
totheend-user.
This application of Cassandra falls under thepersonlization,recommendation and customer
experienceuse cases. It isnot just limited toentertainmentsites, but also onlineshopping platforms
andsocial media recommendations. This iswhyusers would receive notifications ofsimilar goods to
what theyspent time browsing.
3. InternetofThings(IoT)andedgecomputing
Today'sworld isseeingtheriseoftheInternet ofThings (IoT).Wearesteadily bombardedwiththousands of
new information points ordatasets. Every wearable device, weather sensor, traffic sensor, ormobile
device keeps trackofandsends dataonweather, traffic, energy usage,soilconditions, etc.This flood of
information can be overwhelming and easily lost.
4
However, storingandanalyzing information fromIoTdevices,nomatterhowlarge,hasbecome

muchmore effective onCassandra technology. This isdue to (but not limited to)the following reasons:
• Cassandraallowseveryindividualnodetocarryoutreadandwriteoperations.
• Itcanhandleandstorealargeamountofdata.
• Cassandrasupportstheanalysisofdatainreal-time.
4. Authenticationandfrauddetection
Fraud detectionisessential to the security and reliability ofmany companies, mainly banks, insurance
and other financial institutions. At allpoints, these companies must ensure thattheycan beat thenew,
revamped wayofstealing datadevelopedbyfraudsters. Financial companies aim to keep fraudsters and
hackers away fromtheirsystems. Cassandra isapplicable hereduetocontinuous, real-time,
bigdataanalytics.Casandra can source data across a wide range ofinternet activities that cantrigger
alarms whenpatterns and irregularities that may be precededby fraud are detected.
Ontheotherhand,thesesamefinancial institutions arerequiredtocarryoutaseamless
identityauthentication process. To make the user loginauthenticationprocess tight enough toallow
onlygenuine customers yet simple enoughtomake iteasyforthem, Cassandra helpsto carryoutareal-
time analysis. Inaddition tothese analyses, Cassandraisalso vital because, with it, youcanbesure
ofconstantaccessibility.
Ingeneral, Cassandra isamongthe topdatabaseoptionschosentodetectfraudandcarryoutauthentication
processes because:
• Itallowsforreal-timeanalysiswithmachinelearningandartificialintelligence(AI).
• Itcanhostlargeamountsofactivelygrowingdatasets.
• Ithasaflexibleschema toallowtheprocessingofdifferentdatatypes.
5. Messaging
Therearecurrentlyseveralmessagingapplicationsinuse,andanever-growing numberofindividuals
areusing them. Thiscreatestheneed forastabledatabasesystemtostoreever-flowing information volumes.
Cassandra provides bothstability and storage capacity for companies that offer messaging services.
6. Logisticsandassetmanagement
Cassandra isusedinlogisticsandasset managementtotrackthemovement ofanyitemtogettransported.
Fromthepurchase tothefinaldelivery, applications canrelyonCassandra tologeachtransaction. Thisis
5
especiallyapplicabletolargelogisticcompanies regularlyprocessing vastamountsofdata.Cassandrahad

found arobust usecaseinbackend development forsuchapplications. Itstoresandanalyzes dataflowing
through without impacting application performance.
Cassandra's powerfulfeaturesanduniquedistributed architecture makeitafavoritedatabase management

tool for independent developers and largeenterprises. Some ofthe largest companies intheworld that
need high- speed information relay, relyonCassandra, including social media platforms like
Facebook and Twitter, as well as media platforms like Netflix.
Further,Cassandrahasbeenconstantlyupdatedsinceitfirstbecameopen-sourcein2008.ApacheCassandra
version 4.1 isscheduled forrelease in July2022, assuring technical professionals ofcontinued support
and access to cutting-edge features at no added costs.
WhatisApachePig?
Apache Pig isanabstraction over MapReduce. It isa tool/platform which isused to analyzelarger
sets ofdata representing them asdata flows. Pig isgenerally used with Hadoop;wecanperform
allthe data manipulation operations inHadoop using Apache Pig.
Towrite dataanalysis programs, Pigprovides ahigh-level language knownasPigLatin. This
language providesvariousoperatorsusingwhichprogrammers candevelop
theirownfunctionsforreading,writing, and processing data.
ToanalyzedatausingApachePig,programmers needtowritescriptsusingPigLatinlanguage. Allthese
scripts are internally converted to Map and Reduce tasks. Apache Pig has a component
knownasPig Engine that accepts the Pig Latin scripts as input and converts those scripts into
MapReduce jobs. Why Do We Need Apache Pig?
Programmers who are not so good at Java normally used to struggle working with Hadoop, especially
while performing any MapReduce tasks. Apache Pig is a boon for allsuch programmers.
• Using PigLatin, programmers canperformMapReduce taskseasilywithout having totypecomplex
codes in Java.
• Apache Pig uses multi-query approach, thereby reducing the length of codes. For example, an
operationthatwouldrequireyoutotype200linesofcode(LoC)inJavacanbeeasilydoneby
typingaslessasjust 10LoCinApache Pig. Ultimately Apache Pigreduces thedevelopment timeby
almost 16 times.
• PigLatinisSQL-likelanguageanditiseasytolearnApachePigwhenyouarefamiliarwithSQL.
• Apache Pig provides many built-in operators to support data operations like joins, filters,
6
ordering, etc. In addition, it also provides nested data types like tuples, bags, and maps that are
missing from MapReduce.
• FeaturesofPig
• ApachePigcomeswiththefollowingfeatures-
• Richsetofoperators-Itprovidesmanyoperatorstoperformoperationslike join,sort,filer,etc.
• Ease ofprogramming -PigLatinissimilar toSQLand it iseasytowrite aPigscript ifyouare good at
SQL.
• Optimizationopportunities -The tasks in Apache Pig optimize their execution automatically, so
the programmersneed to focus only on semantics ofthe language.
• Extensibility -Using theexisting operators, userscandevelop their ownfunctions toread, process,
and write data.
• UDF's-PigprovidesthefacilitytocreateUser-definedFunctions inotherprogramming languages
such as Java and invoke or embed them in Pig Scripts.
• Handlesallkindsofdata -ApachePiganalyzesallkindsofdata,bothstructuredaswellas unstructured.It
stores the results in HDFS.
ApachePigVsMapReduce
Listed below are the major differences between Apache Pig and MapReduce.
ApachePig MapReduce
ApachePigisadataflowlanguage. MapReduceisadataprocessingparadi
gm.
Itisahighlevel language. MapReduceislowlevelandrigid.
PerformingaJoinoperationinApachePigispretty It is quite difficultin MapReduceto

simple. performaJoinoperation
betweendatasets.
Anynoviceprogrammerwithabasicknowledgeof Exposure to Java is must to work

SQL can work conveniently with Apache Pig. with MapReduce.
Apache Pig uses multi-query approach, MapReduce will require almost 20

thereby reducing the length of the times more the number of lines to
codes to a great extent. perform the same task.
7
There is no need for compilation. Onexecution, MapReducejobshavealongcompilati

every Apache Pig operator is converted on process.
internally into a MapReduce job.
ApachePigVsSQL
Listedbelowarethemajordifferences betweenApachePigandSQL.
Pig Latinisa procedurallanguage. SQLisadeclarativelanguage.
In Apache Pig, schemaisoptional. We can store data without SchemaismandatoryinSQL.

designing a schema (values are stored as $01, $02 etc.)
ThedatamodelinApachePigisnestedrelational.
ThedatamodelusedinSQLisfla
t
relational.
Apache Pig provides limited opportunity Thereis more opportunityfor
forQuery optimization. query optimizationin SQL.
8
Inadditiontoabove differences,ApachePigLatin-
• Allowssplitsinthepipeline.
• Allowsdeveloperstostoredataanywhereinthepipeline.
• Declaresexecutionplans.
• ProvidesoperatorstoperformETL(Extract,Transform,andLoad)functions. Apache Pig Vs Hive
Both Apache Pig and Hive are used to create MapReduce jobs. And in some cases, Hive operates on
HDFS inasimilar way Apache Pig does. In the following table, we have listed a few significant points
that set Apache Pig apart from Hive.
ApachePig Hive
ApachePig uses alanguagecalled Hive uses a language called

Pig Latin.ltwasoriginally created HiveQL. It wasoriginally created
at Yahoo. at Facebook.
PigLatinisadataflowlanguage. HiveQLisaqueryprocessinglangua
ge.
PigLatinis a HiveQLisadeclarativelanguage.
procedurallanguageand
itfitsinpipeline paradigm.
ApachePigcan handlestructured, Hiveismostlyforstructureddata.

unstructured,andsemi-
structureddata.
ApplicationsofApachePig
ApachePigis generallyusedby datascientistsfor performingtasksinvolvingad-hocprocessingand quick
prototyping.Apache Pig is used -
• Toprocesshugedatasourcessuchasweblogs.
• Toperformdataprocessingforsearchplatforms.
• Toprocesstimesensitivedataloads. Apache Pig -History
In2006,ApachePigwasdevelopedasaresearchprojectatYahoo,especiallytocreateandexecute
MapReduce jobsoneverydataset. In2007,Apache Pigwasopensourced viaApacheincubator.In 2008,
the first release of Apache Pig came out. In 2010, Apache Pig graduated as an Apache top-level
project. Apache Pig - Architecture
The language used to analyze data in Hadoop using Pig is known as Pig Latin. It is a highlevel data
9
processinglanguage which provides a rich set ofdata types and operators to perform various
operations on the data.
Toperformaparticular taskProgrammers usingPig,programmersneedtowriteaPigscript usingthe Pig
Latin language, and execute them using any of the execution mechanisms (Grunt Shell, UDFs,
Embedded). After execution, these scripts will go through aseries of transformationsapplied by the
Pig Framework, to produce thedesired output.Internally, Apache Pig converts these scripts into a
series of MapReducejobs, and thus,it makes theprogrammer'sjobeasy. ThearchitectureofApachePig
iss hown below.
ApachePigComponents
Asshown inthefigure,therearevarious componentsintheApachePig framework. Let ustakealook at the
major components.
Parser
Initially the Pig Scripts are handled bythe Parser. It checks the syntax of the script, doestype
checking, and other miscellaneous checks. Theoutput ofthe parser will beaDAG (directed acyclic
graph), which represents the Pig Latin statements and logical operators.
In the DAG, the logical operators of the script are represented as the nodes and the data flows are
represented as edges.
Optimizer
Thelogical plan (DAG)is passedto the logical optimizer,which carriesout the logical optimizations
such as projection and pushdown.
Compiler
The compiler compiles the optimized logical plan into a series of MapReduce jobs. Execution engine
0
Finally the MapReducejobs are submittedto Hadoop inasorted order. Finally, these MapReducejobs
are executed on Hadoop producing the desired results. Pig Latin Data Model
ThedatamodelofPigLatinisfullynestedanditallowscomplexnon-atomicdatatypessuchas
mapandtuple.Given below is the diagrammatic al representation of Pig Latin'sdata model.
Atom
Any single value inPig Latin, irrespective oftheir data, type is known asan Atom. It is stored as string
and can be used as string and number. int, long, float, double, chararray, and bytearray are the atomic
values of Pig. Apiece ofdataor asimple atomic value is known as a field.
Example -'raja'or '30'Tuple
Arecordthatis formedbyanorderedset of fieldsis known as a tuple,the fields can be ofanytype.A tuple
is similarto a row ina table of RDBMS.
Example-(Raja,30) Bag
Abagisanunordered setoftuples. Inother words, acollection oftuples(non-unique)isknown asabag.
Eachtuple can have anynumber offields(flexible schema). A bag isrepresented by'{}'. It is similartoa
table in RDBMS, but unlike a table in RDBMS, it is not necessary that every tuple contain the same
number of fields orthat the fields in thesame position (column)have the same type.
Example -{(Raja,30),(Mohammad,45)}
Abagcanbeafield inarelation; inthat context, it isknownas innerbag. Example -{Raja, 30,
{9848022338, raja@gmail.com,}}
1
Map
Amap(ordatamap)is a setof key-valuepairs.The key needstobe of typechararrayandshouldbe unique.
The valuemightbe ofany type. It is representedby '[]'
Example-[name#Raja, age#30] Relation
A relationisa bag oftuples.The relationsin Pig Latin are unordered (thereis no guaranteethat tuples are
processedin any particular order).
ApachePig-Installation
Thischapter explains thehowtodownload, install, andset up Apache Piginyour system. Prerequisites
It is essential that you have Hadoop and Java installed on your system before you go for Apache Pig.
Therefore, prior to installing Apache Pig, install Hadoop and Java by following the steps given in the
followinglink -
http://www.tutorialspoint.corn/hadoop/hadoo enviornmentsetup.htmDownload Apache Pig
First of all, download the latest version of Apache Pig from the following
website
-https://pig.apache.org/Step 1
Open the homepage ofApache Pig website. Under the section News, click onthe link release page as
shown in the following snapshot.
2
..
- ch.oirg/..-Jc.ses.ht nl'1Downl-oad
A.pache Pig R leases,
Step2
On clicking the specified link, you will be redirected to the Apache Pig Releases page. On this page,
undertheDownloadsection, youwillhavetwo links,namely, Pig0.8andlaterandPig0.7andbefore. Click
onthe link Pig 0.8 and later, then you will be redirectedto the page having aset of mirrors.
+-- Cti
+
••A.U--nul
..Mw.i11
Ll-'.!l1lt':!ii
11WhoWa:Aiu
...8-yl.!arwv -
11
Pi9Tr;.,ol
"D"r1...,.lllC'"j111P;::ah,e-y
..1curn,.,1.:t;11,lfii.on
IJvie,liop-ar.:a,
3
Step3
Chooseandclickanyoneofthesemirrorsasshownbelow.
Step4
These mirrorswilltakeyoutothePigReleasespage.Thispagecontains various versionsofApache
Pig. Click the latest version among them.
4
□ X
C fr· j www.L1s.-pc:he.org/cht;pig/
HAppGCJhdaop
.iliis.
,o
2(!.l'l!,·0&·l::'I1)3;
WH-OS·19<93.li)O
:il:ilS•0::1•19lIli:li
Step5
Within these folders, you will have the source and binary files of Apache Pig in various distributions.
Download the tar files of the source and binary files of Apache Pig 0.15, pig0.15.0-src.tar.gz and pig-
0.15.0.tar.gz.
:..h
C ttli
(J1doc_
WWW,IJS. p -tr1 ---
t
5
InstallApachePig
After downloading the Apache Pig software, install it in your Linux environment byfollowing
thesteps given below.
Step1
Create adirectory with the name Pig in thesame directory where the installation directories of
Hadoop, Java, and other software were installed. (In our tutorial, we have created the Pig directory
in the user named Hadoop).
$mkdirPig
Step2
Extractthedownloadedtarfilesasshownbelow.
$cdDownloads/
$tarzxvfpig-0.15.0-src.tar.gz
$tarzxvfpig-0.15.0.tar.gz
Step3
Movethecontentofpig-0.15.0-src.tar.gzfiletothePigdirectorycreatedearlierasshownbelow.
$mvpig-0.15.0-src.tar.gz/*/home/Hadoop/Pig/
ConfigureApachePig
AfterinstallingApache Pig,we have toconfigureit. To configure,we need toedit two files -bashrc and
pig.properties.
.bashrcfile
Inthe.bashrcfile,setthefollowingvariables-
• PIG_HOMEfoldertotheApachePig'sinstallationfolder,
• PATHenvironmentvariabletothebinfolder,and
• PIG_CLASSPATHenvironment variable to the etc (configuration) folder of your Hadoop
installations (thedirectory thatcontains thecore-site.xml,hdfs-site.xml and mapred-site.xmlfiles).
export PIG_HOME= /home/Hadoop/Pig
export PATH= $PATH:/home/Hadoop/pig/binexportPIG_CLASSPATH= $HADOOP_HOME/conf
pig.properties file
Inthe conf folderof Pig,wehaveafilenamedpig.properties.In thepig.propertiesfile,youcanset various
parameters as given below.
6
pig -h properties VerifyingtheInstallation

Verify the installation of ApachePig by typing the version command.Ifthe installationis
successful, you will get the version of Apache Pig as shown below.
$pig-version
ApachePigversion0.15.0(rl682971)
compiledJun0l2015,11:44:35
RunningPig
YoucanrunPig(executePigLatinstatementsandPigcommands)usingvarious modes.
L Tez Local S Ma e Sp
o p pre z ar
c Mo a duc k
a de r eT M
l k o M
Mo d od
M L de e e
o o
d c
e a
l
d
e
InteractiveMode y expe s Yes
e rime
s nta
atchMode y expe s Yes
e rime
s nta
ExecutionModes
Pighassixexecutionmodesorexectypes:
• LocalMode-TorunPiginlocalmode,youneedaccesstoasinglemachine;allfilesareinstalled
and runusing your localhost and file system. Specify local mode using the -xflag (pig -x local).
• TezLocalMode-TorunPig intezlocalmode. Itissimilar tolocalmode, exceptinternallyPig
willinvoke tez runtime engine.SpecifyTez localmode using the -xflag(pig -xtez_local).
Note:Tezlocalmodeisexperimental.Therearesomequerieswhichjust erroroutonbiggerdatainlocal
mode.
7
• SparkLocalMode-TorunPiginsparklocalmode.Itissimilartolocalmode,exceptinternally
Pig will invoke spark runtime engine. SpecifySpark local mode using the -x flag (pig -
xspark_local). Note: Spark local mode is experimental.There are some queries which
justerrorout onbigger data in local mode.
• MapreduceMode-TorunPiginmapreducemode,youneedaccesstoaHadoopclusterand
HDFSinstallation.Mapreducemodeisthedefaultmode;youcan,butdon'tneedto,specifyitusingthe
-xflag(pigORpig-xmapreduce).
• TezMode-TorunPig inTez mode, youneedaccesstoaHadoopclusterandHDFS installation.

Specify Tez mode using the -x flag (-xtez).
• Spark Mode-TorunPig inSpark mode, you need access toaSpark, YarnorMesos cluster
and HDFS installation. Specify Spark mode using the -x flag (-x spark). In Spark execution
mode, it is necessary to set env::SPARK_MASTERto an appropriate value (local - local mode,
yam-client - yarn- client mode, mesos://host:port-sparkonmesosorspark://host:port-
sparkcluster.Formoreinformation refer tospark documentation onMasterURLs, yarn-cluster
modeiscurrently notsupported).Pigscripts runonSparkcantakeadvantageofthedynamic allocation
feature. Thefeaturecanbeenabled bysimply enablingspark.dynamicAllocation.enabled.Refer to
sparkconfiguration for additional configuration details. Ingeneralallproperties inthepigscript
prefixed withspark.arecopiedtotheSpark Application
Configuration.PleasenotethatYamauxillaryserviceneedtobeenabledonSparkforthistowork.SeeSpark
documentationforadditionaldetails.
YoucanrunPig ineither mode using the"pig"command (thebin/pigPerlscript) orthe"java"
command (java -cp pig.jar ...).
Examples
This example show show to run Pigin local and mapreduce mode using the pig command.
8
/*localmode*/
$pig-xlocal...
/*Tezlocalmode*/
$pig-xtez_local ...
/*Sparklocalmode*/
$pig-xspark_local...
/*mapreducemode*/
$pig...
or
$pig-xmapreduce...
/*Tezmode*/
$pig -xtez...
/*Sparkmode*/
$pig-xspark...
InteractiveMode
YoucanrunPigininteractivemodeusingtheGruntshell.InvoketheGruntshellusingthe"pig"command
(as shown below) and then enter your Pig Latin statements and Pig commands interactively at
the command line.
Example
These Pig Latin statements extract alluser IDs from the/etc/passwd file. First, copythe /etc/passwdfile
to your local working directory. Next, invoke the Grunt shell by typing the "pig" command (in local
or hadoop mode). Then, enter the Pig Latin statements interactively at the grunt prompt (besure to
include thesemicolon after eachstatement). TheDUMPoperator willdisplaytheresults to
grunt>A=load'passwd'usingPigStorage(':');
grunt> B=foreach Agenerate $0asid; grunt>
dump B;
yourterminalscreen.
LocalMode
9
$pig-xlocal
...-Connecting to... grunt>
TezLocalMode
$pig-xtez_local
SparkLocalMode
$pig-xspark_local
MapreduceMode
$pig-xmapreduce
or
$pig
TezMode
0
$ pig-x tez
SparkMode
$pig-xspark
BatchMode
YoucanrunPiginbatchmodeusingPigscriptsandthe"pig"command(inlocalorhadoopmode).
Example
ThePigLatinstatementsinthePigscript(id.pig) extract alluserIDsfromthe/etc/passwdfile.First, copy
the/etc/passwd filetoyourlocalworkingdirectory. Next,runthePigscript fromthecommandline(using
localor mapreducemode). The STORE operator will write theresults to afile(id.out).
/*id.pig*/
A= load 'passwd' using PigStorage(':');--load the passwd file B

= foreach Agenerate $0 as id; -- extract the user IDs
storeBinto'id.out';--writetheresultstoafilenameid.out
LocalMode
$pig-x localid.pig
TezLocalMode
$pig-xtezlocalid.pig
SparkLocalMode
$pig-xsparklocalid.pig
MapreduceMode
$pigid.pig
or
$pig-xmapreduceid.pig
TezMode
$pig-xtezid.pig
1
SparkMode
$pig -xsparkid.pig
PigScripts
UsePigscriptsto placePigLatinstatements andPigcommands inasingle file. While not required,it is
good practice to identify the file using the *.pig extension.
You can run Pig scripts from the command line and from the Grunt shell
(see the run and exec commands).
Pigscriptsallowyoutopassvaluestoparameters usingparametersubstitution.
CommentsinScripts
YoucanincludecommentsinPigscripts:
• Formulti-linecommentsuse/* .......... */
• Forsingle-linecomments use--
/*myscript.pig
Myscriptissimple.
ItincludesthreePigLatinstatements.
*/
A=LOAD 'student' USING PigStorage() AS(name:chararray,age:int, gpa:float); --loading data B

= FORBACHAGENERATEname; -- transforming data
DUMPB;--retrievingresults
ScriptsandDistributedFileSystems
Pigsupportsrunningscripts(andJarfiles)thatarestoredinHDFS,AmazonS3,andotherdistributed file
systems. The script's full location URI is required (see REGISTER for information about Jar files).
For example, to run a Pig script onHDFS, do the following:
$pighdfs://nn.mydomain.com:9020/myscripts/script.pig
PigLatinStatements
Pig Latin statements arethe basic constructsyou use toprocess data using Pig. APig Latin statement is
anoperator that takesarelation asinput and produces another relation asoutput. (This definition applies
to all Pig Latin operators except LOAD and STORE which read data from and write data to the file
system.) Pig Latin statements may include expressions and schemas. Pig Latin statements can span
multiple lines and must end withasemi-colon ( ;). Bydefault, Pig Latin statements are processed using
multi-query execution.
2
PigLatinstatementsaregenerallyorganizedasfollows:
• ALOADstatementtoreaddatafromthefilesystem.
• Aseriesof"transformation"statementstoprocessthedata.
• ADUMPstatement toviewresultsoraSTOREstatement tosavetheresults. Notethat aDUMP
or STORE statementisrequired to generate output.
• InthisexamplePigwillvalidate,butnotexecute,theLOADandFOREACHstatements.
• A=LOAD'student'USINGPigStorage() AS(name:chararray,age:int,gpa:float);
• B=FOREACHAGENERATEname;
• Inthisexample, Pig will validate and then execute the LOAD, FOREACH, and DUMP
statements.
• A=LOAD'student'USINGPigStorage()AS(name:chararray,age:int,gpa:float); B =
• FOREACH A GENERATEname;
• DUMPB;
• (John)
• (Mary)
• (Bill)
• (Joe)
LoadingData
UsetheLOAD operatorandtheload/store functions toreaddataintoPig(PigStorage isthedefaultload
function).
WorkingwithData
Pigallows youtotransformdatainmanyways. Asastartingpoint, become familiarwiththese
operators:
• Use theFILTER operatortowork with tuplesor rows of data.Use theFOREACH operatorto
work with columns of data.
• UsetheGROUPoperatortogroupdatain a singlerelation.UsetheCOGROUP,innerJOIN, and
outer JOIN operators to groupor join data in two or more relations.
• UsetheUNIONoperatortomerge thecontentsoftwo ormorerelations. UsetheSPLIToperator
to partition the contents ofa relation into multiple relations.
3
Storing Intermediate Results

Pig stores the intermediate data generated between MapReduce jobs in a temporary location on
HDFS. This location must already exist on HDFS prior to use. This location can be configured using
the pig.temp.dir property. Theproperty's default value is"/tmp" which isthesameasthehardcoded
location in Pig 0.7.0 and earlier versions.
Storing Final Results
UsetheSTORE operatorand theload/storefunctions towriteresultstothefilesystem(PigStorageisthe
defaultstore function).
Note: During thetesting/debuggingphase ofyour implementation, you canuseDUMP todisplay results
to your terminal screen. However, in a production environment you always want to use the STORE
operator to save your results (see Store vs. Dump).
DebuggingPigLatin
PigLatinprovidesoperatorsthatcanhelpyoudebug yourPigLatinstatements:
• UsetheDUMPoperatortodisplayresultstoyourterminalscreen.
• UsetheDESCRIBEoperatortoreviewtheschemaofarelation.
• UsetheEXPLAIN operatortoviewthelogical,physical,ormapreduceexecutionplansto compute a
relation.
• UsetheILLUSTRATEoperatortoviewthestep-by-stepexecutionofaseriesofstatements.
ShortcutsforDebuggingOperators
Pig provides shortcuts for the frequently used debugging operators (DUMP, DESCRIBE, EXPLAIN,
ILLUSTRATE). These shortcuts can be used in Grunt shell or within pig scripts. Following are the
shortcuts supported by pig
• \dalias-shourtcutforDUMPoperator.Ifaliasisignoredlastdefinedaliaswillbeused.
• \dealias-shourtcutforDESCRIBEoperator.Ifalias isignoredlastdefinedaliaswillbeused.
• \ealias-shourtcutforEXPLAINoperator.Ifaliasisignoredlastdefinedaliaswillbeused.
• \ialias-shourtcutforILLUSTRATEoperator.Ifaliasisignoredlastdefinedaliaswillbeused.
• \q-Toquitgruntshell
PigProperties
Pig supports anumber ofJava properties that you canuse to customize Pig behavior. You canretrieve a
list of the properties using the help properties command. All of these properties are optional; none are
required.
4
TospecifyPigpropertiesuseoneofthesemechanisms:
• Thepig.propertiesfile(addthedirectorythatcontainsthepig.propertiesfiletotheclasspath)
• The-DandaPigproperty inPIG_OPTS environmentvariable (export PIG_OPTS=-
Dpig.tmpfilecompression=true)
• The-Pcommandlineoptionandapropertiesfile(pig-Pmypig.properties)
• Thesetcommand(setpig.exec.nocombinertrue)
Note:ThepropertiesfileusesstandardJavapropertyfile format.
The following precedence order is supported: pig.properties< -D Pig property< -P properties file <set
command.This means thatifthe same property is providedusing the-Dcommandline option as well as
the-Pcommand line option (properties file), the value ofthe property in the properties file will take
precedence.
TospecifyHadooppropertiesyoucanusethesamemechanisms:
• Hadoopconfigurationfiles(includepig-cluster-hadoop-site.xml)
• The-D and aHadoop propertyinPIG_OPTS environmentvariable (export PIG_OPTS=-
Dmapreduce.task.profile=true)
• The-Pcommandlineoptionandaproperty file(pig-Pproperty_file)
• Thesetcommand(setmapred.map.tasks.speculative.executionfalse)
The same precedenceholds: Hadoop configurationfiles< -D Hadoop property< -P properties_file< set
command.
Hadooppropertiesare not interpretedbyPigbut are passeddirectly to Hadoop.AnyHadoopproperty can
be passed this way.
Allproperties that Pigcollects, including Hadoop properties, areavailable toanyUDFviathe
UDFContextobject. To get access to the properties,you cancallthe getJobConf method.
ApachePig-DiagnosticOperators
The load statement will simply load the data into the specified relation in Apache Pig. To verify the
execution of the Load statement, you have to use the Diagnostic Operators. Pig Latin provides four
differenttypes of diagnostic operators -
• Dumpoperator
• Describe operator
• Explanationoperator
• Illustrationoperator
MallareddyEngineeringCollegeForWomen(AutonomousInstitution,UGC-Govt.ofIndia) I00
Inthischapter, wewilldiscuss theDumpoperatorsofPigLatin. Dump Operator

TheDump operatoris usedto run the PigLatin statementsanddisplay theresultson thescreen.Itis
generally used for debugging Purpose.
Syntax
Givenbelow is the syntax of the Dumpoperator. grunt> Dump Relation_Name
Example
Assume we have a filestudent_data.txtin HDFS with the following content.
001,Rajiv,Reddy,9848022337,Hyderabad002,siddarth,Battacharya,9848022338,Kolkata
003,Rajesh,Khanna,9848022339,Delhi004,Preethi,Agarwal,9848022330,Pune
005,Trupthi,Mohanthy,9848022336,Bhuwaneshwar 006,Archana,Mishra,9848022335,Chennai.
Andwehaveread itintoarelationstudentusingtheLOADoperatorasshownbelow.
grunt>student=LOAD 'hdfs://localhost:9000/pig_data/student_data.txt'
USING PigStorage(',')
as(id:int,firstname:chararray,lastname:chararray,phone:chararray,city:chararray)
;
Now,let usprintthecontentsoftherelationusing theDumpoperatorasshownbelow. grunt> Dump

student
Onceyouexecutetheabove PigLatinstatement,itwillstartaMapReduce jobtoreaddatafromHDFS.
Apache Pig - Describe Operator
Thedescribeoperatorisusedtoviewtheschemaofarelation. Syntax
Thesyntaxofthedescribe operatorisasfollows- grunt> Describe Relation_name
ApachePig-ExplainOperator
Theexplain operator isused to display the logical,physical,andMapReduce execution plans of a
relation.
Syntax
Given below isthesyntaxoftheexplain operator. grunt> explain Relation_name;
ApachePig-IllustrateOperator
Theillustrate operatorgives you thestep-by-step execution ofasequence ofstatements. Syntax
Given below isthesyntaxofthe illustrate operator. grunt> illustrate Relation_name;
ApachePig-GroupOperator
MallareddyEngineeringCollegeForWomen(AutonomousInstitution,UGC-Govt.ofIndia) l02
The GROUP operator is usedto group the data in one or more relations.Itcollects the data having the
same key.
Syntax
Given below is the syntax of the group operator. grunt>Group_data= GROUP Relation_name
BYage; Apache Pig - Cogroup Operator
TheCOGROUPoperatorworksmoreorlessinthesamewayastheGROUPoperator.Theonlydifference
betweenthetwooperators isthatthegroup operator isnormallyusedwithonerelation, whilethecogroup
operator is used instatements involving two or more relations.
GroupingTwoRelationsusingCogroup
Assume that we have two files namely student details.txt and employee_details.txt in the HDFS
directory /pig_data/
Apache Pig-JoinOperator
The JOIN operator is used to combine records from two or more relations. While performing a join
operation,wedeclareone(oragroupof)tuple(s) fromeachrelation, askeys. Whenthesekeysmatch,the two
particular tuples are matched, else the records are dropped. Joins can be ofthe following types-
• Self-join
• Inner-join
• Outer-join-leftjoin,rightjoin, andfulljoin
Thischapter explains withexamples howto usethejoinoperator inPigLatin.Assumethat wehave two
files namelycustomers.txtand orders.txtin the /pig_data/directory ofHDFS
ApachePig-CrossOperator
MallareddyEngineeringCollegeForWomen(AutonomousInstitution,UGC-Govt.ofIndia) l02
TheCROSSoperatorcomputes the cross-productof two or more relations.This chapterexplainswith

example how to use the cross operator inPig Latin.
Syntax
GivenbelowisthesyntaxoftheCROSSoperator.
grunt> Relation3_name=CROSS Relationl_name, Relation2_name; Apache Pig - Union Operator
The UNION operatorof PigLatinis usedtomergethecontentof tworelations.Toperform UNION
operation ontwo relations,their columns and domains must be identical.
Syntax
GivenbelowisthesyntaxoftheUNIONoperator.
grunt> Relation_name3=UNION Relation_namel,Relation_name2; Apache Pig - Split Operator
TheSPLIT operator isused tosplit arelationintotwoormorerelations. Syntax
GivenbelowisthesyntaxoftheSPLIToperator.
grunt>SPLITRelationI_nameINTORelation2_nameIF(condition}),Relation2_name(condition2),
Apache Pig - Filter Operator
TheFILTER operator isused toselect therequired tuples fromarelation based onacondition. Syntax
GivenbelowisthesyntaxoftheFILTERoperator.
grunt> Relation2_name= FILTER Relation!_name BY (condition); Apache Pig - Distinct Operator
TheDISTINCT operatorisused toremove redundant(duplicate) tuples fromarelation. Syntax
Given below isthesyntaxofthe DISTINCT operator. grunt> Relation_name2 = DISTINCT
Relatin_namel;
ApachePig-ForeachOperator
The FOR EACH operator is used togenerate specified data transformations based on the column
data.
Syntax
GivenbelowisthesyntaxofFOREACHoperator.
grunt>Relation_name2=FORBACHRelatin_namelGENERATE(requireddata);
Workingwithfunctionsin Pig
Apache Pig provides vanous built-in functions namely eval, load, store, math,
string, bag and tuple functions.
EvalFunctions
GivenbelowisthelistofevalfunctionsprovidedbyApachePig.
S.N. Function&Description
1 AVG()
Tocomputetheaverageofthenumericalvalueswithinabag.
2 BagToString()
Toconcatenatetheelementsofabagintoastring.Whileconcatenating,wecanplacea
delimiter between these values (optional).
3 CONCAT()
Toconcatenatetwoormoreexpressionsofsame type.
4 COUNT()
Togetthenumber ofelementsinabag,while countingthenumberoftuplesinabag.
5 COUNTSTAR()
ItissimilartotheCOUNT()function.Itisusedtogetthenumberofelementsinabag.
6 DIFF()
Tocomparetwobags(fields)inatuple.
7 IsEmpty()
Tocheckifabagor mapisempty.
8 MAX()
Tocalculate thehighest valueforacolumn (numeric valuesorchararrays) inasingle-

column bag.
9 MIN()
To get the minimum (lowest)value (numeric or chararray)for acertain column
inasingle- columnbag.
1 PluckTuple()
0
UsingthePigLatin PluckTuple()function,wecandefineastringPrefixandfilterthe
columnsin a relation that begin with the given prefix.
1 SIZE()
1
TocomputethenumberofelementsbasedonanyPig datatype.
1 SUBTRACT()
2
Tosubtract two bags. Ittakestwo bagsas inputs and returns a bagwhich contains the
tuples ofthe first bag that are not inthe secondbag.
1 SUM()
3
Togetthetotalofthenumericvalues ofacolumninasingle-columnbag.
1 TOKENIZE()
4
To splita string (which contains a groupofwords)ina singletuple and return a bag
which contains the output of the split operation.
The Load and Store functions in Apache Pig are used to determine how the data goes ad comes out
of Pig. These functions are used with the load and storeoperators. Given below isthe list ofload and
store functions available in Pig.
S.N. Function&Description
1 PigStorage()
To loadandstorestructuredfiles.
2 TextLoader()
ToloadunstructureddataintoPig.
3 BinStorage()
ToloadandstoredataintoPigusingmachinereadableformat.
4 HandlingCompression
InPigLatin,wecanloadandstorecompresseddata.
GivenbelowisthelistofBagandTuplefunctions.
S Function&Description
.
N
.
1 TOBAGO
Toconverttwoormoreexpressionsintoabag.
2 TOP()
TogetthetopNtuplesofarelation.
3 TOTUPLE()
Toconvertoneormoreexpressionsintoatuple.
4 TOMAP()
Toconvert thekey-valuepairsintoaMap.
WehavethefollowingStringfunctionsinApache Pig.
S Functions&Description
.
N
.
1 ENDSWITH(string,testAgainst)
Toverifywhetheragivenstring endswithaparticularsubstring.
2 STARTSWITH(string,substring)
Acceptstwostringparametersandverifieswhetherthefirststringstartswiththesecond.
3 SUBSTRING(string,startlndex,stoplndex) Returns a substringfroma given string.
4 EqualslgnoreCase(string1, string2)
Tocompare twostings ignoring thecase.
5 INDEXOF(string,'character',startlndex)
Returnsthefirstoccurrenceofacharacterinastring,searchingforwardfromastartindex.
6 LASTINDEXOF(expression)
Returns theindexofthe last occurrence ofacharacter inastring,searching backward
froma start index.
7 LCFIRST(expression)
Convertsthefirstcharacterinastring tolowercase.
8 UCFIRST(expression)
Returnsastringwiththefirstcharacterconvertedtouppercase.
9 UPPER(expression)
UPPER(expression)Returns astringconvertedtouppercase.
10 LOWER(expression)
Convertsallcharactersinastringtolowercase.
11 REPLACE(string,'oldChar','newChar');
Toreplaceexistingcharactersinastringwithnewcharacters.
12 STRSPLIT(string,regex, limit)
Tosplitastringaroundmatchesofagivenregularexpression.
13 STRSPLITTOBAG(string,regex,limit)
Similarto the STRSPLIT()function,itsplitsthe stringby given delimiterandreturnsthe
resultin a bag.
1 TRIM(expression)
4
Returnsacopyofastringwithleadingandtrailingwhitespacesremoved.
15 LTRIM(expression)
Returns acopyofastringwith leadingwhitespacesremoved.
1 RTRIM(expression)
6
Returnsacopyofastringwithtrailingwhitespacesremoved.
ApachePigprovidesthefollowingDateandTimefunctions-
S.N. Functions&Description
I ToDate(milliseconds)
This function returns a date-time object according to the given parameters. The other
alternative for this function are ToDate(iosstring), ToDate(userstring, format),
ToDate(userstring, format, timezone)
2 CurrentTime()
returnsthedate-timeobjectofthecurrent time.
3 GetDay(datetime)
Returnsthedayofa monthfromthedate-timeobject.
4 GetHour(datetime)
Returnsthehourofadayfromthedate-timeobject.
5 GetMilliSecond(datetime)
Returnsthe millisecondofasecondfromthedate-timeobject.
6 GetMinute(datetime)
Returnstheminuteofanhourfromthedate-timeobject.
7 GetMonth(datetime)
Returnsthe monthofa yearfromthedate-timeobject.
8 GetSecond(datetime)
Returnsthesecondofaminutefromthedate-timeobject.
9 GetWeek(datetime)
Returnstheweekofayearfromthedate-timeobject.
1 GetWeekYear(datetime)
0
Returnstheweekyearfromthedate-timeobject.
1 GetYear(datetime)
1
Returnstheyearfromthedate-timeobject.
1 AddDuration(datetime,duration)
2
Returnstheresultofadate-timeobjectalongwiththedurationobject.
1 SubtractDuration(datetime,duration)
3
SubtractstheDurationobjectfromtheDate-Time objectandreturnstheresult.
1 DaysBetween(datetimel,datetime2)
4
Returnsthenumberofdaysbetweenthetwodate-timeobjects.
1 HoursBetween(datetimel,datetime2)
5
Returnsthenumberofhoursbetweentwodate-timeobjects.
1 MilliSecondsBetween(datetime1,datetime2)
6
Returnsthenumberofmillisecondsbetweentwodate-timeobjects.
1 MinutesBetween(datetimel,datetime2)
7
Returnsthenumberofminutesbetweentwodate-timeobjects.
1 MonthsBetween(datetimel,datetime2)
8
Returnsthenumberofmonthsbetweentwodate-timeobjects.
1 SecondsBetween(datetime1,datetime2)
9
Returnsthenumberofsecondsbetweentwodate-time objects.
2 WeeksBetween(datetimel,datetime2)
0
Returnsthenumberofweeksbetweentwodate-timeobjects.
2 YearsBetween(datetime1,datetime2)
1
Returnsthenumber ofyearsbetweentwodate-timeobjects.
WehavethefollowingMathfunctionsinApachePig-
S.N. Functions&Description
1 ABS(expression)
Togettheabsolutevalueofanexpression.
2 ACOS(expression)
Toget thearccosineofanexpression.
3 ASIN(expression)
Togetthearcsine ofanexpression.
4 ATAN(expression)
Thisfunction isusedtoget thearctangentofanexpression.
5 CBRT(expression)
Thisfunctionisusedtogetthecuberootofanexpression.
6 CEIL(expression)
Thisfunctionisusedtogetthevalueofanexpressionroundeduptothenearestinteger.
7 COS(expression)
Thisfunctionisused togetthetrigonometriccosineofanexpression.
8 COSH(expression)
Thisfunctionisused togetthehyperboliccosineofanexpression.
9 EXP(expression)
This functionisusedtogettheEuler'snumbereraisedtothepowerofx.
1 FLOOR(expression)
0
Togetthevalueofanexpressionroundeddowntothenearestinteger.
1 LOG(expression)
1
Togetthenatural logarithm(base e)ofanexpression.
1 LOG10(expression)
2
Toget the base 10logarithm ofanexpression.
1 RANDOM()
3
Togetapseudorandomnumber(type double)greaterthanorequalto0.0andlessthan1.0.
1 ROUND(expression)
4
To get the value ofanexpression roundedto aninteger (iftheresulttype is float)
orrounded to along (if the resulttype is double).
1 SIN(expression)
5
Togetthesineofanexpression.
1 SINH(expression)
6
Togetthehyperbolicsineofanexpression.
1 SQRT(expression)
7
Togetthe positivesquarerootofanexpression.
1 TAN(expression)
8
Togetthetrigonometrictangentofanangle.
1 TANH(expression)
9
Togetthehyperbolictangentofanexpression.
Hive-Introduction
Theterm'Big Data' isused for collections of large datasets that include hug evolume,high velocity,and
a variety of data tha is increasingdaybyday.Using traditional data management systems,it is difficult
toprocessBigData.Therefore, theApacheSoftwareFoundation introduced
aframeworkcalledHadoop to solve Big Data managementand processing challenges.
Hadoop
Hadoopisanopen-sourceframeworktostoreandprocessBigDatain adistributedenvironment.It
contains two modules, one isMapReduce and another isHadoop Distributed FileSystem(HDFS).
MapReduce:Itisaparallelprogrammingmodel forprocessing largeamountsofstructured,semi-
structured,and unstructureddata on large clusters ofcommodity hardware.
HDFS:Hadoop Distributed FileSystemisapartofHadoopframework,
usedtostoreandprocessthe datasets. It provides a fault-tolerant file system to runoncommodity
hardware.
TheHadoop ecosystemcontains differentsub-projects (tools) suchasSqoop, Pig, and Hive that
are used to help Hadoop modules.
Sqoop:Itisusedtoimport andexportdatato andfrombetweenHDFSandRDBMS.
Pig:Itisaprocedurallanguage platformusedtodevelopascriptforMapReduce operations.

Hive:Itisaplatformusedtodevelop SQLtypescriptstodoMapReduceoperations.
Note:TherearevariouswaystoexecuteMapReduce operations:
The traditional approach using Java MapReduceprogram for structured, semi-structured,and
unstructured data.
Thescripting approach forMapReduce to process structured and semistructured datausing
Pig.TheHiveQueryLanguage (HiveQLorHQL) forMapReduce
toprocessstructureddatausingHive.
WhatisHive
Hiveis a data warehouse infrastructure tool to process structured datain Hadoop. It resides on top
of Hadoop to summarize Big Data, and makes querying and analyzing easy.
InitiallyHivewasdevelopedbyFacebook, latertheApacheSoftware Foundationtook it
upanddeveloped itfurtherasanopensourceunderthenameApache Hive.Itisusedbydifferent
companies.Forexample, Amazon uses itin Amazon Elastic MapReduce.
Hiveisnot
Arelationaldatabase
Adesign for OnLine Transaction Processing (OLTP) Alanguageforreal-timequeriesandrow-
levelupdates
FeaturesofHive
ItstoresschemainadatabaseandprocesseddataintoHDFS.
ItisdesignedforOLAP.
ItprovidesSQLtypelanguageforqueryingcalledHiveQLorHQL. It is familiar, fast, scalable, and
extensible.
ArchitectureofHive
ThefollowingcomponentdiagramdepictsthearchitectureofHive:
Thiscomponentdiagramcontainsdifferentunits.Thefollowingtabledescribeseachunit:
UnitName Operation
UserInterface Hiveisadatawarehouseinfrastructuresoftwarethatcancreateinteractio
n between user and HDFS. The user interfaces that Hive supports
are Hive WebUI, Hive command line, and Hive HD Insight(In
Windows server).
Meta Store Hive chooses respective database servers to storethe schema

orMetadata of tables, databases, columns in a table, their data
types, and HDFS mappmg.
HiveQLProcessEngineHiveQLissimilarto SQL for queryingon schemainfo on the Metastore.

It is one of the replacements of traditional approach for
MapReduce program. Instead of writing MapReduce program in
Java, we can write a query for MapReducejob and process it.
Execution The conjunction part ofHiveQL process Engine and MapReduceis

Engine Hive
ExecutionEngine.Executionengineprocessesthequeryandgenerates
resultsassameasMapReduceresults.ItusestheflavorofMapReduce
.
HDFSorHBAS Hadoop
E distributedfilesystemorHBASEarethedatastoragetechniques to
store data into file system.
WorkingofHive
ThefollowingdiagramdepictstheworkflowbetweenHiveandHadoop.
ThefollowingtabledefineshowHiveinteractswithHadoopframework:
S Operation
t
e
p
N
o
.
1 ExecuteQuery
TheHiveinterfacesuchasCommand LineorWebUIsendsquerytoDriver (any
database driver suchas JDBC, ODBC, etc.) to execute.
2 GetPlan
The driver takesthe help ofquerycompiler that parses thequeryto check
thesyntax and query plan or the requirementof query.
3 Get Metadata
ThecompilersendsmetadatarequesttoMetastore(anydatabase).
4 SendMetadata
Metastoresendsmetadataasaresponsetothecompiler.
5 SendPlan
The compilerchecks the requirementand resendsthe plan to the driver. Up to
here, the parsing and compilingof a query is complete.
6 ExecutePlan
Thedriversendstheexecuteplantotheexecutionengine.
7 ExecuteJob
Internally, the process of execution job is a MapReduce job. The execution engine
sends thejobtoJobTracker, which isinName nodeand
itassignsthisjobtoTaskTracker,which is in Data node. Here, the queryexecutes
MapReducejob.
7.1 MetadataOps
Meanwhile inexecution,theexecutionenginecanexecute metadataoperationswith
Metastore.
8 FetchResult
TheexecutionenginereceivestheresultsfromDatanodes.
9 Send Results
Theexecutionenginesendsthoseresultantvaluestothedriver.
1 Send Results
0 ThedriversendstheresultstoHiveInterfaces.
InstallationandConfiguration
You can install a stable release ofHive bydownloadinga tarball, or you candownloadthe source code
and build Hive from that.
RunningHiveServer2andBeeline
Requirements
Java 1.7
Note: Hive versionsUonward require Java 1.7 or newer. Hive versions 0.14 to 1.1work with Java 1.6
as well. Users are strongly advised to start moving to Java 1.8 (see HIVE-8607).
Hadoop2.x(preferred),l.x(notsupportedbyHive2.0.0onward).Hive versions up to 0.13 also supported
Hadoop 0.20.x, 0.23.x.
Hive is commonly used in production Linux and Windows environment. Mac is a commonly used
development environment. The instructions in this document are applicable to Linux and Mac. Using
it on Windows would require slightly different steps.
InstallingHivefromaStableRelease
Start by downloading the most recent stable release of Hive from one of the Apache download
mirrors (see Hive Releases).
Nextyouneedtounpackthetarball.Thiswillresultinthecreationofasubdirectorynamedhive-
x.y.z(wherex.y.z isthereleasenumber):
$tar-xzvfhive-x.y.z.tar.gz
SettheenvironmentvariableHIVE_HOMEtopointtotheinstallationdirectory:
$cdhive-x.y.z
$exportHIVE_HOME={{pwd}}
Finally,add$HIVE_HOME/bintoyourPATH:
$exportPATH=$HIVE_HOME/bin:$PATH
BuildingHivefromSource
The Hive GIT repository for the most recent Hive code 1s located here: git clonehttps://git-
wip-us.apache.org/repos/asfi'hive.git (the master branch).
All release versions are in branches named "branch-0.#" or "branch-I.#" orthe upcoming"branch-
2.#", withtheexceptionofrelease0.8.1whichisin"branch-0.8-r2".Anybrancheswithothernamesare
feature branchesfor works-in-progress. See Understanding Hive Branches for details.
Asof0.13,HiveisbuiltusingApache Maven.
CompileHiveonmaster
TobuildthecurrentHivecodefromthemasterbranch:
$gitclonehttps://git-wip-us.apache.org/repos/asfi'hive.git
$cdhive
$mvncleanpackage-Pdist[-DskipTests-Dmaven.javadoc.skip=true]
$ cd packaging/target/apache-hive-{version}-SNAPSHOT-bin/apache-hive-{version}-
SNAPSHOT-bin
$ ls LICENSE NOTICE README.txt
RELEASENOTES.txt
bin/ (all the shell scripts) lib/ (required jar files) confi'(configuration files)
examples/(sampleinputandqueryfiles)
hcatalog/(hcataloginstallation)
scripts I(upgradescriptsforhive-metastore) Here,{version}referstothecurrent Hiveversion.
IfbuildingHivesourceusingMaven(mvn),wewillrefertothedirectory"/packaging/target/apache-hive-
{version}-SNAPSHOT-bin/apache-hive-{version}-SNAPSHOT-bin" as <install-dir>fortherestofthe
page.
CompileHiveonbranch-]
In branch-I,Hive supports bothHadoop1.xand 2.x.You will need to specify which version ofHadoop
to build against via aMaven profile.To build againstHadoop 1.xusetheprofile hadoop-1; for Hadoop
2.xusehadoop-2.Forexampletobuild againstHadoop1.x,theabovemvncommandbecomes:
$mvncleanpackage-Phadoop-1,dist
CompileHivePriorto0.13onHadoop0.20
PriortoHive0.13,HivewasbuiltusingApacheAnt.Tobuildan olderversionof HiveonHadoop 0.20:
$svncohttp://svn.apache.org/repos/as£'hive/branches/branch-{version}hive
$cdhive
$antcleanpackage
$cdbuild/dist #ls LICENSE NOTICE README.txt
RELEASENOTES.txt
bin/ (allthe shell scripts) lib/ (requiredjar files) conf/(configurationfiles)
examples/(sampleinputandqueryfiles) hcatalog / (hcatalog installation)
scripts/(upgradescriptsforhive-metastore)
IfusingAnt,wewillrefertothedirectory"build/dist"as<install-dir>.
CompileHivePriorto0.13onHadoop0.23
Tobuild Hive inAnt againstHadoop 0.23, 2.0.0,orother version,build withthe appropriate flag;some
examples below:
$antcleanpackage-Dhadoop.version=0.23.3-Dhadoop-0.23.version=0.23.3-Dhadoop.mr.rev=23
$ ant clean package -Dhadoop.version=2.0.0-alpha -Dhadoop-0.23.version=2.0.0-alpha
Dhadoop.mr.rev=23
RunningHive
HiveusesHadoop,so:
• youmusthaveHadoopinyourpathOR
• exportHADOOP_HOME=<hadoop-install-dir>
In addition, you must use below HDFS create /tmp and

/user/hive/warehouse (aka hive.metastore.warehouse.dir)them chmodg+wbefore you can create a
table in Hive.
$$HADOOP_HOME/bin/hadoopfs-mkdir /tmp
$$HADOOP_HOME/bin/hadoopfs-mkdir /user/hive/warehouse
$$HADOOP_HOME/bin/hadoopfs-chmodg+w/tmp
$$HADOOP_HOME/bin/hadoopfs-chmodg+w/user/hive/warehouse You may find it useful, though
it's not necessary, to set HIVE_HOME:
$exportHIVE_HOME=<hive-install-dir>
RunningHiveCLI
TousetheHivecommandline interface(CLI)fromtheshell:
$$HIVE_HOME/bin/hive
RunningHiveServer2andBeeline
commands to and set
Starting from Hive 2.1, we need to run the schematool command below as an initialization step. For example,
we can use "derby" as db type.
$$HIVE_HOME/bin/schematool-dbType<dbtype>-initSchema
HiveServer2 (introduced in Hive 0.11) has its own CLI called Beeline. HiveCLI is now deprecated in
favor of Beeline, as it lacks the multi-user, security, and other capabilities of HiveServer2. To run
HiveServer2and Beeline from shell:
$$HIVEHOME/bin/hiveserver2
$$HIVE_HOME/bin/beeline-ujdbc:hive2://$HS2_HOST:$HS2_PORT
Beeline isstarted withtheJDBCURLoftheHiveServer2, whichdependsontheaddress andport where
HiveServer2 was started. By default, it will be (localhost:10000), so the address will look like
jdbc:hive2://localhost:10000.
OrtostartBeelineandHiveServer2inthesameprocessfortestingpurpose,forasimilaruserexperience to
HiveCLI:
$$HIVE_HOME/bin/beeline-ujdbc:hive2://
RunningHCatalog
ToruntheHCatalog serverfromtheshellinHiverelease0.11.0and later:
$$HIVE_HOME/hcatalog/sbin/hcat_server.sh
TousetheHCatalogcommandlineinterface(CLI)inHiverelease0.11.0andlater:
$$HIVE_HOME/hcatalog/bin/hcat
Formoreinformation,seeHCatalogInstallationfrom Tarball andHCatalogCLIin
theHCatalogmanual.
RunningWehHCat(Templeton)
ToruntheWebHCatserverfromtheshellinHiverelease0.11.0andlater:
$$HIVE_HOME/hcatalog/sbin/webhcat_server.sh
Formoreinformation,seeWebHCatInstallationintheWebHCatmanual.
Hiveservices
Hive services such as Meta store, File system, and Job Clientintum communicateswith Hive
storage and performs the following actions
• MetadatainformationoftablescreatedinHiveisstoredinHive"Metastoragedatabase".
• QueryresultsanddataloadedinthetablesaregoingtobestoredinHadoopclusteronHDFS.
Jobexecutionflow:
Fromtheabovescreenshotwecanunderstand theJobexecution flow inHivewithHadoop The data flow

inHive behaves in the following pattern;
1. ExecutingQueryfromthe UI(User Interface)
2. The driver is interactingwith Compilerfor getting the plan. (Here plan refers to query
execution) process and its related metadata information gathering
3. The compiler creates the plan for a jobto be executed. Compiler communicatingwithMeta
store for getting metadata request
4. Metastoresendsmetadatainformationbacktocompiler
5. CompilercommunicatingwithDriverwiththeproposedplantoexecute thequery
6. DriverSendingexecutionplanstoExecutionengine
7. Execution Engine (EE) actsasabridge between HiveandHadoop to process thequery. ForDFS
operations.
• EEshouldfirstcontactsNameNode andthentoDatanodestogetthevaluesstoredintables.
• EE is going to fetch desired records from Data Nodes. The actual data of tables resides in data
node only. While fromName Node it only fetches the metadata information for thequery.
• Itcollectsactualdatafromdatanodes relatedtomentionedquery
• ExecutionEngine (EE) communicates bi-directionallywithMetastorepresent inHivetoperform
DDL (Data Definition Language) operations. Here DDL operations like CREATE, DROP and
ALTERING tables and databasesare done.Meta store will store information about database
name, tablenamesandcolumn namesonly.Itwillfetchdatarelatedtoquery mentioned.
• Execution Engine (EE) in tum communicateswith Hadoop daemonssuch as Namenode,
Data nodes,and jobtracker toexecute the queryon topof Hadoop file system
8. Fetchingresultsfromdriver
9. SendingresultstoExecutionengine.OncetheresultsfetchedfromdatanodestotheEE,itwill send
results backto driverandto UI ( frontend)
Hive Continuously in contact with Hadoop filesystemand its daemonsvia Execution engine. The
dotted arrow intheJob flow diagramshows the Execution engine communicationwithHadoop
daemons.
Hive-DataTypes
This chapter takes you through the different atatypesinHive,whichareinvolvedinthetablecreation.
Allthe datatypes inHive areclassifiedinto four types, given as follows:
• ColumnTypes
• Literals
• NullValues
• ComplexTypes
ColumnTypes
ColumntypeareusedascolumndatatypesofHive. Theyareasfollows:
IntegralTypes
Integer type datacanbespecified using integraldatatypes,INT.Whenthedata range exceeds
therange of INT, you need to use BIGINT and if the data range is smaller than the INT, you use
SMALLINT. TINYINT is smallerthan SMALLINT.
ThefollowingtabledepictsvariousINTdatatypes:
Type Postfix Example
TINYINT y lOY
SMALLINT s l0S
INT 10
BIGINT L l0L
StringTypes
Stringtypedata typescanbespecifiedusingsinglequotes('') or doublequotes(""). Itcontainstwo
datatypes: VARCHAR and CHAR. Hive follows C-types escape characters.
ThefollowingtabledepictsvariousCHARdatatypes:
DataType Length
VARCHAR 1to65355
CHAR 255
Timestamp
It supports traditional UNIX timestamp with optional nanosecond prec1s1on. It supports
java.sql.Timestamp format"YYYY-MM-DDHH:MM:SS.fffffffff' and format"yyyy-mm-dd
hh:mm:ss.ffffffffff'.
Dates
DATEvaluesaredescribedin year/month/dayformatintheform{{YYYY-MM-DD}}.
Decimals
TheDECIMALtypeinHiveisassameasBigDecimalformatofJava.Itisusedforrepresenting
immutable arbitrary precision. The syntax and exampleisas follows:
DECIMAL(precision,scale)
decimal(l0,0)
UnionTypes
Union is a collection of heterogeneousdata types.You can create an instance usingcreate

union.The syntax
UNIONTYPE<int,double,array<string>,struct<a:int,b:string>>
{0:1}
{l:2.0}
{2:["three","four"]}
{3:{"a":5,"b":"five"}}
{2:["six","seven"]}
{3:{"a":8,"b":"eight"}}
and example is as follows:

{0:9}
{1:10.0}
Literals
ThefollowingliteralsareusedinHive:
FloatingPointTypes
Floatingpointtypesarenothingbutnumberswithdecimalpoints.Generally,thistypeofdata1s composed of
DOUBLEdata type.
DecimalType
Decimal type data is nothingbutfloating pointvalue with higherrange than DOUBLEdata type.The
range of decimal type is approximately -1o-308 toI0308.
NullValue
MissingvaluesarerepresentedbythespecialvalueNULL.
ComplexTypes
TheHivecomplexdatatypesareasfollows:
Arrays
ArraysinHive are usedthesamewaytheyareusedinJava.
Syntax:ARRAY<data_type>
Maps
MapsinHivearesimilartoJavaMaps.
Syntax:MAP<primitive_type,data_type>
Structs
StructsinHiveissimilartousingcomplexdatawithcomment.
Syntax:STRUCT<col_name:data_type[COMMENTcol_comment],...>
Hive-Built-in Functions
This chapter explains the built-in functions availablein Hive. The functionslook quite similarto SQL
functions,except for their usage.
Built-InFunctions
Hivesupportsthefollowingbuilt-in functions:
Return Signature Description

Type
IntroductiontoBigDataAnalytics Dept.ofCSE AI &ML
BIGIN round(doublea) ItreturnstheroundedBIGINTvalueofthedouble

T .
BIGIN floor(doublea) It returns themaximum BIGINT valuethat is
T equal orless than the double.
BIGINT ceil(doublea)
ItreturnstheminimumBIGINTvaluethatisequalor
greaterthanthedouble.
double Itreturnsarandomnumberthatchangesfromrowtorow.
rand(),rand(intse
ed) ItreturnsthestringresultingfromconcatenatingBafter
string A
concat(stringA,s
string
tringB,...) substr(stringA,intstart) It returnsthesubstring of Astarting fromstart position till the
end of string A
string substr(string A, int start, int ItreturnsthesubstringofA startingfromstartposition length)

with the given length.
string upper(stringA) It returnsthestringresultingfromconvertingallcharacters

ofAtouppercase.
string ucase(stringA) Sameasabove.
string lower(stringA) Itreturns thestringresultingfromconvertingallcharacters of B

to lower case.
string lcase(stringA) Sameasabove.
string trim(stringA) Itreturnsthe stringresultingfromtrimmingspacesfrom both

ends of A
string ltrim(stringA) Itreturnsthestringresultingfromtrimming spacesfrom

thebeginning(lefthandside)ofA
string rtrim(stringA) rtrim(stringA)Itreturnsthestringresultingfrom trimming

spaces fromtheend(righthandside)of A
string regexp_replace(string A, It returnsthe string resulting from replacing all substrings

string B, string C) inBthat match theJava regular expression syntax withC.
IntroductiontoBigDataAnalytics Dept.ofCSE AI &ML
Int Itreturns thenumber ofelements inthemap type.

size(Ma
p<K.V>) Itreturnsthenumber ofelements inthearraytype.
Int
size(Ar
valueof<type>
ray<T>) cast(<expr>as<type>) It converts theresults of the expression expr to<type>e.g.
cast('l' as BIGINT) converts the string'l'to it integral
representation. ANULL is returned if the conversion does
not succeed.
string from_unixtime(int convertthenumber ofseconds from Unix epoch

(1970-01- unixtime) 0100:00:00 UTC)
toastringrepresentingthetimestamp of that momentin the current system
time zone in the format
of"1970-01-0100:00:00"
string to_date(stringtimestamp) It returns the date part of a timestamp string:
to_date("1970-01-0100:00:00")="1970-01-01"
Int year(stringdate) Itreturnstheyearpartofadateoratimestampstring:

year("1970-01-0100:00:00")=1970,year("1970-01-01")
=1970
Int month(stringdate) Itreturnsthemonthpartofadateoratimestampstring:

month("1970-11-0100:00:00")=11,month("l970-11-0l")
=11
Int day(string date) Itreturnsthedaypartofadateoratimestampstring:

day("1970-ll-0100:00:00")=1,day("l970-l1-
01")=1
string getjson_object(stri It extracts json object from a jsonstring based
ng onjson path specified, andreturnsjson string
json_string,stringp oftheextractedjson object. It returns NULL if
ath) theinput json string is invalid.
Example
Thefollowingqueriesdemonstrate somebuilt-in functions: round() function
Onsuccessfulexecutionofquery,yougettoseethefollowingresponse: 3.0
hive>SELECTround(2.6)fromtemp;
floor() function
hive>SELECTfloor(2.6):fromtemp;
Onsuccessfulexecutionofthequery,yougettoseethefollowingresponse: 2.0
ceil()function
hive>SELECTceil(2.6):fromtemp;
Onsuccessfulexecutionofthequery, yougettoseethefollowingresponse: 3.0

AggregateFunctions
Hivesupports thefollowing built-in aggregate functions.Theusageofthese functions
isassameasthe SQL aggregate functions.
Return Signature Description

Type
BIGINT count(*),count(expr), count(*)-
Returnsthetotalnumberofretrievedrows.
DOUB sum(col), Itreturns thesumoftheelements

LE sum(DISTINC inthegrouporthesum of the distinct values of the
T col) column inthe group.
DOUBLE avg(col), It returns the average ofthe elementsin the
grouporthe avg(DISTINCT col) average ofthedistinctvalues
ofthecolumn inthegroup.
D m Itreturnstheminimumvalueofthecolumninthegroup.
O i
U n Itreturnsthe maximumvalueofthecolumninthegroup.
B (
L c
E o
l
D )
O
U m
B a
L x
E (
c
o
l
)
HiveDDLCommands- TypesofDDLHiveCommands
BYDATAFLAIRTEAM·UPDATED·MARCH4,2020
WanttorunHivequeriesforcreating,modifying,dropping,alteringtablesanddatabases?
Inthisarticle,wearegoingtolearnHiveDDLcommands. ThearticledescribestheHiveDataDefinition
Language(DDL) commands for performing various operations like creating a table/databasein
Hive, droppingatable/databaseinHive,alteringatable/database
inHive,etc.TherearemanyDDLcommands. This article will cover
eachDDLcommandindividually,along withtheir syntaxand examples.
ForrunningHiveDDLcommands,youmusthaveHiveinstalledonyoursystem.
IntroductiontoHiveDDLcommands
HiveDDLcommandsarethestatementsusedfordefiningandchangingthestructureofatableordatabase
in Hive. It is used to build or modify thetables and other objects inthe database.
TheseveraltypesofHiveDDLcommandsare:
1. CREATE
2. SHOW
3. DESCRIBE
4. USE
5. DROP
6. ALTER
7. TRUNCATE
Table-1HiveDDLcommands
DDLCommand Use With

CREATE Database,Table
SHOW Databases,Tables,TableProperties,Partitions,Functions,Index
DESCRIBE Database,Table,view
USE Database
DROP Database,Table
ALTER Database,Table
TRUNCATE Table
Beforemovingforward,notethattheHivecommandsarecase-insensitive.
CREATEDATABASEisthesameascreatedatabase.
Sonow,letusgothrougheachofthecommandsdeeply. Let'sstartwiththeDDLcommandson
Databasesin Hive.
DDLCommandsOnDatabasesinHive
1. CREATEDATABASEinHive
TheCREATEDATABASE statementisused to create a databaseinthe Hive. The DATABASE
and SCHEMA are interchangeable. We canuse either DATABASE or SCHEMA.
Syntax:
1. CREATE(DATABASEISCHEMA)[IFNOTEXISTS]database_name
2. [COMMENTdatabase_comment]
3. [LOCATIONhdfs_path]
4. [WITHDBPROPERTIES(property_name=property_value,...)];
DDLCREATE DATABASEExample:
Here in this example,we are reating a database 'dataflair'.
2. SHOWDATABASEinHive
TheSHOWDATABASESstatementlistsallthedatabasespresentintheHive.
Syntax:
SHOW(DATABASESISCHEMAS);
DDLSHOWDATABASESExample:
3. DESCRIBEDATABASEinHive
TheDESCRIBEDATABASE statementinHive shows the name ofDatabase inHive,its comment(if
set), and its location on the file system.
TheEXTENDEDcanbeusedtogetthedatabaseproperties.
Syntax:
1. DESCRIBEDATABASE/SCHEMA[EXTENDED]db_name;
DDLDESCRIBEDATABASEExample:
4. USEDATABASEinHive
TheUSEstatementinHive isused to select thespecific database for asessiononwhichallsubsequent
HiveQL statements would be executed.
Syntax:
1. USEdatabase_name;
DDLUSEDATABASEExample:
5. DROPDATABASEinHive
TheDROPDATABASEstatementinHiveisusedtoDrop(delete)thedatabase.
Thedefaultbehavior isRESTRICT which meansthat thedatabase isdropped onlywhenit isempty.

To dropthe database with tables,we can use CASCADE.
Syntax:
1. DROP(DATABASEISCHEMA)[IFEXISTS]database_name[RESTRICTICASCADE];
DDLDROPDATABASEExample:
Hereinthisexample,wearedroppingadatabase'dataflair'usingtheDROP statement.
6. ALTERDATABASEinHive
TheALTER DATABASE statementinHiveisused tochangethemetadata associated with the
database in Hive.
Syntax for changing Database Properties: ALTER(DATABASEISCHEMA)database_name
SETDBPROPERTIES(property_name=property_value,...);
DDLALTERDATABASEpropertiesExample:
Inthisexample,weare setting thedatabase properties ofthe 'dataflair'database after itscreation by using
the ALTER command.
SyntaxforchangingDatabaseowner:
1. ALTER(DATABASEISCHEMA)database_nameSETOWNER[USERIROLE]user_or_role;
DDLALTERDATABASEownerExample:
In this example,we arechangingtheownerroleofthe'dataflair'databaseusingtheALTERstatement.
Syntax for changing Database Location:
ALTER(DATABASEISCHEMA)database_nameSETLOCATIONhdfs_path;
Note: The ALTER DATABASE ...SET LOCATION statement does not move the database current
directorycontentstothenewlyspecified location. Thisstatement doesnotchangethelocationsassociated
withanytablesorpartitions under thespecified database. Instead, it changesthedefaultparent-directory,
where new tables will be added for this database.
Noother metadata associated withthedatabase canbechanged. DDLCommands onTablesinHive
CREATETABLE
The CREATE TABLE statementinHive isused tocreateatable withthegiven name. Ifatableorview
already exists with the same name, then the error is thrown. We can use IF NOT EXISTSto skip the
error.
Syntax:
CREATE TABLE [IF NOT EXISTS][db_name.]table_name[(col_namedata_type[COMMENT
col_cornrnent], ... [COMMENT col_cornrnent])][COMMENT table_cornrnent][ROW FORMAT
row_format][STORED AS file_format][LOCATION hdfs_path];
DDLCREATETABLEExample:
Inthistable,we arecreatingatable'Employee'inthe'dataflair'database.
ROW FORMATDELIMITEDmeans we are telling the Hive thatwhenitfinds a new line character,
that means a new record.
FIELDSTERMINATEDBY','tellsHivewhatdelimiterweareusinginourfilestoseparateeach column.
STOREDASTEXTFILEistotellHivewhattypeoffiletoexpect.
Don'tknowaboutdifferentDataTypessupportedbyhive?ReadHiveDataTypes article.
2. SHOWTABLESinHive
TheSHOWTABLESstatementinHivelistsallthebasetablesandmminthecurrentdatabase.
Syntax:
1. SHOWTABLES[INdatabase_name];
DDLSHOWTABLESExample:
3. DESCRIBETABLEinHive
TheDESCRIBEstatementinHiveshowsthelistsofcolumnsforthespecifiedtable.
Syntax:
1. DESCRIBE[EXTENDEDIFORMATTED][db_name.]table_name[.col_name([.field_name])];
DDL DESCRIBE TABLE Example:
DROPTABLEinHive
The DROPTABLE statementinHivedeletesthedatafor aparticulartableandremoveall metadata
associated with it from Hive metastore.
IfPURGE is not specifiedthen the data isactually movedto the.Trash/current directory. IfPURGE is
specified, then data is lost completely.
Syntax:
DROPTABLE[IFEXISTS]table_name[PURGE];
DDLDROPTABLE Example:
Inthebelowexample,wearedeletingthe'employee'table.
ALTERTABLEinHive
The ALTER TABLE statement in Hive enables you to change the structure of an existing table.
Using the ALTER TABLE statement we can rename the table, add columns to the table, change the
table properties, etc.
SyntaxtoRenameatable:
ALTERTABLEtablenameRENAMETOnewtable_name;
DDLALTERTABLEnameExample:
In this example,we are trying to rename the 'Employee'tableto'Com_Emp' using the ALTER
statement.
FileEditViewSearchTerminalHelp
0:jdbc:htve2://localhost:10000> LTERTABLEEMployeeRENAMETOCoMp_EMp;
INFO :CoMpilingcoMMand(queryid=dataflatr_20200206174836_ddaea397-57cc-407b-9bc5-0db197fa9ad4):ALTERTAB
LEEMployeeRENAME TOCoMp_EMp
INFO ConcurrencyMode tsdisabled,not creatinga lockManager
INFO:SeMantic Analysts CoMpleted(retrial=false)
INFO:ReturningHivescheMa:ScheMa(fieldScheMas:null,properties:null)
INFO:CoMpletedcoMpiling coMMand(queryid=dataflatr_20200206174836_ddaea397-57cc-407b-9bc5-0db197fa9ad4);
TiMetaken:0.024 seconds
INFO:ConcurrencyModetsdisabled,notcreatingalockManager
INFO: ExecutingCOMMand(queryid=dataflair_20200206174836_ddaea397-57cc-407b-9bc5-0db197fa9ad4):ALTERTAB LE
EMployee RENAME TOCoMp_EMp
INFO:Startingtask[Stage-0:DDL] inserialMode
INFO:CoMpletedexecuting coMMand(queryid=dataflatr_20200206174836_ddaea397-57cc-407b-9bc5-0db197fa9ad4);
TiMetaken:0.129seconds
INFO:OK
INFO:ConcurrencyModetsdisabled,notcreatingalockManager
Norowsaffected(0.167seconds)
0:jdbc:htve2://localhost:10000>D
ataflalr@admln1-All-Serles:-/apache-hive-3.1.2-bln
SyntaxtoAddcolumnstoatable:
ALTERTABLEtable_name ADDCOLUMNS(columnl,column2);
DDL ALTER TABLE columns Example:

Inthisexample, we are adding two columns 'Emp_DOB'and 'Emp_Contact'inthe 'Comp_Emp'table
using the ALTER command.
dataflalr@admln1-All-Serles:-/apache-hlve-3.1.2-bln
FileEditViewSearchTerminalHelp
0: jdbc:hive2://localhost:10000>LTERTABLECoMp_EMpADD COLUMNS(EMp_DOBSTRING,EMp_ContactSTRING);
INFO:CoMpilingcoMMand(queryid=dataflair_20200206175634_fcd1d6b9-749f-4066-a5bf-ced8d070e13b):ALTERTAB
LECoMp_EMpADDCOLUMNS(EMp_DOBSTRING,EMp_ContactSTRING)
INFO ConcurrencyModeisdisabled,notcreatingalockManager INFO
SeManticAnalysisCoMpleted(retrial=false)
INFO ReturningHivescheMa:ScheMa(fieldScheMas:null,properties:null)
INFO CoMpletedcoMpilingCOMMand(queryid=dataflair_20200206175634_fcd1d6b9-749f-4066-a5bf-ced8d070e13b);
INFO:ConcurrencyModeisdisabled,notcreatingalockManager
INFO:Executing COMMand(queryid=dataflair_20200206175634_fcd1d6b9-749f-4066-a5bf-ced8d070e13b):ALTER TAB
LECoMp_EMpADD COLUMNS(EMp_DOBSTRING, EMp_ContactSTRING)
INFO:Startingtask[Stage-0:DDL]inserialMode
; INFO:CoMpletedexecutingcoMMand(queryid=dataflair_20200206175634_fcd1d6b9-749f-4066-a5bf-ced8d070e13b)
INFO:OK
INFO:ConcurrencyModeisdisabled,notcreatingalockManager
Norowsaffected(0.099seconds)
MallareddyEngineeringCollegeForWomen(AutonomousInstitution,UGC-Govt.ofIndia)
0:jdbc:hive2://localhost:10000>D 163
Syntaxtosettableproperties:
ALTERvTABLE table_name SET TBL PROPERTIES ('property_key'='property_new_value');

DDLALTERTABLEproperties Example:
Inthisexample,wearesetting thetablepropertiesaftertablecreation byusingALTERcommand.
TRUNCATETABLE
TRUNCATETABLEstatementinHiveremovesalltherowsfromthetableorpartition.
Syntax:
TRUNCATETABLEtablename;
DDLTRUNCATETABLEExample:
SET:1
1. Discussthe followingindetail
a. Conventionalchallengesinbigdata
b. NatureofData
2. Describe thestepsinvolvedinsupportvectorbasedinferencemethodology.
3. Writea shortnoteonBayesianinferencemethodology.
4. Define thedifferentinferencesinbigdataanalytics.
5. Describethebootstrappinganditsimportance
6. What issamplingandsamplingdistributiongiveadetailedanalysis.
7. Definethefollowing
a. IntelligentDataAnalyticsb.AnalysisVsReporting.
8. Describethepredictionerrorandregressiontechniques
9. DescribeanyfivecharacteristicsofBigData.
10. DefineArcing classifier&Baggingpredictorsin detail.
SET2:
1. a)Whatisadatastream?
b)Discuss14insightsoflnfosphereindatastream.
2. Explainthedifferentapplicationsofdatastreamsindetail.
3. ExplainthestreammodelandDatastreammanagementsystemarchitecture.
4. ExplainhowtocountonesinawindowusingDGIM algorithm
5. Writeashortnoteonthefollowing:(i)Countingdistinctelements inastream.
(ii)Findingmostpopularelementsusingdecayingwindow.
6. WhatarefiltersinBigData?ExplainBloomFilterwithexample
7. Define Decaying Window its performed in data analytics
8. Explainthefollowinga.FMalgorithmanditsapplication b.AMSalgorthimanditsapplications
9. What is RealTime Analytics? Discuss their technologies in detail IO
10. .Explain the three categories of Predictionmethodologies.
SET 3:
1. a)What isHadoop?Explainitscomponents
b)Howdoyouanalyzethedatainhadoop.
2. .Explainthefollowinga.Mapperclass
b.Reducerclass
c.Scalingout
3. ExplainthefailuresinMapreduce
4. Explainthemapreducedataflowwithsinglereduceandmultiplereduce.
5. HowHadoopstreamingissuitedwith textprocessingexplain.
6. DefineHDFS.Describenamenode,datanodeandblock.ExplainHDFSoperationsindetail.
7. Writein detailtheconceptofdevelopingtheMapReduceApplication.
8. Howmapreducejobworkswithclassicjavastream.
9. Explainhowmapreducejobsrunon YARN.
10. Discuss thevarious typesof map reduce &itsformats Set: 4
1. WhatisCluster?ExplainthesettingupaHadoopcluster
(b)Whatare thedifferenttypesofHadoopconfigurationfiles?Discuss
2. Whatarecontrolscripts?Explainthestart.dfs.shscript,Start.mapred.sh
3. Whataretheadditionalconfigurationpropertiestosetfor HDFS
4. ExplainthreestepKerberosticketexchangeprotocol
5. WhatisbenchmarkinghowitworksinHadoop.
6. Howwillyoudefinecommissioningnewnodesanddecommissioningoldnodes?
7. WritethestepsforupgradingHDFS.
8. DiscussadministeringHadoopwithitscheckingpointprocessdiagram
9. Howtorunproxy&Runningmapreducejob.
10. ExplainDatanodedirectorystructure
11. SetLoglevels&itsmetrics
12. WhataretheImportant Hadoopdaemonproperties
13. HowsecuritydoesisdoneinHadoop.Justify
SET:5
1. WhatisPIG?Explainitsinstallingprocess
b)ExplaintwoexecutiontypesormodesinPIG
2. ExplainGrouping,Join,CoGroup,Cross&Groupindata
3. Explaintheprocess ofinstallingHIVE&featuresofHIVE
4. Howwillyouquerythedatain HIVE?
5. GiveadetailnoteonHBASE
6. WhatisZookeeperexplainitsfeatureswithapplications
7. ExplainindetailIBMinfosphereBiginsightsandStreams
8. Discussthevisualdataanalysis techniquesindetail.
9. GiveadetailnoteonInteractiontechniqueswithitsapplications.
10. WhatisHiveQLexplainitsfeatures.

Digital Notes of Big Data Analytics Dated 5.1.2024

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Digital Notes of Big Data Analytics Dated 5.1.2024

Uploaded by

Copyright:

Available Formats

LECTURE NOTES ON

INTRODUCTION TO BIG DATA ANALYTICS

IV B. Tech II semester (2012PE05) Prepared by

Dr.AR. Sivakumaran & Dr.R.Baby Munirathinam

Associate Professor (Dept.of CSE)

Mrs. Gorla Sivaparvathi

• To underst and about bigdata

UNIT-I INTRODUCTION TO BIGDATA

UNITI INTRODUCTION TO BIGDATA

The3Vsconceptwasproposed bytheGartner analystDougLaney ina2001Meta Group research

Figure:Definition of bigdata Gartner.

Data Information ActionableIntelligenceBetter decisionsEnhancedbusinessvalue.

Archives: Archives of scanned documents, paper archives, customer correspondence records,

Figure: Sources of Big

Inshort, we cansay thatdataanalyticsis theprocessofmanipulatingdatatoextract

4. Marketing:By analyzingcustomerdata, spotting trends in consumer behavior,

6. Transportation:To evaluatelogistics data,spot trends in transportation routes,

Figure 1.2:BigData-Transactions, Interactions,Observations BIG DATA

Figure.:Characteristics of Big data

isstillfollowing thatlogic.However, newschannels andradioshavechanged howfastwereceive the news.

forusewithHadoop0.10.1orlaterreleases. Moreinformationand installation guidecanbefound at Apache

from a single machine to thousands of machines, each offering local

Benefits of BigData in OtherAreas:

• It includes the needforinterandintra-institutional legaldocuments.

lead towards cloud computing.

inthetraditional approach. The Big data approach is iterative.

the majority ofadatabase, aB-Tree islessefficient thanMapReduce, whichusesSort/Merge

theyrealizedthattheirarchitecturewouldn'tscaletothebillionsofpagesonthe Web. Help wasat hand

Figure 2-1. MapReduce logical data flow JAVAMAPREDUCE

provides itsownsetofbasictypes that areoptimized for network serialization.Theseare found

notrunthejob.Thisprecautionis topreventdataloss(itcanbeveryannoyingtoaccidentally overwrite

andsetMapOutputValueClass().Theinputtypesarecontrolled viatheinputformat, whichwe

to evolve in the future.It is type-incompatible with the old,how-ever,soapplications need

Introduction to Hadoop Distributed File System(HDFS)

 Run on the master node.

 Run on slave nodes.

Why divide the file into blocks?

Why replicate the blocks in data nodes while storing?

At its core, Hadoop has two major layers namely −

 Processing/Computation layer (MapReduce), and

Hadoop Distributed File System

How Does Hadoop Work?

Hadoop - HDFS Overview

The Design of HDFS :

Given below is the architecture of a Hadoop File System.

HDFS DataNode and NameNode Image:

 Manages the file system namespace.

Fault detection and recovery –

Where to use HDFS

o Very Large Files: Files should be of hundreds of megabytes, gigabytes or more.

Where not to use HDFS

 Storage layer (Hadoop Distributed File System).

Hadoop Distributed File System

HDFS Read Image:

HDFS Write Image:

To Format $ hadoop namenode -format

HDFS Basic File Operations

1. Putting data to HDFS from local file system

$ hadoop fs -mkdir /user/test

$ hadoop fs -copyFromLocal /usr/home/Desktop/data.txt /user/test

o Display the content of HDFS folder

$ Hadoop fs -ls /user/test

2. Copying data from HDFS to local file system