Professional Documents
Culture Documents
Digital Notes of Big Data Analytics Dated 5.1.2024
Digital Notes of Big Data Analytics Dated 5.1.2024
1
DEPARTMENT OF CSE & IT MALLAREDDYENGINEERING COLLEGE FOR WOMEN
(Autonomous lnstitution-UGC,Govt.of India)
Accreditedby NBA& NAAC with 'A'Grade, UGC,Govt.of lndia NIRF
Indian Ranking-2018, Accepted by MHRD, Govt. of India.
PermanentlyAffiliated to JNTUH, Approvedby AICTE, ISO 9001:2015 Certified
Institution AAAA+Rated by Digital Learning Magazine, AAA+ Rated by Careers
360 Magazine, 6thRank CSR Platinum Rated by AICTE-CII Survey,
Top 100 Rank band by ARIIA, MHRD, Govt. of lndia
National Ranking-Top 100Rank band by Outlook,
National Ranking-Top 100 Rank band byTimes News Magazine
Maisammaguda ,Dhulapally, Secunderabad, Kompally-500100
2023-2024
2
CourseName: Introduction toBigData Analytics
Course Code: 2012PE05
Course Objectives:
TEXTBOOK:
1. SeemaAcharya, SubhashiniChellappan,"Big Data and Analytics",WileyPublications,First
Edition,2015
REFERENCEBOOKS:
1. Judith Huruwitz, Alan Nugent, Fem Halper, Marcia Kaufman, "Big
2. datafor dummies", John Wiley &Sons, Inc.(2013)
3. TomWhite,"HadoopTheDefinitiveGuide",O'ReillyPublications,FourthEdition,2015.
3
SNO Title
UNIT -1 INTRODUCTION TO BIGDATA 5-19
I Introduction to Big Data
I Charcteristics of Big data
I Evalutation of Big Data
I Definition of BigData
I Challenges with Bigdata
I Big data Analytics
I Classification of Bigdata Analytics
I Importance and Chballenges Facing BigData
I Data and Analysis
UNIT - 2 Hadoop Technology 21-40
II Introducting Hadoop
II Convolution Approach versus with Hadoop
II Processing Data with Hadoop
II Introduction to Hadoop Eco System
II Challenges with Hadoop
Unit – 3 HADOOP FILE SYSTEM 41-80
III Introduction to Hadoop Distributed File System
III Design of Hadoop File system
III Basic HDFS
III Basic File system
III Hadoop File System
III Basic File system versus Hadoop file System
III The Java Interface
III Inserting Data froma Hadoop URL
III Inserting Data Using the Filesystem API
III Storing Data
III DataFlow-AnatomyofaFileRad
III AnatomyofaFileWrite
III Limitations
UNIT – IV 81-98
IV IntroductiontoMapReduce:ItsFraming
IV FeaturesofMapReduce
IV WorkingofMapReduce
IV Analyze MapandReduceFunctions
IV MapReduce Techniques to Optimize Job,Uses
IV Controlling Input Formats in MapReduce Execution
IV Different Phases inMapReduce, Applications
UNIT – 5 Big data platform
V Sqoop 99-165
V Cassandra
4
V MongoDB
V HIVE
V PIG
IM Important Questions 165-175
5
IntroductiontoBigData Analytics Dept.ofCSEAI &ML
► Introduction to Bigdata
► Characteristics of Data
► Evolution of BigData
► Definition of BigData
► Challenges with BigData
► Bigdata analytics
► Classification of Analytics
► Importance and challenges facing bigdata
► Terminologies UsedinBigDataEnvironments
► DataStorageandAnalysis.
INTRODUCTIONTOBIGDATA
CharacteristicsofData:
Beforegoingtoknowindetailabout thecharacteristicsofdata. Let usknowthetermBIGDATA, BIG
means bigger in size and DATA means raw data.
Irrespective of the size of the enterprise (big or small), data continues to be a precious and
irreplaceable asset.Dataispresent internal totheenterprise andalso existsoutside the fourwalls
andfirewallsoftheenterprise.Dataispresentinhomogeneoussourcesaswellasinheterogeneous
sources. The needof thehour isto understand, manage, process, andtakethedatafor analysis to
draw valuable insights.
Data -- Information Information--Insights
Broadlyclassifythedifferencebetweenthe DataandInformation
Data Information
Meaning Dataisraw,unorganizedfactsth When data is processed,
atneedtobeprocessed.Datacan organized, structuredorpresented
be inagivencontext so as to make it
somethingsimpleandseemingl useful, it is called Information.
y randomanduselessuntilitis
organized.
Example Each student'stest score is Theclass'averagescoreortheschoo
one piece ofdata l's
average score isthe information
that can be concludedfrom the
given data.
Definition: Latin'datum' meaning"that Informationisinterpreteddata.
which is given". Data was
the plural form of
datumsingular.
MallareddvEnaineerinaCol/eaeForWomen(AutonomousInstitution.UGC-Govt.oflndia) 2
IntroductiontoBigData Analytics Dept.ofCSEAI &ML
Characteristics of Data:
Letusstart withthecharacteristicsofdata,Asdepictedinbelow figure, data hasthreekey
characteristics:
Composition: The composition of data deals with the structure of data, that is, the sources or at
differentgranularity,thetypes,andthenatureofdataastowhetheritisstaticorreal-timestreaming
Condition: The conditionof data deals with the state ofdata, that is, "Can one use this data as is
foranalysis?or"Doesitrequirecleansing forfurtherenhancementandenrichment:Context:The context
ofdatadealswithWhere hasthisdatabeengenerated? Whywasthisdatagenerated?"How sensitive
isthisdata?" What aretheeventsassociated withthisdata?andsoon.Small data (dataas it
existedpriortothebigdata revolution)isabout certainty. It isabout fairlyknown data sources; it is about
no major changes to the composition or context ofdata.
Most often we have answers to queries like why this data was generated, where and when it was
generated exactlyhowwewould liketouseit,whatquestions willthisdatabeableto answer, and so on. Big
data is about complexity in terms of multiple and unknown datasets, in terms of
explodingvolume,intermsofthespeedatwhichthedataisbeinggeneratedandthespeedatwhich the data is
being generated and the speed at which it needs to be processed, and in terms ofthe
varietyofdata(internal or external,behavioral or social) that is being generated.
EvolutionofBigData:
1970c and before was the era of mainframes. The data was essentially primitive and structured.
Relational databases evolved in1980s and 1990s. The erawasofdataintensive applications. The World
Wide Web www) and the Internet ofThings (loT) have led to anonslaught ofstructured,
unstructured,and multimedia data.
Data Generation and DataDriven
DataUtilizati
Complexand on Storage Structured data,
Unstructured unstructureddata,
multimedia data
Complexand
Relational Relationalda
tabases:
Data-
intensive
applications
Primitive and Mainfram before
Structured es:Basic
data
storage
1970sand
MallareddvEnaineerinaCol/eaeForWomen(AutonomousInstitution.UGC-Govt.oflndia) 3
IntroductiontoBigData Analytics Dept.ofCSEAI &ML
Relationa
l
(1980san
d1990s) 2000sandbeyond
F
i
g
u
r
e
:
T
h
e
e
v
o
l
u
t
i
o
n
o
f
MallareddvEnaineerinaCol/eaeForWomen(AutonomousInstitution.UGC-Govt.oflndia) 4
IntroductiontoBigData Analytics Dept.ofCSEAI &ML
Figure:EvolutionofBigData
Definitionof Big Data:
lfwewereto ask youthesimplequestion: "Define BigData", what would your answer be?Well, we
willgive you a few responses that we have heard over time:
1. Anything beyond thehumanandtechnical infrastructureneeded tosupportstorage, processmg,and
analysis.
2. Today'sBIGmaybetomorrow'sNORMAL.
3. Terabytesorpetabytesorzettabytesofdata.
4. Ithinkitisabout3Vs.
Well,allofthese responses arecorrect. Butit is notoneofthese; infact, big data isallofthe above and
more.
Big data ishigh-volume,high-velocity, andhigh-variety informationassets that demand
costeffective,innovativeformsofinformationprocessingforenhancedinsightanddecision
making.--Gartner.
l
High-volume
Cost-effective,
High-velocity
innovativeforms of
High-variety
information processing
Enhanced insight
l
&decisionmaking
MallareddvEnaineerinaCol/eaeForWomen(AutonomousInstitution.UGC-Govt.oflndia) 5
IntroductiontoBigData Analytics Dept.ofCSEAI&ML
Challenges withBigData:
1. Data today is growing at an exponential rate. Most of the data that we have today has been
generated in the last 2-3 years. This high tide of data will continue to rise incessantly. The key
questions here are: "Will allthisdata be useful for analysis?","Do wework with althis data ora
subset of it?", "How willwe separate the knowledge fromthe noise?", etc.
2. Cloudcomputingandvirtualizationareheretostay.Cloudcomputingistheanswertomanaging
infrastructurefor bigdataas farascost-efficiency,elasticity, and easy upgrading/downgradingis
concerned. Thisfurther complicates thedecisiontohost bigdata solutions outside theenterprise.
3. Theotherchallenge istodecideontheperiodofretentionofbigdata.Just howlongshould one retain
this data? Atricky question indeed assome data isuseful for making long-term decisions, whereas
in few cases,thedata mayquickly become irrelevant andobsolete just a few hours after having
being generated.
4. Thereisadearthofskilledprofessionalswhopossessahighlevelofproficiencyindatasciences that is
vital in implementing big data solutions.
5. Then,ofcourse,thereareotherchallenges withrespect tocapture,storage, preparation, search,
analysis, transfer, security, and visualization of big data. Big data refers to datasets whose size is
typically beyond the storage capacity of traditional database software tools. There is no explicit
definitionofhowbigthedatasetshould beforittobeconsidered bigdata.Herewearetodealwith data
that is just too big, moves way to fast, and does notfitthe structures of typical database stems.
The data changes are highly dynamic and therefore there is a need to in gest this as quickly as
possible.
6. Datavisualizationisbecomingpopularasaseparatediscipline.Weareshortbyquiteanumber, as far
as business visualizationexperts are concerned.
Capture
ChallengeswithBigData
MallareddvEnaineerinaCol/eaeForWomen(AutonomousInstitution.UGC-Govt.oflndia) 4
IntroductiontoBigData Analytics Dept.ofCSEAI &ML
What is BigData?
Bigdataisdatathatisbiginvolume,velocity,andvariety
Volume:Itgrowfrombits tobytestopetabytesandexabytes.
BitsBytes 0or1
Kilobytes 8bits
Megabytes 1024bytes
Gigabytes 10242bytes
Terabytes 10243bytes
Petabytes 10244bytes
Exabytes 10245bytes
Zettabytes 10246bytes
Vottabytes 10247bytes
10248bytes
Figure:Growthofdata
Wheredoesthisdatagetgenerated?
There areamultitude ofsources forbig data. AnXLS, aDOC,aPDF, etc.isunstructured data;a video
on YouTube, a chat conversation on Internet Messenger, customer feedback form on an
onlineretailwebsiteisunstructureddata;aCCTVcoverage,weather forecast reportisunstructured data
too.
1. Typical internal data sources: Data present within an organization's firewall. It is as follows:
Datastorage:Filesystems,SQL(RDBMSs-Oracle,MSSQLServer,DB2,MySQL,PostgreSQL, etc.),
NoSQL (MongoDB, Cassandra, etc.), and so on.
MallareddyEngineeringCollegeForWomen(AutonomousInstitution,UGC-Govt.ofIndia) 5
IntroductiontoBigData Analytics Dept.ofCSEAI &ML
2. Both(internal+externaldatasources)
Sensordata:
Car sensors, smart electric meters, office buildings, arr conditioning units, refrigera
tors, and so on.
Machinelogdata: Eventlogs,applicationlogs,Business processlogs,auditlogs,clickstream data, etc.
Socialmedia:
Twitter,blogs,Facebook,Linkedln,YouTube,Instagram,etc.
Businessapps:
ERP, CRM, HR,Google Docs, and so on. Media:Audio,Video,Image,Podcast,etc.
Docs:Commaseparatedvalue(CSV),Word Documents,PD,XLS,PPT,andsoon.
Data storage
Archives
Media
Sensordata
SourcesofBigData
Machinelogdata
Publicweb
Velocity:
Wehavemovedfromthedaysofbatchprocessingtorealtimeprocessing.
Batch---+periodic ---+Nearrealtime---+Realtimeprocessing
Variety:
VarietydealswithaWide rangeofdatatypesand sourcesofdata. Wewillstudythis underthree categories:
Structureddata,semi-structureddataandunstructureddata.
1. Structureddata:Fromtraditionaltransactionprocessingsystemsand RDBMS,etc.
2. Semi-structureddata:ForexampleHyperText MarkupLanguage(HTML),eXensibleMarkup
MallareddyEngineeringCollegeForWomen(AutonomousInstitution,UGC-Govt.ofIndia) 6
IntroductiontoBigData Analytics Dept.ofCSEAI &ML
Language(XML).
3. Unstructured data: For example unstructured text documents, audios, videos, emails, photos, PDFs,
social media, etc.
Evolution of Big Data, Definition of Big Data, Challenges with Big Data, Traditional Business
Intelligence (BI) versus Big Data.
Big data analytics: Classification of Analytics, Importance and challenges facing big data,
TerminologiesUsedinBigData Environments,The Big Data Technology Landscape.
BigDataisbecomingoneofthemosttalkedabouttechnologytrendsnowadays. Therealchallenge
withthebigorganizationistogetmaximumoutofthedataalreadyavailableandpredictwhatkind
ofdatatocollectinthefuture.Howtotaketheexistingdataandmakeitmeaningfulthat itprovides us accurate
insight in the past data is one of the key discussion points in many of the executive meetings in
organizations.
With the explosion of the data the challenge has gone to the next level and now a Big Data is
becoming the reality in manyorganizations.The goalofeveryorganization and expert issame to get
maximum out ofthe data, the route and the starting point are different for each organization
andexpert.Asorganizationsareevaluatingandarchitecting bigdatasolutiontheyarealsolearning the ways
and opportunities which are related to Big Data.
There is not a single solution to big data as well there is not a single vendor which can claim to know
all about Big Data. Big Data is too big a concept and there are many players -different architectures,
different vendors and differenttechnology.
ClassificationsOfAnalytics
WhatisData Analytics?
Inthisnew digitalworld,datais beinggeneratedin anenormousamountwhich opens
newparadigms. Aswehave high computing power as wellas a largeamount of data we can
make use ofthis data to help us make data-drivendecisionmaking. The mainbenefitsof
data-drivendecisions arethat they aremade up by observing past trends which have
resultedin beneficialresults.
MallareddyEngineeringCollegeForWomen(AutonomousInstitution,UGC-Govt.ofIndia) 7
IntroductiontoBigData Analytics Dept.ofCSEAI &ML
businesspredictions.DataAnalyticstechniqueandLifeCycle
UseofDataAnalytics
There are some keydomains and strategic planning techniques inwhich theData Analytics
has played a very important role:
•ImprovedDecision-Making-If wewill have supporting data in favor of a
decision that then wewill be able to implement them with even more success
probability. Forexample, if a certain decision or plan has to lead to
betteroutcomesthenthere willbeno doubt in implementingthemagain.
• BetterCustomer Service - Churn modeling is the best example of this in which
we tryto predict or identify what leads to customer churn and change those
things accordinglyso, that the attrition of thecustomers isas low as
possiblewhich is amost important factorin any organization.
• Efficient Operations-DataAnalyticscan help us understandwhat is the
demandof thesituation and what shouldbe done to get betterresults then we will
be able to streamlineour processes which in turn will leadto efficient operations.
• Effective Marketing -Market segmentation techniques have been
implementedtotargetthisimportantfactoronlyinwhichwearesupposed
tofindthemarketingtechniqueswhichwillhelpusincreaseoursalesand leads to
effective marketing strategies.
TypesofDataAnalytics
Therearefourmajortypesofdataanalytics:
1. Predictive(forecasting)
2. Descriptive(businessintelligenceanddatamining)
3. Prescriptive(optimizationandsimulation)
4. Diagnosticanalytics
MallareddyEngineeringCollegeForWomen(AutonomousInstitution,UGC-Govt.ofIndia) 8
IntroductiontoBigData Analytics Dept.ofCSEAI &ML
DataAnalyticsanditsTypes
PredictiveAnalytics
Predictive analytics turn thedataintovaluable,actionable information.predictive
analyticsusesdatatodeterminetheprobableoutcomeofaneventoralikelihoodofa situation
occurring.Predictiveanalyticsholdsavarietyofstatistical techniquesfrom
modeling,machine learning,data mining,and game theorythat analyze currentand
historicalfactsto makepredictionsaboutafutureevent.Techniquesthatareusedfor
predictive analytics are:
• LinearRegression
• TimeSeriesAnalysisandForecasting
• DataMining
BasicCornerStonesofPredictiveAnalytics
• Predictivemodeling
• DecisionAnalysisandoptimization
• Transactionprofiling
DescriptiveAnalytics
Descriptiveanalytics looks at data and analyzepast event for insight asto how to
approachfutureevents.It looksatpastperformanceandunderstandsthe performancebymining
historical data to understandthe cause of success or failure in thepast. Almost
allmanagementreporting such as sales, marketing,operations, and finance uses this type of
analysis.
Thedescriptive model quantifies relationships indata inawaythatisoften used to classify
customers or prospects intogroups. Unlike a predictive model thatfocuses on
predictingthebehaviorofasinglecustomer,Descriptiveanalyticsidentifiesmany different
relationships between customer and product.
Common examplesofDescriptive analytics are companyreportsthat provide historic
reviews like:
• DataQueries
• Reports
• DescriptiveStatistics
MallareddyEngineeringCollegeForWomen(AutonomousInstitution,UGC-Govt.ofIndia) 9
IntroductiontoBigData Analytics Dept.ofCSEAI &ML
• Datadashboard
PrescriptiveAnalytics
Prescriptive Analytics automatically synthesize bigdata, mathematicalscience,
businessrule, andmachinelearning to make a prediction andthen suggests a decision option
to take advantage of theprediction.
Prescriptive analytics goes beyond predicting future outcomes by also suggesting
actionbenefits fromthepredictionsand showing thedecisionmaker theimplication of each
decision option.Prescriptive Analytics not only anticipates what will happen and when to
happenbut also why it will happen. Further, Prescriptive Analyticscan suggest
decisionoptions on howto take advantageof a future opportunityor mitigate a futurerisk
andillustratethe implicationofeachdecisionoption.
Forexample, Prescriptive Analytics can benefit healthcare strategic planning by using
analytics to leverage operational and usage data combinedwith data of external factors
such as economic data, populationdemography,etc.
DiagnosticAnalytics
Inthisanalysis,wegenerally usehistorical dataoverotherdatatoanswerany
questionorforthesolutionofanyproblem.Wetrytofindanydependencyand pattern inthe
historical data of theparticular problem.
Forexample, companies go for this analysis becauseit gives a great insight into a problem,
and they also keep detailed information about their disposal otherwise data collection may
turn out individual for every problem and it will be very time- consuming. Common
techniques used for Diagnostic Analytics are:
• Datadiscovery
• Datamining
• Correlations
FutureScopeofDataAnalytics
1. Retail: Tostudy sales patterns,consumerbehavior,andinventory
management, data analytics can be applied in theretail sector. Data analytics can be
used byretailers to make data-drivendecisions regarding what products to stock, how to
price them, and how to best organizetheir stores.
2. Healthcare:Dataanalyticscan be usedtoevaluatepatient data, spot trendsinpatient
health,andcreateindividualizedtreatmentregimens. Data analytics can be used by
healthcare companies to enhance patient outcomes and lower healthcare expenditures.
3. Finance: In thefield of finance, data analytics can be used to evaluate
investmentdata, spot trendsin the financialmarkets, andmake wise investment
decisions. Data analytics can be used byfinancial institutions to lowerrisk andboost
theperformanceofinvestment portfolios.
MallareddyEngineeringCollegeForWomen(AutonomousInstitution,UGC-Govt.ofIndia) 10
IntroductiontoBigData Analytics Dept.ofCSEAI &ML
Figure1.1:BigDataSphere
MallareddyEngineeringCollegeForWomen(AutonomousInstitution,UGC-Govt.ofIndia) 11
IntroductiontoBigData Analytics Dept.ofCSEAI &ML
VELOCITY
The data growth and social media explosion have changed how we look at the data. There was a time
when we used to believe thatdataof yesterdayis recent.The matter ofthe fact newspapers
BigDataTechnologies
MallareddyEngineeringCollegeForWomen(AutonomousInstitution,UGC-Govt.ofIndia) 12
IntroductiontoBigData Analytics Dept.ofCSEAI &ML
Figure1.5:BigDataLayout
1. APACHEHADOOP
ApacheHadoopisoneofthemainsupportiveelementsinBigDatatechnologies. Itsimplifiesthe
processing of large amount of structuredor unstructured data in a cheap manner.Hadoop is an
open source project from apache that is continuously improving over the years."Hadoop is
basically a set of software libraries and frameworks to manage and process big amount of data
from a single server to thousands of machines.
It provides an efficient and powerfull erordetection
mechanismbasedonapplicationlayerratherthan relying upon hardware.
In December releases Hadoop 1.0.0, moreinformation and installation guide can be found at
Apache Hadoop Documentation.Hadoop is not a single project but includes a number of other
technologiesin it.
2. MAPREDUCE
MapReduce was introduced by google tocreate large amount of web search indexes.It is
basically aframework towrite applications that processes alargeamount
ofstructuredorunstructureddata over theweb. MapReduce takesthequeryand breaks it
intopartstorun iton multiple nodes. By distributed queryprocessing it makes iteasytomaintain
largeamount ofdatabydividing thedata into several different machines.Hadoop MapReduce is a
software frameworkfor easily writing
applicationstomanagelargeamountofdatasetswithahighlyfaulttolerantmanner.Moretutorials and
getting started guide can be found at Apache Documentation.
3. HDFS(Hadoopdistributedfilesystem)
HDFS is a java based file system that is used to store structured or unstructured data over large
clustersofdistributedservers.ThedatastoredinHDFShasnorestrictionorruletobeapplied,the
datacanbeeither fullyunstructured ofpurelystructured.In HDFStheworktomakedatasenseful is
MallareddyEngineeringCollegeForWomen(AutonomousInstitution,UGC-Govt.ofIndia) 13
IntroductiontoBigData Analytics Dept.ofCSEAI &ML
done bydeveloper's code only. Hadoop distributed file system provides ahighly fault tolerant
atmospherewith a deploymenton lowcosthardwaremachines.HDFSis now a part of Apache
Hadoop project, more information and installation guide can be found at Apache HDFS
documentation.
4. HIVE
HivewasoriginallydevelopedbyFacebook,nowitismadeopensourceforsometime.Hiveworks something
like a bridge in between sql and Hadoop, it is basically used to make Sql queries on Hadoop clusters.
Apache Hive is basically a data warehouse that provides ad-hoc queries, data summarization and
analysis ofhuge data setsstoredinHadoop compatiblefile systems.
Hive provides a SQL like called HiveQL query-based implementation of huge amount of data stored
in Hadoop clusters. In January 2013 apache releases Hive 0.10.0, more information and installation
guide can be found at Apache Hive Documentation.
5. PIG
Pig was introduced by yahoo and lateronit was made fullyopen source.It also provides abridge
toquerydataoverHadoopclustersbutunlikehive,it implements ascript implementationtomake Hadoop
data access able by developers and businesspersons. Apache pig provides a high- level programming
platform for developers to process and analyses Big Data using user defined functions
andprogramming efforts. InJanuary2013 Apache released Pig 0.10.1which isdefined
AdvantagesofBigData:
BigDataisusedformanythingsbutsomeofthemainusagepurposesareasfollows:
• Bag Data enables you to gather information about customers and their
experience,theneventuallyhelps you to alignit properly.
• Helps in maintaining the predictive failures beforehand by analyzing the
problemsandprovides withtheirpotentialsolutions.
• Big Data isalso useful for companies toanticipate customer demand, roll out
new plans, test markets, etc.
• BigDataisveryusefulinassessingpredictivefailuresbyanalyzingvarious
indicators such as unstructured data, error messages, log entries, engine
temperature, etc.
• BigData isalsoveryefficientin maintaining operationalfunctions alongwith
anticipating future demands of thecustomers, current market demands thus
providing proper results.
BenefitsofBigDatainIT Sectors:
• Many old IT companies are fully dependent onbigdata inorder tomodernize their
outdatedmainframesby identifying theroot causesof failures and issues in real-
time and antiquated code bases.Many organizations are replacing their
traditionalsystem with open-sourceplatforms like Hadoop.
• Most big data solutions are based onHadoop,which allows designs to scale up
MallareddyEngineeringCollegeForWomen(AutonomousInstitution,UGC-Govt.ofIndia) 14
IntroductiontoBigData Analytics Dept.ofCSEAI &ML
BenefitsofBigDatainBusiness:
• Dataqualityhas adirectimpactonbusinessprocessefficiency. Inpurchaseto
payprocess,poorqualityvendordatacancausemissingpurchasecontractsor
pricinginformationwhichcanleadtodelaysinprocuringvitalgoods.Many
companiesusebigdatasolutionsoralgorithmstosimplydowhattheyhave
alreadybeendoing,sothatthereisnodatalossmoreoverifwerunthe
algorithmagainstthedataset,theresultmightbethelistofindividualwho
exhibitsattributesoffraudulentbehavior.
• In order to cash process, incomplete or inaccurate credit limits or pricing
information can lead to overall customer service loss orreduce revenue or may
increase service cost, with the help of big data technologies and the ability to
run various algorithms more quickly, the data can be updated at regular intervals
throughout the day.
• Thesystematicanalysisofdataordataprofilingisusedtoassesstheoverall health
ofthedatawhich leadstoproper businessdecisions inaccordance with
thepresentsituation becausesometimes inaccurate dataresultsinincorrect
management, which means business decisions are based on incorrect
information.Forexample,themoreonecanunderstandcustomers'complex
relationships, preferences, and interaction history with the company, the
more relevant and timely business outreach.
MallareddyEngineeringCollegeForWomen(AutonomousInstitution,UGC-Govt.ofIndia) 15
IntroductiontoBigDataAnalytics Dept.ofCSEAI &ML
BenefitsofBigDatainEnterprise:
• Bigdatamightallowacompanytocollecttrillionsorbillionsofreal-timedata points on
its products, resources, or customers- and then repackage the data
instantaneously to optimizethecustomerexperience.
• The speed at which data is updated using big data technologies allows
enterprises to more quicklyand accurately respond tocustomer demands. For
example,MetLifeusedMongoDBtoquicklyconsolidatecustomerinformation over
70 different sources andprovide a single, rapidly-updatedview
• Big data canhelp enterprises to act more nimbly allowing them to adapt to changes
faster than their competitors.
Big data challenges include the storing, analyzing the extremely large and fast- growing
data.
SomeoftheBigDatachallengesare:
1. SharingandAccessingData:
• Perhaps the most frequent challenge in bigdata efforts is the inaccessibility of
data sets from external sources.
• Sharing datacancausesubstantialchallenges.
MallareddyEngineeringCollegeForWomen(AutonomousInstitution,UGC-Govt.ofIndia) 16
IntroductiontoBigDataAnalytics Dept.ofCSEAI &ML
• Faulttolerance:
• Fault tolerance is another technical challenge andfault
tolerance computing is extremely hard, involving intricate algorithms.
• Nowadays someofthenewtechnologies likecloudcomputingand
bigdataalwaysintendedthatwhenever thefailureoccursthe damagedoneshouldbe
withintheacceptable threshold that is the whole task should not begin from thescratch.
• Scalability:
Bigdata projects cangrow and evolve rapidly.The scalability issueof BigData has
MallareddyEngineeringCollegeForWomen(AutonomousInstitution,UGC-Govt.ofIndia) 17
IntroductiontoBigDataAnalytics Dept.ofCSEAI &ML
It also requires dealing with the system failures in an efficient manner. This leads to
a bigquestion again thatwhat kinds of storage devices are to be used.
TRADITIONALVSBIGDATABUSINESSAPPROACH
1. SchemalessandColumnorientedDatabases(NoSql)
Weareusingtableandrowbasedrelationaldatabasesovertheyears,thesedatabasesarejustfinewithonline
transactionsandquickupdates.Whenunstructuredandlargeamountofdatacomes
intothepictureweneeds some databaseswithouthaving ahard code schema attachment.There
areanumber ofdatabases to fit into this category, these databases can store unstructured, semi
structured or even fullystructured data.
Apart fromother benefits the finest thing with schema less databases isthat it makes data migrationvery
easy. MongoDBis a very popularand widely usedNoSQLdatabasethesedays.NoSQLand schemaless
databases are used when the primary concern is to store a huge amount ofdata and notto maintain
relationshipbetween elements."NoSQL(not only Sql)is atype of databasesthatdoesnotprimarilyrely
uponschemabased structureand doesnotuse Sql for data processing."
Figurle.:BigData
Thetraditionalapproachworkonthestructureddatathathasabasic layoutand thestructureprovided.
MallareddyEngineeringCollegeForWomen(AutonomousInstitution,UGC-Govt.ofIndia) 18
IntroductiontoBigDataAnalytics Dept.ofCSEAI &ML
Figure1.7:StaticData
Thestructuredapproachdesignsthedatabaseaspertherequirementsintuplesandcolumns.
Workingonthelive comingdata,whichcanbeaninput fromtheeverchanging scenariocannot bedealt
Figure1.8:StreamingData
The Big data analytics workonthe unstructured data, where no specific pattern ofthe data is
defined. The data is not organized in rows and columns. The live flow ofdata iscaptured and
theanalysis isdone on it. Efficiencyincreaseswhenthedatatobeanalyzedislarge.
1.9BigDataArchitecture
MallareddyEngineeringCollegeForWomen(AutonomousInstitution,UGC-Govt.ofIndia) 19
IntroductiontoBigDataAnalytics Dept.ofCSEAI &ML
UNIT-2:Hadoop Technology
Topics:
IntroducingHadoop History of Hadoop
Convolution approachversusHadoop Introductionto Hadoop Eco System Processing
Data with Hadoop Hadoop Distributors
UseCases
ChallengesinHadoop.
InteractingwithHadoop
MallareddyEngineeringCollegeForWomen(AutonomousInstitution,UGC-Govt.ofIndia) 20
IntroductiontoBigDataAnalytics Dept.ofCSEAI &ML
HADOOP
1. Hadoopisanopen-sourceframeworkthatsupportstheprocessingoflargedatasetsin
adistributedcomputingenvironment.
2. Hadoop consistsofMapReduce,theHadoop distributed filesystem(HDFS) andanumber of
related projects such as Apache Hive, HBase and Zookeeper. MapReduce and Hadoop
distributed file system (HDFS) are the main componentofHadoop.
3. ApacheHadoopisanopen-source, freeandJava based software frameworkoffersapowerful
distributed platform to store and manage BigData.
4. ItislicensedunderanApache V2license.
5. It runs applications on large clusters ofcommodity hardware and it processes thousands of
terabytes ofdataonthousands ofthenodes. Hadoop isinspired fromGoogle's MapReduce and
Google File System (GFS)papers.
6. ThemajoradvantageofHadoopframeworkisthatit provides reliabilityandhighavailability.
USEOFHADOOP
There aremanyadvantagesofusing Hadoop:
1. RobustandScalable-Wecanaddnewnodesasneededaswell modifythem.
2. AffordableandCostEffective-Wedonotneedanyspecialhardwareforrunning Hadoop.Wecanjust
use commodity server.
3. AdaptiveandFlexible-Hadoopisbuiltkeepinginmindthatitwillhandlestructuredand unstructured
data.
4. Highly Available and FaultTolerant -When a node fails, theHadoop framework automatically
fails over to another node.
CONVOLUTIONAPPROACHVERSUSHADOOP
There are two major component so the Hadoopframeworkandbothofthemdoes two of the important
taskforit.
1. Hadoop MapReduce isthemethod tosplitalargerdataproblemintosmaller chunkanddistribute itto
manydifferent commodity servers. Eachserver have their ownsetofresources
andtheyhaveprocessed them locally. Once the commodity server has processed the data theysend
it back collectively to main server. This iseffectively aprocess where we process large data
effectively and efficiently
2. HadoopDistributed FileSystem(HDFS) isavirtualfilesystem.Thereisabigdifferencebetweenany
other file system and Hadoop. When we move a fileon HDFS, it isautomatically split into many
small pieces. These small chunks of the file arereplicated and store do not he r servers (usually3)
forthefault tolerance or high availability.
MallareddyEngineeringCollegeForWomen(AutonomousInstitution,UGC-Govt.ofIndia) 21
IntroductiontoBigDataAnalytics Dept.ofCSEAI &ML
3. Namenode: Namenode is the heart of the Hadoop system. The Name Node manages the file
system namespace.Itstoresthemetadatainformation ofthedatablocks.This metadata isstored
permanentlyon to localdisk inthe formofname space image and edit log file. TheName Node also
knows the location of the data blocks on the data node. However,the Name
Nodedoesnotstorethisinformationpersistently. TheName Nodecreatestheblock toData
Nodemapping when itisrestarted.IftheNameNodecrashes, thentheentire Hadoop systemgoes
down. Read more aboutName node
4. Secondary Namenode: The responsibilityofsecondary name node is to periodically copy and
merge the namespace image and editlog. Incase ifthe name nodecrashes, thenthenamespace
image stored in secondary NameNode can be used to restart the NameNode.
5. DataNode: It stores the blocks of data and retrieves them. The DataNodes also reports the
blocks information to the NameNode periodically.
6. Job Tracker: Job Tracker responsibility is to schedule the client's jobs. Job tracker creates map
and reducetasksandschedulesthemtorunontheDataNodes (tasktrackers).JobTracker
alsochecksforany failedtasksandreschedulesthefailedtasksonanother
DataNode.JobtrackercanberunontheNameNode or a separate node.
7. Task Tracker: Task tracker runs on the DataNodes. Task trackers responsibility
1s to runthemaporreducetasksassignedbytheNameNodeandtoreportthestatusof the tasks to the
NameNode. Besides above two corecomponents Hadoop project also contains following
modules aswell.
1. HadoopCommon:CommonutilitiesfortheotherHadoopmodules
2. HadoopYarn:Aframeworkforjobschedulingandclusterresourcemanagement
RDBMS
Why can't we used at a bases with lots of disk storage- scale batch analysis? Why is MapReduce
needed?
Theanswertothesequestionscomesfrom anothertrendindiskdrives:seektimeisimproving more
slowlythantransferrate.Seekingistheprocessofmovingthedisk'sheadtoaparticularplace
onthedisktoreadorwritedata.It characterizes thelatency ofadiskoperation, whereas the transfer rate
corresponds to a disk's bandwidth.
Ifthedata access pattern isdominated by seeks, it will take longer toread orwrite large portions of
the dataset than streaming through it, which operates at the transfer rate. On the other hand, for
updating asmall proportion of records ina database, atraditional B-Tree (the data structure used
in relational databases, which is limited by therate itcan perform seeks) works well. For updating
MallareddyEngineeringCollegeForWomen(AutonomousInstitution,UGC-Govt.ofIndia) 22
IntroductiontoBigDataAnalytics Dept.ofCSEAI &ML
MallareddyEngineeringCollegeForWomen(AutonomousInstitution,UGC-Govt.ofIndia) 23
IntroductiontoBigDataAnalytics Dept.ofCSEAI &ML
keys and values forMapReduce are not an intrinsic property of thedata, but they arechosen by the
person analyzing the data.
Relational data is often normalized to retain its integrity and remove redundancy. Normalization
posesproblemsforMapReduce,sinceitmakesreadingarecord anon-localoperation,andoneofthe
central assumptions that MapReduce makes isthat itispossible toperform (high-speed) streamingreads
and writes.
Awebserverlogisagoodexampleofasetofrecordsthatisnotnormalized(forex-ample,theclient
hostnames arespecified infulleachtime,even though thesameclientmayappearmany times), and
this isonereason that log files of all kindsareparticularlywell-suited to analysis with MapReduce.
MapReduce isalinearlyscalable programmingmodel.Theprogrammer writestwofunctions-a map
function and areduce function-each of which defines a mapping from one set of key-value pairs
to another. These functions areoblivious tothesizeof thedataorthecluster that theyare operating
on, so they canbe usedunchangedfora smalldatasetand for a massiveone.Moreimportant, if you
double the size of the input data, a jobwill run twice as slow. But if you alsodoublethesizeofthe
cluster, a job will run as fast as the original one. This isnot generally true of SQL queries.
Overtime,however,thedifferencesbetweenrelationaldatabasesandMapReduce systems are likely
to blur-both as relational databases start incorporating some of the ideas from MapReduce (such
as Aster Data's and Greenplum's databases) and, from the other direction, as higher-level query
languagesbuiltonMapReduce (suchasPigandHive)makeMapReducesystemsmoreapproachable to
traditional database programmers.
ABRIEFHISTORYOFHADOOP
Hadoopwas created by Doug Cutting, the creator of Apache Lucene, the widely used text
search library. Hadoop has its origins inApache Nutch, an open source web search engine, itself
a part of the Lucene project.
Building a web search engine fromscratchwas an ambitious goal, for not only is the software
required to crawl and index websites complex to write, butit is also a challenge to run without a
dedicated operations team,sincetherearesomanymovingparts.It'sexpensive, too:MikeCafarella
andDoug Cutting estimated a system supporting a I-billion-page index would cost around half a
milliondollarsin hardware,witha monthlyrunningcostof $30,000. Nevertheless,
they believed it was a worthy goal, asit would open up and ultimately democratize search engine
algorithms.
Nutchwasstartedin2002,andaworkingcrawlerandsearchsystem quicklyemerged.However,
MallareddyEngineeringCollegeForWomen(AutonomousInstitution,UGC-Govt.ofIndia) 24
IntroductiontoBigDataAnalytics Dept.ofCSEAI &ML
Google reported that its MapReduce implementationsorted one terabyte in68seconds. As the first
edition of this book was going to press (May2009), it was announced that a team at Yahoo! Used
Hadoop to sort one tera byte in 62 seconds.
MallareddyEngineeringCollegeForWomen(AutonomousInstitution,UGC-Govt.ofIndia) 25
IntroductiontoBigDataAnalytics Dept.ofCSEAI &ML
PROCESSINGTHEDATAWITHHADOOP
Totakeadvantage oftheparallel processing that Hadoop provides, weneed toexpress ourquery as a
MapReduce job. After some local, small-scale testing, we will be able to run it on a cluster of
machines.
MAPANDREDUCE
MapReduceworksbybreakingtheprocessingintotwophases: themapphaseandthereduce phase.
Each phase has key-value pairs as input and output, the types of which may be chosen by the
programmer.Theprogrammeralsospecifiestwofunctions:themapfunctionandthereducefunction.
TheinputtoourmapphaseistherawNCDCdata.Wechooseatextinputformatthat gives useach linein
the dataset as a text value. The key is the offset of the beginning of the line from the beginning
of the file, but aswe have no need forthis, we ignore it.
Our map function is simple. We pull out the year and the air temperature, since these are the
only fields we are interested in. In this case, the map function is just a data preparation
phase, setting up the data in such a way that the reducer function can do its work on it:
finding the maximum temperature foreachyear.Themapfunctionisalsoagood
placetodropbadrecords: here we filter out temperatures that are missing, suspect, or
erroneous.
Tovisualizethewaythemapworks, considerthefollowingsamplelinesofinput data(some unused
columns have been dropped to fit thepage, indicated by ellipses):
00670119909999919500515070049999999N9+00001+99999999999_
00430119909999919500515120049999999N9+00221+99999999999_
00430119909999919500515180049999999N9-00111+99999999999_
00430126509999919490324120040500001N9+01111+99999999999_
00430126509999919490324180040500001N9+00781+99999999999_
Theselinesarepresentedtothemapfunctionasthekey-valuepairs:
(0,0067011990999991950051507004...9999999N9+00001+99999999999 ...)
(106,0043011990999991950051512004...9999999N9+00221+99999999999...)
(212,0043011990999991950051518004...9999999N9-00111+99999999999...)
(318,0043012650999991949032412004...0500001N9+01111+99999999999...)
(424,0043012650999991949032418004...0500001N9+00781+99999999999...)
Thekeysarethelineoffsetswithinthefile,whichweignore inourmapfunction. Themap function
merelyextractstheyearandtheairtemperature(indicatedinbold text), andemitsthemasitsoutput
MallareddyEngineeringCollegeForWomen(AutonomousInstitution,UGC-Govt.ofIndia) 26
IntroductiontoBigDataAnalytics Dept.ofCSEAI &ML
(thetemperaturevalueshave beeninterpretedasintegers):
(1950,0)
(1950,22)
(1950,-11)
(1949,111)
(1949, 78)
The output from the map function isprocessed by the MapReduce framework before being sent
to thereduce function. Thisprocessing sortsandgroups thekey-value pairs bykey. So, continuing
the example, our reduce function sees the following input:
(1949,[111,78])
(1950,[0,22,-11])
Eachyearappearswithalistofallitsairtemperaturereadings.Allthereducefunction has to do no w
is iterate through the list and pick up the maximum reading:
(1949,111)
(1950,22)
This isthe final output: the maximum global temperature recorded in each year.
Thewholedataflowisillustratedin2.2.AtthebottomofthediagramisaUnix pipeline, which mimics
the whole MapReduce flow, and which we will see againlater inthechapterwhen welook
atHadoop Streaming.
MallareddyEngineeringCollegeForWomen(AutonomousInstitution,UGC-Govt.ofIndia) 27
IntroductiontoBigDataAnalytics Dept.ofCSEAI &ML
Ill.
Themap() method alsoprovides aninstance ofContext to write theoutput to.Inthiscase, we write
the year as a Text object (since we are just using it asa key), and the temperature is wrapped in
an IntWritable. We write an output record only if the tern- perature is present and the quality
code indicates the temperaturereading isOK.
Again,fourformal typeparameters areused tospecify theinputandoutputtypes, thistimefor the
reduce function. The input types of the reduce function must match the output types of the
mapfunction:TextandIntWritable.Andinthiscase,theoutputtypesofthereducefunction are Text
and IntWritable, for a year and its maximum temperature, which we find by iterating
through the temperatures and comparing each with arecord of the highest found so far.
Havingconstructed aJobobject,wespecifytheinput andoutput paths.Aninput pathisspecified by
calling the static addlnputPath() method on FilelnputFormat, and it can
beasinglefile,adirectory(inwhichcase,theinputformsallthefilesinthatdirectory), or a file pattern.
As the name suggests, addlnputPath()can be called more than onceto use input from
multiplepaths.
Theoutput path(of which thereisonlyone) isspecified bythestaticsetOutputPath()methodon
FileOutputFormat.It specifies a directory where the output files from the reducer functions
are written. The directory shouldn't exist before running the job, as Hadoop will complain
and
system: five map inputs produced five map outputs, then five reduce inputs in two groups
produced two reduce outputs.
Theoutput waswritten totheoutput directory, which contains one output fileperreducer. Thejob
had a single reducer, so we find a single file, named part-r-00000:
%catoutput/part-r-00000 1949 111
1950 22
Thisresult isthesame aswhen wewent throughit byhand earlier. Weinterpretthisassaying that
themaximum temperature recorded in 1949 was 11.1°C, and in1950 it was 2.2°C.
THEOLDANDTHENEWJAVAMAPREDUCEAPIS
TheJavaMapReduce APIusedintheprevioussectionwasfirst released inHadoop
0.20.0.ThisnewAPI,sometimesreferredtoas "ContextObjects,"was designedtomaketheAPI easier
MallareddyEngineeringCollegeForWomen(AutonomousInstitution,UGC-Govt.ofIndia) 29
IntroductiontoBigDataAnalytics Dept.ofCSEAI &ML
theMapReduce system. The new Context, for example, essentially unifies the role of the
JobConf, theOutput Collector, and the Reporter from the old APL
In both APis, key-value record pairs are pushed to the mapper and reducer, butin addition, the
new APIallowsbothmappersandreducers tocontrol theexecutionflowbyoverriding
therun()method. For example, records can beprocessed inbatches, orthe execution can
beterminated before allthe recordshavebeenprocessed. IntheoldAPIthisispossibleform appears
bywritingaMapRunnable, but noequivalentexists for reducers.
Configuration has been unified. The old API has a special JobConf object for job configuration,
which is an extension of Hadoop's vanilla Configuration object (used for configuring daemons.
In thenew API, this distinction isdropped, sojob configuration isdone through aConfiguration.
Jobcontrol isperformed through theJobclassinthenew API,ratherthantheold JobClient, whichno
longer exists in the new API.
Output files are named slightly differently: in the old API both map and reduce outputs are
named part-nnnnn,whileinthenewAPImapoutputsarenamedpart-m-nnnnn,andreduceoutputsare
MallareddyEngineeringCollegeForWomen(AutonomousInstitution,UGC-Govt.ofIndia) 30
IntroductiontoBigDataAnalytics Dept.ofCSEAI &ML
named part-r-nnnnn (where nnnnn isaninteger designating the part number,startingfrom zero).
User-overridablemethodsinthenewAPIaredeclaredtothrowjava.lang.InterruptedException. What
this means is that you can write your code to be reponsive to interupts so that the framework can
gracefully cancel long-running operations if it needsto.
In the newAPlthereduce()method passes values as a java.lang.Iterable, rather than a
java.lang.lterator (as the oldAPidoes). This change make site as over the values
using Java's for-each loop construct:for (VALUEIN value :values) { ...}
HADOOPECOSYSTEM
AlthoughHadoopisbestknownforMapReduce anditsdistributedfilesystem(HDFS,renamedfrom
NDFS), the term is also used for a family of related projects that fall under the umbrella of
infrastructure for distributed computing and large-scale data processing.
All ofthe core projects covered in this book are hosted by the Apache Software Foundation,
which provides support for a community of open source software projects, including the original
HTTP Server from which it getsitsname. AstheHadoop eco-system grows, more projects
areappearing, notnecessarilyhosted atApache,whichprovide complementary
servicestoHadoop,orbuildonthe core to add higher-level abstractions.
TheHadoopprojectsthatarecoveredinthisbookaredescribedbrieflyhere:
Common
A set of componentsand interfaces fordistributed filesystems and general 1/0(serialization,Java
RPC, persistentdata structures).
Avro
Aserializationsystemforefficient,cross-languageRPC,andpersistentdatastorage.
MapReduce
Adistributeddataprocessingmodelandexecutionenvironmentthatrunsonlargeclustersof commodity
machines.
HDFS
Adistributedfilesystemthatrunsonlargeclustersofcommoditymachines.
Pig
Adataflowlanguageandexecutionenvironmentforexploringverylargedatasets.PigrunsonHDFS and
MapReduceclusters.
Hive
Adistributeddatawarehouse.HivemanagesdatastoredinHDFSandprovidesaquerylanguage
MallareddyEngineeringCollegeForWomen(AutonomousInstitution,UGC-Govt.ofIndia) 31
IntroductiontoBigDataAnalytics Dept.ofCSEAI &ML
based on SQL (and which is translated by the runtime engine toMapReduce jobs) forquerying
the data.
HBase
Adistributed, column-orienteddatabase. HBase usesHDFSforitsunderlying storage,andsupports
both batch-style computations using MapReduce and point queries (random reads).
ZooKeeper
A distributed, highly available coordination service. ZooKeeper provides primitives such as
distributed locks that can be used for building distributed applications.
Sqoop
AtoolforefficientlymovingdatabetweenrelationaldatabasesandHDFS.
PHYSICALARCHITECTURE
Figure2.2:PhysicalArchitecture
Hadoop Cluster-Architecture,CoreComponentsandWork-flow
1. ThearchitectureofHadoopCluster
2. CoreComponentsofHadoopCluster
3. Work-flowofHowFileisStoredinHadoop
HADOOPCLUSTER
i.Hadoopclusterisaspecialtypeofcomputationalclusterdesignedforstoringandanalyzing vastamount of
unstructureddata in adistributedcomputing environment
MallareddyEngineeringCollegeForWomen(AutonomousInstitution,UGC-Govt.ofIndia) 32
IntroductiontoBigDataAnalytics Dept.ofCSEAI &ML
Figure2.3:HadoopCluster
1. Theseclustersrunonlowcostcommodity computers.
11 Hadoopclustersareoftenreferredtoas"sharednothing"systemsbecausetheonlythingthatis shared
between nodes is the network that connects them.
Figure2.4:SharedNothing
LargeHadoopClustersarearrangedinseveralracks.Networktrafficbetweendifferentnodesin the same
rack is much more desirable than network traffic across the racks.
ARealTimeExample: Yahoo's Hadoop cluster. They have morethan10,000 machines runningHadoop
and nearly 1 petabyte of user data.
Figure2.4:YahooHadoopCluster
• AsmallHadoopclusterincludesasinglemastemodeandmultipleworkerorslave node. As
MallareddyEngineeringCollegeForWomen(AutonomousInstitution,UGC-Govt.ofIndia) 33
IntroductiontoBigDataAnalytics Dept.ofCSEAI &ML
discussed earlier,theentireclustercontainstwolayers.
• OneofthelayerofMapReduceLayerandanotherisofHDFSLayer.
• Eachoftheselayerhaveitsownrelevantcomponent.
• ThemasternodeconsistsofaJobTracker,TaskTracker,NameNodeandDataNode.
• Aslaveor workernodeconsistsofaDataNodeand TaskTracker.
It isalso possible that slave nodeorworker nodeisonlydataorcompute node.The matterofthefact
that is the key feature of the Hadoop
Figure2.4:NameNodeCluster
HADOOPCLUSTERARCHITECTURE:
Figure2.5:HadoopClusterArchitecture HadoopClusterwouldconsistsof
► 110differentracks
► Eachrackwouldhavearound40slavemachine
► Atthetopofeachrackthereisarackswitch
► Eachslavemachine(rackserverinarack)hascablescoming outitfromboththeends
► Cablesareconnectedtorackswitchatthetopwhichmeansthattoprackswitchwillhavearound
MallareddyEngineeringCollegeForWomen(AutonomousInstitution,UGC-Govt.ofIndia) 34
IntroductiontoBigDataAnalytics Dept.ofCSEAI &ML
80ports
► Thereareglobal8coreswitches
► The rack switch has uplinks connected to core switches and hence connecting all other racks with
uniformbandwidth, formingtheClusterInthecluster, youhavefewmachinestoactasNamenode and
asJobTracker.TheyarereferredasMasters.Thesemastershavedifferentconfigurationfavoringmore
DRAM and CPU and lesslocalstorage.
Hadoopclusterhas3components:
1. Client
2. Master
3. Slave
Theroleofeachcomponentsareshowninthebelowimage.
Figure2.6:HadoopCoreComponent
Client:
It is neither master nor slave, rather play a role ofloading the data into cluster, submit MapReduce
jobs describing how the data should be processed and then retrieve the data to see the response after
job completion.
MallareddyEngineeringCollegeForWomen(AutonomousInstitution,UGC-Govt.ofIndia) 35
IntroductiontoBigDataAnalytics Dept.ofCSEAI &ML
Figure2.6:HadoopClient
Masters:
TheMastersconsistsof3componentsNameNode,SecondaryNodenameandJobTracker.
Figure2.7:MapReduce-HDFS
NameNode:
NameNodedoesNOTstorethefilesbutonlythe file's metadata.Inlatersectionwewillsee it isactually
the Data Node which stores the files.
Figure2.8:NameNode
• NameNodeoverseesthehealthofDataNodeandcoordinatesaccesstothedatastoredinDataNode.
• Namenodekeepstrackofallthefilesystemrelatedinformationsuchasto
• Whichsectionoffileissavedinwhichpartofthecluster
• Lastaccesstimeforthefiles
• Userpermissionslikewhichuserhaveaccesstothefile
MallareddyEngineeringCollegeForWomen(AutonomousInstitution,UGC-Govt.ofIndia) 36
IntroductiontoBigDataAnalytics Dept.ofCSEAI &ML
JobTracker:
JobTrackercoordinatestheparallelprocessing ofdatausingMapReduce.
ToknowmoreaboutJobTracker, pleasereadthearticleAllYou Want toKnow about MapReduce
(The Heart ofHadoop)
SecondaryNameNode
Figure2.9:SecondaryNameNode
• The jobofSecondaryNode isto contact NameNode inaperiodic manner aftercertain time
interval (by defaultlhour).
• NameNode which keeps all filesystem metadata inRAM has no capability to process that
metadata on to disk.
• IfNameNode crashes,you loseeverythinginRAMitselfand you don't haveanybackup of
filesystem.
• Whatsecondary node does isit contacts NameNodeinan hour and pulls copyofmetadata
information out ofNameNode.
• Itshuffleandmerge thisinformation intocleanfilefolderandsenttobackagaintoNameNode, while
keeping a copy foritself.
• HenceSecondaryNodeisnotthebackupratheritdoesjobofhousekeeping.
• IncaseofNameNodefailure,savedmetadatacanrebuilditeasily.
MallareddyEngineeringCollegeForWomen(AutonomousInstitution,UGC-Govt.ofIndia) 37
IntroductiontoBigDataAnalytics Dept.ofCSEAI &ML
Slaves:
SlavenodesarethemajorityofmachinesinHadoopClusterandareresponsibleto
• Storethedata
• Processthecomputation
• EachslaverunsbothaDataNodeandTaskTrackerdaemonwhichcommunicatestotheirmasters.
• TheTaskTrackerdaemonis aslavetotheJob TrackerandtheDataNodedaemonaslavetothe
NameNode
Figure2.10:Slaves
Hadoop-TypicalWorkflowinHDFS:TaketheexampleofinputfileasSample.txt.
Figure2.11:HDFSWorkflow
1. owTestingHadoop.txtgetsloadedintotheHadoopCluster?
Figure2.12:LoadingfileinHadoopCluster
MallareddyEngineeringCollegeForWomen(AutonomousInstitution,UGC-Govt.ofIndia) 38
IntroductiontoBigDataAnalytics Dept.ofCSEAI &ML
• ClientmachinedoesthisstepandloadstheSample.txtintocluster.
• Itbreaksthesample.txtintosmallerchunkswhichareknownas"Blocks"inHadoopcontext.
• Clientputtheseblocksondifferentmachines(datanodes)throughoutthecluster.
2. Next,howdoestheClientknowsthattowhichdatanodesloadtheblocks?
• NowNameNodecomesintopicture.
• TheNameNodeuseditsRackAwarenessintelligencetodecideonwhichDataNodetoprovide.
• Foreachofthedatablock (inthiscaseBlock-A, Block-Band Block-C), Client contactsNameNode
and in response NameNode sends anordered list of3 DataNodes.
3. HowdoestheClientknowsthattowhichdatanodesloadtheblocks?
ForexampleinresponsetoBlock-Arequest,NodeNamemaysendDataNode-2,DataNode-3
andDataNode-4.
Block-BDataNodeslistDataNode-1,DataNode-3,DataNode-4andforBlockCdatanodelist
DataNode-1, DataNode-2, DataNode-3.Hence
• BlockAgetsstoredinDataNode-2,DataNode-3,DataNode-4
• BlockBgetsstoredinDataNode-1,DataNode-3,DataNode-4
• BlockCgetsstoredinDataNode-1,DataNode-2,DataNode-3
• Everyblockisreplicatedtomorethan1datanodestoensurethedatarecoveryonthetimeof machine
failures. That's whyNameNode send 3DataNodes list for each individual block
4. Whodoestheblockreplication?
• ClientwritethedatablockdirectlytooneDataNode.
• DataNodesthenreplicatetheblocktootherDatanodes.
• Whenoneblock getswritteninall3DataNodethenonlycyclerepeatsfornextblock.
5. Whodoestheblockreplication?
• In HadoopGenlthereisonly oneNameNodewherein Gen2thereis activepassivemodelinName Node
whereonemorenode"PassiveNode" comesin picture.
• The defaultsettingfor Hadoop is to have3 copies ofeach blockin the cluster. This settingcan
be configuredwith"dfs.replication" parameter of hdfs-site.xmlfile.
• Keep notethat Client directlywrites the block totheDataNode withoutanyintervention of
NameNodein this process.
MallareddyEngineeringCollegeForWomen(AutonomousInstitution,UGC-Govt.ofIndia) 39
IntroductiontoBigDataAnalytics Dept.ofCSEAI &ML
Challenges in Hadoop
• Network File system isthe oldest and the most commonly used distributed file system and was
designedforthe generalclassofapplications,Hadooponlyspecific kind ofapplicationscan make use
of it.
• It is known that Hadoop hasbeen created to address the limitations ofthedistributed filesystem,
whereit canstorethelargeamount ofdata,offersfailureprotection and provides fastaccess, but it
should be known that the benefits that come with Hadoop come at some cost.
• Hadoop isdesigned for applications that require random reads; So ifa file has four partsthe file
would liketoreadallthepartsone-by-one going from1to4tilltheend. Random seek iswhere you want
togotoaspecific location inthefile;this issomething that isn't possible withHadoop.Hence, Hadoop
is designedfor non- real-time batch processing of data.
• Hadoop isdesigned forstreaming reads caching ofdata isn't provided. Caching ofdata isprovided
which means that when you want toread data another time, itcanbe read veryfast fromthecache.
Thiscaching isn't possible because you get fasteraccess tothedatadirectly bydoingthesequential
read; hence caching isn't available through Hadoop.
• It willwrite thedata andthen it willread the data several times.It willnotbe updating the data that it
has written; hence updating data written to closed files is not available.However,
youhavetoknowthatinupdate0.19appendingwillbe supported forthosefilesthat aren't closed. Butfor
those files that have been closed, updating is n'tpossible.
• In case of Hadoop we aren't talking about one computer; in this scenario we usually have a large
number of computers and hardware failures are unavoidable; sometime one computer will fail and
sometimes the entire rack can fail too. Hadoop gives excellent protection against hardware
failure; however the performancewill go down proportionateto the number of computers that are
down. In the big picture, it doesn't really matter and it is not generally noticeable since if you have
100 computersand inthemif3failthen97arestillworking. Sotheproportionate lossofperformanceisn't
that noticeable.However, thewayHadoop worksthereis thelossinperformance.Nowthislossof
performance through hardware failures issomething that is managed through replication strategy.
MallareddyEngineeringCollegeForWomen(AutonomousInstitution,UGC-Govt.ofIndia) 40
IntroductiontoBigDataAnalytics Dept.ofCSEAI&ML
UNIT- III
HADOOP FILE SYSTEM
► IntroductiontoHadoopDistributedFileSystem(HDFS)54
► Overview
► TheDesign ofHDFS
► HDFSConcepts
► Basic Filesystem
► Hadoop Filesystems
► BasicFilesystemVsHadoopFilesystems
► LocalFile System
► FileBasedDataStructures
► SequentialFile
► MapFile
► TheJavaInterface
► LibraryClasses
► Inserting/ReadingData fromaHadoopURL
► Inserting/ReadingDataUsingtheFilesystemAPI
► Writing/StoringData
► DataFlow-AnatomyofaFileRead
► Anatomy ofaFileWrite
► Limitations.
MallareddyEngineeringCollegeForWomen(AutonomousInstitution,UGC-Govt.ofIndia) 41
IntroductiontoBigDataAnalytics Dept.ofCSEAI&ML
With growing data velocity the data size easily outgrows the storage limit of a machine. A solution
would be to store the data across a network of machines. Such filesystems are called distributed
filesystems. Since data is stored across a network all the complications of a network come in.
This is where Hadoop comes in. It provides one of the most reliable filesystems. HDFS (Hadoop
Distributed File System) is a unique design that provides storage for extremely large files with
streaming data access pattern and it runs on commodity hardware. Let’s elaborate the terms:
Extremely large files: Here we are talking about the data in range of petabytes(1000 TB).
Streaming Data Access Pattern: HDFS is designed on principle of write-once and read-many-times.
Once data is written large portions of dataset can be processed any number times.
Commodity hardware: Hardware that is inexpensive and easily available in the market. This is one
of feature which specially distinguishes HDFS from other file system.
Nodes: Master-slave nodes typically forms the HDFS cluster.
1. NameNode(MasterNode):
Manages all the slave nodes and assign work to them.
It executes filesystem namespace operations like opening, closing, renaming files and
directories.
It should be deployed on reliable hardware which has the high config. not on commodity
hardware.
2. DataNode(SlaveNode):
Actual worker nodes, who do the actual work like reading, writing, processing etc.
They also perform creation, deletion, and replication upon instruction from the master.
They can be deployed on commodity hardware.
HDFS daemons: Daemons are the processes running in background.
Namenodes:
MallareddyEngineeringCollegeForWomen(AutonomousInstitution,UGC-Govt.ofIndia) 42
IntroductiontoBigDataAnalytics Dept.ofCSEAI&ML
Lets assume that 100TB file is inserted, then masternode(namenode) will first divide the file into
blocks of 10TB (default size is 128 MB in Hadoop 2.x and above).
Then these blocks are stored across different datanodes(slavenode).
Datanodes(slavenode)replicate the blocks among themselves and the information of what blocks they
contain is sent to the master.
Default replication factor is 3 means for each block 3 replicas are created (including itself). In
hdfs.site.xml we can increase or decrease the replication factor i.e we can edit its configuration here.
Note: MasterNode has the record of everything, it knows the location and info of each and every
single data nodes and the blocks they contain, i.e. nothing is done without the permission of
masternode.
Even if multiple datanodes are down we can still do our work, thus making it highly reliable.
High fault tolerance.
Limitations: Though HDFS provide many features there are some areas where it doesn’t work well.
Low latency data access: Applications that require low-latency access to data i.e in the range of
milliseconds will not work well with HDFS, because HDFS is designed keeping in mind that we need
high-throughput of data even at the cost of latency.
Small file problem: Having lots of small files will result in lots of seeks and lots of movement from
one datanode to another datanode to retrieve each small file, this whole process is a very inefficient
data access pattern.
Hadoop - Introduction
Hadoop is an Apache open source framework written in java that allows distributed processing of
large datasets across clusters of computers using simple programming models. The Hadoop
framework application works in an environment that provides
distributed storage and computation across clusters of computers. Hadoop is designed to scale up
from single server to thousands of machines, each offering local computation and storage.
Hadoop Architecture
MapReduce
MapReduce is a parallel programming model for writing distributed applications devised at Google
for efficient processing of large amounts of data (multi-terabyte data-sets), on large clusters
(thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner. The MapReduce
program runs on Hadoop which is an Apache open-source framework.
The Hadoop Distributed File System (HDFS) is based on the Google File System (GFS) and provides
a distributed file system that is designed to run on commodity hardware. It has many similarities with
existing distributed file systems. However, the differences from other distributed file systems are
significant. It is highly fault-tolerant and is designed to be deployed on low-cost hardware. It
provides high throughput access to application data and is suitable for applications having large
datasets.
Apart from the above-mentioned two core components, Hadoop framework also includes the
following two modules −
Hadoop Common − These are Java libraries and utilities required by other Hadoop modules.
Hadoop YARN − This is a framework for job scheduling and cluster resource management.
It is quite expensive to build bigger servers with heavy configurations that handle large scale
MallareddyEngineeringCollegeForWomen(AutonomousInstitution,UGC-Govt.ofIndia) 44
IntroductiontoBigDataAnalytics Dept.ofCSEAI&ML
processing, but as an alternative, you can tie together many commodity computers with single-CPU,
as a single functional distributed system and practically, the clustered machines can read the dataset
in parallel and provide a much higher throughput. Moreover, it is cheaper than one high-end server.
So this is the first motivational factor behind using Hadoop that it runs across clustered and low-cost
machines.
Hadoop runs code across a cluster of computers. This process includes the following core tasks that
Hadoop performs −
Data is initially divided into directories and files. Files are divided into uniform sized blocks of 128M
and 64M (preferably 128M).
These files are then distributed across various cluster nodes for further processing.
HDFS, being on top of the local file system, supervises the processing.
Blocks are replicated for handling hardware failure.
Checking that the code was executed successfully.
Performing the sort that takes place between the map and reduce stages.
Sending the sorted data to a certain computer.
Writing the debugging logs for each job.
Advantages of Hadoop
Hadoop framework allows the user to quickly write and test distributed systems. It is efficient, and it
automatic distributes the data and work across the machines and in turn, utilizes the underlying
parallelism of the CPU cores.
Hadoop does not rely on hardware to provide fault-tolerance and high availability (FTHA), rather
Hadoop library itself has been designed to detect and handle failures at the application layer.
Servers can be added or removed from the cluster dynamically and Hadoop continues to operate
without interruption.
Another big advantage of Hadoop is that apart from being open source, it is compatible on all the
platforms since it is Java based
Hadoop File System was developed using distributed file system design. It is run on commodity
hardware. Unlike other distributed systems, HDFS is highly faulttolerant and designed using low-cost
hardware.
HDFS holds very large amount of data and provides easier access. To store such huge data, the files
are stored across multiple machines. These files are stored in redundant fashion to rescue the system
from possible data losses in case of failure. HDFS also makes applications available to parallel
processing.
HDFS is a filesystem designed for storing very large files with streaming data access patterns,
running on clusters of commodity hardware. Very large files: “Very large” in this context means files
that are hundreds of megabytes, gigabytes, or terabytes in size. There are Hadoop clusters running
today that store petabytes of data. Streaming data access : HDFS is built around the idea that the most
efficient data processing pattern is a write-once, readmany-times pattern. A dataset is typically
MallareddyEngineeringCollegeForWomen(AutonomousInstitution,UGC-Govt.ofIndia) 45
IntroductiontoBigDataAnalytics Dept.ofCSEAI&ML
generated or copied from source, then various analyses are performed on that dataset over time.
Commodity hardware : Hadoop doesn’t require expensive, highly reliable hardware to run on. It’s
designed to run on clusters of commodity hardware (commonly available hardware available from
multiple vendors3) for which the chance of node failure across the cluster is high, at least for large
clusters. HDFS is designed to carry on working without a noticeable interruption to the user in the
face of such failure. These are areas where HDFS is not a good fit today: Low-latency data access :
Applications that require low-latency access to data, in the tens of milliseconds range, will not work
well with HDFS. Lots of small files : Since the namenode holds filesystem metadata in memory, the
limit to the number of files in a filesystem is governed by the amount of memory on the namenode.
Multiple writers, arbitrary file modifications: Files in HDFS may be written to by a single writer.
Writes are always made at the end of the file. There is no support for multiple writers, or for
modifications at arbitrary offsets in the file.
Features of HDFS
It is suitable for the distributed storage and processing.
Hadoop provides a command interface to interact with HDFS.
The built-in servers of namenode and datanode help users to easily check the status of cluster.
Streaming access to file system data.
HDFS provides file permissions and authentication.
HDFS Architecture
HDFS CONCEPTS:
HDFS follows the master-slave architecture and it has the following elements.
MallareddyEngineeringCollegeForWomen(AutonomousInstitution,UGC-Govt.ofIndia) 46
IntroductiontoBigDataAnalytics Dept.ofCSEAI&ML
1. Blocks: A Block is the minimum amount of data that it can read or write.HDFS blocks are 128 MB
by default and this is configurable.Files n HDFS are broken into block-sized chunks,which are stored
as independent units.Unlike a file system, if the file is in HDFS is smaller than block size, then it
does not occupy full block?s size, i.e. 5 MB of file stored in HDFS of block size 128 MB takes 5MB
of space only.The HDFS block size is large just to minimize the cost of seek.
Name node
1. Name Node: HDFS works in master-worker pattern where the name node acts as master.Name Node
is controller and manager of HDFS as it knows the status and the metadata of all the files in HDFS;
the metadata information being file permission, names and location of each block.The metadata are
small, so it is stored in the memory of name node,allowing faster access to data. Moreover the HDFS
cluster is accessed by multiple clients concurrently,so all this information is handled bya single
machine. The file system operations like opening, closing, renaming etc. are executed by it.
2. Data Node: They store and retrieve blocks when they are told to; by client or name node. They
report back to name node periodically, with list of blocks that they are storing. The data node being a
commodity hardware also does the work of block creation, deletion and replication as stated by the
name node.
name node is the commodity hardware that contains the GNU/Linux operating system and the name
node software. It is a software that can be run on commodity hardware. The system having the name
node acts as the master server and it does the following tasks −
The datanode is a commodity hardware having the GNU/Linux operating system and datanode
software. For every node (Commodity hardware/System) in a cluster, there will be a datanode. These
nodes manage the data storage of their system.
Datanodes perform read-write operations on the file systems, as per client request.
They also perform operations such as block creation, deletion, and replication according to the
instructions of the namenode.
Block
Generally the user data is stored in the files of HDFS. The file in a file system will be divided into
one or more segments and/or stored in individual data nodes. These file segments are called as
blocks. In other words, the minimum amount of data that HDFS can read or write is called a Block.
The default block size is 64MB, but it can be increased as per the need to change in HDFS
configuration.
Goals of HDFS
Since HDFS includes a large number of commodity hardware, failure of components is frequent.
Therefore HDFS should have mechanisms for quick and automatic fault detection and recovery.
Huge datasets − HDFS should have hundreds of nodes per cluster to manage the applications having
huge datasets.
Hardware at data − A requested task can be done efficiently, when the computation takes place near
the data. Especially where huge datasets are involved, it reduces the network traffic and increases the
throughput.
What is HDFS
Hadoop comes with a distributed file system called HDFS. In HDFS data is distributed over several
machines and replicated to ensure their durability to failure and high availability to parallel
application.
It is cost effective as it uses commodity hardware. It involves the concept of blocks, data nodes and
node name.
MallareddyEngineeringCollegeForWomen(AutonomousInstitution,UGC-Govt.ofIndia) 48
IntroductiontoBigDataAnalytics Dept.ofCSEAI&ML
o Low Latency data access: Applications that require very less time to access the first data should not
use HDFS as it is giving importance to whole data rather than time to fetch the first record.
o Lots Of Small Files:The name node contains the metadata of files in memory and if the files are
small in size it takes a lot of memory for name node's memory which is not feasible.
o Multiple Writes:It should not be used when we have to write multiple times.
Hadoop - Introduction
Hadoop is an Apache open source framework written in java that allows distributed processing of
large datasets across clusters of computers using simple programming models. The Hadoop
framework application works in an environment that provides
distributed storage and computation across clusters of computers. Hadoop is designed to scale up
from single server to thousands of machines, each offering local computation and storage.
HADOOP ARCHIATECTURE:
At its core, Hadoop has two major layers namely − Processing/Computation layer (MapReduce), and
The Hadoop Distributed File System (HDFS) is based on the Google File System (GFS) and provides
a distributed file system that is designed to run on commodity hardware. It has many similarities with
existing distributed file systems. However, the differences from other distributed file systems are
significant. It is highly fault-tolerant and is designed to be deployed on low-cost hardware. It
provides high throughput access to application data and is suitable for applications having large
datasets.
Apart from the above-mentioned two core components, Hadoop framework also includes the
following two modules −
Hadoop Common − These are Java libraries and utilities required by other Hadoop modules.
Hadoop YARN − This is a framework for job scheduling and cluster resource management.
Since all the metadata is stored in name node, it is very important. If it fails the file system can not be
used as there would be no way of knowing how to reconstruct the files from blocks present in data
node. To overcome this, the concept of secondary name node arises.
Secondary Name Node: It is a separate physical machine which acts as a helper of name node. It
performs periodic check points.It communicates with the name node and take snapshot of meta data
which helps minimize downtime and loss of data.
Starting HDFS
The HDFS should be formatted initially and then started in the distributed mode. Commands are
given below.
To Start $ start-dfs.sh
MallareddyEngineeringCollegeForWomen(AutonomousInstitution,UGC-Govt.ofIndia) 50
IntroductiontoBigDataAnalytics Dept.ofCSEAI&ML
o Copy the file "data.txt" from a file kept in local folder /usr/home/Desktop to HDFS folder
/user/ test
Recursive deleting
Example:
"<localSrc>" and "<localDest>" are paths as above, but on the local file system
o put <localSrc><dest>
Copies the file or directory from the local file system identified by localSrc to dest within the DFS.
MallareddyEngineeringCollegeForWomen(AutonomousInstitution,UGC-Govt.ofIndia) 51
IntroductiontoBigDataAnalytics Dept.ofCSEAI&ML
o copyFromLocal <localSrc><dest>
Identical to -put
o copyFromLocal <localSrc><dest>
Identical to -put
o moveFromLocal <localSrc><dest>
Copies the file or directory from the local file system identified by localSrc to dest within HDFS, and
then deletes the local copy on success.
o get [-crc] <src><localDest>
Copies the file or directory in HDFS identified by src to the local file system path identified by
localDest.
o cat <filen-ame>
Displays the contents of filename on stdout.
o moveToLocal <src><localDest>
Works like -get, but deletes the HDFS copy on success.
o setrep [-R] [-w] rep <path>
Sets the target replication factor for files identified by path to rep. (The actual replication factor will
move toward the target over time)
o touchz <path>
Creates a file at path containing the current time as a timestamp. Fails if a file already exists at path,
unless the file is already size 0.
o test -[ezd] <path>
Returns 1 if path exists; has zero length; or is a directory or 0 otherwise.
o stat [format] <path>
Prints information about path. Format is a string which accepts file size in blocks (%b), filename
(%n), block size (%o), replication (%r), and modification date (%y, %Y).
Starting HDFS
Initially you have to format the configured HDFS file system, open namenode (HDFS server), and
execute the following command.
After formatting the HDFS, start the distributed file system. The following command will start the
namenode as well as the data nodes as cluster.
After loading the information in the server, we can find the list of files in a directory, status of a file,
using ‘ls’. Given below is the syntax of ls that you can pass to a directory or a filename as an
argument.
MallareddyEngineeringCollegeForWomen(AutonomousInstitution,UGC-Govt.ofIndia) 52
IntroductiontoBigDataAnalytics Dept.ofCSEAI&ML
Assume we have data in the file called file.txt in the local system which is ought to be saved in the
hdfs file system. Follow the steps given below to insert the required file in the Hadoop file system.
Step 1
You have to create an input directory.
$ $HADOOP_HOME/bin/hadoop fs -mkdir /user/input
Step 2
Transfer and store a data file from local systems to the Hadoop file system using the put command.
$ $HADOOP_HOME/bin/hadoop fs -put /home/file.txt /user/input
Step 3
You can verify the file using ls command.
$ $HADOOP_HOME/bin/hadoop fs -ls /user/input
Retrieving Data from HDFS
Assume we have a file in HDFS called outfile. Given below is a simple demonstration for retrieving
the required file from the Hadoop file system.
Step 1
Initially, view the data from HDFS using cat command.
$ $HADOOP_HOME/bin/hadoop fs -cat /user/output/outfile
Step 2
Get the file from HDFS to the local file system using get command.
$ $HADOOP_HOME/bin/hadoop fs -get /user/output/ /home/hadoop_tp/
Shutting Down the HDFS
You can shut down the HDFS by using the following command.
$ stop-dfs.sh
c
This article will cover basic Hadoop Distributed File System (HDFS) commands in Linux. You’ll
learn how to create and list directories, move, delete, read files and more.Hadoop FS Command Line
The Hadoop FS command line is a simple way to access and interface with HDFS. Below are some
basic HDFS commands in Linux, including operations like creating directories, moving files, deleting
files, reading files, and listing directories.
To use HDFS commands, start the Hadoop services using the following command:
sbin/start-all.sh
jps
Below cover several basic HDFS commands, along with a list of more File system commands given
command -help.
mkdir:
To create a directory, similar to Unix ls command.
MallareddyEngineeringCollegeForWomen(AutonomousInstitution,UGC-Govt.ofIndia) 53
IntroductiontoBigDataAnalytics Dept.ofCSEAI&ML
Options:
-p : Do not fail if the directory already exists
ls:
List directories present under a specific directory in HDFS, similar to Unix ls command. The -
lsr command can be used for recursive listing of directories and files.
Options:
-d : List the directories as plain files
-h : Format the sizes of files to a human-readable manner instead of number of bytes
-R : Recursively list the contents of directories
copyFromLocal:
Copy files from the local file system to HDFS, similar to -put command. This command will not
work if the file already exists. To overwrite the destination if the file already exists, add -f
flag to command.
Options:
copyToLocal: Copy files from HDFS to local file system, similar to -get command.
$ hadoop fs -copyToLocal [-p] [-ignoreCrc] [-crc] ...
mv:
Move files from one directory to another within HDFS, similar to Unix mv command.
rm:
Remove a file from HDFS, similar to Unix rm command. This command does not delete directories.
For recursive delete, use command -rm -r.
Options:
-r : Recursively remove directories and files
-skipTrash : To bypass trash and immediately delete the source
MallareddyEngineeringCollegeForWomen(AutonomousInstitution,UGC-Govt.ofIndia) 54
IntroductiontoBigDataAnalytics Dept.ofCSEAI&ML
getmerge:
Merge a list of files in one directory on HDFS into a single file on local file system. One of the most
important and useful commands when trying to read the contents of map reduce job or pig job’s
output files.
setrep:
Change replication factor of a file to a specific instead of default replication factor for remaining in
HDFS. If it is a directory, then the command will recursively change in the replication of all the files
in the directory tree given the input provided.
Options:
-w : Request the command wait for the replication to be completed (potentially takes a long time)
-r : Accept for backwards compatibility and has no effect
touchz:
Creates an empty file in HDFS.
test:
Test an HDFS file’s existence of an empty file or if it is a directory or not.
Options:
-w : Request the command wait for the replication to be completed (potentially takes a long time)
-r : Accept for backwards compatibility and has no effect
appendToFile:
Appends the contents of all given local files to the provided destination file on HDFS. The
destination file will be created if it doesn’t already exist.
$ hadoop fs -appendToFile
chmod:
Change the permission of a file, similar to Linux shell’s command but with a few exceptions.
<MODE> Same as mode used for the shell’s command with the only letters recognized are ‘rwxXt’
MallareddyEngineeringCollegeForWomen(AutonomousInstitution,UGC-Govt.ofIndia) 55
IntroductiontoBigDataAnalytics Dept.ofCSEAI&ML
<OCTALMODE> Mode specified in 3 or 4 digits. It is not possible to specify only part of the mode,
unlike the shell command.
Options:
-R : Modify the files recursively
chown:
Change owner and group of a file, similar to Linux shell’s command but with a few exceptions.
Options:
-R : Modify the files recursively
df:
Show capacity (free and used space) of the file system. The status of the root partitions are provided
if the file system has multiple partitions and no path is specified.
Options:
-h : Format the sizes of files to a human-readable manner instead of number of bytes
du:
Show size of each file in the directory.
Options:
-s : Show total summary size
-h : Format the sizes of files to a human-readable manner instead of number of bytes
tail:
Show the last 1KB of the file.
Options:
-f : Show appended data as the file grows
MallareddyEngineeringCollegeForWomen(AutonomousInstitution,UGC-Govt.ofIndia) 56
IntroductiontoBigDataAnalytics Dept.ofCSEAI&ML
Difference Between Local File System (LFS) and Distributed File System (DFS)
The basic file system of Linux operating system is termed as Local file system. It stores any data file
as it is in single copy.
It is not reliable because LFS data It is reliable because in DFS data blocks
does not replicate the Data files. are replicated into different DataNodes.
LFS is cheaper because it does not DFS is expensive because it needs extra
needs extra memory for storing any memory to replicate the same data
data file. blocks.
LFS is less complex than DFS. DFS is more complex than LFS.
MallareddyEngineeringCollegeForWomen(AutonomousInstitution,UGC-Govt.ofIndia) 58
IntroductiontoBigDataAnalytics Dept.ofCSEAI&ML
It stores data files in Tree format. Here, any user can access data files directly. LFS does not
Replicate the data blocks. It always used for storing and processing personal data(small data).
When we need to store and process a large data file (approx 1 TB size file at least), the Local file
system of Operating system is not appropriate. In such cases we use Distributed File system. It can be
created on any Linux operating system with Hadoop. DFS stores any data file by dividing it into
several blocks. This file system works on Master-Slave format where Master is NameNode and
DataNodes are the slaves. All the blocks of a Data file is stored into different DataNodes and the
location is only known by NameNode. Every Data Block is replicated into different Datanodes to
avoid data loss when any datanode fails. In DFS Data files are directly not accessible to any user
because only NameNode knows where the Data blocks of Data file are stored.
Sequence file
sequencefile files are <key,value>flat files (Flat file) designed by Hadoop to store binary forms of pairs.
2, can sequencefile as a container, all the files packaged into the Sequencefile class can be efficiently stored
and processed small files .
3. sequencefile files are not sorted by their stored key, Sequencefile's internal class writer** provides append
functionality * *.
4. The key and value in Sequencefile can be any type writable or a custom writable type.
1. The internal format of the sequencefile depends on whether compression is enabled, or, if it is, either a
record compression or a block compression.
2, three kinds of types:
A. No compression type : If compression is not enabled (the default setting), then each record consists of its
record length (number of bytes), the length of the key, the key and the value. The Length field is four bytes.
B. Record compression type : The record compression format is basically the same as the uncompressed
format, and the difference is that the value byte is compressed with the encoder defined in the header. Note
that the key is not compressed.
C. Block compression type : Block compression compresses multiple records at once , so it is more compact
than record compression and generally preferred . When the number of bytes recorded reaches the minimum
size, it is added to the block. The minimum valueio.seqfile.compress.blocksizeis defined by the property in.
MallareddyEngineeringCollegeForWomen(AutonomousInstitution,UGC-Govt.ofIndia) 59
IntroductiontoBigDataAnalytics Dept.ofCSEAI&ML
The default value is 1000000 bytes. The format is record count, key length, key, value length, value.
read/write Sequencefile
Write Process:
1) Create a configuration
2) Get filesystem
3) Create file output path
4) Call Sequencefile.createwriter to get Sequencefile.writer object
5) Call SequenceFile.Writer.append Append write file
6) Close the stream.
Read process:
1) Create a configuration
2) Get filesystem
3) Create file output path
4) New one sequencefile.reader for reading
5) Get Keyclass and Valueclass
6) Close the stream
org.apache.hadoop.io
Class SequenceFile
There are three SequenceFile Writers based on the SequenceFile.CompressionType used to compress
key/value pairs:
1、Writer : Uncompressed records.
2、RecordCompressWriter : Record-compressed files, only compress values.
3、BlockCompressWriter : Block-compressed files, both keys & values are collected in ‘blocks‘ separately
and compressed. The size of the ‘block‘ is configurable
package SequenceFile;
import java.io.IOException;
MallareddyEngineeringCollegeForWomen(AutonomousInstitution,UGC-Govt.ofIndia) 60
IntroductiontoBigDataAnalytics Dept.ofCSEAI&ML
import java.net.URI;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IOUtils;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.SequenceFile;
import org.apache.hadoop.io.SequenceFile.CompressionType;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.Writable;
import org.apache.hadoop.io.compress.BZip2Codec;
import org.apache.hadoop.util.ReflectionUtils;
public class Demo01 {
final static String uri= "hdfs://liguodong:8020/liguodong";
final static String[] data = {
"apache,software","chinese,good","james,NBA","index,pass"
};
(fs,configuration,path,key.getClass(),value.getClass());*/
//Record compression
@SuppressWarnings("deprecation")
SequenceFile.Writer writer = SequenceFile.createWriter
(fs,configuration,path,key.getClass(),
value.getClass(), CompressionType.RECORD, new BZip2Codec());
//Block compression
/*@SuppressWarnings("deprecation")
SequenceFile.Writer writer = SequenceFile.createWriter
(fs,configuration,path,key.getClass(),
value.getClass(),CompressionType.BLOCK,new BZip2Codec());*/
//5
for (int i = 0; i <30; i++) {
key.set(100-i);
value.set(data[i%data.length]);
writer.append(key, value);
}
//6, close the stream
IOUtils.closeStream(writer);
}
public static void read(FileSystem fs,Configuration configuration,Path path) throws IOException {
//4
@SuppressWarnings("deprecation")
SequenceFile.Reader reader = new SequenceFile.Reader(fs, path,configuration);
//5
Writable key = (Writable) ReflectionUtils.newInstance
(reader.getKeyClass(), configuration);
Writable value = (Writable) ReflectionUtils.newInstance
(reader.getValueClass(), configuration);
while(reader.next(key,value)){
System.out.println("key = "+ key);
System.out.println("value = "+ value);
System.out.println("position = "+ reader.getPosition());
}
MallareddyEngineeringCollegeForWomen(AutonomousInstitution,UGC-Govt.ofIndia) 62
IntroductiontoBigDataAnalytics Dept.ofCSEAI&ML
IOUtils.closeStream(reader);
}
}
Operation Result:
key = 100
value = apache,software
position = 164
key = 99
value = chinese,good
position = 197
key = 98
value = james,NBA
position = 227
key = 97
value = index,pass
position = 258
key = 96
value = apache,software
position = 294
key = 95
value = chinese,good
position = 327
......
key = 72
value = apache,software
position = 1074
key = 71
value = chinese,good
position = 1107
MapFile
Mapfile is the sequencefile of the sorted index and can be found based on key.
Unlike Sequencefile, mapfile key must implement Writablecomparable interface, that is, the key value is
comparable, and value is the writable type.
You can use the Mapfile.fix () method to reconstruct the index and convert the Sequencefile to mapfile.
It has two static member variables:
By observing its directory structure, we can see that mapfile consists of two parts, data and index
respectively.
Index, which is a data-indexed file, primarily records the key value of each record and the position at which
the record is offset in the file.
When mapfile is accessed, the index file is loaded into memory, and the index mapping relationship quickly
navigates to the location of the file where the specified record is located.
Therefore, the retrieval efficiency of mapfile is efficient relative to sequencefile, and the disadvantage is that
it consumes a portion of memory to store index data.
It should be noted that the Mapfile does not record all records into index, which by default stores an index
map for every 128 records. Of course, the recording interval can be artificially modified,
throughMapFIle.Writer的setIndexInterval()methods, or modifiedio.map.index.intervalattributes;
read/write Mapfile
Write Process:
1) Create a configuration
2) Get filesystem
3) Create file output path
4) New one Mapfile.writer object
5) Call MapFile.Writer.append Append write file
6) Close the stream
Read process:
1) Create a configuration
2) Get filesystem
3) Create file output path
4) New one mapfile.reader for reading
5) Get Keyclass and Valueclass
6) Close the stream
Hadoop has an abstract notion of filesystems, of which HDFS is just one implementation. The Java abstract
class org.apache.hadoop.fs.FileSystem represents the client interface to a filesystem in Hadoop, and there are
several concrete implementations.Hadoop is written in Java, so most Hadoop filesystem interactions are
mediated through the Java API. The filesystem shell, for example, is a Java application that uses the Java
FileSystem class to provide filesystem operations.By exposing its filesystem interface as a Java API,
Hadoop makes it awkward for non-Java applications to access HDFS. The HTTP REST API exposed by the
WebHDFS protocol makes it easier for other languages to interact with HDFS. Note that the HTTP interface
is slower than the native Java client, so should be avoided for very large data transfers if possible.
There are two ways of accessing HDFS over HTTP: directly, where the HDFS daemons serve HTTP
requests to clients and via a proxy, which accesses HDFS on the client’s behalf using the usual
DistributedFileSystem API.
In the first case, the embedded web servers in the namenode and datanodes act as WebHDFS endpoints.
(WebHDFS is enabled by default, since dfs.webhdfs.enabled is set to true.) File metadata operations are
handled by the namenode, while file read (and write) operations are sent first to the namenode, which sends
an HTTP redirect to the client indicating the datanode to stream file data from (or to).
The second way of accessing HDFS over HTTP relies on one or more standalone proxy servers. All traffic to
the cluster passes through the proxy, so the client never accesses the namenode or datanode directly. This
allows for stricter firewall and bandwidth-limiting policies to be put in place. The HttpFS proxy exposes the
same HTTP (and HTTPS) interface as WebHDFS, so clients can access both using webhdfs (or swebhdfs)
URIs. The HttpFS proxy is started independently of the namenode and datanode daemons, using the
httpfs.sh script, and by default listens on a different port number 14000.
In this section, we dig into the Hadoop FileSystem class: the API for interacting with one of Hadoop’s
filesystems.
A file in a Hadoop filesystem is represented by a Hadoop Path object. FileSystem is a general filesystem
API, so the first step is to retrieve an instance for the filesystem we want to use—HDFS, in this case. There
are several static factory methods for getting a FileSystem instance
A Configuration object encapsulates a client or server’s configuration, which is set using configuration files
read from the classpath, such as core-site.xml. The first method returns the default filesystem (as specified in
core-site.xml, or the default local filesystem if not specified there). The second uses the given URI’s scheme
and authority to determine the filesystem to use, falling back to the default filesystem if no scheme is
specified in the given URI. The third retrieves the filesystem as the given user, which is important in the
context of security.The fourth one retrieves a local filesystem instance.
With a FileSystem instance in hand, we invoke an open() method to get the input stream for a file.The first
method uses a default buffer size of 4 KB.The second one gives an option to user to specify the buffer size.
The open() method on FileSystem actually returns an FSDataInputStream rather than a standard java.io
class. This class is a specialization of java.io.DataInputStream with support for random access, so you can
read from any part of the stream:
package org.apache.hadoop.fs;
public class FSDataInputStream extends DataInputStream implements Seekable, PositionedReadable {
}
The Seekable interface permits seeking to a position in the file and provides a query method for the current
offset from the start of the file (getPos()) . Calling seek() with a position that is greater than the length of the
file will result in an IOException. Unlike the skip() method of java.io.InputStream, which positions the
stream at a point later than the current position, seek() can move to an arbitrary,
absolute position in the file.
FSDataInputStream also implements the PositionedReadable interface for reading parts of a file at a given
offset.
The read() method reads up to length bytes from the given position in the file into the buffer at the given
offset in the buffer. The return value is the number of bytes actually read; callers should check this value, as
it may be less than length.The readFully() methods will read length bytes into the buffer, unless the end of
the file is reached, in which case an EOFException is thrown.
Finally, bear in mind that calling seek() is a relatively expensive operation and should be done sparingly.
You should structure your application access patterns to rely on streaming data (by using MapReduce, for
example) rather than performing a large number of seeks.
Writing Data
The FileSystem class has a number of methods for creating a file. The simplest is the method that takes a
Path object for the file to be created and returns an output stream to write to.
There are overloaded versions of this method that allow you to specify whether to forcibly overwrite existing
files, the replication factor of the file, the buffer size to use when writing the file, the block size for the file,
and file permissions.
Note : The create() methods create any parent directories of the file to be written that don’t already exist.
Though convenient, this behavior may be unexpected. If you want the write to fail when the parent directory
doesn’t exist, you should check for the existence of the parent directory first by calling the exists() method.
Alternatively, use FileContext, which allows you to control whether parent directories are created or not.
There’s also an overloaded method for passing a callback interface, Progressable, so your application can be
notified of the progress of the data being written to the datanodes.
MallareddyEngineeringCollegeForWomen(AutonomousInstitution,UGC-Govt.ofIndia) 67
IntroductiontoBigDataAnalytics Dept.ofCSEAI&ML
package org.apache.hadoop.util;
public interface Progressable {
public void progress();
}
As an alternative to creating a new file, you can append to an existing file using the append() method (there
are also some other overloaded versions)
The append operation allows a single writer to modify an already written file by opening it and writing data
from the final offset in the file. With this API, applications that produce unbounded files, such as logfiles,
can write to an existing file after having closed it. The append operation is optional and not implemented by
all Hadoop filesystems.Below example shows how to copy a local file to a Hadoop filesystem. We illustrate
progress by printing a period every time the progress() method is called by Hadoop, which is after each 64
KB packet of data is written to the datanode pipeline.
import java.io.BufferedInputStream;
import java.io.FileInputStream;
import java.io.InputStream;
import java.io.OutputStream;
import java.net.URI;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IOUtils;
import org.apache.hadoop.util.Progressable;
public class CopyFileWithProgress {
public static void main(String[] args) throws Exception {
String localSrc = args[0];
String dst = args[1];
InputStream in = new BufferedInputStream(new FileInputStream(localSrc));
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(URI.create(dst), conf);
OutputStream out = fs.create(new Path(dst), new Progressable() {
public void progress() {
System.out.print(".");
}
});
IOUtils.copyBytes(in, out, 4096, true);
}
MallareddyEngineeringCollegeForWomen(AutonomousInstitution,UGC-Govt.ofIndia) 68
IntroductiontoBigDataAnalytics Dept.ofCSEAI&ML
}
FSDataOutputStream
The create() method on FileSystem returns an FSDataOutputStream, which, like FSDataInputStream, has a
method for querying the current position in the file:
package org.apache.hadoop.fs;
public class FSDataOutputStream extends DataOutputStream implements Syncable {
public long getPos() throws IOException {
}
}
However, unlike FSDataInputStream, FSDataOutputStream does not permit seeking. This is because HDFS
allows only sequential writes to an open file or appends to an already written file. In other words, there is no
support for writing to anywhere other than the end of the file, so there is no value in being able to seek while
writing.
This method creates all of the necessary parent directories if they don’t already exists and returns true if its
success full.
Hadoop components need to have Hadoop libraries accessible from CloverDX. The libraries are needed by
HadoopReader, HadoopWriter, ExecuteMapReduce, HDFS and Hive.
The Hadoop libraries are necessary to establish a Hadoop connection, see Hadoop Connections.
The officially supported version of Hadoop is Cloudera 5 version 5.6.0. Other versions close to this one
might work, but the compatibility is not guaranteed.
Cloudera 5
The below mentioned libraries are needed for the connection to Cloudera 5.
Common libraries
hadoop-common-2.6.0-cdh5.6.0.jar
hadoop-auth-2.6.0-cdh5.6.0.jar
guava-15.0.jar
avro-1.7.6-cdh5.6.0.jar
htrace-core4-4.0.1-incubating.jar
MallareddyEngineeringCollegeForWomen(AutonomousInstitution,UGC-Govt.ofIndia) 69
IntroductiontoBigDataAnalytics Dept.ofCSEAI&ML
servlet-api-3.0.jar
HDFS
hadoop-hdfs-2.6.0-cdh5.6.0.jar
protobuf-java-2.5.0.jar
MapReduce
hadoop-annotations-2.6.0-cdh5.6.0.jar
hadoop-mapreduce-client-app-2.6.0-cdh5.6.0.jar
hadoop-mapreduce-client-common-2.6.0-cdh5.6.0.jar
hadoop-mapreduce-client-core-2.6.0-cdh5.6.0.jar
hadoop-mapreduce-client-hs-2.6.0-cdh5.6.0.jar
hadoop-mapreduce-client-jobclient-2.6.0-cdh5.6.0.jar
hadoop-mapreduce-client-shuffle-2.6.0-cdh5.6.0.jar
jackson-core-asl-1.9.2.jar
jackson-mapper-asl-1.9.12.jar
hadoop-yarn-api-2.6.0-cdh5.6.0.jar
hadoop-yarn-client-2.6.0-cdh5.6.0.jar
hadoop-yarn-common-2.6.0-cdh5.6.0.jar
Hive
hive-jdbc-1.1.0-cdh5.6.0.jar
hive-exec-1.1.0-cdh5.6.0.jar
hive-metastore-1.1.0-cdh5.6.0.jar
hive-service-1.1.0-cdh5.6.0.jar
libfb303-0.9.2.jar
slf4j-api-1.7.5.jar
slf4j-log4j12-1.7.5.jar
The libraries can be found in your CDH installation or in a package downloaded from Cloudera.
CDH installation
Required libraries from CDH reside in the directories from the following list.
/usr/lib/hadoop
/usr/lib/hadoop-hdfs
/usr/lib/hadoop-mapreduce
/usr/lib/hadoop-yarn
+ 3rd party libraries are located in lib subdirectories
Package downloaded from Cloudera
The files can be found also in a package downloaded from Cloudera on the following locations.
share/hadoop/common
share/hadoop/hdfs
share/hadoop/mapreduce2
share/hadoop/yarn
+ lib subdirectories
MallareddyEngineeringCollegeForWomen(AutonomousInstitution,UGC-Govt.ofIndia) 70
IntroductiontoBigDataAnalytics Dept.ofCSEAI&ML
One of the simplest ways to read a file from a Hadoop filesystem is by using a java.net.URL object to
open a stream to read the data from. The general idiom is:
InputStream in = null;
try {
in = new URL("hdfs://host/path").openStream();
// process in
} finally {
IOUtils.closeStream(in);
}
There’s a little bit more work required to make Java recognize Hadoop’s hdfs URL scheme. This is
achieved by calling the setURLStreamHandlerFactory() method on URL with an instance of
FsUrlStreamHandlerFactory. This method can be called only once per JVM, so it is typically executed in
a static block. This limitation means that if some other part of your program — perhaps a third-party
component outside your control — sets a URLStreamHandlerFactory, you won’t be able to use this
approach for reading data from Hadoop. The next section discusses an alternative.
Example 3-1 shows a program for displaying files from Hadoop filesystems on standard output, like the
Unix cat command.
Example 3-1. Displaying files from a Hadoop filesystem on standard output using a URLStreamHandler
% export HADOOP_CLASSPATH=hadoop-examples.jar
% hadoop URLCat hdfs://localhost/user/tom/quangle.txt
On the top of the Crumpetty Tree
The Quangle Wangle sat,
But his face you could not see,
On account of his Beaver Hat.
MallareddyEngineeringCollegeForWomen(AutonomousInstitution,UGC-Govt.ofIndia) 71
IntroductiontoBigDataAnalytics Dept.ofCSEAI&ML
The DataStoreWriter interface adds methods to close and flush a writer. Some of the writers have a property
to close a stream after an idle time has been reached but generally this interface is meant for programmatic
control of these operations.
Hadoop has an abstract notion of file systems, of which HDFS is just one implementation. The Java
abstract class org.apache.hadoop.fs.FileSystem represents the client interface to a filesystem in
Hadoop, and there are several concrete implementations. Hadoop is written in Java, so most Hadoop
file system interactions are mediated through the Java API. The filesystem shell, for example, is a
Java application that uses the Java File System class to provide filesystem operations. By exposing its
file system interface as a Java API, Hadoop makes it awkward for non-Java applications to access
HDFS. The HTTP REST API exposed by the WebHDFS protocol makes it easier for other languages
to interact with HDFS. Note that the HTTP interface is slower than the native Java client, so should
be avoided for very large data transfers if possible.
There are two ways of accessing HDFS over HTTP: directly, where the HDFS daemons serve HTTP
requests to clients and via a proxy, which accesses HDFS on the client’s behalf using the usual
DistributedFileSystem API.
In the first case, the embedded web servers in the namenode and datanodes act as WebHDFS
endpoints. (WebHDFS is enabled by default, since dfs.webhdfs.enabled is set to true.) File metadata
operations are handled by the namenode, while file read (and write) operations are sent first to the
MallareddyEngineeringCollegeForWomen(AutonomousInstitution,UGC-Govt.ofIndia) 72
IntroductiontoBigDataAnalytics Dept.ofCSEAI&ML
namenode, which sends an HTTP redirect to the client indicating the datanode to stream file data
from (or to).
The second way of accessing HDFS over HTTP relies on one or more standalone proxy servers.
All traffic to the cluster passes through the proxy, so the client never accesses the namenode or
datanode directly. This allows for stricter firewall and bandwidth-limiting policies to be put in place.
The HttpFS proxy exposes the same HTTP (and HTTPS) interface as WebHDFS, so clients can
access both using webhdfs (or swebhdfs) URIs. The HttpFS proxy is started independently of the
namenode and datanode daemons, using the httpfs.sh script, and by default listens on a different port
number 14000.
In this section, we dig into the Hadoop FileSystem class: the API for interacting with one of
Hadoop’s filesystems.
which positions the stream at a point later than the current position, seek() can move to an arbitrary,
absolute position in the file.
public interface Seekable {
void seek(long pos) throws IOException;
long getPos() throws IOException;
}
FSDataInputStream also implements the PositionedReadable interface for reading parts of a file at a
given offset.
public interface PositionedReadable {
public int read(long position, byte[] buffer, int offset, int length) throws IOException;
public void readFully(long position, byte[] buffer, int offset, int length) throws IOException;
public void readFully(long position, byte[] buffer) throws IOException;
}
The read() method reads up to length bytes from the given position in the file into the buffer at the
given offset in the buffer. The return value is the number of bytes actually read; callers should check
this value, as it may be less than length.The readFully() methods will read length bytes into the
buffer, unless the end of the file is reached, in which case an EOFException is thrown.
Finally, bear in mind that calling seek() is a relatively expensive operation and should be done
sparingly. You should structure your application access patterns to rely on streaming data (by using
MapReduce, for example) rather than performing a large number of seeks.
Writing Data
The FileSystem class has a number of methods for creating a file. The simplest is the method that
takes a Path object for the file to be created and returns an output stream to write to.
public FSDataOutputStream create(Path f) throws IOException
There are overloaded versions of this method that allow you to specify whether to forcibly overwrite
existing files, the replication factor of the file, the buffer size to use when writing the file, the block
size for the file, and file permissions.
Note : The create() methods create any parent directories of the file to be written that don’t already
exist. Though convenient, this behavior may be unexpected. If you want the write to fail when the
parent directory doesn’t exist, you should check for the existence of the parent directory first by
calling the exists() method. Alternatively, use FileContext, which allows you to control whether
parent directories are created or not.
There’s also an overloaded method for passing a callback interface, Progressable, so your application
can be notified of the progress of the data being written to the datanodes.
package org.apache.hadoop.util;
public interface Progressable {
public void progress();
}
As an alternative to creating a new file, you can append to an existing file using the append() method
(there are also some other overloaded versions)
public FSDataOutputStream append(Path f) throws IOException
The append operation allows a single writer to modify an already written file by opening it and
writing data from the final offset in the file. With this API, applications that produce unbounded files,
such as logfiles, can write to an existing file after having closed it. The append operation is optional
and not implemented by all Hadoop filesystems.Below example shows how to copy a local file to a
Hadoop filesystem. We illustrate progress by printing a period every time the progress() method is
called by Hadoop, which is after each 64 KB packet of data is written to the datanode pipeline.
import java.io.BufferedInputStream;
import java.io.FileInputStream;
import java.io.InputStream;
MallareddyEngineeringCollegeForWomen(AutonomousInstitution,UGC-Govt.ofIndia) 74
IntroductiontoBigDataAnalytics Dept.ofCSEAI&ML
import java.io.OutputStream;
import java.net.URI;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IOUtils;
import org.apache.hadoop.util.Progressable;
public class CopyFileWithProgress {
public static void main(String[] args) throws Exception {
String localSrc = args[0];
String dst = args[1];
InputStream in = new BufferedInputStream(new FileInputStream(localSrc));
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(URI.create(dst), conf);
OutputStream out = fs.create(new Path(dst), new Progressable() {
public void progress() {
System.out.print(".");
}
});
IOUtils.copyBytes(in, out, 4096, true);
}
}
FSDataOutputStream
The create() method on FileSystem returns an FSDataOutputStream, which, like FSDataInputStream,
has a method for querying the current position in the file:
package org.apache.hadoop.fs;
public class FSDataOutputStream extends DataOutputStream implements Syncable {
public long getPos() throws IOException {
}
}
However, unlike FSDataInputStream, FSDataOutputStream does not permit seeking. This is because
HDFS allows only sequential writes to an open file or appends to an already written file. In other
words, there is no support for writing to anywhere other than the end of the file, so there is no value
in being able to seek while writing.
Big data is nothing but a collection of data sets that are large, complex, and which are difficult to
store and process using available data management tools or traditional data processing
applications. Hadoop is a framework (open source) for writing, running, storing, and processing large
datasets in a parallel and distributed manner. It is a solution that is used to overcome the challenges
faced by big data.
Fault-Tolerance
Scalability
Distributed Storage
Reliability
High availability
Cost-effective
High throughput
Building Blocks of Hadoop:
1. Name Node
2. Data Node
3. Secondary Name Node (SNN)
4. Job Tracker
5. Task Tracker
Anatomy of File Read in HDFS
Let’s get an idea of how data flows between the client interacting with HDFS, the name node, and the
data nodes with the help of a diagram. Consider the figure:
Step 1: The client opens the file it wishes to read by calling open() on the File System Object(which
for HDFS is an instance of Distributed File System).
Step 2: Distributed File System( DFS) calls the name node, using remote procedure calls (RPCs), to
determine the locations of the first few blocks in the file. For each block, the name node returns the
addresses of the data nodes that have a copy of that block. The DFS returns an FSDataInputStream to
the client for it to read data from. FSDataInputStream in turn wraps a DFSInputStream, which
manages the data node and name node I/O.
MallareddyEngineeringCollegeForWomen(AutonomousInstitution,UGC-Govt.ofIndia) 76
IntroductiontoBigDataAnalytics Dept.ofCSEAI&ML
Step 3: The client then calls read() on the stream. DFSInputStream, which has stored the info node
addresses for the primary few blocks within the file, then connects to the primary (closest) data node
for the primary block in the file.
Step 4: Data is streamed from the data node back to the client, which calls read() repeatedly on the
stream.
Step 5: When the end of the block is reached, DFSInputStream will close the connection to the data
node, then finds the best data node for the next block. This happens transparently to the client, which
from its point of view is simply reading an endless stream. Blocks are read as, with the
DFSInputStream opening new connections to data nodes because the client reads through the stream.
It will also call the name node to retrieve the data node locations for the next batch of blocks as
needed.
Step 6: When the client has finished reading the file, a function is called, close() on the
FSDataInputStream.
MallareddyEngineeringCollegeForWomen(AutonomousInstitution,UGC-Govt.ofIndia) 77
IntroductiontoBigDataAnalytics Dept.ofCSEAI&ML
Step 2: DFS makes an RPC call to the name node to create a new file in the file system’s namespace,
with no blocks associated with it. The name node performs various checks to make sure the file
doesn’t already exist and that the client has the right permissions to create the file. If these checks
pass, the name node prepares a record of the new file; otherwise, the file can’t be created and
therefore the client is thrown an error i.e. IOException. The DFS returns an FSDataOutputStream for
the client to start out writing data to.
Step 3: Because the client writes data, the DFSOutputStream splits it into packets, which it writes to
an indoor queue called the info queue. The data queue is consumed by the DataStreamer, which is
liable for asking the name node to allocate new blocks by picking an inventory of suitable data nodes
to store the replicas. The list of data nodes forms a pipeline, and here we’ll assume the replication
level is three, so there are three nodes in the pipeline. The DataStreamer streams the packets to the
primary data node within the pipeline, which stores each packet and forwards it to the second data
node within the pipeline.
Step 4: Similarly, the second data node stores the packet and forwards it to the third (and last) data
node in the pipeline.
Step 5: The DFSOutputStream sustains an internal queue of packets that are waiting to be
acknowledged by data nodes, called an “ack queue”.
Step 6: This action sends up all the remaining packets to the data node pipeline and waits for
acknowledgments before connecting to the name node to signal whether the file is complete or not.
HDFS follows Write Once Read Many models. So, we can’t edit files that are already stored in
HDFS, but we can include them by again reopening the file. This design allows HDFS to scale to a
large number of concurrent clients because the data traffic is spread across all the data nodes in the
cluster. Thus, it increases the availability, scalability, and throughput of the system.
MallareddyEngineeringCollegeForWomen(AutonomousInstitution,UGC-Govt.ofIndia) 78
IntroductiontoBigDataAnalytics Dept.ofCSEAI&ML
Limitations of HDFS
The main problem with Hadoop is that it is not suitable for small data. HDFS lacks the ability to
support the random reading of small due to its high capacity design. Small files are smaller than the
HDFS Block size (default 128MB). If you are storing these huge numbers of small files, HDFS
cannot handle these lots of small files. As HDFS was designed to work with a small number of large
files for storing large data sets rather than a large number of small files. If there are lot many small
files, then the NameNode will be overloaded since it stores the namespace of HDFS.
Slow Processing Speed
MapReduce processes a huge amount of data. In Hadoop, MapReduce works by breaking the
processing into phases: Map and Reduce. So, MapReduce requires a lot of time to perform these
tasks, thus increasing latency. Hence, reduces processing speed.
Hadoop only supports batch processing, it is not suitable for streaming data. Hence, overall
performance is slower. MapReduce framework doesn’t leverage the memory of the Hadoop cluster to
the maximum.
No Real-time Processing
Hadoop with its core Map-Reduce framework is unable to process real-time data. Hadoop process
data in batches. First, the user loads the file into HDFS. Then the user runs map-reduce job with the
file as input. It follows the ETL cycle of processing. The user extracts the data from the source. Then
the data gets transformed to meet the business requirements. And finally loaded into the data
warehouse. The users can generate insights from this data. The companies use these insights for the
betterment of their business.
Iterative Processing
Core Hadoop does not support iterative processing. Iterative processing requires a cyclic data flow.
In this output of a previous stage serves as an input to the next stage. Hadoop map-reduce is capable
of batch processing. It works on the principle of write-once-read-many. The data gets written on the
disk once. And then read multiple times to get insights. The Map-reduce of Hadoop has a batch
processing engine at its core. It is not able to iterate through data.
Latency
MapReduce in Hadoop is slower because it supports different format, structured and huge amount of
data. In MapReduce, Map takes a set of data and converts it into another set of data, where an
individual element is broken down into a key-value pair. Reduce takes the output from the map as
and Reduce takes the output from the map as input and process further. MapReduce requires a lot of
time to perform these tasks thereby increasing latency.
No Ease of Use
MallareddyEngineeringCollegeForWomen(AutonomousInstitution,UGC-Govt.ofIndia) 79
IntroductiontoBigDataAnalytics Dept.ofCSEAI&ML
In Hadoop, we have to hand code each and every operation. This has two drawbacks first it is
difficult to use. And second, it increases the number of lines to code. There is no interactive mode
available with Hadoop Map-Reduce. This also makes it difficult to debug as it runs in the batch
mode. In this mode, we have to specify the jar file, the input as well as the location of the output file.
If the program fails in between, it is difficult to find the culprit code.
Security Issue
Hadoop does not implement encryption-decryption at the storage as well as network levels. Thus it is
not much secure. For security, Hadoop adopts Kerberos authentication which is difficult to maintain.
No Caching
Apache Hadoop is not efficient for caching. MapReduce cannot cache the intermediate data in
memory for the further requirement and this diminishes the performance of Hadoop.
Lengthy Code
Apache Hadoop has 1, 20,000 line of code. The number of lines produces the number of bugs. Hence
it will take more time to execute the programs.
MallareddyEngineeringCollegeForWomen(AutonomousInstitution,UGC-Govt.ofIndia) 80
IntroductiontoBigDataAnalytics Dept.ofCSEAI&ML
► IntroductiontoMapReduce:itsFramework
► FeaturesofMapReduce
► WorkingofMapReduce
► AnalyzeMapandReduceFunctions
► MapReduceTechniquestooptimizejob
► Uses
► ControllinginputformatsinMapReduceExecution
► DifferentPhasesinmapreduce
► Applications
MapReduceFeatures
Thischapter looksatsomeofthe more advanced featuresofMapReduce,including counters andsorting
and joining datasets.
Counters
There areoften things you would like toknowabout thedata you are analyzing but that areperipheral to
the analysis you are performing. For example, if you were counting invalid records and discovered
that theproportionofinvalidrecordsinthewholedataset wasveryhigh, youmight beprompted tocheckwhy
so many records were being marked as invalid-perhaps there is a bug in the part of the program that
detects invalid records? Or if the data were of poor quality and genuinely did have very many invalid
records, after discovering this, you might decide toincrease the size ofthe dataset so that the number
of good records was large enough for meaningful analysis.
Counters areauseful channel forgathering statistics about the job: for qualitycontrol or forapplication-
level statistics. They are also useful for problem diagnosis. If you are tempted to put a log message
into your map or reduce task, it is often better to see whether you can use a counter instead to record
that a particular condition occurred. Inaddition tocounter values being much easier to retrieve than
logoutput for large distributed jobs, you getarecord ofthe number oftimes that condition occurred,
which ismore work to obtain from a set of logfiles.
Sorting
Theabilitytosort dataisat theheart ofMapReduce.Even ifyour application isn't concerned withsorting
per se, it maybeable to usethe sorting stage that MapReduce provides toorganizeitsdata.Joins
MapReduce canperformjoins between largedatasets but writing thecodetodojoinsfromscratch is fairly
involved. Rather than writing MapReduce programs, you might consider using ahigher-level
MallareddyEngineeringCollegeForWomen(AutonomousInstitution,UGC-Govt.ofIndia) 81
IntroductiontoBigDataAnalytics Dept.ofCSEAI&ML
framework such asPig, Hive, orCascading,inwhich join operations are a core part ofthe
implementation.Let'sbriefly consider the problem we are trying to solve. We have two datasets-
for example, the weather stations database and the weather records-and we want to reconcile the
two. Let'ssaywe wantto seeeachstation's history,withthestation's metadata inlinedineachoutput
row.
This is illustrated in Figure 8-2.
Howweimplementthejoindepends onhowlarge thedatasetsareandhowtheyarepartitioned.Ifone
dataset islarge(theweather records) buttheotheroneissmallenoughtobedistributed toeachnodein
the cluster (asthe station metadata is), the joincan be effected byaMapReduce job that brings the
recordsforeachstationtogether (apartialsortonstationID,forexample). Themapperorreducer uses
the smaller dataset tolook up thestation metadata for a station ID, so it can be written out with
each record.SeeSideData Distribution foradiscussion ofthisapproach, wherewefocusonthe
mechanics of distributingthe data to tasktrackers.
Ifthejoinis performedbythemapper,itis calleda map-side join, whereasif itis performedby the reducer it
is called a reduce-side join.
If both datasets aretoo large foreither to becopied to each node inthe cluster, we can still
jointhem using MapReduce witha map-side orreduce-side join, depending onhow the data
isstructured. One common example ofthis case isa user database and alog ofsome user
activity(such asaccess logs). Forapopularservice, it isnotfeasible
todistributetheuserdatabase(orthelogs)toalltheMapReduce nodes.
MapReduceWorking
Thewholeprocessgoesthroughfourphasesofexecution namely,splitting,mapping,shuffling,and
reducing.
Consideryouhavefollowing inputdataforyourMapReduceProgram Welcome to Hadoop Class
Hadoopisgood
Hadoopisbad
MallareddyEngineeringCollegeForWomen(AutonomousInstitution,UGC-Govt.ofIndia) 82
IntroductiontoBigDataAnalytics Dept.ofCSEAI&ML
MapReduceArchitecture
ThefinaloutputoftheMapReducetaskis
Bad 1
Class 1
Good 1
Hadoop 3
Is 2
To 1
Welcom 1
e
Thedatagoesthroughthefollowingphases
MallareddyEngineeringCollegeForWomen(AutonomousInstitution,UGC-Govt.ofIndia) 83
IntroductiontoBigDataAnalytics Dept.ofCSEAI&ML
InputSplits:
Aninput toaMapReducejobisdivided into fixed-size piecescalled input splitsInputsplit isachunk of the
input that is consumedby a single map
Mapping
This istheveryfirst phase intheexecution ofmap-reduce program. Inthisphase data ineachsplit is passed
toamapping function toproduce output values. Inourexample, ajobof mapping phase isto count
anumberofoccurrences ofeachword from input splits(more details about input-splitisgiven below) and
prepare a listinthe formof<word, frequency>
Shuffling
Thisphase consumes theoutputofMapping phase. Itstask istoconsolidate the relevant records from
Mapping phase output.Inourexample, thesame wordsareclubed together along withtheir respective
frequency.
Reducing
Inthisphase, output values fromtheShuffling phase areaggregated. Thisphase combines values from
Shufflingphase andreturnsasingle output value. Inshort, thisphase summarizes thecompletedataset.
Inourexample, this phase aggregates thevalues fromShuffling phase i.e.,calculates totaloccurrences of
each word.
MapReduce
1. TraditionalEnterpriseSystemsnormallyhaveacentralizedservertostoreandprocessdata.
2. The following illustration depicts aschematic view ofa traditional enterprise system.
Traditional modeliscertainlynotsuitabletoprocesshugevolumesofscalable
dataandcannotbeaccommodated by standard database servers.
3. Moreover,thecentralizedsystemcreatestoomuchofabottleneckwhileprocessingmultiplefiles
simultaneously.
Figure4.1:MapReduce
MallareddyEngineeringCollegeForWomen(AutonomousInstitution,UGC-Govt.ofIndia) 84
IntroductiontoBigDataAnalytics Dept.ofCSEAI&ML
Figure4.2:Physicalstructure
1. AMapReducecomputationexecutesasfollows:
► Some number ofMap tasks each aregiven one ormore chunks fromadistributed filesystem.
These Map tasks tum the chunk into asequence ofkey-value pairs. The way key-value pairs
are produced fromthe input dataisdeterminedbythe code written bythe user forthe Map
function.
► The key-value pairs fromeach Map task are collected bya master controller and sorted by
key.The keysaredivided among alltheReduce tasks, so allkey-value pairs withthe same
keywind up at the same Reduce task.
► The Reduce tasks work onone keyat atime, and combine all the values associated with that
key in someway.Themannerofcombination ofvaluesisdetermined
MallareddyEngineeringCollegeForWomen(AutonomousInstitution,UGC-Govt.ofIndia) 85
IntroductiontoBigDataAnalytics Dept.ofCSEAI&ML
Figure4.3:SchematicMapReduceComputation
MallareddyEngineeringCollegeForWomen(AutonomousInstitution,UGC-Govt.ofIndia) 86
IntroductiontoBigDataAnalytics Dept.ofCSEAI&ML
A. TheMapTask
1. Weviewinputfilesfora Maptaskas consistingof elements,whichcan beany type:a tupleora
document, for example.
2. Achunkisacollectionofelements,andnoelementisstoredacrosstwochunks.
3. Technically,all inputs to Map tasks and outputs from Reducetasks are of the key-value-pair form,
but normallythe keys ofinput elements are not relevant and we shall tend to ignorethem.
4. Insistingonthisformforinputsandoutputsismotivatedbythedesiretoallowcompositionof several
MapReduce processes.
5. TheMap functiontakesaninputelementasitsargumentandproduceszeroormorekey-value pairs.
6. The types ofkeys and values are each arbitrary.vii. Further,keys are not "keys"inthe usualsense;
theydo not have to be unique.
7. RatheraMaptaskcanproduceseveralkey-valuepairswiththesamekey,evenfromthesame element.
Example1:AMapReducecomputationwithwhathasbecomethestandardexampleapplication:counting
thenumber ofoccurrences for eachwordina collection ofdocuments. Inthisexample, the input file isa
repositoryofdocuments, andeachdocument isanelement. TheMapfunction forthis example useskeys
thatareoftypeString(thewords)andvaluesthat areintegers.TheMaptaskreadsadocumentandbreaks itinto
itssequence ofwords w1, w2, ... ,wn. Itthenemits asequence ofkey-value pairs where the
value is always1.Thatis,theoutputof theMaptaskfor this documentis thesequenceofkey-value pairs:
(w1, 1),(w2, 1),...,(wn, 1)
A single Map task will typically process many documents -all the documents in one or more chunks.
Thus,itsoutputwillbe morethanthesequencefortheonedocumentsuggestedabove.Ifawordwappears m
times among all the documents assigned to that process, then there will be m key-value pairs (w, 1)
among itsoutput. Anoption,istocombine these mpairsinto asinglepair(w,m),but wecanonlydothat
because, the Reduce tasks applyanassociative and commutative operation, addition, tothe values.
B. GroupingbyKey
i. AstheMaptaskshaveallcompletedsuccessfully, thekey-value pairsaregrouped bykey,andthevalues
associated with each key are formedinto a list of values.
ii. Thegroupingisperformedbythesystem,regardless ofwhattheMapand Reducetasksdo.
MallareddyEngineeringCollegeForWomen(AutonomousInstitution,UGC-Govt.ofIndia) 87
IntroductiontoBigDataAnalytics Dept.ofCSEAI&ML
MallareddyEngineeringCollegeForWomen(AutonomousInstitution,UGC-Govt.ofIndia) 88
IntroductiontoBigDataAnalytics Dept.ofCSEAI&ML
MallareddyEngineeringCollegeForWomen(AutonomousInstitution,UGC-Govt.ofIndia) 89
IntroductiontoBigDataAnalytics Dept.ofCSEAI&ML
Figure4.4:OverviewoftheexecutionofaMapReduceprogram
ii. TheReduce tasktakestheoutput fromtheMapasaninput andcombines
thosedatatuples(key- value pairs)into a smaller set of tuples.
111. Thereducetaskisalwaysperformedafterthemapjob.
MallareddyEngineeringCollegeForWomen(AutonomousInstitution,UGC-Govt.ofIndia) 90
IntroductiontoBigDataAnalytics Dept.ofCSEAI&ML
Figure4.5:Reducejob
► InputPhase-HerewehaveaRecordReaderthattranslateseachrecord inaninputfileandsends the
parsed data to the mapperinthe form ofkey-value pairs.
► Map-Mapisauser-definedfunction, whichtakesaseriesofkey-value pairsandprocesses each one
of them to generate zero or more key-value pairs.
► Intermediate Keys -they key-value pairs generated by the mapper are known as intermediate
keys.
► Combiner -A combiner is atype oflocalReducer that groups similar data from the map phase
into identifiable sets. It takesthe intermediate keys from the mapper as input and applies a user-
defined code to aggregate the values in a small scope of one mapper. It is not a part of the main
MapReduce algorithm;it is optional.
► Shuffle and Sort -The Reducer task starts with the Shuffle and Sort step. It downloads the
grouped key-value pairsontothelocalmachine, where theReducer isrunning. The individual key-value
pairsaresorted bykeyintoalargerdatalist.Thedatalist groupstheequivalent keystogether sothat their
values can be iterated easily in the Reducertask.
► Reducer -The Reducer takes the grouped key-value paired data as input and runs a Reducer
function on each one of them. Here, the data can be aggregated, filtered, and combined in a number
of ways, and it requires a wide range ofprocessing. Once the execution is over, it gives zero or more
key- value pairs to the final step.
► Output Phase -In the output phase, we have an output formatter that translates the final key-
value pairs fromthe Reducer function and writes them onto afile using arecord writer.
MallareddyEngineeringCollegeForWomen(AutonomousInstitution,UGC-Govt.ofIndia) 91
IntroductiontoBigDataAnalytics Dept.ofCSEAI&ML
iv. TheMapReducephase
Figure46TheMapReducePhase
F.MapReduce-Example
Twitterreceivesaround500milliontweetsperday,whichisnearly3000tweetspersecond.The
following illustration shows how Tweeter manages its tweets withthe help ofMapReduce.
Figure4.7:Example
Tokenize-Tokenizesthetweetsintomapsoftokensandwritesthemaskey-valuepairs.
Filter-Filtersunwantedwordsfromthemapsoftokensandwritesthefilteredmapsaskey-valuepairs.
Count-Generatesatokencounterperword.
AggregateCounters-Preparesanaggregateofsimilarcountervalues intosmallmanageableunits.
G.MapReduce-Algorithm
TheMapReducealgorithmcontainstwoimportanttasks,namelyMapandReduce. The map task is
done by means ofMapper Class
Mapper classtakesthe input, tokenizes it, mapsand sortsit.Theoutput ofMapper classisusedasinput
MallareddyEngineeringCollegeForWomen(AutonomousInstitution,UGC-Govt.ofIndia) 92
IntroductiontoBigDataAnalytics Dept.ofCSEAI&ML
MallareddyEngineeringCollegeForWomen(AutonomousInstitution,UGC-Govt.ofIndia) 93
IntroductiontoBigDataAnalytics Dept.ofCSEAI&ML
ThereducetaskisdonebymeansofReducerClass.
MapReduceimplementsvarious mathematicalalgorithmstodivideataskintosmallpartsandassignthem
tomultiple systems. Intechnical terms, MapReduce algorithm helps insending theMap &Reduce tasks
to appropriate servers in a cluster.
Figure4.8:TheMapReduceClass
H.CopingWithNodeFailures
i. The worst thing that can happen is that the compute node at which the Master is executing fails. In
this case, the entire MapReducejob mustbe restarted.
ii. But only this one node can bring the entire process down; other failures will be managed by the
Master, and the MapReducejob will complete eventually.
iii. SupposethecomputenodeatwhichaMapworkerresidesfails.ThisfailurewillbedetectedbytheMaster,
because it periodically pings the Worker processes.
iv. All the Map tasks that were assigned to this Worker will have to be redone, even ifthey had
completed. The reason for redoing completed Map asks is that their output destined for the Reduce
tasksresides at that compute node, and is now unavailable to the Reduce tasks.
v. The Master sets the status of each ofthese Map tasks to idle and will schedulethem on a Worker when
one becomes available.
vi. TheMastermustalsoinformeachReducetaskthatthelocationofitsinputfromthatMaptaskhas
changed.Dealing with a failure at the node ofa Reduce worker is simpler.
vii. TheMastersimplysetsthestatusofitscurrentlyexecutingReducetaskstoidle.Thesewillberescheduled on
another reduce worker later.
MallareddyEngineeringCollegeForWomen(AutonomousInstitution,UGC-Govt.ofIndia) 94
IntroductiontoBigDataAnalytics Dept.ofCSEAI&ML
ControllingMapReduceExecutionwithInputFormat
Hadoop InputFormatcheckstheInput-Specificationofthejob.InputFormatsplittheInputfile
intoinputSplit and assign to individual Mapper.
Willlearn
whatisInputFormatinHadoopMapReduce,
differentmethodstogetthedatatothemapper and
different typesoflnputFormatinHadoop like FilelnputFormat inHadoop, TextlnputFormat,
KeyValueTextlnputFormat, etc.
MallareddyEngineeringCollegeForWomen(AutonomousInstitution,UGC-Govt.ofIndia) 95
IntroductiontoBigDataAnalytics Dept.ofCSEAI&ML
2. Howwegetthedatatomapper?
FileInputFormatinHadoop
Itisthebaseclassforallfile-based InputFormats.HadoopFileInputFormatspecifiesinputdirectory
where data filesare located. When we start aHadoop job, File Input Format isprovided with apath
containing files toread.File Input Format will read allfilesand divides these files into oneormore
Input Splits.
MallareddyEngineeringCollegeForWomen(AutonomousInstitution,UGC-Govt.ofIndia) 96
IntroductiontoBigDataAnalytics Dept.ofCSEAI&ML
TextlnputFormat
Itisthedefault Input Format ofMapReduce. Text Input Format treats each lineofeach input fileasa
separate recordandperformsnoparsing. Thisisusefulforunformatted dataorline-based recordslike log
files.
SequenceFilelnputForma
Hadoop SequenceFilelnputFormatis anInputFormatwhich reads sequence files. Sequence files are
binaryfilesthat storessequences ofbinary key-value pairs. Sequence filesblock-compressandprovide
direct serialization and deserializationofseveral arbitrary data types (not just text). Here Key& Value
both are user-defined.
SequenceFileAsTextInputFormat
HadoopSequenceFileAsTextlnputFormatis anotherform of SequenceFileinputFormat which
converts thesequence filekeyvalues toTextobjects. Bycalling 'tostring()'conversion isperformed on
the keysand values. ThisInputFormatmakes sequencefilessuitable input for streaming.
SequenceFileAsBinarylnputFormat
Hadoop SequenceFileAsBinarylnputFormat isaSequenceFileinputFormat using whichwecan
extract the sequencefile's keys and values as anopaque binary object.
NLinelnputFormat
Hadoop NLinelnputFormat is another form of TextlnputFormat where the keys are byte offset of the
line and values are contents of the line. Each mapper receives a variable number of lines of input with
TextlnputFormat and KeyValueTextlnputFormat and the number depends onthe size ofthesplit and
the length ofthe lines. And ifwe want our mapper toreceive a fixed number oflines ofinput, then we
use N Line Input Format.
Nis thenumberof lines ofinputthateachmapperreceives.By default(N=l),eachmapperreceives exactly
one line of input. IfN=2, then each split contains twolines. One mapper will receive the first two Key-
Value pairs and anothermapper will receive the second two key-value pairs.
DBInputFormat
Hadoop DBinputFormat is an InputFormat that reads data from a relational database, usingJDBC.
As it doesn't have portioning capabilities, so we need to careful not to swamp the database from
which we are reading too many mappers. So it is best for loading relatively small datasets, perhaps
for joining withlarge datasets fromHDFS using Multipleinputs.Here Key is LongWritableswhile
Value isDB Writable.
MallareddyEngineeringCollegeForWomen(AutonomousInstitution,UGC-Govt.ofIndia) 97
IntroductiontoBigDataAnalytics Dept.ofCSEAI&ML
UNITV- BIGDATAPLATFORMS
Sqoop
Cassandra
MongoDB
HIVE
PIG
Storm Flink Apache
Sqoop
Sqoop is a tool designed to transfer data between Hadoop and relational database servers. It is used to
import data from relational databases such asMySQL, Oracle to Hadoop HDFS, and export from Hadoop
filesystem to relational databases. This is a brief tutorial that explains how to make use ofSqoop inHadoop
ecosystem.
Audience
This tutorial is prepared for professionals aspiring to make a career in Big Data Analytics using Hadoop
FrameworkwithSqoop.ETLdevelopers andprofessionalswho areintoanalytics ingeneralmayaswellusethis
tutorial to good effect.
Prerequisites
Beforeproceedingwith thistutorial,youneedabasicknowledgeof CoreJava,Databaseconceptsof SQL,
Hadoop File system, and any of Linux operating system flavors.
The traditional application management system, that is, the interaction ofapplications with relational
database using RDBMS, isone ofthesources that generate Big Data.SuchBigData, generated byRDBMS,
isstored in Relational Database Servers in the relational database structure.
When Big Data storages and analyzers such as MapReduce, Hive, HBase, Cassandra, Pig, etc. of the
Hadoop ecosystem came into picture, they required a tool to interact with the relational database servers
for importing and exporting theBigData residing inthem. Here,Sqoopoccupies aplace intheHadoop
ecosystem to provide feasible interaction between relational database server and Hadoop's HDFS.
Sqoop-"SQLtoHadoopandHadoop toSQL"
Sqoop is a tool designed to transfer data between Hadoop and relational database servers. It isused to
import datafromrelational databases such as MySQL, Oracle toHadoop HDFS, and export from Hadoop
filesystem to relational databases. It is providedby the Apache Software Foundation.
HowSqoopWorks?
MallareddyEngineeringCollegeForWomen(AutonomousInstitution,UGC-Govt.ofIndia) 98
IntroductiontoBigDataAnalytics Dept.ofCSEAI&ML
ThefollowingimagedescribestheworkflowofSqoop.
SqoopImport
The import tool imports individual tables from RDBMS to HDFS. Each rowin a table is treated asa
record in HDFS. Allrecords arestored astext data intext files oras binarydata in Avro and Sequence
files.
SqoopExport
Theexport toolexportsasetoffilesfromHDFS backto anRDBMS. Thefilesgiven asinput toSqoop
contain records, which are called as rows in table. Those are read and parsed into a set of records and
delimited with user-specified delimiter.
Cassandra
Cassandra isanopen-source NoSQL distributed database that manages large amounts ofdataacross
commodity servers.Itisadecentralized,scalable storagesystemdesigned tohandlevastvolumesofdata
across multiple commodityservers, providing high availability without asingle point offailure.
Cassandra isanopen-source NoSQL distributed database that manages large amounts ofdataacross
commodity servers. Itisadecentralized,scalablestoragesystemdesigned tohandlevast volumes ofdata
across multiple commodity servers, providing high availability without asingle point offailure.
Cassandra wascreated for Facebook butwasopen-sourced and released tobecome an Apache project
(maintained bytheAmerica!non-profit, ApacheSoftwareFoundation)
in2008.Afterthat,itfoundtoppriority in2010and isnow among thebest NoSQL database systems
intheworld. Cassandra istrusted and used by
1. ArchitectureofCassandra
The beautiful thing about Cassandra's architecture isthat it caneasily be expanded to house moredata.
By adding more nodes, you candouble theamount ofdata the systemcarries without
overwhelmingit.This dynamicscalingabilitygoesbothways. Byreducingthenumberofnodes, developers
canshrinkthedatabase systemifnecessary. Compared topreviousstructured querylanguage
(SQL)databases andthecomplexityof increasing their data-carrying capacity, Cassandra's architecture
gives it aconsiderable advantage.
AnotherwayCassandra'sarchitecture helpsitsfunctionality isthat
itincreasesdatasecurityandprotectsfrom data loss.
2. Thepartitioningsystem
3. Cassandra'sreplicability
Another wayCassandra works is byreplicating data across nodes.These secondary nodesarecalled
replica nodes, and the number ofreplica nodes fora given datasetisbased on thereplication factor
MallareddyEngineeringCollegeForWomen(AutonomousInstitution,UGC-Govt.ofIndia) 10
0
IntroductiontoBigDataAnalytics Dept.ofCSEAI&ML
(RF). A replicationfactorof3meansthreenodescoverthesametokenrange,storingthesamedata.Multiple
replicas are key to the reliability of Cassandra.
Evenwhenonenodestopsfunctioning, temporarilyorpermanently,other nodesholdthesamedata,meaning
that dataishardlyeverwhollylost.Betterstill,ifatemporarilydisruptednodeisbackontrack,itreceivesan
updateonthedataactionsitmayhavemissedandthencatches uptospeedtocontinuefunctioning.
KeyFeaturesofCassandra
Cassandra 1s a umque database system, and some of its key features
include:
KeyFeaturesofCassandra
I. Open-sourceavailability
Nothing ismoreexciting thangettingahandyproduct forfree. Thisisprobablyoneofthesignificant factors
behind Cassandra's far-reaching popularity and acceptance. Cassandra is among the open-source products
MallareddyEngineeringCollegeForWomen(AutonomousInstitution,UGC-Govt.ofIndia) 10
1
IntroductiontoBigDataAnalytics Dept.ofCSEAI&ML
hosted by Apache and is free for anyone who wants to utilize it.
2. Distributedfootprint
Another featureofCassandra isthatitiswelldistributed andmeanttorunover multiple nodesasopposedtoa
centralsystem. Allthe nodes areequal insignificance, and without amaster node, no bottleneck slows
the process down. This isveryimportantbecause thecompanies that utilize Cassandra need
toconstantly runon accurate dataand can not tolerate data loss. The equaland wide distribution
ofCassandra data acrossnodes means that losing one node does not significantly affect the system's
general performance.
3. Scalability
Cassandra haselastic scalability. This means that itcanbescaled upor down without much difficulty or
resistance. Cassandra's scalability once again is due to the nodalarchitecture. It is intended to grow
horizontallyasyourneedsasadeveloper orcompany grow.Scaling-up inCassandra isveryeasyandnot
limitedtolocation.Addingorremoving extranodescanadjust yourdatabasesystemtosuit yourdynamic
needs.
Anotherexcitingpoint about scalingin Cassandra isthatthereisnoslowdown,pauseorhitchinthesystem
during the process. This means end-users would not feel theeffect ofwhatever happened, ensuring
smooth service to all individuals connected to the network.
4. CassandraQueryLanguage
Cassandra is not a relationaldatabase and does not use the standard query language orSQL. Itusesthe
Cassandra querylanguage (CQL).Thiswouldhaveposedaproblem foradrninsastheywouldhavetomaster
a whole newlanguage-butthegoodthingabout CassandraQuerylanguage isthat itisverysimilar
toSQL.Itis structured to operate with rows and columns,i.e., table-based data.
However, it does lack the flexibility that comes with the fixed schema ofSQL. CQL combines
thetabular databasemanagementsystemandthekeyvalue.Itoperatesusingthedatatypeoperations,
definitionoperation, data definition operation, triggers operation, securityoperations, arithmetic
operations, etc.
5. Faulttolerance
Cassandra isfault-tolerantprimarily because ofitsdatareplicative ability.Datareplication
denotestheability ofthesystemtostorethesame information atmultiple locations ornodes. This makes it
MallareddyEngineeringCollegeForWomen(AutonomousInstitution,UGC-Govt.ofIndia) 10
2
IntroductiontoBigDataAnalytics Dept.ofCSEAI&ML
highly available and tolerant offaults inthesystem.Failure ofasingle node ordata center does not bring
the systemto ahalt as data has been replicated and stored across the nodes
inthecluster.Datareplication leadstoahighlevel of backup and recovery.
6. Schemafree
SQLis a fixedschema databaselanguagemakingit rigid and fixed. However,Cassandrais a schema-optional
datamodel andallowstheoperatortocreateasmany rowsandcolumnsasisdeemednecessary.
7. Tunableconsistency
Cassandra hastwotypesofconsistency-theeventualconsistency andthesettingconsistency. Thestring
consistency isatypethat broadcastsanyupdateorinformation to everynodewheretheconcerned data is
located. Ineventualconsistency, theclient hasto approve immediately after acluster receives a write.
Cassandra's tunable consistency isa feature that allows the developer todecide to useanyofthe two
types depending onthefunction beingcarriedout.Thedeveloper canuseeitherorbothkindsofconsistency
atany time.
8. Fastwrites
Cassandra isknown to have averyhigh throughput, not hinderedbyitssize. Itsability to write quickly is
a functionofitsdatahandling process. Theinitialsteptakenistowritetothecommit log.Thisisfordurability
to preserve dataincase ofdamage ornode downtime. Writing tothe commitlog isaspeedy and efficient
process using this tool.
Top6 UsesofCassandra
Cassandra isanopen-source NoSQL database managementsystemwith manyadvantages andpractical
MallareddyEngineeringCollegeForWomen(AutonomousInstitution,UGC-Govt.ofIndia) 10
3
IntroductiontoBigDataAnalytics Dept.ofCSEAI&ML
functionalities tha trival othe rsystems. Itisused byseveral large and small companies cross
theglobe.Some ofthe top applicationsofCassandrainclude:
1. E-commerce
E-commerce isanextremely sensitive field that cutsacross everyregion and country. The nature
offinancial markets meansanticipated peaktimesaswellasdowntimes. Forafinance operation,
nocustomer would want toexperience downtime or lackofaccesswhen there isrevenue tobeearned and
lotsofopportunities to hold onto.E-commerce companies canavoid these downtimes orpotential
blackouts byusinga highly reliable systemlikeCassandra. Itsfaulttoleranceallowsittokeeprunning
evenifawholecenterisdamagedwithlittle or no hitch in the system.
Due to itseasy scalability, especially inpeak seasons, E-commerce and inventory managementis also
a significant application ofCassandra. When there isa market rush, the company has toincrease
theabilityof thedatabase tocarryandstoremoredata.Theseasonal, rapidE-commerce growththat
isaffordable anddoes not cause system restart issimplya perfectfit for companies.
E-commercewebsitesalsobenefitfromCassandraasitstoresandrecordsvisitors'activities.Itthenallows
analyticaltoolstomodulatethevisitor's actionand,forinstance,temptsthemtostayonthewebsite.
2. Entertainmentwebsites
WiththeaidofCassandra, websites for movies, games and music cankeep trackofcustomer behavior
and preferences. Thedatabaserecordsforeachvisitor,including what wasclicked,downloaded,
timespent,etc. This information isanalyzed and used to recommendfurther entertainmentoptions
totheend-user.
This application of Cassandra falls under thepersonlization,recommendation and customer
experienceuse cases. It isnot just limited toentertainmentsites, but also onlineshopping platforms
andsocial media recommendations. This iswhyusers would receive notifications ofsimilar goods to
what theyspent time browsing.
3. InternetofThings(IoT)andedgecomputing
Today'sworld isseeingtheriseoftheInternet ofThings (IoT).Wearesteadily bombardedwiththousands of
new information points ordatasets. Every wearable device, weather sensor, traffic sensor, ormobile
device keeps trackofandsends dataonweather, traffic, energy usage,soilconditions, etc.This flood of
information can be overwhelming and easily lost.
MallareddyEngineeringCollegeForWomen(AutonomousInstitution,UGC-Govt.ofIndia) 10
4
IntroductiontoBigDataAnalytics Dept.ofCSEAI&ML
• Cassandraallowseveryindividualnodetocarryoutreadandwriteoperations.
• Itcanhandleandstorealargeamountofdata.
• Cassandrasupportstheanalysisofdatainreal-time.
4. Authenticationandfrauddetection
Fraud detectionisessential to the security and reliability ofmany companies, mainly banks, insurance
and other financial institutions. At allpoints, these companies must ensure thattheycan beat thenew,
revamped wayofstealing datadevelopedbyfraudsters. Financial companies aim to keep fraudsters and
hackers away fromtheirsystems. Cassandra isapplicable hereduetocontinuous, real-time,
bigdataanalytics.Casandra can source data across a wide range ofinternet activities that cantrigger
alarms whenpatterns and irregularities that may be precededby fraud are detected.
Ontheotherhand,thesesamefinancial institutions arerequiredtocarryoutaseamless
identityauthentication process. To make the user loginauthenticationprocess tight enough toallow
onlygenuine customers yet simple enoughtomake iteasyforthem, Cassandra helpsto carryoutareal-
time analysis. Inaddition tothese analyses, Cassandraisalso vital because, with it, youcanbesure
ofconstantaccessibility.
Ingeneral, Cassandra isamongthe topdatabaseoptionschosentodetectfraudandcarryoutauthentication
processes because:
• Itallowsforreal-timeanalysiswithmachinelearningandartificialintelligence(AI).
• Itcanhostlargeamountsofactivelygrowingdatasets.
• Ithasaflexibleschema toallowtheprocessingofdifferentdatatypes.
5. Messaging
Therearecurrentlyseveralmessagingapplicationsinuse,andanever-growing numberofindividuals
areusing them. Thiscreatestheneed forastabledatabasesystemtostoreever-flowing information volumes.
Cassandra provides bothstability and storage capacity for companies that offer messaging services.
6. Logisticsandassetmanagement
Cassandra isusedinlogisticsandasset managementtotrackthemovement ofanyitemtogettransported.
Fromthepurchase tothefinaldelivery, applications canrelyonCassandra tologeachtransaction. Thisis
MallareddyEngineeringCollegeForWomen(AutonomousInstitution,UGC-Govt.ofIndia) 10
5
IntroductiontoBigDataAnalytics Dept.ofCSEAI&ML
WhatisApachePig?
Apache Pig isanabstraction over MapReduce. It isa tool/platform which isused to analyzelarger
sets ofdata representing them asdata flows. Pig isgenerally used with Hadoop;wecanperform
allthe data manipulation operations inHadoop using Apache Pig.
Towrite dataanalysis programs, Pigprovides ahigh-level language knownasPigLatin. This
language providesvariousoperatorsusingwhichprogrammers candevelop
theirownfunctionsforreading,writing, and processing data.
ToanalyzedatausingApachePig,programmers needtowritescriptsusingPigLatinlanguage. Allthese
scripts are internally converted to Map and Reduce tasks. Apache Pig has a component
knownasPig Engine that accepts the Pig Latin scripts as input and converts those scripts into
MapReduce jobs. Why Do We Need Apache Pig?
Programmers who are not so good at Java normally used to struggle working with Hadoop, especially
while performing any MapReduce tasks. Apache Pig is a boon for allsuch programmers.
• Using PigLatin, programmers canperformMapReduce taskseasilywithout having totypecomplex
codes in Java.
• Apache Pig uses multi-query approach, thereby reducing the length of codes. For example, an
operationthatwouldrequireyoutotype200linesofcode(LoC)inJavacanbeeasilydoneby
typingaslessasjust 10LoCinApache Pig. Ultimately Apache Pigreduces thedevelopment timeby
almost 16 times.
• PigLatinisSQL-likelanguageanditiseasytolearnApachePigwhenyouarefamiliarwithSQL.
• Apache Pig provides many built-in operators to support data operations like joins, filters,
MallareddyEngineeringCollegeForWomen(AutonomousInstitution,UGC-Govt.ofIndia) 10
6
IntroductiontoBigDataAnalytics Dept.ofCSEAI&ML
ordering, etc. In addition, it also provides nested data types like tuples, bags, and maps that are
missing from MapReduce.
• FeaturesofPig
• ApachePigcomeswiththefollowingfeatures-
• Richsetofoperators-Itprovidesmanyoperatorstoperformoperationslike join,sort,filer,etc.
• Ease ofprogramming -PigLatinissimilar toSQLand it iseasytowrite aPigscript ifyouare good at
SQL.
• Optimizationopportunities -The tasks in Apache Pig optimize their execution automatically, so
the programmersneed to focus only on semantics ofthe language.
• Extensibility -Using theexisting operators, userscandevelop their ownfunctions toread, process,
and write data.
• UDF's-PigprovidesthefacilitytocreateUser-definedFunctions inotherprogramming languages
such as Java and invoke or embed them in Pig Scripts.
• Handlesallkindsofdata -ApachePiganalyzesallkindsofdata,bothstructuredaswellas unstructured.It
stores the results in HDFS.
ApachePigVsMapReduce
Listed below are the major differences between Apache Pig and MapReduce.
ApachePig MapReduce
ApachePigisadataflowlanguage. MapReduceisadataprocessingparadi
gm.
Itisahighlevel language. MapReduceislowlevelandrigid.
MallareddyEngineeringCollegeForWomen(AutonomousInstitution,UGC-Govt.ofIndia) 10
7
IntroductiontoBigDataAnalytics Dept.ofCSEAI&ML
ApachePigVsSQL
Listedbelowarethemajordifferences betweenApachePigandSQL.
ThedatamodelinApachePigisnestedrelational.
ThedatamodelusedinSQLisfla
t
relational.
Apache Pig provides limited opportunity Thereis more opportunityfor
forQuery optimization. query optimizationin SQL.
MallareddyEngineeringCollegeForWomen(AutonomousInstitution,UGC-Govt.ofIndia) 10
8
IntroductiontoBigDataAnalytics Dept.ofCSEAI&ML
Inadditiontoabove differences,ApachePigLatin-
• Allowssplitsinthepipeline.
• Allowsdeveloperstostoredataanywhereinthepipeline.
• Declaresexecutionplans.
• ProvidesoperatorstoperformETL(Extract,Transform,andLoad)functions. Apache Pig Vs Hive
Both Apache Pig and Hive are used to create MapReduce jobs. And in some cases, Hive operates on
HDFS inasimilar way Apache Pig does. In the following table, we have listed a few significant points
that set Apache Pig apart from Hive.
ApachePig Hive
PigLatinisadataflowlanguage. HiveQLisaqueryprocessinglangua
ge.
PigLatinis a HiveQLisadeclarativelanguage.
procedurallanguageand
itfitsinpipeline paradigm.
MallareddyEngineeringCollegeForWomen(AutonomousInstitution,UGC-Govt.ofIndia) 10
9
IntroductiontoBigDataAnalytics Dept.ofCSEAI&ML
processinglanguage which provides a rich set ofdata types and operators to perform various
operations on the data.
Toperformaparticular taskProgrammers usingPig,programmersneedtowriteaPigscript usingthe Pig
Latin language, and execute them using any of the execution mechanisms (Grunt Shell, UDFs,
Embedded). After execution, these scripts will go through aseries of transformationsapplied by the
Pig Framework, to produce thedesired output.Internally, Apache Pig converts these scripts into a
series of MapReducejobs, and thus,it makes theprogrammer'sjobeasy. ThearchitectureofApachePig
iss hown below.
ApachePigComponents
Asshown inthefigure,therearevarious componentsintheApachePig framework. Let ustakealook at the
major components.
Parser
Initially the Pig Scripts are handled bythe Parser. It checks the syntax of the script, doestype
checking, and other miscellaneous checks. Theoutput ofthe parser will beaDAG (directed acyclic
graph), which represents the Pig Latin statements and logical operators.
In the DAG, the logical operators of the script are represented as the nodes and the data flows are
represented as edges.
Optimizer
Thelogical plan (DAG)is passedto the logical optimizer,which carriesout the logical optimizations
such as projection and pushdown.
Compiler
The compiler compiles the optimized logical plan into a series of MapReduce jobs. Execution engine
MallareddyEngineeringCollegeForWomen(AutonomousInstitution,UGC-Govt.ofIndia) 11
0
IntroductiontoBigDataAnalytics Dept.ofCSEAI&ML
Finally the MapReducejobs are submittedto Hadoop inasorted order. Finally, these MapReducejobs
are executed on Hadoop producing the desired results. Pig Latin Data Model
ThedatamodelofPigLatinisfullynestedanditallowscomplexnon-atomicdatatypessuchas
mapandtuple.Given below is the diagrammatic al representation of Pig Latin'sdata model.
Atom
Any single value inPig Latin, irrespective oftheir data, type is known asan Atom. It is stored as string
and can be used as string and number. int, long, float, double, chararray, and bytearray are the atomic
values of Pig. Apiece ofdataor asimple atomic value is known as a field.
Example -'raja'or '30'Tuple
Arecordthatis formedbyanorderedset of fieldsis known as a tuple,the fields can be ofanytype.A tuple
is similarto a row ina table of RDBMS.
Example-(Raja,30) Bag
Abagisanunordered setoftuples. Inother words, acollection oftuples(non-unique)isknown asabag.
Eachtuple can have anynumber offields(flexible schema). A bag isrepresented by'{}'. It is similartoa
table in RDBMS, but unlike a table in RDBMS, it is not necessary that every tuple contain the same
number of fields orthat the fields in thesame position (column)have the same type.
Example -{(Raja,30),(Mohammad,45)}
Abagcanbeafield inarelation; inthat context, it isknownas innerbag. Example -{Raja, 30,
{9848022338, raja@gmail.com,}}
MallareddyEngineeringCollegeForWomen(AutonomousInstitution,UGC-Govt.ofIndia) 11
1
IntroductiontoBigDataAnalytics Dept.ofCSEAI&ML
Map
Amap(ordatamap)is a setof key-valuepairs.The key needstobe of typechararrayandshouldbe unique.
The valuemightbe ofany type. It is representedby '[]'
Example-[name#Raja, age#30] Relation
A relationisa bag oftuples.The relationsin Pig Latin are unordered (thereis no guaranteethat tuples are
processedin any particular order).
ApachePig-Installation
Thischapter explains thehowtodownload, install, andset up Apache Piginyour system. Prerequisites
It is essential that you have Hadoop and Java installed on your system before you go for Apache Pig.
Therefore, prior to installing Apache Pig, install Hadoop and Java by following the steps given in the
followinglink -
http://www.tutorialspoint.corn/hadoop/hadoo enviornmentsetup.htmDownload Apache Pig
First of all, download the latest version of Apache Pig from the following
website
-https://pig.apache.org/Step 1
Open the homepage ofApache Pig website. Under the section News, click onthe link release page as
shown in the following snapshot.
MallareddyEngineeringCollegeForWomen(AutonomousInstitution,UGC-Govt.ofIndia) 11
2
IntroductiontoBigDataAnalytics Dept.ofCSEAI&ML
..
- ch.oirg/..-Jc.ses.ht nl'1Downl-oad
Step2
On clicking the specified link, you will be redirected to the Apache Pig Releases page. On this page,
undertheDownloadsection, youwillhavetwo links,namely, Pig0.8andlaterandPig0.7andbefore. Click
onthe link Pig 0.8 and later, then you will be redirectedto the page having aset of mirrors.
+-- Cti
+
••A.U--nul
..Mw.i11
Ll-'.!l1lt':!ii
11WhoWa:Aiu
...8-yl.!arwv -
11
Pi9Tr;.,ol
"D"r1...,.lllC'"j111P;::ah,e-y
..1curn,.,1.:t;11,lfii.on
IJvie,liop-ar.:a,
MallareddyEngineeringCollegeForWomen(AutonomousInstitution,UGC-Govt.ofIndia) 11
3
IntroductiontoBigDataAnalytics Dept.ofCSEAI&ML
Step3
Chooseandclickanyoneofthesemirrorsasshownbelow.
Step4
These mirrorswilltakeyoutothePigReleasespage.Thispagecontains various versionsofApache
Pig. Click the latest version among them.
MallareddyEngineeringCollegeForWomen(AutonomousInstitution,UGC-Govt.ofIndia) 11
4
IntroductiontoBigDataAnalytics Dept.ofCSEAI&ML
□ X
C fr· j www.L1s.-pc:he.org/cht;pig/
HAppGCJhdaop
.iliis.
,o
2(!.l'l!,·0&·l::'I1)3;
WH-OS·19<93.li)O
:il:ilS•0::1•19lIli:li
Step5
Within these folders, you will have the source and binary files of Apache Pig in various distributions.
Download the tar files of the source and binary files of Apache Pig 0.15, pig0.15.0-src.tar.gz and pig-
0.15.0.tar.gz.
:..h
C ttli
(J1doc_
WWW,IJS. p -tr1 ---
t
MallareddyEngineeringCollegeForWomen(AutonomousInstitution,UGC-Govt.ofIndia) 11
5
IntroductiontoBigDataAnalytics Dept.ofCSEAI&ML
InstallApachePig
After downloading the Apache Pig software, install it in your Linux environment byfollowing
thesteps given below.
Step1
Create adirectory with the name Pig in thesame directory where the installation directories of
Hadoop, Java, and other software were installed. (In our tutorial, we have created the Pig directory
in the user named Hadoop).
$mkdirPig
Step2
Extractthedownloadedtarfilesasshownbelow.
$cdDownloads/
$tarzxvfpig-0.15.0-src.tar.gz
$tarzxvfpig-0.15.0.tar.gz
Step3
Movethecontentofpig-0.15.0-src.tar.gzfiletothePigdirectorycreatedearlierasshownbelow.
$mvpig-0.15.0-src.tar.gz/*/home/Hadoop/Pig/
ConfigureApachePig
AfterinstallingApache Pig,we have toconfigureit. To configure,we need toedit two files -bashrc and
pig.properties.
.bashrcfile
Inthe.bashrcfile,setthefollowingvariables-
• PIG_HOMEfoldertotheApachePig'sinstallationfolder,
• PATHenvironmentvariabletothebinfolder,and
• PIG_CLASSPATHenvironment variable to the etc (configuration) folder of your Hadoop
installations (thedirectory thatcontains thecore-site.xml,hdfs-site.xml and mapred-site.xmlfiles).
export PIG_HOME= /home/Hadoop/Pig
export PATH= $PATH:/home/Hadoop/pig/binexportPIG_CLASSPATH= $HADOOP_HOME/conf
pig.properties file
Inthe conf folderof Pig,wehaveafilenamedpig.properties.In thepig.propertiesfile,youcanset various
parameters as given below.
MallareddyEngineeringCollegeForWomen(AutonomousInstitution,UGC-Govt.ofIndia) 11
6
IntroductiontoBigDataAnalytics Dept.ofCSEAI &ML
ApachePigversion0.15.0(rl682971)
compiledJun0l2015,11:44:35
RunningPig
YoucanrunPig(executePigLatinstatementsandPigcommands)usingvarious modes.
L Tez Local S Ma e Sp
o p pre z ar
c Mo a duc k
a de r eT M
l k o M
Mo d od
M L de e e
o o
d c
e a
l
d
e
InteractiveMode y expe s Yes
e rime
s nta
atchMode y expe s Yes
e rime
s nta
ExecutionModes
Pighassixexecutionmodesorexectypes:
• LocalMode-TorunPiginlocalmode,youneedaccesstoasinglemachine;allfilesareinstalled
and runusing your localhost and file system. Specify local mode using the -xflag (pig -x local).
• TezLocalMode-TorunPig intezlocalmode. Itissimilar tolocalmode, exceptinternallyPig
willinvoke tez runtime engine.SpecifyTez localmode using the -xflag(pig -xtez_local).
Note:Tezlocalmodeisexperimental.Therearesomequerieswhichjust erroroutonbiggerdatainlocal
mode.
MallareddyEngineeringCollegeForWomen(AutonomousInstitution,UGC-Govt.ofIndia) 11
7
IntroductiontoBigDataAnalytics Dept.ofCSEAI &ML
• SparkLocalMode-TorunPiginsparklocalmode.Itissimilartolocalmode,exceptinternally
Pig will invoke spark runtime engine. SpecifySpark local mode using the -x flag (pig -
xspark_local). Note: Spark local mode is experimental.There are some queries which
justerrorout onbigger data in local mode.
• MapreduceMode-TorunPiginmapreducemode,youneedaccesstoaHadoopclusterand
HDFSinstallation.Mapreducemodeisthedefaultmode;youcan,butdon'tneedto,specifyitusingthe
-xflag(pigORpig-xmapreduce).
• Spark Mode-TorunPig inSpark mode, you need access toaSpark, YarnorMesos cluster
and HDFS installation. Specify Spark mode using the -x flag (-x spark). In Spark execution
mode, it is necessary to set env::SPARK_MASTERto an appropriate value (local - local mode,
yam-client - yarn- client mode, mesos://host:port-sparkonmesosorspark://host:port-
sparkcluster.Formoreinformation refer tospark documentation onMasterURLs, yarn-cluster
modeiscurrently notsupported).Pigscripts runonSparkcantakeadvantageofthedynamic allocation
feature. Thefeaturecanbeenabled bysimply enablingspark.dynamicAllocation.enabled.Refer to
sparkconfiguration for additional configuration details. Ingeneralallproperties inthepigscript
prefixed withspark.arecopiedtotheSpark Application
Configuration.PleasenotethatYamauxillaryserviceneedtobeenabledonSparkforthistowork.SeeSpark
documentationforadditionaldetails.
YoucanrunPig ineither mode using the"pig"command (thebin/pigPerlscript) orthe"java"
command (java -cp pig.jar ...).
Examples
This example show show to run Pigin local and mapreduce mode using the pig command.
MallareddyEngineeringCollegeForWomen(AutonomousInstitution,UGC-Govt.ofIndia) 11
8
IntroductiontoBigDataAnalytics Dept.ofCSEAI &ML
/*localmode*/
$pig-xlocal...
/*Tezlocalmode*/
$pig-xtez_local ...
/*Sparklocalmode*/
$pig-xspark_local...
/*mapreducemode*/
$pig...
or
$pig-xmapreduce...
/*Tezmode*/
$pig -xtez...
/*Sparkmode*/
$pig-xspark...
InteractiveMode
YoucanrunPigininteractivemodeusingtheGruntshell.InvoketheGruntshellusingthe"pig"command
(as shown below) and then enter your Pig Latin statements and Pig commands interactively at
the command line.
Example
These Pig Latin statements extract alluser IDs from the/etc/passwd file. First, copythe /etc/passwdfile
to your local working directory. Next, invoke the Grunt shell by typing the "pig" command (in local
or hadoop mode). Then, enter the Pig Latin statements interactively at the grunt prompt (besure to
include thesemicolon after eachstatement). TheDUMPoperator willdisplaytheresults to
grunt>A=load'passwd'usingPigStorage(':');
grunt> B=foreach Agenerate $0asid; grunt>
dump B;
yourterminalscreen.
LocalMode
MallareddyEngineeringCollegeForWomen(AutonomousInstitution,UGC-Govt.ofIndia) 11
9
IntroductiontoBigDataAnalytics Dept.ofCSEAI &ML
$pig-xlocal
...-Connecting to... grunt>
TezLocalMode
$pig-xtez_local
...-Connecting to... grunt>
SparkLocalMode
$pig-xspark_local
...-Connecting to... grunt>
MapreduceMode
$pig-xmapreduce
...-Connecting to... grunt>
or
$pig
...-Connecting to... grunt>
TezMode
MallareddyEngineeringCollegeForWomen(AutonomousInstitution,UGC-Govt.ofIndia) 12
0
IntroductiontoBigDataAnalytics Dept.ofCSEAI &ML
$ pig-x tez
...-Connecting to... grunt>
SparkMode
$pig-xspark
...-Connecting to... grunt>
BatchMode
YoucanrunPiginbatchmodeusingPigscriptsandthe"pig"command(inlocalorhadoopmode).
Example
ThePigLatinstatementsinthePigscript(id.pig) extract alluserIDsfromthe/etc/passwdfile.First, copy
the/etc/passwd filetoyourlocalworkingdirectory. Next,runthePigscript fromthecommandline(using
localor mapreducemode). The STORE operator will write theresults to afile(id.out).
/*id.pig*/
TezMode
$pig-xtezid.pig
MallareddyEngineeringCollegeForWomen(AutonomousInstitution,UGC-Govt.ofIndia) 12
1
IntroductiontoBigDataAnalytics Dept.ofCSEAI &ML
SparkMode
$pig -xsparkid.pig
PigScripts
UsePigscriptsto placePigLatinstatements andPigcommands inasingle file. While not required,it is
good practice to identify the file using the *.pig extension.
You can run Pig scripts from the command line and from the Grunt shell
(see the run and exec commands).
Pigscriptsallowyoutopassvaluestoparameters usingparametersubstitution.
CommentsinScripts
YoucanincludecommentsinPigscripts:
• Formulti-linecommentsuse/* .......... */
• Forsingle-linecomments use--
/*myscript.pig
Myscriptissimple.
ItincludesthreePigLatinstatements.
*/
ScriptsandDistributedFileSystems
Pigsupportsrunningscripts(andJarfiles)thatarestoredinHDFS,AmazonS3,andotherdistributed file
systems. The script's full location URI is required (see REGISTER for information about Jar files).
For example, to run a Pig script onHDFS, do the following:
$pighdfs://nn.mydomain.com:9020/myscripts/script.pig
PigLatinStatements
Pig Latin statements arethe basic constructsyou use toprocess data using Pig. APig Latin statement is
anoperator that takesarelation asinput and produces another relation asoutput. (This definition applies
to all Pig Latin operators except LOAD and STORE which read data from and write data to the file
system.) Pig Latin statements may include expressions and schemas. Pig Latin statements can span
multiple lines and must end withasemi-colon ( ;). Bydefault, Pig Latin statements are processed using
multi-query execution.
MallareddyEngineeringCollegeForWomen(AutonomousInstitution,UGC-Govt.ofIndia) 12
2
IntroductiontoBigDataAnalytics Dept.ofCSEAI &ML
PigLatinstatementsaregenerallyorganizedasfollows:
• ALOADstatementtoreaddatafromthefilesystem.
• Aseriesof"transformation"statementstoprocessthedata.
• ADUMPstatement toviewresultsoraSTOREstatement tosavetheresults. Notethat aDUMP
or STORE statementisrequired to generate output.
• InthisexamplePigwillvalidate,butnotexecute,theLOADandFOREACHstatements.
• A=LOAD'student'USINGPigStorage() AS(name:chararray,age:int,gpa:float);
• B=FOREACHAGENERATEname;
• Inthisexample, Pig will validate and then execute the LOAD, FOREACH, and DUMP
statements.
• A=LOAD'student'USINGPigStorage()AS(name:chararray,age:int,gpa:float); B =
• FOREACH A GENERATEname;
• DUMPB;
• (John)
• (Mary)
• (Bill)
• (Joe)
LoadingData
UsetheLOAD operatorandtheload/store functions toreaddataintoPig(PigStorage isthedefaultload
function).
WorkingwithData
Pigallows youtotransformdatainmanyways. Asastartingpoint, become familiarwiththese
operators:
• Use theFILTER operatortowork with tuplesor rows of data.Use theFOREACH operatorto
work with columns of data.
• UsetheGROUPoperatortogroupdatain a singlerelation.UsetheCOGROUP,innerJOIN, and
outer JOIN operators to groupor join data in two or more relations.
• UsetheUNIONoperatortomerge thecontentsoftwo ormorerelations. UsetheSPLIToperator
to partition the contents ofa relation into multiple relations.
MallareddyEngineeringCollegeForWomen(AutonomousInstitution,UGC-Govt.ofIndia) 12
3
IntroductiontoBigDataAnalytics Dept.ofCSEAI &ML
MallareddyEngineeringCollegeForWomen(AutonomousInstitution,UGC-Govt.ofIndia) 12
4
IntroductiontoBigDataAnalytics Dept.ofCSEAI &ML
TospecifyPigpropertiesuseoneofthesemechanisms:
• Thepig.propertiesfile(addthedirectorythatcontainsthepig.propertiesfiletotheclasspath)
• The-DandaPigproperty inPIG_OPTS environmentvariable (export PIG_OPTS=-
Dpig.tmpfilecompression=true)
• The-Pcommandlineoptionandapropertiesfile(pig-Pmypig.properties)
• Thesetcommand(setpig.exec.nocombinertrue)
Note:ThepropertiesfileusesstandardJavapropertyfile format.
The following precedence order is supported: pig.properties< -D Pig property< -P properties file <set
command.This means thatifthe same property is providedusing the-Dcommandline option as well as
the-Pcommand line option (properties file), the value ofthe property in the properties file will take
precedence.
TospecifyHadooppropertiesyoucanusethesamemechanisms:
• Hadoopconfigurationfiles(includepig-cluster-hadoop-site.xml)
• The-D and aHadoop propertyinPIG_OPTS environmentvariable (export PIG_OPTS=-
Dmapreduce.task.profile=true)
• The-Pcommandlineoptionandaproperty file(pig-Pproperty_file)
• Thesetcommand(setmapred.map.tasks.speculative.executionfalse)
The same precedenceholds: Hadoop configurationfiles< -D Hadoop property< -P properties_file< set
command.
Hadooppropertiesare not interpretedbyPigbut are passeddirectly to Hadoop.AnyHadoopproperty can
be passed this way.
Allproperties that Pigcollects, including Hadoop properties, areavailable toanyUDFviathe
UDFContextobject. To get access to the properties,you cancallthe getJobConf method.
ApachePig-DiagnosticOperators
The load statement will simply load the data into the specified relation in Apache Pig. To verify the
execution of the Load statement, you have to use the Diagnostic Operators. Pig Latin provides four
differenttypes of diagnostic operators -
• Dumpoperator
• Describe operator
• Explanationoperator
• Illustrationoperator
MallareddyEngineeringCollegeForWomen(AutonomousInstitution,UGC-Govt.ofIndia) I00
IntroductiontoBigDataAnalytics Dept.ofCSEAI&ML
MallareddyEngineeringCollegeForWomen(AutonomousInstitution,UGC-Govt.ofIndia) l02
IntroductiontoBigDataAnalytics Dept.ofCSEAI&ML
The GROUP operator is usedto group the data in one or more relations.Itcollects the data having the
same key.
Syntax
Given below is the syntax of the group operator. grunt>Group_data= GROUP Relation_name
BYage; Apache Pig - Cogroup Operator
TheCOGROUPoperatorworksmoreorlessinthesamewayastheGROUPoperator.Theonlydifference
betweenthetwooperators isthatthegroup operator isnormallyusedwithonerelation, whilethecogroup
operator is used instatements involving two or more relations.
GroupingTwoRelationsusingCogroup
Assume that we have two files namely student details.txt and employee_details.txt in the HDFS
directory /pig_data/
Apache Pig-JoinOperator
The JOIN operator is used to combine records from two or more relations. While performing a join
operation,wedeclareone(oragroupof)tuple(s) fromeachrelation, askeys. Whenthesekeysmatch,the two
particular tuples are matched, else the records are dropped. Joins can be ofthe following types-
• Self-join
• Inner-join
• Outer-join-leftjoin,rightjoin, andfulljoin
Thischapter explains withexamples howto usethejoinoperator inPigLatin.Assumethat wehave two
files namelycustomers.txtand orders.txtin the /pig_data/directory ofHDFS
ApachePig-CrossOperator
MallareddyEngineeringCollegeForWomen(AutonomousInstitution,UGC-Govt.ofIndia) l02
IntroductiontoBigDataAnalytics Dept.ofCSEAI &ML
ApachePig-ForeachOperator
The FOR EACH operator is used togenerate specified data transformations based on the column
data.
Syntax
GivenbelowisthesyntaxofFOREACHoperator.
grunt>Relation_name2=FORBACHRelatin_namelGENERATE(requireddata);
Workingwithfunctionsin Pig
Apache Pig provides vanous built-in functions namely eval, load, store, math,
string, bag and tuple functions.
EvalFunctions
MallareddyEngineeringCollegeForWomen(AutonomousInstitution,UGC-Govt.ofIndia) 128
IntroductiontoBigDataAnalytics Dept.ofCSEAI &ML
GivenbelowisthelistofevalfunctionsprovidedbyApachePig.
S.N. Function&Description
1 AVG()
Tocomputetheaverageofthenumericalvalueswithinabag.
2 BagToString()
Toconcatenatetheelementsofabagintoastring.Whileconcatenating,wecanplacea
delimiter between these values (optional).
3 CONCAT()
Toconcatenatetwoormoreexpressionsofsame type.
4 COUNT()
Togetthenumber ofelementsinabag,while countingthenumberoftuplesinabag.
5 COUNTSTAR()
ItissimilartotheCOUNT()function.Itisusedtogetthenumberofelementsinabag.
6 DIFF()
Tocomparetwobags(fields)inatuple.
7 IsEmpty()
Tocheckifabagor mapisempty.
8 MAX()
9 MIN()
To get the minimum (lowest)value (numeric or chararray)for acertain column
inasingle- columnbag.
MallareddyEngineeringCollegeForWomen(AutonomousInstitution,UGC-Govt.ofIndia) 129
IntroductiontoBigDataAnalytics Dept.ofCSEAI &ML
1 PluckTuple()
0
UsingthePigLatin PluckTuple()function,wecandefineastringPrefixandfilterthe
columnsin a relation that begin with the given prefix.
1 SIZE()
1
TocomputethenumberofelementsbasedonanyPig datatype.
1 SUBTRACT()
2
Tosubtract two bags. Ittakestwo bagsas inputs and returns a bagwhich contains the
tuples ofthe first bag that are not inthe secondbag.
1 SUM()
3
Togetthetotalofthenumericvalues ofacolumninasingle-columnbag.
1 TOKENIZE()
4
To splita string (which contains a groupofwords)ina singletuple and return a bag
which contains the output of the split operation.
The Load and Store functions in Apache Pig are used to determine how the data goes ad comes out
of Pig. These functions are used with the load and storeoperators. Given below isthe list ofload and
store functions available in Pig.
S.N. Function&Description
1 PigStorage()
To loadandstorestructuredfiles.
2 TextLoader()
ToloadunstructureddataintoPig.
3 BinStorage()
ToloadandstoredataintoPigusingmachinereadableformat.
4 HandlingCompression
InPigLatin,wecanloadandstorecompresseddata.
MallareddyEngineeringCollegeForWomen(AutonomousInstitution,UGC-Govt.ofIndia) 130
IntroductiontoBigDataAnalytics Dept.ofCSEAI &ML
GivenbelowisthelistofBagandTuplefunctions.
S Function&Description
.
N
.
1 TOBAGO
Toconverttwoormoreexpressionsintoabag.
2 TOP()
TogetthetopNtuplesofarelation.
3 TOTUPLE()
Toconvertoneormoreexpressionsintoatuple.
4 TOMAP()
Toconvert thekey-valuepairsintoaMap.
WehavethefollowingStringfunctionsinApache Pig.
S Functions&Description
.
N
.
1 ENDSWITH(string,testAgainst)
Toverifywhetheragivenstring endswithaparticularsubstring.
2 STARTSWITH(string,substring)
Acceptstwostringparametersandverifieswhetherthefirststringstartswiththesecond.
4 EqualslgnoreCase(string1, string2)
Tocompare twostings ignoring thecase.
5 INDEXOF(string,'character',startlndex)
Returnsthefirstoccurrenceofacharacterinastring,searchingforwardfromastartindex.
MallareddyEngineeringCollegeForWomen(AutonomousInstitution,UGC-Govt.ofIndia) 131
IntroductiontoBigDataAnalytics Dept.ofCSEAI &ML
6 LASTINDEXOF(expression)
Returns theindexofthe last occurrence ofacharacter inastring,searching backward
froma start index.
7 LCFIRST(expression)
Convertsthefirstcharacterinastring tolowercase.
8 UCFIRST(expression)
Returnsastringwiththefirstcharacterconvertedtouppercase.
9 UPPER(expression)
UPPER(expression)Returns astringconvertedtouppercase.
10 LOWER(expression)
Convertsallcharactersinastringtolowercase.
11 REPLACE(string,'oldChar','newChar');
Toreplaceexistingcharactersinastringwithnewcharacters.
12 STRSPLIT(string,regex, limit)
Tosplitastringaroundmatchesofagivenregularexpression.
13 STRSPLITTOBAG(string,regex,limit)
Similarto the STRSPLIT()function,itsplitsthe stringby given delimiterandreturnsthe
resultin a bag.
1 TRIM(expression)
4
Returnsacopyofastringwithleadingandtrailingwhitespacesremoved.
15 LTRIM(expression)
Returns acopyofastringwith leadingwhitespacesremoved.
1 RTRIM(expression)
6
Returnsacopyofastringwithtrailingwhitespacesremoved.
ApachePigprovidesthefollowingDateandTimefunctions-
MallareddyEngineeringCollegeForWomen(AutonomousInstitution,UGC-Govt.ofIndia) 132
IntroductiontoBigDataAnalytics Dept.ofCSEAI &ML
S.N. Functions&Description
I ToDate(milliseconds)
This function returns a date-time object according to the given parameters. The other
alternative for this function are ToDate(iosstring), ToDate(userstring, format),
ToDate(userstring, format, timezone)
2 CurrentTime()
returnsthedate-timeobjectofthecurrent time.
3 GetDay(datetime)
Returnsthedayofa monthfromthedate-timeobject.
4 GetHour(datetime)
Returnsthehourofadayfromthedate-timeobject.
5 GetMilliSecond(datetime)
Returnsthe millisecondofasecondfromthedate-timeobject.
6 GetMinute(datetime)
Returnstheminuteofanhourfromthedate-timeobject.
7 GetMonth(datetime)
Returnsthe monthofa yearfromthedate-timeobject.
8 GetSecond(datetime)
Returnsthesecondofaminutefromthedate-timeobject.
9 GetWeek(datetime)
Returnstheweekofayearfromthedate-timeobject.
1 GetWeekYear(datetime)
0
Returnstheweekyearfromthedate-timeobject.
1 GetYear(datetime)
1
Returnstheyearfromthedate-timeobject.
MallareddyEngineeringCollegeForWomen(AutonomousInstitution,UGC-Govt.ofIndia) 133
IntroductiontoBigDataAnalytics Dept.ofCSEAI &ML
1 AddDuration(datetime,duration)
2
Returnstheresultofadate-timeobjectalongwiththedurationobject.
1 SubtractDuration(datetime,duration)
3
SubtractstheDurationobjectfromtheDate-Time objectandreturnstheresult.
1 DaysBetween(datetimel,datetime2)
4
Returnsthenumberofdaysbetweenthetwodate-timeobjects.
1 HoursBetween(datetimel,datetime2)
5
Returnsthenumberofhoursbetweentwodate-timeobjects.
1 MilliSecondsBetween(datetime1,datetime2)
6
Returnsthenumberofmillisecondsbetweentwodate-timeobjects.
1 MinutesBetween(datetimel,datetime2)
7
Returnsthenumberofminutesbetweentwodate-timeobjects.
1 MonthsBetween(datetimel,datetime2)
8
Returnsthenumberofmonthsbetweentwodate-timeobjects.
1 SecondsBetween(datetime1,datetime2)
9
Returnsthenumberofsecondsbetweentwodate-time objects.
2 WeeksBetween(datetimel,datetime2)
0
Returnsthenumberofweeksbetweentwodate-timeobjects.
2 YearsBetween(datetime1,datetime2)
1
Returnsthenumber ofyearsbetweentwodate-timeobjects.
WehavethefollowingMathfunctionsinApachePig-
S.N. Functions&Description
1 ABS(expression)
Togettheabsolutevalueofanexpression.
MallareddyEngineeringCollegeForWomen(AutonomousInstitution,UGC-Govt.ofIndia) 134
IntroductiontoBigDataAnalytics Dept.ofCSEAI &ML
2 ACOS(expression)
Toget thearccosineofanexpression.
3 ASIN(expression)
Togetthearcsine ofanexpression.
4 ATAN(expression)
Thisfunction isusedtoget thearctangentofanexpression.
5 CBRT(expression)
Thisfunctionisusedtogetthecuberootofanexpression.
6 CEIL(expression)
Thisfunctionisusedtogetthevalueofanexpressionroundeduptothenearestinteger.
7 COS(expression)
Thisfunctionisused togetthetrigonometriccosineofanexpression.
8 COSH(expression)
Thisfunctionisused togetthehyperboliccosineofanexpression.
9 EXP(expression)
This functionisusedtogettheEuler'snumbereraisedtothepowerofx.
1 FLOOR(expression)
0
Togetthevalueofanexpressionroundeddowntothenearestinteger.
1 LOG(expression)
1
Togetthenatural logarithm(base e)ofanexpression.
1 LOG10(expression)
2
Toget the base 10logarithm ofanexpression.
1 RANDOM()
3
Togetapseudorandomnumber(type double)greaterthanorequalto0.0andlessthan1.0.
MallareddyEngineeringCollegeForWomen(AutonomousInstitution,UGC-Govt.ofIndia) 135
IntroductiontoBigDataAnalytics Dept.ofCSEAI &ML
1 ROUND(expression)
4
To get the value ofanexpression roundedto aninteger (iftheresulttype is float)
orrounded to along (if the resulttype is double).
1 SIN(expression)
5
Togetthesineofanexpression.
1 SINH(expression)
6
Togetthehyperbolicsineofanexpression.
1 SQRT(expression)
7
Togetthe positivesquarerootofanexpression.
1 TAN(expression)
8
Togetthetrigonometrictangentofanangle.
1 TANH(expression)
9
Togetthehyperbolictangentofanexpression.
Hive-Introduction
Theterm'Big Data' isused for collections of large datasets that include hug evolume,high velocity,and
a variety of data tha is increasingdaybyday.Using traditional data management systems,it is difficult
toprocessBigData.Therefore, theApacheSoftwareFoundation introduced
aframeworkcalledHadoop to solve Big Data managementand processing challenges.
Hadoop
Hadoopisanopen-sourceframeworktostoreandprocessBigDatain adistributedenvironment.It
contains two modules, one isMapReduce and another isHadoop Distributed FileSystem(HDFS).
MapReduce:Itisaparallelprogrammingmodel forprocessing largeamountsofstructured,semi-
structured,and unstructureddata on large clusters ofcommodity hardware.
HDFS:Hadoop Distributed FileSystemisapartofHadoopframework,
usedtostoreandprocessthe datasets. It provides a fault-tolerant file system to runoncommodity
hardware.
TheHadoop ecosystemcontains differentsub-projects (tools) suchasSqoop, Pig, and Hive that
are used to help Hadoop modules.
Sqoop:Itisusedtoimport andexportdatato andfrombetweenHDFSandRDBMS.
MallareddyEngineeringCollegeForWomen(AutonomousInstitution,UGC-Govt.ofIndia) 136
IntroductiontoBigDataAnalytics Dept.ofCSEAI &ML
MallareddyEngineeringCollegeForWomen(AutonomousInstitution,UGC-Govt.ofIndia) 137
IntroductiontoBigDataAnalytics Dept.ofCSEAI &ML
Thiscomponentdiagramcontainsdifferentunits.Thefollowingtabledescribeseachunit:
UnitName Operation
UserInterface Hiveisadatawarehouseinfrastructuresoftwarethatcancreateinteractio
n between user and HDFS. The user interfaces that Hive supports
are Hive WebUI, Hive command line, and Hive HD Insight(In
Windows server).
MallareddyEngineeringCollegeForWomen(AutonomousInstitution,UGC-Govt.ofIndia) 138
IntroductiontoBigDataAnalytics Dept.ofCSEAI &ML
resultsassameasMapReduceresults.ItusestheflavorofMapReduce
.
HDFSorHBAS Hadoop
E distributedfilesystemorHBASEarethedatastoragetechniques to
store data into file system.
WorkingofHive
ThefollowingdiagramdepictstheworkflowbetweenHiveandHadoop.
ThefollowingtabledefineshowHiveinteractswithHadoopframework:
S Operation
t
e
p
N
o
.
1 ExecuteQuery
TheHiveinterfacesuchasCommand LineorWebUIsendsquerytoDriver (any
database driver suchas JDBC, ODBC, etc.) to execute.
2 GetPlan
The driver takesthe help ofquerycompiler that parses thequeryto check
thesyntax and query plan or the requirementof query.
MallareddyEngineeringCollegeForWomen(AutonomousInstitution,UGC-Govt.ofIndia) 139
IntroductiontoBigDataAnalytics Dept.ofCSEAI &ML
3 Get Metadata
ThecompilersendsmetadatarequesttoMetastore(anydatabase).
4 SendMetadata
Metastoresendsmetadataasaresponsetothecompiler.
5 SendPlan
The compilerchecks the requirementand resendsthe plan to the driver. Up to
here, the parsing and compilingof a query is complete.
6 ExecutePlan
Thedriversendstheexecuteplantotheexecutionengine.
7 ExecuteJob
Internally, the process of execution job is a MapReduce job. The execution engine
sends thejobtoJobTracker, which isinName nodeand
itassignsthisjobtoTaskTracker,which is in Data node. Here, the queryexecutes
MapReducejob.
7.1 MetadataOps
Meanwhile inexecution,theexecutionenginecanexecute metadataoperationswith
Metastore.
8 FetchResult
TheexecutionenginereceivestheresultsfromDatanodes.
9 Send Results
Theexecutionenginesendsthoseresultantvaluestothedriver.
1 Send Results
0 ThedriversendstheresultstoHiveInterfaces.
InstallationandConfiguration
You can install a stable release ofHive bydownloadinga tarball, or you candownloadthe source code
and build Hive from that.
RunningHiveServer2andBeeline
Requirements
Java 1.7
Note: Hive versionsUonward require Java 1.7 or newer. Hive versions 0.14 to 1.1work with Java 1.6
as well. Users are strongly advised to start moving to Java 1.8 (see HIVE-8607).
Hadoop2.x(preferred),l.x(notsupportedbyHive2.0.0onward).Hive versions up to 0.13 also supported
Hadoop 0.20.x, 0.23.x.
Hive is commonly used in production Linux and Windows environment. Mac is a commonly used
MallareddyEngineeringCollegeForWomen(AutonomousInstitution,UGC-Govt.ofIndia) 140
IntroductiontoBigDataAnalytics Dept.ofCSEAI &ML
development environment. The instructions in this document are applicable to Linux and Mac. Using
it on Windows would require slightly different steps.
InstallingHivefromaStableRelease
Start by downloading the most recent stable release of Hive from one of the Apache download
mirrors (see Hive Releases).
Nextyouneedtounpackthetarball.Thiswillresultinthecreationofasubdirectorynamedhive-
x.y.z(wherex.y.z isthereleasenumber):
$tar-xzvfhive-x.y.z.tar.gz
SettheenvironmentvariableHIVE_HOMEtopointtotheinstallationdirectory:
$cdhive-x.y.z
$exportHIVE_HOME={{pwd}}
Finally,add$HIVE_HOME/bintoyourPATH:
$exportPATH=$HIVE_HOME/bin:$PATH
BuildingHivefromSource
The Hive GIT repository for the most recent Hive code 1s located here: git clonehttps://git-
wip-us.apache.org/repos/asfi'hive.git (the master branch).
All release versions are in branches named "branch-0.#" or "branch-I.#" orthe upcoming"branch-
2.#", withtheexceptionofrelease0.8.1whichisin"branch-0.8-r2".Anybrancheswithothernamesare
feature branchesfor works-in-progress. See Understanding Hive Branches for details.
Asof0.13,HiveisbuiltusingApache Maven.
CompileHiveonmaster
TobuildthecurrentHivecodefromthemasterbranch:
$gitclonehttps://git-wip-us.apache.org/repos/asfi'hive.git
$cdhive
$mvncleanpackage-Pdist[-DskipTests-Dmaven.javadoc.skip=true]
$ cd packaging/target/apache-hive-{version}-SNAPSHOT-bin/apache-hive-{version}-
SNAPSHOT-bin
$ ls LICENSE NOTICE README.txt
RELEASENOTES.txt
bin/ (all the shell scripts) lib/ (required jar files) confi'(configuration files)
examples/(sampleinputandqueryfiles)
MallareddyEngineeringCollegeForWomen(AutonomousInstitution,UGC-Govt.ofIndia) 141
IntroductiontoBigDataAnalytics Dept.ofCSEAI &ML
hcatalog/(hcataloginstallation)
scripts I(upgradescriptsforhive-metastore) Here,{version}referstothecurrent Hiveversion.
IfbuildingHivesourceusingMaven(mvn),wewillrefertothedirectory"/packaging/target/apache-hive-
{version}-SNAPSHOT-bin/apache-hive-{version}-SNAPSHOT-bin" as <install-dir>fortherestofthe
page.
CompileHiveonbranch-]
In branch-I,Hive supports bothHadoop1.xand 2.x.You will need to specify which version ofHadoop
to build against via aMaven profile.To build againstHadoop 1.xusetheprofile hadoop-1; for Hadoop
2.xusehadoop-2.Forexampletobuild againstHadoop1.x,theabovemvncommandbecomes:
$mvncleanpackage-Phadoop-1,dist
CompileHivePriorto0.13onHadoop0.20
PriortoHive0.13,HivewasbuiltusingApacheAnt.Tobuildan olderversionof HiveonHadoop 0.20:
$svncohttp://svn.apache.org/repos/as£'hive/branches/branch-{version}hive
$cdhive
$antcleanpackage
$cdbuild/dist #ls LICENSE NOTICE README.txt
RELEASENOTES.txt
bin/ (allthe shell scripts) lib/ (requiredjar files) conf/(configurationfiles)
examples/(sampleinputandqueryfiles) hcatalog / (hcatalog installation)
scripts/(upgradescriptsforhive-metastore)
IfusingAnt,wewillrefertothedirectory"build/dist"as<install-dir>.
CompileHivePriorto0.13onHadoop0.23
Tobuild Hive inAnt againstHadoop 0.23, 2.0.0,orother version,build withthe appropriate flag;some
examples below:
$antcleanpackage-Dhadoop.version=0.23.3-Dhadoop-0.23.version=0.23.3-Dhadoop.mr.rev=23
$ ant clean package -Dhadoop.version=2.0.0-alpha -Dhadoop-0.23.version=2.0.0-alpha
Dhadoop.mr.rev=23
RunningHive
HiveusesHadoop,so:
• youmusthaveHadoopinyourpathOR
• exportHADOOP_HOME=<hadoop-install-dir>
MallareddyEngineeringCollegeForWomen(AutonomousInstitution,UGC-Govt.ofIndia) 142
IntroductiontoBigDataAnalytics Dept.ofCSEAI &ML
Starting from Hive 2.1, we need to run the schematool command below as an initialization step. For example,
we can use "derby" as db type.
$$HIVE_HOME/bin/schematool-dbType<dbtype>-initSchema
HiveServer2 (introduced in Hive 0.11) has its own CLI called Beeline. HiveCLI is now deprecated in
favor of Beeline, as it lacks the multi-user, security, and other capabilities of HiveServer2. To run
HiveServer2and Beeline from shell:
$$HIVEHOME/bin/hiveserver2
$$HIVE_HOME/bin/beeline-ujdbc:hive2://$HS2_HOST:$HS2_PORT
Beeline isstarted withtheJDBCURLoftheHiveServer2, whichdependsontheaddress andport where
HiveServer2 was started. By default, it will be (localhost:10000), so the address will look like
jdbc:hive2://localhost:10000.
OrtostartBeelineandHiveServer2inthesameprocessfortestingpurpose,forasimilaruserexperience to
HiveCLI:
$$HIVE_HOME/bin/beeline-ujdbc:hive2://
RunningHCatalog
ToruntheHCatalog serverfromtheshellinHiverelease0.11.0and later:
$$HIVE_HOME/hcatalog/sbin/hcat_server.sh
TousetheHCatalogcommandlineinterface(CLI)inHiverelease0.11.0andlater:
MallareddyEngineeringCollegeForWomen(AutonomousInstitution,UGC-Govt.ofIndia) 143
IntroductiontoBigDataAnalytics Dept.ofCSEAI &ML
$$HIVE_HOME/hcatalog/bin/hcat
Formoreinformation,seeHCatalogInstallationfrom Tarball andHCatalogCLIin
theHCatalogmanual.
RunningWehHCat(Templeton)
ToruntheWebHCatserverfromtheshellinHiverelease0.11.0andlater:
$$HIVE_HOME/hcatalog/sbin/webhcat_server.sh
Formoreinformation,seeWebHCatInstallationintheWebHCatmanual.
Hiveservices
Hive services such as Meta store, File system, and Job Clientintum communicateswith Hive
storage and performs the following actions
• MetadatainformationoftablescreatedinHiveisstoredinHive"Metastoragedatabase".
• QueryresultsanddataloadedinthetablesaregoingtobestoredinHadoopclusteronHDFS.
Jobexecutionflow:
MallareddyEngineeringCollegeForWomen(AutonomousInstitution,UGC-Govt.ofIndia) 144
IntroductiontoBigDataAnalytics Dept.ofCSEAI &ML
4. Metastoresendsmetadatainformationbacktocompiler
5. CompilercommunicatingwithDriverwiththeproposedplantoexecute thequery
6. DriverSendingexecutionplanstoExecutionengine
7. Execution Engine (EE) actsasabridge between HiveandHadoop to process thequery. ForDFS
operations.
• EEshouldfirstcontactsNameNode andthentoDatanodestogetthevaluesstoredintables.
• EE is going to fetch desired records from Data Nodes. The actual data of tables resides in data
node only. While fromName Node it only fetches the metadata information for thequery.
• Itcollectsactualdatafromdatanodes relatedtomentionedquery
• ExecutionEngine (EE) communicates bi-directionallywithMetastorepresent inHivetoperform
DDL (Data Definition Language) operations. Here DDL operations like CREATE, DROP and
ALTERING tables and databasesare done.Meta store will store information about database
name, tablenamesandcolumn namesonly.Itwillfetchdatarelatedtoquery mentioned.
• Execution Engine (EE) in tum communicateswith Hadoop daemonssuch as Namenode,
Data nodes,and jobtracker toexecute the queryon topof Hadoop file system
8. Fetchingresultsfromdriver
9. SendingresultstoExecutionengine.OncetheresultsfetchedfromdatanodestotheEE,itwill send
results backto driverandto UI ( frontend)
Hive Continuously in contact with Hadoop filesystemand its daemonsvia Execution engine. The
dotted arrow intheJob flow diagramshows the Execution engine communicationwithHadoop
daemons.
Hive-DataTypes
This chapter takes you through the different atatypesinHive,whichareinvolvedinthetablecreation.
Allthe datatypes inHive areclassifiedinto four types, given as follows:
• ColumnTypes
• Literals
• NullValues
• ComplexTypes
ColumnTypes
ColumntypeareusedascolumndatatypesofHive. Theyareasfollows:
IntegralTypes
Integer type datacanbespecified using integraldatatypes,INT.Whenthedata range exceeds
MallareddyEngineeringCollegeForWomen(AutonomousInstitution,UGC-Govt.ofIndia) 145
IntroductiontoBigDataAnalytics Dept.ofCSEAI &ML
therange of INT, you need to use BIGINT and if the data range is smaller than the INT, you use
SMALLINT. TINYINT is smallerthan SMALLINT.
ThefollowingtabledepictsvariousINTdatatypes:
TINYINT y lOY
SMALLINT s l0S
INT 10
BIGINT L l0L
StringTypes
Stringtypedata typescanbespecifiedusingsinglequotes('') or doublequotes(""). Itcontainstwo
datatypes: VARCHAR and CHAR. Hive follows C-types escape characters.
ThefollowingtabledepictsvariousCHARdatatypes:
DataType Length
VARCHAR 1to65355
CHAR 255
Timestamp
It supports traditional UNIX timestamp with optional nanosecond prec1s1on. It supports
java.sql.Timestamp format"YYYY-MM-DDHH:MM:SS.fffffffff' and format"yyyy-mm-dd
hh:mm:ss.ffffffffff'.
Dates
DATEvaluesaredescribedin year/month/dayformatintheform{{YYYY-MM-DD}}.
Decimals
TheDECIMALtypeinHiveisassameasBigDecimalformatofJava.Itisusedforrepresenting
immutable arbitrary precision. The syntax and exampleisas follows:
DECIMAL(precision,scale)
decimal(l0,0)
UnionTypes
MallareddyEngineeringCollegeForWomen(AutonomousInstitution,UGC-Govt.ofIndia) 146
IntroductiontoBigDataAnalytics Dept.ofCSEAI &ML
UNIONTYPE<int,double,array<string>,struct<a:int,b:string>>
{0:1}
{l:2.0}
{2:["three","four"]}
{3:{"a":5,"b":"five"}}
{2:["six","seven"]}
{3:{"a":8,"b":"eight"}}
Literals
ThefollowingliteralsareusedinHive:
FloatingPointTypes
Floatingpointtypesarenothingbutnumberswithdecimalpoints.Generally,thistypeofdata1s composed of
DOUBLEdata type.
DecimalType
Decimal type data is nothingbutfloating pointvalue with higherrange than DOUBLEdata type.The
range of decimal type is approximately -1o-308 toI0308.
NullValue
MissingvaluesarerepresentedbythespecialvalueNULL.
ComplexTypes
TheHivecomplexdatatypesareasfollows:
Arrays
ArraysinHive are usedthesamewaytheyareusedinJava.
Syntax:ARRAY<data_type>
Maps
MapsinHivearesimilartoJavaMaps.
MallareddyEngineeringCollegeForWomen(AutonomousInstitution,UGC-Govt.ofIndia) 147
IntroductiontoBigDataAnalytics Dept.ofCSEAI &ML
Syntax:MAP<primitive_type,data_type>
Structs
StructsinHiveissimilartousingcomplexdatawithcomment.
Syntax:STRUCT<col_name:data_type[COMMENTcol_comment],...>
Hive-Built-in Functions
This chapter explains the built-in functions availablein Hive. The functionslook quite similarto SQL
functions,except for their usage.
Built-InFunctions
Hivesupportsthefollowingbuilt-in functions:
MallareddyEngineeringCollegeForWomen(AutonomousInstitution,UGC-Govt.ofIndia) 148
IntroductiontoBigDataAnalytics Dept.ofCSE AI &ML
BIGINT ceil(doublea)
ItreturnstheminimumBIGINTvaluethatisequalor
greaterthanthedouble.
double Itreturnsarandomnumberthatchangesfromrowtorow.
rand(),rand(intse
ed) ItreturnsthestringresultingfromconcatenatingBafter
string A
concat(stringA,s
string
tringB,...) substr(stringA,intstart) It returnsthesubstring of Astarting fromstart position till the
end of string A
MallareddyEngineeringCollegeForWomen(AutonomousInstitution,UGC-Govt.ofIndia) 149
IntroductiontoBigDataAnalytics Dept.ofCSE AI &ML
Int
size(Ar
valueof<type>
ray<T>) cast(<expr>as<type>) It converts theresults of the expression expr to<type>e.g.
cast('l' as BIGINT) converts the string'l'to it integral
representation. ANULL is returned if the conversion does
not succeed.
MallareddyEngineeringCollegeForWomen(AutonomousInstitution,UGC-Govt.ofIndia) 150
IntroductiontoBigDataAnalytics Dept.ofCSEAI &ML
hive>SELECTround(2.6)fromtemp;
floor() function
hive>SELECTfloor(2.6):fromtemp;
Onsuccessfulexecutionofthequery,yougettoseethefollowingresponse: 2.0
ceil()function
hive>SELECTceil(2.6):fromtemp;
MallareddyEngineeringCollegeForWomen(AutonomousInstitution,UGC-Govt.ofIndia) 151
IntroductiontoBigDataAnalytics Dept.ofCSEAI &ML
MallareddyEngineeringCollegeForWomen(AutonomousInstitution,UGC-Govt.ofIndia) 152
IntroductiontoBigDataAnalytics Dept.ofCSEAI &ML
HiveDDLCommands- TypesofDDLHiveCommands
BYDATAFLAIRTEAM·UPDATED·MARCH4,2020
WanttorunHivequeriesforcreating,modifying,dropping,alteringtablesanddatabases?
Inthisarticle,wearegoingtolearnHiveDDLcommands. ThearticledescribestheHiveDataDefinition
Language(DDL) commands for performing various operations like creating a table/databasein
Hive, droppingatable/databaseinHive,alteringatable/database
inHive,etc.TherearemanyDDLcommands. This article will cover
eachDDLcommandindividually,along withtheir syntaxand examples.
ForrunningHiveDDLcommands,youmusthaveHiveinstalledonyoursystem.
IntroductiontoHiveDDLcommands
HiveDDLcommandsarethestatementsusedfordefiningandchangingthestructureofatableordatabase
in Hive. It is used to build or modify thetables and other objects inthe database.
TheseveraltypesofHiveDDLcommandsare:
1. CREATE
2. SHOW
3. DESCRIBE
4. USE
5. DROP
6. ALTER
7. TRUNCATE
Table-1HiveDDLcommands
USE Database
DROP Database,Table
ALTER Database,Table
TRUNCATE Table
Beforemovingforward,notethattheHivecommandsarecase-insensitive.
CREATEDATABASEisthesameascreatedatabase.
MallareddyEngineeringCollegeForWomen(AutonomousInstitution,UGC-Govt.ofIndia) 153
IntroductiontoBigDataAnalytics Dept.ofCSEAI &ML
Sonow,letusgothrougheachofthecommandsdeeply. Let'sstartwiththeDDLcommandson
Databasesin Hive.
DDLCommandsOnDatabasesinHive
1. CREATEDATABASEinHive
TheCREATEDATABASE statementisused to create a databaseinthe Hive. The DATABASE
and SCHEMA are interchangeable. We canuse either DATABASE or SCHEMA.
Syntax:
1. CREATE(DATABASEISCHEMA)[IFNOTEXISTS]database_name
2. [COMMENTdatabase_comment]
3. [LOCATIONhdfs_path]
4. [WITHDBPROPERTIES(property_name=property_value,...)];
DDLCREATE DATABASEExample:
Here in this example,we are reating a database 'dataflair'.
MallareddyEngineeringCollegeForWomen(AutonomousInstitution,UGC-Govt.ofIndia) 154
IntroductiontoBigDataAnalytics Dept.ofCSEAI &ML
2. SHOWDATABASEinHive
TheSHOWDATABASESstatementlistsallthedatabasespresentintheHive.
Syntax:
SHOW(DATABASESISCHEMAS);
DDLSHOWDATABASESExample:
3. DESCRIBEDATABASEinHive
TheDESCRIBEDATABASE statementinHive shows the name ofDatabase inHive,its comment(if
set), and its location on the file system.
TheEXTENDEDcanbeusedtogetthedatabaseproperties.
Syntax:
1. DESCRIBEDATABASE/SCHEMA[EXTENDED]db_name;
DDLDESCRIBEDATABASEExample:
MallareddyEngineeringCollegeForWomen(AutonomousInstitution,UGC-Govt.ofIndia) 155
IntroductiontoBigDataAnalytics Dept.ofCSEAI &ML
4. USEDATABASEinHive
TheUSEstatementinHive isused to select thespecific database for asessiononwhichallsubsequent
HiveQL statements would be executed.
Syntax:
1. USEdatabase_name;
DDLUSEDATABASEExample:
MallareddyEngineeringCollegeForWomen(AutonomousInstitution,UGC-Govt.ofIndia) 156
IntroductiontoBigDataAnalytics Dept.ofCSEAI &ML
5. DROPDATABASEinHive
TheDROPDATABASEstatementinHiveisusedtoDrop(delete)thedatabase.
1. DROP(DATABASEISCHEMA)[IFEXISTS]database_name[RESTRICTICASCADE];
DDLDROPDATABASEExample:
Hereinthisexample,wearedroppingadatabase'dataflair'usingtheDROP statement.
6. ALTERDATABASEinHive
TheALTER DATABASE statementinHiveisused tochangethemetadata associated with the
database in Hive.
Syntax for changing Database Properties: ALTER(DATABASEISCHEMA)database_name
MallareddyEngineeringCollegeForWomen(AutonomousInstitution,UGC-Govt.ofIndia) 157
IntroductiontoBigDataAnalytics Dept.ofCSEAI &ML
SETDBPROPERTIES(property_name=property_value,...);
DDLALTERDATABASEpropertiesExample:
Inthisexample,weare setting thedatabase properties ofthe 'dataflair'database after itscreation by using
the ALTER command.
SyntaxforchangingDatabaseowner:
1. ALTER(DATABASEISCHEMA)database_nameSETOWNER[USERIROLE]user_or_role;
DDLALTERDATABASEownerExample:
In this example,we arechangingtheownerroleofthe'dataflair'databaseusingtheALTERstatement.
MallareddyEngineeringCollegeForWomen(AutonomousInstitution,UGC-Govt.ofIndia) 158
IntroductiontoBigDataAnalytics Dept.ofCSEAI &ML
ALTER(DATABASEISCHEMA)database_nameSETLOCATIONhdfs_path;
Note: The ALTER DATABASE ...SET LOCATION statement does not move the database current
directorycontentstothenewlyspecified location. Thisstatement doesnotchangethelocationsassociated
withanytablesorpartitions under thespecified database. Instead, it changesthedefaultparent-directory,
where new tables will be added for this database.
Noother metadata associated withthedatabase canbechanged. DDLCommands onTablesinHive
CREATETABLE
The CREATE TABLE statementinHive isused tocreateatable withthegiven name. Ifatableorview
already exists with the same name, then the error is thrown. We can use IF NOT EXISTSto skip the
error.
Syntax:
CREATE TABLE [IF NOT EXISTS][db_name.]table_name[(col_namedata_type[COMMENT
col_cornrnent], ... [COMMENT col_cornrnent])][COMMENT table_cornrnent][ROW FORMAT
row_format][STORED AS file_format][LOCATION hdfs_path];
DDLCREATETABLEExample:
MallareddyEngineeringCollegeForWomen(AutonomousInstitution,UGC-Govt.ofIndia) 159
IntroductiontoBigDataAnalytics Dept.ofCSEAI &ML
Inthistable,we arecreatingatable'Employee'inthe'dataflair'database.
ROW FORMATDELIMITEDmeans we are telling the Hive thatwhenitfinds a new line character,
that means a new record.
FIELDSTERMINATEDBY','tellsHivewhatdelimiterweareusinginourfilestoseparateeach column.
STOREDASTEXTFILEistotellHivewhattypeoffiletoexpect.
Don'tknowaboutdifferentDataTypessupportedbyhive?ReadHiveDataTypes article.
2. SHOWTABLESinHive
TheSHOWTABLESstatementinHivelistsallthebasetablesandmminthecurrentdatabase.
Syntax:
1. SHOWTABLES[INdatabase_name];
DDLSHOWTABLESExample:
MallareddyEngineeringCollegeForWomen(AutonomousInstitution,UGC-Govt.ofIndia) 160
IntroductiontoBigDataAnalytics Dept.ofCSEAI &ML
3. DESCRIBETABLEinHive
TheDESCRIBEstatementinHiveshowsthelistsofcolumnsforthespecifiedtable.
Syntax:
1. DESCRIBE[EXTENDEDIFORMATTED][db_name.]table_name[.col_name([.field_name])];
DROPTABLEinHive
The DROPTABLE statementinHivedeletesthedatafor aparticulartableandremoveall metadata
associated with it from Hive metastore.
IfPURGE is not specifiedthen the data isactually movedto the.Trash/current directory. IfPURGE is
specified, then data is lost completely.
Syntax:
DROPTABLE[IFEXISTS]table_name[PURGE];
DDLDROPTABLE Example:
Inthebelowexample,wearedeletingthe'employee'table.
MallareddyEngineeringCollegeForWomen(AutonomousInstitution,UGC-Govt.ofIndia) 161
IntroductiontoBigDataAnalytics Dept.ofCSEAI &ML
ALTERTABLEinHive
The ALTER TABLE statement in Hive enables you to change the structure of an existing table.
Using the ALTER TABLE statement we can rename the table, add columns to the table, change the
table properties, etc.
SyntaxtoRenameatable:
ALTERTABLEtablenameRENAMETOnewtable_name;
DDLALTERTABLEnameExample:
In this example,we are trying to rename the 'Employee'tableto'Com_Emp' using the ALTER
statement.
FileEditViewSearchTerminalHelp
0:jdbc:htve2://localhost:10000> LTERTABLEEMployeeRENAMETOCoMp_EMp;
INFO :CoMpilingcoMMand(queryid=dataflatr_20200206174836_ddaea397-57cc-407b-9bc5-0db197fa9ad4):ALTERTAB
LEEMployeeRENAME TOCoMp_EMp
INFO ConcurrencyMode tsdisabled,not creatinga lockManager
INFO:SeMantic Analysts CoMpleted(retrial=false)
INFO:ReturningHivescheMa:ScheMa(fieldScheMas:null,properties:null)
INFO:CoMpletedcoMpiling coMMand(queryid=dataflatr_20200206174836_ddaea397-57cc-407b-9bc5-0db197fa9ad4);
TiMetaken:0.024 seconds
INFO:ConcurrencyModetsdisabled,notcreatingalockManager
INFO: ExecutingCOMMand(queryid=dataflair_20200206174836_ddaea397-57cc-407b-9bc5-0db197fa9ad4):ALTERTAB LE
EMployee RENAME TOCoMp_EMp
INFO:Startingtask[Stage-0:DDL] inserialMode
INFO:CoMpletedexecuting coMMand(queryid=dataflatr_20200206174836_ddaea397-57cc-407b-9bc5-0db197fa9ad4);
TiMetaken:0.129seconds
INFO:OK
INFO:ConcurrencyModetsdisabled,notcreatingalockManager
Norowsaffected(0.167seconds)
0:jdbc:htve2://localhost:10000>D
ataflalr@admln1-All-Serles:-/apache-hive-3.1.2-bln
MallareddyEngineeringCollegeForWomen(AutonomousInstitution,UGC-Govt.ofIndia) 162
IntroductiontoBigDataAnalytics Dept.ofCSEAI &ML
SyntaxtoAddcolumnstoatable:
ALTERTABLEtable_name ADDCOLUMNS(columnl,column2);
FileEditViewSearchTerminalHelp
0: jdbc:hive2://localhost:10000>LTERTABLECoMp_EMpADD COLUMNS(EMp_DOBSTRING,EMp_ContactSTRING);
INFO:CoMpilingcoMMand(queryid=dataflair_20200206175634_fcd1d6b9-749f-4066-a5bf-ced8d070e13b):ALTERTAB
LECoMp_EMpADDCOLUMNS(EMp_DOBSTRING,EMp_ContactSTRING)
INFO ConcurrencyModeisdisabled,notcreatingalockManager INFO
SeManticAnalysisCoMpleted(retrial=false)
INFO ReturningHivescheMa:ScheMa(fieldScheMas:null,properties:null)
INFO CoMpletedcoMpilingCOMMand(queryid=dataflair_20200206175634_fcd1d6b9-749f-4066-a5bf-ced8d070e13b);
TiMetaken:0.02seconds
INFO:ConcurrencyModeisdisabled,notcreatingalockManager
INFO:Executing COMMand(queryid=dataflair_20200206175634_fcd1d6b9-749f-4066-a5bf-ced8d070e13b):ALTER TAB
LECoMp_EMpADD COLUMNS(EMp_DOBSTRING, EMp_ContactSTRING)
INFO:Startingtask[Stage-0:DDL]inserialMode
; INFO:CoMpletedexecutingcoMMand(queryid=dataflair_20200206175634_fcd1d6b9-749f-4066-a5bf-ced8d070e13b)
TiMetaken:0.07seconds
INFO:OK
INFO:ConcurrencyModeisdisabled,notcreatingalockManager
Norowsaffected(0.099seconds)
MallareddyEngineeringCollegeForWomen(AutonomousInstitution,UGC-Govt.ofIndia)
0:jdbc:hive2://localhost:10000>D 163
IntroductiontoBigDataAnalytics Dept.ofCSEAI &ML
Syntaxtosettableproperties:
TRUNCATETABLE
TRUNCATETABLEstatementinHiveremovesalltherowsfromthetableorpartition.
Syntax:
MallareddyEngineeringCollegeForWomen(AutonomousInstitution,UGC-Govt.ofIndia) 164
IntroductiontoBigDataAnalytics Dept.ofCSEAI &ML
TRUNCATETABLEtablename;
DDLTRUNCATETABLEExample:
MallareddyEngineeringCollegeForWomen(AutonomousInstitution,UGC-Govt.ofIndia) 165
IntroductiontoBigDataAnalytics Dept.ofCSEAI &ML
SET:1
1. Discussthe followingindetail
a. Conventionalchallengesinbigdata
b. NatureofData
2. Describe thestepsinvolvedinsupportvectorbasedinferencemethodology.
3. Writea shortnoteonBayesianinferencemethodology.
4. Define thedifferentinferencesinbigdataanalytics.
5. Describethebootstrappinganditsimportance
6. What issamplingandsamplingdistributiongiveadetailedanalysis.
7. Definethefollowing
a. IntelligentDataAnalyticsb.AnalysisVsReporting.
8. Describethepredictionerrorandregressiontechniques
9. DescribeanyfivecharacteristicsofBigData.
MallareddyEngineeringCollegeForWomen(AutonomousInstitution,UGC-Govt.ofIndia) 166
IntroductiontoBigDataAnalytics Dept.ofCSEAI&ML
MallareddyEngineeringCollegeForWomen(AutonomousInstitution,UGC-Govt.ofIndia) 167
IntroductiontoBigDataAnalytics Dept.ofCSEAI&ML
SET2:
1. a)Whatisadatastream?
b)Discuss14insightsoflnfosphereindatastream.
2. Explainthedifferentapplicationsofdatastreamsindetail.
3. ExplainthestreammodelandDatastreammanagementsystemarchitecture.
4. ExplainhowtocountonesinawindowusingDGIM algorithm
5. Writeashortnoteonthefollowing:(i)Countingdistinctelements inastream.
(ii)Findingmostpopularelementsusingdecayingwindow.
6. WhatarefiltersinBigData?ExplainBloomFilterwithexample
8. Explainthefollowinga.FMalgorithmanditsapplication b.AMSalgorthimanditsapplications
9. What is RealTime Analytics? Discuss their technologies in detail IO
SET 3:
1. a)What isHadoop?Explainitscomponents
b)Howdoyouanalyzethedatainhadoop.
2. .Explainthefollowinga.Mapperclass
b.Reducerclass
c.Scalingout
3. ExplainthefailuresinMapreduce
4. Explainthemapreducedataflowwithsinglereduceandmultiplereduce.
5. HowHadoopstreamingissuitedwith textprocessingexplain.
6. DefineHDFS.Describenamenode,datanodeandblock.ExplainHDFSoperationsindetail.
7. Writein detailtheconceptofdevelopingtheMapReduceApplication.
8. Howmapreducejobworkswithclassicjavastream.
9. Explainhowmapreducejobsrunon YARN.
MallareddyEngineeringCollegeForWomen(AutonomousInstitution,UGC-Govt.ofIndia) 168
IntroductiontoBigDataAnalytics Dept.ofCSEAI&ML
1. WhatisCluster?ExplainthesettingupaHadoopcluster
(b)Whatare thedifferenttypesofHadoopconfigurationfiles?Discuss
2. Whatarecontrolscripts?Explainthestart.dfs.shscript,Start.mapred.sh
3. Whataretheadditionalconfigurationpropertiestosetfor HDFS
4. ExplainthreestepKerberosticketexchangeprotocol
5. WhatisbenchmarkinghowitworksinHadoop.
6. Howwillyoudefinecommissioningnewnodesanddecommissioningoldnodes?
7. WritethestepsforupgradingHDFS.
8. DiscussadministeringHadoopwithitscheckingpointprocessdiagram
9. Howtorunproxy&Runningmapreducejob.
10. ExplainDatanodedirectorystructure
11. SetLoglevels&itsmetrics
13. HowsecuritydoesisdoneinHadoop.Justify
SET:5
1. WhatisPIG?Explainitsinstallingprocess
b)ExplaintwoexecutiontypesormodesinPIG
2. ExplainGrouping,Join,CoGroup,Cross&Groupindata
3. Explaintheprocess ofinstallingHIVE&featuresofHIVE
4. Howwillyouquerythedatain HIVE?
5. GiveadetailnoteonHBASE
6. WhatisZookeeperexplainitsfeatureswithapplications
7. ExplainindetailIBMinfosphereBiginsightsandStreams
8. Discussthevisualdataanalysis techniquesindetail.
9. GiveadetailnoteonInteractiontechniqueswithitsapplications.
10. WhatisHiveQLexplainitsfeatures.
MallareddyEngineeringCollegeForWomen(AutonomousInstitution,UGC-Govt.ofIndia) 169