Big Data Notes

UNIT I
INTRODUCTION TO BIG DATA:Introduction – distributed file system – Big

Data andits importance, Four V’s in bigdata, Drivers for Big data, Big data
analytics, Big data applications. Algorithms using map reduce, Matrix-Vector
Multiplication by Map Reduce.
UNIT II
INTRODUCTION HADOOP : Big Data – Apache Hadoop & Hadoop EcoSystem –
Moving Data in and out of Hadoop – Understanding inputs and outputs of
MapReduce -Data Serialization.
UNIT- III
HADOOP ARCHITECTURE: Hadoop Architecture, Hadoop Storage: HDFS,
CommonHadoop Shell commands , Anatomy of File Write and Read., NameNode,
SecondaryNameNode, and DataNode, Hadoop MapReduce paradigm, Map and
Reduce tasks, Job,Tasktrackers - Cluster Setup – SSH & Hadoop Configuration –
HDFS Administering –Monitoring & Maintenance.
UNIT-IV
HIVE AND HIVEQL, HBASE:-Hive Architecture and Installation, Comparison
withTraditional Database, HiveQL - Querying Data - Sorting And Aggregating,
Map ReduceScripts, Joins & Subqueries,
UNIT-V
HBase concepts- Advanced Usage, Schema Design, Advance Indexing -
Zookeeper - howit helps in monitoring a cluster, HBase uses Zookeeper and how
to Build Applications withZookeeper.
ABHYUDAYA MAHILA DEGREE COLLEGE Page 1

Andhra Pradesh State Council of Higher Education
B.Sc. Computer Science/Information Technology (IT) Syllabus Under CBCS
w.e.f.2015
-
BIG DATA TECHNOLOGY

Question Bank
1. What is distributed File System? Explain the significance
of four V's in Big Data
.
2. Explain Briefly about Big Data Analysis.
3. Explain Briefly about Big DataApplication.
4. What is Big Data? Explain the characteristics and Proper APACHE Hadoop
.5.Explain how do we move data in and out of Hadoop
6. Write about Map reduce and Data Serialization.
7Explain briefly about Hadoop Architecture.
8Explain Hadoop shell commands
9. Write about HDFS administration, Monitoring and maintenance.
10Explain Hive Architecture and Installation
11Compare Traditional Data Base with Hive
12.Explain sorting and Aggregation in Hive & L
13.Explain the concepts of HBaseWrite its uses.
14How a Schema Design in done in HBase
15. How a Zookeeper is used in monitoring a clusters.
Big Data Technology

Model Question Paper
Answer anyfiveQuestion5 X 15=75
1. What is distributed File System? Explain the significance of four V's in Big
Data.
2. Explain Briefly about Big Data Analysis.
3. What is Big Data? Explain the characteristics and Proper APACHE Hadoop.
4. Explain how do we move data in and out of Hadoop.
5. Explain briefly about Hadoop Architecture.
6. Explain Hadoop shell commands.
7. Explain Hive Architecture and Installation.
8. Compare Traditional Data Base with Hive.
9. Explain the concepts of HBase
. Write its uses.
10. How a Schema Design in done in HBase.

1.What is a Big Data?
In a digital world where data is increasing rapidly because of the increasing use of the
internet, sensors and heavy machines at a very high rate. The sheer volume, variety
,velocity and veracity of such data is signified by the term “BIG DATA”.
EX: Rolling web log data, network and system logs click information what is considered “big
data” varies depending on the capabilities of the organization managing the set, and on the
capabilities of the applications that are traditionally used to process and analysis the data
set in its domain. Big data is when the data itself becomes part of the problem .
Evalution of a big data:
There are some major mile stones in the evaluation of big data
1940’s:The information is limited storage .

1960’s: Automatic data compression was published.Its explosation of information in the
past few years makes it necessary that requirement for storing information should be
minimised.
1970’s: In 1970’s information flow is ordered to track the volume of information to
circulating the country.
1980’s: In 1980’s research project was started in measured of volume of information in bits .
1990’s: In 1990’s a digital storage systems became more economical than paper storage .
2000’s : 2000 onwards various methods was introduced to steam line information technique
for were controlling the volume,velocity and variety of data merged ,thus introducing 3d
data management
Hadoop was created by Doug Cutting and Mike Cafarellain 2005. It was originally developed
to support distribution for the Nutch search engine project. Doug, who was working at
Yahoo! at the time and is now Chief Architect of Cloudera, named the project after his son's
toy elephant.
2 Explain Distributed File System ?
A distributed file system (DFS) is a file system with data stored on a server. The data is
accessed and processed as if it was stored on the local client machine. The DFS makes it
convenient to share information and files among users on a network in a controlled and
authorized way.
Example
The following scenario explains how much amount of time is required to read to
complete the 1 Tera Byte data by using 1 machine 4 I/O channels with each channel 100 MB
capacity.
While Reading 1TB data…
The distributed file system provides the reliability, availability

based on the replication of the data. The replication make sure that the data is getting
stored in multiple locations even any issue is there with a particular copy the process may
not be disturbed as the same file is also available in any other machine. The distributed
environments follow the logic of single system image, where all the systems are present in
the distributed environment are having the same characteristics.
The normal environment required 45 minutes time to perform the reading operation on 1TB
data, to perform the same task with distributed environment the process took only 4.5
minutes the same performance benefits can be achieved through out in all the operations
like searching and other key operations.
Most Important Question
Q : What are the Characteristics Of 'Big Data'? or v’s of big data?
(i)Volume – We already know that Big Data indicates huge ‘volumes’ of data that is being
generated on a daily basis from various sources like social media platforms, business
processes, machines, networks, human interactions
Facebook Example:
 As of 2011, there are 500,000,000 active Facebook users. Every 13

People on earth. Half of the them are logged in on any given day.
 A record-breaking 750 Million Photos were uploaded to Facebook
over New Year’s weekend.
 There are 206.2 million internet users in the US. That means
71.2% of the US web audience is on Facebook.

Twitter Example:
 Twitter has over 500 million registered users.
 The USA, whose 141.8 million accounts represents 27.4 percent of all Twitter users,
good enough to finish well ahead of Brazil, Japan, the UK and Indonesia.
 79% of US Twitter users are more like to recommend brands they follow
 67% of US Twitter users are more likely to buy from brands they follow
 57% of all companies that use social media for business use Twitter
(ii)Variety – The next aspect of 'Big Data' is its variety.
Variety of Big Data refers to structured, unstructured, and semistructured

data that is gathered from multiple sources. While in the past, data could
only be collected from spreadsheets and databases, today data comes in
an array of forms such as emails, PDFs, photos, videos, audios, SM posts,
and so much more.
(iii)Velocity – The term 'velocity' refers to the speed of generation of data.

How fast the data is generated and processed to meet the demands,
determines real potential in the data.
(iv)Veracity – This refers to the inconsistency which can be shown by the

data at times, thus hampering the process of being able to handle and
manage the data effectively.
Ex: minimum, maximum, mean, standard deviation.
v) validity: It refers to the correctness of the data.
vi) volatility: it refers to how long the data is valid. The data which is valid now many
become invalid after a few minutes or few days
Q : Explain Variety Types of Big Data ?
Big data' could be found in three forms:
1. Structured
2. Unstructured
3. Semi-structured
Structured
By structured data, we mean data that can be processed, stored, and retrieved in a fixed
format. It refers to highly organized information that can be readily and seamlessly stored
and accessed from a database by simple search engine algorithms. For instance, the
employee table in a company database will be structured as the employee details, their
job positions, their salaries, etc., will be present in an organized manner.

Unstructured
Any data with unknown form or the structure is classified as unstructured data. This makes it
very difficult and time-consuming to process and analyze unstructured data. Typical example
of unstructured data is, a heterogeneous data source containing a combination of simple text
files, images, videos etc.
Semi-structured
Semi-structured data pertains to the data containing both the formats mentioned above,
that is, structured and unstructured data. Example of semi-structured data is a data
represented in XML file.
Personal data stored in a XML file-
<rec><name>Prashant Rao</name><sex>Male</sex><age>35</age></rec>
<rec><name>Seema R.</name><sex>Female</sex><age>41</age></rec>
Q) Explain Big Data Applications or Big Data Analytics ?
GOVERNMENT
Big data analytics has proven to be very useful in the government sector. Big data analysis
played a large role in Barack Obama’s successful 2012 re-election campaign. Also most
recently, Big data analysis was majorly responsible for the BJP and its allies to win a highly
successful Indian General Election 2014. The Indian Government utilizes numerous
techniques to ascertain how the Indian electorate is responding to government action, as
well as ideas for policy augmentation.
SOCIAL MEDIA ANALYTICS
The advent of social media has led to an outburst of big data. Various solutions have been
built in order to analyze social media activity like IBM’s Cognos Consumer Insights, a point
solution running on IBM’s BigInsights Big Data platform, can make sense of the chatter.
Social media can provide valuable real-time insights into how the market is responding to
products and campaigns. With the help of these insights, the companies can adjust their
pricing, promotion, and campaign placements accordingly. Before utilizing the big data
there needs to be some preprocessing to be done on the big data in order to derive some
intelligent and valuable results. Thus to know the consumer mindset the application of
intelligent decisions derived from big data is necessary.
TECHNOLOGY
The technological applications of big data comprise of the following companies which deal
with huge amounts of data every day and put them to use for business decisions as well. For
example, eBay.com uses two data warehouses at 7.5 petabytes and 40PB as well as a 40PB
Hadoop cluster for search, consumer recommendations, and merchandising. Inside eBay‟s
90PB data warehouse. Amazon.com handles millions of back-end operations every day, as
well as queries from more than half a million third-party sellers. The core technology that
keeps Amazon running is Linux-based and as of 2005, they had the world’s three largest
Linux databases, with capacities of 7.8 TB, 18.5 TB, and 24.7 TB. Facebook handles 50 billion
photos from its user base. Windermere Real Estate uses anonymous GPS signals from nearly
100 million drivers to help new home buyers determine their typical drive times to and from
work throughout various times of the day.
CALL CENTER ANALYTICS
Now we turn to the customer-facing Big Data application examples, of which call center
analytics are particularly powerful. What’s going on in a customer’s call center is often a
great barometer and influencer of market sentiment, but without a Big Data solution, much
of the insight that a call center can provide will be overlooked or discovered too late. Big
Data solutions can help identify recurring problems or customer and staff behavior patterns
on the fly not only by making sense of time/quality resolution metrics but also by capturing
and processing call content itself.
BANKING
The use of customer data invariably raises privacy issues. By uncovering hidden connections
between seemingly unrelated pieces of data, big data analytics could potentially reveal
sensitive personal information. Research indicates that 62% of bankers are cautious in their
use of big data due to privacy issues. Further, outsourcing of data analysis activities or
distribution of customer data across departments for the generation of richer insights also
amplifies security risks. Such as customers’ earnings, savings, mortgages, and insurance
policies ended up in the wrong hands. Such incidents reinforce concerns about data privacy
and discourage customers from sharing personal information in exchange for customized
offers.
AGRICULTURE
A biotechnology firm uses sensor data to optimize crop efficiency. It plants test crops and
runs simulations to measure how plants react to various changes in condition. Its data
environment constantly adjusts to changes in the attributes of various data it collects,

including temperature, water levels, soil composition, growth, output, and gene sequencing
of each plant in the test bed. These simulations allow it to discover the optimal
environmental conditions for specific gene types.
MARKETING
Marketers have begun to use facial recognition software to learn how well their advertising
succeeds or fails at stimulating interest in their products. A recent study published in the
Harvard Business Review looked at what kinds of advertisements compelled viewers to
continue watching and what turned viewers off. Among their tools was “a system that
analyses facial expressions to reveal what viewers are feeling.” The research was designed
to discover what kinds of promotions induced watchers to share the ads with their social
network, helping marketers create ads most likely to “go viral” and improve sales.
SMART PHONES
Perhaps more impressive, people now carry facial recognition technology in their pockets.
Users of I Phone and Android smartphones have applications at their fingertips that use
facial recognition technology for various tasks. For example, Android users with the
remember app can snap a photo of someone, then bring up stored information about that
person based on their image when their own memory lets them down a potential boon for
salespeople.
TELECOM
Now a day’s big data is used in different fields. In telecom also it plays a very good role.
Operators face an uphill challenge when they need to deliver new, compelling, revenue-
generating services without overloading their networks and keeping their running costs
under control. The market demands new set of data management and analysis capabilities
that can help service providers make accurate decisions by taking into account customer,
network context and other critical aspects of their businesses. Most of these decisions must
be made in real time, placing additional pressure on the operators. Real-time predictive
analytics can help leverage the data that resides in their multitude systems, make it
immediately accessible and help correlate that data to generate insight that can help them
drive their business forward.
HEALTHCARE
Traditionally, the healthcare industry has lagged behind other industries in the use of big
data, part of the problem stems from resistance to change providers are accustomed to
making treatment decisions independently, using their own clinical judgment, rather than
relying on protocols based on big data. Other obstacles are more structural in nature. This is
one of the best place to set an example for Big Data Application.Even within a single
hospital, payor, or pharmaceutical company, important information often remains siloed
within one group or department because organizations lack procedures for integrating data
and communicating findings.
Health care stakeholders now have access to promising new threads of knowledge. This
information is a form of “big data,” so called not only for its sheer volume but for its

complexity, diversity, and timelines. Pharmaceutical industry experts, payers, and providers
are now beginning to analyze big data to obtain insights. Recent technologic advances in the
industry have improved their ability to work with such data, even though the files are
enormous and often have different database structures and technical characteristics.
Q) DRIVERS OF BIG DATA ?
While most definitions of Big Data focus on the new forms of unstructured data flowing
through businesses with new levels of “volume, velocity, variety, and complexity”, I tend to
answer the question using a simple equation:
Big Data = Transactions + Interactions + Observations

The following graphic illustrates what I mean:
7 KEY DRIVERS BEHIND THE BIG DATA MARKET?
Business
1. Opportunity to enable innovative new business models
2. Potential for new insights that drive competitive advantage
Technical
1. Data collected and stored continues to grow exponentially
2. Data is increasingly everywhere and in many formats
3. Traditional solutions are failing under new requirements
Financial
1. Cost of data systems, as a percentage of IT spend, continues to grow
2. Cost advantages of commodity hardware & open source software

Q ) Explain Big data Analytics ?
Big data analytics refers to the strategy of analyzing large volumes of data, or big data. This
big data is gathered from a wide variety of sources, including social networks, videos, digital
images, sensors, and sales transaction records. The aim in analyzing all this data is to
uncover patterns and connections that might otherwise be invisible, and make superior
business decisions.
Prescriptive
These analytics reveals what kind of actions should be taken and which determines future
rules and regulations. These are quite valuable since they allow business owners to answer
specific queries. Take the bariatric health care industry for example. Patient populations can
be measured using prescriptive analytics to measure how many patients are morbidly
obese. That number can then be filtered further by adding categories such as diabetes, LDL
cholesterol levels and others to determine the exact treatment. Some companies also use
this data analysis to forecast sales leads, social media, CRM data etc.Diagnostic
Diagnostic
These analytics analyze past data to determine why certain incidents happened. Say, you
end up with an unsuccessful social media campaign; using a diagnostic big data analysis you
can examine the number of posts that were put up, followers, fans, page views/reviews,
pins etc that will allow you to sift the grain from the chaff so to speak. In other words, you
can distill literally thousands of data into a single view to see what worked and what didn’t
thus saving time and resources.
Descriptive
This phase is based on present processes and incoming data. Such analysis can help you
determine valuable patterns that can offer critical insights into important processes. For
instance, it can help you assess credit risk, review old financial performance to determine
how a customer might pay in the future and even categorize your clientele according to
their preferences and sales cycle. Mining descriptive analytics involves the usage of a
dashboard or simple email reports.
Predictive Analytics
These analytics involves the extraction of current data sets that can help users determine
upcoming trends and outcomes with ease. However, these cannot tell us exactly what will
happen in the future but what a business owner can expect along with different scenarios.
In other words, predictive analysis is an enabler of big data in that it amasses an enormous
amount of data such as customer info, historical data and customer insight in order to
predict future scenarios.

Q) Explain MapReduce – Algorithm ?
The MapReduce algorithm contains two important tasks, namely Map and Reduce.
 The map task is done by means of Mapper Class

 The reduce task is done by means of Reducer Class.
Mapper class takes the input, tokenizes it, maps and sorts it. The output of Mapper class is
used as input by Reducer class, which in turn searches matching pairs and reduces them.
MapReduce implements various mathematical algorithms to divide a task into small parts
and assign them to multiple systems. In technical terms, MapReduce algorithm helps in
sending the Map & Reduce tasks to appropriate servers in a cluster.
These mathematical algorithms may include the following −
 Sorting
 Searching
Sorting
Sorting is one of the basic MapReduce algorithms to process and analyze data. MapReduce
implements sorting algorithm to automatically sort the output key-value pairs from the
mapper by their keys.
 Sorting methods are implemented in the mapper class itself.
 In the Shuffle and Sort phase, after tokenizing the values in the mapper class,
the Context class (user-defined class) collects the matching valued keys as a
collection.
 To collect similar key-value pairs (intermediate keys), the Mapper class takes the
help of RawComparator class to sort the key-value pairs.
 The set of intermediate key-value pairs for a given Reducer is automatically sorted
by Hadoop to form key-values (K2, {V2, V2, …}) before they are presented to the
Reducer.
Searching
Searching plays an important role in MapReduce algorithm. It helps in the combiner phase
(optional) and in the Reducer phase. Let us try to understand how Searching works with
the help of an example.

Example
The following example shows how MapReduce employs Searching algorithm to find out the
details of the employee who draws the highest salary in a given employee dataset.
 Let us assume we have employee data in four different files − A, B, C, and D. Let us
also assume there are duplicate employee records in all four files because of
i
m
p
o
r
t
i
ng the employee data from all database tables repeatedly. See the following
illustration.
 The Map phase processes each input file and provides the employee data in key-
value pairs (<k, v> : <emp name, salary>). See the following illustration.
 The combiner phase (searching technique) will accept the input from the Map
phase as a key-value pair with employee name and salary. Using searching
technique, the combiner will check all the employee salary to find the highest
salaried employee in each file. See the following snippet.
<k: employee name, v: salary>
Max= the salary of an first employee. Treated as max salary

if(v(second employee).salary > Max){
Max = v(salary);
else{
Continue checking;
The expected result is as follows −
<satish, <gopal, <kiran, <manisha,

26000> 50000> 45000> 45000>
 Reducer phase − Form each file, you will find the highest salaried employee. To
avoid redundancy, check all the <k, v> pairs and eliminate duplicate entries, if any.
The same algorithm is used in between the four <k, v> pairs, which are coming from
four input files. The final output should be as follows −
<gopal, 50000>
Q) Explain features of Apache Hadoop ?
1. Open Source
Apache Hadoop is an open source project. It means its code can be modified
according to business requirements.
2. Distributed Processing
As data is stored in a distributed manner in HDFS across the cluster, data is
processed in parallel on a cluster of nodes.
3. Fault Tolerance
This is one of the very important features of Hadoop. By default 3 replicas of each
block is stored across the cluster in Hadoop and it can be changed also as per the
requirement. So if any node goes down, data on that node can be recovered from

other nodes easily with the help of this characteristic. Failures of nodes or tasks are
recovered automatically by the framework. This is how Hadoop is fault tolerant.
4. Reliability
Due to replication of data in the cluster, data is reliably stored on the cluster of
machine despite machine failures. If your machine goes down, then also your data
will be stored reliably due to this charecteristic of Hadoop.
5. High Availability
Data is highly available and accessible despite hardware failure due to multiple
copies of data. If a machine or few hardware crashes, then data will be accessed
from another path.
6. Scalability
Hadoop is highly scalable in the way new hardware can be easily added to the nodes.
This feature of Hadoop also provides horizontal scalability which means new nodes
can be added on the fly without any downtime.
7. Economic
Apache Hadoop is not very expensive as it runs on a cluster of commodity hardware.
We do not need any specialized machine for it. Hadoop also provides huge cost
saving also as it is very easy to add more nodes on the fly here. So if requirement
increases, then you can increase nodes as well without any downtime and without
requiring much of pre-planning.
8. Easy to use
No need of client to deal with distributed computing, the framework takes care of all
the things. So this feature of Hadoop is easy to use.
9. Data Locality
This one is a unique features of Hadoop that made it easily handle the Big Data.
Hadoop works on data locality principle which states that move computation to data
instead of data to computation. When a client submits the MapReduce algorithm,
this algorithm is moved to data in the cluster rather than bringing data to the
location where the algorithm is submitted and then processing it.
Q ) Explain about Apache Hadoop ecosystem components
HDFS:

HDFS stands for Hadoop Distributed File System for managing big data sets with High
Volume, Velocity and Variety. HDFS implements master slave architecture. Master is Name
node and slave is data node.
Features:
• Scalable
• Reliable
• Commodity Hardware
HDFS is the well known for Big Data storage.
Map Reduce:
Map Reduce is a programming model designed to process high volume distributed data.
Platform is built using Java for better exception handling. Map Reduce includes two
deamons, Job tracker and Task Tracker.
Features:
• Functional Programming.
• Works very well on Big Data.
• Can process large datasets.
Map Reduce is the main component known for processing big data.
YARN:
YARN stands for Yet Another Resource Negotiator. It is also called as MapReduce 2(MRv2).
The two major functionalities of Job Tracker in MRv1, resource management and job
scheduling/ monitoring are split into separate daemons which are ResourceManager,
NodeManager and ApplicationMaster.
Features:
• Better resource management.

• Scalability
• Dynamic allocation of cluster resources.
Data Access :
Pig:
Apache Pig is a high level language built on top of MapReduce for analyzing
large datasets with simple adhoc data analysis programs. Pig is also known as
Data Flow language. It is very well integrated with python. It is initially
developed by yahoo.
Salient features of pig:

• Ease of programming
• Optimization opportunities
• Extensibility.
Pig scripts internally will be converted to map reduce programs.
Hive:
Apache Hive is another high level query language and data warehouse
infrastructure built on top of Hadoop for providing data summarization, query
and analysis. It is initially developed by yahoo and made open source.
Salient features of hive:
• SQL like query language called HQL.

• Partitioning and bucketing for faster data processing.
• Integration with visualization tools like Tableau.
Hive queries internally will be converted to map reduce programs.
If you want to become a big data analyst, these two high level languages are a must know!!
Data Storage:
Hbase:
Apache HBase is a NoSQL database built for hosting large tables with
billions of rows and millions of columns on top of Hadoop commodity
hardware machines. Use Apache Hbase when you need random, realtime
read/write access to your Big Data.
Features:
• Strictly consistent reads and writes. In memory operations.

• Easy to use Java API for client access.
• Well integrated with pig, hive and sqoop.
• Is a consistent and partition tolerant system in CAP theorem.
Cassandra:
Cassandra is a NoSQL database designed for linear scalability and high availability. Cassandra
is based on key-value model. Developed by Facebook and known for faster response to
queries.
Features:
• Column indexes
• Support for de-normalization
• Materialized views
• Powerful built-in caching.

Hcatalog:
HCatalog is a table management layer which provides integration of hive metadata for other
Hadoop applications. It enables users with different data processing tools like Apache pig,
Apache MapReduce and Apache Hive to more easily read and write data.
Features:
• Tabular view for different formats.

• Notifications of data availability.
• REST API’s for external systems to access metadata.
Lucene:
Apache LuceneTM is a high-performance, full-featured text search engine library written

entirely in Java. It is a technology suitable for nearly any application that requires full-text
search, especially cross-platform.
Features:
• Scalable, High – Performance indexing.

• Powerful, Accurate and Efficient search algorithms.
• Cross-platform solution.
Hama:
Apache Hama is a distributed framework based on Bulk Synchronous Parallel(BSP)

computing. Capable and well known for massive scientific computations like matrix, graph
and network algorithms.
Features:
• Simple programming model

• Well suited for iterative algorithms
• YARN supported
• Collaborative filtering unsupervised machine learning.
• K-Means clustering.
Crunch:
Apache crunch is built for pipelining MapReduce programs which are simple and efficient.
This framework is used for writing, testing and running MapReduce pipelines.
Features:
• Developer focused.
• Minimal abstractions
• Flexible data model.

Data Serialization:
Avro:
Apache Avro is a data serialization framework which is language neutral.

Designed for language portability, allowing data to potentially outlive the
language to read and write it.
Apache Sqoop:
Apache Sqoop is a tool designed for bulk data transfers between

relational databases and Hadoop.
Features:
• Import and export to and from HDFS.

• Import and export to and from Hive.
• Import and export to HBase.
Apache Flume:
Flume is a distributed, reliable, and available service for efficiently

collecting, aggregating, and moving large amounts of log data.
Features:
• Robust
• Fault tolerant
• Simple and flexible Architecture based on streaming data flows.
Apache Ambari:
Ambari is designed to make hadoop management simpler

by providing an interface for provisioning, managing and
monitoring Apache Hadoop Clusters.
Features:
• Provision a Hadoop Cluster.

• Manage a Hadoop Cluster.
• Monitor a Hadoop Cluster.
Apache Zookeeper:

Zookeeper is a centralized service designed for maintaining configuration information,
naming, providing distributed synchronization, and providing group services.
Features:
• Serialization
• Atomicity
• Reliability
• Simple API
Apache Oozie:
Oozie is a workflow scheduler system to manage Apache

Hadoop jobs.
Features:
• Scalable, reliable and extensible system.

• Supports several types of Hadoop jobs such as Map-Reduce, Hive, Pig and Sqoop.
• Simple and easy to use.
Q ) Hadoop HDFS Data Read and Write Operations(anatomy of file write and
read) ?
To write a file in HDFS, a client needs to interact with master i.e. namenode (master). Now
namenode provides the address of the datanodes(slaves) on which client will start writing
the data. Client directly writes data on the datanodes, now datanode will create data write
pipeline.
The first datanode will copy the block to another datanode, which intern copy it to the third
datanode. Once it creates the replicas of blocks, it sends back the acknowledgment.
HDFS Data Write Pipeline Workflow

Now let’s understand complete end to end HDFS data write pipeline. As shown in the above
figure the data write operation in HDFS is distributed, client copies the data distributedly on
datanodes, the steps by step explanation of data write operation is:
i) The HDFS client sends a create request on DistributedFileSystem APIs.
ii) DistributedFileSystem makes an RPC call to the namenode to create a new file in the file
system’s namespace. The namenode performs various checks to make sure that the file
doesn’t already exist and that the client has the permissions to create the file. When these
checks pass, then only the namenode makes a record of the new file; otherwise, file
creation fails and the client is thrown an IOException.
iii) The DistributedFileSystem returns a FSDataOutputStream for the client to start writing
data to. As the client writes data, DFSOutputStream splits it into packets, which it writes to
an internal queue, called the data queue. The data queue is consumed by the DataStreamer,

whichI is responsible for asking the namenode to allocate new blocks by picking a list of
suitable datanodes to store the replicas.
iv) The list of datanodes form a pipeline, and here we’ll assume the replication level is three,
so there are three nodes in the pipeline. The DataStreamer streams the packets to the first
datanode in the pipeline, which stores the packet and forwards it to the second datanode in
the pipeline. Similarly, the second datanode stores the packet and forwards it to the third
(and last) datanode in the pipeline.
v) DFSOutputStream also maintains an internal queue of packets that are waiting to be
acknowledged by datanodes, called the ack queue. A packet is removed from the ack queue
only when it has been acknowledged by the datanodes in the pipeline. Datanode sends the
acknowledgment once required replicas are created (3 by default). Similarly, all the blocks
are stored and replicated on the different datanodes, the data blocks are copied in parallel.
vi) When the client has finished writing data, it calls close() on the stream.
vii) This action flushes all the remaining packets to the datanode pipeline and waits for
acknowledgments before contacting the namenode to signal that the file is complete. The
namenode already knows which blocks the file is made up of, so it only has to wait for
blocks to be minimally replicated before returning successfully.
We can summarize the HDFS data write operation from the following diagram:
Hadoop HDFS Data Read Operation

To read a file from HDFS, a client needs to interact with namenode (master) as namenode is
the centerpiece of Hadoop cluster (it stores all the metadata i.e. data about the data). Now
namenode checks for required privileges, if the client has sufficient privileges then
namenode provides the address of the slaves where a file is stored. Now client will interact
directly with the respective datanodes to read the data blocks.
HDFS File Read Workflow

Now let’s understand complete end to end HDFS data read operation. As shown in the
above figure the data read operation in HDFS is distributed, the client reads the data
parallelly from datanodes, the steps by step explanation of data read cycle is:

i) Client opens the file it wishes to read by calling open() on the FileSystem object, which for
HDFS is an instance of DistributedFileSystem.
ii) DistributedFileSystem calls the namenode using RPC to determine the locations of the
blocks for the first few blocks in the file. For each block, the namenode returns the
addresses of the datanodes that have a copy of that block and datanode are sorted
according to their proximity to the client.
iii) DistributedFileSystem returns a FSDataInputStream to the client for it to read data
from. FSDataInputStream, thus, wraps the DFSInputStream which manages the datanode
and namenode I/O. Client calls read() on the stream. DFSInputStream which has stored the
datanode addresses then connects to the closest datanode for the first block in the file.
iv) Data is streamed from the datanode back to the client, as a result client can
call read() repeatedly on the stream. When the block ends, DFSInputStream will close the
connection to the datanode and then finds the best datanode for the next block.
v) If the DFSInputStream encounters an error while communicating with a datanode, it will
try the next closest one for that block. It will also remember datanodes that have failed so
that it doesn’t needlessly retry them for later blocks. The DFSInputStream also verifies
checksums for the data transferred to it from the datanode. If it finds a corrupt block, it
reports this to the namenode before theDFSInputStream attempts to read a replica of the
block from another datanode.
vi) When the client has finished reading the data, it calls close() on the stream.
Q ) Types of InputFormat in MapReduce

FileInputFormat in Hadoop
It is the base class for all file-based InputFormats. Hadoop FileInputFormat specifies input
directory where data files are located.
TextInputFormat
It is the default InputFormat of MapReduce. TextInputFormat treats each line of each input
file as a separate record and performs no parsing. This is useful for unformatted data or line-
based records like log files.
 Key – It is the byte offset of the beginning of the line within the file (not whole file just
one split), so it will be unique if combined with the file name.
 Value – It is the contents of the line, excluding line terminators.
KeyValueTextInputFormat
It is similar to TextInputFormat as it also treats each line of input as a separate record. While
TextInputFormat treats entire line as the value, but the KeyValueTextInputFormat breaks
the line itself into key and value by a tab character (‘/t’). Here Key is everything up to the tab
character while the value is the remaining part of the line after tab characte
SequenceFileInputFormat
Hadoop SequenceFileInputFormat is an InputFormat which reads sequence files. Sequence
files are binary files that stores sequences of binary key-value pairs.
SequenceFileAsTextInputFormat
Hadoop SequenceFileAsTextInputFormat is another form of SequenceFileInputFormat
which converts the sequence file key values to Text objects. By calling ‘tostring()’ conversion
is performed on the keys and values. This InputFormat makes sequence files suitable input
for streaming.
Hadoop SequenceFileAsBinaryInputFormat is a SequenceFileInputFormat using which we
can extract the sequence file’s keys and values as an opaque binary object
NLineInputFormat
Hadoop NLineInputFormat is another form of TextInputFormat where the keys are byte
offset of the line and values are contents of the line. Each mapper receives a variable
number of lines of input with TextInputFormat and KeyValueTextInputFormat and the
number depends on the size of the split and the length of the lines. And if we want our
mapper to receive a fixed number of lines of input, then we use NLineInputFormat.
DBInputFormat
Hadoop DBInputFormat is an InputFormat that reads data from a relational database, using
JDBC. As it doesn’t have portioning capabilities, so we need to careful not to swamp the
database from which we are reading too many mappers
Q ) Understanding Inputs And Outputs In Mapreduce ?

Your data might be XML files sitting behind a number of FTP servers, text log files sitting on
a central web server, or Lucene indexes1 in HDFS. How does MapReduce support reading
and writing to these different serialization structures across the various storage
mechanisms? You’ll need to know the answer in order to support a specific serialization
format.
Data input :-
The two classes that support data input in MapReduce are InputFormat and Record-Reader.
The InputFormat class is consulted to determine how the input data should be partitioned
for the map tasks, and the RecordReader performs the reading of data from the inputs.
INPUT FORMAT :-
Every job in MapReduce must define its inputs according to contracts specified in
the InputFormat abstract class. InputFormat implementers must fulfill three contracts:
first, they describe type information for map input keys and values; next, they specify
how the input data should be partitioned; and finally, they indicate the RecordReader
instance that should read the data from source.
RECORDREADER :-

The RecordReader class is used by MapReduce in the map tasks to read data from an input
split and provide each record in the form of a key/value pair for use by mappers. A task is
commonly created for each input split, and each task has a single RecordReader that’s
responsible for reading the data for that input split.
Data output :-
MapReduce uses a similar process for supporting output data as it does for
input data.Two classes must exist, an OutputFormat and a RecordWriter. The OutputFormat
performs some basic validation of the data sink properties, and the RecordWriter writes
each reducer output to the data sink.
OUTPUTFORMAT:-
Much like the InputFormat class, the OutputFormat class, as shown in figure 3.5, defines the
contracts that implementers must fulfill, including checking the information related to the
job output, providing a RecordWriter, and specifying an output committer, which allows
writes to be staged and then made “permanent” upon task and/or job success.
RECORDWRITER:-
You’ll use the RecordWriter to write the reducer outputs to the destination data sink.It’s a
simple class.
Q ) Explain Hadoop MapReduce ?

Hadoop MapReduce is a programming paradigm at the heart of Apache Hadoop for
providing massive scalability across hundreds or thousands of Hadoop clusters on
commodity hardware. The MapReduce model processes large unstructured data sets with a
distributed algorithm on a Hadoop cluster.
The term MapReduce represents two separate and distinct tasks Hadoop programs
perform-Map Job and Reduce Job. Map job scales takes data sets as input and processes
them to produce key value pairs. Reduce job takes the output of the Map job i.e. the key
value pairs and aggregates them to produce desired results. The input and output of the
map and reduce jobs are stored in HDFS
MapReduce
Hadoop MapReduce (Hadoop Map/Reduce) is a software framework for distributed
processing of large data sets on computing clusters. Apache Hadoop is an open-source
framework that allows to store and process big data in a distributed environment across
clusters of computers using simple programming models. MapReduce is the core
component for data processing in Hadoop framework. In layman’s term Mapreduce helps
to split the input data set into a number of parts and run a program on all data parts
parallel at once. The term MapReduce refers to two separate and distinct tasks. The first is
the map operation, takes a set of data and converts it into another set of data, where

individual elements are broken down into tuples (key/value pairs). The reduce operation
combines those data tuples based on the key and accordingly modifies the value of the key.
Bear, Deer, River and Car Example

The following word count example explains MapReduce method. For simplicity, let's
consider a few words of a text document. We want to find the number of occurrence of
each word. First the input is split to distribute the work among all the map nodes as shown
in the figure. Then each word is identified and mapped to the number one. Thus the pairs
also called as tuples are created. In the first mapper node three words Deer, Bear and River
are passed. Thus the output of the node will be three key, value pairs with three distinct
keys and value set to one. The mapping process remains the same in all the nodes. These
tuples are then passed to the reduce nodes. A partitioner comes into action which carries
out shuffling so that all the tuples with same key are sent to same node.
The Reducer node processes all the tuples such that all the pairs with same key are counted
and the count is updated as the value of that specific key. In the example there are two pairs
with the key ‘Bear’ which are then reduced to single tuple with the value equal to the count.
All the output tuples are then collected and written in the output file.
Hadoop divides the job into tasks. There are two types of tasks:
1. Map tasks (Spilts & Mapping)
2. Reduce tasks (Shuffling, Reducing)
as mentioned above.
The complete execution process (execution of Map and Reduce tasks, both) is controlled by
two types of entities called a
1. Jobtracker : Acts like a master (responsible for complete execution of submitted job)
2. Multiple Task Trackers : Acts like slaves, each of them performing the job

For every job submitted for execution in the system, there is one Jobtracker that resides
on Namenode and there are multiple tasktrackers which reside on Datanode.
 A job is divided into multiple tasks which are then run onto multiple data nodes in a
cluster.
 It is the responsibility of jobtracker to coordinate the activity by scheduling tasks to
run on different data nodes.
 Execution of individual task is then look after by tasktracker, which resides on every
data node executing part of the job.
 Tasktracker's responsibility is to send the progress report to the jobtracker.
 In addition, tasktracker periodically sends 'heartbeat' signal to the Jobtracker so as
to notify him of current state of the system.
 Thus jobtracker keeps track of overall progress of each job. In the event of task
failure, the jobtracker can reschedule it on a different tasktracker.
Q ) Application of Map-Rduce Example (OR)

Matrix-vector multiplication::mappedReduced()
MapReduce is a high level programming model for processing large data sets in parallel,
originally developed by Google, adapted from functional programming. The model is
suitable for a range of problems such as matrix operations, relational algebra, statistical
frequency counting etc. To learn more about MapReduce,
The MapReduce implementation in this example differs from the school-book multiplication
that I just introduced. A single map function will be processing only a single matrix element
rather than the whole row.

There are different implementations of matrix vector
multiplication depending on whether the vector fits into the
main memory or not. This example demonstrates a basic
version of matrix-vector multiplication in which the vector
fits into the main memory.
We store the sparse matrix (sparse matrices often appear in scientific and engineering
problems) as a triple with explicit coordinates (i, j, aij). E.g. the value of the first entry of the
matrix (0,0) is 3. Similarly, the value of the second entry in (0,1) is 2. We do not store the
zero entries of the matrix. This is the sparse matrix representation of the matrix from the
figure above:
i, j, aij
0,0,3
0,1,2
1,1,4
1,2,1
2,0,2
2,2,1
We store the vector in a dense format without explicit coordinates. You can see this below:
4
3
1
In our case, the map function takes a single element of the sparse matrix (not the whole
row!), multiplies it with the corresponding entry of the vector and produces an
intermediate key-value pair (i, aij*vj). This is sufficient because in order to perform the
summation (i.e. the reduce step) we only need to know the matrix row, we do not need to
know the matrix column. E.g. one of the map functions takes the first element of the matrix
(0,0,3), multiplies it with the first element of the vector (4) and produces an intermediate
key-value pair (0,12). The key (0), i.e. the row position of the element in the sparse matrix,
associates the value (12) with its position in the matrix-vector product. Another map
function takes the second element (0,1,2), multiplies it with the second row of the vector (3)
producing an intermediate key-value pair (0,6) etc.
Reduce function performs a summary operation on the intermediate keys. E.g. an

intermediate value with the key "0" will be summed up under a final index "0".

Q) Explain word count program in mapreducer with example?
In Hadoop, MapReduce is a computation that decomposes large manipulation jobs into
individual tasks that can be executed in parallel cross a cluster of servers. The results of
tasks can be joined together to compute final results.
MapReduce consists of 2 steps:
 Map Function – It takes a set of data and converts it into another set of data, where
individual elements are broken down into tuples (Key-Value pair).
Example – (Map function in Word Count)
Bus, Car, bus, car, train, car, bus, car, train, bus, TRAIN,BUS,
Input Set of data
buS, caR, CAR, car, BUS, TRAIN
(Bus,1), (Car,1), (bus,1), (car,1), (train,1),

Convert into another
(car,1), (bus,1), (car,1), (train,1), (bus,1),
Output set of data
(TRAIN,1),(BUS,1), (buS,1), (caR,1), (CAR,1),
(Key,Value)
(car,1), (BUS,1), (TRAIN,1)
 Reduce Function – Takes the output from Map as an input and combines those data
tuples into a smaller set of tuples.
Example – (Reduce function in Word Count)
(Bus,1), (Car,1), (bus,1), (car,1),

(train,1),
Input (car,1), (bus,1), (car,1), (train,1),
Set of Tuples (bus,1),
(output of Map
function) (TRAIN,1),(BUS,1), (buS,1), (caR,1),
(CAR,1),
(car,1), (BUS,1), (TRAIN,1)
(BUS,7),
Converts into smaller set of
Output (CAR,7),
tuples
(TRAIN,4)

Work Flow of Program
Workflow of MapReduce consists of 5 steps
1. Splitting – The splitting parameter can be anything, e.g. splitting by space, comma,
semicolon, or even by a new line (‘\n’).
2. Mapping – as explained above
3. Intermediate splitting – the entire process in parallel on different clusters. In order
to group them in “Reduce Phase” the similar KEY data should be on same cluster.
4. Reduce – it is nothing but mostly group by phase
5. Combining – The last phase where all the data (individual result set from each
cluster) is combine together to form a Result
Now Let’s See the Word Count Program in Java
Fortunately we don’t have to write all of the above steps, we only need to write the splitting
parameter, Map function logic, and Reduce function logic. The rest of the remaining steps
will execute automatically.
Make sure that Hadoop is installed on your system with java idk
Steps
Step 1. Open Eclipse> File > New > Java Project >( Name it – MRProgramsDemo) > Finish
Step 2. Right Click > New > Package ( Name it - PackageDemo) > Finish
Step 3. Right Click on Package > New > Class (Name it - WordCount)
Step 4. Add Following Reference Libraries –
Right Click on Project > Build Path> Add External Archivals
 /usr/lib/hadoop-0.20/hadoop-core.jar
 Usr/lib/hadoop-0.20/lib/Commons-cli-1.2.jar
Step 5. Type following Program :
package PackageDemo;
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;
public class WordCount {
public static void main(String [] args) throws Exception
{
Configuration c=new Configuration();
String[] files=new GenericOptionsParser(c,args).getRemainingArgs();
Path input=new Path(files[0]);
Path output=new Path(files[1]);
Job j=new Job(c,"wordcount");
j.setJarByClass(WordCount.class);
j.setMapperClass(MapForWordCount.class);
j.setReducerClass(ReduceForWordCount.class);
j.setOutputKeyClass(Text.class);
j.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(j, input);
FileOutputFormat.setOutputPath(j, output);
System.exit(j.waitForCompletion(true)?0:1);
}
public static class MapForWordCount extends Mapper<LongWritable, Text, Text,
IntWritable>{
public void map(LongWritable key, Text value, Context con) throws IOException,
InterruptedException
{
String line = value.toString();
String[] words=line.split(",");
for(String word: words )
{
Text outputKey = new Text(word.toUpperCase().trim());
IntWritable outputValue = new IntWritable(1);
con.write(outputKey, outputValue);
}
}
}
public static class ReduceForWordCount extends Reducer<Text, IntWritable, Text,
IntWritable>
{
public void reduce(Text word, Iterable<IntWritable> values, Context con) throws
IOException, InterruptedException
{
int sum = 0;
for(IntWritable value : values)
{
sum += value.get();
}
con.write(word, new IntWritable(sum));

}
}
}
Explanation
The program consist of 3 classes:
 Driver class (Public void static main- the entry point)

 Map class which extends public class Mapper<KEYIN,VALUEIN,KEYOUT,VALUEOUT>
and implements the Map function.
 Reduce class which extends public class
Reducer<KEYIN,VALUEIN,KEYOUT,VALUEOUT> and implements the Reduce function.
Step 6. Make Jar File

Right Click on Project> Export> Select export destination as Jar File > next> Finish
Step 7: Take a text file and move it in HDFS
To Move this into Hadoop directly, open the terminal and enter the following commands:

[training@localhost ~]$ hadoop fs -put wordcountFile wordCountFile
Step 8 . Run Jar file
(hadoop jar jarfilename.jar packageName.ClassName PathToInputTextFile
PathToOutputDirectry)
[training@localhost ~]$ hadoop jar MRProgramsDemo.jar PackageDemo.WordCount

wordCountFile MRDir1
Step 9. Open Result
[training@localhost ~]$ hadoop fs -cat MRDir1/part-r-00000

BUS 7
CAR 4
TRAIN 6
Q) What is Serialization?
Serialization is the process of translating data structures or objects state into binary or textual
form to transport the data over network or to store on some persisten storage. Once the data
is transported over network or retrieved from the persistent storage, it needs to be
deserialized again. Serialization is termed as marshalling and deserialization is termed
as unmarshalling.
Avro is one of the preferred data serialization systems because of its language neutrality.
Due to lack of language portability in Hadoop writable classes, Avro becomes a natural
choice because of its ability to handle multiple data formats which can be further processed
by multiple languages.
Avro is also very much preferred for serializing the data in Hadoop.
It uses JSON(JavaScript Object Notation) for defining data types and protocols and serializes
data in a compact binary format. Its primary use is in Apache Hadoop, where it can provide
both a serialization format for persistent data, and a wire format for communication
between Hadoop nodes, and from client programs to the Hadoop services.
By this, we can define Avro as a file format introduced with Hadoop to store data in a
predefined format.This file format can be used in any of the Hadoop's tools like Pig and
Hive.
Generally in distributed systems like Hadoop, the concept of serialization is used
for Interprocess Communication and Persistent Storage.
Interprocess Communication
 To establish the interprocess communication between the nodes connected in a
network, RPC technique was used.
 RPC used internal serialization to convert the message into binary format before
sending it to the remote node via network. At the other end the remote system
deserializes the binary stream into the original message.
 The RPC serialization format is required to be as follows −
o Compact − To make the best use of network bandwidth, which is the most
scarce resource in a data center.

o Fast − Since the communication between the nodes is crucial in distributed
systems, the serialization and deserialization process should be quick,
producing less overhead.
o Extensible − Protocols change over time to meet new requirements, so it
should be straightforward to evolve the protocol in a controlled manner for
clients and servers.
o Interoperable − The message format should support the nodes that are
written in different languages.
Persistent Storage
Persistent Storage is a digital storage facility that does not lose its data with the loss of
power supply. For example - Magnetic disks and Hard Disk Drives.
Writable Interface
This is the interface in Hadoop which provides methods for serialization and
deserialization. The following table describes the methods −
S.No. Methods and Description
1 void readFields(DataInput in)

This method is used to deserialize the fields of the given object.
2 void write(DataOutput out)

This method is used to serialize the fields of the given object.
WritableComparable Interface
It is the combination of Writable and Comparable interfaces. This interface
inherits Writable interface of Hadoop as well as Comparable interface of Java. Therefore it
provides methods for data serialization, deserialization, and comparison.
S.No. Methods and Description
1 int compareTo(class obj)

This method compares current object with the given object obj.
In addition to these classes, Hadoop supports a number of wrapper classes that implement
WritableComparable interface. Each class wraps a Java primitive type. The class hierarchy
of Hadoop serialization is given below –

These classes are useful to serialize various types of data in Hadoop. For instance, let us
consider the IntWritable class. Let us see how this class is used to serialize and deserialize
the data in Hadoop.
IntWritable Class
This class implements Writable, Comparable, and WritableComparableinterfaces. It wraps
an integer data type in it. This class provides methods used to serialize and deserialize
integer type of data.
Constructors
S.No. Summary
1 IntWritable()
2 IntWritable( int value)
Methods
S.No. Summary
1 int get()
Using this method you can get the integer value present in the current object.
2 void readFields(DataInput in)

This method is used to deserialize the data in the given DataInputobject.
3 void set(int value)

This method is used to set the value of the current IntWritableobject.

4 void write(DataOutput out)
This method is used to serialize the data in the current object to the
given DataOutput object.
Serializing the Data in Hadoop

The procedure to serialize the integer type of data is discussed below.
 Instantiate IntWritable class by wrapping an integer value in it.
 Instantiate ByteArrayOutputStream class.
 Instantiate DataOutputStream class and pass the object
of ByteArrayOutputStream class to it.
 Serialize the integer value in IntWritable object using write()method. This method
needs an object of DataOutputStream class.
 The serialized data will be stored in the byte array object which is passed as
parameter to the DataOutputStream class at the time of instantiation. Convert the
data in the object to byte array.
Deserializing the Data in Hadoop
The procedure to deserialize the integer type of data is discussed below −
 Instantiate IntWritable class by wrapping an integer value in it.
 Instantiate ByteArrayOutputStream class.
 Instantiate DataOutputStream class and pass the object
of ByteArrayOutputStream class to it.
 Deserialize the data in the object of DataInputStream using readFields() method of
IntWritable class.
 The deserialized data will be stored in the object of IntWritable class. You can
retrieve this data using get() method of this class.
Advantage of Hadoop over Java Serialization
Hadoop’s Writable-based serialization is capable of reducing the object-creation overhead
by reusing the Writable objects, which is not possible with the Java’s native serialization
framework.
Disadvantages of Hadoop Serialization
To serialize Hadoop data, there are two ways −
 You can use the Writable classes, provided by Hadoop’s native library.
 You can also use Sequence Files which store the data in binary format.
The main drawback of these two mechanisms is that Writables and SequenceFiles have
only a Java API and they cannot be written or read in any other language.
Therefore any of the files created in Hadoop with above two mechanisms cannot be read
by any other third language, which makes Hadoop as a limited box. To address this
drawback, Doug Cutting created Avro, which is a language independent data structure.

Q) Explain Hadoop Architecture ?
Apache HDFS or Hadoop Distributed File System is a block-structured file system where
each file is divided into blocks of a pre-determined size. These blocks are stored across a
cluster of one or several machines. Apache Hadoop HDFS Architecture follows
a Master/Slave Architecture, where a cluster comprises of a single NameNode (Master node)
and all the other nodes are DataNodes (Slave nodes).HDFS can be deployed on a broad
spectrum of machines that support Java. Though one can run several DataNodes on a single
machine, but in the practical world, these DataNodes are spread across various machines.
Apache Hadoop HDFS Architecture are as following:
 HDFS Master/Slave Topology

 NameNode, DataNode and Secondary NameNode
 What is a block?
 Replication Management
 Rack Awareness
 HDFS Read/Write – Behind the scenes
Name Node
NameNode is the master node in the Apache Hadoop

HDFS Architecture that maintains and manages the
blocks present on the DataNodes (slave
nodes). NameNode is a very highly available server
that manages the File System Namespace and controls
access to files by clients.
Functions of NameNode:
 It is the master daemon that maintains and
manages the DataNodes (slave nodes)
 It records the metadata of all the files stored in the cluster, e.g. The location of
blocks stored, the size of the files, permissions, hierarchy, etc. There are two files
associated with the metadata:

FsImage: It contains the complete state of the file system namespace since the
start of the NameNode.
EditLogs: It contains all the recent modifications made to the file system with
respect to the most recent FsImage.
 It records each change that takes place to the file system metadata. For example, if a
file is deleted in HDFS, the NameNode will immediately record this in the EditLog.
 It regularly receives a Heartbeat and a block report from all the DataNodes in the
cluster to ensure that the DataNodes are live.
 In case of the DataNode failure, the NameNode chooses new DataNodes for new
replicas,balance disk usage and manages the communication traffic to the DataNodes.
DataNode:
DataNodes are the slave nodes in HDFS. Unlike NameNode, DataNode is a commodity
hardware, that is, a non-expensive system which is not of high quality or high-availability.
The DataNode is a block server that stores the data in the local file ext3 or ext4.
Functions of DataNode:
 These are slave daemons or process which runs on each slave machine.
 The actual data is stored on DataNodes.
 The DataNodes perform the low-level read and write requests from the file system’s
clients.
 They send heartbeats to the NameNode periodically to report the overall health of
HDFS, by default, this frequency is set to 3 seconds.
Secondary NameNode:
Apart from these two daemons, there is a third daemon or a process called Secondary
NameNode. The Secondary NameNode works concurrently with the primary NameNode
as a helper daemon. And don’t be confused about the Secondary NameNode being
a backup NameNode because it is not.
Functions of Secondary NameNode:
 The Secondary NameNode is one which constantly reads all the file systems and
metadata from the RAM of the NameNode and writes it into the hard disk or the file
system.
 It is responsible for combining the EditLogs with FsImage from the NameNode.
Blocks:

Blocks are the nothing but the smallest continuous location on your hard drive where data is
stored. In general, in any of the File System, you store the data as a collection of blocks.
Similarly, HDFS stores each file as blocks which are scattered throughout the Apache
Hadoop cluster. The default size of each block is 128 MB in Apache Hadoop 2.x (64 MB in
Apache Hadoop 1.x) which you can configure as per your requirement.
Re
plication Management:
HDFS provides a reliable way to store

huge data in a distributed environment
as data blocks. The blocks are also
replicated to provide fault tolerance.
The default replication factor is 3 which
is again configurable. So, as you can
see in the figure below where each
block is replicated three times and
stored on different DataNodes
Therefore, if you are storing a file of

128 MB in HDFS using the default
configuration, you will end up occupying a space of 384 MB (3*128 MB) as the blocks will
be replicated three times and each replica will be residing on a different DataNode.
Rack Awareness:
Again, the NameNode also ensures that all

the replicas are not stored on the same
rack or a single rack. It follows an in-built
Rack Awareness Algorithm to reduce
latency as well as provide fault tolerance.
Considering the replication factor is 3, the
Rack Awareness Algorithm says that the
first replica of a block will be stored on a
local rack and the next two replicas will be
stored on a different (remote) rack but, on
a different DataNode within that (remote) rack as shown in the figure above. If you have
more replicas, the rest of the replicas will be placed on random DataNodes provided not
more than two replicas reside on the same rack, if possible. This is how an actual Hadoop
production cluster looks like. Here, you have multiple racks populated with DataNodes.
HDFS Read/ Write Architecture:

HDFS follows Write Once – Read Many Philosophy. So, you can’t edit files already stored in
HDFS. But, you can append new data by re-opening
the file.
HDFS Write Architecture:
Suppose a situation where an HDFS client, wants to
write a file named “example.txt” of size 248 MB.
Q) EXPLAIN HADOOP SHELL COMMANDS?
1.Print the hadoop version

Syntax: hadoop version.
Description: to determine which hadoop version you r using as well as checksum and dates.
2. Create a directory in HDFS at given path(s)

Syntax: hadoop fs-mkdir [-p]<paths>
Description:-p is option ,create parent directories along the path
Ex: create a single directory “kumar”
Hadoop fs-mkdir/user/hadoop/kumar
3.List the contents of a directory
Syntax: hadoop fs-ls[-R][-h] <args>
Description:list of files and directories
Ex:hadoop fs-ls/user
Hadoop fs-ls-R/user
4.upload a file in HDFS

Syntax: hadoop fs-put<local src>...<dest>
Description:copy single src or multiple src from local file system to the destination file
system in hdfs.
Ex:single file copy
Hadoop fs-put localfile1/user/hadoop/
5.Download a file from HDFS
Syntax:hadoop fs-get<hdfs-src> <local dst>
Description:copy/download files to the local file system from hdfs
Ex:
Copy sampletext1.csv file from hdfs to local file system
Hadoop fs-get/user/hadoop/sampletext1.csv localfile
6.View the content of file
Syntax: hadoop fs-cat<path[file name]>
Description:used to display the content of the hdfs file on your stdout
Ex:
Show the contents of “cmd.txt” file
Hadoop fs-cat/user/hadoop/cmd.txt
7.Copy files from source to destination

Syntax: hadoop fs-cp[-f] URI [URI...] <dest>
Description:allow multiple sources the destination must be a directory[-f] option will
overwrite the destination if it already exits.
Ex:
Copy “text1.txt” to directory “kumar”

Hadoop fs-cp/user/hadoop/text1.txt/user/hadoop/kumar
8.Copy a file from local file system to HDFS

Syntax: hadoop fs-CopyFromLocal[-f]<local src>URI
Description:similar to put command, except that the source is restricted to a local file
reference. The –f option will overwrite the destination if it already exits
Ex:
Copy file to hdfs
Hadoop fs-copyFromLocal localfile hdfs:\host.port/user/hadoop/
9.Copy file/folder to local file system from HDFS

Syntax: hadoop fs-copyToLocal URI <local dst>
Description:similarly to get command, except that the destination is restricted to a local file
reference ie the destination directory must be in the local file system.
Ex:
Copy “text1.txt” file to local directory from hdfs
Hadoop fs-copyToLocalhdfs://host:port/user/hadoop /text1.txt localdirpath
10.Move file from source to destination in hdfs

Syntax: hadoop fs-mv URI[URI...]<dest>
Description:it allow multiple source and the destination needs to be directory. Moving files
across file system is not permitted.
Ex:
Move single “text.txt” to kumar
Hadoop fs-mv/user/hadoop/text1.txt/user/hadoop/kumar
11.Remove a file or directory from HDFS

Syntax:hadoop fs-rm[-r/-R] [-skip Trash]URI[URI...]
Description:-r option is equal to –R
If –skip trash option is specified this will cause the specified files to be deleted
completelyrather than being moved to the trash directory
Ex:
Delete a text5.txt” file from hdfs
Hadoop fs-rm/user/hadoop/text5.txt
12.Remove empty directory from HDFS

Syntax: hadoop fs-rmdir[--ignore-fail-on-non-empty]URI[URI...]
Description:only deletes non empty directory
Ex:
Delete empty directory “”sairam”
Hadoop fs-rmdir/user/hadoop/sairam
13.Display last few lines of a file in hdfs

Syntax: hadoop fs-tail[-f]URI
Description:display last kilobytes of the file to be display on screen

Ex:
Grab the bottom lines of text1.txt file
Hadoop fs-tail/user/hadoop/text1.txt
14.Display disk usage of files and directories

Syntax: hadoop fs-du[-s][-h]URI[URI...]
Description:display size of files and directory contained in the given directory the length of a
file in case the just a file
Ex:
View the disk usage of “sampletext1.csv” file
Hadoop fs-du/user/hadoop/sampletext1.csv
15.To empty the trash of HDFS

Syntax: hadoop fs-expunge
Description: user can explicitly empty trash directory of HDFS by default the location of
trash directory is in “/user/username/trash/…”
Ex:
Hadoop fs-expenge
16.Collect specific information about the file or directory
Syntax: hadoop fs-stat[format]<path>...
Description: formatting options
%b size of file in bytes
%g group name
%n filename
%r replication factor
%u user name of owner
%y milliseconds
%0 hdfs block size in bytes(128 MB by default)
17.Changes the replication factor of a file in hdfs

Syntax: hadoop fs-setrep[-W]<number Replicas><path>
Ex:
Change replication factor of files with in the directory
Hadoop fs-setrep/user/hadoop/hdfsdir
18.Move local files into hdfs file system

Syntax: hadoop fs-moveFromLocal<local src><dst>
Description:it similar to put command. Except that the source localsrc is deleted after its
copied.
Ex:
Move files “text1.txt” &”text2.txt” from local file system to hdfs
Hadoop fs-moveFromLocal text1.txt text2.txt/user/hadoop/
19.Return the help for an individual command

Syntax: hadoop fs-usage<command>
Description:it shows command syntax along with its option

Ex:
Hadoop fs-uaage rm
Q) EXPLAIN HADOOP CLUSTER AND SETTING UP SSH?

What is a Hadoop Cluster?
Cluster is a set of connected computers
which work together as a single system.
Similarly, the Hadoop cluster is just a
computer cluster which we use for
Handling huge volume of data
distributedly.
Hadoop clusters have two types of

machines, such as Master and Slave,
where:
Master: HDFS NameNode, YARN ResourceManager.

Slaves: HDFS DataNodes, YARN NodeManagers.
However, it is recommended to separate the master and slave node, because:
 Task/application workloads on the slave nodes should be isolated from the masters.
 Slaves nodes are frequently decommissioned for maintenance.
Datanode and Namenode
The NameNode is the HDFS master, which manages
the file system namespace and regulates access to
files by clients and also consults with DataNodes
(HDFS slave) while copying data or running
MapReduce operations. Whereas DataNode manages
storage attached to the nodes that they run on,
basically there are a number of DataNodes, that
means one DataNode per slave in the cluster.
SSH Key Authentication

a distributed Hadoop cluster setup requires your “master” node [name node & job tracker]
to be able to SSH (without requiring a password, so key based authentication) to all other
“slave” nodes (e.g. data nodes).
The need for SSH Key based authentication is required so that the master node can then
login to slave nodes (and the secondary node) to start/stop them, etc. This is also required
to be setup on the secondary name node (which is listed in your masters file) so that
[presuming it is running on another machine which is a VERY good idea for a production
cluster] will be started from your name node with ./start-dfs.sh and job tracker node with
./start-mapred.sh
Start the Hadoop Cluster

Hadoop cluster in one of the three supported modes:
 Local (Standalone) Mode
 Pseudo-Distributed Mode
 Fully-Distributed Mode

Standalone Operation
By default, Hadoop is configured to run in a non-distributed mode, as a single Java process.
This is useful for debugging.
The following example copies the unpacked conf directory to use as input and then finds
and displays every match of the given regular expression. Output is written to the
given output directory.
$ mkdir input
$ cp conf/*.xml input
$ bin/hadoop jar hadoop-examples-*.jar grep input output 'dfs[a-z.]+'
$ cat output/*
Pseudo-Distributed Operation
Hadoop can also be run on a single-node in a pseudo-distributed mode where each Hadoop
daemon runs in a separate Java process.
Configuration
Use the following:
conf/core-site.xml:
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>
conf/hdfs-site.xml:
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>
conf/mapred-site.xml:
<configuration>
<property>
<name>mapred.job.tracker</name>
<value>localhost:9001</value>
</property>
</configuration>
Setup passphraseless ssh
Now check that you can ssh to the localhost without a passphrase:
$ ssh localhost
If you cannot ssh to localhost without a passphrase, execute the following commands:
$ ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
$ cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys
Execution
Format a new distributed-filesystem:
$ bin/hadoop namenode -format

Start the hadoop daemons:
$ bin/start-all.sh
The hadoop daemon log output is written to the ${HADOOP_LOG_DIR} directory (defaults
to ${HADOOP_HOME}/logs).
Browse the web interface for the NameNode and the JobTracker; by default they are
available at:
 NameNode - http://localhost:50070/
 JobTracker - http://localhost:50030/
Fully distributed mode:
As name suggest,this moe involves code running on actual hadoop cluster.in this mode we
will see actual power of hadoop,when you run your code against a large input on 1000’s of
severs. It is always difficult to debug MRprogramme as you have mapper running on
different machine with different piece of input with large inputs it is likely that data will be
irregular in its format.
Q)HDFS Administering ,monitoring and Maintenance?
In the Hadoop world, a Systems Administrator is called a Hadoop Administrator. Hadoop

Admin Roles and Responsibilities include setting up Hadoop clusters. Other duties involve
backup, recovery and maintenance. Hadoop administration requires good knowledge of
hardware systems and excellent understanding of Hadoop architecture.
It’s easy to get started with Hadoop administration because Linux system administration is a
pretty well-known beast, and because systems administrators are used to administering all
kinds of existing complex applications
What does a Hadoop Admin do on day to day ?
Installation and Configuration
Cluster Maintenance
Resource Management
Security Management
Troubleshooting
Cluster Monitoring
Backup And Recovery Task
Aligning with the systems engineering team to propose and deploy new hardware and
software environments required for Hadoop and to expand existing environments.
Diligently teaming with the infrastructure, network, database, application and business
intelligence teams to guarantee high data quality and availability.
Q) Comparison with Hive and Traditional database ?

 Hive is based on the notion of Write once, Read many times but RDBMS is designed
for Read and Write many times.
 In RDBMS, record level updates, insertions and deletes, transactions

and indexes are possible. Whereas these are not allowed in Hive because Hive was built
to operate over HDFS data using MapReduce, where full-table scans are the norm and a
table update is achieved by transforming the data into a new table.
 In RDBMS, maximum data size allowed will be in 10’s of Terabytes but

whereas Hive can 100’s Petabytes very easily.

 As Hadoop is a batch-oriented system, Hive doesn’t support OLTP (Online Transaction
Processing) but it is closer to OLAP (Online Analytical Processing) but not ideal since
there is significant latency between issuing a query and receiving a reply, due to
the overhead of Mapreduce jobs and due to the size of the data sets Hadoop was designed
to serve.
 RDBMS is best suited for dynamic data analysis and where fast responses are expected
but Hive is suited for data warehouse applications, where relatively static data is
analyzed, fast response times are not required, and when the data is not changing rapidly.
 To overcome the limitations of Hive, HBase is being integrated with Hive to

support record level operations and OLAP.
 Hive is very easily scalable at low cost but RDBMS is not that much scalable that too it is
very costly scale up.
Hive Traditional database

Schema on WRITE – table schema is enforced
Schema on READ – it’s does not verify the at data load time i.e if the data being loaded
schema while it’s loaded the data does’t conformed on schema in that case it
will rejected
It’s very easily scalable at low cost Not much Scalable, costly scale up.
It’s based on hadoop notation that is Write once In traditional database we can read and write
and read many times many time
Record level updates is not possible in Hive Record level updates, insertions and
deletes, transactions and indexes are possible
OLTP (On-line Transaction Processing) is not Both OLTP (On-line Transaction Processing)
yet supported in Hive but it’s and OLAP (On-line Analytical Processing)
supported OLAP (On-line Analytical Processing) are supported in RDBMS
Q ) Explain Hive ?
Hive is a data warehouse infrastructure tool to process structured data in Hadoop. It resides
on top of Hadoop to summarize Big Data, and makes querying and analyzing easy.
Initially Hive was developed by Facebook, later the Apache Software Foundation took it up
and developed it further as an open source under the name Apache Hive. It is used by
different companies. For example, Amazon uses it in Amazon Elastic MapReduce.
Features of Hive
 It stores schema in a database and processed data into HDFS.
 It is designed for OLAP.
 It provides SQL type language for querying called HiveQL or HQL.
 It is familiar, fast, scalable, and extensible.
Architecture of Hive

The following component diagram depicts the architecture of Hive:

T
his component diagram contains different units. The following table describes each unit:
Unit Name Operation

User Interface Hive is a data warehouse infrastructure software that can create
interaction between user and HDFS. The user interfaces that
Hive supports are Hive Web UI, Hive command line, and Hive
HD Insight (In Windows server).
Meta Store Hive chooses respective database servers to store the schema or
Metadata of tables, databases, columns in a table, their data
types, and HDFS mapping.
HiveQL Process Engine HiveQL is similar to SQL for querying on schema info on the
Metastore. It is one of the replacements of traditional approach
for MapReduce program. Instead of writing MapReduce
program in Java, we can write a query for MapReduce job and
process it.
Execution Engine The conjunction part of HiveQL process Engine and

MapReduce is Hive Execution Engine. Execution
engine processes the query and generates results
as same as MapReduce results. It uses the flavor of
MapReduce.
HDFS or HBASE Hadoop distributed file system or HBASE are the

data storage techniques to store data into file
system.

Q) How to Hive Installation on Ubuntu:
Please follow the below steps to install Apache Hive on Ubuntu:
Step 1: Download Hive tar.
Command: wget http://archive.apache.org/dist/hive/hive-2.1.0/apache-hive-2.1.0-
bin.tar.gz
Step 2: Extract the tar file.

Command: tar -xzf apache-hive-2.1.0-bin.tar.gz
Command: ls
Step
3: Edit
the “.bas
hrc” file
to
update the environment variables for user.
Command: sudo gedit .bashrc
Add the following at the end of the file:
# Set HIVE_HOME
export HIVE_HOME=/home/edureka/apache-hive-2.1.0-bin
export PATH=$PATH:/home/edureka/apache-hive-2.1.0-bin/bin
Also, make sure that hadoop path is also set.

Run below command to make the changes work in same terminal.
Command: source .bashrc
Step 4: Check hive version.
.

Step 5: Create Hive directories within HDFS. The directory ‘warehouse’ is the location to
store the table or data related to hive.
Command:
 hdfs dfs -mkdir -p /user/hive/warehouse
 hdfs dfs -mkdir /tmp
Step 6: Set read/write permissions for table.
Command:
In this command, we are giving write permission to the group:
 hdfs dfs -chmod g+w /user/hive/warehouse
 hdfs dfs -chmod g+w /tmp

Step 7: Set Hadoop path in hive-env.sh
Command: cd apache-hive-2.1.0-bin/
Command: gedit conf/hive-env.sh
Set the parameters as shown in the below snapshot.
Step 8: Edit hive-site.xml

Command: gedit conf/hive-site.xml
Step 9: By default, Hive uses Derby database. Initialize Derby database.
Command: bin/schematool -initSchema -dbType derby
Step 10: Launch Hive.
Command: hive
Step 11: Run few queries in Hive shell.
Command: show databases;
Command: create table employee (id string, name string, dept string) row format delimited
fields terminated by ‘\t’ stored as textfile;
Command: show tables;
Step 12: To exit from Hive:
Command: exit;
Q) Explain Aggregate Functions in Hive?

Hive supports the following built-in aggregate functions. The usage of these functions is as
same as the SQL aggregate functions.
Return Signature Description
Type
BIGINT count(*), count(*) - Returns the total number of retrieved rows.
count(expr),

DOUBLE sum(col), It returns the sum of the elements in the group or the sum of
sum(DISTINCT the distinct values of the column in the group.
col)
DOUBLE avg(col), It returns the average of the elements in the group or the
avg(DISTINCT average of the distinct values of the column in the group.
col)
DOUBLE min(col) It returns the minimum value of the column in the group.
DOUBLE max(col) It returns the maximum value of the column in the group.
Q) Explain HiveQL ?
HiveQL is the Hive query language
Hadoop is an open source framework for the distributed processing of large amounts of
data across a cluster. It relies upon the MapReduce paradigm to reduce complex tasks
into smaller parallel tasks that can be executed concurrently across multiple machines.
However, writing MapReduce tasks on top of Hadoop for processing data is not for
everyone since it requires learning a new framework and a new programming paradigm
altogether. What is needed is an easy-to-use abstraction on top of Hadoop that allows
people not familiar with it to use its capabilities as easily.
Hive aims to solve this problem by offering an SQL-like interface, called HiveQL, on
top of Hadoop. Hive achieves this task by converting queries written in HiveQL into
MapReduce tasks that are then run across the Hadoop cluster to fetch the desired
results
Hive is best suited for batch processing large amounts of data (such as in data
warehousing) but is not ideally suitable as a routine transactional database because of
its slow response times (it needs to fetch data from across a cluster).
A common task for which Hive is used is the processing of logs of web servers. These
logs have a regular structure and hence can be readily converted into a format that
Hive can understand and process
Hive query language (HiveQL) supports SQL features like CREATE tables, DROP
tables, SELECT ... FROM ... WHERE clauses, Joins (inner, left outer, right outer and
outer joins), Cartesian products, GROUP BY, SORT BY, aggregations, union and
many useful functions on primitive as well as complex data types. Metadata browsing
features such as list databases, tables and so on are also provided. HiveQL does have
limitations compared with traditional RDBMS SQL. HiveQL allows creation of new
tables in accordance with partitions(Each table can have one or more partitions in
Hive) as well as buckets (The data in partitions is further distributed as buckets)and
allows insertion of data in single or multiple tables but does not allow deletion or
updating of data
HiveQL: Data Definition

First open the hive console by typing:
$ hive
Once the hive console is opened, like
hive>
you need to run the query to create the table.
1. Create and Show database
They are very useful for larger clusters with multiple teams and users, as a way of avoiding
table name collisions. It’s also common to use databases to organize production tables into
logical groups. If you don’t specify a database, the default database is used.
hive> CREATE DATABASE IF NOT EXISTS financials;
At any time, you can see the databases that already exist as follows:
hive> SHOW DATABASES;
output is
default
financials
hive> CREATE DATABASE human_resources;
hive> SHOW DATABASES;

output is
default
financials
human_resources
2. DESCRIBE database
- shows the directory location for the database.
hive> DESCRIBE DATABASE financials;

output is
hdfs://master-server/user/hive/warehouse/financials.db
2. USE database
The USE command sets a database as your working database, analogous to changing working directories in a
filesystem
hive> USE financials;
3. DROP database
you can drop a database:
hive> DROP DATABASE IF EXISTS financials;

The IF EXISTS is optional and suppresses warnings if financials doesn’t exist.
4. Alter Database
You can set key-value pairs in the DBPROPERTIES associated with a database using the
ALTER DATABASE command. No other metadata about the database can be
changed,including its name and directory location:
hive> ALTER DATABASE financials SET DBPROPERTIES ('edited-by' = 'active steps');

5. Create Tables
The CREATE TABLE statement follows SQL conventions, but Hive’s version offers sig-
nificant extensions to support a wide range of flexibility where the data files for tables are
stored, the formats used, etc.
 Managed Tables

 The tables we have created so far are called managed tables or sometimes called
internal tables, because Hive controls the lifecycle of their data. As we’ve seen,Hive
stores the data for these tables in subdirectory under the directory defined by
hive.metastore.warehouse.dir (e.g., /user/hive/warehouse), by default.
 When we drop a managed table, Hive deletes the data in the table.
 Managed tables are less convenient for sharing with other tools
 External Tables
CREATE EXTERNAL TABLE IF NOT EXISTS stocks (
exchange STRING,
symbol STRING,
ymd STRING,
price_open FLOAT,
price_high FLOAT,
price_low FLOAT,
price_close FLOAT,
volume INT,
price_adj_close FLOAT
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION '/data/stocks/';
The EXTERNAL keyword tells Hive this table is external and the LOCATION …
clause is required to tell Hive where it’s located. Because it’s external
Partitioned, Managed Tables
Partitioned tables help to organize data in a logical fashion, such as hierarchically.

Example: Our HR people often run queries with WHERE clauses that restrict the results to a
particular country or to a particular first-level subdivision (e.g., state in the United States or
province in Canada).
we have to use address.state to project the value inside the address. So, let’s partition the
data first by country and then by state:
CREATE TABLE IF NOT EXISTS mydb.employees (

name STRING,
salary FLOAT,
subordinates ARRAY<STRING>,
deductions MAP<STRING, FLOAT>,
address STRUCT<street:STRING, city:STRING, state:STRING, zip:INT>
)
PARTITIONED BY (country STRING, state STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\001'
COLLECTION ITEMS TERMINATED BY '\002'
MAP KEYS TERMINATED BY '\003'
LINES TERMINATED BY '\n'
STORED AS TEXTFILE;

Partitioning tables changes how Hive structures the data storage. If we create this table in the
mydb database, there will still be an employees directory for the table:
LOAD DATA LOCAL INPATH '/path/to/employees.txt'

INTO TABLE employees
PARTITION (country = 'US', state = 'IL');
hdfs://master_server/user/hive/warehouse/mydb.db/employees
Once created, the partition keys (country and state, in this case) behave like regular columns.
hive> SHOW PARTITIONS employees;
output is
OK
country=US/state=IL
Time taken: 0.145 seconds
Dropping Tables
The familiar DROP TABLE command from SQL is supported:
DROP TABLE IF EXISTS employees;
HiveQL: Data Manipulation
1. Loading Data into Managed Tables
Create stocks table
CREATE EXTERNAL TABLE IF NOT EXISTS stocks (
exchange STRING,
symbol STRING,
ymd STRING,
price_open FLOAT,
price_high FLOAT,
price_low FLOAT,
price_close FLOAT,
volume INT,
price_adj_close FLOAT)
FIELDS TERMINATED BY ','
LOCATION '/data/stocks/';
Queries on Sotck Data Set
Load the stocks
INTO TABLE stocks
PARTITION (exchange = 'NASDAQ', symbol = 'AAPL');
This command will first create the directory for the partition, if it doesn’t already exist,then
copy the data to it.
2. Inserting Data into Tables from Queries
INSERT OVERWRITE TABLE employees PARTITION (country = 'US', state = 'OR')
With OVERWRITE, any previous contents of the partition are replaced. If you drop the
keyword OVERWRITE or replace it with INTO, Hive appends the data rather than replaces
it.
HiveQL queries
1. SELECT … FROM Clauses

SELECT is the projection operator in SQL. The FROM clause identifies from which table,
view, or nested query we select records
Create employees

CREATE EXTERNAL TABLE employees (
name STRING,
salary FLOAT,
subordinates ARRAY<STRING>,
deductions MAP<STRING, FLOAT>,
address STRUCT<street:STRING, city:STRING, state:STRING, zip:INT>
)
FIELDS TERMINATED BY '\001'
COLLECTION ITEMS TERMINATED BY '\002'
MAP KEYS TERMINATED BY '\003'
LINES TERMINATED BY '\n'
STORED AS TEXTFILE
LOCATION '/data/employees';
Load data

INTO TABLE employees
PARTITION (country = 'US', state = 'IL');
Data in employee.txt is assumed as
Select data
hive> SELECT name, salary FROM employees;
output is
When you select columns that are one of the collection types, Hive uses JSON (Java-
Script Object Notation) syntax for the output. First, let’s select the subordinates, an
ARRAY, where a comma-separated list surrounded with […] is used.
hive> SELECT name, subordinates FROM employees;
output is
The deductions is a MAP, where the JSON representation for maps is used, namely a comma-
separated list of key:value pairs, surrounded with {…}:

hive> SELECT name, deductions FROM employees;
output is
Finally, the address is a STRUCT, which is also written using the JSON map format:
hive> SELECT name, address FROM employees;
Q) EXPLAIN USING JOINS IN HIVE ?

Hive supports joining of one or more tables to aggregrate information.The various joins
supported by hive are:
1.Inner joins
2.Outer joins
1.Inner joins:
In case of inner joins,only the records satisfying the given condition get selected.All the other
records get discarded.The below figure illustrates the concept of inner joins
Lets take an example to describe the concept of inner joins.consider the two tables, order and
customer.
DATA OF THE ORDER TABLE

Order-id Cust-id
1 12
2 15
3 16
4 20
5 25
The above table contains the order id (order-id) and the corresponding cust id(cust-id)
DATA OF CUSTOMER TABLE:

Cust-id Cust-name
12 Bob
15 Joseph
20 Lisa
23 Monty
The above table contains the customer id(cust-id) and the corresponding customer name(cust-
name)
To know the names of the customers who have placed orders,we need to take the inner join of
the tables as follows;
SELECT Order.Order-id,Customer.cust-name
FROM Order o
JOIN
Customer c
ON(O.Cust-id=c.cust-id);
The output of the previous command is follows:
Order-id Cust-name
1 Bob
2 joseph
4 lisa
OUTER JOINS:
Some times,you need to retrieve all the records from one table only some records from the
other table.In such cases,you have to use the outer joins.
Outer joins are of three types:
1.Right outer join
2.Left outer join
3.Full outer join
Right outer join:
In this type of join , al the records from the table on the right of the join are retained.
The query involving right outr join canbe written as
follows
SELECT Order.Order-id,Customer.Cust-name
FROM Order
RIGHT OUTER JOIN
Customer C
ON(O.Cust-id=C.Cust-id);
The output of the preceeding query is follows:
Order-id Cust-name
1 Bob
2 Joseph
4 lisa
Null monty
You can see that the order –id field for the customer ‘monty’ is marked as NULL. It is
because there is no related record matching the given customer id in the order table.
Left Outer Join:
In this type of join, all the records from the table on the left side of the join are retained.
The query involving left outer joins can be written as follows:

SELECT Order.Order-id,Customer.Cust-name
FROM ORDER O
LEFT OUTER JOIN
Customer C
ON (O.cust-id=C.Cust-id);

Here all the entries from the order table are present in the output. A field is kept blank as their
is no corresponding value for the key in the customer table. The preceding query gives the
following output.
Order-id Cust-id
1 Bob
2 Joseph
3 Null
4 Lisa
5 Null
Full outer join:
In this case, all the fields from both tables are included for the entries that donot have any
match, a null value would be displayed.
The query involving a full outer join can be written as follows:
SELECT Order.Order-id,customer.cust-name
FROM Order O
FULL OUTER JOIN
Customer C
ON(O.Cust-id=C.Cust-id);
The preceeding query gives the following output

Order-id Cust-id
1 Bob
2 Joseph
3 Null
4 Lisa
5 Null
Null Monty
Cartesian product joins:
In Cartesian product joins, all the records of one table are combined with another table in
all possible combinations. This type of join does not involve any key column to join the
tables.
The following is a query in the Cartesian product join
SELECT*FROM Order JOIN Customer: Joins the order table with the customer table in all
combinations
Map-Side Joins:
In a map side joins operations in hive , the job is assigned to a map reduce task that consists
of two stages: map and reduce. At the map stage, the data is read from join tables and the
‘join key’ and ‘join value’ pair are returned to an intermediate file. This intermediate file is
then sorted and merged in the shuffled stage. At the reduce stage, the sorted result is taken as
input, and the joining task is completed.
A map side join is similar to to the normal join, but here, all the tasks are performed by the
mapper alone. The map side join is preferred in small tables. Suppose you have two tables out
of which one in a small table. Now , when a map reduced task is submitted, a map reduce
local task will be crearted to read the data of the small table from HDFS and the storeit into
an in memory hash table. After reading the data, the map reduced local tasks serialize the in
memory hash table into a hash table file.
In the next stage, the main join map reduce task runs and moves the data from the hash table
file to the hadoop distributed cache, which supplies these files to each mapers local disk. It

means that all the mapppers can load this hash table file back into the memory and continue
with the join work.
In a map –side join ,the small table is read only once , and if multiple mappers are running on
the same machine,the distributed cache needs to push only one copy of the hash table file to
this machine.Map-side joins help in improving the performance of a task by decreasing the
time to finish the task.More over,it helps in minimising the cost involved in sorting and
merging in the shuffle and reduce stages.
Sub queries:
A Query present within a Query is known as a sub query. The main query will depend on the
values returned by the subqueries.
Subqueries can be classified into two types
 Subqueries in FROM clause
 Subqueries in WHERE clause
When to use:
 To get a particular value combined from two column values from different tables
 Dependency of one table values on other tables
 Comparative checking of one column values from other tables
Syntax:
Subquery in FROM clause
SELECT <column names 1, 2…n>From (SubQuery) <TableName_Main >
Subquery in WHERE clause
SELECT <column names 1, 2…n> From<TableName_Main>WHERE col1 IN (SubQuery);
Example:
SELECT col1 FROM (SELECT a+b AS col1 FROM t1) t2
Here t1 and t2 are table names. The colored one is Subquery performed on table t1. Here a
and b are columns that are added in a subquery and assigned to col1. Col1 is the column
value present in Main table. This column "col1" present in the subquery is equivalent to the
main table query in column col1.

HBASE
Q) What is HBase ?
HBase is an open-source, column-oriented distributed database system in a

Hadoop environment. Initially, it was Google Big Table, afterward, it was re-named as HBase
and is primarily written in Java. Apache HBase is needed for real-time Big Data applications.
HBase can store massive amounts of data from terabytes to petabytes. The tables present in
HBase consists of billions of rows having millions of columns.
HBase is a column-oriented database and can manage structured and un-structured data. It
supports NoSQL tool for access huge amount of data from non-relational data model.
HBase is a column-oriented NoSQL database. Although it looks similar to a relational
database which contains rows and columns, but it is not a relational database. Relational
databases are row oriented while HBase is column-oriented. So, let us first understand the
difference between Column-oriented and Row-oriented databases.
Row-oriented vs column-oriented Databases:

Row-oriented databases store table records in a sequence of rows. Whereas column-oriented
databases store table records in a sequence of columns, i.e. the entries in a column are stored
in contiguous locations on disks.
To better understand it, let us take an example and consider the table below.
If this table is stored in a row-oriented database. It will store the records as shown below:
1, Paul Walker, US, 231, Gallardo,
2, Vin Diesel, Brazil, 520, Mustang
In row-oriented databases data is stored on the basis of rows or tuples as you can see above.
While the column-oriented databases store this data as:
1,2, Paul Walker, Vin Diesel, US, Brazil, 231, 520, Gallardo, Mustang
In a column-oriented databases, all the column values are stored together like first column
values will be stored together, then the second column values will be stored together and data
in other columns are stored in a similar manner.
 When the amount of data is very huge, like in terms of petabytes or exabytes, we use
column-oriented approach, because the data of a single column is stored together and
can be accessed faster.
 While row-oriented approach comparatively handles less number of rows and
columns efficiently, as row-oriented database stores data is a structured format.
 When we need to process and analyze a large set of semi-structured or unstructured
data, we use column oriented approach. Such as applications dealing with Online

Analytical Processing like data mining, data warehousing, applications including
analytics, etc.
 Whereas, Online Transactional Processing such as banking and finance domains
which handle structured data and require transactional properties (ACID properties)
use row-oriented approach.
Relational Databases vs. HBase
When talking of data stores, we first think of Relational Databases with structured data
storage and a sophisticated query engine. However, a Relational Database incurs a big
penalty to improve performance as the data size increases. HBase, on the other hand, is
designed from the ground up to provide scalability and partitioning to enable efficient
data structure serialization, storage and retrieval. Broadly, the differences between a
Relational Database and HBase are:
HDFS vs. HBase

HDFS is a distributed file system that is well suited for storing large files. It’s designed to
support batch processing of data but doesn’t provide fast individual record lookups. HBase is
built on top of HDFS and is designed to provide access to single rows of data in large tables.
Overall, the differences between HDFS and HBase are
Q) Explain HBase Architecture and its components

HBase has three major components i.e., HMaster Server, HBase Region Server,
Regions and Zookeeper.
HBase architecture has a single HBase master node (HMaster) and several slaves i.e. region
servers. Each region server (slave) serves a set of regions, and a region can be served only by
a single region server. Whenever a client sends a write request, HMaster receives the request
and forwards it to the corresponding region server.
The below figure explains the hierarchy of the HBase Architecture. We will talk about each
one of them individually.
Region
A region contains all the rows between the start key and the end key assigned to that
region. HBase tables can be divided into a number of regions in such a way that all the
columns of a column family is stored in one region. Each region contains the rows in a
sorted order.
Many regions are assigned to a Region Server, which is responsible for handling, managing,
executing reads and writes operations on that set of regions.
 A table can be divided into a number of regions. A Region is a sorted range of rows
storing data between a start key and an end key.
 A Region has a default size of 256MB which can be configured according to the need.
 A Group of regions is served to the clients by a Region Server.
 A Region Server can serve approximately 1000 regions to the client.
Now starting from the top of the hierarchy, I would first like to explain you about HMaster
Server which acts similarly as a NameNode in HDFS.

Region Server runs on HDFS DataNode and consists of the following components –
 Block Cache – This is the read cache. Most frequently read data is stored in the read cache
and whenever the block cache is full, recently used data is evicted.
 MemStore- This is the write cache and stores new data that is not yet written to the disk.
Every column family in a region has a MemStore.
 Write Ahead Log (WAL) is a file that stores new data that is not persisted to permanent
storage.
 HFile is the actual storage file that stores the rows as sorted key values on a disk.
HMaster
As in the below image, you can see the HMaster handles a collection of Region Server which
resides on DataNode. Let us understand how HMaster does that.
 HBase HMaster performs DDL operations (create and delete tables) and
assigns regions to the Region servers as you can see in the above image.
 It coordinates and manages the Region Server (similar as NameNode manages
DataNode in HDFS).
 It assigns regions to the Region Servers on startup and re-assigns regions to Region
Servers during recovery and load balancing.
ZooKeeper

This below image explains the ZooKeeper’s coordination mechanism.
 Zookeeper acts like a coordinator inside HBase distributed environment. It helps in

maintaining server state inside the cluster by communicating through sessions.
 Every Region Server along with HMaster Server sends continuous heartbeat at regular
interval to Zookeeper and it checks which server is alive and available as mentioned
in above image. It also provides server failure notifications so that, recovery measures
can be executed.
 Referring from the above image you can see, there is an inactive server, which acts as
a backup for active server. If the active server fails, it comes for the rescue.
 The active HMaster sends heartbeats to the Zookeeper while the inactive HMaster
listens for the notification send by active HMaster. If the active HMaster fails to send
a heartbeat the session is deleted and the inactive HMaster becomes active.
 While if a Region Server fails to send a heartbeat, the session is expired and all
listeners are notified about it. Then HMaster performs suitable recovery actions which
we will discuss later in this blog.
 Zookeeper also maintains the .META Server’s path, which helps any client
in searching for any region. The Client first has to check with .META Server in which
Region Server a region belongs, and it gets the path of that Region Server.
Q) Explain Schema Design in HBase?

The HBase schema design is very different compared to the relation database schema
design. Below are some of general concept that should be followed while designing schema
in Hbase:
 Row key: Each table in HBase table is indexed on row key. Data is sorted
lexicographically by this row key. There are no secondary indices available on HBase
table.
 Automaticity: Avoid designing table that requires atomacity across all rows. All
operations on HBase rows are atomic at row level.
 Even distribution: Read and write should uniformly distributed across all nodes
available in cluster. Design row key in such a way that, related entities should be stored
in adjacent rows to increase read efficacy.
HBase Schema Row key, Column family, Column qualifier, individual and Row value Size Limit
Consider below is the size limit when designing schema in Hbase:
 Row keys: 4 KB per key
 Column families: not more than 10 column families per table

 Column qualifiers: 16 KB per qualifier
 Individual values: less than 10 MB per cell
 All values in a single row: max 10 MB
HBase Schema Row key, Column family, Column qualifier, individual and Row value Size Limit
Consider below is the size limit when designing schema in Hbase:
 Row keys: 4 KB per key
 Column families: not more than 10 column families per table
 Column qualifiers: 16 KB per qualifier
 Individual values: less than 10 MB per cell
 All values in a single row: max 10 MB
Reverse Domain Names

If you are storing data that is represented by the domain names then consider using reverse
domain name as a row keys for your HBase Tables. For example, com.company.name.
This technique works perfectly fine when you have data spread across multiple reverse
domains. If you have very few reverse domain then you may end up storing data on single
node causing hotspotting.
Hashing
When you have the data which is represented by the string identifier, then that is good
choice for your Hbase table row key. Use hash of that string identifier as a row key instead
of raw string. For example, if you are storing user data that is identified by user ID’s then
hash of user ID is better choice for your row key.
Timestamps
When you retrieve data based on time when it was stored, it is best to include the
timestamp in your row key. For example, you are trying to store the machine log identified
by machine number then append the timestamp to the machine number when designing
row key, machine001#1435310751234.
Combines Row Key

You can combine multiple key to design row key for your HBase table based on your
requirements.
HBase Column Families and Column Qualifiers
Below are some of guidance on column families and column qualifier:
Column Families
In HBase, you have upto 10 column families to get best performance out of HBase cluster. If
your row contains multiple values that are related to each other, then you should place then
in same family names. Also, the names of your column families should be short, since they
are included in the data that is transferred for each request.
Column Qualifiers
You can create as many column qualifiers as you need in each row. The empty cells in the
row does not consume any space. The names of your column qualifiers should be short,
since they are included in the data that is transferred for each request.
Creating HBase Schema Design
You can create the schema using Apache HBase shell or Java API’s:
Below is the example of create table schema:

hbase(main):001:0> create 'test_table_schema', 'cf'
0 row(s) in 2.7740 seconds
=> Hbase::Table - test_table_schema
Q) What is Apache ZooKeeper?
Apache ZooKeeper is a software project of Apache Software Foundation. It is an open-

source technology that maintains configuration information and provides synchronized as
well as group services which are deployed on Hadoop cluster to administer the
infrastructure.
The ZooKeeper framework was originally built at Yahoo! for easier accessing of applications
but, later on, ZooKeeper was used for organizing services used by distributed frameworks
like Hadoop, HBase, etc., and Apache ZooKeeper became a standard. It was designed to be a
vigorous service that enabled application developers to focus mainly on their application
logic rather than coordination.
In a distributed environment, coordinating and managing a service has become a difficult
process. Apache ZooKeeper was used to solve this problem because of its simple
architecture, as well as API, that allows developers to implement common coordination
tasks like electing a master server, managing group membership, and managing metadata.
Apache ZooKeeper is used for maintaining centralized configuration information, naming,
providing distributed synchronization, and providing group services in a simple interface so
that we don’t have to write it from scratch. Apache Kafka also uses ZooKeeper to manage
configuration. ZooKeeper allows developers to focus on the core application logic, and it
implements various protocols on the cluster so that the applications need not implement
them on their own.
ZooKeeper Architecture
Apache ZooKeeper works on the Client–Server architecture in which clients are machine
nodes and servers are nodes.
The following figure shows the relationship between the servers and their clients. In this, we
can see that each client sources the client library, and further they communicate with any of
the ZooKeeper nodes.

Components of the ZooKeeper architecture has been explained in the following table.
Part Description
Client node in our distributed applications cluster is used to access information from the
server. It sends a message to the server to let the server know that the client is alive, and if
Client
there is no response from the connected server the client automatically resends the message
to another server.
The server gives an acknowledgement to the client to inform that the server is alive, and it
Server
provides all services to clients.
Leader If any of the server nodes is failed, this server node performs automatic recovery.
Follower It is a server node which follows the instructions given by the leader.
Working of Apache ZooKeeper
 The first thing that happens as soon as the ensemble (a group of ZooKeeper servers) starts is,
it waits for the clients to connect to the servers.
 After that, the clients in the ZooKeeper ensemble will connect to one of the nodes. That node
can be any of a leader node or a follower node.
 Once the client is connected to a particular node, the node assigns a session ID to the client
and sends an acknowledgement to that particular client.
 If the client does not get any acknowledgement from the node, then it resends the message to
another node in the ZooKeeper ensemble and tries to connect with it.
 On receiving the acknowledgement, the client makes sure that the connection is not lost by
sending the heartbeats to the node at regular intervals.
 Finally, the client can perform functions like read, write, or store the data as per the need.
Features of Apache ZooKeeper

Apache ZooKeeper provides a wide range of good features to the user.
 Updating the Node’s Status: Apache ZooKeeper is capable of updating every node that
allows it to store updated information about each node across the cluster.
 Managing the Cluster: This technology can manage the cluster in such a way that the status
of each node is maintained in real time, leaving lesser chances for errors and ambiguity.
 Naming Service: ZooKeeper attaches a unique identification to every node which is quite
similar to the DNA that helps identify it.
 Automatic Failure Recovery: Apache ZooKeeper locks the data while modifying which
helps the cluster recover it automatically if a failure occurs in the database.
Benefits of Apache ZooKeeper
 Simplicity: Coordination is done with the help of a shared hierarchical namespace.

 Reliability: The system keeps performing even if more than one node fails.
 Order: It keeps track by stamping each update with a number denoting its order.
 Speed: It runs with a ratio of 10:1 in the cases where ‘reads’ are more common.
 Scalability: The performance can be enhanced by deploying more machines.
ZooKeeper Use Cases

There are many use cases of ZooKeeper. Some of the most prominent of them are as
follows:

 Managing the configuration
 Naming services
 Choosing the leader
 Queuing messages
 Managing the notification system
 Synchronization
One of the ways in which we can communicate with the ZooKeeper ensemble is by using the
ZooKeeper Command Line Interface (CLI). This gives us the feature of using various options,
and also for the sake of debugging there is increased dependence on the CLI.
Applications of Zookeeper
a. Apache Solr
For leader election and centralized configuration, Apache Solr uses Zookeeper.
b. Apache Mesos
A tool which offers efficient resource isolation and sharing across distributed applications,
as a cluster manager is what we call Apache Mesos. Hence for the fault-tolerant replicated
master, Mesos uses ZooKeeper.
c. Yahoo!
As we all know ZooKeeper was originally built at “Yahoo!”, so for several requirements like
robustness, data transparency, centralized configuration, better performance, as well as
coordination, they designed Zookeeper.
d. Apache Hadoop
As we know behind the growth of Big Data industry, Apache Hadoop is the driving force. So,
for configuration management and coordination, Hadoop relies on ZooKeeper.
Multiple ZooKeeper servers endure large Hadoop clusters, that’s why in order to retrieve
and update synchronization information, each client machine communicates with one of the
ZooKeeper servers. Like: Human Genome Project, as there are terabytes of data in Human
Genome Project, So, In order to analyze the dataset and find interesting facts for human
development, it usesHadoop MapReduce framework.
e. Apache HBase
An open source, distributed, NoSQL database which we use for real-time read/write access
of large datasets is what we call Apache HBase. While it comes to Zookeeper, installation
of HBase distributed application depends on a running ZooKeeper cluster.

In addition, to track the status of distributed data throughout the master and region servers,
Apache HBase uses ZooKeeper. Now use-cases of HBase are −
i. Telecom
One of them is the Telecom industry. As it stores billions of mobile call records and further
access them in real time. Hence we can say it uses HBase to process all the records in real
time, easily and efficiently.
ii. Social network
Like Twitter, LinkedIn, and Facebook receives huge volumes of data on daily basis so to find
recent trends and other interesting facts it also uses HBase.
f. Apache Accumulo
Moreover, on top of Apache ZooKeeper (and Apache Hadoop), another sorted distributed
key/value store “Apache Accumulo” is built.
g. Neo4j
For write master selection and read slave coordination, No4j is a distributed graph
database which uses ZooKeeper.
h. Cloudera
Basically, for centralized configuration management, Cloudera search integrates search

functionality with Hadoop by using ZooKeeper.
Q) What are Benfits or Advantages of Bigdata ?

Benfits or Advantages of Big Data
Following are the benefits or advantages of Big Data:

➨Big data analysis derives innovative solutions. Big data analysis helps in understanding
and targeting customers. It helps in optimizing business processes.
➨It helps in improving science and research.
➨It improves healthcare and public health with availability of record of patients.
➨It helps in financial tradings, sports, polling, security/law enforcement etc.
➨Any one can access vast information via surveys and deliver anaswer of any query.
Drawbacks or disadvantages of Big Data

Following are the drawbacks or disadvantages of Big Data:
➨Traditional storage can cost lot of money to store big data.
➨Lots of big data is unstructured.
➨Big data analysis violates principles of privacy.
➨It can be used for manipulation of customer records.
➨It may increase social stratification.
➨Big data analysis is not useful in short run. It needs to be analyzed for longer duration to
leverage its benefits.

Big Data Notes

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Big Data Notes

Uploaded by

Copyright:

Available Formats

UNIT I

INTRODUCTION TO BIG DATA:Introduction – distributed file system – Big

ABHYUDAYA MAHILA DEGREE COLLEGE Page 1

BIG DATA TECHNOLOGY

Big Data Technology

ABHYUDAYA MAHILA DEGREE COLLEGE Page 2

Evalution of a big data:

1940’s:The information is limited storage .

While Reading 1TB data…

The distributed file system provides the reliability, availability

Q : What are the Characteristics Of 'Big Data'? or v’s of big data?

 As of 2011, there are 500,000,000 active Facebook users. Every 13

ABHYUDAYA MAHILA DEGREE COLLEGE Page 4

(ii)Variety – The next aspect of 'Big Data' is its variety.

Variety of Big Data refers to structured, unstructured, and semistructured

(iii)Velocity – The term 'velocity' refers to the speed of generation of data.

(iv)Veracity – This refers to the inconsistency which can be shown by the

v) validity: It refers to the correctness of the data.

Q : Explain Variety Types of Big Data ?

Big data' could be found in three forms:

ABHYUDAYA MAHILA DEGREE COLLEGE Page 5

Personal data stored in a XML file-

Q) Explain Big Data Applications or Big Data Analytics ?

SOCIAL MEDIA ANALYTICS

CALL CENTER ANALYTICS

ABHYUDAYA MAHILA DEGREE COLLEGE Page 7

ABHYUDAYA MAHILA DEGREE COLLEGE Page 8

Q) DRIVERS OF BIG DATA ?

Big Data = Transactions + Interactions + Observations

7 KEY DRIVERS BEHIND THE BIG DATA MARKET?

ABHYUDAYA MAHILA DEGREE COLLEGE Page 9

ABHYUDAYA MAHILA DEGREE COLLEGE Page 10

 The map task is done by means of Mapper Class

ABHYUDAYA MAHILA DEGREE COLLEGE Page 11

<k: employee name, v: salary>

Max= the salary of an first employee. Treated as max salary

ABHYUDAYA MAHILA DEGREE COLLEGE Page 12

The expected result is as follows −

<satish, <gopal, <kiran, <manisha,

Q) Explain features of Apache Hadoop ?

ABHYUDAYA MAHILA DEGREE COLLEGE Page 13

Q ) Explain about Apache Hadoop ecosystem components

ABHYUDAYA MAHILA DEGREE COLLEGE Page 14

HDFS is the well known for Big Data storage.

• Better resource management.

Salient features of pig:

ABHYUDAYA MAHILA DEGREE COLLEGE Page 15

Pig scripts internally will be converted to map reduce programs.

Salient features of hive:

• SQL like query language called HQL.

• Strictly consistent reads and writes. In memory operations.

ABHYUDAYA MAHILA DEGREE COLLEGE Page 16

• Tabular view for different formats.

Apache LuceneTM is a high-performance, full-featured text search engine library written

• Scalable, High – Performance indexing.

Apache Hama is a distributed framework based on Bulk Synchronous Parallel(BSP)

• Simple programming model

ABHYUDAYA MAHILA DEGREE COLLEGE Page 17

Apache Avro is a data serialization framework which is language neutral.

Apache Sqoop is a tool designed for bulk data transfers between

• Import and export to and from HDFS.

Flume is a distributed, reliable, and available service for efficiently

Ambari is designed to make hadoop management simpler