Lecture Notes PDF

1.
INTRODUCTION TO BIG DATA

i. Introduction – distributed file system
ii. Big Data and its importance
iii. Four Vs
iv. Drivers for Big data
v. Big data Analytics
vi. Big data applications
vii. Algorithms using map reduce
i) Introduction – distributed file system
In order to understand ‘Big Data’, we first need to know ‘data’. Oxford dictionary
defines 'data' as -
"The quantities, characters, or symbols on which operations are performed by
a computer, which may be stored and transmitted in the form of electrical
signals and recorded on magnetic, optical, or mechanical recording media. "
Big data analytics refers to the method of analysing huge volumes of data, or big
data. So, 'Big Data' is also a data but with a huge size. 'Big Data' is a term used to
describe collection of data that is huge in size and yet growing exponentially with
time. The big data is collected from a large assortment of sources, such as social
networks, videos, digital images, and sensors. The major aim of Big Data Analytics
is to discover new patterns and relationships which might be invisible, and it can
provide new insights about the users who created it.
Example of Big Data:

If I ask you about big data then, you might say: whenever your data set extends
beyond the capacity of your system to store and manipulate, its big data.
It means that if you have a laptop and your data doesn’t fit on your laptop, that’s big
data for you. But if you have a very large firm with a large clusters of storage space
and even then your data exceeds the storage capacity of your systems, then that’s
big data for you.
Big data is not something that you can say well, it’s 50 GB or 50 TB would make it
big data. It is whenever a person, individual or firm’s storage capacity or the ability
to analyse data is exceeded by the amount of data they have, that becomes big
data for them.
Distributed File System:

Distributed File System talks about managing data, i.e. files or folders across
multiple computers or servers. In other words, DFS is a file system that allows us to
store data over multiple nodes or machines in a cluster and allows multiple users
to access data. So basically, it serves the same purpose as the file system which is
available in your machine. In case of Distributed File System, you store data in
multiple machines rather than single machine. Even though the files are stored
across the network, DFS organizes, and displays data in such a manner that a user
sitting on a machine will feel like all the data is stored in that very machine.
A distributed file system is a client/server-based application that allows clients to
access and process data stored on the server as if it were on their own computer.
When a user accesses a file on the server, the server sends the user a copy of the
file, which is cached on the user's computer while the data is being processed and
is then returned to the server.
Ideally, a distributed file system organizes file and directory services of
individual servers into a global directory in such a way that remote data access is
not location-specific but is identical from any client. All files are accessible to all
users of the global file system and organization is hierarchical and directory-
based.
Since more than one client may access the same data simultaneously, the
server must have a mechanism in place (such as maintaining information about the
times of access) to organize updates so that the client always receives the most
current version of data and that data conflicts do not arise. Distributed file systems
typically use file or database replication (distributing copies of data on multiple
servers) to protect against data access failures.
Sun Microsystems' Network File System (NFS), Novell NetWare, Microsoft's
Distributed File System, and IBM/Transarc's DFS are some examples of distributed
file systems.
ii) Big Data and its importance
Big data is a term that refers to data sets or combinations of data sets whose size
(volume), complexity (variability), and rate of growth (velocity) make them
difficult to be captured, managed, processed or analysed by conventional
technologies and tools, such as relational databases and desktop statistics or
visualization packages, within the time necessary to make them useful.
While the size used to determine whether a particular data set is considered
big data is not firmly defined and continues to change over time, most analysts and
practitioners currently refer to data sets from 30-50 terabytes (10 12 or 1000
gigabytes per terabyte) to multiple petabytes (1015 or 1000 terabytes per
petabyte) as big data.
The complex nature of big data is primarily driven by the unstructured
nature of much of the data that is generated by modern technologies, such as that
from web logs, radio frequency Id (RFID), sensors embedded in devices,
machinery, vehicles, Internet searches, social networks such as Facebook,
portable computers, smart phones and other cell phones, GPS devices, and call
centre records. In most cases, in order to effectively utilize big data, it must be
combined with structured data (typically from a relational database) from a more
conventional business application, such as Enterprise Resource Planning (ERP) or
Customer Relationship Management (CRM).
Similar to the complexity, or variability, aspect of big data, its rate of growth,
or velocity aspect, is largely due to the ubiquitous nature of modern online, Real-
time data capture devices, systems, and networks. It is expected that the rate of
growth of big data will continue to increase for the foreseeable future. Specific new
big data technologies and tools have been and continue to be developed. Much of
the new big data technology relies heavily on massively parallel processing (MPP)
databases, which can concurrently distribute the processing of very large sets of
data across many servers. As another example, specific database query tools have
been developed for working with the massive amounts of unstructured data that
are being generated in big data environments.
BIG Data – Growth and Size Facts (*MGI Estimates)

 There were 5 billion mobile phones in use in 2010.
 There are 30 billion pieces of content shared on Facebook each month.
 There is a 40% projected growth in global data generated per year vs. 5%
growth in global IT spending.
 There were 235 terabytes of data collected by the US Library of Congress in
April 2011.
 15 out of 17 major business sectors in the United States have more data
stored per company that the US Library of Congress.
Big Data – Value Potential (*)

 $300 billion annual value to US healthcare – more than twice the total annual
healthcare spending in Spain.
 $600 billion – potential annual consumer surplus from using personal
location data globally.
 60% - potential increase in retailers’ operating margins possible via use of
big data.
Big Data – Industry Examples

 Major utility company integrates usage data recorded from smart meters
in semi real-time into their analysis of the national energy grid.
 Pay television providers have begun to customize ads based on individual
household demographics and viewing patterns.
 A major entertainment company is able to analyse its data and customer
patterns across its many and varied enterprises – e.g. using park
attendance, on-line purchase, and television viewership data.
 The security arm of a financial services firm detects fraud by correlating
activities across multiple data sets. As new fraud methods are detected
and understood, they are used to encode new algorithms into the fraud
detection system.
Importance of Big Data:

When big data is effectively and efficiently captured, processed, and analysed,
companies are able to gain a more complete understanding of their business,
customers, products, competitors, etc. which can lead to efficiency improvements,
increased sales, lower costs, better customer service, and/or improved products
and services.
For example:
Manufacturing companies deploy sensors in their products to return a stream
of telemetry. Sometimes this is used to deliver services like OnStar, that delivers
communications, security and navigation services. Perhaps more importantly, this
telemetry also reveals usage patterns, failure rates and other opportunities for
product improvement that can reduce development and assembly costs. (Oracle)
The proliferation of smart phones and other GPS devices offers advertisers an
opportunity to target consumers when they are in close proximity to a store, a
coffee shop or a restaurant. This opens up new revenue for service providers and
offers many businesses a chance to target new customers.
Retailers usually know who buys their products. Use of social media and web
log files from their ecommerce sites can help them understand who didn’t buy
and why they chose not to, information not available to them today. This can enable
much more effective micro customer segmentation and targeted marketing
campaigns, as well as improve supply chain efficiencies.
Other widely-cited examples of the effective use of big data exist in the
following areas:
a) Using information technology (IT) logs to improve IT troubleshooting and
security breach detection, speed, effectiveness, and future occurrence
prevention.
b) Use of voluminous historical call centre information more quickly, in order
to improve customer interaction and satisfaction.
c) Use of social media content in order to better and more quickly understand
customer sentiment about you/your customers, and improve products,
services, and customer interaction.
d) Fraud detection and prevention in any industry that processes financial
transactions online, such as shopping, banking, investing, insurance and
health care claims.
e) Use of financial market transaction information to more quickly assess risk
and take corrective action.
iii) The Four v’s of Big Data
Beyond simply being a lot of information, big data is now more precisely defined
by a set of characteristics. Those characteristics are commonly referred to as the
four Vs – Volume, Velocity, Variety and Veracity.
Volume
The main characteristic that makes data “big” is the sheer volume. It makes no
sense to focus on minimum storage units because the total amount of information is
growing exponentially every year. In 2010, Thomson Reuters estimated in its
annual report that it believed the world was “awash with over 800 Exabytes of data
and growing.”
For that same year, EMC, a hardware company that makes data storage devices,
thought it was closer to 900 Exabytes and would grow by 50 percent every year.
No one really knows how much new data is being generated, but the amount of
information being collected is huge.
Variety
Variety is one the most interesting developments in technology as more and more
information is digitized. Traditional data types (structured data) include things on
a bank statement like date, amount, and time. These are things that fit neatly in a
relational database.
Structured data is augmented by unstructured data, which is where things
like Twitter feeds, audio files, MRI images, web pages, web logs are put anything
that can be captured and stored but doesn’t have a meta model (a set of rules to
frame a concept or idea - it defines a class of information and how to express it) that
neatly defines it.
Unstructured data is a fundamental concept in big data. The best way to understand
unstructured data is by comparing it to structured data. Think of structured data as
data that is well defined in a set of rules. For example, money will always be
numbers and have at least two decimal points; names are expressed as text; and
dates follow a specific pattern.
With unstructured data, on the other hand, there are no rules. A picture, a
voice recording, a tweet — they all can be different but express ideas and thoughts
based on human understanding. One of the goals of big data is to use technology
to take this unstructured data and make sense of it. The definition of big data
depends on whether the data can be ingested, processed, and examined in a time
that meets a particular business’s requirements. For one company or system, big
data may be 50TB; for another, it may be 10PB.
Veracity
Veracity refers to the trustworthiness of the data. Can the manager rely on the fact
that the data is representative? Every good manager knows that there are inherent
discrepancies in all the data collected.
Velocity
Velocity is the frequency of incoming data that needs to be processed. Think about
how many SMS messages, Facebook status updates, or credit card swipes are
being sent on a particular telecom carrier every minute of every day, and you’ll
have a good appreciation of velocity. A streaming application like Amazon Web
Services Kinesis is an example of an application that handles the velocity of data.
Value
It may seem painfully obvious to some, but a real objective is critical to this mashup
of the four V’s. Will the insights you gather from analysis create a new product line,
a cross-sell opportunity, or a cost-cutting measure? Or will your data analysis lead
to the discovery of a critical causal effect that results in a cure to a disease? The
ultimate objective of any big data project should be to generate some sort of value
for the company doing all the analysis. Otherwise, you’re just performing some
technological task for technology’s sake.
iv) Key Drivers of Big Data:
 ERP, SCM, CRM, and transactional Web applications are classic examples of
systems processing Transactions. Highly structured data in these
systems is typically stored in SQL databases.
 Interactions are about how people and things interact with each other or with
your business. Web Logs, User Click Streams, Social Interactions &
Feeds, and User-Generated Content are classic places to find Interaction
data.
 Observational data tends to come from the “Internet of Things”. Sensors
for heat, motion, pressure and RFID and GPS chips within such things
as mobile devices, ATM machines, and even aircraft engines provide
just some examples of “things” that output Observation data.
Business
1. Opportunity to enable innovative new business models
2. Potential for new insights that drive competitive advantage
Technical
1. Data collected and stored continues to grow exponentially
2. Data is increasingly everywhere and in many formats
3. Traditional solutions are failing under new requirements
Financial
1. Cost of data systems, as a percentage of IT spend, continues to grow
2. Cost advantages of commodity hardware & open source software
There’s a new generation of data management technologies, such as Apache

Hadoop, that are providing an innovative and cost effective foundation for the
emerging landscape of Big Data processing and analytics solutions. Needless to
say, I’m excited to see how this market will mature and grow over the coming
years.
v) Big Data Analytics with Example:
“High-volume, high-velocity, and high variety of information assets that

demand cost-effective, innovative forms of information processing for
enhanced insights and decision-making”
Big data analytics is the use of advanced analytic techniques against very
large, diverse data sets that include different types such as
structured/unstructured and streaming/batch, and different sizes from
terabytes to zettabytes.
Big data analytics helps organizations harness their data and use it to identify new
opportunities. That, in turn, leads to smarter business moves, more efficient
operations, higher profits and happier customers. In his report Big Data in Big
Companies, IIA Director of Research Tom Davenport interviewed more than 50
businesses to understand how they used big data. He found they got value in the
following ways:
1. Cost reduction. Big data technologies such as Hadoop and cloud-based
analytics bring significant cost advantages when it comes to storing large
amounts of data – plus they can identify more efficient ways of doing
business.
2. Faster, better decision making. With the speed of Hadoop and in-memory
analytics, combined with the ability to analyse new sources of data,
businesses are able to analyse information immediately – and make
decisions based on what they’ve learned.
3. New products and services. With the ability to gauge customer needs and
satisfaction through analytics comes the power to give customers what they
want. Davenport points out that with big data analytics, more companies are
creating new products to meet customers’ needs.
Example of Big Data Analysis:
As the technology that helps an organization to break down data silos and analyze
data improves, business can be transformed in all sorts of ways. According to
Datamation, today's advances in analyzing big data allow researchers to
decode human DNA in minutes, predict where terrorists plan to attack,
determine which gene is mostly likely to be responsible for certain diseases
and, of course, which ads you are most likely to respond to on Facebook.
Another example comes from one of the biggest mobile carriers in the world.
France's Orange launched its Data for Development project by releasing
subscriber data for customers in the Ivory Coast. The 2.5 billion records,
which were made anonymous, included details on calls and text messages
exchanged between 5 million users. Researchers accessed the data and sent
Orange proposals for how the data could serve as the foundation for development
projects to improve public health and safety. Proposed projects included one that
showed how to improve public safety by tracking cell phone data to map where
people went after emergencies; another showed how to use cellular data for
disease containment.
Benefit of Big Data Analytics:

Enterprises are increasingly looking to find actionable insights into their data.
Many big data projects originate from the need to answer specific business
questions. With the right big data analytics platforms in place, an enterprise
can boost sales, increase efficiency, and improve operations, customer
service and risk management.
Webopedia company, QuinStreet, surveyed 540 enterprise decision-makers

involved in big data purchases to learn which business areas companies plan to
use Big Data analytics to improve operations. About half of all respondents said
they were applying big data analytics to improve customer retention, help with
product development and gain a competitive advantage.
Notably, the business area getting the most attention relates to increasing
efficiency and optimizing operations. Specifically, 62 percent of respondents said
that they use big data analytics to improve speed and reduce complexity.
vi) Application of Big Data
1. Banking and Security

Industry-Specific big data challenges
A study of 16 projects in 10 top investments and retail banks shows that the
challenges in this industry include: securities fraud early warning, tick analytics,
card fraud detection, archival of audit trails, enterprise credit risk reporting, trade
visibility, customer data transformation, social analytics for trading, IT operations
analytics, and IT policy compliance analytics, among others.
Applications of big data in the banking and securities industry
The Securities Exchange Commission (SEC) is using big data to monitor
financial market activity. They are currently using network analytics and natural
language processors to catch illegal trading activity in the financial markets.
Retail traders, Big banks, hedge funds and other so-called ‘big boys’ in the
financial markets use big data for trade analytics used in high frequency trading,
pre-trade decision-support analytics, sentiment measurement, Predictive
Analytics etc.
This industry also heavily relies on big data for risk analytics including; antimony
laundering, demand enterprise risk management, "Know Your Customer", and
fraud mitigation.
Big Data providers specific to this industry include: 1010data, Panopticon Software,
Stream base Systems, Nice Actimize and Quartet FS.
2. Communications, Media and Entertainment

Industry – specific big data challenges
Since consumers expect rich media on-demand in different formats and in a variety
of devices, some big data challenges in the communications, media and
entertainment industry include:
Collecting, analyzing, and utilizing consumer insights
Leveraging mobile and social media content
Understanding patterns of real-time, media content usage
Applications of big data in the Communications, media and entertainment
industry
Organizations in this industry imultaneously analyze customer data along with
behavioral data to create detailed customer profiles that can be used to:
Create content for different target audiences
Recommend content on demand
Measure content performance
3. Healthcare Providers
Industry-Specific challenges
The healthcare sector has access to huge amounts of data but has been plagued by
failures in utilizing the data to curb the cost of rising healthcare and by inefficient
systems that stifle faster and better healthcare benefits across the board. This is
mainly due to the fact that electronic data is unavailable, inadequate, or unusable.
Additionally, the healthcare databases that hold health-related information have
made it difficult to link data that can show patterns useful in the medical field.
Applications of big data in the healthcare sector

Some hospitals, like Beth Israel, are using data collected from a cell phone app,
from millions of patients, to allow doctors to use evidence-based medicine as
opposed to administering several medical/lab tests to all patients who go to the
hospital. A battery of tests can be efficient but they can also be expensive and
usually ineffective. Free public health data and Google Maps have been used by
the University of Florida to create visual data that allows for faster identification and
efficient analysis of healthcare information, used in tracking the spread of chronic
disease.
4. Education
Industry-Specific big data challenges
From a technical point of view, a major challenge in the education industry is to
incorporate big data from different sources and vendors and to utilize it on
platforms that were not designed for the varying data. From a practical point of
view, staff and institutions have to learn the new data management and analysis
tools.
On the technical side, there are challenges to integrate data from different sources,
on different platforms and from different vendors that were not designed to work
with one another.
Politically, issues of privacy and personal data protection associated with big data
used for educational purposes is a challenge.
Applications of big data in Education

Big data is used quite significantly in higher education. For example, The
University of Tasmania. An Australian university with over 26000 students, has
deployed a Learning and Management System that tracks among other things,
when a student logs onto the system, how much time is spent on different pages in
the system, as well as the overall progress of a student over time.
In a different use case of the use of big data in education, it is also used to measure
teacher’s effectiveness to ensure a good experience for both students and
teachers. Teacher’s performance can be fine-tuned and measured against student
numbers, subject matter, student demographics, student aspirations, behavioral
classification and several other variables.
On a governmental level, the Office of Educational Technology in the U. S.

Department of Education, is using big data to develop analytics to help course
correct students who are going astray while using online big data courses. Click
patterns are also being used to detect boredom.
5. Manufacturing and Natural Resources

Increasing demand for natural resources including oil, agricultural products,
minerals, gas, metals, and so on has led to an increase in the volume, complexity,
and velocity of data that is a challenge to handle.
Similarly, large volumes of data from the manufacturing industry are untapped. The
underutilization of this information prevents improved quality of products, energy
efficiency, reliability, and better profit margins.
Applications of big data in manufacturing and natural resources

In the natural resources industry, big data allows for predictive modeling to
support decision making that has been utilized to ingest and integrate large
amounts of data from geospatial data, graphical data, text and temporal data. Areas
of interest where this has been used include; seismic interpretation and reservoir
characterization.
Big data has also been used in solving today’s manufacturing challenges and to
gain competitive advantage among other benefits.
In the graphic below, a study by Deloitte shows the use of supply chain capabilities
from big data currently in use and their expected use in the future.
6. Government
In governments the biggest challenges are the integration and interoperability of
big data across different government departments and affiliated organizations.
Applications of big data in Government

In public services, big data has a very wide range of applications including: energy
exploration, financial market analysis, fraud detection, health related research and
environmental protection.
Some more specific examples are as follows:

Big data is being used in the analysis of large amounts of social disability claims,
made to the Social Security Administration (SSA), that arrive in the form of
unstructured data. The analytics are used to process medical information rapidly
and efficiently for faster decision making and to detect suspicious or fraudulent
claims.
The Food and Drug Administration (FDA) is using big data to detect and study
patterns of food-related illnesses and diseases. This allows for faster response
which has led to faster treatment and less death. The Department of Homeland
Security uses big data for several different use cases. Big data is analyzed from
different government agencies and is used to protect the country.
7. Insurance
Lack of personalized services, lack of personalized pricing and the lack of targeted
services to new segments and to specific market segments are some of the main
challenges.
In a survey conducted by Market force challenges identified by professionals in

the insurance industry include underutilization of data gathered by loss adjusters
and a hunger for better insight.
Applications of big data in the insurance industry

Big data has been used in the industry to provide customer insights for transparent
and simpler products, by analyzing and predicting customer behavior through
data derived from social media, GPSenabled devices and CCTV footage. The big
data also allows for better customer retention from insurance companies. When it
comes to claims management, predictive analytics from big data has been used to
offer faster service since massive amounts of data can be analyzed especially in
the underwriting stage.
Fraud detection has also been enhanced.
Through massive data from digital channels and social media, real-time monitoring
of claims throughout the claims cycle has been used to provide insights.
8. Retail and Whole sale trade
From traditional brick and mortar retailers and wholesalers to current day e-
commerce traders, the industry has gathered a lot of data over time. This data,
derived from customer loyalty cards, POS scanners, RFID etc. is not being used
enough to improve customer experiences on the whole. Any changes and
improvements made have been quite slow.
Applications of big data in the Retail and Wholesale industry
Big data from customer loyalty data, POS, store inventory, local demographics data
continues to be gathered by retail and wholesale stores.
In New York’s Big Show retail trade conference in 2014, companies like Microsoft,
Cisco and IBM pitched the need for the retail industry to utilize big data for
analytics and for other uses including:
Optimized staffing through data from shopping patterns, local events, and so on
Reduced fraud
Timely analysis of inventory
Social media use also has a lot of potential use and continues to be slowly but surely
adopted especially by brick and mortar stores. Social media is used for customer
prospecting, customer retention, promotion of products, and more.
9. Transportation
In recent times, huge amounts of data from location-based social networks and
high speed data from telecoms have affected travel behavior. Regrettably,
research to understand travel behaviour has not progressed as quickly.
In most places, transport demand models are still based on poorly understood new
social media structures.
Applications of big data in the transportation industry

Some applications of big data by governments, private organizations and
individuals include:
Governments use of big data: traffic control, route planning, intelligent transport
systems, congestion management (by predicting traffic conditions)
Private sector use of big data in transport: revenue management, technological
enhancements, logistics and for competitive advantage (by consolidating
shipments and optimizing freight movement)
Individual use of big data includes: route planning to save on fuel and time, for
travel arrangements in tourism etc.
10. Energy and Utilities

The image below shows some of the main challenges in the energy and utilities
industry.
Applications of big data in the energy and utilities industry

Smart meter readers allow data to be collected almost every 15 minutes as
opposed to once a day with the old meter readers. This granular data is being used
to analyze consumption of utilities better which allows for improved customer
feedback and better control of utilities use.
In utility companies the use of big data also allows for better asset and workforce
management which is useful for recognizing errors and correcting them as soon as
possible before complete failure is experienced.
vii) Map Reduce Algorithm
MapReduce is a Distributed Data Processing Algorithm, introduced by Google in

its MapReduce Tech Paper.
MapReduce algorithm is mainly useful to process huge amount of data in parallel,

reliable and efficient way in cluster environments.
It uses Divide and Conquer technique to process large amount of data.
It divides input task into smaller and manageable sub-tasks (They should be
executable independently) to execute them in-parallel.
MapReduce Algorithm Steps

MapReduce Algorithm uses the following three main steps:
1. Map Function
2. Shuffle Function
3. Reduce Function
3 steps or functions
Map Function:
Map Function is the first step in MapReduce Algorithm. It takes input tasks (say
DataSets. I have given only one DataSet in below diagram.) and divides them into
smaller sub-tasks. Then perform required computation on each sub-task in
parallel.
This step performs the following two sub-steps:

1. Splitting
2. Mapping
Splitting step takes input DataSet from Source and divide into smaller Sub-
DataSets.
Mapping step takes those smaller Sub-DataSets and perform required

action
or computation on each Sub-DataSet.
The output of this Map Function is a set of key and value pairs as <Key, Value> as
shown in the below diagram.
It is the second step in MapReduce Algorithm. Shuffle Function is also known as

“Combine
Function”.
It performs the following two sub-steps:
1. Merging
2. Sorting
It takes a list of outputs coming from “Map Function” and perform these two sub-
steps on each and every key-value pair.
Merging step combines all key-value pairs which have same keys (that is
grouping keyvalue pairs by comparing “Key”). This step returns <Key,
List<Value>>.
Sorting step takes input from Merging step and sort all key-value pairs by using
Keys.
This step also returns <Key, List<Value>> output but with sorted key-value pairs.
Finally, Shuffle Function returns a list of <Key, List<Value>> sorted pairs to next
step.
Reduce Function
It is the final step in MapReduce Algorithm. It performs only one step: Reduce step.
It takes list of <Key, List<Value>> sorted pairs from Shuffle Function and perform
reduce operation as shown below.
Final step output looks like first step output. However final step <Key, Value> pairs
are different than first step <Key, Value> pairs. Final step <Key, Value> pairs are
computed and sorted pairs.
We can observe the difference between first step output and final step output with
some simple example. We will discuss same steps with one simple example in next
section.
That’s it all three steps of MapReduce Algorithm.
Example:
Problem Statement:
Count the number of occurrences of each word available in a DataSet.
Input DataSet
Please find our example Input DataSet file in below diagram. Just for simplicity, we
are going to use simple small DataSet. However, Real-time applications use very
huge amount of Data.
Input Data Set
That’s it all about MapReduce Algorithm. It’s time to start Developing and testing
MapReduce Programs.
2. INTRODUCTION TO Hadoop and Hadoop Architecture
i. Big Data - Apache Hadoop & Hadoop Ecosystem
ii. Moving data in and out of Hadoop – Understanding inputs and
outputs of MapReduce
iii. Data Serialization
i) Big Data - Apache Hadoop

Hadoop is an open-source framework that allows to store and process big data in
a distributed environment across clusters of computers using simple
programming models. It is designed to scale up from single servers to thousands
of machines, each offering local computation and storage.
Due to the advent of new technologies, devices, and communication means

like social networking sites, the amount of data produced by mankind is
growing rapidly every year. The amount of data produced by us from the
beginning of time till 2003 was 5 billion gigabytes. If you pile up the data in the
form of disks, it may fill an entire football field. The same amount was
created in every two days in 2011, and in every ten minutes in 2013. This rate
is still growing enormously. Though all this information produced is meaningful
and can be useful when processed, it is being neglected.
“90% of the world’s data was generated in the last few years.”
What is Big Data?

Big data means really a big data; it is a collection of large datasets that cannot be
processed using traditional computing techniques.
Big data is not merely a data, rather it has become a complete subject, which
involves various tools, techniques and frameworks.
What Comes Under Big Data?

Big data involves the data produced by different devices and applications. Given
below are some of the fields that come under the umbrella of Big Data.
 Black Box Data: It is a component of helicopter, airplanes, and jets, etc. It
captures voices of the flight crew, recordings of microphones and
earphones, and the performance information of the aircraft.
 Social Media Data: Social media such as Facebook and Twitter hold
information and the views posted by millions of people across the globe.
 Stock Exchange Data: The stock exchange data holds information about
the ‘buy’ and ‘sell’ decisions made on a share of different companies made
by the customers.
 Power Grid Data: The power grid data holds information consumed by a
particular node with respect to a base station.
 Transport Data: Transport data includes model, capacity, distance and
availability of a vehicle.
 Search Engine Data: Search engines retrieve lots of data from different
databases.
Three Type of Big Data
1) Structured data: Relational data.
2) Semi Structured data: XML data.
3) Unstructured data: Word, PDF, Text, Media Logs.
 HADOOP:
Hadoop is an Apache open source framework written in java that allows
distributed processing of large datasets across clusters of computers using
simple programming models. A Hadoop frame-worked application works in an
environment that provides distributed storage and computation across clusters of
computers. Hadoop is designed to scale up from single server to thousands of
machines, each offering local computation and storage.
Hadoop Architecture:
Hadoop framework includes following four modules:
 Hadoop Common: These are Java libraries and utilities required by other
Hadoop modules. These libraries provide filesystem and OS level
abstractions and contains the necessary Java files and scripts required to
start Hadoop.
 Hadoop YARN: This is a framework for job scheduling and cluster
resource management.
 Hadoop Distributed File System (HDFS™): A distributed file system that
provides high-throughput access to application data.
 Hadoop MapReduce: This is YARN-based system for parallel processing
of large data sets.
Since 2012, the term "Hadoop" often refers not just to the base modules
mentioned above but also to the collection of additional software packages that
can be installed on top of or alongside Hadoop, such as Apache Pig, Apache Hive,
Apache HBase, Apache Spark etc.
 MapReduce:
Hadoop MapReduce is a software framework for easily writing applications which
process big amounts of data in-parallel on large clusters (thousands of nodes) of
commodity hardware in a reliable, fault-tolerant manner.
The term MapReduce actually refers to the following two different tasks that
Hadoop programs perform:
 The Map Task: This is the first task, which takes input data and converts it
into a set of data, where individual elements are broken down into tuples
(key/value pairs).
 The Reduce Task: This task takes the output from a map task as input and
combines those data tuples into a smaller set of tuples. The reduce task is
always performed after the map task.
Typically, both the input and the output are stored in a file-system. The
framework takes care of scheduling tasks, monitoring them and re-executes the
failed tasks.
The MapReduce framework consists of a single master Job Tracker and one
slave Task Tracker per cluster-node.
Job Tracker, the master is responsible for resource management, tracking
resource consumption/availability and scheduling the jobs component tasks on
the slaves, monitoring them and re-executing the failed tasks.
The slaves, Task Tracker execute the tasks as directed by the master and
provide task-status information to the master periodically.
The Job Tracker is a single point of failure for the Hadoop MapReduce service
which means if Job Tracker goes down, all running jobs are halted.
 Hadoop Distributed File System

Hadoop can work directly with any mountable distributed file system such as
Local FS, HFTP FS, S3 FS, and others, but the most common file system used by
Hadoop is the Hadoop Distributed File System (HDFS).
The Hadoop Distributed File System (HDFS) is based on the Google File
System (GFS) and provides a distributed file system that is designed to run
on large clusters (thousands of computers) of small computer machines in a
reliable, fault-tolerant manner.
HDFS uses a master/slave architecture where master consists of a single

NameNode that manages the file system metadata and one or more slave
DataNodes that store the actual data.
A file in an HDFS namespace is split into several blocks and those blocks are
stored in a set of DataNodes. The NameNode determines the mapping of blocks
to the DataNodes. The DataNodes takes care of read and write operation with
the file system. They also take care of block creation, deletion and replication
based on instruction given by NameNode.
HDFS provides a shell like any other file system and a list of commands are
available to interact with the file system. These shell commands will be covered
in a separate chapter along with appropriate examples.
 Hadoop Distributed File System

Hadoop File System was developed using distributed file system design. It is run
on commodity hardware. Unlike other distributed systems, HDFS is highly fault
tolerant and designed using low-cost hardware.
HDFS holds very large amount of data and provides easier access. To store such
huge data, the files are stored across multiple machines. These files are stored in
redundant fashion to rescue the system from possible data losses in case of
failure. HDFS also makes applications available to parallel processing.
Features of HDFS
 It is suitable for the distributed storage and processing.
 Hadoop provides a command interface to interact with HDFS.
 The built-in servers of NameNode and DataNode help users to easily
check the status of cluster.
 Streaming access to file system data.
 HDFS provides file permissions and authentication.
 HDFS Architecture
Given below is the architecture of a Hadoop File System.
HDFS follows the master-slave architecture and it has the following elements.
 Namenode
The NameNode is the commodity hardware that contains the GNU/Linux
operating system and the NameNode software. It is a software that can be run on
commodity hardware. The system having the NameNode acts as the master
server and it does the following tasks:
 Manages the file system namespace.
 Regulates client’s access to files.
 It also executes file system operations such as renaming, closing, and
opening files and directories.
 DataNode
The DataNode is a commodity hardware having the GNU/Linux operating
system and DataNode software. For every node (Commodity hardware/System)
in a cluster, there will be a DataNode. These nodes manage the data storage of
their system.
 DataNodes perform read-write operations on the file systems, as per client
request.
 They also perform operations such as block creation, deletion, and

replication according to the instructions of the NameNode.
 Block
Generally, the user data is stored in the files of HDFS. The file in a file system will
be divided into one or more segments and/or stored in individual data nodes.
These file segments are called as blocks. In other words, the minimum amount of
data that HDFS can read or write is called a Block. The default block size is
64MB, but it can be increased as per the need to change in HDFS configuration.
 Goals of HDFS
 Fault detection and recovery: Since HDFS includes a large number of
commodity hardware, failure of components is frequent. Therefore, HDFS
should have mechanisms for quick and automatic fault detection and
recovery.
 Huge datasets: HDFS should have hundreds of nodes per cluster to

manage the applications having huge datasets.
 Hardware at data: A requested task can be done efficiently, when the

computation takes place near the data. Especially where huge datasets are
involved, it reduces the network traffic and increases the throughput.
 Hadoop Eco System
 Core Hadoop:
 HDFS:
HDFS stands for Hadoop Distributed File System for managing big data sets with
High Volume, Velocity and Variety. HDFS implements master slave architecture.
Master is Name node and slave is data node.
Features:
• Scalable
• Reliable
• Commodity Hardware
HDFS is the well-known for Big Data storage.
 Map Reduce:
Map Reduce is a programming model designed to process high volume
distributed data. Platform is built using Java for better exception handling. Map
Reduce includes two daemons, Job tracker and Task Tracker.
Features:
• Functional Programming.
• Works very well on Big Data.
• Can process large datasets.
Map Reduce is the main component known for processing big data.
 YARN:
YARN stands for Yet Another Resource Negotiator. It is also called as
MapReduce 2(MRv2). The two major functionalities of Job Tracker in MRv1,
resource management and job scheduling/ monitoring are split into separate
daemons which are Resource Manager, Node Manager and Application Master.
Features:
• Better resource management.
• Scalability
• Dynamic allocation of cluster resources.
 Data Access:
 Pig:
Apache Pig is a high level language built on top of MapReduce for analyzing
large datasets with simple adhoc data analysis programs. Pig is also known as
Data Flow language. It is very well integrated with python. It is initially developed
by yahoo.
Salient features of pig:
• Ease of programming
• Optimization opportunities
• Extensibility.
Pig scripts internally will be converted to map reduce programs.
 Hive:
Apache Hive is another high level query language and data warehouse
infrastructure built on top of Hadoop for providing data summarization, query and
analysis. It is initially developed by yahoo and made open source.
Salient features of hive:
• SQL like query language called HQL.
• Partitioning and bucketing for faster data processing.
• Integration with visualization tools like Tableau.
Hive queries internally will be converted to map reduce programs.
If you want to become a big data analyst, these two high level languages are a
must know!!
 Data Storage
 HBase:
Apache HBase is a NoSQL database built for hosting large tables with billions of
rows and millions of columns on top of Hadoop commodity hardware machines.
Use Apache Hbase when you need random, realtime read/write access to your
Big Data.
Features:
• Strictly consistent reads and writes. In memory operations.
• Easy to use Java API for client access.
• Well integrated with pig, hive and sqoop.
• Is a consistent and partition tolerant system in CAP theorem.
 Cassandra:
Cassandra is a NoSQL database designed for linear scalability and high
availability. Cassandra is based on key-value model. Developed by Facebook
and known for faster response to queries.
Features:
• Column indexes
• Support for de-normalization
• Materialized views
• Powerful built-in caching.
 Interaction -Visualization- execution-development

 Hcatalog:
HCatalog is a table management layer which provides integration of hive
metadata for other Hadoop applications. It enables users with different data
processing tools like Apache pig, Apache MapReduce and Apache Hive to more
easily read and write data.
Features:
• Tabular view for different formats.
• Notifications of data availability.
• REST API’s for external systems to access metadata.
 Lucene:
Apache LuceneTM is a high-performance, full-featured text search engine library
written entirely in Java. It is a technology suitable for nearly any application that
requires full-text search, especially cross-platform.
Features:
• Scalable, High – Performance indexing.
• Powerful, Accurate and Efficient search algorithms.
• Cross-platform solution.
 Hama:
Apache Hama is a distributed framework based on Bulk Synchronous
Parallel(BSP) computing. Capable and well known for massive scientific
computations like matrix, graph and network algorithms.
Features:
• Simple programming model
• Well suited for iterative algorithms
• YARN supported
• Collaborative filtering unsupervised machine learning.
• K-Means clustering.
 Crunch:
Apache crunch is built for pipelining MapReduce programs which are simple and
efficient. This framework is used for writing, testing and running MapReduce
pipelines.
Features:
• Developer focused.
• Minimal abstractions
• Flexible data model.
 Data Serialization:
 Avro:
Apache Avro is a data serialization framework which is language neutral.
Designed for language portability, allowing data to potentially outlive the
language to read and write it.
 Thrift:
Thrift is a language developed to build interfaces to interact with technologies
built on Hadoop. It is used to define and create services for numerous languages.
 Data Intelligence
 Drill
Apache Drill is a low latency SQL query engine for Hadoop and NoSQL.
Features:
• Agility
• Flexibility
• Familiarilty.
 Mahout:
Apache Mahout is a scalable machine learning library designed for building
predictive analytics on Big Data. Mahout now has implementations apache spark
for faster in memory computing.
Features:
• Collaborative filtering.
• Classification
• Clustering
• Dimensionality reduction
 Data Integration
 Apache Sqoop:
Apache Sqoop is a tool designed for bulk data transfers between relational
databases and Hadoop.
Features:
• Import and export to and from HDFS.
• Import and export to and from Hive.
• Import and export to HBase.
 Apache Flume:
Flume is a distributed, reliable, and available service for efficiently collecting,
aggregating, and moving large amounts of log data.
Features:
• Robust
• Fault tolerant
• Simple and flexible Architecture based on streaming data flows.
 Apache Chukwa:
Scalable log collector used for monitoring large distributed files systems.
Features:
• Scales to thousands of nodes.
• Reliable delivery.
• Should be able to store data indefinitely.
 Management, Monitoring and Orchestration:

 Apache Ambari:
Ambari is designed to make hadoop management simpler by providing an
interface for provisioning, managing and monitoring Apache Hadoop Clusters.
Features:
• Provision a Hadoop Cluster.
• Manage a Hadoop Cluster.
• Monitor a Hadoop Cluster.
 Apache Zookeeper:
Zookeeper is a centralized service designed for maintaining configuration
information, naming, providing distributed synchronization, and providing group
services.
Features:
• Serialization
• Atomicity
• Reliability
• Simple API
 Apache Oozie:
Oozie is a workflow scheduler system to manage Apache Hadoop jobs.
Features:
• Scalable, reliable and extensible system.
• Supports several types of Hadoop jobs such as Map-Reduce, Hive, Pig and
Sqoop.
• Simple and easy to use.
 Understanding inputs and outputs of MapReduce

The Algorithm (Map Reduce)
 Generally, MapReduce paradigm is based on sending the computer to
where the data resides!
 MapReduce program executes in three stages, namely map stage, shuffle
stage, and reduce stage.
o Map stage: The map or mapper’s job is to process the input data.
Generally, the input data is in the form of file or directory and is stored in
the Hadoop file system (HDFS). The input file is passed to the mapper
function line by line. The mapper processes the data and creates several
small chunks of data.
o Reduce stage : This stage is the combination of the Shuffle stage and the
Reduce stage. The Reducer’s job is to process the data that comes from
the mapper. After processing, it produces a new set of output, which will
be stored in the HDFS.
 During a MapReduce job, Hadoop sends the Map and Reduce tasks to the
appropriate servers in the cluster.
 The framework manages all the details of data-passing such as issuing
tasks, verifying task completion, and copying data around the cluster
between the nodes.
 Most of the computing takes place on nodes with data on local disks that
reduces the network traffic.
 After completion of the given tasks, the cluster collects and reduces the
data to form an appropriate result, and sends it back to the Hadoop server.
Inputs and Outputs (Java Perspective) :
The MapReduce framework operates on <key, value> pairs, that is, the
framework views the input to the job as a set of <key, value> pairs and produces
a set of <key, value> pairs as the output of the job, conceivably of different types.
The key and the value classes should be in serialized manner by the framework
and hence, need to implement the Writable interface. Additionally, the key
classes have to implement the Writable-Comparable interface to facilitate sorting
by the framework. Input and Output types of a MapReduce job: (Input) <k1, v1>
-> map -> <k2, v2>-> reduce -> <k3, v3>(Output).
Terminology
 PayLoad - Applications implement the Map and the Reduce functions, and
form the core of the job.
 Mapper - Mapper maps the input key/value pairs to a set of intermediate
key/value pair.
 NamedNode - Node that manages the Hadoop Distributed File System
(HDFS).
 DataNode - Node where data is presented in advance before any
processing takes place.
 MasterNode - Node where JobTracker runs and which accepts job
requests from clients.
 SlaveNode - Node where Map and Reduce program runs.
 JobTracker - Schedules jobs and tracks the assign jobs to Task tracker.
 Task Tracker - Tracks the task and reports status to JobTracker.
 Job - A program is an execution of a Mapper and Reducer across a dataset.
 Task - An execution of a Mapper or a Reducer on a slice of data.
 Task Attempt - A particular instance of an attempt to execute a task on a
SlaveNode.
Example Scenario:
Given below is the data regarding the electrical consumption of an
organization. It contains the monthly electrical consumption and the annual
average for various years.
If the above data is given as input, we have to write applications to process it

and produce results such as finding the year of maximum usage, year of
minimum usage, and so on. This is a walkover for the programmers with finite
number of records. They will simply write the logic to produce the required
output, and pass the data to the application written.
But, think of the data representing the electrical consumption of all the large scale
industries of a particular state, since its formation.
When we write applications to process such bulk data,
 They will take a lot of time to execute.
 There will be a heavy network traffic when we move data from source
to network server and so on.
To solve these problems, we have the MapReduce framework.

Input Data
The above data is saved as sample.txt and given as input. The input file looks as
shown below.
 Data Serialization:
Serialization is the process of translating data structures or objects state into
binary or textual form to transport the data over network or to store on some
persistent storage. Once the data is transported over network or retrieved from
the persistent storage, it needs to be deserialized again. Serialization is termed
as marshalling and deserialization is termed as unmarshalling.
Serialization in Java:
Java provides a mechanism, called object serialization where an object can be
represented as a sequence of bytes that includes the object's data as well as
information about the object's type and the types of data stored in the object.
After a serialized object is written into a file, it can be read from the file and
deserialized. That is, the type information and bytes that represent the object and
its data can be used to recreate the object in memory.
ObjectInputStream and ObjectOutputStream classes are used to serialize and
deserialize an object respectively in Java.
Serialization in Hadoop
Generally in distributed systems like Hadoop, the concept of serialization is used
for Interprocess Communication and Persistent Storage.
Interprocess Communication
To establish the interprocess communication between the nodes connected
in a network, RPC technique was used.
RPC used internal serialization to convert the message into binary format
before sending it to the remote node via network. At the other end the remote
system deserializes the binary stream into the original message.
The RPC serialization format is required to be as follows -
o Compact - To make the best use of network bandwidth, which is the most
scarce resource in a data center.
o Fast - Since the communication between the nodes is crucial in distributed
systems, the serialization and deserialization process should be quick, producing
less overhead.
o Extensible - Protocols change over time to meet new requirements, so it should
be straightforward to evolve the protocol in a controlled manner for clients and
servers.
o Interoperable - The message format should support the nodes that are written
in different languages.
Persistent Storage (Storage Media)

Persistent Storage is a digital storage facility that does not lose its data with
the loss of power supply. For example - Magnetic disks and Hard Disk
Drives.
Writable Interface
This is the interface in Hadoop which provides methods for serialization and
deserialization. The following table describes the methods –
WritableComparable Interface
It is the combination of Writable and Comparable interfaces. This interface
inherits Writable interface of Hadoop as well as Comparable interface of Java.
Therefore, it provides methods for data serialization, deserialization, and
comparison.
In addition to these classes, Hadoop supports a number of wrapper classes that

implement WritableComparable interface. Each class wraps a Java primitive
type. The class hierarchy of Hadoop serialization is given below –
These classes are useful to serialize various types of data in Hadoop. For instance,
let us consider the IntWritable class. Let us see how this class is used to serialize
and deserialize the data in Hadoop.
IntWritable Class
This class implements Writable, Comparable, and
WritableComparableinterfaces. It wraps an integer data type in it. This class
provides methods used to serialize and deserialize integer type of data.
Serializing the Data in Hadoop
The procedure to serialize the integer type of data is discussed below.
Instantiate IntWritable class by wrapping an integer value in it.
Instantiate ByteArrayOutputStream class.
Instantiate DataOutputStream class and pass the object Of
ByteArrayOutputStream class to it.
Serialize the integer value in IntWritable object using write()method. This
method needs an object of DataOutputStream class.
The serialized data will be stored in the byte array object which is passed as
parameter to the DataOutputStream class at the time of instantiation. Convert
the data in the object to byte array.
Example
The following example shows how to serialize data of integer type in Hadoop –
Deserializing the Data in Hadoop
The procedure to deserialized the integer type of data is discussed below -
Instantiate IntWritable class by wrapping an integer value in it.
Instantiate ByteArrayOutputStream class.
Instantiate DataOutputStream class and pass the object of
ByteArrayOutputStream class to it.
Deserialize the data in the object of DataInputStream usingreadFields()
method of IntWritable class.
The deserialized data will be stored in the object of IntWritable class. You
canmretrieve this data using get() method of this class.
Example
The following example shows how to deserialize the data of integer type in
Hadoop –
Advantage of Hadoop over Java Serialization
Hadoop’s Writable-based serialization is capable of reducing the object-creation
overhead by reusing the Writable objects, which is not possible with the Java’s
native serialization framework.
Disadvantages of Hadoop Serialization

To serialize Hadoop data, there are two ways -
You can use the Writable classes, provided by Hadoop’s native library.
You can also use Sequence Files which store the data in binary format.
The main drawback of these two mechanisms is that Writable and

SequenceFiles have only a Java API and they cannot be written or read in any
other language.
Therefore, any of the files created in Hadoop with above two mechanisms cannot
be read by any other third language, which makes Hadoop as a limited box. To
address this drawback, Doug Cutting created Avro, which is a language
independent data structure.
Moving Data in and out of Hadoop
Big Data, as we know, is a collection of large datasets that cannot be processed
using traditional computing techniques. Big Data, when analyzed, gives valuable
results. Hadoop is an open-source framework that allows to store and process Big
Data in a distributed environment across clusters of computers using simple
programming models.
Streaming / Log Data

Generally, most of the data that is to be analyzed will be produced by various
data sources like applications servers, social networking sites, cloud servers, and
enterprise servers. This data will be in the form of log files and events.
Log file - In general, a log file is a file that lists events/actions that occur in an
operating system. For example, web servers list every request made to the
server in the log files.
On harvesting such log data, we can get information about -

the application performance and locate various software and hardware failures.
the user behavior and derive better business insights.
The traditional method of transferring data into the HDFS system is to use the put
command. Let us see how to use the put command.
HDFS put Command

The main challenge in handling the log data is in moving these logs produced by
multiple servers to the Hadoop environment.
Hadoop File System Shell provides commands to insert data into Hadoop and
read from it. You can insert data into Hadoop using the put command as shown
below.
We can use the put command of Hadoop to transfer data from these sources to
HDFS. But, it suffers from the following drawbacks -
Using put command, we can transfer only one file at a time while the data
generators generate data at a much higher rate. Since the analysis made on older
data is less accurate, we need to have a solution to transfer data in real time.
If we use put command, the data is needed to be packaged and should be
ready for the upload. Since the webservers generate data continuously, it is a
very difficult task.
What we need here is a solution that can overcome the drawbacks of put
command and transfer the "streaming data" from data generators to centralized
stores (especially HDFS) with less delay.
Problem with HDFS

In HDFS, the file exists as a directory entry and the length of the file will be
considered as zero till it is closed. For example, if a source is writing data into
HDFS and the network was interrupted in the middle of the operation (without
closing the file), then the data written in the file will be lost.
Therefore, we need a reliable, configurable, and maintainable system to transfer

the log data into HDFS.
Note - In POSIX file system, whenever we are accessing a file (say performing
write operation), other programs can still read this file (at least the saved portion
of the file). This is because the file exists on the disc before it is closed.
Available Solutions
To send streaming data (log files, events etc..,) from various sources to HDFS, we
have the following tools available at our disposal –
Facebook’s Scribe
Scribe is an immensely popular tool that is used to aggregate and stream log
data. It is designed to scale to a very large number of nodes and be robust to
network and node failures.
Apache Kafka
Kafka has been developed by Apache Software Foundation. It is an open-source
message broker. Using Kafka, we can handle feeds with high-throughput and
low-latency.
Apache Flume
Apache Flume is a tool/service/data ingestion mechanism for collecting
aggregating and transporting large amounts of streaming data such as log data,
events (etc...) from various webserves to a centralized data store.
It is a highly reliable, distributed, and configurable tool that is principally

designed to transfer streaming data from various sources to HDFS.
Apache Flume - Architecture

The following illustration depicts the basic architecture of Flume. As shown in the
illustration, data generators (such as Facebook, Twitter) generate data which
gets collected by individual Flume agents running on them. Thereafter, a data
collector (which is also an agent) collects the data from the agents which is
aggregated and pushed into a centralized store such as HDFS or HBase.
1. Flume Event
An event is the basic unit of the data transported inside Flume. It contains a
payload of byte array that is to be transported from the source to the destination
accompanied by optional headers. A typical Flume event would have the
following structure.
2. Flume Agent
An agent is an independent daemon process (JVM) in Flume. It receives the
data (events) from clients or other agents and forwards it to its next
destination (sink or agent). Flume may have more than one agent. Following
diagram represents a Flume Agent. As shown in the diagram a Flume Agent
contains three main components namely, Source, channel, and sink.
i) Source
A source is the component of an Agent which receives data from the data
generators and transfers it to one or more channels in the form of Flume events.
Apache Flume supports several types of sources and each source receives
events from a specified data generator.
Example - Avro source, Thrift source, twitter 1% source etc.
ii) Channel
A channel is a transient store which receives the events from the source and
buffers them till they are consumed by sinks. It acts as a bridge between the
sources and the sinks.
These channels are fully transactional and they can work with any number of
sources and sinks.
Example - JDBC channel, File system channel, Memory channel, etc.
iii) Sink
A sink stores the data into centralized stores like HBase and HDFS. It consumes
the data (events) from the channels and delivers it to the destination. The
destination of the sink might be another agent or the central stores.
Example - HDFS sink
Note - A flume agent can have multiple sources, sinks and channels. We have
listed all the supported sources, sinks, channels in the Flume configuration
chapter of this tutorial.
Additional Components of Flume Agent

What we have discussed above are the primitive components of the agent. In
addition to this, we have a few more components that play a vital role in
transferring the events from the data generator to the centralized stores.
1) Interceptors
Interceptors are used to alter/inspect flume events which are transferred
between source and channel.
2) Channel Selectors
These are used to determine which channel is to be opted to transfer the data in
case of multiple channels. There are two types of channel selectors -
Default channel selectors - These are also known as replicating channel
selectors they replicate all the events in each channel.
Multiplexing channel selectors - These decides the channel to send an event
based on the address in the header of that event.
3) Sink Processors
These are used to invoke a particular sink from the selected group of sinks. These
are used to create failover paths for your sinks or load balance events across
multiple sinks from a channel.
Flame Dataflow
Flume is a framework which is used to move log data into HDFS. Generally,
events and log data are generated by the log servers and these servers have
Flume agents running on them. These agents receive the data from the data
generators.
The data in these agents will be collected by an intermediate node known as

Collector. Just like agents, there can be multiple collectors in Flume.
Finally, the data from all these collectors will be aggregated and pushed to a
centralized store such as HBase or HDFS. The following diagram explains the data
flow in Flume.
Multi-hop Flow
Within Flume, there can be multiple agents and before reaching the final
destination, an event may travel through more than one agent. This is known
asmulti-hop flow.
Fan-out Flow
The dataflow from one source to multiple channels is known as fan-out flow. It is
of two types -
Replicating - The data flow where the data will be replicated in all the
configured channels.
Multiplexing - The data flow where the data will be sent to a selected channel
which is mentioned in the header of the event.
Fan-in Flow
The data flow in which the data will be transferred from many sources to one
channel is known as fan-in flow.
Failure Handling
In Flume, for each event, two transactions take place: one at the sender and one at
the receiver. The sender sends events to the receiver. Soon after receiving the
data, the receiver commits its own transaction and sends a “received” signal to
the sender. After receiving the signal, the sender commits its transaction. (Sender
will not commit its transaction till it receives a signal from the receiver.)
3. HDFS, HIVE AND HIVEQL, HBASE
HDFS-Overview, Installation and Shell, Java API; Hive Architecture and Installation,
Comparison with Traditional Database, HiveQL Querying Data, Sorting And
Aggregating, Map Reduce Scripts, Joins & Sub queries, HBase concepts, Advanced
Usage, Schema Design, Advance Indexing, PIG, Zookeeper , how it helps in
monitoring a cluster, HBase uses Zookeeper and how to Build Applications with
Zookeeper.
Hive is a data warehouse infrastructure tool to process structured data in Hadoop. It

resides on top of Hadoop to summarize Big Data, and makes querying and analyzing easy.
The term ‘Big Data’ is used for collections of large datasets that include huge
volume, high velocity, and a variety of data that is increasing day by day. Using
traditional data management systems, it is difficult to process Big Data. Therefore,
the Apache Software Foundation introduced a framework called Hadoop to solve
Big Data management and processing challenges.
Hadoop
Hadoop is an open-source framework to store and process Big Data in a distributed
environment. It contains two modules, one is MapReduce and another is Hadoop
Distributed File System (HDFS).
 MapReduce: It is a parallel programming model for processing large

amounts of structured, semi-structured, and unstructured data on large
clusters of commodity hardware.
 HDFS:Hadoop Distributed File System is a part of Hadoop framework, used
to store and process the datasets. It provides a fault-tolerant file system to
run on commodity hardware.
The Hadoop ecosystem contains different sub-projects (tools) such as Sqoop, Pig,
and Hive that are used to help Hadoop modules.
 Sqoop: It is used to import and export data to and from between HDFS and
RDBMS.
 Pig: It is a procedural language platform used to develop a script for
MapReduce operations.
 Hive: It is a platform used to develop SQL type scripts to do MapReduce
operations.
What is Hive
Hive is a data warehouse infrastructure tool to process structured data in Hadoop.
It resides on top of Hadoop to summarize Big Data, and makes querying and
analyzing easy.
Initially Hive was developed by Facebook, later the Apache Software Foundation
took it up and developed it further as an open source under the name Apache Hive.
It is used by different companies. For example, Amazon uses it in Amazon Elastic
MapReduce.
Features of Hive
 It stores schema in a database and processed data into HDFS.
 It is designed for OLAP.
 It provides SQL type language for querying called HiveQL or HQL.
 It is familiar, fast, scalable, and extensible.
Architecture of Hive
The following component diagram depicts the architecture of Hive:
This component diagram contains different units. The following table describes
each unit:
Unit Name Operation
Hive is a data warehouse infrastructure software that can create

interaction between user and HDFS. The user interfaces that Hive
User Interface
supports are Hive Web UI, Hive command line, and Hive HD Insight (In
Windows server).
Hive chooses respective database servers to store the schema or

Meta Store Metadata of tables, databases, columns in a table, their data types, and
HDFS mapping.
HiveQL is similar to SQL for querying on schema info on the Metastore.
HiveQL Process It is one of the replacements of traditional approach for MapReduce
Engine program. Instead of writing MapReduce program in Java, we can write
a query for MapReduce job and process it.
The conjunction part of HiveQL process Engine and MapReduce is Hive

Execution Execution Engine. Execution engine processes the query and
Engine generates results as same as MapReduce results. It uses the flavor of
MapReduce.
Hadoop distributed file system or HBASE are the data storage

HDFS or HBASE
techniques to store data into file system.
Working of Hive
The following diagram depicts the workflow between Hive and Hadoop.
The following table defines how Hive interacts with Hadoop framework:
Step No. Operation
Execute Query
1
The Hive interface such as Command Line or Web UI sends query to
Driver (any database driver such as JDBC, ODBC, etc.) to execute.
2 Get Plan
The driver takes the help of query compiler that parses the query to
check the syntax and query plan or the requirement of query.
Get Metadata
3
The compiler sends metadata request to Metastore (any database).
Send Metadata
4
Metastore sends metadata as a response to the compiler.
Send Plan
5
The compiler checks the requirement and resends the plan to the
driver. Up to here, the parsing and compiling of a query is complete.
Execute Plan
6
The driver sends the execute plan to the execution engine.
Execute Job
7 Internally, the process of execution job is a MapReduce job. The

execution engine sends the job to JobTracker, which is in Name node
and it assigns this job to TaskTracker, which is in Data node. Here, the
query executes MapReduce job.
Metadata Ops
7.1
Meanwhile in execution, the execution engine can execute metadata
operations with Metastore.
Fetch Result
8
The execution engine receives the results from Data nodes.
Send Results
9
The execution engine sends those resultant values to the driver.
Send Results
10
The driver sends the results to Hive Interfaces.
Basic Hive Queries:

 Create new database : hiveseries
 In order to create any table inside particular database we have to use that
database using command: use name_of_database;
 To know your current databse, in which you are working currently:

set hive.cli.print.current.db=true;
 To create new table, we have to write schema for table (query):

Create table table_name (field_name data_type….) row format delimited by
‘,’;
 Now , we have to load some data or insert data in table:

Load data local inpath ‘/home/hduser/Desktop/CellInfo.txt’ in to table
cellnumbers;
 To check the schema, means structure of table :

Describe cellnumbers;
 To display data inserted into table:
Select * from cellnumbers;
If we execute any query, then hive internally convert that query into hadoop map-
reduce program.
Which shows, hive run on top of hadoop.
 ADVANCE QUERY:
 We can create new table with data available in another table using:
Create table namewithcell as select name, cell from cellnumbers;
 Create new table onecell and insert cewll number data from cellnumbers
table
Join:
Syntax
join_table:
table_reference JOIN table_factor [join_condition]

| table_reference {LEFT|RIGHT|FULL} [OUTER] JOIN table_reference
join_condition
| table_reference LEFT SEMI JOIN table_reference join_condition
| table_reference CROSS JOIN table_reference [join_condition]
Example
We will use the following two tables in this chapter. Consider the following table
named CUSTOMERS..
+----+----------+-----+-----------+----------+
| ID | NAME | AGE | ADDRESS | SALARY |
+----+----------+-----+-----------+----------+
| 1 | Ramesh | 32 | Ahmedabad | 2000.00 |
| 2 | Khilan | 25 | Delhi | 1500.00 |
| 3 | kaushik | 23 | Kota | 2000.00 |
| 4 | Chaitali | 25 | Mumbai | 6500.00 |
| 5 | Hardik | 27 | Bhopal | 8500.00 |
| 6 | Komal | 22 | MP | 4500.00 |
| 7 | Muffy | 24 | Indore | 10000.00 |
+----+----------+-----+-----------+----------+
Consider another table ORDERS as follows:
+-----+---------------------+-------------+--------+
|OID | DATE | CUSTOMER_ID | AMOUNT |
+-----+---------------------+-------------+--------+
| 102 | 2009-10-08 00:00:00 | 3 | 3000 |
| 100 | 2009-10-08 00:00:00 | 3 | 1500 |
| 101 | 2009-11-20 00:00:00 | 2 | 1560 |
| 103 | 2008-05-20 00:00:00 | 4 | 2060 |
+-----+---------------------+-------------+--------+
There are different types of joins given as follows:
 JOIN
 LEFT OUTER JOIN
 RIGHT OUTER JOIN
 FULL OUTER JOIN
JOIN
JOIN clause is used to combine and retrieve the records from multiple tables. JOIN
is same as OUTER JOIN in SQL. A JOIN condition is to be raised using the primary
keys and foreign keys of the tables.
The following query executes JOIN on the CUSTOMER and ORDER tables, and
retrieves the records:
hive> SELECT c.ID, c.NAME, c.AGE, o.AMOUNT

FROM CUSTOMERS c JOIN ORDERS o
ON (c.ID = o.CUSTOMER_ID);
On successful execution of the query, you get to see the following response:
+----+----------+-----+--------+
| ID | NAME | AGE | AMOUNT |
+----+----------+-----+--------+
| 3 | kaushik | 23 | 3000 |
| 3 | kaushik | 23 | 1500 |
| 2 | Khilan | 25 | 1560 |
| 4 | Chaitali | 25 | 2060 |
+----+----------+-----+--------+
LEFT OUTER JOIN

The HiveQL LEFT OUTER JOIN returns all the rows from the left table, even if there
are no matches in the right table. This means, if the ON clause matches 0 (zero)
records in the right table, the JOIN still returns a row in the result, but with NULL in
each column from the right table.
A LEFT JOIN returns all the values from the left table, plus the matched values from
the right table, or NULL in case of no matching JOIN predicate.
The following query demonstrates LEFT OUTER JOIN between CUSTOMER and
ORDER tables:
hive> SELECT c.ID, c.NAME, o.AMOUNT, o.DATE

FROM CUSTOMERS c
LEFT OUTER JOIN ORDERS o
+----+----------+--------+---------------------+
| ID | NAME | AMOUNT | DATE |
+----+----------+--------+---------------------+
| 1 | Ramesh | NULL | NULL |
| 2 | Khilan | 1560 | 2009-11-20 00:00:00 |
| 3 | kaushik | 3000 | 2009-10-08 00:00:00 |
| 3 | kaushik | 1500 | 2009-10-08 00:00:00 |
| 4 | Chaitali | 2060 | 2008-05-20 00:00:00 |
| 5 | Hardik | NULL | NULL |
| 6 | Komal | NULL | NULL |
| 7 | Muffy | NULL | NULL |
+----+----------+--------+---------------------+
RIGHT OUTER JOIN

The HiveQL RIGHT OUTER JOIN returns all the rows from the right table, even if
there are no matches in the left table. If the ON clause matches 0 (zero) records in
the left table, the JOIN still returns a row in the result, but with NULL in each column
from the left table.
A RIGHT JOIN returns all the values from the right table, plus the matched values
from the left table, or NULL in case of no matching join predicate.
The following query demonstrates RIGHT OUTER JOIN between the CUSTOMER
and ORDER tables.
notranslate"> hive> SELECT c.ID, c.NAME, o.AMOUNT, o.DATE FROM

CUSTOMERS c RIGHT OUTER JOIN ORDERS o ON (c.ID = o.CUSTOMER_ID);
+------+----------+--------+---------------------+
+------+----------+--------+---------------------+
| 3 | kaushik | 3000 | 2009-10-08 00:00:00 |
| 3 | kaushik | 1500 | 2009-10-08 00:00:00 |
| 2 | Khilan | 1560 | 2009-11-20 00:00:00 |
| 4 | Chaitali | 2060 | 2008-05-20 00:00:00 |
+------+----------+--------+---------------------+
FULL OUTER JOIN

The HiveQL FULL OUTER JOIN combines the records of both the left and the right
outer tables that fulfil the JOIN condition. The joined table contains either all the
records from both the tables, or fills in NULL values for missing matches on either
side.
The following query demonstrates FULL OUTER JOIN between CUSTOMER and
ORDER tables:
hive> SELECT c.ID, c.NAME, o.AMOUNT, o.DATE
FROM CUSTOMERS c
FULL OUTER JOIN ORDERS o
+------+----------+--------+---------------------+
+------+----------+--------+---------------------+
| 1 | Ramesh | NULL | NULL |
| 2 | Khilan | 1560 | 2009-11-20 00:00:00 |
| 3 | kaushik | 3000 | 2009-10-08 00:00:00 |
| 3 | kaushik | 1500 | 2009-10-08 00:00:00 |
| 4 | Chaitali | 2060 | 2008-05-20 00:00:00 |
| 5 | Hardik | NULL | NULL |
| 6 | Komal | NULL | NULL |
| 7 | Muffy | NULL | NULL |
| 3 | kaushik | 3000 | 2009-10-08 00:00:00 |
| 3 | kaushik | 1500 | 2009-10-08 00:00:00 |
| 2 | Khilan | 1560 | 2009-11-20 00:00:00 |
| 4 | Chaitali | 2060 | 2008-05-20 00:00:00 |
+------+----------+--------+---------------------+
HBase
HBase is a data model that is similar to Google’s big table designed to provide quick
random access to huge amounts of structured data.
Since 1970, RDBMS is the solution for data storage and maintenance related
problems. After the advent of big data, companies realized the benefit of
processing big data and started opting for solutions like Hadoop.
Hadoop uses distributed file system for storing big data, and MapReduce to
process it. Hadoop excels in storing and processing of huge data of various formats
such as arbitrary, semi-, or even unstructured.
Limitations of Hadoop
Hadoop can perform only batch processing, and data will be accessed only in a
sequential manner. That means one has to search the entire dataset even for the
simplest of jobs.
A huge dataset when processed results in another huge data set, which should also
be processed sequentially. At this point, a new solution is needed to access any
point of data in a single unit of time (random access).
Hadoop Random Access Databases

Applications such as HBase, Cassandra, couchDB, Dynamo, and MongoDB are
some of the databases that store huge amounts of data and access the data in a
random manner.
What is HBase?
HBase is a distributed column-oriented database built on top of the Hadoop file
system. It is an open-source project and is horizontally scalable.
HBase is a data model that is similar to Google’s big table designed to provide
quick random access to huge amounts of structured data. It leverages the fault
tolerance provided by the Hadoop File System (HDFS).
It is a part of the Hadoop ecosystem that provides random real-time read/write

access to data in the Hadoop File System.
One can store the data in HDFS either directly or through HBase. Data consumer
reads/accesses the data in HDFS randomly using HBase. HBase sits on top of the
Hadoop File System and provides read and write access.
HBase and HDFS

HDFS HBase
HDFS is a distributed file system
HBase is a database built on top of the HDFS.
suitable for storing large files.
HDFS does not support fast
HBase provides fast lookups for larger tables.
individual record lookups.
It provides high latency batch
It provides low latency access to single rows
processing; no concept of batch
from billions of records (Random access).
processing.
HBase internally uses Hash tables and provides
It provides only sequential
random access, and it stores the data in indexed
access of data.
HDFS files for faster lookups.
Storage Mechanism in HBase
HBase is a column-oriented database and the tables in it are sorted by row. The
table schema defines only column families, which are the key value pairs. A table
have multiple column families and each column family can have any number of
columns. Subsequent column values are stored contiguously on the disk. Each cell
value of the table has a timestamp. In short, in an HBase:
 Table is a collection of rows.

 Row is a collection of column families.
 Column family is a collection of columns.
 Column is a collection of key value pairs.
Given below is an example schema of table in HBase.
Column Family Column Family Column Family Column Family

Rowid
col1 col2 col3 col1 col2 col3 col1 col2 col3 col1 col2 col3
1
2
3
Column Oriented and Row Oriented

Column-oriented databases are those that store data tables as sections of columns
of data, rather than as rows of data. Shortly, they will have column families.
Row-Oriented Database Column-Oriented Database

It is suitable for Online Transaction Process It is suitable for Online Analytical
(OLTP). Processing (OLAP).
Such databases are designed for small Column-oriented databases are
number of rows and columns. designed for huge tables.
The following image shows column families in a column-oriented database:

HBase and RDBMS
HBase RDBMS
HBase is schema-less, it doesn't have the An RDBMS is governed by its schema,
concept of fixed columns schema; defines which describes the whole structure
only column families. of tables.
It is built for wide tables. HBase is It is thin and built for small tables.
horizontally scalable. Hard to scale.
No transactions are there in HBase. RDBMS is transactional.
It has de-normalized data. It will have normalized data.
It is good for semi-structured as well as
It is good for structured data.
structured data.
Features of HBase
 HBase is linearly scalable.
 It has automatic failure support.
 It provides consistent read and writes.
 It integrates with Hadoop, both as a source and a destination.
 It has easy java API for client.
 It provides data replication across clusters.
Where to Use HBase

 Apache HBase is used to have random, real-time read/write access to Big
Data.
 It hosts very large tables on top of clusters of commodity hardware.
 Apache HBase is a non-relational database modeled after Google's Bigtable.
Bigtable acts up on Google File System, likewise Apache HBase works on
top of Hadoop and HDFS.
Applications of HBase
 It is used whenever there is a need to write heavy applications.
 HBase is used whenever we need to provide fast random access to available
data.
 Companies such as Facebook, Twitter, Yahoo, and Adobe use HBase
internally.
Zookeeper - Overview
ZooKeeper is a distributed co-ordination service to manage large set of hosts. Co-
ordinating and managing a service in a distributed environment is a complicated
process. ZooKeeper solves this issue with its simple architecture and API.
ZooKeeper allows developers to focus on core application logic without worrying
about the distributed nature of the application.
The ZooKeeper framework was originally built at “Yahoo!” for accessing their
applications in an easy and robust manner. Later, Apache ZooKeeper became a
standard for organized service used by Hadoop, HBase, and other distributed
frameworks. For example, Apache HBase uses ZooKeeper to track the status of
distributed data.
Distributed Application
A distributed application can run on multiple systems in a network at a given time
(simultaneously) by coordinating among themselves to complete a particular task
in a fast and efficient manner. Normally, complex and time-consuming tasks, which
will take hours to complete by a non-distributed application (running in a single
system) can be done in minutes by a distributed application by using computing
capabilities of all the system involved.
The time to complete the task can be further reduced by configuring the
distributed application to run on more systems. A group of systems in which a
distributed application is running is called a Cluster and each machine running in
a cluster is called a Node.
A distributed application has two parts, Server and Client application. Server
applications are actually distributed and have a common interface so that clients
can connect to any server in the cluster and get the same result. Client applications
are the tools to interact with a distributed application.
Benefits of Distributed Applications
 Reliability − Failure of a single or a few systems does not make the whole
system to fail.
 Scalability − Performance can be increased as and when needed by adding
more machines with minor change in the configuration of the application
with no downtime.
 Transparency − Hides the complexity of the system and shows itself as a
single entity / application.
Challenges of Distributed Applications
 Race condition − Two or more machines trying to perform a particular task,

which actually needs to be done only by a single machine at any given time.
For example, shared resources should only be modified by a single machine
at any given time.
 Deadlock − Two or more operations waiting for each other to complete
indefinitely.
 Inconsistency − Partial failure of data.
What is Apache ZooKeeper Meant For?

Apache ZooKeeper is a service used by a cluster (group of nodes) to coordinate
between themselves and maintain shared data with robust synchronization
techniques. ZooKeeper is itself a distributed application providing services for
writing a distributed application.
The common services provided by ZooKeeper are as follows −
 Naming service − Identifying the nodes in a cluster by name. It is similar to

DNS, but for nodes.
 Configuration management − Latest and up-to-date configuration
information of the system for a joining node.
 Cluster management − Joining / leaving of a node in a cluster and node
status at real time.
 Leader election − Electing a node as leader for coordination purpose.
 Locking and synchronization service − Locking the data while modifying
it. This mechanism helps you in automatic fail recovery while connecting
other distributed applications like Apache HBase.
 Highly reliable data registry − Availability of data even when one or a few
nodes are down.
Distributed applications offer a lot of benefits, but they throw a few complex and
hard-to-crack challenges as well. ZooKeeper framework provides a complete
mechanism to overcome all the challenges. Race condition and deadlock are
handled using fail-safe synchronization approach. Another main drawback is
inconsistency of data, which ZooKeeper resolves with atomicity.
Benefits of ZooKeeper
Here are the benefits of using ZooKeeper −
 Simple distributed coordination process

 Synchronization − Mutual exclusion and co-operation between server
processes. This process helps in Apache HBase for configuration
management.
 Ordered Messages
 Serialization − Encode the data according to specific rules. Ensure your
application runs consistently. This approach can be used in MapReduce to
coordinate queue to execute running threads.
 Reliability
 Atomicity − Data transfer either succeed or fail completely, but no
transaction is partial.
Architecture of ZooKeeper
Take a look at the following diagram. It depicts the “Client-Server Architecture” of
ZooKeeper.
Each one of the components that is a part of the ZooKeeper architecture has been
explained in the following table.
Part Description
Clients, one of the nodes in our distributed application cluster, access

information from the server. For a particular time interval, every client
sends a message to the server to let the sever know that the client is
Client alive.
Similarly, the server sends an acknowledgement when a client

connects. If there is no response from the connected server, the client
automatically redirects the message to another server.
Server, one of the nodes in our ZooKeeper ensemble, provides all the services
Server
to clients. Gives acknowledgement to client to inform that the server is alive.
Group of ZooKeeper servers. The minimum number of nodes that is required

Ensemble
to form an ensemble is 3.
Server node which performs automatic recovery if any of the connected node
Leader
failed. Leaders are elected on service startup.
Follower Server node which follows leader instruction.

HBase - Architecture
In HBase, tables are split into regions and are served by the region servers.
Regions are vertically divided by column families into “Stores”. Stores are saved
as files in HDFS. Shown below is the architecture of HBase.
Note: The term ‘store’ is used for regions to explain the storage structure.
HBase has three major components: the client library, a master server, and region
servers. Region servers can be added or removed as per requirement.
MasterServer
The master server -
 Assigns regions to the region servers and takes the help of Apache
ZooKeeper for this task.
 Handles load balancing of the regions across region servers. It unloads the
busy servers and shifts the regions to less occupied servers.
 Maintains the state of the cluster by negotiating the load balancing.
 Is responsible for schema changes and other metadata operations such as
creation of tables and column families.
Regions
Regions are nothing but tables that are split up and spread across the region
servers.
Region server
The region servers have regions that -
 Communicate with the client and handle data-related operations.

 Handle read and write requests for all the regions under it.
 Decide the size of the region by following the region size thresholds.
When we take a deeper look into the region server, it contain regions and stores
as shown below:
The store contains memory store and HFiles. Memstore is just like a cache memory.
Anything that is entered into the HBase is stored here initially. Later, the data is
transferred and saved in Hfiles as blocks and the memstore is flushed.
Zookeeper
 Zookeeper is an open-source project that provides services like maintaining
configuration information, naming, providing distributed synchronization,
etc.
 Zookeeper has ephemeral nodes representing different region servers.
Master servers use these nodes to discover available servers.
 In addition to availability, the nodes are also used to track server failures or
network partitions.
 Clients communicate with region servers via zookeeper.
 In pseudo and standalone modes, HBase itself will take care of zookeeper.
Apache Spark Overview:
Apache Spark is a cluster-computing platform that provides an API for distributed
programming similar to the MapReduce model, but is designed to be fast for
interactive queries and iterative algorithms.
Apache Spark is a general framework for distributed computing that offers high
performance for both batch and interactive processing. It exposes APIs for Java,
Python, and Scala and consists of Spark core and several related projects:
• Spark SQL - Module for working with structured data. Allows you to seamlessly
mix SQL queries with Spark
programs.
• Spark Streaming - API that allows you to build scalable fault-tolerant streaming
applications.
• MLlib - API that implements common machine learning algorithms.
• GraphX - API for graphs and graph-parallel computation
Advantages of using Apache Spark over Hadoop:
Speed:
In Apache spark it is claimed that, it is 10X to 100X faster than Hadoop due to its
usage of in memory processing.
We can say, if it’s a totally memory based process then it’s a 100 times faster and if
it’s adisk based process of data then it’s a 10 times faster.
Memory:
Apache Spark stores data in memory, whereas Hadoop map-reduce stores data in
hard disk. So, there a more usage of main memory in Apache spark, Whereas
Haddop keeps whatever data they have in hard disk.
RDD:
Resilient Distributed Dataset (aka RDD) is the primary data abstraction in
Apache Spark and the core of Spark. RDD provides guarantee fault tolerance. On
the other hand Apache Hadoop uses replication of data in multiple copies to
achieve fault tolerance.
Streaming:
Apache Spark supports Streaming with very less administration. This makes it
much easier to use than Hadoop for real-time stream processing.
API:
Spark provides a versatile API that can be used with multiple data sources as well
as languages. It can be used with JAVA, SCALA, Python and many more.
 Resilient Distributed Dataset (RDD) in Apache Spark:
At a high level, every Spark application consists of a driver program that runs the
user’s main function and executes various parallel operations on a cluster. The main
abstraction Spark provides is a resilient distributed dataset (RDD), which is a
collection of elements partitioned across the nodes of the cluster that can be
operated on in parallel. RDDs are created by starting with a file in the Hadoop file
system (or any other Hadoop-supported file system), or an existing Scala collection
in the driver program, and transforming it. Users may also ask Spark to persist an
RDD in memory, allowing it to be reused efficiently across parallel operations.
Finally, RDDs automatically recover from node failures.
Decomposing the name RDD:
 Resilient, i.e. fault-tolerant with the help of RDD lineage graph(DAG) and
so able to recompute missing or damaged partitions due to node failures.
 Distributed, since Data resides on multiple nodes.
 Dataset represents records of the data you work with. The user can load the
data set externally which can be either JSON file, CSV file, text file or
database via JDBC with no specific data structure.
Let us first discuss how MapReduce operations take place and why they are
not so efficient.
Data Sharing is Slow in MapReduce
MapReduce is widely adopted for processing and generating large datasets with a
parallel, distributed algorithm on a cluster. It allows users to write parallel
computations, using a set of high-level operators, without having to worry about
work distribution and fault tolerance.
Data sharing is slow in MapReduce due to replication, serialization, and disk IO.
Regarding storage system, most of the Hadoop applications, they spend more than
90% of the time doing HDFS read-write operations.
Iterative Operations on MapReduce
Reuse intermediate results across multiple computations in multi-stage

applications. The following illustration explains how the current framework works,
while doing the iterative operations on MapReduce. This incurs substantial
overheads due to data replication, disk I/O, and serialization, which makes the
system slow.
Iterative Operations on Spark RDD
The illustration given below shows the iterative operations on Spark RDD. It will
store intermediate results in a distributed memory instead of Stable storage (Disk)
and make the system faster.
Note − If the Distributed memory (RAM) is not sufficient to store intermediate

results (State of the JOB), then it will store those results on the disk.
Interactive Operations on Spark RDD
This illustration shows interactive operations on Spark RDD. If different queries are
run on the same set of data repeatedly, this particular data can be kept in memory
for better execution times.
By default, each transformed RDD may be recomputed each time you run an action
on it. However, you may also persist an RDD in memory, in which case Spark will
keep the elements around on the cluster for much faster access, the next time you
query it. There is also support for persisting RDDs on disk, or replicated across
multiple nodes.
 what is transformation and actions in apache spark

RDDs support two types of operations: transformations, which create a new dataset
from an existing one, and actions, which return a value to the driver program after
running a computation on the dataset. For example, map is a transformation that
passes each dataset element through a function and returns a new RDD
representing the results. On the other hand, reduce is an action that aggregates all
the elements of the RDD using some function and returns the final result to the
driver program.
All transformations in Spark are lazy, in that they do not compute their results right
away. Instead, they just remember the transformations applied to some base
dataset (e.g. a file). The transformations are only computed when an action
requires a result to be returned to the driver program. This design enables Spark
to run more efficiently. For example, we can realize that a dataset created through
map will be used in a reduce and return only the result of the reduce to the driver,
rather than the larger mapped dataset.
By default, each transformed RDD may be recomputed each time you run an action
on it. However, you may also persist an RDD in memory using the persist (or cache)
method, in which case Spark will keep the elements around on the cluster for much
faster access the next time you query it. There is also support for persisting RDDs
on disk, or replicated across multiple nodes.
ILLUSTRATION of TRANSFORMATION AND ACTIONS with JAVA:
To illustrate RDD basics, consider the simple program below:
JavaRDD<String> lines = sc.textFile("data.txt");

JavaRDD<Integer> lineLengths = lines.map(s -> s.length());
int totalLength = lineLengths.reduce((a, b) -> a + b);
The first line defines a base RDD from an external file. This dataset is not loaded in
memory or otherwise acted on: lines is merely a pointer to the file. The second line
defines lineLengths as the result of a map transformation. Again, lineLengths is not
immediately computed, due to laziness. Finally, we run reduce, which is an action.
At this point Spark breaks the computation into tasks to run on separate machines,
and each machine runs both its part of the map and a local reduction, returning
only its answer to the driver program.
If we also wanted to use lineLengths again later, we could add:
lineLengths.persist(StorageLevel.MEMORY_ONLY());
before the reduce, which would cause lineLengths to be saved in memory after the
first time it is computed.
5. NoSQL
I. What is it?
II. Where It is Used, Types of NoSQL databases,
III. Why NoSQL?
IV. Advantages of NoSQL
V. Use of NoSQL in Industry
VI. SQL vs NoSQL, NewSQL
i) What is NoSQL?
 A NoSQL database environment is, simply put, a non-relational and largely
distributed database system that enables rapid, ad-hoc organization and
analysis of extremely high-volume, disparate data types.
 NoSQL databases are sometimes referred to as cloud databases,
nonrelational databases, Big Data databases and a myriad of other terms and
were developed in response to the sheer volume of data being generated,
stored and analyzed by modern users (user-generated data) and their
applications (machine-generated data).
 In general, NoSQL databases have become the first alternative to relational
databases, with scalability, availability, and fault tolerance being key
deciding factors.
 They go well beyond the more widely understood legacy, relational
databases (such as Oracle, SQL Server and DB2 databases) in satisfying the
needs of today’s modern business applications.
 A very flexible and schema-less data model, horizontal scalability,
distributed architectures, and the use of languages and interfaces that are
“not only” SQL typically characterize this technology.
 From a business standpoint, considering a NoSQL or ‘Big Data’ environment
has been shown to provide a clear competitive advantage in numerous
industries. In the ‘age of data’, this is compelling information as a great
saying about the importance of data is summed up with the following “if your
data isn’t growing then neither is your business”.
ii) Where it is used and types of NoSQL databases

 NoSQL stands for “Not Only Sql.” Some people think this term opposes
Sql, but that is not the case. NoSQL systems are non-relational and can
coexist with relational databases. They are suitable in applications where
a large amount of data is involved. Data in this case is either structured,
unstructured or semi-structured.
 There are different types of NoSQL databases. Some of these include
Apache Cassandra, Mongo DB, Coach DB, Redis, Infinite Graph, and
HBase among others. All of them differ in terms of structure and usage.
Depending on these factors, such databases have been further classified
into various categories.
 Since NoSQL databases cannot be used interchangeably, we will look at
the practical usages of them in this article as per these categories. They
are not compatible like relational databases. In situations where you
need a certain category of NoSQL you cannot use a different one. Each of
them is specialized in a certain environment.
 There are four NoSQL database categories we will look at:
 Key-Valued Stores
 Column Family Stores
 Document Databases
 Graph Databases.
1) Key – Valued Stores

 Key-value stores allow application developers to store data in an
unstructured manner. The stored data contains a key and the actual data.
Therefore, in simple terms, Key-Value Stores keep data by keys:
 Such databases are the simplest to implement. They are mostly preferred
when one is working with complex data that is difficult to model. Also, they
are the best in situations where the write performance (rapid recording of
data) is prioritized. The third environment where these databases prevail is
when data is accessed by key.
 Notable examples of NoSQL databases in this category include Voldemort,
Tokyo Cabinet, Redis, and Amazon Dynamo.
 Key-value stores are used in such projects like:
 Amazon’s Shopping Cart (Amazon Dynamo is used)
 Mozilla Test Pilot
 Rhino DHT
2) Column Family Stores
 Column Family Stores are also referred to as distributed peer stores. They
are designed to handle huge amount of data distributed over many servers.
Like Key-Value Stores, they use keys. However, the key points to multiple
columns of the database. Columns here are organized by the column family:
 Examples of databases in this category include Cassandra, HBase and Riak.
Google’s Big Data also falls into this category, though it is not distributed
outside the Google platform.
 Typical applications that have implemented Column Family Stores
databases include:
 Google Earth, Maps- Google’s Big Data
 Ebay
 The New York Times
 Comcas
 Hulu
 Databases in this category are suitable for applications with distributed file
systems. The key strengths for these tools in distributed applications lie in
their distributed storage nature and retrieval capacity.
3) Document Databases
 These categories of database manage document-oriented data (semi-
structured data). Here, data may be represented in formats similar to JSON.
Each document has an arbitrary set of properties that may differ from other
documents in the same collection:
 Examples of databases under this category include: MongoDB and
CouchDB.
 The following applications are practical examples where the above
document databases have been used:
Linked-In
Dropbox Mailbox
Friendsell.com
Memorize.com
 Document databases are suitable for use in applications that need to keep
data in a complicated multi-level format without a fixed schema for each
record. They are especially good for straightforward mapping of business
models to database entities.
4) Graph Databases
 These databases keep data in the forms of nodes, properties and edges.
Nodes stand for objects whose data we want to store, properties represent
the features of those objects, and edges show the relationships between
those objects (nodes). In this representation, adjacent nodes point to each
other directly (the edges may be directed or indirected):
 Examples of databases falling under this category include Neo4j, Infinite

Graph and InfoGrid.
 Typical applications where graph databases have been implemented
include:
Polymap
Neoclipse
Neosocial which connects Facebook
 Graph databases may fit any application that requires managing
relationships among many objects, like social networks.
iii) Why NoSQL?
 NoSQL databases first started out as in-house solutions to real problems in
companies such as Amazon Dynamo, Google BigTable, LinkedIn Voldemort,
Twitter FlockDB, Facebook Cassandra, Yahoo! PNUTS, and others.
 These companies didn’t start off by rejecting SQL and relational
technologies; they tried them and found that they didn’t meet their
requirements.
 In particular, these companies faced three primary issues: unprecedented
transaction volumes, expectations of low-latency access to massive datasets,
and nearly perfect service availability while operating in an unreliable
environment. Initially, companies tried the traditional approach: they added
more hardware or upgraded to faster SYSADMIN Greg Burd is a Developer
Advocate for Basho Technologies, makers of Riak.
 Before Basho, Greg spent nearly ten years as the product manager for
Berkeley DB at Sleepy cat Software and then at Oracle. Previously, Greg
worked for NeXT Computer, Sun Microsystems, and Know Now.
 Greg has long been an avid supporter of open source software.
 NoSQL GREG BURD hardware as it became available. When that didn’t
work, they tried to scale existing relational solutions by simplifying their
database schema, de-normalizing the schema, relaxing durability and
referential integrity, introducing various query caching layers, separating
read-only from write-dedicated replicas, and, finally, data partitioning in an
attempt to address these new requirements.
 Although each of these techniques extended the functionality of existing
relational technologies, none fundamentally addressed the core limitations,
and they all introduced additional overhead and technical tradeoffs. In other
words, these were good band-aids but not cures.
 A major influence on the eventual design of NoSQL databases came from a
dramatic shift in IT operations. When the majority of relational database
technology was designed, the predominant model for hardware
deployments involved buying large servers attached to dedicated storage
area networks (SANs).
 Databases were designed with this model in mind: They expected there to
be a single machine with the responsibility of managing the consistent state
of the database on that system’s connected storage.
 In other words, databases managed local data in files and provided as much
concurrent access as possible given the machine’s hardware limitations.
 Replication of data to scale concurrent access across multiple systems was
generally unnecessary, as most systems met design goals with a single
server and reliability goals with a hot stand-by ready to take over query
processing in the event of master failure. Beyond simple failover replication,
there were only a few options, and they were all predicated on this same
notion of completely consistent centralized data management.
 Technologies such as two-phase commit and products such as Oracle’s RAC
were available, but they were hard to manage, very expensive, and scaled
to only a handful of machines.
 Other solutions available included logical SQL statement-level replication,
single-master multi-replica log-based replication, and other home-grown
approaches, all of which have serious limitations and generally introduce a
lot of administrative and technical overhead.
 In the end, it was the common architecture and design assumptions
underlying most relational databases that failed to address the scalability,
latency, and availability requirements of many of the largest sites during the
massive growth of the Internet.
 Given that databases were centralized and generally running on an
organization’s most expensive hardware containing its most precious
information, it made sense to create an organizational structure that
required at least a 1:1 ratio of database administrators to database systems
to protect and nurture that investment. This, too, was not easy to scale, was
costly, and could slow innovation.
 A growing number of companies were still hitting the scalability and
performance wall even when using the best practices and the most
advanced technologies of the time.
 Database architects had sacrificed many of the most central aspects of a
relational database, such as joins and fully consistent data, while introducing
many complex and fragile pieces into the operations puzzle. Schema
devolved from many interrelated fully expressed tables to something much
more like a simple key/value look-up.
 Deployments of expensive servers were not able to keep up with demand.
At this point these companies had taken relational databases so far outside
their intended use cases that it was no wonder that they were unable to meet
performance requirements.
 It quickly became clear to them that they could do much better by building
something in-house that was tailored to their particular workloads. These in-
house custom solutions are the inspiration behind the many NoSQL products
we now see on the market.
iv) Advantages of NoSQL

 NoSQL databases were created in response to the limitations of traditional
relational database technology. When compared against relational
databases, NoSQL databases are more scalable and provide superior
performance, and their data model addresses several shortcomings of the
relational model.
 The advantages of NoSQL include being able to handle:
 Large volumes of structured, semi-structured, and unstructured data
 Agile sprints, quick iteration, and frequent code pushes
 Object-oriented programming that is easy to use and flexible
 Efficient, scale-out architecture instead of expensive, monolithic
architecture
 Today, companies leverage NoSQL databases for a growing number of use
cases. NoSQL databases also tend to be open-source and that means a
relatively low-cost way of developing, implementing and sharing software.
v) Use of NoSQL in Industry

Key -Value
 Shopping carts
 Web user data analysis
 Amazon, Linkedin
Document based
 Real-time Analysis
 Logging
 Document archive management
Column-oriented
 Analyze huge web user actions
 Sensor feeds
 Facebook, Twitter, eBay, Netfix
Graph-based
 Network modeling
 Recommendation
 Walmart-upsell, cross-sell
vi) SQL vs NoSQL, NewSQL

6. Data Base for the Modern Web
Introduction to MongoDB key features, Core Server tools, MongoDB through the
JavaScript’s Shell, Creating and Querying through Indexes, Document-Oriented,
principles of schema design, Constructing queries on Databases, collections and
Documents, MongoDB Query Language.
i) Introduction to MongoDB key features

MongoDB is an open-source document database and leading NoSQL database.
MongoDB is written in C++.
MongoDB is a cross-platform, document oriented database that provides, high

performance, high availability, and easy scalability. MongoDB works on concept
of collection and document.
Database
Database is a physical container for collections. Each database gets its own set of
files on the file system. A single MongoDB server typically has multiple databases.
Collection
Collection is a group of MongoDB documents. It is the equivalent of an RDBMS

table. A collection exists within a single database. Collections do not enforce a
schema. Documents within a collection can have different fields. Typically, all
documents in a collection are of similar or related purpose.
Document
A document is a set of key-value pairs. Documents have dynamic schema. Dynamic

schema means that documents in the same collection do not need to have the same
set of fields or structure, and common fields in a collection's documents may hold
different types of data.
The following table shows the relationship of RDBMS terminology with MongoDB.
RDBMS MongoDB
Database Database
Table Collection
Tuple/Row Document
column Field
Table Join Embedded Documents
Primary Key (Default key _id provided
Primary Key
by mongodb itself)
MongoDB’s key features:

i) Document data model
MongoDB’s data model is document-oriented. Internally, MongoDB stores
documents in a format called Binary JSON, or BSON. BSON has a similar structure
but is intended for storing many documents. When you query MongoDB and get
results back, these will be translated into an easy-to-read data structure. The
MongoDB shell uses JavaScript and gets documents in JSON, which is what we’ll
use for most of our examples.
Where relational databases have tables, MongoDB has collections. In other
words, MySQL (a popular relational database) keeps its data in tables of rows,
while MongoDB keeps its data in collections of documents, which you can think of
as a group of documents. Collections are an important concept in MongoDB. The
data in a collection is stored to disk, and most queries require you to specify which
collection you’d like to target.
ii) Schema less − MongoDB is a document database in which one collection

holds different documents. Number of fields, content and size of the
document can differ from one document to another.
 Structure of a single object is clear.

 No complex joins.
 Deep query-ability. MongoDB supports dynamic queries on
documents using a document-based query language that's nearly
as powerful as SQL.
iii) Ad hoc queries

Relational databases have this property; they’ll faithfully execute any well-formed
SQL query with any number of conditions. Ad hoc queries are easy to take for
granted if the only databases you’ve ever used have been relational. But not all
databases support dynamic queries. For instance, key-value stores are queryable
on one axis only: the value’s key. Like many other systems, key-value stores
sacrifice rich query power in exchange for a simple scalability model. One of
MongoDB’s design goals is to preserve most of the query power that’s been so
fundamental to the relational database world.
To see how Mongo

DB’s query language works, let’s take a simple example involving posts and
comments. Suppose you want to find all posts tagged with the term politics having
more than 10 votes. A SQL query would look like this:
SELECT * FROM posts

INNER JOIN posts_tags ON posts.id = posts_tags.post_id
INNER JOIN tags ON posts_tags.tag_id == tags.id
WHERE tags.text = 'politics' AND posts.vote_count > 10;
The equivalent query in MongoDB is specified using a document as a matcher. The

special $gt key indicates the greater-than condition:
db.posts.find({'tags': 'politics', 'vote_count': {'$gt': 10}});
iv) Indexes
The best way to understand database indexes is by analogy: many books have
indexes matching keywords to page numbers. Suppose you have a cookbook and
want to find all recipes calling for pears (maybe you have a lot of pears and don’t
want them to go bad). The time-consuming approach would be to page through
every recipe, checking each ingredient list for pears. Most people would prefer to
check the book’s index for the pears entry, which would give a list of all the recipes
containing pears.
Database indexes are data structures that provide this same service. Indexes
in Mongo DB are implemented as a B-tree data structure. B-tree indexes, also used
in many relational databases, are optimized for a variety of queries, including
range scans and queries with sort clauses.
Most databases give each document or row a primary key, a unique identifier
for that datum. The primary key is generally indexed automatically so that each
datum can be efficiently accessed using its unique key, and MongoDB is no
different. But not every database allows you to also index the data inside that row
or document. These are called secondary indexes. Many NoSQL databases, such as
HBase, are considered keyvalue stores because they don’t allow any secondary
indexes. This is a significant feature in MongoDB; by permitting multiple secondary
indexes MongoDB allows users to optimize for a wide variety of queries.
v) Replication
MongoDB provides database replication via a topology known as a replica set.
Replica sets distribute data across two or more machines for redundancy and
automate failover in the event of server and network outages. Additionally,
replication is used to scale database reads. If you have a read-intensive
application, as is commonly the case on the web, it’s possible to spread database
reads across machines in the replica set cluster.
vi) Scaling
The easiest way to scale most databases is to upgrade the hardware. If your
application is running on a single node, it’s usually possible to add some
combination of faster disks, more memory, and a beefier CPU to ease any database
bottlenecks. The technique of augmenting a single node’s hardware for scale is
known as vertical scaling, or scaling up. Vertical scaling has the advantages of
being simple, reliable, and cost-effective up to a certain point, but eventually you
reach a point where it’s no longer feasible to move to a better machine.
It then makes sense to consider scaling horizontally, or scaling out (see
figure). Instead of beefing up a single node, scaling horizontally means distributing
the database across multiple machines. A horizontally scaled architecture can run
on many smaller, less expensive machines, often reducing your hosting costs.
Machines will unavoidably fail from time to time. If you’ve scaled vertically
and the machine fails, then you need to deal with the failure of a machine on which
most of your system depends. This may not be an issue if a copy of the data exists
on a replicated slave, but it’s still the case that only a single server need fail to
bring down the entire system. Contrast that with failure inside a horizontally scaled
architecture. This may be less catastrophic because a single machine represents a
much smaller percentage of the system as a whole.
MongoDB was designed to make horizontal scaling manageable.
ii) MongoDB’s Core Server and tools

MongoDB is written in C++ and actively developed by MongoDB, Inc. The project
compiles on all major operating systems, including Mac OS X, Windows, Solaris,
and most flavors of Linux. Mongo
DB v1.0 was released in November 2009.
Core server
The core database server runs via an executable called mongod (mongodb.exe on
Windows). The mongod server process receives commands over a network socket
using a custom binary protocol. Most of our MongoDB production servers are run
on Linux because of its reliability, wide adoption, and excellent tools.
mongod can be run in several modes, such as a standalone server or a
member of a replica set. Replication is recommended when you’re running Mongo
DB in production, and you generally see replica set configurations consisting of
two replicas plus a mongod running in arbiter mode.
Configuring a mongod process is relatively simple; it can be accomplished
both with command-line arguments and with a text configuration file. To see these
configurations, you can run mongod --help.
JavaScript shell
The MongoDB command shell is a JavaScript -based tool for administering the
database and manipulating data. The mongo executable loads the shell and
connects to a specified mongod process, or one running locally by default. The
shell was developed to be similar to the My SQL shell; the biggest differences are
that it’s based on JavaScript and SQL isn’t used. For instance, you can pick your
database and then insert a simple document into the users collection like this:
> use my_database

> db.users.insert({name: "Kyle"})
The first command, indicating which database you want to use. The second
command is a JavaScript expression that inserts a simple document. To see the
results of your insert, you can issue a simple query:
> db.users.find()
{ _id: ObjectId("4ba667b0a90578631c9caea0"), name: "Kyle" }
The find method returns the inserted document, with an object ID added. All
documents require a primary key stored in the _id field. You’re allowed to enter a
custom _id as long as you can guarantee its uniqueness. But if you omit the _id
altogether, a MongoDB object ID will be inserted automatically.
In addition to allowing you to insert and query for data, the shell permits you
to run administrative commands. Some examples include viewing the current
database operation, checking the status of replication to a secondary node, and
configuring a collection for sharding.
Database drivers
The MongoDB drivers are easy to use. The driver is the code used in an application
to communicate with a MongoDB server. All drivers have functionality to query,
retrieve results, write data, and run database commands.
Command-line tools
MongoDB is bundled with several command-line utilities:
mongodump and mongorestore—Standard utilities for backing up and restoring a

database. mongodump saves the database’s data in its native BSON format and thus
is best used for backups only; this tool has the advantage of being usable for hot
backups, which can easily be restored with mongorestore.
mongoexport and mongoimport—Export and import JSON, CSV, and TSV this is
useful if you need your data in widely supported formats. Mongoimport can also
be good for initial imports of large data sets, although before importing, it’s often
desirable to adjust the data model to take best advantage of MongoDB. In such
cases, it’s easier to import the data through one of the drivers using a custom script.
mongosniff—A wire-sniffing tool for viewing operations sent to the database. It

essentially translates the BSON going over the wire to human-readable shell
statements.
mongostat—Similar to iostat, this utility constantly polls MongoDB and the system
to provide helpful stats, including the number of operations per second (inserts,
queries, updates, deletes, and so on), the amount of virtual memory allocated, and
the number of connections to the server.
mongotop—Similar to top, this utility polls MongoDB and shows the amount of time
it spends reading and writing data in each collection.
mongoperf—Helps you understand the disk operations happening in a running

MongoDB instance.
mongooplog—Shows what’s happening in the MongoDB oplog.
Bsondump—Converts BSON files into human-readable formats including JSON.
iii) MongoDB through the JavaScript’s Shell
MongoDB is a document oriented NoSQL database. It’s not a relational database,

it’s non-relational database. It is a schema free database. Schema-less NoSQL data-
store will in theory allow you to store any data you like (typically key value pairs,
in a document) without prior knowledge of the keys, or data types. Schema free DB
is nothing but a NOSQL database. Here you have Collections. There is no prefixed
structure. You can alter is very easily and store any kind of data.
Good examples are : MONGODB, CASSANDRA, REDIS
MongoDB organizes itself in a group of documents, also known as collection. It is

based on simple query language, known as document based query.
 To check all available databases, execute:
show dbs
Initially, there will be only one database: local
 Now, create new database using command: use [name_of_database]
 We create a user for our database, with its password and role.
 Now, first create a collection. Collection is similar to table in database.
 After creating collection, we can insert document in that.
 To see list of documents from collection perform following command:
 We can insert multiple documents also inside our collection, by adding

databases inside array symbol: []
 To properly arrange all the documents, we can use pretty() function as

below:
 If you wants to update any document, then execute following query:
 Now, observe that, just to insert gender field in above example, we have
written first_name, last_name and gender, all three – just to insert one field.
If we write only gender, then it will update entire document with only one
field gender.
In order to avoid this can use $set:

 To perform arithmetic operation on your document, let’s first add one field
age in one document.
Let’s increment age of Steven by 5 year by executing following command:
 To remove any field from document, we can use unset command as below.
This will remove age field from document where first_name is Steven.
 What happens if I try to update document which is not available in collection:
It will shows 0 updated and nothing will happen.
 If we wants that, if that document is not available then create new one. This
can be possible by using ‘upsert’. From the below image, we can see, new
document with first_name: “Marry” is now created.
 We can also rename any field name. Here we will rename ‘gender’ and set
it to ‘sex’ as below:
 To remove particular document perform query shown in below image:
This will delete all the student’s with name Steven. In order to remove only first
customer with name Steven:
iv) Creating and Querying through Indexes

It’s common to create indexes to enhance query performance. Fortunately,
MongoDB’s indexes can be created easily from the shell.
Creating a large collection

An indexing example makes sense only if you have a collection with many
documents. So you’ll add 20,000 simple documents to a numbers collection.
Because the MongoDB shell is also a JavaScript interpreter, the code to accomplish
this is simple:
> for(i = 0; i < 20000; i++) {
db.numbers.save({num: i});
}
WriteResult({ "nInserted" : 1 }
That’s a lot of documents, so don’t be surprised if the insert takes a few seconds to
complete. Once it returns, you can run a couple of queries to verify that all the
documents are present:
> db.numbers.count()
20000
> db.numbers.find()
{ "_id": ObjectId("4bfbf132dba1aa7c30ac830a"), "num": 0 }
{ "_id": ObjectId("4bfbf132dba1aa7c30ac830b"), "num": 1 }
{ "_id": ObjectId("4bfbf132dba1aa7c30ac830c"), "num": 2 }
{ "_id": ObjectId("4bfbf132dba1aa7c30ac830d"), "num": 3 }
{ "_id": ObjectId("4bfbf132dba1aa7c30ac830e"), "num": 4 }
{ "_id": ObjectId("4bfbf132dba1aa7c30ac830f"), "num": 5 }
{ "_id": ObjectId("4bfbf132dba1aa7c30ac8310"), "num": 6 }
{ "_id": ObjectId("4bfbf132dba1aa7c30ac831a"), "num": 16 }
{ "_id": ObjectId("4bfbf132dba1aa7c30ac831b"), "num": 17 }
{ "_id": ObjectId("4bfbf132dba1aa7c30ac831c"), "num": 18 }
{ "_id": ObjectId("4bfbf132dba1aa7c30ac831d"), "num": 19 }
Type "it" for more
The count() command shows that you’ve inserted 20,000 documents. The
subsequent query displays the first 20 results (this number may be different in your
shell). You can display additional results with the it command:
> it
{ "_id": ObjectId("4bfbf132dba1aa7c30ac831e"), "num": 20 }
{ "_id": ObjectId("4bfbf132dba1aa7c30ac831f"), "num": 21 }
...
The it command instructs the shell to return the next result set. With a sizable set of
documents available, let’s try a couple queries. Given what you know about
MongoDB’s query engine, a simple query matching a document on its num
attribute makes sense:
> db.numbers.find({num: 500})

{ "_id" : ObjectId("4bfbf132dba1aa7c30ac84fe"), "num" : 500 }
RANGE QUERIES
More interestingly, you can also issue range queries using the special $gt and $lt
operators. They stand for greater than and less than, respectively. Here’s how you
query for all documents with a num value greater than 199,995:
> db.numbers.find( {num: {"$gt": 19995 }} )

{ "_id" : ObjectId("552e660b58cd52bcb2581142"), "num" : 19996 }
You can also combine the two operators to specify upper and lower boundaries:
> db.numbers.find( {num: {"$gt": 20, "$lt": 25 }} )

{ "_id" : ObjectId("552e660558cd52bcb257c33b"), "num" : 21 }
{ "_id" : ObjectId("552e660558cd52bcb257c33c"), "num" : 22 }
{ "_id" : ObjectId("552e660558cd52bcb257c33d"), "num" : 23 }
{ "_id" : ObjectId("552e660558cd52bcb257c33e"), "num" : 24 }
Indexing and explain( )

If you’ve spent time working with relational databases, you’re probably familiar
with SQL’s EXPLAIN, an invaluable tool for debugging or optimizing a query. When
any database receives a query, it must plan out how to execute it; this is called a
query plan. EXPLAIN describes query paths and allows developers to diagnose
slow operations by determining which indexes a query has used. Often a query
can be executed in multiple ways, and sometimes this results in behavior you might
not expect. EXPLAIN explains. MongoDB has its own version of EXPLAIN that
provides the same service. To get an idea of how it works, let’s apply it to one of
the queries you just issued. Try running the following on your system:
> db.numbers.find({num: {"$gt": 19995}}).explain("executionStats")

The result should look something like what you see in the next listing. The
"executionStats" keyword is new to MongoDB 3.0 and requests a different mode
that gives more detailed output. Upon examining output of explain command, we
may be surprised to see that the query engine has to scan the entire collection, all
20,000 documents (docsExamined), to return only few results (nReturned).
v) Document oriented, principles of schema design, collections and Documents:
Refer theory from Pages 73, 74 and 75… of Book : “MongoDB in Action”, Kyle
Banker, Piter Bakkum , Shaun Verch, Dream tech Press
vi) Constructing queries on Databases
Refer theory from Page 98… of Book : “MongoDB in Action”, Kyle Banker,
Piter Bakkum , Shaun Verch, Dream tech Press
vii) MongoDB Query Language
Refer theory from Page 103… of Book : “MongoDB in Action”, Kyle Banker,
Piter Bakkum , Shaun Verch, Dream tech Press

Lecture Notes PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lecture Notes PDF

Uploaded by

Copyright:

Available Formats

1.

INTRODUCTION TO BIG DATA

i) Introduction – distributed file system

Example of Big Data:

Distributed File System:

ii) Big Data and its importance

BIG Data – Growth and Size Facts (*MGI Estimates)

Big Data – Value Potential (*)

Big Data – Industry Examples

Importance of Big Data:

iii) The Four v’s of Big Data

iv) Key Drivers of Big Data:

There’s a new generation of data management technologies, such as Apache

v) Big Data Analytics with Example:

“High-volume, high-velocity, and high variety of information assets that

Example of Big Data Analysis:

Benefit of Big Data Analytics:

Webopedia company, QuinStreet, surveyed 540 enterprise decision-makers

vi) Application of Big Data

1. Banking and Security

2. Communications, Media and Entertainment

Applications of big data in the healthcare sector

Applications of big data in Education

On a governmental level, the Office of Educational Technology in the U. S.

5. Manufacturing and Natural Resources

Applications of big data in manufacturing and natural resources

Applications of big data in Government

Some more specific examples are as follows:

In a survey conducted by Market force challenges identified by professionals in

Applications of big data in the insurance industry

8. Retail and Whole sale trade

Applications of big data in the transportation industry

10. Energy and Utilities

Applications of big data in the energy and utilities industry

MapReduce is a Distributed Data Processing Algorithm, introduced by Google in

MapReduce algorithm is mainly useful to process huge amount of data in parallel,

MapReduce Algorithm Steps

This step performs the following two sub-steps:

Mapping step takes those smaller Sub-DataSets and perform required

It is the second step in MapReduce Algorithm. Shuffle Function is also known as

i) Big Data - Apache Hadoop

Due to the advent of new technologies, devices, and communication means

What is Big Data?

What Comes Under Big Data?

 Hadoop Distributed File System

HDFS uses a master/slave architecture where master consists of a single

 Hadoop Distributed File System

 They also perform operations such as block creation, deletion, and

 Huge datasets: HDFS should have hundreds of nodes per cluster to

 Hardware at data: A requested task can be done efficiently, when the

 Hadoop Eco System

 Interaction -Visualization- execution-development

 Management, Monitoring and Orchestration:

 Understanding inputs and outputs of MapReduce

If the above data is given as input, we have to write applications to process it

To solve these problems, we have the MapReduce framework.

Persistent Storage (Storage Media)

In addition to these classes, Hadoop supports a number of wrapper classes that

Disadvantages of Hadoop Serialization

The main drawback of these two mechanisms is that Writable and

Streaming / Log Data

On harvesting such log data, we can get information about -

HDFS put Command