Professional Documents
Culture Documents
Lecture Notes PDF
Lecture Notes PDF
In order to understand ‘Big Data’, we first need to know ‘data’. Oxford dictionary
defines 'data' as -
"The quantities, characters, or symbols on which operations are performed by
a computer, which may be stored and transmitted in the form of electrical
signals and recorded on magnetic, optical, or mechanical recording media. "
Big data analytics refers to the method of analysing huge volumes of data, or big
data. So, 'Big Data' is also a data but with a huge size. 'Big Data' is a term used to
describe collection of data that is huge in size and yet growing exponentially with
time. The big data is collected from a large assortment of sources, such as social
networks, videos, digital images, and sensors. The major aim of Big Data Analytics
is to discover new patterns and relationships which might be invisible, and it can
provide new insights about the users who created it.
It means that if you have a laptop and your data doesn’t fit on your laptop, that’s big
data for you. But if you have a very large firm with a large clusters of storage space
and even then your data exceeds the storage capacity of your systems, then that’s
big data for you.
Big data is not something that you can say well, it’s 50 GB or 50 TB would make it
big data. It is whenever a person, individual or firm’s storage capacity or the ability
to analyse data is exceeded by the amount of data they have, that becomes big
data for them.
Big data is a term that refers to data sets or combinations of data sets whose size
(volume), complexity (variability), and rate of growth (velocity) make them
difficult to be captured, managed, processed or analysed by conventional
technologies and tools, such as relational databases and desktop statistics or
visualization packages, within the time necessary to make them useful.
While the size used to determine whether a particular data set is considered
big data is not firmly defined and continues to change over time, most analysts and
practitioners currently refer to data sets from 30-50 terabytes (10 12 or 1000
gigabytes per terabyte) to multiple petabytes (1015 or 1000 terabytes per
petabyte) as big data.
The complex nature of big data is primarily driven by the unstructured
nature of much of the data that is generated by modern technologies, such as that
from web logs, radio frequency Id (RFID), sensors embedded in devices,
machinery, vehicles, Internet searches, social networks such as Facebook,
portable computers, smart phones and other cell phones, GPS devices, and call
centre records. In most cases, in order to effectively utilize big data, it must be
combined with structured data (typically from a relational database) from a more
conventional business application, such as Enterprise Resource Planning (ERP) or
Customer Relationship Management (CRM).
Similar to the complexity, or variability, aspect of big data, its rate of growth,
or velocity aspect, is largely due to the ubiquitous nature of modern online, Real-
time data capture devices, systems, and networks. It is expected that the rate of
growth of big data will continue to increase for the foreseeable future. Specific new
big data technologies and tools have been and continue to be developed. Much of
the new big data technology relies heavily on massively parallel processing (MPP)
databases, which can concurrently distribute the processing of very large sets of
data across many servers. As another example, specific database query tools have
been developed for working with the massive amounts of unstructured data that
are being generated in big data environments.
The proliferation of smart phones and other GPS devices offers advertisers an
opportunity to target consumers when they are in close proximity to a store, a
coffee shop or a restaurant. This opens up new revenue for service providers and
offers many businesses a chance to target new customers.
Retailers usually know who buys their products. Use of social media and web
log files from their ecommerce sites can help them understand who didn’t buy
and why they chose not to, information not available to them today. This can enable
much more effective micro customer segmentation and targeted marketing
campaigns, as well as improve supply chain efficiencies.
Other widely-cited examples of the effective use of big data exist in the
following areas:
a) Using information technology (IT) logs to improve IT troubleshooting and
security breach detection, speed, effectiveness, and future occurrence
prevention.
b) Use of voluminous historical call centre information more quickly, in order
to improve customer interaction and satisfaction.
c) Use of social media content in order to better and more quickly understand
customer sentiment about you/your customers, and improve products,
services, and customer interaction.
d) Fraud detection and prevention in any industry that processes financial
transactions online, such as shopping, banking, investing, insurance and
health care claims.
e) Use of financial market transaction information to more quickly assess risk
and take corrective action.
Beyond simply being a lot of information, big data is now more precisely defined
by a set of characteristics. Those characteristics are commonly referred to as the
four Vs – Volume, Velocity, Variety and Veracity.
Volume
The main characteristic that makes data “big” is the sheer volume. It makes no
sense to focus on minimum storage units because the total amount of information is
growing exponentially every year. In 2010, Thomson Reuters estimated in its
annual report that it believed the world was “awash with over 800 Exabytes of data
and growing.”
For that same year, EMC, a hardware company that makes data storage devices,
thought it was closer to 900 Exabytes and would grow by 50 percent every year.
No one really knows how much new data is being generated, but the amount of
information being collected is huge.
Variety
Variety is one the most interesting developments in technology as more and more
information is digitized. Traditional data types (structured data) include things on
a bank statement like date, amount, and time. These are things that fit neatly in a
relational database.
Structured data is augmented by unstructured data, which is where things
like Twitter feeds, audio files, MRI images, web pages, web logs are put anything
that can be captured and stored but doesn’t have a meta model (a set of rules to
frame a concept or idea - it defines a class of information and how to express it) that
neatly defines it.
Unstructured data is a fundamental concept in big data. The best way to understand
unstructured data is by comparing it to structured data. Think of structured data as
data that is well defined in a set of rules. For example, money will always be
numbers and have at least two decimal points; names are expressed as text; and
dates follow a specific pattern.
With unstructured data, on the other hand, there are no rules. A picture, a
voice recording, a tweet — they all can be different but express ideas and thoughts
based on human understanding. One of the goals of big data is to use technology
to take this unstructured data and make sense of it. The definition of big data
depends on whether the data can be ingested, processed, and examined in a time
that meets a particular business’s requirements. For one company or system, big
data may be 50TB; for another, it may be 10PB.
Veracity
Veracity refers to the trustworthiness of the data. Can the manager rely on the fact
that the data is representative? Every good manager knows that there are inherent
discrepancies in all the data collected.
Velocity
Velocity is the frequency of incoming data that needs to be processed. Think about
how many SMS messages, Facebook status updates, or credit card swipes are
being sent on a particular telecom carrier every minute of every day, and you’ll
have a good appreciation of velocity. A streaming application like Amazon Web
Services Kinesis is an example of an application that handles the velocity of data.
Value
It may seem painfully obvious to some, but a real objective is critical to this mashup
of the four V’s. Will the insights you gather from analysis create a new product line,
a cross-sell opportunity, or a cost-cutting measure? Or will your data analysis lead
to the discovery of a critical causal effect that results in a cure to a disease? The
ultimate objective of any big data project should be to generate some sort of value
for the company doing all the analysis. Otherwise, you’re just performing some
technological task for technology’s sake.
ERP, SCM, CRM, and transactional Web applications are classic examples of
systems processing Transactions. Highly structured data in these
systems is typically stored in SQL databases.
Interactions are about how people and things interact with each other or with
your business. Web Logs, User Click Streams, Social Interactions &
Feeds, and User-Generated Content are classic places to find Interaction
data.
Observational data tends to come from the “Internet of Things”. Sensors
for heat, motion, pressure and RFID and GPS chips within such things
as mobile devices, ATM machines, and even aircraft engines provide
just some examples of “things” that output Observation data.
Business
1. Opportunity to enable innovative new business models
2. Potential for new insights that drive competitive advantage
Technical
1. Data collected and stored continues to grow exponentially
2. Data is increasingly everywhere and in many formats
3. Traditional solutions are failing under new requirements
Financial
1. Cost of data systems, as a percentage of IT spend, continues to grow
2. Cost advantages of commodity hardware & open source software
As the technology that helps an organization to break down data silos and analyze
data improves, business can be transformed in all sorts of ways. According to
Datamation, today's advances in analyzing big data allow researchers to
decode human DNA in minutes, predict where terrorists plan to attack,
determine which gene is mostly likely to be responsible for certain diseases
and, of course, which ads you are most likely to respond to on Facebook.
Another example comes from one of the biggest mobile carriers in the world.
France's Orange launched its Data for Development project by releasing
subscriber data for customers in the Ivory Coast. The 2.5 billion records,
which were made anonymous, included details on calls and text messages
exchanged between 5 million users. Researchers accessed the data and sent
Orange proposals for how the data could serve as the foundation for development
projects to improve public health and safety. Proposed projects included one that
showed how to improve public safety by tracking cell phone data to map where
people went after emergencies; another showed how to use cellular data for
disease containment.
Notably, the business area getting the most attention relates to increasing
efficiency and optimizing operations. Specifically, 62 percent of respondents said
that they use big data analytics to improve speed and reduce complexity.
Retail traders, Big banks, hedge funds and other so-called ‘big boys’ in the
financial markets use big data for trade analytics used in high frequency trading,
pre-trade decision-support analytics, sentiment measurement, Predictive
Analytics etc.
This industry also heavily relies on big data for risk analytics including; antimony
laundering, demand enterprise risk management, "Know Your Customer", and
fraud mitigation.
Big Data providers specific to this industry include: 1010data, Panopticon Software,
Stream base Systems, Nice Actimize and Quartet FS.
3. Healthcare Providers
Industry-Specific challenges
The healthcare sector has access to huge amounts of data but has been plagued by
failures in utilizing the data to curb the cost of rising healthcare and by inefficient
systems that stifle faster and better healthcare benefits across the board. This is
mainly due to the fact that electronic data is unavailable, inadequate, or unusable.
Additionally, the healthcare databases that hold health-related information have
made it difficult to link data that can show patterns useful in the medical field.
4. Education
Industry-Specific big data challenges
From a technical point of view, a major challenge in the education industry is to
incorporate big data from different sources and vendors and to utilize it on
platforms that were not designed for the varying data. From a practical point of
view, staff and institutions have to learn the new data management and analysis
tools.
On the technical side, there are challenges to integrate data from different sources,
on different platforms and from different vendors that were not designed to work
with one another.
Politically, issues of privacy and personal data protection associated with big data
used for educational purposes is a challenge.
In a different use case of the use of big data in education, it is also used to measure
teacher’s effectiveness to ensure a good experience for both students and
teachers. Teacher’s performance can be fine-tuned and measured against student
numbers, subject matter, student demographics, student aspirations, behavioral
classification and several other variables.
Similarly, large volumes of data from the manufacturing industry are untapped. The
underutilization of this information prevents improved quality of products, energy
efficiency, reliability, and better profit margins.
Big data has also been used in solving today’s manufacturing challenges and to
gain competitive advantage among other benefits.
In the graphic below, a study by Deloitte shows the use of supply chain capabilities
from big data currently in use and their expected use in the future.
6. Government
Industry-Specific challenges
In governments the biggest challenges are the integration and interoperability of
big data across different government departments and affiliated organizations.
The Food and Drug Administration (FDA) is using big data to detect and study
patterns of food-related illnesses and diseases. This allows for faster response
which has led to faster treatment and less death. The Department of Homeland
Security uses big data for several different use cases. Big data is analyzed from
different government agencies and is used to protect the country.
7. Insurance
Industry-Specific challenges
Lack of personalized services, lack of personalized pricing and the lack of targeted
services to new segments and to specific market segments are some of the main
challenges.
Through massive data from digital channels and social media, real-time monitoring
of claims throughout the claims cycle has been used to provide insights.
Industry-Specific challenges
From traditional brick and mortar retailers and wholesalers to current day e-
commerce traders, the industry has gathered a lot of data over time. This data,
derived from customer loyalty cards, POS scanners, RFID etc. is not being used
enough to improve customer experiences on the whole. Any changes and
improvements made have been quite slow.
Applications of big data in the Retail and Wholesale industry
Big data from customer loyalty data, POS, store inventory, local demographics data
continues to be gathered by retail and wholesale stores.
In New York’s Big Show retail trade conference in 2014, companies like Microsoft,
Cisco and IBM pitched the need for the retail industry to utilize big data for
analytics and for other uses including:
Optimized staffing through data from shopping patterns, local events, and so on
Reduced fraud
Timely analysis of inventory
Social media use also has a lot of potential use and continues to be slowly but surely
adopted especially by brick and mortar stores. Social media is used for customer
prospecting, customer retention, promotion of products, and more.
9. Transportation
Industry-Specific challenges
In recent times, huge amounts of data from location-based social networks and
high speed data from telecoms have affected travel behavior. Regrettably,
research to understand travel behaviour has not progressed as quickly.
In most places, transport demand models are still based on poorly understood new
social media structures.
In utility companies the use of big data also allows for better asset and workforce
management which is useful for recognizing errors and correcting them as soon as
possible before complete failure is experienced.
vii) Map Reduce Algorithm
3 steps or functions
Map Function:
Map Function is the first step in MapReduce Algorithm. It takes input tasks (say
DataSets. I have given only one DataSet in below diagram.) and divides them into
smaller sub-tasks. Then perform required computation on each sub-task in
parallel.
The output of this Map Function is a set of key and value pairs as <Key, Value> as
shown in the below diagram.
Sorting step takes input from Merging step and sort all key-value pairs by using
Keys.
This step also returns <Key, List<Value>> output but with sorted key-value pairs.
Finally, Shuffle Function returns a list of <Key, List<Value>> sorted pairs to next
step.
Reduce Function
It is the final step in MapReduce Algorithm. It performs only one step: Reduce step.
It takes list of <Key, List<Value>> sorted pairs from Shuffle Function and perform
reduce operation as shown below.
Final step output looks like first step output. However final step <Key, Value> pairs
are different than first step <Key, Value> pairs. Final step <Key, Value> pairs are
computed and sorted pairs.
We can observe the difference between first step output and final step output with
some simple example. We will discuss same steps with one simple example in next
section.
That’s it all three steps of MapReduce Algorithm.
Example:
Problem Statement:
Count the number of occurrences of each word available in a DataSet.
Input DataSet
Please find our example Input DataSet file in below diagram. Just for simplicity, we
are going to use simple small DataSet. However, Real-time applications use very
huge amount of Data.
Input Data Set
That’s it all about MapReduce Algorithm. It’s time to start Developing and testing
MapReduce Programs.
2. INTRODUCTION TO Hadoop and Hadoop Architecture
i. Big Data - Apache Hadoop & Hadoop Ecosystem
ii. Moving data in and out of Hadoop – Understanding inputs and
outputs of MapReduce
iii. Data Serialization
HADOOP:
Hadoop is an Apache open source framework written in java that allows
distributed processing of large datasets across clusters of computers using
simple programming models. A Hadoop frame-worked application works in an
environment that provides distributed storage and computation across clusters of
computers. Hadoop is designed to scale up from single server to thousands of
machines, each offering local computation and storage.
Hadoop Architecture:
Hadoop framework includes following four modules:
Hadoop Common: These are Java libraries and utilities required by other
Hadoop modules. These libraries provide filesystem and OS level
abstractions and contains the necessary Java files and scripts required to
start Hadoop.
Hadoop YARN: This is a framework for job scheduling and cluster
resource management.
Hadoop Distributed File System (HDFS™): A distributed file system that
provides high-throughput access to application data.
Hadoop MapReduce: This is YARN-based system for parallel processing
of large data sets.
Since 2012, the term "Hadoop" often refers not just to the base modules
mentioned above but also to the collection of additional software packages that
can be installed on top of or alongside Hadoop, such as Apache Pig, Apache Hive,
Apache HBase, Apache Spark etc.
MapReduce:
Hadoop MapReduce is a software framework for easily writing applications which
process big amounts of data in-parallel on large clusters (thousands of nodes) of
commodity hardware in a reliable, fault-tolerant manner.
The term MapReduce actually refers to the following two different tasks that
Hadoop programs perform:
The Map Task: This is the first task, which takes input data and converts it
into a set of data, where individual elements are broken down into tuples
(key/value pairs).
The Reduce Task: This task takes the output from a map task as input and
combines those data tuples into a smaller set of tuples. The reduce task is
always performed after the map task.
Typically, both the input and the output are stored in a file-system. The
framework takes care of scheduling tasks, monitoring them and re-executes the
failed tasks.
The MapReduce framework consists of a single master Job Tracker and one
slave Task Tracker per cluster-node.
Job Tracker, the master is responsible for resource management, tracking
resource consumption/availability and scheduling the jobs component tasks on
the slaves, monitoring them and re-executing the failed tasks.
The slaves, Task Tracker execute the tasks as directed by the master and
provide task-status information to the master periodically.
The Job Tracker is a single point of failure for the Hadoop MapReduce service
which means if Job Tracker goes down, all running jobs are halted.
HDFS holds very large amount of data and provides easier access. To store such
huge data, the files are stored across multiple machines. These files are stored in
redundant fashion to rescue the system from possible data losses in case of
failure. HDFS also makes applications available to parallel processing.
Features of HDFS
It is suitable for the distributed storage and processing.
Hadoop provides a command interface to interact with HDFS.
The built-in servers of NameNode and DataNode help users to easily
check the status of cluster.
Streaming access to file system data.
HDFS provides file permissions and authentication.
HDFS Architecture
Given below is the architecture of a Hadoop File System.
HDFS follows the master-slave architecture and it has the following elements.
Namenode
The NameNode is the commodity hardware that contains the GNU/Linux
operating system and the NameNode software. It is a software that can be run on
commodity hardware. The system having the NameNode acts as the master
server and it does the following tasks:
Manages the file system namespace.
Regulates client’s access to files.
It also executes file system operations such as renaming, closing, and
opening files and directories.
DataNode
The DataNode is a commodity hardware having the GNU/Linux operating
system and DataNode software. For every node (Commodity hardware/System)
in a cluster, there will be a DataNode. These nodes manage the data storage of
their system.
DataNodes perform read-write operations on the file systems, as per client
request.
Block
Generally, the user data is stored in the files of HDFS. The file in a file system will
be divided into one or more segments and/or stored in individual data nodes.
These file segments are called as blocks. In other words, the minimum amount of
data that HDFS can read or write is called a Block. The default block size is
64MB, but it can be increased as per the need to change in HDFS configuration.
Goals of HDFS
Fault detection and recovery: Since HDFS includes a large number of
commodity hardware, failure of components is frequent. Therefore, HDFS
should have mechanisms for quick and automatic fault detection and
recovery.
Core Hadoop:
HDFS:
HDFS stands for Hadoop Distributed File System for managing big data sets with
High Volume, Velocity and Variety. HDFS implements master slave architecture.
Master is Name node and slave is data node.
Features:
• Scalable
• Reliable
• Commodity Hardware
HDFS is the well-known for Big Data storage.
Map Reduce:
Map Reduce is a programming model designed to process high volume
distributed data. Platform is built using Java for better exception handling. Map
Reduce includes two daemons, Job tracker and Task Tracker.
Features:
• Functional Programming.
• Works very well on Big Data.
• Can process large datasets.
Map Reduce is the main component known for processing big data.
YARN:
YARN stands for Yet Another Resource Negotiator. It is also called as
MapReduce 2(MRv2). The two major functionalities of Job Tracker in MRv1,
resource management and job scheduling/ monitoring are split into separate
daemons which are Resource Manager, Node Manager and Application Master.
Features:
• Better resource management.
• Scalability
• Dynamic allocation of cluster resources.
Data Access:
Pig:
Apache Pig is a high level language built on top of MapReduce for analyzing
large datasets with simple adhoc data analysis programs. Pig is also known as
Data Flow language. It is very well integrated with python. It is initially developed
by yahoo.
Salient features of pig:
• Ease of programming
• Optimization opportunities
• Extensibility.
Pig scripts internally will be converted to map reduce programs.
Hive:
Apache Hive is another high level query language and data warehouse
infrastructure built on top of Hadoop for providing data summarization, query and
analysis. It is initially developed by yahoo and made open source.
Salient features of hive:
• SQL like query language called HQL.
• Partitioning and bucketing for faster data processing.
• Integration with visualization tools like Tableau.
Hive queries internally will be converted to map reduce programs.
If you want to become a big data analyst, these two high level languages are a
must know!!
Data Storage
HBase:
Apache HBase is a NoSQL database built for hosting large tables with billions of
rows and millions of columns on top of Hadoop commodity hardware machines.
Use Apache Hbase when you need random, realtime read/write access to your
Big Data.
Features:
• Strictly consistent reads and writes. In memory operations.
• Easy to use Java API for client access.
• Well integrated with pig, hive and sqoop.
• Is a consistent and partition tolerant system in CAP theorem.
Cassandra:
Cassandra is a NoSQL database designed for linear scalability and high
availability. Cassandra is based on key-value model. Developed by Facebook
and known for faster response to queries.
Features:
• Column indexes
• Support for de-normalization
• Materialized views
• Powerful built-in caching.
Lucene:
Apache LuceneTM is a high-performance, full-featured text search engine library
written entirely in Java. It is a technology suitable for nearly any application that
requires full-text search, especially cross-platform.
Features:
• Scalable, High – Performance indexing.
• Powerful, Accurate and Efficient search algorithms.
• Cross-platform solution.
Hama:
Apache Hama is a distributed framework based on Bulk Synchronous
Parallel(BSP) computing. Capable and well known for massive scientific
computations like matrix, graph and network algorithms.
Features:
• Simple programming model
• Well suited for iterative algorithms
• YARN supported
• Collaborative filtering unsupervised machine learning.
• K-Means clustering.
Crunch:
Apache crunch is built for pipelining MapReduce programs which are simple and
efficient. This framework is used for writing, testing and running MapReduce
pipelines.
Features:
• Developer focused.
• Minimal abstractions
• Flexible data model.
Data Serialization:
Avro:
Apache Avro is a data serialization framework which is language neutral.
Designed for language portability, allowing data to potentially outlive the
language to read and write it.
Thrift:
Thrift is a language developed to build interfaces to interact with technologies
built on Hadoop. It is used to define and create services for numerous languages.
Data Intelligence
Drill
Apache Drill is a low latency SQL query engine for Hadoop and NoSQL.
Features:
• Agility
• Flexibility
• Familiarilty.
Mahout:
Apache Mahout is a scalable machine learning library designed for building
predictive analytics on Big Data. Mahout now has implementations apache spark
for faster in memory computing.
Features:
• Collaborative filtering.
• Classification
• Clustering
• Dimensionality reduction
Data Integration
Apache Sqoop:
Apache Sqoop is a tool designed for bulk data transfers between relational
databases and Hadoop.
Features:
• Import and export to and from HDFS.
• Import and export to and from Hive.
• Import and export to HBase.
Apache Flume:
Flume is a distributed, reliable, and available service for efficiently collecting,
aggregating, and moving large amounts of log data.
Features:
• Robust
• Fault tolerant
• Simple and flexible Architecture based on streaming data flows.
Apache Chukwa:
Scalable log collector used for monitoring large distributed files systems.
Features:
• Scales to thousands of nodes.
• Reliable delivery.
• Should be able to store data indefinitely.
During a MapReduce job, Hadoop sends the Map and Reduce tasks to the
appropriate servers in the cluster.
The framework manages all the details of data-passing such as issuing
tasks, verifying task completion, and copying data around the cluster
between the nodes.
Most of the computing takes place on nodes with data on local disks that
reduces the network traffic.
After completion of the given tasks, the cluster collects and reduces the
data to form an appropriate result, and sends it back to the Hadoop server.
Inputs and Outputs (Java Perspective) :
The MapReduce framework operates on <key, value> pairs, that is, the
framework views the input to the job as a set of <key, value> pairs and produces
a set of <key, value> pairs as the output of the job, conceivably of different types.
The key and the value classes should be in serialized manner by the framework
and hence, need to implement the Writable interface. Additionally, the key
classes have to implement the Writable-Comparable interface to facilitate sorting
by the framework. Input and Output types of a MapReduce job: (Input) <k1, v1>
-> map -> <k2, v2>-> reduce -> <k3, v3>(Output).
Terminology
PayLoad - Applications implement the Map and the Reduce functions, and
form the core of the job.
Mapper - Mapper maps the input key/value pairs to a set of intermediate
key/value pair.
NamedNode - Node that manages the Hadoop Distributed File System
(HDFS).
DataNode - Node where data is presented in advance before any
processing takes place.
MasterNode - Node where JobTracker runs and which accepts job
requests from clients.
SlaveNode - Node where Map and Reduce program runs.
JobTracker - Schedules jobs and tracks the assign jobs to Task tracker.
Task Tracker - Tracks the task and reports status to JobTracker.
Job - A program is an execution of a Mapper and Reducer across a dataset.
Task - An execution of a Mapper or a Reducer on a slice of data.
Task Attempt - A particular instance of an attempt to execute a task on a
SlaveNode.
Example Scenario:
Given below is the data regarding the electrical consumption of an
organization. It contains the monthly electrical consumption and the annual
average for various years.
But, think of the data representing the electrical consumption of all the large scale
industries of a particular state, since its formation.
When we write applications to process such bulk data,
They will take a lot of time to execute.
There will be a heavy network traffic when we move data from source
to network server and so on.
Serialization in Java:
Java provides a mechanism, called object serialization where an object can be
represented as a sequence of bytes that includes the object's data as well as
information about the object's type and the types of data stored in the object.
After a serialized object is written into a file, it can be read from the file and
deserialized. That is, the type information and bytes that represent the object and
its data can be used to recreate the object in memory.
ObjectInputStream and ObjectOutputStream classes are used to serialize and
deserialize an object respectively in Java.
Serialization in Hadoop
Generally in distributed systems like Hadoop, the concept of serialization is used
for Interprocess Communication and Persistent Storage.
Interprocess Communication
To establish the interprocess communication between the nodes connected
in a network, RPC technique was used.
RPC used internal serialization to convert the message into binary format
before sending it to the remote node via network. At the other end the remote
system deserializes the binary stream into the original message.
The RPC serialization format is required to be as follows -
o Compact - To make the best use of network bandwidth, which is the most
scarce resource in a data center.
o Fast - Since the communication between the nodes is crucial in distributed
systems, the serialization and deserialization process should be quick, producing
less overhead.
o Extensible - Protocols change over time to meet new requirements, so it should
be straightforward to evolve the protocol in a controlled manner for clients and
servers.
o Interoperable - The message format should support the nodes that are written
in different languages.
Writable Interface
This is the interface in Hadoop which provides methods for serialization and
deserialization. The following table describes the methods –
WritableComparable Interface
It is the combination of Writable and Comparable interfaces. This interface
inherits Writable interface of Hadoop as well as Comparable interface of Java.
Therefore, it provides methods for data serialization, deserialization, and
comparison.
These classes are useful to serialize various types of data in Hadoop. For instance,
let us consider the IntWritable class. Let us see how this class is used to serialize
and deserialize the data in Hadoop.
IntWritable Class
This class implements Writable, Comparable, and
WritableComparableinterfaces. It wraps an integer data type in it. This class
provides methods used to serialize and deserialize integer type of data.
Serializing the Data in Hadoop
The procedure to serialize the integer type of data is discussed below.
Instantiate IntWritable class by wrapping an integer value in it.
Instantiate ByteArrayOutputStream class.
Instantiate DataOutputStream class and pass the object Of
ByteArrayOutputStream class to it.
Serialize the integer value in IntWritable object using write()method. This
method needs an object of DataOutputStream class.
The serialized data will be stored in the byte array object which is passed as
parameter to the DataOutputStream class at the time of instantiation. Convert
the data in the object to byte array.
Example
The following example shows how to serialize data of integer type in Hadoop –
Deserializing the Data in Hadoop
The procedure to deserialized the integer type of data is discussed below -
Instantiate IntWritable class by wrapping an integer value in it.
Instantiate ByteArrayOutputStream class.
Instantiate DataOutputStream class and pass the object of
ByteArrayOutputStream class to it.
Deserialize the data in the object of DataInputStream usingreadFields()
method of IntWritable class.
The deserialized data will be stored in the object of IntWritable class. You
canmretrieve this data using get() method of this class.
Example
The following example shows how to deserialize the data of integer type in
Hadoop –
Advantage of Hadoop over Java Serialization
Hadoop’s Writable-based serialization is capable of reducing the object-creation
overhead by reusing the Writable objects, which is not possible with the Java’s
native serialization framework.
Log file - In general, a log file is a file that lists events/actions that occur in an
operating system. For example, web servers list every request made to the
server in the log files.
Hadoop File System Shell provides commands to insert data into Hadoop and
read from it. You can insert data into Hadoop using the put command as shown
below.
We can use the put command of Hadoop to transfer data from these sources to
HDFS. But, it suffers from the following drawbacks -
Using put command, we can transfer only one file at a time while the data
generators generate data at a much higher rate. Since the analysis made on older
data is less accurate, we need to have a solution to transfer data in real time.
If we use put command, the data is needed to be packaged and should be
ready for the upload. Since the webservers generate data continuously, it is a
very difficult task.
What we need here is a solution that can overcome the drawbacks of put
command and transfer the "streaming data" from data generators to centralized
stores (especially HDFS) with less delay.
Note - In POSIX file system, whenever we are accessing a file (say performing
write operation), other programs can still read this file (at least the saved portion
of the file). This is because the file exists on the disc before it is closed.
Available Solutions
To send streaming data (log files, events etc..,) from various sources to HDFS, we
have the following tools available at our disposal –
Facebook’s Scribe
Scribe is an immensely popular tool that is used to aggregate and stream log
data. It is designed to scale to a very large number of nodes and be robust to
network and node failures.
Apache Kafka
Kafka has been developed by Apache Software Foundation. It is an open-source
message broker. Using Kafka, we can handle feeds with high-throughput and
low-latency.
Apache Flume
Apache Flume is a tool/service/data ingestion mechanism for collecting
aggregating and transporting large amounts of streaming data such as log data,
events (etc...) from various webserves to a centralized data store.
1. Flume Event
An event is the basic unit of the data transported inside Flume. It contains a
payload of byte array that is to be transported from the source to the destination
accompanied by optional headers. A typical Flume event would have the
following structure.
2. Flume Agent
An agent is an independent daemon process (JVM) in Flume. It receives the
data (events) from clients or other agents and forwards it to its next
destination (sink or agent). Flume may have more than one agent. Following
diagram represents a Flume Agent. As shown in the diagram a Flume Agent
contains three main components namely, Source, channel, and sink.
i) Source
A source is the component of an Agent which receives data from the data
generators and transfers it to one or more channels in the form of Flume events.
Apache Flume supports several types of sources and each source receives
events from a specified data generator.
Example - Avro source, Thrift source, twitter 1% source etc.
ii) Channel
A channel is a transient store which receives the events from the source and
buffers them till they are consumed by sinks. It acts as a bridge between the
sources and the sinks.
These channels are fully transactional and they can work with any number of
sources and sinks.
Example - JDBC channel, File system channel, Memory channel, etc.
iii) Sink
A sink stores the data into centralized stores like HBase and HDFS. It consumes
the data (events) from the channels and delivers it to the destination. The
destination of the sink might be another agent or the central stores.
Example - HDFS sink
Note - A flume agent can have multiple sources, sinks and channels. We have
listed all the supported sources, sinks, channels in the Flume configuration
chapter of this tutorial.
1) Interceptors
Interceptors are used to alter/inspect flume events which are transferred
between source and channel.
2) Channel Selectors
These are used to determine which channel is to be opted to transfer the data in
case of multiple channels. There are two types of channel selectors -
Default channel selectors - These are also known as replicating channel
selectors they replicate all the events in each channel.
Multiplexing channel selectors - These decides the channel to send an event
based on the address in the header of that event.
3) Sink Processors
These are used to invoke a particular sink from the selected group of sinks. These
are used to create failover paths for your sinks or load balance events across
multiple sinks from a channel.
Flame Dataflow
Flume is a framework which is used to move log data into HDFS. Generally,
events and log data are generated by the log servers and these servers have
Flume agents running on them. These agents receive the data from the data
generators.
Finally, the data from all these collectors will be aggregated and pushed to a
centralized store such as HBase or HDFS. The following diagram explains the data
flow in Flume.
Multi-hop Flow
Within Flume, there can be multiple agents and before reaching the final
destination, an event may travel through more than one agent. This is known
asmulti-hop flow.
Fan-out Flow
The dataflow from one source to multiple channels is known as fan-out flow. It is
of two types -
Replicating - The data flow where the data will be replicated in all the
configured channels.
Multiplexing - The data flow where the data will be sent to a selected channel
which is mentioned in the header of the event.
Fan-in Flow
The data flow in which the data will be transferred from many sources to one
channel is known as fan-in flow.
Failure Handling
In Flume, for each event, two transactions take place: one at the sender and one at
the receiver. The sender sends events to the receiver. Soon after receiving the
data, the receiver commits its own transaction and sends a “received” signal to
the sender. After receiving the signal, the sender commits its transaction. (Sender
will not commit its transaction till it receives a signal from the receiver.)
3. HDFS, HIVE AND HIVEQL, HBASE
HDFS-Overview, Installation and Shell, Java API; Hive Architecture and Installation,
Comparison with Traditional Database, HiveQL Querying Data, Sorting And
Aggregating, Map Reduce Scripts, Joins & Sub queries, HBase concepts, Advanced
Usage, Schema Design, Advance Indexing, PIG, Zookeeper , how it helps in
monitoring a cluster, HBase uses Zookeeper and how to Build Applications with
Zookeeper.
The term ‘Big Data’ is used for collections of large datasets that include huge
volume, high velocity, and a variety of data that is increasing day by day. Using
traditional data management systems, it is difficult to process Big Data. Therefore,
the Apache Software Foundation introduced a framework called Hadoop to solve
Big Data management and processing challenges.
Hadoop
Hadoop is an open-source framework to store and process Big Data in a distributed
environment. It contains two modules, one is MapReduce and another is Hadoop
Distributed File System (HDFS).
The Hadoop ecosystem contains different sub-projects (tools) such as Sqoop, Pig,
and Hive that are used to help Hadoop modules.
Sqoop: It is used to import and export data to and from between HDFS and
RDBMS.
Pig: It is a procedural language platform used to develop a script for
MapReduce operations.
Hive: It is a platform used to develop SQL type scripts to do MapReduce
operations.
What is Hive
Hive is a data warehouse infrastructure tool to process structured data in Hadoop.
It resides on top of Hadoop to summarize Big Data, and makes querying and
analyzing easy.
Initially Hive was developed by Facebook, later the Apache Software Foundation
took it up and developed it further as an open source under the name Apache Hive.
It is used by different companies. For example, Amazon uses it in Amazon Elastic
MapReduce.
Features of Hive
It stores schema in a database and processed data into HDFS.
It is designed for OLAP.
It provides SQL type language for querying called HiveQL or HQL.
It is familiar, fast, scalable, and extensible.
Architecture of Hive
The following component diagram depicts the architecture of Hive:
This component diagram contains different units. The following table describes
each unit:
Working of Hive
The following diagram depicts the workflow between Hive and Hadoop.
The following table defines how Hive interacts with Hadoop framework:
Execute Query
1
The Hive interface such as Command Line or Web UI sends query to
Driver (any database driver such as JDBC, ODBC, etc.) to execute.
2 Get Plan
The driver takes the help of query compiler that parses the query to
check the syntax and query plan or the requirement of query.
Get Metadata
3
The compiler sends metadata request to Metastore (any database).
Send Metadata
4
Metastore sends metadata as a response to the compiler.
Send Plan
5
The compiler checks the requirement and resends the plan to the
driver. Up to here, the parsing and compiling of a query is complete.
Execute Plan
6
The driver sends the execute plan to the execution engine.
Execute Job
If we execute any query, then hive internally convert that query into hadoop map-
reduce program.
Which shows, hive run on top of hadoop.
ADVANCE QUERY:
We can create new table with data available in another table using:
Create table namewithcell as select name, cell from cellnumbers;
Create new table onecell and insert cewll number data from cellnumbers
table
Join:
Syntax
join_table:
Example
We will use the following two tables in this chapter. Consider the following table
named CUSTOMERS..
+----+----------+-----+-----------+----------+
| ID | NAME | AGE | ADDRESS | SALARY |
+----+----------+-----+-----------+----------+
| 1 | Ramesh | 32 | Ahmedabad | 2000.00 |
| 2 | Khilan | 25 | Delhi | 1500.00 |
| 3 | kaushik | 23 | Kota | 2000.00 |
| 4 | Chaitali | 25 | Mumbai | 6500.00 |
| 5 | Hardik | 27 | Bhopal | 8500.00 |
| 6 | Komal | 22 | MP | 4500.00 |
| 7 | Muffy | 24 | Indore | 10000.00 |
+----+----------+-----+-----------+----------+
+-----+---------------------+-------------+--------+
|OID | DATE | CUSTOMER_ID | AMOUNT |
+-----+---------------------+-------------+--------+
| 102 | 2009-10-08 00:00:00 | 3 | 3000 |
| 100 | 2009-10-08 00:00:00 | 3 | 1500 |
| 101 | 2009-11-20 00:00:00 | 2 | 1560 |
| 103 | 2008-05-20 00:00:00 | 4 | 2060 |
+-----+---------------------+-------------+--------+
JOIN
LEFT OUTER JOIN
RIGHT OUTER JOIN
FULL OUTER JOIN
JOIN
JOIN clause is used to combine and retrieve the records from multiple tables. JOIN
is same as OUTER JOIN in SQL. A JOIN condition is to be raised using the primary
keys and foreign keys of the tables.
The following query executes JOIN on the CUSTOMER and ORDER tables, and
retrieves the records:
On successful execution of the query, you get to see the following response:
+----+----------+-----+--------+
| ID | NAME | AGE | AMOUNT |
+----+----------+-----+--------+
| 3 | kaushik | 23 | 3000 |
| 3 | kaushik | 23 | 1500 |
| 2 | Khilan | 25 | 1560 |
| 4 | Chaitali | 25 | 2060 |
+----+----------+-----+--------+
A LEFT JOIN returns all the values from the left table, plus the matched values from
the right table, or NULL in case of no matching JOIN predicate.
The following query demonstrates LEFT OUTER JOIN between CUSTOMER and
ORDER tables:
+----+----------+--------+---------------------+
| ID | NAME | AMOUNT | DATE |
+----+----------+--------+---------------------+
| 1 | Ramesh | NULL | NULL |
| 2 | Khilan | 1560 | 2009-11-20 00:00:00 |
| 3 | kaushik | 3000 | 2009-10-08 00:00:00 |
| 3 | kaushik | 1500 | 2009-10-08 00:00:00 |
| 4 | Chaitali | 2060 | 2008-05-20 00:00:00 |
| 5 | Hardik | NULL | NULL |
| 6 | Komal | NULL | NULL |
| 7 | Muffy | NULL | NULL |
+----+----------+--------+---------------------+
A RIGHT JOIN returns all the values from the right table, plus the matched values
from the left table, or NULL in case of no matching join predicate.
The following query demonstrates RIGHT OUTER JOIN between the CUSTOMER
and ORDER tables.
On successful execution of the query, you get to see the following response:
+------+----------+--------+---------------------+
| ID | NAME | AMOUNT | DATE |
+------+----------+--------+---------------------+
| 3 | kaushik | 3000 | 2009-10-08 00:00:00 |
| 3 | kaushik | 1500 | 2009-10-08 00:00:00 |
| 2 | Khilan | 1560 | 2009-11-20 00:00:00 |
| 4 | Chaitali | 2060 | 2008-05-20 00:00:00 |
+------+----------+--------+---------------------+
The following query demonstrates FULL OUTER JOIN between CUSTOMER and
ORDER tables:
hive> SELECT c.ID, c.NAME, o.AMOUNT, o.DATE
FROM CUSTOMERS c
FULL OUTER JOIN ORDERS o
ON (c.ID = o.CUSTOMER_ID);
On successful execution of the query, you get to see the following response:
+------+----------+--------+---------------------+
| ID | NAME | AMOUNT | DATE |
+------+----------+--------+---------------------+
| 1 | Ramesh | NULL | NULL |
| 2 | Khilan | 1560 | 2009-11-20 00:00:00 |
| 3 | kaushik | 3000 | 2009-10-08 00:00:00 |
| 3 | kaushik | 1500 | 2009-10-08 00:00:00 |
| 4 | Chaitali | 2060 | 2008-05-20 00:00:00 |
| 5 | Hardik | NULL | NULL |
| 6 | Komal | NULL | NULL |
| 7 | Muffy | NULL | NULL |
| 3 | kaushik | 3000 | 2009-10-08 00:00:00 |
| 3 | kaushik | 1500 | 2009-10-08 00:00:00 |
| 2 | Khilan | 1560 | 2009-11-20 00:00:00 |
| 4 | Chaitali | 2060 | 2008-05-20 00:00:00 |
+------+----------+--------+---------------------+
HBase
HBase is a data model that is similar to Google’s big table designed to provide quick
random access to huge amounts of structured data.
Since 1970, RDBMS is the solution for data storage and maintenance related
problems. After the advent of big data, companies realized the benefit of
processing big data and started opting for solutions like Hadoop.
Hadoop uses distributed file system for storing big data, and MapReduce to
process it. Hadoop excels in storing and processing of huge data of various formats
such as arbitrary, semi-, or even unstructured.
Limitations of Hadoop
Hadoop can perform only batch processing, and data will be accessed only in a
sequential manner. That means one has to search the entire dataset even for the
simplest of jobs.
A huge dataset when processed results in another huge data set, which should also
be processed sequentially. At this point, a new solution is needed to access any
point of data in a single unit of time (random access).
What is HBase?
HBase is a distributed column-oriented database built on top of the Hadoop file
system. It is an open-source project and is horizontally scalable.
HBase is a data model that is similar to Google’s big table designed to provide
quick random access to huge amounts of structured data. It leverages the fault
tolerance provided by the Hadoop File System (HDFS).
One can store the data in HDFS either directly or through HBase. Data consumer
reads/accesses the data in HDFS randomly using HBase. HBase sits on top of the
Hadoop File System and provides read and write access.
Features of HBase
HBase is linearly scalable.
It has automatic failure support.
It provides consistent read and writes.
It integrates with Hadoop, both as a source and a destination.
It has easy java API for client.
It provides data replication across clusters.
Applications of HBase
It is used whenever there is a need to write heavy applications.
HBase is used whenever we need to provide fast random access to available
data.
Companies such as Facebook, Twitter, Yahoo, and Adobe use HBase
internally.
Zookeeper - Overview
ZooKeeper is a distributed co-ordination service to manage large set of hosts. Co-
ordinating and managing a service in a distributed environment is a complicated
process. ZooKeeper solves this issue with its simple architecture and API.
ZooKeeper allows developers to focus on core application logic without worrying
about the distributed nature of the application.
The ZooKeeper framework was originally built at “Yahoo!” for accessing their
applications in an easy and robust manner. Later, Apache ZooKeeper became a
standard for organized service used by Hadoop, HBase, and other distributed
frameworks. For example, Apache HBase uses ZooKeeper to track the status of
distributed data.
Distributed Application
A distributed application can run on multiple systems in a network at a given time
(simultaneously) by coordinating among themselves to complete a particular task
in a fast and efficient manner. Normally, complex and time-consuming tasks, which
will take hours to complete by a non-distributed application (running in a single
system) can be done in minutes by a distributed application by using computing
capabilities of all the system involved.
The time to complete the task can be further reduced by configuring the
distributed application to run on more systems. A group of systems in which a
distributed application is running is called a Cluster and each machine running in
a cluster is called a Node.
A distributed application has two parts, Server and Client application. Server
applications are actually distributed and have a common interface so that clients
can connect to any server in the cluster and get the same result. Client applications
are the tools to interact with a distributed application.
Benefits of Distributed Applications
Reliability − Failure of a single or a few systems does not make the whole
system to fail.
Scalability − Performance can be increased as and when needed by adding
more machines with minor change in the configuration of the application
with no downtime.
Transparency − Hides the complexity of the system and shows itself as a
single entity / application.
Distributed applications offer a lot of benefits, but they throw a few complex and
hard-to-crack challenges as well. ZooKeeper framework provides a complete
mechanism to overcome all the challenges. Race condition and deadlock are
handled using fail-safe synchronization approach. Another main drawback is
inconsistency of data, which ZooKeeper resolves with atomicity.
Benefits of ZooKeeper
Here are the benefits of using ZooKeeper −
Architecture of ZooKeeper
Take a look at the following diagram. It depicts the “Client-Server Architecture” of
ZooKeeper.
Each one of the components that is a part of the ZooKeeper architecture has been
explained in the following table.
Part Description
Server node which performs automatic recovery if any of the connected node
Leader
failed. Leaders are elected on service startup.
In HBase, tables are split into regions and are served by the region servers.
Regions are vertically divided by column families into “Stores”. Stores are saved
as files in HDFS. Shown below is the architecture of HBase.
Note: The term ‘store’ is used for regions to explain the storage structure.
HBase has three major components: the client library, a master server, and region
servers. Region servers can be added or removed as per requirement.
MasterServer
The master server -
Assigns regions to the region servers and takes the help of Apache
ZooKeeper for this task.
Handles load balancing of the regions across region servers. It unloads the
busy servers and shifts the regions to less occupied servers.
Maintains the state of the cluster by negotiating the load balancing.
Is responsible for schema changes and other metadata operations such as
creation of tables and column families.
Regions
Regions are nothing but tables that are split up and spread across the region
servers.
Region server
When we take a deeper look into the region server, it contain regions and stores
as shown below:
The store contains memory store and HFiles. Memstore is just like a cache memory.
Anything that is entered into the HBase is stored here initially. Later, the data is
transferred and saved in Hfiles as blocks and the memstore is flushed.
Zookeeper
Zookeeper is an open-source project that provides services like maintaining
configuration information, naming, providing distributed synchronization,
etc.
Zookeeper has ephemeral nodes representing different region servers.
Master servers use these nodes to discover available servers.
In addition to availability, the nodes are also used to track server failures or
network partitions.
Clients communicate with region servers via zookeeper.
In pseudo and standalone modes, HBase itself will take care of zookeeper.
Apache Spark Overview:
Apache Spark is a cluster-computing platform that provides an API for distributed
programming similar to the MapReduce model, but is designed to be fast for
interactive queries and iterative algorithms.
Apache Spark is a general framework for distributed computing that offers high
performance for both batch and interactive processing. It exposes APIs for Java,
Python, and Scala and consists of Spark core and several related projects:
• Spark SQL - Module for working with structured data. Allows you to seamlessly
mix SQL queries with Spark
programs.
• Spark Streaming - API that allows you to build scalable fault-tolerant streaming
applications.
• MLlib - API that implements common machine learning algorithms.
• GraphX - API for graphs and graph-parallel computation
Speed:
In Apache spark it is claimed that, it is 10X to 100X faster than Hadoop due to its
usage of in memory processing.
We can say, if it’s a totally memory based process then it’s a 100 times faster and if
it’s adisk based process of data then it’s a 10 times faster.
Memory:
Apache Spark stores data in memory, whereas Hadoop map-reduce stores data in
hard disk. So, there a more usage of main memory in Apache spark, Whereas
Haddop keeps whatever data they have in hard disk.
RDD:
Resilient Distributed Dataset (aka RDD) is the primary data abstraction in
Apache Spark and the core of Spark. RDD provides guarantee fault tolerance. On
the other hand Apache Hadoop uses replication of data in multiple copies to
achieve fault tolerance.
Streaming:
Apache Spark supports Streaming with very less administration. This makes it
much easier to use than Hadoop for real-time stream processing.
API:
Spark provides a versatile API that can be used with multiple data sources as well
as languages. It can be used with JAVA, SCALA, Python and many more.
At a high level, every Spark application consists of a driver program that runs the
user’s main function and executes various parallel operations on a cluster. The main
abstraction Spark provides is a resilient distributed dataset (RDD), which is a
collection of elements partitioned across the nodes of the cluster that can be
operated on in parallel. RDDs are created by starting with a file in the Hadoop file
system (or any other Hadoop-supported file system), or an existing Scala collection
in the driver program, and transforming it. Users may also ask Spark to persist an
RDD in memory, allowing it to be reused efficiently across parallel operations.
Finally, RDDs automatically recover from node failures.
Decomposing the name RDD:
Resilient, i.e. fault-tolerant with the help of RDD lineage graph(DAG) and
so able to recompute missing or damaged partitions due to node failures.
Distributed, since Data resides on multiple nodes.
Dataset represents records of the data you work with. The user can load the
data set externally which can be either JSON file, CSV file, text file or
database via JDBC with no specific data structure.
Let us first discuss how MapReduce operations take place and why they are
not so efficient.
MapReduce is widely adopted for processing and generating large datasets with a
parallel, distributed algorithm on a cluster. It allows users to write parallel
computations, using a set of high-level operators, without having to worry about
work distribution and fault tolerance.
Data sharing is slow in MapReduce due to replication, serialization, and disk IO.
Regarding storage system, most of the Hadoop applications, they spend more than
90% of the time doing HDFS read-write operations.
The illustration given below shows the iterative operations on Spark RDD. It will
store intermediate results in a distributed memory instead of Stable storage (Disk)
and make the system faster.
This illustration shows interactive operations on Spark RDD. If different queries are
run on the same set of data repeatedly, this particular data can be kept in memory
for better execution times.
By default, each transformed RDD may be recomputed each time you run an action
on it. However, you may also persist an RDD in memory, in which case Spark will
keep the elements around on the cluster for much faster access, the next time you
query it. There is also support for persisting RDDs on disk, or replicated across
multiple nodes.
All transformations in Spark are lazy, in that they do not compute their results right
away. Instead, they just remember the transformations applied to some base
dataset (e.g. a file). The transformations are only computed when an action
requires a result to be returned to the driver program. This design enables Spark
to run more efficiently. For example, we can realize that a dataset created through
map will be used in a reduce and return only the result of the reduce to the driver,
rather than the larger mapped dataset.
By default, each transformed RDD may be recomputed each time you run an action
on it. However, you may also persist an RDD in memory using the persist (or cache)
method, in which case Spark will keep the elements around on the cluster for much
faster access the next time you query it. There is also support for persisting RDDs
on disk, or replicated across multiple nodes.
The first line defines a base RDD from an external file. This dataset is not loaded in
memory or otherwise acted on: lines is merely a pointer to the file. The second line
defines lineLengths as the result of a map transformation. Again, lineLengths is not
immediately computed, due to laziness. Finally, we run reduce, which is an action.
At this point Spark breaks the computation into tasks to run on separate machines,
and each machine runs both its part of the map and a local reduction, returning
only its answer to the driver program.
lineLengths.persist(StorageLevel.MEMORY_ONLY());
before the reduce, which would cause lineLengths to be saved in memory after the
first time it is computed.
5. NoSQL
I. What is it?
II. Where It is Used, Types of NoSQL databases,
III. Why NoSQL?
IV. Advantages of NoSQL
V. Use of NoSQL in Industry
VI. SQL vs NoSQL, NewSQL
i) What is NoSQL?
A NoSQL database environment is, simply put, a non-relational and largely
distributed database system that enables rapid, ad-hoc organization and
analysis of extremely high-volume, disparate data types.
NoSQL databases are sometimes referred to as cloud databases,
nonrelational databases, Big Data databases and a myriad of other terms and
were developed in response to the sheer volume of data being generated,
stored and analyzed by modern users (user-generated data) and their
applications (machine-generated data).
In general, NoSQL databases have become the first alternative to relational
databases, with scalability, availability, and fault tolerance being key
deciding factors.
They go well beyond the more widely understood legacy, relational
databases (such as Oracle, SQL Server and DB2 databases) in satisfying the
needs of today’s modern business applications.
A very flexible and schema-less data model, horizontal scalability,
distributed architectures, and the use of languages and interfaces that are
“not only” SQL typically characterize this technology.
From a business standpoint, considering a NoSQL or ‘Big Data’ environment
has been shown to provide a clear competitive advantage in numerous
industries. In the ‘age of data’, this is compelling information as a great
saying about the importance of data is summed up with the following “if your
data isn’t growing then neither is your business”.
Such databases are the simplest to implement. They are mostly preferred
when one is working with complex data that is difficult to model. Also, they
are the best in situations where the write performance (rapid recording of
data) is prioritized. The third environment where these databases prevail is
when data is accessed by key.
Notable examples of NoSQL databases in this category include Voldemort,
Tokyo Cabinet, Redis, and Amazon Dynamo.
Key-value stores are used in such projects like:
Amazon’s Shopping Cart (Amazon Dynamo is used)
Mozilla Test Pilot
Rhino DHT
2) Column Family Stores
Column Family Stores are also referred to as distributed peer stores. They
are designed to handle huge amount of data distributed over many servers.
Like Key-Value Stores, they use keys. However, the key points to multiple
columns of the database. Columns here are organized by the column family:
Examples of databases in this category include Cassandra, HBase and Riak.
Google’s Big Data also falls into this category, though it is not distributed
outside the Google platform.
Typical applications that have implemented Column Family Stores
databases include:
Google Earth, Maps- Google’s Big Data
Ebay
The New York Times
Comcas
Hulu
Databases in this category are suitable for applications with distributed file
systems. The key strengths for these tools in distributed applications lie in
their distributed storage nature and retrieval capacity.
3) Document Databases
These categories of database manage document-oriented data (semi-
structured data). Here, data may be represented in formats similar to JSON.
Each document has an arbitrary set of properties that may differ from other
documents in the same collection:
Examples of databases under this category include: MongoDB and
CouchDB.
The following applications are practical examples where the above
document databases have been used:
Linked-In
Dropbox Mailbox
Friendsell.com
Memorize.com
Document databases are suitable for use in applications that need to keep
data in a complicated multi-level format without a fixed schema for each
record. They are especially good for straightforward mapping of business
models to database entities.
4) Graph Databases
These databases keep data in the forms of nodes, properties and edges.
Nodes stand for objects whose data we want to store, properties represent
the features of those objects, and edges show the relationships between
those objects (nodes). In this representation, adjacent nodes point to each
other directly (the edges may be directed or indirected):
Database
Database is a physical container for collections. Each database gets its own set of
files on the file system. A single MongoDB server typically has multiple databases.
Collection
Document
The following table shows the relationship of RDBMS terminology with MongoDB.
RDBMS MongoDB
Database Database
Table Collection
Tuple/Row Document
column Field
Table Join Embedded Documents
Primary Key (Default key _id provided
Primary Key
by mongodb itself)
iv) Indexes
The best way to understand database indexes is by analogy: many books have
indexes matching keywords to page numbers. Suppose you have a cookbook and
want to find all recipes calling for pears (maybe you have a lot of pears and don’t
want them to go bad). The time-consuming approach would be to page through
every recipe, checking each ingredient list for pears. Most people would prefer to
check the book’s index for the pears entry, which would give a list of all the recipes
containing pears.
Database indexes are data structures that provide this same service. Indexes
in Mongo DB are implemented as a B-tree data structure. B-tree indexes, also used
in many relational databases, are optimized for a variety of queries, including
range scans and queries with sort clauses.
Most databases give each document or row a primary key, a unique identifier
for that datum. The primary key is generally indexed automatically so that each
datum can be efficiently accessed using its unique key, and MongoDB is no
different. But not every database allows you to also index the data inside that row
or document. These are called secondary indexes. Many NoSQL databases, such as
HBase, are considered keyvalue stores because they don’t allow any secondary
indexes. This is a significant feature in MongoDB; by permitting multiple secondary
indexes MongoDB allows users to optimize for a wide variety of queries.
v) Replication
MongoDB provides database replication via a topology known as a replica set.
Replica sets distribute data across two or more machines for redundancy and
automate failover in the event of server and network outages. Additionally,
replication is used to scale database reads. If you have a read-intensive
application, as is commonly the case on the web, it’s possible to spread database
reads across machines in the replica set cluster.
vi) Scaling
The easiest way to scale most databases is to upgrade the hardware. If your
application is running on a single node, it’s usually possible to add some
combination of faster disks, more memory, and a beefier CPU to ease any database
bottlenecks. The technique of augmenting a single node’s hardware for scale is
known as vertical scaling, or scaling up. Vertical scaling has the advantages of
being simple, reliable, and cost-effective up to a certain point, but eventually you
reach a point where it’s no longer feasible to move to a better machine.
It then makes sense to consider scaling horizontally, or scaling out (see
figure). Instead of beefing up a single node, scaling horizontally means distributing
the database across multiple machines. A horizontally scaled architecture can run
on many smaller, less expensive machines, often reducing your hosting costs.
Machines will unavoidably fail from time to time. If you’ve scaled vertically
and the machine fails, then you need to deal with the failure of a machine on which
most of your system depends. This may not be an issue if a copy of the data exists
on a replicated slave, but it’s still the case that only a single server need fail to
bring down the entire system. Contrast that with failure inside a horizontally scaled
architecture. This may be less catastrophic because a single machine represents a
much smaller percentage of the system as a whole.
MongoDB was designed to make horizontal scaling manageable.
Core server
The core database server runs via an executable called mongod (mongodb.exe on
Windows). The mongod server process receives commands over a network socket
using a custom binary protocol. Most of our MongoDB production servers are run
on Linux because of its reliability, wide adoption, and excellent tools.
mongod can be run in several modes, such as a standalone server or a
member of a replica set. Replication is recommended when you’re running Mongo
DB in production, and you generally see replica set configurations consisting of
two replicas plus a mongod running in arbiter mode.
Configuring a mongod process is relatively simple; it can be accomplished
both with command-line arguments and with a text configuration file. To see these
configurations, you can run mongod --help.
JavaScript shell
The MongoDB command shell is a JavaScript -based tool for administering the
database and manipulating data. The mongo executable loads the shell and
connects to a specified mongod process, or one running locally by default. The
shell was developed to be similar to the My SQL shell; the biggest differences are
that it’s based on JavaScript and SQL isn’t used. For instance, you can pick your
database and then insert a simple document into the users collection like this:
The first command, indicating which database you want to use. The second
command is a JavaScript expression that inserts a simple document. To see the
results of your insert, you can issue a simple query:
> db.users.find()
{ _id: ObjectId("4ba667b0a90578631c9caea0"), name: "Kyle" }
The find method returns the inserted document, with an object ID added. All
documents require a primary key stored in the _id field. You’re allowed to enter a
custom _id as long as you can guarantee its uniqueness. But if you omit the _id
altogether, a MongoDB object ID will be inserted automatically.
In addition to allowing you to insert and query for data, the shell permits you
to run administrative commands. Some examples include viewing the current
database operation, checking the status of replication to a secondary node, and
configuring a collection for sharding.
Database drivers
The MongoDB drivers are easy to use. The driver is the code used in an application
to communicate with a MongoDB server. All drivers have functionality to query,
retrieve results, write data, and run database commands.
Command-line tools
MongoDB is bundled with several command-line utilities:
mongoexport and mongoimport—Export and import JSON, CSV, and TSV this is
useful if you need your data in widely supported formats. Mongoimport can also
be good for initial imports of large data sets, although before importing, it’s often
desirable to adjust the data model to take best advantage of MongoDB. In such
cases, it’s easier to import the data through one of the drivers using a custom script.
mongostat—Similar to iostat, this utility constantly polls MongoDB and the system
to provide helpful stats, including the number of operations per second (inserts,
queries, updates, deletes, and so on), the amount of virtual memory allocated, and
the number of connections to the server.
mongotop—Similar to top, this utility polls MongoDB and shows the amount of time
it spends reading and writing data in each collection.
show dbs
We create a user for our database, with its password and role.
Now, first create a collection. Collection is similar to table in database.
Now, observe that, just to insert gender field in above example, we have
written first_name, last_name and gender, all three – just to insert one field.
If we write only gender, then it will update entire document with only one
field gender.
To remove any field from document, we can use unset command as below.
This will remove age field from document where first_name is Steven.
What happens if I try to update document which is not available in collection:
If we wants that, if that document is not available then create new one. This
can be possible by using ‘upsert’. From the below image, we can see, new
document with first_name: “Marry” is now created.
We can also rename any field name. Here we will rename ‘gender’ and set
it to ‘sex’ as below:
To remove particular document perform query shown in below image:
This will delete all the student’s with name Steven. In order to remove only first
customer with name Steven:
That’s a lot of documents, so don’t be surprised if the insert takes a few seconds to
complete. Once it returns, you can run a couple of queries to verify that all the
documents are present:
> db.numbers.count()
20000
> db.numbers.find()
{ "_id": ObjectId("4bfbf132dba1aa7c30ac830a"), "num": 0 }
{ "_id": ObjectId("4bfbf132dba1aa7c30ac830b"), "num": 1 }
{ "_id": ObjectId("4bfbf132dba1aa7c30ac830c"), "num": 2 }
{ "_id": ObjectId("4bfbf132dba1aa7c30ac830d"), "num": 3 }
{ "_id": ObjectId("4bfbf132dba1aa7c30ac830e"), "num": 4 }
{ "_id": ObjectId("4bfbf132dba1aa7c30ac830f"), "num": 5 }
{ "_id": ObjectId("4bfbf132dba1aa7c30ac8310"), "num": 6 }
{ "_id": ObjectId("4bfbf132dba1aa7c30ac8311"), "num": 7 }
{ "_id": ObjectId("4bfbf132dba1aa7c30ac8312"), "num": 8 }
{ "_id": ObjectId("4bfbf132dba1aa7c30ac8313"), "num": 9 }
{ "_id": ObjectId("4bfbf132dba1aa7c30ac8314"), "num": 10 }
{ "_id": ObjectId("4bfbf132dba1aa7c30ac8315"), "num": 11 }
{ "_id": ObjectId("4bfbf132dba1aa7c30ac8316"), "num": 12 }
{ "_id": ObjectId("4bfbf132dba1aa7c30ac8317"), "num": 13 }
{ "_id": ObjectId("4bfbf132dba1aa7c30ac8318"), "num": 14 }
{ "_id": ObjectId("4bfbf132dba1aa7c30ac8319"), "num": 15 }
{ "_id": ObjectId("4bfbf132dba1aa7c30ac831a"), "num": 16 }
{ "_id": ObjectId("4bfbf132dba1aa7c30ac831b"), "num": 17 }
{ "_id": ObjectId("4bfbf132dba1aa7c30ac831c"), "num": 18 }
{ "_id": ObjectId("4bfbf132dba1aa7c30ac831d"), "num": 19 }
The count() command shows that you’ve inserted 20,000 documents. The
subsequent query displays the first 20 results (this number may be different in your
shell). You can display additional results with the it command:
> it
{ "_id": ObjectId("4bfbf132dba1aa7c30ac831e"), "num": 20 }
{ "_id": ObjectId("4bfbf132dba1aa7c30ac831f"), "num": 21 }
{ "_id": ObjectId("4bfbf132dba1aa7c30ac8320"), "num": 22 }
...
The it command instructs the shell to return the next result set. With a sizable set of
documents available, let’s try a couple queries. Given what you know about
MongoDB’s query engine, a simple query matching a document on its num
attribute makes sense:
RANGE QUERIES
More interestingly, you can also issue range queries using the special $gt and $lt
operators. They stand for greater than and less than, respectively. Here’s how you
query for all documents with a num value greater than 199,995:
You can also combine the two operators to specify upper and lower boundaries:
Refer theory from Pages 73, 74 and 75… of Book : “MongoDB in Action”, Kyle
Banker, Piter Bakkum , Shaun Verch, Dream tech Press
Refer theory from Page 98… of Book : “MongoDB in Action”, Kyle Banker,
Piter Bakkum , Shaun Verch, Dream tech Press
Refer theory from Page 103… of Book : “MongoDB in Action”, Kyle Banker,
Piter Bakkum , Shaun Verch, Dream tech Press