Professional Documents
Culture Documents
Cat1 Big Data Print
Cat1 Big Data Print
• Big Data
BIG DATA ANALYTICS 'Big Data' is a term used to describe collection of
(Introduction & Life Cycle) data that is huge in size and yet growing
exponentially with time.
C Ranichandra
SITE, VIT Such a data is so large and complex that none of the
traditional data management tools are able to store
it or process it efficiently
• Education
(ii)Variety –
• Medical Variety refers to heterogeneous sources and the nature of data, both structured and
unstructured.
Now days, data in the form of emails, photos, videos, monitoring devices, PDFs, audio,
etc. is also being considered in the analysis applications.
This variety of unstructured data poses certain issues for storage, mining and analysing
data.
Data is obtained primarily from the following types of sources:
iii)Velocity – The term 'velocity' refers to the speed of generation of data.
Internal sources, such as organizational or enterprise data
How fast the data is generated and processed to meet the demands, Internal Provides structured or organized data that originates from within the enterprise and helps
Eg: Customer relationship management (CRM), Enterprise resource planning (ERP) systems,
Big Data Velocity deals with the speed at which data flows in from sources Customers details, Products and sales data
like business processes, application logs, networks and social media sites, Application: This data (current data in the operational system) is used to support daily business
operations of an organization.
sensors, mobile devices, etc. The flow of data is massive and continuous.
External sources, such as social data
iv) Veracity - generally refers to the uncertainty of data, i.e. whether the External Provides unstructured or unorganized data that originates from the external environment
of an organization
obtained data is correct or consistent.
Eg: Business partners, Syndicate data suppliers, Internet, Government, Market research
Out of the huge amount of data that is generated in almost every process, organizations
Application: This data is often analyzed to understand the entities mostly external to the
only the data that is correct and consistent can be used for further analysis.
organization such as customers, competitors, market, and environment
Types of data:
Different types of Data based on Structure
Social Data :
Refers to the information collected from various social networking sites and online portals Students course registration information
Eg: Facebook, Twitter, and LinkedIn
Students Digital Assignments
Machine Data :
FIFA world Cup website
Refers to the information generated from Radio Frequency Identification (RFID) chips, bar code
Transactional Data :
Refers to the information generated from online shopping sites, retailers, and Business to Business
(B2B) transactions
Unstructured Data: Working with unstructured data poses certain challenges like:
•Unstructured data is a set of data that might or might not have any logical or repeating •Identifying the unstructured data that can be processed
patterns.
•Sorting, organizing, and arranging unstructured data in different sets
•Consists metadata, i.e. the additional information related to data
and formats
•Comprises inconsistent data such as data from files and social media websites
•Combining and linking unstructured data in a more structured format
•Consists of data in different formats such as e-mails, text, audio, video, or images
to derive any logical conclusions out of the available information
Some sources of unstructured data include:
•Cost in terms of storage space and human resource (data analysts and
•Text both internal/external to an organization: Documents, logs, survey results,
scientists) needed to deal with the exponential growth of unstructured
feedbacks and emails from both within and across the organization
data
•Social media: Data obtained from social networking platforms including YouTube,
•Such type of data does not follow the proper structure of data models as in relational
databases.
•In other words, data is stored inconsistently in rows and columns of a database.
•RDF
•Data exchange formats like XML or JavaScript Object Notation (JSON) data
• 4. Model Building
64.242.88.10 - - [07/Mar/2004:16:36:22 -0800] "GET /twiki/bin/rdiff/Main/WebIndex?rev1=1.2&rev2=1.1 HTTP/1.1" 200 46373
64.242.88.10 - - [07/Mar/2004:16:37:27 -0800] "GET /twiki/bin/view/TWiki/DontNotify HTTP/1.1" 200 4140
64.242.88.10 - - [07/Mar/2004:16:39:24 -0800] "GET /twiki/bin/v
• 5. Communicate Results
• 6. Operationalize
Data Analytics Life Cycle:
The team also considers whether its existing tools will suffice for and develop a narrative to summarize and convey findings to
for executing models and workflows (for example, fast hardware Phase 6—Operationalize: In Phase 6, the team delivers final reports,
and parallel processing, if applicable). briefings, code, and technical documents. In addition, the team may run
a pilot project to implement the models in a production environment.
Responsibilities:
Data Scientist:
Develop and plan required analytic projects in response to business needs.
of an analysis project, and interface with business sponsors to provide Contribute to data mining architectures, modeling standards, reporting, and
mining/analysis services. Design and create data reports and reporting tools to help business executives in their decision making,
Documenting the types and structure of the business data (logical modeling),
Adhere to change control and testing processes for
Analyzing and mining business data to identify patterns and correlations among the various data points,
modifications to analytical models. Data Analyst
Perform statistical analysis of business data.
The Securities Exchange Commission (SEC) is using big data to monitor financial market
challenges in this industry include: activity. They are currently using network analytics and natural language processors to
catch illegal trading activity in the financial markets.
fraud early warning,
Since consumers expect rich media on-demand in different formats and in a variety of devices, some big Industry-Specific challenges
data challenges in the communications, media and entertainment industry include:
electronic data is unavailable, inadequate, or unusable.
Collecting, analyzing, and utilizing consumer insights
The healthcare databases that hold health-related information have made it difficult to link data that can
Leveraging mobile and social media content show patterns useful in the medical field.
the exclusion of patients from the decision making process, and the use of data from different readily
Understanding patterns of real-time, media content usage available sensors.
Applications
Organizations in this industry simultaneously analyze customer data along with behavioral data to create
Applications of big data in the healthcare sector
detailed customer profiles that can be used to:
Recommend content on demand faster identification and efficient analysis of healthcare information
Spotify, an on-demand music service, uses Hadoop big data analytics, to collect data from its Obamacare has used big data ,
millions of users worldwide and then uses the analyzed data to give informed music
recommendations to individual users. Big Data Providers in this industry include: Recombinant Data, Humedica, Explorys and Cerner
incorporate big data from different sources and vendors and to utilize it on platforms that were not designed for the varying data.
Increasing demand for natural resources including oil, agricultural products, minerals, gas, metals, and so on has led to an
increase in the volume, complexity, and velocity of data that is a challenge to handle.
staff and institutions have to learn the new data management and analysis tools
large volumes of data from the manufacturing industry are untapped. The underutilization of this information prevents
issues of privacy and personal data protection associated with big data used for educational purposes is a challenge improved quality of products, energy efficiency, reliability, and better profit margins.
Applications
Applications: predictive modeling to support decision making that has been utilized to ingest and integrate large
amounts of data from geospatial data, graphical data, text and temporal data.
Big data is used quite significantly in higher education. For example, The University of Tasmania. An Australian university with over
26000 students, has deployed a Learning and Management System that tracks among other things, when a student logs onto the Areas of interest where this has been used include; seismic interpretation and reservoir
system, how much time is spent on different pages in the system, as well as the overall progress of a student over time.
characterization.
measure teacher’s effectiveness to ensure a good experience for both students and teachers. Teacher’s performance can be fine-
tuned and measured against student numbers, subject matter, Big data has also been used in solving today’s manufacturing challenges and to gain competitive
On a governmental level, the Office of Educational Technology in the U. S. Department of Education, is using big data to develop advantage among other benefits.
analytics to help course correct students who are going astray while using online big data courses. Click patterns are also being used to
detect boredom.
Lack of personalized services, lack of personalized pricing and the lack of targeted services to new segments and to specific
In governments the biggest challenges are the integration and interoperability of big data across different government market segments are some of the main challenges.
departments and affiliated organizations.
challenges identified by professionals in the insurance industry include underutilization of data gathered by loss adjusters
Applications and a hunger for better insight.
energy exploration
Applications :
financial market analysis
Big data has been used in the industry to provide customer insights for transparent and simpler products, by analyzing and
fraud detection predicting customer behavior through data derived from social media, GPS-enabled devices and CCTV footage. The big data
also allows for better customer retention from insurance companies.
health related research
When it comes to claims management, predictive analytics from big data has been used to offer faster service since
environmental protection. massive amounts of data can be analyzed especially in the underwriting stage.
Some more specific examples are as follows: Fraud detection has also been enhanced.
Through massive data from digital channels and social media, real-time monitoring of claims throughout the claims cycle
The Food and Drug Administration (FDA) is using big data to detect and study patterns of food-related illnesses and
has been used to provide insights.
diseases. This allows for faster response which has led to faster treatment and less death.
Big Data Providers in this industry include: Sprint, Qualcomm, Octo Telematics, The Climate Corp.
Big Data Providers in this industry include: Digital Reasoning, Socrata and HP
From traditional brick and mortar retailers and wholesalers to current day e-commerce traders, the industry has gathered a lot of In recent times, huge amounts of data from location-based social networks and high speed data from telecoms have affected travel behavior.
data over time. This data, derived from customer loyalty cards, POS scanners, RFID etc. is not being used enough to improve Regrettably, research to understand travel behavior has not progressed as quickly.
customer experiences on the whole. Any changes and improvements made have been quite slow.
In most places, transport demand models are still based on poorly understood new social media structures.
Applications
Applications of big data in the transportation industry
Big data from customer loyalty data, POS, store inventory, local demographics data continues to be gathered by retail and
wholesale stores. Governments use of big data:
In New York’s Big Show retail trade conference in 2014, companies like Microsoft, Cisco and IBM pitched the need for the retail traffic control, route planning, intelligent transport systems, congestion management (by predicting traffic conditions)
industry to utilize big data for analytics and for other uses including:
Private sector use of big data in transport:
Optimized staffing through data from shopping patterns, local events, and so on
revenue management, technological enhancements, logistics and for competitive advantage (by consolidating shipments and
Reduced fraud optimizing freight movement)
Timely analysis of inventory Individual use of big data includes: route planning to save on fuel and time, for travel arrangements in tourism etc.
Social media is used for customer prospecting, customer retention, promotion of products, and more. IBM-”CITY OF DUBLIN”, Congestion and Road traffic Control, 2010, 66 million
Big Data Providers in this industry include: First Retail, First Insight, Fujitsu, Infor, Epicor and Vistex Big Data Providers in this industry include: Qualcomm and Manhattan Associates
10. Energy and Utilities
Industry-Specific challenges
Suppliers of electricity, gas and water are finding ways to analyse the vast volumes of data their new smart systems are generating in order to
gain insights in customer trends and operational efficiencies.
Applications:
DATA ANALYTICS
(HDFS)
Smart meter readers allow data to be collected almost every 15 minutes as opposed to once a day with the old meter readers.
This granular data is being used to analyze consumption of utilities better which allows for improved customer feedback and
better control of utilities use.
In utility companies the use of big data also allows for better asset and workforce management which is useful for recognizing
errors and correcting them as soon as possible before complete failure is experienced and illegal tapping
A smart grid - an electrical grid which includes a variety of operational and energy measures including smart meters, smart
appliances, renewable energy resources, and energy efficiency resources
Big Data Providers in this industry include: Alstom Siemens ABB and Cloudera, IBM
HDFS Concepts
• Block caching: frequently accessed blocks cached in datanode
memory, applications instruct namenode which block to
cache.
• HDFS Federation-namenode keeps a reference to every file
and block in the filesystem, HDFS federation allows a cluster
to scale by adding namenodes which manages a portion of
the filesystem
• HDFS high availability-The combination of replicating
namenode metadata on multiple filesystems and using
secondary namenode to create checkpoints protects data
loss, but namenode is still SPOF.
HDFS Concepts HDFS Concepts
• Hadoop 2 remedied this situation by HDFS HA
• In this implementation there are a pair of namenode • Failover controller-manages the transition from
active-standby configuration active namenode to standby{default implementation
• Few architectural changes are needed to allow this to uses ZooKeeper}
happen: • Fencing- HA implementation ensures the previously
• Namenodes must use highly available shared storage active namenode is prevented from doing any
• Datanodes must send block reports to both namenodes damage and causing corruption.
• Clients must be configured to handle namenode failover • Choices for HA shared storage
• The secondary namenode role is subsumed by the • NFS filer
standby
• QJM quorum journal manager { recommended choice
for HDFS installation}
What is Hadoop?
Hadoop Cluster:
Normally any set of loosely connected or tightly connected
computers that work together as a single system is called
Cluster. In simple words, a computer cluster used for Hadoop
is called Hadoop Cluster.
Hadoop cluster is a special type of computational cluster
designed for storing and analyzing vast amount of
unstructured data in a distributed computing
environment. These clusters run on low cost
commodity computers.
HBASE Hive
Hadoop Database or HBASE is a non-
relational (NoSQL) database that runs on top Hive is created by Facebook and later donated to
of HDFS. Apache foundation.
HBASE was created for large table which have
Hive mainly deals with structured data which is stored
in HDFS with a Query Language similar to SQL and
billions of rows and millions of columns with
known as HQL (Hive Query Language).
fault tolerance capability and horizontal
Hive also run Map reduce program in a backend to
scalability and based on Google Big Table.
process data in HDFS
Hadoop can perform only batch
processing, and data will be accessed only in
a sequential manner, for random access of
huge data HBASE is used.
Pig HIVE PIG
Similar to HIVE, PIG also deals with structured data
using PIG LATIN language. CLI/UI/
API
Load
MAP
(JDBC/ODBC)
PIG was originally developed at Yahoo to answer filter Local
Thrift Client
It is an alternative provided to programmer who loves Thrift
Server
(PHP/ Perl/Python/
C++/Java)
scripting and don't want to use Java/Python or SQL to HiveQL
Group Ditsribute
Compile
A Pig Latin program is made up of a series of HIVE foreach
Mahout Oozie
Mahout is an open source machine learning
It is a workflow scheduler system to manage
library from Apache written in java. hadoop jobs.
The algorithms it implements fall under the broad
Oozie is implemented as a Java Web-Application that
umbrella of machine learning or collective intelligence. runs in a Java Servlet-Container.
Mahout aims to be the machine learning tool of
Hadoop basically deals with bigdata and when some
choice when the collection of data to be processed is programmer wants to run many job in a sequential
very large, perhaps far too large for a single machine. manner like output of job A will be input to Job B and
Predictive analysis-recommender, clustering, similarly output of job B is input to job C and final
classification output will be output of job C. To automate this
sequence we need a workflow and to execute same
we need engine for which OOZIE is used.
Zookeeper
ZooKeeper is a centralized service for maintaining
configuration information, naming, providing
distributed synchronization, and providing group
services .
Writing distributed applications is difficult because of
partial failure may occur between nodes ,to overcome
this Apache Zookeper has been developed by
maintaining an open-source server which enables
highly reliable distributed coordination.
In case of any partial failure clients can connect to
any node and be assured that they will receive the
correct, up-to-date information.
NO SQL
2
CRA CRA
NO SQL HBase
3 4
Since 1970 , RDBMS is the only solution for data Hbase is a distributed column-oriented database
storage and manipulation and maintenance built on top of HDFS
After the data changed in all dimension(Vs), Based on Google’s Big Table, provides random access
companies realized the solutions for processing big on structured data
data It’s a part of Hadoop Eco system, which provides
Solution: Hadoop, but only sequential access random r/w access on HDFS
CRA CRA
Random R/W HDFS and HBASE
5 6
HDFS HBASE
Distributed FS Database on HDFS
Provides high latency batch processing Low latency access to single rows
Only sequential access of data Random access by hash index
CRA CRA
CRA CRA
Applications of Hbase Hbase Architecture
9 10
Need to write heavy applications Tables are split into regions and served by region server,
Random access of data Regions are divided into stores and stored in HDFS
Facebook , twitter, yahoo and Adobe use Hbase
internally
CRA CRA
General:
Status
DML
Version Put- a cell value
Table_help
Get- get row or cell
Whoami
DDL: Delete – delete a cell value
Create
Delete all- delete all the cells in a row
List
Disable Scan- scan and return table value
Is_disabled Count- number of rows in a table
Enable
Is_enabled Truncate- disable, drop and recreate a specified table
Describe
Alter
Exists
Drop_all –drop tables matching regex commands
CRA CRA
DDL DML Commands
13 14
CRA CRA
CRA CRA
DDL+DML Commands Exercise-MBA Admissions
17 18
CRA CRA
Hadoop Based data warehousing like framework It stores schema in a database and processed data
Developed by Facebook into HDFS.
Fire queries using SQL like language HiveQL It is designed for OLAP.
SQL programmers to use warehouse without It provides SQL type language for querying called
MapReduce knowledge HiveQL or HQL.
It is familiar, fast, scalable, and extensible.
CRA CRA
Architecture of Hive Explanation
21 22
Meta Store Hive chooses respective database servers to store the schema or Metadata of tables,
databases, columns in a table, their data types, and HDFS mapping.
HiveQL Process Engine HiveQL is similar to SQL for querying on schema info on the Metastore. It is one of
the replacements of traditional approach for MapReduce program. Instead of writing
MapReduce program in Java, we can write a query for MapReduce job and process
it.
Execution Engine The conjunction part of HiveQL process Engine and MapReduce is Hive Execution
Engine. Execution engine processes the query and generates results as same as
MapReduce results. It uses the flavor of MapReduce.
HDFS or HBASE Hadoop distributed file system or HBASE are the data storage techniques to store
data into file system.
CRA CRA
Workflow Steps
23 24
Step Operation
No.
1 Execute Query The Hive interface such as Command Line or Web UI sends query to Driver (any database driver such
as JDBC, ODBC, etc.) to execute.
2 Get Plan The driver takes the help of query compiler that parses the query to check the syntax and query plan or the
requirement of query.
3 Get Metadata The compiler sends metadata request to Metastore (any database).
4 Send Metadata Metastore sends metadata as a response to the compiler.
5 Send Plan The compiler checks the requirement and resends the plan to the driver. Up to here, the parsing and
compiling of a query is complete.
6 Execute PlanThe driver sends the execute plan to the execution engine.
7 Execute Job Internally, the process of execution job is a MapReduce job. The execution engine sends the job to
JobTracker, which is in Name node and it assigns this job to TaskTracker, which is in Data node. Here, the query
executes MapReduce job.
7.1 Metadata Ops Meanwhile in execution, the execution engine can execute metadata operations with Metastore.
8 Fetch Result The execution engine receives the results from Data nodes.
9 Send Results The execution engine sends those resultant values to the driver.
10 Send Results The driver sends the results to Hive Interfaces.
CRA CRA
HiveQL- Create Create example
25 26
CRA CRA
CRA CRA
JSON
29 30
{
{ "fruit": "Apple", “emp”:
"size": "Large", “e1”: { “eid": “12",
"color": "Red" } “name": “Rani",
“salary": “29000" },
select get_json_object (str,'$.eid') as eid, “e2”: { “eid": “13",
get_json_object(str,'$.name') as name from “name": “uma",
json_emp; “salary": “29000" }
}
CRA CRA
S3
2
CRA CRA
Concepts
3 4
Buckets Buckets
Objects A bucket is a container for objects stored in
Keys Amazon S3. Every object is contained in a bucket.
Regions Objects
Objects are the fundamental entities stored in
Amazon S3. Objects consist of object data and
metadata.
CRA CRA
Common Operations
Random R/W
5 6
Impala is a MPP (Massive Parallel Processing) SQL Impala combines the SQL support and multi-user
query engine for processing huge volumes of data performance of a traditional analytic database with
that is stored in Hadoop cluster. the scalability and flexibility of Apache Hadoop, by
It is an open source software which is written in C++ utilizing standard components such as HDFS,
and Java. HBase, Metastore, YARN, and Sentry.
It provides high performance and low latency With Impala, users can communicate with HDFS or
compared to other SQL engines for Hadoop. HBase using SQL queries in a faster way compared
In other words, Impala is the highest performing
to other SQL engines like Hive.
SQL engine (giving RDBMS-like experience) which Impala can read almost all the file formats such as
provides the fastest way to access data that is stored Parquet, Avro, RCFile used by Hadoop.
in Hadoop Distributed File System.
CRA CRA
HBase provides support for Hive does not provide any Impala does not provide any
triggers. support for triggers. support for triggers.
CRA CRA
11
Cloudera vm
Impala-shell
CRA