Cat1 Big Data Print

Overview
• Big Data
BIG DATA ANALYTICS 'Big Data' is a term used to describe collection of
(Introduction & Life Cycle) data that is huge in size and yet growing
exponentially with time.
C Ranichandra
SITE, VIT Such a data is so large and complex that none of the
traditional data management tools are able to store
it or process it efficiently
Where we have Big Data Characteristics Of 'Big Data'

(i)Volume – The name 'Big Data' itself is related to a size which is enormous.
Size of data plays very crucial role in determining value out of data.
• Business Volume is the amount of data generated by organizations or individuals
• Education
(ii)Variety –
• Medical Variety refers to heterogeneous sources and the nature of data, both structured and
unstructured.
Now days, data in the form of emails, photos, videos, monitoring devices, PDFs, audio,
etc. is also being considered in the analysis applications.
This variety of unstructured data poses certain issues for storage, mining and analysing
data.
Data is obtained primarily from the following types of sources:
iii)Velocity – The term 'velocity' refers to the speed of generation of data.
Internal sources, such as organizational or enterprise data
How fast the data is generated and processed to meet the demands, Internal Provides structured or organized data that originates from within the enterprise and helps
determines real potential in the data. run business
Eg: Customer relationship management (CRM), Enterprise resource planning (ERP) systems,
Big Data Velocity deals with the speed at which data flows in from sources Customers details, Products and sales data
like business processes, application logs, networks and social media sites, Application: This data (current data in the operational system) is used to support daily business
operations of an organization.
sensors, mobile devices, etc. The flow of data is massive and continuous.
External sources, such as social data
iv) Veracity - generally refers to the uncertainty of data, i.e. whether the External Provides unstructured or unorganized data that originates from the external environment
of an organization
obtained data is correct or consistent.
Eg: Business partners, Syndicate data suppliers, Internet, Government, Market research
Out of the huge amount of data that is generated in almost every process, organizations
Application: This data is often analyzed to understand the entities mostly external to the
only the data that is correct and consistent can be used for further analysis.
organization such as customers, competitors, market, and environment
Types of data:
Different types of Data based on Structure
Social Data :
Refers to the information collected from various social networking sites and online portals Students course registration information
Eg: Facebook, Twitter, and LinkedIn
Students Digital Assignments
Machine Data :
FIFA world Cup website
Refers to the information generated from Radio Frequency Identification (RFID) chips, bar code
scanners and sensors
Eg: Faculty attendance, Library books issue
Transactional Data :
Refers to the information generated from online shopping sites, retailers, and Business to Business
(B2B) transactions
Eg: Retail websites like eBay and Amazon.com, Google

On the basis of the structure of data received from the sources, Big
Structured data:
Data comprises:
•Is organized data in a predefined format
• Structured data
•Is the data that resides in fixed fields within a record or file
• Unstructured data
•Is formatted data that has entities and their attributes mapped
• Semi-structured data
•Is used to query and report against predetermined data types
Structured data:
Some sources of structured data include:
defined as the data that has a defined repeating pattern.
•Relational databases
This pattern makes it easier for any program to sort, read, and process
•Flat files in the form of records
the data.
•Multidimensional databases
Processing structured data is much easier and faster than processing
data without any specific repeating patterns •Legacy databases
Unstructured Data: Working with unstructured data poses certain challenges like:
•Unstructured data is a set of data that might or might not have any logical or repeating •Identifying the unstructured data that can be processed
patterns.
•Sorting, organizing, and arranging unstructured data in different sets
•Consists metadata, i.e. the additional information related to data
and formats
•Comprises inconsistent data such as data from files and social media websites
•Combining and linking unstructured data in a more structured format
•Consists of data in different formats such as e-mails, text, audio, video, or images
to derive any logical conclusions out of the available information
Some sources of unstructured data include:
•Cost in terms of storage space and human resource (data analysts and
•Text both internal/external to an organization: Documents, logs, survey results,
scientists) needed to deal with the exponential growth of unstructured
feedbacks and emails from both within and across the organization
data
•Social media: Data obtained from social networking platforms including YouTube,
Facebook, Twitter, LinkedIn, and Flickr
•Mobile data: Data such as text messages and location information

Semi-Structured Data
•Semi-structured data, also known as having a schema-less or self-describing structure,

refers to a form of structured data that contains tags or markup elements in order to
separate elements and generate hierarchies of records and fields in the given data.
•Such type of data does not follow the proper structure of data models as in relational
databases.
•In other words, data is stored inconsistently in rows and columns of a database.
Some sources for semi-structured data include:
•RDF
•File systems like Web data in the form of cookies
•Data exchange formats like XML or JavaScript Object Notation (JSON) data
•No SQL Databases
4.242.88.10 - - [07/Mar/2004:16:05:49 -0800] "GET

Data Analytics Life Cycle
/twiki/bin/edit/Main/Double_bounce_sender?topicparent=Main.ConfigurationVariables HTTP/1.1" 401 12846
64.242.88.10 - - [07/Mar/2004:16:06:51 -0800] "GET /twiki/bin/rdiff/TWiki/NewUserTemplate?rev1=1.3&rev2=1.2 HTTP/1.1"
200 4523
64.242.88.10 - - [07/Mar/2004:16:10:02 -0800] "GET /mailman/listinfo/hsdivision HTTP/1.1" 200 6291
64.242.88.10 - - [07/Mar/2004:16:11:58 -0800] "GET /twiki/bin/view/TWiki/WikiSyntax HTTP/1.1" 200 7352
64.242.88.10 - - [07/Mar/2004:16:20:55 -0800] "GET /twiki/bin/view/Main/DCCAndPostFix HTTP/1.1" 200 5253 • 1. Discovery
64.242.88.10 - - [07/Mar/2004:16:23:12 -0800] "GET
/twiki/bin/oops/TWiki/AppendixFileSystem?template=oopsmore¶m1=1.12¶m2=1.12 HTTP/1.1" 200 11382
64.242.88.10 - - [07/Mar/2004:16:24:16 -0800] "GET /twiki/bin/view/Main/PeterThoeny HTTP/1.1" 200 4924
64.242.88.10 - - [07/Mar/2004:16:29:16 -0800] "GET
/twiki/bin/edit/Main/Header_checks?topicparent=Main.ConfigurationVariables HTTP/1.1" 401 12851
• 2. Data Preparation
64.242.88.10 - - [07/Mar/2004:16:30:29 -0800] "GET /twiki/bin/attach/Main/OfficeLocations HTTP/1.1" 401 12851
64.242.88.10 - - [07/Mar/2004:16:31:48 -0800] "GET /twiki/bin/view/TWiki/WebTopicEditTemplate HTTP/1.1" 200 3732
64.242.88.10 - - [07/Mar/2004:16:32:50 -0800] "GET /twiki/bin/view/Main/WebChanges HTTP/1.1" 200 40520
64.242.88.10 - - [07/Mar/2004:16:33:53 -0800] "GET
/twiki/bin/edit/Main/Smtpd_etrn_restrictions?topicparent=Main.ConfigurationVariables HTTP/1.1" 401 12851
• 3. Model Planning
64.242.88.10 - - [07/Mar/2004:16:35:19 -0800] "GET /mailman/listinfo/business HTTP/1.1" 200 6379
• 4. Model Building
64.242.88.10 - - [07/Mar/2004:16:36:22 -0800] "GET /twiki/bin/rdiff/Main/WebIndex?rev1=1.2&rev2=1.1 HTTP/1.1" 200 46373
64.242.88.10 - - [07/Mar/2004:16:37:27 -0800] "GET /twiki/bin/view/TWiki/DontNotify HTTP/1.1" 200 4140
64.242.88.10 - - [07/Mar/2004:16:39:24 -0800] "GET /twiki/bin/v
• 5. Communicate Results
• 6. Operationalize
Data Analytics Life Cycle:
Phase 1—Discovery: In Phase 1, the team learns the business domain,

including relevant history such as whether the organization or business
unit has attempted similar projects in the past from which they can
learn.
•The team assesses the resources available to support the project in

terms of people, technology, time, and data.
•Important activities in this phase include framing the business problem

that can be addressed in subsequent phases and formulating initial
hypotheses (IHs) to test and begin learning the data.
Phase 2—Data preparation: Phase 2 requires the presence of an

analytic sandbox, in which the team can work with data and
Phase 3—Model planning: Phase 3 is model planning, where the
perform analytics for the duration of the project.
team determines the methods, techniques, and workflow it
•The team needs to execute extract, load, and transform (ELT) or intends to follow for the subsequent model building phase.
extract, transform and load (ETL) to get data into the sandbox.
•The team explores the data to learn about the relationships
•Data should be transformed so the team can work with it and between variables and subsequently selects key variables and the
analyze it. most suitable models.
•In this phase, the team also needs to familiarize itself with the
data thoroughly and take steps to condition the data
Phase 4—Model building: In Phase 4, the team develops datasets
Phase 5—Communicate results: In Phase 5, the team, in collaboration
for testing, training, and production purposes. with major stakeholders, determines if the results of the project are a
In addition, in this phase the team builds and executes models success or a failure based on the criteria developed in Phase 1.
based on the work done in the model planning phase. •The team should identify key findings, quantify the business value,
The team also considers whether its existing tools will suffice for and develop a narrative to summarize and convey findings to
running the models, or if it will need a more robust environment stakeholders.
for executing models and workflows (for example, fast hardware Phase 6—Operationalize: In Phase 6, the team delivers final reports,
and parallel processing, if applicable). briefings, code, and technical documents. In addition, the team may run
a pilot project to implement the models in a production environment.
Responsibilities:
Data Scientist:
Develop and plan required analytic projects in response to business needs.
Conduct research and make recommendations on data mining products,

The Data Scientist will be responsible for
services, protocols, and standards in support of procurement and development
designing and implementing processes and layouts for complex, efforts. intense intellectual curiosity
large-scale data sets used for modeling, data mining, and research Work with application developers to extract data relevant for analysis.
purposes. Develop new tools or methods for data acquistion and transformation.
 business case development, planning, coordination/collaboration In conjunction with data owners and department managers, contribute to the
with various internal and vendor teams, project managing the lifecycle development of data models and protocols for mining production databases.
of an analysis project, and interface with business sponsors to provide Contribute to data mining architectures, modeling standards, reporting, and
periodic updates. data analysis methodologies.
The role will require working on multiple projects simultaneously-

super user
The Data Analyst

is the professional whose focus of analysis and problem solving relates to data, types of data, and relationships
Collaborate with unit managers, end users, development staff, among data elements within a business system or IT system.
and other stakeholders to integrate data mining results with

The data analyst role can vary but it can commonly involve the
existing systems. following:
Provide and apply quality assurance best practices for data Mapping and tracing data from system to system in order to solve a given business or system problem,
mining/analysis services. Design and create data reports and reporting tools to help business executives in their decision making,
Documenting the types and structure of the business data (logical modeling),
Adhere to change control and testing processes for
Analyzing and mining business data to identify patterns and correlations among the various data points,
modifications to analytical models. Data Analyst
Perform statistical analysis of business data.
» Performs logical data modeling

Other common titles for this role are:
» Identifies patterns in data

Data Modeler, Business Intelligence Analyst, Data Warehouse Analyst,
» Designs and creates reports

Systems Analyst, Business Analyst (generic term), etc.
Analyst and Scientist
Data analysts are generally well versed in Sequel, they know a

some Regular Expressions, they can slice and dice data, they can
use analytics or BI packages – like Tableau or Pentaho or an in
house analytics solution – and they can tell a story from the
data. They should also have some level of scientific curiosity.
On the other end of the spectrum, a Data Scientist will have

quite a bit of machine learning and engineering or programming
skills and will be able to manipulate data to his or her own will.
Key Roles in Analytic Project

Key Roles in Analytic Project
1. Data Hygienists make sure that data coming into the system is clean and accurate, and stays
that way over the entire data lifecycle. For example, are the time values being captured the
same? One data set might be measuring calendar days in a year (365), another working days
4. Data Scientists take this organized data and create sophisticated analytics models that, for
in a year (260), and yet another hours in a year (8765). All the values have to be the same so
that comparisons are possible. Or have old data fields been populated with new types of example, help predict customer behavior and allow advanced customer segmentation and
data but under the old field names? pricing optimization. They ensure each model is updated frequently so it remains relevant
for longer.
Clean, enhance, standardize , Data integrity management, clustering
MapReduce, Mining-Classification, Clustering, Frequent item set
2. Data Explorers sift through mountains of data to discover the data you actually need. That can
be a significant task because so much data out there was never intended for analytic use 5. Campaign Experts turn the models into results. They have a thorough knowledge of the
and, therefore, is not stored or organized in a way that’s easy to access. Cash register data is technical systems that deliver specific marketing campaigns, such as which customer should
a perfect example. Its original function was to allow companies to track revenue not to get what message when. They use what they learn from the models to prioritize channels
predict what product a given customer would buy next. and sequence the campaigns — for example, based on analysis of an identified segment’s
historical behavior it will be most effective to first send an email then follow it up 48 hours
Data Aggregation, Dimensionality reduction later with a direct mail.
3. Business Solution Architects put the discovered data together and organize it so that it’s Mails, presentation, vedio conference
ready to analyze. They structure the data to ensure it can be usefully queried in appropriate
timeframes by all users. Some data needs to be accessed by the minute or hour, for
example, so that data needs to be updated every minute or hour.
DFS, key value stores
Big Data In Industries Verticals
1. Banking and Securities Applications
The Securities Exchange Commission (SEC) is using big data to monitor financial market
challenges in this industry include: activity. They are currently using network analytics and natural language processors to
catch illegal trading activity in the financial markets.
fraud early warning,
card fraud detection,

Retail traders, Big banks, hedge funds and other so-called ‘big boys’ in the financial
markets use big data for trade analytics used in high frequency trading, pre-trade
archival of audit trails, { audit log-security relevant chronological record that provide documentary decision-support analytics, sentiment measurement, Predictive Analytics etc.
evidence of sequence of activities}
This industry also heavily relies on big data for risk analytics including; anti-money
enterprise credit risk reporting,
laundering, demand enterprise risk management, "Know Your Customer", and fraud
trade visibility, mitigation.
customer data transformation,

2. Communications, Media and Entertainment 3. Healthcare Providers
Since consumers expect rich media on-demand in different formats and in a variety of devices, some big Industry-Specific challenges
data challenges in the communications, media and entertainment industry include:
 electronic data is unavailable, inadequate, or unusable.
Collecting, analyzing, and utilizing consumer insights
 The healthcare databases that hold health-related information have made it difficult to link data that can
Leveraging mobile and social media content show patterns useful in the medical field.
 the exclusion of patients from the decision making process, and the use of data from different readily
Understanding patterns of real-time, media content usage available sensors.
Applications
Organizations in this industry simultaneously analyze customer data along with behavioral data to create
Applications of big data in the healthcare sector
detailed customer profiles that can be used to:
Centralized patient information for medical assistance and diagnosis

 Create content for different target audiences
 Recommend content on demand faster identification and efficient analysis of healthcare information
 Measure content performance used in tracking the spread of chronic disease.
Spotify, an on-demand music service, uses Hadoop big data analytics, to collect data from its Obamacare has used big data ,
millions of users worldwide and then uses the analyzed data to give informed music
recommendations to individual users. Big Data Providers in this industry include: Recombinant Data, Humedica, Explorys and Cerner
4. Education 5. Manufacturing and Natural Resources

Industry-Specific big data challenges
Industry-Specific challenges
 incorporate big data from different sources and vendors and to utilize it on platforms that were not designed for the varying data.
Increasing demand for natural resources including oil, agricultural products, minerals, gas, metals, and so on has led to an
increase in the volume, complexity, and velocity of data that is a challenge to handle.
 staff and institutions have to learn the new data management and analysis tools
 large volumes of data from the manufacturing industry are untapped. The underutilization of this information prevents
issues of privacy and personal data protection associated with big data used for educational purposes is a challenge improved quality of products, energy efficiency, reliability, and better profit margins.
Applications
Applications:  predictive modeling to support decision making that has been utilized to ingest and integrate large
amounts of data from geospatial data, graphical data, text and temporal data.
Big data is used quite significantly in higher education. For example, The University of Tasmania. An Australian university with over
26000 students, has deployed a Learning and Management System that tracks among other things, when a student logs onto the Areas of interest where this has been used include; seismic interpretation and reservoir
system, how much time is spent on different pages in the system, as well as the overall progress of a student over time.
characterization.
 measure teacher’s effectiveness to ensure a good experience for both students and teachers. Teacher’s performance can be fine-
tuned and measured against student numbers, subject matter, Big data has also been used in solving today’s manufacturing challenges and to gain competitive
On a governmental level, the Office of Educational Technology in the U. S. Department of Education, is using big data to develop advantage among other benefits.
analytics to help course correct students who are going astray while using online big data courses. Click patterns are also being used to
detect boredom.
Big data Providers: Carnegie Learning, D2L (Desire to Learn)

7. Insurance
6. Government
Lack of personalized services, lack of personalized pricing and the lack of targeted services to new segments and to specific
In governments the biggest challenges are the integration and interoperability of big data across different government market segments are some of the main challenges.
departments and affiliated organizations.
 challenges identified by professionals in the insurance industry include underutilization of data gathered by loss adjusters
Applications and a hunger for better insight.
 energy exploration
Applications :
 financial market analysis
Big data has been used in the industry to provide customer insights for transparent and simpler products, by analyzing and
 fraud detection predicting customer behavior through data derived from social media, GPS-enabled devices and CCTV footage. The big data
also allows for better customer retention from insurance companies.
 health related research
When it comes to claims management, predictive analytics from big data has been used to offer faster service since
 environmental protection. massive amounts of data can be analyzed especially in the underwriting stage.
Some more specific examples are as follows:  Fraud detection has also been enhanced.
Through massive data from digital channels and social media, real-time monitoring of claims throughout the claims cycle
The Food and Drug Administration (FDA) is using big data to detect and study patterns of food-related illnesses and
has been used to provide insights.
diseases. This allows for faster response which has led to faster treatment and less death.
Big Data Providers in this industry include: Sprint, Qualcomm, Octo Telematics, The Climate Corp.
Big Data Providers in this industry include: Digital Reasoning, Socrata and HP
8. Retail and Whole sale trade 9. Transportation
Industry-Specific challenges Industry-Specific challenges
From traditional brick and mortar retailers and wholesalers to current day e-commerce traders, the industry has gathered a lot of In recent times, huge amounts of data from location-based social networks and high speed data from telecoms have affected travel behavior.
data over time. This data, derived from customer loyalty cards, POS scanners, RFID etc. is not being used enough to improve Regrettably, research to understand travel behavior has not progressed as quickly.
customer experiences on the whole. Any changes and improvements made have been quite slow.
In most places, transport demand models are still based on poorly understood new social media structures.
Applications
Applications of big data in the transportation industry
 Big data from customer loyalty data, POS, store inventory, local demographics data continues to be gathered by retail and
wholesale stores.  Governments use of big data:
 In New York’s Big Show retail trade conference in 2014, companies like Microsoft, Cisco and IBM pitched the need for the retail traffic control, route planning, intelligent transport systems, congestion management (by predicting traffic conditions)
industry to utilize big data for analytics and for other uses including:
 Private sector use of big data in transport:
 Optimized staffing through data from shopping patterns, local events, and so on
revenue management, technological enhancements, logistics and for competitive advantage (by consolidating shipments and
 Reduced fraud optimizing freight movement)
 Timely analysis of inventory  Individual use of big data includes: route planning to save on fuel and time, for travel arrangements in tourism etc.
 Social media is used for customer prospecting, customer retention, promotion of products, and more. IBM-”CITY OF DUBLIN”, Congestion and Road traffic Control, 2010, 66 million
Big Data Providers in this industry include: First Retail, First Insight, Fujitsu, Infor, Epicor and Vistex Big Data Providers in this industry include: Qualcomm and Manhattan Associates
10. Energy and Utilities
Suppliers of electricity, gas and water are finding ways to analyse the vast volumes of data their new smart systems are generating in order to
gain insights in customer trends and operational efficiencies.
Applications:
DATA ANALYTICS
(HDFS)
 Smart meter readers allow data to be collected almost every 15 minutes as opposed to once a day with the old meter readers.
This granular data is being used to analyze consumption of utilities better which allows for improved customer feedback and
better control of utilities use.
 In utility companies the use of big data also allows for better asset and workforce management which is useful for recognizing
errors and correcting them as soon as possible before complete failure is experienced and illegal tapping
 A smart grid - an electrical grid which includes a variety of operational and energy measures including smart meters, smart
appliances, renewable energy resources, and energy efficiency resources
Big Data Providers in this industry include: Alstom Siemens ABB and Cloudera, IBM
Hadoop Intro History of Hadoop

Data Storage and Analysis  Created by Doug Cutting, the creator of Apache Lucene (Text
 Although storage capacities of HD have increased massively Search Library)
access speeds have not kept up  The name Hadoop is not an acronym, its made-up name.
 1990: 1370MB 4.4 MB/s ,  Hadoop has its origin in Apache Nutch, an open source web
 2010: 1TB, 100MB/s search engine
 Alternate: Parallel read/write  Mike Caferella and Cutting estimated a system
 Problem 1: Many hardware, failure, use replication  Nutch was started in 2002, crawler and search system, could
not scale for billions of pages
 Problem 2: Combine data that is distributed in many
hardware  2003, Google distributed file system @ Google
 2004, started NDFS( Nutch DFS)
History of Hadoop What is Hadoop?
 Hadoop provides a reliable shared storage and analysis system.
 Storage is provided by HDFS
 Analysis by MapReduce
Hadoop is an Apache open source framework written in java that allows
distributed processing of large datasets across clusters of computers using
simple programming models. A Hadoop frame-worked application works
in an environment that provides distributed storage and computation
across clusters of computers. Hadoop is designed to scale up from single
server to thousands of machines, each offering local computation and
storage.
DFS Design of HDFS

 Partition- Data outgrows the storage capacity, partition HDFS is a filesystem designed for storing very large files with
across multiple machines streaming data access patterns, running on clusters of
commodity hardwares.
 File systems that manage storage across network of
machines-DFS Very large files- hundreds of MB,GB,TB, there are hadoop
clusters that store PB
 All n/w based complications make dfs complex
Streaming Data- Write once, read many times; A data set is
 Hadoop comes with DFS calles HDFS generated/copied from source and analysis done on all or
majority
Commodity H/W – Hadoop doesn't require expensive or
highly reliable hardware, designed to run on commonly
available hardware that can be obtained from multiple
vendors.
HDFS is not fit for HDFS Concepts
• Namenode: manages the file system namespace
Applications that require Low-latency data (master), maintains filesystem tree and metadata , the
access(in tens of ms) will not work well with information is stored in namespace image and edit log.
hdfs, e.g HBASE • Datanode: workhorses of the filesystem, they store
and retrieve blocks when accessed by client and report
Lots of small files-namenode has to hold the
back to namenode
metadata
• Secondary Namenode:periodically merge the name
HDFS supports single writer and append only space image with the edit log , keeps a copy of merged
fashion(no multiple writers and arbitrary file namespace image used in the event of namenode
modifications) failing.
HDFS Concepts
• Block caching: frequently accessed blocks cached in datanode
memory, applications instruct namenode which block to
cache.
• HDFS Federation-namenode keeps a reference to every file
and block in the filesystem, HDFS federation allows a cluster
to scale by adding namenodes which manages a portion of
the filesystem
• HDFS high availability-The combination of replicating
namenode metadata on multiple filesystems and using
secondary namenode to create checkpoints protects data
loss, but namenode is still SPOF.
HDFS Concepts HDFS Concepts
• Hadoop 2 remedied this situation by HDFS HA
• In this implementation there are a pair of namenode • Failover controller-manages the transition from
active-standby configuration active namenode to standby{default implementation
• Few architectural changes are needed to allow this to uses ZooKeeper}
happen: • Fencing- HA implementation ensures the previously
• Namenodes must use highly available shared storage active namenode is prevented from doing any
• Datanodes must send block reports to both namenodes damage and causing corruption.
• Clients must be configured to handle namenode failover • Choices for HA shared storage
• The secondary namenode role is subsumed by the • NFS filer
standby
• QJM quorum journal manager { recommended choice
for HDFS installation}
HDFS Concepts Benefits of blocks

Blocks- Disk has a block size , which is the min 1. A file can be larger than single disk, and
amount of data it can R/W. File systems for stored in multiple nodes in a cluster
single disk , disk blocks are 512 bytes. 2. abstraction of block simplifies storage
In HDFS, block is a larger unit 128 MB, files are subsystem-metadata stored seperately
broken into block-sized chunks which are 3. Blocks fit well with replication for providing
stored as independent units. fault tolerance and availability.
Unlike FS, a 1MB file stored in 128MB block size
uses 1MB only
HDFS Shell Commands
1. Create a directory in HDFS at given path(s). 3. Upload and download a file in HDFS.
Usage: Upload:
hadoop fs -mkdir <paths> hdfs dfs hadoop fs -put:
Example: Copy single src file, or multiple src files from local file system to the Hadoop
hadoop fs -mkdir /user/dir1 /user/dir2 data file system
Or Usage:
Hdfs dfs hadoop fs -put <localsrc> ... <HDFS_dest_Path>
Example:
2. List the contents of a directory.
hadoop fs -put /home/ubuntu/Samplefile.txt /user/local/hadoop/dir3/
Usage :
hadoop fs -ls <args>
Example:
hadoop fs -ls /user/tmp
Download: 5. Change file Permissions

hadoop fs -get: Same as unix chmod command:
Copies/Downloads files to the local file system Usage:

hdfs dfs -chmod -R <MODE[,MODE]... | OCTALMODE> URI [URI ...]> URI [URI
Usage:
...]
hadoop fs -get <hdfs_src> <localdst> Example:
Example: Hdfs dfs –chmod 777 abc.txt
hadoop fs –get /user/local/hadoop/dir3/Samplefile.txt /home/ubuntu/mydir Hdfs dfs –chmod 755 new.txt
4. See contents of a file
Same as unix cat command: 6. Delete file
Usage: Usage: hdfs dfs -rm [-skipTrash] URI [URI ...]
hadoop fs -cat <path[filename]> hdfs dfs -rmr [-skipTrash] URI [URI ...] --delete recursively
Example: Example:
hadoop fs -cat /user/…./dir1/abc.txt Hdfs dfs –rm abc.txt

Hdfs dfs –rmr dir2
Hadoop clusters are often referred to as "shared nothing" systems
Hadoop Cluster - Architecture, Core because the only thing that is shared between nodes is the network
Components that connects them
What is Hadoop?
Hadoop Cluster:
Normally any set of loosely connected or tightly connected
computers that work together as a single system is called
Cluster. In simple words, a computer cluster used for Hadoop
is called Hadoop Cluster.
Hadoop cluster is a special type of computational cluster
designed for storing and analyzing vast amount of
unstructured data in a distributed computing
environment. These clusters run on low cost
commodity computers.
Hadoop cluster has 3 components

Yahoo's Hadoop cluster. They have more than 10,000 machines Client,
running Hadoop and nearly 1 petabyte of user data. Master
Slave
Masters: The Masters consists of 3 components
Client: It is neither master nor slave, rather play a role of loading NameNode,
the data into cluster, submit MapReduce jobs describing how Secondary NameNode
the data should be processed and then retrieve the data to see JobTracker.
the response after job completion. NameNode:NameNode does NOT store the files but only the file's
metadata.
NameNode oversees the health of DataNode and coordinates access
to the data stored in DataNode.
Namenode keeps track of all the file system related information such as
to
--Which section of file is saved in which part of the cluster
--Last access time for the files
--User permissions like which user have access to the file
JobTracker: Slaves: Slave nodes are the majority of machines in

Hadoop Cluster and are responsible to
JobTracker coordinates the parallel processing of data using

Store the data
MapReduce.

Process the computation
Secondary Name Node:
Don't get confused with the name "Secondary".
Secondary Node is NOT the backup or high availability node
for Name node.it does job of housekeeping (shuffle and
merge )
Hadoop EcoSystem
Each slave runs both a DataNode and Task Tracker Hadoop is a framework which deals with Big Data but
daemon which communicates to their masters. unlike any other framework it's not a simple framework,
The Task Tracker daemon is a slave to the JobTracker and it has its own family for processing different thing which
the DataNode daemon a slave to the NameNode is tied up in one umbrella called as Hadoop Ecosystem.
Data is mainly categorized in 3 types under Big Data platform.
Structured Data - Data which has proper structure and which can be easily stored in
tabular form in any relational databases like Mysql, Oracle etc is known as structured
data.Example- Employee data .
Semi-Structured Data - Data which has some structure but cannot be saved in a tabular
form in relational databases is known as semi structured data. Example-XML data, email
messages etc.
Unstructured Data - Data which is not having any structure and cannot be saved in tabular
form of relational databases is known as unstructured data. Example- Video files, Audio
files, Text file etc.
SQOOP : SQL + HADOOP = SQOOP
When we import any structured data from table (RDBMS) to HDFS a c
file is created in HDFS which we can process by either Map Reduce
program directly or by HIVE or PIG.

HDFS
(Hadoop Distributed File System)
MapReduce Framework
 It is another main component of Hadoop and a
method of programming in a distributed data stored

HDFS is a main component of Hadoop and a in a HDFS.
technique to store the data in distributed manner in order to 
We can write MapReduce program by using any
compute fast. lmaewnorkg uage like JAVA, C++ , PYTHON, RUBY etc.

HDFS saves data in a block of 128 MB in size o
By name only Map Reduce gives its functionality
ain co ponent of Had op and a metho d of p gramming in a dis ributed data ored a DFS. We can write Map re
ng an la uage like AVA, ++ PEs, PYTH R BY et By me nly ap edu gives i s f n tion ity Mapwil
which is logical splitting of data in a Map will do mapping
educe. Ma Reduce of logic into data
ic into dat (distri u ed n DFS) and once compu
ogram can be applied to any
tion s ov r re u
ype
r will col ct the resu
data wheth r St uctu d or Unstr
Datanode (physical storage of data) in Hadoop g Ma
(distributed in HDFS) and once computation is over
cluster(formation of several Datanode which is a collection
commodity hardware connected through single network). reducer will collect the result of Map to generate
final output result of MapReduce.

All information about data splits in datanode known as

MapReduce Program can be applied to any type of
metadata is captured in Namenode which is again a part of data whether Structured or Unstructured stored in
HDFS. HDFS. Example - word count using MapReduce
HBASE Hive

Hadoop Database or HBASE is a non-
relational (NoSQL) database that runs on top  Hive is created by Facebook and later donated to
of HDFS. Apache foundation.

HBASE was created for large table which have 
Hive mainly deals with structured data which is stored
in HDFS with a Query Language similar to SQL and
billions of rows and millions of columns with
known as HQL (Hive Query Language).
fault tolerance capability and horizontal 
Hive also run Map reduce program in a backend to
scalability and based on Google Big Table.
process data in HDFS

Hadoop can perform only batch
processing, and data will be accessed only in
a sequential manner, for random access of
huge data HBASE is used.
Pig HIVE PIG

Similar to HIVE, PIG also deals with structured data
using PIG LATIN language. CLI/UI/
API
Load
MAP
(JDBC/ODBC)

PIG was originally developed at Yahoo to answer filter Local
similar need to HIVE. Meta

store
JDBC
ODBC
Filter
aggregatio
n
Thrift Client

It is an alternative provided to programmer who loves Thrift
Server
(PHP/ Perl/Python/
C++/Java)
scripting and don't want to use Java/Python or SQL to HiveQL
Group Ditsribute
process data. Driver
Compile

A Pig Latin program is made up of a series of HIVE foreach
operations, or transformations, that are applied to the Map Reduce HADOOP

Reduce
Global
Jobs
input data which runs MapReduce program in store
Aggrega
tion
Executio n
backend to produce output. Engine
Mahout Oozie

Mahout is an open source machine learning 
It is a workflow scheduler system to manage
library from Apache written in java. hadoop jobs.

The algorithms it implements fall under the broad 
Oozie is implemented as a Java Web-Application that
umbrella of machine learning or collective intelligence. runs in a Java Servlet-Container.
Mahout aims to be the machine learning tool of 
Hadoop basically deals with bigdata and when some

choice when the collection of data to be processed is programmer wants to run many job in a sequential
very large, perhaps far too large for a single machine. manner like output of job A will be input to Job B and
Predictive analysis-recommender, clustering, similarly output of job B is input to job C and final
classification output will be output of job C. To automate this
 sequence we need a workflow and to execute same
we need engine for which OOZIE is used.
Zookeeper

ZooKeeper is a centralized service for maintaining
configuration information, naming, providing
distributed synchronization, and providing group
services .

Writing distributed applications is difficult because of
partial failure may occur between nodes ,to overcome
this Apache Zookeper has been developed by
maintaining an open-source server which enables
highly reliable distributed coordination.
In case of any partial failure clients can connect to

any node and be assured that they will receive the
correct, up-to-date information.
NO SQL
2
HBase  Not Only SQL

 being non-relational, distributed, open-source
1
and horizontally scalable.
N.C. SENTHILKUMAR
 > 255 No SQL Databases
C.RANICHADNRA
 Categorized as
 Column store/column families: HBASE, Accumulo, IBM Informix
 Document Store: Azure Document DB, Mongo DB, IBM Cloudant
 Key Value/ Tuple Store: Dynamo DB, Azure Table Storage, Oracle
NoSQL DB
 Graph Databases: AllegroGraph, Neo4J, OrientDB
CRA CRA
NO SQL HBase
3 4
 Since 1970 , RDBMS is the only solution for data  Hbase is a distributed column-oriented database
storage and manipulation and maintenance built on top of HDFS
 After the data changed in all dimension(Vs),  Based on Google’s Big Table, provides random access
companies realized the solutions for processing big on structured data
data  It’s a part of Hadoop Eco system, which provides
 Solution: Hadoop, but only sequential access random r/w access on HDFS
CRA CRA
Random R/W HDFS and HBASE
5 6
HDFS HBASE
Distributed FS Database on HDFS
Provides high latency batch processing Low latency access to single rows
Only sequential access of data Random access by hash index
CRA CRA
Storage Mechanism in HBase Hbase and RDBMS

7 8
 Column –oriented HBASE RDBMS

Schema less Schema oriented
 Table schema defines only column families , which Built for wide tables, horizontally Thin and built for small tables, hard to
are key value pairs scalable scale
No transactions-Suitable for OLAP Transactional
Rowid Column family Column family Denormalized data Normalized data
Good for semi structured and Good for structured
Col1 Col2 Col3 Col1 Col2 col3
structured data
CRA CRA
Applications of Hbase Hbase Architecture
9 10
 Need to write heavy applications Tables are split into regions and served by region server,
 Random access of data Regions are divided into stores and stored in HDFS
 Facebook , twitter, yahoo and Adobe use Hbase
internally
CRA CRA
HBase Shell Commands HBase Shell Commands

11 12
 General:
 Status
 DML
 Version  Put- a cell value
 Table_help
 Get- get row or cell
 Whoami
 DDL:  Delete – delete a cell value
 Create
 Delete all- delete all the cells in a row
 List
 Disable  Scan- scan and return table value
 Is_disabled  Count- number of rows in a table
 Enable
 Is_enabled  Truncate- disable, drop and recreate a specified table
 Describe
 Alter
 Exists
 Drop_all –drop tables matching regex commands
CRA CRA
DDL DML Commands
13 14
 Create <table name> <family name>  Put <'tablename'>,<'rowname'>,<'columnvalue'>,

 List <'value'>
 Disable <table name>  Scan <‘tablename’>
 Describe <table name>  get 'table name', ‘rowid’, {COLUMN ‘column
 Drop <table name> family:column name ’}
 Drop_all <regexp>  delete <'tablename'>,<'row name'>,<'column
 alter <tablename>, NAME=><column familyname>, name'>
VERSIONS=>5
 deleteall <'tablename'>, <'rowname'>
 alter <table name> , NAME =><cf> , METHOD => 'delete'
 truncate <tablename>
CRA CRA
Example DDL+DML Commands

15 16
 create 'emp', 'personal data', ’professional data’

 put 'emp','1','personal data:name','raju‘
 put 'emp','1','personal data:city','hyderabad‘
 put 'emp','1','professional
data:designation','manager
 put 'emp','1','professional data:salary','50000’
 Scan ‘emp’
CRA CRA
DDL+DML Commands Exercise-MBA Admissions
17 18
Application Personal Data Academic details

 get 'emp', '1‘
no
 get 'emp', ‘1', {COLUMN 'personal data: name'} Name Gender Address UG qualification University/ Overall
college percentage
 Get ‘emp’, ‘1’, {FILTER=>’personal data:name’ =‘sakthi’}
 delete 'emp', '1', 'personal data:city‘
 deleteall 'emp','1‘
 put 'emp',‘1','personal data:city', 'Delhi‘
 count 'emp‘
 truncate 'emp‘
 Drop ‘emp’
CRA CRA
Hive Features of Hive

19 20
 Hadoop Based data warehousing like framework  It stores schema in a database and processed data
 Developed by Facebook into HDFS.
 Fire queries using SQL like language HiveQL  It is designed for OLAP.
 SQL programmers to use warehouse without  It provides SQL type language for querying called
MapReduce knowledge HiveQL or HQL.
 It is familiar, fast, scalable, and extensible.
CRA CRA
Architecture of Hive Explanation
21 22
Unit Name Operation

User Interface Hive is a data warehouse infrastructure software that can create interaction between
user and HDFS. The user interfaces that Hive supports are Hive Web UI, Hive
command line, and Hive HD Insight (In Windows server).
Meta Store Hive chooses respective database servers to store the schema or Metadata of tables,
databases, columns in a table, their data types, and HDFS mapping.
HiveQL Process Engine HiveQL is similar to SQL for querying on schema info on the Metastore. It is one of
the replacements of traditional approach for MapReduce program. Instead of writing
MapReduce program in Java, we can write a query for MapReduce job and process
it.
Execution Engine The conjunction part of HiveQL process Engine and MapReduce is Hive Execution
Engine. Execution engine processes the query and generates results as same as
MapReduce results. It uses the flavor of MapReduce.
HDFS or HBASE Hadoop distributed file system or HBASE are the data storage techniques to store
data into file system.
CRA CRA
Workflow Steps
23 24
Step Operation
No.
1 Execute Query The Hive interface such as Command Line or Web UI sends query to Driver (any database driver such
as JDBC, ODBC, etc.) to execute.
2 Get Plan The driver takes the help of query compiler that parses the query to check the syntax and query plan or the
requirement of query.
3 Get Metadata The compiler sends metadata request to Metastore (any database).
4 Send Metadata Metastore sends metadata as a response to the compiler.
5 Send Plan The compiler checks the requirement and resends the plan to the driver. Up to here, the parsing and
compiling of a query is complete.
6 Execute PlanThe driver sends the execute plan to the execution engine.
7 Execute Job Internally, the process of execution job is a MapReduce job. The execution engine sends the job to
JobTracker, which is in Name node and it assigns this job to TaskTracker, which is in Data node. Here, the query
executes MapReduce job.
7.1 Metadata Ops Meanwhile in execution, the execution engine can execute metadata operations with Metastore.
8 Fetch Result The execution engine receives the results from Data nodes.
9 Send Results The execution engine sends those resultant values to the driver.
10 Send Results The driver sends the results to Hive Interfaces.
CRA CRA
HiveQL- Create Create example
25 26
 CREATE [TEMPORARY] [EXTERNAL] TABLE  CREATE TABLE IF NOT EXISTS employee
[IF NOT EXISTS] [db_name.] table_name ( eid int,

[(col_name data_type name String,
[COMMENT col_comment], ...)] salary String,
designation String)
[COMMENT table_comment]
COMMENT ‘Employee details’
[ROW FORMAT row_format]
ROW FORMAT DELIMITED
[STORED AS file_format] FIELDS TERMINATED BY ‘\t’
LINES TERMINATED BY ‘\n’
STORED AS TEXTFILE;
CRA CRA
Load data Semi structured data

27 28
 LOAD DATA [LOCAL] INPATH 'filepath'  <emp><eid>12</eid><name>rani</name><salary

[OVERWRITE] INTO TABLE tablename >29000</salary></emp>
[PARTITION (partcol1=val1, partcol2=val2 ...)]
 LOAD DATA LOCAL INPATH  create table xmlsample_Emp(str string);
'/home/user/sample.txt' OVERWRITE INTO TABLE  load data local inpath '/home/user/sample.xml'
employee; overwrite into table xmlsample_Emp;
 select xpath(str,'emp/eid/text()'),
xpath(str,'emp/salary/text()')
from xmlsample_Emp;
CRA CRA
JSON
29 30
{
{ "fruit": "Apple", “emp”:
"size": "Large", “e1”: { “eid": “12",
"color": "Red" } “name": “Rani",
“salary": “29000" },
select get_json_object (str,'$.eid') as eid, “e2”: { “eid": “13",
get_json_object(str,'$.name') as name from “name": “uma",
json_emp; “salary": “29000" }
}
CRA CRA
S3
2
S3  Amazon Simple Storage Service (Amazon S3)

 storage for the Internet. It is designed to make web-
1
scale computing easier for developers
C.RANICHADNRA  has a simple web services interface that you can use to
store and retrieve any amount of data, at any time,
from anywhere on the web
CRA CRA
Concepts
3 4
 Buckets  Buckets
 Objects A bucket is a container for objects stored in
 Keys Amazon S3. Every object is contained in a bucket.
 Regions  Objects
Objects are the fundamental entities stored in
Amazon S3. Objects consist of object data and
metadata.
CRA CRA
Common Operations
Random R/W
5 6
 Keys  Create a bucket – Create and name your own bucket in

which to store your objects.
A key is the unique identifier for an object within
a bucket. Every object in a bucket has exactly one key.  Write an object – Store data by creating or overwriting
an object. When you write an object, you specify a unique
The combination of a bucket, key, and version ID
key in the namespace of your bucket. This is also a good
uniquely identify each object. time to specify any access control you want on the object.
 Regions  Read an object – Read data back. You can download the
You can choose the geographical AWS Region where data via HTTP or BitTorrent.
Amazon S3 will store the buckets that you create. You might  Delete an object – Delete some of your data.
choose a Region to optimize latency, minimize costs, or  List keys – List the keys contained in one of your buckets.
address regulatory requirements. Objects stored in a Region You can filter the key list based on a prefix.
never leave the Region unless you explicitly transfer them to
another
CRA
Region. CRA
Impala Features
7 8
 Impala is a MPP (Massive Parallel Processing) SQL  Impala combines the SQL support and multi-user
query engine for processing huge volumes of data performance of a traditional analytic database with
that is stored in Hadoop cluster. the scalability and flexibility of Apache Hadoop, by
 It is an open source software which is written in C++ utilizing standard components such as HDFS,
and Java. HBase, Metastore, YARN, and Sentry.
 It provides high performance and low latency  With Impala, users can communicate with HDFS or
compared to other SQL engines for Hadoop. HBase using SQL queries in a faster way compared
 In other words, Impala is the highest performing
to other SQL engines like Hive.
SQL engine (giving RDBMS-like experience) which  Impala can read almost all the file formats such as
provides the fastest way to access data that is stored Parquet, Avro, RCFile used by Hadoop.
in Hadoop Distributed File System.
CRA CRA
HBase Hive Impala

Advantages HBase is wide-column store Hive is a data warehouse Impala is a tool to manage,
database based on Apache software. Using this, we can analyze data that is stored on
9 Hadoop. It uses the concepts of 10
access and manage large Hadoop.
BigTable. distributed datasets, built on
 process data that is stored in HDFS at lightning-fast speed Hadoop.
with traditional SQL knowledge. The data model of HBase is wide Impala follows Relational
Hive follows Relational model.
column store. model.
 data transformation and data movement is not required for
HBase is developed using Java Hive is developed using Java Impala is developed using
data stored on Hadoop language. language. C++.
 access the data that is stored in HDFS, HBase, and Amazon The data model of HBase is The data model of Hive is The data model of Impala is
schema-free. Schema-based. Schema-based.
s3 without the knowledge of Java (MapReduce jobs).
HBase provides Java, RESTful Hive provides JDBC, ODBC, Impala provides JDBC and
 The time-consuming stages of loading & reorganizing is and, Thrift API’s. Thrift API’s. ODBC API’s.
overcome with the new techniques such as exploratory Supports programming
Supports programming
Impala supports all languages
languages like C, C#, C++, supporting JDBC/ODBC.
data analysis & data discovery making the process Groovy, Java PHP, Python, and languages like C++, Java,
PHP, and Python.
faster. Scala.
HBase provides support for Hive does not provide any Impala does not provide any
triggers. support for triggers. support for triggers.
CRA CRA
11
 Cloudera vm
 Impala-shell
CRA

Cat1 Big Data Print

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Cat1 Big Data Print

Uploaded by

Copyright:

Available Formats

Overview

Where we have Big Data Characteristics Of 'Big Data'

determines real potential in the data. run business

scanners and sensors

Eg: Faculty attendance, Library books issue

Eg: Retail websites like eBay and Amazon.com, Google

Facebook, Twitter, LinkedIn, and Flickr

•Mobile data: Data such as text messages and location information

•Semi-structured data, also known as having a schema-less or self-describing structure,

Some sources for semi-structured data include:

•File systems like Web data in the form of cookies

•No SQL Databases

4.242.88.10 - - [07/Mar/2004:16:05:49 -0800] "GET

Phase 1—Discovery: In Phase 1, the team learns the business domain,

•The team assesses the resources available to support the project in

•Important activities in this phase include framing the business problem

Phase 2—Data preparation: Phase 2 requires the presence of an

running the models, or if it will need a more robust environment stakeholders.

Conduct research and make recommendations on data mining products,

periodic updates. data analysis methodologies.

The role will require working on multiple projects simultaneously-

The Data Analyst

and other stakeholders to integrate data mining results with

» Performs logical data modeling

» Identifies patterns in data

» Designs and creates reports

Data analysts are generally well versed in Sequel, they know a

On the other end of the spectrum, a Data Scientist will have

Key Roles in Analytic Project

card fraud detection,

customer data transformation,

Centralized patient information for medical assistance and diagnosis

 Measure content performance used in tracking the spread of chronic disease.

4. Education 5. Manufacturing and Natural Resources

Big data Providers: Carnegie Learning, D2L (Desire to Learn)

8. Retail and Whole sale trade 9. Transportation

Industry-Specific challenges Industry-Specific challenges

Hadoop Intro History of Hadoop

DFS Design of HDFS

HDFS Concepts Benefits of blocks

Download: 5. Change file Permissions

Copies/Downloads files to the local file system Usage:

hadoop fs -cat /user/…./dir1/abc.txt Hdfs dfs –rm abc.txt

Hadoop cluster has 3 components

JobTracker: Slaves: Slave nodes are the majority of machines in

SQOOP : SQL + HADOOP = SQOOP

When we import any structured data from table (RDBMS) to HDFS a c

file is created in HDFS which we can process by either Map Reduce

program directly or by HIVE or PIG.

similar need to HIVE. Meta

process data. Driver

operations, or transformations, that are applied to the Map Reduce HADOOP

HBase  Not Only SQL

Storage Mechanism in HBase Hbase and RDBMS

 Column –oriented HBASE RDBMS

HBase Shell Commands HBase Shell Commands

 Create <table name> <family name>  Put <'tablename'>,<'rowname'>,<'columnvalue'>,

Example DDL+DML Commands

 create 'emp', 'personal data', ’professional data’

Application Personal Data Academic details

Hive Features of Hive

Unit Name Operation

 CREATE [TEMPORARY] [EXTERNAL] TABLE  CREATE TABLE IF NOT EXISTS employee