Download as pdf or txt
Download as pdf or txt
You are on page 1of 67

BIG DATA

ANALYTICS

compiled by Dr. Rohini Temkar 1


Module 1: Introduction to Big Data and Hado

1 Introduction to Big Data

2 Big data characteristics, Types of big data

3 Traditional vs Big Data Business Approach

4 Case study of Big Data

compiled by Dr. Rohini Temkar 2


Module 1: Introduction to Big Data and Hado

5 Concept of Hadoop

6 Core Hadoop Components, Hadoop Ecosystem

compiled by Dr. Rohini Temkar 3


What’s Big Data?

No single definition

compiled by Dr. Rohini Temkar 4


compiled by Dr. Rohini Temkar 5
101001101001000010
101001111011101101
Information Age : Activities 101101010100001110
010101100101010011
101010001010100010
110101101101101000
• Usage of smart phone for voice call, or sms 101011100010101000
• Search on web browser 101000101110101100
010011010011010010
• Emails 000101010011110111
011011011010101000
• On line Shopping 011100101011001010
• Conversation on social media: Tweets, posts 100111010100010101
000101101011011011
• Uploads of images and videos on social 0 1 0 0 1
media and Youtube
• Listening to Music
• Reading a Book
• Sensors connected to mobile device, Imaging
Technology in medical field Buzz word
• Sensors for climate information,
• IoT Data : Internet of Everything !
BIG
DATA
compiled by Dr. Rohini Temkar 6
Sources of
Data…
1 Text

2 Image

3 Audio

4 Video

compiled by Dr. Rohini Temkar 7


Data Sources

14
©"2015"IBM"Corporation
Source: Josh James, Domosphere

compiled by Dr. Rohini Temkar 8


Who’s Generating Big Data

Mobile devices
(tracking all objects all the
time)
Social media and Scientific instruments
networks (collecting all sorts of
(all of us are generating data)
data)
Sensor technology and
networks
(measuring all kinds of data)

■ The progress and innovation is no longer hindered by the ability to collect


data

■ But, by the ability to manage, analyze, summarize, visualize, and discover


knowledge from the collected data in a timely manner and in a scalable
fashion compiled by Dr. Rohini Temkar 9
Data :
• Every thing we do on digital world, leaves a Trace and that is data.
• Something that is full of information.
• It is everywhere.
• It is created constantly.
• It is an ever increasing rate from eeverywhere.

•Big :
• It is doubling every two years, and changing the way we live.
• Something that is Large
• Is it Kbyte , M byte, G byte ??

compiled by Dr. Rohini Temkar 10


What’s Big Data?
■ Big data is the term for a collection of data sets so
large and complex that it becomes difficult to process
using on-hand database management tools or
traditional data processing applications.
■ The challenges include capture, curation, storage,
search, sharing, transfer, analysis, and visualization.
■ The trend to larger data sets is due to the additional
information derivable from analysis of a single large set
of related data, as compared to separate smaller sets
with the same total amount of data, allowing
correlations to be found to "spot business trends,
determine quality of research, prevent diseases, link
legal citations, combat crime, and determine real-time
roadway traffic conditions.”

compiled by Dr. Rohini Temkar 11


Reflection Point
• So What could be the problems associated with this large amount
of Data ?

compiled by Dr. Rohini Temkar 12


Challenges
According to IBM, every day we create 2.5 quintillion (2.5X10^18) bytes of data

Acquistion Storage
Data writing and Traditional devices
reading speeds are are not able to store
different in the this huge amount of
devices. data.

Searching Visualization
Traditional data Traditional tools of
bases can not be data visualization
used to search a are failing with large
particular data data set
from large chunk
of data.
Analytics
Sharing Due to large size of data
Large BW it is difficult to analyse
the data with traditional
algorithms..

compiled by Dr. Rohini Temkar 13


• The vast amount of data generated every minute is difficult for
acquisition, storage, searching, sharing, analytics, and
visualization.

So a situation in which data sets have grown to such enormous


sizes that conventional information technologies can no longer
effectively handle either the size of the data set or the scale and
growth of the data set is termed as Big Data.

compiled by Dr. Rohini Temkar 14


“BIG DATA IS AT THE FOUNDATION
OF ALL THE MEGATRENDS THAT
ARE HAPPENING TODAY, FROM
SOCIAL TO MOBILE TO CLOUD TO
GAMING.”

CHRIS LYNCH, VERTICA SYSTEMS

compiled by Dr. Rohini Temkar 15


“Big Data in general is
defined as highvolume, velocity and
variety information assets that
demand cost- effective, innovative
forms of information
processing for enhanced insight
and decision making.” -
Gartner

compiled by Dr. Rohini Temkar 16


What is Big… A
perspective

• 1byte: grain of rice

compiled by Dr. Rohini Temkar 17


What is Big… A
perspective

• 1byte: 1 grain of rice


• 1Kbyte: 1 cup of rice

compiled by Dr. Rohini Temkar 18


What is Big… A
perspective

• 1byte: 1 grain of rice


• 1Kbyte: 1 cup of rice
• 1Mbyte: 8 bags of rice

compiled by Dr. Rohini Temkar 19


What is Big… A
perspective
• 1byte: 1 grain of rice
• 1Kbyte: 1 cup of rice
• 1Mbyte: 8 bags of rice
• 1Gbyte: 3 semi tucks

compiled by Dr. Rohini Temkar 20


What is Big… A
perspective
• 1byte: 1 grain of rice
• 1Kbyte: 1 cup of rice
• 1Mbyte: 8 bags of rice
• 1Gbyte: 3 Semi tucks
• 1 Terabyte: 2 container ships

compiled by Dr. Rohini Temkar 21


What is Big… A
perspective
• 1byte: 1 grain of rice
• 1Kbyte: 1 cup of rice
• 1Mbyte: 8 bags of rice
• 1Gbyte: 3 Semi tucks
• 1 Terabyte: 2 container ships
• 1 Petabyte: Needs buildings from
an entire city

compiled by Dr. Rohini Temkar 22


What is Big… A
perspective
• 1byte: 1 grain of rice
• 1Kbyte: 1 cup of rice
• 1Mbyte: 8 bags of rice
• 1Gbyte: 3 Semi tucks

• 1 Terabyte: 2 Container Ships

• 1 Petabyte: needs buildings from an entire city

• 1 Exabyte: May cover half of India

compiled by Dr. Rohini Temkar 23


What is Big… A
perspective
• 1byte: 1 grain of rice
• 1Kbyte: 1 cup of rice
• 1Mbyte: 8 bags of rice
• 1Gbyte: 3 Semi tucks

• 1 Terabyte: 2 Container Ships

• 1 Petabyte: needs buildings from an entire city

• 1 Exabyte: May cover half of India

• 1 Zettabyte : Fills the Pacific Ocean

compiled by Dr. Rohini Temkar 24


What is Big… A perspective
• 1byte: 1 grain of rice
• 1Kbyte: 1 cup of rice
• 1Mbyte: 8 bags of rice
• 1Gbyte: 3 Semi tucks
• 1 Terabyte: 2 Container Ships

• 1 Petabyte: needs buildings from an entire city

• 1 Exabyte: May cover half of India

• 1 Zettabyte : Fills the Pacific Ocean


• 1 Yottabyte:An earth size rice

compiled by Dr. Rohini Temkar 25


compiled by Dr. Rohini Temkar 26
CHARACTERISTICS OF BIG
DATA

compiled by Dr. Rohini Temkar 27


Big Data: 3V’s

compiled by Dr. Rohini Temkar 28


Characteristics of Big Data
1-Scale (Volume)
■ Data Volume
– 44x increase from 2009 2020
– From 0.8 zettabytes to 35zb
■ Data volume is increasing
exponentially

Exponential increase in
collected/generated data
compiled by Dr. Rohini Temkar 29
Characteristics of Big Data
2-Complexity (Varity)
■ Various formats, types, and
structures
■ Text, numerical, images, audio,
video, sequences, time series,
social media data, multi-dim
arrays, etc…
■ Static data vs. streaming data
■ A single application can be
generating/collecting many
types of data

To extract knowledge🡺 all these types


of data need to linked together

compiled by Dr. Rohini Temkar 30


A Single View to the Customer

Bankin
Social g
Media Financ
e

Our
Gamin
g Custom Known
Histor
y

er

Entertai Purcha
n se

compiled by Dr. Rohini Temkar 31


Variety: Different forms of data
Newspapers, Articles , White papers, Customer names, Addresses, Product
names, Customer Transaction information, Text, xcel & pdf files, Encrypted
Mails & Messages, Corporate and Personal emails, SmS, Social media data
Texts

Dates, Phone Numbers,


Numbers Social Security Numbers , Credit Card Numbers

Server and application logs, Internet search history, Survey Reports,


Websites Consumer Ratings & Reviews, Social Network profiles Blogs, User forums ,
Govt , Publisher and aggregator data base.

Audio, Call center data, Customer call logs , Voice Transcriptions, Voice
Video & mails, phone logs, Videos, Surveillance images, Medical images
Images

Sensor Fields, Farms, Industries, GPS (vehicle location) data, Internet of


everything
Data

compiled by Dr. Rohini Temkar 32


compiled by Dr. Rohini Temkar 33
Characteristics of Big Data
3-Speed (Velocity)
■ Data is begin generated fast and need to be
processed fast
■ Online Data Analytics
■ Late decisions 🡺 missing opportunities
■ Examples
– E-Promotions: Based on your current location, your purchase
history, what you like 🡺 send promotions right now for store next to
you

– Healthcare monitoring: sensors monitoring your activities and


body 🡺 any abnormal measurements require immediate reaction

compiled by Dr. Rohini Temkar 34


compiled by Dr. Rohini Temkar 35
Some Make it 4V’s

compiled by Dr. Rohini Temkar 36


compiled by Dr. Rohini Temkar 37
compiled by Dr. Rohini Temkar 38
compiled by Dr. Rohini Temkar 39
compiled by Dr. Rohini Temkar 40
Big Data Vs Small Data

compiled by Dr. Rohini Temkar 41


Types of Data

compiled by Dr. Rohini Temkar 42


Types of Data
Structured Data Unstructured Data
Characteristics -Predefined data Models, -No Predefined data Models,
- Generally text & -May be text, numbers, video
numbers, easy to audio, sound images & other
search formats
-Difficult to to search
Resides in Relational data Applications
bases Data NoSQL data bases, data
warehouses warehouses,
lakes
Generated by Machines as well as human Machines as well as human

Typical Airline Reservation Word processing,


Applications systems, ERP data, Presentation Software,
CRP Data, Inventory email clients, Tools for
control viewing & editing video

compiled by Dr. Rohini Temkar 43


Structured Data Example

compiled by Dr. Rohini Temkar 44


Unstructured Data Example

compiled by Dr. Rohini Temkar 45


Semi-structured data

• textual data
files with a
discernible
pattern that
enables
parsing.
• for ex, extensible
markup language
[xml] data files that
are self- describing
and defined by an
xml schema

compiled by Dr. Rohini Temkar 46


compiled by Dr. Rohini Temkar 47
Big Data Analytics

compiled by Dr. Rohini Temkar 48


Applications of Big Data

compiled by Dr. Rohini Temkar 49


Applications of Big Data

compiled by Dr. Rohini Temkar 50


• Smart wearables have gradually gained popularity and are the
latest trend among people of all age groups.
• This generates massive amounts of real-time data in the form
of alerts which helps in saving the lives of the people.

compiled by Dr. Rohini Temkar 51


compiled by Dr. Rohini Temkar 52
Big Data in E-commerce
■ Big Data’s recommendation engine is one of
the most amazing applications the Big Data
world has ever witnessed.
■ It furnishes the companies with
a360-degreeview of its customers.
■ Companies then suggest customers
accordingly.
■ Customers now experience more
personalized services than they have ever
had.
■ Big Data has completely redefined people’s
online shopping experiences.
compiled by Dr. Rohini Temkar 53
Big Data in Media and
Entertainment
■ Viewers these days need content
according to their choices only.
■ Content that is relatively new to what
they saw the previous time.
■ Earlier the companies broadcasted
the Ads randomly without any kind of
analysis.

compiled by Dr. Rohini Temkar 54


Big Data in Finance
■ Digital banking and payments are two of the
most trending buzzwords around and Big data
has been at the heart of it.

■ Big Data is bossing the key areas of financial


firms such as fraud detection, risk analysis,
algorithmic trading, and customer
contentment.

compiled by Dr. Rohini Temkar 55


Big Data in Telecom
■ Companies now with the help of Big Data and analytics
can track the areas with the lowest as well as the
highest network traffics and thus doing the needful to
ensure hassle-free network connectivity.

■ Big Data alike other industries have helped the telecom


industry to understand its customers pretty well.

■ Telecom industries now provide customers with offers


as customized as possible.

compiled by Dr. Rohini Temkar 56


Big Data in Automobile

compiled by Dr. Rohini Temkar 57


Big Data in Travel
Industry
■ Through Big Data and analytics, travel companies are
now able to offer more customized traveling
experience.

■ They are now able to understand their customer’s


requirements in a much-enhanced way.

■ From providing them with the best offers to be able to


make suggestions in real-time, Big Data is certainly a
perfect guide for any traveller.

compiled by Dr. Rohini Temkar 58


CASE STUDY: Insurance
• With Big data technologies, Insurance
companies can store and crunch any volume of
data in cost effective way.
• The major challenges are:
• verifying the data collected from customer
• Profile customer behavior
• social networking influence
• Identifying right premium amount
• Verifying claims raised
• Fraud detection
• Exact claim imbursement.

compiled by Dr. Rohini Temkar 59


CASE STUDY: Insurance
■Input
• Internal Data e.g. customer data from CRM, Product data etc. and
• External Data – e.g. from social networking sites regarding behavior of
customer, sentiments , reviews or rating of product etc.
■ Both Internal and External sources of data can be used effectively for
deciding premium for the car.
■ The insurance is required for particular brand of car then reviews about
cars can be mined and by using sentiment analysis car can be put into
different categories.
■The better cars with better reviews and ratings can be offered with low
premium.
■ Cars which have bad reviews and which break down frequently can be
offered with higher insurance premium.

compiled by Dr. Rohini Temkar 60


Insurance
Input
• Internal Data e.g. customer data from CRM, Product data etc. and
• External Data – e.g. from social networking sites regarding behavior
of customer, sentiments , reviews or rating of product etc.
Both Internal and External sources of data can be used effectively for
deciding premium for the car.
The insurance is required for particular brand of car then reviews about
cars can be mined and by using sentiment analysis car can be put into
different categories.
The better cars with better reviews and ratings can be offered
with low premium.
Cars which have bad reviews and which break down frequently can be
offered with higher insurance premium.

compiled by Dr. Rohini Temkar 61


Technologies Available for Big Data
Apache Hadoop

• It is one of the main supportive element in Big Data technologies.


• It simplifies the processing of large amount of structured or
unstructured data in a cheap manner.
• It is an open source project from apache.
• It is basically a set of software libraries and frameworks to
manage and process big amount of data from a single server to
thousands of machines.
• It provides an efficient and powerful error detection mechanism
based on application layer rather than relying upon hardware.
• In December 2012 apache releases Hadoop 1.0.0.
• It is not a single project but includes a number of other technologies in
it.

compiled by Dr. Rohini Temkar 62


Technologies Available for Big Data
MapReduce
• MapReduce is a programming model used for efficient
processing in parallel over large data-sets in a distributed
manner.
• The data is first split and then combined to produce the final
result.
• It is basically a framework to write applications that processes a
large amount of structured or unstructured data over the web.
• It takes the query and breaks it into parts to run it on multiple
nodes.
• By distributed query processing it makes it easy to maintain large
amount of data by dividing the data into several different
machines.

compiled by Dr. Rohini Temkar 63


Technologies Available for Big Data
HDFS(Hadoop distributed file system)

• It is a java based file system that is used to store structured or


unstructured data over large clusters of distributed servers.
• In HDFS the work to make data senseful is done by
developer's code only.
• It provides a highly fault tolerant atmosphere with a
deployment on low cost hardware machines.
• HDFS is now a part of Apache Hadoop project, more
information and installation guide can be found at Apache HDFS
documentation.

compiled by Dr. Rohini Temkar 64


Technologies Available for Big Data
Hive
• Hive was originally developed by Facebook, now it is made open
source for some time.
• Hive works something like a bridge in between sql and Hadoop,
it is basically used to make Sql queries on Hadoop clusters.
• Apache Hive is basically a data warehouse that provides ad-hoc
queries, data summarization and analysis of huge data sets
stored in Hadoop compatible file systems.
• Hive provides a SQL like called HiveQL query based
implementation of huge amount of data stored in Hadoop
clusters. In January 2013 apache releases Hive 0.10.0, more
information and installation guide can be found at Apache Hive
Documentation.

compiled by Dr. Rohini Temkar 65


Technologies Available for Big Data
Pig
• Pig was introduced by yahoo and later on it was made
fully open source.
• It also provides a bridge to query data over Hadoop clusters but
unlike hive, it implements a script implementation to make
Hadoop data access able by developers and business persons.
• Apache pig provides a high level programming platform for
developers to process and analyses Big Data using user
defined functions and programming efforts.
• In January 2013 Apache released Pig 0.10.1 which is defined
for use with Hadoop 0.10.1 or later releases.
• More information and installation guide can be found at
Apache Pig Getting Started Documentation.

compiled by Dr. Rohini Temkar 66


Big data Tools over the
years…

compiled by Dr. Rohini Temkar 67

You might also like