Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 38

Big Data

1
Hello!
I am Minakshi Gogoi
You can find me at
minakshi_cse@gimt-guwahati.ac.in
Contents
● What is Big Data?

● Where does Big data comes from ?

● Types of data.
Let’s start with the first set of slides
“Big data is the term for a collection of data sets so
large and complex that it becomes difficult
to process using on-hand database
management tools or traditional
data processing applications”

5
Big Data….
 Information from multiple
Petabytes, exabytes of data
internal and external
 Volumes too great for
sources:
typical DBMS  Transactions
 Social media
 Enterprise content
 Sensors
 Mobile devices

6
Where does Big Data come from?
Email Transactions
Enterprise Partner, Employee
Customer, Supplier

Contracts Monitoring

Public Commercial
Sensor
Credit
Weather
Industry
Population Social Media Sentiment
Economic
Network
Types of Data
• When collecting or gathering data we collect data from
individuals cases on particular variables.
• A variable is a unit of data collection whose value can vary.
• Variables can be defined into types according to the level
of mathematical scaling that can be carried out on the data.
• There are four types of data or levels of measurement:

1. Categorical (Nominal) 2. Ordinal

3. Interval 4. Ratio
Categorical (Nominal) data
• Nominal or categorical data is data that comprises of categories that cannot be
rank ordered – each category is just different.
• The categories available cannot be placed in any order and no judgement can be
made about the relative size or distance from one category to another.
 Categories bear no quantitative relationship to one another
 Examples:
- customer’s location (America, Europe, Asia)
- employee classification (manager, supervisor,
associate)
• What does this mean? No mathematical operations can be performed on the
data relative to each other.
• Therefore, nominal data reflect qualitative differences rather than quantitative
ones.
Nominal data
Examples:
What is your gender? Did you enjoy the
(please tick) match? (please tick)

Male Yes
Female No

• Systems for measuring nominal data must ensure that each


category is mutually exclusive
Ordinal data How satisfied are you with the level of
service you have received? (please tick)

Example:
Very satisfied
Somewhat satisfied
Neutral
Somewhat dissatisfied
Very dissatisfied
• Ordinal data is data that comprises of categories that can be rank ordered.
• Similarly with nominal data the distance between each category cannot be
calculated but the categories can be ranked above or below each other.
 No fixed units of measurement
 Examples:
- Class rankings
- survey responses
(poor, average, good, very good, excellent)
Interval and ratio data
• Both interval and ratio data are examples of scale data.
• Scale data:
• data is in numeric format
• data that can be measured on a continuous scale
• the distance between each can be observed and as a result measured
• the data can be placed in rank order.
Interval data
• Ordinal data but with constant differences
between observations
• Ratios are not meaningful
• Examples:
• Time – moves along a continuous measure or
seconds, minutes and so on and is without a zero
point of time.
• Temperature – moves along a continuous measure
of degrees and is without a true zero.
Ratio data
• Ratio data measured on a continuous scale and
does have a natural zero point.
 Ratios are meaningful
 Examples:
• monthly sales
• delivery times
• Weight
• Height
• Age
Data for Business Analytics
Classifying Data Elements in a Purchasing Database
Ca

C
C

In
In
R
Figure 1.2

R
Ra
at

at
at

at

at
at
te

te
te
e

eg
eg

t io
io

io
io
go

go

rv
rv
or
or

al
al
ric

ric

ic
ic
al

al

al
al

If there was field (column) for Supplier Rating (Excellent, Good, Acceptable, Bad), that data would be
classified as Ordinal
1-15
Big Data Characteristics

VOLUME
Growing quantity of data
e.g. social media, behavioral, video Y
ET
R I
VA
Quickening speed of data
e.g. smart meters, process monitoring VELOCITY

Gartner, Feb 2001


Increase in types of data
e.g. app data, unstructured data
Which Big Data characteristic is the
biggest issue for your organization?

Velocit
y of Variet
data
Volum y of
16%
e of16% data
data 48%
35% 48%
35%

Source: Getting Value from Big Data, Gartner Webinar, May 2012
Volume
• Volume
• Petabytes,
exabytes of data
• Volumes too great
for typical DBMS
Big Data: 3V’s

19
Volume (Scale)
● Data Volume
○ 44x increase from 2009 2020
○ From 0.8 zettabytes to 35zb
● Data volume is increasing exponentially

Exponential increase in
collected/generated data
20
Volume - Bytes Defined

eBay data warehouse (2010) = 10 PB

eBay will increase this 2.5 times by 2011

Teradata > 10 PB

Megabyte: 220 bytes or, loosely, one Gigabyte:   230 bytes or, loosely one billion bytes
million bytes
5-21
Velocity
• Velocity
• Massive amount of
streaming data
Variety
• Variety
• Massive sets of unstructured/semi-structured data
from Web traffic, social media, sensors, and so on
Which source of data represents the most
immediate opportunity?

Source: Getting Value from Big Data, Gartner Webinar, May 2012
Big Data Opportunities
Making better informed decisions
e.g. strategies, recommendations

Discovering hidden insights


e.g. anomalies forensics, patterns, trends

Automating business processes


e.g. complex events, translation
Variety (Complexity)
● Relational Data (Tables/Transaction/Legacy Data)
● Text Data (Web)
● Semi-structured Data (XML)
● Graph Data
○ Social Network, Semantic Web (RDF), …

● Streaming Data
○ You can only scan the data once

● A single application can be generating/collecting many types of data

● Big Public Data (online, weather, finance, etc)

To extract knowledge all these types of data need


to linked together 26
A Single View to the Customer

Bankin
Social g
Media Financ
e

Our
Gamin
g
Customer Known
History

Purchas
Entertain
e
Velocity (Speed)

● Data is begin generated fast and need to be processed fast


● Online Data Analytics
● Late decisions  missing opportunities
● Examples
○ E-Promotions: Based on your current location, your purchase history, what you like 
send promotions right now for store next to you

○ Healthcare monitoring: sensors monitoring your activities and body  any abnormal
measurements require immediate reaction
28
Real-time/Fast Data

Mobile devices
(tracking all objects all the time)
Social media and networks Scientific instruments
(all of us are generating data) (collecting all sorts of data)
Sensor technology and
networks
(measuring all kinds of data)
● The progress and innovation is no longer hindered by the ability to collect data
● But, by the ability to manage, analyze, summarize, visualize, and discover knowledge
from the collected data in a timely manner and in a scalable fashion
29
Real-Time Analytics/Decision Requirement

Product
Learning why Customers
Recommendations Influence
Behavior Switch to competitors
that are Relevant
and their offers; in
& Compelling
time to Counter

Friend Invitations
Improving the Customer to join a
Marketing Game or Activity
Effectiveness of a that expands
Promotion while it Preventing Fraud business
is still in Play as it is Occurring
& preventing more
proactively
Some Make it 4V’s

31
Harnessing Big Data

● OLTP: Online Transaction Processing (DBMSs)


● OLAP: Online Analytical Processing (Data Warehousing)
● RTAP: Real-Time Analytics Processing (Big Data Architecture & technology)
32
The Model Has Changed…
● The Model of Generating/Consuming Data has Changed
Old Model: Few companies are generating data, all others are consuming data

New Model: all of us are generating data, and all of us are consuming data

33
What’s driving Big Data
- Optimizations and predictive analytics
- Complex statistical analysis
- All types of data, and many sources
- Very large datasets
- More of a real-time

- Ad-hoc querying and reporting


- Data mining techniques
- Structured data, typical sources
- Small to mid-size datasets

34
Big Data Analytics
● Big data is more real-time in nature
than traditional DW applications
● Traditional DW architectures (e.g.
Exadata, Teradata) are not well-suited
for big data apps
● Shared nothing, massively parallel
processing, scale out architectures are
well-suited for big data apps

35
Big Data Technology

37
Thanks !

You might also like