L01 Introduction

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 50

INTRODUCTION:

BIG DATA ANALYTICS

1
Objective…

How to manage very large amounts of data and extract value and
knowledge from them

2
OUTLINE:
 TYPES OF DIGITAL DATA
 INTRODUCTION TO BIG DATA
 BIG DATA ANALYTICS

3
OUTLINE:
 TYPES OF DIGITAL DATA
 INTRODUCTION TO BIG DATA
 BIG DATA ANALYTICS

4
Classification of Digital Data

Approximate percentage distribution of digital data


Structured
Data
Sources of Structured Data

Databases such as
Oracle, DB2,
Teradata, MySql,
PostgreSQL, etc

Structured data Spreadsheets

OLTP Systems
Ease with Structured Data

Input / Update /
Delete

Security

Ease with Structured data Indexing /


Searching

Scalability

Transaction
Processing
Semi-structured
Data
Sources of Semi-structured Data

XML (eXtensible Markup Language)

Other Markup Languages

JSON (Java Script Object Notation)


Semi-Structured Data
Characteristics of Semi-structured Data

Inconsistent Structure

Self-describing
(lable/value pairs)
Semi-structured data
Often Schema information is
blended with data values

Data objects may have different


attributes not known beforehand
Unstructured Data
Sources of Unstructured Data
Web Pages

Images

Free-Form
Text

Audios
Unstructured data

Videos

Body of
Email

Text
Messages

Chats

Social
Media data

Word
Document
Issues with terminology – Unstructured Data

Structure can be implied despite not being


formerly defined.

Data with some structure may still be labeled


Issues with terminology
unstructured if the structure doesn’t help with
processing task at hand

Data may have some structure or may even be


highly structured in ways that are unanticipated
or unannounced.
Dealing with Unstructured Data

Data Mining

Natural Language Processing (NLP)

Dealing with Unstructured Data Text Analytics

Noisy Text Analytics


Answer a few quick
questions …
• Which category (structured, semi-structured, or unstructured)
will you place a Web Page in?

• Which category (structured, semi-structured, or unstructured)


will you place Word Document in?

• State a few examples of human generated and machine-


generated data.
Categorize Health Care
Data…….
HEALTH CARE DATA SETS
OUTLINE:
 TYPES OF DIGITAL DATA
 INTRODUCTION TO BIG DATA
 BIG DATA ANALYTICS

24
What is big data?
• “Everyday, we create 2.5 quintillion bytes of data — so
much that 90% of the data in the world today has been
created in the last two years alone.
What is big data?
• “Every day, we create 2.5 quintillion bytes of data — so much
that 90% of the data in the world today has been created in
the last two years alone.

• This data comes from everywhere:


– sensors used to gather climate information,
– posts to social media sites,
– digital pictures and videos,
– purchase transaction records, and
– cell phone GPS signals to name a few.

This data is “big data.”


Huge amount of data
• There are huge volumes of data in the world:

+ From the beginning of recorded time until 2003,


+ We created 5 billion gigabytes (exabytes) of data.

+ In 2011, the same amount was created every two days

+ In 2013, the same amount of data is created every 10


minutes.
Finally….
`Big- Data’ is similar to ‘Small-data’ but bigger

.. But having data bigger it requires different


approaches:
Techniques, tools, architecture
… with an aim to solve new problems
Or old problems in a better way
Harnessing Big Data

• OLTP: Online Transaction Processing (DBMSs)


• OLAP: Online Analytical Processing (Data Warehousing)
• RTAP: Real-Time Analytics Processing (Big Data Architecture & technology)

29
Definition of Big Data
Big Data Definition

• No single standard definition…

“Big Data” is data whose scale, diversity, and


complexity require new architecture,
techniques, algorithms, and analytics to manage
it and extract value and hidden knowledge from
it…
31
Definition of Big Data

High-volume
Big Data is high-volume,
High-velocity high-velocity, and high-
High-variety variety information
assets that demand cost
effective, innovative
forms of information
Cost-effective, innovative processing for enhanced
forms of information insight and decision
processing making.

Source: Gartner
Enhanced insight & IT Glossary
decision making
Big data spans three dimensions: Volume,
Velocity and Variety
Velocity
Variety Often time-sensitive,
Structured and data must be
unstructured data: analyzed as it’s
clinical notes, audio streaming in to
transcription, maximize its value to
imaging, click streams patient care (e.g.
patient monitoring)

Volume in petabytes
Electronic medical records, images, digital
pathology, email, web communications
Characteristics of Big Data:
1-Scale (Volume)

• Data Volume
– 44x increase from 2009 2020
– From 0.8 zettabytes to 35zb
• Data volume is increasing exponentially

Exponential increase in
collected/generated data

34
Big Data Characteristics
How big is the Big Data?
- What is big today maybe not big tomorrow

- Any data that can challenge our current technology


in some manner can consider as Big Data
- Volume
- Communication
- Speed of Generating
- Meaningful Analysis
Characteristics of Big Data:
2-Complexity (Varity)
• Various formats, types, and structures
• Text, numerical, images, audio, video,
sequences, time series, social media
data, multi-dim arrays, etc…
• Static data vs. streaming data
• A single application can be
generating/collecting many types of
data

To extract knowledge all these types of


data need to linked together

39
Variety

• Structured data: example: traditional transaction processing systems


and RDBMS, etc.

• Semi-structured data: example: Hyper Text Markup Language (HTML),


eXtensible Markup Language (XML).

• Unstructured data: example: unstructured text documents, audio,


video, email, photos, PDFs, social media, etc.
Characteristics of Big Data:
3-Speed (Velocity)
• Data is begin generated fast and need to be
processed fast
• Online Data Analytics
• Late decisions  missing opportunities
• Examples
– E-Promotions: Based on your current location, your purchase history,
what you like  send promotions right now for store next to you

– Healthcare monitoring: sensors monitoring your activities and body 


any abnormal measurements require immediate reaction

41
Velocity

Batch  Periodic  Near real time  Real-time processing


Other Characteristics of Data –
Which are not Definitional Traits of Big Data

• Veracity and Validity

• Volatility

• Variability
Sources of Big
Data
Sources of Big Data
Who’s Generating Big Data

Mobile devices
(tracking all objects all the time)

Social media and networks Scientific instruments


(all of us are generating data) (collecting all sorts of data)

Sensor technology and


networks
(measuring all kinds of data)
• The progress and innovation is no longer hindered by the ability to collect data
• But, by the ability to manage, analyze, summarize, visualize, and discover
knowledge from the collected data in a timely manner and in a scalable fashion

46
Challenges with Big Data
Challenges with Big Data
Capture

Storage

Curation

Challenges with Big Data


Search

Analysis

Transfer

Visualization

Privacy
Violations
Challenges in Handling Big Data

• The Bottleneck is in technology


– New architecture, algorithms, techniques are needed
• Also in technical skills
– Experts in using the new technology and dealing with big data

49
Why Big Data?
Why Big Data?

More Data

More Accurate Analysis

More Confidence in decision making

Greater operational efficiencies, Cost reduction,


Time reduction, New product development, Optimized offerings, etc.
Traditional Business Intelligence (BI)
versus Big Data
A Typical Data Warehouse Environment

Reporting /
ERP
Dashboarding

CRM OLAP

Legacy Data Warehouse Ad hoc querying

3rd party Apps Modeling


A Typical Hadoop Environment

Web Logs HDFS

Hadoop
Operational
Systems
Images and Videos

Data Warehouse

Social Media
(Twitter, Facebook, etc.)
MapReduce
Data Marts

Docs & PDFs ODS


Co-existence of Big Data and Data Warehouse

Web Logs HDFS

Hadoop
Operational
Systems
Images and Videos

Data Warehouse
Data Warehouse
Social Media
(Twitter, Facebook, etc.)
MapReduce
Data Marts

Docs & PDFs ODS


Answer a few quick
questions …
• Share your understanding of Big Data.

• How is traditional BI environment different from the Big Data


environment?

• Share your experience as a customer on an e-commerce site.


Comment on the big data that gets created on a typical e-commerce
site.

You might also like