Chapter 1

You might also like

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 27

Chapter 1

INTRODUCTION TO
BIG DATA ANALYTICS

Dr G Sudha Sadasivam
• Introduction – Big data & Characteristics, Cloud
Computing, Virtualisation, Data Center
Architecture
• Data Analytics Life Cycle & Hadoop , Map
Reduce
• No SQL Stores - Key-Value Stores – Columnar
Stores – Document Stores - Graph Databases
• Theory & Methods – Clustering, Classification,
Regression- Stream Analytics - Spark
CA
• Test 1 – Unit 1 to 2;
• Test 2 – Unit 3,4
• Assignment Presentation
– Hadoop
– MongoDB
– Neo4j
– Storm / Spark
• Tutorial 1 & 2
– Analysis questions
Agenda
• Current generation data
• Data Science and related fields
• Big Data Characteristics
• Traditional vs Big Data Systems
• Composition of Data Science Team
• Architecture of Big data System with usecase
• Big Data Usecases
Introduction
• Social Media, Machine logs and Transaction data that includes
weblogs, text, videos, images and sensors create large volume
of heterogeneous real time data.
• Organization, management, analysis, dissemination, and
knowledge discovery from this priceless data aids decision
making in business
• This requires large-scale computing infrastructure and
sophisticated systems for storage
• Data is the raw material of the current industrial revolution 4.0.
Sales of Cakes
Jan Feb Mar Apr May Nov Dec

230 120 100 130 140 180 250

Data – Numbers
Information – More sales in Jan, Nov, Dec
Knowledge – connect with a community Festival
Action – Advertise to the community

V
a
l
u
e
Velocity
Challenges & 3 - Vs
Real Time
Massive and growing amounts of information
residing internal and external to the
Near Real
Time organization…. Volume (Terabytes -> Zettabytes)
tion
Unconventional semi structured or
ma
uto

Periodic
unstructured (diverse) including web pages, log
et a
ar k

Batch
files, social media, click-streams, instant
a, M
ed i

messages, text messages, emails, sensor data


, W o,
al M

di o i de
eb

Salary
from active and passive systems, etc.
Au , V
i
Soc

System
oto

Structured
Variety (Structured -> Semi-structured ->
rs,
Ph
so

MB
Sen

GB
Semi- Unstructured)
le,

TB
bi

structured
Changing information
Mo

PB unstructured
Velocity (Batch -> Streaming Data)
Variety Volume

A massive volume of both structured and unstructured data that is so


large that it's difficult to store, analyse, process, share, visualise and
manage with traditional database and software techniques.”
- Roger Magoulas of O’reilly in 2005

Sentiment Transaction Claim fraud Warranty claim Surveillance CDR


analytics analytics analytics analytics analytics analytics
Data Science
• Doll analysed the effect of smoking (1950) to identify
– What has happened?
– Why did it happen?
– What is likely to happen in future?
– What actions can be taken based on the analysis?
– How to cause something to happen?
• This descriptive, diagnostic, predictive, prescriptive, or cognitive analysis
can lead to knowledge discovery
• The term “Data Science” was coined by Peter Naur (1960) and
reintroduced by Chikio Hayashi (1996) in the International Federation of
Classification Societies (IFCS)
• Data science is the science of making sense out of data
• Discovery is guided by data rather than by a model
• Data Science is the extraction of actionable knowledge
directly from data through a process of discovery,
hypothesis, and hypothesis analysis.
Data Science
• Statistics: branch of
Mathematics Domain Expertise mathematics that deals with
Algebra Healthcare collection, analysis,
interpretation, presentation,
Engineering
Optimisation and organization of data. OR is
Economics used in decision making
Statistics Data
Business • Data is collected from various
domains
OR Analytics Sciences • Programming is needed to
law develop algorithms, tools and
frameworks
• Data science makes sense out
Software Tools of data manually or
Programming automatically
Frameworks
• Analytics makes smart
Infrastructure
business decisions using
Algorithms Statistics, domain data and
programming to improve on
Computer Science
the business
• What is data science?
• Why is analytics important for big data?
• Characteristics of current generation of data?
• Define big data
• Difference between knowledge and
information?
• Objective of big data systems?
Data Science related disciplines
• Statistics is a measure for an attribute(s) of a sample.
• Artificial intelligence (AI) is a sub-domain under computer
science, concerned with solving tasks using abstract intelligence.
• ML is a branch of AI that aims at designing automated systems
that can update/refine themselves by discovering new
knowledge
• Data mining deals with designing algorithms (comp sc) to extract
insights from data (domain)
• Both machine learning and statistics converge on learning from
data. While statistics is estimation-based, machine learning
deals with automated learning.
• Data science covers the whole spectrum of data processing, not
just the algorithmic or statistical aspects. It includes architecture,
data integration, machine learning, data visualization, business
intelligence, automated data-driven decisions. It includes
Machine Learning Data Science
Branch of AI Uses ML
Develop new (individual) models Explore many models, build and tune hybrids
Prove mathematical properties of models
Understand empirical properties of models
Improve/validate on a few, relatively clean, Develop/use tools that can handle massive
small datasets datasets
Research - Publish a paper
Industry - Take action!

Database vs DataScience:
– past vs future querying
– ACID vs CAP
– SQL va NoSQL (schema less)
Data miners sort through huge datasets using sophisticated
software to identify undiscovered patterns and establish
hidden relationships.
Data analytics focuses on inference, the process of deriving a
conclusion based solely on what is already known by the
researcher
• Data analytics aims at using tools and techniques to discover
knowledge from hidden patterns and to take effective actions
for prediction
– Descriptive analytics - Analysis of data leads to describing
patterns
– Diagnostic analytics - the knowledge discovery helps to
understand the reason behind the occurrence of patterns
– Predictive analytics - the knowledge discovered is used to
forecast future trends
– Prescriptive analytics - the knowledge discovered can be
used to suggest actions to be taken in future
– Cognitive Analytics - the knowledge discovered causes
something to happen

• Data engineering is a sub-field of software architecture


basically dealing with hardware and frameworks to maintain
data that is consumed by data scientists
5. Cognitive is Prescriptive
with sensors and
actuators- Send emails to 4. If prediction of
customer segments product 1 in
October is 30 in
number, which
user segment to
target to sell
offers? Discounts?

3. Sales from Jan –


Sept 2018
Predict October 1. Sales from Jan –
sales of a product Sept 2018  Max/
Min/ Average
sales; which
products are sold

2. Diagnostic Analytics –
Jan Feb Mar Apr May Nov Dec Why products are sold?
23 120 100 130 140 180 250
0
Complexity of different types of Analytics

Hindsight Insight Foresight Context Inference


Big Data Characteristics
• Volume: scale at which the data is growing.
• Variety: encompasses structured data (customer information),
semi-structured (XML), unstructured data (videos, images,
sensors)
• Velocity: static and dynamic streaming data from business
applications
• Value: extract additional value from big data sources to provide
business insights for better decision making.
• Veracity: dependability, trustworthiness, reliability and
certainty of data.
• Validity: correctness, authenticity, accuracy of data used for
extracting useful information
• Variability: dynamic evolving behavior of a data source. Eg
different coffee blends is variety. Each blend may taste different
on different days - Variability
• Visualisation: to communicate information clearly and
effectively to its intended users eg. Charts and graphs
• Viscosity: data velocity relative to the time scale of events.
Gives time lag between the occurrence and usage of an
event. Only highly viscous data is stored; less viscous data
calls for quick action.
• Virality: information dispersion/ spread in a people-to-people
network
• Volatility: a measure of the rate of loss/stability of data. Time
for validity and storage of data
• Venue: source of data and methods to obtain it
• Vocabulary: data models and data structures used to organize,
store, and analyse data
• Versatality: data used differently under different contexts
Architecture of Big Data Systems
• Relational systems – 3 layers
• Big Data Systems – 4 layers
• Data storage layer
– Different data stores for
different types of data –
polyglot persistence
– Has physical and logical layers
– Eg HDFS, S3, MongoDB
Data Processing Layer: process
stored data in batch
(MapReduce) or real-time
(Spark) modes.
• Data Query Layer: obtain
valuable insights by querying
processing layer eg Hive
• Visualisation Layer: presents
the value of data to the users
in an understandable format
• Services Layers:
– Ingestion Layer: Data from multiple sources is prioritized,
validated, categorized, and routed to the destination for
effective storage and access. Data may be ingested in
batches or in real-time. Eg. Flume, LogStash, Sqoop
– Data Collector Layer: transport data from the ingestion
layer to the rest of data pipeline using messaging system .
Eg Kafka
– Data Security layer: provides authentication, authorization,
audit, data encryption and centralized administration for
big data systems eg Knox
– Data monitoring Layer: monitors the performance at
infrastructure, framework, analytics engine, data store,
and application levels
– Infrastructure Layer: provides the hardware to host various
big data frameworks eg Cloud
Case study – Big data Systems in Personalised Healthcare
• EHRs with genomics, and
image data are stored in HDFS
• Pig and Hive tools - to clean
and prepare data.
• Identifying cohorts (similar
patients) using clinical trials of
heart failure patients and EHR
• Pig queries - to filter patients
based on heart failure level
(descriptive)
• Analyse clinical trials with rule
base by experts (diagnostic)
• Personalised treatment –
predictive
• Actions to offer the treatment
– prescriptive
• data processing layer extracts
the big data driven phenotype
Traditional vs big data systems
• Processing – Transactional vs Analytical
• Framework – Structured vs heterogeneous data
• Infrastructure - small/medium scale with centralized control vs large scale
Roles in Data Science Team
•Data scientist designs and
conducts expt. by collecting
Data Architect and cleansing, analysing data,
Design / Impl
inferring insights & visualizing
Design new
Data models
storage solutions /
models
the results
Data
mungling •Statistician: Collects, analyses,
Research
Data analysis
Tool usage
and interprets data using
Statistics
Data Scientist statistical methods to transform
Machine Data
learning
Analyst businesses.
Visualisation
Mathematics Manage storage
solutions
•Data analyst collects,
Business insights
DBA processes, and analyses data.
Design new
algorithms
communication
Unlike data scientist, he does
Tool usage Data Engineer not invent new algorithms
improve business •Business analyst aims at
process
improving the business from
Business
Analyst the insights
•Data Architect: creates a blue print for the data management system to capture,
integrate, organize, protect, and maintain data
• Database Administrator: manages and maintains the database
• Data Engineer has SE skills to develop, construct, test, and maintain frameworks
Big data verticals
Big data usecases/ verticals
• Sports Domain
– Personalised coaching by analysing player’s movements and
styles
– predict performance of a game
– eco-friendly product design
– Internet gaming
• Sentiment analysis
– Track flight experience
– Identify customer preferences based on demographics,
pricing, seasons, and geography
• Behavioral Analysis
– understand customer behavior to add business value
– Cash back offers to customers based on purchase history
– Retaining old customers
– advertise suitably to its customers
– Optimise orders
• Customer segmentation: grouping similar users based on their
purchases and recommending suitable items
– Grouping by analysing the browsing and purchase patterns
– music recommendations based on the customer profile
– Personalising healthcare - to predict health outcomes
• Prediction: prediction of outcomes based on historical info
– the performance of students in various courses
– weather forecasting, predict natural disasters/ climatic changes
– Spread of diseases
– stock recommendations to customers
– use RT traffic information to model & predict traffic patterns
• Fraud Detection: to detect prevent and eliminate internal and
external frauds.
– VISA fraud detection
– Intelligent Bureau of Canada (IBC) identifies fraudulent claims
– Analysing black box of an aircraft during sabotage
Conclusion

• Current generation data


• Data Science and related field
• Big Data Characteristics
• Traditional vs Big Data Systems
• Composition of Data Science Team
• Architecture of Big data System with usecase
• Big Data Usecases

You might also like