Professional Documents
Culture Documents
Mod 1 Intro BDAEdited
Mod 1 Intro BDAEdited
INTRODUCTION TO BIG
DATA ANALYTICS
CPIS604-SACHI ARAFAT (FROM EMC NOTES) 1
MODULE 1.A – Upon completion of this module, you should be able to:
•
•
Define big data
Identify four business drivers for
WEEK 1
advanced analytics
• Distinguish the techniques for
Business Intelligence from Data
Science
INTRO TO BDA
• Describe the role of the Data Scientist
within the new big data ecosystem
• Cite at least three illustrative examples
of big data opportunities
CPIS604-SACHI ARAFAT
4
(FROM EMC NOTES)
BIG DATA DEFINED
“Big Data” is data whose scale, distribution, diversity, and/or
timeliness require the use of new technical architectures and
analytics to enable insights that unlock new sources of
business value.
Requires new data architectures, analytic sandboxes
New tools
New analytical methods
Integrating multiple skills into new role of data scientist
Source: McKinsey May 2011 article Big Data: The next frontier for innovation, competition, and productivity
2. Processing Complexity
Changing data structures
Use cases warranting additional transformations and
analytical techniques
3. Data Structure
Greater variety of data structures to mine and analyze
d
• Textual data files with a discernable pattern,
More Structured
enabling parsing
Semi-
Structured • Example: XML data files that are self
describing and defined by an xml schema
Semi-Structured Data
View Source
http://www.google.com/
#hl=en&sugexp=kjrmc&cp=8&gs_id=2m&xhr=t&q=data+scientist&pq=big+data&pf=p&sclien
t=psyb&source=hp&pbx=1&oq=data+sci&aq=0&aqi=g4&aql=f&gs_sm=&gs_upl=&bav=on.2,
or.r_gc.r_pw.,cf.osb&fp=d566e0fbd09c8604&biw=1382&bih=651
Unstructured Data
The Red Wheelbarrow, by
William Carlos Williams
Spreadsheets and low- • Supports BI and reporting, but • Enables high performance analytics
volume DB‘s for restricts robust analyses using in-db processing
recordkeeping
• Analyst dependent on IT & • Reduces costs associated with data
Analyst dependent on data DBAs for data access and replication into "shadow" file
extracts schema changes systems
• Analysts must spend significant • “Analyst-owned” rather than “DBA
time to get extracts from owned”
multiple sources
Spreadsheets and low- • Supports BI and reporting, but • Enables high performance analytics
volume DB‘s for restricts robust analyses using in-db processing
recordkeeping
• Analyst dependent on IT & • Reduces costs associated with data
Analyst dependent on data DBAs for data access and replication into "shadow" file
extracts schema changes systems
• Analysts must spend significant • “Analyst-owned” rather than “DBA
time to get extracts from owned”
multiple sources
Discussion Questions
1. Discuss how the bank’s data would change under these circumstances.
2. How are their needs changing with these business changes?
3. What do you need to consider from an analyst point of view? What are some thing
to consider implementing as the bank grows?
CPIS604-SACHI ARAFAT (FROM EMC NOTES) 11
MODULE 1: INTRODUCTION TO BIG
DATA ANALYTICS
Lesson 1: Summary
12
CPIS604-SACHI ARAFAT (FROM EMC NOTES)
MODULE 1: INTRODUCTION TO BIG DATA ANALYTICS
13
CPIS604-SACHI ARAFAT (FROM EMC NOTES)
BUSINESS DRIVERS FOR ANALYTICS
Current Business Problems Provide Opportunities for Organizations to
Become More Analytical & Data Driven
Driver Examples
1
Desire to optimize business
Sales, pricing, profitability, efficiency
operations
2
Desire to identify business risk Customer churn, fraud, default
3
Predict new business opportunities Upsell, cross-sell, best new customer prospects
4
Comply with laws or regulatory
Anti-Money Laundering, Fair Lending, Basel II
requirements
Non-Agile Models
2 Departmental
“Spread
Marts”
Warehouse
Enterprise 4
Departmental Applications
Warehouse
3 Prioritized
Operational
Processes
Static schemas
accrete over time Reporting Siloed
Analytics
Non-Agile Models
2 Departmental
“Spread
Marts”
Warehouse
Enterprise 4
Departmental Applications
Warehouse
3 Prioritized
Operational
Processes
Static schemas
accrete over time Reporting Siloed
Analytics
18
CPIS604-SACHI ARAFAT (FROM EMC NOTES)
OPPORTUNITIES FOR A NEW APPROACH
TO ANALYTICS
NEW APPLICATIONS DRIVING DATA VOLUME
SMALL
Data
Users/Buyers
Catalog
4 Co-Ops
Phone/TV Retail
Media
Private
Media Credit List Investigators
Archives Bureaus Financial Brokers Delivery /Lawyers
Banks Service
Government
Data
Users/Buyers
Catalog
4 Co-Ops
Phone/TV Retail
Media
Private
Media Credit List Investigators
Archives Bureaus Financial Brokers Delivery /Lawyers
Banks Service
Government
Data
Users/Buyers
Catalog
4 Co-Ops
Phone/TV Retail
Media
Private
Media Credit List Investigators
Archives Bureaus Financial Brokers Delivery /Lawyers
Banks Service
Government
Analytic Sandbox
Data assets gathered from multiple sources
1. Speed
1. Speedofof decision
decision making
making and technologies for analysis
2. Throughput
2. Throughput
• Enables high performance analytics
using in-db processing
3. Analysis
3.
Analysisflexibility
flexibility • Reduces costs associated with data
replication into "shadow" file
systems
• “Analyst-owned” rather than “DBA
owned”
Your Thoughts?
e t al
om on en y ing ais
c
In ati ym o r c or tory pr
ic plo is
t S is Ap
e rif Em H edit d H
V Cr An
25
CPIS604-SACHI ARAFAT (FROM EMC NOTES)
MODULE 1: INTRODUCTION TO BIG DATA
ANALYTICS
Lesson 3 (Week 2): The Data Scientist
26
CPIS604-SACHI ARAFAT (FROM EMC NOTES)
SKILLS NEEDED IN THE
NEW DATA ECOSYSTEM Your Thoughts?
What new skill sets do you need to take advantage of the big data
sets in the loan processing improvement case study?
Note: Figures above reflect a projected talent gap in US in 2018, as shown in McKinsey May 2011 article Big Data: The next frontier for innovation,
competition, and productivity
Data Scientists
Data Data Bl LOB
Reframe business Engineers Analyst Analyst
challenges as analytics User
challenges
Analytic Productivity Platform
Design, implement and
deploy statistical models
and data mining Tools & Services
techniques on big data
Data
Create insights that lead to Infrastructure Platform
Admin
actionable
recommendations
Quantitative
Curious &
Technical
Creative
Skeptical Communicative
& Collaborative
31
CPIS604-SACHI ARAFAT (FROM EMC NOTES)
MODULE 1: INTRODUCTION TO BIG DATA
ANALYTICS
Lesson 4: Big Data Analytics in Industry Verticals
32
CPIS604-SACHI ARAFAT (FROM EMC NOTES)
BIG DATA ANALYTICS:
INDUSTRY EXAMPLES
1
Health Care
• Reducing Cost of Care Medical
• Preventing Pandemics
3 Life Sciences Data
Collectors
• Genomic Mapping
4 IT Infrastructure
• Unstructured Data Analysis
Phone/TV Retail
5 Online Services
Financial
• Social Media for Professionals
33
CPIS604-SACHI ARAFAT (FROM EMC NOTES)
1
• Dr. Jeffrey Brenner generated his own crime maps from medical
Use of Big Data billing records of 3 hospitals
• City hospitals & ER’s provided expensive care, low quality care
• Reduced hospital costs by 56% by realizing that 80% of city’s
Key medical costs came from 13% of its residents, mainly low-
Outcomes income or elderly
• Now offers preventative care over the phone or through home
visits
Situation • Broad Institute (MIT & Harvard) mapping the Human Genome
39
CPIS604-SACHI ARAFAT (FROM EMC NOTES)
CHECK YOUR
KNOWLEDGE
Your Thoughts?