Professional Documents
Culture Documents
Introduction To Data Science
Introduction To Data Science
to Data Science
Introduction to Data Science
Dr. Ramakrishna Dantu
Associate Professor, BITS Pilani
Introduction to Data Science
Disclaimer and Acknowledgement
• The content for these slides has been obtained from books and various
other source on the Internet
• I here by acknowledge all the contributors for their material and inputs.
• I have provided source information wherever necessary
• I have added and modified the content to suit the requirements of the
course
Introduction to Data Science
Module Topics
• Fundamentals of Data Science
Why Data Science
Defining Data Science
Data Science Process
• Real World applications
• Data Science vs BI
• Data Science vs Statistics
• Roles and responsibilities of a Data Scientist
• Software Engineering for Data Science
• Data Scientists Toolbox
• Data Science Challenges
Why Data Science?
Fundamentals of Data Science
Why Data Science?
• Let's look into this question from the following perspectives:
Data Science as a field
Various data science job roles
Market revenue
Skills
Source: https://analyticsindiamag.com/why‐a‐career‐in‐data‐science‐should‐be‐your‐prime‐focus‐in‐2020/
Fundamentals of Data Science
Why Data Science?
• Data Science as a field
"Data Science is the sexiest job in the 21st century"
‐‐IBM
Data Science is one of the fastest growing fields in the world
In 2019, 2.9 million data science job openings were required globally
According to the U.S. Bureau of Labor Statistics, 11.5 million new jobs will be
created by the year 2026
Even with COVID‐19 situation, and the amount of shortage in talent, there might
not be a dip in data science as a career option
Source: https://analyticsindiamag.com/why‐a‐career‐in‐data‐science‐should‐be‐your‐prime‐focus‐in‐2020/
Fundamentals of Data Science
Why Data Science?
• The Hottest Job Roles and Trends In Data Science 2020
The increase in data science as a career choice in 2020 will also see the rise in its
various job roles
Data Engineer
Data Administrator
Machine Learning Engineer
Statistician
Data and Analytics Manager
In India, the average salary of a data scientist as of January 2020 is ₹10L/yr, which is
pretty attractive (Glassdoor, 2020)
Source: https://analyticsindiamag.com/why‐a‐career‐in‐data‐science‐should‐be‐your‐prime‐focus‐in‐2020/
Fundamentals of Data Science
Why Data Science?
• Market Revenue
In recent years, with the amount of shift in analytics and data science, the market
revenue has increased
Analytics, data science, and big data industry in India generated about
$2.71 billion annually in 2018, and the revenue grew to
$3.03 billion in 2019
This 2019 figure, is expected to double by 2025 nearly.
Source: https://analyticsindiamag.com/why‐a‐career‐in‐data‐science‐should‐be‐your‐prime‐focus‐in‐2020/
Fundamentals of Data Science
Most wanted Data Science Skills in 2019
• Survey Details
LinkedIn data scientist job openings
Survey conducted in the US in April
2019
24,697 respondents
76.13 percent of data scientist job
openings on LinkedIn required a
knowledge of the programming
language Python
Wider industry data may vary
Source: https://softwarestrategiesblog.com/2019/06/13/how‐to‐get‐your‐data‐scientist‐career‐started/
Data Science Defined
Fundamentals of Data Science
Data Science Defined
• There is no consistent definition to what constitutes data science
• Data Science is a study of data.
• Data Science is an art of uncovering useful patterns, connections, insights, and trends that
are hiding behind the data
• Data Science helps to translate data into a story. The story telling helps in uncovering
insights. The insights in turn help in making decisions or strategic choices
• Data Science involves extracting meaningful insights from any data
Requires a major effort of preparing, cleaning, scrubbing, or standardizing the data
Algorithms are then applied to crunch pre‐processed data
This process is iterative and requires analysts’ awareness of the best practices
Automation of tasks allows us focus on the most important aspect of data science:
Interpreting the results of the analysis in order to make decisions
Fundamentals of Data Science
Data Science Defined
• Data science is an inter‐disciplinary practice that draws from
Data engineering, statistics, data mining, machine learning, and predictive analytics
• Similar to operations research, data science focuses on…
Implementing data‐driven models and managing their outcomes
• To uncover nuggets of wisdom and actionable insights, Data science
combines the
Data‐driven approach of statistical data analysis
Computational power of the computer
Programming acumen of a computer scientist
Domain‐specific business intelligence
Source: Data Science using Python and R – Chantal Larose & Daniel Larose
Fundamentals of Data Science
Data Science Defined – Drew Conway's view
• Drew Conway is the CEO and founder of Alluvium
• He is a leading expert in the application of computational methods to social and
behavioral problems at large‐scale
• Drew has been writing and speaking about the role of data — and the discipline
of data science — in industry, government, and academia for several years.
• Drew has advised and consulted companies across many industries:
Ranging from fledgling start‐ups to Fortune 100 companies
Academic institutions, and
Government agencies at all levels
• Drew started his career in counter‐terrorism as a computational social scientist
in the U.S. intelligence community
Source: http://drewconway.com
Fundamentals of Data Science
Data Science Defined ‐ Drew Conway's view
• Data science comprises three distinct
and overlapping skills:
Statistics – for modeling and summarizing
datasets
Computer science – for designing and using
algorithms to effectively store, process, and
visualize the data
Domain expertise – necessary both to
formulate the right questions and to put their
answers in context Drew Conway’s Data Science Venn Diagram
Source: http://drewconway.com/zia/2013/3/26/the‐data‐science‐venn‐diagram
Fundamentals of Data Science
Data Science Defined ‐ Drew Conway's view
• Danger Zone
Here, people "know enough to be dangerous," and is the most
problematic area of the diagram
People are perfectly capable of extracting and structuring data in the
field they know quite a bit about
People even know enough R to run a linear regression and report the
coefficients
But they lack any understanding of what those coefficients mean
It is from this part of the diagram that the phrase "lies, damned lies, and
statistics" emanates
Because either through ignorance or malice this overlap of skills gives people the
ability to create what appears to be a legitimate analysis without any Drew Conway’s
understanding of how they got there or what they have created Data Science Venn Diagram
As such, the danger zone is sparsely populated, however, it does not take
many to produce a lot of damage.
Source: http://drewconway.com/zia/2013/3/26/the‐data‐science‐venn‐diagram
Fundamentals of Data Science
Artificial Intelligence, Machine Learning, and Data Science
• Artificial Intelligence
AI involves making machines capable of mimicking human behavior, particularly cognitive
functions:
For e.g., facial recognition, automated driving, sorting mail based on postal code
• Machine learning
Considered a sub‐field of or one of the tools of AI
Involves providing machines with the capability of
learning from experience
Experience for machines comes in the form of data
Data that is used to teach machines is called training data
• Data Science
Data science is the application of machine learning,
artificial intelligence, and other quantitative fields like
statistics, visualization, and mathematics
Source: Data Science, 2nd Edition by Bala Deshpande; Vijay Kotu
Data Science Process
Fundamentals of Data Science
Various Stages of Data Science Process
Source: Simplilearn ‐ https://www.youtube.com/watch?v=X3paOmcrTjQ
Fundamentals of Data Science
Phases of Data Science Process Determine
whether your
The coding of models are any
the data science good
Establish Deploy the
process
baseline model model for real‐
Select the best world problems
performance Apply various performing
Clearly describe algorithms to model from a set
project Partition the uncover hidden
objectives Gain insights data, Balance relationships
into the data the data, if Deployment
Create a through needed
Data cleaning graphical Evaluation
problem
/preparation is exploration
statement that Modeling
more labor‐
can be solved
intensive
using data Setup
science
Exploratory
Data
Data Analysis
Preparation
Problem
Description
Source: Data Science Using Python & R by Chantal Larose and Daniel Larose
What is Data Science?
Key features and motivations of Data Science
• Extracting Meaningful Patterns
• Building Representative Models
• Combination of Statistics, Machine Learning, and Computing
• Learning Algorithms
• Associated Fields
Descriptive Statistics
Exploratory Visualization
Dimensional Slicing
Hypothesis Testing
Data Engineering
Business Intelligence
Source: Data Science, 2nd Edition, by Bala Deshpande; Vijay Kotu
Fundamentals of Data Science
Data Science Tasks and Examples
Tasks Description Algorithms Examples
Classification • Predict if a data point belongs to • Decision trees • Assigning voters into known buckets by
one of the predefined classes. • neural networks political parties, e.g., soccer moms
• The prediction will be based on • Bayesian models • Bucketing new customers into one of the
learning from a known dataset • induction rules known customer groups
• k‐nearest neighbors
Regression • Predict the numeric target label of a • Linear regression • Predicting the unemployment rate for the
data point. • logistic regression next year
• The prediction will be based on • Estimating insurance premium
learning from a known database
Anomaly Detection • Predict if a data point is an outlier • Distance‐based • Detecting fraudulent credit card
compared to other data points in • density‐based local outlier transactions and network intrusion
the dataset factor (LOF)
Association • Identify relationships within an item • FP‐growth algorithm • Finding cross‐selling opportunities for a
analysis set based on transaction data • a priori algorithm retailer based on transaction purchase
history
The Local Outlier Factor (LOF) algorithm is an unsupervised anomaly detection method which
computes the local density deviation of a given data point with respect to its neighbors.
Source: Data Science, 2nd Edition, by Bala Deshpande; Vijay Kotu
Fundamentals of Data Science
Data Science Tasks and Examples
Tasks Description Algorithms Examples
Time series • Predict the value of the target • Exponential smoothing • Sales forecasting
forecasting variable for a future timeframe based • Autoregressive • Production forecasting
on historical values integrated moving • Virtually any growth phenomenon that needs to
average (ARIMA) be extrapolated
• Regression
Clustering • Identify natural clusters within the • k‐Means • Finding customer segments in a company based
dataset based on inherit properties • Density‐based on transaction, web, and customer call data
within the dataset clustering (e.g.,
DBSCAN)
Recommendation • Predict the preference of an item for • Collaborative filtering • Finding the top recommended movies for a user
engines a user • Content‐based filtering
• Hybrid recommenders
Source: Data Science, 2nd Edition, by Bala Deshpande; Vijay Kotu
Fundamentals of Data Science
Disciplines involved in Data Science
Source: http://www.verival.com/DataScience
Real World
Data Science Applications
Real‐World Applications
Data Science Use Cases
Source: https://data‐flair.training/blogs/data‐science‐use‐cases/
Real‐World Applications
Facebook
• Social Analytics
Utilizes quantitative research to gain insights about the social interactions of among
people
Makes use of deep learning, facial recognition, and text analysis
In facial recognition, it uses powerful neural networks to classify faces in the
photographs
In text analysis, it uses its own understanding engine called “DeepText” to
understand people’s interest and aligns photographs with texts
It uses deep learning for targeted advertising
Using the insights gained from data, it clusters users based on their preferences and provides
them with the advertisements that appeal to them
Source: https://data‐flair.training/blogs/data‐science‐use‐cases/
Real‐World Applications
Amazon
• Improving E‐Commerce Experience
Personalized recommendation
Amazon heavily relies on predictive analytics (a personalized recommender system) to increase customer satisfaction
Amazon analyzes the purchase history of customers, other customer suggestions, and user ratings to recommend
products
Anticipatory shipping model
Uses big data for predicting the products that are most likely to be purchased by its users
Analyzes pattern of customer purchases and keeps products in the nearest warehouse which the customers may utilize
in the future
Price discounts
Using parameters such as the user activity, order history, prices offered by the competitors, product availability, etc.,
Amazon provides discounts on popular items and earns profits on less popular items.
Fraud Detection
Amazon has its own novel ways and algorithms to detect fraud sellers and fraudulent purchases
Improving Packaging Efficiency
Amazon optimizes packaging of products in warehouses and increases efficiency of packaging lines through the data
collected from the workers
Source: https://data‐flair.training/blogs/data‐science‐use‐cases/
Real‐World Applications
Uber
• Improving Rider Experience
Uber maintains large database of drivers, customers, and several other records
Makes extensive use of Big Data and crowdsourcing to derive insights and provide best
services to its customers
Dynamic pricing
The concept is rooted in Big Data and data science to calculate fares based on the parameters
Uber matches customer profile with the most suitable driver and charges you based on the time it
takes to cover the distance rather than the distance itself
o The time of travel is calculated using algorithms that make use of data related to traffic density and
weather conditions
When the demand is higher (more riders) than supply (less drivers), the price of the ride goes up
When the demand for Uber rides is less, then Uber charges lower rate
Source: https://data‐flair.training/blogs/data‐science‐use‐cases/
Real‐World Applications
Bank of America and other Banks
• Improving Customer Experience
Erica – a virtual financial assistant (BoA)
Considered as the world's finest innovation in finance domain, Erica serves as a customer advisor to over 45 million
users around the world
Erica makes use of Speech Recognition to take customer inputs, which is a technological advancement in the field of
Data Science.
Fraud detection
Uses data science and predictive analytics to detect frauds in payments, insurance, credit cards, and customer
information
Banks employ data scientists to use their quantitative knowledge where they apply algorithms like association,
clustering, forecasting, and classification.
Risk modeling
Banks use data science for risk modeling to regulate financial activities
Customer segmentation
Using various data‐mining techniques, banks are able to segment their customers in the high‐value and low‐value
segments
Data scientists makes use of clustering, logistic regression, decision trees to help the banks to understand the Customer
Lifetime Value (CLV) and take group them in the appropriate segments
Source: https://data‐flair.training/blogs/data‐science‐use‐cases/
Real‐World Applications
Airbnb
• Providing better search results
Airbnb uses a massive big data of customer and host information, homestays and lodge records, as
well as website traffic
Uses data science to provide better search results to its customers and find compatible hosts
• Detecting bounce rates
Airbnb makes use of demographic analytics to analyze bounce rates from their websites
In 2014, Airbnb found that users in certain countries would click the neighborhood link, browse the
page and photos and not make any booking
To mitigate this issue, Airbnb released a different version in those countries and replaced
neighborhood links with the top travel destinations
This resulted in a 10% improvement in the lift rate for those users
• Providing ideal lodgings and localities
Airbnb uses knowledge graphs where the user’s preferences are matched with the various
parameters to provide ideal lodgings and localities
Source: https://data‐flair.training/blogs/data‐science‐use‐cases/
Real‐World Applications
Spotify
• Providing better music streaming experience
Spotify uses Data Science and leverages big data for providing personalized music recommendations
It has over 100 million users and uses a massive amount of big data
Uses over 600 GBs of daily data generated by the users to build its algorithms to boost user experience
• Improving experience for artists and managers
Spotify for Artists application allows the artists and managers to analyze their streams, fan approval and
the hits they are generating through Spotify’s playlists
• Others
Spotify uses data science to gain insights about which universities had the highest percentage of party
playlists and which ones spent the most time on it
"Spotify Insights" publishes information about the ongoing trends in the music
Spotify's Niland, an API based product, uses machine learning to provide better searches and
recommendations to its users.
Spotify analyzes listening habits of its users to predict the Grammy Award Winners
In the year 2013, Spotify made 4 correct predictions out of 6
Source: https://data‐flair.training/blogs/data‐science‐use‐cases/
Real‐World Applications
Real World Examples
• Anomaly detection
Fraud, disease, crime, etc.
• Automation and decision‐making
Background checks, credit worthiness, etc.
• Classifications
In an email server, this could mean classifying emails as “important” or “junk”
• Forecasting
Sales, revenue and customer retention
• Pattern detection
Weather patterns, financial market patterns, etc.
• Recognition
Facial, voice, text, etc.
• Recommendations
Based on learned preferences, recommendation engines can refer you to movies, restaurants and books you may like
Real‐World Applications
Real World Examples
• Sales and Marketing
Google AdSense collects data from internet users so relevant commercial messages
can be matched to the person browsing the internet
MaxPoint (http://maxpoint.com/us) is another example of real‐time personalized
advertising
• Customer Relationship Management
Commercial companies in almost every industry use data science to gain insights
into their customers, processes, staff, completion, and products
Many companies use data science to offer customers a better user experience, as
well as to cross‐sell, up‐sell, and personalize their offerings
Source: https://data‐flair.training/blogs/data‐science‐applications/
Data Science Vs. Business
Intelligence
Data Science Vs. Business Intelligence
Comparing Data Science and BI
• Business Intelligence
What happened last quarter?
How many units sold?
Where is the problem?
In which situations?
• Data Science
Why is this happening?
What if…
What would be an optimal scenario
for our business?
What will happen next?
What if these trends continue?
Source: https://www.datasciencecentral.com/profiles/blogs/updated‐difference‐between‐business‐intelligence‐and‐data‐science
Data Science Vs. Business Intelligence
Comparing Data Science and BI
Business Intelligence Data Science
Perspective Past and Present Forward Looking
Involves Statistics Yes Yes
Analysis Descriptive Explorative
Comparative Diagnostic
Predictive
Data Usually structured (SQL) Less Structured (Logs, Cloud data, Text,
noSQL, SQL)
Data Warehouse Distributed Data
New Data, Same analysis Same Data, New analysis
Scope General Specific to business questions
Tools Statistics, Visualization Statistics, Machine Learning, Graph
Analysis, NLP
Expertise Business Analyst Data Scientist
Deliverable Tabular, Charts Visualization, Insight, Story
Data Science Vs. Business Intelligence
BI Analyst and Data Scientist Characteristics
BI Analyst Data Scientist
Focus Reports, KPIs, Trends Patterns, Correlations, Models
Process Static, Comparative Exploratory, Experimentation, Visual
Data Sources Pre‐planned, Added Slowly On the fly, as‐needed
Transformation Up front, Carefully planned In‐database, On‐demand, Enrichment
Data Quality Single version of truth "Good enough", probabilities
Analytics Retrospective, Descriptive Predictive, Prescriptive
Table shows the different attitudinal approaches for each
Source: https://www.datasciencecentral.com/profiles/blogs/updated‐difference‐between‐business‐intelligence‐and‐data‐science
Data Science Vs. Business Intelligence
Comparing Data Science and Statistics
Data Science Statistics
Type of problem Semi‐structured or Unstructured Well Structured
Analysis objective Need not be well defined Well formed objective
Type of analysis Explorative Confirmative
Source: https://www.datasciencecentral.com
Roles and Responsibilities of a Data
Scientist
R & R of a Data Scientist
Who is a Data Scientist?
• One source writes this…
In fact, some data scientists are — for all practical purposes — statisticians, while others are pretty much
indistinguishable from software engineers
Some are machine‐learning experts, while others couldn’t machine‐learn their way out of kindergarten
Some are PhDs with impressive publication records, while others have never read an academic paper
(shame on them, though)
No matter how you define data science, you’ll find practitioners for whom the definition is totally,
absolutely wrong
• Nonetheless, a data scientist is someone who extracts insights from messy data
• A data scientist is responsible for guiding a data science project from start to finish
• Success in a data science project comes not just from an one tool, but from having
quantifiable goals, good methodology, cross‐discipline interactions, and a repeatable
workflow
Source: Data Science from Scratch, by Joel Grus
R & R of a Data Scientist
Who is a Data Scientist?
• Skills Required for a Data Scientist
Math
Statistics
Computer science
Machine learning
Domain expertise
Data visualization
Communication and presentation skills
Source: https://www.northeastern.edu/graduate/blog/what‐does‐a‐data‐scientist‐do/
R & R of a Data Scientist
Data Scientist Job Titles
• The most common careers in data science include the following roles:
Data scientists:
Design data modeling processes to create algorithms and predictive models and perform
custom analysis.
Data analysts:
Manipulate large data sets and use them to identify trends and reach meaningful conclusions
to inform strategic business decisions.
Data engineers:
Clean, aggregate, and organize data from disparate sources and transfer it to data warehouses.
Business intelligence specialists:
Identify trends in data sets.
Data architects:
Design, create, and manage an organization’s data architecture.
Source: https://www.northeastern.edu/graduate/blog/what‐does‐a‐data‐scientist‐do/
Software Engineering for Data
Science
Software Engineering for Data Science
Software Engineering
• Software engineering is an engineering discipline that is concerned
with all aspects of software production
From early stages of system specification through to maintaining the system after it
has gone into production
• Software includes computer programs, all associated documentation,
and configuration data that are needed for software to work correctly
• Software engineers are concerned with developing software
products. That is,
Generic products that are stand‐alone systems
Products that are customized for a particular customer
Source: Software Engineering by Ian Sommerville
Software Engineering for Data Science
Software Development Process
Gathering • Waterfall model views development activities as strictly
Requirements
defined steps or states
Systems
• Each step cannot begin until the previous step is
Analysis completed and documented
• No phase is considered complete until its
Systems
Design
documentation is approved
• The model is tightly tied to project
Development
management phases
Testing
• The whole project is developed in multiple iterations
Testing Design
• Each iteration implements a selected set of
requirements or functionality
• Iterative process groups requirements into layers or Development
cycles
• Within each iteration, the development goes through a
complete life cycle: analysis, design, coding, and testing
Gathering
Requirements
Systems Software Engineering Process
Analysis
Deployment
Evaluation
Systems Design Modeling
Setup
Exploratory
Data
Data Analysis
Development Preparation
Problem
Description
Data Science Process
Testing
Software Engineering for Data Science
Data Science Vs. Software Engineering
Software Engineering Data Science
Software engineering focuses on creating software that Data science involves analyzing huge amounts of data,
serves a specific purpose with some aspects of programming and development
Uses a methodology involving various phases beginning Uses a methodology involving various phases beginning
from requirements specification through software from requirements specification through model
deployment into production deployment
Consider a software such as Chrome or FireFox browser You use the browser to search for let's say “Data
that is developed using software engineering principles Science Bootcamp”. The results displayed are probably
based on algorithm developed using Data Science
If you wish to find a restaurant, you would use your phone assistant (developed by a software engineer) to find
one. Then, an algorithm (developed by a data scientist) searches and finds restaurants close to your location and
displays the results in a map application (developed by a software engineer) and tells how exactly you can go
Source: https://careerkarma.com/blog/data‐science‐vs‐software‐engineering/
Software Engineering for Data Science
Data Science Software Engineering
• Involves collecting and analyzing • Is concerned with creating useful
data applications
• Data scientists utilize the ETL process • Software engineers use the SDLC
process
• Is more process‐oriented
• Uses frameworks like Waterfall,
• Data scientists use tools like Amazon Agile, and Spiral
S3, MongoDB, Hadoop, and MySQL • Software engineers use tools like
• Skills include machine learning, Rails, Django, Flask, and Vue.js
statistics, and data visualization • Skills are focused on coding
languages
Source: https://careerkarma.com/blog/data‐science‐vs‐software‐engineering/
Data Scientist's Toolbox
Data Scientist's Toolbox
Tools Used by Data Scientists for Data Analysis
• Java, R, Python, Clojure, Haskell, Scala…
• Hadoop, HDFS & MapReduce, Spark, Strom…
• HBase, Pig, Hive, Shark, Impala, Cascalog…
• ETL, Webscrapers, Flume, Sqoop, Hume…
• SQL, RDMS, DW, OLAP…
• Knime, Weka, RapidMiner, Scipy, NumPy, scikit‐learn, pandas…
• js, Gephi, ggplot2, Tableau, Flare, Shiny…
• SPSS, Matlab, SAS…
• NoSQL, Mongo DB, Couchbase, Cassandra…
• And Yes! … MS‐Excel : the most used, most underrated DS tool.
Source: http://makemeanalyst.com/data‐scientist‐toolkits/
Data Science Challenges
Data Science Challenges
Broad Categorization of Data Science Challenges
• By reading related material from various sources, data science
challenges can be categorized as:
Data related
Organization related
Technology related
People related
Skill related
Can you think of anything else?
Data Science Challenges
Challenges Faced by Data Professionals
• Survey of over 16,000 data professionals
• The question was…
• “At work, which barriers or challenges have
you faced this past year? (Select all that
apply).”
• The top 10 challenges were:
Dirty data (36%)
Lack of data science talent (30%)
Company politics (27%)
Lack of clear question (22%)
Data inaccessible (22%)
Results not used by decision makers (18%)
Explaining data science to others (16%)
Privacy issues (14%)
Lack of domain expertise (14%)
Organization small and cannot afford data
science team (13%)
Source: https://businessoverbroadway.com/2018/03/18/top‐10‐challenges‐to‐practicing‐data‐science‐at‐work/
Data Science Challenges
Challenges Faced by Data Professionals
• A principal component analysis of the 20 challenges resulted in five categories
Insights not Used in Decision Making
These challenges include company politics, an inability to integrate study findings into decision‐making
processes and lack of management support.
Data Privacy, Veracity, Unavailability
These challenges revolved around the data itself, including how “dirty” it is, its availability as well as privacy
issues.
Limitations of tools to scale / deploy
Challenges in this category are related to the tools that are used to extract insights, deploy models as well as
scaling solutions up to the full database.
Lack of Funds
Challenges around lack of funding impact what the organization can purchase with respect to external data
sources, data science talent and, perhaps, domain expertise.
Wrong Questions Asked
Challenges are about the difficulty in maintaining expectations about the impact of data science projects and not
having a clear question to answer or a clear direction to go in with the available
Source: https://businessoverbroadway.com/2018/03/18/top‐10‐challenges‐to‐practicing‐data‐science‐at‐work/
Data Science Challenges
Data Science Challenges
• Insights not Used in Decision Making
Company politics
Inability to integrate study findings into decision‐making
processes
Lack of management support.
• Data Privacy, Veracity, Unavailability
“Dirty” data
Lack of availability
Privacy issues
• Limitations of tools to scale / deploy
Challenges in this category are related to the tools that are
used to extract insights, deploy models as well as scaling
solutions up to the full database.
• Lack of Funds
Challenges around lack of funding impact what the organization
can purchase with respect to external data sources, data
science talent and, perhaps, domain expertise.
• Wrong Questions Asked
Challenges are about the difficulty in maintaining expectations
about the impact of data science projects and not having a
clear question to answer or a clear direction to go in with the
available
Source: https://businessoverbroadway.com/2018/03/18/top‐10‐challenges‐to‐practicing‐data‐science‐at‐work/
Data Science Challenges
Data Science Challenges
• Unreasonable Management Expectations
• Misunderstanding Data’s Significance
• Being Blamed for Bad News
• Having to Convince Management
• Lack of Professionals
• Problem Identification
• Accessing the Right Data
• Cleansing of the Data
Source: https://www.dataquest.io/blog/data‐science‐problems‐fix/
Data Science Challenges
Data Science Challenges
• Complexity of reality
• Cognitive bias
• Data Quality
Accuracy
Completeness
Uniqueness
Timeliness
Consistency
• Content and Source Bias
• Granularity of Data
In terms of time and space
• Availability & Access of Data
Data Divide
Source: Coursera
Source: https://dimensionless.in/challenges‐faced‐by‐data‐scientist‐and‐how‐to‐overcome‐them/
Data Science Challenges
Cognitive Bias
• The way we see the world (events, facts, people, etc.,) is based on our own
belief system and experiences and may not be reasonable or accurate
• Cognitive biases are ways of thinking about and perceiving the world that
may not necessarily reflect reality
• We may think we experience the world around us with perfect objectivity,
but this is rarely the case
• Each of us sees things differently based on our preconceptions, past
experiences, cultural, environmental, and social factors
This doesn’t necessarily mean that the way we think or feel about something is truly
representative of reality
• Simply put, cognitive biases are the distortions of reality because of the lens
through which we view the world
Source: https://www.wordstream.com/blog/ws/2014/05/22/landing‐pages‐cognitive‐biases
Data Science Challenges
Types of Cognitive Biases
• Selection (or sample) Bias
• Seasonal Bias
• Linearity Bias
• Confirmation Bias
• Recall Bias
• Survivor Bias
• Observer Bias
• Reinforcement Bias
• Availability Bias
Source: https://www.wordstream.com/blog/ws/2014/05/22/landing‐pages‐cognitive‐biases
Data Science Challenges
Cognitive Bias
• MONEYBALL ‐ breaking biases ‐ FREQUENCY BASED PROBABILITY /
STATISTICS ‐ MATHEMATICS in the MOVIES
https://www.youtube.com/watch?v=KWPhV6PUr9o&ab_channel=Movieclips
• Types of cognitive bias
https://www.youtube.com/watch?v=wEwGBIr_RIw&ab_channel=PracticalPsychol
ogy
• Graphs can be misleading
https://www.youtube.com/watch?v=E91bGT9BjYk&ab_channel=TED‐Ed
• Simpson's paradox
https://www.youtube.com/watch?v=sxYrzzy3cq8&ab_channel=TED‐Ed
Source: https://www.wordstream.com/blog/ws/2014/05/22/landing‐pages‐cognitive‐biases
Thank You!
Introduction to Data Science
Data Analytics
Dr. Ramakrishna Dantu
Associate Professor, BITS Pilani
Introduction to Data Science
Disclaimer and Acknowledgement
• The content for these slides has been obtained from books and various
other source on the Internet
• I here by acknowledge all the contributors for their material and inputs.
• I have provided source information wherever necessary
• I have added and modified the content to suit the requirements of the
course
Introduction to Data Science
Data Analytics Module Topics
• Defining Analytics
• Types of data analytics
Descriptive, Diagnostic
Predictive, Prescriptive
• Data Analytics – methodologies
CRISP‐DM Methodology
SEMMA
BIG DATA LIFE CYCLE
SMAM
• Analytics Capacity Building
• Challenges in Data‐driven decision making
What is Analytics?
Defining Analytics
What is Analytics?
What route should I follow? How to retain customers?
What home should I buy?
What treatment should I What stock should I purchase?
recommend?
Defining Analytics
What is Analytics? – Dictionary meaning
• It can be used as a noun or verb.
Is Analytics the name of your department, or do you actually "do" Analytics?
Analytics can be a noun. "He is the HOD of Business Analytics department!"
Analytics can also be a verb. "I applied Analytics to the data to find patterns"
• Let's look at the dictionary meaning first
Analytics is the systematic computational analysis of data or statistics
‐‐ Oxford
Analytics is a process in which a computer examines information using mathematical
methods in order to find useful patterns
‐‐ Cambridge
Analytics is the analysis of data, typically large sets of business data, by the use of
mathematics, statistics, and computer software
‐‐ dictionary.com
Source: http://the‐data‐guy.blogspot.com/2016/10/is‐analytics‐noun‐or‐verb.html
Defining Analytics
What is Analytics? – From websites
• It is the process of discovering, interpreting, and communicating significant
patterns in data and using tools to empower your entire organization to ask any
question of any data in any environment on any device
‐‐Oracle
• Data Analytics refers to the techniques used to analyze data to enhance
productivity and business gain
‐‐ Edureka
• Data analytics is the pursuit of extracting meaning from raw data using
specialized computer systems
‐‐ Informatica
• I couldn’t find a definition from popular websites such as Google, IBM,
Microsoft, Amazon, etc.,
Source: Internet
Defining Analytics
What is Analytics?
• Analytics is the process of extracting and creating information from
raw data by using techniques such as:
filtering, processing, categorizing, condensing and contextualizing the data
• Analytics is a broad term that encompasses the processes,
technologies, frameworks and algorithms to extract meaningful
insights from data
• This information thus obtained is then used to infer knowledge about
the system and/or its users, and its operations to make the systems
smarter and more efficient
Source: Big Data Analytics – A Hands‐on Approach by Arshdeep Bahga & Vijay Madisetti
Characteristics of Big Data
Characteristics of Big Data (5Vs)
Volume
• Big data is so large that it would not fit on a single machine
• Specialized tools and frameworks are required to store, process and analyze such
data Volume
• For example: Velocity
Social media applications process billions of messages everyday, IoT can generate terabytes of
sensor data everyday, etc. Big
• The data generated by modern IT, industrial, healthcare, Internet of Things, and Data
other systems is growing exponentially
Value Variety
• Cost of storage and processing architectures is going down
• The need to extract valuable insights from the data to improve business processes, Veracity
efficiency and service to consumers
• Though there is no fixed threshold for the volume of data to be considered as big
data
• Typically, the term big data is used for massive scale data that is difficult to store,
manage and process using traditional databases and data processing architectures
• Specialized tools are required to ingest such high
velocity data into the big data infrastructure and analyze
the data in real‐time
Source: Big Data Analytics – A Hands‐on Approach by Arshdeep Bahga & Vijay Madisetti
Characteristics of Big Data (5Vs)
Variety & Veracity
• Variety
Variety refers to different forms of the data Volume
Big data comes in different forms such as structured, unstructured or semi‐ Velocity
structured, including text data, image, audio, video and sensor data
Big
Big data systems need to be flexible enough to handle such variety of data
Data
• Veracity Value Variety
Veracity refers to how accurate is the data
Veracity
To extract value from the data, the data needs to be cleaned to remove noise
Data‐driven applications can reap the benefits of big data only when the data
is meaningful and accurate
Therefore, cleansing of data is important so that incorrect and faulty data can
be filtered out
• The end goal of any big data analytics system is Velocity
to extract value from the data Big
Data
• The value of the data is also related to the Value Variety
veracity or accuracy of the data
Veracity
• For some applications value also depends on
how fast we are able to process the data
Source: https://blogs.gartner.com/jason‐mcnellis/2019/11/05/youre‐likely‐investing‐lot‐marketing‐analytics‐getting‐right‐insights/
Types of Analytics
Types of analytics according to the objective
• Involves retrospective analysis of what happened using historical data
• Descriptive Analytics –
• Data mining methods help uncover patterns that offer insight
What is happening?
• Lays foundation for diagnosis and further analysis of what might happen in the future
• Looks for the root cause of a problem
• Diagnostic Analytics –
• Used to determine why something happened
Why did it happen?
• Attempts to find and understand the causes of events and behaviors
• Uses past data and external sources to predict the future
• Predictive Analytics –
• Involves forecasting
What is likely to happen?
• Uses data mining and AI techniques to analyze current data and develop future scenarios
• Dedicated to finding the right action to be taken.
• Prescriptive Analytics – • Descriptive analytics provides a historical perspective, and predictive analytics helps forecast
What should be done? what might happen.
• Prescriptive analytics uses these parameters to find the best solution
Types of Analytics
Next level of Analytics
• Cognitive Computing
Human cognition is based on the context and reasoning
Cognitive systems mimic how humans reason and process
Cognitive systems analyze information and draw inferences using probability
They continuously learn from data and
reprogram themselves
According to one source
"The essential distinction between cognitive
platforms and artificial intelligence systems is that:
o you want an AI to do something for you
o a cognitive platform is something you turn to for
collaboration or for advice"
Source: https://interestingengineering.com/cognitive‐computing‐more‐human‐than‐artificial‐intelligence
Image Source: https://towardsdatascience.com/cognitive‐computing‐what‐can‐it‐be‐used‐for‐8af4721928f5
Types of Analytics
Next level of Analytics
• Cognitive Analytics – What Don't I Know?
Involves Semantics, AI, Machine learning, Deep Learning, Natural Language Processing, and Neural
Networks
Simulates human thought process to learn from the data and extract the hidden patterns from data
Uses all types of data: audio, video, text, images in the analytics process
Although this is the top tier of analytics maturity,
Cognitive Analytics can be used in the prior levels
It extends the analytics journey to areas that were
unreachable with more classical analytics techniques like
business intelligence, statistics, and operations research."
‐‐ Jean Francois Puget
Applications: https://www.gflesch.com/elevity‐it‐blog/healthcare‐retail‐finance‐are‐leveraging‐cognitive‐computing Source: https://www.ecapitaladvisors.com/blog/analytics‐maturity/
https://www.xenonstack.com/insights/what‐is‐cognitive‐analytics/
Types of Analytics
Application areas of Cognitive Analytics
• Healthcare, Financial Services, Supply Chains, Customer Relationship
Management, Cybersecurity, etc.,.
• Chatbots:
Programs that can simulate a human conversation by understanding the communication in
a contextual sense
Chatbots use the machine learning technique called natural language processing
Natural language processing allows programs to take inputs from humans (voice or text),
analyze it and then provide logical answers
Cognitive computing enables chatbots to have a certain level of intelligence in
communication
E.g., understanding user’s needs based on past communication, giving suggestions, etc.
Source: https://www.newgenapps.com/blog/what‐is‐cognitive‐computing‐applications‐companies‐artificial‐intelligence/
Types of Analytics
Application areas of Cognitive Analytics
• Sentiment analysis:
It is the science of understanding emotions conveyed in a communication
While it easy for humans to understand tone, intent etc. in a conversation, it is far
more complicated for machines
To enable machines to understand human communication a training data of human
conversations is used and then analyze the accuracy of the analysis
Sentiment analysis is popularly used to analyze social media communications like
tweets, comments, reviews, complaints etc.
Source: https://www.newgenapps.com/blog/what‐is‐cognitive‐computing‐applications‐companies‐artificial‐intelligence/
Types of Analytics
Major Players in Cognitive Computing
• SparkCognition
• Expert System
• Microsoft Cognitive Services
• IBM Watson
• Numenta
• Google DeepMind
• Cicso Cognitive Threat Analytics
• CognitiveScale
• CustomerMatrix
• HPE Haven OnDemand
Source: https://www.newgenapps.com/blog/what‐is‐cognitive‐computing‐applications‐companies‐artificial‐intelligence/
Types of Analytics
Level of automation in different types of analytics
Source: Gartner
Analytics Vs. Data Science
Types of Analytics
How Analytics is Connected to Data Science
Image Source: Simplilearn ‐ https://www.youtube.com/watch?v=X3paOmcrTjQ
Image Source: https://thedatascientist.com/why‐now‐is‐the‐right‐time‐for‐sports‐analytics/
Types of Analytics
How Analytics is Connected to Data Science
Source: Simplilearn ‐ https://www.youtube.com/watch?v=X3paOmcrTjQ
Types of Analytics
Types of analytics according to the domain
• Marketing Analytics
• Financial Analytics
• Healthcare Analytics
• Sports Analytics
• HR Analytics
• Customer Analytics
• Web Analytics
• Social Analytics
• Cultural Analytics
• Political Analytics
Image Source: https://www.dreamstime.com/stock‐photo‐diverse‐people‐social‐networking‐concepts‐image44309613
Types of Analytics
Types of analytics according to the type of data
• Text analytics
• Real‐time data analytics
• Multimedia analytics
• Geo analytics
• Mobile analytics
Types of Analytics
Computational Tasks or "giants"
• The National Research Council characterized computational tasks for massive
data analysis into seven groups (aka "giants")
• These groups include:
(1) Basis Statistics
(2) Generalized N‐Body Problems
(3) Linear Algebraic Computations
(4) Graph‐Theoretic Computations
(5) Optimization
(6) Integration and
(7) Alignment
• This grouping of computational tasks provides a taxonomy of tasks that is useful
in analyzing data according to mathematical structure and computational
strategy
Source: Big Data Analytics – A Hands‐on Approach by Arshdeep Bahga & Vijay Madisetti
Types of Analytics
Mapping between types of analytics and computational tasks or 'giants'
Descriptive Analytics Diagnostic Analytics Predictive Analytics Prescriptive Analytics
(What can we do make it
(What happened) (Why did it happen) (What is likely to happen)
happen)
‐ Reports ‐ Queries ‐ Forecasts
‐ Reports
‐ Alerts ‐ Data Mining ‐ Simulations
‐ Alerts
Source: https://avasant.com/report/intelligent‐automation‐use‐cases/
Types of Analytics
Analytics use cases
Enterprise process‐specific use cases
Customer Service Finance & Accounting Human Resources Marketing & Sales Procurement
IT Process Specific use cases
IT Applications IT Infrastructure & Operations
• Application code generation • Error detection and remedy management • Auto resolution of tickets • Threat detection
• Application configuration • Release and deployment support • Endpoint preemptive monitoring • log analysis
• Application performance management • Automatic code refactoring • Autonomous cybersecurity • Capacity planning
• Application self‐healing • Test case generation • Predictive analytics‐led server management • Predictive infrastructure scaling
• Rapid prototyping • Application development • Predictive maintenance • Infrastructure cost management
Source: https://avasant.com/report/intelligent‐automation‐use‐cases/
Types of Analytics
Applications – Descriptive Analytics
• Telecom company finding current trend of customers
• Online retail company finding number of site visits for tickets
• Healthcare provider learns how many patients were hospitalized last month
• Retailer – the average weekly sales volume
• Manufacturer – a rate of the products returned for a past month, etc.
• A placement agency – number of candidates being placed to top companies,
candidates actually joining, candidates who accept but do not join etc.
• Consultancy firm – Doing salary analysis of different work streams, stages,
industries
Source: https://www.govtech.com/data/Role‐of‐Data‐Analytics‐in‐Predictive‐Policing.html
Predictive Analytics – Case Study
Predictive Policing – Seeing the crime before it happens
• At the start of each shift, crime data fed
into the PredPol system provides
officers with 15 different zones for four
types of crime:
Auto theft, vehicle burglary, burglary and
gang‐related activity
• Each zone covers an area of 500 square
feet
Source: https://www.govtech.com/data/Role‐of‐Data‐Analytics‐in‐Predictive‐Policing.html
Predictive Analytics – Case Study
Predictive Policing – PredPol algorithm
• PredPol uses a machine‐learning algorithm to make its predictions
• Historical event datasets are used to train the algorithm for each new city (ideally 2 to 5 years of data)
• It then updates the algorithm each day with new events as they are received from the department
• This new information comes from the agency’s records management system (RMS).
• PredPol uses ONLY 3 data points – crime type, crime location, and crime date/time – to create its predictions
• No personally identifiable information is ever used
• No demographic, ethnic or socio‐economic information is ever used
• This eliminates the possibility for privacy or civil rights violations seen with other intelligence‐led policing
models
• Predictions are displayed as red boxes (covering 150 x 150 sq mts) on a web interface via Google Maps
• The boxes represent the highest‐risk areas for each day and for the corresponding shift: day, swing or night
shift
• Officers are instructed to spend roughly 10% their shift time patrolling PredPol boxes
Source: https://www.govtech.com/data/Role‐of‐Data‐Analytics‐in‐Predictive‐Policing.html
Predictive Analytics – Case Study
Predictive Policing – Predicting crime hot spots
• Descriptive and Diagnostic
Analytics
Software searches for patterns
and correlations in past crime
data
• Predictive Analytics
Algorithm predicts when/where
crime is likely to happen in the
future
• Response
Police deploy resources in certain
areas at certain times
Source: https://www.sciencemag.org/news/2016/09/can‐predictive‐policing‐prevent‐crime‐it‐happens
Predictive Analytics – Case Study
Predictive Policing – Predicting crime hot spots
• Descriptive and Diagnostic
Analytics
Software analyzes the social
contacts of those that have been
involved in the past
• Predictive Analytics
Algorithm identifies probable
perpetrators or victims of
violence
• Response
People are told they're
considered at risk, or social
services are offered
Source: https://www.sciencemag.org/news/2016/09/can‐predictive‐policing‐prevent‐crime‐it‐happens
Predictive Analytics – Case Study
Predictive Policing
• Science Behind the News: Predictive Policing
https://www.youtube.com/watch?v=74_jreara3w
Healthcare Case Study
Healthcare Analytics – Case Study
Healthcare Practices
• Private Practice
• Group Practice
• Large HMOs
• Hospital Based
• Locum Tenens
Source: https://www.micromd.com/blogmd/5‐types‐medical‐practices/
Image Source: https://technologyadvice.com/blog/healthcare/best‐medical‐practice‐management‐software/
Healthcare Analytics – Case Study
Healthcare Practice
• Private Practice
A physician practices alone without any partners
and typically with minimal support staff
Ideally works for physicians who wish to own and
manage their own practice
Benefits
Individual freedom, closer relationships with patients,
and the ability to set their own practice’s growth
pattern
Drawback
Longer work hours, financial extremes, and a greater
amount of business risk.
Source: https://www.micromd.com/blogmd/5‐types‐medical‐practices/
Image Source: https://www.mediqfinancial.com.au/blog/the‐essential‐guide‐to‐starting‐your‐own‐medical‐practice/
Healthcare Analytics – Case Study
Healthcare Practice
• Group Practice
A group practice involves two or more physicians who all provide medical care within the same facility
They utilize the same personnel and divide the income in a manner previously agreed upon by the group
Group practices may consist of providers from a single specialty or multiple specialties
Benefits
Shorter work hours, built‐in on‐call coverage, and access to more working capital
All of these factors can lead to less stress
Drawback
Less individual freedom, limits on the ability to rapidly grow income, and the need for a consensus on business
decisions.
Source: https://www.micromd.com/blogmd/5‐types‐medical‐practices/
Image source: https://www.physicianleaders.org/news/selling‐medical‐practice‐you‐need‐exit‐strategy
Healthcare Analytics – Case Study
Healthcare Practice
• Large HMOs (Health Maintenance Organizations)
HMO, employs providers to care for their members and beneficiaries
The goal of HMOS is to decrease medical costs for those consumers
There are two main types of HMOs
The prepaid group practice model and the medical care foundation (MCF), also called individual practice association.
Benefits
More stable work life and regular hours for providers (physicians)
Less paperwork and regulatory responsibilities and a regular salary along with bonus opportunities
These bonuses are based on productivity or patient satisfaction
Drawback
The main drawback for physicians working for an HMO is the lack of autonomy
HMO’s required physicians to follow their guidelines in providing care.
Examples:
The Kaiser Foundation Health Plan in California
The Health Insurance Plan of Greater New York
Group Health Cooperative of Puget Sound
Source: https://www.micromd.com/blogmd/5‐types‐medical‐practices/
Healthcare Analytics – Case Study
Healthcare Practice
• Hospital Based
In hospital based work, physicians earn a predictable
income, have a regular patient base, and a solid
referral network
Physicians who are employed by a hospital will either
work in a hospital‐owned practice or in a department
of the hospital itself
Benefits
A regular work schedule, low to no business and legal
risk, and a steady flow of income
Drawback
Lack of physician autonomy
Employee constraints and the expectation that
physicians become involved in hospital committee work
Source: https://www.micromd.com/blogmd/5‐types‐medical‐practices/
Healthcare Analytics – Case Study
Healthcare Practice
• Locum Tenens
Locum tenens is derived from the Latin phrase for "to hold the place of."
In locum tenens, physicians re‐home to areas hurting for healthcare professionals
These types of positions offer temporary employment and may offer higher pay than more
permanent employment situations
Benefits
Physicians working in locum tenens scenarios enjoy the benefits of variety and the ability to
experience numerous types of practices and geographic locations
They also enjoy schedule flexibility and lower living costs
Drawback
The possibility that benefits are not included, and a potential lack of steady work
Physicians need to regularly uproot their families
Source: https://www.micromd.com/blogmd/5‐types‐medical‐practices/
Healthcare Analytics – Case Study
Healthcare Scenario – Sources of Data
• Common sources of practice data existing within the practice:
Revenue Cycle Management System (billing for a procedure, processing denials, collecting
payments)
Electronic Health Records (EHRs) (provides services to a patient)
Scheduling & Information System (schedules an appointment)
Survey Results
Peer Review System
Other sources
• Analytics provides a platform to convert that data into actionable
information for the practice
• As reimbursement shifts to a value‐based care model, it is critical to have
insight into both clinical and business metrics to prepare for the future
Source: https://integratedmp.com/healthcare‐practice‐analytics‐101‐numbers‐charts‐dashboards‐oh‐my/
Healthcare Analytics – Case Study
Healthcare Scenario – Integrated Medical Partners (IMP)
• https://www.youtube.com/watch?v=olpuyn6kemg
• IMP
Primarily offers
Revenue Cycle Management (RCM) and Advanced Analytics among others
Helps maximize collections for independent physician practices
Provides practice performance analytics
Offers tailored solutions to partner with physicians and hospitals so that they can
Improve patient care, enhanced compliance, operational efficiency, and increase profitability
Source: https://integratedmp.com/healthcare‐practice‐analytics‐101‐numbers‐charts‐dashboards‐oh‐my/
Healthcare Analytics – Case Study
Healthcare Scenario – Denial of Claims
• Consider that the healthcare practice has seen an increase in denied
claims over the past several months
• Increased denials impact negatively on the financial performance of
the organization
• Therefore, the company needs tools to help identify and resolve the
root cause of the denials
This trend needs to be reversed as soon as possible, but how?
• The different types of analytics provide necessary insights into the
data and the cause of the denial increase
Source: https://integratedmp.com/4‐key‐healthcare‐analytics‐sources‐is‐your‐practice‐using‐them/
Healthcare Analytics – Case Study
Healthcare Scenario – Descriptive Analytics
• Descriptive – what happened?
Descriptive analytics will tell you what is happening in the practice
In this example, there has been an increase in the number of denied claims over
the past several months
Further research identifying a trend revealed that the increase in denials is specific
to a particular denial code
What is this denial code?
This denial code is for a referring provider (doctor) who is not enrolled in the Medicare Provider
Enrollment, Chain, and Ownership System (PECOS).
• Now, the question is why this provider is not enrolled in PECOS?
Source: https://integratedmp.com/4‐key‐healthcare‐analytics‐sources‐is‐your‐practice‐using‐them/
Healthcare Analytics – Case Study
Healthcare Scenario – Diagnostic Analytics
• Diagnostic – why did it happen?
The descriptive analytics has identified an increase in denials specific to a referring
provider
The next step is to diagnose to understand why this change occurred
Identify why the referring provider is not enrolled in PECOS
So, we utilize diagnostic analytics to understand why this change occurred
Looking for changes in referring provider patterns at the time the increase in PECOS
denials began would help to identify new referring providers
It will also identify whether these new referring providers are enrolled in PECOS
For the purpose of this example, assume we identify one referring provider who is
new to referring to your practice that is not enrolled in PECOS.
Source: https://integratedmp.com/4‐key‐healthcare‐analytics‐sources‐is‐your‐practice‐using‐them/
Healthcare Analytics – Case Study
Healthcare Scenario – Predictive Analytics
• Predictive – what will happen?
Predictive Analytics allows us to learn from historical trends to predict what will
happen in the future
Utilizing descriptive and diagnostic analytics, we can determine the historical
referral pattern of the new referring provider not enrolled in PECOS
Assuming this provider remains not enrolled in PECOS and refers a steady stream of
patients to the organization
predictive analytics will tell us the expected denials associated with these claims
The resulting impact of this situation is identified
Source: https://integratedmp.com/4‐key‐healthcare‐analytics‐sources‐is‐your‐practice‐using‐them/
Healthcare Analytics – Case Study
Healthcare Scenario – Prescriptive Analytics
• Prescriptive – what should I do?
Prescriptive Analytics assists in determining the best course of action from the information
gathered from descriptive, diagnostic and predictive analytics
In this scenario, the best course of action would be to contact the new referring provider,
express appreciation for the new referrals, but also evidence that the provider’s referrals
are resulting in denied claims due to PECOS enrollment
By working with this provider to become enrolled in PECOS, denied claims can now be
rebilled
In addition, future claims are less likely to be denied due to the provider not being
certified/eligible to be paid for the procedure/service on the claim date of service
The key to prescriptive analytics is implementing a solution that will prevent the same
breakdown from occurring again in the future
By reducing the likelihood of claim denials, the overall financial health of the physician
practice benefits
Source: https://integratedmp.com/4‐key‐healthcare‐analytics‐sources‐is‐your‐practice‐using‐them/
Analytics Maturity Models
Types of Analytics
Gartner Maturity Model for Data and Analytics
• D&A – Data and Analytics
• CDO – Chief Data Officer
• ROI – Return on Investment
• Outside‐in approach is guided by the
belief that customer value creation is
the key to success
• Inside‐out approach is guided by the
belief that the inner strengths and
capabilities of the organization will
produce a sustainable future
• Synergy: the combined power of a
group of things when they are working
together that is greater than the total
power achieved by each working
separately
Source: Gartner
Source: http://www.verival.com/DataScience
Types of Analytics
Analytics Maturity Vs. Competitive Advantage
• The model shows varying stages
of a company, its level of
analytics needs, and how
organizations can grow and
move up to the next level of
analytical maturity.
• By learning how to apply these
concepts to your organization,
you'll empower your teams and
yourself with the framework to
drive action at the right level for
the stage you're at
• The concept of transforming raw
data into prescriptive analytics is
easily digestible in each of these
graphics.
Source: https://www.jirav.com/blog/data‐analytics‐maturity‐models
Types of Analytics
Competing on Analytics
• Potential competitive advantage
increase with more sophisticated
analytics
• Companies in the bottom third of this
figure should be trying to move further
up it
• Business Intelligence (BI) tools typically
cover descriptive analytics
• Moving towards predictive analytics
can enable companies to see problems
before they arise
• Prescriptive analysis then helps to
specify the actions an organization
should carry out
• Autonomous analytics can power
decisions without human interaction
Source: Competing on Analytics: The new science of winning by Thomas H. Davenport and Jeanne G. Harris
Types of Analytics
Competing on Analytics
• Without human interaction may
sound scary as jobs may be
affected but the fact is that it
increases productivity and more
time can be spent innovating
rather than investing in a
manual process
Google search is a good example.
• Not all organizations need to be
at the autonomous analysis
stage – but all need to be
moving away from the bottom
of the chart.
Source: Competing on Analytics: The new science of winning by Thomas H. Davenport and Jeanne G. Harris
Types of Analytics
Competing on Analytics
Source: HBR: Competing on Analytics: by Thomas H. Davenport and Jeanne G. Harris
Skills & Tools for Analytics
Skills Required for Analytics
Technical & Business Skills
• Technical skills for data analytics:
Packages and Statistical methods
BI Platform and Data Warehousing
Base design of data
Data Visualization and munging
Reporting methods
Knowledge of Hadoop and MapReduce
Data Mining
• Business Skills Data analytics:
Effective communication skills
Creative thinking
Industry knowledge
Analytic problem solving
Image Source: Simplilearn ‐ https://www.youtube.com/watch?v=X3paOmcrTjQ
Tools in Analytics
Commercial & Open Sources Tools
• Commercial • Free Tools
SAS
WPS R Programming
MS Excel Google Analytics
Tableau Python
Pentaho
Spotfire
Statistica
Qlikview Anaconda Jupyter
KISSmetrics Apache Spark
WeKa
BigML
RapidMiner
Thank You!
Introduction to Data Science
Data Analytics Methodologies
Dr. Ramakrishna Dantu
Associate Professor, BITS Pilani
Introduction to Data Science
Disclaimer and Acknowledgement
• The content for these slides has been obtained from books and various
other source on the Internet
• I here by acknowledge all the contributors for their material and inputs.
• I have provided source information wherever necessary
• I have added and modified the content to suit the requirements of the
course
Introduction to Data Science
Data Analytics Module Topics
• Defining Analytics
• Types of data analytics
Descriptive, Diagnostic
Predictive, Prescriptive
• Data Analytics – methodologies
KDD
CRISP‐DM
SEMMA
SMAM
BIG DATA LIFE CYCLE
• Analytics Capacity Building
• Challenges in Data‐driven decision making
Data Analytics Methodologies
What exactly is a methodology?
Source: Data Mining and Predictive Analytics Daniel Larose & Chantal Larose
Knowledge Discovery in Databases
Knowledge Discovery in Databases
CRISP‐DM Methodology
• The Knowledge Discovery in Databases
(or KDD) refers to a process of finding
knowledge from data in the context of
large databases
• It does this by applying data mining methods (algorithms) to extract
what is deemed knowledge:
In accordance with the specifications of measures and thresholds, using a database
along with any required preprocessing, subsampling, and transformations of that
database
Source: http://www2.cs.uregina.ca/~dbd/cs831/notes/kdd/1_kdd.html
Image source: https://www.javatpoint.com/kdd‐process‐in‐data‐mining
Knowledge Discovery in Databases
KDD Process
• KDD requires relevant prior knowledge and brief understanding of application domain and
goals
• KDD process is iterative and interactive in nature
• There are nine different steps in seven different phases of this model
Source: http://www2.cs.uregina.ca/~dbd/cs831/notes/kdd/1_kdd.html
Image source: https://www.javatpoint.com/kdd‐process‐in‐data‐mining
Knowledge Discovery in Databases
KDD Process
• Pre KDD process (Developing Domain Understanding)
This is the first stage of KDD process in which goals are defined from customer’s view point
and used to develop and understanding about application domain and its prior knowledge
Involves developing an of understanding of:
The application domain
The relevant prior knowledge and
The goals of the end‐user
• Post KDD process (Using Discovered Knowledge)
This understanding must be continued by:
The knowledge consolidation
Incorporation this knowledge into the system
Source: http://www2.cs.uregina.ca/~dbd/cs831/notes/kdd/1_kdd.html
Knowledge Discovery in Databases
KDD Process
• Selection:
This is the second stage of KDD process which focuses creating on
target data set and subset of data samples or variables.
It is an important stage because knowledge discovery is performed
on all these.
• Pre‐processing
This is the third stage of KDD process which focuses on target data
cleaning and pre‐processing to obtain consistent data without any
noise and inconsistencies
In this stage strategies are developed for handling such type of
noisy and inconsistent data.
• Transformation
This is the fourth stage of KDD process which focuses on the
transformation of data using dimensionality reduction and other
transformation methods
Source: http://www2.cs.uregina.ca/~dbd/cs831/notes/kdd/1_kdd.html
Image source: https://www.javatpoint.com/kdd‐process‐in‐data‐mining
Knowledge Discovery in Databases
KDD Process
• Data Mining ‐ this stage consists of three steps:
Choosing the suitable data mining task
In this step of KDD process an appropriate data mining task is
chosen based on particular goals that are defined in first stage
The examples of data mining method or tasks are classification,
clustering, regression and summarization etc.
Choosing a suitable data mining algorithm
In this step, one or more appropriate data mining algorithms are
selected for searching different patterns from data
There are number of algorithms present today for data mining but
appropriate algorithms are selected based on matching the overall
criteria for data mining
Employing the data mining algorithm
In this step, the selected algorithms are implemented
Source: http://www2.cs.uregina.ca/~dbd/cs831/notes/kdd/1_kdd.html
Image source: https://www.javatpoint.com/kdd‐process‐in‐data‐mining
Knowledge Discovery in Databases
KDD Process
• Interpretation/Evaluation
This is the eighth step of KDD process that focuses on
interpretation and evaluation of mined patterns
This step may involve in visualization of extracted patterns
• Post KDD Process (Using discovered knowledge)
In this final step of KDD process the discovered knowledge is
used for different purposes
The discovered knowledge can be used by the interested
parties or can be integrate with another system for further
action
Source: http://www2.cs.uregina.ca/~dbd/cs831/notes/kdd/1_kdd.html
Image source: https://www.javatpoint.com/kdd‐process‐in‐data‐mining
CRISP‐DM
CRoss‐Industry Standard Process for Data
Mining
Data Analytics Methodologies
CRISP‐DM Methodology
• Stands for "CRoss‐Industry Standard Process for Data Mining"
• CRISP‐DM was conceived in 1996 and became a European Union project under the ESPRIT funding
initiative in 1997
European Strategic Program on Research in Information Technology
• The project was led by five companies: Integral Solutions Ltd (ISL), Teradata, Daimler AG, NCR
Corporation and OHRA, an insurance company.
ISL, later acquired and merged into SPSS. The computer giant NCR Corporation produced the Teradata data
warehouse and its own data mining software.
• It is an open standard process model that describes common approaches used by data mining
experts
• Provides a nonproprietary, technology‐agnostic, and structured approach for fitting data mining
project into the general problem‐solving strategy
• This model is an idealized sequence of six stages
• In practice, it will often be necessary to backtrack to previous activities and repeat certain actions
Source: Data Mining and Predictive Analytics Daniel Larose & Chantal Larose
Data Analytics Methodologies
CRIPS‐DM Methodology
• A data mining project is conceptualized as a
life cycle consisting of six phases
• The phase‐sequence is adaptive
That is, the next phase in the sequence depends on
the outcomes associated with the previous phase
• Dependencies between phases are indicated by the arrows
For instance, suppose that we are in the modeling phase
Depending on the behavior and characteristics of the model, we may have to return to
the data preparation phase for further refinement before moving forward to the model
evaluation phase
Source: Data Mining and Predictive Analytics Daniel Larose & Chantal Larose
Image Source: https://towardsdatascience.com/why‐using‐crisp‐dm‐will‐make‐you‐a‐better‐data‐scientist‐66efe5b72686
Data Analytics Methodologies
Six Phases of CRIPS‐DM Methodology
• Business/Research Understanding Phase
This is the first phase of CRISP‐DM process
It focuses on and uncovers important factors including success
criteria, business and data mining objectives and requirements as
well as business terminologies and technical terms.
a) First, clearly enunciate the project objectives and requirements
in terms of the business or research unit as a whole
b) Then, formulate these goals and restrictions into a data science
problem definition
c) Finally, prepare a preliminary strategy for achieving these
objectives
Source: Data Mining and Predictive Analytics Daniel Larose & Chantal Larose
Data Analytics Methodologies
Six Phases of CRIPS‐DM Methodology
• Data Understanding Phase
This phase of CRISP‐DM process focuses on data collection,
exploring of data, checking quality, to get insights into the
data to form hypotheses for hidden information
a) First, collect the data
b) Then, use exploratory data analysis to familiarize yourself
with the data, and discover initial insights
c) Evaluate the quality of the data
d) Finally, if desired, select interesting subsets that may
contain actionable patterns
Source: Data Mining and Predictive Analytics Daniel Larose & Chantal Larose
Data Analytics Methodologies
Six Phases of CRIPS‐DM Methodology
• Data Preparation Phase
This phases focuses on selection and preparation of final data set
This phase may include many tasks such as records, table and attributes selection
as well as cleaning and transformation of data.
a) This labor‐intensive phase covers all aspects of preparing the final data set, which
shall be used for subsequent phases, from the initial, raw, dirty data
b) Select the cases and variables you want to analyze, and that are appropriate for
your analysis.
c) Perform transformations on certain variables, if needed.
d) Clean the raw data so that it is ready for the modeling tools.
Source: Data Mining and Predictive Analytics Daniel Larose & Chantal Larose
Data Analytics Methodologies
Six Phases of CRIPS‐DM Methodology
• Modeling Phase
This phase involves selection and application of various modeling techniques
Different parameters are set and different models are built for same data mining
problem
a) Select and apply appropriate modeling techniques.
b) Calibrate model settings to optimize results.
c) Often, several different techniques may be applied for the same data science
problem
d) May require looping back to previous phase, in order to bring data into a specific
format requirements of a particular modeling technique
Source: Data Mining and Predictive Analytics Daniel Larose & Chantal Larose
Data Analytics Methodologies
Six Phases of CRIPS‐DM Methodology
• Evaluation Phase
This phase involves evaluation of obtained models and deciding how to use the
results
Interpretation of the model depends upon the algorithm and models are evaluated
to review whether they achieve the objectives properly or not.
a) Evaluate the models delivered by the modeling phase for quality and
effectiveness, before deploying them for use in the field
b) Determine whether the model in fact achieves the objectives set for it in phase 1.
c) Establish whether some important facet of the business or research problem has
not been sufficiently accounted for.
d) Finally, come to a decision regarding the use of the modeling results.
Source: Data Mining and Predictive Analytics Daniel Larose & Chantal Larose
Data Analytics Methodologies
Six Phases of CRIPS‐DM Methodology
• Deployment Phase
This phase focuses on deploying the final model and determining the usage of
obtain knowledge and results
This phase also focuses on organizing, reporting and presenting the gained
knowledge when needed
a) Model creation does not signify the completion of the project until they are used
b) Examples:
a) A simple deployment: Generate a report.
b) A more complex deployment: Initiate projects based on the predictive outcomes of the
model
c) For businesses, the customer often carries out the deployment based on your
model
Source: Data Mining and Predictive Analytics Daniel Larose & Chantal Larose
Data Analytics Methodologies
Phases and Tasks of CRIPS‐DM Methodology
SEMMA Process Model
Data Analytics Methodologies
SEMMA Process Model
• SEMMA stands for Sample, Explore, Modify, Model and Access
• It is a data mining method presented by the SAS Institute
• It enables data organization, discovery, development and maintenance of data
mining projects
• SAS Institute defines data mining as the process of Sampling, Exploring,
Modifying, Modeling, and Assessing (SEMMA) large amounts of data to uncover
previously unknown patterns which can be utilized as a business advantage
• The data mining process is applicable across a variety of industries and provides
methodologies for such diverse business problems, such as:
Fraud detection, customer retention and attrition, market segmentation, risk analysis, customer
satisfaction, bankruptcy prediction, and portfolio analysis
Source: Big Data Analytics Methods, 2nd Edition by Ghavami P
Data Analytics Methodologies
SEMMA Process Model
Sampling
• Five stages of SEMMA model: SAMPLE (Optional)
Sample:
Involves sampling of data to extract critical information about the data while
small enough to manipulate quickly.
Explore: EXPLORE Data Clustering,
This stage focuses on data exploration and discovery. Visualization Assocations
It improves understanding of the data, finding trends and anomalies in the
data. Variable Selection, Data
MODIFY
Modify: Creation Transformation
This stage involves modification of data through data transformations, and
the model selection process.
This stage may evaluate anomalies, outliers and variable reduction in data. MODEL
Model: Neural Tree‐based Logistic Other Stat
The goal of this stage is to build the model and apply the model to the data. Networks Models Models Models
Different modeling techniques may be applied as datasets have different
attributes.
Access:
The final stage of SEMMA evaluates the reliability and usefulness of the
results, performance and accuracy of the model(s)
Model
ASSESS Assessment
Source: SAS® Enterprise Miner™ 14.3: Reference Help
Big Data Analytics Methods, 2nd Edition by Ghavami P
Data Analytics Methodologies
SEMMA Process Model
Sampling
SAMPLE
• SEMMA is actually not a data mining methodology (Optional)
• It is a logical organization of the functional tool set
of SAS Enterprise Miner for carrying out the core Data Clustering,
tasks of data mining EXPLORE Visualization Assocations
Source: SAS® Enterprise Miner™ 14.3: Reference Help
Big Data Analytics Methods, 2nd Edition by Ghavami P
Comparison of Methodologies
Data Analytics Methodologies
Comparison of KDD Vs. SEMMA
• By comparing KDD with SEMMA stages we can see that
they are very similar
SEMMA KDD
Sample phase Can be identified with Selection
Explore phase Can be identified with Preprocessing
Sampling
SAMPLE
Modify phase Can be identified with Transformation (Optional)
MODEL
• Close examination tells us that the five stages of the SEMMA process Neural
Networks
Tree‐based
Models
Logistic
Models
Other Stat
Models
can be seen as a practical implementation of the five stages of the KDD
process, since it is directly linked to the SAS Enterprise Miner software. Model
ASSESS Assessment
Data Analytics Methodologies
Comparison of KDD Vs. CRIPS‐DM
• Comparing the KDD stages with the CRISP‐DM stages is not as straightforward as in SEMMA situation
• CRISP‐DM methodology incorporates the steps that, as referred above, must precede and follow the KDD
process that is to say:
• The Business Understanding phase
Can be identified with the development of an understanding of the application domain, the relevant prior knowledge and the goals of
the end‐user;
• The Data Understanding phase
Can be identified as the combination of Selection and Pre processing;
• The Deployment phase
Can be identified with the consolidation by incorporating this knowledge into the system.
• The Data Preparation phase
Can be identified with Transformation;
• The Modeling phase
Can be identified with Data Mining;
• The Evaluation phase
Can be identified with Interpretation/Evaluation.
Data Analytics Methodologies
KDD Vs. SEMMA Vs. CRIPS‐DM SAMPLE
Sampling
(Optional)
Data Clustering,
EXPLORE Visualization Assocations
MODEL
Neural Tree‐based Logistic Other Stat
Networks Models Models Models
Model
ASSESS Assessment
Data Analytics Methodologies
Drawbacks of CRISP‐DM‐1
• One major drawback is that the model no longer seems
to be actively maintained
• The official site, CRISP‐DM.org, is no longer being
maintained
• Furthermore, the framework itself has not been updated
for working with new technologies, such as big data
• Big data technologies means that there can be additional
effort spend in the data understanding phase
For example, as the business grapples with the
additional complexities that are involved in the shape of big data
sources.
Data Analytics Methodologies
Drawbacks of CRISP‐DM‐1
• The methodology itself was conceived in mid 1990s
• Gregory Piatetsky of KDNuggets says:
"CRISP‐DM remains the most popular methodology for
analytics, data mining, and data science projects, with 43%
share in latest KDnuggets Poll, but a replacement for
unmaintained CRISP‐DM is long overdue."
Data Analytics Methodologies
Drawbacks of CRISP‐DM‐2
• CRISP‐DM is a great framework and its use on
projects helps focus them on delivering real business
value
• CRISP‐DM has been around a long time so many
projects are implemented using CRISP‐DM
• However, projects do take shortcuts
• Methodology translate into frequent friction points
between the business and IT professionals
• Some of these shortcuts make sense but too often
they result in projects using a corrupted version of the
approach like the one shown in Figure
Data Analytics Methodologies
Advantages of CRISP‐DM
• A lack of clarity
Rather than drilling down into the details to get clarity on both the business
problem and the analytic approach, the project team work superficially with
business goals and use some metrics to measure success.
• Mindless rework:
Most try to find new data or new modeling techniques rather than working
with their business partners to re‐evaluate the business problem.
• Blind hand‐off to IT:
Some analytic teams don’t think about deployment and operationalization of
their models at all
They fail to recognize that the models they build will have to be applied to
live data in operational data stores or embedded in operational systems.
• Failure to iterate
Analytic professionals know that models age and that models need to be kept
up to date if they are to continue to be valuable.
Standard Methodology for
Analytical Models (SMAM)
Data Analytics Methodologies
Standard Methodology for Analytics Models (SMAM)
• Here we will look into another methodology called SMAM, which is
supposed to overcome those friction points
Source: https://www.kdnuggets.com/2015/08/new‐standard‐methodology‐analytical‐models.html
Data Analytics Methodologies
Standard Methodology for Analytics Models (SMAM)
Phase Description
Use case Selection of the ideal approach from a list of candidates
identification
Model requirements Understanding the conditions required for the model
gathering to function
Data preparation Getting the data ready for the modeling
Modeling Scientific experimentation to solve the business
experiments question
Insight creation Visualization and dashboarding to provide insight
Proof of value (ROI) Running the model in a small scale setting to prove the
value
Operationalization Embedding the analytical model in operational systems
Model Lifecycle Governance around model lifetime and refresh
Source: https://www.kdnuggets.com/2015/08/new‐standard‐methodology‐analytical‐models.html
Data Analytics Methodologies
Standard Methodology for Analytics Models (SMAM)
• Use case identification phase
Describes the brainstorming/discovery process of looking at the different areas where models may
be applicable
It involves educating business parties involved about what analytical modeling is, what realistic
expectations are from the various approaches and how models can be leveraged in the business
Discussions on the use‐case identification involve topics around data availability, model integration
complexity, analytical model complexity and model impact on the business
From a list of identified use‐cases in an area, the one with the best ranking on above mentioned
criteria should be considered for implementation
Parties involved in this phase are (higher) management to ensure the right goal setting, IT for data
availability, the involved department for model relevance checking and the data scientists to infuse
the analytical knowledge and creative analytical idea provisioning
The result of this phase is a chosen use‐case, and potentially a roadmap with the other considered
initiatives on a timeline.
Source: https://www.kdnuggets.com/2015/08/new‐standard‐methodology‐analytical‐models.html
Data Analytics Methodologies
Standard Methodology for Analytics Models (SMAM)
• Model requirements gathering
In this phase, requirements are gathered for the use‐case selected in the previous
phase
These requirements include:
Conditions for the cases/customers/entities considered for scoring
Requirements about the availability of data, initial ideas around model reporting can be
explored and finally, ways that the end‐users would like to consume the results of the analytical
models
Parties involved in this phase are people from the involved department(s), the end‐users and
the data scientists
The result of this phase is a requirements document.
Source: https://www.kdnuggets.com/2015/08/new‐standard‐methodology‐analytical‐models.html
Data Analytics Methodologies
Standard Methodology for Analytics Models (SMAM)
• Data preparation
In this phase, discussions revolve around data access, data location, data understanding,
data validation, and creation of the modeling data
This is a phase where IT/data administrators/DBA’s closely work together with the data
scientist to help prepare the data in a format consumable by the data scientist
The process is agile; the data scientist tries out various approaches on smaller sets and
then may ask IT to perform the required transformation in large
As with the CRISP‐DM, the previous phase, this phase and the next happen in that order,
but often iterate back and forth
Parties involved
The involved parties are IT/data administrators/DBA/data modelers and data scientists
The results of this phase should be the data scientist being convinced that with the data
available, a model is viable, as well as the scoring of the model in the operational
environment
Source: https://www.kdnuggets.com/2015/08/new‐standard‐methodology‐analytical‐models.html
Data Analytics Methodologies
Standard Methodology for Analytics Models (SMAM)
• Modeling experiments
In this phase, data scientists play with the data; crack the nut; trying to come up
with the solution that is both cool, elegant and working
Results are not immediate; progress is obtained by evolution and by patiently
collecting insights to put them together in an ever evolving model
At times, the solution at hand may not look viable anymore, and an entire different
angle needs to be explored, seemingly starting from scratch
The data scientist may need to connect to end‐user to validate initial results, or to
have discussion to get ideas which can be translated into testable hypotheses/
model features
The result of this phase is an analytical model that is evaluated in the best possible
way with the (historic) data available as well as a reporting of these facts.
Source: https://www.kdnuggets.com/2015/08/new‐standard‐methodology‐analytical‐models.html
Data Analytics Methodologies
Standard Methodology for Analytics Models (SMAM)
• Insight creation
Dashboards and visualization are critically important for the acceptance of the model by the
business
This outcomes of this phase are analytic reporting and operational reporting
Analytical reporting refers to any reporting on data where the outcome (of the analytical model) has
already been observed – descriptive modeling
This data can then be used to understand the performance of the model and the evolution of
performance over time
Operational reporting refers to any reporting on the data where the outcome has not yet been
observed – predictive modeling
This data can be used to understand what the model predicts for the future in an aggregated sense
and is used for monitoring purposes
The involved parties are the end‐users, business department, and the data scientists
The result of this phase is a set of visualizations and dashboards that provide a clear view on the
model effectiveness and provide business usable insights.
Source: https://www.kdnuggets.com/2015/08/new‐standard‐methodology‐analytical‐models.html
Data Analytics Methodologies
Standard Methodology for Analytics Models (SMAM)
• Proof of value (ROI)
This phase involves deploying the model on a small scale to ensure return on
investment
If the ROI is positive, the business will be convinced that they can trust the model
A decision can be made if the model should be deployed or not
• Operationalization
This phase involves putting the model into production and operationalization of the
model by closely working with the IT department
The model is handed over the maintenance team for support
Source: https://www.kdnuggets.com/2015/08/new‐standard‐methodology‐analytical‐models.html
Data Analytics Methodologies
Standard Methodology for Analytics Models (SMAM)
• Model lifecycle
An analytical model in production will not be fit forever
Depending on how fast the business changes, the model performance degrades over time.
Generally, two types of model changes can happen: refresh and upgrade
In a model refresh, the model is trained with more recent data, leaving the model
structurally untouched
The model upgrade is typically initiated by the availability of new data sources and the
request from the business to improve model performance by the inclusion of the new
sources
The involved parties are the end‐users, the operational team that handles the model
execution, the IT/data administrators/DBA for the new data and the data scientist
Source: https://www.kdnuggets.com/2015/08/new‐standard‐methodology‐analytical‐models.html
Big Data Life Cycle
Data Analytics Methodologies
Big Data Life Cycle
• Big Data analysis differs from traditional data analysis primarily due to the volume,
velocity, variety, and veracity characteristics of the data
• Massive amounts of data ‐ It is big – typically in terabytes or even petabytes
• Varied data formats such as video files to SQL databases to text data
It could be a traditional database, it could be video data, log data, text data or even voice data
• Data that comes in at varying frequency – from days to minutes ‐ It keeps increasing as
new data keeps flowing in
• To address these distinct aspects of Big Data, a step‐by‐step methodology is needed to
organize and manage the tasks and activities associated with Big Data analysis
• These activities and tasks involve acquiring, processing, analyzing and repurposing data
• In addition to the lifecycle, it is also important to consider the issues of training, education,
tooling and staffing of a data analytics team
Source: Big Data Fundamentals: Concepts, Drivers & Techniques by Wajid Khattak; Paul Buhler; Thomas Erl
Data Analytics Methodologies
Big Data Life Cycle
• The Big Data analytics lifecycle can be divided
into nine stages:
1) Business Case Evaluation
2) Data Identification
3) Data Acquisition & Filtering
4) Data Extraction
5) Data Validation & Cleansing
6) Data Aggregation & Representation
7) Data Analysis
8) Data Visualization
9) Utilization of Analysis Results
Source: Big Data Fundamentals: Concepts, Drivers & Techniques by Wajid Khattak; Paul Buhler; Thomas Erl
Data Analytics Methodologies
Big Data Life Cycle
• Business Case Evaluation
This stage involves creating, assessing, and
approving a business case prior to proceeding to
next stage
The business case presents:
Clear requirements
Justification for the analytics project
Motivation
Goals for carrying out the analysis
Budget
Timeline
Resources
Source: Big Data Fundamentals: Concepts, Drivers & Techniques by Wajid Khattak; Paul Buhler; Thomas Erl
Data Analytics Methodologies
Big Data Life Cycle
• Data Identification
This stage involves identifying the data and their sources required for the analysis
A wider variety of data sources may increase the probability of finding hidden
patterns and correlations
For example, it benefits to look into as many types of resources as possible, particularly when it's
not clear exactly what to look for
However, there is a tradeoff between the value gained and the investment in terms of resources,
time, and budget
These dataset can be internal or external to the enterprise depending on the project
objectives
Internal data sources include
Data Warehouse, Data marts, Operational Systems, etc.,.
External data sources include
Third‐party datasets such as market trends, stock data analysis; Government supplied datasets,
and other publicly available datasets.
Sometimes, data may be needed to be harvested using automated tools from weblogs (blogs) and
other content‐based web sites
Source: Big Data Fundamentals: Concepts, Drivers & Techniques by Wajid Khattak; Paul Buhler; Thomas Erl
Data Analytics Methodologies
Big Data Life Cycle
• Data Acquisition & Filtering
Here, data is gathered from all of the data sources that were identified during the
previous stage
Depending on the type of data source, data may come as a collection of files, such as
data purchased from a third‐party data provider, or may require API integration, such
as with Twitter
In many cases, some or most of the acquired data may be irrelevant (noise) and can
be discarded as part of the filtering process
Metadata can be added via automation to data from both internal and external data
sources to improve the classification and querying
For example:
Dataset size and structure, source information, date and time of creation or collection and
language‐specific information
Metadata should be machine‐readable so that it can be passed forward along
subsequent analysis stages
This helps maintain the origins of data throughout the Big Data analytics lifecycle,
which helps to establish and preserve data accuracy and quality
Source: Big Data Fundamentals: Concepts, Drivers & Techniques by Wajid Khattak; Paul Buhler; Thomas Erl
Data Analytics Methodologies
Big Data Life Cycle
• Data Extraction and Transformation
Due to disparate external data sources, data may
arrive in a format incompatible with the Big Data
solution
XML, JSON Files
Tab or comma or other delimited files
Raw text files
The extent of extraction and transformation
required depends on the types of analytics and
capabilities of the Big Data solution
For example: Merging fields, Separating data fields into
individual components, etc
Source: Big Data Fundamentals: Concepts, Drivers & Techniques by Wajid Khattak; Paul Buhler; Thomas Erl
Data Analytics Methodologies
Big Data Life Cycle
• Data Validation & Cleansing
Internal data usually has a pre‐defined structured and
pre‐validated
External data can be unstructured and without any
indication of validity
This stage is dedicated to establishing complex validation
rules and removing any known invalid data
For example:
Redundant data across different datasets
o This redundancy can be exploited to explore interconnected
datasets in order to assemble validation parameters and fill in
missing valid data
The data validation and cleansing can be a batch process
during ETL or it can be a real‐time in‐memory operation
Source: Big Data Fundamentals: Concepts, Drivers & Techniques by Wajid Khattak; Paul Buhler; Thomas Erl
Data Analytics Methodologies
Big Data Life Cycle
• Data Aggregation & Representation
This stage is dedicated to integrating multiple
datasets together to arrive at a unified view.
Data spread across multiple datasets may need to be
joined together via common fields such as Date or ID
Data may need to be reconciled because same data
fields may appear in multiple datasets
E.g., date of birth
Data aggregation can become complicated because
of differences in:
Data Structure – Although the data format may be the
same, the data model may be different
Semantics – A value that is labeled differently in two
different datasets may mean the same thing, for
example "surname" and "last name"
Source: Big Data Fundamentals: Concepts, Drivers & Techniques by Wajid Khattak; Paul Buhler; Thomas Erl
Data Analytics Methodologies
Big Data Life Cycle
• Data Aggregation and Representation
Very often they follow different naming conventions
and varied standards for data representation
E.g., Cusomer_Identifier, Customer_Id, Cust_Id, CID
Names have to be standardized and discrepancies
should be resolved
Integrating the data involves combining all the relevant
operational data into coherent data structures to be
made ready for loading into the data warehouse.
Data integration and consolidation is a type of
preprocess before other major transformation routines
are applied
Data Analytics Methodologies
Big Data Life Cycle
• Data Aggregation & Representation
Data aggregation can be a time and effort intensive operation
because of large volumes of data
Future data analysis requirements need to be considered during this
stage to help foster data reusability
It is important to understand that the same data can be stored in
many different forms
One form may be better suited for a particular type of analysis than
another
For example, data stored as a BLOB would be of little use if the analysis
requires access to individual data fields
A standardized data structure for the Big Data solution can be used
for a range of analysis techniques and projects
Source: Big Data Fundamentals: Concepts, Drivers & Techniques by Wajid Khattak; Paul Buhler; Thomas Erl
Data Analytics Methodologies
Big Data Life Cycle
• Data Analysis
This stage is dedicated to carrying out the actual analysis task,
which typically involves one or more types of analytics
This stage can be iterative in which case analysis is repeated
until the appropriate pattern or correlation is uncovered
Depending on the type of analytic result required, this stage
can be
as simple as querying a dataset to compute an aggregation for
comparison
as challenging as combining data mining and complex statistical analysis
techniques to discover patterns, anomalies, or to generate a statistical
or mathematical model to depict relationships between variables.
Source: Big Data Fundamentals: Concepts, Drivers & Techniques by Wajid Khattak; Paul Buhler; Thomas Erl
Data Analytics Methodologies
Big Data Life Cycle
• Data Visualization
This stage is dedicated to using visualization techniques and tools to
graphically communicate the results of analysis for effective
interpretation
Business users need to be able to understand the results in order to
obtain value from the data analysis and subsequently have the ability
to provide feedback from stage 8 back to stage 7
Data Visualization provides users with the ability to perform visual
analysis, allowing for the discovery of answers to unknown questions
Data Visualization helps in presenting same results in a number of
different ways, which can influence the interpretation of the results.
Providing a method of drilling down to comparatively simple statistics
is crucial, in order for users to understand how the rolled up or
aggregated results were generated.
Source: Big Data Fundamentals: Concepts, Drivers & Techniques by Wajid Khattak; Paul Buhler; Thomas Erl
Data Analytics Methodologies
Big Data Life Cycle
• Utilization of Analysis Results
This stage involves determining how and where processed
analysis data can be further leveraged
The results of the analysis may produce "models" that can be
used on a new set of dataset to understand patterns and
relationships that exist within the data
A model may look like a mathematical equation or a set of rules
Models can be used to improve business process logic and
application system logic, and they can form the basis of a new
system or software program
Source: Big Data Fundamentals: Concepts, Drivers & Techniques by Wajid Khattak; Paul Buhler; Thomas Erl
Data Analytics Methodologies
Big Data Life Cycle Vs. CRISP‐DM
• A comparison of stages involved in Big Data Life Cycle and CRISP‐DM
Business Problem Definition – CRISP DM : Business understanding
Research ‐‐ new step
Human Resources Assessment – new step
Data Acquisition – new step
Data Munging – new step
Data Storage – new step
Exploratory Data Analysis – CRISP DM : Data Understanding
Data Preparation for Modeling and Assessment – CRISP DM : Data Preparation
Modeling – CRISP DM : Data Modelling
Implementation – CRISP DM : Evaluation & Deployment
Data Analytics Methodologies
Big Data Life Cycle – Possible case studies
• What is the precision agriculture? Why it is a likely answer to climate change
and food security?
https://www.youtube.com/watch?v=581Kx8wzTMc&ab_channel=Inter‐AmericanDevelopmentBank
• Innovating for Agribusiness
https://www.youtube.com/watch?v=C4W0qSQ6A8U
• How Big Data Can Solve Food Insecurity
https://www.youtube.com/watch?v=4r_IxShUQuA&ab_channel=mSTARProject
• AI for AgriTech ecosystem in India‐ IBM Research
https://www.youtube.com/watch?v=hhoLSI4bW_4&ab_channel=IBMIndia
• Bringing Artificial Intelligence to agriculture to boost crop yield
https://www.youtube.com/watch?v=GSvT940uS_8&ab_channel=MicrosoftIndia
Thank You!
Introduction to Data Science
Data Science Process
Dr. Ramakrishna Dantu
Associate Professor, BITS Pilani
Introduction to Data Science
Disclaimer and Acknowledgement
• The content for these slides has been obtained from books and various
other source on the Internet
• I here by acknowledge all the contributors for their material and inputs.
• I have provided source information wherever necessary
• I have added and modified the content to suit the requirements of the
course
Introduction to Data Science
Data Science Process
• Data Science Methodology
Business understanding
Data Requirements
Data Acquisition
Data Understanding
Data preparation
Modelling
Model Evaluation
Deployment and feedback
• Case Study
• Data Science Proposal
Samples
Evaluation
Review Guide
Data Science Process
Data Science Process
Stages in Data Science Process
• Setting the research goal
• Retrieving data
• Data Preparation
• Data Exploration
• Data Modeling
• Presentation and Automation
Source: Introducing Data Science by Cielin et al.,
Data Science Process
Business Scenario
• Suppose you’re working for a Bank
• The bank feels that it’s losing too much money to
bad loans and wants to reduce its losses
• To do so, they want a tool to help loan officers
more accurately detect risky loans
Source: Introducing Data Science by Cielin et al.,
Setting the Research Goal
Data Science Process
Setting the Research Goal ‐ Concept
• An essential outcome of this stage is the research goal that states the purpose
of your assignment in a clear and focused manner
• Understanding the business goals and context is critical for project success
• Ask questions until you grasp the exact business expectations
Identify how your project fits in the bigger picture
Understand how this research is going to change the business
Understand how business will use the results
• Nothing is more frustrating than when you report your findings back to the
organization, everyone immediately realizes that you misunderstood their
question
• Don’t skim over this phase lightly
• Many data scientists fail here:
Despite their mathematical wit and scientific brilliance, they never seem to grasp the
business goals and context
Source: Introducing Data Science by Cielin et al.,
Data Science Process
Setting the Research Goal ‐ Concept
• The first task in a data science project is to define a measurable and
quantifiable goal
• Typically, this phase involves two steps:
Define research goal
Develop a project charter
• Our aim is to learn all about the context of the project:
Why do the sponsors want the project in the first place?
What do they lack, and what do they need?
What are they doing to solve the problem now, and why isn’t that good enough?
What resources will you need: what kind of data and how much staff?
Will you have domain experts to collaborate with, and what are the computational
resources?
How do the project sponsors plan to deploy your results?
What are the constraints that have to be met for successful deployment?
Source: Introducing Data Science by Cielin et al.,
Data Science Process
Setting the Research Goal ‐ Concept
• A project charter requires teamwork, and your input
covers at least the following:
A clear research goal
The project mission and context
How you’re going to perform your analysis
What resources you expect to use
Proof that it’s an achievable project, or proof of concepts
Deliverables and a measure of success
A timeline
• Your client can use this information to make an
estimation of the project costs and the data and people
required for your project to become a success.
Source: Introducing Data Science by Cielin et al.,
Data Science Process
Setting the Research Goal – Case Scenario
• In the loan example, the ultimate business goal is to reduce the bank’s
losses due to bad loans
• Project sponsor wants a tool to help loan officers more accurately score
loan applicants, and so reduce the number of bad loans made
• Loan officers feel that they have final discretion on loan approvals
• Once we have a high‐level understanding, we work with stakeholders to
define the precise goal of the project
• The goal should be specific and measurable
NO ‐ "We want to get better at finding bad loans" but instead
YES ‐ "We want to reduce our rate of loan charge‐offs by at least 10%, using a model that
predicts which loan applicants are likely to default."
Source: Practical Data Science with R ‐ Nina Zumel, John Mount
Data Science Process
Setting the Research Goal – Case Scenario
• Importance of having specific and measureable goal
A concrete goal helps with acceptance criteria and when to stop the project
If the goal is less specific (not measurable), there is no boundary to the project,
because no result will be "good enough."
If we don’t know what we want to achieve, we don’t know when to stop trying—or
even what to try
When the project eventually terminates—because either time or resources run
out—no one will be happy with the outcome
Source: Practical Data Science with R ‐ Nina Zumel, John Mount
Data Science Process
Setting the Research Goal – Case Scenario
• When can we have non‐specific or non‐measurable goals?
When our project is explorative in nature
"Is there something in the data that correlates to higher defaults?"
"Should we think about reducing the kinds of loans we give out?"
Which types might we eliminate?
In this situation, we can still scope the project with concrete stopping conditions, such as a
time limit
For example
We might decide to spend two weeks, and no more, exploring the data, with the goal of coming up
with candidate hypotheses
These hypotheses can then be turned into concrete questions or goals for a full‐scale modeling
project
Source: Practical Data Science with R ‐ Nina Zumel, John Mount
Data Collection
Data Science Process
Retrieving Data ‐ Concept
• This step involves finding suitable data and getting access
to it:
Internal data
External data
• In this step, we ask the following questions:
What data is available to me?
Will it help me solve the problem?
Is it enough?
Is the data quality good enough?
• Data found in this phase is usually in the raw form
• Data requires polishing and transformation before using
it in the next stage
Source: Introducing Data Science by Cielin et al.,
Data Science Process
Retrieving Data – Case Scenario
• For the loan application case scenario, we need data related to loan
applications, and those that were defaulted
• Imagine that we’ve collected a sample of representative loans from
the last decade
• Some of the loans have defaulted
Most of them (about 70%) have not
• We’ve collected a variety of attributes about each loan application
See the listing in the next slide
Source: Practical Data Science with R ‐ Nina Zumel, John Mount
Data Science Process
Retrieving Data – Case Scenario
• Table 1.2 Loan data attributes
Status_of_existing_checking_account (at time of application) Present_residence_since
Duration_in_month (loan length) Collateral (car, property, and so on)
Credit_history Age_in_years
Purpose (car loan, student loan, and so on) Other_installment_plans (other loans/lines of credit—the type)
Credit_amount (loan amount) Housing (own, rent, and so on)
Savings_Account_or_bonds (balance/amount) Number_of_existing_credits_at_this_bank
Present_employment_since Job (employment type)
Installment_rate_in_percentage_of_disposable_income Number_of_dependents
Personal_status_and_sex Telephone (do they have one)
Cosigners Loan_status (dependent variable)
Source: Practical Data Science with R ‐ Nina Zumel, John Mount
Data Science Process
Retrieving Data – Case Scenario
• In the dataset, Loan_status takes on two possible values:
GoodLoan and BadLoan
• For the purposes of this case scenario
GoodLoan means that is paid off, and a BadLoan means it is defaulted
• Analysis Tip:
As much as possible, try to use information that can be directly measured, rather than information that is
inferred from another measurement
For example, we might be tempted to use income as a variable, reasoning that a lower income implies
more difficulty paying off a loan
The ability to pay off a loan is more directly measured by considering the size of the loan payments
relative to the borrower’s disposable income
This information is more useful than income alone
This information is available in the dataset as the variable
Installment_rate_in_percentage_of_disposable_income
Source: Practical Data Science with R ‐ Nina Zumel, John Mount
Data Science Process
Retrieving Data – Case Scenario
• This is the stage where we initially explore
and visualize your data
• We also perform basic cleaning of the data,
as needed
• In the process of exploring, we may discover
that the data isn’t suitable for your problem,
or that you need other types of information
as well
• We may discover things in the data that raise
issues more important than the one you
originally planned to address The fraction of defaulting loans by credit history
• For example, the data in the figure seems category
The dark region of each bar represents the
counterintuitive. fraction of loans in that category that defaulted.
Source: Practical Data Science with R ‐ Nina Zumel, John Mount
Data Science Process
Retrieving Data – Case Scenario
• Why would some of the seemingly safe
applicants (those who repaid all credits to
the bank) default at a higher rate than
seemingly riskier ones (those who had been
delinquent in the past)?
• After looking more carefully and discussing
with domain experts, we realize that this
sample is inherently biased:
We only have loans that were actually made
(and therefore already accepted).
• A true unbiased sample of loan applications
should include
both loan applications that were accepted and
ones that were rejected
Source: Practical Data Science with R ‐ Nina Zumel, John Mount
Data Science Process
Retrieving Data – Case Scenario
• Because the data sample only includes accepted loans, there are fewer risky‐looking loans
than safe‐looking ones in the data
• The probable story is that risky‐looking loans were approved after a much stricter vetting
process
A process that perhaps the safe‐looking loan applications could bypass
• This suggests that if your model is to be used downstream of the current application
approval process
Credit history is no longer a useful variable
• It also suggests that even seemingly safe loan applications should be more carefully
scrutinized
• Discoveries like this may require us to approach stakeholders to change or refine the
project goals
• We may decide to concentrate on the seemingly safe loan applications
Source: Practical Data Science with R ‐ Nina Zumel, John Mount
Data Preparation
Techniques used in Data Preparation
Overview
• The raw data collected is likely to be "a
diamond in the rough."
• Our task is to sanitize and prepare it for
use in the modeling and reporting phase
• This includes cleansing, transformation,
and combining the data from a raw form
into data that’s directly usable in your
models
• Doing so is tremendously important
because:
Our models will perform better with a clean
data
Otherwise, garbage in equals garbage out
Source: Introducing Data Science by Cielin et al.,
Techniques used in Data Preparation
Overview
• Data Preparation
The third phase of the process is data preparation
Cleansing
Finding and correcting errors, Spaces, Typos,
Missing values
Outliers
Transformation
Aggregating, Extrapolating, Derived measures, Reducing number of
variables, Creating dummies, etc
Combining
Creating views
Set operators
Source: Introducing Data Science by Cielin et al.,
Techniques in Data Preparation
Data Cleansing
• Data cleansing focuses on removing errors in the data
• The purpose is to give the data a "true and consistent
representation" of the process from where it
originates
• Two types of errors:
Interpretation error
It exists when we take the value in the data for granted
o E.g, Age = 300 years, Weight = 500 kilos
Inconsistencies between data sources
E.g., Using "Male" is one table and "M" in another
E.g., Using "Rupees" in one table and "Dollars" in another
Source: Introducing Data Science by Cielin et al.,
Techniques in Data Preparation
Data Cleansing
Source: Introducing Data Science by Cielin et al.,
Techniques in Data Preparation
Data Cleansing
• Sometimes we may use more advanced methods, such as
simple modeling, to find and identify data errors and
anomalies
• Using diagnostic plots can be especially insightful
For example, simple regression can identify data points that seem out
of place
• When a single observation has too much influence, this
can point to an error in the data, but it can also be a valid
point
• At the data cleansing stage, these advanced methods are,
however, rarely applied and often regarded by certain
data scientists as overkill.
Source: Introducing Data Science by Cielin et al.,
Techniques in Data Preparation
Data Cleansing
The outlier point influences
the model heavily and is
worth investigating because
it can point to a region
where we don’t have
enough data or might
indicate an error in the
data, but it also can be a
valid data point
Source: Introducing Data Science by Cielin et al.,
Techniques in Data Preparation
Data Cleansing – Data Entry Errors
• Data entry errors may be caused by humans or
machines
• Errors originating from machines are transmission errors
or bugs in the extract, transform, and load phase (ETL)
• When the variable doesn’t have many classes, error
check can be done by tabulating the data with counts
• For example: Most errors of this type are easy to fix with
Consider the variable can take only two values: “Good” and simple assignment statements and if‐then
“Bad” else rules:
We can create a frequency table and see if those are truly the if (x == "Godo"):
only two values present x = “Good”
The values “Godo” and “Bade” point out something went if (x == "Bade"):
wrong in at least 16 cases. x = "Bad"
Source: Introducing Data Science by Cielin et al.,
Techniques in Data Preparation
Data Cleansing – Redundant White Space
• Whitespaces tend to be hard to detect but cause errors like other redundant characters would
• Most of us have lost several hours in projects because of a bug that was caused by whitespaces at
the end of a string
• We require the software program to join two keys and notice that observations are missing from
the output file
• After looking for days through the code, we realize that the cleaning during the ETL phase wasn’t
well executed, and keys in one table contained a whitespace at the end of a string
• This caused a mismatch of keys such as "FR " – "FR", dropping the observations that couldn’t be
matched
• Most programming languages provide string functions that will remove the leading and trailing
whitespaces
• For instance, in Python provides the strip() function to remove leading and trailing spaces
Source: Introducing Data Science by Cielin et al.,
Techniques in Data Preparation
Data Cleansing
• Fixing Capital Letter Mismatches
Capital letter mismatches are common.
Most programming languages make a distinction between "India" and "india"
We can solve this problem by applying a function that returns both strings in lowercase or
uppercase, such as .lower() or .upper() in Python
"India".lower() == "india".lower() should result in true
• Impossible Values & Sanity Check
Sanity checks are another valuable type of data check
Here you check the value against physically or theoretically impossible values such as
people taller than 3 meters or someone with an age of 300 years
Sanity checks can be directly expressed with rules:
ageCheck = 0 <= age <= 120
Source: Introducing Data Science by Cielin et al.,
Techniques in Data Preparation
Data Cleansing ‐ Outliers
• An outlier is an observation that seems to be
distant from other observations
• One observation that follows a different logic
than the other observations
• The easiest way to find outliers is to use a plot or
a table with the minimum and maximum values
• The plot on the top shows no outliers
• The plot on the bottom shows possible outliers
on the upper side when a normal distribution is
expected
• The normal distribution, or Gaussian distribution,
is the most common distribution in natural
sciences
Source: Introducing Data Science by Cielin et al.,
Techniques in Data Preparation
Data Cleansing – Missing Values
• Missing values aren’t necessarily wrong
• Missing values might be an indicator that
Something went wrong in the data collection or that an error happened
in the ETL process
• Which technique to use at what time depends on a the
particular case
• For instance, if we don’t have observations to spare, omitting
an observation is probably not an option
• If the variable can be described by a stable distribution, we
could impute based on this
• However, a missing value may actually means “zero”?
• This can be the case in sales, for instance:
If no promotion is applied on a customer basket, that customer’s promo
is missing, but most likely it’s also 0, no price cut
Source: Introducing Data Science by Cielin et al.,
Techniques in Data Preparation
Data Cleansing
• Deviations from Code Book
A code book is a description of your data, a form of metadata
It contains things such as the number of variables per observation, the number of
observations, and what each encoding within a variable means
For example, "0" equals "negative", "5" stands for "very positive"
A code book also tells the type of data you’re looking at:
Is it hierarchical, graph, something else?
You look at those values that are present in set A but not in set B
If we have multiple values to check, it’s better to put them from the code book into
a table and use a difference operator to check the discrepancy between both tables
Source: Introducing Data Science by Cielin et al.,
Techniques in Data Preparation
Data Cleansing
• Different Units of Measurement
When integrating two data sets, you have to pay attention to their respective units of
measurement
For example, prices of gasoline in the world
When we gather data from different data providers, data sets can contain prices per gallon
and others can contain prices per liter
A simple conversion will do the trick in this case
• Different Levels of Aggregation
Having different levels of aggregation is similar to having different types of measurement
For example, a data set containing data per week versus one containing data per month
This type of error is generally easy to detect, and summarizing (or the inverse, expanding)
the data sets will fix it
Source: Introducing Data Science by Cielin et al.,
Techniques in Data Preparation
Data Transformation
• Relationships between an input variable and an output variable aren’t always
linear
• For instance, a relationship of the form y = aebx
• Taking the log of the independent variables simplifies the estimation problem
dramatically
• Transforming the input variables greatly simplifies the estimation problem.
• Other times you might want to combine two variables into a new variable
Source: Introducing Data Science by Cielin et al.,
Techniques in Data Preparation
Data Transformation
Source: Introducing Data Science by Cielin et al.,
Techniques in Data Preparation
Data Transformation
• Reducing the Number of Variables
Sometimes there are too many variables and need to reduce the number because
they don’t add new information to the model
Having too many variables in the model makes the model difficult to handle
Certain techniques don’t perform well when you overload them with too many
input variables
For instance, all the techniques based on a Euclidean distance perform well only up to 10
variables
Data scientists use special methods to reduce the number of variables but retain
the maximum amount of data
Source: Introducing Data Science by Cielin et al.,
Techniques in Data Preparation
Data Transformation
• Reducing the Number of Variables
Figure shows how reducing the number of variables makes it easier to understand
the key values
It also shows how two variables account for 50.6% of the variation within the data
set (component1 = 27.8% + component2 = 22.8%)
These variables, called "component1" and "component2," are both combinations
of the original variables
They’re the principal components of the underlying data structure
Source: Introducing Data Science by Cielin et al.,
Techniques in Data Preparation
Data Transformation
Source: Introducing Data Science by Cielin et al.,
Techniques in Data Preparation
Data Transformation – Euclidean Distance
• Euclidean distance or "ordinary" distance is an extension to one of the first things we learn in trigonometry:
Pythagoras’s leg theorem
• If we know the length of the two sides next to the 90° angle of a right‐angled triangle we can easily derive the
length of the remaining side (hypotenuse)
• The formula for this is hypotenuse = 1 2 .
• The Euclidean distance between two points in a two‐dimensional plane is calculated using a similar formula:
distance = 1 2 1 2 .
• If we want to expand this distance calculation to more dimensions, add the coordinates of the point within
those higher dimensions to the formula
• For three dimensions we get distance = 1 2 1 2 1 2
• Euclidean Distance is used in
K‐nearest neighbors (classification) or K‐means (clustering) to find the “k closest points” of a particular sample point
Hierarchical clustering, agglomerative clustering (complete and single linkage) where you want to find the distance between clusters.
Source: Introducing Data Science by Cielin et al.,
Techniques in Data Preparation
Data Transformation
• Turning Variables into Dummies
Variables can be turned into dummy variables
Dummy variables can only take two values:
true(1) or false(0)
They’re used to indicate the absence of a categorical
effect that may explain the observation
In this case you’ll make separate columns for the
classes stored in one variable and indicate it with 1
if the class is present and 0 otherwise
An example is turning one column named Weekdays
into the columns Monday through Sunday
You use an indicator to show if the observation was
on a Monday; you put 1 on Monday and 0
elsewhere
Source: Introducing Data Science by Cielin et al.,
Techniques in Data Preparation
Combining Data
• Data comes from several different places
• In this substep we focus on integrating
these different sources
• Data varies in size, type, and structure,
ranging from databases and Excel files to
text documents
• Different ways of combining data
Joining
Enriching an observation from one table with
information from another table
Appending or Stacking
Adding the observations of one table to those of
another table
Source: Introducing Data Science by Cielin et al.,
Techniques in Data Preparation
Combining Data – Joining Tables
Source: Introducing Data Science by Cielin et al.,
Techniques in Data Preparation
Combining Data – Appending Tables
Source: Introducing Data Science by Cielin et al.,
Techniques in Data Preparation
Combining Data – Using Views to Simulate Joins & Appends
Source: Introducing Data Science by Cielin et al.,
Techniques in Data Preparation
Combining Data – Enriching Aggregated Values
Source: Introducing Data Science by Cielin et al.,
Data Exploration
Data Exploration
Overview
• Data Exploration
The fourth phase in the process is data exploration
The goal of this step is to gain a deeper understanding of
the data
We look for patterns, correlations, and deviations based
on visual and descriptive techniques
The insights gained here will enable us to start modeling
Source: Introducing Data Science by Cielin et al.,
Data Exploration
Overview
• During exploratory data analysis we take a deep dive into
the data
• Information becomes much easier to grasp when shown in
a picture
• We mainly use graphical techniques to gain an
understanding of your data and the interactions between
variables
• It’s common to discover anomalies we missed previously
We may have to take a step back and fix them
• The visualization techniques in this phase can range from
simple line graphs or histograms to more complex diagrams
such as Sankey diagram (see next slide)
Source: Introducing Data Science by Cielin et al.,
Data Exploration
Sankey diagram
• This Sankey diagram
visualizes the flow of energy:
supplies are on the left, and
demands are on the right
• Links show how varying
amounts of energy are
converted or transmitted
before being consumed or
lost
Data: Department of Energy & Climate Change via Tom Counsell
Source: https://observablehq.com/@d3/sankey‐diagram
Data Exploration
Overview
• A bar chart, a line plot, and a distribution are some of the graphs
used in exploratory analysis
Source: Introducing Data Science by Cielin et al.,
Data Exploration
Overview
• Drawing multiple plots
together can help us
understand the structure of
your data over multiple
variables.
Source: Introducing Data Science by Cielin et al.,
Data Exploration
Overview
• Histogram
In a histogram a variable is cut into discrete categories and the number of occurrences in each category
are summed up and shown in the graph
• Boxplot
The boxplot, on the other hand, doesn’t show how many observations are present but does offer an
impression of the distribution within categories
It can show the maximum, minimum, median, and other characterizing measures at the same time.
Source: Introducing Data Science by Cielin et al.,
Data Modeling
Data Modeling
Overview
• The most common data science modeling tasks are these:
Classifying—Deciding if something belongs to one category or another
Scoring—Predicting or estimating a numeric value, such as a price or probability
Ranking—Learning to order items by preferences
Clustering—Grouping items into most‐similar groups
Finding relations—Finding correlations or potential causes of effects seen in the data
Characterizing—Very general plotting and report generation from data
Source: Practical Data Science with R ‐ Nina Zumel, John Mount
Data Exploration
Modeling – Case Scenario
• The loan application problem is a classification problem:
We want to identify loan applicants who are likely to default
• Some common approaches in such cases are logistic regression and
tree‐based methods
• To solve this problem, we decide that a decision tree is most suitable
• Loan officers, who will use our model, are interested in knowing an
indication of how confident the model is in its decision:
Is this applicant highly likely to default, or only somewhat likely?
Source: Practical Data Science with R ‐ Nina Zumel, John Mount
Data Exploration
Modeling – Case Scenario
Let’s suppose that there is
an application for a one‐
year loan of $10,000
On the other hand,
suppose that there is an
application for a one‐year
loan of $15,000
Source: Practical Data Science with R ‐ Nina Zumel, John Mount
Data Exploration
Modeling – Case Scenario
• Let’s suppose that we discover the model shown in figure
• Let’s trace an example path through the tree
• Let’s suppose that there is an application for a one‐year loan of $10,000
• At the top of the tree (node 1), the model checks if the loan is for longer than 34
months
• The answer is “no,” so the model takes the right branch down the tree
This is shown as the highlighted branch from node 1
• The next question (node 3) is whether the loan is for more than DM 11,000
• Again, the answer is “no,” so the model branches right and arrives at leaf 3.
Source: Practical Data Science with R ‐ Nina Zumel, John Mount
Data Exploration
Modeling – Case Scenario
• Historically, 75% of loans that arrive at this leaf are good loans, so the
model recommends that you approve this loan, as there is a high
probability that it will be paid off
• On the other hand, suppose that there is an application for a one‐
year loan of $15,000
• In this case, the model would first branch right at node 1, and then
left at node 3, to arrive at leaf 2
• Historically, all loans that arrive at leaf 2 have defaulted, so the model
recommends that you reject this loan application.
Source: Practical Data Science with R ‐ Nina Zumel, John Mount
Thank You!
Model Evaluation
Model Evaluation
Overview
• Once we have a model ready, we need to determine if it meets our goals:
Is it accurate enough for your needs?
Does it generalize well?
Does it perform better than "the obvious guess"?
Does it perform better than whatever estimate we currently use?
Do the results of the model (coefficients, clusters, rules, confidence intervals, significances,
and diagnostics) make sense in the context of the problem domain?
• If the answer is "NO" to any of the above questions, it’s time to loop back
to the modeling step
Or decide that the data doesn’t support the goal you’re trying to achieve.
Source: Practical Data Science with R ‐ Nina Zumel, John Mount
Model Evaluation
Evaluation – Case Scenario
• In the loan application example, the first thing to check is whether the rules that
the model discovered make sense
• Looking at decision tree structure, we don’t notice any obviously strange rules,
so you can go ahead and evaluate the model’s accuracy
• A good summary of classifier accuracy is the confusion matrix, which tabulates
actual classifications against predicted ones.
• We create a confusion matrix where rows represent actual loan status, and
columns represent predicted loan status
• conf_mat ["GoodLoan", "BadLoan"] refers to the element conf_mat[2, 1]
The number of actual good loans that the model predicted were bad
• The diagonal entries of the matrix represent correct predictions.
Source: Practical Data Science with R ‐ Nina Zumel, John Mount
Model Evaluation
Confusion Matrix
• A confusion matrix is a table that is often used to evaluate the
performance of a classification model (or "classifier")
• It works on a set of test data for which the true values are known
• There are two possible predicted classes: "YES" and "NO"
• If we were predicting the presence of a disease, for example, "yes"
would mean they have the disease, and "no" would mean they
don't have the disease.
• The classifier made a total of 165 predictions
E.g., 165 patients were being tested for the presence of that disease
• Out of those 165 cases, the classifier predicted "yes" 110 times,
and "no" 55 times.
• In reality, 105 patients in the sample have the disease, and 60
patients do not.
Source: https://www.dataschool.io/simple‐guide‐to‐confusion‐matrix‐terminology
Model Evaluation
Confusion Matrix
• True positives (TP):
These are cases in which the model predicted yes (they
have the disease), and the patients actually do have the
disease.
• True negatives (TN):
The model predicted no, and they don't have the disease.
• False positives (FP):
The model predicted YES, but they don't actually have the
disease. (Also known as a "Type I error.")
• False negatives (FN):
The model predicted NO, but they actually do have the
disease. (Also known as a "Type II error.")
Source: https://www.dataschool.io/simple‐guide‐to‐confusion‐matrix‐terminology
Model Evaluation
Confusion Matrix
Term Description Calculation
Accuracy Overall, how often is the classifier correct? • (TP+TN)/total = (100+50)/165 = 0.91
Misclassification Rate Overall, how often is it wrong? • (FP+FN)/total = (10+5)/165 = 0.09
• Equivalent to 1 minus Accuracy
• Also known as "Error Rate"
True Positive Rate When it's actually YES, how often does it predict YES? • TP/actual YES = 100/105 = 0.95
(Sensitivity or Recall)
False Positive Rate When it's actually NO, how often does it predict YES? • FP/actual NO = 10/60 = 0.17
Source: https://www.dataschool.io/simple‐guide‐to‐confusion‐matrix‐terminology
Model Evaluation
Evaluation – Case Scenario
Source: Practical Data Science with R ‐ Nina Zumel, John Mount
Model Evaluation
Evaluation – Case Scenario
Term Calculation
Accuracy • (TP+TN)/total = (41+687)/1000 = 0.728 Predicted
Misclassification • (FP+FN)/total = (13+259)/1000 = 0.272 n = 1000
Bad Loan Good Loan
Rate • Equivalent to 1 minus Accuracy Yes No
• Also known as "Error Rate" Bad Loan
TP = 41 FN = 259 300
Actual
True Positive Rate • TP/actual yes = 41/300 = 0.137 Yes
• Also known as "Sensitivity" or "Recall" Good Loan
FP = 13 TN = 687 700
False Positive Rate • FP/actual no = 13/700 = 0.01857 No
54 946
True Negative Rate: • TN/actual no = 687/700 = 0.98143
• Equivalent to 1 minus False Positive Rate
• Also known as "Specificity" Our prediction is for Bad Loans. So,
Precision • TP/predicted yes = 41/54 = 0.759 • Yes means, Yes, it's a bad loan
Prevalence • Actual yes/total = 300/1000 = 0.64 • No means, No, it's not a bad loan
Source: Practical Data Science with R ‐ Nina Zumel, John Mount
Model Presentation
Model Presentation
Overview
• This phase involves presenting our results to stakeholders
• We also document the model for those in the organization who are
responsible for using, running, and maintaining the model once it has
been deployed
• Different audiences require different kinds of information
• Business‐oriented audiences want to understand the impact of our
findings in terms of business metrics
Source: Practical Data Science with R ‐ Nina Zumel, John Mount
Model Presentation
Case Scenario
• In the loan example, our business audience is interested in knowing how our
model can reduce chargeoffs
Chargeoff = the money that the bank loses to bad loans
• Suppose our model identified a set of bad loans that amounted to 22% of the
total money lost to defaults
• Then our presentation should emphasize that the model can potentially reduce
the bank’s losses by that amount
See figure in the next slide
• We may also give other interesting findings or recommendations, such as:
New car loans are much riskier than used car loans
Most losses are tied to bad car loans and bad equipment loans (assuming that the audience didn’t
already know these facts)
Source: Practical Data Science with R ‐ Nina Zumel, John Mount
Model Presentation
Case Scenario
• Our presentation for the loan
officers should emphasize:
How should they interpret the model?
What does the model output look like?
If the model provides a trace of which
rules in the decision tree executed
How do they read that?
If the model provides a confidence
score in addition to a classification
How should they use the confidence
score?
When might they potentially overrule
the model?
Source: Practical Data Science with R ‐ Nina Zumel, John Mount
Model Deployment
Concept
• Finally, the model is put into operation
• In many organizations, the data scientist no longer has primary responsibility for
the day‐to‐day operation of the model
• But you still should ensure that the model will run smoothly and won’t make
disastrous decisions
• You also want to make sure that the model can be updated as its environment
changes
• And in many situations, the model will initially be deployed in a small pilot
program
• The test might bring out issues that you didn’t anticipate, and you may have to
adjust the model accordingly
Source: Practical Data Science with R ‐ Nina Zumel, John Mount
Model Deployment
Case Scenario
• When we deploy the model, we might find that loan officers
frequently override the model in certain situations because it
contradicts their intuition
• Is their intuition wrong? Or
• Is your model incomplete? Or
• More positively, your model may perform so successfully that the
bank wants you to extend it to home loans as well
Source: Practical Data Science with R ‐ Nina Zumel, John Mount
Thank You!
Introduction to Data Science
Data Science Process
Dr. Ramakrishna Dantu
Associate Professor, BITS Pilani
Introduction to Data Science
Disclaimer and Acknowledgement
• The content for these slides has been obtained from books and various
other source on the Internet
• I here by acknowledge all the contributors for their material and inputs.
• I have provided source information wherever necessary
• I have added and modified the content to suit the requirements of the
course
Introduction to Data Science
Data Science Process
• Data Science Methodology
Business understanding
Data Requirements
Data Acquisition
Data Understanding
Data preparation
Modelling
Model Evaluation
Deployment and feedback
• Case Study
• Data Science Proposal
Samples
Evaluation
Review Guide
Data Science Process with a
Case Study
Data Science Process
Stages in Data Science Process
Business Analytic
Understanding Approach
Data
Feedback
Requirements
Deployment Data Collection
Data
Evaluation
Understanding
Data Modeling Data Preparation
Source: CognitiveClass
Data Science Process
10 Questions the process aims to answer
• From Problem to Approach
1. What is the problem that you are trying to solve?
2. How can you use data to answer the questions?
• Working with Data
3. What data do you need to answer the question?
4. Where is the data coming from? Identify all Sources. How will you acquire it?
5. Is the data that you collected representative of the problem to be solved?
6. What additional work is required to manipulate and work with the data?
• Delivering the Answer
7. In what way can the data be visualized to get to the answer that is required?
8. Does the model used really answer the initial question or does it need to be adjusted?
9. Can you put the model into practice?
10. Can you get constructive feedback into answering the question?
Source: CognitiveClass
Data Science Process
Five Stages of Data Science Process
• From Problem to Approach
Business Understanding
Analytic Approach
• From Requirements to Collection
Data Requirements
Data Collection
• From Understanding to Preparation
Data Understanding
Data Preparation
• From Modeling to Evaluation
Modeling
Evaluation
• From Deployment to Feedback
Deployment
Feedback
Source: CognitiveClass
Case Study
Case Study
Hospital Readmissions
Image Source: https://medium.com/nwamaka‐imasogie/predicting‐hospital‐readmission‐using‐nlp‐5f0fe6f1a705
Case Study
Hospital Readmissions
• Scenario
There is a limited budget for providing healthcare to the public
Hospital readmissions for re‐occurring problems are considered as a sign of failure
in the healthcare system
There is a dire need to properly address the patient condition prior to the initial
patient discharge
The core question is:
What is the best way to allocate these funds to maximize their use in providing quality care?
A successful data science program will deliver:
better patient care by giving physicians new tools to incorporate timely, data‐driven
information into patient care decisions
Source: CognitiveClass
From Problem to
Analytic Approach
Source: CognitiveClass
Case Study
Data Problem to Analytic Approach
• Key topics covered
The need to understand and prioritize the business goal.
The way stakeholder support influences a project.
The importance of selecting the right model.
When to use a predictive, descriptive, or classification model.
Source: CognitiveClass
BUSINESS
UNDERSTANDING
Source: CognitiveClass
Case Study
Business Understanding (Concept)
• What is the problem that you are trying to solve?
• Identify the goal
• Identify and define the objectives that support the goal.
• Ask questions
• Seek clarifications
• Get the stakeholder buy‐in and support
Source: CognitiveClass
Case Study
Business Understanding (Application)
• Case study asks the following question:
What is the best way to allocate the limited healthcare budget to maximize its use
in providing quality care?
• Goals and Objectives
Define the GOALS
To provide quality care without increasing cost
Define the OBJECTIVES
Review the process to identify inefficiencies
Source: CognitiveClass
Case Study
Business Understanding (Application)
• Examining hospital readmissions
It was found that approximately 30% of individuals who finish rehab treatment would be readmitted
to a rehab center within one year
50% would be readmitted within five years
After reviewing some records, it was found that patients with heart failure were high on the list of
readmission.
Source: CognitiveClass
Case Study
Business Understanding (Application)
• It was found that a decision tree model can be applied to investigate this
scenario to determine the reason for this phenomenon
• It is important to gain business insight for the analytics team to implement the
data science project
Data scientists proposed and organized an on‐site workshop
• The business sponsors involvement throughout
the project was critical because
• The sponsor had
Set the overall direction
Remained committed and advised
when required, got the necessary support
Pilot Project Kickoff
Source: CognitiveClass
Case Study
Business Understanding (Application)
• Finally, four business requirements were identified for whatever
model would be built
Case study question
What is the best way to allocate the limited healthcare budget to maximize its use in providing
quality care?
Business requirements
To predict the risk of readmission
To predict readmission outcomes for those patients with Congestive Heart Failure
To understand the combination of events that led to the predicted outcome
To apply an easy‐to‐understand process to new patients, regarding their readmission risk
Source: CognitiveClass
ANALYTIC
APPROACH
Source: CognitiveClass
Case Study
Analytic Approach (Concept)
• Available data:
Patient data, Readmissions data, CFH data, etc.,
• How can we use data to answer the questions?
• Choose Analytic approach based on the type of question
Descriptive
Current status
Diagnostic (Statistical Analysis)
What happened?
Why is this happening?
Predictive (Forecasting)
What if these trends continue?
What will happen next?
Prescriptive
How do we solve it?
Source: CognitiveClass
Case Study
Analytic Approach (Concept)
• The analytic approach can be selected once a clear understanding of the
question is established:
If the question is to determine probabilities of an action
A Predictive model can be used
If the question is to show relationships
A Descriptive model may be used
If the question requires a yes / no answer
A Classification approach to predicting a response is appropriate
Decision tree classification
o Categorical outcome
o Explicit "decision path" showing conditions leading to high risk
o Likelihood of classified outcome
o Easy to understand and apply
Source: CognitiveClass
Case Study
Analytic Approach (Application)
• In the case study, a decision tree classification model was used to identify the combination
of conditions leading to each patient's outcome
• Examining the variables in each of the nodes along each path to a leaf, led to a respective
threshold value to split the tree
E.g., Age >= 60
• A decision tree classifier provides both the predicted outcome, as well as the likelihood of
that outcome, based on the proportion at the dominant outcome, yes or no, in each group
• The analysts can obtain the readmission risk, or the
likelihood of a yes for each patient
• If the dominant outcome is yes, then the risk is
simply the proportion of yes patients in the leaf
• If it is no, then the risk is 1 minus the proportion of
no patients in the leaf
Source: CognitiveClass
Case Study
Analytic Approach (Application) – Why decision tree??
• For non‐data scientists, a decision tree classification model is easy to understand
and apply, to score new patients for their risk of readmission
• Clinicians can readily see what conditions are causing a patient to be scored as
high‐risk
• Multiple models can be built and applied at various points during hospital stay
• This gives a moving picture of the patient's risk and how it is evolving with the
various treatments being applied
• For these reasons, the decision tree approach
was chosen for building the CHF readmission
model
Source: CognitiveClass
From Data Requirements
to Collection
Source: CognitiveClass
Case Study
From Data Requirements to Collection
• Key topics covered
The significance of defining the data requirements for the model.
Why the content, format, and representation of data matter
The importance of identifying the correct sources for data collection
How to handle unavailable and redundant data.
To anticipate the needs of future stages in the process.
Source: CognitiveClass
DATA
REQUIREMENTS
Source: CognitiveClass
Case Study
Data Requirements (Concept)
• If our goal is to make a "Biryani" but we don't have the right
ingredients, then the success of making a good Biryani will be
compromised
• If the "recipe" is the problem to be solved, then the data are the
ingredients
• The data scientist must ask the following questions:
What are the data requirements
How to obtain or collect them
How to understand or use them
How to prepare the data to meet the desired outcome
• Based on the understanding of the problem and the analytic approach chosen, it is important to define the
data requirements
• Here, the analytic approach is decision tree classification, so data requirements should be defined accordingly
• This involves:
Identifying data content, formats, and data sources needed for the initial data collection
Source: CognitiveClass
Case Study
Data Requirements (Application)
• Data requirements for the case study included selecting a suitable list of patients from the
health insurance providers member base
• In order to put together patient clinical histories, three criteria were identified for selecting
the patient cohort
1) A patient must be admitted as an in‐patient within
health insurance provider's service area
2) Patient's primary diagnosis should be CHF for one full
year
3) Prior to the primary admission for CHF, a patient must
have had at least 6 months of continuous enrollment
• Disqualifying conditions (outliers)
CHF patient who have been diagnosed with other serious conditions are excluded because this may result
in above‐average rates of re‐entry and may therefore distort results
Source: CognitiveClass
Case Study
Data Requirements (Application) – Defining the data
• The content and format suitable for decision tree classifier needs to be
defined
• Format
Transactional format
This model requires, one record per patient
Columns of the record represent dependent and predictor variables
• Content
To model the readmission outcome, data should represent all aspects of the patient's
clinical history. This includes:
Authorizations, primary, secondary and tertiary diagnoses, procedures, prescriptions and other
services provided during hospitalization or visits by patients / doctors.
Source: CognitiveClass
Case Study
Data Requirements (Application)
• A given patient can have thousands of records that represent all their attributes
• The data analytics specialists collected the transaction records from patient
records and created a set of new variables to represent that information
• It was a task for the data preparation phase, so it is important to anticipate the
next phases
Source: CognitiveClass
DATA
COLLECTION
Source: CognitiveClass
Case Study
Data Collection (Concept)
• What occurs during data collection?
Assess to determine whether or not we have the data we need.
Decide whether more data or less data is required.
Revise data requirements if needed.
Assess content, quality and initial insights of data.
Identify gaps in data.
How to extract, merge and Archive data?
• Once data collection is completed, the Data Scientist performs an assessment to make sure he has all the
required data
• As with the purchase of ingredients for making a meal, some ingredients may be out of season and more
difficult to obtain or cost more than originally planned.
• At this stage, the data requirements are reviewed and a decision is made as to whether more or less data is
required for the collection.
Source: CognitiveClass
Case Study
Data Collection (Concept)
• The collected data is explored using descriptive statistics and visualization to assess its
content and quality
• The gaps in the data are identified and plans for filling or replacement must be made
• Once this step is complete, essentially, the ingredients are now ready for washing and
cutting
• Now, let's look at the "data collection" step in the context of the case study
Source: CognitiveClass
Case Study
Data Collection – Available Data (Application)
• This case study required data about:
Demographics, clinical and coverage information of patients, provider information,
claims records, as well as pharmaceutical and other information related to all the
diagnoses of the CHF patients.
• Available data sources
Corporate data warehouse
Single source of medical, claims, eligibility, provider, &
member information
In‐patient record system
Claim patient system
Disease management program information
Source: CognitiveClass
Case Study
Data Collection – Unavailable Data (Application)
• This case study also required other data, but not available
Pharmaceutical records
Information on drugs
• This data source was not yet integrated with the rest of
the data sources
• In such situations, DATA NOT
It is okay to postpone decisions about unavailable data and to try
to capture them later.
• For example:
This can happen even after obtaining intermediate results from
AVAILABLE
predictive modeling
If the results indicate that drug information may be important for
a good model, you will spend time trying to get it.
• However, it turned out that they could build a reasonably good model without this
information about drugs.
Source: CognitiveClass
Case Study
Data Collection – Merging Data (Application)
• Data Pre‐processing
Database administrators and programmers often work
together to extract data from different sources and then
combine them
Redundant data can be deleted and made available to the
next level of methodology
the "Data Understanding" phase
At this stage, scientists and analysts can discuss ways to better
manage their data by automating certain database processes
to facilitate data collection
• Next, we move on to understanding the data
Source: CognitiveClass
From Data Understanding
to Data Preparation
Source: CognitiveClass
Case Study
From Data Understanding to Preparation
• Key topics covered
The importance of descriptive statistics.
How to manage missing, invalid, or misleading data.
The need to clean data and sometimes transform it.
The consequences of bad data for the model.
Data understanding is iterative
We learn more about data the more we study it.
Source: CognitiveClass
DATA
UNDERSTANDING
Source: CognitiveClass
Case Study
Data Understanding (Concept)
• This section of the methodology answers the question
Is data you collected representative of the problem to be solved?
• Descriptive statistics
Univariates statistics
Pairwise correlation
Histograms
• Assert data quality
Missing values
Invalid data
Misleading data
• From the data collected, we should understand the variables and their characteristics
using Exploratory Data Analysis and Descriptive Statistics
• Sometimes we may have to perform pre‐processing operations on the data
Source: CognitiveClass
Case Study
Data Understanding (Application)
• First, Univariate Statistics
Basic statistics included univariate statistics for each
variable, such as:
mean, median, minimum, maximum, and standard deviation
• Second, Pairwise Correlations
Pairwise correlations were used to determine the
degree of correlation between the variables
Variables that are highly correlated means they are
essentially redundant
This makes only one variable relevant for the modeling
Source: CognitiveClass
Case Study
Data Understanding (Application)
• Third, Histograms
Third, the histograms of the variables were examined
to understand their distributions
Histograms are a good way to understand how
values o
r variables are distributed
They help to know what kind of data preparation may
be needed to make the variable more useful in a
model
For example:
If a categorical variable contains too many different values t o
be meaningful in a model, the histogram can help decide
how to consolidate those values
Source: CognitiveClass
Case Study
Data Understanding (Application)
• Looking at data quality
Univariate, statistics and histograms are also used to
assess the quality of the data.
On the basis of the data provided, some values can be re‐
coded or deleted if necessary
E.g., if a particular variable has a lot of missing values, we may
drop the variable from the model
The question then arises as to whether "missing" means
something
Sometimes a missing value means "no" or "0" (zero), or sometimes simply "we do not
know".
Or if a variable contains invalid or misleading values. For example
A numeric variable called "age" containing 0 to 100 and 999, where "triple‐9" actually means
"missing", will be treated as a valid value unless we have corrected it
Source: CognitiveClass
Case Study
Data Understanding (Application)
• This is an iterative process
Originally, the meaning of CHF admission was decided on
the basis of a primary diagnosis of CHF
However, preliminary data analysis and clinical experience
revealed that CHF admissions were also based on other
diagnosis
The initial definition did not cover all cases of CHF admissions
This made data scientists to return back to the data collection phase
They added secondary and tertiary diagnoses, and created a more complete definition of
CHF admission
This is one example of the iterative processes in the methodology
The more we work with the problem and the data, the more we learn and the more the
model can be adjusted, which ultimately leads to a better resolution of the problem.
Source: CognitiveClass
DATA
PREPARATION
Source: CognitiveClass
Case Study
Data Preparation (Concept)
• In a way, data preparation is like removing dirt and washing vegetables
• Compared to data collection and understanding, data preparation is the most time consuming
phase – 70% to 90% of overall project time
• Automating collection and preparation time can reduce to 50%.
• The data preparation phase of the methodology answers the question:
What are the ways in which data is prepared?
Address missing or invalid values
Remove duplicates
Format data properly
• Transforming data
Process of getting data into a state where it may be easier to work with
• Feature Engineering
Process of using domain knowledge of data to create features that make ML algorithms work
Feature is a characteristic that might help solving a problem
Source: CognitiveClass
Case Study
Data Preparation (Concept)
Source: CognitiveClass
Case Study
Data Preparation (Concept)
• Feature Engineering
Feature engineering is also part of the data preparation
Use domain knowledge on data to create features that work with machine learning
algorithms
A feature is a property that can be
useful for solving a problem
The functions in the data are
important for the predictive models
and influence the desired results.
Source: CognitiveClass
Case Study
Data Preparation (Application)
• Defining Congestive Heart Failure
In the case study, first step in the data preparation stage was to actually define
what CHF means
First, the set of diagnosis‐related group codes needed to be identified, as CHF
implies certain kinds of fluid buildup
Data scientists also needed to consider that CHF is only one type of heart failure
Clinical guidance was needed to get the right codes for CHF
Source: CognitiveClass
Case Study
Data Preparation (Application)
Source: CognitiveClass
Case Study
Data Preparation (Application)
• Defining re‐admission criteria for the same condition
Next step involved defining the criteria for CHF
readmissions
The timing of events needed to be evaluated in
order to define whether a particular CHF
admission was an initial event (called as index
admission), or a CHF‐related re‐admission
Based on clinical expertise, a time period of 30
days was set as the window for readmission
relevant for CHF patients, following the
discharge from the initial admission
Source: CognitiveClass
Case Study
Data Preparation (Application)
• Aggregating transactional records
Next, the records that were in
transactional format were aggregated
Meaning that the data included multiple
records for each patient
Transactional records included claims
submitted for physician, laboratory,
hospital, and clinical services
Also included were records describing all
the diagnoses, procedures, prescriptions,
and other information about in‐patients
and out‐patients
Source: CognitiveClass
Case Study
Data Preparation (Application)
• Aggregating data to patient level
A given patient could have hundreds or
even thousands of records, depending on
their clinical history
All the transactional records were
aggregated to the patient level, yielding a
single record for each patient
This is required for the decision‐tree
classification method used for modeling
Source: CognitiveClass
Case Study
Data Preparation (Application)
• Aggregating data to patient level
Many new columns were created
representing the information in the
transactions
For example
Frequency and most recent visits to doctors,
clinics and hospitals with diagnoses,
procedures, prescriptions, and so forth
Co‐morbidities with CHF were also
considered, such as:
Diabetes, hypertension, and many other
diseases and chronic conditions that could
impact the risk of re‐admission for CHF
Source: CognitiveClass
Case Study
Data Preparation (Application)
• Do we need more data or less
data?
A literature review on CHF was also
undertaken to see whether any
important data elements were
overlooked
Such as co‐morbidities that had not yet
been accounted for
The literature review involved looping
back to the data collection stage to add
a few more indicators for conditions
and procedures
Source: CognitiveClass
Case Study
Data Preparation (Application)
• Creating new variables
Aggregating the transactional data at the
patient level, involved
merging it with the other patient data,
including their demographic information, such
as age, gender, type of insurance, and so forth
The result was the creation of one table
containing a single record per patient
Columns represent the attributes about the
patient in his or her clinical history
These columns would be used as variables
in the predictive modeling
Source: CognitiveClass
Case Study
Data Preparation (Application)
• Completing the data set
Here is a list of the variables that were
ultimately used in building the model
Measures
o Gender, Age, Primary Diagnosis Related
Group (DRG), Length of Stay, CHF Diagnosis
Importance (primary, secondary, tertiary),
Prior admissions, Line of business
Diagnosis Flags (Y/N)
o CHF, Atrial fibrillation, Pneumonia,
• Dependent Variable Diabetes, Renal failure, Hypertension
CHF readmission within 30 days following
discharge from CHF hospitalization (Yes/No)
Source: CognitiveClass
Case Study
Data Preparation (Application)
• Creating training and testing data
sets
The data preparation stage resulted in a
cohort of 2,343 patients
These patients met all of the criteria for
this case study
The data (patient records) were then split
into training and testing sets for building
and validating the model, respectively
Source: CognitiveClass
From Modeling to
Evaluation
Source: CognitiveClass
Case Study
From Data Modeling to Evaluation
• Key topics covered
The difference between descriptive and predictive models
The role of training sets and test sets
The importance of asking if the question has been answered
Why diagnostic measures tools are needed
The purpose of statistical significance tests
That modeling and evaluation are iterative processes
Source: CognitiveClass
• Modeling
In what way can the data be visualized to get to the answer that is required?
DATA
MODELING
Source: CognitiveClass
Case Study
Data Modeling (Concept)
• Data modeling focuses on developing models that are either descriptive or
predictive
• Descriptive Models
What happened?
Use statistics
• Predictive Models
What will happen?
Use Machine Learning
Try to generate yes/no type outcomes
A training set is used for developing the predictive model
• A training set
contains historical data in which the outcomes are already known
acts like a gauge to determine if the model needs to be calibrated
• The data scientist will try different algorithms to ensure that the variables in
play are actually required
Source: CognitiveClass
Case Study
Data Modeling (Concept)
• Success of data compilation, data preparation, and modeling
depends on:
the understanding of problem and the analytical approach being
taken
• Like the quality of ingredients in cooking, the quality of data
sets the stage for the outcome
If data quality is bad, the outcome will be bad
• Constant refinement, adjustment, and tweaking within each
step are essential to ensure a solid outcome
• The end goal is to build a model that can answer the original
question
In this stage of the methodology, model evaluation, deployment, and
feedback loops ensure that the model is relevant and the question is
really answered
Source: CognitiveClass
Case Study
Data Modeling (Application)
• There are many aspects to model building – one of those is
parameter tuning to improve the model
• With a prepared training set, the first decision tree classification
model for CHF readmission can be built
• We are looking for patients with high‐risk readmission, so the
outcome of interest will be CHF readmission equals "yes"
• In this first model, overall accuracy in classifying the yes and no
outcomes was 85%
• This sounds good, but it represents only 45% of the "yes"
Meaning, when it's actually YES, model predicted YES only 45% of the time
• The question is:
How could the accuracy of the model be improved in predicting the yes
outcome?
Source: CognitiveClass
Case Study
Data Modeling (Application)
• For decision tree classification, the best parameter to adjust is
the relative cost of misclassified yes and no outcomes
• Type I Error or False positive
When a true, non‐readmission is misclassified, and action is taken to reduce
that patient's risk, the cost of that error is the wasted intervention
• Type II Error or False negative
When a true readmission is misclassified, and no action is taken to reduce
that risk
The cost of this error is the readmission and all its attended costs, plus the Actual Condition
trauma to the patient
Positive Negative
• The costs of the two different kinds of misclassification errors can
be quite different Positive
True Positive False Positive
(Power) (Type I Error)
So, we adjust the relative weights of misclassifying the yes and no Predicted
outcomes Condition
False Negative
Negative True Negative
• The default is 1‐to‐1, but the decision tree algorithm, allows the (Type II Error)
setting of a higher value for yes
Source: CognitiveClass
Case Study
Data Modeling (Application)
• For the second model, the relative cost was set at 9‐to‐1
Ratio of cost of false positive to false negative
• This is a very high ratio, but gives more insight to the model's
behavior
• This time the model correctly classified 97% of the YES, but at the
expense of a very low accuracy on the NO, with an overall
accuracy of only 49%
• This was clearly not a good model
• The problem with this outcome is the large number of false‐positives
Meaning, a true, non‐readmission is misclassified as re‐admission
This would recommend unnecessary and costly intervention for patients, who would not have been re‐admitted
anyway
• Therefore, the data scientist needs to try again to find a better balance between the yes and no
accuracies
Source: CognitiveClass
Case Study
Data Modeling (Application)
• The data scientist needs to try again to find a better balance between the yes and no
accuracies
• For the third model, the relative cost was set at a more reasonable 4‐to‐1
This time, the overall accuracy was 81%
Yes accuracy was 68%
This is called sensitivity by statisticians
No accuracy was 85%
This is called specificity
• In medical diagnosis
Test sensitivity is the ability of a test to correctly identify those with the disease (true positive rate)
Test specificity is the ability of the test to correctly identify those without the disease (true negative rate).
• This is the optimum balance that can be obtained with a rather small training set
By adjusting the relative cost of misclassified yes and no outcomes parameter
Source: CognitiveClass
Case Study
Confusion Matrix
• A confusion matrix is a table that is often used to evaluate the
performance of a classification model (or "classifier")
• It works on a set of test data for which the true values are known
• There are two possible predicted classes: "YES" and "NO"
• If we were predicting the presence of a disease, for example, "yes"
would mean they have the disease, and "no" would mean they
don't have the disease.
• The classifier made a total of 165 predictions
E.g., 165 patients were being tested for the presence of that disease
• Out of those 165 cases, the classifier predicted "yes" 110 times,
and "no" 55 times.
• In reality, 105 patients in the sample have the disease, and 60
patients do not.
Source: https://www.dataschool.io/simple‐guide‐to‐confusion‐matrix‐terminology
Case Study
Confusion Matrix
• True positives (TP):
These are cases in which the model predicted yes (they
have the disease), and the patients actually do have the
disease.
• True negatives (TN):
The model predicted no, and they don't have the disease.
• False positives (FP):
The model predicted YES, but they don't actually have the
disease. (Also known as a "Type I error.")
• False negatives (FN):
The model predicted NO, but they actually do have the
disease. (Also known as a "Type II error.")
Source: https://www.dataschool.io/simple‐guide‐to‐confusion‐matrix‐terminology
Case Study
Confusion Matrix
Term Description Calculation
Accuracy Overall, how often is the classifier correct? • (TP+TN)/total = (100+50)/165 = 0.91
Misclassification Rate Overall, how often is it wrong? • (FP+FN)/total = (10+5)/165 = 0.09
• Equivalent to 1 minus Accuracy
• Also known as "Error Rate"
True Positive Rate When it's actually YES, how often does it predict YES? • TP/actual YES = 100/105 = 0.95
(Sensitivity or Recall)
False Positive Rate When it's actually NO, how often does it predict YES? • FP/actual NO = 10/60 = 0.17
Source: https://www.dataschool.io/simple‐guide‐to‐confusion‐matrix‐terminology
Case Study
Data Modeling (Application)
• Evaluation
Does the model used really answers the initial question or does it need to be
adjusted?
DATA
EVALUATION
Source: CognitiveClass
Case Study
Model Evaluation (Concept)
• Quality of the developed model is assessed
• Before model gets deployed evaluate whether the model really
answers the initial question
• Two phases of evaluation
Diagnostic measure phase
Ensure model is working as intended.
Statistical significance phase
Ensure data is being properly handled and interpreted within the model.
Source: CognitiveClass
Case Study
Model Evaluation (Concept)
• Diagnostic measurement phase
Ensures that the model works as intended
If the model is a predictive model
A decision tree can be used to assess whether the response
provided by the model matches the original design
This allows areas to be displayed where adjustments are required
If the model is a descriptive model that evaluates
relationships
A set of tests with known results can be applied and the model
refined as necessary
• Statistical significance test phase
Can be applied to the model to ensure that the model data is processed and interpreted correctly
This is to avoid a second unnecessary assumption when the answer is revealed
Source: CognitiveClass
Case Study
Model Evaluation (Application)
• Let's look at one way to find the optimal model
through a diagnostic measure based on tuning
one of the parameters in model building
• Specifically we'll see how to tune the relative cost
of misclassifying yes and no outcomes.
• As shown in this table, four models were built
with four different relative misclassification costs.
• As we see, each value of this model‐building parameter increases the true‐
positive rate (or sensitivity) of the accuracy in predicting yes, at the expense of
lower accuracy in predicting no, that is, an increasing false‐positive rate.
Source: CognitiveClass
Case Study
Model Evaluation (Application)
• Which model is best based on tuning this parameter?
• Risk‐reducing intervention – two scenarios
Cannot be applied to all CHF patients because many of them would not have been readmitted anyway.
This will be cost effective
The intervention itself would not be as effective in improving patient care if not enough high‐risk CHF
patients targeted
• So, how do we determine which model was optimal?
• This can be done with the help of an ROC curve (receiver operating characteristic curve)
• ROC curve is a graph showing the performance of a classification model at all classification
thresholds
• This curve plots two parameters:
True Positive Rate
False Positive Rate
Source: CognitiveClass
Case Study
Receiver Operator Characteristic (ROC) Curve
• ROC curves are used to show the connection/trade‐off between clinical sensitivity and
specificity for every possible cut‐off (threshold) for a test or a combination of tests
• The area under an ROC curve is a measure of the usefulness of a test in general
A greater area means a more useful test
• ROC curves are used in clinical biochemistry to choose the most appropriate cut‐off for a
test
• The best cut‐off has the highest true positive rate together with the lowest false positive
rate.
• ROC curves were first employed in the study of discriminator systems for the detection of
radio signals in the presence of noise in the 1940s, following the attack on Pearl Harbor.
• The initial research was motivated by the desire to determine how the US RADAR
"receiver operators" had missed the Japanese aircraft.
Case Study
Model Evaluation (Application)
• Cost Study – Using the ROC Curve
An ROC curve plots TPR vs. FPR at different classification
thresholds
Lowering the classification threshold classifies more items as
positive, thus increasing both False Positives and True
Positives
The optimal model is the one giving the maximum separation
between the blue ROC curve relative to the red base line
This curve quantifies how well a binary classification model
performs • Diagnostic tool for classification
Declassifying the yes and no outcomes when some discrimination evaluation model
criterion is varied
Classification model performance
In this case, the criterion is a relative misclassification cost
True‐positive rate Vs. False‐positive
We can see that model 3, with a relative misclassification cost rate
of 4‐to‐1, is the best of the 4 models Optimal model at maximum
separation
Source: CognitiveClass
From Deployment
to Feedback
Source: CognitiveClass
Case Study
From Deployment to Feedback
• Key topics covered
The importance of stakeholder input.
To consider the scale of deployment.
The importance of incorporating feedback to refine the
model.
The refined model must be redeployed.
This process should be repeated as often as necessary.
Source: CognitiveClass
• Modeling
In what way can the data be visualized to get to the answer that is required?
MODEL
DEPLOYMENT
Source: CognitiveClass
Case Study
Deployment (Concept)
• To make the model relevant and useful to address the initial question,
involves getting the stakeholders familiar with the tool produced
• Once the model is evaluated/approved by the stakeholders, it is
deployed and put to the ultimate test
• The model may be rolled out to a limited group of users or in a test
environment, to build up confidence in applying the outcome for use
across the board
Source: CognitiveClass
Case Study
Deployment (Application)
• Understanding the results
In preparation for model deployment, the next
step was to assimilate the knowledge for the
business group who would be designing and
managing the intervention program to reduce
readmission risk
In this scenario, the business people translated the
model results so that the clinical staff could
understand how to identify high‐risk patients and
design suitable intervention actions
The goal was to reduce the likelihood that these patients would be readmitted within 30 days after
discharge
During the business requirements stage, the Intervention Program Director and her team had
wanted an application that would provide automated, near real‐time risk assessments of congestive
heart failure
Source: CognitiveClass
Case Study
Deployment (Application)
• Gathering application requirements
It also had to be easy for clinical staff to use,
and preferably through browser‐based
application on a tablet, that each staff member
could carry around
This patient data was generated throughout the
hospital stay
It would be automatically prepared in a format needed by the model and each patient
would be scored near the time of discharge
Clinicians would then have the most up‐to‐date risk assessment for each patient, helping
them to select which patients to target for intervention after discharge. As part of solution
deployment, the Intervention team would develop and deliver training for the clinical staff.
Source: CognitiveClass
Case Study
Deployment (Application)
• Additional Requirements
Processes for tracking and monitoring
patients receiving the intervention would
have to be developed in collaboration with
IT developers and database administrators,
so that the results could go through the
feedback stage and the model could be
refined over time
Source: CognitiveClass
• Modeling
In what way can the data be visualized to get to the answer that is required?
FEEDBACK
Source: CognitiveClass
Case Study
Feedback (Concept)
• Feedback from the users will help to refine the
model and assess it for performance and impact
• The value of the model will be dependent on
successfully incorporating feedback and making
adjustments for as long as the solution is required
• Throughout the Data Science Methodology, each
step sets the stage for the next
• This makes the methodology cyclical, ensures refinement at each stage in the game
• Once the model has been evaluated and the data scientist trusts that it will work, it will be
deployed and will undergo the final test:
Its real use in real time in the field
Source: CognitiveClass
Case Study
Feedback (Application)
• Feedback stage included these steps:
First
The review process would be defined and put
into place, with overall responsibility for
measuring the results of the model applied to
CHF risk population
Clinical management executives would have
overall responsibility for the review process
Second
CHF patients receiving intervention would be
tracked and their re‐admission outcomes
recorded • For ethical reasons, CHF patients would not be
Third split into controlled and treatment groups
The intervention would then be measured to • Instead, readmission rates would be compared
determine how effective it was in reducing re‐ before and after the implementation of the
admissions model to measure its impact
Source: CognitiveClass
Case Study
Feedback (Application)
• After the deployment and feedback stages, the impact
of the intervention program on re‐admission rates
would be reviewed after the first year of its
implementation
• Then the model would be refined, based on all of the
data compiled after model implementation and the
knowledge gained throughout these stages
• Other refinements included:
Incorporating information about participation in the
intervention program, and possibly refining the model to
incorporate detailed pharmaceutical data
If you recall, data collection was initially deferred because the
pharmaceutical data was not readily available at the time
But after feedback and practical experience with the model, it
might be determined that adding that data could be worth the
investment of effort and time
We also have to allow for the possibility that other refinements
might present themselves during the feedback stage.
Source: CognitiveClass
Case Study
Feedback (Application)
• Redeployment
The intervention actions and processes
would be reviewed and very likely
refined as well, based on the
experience and knowledge gained
through initial deployment and
feedback
Finally, the refined model and
intervention actions would be
redeployed, with the feedback process
continued throughout the life of the
Intervention program
Source: CognitiveClass
Thank You!