Introduction To Data Science

Introduction
to Data Science
Introduction to Data Science
Dr. Ramakrishna Dantu
Associate Professor, BITS Pilani
Disclaimer and Acknowledgement
• The content for these slides has been obtained from books and various
other source on the Internet
• I here by acknowledge all the contributors for their material and inputs.
• I have provided source information wherever necessary
• I have added and modified the content to suit the requirements of the
course
Module Topics
• Fundamentals of Data Science
 Why Data Science
 Defining Data Science
 Data Science Process
• Real World applications
• Data Science vs BI
• Data Science vs Statistics
• Roles and responsibilities of a Data Scientist
• Software Engineering for Data Science
• Data Scientists Toolbox
• Data Science Challenges
Why Data Science?
Fundamentals of Data Science
Why Data Science?
• Let's look into this question from the following perspectives:
 Data Science as a field
 Various data science job roles
 Market revenue
 Skills
Source: https://analyticsindiamag.com/why‐a‐career‐in‐data‐science‐should‐be‐your‐prime‐focus‐in‐2020/
Why Data Science?
• Data Science as a field
 "Data Science is the sexiest job in the 21st century"
 ‐‐IBM
 Data Science is one of the fastest growing fields in the world
 In 2019, 2.9 million data science job openings were required globally
 According to the U.S. Bureau of Labor Statistics, 11.5 million new jobs will be
created by the year 2026
 Even with COVID‐19 situation, and the amount of shortage in talent, there might
not be a dip in data science as a career option
Why Data Science?
• The Hottest Job Roles and Trends In Data Science 2020
 The increase in data science as a career choice in 2020 will also see the rise in its
various job roles
 Data Engineer
 Data Administrator
 Machine Learning Engineer
 Statistician
 Data and Analytics Manager
 In India, the average salary of a data scientist as of January 2020 is ₹10L/yr, which is
pretty attractive (Glassdoor, 2020)
Why Data Science?
• Market Revenue
 In recent years, with the amount of shift in analytics and data science, the market
revenue has increased
 Analytics, data science, and big data industry in India generated about
 $2.71 billion annually in 2018, and the revenue grew to
 $3.03 billion in 2019
 This 2019 figure, is expected to double by 2025 nearly.
Most wanted Data Science Skills in 2019
• Survey Details
 LinkedIn data scientist job openings
 Survey conducted in the US in April
2019
 24,697 respondents
 76.13 percent of data scientist job
openings on LinkedIn required a
knowledge of the programming
language Python
 Wider industry data may vary
Source: https://softwarestrategiesblog.com/2019/06/13/how‐to‐get‐your‐data‐scientist‐career‐started/
Data Science Defined
• There is no consistent definition to what constitutes data science
• Data Science is a study of data.
• Data Science is an art of uncovering useful patterns, connections, insights, and trends that
are hiding behind the data
• Data Science helps to translate data into a story. The story telling helps in uncovering
insights. The insights in turn help in making decisions or strategic choices
• Data Science involves extracting meaningful insights from any data
 Requires a major effort of preparing, cleaning, scrubbing, or standardizing the data
 Algorithms are then applied to crunch pre‐processed data
 This process is iterative and requires analysts’ awareness of the best practices
 Automation of tasks allows us focus on the most important aspect of data science:
 Interpreting the results of the analysis in order to make decisions
• Data science is an inter‐disciplinary practice that draws from
 Data engineering, statistics, data mining, machine learning, and predictive analytics
• Similar to operations research, data science focuses on…
 Implementing data‐driven models and managing their outcomes
• To uncover nuggets of wisdom and actionable insights, Data science
combines the
 Data‐driven approach of statistical data analysis
 Computational power of the computer
 Programming acumen of a computer scientist
 Domain‐specific business intelligence
Source: Data Science using Python and R – Chantal Larose & Daniel Larose
Data Science Defined – Drew Conway's view
• Drew Conway is the CEO and founder of Alluvium
• He is a leading expert in the application of computational methods to social and
behavioral problems at large‐scale
• Drew has been writing and speaking about the role of data — and the discipline
of data science — in industry, government, and academia for several years.
• Drew has advised and consulted companies across many industries:
 Ranging from fledgling start‐ups to Fortune 100 companies
 Academic institutions, and
 Government agencies at all levels
• Drew started his career in counter‐terrorism as a computational social scientist
in the U.S. intelligence community
Source: http://drewconway.com
Data Science Defined ‐ Drew Conway's view
• Data science comprises three distinct
and overlapping skills:
 Statistics – for modeling and summarizing
datasets
 Computer science – for designing and using
algorithms to effectively store, process, and
visualize the data
 Domain expertise – necessary both to
formulate the right questions and to put their
answers in context Drew Conway’s Data Science Venn Diagram
Source: http://drewconway.com/zia/2013/3/26/the‐data‐science‐venn‐diagram
Data Science Defined ‐ Drew Conway's view
• Danger Zone
 Here, people "know enough to be dangerous," and is the most
problematic area of the diagram
 People are perfectly capable of extracting and structuring data in the
field they know quite a bit about
 People even know enough R to run a linear regression and report the
coefficients
 But they lack any understanding of what those coefficients mean
 It is from this part of the diagram that the phrase "lies, damned lies, and
statistics" emanates
 Because either through ignorance or malice this overlap of skills gives people the
ability to create what appears to be a legitimate analysis without any Drew Conway’s
understanding of how they got there or what they have created Data Science Venn Diagram
 As such, the danger zone is sparsely populated, however, it does not take
many to produce a lot of damage.
Source: http://drewconway.com/zia/2013/3/26/the‐data‐science‐venn‐diagram
Artificial Intelligence, Machine Learning, and Data Science
• Artificial Intelligence
 AI involves making machines capable of mimicking human behavior, particularly cognitive
functions:
 For e.g., facial recognition, automated driving, sorting mail based on postal code
• Machine learning
 Considered a sub‐field of or one of the tools of AI
 Involves providing machines with the capability of
learning from experience
 Experience for machines comes in the form of data
 Data that is used to teach machines is called training data
• Data Science
 Data science is the application of machine learning,
artificial intelligence, and other quantitative fields like
statistics, visualization, and mathematics
Source: Data Science, 2nd Edition by Bala Deshpande; Vijay Kotu
Data Science Process
Various Stages of Data Science Process
Source: Simplilearn ‐ https://www.youtube.com/watch?v=X3paOmcrTjQ
Phases of Data Science Process Determine
whether your
The coding of models are any
the data science good
Establish Deploy the
process
baseline model model for real‐
Select the best world problems
performance Apply various performing
Clearly describe algorithms to model from a set
project Partition the uncover hidden
objectives Gain insights data, Balance relationships
into the data the data, if Deployment
Create a through needed
Data cleaning graphical Evaluation
problem
/preparation is exploration
statement that Modeling
more labor‐
can be solved
intensive
using data Setup
science
Exploratory
Data
Data Analysis
Preparation
Problem
Description
Source: Data Science Using Python & R by Chantal Larose and Daniel Larose
What is Data Science?
Key features and motivations of Data Science
• Extracting Meaningful Patterns
• Building Representative Models
• Combination of Statistics, Machine Learning, and Computing
• Learning Algorithms
• Associated Fields
 Descriptive Statistics
 Exploratory Visualization
 Dimensional Slicing
 Hypothesis Testing
 Data Engineering
 Business Intelligence
Source: Data Science, 2nd Edition, by Bala Deshpande; Vijay Kotu
Data Science Tasks and Examples
Tasks Description Algorithms Examples
Classification • Predict if a data point belongs to • Decision trees • Assigning voters into known buckets by
one of the predefined classes. • neural networks political parties, e.g., soccer moms
• The prediction will be based on • Bayesian models • Bucketing new customers into one of the
learning from a known dataset • induction rules known customer groups
• k‐nearest neighbors
Regression • Predict the numeric target label of a • Linear regression • Predicting the unemployment rate for the
data point. • logistic regression next year
• The prediction will be based on • Estimating insurance premium
learning from a known database
Anomaly Detection • Predict if a data point is an outlier • Distance‐based • Detecting fraudulent credit card
compared to other data points in • density‐based local outlier transactions and network intrusion
the dataset factor (LOF)
Association • Identify relationships within an item • FP‐growth algorithm • Finding cross‐selling opportunities for a
analysis set based on transaction data • a priori algorithm retailer based on transaction purchase
history
The Local Outlier Factor (LOF) algorithm is an unsupervised anomaly detection method which
computes the local density deviation of a given data point with respect to its neighbors.
Data Science Tasks and Examples
Tasks Description Algorithms Examples
Time series • Predict the value of the target • Exponential smoothing • Sales forecasting
forecasting variable for a future timeframe based • Autoregressive • Production forecasting
on historical values integrated moving • Virtually any growth phenomenon that needs to
average (ARIMA) be extrapolated
• Regression
Clustering • Identify natural clusters within the • k‐Means • Finding customer segments in a company based
dataset based on inherit properties • Density‐based on transaction, web, and customer call data
within the dataset clustering (e.g.,
DBSCAN)
Recommendation • Predict the preference of an item for • Collaborative filtering • Finding the top recommended movies for a user
engines a user • Content‐based filtering
• Hybrid recommenders
Disciplines involved in Data Science
Source: http://www.verival.com/DataScience
Real World
Data Science Applications
Real‐World Applications
Data Science Use Cases
Source: https://data‐flair.training/blogs/data‐science‐use‐cases/
Facebook
• Social Analytics
 Utilizes quantitative research to gain insights about the social interactions of among
people
 Makes use of deep learning, facial recognition, and text analysis
 In facial recognition, it uses powerful neural networks to classify faces in the
photographs
 In text analysis, it uses its own understanding engine called “DeepText” to
understand people’s interest and aligns photographs with texts
 It uses deep learning for targeted advertising
 Using the insights gained from data, it clusters users based on their preferences and provides
them with the advertisements that appeal to them
Amazon
• Improving E‐Commerce Experience
 Personalized recommendation
 Amazon heavily relies on predictive analytics (a personalized recommender system) to increase customer satisfaction
 Amazon analyzes the purchase history of customers, other customer suggestions, and user ratings to recommend
products
 Anticipatory shipping model
 Uses big data for predicting the products that are most likely to be purchased by its users
 Analyzes pattern of customer purchases and keeps products in the nearest warehouse which the customers may utilize
in the future
 Price discounts
 Using parameters such as the user activity, order history, prices offered by the competitors, product availability, etc.,
Amazon provides discounts on popular items and earns profits on less popular items.
 Fraud Detection
 Amazon has its own novel ways and algorithms to detect fraud sellers and fraudulent purchases
 Improving Packaging Efficiency
 Amazon optimizes packaging of products in warehouses and increases efficiency of packaging lines through the data
collected from the workers
Uber
• Improving Rider Experience
 Uber maintains large database of drivers, customers, and several other records
 Makes extensive use of Big Data and crowdsourcing to derive insights and provide best
services to its customers
 Dynamic pricing
 The concept is rooted in Big Data and data science to calculate fares based on the parameters
 Uber matches customer profile with the most suitable driver and charges you based on the time it
takes to cover the distance rather than the distance itself
o The time of travel is calculated using algorithms that make use of data related to traffic density and
weather conditions
 When the demand is higher (more riders) than supply (less drivers), the price of the ride goes up
 When the demand for Uber rides is less, then Uber charges lower rate
Bank of America and other Banks
• Improving Customer Experience
 Erica – a virtual financial assistant (BoA)
 Considered as the world's finest innovation in finance domain, Erica serves as a customer advisor to over 45 million
users around the world
 Erica makes use of Speech Recognition to take customer inputs, which is a technological advancement in the field of
Data Science.
 Fraud detection
 Uses data science and predictive analytics to detect frauds in payments, insurance, credit cards, and customer
information
 Banks employ data scientists to use their quantitative knowledge where they apply algorithms like association,
clustering, forecasting, and classification.
 Risk modeling
 Banks use data science for risk modeling to regulate financial activities
 Customer segmentation
 Using various data‐mining techniques, banks are able to segment their customers in the high‐value and low‐value
segments
 Data scientists makes use of clustering, logistic regression, decision trees to help the banks to understand the Customer
Lifetime Value (CLV) and take group them in the appropriate segments
Airbnb
• Providing better search results
 Airbnb uses a massive big data of customer and host information, homestays and lodge records, as
well as website traffic
 Uses data science to provide better search results to its customers and find compatible hosts
• Detecting bounce rates
 Airbnb makes use of demographic analytics to analyze bounce rates from their websites
 In 2014, Airbnb found that users in certain countries would click the neighborhood link, browse the
page and photos and not make any booking
 To mitigate this issue, Airbnb released a different version in those countries and replaced
neighborhood links with the top travel destinations
 This resulted in a 10% improvement in the lift rate for those users
• Providing ideal lodgings and localities
 Airbnb uses knowledge graphs where the user’s preferences are matched with the various
parameters to provide ideal lodgings and localities
Spotify
• Providing better music streaming experience
 Spotify uses Data Science and leverages big data for providing personalized music recommendations
 It has over 100 million users and uses a massive amount of big data
 Uses over 600 GBs of daily data generated by the users to build its algorithms to boost user experience
• Improving experience for artists and managers
 Spotify for Artists application allows the artists and managers to analyze their streams, fan approval and
the hits they are generating through Spotify’s playlists
• Others
 Spotify uses data science to gain insights about which universities had the highest percentage of party
playlists and which ones spent the most time on it
 "Spotify Insights" publishes information about the ongoing trends in the music
 Spotify's Niland, an API based product, uses machine learning to provide better searches and
recommendations to its users.
 Spotify analyzes listening habits of its users to predict the Grammy Award Winners
 In the year 2013, Spotify made 4 correct predictions out of 6
Real World Examples
• Anomaly detection
 Fraud, disease, crime, etc.
• Automation and decision‐making
 Background checks, credit worthiness, etc.
• Classifications
 In an email server, this could mean classifying emails as “important” or “junk”
• Forecasting
 Sales, revenue and customer retention
• Pattern detection
 Weather patterns, financial market patterns, etc.
• Recognition
 Facial, voice, text, etc.
• Recommendations
 Based on learned preferences, recommendation engines can refer you to movies, restaurants and books you may like
Real World Examples
• Sales and Marketing
 Google AdSense collects data from internet users so relevant commercial messages
can be matched to the person browsing the internet
 MaxPoint (http://maxpoint.com/us) is another example of real‐time personalized
advertising
• Customer Relationship Management
 Commercial companies in almost every industry use data science to gain insights
into their customers, processes, staff, completion, and products
 Many companies use data science to offer customers a better user experience, as
well as to cross‐sell, up‐sell, and personalize their offerings
Source: Introducing Data Science ‐ Cielen, Meysman and Ali

Real World Applications
• Human Resources
 Human resource professionals use people analytics and text mining to:
 Screen candidates, monitor the mood of employees, and study informal networks among
coworkers
 People analytics is the central theme in the book
 Moneyball: The Art of Winning an Unfair Game
 The traditional scouting process for American baseball was random, and replacing it with
correlated signals changed everything
 Relying on statistics allowed them to hire the right players and pit them against the opponents
provided the biggest advantage

• Finance
 Financial institutions use data science to:
 Predict stock markets, determine the risk of lending money, and learn how to attract new
clients for their services
 As of 2016, at least 50% of trades worldwide are performed automatically by
machines based on algorithms developed by quants with the help of big data and
data science techniques
 Data scientists who work on trading algorithms are often called as quants

• Non‐Governmental Agencies
 Nongovernmental organizations (NGOs) are also bigtime into data science
 They use it to raise money and defend their causes
 The World Wildlife Fund (WWF), for instance, employs data scientists to increase
the effectiveness of their fundraising efforts
 Many data scientists devote part of their time to helping NGOs, because NGOs
often lack the resources to collect data and employ data scientists
 DataKind is one such data scientist group that devotes its time to the benefit of
mankind

• Education
 Universities use data science in their research but also to enhance the study
experience of their students
 The rise of massive open online courses (MOOC) produces a lot of data, which
allows universities to study how this type of learning can complement traditional
classes
 MOOCs are an invaluable asset if you want to become a data scientist and big data
professional:
 Coursera, Udacity, and edX
 The big data and data science landscape changes quickly, and MOOCs allow you to
stay up to date by following courses from top universities

Source: https://data‐flair.training/blogs/data‐science‐applications/
Data Science Vs. Business
Intelligence
Data Science Vs. Business Intelligence
Comparing Data Science and BI
• Business Intelligence
 What happened last quarter?
 How many units sold?
 Where is the problem?
 In which situations?
• Data Science
 Why is this happening?
 What if…
 What would be an optimal scenario
for our business?
 What will happen next?
 What if these trends continue?
Source: https://www.datasciencecentral.com/profiles/blogs/updated‐difference‐between‐business‐intelligence‐and‐data‐science
Comparing Data Science and BI
Business Intelligence Data Science
Perspective Past and Present Forward Looking
Involves Statistics Yes Yes
Analysis Descriptive Explorative
Comparative Diagnostic
Predictive
Data Usually structured (SQL) Less Structured (Logs, Cloud data, Text,
noSQL, SQL)
Data Warehouse Distributed Data
New Data, Same analysis Same Data, New analysis
Scope General Specific to business questions
Tools Statistics, Visualization Statistics, Machine Learning, Graph
Analysis, NLP
Expertise Business Analyst Data Scientist
Deliverable Tabular, Charts Visualization, Insight, Story
BI Analyst and Data Scientist Characteristics
BI Analyst Data Scientist
Focus Reports, KPIs, Trends Patterns, Correlations, Models
Process Static, Comparative Exploratory, Experimentation, Visual
Data Sources Pre‐planned, Added Slowly On the fly, as‐needed
Transformation Up front, Carefully planned In‐database, On‐demand, Enrichment
Data Quality Single version of truth "Good enough", probabilities
Analytics Retrospective, Descriptive Predictive, Prescriptive
Table shows the different attitudinal approaches for each
Source: https://www.datasciencecentral.com/profiles/blogs/updated‐difference‐between‐business‐intelligence‐and‐data‐science
Comparing Data Science and Statistics
Data Science Statistics
Type of problem Semi‐structured or Unstructured Well Structured
Analysis objective Need not be well defined Well formed objective
Type of analysis Explorative Confirmative
Data collection Data collection is not linked to Data collection based on the

the objective objective
Size of the dataset Large Small
Heterogeneous Homogenous
Paradigm Theory & Heuristic Theory based
(Deductive & Inductive) (Deductive)
Source: https://www.datasciencecentral.com
Roles and Responsibilities of a Data
Scientist
R & R of a Data Scientist
Who is a Data Scientist?
• One source writes this…
 In fact, some data scientists are — for all practical purposes — statisticians, while others are pretty much
indistinguishable from software engineers
 Some are machine‐learning experts, while others couldn’t machine‐learn their way out of kindergarten
 Some are PhDs with impressive publication records, while others have never read an academic paper
(shame on them, though)
 No matter how you define data science, you’ll find practitioners for whom the definition is totally,
absolutely wrong
• Nonetheless, a data scientist is someone who extracts insights from messy data
• A data scientist is responsible for guiding a data science project from start to finish
• Success in a data science project comes not just from an one tool, but from having
quantifiable goals, good methodology, cross‐discipline interactions, and a repeatable
workflow
Source: Data Science from Scratch, by Joel Grus
Who is a Data Scientist?
• Skills Required for a Data Scientist
 Math
 Statistics
 Computer science
 Machine learning
 Domain expertise
 Data visualization
 Communication and presentation skills
Image Source: Getting Started with Data Science – Murtaza Haider

Source: Doing Data Science – O'Reilly – Cathey O'Neil & Rachel Schutt
Data Scientist Responsibilities
• Data scientists work closely with business stakeholders to understand their goals and determine
how data can be used to achieve those goals
• They design data modeling processes, create algorithms and predictive models to extract the data
the business needs, then help analyze the data and share insights with peers
• Typically, the process of gathering and analyzing data involves the following path:
 Ask the right questions to begin the discovery process.
 Acquire data.
 Process and clean the data.
 Integrate and store data.
 Initial data investigation and exploratory data analysis.
 Choose one or more potential models and algorithms
 Apply data science methods and techniques, such as machine learning, statistical modeling, and artificial intelligence.
 Measure and improve results.
 Present final results to stakeholders.
 Make adjustments based on feedback.
 Repeat the process to solve a new problem
Source: https://www.northeastern.edu/graduate/blog/what‐does‐a‐data‐scientist‐do/
Data Scientist Job Titles
• The most common careers in data science include the following roles:
 Data scientists:
 Design data modeling processes to create algorithms and predictive models and perform
custom analysis.
 Data analysts:
 Manipulate large data sets and use them to identify trends and reach meaningful conclusions
to inform strategic business decisions.
 Data engineers:
 Clean, aggregate, and organize data from disparate sources and transfer it to data warehouses.
 Business intelligence specialists:
 Identify trends in data sets.
 Data architects:
 Design, create, and manage an organization’s data architecture.
Source: https://www.northeastern.edu/graduate/blog/what‐does‐a‐data‐scientist‐do/
Software Engineering for Data
Science
Software Engineering for Data Science
Software Engineering
• Software engineering is an engineering discipline that is concerned
with all aspects of software production
 From early stages of system specification through to maintaining the system after it
has gone into production
• Software includes computer programs, all associated documentation,
and configuration data that are needed for software to work correctly
• Software engineers are concerned with developing software
products. That is,
 Generic products that are stand‐alone systems
 Products that are customized for a particular customer
Source: Software Engineering by Ian Sommerville
Software Development Process
Gathering • Waterfall model views development activities as strictly
Requirements
defined steps or states
Systems
• Each step cannot begin until the previous step is
Analysis completed and documented
• No phase is considered complete until its
Systems
Design
documentation is approved
• The model is tightly tied to project
Development
management phases
Testing
Source: Object‐Oriented Systems Analysis and Design by Noushin Ashrafi and Hessam Ashrafi

Software Development Process
Iteration_3
Planning
Iteration_2
Iteration_1
Implementation Analysis
• The whole project is developed in multiple iterations
Testing Design
• Each iteration implements a selected set of
requirements or functionality
• Iterative process groups requirements into layers or Development
cycles
• Within each iteration, the development goes through a
complete life cycle: analysis, design, coding, and testing
Source: Object‐Oriented Systems Analysis and Design by Noushin Ashrafi and Hessam Ashrafi

Software Engineering Process Vs. Data Science Process
Gathering
Requirements
Systems Software Engineering Process
Analysis
Deployment
Evaluation
Systems Design Modeling
Setup
Exploratory
Data
Data Analysis
Development Preparation
Problem
Description
Testing
Data Science Vs. Software Engineering
Software Engineering Data Science
Software engineering focuses on creating software that Data science involves analyzing huge amounts of data,
serves a specific purpose with some aspects of programming and development
Uses a methodology involving various phases beginning Uses a methodology involving various phases beginning
from requirements specification through software from requirements specification through model
deployment into production deployment
Consider a software such as Chrome or FireFox browser You use the browser to search for let's say “Data
that is developed using software engineering principles Science Bootcamp”. The results displayed are probably
based on algorithm developed using Data Science
If you wish to find a restaurant, you would use your phone assistant (developed by a software engineer) to find
one. Then, an algorithm (developed by a data scientist) searches and finds restaurants close to your location and
displays the results in a map application (developed by a software engineer) and tells how exactly you can go
Source: https://careerkarma.com/blog/data‐science‐vs‐software‐engineering/
Data Science Software Engineering
• Involves collecting and analyzing • Is concerned with creating useful
data applications
• Data scientists utilize the ETL process • Software engineers use the SDLC
process
• Is more process‐oriented
• Uses frameworks like Waterfall,
• Data scientists use tools like Amazon Agile, and Spiral
S3, MongoDB, Hadoop, and MySQL • Software engineers use tools like
• Skills include machine learning, Rails, Django, Flask, and Vue.js
statistics, and data visualization • Skills are focused on coding
languages
Source: https://careerkarma.com/blog/data‐science‐vs‐software‐engineering/
Data Scientist's Toolbox
Data Scientist's Toolbox
Tools Used by Data Scientists for Data Analysis
• Java, R, Python, Clojure, Haskell, Scala…
• Hadoop, HDFS & MapReduce, Spark, Strom…
• HBase, Pig, Hive, Shark, Impala, Cascalog…
• ETL, Webscrapers, Flume, Sqoop, Hume…
• SQL, RDMS, DW, OLAP…
• Knime, Weka, RapidMiner, Scipy, NumPy, scikit‐learn, pandas…
• js, Gephi, ggplot2, Tableau, Flare, Shiny…
• SPSS, Matlab, SAS…
• NoSQL, Mongo DB, Couchbase, Cassandra…
• And Yes! … MS‐Excel : the most used, most underrated DS tool.
Source: http://makemeanalyst.com/data‐scientist‐toolkits/
Data Science Challenges
Broad Categorization of Data Science Challenges
• By reading related material from various sources, data science
challenges can be categorized as:
 Data related
 Organization related
 Technology related
 People related
 Skill related
 Can you think of anything else?
Challenges Faced by Data Professionals
• Survey of over 16,000 data professionals
• The question was…
• “At work, which barriers or challenges have
you faced this past year? (Select all that
apply).”
• The top 10 challenges were:
 Dirty data (36%)
 Lack of data science talent (30%)
 Company politics (27%)
 Lack of clear question (22%)
 Data inaccessible (22%)
 Results not used by decision makers (18%)
 Explaining data science to others (16%)
 Privacy issues (14%)
 Lack of domain expertise (14%)
 Organization small and cannot afford data
science team (13%)
Source: https://businessoverbroadway.com/2018/03/18/top‐10‐challenges‐to‐practicing‐data‐science‐at‐work/
Challenges Faced by Data Professionals
• A principal component analysis of the 20 challenges resulted in five categories
 Insights not Used in Decision Making
 These challenges include company politics, an inability to integrate study findings into decision‐making
processes and lack of management support.
 Data Privacy, Veracity, Unavailability
 These challenges revolved around the data itself, including how “dirty” it is, its availability as well as privacy
issues.
 Limitations of tools to scale / deploy
 Challenges in this category are related to the tools that are used to extract insights, deploy models as well as
scaling solutions up to the full database.
 Lack of Funds
 Challenges around lack of funding impact what the organization can purchase with respect to external data
sources, data science talent and, perhaps, domain expertise.
 Wrong Questions Asked
 Challenges are about the difficulty in maintaining expectations about the impact of data science projects and not
having a clear question to answer or a clear direction to go in with the available
• Insights not Used in Decision Making
 Company politics
 Inability to integrate study findings into decision‐making
processes
 Lack of management support.
• Data Privacy, Veracity, Unavailability
 “Dirty” data
 Lack of availability
 Privacy issues
• Limitations of tools to scale / deploy
 Challenges in this category are related to the tools that are
used to extract insights, deploy models as well as scaling
solutions up to the full database.
• Lack of Funds
 Challenges around lack of funding impact what the organization
can purchase with respect to external data sources, data
science talent and, perhaps, domain expertise.
• Wrong Questions Asked
 Challenges are about the difficulty in maintaining expectations
about the impact of data science projects and not having a
clear question to answer or a clear direction to go in with the
available
• Unreasonable Management Expectations
• Misunderstanding Data’s Significance
• Being Blamed for Bad News
• Having to Convince Management
• Lack of Professionals
• Problem Identification
• Accessing the Right Data
• Cleansing of the Data
Source: https://www.dataquest.io/blog/data‐science‐problems‐fix/
• Complexity of reality
• Cognitive bias
• Data Quality
 Accuracy
 Completeness
 Uniqueness
 Timeliness
 Consistency
• Content and Source Bias
• Granularity of Data
 In terms of time and space
• Availability & Access of Data
 Data Divide
Source: Coursera
Source: https://dimensionless.in/challenges‐faced‐by‐data‐scientist‐and‐how‐to‐overcome‐them/
Cognitive Bias
• The way we see the world (events, facts, people, etc.,) is based on our own
belief system and experiences and may not be reasonable or accurate
• Cognitive biases are ways of thinking about and perceiving the world that
may not necessarily reflect reality
• We may think we experience the world around us with perfect objectivity,
but this is rarely the case
• Each of us sees things differently based on our preconceptions, past
experiences, cultural, environmental, and social factors
 This doesn’t necessarily mean that the way we think or feel about something is truly
representative of reality
• Simply put, cognitive biases are the distortions of reality because of the lens
through which we view the world
Source: https://www.wordstream.com/blog/ws/2014/05/22/landing‐pages‐cognitive‐biases
Types of Cognitive Biases
• Selection (or sample) Bias
• Seasonal Bias
• Linearity Bias
• Confirmation Bias
• Recall Bias
• Survivor Bias
• Observer Bias
• Reinforcement Bias
• Availability Bias
Cognitive Bias
• MONEYBALL ‐ breaking biases ‐ FREQUENCY BASED PROBABILITY /
STATISTICS ‐ MATHEMATICS in the MOVIES
 https://www.youtube.com/watch?v=KWPhV6PUr9o&ab_channel=Movieclips
• Types of cognitive bias
 https://www.youtube.com/watch?v=wEwGBIr_RIw&ab_channel=PracticalPsychol
ogy
• Graphs can be misleading
 https://www.youtube.com/watch?v=E91bGT9BjYk&ab_channel=TED‐Ed
• Simpson's paradox
 https://www.youtube.com/watch?v=sxYrzzy3cq8&ab_channel=TED‐Ed
Thank You!
Data Analytics
course
Data Analytics Module Topics
• Defining Analytics
• Types of data analytics
 Descriptive, Diagnostic
 Predictive, Prescriptive
• Data Analytics – methodologies
 CRISP‐DM Methodology
 SEMMA
 BIG DATA LIFE CYCLE
 SMAM
• Analytics Capacity Building
• Challenges in Data‐driven decision making
What is Analytics?
Defining Analytics
What is Analytics?
What route should I follow? How to retain customers?
What home should I buy?
What treatment should I What stock should I purchase?
recommend?
Defining Analytics
What is Analytics? – Dictionary meaning
• It can be used as a noun or verb.
 Is Analytics the name of your department, or do you actually "do" Analytics?
 Analytics can be a noun. "He is the HOD of Business Analytics department!"
 Analytics can also be a verb. "I applied Analytics to the data to find patterns"
• Let's look at the dictionary meaning first
 Analytics is the systematic computational analysis of data or statistics
 ‐‐ Oxford
 Analytics is a process in which a computer examines information using mathematical
methods in order to find useful patterns
 ‐‐ Cambridge
 Analytics is the analysis of data, typically large sets of business data, by the use of
mathematics, statistics, and computer software
 ‐‐ dictionary.com
Source: http://the‐data‐guy.blogspot.com/2016/10/is‐analytics‐noun‐or‐verb.html
Defining Analytics
What is Analytics? – From websites
• It is the process of discovering, interpreting, and communicating significant
patterns in data and using tools to empower your entire organization to ask any
question of any data in any environment on any device
 ‐‐Oracle
• Data Analytics refers to the techniques used to analyze data to enhance
productivity and business gain
 ‐‐ Edureka
• Data analytics is the pursuit of extracting meaning from raw data using
specialized computer systems
 ‐‐ Informatica
• I couldn’t find a definition from popular websites such as Google, IBM,
Microsoft, Amazon, etc.,
Source: Internet
Defining Analytics
What is Analytics?
• Analytics is the process of extracting and creating information from
raw data by using techniques such as:
 filtering, processing, categorizing, condensing and contextualizing the data
• Analytics is a broad term that encompasses the processes,
technologies, frameworks and algorithms to extract meaningful
insights from data
• This information thus obtained is then used to infer knowledge about
the system and/or its users, and its operations to make the systems
smarter and more efficient
Source: Big Data Analytics – A Hands‐on Approach by Arshdeep Bahga & Vijay Madisetti
Characteristics of Big Data
Characteristics of Big Data (5Vs)
Volume
• Big data is so large that it would not fit on a single machine
• Specialized tools and frameworks are required to store, process and analyze such
data Volume
• For example: Velocity
 Social media applications process billions of messages everyday, IoT can generate terabytes of
sensor data everyday, etc. Big
• The data generated by modern IT, industrial, healthcare, Internet of Things, and Data
other systems is growing exponentially
Value Variety
• Cost of storage and processing architectures is going down
• The need to extract valuable insights from the data to improve business processes, Veracity
efficiency and service to consumers
• Though there is no fixed threshold for the volume of data to be considered as big
data
• Typically, the term big data is used for massive scale data that is difficult to store,
manage and process using traditional databases and data processing architectures

Velocity
• Velocity of data refers to how fast the data is generated
• Data generated by certain sources can arrive at very
Volume
high velocities, for example, social media data or sensor
data Velocity
• Velocity is the primary reason for the exponential Big
growth of data in short span of time Data
• Some applications can have strict deadlines for data Value Variety
analysis (such as trading or online fraud detection) and
the data needs to be analyzed in real‐time Veracity
• Specialized tools are required to ingest such high
velocity data into the big data infrastructure and analyze
the data in real‐time
Variety & Veracity
• Variety
 Variety refers to different forms of the data Volume
 Big data comes in different forms such as structured, unstructured or semi‐ Velocity
structured, including text data, image, audio, video and sensor data
Big
 Big data systems need to be flexible enough to handle such variety of data
Data
• Veracity Value Variety
 Veracity refers to how accurate is the data
Veracity
 To extract value from the data, the data needs to be cleaned to remove noise
 Data‐driven applications can reap the benefits of big data only when the data
is meaningful and accurate
 Therefore, cleansing of data is important so that incorrect and faulty data can
be filtered out

Value
• Value of data refers to the usefulness of data for
the intended purpose Volume
• The end goal of any big data analytics system is Velocity
to extract value from the data Big
Data
• The value of the data is also related to the Value Variety
veracity or accuracy of the data
Veracity
• For some applications value also depends on
how fast we are able to process the data

Analytics Goals
The Driver
• What drives the choice of technologies, algorithms, and frameworks
used for analytics?
 Analytics goals of the analytic task at hand
 To predict something
o For example: whether a transaction is a fraud or not, whether it will rain on a particular day, or
whether a tumor is benign or malignant
 To find patterns in the data
o For example: finding the top 10 coldest days in the year, finding which pages are visited the most on
a particular website, or finding the most searched celebrity in a particular year
 To find relationships in the data
o For example: finding similar news articles, finding similar patients in an electronic health record
system, finding related products on an eCommerce website, finding similar images, or finding
correlation between news items and stock prices)

Analytics Terminology
Some references
• https://marketing.adobe.com/resources/help/en_US/reference/gloss
ary.html
• https://analyticstraining.com/analytics‐terminology/
• https://blog.hubspot.com/marketing/hubspot‐google‐analytics‐
glossary

Types of Analytics
Types of Analytics
Categorization of Analytics
• Analytics based on the objective
• Analytics based on the domain
• Analytics based on the type of data
• Are there any other possibilities??
Types of Analytics
Types of analytics according to the objective
Source: https://blogs.gartner.com/jason‐mcnellis/2019/11/05/youre‐likely‐investing‐lot‐marketing‐analytics‐getting‐right‐insights/
Types of Analytics
Types of analytics according to the objective
• Involves retrospective analysis of what happened using historical data
• Descriptive Analytics –
• Data mining methods help uncover patterns that offer insight
What is happening?
• Lays foundation for diagnosis and further analysis of what might happen in the future
• Looks for the root cause of a problem
• Diagnostic Analytics –
• Used to determine why something happened
Why did it happen?
• Attempts to find and understand the causes of events and behaviors
• Uses past data and external sources to predict the future
• Predictive Analytics –
• Involves forecasting
What is likely to happen?
• Uses data mining and AI techniques to analyze current data and develop future scenarios
• Dedicated to finding the right action to be taken.
• Prescriptive Analytics – • Descriptive analytics provides a historical perspective, and predictive analytics helps forecast
What should be done? what might happen.
• Prescriptive analytics uses these parameters to find the best solution
Types of Analytics
Next level of Analytics
• Cognitive Computing
 Human cognition is based on the context and reasoning
 Cognitive systems mimic how humans reason and process
 Cognitive systems analyze information and draw inferences using probability
 They continuously learn from data and
reprogram themselves
 According to one source
 "The essential distinction between cognitive
platforms and artificial intelligence systems is that:
o you want an AI to do something for you
o a cognitive platform is something you turn to for
collaboration or for advice"
Source: https://interestingengineering.com/cognitive‐computing‐more‐human‐than‐artificial‐intelligence
Image Source: https://towardsdatascience.com/cognitive‐computing‐what‐can‐it‐be‐used‐for‐8af4721928f5
Types of Analytics
Next level of Analytics
• Cognitive Analytics – What Don't I Know?
 Involves Semantics, AI, Machine learning, Deep Learning, Natural Language Processing, and Neural
Networks
 Simulates human thought process to learn from the data and extract the hidden patterns from data
 Uses all types of data: audio, video, text, images in the analytics process
 Although this is the top tier of analytics maturity,
Cognitive Analytics can be used in the prior levels
 It extends the analytics journey to areas that were
unreachable with more classical analytics techniques like
business intelligence, statistics, and operations research."
 ‐‐ Jean Francois Puget
Applications: https://www.gflesch.com/elevity‐it‐blog/healthcare‐retail‐finance‐are‐leveraging‐cognitive‐computing Source: https://www.ecapitaladvisors.com/blog/analytics‐maturity/
https://www.xenonstack.com/insights/what‐is‐cognitive‐analytics/
Types of Analytics
Application areas of Cognitive Analytics
• Healthcare, Financial Services, Supply Chains, Customer Relationship
Management, Cybersecurity, etc.,.
• Chatbots:
 Programs that can simulate a human conversation by understanding the communication in
a contextual sense
 Chatbots use the machine learning technique called natural language processing
 Natural language processing allows programs to take inputs from humans (voice or text),
analyze it and then provide logical answers
 Cognitive computing enables chatbots to have a certain level of intelligence in
communication
 E.g., understanding user’s needs based on past communication, giving suggestions, etc.
Source: https://www.newgenapps.com/blog/what‐is‐cognitive‐computing‐applications‐companies‐artificial‐intelligence/
Types of Analytics
Application areas of Cognitive Analytics
• Sentiment analysis:
 It is the science of understanding emotions conveyed in a communication
 While it easy for humans to understand tone, intent etc. in a conversation, it is far
more complicated for machines
 To enable machines to understand human communication a training data of human
conversations is used and then analyze the accuracy of the analysis
 Sentiment analysis is popularly used to analyze social media communications like
tweets, comments, reviews, complaints etc.
Types of Analytics
Major Players in Cognitive Computing
• SparkCognition
• Expert System
• Microsoft Cognitive Services
• IBM Watson
• Numenta
• Google DeepMind
• Cicso Cognitive Threat Analytics
• CognitiveScale
• CustomerMatrix
• HPE Haven OnDemand
Types of Analytics
Level of automation in different types of analytics
Source: Gartner
Analytics Vs. Data Science
Types of Analytics
How Analytics is Connected to Data Science
Image Source: Simplilearn ‐ https://www.youtube.com/watch?v=X3paOmcrTjQ
Image Source: https://thedatascientist.com/why‐now‐is‐the‐right‐time‐for‐sports‐analytics/
Types of Analytics
How Analytics is Connected to Data Science
Source: Simplilearn ‐ https://www.youtube.com/watch?v=X3paOmcrTjQ
Types of Analytics
Types of analytics according to the domain
• Marketing Analytics
• Financial Analytics
• Healthcare Analytics
• Sports Analytics
• HR Analytics
• Customer Analytics
• Web Analytics
• Social Analytics
• Cultural Analytics
• Political Analytics
Image Source: https://www.dreamstime.com/stock‐photo‐diverse‐people‐social‐networking‐concepts‐image44309613
Types of Analytics
Types of analytics according to the type of data
• Text analytics
• Real‐time data analytics
• Multimedia analytics
• Geo analytics
• Mobile analytics
Types of Analytics
Computational Tasks or "giants"
• The National Research Council characterized computational tasks for massive
data analysis into seven groups (aka "giants")
• These groups include:
 (1) Basis Statistics
 (2) Generalized N‐Body Problems
 (3) Linear Algebraic Computations
 (4) Graph‐Theoretic Computations
 (5) Optimization
 (6) Integration and
 (7) Alignment
• This grouping of computational tasks provides a taxonomy of tasks that is useful
in analyzing data according to mathematical structure and computational
strategy
Types of Analytics
Mapping between types of analytics and computational tasks or 'giants'
Descriptive Analytics Diagnostic Analytics Predictive Analytics Prescriptive Analytics
(What can we do make it
(What happened) (Why did it happen) (What is likely to happen)
happen)
‐ Reports ‐ Queries ‐ Forecasts
‐ Reports
‐ Alerts ‐ Data Mining ‐ Simulations
‐ Alerts
Generalized N‐ Linear Algebraic Graph Theoretic Alignment

Basic Statistics Optimization Integration
body problem Computations Computations Problems
 Mean  Distances  Linear Algebra  Graph Search  Minimization  Bayesian  Matching
 Median  Kernels  Linear Betweenness  Maximization Inference between data
 Variance  Similarity Regression  Centrality  Linear  Expectations sets (text,
 Counts between pairs  PCA  Commute Programming  Markov Chain images,
 Top‐N of points distance  Quadratic Monte Carlo sequences)
 Distinct  Nearest  Shortest Path Programming  Hidden
Neighbor  Minimum  Gradient Markov Model
 Clustering Spanning Tree Descent
 Kernel SVM

Analytics Use Cases
Types of Analytics
Analytics use cases
Industry process‐specific use cases
Insurance Healthcare & Life Sciences Government & Non‐Profit Financial Services Banking
• Claim data extraction • Automated diagnosis • Disaster alerts • Data extraction • Fraud detection

• Claims management • E2B transmission • Relief Management • Data validation • Anti‐money laundering
• Regulatory compliance • Prescription management • Traffic Management • Breach identification • Regulatory reporting
• Risk evaluation • Drug discovery • Weather forecasting • Customer risk profiling • Document extraction
• Adjudication • Clinical documentation • Tax compliance • Investment services • Payment reminders
• Match to issued policy • Real‐time user authentication
Manufacturing Retain & CPG Telecom & Media Travel & Transportation Utilities & Resources
• Distributed market place • Network operations • Cargo management • Load forecasting

• Asset management • Demand management
• Food auditing • Fraud detection • Travel recommendations
• Supply chain management • Predictive maintenance
• Inventory control • Predictive maintenance • Price forecasting
• Inventory management • Energy trading
• Loyalty programs • Customer service • Traffic congestion management
• Energy management • Consumption insights &
• Procurement optimization • Route rationalization
• Supply chain traceability analysis
Source: https://avasant.com/report/intelligent‐automation‐use‐cases/
Types of Analytics
Analytics use cases
Enterprise process‐specific use cases
Customer Service Finance & Accounting Human Resources Marketing & Sales Procurement
• Order to cash • Resume screening • Price optimization • Demand forecasting

• Customer enquiry routing • Procure to pay • Candidate profiling • Shelf audits • Payment processing
• Customer self‐service support • Record to report • Performance management • Social media marketing • Goods receipt and confirmation
• Customer feedback & Surveys • Audit support • Employee virtual assistance • Lead management • E‐auctions
• Document review and analysis • Customer data management • Contract management
IT Process Specific use cases
IT Applications IT Infrastructure & Operations
• Application code generation • Error detection and remedy management • Auto resolution of tickets • Threat detection
• Application configuration • Release and deployment support • Endpoint preemptive monitoring • log analysis
• Application performance management • Automatic code refactoring • Autonomous cybersecurity • Capacity planning
• Application self‐healing • Test case generation • Predictive analytics‐led server management • Predictive infrastructure scaling
• Rapid prototyping • Application development • Predictive maintenance • Infrastructure cost management
Source: https://avasant.com/report/intelligent‐automation‐use‐cases/
Types of Analytics
Applications – Descriptive Analytics
• Telecom company finding current trend of customers
• Online retail company finding number of site visits for tickets
• Healthcare provider learns how many patients were hospitalized last month
• Retailer – the average weekly sales volume
• Manufacturer – a rate of the products returned for a past month, etc.
• A placement agency – number of candidates being placed to top companies,
candidates actually joining, candidates who accept but do not join etc.
• Consultancy firm – Doing salary analysis of different work streams, stages,
industries

Predictive Analytics Case Study
Predictive Analytics – Case Study
Predictive Policing – Seeing the crime before it happens
• Refers to the usage of mathematical, probabilistic,
geospatial, and other analytical techniques in law
enforcement to identify potential criminal activity
• This involves two aspects:
 Applying advanced analytics to various data sets to predict a
crime
 Offer intervention (deploying resources in the areas where
there is a potential for crime)
• Predictive policing falls into four categories
 Predicting crimes
 Predicting offenders
 Predicting perpetrators' identities
 Predicting victims of crime
Source: https://www.govtech.com/data/Role‐of‐Data‐Analytics‐in‐Predictive‐Policing.html
Predictive Policing – Seeing the crime before it happens
• At the start of each shift, crime data fed
into the PredPol system provides
officers with 15 different zones for four
types of crime:
 Auto theft, vehicle burglary, burglary and
gang‐related activity
• Each zone covers an area of 500 square
feet
Predictive Policing – PredPol algorithm
• PredPol uses a machine‐learning algorithm to make its predictions
• Historical event datasets are used to train the algorithm for each new city (ideally 2 to 5 years of data)
• It then updates the algorithm each day with new events as they are received from the department
• This new information comes from the agency’s records management system (RMS).
• PredPol uses ONLY 3 data points – crime type, crime location, and crime date/time – to create its predictions
• No personally identifiable information is ever used
• No demographic, ethnic or socio‐economic information is ever used
• This eliminates the possibility for privacy or civil rights violations seen with other intelligence‐led policing
models
• Predictions are displayed as red boxes (covering 150 x 150 sq mts) on a web interface via Google Maps
• The boxes represent the highest‐risk areas for each day and for the corresponding shift: day, swing or night
shift
• Officers are instructed to spend roughly 10% their shift time patrolling PredPol boxes
Predictive Policing – Predicting crime hot spots
• Descriptive and Diagnostic
Analytics
 Software searches for patterns
and correlations in past crime
data
• Predictive Analytics
 Algorithm predicts when/where
crime is likely to happen in the
future
• Response
 Police deploy resources in certain
areas at certain times
Source: https://www.sciencemag.org/news/2016/09/can‐predictive‐policing‐prevent‐crime‐it‐happens
Predictive Policing – Predicting crime hot spots
• Descriptive and Diagnostic
Analytics
 Software analyzes the social
contacts of those that have been
involved in the past
• Predictive Analytics
 Algorithm identifies probable
perpetrators or victims of
violence
• Response
 People are told they're
considered at risk, or social
services are offered
Source: https://www.sciencemag.org/news/2016/09/can‐predictive‐policing‐prevent‐crime‐it‐happens
Predictive Policing
• Science Behind the News: Predictive Policing
 https://www.youtube.com/watch?v=74_jreara3w
Healthcare Case Study
Healthcare Analytics – Case Study
Healthcare Practices
• Private Practice
• Group Practice
• Large HMOs
• Hospital Based
• Locum Tenens
Source: https://www.micromd.com/blogmd/5‐types‐medical‐practices/
Image Source: https://technologyadvice.com/blog/healthcare/best‐medical‐practice‐management‐software/
Healthcare Practice
• Private Practice
 A physician practices alone without any partners
and typically with minimal support staff
 Ideally works for physicians who wish to own and
manage their own practice
 Benefits
 Individual freedom, closer relationships with patients,
and the ability to set their own practice’s growth
pattern
 Drawback
 Longer work hours, financial extremes, and a greater
amount of business risk.
Image Source: https://www.mediqfinancial.com.au/blog/the‐essential‐guide‐to‐starting‐your‐own‐medical‐practice/
Healthcare Practice
• Group Practice
 A group practice involves two or more physicians who all provide medical care within the same facility
 They utilize the same personnel and divide the income in a manner previously agreed upon by the group
 Group practices may consist of providers from a single specialty or multiple specialties
 Benefits
 Shorter work hours, built‐in on‐call coverage, and access to more working capital
 All of these factors can lead to less stress
 Drawback
 Less individual freedom, limits on the ability to rapidly grow income, and the need for a consensus on business
decisions.
Image source: https://www.physicianleaders.org/news/selling‐medical‐practice‐you‐need‐exit‐strategy
Healthcare Practice
• Large HMOs (Health Maintenance Organizations)
 HMO, employs providers to care for their members and beneficiaries
 The goal of HMOS is to decrease medical costs for those consumers
 There are two main types of HMOs
 The prepaid group practice model and the medical care foundation (MCF), also called individual practice association.
 Benefits
 More stable work life and regular hours for providers (physicians)
 Less paperwork and regulatory responsibilities and a regular salary along with bonus opportunities
 These bonuses are based on productivity or patient satisfaction
 Drawback
 The main drawback for physicians working for an HMO is the lack of autonomy
 HMO’s required physicians to follow their guidelines in providing care.
 Examples:
 The Kaiser Foundation Health Plan in California
 The Health Insurance Plan of Greater New York
 Group Health Cooperative of Puget Sound
Healthcare Practice
• Hospital Based
 In hospital based work, physicians earn a predictable
income, have a regular patient base, and a solid
referral network
 Physicians who are employed by a hospital will either
work in a hospital‐owned practice or in a department
of the hospital itself
 Benefits
 A regular work schedule, low to no business and legal
risk, and a steady flow of income
 Drawback
 Lack of physician autonomy
 Employee constraints and the expectation that
physicians become involved in hospital committee work
Healthcare Practice
• Locum Tenens
 Locum tenens is derived from the Latin phrase for "to hold the place of."
 In locum tenens, physicians re‐home to areas hurting for healthcare professionals
 These types of positions offer temporary employment and may offer higher pay than more
permanent employment situations
 Benefits
 Physicians working in locum tenens scenarios enjoy the benefits of variety and the ability to
experience numerous types of practices and geographic locations
 They also enjoy schedule flexibility and lower living costs
 Drawback
 The possibility that benefits are not included, and a potential lack of steady work
 Physicians need to regularly uproot their families
Healthcare Scenario – Sources of Data
• Common sources of practice data existing within the practice:
 Revenue Cycle Management System (billing for a procedure, processing denials, collecting
payments)
 Electronic Health Records (EHRs) (provides services to a patient)
 Scheduling & Information System (schedules an appointment)
 Survey Results
 Peer Review System
 Other sources
• Analytics provides a platform to convert that data into actionable
information for the practice
• As reimbursement shifts to a value‐based care model, it is critical to have
insight into both clinical and business metrics to prepare for the future
Source: https://integratedmp.com/healthcare‐practice‐analytics‐101‐numbers‐charts‐dashboards‐oh‐my/
Healthcare Scenario – Integrated Medical Partners (IMP)
• https://www.youtube.com/watch?v=olpuyn6kemg
• IMP
 Primarily offers
 Revenue Cycle Management (RCM) and Advanced Analytics among others
 Helps maximize collections for independent physician practices
 Provides practice performance analytics
 Offers tailored solutions to partner with physicians and hospitals so that they can
 Improve patient care, enhanced compliance, operational efficiency, and increase profitability
Source: https://integratedmp.com/healthcare‐practice‐analytics‐101‐numbers‐charts‐dashboards‐oh‐my/
Healthcare Scenario – Denial of Claims
• Consider that the healthcare practice has seen an increase in denied
claims over the past several months
• Increased denials impact negatively on the financial performance of
the organization
• Therefore, the company needs tools to help identify and resolve the
root cause of the denials
 This trend needs to be reversed as soon as possible, but how?
• The different types of analytics provide necessary insights into the
data and the cause of the denial increase
Source: https://integratedmp.com/4‐key‐healthcare‐analytics‐sources‐is‐your‐practice‐using‐them/
Healthcare Scenario – Descriptive Analytics
• Descriptive – what happened?
 Descriptive analytics will tell you what is happening in the practice
 In this example, there has been an increase in the number of denied claims over
the past several months
 Further research identifying a trend revealed that the increase in denials is specific
to a particular denial code
 What is this denial code?
 This denial code is for a referring provider (doctor) who is not enrolled in the Medicare Provider
Enrollment, Chain, and Ownership System (PECOS).
• Now, the question is why this provider is not enrolled in PECOS?
Healthcare Scenario – Diagnostic Analytics
• Diagnostic – why did it happen?
 The descriptive analytics has identified an increase in denials specific to a referring
provider
 The next step is to diagnose to understand why this change occurred
 Identify why the referring provider is not enrolled in PECOS
 So, we utilize diagnostic analytics to understand why this change occurred
 Looking for changes in referring provider patterns at the time the increase in PECOS
denials began would help to identify new referring providers
 It will also identify whether these new referring providers are enrolled in PECOS
 For the purpose of this example, assume we identify one referring provider who is
new to referring to your practice that is not enrolled in PECOS.
Healthcare Scenario – Predictive Analytics
• Predictive – what will happen?
 Predictive Analytics allows us to learn from historical trends to predict what will
happen in the future
 Utilizing descriptive and diagnostic analytics, we can determine the historical
referral pattern of the new referring provider not enrolled in PECOS
 Assuming this provider remains not enrolled in PECOS and refers a steady stream of
patients to the organization
 predictive analytics will tell us the expected denials associated with these claims
 The resulting impact of this situation is identified
Healthcare Scenario – Prescriptive Analytics
• Prescriptive – what should I do?
 Prescriptive Analytics assists in determining the best course of action from the information
gathered from descriptive, diagnostic and predictive analytics
 In this scenario, the best course of action would be to contact the new referring provider,
express appreciation for the new referrals, but also evidence that the provider’s referrals
are resulting in denied claims due to PECOS enrollment
 By working with this provider to become enrolled in PECOS, denied claims can now be
rebilled
 In addition, future claims are less likely to be denied due to the provider not being
certified/eligible to be paid for the procedure/service on the claim date of service
 The key to prescriptive analytics is implementing a solution that will prevent the same
breakdown from occurring again in the future
 By reducing the likelihood of claim denials, the overall financial health of the physician
practice benefits
Analytics Maturity Models
Types of Analytics
Gartner Maturity Model for Data and Analytics
• D&A – Data and Analytics
• CDO – Chief Data Officer
• ROI – Return on Investment
• Outside‐in approach is guided by the
belief that customer value creation is
the key to success
• Inside‐out approach is guided by the
belief that the inner strengths and
capabilities of the organization will
produce a sustainable future
• Synergy: the combined power of a
group of things when they are working
together that is greater than the total
power achieved by each working
separately
Source: Streaming Change Data Capture by Dan Potter; Itamar Ankorion; Kevin Petrie ‐ O'Reilly Media, Inc., 2018

Types of Analytics
Analytics Maturity Vs. Value
Source: Gartner
Source: http://www.verival.com/DataScience
Types of Analytics
Analytics Maturity Vs. Competitive Advantage
• The model shows varying stages
of a company, its level of
analytics needs, and how
organizations can grow and
move up to the next level of
analytical maturity.
• By learning how to apply these
concepts to your organization,
you'll empower your teams and
yourself with the framework to
drive action at the right level for
the stage you're at
• The concept of transforming raw
data into prescriptive analytics is
easily digestible in each of these
graphics.
Source: https://www.jirav.com/blog/data‐analytics‐maturity‐models
Types of Analytics
Competing on Analytics
• Potential competitive advantage
increase with more sophisticated
analytics
• Companies in the bottom third of this
figure should be trying to move further
up it
• Business Intelligence (BI) tools typically
cover descriptive analytics
• Moving towards predictive analytics
can enable companies to see problems
before they arise
• Prescriptive analysis then helps to
specify the actions an organization
should carry out
• Autonomous analytics can power
decisions without human interaction
Source: Competing on Analytics: The new science of winning by Thomas H. Davenport and Jeanne G. Harris
Types of Analytics
• Without human interaction may
sound scary as jobs may be
affected but the fact is that it
increases productivity and more
time can be spent innovating
rather than investing in a
manual process
 Google search is a good example.
• Not all organizations need to be
at the autonomous analysis
stage – but all need to be
moving away from the bottom
of the chart.
Source: Competing on Analytics: The new science of winning by Thomas H. Davenport and Jeanne G. Harris
Types of Analytics
Source: HBR: Competing on Analytics: by Thomas H. Davenport and Jeanne G. Harris
Skills & Tools for Analytics
Skills Required for Analytics
Technical & Business Skills
• Technical skills for data analytics:
 Packages and Statistical methods
 BI Platform and Data Warehousing
 Base design of data
 Data Visualization and munging
 Reporting methods
 Knowledge of Hadoop and MapReduce
 Data Mining
• Business Skills Data analytics:
 Effective communication skills
 Creative thinking
 Industry knowledge
 Analytic problem solving
Image Source: Simplilearn ‐ https://www.youtube.com/watch?v=X3paOmcrTjQ
Tools in Analytics
Commercial & Open Sources Tools
• Commercial • Free Tools
 SAS
 WPS  R Programming
 MS Excel  Google Analytics
 Tableau  Python
 Pentaho
 Spotfire
 Statistica
 Qlikview  Anaconda Jupyter
 KISSmetrics  Apache Spark
 WeKa
 BigML
 RapidMiner
Thank You!
Data Analytics Methodologies
course
Data Analytics Module Topics
• Defining Analytics
• Types of data analytics
 Descriptive, Diagnostic
 Predictive, Prescriptive
• Data Analytics – methodologies
 KDD
 CRISP‐DM
 SEMMA
 SMAM
 BIG DATA LIFE CYCLE
• Analytics Capacity Building
• Challenges in Data‐driven decision making
What exactly is a methodology?
Source: Data Mining and Predictive Analytics Daniel Larose & Chantal Larose
Knowledge Discovery in Databases
CRISP‐DM Methodology
• The Knowledge Discovery in Databases
(or KDD) refers to a process of finding
knowledge from data in the context of
large databases
• It does this by applying data mining methods (algorithms) to extract
what is deemed knowledge:
 In accordance with the specifications of measures and thresholds, using a database
along with any required preprocessing, subsampling, and transformations of that
database
Source: http://www2.cs.uregina.ca/~dbd/cs831/notes/kdd/1_kdd.html
Image source: https://www.javatpoint.com/kdd‐process‐in‐data‐mining
KDD Process
• KDD requires relevant prior knowledge and brief understanding of application domain and
goals
• KDD process is iterative and interactive in nature
• There are nine different steps in seven different phases of this model
KDD Process
• Pre KDD process (Developing Domain Understanding)
 This is the first stage of KDD process in which goals are defined from customer’s view point
and used to develop and understanding about application domain and its prior knowledge
 Involves developing an of understanding of:
 The application domain
 The relevant prior knowledge and
 The goals of the end‐user
• Post KDD process (Using Discovered Knowledge)
 This understanding must be continued by:
 The knowledge consolidation
 Incorporation this knowledge into the system
KDD Process
• Selection:
 This is the second stage of KDD process which focuses creating on
target data set and subset of data samples or variables.
 It is an important stage because knowledge discovery is performed
on all these.
• Pre‐processing
 This is the third stage of KDD process which focuses on target data
cleaning and pre‐processing to obtain consistent data without any
noise and inconsistencies
 In this stage strategies are developed for handling such type of
noisy and inconsistent data.
• Transformation
 This is the fourth stage of KDD process which focuses on the
transformation of data using dimensionality reduction and other
transformation methods
KDD Process
• Data Mining ‐ this stage consists of three steps:
 Choosing the suitable data mining task
 In this step of KDD process an appropriate data mining task is
chosen based on particular goals that are defined in first stage
 The examples of data mining method or tasks are classification,
clustering, regression and summarization etc.
 Choosing a suitable data mining algorithm
 In this step, one or more appropriate data mining algorithms are
selected for searching different patterns from data
 There are number of algorithms present today for data mining but
appropriate algorithms are selected based on matching the overall
criteria for data mining
 Employing the data mining algorithm
 In this step, the selected algorithms are implemented
KDD Process
• Interpretation/Evaluation
 This is the eighth step of KDD process that focuses on
interpretation and evaluation of mined patterns
 This step may involve in visualization of extracted patterns
• Post KDD Process (Using discovered knowledge)
 In this final step of KDD process the discovered knowledge is
used for different purposes
 The discovered knowledge can be used by the interested
parties or can be integrate with another system for further
action
CRISP‐DM
CRoss‐Industry Standard Process for Data
Mining
CRISP‐DM Methodology
• Stands for "CRoss‐Industry Standard Process for Data Mining"
• CRISP‐DM was conceived in 1996 and became a European Union project under the ESPRIT funding
initiative in 1997
 European Strategic Program on Research in Information Technology
• The project was led by five companies: Integral Solutions Ltd (ISL), Teradata, Daimler AG, NCR
Corporation and OHRA, an insurance company.
 ISL, later acquired and merged into SPSS. The computer giant NCR Corporation produced the Teradata data
warehouse and its own data mining software.
• It is an open standard process model that describes common approaches used by data mining
experts
• Provides a nonproprietary, technology‐agnostic, and structured approach for fitting data mining
project into the general problem‐solving strategy
• This model is an idealized sequence of six stages
• In practice, it will often be necessary to backtrack to previous activities and repeat certain actions
CRIPS‐DM Methodology
• A data mining project is conceptualized as a
life cycle consisting of six phases
• The phase‐sequence is adaptive
 That is, the next phase in the sequence depends on
the outcomes associated with the previous phase
• Dependencies between phases are indicated by the arrows
 For instance, suppose that we are in the modeling phase
 Depending on the behavior and characteristics of the model, we may have to return to
the data preparation phase for further refinement before moving forward to the model
evaluation phase
Image Source: https://towardsdatascience.com/why‐using‐crisp‐dm‐will‐make‐you‐a‐better‐data‐scientist‐66efe5b72686
Six Phases of CRIPS‐DM Methodology
• Business/Research Understanding Phase
 This is the first phase of CRISP‐DM process
 It focuses on and uncovers important factors including success
criteria, business and data mining objectives and requirements as
well as business terminologies and technical terms.
a) First, clearly enunciate the project objectives and requirements
in terms of the business or research unit as a whole
b) Then, formulate these goals and restrictions into a data science
problem definition
c) Finally, prepare a preliminary strategy for achieving these
objectives
• Data Understanding Phase
 This phase of CRISP‐DM process focuses on data collection,
exploring of data, checking quality, to get insights into the
data to form hypotheses for hidden information
a) First, collect the data
b) Then, use exploratory data analysis to familiarize yourself
with the data, and discover initial insights
c) Evaluate the quality of the data
d) Finally, if desired, select interesting subsets that may
contain actionable patterns
• Data Preparation Phase
 This phases focuses on selection and preparation of final data set
 This phase may include many tasks such as records, table and attributes selection
as well as cleaning and transformation of data.
a) This labor‐intensive phase covers all aspects of preparing the final data set, which
shall be used for subsequent phases, from the initial, raw, dirty data
b) Select the cases and variables you want to analyze, and that are appropriate for
your analysis.
c) Perform transformations on certain variables, if needed.
d) Clean the raw data so that it is ready for the modeling tools.
• Modeling Phase
 This phase involves selection and application of various modeling techniques
 Different parameters are set and different models are built for same data mining
problem
a) Select and apply appropriate modeling techniques.
b) Calibrate model settings to optimize results.
c) Often, several different techniques may be applied for the same data science
problem
d) May require looping back to previous phase, in order to bring data into a specific
format requirements of a particular modeling technique
• Evaluation Phase
 This phase involves evaluation of obtained models and deciding how to use the
results
 Interpretation of the model depends upon the algorithm and models are evaluated
to review whether they achieve the objectives properly or not.
a) Evaluate the models delivered by the modeling phase for quality and
effectiveness, before deploying them for use in the field
b) Determine whether the model in fact achieves the objectives set for it in phase 1.
c) Establish whether some important facet of the business or research problem has
not been sufficiently accounted for.
d) Finally, come to a decision regarding the use of the modeling results.
• Deployment Phase
 This phase focuses on deploying the final model and determining the usage of
obtain knowledge and results
 This phase also focuses on organizing, reporting and presenting the gained
knowledge when needed
a) Model creation does not signify the completion of the project until they are used
b) Examples:
a) A simple deployment: Generate a report.
b) A more complex deployment: Initiate projects based on the predictive outcomes of the
model
c) For businesses, the customer often carries out the deployment based on your
model
Phases and Tasks of CRIPS‐DM Methodology
SEMMA Process Model
SEMMA Process Model
• SEMMA stands for Sample, Explore, Modify, Model and Access
• It is a data mining method presented by the SAS Institute
• It enables data organization, discovery, development and maintenance of data
mining projects
• SAS Institute defines data mining as the process of Sampling, Exploring,
Modifying, Modeling, and Assessing (SEMMA) large amounts of data to uncover
previously unknown patterns which can be utilized as a business advantage
• The data mining process is applicable across a variety of industries and provides
methodologies for such diverse business problems, such as:
 Fraud detection, customer retention and attrition, market segmentation, risk analysis, customer
satisfaction, bankruptcy prediction, and portfolio analysis
Source: Big Data Analytics Methods, 2nd Edition by Ghavami P
SEMMA Process Model
Sampling
• Five stages of SEMMA model: SAMPLE (Optional)
 Sample:
 Involves sampling of data to extract critical information about the data while
small enough to manipulate quickly.
 Explore: EXPLORE Data Clustering,
 This stage focuses on data exploration and discovery. Visualization Assocations
 It improves understanding of the data, finding trends and anomalies in the
data. Variable Selection, Data
MODIFY
 Modify: Creation Transformation
 This stage involves modification of data through data transformations, and
the model selection process.
 This stage may evaluate anomalies, outliers and variable reduction in data. MODEL
 Model: Neural Tree‐based Logistic Other Stat
 The goal of this stage is to build the model and apply the model to the data. Networks Models Models Models
 Different modeling techniques may be applied as datasets have different
attributes.
 Access:
 The final stage of SEMMA evaluates the reliability and usefulness of the
results, performance and accuracy of the model(s)
Model
ASSESS Assessment
Source: SAS® Enterprise Miner™ 14.3: Reference Help
Big Data Analytics Methods, 2nd Edition by Ghavami P
SEMMA Process Model
Sampling
SAMPLE
• SEMMA is actually not a data mining methodology (Optional)
• It is a logical organization of the functional tool set
of SAS Enterprise Miner for carrying out the core Data Clustering,
tasks of data mining EXPLORE Visualization Assocations
• Enterprise Miner can be used as part of any Variable Selection, Data

MODIFY
iterative data mining methodology adopted by the Creation Transformation
client
• Naturally steps such as formulating a well defined MODEL
business or research problem and assembling Neural
Networks
Tree‐based Logistic
Models
Other Stat
Models
Models
quality representative data sources are critical to
the overall success of any data mining project.
• SEMMA is focused on the model development
aspects of data mining Model
ASSESS Assessment
Source: SAS® Enterprise Miner™ 14.3: Reference Help
Big Data Analytics Methods, 2nd Edition by Ghavami P
Comparison of Methodologies
Comparison of KDD Vs. SEMMA
• By comparing KDD with SEMMA stages we can see that
they are very similar
SEMMA KDD
Sample phase Can be identified with Selection
Explore phase Can be identified with Preprocessing
Sampling
SAMPLE
Modify phase Can be identified with Transformation (Optional)
Model phase Can be identified with Data Mining EXPLORE Data

Visualization
Clustering,
Assocations
Assess phase Can be identified with Interpretation/Evaluation MODIFY Variable Selection,

Creation
Data
Transformation
MODEL
• Close examination tells us that the five stages of the SEMMA process Neural
Networks
Tree‐based
Models
Logistic
Models
Other Stat
Models
can be seen as a practical implementation of the five stages of the KDD
process, since it is directly linked to the SAS Enterprise Miner software. Model
ASSESS Assessment
Comparison of KDD Vs. CRIPS‐DM
• Comparing the KDD stages with the CRISP‐DM stages is not as straightforward as in SEMMA situation
• CRISP‐DM methodology incorporates the steps that, as referred above, must precede and follow the KDD
process that is to say:
• The Business Understanding phase
 Can be identified with the development of an understanding of the application domain, the relevant prior knowledge and the goals of
the end‐user;
• The Data Understanding phase
 Can be identified as the combination of Selection and Pre processing;
• The Deployment phase
 Can be identified with the consolidation by incorporating this knowledge into the system.
• The Data Preparation phase
 Can be identified with Transformation;
• The Modeling phase
 Can be identified with Data Mining;
• The Evaluation phase
 Can be identified with Interpretation/Evaluation.
KDD Vs. SEMMA Vs. CRIPS‐DM SAMPLE
Sampling
(Optional)
Data Clustering,
EXPLORE Visualization Assocations
MODIFY Variable Selection, Data

Creation Transformation
MODEL
Neural Tree‐based Logistic Other Stat
Networks Models Models Models
Model
ASSESS Assessment
Drawbacks of CRISP‐DM‐1
• One major drawback is that the model no longer seems
to be actively maintained
• The official site, CRISP‐DM.org, is no longer being
maintained
• Furthermore, the framework itself has not been updated
for working with new technologies, such as big data
• Big data technologies means that there can be additional
effort spend in the data understanding phase
 For example, as the business grapples with the
additional complexities that are involved in the shape of big data
sources.
• The methodology itself was conceived in mid 1990s
• Gregory Piatetsky of KDNuggets says:
 "CRISP‐DM remains the most popular methodology for
analytics, data mining, and data science projects, with 43%
share in latest KDnuggets Poll, but a replacement for
unmaintained CRISP‐DM is long overdue."
• CRISP‐DM is a great framework and its use on
projects helps focus them on delivering real business
value
• CRISP‐DM has been around a long time so many
projects are implemented using CRISP‐DM
• However, projects do take shortcuts
• Methodology translate into frequent friction points
between the business and IT professionals
• Some of these shortcuts make sense but too often
they result in projects using a corrupted version of the
approach like the one shown in Figure
Advantages of CRISP‐DM
• A lack of clarity
 Rather than drilling down into the details to get clarity on both the business
problem and the analytic approach, the project team work superficially with
business goals and use some metrics to measure success.
• Mindless rework:
 Most try to find new data or new modeling techniques rather than working
with their business partners to re‐evaluate the business problem.
• Blind hand‐off to IT:
 Some analytic teams don’t think about deployment and operationalization of
their models at all
 They fail to recognize that the models they build will have to be applied to
live data in operational data stores or embedded in operational systems.
• Failure to iterate
 Analytic professionals know that models age and that models need to be kept
up to date if they are to continue to be valuable.
Standard Methodology for
Analytical Models (SMAM)
Standard Methodology for Analytics Models (SMAM)
• Here we will look into another methodology called SMAM, which is
supposed to overcome those friction points
Source: https://www.kdnuggets.com/2015/08/new‐standard‐methodology‐analytical‐models.html
Phase Description
Use case Selection of the ideal approach from a list of candidates
identification
Model requirements Understanding the conditions required for the model
gathering to function
Data preparation Getting the data ready for the modeling
Modeling Scientific experimentation to solve the business
experiments question
Insight creation Visualization and dashboarding to provide insight
Proof of value (ROI) Running the model in a small scale setting to prove the
value
Operationalization Embedding the analytical model in operational systems
Model Lifecycle Governance around model lifetime and refresh
• Use case identification phase
 Describes the brainstorming/discovery process of looking at the different areas where models may
be applicable
 It involves educating business parties involved about what analytical modeling is, what realistic
expectations are from the various approaches and how models can be leveraged in the business
 Discussions on the use‐case identification involve topics around data availability, model integration
complexity, analytical model complexity and model impact on the business
 From a list of identified use‐cases in an area, the one with the best ranking on above mentioned
criteria should be considered for implementation
 Parties involved in this phase are (higher) management to ensure the right goal setting, IT for data
availability, the involved department for model relevance checking and the data scientists to infuse
the analytical knowledge and creative analytical idea provisioning
 The result of this phase is a chosen use‐case, and potentially a roadmap with the other considered
initiatives on a timeline.
• Model requirements gathering
 In this phase, requirements are gathered for the use‐case selected in the previous
phase
 These requirements include:
 Conditions for the cases/customers/entities considered for scoring
 Requirements about the availability of data, initial ideas around model reporting can be
explored and finally, ways that the end‐users would like to consume the results of the analytical
models
 Parties involved in this phase are people from the involved department(s), the end‐users and
the data scientists
 The result of this phase is a requirements document.
• Data preparation
 In this phase, discussions revolve around data access, data location, data understanding,
data validation, and creation of the modeling data
 This is a phase where IT/data administrators/DBA’s closely work together with the data
scientist to help prepare the data in a format consumable by the data scientist
 The process is agile; the data scientist tries out various approaches on smaller sets and
then may ask IT to perform the required transformation in large
 As with the CRISP‐DM, the previous phase, this phase and the next happen in that order,
but often iterate back and forth
 Parties involved
 The involved parties are IT/data administrators/DBA/data modelers and data scientists
 The results of this phase should be the data scientist being convinced that with the data
available, a model is viable, as well as the scoring of the model in the operational
environment
• Modeling experiments
 In this phase, data scientists play with the data; crack the nut; trying to come up
with the solution that is both cool, elegant and working
 Results are not immediate; progress is obtained by evolution and by patiently
collecting insights to put them together in an ever evolving model
 At times, the solution at hand may not look viable anymore, and an entire different
angle needs to be explored, seemingly starting from scratch
 The data scientist may need to connect to end‐user to validate initial results, or to
have discussion to get ideas which can be translated into testable hypotheses/
model features
 The result of this phase is an analytical model that is evaluated in the best possible
way with the (historic) data available as well as a reporting of these facts.
• Insight creation
 Dashboards and visualization are critically important for the acceptance of the model by the
business
 This outcomes of this phase are analytic reporting and operational reporting
 Analytical reporting refers to any reporting on data where the outcome (of the analytical model) has
already been observed – descriptive modeling
 This data can then be used to understand the performance of the model and the evolution of
performance over time
 Operational reporting refers to any reporting on the data where the outcome has not yet been
observed – predictive modeling
 This data can be used to understand what the model predicts for the future in an aggregated sense
and is used for monitoring purposes
 The involved parties are the end‐users, business department, and the data scientists
 The result of this phase is a set of visualizations and dashboards that provide a clear view on the
model effectiveness and provide business usable insights.
• Proof of value (ROI)
 This phase involves deploying the model on a small scale to ensure return on
investment
 If the ROI is positive, the business will be convinced that they can trust the model
 A decision can be made if the model should be deployed or not
• Operationalization
 This phase involves putting the model into production and operationalization of the
model by closely working with the IT department
 The model is handed over the maintenance team for support
• Model lifecycle
 An analytical model in production will not be fit forever
 Depending on how fast the business changes, the model performance degrades over time.
 Generally, two types of model changes can happen: refresh and upgrade
 In a model refresh, the model is trained with more recent data, leaving the model
structurally untouched
 The model upgrade is typically initiated by the availability of new data sources and the
request from the business to improve model performance by the inclusion of the new
sources
 The involved parties are the end‐users, the operational team that handles the model
execution, the IT/data administrators/DBA for the new data and the data scientist
Big Data Life Cycle
Big Data Life Cycle
• Big Data analysis differs from traditional data analysis primarily due to the volume,
velocity, variety, and veracity characteristics of the data
• Massive amounts of data ‐ It is big – typically in terabytes or even petabytes
• Varied data formats such as video files to SQL databases to text data
 It could be a traditional database, it could be video data, log data, text data or even voice data
• Data that comes in at varying frequency – from days to minutes ‐ It keeps increasing as
new data keeps flowing in
• To address these distinct aspects of Big Data, a step‐by‐step methodology is needed to
organize and manage the tasks and activities associated with Big Data analysis
• These activities and tasks involve acquiring, processing, analyzing and repurposing data
• In addition to the lifecycle, it is also important to consider the issues of training, education,
tooling and staffing of a data analytics team
Source: Big Data Fundamentals: Concepts, Drivers & Techniques by Wajid Khattak; Paul Buhler; Thomas Erl
Big Data Life Cycle
• The Big Data analytics lifecycle can be divided
into nine stages:
1) Business Case Evaluation
2) Data Identification
3) Data Acquisition & Filtering
4) Data Extraction
5) Data Validation & Cleansing
6) Data Aggregation & Representation
7) Data Analysis
8) Data Visualization
9) Utilization of Analysis Results
Big Data Life Cycle
• Business Case Evaluation
 This stage involves creating, assessing, and
approving a business case prior to proceeding to
next stage
 The business case presents:
 Clear requirements
 Justification for the analytics project
 Motivation
 Goals for carrying out the analysis
 Budget
 Timeline
 Resources
Big Data Life Cycle
• Data Identification
 This stage involves identifying the data and their sources required for the analysis
 A wider variety of data sources may increase the probability of finding hidden
patterns and correlations
 For example, it benefits to look into as many types of resources as possible, particularly when it's
not clear exactly what to look for
 However, there is a tradeoff between the value gained and the investment in terms of resources,
time, and budget
 These dataset can be internal or external to the enterprise depending on the project
objectives
 Internal data sources include
 Data Warehouse, Data marts, Operational Systems, etc.,.
 External data sources include
 Third‐party datasets such as market trends, stock data analysis; Government supplied datasets,
and other publicly available datasets.
 Sometimes, data may be needed to be harvested using automated tools from weblogs (blogs) and
other content‐based web sites
Big Data Life Cycle
• Data Acquisition & Filtering
 Here, data is gathered from all of the data sources that were identified during the
previous stage
 Depending on the type of data source, data may come as a collection of files, such as
data purchased from a third‐party data provider, or may require API integration, such
as with Twitter
 In many cases, some or most of the acquired data may be irrelevant (noise) and can
be discarded as part of the filtering process
 Metadata can be added via automation to data from both internal and external data
sources to improve the classification and querying
 For example:
 Dataset size and structure, source information, date and time of creation or collection and
language‐specific information
 Metadata should be machine‐readable so that it can be passed forward along
subsequent analysis stages
 This helps maintain the origins of data throughout the Big Data analytics lifecycle,
which helps to establish and preserve data accuracy and quality
Big Data Life Cycle
• Data Extraction and Transformation
 Due to disparate external data sources, data may
arrive in a format incompatible with the Big Data
solution
 XML, JSON Files
 Tab or comma or other delimited files
 Raw text files
 The extent of extraction and transformation
required depends on the types of analytics and
capabilities of the Big Data solution
 For example: Merging fields, Separating data fields into
individual components, etc
Big Data Life Cycle
• Data Validation & Cleansing
 Internal data usually has a pre‐defined structured and
pre‐validated
 External data can be unstructured and without any
indication of validity
 This stage is dedicated to establishing complex validation
rules and removing any known invalid data
 For example:
 Redundant data across different datasets
o This redundancy can be exploited to explore interconnected
datasets in order to assemble validation parameters and fill in
missing valid data
 The data validation and cleansing can be a batch process
during ETL or it can be a real‐time in‐memory operation
Big Data Life Cycle
• Data Aggregation & Representation
 This stage is dedicated to integrating multiple
datasets together to arrive at a unified view.
 Data spread across multiple datasets may need to be
joined together via common fields such as Date or ID
 Data may need to be reconciled because same data
fields may appear in multiple datasets
 E.g., date of birth
 Data aggregation can become complicated because
of differences in:
 Data Structure – Although the data format may be the
same, the data model may be different
 Semantics – A value that is labeled differently in two
different datasets may mean the same thing, for
example "surname" and "last name"
Big Data Life Cycle
• Data Aggregation and Representation
 Very often they follow different naming conventions
and varied standards for data representation
 E.g., Cusomer_Identifier, Customer_Id, Cust_Id, CID
 Names have to be standardized and discrepancies
should be resolved
 Integrating the data involves combining all the relevant
operational data into coherent data structures to be
made ready for loading into the data warehouse.
 Data integration and consolidation is a type of
preprocess before other major transformation routines
are applied
Big Data Life Cycle
• Data Aggregation & Representation
 Data aggregation can be a time and effort intensive operation
because of large volumes of data
 Future data analysis requirements need to be considered during this
stage to help foster data reusability
 It is important to understand that the same data can be stored in
many different forms
 One form may be better suited for a particular type of analysis than
another
 For example, data stored as a BLOB would be of little use if the analysis
requires access to individual data fields
 A standardized data structure for the Big Data solution can be used
for a range of analysis techniques and projects
Big Data Life Cycle
• Data Analysis
 This stage is dedicated to carrying out the actual analysis task,
which typically involves one or more types of analytics
 This stage can be iterative in which case analysis is repeated
until the appropriate pattern or correlation is uncovered
 Depending on the type of analytic result required, this stage
can be
 as simple as querying a dataset to compute an aggregation for
comparison
 as challenging as combining data mining and complex statistical analysis
techniques to discover patterns, anomalies, or to generate a statistical
or mathematical model to depict relationships between variables.
Big Data Life Cycle
• Data Visualization
 This stage is dedicated to using visualization techniques and tools to
graphically communicate the results of analysis for effective
interpretation
 Business users need to be able to understand the results in order to
obtain value from the data analysis and subsequently have the ability
to provide feedback from stage 8 back to stage 7
 Data Visualization provides users with the ability to perform visual
analysis, allowing for the discovery of answers to unknown questions
 Data Visualization helps in presenting same results in a number of
different ways, which can influence the interpretation of the results.
 Providing a method of drilling down to comparatively simple statistics
is crucial, in order for users to understand how the rolled up or
aggregated results were generated.
Big Data Life Cycle
• Utilization of Analysis Results
 This stage involves determining how and where processed
analysis data can be further leveraged
 The results of the analysis may produce "models" that can be
used on a new set of dataset to understand patterns and
relationships that exist within the data
 A model may look like a mathematical equation or a set of rules
 Models can be used to improve business process logic and
application system logic, and they can form the basis of a new
system or software program
Big Data Life Cycle Vs. CRISP‐DM
• A comparison of stages involved in Big Data Life Cycle and CRISP‐DM
 Business Problem Definition – CRISP DM : Business understanding
 Research ‐‐ new step
 Human Resources Assessment – new step
 Data Acquisition – new step
 Data Munging – new step
 Data Storage – new step
 Exploratory Data Analysis – CRISP DM : Data Understanding
 Data Preparation for Modeling and Assessment – CRISP DM : Data Preparation
 Modeling – CRISP DM : Data Modelling
 Implementation – CRISP DM : Evaluation & Deployment
Big Data Life Cycle – Possible case studies
• What is the precision agriculture? Why it is a likely answer to climate change
and food security?
 https://www.youtube.com/watch?v=581Kx8wzTMc&ab_channel=Inter‐AmericanDevelopmentBank
• Innovating for Agribusiness
 https://www.youtube.com/watch?v=C4W0qSQ6A8U
• How Big Data Can Solve Food Insecurity
 https://www.youtube.com/watch?v=4r_IxShUQuA&ab_channel=mSTARProject
• AI for AgriTech ecosystem in India‐ IBM Research
 https://www.youtube.com/watch?v=hhoLSI4bW_4&ab_channel=IBMIndia
• Bringing Artificial Intelligence to agriculture to boost crop yield
 https://www.youtube.com/watch?v=GSvT940uS_8&ab_channel=MicrosoftIndia
Thank You!
course
• Data Science Methodology
 Business understanding
 Data Requirements
 Data Acquisition
 Data Understanding
 Data preparation
 Modelling
 Model Evaluation
 Deployment and feedback
• Case Study
• Data Science Proposal
 Samples
 Evaluation
 Review Guide
Stages in Data Science Process
• Setting the research goal
• Retrieving data
• Data Preparation
• Data Exploration
• Data Modeling
• Presentation and Automation
Source: Introducing Data Science by Cielin et al.,
Business Scenario
• Suppose you’re working for a Bank
• The bank feels that it’s losing too much money to
bad loans and wants to reduce its losses
• To do so, they want a tool to help loan officers
more accurately detect risky loans
Setting the Research Goal
Setting the Research Goal ‐ Concept
• An essential outcome of this stage is the research goal that states the purpose
of your assignment in a clear and focused manner
• Understanding the business goals and context is critical for project success
• Ask questions until you grasp the exact business expectations
 Identify how your project fits in the bigger picture
 Understand how this research is going to change the business
 Understand how business will use the results
• Nothing is more frustrating than when you report your findings back to the
organization, everyone immediately realizes that you misunderstood their
question
• Don’t skim over this phase lightly
• Many data scientists fail here:
 Despite their mathematical wit and scientific brilliance, they never seem to grasp the
business goals and context
• The first task in a data science project is to define a measurable and
quantifiable goal
• Typically, this phase involves two steps:
 Define research goal
 Develop a project charter
• Our aim is to learn all about the context of the project:
 Why do the sponsors want the project in the first place?
 What do they lack, and what do they need?
 What are they doing to solve the problem now, and why isn’t that good enough?
 What resources will you need: what kind of data and how much staff?
 Will you have domain experts to collaborate with, and what are the computational
 resources?
 How do the project sponsors plan to deploy your results?
 What are the constraints that have to be met for successful deployment?
• A project charter requires teamwork, and your input
covers at least the following:
 A clear research goal
 The project mission and context
 How you’re going to perform your analysis
 What resources you expect to use
 Proof that it’s an achievable project, or proof of concepts
 Deliverables and a measure of success
 A timeline
• Your client can use this information to make an
estimation of the project costs and the data and people
required for your project to become a success.
Setting the Research Goal – Case Scenario
• In the loan example, the ultimate business goal is to reduce the bank’s
losses due to bad loans
• Project sponsor wants a tool to help loan officers more accurately score
loan applicants, and so reduce the number of bad loans made
• Loan officers feel that they have final discretion on loan approvals
• Once we have a high‐level understanding, we work with stakeholders to
define the precise goal of the project
• The goal should be specific and measurable
 NO ‐ "We want to get better at finding bad loans" but instead
 YES ‐ "We want to reduce our rate of loan charge‐offs by at least 10%, using a model that
predicts which loan applicants are likely to default."
Source: Practical Data Science with R ‐ Nina Zumel, John Mount
• Importance of having specific and measureable goal
 A concrete goal helps with acceptance criteria and when to stop the project
 If the goal is less specific (not measurable), there is no boundary to the project,
because no result will be "good enough."
 If we don’t know what we want to achieve, we don’t know when to stop trying—or
even what to try
 When the project eventually terminates—because either time or resources run
out—no one will be happy with the outcome
• When can we have non‐specific or non‐measurable goals?
 When our project is explorative in nature
 "Is there something in the data that correlates to higher defaults?"
 "Should we think about reducing the kinds of loans we give out?"
 Which types might we eliminate?
 In this situation, we can still scope the project with concrete stopping conditions, such as a
time limit
 For example
 We might decide to spend two weeks, and no more, exploring the data, with the goal of coming up
with candidate hypotheses
 These hypotheses can then be turned into concrete questions or goals for a full‐scale modeling
project
Data Collection
Retrieving Data ‐ Concept
• This step involves finding suitable data and getting access
to it:
 Internal data
 External data
• In this step, we ask the following questions:
 What data is available to me?
 Will it help me solve the problem?
 Is it enough?
 Is the data quality good enough?
• Data found in this phase is usually in the raw form
• Data requires polishing and transformation before using
it in the next stage
Retrieving Data – Case Scenario
• For the loan application case scenario, we need data related to loan
applications, and those that were defaulted
• Imagine that we’ve collected a sample of representative loans from
the last decade
• Some of the loans have defaulted
 Most of them (about 70%) have not
• We’ve collected a variety of attributes about each loan application
 See the listing in the next slide
• Table 1.2 Loan data attributes
Status_of_existing_checking_account (at time of application) Present_residence_since
Duration_in_month (loan length) Collateral (car, property, and so on)
Credit_history Age_in_years
Purpose (car loan, student loan, and so on) Other_installment_plans (other loans/lines of credit—the type)
Credit_amount (loan amount) Housing (own, rent, and so on)
Savings_Account_or_bonds (balance/amount) Number_of_existing_credits_at_this_bank
Present_employment_since Job (employment type)
Installment_rate_in_percentage_of_disposable_income Number_of_dependents
Personal_status_and_sex Telephone (do they have one)
Cosigners Loan_status (dependent variable)
• In the dataset, Loan_status takes on two possible values:
 GoodLoan and BadLoan
• For the purposes of this case scenario
 GoodLoan means that is paid off, and a BadLoan means it is defaulted
• Analysis Tip:
 As much as possible, try to use information that can be directly measured, rather than information that is
inferred from another measurement
 For example, we might be tempted to use income as a variable, reasoning that a lower income implies
more difficulty paying off a loan
 The ability to pay off a loan is more directly measured by considering the size of the loan payments
relative to the borrower’s disposable income
 This information is more useful than income alone
 This information is available in the dataset as the variable
 Installment_rate_in_percentage_of_disposable_income
• This is the stage where we initially explore
and visualize your data
• We also perform basic cleaning of the data,
as needed
• In the process of exploring, we may discover
that the data isn’t suitable for your problem,
or that you need other types of information
as well
• We may discover things in the data that raise
issues more important than the one you
originally planned to address The fraction of defaulting loans by credit history
• For example, the data in the figure seems category
The dark region of each bar represents the
counterintuitive. fraction of loans in that category that defaulted.
• Why would some of the seemingly safe
applicants (those who repaid all credits to
the bank) default at a higher rate than
seemingly riskier ones (those who had been
delinquent in the past)?
• After looking more carefully and discussing
with domain experts, we realize that this
sample is inherently biased:
 We only have loans that were actually made
(and therefore already accepted).
• A true unbiased sample of loan applications
should include
 both loan applications that were accepted and
ones that were rejected
• Because the data sample only includes accepted loans, there are fewer risky‐looking loans
than safe‐looking ones in the data
• The probable story is that risky‐looking loans were approved after a much stricter vetting
process
 A process that perhaps the safe‐looking loan applications could bypass
• This suggests that if your model is to be used downstream of the current application
approval process
 Credit history is no longer a useful variable
• It also suggests that even seemingly safe loan applications should be more carefully
scrutinized
• Discoveries like this may require us to approach stakeholders to change or refine the
project goals
• We may decide to concentrate on the seemingly safe loan applications
Data Preparation
Techniques used in Data Preparation
Overview
• The raw data collected is likely to be "a
diamond in the rough."
• Our task is to sanitize and prepare it for
use in the modeling and reporting phase
• This includes cleansing, transformation,
and combining the data from a raw form
into data that’s directly usable in your
models
• Doing so is tremendously important
because:
 Our models will perform better with a clean
data
 Otherwise, garbage in equals garbage out
Techniques used in Data Preparation
Overview
• Data Preparation
 The third phase of the process is data preparation
 Cleansing
 Finding and correcting errors, Spaces, Typos,
 Missing values
 Outliers
 Transformation
 Aggregating, Extrapolating, Derived measures, Reducing number of
variables, Creating dummies, etc
 Combining
 Creating views
 Set operators
Techniques in Data Preparation
Data Cleansing
• Data cleansing focuses on removing errors in the data
• The purpose is to give the data a "true and consistent
representation" of the process from where it
originates
• Two types of errors:
 Interpretation error
 It exists when we take the value in the data for granted
o E.g, Age = 300 years, Weight = 500 kilos
 Inconsistencies between data sources
 E.g., Using "Male" is one table and "M" in another
 E.g., Using "Rupees" in one table and "Dollars" in another
Data Cleansing
Data Cleansing
• Sometimes we may use more advanced methods, such as
simple modeling, to find and identify data errors and
anomalies
• Using diagnostic plots can be especially insightful
 For example, simple regression can identify data points that seem out
of place
• When a single observation has too much influence, this
can point to an error in the data, but it can also be a valid
point
• At the data cleansing stage, these advanced methods are,
however, rarely applied and often regarded by certain
data scientists as overkill.
Data Cleansing
The outlier point influences
the model heavily and is
worth investigating because
it can point to a region
where we don’t have
enough data or might
indicate an error in the
data, but it also can be a
valid data point
Data Cleansing – Data Entry Errors
• Data entry errors may be caused by humans or
machines
• Errors originating from machines are transmission errors
or bugs in the extract, transform, and load phase (ETL)
• When the variable doesn’t have many classes, error
check can be done by tabulating the data with counts
• For example: Most errors of this type are easy to fix with
 Consider the variable can take only two values: “Good” and simple assignment statements and if‐then
“Bad” else rules:
 We can create a frequency table and see if those are truly the if (x == "Godo"):
only two values present x = “Good”
 The values “Godo” and “Bade” point out something went if (x == "Bade"):
wrong in at least 16 cases. x = "Bad"
Data Cleansing – Redundant White Space
• Whitespaces tend to be hard to detect but cause errors like other redundant characters would
• Most of us have lost several hours in projects because of a bug that was caused by whitespaces at
the end of a string
• We require the software program to join two keys and notice that observations are missing from
the output file
• After looking for days through the code, we realize that the cleaning during the ETL phase wasn’t
well executed, and keys in one table contained a whitespace at the end of a string
• This caused a mismatch of keys such as "FR " – "FR", dropping the observations that couldn’t be
matched
• Most programming languages provide string functions that will remove the leading and trailing
whitespaces
• For instance, in Python provides the strip() function to remove leading and trailing spaces
Data Cleansing
• Fixing Capital Letter Mismatches
 Capital letter mismatches are common.
 Most programming languages make a distinction between "India" and "india"
 We can solve this problem by applying a function that returns both strings in lowercase or
uppercase, such as .lower() or .upper() in Python
 "India".lower() == "india".lower() should result in true
• Impossible Values & Sanity Check
 Sanity checks are another valuable type of data check
 Here you check the value against physically or theoretically impossible values such as
people taller than 3 meters or someone with an age of 300 years
 Sanity checks can be directly expressed with rules:
 ageCheck = 0 <= age <= 120
Data Cleansing ‐ Outliers
• An outlier is an observation that seems to be
distant from other observations
• One observation that follows a different logic
than the other observations
• The easiest way to find outliers is to use a plot or
a table with the minimum and maximum values
• The plot on the top shows no outliers
• The plot on the bottom shows possible outliers
on the upper side when a normal distribution is
expected
• The normal distribution, or Gaussian distribution,
is the most common distribution in natural
sciences
Data Cleansing – Missing Values
• Missing values aren’t necessarily wrong
• Missing values might be an indicator that
 Something went wrong in the data collection or that an error happened
in the ETL process
• Which technique to use at what time depends on a the
particular case
• For instance, if we don’t have observations to spare, omitting
an observation is probably not an option
• If the variable can be described by a stable distribution, we
could impute based on this
• However, a missing value may actually means “zero”?
• This can be the case in sales, for instance:
 If no promotion is applied on a customer basket, that customer’s promo
is missing, but most likely it’s also 0, no price cut
Data Cleansing
• Deviations from Code Book
 A code book is a description of your data, a form of metadata
 It contains things such as the number of variables per observation, the number of
observations, and what each encoding within a variable means
 For example, "0" equals "negative", "5" stands for "very positive"
 A code book also tells the type of data you’re looking at:
 Is it hierarchical, graph, something else?
 You look at those values that are present in set A but not in set B
 If we have multiple values to check, it’s better to put them from the code book into
a table and use a difference operator to check the discrepancy between both tables
Data Cleansing
• Different Units of Measurement
 When integrating two data sets, you have to pay attention to their respective units of
measurement
 For example, prices of gasoline in the world
 When we gather data from different data providers, data sets can contain prices per gallon
and others can contain prices per liter
 A simple conversion will do the trick in this case
• Different Levels of Aggregation
 Having different levels of aggregation is similar to having different types of measurement
 For example, a data set containing data per week versus one containing data per month
 This type of error is generally easy to detect, and summarizing (or the inverse, expanding)
the data sets will fix it
Data Transformation
• Relationships between an input variable and an output variable aren’t always
linear
• For instance, a relationship of the form y = aebx
• Taking the log of the independent variables simplifies the estimation problem
dramatically
• Transforming the input variables greatly simplifies the estimation problem.
• Other times you might want to combine two variables into a new variable
Data Transformation
Data Transformation
• Reducing the Number of Variables
 Sometimes there are too many variables and need to reduce the number because
they don’t add new information to the model
 Having too many variables in the model makes the model difficult to handle
 Certain techniques don’t perform well when you overload them with too many
input variables
 For instance, all the techniques based on a Euclidean distance perform well only up to 10
variables
 Data scientists use special methods to reduce the number of variables but retain
the maximum amount of data
Data Transformation
• Reducing the Number of Variables
 Figure shows how reducing the number of variables makes it easier to understand
the key values
 It also shows how two variables account for 50.6% of the variation within the data
set (component1 = 27.8% + component2 = 22.8%)
 These variables, called "component1" and "component2," are both combinations
of the original variables
 They’re the principal components of the underlying data structure
Data Transformation
Data Transformation – Euclidean Distance
• Euclidean distance or "ordinary" distance is an extension to one of the first things we learn in trigonometry:
Pythagoras’s leg theorem
• If we know the length of the two sides next to the 90° angle of a right‐angled triangle we can easily derive the
length of the remaining side (hypotenuse)
• The formula for this is hypotenuse = 1 2 .
• The Euclidean distance between two points in a two‐dimensional plane is calculated using a similar formula:
distance = 1 2 1 2 .
• If we want to expand this distance calculation to more dimensions, add the coordinates of the point within
those higher dimensions to the formula
• For three dimensions we get distance = 1 2 1 2 1 2
• Euclidean Distance is used in
 K‐nearest neighbors (classification) or K‐means (clustering) to find the “k closest points” of a particular sample point
 Hierarchical clustering, agglomerative clustering (complete and single linkage) where you want to find the distance between clusters.
Data Transformation
• Turning Variables into Dummies
 Variables can be turned into dummy variables
 Dummy variables can only take two values:
 true(1) or false(0)
 They’re used to indicate the absence of a categorical
effect that may explain the observation
 In this case you’ll make separate columns for the
classes stored in one variable and indicate it with 1
if the class is present and 0 otherwise
 An example is turning one column named Weekdays
into the columns Monday through Sunday
 You use an indicator to show if the observation was
on a Monday; you put 1 on Monday and 0
elsewhere
Combining Data
• Data comes from several different places
• In this substep we focus on integrating
these different sources
• Data varies in size, type, and structure,
ranging from databases and Excel files to
text documents
• Different ways of combining data
 Joining
 Enriching an observation from one table with
information from another table
 Appending or Stacking
 Adding the observations of one table to those of
another table
Combining Data – Joining Tables
Combining Data – Appending Tables
Combining Data – Using Views to Simulate Joins & Appends
Combining Data – Enriching Aggregated Values
Data Exploration
Data Exploration
Overview
• Data Exploration
 The fourth phase in the process is data exploration
 The goal of this step is to gain a deeper understanding of
the data
 We look for patterns, correlations, and deviations based
on visual and descriptive techniques
 The insights gained here will enable us to start modeling
Data Exploration
Overview
• During exploratory data analysis we take a deep dive into
the data
• Information becomes much easier to grasp when shown in
a picture
• We mainly use graphical techniques to gain an
understanding of your data and the interactions between
variables
• It’s common to discover anomalies we missed previously
 We may have to take a step back and fix them
• The visualization techniques in this phase can range from
simple line graphs or histograms to more complex diagrams
such as Sankey diagram (see next slide)
Data Exploration
Sankey diagram
• This Sankey diagram
visualizes the flow of energy:
supplies are on the left, and
demands are on the right
• Links show how varying
amounts of energy are
converted or transmitted
before being consumed or
lost
Data: Department of Energy & Climate Change via Tom Counsell
Source: https://observablehq.com/@d3/sankey‐diagram
Data Exploration
Overview
• A bar chart, a line plot, and a distribution are some of the graphs
used in exploratory analysis
Data Exploration
Overview
• Drawing multiple plots
together can help us
understand the structure of
your data over multiple
variables.
Data Exploration
Overview
• Histogram
 In a histogram a variable is cut into discrete categories and the number of occurrences in each category
are summed up and shown in the graph
• Boxplot
 The boxplot, on the other hand, doesn’t show how many observations are present but does offer an
impression of the distribution within categories
 It can show the maximum, minimum, median, and other characterizing measures at the same time.
Data Modeling
Data Modeling
Overview
• The most common data science modeling tasks are these:
 Classifying—Deciding if something belongs to one category or another
 Scoring—Predicting or estimating a numeric value, such as a price or probability
 Ranking—Learning to order items by preferences
 Clustering—Grouping items into most‐similar groups
 Finding relations—Finding correlations or potential causes of effects seen in the data
 Characterizing—Very general plotting and report generation from data
Data Exploration
Modeling – Case Scenario
• The loan application problem is a classification problem:
 We want to identify loan applicants who are likely to default
• Some common approaches in such cases are logistic regression and
tree‐based methods
• To solve this problem, we decide that a decision tree is most suitable
• Loan officers, who will use our model, are interested in knowing an
indication of how confident the model is in its decision:
 Is this applicant highly likely to default, or only somewhat likely?
Data Exploration
Let’s suppose that there is
an application for a one‐
year loan of $10,000
On the other hand,
suppose that there is an
application for a one‐year
loan of $15,000
Data Exploration
• Let’s suppose that we discover the model shown in figure
• Let’s trace an example path through the tree
• Let’s suppose that there is an application for a one‐year loan of $10,000
• At the top of the tree (node 1), the model checks if the loan is for longer than 34
months
• The answer is “no,” so the model takes the right branch down the tree
 This is shown as the highlighted branch from node 1
• The next question (node 3) is whether the loan is for more than DM 11,000
• Again, the answer is “no,” so the model branches right and arrives at leaf 3.
Data Exploration
• Historically, 75% of loans that arrive at this leaf are good loans, so the
model recommends that you approve this loan, as there is a high
probability that it will be paid off
• On the other hand, suppose that there is an application for a one‐
year loan of $15,000
• In this case, the model would first branch right at node 1, and then
left at node 3, to arrive at leaf 2
• Historically, all loans that arrive at leaf 2 have defaulted, so the model
recommends that you reject this loan application.
Thank You!
Model Evaluation
Model Evaluation
Overview
• Once we have a model ready, we need to determine if it meets our goals:
 Is it accurate enough for your needs?
 Does it generalize well?
 Does it perform better than "the obvious guess"?
 Does it perform better than whatever estimate we currently use?
 Do the results of the model (coefficients, clusters, rules, confidence intervals, significances,
and diagnostics) make sense in the context of the problem domain?
• If the answer is "NO" to any of the above questions, it’s time to loop back
to the modeling step
 Or decide that the data doesn’t support the goal you’re trying to achieve.
Model Evaluation
Evaluation – Case Scenario
• In the loan application example, the first thing to check is whether the rules that
the model discovered make sense
• Looking at decision tree structure, we don’t notice any obviously strange rules,
so you can go ahead and evaluate the model’s accuracy
• A good summary of classifier accuracy is the confusion matrix, which tabulates
actual classifications against predicted ones.
• We create a confusion matrix where rows represent actual loan status, and
columns represent predicted loan status
• conf_mat ["GoodLoan", "BadLoan"] refers to the element conf_mat[2, 1]
 The number of actual good loans that the model predicted were bad
• The diagonal entries of the matrix represent correct predictions.
Model Evaluation
Confusion Matrix
• A confusion matrix is a table that is often used to evaluate the
performance of a classification model (or "classifier")
• It works on a set of test data for which the true values are known
• There are two possible predicted classes: "YES" and "NO"
• If we were predicting the presence of a disease, for example, "yes"
would mean they have the disease, and "no" would mean they
don't have the disease.
• The classifier made a total of 165 predictions
 E.g., 165 patients were being tested for the presence of that disease
• Out of those 165 cases, the classifier predicted "yes" 110 times,
and "no" 55 times.
• In reality, 105 patients in the sample have the disease, and 60
patients do not.
Source: https://www.dataschool.io/simple‐guide‐to‐confusion‐matrix‐terminology
Model Evaluation
Confusion Matrix
• True positives (TP):
 These are cases in which the model predicted yes (they
have the disease), and the patients actually do have the
disease.
• True negatives (TN):
 The model predicted no, and they don't have the disease.
• False positives (FP):
 The model predicted YES, but they don't actually have the
disease. (Also known as a "Type I error.")
• False negatives (FN):
 The model predicted NO, but they actually do have the
disease. (Also known as a "Type II error.")
Model Evaluation
Confusion Matrix
Term Description Calculation
Accuracy Overall, how often is the classifier correct? • (TP+TN)/total = (100+50)/165 = 0.91
Misclassification Rate Overall, how often is it wrong? • (FP+FN)/total = (10+5)/165 = 0.09
• Equivalent to 1 minus Accuracy
• Also known as "Error Rate"
True Positive Rate When it's actually YES, how often does it predict YES? • TP/actual YES = 100/105 = 0.95
(Sensitivity or Recall)
False Positive Rate When it's actually NO, how often does it predict YES? • FP/actual NO = 10/60 = 0.17
True Negative Rate: When it's actually NO, how often does it predict NO? • TN/actual NO = 50/60 = 0.83

(Specificity) • Equivalent to 1 minus False Positive Rate
Precision When it predicts YES, how often is it correct? • TP/predicted YES = 100/110 = 0.91
Prevalence How often does the YES condition actually occur in • Actual YES/total = 105/165 = 0.64
our sample?
Model Evaluation
Model Evaluation
Term Calculation
Accuracy • (TP+TN)/total = (41+687)/1000 = 0.728 Predicted
Misclassification • (FP+FN)/total = (13+259)/1000 = 0.272 n = 1000
Bad Loan Good Loan
Rate • Equivalent to 1 minus Accuracy Yes No
• Also known as "Error Rate" Bad Loan
TP = 41 FN = 259 300
Actual
True Positive Rate • TP/actual yes = 41/300 = 0.137 Yes
• Also known as "Sensitivity" or "Recall" Good Loan
FP = 13 TN = 687 700
False Positive Rate • FP/actual no = 13/700 = 0.01857 No
54 946
True Negative Rate: • TN/actual no = 687/700 = 0.98143
• Equivalent to 1 minus False Positive Rate
• Also known as "Specificity" Our prediction is for Bad Loans. So,
Precision • TP/predicted yes = 41/54 = 0.759 • Yes means, Yes, it's a bad loan
Prevalence • Actual yes/total = 300/1000 = 0.64 • No means, No, it's not a bad loan
Model Presentation
Model Presentation
Overview
• This phase involves presenting our results to stakeholders
• We also document the model for those in the organization who are
responsible for using, running, and maintaining the model once it has
been deployed
• Different audiences require different kinds of information
• Business‐oriented audiences want to understand the impact of our
findings in terms of business metrics
Model Presentation
Case Scenario
• In the loan example, our business audience is interested in knowing how our
model can reduce chargeoffs
 Chargeoff = the money that the bank loses to bad loans
• Suppose our model identified a set of bad loans that amounted to 22% of the
total money lost to defaults
• Then our presentation should emphasize that the model can potentially reduce
the bank’s losses by that amount
 See figure in the next slide
• We may also give other interesting findings or recommendations, such as:
 New car loans are much riskier than used car loans
 Most losses are tied to bad car loans and bad equipment loans (assuming that the audience didn’t
already know these facts)
Model Presentation
Case Scenario
• Our presentation for the loan
officers should emphasize:
 How should they interpret the model?
 What does the model output look like?
 If the model provides a trace of which
rules in the decision tree executed
 How do they read that?
 If the model provides a confidence
score in addition to a classification
 How should they use the confidence
score?
 When might they potentially overrule
the model?
Model Deployment
Concept
• Finally, the model is put into operation
• In many organizations, the data scientist no longer has primary responsibility for
the day‐to‐day operation of the model
• But you still should ensure that the model will run smoothly and won’t make
disastrous decisions
• You also want to make sure that the model can be updated as its environment
changes
• And in many situations, the model will initially be deployed in a small pilot
program
• The test might bring out issues that you didn’t anticipate, and you may have to
adjust the model accordingly
Model Deployment
Case Scenario
• When we deploy the model, we might find that loan officers
frequently override the model in certain situations because it
contradicts their intuition
• Is their intuition wrong? Or
• Is your model incomplete? Or
• More positively, your model may perform so successfully that the
bank wants you to extend it to home loans as well
Thank You!
course
• Data Science Methodology
 Business understanding
 Data Acquisition
 Data preparation
 Modelling
 Model Evaluation
 Deployment and feedback
• Case Study
• Data Science Proposal
 Samples
 Evaluation
 Review Guide
Data Science Process with a
Case Study
Stages in Data Science Process
Business Analytic
Understanding Approach
Data
Feedback
Requirements
Deployment Data Collection
Data
Evaluation
Understanding
Data Modeling Data Preparation
Source: CognitiveClass
10 Questions the process aims to answer
• From Problem to Approach
1. What is the problem that you are trying to solve?
2. How can you use data to answer the questions?
• Working with Data
3. What data do you need to answer the question?
4. Where is the data coming from? Identify all Sources. How will you acquire it?
5. Is the data that you collected representative of the problem to be solved?
6. What additional work is required to manipulate and work with the data?
• Delivering the Answer
7. In what way can the data be visualized to get to the answer that is required?
8. Does the model used really answer the initial question or does it need to be adjusted?
9. Can you put the model into practice?
10. Can you get constructive feedback into answering the question?
Five Stages of Data Science Process
• From Problem to Approach
 Business Understanding
 Analytic Approach
• From Requirements to Collection
 Data Collection
• From Understanding to Preparation
 Data Preparation
• From Modeling to Evaluation
 Modeling
 Evaluation
• From Deployment to Feedback
 Deployment
 Feedback
Case Study
Case Study
Hospital Readmissions
Image Source: https://medium.com/nwamaka‐imasogie/predicting‐hospital‐readmission‐using‐nlp‐5f0fe6f1a705
Case Study
Hospital Readmissions
• Scenario
 There is a limited budget for providing healthcare to the public
 Hospital readmissions for re‐occurring problems are considered as a sign of failure
in the healthcare system
 There is a dire need to properly address the patient condition prior to the initial
patient discharge
 The core question is:
 What is the best way to allocate these funds to maximize their use in providing quality care?
 A successful data science program will deliver:
 better patient care by giving physicians new tools to incorporate timely, data‐driven
information into patient care decisions
From Problem to
Analytic Approach
Case Study
Data Problem to Analytic Approach
• Key topics covered
 The need to understand and prioritize the business goal.
 The way stakeholder support influences a project.
 The importance of selecting the right model.
 When to use a predictive, descriptive, or classification model.
BUSINESS
UNDERSTANDING
Case Study
Business Understanding (Concept)
• What is the problem that you are trying to solve?
• Identify the goal
• Identify and define the objectives that support the goal.
• Ask questions
• Seek clarifications
• Get the stakeholder buy‐in and support
Case Study
Business Understanding (Application)
• Case study asks the following question:
 What is the best way to allocate the limited healthcare budget to maximize its use
in providing quality care?
• Goals and Objectives
 Define the GOALS
 To provide quality care without increasing cost
 Define the OBJECTIVES
 Review the process to identify inefficiencies
Case Study
• Examining hospital readmissions
 It was found that approximately 30% of individuals who finish rehab treatment would be readmitted
to a rehab center within one year
 50% would be readmitted within five years
 After reviewing some records, it was found that patients with heart failure were high on the list of
readmission.
Case Study
• It was found that a decision tree model can be applied to investigate this
scenario to determine the reason for this phenomenon
• It is important to gain business insight for the analytics team to implement the
data science project
 Data scientists proposed and organized an on‐site workshop
• The business sponsors involvement throughout
the project was critical because
• The sponsor had
 Set the overall direction
 Remained committed and advised
 when required, got the necessary support
Pilot Project Kickoff
Case Study
• Finally, four business requirements were identified for whatever
model would be built
 Case study question
 What is the best way to allocate the limited healthcare budget to maximize its use in providing
quality care?
 Business requirements
 To predict the risk of readmission
 To predict readmission outcomes for those patients with Congestive Heart Failure
 To understand the combination of events that led to the predicted outcome
 To apply an easy‐to‐understand process to new patients, regarding their readmission risk
ANALYTIC
APPROACH
Case Study
Analytic Approach (Concept)
• Available data:
 Patient data, Readmissions data, CFH data, etc.,
• How can we use data to answer the questions?
• Choose Analytic approach based on the type of question
 Descriptive
 Current status
 Diagnostic (Statistical Analysis)
 What happened?
 Why is this happening?
 Predictive (Forecasting)
 What if these trends continue?
 What will happen next?
 Prescriptive
 How do we solve it?
Case Study
Analytic Approach (Concept)
• The analytic approach can be selected once a clear understanding of the
question is established:
 If the question is to determine probabilities of an action
 A Predictive model can be used
 If the question is to show relationships
 A Descriptive model may be used
 If the question requires a yes / no answer
 A Classification approach to predicting a response is appropriate
 Decision tree classification
o Categorical outcome
o Explicit "decision path" showing conditions leading to high risk
o Likelihood of classified outcome
o Easy to understand and apply
Case Study
Analytic Approach (Application)
• In the case study, a decision tree classification model was used to identify the combination
of conditions leading to each patient's outcome
• Examining the variables in each of the nodes along each path to a leaf, led to a respective
threshold value to split the tree
 E.g., Age >= 60
• A decision tree classifier provides both the predicted outcome, as well as the likelihood of
that outcome, based on the proportion at the dominant outcome, yes or no, in each group
• The analysts can obtain the readmission risk, or the
likelihood of a yes for each patient
• If the dominant outcome is yes, then the risk is
simply the proportion of yes patients in the leaf
• If it is no, then the risk is 1 minus the proportion of
no patients in the leaf
Case Study
Analytic Approach (Application) – Why decision tree??
• For non‐data scientists, a decision tree classification model is easy to understand
and apply, to score new patients for their risk of readmission
• Clinicians can readily see what conditions are causing a patient to be scored as
high‐risk
• Multiple models can be built and applied at various points during hospital stay
• This gives a moving picture of the patient's risk and how it is evolving with the
various treatments being applied
• For these reasons, the decision tree approach
was chosen for building the CHF readmission
model
From Data Requirements
to Collection
Case Study
From Data Requirements to Collection
 The significance of defining the data requirements for the model.
 Why the content, format, and representation of data matter
 The importance of identifying the correct sources for data collection
 How to handle unavailable and redundant data.
 To anticipate the needs of future stages in the process.
DATA
REQUIREMENTS
Case Study
Data Requirements (Concept)
• If our goal is to make a "Biryani" but we don't have the right
ingredients, then the success of making a good Biryani will be
compromised
• If the "recipe" is the problem to be solved, then the data are the
ingredients
• The data scientist must ask the following questions:
 What are the data requirements
 How to obtain or collect them
 How to understand or use them
 How to prepare the data to meet the desired outcome
• Based on the understanding of the problem and the analytic approach chosen, it is important to define the
data requirements
• Here, the analytic approach is decision tree classification, so data requirements should be defined accordingly
• This involves:
 Identifying data content, formats, and data sources needed for the initial data collection
Case Study
Data Requirements (Application)
• Data requirements for the case study included selecting a suitable list of patients from the
health insurance providers member base
• In order to put together patient clinical histories, three criteria were identified for selecting
the patient cohort
1) A patient must be admitted as an in‐patient within
health insurance provider's service area
2) Patient's primary diagnosis should be CHF for one full
year
3) Prior to the primary admission for CHF, a patient must
have had at least 6 months of continuous enrollment
• Disqualifying conditions (outliers)
 CHF patient who have been diagnosed with other serious conditions are excluded because this may result
in above‐average rates of re‐entry and may therefore distort results
Case Study
Data Requirements (Application) – Defining the data
• The content and format suitable for decision tree classifier needs to be
defined
• Format
 Transactional format
 This model requires, one record per patient
 Columns of the record represent dependent and predictor variables
• Content
 To model the readmission outcome, data should represent all aspects of the patient's
clinical history. This includes:
 Authorizations, primary, secondary and tertiary diagnoses, procedures, prescriptions and other
services provided during hospitalization or visits by patients / doctors.
Case Study
Data Requirements (Application)
• A given patient can have thousands of records that represent all their attributes
• The data analytics specialists collected the transaction records from patient
records and created a set of new variables to represent that information
• It was a task for the data preparation phase, so it is important to anticipate the
next phases
DATA
COLLECTION
Case Study
Data Collection (Concept)
• What occurs during data collection?
 Assess to determine whether or not we have the data we need.
 Decide whether more data or less data is required.
 Revise data requirements if needed.
 Assess content, quality and initial insights of data.
 Identify gaps in data.
 How to extract, merge and Archive data?
• Once data collection is completed, the Data Scientist performs an assessment to make sure he has all the
required data
• As with the purchase of ingredients for making a meal, some ingredients may be out of season and more
difficult to obtain or cost more than originally planned.
• At this stage, the data requirements are reviewed and a decision is made as to whether more or less data is
required for the collection.
Case Study
Data Collection (Concept)
• The collected data is explored using descriptive statistics and visualization to assess its
content and quality
• The gaps in the data are identified and plans for filling or replacement must be made
• Once this step is complete, essentially, the ingredients are now ready for washing and
cutting
• Now, let's look at the "data collection" step in the context of the case study
Case Study
Data Collection – Available Data (Application)
• This case study required data about:
 Demographics, clinical and coverage information of patients, provider information,
claims records, as well as pharmaceutical and other information related to all the
diagnoses of the CHF patients.
• Available data sources
 Corporate data warehouse
 Single source of medical, claims, eligibility, provider, &
member information
 In‐patient record system
 Claim patient system
 Disease management program information
Case Study
Data Collection – Unavailable Data (Application)
• This case study also required other data, but not available
 Pharmaceutical records
 Information on drugs
• This data source was not yet integrated with the rest of
the data sources
• In such situations, DATA NOT
 It is okay to postpone decisions about unavailable data and to try
to capture them later.
• For example:
 This can happen even after obtaining intermediate results from
AVAILABLE
predictive modeling
 If the results indicate that drug information may be important for
a good model, you will spend time trying to get it.
• However, it turned out that they could build a reasonably good model without this
information about drugs.
Case Study
Data Collection – Merging Data (Application)
• Data Pre‐processing
 Database administrators and programmers often work
together to extract data from different sources and then
combine them
 Redundant data can be deleted and made available to the
next level of methodology
 the "Data Understanding" phase
 At this stage, scientists and analysts can discuss ways to better
manage their data by automating certain database processes
to facilitate data collection
• Next, we move on to understanding the data
From Data Understanding
to Data Preparation
Case Study
From Data Understanding to Preparation
 The importance of descriptive statistics.
 How to manage missing, invalid, or misleading data.
 The need to clean data and sometimes transform it.
 The consequences of bad data for the model.
 Data understanding is iterative
 We learn more about data the more we study it.
DATA
UNDERSTANDING
Case Study
Data Understanding (Concept)
• This section of the methodology answers the question
 Is data you collected representative of the problem to be solved?
• Descriptive statistics
 Univariates statistics
 Pairwise correlation
 Histograms
• Assert data quality
 Missing values
 Invalid data
 Misleading data
• From the data collected, we should understand the variables and their characteristics
using Exploratory Data Analysis and Descriptive Statistics
• Sometimes we may have to perform pre‐processing operations on the data
Case Study
Data Understanding (Application)
• First, Univariate Statistics
 Basic statistics included univariate statistics for each
variable, such as:
 mean, median, minimum, maximum, and standard deviation
• Second, Pairwise Correlations
 Pairwise correlations were used to determine the
degree of correlation between the variables
 Variables that are highly correlated means they are
essentially redundant
 This makes only one variable relevant for the modeling
Case Study
• Third, Histograms
 Third, the histograms of the variables were examined
to understand their distributions
 Histograms are a good way to understand how
values o
r variables are distributed
 They help to know what kind of data preparation may
be needed to make the variable more useful in a
model
 For example:
 If a categorical variable contains too many diﬀerent values t o
be meaningful in a model, the histogram can help decide
how to consolidate those values
Case Study
• Looking at data quality
 Univariate, statistics and histograms are also used to
assess the quality of the data.
 On the basis of the data provided, some values can be re‐
coded or deleted if necessary
 E.g., if a particular variable has a lot of missing values, we may
drop the variable from the model
 The question then arises as to whether "missing" means
something
 Sometimes a missing value means "no" or "0" (zero), or sometimes simply "we do not
know".
 Or if a variable contains invalid or misleading values. For example
 A numeric variable called "age" containing 0 to 100 and 999, where "triple‐9" actually means
"missing", will be treated as a valid value unless we have corrected it
Case Study
• This is an iterative process
 Originally, the meaning of CHF admission was decided on
the basis of a primary diagnosis of CHF
 However, preliminary data analysis and clinical experience
revealed that CHF admissions were also based on other
diagnosis
 The initial definition did not cover all cases of CHF admissions
 This made data scientists to return back to the data collection phase
 They added secondary and tertiary diagnoses, and created a more complete definition of
CHF admission
 This is one example of the iterative processes in the methodology
 The more we work with the problem and the data, the more we learn and the more the
model can be adjusted, which ultimately leads to a better resolution of the problem.
DATA
PREPARATION
Case Study
Data Preparation (Concept)
• In a way, data preparation is like removing dirt and washing vegetables
• Compared to data collection and understanding, data preparation is the most time consuming
phase – 70% to 90% of overall project time
• Automating collection and preparation time can reduce to 50%.
• The data preparation phase of the methodology answers the question:
 What are the ways in which data is prepared?
 Address missing or invalid values
 Remove duplicates
 Format data properly
• Transforming data
 Process of getting data into a state where it may be easier to work with
• Feature Engineering
 Process of using domain knowledge of data to create features that make ML algorithms work
 Feature is a characteristic that might help solving a problem
Case Study
Case Study
• Feature Engineering
 Feature engineering is also part of the data preparation
 Use domain knowledge on data to create features that work with machine learning
algorithms
 A feature is a property that can be
useful for solving a problem
 The functions in the data are
important for the predictive models
and influence the desired results.
Case Study
Data Preparation (Application)
• Defining Congestive Heart Failure
 In the case study, first step in the data preparation stage was to actually define
what CHF means
 First, the set of diagnosis‐related group codes needed to be identified, as CHF
implies certain kinds of fluid buildup
 Data scientists also needed to consider that CHF is only one type of heart failure
 Clinical guidance was needed to get the right codes for CHF
Case Study
Case Study
• Defining re‐admission criteria for the same condition
 Next step involved defining the criteria for CHF
readmissions
 The timing of events needed to be evaluated in
order to define whether a particular CHF
admission was an initial event (called as index
admission), or a CHF‐related re‐admission
 Based on clinical expertise, a time period of 30
days was set as the window for readmission
relevant for CHF patients, following the
discharge from the initial admission
Case Study
• Aggregating transactional records
 Next, the records that were in
transactional format were aggregated
 Meaning that the data included multiple
records for each patient
 Transactional records included claims
submitted for physician, laboratory,
hospital, and clinical services
 Also included were records describing all
the diagnoses, procedures, prescriptions,
and other information about in‐patients
and out‐patients
Case Study
• Aggregating data to patient level
 A given patient could have hundreds or
even thousands of records, depending on
their clinical history
 All the transactional records were
aggregated to the patient level, yielding a
single record for each patient
 This is required for the decision‐tree
classification method used for modeling
Case Study
• Aggregating data to patient level
 Many new columns were created
representing the information in the
transactions
 For example
 Frequency and most recent visits to doctors,
clinics and hospitals with diagnoses,
procedures, prescriptions, and so forth
 Co‐morbidities with CHF were also
considered, such as:
 Diabetes, hypertension, and many other
diseases and chronic conditions that could
impact the risk of re‐admission for CHF
Case Study
• Do we need more data or less
data?
 A literature review on CHF was also
undertaken to see whether any
important data elements were
overlooked
 Such as co‐morbidities that had not yet
been accounted for
 The literature review involved looping
back to the data collection stage to add
a few more indicators for conditions
and procedures
Case Study
• Creating new variables
 Aggregating the transactional data at the
patient level, involved
 merging it with the other patient data,
including their demographic information, such
as age, gender, type of insurance, and so forth
 The result was the creation of one table
containing a single record per patient
 Columns represent the attributes about the
patient in his or her clinical history
 These columns would be used as variables
in the predictive modeling
Case Study
• Completing the data set
 Here is a list of the variables that were
ultimately used in building the model
 Measures
o Gender, Age, Primary Diagnosis Related
Group (DRG), Length of Stay, CHF Diagnosis
Importance (primary, secondary, tertiary),
Prior admissions, Line of business
 Diagnosis Flags (Y/N)
o CHF, Atrial fibrillation, Pneumonia,
• Dependent Variable Diabetes, Renal failure, Hypertension
 CHF readmission within 30 days following
discharge from CHF hospitalization (Yes/No)
Case Study
• Creating training and testing data
sets
 The data preparation stage resulted in a
cohort of 2,343 patients
 These patients met all of the criteria for
this case study
 The data (patient records) were then split
into training and testing sets for building
and validating the model, respectively
From Modeling to
Evaluation
Case Study
From Data Modeling to Evaluation
 The difference between descriptive and predictive models
 The role of training sets and test sets
 The importance of asking if the question has been answered
 Why diagnostic measures tools are needed
 The purpose of statistical significance tests
 That modeling and evaluation are iterative processes
• Modeling
 In what way can the data be visualized to get to the answer that is required?
DATA
MODELING
Case Study
Data Modeling (Concept)
• Data modeling focuses on developing models that are either descriptive or
predictive
• Descriptive Models
 What happened?
 Use statistics
• Predictive Models
 What will happen?
 Use Machine Learning
 Try to generate yes/no type outcomes
 A training set is used for developing the predictive model
• A training set
 contains historical data in which the outcomes are already known
 acts like a gauge to determine if the model needs to be calibrated
• The data scientist will try different algorithms to ensure that the variables in
play are actually required
Case Study
Data Modeling (Concept)
• Success of data compilation, data preparation, and modeling
depends on:
 the understanding of problem and the analytical approach being
taken
• Like the quality of ingredients in cooking, the quality of data
sets the stage for the outcome
 If data quality is bad, the outcome will be bad
• Constant refinement, adjustment, and tweaking within each
step are essential to ensure a solid outcome
• The end goal is to build a model that can answer the original
question
 In this stage of the methodology, model evaluation, deployment, and
feedback loops ensure that the model is relevant and the question is
really answered
Case Study
Data Modeling (Application)
• There are many aspects to model building – one of those is
parameter tuning to improve the model
• With a prepared training set, the first decision tree classification
model for CHF readmission can be built
• We are looking for patients with high‐risk readmission, so the
outcome of interest will be CHF readmission equals "yes"
• In this first model, overall accuracy in classifying the yes and no
outcomes was 85%
• This sounds good, but it represents only 45% of the "yes"
 Meaning, when it's actually YES, model predicted YES only 45% of the time
• The question is:
 How could the accuracy of the model be improved in predicting the yes
outcome?
Case Study
• For decision tree classification, the best parameter to adjust is
the relative cost of misclassified yes and no outcomes
• Type I Error or False positive
 When a true, non‐readmission is misclassified, and action is taken to reduce
that patient's risk, the cost of that error is the wasted intervention
• Type II Error or False negative
 When a true readmission is misclassified, and no action is taken to reduce
that risk
 The cost of this error is the readmission and all its attended costs, plus the Actual Condition
trauma to the patient
Positive Negative
• The costs of the two different kinds of misclassification errors can
be quite different Positive
True Positive False Positive
(Power) (Type I Error)
 So, we adjust the relative weights of misclassifying the yes and no Predicted
outcomes Condition
False Negative
Negative True Negative
• The default is 1‐to‐1, but the decision tree algorithm, allows the (Type II Error)
setting of a higher value for yes
Case Study
• For the second model, the relative cost was set at 9‐to‐1
 Ratio of cost of false positive to false negative
• This is a very high ratio, but gives more insight to the model's
behavior
• This time the model correctly classified 97% of the YES, but at the
expense of a very low accuracy on the NO, with an overall
accuracy of only 49%
• This was clearly not a good model
• The problem with this outcome is the large number of false‐positives
 Meaning, a true, non‐readmission is misclassified as re‐admission
 This would recommend unnecessary and costly intervention for patients, who would not have been re‐admitted
anyway
• Therefore, the data scientist needs to try again to find a better balance between the yes and no
accuracies
Case Study
• The data scientist needs to try again to find a better balance between the yes and no
accuracies
• For the third model, the relative cost was set at a more reasonable 4‐to‐1
 This time, the overall accuracy was 81%
 Yes accuracy was 68%
 This is called sensitivity by statisticians
 No accuracy was 85%
 This is called specificity
• In medical diagnosis
 Test sensitivity is the ability of a test to correctly identify those with the disease (true positive rate)
 Test specificity is the ability of the test to correctly identify those without the disease (true negative rate).
• This is the optimum balance that can be obtained with a rather small training set
 By adjusting the relative cost of misclassified yes and no outcomes parameter
Case Study
Confusion Matrix
• A confusion matrix is a table that is often used to evaluate the
performance of a classification model (or "classifier")
• It works on a set of test data for which the true values are known
• There are two possible predicted classes: "YES" and "NO"
• If we were predicting the presence of a disease, for example, "yes"
would mean they have the disease, and "no" would mean they
don't have the disease.
• The classifier made a total of 165 predictions
 E.g., 165 patients were being tested for the presence of that disease
• Out of those 165 cases, the classifier predicted "yes" 110 times,
and "no" 55 times.
• In reality, 105 patients in the sample have the disease, and 60
patients do not.
Case Study
Confusion Matrix
• True positives (TP):
 These are cases in which the model predicted yes (they
have the disease), and the patients actually do have the
disease.
• True negatives (TN):
 The model predicted no, and they don't have the disease.
• False positives (FP):
 The model predicted YES, but they don't actually have the
disease. (Also known as a "Type I error.")
• False negatives (FN):
 The model predicted NO, but they actually do have the
disease. (Also known as a "Type II error.")
Case Study
Confusion Matrix
Term Description Calculation
Accuracy Overall, how often is the classifier correct? • (TP+TN)/total = (100+50)/165 = 0.91
Misclassification Rate Overall, how often is it wrong? • (FP+FN)/total = (10+5)/165 = 0.09
• Equivalent to 1 minus Accuracy
• Also known as "Error Rate"
True Positive Rate When it's actually YES, how often does it predict YES? • TP/actual YES = 100/105 = 0.95
(Sensitivity or Recall)
False Positive Rate When it's actually NO, how often does it predict YES? • FP/actual NO = 10/60 = 0.17
True Negative Rate: When it's actually NO, how often does it predict NO? • TN/actual NO = 50/60 = 0.83

(Specificity) • Equivalent to 1 minus False Positive Rate
Precision When it predicts YES, how often is it correct? • TP/predicted YES = 100/110 = 0.91
Prevalence How often does the YES condition actually occur in • Actual YES/total = 105/165 = 0.64
our sample?
Case Study
• Evaluation
 Does the model used really answers the initial question or does it need to be
adjusted?
DATA
EVALUATION
Case Study
Model Evaluation (Concept)
• Quality of the developed model is assessed
• Before model gets deployed evaluate whether the model really
answers the initial question
• Two phases of evaluation
 Diagnostic measure phase
 Ensure model is working as intended.
 Statistical significance phase
 Ensure data is being properly handled and interpreted within the model.
Case Study
Model Evaluation (Concept)
• Diagnostic measurement phase
 Ensures that the model works as intended
 If the model is a predictive model
 A decision tree can be used to assess whether the response
provided by the model matches the original design
 This allows areas to be displayed where adjustments are required
 If the model is a descriptive model that evaluates
relationships
 A set of tests with known results can be applied and the model
refined as necessary
• Statistical significance test phase
 Can be applied to the model to ensure that the model data is processed and interpreted correctly
 This is to avoid a second unnecessary assumption when the answer is revealed
Case Study
Model Evaluation (Application)
• Let's look at one way to find the optimal model
through a diagnostic measure based on tuning
one of the parameters in model building
• Specifically we'll see how to tune the relative cost
of misclassifying yes and no outcomes.
• As shown in this table, four models were built
with four different relative misclassification costs.
• As we see, each value of this model‐building parameter increases the true‐
positive rate (or sensitivity) of the accuracy in predicting yes, at the expense of
lower accuracy in predicting no, that is, an increasing false‐positive rate.
Case Study
• Which model is best based on tuning this parameter?
• Risk‐reducing intervention – two scenarios
 Cannot be applied to all CHF patients because many of them would not have been readmitted anyway.
This will be cost effective
 The intervention itself would not be as effective in improving patient care if not enough high‐risk CHF
patients targeted
• So, how do we determine which model was optimal?
• This can be done with the help of an ROC curve (receiver operating characteristic curve)
• ROC curve is a graph showing the performance of a classification model at all classification
thresholds
• This curve plots two parameters:
 True Positive Rate
 False Positive Rate
Case Study
Receiver Operator Characteristic (ROC) Curve
• ROC curves are used to show the connection/trade‐off between clinical sensitivity and
specificity for every possible cut‐off (threshold) for a test or a combination of tests
• The area under an ROC curve is a measure of the usefulness of a test in general
 A greater area means a more useful test
• ROC curves are used in clinical biochemistry to choose the most appropriate cut‐off for a
test
• The best cut‐off has the highest true positive rate together with the lowest false positive
rate.
• ROC curves were first employed in the study of discriminator systems for the detection of
radio signals in the presence of noise in the 1940s, following the attack on Pearl Harbor.
• The initial research was motivated by the desire to determine how the US RADAR
"receiver operators" had missed the Japanese aircraft.
Case Study
• Cost Study – Using the ROC Curve
 An ROC curve plots TPR vs. FPR at different classification
thresholds
 Lowering the classification threshold classifies more items as
positive, thus increasing both False Positives and True
Positives
 The optimal model is the one giving the maximum separation
between the blue ROC curve relative to the red base line
 This curve quantifies how well a binary classification model
performs • Diagnostic tool for classification
 Declassifying the yes and no outcomes when some discrimination evaluation model
criterion is varied
 Classification model performance
 In this case, the criterion is a relative misclassification cost
 True‐positive rate Vs. False‐positive
 We can see that model 3, with a relative misclassification cost rate
of 4‐to‐1, is the best of the 4 models  Optimal model at maximum
separation
From Deployment
to Feedback
Case Study
From Deployment to Feedback
 The importance of stakeholder input.
 To consider the scale of deployment.
 The importance of incorporating feedback to refine the
model.
 The refined model must be redeployed.
 This process should be repeated as often as necessary.
• Modeling
MODEL
DEPLOYMENT
Case Study
Deployment (Concept)
• To make the model relevant and useful to address the initial question,
involves getting the stakeholders familiar with the tool produced
• Once the model is evaluated/approved by the stakeholders, it is
deployed and put to the ultimate test
• The model may be rolled out to a limited group of users or in a test
environment, to build up confidence in applying the outcome for use
across the board
Case Study
Deployment (Application)
• Understanding the results
 In preparation for model deployment, the next
step was to assimilate the knowledge for the
business group who would be designing and
managing the intervention program to reduce
readmission risk
 In this scenario, the business people translated the
model results so that the clinical staff could
understand how to identify high‐risk patients and
design suitable intervention actions
 The goal was to reduce the likelihood that these patients would be readmitted within 30 days after
discharge
 During the business requirements stage, the Intervention Program Director and her team had
wanted an application that would provide automated, near real‐time risk assessments of congestive
heart failure
Case Study
• Gathering application requirements
 It also had to be easy for clinical staff to use,
and preferably through browser‐based
application on a tablet, that each staff member
could carry around
 This patient data was generated throughout the
hospital stay
 It would be automatically prepared in a format needed by the model and each patient
would be scored near the time of discharge
 Clinicians would then have the most up‐to‐date risk assessment for each patient, helping
them to select which patients to target for intervention after discharge. As part of solution
deployment, the Intervention team would develop and deliver training for the clinical staff.
Case Study
• Additional Requirements
 Processes for tracking and monitoring
patients receiving the intervention would
have to be developed in collaboration with
IT developers and database administrators,
so that the results could go through the
feedback stage and the model could be
refined over time
• Modeling
FEEDBACK
Case Study
Feedback (Concept)
• Feedback from the users will help to refine the
model and assess it for performance and impact
• The value of the model will be dependent on
successfully incorporating feedback and making
adjustments for as long as the solution is required
• Throughout the Data Science Methodology, each
step sets the stage for the next
• This makes the methodology cyclical, ensures refinement at each stage in the game
• Once the model has been evaluated and the data scientist trusts that it will work, it will be
deployed and will undergo the final test:
 Its real use in real time in the field
Case Study
Feedback (Application)
• Feedback stage included these steps:
 First
 The review process would be defined and put
into place, with overall responsibility for
measuring the results of the model applied to
CHF risk population
 Clinical management executives would have
overall responsibility for the review process
 Second
 CHF patients receiving intervention would be
tracked and their re‐admission outcomes
recorded • For ethical reasons, CHF patients would not be
 Third split into controlled and treatment groups
 The intervention would then be measured to • Instead, readmission rates would be compared
determine how effective it was in reducing re‐ before and after the implementation of the
admissions model to measure its impact
Case Study
• After the deployment and feedback stages, the impact
of the intervention program on re‐admission rates
would be reviewed after the first year of its
implementation
• Then the model would be refined, based on all of the
data compiled after model implementation and the
knowledge gained throughout these stages
• Other refinements included:
 Incorporating information about participation in the
intervention program, and possibly refining the model to
incorporate detailed pharmaceutical data
 If you recall, data collection was initially deferred because the
pharmaceutical data was not readily available at the time
 But after feedback and practical experience with the model, it
might be determined that adding that data could be worth the
investment of effort and time
 We also have to allow for the possibility that other refinements
might present themselves during the feedback stage.
Case Study
• Redeployment
 The intervention actions and processes
would be reviewed and very likely
refined as well, based on the
experience and knowledge gained
through initial deployment and
feedback
 Finally, the refined model and
intervention actions would be
redeployed, with the feedback process
continued throughout the life of the
Intervention program
Thank You!

Introduction To Data Science

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Introduction To Data Science

Uploaded by

Copyright:

Available Formats

Introduction

Source: Introducing Data Science ‐ Cielen, Meysman and Ali

Source: Introducing Data Science ‐ Cielen, Meysman and Ali

Source: Introducing Data Science ‐ Cielen, Meysman and Ali

Source: Introducing Data Science ‐ Cielen, Meysman and Ali

Source: Introducing Data Science ‐ Cielen, Meysman and Ali

Data collection Data collection is not linked to Data collection based on the

Image Source: Getting Started with Data Science – Murtaza Haider

Source: Object‐Oriented Systems Analysis and Design by Noushin Ashrafi and Hessam Ashrafi

Source: Object‐Oriented Systems Analysis and Design by Noushin Ashrafi and Hessam Ashrafi

Source: Big Data Analytics – A Hands‐on Approach by Arshdeep Bahga & Vijay Madisetti

Source: Big Data Analytics – A Hands‐on Approach by Arshdeep Bahga & Vijay Madisetti

Source: Big Data Analytics – A Hands‐on Approach by Arshdeep Bahga & Vijay Madisetti

Source: Big Data Analytics – A Hands‐on Approach by Arshdeep Bahga & Vijay Madisetti

Source: Big Data Analytics – A Hands‐on Approach by Arshdeep Bahga & Vijay Madisetti

Generalized N‐ Linear Algebraic Graph Theoretic Alignment

Source: Big Data Analytics – A Hands‐on Approach by Arshdeep Bahga & Vijay Madisetti

Insurance Healthcare & Life Sciences Government & Non‐Profit Financial Services Banking

• Claim data extraction • Automated diagnosis • Disaster alerts • Data extraction • Fraud detection

Manufacturing Retain & CPG Telecom & Media Travel & Transportation Utilities & Resources

• Distributed market place • Network operations • Cargo management • Load forecasting

• Order to cash • Resume screening • Price optimization • Demand forecasting

Source: Big Data Analytics – A Hands‐on Approach by Arshdeep Bahga & Vijay Madisetti

Source: Streaming Change Data Capture by Dan Potter; Itamar Ankorion; Kevin Petrie ‐ O'Reilly Media, Inc., 2018

• Enterprise Miner can be used as part of any Variable Selection, Data

Model phase Can be identified with Data Mining EXPLORE Data

Assess phase Can be identified with Interpretation/Evaluation MODIFY Variable Selection,

MODIFY Variable Selection, Data

True Negative Rate: When it's actually NO, how often does it predict NO? • TN/actual NO = 50/60 = 0.83

True Negative Rate: When it's actually NO, how often does it predict NO? • TN/actual NO = 50/60 = 0.83

You might also like