Professional Documents
Culture Documents
HealthCare Analytics - Day 1-5
HealthCare Analytics - Day 1-5
Analytics
3 Ground Rules
•Participate (write in the chat window or unmute
yourself)
•Take Actions
Are you ready?
• Get your laptop/ desktop
Introduction to Anaconda,
Orange Data mining tool
What is
healthcare
analytics?
the use of advanced computing
technology to improve medical care
What is healthcare analytics?
the use of advanced computing technology to improve medical care
• Aarogya setu
• Vaccination
• Practo
• Curofy
• Amion
•Healthcare delivery and policy
•Healthcare data: history &
Healthcare
physical examination (H&P)
•Clinical science: physiology,& p
athology,
Analytics
Healthcare analytics can
be viewed as the
intersection of three
fields:
Healthcare (Healthcare
Analytics),
Mathematics (Math), and
Computer science (CS) •High school mathematics: • Artificial intelligence
•Probability and statistics • Databases -electronic
•Linear algebra: Calculus medical record (EMR)
and optimization • Programming languages
• Human-computer interaction
Healthcare Analytics Improves Medical
Care
The effectiveness of medical care is measured using "healthcare triple aim"
prescriptive analytics
Important Pillars of Health Care Analytics
Predicting future
diagnostic and treatment
events
By identifying high-risk patients, steps can be taken to hinder or delay the onset of
the disease or prevent it altogether.
Healthcare Imaging
2030 - Meet Sophie
Predicting Future Diagnostic And Treatment
Events
21
22
23
The application of
Analytics in
Healthcare
Let us understand the basics
Health Care Analytics
Statistics
Unstructured Vs Structured Data
Categories of Data
Data
6 Measurable 6 Countable
6 Endlessly Divisible 6 Indivisible
Attribute/Discrete - characteristics identified only by name, label,
class,
or category.
Examples: defective/not defective,
approve/decline, branch of account,
relationship manager name, over
$100,000/under $100,000.
Data Types
Variable/Continuous - characteristics that can take on any value
on a continuum, therefore enabling precise measurement of
the distribution and interval between points.
Infinitely divisible.
Examples: cycle time, money.
Mean - Average
We did example from sugar level for diabetes
https://in.linkedin.com/in/dimplesanghvi
Median – the Middle value
We did example from T-shirt we plan to distribute
https://in.linkedin.com/in/dimplesanghvi
Mode – Most frequently
occurring value
We did example from T-shirt we plan to distribute
https://in.linkedin.com/in/dimplesanghvi
Graphs – 1 picture is worth a 1000 words
Pie Chart
https://in.linkedin.com/in/dimplesanghvi
How size can mislead
https://in.linkedin.com/in/dimplesanghvi
Which team has worst performers?
Team A Team B
5 10
9
4
8
3
7
2
6
1
5
0 4
Category 1 Category 2 Category 1 Category 2
https://in.linkedin.com/in/dimplesanghvi
Which team has worst performers?
101
Correct data
99.6 98.3 97.3 99.9
99.9 100
100
99.6 90
80
99 70
98.3 60
98 50
40
97.3
30
97
20
10
96 0
Category 1 Category 2 Category 3 Category 4 Category 1 Category 2 Category 3 Category 4
Series 1 Series 1
• Descriptive
• Prescriptive
Explain • Predictive
• What is Probability?
•Types of Events
• Types of Probability
• i. Marginal Probability
• ii. Joint Probability
• iii. Conditional Probability
Simpson’s
Paradox
https://in.linkedin.com/in/
dimplesanghvi
Let us understand
Simpson’s Paradox
• The hospital A The hospital B
You will find people will talk about overall score and
not individual marks
Lurking variable
UK smoking
case study
•Take Actions
Types of Machine
Learning
U-OSEMN
Content
Linear Regression
Use Cases
Examples of Healthcare
Analytics
Let us watch some videos
Machine Learning
Machine
Learning
1. BUsiness Requirement
2. Obtain the Data
3. Scrub the Data
4. Data Exploration
5. Data Modelling
6. INterpret & Deployment of the Model
Data Life Cycle
3. Scrub the Data What is the most efficient way to store and access it
4. Data Exploration
5. Data Modelling
6. Interpret & Deployment of the Model
Data Life Cycle
1. Business Requirement
2. Obtain the Data
3. Scrub the Data Understand the patterns in the data
1. Business Requirement
2. Obtain the Data
3. Scrub the Data Determine optimal data features for ML models
Create a model that predicts the target most
4. Data Exploration accurately
1. Business Requirement
2. Obtain the Data
3. Scrub the Data
4. Data Exploration Determine the model in a pre-
production / test environment
5. Data Modelling
Monitor the performance
6. Interpret & Deployment of the Model
Note: Remember To Split The Base Data
Simple Linear Regression
1. Business Requirement
2. Obtain the Data
3. Scrub the Data
4. Data Exploration
5. Data Modelling
6. Interpret & Deployment of the Model
Business Requirement
• The hospital has data of children in the age group of 0-2 years
• The research team has captured few features
• Can you understand what is the story the data is telling?
• The current head size should be 36cm, can we predict it based on
other parameter?
• How significant is the model?
• Are we missing any information?
Given the dataset – let us predict
Distribution
Stratification
Bootstrapping
Cross Validation
Cross Validation
Build the
model
Goodness of fit
–R Square
•R-Square value is a statistical
measure of how close the data
are to the fitted regression line
1. Business Requirement
2. Obtain the Data
3. Scrub the Data
4. Data Exploration
5. Data Modelling
6. Interpret & Deployment of the Model
The dataset • The charges
Age No of children BMI Gender
Evaluation of Accuracy
Classification Precision
model Recall
What happens when the
prediction should be categorical
• Logistic regression fits the data with a sigmoidal/logistic curve rather than a line
and outputs an approximation of the probability of the output given the input
• Decision tree learning is one of the predictive
modelling approaches.
1. BUsiness Requirement
2. Obtain the Data
3. Scrub the Data
4. Data Exploration
5. Data Modelling
6. INterpret & Deployment of the Model
Descriptive – what has happened? (as is)
• In my current 65% non diabetic – 35% are diabetic
• 45% of my dataset has 0-2 preg – 55% of the dataset >2preg
• ~30% of my dataset has 0-1 preg – 70% of them >1
Diabetes
•Can you build a machine learning model
to accurately predict whether or not the
patients in the dataset have diabetes or
not?
Kidney Disease
• The data was taken over a 2-month
period in India with 25 features ( eg,
red blood cell count, white blood cell
count, etc).
X
Good Model
X
Overfitting
X
Overfitting
X
Overfitting
X
Underfitting
• Underfitting
• Model does not capture the underlying trend of the data and
does not fit the data well enough.
• Low variance but high bias.
• Underfitting is often a result of an excessively simple model
Data
X
Underfitting
X
Machine Learning
• When thinking about overfitting and underfitting we want
to keep in mind the relationship of model performance on
the training set versus the test/validation set.
• This data was easy to visualize, but how can we see
underfitting and overfitting when dealing with multi
dimensional data sets?
Evaluating
Performance
CLASSIFICATION
Model • The key classification metrics
Evaluatio we need to understand are:
n • Accuracy
• Recall
• Precision
• F1-Score
Model
• Typically in any classification
Evaluatio task your model can only
achieve two results:
n • Either your model was
correct in its prediction.
• Or your model was
incorrect in its prediction.
Model Evaluation
TRAINED
MODEL
Model Evaluation
TRAINED
Test Image
from X_test
MODEL
Model Evaluation
TRAINED
Test Image
from X_test
MODEL
DOG
Correct Label
from y_test
Model Evaluation
TRAINED DOG
Test Image
from X_test
MODEL
Prediction on
Test Image
DOG
Correct Label
from y_test
Model Evaluation
TRAINED DOG
Test Image
from X_test
MODEL
Prediction on
Test Image
DOG
Correct Label DOG == DOG ?
from y_test
Compare Prediction to Correct Label
Model Evaluation
TRAINED CAT
Test Image
from X_test
MODEL
Prediction on
Test Image
DOG
Correct Label DOG == CAT ?
from y_test
Compare Prediction to Correct Label
Model
Evaluation-
Confusion
Matrix
Model Evaluation
• Accuracy -the number of correct predictions made by the
model
• For example, if the X_test set was 100 images and our
model correctly predicted 80 images, then we have
80/100.
Software Research
Domain
Knowledge
Accuracy calculation using Confusion Matrix
Actual
TP+FP+TN+FN True True Positive (TP)
False Negative
(FN)
True Negative
False False Positive (FP)
(TN)
Build a model for
Accuracy – Uses cancer prediction
Accuracy
TP+TN
• Accuracy = ------------------------------
• TP+FP+TN+FN
Calculate
Accuracy score??
Accuracy – Lets play with Dumb model
Criminal
Criminal True
TruePositive
Positive(TP)
(TP) False Negative
mistake (FN) Total Positive = TP + FN
Innocent
Innocent I False
ask hold anyone
Positive (FP) True
True Negative
Negative (TN)
(TN)
back for investigation Total Negatives= TN + FP
You are in charge at Airport security check -where you know that all film
stars /Politicians and high profile people are passing through the gate
TP
Recall = ---------------------
TP+FN
Undetected weapon
carrier will cost my job
SSR Murder investigation
Prediction by Model
Criminal Innocent
Actual
You went to the party when you know that all film stars and
high profile people are attending it
I cannot make
Ithiscannot Total Actual Negatives= TN + FP
Innocent mistake True Negative (TN)
make this
mistake
https://www.youtube.com/watch?
v=QllguanpKic
AI in Healthcare: Top
A.I. Algorithms In H
ealthcare - The Medic
al Futurist
Image Analytics
MRI dataset of 5 stages of Alzheimer's disease from ADNI repository
https://www.kaggle.com/yasserhessein/dataset-alzheimer
Normal Distribution
• Normal distribution is characterised by bell-shaped
curve
• Mean is at centre of bell-shaped curve
• Mean = Median
• Percentage of values
• Mean ± 1 Std. Dev. contains c.68% of all values
• Mean ± 2 Std. Dev. contains c.95% of all values Smaller Std. Dev.
Mean
Standard Deviation measures spread
171
The Normal Curve
68.26 %
95.46%
99.73%
-4 -3 -2 -1 0 1 2 3 4
The empirical rule allows us to divide the normal distribution into predictable ranges based
on standard deviation
172
Central Limit Theorem
Medical Camp
• MedCamp organizes health camps in several cities with low
work-life balance. They reach out to working people and ask
them to register for these health camps. For those
who attend, MedCamp provides them the facility to
undergo health checks or increase awareness by visiting
various stalls (depending on the format of the camp).
• MedCamp has conducted 65 such events over a period of 4
years and there is a high drop off between “Registration”
and the Number of people taking tests at the Camps.
• One of the huge costs in arranging these camps is the
amount of inventory you need to carry. If you carry more
than the required inventory, you incur unnecessarily high
costs.
• On the other hand, if you carry less than the required
inventory for conducting these medical checks, people end
up having bad experience.
The Process
MedCamp employees/volunteers reach out to people and drive
registrations :During the camp, People who “ShowUp” either undergo
the medical tests or visit stalls depending on the format of the health
camp.
• Other things to note:
• Since this is a completely voluntary activity for the working professionals, MedCamp
usually has little profile information about these people.
• For a few camps, there was a hardware failure, so some information about the date
and time of registration is lost.
• MedCamp runs 3 formats of these camps.
• The first and second format provides people with an instantaneous health score.
• The third format provides information about several health issues through various
awareness stalls.
Favorable outcome
PatientProfile.csv -This file contains Patient profile details like PatientID, OnlineFollower, Social
media details, Income, Education, Age, FirstInteractionDate, CityType and Employer_Category
FirstHealthCampAttended.csv -This file contains details about people who attended health camp
of first format. This includes Donation (amount) & HealthScore of the person.
SecondHealthCampAttended.csv -This file contains details about people who attended health
camp of second format. This includes HealthScore of the person.
ThirdHealthCampAttended.csv - This file contains details about people who attended health
camp of third format. This includes Numberofstallvisited & LastStallVisited_Number.
• MedCamp organizes health camps in several cities with low work-life balance.
They reach out to working people and ask them to register for these health
camps. For those who attend, MedCamp provides them the facility to undergo
health checks or increase awareness by visiting
• Evaluation Metric
• The evaluation metric for this hackathon is ROC-AUC Score.
• healthcare-analytics dataset
• healthcare-provider-fraud-detection-analysis
Machine Learning Process
Data
Acquisition
Machine Learning Process
Data Data
Acquisition Cleaning
Machine Learning Process
Explore and build Model
Test
Data
Training
Data Data Data
Acquisition Cleaning
Machine Learning Process
Model: Build Model based on the data
Test
Data
Model
Data Data Training &
Acquisition Cleaning Building
Machine Learning Process
Evaluate: Check the performance of your model (Test
&Score) Test
Data
Model
Data Data Model
Training &
Acquisition Cleaning Testing
Building
Machine Learning Process
Enhance the model /improve the model accuracy
Test
Data
Model
Data Data Model
Training &
Acquisition Cleaning Testing
Building
Adjust
Model
Parameters
Machine Learning Process
Deploy the model
Test
Data
Model
Data Data Model Model
Training &
Acquisition Cleaning Testing Deployment
Building
Supervised Learning
https://www.youtube.com/watch?
v=QllguanpKic
AI in Healthcare: Top
A.I. Algorithms In H
ealthcare - The Medic
al Futurist
Normal Distribution
• Normal distribution is characterised by bell-shaped
curve
• Mean is at centre of bell-shaped curve
• Mean = Median
• Percentage of values
• Mean ± 1 Std. Dev. contains c.68% of all values
• Mean ± 2 Std. Dev. contains c.95% of all values Smaller Std. Dev.
Mean
Standard Deviation measures spread
191
The Normal Curve
68.26 %
95.46%
99.73%
-4 -3 -2 -1 0 1 2 3 4
The empirical rule allows us to divide the normal distribution into predictable ranges based
on standard deviation
192
Central Limit Theorem
Medical Camp
• MedCamp organizes health camps in several cities with low
work-life balance. They reach out to working people and ask
them to register for these health camps. For those
who attend, MedCamp provides them the facility to
undergo health checks or increase awareness by visiting
various stalls (depending on the format of the camp).
• MedCamp has conducted 65 such events over a period of 4
years and there is a high drop off between “Registration”
and the Number of people taking tests at the Camps.
• One of the huge costs in arranging these camps is the
amount of inventory you need to carry. If you carry more
than the required inventory, you incur unnecessarily high
costs.
• On the other hand, if you carry less than the required
inventory for conducting these medical checks, people end
up having bad experience.
The Process
MedCamp employees/volunteers reach out to people and drive
registrations :During the camp, People who “ShowUp” either undergo
the medical tests or visit stalls depending on the format of the health
camp.
• Other things to note:
• Since this is a completely voluntary activity for the working professionals, MedCamp
usually has little profile information about these people.
• For a few camps, there was a hardware failure, so some information about the date
and time of registration is lost.
• MedCamp runs 3 formats of these camps.
• The first and second format provides people with an instantaneous health score.
• The third format provides information about several health issues through various
awareness stalls.
Favorable outcome
PatientProfile.csv -This file contains Patient profile details like PatientID, OnlineFollower, Social
media details, Income, Education, Age, FirstInteractionDate, CityType and Employer_Category
FirstHealthCampAttended.csv -This file contains details about people who attended health camp
of first format. This includes Donation (amount) & HealthScore of the person.
SecondHealthCampAttended.csv -This file contains details about people who attended health
camp of second format. This includes HealthScore of the person.
ThirdHealthCampAttended.csv - This file contains details about people who attended health
camp of third format. This includes Numberofstallvisited & LastStallVisited_Number.
• MedCamp organizes health camps in several cities with low work-life balance.
They reach out to working people and ask them to register for these health
camps. For those who attend, MedCamp provides them the facility to undergo
health checks or increase awareness by visiting
• Evaluation Metric
• The evaluation metric for this hackathon is ROC-AUC Score.
• healthcare-analytics dataset
• healthcare-provider-fraud-detection-analysis