HealthCare Analytics - Day 1-5

HealthCare
Analytics
3 Ground Rules
•Participate (write in the chat window or unmute
yourself)
•Keep yourself hydrated (have water

bottle next to you)
•Take Actions
Are you ready?
• Get your laptop/ desktop
• Get Paper & Pen
• Close the Door
• Keep your phone on silent(not even vibration)

Basics of healthcare analytics
History of healthcare analytics
Content Examples of healthcare analytics
Introduction to Anaconda,
Orange Data mining tool
What is
healthcare
analytics?
the use of advanced computing
technology to improve medical care
What is healthcare analytics?
the use of advanced computing technology to improve medical care
The National Library of Medicine (NLM)

defines healthcare as a multidisciplinary
domain of planning, developing,
adopting, and applying information
technology-related inventions in the
delivery, administration, and
development of healthcare facilities
What is healthcare analytics?
the use of advanced computing technology to improve medical care
• Aarogya setu
• Vaccination
• Practo
• Curofy
• Amion
•Healthcare delivery and policy
•Healthcare data: history &
Healthcare
physical examination (H&P)
•Clinical science: physiology,& p
athology,
Analytics
Healthcare analytics can
be viewed as the
intersection of three
fields:
Healthcare (Healthcare
Analytics),
Mathematics (Math), and
Computer science (CS) •High school mathematics: • Artificial intelligence
•Probability and statistics • Databases -electronic
•Linear algebra: Calculus medical record (EMR)
and optimization • Programming languages
• Human-computer interaction
Healthcare Analytics Improves Medical
Care
The effectiveness of medical care is measured using "healthcare triple aim"
Improving Reducing Ensuring
Improving Reducing costs Ensuring quality

outcomes
Accurate diagnosis
Healthcare Effective treatment

outcomes
No complications
We yearn for better outcomes in

our own lives whenever we visit
An overall improved quality of
a doctor or a hospital, some of the
things about which we
life
are concerned:
descriptive analytics
Analytics predictive analytics
prescriptive analytics
Important Pillars of Health Care Analytics
Predicting future
diagnostic and treatment
events
By identifying high-risk patients, steps can be taken to hinder or delay the onset of
the disease or prevent it altogether.
Healthcare Imaging
2030 - Meet Sophie
Predicting Future Diagnostic And Treatment
Events
First: what specific event (or disease) are we interested in predicting?
Second: what data will we use to make our predictions?

• Structured clinical data (EMR)
• Unstructured data- x-ray images, ECG, data recorded from devices
Third: what machine learning algorithm will we use?

Key Learning
Dimensions on Soft Ware
Obtaining adequate &

relevant data in right format
Applying the Roadmap and

Deriving Business Insights
Mechanics of Running Tool

Click on Orange
21
22
23
The application of
Analytics in
Healthcare
Let us understand the basics
Health Care Analytics
Statistics
Unstructured Vs Structured Data
Categories of Data
Data
Variable / Continuous Attribute / Discrete
6 Measurable 6 Countable
6 Endlessly Divisible 6 Indivisible
Attribute/Discrete - characteristics identified only by name, label,
class,
or category.
Examples: defective/not defective,
approve/decline, branch of account,
relationship manager name, over
$100,000/under $100,000.
Data Types
Variable/Continuous - characteristics that can take on any value
on a continuum, therefore enabling precise measurement of
the distribution and interval between points.
Infinitely divisible.
Examples: cycle time, money.
Mean - Average
We did example from sugar level for diabetes
https://in.linkedin.com/in/dimplesanghvi
Median – the Middle value
We did example from T-shirt we plan to distribute
Mode – Most frequently
occurring value
We did example from T-shirt we plan to distribute
Graphs – 1 picture is worth a 1000 words
Pie Chart
In the misleading pie chart, Item C appears to be at least as large as Item A,

whereas in reality, it is less than half
How size can mislead
Which team has worst performers?
Team A Team B
5 10
9
4
8
3
7
2
6
1
5
0 4
Category 1 Category 2 Category 1 Category 2
Series 1 Series 2 Series 3 series4 Series 1 Series 2 Series 3
Which team has worst performers?
101
Correct data
99.6 98.3 97.3 99.9
99.9 100
100
99.6 90
80
99 70
98.3 60
98 50
40
97.3
30
97
20
10
96 0
Category 1 Category 2 Category 3 Category 4 Category 1 Category 2 Category 3 Category 4
Series 1 Series 1
• Descriptive
• Prescriptive
Explain • Predictive
• What is Probability?
•Types of Events
• Types of Probability
• i. Marginal Probability
• ii. Joint Probability
• iii. Conditional Probability
Simpson’s
Paradox
https://in.linkedin.com/in/
dimplesanghvi
Let us understand
Simpson’s Paradox
• The hospital A The hospital B
1000 admitted & treated 1000 admitted & treated

900 survived 800 survived
100 Critical condition

30 survived
Why did this 400 Critical condition
210 survived
900 Non Critical condition happen? 600 Non Critical condition

870 survived 590 survived
Simpson’s Paradox
Because group / aggregating the data mis-lead the

information
You will find people will talk about overall score and
not individual marks
Lurking variable
UK smoking
case study
• Lurking variable – Age is the lurking

variable
Population
Medical task
Screening
An example -screening for cervical cancer;
women are recommended to undergo this
cost-effective test every 1-3 years
throughout most of their lives
Response to treatment
Healthcare Analytics
MBA(BA) 2020-22 Sem
III Online Class
Day 3
3 Ground Rules
•Participate (write in the chat window or unmute
yourself)
•Keep yourself hydrated (have water

bottle next to you)
•Take Actions
Types of Machine
Learning
U-OSEMN
Content
Linear Regression
Use Cases
Examples of Healthcare
Analytics
Let us watch some videos
Machine Learning
Machine
Learning
Supervised Unsupervised Reinforcemen

Learning Learning t Learning
Regression Classification Segmentation Clustering

Supervised learning
• Training data includes desired

outputs
Supervised
Learning Unsupervised learning
• Training data does not

include desired outputs
Inductive Learning
Given examples of a function (y= f(X))
Predict function f(X) for new examples X

• if y is Discrete then f(X): Classification
• If y is Continuous then f(X): Regression
f(X) = Probability(X): Probability estimation

Data Life Cycle- U-OSEMN
1. BUsiness Requirement
2. Obtain the Data
3. Scrub the Data
4. Data Exploration
5. Data Modelling
6. INterpret & Deployment of the Model
Data Life Cycle
1. Business Requirement Understand the Business problem

Identify the objective
2. Obtain the Data
Define the variable to be predicted
3. Scrub the Data
4. Data Exploration
5. Data Modelling
6. Interpret & Deployment of the Model
Data Life Cycle
What data do we need for this project?

1. Business Requirement
What are the data sources
2. Obtain the Data How can I obtain the data
3. Scrub the Data What is the most efficient way to store and access it
4. Data Exploration
5. Data Modelling
Data Life Cycle
1. Business Requirement Transform the data into desired format

Data Cleaning
2. Obtain the Data Missing Value
3. Scrub the Data Corrupted data
Remove unnecessary data
4. Data Exploration Cleaning the data and its significance
Combining /Grouping Data
5. Data Modelling
Data Life Cycle
2. Obtain the Data
3. Scrub the Data Understand the patterns in the data
4. Data Exploration Derive useful insights

Form hypothesis
5. Data Modelling
Data Life Cycle
2. Obtain the Data
3. Scrub the Data Determine optimal data features for ML models
Create a model that predicts the target most
4. Data Exploration accurately
5. Data Modelling Evaluate & test the efficiency of the model

Data Life Cycle
2. Obtain the Data
3. Scrub the Data
4. Data Exploration Determine the model in a pre-
production / test environment
5. Data Modelling
Monitor the performance
Note: Remember To Split The Base Data
Simple Linear Regression
one independent variable

and
one dependent variable
Multi Linear Regression
Mulitpy independent variable

and
one dependent variable
Straight Line Equation
Positive or Negative slope?
The correlation between two random variables, X and Y, is a measure of the

degree of linear association between the two variables.
The correlation coefficient "r", can take on any value from -1 to 1.
What type of
correlation
What can
you
interpret?
Predict the size of
head of the baby
Linear Regression problem
This Photo by Unknown author is licensed under CC BY.

2. Obtain the Data
3. Scrub the Data
4. Data Exploration
5. Data Modelling
Business Requirement
• The hospital has data of children in the age group of 0-2 years
• The research team has captured few features
• Can you understand what is the story the data is telling?
• The current head size should be 36cm, can we predict it based on
other parameter?
• How significant is the model?
• Are we missing any information?
Given the dataset – let us predict
Distribution
• Head size by Gender

• What can you interpret
What do you analyse?
% Split – 70/30
Stratification
Bootstrapping
Cross Validation
Cross Validation
Build the
model
Goodness of fit
–R Square
•R-Square value is a statistical
measure of how close the data
are to the fitted regression line
•It is also known as coefficient of

determination
Predict the
Cost of
Treatment
2. Obtain the Data
3. Scrub the Data
4. Data Exploration
5. Data Modelling
The dataset • The charges
Age No of children BMI Gender
Can you explain?

Also try for other columns/features
Charge of
treatement
by Gender
Charge of
treatement
Smoker/
nonsmoker
Scatter Plot
Which model is
better? On training
dataset
Which model is
better?
Healthcare Analytics
MBA(BA) 2020-22 Sem
III Online Class
Logistic Regression
Logistic Regression
Content Confusion Matrix
Evaluation of Accuracy
Classification Precision
model Recall
What happens when the
prediction should be categorical
Does person have disease or

Logistic not,
Regression Will the patient survive or not,
Application accepted or not,

etc.
Logistic Regression
• One commonly used algorithm is Logistic Regression
• Assumes that the dependent (output) variable is binary
Predict output for categorical data /binary data
• Does person have disease or not,
• Will the patient survive or not,
• Application accepted or not, etc.
• Logistic regression fits the data with a sigmoidal/logistic curve rather than a line
and outputs an approximation of the probability of the output given the input
• Decision tree learning is one of the predictive
modelling approaches.
CART – • Tree models where the target variable can take a

Classificatio discrete set of values are called classification
trees;
n and
Regression • Decision trees where the target variable can take
continuous values are called regression trees.
Tree
• Decision trees are among the most popular
machine learning algorithms given their
intelligibility and simplicity.
Diabetes
• This dataset is originally from the NIH
• The objective of the dataset is to diagnostically predict whether or not a
patient has diabetes, based on certain diagnostic measurements
included in the dataset.
• In particular, all patients here are females at least 21 years old of Pima
Indian heritage.
• The datasets consists of several medical predictor variables and one target
variable, Outcome. Predictor variables includes the number of pregnancies the
patient has had, their BMI, insulin level, age, etc.
1. BUsiness Requirement
2. Obtain the Data
3. Scrub the Data
4. Data Exploration
5. Data Modelling
6. INterpret & Deployment of the Model
Descriptive – what has happened? (as is)
• In my current 65% non diabetic – 35% are diabetic
• 45% of my dataset has 0-2 preg – 55% of the dataset >2preg
• ~30% of my dataset has 0-1 preg – 70% of them >1
Diabetes
•Can you build a machine learning model
to accurately predict whether or not the
patients in the dataset have diabetes or
not?
Kidney Disease
• The data was taken over a 2-month
period in India with 25 features ( eg,
red blood cell count, white blood cell
count, etc).
• The target is the 'classification', which

is either 'ckd' or 'notckd' -
• ckd=chronic kidney disease.
• There are 400 rows
Predicting if the cancer diagnosis
• 30 features are used, examples:
• radius (mean of distances from center to points on the
perimeter)
• texture (standard deviation of gray-scale values)
• perimeter
• area
• smoothness (local variation in radius lengths)
• compactness (perimeter^2 / area - 1.0)
• concavity (severity of concave portions of the contour)
• concave points (number of concave portions of the contour)
• symmetry
• fractal dimension ("coastline approximation" - 1)
• Target Class Distribution: 212 Malignant, 357 Benign
Overfitting and
Underfitting
• Overfitting
• The model fits too much to
Machine the noise from the data.

• This often results in low
Learning error on training sets but
high error on test/validation
sets.
Data
X
Good Model
X
Overfitting
X
Overfitting
X
Overfitting
X
Underfitting
• Underfitting
• Model does not capture the underlying trend of the data and
does not fit the data well enough.
• Low variance but high bias.
• Underfitting is often a result of an excessively simple model
Data
X
Underfitting
X
Machine Learning
• When thinking about overfitting and underfitting we want
to keep in mind the relationship of model performance on
the training set versus the test/validation set.
• This data was easy to visualize, but how can we see
underfitting and overfitting when dealing with multi
dimensional data sets?
Evaluating
Performance
CLASSIFICATION
Model • The key classification metrics
Evaluatio we need to understand are:
n • Accuracy
• Recall
• Precision
• F1-Score
Model
• Typically in any classification
Evaluatio task your model can only
achieve two results:
n • Either your model was
correct in its prediction.
• Or your model was
incorrect in its prediction.
Model Evaluation
TRAINED
MODEL
Model Evaluation
TRAINED
Test Image
from X_test
MODEL
Model Evaluation
TRAINED
Test Image
from X_test
MODEL
DOG
Correct Label
from y_test
Model Evaluation
TRAINED DOG
Test Image
from X_test
MODEL
Prediction on
Test Image
DOG
Correct Label
from y_test
Model Evaluation
TRAINED DOG
Test Image
from X_test
MODEL
Prediction on
Test Image
DOG
Correct Label DOG == DOG ?
from y_test
Compare Prediction to Correct Label
Model Evaluation
TRAINED CAT
Test Image
from X_test
MODEL
Prediction on
Test Image
DOG
Correct Label DOG == CAT ?
from y_test
Compare Prediction to Correct Label
Model
Evaluation-
Confusion
Matrix
Model Evaluation
• Accuracy -the number of correct predictions made by the
model
• For example, if the X_test set was 100 images and our
model correctly predicted 80 images, then we have
80/100.
• 0.8 or 80% accuracy

Model Evaluation
• Is Accuracy a good choice with unbalanced classes?
• Imagine we had 99 images of dogs and 1 image of a
cat.
• If our model was simply a line that always predicted
dog we would get 99% accuracy!
Recall
○ Ability of a model to find all the relevant cases within a
dataset.
○ The precise definition of recall is the
number of true positives
# true positives + #false negatives

Precision
○ Ability of a classification model to identify only
the relevant data points.
○ Precision is defined as
number of true positives
# true positives + #false positives.

F1-Score
● F1-Score is the harmonic mean of precision and recall taking
both metrics into account
○ harmonic mean instead of a simple average because it punishes
extreme values.
○ A classifier with a precision of 1.0 and a recall of 0.0 has a
simple average of 0.5 but an F1 score of 0.
Confusion Matrix
We can also view all correctly classified versus incorrectly classified images in the form of a confusion
matrix.
Machine Math &

Learning
Statistics
DS
Software Research
Domain
Knowledge
Accuracy calculation using Confusion Matrix
Ratio of Total number of correct prediction over total predictions
Total of correct prediction

Accuracy = ----------------------------------
Total Prediction
TP+TN Prediction by Model
Accuracy = ------------------------------ True False
Actual
TP+FP+TN+FN True True Positive (TP)
False Negative
(FN)
True Negative
False False Positive (FP)
(TN)
Build a model for
Accuracy – Uses cancer prediction
Accuracy
Total of correct prediction

• Accuracy = -----------------------------
• Total Prediction
TP+TN
• Accuracy = ------------------------------
• TP+FP+TN+FN
Calculate
Accuracy score??
Accuracy – Lets play with Dumb model
Now What is the

Accuracy score??
Is Dumb model great?
Or is ML model build is
worse than dumb?
No….Using accuracy to understand the performance of
model is not correct
True Positive Rate (TPR)
Actual Positive prediction
TPR = --------------------------------------------
Total Actual Positives
TP+FN= Total actual +ve

TP
FP+TN= Total actual -ve
TPR = ---------------------
TP+FN
Higher the value better the model

TPR results are between Zero to One
Calculate TPR for dumb model
Actual Positive prediction (TP)
TPR = --------------------------------------------
Total Actual Positives (TP+FN)
Now What is the

Recall score??
Now is the dumb model good?

TPR = Recall
Can you identify this scene?
Airport Metal detection investigation
Prediction by Model
Criminal
Criminal Innocent
Innocent
I cannot make this
Actual
Criminal
Criminal True
TruePositive
Positive(TP)
(TP) False Negative
mistake (FN) Total Positive = TP + FN
Innocent
Innocent I False
ask hold anyone
Positive (FP) True
True Negative
Negative (TN)
(TN)
back for investigation Total Negatives= TN + FP
You are in charge at Airport security check -where you know that all film
stars /Politicians and high profile people are passing through the gate
Aim: “All” weapon carriers MUST be caught

Airport Metal detection investigation
Prediction by Model
Criminal
Criminal Innocent
Innocent Recall
I cannot make
Actual
Criminal True Positive (TP) False

I cannot make (FN)
Negative this
Criminal True Positive (TP) mistake Total Positive = TP + FN
thismistake
Innocent False Positive (FP) True Negative (TN)
II can make
ask hold this
anyone Total Negatives= TN + FP
Innocent True Negative (TN)
backmistake: I ask
for investigation
hold anyone back
for investigation
(even innocent
people)
Recall
Out of total actual positives how many have been predicted as positives
TP
Recall = ---------------------
TP+FN
Minimize the False Negative better

the model
Undetected weapon
carrier will cost my job
SSR Murder investigation
Prediction by Model
Criminal Innocent
Actual
Criminal True Positive (TP) False Negative (FN) Total Positive = TP + FN
Innocent False Positive (FP) True Negative (TN) Total Negatives= TN + FP
You went to the party when you know that all film stars and
high profile people are attending it
Aim : Arrest “only” criminals

Precision
Prediction by Model
Criminal Innocent
True Positive We endup making

Actual
Criminal (TP) this mistake Total Actual Positive = TP + FN
I cannot make
Ithiscannot Total Actual Negatives= TN + FP
Innocent mistake True Negative (TN)
make this
mistake
Avoiding False Positive is more important than Encountering False Negative
We also call this as Precision= we cannot afford to have FP

False Negative Rate (FNR) / Precision
Predicted Positive
Precision=---------------------------------
Total predicted Positive
T P
Precision = --------------------- Higher the precision better the model
TP+FP
TP+FN= Total actual +ve
FP+TN= Total actual -ve

TP+FP= Total Predicted +ve FN+TN= Total Predicted -ve
Model Evaluation
● Still confused on the confusion matrix?

● No problem! Check out the Wikipedia pageMachinefor it, itMath & good
has a really
Learning
diagram with all the formulas for all the metrics. Statistics
● Let’s think back on this idea of: DS
○ What is a good enough accuracy?

Software Research
● This all depends on the context of the situation!

● Did you create a model to predict presence of a disease?
Domain
● Is the disease presence well balanced in the general population?
Knowledge
(Probably not!)
Evaluating Performance
REGRESSION
Evaluating Regression
• Let’s take a moment now to discuss evaluating Regression Models
• Regression is a task when a model attempts to predict continuous
values (unlike categorical values, which is classification)
• Evaluation metrics like accuracy or recall- These sort of metrics
aren’t useful for regression problems, we need metrics designed
for continuous values!
• For example, attempting to predict the cost of treatment
of a patient its features is a regression task.
• Attempting to predict the patient isa smoker or not given
its features would be a classification task.
• Let’s discuss some of the most common evaluation metrics
for regression:
• Mean Absolute Error
• Mean Squared Error
• Root Mean Square Error
• Mean Absolute Error (MAE)
• This is the mean of the absolute value of errors.
• Easy to understand
Evaluating
Regression
MAE won’t punish large
errors however.
Evaluating
Regression
MAE won’t punish large errors
however.
Evaluating
Regression
MAE won’t punish large errors
however.
We want our error metrics to
account for these!
• Mean Squared Error (MSE)

• This is the mean of the squared errors.
• Larger errors are noted more than with MAE,
making MSE more popular.
• Root Mean Square Error (RMSE)
• This is the root of the mean of the squared errors.
• Most popular (has same units as y)
• Most common question from students:
• “Is this value of RMSE good?”
• Context is everything!
• A RMSE of $10 is fantastic for predicting the price of a
house, but horrible for predicting the price of a candy bar!
• Compare your error metric to the average value of the label
in your data set to try to get an intuition of its overall
performance.
• Domain knowledge also plays an important role here!
• Context of importance is also necessary to consider.
• We may create a model to predict how much medication to
give, in which case small fluctuations in RMSE may actually
be very significant.
History of the present illness (HPI)
HPI element Corresponding question Example answer
The pain is left-sided and radiates to the left
Location Where is the pain located?
arm and back.
Quality What does the pain feel like? Patient reports a shooting, stabbing pain.
Severity On a scale of 1-10, how bad is the pain? Severity is 8/10.
Onset: When did the pain first The current episode began half an hour
start? Frequency: How often does the pain ago. Episodes have occurred for a few
Timing
occur? Duration: How long are the pain months, following exercise, and for periods
episodes? of up to 15-20 minutes.
Exacerbating
What makes the pain worse? Pain is exacerbated by exercise.
Factors
Alleviating Factors What relieves the pain? Pain is relieved by rest and weight loss.
Associated Do you notice any other symptoms when the Patient reports symptoms associated with
Symptoms pain is present? dyspnea.
Heart Attack
https://www.youtube.com/watch?
v=QllguanpKic
AI in Healthcare: Top
A.I. Algorithms In H
ealthcare - The Medic
al Futurist
Image Analytics
MRI dataset of 5 stages of Alzheimer's disease from ADNI repository
https://www.kaggle.com/yasserhessein/dataset-alzheimer
Normal Distribution
• Normal distribution is characterised by bell-shaped
curve
• Mean is at centre of bell-shaped curve
• Mean = Median
• Percentage of values
• Mean ± 1 Std. Dev. contains c.68% of all values
• Mean ± 2 Std. Dev. contains c.95% of all values Smaller Std. Dev.
• Mean ± 3 Std. Dev. contains c.99.7% of all values

• Mean ± 4.5 Std. Dev. contains 99.99932% of all values
Larger Std. Dev.
Mean
Standard Deviation measures spread
171
The Normal Curve
68.26 %
95.46%
99.73%
-4 -3 -2 -1 0 1 2 3 4
The empirical rule allows us to divide the normal distribution into predictable ranges based
on standard deviation
172
Central Limit Theorem
Medical Camp
• MedCamp organizes health camps in several cities with low
work-life balance. They reach out to working people and ask
them to register for these health camps. For those
who attend, MedCamp provides them the facility to
undergo health checks or increase awareness by visiting
various stalls (depending on the format of the camp).
• MedCamp has conducted 65 such events over a period of 4
years and there is a high drop off between “Registration”
and the Number of people taking tests at the Camps.
• One of the huge costs in arranging these camps is the
amount of inventory you need to carry. If you carry more
than the required inventory, you incur unnecessarily high
costs.
• On the other hand, if you carry less than the required
inventory for conducting these medical checks, people end
up having bad experience.
The Process
MedCamp employees/volunteers reach out to people and drive
registrations :During the camp, People who “ShowUp” either undergo
the medical tests or visit stalls depending on the format of the health
camp.
• Other things to note:
• Since this is a completely voluntary activity for the working professionals, MedCamp
usually has little profile information about these people.
• For a few camps, there was a hardware failure, so some information about the date
and time of registration is lost.
• MedCamp runs 3 formats of these camps.
• The first and second format provides people with an instantaneous health score.
• The third format provides information about several health issues through various
awareness stalls.
Favorable outcome
For the first 2 formats, a favorable outcome is defined as

getting a health_score, while in the third format it is
defined as visiting at least a stall.
You need to predict the chances (probability) of having a

favorable outcome.
Data Description
HealthCampDetail.csv - File containing HealthCampId, CampStartDate, CampEndDate and
Category details of each camp.
Train.csv- File containing registration details for all the test camps.
PatientProfile.csv -This file contains Patient profile details like PatientID, OnlineFollower, Social
media details, Income, Education, Age, FirstInteractionDate, CityType and Employer_Category
FirstHealthCampAttended.csv -This file contains details about people who attended health camp
of first format. This includes Donation (amount) & HealthScore of the person.
SecondHealthCampAttended.csv -This file contains details about people who attended health
camp of second format. This includes HealthScore of the person.
ThirdHealthCampAttended.csv - This file contains details about people who attended health
camp of third format. This includes Numberofstallvisited & LastStallVisited_Number.
• MedCamp organizes health camps in several cities with low work-life balance.
They reach out to working people and ask them to register for these health
camps. For those who attend, MedCamp provides them the facility to undergo
health checks or increase awareness by visiting
• Evaluation Metric
• The evaluation metric for this hackathon is ROC-AUC Score.
• healthcare-analytics dataset
• healthcare-provider-fraud-detection-analysis
Machine Learning Process
Obtain : Get your data! Customers, Sensors, etc...
Data
Acquisition
Scrub: Clean and format your data
Data Data
Acquisition Cleaning
Explore and build Model
Test
Data
Training
Data Data Data
Acquisition Cleaning
Model: Build Model based on the data
Test
Data
Model
Data Data Training &
Acquisition Cleaning Building
Evaluate: Check the performance of your model (Test
&Score) Test
Data
Model
Data Data Model
Training &
Acquisition Cleaning Testing
Building
Enhance the model /improve the model accuracy
Test
Data
Model
Data Data Model
Training &
Acquisition Cleaning Testing
Building
Adjust
Model
Parameters
Deploy the model
Test
Data
Model
Data Data Model Model
Training &
Acquisition Cleaning Testing Deployment
Building
Supervised Learning
• To fix this issue, data is often split into 3 sets

• Training Data
• Used to train model parameters
• Validation Data
• Used to determine what model hyperparameters to adjust
• Test Data
• Used to get some final performance metric
History of the present illness (HPI)
HPI element Corresponding question Example answer
The pain is left-sided and radiates to the left
Location Where is the pain located?
arm and back.
Quality What does the pain feel like? Patient reports a shooting, stabbing pain.
Severity On a scale of 1-10, how bad is the pain? Severity is 8/10.
Onset: When did the pain first The current episode began half an hour
start? Frequency: How often does the pain ago. Episodes have occurred for a few
Timing
occur? Duration: How long are the pain months, following exercise, and for periods
episodes? of up to 15-20 minutes.
Exacerbating
What makes the pain worse? Pain is exacerbated by exercise.
Factors
Alleviating Factors What relieves the pain? Pain is relieved by rest and weight loss.
Associated Do you notice any other symptoms when the Patient reports symptoms associated with
Symptoms pain is present? dyspnea.
Heart Attack
https://www.youtube.com/watch?
v=QllguanpKic
AI in Healthcare: Top
A.I. Algorithms In H
ealthcare - The Medic
al Futurist
Normal Distribution
• Normal distribution is characterised by bell-shaped
curve
• Mean is at centre of bell-shaped curve
• Mean = Median
• Percentage of values
• Mean ± 1 Std. Dev. contains c.68% of all values
• Mean ± 2 Std. Dev. contains c.95% of all values Smaller Std. Dev.
• Mean ± 3 Std. Dev. contains c.99.7% of all values

• Mean ± 4.5 Std. Dev. contains 99.99932% of all values
Larger Std. Dev.
Mean
Standard Deviation measures spread
191
The Normal Curve
68.26 %
95.46%
99.73%
-4 -3 -2 -1 0 1 2 3 4
The empirical rule allows us to divide the normal distribution into predictable ranges based
on standard deviation
192
Central Limit Theorem
Medical Camp
• MedCamp organizes health camps in several cities with low
work-life balance. They reach out to working people and ask
them to register for these health camps. For those
who attend, MedCamp provides them the facility to
undergo health checks or increase awareness by visiting
various stalls (depending on the format of the camp).
• MedCamp has conducted 65 such events over a period of 4
years and there is a high drop off between “Registration”
and the Number of people taking tests at the Camps.
• One of the huge costs in arranging these camps is the
amount of inventory you need to carry. If you carry more
than the required inventory, you incur unnecessarily high
costs.
• On the other hand, if you carry less than the required
inventory for conducting these medical checks, people end
up having bad experience.
The Process
MedCamp employees/volunteers reach out to people and drive
registrations :During the camp, People who “ShowUp” either undergo
the medical tests or visit stalls depending on the format of the health
camp.
• Other things to note:
• Since this is a completely voluntary activity for the working professionals, MedCamp
usually has little profile information about these people.
• For a few camps, there was a hardware failure, so some information about the date
and time of registration is lost.
• MedCamp runs 3 formats of these camps.
• The first and second format provides people with an instantaneous health score.
• The third format provides information about several health issues through various
awareness stalls.
Favorable outcome
For the first 2 formats, a favorable outcome is defined as

getting a health_score, while in the third format it is
defined as visiting at least a stall.
You need to predict the chances (probability) of having a

favorable outcome.
Data Description
HealthCampDetail.csv - File containing HealthCampId, CampStartDate, CampEndDate and
Category details of each camp.
Train.csv- File containing registration details for all the test camps.
PatientProfile.csv -This file contains Patient profile details like PatientID, OnlineFollower, Social
media details, Income, Education, Age, FirstInteractionDate, CityType and Employer_Category
FirstHealthCampAttended.csv -This file contains details about people who attended health camp
of first format. This includes Donation (amount) & HealthScore of the person.
SecondHealthCampAttended.csv -This file contains details about people who attended health
camp of second format. This includes HealthScore of the person.
ThirdHealthCampAttended.csv - This file contains details about people who attended health
camp of third format. This includes Numberofstallvisited & LastStallVisited_Number.
• MedCamp organizes health camps in several cities with low work-life balance.
They reach out to working people and ask them to register for these health
camps. For those who attend, MedCamp provides them the facility to undergo
health checks or increase awareness by visiting
• Evaluation Metric
• The evaluation metric for this hackathon is ROC-AUC Score.
• healthcare-analytics dataset
• healthcare-provider-fraud-detection-analysis

HealthCare Analytics - Day 1-5

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

HealthCare Analytics - Day 1-5

Uploaded by

Copyright:

Available Formats

HealthCare

•Keep yourself hydrated (have water

• Get Paper & Pen

• Close the Door

• Keep your phone on silent(not even vibration)

History of healthcare analytics

Content Examples of healthcare analytics

The National Library of Medicine (NLM)

Improving Reducing Ensuring

Improving Reducing costs Ensuring quality

Healthcare Effective treatment

We yearn for better outcomes in

First: what specific event (or disease) are we interested in predicting?

Second: what data will we use to make our predictions?

Third: what machine learning algorithm will we use?

Obtaining adequate &

Applying the Roadmap and

Mechanics of Running Tool

Variable / Continuous Attribute / Discrete

In the misleading pie chart, Item C appears to be at least as large as Item A,

Series 1 Series 2 Series 3 series4 Series 1 Series 2 Series 3

1000 admitted & treated 1000 admitted & treated

100 Critical condition

900 Non Critical condition happen? 600 Non Critical condition

Because group / aggregating the data mis-lead the

• Lurking variable – Age is the lurking

•Keep yourself hydrated (have water

Supervised Unsupervised Reinforcemen

Regression Classification Segmentation Clustering

• Training data includes desired

• Training data does not

Given examples of a function (y= f(X))

Predict function f(X) for new examples X

f(X) = Probability(X): Probability estimation

1. Business Requirement Understand the Business problem

What data do we need for this project?

1. Business Requirement Transform the data into desired format

4. Data Exploration Derive useful insights

5. Data Modelling Evaluate & test the efficiency of the model

6. Interpret & Deployment of the Model

one independent variable

Mulitpy independent variable

The correlation between two random variables, X and Y, is a measure of the

This Photo by Unknown author is licensed under CC BY.

• Head size by Gender

•It is also known as coefficient of

Can you explain?

Content Confusion Matrix

Does person have disease or

Regression Will the patient survive or not,

Application accepted or not,

CART – • Tree models where the target variable can take a

• The target is the 'classification', which

Machine the noise from the data.

• 0.8 or 80% accuracy

number of true positives

# true positives + #false negatives

number of true positives

# true positives + #false positives.

Machine Math &

Ratio of Total number of correct prediction over total predictions

Total of correct prediction