Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 196

HealthCare

Analytics
3 Ground Rules
•Participate (write in the chat window or unmute
yourself)

•Keep yourself hydrated (have water


bottle next to you)

•Take Actions
Are you ready?
• Get your laptop/ desktop

• Get Paper & Pen

• Close the Door

• Keep your phone on silent(not even vibration)


Basics of healthcare analytics

History of healthcare analytics

Content Examples of healthcare analytics

Introduction to Anaconda,
Orange Data mining tool
What is
healthcare
analytics?
the use of advanced computing
technology to improve medical care
What is healthcare analytics?
the use of advanced computing technology to improve medical care

The National Library of Medicine (NLM)


defines healthcare as a multidisciplinary
domain of planning, developing,
adopting, and applying information
technology-related inventions in the
delivery, administration, and
development of healthcare facilities
What is healthcare analytics?
the use of advanced computing technology to improve medical care

• Aarogya setu

• Vaccination 

• Practo 

• Curofy

• Amion
•Healthcare delivery and policy
•Healthcare data: history &

Healthcare
physical examination (H&P) 
•Clinical science: physiology,&  p
athology, 

Analytics
Healthcare analytics can
be viewed as the
intersection of three
fields:
Healthcare (Healthcare
Analytics), 
Mathematics (Math), and
Computer science (CS) •High school mathematics: • Artificial intelligence
•Probability and statistics • Databases -electronic
•Linear algebra: Calculus medical record (EMR)
and optimization • Programming languages
• Human-computer interaction
Healthcare Analytics Improves Medical
Care
The effectiveness of medical care is measured using "healthcare triple aim"

Improving Reducing Ensuring

Improving Reducing costs Ensuring quality


outcomes
Accurate diagnosis

Healthcare  Effective treatment


outcomes
No complications

We yearn for better outcomes in


our own lives whenever we visit
An overall improved quality of
a doctor or a hospital, some of the
things about which we
life
are concerned:
descriptive analytics
Analytics predictive analytics

prescriptive analytics
Important Pillars of Health Care Analytics
Predicting future
diagnostic and treatment
events
By identifying high-risk patients, steps can be taken to hinder or delay the onset of
the disease or prevent it altogether.
Healthcare Imaging
2030 - Meet Sophie
Predicting Future Diagnostic And Treatment
Events

First: what specific event (or disease) are we interested in predicting?

Second: what data will we use to make our predictions?


• Structured clinical data (EMR) 
• Unstructured data- x-ray images, ECG, data recorded from devices 

Third: what machine learning algorithm will we use?


Key Learning
Dimensions on Soft Ware

Obtaining adequate &


relevant data in right format

Applying the Roadmap and


Deriving Business Insights

Mechanics of Running Tool


Click on Orange

21
22
23
The application of
Analytics in
Healthcare
Let us understand the basics
Health Care Analytics
Statistics
Unstructured Vs Structured Data
Categories of Data
Data

Variable / Continuous Attribute / Discrete

6 Measurable 6 Countable
6 Endlessly Divisible 6 Indivisible
Attribute/Discrete - characteristics identified only by name, label,
class,
or category.
Examples: defective/not defective,
approve/decline, branch of account,
relationship manager name, over
$100,000/under $100,000.
Data Types
Variable/Continuous - characteristics that can take on any value
on a continuum, therefore enabling precise measurement of
the distribution and interval between points.
Infinitely divisible.
Examples: cycle time, money.
Mean - Average
We did example from sugar level for diabetes

https://in.linkedin.com/in/dimplesanghvi
Median – the Middle value
We did example from T-shirt we plan to distribute

https://in.linkedin.com/in/dimplesanghvi
Mode – Most frequently
occurring value
We did example from T-shirt we plan to distribute

https://in.linkedin.com/in/dimplesanghvi
Graphs – 1 picture is worth a 1000 words
Pie Chart

In the misleading pie chart, Item C appears to be at least as large as Item A,


whereas in reality, it is less than half

https://in.linkedin.com/in/dimplesanghvi
How size can mislead

https://in.linkedin.com/in/dimplesanghvi
Which team has worst performers?

Team A Team B
5 10

9
4

8
3
7
2
6

1
5

0 4
Category 1 Category 2 Category 1 Category 2

Series 1 Series 2 Series 3 series4 Series 1 Series 2 Series 3

https://in.linkedin.com/in/dimplesanghvi
Which team has worst performers?
101
Correct data
99.6 98.3 97.3 99.9
99.9 100
100
99.6 90
80
99 70
98.3 60
98 50
40
97.3
30
97
20
10
96 0
Category 1 Category 2 Category 3 Category 4 Category 1 Category 2 Category 3 Category 4

Series 1 Series 1
• Descriptive
• Prescriptive
Explain • Predictive
• What is Probability?  
•Types of Events 
• Types of Probability 
•    i.      Marginal Probability 
•  ii.      Joint Probability 
• iii.      Conditional Probability 
Simpson’s
Paradox

https://in.linkedin.com/in/
dimplesanghvi
Let us understand
Simpson’s Paradox
• The hospital A The hospital B

1000 admitted & treated 1000 admitted & treated


900 survived 800 survived

100 Critical condition


30 survived
Why did this 400 Critical condition
210 survived

900 Non Critical condition happen? 600 Non Critical condition


870 survived 590 survived
https://in.linkedin.com/in/dimplesanghvi
Simpson’s Paradox

Because group / aggregating the data mis-lead the


information

You will find people will talk about overall score and
not individual marks

Lurking variable
UK smoking
case study 

• Lurking variable – Age is the lurking


variable
Population
Medical task
Screening
An example -screening for cervical cancer;
women are recommended to undergo this
cost-effective test every 1-3 years
throughout most of their lives
Response to treatment
Healthcare Analytics
MBA(BA) 2020-22 Sem
III Online Class
Day 3
3 Ground Rules
•Participate (write in the chat window or unmute
yourself)

•Keep yourself hydrated (have water


bottle next to you)

•Take Actions
Types of Machine
Learning 
U-OSEMN
Content
Linear Regression

Use Cases
Examples of Healthcare
Analytics
Let us watch some videos
Machine Learning

Machine
Learning

Supervised Unsupervised Reinforcemen


Learning Learning t Learning

Regression Classification Segmentation Clustering


Supervised learning

• Training data includes desired


outputs
Supervised
Learning Unsupervised learning

• Training data does not


include desired outputs
Inductive Learning

Given examples of a function (y= f(X))

Predict function f(X) for new examples X


• if y is Discrete then f(X): Classification
• If y is Continuous then f(X): Regression

f(X) = Probability(X): Probability estimation


Data Life Cycle- U-OSEMN

1. BUsiness Requirement
2. Obtain the Data 
3. Scrub the Data 
4. Data Exploration
5. Data Modelling
6. INterpret & Deployment of the Model
Data Life Cycle

1. Business Requirement Understand the Business problem


Identify the objective
2. Obtain the Data 
Define the variable to be predicted
3. Scrub the Data 
4. Data Exploration
5. Data Modelling
6. Interpret & Deployment of the Model
Data Life Cycle

What data do we need for this project?


1. Business Requirement
What are the data sources
2. Obtain the Data  How can I obtain the data

3. Scrub the Data  What is the most efficient way to store and access it

4. Data Exploration
5. Data Modelling
6. Interpret & Deployment of the Model
Data Life Cycle

1. Business Requirement Transform the data into desired format


Data Cleaning
2. Obtain the Data  Missing Value
3. Scrub the Data  Corrupted data
Remove unnecessary data
4. Data Exploration Cleaning the data and its significance
Combining /Grouping Data
5. Data Modelling
6. Interpret & Deployment of the Model
Data Life Cycle

1. Business Requirement
2. Obtain the Data 
3. Scrub the Data  Understand the patterns in the data

4. Data Exploration Derive useful insights


Form hypothesis
5. Data Modelling
6. Interpret & Deployment of the Model
Data Life Cycle

1. Business Requirement
2. Obtain the Data 
3. Scrub the Data  Determine optimal data features for ML models
Create a model that predicts the target most
4. Data Exploration accurately

5. Data Modelling Evaluate & test the efficiency of the model

6. Interpret & Deployment of the Model


Data Life Cycle

1. Business Requirement
2. Obtain the Data 
3. Scrub the Data 
4. Data Exploration Determine the model in a pre-
production / test environment
5. Data Modelling
Monitor the performance
6. Interpret & Deployment of the Model
Note: Remember To Split The Base Data
Simple Linear Regression 

one independent variable 


 and 
 one dependent variable
Multi Linear Regression 

Mulitpy independent variable 


 and 
 one dependent variable
Straight Line Equation
Positive or Negative slope?

The correlation between two random variables, X and Y, is a measure of the


degree of  linear association between the two variables. 
The correlation coefficient "r", can take on any value from -1 to 1.
What type of
correlation
What can
you
interpret?
Predict the size of
head of the baby
Linear Regression problem

This Photo by Unknown author is licensed under CC BY.


Data Life Cycle- U-OSEMN

1. Business Requirement
2. Obtain the Data 
3. Scrub the Data 
4. Data Exploration
5. Data Modelling
6. Interpret & Deployment of the Model
Business Requirement
• The hospital has data of children in the age group of 0-2 years
• The research team has captured few features
• Can you understand what is the story the data is telling?
• The current  head size should be 36cm, can we predict it based on
other parameter? 
• How significant is the model?
• Are we missing any information?
Given the dataset – let us predict
Distribution

• Head size by Gender


• What can you interpret
What do you analyse?
% Split – 70/30

Stratification

Bootstrapping

Cross Validation
Cross Validation
Build the
model
Goodness of fit
–R Square
•R-Square value is a statistical
measure of how close the data
are to the fitted regression line

•It is also known as coefficient of


determination
Predict the
Cost of
Treatment
Data Life Cycle- U-OSEMN

1. Business Requirement
2. Obtain the Data 
3. Scrub the Data 
4. Data Exploration
5. Data Modelling
6. Interpret & Deployment of the Model
The dataset • The charges
Age No of children BMI Gender

Can you explain?


Also try for other columns/features
Charge of
treatement
by Gender
Charge of
treatement 
Smoker/
nonsmoker
Scatter Plot  
Which model is
better? On training
dataset
Which model is
better?
Healthcare Analytics
MBA(BA) 2020-22 Sem
III Online Class
Logistic Regression
Logistic Regression

Content Confusion Matrix

Evaluation of Accuracy
Classification Precision
model Recall
What happens when the
prediction should be categorical

Does person have disease or


Logistic not,

Regression Will the patient survive or not, 

Application accepted or not,


etc.
Logistic Regression
• One commonly used algorithm is Logistic Regression
• Assumes that the dependent (output) variable is binary 
        Predict output for categorical data /binary data
• Does person have disease or not,
• Will the patient survive or not, 
• Application accepted or not, etc.

• Logistic regression fits the data with a sigmoidal/logistic curve rather than a line
and outputs an approximation of the probability of the output given the input
•  Decision tree learning is one of the predictive
modelling approaches.

CART – •  Tree models where the target variable can take a


Classificatio discrete set of values are called classification
trees; 
n and
Regression •  Decision trees where the target variable can take
continuous values are called regression trees. 
Tree
•  Decision trees are among the most popular
machine learning algorithms given their
intelligibility and simplicity.
Diabetes
• This dataset is originally from the NIH 
• The objective of the dataset is to diagnostically predict whether or not a
patient has diabetes, based on certain diagnostic measurements
included in the dataset. 
• In particular, all patients here are females at least 21 years old of Pima
Indian heritage.
• The datasets consists of several medical predictor variables and one target
variable, Outcome. Predictor variables includes the number of pregnancies the
patient has had, their BMI, insulin level, age, etc.
Data Life Cycle- U-OSEMN

1. BUsiness Requirement
2. Obtain the Data 
3. Scrub the Data 
4. Data Exploration
5. Data Modelling
6. INterpret & Deployment of the Model
Descriptive – what has happened? (as is)
• In my current 65% non diabetic – 35% are diabetic
• 45% of my dataset has 0-2 preg – 55% of the dataset >2preg
• ~30% of my dataset has 0-1 preg – 70% of them >1
Diabetes​
•Can you build a machine learning model
to accurately predict whether or not the
patients in the dataset have diabetes or
not?
Kidney Disease
• The data was taken over a 2-month
period in India with 25 features ( eg,
red blood cell count, white blood cell
count, etc). 

• The target is the 'classification', which


is either 'ckd' or 'notckd' - 
• ckd=chronic kidney disease. 
• There are 400 rows
Predicting if the cancer diagnosis
• 30 features are used, examples:
•   radius (mean of distances from center to points on the
perimeter)
•   texture (standard deviation of gray-scale values)
•   perimeter
•   area
•   smoothness (local variation in radius lengths)
•   compactness (perimeter^2 / area - 1.0)
•   concavity (severity of concave portions of the contour)
•   concave points (number of concave portions of the contour)
•   symmetry 
•   fractal dimension ("coastline approximation" - 1)
• Target Class Distribution: 212 Malignant, 357 Benign
Overfitting and
Underfitting
• Overfitting
• The model fits too much to

Machine the noise from the data.


• This often results in low
Learning error on training sets but
high error on test/validation
sets.
Data

X
Good Model

X
Overfitting 

X
Overfitting 

X
Overfitting 

X
Underfitting
• Underfitting
• Model does not capture the underlying trend of the data and
does not fit the data well enough.
• Low variance but high bias.
• Underfitting is often a result of an excessively simple model
Data

X
Underfitting

X
Machine Learning
• When thinking about overfitting and underfitting we want
to keep in mind the relationship of model performance on
the training set versus the test/validation set.
• This data was easy to visualize, but how can we see
underfitting and overfitting when dealing with multi
dimensional data sets?
Evaluating
Performance
CLASSIFICATION
Model • The key classification metrics
Evaluatio we need to understand are:

n • Accuracy
• Recall
• Precision
• F1-Score
Model
• Typically in any classification
Evaluatio task your model can only
achieve two results:
n • Either your model was
correct in its prediction.
• Or your model was
incorrect in its prediction.
Model Evaluation

TRAINED
MODEL
Model Evaluation

TRAINED
Test Image
from X_test
MODEL
Model Evaluation

TRAINED
Test Image
from X_test
MODEL

DOG
Correct Label
from y_test
Model Evaluation

TRAINED DOG
Test Image
from X_test
MODEL
Prediction on
Test Image
DOG
Correct Label
from y_test
Model Evaluation

TRAINED DOG
Test Image
from X_test
MODEL
Prediction on
Test Image
DOG
Correct Label DOG == DOG ?
from y_test
Compare Prediction to Correct Label
Model Evaluation

TRAINED CAT
Test Image
from X_test
MODEL
Prediction on
Test Image
DOG
Correct Label DOG == CAT ?
from y_test
Compare Prediction to Correct Label
Model
Evaluation-
Confusion
Matrix
Model Evaluation
• Accuracy -the number of correct predictions made by the
model 

• For example, if the X_test set was 100 images and our
model correctly predicted 80 images, then we have
80/100.

• 0.8 or 80% accuracy


Model Evaluation
• Is Accuracy a good choice with unbalanced classes?
• Imagine we had 99 images of dogs and 1 image of a
cat.
• If our model was simply a line that always predicted
dog we would get 99% accuracy!
Recall
○ Ability of a model to find all the relevant cases within a
dataset. 
○ The precise definition of recall is the 

number of true positives 

# true positives + #false negatives 


Precision
○ Ability of a classification model to identify only
the relevant data points.
○ Precision is defined as 

number of true positives

# true positives +  #false positives. 


F1-Score
● F1-Score is the harmonic mean of precision and recall taking
both metrics into account 
○ harmonic mean instead of a simple average because it punishes
extreme values. 
○ A classifier with a precision of 1.0 and a recall of 0.0 has a
simple average of 0.5 but an F1 score of 0. 
Confusion Matrix
We can also view all correctly classified versus incorrectly classified images in the form of a confusion
matrix.

Machine Math &


Learning
Statistics
DS

Software Research

Domain
Knowledge
Accuracy calculation using Confusion Matrix

Ratio of Total number of correct prediction over total predictions

Total of correct prediction


Accuracy = ----------------------------------
Total Prediction

TP+TN Prediction by Model

Accuracy = ------------------------------ True False

Actual
TP+FP+TN+FN True True Positive (TP)
False Negative
(FN)
True Negative
False False Positive (FP)
(TN)
Build a model for
Accuracy – Uses cancer prediction
Accuracy

Total of correct prediction


• Accuracy = -----------------------------
• Total Prediction

TP+TN
• Accuracy = ------------------------------
• TP+FP+TN+FN

Calculate
Accuracy score??
Accuracy – Lets play with Dumb model

Now What is the


Accuracy score??
Is Dumb model great?
Or is ML model build is
worse than dumb?
No….Using accuracy to understand the performance of
model is not correct
True Positive Rate (TPR)
Actual Positive prediction
TPR = --------------------------------------------
Total Actual Positives

TP+FN= Total actual +ve


TP
FP+TN= Total actual -ve
TPR = ---------------------
TP+FN

Higher the value better the model


TPR results are between Zero to One
Calculate TPR for dumb model
Actual Positive prediction (TP)
TPR = --------------------------------------------
Total Actual Positives (TP+FN)

Now What is the


Recall score??

Now is the dumb model good?


TPR = Recall 
Can you identify this scene?
Airport Metal detection investigation
Prediction by Model
Criminal
Criminal Innocent
Innocent
I cannot make this
Actual

Criminal
Criminal True
TruePositive
Positive(TP)
(TP) False Negative
mistake (FN) Total Positive = TP + FN

Innocent
Innocent I False
ask hold anyone
Positive (FP) True
True Negative
Negative (TN)
(TN)
back for investigation Total Negatives= TN + FP

You are in charge at Airport security check -where you know that all film
stars /Politicians and high profile people are passing through the gate

Aim: “All” weapon carriers MUST be caught


Airport Metal detection investigation
Prediction by Model
Criminal
Criminal Innocent
Innocent Recall
I cannot make
Actual

Criminal True Positive (TP) False


I cannot make (FN)
Negative this
Criminal True Positive (TP) mistake Total Positive = TP + FN
thismistake
Innocent False Positive (FP) True Negative (TN)
II can make
ask hold this
anyone Total Negatives= TN + FP
Innocent True Negative (TN)
backmistake: I ask
for investigation
hold anyone back
for investigation
(even innocent
people)
Recall
Out of total actual positives how many have been predicted as positives

TP
Recall = ---------------------
TP+FN

Minimize the False Negative better


the model

Undetected weapon
carrier will cost my job
SSR Murder investigation
Prediction by Model
Criminal Innocent
Actual

Criminal True Positive (TP) False Negative (FN) Total Positive = TP + FN

Innocent False Positive (FP) True Negative (TN) Total Negatives= TN + FP

You went to the party when you know that all film stars and
high profile people are attending it

Aim : Arrest “only” criminals


Precision
SSR Murder investigation
Prediction by Model
Criminal Innocent

True Positive We endup making


Actual

Criminal (TP) this mistake Total Actual Positive = TP + FN

I cannot make
Ithiscannot Total Actual Negatives= TN + FP
Innocent mistake True Negative (TN)
make this
mistake

Avoiding False Positive is more important than Encountering False Negative

We also call this as Precision= we cannot afford to have FP


SSR Murder investigation
False Negative Rate (FNR) / Precision
                          Predicted Positive
Precision=---------------------------------
                       Total predicted Positive
                              T P 
Precision = --------------------- Higher the precision better the model
TP+FP

TP+FN= Total actual +ve

FP+TN= Total actual -ve


TP+FP= Total Predicted +ve FN+TN= Total Predicted -ve
Model Evaluation

● Still confused on the confusion matrix?


● No problem! Check out the Wikipedia pageMachinefor it, itMath & good
has a really
Learning
diagram with all the formulas for all the metrics. Statistics
● Let’s think back on this idea of: DS

○ What is a good enough accuracy?


Software Research

● This all depends on the context of the situation!


● Did you create a model to predict presence of a disease?
Domain
● Is the disease presence well balanced in the general population?
Knowledge
(Probably not!)
Evaluating Performance
REGRESSION
Evaluating Regression
• Let’s take a moment now to discuss evaluating Regression Models
• Regression is a task when a model attempts to predict continuous
values (unlike categorical values, which is classification)
• Evaluation metrics like accuracy or recall- These sort of metrics
aren’t useful for regression problems, we need metrics designed
for continuous values!
Evaluating Regression
• For example, attempting to predict the cost of treatment
of a patient its features is a regression task.
• Attempting to predict the patient isa smoker or not given
its features would be a classification task.
Evaluating Regression
• Let’s discuss some of the most common evaluation metrics
for regression:
• Mean Absolute Error
• Mean Squared Error
• Root Mean Square Error
Evaluating Regression
• Mean Absolute Error (MAE)
• This is the mean of the absolute value of errors.
• Easy to understand
Evaluating
Regression
MAE won’t punish large
errors however.
Evaluating
Regression
MAE won’t punish large errors
however.
Evaluating
Regression
MAE won’t punish large errors
however.
We want our error metrics to
account for these!
Evaluating Regression

• Mean Squared Error (MSE)


• This is the mean of the squared errors.
• Larger errors are noted more than with MAE,
making MSE more popular.
Evaluating Regression
• Root Mean Square Error (RMSE)
• This is the root of the  mean of the squared errors.
• Most popular (has same units as y)
Evaluating Regression
• Most common question from students:
• “Is this value of RMSE good?”
• Context is everything!
• A RMSE of $10 is fantastic for predicting the price of a
house, but horrible for predicting the price of a candy bar!
Evaluating Regression
• Compare your error metric to the average value of the label
in your data set to try to get an intuition of its overall
performance.
• Domain knowledge also plays an important role here!
• Context of importance is also necessary to consider.
• We may create a model to predict how much medication to
give, in which case small fluctuations in RMSE may actually
be very significant.
History of the present illness (HPI)
HPI element Corresponding question Example answer
The pain is left-sided and radiates to the left
Location Where is the pain located?
arm and back.
Quality What does the pain feel like? Patient reports a shooting, stabbing pain.
Severity On a scale of 1-10, how bad is the pain? Severity is 8/10.
Onset: When did the pain first The current episode began half an hour
start? Frequency: How often does the pain ago. Episodes have occurred for a few
Timing
occur? Duration: How long are the pain months, following exercise, and for periods
episodes? of up to 15-20 minutes.
Exacerbating
What makes the pain worse? Pain is exacerbated by exercise.
Factors
Alleviating Factors What relieves the pain? Pain is relieved by rest and weight loss.
Associated Do you notice any other symptoms when the Patient reports symptoms associated with
Symptoms pain is present? dyspnea.
Heart Attack

https://www.youtube.com/watch?
v=QllguanpKic
AI in Healthcare: Top
A.I. Algorithms In H
ealthcare - The Medic
al Futurist
Image Analytics
MRI dataset of 5 stages of Alzheimer's disease from ADNI repository

https://www.kaggle.com/yasserhessein/dataset-alzheimer
Normal Distribution
• Normal distribution is characterised by bell-shaped
curve
• Mean is at centre of bell-shaped curve
• Mean = Median
• Percentage of values
• Mean ± 1 Std. Dev. contains c.68% of all values
• Mean ± 2 Std. Dev. contains c.95% of all values Smaller Std. Dev.

• Mean ± 3 Std. Dev. contains c.99.7% of all values


• Mean ± 4.5 Std. Dev. contains 99.99932% of all values
Larger Std. Dev.

Mean
Standard Deviation measures spread
171
The Normal Curve

68.26 %

95.46%

99.73%

-4 -3 -2 -1 0 1 2 3 4

The empirical rule allows us to divide the normal distribution into predictable ranges based
on standard deviation
172
Central Limit Theorem
Medical Camp
• MedCamp organizes health camps in several cities with low
work-life balance. They reach out to working people and ask
them to register for these health camps. For those
who attend, MedCamp provides them the facility to
undergo health checks or increase awareness by visiting
various stalls (depending on the format of the camp).
• MedCamp has conducted 65 such events over a period of 4
years and there is a high drop off between “Registration”
and the Number of people taking tests at the Camps. 
• One of the huge costs in arranging these camps is the
amount of inventory you need to carry. If you carry more
than the required inventory, you incur unnecessarily high
costs. 
• On the other hand, if you carry less than the required
inventory for conducting these medical checks, people end
up having bad experience. 
The Process
MedCamp employees/volunteers reach out to people and drive
registrations :During the camp, People who “ShowUp” either undergo
the medical tests or visit stalls depending on the format of the health
camp.
• Other things to note:
• Since this is a completely voluntary activity for the working professionals, MedCamp
usually has little profile information about these people.
• For a few camps, there was a hardware failure, so some information about the date
and time of registration is lost.
• MedCamp runs 3 formats of these camps. 
• The first and second format provides people with an instantaneous health score. 
• The third format provides information about several health issues through various
awareness stalls.
Favorable outcome

For the first 2 formats, a favorable outcome is defined as


getting a health_score, while in the third format it is
defined as visiting at least a stall.

You need to predict the chances (probability) of having a


favorable outcome.
Data Description
HealthCampDetail.csv - File containing HealthCampId, CampStartDate, CampEndDate and
Category details of each camp.
Train.csv- File containing registration details for all the test camps. 

PatientProfile.csv -This file contains Patient profile details like PatientID, OnlineFollower, Social
media details, Income, Education, Age, FirstInteractionDate, CityType and Employer_Category
FirstHealthCampAttended.csv -This file contains details about people who attended health camp
of first format. This includes Donation (amount) & HealthScore of the person.
SecondHealthCampAttended.csv -This file contains details about people who attended health
camp of second format. This includes HealthScore of the person.
ThirdHealthCampAttended.csv - This file contains details about people who attended health
camp of third format. This includes Numberofstallvisited & LastStallVisited_Number.
• MedCamp organizes health camps in several cities with low work-life balance.
They reach out to working people and ask them to register for these health
camps. For those who attend, MedCamp provides them the facility to undergo
health checks or increase awareness by visiting 
• Evaluation Metric
• The evaluation metric for this hackathon is ROC-AUC Score.
• healthcare-analytics dataset

• healthcare-provider-fraud-detection-analysis
Machine Learning Process

Obtain : Get your data! Customers, Sensors, etc...

Data
Acquisition
Machine Learning Process

Scrub: Clean and format your data

Data Data
Acquisition Cleaning
Machine Learning Process
Explore and build Model
Test
Data

Training
Data Data Data
Acquisition Cleaning
Machine Learning Process
Model: Build Model based on the data
Test
Data

Model
Data Data Training &
Acquisition Cleaning Building
Machine Learning Process
Evaluate: Check the performance of your model (Test
&Score) Test
Data

Model
Data Data Model
Training &
Acquisition Cleaning Testing
Building
Machine Learning Process
Enhance the model /improve the model accuracy
Test
Data

Model
Data Data Model
Training &
Acquisition Cleaning Testing
Building

Adjust
Model
Parameters
Machine Learning Process
Deploy the model
Test
Data

Model
Data Data Model Model
Training &
Acquisition Cleaning Testing Deployment
Building
Supervised Learning

• To fix this issue, data is often split into 3 sets


• Training Data
• Used to train model parameters
• Validation Data
• Used to determine what model hyperparameters to adjust
• Test Data
• Used to get some final performance metric
History of the present illness (HPI)
HPI element Corresponding question Example answer
The pain is left-sided and radiates to the left
Location Where is the pain located?
arm and back.
Quality What does the pain feel like? Patient reports a shooting, stabbing pain.
Severity On a scale of 1-10, how bad is the pain? Severity is 8/10.
Onset: When did the pain first The current episode began half an hour
start? Frequency: How often does the pain ago. Episodes have occurred for a few
Timing
occur? Duration: How long are the pain months, following exercise, and for periods
episodes? of up to 15-20 minutes.
Exacerbating
What makes the pain worse? Pain is exacerbated by exercise.
Factors
Alleviating Factors What relieves the pain? Pain is relieved by rest and weight loss.
Associated Do you notice any other symptoms when the Patient reports symptoms associated with
Symptoms pain is present? dyspnea.
Heart Attack

https://www.youtube.com/watch?
v=QllguanpKic
AI in Healthcare: Top
A.I. Algorithms In H
ealthcare - The Medic
al Futurist
Normal Distribution
• Normal distribution is characterised by bell-shaped
curve
• Mean is at centre of bell-shaped curve
• Mean = Median
• Percentage of values
• Mean ± 1 Std. Dev. contains c.68% of all values
• Mean ± 2 Std. Dev. contains c.95% of all values Smaller Std. Dev.

• Mean ± 3 Std. Dev. contains c.99.7% of all values


• Mean ± 4.5 Std. Dev. contains 99.99932% of all values
Larger Std. Dev.

Mean
Standard Deviation measures spread
191
The Normal Curve

68.26 %

95.46%

99.73%

-4 -3 -2 -1 0 1 2 3 4

The empirical rule allows us to divide the normal distribution into predictable ranges based
on standard deviation
192
Central Limit Theorem
Medical Camp
• MedCamp organizes health camps in several cities with low
work-life balance. They reach out to working people and ask
them to register for these health camps. For those
who attend, MedCamp provides them the facility to
undergo health checks or increase awareness by visiting
various stalls (depending on the format of the camp).
• MedCamp has conducted 65 such events over a period of 4
years and there is a high drop off between “Registration”
and the Number of people taking tests at the Camps. 
• One of the huge costs in arranging these camps is the
amount of inventory you need to carry. If you carry more
than the required inventory, you incur unnecessarily high
costs. 
• On the other hand, if you carry less than the required
inventory for conducting these medical checks, people end
up having bad experience. 
The Process
MedCamp employees/volunteers reach out to people and drive
registrations :During the camp, People who “ShowUp” either undergo
the medical tests or visit stalls depending on the format of the health
camp.
• Other things to note:
• Since this is a completely voluntary activity for the working professionals, MedCamp
usually has little profile information about these people.
• For a few camps, there was a hardware failure, so some information about the date
and time of registration is lost.
• MedCamp runs 3 formats of these camps. 
• The first and second format provides people with an instantaneous health score. 
• The third format provides information about several health issues through various
awareness stalls.
Favorable outcome

For the first 2 formats, a favorable outcome is defined as


getting a health_score, while in the third format it is
defined as visiting at least a stall.

You need to predict the chances (probability) of having a


favorable outcome.
Data Description
HealthCampDetail.csv - File containing HealthCampId, CampStartDate, CampEndDate and
Category details of each camp.
Train.csv- File containing registration details for all the test camps. 

PatientProfile.csv -This file contains Patient profile details like PatientID, OnlineFollower, Social
media details, Income, Education, Age, FirstInteractionDate, CityType and Employer_Category
FirstHealthCampAttended.csv -This file contains details about people who attended health camp
of first format. This includes Donation (amount) & HealthScore of the person.
SecondHealthCampAttended.csv -This file contains details about people who attended health
camp of second format. This includes HealthScore of the person.
ThirdHealthCampAttended.csv - This file contains details about people who attended health
camp of third format. This includes Numberofstallvisited & LastStallVisited_Number.
• MedCamp organizes health camps in several cities with low work-life balance.
They reach out to working people and ask them to register for these health
camps. For those who attend, MedCamp provides them the facility to undergo
health checks or increase awareness by visiting 
• Evaluation Metric
• The evaluation metric for this hackathon is ROC-AUC Score.
• healthcare-analytics dataset

• healthcare-provider-fraud-detection-analysis

You might also like