Thesis Presentation

Thesis Title:
A Machine Learning Based Approach for Predicting Patient’s

Admission Statistics in Local Hospitals Using Electronic
Health Records
Presented by: Shaaf Amjad

19-MSc-EM/PT-01
Supervisor:
Dr. Saifullah
Associate Professor
BACKGROUND
• Electronic Health Record is widely being employed in many hospitals
and is becoming a good source of analytical data that can be utilized
for many applications for the betterment of health care services and
hospital administrative functions
• Influx of patients visiting hospitals are important for smooth running
of the operations of a hospital
• Machine Learning utilizes its predictive capability to access the
dynamics of the databases to identify key relationships between the
concerned features and variables involved.
Abstract
This research paper highlights the significance of managing and optimizing hospital operations and resource allocation
concerning the influx of ordinary case patients. While existing scholarly investigations have predominantly concentrated
on aspects like readmissions and lengths of stays across diverse hospital departments, the attention devoted to ordinary
case patients remains limited.In this study, we employ the capabilities of machine learning (ML) methodologies and
apply it to electronic health records (EHR) extracted from a web-based open-source database. The primary objective is
to develop a predictive model for estimating the incoming volume of ordinary case patients during specific months and
years. To achieve this, we undertake a comprehensive evaluation of three prominent ML algorithms: Linear Regression
(LR), Support Vector Regression (SVR), and Random Forest (RF). The assessment of prediction accuracy is based on
the computation of normalized MAE and RMSE values. The empirical findings demonstrate that the Random Forest
algorithm exhibits the highest degree of accuracy in predicting patient influx, closely trailed by Linear Regression, while
Support Vector Regression lags. The analytical underpinning for Random Forest's superior performance lies in its
adeptness at capturing intricate non-linear relationships inherent within the dataset.
AIMS and OBJECTIVES
• To acquire quality Electronic health record from local hospitals for
model development/open database
• To predict influx of patient visit count who are not chronic in nature
but are ordinary case patients
• Formation of different Machine Learning algorithm-based models and
to compare and select the optimum performance model.
• Provide a working model to assist in forecasting and efficient planning
of hospital operations.
RESEARCH
METHODOLO
GY
ELECTRONIC HEALTH RECORD
• Electronic Health Record (EHR): an
electronic version of a patient’s
medical history, that is maintained
by the provider over time, and may
include all of the key administrative
clinical data relevant to that person’s
care under a particular provider,
including demographics, progress
notes, problems, medications, vital
signs, past medical history,
immunizations, laboratory data and
radiology reports.
EHR SYSTEM
MACHINE LEARNING
• Machine learning techniques are computational methods that learn
patterns or classifications within data without being explicitly
programmed to do so.
• Machine Learning relies on different algorithms to solve data
problems.
•The kind of algorithm employed depends on the kind of problem you
wish to solve, the number of variables, the kind of model that would suit
it best and so on.
MACHINE LEARNING
• SUPERVISED
• relies on labelled input and output training data
• UNSUPERVISED
• relies on unlabeled or raw data.
MACHINE LEARNING
LITERATURE WORK
1. ML based model predictions for clinical detoriation in patients
2. ML based risk of readmissions in hospitals
3. Prediction of in-hospital length of stay among cardiac patients
4. To predict mortality rate in infleunza patients
5. Applications of AI in ophthalmology using EHR
6. Model for Acute admissions in aged population
7. Prediction of ICU patients readmission
8. Intelligent health risk prediction systems
RESEARCH GAPS
• Most the studies were much focused on health care, and individual health
characteristics of patients.
• But none linked the influx rate of patients to the efficient and seamless
operation of hospitals.
• Frequency of patients visiting the hospitals are crucial in order to forecast
demand and effectively manage the resources to ensure smooth operation.
• Ordinary case patients with minor consultancy are not being considered by
any study rather focused on chronic patients.
• Ordinary case patients plays a major part in determining daily patients visit
count.
ML MODEL
WORKING
FRAMEWOR
K
MAE
MACHINE LEARNING MODEL DESIGN
• DATASET
• The main objective of this project is to acquire EHR that will serve as
the source database for ML model design.
• Quality and structure of a dataset are pivotal determinants of the
efficacy of machine learning models
• Machine Learning Models Require a healthy, structured, sorted and
vast number of figures to achieve demanded accuracy
• Considering the scenario, we opted to choose an open-web source
based EHR dataset details of which are in next slides
ML MODEL DESIGN
• DATASET FEATURES
• NHS DIGITAL UK open-source dataset
• NHS UK offers to use of its medical recorded data under its “supporting open
data and transparency” policy
• Data was acquired from their Statistical record archive
• File name: Hospital Episodes Statistics
• Format of dataset: Excel, Tabular format
• Timescale of record: from April 2007 to September 2022
• Feature: No of patient visits with different treatment nature and department
of admission of all Hospitals, health centers in UK registered under NHS
Dataset Analysis
Sr. Classes of variables identified from Datasheet No. of Patients Visits

No in Miilions
1 APC_Finished_Consultant 283.87
2 APC_FCEs_with_a_procedure 167.27
3 APC_Ordinary_Episodes 186.77
4 APC_Day_Case_Episodes 97.09
5 APC_Day_Case_Episodes_with_proc 91.439
6 APC_Finished_Admission_Episodes 239.04
ML MODEL DESIGN: Phase 2
• DATASET-PREPROCESSING
• Cleaning and sorting of data
• Data Splitting: Dataset is divided into two parts
• Training Dataset
• Testing Dataset
• In our project, Dataset is split into 80:20 parts
• Training dataset: 80%; year 2007 to year 2020
• Testing dataset: 20%; year 2021 to year 2022
ML MODEL DESIGN
• TRAINING MODEL
ML MODEL DESIGN
• TESTING MODEL
ML MODEL SELECTION
Selection is based on following factors:
1. Benchmark practices
2. Quality and structure of the data
3. Size of the data
4. Degree of randomization, non-linearity
5. Continuous data, Classifications, features
6. Complexity of the model objective, intended results requirement
7. Degree of accuracy
8. Processing capability of system
Linear Regression
It finds the best-fit line through the data points. Simple linear
regression model can be represented as follows:
The above equation can be re-written as:
No of admitted
• represents the month in the dataset
patients
• represents the predicted number of admitted patients in that Regression Line
month
• Set = 1 to solve for the intercept term (). Data Point
In vector notation, the above equation can be rewritten as:

Months
Using Least Squares to find :

Support Vector Regression
It works by finding a hyperplane that maximizes the margin while also

allowing a certain amount of error tolerance.
Decision Boundary
𝑦 =𝑤𝑥 + 𝑏
Support Vector
• Given:
• x = months
• y = Number of patients admitted
• Find w and b
Cost Function: Hyperplane
Margin Data Point
Constraints:
are the hyperparameters.

Random Forest
It averages the prediction outcomes of multiple decision trees to give
the final prediction.
Dataset
(2007-2020)
Shuffle
Dataset-1 Dataset-2 Dataset-N
Decision Decision Decision

Tree-1 Tree-2 Tree-N
Average
Predicted
Outcome (Y’)
Decision Tree
1. The features, and their associated labels (y) are sorted in ascending order based
on the value of x.
Month No of Split Value MSE
2. Take the first two data points and compute their average which is used as the patients
threshold () value (also known as the split value). admitted
3. Each data point () in the data column is compared with the split value and 1 (Jan) 900553 1.5 10
2 (Feb) 872450 2.5 8
categorized into left and right tree nodes according to the following rules: 3 (Mar) 890136 3.5 7
a. If < , the data goes to the left child node. 4 (Apr) 907927 4.5 10
b. If > , the data goes to the right child node. 5 915756
(May)
4. Now, take the average of all the Y values in left child node and average of all Y
values in right child node separately. These 2 values are the predicted output of
the decision tree for x < and x ≥ respectively. Using the predicted and original
values (Y), calculate the mean square error (MSE) and note it down.
5. Repeat steps 3-5 for the next set of two data points and note down the MSE
obtained for each split value.
6. Once all the data points have been traversed, select that split value that
resulted in the minimum MSE and then repeat the steps 3-6 for each child node
to find the best next split value.
Decision Tree
𝜏 1=3.5
Compute Compute
𝑥< 𝜏 2 𝑥> 𝜏 2 𝑥< 𝜏 3 𝑥> 𝜏 3

Predicted Predicted Predicted Predicted
Output (y) Output (y) Output (y) Output (y)
EXPERIMENTATION
• PROGRAMMING LANGUAGE
• PYTHON (GOOGLE COLLABORATORY)
• a web browser version for running Python
ML models trained on algorithms using python based built-in libraries by

Scikit- Learn
Scikit-Learn, a Python programming language module that
encompasses a range of data mining and analysis tools
For mid level supervised and unsupervised ML algorithms
EXPERIMENTATION
• STEPS IN PROGRAMMING
1. Install required libraries for ml algorithm
2. Upload data from excel file and select variables and crosscheck data
3. Convert Months and years into numeric values that is under stable to
programming language
4. Split data for training and testing models
5. Train Model on Training dataset: Train LR, SVR and RF
6. Validate Model by computing error by comparing it with Testing
dataset.
7. Perform Statistical Tests to check the performance evaluation of the
prediction model i.e. MAE, RMSE, normalized Absolute Error
Code Example:
Code Example:
RESULTS AND PERFOMANCE
EVALUATION
• Regression based and non-classification, continuous data based
models are usually evaluated by RMSE, R-square, MAE.
• We employed MAE considering the degree of randomness in data and
results.
• Errors are computed for Each ML model
• Errors are computed in Python
Results and perfomance evaluation
Random Forest SVR
1200000
1200000
1000000
1000000
800000
800000
600000
Visit Count
400000 600000
200000 400000
0 200000
1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2
' 2 '2 '2 '2 '2 ' 2 '2 '2 ' 2 ' 2 '2 ' 2 ' 2 '2 '2 '2 '2 ' 2 '2 '2 ' 2
a ry ary rch pril ay ne uly ust ber ber ber ber ary ary rch pril ay ne uly ust ber 0
nu ru a A M Ju J ug m to m m nu ru a A M Ju J ug m 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2
J a F eb M A pte Oc ve ce Ja eb M A pte ' 2 '2 '2 '2 '2 ' 2 '2 '2 ' 2 ' 2 '2 ' 2 ' 2 '2 '2 '2 '2 ' 2 '2 '2 ' 2
Se
o
N D e F
Se a ry ary rch pril ay ne uly ust ber ber ber ber ary ary rch pril ay ne uly ust ber
nu ru Ma A M Ju J g m to m m u ru a A M Ju J g m
J a F eb Au pte Oc ve ce Jan eb M Au pte
Month o e F
Se N D Se
Predicted Patients True Value Predicted Patients True Value
Linear Regression
1200000
1000000
800000
600000
400000
200000
0
1 1 1 1 1 1 1 1 21 21 21 21 2 2 2 2 22 2 2 2 22
'2 '2 '2 '2 '2 '2 '2 '2 r' r' r' r' '2 '2 '2 '2 y' '2 '2 '2 r'
a ry a ry rch pril ay ne uly ust be be be be a ry a ry rch
pril a ne uly ust be
nu u a A M Ju J g o nu ru a A M Ju J g
Ja br M Au ptem Oct ve
m
ce
m
Ja eb M Au ptem
Fe o e F
Se N D Se
Normalizing MAE
• MAE values normalized on 0 to 1 Scale
Randomn Forest
Results and Performance Evaluation
Results: Average Normalized MAE Values
MAE
ML Model
(normalized)
Random Forest 0.184
Support Vector
0.186
Regression
Linear Regression 0.191
Results: Analysis
• Based on normalized MAE values
Random Forest is overall the best model
Linear Regression is second best but there no much difference LR and
Random Forest performance.
SVR is the least accurate model among all
Discussion
• Random Forest may have outperformed the other algorithms due to its
ability to capture complex non-linear relationships inherent in the data.
The nature of the data used in this study involves intricate non-linear
patterns, and RF’s capability to model and capture such patterns led to
lower MAE values

Thesis Presentation

Uploaded by

Copyright:

Available Formats

You might also like

Thesis Presentation

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Thesis Presentation

Uploaded by

Copyright:

Available Formats

Thesis Title:

A Machine Learning Based Approach for Predicting Patient’s

Presented by: Shaaf Amjad

Sr. Classes of variables identified from Datasheet No. of Patients Visits

The above equation can be re-written as:

In vector notation, the above equation can be rewritten as:

Using Least Squares to find :

It works by finding a hyperplane that maximizes the margin while also

Margin Data Point

are the hyperparameters.

Dataset-1 Dataset-2 Dataset-N

Decision Decision Decision

𝑥< 𝜏 2 𝑥> 𝜏 2 𝑥< 𝜏 3 𝑥> 𝜏 3

ML models trained on algorithms using python based built-in libraries by

Predicted Patients True Value Predicted Patients True Value

You might also like