Tambi 2018-FPR

School of Computing
FACULTY OF ENGINEERING
Process Mining in Healthcare: A Question Driven

Methodology for Analyzing Emergency Room Processes
Using MIMIC-III Database
Ankit Tambi
Submitted in accordance with the requirements for the degree of

(MSc Advance Computer Science)
(2017/2018)
- ii -
The candidate confirms that the following have been submitted:
Items Format Recipient(s) and Date

Deliverables 1, 2 Report SSO (01/05/19)
Deliverable 3 Software codes Supervisor (01/05/19)
Type of Project: Empirical Investigation
The candidate confirms that the work submitted is their own and the appropriate credit has
been given where reference has been made to the work of others.
I understand that failure to attribute material which is obtained from another source may be
considered as plagiarism.
(Signature of student) ___________________
© <2019> The University of Leeds and <Ankit Mohan Tambi>

- iii -
ABSTRACT
The Emergency Room (ER) is a high priority entry point in the hospital where
patients arrive at very severe condition and require immediate attention. Failure to provide
quick and effective treatment may even lead to death.
To avoid such mishap, the ER needs a good team interaction among professionals
which can be ensured by deciding their exact roles and relations. Moreover, proper
management of all resources/inventories required, beforehand, may reduce the long waiting
time due to lack or scarcity of necessary resources. Managing all this is a very tedious job;
however, Process Mining (PM) in healthcare helps understand the complex processes by
analysing its process models using software like, ProM and Disco.
Efficient application of PM in Healthcare requires a huge database. This project is

using MIMIC-III (Medical Information Mart for Intensive Care) which is a huge, (de-identified)
free-access database comprising healthcare-related information of more than 40,000
patients admitted to critical care units.
And the methodology used is Question Driven Methodology (QDM) specifically

designed to understand the flexible nature of ER which helps to answer the Frequently
Posed Questions (FPQ) asked by ER experts.
The process model generated, using this methodology, gave us insights into the
order and frequency of the procedure events executed for patients diagnosed with
Pneumonia, admitted through ER. This information can help focus on the commonly
executed procedure events, such as chest x-ray, and help reduce waiting time and chaos,
thereby decreasing the discomfort of the patients during their stay at the ER.
- iv -
ACKNOWLEDGEMENTS
I am deeply grateful to my supervisor Prof. Owen Johnson for his continuous

guidance and support and also for his patience, motivation, enthusiasm and immense
knowledge throughout the course of this work. Owen has always motivated me to push
boundaries and has helped provide shape and method to my vaguely defined research
ideas. Owen patiently guided me as I moved from asking incredibly stupid questions in the
beginning to slightly less stupid questions towards the end. I owe him not only the
completion of this thesis but a great extent of my professional development at the university.
Equally important has been the role of Dr Lydia Lau for encouraging me to learn
more deeply about the subject and to grow in the field of process mining. Lydia helped me
through my ups and downs. She has been a huge inspiration to me and never let me feel
demoralised. As an international student I am very much thankful to the members of staff of
students support office.
My heartfelt thanks to Eric Rojas and team. I've personally met him when he came to
visit the university and attended his presentation on Question-Driven Methodology paper in
healthcare which led the foundation of this project.
I am also thankful to PhD student Angelina Prima Kurniati whom I received a lot of
help from during my project.
I consider myself particularly fortunate for my younger sister Nidhi who is currently
perceiving medicine in India, and I owe her everything that I was able to do. Without her I
couldn’t understand all the medical terms used in this project. Because of her support in
understanding the critical terms used in medicine, I was able to understand the project more
deeply. Healthcare project needs support from doctors to understand it and an engineer
alone couldn’t grasp the meaning of all the complicated terminologies.
Finally, I owe everything that I have been able to achieve, to the years of toil of my
parents. I cannot marvel enough at their endeavours in supporting the early education of me,
in spite of having limited means. The foundation they helped lay allowed this young boy from
a small town in India to believe that he can aspire. My father’s passionate association with
education continues to inspire me every day. I dedicate this thesis to my parents as a small
effort to express my immense gratitude. I also thank the rest of my family for always being
supportive, and hope that I continue to expand and pursue my aspiration.
-v-
LIST OF FIGURES
Fig. No. Name of Figure Page No.

2.1 An Example of Event Log Fragment 6
2.2 Process mining in healthcare 7
2.3 Proposed methodology 11
3.1 Frequently-posed questions for ER (Emergency Rooms) 13
3.2 A dotted chart of patients with Pneumonia 14
5.1 An example of an Event log fragment of Procedure Events 24
5.2 Example of a filter generated using Disco 25
5.3 (a) Mean activity duration 25
(b) Process map with LABEL as activity (40% activities & 10% 26
path)
5.4 Process map with CATAGORY as activity (100% activities & 26
10% path)
5.5 (a) Process map for pneumonia with CATAGORY as activity 27
(100% activities & 10% path)
(b) Process map for pneumonia with CATAGORY as activity 27
(100% activities & 70% path)
5.6 (a) Mean and relative frequency for CATEGORY as activity 28
(b) Mean and relative frequency for LABEL as activity 28
5.7 Petri net on CATEGORY as events 29
5.8 Petri net log replay result 30
5.9 Conformance checking a) Legend b) Global Statistics c) 30
Elements Statistics
5.10 The Petri Net generated by Interactive Data-Aware Heuristics 31
Miner
- vi -
LIST OF TABLES
Table No. Name of Table Page No.

4.1 Guidelines for the data extraction stage (MIMIC-III) 16
4.2 Guidelines for the event log creation stage. FPQ, Frequently- 18
Posed Question
4.3 Analysis guide for general type of question 20
- vii -
TABLE OF CONTENTS
Sr. No. Title Page No.

ABSTRACT iii
ACKNOWLEDGEMENTS iv
LIST OF FIGURES v
LIST OF TABLES vi
1 INTRODUCTION 1
1.1 Overview 1
1.2 Problem Statement and Possible Solution 1
1.3 Project Scope 2
1.4 Aim 2
1.5 Objectives 2
1.6 Deliverables 3
1.7 Project management and planning 3
2 BACKGROUND RESEARCH AND LITERATURE REVIEW 4
2.1 Initial research and training 4
2.2 The Emergency Room 4
2.3 Process Mining 5
2.4 Process Mining Tools 8
2.4.1 ProM 8
2.4.2 Disco 8
2.4.3 MySQL 8
2.4.4 HeidiSQL 9
2.5 Version Controls 9
2.5.1 MIMIC-III 9
2.5.2 Disco 9
2.5.3 ProM 9
2.5.4 HeidiSQL 9
2.5.5 MySQL Workbench 8.0 9
2.6 Data Collection 10
2.7 MIMIC-III 10
2.8 Proposed Methodology 11
2.9 Conclusion 12
3 FREQUENTLY POSED QUESTIONS 13
4 METHODOLOGY 16
- viii -
5 IMPLEMENTATION AND RESULT 21

5.1 Data Description and Understanding 21
5.2 Methodology implementation 22
5.3 Conclusion 30
6 DATA REFERENCE MODEL 32
7 PROJECT EVALUATION 35
7.1 General Evaluation 35
7.2 Personal Reflection 35
8 CONCLUSION AND FUTURE WORK 36
REFERENCES 37
APPENDIX A 39
APPENDIX B 41
1
1 INTRODUCTION
1.1 Overview
There's a worldwide increase in awareness that good quality healthcare is key to

keeping the human race healthy, happy and alive for longer. Moreover, millions and billions
of dollars are being spent worldwide over healthcare expenses, by the government as well
as individuals [7,8].
Emergency care, a field that developed only in recent decades, deals with patients
arriving in critical condition. The patients arrive with a collection of symptoms of a range of
severity which have to be assessed rapidly. Following this, instantaneous decisions have to
be made, which lead to a collection of events occurring in a complex manner for the rapid
effective recovery of the patient. Thus, studying the processes in the ER (Emergency Room)
can lead to a better understanding of the processes in addition to, increased transparency
[2].
MIMIC-III is a huge healthcare database which can serve as an extremely valuable

resource to study these processes. However, retrieving this data in the required form (event
log) can be quite difficult which led me to develop a Data Reference Model (DRM).
In this project, a Question-Driven Methodology (QDM) (specifically developed for the

flexible nature of ER by Eric Rojas et.al.) was used for Process Mining in the ER through the
use of Frequently Posed Questions (FPQ’s). An analysis of these processes can help add
more structure to them. Frequently executed activities were observed, that could help better
manage the resources while decreasing the waiting time [1].
1.2 Problem Statement and Possible Solution
Developing a data reference model using MIMIC-III database to answer FPQ’s posed
by experts using Question Driven Methodology (designed specifically for solving FPQ’s in
ER processes by Rojas E. et.al.) and applying Process Mining (PM) tools, techniques and
algorithms on ER data for pneumonia patients by generating process models, analysing
them by observing what activities are frequently executed with pneumonia patients which
gives us an idea about what resources to focus the attention on in order to add more
structure to the processes to decrease the waiting time.
2
1.3 Project Scope
Since the scope of this project is limited, rather than doing workshops with the
clinicians, we are using MIMIC-III data. MIMIC--III is a huge, free-access database
comprising healthcare-related information of more than 40,000 patients admitted to critical
care units between 2001 and 2012.
The aim of this project is to answer one of the FPQ's posed by ER experts using
QDM developed by Eric Rojas et.al. specifically for ER processes by using PM tools,
techniques and algorithms. I have developed a Data Reference Model (DRM) which further
helps to focus to solve Process Discovery Question (in this case for pneumonia) and gain
insights about the frequent things happen in ER which further helps to manage better
resources/inventories to reduce waiting time and chaos .
1.4 Aim
To Demonstrate the use of Question-Driven Methodology(QDM), (specifically

designed for ER processes proposed by Eric Rojas et.al.), using MIMIC-III datasets after
developing Data Reference Model (DRM) from it and applying Process Mining tools,
techniques and algorithms to explore the process models to answer frequently posed
questions (FPQ’s) asked by experts.
1.5 Objectives
1. To better understand how the Process Mining works in healthcare so as to be able to

apply it specifically to the ER.
2. To learn the basic workings of several Process Mining tools and techniques.
3. To analyze the interaction among various healthcare professionals at ER using the event
log.
4. To Apply the stages of QDM on MIMIC-III data by developing DRM.

3
1.6 Deliverables
1) The models on ER processes
2) The Project Report
1.7 Project management and planning
Months : TASK
June 2018 : Initial meetings
July 2018 : Learned process mining online course, Scoping And

Planning
Sept 2018 : Learned ProM and Disco
Oct 2018 : Requesting data from MIMIC-III
Nov 2018 : Background Research of Healthcare
Feb 2019 : Further Research and Reference Material
Mar 2019 : Applying QDM Stages
Apr 2019 : Write Up and Hand In

4
2 BACKGROUND RESEARCH AND LITERATURE REVIEW
2.1 Initial research and training
At the initial stage of this project, to gain a better understanding of Process Mining
and its different tools, techniques and algorithms an online course by Van der Aalst on
Coursera named "Process Mining: Data Science In Action'' [17] was taken. Later, in order to
gain further knowledge on the Prom Software, "Introduction to Process Mining with ProM''
[18] course from future learn was studied.
After the initial training was done in PM and PM tools (ProM and Disco), the next step
was to gain access to MIMIC-III dataset. MIMIC-III is a huge database, considered under big
data which I already have knowledge about, since I studied this module in my second
semester. Thus, with the help of big data, I seek to find the possible V's of big data from the
huge database like MIMIC-III.
After gaining access to MIMIC-III, I studied it table-by-table making connections of its

columns with one another.
Then I studied some of the frequently used Process Mining methodologies namely
PM2: Process Mining Project Methodology, clear path method and L* lifecycle method until
Eric Rojas came to visit university and gave presentation on the use of Question-Driven
Methodology in healthcare for solving Frequently-Posed Questions by healthcare experts
which are specifically used in Emergency Room (ER) processes and after meeting him
personally and attending his presentation, I decided to redo his latest Question Driven
Methodology for analyzing ER process using Process Mining paper on MIMIC-III database.
2.2 The Emergency Room
The Emergency care, in recent decades, has become recognized as a specialty in its
own right and is provided at the Emergency Room (ER) to the patients arriving in critical
conditions requiring urgent attention and care; making it the first contact point with the
healthcare system for unscheduled and undifferentiated patients of all ages [10,9,1].
As defined by the American College of Emergency Physicians, "Emergency medicine

is the medical speciality dedicated to the diagnosis and treatment of unforeseen illness or
injury. …The practice of emergency medicine includes the initial evaluation, diagnosis,
treatment, coordination of care among multiple providers, and disposition of any patient
requiring expeditious medical, surgical, or psychiatric care [3]."
5
At the ER many aid professionals (physician, nurses) in coordination with technicians

and body resources work towards providing quick and effective care to the patients [3]. This
begins with the initial screening and examination, generally executed by the nursing staff, in
order to classify patients according to their level of severity. Thus, the nurses are usually the
first contact in the ER [1, 3].
Following the screening by the nurses, the emergency medicine professionals further
examine the patients, in their order of severity, and provide their assistance [3]. Their
primary aim is to fully treat the patients and help them recover. Whenever that's not
immediately possible, they aim towards alleviating their set of symptoms. To achieve this,
rapid assessment followed by quick and accurate decisions for the recovery of the patient
have to be made through a systematic collaboration among various healthcare
professionals, making team interaction and collaboration a key aspect for the success at the
ER. Their coordinated collaboration is critical for the prompt care of the patients, who arrive
at the ER in a delicate health condition requiring immediate attention [3].
Many studies point towards inadequate team interaction in the ER as the primary or
contributing cause of over half the malpractice claims [1,2]. Severe impact on the patients'
treatment at the ER has been observed in the absence of adequate team structure and/or
defined appropriate goals and responsibilities of all the team members and in some cases
non-involvement of the relevant team members in the decision-making process, lack of
standard protocol and poor prioritization of activities, followed by poor communication, led to
further chaos. This puts the patients at higher risk by wasting precious time in cases of
utmost emergency [3].
On the other hand, a good team interaction aiding the planning and execution of ER
processes can significantly reduce morbidity and mortality rates. This can also make the
patients' stay at the ER much more convenient. Moreover, this can reduce clinical error
rates, which in turn, reduce legal costs, and even impact other human aspects such as
reduce stress and frustration among patients as well as healthcare professionals [3].
Thus, by applying Process Mining on the stored database through the use of QDM
and getting answers from it, we can overcome most of these challenges faced in the ER.
2.3 Process Mining
Healthcare processes improvement can be expected to immensely affect the quality

of experience of patients as well as that of the healthcare professionals at the ER. However,
their improvement is a very complex and challenging task. These processes have huge
6
scope for the improvement of productivity by figuring out well-defined patterns of several
processes and the roles of the healthcare professionals in the latter, thus reducing the chaos
and waiting times while at the same time reducing costs [4].
This can be done by observing and analysing many of the complex non-trivial and
time-consuming processes (clinical and administrative) occurring at the ER in order to come
up with more efficient suggestions [4,2]. One possible approach commonly undertaken is
conducting interviews. Unfortunately, this is a time-consuming process. Moreover, the
suggestions are quite often highly subjective since each person in the healthcare processes
tends to have an ideal scenario in mind, which in reality is only one of the many scenarios
possible [4].
Therefore, in order to obtain suggestions that can be fully expected to be objective in

nature event data proves to be a readily available source. This is made possible by Process
Mining which provides several tools and techniques to analyse and improve them [4,1].
The extraction of an event log is the starting point for Process Mining. In an ER; the
event log can be viewed as a set of episodes (case) consisting of several procedures or
activities that were executed for a particular instance [3]. The event logs in the ER records
also store additional information about the events, such as the timestamp, data element
(e.g., age, sex of the patient). In fact, whenever possible, Process Mining techniques use
extra information (resources) [4]. An example of an event log is given in fig 2.1.
ER processes, however, are quite complex; every patient is a different episode in

terms of multiple diagnoses and the differing severity of the latter which influences the way in
which the healthcare professionals seek to treat the patients [3].
Fig 2.1 An Example of Event Log Fragment

7
Process Mining seeks to observe event data (i.e., observed behaviour) to determine
process models [4].The three main types of Process Mining are discovery, conformance and
enhancement. A discovery technique seeks to produce a process model from the event log
without using any previous (a-priori) information. Conformance is essentially the step
preceded by discovery or enhancement since an existing process model (formed through
discovery or enhancement) is compared with an event log of the same process.
Conformance checking is done to check whether the process model conforms with the event
log and vice versa expressed in terms of fitness [4]. The third type of Process Mining is an
enhancement. This deals with extending and/or improving an existing process model using
information about the actual process recorded in the event log [4].
Fig 2.2 Process mining in healthcare (based on [2])
However, Process Mining being an emerging field still has several limitations to its
application. This includes the limited availability and implementation of healthcare data that
is process aware and that records event logs. Further, data extraction and interpretation (to
respond to questions frequently posed by experts) can prove to be a tedious job involving
the high dependence on ER experts; In order to identify opportunities for applying Process
8
Mining, it is crucial to be able to understand frequently asked questions posed by healthcare

experts regarding such processes [1, 2].
Our approach focuses on extracting knowledge from healthcare databases through

the use of Process Mining to answer various types of questions posed by the experts. This
project uses the Question-Driven Methodology specific for ER. Moreover, a Data Reference
Model for the MIMIC-III database has been proposed for the easier extraction of data. These
process models can also be implemented to understand how the information system is
expected to support the process execution, thereby developing more process-aware
healthcare data [2].
2.4 Process Mining Tools
2.4.1 ProM
ProM is a free open-source framework based in Java, developed at the Eindhoven

University of Technology for the purpose of Process Mining [5]. ProM provides several filters
that help pre-process the data, and plug-ins (such as Fuzzy Miner, Petri net, Rapid Miner
etc.) that can be applied on the data (filtered or unfiltered event logs) [5]. This makes ProM
an important tool in several fields like healthcare, banking, telecommunications and various
public sector services like transport. However, the presence of so many plug-ins and filters,
also makes it quite complex to use [11].
2.4.2 Disco
Disco (developed by Fluxicon) is a powerful Process Mining tool. It doesn't contain

several plug-ins like the ProM, instead, it was specifically designed to be easy-to-use, fast
and efficient. It is able to work efficiently even with huge and complex data and generating
insightful process maps and detailed statistics through many filters within minutes. This has
made disco a widely used tool in Process Mining [12].
2.4.3 MySQL
MySQL is an open-source Relational Database Management System (RDBMS),

written in C and C++. It was released initially in 1995. MySQL Workbench is a graphical tool
that allows working with MySQL Servers and Databases. It also allows data migration from
another RDBMS like PostgreSQL, Microsoft SQL Server and SQLite [13].
9
2.4.4 HeidiSQL
HeidiSQL is a free and open-source, administration tool for MySQL, PostgreSQL,

MariaDB and Microsoft SQL. It was invented in 2002 by Ansgar to be easy-to-use GUI. It lets
you see and edit data and structures from the aforementioned RDBMSs and has become
one of the most popular tools for these platforms [14].
2.5 Version Controls
2.5.1 MIMIC-III
The latest version which was used in this project for MIMIC is MIMIC-III v1.4 in which
more than 40,000 hospital admissions are stored out of which 38,645 are adults and 7,875
are neonates [15].
2.5.2 Disco
The latest version which was used in this project of Disco is 2.2.1 released on
28/08/2018.
2.5.3 ProM
The latest version which was used in this project of ProM is ProM PM 6.8 and
XESame 1.8
2.5.4 HeidiSQL
The latest version which was used in this project of HeidiSQL is version 9.4.0.5125
(32 bit) compiled on 2016-10-21.
2.5.5 MySQL Workbench 8.0
The latest version which was used in this project of MySQL is version 8.0.13 build
13780177 CE (64 bits).
10
2.6 Data Collection
MIMIC-III has over 40,000 hospital admissions [15] and the database is enriched with
detailed information of patients and to gain access to such a huge freely available database
for research purpose we have to follow some steps -
1. Complete the CITI "Data or Specimens Only Research'' course
2. Then, pass the ethics test. After passing ethics test user can
3. Apply to access the MIMIC-III database through Physionet by providing it CITI test
completion report as soon as the application gets approved user can get access to
4. Download all the tables from the MIMIC-III database Initially, tables are in the form of
compressed CSV files and user needs to
5. Decompress it to store on local computer for further applications.
2.7 MIMIC-III
MIMIC-III (Medical Information Mart for Intensive Care) is a free-access database

consisting of healthcare-records. It contains de-identified information of more than 40,000
patients admitted to critical care units at a large tertiary care centre. It was derived from Beth
Israel Deaconess Medical Centre in Boston between 2001 and 2012.
The data is spread across 26 relational tables that include information like diagnoses,
diagnostic codes, bedside measurements of vital signs, laboratory observations, notes
charted by caregivers, admission and discharge time and locations, survival data, relevant
personal patient information and more. These tables are linked to each other through various
IDs like SUBJECT_ID, HADM_ID, ITEM_ID, CGID.
The de-identification of patients, a necessity to protect private and sensitive

healthcare information of patients, led to the identification of patients only through numerical
IDs. Moreover, the dates in the MIMIC-III database are shifted in the future randomly to
better protect the confidentiality of the patients. However, this shifting of Time caused
several problems in handling and using the data that are discussed later in Implementation
and Result chapter.
MIMIC-III database contains information from all departments of the hospital;

however, this project focuses on the patients admitted through the ER only.
11
And for that purpose, a Data Reference Model was developed specifically for solving
frequently posed questions posed by experts [1] in ER processes which are discussed in
Data Reference Model chapter.
2.8 Proposed Methodology
Process Mining bridges the gap between the data obtained by Data Mining and the
algorithm implemented through Machine Learning. Multiple methodologies have been
developed to support Process Mining Projects such as Process Diagnostic Method (PDM),
L* life-cycle model and Process Mining Project Methodology (PM2).
As this project specifically concentrates on ER processes; A new methodology,

Question-Driven Methodology (QDM), that specifically deals with the flexible nature of ER
processes, which was proposed by Eric Rojas et. al. [1] is used; in which the part of data to
focus on can be identified from the beginning for solving the required question. An event log
is extracted from MIMIC-III data thereby making DRM including minimum information
required for the selected question. Further, the Process Mining tools allow identification of
more frequent variants after clustering similar group of episodes.
Fig. 2.3 Proposed methodology (based on [1])
This methodology consists of six stages corresponding to Eric's Question-Driven

Methodology (of which the 6th is beyond the scope of this project):
(1) Extraction of data (from MIMIC-III*);
(2) Creation of an event log based on the questions;

12
(3) Filtration of the event log;
(4) Applying data analysis;
(5) Applying Process Mining (PM) techniques;
(6) Analyzing the results with the experts** 1
2.9 Conclusion
A better understanding of the processes in ER by answering the Frequently-posed

Questions can help better the processes, and define the roles of all the professionals,
thereby decreasing chaos, discomfort and waiting time. This can be achieved by applying
PM tools and techniques on the given (MIMIC-III) dataset using Question Driven
Methodology proposed by Rojas, E. et.al.
*Eric's methodology uses HIS database instead.
**not implemented in this project since it was beyond the scope of this project.
13
3 FREQUENTLY POSED QUESTIONS
Most of the questions suggested by Eric Rojas et.al. based on HIS database can also
be answered through the MIMIC-III database. He divided these questions into 2 types:
general and episode oriented.
Fig. 3.1 Frequently-posed questions for ER (Emergency Rooms). (Based on[1])
The general questions can be further classified into 4 types:
Process discovery is about discovering process models to describe the control flow
of activities. For example, "What is the process model for the procedures performed while
treating patients with a particular diagnosis?"
Conformance checking is checking the fitness of the process model with the event
log to verify whether the processes correspond with that model. This can help check from
time-to-time whether internal protocols are being followed. The higher the fitness, the lesser
are the deviations from the process model.
Performance analysis is done by analyzing processes and sub-processes with

respect to the time taken to execute them.
Organizational analysis circles around the resources (healthcare professionals) and

their roles and relationships during the execution of different activities.
14
Episode-oriented questions [1] described by E. Rojas et al. were of 3 types:
Triage-driven questions, based on the concept of the triage system, cannot be

answered through MIMIC-III database since it lacks the necessary data about grading the
patients according to the triage system.
Stay duration-oriented questions are based on the total duration of a patient's stay at
the ER. The characteristics of patients staying for different durations can be checked through
this type of questions. For example, "What are the characteristics of the patients staying for
a duration shorter than 24 hours?" this can be done by filtering those variants that stay for
lesser than 24 hours. A dotted chart of patients with Pneumonia showing variants vs stay
duration is given in fig 3.2.
Fig 3.2 A dotted chart of patients with Pneumonia
The ER patient discharge-driven questions are concerned with the destination of the
patients following ER. The patient could either be discharged home or admitted to other units
of the hospital. These questions seek to find the characteristics that differ in either
destination. For example, "What are the activities executed for the patients that are
discharged home?"
Further compound questions concerning more than one of the above criteria may
also be answered. Such questions require combined data from all those basic questions.
Filters are applied to this data to reach the required answers. For example, "What are the
15
characteristics of the patients with pneumonia that are discharged home within 24 hours of
their admission to the ER?"
All these questions can be solved through the Data Reference Model specifically
made for ER from MIMIC-III database described in Data Reference Model chapter.
16
4 METHODOLOGY
The Process Mining methodology implemented in this project is the Question-Driven

Methodology (QDM) for analysing Emergency Room (ER) processes, proposed by Eric
Rojas et.al.[1]. This methodology consists of 6 stages, out of which we have implemented
the first 5 stages on MIMIC-III; While stage 6 of this methodology is beyond the scope of this
project. The 6 stages of the Question-Driven Methodology (QDM) are as follows:
Stage 1. Data extraction:
This stage is concerned with the identification of the data to be used for the FPQs
followed by its extraction from its source. Based on this, a data model is created. This
requires checking for the availability of timestamps, defining the events or activities, creating
any extra fields wherever necessary and verifying the quality of the extracted data. This can
be summed up in the table 4.1 below.
Table 4.1 Guidelines for the data extraction stage (MIMIC-III) [1]
Activity Description Guidelines
1.1: Identify Have access to the correct Make sure you have permissions
available data in data from the direct and access granted to them
MIMIC-III and sources, them being directly or through the data owner.
build the data MIMIC-III.
Identify if data are missing in the
model
data sources, and check if it is
feasible to execute the analysis
(e.g., timestamps are the minimum
required data). The available data
model should contain as many
dimensions and attributes of the
data reference model as possible.
It should have as much detail

granularity as possible.
1.2: Ensure Check that for each event If different levels of accuracy are
availability of a or activity included in the present, the highest one present in
17
timestamp for data model, a correct all of the data is recommended,

each event timestamp is included. just to have the same level across
all of the examined data. If some
data do not have a timestamp,
they cannot be used for the
analysis.
1.3: Name In case any activity or event Use meaningful names for the ER
events does not have an experts.
appropriate name, one
should be assigned to it.
1.4: Create Create specific-fields based It is advisable to group activities

specific-fields on the required needs. into subprocesses
It might also be useful to split

activities into sub-activities (e.g.,
the professional activities could be
split according to the role, for
example professional activities,
such as "physician professional
activities" or "nurse professional
activities".
1.5: Verify data Further general issues have Check lack of data, incorrect data
quality been identified from the or the inaccuracy and irrelevance
literature review that must of data.
be tackled when generating
Check in more detail all of the
an event log for process
significant challenges previously
mining purposes in
found in the literature.
healthcare.
Stage 2. Event log creation:
This stage selects specific data (for example, including only necessary columns such
as case, time, event and necessary resources) to generate an event log from the extracted
data based on the FPQ’s. This event may need to be revisited several times depending upon
18
the realization of the requirement of more resources. Table 4.2 sums up the activities of this
stage.
Table 4.2 Guidelines for the event log creation stage. FPQ, Frequently-Posed Question
[1]
Activity Description Guidelines
2.1: Identify data required Identify the FPQ to be Have clarity and a good
to perform the specific answered and identify understanding of the
analysis what data from the general FPQ’s that are desired to
data model will be used. be answered.
Not all of the data included

in the reference model
may be required to answer
a specific question.
2.2: Create the event log Once the data stored in Establish the format in
the data model are which the event log will be
available, a specific event built.
log must be created each
Tools such as Excel with
time a question requires a
comma separated values
response.
files can be used, but
more specific standards
(such as XES) should also
be considered.
2.3: Include specific According to the After the first version of

characteristics for each characteristics of the data the event log is built, an
event or activity and the question that inspection should be made
requires an answer, to assure that not only the
certain data types must be minimum data are
included in the event log. included, but also the
desired characteristics of
the episode with correct
values.
19
Stage 3. Filtering stage:
The filtering stage, as the name suggests, involves filtering the event log based on
the specific requirements of the question such as filtering for particular diagnoses and further
filtering of data to exclude certain variants. This filtering is mainly done through Process
Mining tools especially the ones in Disco. At times, MS-Excel can also be used to filter
certain attributes.
Stage 4. Data analysis stage:
The data is analysed to discover different patterns and to gain more insights about
the datasets. This is done through different data analysis techniques and tools. Appropriate
data analysis techniques are to be selected based on the expected results. Identification of
tools required to perform the selected technique is further selected.
Stage 5. Process Mining stage:
The Process Mining stage includes the selection and application of appropriate
techniques through the use of tools corresponding to those techniques. This stage focuses
on discovery, analysis and improvement of real processes. This can be done through a
variety of Process Mining tools including Disco [12] and ProM [16]. Four types of process
analysis are performed: process discovery, conformance analysis, performance analysis
and organizational analysis.
Process discovery based on the event log results in the discovery of a model that
describes the activities and paths taken in different cases. This is done through several
models such as Fuzzy model (which is more flexible in nature), heuristic miner, Petri net,
inductive miner.
Conformance analysis focuses on the verification of conformance between the

discovered model and the event log. It is an important step since it seeks to confirm whether
the process is being run as predicted by the model. It is possible to apply conformance
based on replay for exploratory conformance.
Performance analysis is performed from the time perspective. It considers the activity
duration and waiting time of activities to discover bottlenecks and waiting times. This can be
performed through tools such as ProM.
20
Organizational analysis is done from the resource perspective. It identifies the roles
and relations between the resources during the case execution. This can be performed
through organizational metrics in ProM.
A further question-specific analysis is performed through several different techniques

to either obtain a process model through discovery or verify conformance through
conformance checking techniques.
Table 4.3 Analysis guide for general type of question [1]
Question Analysis
General Discovery Questions Heuristics miner algorithm, genetic miner

algorithm and inductive miner algorithm
General Conformance Questions Conformance checking and replay
General Performance Questions Performance analysis technique
General Organizational Questions Organizational metrics, such as

handover of work, doing similar tasks,
working together and subcontracting.
Further data analysis is followed by the corresponding Process Mining analysis

through continuous iteration resulting in the refining of data and results until the required
answers are obtained. Filtration of the event logs through the identification of trends and use
of only the desired episodes is done during this stage, which can also help reduce the
spaghetti effect from the process model.
Stage 6. Result evaluation stage:
This stage involves evaluation of the obtained results through relevant ER experts
(who knows the complete process including each task performed) by means of
questionnaires, interviews and focus groups [1]. The experts may infer that the obtained
results are irrelevant or uncommon, upon which previous stages will have to be revisited to
verify whether the techniques and filters were applied appropriately.
21
5 IMPLEMENTATION AND RESULT
5.1 Data Description and Understanding
The data collected and used is derived from the MIMIC-III, which is a huge
healthcare-related database with medical records of over 40,000 patients admitted at a
tertiary-care hospital (Beth Israel Deaconess Medical Centre) between 2001 and 2012 [15].
It is a relational database that consists of 26 tables interconnected through several

columns with the suffix ID (for example, HADM_ID, CGID, ITEMID columns). The data in
these tables aren't directly arranged in the form of event logs. However, 16 of these tables
contain timestamped-data, the Event tables (for example, PROCEDUREEVENT_MV,
DATETIMEEVENTS, CHARTEVENTS tables), which can be linked with other tables that
may either contain additional information to describe the information in the event tables
(D_ITEMS, D_ICD_DIAGNOSES tables), or other information related to the patient
(ADMISSIONS, PATIENTS tables). Thus, a combination of two or more tables is necessary
in general to generate a meaningful event log. These event logs form the input for the
Process Mining tools.
ADMISSIONS table contains information about the admission of the patient to the
hospital, such as the time and location of admission as well as discharge. Further, their initial
diagnosis is also contained in this table. The episodes (or cases) are referred through 2
unique IDs (SUBJECT_ID, HADM_ID). The SUBJECT_ID refers to the unique ID of a
patient, while the HADM_ID refers to a single admission at the hospital, that is, a patient may
have a single SUBJECT_ID and multiple HADM_IDs, each referring to a single different
admission at the hospital.
These IDs link the ADMISSIONS table to all the -EVENTs tables. Different –EVENT
tables contain records of different types of events taking place (for example, the
PROCEDUREEVENTS_MV table contains the records of all the procedures performed on
the patients), described in the form of ITEMIDs. These ITEMIDs on their own can impart no
understanding since it is just numbers. The full descriptions of what each of these ITEMIDs
mean are given in the D_ITEMS table which needs to be linked to the –EVENTs tables via
the ITEMID.
However, linking of all the –EVENTS tables simultaneously, to the ADMISSIONS

table leads to too many processes which will later be too complicated to be analyzed through
the Process Mining techniques and result in the formation of Spaghetti models. Instead,
each _EVENTS table can be separately linked to the ADMISSIONS table followed by the
22
linking of D_ITEMS table, and the different categories of processes can be separately
analyzed one by one [5].
The MIMIC-III database is an extension of MIMIC-II, that is, MIMIC-III contains data
from MIMIC-II (collected between 2001 and 2008) in addition to the newly collected data.
This transition was accompanied by the transition in the data management software, from
the CareVue (CV) system (implemented by MIMIC_II) to the MetaVision (MV) system, which
leads to several changes in the way the information was stored and interpreted.
Moreover, in order to protect patient confidentiality, all the dates in the database have
been shifted randomly to some point in the future. However, internal consistency w.r.t. the
patient is maintained so that the timestamped-data for the same patient in all the tables is
synchronizable.
Since this project specifically seeks to study the processes in the ER, only the
records of those patients admitted through the ER were included. Further, only the records
collected through the MetaVision (MV) system (2008 onwards). A Question-driven approach
was used on these records to describe the processes occurring in the ER. The questions
used were obtained from the study performed by Eric Rojas et al., March 2017.
With that as a reference, the question considered for this methodology is "What are
the procedure events executed while treating patients diagnosed with pneumonia admitted
through ER?''
5.2 Methodology implementation
Stages from the QDM were implemented as follows:
Stage 1 and 2:
Data extraction and Event log creation:
The data was extracted from the MIMIC-III database followed by its cleansing.
Out of the aforementioned 26 tables, the tables necessary for solving the question
were selected. In this case, the records from ADMISSIONS, PROCEDUREEVENTS_MV
and D_ITEMS tables were used. This data needed to be cleaned and edited. At first, this
didn't seem very important, however later when the tables (in Comma separated value
(CSV) format) were being imported in the SQL tools to be joined for generating the event
log, several errors popped-up. The RDBMS used was MySQL and the graphical tool used to
import and join the tables was HeidiSQL.
23
This was the lengthiest of all the stages, given the size, complexity and inconsistency
of the records. Information in several places was missing, this generated data truncation
errors each time since the data in those rows was less than expected. These rows had to be
manually excluded every time. Further, the date-time formats threw several errors. At first,
the format had to be changed to "yyyy/mm/dd hh:mm:ss" before importing to the SQL
Servers. In the later stages, the date-time data would get considered as text. For this, one
possible approach was using the option for formatting text to date, however, that led to the
exclusion of the time part which was a very crucial component of the event log. Eventually,
all steps had to be repeated from the beginning to include the date-time column from the
source.
Only selected columns from the data were imported. Moreover, the records were
filtered to include only the records with the attribute EMERGENCY ROOM ADMIT from the
ADMISSION_LOCATION column in the ADMISSIONS table. Following this, the columns
imported were HADM_ID, ADMIT_TIME, DISCH_TIME and DIAGNOSIS from the
ADMISSIONS table. Preference was given to HADM_ID over SUBJECT_ID since each
patient may have several different entries with different symptoms and diagnoses each time,
and the processes being analyzed were specific towards diagnoses rather than the patients.
The diagnosis was derived from the ADMISSIONS table itself since it contained the brief
diagnosis given in the ER based on the initial symptoms the patient first appeared at the ER
with.
From the PROCEDUREEVENTS_MV table, HADM_ID, START_TIME, END_TIME

and ITEM_ID. The END_TIME column was later dropped, as mentioned further. From the
D_ITEMS table, the records were filtered through the LINKSTO column through the
PROCEDUREVENTS_MV attribute to only include the list of ITEMID corresponding to the
event table. Following the filtration, the ITEMID, LABEL and CATEGORY columns were
imported; the importance of the CATEGORY column at first wasn't clear and so it was only
later included.
Once the required tables were imported, they had to be joined. The type of join was
decided on the basis of requirements. An example query for joining the ADMISSIONS table
with the PROCEDUREEVENTS_MV table is as follows:
INSERT INTO combined_table_1
SELECT a.hadm_id , a.admit_time , a.disch_time , a.diagnosis , p.item_id
FROM admissions a
CROSS JOIN procedure_events p

24
WHERE a.hadm_id = p.hadm_id ;
The newly generated table (combined_table_1) was later joined similarly with the
D_ITEMS table to add 2 more columns (LABEL, CATEGORY). Moreover, it was observed
that the starting (as well as ending) events of the cases were different. This was quite
problematic for the application of several Process Mining techniques. Hence the admission
and discharge events had to be added to the event log among other procedures. For this,
from the ADMISSIONS table, the HADM_ID and ADMIT_TIME columns were extracted to
match with the HADM_ID and START_TIME columns of the event log, respectively.
Moreover, to match with the LABEL and CATEGORY columns, corresponding columns were
added with the attribute ‘admission'. And all the rows from this table were added to the initial
event log. Similar steps were also taken for adding the discharge event, using HADM_ID and
DISCH_TIME instead and setting the LABEL and CATEGORY attribute as DISCHARGE. A
sample of the event log thus generated is shown in fig 5.1
Fig 5.1 An example of an Event log fragment of Procedure Events
Stage 3. Filtering:
The filtering of the data was done at several levels. As mentioned in the previous
section, the ADMISSIONS table data was filtered to only include the cases that were
admitted through the Emergency Room. This was done through MS-Excel. Later only the
procedure events corresponding to these ER patients were included in the final event log,
through the use of CROSS JOIN with the WHERE clause. This was done through HeidiSQL.
It was noted in the next stage that some of the variants (about 1% of the total) didn't begin
with admission and/or end with discharge. This might have occurred since several records
with data quality issues were excluded in Stage 1. These variants were filtered through
Disco.
25
Fig 5.2 Example of a filter generated using Disco
Furthermore, to specifically show the processes in Pneumonia patients, the event log
was filtered to only include the cases with the diagnosis of Pneumonia. Hence, it can be
noted that the filtering can be done through several applications as suitable.
Stage 4 and 5. Data analysis and Process Mining analysis:
In this stage, data were imported to the Process Mining tools and analysed. The
mean duration and frequency of the activities/events were noted. Further, the process map
generated in Disco with 40% activities and 10% paths was as shown in fig 5.3(b). This was
when the labels of the procedures were used as the event.
Fig 5.3(a) Mean activity duration

26
Fig 5.3(b) Process map with LABEL as activity (40% activities & 10% path)
Upon the analysis of this map, it became clear that further data analysis was
required. The information in the CATEGORY column categorizes the activities into 10 types.
Eventually, CATEGORY was used as the event instead of LABEL. The following map,
showing 100% activities and 10% paths, was generated when these changes were made.
Fig 5.4 Process map with CATAGORY as activity (100% activities & 10% path)
Further, we tried to analyze the process for a particular diagnosis, in this case,
pneumonia So, the selection of only the cases with Pneumonia as the diagnosis (around 7
27
thousand cases) was done. Out of all these cases, around 1% cases that didn't begin with
admission and/or didn't end with discharge were excluded. And the data and processes
were analyzed, first in Disco and later in ProM.
The following fig 5.5(a) and 5.5(b) shows the result of process discovery through
Disco showing 10% and 70% of all the paths, respectively.
Fig 5.5(a) Process map for pneumonia with CATAGORY as activity (100% activities &
10% path)
Fig 5.5(b) Process map for pneumonia with CATAGORY as activity (100% activities &
70% path)
28
It is to be noted that the darkness of the colour of the activities in the above process
maps is proportional to the frequency of the activities. This is also shown in the fig 5.6(a). It
can be observed that imaging and cultures techniques are done quite frequently. In fact, it
can be noted that an average of more than one imaging process per episode is done. Thus,
total case duration can be shortened quite significantly with any decrease in the duration of
these activities, by either increasing the availability of the tools (or machines) and resources
required to perform these activities, or by bettering these techniques to reduce their actual
duration. Upon further analysis of the frequency of the actual activity (with LABEL as the
event), it can be noted that chest x-rays, EKGs and Blood, Sputum and Urine cultures are
done in over half the cases, on an average.
Fig 5.6(a) Mean and relative frequency for CATEGORY as activity
Fig 5.6(b) Mean and relative frequency for LABEL as activity

29
Following this, conformance checking was done through the conformance technique,
Replay log on Petri Net for Conformance analysis, was done through ProM. This required an
event log and a Petri Net model. The Petri Net model was generated through process
discovery techniques, which was as shown in the fig 5.7
Fig 5.7 Petri net on CATEGORY as events
Also, the results obtained through the conformance analysis of the above Petri Net is
shown in fig 5.8 below.
30
Fig 5.8 Petri net log replay result
The trace fitness, as shown in the fig 5.9(b), is 0.977, which is almost 1, meaning that
most of the traces fit perfectly with the process model. Access line – peripheral is one activity
where the moves aren't synchronous, that is, it has no corresponding activity in the model
that was enabled during the replay.
Fig 5.9 Conformance checking a) Legend b) Global Statistics c) Elements Statistics

31
At this point, attempts to repair the log by adding the missing event can be done.
However, ER processes are highly flexible [1] in nature and the data is noisy with several
missing events. Discovery algorithms like heuristics miner were developed to deal with such
data, since they take the log as it is, and don't try to repair missing events [6]. The Petri Net
generated by Interactive Data-Aware Heuristics Miner is as shown in fig 5.10.
Fig 5.10 The Petri Net generated by Interactive Data-Aware Heuristics Miner
5.3 Conclusion:
While implementing all these steps, we were able to develop a Data Reference
model, specifying the exact relations of only the necessary tables and their necessary
columns. This model has been described in the next chapter.
The implementation of Process Mining in healthcare through a Question-Driven

Methodology specifically designed for emergency room processes allow us to understand
how the processes are executed in case of pneumonia patients. It can be observed what
activities are frequently executed in pneumonia patients and so give us an idea about what
resources to focus the attention on in order to add more structure to the processes and
decrease the waiting time.
32
6 DATA REFERENCE MODEL
A data reference model based on the MIMIC-III database for ER processes is

described in this section. This model is designed to specifically contain all the data required
to study several ER processes and also solving frequently posed questions by ER experts.
The initial step is to filter only the episodes admitted through the Emergency Room,
from the ADMISSIONS table. It should be kept in mind that all the cases in other tables are
for all the patients admitted to the hospital. So, they all cannot, directly, be used in the event
log. Thus, the ADMISSIONS table forms the base of all the tables (Primary level table); and
other tables to be used must only be cross joined to the filtered ADMISSIONS table where
the HADM_IDs from both or all the tables match.
Most –EVENT tables, namely, the CHARTEVENTS, DATETIMEEVENTS,

INPUTEVENTS_CV, INPUTEVENTS_MV, LABEVENTS, MICROBIOLOGYEVENTS,
NOTEEVENTS, OUTPUTEVENTS and PROCEDUREEVENTS_MV form the Secondary
level tables and are linked to the ADMISSIONS table, as explained above. In addition,
PRESCRIPTIONS and SERVICES tables also linked to the Primary table through
HADM_ID. PATIENTS table is linked through SUBJECT_ID. Note that CALLOUT,
ICUSTAYS, CPTEVENTS, DRGCODES and TRANSFERS can also be added to this level
but they aren't included in this model since they are either concerned with billing procedures
or don't refer to ER episodes.
The tertiary level tables are further linked to elements in the secondary level table.
The D_ITEMS table links through ITEMID to all the –EVENT tables except LABEVENTS and
CPTEVENTS. D_LABITEMS links similarly through ITEMID to LABEVENTS. ITEMID gives
detailed descriptions of the events in the event tables. D_CPT table that links to
CPTEVENTS table has also not been used.
The CAREGIVERS table is also a tertiary level table that describes and links through
the CGID to the –EVENT tables. D_ICD_DIAGNOSES and D_ICD_PROCEDURES are
more tables on this level but haven't been used in this model.
The important columns to be included from the tables are as follows:

33
Primary table:
1. ADMISSIONS
SUBJECT_ID, HADM_ID, ADMITTIME, DISCHTIME, DEATHTIME,

ADMISSION_LOCATION (only to filter ER ADMITS), DISCHARGE_LOCATION,
DIAGNOSIS.
Secondary tables:
1. CHARTEVENTS
HADM_ID, ITEMID, CHARTTIME, CGID
2. DATETIMEEVENTS
3. INPUTEVENTS_CV
4. INPUTEVENTS_MV
HADM_ID, STARTTIME, ENDTIME, ITEMID, CGID
5. LABEVENTS
HADM_ID, ITEMID, CHARTTIME
6. MICROBIOLOGYEVENTS
HADM_ID, CHARTDATE, CHARTTIME
7. NOTEEVENTS
HADM_ID, CHARTDATE, CATEGORY, DESCRIPTION, CGID
8. OUTPUTEVENTS
HADM_ID, CHARTTIME, ITEMID, CGID
9. PATIENTS
SUBJECT_ID, GENDER, DOB, DOD
10. PRESCRIPTIONS
HADM_ID, STARTDATE, ENDDATE, DRUG
11. PROCEDUREEVENTS_MV
HADM_ID, STARTTIME, ENDTIME, ITEMID, CGID

34
12. SERVICES
HADM_ID, TRANSFERTIME, PREV_SERVICE, CURR_SERVICE
Tertiary level tables:
1. D_ITEMS
ITEMID, LABEL, CATEGORY; this table is to be sorted on the basis of LINKSTO, depending
upon which –EVENTS table it is being linked to.
2. D_LABITEMS
ITEMID, LABEL, CATEGORY
3. CAREGIVERS
CGID, LABEL
The primary keys that are required to link with other tables are highlighted. Other
columns may be included, if and when needed, for further details.
35
7 PROJECT EVALUATION
7.1 General Evaluation
With the help of Question-Driven Methodology (QDM) specifically designed for ER

processes (proposed by Eric Rojas et.al.) [1] and using Process Mining techniques in
healthcare we seek to find the activities executed for treating patients having pneumonia
with the help of Data Reference Model (DRM) developed by using MIMIC-III database.
Also, different process models have been generated using software like ProM and
Disco.
After generating those process models they have been analysed and by analysing
them it's been observed which activities are frequently executed for pneumonia patients and
so give us an idea about what resources to focus the attention on in order to add more
structure to the processes to decrease the waiting time.
7.2 Personal Reflection
Through this project, I learned to trust the process and trust myself in order to
successfully reach the goal and this could never be achieved without the help of my
professors. Moreover, I learned how to handle a larger project by myself.
Learned to write the thesis and went through all its processes like learning more
about academic writing, reference writing, and giving citations etc.
This project taught me responsibility from conception to completion.
This project has given me golden opportunity to read and learn about so many other
related researchers and through that, I learned that the world of work is full of collaborative
efforts and the relationship you made with the subject and authors.
I learned to use software and write codes, which was my biggest fear.
I mostly enjoyed the reading and understanding part of the project but found report
writing part quite difficult.
But the next time whenever I work with some other project it will be much easier for
me because of this experience. And this experience with handling a project on your own,
going through so many data tables, learning and understanding them thoroughly,
implementing the procedure and getting the result will definitely help me in the real world;
and I'm very grateful for it.
36
8 CONCLUSION AND FUTURE WORK
In this project, the processes in ER, particularly for patients with pneumonia, were
studied. This also gave us more insights on frequency of the activities executed, which in
turn, can help manage the resources beforehand.
The scope of this project is to answer an example of one of the FPQs by ER experts.
Future work may involve answering further complex FPQs. Moreover, there’s a huge
scope in the development of a search platform for healthcare by connecting the actual
database of every hospital within the country and by applying PM techniques on it, through
which we could not only gain more insights into the workings of ER but also of every sector
in healthcare.
With the help of PM techniques in Healthcare we can gain more insights about:
1. The number of patients in need of healthcare and the season-wise epidemiology.
2. How much healthcare staff and of what specialization is required in that part of the
country for that particular season.
3. How much inventory we require and how to manage it beforehand.
4. The prior actions to be taken by the government for the citizens.
Thus, making healthcare much more manageable and efficient, decreasing morbidity
and mortality.
37
REFERENCES
1. Rojas, E. et al. Question-Driven Methodology for Analyzing Emergency Room Processes

Using Process Mining. Appl. Sci. 2017, 7, 302, pp.1-29.
2. Rojas, E. et al. Process mining in healthcare: A literature review. Journal of Biomedical

Informatics 61 (2016) , pp. 224–236.
3. Alvarez, C. et al. Discovering role interaction models in the Emergency Room using
Process Mining. Journal of Biomedical Informatics 78 (2018), pp.60–77.
4. Mans, Ronny S., van der Aalst, Wil M.P. and Vanwersch, Rob J.B.Process Mining in
Healthcare. Switzerland :Springer International Publishing AG, 2015.
5. Kong, Xiaoqing. Process Mining of Big Data in Healthcare. MSc Computer Science. The
University of Leeds. 2016-17.
6. Mans, Ronny S. et al. Repairing Event Logs Using Stochastic Process Models.
[Online].[No Year].[Accessed 11 Feb 2019]. Available from:
https://pdfs.semanticscholar.org/
7. National Center for Biotechnology Information. Improving the Nation’s Health Care
System. [Online]. 2009. [Accessed 5 Feb 2019].Available From:
https://www.ncbi.nlm.nih.gov/
8. Wikipedia. Health care. [Online].2019[Accessed 21 April 2019] Available from: https://en.

wikipedia. org/
9. Sakr, M. and Wardrope, J. Casualty, accident and emergency, or emergency medicine,

the evolution.[Online]. 2000. [Accessed 3 March 2019]. Available from:
https://www.ncbi.nlm.nih. gov/
10. Wikipedia. Emergency medicine. [Online].2019[Accessed 09 Feb 2019] Available from:

https://en.wikipedia.org/
11. Eindhoven University of Technology. Process Mining. [Online]. 2016. [Accessed 18 July
2018]. Available from: http://www.processmining.org/
12. Fluxicon BV. Discover Your Processes. [Online].2019.[ Accessed 18 July 2018].
Available from: https://fluxicon.com/disco/
13. Oracle Corporation and/or its affiliates. MySQL Documentation. [Online]. 2019.
[Accessed 05 March 2019]. Available from: https://dev.mysql.com/
14. Becker, A. HeidiSQL. [Online]. 2019. [Accessed 23 Feb 2019]. Available from:
38
https://www.heidisql.com/
15. MIMIC-III, a freely accessible critical care database. Johnson AEW, Pollard TJ, Shen L,
Lehman L, Feng M, Ghassemi M, Moody B, Szolovits P, Celi LA, and Mark RG. MIMIC-
III Critical Care Database.[Online]. 2016. [Accessed 21 June 2018] Available from:
http://www.nature.com/
16. Eindhoven University of Technology. ProM Tools. [Online]. 2010. [Accessed 5 July
2018]. Available from: http://www.promtools.org/
17. Coursera.Process Mining: Data Science In Action.[Online].2018.[Accessed 10 Oct 2018].

Available from: https://www.coursera.org/
18. Future Learn. Introduction To Process Mining With ProM.[Online] 2018. [Accessed 21
Oct 2018]. Available from: https://www.futurelearn.com/
39
APPENDIX A
Tables in MIMIC-III
ADMISSIONS: Define a patient’s hospital admission
CALLOUT: Provides information when a patient was READY for discharge from the ICU,
and when the patient was actually discharged from the ICU
CAREGIVERS: Defines the role of caregivers
CHARTEVENTS: Contains all charted data for all patients
CPTEVENTS: Contains current procedural terminology (CPT) codes, which facilitate billing
for procedures performed on patients
D_CPT: High-level definitions for current procedural terminology (CPT) codes
D_ICD_DIAGNOSES: Definition table for ICD diagnoses
D_ICD_PROCEDURES: Definition table for ICD procedures
D_ITEMS: Definition table for all items in the ICU databases
D_LABITEMS: Definition table for all laboratory measurements
DATETIMEEVENTS: Contains all date formatted data
DIAGNOSES_ICD: Contains ICD diagnoses for patients, most notably ICD-9 diagnoses
DRGCODES: Contains diagnosis related groups (DRG) codes for patients
ICUSTAYS: Defines each ICUSTAY_ID in the database, i.e. defines a single ICU stay
INPUTEVENTS_CV: Input data for patients
INPYTEVENTS_MV: Input data for patients
LABEVENTS: Contains all laboratory measurements for a given patient, including outpatient
data
MICROBIOLOGYEVENTS: Contains microbiology information, including tests performed

and sensitivities
NOTEEVENTS: Contains all notes for patients
OUTPUTEVENTS: Output data for patients
PATIENTS: Contains all charted data for all patients

40
PRESCRIPTIONS: Contains medication related order entries, i.e. prescriptions
PROCEDUREEVENTS_MV: Contains procedures for patients
PROCEDURES_ICD: Contains ICD procedures for patients, most notably ICD-9 procedures
SERVICES: Lists services that a patient was admitted/transferred under
TRANSFERS: Physical locations for patients throughout their hospital stay

41
APPENDIX B
List of queries run in HeidiSQL
CODE 1:
SELECT p.hadm_id, p.start_time , p.item_id , d.label , d.category
FROM procedure_event p
CROSS JOIN d_items d
WHERE p.item_id = d.item_id ;
CODE 2:
INSERT INTO combined_table_1 (hadm_id , start_time , item_id , label , category)
SELECT hadm_id , admit_time , èvent` , èvent`
FROM admission_event ;
CODE 3:
INSERT INTO combined_table_1 (hadm_id , start_time , item_id , label , category)
SELECT hadm_id , disch_time , èvent` , èvent`
FROM discharge_event ;
CODE 4:
SELECT c.hadm_id , c.start_time , c.item_id , c.label , c.category , d.diagnosis
FROM combined_table_1 c
CROSS JOIN diagnosis_table d
WHERE c.hadm_id = d.hadm_id ;

Tambi 2018-FPR

Uploaded by

Copyright:

Available Formats

You might also like

Tambi 2018-FPR

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Tambi 2018-FPR

Uploaded by

Copyright:

Available Formats

School of Computing

Process Mining in Healthcare: A Question Driven

Submitted in accordance with the requirements for the degree of

The candidate confirms that the following have been submitted:

Items Format Recipient(s) and Date

Type of Project: Empirical Investigation

(Signature of student) ___________________

© <2019> The University of Leeds and <Ankit Mohan Tambi>

Efficient application of PM in Healthcare requires a huge database. This project is

And the methodology used is Question Driven Methodology (QDM) specifically

I am deeply grateful to my supervisor Prof. Owen Johnson for his continuous

Fig. No. Name of Figure Page No.

Table No. Name of Table Page No.

Sr. No. Title Page No.

5 IMPLEMENTATION AND RESULT 21

There's a worldwide increase in awareness that good quality healthcare is key to

MIMIC-III is a huge healthcare database which can serve as an extremely valuable

In this project, a Question-Driven Methodology (QDM) (specifically developed for the

1.2 Problem Statement and Possible Solution

1.3 Project Scope

To Demonstrate the use of Question-Driven Methodology(QDM), (specifically

1. To better understand how the Process Mining works in healthcare so as to be able to

4. To Apply the stages of QDM on MIMIC-III data by developing DRM.

2) The Project Report

1.7 Project management and planning

June 2018 : Initial meetings

July 2018 : Learned process mining online course, Scoping And

Sept 2018 : Learned ProM and Disco

Oct 2018 : Requesting data from MIMIC-III

Nov 2018 : Background Research of Healthcare

Feb 2019 : Further Research and Reference Material

Mar 2019 : Applying QDM Stages

Apr 2019 : Write Up and Hand In

2 BACKGROUND RESEARCH AND LITERATURE REVIEW

2.1 Initial research and training

After gaining access to MIMIC-III, I studied it table-by-table making connections of its

2.2 The Emergency Room

As defined by the American College of Emergency Physicians, "Emergency medicine

At the ER many aid professionals (physician, nurses) in coordination with technicians

2.3 Process Mining

Healthcare processes improvement can be expected to immensely affect the quality

Therefore, in order to obtain suggestions that can be fully expected to be objective in

ER processes, however, are quite complex; every patient is a different episode in

Fig 2.1 An Example of Event Log Fragment

Fig 2.2 Process mining in healthcare (based on [2])

Mining, it is crucial to be able to understand frequently asked questions posed by healthcare

Our approach focuses on extracting knowledge from healthcare databases through

2.4 Process Mining Tools

ProM is a free open-source framework based in Java, developed at the Eindhoven

Disco (developed by Fluxicon) is a powerful Process Mining tool. It doesn't contain

MySQL is an open-source Relational Database Management System (RDBMS),

HeidiSQL is a free and open-source, administration tool for MySQL, PostgreSQL,

2.5 Version Controls

2.5.5 MySQL Workbench 8.0

2.6 Data Collection

1. Complete the CITI "Data or Specimens Only Research'' course

5. Decompress it to store on local computer for further applications.

MIMIC-III (Medical Information Mart for Intensive Care) is a free-access database