Download as pdf or txt
Download as pdf or txt
You are on page 1of 22

Accepted Manuscript

Unsupervised neural networks for clustering emergent patient flows

Marina Resta, Michele Sonnessa, Elena Tnfani, Angela Testi

PII: S2211-6923(16)30135-7
DOI: http://dx.doi.org/10.1016/j.orhc.2017.08.002
Reference: ORHC 127

To appear in: Operations Research for Health Care

Received date : 2 November 2016


Accepted date : 15 August 2017

Please cite this article as: M. Resta, M. Sonnessa, E. Tnfani, A. Testi, Unsupervised neural
networks for clustering emergent patient flows, Operations Research for Health Care (2017),
http://dx.doi.org/10.1016/j.orhc.2017.08.002

This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to
our customers we are providing this early version of the manuscript. The manuscript will undergo
copyediting, typesetting, and review of the resulting proof before it is published in its final form.
Please note that during the production process errors may be discovered which could affect the
content, and all legal disclaimers that apply to the journal pertain.
UNSUPERVISED NEURAL NETWORKS FOR CLUSTERING
EMERGENT PATIENT FLOWS

Marina Resta, Michele Sonnessa, Elena Tnfani, Angela Testi


Department of Economics and Business Studies, University of Genova, Via Vivaldi 5, 16126 Genova (Italy)

Abstract

In recent years, hospitals increasingly faced with a growing proportion of their inpatient admissions coming
from the fluctuating demand of emergency admissions. The opportunity to move emergency patients, with a
decision to admit, out of an Emergency Department (ED) is linked to the ability of the hospital to actually
receive them. Indeed, the growing concern on public budget constraints implies reducing the number of
inpatient ward beds making crucial to improve the bed capacity planning. The attention must be focused on
avoiding system bottlenecks such as the boarding in the ED of emergent patients waiting to be admitted into
inpatient hospital wards. Bed management is considered a critical function in managing bed capacity and
smoothing elective and emergent patient flows. In order to support the bed management function the
clustering and previsional analysis of patient flows data are needed. In this work, we use an unsupervised
neural network technique, namely Self Organizing Maps (SOMs), to explore input data and to extract
significant patterns. A large quantity of data records has been collected over a yearly period to obtain
information related to the arrivals of emergent patients in a medium-sized ED located in the city of Genova.
The aim of the paper is twofold. Our first goal is to develop a new framework based on SOMs, for the
analysis of healthcare data that include heterogeneous information. Second, we give a seasonal connotation
to the analyzed data, as the SOM can discover clusters and patient profiles that can be used to support bed
capacity planning.

Keywords: Bed Management, Emergency Department Overcrowding, Unsupervised Neural networks, Self-
Organizing Maps (SOMs)

1. INTRODUCTION AND ADRESSED PROBLEM

In recent years there has been a growing concern to reduce overcrowding in Emergency
Departments (EDs) that is straightforwardly acknowledged being an issue of worldwide importance
(Forster et al., 2003). Many studies showed that the boarding of patients in ED hallways, when no
inpatient beds are available, is one major cause of ED overcrowding. A boarded patient is defined
as a patient who remains in the ED after a decision to admit in an inpatient ward, because no
inpatient beds are available. The ED boarding often results in decreasing the ability to see new
patients in the ED, increasing waiting time and length of stay, and leads to sufferings for those
patients who wait, lying on trolleys in emergency department corridors for hours, and even days, as
well as to the dissatisfaction of the emergency department staff (Darlet and Richards, 2000;
Richards et al. 2011).
The problem is further complicated by the current decrease of hospital beds due to financial
concerns on public expenditure. In particular, Italy is the country in Europe with the lowest ratio of
beds to population (about 3.4 per 1000 inhabitant vs. 6.3 European average, Eurostat 2014).
1

Facing ED boarding attains to recognize that it is necessary to manage the whole bed capacity
considering emergent and elective admissions simultaneously to better smoothing intra-hospital
patient flows from ED to inpatient wards.
The problem is not novel, and a solution suggested is the introduction of the so-called Bed Manager
(BM) function within the hospital organization (Proudlove et al. 2003, Proudlove et al., 2007 ).
About twenty years ago, Green and Armstrong (1994) already conceptualized the BM function as
the way of keeping a balance between flexibility for admitting emergency patients and high bed
occupancy, which is an indicator of good hospital management.
The control of the whole set of patient flows is obviously possible only with the help of an on-line
system able to identify earlier information about pending admissions to the acute beds available
(Tortorella et al., 2013). This can be done, for instance, visualizing the on time patient flows by
means of a tool which collects and filters the information from the ED and inpatient ward thus
supporting hospital bed managers in their daily decision making (Jensen et al., 2013).
BM should be supplemented by other techniques and tools to allow classifying and clustering
patients who arrive at the ED and forecast the demand of inpatients hospital beds coming from the
ED.
In order to support the bed management function a deep data investigation aimed at classifying,
clustering and predicting patient flows is needed.
In this work, we explore the potential of machine learning techniques to support the bed
management function. As widespread known, machine learning (Mitchell, 1997) is a subset of
computer science that deals with techniques and algorithms, based on empirical data, aimed to
produce new knowledge on the observed phenomena. Data mining techniques can be used to learn
the relationships between the critical features of the instances and the performance of algorithms
(Smith-Miles and Lopes, 2012). The main goal is to learn as much as possible information from
large amounts of data to support more informed decisions, and to transform new knowledge into an
understandable structure for further use (Fayyad et al., 1996).
In particular, we use a kind of Artificial Neural Network (ANN), referred as Kohonens Self-
Organising Maps (SOMs), to explore input data and extract significant patterns (Kohonen 1982,
1997). The main rationale for using SOMs over more traditional methods is the inherent local
modeling property and the topology preservation of units that enhance interpretability of dynamics.
This study is developed with the collaboration of the Local Health Government of the Liguria
Region (Italy) that help us to collect and analyse the main data related to the flows of patients in a
medium sized ED sited in the city of Genova.
The paper is organized as follows. In Section 2 the main literature of interest is reviewed and the
novelties of the study underlined. The methodology in use is introduced and explained in detail in
Section 3 where the application to a real case study is also presented. In Section 4 the results are
illustrated and discussed. Conclusions and future directions of research end the paper.

2. LITERATURE REVIEW
As stated in the previous section, in this work we are focusing on ED boarding and overcrowding
and the potential of bed management to address the problem and avoid bottlenecks in the patient
flows. As a matter of fact, there is a wide variety of contributions dealing with those issues, either
from a strongest mathematical viewpoint or in a more general quantitative perspective. Here we are
mainly concerned in highlighting both the major points of contact between the earlier works on this
topic and our paper, and the elements of novelty in our approach.
Starting from the evidence that ED boarding and access blocks can potentially represent a threat to
patient safety, (Boyle et al, 2014) developed a model of bed management based on features

matching, in order to extract the similarities among various days. They extracted all days within the
historical dataset that match the day type (Sunday, Monday, public holiday, etc.) within a 4-week
window, centered on the day of interest and applied a computational predictive model based on
smoothing techniques as well as on multiple regression, and autoregressive integrated moving
average models. Alternatively, (Chan et al., 2012) focused on the complexity of Emergency
department (ED) as the key to analyze hospitals crowding. They then discussed a machine learning
model which can identify the factors causing ED crowding and validate the coping strategies of
hospitals. The model discussed in (Chan et al., 2012) first introduces the decision tree method to fit
a nonlinear association and obtain intelligent grading rules of ED crowding; then it integrates the
intelligent grading rules and indexes of coping strategies to construct a hierarchical linear model.
The final outcome is a model, which is able to manage traditional modeling issue of high
correlation among independent variables and un-convergence.
Furthermore, (Kannampallil et al., 2011) examined various aspects of complexity and proposed a
kind of theoretical lens for understanding and studying complexity in healthcare systems based on
degrees of interrelatedness of system components. Along an inclusive research trail focusing on
both the complexity of hospital procedures and the excellence of machine learning methods, there is
also the contribution of Gopakumar et al. (2016) where the authors compared the efficacy of five
forecasting models to provide a reasonable estimate of total next-day discharges, which can aid in
efficient bed management. The compared forecasting methods include Autoregressive Integrated
Moving Average (ARIMA), Autoregressive Moving Average with Exogenous Variables (ARMAX)
proposed by (Brockwell and Davis, 2009), k- nearest neighbor (Altman, 1992), random forest (Ho,
1995; Ho, 1998; Hastie et al., 2008), and support vector regressions (Cortes and Vapnik, 1995;
Suykens and Vandewalle, 1999; Smola and Schlkopf, 2004). The challenge lies in dealing with
large amounts of discharge noise introduced by the nonlinear nature of hospital procedures, and the
unavailability of real-time clinical information in wards. The forecasting quality in this case was
checked using mean forecast error, mean absolute error, symmetric mean absolute percentage error,
and root mean square error (Armstrong, 1985). Random forest and support vector regression models
resulted in superior performance over traditional autoregressive methods. Similarly, Oliveira et al.
(2014) propose a Data Mining (DM) approach in order to identify relevant data about patients
management to provide decision makers with important information to fundament their decisions.
The core of the procedure is a bunch of 48 DM models based on Machine Learning Models,
namely: Regression Tree (RT) and Support Vector Machine (SVM) in order to perform regression
tasks. Regression models were able to predict patients discharges with very promising values of
Relative Absolute Error (RAE). Joy and Jones (2005), on the other hand, discussed a hybrid
methodology, incorporating a neural network and an ARIMA model to predict a time series of bed
demand, while Bagnasco et al. (2015) highlighted the utility of artificial neural networks in
predicting communication risks. Finally, Gul and Guneri (2015) aimed to forecast patient Length
Of Stay (LOS) using Feed Forward Artificial Neural Network (FFNN) within the input factors that
are predictive such as patient age, sex, mode of arrival, treatment unit, medical tests and inspection
in the ED.

Our work fits into the research vein that focus on the potential of machine learning techniques to
deal with data mining and bed management issues. However, with respect to the existing literature,
we add some elements of innovation.
A major element of novelty resides in the method used, i.e. a kind of Artificial Neural Network
(ANN) called Kohonens Self-Organising Maps (Kohonen 1982, 1997). It is known that in the
plenty of learning machine algorithms, ANN are inspired by the functioning of the human brain to
estimate unknown functions that depend on a large number of inputs (McCullogh and Pitts, 1943;
MacKey, 2003). Kohonens Self-Organising Maps (SOMs) are a particular kind of ANN using
unsupervised learning to produce a low-dimensional (typically two-dimensional) map, that is a
3

discretized representation of the input space (Von der Malsburg, 1973). SOMs have been largely
applied in a variety of fields, including economic dynamics, accounting and financial reporting
(Back et al., 2001; Sarlin, 2013; Peat and Jones, 2014 Resta, 2016): the main strength of this
approach relies on the possibility to extract intrinsic patterns, letting the data literally speaking from
themselves (Martin-del-Brio and Serrano-Cinca, 1993; Resta, 2016). Moreover, SOMs can be
applied not only to numerical information, but also to qualitative data extracted from annual reports
(Visa et al. 2000). In general, although a large number of research papers agree that SOM is a very
suitable technique to investigate financial data (Schreck et al. 2007; Budayan et al. 2009) and data
mining in its wider broadest sense (Ong and Abidi, 1999), to the best of our knowledge, there is any
paper directly applying SOM to support bed management strategies.
This paper aims at filling this gap suggesting a pilot case study for further developing a more
consolidated framework for the use of SOMs to similar problems. More in details, our work fits in
the existing literature by contributing towards at least two directions:
first, developing a new framework for the analysis of healthcare data, using SOMs as source of
heterogeneous information; second, giving a seasonal connotation to the analyzed dataset, as the
SOMs can be used to discover clusters and patients profiles that can be used to support hospital
and bed management policies.

3. MATERIALS AND METHODS

3.1 Methodology

The Self Organizing Maps (SOM) algorithm is based on competitive learning, or Vector
Quantization (VQ). Thus, instead of learning the inputs features by a mechanism of error correction
(like generally ANN do), SOM represents the input space by way of neighbour relationships that
maintain unchanged the existing topological properties. The result is a projection from the original
higher dimensional input space into a bi-dimensional grid where neighbour nodes represent patterns
closer in the input space. This kind of multidimensional scaling makes SOM particularly appealing
as a tool to manage data characterized by high complexity.
From the technical viewpoint, we assume that M is the mk bi-dimensional projection grid whose
elements (units, nodes) are arranged into m rows and k columns; each unit of M is associated with
an array wi,j (i=1,,m, j=1,,k) with as many components as the number of data in the input space.
The process leading to VQ consists in a number of steps, represented in the flowchart of Figure 1
and therein summarised:
1. The arrays wi,j (i=1,,m, j=1,,k) are initialized at random.
2. For t =1, an input vector x(t) is taken from the r-dimensions input space X.
3. For each node in the map:
a) the Euclidean distance formula is applied to exploit the similarity between the input
vector and the map's nodes:
d E x(t ), wij (t ) ( x (t ) wi , j (t ))T ( x(t ) wi , j (t )) (1)
Where T is the transposition operator.
b) the node with the smallest distance from x(t) is called Best Matching Unit (BMU) and
it is labelled by wij* (t ) :

wi,* j (t) min


wi , j (t )M
d E
x(t),w (t)
i, j
(2)

4. The nodes in the neighbourhood of the BMU (including the BMU itself) are updated as
follows:

wi , j (t 1) wi , j (t ) (t ) h t , x (t ), wi*, j (t ) x(t ) wi , j (t ) (3)

Where (t) is a scalar factor in the range (0,1), decreasing on time t that defines the size of
the correction. In particular, from values closer to one (that means maximum correction), as
time goes on, (t) decreases to values nearer to zero (no correction at all). Finally, h(.) is
the neighbourhood function modelling the distance between the map nodes and the BMU,
that in its simplest shape, is given by:


h p BMU , p wi , j , t e BMU wi , j
t | p p |
(4)

Where pBMU , pwi , j are the grid coordinates of both the BMU and the generic map nodes,
respectively.
5. Increase t and repeat from Step 2 until all the patterns have been presented to the map at least
once.

Initialize weight
Start/Stop
wij randomly

Select input
vector x(t)

for each node

Euclidean distance is computed

Best Matching Unit is selected


wij* (t )

Neighbours of All patterns presented


BMU are updated at least once

Start/Stop

Figure 1. SOM learning algorithm

The goodnness of the SOM reprresentation of the inpu ut space caan be evaluuated by seeveral errorr
measures (Villman et e al, 1997 7). Those oof widespreead use, also used inn this studdy, are thee
Quantizatioon Error (Q
QE), and the Topographhic Error (TE
E).
The QE iss computedd by determ mining the average disstance of thhe sample vectors to the clusterr
centroids bby which theey are repreesented:

N
1
QE
N
d
i 1
E ( xi , w* ) (5)
(

where w* iis the BMUU for the i-th


h input xi annd N is the number
n of in
nput vectorss used to traain the map
On the othher hand, TE
T is a top pology presservation measure;
m forr all data saamples, it assumes too
determine the respectiive best and
d second-besst matching g units, as fo
ollows:

N
1
TE
N
1 ( x )
j 1
u j
(6)
(

where, 1u ( x j ) is the indicator


i fu ue 1, if the first and ssecond Best Matchingg
unction assuuming valu
Neurons off are not direct neigh hbours of eaach others, and zero oth herwise.
Clearly, thhe major strength of SO OM relies inn offering a nice tool to t project hhigh dimenssional inputt
data into a two-dimennsions latticce. Consideer for instan nce Figure 2, where thhe result off projectingg
onto the neeural spacee 5-dimensio ons input saamples is shown, originating a U Unified distaance matrixx
(U-Matrix)). This figuure uses thee followingg coding: hexagons
h reepresent thee neurons, thet colourss
indicate thhe distances between neeurons, withh different shades of colours
c varyying from deep
d blue too
yellow. In particular, tones of yellow
y referr to largest distances, while
w both deep and lighter
l bluee
represent ssmaller distances. The width of thhe hexagonss, on the oth her hand, ggives an ideea about thee
representattiveness of the node. The
T widest tthe hexagon n, the greattest is the ab
ability of thee neuron too
represent thhe features of the inputt space.

Figu
ure 2. The Unified
U Distannce Matrix (U-Matrix) fo
or a sample SSOM.

Alternativeely, the U-M


Matrix can be
b given as shown in Fiigure 3, where the orgaanization off the map is
presented iin terms of clusters
c org
ganization oof the projeccted input sp
pace.
6

Figure 3. Clusteers organizatiion in a U-M


Matrix.

A straightfforward inteerpretation of the U-M Matrix in Fig


gure 3 suggeests that, acccording to the colourss
division, thhe SOM alggorithm has clustered thhe data intoo nine group ps. The coloour differencce indicatess
that data pooints in thesse regions are
a farther aapart.
Figure 3 suggests also the significance of eeach node of the map, provided as the num mber of hitss
between innput patterns and neuro ons in the SO
OM. This iss directly reendered, by indicating the numberr
of matchess between eache node and
a the inpuut patterns. To make an a examplee, the hexag gon labelledd
by the nummber 61 is a neuron whiich matchess (i.e. is the most similaar), to 61 innput patternss.

3.2. Case S
Study Dataa Collection
n

The dataseet used in this


t study includes
i daata from thee ED datab base of the Villa Scasssi Hospitall
located in Genova (Ittaly). The database
d conntains the reecords for the
t 44,876 ppatients wh ho accessedd
the ED durring the yeaarly period in which thee data were collected (ffrom Januarry to Decem mber 2015).
The originnal databasee contained a consideraable numbeer of inform mation for eeach patientt, includingg
several dem mographic variables
v an
nd clinical ddata; howevver, we decided to focuus our analy ysis mainlyy
on non-clinnical variabbles registerred at the trriage visit, before any clinical connsultation, thus givingg
attention too the characcteristics of patients whhen they acccess the EDD. This ratioonale led us to considerr
the bunch oof input varriables listedd in Table 11.

Table 1. Inpput variabless in the datasset.

Variabble Name Typ


pe
Age Num
meric, Integer (in years)
Gendeer Cateegorical, num
merically codded
Citizennship Cateegorical, num
merically codded
Registered Residen
nce Cateegorical, num
merically codded
Time in
i ED Num
meric, Integer (in hours)

We performed a data cleaning, aimed to avoid both missing data and badly specified variables,
which led to a 3% cut of the dataset. The final sample was therefore composed by 42,278 patients,
rightly identified by the set of variables in Table 1.
By looking at Table 1, we are directly introduced to one of the major issues of the database, and
main cause of the complexity in managing this kind of data: only Age and Time in ED are
numerical variables, introducing hard quantitative information about the data. The remaining
variables, represent that kind of soft information which is a major challenge for any data mining
process. As a matter of fact, using SOMs finds the deepest motivation from this issue, as more
recent software implementations of the algorithm make possible to manage both kind of data at the
same time (Olteanu, and Villa-Vialaneix, 2015).
A preliminary data snooping, gave us the possibility to infer the most important features of the
categorical issues as summarized in Table 2, while basic statistics (min, max, median and main
quartiles) for the numeric variables are provided in Table 3.

Table 2. Basic features for categorical variables in the dataset.

Gender Citizenship Registered Residence


Male 21,392 NULL 26 Genova 35,355
Female 21,484 African (AR) 1,684 Liguria 5,432
European (ER) 38,130 Italy 1,755
American (RA) 2,559 Abroad 334
Southern East Asia
(SEAR) 477
Total 42,876 Total 42,876 Total 42,876

Looking at Table 2, one can observe that the sample under observation was approximately equally
distributed with respect to the gender attribute. The citizenship of patients, on the other hand, was
organized into five classes: African (AR) representing the 4% of the whole sample; European (AR)
accounting for 89% of the sample, America (RA) and Southern East Asia (SEAR) representing the
6% and 1%, respectively, of the sample. The NULL entry represents the case of missing citizenship
data: this is a very residual group, as only in the 0.0606% of cases this information was not
provided. For what is concerned with the Registered Residence, 1% of the patients came from
abroad, while the remaining 99% lived either in Genova (82%), other parts of Liguria (13%) or
other parts of Italy (4%).

Table 3. Basic statistics for numerical variables in the dataset.

Age Time in Ed
Min. 0 0
1st Quartile 36 1
Median 53 3
Mean 54.5 7.36
3rd Quartile 75 7
Max. 115 624

The age of patients accessing the ED range from 0 to 115, where both the lower and upper bounds
must be meant as extreme values, as the median and mean values are very close and equal to about
54 years. Additionally, as explained in the Results discussion, using the Age information, we
provided each instance in the database with the labels: Y (Young) for ages lower than fifteen years,
8

AP (Active Population) for people with ages in the range [15,65], and E (Elders) for people older
than 65.
The variable Time in ED gives the number of hours the patients stay in the ED before being
discharged or admitted into an inpatient hospital wards. This variable shows a great variability
ranging from 0 to 624 hours. This latter value, however, is probably an extreme value, as the mean
and the 3rd quartile are both very close to 7 hours.

3.3. Data Analysis with Self Organizing Maps

We run Self Organizing Maps (SOMs) on the dataset described in Sec. 3.2. The tests and analysis
have been monthly arranged, in order to highlight (when existent) either the similarities or the
divergences among separate months, hence providing very suitable indications to avoid bed
misallocations. The number of patients in the dataset for each month is given in Table 4.

Table 4. Number of patients per month grouped according to the Age class.

Jan Feb March April May June July Aug Sept Oct Nov Dec
Age 3,779 3,302 3,451 3,493 3,586 3,691 3,918 3,706 3,413 3,556 3,396 3,585
class
Y 44 41 29 44 33 35 41 38 37 42 51 45
AP 2255 1966 2193 2154 2255 2426 2545 2348 2107 2173 2100 2145
E 1480 1295 1230 1295 1298 1231 1332 1320 1269 1341 1245 1395

The input data for each of month consist in a matrix where each row represents a patient addressing
the ED, and five columns reporting the variables entries described in Table 1. Before running SOM,
whenever possible, data were scaled by way of a discrete histogram equalization, i.e., values were
preliminarily ordered, then replaced by their ordinal number and finally they were then scaled in the
range [0,1].
Clearly, likewise in any data-driven approach, also in applying the SOM methodology there is a
certain degree of complexity due to the choice of proper parameters for the algorithm in use.
To such aim, in order to make easier both understanding the simulation and analyzing the results,
we denoted by u the generic output of the SOM procedure, so that the following equality holds:

u = f (Xdim, Ydim, AlphaT, InAlpha, Neighb, InRad) (7)

where: Xdim and Ydim are, respectively, the number of rows and columns of the map (with
Xdim,Ydim>1); AlphaT specifies the shape of the learning rate function which drives the learning
process, and InAlpha is the initial value for the learning rate function. Furthermore, Neighb defines
the neighbourhood structure of the net (in order to reduce the complexity of the study, we opted for
a standard cross-shaped neighbourhood). Finally, InRad is the initial radius of the neighbourhood,
where InRad= min(Xdim, Ydim).
Basically, our simulations assume AlphaT as a linearly decreasing function with initial value equal
to one, and cross-shaped neighbourhood. We then run SOMs by considering all the possible
permutations for Xdim and Ydim, varying them in the range [12;20]. This range of variation was
suggested by the rule of thumb rule proposed in Olteanu, and Villa-Vialaneix (2015) according to
which the good number of neurons in a SOM should be at least equal to /10, where N is the
sample length.
All the listed SOM configurations were run for the monthly database of patients, for an overall
number of 864 different configurations, 72 for each month. The goodness of the parameters was
tested in two different ways: firstly, we checked the level of the convergence indexes as reported in
9

Section 3 by Eqs. (5) and (6). Additionally, we run the Analysis of Variance (ANOVA) test on each
final map, in order to verify the significance of the variables in use.
For all the examined cases, the best configuration resulted being the square 19x19 SOM. The
ANOVA scores of the 19x19 SOM are given in Table 5 for each month.
In particular, by looking at the results in Table 5, by way of the ANOVA test we try to assess the
significance of the examined variables in contributing to the final trim of the maps (one for each
month) and to the clustering. The number of stars under the F-scores must be intended as a signal to
discard the null hypothesis of equality in mean of the examined variables: in other words, all the
variables consider in the study have significantly different means, and are therefore relevant for the
map definition in every month. Moving to the Topographic Error (TE), as TE measures how well
the topology of the input space is preserved in the map, the low values reported in Table 5 testify
that only a very low proportion of the input vectors is wrongly addressed by the monthly SOM, and
hence the results obtained are very satisfying from the classification viewpoint. Finally, the
Quantization Error (QE) is at lower levels, too in all the observed months, and due to the known
trade-off between QE and TE it seemed not proper trying any further modification of the maps
dimensions in search for better results of either QE or TE.

10

Table 5. ANOVA scores, values of Topologic (TE) and Quantization Error (QE) for SOMs with dimensions 19x19.

Jan Feb March April May June July Aug Sept


198.848 139.334 106.749 174.835 238.17 745.531 212.045 442.811 157.704
Age
*** *** *** *** *** *** *** *** ***
154.213 5.824 76.522 251.531 884.985 1534.711 116.806 971.049 251.496
Gender
*** *** *** *** *** *** *** *** ***
2675.678 534.628 1006.546 950.918 4266.74 14315.77 1221.497 6867.608 636.066
Citizenship
*** *** *** *** *** *** *** *** ***
Registered 40.985 2.517 14.737 4.933 6.909 23.959 1.911 8.359 4.188
Residence *** *** *** *** *** *** *** *** ***
22.974 143.158 79.089 39.233 181.698 290.607 13.849 75.521 27.423
Time in ED
*** *** *** *** *** *** *** *** ***
TE 0.00194 0.001546 -0.00005 0.00010 0.001624 0.00106 -0.00144 0.001643 0.000811
QE 0 0.00030 0.00203 0.001718 0.00028 0.00081 0.00332 0.00027 0.00117

11

4. RESU
ULTS AND
D DISCUSSION
In this secttion, we provide the discussion
d oof the resultts obtained by runningg SOMs of dimensionss
19x19 on tthe databasee presented in Section 33.
Figure 4 sshows the U-Matrix obtained
o foor each mo onth. Different colors shades corrrespond too
different cllusters.

Jan Feb March

April May June

July Aug Sept

Oct Nov Dec


Figurre 4. U-Matrrix for each month
m (year 2015).
2

12

At a first glance, the information provided in the different months seem very similar, as the visual
inspection of the U-matrices suggest the impression that the number of clusters in each map is
closer one to each other. Indeed, looking at both the number of emerging clusters (Table 6) and at
the analysis of the composition of the groups and variables (Figures 5 and 6) the apparent
homogeneity of Figure 4 is lost.
In Table 6, the percentages express the representativeness of each cluster size with respect to the
monthly number of incoming patients. At this point a straightforward remark is that the
greater/lower number of groups emerging over the 12 months horizon probably corresponds to
different features of the incoming patients. In addition, an interesting clue comes by observing that
while, on average, the clusters are equally representative of the sample of patients, anomalies are
present in some months. In fact, from April, to July, we can encounter both clusters with the highest
significance with respect to the sample size (44.03% and 34.71%) and groups with the lowest
representativeness at all (0.36% and 0.71%).

Table 6. Percentage of patients for each cluster and month.

Jan Feb March Apr May June July Aug Sept Oct Nov Dic
CL01 13.63% 2.97% 8.61% 17.09% 17.35% 5.47% 13.96% 42.53% 16.76% 13.58% 13.28% 7.45%
CL02 12.73% 10.21% 7.10% 5.70% 9.87% 13.17% 0.71% 7.26% 20.77% 9.45% 10.57% 32.30%
CL03 1.38% 2.00% 11.24% 5.98% 11.57% 31.94% 12.48% 13.65% 12.16% 7.03% 5.24% 3.29%
CL04 11.33% 5.39% 8.72% 13.34% 8.53% 36.25% 34.71% 1.62% 13.30% 2.08% 21.88% 19.22%
CL05 1.83% 2.42% 13.88% 6.44% 4.91% 13.17% 5.79% 6.23% 12.66% 11.87% 16.31% 1.03%
CL06 6.01% 10.27% 4.69% 44.03% 0.36% - 15.90% 5.67% 11.28% 7.17% 32.71% 22.73%
CL07 1.35% 16.05% 5.74% 7.41% 6.16% - 16.44% 15.81% 13.07% 3.77% - 13.97%
CL08 3.89% 1.42% 3.25% - 7.19% - - 7.23% - 11.45% - -
CL09 1.96% 4.63% 14.87% - 5.08% - - - - 18.56% - -
CL10 14.08% 3.88% 4.98% - 7.50% - - - - 15.04% - -
CL11 4.08% 6.15% 1.30% - 1.92% - - - - - - -
CL12 7.94% 3.85% 11.36% - 0.73% - - - - - - -
CL13 6.77% 18.84% 0.96% - 6.16% - - - - - - -
CL14 8.57% 11.93% 3.30% - 10.68% - - - - - - -
CL15 4.47% - - - 1.98% - - - - - - -

Studying how much the input components can affect the overall representation provides a first
detailed analysis. This information can be visually observed by examining the SOM weight planes,
i.e. by visualizing neurons colouring per single input component. Figure 5 offers a representation of
the five weights planes obtained from the map depicted in Figure 4.
Looking at the contribution of each variable to cluster composition, we can capture the (eventual)
correlation or anti-correlation of each variable in determining the final clustering. In this way one
can study both the organization of the input space provided by the overall SOM (as in Figure 4),
and the impact of each component into the overall structure of data (as in Figure 5), thus deriving
some important pieces of information concerning the intrinsic features of the dataset. In the case
therein discussed we consider five components as determinants of the U-Matrixes reported in
Figure 4. A first glance on Figure 5 suggests the possibility of stronger correlation between earlier
three components Age, Citizenship and Gender, as the distribution of colors resembles some
similarities. Indeed, apparently there is not any evident clue able to interpret the role of both
Registered Residence and Time in ED.

13

Figuree 5. Componnent planes in


i the samplee SOM.

Figure 6 suupports thee above rem


marks, graphhically show wing the main featuress of three months
m (i.e.
January, Juune and Deccember) thaat better reprresent the variety
v of the emerging patterns.

14

80.00% AP, E
ER
100% OoT
70.00% AP, E
RA, SEAR, AR AP, E
100% OoT ER
100% OoT
60.00%
AP, E
AP, E ER
ER 31% OoT
50.00% AP, E
40% OoT
Percentage of to be admi ed

ER
13% OoT
40.00%
Y, AP, E
ER, SEAR AP Y, AP, E
19% OoT ER ER
30.00%
Y, AP 13% OoT 23% OoT
ER, SEAR, AR Y, AP, E
8% OoT ER, SEAR, AR, RA
20.00%
12% OoT
AP, E
ER, SEAR, AR, RA
10.00% 13% OoT

0.00%
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Y, AP
Ac ve, Old Y, AP, E ER, AR
10.00% RA, AR, SEAR
ER Cluster # 5% OoT
5% OoT 11% OoT

(a)
Y, AP, E
ER
25.00% Y, AP, E 22% OoT
ER
16% OoT
Y, AP, E
ER, SEAR
19% OoT
20.00%

Y, AP, E
Percentage of to be admi ed

ER, SEAR, RA
15.00% 16% OoT

10.00%

Y, AP, E
SEAR, AR,RA
10% OoT

5.00%

0.00%
0 1 2 3 4 5 6
Cluster #


(b)
60.00%

AP, E
ER
50.00% 100% OoT

AP, E
40.00% ER
Percentage of to be admi ed

Y, AP, E 22% OoT


ER, RA
18% OoT

30.00%
AP, E
ER, SEAR, AR
18% OoT

Y, AP
20.00% ER, AR
Y, AP, E 7% OoT
SEAR, AR, RA
9% OoT
Y, AP, E
10.00% ER, SEAR, AR, RA
16% OoT

0.00%
0 1 2 3 4 5 6 7 8
Cluster #


(c)
Figure 6. Main features of the emerging clusters in the month January (a), June (b) and December (c).

The graphs in Figure 6 must be interpreted in the following way: for each month, we represented on
the x-axis the cluster number, and on the y-axis the percentage of patients admitted into an inpatient
15

hospital wards after been diagnosed. For each cluster, then, we created a circle whose width is
proportional to the significance of the group, with the color expressing the dominant gender (blue
for males and red for females) while the connoting features are given on the top of each circle. To
make an example, by looking at the month of June (middle graph in Figure 6) we can observe that
clusters 3 and 4 are those with the highest representativeness, but while in the former group we find
a greater percentage of male patients, the contrary holds in the case of cluster 4. Additionally, in
both clusters the greatest part of the patients come from the European area, while the other groups
are more heterogeneous with respect to this aspect. Another interesting clue concerns the
percentage of Out of Time (OoT) patients, i.e. those patients whose staying in the ER is longer than
eight hours. Figure 6, in facts, suggests the existence of correlation between this variable and the
percentage of inpatients.
In Table 7 the probability to be admitted into inpatient wards, for each cluster and for the three
months analyzed in Figure 6, is reported. The table reports also the characteristics of each cluster.
The main outcome of the clustering approach herein proposed is the ability to classify patients
according to a set of characteristics detected at the ED entrance and, more interesting, to assign a
probability to require an inpatient ward bed to each cluster/patient.

Table 7. Probability to be admitted into inpatient wards for each cluster.


Admission
Cluster Age Citizenship Sex probability
January
1 Y, AP, E ER, SEAR Women 16%
2 Y, AP ER, SEAR, AR Women 6%
3 AP, E RA, SEAR, AR Women 58%
4 AP ER Men 14%
5 AP, E ER Men 4%
6 Y, AP, E ER, SEAR, AR, RA Women 8%
7 AP, E ER, SEAR, AR, RA Men 4%
8 AP, E ER Men 38%
9 AP, E ER Men 66%
10 AP, E ER Women 27%
11 Y, AP, E RA, AR, SEAR Men 8%
12 Y, AP, E ER Men 16%
13 Y, AP ER, AR Men 5%
14 AP, E ER Men 39%
15 AP, E ER Women 51%
June
1 Y, AP, E SEAR, AR,RA Men 10%
2 Y, AP, E ER, SEAR Men 19%
3 Y, AP, E ER Men 17%
4 Y, AP, E ER Women 22%
5 Y, AP, E ER, SEAR, RA Women 17%
December
1 Y, AP, E SEAR, AR, RA Women 9%
2 Y, AP, E ER, RA Men 18%
3 AP, E ER Women 100%
4 Y, AP ER, AR Women 7%
5 Y, AP, E ER, SEAR, AR, RA Men 16%
6 AP, E ER Women 22%
16

7 AP, E ER, SEAR, AR Men 19%

Finally, we suggest a possible hint to analyze the similarities (and dissimilarities) among the
observed months. To such aim, we compute a convergence/divergence index (CDI) for each month
r (r=1,2,..12), defined as follows:

IM r
CDI r , s r=1,2,,12 (8)
BM r , s

Where:

m k 1 m 1 k m 1 k 1
d E ( wi, j , wi, j1 ) d E ( wi, j , wi1, j ) [d E ( wi, j , wi1, j 1 ) d E ( wi1, j , wi, j1 )]
r r r r r r r r
IM r (9)
i j i j i j

and:
mr ks 1 mr 1 k s mr 1 ks 1

d E ( wi, j , wi, j1 ) d E ( wi, j , wi1, j ) [d E (wi,rj , wis1, j 1 ) d E (wir1, j , wi,sj1 )]


r s r s
BM r , s (10)
i j i j i j

Eq. (9) defines the Intra-Map (IM) distance, with , denoting the vector associated to the neurons
with coordinates (i,j) in the SOM of the r-th month of the year, while m and k indicate the number
of rows and columns in the map, respectively, as already done in Sec. 3: in the case under
examination we have m=19=k for all the examined SOMs. The index is computed as the sum of the
Euclidean distances (dE) between each couple of neuron in the SOM referring to the r-th month.
More in detail, the earlier two addends of (9) make possible to evaluate the Euclidean distance (dE)
between each couple of neurons above and below the main diagonal in the SOM lattice, while the
third addend computes the Euclidean distance for each couple of neurons located on the SOM grid
main diagonal. Moving to (10), it defines the Between Maps (BM) distance: while , is a vector
of the SOM running on the r-th month data, , is a vector in the SOM trained on the s-th month,
with rs, and mr, ks represent the number of rows and columns of the SOM in the r-th and s-th
months respectively, with rs.
It is necessary to highlight that in the actual formulation, as given in (8) CDIr,s works only if the
SOM of the r-th map has the same dimensions as the SOM referring to the s-th month. In fact,
while IMr is shape independent, clearly BMr,s works only if the SOMs under comparison have the
same number of rows and columns. Future efforts will be therefore directed to generalize (10) so
that BM can work also between maps with different dimensions. Also in its actual form, by
construction, the CDI is an index that evaluates the intra-map distance (IM) over the distance
between two maps (BM) and lets to evaluate how much two maps are similar one to each other.
The higher the index, the higher the variability of patients clustered by the monthly SOM compared
to the others.
A synthetic analysis of the maps similarity is given in Figure 7. Very surprisingly, July is very far
from any one other month, and it is also the month with the highest number of patients accessing
the ED. A possible motivation of such difference is that probably the presence of tourists in the
Liguria region significantly alters the composition of patients population. Such phenomenon
should affect the month of August, as well, however, no so straight evidence is observable in this
case.

17

Jan
1.1 Jan
Dec Feb
1.05 Feb

1 Mar

Nov 0.95 Mar Apr

0.9 May

0.85 June

Oct 0.8 Apr July

Aug

Sep

Sep May Oct

Nov

Aug June Dec

July

Figure 7. Radar plot for the behavior of the CDI in the observed months.

Furthermore, Figure 8 gives an alternative representation of the CDI how much the maps are far one
to each other. In general, it can be noticed how winter months are characterized by a less uniform
data distribution if compared to other months (except for July).

0.8

0.6

0.4

0.2

0
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec

Figure 8. Overall absolute distance of each month compared to all the others.

18

5. CONCLUSIONS
In this paper, a novel approach for patient classification is proposed on the basis of Self Organizing
maps (SOM). Such technique has been extensively used to classify data patterns by way of a bi-
dimensional representation, and to create clusters that are able to group common characteristics of
input data in subsets. The use of SOM is here examined on an inter-temporal basis, as the
classification of input patterns is applied month by month, and a new method is introduced to
compare cluster compositions over a temporal horizon.
This SOM-based model has been applied on a case study, focusing on data kept from the
Emergency Department of a hospital located in Genova, Italy. In our experimental analysis, we
applied the SOM classification technique on a monthly basis, using mainly demographic features of
the patients to assess the hospital. We then extracted for each month the clusters and we analyzed
them.
Furthermore, our newly introduced metric to measure the distance among monthly SOM made us
possible to assess the level of generalization inside the classification month by month.
The proposed method can help in gaining a deeper understanding of the flow of patients arriving at
ED. It is not only a question of the number of patients, but also of their characteristics;
The only detected characteristics are until now only the clinical ones, revealed at the moment of the
triage affecting the clinical need of patients. What is lacking is the knowledge of the expected need
of assistance, mainly the demand addressed to hospital bed capacity.
The analysis conducted at various levels of details suggests that the clustering approach allows to
classify patients as soon as they enter ED and to provide for each patient the probability to be
admitted into inpatient hospital wards and requiring inpatient beds. This information allows the bed
manager to intervene immediately before bottlenecks arise and contributes to improve its activity
triggering the appropriate tool in advance to better allocate patients to beds.
Future works will be directed to define an automatic procedure using the methodology herein
proposed, that given an input data, is able to produce admission probabilities depending on patients
profiles.

19

REFERENCES

1. Altman, N. S. (1992). An introduction to kernel and nearest-neighbor nonparametric regression. The


American Statistician. 46 (3): 175185.
2. Armstrong, J. S. (1985) Long-range Forecasting: From Crystal Ball to Computer, 2nd. ed. Wiley.
3. Back, B., Toivonen, J., Vanharanta, H., & Visa, A. (2001). Comparing numerical data and text
information from annual reports using self-organizing maps. International Journal of Accounting
Information Systems, 2(4), 249-269.
4. Bagnasco A., Siri, A., Aleo, G., Rocco G. and L. Sasso (2015) Applying artificial neural networks to
predict communication risks in the emergency department. Journal of Advanced Nursing 71(10), 2293
2304.
5. Boyle, J., Jessup, M., Crilly, J., Green, D., Lind, J., Wallis, M., Miller, P. and G. Fitzgerald (2014)
Predicting emergency department admissions, Emergency Medical Journal, 29(5): 358-365.
6. Brockwell, P. J.; Davis, R. A. (2009). Time Series: Theory and Methods (2nd ed.). New York: Springer.
p. 273.
7. Budayan, C., Dikmen, I., & Birgonul, M. T. (2009). Comparing the performance of traditional cluster
analysis, self-organizing maps and fuzzy C-means method for strategic grouping. Expert Systems with
Applications, 36(9), 11772-11781.
8. Chan, C., Huang, H. & You, H. Intelligence modeling for coping strategies to reduce emergency
department overcrowding in hospitals. Journal of Intelligent Manufacturing, 23(6), 23072318.
9. Cortes, C.; Vapnik, V. (1995). Support-vector networks. Machine Learning. 20 (3): 273297.
10. Fayyad, U., Piatetsky-Shapiro, G., & Smyth, P. (1996), From data mining to knowledge discovery in
databases. AI magazine, 17(3), 37.
11. Forster A.J., Stiell I., Wells G. (2003) The effect of hospital occupancy on emergency department length
of stay and patient disposition. Academic Emergency Medicine, 10(2):127-133.
12. Gul, M., and A.F. Guneri (2015) Forecasting patient length of stay in an emergency department by
artificial neural networks., Journal of Aeronautics and Space Technologies, 8(2): 43-48.
13. Gopakumar, S., Tran, T., Luo, W., Phung, D., and S. Venkatesh (2016) Forecasting Daily Patient
Outflow From a Ward Having No Real-Time Clinical Data JMIR Med Inform. 4(3): e24.
14. Ho, T. K. (1995) Random Decision Forests. Proceedings of the 3rd International Conference on
Document Analysis and Recognition, Montreal, QC, 1416 August 1995. pp. 278282.
15. Ho, T.K. (1998) The Random Subspace Method for Constructing Decision Forests. IEEE Transactions
on Pattern Analysis and Machine Intelligence. 20 (8): 832844.
16. Hastie, T., Tibshirani, R., and J. Friedman (2008). The Elements of Statistical Learning (2nd ed.).
Springer.
17. Joy, M.P, and S. Jones (2005) Predicting Bed Demand in a Hospital using Neural Networks and
ARIMA models: a Hybrid Approach, Proceedings ESANN 2005 (13th Annual Symposium on Artificial
Neural Networks), pp. 127-132.
18. Kannampallil, T.G., Schauer, G.F., Cohen, T., and V.L. Patel (2011) Considering complexity in
healthcare systems, Journal of Biomedical Informatics 44 (2011) 943947
19. Kohonen T. (1982), Self-organized formation of topologically correct feature maps, Biological
Cybernetics, 43, 5969.
20. Kohonen T. (1997), Self-organizing maps, (2nd ed.), Springer, Berlin.
21. Malmberg, A., Malmberg, B., & Lundequist, P. (2000). Agglomeration and firm performance:
economies of scale, localisation, and urbanisation among Swedish export firms. Environment and
Planning A, 32(2), 305-321.
22. Martin-del-Brio, B, Serrano-Cinca, C. (1993). Self-organizing neural networks for the analysis and
representation of data: some financial cases, Neural Computing Application, 1 (1993), pp. 193206.
23. McCulloch, W. S., & Pitts, W. (1943). A logical calculus of the ideas immanent in nervous activity. The
bulletin of mathematical biophysics, 5(4), 115-133.
24. Mitchell, T. (1997), Machine Learning, McGraw Hill, p.2.
25. MacKay, D. J. (2003). Information theory, inference and learning algorithms. Cambridge university
press.
26. Oliveira, S., Portela, F., Santos M.F., Machado J., and A. Abelha (2014) Hospital bed management
20

support using regression data mining models Proceedings IWBBIO 2014. Granada 7-9 April, 2014
27. Ong, J., Abidi S.S.R (1999) Data Mining Using Self-Organizing Kohonen maps: A Technique for
Effective Data Clustering & Visualisation. Procedings of the International Conference on Artificial
Intelligence (IC-AI'99).
28. Resta, M. (2016). Computational Intelligence Paradigms in Economic and Financial Decision Making.
Springer International Publishing.
29. Sarlin, P. (2013). Decomposing the global financial crisis: A self-organizing time map. Pattern
Recognition Letters, 34(14), 1701-1709.
30. Smola, A. J.; Schlkopf, B. (2004). A tutorial on support vector regression. Statistics and Computing, 14
(3): 199222.
31. Smith-Miles K., Lopes L. (2012). Measuring instance difficulty for combinatorial optimization
problems, Computers and Operations Research, 39 (5), 875-889.
32. Suykens, J. A. K.; Vandewalle, Joos P. L. (1999) Least squares support vector machine classifiers,
Neural Processing Letters, 9(3): 293300.
33. Vesanto, J. (1999). SOM-based data visualization methods. Intelligent data analysis, 3(2), 111-126.
34. Villmann, T & Der, R. & Herrmann, M. & Martinetz, T.M. (1997) Topology preservation in self-
organizing feature maps: exact definition and measurement, IEEE Transactions on Neural Networks,
8(2):256266
35. Visa, A., Toivonen, J., Ruokonen, P., Vanharanta, H., & Back, B. (2000). Knowledge discovery from
text documents based on paragraph maps. Proceedings of the 33rd Annual Hawaii International
Conference on on Systems Science.
36. Von der Malsburg, C (1973), Self-organization of orientation sensitive cells in the striate cortex,
Kybernetika 14, 85-100.
37. Green, J,. Armstrong, D,. (1994). The views of service providers. In Morrell, D., Green, J., Armstrong,
D., Bartholomew, J., Gelder, F., Jenkins, C., Jankowski, R., Mandalia, S., Britten, N., Shaw, A., Savill,
R. (eds) Five Essays on Emergency Pathways, Institute, for the Kings Fund Commission on the future of
Acute Services in London, Kings Fund, London.
38. Howell, E., Bessman, E., Kravet, S., Kolodner, K., Marshall, R., Wright, S., (2008). Active Bed
Management by Hospitalists and Emergent Department Throughput. Annals of Internal Medicine 149,
804-810.
39. Howell, E., Bessman, E., Marshall, R. Wright, S., (2010). Hospitalist bed management effecting
throughput from the emergency department to the intensive care unit. Journal of Critical Care 7(2), 184-
189.
40. Proudlove, N.C., Gordon K., Boaden R. (2003) Can good bed management solve the overcrowding in
accident and emergency departments? Emergency Medical Journal 20,149-155.
41. Tortorella, F., Ukanowicz, D., Douglas-Ntagha, P., Ray, R., Triller, M., (2013). Improving bed turnover
time with a bed management system. Journal of Nursing Administration, 43(1), 37-43.
42. Proudlove N.C., Black S., Fletcher A. (2007). OR and the challenge to improve the NHS: Modelling for
insight and improvement in in-patient flows. Journal of the Operational Research Society, 58: 145158.

21

You might also like