1 s2.0 S1532046421001696 Main

Journal of Biomedical Informatics 120 (2021) 103840
Contents lists available at ScienceDirect
Journal of Biomedical Informatics

journal homepage: www.elsevier.com/locate/yjbin
Original Research
Treatment initiation prediction by EHR mapped PPD tensor based

convolutional neural networks boosting algorithm
Xueli Xiao a, Guanhao Wei b, *, Li Zhou b, Yi Pan a, Huan Jing b, Emily Zhao b, Yilian Yuan b
a
Computer Science Department, Georgia State University, Atlanta, GA 30303, USA
b
Advance Analytics, IQVIA Inc., Plymouth Meeting, PA 19462, USA
A R T I C L E I N F O A B S T R A C T
Keywords: Electronic health records contain patient’s information that can be used for health analytics tasks such as disease
Electronic Health Records detection, disease progression prediction, patient profiling, etc. Traditional machine learning or deep learning
Image Mapping methods treat EHR entities as individual features, and no relationships between them are taken into consider
Convolutional Neural Networks
ation. We propose to evaluate the relationships between EHR features and map them into Procedures, Pre
Treatment Initiation Prediction
scriptions, and Diagnoses (PPD) tensor data, which can be formatted as images. The mapped images are then fed
into deep convolutional networks for local pattern and feature learning. We add this relationship-learning part as
a boosting module on a commonly used classical machine learning model. Experiments were performed on a
Chronic Lymphocytic Leukemia dataset for treatment initiation prediction. Experimental results show that the
proposed approach has better real world modeling performance than the baseline models in terms of prediction
precision.
1. Introduction predictions.
Recent works show incorporating hierarchical and sequential feature
Electronic health records (EHRs) collected through medical office relationships can help with improving the prediction performance. Choi
visits and pharmacy transactions can provide a significant amount of et al. proposed graph convolutional transformers that learn the hidden
digital information that can be used in applications such as health an structure of EHR while performing supervised training [10]. Choi et al.
alytics and clinical informatics. [1,2]. Various machine learning and have also proposed MiME [11], a model architecture that reflects re
deep learning models have been applied to EHR data for disease lationships between diagnosis and treatments. Wu et al. studied the
detection [3–6], disease or treatment progression [7,8], disease sub relationship between patients and clinicians because patients with the
typing classification [9], and so on. same disease that visit the same clinician tend to receive similar treat
In addition to information such as patient demographics: age, ments [12]. In EHR records, since feature interactions, mutual or con
gender, health care providers, and so on, an EHR also contains infor current effects can not be reflected automatically, researchers try to
mation relates to patient medical journeys: a sequence of diagnosis reconstruct those potential meaningful information using various
codes, surgical procedures, lab procedures, and prescriptions over time. methods. For example, the relationship between a prescription and a
Traditionally, the information in an EHR is treated as a flattened bag disease can be represented by how often they appear in the same med
of aggregated features or medical feature sequence in disease progres ical claim. Zeng et al. measured drug-drug similarity by calculating the
sion prediction models. In addition, each patient’s visit consists of Jaccard similarity coefficient on EHR data [13].
different types of features. For example, diagnosis and lab results will Deep convolutional neural networks (CNNs) are a popular class of
affect healthcare provider’s decisions on prescriptions and treatment deep learning models, and they are mostly used to analyze visual im
procedures. Also, the frequency and sequence of treatments in the early agery. Images contain spatially coherent pixels. Neighboring pixels
stage of disease will impact the choices of treatment plans in the late share similar information. Convolution filters in a CNN model do
stage. Unfortunately, such latent interactions between multiple EHR convolution operations and extract features among neighboring pixels.
features are hard to leverage to make significant contributions for model Sharma et al. [14] have transformed non-image data such as gene
* Corresponding author.
E-mail address: guanhao.wei@iqvia.com (G. Wei).
https://doi.org/10.1016/j.jbi.2021.103840
Received 4 February 2021; Received in revised form 8 June 2021; Accepted 11 June 2021
Available online 15 June 2021
1532-0464/© 2021 Elsevier Inc. This article is made available under the Elsevier license (http://www.elsevier.com/open-access/userlicense/1.0/).
X. Xiao et al. Journal of Biomedical Informatics 120 (2021) 103840
expressions and test data to images and used CNNs for extracting fea genomic data, etc. Some researchers map non-images data to images and
tures and subsequent prediction tasks. When mapping non-image data to use CNNs to produce higher classification performance. Xu et al. turned
images, the positions of the pixels are essential. If they are arbitrarily texts into binary codes by applying 1-d convolution to them [17]. Gao
arranged, it can harm the feature extraction and prediction et al. converted DNA sequences to binary code and used CNNs to predict
performances. polyadenylation sites [18]. Lyu et al. used CNNs for RNA-seq data. The
Inspired by previous works, we propose to map EHR features, mainly resulting images are constructed based on chromosome location [19].
Procedures, Prescriptions, and Diagnoses (PPD) records to images using Sharma et al. converted various datasets into images. Relationships
relational information among the features. Such EHR mapped image is between a set of features are analyzed, and similar features are placed
defined with the term PPD tensor for this particular use case. This way, close to each other. Crucial information can be learned through element
the feature relationships are taken into consideration as related features arrangements. On the other hand, if features are arbitrarily arranged,
are positioned as neighboring pixels inside the PPD tensors. A CNN the prediction performance can be badly influenced [14].
model can then be used to extract and learn from the mapped PPD
tensors. Specifically, we experiment on EHR data on Chronic Lympho 3. Method
cytic Leukemia (CLL) patients and make predictions on whether the next
line of treatments needs to be initiated. 3.1. Mapping electronic health records to PPD tensors
The contribution of our work is as follows:
We take the diagnosis and treatment (procedure and prescription)
• We use a convolutional neural network to learn the relationship features from EHR data and map them into PPD tensors for each patient.
among medical entities in EHR data and perform a supervised pre As a result, each patient will have a unique PPD tensor representing his/
diction task. her doctor visits journey. The resulting tensors have three ranks. The
• We propose a novel method that maps EHR patient journeys (diag diagnosis and treatment features will be positioned inside of a matrix (a
nosis, procedures, and prescriptions) to PPD tensors. The underlying rank-2 tensor), and the time information will make the 3rd rank of the
relations among the medical entities are taken into consideration. tensor. For example, if there are 144 diagnosis and treatment features in
• Deep learning models require balances among classes. In the actual total, they can be mapped as a 12 × 12 matrix without time information.
data, the number of positive samples is small. We design a novel EHR To map diagnosis-treatment features into PPD tensors, we need to
data augmentation method that upsamples the positive class. consider: the relationships between features, how to position them ac
• We utilize a hybrid machine learning model architecture that in cording to their relationships, how to normalize feature values to the
cludes a base machine learning model and a CNN boosting model. range of [0, 1]. We also have some time information in our dataset, and
The base machine model can learn additional features that are not we encoded that into the 3rd rank of PPD tensors.
mapped to PPD tensors. We first analyze the relationship between diagnosis and treatment
features by evaluating quantitatively how often two features appear
The rest of our work is organized as follows: Section 2 is related together in the same medical claim. Then, we treat each feature like a
work. Section 3 introduces our method: how EHR data can be mapped to pixel inside an image and optimize their locations inside the image
PPD tensors and then fed to Convolutional Neural Networks to learn matrix by considering their relationships. Related features are posi
local patterns. Section 4 describes experiment settings and results. tioned as adjacent pixels as much as possible. Since the range of a pixel
Finally, Section 5 includes the conclusion and possible future directions. value is [0, 1], 0 representing black, and 1 representing white, we also
normalize our features into this range. At last, we put time information,
2. Related work adding the 3rd rank to the PPD tensor. The following subsections discuss
in detail how diagnosis-treatment features are mapped into image
2.1. Deep learning models on EHR data tensors.
The ability to proactively identify potential changes in a patient’s 3.1.1. Relationships between procedure, prescription and diagnosis
treatment is a core research task of disease or treatment progression Electronic health records contain various types of information such
modeling. Various deep learning models have been developed for this as patient demographic information like age, gender, and ethnicity; and
purpose. Ma et al. [15] proposed a bidirectional RNN, which is informed patient encounter information such as diagnosis codes, procedures used,
on both past and future visits. Three attention mechanisms are used to and drugs prescribed. In addition, there exist interactive relationships
take into consideration the relationships among visits. Che et al. [7] between the procedure, prescription, and diagnosis features. For
designed an RNN architecture for personalized predictions of Parkin example, doctors usually give the procedure EKG based on the diagnosis
son’s disease. Dynamic temporal matching is used to learn the similarity of chest pain. Therefore, it means that chest pain and EKG are closely
between longitudinal patient records. Choi et al. [6] developed a two- related.
level predictive model to detect influential past visits and clinical vari In images, pixels that are physically close together share similar in
ables and improved model accuracy and interpretability. Choi et al. [3] formation. Similarly, transforming Electronic Health Records into PPD
also proposed an RNN model for early detection of heart failures. Hi tensors requires positioning similar entities together. Electronic health
erarchical information from medical oncology is extracted using a records often do not contain direct relationships and closeness between
graph-based attention model. Ma et al. [16] proposed a framework for key medical entities. However, such information can be measured using
diagnosis prediction based on patients’ historical visit records. Medical quantitative methods. To extract these kinds of associations from EHRs,
code descriptions are incorporated in their model. we look at how frequently two medical entities appear together in a
medical claim. The more frequently they appear together, the more
likely they are related. Specifically, we use conditional probability to
2.2. Mapping non-image data to images measure the relationship between a treatment (procedure or prescrip
tion) and a diagnosis. Formula 1 shows how the relationship is
CNNs are deep learning models that handle images effectively. Many measured:
data are not in the form of images, such as texts, speeches, financial data,
2
3.1.2. Genetic Algorithms for optimizing entity positions

When mapping medical features into PPD tensors, we treat each
feature as a pixel, and we want the related features to be positioned as
neighboring pixels as much as possible. Fig. 1 shows an example of how
some medical features are positioned inside of an image. Related fea
tures, such as Fever, IV fluid, and Advil, are placed as neighboring pixels.
There might not be enough features to fill the image, so some pixels are
”empty” and do not represent any features. The fixed feature locations
and feature closeness reflect the image structure.
How to arrange the positions of diagnosis and treatments in a PPD
tensor such that related entities are as close together as possible is an
optimization problem. Suppose the number of diagnosis and treatments
together is N, and diagnosis and treatments are mapped to an image of
size W⋅H, there will be (W⋅H)!/(W⋅H − N)! number of ways to arrange. To
traverse through each possible arrangement and evaluate how close
together we are positioning the medical features is a computationally
impossible task in practice. As the number of features increases, the
number of ways of positioning grows super-exponentially.
Genetic algorithms (GA) can be used to find solutions to optimization
problems efficiently. A genetic algorithm is a kind of evolutionary al
gorithm inspired by the process of natural selection. It is commonly used
to generate high-quality solutions when the search space size is enor
Fig. 1. An example of how related features are positioned as neighboring pixels mous. GAs have been applied to various optimization problems such as
inside of a PPD tensor. deep learning hyperparameter optimizations [20–22], neural network
weight optimizations [23,24], search problems in neuroscience [25,26],
vehicle routing problems [27], etc. Typically, in a GA, an initial popu
P(rx|dx) = P(rx ∩ dx)/P(dx) (1) lation of solutions is randomly generated and then evolved to create
better solutions. Biologically inspired operators such as selection,
P represents probability. rx is a treatment, and dx is a diagnosis. We crossover, and mutation are applied to the solutions during the evolu
first go through all medical claims in our dataset, obtain the probabili tionary process. As a result, the algorithm allows better solutions to
ties of every single diagnosis and the probability of a treatment and a survive into new generations and eliminate bad ones.
diagnosis appearing together. Then, we calculate the conditional prob To use a genetic algorithm, we need to have genetic representations
ability of a treatment given a diagnosis. This reflects quantitative the for the solutions. For our problem, we unroll the image mapping to a
relationship between diagnosis and treatments. vector that contains diagnosis and treatments, as shown in Fig. 2. The
Fig. 2. The genetic representation of feature positions inside of a matrix.
Fig. 3. Modified crossover operation.
3
evolutionary process until the fitness score of the best-discovered solu

tion converges (our stopping condition).
Algorithm 1. Diagnosis Treatments Mapping Optimization
Fig. 4. Mutation operations swaps two values.
unrolled vectors are the genetic representation of image mapping solu

tions. We create random solution vectors and apply genetic operators to
get better and better image mapping solutions.
We design modified genetic operators to generate new populations.
We still use selection, crossover, and mutation operations. However, the
original crossover and mutation operations can not be directly applied
for ordering diagnosis and treatment because each entity appears in the
chromosome for exactly one time. Using the original crossover and
mutation would cause the solutions to contain some entities multiple
times while missing other entities. Instead, we use modified crossover
and mutation operations. Fig. 3 shows our crossover operations. Two
chromosomes, Parent a and Parent b, are recombined to generate two
children. The children contain genes from both parents. To produce
Child a, the front part of the chromosome from Parent a (bounded by the
red box) is first taken. However, we cannot simply use the latter part of
Parent b. Instead, from Parent b, we remove the genes Parent a has
already passed on to the child, concatenate the remaining genes in order,
and give Child a (bounded by the orange boxes). Child b is produced
similarly to Child a. For mutation, two genes in the chromosome are
randomly chosen, and their positions are swapped. These modified ge
3.1.3. Values of medical features
netic operations guarantee that the children inherit from both parents
The value for each pixel inside a PPD tensors ranges from 0 to 1: zero
while containing each diagnosis and treatment for exactly one time. (See represents black, one means white, and the values in between are
Fig. 4).
various degrees of grays. The value for the procedure, prescription, and
We also design a fitness function to evaluate how good are the so diagnosis feature is the number of times that feature happened within a
lutions. We have calculated the relationship between a diagnosis and a
fixed period for the patient. To normalize each feature to the range of [0,
treatment (procedure or prescription) according to 1. For each treat 1], we divide each feature by the value at 95% percentile. Any results
ment, we have obtained its relationship values with all diagnosis fea
greater than one are shrunk to 1. We normalized using the 95% value
tures. To calculate the fitness of a feature arrangement solution, we first instead of the maximum value to handle outliers.
find a treatment’s N-nearest diagnosis neighbors {N1 , N2 , ..., Nn } ac
cording to the relationship values, with N1 being the closest diagnosis 3.1.4. Time information as 3rd rank in a tensor
neighbor. The number of nearest neighbors to consider (n) is a hyper
The EHR data contains patients’ visit information over some time.
parameter to set up. We look at 3-nearest neighbors for our dataset. In And we want to incorporate some time information in our PPD tensor.
the image mapping, we define a neighborhood for each treatment. The
So, for a given time interval, instead of treating it as one whole part, we
neighborhood is simply the adjacent pixels to the treatment pixel. Fig. 5 divide it into three parts, and how many times each feature appears
is an example of a neighborhood. The prescription Advil is bounded by
within each piece is counted. The counts are then normalized to a value
the red box and has size 3 × 3, which means we consider the surrounding in the [0, 1] range. The three parts of time intervals make the 3rd rank of
8 pixels to be this prescription’s neighbors. The neighborhood size is a
the PPD tensor and make the 3rd rank’s dimension 3. Note that the
hyperparameter that can take different values. The larger the neigh length of time interval used in our dataset is the same for all EHR data.
borhood, the less constraint we put in the feature ordering. Next, for
Fig. 6 shows the time dimension of our PPD tensor: the front matrix
each treatment i, we check if any diagnosis in {N1 , N2 , ..., Nn } is within corresponds to features earlier in time, the back matrix means later.
its neighborhood and assign a corresponding score Ti . The value of Ti is
determined by which diagnosis is within I’s neighborhood. For example,
if N1 is within i’s neighborhood, Ti takes on a small value such as 0.5. If 3.2. Data augmentation
N3 is within i’s neighborhood, Ti has a slightly larger value such as 0.8. If
no diagnosis is within I’s neighborhood, Ti is infinity. The fitness score of One problem for prediction tasks on EHRs is that the dataset is
a PPD tensor mapping is sum(Ti ) for all treatments i. We want to opti usually highly unbalanced. The number of positive samples is minimal
mize the mapping to have small a fitness score as possible. compared with the negatives ones. Having highly unbalanced data will
The overall steps of our modified genetic algorithm for diagnosis- cause deep learning models to perform poorly since it will be difficult for
treatment to PPD tensor mapping are as follows. First, we designed the model to learn from the minority class. Using data augmentation
our crossover and mutation operation to produce new solutions. [28] to boost the number of positives samples is a solution to data
Different strategies to adopt crossover and mutation to discover new imbalance.
mapping solutions can be explored in the future. We repeat the Popular image augmentation methods such as translation, rotation,
4
and flipping do not suit our needs here. Since pixels of different co
ordinates have different meanings – they represent different diagnoses
or treatments, we do not want to use translations or rotations because
they change pixel locations.
A popular method used for EHR data augmentation is Synthetic
Minority Oversampling Technique (SMOTE) [29]. SMOTE works by
selecting two samples, drawing a line between them in the feature space,
and generating a new one along the line. Unfortunately, we found the
data augmentation using SMOTE to be very slow due to a large amount
of data and high feature dimensions that we have. We use a novel data
augmentation technique named PPD-tensor-overlay for EHR data. The
technique is inspired by data augmentation methods using image
overlay [30] As shown in Fig. 7, we take all features from a positive
sample and overlay it with part/all features from a negative one to
produce a new synthesized positive sample. Note that before the data
augmentation step, we have removed some negative training samples
that could potentially be positive using the information obtained from
the base machine model. The negative samples used by the data
augmentation is a reduced set of the original after the removal. Intui
tively, positive samples plus features from the negative samples should
still generate positive ones. This technique is easy to implement and runs
very efficiently in practice.
Formula 2 is used to generate a synthesized positive patient PPD Fig. 5. A 3 × 3 neighborhood example.
tensor.
Iaug [i, j] = min(1, Ipos [i, j] + Mask[i, j]⋅Ineg [i, j]) (2)
Iaug is the augmented image, Ipos is a positive image, Ineg is a negative

image. i and j are coordinates for a pixel in the image. Mask[i, j] is a mask
for the pixel that takes either 0 or 1. This value is randomly generated
according to a preset probability. The function min takes the minimum
value between two values.
3.3. Convolutional Neural Networks for image mapping classification
EHR data for every patient is processed, and the diagnosis and
treatment (prescription and procedure) information is converted to PPD
tensors. Thus, every patient has one image that encodes the EHR in
formation over a time interval. After mapping diagnosis and treatments
to PPD tensors, we pass them to a CNN for feature extraction and clas
sification. A CNN model of multiple convolution layers is created for a
binary classification task. Algorithm 2 describes how the convolution
filters learn the similarity information from neighboring pixels. Fig. 6. Time information encoded in the 3rd rank of a PPD tensor.
Algorithm 2. Convolution for feature similarity learning
3.4. Overall model architecture
Fig. 8 shows the overall architecture of the proposed modeling

pipeline. Our input data contains both patient demographic information
and patient journey information. This information is fed to the
commonly used classical machine learning models, and prediction
scores are produced. The patient journey information - sequences of
diagnosis, procedure, and prescription are mapped to PPD tensors using
optimized mapping rules. We then use our data augmentation method to
balance the positive and negative classes. The output scores from the
base machine learning model are also used to remove some negative
training samples that could potentially be confusing to the deep learning
model. Negative sample removal is an optional step. We adopted this
step for the dataset we experimented on because the negative class is
noisy and may contain potentially positive samples. Therefore we
removed some negative training samples with a higher probability of
being positive to avoid confusing the deep learning model. The PPD
tensors are then fed to a CNN model for prediction. The output scores
from the CNN model are finally combined with the base machine
learning model for a final prediction score.
5
Fig. 7. To generate a synthesize positive patient PPD tensor, a positive patient’s feature is overlaid with part of a negative patient’s features.
4. Experiments and results regarding the product, provider, payer, and geography. Rx data is
longitudinally linked back to an anonymous patient token and can be
We perform our experiments on a Chronic Lymphocytic Leukemia linked to events within the data set itself and across other patient data
dataset. We compare our results with base machine learning models on assets. In addition, IQVIA receives just under 1 billion office-based
the validation dataset. Top k patients with the highest scores are electronic medical claims from office-based individual professionals,
selected, and precisions are calculated for different k values. ambulatory and general healthcare sites per year, including patient-
level diagnosis and procedure information. The information represents
4.1. Chronic Lymphocytic Leukemia dataset nearly 40% of all electronically filed medical claims in the US and is now
the largest diagnosis database of its kind. The Dx data includes infor
We perform our experiments on the Chronic Lymphocytic Leukemia mation tracked to a specific prescriber enabling the user to obtain a more
patient dataset to make predictions for the second-line treatment (L2+) thorough picture of a prescriber’s activity which can, in turn, be linked
initiation in a fixed time window. This dataset is obtained from the to prescription data, affiliations, and other practitioner attributes.
IQVIA longitudinal prescription claims database (Rx) and electronic Information for patients who were diagnosed and received initial
medical claims (Dx) database. IQVIA receives 2 billion prescription treatment with CLL L2 + were selected from the databases. Patients
claims per year with history from January 2004 with coverage up to labeled as positive are the ones who received second-line treatment in
88% for the retail channel, 50–70% for traditional and specialty mail the 15 months look-back period interval. In contrast, negative ones were
orders, and 40% for long-term care. This information represents activ not observed to receive any further treatments inside of the interval. The
ities during the prescription transaction and contains information training dataset consists of monthly aggregated cohorts from April 2018
through May 2019. In addition, data from 4 cohorts from September
2019 to December 2019 are used for model validation. Due to potential
patient information overlap from neighboring cohorts, we ensured that
validation cohorts are at least three months later than the most recent
training cohort to avoid information leakage. Overall, there will be
around 60 K patients in each cohort. Thus, the amount of positive
Fig. 8. The overall architecture of the proposed modeling pipeline. Fig. 9. Overall model performance in terms of precision at top k.
6
patients who received second-line treatment is around 3% among all

CLL patients. In general, we extract Dx/Rx claim level data from the
IQVIA database to build features in different categories: patient char
acteristics, major HCP information, diagnosis/prescription/treatment
counts, period gaps between various visit events, etc. Feature selection
methods using fuzzy logic or rough sets [31,32] could be explored in the
future to generate better features. In total, there are 226 features for
baseline models. And only 144 visit count features for each patient to
construct a mapped 12 × 12 PPD tensor.
4.2. Treatment initiation prediction model performance
Since our dataset is highly unbalanced between positives and nega

tives, we use precision@k to evaluate model performances. First, scores
are calculated for all patients, and k patients with the highest scores are
Fig. 10. Precision Recall curve comparisons. selected. We then calculate among the patients chosen what the per
centage of patients who are actually positive is.
We apply our method to three baseline models for comparison:
Random Forest, XGBoost (XGB) [33], and CatBoost. We evaluate the
performance of these models with and without CNN learned feature
similarities. .
In a real-world application, considering limited time and resources
that can be put on truly identified patients for further healthcare
arrangement, we only care about model performance for the top 3000
patients with the highest scores. Thus we compare model precision for k
equals 3000 and less.
As shown in Figs. 9 and 10, we compare average model performances
on testing cohorts between baseline machine models with proposed CNN
boosted models. The ensemble of baseline models such as xgb-rf is
included as a reference. It can be seen that our model outperforms the
baselines in general.
Fig. 11 shows the architecture of the CNN we used in our model.
There are in total four convolutional layers with one fully connected
layer at the end. The hyperparameter settings can be found from the
Fig. 11. CNN model architecture. figure.
Table 1
Average model prediction precision comparison at top k scored CLL patients.
Top k value 100 300 500 1,000 2,000 3,000
Random Forest 24.0% 19.0% 17.4% 14.6% 12.5% 11.3%
CNN-Boost-RF 23.0% 19.3% 17.4% 14.9% 12.5% 11.3%

(random mapping, SMOTE) (-1.0%) (+0.3%) (+0.0%) (+0.3%) (+0.0%) (+0.0%)
CNN-Boost-RF 26.0% 20.6% 18.0% 15.8% 12.6% 11.3%

(optimized mapping, SMOTE) (+2.0%) (+1.6%) (+0.6%) (+1.2%) (+0.1%) (+0.0%)
CNN-Boost-RF 26.0% 21.0% 18.2% 15.8% 12.7% 11.3%

(optimized mapping, PPD-tensor-overlay) (+2.0%) (+2.0%) (+0.8%) (+1.2%) (+0.2%) (+0.0%)
XGB 27.0% 20.3% 17.8% 14.7% 12.7% 11.5%
CNN-Boost-XGB 27.0% 20.6% 18.2% 15.1% 12.7% 11.5%

(random mapping, SMOTE) (+0.0%) (+0.3%) (+0.4%) (+0.4%) (+0.0%) (+0.0%)
CNN-Boost-XGB 31.0% 23.0% 20.4% 17.0% 13.5% 11.6%

CNN-Boost-XGB 33.0% 23.6% 21.0% 17.4% 13.7% 11.9%

(optimized mapping PPD-tensor-overlay) (+6.0%) (+3.3%) (+3.2%) (+2.7)% (+1.0%) (+0.4%)
CatBoost 25.0% 20.0% 17.8% 15.0% 12.7% 11.5%
CNN-Boost-CatB 24.0% 20.3% 18.0% 15.5% 12.9% 11.6%

(random mapping, SMOTE) (-1.0%) (+0.3%) (+0.2%) (+0.5%) (+0.2%) (+0.1%)
CNN-Boost-CatB 27.0% 22.3% 20.4% 16.9% 13.3% 12.0%

CNN-Boost-CatB 27.0% 22.6% 20.4% 17.0% 13.5% 12.0%

(optimized mapping PPD-tensor-overlay) (+2.0%) (+2.6%) (+2.6%) (+2.0%) (+0.8%) (+0.5%)
RF + XGB 26.0% 19.6% 17.8% 14.9% 13.0% 11.7%
XGB + CatB 26.0% 20.3% 17.8% 14.8% 13.0% 11.8%
RF + CatB 25.0% 20.3% 17.4% 14.8% 12.8% 11.8%
7
In Table 1, we show the precision@k values for a selected range of k Declaration of Competing Interest
values: 100, 300, 500, 1000, 2000 and 3000. We compare our method
with the base machine learning models: Random Forest, XGBoost, and The authors declare that they have no known competing financial
CatBoost. From Table 1, it can be observed that our CNN boosted ML interests or personal relationships that could have appeared to influence
model has overall better precision@k compared with the corresponding the work reported in this paper.
single baseline model. When the k value is smaller, the improvements
tend to be larger. In addition to listing single model performance, we References
also run experiments with combinations of base ML models. We can see
that a simple combination of RF, XGB, and CatBoost does not necessarily [1] Eric J. Topol. High-performance medicine: the convergence of human and artificial
intelligence. Nature Medicine, 25(1):44–56, 1 2019.
improve model performances, which also demonstrates the boosting
[2] Cao Xiao, Edward Choi, Jimeng Sun, Opportunities and challenges in developing
ability from the proposed CNN method. deep learning models using electronic health records data: a systematic review,
What’s more, to better specify the impact from the genetic algorithm Journal of the American Medical Informatics Association: JAMIA 25 (10) (2018)
and data augmentation of the proposed boosting approach, the model 1419–1428.
[3] Edward Choi, Andy Schuetz, Walter F Stewart, Jimeng Sun, Using recurrent neural
performances without optimized PPD tensor location (random mapping) network models for early detection of heart failure onset, Journal of the American
and using the SMOTE data augmentation technique are attached. It Medical Informatics Association: JAMIA 24 (2) (2017) 361–370.
shows that the genetic algorithm for feature position optimization is [4] Kezi Yu, Yunlong Wang, Yong Cai, Cao Xiao, Emily Zhao, Lucas Glass, and Jimeng
Sun. Rare Disease Detection by Sequence Modeling with Generative Adversarial
critically important for generating a high-quality PPD tensor that can be Networks. arXiv preprint arXiv:1907.01022, 2019.
fed into CNN models. A random set of feature positions does not [5] Zachary C Lipton, David C Kale, Charles Elkan, Randall P Wetzel Laura, and Leland
contribute much to generate a meaningful image for learning. Data K Whittier Virtual PICU. Learning to diagnose with lstm recurrent neural networks.
arXiv preprint arXiv:1511.03677, 2015.
augmentation is a crucial component of our CNN model part. Our CNN [6] Edward Choi, Mohammad Taha Bahadori, Joshua A Kulas, Andy Schuetz, Walter
model failed to work without data augmentation and cannot learn F Stewart, Jimeng Sun, RETAIN: An Interpretable Predictive Model for Healthcare
anything from the minority class since the two classes are highly un using Reverse Time Attention Mechanism, in: In Advances in Neural Information
Processing Systems, 2016.
balanced. SMOTE can also be used for minority class upsampling. Our
[7] Chao Che, Cao Xiao, Jian Liang, Bo Jin, Jiayu Zho, and Fei Wang. An RNN
PPD-tensor-overlay data augmentation runs much faster than SMOTE. Architecture with Dynamic Temporal Matching for Personalized Predictions of
We also include the SMOTE upsampling technique to balance the classes Parkinson’s Disease, in: Proceedings of the 2017 SIAM International Conference on
Data Mining, Philadelphia, PA, 6 2017. Society for Industrial and Applied
for reference. Our PPD-tensor-overlay data augmentation technique
Mathematics.
gives slightly higher model performances compared with SMOTE. [8] Fan Zhang, Tong Wu, Yunlong Wang, Yong Cai, Cao Xiao, Emily Zhao, Lucas Glass,
Compared to our production method, which is a fine-tuned XGB and and Jimeng Sun. Predicting Treatment Initiation from Clinical Time Series Data via
RF model ensemble, the CNN boosted XGB model improves absolute Graph-Augmented Time-Sensitive Model. arXiv preprint arXiv:1907.01099, 2019.
[9] Inci M Baytas, Cao Xiao, Xi Zhang, Fei Wang, Anil K Jain, and Jiayu Zhou. Patient
precision at 1000 with 2.5%. It means many more truly identified pa Subtyping via Time-Aware LSTM Networks, in: Proceedings of the 23rd ACM
tients could potentially receive targeted treatment on time, bringing SIGKDD international conference on knowledge discovery and data mining, 2017.
significant value to drive the effectiveness of the overall treatment [10] Edward Choi, Zhen Xu, Yujia Li, Michael Dusenberry, Gerardo Flores, Emily Xue,
and Andrew Dai. Learning the graphical structure of electronic health records with
strategy. graph convolutional transformer, in: Proceedings of the AAAI Conference on
Artificial Intelligence, volume 34, pages 606–613, 2020.
5. Conclusion and future work [11] Edward Choi, Cao Xiao, Walter F Stewart, and Jimeng Sun. MiME: Multilevel
Medical Embedding of Electronic Health Records for Predictive Healthcare. arXiv
preprint arXiv:1810.09593, 2018.
In this work, we propose to map procedure, prescription, and diag [12] Tong Wu, Yunlong Wang, Yue Wang, Emily Zhao, Yilian Yuan, and Zhi Yang.
nosis features in EHR records to PPD tensors and use a CNN algorithm to Representation Learning of EHR Data via Graph-Based Medical Entity Embedding.
arXiv preprint arXiv:1910.02574, 2019.
learn the image features for model prediction boosting. The interactive
[13] Xian Zeng, Zheng Jia, Zhiqiang He, Weihong Chen, Xudong Lu, Huilong Duan, and
relationship between these medical features is taken into consideration. Haomin Li. Measure clinical drug–drug similarity using Electronic Medical
We apply our method to real-world healthcare record data, and exper Records. International Journal of Medical Informatics, 124(May 2018):97–103,
2019.
imental results show that it can significantly boost classical base ma
[14] Alok Sharma, Edwin Vans, Daichi Shigemizu, Keith A. Boroevich,
chine learning models’ performance. In the future, we plan to apply the Tatsuhiko Tsunoda, DeepInsight: A methodology to transform a non-image data to
new model design to various therapeutic areas and different clinic an image for convolution neural network architecture, Scientific Reports 9 (1)
scopes utilizing a similar data structure. In addition to full-image-based (2019) 1–7.
[15] Fenglong Ma, Radha Chitta, Jing Zhou, Quanzeng You, Tong Sun, and Jing Gao.
classification, recent studies as Recurrent Attention Convolutional Dipole: Diagnosis prediction in healthcare via attention-based bidirectional
Neural Network and Multiple Object Recognition have become hot recurrent neural networks, in: Proceedings of the 23rd ACM SIGKDD international
topics related to analyzing local information of an image and empha conference on knowledge discovery and data mining, pages 1903–1911, 2017.
[16] Fenglong Ma, Yaqing Wang, Houping Xiao, Ye Yuan, Radha Chitta, Jing Zhou, and
sizing such subpattern’s important role in the task. With the support of Jing Gao. A General Framework for Diagnosis Prediction via Incorporating Medical
those cutting-edge findings, we can further investigate identifying a Code Descriptions, in: 2018 IEEE International Conference on Bioinformatics and
subpattern of PPD feature combinations that make the greatest contri Biomedicine (BIBM), pages 1070–1075. IEEE, 12 2018.
[17] Jiaming Xu, Peng Wang, Guanhua Tian, Bo Xu, Jun Zhao, Fangyuan Wang, and
butions to model prediction performance, which can be essential refer Hongwei Hao. Convolutional Neural Networks for Text Hashing, in: IJCAI 15
ences for healthcare providers. Additionally, those medical features can Proceedings of the 24th International Conference on Artificial Intelligence, pages
be mapped to 3D images where more detailed time sequence informa 1369–1375, 12 2015.
[18] Xin Gao, Jie Zhang, Zhi Wei, Hakon Hakonarson, Deeppolya: A convolutional
tion can be encoded. The 3D images can then be fed into 3D-CNN models
neural network approach for polyadenylation site prediction, IEEE Access 6 (2018)
for feature learning. 24340–24349.
[19] Boyu Lyu and Anamul Haque. Deep Learning Based Tumor Type Classification
Using Gene Expression Data, in: BCB ’18: Proceedings of the 2018 ACM
6. CRediT authorship contribution statement
International Conference on Bioinformatics, Computational Biology, and Health
Informatics, pages 89–96, 2018.
Xueli Xiao: Methodology, Software, Visualization, Writing - Original [20] Steven R. Young, Derek C. Rose, Thomas P. Karnowski, Seung-Hwan Lim, and
Draft. Guanhao Wei: Conceptualization, Validation,Writing - Review & Robert M. Patton. Optimizing deep learning hyper-parameters through an
evolutionary algorithm, in: Proceedings of the Workshop on Machine Learning in
Editing. Li Zhou: Supervision, Writing - Review & Editing. Yi Pan: High-Performance Computing Environments - MLHPC ’15, pages 1–5, Austin,
Methodology, Resources. Huan Jing: Data Curation. Emily Zhao and Texas, USA, 2015. ACM Press.
Yilian Yuan: Project administration, Funding acquisition. [21] Esteban Real, Sherry Moore, Andrew Selle, Saurabh Saxena, Yutaka Leon
Suematsu, Jie Tan, Quoc Le, and Alex Kurakin. Large-Scale Evolution of Image
Classifiers, in: Proceedings of the 34th International Conference on Machine
Learning - Volume 70, pages 2902–2911, Sydney, NSW, Australia, 2017.
8
[22] Xueli Xiao, Ming Yan, Sunitha Basodi, Chunyan Ji, and Yi Pan. Efficient [28] Connor Shorten and Taghi M. Khoshgoftaar. A survey on Image Data Augmentation
Hyperparameter Optimization in Deep Learning Using a Variable Length Genetic for Deep Learning. Journal of Big Data, 6(1):60, 12 2019.
Algorithm. arXiv preprint arXiv:2006.12703, 2020. [29] Nitesh V. Chawla, Kevin W Bowyer, Lawrence O Hall, and W Philip Kegelmeyer.
[23] Jatinder N.D Gupta and Randall S Sexton. Comparing backpropagation with a SMOTE: Synthetic Minority Over-sampling Technique. Journal of Artificial
genetic algorithm for neural network training. Omega, 27(6):679–684, 12 1999. Intelligence Research, 16:321–357, 2002.
[24] David J Montana and Lawrence Davis. Training Feedforward Neural Networks [30] Hiroshi Inoue. Data Augmentation by Pairing Samples for Images Classification.
Using Genetic Algorithms, in: Proceedings of the 11th international joint arXiv preprint arXiv:1801.02929, 2018.
conference on Artificial Intelligence, pages 762–767, San Francisco, CA, USA, [31] T.K. Bamunu Mudiyanselage, X. Xiao, Y. Zhang, Y. Pan, Deep Fuzzy Neural
1989. Morgan Kaufmann Publishers Inc. Networks for Biomarker Selection for Accurate Cancer Detection, IEEE Trans.
[25] Xiuchun Xiao, Bing Liu, Jing Zhang, Xueli Xiao, and Yi Pan. An Optimized Method Fuzzy Syst., 2019.
for Bayesian Connectivity Change Point Model. Journal of Computational Biology, [32] Junbo Zhang, Jian-Syuan Wong, Tianrui Li, and Yi Pan. A comparison of parallel
25(3):337–347, 3 2018. large-scale knowledge acquisition using rough set theory on different MapReduce
[26] Xiuchun Xiao, Bing Liu, Jing Zhang, Xueli Xiao, and Yi Pan. Detecting Change runtime systems. International Journal of Approximate Reasoning, 55(3):896–907,
Points in fMRI Data via Bayesian Inference and Genetic Algorithm Model, in: 3 2014.
ISBRA 2017: Bioinformatics Research and Applications, pages 314–324. Springer, [33] Tianqi Chen and Carlos Guestrin. XGBoost: A Scalable Tree Boosting System, in:
Cham, 2017. Proceedings of the 22nd acm sigkdd international conference on knowledge
[27] Barrie M. Baker and M.A. Ayechew. A Genetic Algorithm for the Vehicle Routing discovery and data mining, pages 785–794, 2016.
Problem. Computers & Operations Research, 30(5):787–800, 4 2003.

1 s2.0 S1532046421001696 Main

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

1 s2.0 S1532046421001696 Main

Uploaded by

Copyright:

Available Formats

Journal of Biomedical Informatics 120 (2021) 103840

Contents lists available at ScienceDirect

Journal of Biomedical Informatics

Treatment initiation prediction by EHR mapped PPD tensor based

3.1.2. Genetic Algorithms for optimizing entity positions

Fig. 2. The genetic representation of feature positions inside of a matrix.

Fig. 3. Modified crossover operation.

evolutionary process until the fitness score of the best-discovered solu­

Fig. 4. Mutation operations swaps two values.

unrolled vectors are the genetic representation of image mapping solu­

Iaug is the augmented image, Ipos is a positive image, Ineg is a negative

3.3. Convolutional Neural Networks for image mapping classification

3.4. Overall model architecture

Fig. 8 shows the overall architecture of the proposed modeling

patients who received second-line treatment is around 3% among all

4.2. Treatment initiation prediction model performance

Since our dataset is highly unbalanced between positives and nega­

Random Forest 24.0% 19.0% 17.4% 14.6% 12.5% 11.3%

CNN-Boost-RF 23.0% 19.3% 17.4% 14.9% 12.5% 11.3%

CNN-Boost-RF 26.0% 20.6% 18.0% 15.8% 12.6% 11.3%

CNN-Boost-RF 26.0% 21.0% 18.2% 15.8% 12.7% 11.3%

XGB 27.0% 20.3% 17.8% 14.7% 12.7% 11.5%

CNN-Boost-XGB 27.0% 20.6% 18.2% 15.1% 12.7% 11.5%

CNN-Boost-XGB 31.0% 23.0% 20.4% 17.0% 13.5% 11.6%

CNN-Boost-XGB 33.0% 23.6% 21.0% 17.4% 13.7% 11.9%

CatBoost 25.0% 20.0% 17.8% 15.0% 12.7% 11.5%

CNN-Boost-CatB 24.0% 20.3% 18.0% 15.5% 12.9% 11.6%

CNN-Boost-CatB 27.0% 22.3% 20.4% 16.9% 13.3% 12.0%

CNN-Boost-CatB 27.0% 22.6% 20.4% 17.0% 13.5% 12.0%

RF + XGB 26.0% 19.6% 17.8% 14.9% 13.0% 11.7%

XGB + CatB 26.0% 20.3% 17.8% 14.8% 13.0% 11.8%

RF + CatB 25.0% 20.3% 17.4% 14.8% 12.8% 11.8%

You might also like

evolutionary process until the fitness score of the best-discovered solu

unrolled vectors are the genetic representation of image mapping solu

Since our dataset is highly unbalanced between positives and nega