Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

2017 IEEE 19th International Conference on e-Health Networking, Applications and Services (Healthcom)

A Survey of Data Mining Technology on Electronic Medical Records


Wencheng Sun1 , Zhiping Cai1 , Fang Liu1 , Shengqun Fang1 , Guoyan Wang2
1 College of Computer, National University of Defense Technology, Changsha 410073, China
2 SysCan Biotechnology Company Limited, Suzhou215000, China
Corresponding author: Zhiping Cai (zpcai@nudt.edu.cn)
Abstract—Medical institutes use Electronic Medical Record importance of medical health and the growing social value
(EMR) to record a series of medical events, including it creates, EMR data min ing has become a hot topic in the
diagnostic information (diagnosis codes), procedures academia and industry circle.
performed (procedure codes) and admission details. Plenty of
II. RESEARCH PROGRESS OF DAT A M INING
data mining technologies are applied in the EMR data set for
TECHNOLOGY ON EMR
knowledge discovery, which is precious to medical practice.
The knowledge found is conducive to develop treatment plans, Data mining (DM), also known as Knowledge
improve health care and reduce medical expenses, moreover, it Discovery and Data M ining (KDD), refers to the process of
could also provide further assistance to predict and control discovering implicit and valuable patterns fro m data sets.

outbreaks of epidemic disease. The growing social value it Data mining techniques can be divided into two main

creates has made it a hot spot for experts and scholars. In this categories: descriptive (unsupervised) learning and

paper, we will summarize the research status of data mining predictive (supervised) learning [2]. Descriptive data
technologies on EMR, and analyze the challenges that EMR mining, a kind of exp loratory analysis methods, attempts to

research is confronting currently.


measure the similarity between records and discover
worthy patterns and relationships. The most important
Keywords—EMR; data mining; classification; clustering; techniques applied in descriptive data min ing are
regression Clustering and Association Rules Mining. Pred ictive data
mining attempts to classify data based on specific targets
I. INT RODUCT ION
(or tags) to construct predictive models and generate
EM R contains many kinds of health-care data, wh ich
predictive ru les. Classification algorith m plays a great role
can be divided into three kinds: structured data,
both in predictive learning and data min ing technologies
semi-structured data and unstructured data. Structured data,
used in EM R. Table 1 shows the current research progress
which is generally stored in fixed -mode databases,
of data mining technology on EMR.
including basic information (age, height, weight, blood type,
etc), drugs taken and allergies.Semi-structured data has the A. Classification Technology

flow chart fo rmat, similar to Resource Description Files Classification process can be divided into t wo stages :
(RDF), including name, value and time-stamp. The the learning stage and the classification stage.
unstructured text [1] is one kind of narrat ive data, such as Classification model is generally constructed in the
clin ical notes, surgical records, and pathology reports. The learning stage, and the model’s accuracy would be
unstructured text is a treasure trove for data min ing, but estimated in the classification stage. Before seeking
lacks a co mmon structural framework, and there are many med ical care, patients are normally expected to know how
errors, such as improper grammatical use, spelling errors their condition is, serious or mild, early or late. For doctors
and semantic ambiguities, which increase complexity. and nurses, they may wish to know wh ich treat ment
EM R data are diverse and rich, and each source data or patients should receive, plan A or plan B. So mething
feature-subset can provide different insights. Owing to the similar is very suitable for classification technology to

978-1-5090-6704-6/17/$31.00 ©2017 IEEE


2017 IEEE 19th International Conference on e-Health Networking, Applications and Services (Healthcom)

solve. As can be seen from thes e examp les above, it is feature selection known as PCEHRClust, which is to
necessary to provide tag of training sets when building improve the quality of medical services. Canino G et al. [13]
classifier. That is, the class label of training sets is known, proposed a methodology to analyze Diagnosis Related
so classification is a kind of supervising learning methods. Group(DRG) records and used a semantics -based clustering
Lo H Z et al. [3] p resented an algorith m, based on procedure to look for similar patterns of diseases.
Naive Bayesian, which allow existing ADR(Adverse Drug Toddenroth D et al. [14] proposed strategies to adapt heat
Reactions) detection methods, which were developed for maps for the search for associations and causal effects
spontaneous reporting systems, to be applied d irectly to the within routine EM R data. The use of measures of
longitudinal EHR data. Hannan B et al. [5] constructed a association as a clustering input proved to be valid, has
system called iHANDs using a Bayesian in ference been taken as a trigger to apply transformations. Maya S et
mechanis m to identify the level of confidence for each al. [15] proposed a bottom-up hierarchical clustering
possible cause. Li C et al. [6] proposed a novel hierarchical method to cluster the spacial patterns of the visual field in
Bayesian non-parametric model, the word distance glaucoma patients to analyze the progression patterns of
dependent Chinese restaurant franchise (wddCRF), wh ich glaucoma.
incorporates word-to-word distances to discover
C. Frequent Pattern Mining
semantically-coherent disease topics. Somanchi S et al. [ 7]
Frequent pattern represents a tread in the data set. The
used the SVM (Support Vector Machine) method to predict
med ical entity names wh ich occur frequently must have a
the Code Blue to reduce mortality. Rav indranath K R [ 8]
certain relevance, which is of help to next treat ment. For
constructed a CDSS (Clinical Decision Decision System)
example, adverse drug reactions and changes in nursing
using decision tree techniques and proposed an extended
methods will lead to deteriorat ion or imp rovement of
sub-tree strategy to handle continuous values. Chen Z et al.
condition.
[9] proposed two novel modifications to standard neural net
Huang et al. [16] used a combination of two
training to discover and detect characteristic patterns of
semantic-driven frequent pattern mining algorith ms to
physiology in clin ical t ime series data. Baba Y et al. [11]
analyze adverse drug events and optimize drug alert
used the multi-classifier method to construct low-cost
methods. Zhang L et al. [17] p ropose a topic-model-based
preventive approaches to deal with non-communicable
SPM(Sequential Pattern M ining) approach to find disease
diseases, by predicting subjective risk, drug
progression patterns. Peter A et al. [18] co mbined frequent
recommendation and future risk.
sequence mining techniques with advanced visualizat ions
B. Clustering Technology to support the integration of data-driven insights into care
Clustering is a process of dividing a data set into pathway discovery.
different subsets (also known as groups or clusters), and
D. Association Rule Mining
ensuring that data objects within the same subset are highly
Association Rule Min ing is usually used after Frequent
similar, and the data objects between different subsets are
Pattern Mining, and sometimes these two mining methods
very dissimilar. In the med ical field, there are many areas
are unified as Frequent Pattern Mining. Association Rule
we actually have little knowledge about, such as super
Mining can help d iscover the rule relat ions between
bacteria and new influenza, so their accumu lated data is
med ical entit ies in data set, such as drugs and symptoms,
difficult to be classified into different types. At this point,
disease and condition, and help confirm the confidence of
clustering technology is needed, and different clustering
these relations.
technologies will generate different subsets.
Chute C G et al. [19] co mb ined semantic web with
Rabbi K et al. [12] proposed a clustering algorith m for
association rule mining technology to identify potential
sensory data in health-care organizat ion based on dynamic
2017 IEEE 19th International Conference on e-Health Networking, Applications and Services (Healthcom)

drug-drug interactions to provide guidance for drug therapy Vin zamu ri B et al. [31] proposed an algorith m,
and post-treatment interventions. Simon et al. [20] aimed to combined with two unique correlation based regularizers
apply association rule min ing to EM R to discover sets of with co x regression, which has been proved to be valid in
risk factors and their corresponding sub-populations that several synthetic data sets and EHR data about heart failure .
represent patients at particularly high risk of developing Ftoerper et al. [32] developd a web-based tool that forecasts
diabetes. Chen J H et al. [21] used association methods to the daily bed need for admissions using routinely available
improve the accuracy of sequential predict ion and to build clin ical data within EM R. Tran T et al. [33] constructed a
a medical decision support system to predict ICU mortality. novel ordinal regression framework for p redicting med ical
risk stratification fro m EM R, to solve a large short-term
E. Time Series Mining
suicide risk predict ion problem. Wang Z et al. [35]
For patients, imp rovement o r deterioration of their
developed a dynamic Poisson auto-regressive model with
condition is time-vary ing. If doctors master the evolution in
exogenous inputs variables (DPA RX) for flu forecasting, to
advance, they can conduct risk predict ion. Fu rthermore, for
predict short-term ILI (Influenza-like-illness) case count.
those treatment plans closely related to time, such as
emergency treatment of heart-failure and advanced cancer G. Hybrid Model
treatment, time series mining technology is more effective. A hybrid model is the one built after co mprehensive
Moskovitch R et al. [23] presented Maitreya, a consideration of advantages and disadvantages of various
framework for the predict ion of outcome events, to learn mining technologies. Thus, it can be said that, after
predictive models based on the temporal patterns . Yin C et reasonable deliberation, hybrid models have better
al. [25] used relevance feedback for retrieving time series performance than those models built with a single
data to provide decision support. Kop R et al. [26] studied technology. When we use clustering technology to predict
the benefit of using advanced data mining techniques for suicidal behaviors of people with mental illness, if the
Colo Rectal Cancer (CRC), and d iscovered that target data set could be preprocessed ahead of time, it is
state-of-the-art data min ing techniques, such as temporal believed that the results will be improved.
data mining, are able to generate better predictive models. Liu M et al. [37] used different methods such as neural
Zhou Z et al. [27] proposed a predictive method using networks, Bayesian and so on, to test adverse drug
spatio-temporal kernel density estimat ion (stKDE), and reactions. David M et al. [38] classified laboratory data and
provide spatial density predictions for ambulance demand physiologic data using decision trees and naive Bayes
in Toronto. Liu C er al. [29] developed a novel classifiers, to evaluate six d iscretization strategies, both
representation, namely the temporal g raph, for the supervised and unsupervised. Chen Y et al. [39] proposed
longitudinal and heterogeneous properties in EMR. an approach based on bi-clustering and neural network for
classification of lesions in breast ultrasound. Amin S et al.
F. Regression Algorithm
[40] proposed a hybrid system, for pred iction of heart
This method is to discover mathematical relat ions
disease using major risk factors, involved with neural
between medical data with relevant mathematical tools
networks and genetic algorithms.
applied. Regression algorithms are generally used to
identify relationships between variables, such as the extent
to which variable A affects the variable B, or predict the
future trend of variables.

TABLE I. T ABLE T YP E STYLES SUMMARY OF DATA MINING TECHNOLOGIES ON EMR

DM T YPE MET HODS APPLICAT ION REFERENCE YEAR


2017 IEEE 19th International Conference on e-Health Networking, Applications and Services (Healthcom)

Naive Bayesian Adverse drug reactions [3] 2013


Bayesian grading framework Infectious disease [4] 2015

Bayesian reasoning mechanism Personal health decision making system [5] 2014
Stratified Bayesian model Disease risk prediction [6] 2016
Classification
Support Vector Machines ICU risk prediction [7] 2015
Technology
Decision tree Decision Support Systems [8] 2015
Neural network EMR data processing [9] 2015

Dynamic classification and hierarchical model Re - hospitalization risk prediction [10] 2015
Multi-classifier method Noncommunicable disease prediction [11] 2015

Dynamic feature selection Quality of medical service [12] 2015

Clustering A semantic - based clustering method Disease patterns [13] 2015


Technology Correlation measurement EMR data processing [14] 2014
A bottom-up hierarchical clustering method Glaucoma treatment [15] 2015

Frequent Frequent pattern mining Adverse drug reactions [16] 2013


Pattern Clinical subject sequence pattern mining Clinical patterns [17] 2016
Mining Frequent pattern mining and visualization Nursing path [18] 2015
Semantic web and association rule mining Potential drug-drug interactions [19] 2013

Association Association rule mining Diabetes risk prediction [20] 2015


Rule Mining Association rule mining Recommendation systems [21] 2016

Association methods ICU mortality prediction [22] 2014


T ime series mining Risk prediction [23] 2016
T ime series mining Personalized medical care [24] 2015

Relevance feedback Decision-making support [25] 2014

T ime Series T ime series mining Risk prediction of colorectal cancer [26] 2015

Mining Space-time series technology Ambulance positioning [27] 2015


T ime series mining Risk prediction of heart disease [28] 2014

Phenotyping framework Risk prediction of heart failure [29] 2015


T ime series mining Disease risk prediction [30] 2014

Cox regression Heart failure treatment [31] 2013


Multivariate logistic regression Analysis of Cardiac Surgical Bed Demand [32] 2016
Regression Ordinal regression framework Suicide risk prediction [33] 2015
Algorithm Generalized linear dynamic model Death probability risk prediction [34] 2015

Dynamic Poisson Auto-regressive Model Influenza prediction [35] 2015


Sparse logistic regression Disease risk prediction [36] 2014

Neural networks and Bayesian Adverse drug reactions [37] 2013


Decision trees and naive Bayes classifiers ICU data classification [38] 2013
Hybrid Model Bi-clustering and neural network Breast cancer treatment [39] 2016

Neural networks and genetic algorithms Heart disease risk prediction [40] 2013
K-means and clustering technologies EMR data processing [41] 2014
2017 IEEE 19th International Conference on e-Health Networking, Applications and Services (Healthcom)

[3] Lo, H. Z., Ding, W., & Nazeri, Z. (2013). Mining Adverse Drug
III. CONCLUSION Reactions from Electronic Health Records. IEEE, International
As a new research field, EM R data min ing is highly Conference on Data Mining Workshops (Vol.309, pp.1137-1140).
IEEE.
valued. Ho wever, there are some challenges that constrain
[4] Fan, K., Eisenberg, M., Walsh, A., Aiello, A., & Heller, K. (2015).
its development. Hierarchical Graph-Coupled HMMs for Heterogeneous
1. Data Quality. EM R data itself has incompleteness, Personalized Health Data. The, ACM SIGKDD International
redundancy and diversity, therefore, it can not be directly Conference (pp.239-248). ACM.

applied in data min ing. Data pre-processing is a must, and [5] Hannan, B., Zhang, X., & Sethares, K. (2014). iHANDs:Intelligent
Health Advising and Decision-Support Agent. Ieee/wic/acm
different data sets need different methods, according to
International Joint Conferences on Web Intelligence (Vol.3,
their own data characteristics. In addit ion, as narrat ive data, pp.294-301). ACM.
unstructured text doesn’t have an unified framework, and [6] Li, C., Rana, S., Phung, D., & Venkatesh, S. (2016). Hierarchical

are generally processed with NLP(Natural Language bayesian nonparametric models for knowledge discovery from
electronic medical records. Knowledge-Based Systems, 99, 168-182.
Processing) and text mining methods.
[7] Somanchi, S., Adhikari, S., Lin, A., Eneva, E., & Ghani, R. (2015).
2. Data Sharing and Privacy. At present, many patients Early Prediction of Cardiac Arrest (Code Blue) using Electronic
suffer fro m social mo rality and ethical stress, and worse Medical Records.ACM SIGKDD International Conference on
still, they are unwilling or afraid to receive med ical Knowledge Discovery and Data Mining (pp.2119-2126). ACM.
[8] Ravindranath, K. R. (2015). Clinical Decision Support System for
treatment, because of their illness or symptoms, such as
heart diseases using Extended sub tree. International Conference on
unmarried pregnancy, HIV o r other private issues. In the
Pervasive Computing (pp.1-5).
future, EMR will be circulated as a commercial product or [9] Che, Z., Kale, D., Li, W., Bahadori, M. T., & Liu, Y. (2015). Deep
public goods, which would pro mote the develop ment of Computational Phenotyping. The, ACM SIGKDD International
med ical career, but at the same time cause great pressure to Conference(pp.507-516). ACM.
[10] Basu Roy, S., T eredesai, A., Zolfaghar, K., Liu, R., Hazel, D., &
privacy protection.
Newman, S., et al. (2015). Dynamic Hierarchical Classification for
3. EM R Management. The way to manage EM R and the
Patient Risk-of-Readmission. The, ACM SIGKDD International
standard of EMR varies according to the HIS in each Conference (pp.1691-1700). ACM.
med ical institutes, which makes numerous applications [11] Baba, Y., Kashima, H., Nohara, Y., Kai, E., Ghosh, P., & Islam, R.,

with EM R as the core, such as medical decision support et al. (2015). Predictive Approaches for Low-Cost Preventive
Medicine Program in Developing Countries. The, ACM SIGKDD
system , mobile medical and other, difficult to pro mote.
International Conference(Vol.1, pp.1681-1690). ACM.
Moreover, the cloud technology will play a larger ro le in [12] Rabbi, K., Mamun, Q., & Islam, M. R. (2015). Dynamic feature
EMR storage, transmission and management. selection (DFS) based Data clustering technique on sensory data
The healthcare industry is predominantly moving streaming in eHealth record system. Industrial Electronics and
Applications (pp.661-665). IEEE.
towards accessible and quality health services, and EM R
[13] Canino, G., Guzzi, P., Tradigo, G., & Zhang, A. (2015). On the
data mining will continue to evolve fast, leading to
analysis of diseases and their related geographical data. IEEE
impactful and positive changes Journal of Biomedical & Health Informatics, 1-1.
to the way people work and live. [14] Toddenroth, D., Ganslandt, T., Castellanos, I., Prokosch, H. U., &
Bürkle, T . (2014). Employing heat maps to mine associations in
REFERENCES structured routine care data. Artificial Intelligence in

[1] Feldman, K., Hazekamp, N., & Chawla, N. V. (2016). Mining the Medicine, 60(2), 79.
[15] Maya, S., Morino, K., Murata, H., Asaoka, R., & Yamanishi, K.
Clinical Narrative: All T ext are Not Equal. IEEE International
Conference on Healthcare Informatics (pp.271-280). IEEE. (2015). Discovery of Glaucoma Progressive Patterns Using

[2] Fayyad, U., Piatetsky-Shapiro, G., & Smyth, P. (1996). The KDD
Hierarchical MDL-Based Clustering. The, ACM SIGKDD

process for extracting useful knowledge from volumes of data. International Conference (pp.1979-1988). ACM.
[16] Huang, J., Huan, J., Tropsha, A., & Dang, J. (2013).
ACM.
Semantics-driven frequent data pattern mining on electronic health
2017 IEEE 19th International Conference on e-Health Networking, Applications and Services (Healthcom)

records for effective adverse drug event monitoring. IEEE [29] Liu, C., Wang, F., Hu, J., & Xiong, H. (2015). Temporal
International Conference on Bioinformatics and Phenotyping from Longitudinal Electronic Health Records: A
Biomedicine (pp.608-611). IEEE. Graph Based Framework. The, ACM SIGKDD International
[17] Zhang, L., Zhao, J., Wang, Y., & Xie, B. (2016). Mining patterns of Conference (pp.705-714). ACM.
disease progression: a topic-model-based approach. Studies in [30] Zhou, J., Wang, F., Hu, J., & Ye, J. (2014). From micro to macro:
Health Technology & Informatics, 228, 354. data driven phenotyping by densification of longitudinal electronic
[18] Perer, A., Wang, F., & Hu, J. (2015). Mining and exploring care medical records. ACM SIGKDD International Conference on
pathways from electronic medical records with visual Knowledge Discovery and Data Mining (pp.135-144). ACM.
analytics. Journal of Biomedical Informatics, 56(C), 369. [31] Vinzamuri, B., & Reddy, C. K. (2013). Cox Regression with
[19] Pathak, J., Kiefer, R. C., & Chute, C. G. (2013). Mining drug-drug Correlation Based Regularization for Electronic Health
interaction patterns from linked data: A case study for Warfarin, Records. IEEE, International Conference on Data
Clopidogrel, and Simvastatin. IEEE International Conference on Mining (pp.757-766). IEEE.
Bioinformatics and Biomedicine (pp.23-30). IEEE. [32] Ftoerper, M., EleniFlanagan, SaulehSiddiqui, JeffAppelbaum,
[20] Simon, G. J., Caraballo, P. J., Therneau, T. M., & Cha, S. S. (2015). Kkasper, E., & ScottLevin. (2015). Cardiac catheterization
Extending association rule summarization techniques to assess risk laboratory inpatient forecast tool: a prospective evaluation. Journal
of diabetes mellitus. Knowledge & Data Engineering IEEE of the American Medical Informatics Association Jamia, 23(e1).
Transactions on,27(1), 130-141. [33] Tran, T., Phung, D., Luo, W., & Venkatesh, S. (2015). Stabilized
[21] Chen, J. H., Podchiyska, T ., & Altman, R. B. (2016). Orderrex: sparse ordinal regression for medical risk stratification. Knowledge
clinical order decision support and outcome predictions by and Information Systems, 43(3), 555-582.
data-mining electronic medical records. Journal of the American [34] Caballero Barajas, K. L., & Akella, R. (2015). Dynamically
Medical Informatics Association,23(2), 339. Modeling Patient's Health State from Electronic Medical Records:
[22] Chen, J. H., & Altman, R. B. (2014). Automated physician order A T ime Series Approach. ACM SIGKDD International Conference
recommendations and outcome predictions by data-mining on Knowledge Discovery and Data Mining (pp.69-78). ACM.
electronic medical records. Amia Joint Summits on Translational [35] Wang, Z., Chakraborty, P., Mekaru, S. R., Brownstein, J. S., Ye, J.,
Science Proceedings Amia Joint Summits on Translational & Ramakrishnan, N. (2015). Dynamic Poisson Autoregression for
Science, 2014, 206-210. Influenza-Like-Illness Case Count Prediction. The, ACM SIGKDD
[23] Moskovitch, R., Choi, H., Hripcsak, G., & Tatonetti, N. (2016). International Conference (pp.1285-1294). ACM.
Prognosis of clinical outcomes with temporal patterns and [36] Wang, F., Zhang, P., Qian, B., Wang, X., & Davidson, I. (2014).
experiences with one class feature selection. IEEE/ACM Clinical risk prediction with multilinear sparse logistic
Transactions on Computational Biology & Bioinformatics, PP(99), regression. ACM SIGKDD International Conference on Knowledge
1-1. Discovery and Data Mining(pp.145-154). ACM.
[24] Yadav, P., Steinbach, M., Pruinelli, L., & Westra, B. (2015). [37] Liu, M., McPeek Hinz, E. R., Matheny, M. E., Denny, J. C.,
Forensic Style Analysis with Survival Trajectories. IEEE Schildcrout, J. S., & Miller, R. A., et al. (2013). Comparative
International Conference on Data Mining (pp.1069-1074). IEEE. analysis of pharmacovigilance methods in the detection of adverse
[25] Yin, C., Ishikawa, H., & Takama, Y. (2014). Proposal of time series drug reactions using electronic medical records. Journal of the
data retrieval with user feedback. IEEE International Conference on American Medical Informatics Association Jamia, 20(3), 420-6.
Granular Computing (pp.358-361). IEEE. [38] David M Maslove, Tanya Podchiyska, Henry J Lowe. (2012).
[26] Kop, R., Hoogendoorn, M., Moons, L. M. G., Numans, M. E., & Discretization of continuous features in clinical datasets. Journal of
Teije, A. T . (2015). On the advantage of using dedicated data the American Medical Informatics Association, 20(3), 544-53.
mining techniques to predict colorectal cancer. Lecture Notes in [39] Chen, Y., & Huang, Q. (2016). An approach based on biclustering
Computer Science, 9105(5), 133-142. and neural network for classification of lesions in breast
[27] Zhou, Z., & Matteson, D. S. (2015). Predicting Ambulance Demand: ultrasound.International Conference on Advanced Robotics and
a Spatio-Temporal Kernel Approach. ACM SIGKDD International Mechatronics(pp.597-601).
Conference on Knowledge Discovery and Data [40] Amin, S. U., Agarwal, K., & Beg, R. (2013). Genetic neural
Mining (pp.2297-2303). ACM. network based data mining in prediction of heart disease using risk
[28] Chia, C. C., & Syed, Z. (2014). Scalable noise mining in long-term factors. Information & Communication
electrocardiographic time-series to predict death following heart Technologies (pp.1227-1231). IEEE.
attacks.ACM SIGKDD International Conference on Knowledge [41] Sumana, B. V., & Santhanam, T. (2014). Prediction of diseases by
Discovery and Data Mining (pp.125-134). ACM. cascading clustering and classification. International Conference on
Advances in Electronics, Computers and Communications (pp.1-8).

You might also like