Professional Documents
Culture Documents
CIS5364 Termpaper Data Miningin Healthcare
CIS5364 Termpaper Data Miningin Healthcare
CIS5364 Termpaper Data Miningin Healthcare
net/publication/322754945
CITATIONS READS
6 11,568
2 authors, including:
Indrajit Sen
Texas State University
13 PUBLICATIONS 8 CITATIONS
SEE PROFILE
All content following this page was uploaded by Indrajit Sen on 28 January 2018.
1
ABSTRACT
The focus of this paper is to examine the gift of data mining in everyday life,
has received a fillip. Using and enhancing already known statistical techniques, data
mining helps predict human behavior from sectors as diverse as supermarket purchases
The paper starts with a brief introduction to data mining including describing its
popular and everyday applications in retail. Data mining technologies and algorithms are
briefly analyzed. A quick interview with a firm actually using data mining to its benefit is
mentioned.
Next, the paper moves onto describing various research papers that have used
data mining to answer critical health questions. What is the age group most susceptible
to cardio vascular diseases? What is the most popular cancer vaccine trial? How many
such trials have been successful? What is a good treatment for a rare children’s disease?
How can data mining be used to solve problems relating to medical applications across
nations? How can life expectancy be accurately determined? Most of these questions find
We then examine legal and ethical aspects of data mining. Finally, we close on an
2
INTRODUCTION
In today’s world it seems that it is difficult to plan without data mining but imagine
you wake up one day and realize there is no way you can access any information that is
valuable to you. Suppose you are a doctor and found out that there is no means by which
you look in the computer and recall the patient’s habits and activities. There is no way to
search for effective treatments and best practices and moreover there was no way to
analyze the data and avoid some of the complications involved in the industry. We know
With the advancement in data mining, these days we can answer crucial questions
like “What kind of surgeries resulted in longer than five days of stay for patients in
hospitals?” and “What were the common pre-surgery symptoms of patients who stayed
for a longer period of time in a hospital?” The utility of data mining is not only important
and limited to healthcare industry but also in improving customer satisfaction, better target
for all industries. However, since this paper is based on healthcare applications of data
3
BACKGROUND
Data mining can be considered a relatively recently developed methodology and
technology, coming into prominence only in 1994. It aims to identify valid, novel,
through copious sets of data to sniff out patterns that are too subtle or complex for humans
to detect. There is huge amount of data that is collected during different processes.
Traditional methods will take too much time and efforts to analyze the data .With data
mining business tools and data mining algorithms, it will be much easier to track down the
Due to its huge importance data mining has been used intensively by many
and its applications within healthcare are of vital importance. For example, data mining
can help healthcare insurers detect fraud and abuse, health care organizations make
and best practices, and patients receive better and more affordable healthcare services
healthcare, customer relationship management, and the detection of fraud and abuse. It
also gives an illustrative example of a healthcare data mining application involving the
Imagine that you are running fast and come up to a point where you should not run
any further but you are still pushing yourself. Until your doctor calls you and tells you must
slow down. How awesome is that. Already, some mobile apps and trackers are collecting
4
your fitness data and sending it to the cloud. Microsoft HealthVault — Microsoft’s web-
based electronic health records platform — lets doctors access data from fitness trackers
like Fit Bit or Nike+ Fuel Band and glucose and heart monitors that patients have
Today, with the advancement in technology, you do not have to fill out a new form
every time you see another doctor. Doctors now share that information with each other.
Apple, Adidas, Samsung, GPS maker Garmin, audio technology company Jawbone, and
gaming hardware manufacturer Razor are developing products that measure biological
functions at ever faster clips. Startups across the country are creating gadgets such as
pill boxes that can monitor whether patients are taking their meds and under-the-mattress
sensors that measure heart rate, breathing and movement. It is an attempt to create a
focus on the most important information in the data they have collected about the behavior
of their customers and potential customers. With the use of data mining you can tell and
study a lot about patterns and behaviors. This can help to make valuable business
decisions. There can be several things that can be done from data mining like:
1) Fraud Detection: Big stores like Macy’s or J C Penny and other small
businesses can also keep track of which are the customers who buy things and return
them after using them .This kind of information can be tracked if the transactions are
being made by one particular credit card. In one of the author’s job search, she interacted
5
with a business analyst of Buckle, Inc., Mr. Shane Johnson who said that there are many
customers who will buy a particular item like child clothing or a women’s dress and return
it back after few days. These dresses are usually worn and after taking credit card
information and digging out in detail the store found out that the customers who were
doing this were mainly females in the age between 18 to 29 years old and of Hispanic
origin. But there is nothing which we can do to fix the problem. However, at most we can
tell them that they have a pretty strong return history. So, by doing this this segment of
customers will know that the store knows what they are doing (Johnson, 2014).
2) Can identify the complimentary goods for one particular kind of Product:
a) Amazon offers a useful example of how descriptive findings are used for prediction.
Looking at the user’s purchase history Amazon was able to find the association between
b) Target assigns every customer a Guest ID number, tied to their credit card, name, or
e-mail address that becomes a bucket that stores a history of everything they have bought
and any demographic information Target has collected from them or bought from other
6
TOOLS, SOFTWARE AND ALGORITHMS
COMMON DATA ANALYSIS TOOLS
Orange: A component-based data mining and machine learning software suite written in
Rapid Miner : An environment for machine learning and data mining experiments (7).
Laboratory.
SenticNet API: A semantic and affective resource for opinion mining and sentiment
analysis.
framework for analyzing unstructured content such as text, audio and video – originally
developed by IBM.
programming language.
One of the authors of this paper interned at Keller Williams Realty firm and used
software R to do her research work. Keller Williams is a renowned realty firm and deals
in collecting customer data and its’ analysis. It collects data from various sources like
different companies, seminars, online enquiries and walk-ins. After collecting vital
information about the clients, for example, are people living in a particular location looking
for a big budget house or a small budget house. How age is related to the size of the
7
house. It then creates and organizes marketing campaigns. These marketing campaigns
were designed for a particular target group which was found after doing the analysis. Data
mining helped them a lot because now they were considering only a limited group of
people with different attributes to target instead of targeting the whole bunch of people
who don’t even require big budged house. We interviewed the manager of Keller Williams
South Austin and he said Data mining and its application has really resulted in some
focused marketing .It has also showed some improved results from the past where the
campaigns were targeted to the clients as a single entity. He continued saying that now
the campaign and marketing events are more specific and customer needs are taken into
account rather that doing bulk marketing and sending thousands of email on regular basis
to people whose requirements are not even met in those add campaigns.
algorithm checks for some sort of connectivity and pattern in the data and creates results.
The algorithm then uses the results of this analysis to define the optimal parameters for
creating the mining model. These parameters are then applied across the entire data set
There can be multiple algorithms to define the model. It is not unusual for seasoned
analysts to mine data using an initial algorithm, and then use a more complex one to
refine their results. Examples of research papers that mined data based on healthcare
databases often have found that their research findings are enhanced by the second
algorithm as this paper finds out in a subsequent section. Based on the algorithm used,
the information will be extracted which can then be used to make valuable decisions.
8
CHOOSING THE RIGHT ALGORITHM
It is not always easy to choose the best algorithm. It can be really tricky and
cumbersome at times. Every algorithm produces a different result. How different the
results are can be sometimes used to determine the efficacy of a research method
(Microsoft Technet, 2014). For example, you are working for Sam’s Club and have tens
of thousands of customer data and you have to cut down the data but not able to come
to a conclusion that which data to delete and which to keep .Then in this case Microsoft
Decision Trees algorithm can be of great use because this algorithm can identify which
Examples are Support Vector Machines (SVM) and C4.5 (Yang, 2007).
Naïve Bayes
various attributes in a data set. The most famous example is the Apriori algorithm.
Segmentation algorithms: These slice up the data into groups or clusters. The
9
Sequence analysis algorithms: These summarize frequent sequences or episodes
in data, such as a Web path flow. An example is the CART algorithm (Microsoft
Technet, 2014).
Cardiovascular disease and cancer are the two deadliest killers in the world in that
order, according to the WHO (Mathers CD, 2009). Better knowledge about causes and
symptoms can no doubt reduce or delay fatalities to a large extent. Data about patients
either in the format of the data or its availability. Even if all or most of the data could be
brought in a mutually intelligible format, it is not humanly possible to draw inferences from
the hidden patterns. Most of the hidden information or pattern would go unnoticed and
the utility of the precious data would really be limited to a small group of localized patients.
Physicians in advanced technological nations like the US and the UK would not be able
to fruitfully research that data and find new ground breaking cures for all of humankind.
CARDIOVASCULAR DISEASES
A group of three Iranian scientists used classical data mining algorithms like
Decision Trees, Artificial Neural Networks (ANNs), and Support Vector Machine (SVM)
to attempt to predict the early onset of Coronary Artery Disease (CAD) (Peyman Rezaei
Hachesu1, 2013). Although the study was local, and onset of CAD is also dependent on
race, their study provides valuable insight into prediction of CAD. A group of around 5000
10
patients with CAD were analyzed using the three algorithms above. The following steps
1) The sample population was carefully chosen with expert medical advice, such
that patients of a particular heart health hospital in Teheran, Iran qualified well
2) From the available pool of patients, all patients did not have consistent or
complete data. Data was pre-processed to remove noise, missing values were
substituted using average values in most cases and outliers were removed.
Outliers were defined as values lying outside the first and third quartile.
3) After the clean-up, only around 2000 data points were found to be complete
and valid. Since separation into a training and testing set is an important aspect
of data mining, 80% of the data was used for training and 20% for testing.
It was found that the mean age for onset of CAD was 58 with the 54-64 year old
age group being most susceptible. Overall, the SVM technique was found to be the most
accurate.
Using similar data sets in other countries and the same analysis algorithm (SVM),
onset of CAD in other countries including the US can be predicted. According to the
American Heart Association, the cost to treat heart disease in United States will triple by
2030 (American Heart Association, 2011). Further research into the factors causing CAD
11
CANCER
Although cardiovascular disease is the biggest killer, cancer is not far behind. In
fact, cancer is catching up as the number one, with global cancer deaths projected to
increase from 7.1 million in 2002 to 11.5 million in 2030 (World Health Organization, 2007)
4. The largest pharmaceutical companies in the world are (literally) in a rat-race to invent
new medications and compounds to cure cancer. A vital part of any new drug or vaccine
introduction is clinical trials. Clinical trials are research studies that explore whether a
medical strategy, treatment, or device is safe and effective for humans (National Institutes
of Health, 2014). As such, clinical trials involve huge data sets, however just collecting
the data is useless if it cannot be mined or analyzed usefully. A wealth of publicly available
summarize and visualize cancer vaccine clinical trials (Xiaohong Cao*1, 2008). The
researchers deduced that although a large volume of data was available, only simple
querying techniques were used thus far. Using sophisticated data mining and
bioinformatics, the researchers were able to answer critical questions like since when are
the trials running with or without success, vaccine platforms used and the phase of the
trials. However, the most important question answered was if any of the types of cancer
were neglected in research an trials. The researchers (not so surprisingly) found that
several varieties of equally deadly cancer like bladder, liver, pancreatic, stomach and
12
Few other major findings using data mining techniques on the publicly available cancer
1) Though the first cancer vaccine (lung) trial was conducted in 1971, a gradual
prevalence of trails started only as late as the early 2000s. Trails have been
2) The top five cancers targeted by vaccine therapy in clinical trials are: melanoma
(skin cancer), cervical, prostate, breast, and leukemia. Melanoma is the largest
3) In regards to institutions actually performing the trials, it was observed that the
type of vaccine strategy used. The researchers found that the majority of the
trials used an antigen based vaccine followed by a cellular based one. Together,
the antigen- and cellular-based vaccines forms over 80% of the trials.
5) An interesting scatter-plot with cancer incidence rates on the X-axis and five
cancer prevalence and survival rates with existing medication. The four most
occurring cancers – prostate, melanoma, breast and cervix all find high clinical
trial rates (dark red circles). Interestingly, prostate cancer has a very high
13
PEDIATRICS
specialized hospitals like the Memphis, TN based St. Jude; mining all of the available
inpatient data is more important than ever. The aptly named ‘KID’ or Kids’ Inpatient
Database is a veritable one-stop shop for all pediatrics related clinical data (Bliss-Holtz,
2012). The KID is included in the HCUP (Healthcare Costs and Utilization Project) family
and Quality (AHRQ), a federal agency. The data sizes are large, implying that relatively
rare children’s diseases like prune belly syndrome can be easily analyzed. Variables
contained in the KID include primary and secondary diagnoses; primary and secondary
procedures; admission and discharge status; patient demographics including gender, age,
race, median income (by ZIP code data); total charges; length of stay and hospital
characteristics (e.g., ownership, size, teaching status). The KID is thus a veritable gold
mine and if properly mined can help solve many pediatrics related questions that
physicians face.
OUTPATIENT HEALTHCARE
Most outpatients are not so grandly treated like inpatients in a typical hospital –
presumably because they pay much less, but outpatient illnesses can be very involved
and having adequate knowledge regarding diseases, conditions and medications can
mean cost savings for both the patient and the care provider. A research paper published
(Huang, 2013) with the help of a medical database of a Taiwanese hospital aims to
determine the best algorithm to analyze such a data set. Association rules can be
14
constructed between abnormal health examination results and outpatient illnesses. A
disease prevention knowledge database can then be built up that assists healthcare
providers in follow-up treatment and prevention. The author also proposes a new
algorithm that can analyze such a data set more effectively. Though definitely a candidate
for more rigorous testing, the power of data mining and the potential for further research
is easily demonstrated.
Few points on the choice of data mining algorithms and research methodology
required by this study. Apriori algorithms were first discussed in 1993 and have
been popular since then (Huang, 2013). However, Apriori requires repeated
forefront.
2) Since the research was conducted in Taiwan, the data consisted of two parts
checkup data was divided into normal (01), below normal (02) and above
normal (03). Normal health data was filtered out, since the association sought
was between abnormal health results and outpatient illness (around 100,000
data points).
15
3) Outpatient illness records were obtained six months before and after the clinical
data. Also; incomplete, prenatal and dental data were removed from the
dataset.
Please see figure 2 in the appendix for a flowchart of the data integration process.
4) A new algorithm DCSM – Data Cutting and Sorting Method was proposed in
view of the limitations of the Apriori method. The DCSM is a seven step
5) Empirical analysis revealed that association rules found by using DCSM and
Apriori were exactly the same, thereby validating the new algorithm. However,
DCSM was found to be around ten times faster than the classical Apriori.
research.
16
TRANSNATIONAL MEDICINE
conditions travel visa-free across international borders and time zones. The problem is
cancer in one country might not be the culprit in another, however a related cause may
very well be. Data mining comes to the rescue again! To identify patterns of related
causes for a deadly disease, sequence clustering algorithms are very useful. Keeping in
mind the geographical distances between two countries, technologies like Service
geographically disparate datasets. With ever increasing Internet speeds, large data sets
eliminates most licensing needs and abstracts difficult technology from regular physician
Life expectancy is a very useful metric, not only for healthcare administration, but
also for social applications like insurance, Medicare, etc. A group of researchers sought
to determine the life expectancy of a sample of outpatient population that were aged 50
and over (Jason Scott Mathias, 2013). They used predictive data mining and high
dimensional analytics. Predictive data mining is already being used by companies like
Amazon and Google to recommend products to their customers per the authors.
17
Applications in healthcare include ability to improve cancer and infectious disease
treatments.
The research experiment has around 7500 subjects- patients over 50 with at least
one visit to a large medical facility in 2003. 980 health attributes from their electronic
health records were extracted and run through complex statistical techniques (that
known diseases, hospital visits, patient vital signs, medications and healthcare utilization.
Using Correlation Feature Selection (CFS), all attributes were tested for mutual
correlation and correlation with a dichotomous variable that represented death in five
years. The number of patients who passed away in five years were noted. Using a mix of
the rotation forest ensembling techniques with alternating decision trees, the researchers
were successfully able to develop an index that could distinguish a group of high risk
The research has great ramifications since patients who are more likely to survive
18
LEGAL ASPECTS OF DATA MINING
Data mining of health care related databases has two broad-based uses in the
legal world. The first being its use in non-healthcare legal matters where data mining can
Evidence 404(b) makes no provision for treating prior acts found by humans any
differently than prior acts found by computer using data mining. Thus, a plaintiff with a
claims related case can very well use reasonable data mining techniques to hold his stand
in a court of law.
The second legal aspect of data mining deals with the healthcare data itself. A
good introductory fact is the US Supreme Court ruling of June 2011 in Sorrell versus IMS
Health Inc. determined that Vermont's law prohibiting pharmacies from selling
prescription data to "data-mining companies" violated the Free Speech Clause of the First
Amendment (Cohen, 2012). When it comes to healthcare data, HIPPA (Health Insurance
Portability and Accountability Act of 1996) has a leading role to play. The Supreme Court
ruling is a little surprising because of the Federal Privacy Rule that implements the HIPPA
prohibits any unauthorized use or disclosure of protected health information for marketing
purposes. However, laws are usually interpreted ‘in context’ (and this was a marketing,
not a research context) and thus the Supreme Court ruling throws many challenges in the
face of data mining evangelists who seek to make all healthcare research related data
global. Where marketing stops and gainful research starts has to be carefully determined.
Globally, however privacy laws differ and what may be legal in the US might be
19
systems, due diligence must be conducted prior to any significant monetary or time
commitment.
ETHICS
Ethics questions start where the law ends. Data mining firms might masquerade
as research firms, extract a lot of diverse data and sell it for their own profits. The question
of how useful such a mining exercise is going to be to the larger society in general must
be asked first. Hospitals are always cash-strapped and look for ways of making money
(other than over-billing insurance companies). A large hospital might well be tempted to
sell the data for ‘research purposes’ on a continuing basis- a step that might be legal in
some states or countries but totally unethical. Primary care physicians have their own
ethical role to play too. Bypassing HIPPA for research related data mining make quick
With the decrease in the ‘digital divide’ data travels internationally – in seconds.
Most laws restrict data privacy to within the international borders. Data can be easily
traded (and not illegally since laws in most countries have not caught up yet) across
boundaries and very cheaply considering the levels of income in developing (and poor)
nations. Such international data mining ‘cartels’ can easily put large population of a region
20
FINDINGS AND P ROPOSED SOLUTIONS
Technology is addictive (and lucrative too), but legal regulations must be in place
countries have a few laws. Consortia of major countries (includes emerging markets)
must be formed that can deliberate and legislate on transnational and ethical aspects of
data mining. Laws must favor the poorer economies to prevent misuse.
Education is vital in a complex field like data mining. Many large universities have
started offering courses in Data Mining, but a lot more needs to be done to reach the
masses. Data Mining does not have only elitist applications, but it can be used in everyday
and a very small slice of the pie has been discovered yet. Current applications are
restricted to more experimental areas. Data mining should get easier and more common
place every day. In the near future, however, data mining algorithms should be able to
diseases like cancer. Also, currently most derived data mining patterns are more
mathematical than practical and is virtually ‘rocket science’ for most people not trained to
21
The future should see more technology abstraction layers being put (by developed
application software) that should make use and interpretation of data mining technologies
22
APPENDIX (LIST OF FIGURES)
FIGURE 1
23
FIGURE 2
FIGURE 3
24
BIBLIOGRAPHY
(2012). Retrieved from The Atlantic: http://www.theatlantic.com/technology/archive/2012/04/everything-you-
wanted-to-know-about-data-mining-but-were-afraid-to-ask/255388/
American Heart Association. (2011). Retrieved from Cost to treat heart disease in United States will triple by 2030:
www.sciencedaily.com/releases/2011/01/110124121545.htm
Bliss-Holtz, J. (2012). THE KIDS’ INPATIENT DATABASE (KID) AND DATA MINING. Informa Healthcare
USA, Inc.
Borgwardt, H.-P. K. (2007). Future trends in data mining. Springer Science+Business Media.
Hernandez, D. (2014). Doctors monitor patients remotely via smartphones and fitness trackers. Retrieved from
http://www.pbs.org/newshour/updates/doctors-monitor-patients-vitals-via-smartphones-fitness-trackers
Hian, C. K. (n.d.). Data mining applications in healthcare. Retrieved from Journal of Healthcare Information
Management: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.92.3184&rep=rep1&type=pdf
Hill, K. (2012). How target figured out a teen girl was pregnant before her father did. Retrieved from
http://www.forbes.com/sites/kashmirhill/2012/02/16/how-target-figured-out-a-teen-girl-was-pregnant-
before-her-father-did/
Huang, Y. C. (2013). Mining association rules between abnormal health examination results and outpatient medical
records. Health Information Management Journal.
Jason Scott Mathias, 1. A. (2013). Development of a 5 year life expectancy index in older adults using predictive
mining of electronic health record data. Journal of the American Medical Informatics Association.
Jigjidsuren, C.-P. S. (2011). A Data-Mining Framework for Transnational Healthcare System. Journal of Medical
Systems.
Koh HC1, T. G. (n.d.). US National Library of Medicine National Institutes of Health. Retrieved from
http://www.ncbi.nlm.nih.gov/pubmed/15869215
Mathers CD, L. D. (2009). Projections of global mortality and burden of disease from 2002 to 2030.
Peyman Rezaei Hachesu1, M. A. (2013). Cardiac diseases prediction and rule extract with data mining - Classification
techniques. HealthMed.
25
World Health Organization. (2007). Retrieved from Department of Measurement and Health Information Systems:
World Health Statistics.
Xiaohong Cao*1, K. B. (2008). Data mining of cancer vaccine trials: a bird's-eye view. Immunome Research.
26