CIS5364 Termpaper Data Miningin Healthcare

See discussions, stats, and author profiles for this publication at: https://www.researchgate.
net/publication/322754945
DATA MINING IN HEALTHCARE
Research · January 2018

DOI: 10.13140/RG.2.2.22189.38887
CITATIONS READS
6 11,568
2 authors, including:
Indrajit Sen
Texas State University
13 PUBLICATIONS 8 CITATIONS
SEE PROFILE
All content following this page was uploaded by Indrajit Sen on 28 January 2018.
The user has requested enhancement of the downloaded file.

DATA MINING IN
HEALTHCARE
Indrajit Sen, Krati Khandelwal

CIS 5364; SPRING 2014
TEXAS STATE UNIVERSITY
TABLE OF CONTENTS
Abstract .................................................................................................................................................................... 2
Introduction .............................................................................................................................................................. 3
Background .............................................................................................................................................................. 4
Definition and Usage ................................................................................................................................................ 5
Tools, Software and Algorithms ............................................................................................................................... 7
Common data analysis tools.................................................................................................................................. 7
Data mining algorithms ........................................................................................................................................ 8
Choosing the right algorithm ................................................................................................................................ 9
Choosing an algorithm by type ............................................................................................................................. 9
Applications of Data Mining in Healthcare ............................................................................................................. 10
Cardiovascular diseases ...................................................................................................................................... 10
Cancer ................................................................................................................................................................ 12
Pediatrics ............................................................................................................................................................ 14
Outpatient healthcare .......................................................................................................................................... 14
Transnational medicine ....................................................................................................................................... 17
Life expectancy calculations ............................................................................................................................... 17
Legal Aspects of Data Mining ................................................................................................................................ 19
Ethics ..................................................................................................................................................................... 20
Findings and Proposed Solutions ............................................................................................................................ 21
Summary and Conclusion ....................................................................................................................................... 21
Appendix (List of figures) ...................................................................................................................................... 23
Figure 1 .............................................................................................................................................................. 23
Figure 2 .............................................................................................................................................................. 24
Figure 3 .............................................................................................................................................................. 24
Bibliography........................................................................................................................................................... 25
1
ABSTRACT
The focus of this paper is to examine the gift of data mining in everyday life,
especially healthcare. With preponderance of computing technology, statistical analysis
has received a fillip. Using and enhancing already known statistical techniques, data
mining helps predict human behavior from sectors as diverse as supermarket purchases
to cancer vaccine manufacture.
The paper starts with a brief introduction to data mining including describing its
popular and everyday applications in retail. Data mining technologies and algorithms are
briefly analyzed. A quick interview with a firm actually using data mining to its benefit is
mentioned.
Next, the paper moves onto describing various research papers that have used
data mining to answer critical health questions. What is the age group most susceptible
to cardio vascular diseases? What is the most popular cancer vaccine trial? How many
such trials have been successful? What is a good treatment for a rare children’s disease?
How can data mining be used to solve problems relating to medical applications across
nations? How can life expectancy be accurately determined? Most of these questions find
an answer in this paper.
We then examine legal and ethical aspects of data mining. Finally, we close on an
optimistic note on the future prospects of this promising technology.
2
INTRODUCTION
In today’s world it seems that it is difficult to plan without data mining but imagine
you wake up one day and realize there is no way you can access any information that is
valuable to you. Suppose you are a doctor and found out that there is no means by which
you look in the computer and recall the patient’s habits and activities. There is no way to
search for effective treatments and best practices and moreover there was no way to
analyze the data and avoid some of the complications involved in the industry. We know
that data is powerful and valuable. But how?
With the advancement in data mining, these days we can answer crucial questions
like “What kind of surgeries resulted in longer than five days of stay for patients in
hospitals?” and “What were the common pre-surgery symptoms of patients who stayed
for a longer period of time in a hospital?” The utility of data mining is not only important
and limited to healthcare industry but also in improving customer satisfaction, better target
marketing campaigns, identifying high-risk clients, and improving production processes
for all industries. However, since this paper is based on healthcare applications of data
mining, our focus will be on healthcare.
3
BACKGROUND
Data mining can be considered a relatively recently developed methodology and
technology, coming into prominence only in 1994. It aims to identify valid, novel,
potentially useful, and understandable correlations and patterns in detail by combing
through copious sets of data to sniff out patterns that are too subtle or complex for humans
to detect. There is huge amount of data that is collected during different processes.
Traditional methods will take too much time and efforts to analyze the data .With data
mining business tools and data mining algorithms, it will be much easier to track down the
core of the information with much ease and accuracy (Hian).
Due to its huge importance data mining has been used intensively by many
organizations. In healthcare, data mining is becoming increasingly popular. Data mining
and its applications within healthcare are of vital importance. For example, data mining
can help healthcare insurers detect fraud and abuse, health care organizations make
customer relationship management decisions, physicians identify effective treatments
and best practices, and patients receive better and more affordable healthcare services
(Koh HC1, n.d.).
Major areas such are the evaluation of treatment effectiveness, management of
healthcare, customer relationship management, and the detection of fraud and abuse. It
also gives an illustrative example of a healthcare data mining application involving the
identification of risk factors associated with the onset of diabetes.
Imagine that you are running fast and come up to a point where you should not run
any further but you are still pushing yourself. Until your doctor calls you and tells you must
slow down. How awesome is that. Already, some mobile apps and trackers are collecting
4
your fitness data and sending it to the cloud. Microsoft HealthVault — Microsoft’s web-
based electronic health records platform — lets doctors access data from fitness trackers
like Fit Bit or Nike+ Fuel Band and glucose and heart monitors that patients have
uploaded themselves (Hernandez, 2014).
Today, with the advancement in technology, you do not have to fill out a new form
every time you see another doctor. Doctors now share that information with each other.
Apple, Adidas, Samsung, GPS maker Garmin, audio technology company Jawbone, and
gaming hardware manufacturer Razor are developing products that measure biological
functions at ever faster clips. Startups across the country are creating gadgets such as
pill boxes that can monitor whether patients are taking their meds and under-the-mattress
sensors that measure heart rate, breathing and movement. It is an attempt to create a
one-stop shop for health information (Hernandez, 2014).
DEFINITION AND USAGE

Data mining is a powerful new technology with great potential to help companies
focus on the most important information in the data they have collected about the behavior
of their customers and potential customers. With the use of data mining you can tell and
study a lot about patterns and behaviors. This can help to make valuable business
decisions. There can be several things that can be done from data mining like:
1) Fraud Detection: Big stores like Macy’s or J C Penny and other small
businesses can also keep track of which are the customers who buy things and return
them after using them .This kind of information can be tracked if the transactions are
being made by one particular credit card. In one of the author’s job search, she interacted
5
with a business analyst of Buckle, Inc., Mr. Shane Johnson who said that there are many
customers who will buy a particular item like child clothing or a women’s dress and return
it back after few days. These dresses are usually worn and after taking credit card
information and digging out in detail the store found out that the customers who were
doing this were mainly females in the age between 18 to 29 years old and of Hispanic
origin. But there is nothing which we can do to fix the problem. However, at most we can
tell them that they have a pretty strong return history. So, by doing this this segment of
customers will know that the store knows what they are doing (Johnson, 2014).
2) Can identify the complimentary goods for one particular kind of Product:
a) Amazon offers a useful example of how descriptive findings are used for prediction.
Looking at the user’s purchase history Amazon was able to find the association between
cocktail shaker and martini glass purchases (The Atlantic, 2012).
Another similar example could be:
b) Target assigns every customer a Guest ID number, tied to their credit card, name, or
e-mail address that becomes a bucket that stores a history of everything they have bought
and any demographic information Target has collected from them or bought from other
sources (Hill, 2012)
6
TOOLS, SOFTWARE AND ALGORITHMS
COMMON DATA ANALYSIS TOOLS
Orange: A component-based data mining and machine learning software suite written in
the Python language (Wikipedia, 2014).
R: A programming language and software environment for statistical computing, data
mining, and graphics. It is part of the GNU Project.
Rapid Miner : An environment for machine learning and data mining experiments (7).
SCaViS : Java cross-platform data analysis framework developed at Argonne National
Laboratory.
SenticNet API: A semantic and affective resource for opinion mining and sentiment
analysis.
UIMA: The UIMA (Unstructured Information Management Architecture) is a component
framework for analyzing unstructured content such as text, audio and video – originally
developed by IBM.
Weka : A suite of machine learning software applications written in the Java
programming language.
And there are many more to follow.
One of the authors of this paper interned at Keller Williams Realty firm and used
software R to do her research work. Keller Williams is a renowned realty firm and deals
in collecting customer data and its’ analysis. It collects data from various sources like
different companies, seminars, online enquiries and walk-ins. After collecting vital
information about the clients, for example, are people living in a particular location looking
for a big budget house or a small budget house. How age is related to the size of the
7
house. It then creates and organizes marketing campaigns. These marketing campaigns
were designed for a particular target group which was found after doing the analysis. Data
mining helped them a lot because now they were considering only a limited group of
people with different attributes to target instead of targeting the whole bunch of people
who don’t even require big budged house. We interviewed the manager of Keller Williams
South Austin and he said Data mining and its application has really resulted in some
focused marketing .It has also showed some improved results from the past where the
campaigns were targeted to the clients as a single entity. He continued saying that now
the campaign and marketing events are more specific and customer needs are taken into
account rather that doing bulk marketing and sending thousands of email on regular basis
to people whose requirements are not even met in those add campaigns.
DATA MINING ALGORITHMS

A data mining algorithm is a set of calculations that interprets the data. The
algorithm checks for some sort of connectivity and pattern in the data and creates results.
The algorithm then uses the results of this analysis to define the optimal parameters for
creating the mining model. These parameters are then applied across the entire data set
to extract actionable patterns and detailed statistics.
There can be multiple algorithms to define the model. It is not unusual for seasoned
analysts to mine data using an initial algorithm, and then use a more complex one to
refine their results. Examples of research papers that mined data based on healthcare
databases often have found that their research findings are enhanced by the second
algorithm as this paper finds out in a subsequent section. Based on the algorithm used,
the information will be extracted which can then be used to make valuable decisions.
8
CHOOSING THE RIGHT ALGORITHM
It is not always easy to choose the best algorithm. It can be really tricky and
cumbersome at times. Every algorithm produces a different result. How different the
results are can be sometimes used to determine the efficacy of a research method
(Microsoft Technet, 2014). For example, you are working for Sam’s Club and have tens
of thousands of customer data and you have to cut down the data but not able to come
to a conclusion that which data to delete and which to keep .Then in this case Microsoft
Decision Trees algorithm can be of great use because this algorithm can identify which
columns are of least importance and that can be easily deleted.
CHOOSING AN ALGORITHM BY TYPE
 Classification algorithms: A dataset usually has several attributes. A classification
algorithm predicts one or more discrete variables based on these attributes.
Examples are Support Vector Machines (SVM) and C4.5 (Yang, 2007).
 Regression algorithms: While classification algorithms predict discrete variables,
regression algorithms predict continuous variables. Examples are AdaBoost and
Naïve Bayes
 Association algorithms: Are useful in determining the associations between
various attributes in a data set. The most famous example is the Apriori algorithm.
 Segmentation algorithms: These slice up the data into groups or clusters. The
Microsoft Clustering Algorithm is a good example.
9
 Sequence analysis algorithms: These summarize frequent sequences or episodes
in data, such as a Web path flow. An example is the CART algorithm (Microsoft
Technet, 2014).
APPLICATIONS OF DATA MINING IN HEALTHCARE
Cardiovascular disease and cancer are the two deadliest killers in the world in that
order, according to the WHO (Mathers CD, 2009). Better knowledge about causes and
symptoms can no doubt reduce or delay fatalities to a large extent. Data about patients
are present in global hospital databases. However, there seems to be no consistency,
either in the format of the data or its availability. Even if all or most of the data could be
brought in a mutually intelligible format, it is not humanly possible to draw inferences from
the hidden patterns. Most of the hidden information or pattern would go unnoticed and
the utility of the precious data would really be limited to a small group of localized patients.
Physicians in advanced technological nations like the US and the UK would not be able
to fruitfully research that data and find new ground breaking cures for all of humankind.
CARDIOVASCULAR DISEASES
A group of three Iranian scientists used classical data mining algorithms like
Decision Trees, Artificial Neural Networks (ANNs), and Support Vector Machine (SVM)
to attempt to predict the early onset of Coronary Artery Disease (CAD) (Peyman Rezaei
Hachesu1, 2013). Although the study was local, and onset of CAD is also dependent on
race, their study provides valuable insight into prediction of CAD. A group of around 5000
10
patients with CAD were analyzed using the three algorithms above. The following steps
were followed to preserver the validity and sanctity of the research:
1) The sample population was carefully chosen with expert medical advice, such
that patients of a particular heart health hospital in Teheran, Iran qualified well
for the study.
2) From the available pool of patients, all patients did not have consistent or
complete data. Data was pre-processed to remove noise, missing values were
substituted using average values in most cases and outliers were removed.
Outliers were defined as values lying outside the first and third quartile.
Minitab14 was used to further investigate the data distribution.
3) After the clean-up, only around 2000 data points were found to be complete
and valid. Since separation into a training and testing set is an important aspect
of data mining, 80% of the data was used for training and 20% for testing.
It was found that the mean age for onset of CAD was 58 with the 54-64 year old
age group being most susceptible. Overall, the SVM technique was found to be the most
accurate.
Using similar data sets in other countries and the same analysis algorithm (SVM),
onset of CAD in other countries including the US can be predicted. According to the
American Heart Association, the cost to treat heart disease in United States will triple by
2030 (American Heart Association, 2011). Further research into the factors causing CAD
can reduce this expense significantly.
11
CANCER
Although cardiovascular disease is the biggest killer, cancer is not far behind. In
fact, cancer is catching up as the number one, with global cancer deaths projected to
increase from 7.1 million in 2002 to 11.5 million in 2030 (World Health Organization, 2007)
4. The largest pharmaceutical companies in the world are (literally) in a rat-race to invent
new medications and compounds to cure cancer. A vital part of any new drug or vaccine
introduction is clinical trials. Clinical trials are research studies that explore whether a
medical strategy, treatment, or device is safe and effective for humans (National Institutes
of Health, 2014). As such, clinical trials involve huge data sets, however just collecting
the data is useless if it cannot be mined or analyzed usefully. A wealth of publicly available
clinical data can be found on the US government’s website ClinicalTrials.gov.
Using especially the data available on cancer, three US researchers tried to
summarize and visualize cancer vaccine clinical trials (Xiaohong Cao*1, 2008). The
researchers deduced that although a large volume of data was available, only simple
querying techniques were used thus far. Using sophisticated data mining and
bioinformatics, the researchers were able to answer critical questions like since when are
the trials running with or without success, vaccine platforms used and the phase of the
trials. However, the most important question answered was if any of the types of cancer
were neglected in research an trials. The researchers (not so surprisingly) found that
several varieties of equally deadly cancer like bladder, liver, pancreatic, stomach and
esophageal were neglected. This finding is sure to rattle boardrooms of many
multinational pharmaceutical companies.
12
Few other major findings using data mining techniques on the publicly available cancer
clinical trial data are:
1) Though the first cancer vaccine (lung) trial was conducted in 1971, a gradual
prevalence of trails started only as late as the early 2000s. Trails have been
steadily increasing since that time.
2) The top five cancers targeted by vaccine therapy in clinical trials are: melanoma
(skin cancer), cervical, prostate, breast, and leukemia. Melanoma is the largest
trial candidate, while cervical cancer is second.
3) In regards to institutions actually performing the trials, it was observed that the
National Cancer Institute was the undisputed leader followed by GSK
(GlaxoSmithKline). All other pharmaceutical companies had more or less
equally contributed to cancer vaccine trial and research.
4) Effectiveness of cancer vaccine trials can also be measured by the specific
type of vaccine strategy used. The researchers found that the majority of the
trials used an antigen based vaccine followed by a cellular based one. Together,
the antigen- and cellular-based vaccines forms over 80% of the trials.
5) An interesting scatter-plot with cancer incidence rates on the X-axis and five
year survival rates on the Y-axis gives an interesting representation of current
cancer prevalence and survival rates with existing medication. The four most
occurring cancers – prostate, melanoma, breast and cervix all find high clinical
trial rates (dark red circles). Interestingly, prostate cancer has a very high
survival rate too. Please see figure 1 in the appendix.
13
PEDIATRICS
Pediatrics is gaining increasing focus in the healthcare arena. With new
specialized hospitals like the Memphis, TN based St. Jude; mining all of the available
inpatient data is more important than ever. The aptly named ‘KID’ or Kids’ Inpatient
Database is a veritable one-stop shop for all pediatrics related clinical data (Bliss-Holtz,
2012). The KID is included in the HCUP (Healthcare Costs and Utilization Project) family
created in a Federal-State-Industry partnership with the Agency for Healthcare Research
and Quality (AHRQ), a federal agency. The data sizes are large, implying that relatively
rare children’s diseases like prune belly syndrome can be easily analyzed. Variables
contained in the KID include primary and secondary diagnoses; primary and secondary
procedures; admission and discharge status; patient demographics including gender, age,
race, median income (by ZIP code data); total charges; length of stay and hospital
characteristics (e.g., ownership, size, teaching status). The KID is thus a veritable gold
mine and if properly mined can help solve many pediatrics related questions that
physicians face.
OUTPATIENT HEALTHCARE
Most outpatients are not so grandly treated like inpatients in a typical hospital –
presumably because they pay much less, but outpatient illnesses can be very involved
and having adequate knowledge regarding diseases, conditions and medications can
mean cost savings for both the patient and the care provider. A research paper published
(Huang, 2013) with the help of a medical database of a Taiwanese hospital aims to
determine the best algorithm to analyze such a data set. Association rules can be
14
constructed between abnormal health examination results and outpatient illnesses. A
disease prevention knowledge database can then be built up that assists healthcare
providers in follow-up treatment and prevention. The author also proposes a new
algorithm that can analyze such a data set more effectively. Though definitely a candidate
for more rigorous testing, the power of data mining and the potential for further research
is easily demonstrated.
Few points on the choice of data mining algorithms and research methodology
used in the study:
1) Apriori algorithms are generally used to demonstrate association rules as
required by this study. Apriori algorithms were first discussed in 1993 and have
been popular since then (Huang, 2013). However, Apriori requires repeated
database scans that result in low efficiency. As medical research improves, a
requirement to correlate multiple diseases and causes has come to the
forefront.
2) Since the research was conducted in Taiwan, the data consisted of two parts
from a hospital in Taiwan: health examination results and outpatient medical
records. No distinction was made regarding medical department. Patient health
checkup data was divided into normal (01), below normal (02) and above
normal (03). Normal health data was filtered out, since the association sought
was between abnormal health results and outpatient illness (around 100,000
data points).
15
3) Outpatient illness records were obtained six months before and after the clinical
data. Also; incomplete, prenatal and dental data were removed from the
dataset.
Please see figure 2 in the appendix for a flowchart of the data integration process.
4) A new algorithm DCSM – Data Cutting and Sorting Method was proposed in
view of the limitations of the Apriori method. The DCSM is a seven step
method and consists of-
a. Data conversion into a Boolean matrix.
b. Establish large item sets for high frequency data.
c. Establish a reductions matric: essentially remove unpaired data.
d. Iterate step (b).
e. Iterate step (c).
f. Iterate step (d).
g. Return to step (b) and repeat steps (c) to (f).
5) Empirical analysis revealed that association rules found by using DCSM and
Apriori were exactly the same, thereby validating the new algorithm. However,
DCSM was found to be around ten times faster than the classical Apriori.
6) Association rules were corroborated by medical doctors and independent
research.
16
TRANSNATIONAL MEDICINE
A disease causing pathogen knows no international boundaries. Diseases and
conditions travel visa-free across international borders and time zones. The problem is
further exacerbated by different lifestyles in different countries. A particular cause of
cancer in one country might not be the culprit in another, however a related cause may
very well be. Data mining comes to the rescue again! To identify patterns of related
causes for a deadly disease, sequence clustering algorithms are very useful. Keeping in
mind the geographical distances between two countries, technologies like Service
Oriented Architecture (SOA) and Cloud Computing can be used to retrieve/query
geographically disparate datasets. With ever increasing Internet speeds, large data sets
can be quickly and integrally transmitted across oceans. Virtualization technology
eliminates most licensing needs and abstracts difficult technology from regular physician
assistants (Jigjidsuren, 2011). A very representative diagram for transnational medicine
is given in the appendix in figure 3.
LIFE EXPECTANCY CALCULATIONS
Life expectancy is a very useful metric, not only for healthcare administration, but
also for social applications like insurance, Medicare, etc. A group of researchers sought
to determine the life expectancy of a sample of outpatient population that were aged 50
and over (Jason Scott Mathias, 2013). They used predictive data mining and high
dimensional analytics. Predictive data mining is already being used by companies like
Amazon and Google to recommend products to their customers per the authors.
17
Applications in healthcare include ability to improve cancer and infectious disease
treatments.
The research experiment has around 7500 subjects- patients over 50 with at least
one visit to a large medical facility in 2003. 980 health attributes from their electronic
health records were extracted and run through complex statistical techniques (that
included predictive data mining). Attributes included information about demographics,
known diseases, hospital visits, patient vital signs, medications and healthcare utilization.
Using Correlation Feature Selection (CFS), all attributes were tested for mutual
correlation and correlation with a dichotomous variable that represented death in five
years. The number of patients who passed away in five years were noted. Using a mix of
the rotation forest ensembling techniques with alternating decision trees, the researchers
were successfully able to develop an index that could distinguish a group of high risk
patients with life expectancy less than five years.
The research has great ramifications since patients who are more likely to survive
longer can be preferred in diagnostic treatments and organ transplants.
18
LEGAL ASPECTS OF DATA MINING
Data mining of health care related databases has two broad-based uses in the
legal world. The first being its use in non-healthcare legal matters where data mining can
be used as credible evidence while testifying. It is to be noted that Federal Rule of
Evidence 404(b) makes no provision for treating prior acts found by humans any
differently than prior acts found by computer using data mining. Thus, a plaintiff with a
claims related case can very well use reasonable data mining techniques to hold his stand
in a court of law.
The second legal aspect of data mining deals with the healthcare data itself. A
good introductory fact is the US Supreme Court ruling of June 2011 in Sorrell versus IMS
Health Inc. determined that Vermont's law prohibiting pharmacies from selling
prescription data to "data-mining companies" violated the Free Speech Clause of the First
Amendment (Cohen, 2012). When it comes to healthcare data, HIPPA (Health Insurance
Portability and Accountability Act of 1996) has a leading role to play. The Supreme Court
ruling is a little surprising because of the Federal Privacy Rule that implements the HIPPA
prohibits any unauthorized use or disclosure of protected health information for marketing
purposes. However, laws are usually interpreted ‘in context’ (and this was a marketing,
not a research context) and thus the Supreme Court ruling throws many challenges in the
face of data mining evangelists who seek to make all healthcare research related data
global. Where marketing stops and gainful research starts has to be carefully determined.
Globally, however privacy laws differ and what may be legal in the US might be
illegal in another country. Especially when implementing transnational healthcare
19
systems, due diligence must be conducted prior to any significant monetary or time
commitment.
ETHICS
Ethics questions start where the law ends. Data mining firms might masquerade
as research firms, extract a lot of diverse data and sell it for their own profits. The question
of how useful such a mining exercise is going to be to the larger society in general must
be asked first. Hospitals are always cash-strapped and look for ways of making money
(other than over-billing insurance companies). A large hospital might well be tempted to
sell the data for ‘research purposes’ on a continuing basis- a step that might be legal in
some states or countries but totally unethical. Primary care physicians have their own
ethical role to play too. Bypassing HIPPA for research related data mining make quick
money, but put all patient privacy and ethics at stake.
With the decrease in the ‘digital divide’ data travels internationally – in seconds.
Most laws restrict data privacy to within the international borders. Data can be easily
traded (and not illegally since laws in most countries have not caught up yet) across
boundaries and very cheaply considering the levels of income in developing (and poor)
nations. Such international data mining ‘cartels’ can easily put large population of a region
to privacy risks without their prior approval.
20
FINDINGS AND P ROPOSED SOLUTIONS
Technology is addictive (and lucrative too), but legal regulations must be in place
to prevent misuse. Currently, there is no international legislation. Only a few advanced
countries have a few laws. Consortia of major countries (includes emerging markets)
must be formed that can deliberate and legislate on transnational and ethical aspects of
data mining. Laws must favor the poorer economies to prevent misuse.
Education is vital in a complex field like data mining. Many large universities have
started offering courses in Data Mining, but a lot more needs to be done to reach the
masses. Data Mining does not have only elitist applications, but it can be used in everyday
life in the near future.
SUMMARY AND CONCLUSION

Data Mining is new technology and is still in its infancy. Applications are minimal
and a very small slice of the pie has been discovered yet. Current applications are
restricted to more experimental areas. Data mining should get easier and more common
place every day. In the near future, however, data mining algorithms should be able to
‘self-tune’ themselves and help researchers, especially in healthcare to eliminate deadly
diseases like cancer. Also, currently most derived data mining patterns are more
mathematical than practical and is virtually ‘rocket science’ for most people not trained to
understand the science.
21
The future should see more technology abstraction layers being put (by developed
application software) that should make use and interpretation of data mining technologies
just like e-mail is today (Borgwardt, 2007).
22
APPENDIX (LIST OF FIGURES)
FIGURE 1
23
FIGURE 2
FIGURE 3
24
BIBLIOGRAPHY
(2012). Retrieved from The Atlantic: http://www.theatlantic.com/technology/archive/2012/04/everything-you-
wanted-to-know-about-data-mining-but-were-afraid-to-ask/255388/
(2014). Retrieved from Microsoft Technet: http://technet.microsoft.com/en-us/library/ms175595.aspx
American Heart Association. (2011). Retrieved from Cost to treat heart disease in United States will triple by 2030:
www.sciencedaily.com/releases/2011/01/110124121545.htm
Bliss-Holtz, J. (2012). THE KIDS’ INPATIENT DATABASE (KID) AND DATA MINING. Informa Healthcare
USA, Inc.
Borgwardt, H.-P. K. (2007). Future trends in data mining. Springer Science+Business Media.
Cohen, B. (2012). REGULATING DATA MINING FOST-SORRELL: USING HIPAA TO RESTRICT

MARKETING USES OF PATIENTS' PRIVATE MEDICAL INFORMATION. Wake Forest Law Review.
Hernandez, D. (2014). Doctors monitor patients remotely via smartphones and fitness trackers. Retrieved from
http://www.pbs.org/newshour/updates/doctors-monitor-patients-vitals-via-smartphones-fitness-trackers
Hian, C. K. (n.d.). Data mining applications in healthcare. Retrieved from Journal of Healthcare Information
Management: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.92.3184&rep=rep1&type=pdf
Hill, K. (2012). How target figured out a teen girl was pregnant before her father did. Retrieved from
http://www.forbes.com/sites/kashmirhill/2012/02/16/how-target-figured-out-a-teen-girl-was-pregnant-
before-her-father-did/
Huang, Y. C. (2013). Mining association rules between abnormal health examination results and outpatient medical
records. Health Information Management Journal.
Jason Scott Mathias, 1. A. (2013). Development of a 5 year life expectancy index in older adults using predictive
mining of electronic health record data. Journal of the American Medical Informatics Association.
Jigjidsuren, C.-P. S. (2011). A Data-Mining Framework for Transnational Healthcare System. Journal of Medical
Systems.
Johnson, S. (2014). (K. K, Interviewer)
Koh HC1, T. G. (n.d.). US National Library of Medicine National Institutes of Health. Retrieved from
http://www.ncbi.nlm.nih.gov/pubmed/15869215
Mathers CD, L. D. (2009). Projections of global mortality and burden of disease from 2002 to 2030.
National Institutes of Health. (2014). Retrieved from https://www.nhlbi.nih.gov/health/health-

topics/topics/clinicaltrials/
Peyman Rezaei Hachesu1, M. A. (2013). Cardiac diseases prediction and rule extract with data mining - Classification
techniques. HealthMed.
Wikipedia. (2014). Retrieved from http://en.wikipedia.org/wiki/Data_mining
25
World Health Organization. (2007). Retrieved from Department of Measurement and Health Information Systems:
World Health Statistics.
Xiaohong Cao*1, K. B. (2008). Data mining of cancer vaccine trials: a bird's-eye view. Immunome Research.
Yang, X. W. (2007). Top 10 algorithms in data mining. Retrieved from

http://www.cs.umd.edu/~samir/498/10Algorithms-08.pdf
26
View publication stats

CIS5364 Termpaper Data Miningin Healthcare

Uploaded by

Copyright:

Available Formats

You might also like

CIS5364 Termpaper Data Miningin Healthcare

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

CIS5364 Termpaper Data Miningin Healthcare

Uploaded by

Copyright:

Available Formats

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

DATA MINING IN HEALTHCARE

Research · January 2018

The user has requested enhancement of the downloaded file.

Indrajit Sen, Krati Khandelwal

especially healthcare. With preponderance of computing technology, statistical analysis

to cancer vaccine manufacture.

an answer in this paper.

optimistic note on the future prospects of this promising technology.

that data is powerful and valuable. But how?

marketing campaigns, identifying high-risk clients, and improving production processes

mining, our focus will be on healthcare.

potentially useful, and understandable correlations and patterns in detail by combing

core of the information with much ease and accuracy (Hian).

organizations. In healthcare, data mining is becoming increasingly popular. Data mining

customer relationship management decisions, physicians identify effective treatments

(Koh HC1, n.d.).

Major areas such are the evaluation of treatment effectiveness, management of

identification of risk factors associated with the onset of diabetes.

uploaded themselves (Hernandez, 2014).

one-stop shop for health information (Hernandez, 2014).

DEFINITION AND USAGE

cocktail shaker and martini glass purchases (The Atlantic, 2012).

Another similar example could be:

sources (Hill, 2012)

the Python language (Wikipedia, 2014).

R: A programming language and software environment for statistical computing, data

mining, and graphics. It is part of the GNU Project.

SCaViS : Java cross-platform data analysis framework developed at Argonne National

UIMA: The UIMA (Unstructured Information Management Architecture) is a component

Weka : A suite of machine learning software applications written in the Java

And there are many more to follow.

DATA MINING ALGORITHMS

to extract actionable patterns and detailed statistics.

columns are of least importance and that can be easily deleted.

CHOOSING AN ALGORITHM BY TYPE

 Classification algorithms: A dataset usually has several attributes. A classification

algorithm predicts one or more discrete variables based on these attributes.

 Regression algorithms: While classification algorithms predict discrete variables,

regression algorithms predict continuous variables. Examples are AdaBoost and

 Association algorithms: Are useful in determining the associations between

Microsoft Clustering Algorithm is a good example.

APPLICATIONS OF DATA MINING IN HEALTHCARE

are present in global hospital databases. However, there seems to be no consistency,

were followed to preserver the validity and sanctity of the research:

for the study.

Minitab14 was used to further investigate the data distribution.

can reduce this expense significantly.

clinical data can be found on the US government’s website ClinicalTrials.gov.

Using especially the data available on cancer, three US researchers tried to

esophageal were neglected. This finding is sure to rattle boardrooms of many

multinational pharmaceutical companies.

clinical trial data are:

steadily increasing since that time.

trial candidate, while cervical cancer is second.

National Cancer Institute was the undisputed leader followed by GSK

(GlaxoSmithKline). All other pharmaceutical companies had more or less