Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 63

CSE

300

Data mining and its application and


usage in medicine

By Radhika

1
Data Mining and Medicine
 History
CSE  Past 20 years with relational databases
300
 More dimensions to database queries
 earliest and most successful area of data mining
 Mid 1800s in London hit by infectious disease
 Two theories
– Miasma theory  Bad air propagated disease
– Germ theory  Water-borne
 Advantages
– Discover trends even when we don’t understand reasons
– Discover irrelevant patterns that confuse than enlighten
– Protection against unaided human inference of patterns provide
quantifiable measures and aid human judgment
 Data Mining
 Patterns persistent and meaningful
 Knowledge Discovery of Data
2
The future of data mining
 10 biggest killers in the US
CSE
300

 Data mining = Process of discovery of interesting,


meaningful and actionable patterns hidden in large
amounts of data

3
Major Issues in Medical Data Mining
 Heterogeneity of medical data
CSE  Volume and complexity
300
 Physician’s interpretation
 Poor mathematical categorization
 Canonical Form
 Solution: Standard vocabularies, interfaces
between different sources of data integrations,
design of electronic patient records
 Ethical, Legal and Social Issues
 Data Ownership
 Lawsuits
 Privacy and Security of Human Data
 Expected benefits
 Administrative Issues
4
Why Data Preprocessing?
 Patient records consist of clinical, lab parameters,
CSE results of particular investigations, specific to tasks
300
 Incomplete: lacking attribute values, lacking
certain attributes of interest, or containing only
aggregate data
 Noisy: containing errors or outliers
 Inconsistent: containing discrepancies in codes or
names
 Temporal chronic diseases parameters
 No quality data, no quality mining results!
 Data warehouse needs consistent integration of
quality data
 Medical Domain, to handle incomplete,
inconsistent or noisy data, need people with
domain knowledge
5
What is Data Mining? The KDD Process
CSE
300
Pattern Evaluation

Data Mining

Task-relevant
Data
Data Selection
Warehouse
Data Cleaning
Data Integration

Databases

6
From Tables and Spreadsheets to Data Cubes
 A data warehouse is based on a multidimensional data
CSE model that views data in the form of a data cube
300
 A data cube, such as sales, allows data to be modeled
and viewed in multiple dimensions
 Dimension tables, such as item (item_name, brand,
type), or time(day, week, month, quarter, year)
 Fact table contains measures (such as
dollars_sold) and keys to each of related dimension
tables

 W. H. Inmon:“A data warehouse is a subject-oriented,


integrated, time-variant, and nonvolatile collection of
data in support of management’s decision-making
process.”
7
Data Warehouse vs. Heterogeneous DBMS
 Data warehouse: update-driven, high performance
CSE  Information from heterogeneous sources is
300
integrated in advance and stored in warehouses for
direct query and analysis
 Do not contain most current information
 Query processing does not interfere with
processing at local sources
 Store and integrate historical information
 Support complex multidimensional queries

8
Data Warehouse vs. Operational DBMS
 OLTP (on-line transaction processing)
CSE  Major task of traditional relational DBMS
300  Day-to-day operations: purchasing, inventory,
banking, manufacturing, payroll, registration,
accounting, etc.
 OLAP (on-line analytical processing)
 Major task of data warehouse system
 Data analysis and decision making
 Distinct features (OLTP vs. OLAP):
 User and system orientation: customer vs. market
 Data contents: current, detailed vs. historical,
consolidated
 Database design: ER + application vs. star + subject
 View: current, local vs. evolutionary, integrated
 Access patterns: update vs. read-only but complex
queries
9
CSE
300

10
Why Separate Data Warehouse?
 High performance for both systems
CSE  DBMS tuned for OLTP: access methods, indexing,
300
concurrency control, recovery
 Warehouse tuned for OLAP: complex OLAP
queries, multidimensional view, consolidation
 Different functions and different data:
 Missing data: Decision support requires historical
data which operational DBs do not typically
maintain
 Data consolidation: DS requires consolidation
(aggregation, summarization) of data from
heterogeneous sources
 Data quality: different sources typically use
inconsistent data representations, codes and formats
which have to be reconciled
11
CSE
300

12
CSE
300

13
Typical OLAP Operations
 Roll up (drill-up): summarize data
CSE  by climbing up hierarchy or by dimension reduction
300  Drill down (roll down): reverse of roll-up
 from higher level summary to lower level summary or
detailed data, or introducing new dimensions
 Slice and dice:
 project and select
 Pivot (rotate):
 reorient the cube, visualization, 3D to series of 2D planes.
 Other operations
 drill across: involving (across) more than one fact table
 drill through: through the bottom level of the cube to its
back-end relational tables (using SQL)

14
CSE
300

15
CSE
300

16
Multi-Tiered Architecture
CSE
300
Monitor OLAP Server
other Metadata &
sources Integrator

Analysis
Query
Operational Extract
Reports
DBs Transform
Data Serve
Load Data mining
Refresh Warehouse

Data Marts

Data Sources OLAP Engine Front-End Tools


Data Storage

17
Steps of a KDD Process
 Learning the application domain:
CSE  relevant prior knowledge and goals of application
300  Creating a target data set: data selection
 Data cleaning and preprocessing: (may take 60% of effort!)
 Data reduction and transformation:
 Find useful features, dimensionality/variable reduction,
invariant representation.
 Choosing functions of data mining
 summarization, classification, regression, association,
clustering.
 Choosing the mining algorithm(s)
 Data mining: search for patterns of interest
 Pattern evaluation and knowledge presentation
 visualization, transformation, removing redundant patterns,
etc.
 Use of discovered knowledge
18
Common Techniques in Data Mining
 Predictive Data Mining
CSE
300
 Most important
 Classification: Relate one set of variables in data to
response variables
 Regression: estimate some continuous value
 Descriptive Data Mining
 Clustering: Discovering groups of similar instances
 Association rule extraction

 Variables/Observations
 Summarization of group descriptions

19
Leukemia
 Different types of cells look very similar
CSE 
300
Given a number of samples (patients)
 can we diagnose the disease accurately?
 Predict the outcome of treatment?
 Recommend best treatment based of previous
treatments?
 Solution: Data mining on micro-array data
 38 training patients, 34 testing patients ~ 7000 patient
attributes
 2 classes: Acute Lymphoblastic Leukemia(ALL) vs
Acute Myeloid Leukemia (AML)

20
Clustering/Instance Based Learning
 Uses specific instances to perform classification than general
CSE IF THEN rules
300  Nearest Neighbor classifier
 Most studied algorithms for medical purposes
 Clustering– Partitioning a data set into several groups
(clusters) such that
 Homogeneity: Objects belonging to the same cluster are
similar to each other
 Separation: Objects belonging to different clusters are
dissimilar to each other. 
 Three elements
 The set of objects
 The set of attributes
 Distance measure

21
Measure the Dissimilarity of Objects
CSE
 Find best matching instance
300  Distance function
 Measure the dissimilarity between a pair of
data objects
 Things to consider
 Usually very different for interval-scaled,
boolean, nominal, ordinal and ratio-scaled
variables
 Weights should be associated with different
variables based on applications and data
semantic
 Quality of a clustering result depends on both the
distance measure adopted and its implementation

22
Minkowski Distance
 Minkowski distance: a generalization
CSE
300 d (i, j)  q | x  x |q  | x  x |q ... | x  x |q (q  0)
i1 j1 i2 j2 ip jp
 If q = 2, d is Euclidean distance
 If q = 1, d is Manhattan distance

Xi (1,7) xi
12
8.48
q=2 6
q=1

6
Xj(7,1) xj
23
Binary Variables
 A contingency table for binary data
CSE Object j
300
1 0 sum
1 a b ab
0 c d cd
Object i sum a  c b  d p

 Simple matching coefficient

d (i , j )  bc
ab cd

24
Dissimilarity between Binary Variables
 Example
CSE
300
A1 A2 A3 A4 A5 A6 A7
Object 1 1 0 1 1 1 0 0
Object 2 1 1 1 0 0 0 1
Object 2
1 0 sum
1 2 2 4
Object 1 0 2 1 3
2  2 4
sum 4 3 7 d (O ,O )  
1 2 2  2  2 1 7

25
K-nearest neighbors algorithm
 Initialization
CSE  Arbitrarily choose k objects as the initial cluster
300
centers (centroids)
 Iteration until no change
 For each object Oi

 Calculate the distances between Oi and the k centroids


 (Re)assign Oi to the cluster whose centroid is the
closest to Oi
 Update the cluster centroids based on current
assignment

26
k-Means Clustering Method cluster
10 current 10
mean
9 clusters 9
CSE 8 8

300 7 7
6 6
5 5

4 4

3 3

2 2

1 1

0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
objects
new relocated
clusters
10
10
9
9
8
8
7
7
6
6
5
5
4
4
3
3
2
2
1
1
0
0
0 1 2 3 4 5 6 7 8 9 10
0 1 2 3 4 5 6 7 8 9 10

27
Dataset
 Data set from UCI repository
CSE 
300
http://kdd.ics.uci.edu/
 768 female Pima Indians evaluated for diabetes
 After data cleaning 392 data entries

28
Hierarchical Clustering
 Groups observations based on dissimilarity
CSE 
300
Compacts database into “labels” that represent the
observations
 Measure of similarity/Dissimilarity
 Euclidean Distance
 Manhattan Distance
 Types of Clustering
 Single Link
 Average Link
 Complete Link

29
Hierarchical Clustering: Comparison
Single-link
CSE
Complete-link
5
300 1 4 1
3
2 5
5 5
2 1 2
2 3 6 3 6
3
1
4 4
4

Average-link Centroid distance


5
1 5 4 1
2
5 2
2 5
2
3 6 3
3 6
4 1 1
4 4
3

30
Compare Dendrograms
Single-link Complete-link
CSE
300

1 2 5 3 6 4 1 2 5 3 6 4

Average-link Centroid distance

2 5 3 6 4 1
1 2 5 3 6 4 31
Which Distance Measure is Better?
 Each method has both advantages and disadvantages;
CSE application-dependent
300
 Single-link
 Can find irregular-shaped clusters
 Sensitive to outliers
 Complete-link, Average-link, and Centroid distance
 Robust to outliers
 Tend to break large clusters
 Prefer spherical clusters

32
Dendrogram from dataset
CSE
300

 Minimum spanning tree through the observations


 Single observation that is last to join the cluster is patient whose
blood pressure is at bottom quartile, skin thickness is at bottom
quartile and BMI is in bottom half
 Insulin was however largest and she is 59-year old diabetic
33
Dendrogram from dataset
CSE
300

 Maximum dissimilarity between observations in one


cluster when compared to another
34
Dendrogram from dataset
CSE
300

 Average dissimilarity between observations in one


cluster when compared to another
35
Supervised versus Unsupervised Learning
 Supervised learning (classification)
CSE  Supervision: Training data (observations,
300
measurements, etc.) are accompanied by labels
indicating the class of the observations
 New data is classified based on training set
 Unsupervised learning (clustering)
 Class labels of training data are unknown
 Given a set of measurements, observations, etc.,
need to establish existence of classes or clusters in
data

36
Classification and Prediction
 Derive models that can use patient specific
CSE information, aid clinical decision making
300
 Apriori decision on predictors and variables to predict
 No method to find predictors that are not present in the
data
 Numeric Response
 Least Squares Regression
 Categorical Response
 Classification trees
 Neural Networks
 Support Vector Machine
 Decision models
 Prognosis, Diagnosis and treatment planning
 Embed in clinical information systems
37
Least Squares Regression
 Find a linear function of predictor variables that
CSE minimize the sum of square difference with response
300
 Supervised learning technique

 Predict insulin in our dataset :glucose and BMI

38
Decision Trees
 Decision tree
CSE  Each internal node tests an attribute
300  Each branch corresponds to attribute value
 Each leaf node assigns a classification
 ID3 algorithm
 Based on training objects with known class labels to
classify testing objects
 Rank attributes with information gain measure
 Minimal height
 least number of tests to classify an object
 Used in commercial tools eg: Clementine
 ASSISTANT
 Deal with medical datasets
 Incomplete data
 Discretize continuous variables
 Prune unreliable parts of tree
 Classify data

39
Decision Trees
CSE
300

40
Algorithm for Decision Tree Induction
CSE
300
Basic algorithm (a greedy algorithm)
 Attributes are categorical (if continuous-valued,
they are discretized in advance)
 Tree is constructed in a top-down recursive
divide-and-conquer manner
 At start, all training examples are at the root
 Test attributes are selected on basis of a heuristic
or statistical measure (e.g., information gain)
 Examples are partitioned recursively based on
selected attributes

41
Training Dataset
CSE Age BMI Hereditary Vision Risk of
300 Condition X
P1 <=30 high no fair no
P2 <=30 high no excellent no
P3 >40 high no fair yes
P4 31…40 medium no fair yes
P5 31…40 low yes fair yes
P6 31…40 low yes excellent no
P7 >40 low yes excellent yes
P8 <=30 medium no fair no
P9 <=30 low yes fair yes
P10 31…40 medium yes fair yes
P11 <=30 medium yes excellent yes
P12 >40 medium no excellent yes
P13 >40 high yes fair yes
P14 31…40 medium no excellent no

42
Construction of A Decision Tree for “Condition X”

CSE [P1,…P14]
Age?
300 Yes: 9, No:5

30…40
<=30 >40
[P1,P2,P8,P9,P11] [P3,P7,P12,P13] [P4,P5,P6,P10,P14]
Yes: 2, No:3 Yes: 4, No:0 Yes: 3, No:2
History YES Vision

no yes excellent fair

[P1,P2,P8] [P9,P11] [P6,P14] [P4,P5,P10]


Yes: 0, No:3 Yes: 2, No:0 Yes: 0, No:2 Yes: 3, No:0

NO YES NO YES

43
Entropy and Information Gain
 S contains si tuples of class Ci for i = {1, ..., m}
CSE 
300
Information measures info required to classify any
arbitrary tuple m
si si
I( s1,s2,...,sm )   log 2
i 1 s s
 Entropy of attribute A with values {a1,a2,…,av}
v
s1 j  ...  smj
E(A)   I (s1 j ,..., smj )
j 1 s
 Information gained by branching on attribute A

Gain( A )  I(s1, s2,..., sm )  E(A )

44
Entropy and Information Gain
 Select attribute with the highest information gain (or
CSE greatest entropy reduction)
300
 Such attribute minimizes information needed to
classify samples

45
Rule Induction
 IF conditions THEN Conclusion
CSE  Eg: CN2
300  Concept description:
 Characterization: provides a concise and succinct summarization
of given collection of data
 Comparison: provides descriptions comparing two or more
collections of data

 Training set, testing set


 Imprecise
 Predictive Accuracy
 P/P+N

46
Example used in a Clinic
 Hip arthoplasty trauma surgeon predict patient’s long-
CSE term clinical status after surgery
300
 Outcome evaluated during follow-ups for 2 years
 2 modeling techniques
 Naïve Bayesian classifier
 Decision trees
 Bayesian classifier
 P(outcome=good) = 0.55 (11/20 good)
 Probability gets updated as more attributes are
considered
 P(timing=good|outcome=good) = 9/11 (0.846)
 P(outcome = bad) = 9/20 P(timing=good|
outcome=bad) = 5/9

47
CSE
300

Nomogram

48
Bayesian Classification
 Bayesian classifier vs. decision tree
CSE  Decision tree: predict the class label
300
 Bayesian classifier: statistical classifier; predict
class membership probabilities
 Based on Bayes theorem; estimate posterior
probability
 Naïve Bayesian classifier:
 Simple classifier that assumes attribute
independence
 High speed when applied to large databases
 Comparable in performance to decision trees

49
Bayes Theorem
 Let X be a data sample whose class label is unknown
CSE 
300
Let Hi be the hypothesis that X belongs to a particular
class Ci
 P(Hi) is class prior probability that X belongs to a
particular class Ci
 Can be estimated by ni/n from training data
samples
 n is the total number of training data samples
 ni is the number of training data samples of class Ci

P( X | H )P(H )
P(H | X )  i i
i P( X )

Formula of Bayes Theorem

50
More classification Techniques
 Neural Networks
CSE  Similar to pattern recognition properties of biological
300 systems
 Most frequently used
 Multi-layer perceptrons
– Input with bias, connected by weights to hidden, output
 Backpropagation neural networks
 Support Vector Machines
 Separate database to mutually exclusive regions
 Transform to another problem space
 Kernel functions (dot product)
 Output of new points predicted by position
 Comparison with classification trees
 Not possible to know which features or combination of
features most influence a prediction

51
Multilayer Perceptrons
 Non-linear transfer functions to weighted sums of
CSE inputs
300
 Werbos algorithm
 Random weights
 Training set, Testing set

52
Support Vector Machines
 3 steps
CSE  Support Vector creation
300  Maximal distance between points found
 Perpendicular decision boundary
 Allows some points to be misclassified
 Pima Indian data with X1(glucose) X2(BMI)

53
What is Association Rule Mining?
 Finding frequent patterns, associations, correlations, or causal
CSE structures among sets of items or objects in transaction
300
databases, relational databases, and other information
repositories
PatientID Conditions
Example of Association
1 High LDL Low HDL, Rules
High BMI, Heart Failure {High LDL, Low HDL} 
2 High LDL Low HDL, {Heart Failure}
Heart Failure, Diabetes
3 Diabetes
4 High LDL Low HDL,
Heart Failure  People who have high LDL
5 High BMI , High LDL (“bad” cholesterol), low HDL
Low HDL, Heart Failure (“good cholesterol”) are at
higher risk of heart failure.

54
Association Rule Mining
 Market Basket Analysis
CSE  Same groups of items bought placed together
300
 Healthcare

 Understanding among association among patients with


demands for similar treatments and services
 Goal : find items for which joint probability of
occurrence is high
 Basket of binary valued variables
 Results form association rules, augmented with
support and confidence

55
Association Rule Mining
 Association Rule
CSE  An implication D Trans containing
300 expression of the form both X and Y
X  Y, where X and Y
are itemsets and
XY=
 Rule Evaluation
Metrics
 Support (s): Fraction of Trans Trans
transactions that containing X containing Y
contain both X and Y
 Confidence (c): # trans containing ( X  Y )
P( X  Y ) 
Measures how often # trans in D
items in Y appear in
transactions that
contain X # trans containing ( X  Y )
P( X | Y ) 
# trans containing X

56
The Apriori Algorithm
CSE
 Starts with most frequent 1-itemset
300  Include only those “items” that pass threshold
 Use 1-itemset to generate 2-itemsets
 Stop when threshold not satisfied by any itemset

 L1 = {frequent items};
for (k = 1; Lk !=; k++) do
 Candidate Generation: Ck+1 = candidates
generated from Lk;
 Candidate Counting: for each transaction t in
database do increment the count of all candidates
in Ck+1 that are contained in t
 Lk+1 = candidates in Ck+1 with min_sup
return k Lk;
57
Apriori-based Mining
CSE
300
Data base D 1-candidates Freq 1-itemsets 2-candidates
TID Items Itemset Sup Itemset Sup Itemset
10 a, c, d a 2 a 2 ab
20 b, c, e Scan D b 3 b 3 ac
30 a, b, c, e c 3 c 3 ae
40 b, e d 1 e 3 bc
Min_sup=0.5 e 3 be
ce
3-candidates Freq 2-itemsets Counting
Scan D Itemset Itemset Sup Itemset Sup
bce ac 2 ab 1
bc 2 ac 2 Scan D
be 3 ae 1
Freq 3-itemsets
ce 2 bc 2
Itemset Sup
be 3
bce 2
ce 2

58
Principle Component Analysis
 Principle Components
 In cases of large number of variables, highly possible that
CSE
300 some subsets of the variables are very correlated with each
other. Reduce variables but retain variability in dataset
 Linear combinations of variables in the database
 Variance of each PC maximized
– Display as much spread of the original data
 PC orthogonal with each other
– Minimize the overlap in the variables
 Each component normalized sum of square is unity
– Easier for mathematical analysis
 Number of PC < Number of variables
 Associations found
 Small number of PC explain large amount of variance
 Example 768 female Pima Indians evaluated for diabetes
 Number of times pregnant, two-hour oral glucose tolerance test
(OGTT) plasma glucose, Diastolic blood pressure, Triceps skin fold
thickness, Two-hour serum insulin, BMI, Diabetes pedigree
function, Age, Diabetes onset within last 5 years
59
PCA Example
CSE
300

60
National Cancer Institute
CSE
 CancerNet http://www.nci.nih.gov
300  CancerNet for Patients and the Public
 CancerNet for Health Professionals
 CancerNet for Basic Reasearchers
 CancerLit

61
Conclusion
 About ¾ billion of people’s medical records are
CSE electronically available
300
 Data mining in medicine distinct from other fields due
to nature of data: heterogeneous, with ethical, legal
and social constraints
 Most commonly used technique is classification and
prediction with different techniques applied for
different cases
 Associative rules describe the data in the database
 Medical data mining can be the most rewarding
despite the difficulty

62
CSE
300

Thank you !!!


63

You might also like