Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 69

ANOVA

for comparing means between more than 2 groups

Variance: average of squared differences from mean.


Hypotheses of One-Way ANOVA
• H 0 : μ1  μ 2  μ 3    μ c
• All population means are equal
• i.e., no treatment effect (no variation in means among groups)


H 1 :least
• At Notoneall of the population
population mean is different
means are the same
• i.e., there is a treatment effect
• Does not mean that all population means are different (some pairs may be the
same)
The F-distribution
• A ratio of variances follows an F-distribution:

 2
between
~ F
 2
within
n,m

The F-test tests the hypothesis that two variances are equal.
F will be close to 1 if sample variances are equal.

H
0:2
2
between
within

H
a:2
2
between
within
How to calculate ANOVA’s by hand…
 
Treatment 1 Treatment 2 Treatment 3 Treatment 4
y11 y21 y31 y41
y12 y22 y32 y42
n=10 obs./group
y13 y23 y33 y43
y14 y24 y34 y44 k=4 groups
y15 y25 y35 y45
y16 y26 y36 y46
y17 y27 y37 y47
y18 y28 y38 y48
y19 y29 y39 y49
y110 y210 y310 y410
10


10 10 10
y1 j
y 2j y 3j y 4j The group means
j 1 j 1
y1  y 2 
j 1
y 3 
j 1 y 4 
10 10 10 10
10 10

10 10

(y 1j  y1 ) 2

j 1
( y 2 j  y 2 ) 2 (y
j 1
3j  y 3 ) 2
(y
j 1
4j  y 4 ) 2
The (within) group
j 1

10  1 10  1 10  1 10  1 variances
Sum of Squares Within (SSW), or Sum of
Squares Error (SSE)
10 10
(y
10 10

(y (y
2
 y 2 )
(y  y 4 ) 2
2
1j  y1 ) 2 2j 3j  y 3 ) 4j
j 1 j 1 j 1
The (within) group
j 1
variances
10  1 10  1 10  1 10  1

10 10

 (y
10 10

(y   y 4 ) 2
2
1j
2
 y1 ) + ( y 2 j  y 2 ) 2 + ( y 3 j  y 3 ) + 4j
j 1 j 3 j 1
j 1

4 10
  i 1 j 1
( y ij  y i  ) 2 Sum of Squares Within (SSW) (or SSE, for
chance error)
Sum of Squares Between (SSB), or Sum of Squares
Regression (SSR)

4 10
Overall mean of all 40
observations (“grand  y
i 1 j 1
ij
mean”)
y  
40


Sum of Squares Between (SSB).
2
10 x ( y i  y  )
Variability of the group means compared
to the grand mean (the variability due to
the treatment).
i 1
Total Sum of Squares (SST)

Total sum of squares(TSS).


4 10


Squared difference of every observation

( y ij  y  ) 2 from the overall mean. (numerator of


variance of Y!)

i 1 j 1
Partitioning of Variance

4 10 4 4 10
(y
i 1 j 1
ij  y i ) 2
+ 10x
 ( y i   y  ) 2 =  ( y ij  y  ) 2
i 1 i 1 j 1

SSW + SSB = TSS


ANOVA Table
     

Mean Sum of
Source of Sum of squares Squares
variation d.f. F-statistic p-value

Between k-1 SSB SSB/k-1 Go to


SSB
(sum of squared deviations k 1
(k groups) SSW Fk-1,nk-k
of group means from grand nk  k
mean)
chart

Within nk-k SSW s2=SSW/nk-k


(sum of squared deviations
(n individuals per group)
of observations from their
group mean)

Total variation nk-1 TSS


(sum of squared deviations of observations
from grand mean)   TSS=SSB + SSW
Example

Treatment 1 Treatment 2 Treatment 3 Treatment 4


60 inches 50 48 47
67 52 49 67
42 43 50 54
67 67 55 67
56 67 56 68
62 59 61 65
64 67 61 65
59 64 60 56
72 63 59 60
71 65 64 65
Example

Step 1) calculate the sum of squares


Treatment 1 Treatment 2 Treatment 3 Treatment 4
between groups:
60 inches 50 48 47
  67 52 49 67
42 43 50 54
Mean for group 1 = 62.0 67 67 55 67
Mean for group 2 = 59.7 56 67 56 68
62 59 61 65
Mean for group 3 = 56.3 64 67 61 65

Mean for group 4 = 61.4 59 64 60 56


72 63 59 60
  71 65 64 65

Grand mean= 59.85

SSB = [(62-59.85)2 + (59.7-59.85)2 + (56.3-59.85)2 + (61.4-59.85)2 ] xn per group= 19.65x10 = 196.5


Example

Step 2) calculate the sum of squares


Treatment 1 Treatment 2 Treatment 3 Treatment 4
within groups:
60 inches 50 48 47
  67 52 49 67
42 43 50 54
(60-62) 2+(67-62) 2+ (42-62) 2+ (67-62) 2+ 67 67 55 67
(56-62) 2+ (62-62) 2+ (64-62) 2+ (59-62) 2+ 56 67 56 68
(72-62) 2+ (71-62) 2+ (50-59.7) 2+ (52-59.7) 62 59 61 65
2
+ (43-59.7) 2+67-59.7) 2+ (67-59.7) 2+ (69- 64 67 61 65
59.7) 2…+….(sum of 40 squared 59 64 60 56
deviations) = 2060.6 72 63 59 60
71 65 64 65
Step 3) Fill in the ANOVA table
           

Source of variation d.f. Sum of squares Mean Sum of Squares F-statistic p-value

Between 3 196.5 65.5 1.14 .344

Within 36 2060.6 57.2    

 
Total 39 2257.1      

INTERPRETATION of ANOVA:
How much of the variance in height is explained by treatment group?
R2=“Coefficient of Determination” = SSB/TSS = 196.5/2275.1=9%
Coefficient of Determination

2 SSB SSB
R  
SSB  SSE SST
The amount of variation in the outcome variable (dependent variable) that is explained by the predictor
(independent variable).
Advanced Analytical Theory and Methods
Clustering- Overview,

K means- Use cases,

Overview of methods,

determining number of clusters,

diagnostics,

reasons to choose and cautions.


Overview of Clustering

Clustering is the use of unsupervised techniques for grouping similar objects


Supervised methods use labeled objects

Unsupervised methods use unlabeled objects

Clustering looks for hidden structure in the data, similarities based on attributes


Often used for exploratory analysis

No predictions are made
General Applications of Clustering
• Pattern Recognition

• Spatial Data Analysis

• create thematic maps in GIS by clustering feature spaces

• detect spatial clusters and explain them in spatial data mining

• Image Processing

• Economic Science (especially market research)

• WWW

• Document classification

• Cluster Weblog data to discover groups of similar access patterns


1/19/22 Data Mining: Concepts and Techniques 17
Examples of Clustering Applications

 Marketing: Help marketers discover distinct groups in their customer bases, and
then use this knowledge to develop targeted marketing programs
 Land use: Identification of areas of similar land use in an earth observation database
 Insurance: Identifying groups of motor insurance policy holders with a high average
claim cost
 City-planning: Identifying groups of houses according to their house type, value, and
geographical location
 Earth-quake studies: Observed earth quake epicenters should be clustered along
continent faults

1/19/22 Data Mining: Concepts and Techniques 18


CLUSTERING

• Cluster: a collection of data objects similar to one another within the


same cluster
• Dissimilar to the objects in other clusters
• The distance between points in a cluster is less than the distance
between a point in the cluster and any point outside it
• Data can be clustered on different attributes
• Clustering differs from classification
• Unsupervised learning
• No predefined classes (no a priori knowledge)

19
+
+

Cluster analysis: Finding similarities between data according to the


characteristics found in the data and grouping similar data objects into clusters

20
Clustering Methods

• Given a cluster Km of N points { tm1,tm2, tmk} , the centroid or middle of the cluster computed
as
• Centroid = Cm = ∑ tmi / N is considered as the representative of the cluster (there may not be
any corresponding object)
• Some algorithms use as representative a centrally located object called Medoid

Centroid Medoid

21
Advanced Analytical Theory and Methods
Clustering- Overview,

K means- Use cases,

Overview of methods,

determining number of clusters,

diagnostics,

reasons to choose and cautions.


K-means Algorithm

Given a collection of objects each with n measurable attributes and a chosen value k that is the number
of clusters, the algorithm identifies the k clusters of objects based on the objects proximity to the
centers of the k groups.

The algorithm is iterative with the centers adjusted to the mean of each cluster’s n-dimensional vector
of attributes
Use Cases
• Clustering is often used as a lead-in to classification, where labels are
applied to the identified clusters
• Some applications
• Image processing
• With security images, successive frames are examined for change
• Medical
• Patients can be grouped to identify naturally occurring clusters
• Customer segmentation
• Marketing and sales groups identify customers having similar behaviors and spending
patterns
Advanced Analytical Theory and Methods
Clustering- Overview,

K means- Use cases,

Overview of methods,

determining number of clusters,

diagnostics,

reasons to choose and cautions.


K-Means Example
Given: {2,4,10,12,3,20,30,11,25}, k=2

Randomly assign means: m1=3,m2=4

K1={2,3}, K2={4,10,12,20,30,11,25}, m1=2.5,m2=16

K1={2,3,4},K2={10,12,20,30,11,25}, m1=3,m2=18

K1={2,3,4,10},K2={12,20,30,11,25}, m1=4.75,m2=19.6

K1={2,3,4,10,11,12},K2={20,30,25}, m1=7,m2=25
K-means Method
Four Steps

Choose the value of k and the initial guesses for the centroids

Compute the distance from each data point to each centroid, and assign each point to the closest centroid

Compute the centroid of each newly defined cluster from step 2

Repeat steps 2 and 3 until the algorithm converges (no changes occur)
K-means Method- for two dimension
Example – Step 1
• Choose the value of k and the k initial guesses for the centroids.
• In this example, k = 3, and the initial centroids are indicated by the points shaded in red, green, and
blue
K-means Method- for two dimension
Example – Step 2
•   are assigned to the closest centroid.
Points
In two dimensions, the distance, d, between any two points,(x1,y1) and (x2,y2) is
expressed by Euclidean distance measure+
K-means Method- for two dimension Example – Step
3
• 
Computecentroidsof the new clusters. In two dimensions, the centroid
(Xc,Yc) of the m points is calculated as follows
(Xc,Yc)= ,
K-means Method- for two dimension
Example – Step 4
• Repeat steps 2 and 3 until convergence
• Convergence occurs when the centroids do not change or when the
centroids oscillate back and forth
• This can occur when one or more points have equal distances from the
centroid centers
Advanced Analytical Theory and Methods
Clustering- Overview,

K means- Use cases,

Overview of methods,

determining number of clusters,

diagnostics,

reasons to choose and cautions.


Determining Number of Clusters

• k clusters can be identified in a given dataset, but what value of k


should be selected?
• The value of k can be chosen based on a reasonable guess or some
predefined requirement.
• How to know better or worse having k clusters versus k – 1 or k + 1
clusters
• Solution:
• Use heuristic – e.g., Within Sum of Squares (WSS)
• WSS metric is the sum of the squares of the distances between each data point and the
closest centroid
• The process of identifying the appropriate value of k is referred to as finding the
“elbow” of the WSS curve
Determining Number of Clusters (WSS Method)

1. Compute clustering algorithm (e.g., k-means clustering) for different values of k. For
instance, by varying k from 1 to 10 clusters.
2. For each k, calculate the total within-cluster sum of square (WSS).
3. Plot the curve of WSS according to the number of clusters k.
4. The location of a bend (knee) in the plot is generally considered as an indicator of the
appropriate number of clusters.

• where: xi -is a data point belonging to the cluster Ck


• μk is the mean value of the points assigned to the cluster Ck
K-means
• K-means is unsupervised: no prior knowledge of the “right answer”
• Goal of the algorithm Is to determine the definition of the right
answer by finding clusters of data
• Kind of data g+ data, survey data, medical data, SAT scores
• Assume data {age, gender, income, state, household, size}, your goal
is to segment the users.
K-MEANS CLUSTERING
• The k-means algorithm is an algorithm to cluster n objects based on attributes
into k partitions, where k < n.
• It is similar to the expectation-maximization algorithm for mixtures of Gaussians
in that they both attempt to find the centers of natural clusters in the data.
• It assumes that the object attributes form a vector space.
• An algorithm for partitioning (or clustering) N data points into K
disjoint subsets Sj containing data points so as to minimize the sum-
of-squares criterion

where xn is a vector representing the the nth data point and Muj is the
geometric centroid of the data points in Sj.
• Simply speaking k-means clustering is an algorithm to classify or to
group the objects based on attributes/features into K number of
group.
• K is positive integer number.
• The grouping is done by minimizing the sum of squares of distances
between data and the corresponding cluster centroid.
How the K-Mean Clustering algorithm
works?
• Step 1: Begin with a decision on the value of k as number of clusters .
• Step 2: Put any initial partition that classifies the data into k clusters. You
may assign the training samples randomly, or systematically as the
following:
1.Take the first k training sample as single-element clusters
2. Assign each of the remaining (N-k) training sample to the cluster with
the nearest centroid. After each assignment, recompute the centroid of
the gaining cluster.
• Step 3: Take each sample in sequence and compute its distance from the centroid
of each of the clusters. If a sample is not currently in the cluster with the closest
centroid, switch this sample to that cluster and update the centroid of the cluster
gaining the new sample and the cluster losing the sample.
• Step 4 . Repeat step 3 until convergence is achieved, that is until a pass through the
training sample causes no new assignments.
A Simple example showing the implementation of k-means algorithm
(using K=2)
Step 1:
Initialization: Randomly we choose following two centroids (k=2) for two clusters.
In this case the 2 centroid are: m1=(1.0,1.0) and m2=(5.0,7.0).
Step 2:
• Thus, we obtain two clusters containing:
{1,2,3} and {4,5,6,7}.
• Their new centroids are:
Step 3:
• Now using these centroids we compute
the Euclidean distance of each object, as
shown in table.

• Therefore, the new clusters are:


{1,2} and {3,4,5,6,7}

• Next centroids are: m1=(1.25,1.5) and


m2 = (3.9,5.1)
• Step 4 :
The clusters obtained are:
{1,2} and {3,4,5,6,7}

• Therefore, there is no change in the


cluster.
• Thus, the algorithm comes to a halt here
and final result consist of 2 clusters {1,2}
and {3,4,5,6,7}.
The remaining individuals are now examined in sequence and allocated to the cluster to
which they are closest, in terms of Euclidean distance to the cluster mean. The mean
vector is recalculated each time a new member is added. This leads to the following
series of steps:
PLOT
Advanced Analytical Theory and Methods
Clustering- Overview,

K means- Use cases,

Overview of methods,

determining number of clusters,

diagnostics,

reasons to choose and cautions.


Diagnostics

When the number of clusters is small, plotting the data helps refine the choice of k


Are the clusters well separated from each other?

Do any of the clusters have only a few points

Do any of the centroids appear to be too close to each other?

The following questions should be considered


Diagnostics
Example of distinct clusters
Diagnostics
Example of less obvious clusters
Diagnostics
Six clusters from points of previous figure
Advanced Analytical Theory and Methods
Clustering- Overview,

K means- Use cases,

Overview of methods,

determining number of clusters,

diagnostics,

reasons to choose and cautions.


Reasons to Choose and Cautions
• Decisions the practitioner must make
• What object attributes should be included in the
analysis?
• What unit of measure should be used for each attribute?
• Do the attributes need to be rescaled?
• What other considerations might apply?
Reasons to Choose and Cautions
Object Attributes

• Important to understand what attributes will be known at the time a new


object is assigned to a cluster
• E.g., information on existing customers’ satisfaction or purchase frequency may be
available, but such information may not be available for potential customers .
• Eg. information like age and income of existing customers is available but may
not be available, for new customers
• Best to reduce number of attributes when possible
• Too many attributes minimize the impact of key variables
• Identify highly correlated attributes for reduction
• Combine several attributes into one: e.g., debt/asset ratio
Reasons to Choose and Cautions
Units of Measure

• K-means algorithm will identify different clusters depending on the


units of measure

k=2
Reasons to Choose and Cautions
Units of Measure

Age dominates
k=2
Reasons to Choose and Cautions
Rescaling

• Rescaling can reduce domination effect


• E.g., divide each variable by the appropriate standard deviation

Rescaled
attributes

k=2
Reasons to Choose and Cautions
Additional Considerations

K-means sensitive to starting seeds


Important to rerun with several seeds – R has the nstart option

Could explore distance metrics other than Euclidean


E.g., Manhattan, Mahalanobis, etc.

K-means is easily applied to numeric data and does not work well with nominal attributes


E.g., color
Additional Algorithms

K-modes clustering


kmod()

Partitioning around Medoids (PAM)


pam()

Hierarchical agglomerative clustering


hclust()
Summary
• Clustering analysis groups similar objects based on the
objects’ attributes
• To use k-means properly, it is important to
• Properly scale the attribute values to avoid domination
• Assure the concept of distance between the assigned values of an
attribute is meaningful
• Carefully choose the number of clusters, k
• Once the clusters are identified, it is often useful to label
them in a descriptive way
Weaknesses of K-Mean Clustering
1.   the numbers of data are not so many, initial grouping will determine the
When
cluster significantly.
2. The number of cluster, K, must be determined before hand. Its disadvantage is that it
does not yield the same result with each run, since the resulting clusters depend on
the initial random assignments.
3. We never know the real cluster, using the same data, because if it is inputted in a
different order it may produce different cluster if the number of data is few.
4. It is sensitive to initial condition. Different initial condition may produce different
result of cluster. The algorithm may be trapped in the local optimum.
Applications of K-Mean Clustering
• It is relatively efficient and fast. It computes result at O(tkn), where n is number
of objects or points, k is number of clusters and t is number of iterations.
• k-means clustering can be applied to machine learning or data mining
• Used on acoustic data in speech understanding to convert waveforms into one of
k categories (known as Vector Quantization or Image Segmentation).
• Also used for choosing color palettes on old fashioned graphical display devices
and Image Quantization.

You might also like