Professional Documents
Culture Documents
Anova For Comparing Means Between More Than 2 Groups: Variance: Average of Squared Differences From Mean
Anova For Comparing Means Between More Than 2 Groups: Variance: Average of Squared Differences From Mean
•
H 1 :least
• At Notoneall of the population
population mean is different
means are the same
• i.e., there is a treatment effect
• Does not mean that all population means are different (some pairs may be the
same)
The F-distribution
• A ratio of variances follows an F-distribution:
2
between
~ F
2
within
n,m
The F-test tests the hypothesis that two variances are equal.
F will be close to 1 if sample variances are equal.
H
0:2
2
between
within
H
a:2
2
between
within
How to calculate ANOVA’s by hand…
Treatment 1 Treatment 2 Treatment 3 Treatment 4
y11 y21 y31 y41
y12 y22 y32 y42
n=10 obs./group
y13 y23 y33 y43
y14 y24 y34 y44 k=4 groups
y15 y25 y35 y45
y16 y26 y36 y46
y17 y27 y37 y47
y18 y28 y38 y48
y19 y29 y39 y49
y110 y210 y310 y410
10
10 10 10
y1 j
y 2j y 3j y 4j The group means
j 1 j 1
y1 y 2
j 1
y 3
j 1 y 4
10 10 10 10
10 10
10 10
(y 1j y1 ) 2
j 1
( y 2 j y 2 ) 2 (y
j 1
3j y 3 ) 2
(y
j 1
4j y 4 ) 2
The (within) group
j 1
10 1 10 1 10 1 10 1 variances
Sum of Squares Within (SSW), or Sum of
Squares Error (SSE)
10 10
(y
10 10
(y (y
2
y 2 )
(y y 4 ) 2
2
1j y1 ) 2 2j 3j y 3 ) 4j
j 1 j 1 j 1
The (within) group
j 1
variances
10 1 10 1 10 1 10 1
10 10
(y
10 10
(y y 4 ) 2
2
1j
2
y1 ) + ( y 2 j y 2 ) 2 + ( y 3 j y 3 ) + 4j
j 1 j 3 j 1
j 1
4 10
i 1 j 1
( y ij y i ) 2 Sum of Squares Within (SSW) (or SSE, for
chance error)
Sum of Squares Between (SSB), or Sum of Squares
Regression (SSR)
4 10
Overall mean of all 40
observations (“grand y
i 1 j 1
ij
mean”)
y
40
Sum of Squares Between (SSB).
2
10 x ( y i y )
Variability of the group means compared
to the grand mean (the variability due to
the treatment).
i 1
Total Sum of Squares (SST)
Squared difference of every observation
i 1 j 1
Partitioning of Variance
4 10 4 4 10
(y
i 1 j 1
ij y i ) 2
+ 10x
( y i y ) 2 = ( y ij y ) 2
i 1 i 1 j 1
Mean Sum of
Source of Sum of squares Squares
variation d.f. F-statistic p-value
Source of variation d.f. Sum of squares Mean Sum of Squares F-statistic p-value
Total 39 2257.1
INTERPRETATION of ANOVA:
How much of the variance in height is explained by treatment group?
R2=“Coefficient of Determination” = SSB/TSS = 196.5/2275.1=9%
Coefficient of Determination
2 SSB SSB
R
SSB SSE SST
The amount of variation in the outcome variable (dependent variable) that is explained by the predictor
(independent variable).
Advanced Analytical Theory and Methods
Clustering- Overview,
Overview of methods,
diagnostics,
●
Supervised methods use labeled objects
●
Unsupervised methods use unlabeled objects
Clustering looks for hidden structure in the data, similarities based on attributes
●
Often used for exploratory analysis
●
No predictions are made
General Applications of Clustering
• Pattern Recognition
• Image Processing
• WWW
• Document classification
Marketing: Help marketers discover distinct groups in their customer bases, and
then use this knowledge to develop targeted marketing programs
Land use: Identification of areas of similar land use in an earth observation database
Insurance: Identifying groups of motor insurance policy holders with a high average
claim cost
City-planning: Identifying groups of houses according to their house type, value, and
geographical location
Earth-quake studies: Observed earth quake epicenters should be clustered along
continent faults
19
+
+
20
Clustering Methods
• Given a cluster Km of N points { tm1,tm2, tmk} , the centroid or middle of the cluster computed
as
• Centroid = Cm = ∑ tmi / N is considered as the representative of the cluster (there may not be
any corresponding object)
• Some algorithms use as representative a centrally located object called Medoid
Centroid Medoid
21
Advanced Analytical Theory and Methods
Clustering- Overview,
Overview of methods,
diagnostics,
Given a collection of objects each with n measurable attributes and a chosen value k that is the number
of clusters, the algorithm identifies the k clusters of objects based on the objects proximity to the
centers of the k groups.
The algorithm is iterative with the centers adjusted to the mean of each cluster’s n-dimensional vector
of attributes
Use Cases
• Clustering is often used as a lead-in to classification, where labels are
applied to the identified clusters
• Some applications
• Image processing
• With security images, successive frames are examined for change
• Medical
• Patients can be grouped to identify naturally occurring clusters
• Customer segmentation
• Marketing and sales groups identify customers having similar behaviors and spending
patterns
Advanced Analytical Theory and Methods
Clustering- Overview,
Overview of methods,
diagnostics,
K1={2,3,4},K2={10,12,20,30,11,25}, m1=3,m2=18
K1={2,3,4,10},K2={12,20,30,11,25}, m1=4.75,m2=19.6
K1={2,3,4,10,11,12},K2={20,30,25}, m1=7,m2=25
K-means Method
Four Steps
Choose the value of k and the initial guesses for the centroids
Compute the distance from each data point to each centroid, and assign each point to the closest centroid
Repeat steps 2 and 3 until the algorithm converges (no changes occur)
K-means Method- for two dimension
Example – Step 1
• Choose the value of k and the k initial guesses for the centroids.
• In this example, k = 3, and the initial centroids are indicated by the points shaded in red, green, and
blue
K-means Method- for two dimension
Example – Step 2
• are assigned to the closest centroid.
Points
In two dimensions, the distance, d, between any two points,(x1,y1) and (x2,y2) is
expressed by Euclidean distance measure+
K-means Method- for two dimension Example – Step
3
•
Computecentroidsof the new clusters. In two dimensions, the centroid
(Xc,Yc) of the m points is calculated as follows
(Xc,Yc)= ,
K-means Method- for two dimension
Example – Step 4
• Repeat steps 2 and 3 until convergence
• Convergence occurs when the centroids do not change or when the
centroids oscillate back and forth
• This can occur when one or more points have equal distances from the
centroid centers
Advanced Analytical Theory and Methods
Clustering- Overview,
Overview of methods,
diagnostics,
1. Compute clustering algorithm (e.g., k-means clustering) for different values of k. For
instance, by varying k from 1 to 10 clusters.
2. For each k, calculate the total within-cluster sum of square (WSS).
3. Plot the curve of WSS according to the number of clusters k.
4. The location of a bend (knee) in the plot is generally considered as an indicator of the
appropriate number of clusters.
where xn is a vector representing the the nth data point and Muj is the
geometric centroid of the data points in Sj.
• Simply speaking k-means clustering is an algorithm to classify or to
group the objects based on attributes/features into K number of
group.
• K is positive integer number.
• The grouping is done by minimizing the sum of squares of distances
between data and the corresponding cluster centroid.
How the K-Mean Clustering algorithm
works?
• Step 1: Begin with a decision on the value of k as number of clusters .
• Step 2: Put any initial partition that classifies the data into k clusters. You
may assign the training samples randomly, or systematically as the
following:
1.Take the first k training sample as single-element clusters
2. Assign each of the remaining (N-k) training sample to the cluster with
the nearest centroid. After each assignment, recompute the centroid of
the gaining cluster.
• Step 3: Take each sample in sequence and compute its distance from the centroid
of each of the clusters. If a sample is not currently in the cluster with the closest
centroid, switch this sample to that cluster and update the centroid of the cluster
gaining the new sample and the cluster losing the sample.
• Step 4 . Repeat step 3 until convergence is achieved, that is until a pass through the
training sample causes no new assignments.
A Simple example showing the implementation of k-means algorithm
(using K=2)
Step 1:
Initialization: Randomly we choose following two centroids (k=2) for two clusters.
In this case the 2 centroid are: m1=(1.0,1.0) and m2=(5.0,7.0).
Step 2:
• Thus, we obtain two clusters containing:
{1,2,3} and {4,5,6,7}.
• Their new centroids are:
Step 3:
• Now using these centroids we compute
the Euclidean distance of each object, as
shown in table.
Overview of methods,
diagnostics,
When the number of clusters is small, plotting the data helps refine the choice of k
●
Are the clusters well separated from each other?
●
Do any of the clusters have only a few points
●
Do any of the centroids appear to be too close to each other?
Overview of methods,
diagnostics,
k=2
Reasons to Choose and Cautions
Units of Measure
Age dominates
k=2
Reasons to Choose and Cautions
Rescaling
Rescaled
attributes
k=2
Reasons to Choose and Cautions
Additional Considerations
●
Important to rerun with several seeds – R has the nstart option
●
E.g., Manhattan, Mahalanobis, etc.
K-means is easily applied to numeric data and does not work well with nominal attributes
●
E.g., color
Additional Algorithms
K-modes clustering
●
kmod()
●
pam()
●
hclust()
Summary
• Clustering analysis groups similar objects based on the
objects’ attributes
• To use k-means properly, it is important to
• Properly scale the attribute values to avoid domination
• Assure the concept of distance between the assigned values of an
attribute is meaningful
• Carefully choose the number of clusters, k
• Once the clusters are identified, it is often useful to label
them in a descriptive way
Weaknesses of K-Mean Clustering
1. the numbers of data are not so many, initial grouping will determine the
When
cluster significantly.
2. The number of cluster, K, must be determined before hand. Its disadvantage is that it
does not yield the same result with each run, since the resulting clusters depend on
the initial random assignments.
3. We never know the real cluster, using the same data, because if it is inputted in a
different order it may produce different cluster if the number of data is few.
4. It is sensitive to initial condition. Different initial condition may produce different
result of cluster. The algorithm may be trapped in the local optimum.
Applications of K-Mean Clustering
• It is relatively efficient and fast. It computes result at O(tkn), where n is number
of objects or points, k is number of clusters and t is number of iterations.
• k-means clustering can be applied to machine learning or data mining
• Used on acoustic data in speech understanding to convert waveforms into one of
k categories (known as Vector Quantization or Image Segmentation).
• Also used for choosing color palettes on old fashioned graphical display devices
and Image Quantization.