RPubs - The Analytics Edge EdX MIT15 Clustering

7/23/2017 The_Analytics_Edge_edX_MIT15.
071x_June2015_5
The_Analytics_Edge_edX_MIT15.071x_June2015_5
Nishant Upadhyay
Department of Economics,University of Pune,India
(mailto:#)nishantup@gmail.com (mailto:nishantup@gmail.com)
16 July 2015
Unit 6: Clustering
Preliminaries
Recommendations Worth a Million_An Introduction to Clustering_1
VIDEO 1: INTRODUCTION TO NETFLIX
VIDEO 2: RECOMMENDATION SYSTEMS
VIDEO 3: MOVIE DATA AND CLUSTERING
VIDEO 4: COMPUTING DISTANCES
VIDEO 5: HIERARCHICAL CLUSTERING
VIDEO 6: GETTING THE DATA (R script reproduced here)
VIDEO 7: HIERARCHICAL CLUSTERING IN R (R script reproduced here)
VIDEO 8: THE ANALYTICS EDGE OF RECOMMENDATION SYSTEMS
Predictive Diagnosis_Discovering Patterns for Disease Detection_2
VIDEO 1: HEART ATTACKS
VIDEO 2: THE DATA
VIDEO 3: PREDICTING HEART ATTACKS USING CLUSTERING
VIDEO 4: UNDERSTANDING CLUSTER PATTERNS
VIDEO 5: THE ANALYTICS EDGE
Seeing the Big Picture_Segmenting Images to Create Data(Recitation)_3
VIDEO 1: IMAGE SEGMENTATION
VIDEO 2: CLUSTERING PIXELS (R script reproduced here)
VIDEO 3: HIERARCHICAL CLUSTERING (R script reproduced here)
VIDEO 4: MRI IMAGE (R script reproduced here)
VIDEO 5: K-MEANS CLUSTERING (R script reproduced here)
VIDEO 6: DETECTING TUMORS (R script reproduced here)
VIDEO 7: COMPARING METHODS USED IN THE CLASS
Assignment 6
DOCUMENT CLUSTERING WITH DAILY KOS
MARKET SEGMENTATION FOR AIRLINES
PREDICTING STOCK RETURNS WITH CLUSTER-THEN-PREDICT

Source le The_Analytics_Edge_edX_MIT15.071x_June2015_5.rmd (data:text/x-
markdown;base64,LS0tDQp0aXRsZTogIlRoZV9BbmFseXRpY3NfRWRnZV9lZFhfTUlUMTUuMDcxeF9KdW5lMjAxNV81Ig0KYXV0aG9yOiANCi0gYWZmaWxpYXRpb246IER
Unit 6: Clustering
Preliminaries
These are my notes for the lectures of the The_Analytics_Edge_edX_MIT15.0 71x_June2015 (https://courses.edx.org/courses/course-
v1:MITx+15.071x_2a+2T2015) by Professor Dimitris Bertsimas. The goal of these notes is to provide the reproducible R code for all the
lectures.
A good list of resources about using R for Clustering are given below:
Getting your clustering right (Part I)_Analytics Vidya (http://www.analyticsvidhya.com/blog/2013/11/getting-clustering-right/)

Cluster Analysis_Quick-R (http://www.statmethods.net/advstats/cluster.html)
Hierarchical Cluster Analysis (http://www.r-tutor.com/gpu-computing/clustering/hierarchical-cluster-analysis)
Cluster Analysis in R (http://ecology.msu.montana.edu/labdsv/R/labs/lab13/lab13.html)
Cluster Analysis and R_stat.berkeley (http://www.stat.berkeley.edu/~s133/Cluster2a.html)
K-means Clustering (from R in Action)_R-statistics blog (http://www.r-statistics.com/2013/08/k-means-clustering-from-r-in-action/)
Cluster Analysis: Tutorial with R_Jari Oksanen (http://cc.oulu./~jarioksa/opetus/metodi/sessio3.pdf)
Cluster Analysis using R - IASRI (http://iasri.res.in/ebook/win_school_aa/notes/cluster_analysis_usingr.pdf)
Performing a cluster analysis in R (http://www.instantr.com/2013/02/12/performing-a-cluster-analysis-in-r/)
Cluster Analysis with R_Rpubs (https://rstudio-pubs-static.s3.amazonaws.com/33876_1d7794d9a86647ca90c4f182df93f0e8.html)
Cluster analysis tutorial - University of Georgia (http://strata.uga.edu/software/pdf/clusterTutorial.pdf)
Cluster Analysis in R (http://www.pmc.ucsc.edu/~mclapham/Rtips/cluster.htm)
Cluster Analysis using R with banking customer balance distribution (http://deepaksinghviblog.blogspot.in/2014/09/cluster-analysis-using-
r-with-banking.html)
CLUSTER ANALYSIS: HOW TO IDENTIFY INHERENT GROUPS WITHIN DATA?_Rstatistics.net (http://rstatistics.net/cluster-analysis/)
CRAN Task View: Cluster Analysis & Finite Mixture Models (https://cran.r-project.org/web/views/Cluster.html)
R Script for K-Means Cluster Analysis (http://www.mattpeeples.net/kmeans.html)
R & Bioconductor Manual (hhttp://manuals.bioinformatics.ucr.edu/home/R_BioCondManual#TOC-Clustering-and-Data-Mining-in-R)
Data Science with R_Clusters analysis_togaware (http://handsondatascience.com/ClustersO.pdf)
Quick-R_ Tree-Based Models (http://www.statmethods.net/advstats/cart.html)
Quick-R_ Tree-Based Models (http://www.statmethods.net/advstats/cart.html)
Recommendations Worth a Million_An Introduction to

https://rpubs.com/nishantsbi/93582 1/67
7/23/2017 The_Analytics_Edge_edX_MIT15.071x_June2015_5
Recommendations Worth a Million_An Introduction to

Clustering_1
An Introduction to Clustering
VIDEO 1: INTRODUCTION TO NETFLIX

Netix
Online DVD rental and streaming video service

More than 40 million subscribers worldwide
$3.6 billion in revenue
Key aspect is being able to oer customers accurate movie recommendations based on a customers own preferences and viewing history
netix
The Netix Prize
From 2006 - 2009 Netix ran a contest asking the public to submit algorithms to predict user ratings for movies
Training data set of ~100,000,000 ratings and test data set of ~3,000,000 ratings were provided
Oered a grand prize of $1,000,000 USD to the team who could beat Netixs own algorithm,Cinematch, by more than 10%, measured in
RMSE
Contest Rules
If the grand prize was not yet reached, progress prizes of $50,000 USD per year would be awarded for the best result so far, as long as it
had >1% improvement over the previous year.
Teams must submit code and a description of the algorithm to be awarded any prizes
If any team met the 10% improvement goal, last call would be issued and 30 days would remain for all teams to submit their best
algorithm.
Last Call Announced
On June 26, 2009, the team BellKors Pragmatic Chaos submitted a 10.05% improvement over Cinematch
QUICK QUESTION
About how many years did it take for a team to submit a 10% improvement over Cinematch?
Ans:2.5
EXPLANATION:The contest started in October 2006, and eneded in July 2009. So it took about 2.5 years for a team to submit a 10%
improvement solution.
VIDEO 2: RECOMMENDATION SYSTEMS
Predicting the Best User Ratings
Netix was willing to pay over $1M for the best user rating algorithm, which shows how critical the recommendation system was to their
business
What data could be used to predict user ratings?
Every movie in Netixs database has the ranking from all users who have ranked that movie
We also know facts about the movie itself: actors,director, genre classications, year released, etc.
Recommender system_Collaborative Filtering
Using Other Users Rankings
userrating
Consider suggesting to Carl that he watch Men in Black, since Amy rated it highly and Carl and Amy seem to have similar preferences
This technique is called Collaborative Filtering
Recommender system_Content Filtering
Using Movie Information
contentltering
Strengths and Weaknesses
Collaborative Filtering Systems

Can accurately suggest complex items without understanding the nature of the items
Requires a lot of data about the user to make accurate recommendations
Millions of items - need lots of computing power
Content Filtering
Requires very little data to get started
Can be limited in scope
Hybrid Recommendation Systems
Netix uses both collaborative and content ltering

For example, consider a collaborative ltering approach where we determine that Amy and Carl have similar preferences.
We could then do content ltering, where we would nd that Terminator, which both Amy and Carl liked, is classied in almost the same
set of genres as Starship Troopers
Recommend Starship Troopers to both Amy and Carl,even though neither of them have seen it before
QUICK QUESTION
Lets consider a recommendation system on Amazon.com, an online retail site.
If Amazon.com constructs a recommendation system for books, and would like to use the same exact algorithm for shoes, what type would it
have to be?
Ans: Collaborative Filtering
If Amazon.com would like to suggest books to users based on the previous books they have purchased, what type of recommendation system
would it be?
Ans:Content Filtering
EXPLANATION:In the rst case, the recommendation system would have to be collaborative ltering, since it cant use information about the
items. In the second case, the recommendation system would be content ltering since other users are not involved.
VIDEO 3: MOVIE DATA AND CLUSTERING

MovieLens Item Dataset
Movies in the dataset are categorized as belonging to dierent genres
genres
Each movie may belong to many genres

Can we systematically nd groups of movies with similar sets of genres?
Why Clustering?
Unsupervised learning
Goal is to segment the data into similar groups instead of prediction
Can also cluster data into similar groups and then build a predictive model for each group
Be careful not to overt your model!
This works best with large datasets
cluster
Types of Clustering Methods
There are many dierent algorithms for clustering

Dier in what makes a cluster and how to nd them
We will cover
Hierarchical
K-means in the next lecture
QUICK QUESTION
In the previous video, we discussed how clustering is used to split the data into similar groups. Which of the following tasks do you think are
appropriate for clustering? Select all that apply.
Ans:Dividing search results on Google into categories based on the topic & Grouping players into dierent types of basketball players that
make it to the NBA
EXPLANATION:The rst two options are appropriate tasks for clustering. Clustering probably wouldnt help us predict the winner of the World
Series.
VIDEO 4: COMPUTING DISTANCES

How does Clustering work?
Distance Between Points
Need to dene distance between two data points

Most popular is Euclidean distance
Distance between points i and j is
where k is the number of independent variables
Distance Example
The movie Toy Story is categorized as Animation, Comedy, and Childrens

Toy Story:(0,0,0,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0)
toy
The movie Batman Forever is categorized as Action, Adventure, Comedy, and Crime
Batman Forever:(0,1,1,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0)
batman
Distance Between Points
Toy Story: (0,0,0,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0)

Batman Forever: (0,1,1,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0)
calc1
Other popular distance metrics:

Manhattan Distance
Sum of absolute values instead of squares
Maximum Coordinate Distance
Only consider measurement for which data points deviate the most
Distance Between Clusters
Minimum Distance
Distance between clusters is the distance between points that are the closest
min
Maximum Distance
Distance between clusters is the distance between points that are the farthest
max
Centroid Distance
Distance between centroids of clusters
Centroid is point that has the average of all data points in each component
centroid
Normalize Data
Distance is highly inuenced by scale of variables, so customary to normalize rst

In our movie dataset, all genre variables are on the same scale and so normalization is not necessary
However, if we included a variable such as Box Oce Revenue, we would need to normalize.
QUICK QUESTION
The movie The Godfather is in the genres action, crime, and drama, and is dened by the vector: (0,1,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0)
The movie Titanic is in the genres action, drama, and romance, and is dened by the vector: (0,1,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0)
What is the distance between The Godfather and Titanic, using euclidean distance?
Ans:1.414214
EXPLANATION:The distance between these two movies is the square root of 2. They have a dierence of 1 in two genres - crime and romance.
VIDEO 5: HIERARCHICAL CLUSTERING

Hierarchical
Start with each data point in its own cluster
step1
Combine two nearest clusters (Euclidean, Centroid)
step2
combine two nearest clusters (Euclidean, Centroid)
step3
step4
step5
step6
step7
step8
Display Cluster Process:Dendogram
dendo
Select Clusters
select
Meaningful Clusters?
Look at statistics (mean, min, max, . . .) for each cluster and each variable
See if the clusters have a feature in common that was not used in the clustering (like an outcome)
QUICK QUESTION
Suppose you are running the Hierarchical clustering algorithm with 212 observations.
How many clusters will there be at the start of the algorithm?
Ans:212
How many clusters will there be at the end of the algorithm?
Ans:1
EXPLANATION:The Hierarchical clustering algorithm always starts with each data point in its own cluster, and ends with all data points in the
same cluster. So there will be 212 clusters at the beginning of the algorithm, and 1 cluster at the end of the algorithm.
VIDEO 6: GETTING THE DATA (R script reproduced here)

In this video, well be downloading our dataset from the MovieLens website. Please open the following link in a new window or tab of your
browser to access the data: movieLens.txt (http://les.grouplens.org/datasets/movielens/ml-100k/u.item)
#Unit 6 - Introduction to Clustering
#Video 6
#After following the steps in the video, load the data into R
movies = read.table("movieLens.txt", header=FALSE, sep="|",quote="\"")
str(movies)
## 'data.frame': 1682 obs. of 24 variables:

## $ V1 : int 1 2 3 4 5 6 7 8 9 10 ...
## $ V2 : Factor w/ 1664 levels "'Til There Was You (1997)",..: 1525 618 555 594 344 1318 1545 111 391 1240 ...
## $ V3 : Factor w/ 241 levels "","01-Aug-1997",..: 71 71 71 71 71 71 71 71 71 182 ...
## $ V4 : logi NA NA NA NA NA NA ...
## $ V5 : Factor w/ 1661 levels "","http://us.imdb.com/M/title-exact/Independence%20(1997)",..: 1431 565 505 543
310 1661 1453 103 357 1183 ...
## $ V6 : int 0 0 0 0 0 0 0 0 0 0 ...
## $ V7 : int 0 1 0 1 0 0 0 0 0 0 ...
## $ V8 : int 0 1 0 0 0 0 0 0 0 0 ...
## $ V9 : int 1 0 0 0 0 0 0 0 0 0 ...
## $ V10: int 1 0 0 0 0 0 0 1 0 0 ...
## $ V11: int 1 0 0 1 0 0 0 1 0 0 ...
## $ V12: int 0 0 0 0 1 0 0 0 0 0 ...
## $ V13: int 0 0 0 0 0 0 0 0 0 0 ...
## $ V14: int 0 0 0 1 1 1 1 1 1 1 ...
## $ V15: int 0 0 0 0 0 0 0 0 0 0 ...
## $ V16: int 0 0 0 0 0 0 0 0 0 0 ...
## $ V17: int 0 0 0 0 0 0 0 0 0 0 ...
## $ V18: int 0 0 0 0 0 0 0 0 0 0 ...
## $ V19: int 0 0 0 0 0 0 0 0 0 0 ...
## $ V20: int 0 0 0 0 0 0 0 0 0 0 ...
## $ V21: int 0 0 0 0 0 0 1 0 0 0 ...
## $ V22: int 0 1 1 0 1 0 0 0 0 0 ...
## $ V23: int 0 0 0 0 0 0 0 0 0 1 ...
## $ V24: int 0 0 0 0 0 0 0 0 0 0 ...
#Add column names

colnames(movies) = c("ID", "Title", "ReleaseDate", "VideoReleaseDate", "IMDB", "Unknown", "Action", "Adventure",
"Animation", "Childrens", "Comedy", "Crime", "Documentary", "Drama", "Fantasy", "FilmNoir", "Horror", "Musical",
"Mystery", "Romance", "SciFi", "Thriller", "War", "Western")
str(movies)

## $ ID : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Title : Factor w/ 1664 levels "'Til There Was You (1997)",..: 1525 618 555 594 344 1318 1545 111
391 1240 ...
## $ ReleaseDate : Factor w/ 241 levels "","01-Aug-1997",..: 71 71 71 71 71 71 71 71 71 182 ...
## $ VideoReleaseDate: logi NA NA NA NA NA NA ...
## $ IMDB : Factor w/ 1661 levels "","http://us.imdb.com/M/title-exact/Independence%20(1997)",..: 143
1 565 505 543 310 1661 1453 103 357 1183 ...
## $ Unknown : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Action : int 0 1 0 1 0 0 0 0 0 0 ...
## $ Adventure : int 0 1 0 0 0 0 0 0 0 0 ...
## $ Animation : int 1 0 0 0 0 0 0 0 0 0 ...
## $ Childrens : int 1 0 0 0 0 0 0 1 0 0 ...
## $ Comedy : int 1 0 0 1 0 0 0 1 0 0 ...
## $ Crime : int 0 0 0 0 1 0 0 0 0 0 ...
## $ Documentary : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Drama : int 0 0 0 1 1 1 1 1 1 1 ...
## $ Fantasy : int 0 0 0 0 0 0 0 0 0 0 ...
## $ FilmNoir : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Horror : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Musical : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Mystery : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Romance : int 0 0 0 0 0 0 0 0 0 0 ...
## $ SciFi : int 0 0 0 0 0 0 1 0 0 0 ...
## $ Thriller : int 0 1 1 0 1 0 0 0 0 0 ...
## $ War : int 0 0 0 0 0 0 0 0 0 1 ...
## $ Western : int 0 0 0 0 0 0 0 0 0 0 ...
# Remove unnecessary variables

movies$ID = NULL
movies$ReleaseDate = NULL
movies$VideoReleaseDate = NULL
movies$IMDB = NULL
#Remove duplicates
movies = unique(movies)
#Take a look at our data again:

str(movies)

## $ Title : Factor w/ 1664 levels "'Til There Was You (1997)",..: 1525 618 555 594 344 1318 1545 111 391 1
240 ...
## $ Unknown : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Action : int 0 1 0 1 0 0 0 0 0 0 ...
## $ Adventure : int 0 1 0 0 0 0 0 0 0 0 ...
## $ Animation : int 1 0 0 0 0 0 0 0 0 0 ...
## $ Childrens : int 1 0 0 0 0 0 0 1 0 0 ...
## $ Comedy : int 1 0 0 1 0 0 0 1 0 0 ...
## $ Crime : int 0 0 0 0 1 0 0 0 0 0 ...
## $ Documentary: int 0 0 0 0 0 0 0 0 0 0 ...
## $ Drama : int 0 0 0 1 1 1 1 1 1 1 ...
## $ Fantasy : int 0 0 0 0 0 0 0 0 0 0 ...
## $ FilmNoir : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Horror : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Musical : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Mystery : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Romance : int 0 0 0 0 0 0 0 0 0 0 ...
## $ SciFi : int 0 0 0 0 0 0 1 0 0 0 ...
## $ Thriller : int 0 1 1 0 1 0 0 0 0 0 ...
## $ War : int 0 0 0 0 0 0 0 0 0 1 ...
## $ Western : int 0 0 0 0 0 0 0 0 0 0 ...
#QUICK QUESTION
#Using the table function in R, please answer the following questions about the dataset "movies".
#How many movies are classified as comedies?

table(movies$Comedy)
##
## 0 1
## 1162 502
#Ans:502
#How many movies are classified as westerns

table(movies$Western)
##
## 0 1
## 1637 27
#Ans:27
#How many movies are classified as romance AND drama?

table(movies$Drama,movies$Romance)
##
## 0 1
## 0 801 147
## 1 619 97
#Ans:97
VIDEO 7: HIERARCHICAL CLUSTERING IN R (R script reproduced here)

#Video 7
#There are 2 steps to Hierarchial Clustering

#1.compute the distances between all data points
#2.Then we need to cluster the points
# Calculate distances between genre features:

distances = dist(movies[2:20], method = "euclidean")
# Hierarchical clustering
clusterMovies = hclust(distances, method = "ward.D") #the ward method cares about the distance between clusters u
sing centroid distance and also the variance in each cluster.
# Plot the dendrogram

plot(clusterMovies)
#from the dendogram, drawing the Horizontal line, a per the requirement of our problem, we select 10 clusters (i.
e. the horizontal line cuts across 10 vertical lines)
#Assign points to clusters (Label each movie in cluster (with 10 clusters))

clusterGroups = cutree(clusterMovies, k = 10)
#Now let's figure out what the clusters are like.
# Let's use the tapply function to compute the percentage of movies in each genre and cluster
#Calculate average "action" genre value for each cluster

tapply(movies$Action, clusterGroups, mean)
## 1 2 3 4 5 6 7
## 0.1784512 0.7839196 0.1238532 0.0000000 0.0000000 0.1015625 0.0000000
## 8 9 10
## 0.0000000 0.0000000 0.0000000
#Calculate average "Romance" genre value for each cluster

tapply(movies$Romance, clusterGroups, mean)
## 1 2 3 4 5 6
## 0.10437710 0.04522613 0.03669725 0.00000000 0.00000000 1.00000000
## 7 8 9 10
## 1.00000000 0.00000000 0.00000000 0.00000000
#We can repeat this for each genre. If you do, you get the results in ClusterMeans.ods
#Find which cluster Men in Black is in.

subset(movies, Title=="Men in Black (1997)")
## Title Unknown Action Adventure Animation Childrens

## 257 Men in Black (1997) 0 1 1 0 0
## Comedy Crime Documentary Drama Fantasy FilmNoir Horror Musical Mystery
## 257 1 0 0 0 0 0 0 0 0
## Romance SciFi Thriller War Western
## 257 0 1 0 0 0
mibGrp <-clusterGroups[257] # to get which cluster Men in Black (1997) goes into
mibGrp
## 257
## 2
# Create a new data set with just the movies from cluster 2
cluster2 = subset(movies, clusterGroups==mibGrp)
#Look at the first 10 titles in this cluster(Find other movies in same cluster as "Men in Black"):
cluster2$Title[1:10]
## [1] GoldenEye (1995)

## [2] Bad Boys (1995)
## [3] Apollo 13 (1995)
## [4] Net, The (1995)
## [5] Natural Born Killers (1994)
## [6] Outbreak (1995)
## [7] Stargate (1994)
## [8] Fugitive, The (1993)
## [9] Jurassic Park (1993)
## [10] Robert A. Heinlein's The Puppet Masters (1994)
## 1664 Levels: 'Til There Was You (1997) ... Zeus and Roxanne (1997)
#AN ADVANCED APPROACH TO FINDING CLUSTER CENTROIDS

#Other ways to calculate average genre value:
#In this video, we explain how you can find the cluster centroids by using the function "tapply" for each variabl
e in the dataset. While this approach works and is familiar to us, it can be a little tedious when there are a lo
t of variables. An alternative approach is to use the colMeans function. With this approach, you only have one co
mmand for each cluster instead of one command for each variable.
#If you run the following command in your R console, you can get all of the column (variable) means for cluster
1:
colMeans(subset(movies[2:20], clusterGroups == 1))
## Unknown Action Adventure Animation Childrens Comedy

## 0.006734007 0.178451178 0.185185185 0.134680135 0.393939394 0.363636364
## Crime Documentary Drama Fantasy FilmNoir Horror
## 0.033670034 0.010101010 0.306397306 0.070707071 0.000000000 0.016835017
## Musical Mystery Romance SciFi Thriller War
## 0.188552189 0.000000000 0.104377104 0.074074074 0.040404040 0.225589226
## Western
## 0.090909091
#You can repeat this for each cluster by changing the clusterGroups number. However, if you also have a lot of cl
usters, this approach is not that much more efficient than just using the tapply function.
#A more advanced approach uses the "split" and "lapply" functions. The following command will split the data into
subsets based on the clusters:
spl = split(movies[2:20], clusterGroups)
#Then you can use spl to access the different clusters, because
head(spl[[1]])
## Unknown Action Adventure Animation Childrens Comedy Crime Documentary

## 1 0 0 0 1 1 1 0 0
## 4 0 1 0 0 0 1 0 0
## 7 0 0 0 0 0 0 0 0
## 8 0 0 0 0 1 1 0 0
## 10 0 0 0 0 0 0 0 0
## 21 0 1 1 0 0 1 0 0
## Drama Fantasy FilmNoir Horror Musical Mystery Romance SciFi Thriller
## 1 0 0 0 0 0 0 0 0 0
## 4 1 0 0 0 0 0 0 0 0
## 7 1 0 0 0 0 0 0 1 0
## 8 1 0 0 0 0 0 0 0 0
## 10 1 0 0 0 0 0 0 0 0
## 21 0 0 0 0 1 0 0 0 1
## War Western
## 1 0 0
## 4 0 0
## 7 0 0
## 8 0 0
## 10 1 0
## 21 0 0
#is the same as

head(subset(movies[2:20], clusterGroups == 1))
## Unknown Action Adventure Animation Childrens Comedy Crime Documentary

## 1 0 0 0 1 1 1 0 0
## 4 0 1 0 0 0 1 0 0
## 7 0 0 0 0 0 0 0 0
## 8 0 0 0 0 1 1 0 0
## 10 0 0 0 0 0 0 0 0
## 21 0 1 1 0 0 1 0 0
## Drama Fantasy FilmNoir Horror Musical Mystery Romance SciFi Thriller
## 1 0 0 0 0 0 0 0 0 0
## 4 1 0 0 0 0 0 0 0 0
## 7 1 0 0 0 0 0 0 1 0
## 8 1 0 0 0 0 0 0 0 0
## 10 1 0 0 0 0 0 0 0 0
## 21 0 0 0 0 1 0 0 0 1
## War Western
## 1 0 0
## 4 0 0
## 7 0 0
## 8 0 0
## 10 1 0
## 21 0 0
colMeans(spl[[1]])

## 0.006734007 0.178451178 0.185185185 0.134680135 0.393939394 0.363636364
## 0.033670034 0.010101010 0.306397306 0.070707071 0.000000000 0.016835017
## 0.188552189 0.000000000 0.104377104 0.074074074 0.040404040 0.225589226
## Western
## 0.090909091
#so colMeans(spl[[1]]) will output the centroid of cluster 1. But an even easier approach uses the lapply functio
n. The following command will output the cluster centroids for all clusters:
lapply(spl, colMeans)
## $`1`
## 0.006734007 0.178451178 0.185185185 0.134680135 0.393939394 0.363636364
## 0.033670034 0.010101010 0.306397306 0.070707071 0.000000000 0.016835017
## 0.188552189 0.000000000 0.104377104 0.074074074 0.040404040 0.225589226
## Western
## 0.090909091
##
## $`2`
## 0.000000000 0.783919598 0.351758794 0.010050251 0.005025126 0.065326633
## 0.005025126 0.000000000 0.110552764 0.000000000 0.000000000 0.080402010
## 0.000000000 0.000000000 0.045226131 0.346733668 0.376884422 0.015075377
## Western
## 0.000000000
##
## $`3`
## 0.000000000 0.123853211 0.036697248 0.000000000 0.009174312 0.064220183
## 0.412844037 0.000000000 0.380733945 0.004587156 0.105504587 0.018348624
## 0.000000000 0.275229358 0.036697248 0.041284404 0.610091743 0.000000000
## Western
## 0.000000000
##
## $`4`
## 0 0 0 0 0 0
## 0 0 1 0 0 0
## 0 0 0 0 0 0
## Western
## 0
##
## $`5`
## 0 0 0 0 0 1
## 0 0 0 0 0 0
## 0 0 0 0 0 0
## Western
## 0
##
## $`6`
## 0.0000000 0.1015625 0.0000000 0.0000000 0.0000000 0.1093750
## 0.0468750 0.0000000 0.6640625 0.0000000 0.0078125 0.0156250
## 0.0000000 0.0000000 1.0000000 0.0000000 0.1406250 0.0000000
## Western
## 0.0000000
##
## $`7`
## 0 0 0 0 0 1
## 0 0 0 0 0 0
## 0 0 1 0 0 0
## Western
## 0
##
## $`8`
## 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0212766
## 0.0000000 1.0000000 0.0000000 0.0000000 0.0000000 0.0000000
## 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0212766
## Western
## 0.0000000
##
## $`9`
## 0 0 0 0 0 1
## 0 0 1 0 0 0
## 0 0 0 0 0 0
## Western
## 0
##
## $`10`
## 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.1587302
## 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 1.0000000
## 0.0000000 0.0000000 0.0000000 0.0000000 0.1587302 0.0000000
## Western
## 0.0000000
#The lapply function runs the second argument (colMeans) on each element of the first argument (each cluster subs
et in spl). So instead of using 19 tapply commands, or 10 colMeans commands, we can output our centroids with jus
t two commands: one to define spl, and then the lapply command.
#Even a better way

sapply(spl, colMeans)
## 1 2 3 4 5 6 7 8
## Unknown 0.006734007 0.000000000 0.000000000 0 0 0.0000000 0 0.0000000
## Action 0.178451178 0.783919598 0.123853211 0 0 0.1015625 0 0.0000000
## Adventure 0.185185185 0.351758794 0.036697248 0 0 0.0000000 0 0.0000000
## Animation 0.134680135 0.010050251 0.000000000 0 0 0.0000000 0 0.0000000
## Childrens 0.393939394 0.005025126 0.009174312 0 0 0.0000000 0 0.0000000
## Comedy 0.363636364 0.065326633 0.064220183 0 1 0.1093750 1 0.0212766
## Crime 0.033670034 0.005025126 0.412844037 0 0 0.0468750 0 0.0000000
## Documentary 0.010101010 0.000000000 0.000000000 0 0 0.0000000 0 1.0000000
## Drama 0.306397306 0.110552764 0.380733945 1 0 0.6640625 0 0.0000000
## Fantasy 0.070707071 0.000000000 0.004587156 0 0 0.0000000 0 0.0000000
## FilmNoir 0.000000000 0.000000000 0.105504587 0 0 0.0078125 0 0.0000000
## Horror 0.016835017 0.080402010 0.018348624 0 0 0.0156250 0 0.0000000
## Musical 0.188552189 0.000000000 0.000000000 0 0 0.0000000 0 0.0000000
## Mystery 0.000000000 0.000000000 0.275229358 0 0 0.0000000 0 0.0000000
## Romance 0.104377104 0.045226131 0.036697248 0 0 1.0000000 1 0.0000000
## SciFi 0.074074074 0.346733668 0.041284404 0 0 0.0000000 0 0.0000000
## Thriller 0.040404040 0.376884422 0.610091743 0 0 0.1406250 0 0.0000000
## War 0.225589226 0.015075377 0.000000000 0 0 0.0000000 0 0.0212766
## Western 0.090909091 0.000000000 0.000000000 0 0 0.0000000 0 0.0000000
## 9 10
## Unknown 0 0.0000000
## Action 0 0.0000000
## Adventure 0 0.0000000
## Animation 0 0.0000000
## Childrens 0 0.0000000
## Comedy 1 0.1587302
## Crime 0 0.0000000
## Documentary 0 0.0000000
## Drama 1 0.0000000
## Fantasy 0 0.0000000
## FilmNoir 0 0.0000000
## Horror 0 1.0000000
## Musical 0 0.0000000
## Mystery 0 0.0000000
## Romance 0 0.0000000
## SciFi 0 0.0000000
## Thriller 0 0.1587302
## War 0 0.0000000
## Western 0 0.0000000
#QUICK QUESTION
#Run the cutree function again to create the cluster groups, but this time pick k = 2 clusters. It turns out that
the algorithm groups all of the movies that only belong to one specific genre in one cluster (cluster 2), and pu
ts all of the other movies in the other cluster (cluster 1). What is the genre that all of the movies in cluster
2 belong to?
clusterGroups <- cutree(clusterMovies, k = 2)

spl <- split(movies[2:20], clusterGroups)
sapply(spl, colMeans)
## 1 2
## Unknown 0.001545595 0
## Action 0.192426584 0
## Adventure 0.102782071 0
## Animation 0.032457496 0
## Childrens 0.092735703 0
## Comedy 0.387944359 0
## Crime 0.082689335 0
## Documentary 0.038639876 0
## Drama 0.267387944 1
## Fantasy 0.017001546 0
## FilmNoir 0.018547141 0
## Horror 0.069551777 0
## Musical 0.043276662 0
## Mystery 0.046367852 0
## Romance 0.188562597 0
## SciFi 0.077279753 0
## Thriller 0.191653787 0
## War 0.054868624 0
## Western 0.020865533 0
#or
lapply(spl, colMeans)
## $`1`
## 0.001545595 0.192426584 0.102782071 0.032457496 0.092735703 0.387944359
## 0.082689335 0.038639876 0.267387944 0.017001546 0.018547141 0.069551777
## 0.043276662 0.046367852 0.188562597 0.077279753 0.191653787 0.054868624
## Western
## 0.020865533
##
## $`2`
## 0 0 0 0 0 0
## 0 0 1 0 0 0
## 0 0 0 0 0 0
## Western
## 0
#Ans:Drama
VIDEO 8: THE ANALYTICS EDGE OF RECOMMENDATION SYSTEMS

Beyond Movies: Mass Personalization
beyond
Cornerstone of these Top Businesses
corner
Recommendation Method Used
methodused
Winners are Declared!
winners
The Edge of Recommendation Systems
In todays digital age, businesses often have hundreds of thousands of items to oer their customers
Excellent recommendation systems can make or break these businesses
Clustering algorithms, which are tailored to nd similar customers or similar items, form the backbone of many of these recommendation
systems
Predictive Diagnosis_Discovering Patterns for Disease

Detection_2
VIDEO 1: HEART ATTACKS
Heart Attacks
Heart attack is a common complication of coronary heart disease resulting from the interruption of blood supply to part of the heart
2012 report from the American Heart Association estimates about 715,000 Americans have a heart attack every year
Every 20 seconds, a person has a heart attack in the US
Nearly half occur without prior warning signs
250,000 Americans die of Sudden Cardiac Death yearly
Analytics Helps Monitoring
Understanding the clinical characteristics of patients in whom heart attack was missed is key
Need for an increased understanding of the patterns in a patients diagnostic history that link to a heart attack
Predicting whether a patient is at risk of a heart attack helps monitoring and calls for action
Analytics helps understand patterns of heart attacks and provides good predictions
QUICK QUESTION
In this class, weve learned many dierent methods for predicting outcomes. Which of the following methods is designed to be used to predict an
outcome like whether or not someone will experience a heart attack? Select all that apply.
Ans:Logistic Regression,CART, Random Forest
EXPLANATION:Logistic Regression, CART, and Random Forest are all designed to be used to predict whether or not someone has a heart attack,
since this is a classication problem. Linear regression would be appropriate for a problem with a continuous outcome, such as the amount of
time until someone has a heart attack. In this lecture, well use random forest, but the other methods could be used too.
VIDEO 2: THE DATA

Claims Data
Claims data oers an expansive view of a patients health history

Demographics, medical history and medications
Oers insights regarding a patients risk
May reveal indicative signals and patterns
We will use health insurance claims led for about 7,000 members from January 2000 - November 2007
Concentrated on members with the following attributes

At least 5 claims with coronary artery disease diagnosis
At least 5 claims with hypertension diagnostic codes
At least 100 total medical claims
At least 5 pharmacy claims
Data from at least 5 years
Yields patients with a high risk of heart attack and a reasonably rich history and continuous coverage
Data Aggregation
The resulting dataset includes about 20 million health insurance entries including individual medical and pharmaceutical records
Diagnosis, procedure and drug codes in the dataset comprise tens of thousands of attributes
Codes were aggregated into groups
218 diagnosis groups, 180 procedure groups, 538 drug groups
46 diagnosis groups were considered by clinicians as possible risk factors for heart attacks
Diagnostic History
We then compress medical records to obtain a chronological representation of a patients diagnostic prole
Cost and number of medical claims and hospital visits by diagnosis
Observations split into 21 periods, each 90 days in length
Examined 9 months of diagnostic history leading up to heart attack/no heart attack event
Align data to make observations date-independent while preserving the order of events
3 months ~ 0-3 months before heart attack
Target Variable
Target prediction is the rst occurrence of a heart attack

Diagnosis on medical claim
Visit to emergency room followed by hospitalization
Binary Yes/No
target
Dataset Compilation
data1
Cost Bucket Partitioning
Cost is a good summary of a persons overall health

Divide population into similar smaller groups
Low risk, average risk, high risk
Build models for each group
QUICK QUESTION
In the previous video, we discussed how we split the data into three groups, or buckets, according to cost.
Which bucket has the most data, in terms of number of patients?
Ans:Cost Bucket 1
Which bucket probably has the densest data, in terms of number of claims per person?
Ans:Cost Bucket 3
EXPLANATION:Cost Bucket 1 contains the most patients (see slide 7 of the previous video), and Cost Bucket 3 probably has the densest data,
since these are the patients with the highest cost in terms of claims.
VIDEO 3: PREDICTING HEART ATTACKS USING CLUSTERING

Predicting Heart Attacks (Random Forest)
Predicting whether a patient has a heart attack for each of the cost buckets using the random forest algorithm
Incorporating Clustering
Patients in each bucket may have dierent characteristics
cluster11
Clustering Cost Buckets
Two clustering algorithms were used for the analysis as an alternative to hierarchal clustering
Spectral Clustering
k-means clustering
Clustering Cost Buckets
Two clustering algorithms were used for the analysis as an alternative to hierarchal clustering
Spectral Clustering
k-means clustering
k-Means Clustering
algo1
Algo2
algo3
algo4
algo5
algo6
algo7
Practical Considerations
The number of clusters k can be selected from previous knowledge or experimenting

Can strategically select initial partition of points into clusters if you have some knowledge of the data
Can run algorithm several times with dierent random starting points
In recitation, we will learn how to run the k-means clustering algorithm in R
Random Forest with Clustering
RF2
QUICK QUESTION
K-means clustering diers from Hierarchical clustering in a couple important ways. Which of the following statements is true?
Ans:In k-means clustering, you have to pick the number of clusters you want before you run the algorithm
EXPLANATION:In k-means clustering, you have to pick the number of clusters before you run the algorithm, but the computational eort needed
is much less than that for hierarchical clustering (well see this in more detail during the recitation).
VIDEO 4: UNDERSTANDING CLUSTER PATTERNS

Understanding Cluster Patterns
Clusters are interpretable and reveal unique patterns of diagnostic history among the population
patterns
QUICK QUESTION
As we saw in the previous video, the clusters can be used to nd interesting patterns of health in addition to being used to improve predictive
models. By changing the number of clusters, you can nd more general or more specic patterns.
If you wanted to nd more unusual patterns shared by a small number of people, would you increase or decrease the number of clusters?
Ans:Increase
EXPLANATION:If you wanted to nd more unusual patterns, you would increase the number of clusters since the clusters would become smaller
and more patterns would probably emerge.
VIDEO 5: THE ANALYTICS EDGE

Impact of Clustering
Clustering members within each cost bucket yielded better predictions of heart attacks within clusters
Grouping patients in clusters exhibits temporal diagnostic patterns within 9 months of a heart attack
These patterns can be incorporated in the diagnostic rules for heart attacks
Great research interest in using analytics for early heart failure detection through pattern recognition
Seeing the Big Picture_Segmenting Images to Create

Data(Recitation)_3
VIDEO 1: IMAGE SEGMENTATION
Segmenting Images to Create Data
Image Segmentation
Divide up digital images to salient regions/clusters corresponding to individual surfaces, objects, or natural parts of objects
Clusters should be uniform and homogenous with respect to certain characteristics (color, intensity, texture)
Goal: Useful and analyzable image representation
Wide Applications
Medical Imaging
Locate tissue classes, organs, pathologies and tumors
Measure tissue/tumor volume
Object Detection
Detect facial features in photos
Detect pedestrians in footages of surveillance videos
Recognition tasks
Fingerprint/Iris recognition
Various Methods
Clustering methods
Partition image to clusters based on dierences in pixel colors, intensity or texture
Edge detection
Based on the detection of discontinuity, such as an abrupt change in the gray level in gray-scale images
Region-growing methods
Divides image into regions, then sequentially merges suciently similar regions
In this Recitation.
Review hierarchical and k-means clustering in R

Restrict ourselves to gray-scale images
Simple example of a ower image (ower.csv)
Medical imaging application with examples of transverse MRI images of the brain (healthy.csv and tumor.csv)
Compare the use, pros and cons of all analytics methods we have seen so far
VIDEO 2: CLUSTERING
https://rpubs.com/nishantsbi/93582 PIXELS (R script reproduced here) 24/67
VIDEO 2: CLUSTERING PIXELS (R script reproduced here)

Grayscale Images
Image is represented as a matrix of pixel intensity values ranging from 0 (black) to 1 (white)
For 8 bits/pixel (bpp), 256 color levels
gray
Grayscale Image Segmentation
Cluster pixels according to their intensity values
gray2
#Unit 6 - Recitation
#Video 2
flower = read.csv("flower.csv", header=FALSE) #since we do not have headers in the data and we dont want R to con
vert the first row as header by default, hence we use header=FALSE argument
str(flower)

## $ V1 : num 0.0991 0.0991 0.1034 0.1034 0.1034 ...
## $ V2 : num 0.112 0.108 0.112 0.116 0.108 ...
## $ V3 : num 0.134 0.116 0.121 0.116 0.112 ...
## $ V4 : num 0.138 0.138 0.121 0.121 0.112 ...
## $ V5 : num 0.138 0.134 0.125 0.116 0.112 ...
## $ V6 : num 0.138 0.129 0.121 0.108 0.112 ...
## $ V7 : num 0.129 0.116 0.103 0.108 0.112 ...
## $ V8 : num 0.116 0.103 0.103 0.103 0.116 ...
## $ V9 : num 0.1121 0.0991 0.1078 0.1121 0.1164 ...
## $ V10: num 0.121 0.108 0.112 0.116 0.125 ...
## $ V11: num 0.134 0.125 0.129 0.134 0.129 ...
## $ V12: num 0.147 0.134 0.138 0.129 0.138 ...
## $ V13: num 0.000862 0.146552 0.142241 0.142241 0.133621 ...
## $ V14: num 0.000862 0.000862 0.142241 0.133621 0.12931 ...
## $ V15: num 0.142 0.142 0.134 0.121 0.116 ...
## $ V16: num 0.125 0.125 0.116 0.108 0.108 ...
## $ V17: num 0.1121 0.1164 0.1078 0.0991 0.0991 ...
## $ V18: num 0.108 0.112 0.108 0.108 0.108 ...
## $ V19: num 0.121 0.129 0.125 0.116 0.116 ...
## $ V20: num 0.138 0.129 0.125 0.116 0.116 ...
## $ V21: num 0.138 0.134 0.121 0.125 0.125 ...
## $ V22: num 0.134 0.129 0.125 0.121 0.103 ...
## $ V23: num 0.125 0.1207 0.1164 0.1164 0.0819 ...
## $ V24: num 0.1034 0.1034 0.0991 0.0991 0.1034 ...
## $ V25: num 0.0948 0.0905 0.0905 0.1034 0.125 ...
## $ V26: num 0.0862 0.0862 0.0991 0.125 0.1422 ...
## $ V27: num 0.086207 0.086207 0.103448 0.12931 0.000862 ...
## $ V28: num 0.0991 0.1078 0.1164 0.1293 0.1466 ...
## $ V29: num 0.116 0.134 0.134 0.121 0.142 ...
## $ V30: num 0.121 0.138 0.142 0.129 0.138 ...
## $ V31: num 0.121 0.134 0.142 0.134 0.129 ...
## $ V32: num 0.116 0.134 0.129 0.116 0.112 ...
## $ V33: num 0.108 0.112 0.116 0.108 0.108 ...
## $ V34: num 0.1078 0.1078 0.1034 0.0991 0.1034 ...
## $ V35: num 0.1078 0.1034 0.0991 0.0991 0.0991 ...
## $ V36: num 0.1078 0.1034 0.1034 0.0905 0.0862 ...
## $ V37: num 0.1078 0.1078 0.1034 0.0819 0.0733 ...
## $ V38: num 0.0948 0.0991 0.0776 0.069 0.0733 ...
## $ V39: num 0.0733 0.056 0.0474 0.0474 0.056 ...
## $ V40: num 0.0474 0.0388 0.0431 0.0474 0.0603 ...
## $ V41: num 0.0345 0.0345 0.0388 0.0474 0.0647 ...
## $ V42: num 0.0259 0.0259 0.0345 0.0431 0.056 ...
## $ V43: num 0.0259 0.0259 0.0388 0.0517 0.0603 ...
## $ V44: num 0.0302 0.0302 0.0345 0.0517 0.0603 ...
## $ V45: num 0.0259 0.0259 0.0259 0.0388 0.0474 ...
## $ V46: num 0.0259 0.0172 0.0172 0.0259 0.0345 ...
## $ V47: num 0.01724 0.01724 0.00862 0.02155 0.02586 ...
## $ V48: num 0.0216 0.0129 0.0129 0.0172 0.0302 ...
## $ V49: num 0.0216 0.0216 0.0216 0.0345 0.0603 ...
## $ V50: num 0.0302 0.0345 0.0388 0.0603 0.0776 ...
# Change the data type to matrix

#since the data is given as 50 by 50 pixel intensities data frame, we need to convert/manipulate the data so that
Clustering can be applied.This requires converting the data frame first to matrix and then converting this matri
x into a vector using as. functions
flowerMatrix = as.matrix(flower)
str(flowerMatrix)
## num [1:50, 1:50] 0.0991 0.0991 0.1034 0.1034 0.1034 ...

## - attr(*, "dimnames")=List of 2
## ..$ : NULL
## ..$ : chr [1:50] "V1" "V2" "V3" "V4" ...
# Turn matrix into a vector

flowerVector = as.vector(flowerMatrix)
str(flowerVector)
## num [1:2500] 0.0991 0.0991 0.1034 0.1034 0.1034 ...
#Had we converted the data frame directly into vector as below, we see that it still remains a data frame.
flowerVector2 = as.vector(flower)
str(flowerVector2)

## $ V1 : num 0.0991 0.0991 0.1034 0.1034 0.1034 ...
## $ V2 : num 0.112 0.108 0.112 0.116 0.108 ...
## $ V3 : num 0.134 0.116 0.121 0.116 0.112 ...
## $ V4 : num 0.138 0.138 0.121 0.121 0.112 ...
## $ V5 : num 0.138 0.134 0.125 0.116 0.112 ...
## $ V6 : num 0.138 0.129 0.121 0.108 0.112 ...
## $ V7 : num 0.129 0.116 0.103 0.108 0.112 ...
## $ V8 : num 0.116 0.103 0.103 0.103 0.116 ...
## $ V9 : num 0.1121 0.0991 0.1078 0.1121 0.1164 ...
## $ V10: num 0.121 0.108 0.112 0.116 0.125 ...
## $ V11: num 0.134 0.125 0.129 0.134 0.129 ...
## $ V12: num 0.147 0.134 0.138 0.129 0.138 ...
## $ V13: num 0.000862 0.146552 0.142241 0.142241 0.133621 ...
## $ V14: num 0.000862 0.000862 0.142241 0.133621 0.12931 ...
## $ V15: num 0.142 0.142 0.134 0.121 0.116 ...
## $ V16: num 0.125 0.125 0.116 0.108 0.108 ...
## $ V17: num 0.1121 0.1164 0.1078 0.0991 0.0991 ...
## $ V18: num 0.108 0.112 0.108 0.108 0.108 ...
## $ V19: num 0.121 0.129 0.125 0.116 0.116 ...
## $ V20: num 0.138 0.129 0.125 0.116 0.116 ...
## $ V21: num 0.138 0.134 0.121 0.125 0.125 ...
## $ V22: num 0.134 0.129 0.125 0.121 0.103 ...
## $ V23: num 0.125 0.1207 0.1164 0.1164 0.0819 ...
## $ V24: num 0.1034 0.1034 0.0991 0.0991 0.1034 ...
## $ V25: num 0.0948 0.0905 0.0905 0.1034 0.125 ...
## $ V26: num 0.0862 0.0862 0.0991 0.125 0.1422 ...
## $ V27: num 0.086207 0.086207 0.103448 0.12931 0.000862 ...
## $ V28: num 0.0991 0.1078 0.1164 0.1293 0.1466 ...
## $ V29: num 0.116 0.134 0.134 0.121 0.142 ...
## $ V30: num 0.121 0.138 0.142 0.129 0.138 ...
## $ V31: num 0.121 0.134 0.142 0.134 0.129 ...
## $ V32: num 0.116 0.134 0.129 0.116 0.112 ...
## $ V33: num 0.108 0.112 0.116 0.108 0.108 ...
## $ V34: num 0.1078 0.1078 0.1034 0.0991 0.1034 ...
## $ V35: num 0.1078 0.1034 0.0991 0.0991 0.0991 ...
## $ V36: num 0.1078 0.1034 0.1034 0.0905 0.0862 ...
## $ V37: num 0.1078 0.1078 0.1034 0.0819 0.0733 ...
## $ V38: num 0.0948 0.0991 0.0776 0.069 0.0733 ...
## $ V39: num 0.0733 0.056 0.0474 0.0474 0.056 ...
## $ V40: num 0.0474 0.0388 0.0431 0.0474 0.0603 ...
## $ V41: num 0.0345 0.0345 0.0388 0.0474 0.0647 ...
## $ V42: num 0.0259 0.0259 0.0345 0.0431 0.056 ...
## $ V43: num 0.0259 0.0259 0.0388 0.0517 0.0603 ...
## $ V44: num 0.0302 0.0302 0.0345 0.0517 0.0603 ...
## $ V45: num 0.0259 0.0259 0.0259 0.0388 0.0474 ...
## $ V46: num 0.0259 0.0172 0.0172 0.0259 0.0345 ...
## $ V47: num 0.01724 0.01724 0.00862 0.02155 0.02586 ...
## $ V48: num 0.0216 0.0129 0.0129 0.0172 0.0302 ...
## $ V49: num 0.0216 0.0216 0.0216 0.0345 0.0603 ...
## $ V50: num 0.0302 0.0345 0.0388 0.0603 0.0776 ...
#Hence its crucial that we first convert the data frame into a matrix and then into a vector
#lets proceed with the familiar Hierarchial clustering by computing the pair-wise distances between each intensit
y value in the flower vector as below:
#Compute "euclidean" distances

distance = dist(flowerVector, method = "euclidean")
VIDEO 3: HIERARCHICAL CLUSTERING (R script reproduced here)

Dendrogram Example
dendo1
The lowest level denotes all the obs or individual data points.All higher nodes denotes the clusters.
Vertical lines denote the distance between the clusters.Taller this lines, more DISSIMILAR the cluster are.
Choosing the no of clusters by drawing the horizontal lines cutting through the vertical distance lines in the dendogram
dendo2
What no of clusters to choose then?
Smaller the number of clusters means coarser the clustering will be.Higher number would result in over segmentation.Hence we should think
about the trade-o.
Criteria to choose the no of clusters: * If the horizontal line passing through the vertical distance has enough head room to move up & down, then
that can be one reason to choose the cut.In above diagram, 3 cluster cut seems appropriate as the horizontal cut line has enough head room up
& down.
#Video 3
# Hierarchical clustering of the intensity values which is the dendogram tree

clusterIntensity = hclust(distance, method="ward.D2")
#Plot the dendrogram

plot(clusterIntensity)
# Select 3 clusters
#visualising the cuts by plotting a rectangle around the clusters
rect.hclust(clusterIntensity, k = 3, border = "red")
#lets split the data into these 3 clusters

flowerClusters = cutree(clusterIntensity, k = 3)
flowerClusters
## [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [35] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [69] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [103] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [137] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [171] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [205] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [239] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [273] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [307] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [341] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 3
## [375] 2 2 3 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [409] 1 1 1 1 1 1 1 1 1 2 2 1 1 1 3 3 3 3 3 2 1 2 3 2 1 1 1 1 1 1 1 1 1 1
## [443] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 3 1 1 1 3 3 3 3
## [477] 3 2 2 3 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [511] 1 1 1 1 1 1 2 3 3 2 1 1 3 3 3 3 3 2 3 3 3 1 2 3 3 1 1 1 1 1 1 1 1 1
## [545] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 3 2 1 2 3 3 3 1 1 3 3 3 3 3 2
## [579] 3 3 2 2 3 3 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [613] 2 3 3 2 1 3 3 3 2 1 2 3 3 3 3 3 3 3 2 3 3 3 3 3 3 3 1 1 1 1 1 1 1 1
## [647] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 3 3 2 2 3 3 3 1 2 3 3 3 3 3 3 3
## [681] 3 3 3 3 3 3 3 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2
## [715] 3 3 3 3 3 3 3 2 1 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 1 1 1 1 1 1 1 1 1 1
## [749] 1 1 1 1 1 1 1 1 1 1 1 2 3 2 1 1 2 3 3 3 3 3 3 3 2 3 3 3 3 3 3 3 3 3
## [783] 3 3 3 3 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 3 3 3 2 1 1
## [817] 3 3 3 3 3 3 2 2 3 3 3 3 3 3 3 3 3 3 3 2 1 1 2 2 3 3 1 1 1 1 1 1 1 1
## [851] 1 1 1 1 1 1 1 1 1 1 2 3 3 3 2 1 2 3 3 3 3 3 3 2 3 3 3 3 3 3 3 3 3 3
## [885] 2 1 2 3 3 3 3 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 3 3 3 3 2 2
## [919] 3 3 3 3 3 2 2 3 3 3 3 3 3 3 3 2 1 3 3 3 3 3 3 2 1 1 1 1 1 1 1 1 1 1
## [953] 1 1 1 1 1 1 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 2 2 2 2 2 3 3 3 3 2 2 3 3
## [987] 3 3 3 3 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 3 3 3 3 3 3 3 3 3
## [1021] 3 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [1055] 1 1 2 2 2 3 3 3 3 3 3 3 3 3 3 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 2 2
## [1089] 2 2 3 3 3 2 1 1 1 1 1 1 1 1 1 1 1 2 3 3 3 3 3 3 3 3 3 3 3 3 3 2 2 2
## [1123] 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 2 1 1 1 1 1 1 1 1 1 1 1 1
## [1157] 2 3 3 3 3 3 3 3 3 3 3 3 3 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 2
## [1191] 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 3 3 3 3 3 2 2 2 2 2 2
## [1225] 2 2 2 2 2 2 2 3 3 3 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [1259] 3 3 3 3 3 3 3 3 3 3 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1
## [1293] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 3 3 3 3 3 3 3 3 3 3 3 2 2 2 2 2 2 2
## [1327] 2 2 2 2 3 3 3 3 3 3 3 3 3 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 3 3
## [1361] 3 3 3 3 3 3 3 3 3 3 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 1
## [1395] 1 1 1 1 1 1 1 1 1 1 1 1 1 2 3 3 3 3 3 3 3 3 3 3 3 3 2 2 2 2 2 2 2 2
## [1429] 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 3 3 3
## [1463] 3 3 3 3 3 3 3 3 3 3 3 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3 2 1 1 1 1 1
## [1497] 1 1 1 1 1 1 1 1 1 1 1 1 3 3 3 3 3 3 3 2 3 3 3 3 3 3 3 2 2 3 3 3 2 3
## [1531] 3 3 3 3 3 2 3 3 3 3 3 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 3 3 2 2 1
## [1565] 2 3 3 3 3 3 3 3 3 2 2 3 3 3 3 2 3 3 3 3 3 2 1 2 2 2 2 2 1 1 1 1 1 1
## [1599] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 3 3 3 3 3 3 3 3 3 2 3 3 3 3 3 2 3 3
## [1633] 3 3 3 3 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 3 3 3
## [1667] 2 3 3 3 3 3 2 2 3 3 3 3 3 2 1 3 3 3 3 3 3 3 2 1 1 1 1 1 1 1 1 1 1 1
## [1701] 1 1 1 1 1 1 1 1 1 1 1 2 3 3 2 2 3 3 3 3 3 3 1 3 3 3 3 3 3 2 1 2 3 3
## [1735] 3 3 3 3 3 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 3 2 1 3 3 3
## [1769] 3 3 3 1 1 3 3 3 3 3 3 2 1 1 2 3 3 3 3 2 3 2 1 1 1 1 1 1 1 1 1 1 1 1
## [1803] 1 1 1 1 1 1 1 1 1 2 2 1 2 3 3 3 3 3 2 1 2 3 3 2 3 3 3 2 1 1 1 1 3 3
## [1837] 3 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 3 3 3 3 3
## [1871] 1 1 2 3 3 2 3 3 3 3 1 1 1 1 1 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [1905] 1 1 1 1 1 1 1 1 1 1 2 3 3 3 3 3 1 1 2 3 3 1 3 3 3 3 1 1 1 1 1 1 1 1
## [1939] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 3 3 2 1 1
## [1973] 2 3 3 1 2 3 3 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [2007] 1 1 1 1 1 1 1 1 1 1 1 2 3 1 1 1 1 3 3 1 1 3 3 3 1 1 1 1 1 1 1 1 1 1
## [2041] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2
## [2075] 3 1 1 1 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [2109] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [2143] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [2177] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [2211] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [2245] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [2279] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [2313] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [2347] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [2381] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [2415] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [2449] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [2483] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
#this assigns values 1 to 3 denoting which of the 3 clusters the intensity(data point) was assigned to
# Find mean intensity values of each of these 3 clusters

tapply(flowerVector, flowerClusters, mean)
## 1 2 3
## 0.08574315 0.50826255 0.93147713
# Plot the image and the clusters

#Before plotting the image we want to convert the flowerClusters vector into a matrix of form n x m using dim()
dim(flowerClusters) = c(50,50)
#plotting the image
image(flowerClusters, axes = FALSE)
#Lets see how the original flower looks so that we can compare with out image output
# Original image with gray colour scale
image(flowerMatrix,axes=FALSE,col=grey(seq(0,1,length=256)))
#this produces a low resolution image.Next we try high resolution brain image
VIDEO 4: MRI IMAGE (R script reproduced here)

k-Means Clustering
The k-means clustering aims at partitioning the data into k clusters in which each data point belongs to the cluster whose mean is the
nearest
kmeans1
k-Means Clustering steps:
kmeans2
kmeans3
kmeans4
kmeans5
kmeans6
kmeans7
kmeans8
# Video 4
# Let's try this with an MRI image of the brain

#healthy.csv dataframe consists of matrix of intensity values
healthy = read.csv("healthy.csv", header=FALSE)
#Creating the healthy matrix

healthyMatrix = as.matrix(healthy)
str(healthyMatrix)
## num [1:566, 1:646] 0.00427 0.00855 0.01282 0.01282 0.01282 ...

## - attr(*, "dimnames")=List of 2
## ..$ : NULL
## ..$ : chr [1:646] "V1" "V2" "V3" "V4" ...
#we can see that the image is 566 by 646....considerably larger than the flower
# Plot the image in gray scale

image(healthyMatrix,axes=FALSE,col=grey(seq(0,1,length=256)))
# Hierarchial clustering by computing the distance vector

healthyVector = as.vector(healthyMatrix)
#distance = dist(healthyVector, method = "euclidean")
#throws up error that the vector size is huge 480GB+
# We have an error - why?

#lets see the size of the vector
str(healthyVector)
## num [1:365636] 0.00427 0.00855 0.01282 0.01282 0.01282 ...
#Lets store the value in n

n=365636
#therefore the pairwise distance between each of the above n data points is:
n*(n-1)/2
## [1] 66844659430
#wow...66 billion+ pairwise distances
#so the bad news now is that we cannot use hierarchial clustering because of high resolution of the image.Hence w
e go for other clustering method called k-means
VIDEO 5: K-MEANS CLUSTERING (R script reproduced here)

#Video 5
# Specify number of clusters (Depends upon what you want to extract from the image)
k = 5
# Run k-means
set.seed(1)#since k means clustering starts by randomly assigning data points to cluster we set the seed to obtai
n similar results
KMC = kmeans(healthyVector, centers = k, iter.max = 1000)
str(KMC)
## List of 9
## $ cluster : int [1:365636] 3 3 3 3 3 3 3 3 3 3 ...
## $ centers : num [1:5, 1] 0.4818 0.1062 0.0196 0.3094 0.1842
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:5] "1" "2" "3" "4" ...
## .. ..$ : NULL
## $ totss : num 5775
## $ withinss : num [1:5] 96.6 47.2 39.2 57.5 62.3
## $ tot.withinss: num 303
## $ betweenss : num 5472
## $ size : int [1:5] 20556 101085 133162 31555 79278
## $ iter : int 2
## $ ifault : int 0
## - attr(*, "class")= chr "kmeans"
# Extract clusters variable for plotting it

healthyClusters = KMC$cluster
#mean intensity values of each clusters is already available under the variable name centers in KMC object(NO NEE
D to calculate as was done for hierarchial clustering)
KMC$centers[2] #mean intensity value of 2nd cluster
## [1] 0.1061945
# Plot the image with the clusters

#first convert into a matrix using dim()
dim(healthyClusters) = c(nrow(healthyMatrix), ncol(healthyMatrix))
image(healthyClusters, axes = FALSE, col=rainbow(k))
#SCREE PLOTS
#While dendrograms can be used to select the final number of clusters for Hierarchical Clustering, we can't use d
endrograms for k-means clustering. However, there are several other ways that the number of clusters can be selec
ted. One common way to select the number of clusters is by using a scree plot, which works for any clustering alg
orithm.
#A standard scree plot has the number of clusters on the x-axis, and the sum of the within-cluster sum of squares
on the y-axis. The within-cluster sum of squares for a cluster is the sum, across all points in the cluster, of
the squared distance between each point and the centroid of the cluster. We ideally want very small within-clus
ter sum of squares, since this means that the points are all very close to their centroid.
#To create the scree plot, the clustering algorithm is run with a range of values for the number of clusters. For
each number of clusters, the within-cluster sum of squares can easily be extracted when using k-means clusterin
g. For example, suppose that we want to cluster the MRI image from this video into two clusters. We can first run
the k-means algorithm with two clusters:
KMC2 = kmeans(healthyVector, centers = 2, iter.max = 1000)
#Then, the within-cluster sum of squares is just an element of KMC2:

KMC2$withinss
## [1] 803.2878 1214.2523
#This gives a vector of the within-cluster sum of squares for each cluster (in this case, there should be two num
bers).
#Now suppose we want to determine the best number of clusters for this dataset. We would first repeat the kmeans
function call above with centers = 3, centers = 4, etc. to create KMC3, KMC4, and so on. Then, we could generate
the following plot:

KMC10 = kmeans(healthyVector, centers =10, iter.max = 1000)
NumClusters = seq(2,10,1)
SumWithinss = c(sum(KMC2$withinss), sum(KMC3$withinss), sum(KMC4$withinss), sum(KMC5$withinss),

sum(KMC6$withinss), sum(KMC7$withinss), sum(KMC8$withinss), sum(KMC9$withinss), sum(KMC10$withinss))
#SCREE plot
plot(NumClusters, SumWithinss, type="b")
#The plot looks like this (the type="b" argument just told the plot command to give us points and lines):
#To determine the best number of clusters using this plot, we want to look for a bend, or elbow, in the plot. Thi
s means that we want to find the number of clusters for which increasing the number of clusters further does not
significantly help to reduce the within-cluster sum of squares. For this particular dataset, it looks like 4 or
5 clusters is a good choice. Beyond 5, increasing the number of clusters does not really reduce the within-clust
er sum of squares too much.
#Note: You may have noticed it took a lot of typing to generate SumWithinss; this is because we limited ourselves
to R functions we've learned so far in the course. In fact, R has powerful functions for repeating tasks with a
different input (in this case running kmeans with different cluster sizes). For instance, we could generate SumW
ithinss with:
#All the above steps in one go using the sapply ()

SumWithinss = sapply(2:10, function(x) sum(kmeans(healthyVector, centers=x, iter.max=1000)$withinss))
#SCREE plot
plot(NumClusters, SumWithinss, type="b")
VIDEO 6: DETECTING TUMORS (R script reproduced here)

First Taste of a Fascinating Field
MRI image segmentation is subject of ongoing research

k-means is a good starting point, but not enough
Advanced clustering techniques such as the modied fuzzy k-means (MFCM) clustering technique
Packages in R specialized for medical image analysis Medical Imaging (http://cran.r-project.org/web/views/MedicalImaging.html)
#Video 6
#Apply to a test image
tumor = read.csv("tumor.csv", header=FALSE)

tumorMatrix = as.matrix(tumor)
tumorVector = as.vector(tumorMatrix)
#Apply clusters from before to new image, using the flexclust package
#install.packages("flexclust")
library(flexclust)
## Loading required package: grid
## Loading required package: modeltools
## Loading required package: stats4
#converting KMC into kcca object

KMC.kcca = as.kcca(KMC, healthyVector) #Healthy vector as a training set
tumorClusters = predict(KMC.kcca, newdata = tumorVector) #tumorvector as a testing set
#Visualize the clusters
#converting tumorClusters into a matrix

dim(tumorClusters) = c(nrow(tumorMatrix), ncol(tumorMatrix))
#visualising the tumorClusters

image(tumorClusters, axes = FALSE, col=rainbow(k))
Segmented MRI Images
mri1
Left side image is of a healthy brain
T2 Weighted MRI Images
mri2
VIDEO 7: COMPARING METHODS USED IN THE CLASS

Comparison of Methods
method1
method2
method3
Assignment 6
DOCUMENT CLUSTERING WITH DAILY KOS
Document clustering, or text clustering, is a very popular application of clustering algorithms. A web search engine, like Google, often returns
thousands of results for a simple query. For example, if you type the search term jaguar into Google, around 200 million results are returned.
This makes it very dicult to browse or nd relevant information, especially if the search term has multiple meanings. If we search for jaguar, we
might be looking for information about the animal, the car, or the Jacksonville Jaguars football team.
Clustering methods can be used to automatically group search results into categories, making it easier to nd relavent results. This method is
used in the search engines PolyMeta and Helioid, as well as on FirstGov.gov, the ocial Web portal for the U.S. government. The two most
common algorithms used for document clustering are Hierarchical and k-means.
In this problem, well be clustering articles published on Daily Kos, an American political blog that publishes news and opinion articles written
from a progressive point of view. Daily Kos was founded by Markos Moulitsas in 2002, and as of September 2014, the site had an average
weekday trac of hundreds of thousands of visits.
The le dailykos.csv contains data on 3,430 news articles or blogs that have been posted on Daily Kos. These articles were posted in 2004,
leading up to the United States Presidential Election. The leading candidates were incumbent President George W. Bush (republican) and John
Kerry (democratic). Foreign policy was a dominant topic of the election, specically, the 2003 invasion of Iraq.
Each of the variables in the dataset is a word that has appeared in at least 50 dierent articles (1,545 words in total). The set of words has been
trimmed according to some of the techniques covered in the previous week on text analytics (punctuation has been removed, and stop words
have been removed). For each document, the variable values are the number of times that word appeared in the document.
#PROBLEM 1.1 - HIERARCHICAL CLUSTERING
#Let's start by building a hierarchical clustering model. First, read the data set into R.
dailykos<- read.csv("dailykos.csv")
dim(dailykos)
## [1] 3430 1545
#compute the distances (using method="euclidean")

kosDist<- dist(dailykos,method="euclidean")
#use hclust to build the model (using method="ward.D"),clustering on all of the variables
kosHierClust<- hclust(kosDist, method="ward.D")
#Running the dist function will probably take you a while. Why? Select all that apply.
#Ans:We have a lot of observations, so it takes a long time to compute the distance between each pair of observat
ions & We have a lot of variables, so the distance computation is long.
#Explanation:The distance computation can take a long time if you have a lot of observations and/or if there are
a lot of variables. As we saw in recitation, it might not even work if you have too many of either!
###########################################
#Plot the dendrogram of your hierarchical clustering model. Just looking at the dendrogram, which of the followin
g seem like good choices for the number of clusters? Select all that apply.
#plot the dendrogram

plot(kosHierClust)#where "kosHierClust" is the name of the clustering model
#Ans:2 & 3
#Explanation:The choices 2 and 3 are good cluster choices according to the dendrogram, because there is a lot of
space between the horizontal lines in the dendrogram in those cut off spots (draw a horizontal line across the d
endrogram where it crosses 2 or 3 vertical lines). The choices of 5 and 6 do not seem good according to the dendr
ogram because there is very little space.
#######################################
#In this problem, we are trying to cluster news articles or blog posts into groups. This can be used to show read
ers categories to choose from when trying to decide what to read. Just thinking about this application, what are
good choices for the number of clusters? Select all that apply.
#Ans:7 & 8
#Explanation:Thinking about the application, it is probably better to show the reader more categories than 2 or
3. These categories would probably be too broad to be useful. Seven or eight categories seems more reasonable.
######################################
#Let's pick 7 clusters. This number is reasonable according to the dendrogram, and also seems reasonable for the
application. Use the cutree function to split your data into 7 clusters.
#You can split your data into clusters by first using the cutree function to compute the cluster numbers
#cluster assignment of hierarchical clustering
hierGroups<-cutree(kosHierClust, k=7)
#Then, you can use the subset function 7 times to split the data into the 7 clusters(we don't really want to run
tapply on every single variable when we have over 1,000 different variables):
HierCluster1 = subset(dailykos, hierGroups == 1)
nrow(HierCluster1)
## [1] 1266

nrow(HierCluster2)
## [1] 321

nrow(HierCluster3)
## [1] 374

nrow(HierCluster4)
## [1] 139

nrow(HierCluster5)
## [1] 407

nrow(HierCluster6)
## [1] 714

nrow(HierCluster7)
## [1] 209
#or
table(hierGroups)
## hierGroups
## 1 2 3 4 5 6 7
## 1266 321 374 139 407 714 209
#More Advanced Approach:
#There is a very useful function in R called the "split" function. Given a vector assigning groups like hierGroup
s, you could split dailykos into the clusters by typing:
HierCluster = split(dailykos, hierGroups)
#Then cluster 1 can be accessed by typing HierCluster[[1]], cluster 2 can be accessed by typing HierCluster[[2]],
etc. If you have a variable in your current R session called "split", you will need to remove it with rm(split)
before using the split function.
#How many observations are in cluster 3?

#Ans:374
#Which cluster has the most observations?

#Ans:1
#Which cluster has the fewest observations?

#Ans:4
#########################################
#Instead of looking at the average value in each variable individually, we'll just look at the top 6 words in eac
h cluster. To do this for cluster 1, type the following in your R console (where "HierCluster1" should be replace
d with the name of your first cluster subset):
tail(sort(colMeans(HierCluster[[1]])))
## state republican poll democrat kerry bush

## 0.7575039 0.7590837 0.9036335 0.9194313 1.0624013 1.7053712
#This computes the mean frequency values of each of the words in cluster 1, and then outputs the 6 words that occ
ur the most frequently. The colMeans function computes the column (word) means, the sort function orders the word
s in increasing order of the mean values, and the tail function outputs the last 6 words listed, which are the on
es with the largest column means.
#What is the most frequent word in this cluster, in terms of average value? Enter the word exactly how you see it
in the output:
#Ans:bush
#Explanation:After running the R command given above, we can see that the most frequent word on average is "bus
h". This corresponds to President George W. Bush.
##################################################
#Now repeat the command given in the previous problem for each of the other clusters, and answer the following qu
estions.
#Get top 6 words of all clusters
#using lapply() rather than doing clusterwise

topWords <- lapply(HierCluster, function(c) tail(sort(colMeans(c))))
topWords
## $`1`
## state republican poll democrat kerry bush
## 0.7575039 0.7590837 0.9036335 0.9194313 1.0624013 1.7053712
##
## $`2`
## bush democrat challenge vote poll november
## 2.847352 2.850467 4.096573 4.398754 4.847352 10.339564
##
## $`3`
## elect parties state republican democrat bush
## 1.647059 1.665775 2.320856 2.524064 3.823529 4.406417
##
## $`4`
## campaign voter presided poll bush kerry
## 1.431655 1.539568 1.625899 3.589928 7.834532 8.438849
##
## $`5`
## american presided administration war iraq
## 1.090909 1.120393 1.230958 1.776413 2.427518
## bush
## 3.941032
##
## $`6`
## race bush kerry elect democrat poll
## 0.4579832 0.4887955 0.5168067 0.5350140 0.5644258 0.5812325
##
## $`7`
## democrat clark edward poll kerry dean
## 2.148325 2.497608 2.607656 2.765550 3.952153 5.803828
#or doing each cluster wise

tail(sort(colMeans(HierCluster2)))
## bush democrat challenge vote poll november

## 2.847352 2.850467 4.096573 4.398754 4.847352 10.339564
## elect parties state republican democrat bush

## 1.647059 1.665775 2.320856 2.524064 3.823529 4.406417
## campaign voter presided poll bush kerry

## 1.431655 1.539568 1.625899 3.589928 7.834532 8.438849
## american presided administration war iraq

## 1.090909 1.120393 1.230958 1.776413 2.427518
## bush
## 3.941032
## race bush kerry elect democrat poll

## 0.4579832 0.4887955 0.5168067 0.5350140 0.5644258 0.5812325
## democrat clark edward poll kerry dean

## 2.148325 2.497608 2.607656 2.765550 3.952153 5.803828
#Which words best describe cluster 2?

#Ans:november, poll, vote, challenge
#Which cluster could best be described as the cluster related to the Iraq war?
#Ans:Cluster 5
#In 2004, one of the candidates for the Democratic nomination for the President of the United States was Howard D
ean, John Kerry was the candidate who won the democratic nomination, and John Edwards with the running mate of Jo
hn Kerry (the Vice President nominee). Given this information, which cluster best corresponds to the democratic p
arty?
#Ans:Cluster 7
#Explanation:You can see that the words that best describe Cluster 2 are november, poll, vote, and challenge. The
most common words in Cluster 5 are bush, iraq, war, and administration, so it is the cluster that can best be de
scribed as corresponding to the Iraq war. And the most common words in Cluster 7 are dean, kerry, poll, and edwar
d, so it looks like the democratic cluster.
##############################################
#PROBLEM 2.1 - K-MEANS CLUSTERING
#Now, run k-means clustering, setting the seed to 1000 right before you run the kmeans function. Again, pick the
number of clusters equal to 7. You don't need to add the iters.max argument.
set.seed(1000)
#cluster assignment of k-means clustering

KmeansCluster<-kmeans(dailykos, centers=7)
str(KmeansCluster)
## List of 9
## $ cluster : int [1:3430] 4 4 6 4 1 4 7 4 4 4 ...
## $ centers : num [1:7, 1:1545] 0.0342 0.0556 0.0253 0.0136 0.0491 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:7] "1" "2" "3" "4" ...
## .. ..$ : chr [1:1545] "abandon" "abc" "ability" "abortion" ...
## $ totss : num 896461
## $ withinss : num [1:7] 76583 52693 99504 258927 88632 ...
## $ tot.withinss: num 730632
## $ betweenss : num 165829
## $ size : int [1:7] 146 144 277 2063 163 329 308
## $ iter : int 7
## $ ifault : int 0
## - attr(*, "class")= chr "kmeans"
#Then, you can subset your data into the 7 clusters by using the following commands:
KmeansCluster1 = subset(dailykos, KmeansCluster$cluster == 1)
nrow(KmeansCluster1)
## [1] 146

## [1] 144

## [1] 277

## [1] 2063

## [1] 163

## [1] 329

## [1] 308
#Alternatively, you could answer these questions by looking at

table(KmeansCluster$cluster)
##
## 1 2 3 4 5 6 7
## 146 144 277 2063 163 329 308
#More Advanced Approach:
#rm(split)
KmeansClusterspl<-split(dailykos,KmeansCluster$cluster)
#How many observations are in Cluster 3?

nrow(KmeansClusterspl[[3]])
## [1] 277
#Ans:
#using sapply() to get the no of obs in all 7 clusters

sapply(KmeansClusterspl, nrow)
## 1 2 3 4 5 6 7
## 146 144 277 2063 163 329 308
#Which cluster has the most observations?

#Ans:Cluster 4
#Which cluster has the fewest number of observations?

#Ans:Cluster 2
###################################################
#Now, output the six most frequent words in each cluster, like we did in the previous problem, for each of the k-
means clusters.
#most frequent words from each kmeans cluster:

kmTopWords <- lapply(KmeansClusterspl, function(c) tail(sort(colMeans(c))))
kmTopWords
## $`1`
## state iraq kerry administration presided
## 1.609589 1.616438 1.636986 2.664384 2.767123
## bush
## 11.431507
##
## $`2`
## primaries democrat edward clark kerry dean
## 2.319444 2.694444 2.798611 3.090278 4.979167 8.277778
##
## $`3`
## administration iraqi american bush war
## 1.389892 1.610108 1.685921 2.610108 3.025271
## iraq
## 4.093863
##
## $`4`
## elect republican kerry poll democrat bush
## 0.6010664 0.6175473 0.6495395 0.7474552 0.7891420 1.1473582
##
## $`5`
## race senate state parties republican democrat
## 2.484663 2.650307 3.521472 3.619632 4.638037 6.993865
##
## $`6`
## democrat bush challenge vote poll november
## 2.899696 2.960486 4.121581 4.446809 4.872340 10.370821
##
## $`7`
## presided voter campaign poll bush kerry
## 1.324675 1.334416 1.383117 2.788961 5.970779 6.480519
#or
#we can output the most frequent words in each of the k-means clusters by using the following commands:
tail(sort(colMeans(KmeansCluster1)))
## state iraq kerry administration presided

## 1.609589 1.616438 1.636986 2.664384 2.767123
## bush
## 11.431507
## primaries democrat edward clark kerry dean

## 2.319444 2.694444 2.798611 3.090278 4.979167 8.277778
## administration iraqi american bush war

## 1.389892 1.610108 1.685921 2.610108 3.025271
## iraq
## 4.093863
## elect republican kerry poll democrat bush

## 0.6010664 0.6175473 0.6495395 0.7474552 0.7891420 1.1473582
## race senate state parties republican democrat

## 2.484663 2.650307 3.521472 3.619632 4.638037 6.993865
## democrat bush challenge vote poll november

## 2.899696 2.960486 4.121581 4.446809 4.872340 10.370821
## presided voter campaign poll bush kerry

## 1.324675 1.334416 1.383117 2.788961 5.970779 6.480519
#Which k-means cluster best corresponds to the Iraq War?

#Ans:Cluster 3
#Which k-means cluster best corresponds to the democratic party? (Remember that we are looking for the names of t
he key democratic party leaders.)
#Ans:Cluster 2
#Explanation:By looking at the output, you can see that the cluster best correponding to the Iraq War is cluster
3 (top words are iraq, war, and bush) and the cluster best corresponding to the democratic party is cluster 2 (t
op words dean, kerry, clark, and edward).
##################################################
#For the rest of this problem, we'll ask you to compare how observations were assigned to clusters in the two dif
ferent methods. Use the table function to compare the cluster assignment of hierarchical clustering to the cluste
r assignment of k-means clustering.
#topWords
#kmTopWords
table(hierGroups,KmeansCluster$cluster)
##
## hierGroups 1 2 3 4 5 6 7
## 1 3 11 64 1045 32 0 111
## 2 0 0 0 0 0 320 1
## 3 85 10 42 79 126 8 24
## 4 10 5 0 0 1 0 123
## 5 48 0 171 145 3 1 39
## 6 0 2 0 712 0 0 0
## 7 0 116 0 82 1 0 10
#Which Hierarchical Cluster best corresponds to K-Means Cluster 2?

#Ans:Hierarchical Cluster 7
#Explanation:From "table(hierGroups, KmeansCluster$cluster)", we read that 116 (80.6%) of the observations in K-M
eans Cluster 2 also fall in Hierarchical Cluster 7.
####################################################

table(hierGroups, KmeansCluster$cluster)
##
## hierGroups 1 2 3 4 5 6 7
## 1 3 11 64 1045 32 0 111
## 2 0 0 0 0 0 320 1
## 3 85 10 42 79 126 8 24
## 4 10 5 0 0 1 0 123
## 5 48 0 171 145 3 1 39
## 6 0 2 0 712 0 0 0
## 7 0 116 0 82 1 0 10
#Explanation:From "table(hierGroups, KmeansCluster$cluster)", we read that 171 (61.7%) of the observations in K-M
eans Cluster 3 also fall in Hierarchical Cluster 5.
###########################################

##
## hierGroups 1 2 3 4 5 6 7
## 1 3 11 64 1045 32 0 111
## 2 0 0 0 0 0 320 1
## 3 85 10 42 79 126 8 24
## 4 10 5 0 0 1 0 123
## 5 48 0 171 145 3 1 39
## 6 0 2 0 712 0 0 0
## 7 0 116 0 82 1 0 10
#Ans:No Hierarchical Cluster contains at least half of the points in K-Means Cluster 7.
#Explanation:From "table(hierGroups, KmeansCluster$cluster)", we read that no more than 123 (39.9%) of the observ
ations in K-Means Cluster 7 fall in any hierarchical cluster.
##########################################

##
## hierGroups 1 2 3 4 5 6 7
## 1 3 11 64 1045 32 0 111
## 2 0 0 0 0 0 320 1
## 3 85 10 42 79 126 8 24
## 4 10 5 0 0 1 0 123
## 5 48 0 171 145 3 1 39
## 6 0 2 0 712 0 0 0
## 7 0 116 0 82 1 0 10
#Explanation:From "table(hierGroups, KmeansCluster$cluster)", we read that 320 (97.3%) of observations in K-Means
Cluster 6 fall in Hierarchical Cluster 2.
MARKET SEGMENTATION FOR AIRLINES

Market segmentation is a strategy that divides a broad target market of customers into smaller, more similar groups, and then designs a
marketing strategy specically for each group. Clustering is a common technique for market segmentation since it automatically nds similar
groups given a data set.
In this problem, well see how clustering can be used to nd similar groups of customers who belong to an airlines frequent yer program. The
airline is trying to learn more about its customers so that it can target dierent customer segments with dierent types of mileage oers.
The le AirlinesCluster.csv contains information on 3,999 members of the frequent yer program. This data comes from the textbook Data Mining
for Business Intelligence, by Galit Shmueli, Nitin R. Patel, and Peter C. Bruce. For more information, see the website for the book.
There are seven dierent variables in the dataset, described below:
Balance = number of miles eligible for award travel

QualMiles = number of miles qualifying for TopFlight status
BonusMiles = number of miles earned from non-ight bonus transactions in the past 12 months
BonusTrans = number of non-ight bonus transactions in the past 12 months
FlightMiles = number of ight miles in the past 12 months
FlightTrans = number of ight transactions in the past 12 months
DaysSinceEnroll = number of days since enrolled in the frequent yer program
#Read in the data

airlines<- read.csv("AirlinesCluster.csv")
summary(airlines)
## Balance QualMiles BonusMiles BonusTrans

## Min. : 0 Min. : 0.0 Min. : 0 Min. : 0.0
## 1st Qu.: 18528 1st Qu.: 0.0 1st Qu.: 1250 1st Qu.: 3.0
## Median : 43097 Median : 0.0 Median : 7171 Median :12.0
## Mean : 73601 Mean : 144.1 Mean : 17145 Mean :11.6
## 3rd Qu.: 92404 3rd Qu.: 0.0 3rd Qu.: 23801 3rd Qu.:17.0
## Max. :1704838 Max. :11148.0 Max. :263685 Max. :86.0
## FlightMiles FlightTrans DaysSinceEnroll
## Min. : 0.0 Min. : 0.000 Min. : 2
## 1st Qu.: 0.0 1st Qu.: 0.000 1st Qu.:2330
## Median : 0.0 Median : 0.000 Median :4096
## Mean : 460.1 Mean : 1.374 Mean :4119
## 3rd Qu.: 311.0 3rd Qu.: 1.000 3rd Qu.:5790
## Max. :30817.0 Max. :53.000 Max. :8296
#or better way

colMeans(airlines)

## 73601.327582 144.114529 17144.846212 11.601900
## 460.055764 1.373593 4118.559390
#PROBLEM 1.1 - NORMALIZING THE DATA
#Looking at the summary of airlines, which TWO variables have (on average) the smallest values?
#Ans:BonusTrans & FlightTrans
#Which TWO variables have (on average) the largest values?

#Ans:Balance & BonusMiles
#EXPLANATION: For the smallest values, BonusTrans and FlightTrans are on the scale of tens, whereas all other var
iables have values in the thousands.For the largest values, Balance and BonusMiles have average values in the ten
s of thousands.
#################################
#In this problem, we will normalize our data before we run the clustering algorithms. Why is it important to norm
alize the data before clustering?
#Ans:If we don't normalize the data, the clustering will be dominated by the variables that are on a larger scal
e.
#EXPLANATION:If we don't normalize the data, the variables that are on a larger scale will contribute much more t
o the distance calculation, and thus will dominate the clustering.
###################################
#Let's go ahead and normalize our data. You can normalize the variables in a data frame by using the preProcess f
unction in the "caret" package.
library(caret)
##
## Attaching package: 'caret'
## The following object is masked from 'package:mosaic':

##
## dotPlot
#Now, create a normalized data frame called "airlinesNorm" by running the following commands:
preproc = preProcess(airlines)
airlinesNorm = predict(preproc, airlines) #Normalising the data

##The first command pre-processes the data, and the second command performs the normalization.
summary(airlinesNorm)

## Min. :-0.7303 Min. :-0.1863 Min. :-0.7099 Min. :-1.20805
## 1st Qu.:-0.5465 1st Qu.:-0.1863 1st Qu.:-0.6581 1st Qu.:-0.89568
## Median :-0.3027 Median :-0.1863 Median :-0.4130 Median : 0.04145
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.00000
## 3rd Qu.: 0.1866 3rd Qu.:-0.1863 3rd Qu.: 0.2756 3rd Qu.: 0.56208
## Max. :16.1868 Max. :14.2231 Max. :10.2083 Max. : 7.74673
## Min. :-0.3286 Min. :-0.36212 Min. :-1.99336
## 1st Qu.:-0.3286 1st Qu.:-0.36212 1st Qu.:-0.86607
## Median :-0.3286 Median :-0.36212 Median :-0.01092
## Mean : 0.0000 Mean : 0.00000 Mean : 0.00000
## 3rd Qu.:-0.1065 3rd Qu.:-0.09849 3rd Qu.: 0.80960
## Max. :21.6803 Max. :13.61035 Max. : 2.02284
#Mean of eah variable

round(colMeans(airlinesNorm),6) #or

## 0 0 0 0
## 0 0 0
round(sapply(airlinesNorm,mean),6)

## 0 0 0 0
## 0 0 0
#std.deviation of each var

sapply(airlinesNorm,sd)

## 1 1 1 1
## 1 1 1
#If you look at the summary of airlinesNorm, you should see that all of the variables now have mean zero. You can
also see that each of the variables has standard deviation 1 by using the sd() function.
#In the normalized data, which variable has the largest maximum value?
apply(airlinesNorm, 2, max)

## 16.186811 14.223084 10.208293 7.746727
## 21.680292 13.610351 2.022842
#Ans:FlightMiles
#In the normalized data, which variable has the smallest minimum value?
apply(airlinesNorm, 2, min)

## -0.7303482 -0.1862754 -0.7099031 -1.2080518
## -0.3285622 -0.3621226 -1.9933614
#Ans:DaysSinceEnroll
#EXPLANATION:You can see from the output that FlightMiles now has the largest maximum value, and DaysSinceEnroll
now has the smallest minimum value. Note that these were not the variables with the largest and smallest values
in the original dataset airlines.
#####################################
#Compute the distances between data points (using euclidean distance) and then run the Hierarchical clustering al
gorithm (using method="ward.D") on the normalized data. It may take a few minutes for the commands to finish sinc
e the dataset has a large number of observations for hierarchical clustering.
distances <- dist(airlinesNorm,method="euclidean")

airlineClust <- hclust(distances,method="ward.D")
#Then, plot the dendrogram of the hierarchical clustering process:

plot(airlineClust)
#Suppose the airline is looking for somewhere between 2 and 10 clusters. According to the dendrogram, which of th
e following is NOT a good choice for the number of clusters?
#Ans:6
#EXPLANATION:if you run a horizontal line down the dendrogram, you can see that there is a long time that the lin
e crosses 2 clusters, 3 clusters, or 7 clusters. However, it it hard to see the horizontal line cross 6 clusters.
This means that 6 clusters is probably not a good choice.
##########################################
#Suppose that after looking at the dendrogram and discussing with the marketing department, the airline decides t
o proceed with 5 clusters. Divide the data points into 5 clusters by using the cutree function.
#How many data points are in Cluster 1?
#Divide into 5 clusters

clusterGroups<- cutree(airlineClust, k=5)
airlineClusters <-split(airlinesNorm, clusterGroups)
nrow(airlineClusters[[1]])
## [1] 776
#or
table(clusterGroups)
## clusterGroups
## 1 2 3 4 5
## 776 519 494 868 1342
#Ans:776
#EXPLANATION:you can see that there are 776 data points in the first cluster.
#####################################
#Now, use tapply to compare the average values in each of the variables for the 5 clusters (the centroids of the
clusters). You may want to compute the average values of the unnormalized data so that it is easier to interpre
t. You can do this for the variable "Balance" with the following command:
#The centroids of the clusters (Un-normalised data)

#average values in 'Balance'variables for the 5 cluster
tapply(airlines$Balance, clusterGroups, mean)
## 1 2 3 4 5
## 57866.90 110669.27 198191.57 52335.91 36255.91
#then similarly compute for all other variables

tapply(airlines$QualMiles, clusterGroups, mean)
## 1 2 3 4 5
## 0.6443299 1065.9826590 30.3461538 4.8479263 2.5111773
tapply(airlines$BonusMiles, clusterGroups, mean)
## 1 2 3 4 5
## 10360.124 22881.763 55795.860 20788.766 2264.788
tapply(airlines$BonusTrans, clusterGroups, mean)
## 1 2 3 4 5
## 10.823454 18.229287 19.663968 17.087558 2.973174
tapply(airlines$FlightMiles, clusterGroups, mean)
## 1 2 3 4 5
## 83.18428 2613.41811 327.67611 111.57373 119.32191
tapply(airlines$FlightTrans, clusterGroups, mean)
## 1 2 3 4 5
## 0.3028351 7.4026975 1.0688259 0.3444700 0.4388972
tapply(airlines$DaysSinceEnroll, clusterGroups, mean)
## 1 2 3 4 5
## 6235.365 4402.414 5615.709 2840.823 3060.081
#or
#Advanced Explanation:
#Instead of using tapply, you could have alternatively used colMeans and subset, as follows:
colMeans(subset(airlines, clusterGroups == 1))

## 5.786690e+04 6.443299e-01 1.036012e+04 1.082345e+01
## 8.318428e+01 3.028351e-01 6.235365e+03

## 1.106693e+05 1.065983e+03 2.288176e+04 1.822929e+01
## 2.613418e+03 7.402697e+00 4.402414e+03

## 1.981916e+05 3.034615e+01 5.579586e+04 1.966397e+01
## 3.276761e+02 1.068826e+00 5.615709e+03

## 52335.913594 4.847926 20788.766129 17.087558
## 111.573733 0.344470 2840.822581

## 3.625591e+04 2.511177e+00 2.264788e+03 2.973174e+00
## 1.193219e+02 4.388972e-01 3.060081e+03
#But an even more compact way of finding the centroids would be to use the function "split" to first split the da
ta into clusters, and then to use the function "lapply" to apply the function "colMeans" to each of the clusters:
lapply(split(airlines, clusterGroups), colMeans)
## $`1`
## 5.786690e+04 6.443299e-01 1.036012e+04 1.082345e+01
## 8.318428e+01 3.028351e-01 6.235365e+03
##
## $`2`
## 1.106693e+05 1.065983e+03 2.288176e+04 1.822929e+01
## 2.613418e+03 7.402697e+00 4.402414e+03
##
## $`3`
## 1.981916e+05 3.034615e+01 5.579586e+04 1.966397e+01
## 3.276761e+02 1.068826e+00 5.615709e+03
##
## $`4`
## 52335.913594 4.847926 20788.766129 17.087558
## 111.573733 0.344470 2840.822581
##
## $`5`
## 3.625591e+04 2.511177e+00 2.264788e+03 2.973174e+00
## 1.193219e+02 4.388972e-01 3.060081e+03
#or better & faster ways is using sapply()

airlinesUnnormClusters<-split(airlines,clusterGroups)
round(sapply(airlinesUnnormClusters,colMeans),4)
## 1 2 3 4 5
## Balance 57866.9046 110669.2659 198191.5749 52335.9136 36255.9098
## QualMiles 0.6443 1065.9827 30.3462 4.8479 2.5112
## BonusMiles 10360.1237 22881.7630 55795.8603 20788.7661 2264.7876
## BonusTrans 10.8235 18.2293 19.6640 17.0876 2.9732
## FlightMiles 83.1843 2613.4181 327.6761 111.5737 119.3219
## FlightTrans 0.3028 7.4027 1.0688 0.3445 0.4389
## DaysSinceEnroll 6235.3647 4402.4143 5615.7085 2840.8226 3060.0812
#Compared to the other clusters, Cluster 1 has the largest average values in which variables (if any)? Select all
that apply.
#Ans:DaysSinceEnroll
#EXPLANATION:The only variable for which Cluster 1 has large values is DaysSinceEnroll.
#How would you describe the customers in Cluster 1?

#Ans: Infrequent but loyal customers.
#EXPLANATION:Cluster 1 mostly contains customers with few miles, but who have been with the airline the longest.
####################################
that apply.
#Ans:QualMiles,FlightMiles,FlightTrans
#EXPLANATION:Cluster 2 has the largest average values in the variables QualMiles, FlightMiles and FlightTrans. Th
is cluster also has relatively large values in BonusTrans and Balance.

#Ans: Customers who have accumulated a large amount of miles, and the ones with the largest number of flight tran
sactions.
#EXPLANATION:Cluster 2 contains customers with a large amount of miles, mostly accumulated through flight transac
tions.
##################################
that apply.
#Ans:Balance,BonusMiles,BonusTrans
#EXPLANATION:Cluster 3 has the largest values in Balance, BonusMiles, and BonusTrans. While it also has relativel
y large values in other variables, these are the three for which it has the largest values.

#Ans:Customers who have accumulated a large amount of miles, mostly through non-flight transactions.
#EXPLANATION:Cluster 3 mostly contains customers with a lot of miles, and who have earned the miles mostly throug
h bonus transactions.
########################################
that apply.
#Ans:None
#EXPLANATION:Cluster 4 does not have the largest values in any of the variables.

#Ans:Relatively new customers who seem to be accumulating miles, mostly through non-flight transactions.
#EXPLANATION:Cluster 4 customers have the smallest value in DaysSinceEnroll, but they are already accumulating a
reasonable number of miles.
################################
that apply.
#Ans:None
#EXPLANATION:Cluster 5 does not have the largest values in any of the variables.

#Ans:Relatively new customers who don't use the airline very often.
#EXPLANATION:Cluster 5 customers have lower than average values in all variables.
####################################
#Now run the k-means clustering algorithm on the normalized data, again creating 5 clusters. Set the seed to 88 r
ight before running the clustering algorithm, and set the argument iter.max to 1000.
#Kmeans clustering
set.seed(88)
kmeansClust<- kmeans(airlinesNorm, centers=5, iter.max=1000)
#you can look at the number of observations in each cluster with the following command:
table(kmeansClust$cluster)
##
## 1 2 3 4 5
## 408 141 993 1182 1275
#or
sum(kmeansClust$size > 1000)
## [1] 2
#How many clusters have more than 1,000 observations?

#Ans:2
#Explanation:There are two clusters with more than 1000 observations.
###################################
#Now, compare the cluster centroids to each other either by dividing the data points into groups and then using t
apply, or by looking at the output of kmeansClust$centers, where "kmeansClust" is the name of the output of the k
means function. (Note that the output of kmeansClust$centers will be for the normalized data. If you want to look
at the average values for the unnormalized data, you need to use tapply like we did for hierarchical clusterin
g.)
#the centroids of the clusters (Normalised data)
#Hierarchial clusters centroid

airlinesUnnormClusters<-split(airlinesNorm,clusterGroups)
round(sapply(airlinesUnnormClusters,colMeans),4)
## 1 2 3 4 5
## Balance -0.1561 0.3678 1.2363 -0.2110 -0.3706
## QualMiles -0.1854 1.1916 -0.1471 -0.1800 -0.1830
## BonusMiles -0.2809 0.2375 1.6004 0.1509 -0.6161
## BonusTrans -0.0811 0.6901 0.8395 0.5712 -0.8985
## FlightMiles -0.2692 1.5379 -0.0945 -0.2489 -0.2433
## FlightTrans -0.2823 1.5895 -0.0803 -0.2713 -0.2464
## DaysSinceEnroll 1.0250 0.1375 0.7250 -0.6187 -0.5125
#Kmeans centroid
kmeansClust$centers
## Balance QualMiles BonusMiles BonusTrans FlightMiles FlightTrans

## 1 1.44439706 0.51115730 1.8769284 1.0331951 0.1169945 0.1444636
## 2 1.00054098 0.68382234 0.6144780 1.7214887 3.8559798 4.1196141
## 3 -0.05580605 -0.14104391 0.3041358 0.7108744 -0.1218278 -0.1287569
## 4 -0.13331742 -0.11491607 -0.3492669 -0.3373455 -0.1833989 -0.1961819
## 5 -0.40579897 -0.02281076 -0.5816482 -0.7619054 -0.1989602 -0.2196582
## DaysSinceEnroll
## 1 0.7198040
## 2 0.2742394
## 3 -0.3398209
## 4 0.9640923
## 5 -0.8897747
#Do you expect Cluster 1 of the K-Means clustering output to necessarily be similar to Cluster 1 of the Hierarchi
cal clustering output?
#Ans:No, because cluster ordering is not meaningful in either k-means clustering or hierarchical clustering.
#EXPLANATION:The clusters are not displayed in a meaningful order, so while there may be a cluster produced by th
e k-means algorithm that is similar to Cluster 1 produced by the Hierarchical method, it will not necessarily be
shown first.
PREDICTING STOCK RETURNS WITH CLUSTER-THEN-PREDICT

In the second lecture sequence this week, we heard about cluster-then-predict, a methodology in which you rst cluster observations and then
build cluster-specic prediction models. In the lecture sequence, we saw how this methodology helped improve the prediction of heart attack
risk. In this assignment, well use cluster-then-predict to predict future stock prices using historical stock data.
When selecting which stocks to invest in, investors seek to obtain good future returns. In this problem, we will rst use clustering to identify
clusters of stocks that have similar returns over time. Then, well use logistic regression to predict whether or not the stocks will have positive
future returns.
For this problem, well use StocksCluster.csv, which contains monthly stock returns from the NASDAQ stock exchange. The NASDAQ is the
second-largest stock exchange in the world, and it lists many technology companies. The stock price data used in this problem was obtained
from infochimps, a website providing access to many datasets.
Each observation in the dataset is the monthly returns of a particular company in a particular year. The years included are 2000-2009. The
companies are limited to tickers that were listed on the exchange for the entire period 2000-2009, and whose stock price never fell below $1. So,
for example, one observation is for Yahoo in 2000, and another observation is for Yahoo in 2001. Our goal will be to predict whether or not the
stock return in December will be positive, using the stock returns for the rst 11 months of the year.
This dataset contains the following variables:
ReturnJan = the return for the companys stock during January (in the year of the observation).
ReturnFeb = the return for the companys stock during February (in the year of the observation).
ReturnMar = the return for the companys stock during March (in the year of the observation).
ReturnApr = the return for the companys stock during April (in the year of the observation).
ReturnMay = the return for the companys stock during May (in the year of the observation).
ReturnJune = the return for the companys stock during June (in the year of the observation).
ReturnJuly = the return for the companys stock during July (in the year of the observation).
ReturnAug = the return for the companys stock during August (in the year of the observation).
ReturnSep = the return for the companys stock during September (in the year of the observation).
ReturnOct = the return for the companys stock during October (in the year of the observation).
ReturnNov = the return for the companys stock during November (in the year of the observation).
PositiveDec = whether or not the companys stock had a positive return in December (in the year of the observation). This variable takes
value 1 if the return was positive, and value 0 if the return was not positive.
For the rst 11 variables, the value stored is a proportional change in stock value during that month. For instance, a value of 0.05 means the
stock increased in value 5% during the month, while a value of -0.02 means the stock decreased in value 2% during the month.
#PROBLEM 1.1 - EXPLORING THE DATASET
#Load StocksCluster.csv into a data frame called "stocks".
#How many observations are in the dataset?

stocks <- read.csv("StocksCluster.csv")
str(stocks)

## $ ReturnJan : num 0.0807 -0.0107 0.0477 -0.074 -0.031 ...
## $ ReturnFeb : num 0.0663 0.1021 0.036 -0.0482 -0.2127 ...
## $ ReturnMar : num 0.0329 0.1455 0.0397 0.0182 0.0915 ...
## $ ReturnApr : num 0.1831 -0.0844 -0.1624 -0.0247 0.1893 ...
## $ ReturnMay : num 0.13033 -0.3273 -0.14743 -0.00604 -0.15385 ...
## $ ReturnJune : num -0.0176 -0.3593 0.0486 -0.0253 -0.1061 ...
## $ ReturnJuly : num -0.0205 -0.0253 -0.1354 -0.094 0.3553 ...
## $ ReturnAug : num 0.0247 0.2113 0.0334 0.0953 0.0568 ...
## $ ReturnSep : num -0.0204 -0.58 0 0.0567 0.0336 ...
## $ ReturnOct : num -0.1733 -0.2671 0.0917 -0.0963 0.0363 ...
## $ ReturnNov : num -0.0254 -0.1512 -0.0596 -0.0405 -0.0853 ...
## $ PositiveDec: int 0 0 0 1 1 1 1 0 0 0 ...
#or
nrow(stocks)
## [1] 11580
#Ans:11580
###################################
#What proportion of the observations have positive returns in December?
#Proportion of stocks with positive return
prop.table(table(stocks$PositiveDec)) #6324/11580 = 0.546
##
## 0 1
## 0.453886 0.546114
#or
sum(stocks$PositiveDec) / nrow(stocks)
## [1] 0.546114
#or
mean(stocks$PositiveDec)
## [1] 0.546114
#Ans:0.546114
######################################
#What is the maximum correlation between any two return variables in the dataset? You should look at the pairwise
correlations between ReturnJan, ReturnFeb, ReturnMar, ReturnApr, ReturnMay, ReturnJune, ReturnJuly, ReturnAug, R
eturnSep, ReturnOct, and ReturnNov.
#Max correlation between months January to November

cm <- cor(stocks[1:11])
diag(cm) <- -1 # making the diagonals as -1
max(cm)
## [1] 0.1916728
#Ans:0.1916728
####################################
#Which month (from January through November) has the largest mean return across all observations in the dataset?
summary(stocks)
## ReturnJan ReturnFeb ReturnMar

## Min. :-0.7616205 Min. :-0.690000 Min. :-0.712994
## 1st Qu.:-0.0691663 1st Qu.:-0.077748 1st Qu.:-0.046389
## Median : 0.0009965 Median :-0.010626 Median : 0.009878
## Mean : 0.0126316 Mean :-0.007605 Mean : 0.019402
## 3rd Qu.: 0.0732606 3rd Qu.: 0.043600 3rd Qu.: 0.077066
## Max. : 3.0683060 Max. : 6.943694 Max. : 4.008621
## ReturnApr ReturnMay ReturnJune
## Min. :-0.826503 Min. :-0.92207 Min. :-0.717920
## 1st Qu.:-0.054468 1st Qu.:-0.04640 1st Qu.:-0.063966
## Median : 0.009059 Median : 0.01293 Median :-0.000880
## Mean : 0.026308 Mean : 0.02474 Mean : 0.005938
## 3rd Qu.: 0.085338 3rd Qu.: 0.08396 3rd Qu.: 0.061566
## Max. : 2.528827 Max. : 6.93013 Max. : 4.339713
## ReturnJuly ReturnAug ReturnSep
## Min. :-0.7613096 Min. :-0.726800 Min. :-0.839730
## 1st Qu.:-0.0731917 1st Qu.:-0.046272 1st Qu.:-0.074648
## Median :-0.0008047 Median : 0.007205 Median :-0.007616
## Mean : 0.0030509 Mean : 0.016198 Mean :-0.014721
## 3rd Qu.: 0.0718205 3rd Qu.: 0.070783 3rd Qu.: 0.049476
## Max. : 2.5500000 Max. : 3.626609 Max. : 5.863980
## ReturnOct ReturnNov PositiveDec
## Min. :-0.685504 Min. :-0.747171 Min. :0.0000
## 1st Qu.:-0.070915 1st Qu.:-0.054890 1st Qu.:0.0000
## Median : 0.002115 Median : 0.008522 Median :1.0000
## Mean : 0.005651 Mean : 0.011387 Mean :0.5461
## 3rd Qu.: 0.074542 3rd Qu.: 0.076576 3rd Qu.:1.0000
## Max. : 5.665138 Max. : 3.271676 Max. :1.0000
#If you look at the mean value for each variable, you can see that April has the largest mean value (0.026308), a
nd September has the smallest mean value (-0.014721).
#or better
cm <- colMeans(stocks[1:11])
which.max(cm)
## ReturnApr
## 4
#Ans:ReturnApr
#Which month (from January through November) has the smallest mean return across all observations in the dataset?
which.min(cm)
## ReturnSep
## 9
#Ans:ReturnSep
###########################################
#PROBLEM 2.1 - INITIAL LOGISTIC REGRESSION MODEL
#Run the following commands to split the data into a training set and testing set, putting 70% of the data in the
training set and 30% of the data in the testing set:
#Split the data into a training set and testing set
library(caTools)
##
## Attaching package: 'caTools'
## The following objects are masked from 'package:base64enc':

##
## base64decode, base64encode
set.seed(144)
spl = sample.split(stocks$PositiveDec, SplitRatio = 0.7)
stocksTrain = subset(stocks, spl == TRUE)
stocksTest = subset(stocks, spl == FALSE)
#Then, use the stocksTrain data frame to train a logistic regression model (name it StocksModel) to predict Posit
iveDec using all the other variables as independent variables. Don't forget to add the argument family=binomial t
o your glm command.
#Train Logistic regression model as:

StocksModel = glm(PositiveDec ~ ., data=stocksTrain, family=binomial)
#we can compute our predictions on the training set with:

PredictTrain = predict(StocksModel, type="response")
#construct a confusion/classification matrix with the table function using a threshold of 0.5:
cmat_LR <-table(stocksTrain$PositiveDec, PredictTrain > 0.5)
cmat_LR
##
## FALSE TRUE
## 0 990 2689
## 1 787 3640
#lets now compute the overall accuracy of the training set

accu_LR <-(cmat_LR[1,1] + cmat_LR[2,2])/sum(cmat_LR)
accu_LR
## [1] 0.5711818
#or
sum(diag(cmat_LR))/nrow(stocksTrain)
## [1] 0.5711818
#What is the overall accuracy on the training set using a threshold of 0.5?
#Ans: 0.5711818 ((990 + 3640)/(990 + 2689 + 787 + 3640) = 0.571)
#########################################
#Now obtain test set predictions from StocksModel. What is the overall accuracy of the model on the test, again u
sing a threshold of 0.5?
#Out-of-Sample predictions of the Logistic Regression model

PredictTest = predict(StocksModel, newdata=stocksTest, type="response")
#Then we compute the confusion matrix for the testing set using a threshold of 0.5
cmat_LR<-table(stocksTest$PositiveDec, PredictTest > 0.5)
cmat_LR
##
## FALSE TRUE
## 0 417 1160
## 1 344 1553
#lets now compute the overall accuracy of the test set

accu_LR <-(cmat_LR[1,1] + cmat_LR[2,2])/sum(cmat_LR)
accu_LR
## [1] 0.5670697
#or
sum(diag(cmat_LR))/nrow(stocksTest)
## [1] 0.5670697
#Ans:0.5670697 (417 + 1553)/(417 + 1160 + 344 + 1553) = 0.567
########################################
#What is the accuracy on the test set of a baseline model that always predicts the most common outcome (PositiveD
ec = 1)?
#baseline model computed by making a table of the outcome variable in the test set:
baseline<-table(stocksTest$PositiveDec)
baseline
##
## 0 1
## 1577 1897
#Baseline accuracy
accu_baseline <- max(baseline)/sum(baseline)
accu_baseline #1897/(1577 + 1897) = 0.5460564
## [1] 0.5460564
#or
baseline[2] / sum(baseline)
## 1
## 0.5460564
#Ans:0.5460564
#EXPLANATION:The baseline model would get all of the PositiveDec = 1 cases correct, and all of the PositiveDec =
0 cases wrong, for an accuracy of 1897/(1577 + 1897) = 0.5460564.
##################################
#PROBLEM 3.1 - CLUSTERING STOCKS
#Now, let's cluster the stocks. The first step in this process is to remove the dependent variable using the foll
owing commands:
#Remove dependent variable from data because we are trying to use clustering to "discover" the dependent variabl
e,hence dependent variable removed
limitedTrain = stocksTrain
limitedTrain$PositiveDec = NULL
limitedTest = stocksTest
limitedTest$PositiveDec = NULL
#Why do we need to remove the dependent variable in the clustering phase of the cluster-then-predict methodology?
#Ans:Needing to know the dependent variable value to assign an observation to a cluster defeats the purpose of th
e methodology
#EXPLANATION:In cluster-then-predict, our final goal is to predict the dependent variable, which is unknown to us
at the time of prediction. Therefore, if we need to know the outcome value to perform the clustering, the method
ology is no longer useful for prediction of an unknown outcome value.
#This is an important point that is sometimes mistakenly overlooked. If you use the outcome value to cluster, you
might conclude your method strongly outperforms a non-clustering alternative. However, this is because it is usi
ng the outcome to determine the clusters, which is not valid.
################################################
#In the market segmentation assignment in this week's homework, you were introduced to the preProcess command fro
m the caret package, which normalizes variables by subtracting by the mean and dividing by the standard deviatio
n.
#In cases where we have a training and testing set, we'll want to normalize by the mean and standard deviation of
the variables in the training set. We can do this by passing just the training set to the preProcess function:
library(caret)
preproc = preProcess(limitedTrain)
normTrain = predict(preproc, limitedTrain)
normTest = predict(preproc, limitedTest)
#What is the mean of the ReturnJan variable in normTrain?

mean(normTrain$ReturnJan)
## [1] 2.100586e-17
#Ans:2.100586e-17
#What is the mean of the ReturnJan variable in normTest?

mean(normTest$ReturnJan)
## [1] -0.0004185886
#Ans:-0.0004185886
###############################
#Why is the mean ReturnJan variable much closer to 0 in normTrain than in normTest?
#Ans:The distribution of the ReturnJan variable is different in the training and testing set
#EXPLANATION:From mean(stocksTrain$ReturnJan) and mean(stocksTest$ReturnJan), we see that the average return in J
anuary is slightly higher in the training set than in the testing set. Since normTest was constructed by subtract
ing by the mean ReturnJan value from the training set, this explains why the mean value of ReturnJan is slightly
negative in normTest.
#####################################
#Set the random seed to 144 (it is important to do this again, even though we did it earlier). Run k-means cluste
ring with 3 clusters on normTrain, storing the result in an object called km.
set.seed(144)
km <- kmeans(normTrain, centers=3, iter.max=1000)
#We can see the number of observations in each cluster by:

km$size
## [1] 3157 4696 253
#or
table(km$cluster)
##
## 1 2 3
## 3157 4696 253
#Which cluster has the largest number of observations?

#Ans:Cluster 2
##################################
#Recall from the recitation that we can use the flexclust package to obtain training set and testing set cluster
assignments for our observations (note that the call to as.kcca may take a while to complete):
#Obtain training set and testing set cluster assignments:
library(flexclust)
km.kcca = as.kcca(km, normTrain)

clusterTrain = predict(km.kcca)
clusterTest = predict(km.kcca, newdata=normTest)
#How many test-set observations were assigned to Cluster 2?
#we can obtain the breakdown of the testing set clusters with:
table(clusterTest)
## clusterTest
## 1 2 3
## 1298 2080 96
#or getting directly as

sum(clusterTest == 2) #test-set observations assigned to Cluster 2
## [1] 2080
#Ans:2080
#################################
#PROBLEM 4.1 - CLUSTER-SPECIFIC PREDICTIONS
#Using the subset function, build data frames stocksTrain1, stocksTrain2, and stocksTrain3, containing the elemen
ts in the stocksTrain data frame assigned to clusters 1, 2, and 3, respectively (be careful to take subsets of st
ocksTrain, not of normTrain). Similarly build stocksTest1, stocksTest2, and stocksTest3 from the stocksTest data
frame.
#We can obtain the necessary subsets with:

stocksTrain1 = subset(stocksTrain, clusterTrain == 1)
stocksTest1 = subset(stocksTest, clusterTest == 1)
#Which training set data frame has the highest average value of the dependent variable?
mean(stocksTrain1$PositiveDec)
## [1] 0.6024707
## [1] 0.5140545
## [1] 0.4387352
#or better way
#Builing stocksTrain[1,2,3] using clustering clusterTrain and also stocksTest[1,2,3] from clusterTest
stocksTrain11 <- split(stocksTrain, clusterTrain)
stocksTest11 <- split(stocksTest, clusterTest)
sapply(stocksTrain11, function(s){ mean(s$PositiveDec) })
## 1 2 3
## 0.6024707 0.5140545 0.4387352
#Ans:stocksTrain1
#EXPLANATION: we see that stocksTrain1 has the observations with the highest average value of the dependent varia
ble.
####################################
#Build logistic regression models StocksModel1, StocksModel2, and StocksModel3, which predict PositiveDec using a
ll the other variables as independent variables. StocksModel1 should be trained on stocksTrain1, StocksModel2 sho
uld be trained on stocksTrain2, and StocksModel3 should be trained on stocksTrain3.
#We can build the Logistic models with:

StocksModel1 = glm(PositiveDec ~ ., data=stocksTrain1, family=binomial)
summary(StocksModel1)
##
## Call:
## glm(formula = PositiveDec ~ ., family = binomial, data = stocksTrain1)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.7307 -1.2910 0.8878 1.0280 1.5023
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.17224 0.06302 2.733 0.00628 **
## ReturnJan 0.02498 0.29306 0.085 0.93206
## ReturnFeb -0.37207 0.29123 -1.278 0.20139
## ReturnMar 0.59555 0.23325 2.553 0.01067 *
## ReturnApr 1.19048 0.22439 5.305 1.12e-07 ***
## ReturnMay 0.30421 0.22845 1.332 0.18298
## ReturnJune -0.01165 0.29993 -0.039 0.96901
## ReturnJuly 0.19769 0.27790 0.711 0.47685
## ReturnAug 0.51273 0.30858 1.662 0.09660 .
## ReturnSep 0.58833 0.28133 2.091 0.03651 *
## ReturnOct -1.02254 0.26007 -3.932 8.43e-05 ***
## ReturnNov -0.74847 0.28280 -2.647 0.00813 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 4243.0 on 3156 degrees of freedom
## Residual deviance: 4172.9 on 3145 degrees of freedom
## AIC: 4196.9
##
## Number of Fisher Scoring iterations: 4
##
## Call:
##
## -2.2012 -1.1941 0.8583 1.1334 1.9424
##
## Coefficients:
## (Intercept) 0.10293 0.03785 2.719 0.006540 **
## ReturnJan 0.88451 0.20276 4.362 1.29e-05 ***
## ReturnFeb 0.31762 0.26624 1.193 0.232878
## ReturnMar -0.37978 0.24045 -1.579 0.114231
## ReturnApr 0.49291 0.22460 2.195 0.028189 *
## ReturnMay 0.89655 0.25492 3.517 0.000436 ***
## ReturnJune 1.50088 0.26014 5.770 7.95e-09 ***
## ReturnJuly 0.78315 0.26864 2.915 0.003554 **
## ReturnAug -0.24486 0.27080 -0.904 0.365876
## ReturnSep 0.73685 0.24820 2.969 0.002989 **
## ReturnOct -0.27756 0.18400 -1.509 0.131419
## ReturnNov -0.78747 0.22458 -3.506 0.000454 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
##
## AIC: 6386.2
##
##
## Call:
##
## -1.9146 -1.0393 -0.7689 1.1921 1.6939
##
## Coefficients:
## (Intercept) -0.181896 0.325182 -0.559 0.5759
## ReturnJan -0.009789 0.448943 -0.022 0.9826
## ReturnFeb -0.046883 0.213432 -0.220 0.8261
## ReturnMar 0.674179 0.564790 1.194 0.2326
## ReturnApr 1.281466 0.602672 2.126 0.0335 *
## ReturnMay 0.762512 0.647783 1.177 0.2392
## ReturnJune 0.329434 0.408038 0.807 0.4195
## ReturnJuly 0.774164 0.729360 1.061 0.2885
## ReturnAug 0.982605 0.533158 1.843 0.0653 .
## ReturnSep 0.363807 0.627774 0.580 0.5622
## ReturnOct 0.782242 0.733123 1.067 0.2860
## ReturnNov -0.873752 0.738480 -1.183 0.2367
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
##
## AIC: 352.29
##
#or better way using a custom function
#Create logistic regression models 1 2 and 3 using stocksTrain11[[1]]

stocksModels <- lapply(stocksTrain11, function(s){
glm(s$PositiveDec ~ ., family=binomial, data=s)
})
sapply(stocksModels, function(m){ m$coefficients })
## 1 2 3
## (Intercept) 0.17223985 0.1029318 -0.181895809
## ReturnJan 0.02498357 0.8845148 -0.009789345
## ReturnFeb -0.37207369 0.3176221 -0.046883260
## ReturnMar 0.59554957 -0.3797811 0.674179495
## ReturnApr 1.19047752 0.4929105 1.281466189
## ReturnMay 0.30420906 0.8965492 0.762511555
## ReturnJune -0.01165375 1.5008787 0.329433917
## ReturnJuly 0.19769226 0.7831487 0.774164370
## ReturnAug 0.51272941 -0.2448602 0.982605385
## ReturnSep 0.58832685 0.7368522 0.363806823
## ReturnOct -1.02253506 -0.2775631 0.782242086
## ReturnNov -0.74847186 -0.7874737 -0.873752144
#Which variables have a positive sign for the coefficient in at least one of StocksModel1, StocksModel2, and Stoc
ksModel3 and a negative sign for the coefficient in at least one of StocksModel1, StocksModel2, and StocksModel3?
Select all that apply.
#Ans: ReturnJan, ReturnFeb, ReturnMar, ReturnJune, ReturnAug, and ReturnOct differ in sign between the models.
###########################################
#Using StocksModel1, make test-set predictions called PredictTest1 on the data frame stocksTest1. Using StocksMod
el2, make test-set predictions called PredictTest2 on the data frame stocksTest2. Using StocksModel3, make test-s
et predictions called PredictTest3 on the data frame stocksTest3.
#The Out of samples predictions can be obtained with:

PredictTest1 = predict(StocksModel1, newdata = stocksTest1, type="response")
#The Out of samples Confusion/classification matrices using a threshold of 0.5 are:

cmat1<-table(stocksTest1$PositiveDec, PredictTest1 > 0.5)
cmat1
##
## FALSE TRUE
## 0 30 471
## 1 23 774

cmat2
##
## FALSE TRUE
## 0 388 626
## 1 309 757

cmat3
##
## FALSE TRUE
## 0 49 13
## 1 21 13
#lets now compute the Out of Samples overall accuracy :

accu_1<- (cmat1[1,1] + cmat1[2,2])/sum(cmat1)
accu_1
## [1] 0.6194145

accu_2
## [1] 0.5504808

accu_3
## [1] 0.6458333
#or better way using a custom function
#Make predictions using a thresholf of 0.5 & get accuracies of, using stockModels[[i]] against stocksTest11[[i]]
predictions <- sapply(1:3, function (i) {

p <- predict(stocksModels[[i]], newdata=stocksTest11[[i]], type="response")
(conf.mat <- table(stocksTest11[[i]]$PositiveDec, p > 0.5))
accuracy <- sum(diag(conf.mat)) / sum(conf.mat)
list(predict=p, accuracy=accuracy)
})
predictions
## [,1] [,2] [,3]

## predict Numeric,1298 Numeric,2080 Numeric,96
## accuracy 0.6194145 0.5504808 0.6458333
#What is the overall accuracy of StocksModel1 on the test set stocksTest1, using a threshold of 0.5?
#Ans:0.6194145 (30 + 774)/(30 + 471 + 23 + 774) = 0.6194145
#Ans:0.5504808 (388 + 757)/(388 + 626 + 309 + 757) = 0.5504808
#Ans:0.6458333 (49 + 13)/(49 + 13 + 21 + 13) = 0.6458333
#############################################
#To compute the overall test-set accuracy of the cluster-then-predict approach, we can combine all the test-set p
redictions into a single vector and all the true outcomes into a single vector:
AllPredictions = c(PredictTest1, PredictTest2, PredictTest3)

AllOutcomes = c(stocksTest1$PositiveDec, stocksTest2$PositiveDec, stocksTest3$PositiveDec)
#What is the overall test-set accuracy of the cluster-then-predict approach, again using a threshold of 0.5?
#Computing overall confusion matrix of the cluster-then-predict approach

cmatoverall<-table(AllOutcomes, AllPredictions > 0.5)
cmatoverall
##
## AllOutcomes FALSE TRUE
## 0 467 1110
## 1 353 1544
#lets now compute the Out of Samples overall accuracy :

accu_overall<- (cmatoverall[1,1] + cmatoverall[2,2])/sum(cmatoverall)
accu_overall #(467 + 1544)/(467 + 1110 + 353 + 1544) = 0.5788716
## [1] 0.5788716
#Ans:0.5788716
#EXPLANATION:Which tells us that the overall accuracy is (467 + 1544)/(467 + 1110 + 353 + 1544) = 0.5788716.
#We see a modest improvement over the original logistic regression model. Since predicting stock returns is a not
oriously hard problem, this is a good increase in accuracy. By investing in stocks for which we are more confiden
t that they will have positive returns (by selecting the ones with higher predicted probabilities), this cluster-
then-predict model can give us an edge over the original logistic regression model.
sessionInfo()
## R version 3.3.0 (2016-05-03)

## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 10586)
##
## locale:
## [1] LC_COLLATE=English_India.1252 LC_CTYPE=English_India.1252
## [3] LC_MONETARY=English_India.1252 LC_NUMERIC=C
## [5] LC_TIME=English_India.1252
##
## attached base packages:
## [1] stats4 grid stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] caTools_1.17.1 caret_6.0-68 flexclust_1.3-4
## [4] modeltools_0.2-21 DataComputing_0.8.3 curl_0.9.7
## [7] base64enc_0.1-3 manipulate_1.0.1 mosaic_0.13.0
## [10] mosaicData_0.13.0 car_2.1-2 lattice_0.20-33
## [13] knitr_1.13 stringr_1.0.0 tidyr_0.4.1
## [16] lubridate_1.5.6 dplyr_0.4.3 ggplot2_2.1.0
##
## loaded via a namespace (and not attached):
## [1] reshape2_1.4.1 splines_3.3.0 colorspace_1.2-6
## [4] htmltools_0.3.5 yaml_2.1.13 mgcv_1.8-12
## [7] nloptr_1.0.4 DBI_0.4-1 foreach_1.4.3
## [10] plyr_1.8.4 MatrixModels_0.4-1 munsell_0.4.3
## [13] gtable_0.2.0 codetools_0.2-14 evaluate_0.9
## [16] SparseM_1.7 quantreg_5.26 pbkrtest_0.4-6
## [19] parallel_3.3.0 Rcpp_0.12.5 scales_0.4.0
## [22] formatR_1.4 mime_0.4 lme4_1.1-12
## [25] gridExtra_2.2.1 digest_0.6.9 stringi_1.1.1
## [28] bitops_1.0-6 tools_3.3.0 magrittr_1.5
## [31] lazyeval_0.1.10 ggdendro_0.1-20 MASS_7.3-45
## [34] Matrix_1.2-6 iterators_1.0.8 assertthat_0.1
## [37] minqa_1.2.4 rmarkdown_0.9.6 R6_2.1.2
## [40] nnet_7.3-12 nlme_3.1-128
################# Finally clustering unit finished.......Yay! ###################

RPubs - The Analytics Edge EdX MIT15 Clustering

Uploaded by

Copyright:

Available Formats

You might also like

RPubs - The Analytics Edge EdX MIT15 Clustering

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

RPubs - The Analytics Edge EdX MIT15 Clustering

Uploaded by

Copyright:

Available Formats

7/23/2017 The_Analytics_Edge_edX_MIT15.

Getting your clustering right (Part I)_Analytics Vidya (http://www.analyticsvidhya.com/blog/2013/11/getting-clustering-right/)

Recommendations Worth a Million_An Introduction to

Recommendations Worth a Million_An Introduction to

VIDEO 1: INTRODUCTION TO NETFLIX

Online DVD rental and streaming video service

The Netix Prize

Last Call Announced

VIDEO 2: RECOMMENDATION SYSTEMS

Recommender system_Collaborative Filtering

Using Other Users Rankings

Recommender system_Content Filtering

Using Movie Information

Strengths and Weaknesses

Collaborative Filtering Systems

Hybrid Recommendation Systems

Netix uses both collaborative and content ltering

Lets consider a recommendation system on Amazon.com, an online retail site.

Ans: Collaborative Filtering

VIDEO 3: MOVIE DATA AND CLUSTERING

Movies in the dataset are categorized as belonging to dierent genres

Each movie may belong to many genres

Types of Clustering Methods

There are many dierent algorithms for clustering

VIDEO 4: COMPUTING DISTANCES

Distance Between Points

Need to dene distance between two data points

Distance between points i and j is

where k is the number of independent variables

The movie Toy Story is categorized as Animation, Comedy, and Childrens

Distance Between Points

Toy Story: (0,0,0,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0)

Other popular distance metrics:

Distance Between Clusters

Distance is highly inuenced by scale of variables, so customary to normalize rst

VIDEO 5: HIERARCHICAL CLUSTERING

Start with each data point in its own cluster

Combine two nearest clusters (Euclidean, Centroid)

combine two nearest clusters (Euclidean, Centroid)

Combine two nearest clusters (Euclidean, Centroid)

Combine two nearest clusters (Euclidean, Centroid)

Combine two nearest clusters (Euclidean, Centroid)

Combine two nearest clusters (Euclidean, Centroid)

Display Cluster Process:Dendogram

How many clusters will there be at the start of the algorithm?

How many clusters will there be at the end of the algorithm?

VIDEO 6: GETTING THE DATA (R script reproduced here)

#Unit 6 - Introduction to Clustering

## 'data.frame': 1682 obs. of 24 variables:

#Add column names

## 'data.frame': 1682 obs. of 24 variables:

# Remove unnecessary variables

#Take a look at our data again:

## 'data.frame': 1664 obs. of 20 variables:

#How many movies are classified as comedies?

#How many movies are classified as westerns

#How many movies are classified as romance AND drama?

VIDEO 7: HIERARCHICAL CLUSTERING IN R (R script reproduced here)