Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 7

#Lab intending to build a comprehensive analysis and understanding

#of the dataset and then cluster the data into groups.

#the data is the seeds dataset available here: https://archive.ics.uci.edu/ml/datasets/seeds#

#***** NOTE *****

#For the purpose of this assessment, the data has been updated and changed slightly.

#This is an invitation for everyone to focus on working on the data

#rather then finding code on the internet to analyze the data :)

#0. Load the data (i.e., the CSV file given to you)

seeds <- read.csv("Seeds.csv")

#************************************************************

#1. Explore the data

#************************************************************

#PLEASE MAKE SURE THAT YOU COMMENT EVERY IMPORTANT STEP YOU DO HERE!

#What do you do? Why do you do that? Why do you get the outcome you get? How do you explain
the outcome

#Demonstrate that you can use boxplots, histograms, and do you exploit that information.

#Requirement 1: Should we keep all the arrtibutes. Demonstrate and comment on your approach.

#Include visualizations with their explanation when possible.

names(seeds)

seeds$ID <- NULL

#We should remove ID as its just ID and will not add any value in model.

# Requirement 2: Should I transform variables

#Include visualizations with their explanation when possible.

head(seeds)
str(seeds)

unique(seeds$Group)

seeds$Group <- as.factor(seeds$Group)

# As Group has 3 unique values 1,2,3 which should be categorical values as they are group we
need to convert them from int to Factor.

# Requirement 3: Are there NA Values, Outliers or Other Strange Values? Provide and explain the
solutions you put in place to solve these issues.

#Include visualizations with their explanation when possible.

sum(is.na(seeds))

#This comes out to be 12 which means there are 12 missing values in the dataframe.

boxplot(seeds,notch = TRUE, col = 1:7)

# We can see that there is one outlier only in asymmetrycoefficient

#Requirement 4: Provide some interesting descriptive learnings from the descriptive analysis that
you have done.

#Include visualizations with their explanation when possible.

pairs(seeds)

#Area Parimeter and Length and Width of kernel are highly correlated and they should be as they
are related in general.
plot(x=seeds$area, y=seeds$Group)

#We can see that Group 3 has smallest leaves their Area is smallest, Group 2 has biggest Area
range and then Group 1.

#Requirement 5: Conclude on the descriptive task you have just done.

#Conclusion

1. Perimeter, Area, Length of Kernel and Width of Kernel all are highly correlated.
2. Biggest Area is of Group 2 then Group 1 and then Group 3
3. Similar trend is observed with Length and Width of Kernel.
#************************************************************

#2. Clustering

#************************************************************

#Cluster the dataset using kmeans

#Requirement 6: demonstrate the best number of clusters. Implement and comment the steps you
take to do so.

df <- seeds

df$Group <- NULL

df <- df[rowSums(is.na(df)) == 0,]

k_limit <- 7

wss <- sapply(1:k_limit,

function(k){kmeans(df, k, nstart=50,iter.max = 15 )$tot.withinss})

wss

plot(1:k_limit, wss,

type="b", pch = 19, frame = FALSE,

xlab="Number of clusters K",

ylab="Total within-clusters sum of squares")

For identifying number of correct number of cluster I used Elbow method.

We can see that elbow is at 3

So K = 3 Is right choice

#Include visualizations with their explanation when possible.


#Requirement 7: comment on the quality of the clustering

#Include visualizations with their explanation when possible.

kmean = kmeans(df,3,nstart = 50,iter.max = 15)

kmean

Within cluster sum of squares by cluster:


[1] 179.0850 173.6147 200.6035
(between_SS / total_SS = 78.8 %)

Out of total Sum of squares we were able to capture 78% in between Sum of
Squares.

#Requirement 8: Propose a better solution. Motivate your approach

#Include visualizations with their explanation when possible.

1. We can Scale the data


2. As we can see that Area and Perimeter were highly correlated with Length and Width of
kernel So We can directly cluster on Length and Width Attribute.

BY EDA WE HAVE SEEN THAT LENGTH AND WIDTH OF KERNEL ARE BETTER CHOICES

df <- scale(df)

kmean_scale = kmeans(df[,c(4,5)],3,nstart = 50,iter.max = 15)

kmean_scale

#Requirement 9: Comment on the quality of the clustering

#Include visualizations with their explanation when possible.

We can see that Performance has increased to 84.8% from 78%


Within cluster sum of squares by cluster:
[1] 20.82258 20.81995 18.67862
(between_SS / total_SS = 84.8 %)
#Requirement 10: Demonstrate that the solution you are proposing is the best by considering the
attribute group as the class of the dataset.

#Comment and motivate your approach.

Below plot represents distribution of 3 groups by 3 colors.

plot(df[,c(4,5)], col = seeds$Group)

Below plot represents distribution of 3 clusters we obtained by 3 colors

plot(df[,c(4,5)], col = kmean_scale$cluster)

They both are similar in terms of clusters. We can see our clustering worked better.
#Requirement 11: Attach the obtained clusters to the data
Clustering vector:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1
18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34
1 1 3 1 1 1 3 1 1 3 3 1 1 3 1 1 1
35 36 37 39 40 41 42 43 44 45 46 47 50 51 52 53 54
1 1 1 1 1 1 1 3 1 1 1 1 1 1 1 1 1
55 56 57 58 59 60 63 64 65 66 67 68 69 70 71 72 73
1 1 1 1 1 3 3 3 3 3 1 1 1 3 2 2 2
74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90
2 2 1 2 2 2 1 1 2 2 2 2 2 2 2 2 2
91 92 93 95 96 97 98 99 100 102 103 104 105 106 107 108 109
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126
2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 1 2
127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143
2 2 2 2 2 2 1 1 1 1 2 1 1 1 3 1 3
145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161
3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178
3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195
3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
196 198 199 200 202 203 204 205 207 208 209 210
3 3 3 3 3 3 3 3 3 1 3 3

#Requirement 12: Provide a general conclusion on the clustering of this data set, i.e., what can we
learn from this data.

Conclusion –

Generally we have 3 groups, these groups have different type of seeds, Group2 has biggest seeds
and group3 has smallest seeds. Length and width are highly related to area and perimeter.
Compactness are higher in group3 then group2 then group1.

You might also like