Professional Documents
Culture Documents
Assignment 1 ST June
Assignment 1 ST June
#of the dataset and then cluster the data into groups.
#For the purpose of this assessment, the data has been updated and changed slightly.
#0. Load the data (i.e., the CSV file given to you)
#************************************************************
#************************************************************
#PLEASE MAKE SURE THAT YOU COMMENT EVERY IMPORTANT STEP YOU DO HERE!
#What do you do? Why do you do that? Why do you get the outcome you get? How do you explain
the outcome
#Demonstrate that you can use boxplots, histograms, and do you exploit that information.
#Requirement 1: Should we keep all the arrtibutes. Demonstrate and comment on your approach.
names(seeds)
#We should remove ID as its just ID and will not add any value in model.
head(seeds)
str(seeds)
unique(seeds$Group)
# As Group has 3 unique values 1,2,3 which should be categorical values as they are group we
need to convert them from int to Factor.
# Requirement 3: Are there NA Values, Outliers or Other Strange Values? Provide and explain the
solutions you put in place to solve these issues.
sum(is.na(seeds))
#This comes out to be 12 which means there are 12 missing values in the dataframe.
#Requirement 4: Provide some interesting descriptive learnings from the descriptive analysis that
you have done.
pairs(seeds)
#Area Parimeter and Length and Width of kernel are highly correlated and they should be as they
are related in general.
plot(x=seeds$area, y=seeds$Group)
#We can see that Group 3 has smallest leaves their Area is smallest, Group 2 has biggest Area
range and then Group 1.
#Conclusion
1. Perimeter, Area, Length of Kernel and Width of Kernel all are highly correlated.
2. Biggest Area is of Group 2 then Group 1 and then Group 3
3. Similar trend is observed with Length and Width of Kernel.
#************************************************************
#2. Clustering
#************************************************************
#Requirement 6: demonstrate the best number of clusters. Implement and comment the steps you
take to do so.
df <- seeds
k_limit <- 7
wss
plot(1:k_limit, wss,
So K = 3 Is right choice
kmean
Out of total Sum of squares we were able to capture 78% in between Sum of
Squares.
BY EDA WE HAVE SEEN THAT LENGTH AND WIDTH OF KERNEL ARE BETTER CHOICES
df <- scale(df)
kmean_scale
They both are similar in terms of clusters. We can see our clustering worked better.
#Requirement 11: Attach the obtained clusters to the data
Clustering vector:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1
18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34
1 1 3 1 1 1 3 1 1 3 3 1 1 3 1 1 1
35 36 37 39 40 41 42 43 44 45 46 47 50 51 52 53 54
1 1 1 1 1 1 1 3 1 1 1 1 1 1 1 1 1
55 56 57 58 59 60 63 64 65 66 67 68 69 70 71 72 73
1 1 1 1 1 3 3 3 3 3 1 1 1 3 2 2 2
74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90
2 2 1 2 2 2 1 1 2 2 2 2 2 2 2 2 2
91 92 93 95 96 97 98 99 100 102 103 104 105 106 107 108 109
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126
2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 1 2
127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143
2 2 2 2 2 2 1 1 1 1 2 1 1 1 3 1 3
145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161
3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178
3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195
3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
196 198 199 200 202 203 204 205 207 208 209 210
3 3 3 3 3 3 3 3 3 1 3 3
#Requirement 12: Provide a general conclusion on the clustering of this data set, i.e., what can we
learn from this data.
Conclusion –
Generally we have 3 groups, these groups have different type of seeds, Group2 has biggest seeds
and group3 has smallest seeds. Length and width are highly related to area and perimeter.
Compactness are higher in group3 then group2 then group1.