Sol H109

You might also like

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 16

Student name

Student id
Contents
Applicable Area: K-means Clustering on the Iris Dataset using WEKA......................................................3
a. Real-life Scenario................................................................................................................................3
b. Why K-means Algorithm with WEKA..................................................................................................3
Brainstorming and Rationale..................................................................................................................3
Reasons for Selection..............................................................................................................................3
2. Application Process using WEKA............................................................................................................4
3. Potential Insights....................................................................................................................................4
4. Significance.............................................................................................................................................4
Dataset Introduction: Iris Flower Dataset..................................................................................................5
a. Overview of the Dataset.....................................................................................................................5
Source of the Dataset..............................................................................................................................5
b. Potential Insights through K-means Clustering......................................................................................5
1. Species Identification..........................................................................................................................5
2. Feature Analysis..................................................................................................................................6
3. Species Comparison............................................................................................................................6
4. Data-driven Species Classification......................................................................................................6
5. Visual Representation.........................................................................................................................6
6. Generalization to Similar Datasets.....................................................................................................7
Results of K-means Clustering on Iris dataset............................................................................................7
a. Discuss and Interpret the Results.......................................................................................................7
Cluster Centroids.....................................................................................................................................7
Results of Hierarchical Clustering...........................................................................................................9
b. Discuss the Novelty and Significance..................................................................................................9
References.................................................................................................................................................10
Appendix of dataset..................................................................................................................................11
Applicable Area: K-means Clustering on the Iris Dataset using
WEKA

a. Real-life Scenario
Now imagine you are a data analyst for a botanical research institute. This institution has
accumulated a lot of information about iris flowers – measurements are taken in terms of sepal
length, etc., from different species. The aim is to discover latent patterns and clusters in the iris
dataset, which would reveal valuable insights about the natural variation of different types of
irises flowers(Omelina, Goga, Pavlovicova, Oravec, & Jansen, 2021).
b. Why K-means Algorithm with WEKA
Brainstorming and Rationale
K-means algorithm is a potent clustering technique that divides the dataset into k different,
nonoverlapping subsets or clusters.(Wu et al., 2019)Each data point is assigned to the nearest
mean cluster, this makes it suitable for highlighting inherent groupings in iris dataset.
Reasons for Selection
Numerical Data Suitability
 The iris dataset is a set of numerical measurements including sepal and petal dimensions.
K-means is good for numerical data, so it’s a fit choice since you are clustering based on
these sorts of continuous features(Bouckaert et al., 2018).
Cluster Interpretability
 K-means clusters have clean boundaries, which makes it easier to interpret and classify
the various population of iris flowers. In a botanic perspective, this approach is very
helpful for the identification of different species.
Ease of Use in WEKA
 WEKA is a user-friendly graphical interface for applying machine learning algorithms.
This is favourable for data analysts who may not possess advanced programming skills.
WEKA is an easy-to-use software, meaning users can easily explore and apply clustering
techniques.
Scalability
 K-means is a computationally efficient and scalable algorithm thus capable of being used
on relatively large datasets such as the iris dataset. This allows for clustering of even a
moderately-sized dataset.
Cluster Centroids Representativeness
 K-means generate centroids representing the mean value of data points in every cluster.
This may give valuable information about the typicalities of each group’s iris flowers as
this can help in understanding characteristics we find different for other types.

2. Application Process using WEKA


a. Data Preparation
The data is preprocessed to handle any missing values and format it to suit clustering.
b. Applying K-means Algorithm:
Choose the K-means clustering algorithm in WEKA and enter the number of clusters , K .
Apply the algorithm to split iris dataset into different clusters(bin Othman & Yau, 2007).
c. Visualizations and Interpretation:
Interpret the results of clustering using WEKA’s tools for visualization. Scatter plots can help
illustrate the separation of data points into different clusters.

3. Potential Insights
Species Groupings:
K-means clustering could present natural groupings representing different species of iris flowers
according to their measurements for sepal and petal(Bouckaert et al., 2016).
Characteristic Features:
Cluster centroids analysis can show typical features for each species and help in identifying the
most distinguishing characteristics.
Data-Driven Species Classification:
Clustering can help gain insights for data-driven species classification, improving the
institution’s potential of categorizing iris flowers(Hall et al., 2009).

4. Significance
Another valuable element of our research is the K-means algorithm implementation to the iris
dataset using WEKA. By applying clustering approaches, the institution can reveal underlying
patterns and groupings in a dataset of iris flowers which helps to understand their variability
among different species.
K-means is selected because it works so well on numerical data, can also be interpreted with ease
in addition to its scalability and efficiency. The clustering ability of the algorithm is consistent
with what the institution has set out to do in terms of categorizing iris flowers based on their
morphological features(Bowyer & Flynn, 2016).
Well-informed decision making can be guided by the clustering insights in the botanical sense
where exact identification of plant species is very critical. The cluster centroids may assist
researchers to assign typical characteristics of each species so that the groups would be easier
classified by data-driven observations.
Data analysts and researchers with varying levels of technical ability utilizing the K-means
algorithm find it straightforward to understand WEKA’s intuitive interface. WEKA offers advice
on clustering results visualizations that increase interpretability allowing researchers to
investigate and understand the underlying structure within iris dataset(Arora, 2012).
In summary, K-means algorithm utilizing WEKA for iris dataset not only helps in sorting the
variations of flowers based on their measurements but this method also forms a basis for
subsequent botanical study and categorization. This data driven approach aids in accomplishing
the institution’s objective of promoting botanical knowledge through intensive analysis and
interpretation of iris dataset.

Dataset Introduction: Iris Flower Dataset


a. Overview of the Dataset
The dataset selected for use in analysis is the famous Iris Flowers Dataset. This dataset can be
used for machine learning and statistical analysis(De Marsico, Nappi, Riccio, & Wechsler, 2015).
It consists of 150 observations of iris flowers, with each observation belonging to one of three
species: setosa, versicolor, or virginica. The dataset includes four features (data variables)
measured in centimeters:
 Sepal Length: Length of the sepal in iris flower.
 Sepal Width: Length of the sepal.
 Petal Length: Length of the petal in an iris flower.
 Petal Width: Iris flower’s petal breadth.
The dataset is a commonly used one in pattern recognition and classification tasks because the
features between three species of iris are clearly distinguishable. The objective is often to predict
the type of iris flower based on its measurements.
Source of the Dataset
1936 was the year that Ronald A. Fisher, a British biologist and statistician introduced his Iris
Flower dataset . It is now a classic dataset and can be readily available on different machine
learning libraries such as scikit-learn in Python. It is also available in the UCI Machine Learning
Repository, a popular repository for machine learning datasets used by researchers and
practitioners.

b. Potential Insights through K-means Clustering


1. Species Identification
The primary insight from applying the K-means clustering algorithm to the Iris dataset is the
potential identification of natural groupings that correspond to the three species of iris flowers:
setosa, versicolor, and virginica. K-means clustering is designed to divide data into K different
groups, and for the Iris dataset it would be selected by three in order to reflect the number of
species.
2. Feature Analysis
K-means clustering would make it possible to analyze average characteristics within each cluster.
We could look at the average sepal length, width of sepals ,Petal’s length and Petal’ s width for
every cluster. This analysis may identify characteristic features that differentiate one species
from another.
3. Species Comparison
By utilizing the K-means algorithm, we would be able to evaluate which of these clusters
constituted as corresponding with real species labels found in a given dataset. We could know
how perfectly the clustering fits to what is known species categories, a quantitative assessment of
the efficiency of algorithm.
4. Data-driven Species Classification
The successful formation of clusterings that closely match the actual species labels indicates that
features used for clustering bear high indicative power with regard to iris species.
5. Visual Representation
Scatter plots of the clustered data points could also be generated to provide an intuitive
visualization depicting how iris flowers separated based on their features. This graphical display
might help interpret the outcomes and convey results to stakeholders.

Fig# Hierarchical Clusters


Fig# Filtered clusters
6. Generalization to Similar Datasets
The knowledge obtained from K-means clustering analysis of Iris dataset can be applied to other
datasets in the botany or plant biology domain. The results could be used for the methodology
and findings of future research on clustering or plant species classification depending upon
morphological characteristics.

Results of K-means Clustering on Iris dataset


a. Discuss and Interpret the Results
So, while using the K-means clustering algorithm on Iris dataset reveals how iris flowers
naturally cluster in relation to their sepal and petal measurements. Here's a breakdown of the key
findings:
Cluster Centroids
Finally, the cluster centroids represent average values of sepal length , sepal width , petal length
and petal width for each group. Cluster 0 presents centroid similar to Iris-versicolor and, at the
same time ,Cluster No.atthe centers is Similar or resembling Iris setosa .
Sepal Length: Cluster 0 (6.262) is more extended than the Cluster 1(5.047).
Sepal Width: 0 Cluster space is thinner than 1.
Petal Length: Cluster 0 measures 4.9o6, longer than the petal of Cluster or quantity one witha
measurement of a restricted behinds14quero But!
Petal Width: 0 Cluster 1 has a smaller petal as compared to the rest.
Cluster Membership
The first starting points (random) show two predominant clusters mostly representing Iris-
versicolor. Overall, the results of final clustering show a more distinct splitting into two clusters
with one resembling Iris versicolor and another-Iris setosa.
Within Cluster Sum of Squared Errors (WCSS)
WCSS measures how internally coherent the clusters are. In this instance, WCSS is 62.14 which
represents the total of squared distances between each data point and its allotted cluster centroid.
Clustered Instances
105 and 54 instances in Cluster 0 constitute 67 percent of the data while those in Cluster One
make up for about thirty-three Ities are represented in clusters.

Fig# Results of K-means clusters


Results of Hierarchical Clustering
2 different clusters surfaced when the Hierarchical Clustering model was applied to Iris dataset.
33% of the instances are in Cluster 0 and 67% in Cluster It is recommended to reorder
Observations as follows: Clusters are arranged hierarchically and represented by a dendrogram
that describes the connections between data points. The algorithm uses the Euclidean distance
measure and applies the single-linkage method. With the clustering results, one can get an insight
into hierarchical view of Iris dataset’s inherent structure and natural groupings arising from
similarity in measures on sepal as well petal. The hierarchical clustering technique enables the
investigation of cohesion inside this dataset and helps understand how iris flowers are organized
in a hierarchy.

b. Discuss the Novelty and Significance


Novelty
The novelty is of course that the iris flowers are cleanly separated into two clusters that has little
to do with their nominal labels. The algorithm discovered latent patterns using sepal and petal
measurements indicating a natural clustering that may not have been observed with conventional
species identification.
Significance
Data-Driven Species Grouping
K-means clustering is data driven in regard to establishing clusters of iris flowers based on
morphological characteristics. This can be critical in situations where traditional taxonomy might
prove to make things difficult or when other characteristics contribute more towards the
variability of a species.
Species Identification Challenges
It is given its importance in cases where some species may have similar morphological features
and thus wrongfully classified. Clustering may help find another perspective, avoid some
shortcomings in precise species recognition.
Potential Insights into Hybridization
The clustering results may show hybridization or within-species variations that are apparently
similar in their morphological aspects. This insight opens up a more subtle view of iris flower
diversity.
Practical Applications in Botany
The use of K-means clustering in illustrating the model for Iris dataset has diverse practical
implications more so in botany where proper identification and classification of plant species are
vital. Data driven approach can be supplementary to classical taxonomy, because it offers a more
holistic view of the plant biodiversity(Aher & Lobo, 2011).
This is highly important because it may help in dealing with problems of species identification,
gather more information about hybridization and contribute towards richer knowledge regarding
the plant biodiversity within botany. Algorithmic approach is a very efficient mechanism through
which researchers searching for unconventional stages in datasets with enormous prospects to be
useful from various scientific angles can benefit.

References
Aher, S. B., & Lobo, L. (2011). Data mining in educational system using weka. Paper presented
at the International conference on emerging technology trends (ICETT).
Arora, R. (2012). Comparative analysis of classification algorithms on different datasets using
WEKA. International Journal of Computer Applications, 54(13).
bin Othman, M. F., & Yau, T. M. S. (2007). Comparison of different classification techniques
using WEKA for breast cancer. Paper presented at the 3rd Kuala Lumpur International
Conference on Biomedical Engineering 2006: Biomed 2006, 11–14 December 2006 Kuala
Lumpur, Malaysia.
Bouckaert, R. R., Frank, E., Hall, M., Kirkby, R., Reutemann, P., Seewald, A., & Scuse, D.
(2016). WEKA manual for version 3-9-1. University of Waikato: Hamilton, New Zealand, 1-341.
Bouckaert, R. R., Frank, E., Hall, M., Kirkby, R., Reutemann, P., Seewald, A., & Scuse, D.
(2018). WEKA manual for version 3-8-3. The University of Waikato, 1-327.
Bowyer, K. W., & Flynn, P. J. (2016). The ND-IRIS-0405 iris image dataset. arXiv preprint
arXiv:1606.04853.
De Marsico, M., Nappi, M., Riccio, D., & Wechsler, H. (2015). Mobile iris challenge evaluation
(De Marsico et al.)-I, biometric iris dataset and protocols. Pattern Recognition Letters, 57, 17-23.
Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., & Witten, I. H. (2009). The
WEKA data mining software: an update. ACM SIGKDD explorations newsletter, 11(1), 10-18.
Omelina, L., Goga, J., Pavlovicova, J., Oravec, M., & Jansen, B. (2021). A survey of iris
datasets. Image and Vision Computing, 108, 104109.
Wu, Y., He, J., Ji, Y., Huang, G., Yao, H., Zhang, P., . . . Li, Y. (2019). Enhanced classification
models for iris dataset. Procedia Computer Science, 162, 946-954.

Appendix of dataset

SepalLengthC SepalWidthC PetalLengthC PetalWidthC


Id m m m m Species
1 5.1 3.5 1.4 0.2 Iris-setosa
2 4.9 3 1.4 0.2 Iris-setosa
3 4.7 3.2 1.3 0.2 Iris-setosa
4 4.6 3.1 1.5 0.2 Iris-setosa
5 5 3.6 1.4 0.2 Iris-setosa
6 5.4 3.9 1.7 0.4 Iris-setosa
7 4.6 3.4 1.4 0.3 Iris-setosa
8 5 3.4 1.5 0.2 Iris-setosa
9 4.4 2.9 1.4 0.2 Iris-setosa
10 4.9 3.1 1.5 0.1 Iris-setosa
11 5.4 3.7 1.5 0.2 Iris-setosa
12 4.8 3.4 1.6 0.2 Iris-setosa
13 4.8 3 1.4 0.1 Iris-setosa
14 4.3 3 1.1 0.1 Iris-setosa
15 5.8 4 1.2 0.2 Iris-setosa
16 5.7 4.4 1.5 0.4 Iris-setosa
17 5.4 3.9 1.3 0.4 Iris-setosa
18 5.1 3.5 1.4 0.3 Iris-setosa
19 5.7 3.8 1.7 0.3 Iris-setosa
20 5.1 3.8 1.5 0.3 Iris-setosa
21 5.4 3.4 1.7 0.2 Iris-setosa
22 5.1 3.7 1.5 0.4 Iris-setosa
23 4.6 3.6 1 0.2 Iris-setosa
24 5.1 3.3 1.7 0.5 Iris-setosa
25 4.8 3.4 1.9 0.2 Iris-setosa
26 5 3 1.6 0.2 Iris-setosa
27 5 3.4 1.6 0.4 Iris-setosa
28 5.2 3.5 1.5 0.2 Iris-setosa
29 5.2 3.4 1.4 0.2 Iris-setosa
30 4.7 3.2 1.6 0.2 Iris-setosa
31 4.8 3.1 1.6 0.2 Iris-setosa
32 5.4 3.4 1.5 0.4 Iris-setosa
33 5.2 4.1 1.5 0.1 Iris-setosa
34 5.5 4.2 1.4 0.2 Iris-setosa
35 4.9 3.1 1.5 0.1 Iris-setosa
36 5 3.2 1.2 0.2 Iris-setosa
37 5.5 3.5 1.3 0.2 Iris-setosa
38 4.9 3.1 1.5 0.1 Iris-setosa
39 4.4 3 1.3 0.2 Iris-setosa
40 5.1 3.4 1.5 0.2 Iris-setosa
41 5 3.5 1.3 0.3 Iris-setosa
42 4.5 2.3 1.3 0.3 Iris-setosa
43 4.4 3.2 1.3 0.2 Iris-setosa
44 5 3.5 1.6 0.6 Iris-setosa
45 5.1 3.8 1.9 0.4 Iris-setosa
46 4.8 3 1.4 0.3 Iris-setosa
47 5.1 3.8 1.6 0.2 Iris-setosa
48 4.6 3.2 1.4 0.2 Iris-setosa
49 5.3 3.7 1.5 0.2 Iris-setosa
50 5 3.3 1.4 0.2 Iris-setosa
Iris-
51 7 3.2 4.7 1.4 versicolor
Iris-
52 6.4 3.2 4.5 1.5 versicolor
Iris-
53 6.9 3.1 4.9 1.5 versicolor
Iris-
54 5.5 2.3 4 1.3 versicolor
Iris-
55 6.5 2.8 4.6 1.5 versicolor
Iris-
56 5.7 2.8 4.5 1.3 versicolor
Iris-
57 6.3 3.3 4.7 1.6 versicolor
Iris-
58 4.9 2.4 3.3 1 versicolor
Iris-
59 6.6 2.9 4.6 1.3 versicolor
Iris-
60 5.2 2.7 3.9 1.4 versicolor
Iris-
61 5 2 3.5 1 versicolor
62 5.9 3 4.2 1.5 Iris-
versicolor
Iris-
63 6 2.2 4 1 versicolor
Iris-
64 6.1 2.9 4.7 1.4 versicolor
Iris-
65 5.6 2.9 3.6 1.3 versicolor
Iris-
66 6.7 3.1 4.4 1.4 versicolor
Iris-
67 5.6 3 4.5 1.5 versicolor
Iris-
68 5.8 2.7 4.1 1 versicolor
Iris-
69 6.2 2.2 4.5 1.5 versicolor
Iris-
70 5.6 2.5 3.9 1.1 versicolor
Iris-
71 5.9 3.2 4.8 1.8 versicolor
Iris-
72 6.1 2.8 4 1.3 versicolor
Iris-
73 6.3 2.5 4.9 1.5 versicolor
Iris-
74 6.1 2.8 4.7 1.2 versicolor
Iris-
75 6.4 2.9 4.3 1.3 versicolor
Iris-
76 6.6 3 4.4 1.4 versicolor
Iris-
77 6.8 2.8 4.8 1.4 versicolor
Iris-
78 6.7 3 5 1.7 versicolor
Iris-
79 6 2.9 4.5 1.5 versicolor
Iris-
80 5.7 2.6 3.5 1 versicolor
Iris-
81 5.5 2.4 3.8 1.1 versicolor
Iris-
82 5.5 2.4 3.7 1 versicolor
Iris-
83 5.8 2.7 3.9 1.2 versicolor
Iris-
84 6 2.7 5.1 1.6 versicolor
Iris-
85 5.4 3 4.5 1.5 versicolor
86 6 3.4 4.5 1.6 Iris-
versicolor
Iris-
87 6.7 3.1 4.7 1.5 versicolor
Iris-
88 6.3 2.3 4.4 1.3 versicolor
Iris-
89 5.6 3 4.1 1.3 versicolor
Iris-
90 5.5 2.5 4 1.3 versicolor
Iris-
91 5.5 2.6 4.4 1.2 versicolor
Iris-
92 6.1 3 4.6 1.4 versicolor
Iris-
93 5.8 2.6 4 1.2 versicolor
Iris-
94 5 2.3 3.3 1 versicolor
Iris-
95 5.6 2.7 4.2 1.3 versicolor
Iris-
96 5.7 3 4.2 1.2 versicolor
Iris-
97 5.7 2.9 4.2 1.3 versicolor
Iris-
98 6.2 2.9 4.3 1.3 versicolor
Iris-
99 5.1 2.5 3 1.1 versicolor
Iris-
100 5.7 2.8 4.1 1.3 versicolor
Iris-
101 6.3 3.3 6 2.5 virginica
Iris-
102 5.8 2.7 5.1 1.9 virginica
Iris-
103 7.1 3 5.9 2.1 virginica
Iris-
104 6.3 2.9 5.6 1.8 virginica
Iris-
105 6.5 3 5.8 2.2 virginica
Iris-
106 7.6 3 6.6 2.1 virginica
Iris-
107 4.9 2.5 4.5 1.7 virginica
Iris-
108 7.3 2.9 6.3 1.8 virginica
Iris-
109 6.7 2.5 5.8 1.8 virginica
110 7.2 3.6 6.1 2.5 Iris-
virginica
Iris-
111 6.5 3.2 5.1 2 virginica
Iris-
112 6.4 2.7 5.3 1.9 virginica
Iris-
113 6.8 3 5.5 2.1 virginica
Iris-
114 5.7 2.5 5 2 virginica
Iris-
115 5.8 2.8 5.1 2.4 virginica
Iris-
116 6.4 3.2 5.3 2.3 virginica
Iris-
117 6.5 3 5.5 1.8 virginica
Iris-
118 7.7 3.8 6.7 2.2 virginica
Iris-
119 7.7 2.6 6.9 2.3 virginica
Iris-
120 6 2.2 5 1.5 virginica
Iris-
121 6.9 3.2 5.7 2.3 virginica
Iris-
122 5.6 2.8 4.9 2 virginica
Iris-
123 7.7 2.8 6.7 2 virginica
Iris-
124 6.3 2.7 4.9 1.8 virginica
Iris-
125 6.7 3.3 5.7 2.1 virginica
Iris-
126 7.2 3.2 6 1.8 virginica
Iris-
127 6.2 2.8 4.8 1.8 virginica
Iris-
128 6.1 3 4.9 1.8 virginica
Iris-
129 6.4 2.8 5.6 2.1 virginica
Iris-
130 7.2 3 5.8 1.6 virginica
Iris-
131 7.4 2.8 6.1 1.9 virginica
Iris-
132 7.9 3.8 6.4 2 virginica
Iris-
133 6.4 2.8 5.6 2.2 virginica
134 6.3 2.8 5.1 1.5 Iris-
virginica
Iris-
135 6.1 2.6 5.6 1.4 virginica
Iris-
136 7.7 3 6.1 2.3 virginica
Iris-
137 6.3 3.4 5.6 2.4 virginica
Iris-
138 6.4 3.1 5.5 1.8 virginica
Iris-
139 6 3 4.8 1.8 virginica
Iris-
140 6.9 3.1 5.4 2.1 virginica
Iris-
141 6.7 3.1 5.6 2.4 virginica
Iris-
142 6.9 3.1 5.1 2.3 virginica
Iris-
143 5.8 2.7 5.1 1.9 virginica
Iris-
144 6.8 3.2 5.9 2.3 virginica
Iris-
145 6.7 3.3 5.7 2.5 virginica
Iris-
146 6.7 3 5.2 2.3 virginica
Iris-
147 6.3 2.5 5 1.9 virginica
Iris-
148 6.5 3 5.2 2 virginica
Iris-
149 6.2 3.4 5.4 2.3 virginica
Iris-
150 5.9 3 5.1 1.8 virginica

You might also like