Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 7

Group Memebers:

Shahzaman (46146)

Sunil Kumar (42545)

Sant Saran (47969)

Shariq khan (46458)

Qalandar Bux (48269)

M.Abdullah (47750)

BUSINESS ANALYTICS PROJECT


Introduction:

In this project we apply methods/techniques that we study in course business analytics for
doing the analysis. In the first phase of project we created an account on kaggle.com where we
can xplore varoius datasets. We selected a problem "Iris species" to work on and dowload the
dataset.

Dataset:

The Iris dataset was used in R.A. Fisher's classic 1936 paper, The Use of Multiple Measurements
in Taxonomic Problems, and can also be found on the UCI Machine Learning Repository. It
includes three iris species with 50 samples each as well as some properties about each flower.
One flower species is linearly separable from the other two, but the other two are not linearly
separable from each other.

 The columns in this dataset are:

Id

SepalLengthCm

SepalWidthCm

PetalLengthCm
PetalWidthCm

Species

CLUSTERING:

K Mean Clustering:

k-means clustering is a method of vector quantization, that aims to partition n


observations into k clusters in which each observation belongs to the cluster with
the nearest mean cluster centers serving as a prototype of the cluster. The K-
means clustering algorithm is one of the most popular clustering algorithms in the
world. Clustering aims to classify data from the whole data space. The difference
between each data object in the same class is similar. However, the difference
between each data objects in different classes is large. Clustering belongs to the
unsupervised learning method and it can automatically sort data sets

First of all we download and explore the dataset of Iris species. And we decided to
segment the Iris data into clustering.

Summary Of four variables of dataset:


The dataset contains 3 classes (setosa, versicolor, or virginica) 50 instance of each of iris
species.

Sepal length Sepal width Petal length Petal width

Min: 4.300 2.000 1.000 0.100


1st Qu: 5.100 2.800 1.600 0.300

Median: 5.800 3.000 4.350 1.300

Mean: 5.843 3.057 3.758 1.199

3rd Qu: 6.400 3.300 5.100 1.800

Max: 7.900 4.400 6.900 2.500

Sepal length 0.6856935


Sepal width 0.1899794
Petal length 3.1162779
Petal width 0.5810063

We implement the Elbow Method on the Iris database.


The elbow point: K(centers)=3

The k means has grouped data into 3 clusters

So we conclude that 3 is best value for K to be used to create the final model.

[1] 550.895333 86.390220 31.371359 19.465989 13.916909 11.025145


[7] 9.185076 7.615402 6.456495 5.550520

plot(1:k.max,wss, type= "b", xlab = "Number of clusters(k)", ylab = "Within cluster sum of squares")

Results:

Optimum value of K=3


icluster <- kmeans(iris[,3:4],3,nstart = 20) table(icluster$cluster,iris$Species)

K means has grouped the data into three clusters.

setosa versicolor virginica

1 0 2 46

2 50 0 0

3 0 48 4

Conclusion:
We have download Iris species dataset from Kaggle.com. We apply the K Mean clustering
technique on dataset, Clustering aims to classify data from the whole data space and we found
the optimum value of (K-Mean) K=3.

You might also like