Professional Documents
Culture Documents
Solution HW2
Solution HW2
Using the same data set (credit_card_data.txt or credit_card_data-headers.txt) as in Question 2.2, use
the ksvm or kknn function to find a good classifier:
A. using cross-validation (do this for the k-nearest-neighbors model; SVM is optional); and
3. Best K and best corresponding KNN K-fold cross validation accuracy is as follows:
4. You can refer to the attached “3.1.a. KNN with K-fold.R” for details.
B. Splitting the data into training, validation, and test data sets (pick either KNN or SVM; the other
is optional).
I split the data into training, validation and test data sets and then used SVM model to check the
most accurate classifier.
1. 60% of data is used for the training set and then remaining 40% was divided equally
between validate and test data set. You can refer to the attached “3.1.b. SVM with train,
test, validate.R” for details.
2. Accuracy for C values (0.00001, 0.0001, 0.001, 0.01, 0.1, 1, 10, 100, 1000) is as follows:
5. As expected, the True model accuracy (86.25%), as calculated on the testing data set is
lower than the training accuracy (87.78%) which includes random effects and was
calculated on the validation data set.
Describe a situation or problem from your job, everyday life, current events, etc., for which a
clustering model would be appropriate. List some (up to 5) predictors that you might use.
Model: Clustering
Predictor variables:
Question 4.2
The iris data set iris.txt contains 150 data points, each with four predictor variables and one
categorical response. The predictors are the width and length of the sepal and petal of flowers,
and the response is the type of flower. The data is available from the R library datasets and can be
accessed with iris once the library is loaded. It is also available at the UCI Machine Learning
Repository (https://archive.ics.uci.edu/ml/datasets/Iris ). The response values are only given to
see how well a specific method performed and should not be used to build the model.
Use the R function kmeans to cluster the points as well as possible. Report the best combination
of predictors, your suggested value of k, and how well your best clustering predicts flower type.
1. I first used the R function kmeans to cluster the points as well as possible for K ranging from 1:10,
nstart = 5, iter.max = 15.
2. Further I calculated the WCSS (within-class sum of squares) corresponding to respective K and stored
the results in the wcss vector.
3. I then created a plot to visualize the relationship between the number of clusters (k) and the total
within-cluster sum of squares (WCSS). The "elbow" point in the plot is used to select the best value
of k, which in this case is determined to be 3.
4.For the best combination of predictors I calculate within-class sum of squares for different predictor
combinations
5.This results in WCSS for different combinations of predictors:
6. As number of predictors increase, the data points increase, and clusters become less compact
increasing within-class sum of squares. The best combination of predictors is Petal_width with the
least WCSS of 4.93. However, the main goal of using k-mean is to cluster the points as well as possible
hence I will use all the predictors in the model.
7. I then perform k-means clustering with all predictors and k=3 and create contingency table to see how
well my best clustering predicts flower type.
8. Model performed well for Iris-setosa but there was some wrong classification for Iris-versicolor and
Iris-virginica
For Iris-setosa, all 50 samples were correctly classified into Cluster 1.
For Iris-versicolor, 48 out of 50 samples were classified into Cluster 2, but 2 were classified into
Cluster 3.
For Iris-virginica, 36 out of 50 samples were classified into Cluster 3, but 14 were incorrectly
classified into Cluster 1.
9. You can find more details in attached 4.1 K-Means.R file for details.