Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 6

Question 3.

Using the same data set (credit_card_data.txt or credit_card_data-headers.txt) as in Question 2.2, use
the ksvm or kknn function to find a good classifier:

A. using cross-validation (do this for the k-nearest-neighbors model; SVM is optional); and

I did K-fold cross validation in k-nearest-neighbors model.


1. Number of folds I used =5
2. Accuracy for K 1 through 20 with K-fold cross validation is as follows:

3. Best K and best corresponding KNN K-fold cross validation accuracy is as follows:

4. You can refer to the attached “3.1.a. KNN with K-fold.R” for details.

B. Splitting the data into training, validation, and test data sets (pick either KNN or SVM; the other
is optional).

I split the data into training, validation and test data sets and then used SVM model to check the
most accurate classifier.
1. 60% of data is used for the training set and then remaining 40% was divided equally
between validate and test data set. You can refer to the attached “3.1.b. SVM with train,
test, validate.R” for details.
2. Accuracy for C values (0.00001, 0.0001, 0.001, 0.01, 0.1, 1, 10, 100, 1000) is as follows:

3. Best C and validation accuracy is as follows:

4. Best Test accuracy is as follows:

5. As expected, the True model accuracy (86.25%), as calculated on the testing data set is
lower than the training accuracy (87.78%) which includes random effects and was
calculated on the validation data set.

6. Coefficient and Intercept for final model:

7. Final equation of classifier:


Question 4.1

Describe a situation or problem from your job, everyday life, current events, etc., for which a
clustering model would be appropriate. List some (up to 5) predictors that you might use.

Problem Statement: A company came across market research where its


products are significantly high-valued and low-priced in comparison to
competitors’ products in the market. The company wants to utilize the
opportunity and implement price increases for all its customers. But before that
the management wants to group customers into specific clusters based on their
price sensitivity and stickiness. After figuring out the specific clusters
management can implement focused and informed price increase and reduce
any negative impact like churn of price increase.

Model: Clustering

Predictor variables:

1. Number of Products bought by Customer (indicates customer stickiness).


2. Number of Contracts of Customer (indicates customer stickiness).
3. Business ($) given by Customer in terms of ARR (indicates affordability,
price sensitivity and customer stickiness)
4. Number of trading partners (shows how deeply integrated the product is
with customer’s customers)
5. %increase or %decrease in customer activity on platform over last 3
months (shows changing customer preference)

Question 4.2

The iris data set iris.txt contains 150 data points, each with four predictor variables and one
categorical response. The predictors are the width and length of the sepal and petal of flowers,
and the response is the type of flower. The data is available from the R library datasets and can be
accessed with iris once the library is loaded. It is also available at the UCI Machine Learning
Repository (https://archive.ics.uci.edu/ml/datasets/Iris ). The response values are only given to
see how well a specific method performed and should not be used to build the model.

Use the R function kmeans to cluster the points as well as possible. Report the best combination
of predictors, your suggested value of k, and how well your best clustering predicts flower type.

1. I first used the R function kmeans to cluster the points as well as possible for K ranging from 1:10,
nstart = 5, iter.max = 15.
2. Further I calculated the WCSS (within-class sum of squares) corresponding to respective K and stored
the results in the wcss vector.

3. I then created a plot to visualize the relationship between the number of clusters (k) and the total
within-cluster sum of squares (WCSS). The "elbow" point in the plot is used to select the best value
of k, which in this case is determined to be 3.

4.For the best combination of predictors I calculate within-class sum of squares for different predictor
combinations
5.This results in WCSS for different combinations of predictors:

6. As number of predictors increase, the data points increase, and clusters become less compact
increasing within-class sum of squares. The best combination of predictors is Petal_width with the
least WCSS of 4.93. However, the main goal of using k-mean is to cluster the points as well as possible
hence I will use all the predictors in the model.

7. I then perform k-means clustering with all predictors and k=3 and create contingency table to see how
well my best clustering predicts flower type.
8. Model performed well for Iris-setosa but there was some wrong classification for Iris-versicolor and
Iris-virginica
For Iris-setosa, all 50 samples were correctly classified into Cluster 1.
For Iris-versicolor, 48 out of 50 samples were classified into Cluster 2, but 2 were classified into
Cluster 3.
For Iris-virginica, 36 out of 50 samples were classified into Cluster 3, but 14 were incorrectly
classified into Cluster 1.
9. You can find more details in attached 4.1 K-Means.R file for details.

You might also like