Question 2.2

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 4

Question 2.

The files credit_card_data.txt (without headers) and credit_card_data-headers.txt


(with headers) contain a dataset with 654 data points, 6 continuous and 4 binary predictor variables. It
has anonymized credit card applications with a binary response variable (last column) indicating if the
application was positive or negative. The dataset is the “Credit Approval Data Set” from the UCI Machine
Learning Repository (https://archive.ics.uci.edu/ml/datasets/Credit+Approval) without the categorical
variables and without data points that have missing values.

1. Using the support vector machine function ksvm contained in the R package kernlab, find a
good classifier for this data. Show the equation of your classifier, and how well it classifies the
data points in the full data set. (Don’t worry about test/validation data yet; we’ll cover that
topic soon.)

To find a good classifier for this data first we must explore some margin values. Exploring margin
values in increments of 100 between 10 and 2000, we find as follows.

Exploring at increments of 10 between 10 and 1000, we find as follows.

Reviewing the two charts it seems that the highest prediction accuracy resides between 10-200.
Reviewing margins between 10-200 at increments of 5, we find as follows.

Reviewing margins between 1-10 at increments of 1, we find as follows.

Based on our findings, a margin of 100 as initially provided can be utilized for ksvm with
vanilladot as there seems to be no significant deviation in prediction accuracy between 10 and
200.

Equation of Classifier: -0.001006535 x1 + -0.001172905 x2 + -0.001626197 x


3 + 0.00300642 x4 + 1.004941 x5 + -0.002825943 x6 + 0.0002600295 x7 + -
0.0005349551 x8 + -0.001228376 x9 + 0.1063634

Prediction Accuracy: 86.39144 %


2. You are welcome, but not required, to try other (nonlinear) kernels as well; we’re not covering
them in this course, but they can sometimes be useful and might provide better predictions
than vanilladot.

[1,] "rbfdot" "95.1070336391437"


[2,] "polydot" "86.3914373088685"
[3,] "vanilladot" "86.3914373088685"
[4,] "tanhdot" "72.17125382263"
[5,] "laplacedot" "100"
[6,] "besseldot" "92.5076452599388"
[7,] "anovadot" "90.6727828746177"
[8,] "splinedot" "97.8593272171254"

3. Using the k-nearest-neighbors classification function kknn contained in the R kknn package,
suggest a good value of k, and show how well it classifies that data points in the full data set.
Don’t forget to scale the data (scale=TRUE in kknn).

prediction_vect 1 81.49847
prediction_vect 2 81.49847
prediction_vect 3 81.49847
prediction_vect 4 81.49847
prediction_vect 5 85.16820
prediction_vect 6 84.55657
prediction_vect 7 84.70948
prediction_vect 8 84.86239
prediction_vect 9 84.70948
prediction_vect 10 85.01529
prediction_vect 11 85.16820
prediction_vect 12 85.32110
prediction_vect 13 85.16820
prediction_vect 14 85.16820
prediction_vect 15 85.32110
prediction_vect 16 85.16820
prediction_vect 17 85.16820
prediction_vect 18 85.16820
prediction_vect 19 85.01529
prediction_vect 20 85.01529
prediction_vect 21 84.86239
prediction_vect 22 84.70948
prediction_vect 23 84.40367
prediction_vect 24 84.55657
prediction_vect 25 84.55657
prediction_vect 26 84.40367
prediction_vect 27 84.09786
prediction_vect 28 83.79205
prediction_vect 29 83.94495
prediction_vect 30 84.09786
prediction_vect 31 83.79205
prediction_vect 32 83.63914
prediction_vect 33 83.48624
prediction_vect 34 83.33333
prediction_vect 35 83.18043
prediction_vect 36 83.18043
prediction_vect 37 83.18043
prediction_vect 38 83.18043
prediction_vect 39 83.18043
prediction_vect 40 83.18043
prediction_vect 41 83.18043
prediction_vect 42 83.48624
prediction_vect 43 83.48624
prediction_vect 44 83.63914
prediction_vect 45 83.94495
prediction_vect 46 84.09786
prediction_vect 47 83.79205
prediction_vect 48 83.94495
prediction_vect 49 83.94495
prediction_vect 50 83.79205

Based on the results I would recommend using a k value of 12 because it is the smallest number of
neighbors required with the highest accuracy of 85.32% that I found. This way we can optimize both
efficiency of computation and accuracy of model.

You might also like