CIS242 HW4 Ans

CIS242_HW4_Ans
March 18, 2021
0.1 CIS 242

0.2 Spring 2021
0.3 HOMEWORK ASSIGNMENT 4
Please compile your responses use markdown in your Jupyter notebook to answer the questions. If
you prefer, you may also submit a Word or PDF document with the responses along the PDF or
HTML version of the completed notebook.
Active notebooks (.ipynb files) or raw code (.py files) will NOT be accepted and no
points will be given. The code part of the files will not be graded, but they will be checked if
necessary to verify your findings and recommendations. Point deductions may occur if there are
major discrepancies between your written answers and the output from the code.
Please organize your notebook to have the homework responses at the top and the working code
underneath.
Each question is worth 2 points for a total of 24 points.
0.4 NAME:
0.5 Predicting Loans for Universal Bank
Remember: You were recently hired as a data scientist by Universal Bank. The bank’s Vice
President is interested learning what predictive analytics can do for the bank. She supplies you
with a dataset containing information on a sample of customers. Here is a description of each
variable in the Universal Bank dataset:
• Age: Customer’s age in years
• Experience: Number of years of professional work experience
• Income: Annual income in thousands of dollars ($000)
• Zip code: Zip code of home address
• Family: Customer’s family size
• CC Avg: Average spending on credit cards per month in thousands of dollars ($000)
• Education: Education level where 1 = Undergraduate; 2 = Graduate; and 3=Ad-
vanced/Professional
• Mortgage: Value of house mortgage if any; in thousands of dollar ($000)
• Personal.Loan: Did the customer accept a personal loan offered in the bank’s last campaign?
1=Yes; 0 = No.
• Securities.Account: Does the customer have a securities account with the bank? 1 = Yes; 0
= No.
1
• CD.Account: Does the customer have a certificate of deposit (CD) account with the bank? 1
= Yes; 0 = No.
• Online: Does the customer use Internet banking facilities? 1 = Yes; 0 = No.
• Credit.Card: Does the customer use a credit card issued by Universal Bank? 1 = Yes; 0 =
No.
You have already investigated these data and built a logistic regression model to help understand
the personal loan market. Now the VP wants to investigate other predictive models.
1. You are asked to create a NB model to predict personal loans. What variables will
you use? Why?
[1]: # code or markdown here
import pandas as pd
data = pd.read_csv('D:/station/FEB/JIHAO/HW4/UniversalBank.csv')
data.head()
abs(data.corr()) # find correlation
[1]: Age Experience Income ZIP Code Family \

Age 1.000000 0.994215 0.055269 0.029216 0.046418
Experience 0.994215 1.000000 0.046574 0.028626 0.052563
Income 0.055269 0.046574 1.000000 0.016410 0.157501
ZIP Code 0.029216 0.028626 0.016410 1.000000 0.011778
Family 0.046418 0.052563 0.157501 0.011778 1.000000
CCAvg 0.052030 0.050089 0.645993 0.004068 0.109285
Education 0.041334 0.013152 0.187524 0.017377 0.064929
Mortgage 0.012539 0.010582 0.206806 0.007383 0.020445
Personal Loan 0.007726 0.007413 0.502462 0.000107 0.061367
Securities Account 0.000436 0.001232 0.002616 0.004704 0.019994
CD Account 0.008043 0.010353 0.169738 0.019972 0.014110
Online 0.013702 0.013898 0.014206 0.016990 0.010354
CreditCard 0.007681 0.008967 0.002385 0.007691 0.011588
CCAvg Education Mortgage Personal Loan \

Age 0.052030 0.041334 0.012539 0.007726
Experience 0.050089 0.013152 0.010582 0.007413
Income 0.645993 0.187524 0.206806 0.502462
ZIP Code 0.004068 0.017377 0.007383 0.000107
Family 0.109285 0.064929 0.020445 0.061367
CCAvg 1.000000 0.136138 0.109909 0.366891
Education 0.136138 1.000000 0.033327 0.136722
Mortgage 0.109909 0.033327 1.000000 0.142095
Personal Loan 0.366891 0.136722 0.142095 1.000000
Securities Account 0.015087 0.010812 0.005411 0.021954
CD Account 0.136537 0.013934 0.089311 0.316355
Online 0.003620 0.015004 0.005995 0.006278
CreditCard 0.006686 0.011014 0.007231 0.002802
Securities Account CD Account Online CreditCard
2
Age 0.000436 0.008043 0.013702 0.007681
Experience 0.001232 0.010353 0.013898 0.008967
Income 0.002616 0.169738 0.014206 0.002385
ZIP Code 0.004704 0.019972 0.016990 0.007691
Family 0.019994 0.014110 0.010354 0.011588
CCAvg 0.015087 0.136537 0.003620 0.006686
Education 0.010812 0.013934 0.015004 0.011014
Mortgage 0.005411 0.089311 0.005995 0.007231
Personal Loan 0.021954 0.316355 0.006278 0.002802
Securities Account 1.000000 0.317034 0.012627 0.015028
CD Account 0.317034 1.000000 0.175880 0.278644
Online 0.012627 0.175880 1.000000 0.004210
CreditCard 0.015028 0.278644 0.004210 1.000000
[2]: [['Experience', 'Age'], ['Income', 'CCAvg'],

['Income', 'Mortgage'], ['Securities Account', 'CD Account'],
['CD Account', 'CreditCard']]
[2]: [['Experience', 'Age'],

['Income', 'CCAvg'],
['Income', 'Mortgage'],
['Securities Account', 'CD Account'],
['CD Account', 'CreditCard']]
The above pairs have correlation abvoe 0.2, as NB assumes there is no corrrelation between features,
we deceide to drop some varibles. Also we remove ZIP Code to aviode misuse. However, depending
the model we select, we have to select continuous/categorical in next.
[3]: data_nb = data.copy()

for i in ['Age', 'CCAvg', 'Mortgage', 'Securities Account', 'CreditCard', 'ZIP␣
,→Code' ]:
del data_nb[i] # drop variables
data_nb # variables in models
[3]: Experience Income Family Education Personal Loan CD Account Online

0 1 49 4 1 0 0 0
1 19 34 3 1 0 0 0
2 15 11 1 1 0 0 0
3 9 100 1 2 0 0 0
4 8 45 4 2 0 0 0
… … … … … … … …
4995 3 40 1 3 0 0 1
4996 4 15 4 1 0 0 1
4997 39 24 2 3 0 0 0
4998 40 49 3 2 0 0 1
4999 4 83 3 1 0 0 1
3
[5000 rows x 7 columns]
2. Do you need to make any changes to the data? If so, what do you change and why?
If not, why not?
1 code or markdown here

Firstly, we are going to use CategoricalNB, therefore we have to use categorical data, and convert
them into dummy variables.
[4]: cols = ['Family', 'Education', 'Personal Loan',

'CD Account', 'Online'] # selected cols
[5]: m_nb_df = data_nb[cols].copy()

m_nb_df = pd.get_dummies(m_nb_df,
columns=['Family', 'Education'],
drop_first=True) # get dummy
3. Train a NB model. What set up or parameter choices did you make? Why? we
split the data into train and test set, with which we can check the model’s predictive performance.

from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.naive_bayes import MultinomialNB, GaussianNB, CategoricalNB
X = m_nb_df.copy()
y = X.pop('Personal Loan')
X_train, X_test, y_train, y_test = train_test_split(X, y, # train test split
test_size=0.2,␣
,→random_state=1234)
print(y_train.value_counts()/len(y_train))
print(y_test.value_counts()/len(y_test))
NB = CategoricalNB()
NB.fit(X_train, y_train)
0 0.90625
1 0.09375
Name: Personal Loan, dtype: float64
0 0.895
1 0.105
[6]: CategoricalNB()
4. Test the accuracy of your model. What is the overall accuracy of the model? What
did it predict well? What did it not predict well?
4
print("accuracy of NB: " + str(metrics.accuracy_score(y_test, NB.
,→predict(X_test))))
pd.crosstab(NB.predict(X_test),y_test,rownames = ["Predicted"], colnames =␣

,→["Actual"])
accuracy of NB: 0.902
[7]: Actual 0 1
Predicted
0 883 86
1 12 19
overall acccuracy is 0.902, it predict well in case people not taking loans.
5. Train at least two more versions of the model. What changes did you make? Why?
What was the outcome?

The first version, we will use continuous data to fit a GaussianNB.
[8]: cols = ['Experience', 'Income', 'Personal Loan', 'Family']

X = data_nb[cols].copy()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,␣
NB = GaussianNB()

,→["Actual"])
0 0.90625
1 0.09375
0 0.895
1 0.105
[8]: Actual 0 1
Predicted
0 850 61
1 45 44
5
Lastly we choose to convert the continuous data into categorical ones, and again, fit a Multinomi-
alNB.
[9]: import matplotlib.pyplot as plt # EDA for encoding numerical variables

for cl in X.columns:
plt.figure()
plt.hist(X[cl])
plt.title(cl)
6
[10]: X = data_nb.copy()
X.Experience = pd.qcut(X.Experience , q=10, labels = False)
7
X.Income= pd.qcut(X.Income, q=3, labels = False)
NB = MultinomialNB()

,→["Actual"])
0 0.90625
1 0.09375
0 0.895
1 0.105
[10]: Actual 0 1
Predicted
0 892 93
1 3 12
According to selected models’ overall accuracy, MultinomialNB get best performance.
6. Now you are asked to create a KNN model to predict personal loans. What
variables will you use? Why?

KNN dose not make assumption on correltion or distribution, we can use all variable but not ZIP
Code for avioding misuse. However, we have to change categorical ones to dummy variables.
[11]: del data['ZIP Code']
7. Do you need to make any changes to the data? How is this similar or different to
the NB model? Why?

we still need to transforem categorical data into dummy variables, however, we can use mixed data.
[12]: data = pd.get_dummies(data, columns=['Education'], drop_first=True)
8
8. Train a KNN model. What set up or parameter choices did you make? Why?

the parameter to be set is K, we do not which K is best, we only know it should not too large or
too samll, we have to try different vales.
[13]: from sklearn.neighbors import KNeighborsClassifier

import numpy as np
X = data.copy()
error_rate = []
for i in range(3,21): # choose k
knn = KNeighborsClassifier(n_neighbors=i, metric = "euclidean")

knn.fit(X_train,y_train)
pred_i = knn.predict(X_test)
error_rate.append(np.mean(pred_i != y_test))
plt.figure(figsize=(10,6))
plt.plot(range(3,21),error_rate,color='blue', linestyle='dashed', marker='o',
markerfacecolor='red', markersize=10)
plt.title('Error Rate vs. K Value')
plt.xlabel('K')
plt.ylabel('Error Rate')
[13]: Text(0, 0.5, 'Error Rate')
9
[14]: ids = np.where(error_rate == min(error_rate))[0][0]
opt_k = [i for i in range(3,21)][ids] # get opitimal k
opt_k
[14]: 14
9. Test the accuracy of your model. What is the overall accuracy of the model? What
did it predict well? What did it not predict well?
[15]: knn = KNeighborsClassifier(n_neighbors=opt_k)
print("accuracy of NB: " + str(metrics.accuracy_score(y_test, knn.
pd.crosstab(knn.predict(X_test),y_test,rownames = ["Predicted"], colnames =␣

,→["Actual"])
[15]: Actual 0 1
Predicted
0 885 88
1 10 17
10. What changes can you make that might affect the KNN model? Train at least
two more models. What is the result?
10
First Verison, we scale our data.
[16]: from sklearn.preprocessing import MinMaxScaler

X = data.copy()
X = MinMaxScaler().fit_transform(X) # scaling

error_rate = []
for i in range(1,21):
knn = KNeighborsClassifier(n_neighbors=i, metric = "euclidean")

ids = np.where(error_rate == min(error_rate))[0][0]
opt_k = [i for i in range(3,21)][ids]
opt_k
[16]: 3


,→["Actual"])
[17]: Actual 0 1
Predicted
0 891 40
1 4 65
The second version, we use manhattan distance.
[18]: error_rate = []
for i in range(1,21):
knn = KNeighborsClassifier(n_neighbors=i, metric = "manhattan") # different␣

,→distance
11
ids = np.where(error_rate == min(error_rate))[0][0]
opt_k = [i for i in range(3,21)][ids]
opt_k
[18]: 7


,→["Actual"])
[19]: Actual 0 1
Predicted
0 891 50
1 4 55
The optimal K varies, however, with scaled data, the result is better in the term of accuracy.
11. What are the relative strengths and weaknesses of the various models? Mention
at least 3 points.

• NB is a good choice if we only have categorical variables.
• NB can not use mixed data.
• KNN have to try different K.
• KNN is scale sensitive.
• KNN can use mixed data type.
12. What would you recommend to the VP? Why? Remember, you have 3 different
models (and their variants) to choose from.

If we can consider the prediction of all three models, either equal weigting them or giving priority
to some one, we may get better prediction.
8.1 Quick survey:

Please take a minute to provide feedback on how you think the course is going so far using this link
The responses are anonymous and will be helpful for me to understand how I can try to make sure
you get the most out of the class. Please respond by the time the homework is due on Thursday,
March 4 at 11am EST.
12
8.1.1 If we get a response rate of 60% or above, everyone gets an extra 1 point on
this assignment! If we get a response rate of 75% or above, everyone gets an
extra 2 points. That means that 60% (or 75%) of the class provides honest
feedback and not 60% (or 75%) of the class says nice things. :-)
8.2 Working code below
[ ]:
13

CIS242 HW4 Ans

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

CIS242 HW4 Ans

Uploaded by

Copyright:

Available Formats

CIS242_HW4_Ans

March 18, 2021

0.1 CIS 242

[1]: Age Experience Income ZIP Code Family \

CCAvg Education Mortgage Personal Loan \

Securities Account CD Account Online CreditCard

[2]: [['Experience', 'Age'], ['Income', 'CCAvg'],

[2]: [['Experience', 'Age'],

[3]: data_nb = data.copy()

del data_nb[i] # drop variables

data_nb # variables in models

[3]: Experience Income Family Education Personal Loan CD Account Online

1 code or markdown here

[4]: cols = ['Family', 'Education', 'Personal Loan',

[5]: m_nb_df = data_nb[cols].copy()

[6]: # code or markdown here

pd.crosstab(NB.predict(X_test),y_test,rownames = ["Predicted"], colnames =␣

accuracy of NB: 0.902

2 code or markdown here

[8]: cols = ['Experience', 'Income', 'Personal Loan', 'Family']

pd.crosstab(NB.predict(X_test),y_test,rownames = ["Predicted"], colnames =␣

[9]: import matplotlib.pyplot as plt # EDA for encoding numerical variables

pd.crosstab(NB.predict(X_test),y_test,rownames = ["Predicted"], colnames =␣

According to selected models’ overall accuracy, MultinomialNB get best performance.

3 code or markdown here

[11]: del data['ZIP Code']

4 code or markdown here

[12]: data = pd.get_dummies(data, columns=['Education'], drop_first=True)

5 code or markdown here

[13]: from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier(n_neighbors=i, metric = "euclidean")

[13]: Text(0, 0.5, 'Error Rate')

pd.crosstab(knn.predict(X_test),y_test,rownames = ["Predicted"], colnames =␣

accuracy of NB: 0.902

[16]: from sklearn.preprocessing import MinMaxScaler

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,␣

knn = KNeighborsClassifier(n_neighbors=i, metric = "euclidean")

[17]: knn = KNeighborsClassifier(n_neighbors=opt_k)

pd.crosstab(knn.predict(X_test),y_test,rownames = ["Predicted"], colnames =␣

accuracy of NB: 0.956

The second version, we use manhattan distance.

knn = KNeighborsClassifier(n_neighbors=i, metric = "manhattan") # different␣

[19]: knn = KNeighborsClassifier(n_neighbors=opt_k)

pd.crosstab(knn.predict(X_test),y_test,rownames = ["Predicted"], colnames =␣

accuracy of NB: 0.946

7 code or markdown here

8 code or markdown here

8.1 Quick survey:

You might also like