Download as pdf or txt
Download as pdf or txt
You are on page 1of 13

CIS242_HW4_Ans

March 18, 2021

0.1 CIS 242


0.2 Spring 2021
0.3 HOMEWORK ASSIGNMENT 4
Please compile your responses use markdown in your Jupyter notebook to answer the questions. If
you prefer, you may also submit a Word or PDF document with the responses along the PDF or
HTML version of the completed notebook.

Active notebooks (.ipynb files) or raw code (.py files) will NOT be accepted and no
points will be given. The code part of the files will not be graded, but they will be checked if
necessary to verify your findings and recommendations. Point deductions may occur if there are
major discrepancies between your written answers and the output from the code.
Please organize your notebook to have the homework responses at the top and the working code
underneath.
Each question is worth 2 points for a total of 24 points.

0.4 NAME:
0.5 Predicting Loans for Universal Bank
Remember: You were recently hired as a data scientist by Universal Bank. The bank’s Vice
President is interested learning what predictive analytics can do for the bank. She supplies you
with a dataset containing information on a sample of customers. Here is a description of each
variable in the Universal Bank dataset:
• Age: Customer’s age in years
• Experience: Number of years of professional work experience
• Income: Annual income in thousands of dollars ($000)
• Zip code: Zip code of home address
• Family: Customer’s family size
• CC Avg: Average spending on credit cards per month in thousands of dollars ($000)
• Education: Education level where 1 = Undergraduate; 2 = Graduate; and 3=Ad-
vanced/Professional
• Mortgage: Value of house mortgage if any; in thousands of dollar ($000)
• Personal.Loan: Did the customer accept a personal loan offered in the bank’s last campaign?
1=Yes; 0 = No.
• Securities.Account: Does the customer have a securities account with the bank? 1 = Yes; 0
= No.

1
• CD.Account: Does the customer have a certificate of deposit (CD) account with the bank? 1
= Yes; 0 = No.
• Online: Does the customer use Internet banking facilities? 1 = Yes; 0 = No.
• Credit.Card: Does the customer use a credit card issued by Universal Bank? 1 = Yes; 0 =
No.
You have already investigated these data and built a logistic regression model to help understand
the personal loan market. Now the VP wants to investigate other predictive models.

1. You are asked to create a NB model to predict personal loans. What variables will
you use? Why?
[1]: # code or markdown here
import pandas as pd
data = pd.read_csv('D:/station/FEB/JIHAO/HW4/UniversalBank.csv')
data.head()
abs(data.corr()) # find correlation

[1]: Age Experience Income ZIP Code Family \


Age 1.000000 0.994215 0.055269 0.029216 0.046418
Experience 0.994215 1.000000 0.046574 0.028626 0.052563
Income 0.055269 0.046574 1.000000 0.016410 0.157501
ZIP Code 0.029216 0.028626 0.016410 1.000000 0.011778
Family 0.046418 0.052563 0.157501 0.011778 1.000000
CCAvg 0.052030 0.050089 0.645993 0.004068 0.109285
Education 0.041334 0.013152 0.187524 0.017377 0.064929
Mortgage 0.012539 0.010582 0.206806 0.007383 0.020445
Personal Loan 0.007726 0.007413 0.502462 0.000107 0.061367
Securities Account 0.000436 0.001232 0.002616 0.004704 0.019994
CD Account 0.008043 0.010353 0.169738 0.019972 0.014110
Online 0.013702 0.013898 0.014206 0.016990 0.010354
CreditCard 0.007681 0.008967 0.002385 0.007691 0.011588

CCAvg Education Mortgage Personal Loan \


Age 0.052030 0.041334 0.012539 0.007726
Experience 0.050089 0.013152 0.010582 0.007413
Income 0.645993 0.187524 0.206806 0.502462
ZIP Code 0.004068 0.017377 0.007383 0.000107
Family 0.109285 0.064929 0.020445 0.061367
CCAvg 1.000000 0.136138 0.109909 0.366891
Education 0.136138 1.000000 0.033327 0.136722
Mortgage 0.109909 0.033327 1.000000 0.142095
Personal Loan 0.366891 0.136722 0.142095 1.000000
Securities Account 0.015087 0.010812 0.005411 0.021954
CD Account 0.136537 0.013934 0.089311 0.316355
Online 0.003620 0.015004 0.005995 0.006278
CreditCard 0.006686 0.011014 0.007231 0.002802

Securities Account CD Account Online CreditCard

2
Age 0.000436 0.008043 0.013702 0.007681
Experience 0.001232 0.010353 0.013898 0.008967
Income 0.002616 0.169738 0.014206 0.002385
ZIP Code 0.004704 0.019972 0.016990 0.007691
Family 0.019994 0.014110 0.010354 0.011588
CCAvg 0.015087 0.136537 0.003620 0.006686
Education 0.010812 0.013934 0.015004 0.011014
Mortgage 0.005411 0.089311 0.005995 0.007231
Personal Loan 0.021954 0.316355 0.006278 0.002802
Securities Account 1.000000 0.317034 0.012627 0.015028
CD Account 0.317034 1.000000 0.175880 0.278644
Online 0.012627 0.175880 1.000000 0.004210
CreditCard 0.015028 0.278644 0.004210 1.000000

[2]: [['Experience', 'Age'], ['Income', 'CCAvg'],


['Income', 'Mortgage'], ['Securities Account', 'CD Account'],
['CD Account', 'CreditCard']]

[2]: [['Experience', 'Age'],


['Income', 'CCAvg'],
['Income', 'Mortgage'],
['Securities Account', 'CD Account'],
['CD Account', 'CreditCard']]

The above pairs have correlation abvoe 0.2, as NB assumes there is no corrrelation between features,
we deceide to drop some varibles. Also we remove ZIP Code to aviode misuse. However, depending
the model we select, we have to select continuous/categorical in next.

[3]: data_nb = data.copy()


for i in ['Age', 'CCAvg', 'Mortgage', 'Securities Account', 'CreditCard', 'ZIP␣
,→Code' ]:

del data_nb[i] # drop variables

data_nb # variables in models

[3]: Experience Income Family Education Personal Loan CD Account Online


0 1 49 4 1 0 0 0
1 19 34 3 1 0 0 0
2 15 11 1 1 0 0 0
3 9 100 1 2 0 0 0
4 8 45 4 2 0 0 0
… … … … … … … …
4995 3 40 1 3 0 0 1
4996 4 15 4 1 0 0 1
4997 39 24 2 3 0 0 0
4998 40 49 3 2 0 0 1
4999 4 83 3 1 0 0 1

3
[5000 rows x 7 columns]

2. Do you need to make any changes to the data? If so, what do you change and why?
If not, why not?

1 code or markdown here


Firstly, we are going to use CategoricalNB, therefore we have to use categorical data, and convert
them into dummy variables.

[4]: cols = ['Family', 'Education', 'Personal Loan',


'CD Account', 'Online'] # selected cols

[5]: m_nb_df = data_nb[cols].copy()


m_nb_df = pd.get_dummies(m_nb_df,
columns=['Family', 'Education'],
drop_first=True) # get dummy

3. Train a NB model. What set up or parameter choices did you make? Why? we
split the data into train and test set, with which we can check the model’s predictive performance.

[6]: # code or markdown here


from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.naive_bayes import MultinomialNB, GaussianNB, CategoricalNB

X = m_nb_df.copy()
y = X.pop('Personal Loan')
X_train, X_test, y_train, y_test = train_test_split(X, y, # train test split
test_size=0.2,␣
,→random_state=1234)

print(y_train.value_counts()/len(y_train))
print(y_test.value_counts()/len(y_test))
NB = CategoricalNB()
NB.fit(X_train, y_train)

0 0.90625
1 0.09375
Name: Personal Loan, dtype: float64
0 0.895
1 0.105
Name: Personal Loan, dtype: float64

[6]: CategoricalNB()

4. Test the accuracy of your model. What is the overall accuracy of the model? What
did it predict well? What did it not predict well?

4
[7]: # code or markdown here
print("accuracy of NB: " + str(metrics.accuracy_score(y_test, NB.
,→predict(X_test))))

pd.crosstab(NB.predict(X_test),y_test,rownames = ["Predicted"], colnames =␣


,→["Actual"])

accuracy of NB: 0.902

[7]: Actual 0 1
Predicted
0 883 86
1 12 19

overall acccuracy is 0.902, it predict well in case people not taking loans.

5. Train at least two more versions of the model. What changes did you make? Why?
What was the outcome?

2 code or markdown here


The first version, we will use continuous data to fit a GaussianNB.

[8]: cols = ['Experience', 'Income', 'Personal Loan', 'Family']


X = data_nb[cols].copy()
y = X.pop('Personal Loan')
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,␣
,→random_state=1234)

print(y_train.value_counts()/len(y_train))
print(y_test.value_counts()/len(y_test))
NB = GaussianNB()
NB.fit(X_train, y_train)
print("accuracy of NB: " + str(metrics.accuracy_score(y_test, NB.
,→predict(X_test))))

pd.crosstab(NB.predict(X_test),y_test,rownames = ["Predicted"], colnames =␣


,→["Actual"])

0 0.90625
1 0.09375
Name: Personal Loan, dtype: float64
0 0.895
1 0.105
Name: Personal Loan, dtype: float64
accuracy of NB: 0.894

[8]: Actual 0 1
Predicted
0 850 61
1 45 44

5
Lastly we choose to convert the continuous data into categorical ones, and again, fit a Multinomi-
alNB.

[9]: import matplotlib.pyplot as plt # EDA for encoding numerical variables


for cl in X.columns:
plt.figure()
plt.hist(X[cl])
plt.title(cl)

6
[10]: X = data_nb.copy()
X.Experience = pd.qcut(X.Experience , q=10, labels = False)

7
X.Income= pd.qcut(X.Income, q=3, labels = False)
y = X.pop('Personal Loan')
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,␣
,→random_state=1234)

print(y_train.value_counts()/len(y_train))
print(y_test.value_counts()/len(y_test))
NB = MultinomialNB()
NB.fit(X_train, y_train)
print("accuracy of NB: " + str(metrics.accuracy_score(y_test, NB.
,→predict(X_test))))

pd.crosstab(NB.predict(X_test),y_test,rownames = ["Predicted"], colnames =␣


,→["Actual"])

0 0.90625
1 0.09375
Name: Personal Loan, dtype: float64
0 0.895
1 0.105
Name: Personal Loan, dtype: float64
accuracy of NB: 0.904

[10]: Actual 0 1
Predicted
0 892 93
1 3 12

According to selected models’ overall accuracy, MultinomialNB get best performance.

6. Now you are asked to create a KNN model to predict personal loans. What
variables will you use? Why?

3 code or markdown here


KNN dose not make assumption on correltion or distribution, we can use all variable but not ZIP
Code for avioding misuse. However, we have to change categorical ones to dummy variables.

[11]: del data['ZIP Code']

7. Do you need to make any changes to the data? How is this similar or different to
the NB model? Why?

4 code or markdown here


we still need to transforem categorical data into dummy variables, however, we can use mixed data.

[12]: data = pd.get_dummies(data, columns=['Education'], drop_first=True)

8
8. Train a KNN model. What set up or parameter choices did you make? Why?

5 code or markdown here


the parameter to be set is K, we do not which K is best, we only know it should not too large or
too samll, we have to try different vales.

[13]: from sklearn.neighbors import KNeighborsClassifier


import numpy as np
X = data.copy()
y = X.pop('Personal Loan')
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,␣
,→random_state=1234)

error_rate = []
for i in range(3,21): # choose k

knn = KNeighborsClassifier(n_neighbors=i, metric = "euclidean")


knn.fit(X_train,y_train)
pred_i = knn.predict(X_test)
error_rate.append(np.mean(pred_i != y_test))
plt.figure(figsize=(10,6))
plt.plot(range(3,21),error_rate,color='blue', linestyle='dashed', marker='o',
markerfacecolor='red', markersize=10)
plt.title('Error Rate vs. K Value')
plt.xlabel('K')
plt.ylabel('Error Rate')

[13]: Text(0, 0.5, 'Error Rate')

9
[14]: ids = np.where(error_rate == min(error_rate))[0][0]
opt_k = [i for i in range(3,21)][ids] # get opitimal k
opt_k

[14]: 14

9. Test the accuracy of your model. What is the overall accuracy of the model? What
did it predict well? What did it not predict well?
[15]: knn = KNeighborsClassifier(n_neighbors=opt_k)
knn.fit(X_train,y_train)
print("accuracy of NB: " + str(metrics.accuracy_score(y_test, knn.
,→predict(X_test))))

pd.crosstab(knn.predict(X_test),y_test,rownames = ["Predicted"], colnames =␣


,→["Actual"])

accuracy of NB: 0.902

[15]: Actual 0 1
Predicted
0 885 88
1 10 17

10. What changes can you make that might affect the KNN model? Train at least
two more models. What is the result?

10
6 code or markdown here
First Verison, we scale our data.

[16]: from sklearn.preprocessing import MinMaxScaler


X = data.copy()
y = X.pop('Personal Loan')

X = MinMaxScaler().fit_transform(X) # scaling

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,␣


,→random_state=1234)

error_rate = []
for i in range(1,21):

knn = KNeighborsClassifier(n_neighbors=i, metric = "euclidean")


knn.fit(X_train,y_train)
pred_i = knn.predict(X_test)
error_rate.append(np.mean(pred_i != y_test))
ids = np.where(error_rate == min(error_rate))[0][0]
opt_k = [i for i in range(3,21)][ids]
opt_k

[16]: 3

[17]: knn = KNeighborsClassifier(n_neighbors=opt_k)


knn.fit(X_train,y_train)
print("accuracy of NB: " + str(metrics.accuracy_score(y_test, knn.
,→predict(X_test))))

pd.crosstab(knn.predict(X_test),y_test,rownames = ["Predicted"], colnames =␣


,→["Actual"])

accuracy of NB: 0.956

[17]: Actual 0 1
Predicted
0 891 40
1 4 65

The second version, we use manhattan distance.

[18]: error_rate = []
for i in range(1,21):

knn = KNeighborsClassifier(n_neighbors=i, metric = "manhattan") # different␣


,→distance

knn.fit(X_train,y_train)
pred_i = knn.predict(X_test)
error_rate.append(np.mean(pred_i != y_test))

11
ids = np.where(error_rate == min(error_rate))[0][0]
opt_k = [i for i in range(3,21)][ids]
opt_k

[18]: 7

[19]: knn = KNeighborsClassifier(n_neighbors=opt_k)


knn.fit(X_train,y_train)
print("accuracy of NB: " + str(metrics.accuracy_score(y_test, knn.
,→predict(X_test))))

pd.crosstab(knn.predict(X_test),y_test,rownames = ["Predicted"], colnames =␣


,→["Actual"])

accuracy of NB: 0.946

[19]: Actual 0 1
Predicted
0 891 50
1 4 55

The optimal K varies, however, with scaled data, the result is better in the term of accuracy.

11. What are the relative strengths and weaknesses of the various models? Mention
at least 3 points.

7 code or markdown here


• NB is a good choice if we only have categorical variables.
• NB can not use mixed data.
• KNN have to try different K.
• KNN is scale sensitive.
• KNN can use mixed data type.

12. What would you recommend to the VP? Why? Remember, you have 3 different
models (and their variants) to choose from.

8 code or markdown here


If we can consider the prediction of all three models, either equal weigting them or giving priority
to some one, we may get better prediction.

8.1 Quick survey:


Please take a minute to provide feedback on how you think the course is going so far using this link
The responses are anonymous and will be helpful for me to understand how I can try to make sure
you get the most out of the class. Please respond by the time the homework is due on Thursday,
March 4 at 11am EST.

12
8.1.1 If we get a response rate of 60% or above, everyone gets an extra 1 point on
this assignment! If we get a response rate of 75% or above, everyone gets an
extra 2 points. That means that 60% (or 75%) of the class provides honest
feedback and not 60% (or 75%) of the class says nice things. :-)
8.2 Working code below
[ ]:

13

You might also like