Professional Documents
Culture Documents
CIS242 HW4 Ans
CIS242 HW4 Ans
Active notebooks (.ipynb files) or raw code (.py files) will NOT be accepted and no
points will be given. The code part of the files will not be graded, but they will be checked if
necessary to verify your findings and recommendations. Point deductions may occur if there are
major discrepancies between your written answers and the output from the code.
Please organize your notebook to have the homework responses at the top and the working code
underneath.
Each question is worth 2 points for a total of 24 points.
0.4 NAME:
0.5 Predicting Loans for Universal Bank
Remember: You were recently hired as a data scientist by Universal Bank. The bank’s Vice
President is interested learning what predictive analytics can do for the bank. She supplies you
with a dataset containing information on a sample of customers. Here is a description of each
variable in the Universal Bank dataset:
• Age: Customer’s age in years
• Experience: Number of years of professional work experience
• Income: Annual income in thousands of dollars ($000)
• Zip code: Zip code of home address
• Family: Customer’s family size
• CC Avg: Average spending on credit cards per month in thousands of dollars ($000)
• Education: Education level where 1 = Undergraduate; 2 = Graduate; and 3=Ad-
vanced/Professional
• Mortgage: Value of house mortgage if any; in thousands of dollar ($000)
• Personal.Loan: Did the customer accept a personal loan offered in the bank’s last campaign?
1=Yes; 0 = No.
• Securities.Account: Does the customer have a securities account with the bank? 1 = Yes; 0
= No.
1
• CD.Account: Does the customer have a certificate of deposit (CD) account with the bank? 1
= Yes; 0 = No.
• Online: Does the customer use Internet banking facilities? 1 = Yes; 0 = No.
• Credit.Card: Does the customer use a credit card issued by Universal Bank? 1 = Yes; 0 =
No.
You have already investigated these data and built a logistic regression model to help understand
the personal loan market. Now the VP wants to investigate other predictive models.
1. You are asked to create a NB model to predict personal loans. What variables will
you use? Why?
[1]: # code or markdown here
import pandas as pd
data = pd.read_csv('D:/station/FEB/JIHAO/HW4/UniversalBank.csv')
data.head()
abs(data.corr()) # find correlation
2
Age 0.000436 0.008043 0.013702 0.007681
Experience 0.001232 0.010353 0.013898 0.008967
Income 0.002616 0.169738 0.014206 0.002385
ZIP Code 0.004704 0.019972 0.016990 0.007691
Family 0.019994 0.014110 0.010354 0.011588
CCAvg 0.015087 0.136537 0.003620 0.006686
Education 0.010812 0.013934 0.015004 0.011014
Mortgage 0.005411 0.089311 0.005995 0.007231
Personal Loan 0.021954 0.316355 0.006278 0.002802
Securities Account 1.000000 0.317034 0.012627 0.015028
CD Account 0.317034 1.000000 0.175880 0.278644
Online 0.012627 0.175880 1.000000 0.004210
CreditCard 0.015028 0.278644 0.004210 1.000000
The above pairs have correlation abvoe 0.2, as NB assumes there is no corrrelation between features,
we deceide to drop some varibles. Also we remove ZIP Code to aviode misuse. However, depending
the model we select, we have to select continuous/categorical in next.
3
[5000 rows x 7 columns]
2. Do you need to make any changes to the data? If so, what do you change and why?
If not, why not?
3. Train a NB model. What set up or parameter choices did you make? Why? we
split the data into train and test set, with which we can check the model’s predictive performance.
X = m_nb_df.copy()
y = X.pop('Personal Loan')
X_train, X_test, y_train, y_test = train_test_split(X, y, # train test split
test_size=0.2,␣
,→random_state=1234)
print(y_train.value_counts()/len(y_train))
print(y_test.value_counts()/len(y_test))
NB = CategoricalNB()
NB.fit(X_train, y_train)
0 0.90625
1 0.09375
Name: Personal Loan, dtype: float64
0 0.895
1 0.105
Name: Personal Loan, dtype: float64
[6]: CategoricalNB()
4. Test the accuracy of your model. What is the overall accuracy of the model? What
did it predict well? What did it not predict well?
4
[7]: # code or markdown here
print("accuracy of NB: " + str(metrics.accuracy_score(y_test, NB.
,→predict(X_test))))
[7]: Actual 0 1
Predicted
0 883 86
1 12 19
overall acccuracy is 0.902, it predict well in case people not taking loans.
5. Train at least two more versions of the model. What changes did you make? Why?
What was the outcome?
print(y_train.value_counts()/len(y_train))
print(y_test.value_counts()/len(y_test))
NB = GaussianNB()
NB.fit(X_train, y_train)
print("accuracy of NB: " + str(metrics.accuracy_score(y_test, NB.
,→predict(X_test))))
0 0.90625
1 0.09375
Name: Personal Loan, dtype: float64
0 0.895
1 0.105
Name: Personal Loan, dtype: float64
accuracy of NB: 0.894
[8]: Actual 0 1
Predicted
0 850 61
1 45 44
5
Lastly we choose to convert the continuous data into categorical ones, and again, fit a Multinomi-
alNB.
6
[10]: X = data_nb.copy()
X.Experience = pd.qcut(X.Experience , q=10, labels = False)
7
X.Income= pd.qcut(X.Income, q=3, labels = False)
y = X.pop('Personal Loan')
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,␣
,→random_state=1234)
print(y_train.value_counts()/len(y_train))
print(y_test.value_counts()/len(y_test))
NB = MultinomialNB()
NB.fit(X_train, y_train)
print("accuracy of NB: " + str(metrics.accuracy_score(y_test, NB.
,→predict(X_test))))
0 0.90625
1 0.09375
Name: Personal Loan, dtype: float64
0 0.895
1 0.105
Name: Personal Loan, dtype: float64
accuracy of NB: 0.904
[10]: Actual 0 1
Predicted
0 892 93
1 3 12
6. Now you are asked to create a KNN model to predict personal loans. What
variables will you use? Why?
7. Do you need to make any changes to the data? How is this similar or different to
the NB model? Why?
8
8. Train a KNN model. What set up or parameter choices did you make? Why?
error_rate = []
for i in range(3,21): # choose k
9
[14]: ids = np.where(error_rate == min(error_rate))[0][0]
opt_k = [i for i in range(3,21)][ids] # get opitimal k
opt_k
[14]: 14
9. Test the accuracy of your model. What is the overall accuracy of the model? What
did it predict well? What did it not predict well?
[15]: knn = KNeighborsClassifier(n_neighbors=opt_k)
knn.fit(X_train,y_train)
print("accuracy of NB: " + str(metrics.accuracy_score(y_test, knn.
,→predict(X_test))))
[15]: Actual 0 1
Predicted
0 885 88
1 10 17
10. What changes can you make that might affect the KNN model? Train at least
two more models. What is the result?
10
6 code or markdown here
First Verison, we scale our data.
X = MinMaxScaler().fit_transform(X) # scaling
error_rate = []
for i in range(1,21):
[16]: 3
[17]: Actual 0 1
Predicted
0 891 40
1 4 65
[18]: error_rate = []
for i in range(1,21):
knn.fit(X_train,y_train)
pred_i = knn.predict(X_test)
error_rate.append(np.mean(pred_i != y_test))
11
ids = np.where(error_rate == min(error_rate))[0][0]
opt_k = [i for i in range(3,21)][ids]
opt_k
[18]: 7
[19]: Actual 0 1
Predicted
0 891 50
1 4 55
The optimal K varies, however, with scaled data, the result is better in the term of accuracy.
11. What are the relative strengths and weaknesses of the various models? Mention
at least 3 points.
12. What would you recommend to the VP? Why? Remember, you have 3 different
models (and their variants) to choose from.
12
8.1.1 If we get a response rate of 60% or above, everyone gets an extra 1 point on
this assignment! If we get a response rate of 75% or above, everyone gets an
extra 2 points. That means that 60% (or 75%) of the class provides honest
feedback and not 60% (or 75%) of the class says nice things. :-)
8.2 Working code below
[ ]:
13