Download as pdf or txt
Download as pdf or txt
You are on page 1of 26

Final_Group_Project

October 20, 2023

1 Introduction
CST 383 Final Group Project
Team:
Adam Momand Zack Hester Royal Williams Alex O’Brien
Data set:
Life Expectancy (WHO) https://www.kaggle.com/datasets/kumarajarshi/life-expectancy-who
Objective
Our team will be using data from the World Health Organization which includes Life Expectancy
data by country. We will be using this data to predict life expectancy using predictors such as
status (developing, developed), Adult Mortality, Infant Deaths, Alcohol Usage, BMI, Measles and
Polio data, etc…

2 Importing Libraries and Data


[ ]: # Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsRegressor
from sklearn.neighbors import KNeighborsClassifier
from matplotlib import rcParams

[ ]: # Data
path = "https://raw.githubusercontent.com/alexobrien5/FinalProject/main/
↪life_expectancy.csv"

df = pd.read_csv(path)

1
3 Preprocessing Data
[ ]: df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2938 entries, 0 to 2937
Data columns (total 22 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Country 2938 non-null object
1 Year 2938 non-null int64
2 Status 2938 non-null object
3 Life expectancy 2928 non-null float64
4 Adult Mortality 2928 non-null float64
5 infant deaths 2938 non-null int64
6 Alcohol 2744 non-null float64
7 percentage expenditure 2938 non-null float64
8 Hepatitis B 2385 non-null float64
9 Measles 2938 non-null int64
10 BMI 2904 non-null float64
11 under-five deaths 2938 non-null int64
12 Polio 2919 non-null float64
13 Total expenditure 2712 non-null float64
14 Diphtheria 2919 non-null float64
15 HIV/AIDS 2938 non-null float64
16 GDP 2490 non-null float64
17 Population 2286 non-null float64
18 thinness 1-19 years 2904 non-null float64
19 thinness 5-9 years 2904 non-null float64
20 Income composition of resources 2771 non-null float64
21 Schooling 2775 non-null float64
dtypes: float64(16), int64(4), object(2)
memory usage: 505.1+ KB

[ ]: # Data Cleaning
# Dropping null rows
df = df.dropna()
df.isna().sum()

[ ]: Country 0
Year 0
Status 0
Life expectancy 0
Adult Mortality 0
infant deaths 0
Alcohol 0
percentage expenditure 0
Hepatitis B 0

2
Measles 0
BMI 0
under-five deaths 0
Polio 0
Total expenditure 0
Diphtheria 0
HIV/AIDS 0
GDP 0
Population 0
thinness 1-19 years 0
thinness 5-9 years 0
Income composition of resources 0
Schooling 0
dtype: int64

[ ]: df.describe()

[ ]: Year Life expectancy Adult Mortality infant deaths \


count 1649.000000 1649.000000 1649.000000 1649.000000
mean 2007.840509 69.302304 168.215282 32.553062
std 4.087711 8.796834 125.310417 120.847190
min 2000.000000 44.000000 1.000000 0.000000
25% 2005.000000 64.400000 77.000000 1.000000
50% 2008.000000 71.700000 148.000000 3.000000
75% 2011.000000 75.000000 227.000000 22.000000
max 2015.000000 89.000000 723.000000 1600.000000

Alcohol percentage expenditure Hepatitis B Measles \


count 1649.000000 1649.000000 1649.000000 1649.000000
mean 4.533196 698.973558 79.217708 2224.494239
std 4.029189 1759.229336 25.604664 10085.802019
min 0.010000 0.000000 2.000000 0.000000
25% 0.810000 37.438577 74.000000 0.000000
50% 3.790000 145.102253 89.000000 15.000000
75% 7.340000 509.389994 96.000000 373.000000
max 17.870000 18961.348600 99.000000 131441.000000

BMI under-five deaths Polio Total expenditure \


count 1649.000000 1649.000000 1649.000000 1649.000000
mean 38.128623 44.220133 83.564585 5.955925
std 19.754249 162.897999 22.450557 2.299385
min 2.000000 0.000000 3.000000 0.740000
25% 19.500000 1.000000 81.000000 4.410000
50% 43.700000 4.000000 93.000000 5.840000
75% 55.800000 29.000000 97.000000 7.470000
max 77.100000 2100.000000 99.000000 14.390000

3
Diphtheria HIV/AIDS GDP Population \
count 1649.000000 1649.000000 1649.000000 1.649000e+03
mean 84.155246 1.983869 5566.031887 1.465363e+07
std 21.579193 6.032360 11475.900117 7.046039e+07
min 2.000000 0.100000 1.681350 3.400000e+01
25% 82.000000 0.100000 462.149650 1.918970e+05
50% 92.000000 0.100000 1592.572182 1.419631e+06
75% 97.000000 0.700000 4718.512910 7.658972e+06
max 99.000000 50.600000 119172.741800 1.293859e+09

thinness 1-19 years thinness 5-9 years \


count 1649.000000 1649.000000
mean 4.850637 4.907762
std 4.599228 4.653757
min 0.100000 0.100000
25% 1.600000 1.700000
50% 3.000000 3.200000
75% 7.100000 7.100000
max 27.200000 28.200000

Income composition of resources Schooling


count 1649.000000 1649.000000
mean 0.631551 12.119891
std 0.183089 2.795388
min 0.000000 4.200000
25% 0.509000 10.300000
50% 0.673000 12.300000
75% 0.751000 14.000000
max 0.936000 20.700000

[ ]: # Preprocessing
# Encoding status into new binary column for developing/developed
df['Status'].value_counts()

df1 = pd.get_dummies(df['Status'])
df2 = pd.concat((df1, df), axis=1)
df2 = df2.drop(['Status'], axis=1)
df2 = df1.drop(['Developing'], axis=1)
df3 = df2.rename(columns={'Developed': 'Status'})
status = np.array(df3['Status'])
df['DevStatus'] = status

[ ]: df.hist(bins=50, figsize=(20, 20))

[ ]: array([[<Axes: title={'center': 'Year'}>,


<Axes: title={'center': 'Life expectancy'}>,
<Axes: title={'center': 'Adult Mortality'}>,

4
<Axes: title={'center': 'infant deaths'}>,
<Axes: title={'center': 'Alcohol'}>],
[<Axes: title={'center': 'percentage expenditure'}>,
<Axes: title={'center': 'Hepatitis B'}>,
<Axes: title={'center': 'Measles'}>,
<Axes: title={'center': 'BMI'}>,
<Axes: title={'center': 'under-five deaths'}>],
[<Axes: title={'center': 'Polio'}>,
<Axes: title={'center': 'Total expenditure'}>,
<Axes: title={'center': 'Diphtheria'}>,
<Axes: title={'center': 'HIV/AIDS'}>,
<Axes: title={'center': 'GDP'}>],
[<Axes: title={'center': 'Population'}>,
<Axes: title={'center': 'thinness 1-19 years'}>,
<Axes: title={'center': 'thinness 5-9 years'}>,
<Axes: title={'center': 'Income composition of resources'}>,
<Axes: title={'center': 'Schooling'}>],
[<Axes: title={'center': 'DevStatus'}>, <Axes: >, <Axes: >,
<Axes: >, <Axes: >]], dtype=object)

5
4 Data Exploration and Visualization
[ ]: df.head()

[ ]: Country Year Status Life expectancy Adult Mortality \


0 Afghanistan 2015 Developing 65.0 263.0
1 Afghanistan 2014 Developing 59.9 271.0
2 Afghanistan 2013 Developing 59.9 268.0
3 Afghanistan 2012 Developing 59.5 272.0
4 Afghanistan 2011 Developing 59.2 275.0

infant deaths Alcohol percentage expenditure Hepatitis B Measles … \

6
0 62 0.01 71.279624 65.0 1154 …
1 64 0.01 73.523582 62.0 492 …
2 66 0.01 73.219243 64.0 430 …
3 69 0.01 78.184215 67.0 2787 …
4 71 0.01 7.097109 68.0 3013 …

Total expenditure Diphtheria HIV/AIDS GDP Population \


0 8.16 65.0 0.1 584.259210 33736494.0
1 8.18 62.0 0.1 612.696514 327582.0
2 8.13 64.0 0.1 631.744976 31731688.0
3 8.52 67.0 0.1 669.959000 3696958.0
4 7.87 68.0 0.1 63.537231 2978599.0

thinness 1-19 years thinness 5-9 years Income composition of resources \


0 17.2 17.3 0.479
1 17.5 17.5 0.476
2 17.7 17.7 0.470
3 17.9 18.0 0.463
4 18.2 18.2 0.454

Schooling DevStatus
0 10.1 0
1 10.0 0
2 9.9 0
3 9.8 0
4 9.5 0

[5 rows x 23 columns]

[ ]: # Create a scatter matrix plot of selected variables


selected_vars = ['Life expectancy', 'Income composition of resources',␣
↪'Schooling']

pd.plotting.scatter_matrix(df[selected_vars], figsize=(8, 8))


plt.show()

7
Showing correlation between features

[ ]: df.corr()

<ipython-input-11-2f6f6606aa2c>:1: FutureWarning: The default value of


numeric_only in DataFrame.corr is deprecated. In a future version, it will
default to False. Select only valid columns or specify the value of numeric_only
to silence this warning.
df.corr()

[ ]: Year Life expectancy Adult Mortality \


Year 1.000000 0.050771 -0.037092
Life expectancy 0.050771 1.000000 -0.702523
Adult Mortality -0.037092 -0.702523 1.000000

8
infant deaths 0.008029 -0.169074 0.042450
Alcohol -0.113365 0.402718 -0.175535
percentage expenditure 0.069553 0.409631 -0.237610
Hepatitis B 0.114897 0.199935 -0.105225
Measles -0.053822 -0.068881 -0.003967
BMI 0.005739 0.542042 -0.351542
under-five deaths 0.010479 -0.192265 0.060365
Polio -0.016699 0.327294 -0.199853
Total expenditure 0.059493 0.174718 -0.085227
Diphtheria 0.029641 0.341331 -0.191429
HIV/AIDS -0.123405 -0.592236 0.550691
GDP 0.096421 0.441322 -0.255035
Population 0.012567 -0.022305 -0.015012
thinness 1-19 years 0.019757 -0.457838 0.272230
thinness 5-9 years 0.014122 -0.457508 0.286723
Income composition of resources 0.122892 0.721083 -0.442203
Schooling 0.088732 0.727630 -0.421171
DevStatus -0.034138 0.442798 -0.278173

infant deaths Alcohol \


Year 0.008029 -0.113365
Life expectancy -0.169074 0.402718
Adult Mortality 0.042450 -0.175535
infant deaths 1.000000 -0.106217
Alcohol -0.106217 1.000000
percentage expenditure -0.090765 0.417047
Hepatitis B -0.231769 0.109889
Measles 0.532680 -0.050110
BMI -0.234425 0.353396
under-five deaths 0.996906 -0.101082
Polio -0.156929 0.240315
Total expenditure -0.146951 0.214885
Diphtheria -0.161871 0.242951
HIV/AIDS 0.007712 -0.027113
GDP -0.098092 0.443433
Population 0.671758 -0.028880
thinness 1-19 years 0.463415 -0.403755
thinness 5-9 years 0.461908 -0.386208
Income composition of resources -0.134754 0.561074
Schooling -0.214372 0.616975
DevStatus -0.108757 0.607782

percentage expenditure Hepatitis B \


Year 0.069553 0.114897
Life expectancy 0.409631 0.199935
Adult Mortality -0.237610 -0.105225
infant deaths -0.090765 -0.231769

9
Alcohol 0.417047 0.109889
percentage expenditure 1.000000 0.016760
Hepatitis B 0.016760 1.000000
Measles -0.063071 -0.124800
BMI 0.242738 0.143302
under-five deaths -0.092158 -0.240766
Polio 0.128626 0.463331
Total expenditure 0.183872 0.113327
Diphtheria 0.134813 0.588990
HIV/AIDS -0.095085 -0.094802
GDP 0.959299 0.041850
Population -0.016792 -0.129723
thinness 1-19 years -0.255035 -0.129406
thinness 5-9 years -0.255635 -0.133251
Income composition of resources 0.402170 0.184921
Schooling 0.422088 0.215182
DevStatus 0.461688 0.140351

Measles BMI under-five deaths … \


Year -0.053822 0.005739 0.010479 …
Life expectancy -0.068881 0.542042 -0.192265 …
Adult Mortality -0.003967 -0.351542 0.060365 …
infant deaths 0.532680 -0.234425 0.996906 …
Alcohol -0.050110 0.353396 -0.101082 …
percentage expenditure -0.063071 0.242738 -0.092158 …
Hepatitis B -0.124800 0.143302 -0.240766 …
Measles 1.000000 -0.153245 0.517506 …
BMI -0.153245 1.000000 -0.242137 …
under-five deaths 0.517506 -0.242137 1.000000 …
Polio -0.057850 0.186268 -0.171164 …
Total expenditure -0.113583 0.189469 -0.145803 …
Diphtheria -0.058606 0.176295 -0.178448 …
HIV/AIDS -0.003522 -0.210897 0.019476 …
GDP -0.064768 0.266114 -0.100331 …
Population 0.321946 -0.081416 0.658680 …
thinness 1-19 years 0.180642 -0.547018 0.464785 …
thinness 5-9 years 0.174946 -0.554094 0.462289 …
Income composition of resources -0.058277 0.510505 -0.148097 …
Schooling -0.115660 0.554844 -0.226013 …
DevStatus -0.071963 0.298380 -0.109847 …

Total expenditure Diphtheria HIV/AIDS \


Year 0.059493 0.029641 -0.123405
Life expectancy 0.174718 0.341331 -0.592236
Adult Mortality -0.085227 -0.191429 0.550691
infant deaths -0.146951 -0.161871 0.007712
Alcohol 0.214885 0.242951 -0.027113

10
percentage expenditure 0.183872 0.134813 -0.095085
Hepatitis B 0.113327 0.588990 -0.094802
Measles -0.113583 -0.058606 -0.003522
BMI 0.189469 0.176295 -0.210897
under-five deaths -0.145803 -0.178448 0.019476
Polio 0.119768 0.609245 -0.107885
Total expenditure 1.000000 0.129915 0.043101
Diphtheria 0.129915 1.000000 -0.117601
HIV/AIDS 0.043101 -0.117601 1.000000
GDP 0.180373 0.158438 -0.108081
Population -0.079962 -0.039898 -0.027801
thinness 1-19 years -0.209872 -0.187242 0.172592
thinness 5-9 years -0.217865 -0.180952 0.183147
Income composition of resources 0.183653 0.343262 -0.248590
Schooling 0.243783 0.350398 -0.211840
DevStatus 0.192538 0.201654 -0.129555

GDP Population thinness 1-19 years \


Year 0.096421 0.012567 0.019757
Life expectancy 0.441322 -0.022305 -0.457838
Adult Mortality -0.255035 -0.015012 0.272230
infant deaths -0.098092 0.671758 0.463415
Alcohol 0.443433 -0.028880 -0.403755
percentage expenditure 0.959299 -0.016792 -0.255035
Hepatitis B 0.041850 -0.129723 -0.129406
Measles -0.064768 0.321946 0.180642
BMI 0.266114 -0.081416 -0.547018
under-five deaths -0.100331 0.658680 0.464785
Polio 0.156809 -0.045387 -0.164070
Total expenditure 0.180373 -0.079962 -0.209872
Diphtheria 0.158438 -0.039898 -0.187242
HIV/AIDS -0.108081 -0.027801 0.172592
GDP 1.000000 -0.020369 -0.277498
Population -0.020369 1.000000 0.282529
thinness 1-19 years -0.277498 0.282529 1.000000
thinness 5-9 years -0.277959 0.277913 0.927913
Income composition of resources 0.446856 -0.008132 -0.453679
Schooling 0.467947 -0.040312 -0.491199
DevStatus 0.484801 -0.034790 -0.308005

thinness 5-9 years \


Year 0.014122
Life expectancy -0.457508
Adult Mortality 0.286723
infant deaths 0.461908
Alcohol -0.386208
percentage expenditure -0.255635

11
Hepatitis B -0.133251
Measles 0.174946
BMI -0.554094
under-five deaths 0.462289
Polio -0.174489
Total expenditure -0.217865
Diphtheria -0.180952
HIV/AIDS 0.183147
GDP -0.277959
Population 0.277913
thinness 1-19 years 0.927913
thinness 5-9 years 1.000000
Income composition of resources -0.438484
Schooling -0.472482
DevStatus -0.307279

Income composition of resources Schooling \


Year 0.122892 0.088732
Life expectancy 0.721083 0.727630
Adult Mortality -0.442203 -0.421171
infant deaths -0.134754 -0.214372
Alcohol 0.561074 0.616975
percentage expenditure 0.402170 0.422088
Hepatitis B 0.184921 0.215182
Measles -0.058277 -0.115660
BMI 0.510505 0.554844
under-five deaths -0.148097 -0.226013
Polio 0.314682 0.350147
Total expenditure 0.183653 0.243783
Diphtheria 0.343262 0.350398
HIV/AIDS -0.248590 -0.211840
GDP 0.446856 0.467947
Population -0.008132 -0.040312
thinness 1-19 years -0.453679 -0.491199
thinness 5-9 years -0.438484 -0.472482
Income composition of resources 1.000000 0.784741
Schooling 0.784741 1.000000
DevStatus 0.463615 0.512543

DevStatus
Year -0.034138
Life expectancy 0.442798
Adult Mortality -0.278173
infant deaths -0.108757
Alcohol 0.607782
percentage expenditure 0.461688
Hepatitis B 0.140351

12
Measles -0.071963
BMI 0.298380
under-five deaths -0.109847
Polio 0.201917
Total expenditure 0.192538
Diphtheria 0.201654
HIV/AIDS -0.129555
GDP 0.484801
Population -0.034790
thinness 1-19 years -0.308005
thinness 5-9 years -0.307279
Income composition of resources 0.463615
Schooling 0.512543
DevStatus 1.000000

[21 rows x 21 columns]

[ ]: # Most correlated features function


def most_corr_index(x):
return x.sort_values(ascending=False).index[1]
df.corr().apply(most_corr_index)

<ipython-input-12-9d2ccbe79084>:4: FutureWarning: The default value of


numeric_only in DataFrame.corr is deprecated. In a future version, it will
default to False. Select only valid columns or specify the value of numeric_only
to silence this warning.
df.corr().apply(most_corr_index)

[ ]: Year Income composition of resources


Life expectancy Schooling
Adult Mortality HIV/AIDS
infant deaths under-five deaths
Alcohol Schooling
percentage expenditure GDP
Hepatitis B Diphtheria
Measles infant deaths
BMI Schooling
under-five deaths infant deaths
Polio Diphtheria
Total expenditure Schooling
Diphtheria Polio
HIV/AIDS Adult Mortality
GDP percentage expenditure
Population infant deaths
thinness 1-19 years thinness 5-9 years
thinness 5-9 years thinness 1-19 years
Income composition of resources Schooling
Schooling Income composition of resources

13
DevStatus Alcohol
dtype: object

Interesting that Life Expectancy is most correlated with Schooling years

[ ]: sns.scatterplot(data=df, x='Life expectancy', y='Schooling')

[ ]: <Axes: xlabel='Life expectancy', ylabel='Schooling'>

[ ]: sns.scatterplot(data=df, x='GDP', y='percentage expenditure')

[ ]: <Axes: xlabel='GDP', ylabel='percentage expenditure'>

14
[ ]: # pairwise plot of selected variables
selected_vars = ['Life expectancy', 'Adult Mortality', 'infant deaths',␣
↪'Alcohol']

sns.pairplot(df[selected_vars])

[ ]: <seaborn.axisgrid.PairGrid at 0x7fe820334250>

15
[ ]: #Histogram: Examine the distribution of a numeric variable.
sns.histplot(data=df, x='Life expectancy')
plt.xlabel('Life Expectancy')
plt.ylabel('Frequency')
plt.title('Distribution of Life Expectancy')
plt.show()

16
[ ]: # Scatter Plot - Life Expectancy vs GDP
plt.scatter(df['GDP'], df['Life expectancy'])
plt.xlabel('GDP')
plt.ylabel('Life Expectancy')
plt.title('Life Expectancy vs. GDP')
plt.show()

17
[ ]: # Bar plot - Life Expectancy by Region or Country
plt.figure(figsize=(12, 6))
sns.barplot(data=df, x='Country', y='Life expectancy')
plt.xlabel('Country')
plt.ylabel('Life Expectancy')
plt.title('Average Life Expectancy by Country')
plt.xticks(rotation=45)
plt.subplots_adjust(left=0.1, right=0.9)
plt.show()

18
5 Predicting Life Expectancy with kNN Regression
[ ]: # kNN Regression Prediction
predictors = ['Adult Mortality', 'infant deaths', 'Alcohol', 'Hepatitis B',␣
↪'Measles', 'BMI']

target = ['Life expectancy']

X = df[predictors].values
y = df[target].values

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30,␣


↪random_state=42)

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

knn = KNeighborsRegressor(n_neighbors=7)
knn.fit(X_train, y_train)

predictions = knn.predict(X_test)

print(predictions[:10])

19
print(y_test[:10])

mse = ((predictions - y_test)**2).mean()


print('MSE: {0:.0f}'.format(mse))
rmse = np.sqrt(mse)
print('RMSE: {0:.0f}'.format(rmse))

blind_mse = ((y_train.mean() - y_test)**2).mean()


print('blind MSE: {0:.0f}'.format(blind_mse))
blind_rmse = np.sqrt(blind_mse)
print('blind RMSE: {0:.0f}'.format(blind_rmse))

[[67.4 ]
[74.47142857]
[81.87142857]
[53.21428571]
[51.85714286]
[49.72857143]
[70.6 ]
[73.64285714]
[73.3 ]
[74.18571429]]
[[67.5]
[73.8]
[79.1]
[54.9]
[48.6]
[50. ]
[68.7]
[74.1]
[76.9]
[72.4]]
MSE: 13
RMSE: 4
blind MSE: 79
blind RMSE: 9

[ ]: # Same prediction with more predictors (adding binary developed status feature␣
↪as well)

predictors = ['Adult Mortality', 'infant deaths', 'Alcohol', 'Hepatitis B',␣


↪'Measles', 'BMI', 'DevStatus', 'under-five deaths', 'Polio', 'GDP']

target = ['Life expectancy']

X = df[predictors].values
y = df[target].values

20
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30,␣
↪random_state=42)

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

knn = KNeighborsRegressor(n_neighbors=7)
knn.fit(X_train, y_train)

predictions = knn.predict(X_test)

print(predictions[:10])
print(y_test[:10])

mse = ((predictions - y_test)**2).mean()


print('MSE: {0:.0f}'.format(mse))
rmse = np.sqrt(mse)
print('RMSE: {0:.0f}'.format(rmse))

blind_mse = ((y_train.mean() - y_test)**2).mean()


print('blind MSE: {0:.0f}'.format(blind_mse))
blind_rmse = np.sqrt(blind_mse)
print('blind RMSE: {0:.0f}'.format(blind_rmse))

[[67.17142857]
[74.78571429]
[80.05714286]
[53.31428571]
[53.18571429]
[49.72857143]
[69.11428571]
[73.07142857]
[72.35714286]
[73.32857143]]
[[67.5]
[73.8]
[79.1]
[54.9]
[48.6]
[50. ]
[68.7]
[74.1]
[76.9]
[72.4]]
MSE: 13
RMSE: 4
blind MSE: 79

21
blind RMSE: 9

[ ]: # After finding correlation with Schooling, adding to see if the RMSE decreases
# Same prediction with more predictors (adding binary developed status feature␣
↪as well)

predictors = ['Adult Mortality', 'infant deaths', 'Alcohol',


X_test = scaler.transform(X_test)

knn = KNeighborsRegressor(n_neighbors=7)
knn.fit(X_train, y_train)

predictions = knn.predict(X_test)

print(predictions[:10])
print(y_test[:10])

mse = ((predictions - y_test)**2).mean()


print('MSE: {0:.0f}'.format(mse))
rmse = np.sqrt(mse)
print('RMSE: {0:.0f}'.format(rmse))

blind_mse = ((y_train.mean() - y_test)**2).mean()


print('blind MSE: {0:.0f}'.format(blind_mse))
blind_rmse = np.sqrt(blind_mse)
print('blind RMSE: {0:.0f}'.format(blind_rmse))

[[65.35714286]
[75.17142857]
[81.52857143]
[51.58571429]
[53.18571429]
[49.72857143]
[69.97142857]
[74.41428571]
[75.8 ]
[73.32857143]]
[[67.5]
[73.8]
[79.1]
[54.9]
[48.6]
[50. ]
[68.7]
[74.1]
[76.9]
[72.4]]
MSE: 12
RMSE: 3

22
blind MSE: 79
blind RMSE: 9

6 Predicting Status classification using kNN Classifier


[ ]: predictors = ['Year', 'Population']
target = 'Status'

X = df[predictors].values
y = (df[target] == 'Developing').values.astype(int)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

knn = KNeighborsClassifier()
knn.fit(X_train, y_train)

predictions = knn.predict(X_test)

print(predictions[:20])
print(y_test[:20])

#accuracy for this model


accuracy = (predictions == y_test).mean()
print('accuracy: {0:.3f}'.format(accuracy)) #accuracy is 0.827

[1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1]
[1 1 1 1 1 1 0 1 1 1 1 0 1 0 1 1 1 1 1 1]
accuracy: 0.858

[ ]: #low accuracy, adding more features including Life expectancy


predictors = ['Year', 'Population', 'Adult Mortality', 'infant deaths', 'Life␣
↪expectancy',

'under-five deaths', 'Schooling']


target = 'Status'

X = df[predictors].values
y = (df[target] == 'Developing').values.astype(int)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

23
knn = KNeighborsClassifier()
knn.fit(X_train, y_train)

predictions = knn.predict(X_test)

print(predictions[:20])
print(y_test[:20])

#accuracy for this model


accuracy = (predictions == y_test).mean()
print('accuracy: {0:.3f}'.format(accuracy)) #increased to 0.885

[1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1]
[1 1 1 1 1 0 1 1 1 1 0 1 1 1 1 1 1 1 1 1]
accuracy: 0.900

[ ]: #accuracy increased from 0.827 to 0.885, adding more features


predictors = ['Year', 'Population', 'Adult Mortality', 'infant deaths', 'Life␣
↪expectancy',

'under-five deaths', 'Schooling', 'Alcohol']


target = 'Status'

X = df[predictors].values
y = (df[target] == 'Developing').values.astype(int)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

knn = KNeighborsClassifier()
knn.fit(X_train, y_train)

predictions = knn.predict(X_test)

print(predictions[:20])
print(y_test[:20])

#accuracy for this model


accuracy = (predictions == y_test).mean()
print('accuracy: {0:.3f}'.format(accuracy)) #increased to 0.958

[1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1]
[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]
accuracy: 0.933

24
[ ]: #adding even more features
df.info()
predictors = ['Year', 'Population', 'Adult Mortality', 'infant deaths', 'Life␣
↪expectancy',

'under-five deaths', 'Schooling', 'Alcohol', 'GDP']


target = 'Status'

X = df[predictors].values
y = (df[target] == 'Developing').values.astype(int)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

knn = KNeighborsClassifier()
knn.fit(X_train, y_train)

predictions = knn.predict(X_test)

print(predictions[:20])
print(y_test[:20])

#accuracy for this model


accuracy = (predictions == y_test).mean()
print('accuracy: {0:.3f}'.format(accuracy)) #increased to 0.967

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1649 entries, 0 to 2937
Data columns (total 23 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Country 1649 non-null object
1 Year 1649 non-null int64
2 Status 1649 non-null object
3 Life expectancy 1649 non-null float64
4 Adult Mortality 1649 non-null float64
5 infant deaths 1649 non-null int64
6 Alcohol 1649 non-null float64
7 percentage expenditure 1649 non-null float64
8 Hepatitis B 1649 non-null float64
9 Measles 1649 non-null int64
10 BMI 1649 non-null float64
11 under-five deaths 1649 non-null int64
12 Polio 1649 non-null float64
13 Total expenditure 1649 non-null float64
14 Diphtheria 1649 non-null float64

25
15 HIV/AIDS 1649 non-null float64
16 GDP 1649 non-null float64
17 Population 1649 non-null float64
18 thinness 1-19 years 1649 non-null float64
19 thinness 5-9 years 1649 non-null float64
20 Income composition of resources 1649 non-null float64
21 Schooling 1649 non-null float64
22 DevStatus 1649 non-null uint8
dtypes: float64(16), int64(4), object(2), uint8(1)
memory usage: 297.9+ KB
[1 1 1 1 1 0 0 1 1 1 1 1 1 1 0 1 1 1 1 1]
[1 1 1 1 1 0 0 1 1 1 1 1 1 1 0 1 1 1 1 1]
accuracy: 0.958

26

You might also like