Professional Documents
Culture Documents
Final Group Project
Final Group Project
1 Introduction
CST 383 Final Group Project
Team:
Adam Momand Zack Hester Royal Williams Alex O’Brien
Data set:
Life Expectancy (WHO) https://www.kaggle.com/datasets/kumarajarshi/life-expectancy-who
Objective
Our team will be using data from the World Health Organization which includes Life Expectancy
data by country. We will be using this data to predict life expectancy using predictors such as
status (developing, developed), Adult Mortality, Infant Deaths, Alcohol Usage, BMI, Measles and
Polio data, etc…
[ ]: # Data
path = "https://raw.githubusercontent.com/alexobrien5/FinalProject/main/
↪life_expectancy.csv"
df = pd.read_csv(path)
1
3 Preprocessing Data
[ ]: df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2938 entries, 0 to 2937
Data columns (total 22 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Country 2938 non-null object
1 Year 2938 non-null int64
2 Status 2938 non-null object
3 Life expectancy 2928 non-null float64
4 Adult Mortality 2928 non-null float64
5 infant deaths 2938 non-null int64
6 Alcohol 2744 non-null float64
7 percentage expenditure 2938 non-null float64
8 Hepatitis B 2385 non-null float64
9 Measles 2938 non-null int64
10 BMI 2904 non-null float64
11 under-five deaths 2938 non-null int64
12 Polio 2919 non-null float64
13 Total expenditure 2712 non-null float64
14 Diphtheria 2919 non-null float64
15 HIV/AIDS 2938 non-null float64
16 GDP 2490 non-null float64
17 Population 2286 non-null float64
18 thinness 1-19 years 2904 non-null float64
19 thinness 5-9 years 2904 non-null float64
20 Income composition of resources 2771 non-null float64
21 Schooling 2775 non-null float64
dtypes: float64(16), int64(4), object(2)
memory usage: 505.1+ KB
[ ]: # Data Cleaning
# Dropping null rows
df = df.dropna()
df.isna().sum()
[ ]: Country 0
Year 0
Status 0
Life expectancy 0
Adult Mortality 0
infant deaths 0
Alcohol 0
percentage expenditure 0
Hepatitis B 0
2
Measles 0
BMI 0
under-five deaths 0
Polio 0
Total expenditure 0
Diphtheria 0
HIV/AIDS 0
GDP 0
Population 0
thinness 1-19 years 0
thinness 5-9 years 0
Income composition of resources 0
Schooling 0
dtype: int64
[ ]: df.describe()
3
Diphtheria HIV/AIDS GDP Population \
count 1649.000000 1649.000000 1649.000000 1.649000e+03
mean 84.155246 1.983869 5566.031887 1.465363e+07
std 21.579193 6.032360 11475.900117 7.046039e+07
min 2.000000 0.100000 1.681350 3.400000e+01
25% 82.000000 0.100000 462.149650 1.918970e+05
50% 92.000000 0.100000 1592.572182 1.419631e+06
75% 97.000000 0.700000 4718.512910 7.658972e+06
max 99.000000 50.600000 119172.741800 1.293859e+09
[ ]: # Preprocessing
# Encoding status into new binary column for developing/developed
df['Status'].value_counts()
df1 = pd.get_dummies(df['Status'])
df2 = pd.concat((df1, df), axis=1)
df2 = df2.drop(['Status'], axis=1)
df2 = df1.drop(['Developing'], axis=1)
df3 = df2.rename(columns={'Developed': 'Status'})
status = np.array(df3['Status'])
df['DevStatus'] = status
4
<Axes: title={'center': 'infant deaths'}>,
<Axes: title={'center': 'Alcohol'}>],
[<Axes: title={'center': 'percentage expenditure'}>,
<Axes: title={'center': 'Hepatitis B'}>,
<Axes: title={'center': 'Measles'}>,
<Axes: title={'center': 'BMI'}>,
<Axes: title={'center': 'under-five deaths'}>],
[<Axes: title={'center': 'Polio'}>,
<Axes: title={'center': 'Total expenditure'}>,
<Axes: title={'center': 'Diphtheria'}>,
<Axes: title={'center': 'HIV/AIDS'}>,
<Axes: title={'center': 'GDP'}>],
[<Axes: title={'center': 'Population'}>,
<Axes: title={'center': 'thinness 1-19 years'}>,
<Axes: title={'center': 'thinness 5-9 years'}>,
<Axes: title={'center': 'Income composition of resources'}>,
<Axes: title={'center': 'Schooling'}>],
[<Axes: title={'center': 'DevStatus'}>, <Axes: >, <Axes: >,
<Axes: >, <Axes: >]], dtype=object)
5
4 Data Exploration and Visualization
[ ]: df.head()
6
0 62 0.01 71.279624 65.0 1154 …
1 64 0.01 73.523582 62.0 492 …
2 66 0.01 73.219243 64.0 430 …
3 69 0.01 78.184215 67.0 2787 …
4 71 0.01 7.097109 68.0 3013 …
Schooling DevStatus
0 10.1 0
1 10.0 0
2 9.9 0
3 9.8 0
4 9.5 0
[5 rows x 23 columns]
7
Showing correlation between features
[ ]: df.corr()
8
infant deaths 0.008029 -0.169074 0.042450
Alcohol -0.113365 0.402718 -0.175535
percentage expenditure 0.069553 0.409631 -0.237610
Hepatitis B 0.114897 0.199935 -0.105225
Measles -0.053822 -0.068881 -0.003967
BMI 0.005739 0.542042 -0.351542
under-five deaths 0.010479 -0.192265 0.060365
Polio -0.016699 0.327294 -0.199853
Total expenditure 0.059493 0.174718 -0.085227
Diphtheria 0.029641 0.341331 -0.191429
HIV/AIDS -0.123405 -0.592236 0.550691
GDP 0.096421 0.441322 -0.255035
Population 0.012567 -0.022305 -0.015012
thinness 1-19 years 0.019757 -0.457838 0.272230
thinness 5-9 years 0.014122 -0.457508 0.286723
Income composition of resources 0.122892 0.721083 -0.442203
Schooling 0.088732 0.727630 -0.421171
DevStatus -0.034138 0.442798 -0.278173
9
Alcohol 0.417047 0.109889
percentage expenditure 1.000000 0.016760
Hepatitis B 0.016760 1.000000
Measles -0.063071 -0.124800
BMI 0.242738 0.143302
under-five deaths -0.092158 -0.240766
Polio 0.128626 0.463331
Total expenditure 0.183872 0.113327
Diphtheria 0.134813 0.588990
HIV/AIDS -0.095085 -0.094802
GDP 0.959299 0.041850
Population -0.016792 -0.129723
thinness 1-19 years -0.255035 -0.129406
thinness 5-9 years -0.255635 -0.133251
Income composition of resources 0.402170 0.184921
Schooling 0.422088 0.215182
DevStatus 0.461688 0.140351
10
percentage expenditure 0.183872 0.134813 -0.095085
Hepatitis B 0.113327 0.588990 -0.094802
Measles -0.113583 -0.058606 -0.003522
BMI 0.189469 0.176295 -0.210897
under-five deaths -0.145803 -0.178448 0.019476
Polio 0.119768 0.609245 -0.107885
Total expenditure 1.000000 0.129915 0.043101
Diphtheria 0.129915 1.000000 -0.117601
HIV/AIDS 0.043101 -0.117601 1.000000
GDP 0.180373 0.158438 -0.108081
Population -0.079962 -0.039898 -0.027801
thinness 1-19 years -0.209872 -0.187242 0.172592
thinness 5-9 years -0.217865 -0.180952 0.183147
Income composition of resources 0.183653 0.343262 -0.248590
Schooling 0.243783 0.350398 -0.211840
DevStatus 0.192538 0.201654 -0.129555
11
Hepatitis B -0.133251
Measles 0.174946
BMI -0.554094
under-five deaths 0.462289
Polio -0.174489
Total expenditure -0.217865
Diphtheria -0.180952
HIV/AIDS 0.183147
GDP -0.277959
Population 0.277913
thinness 1-19 years 0.927913
thinness 5-9 years 1.000000
Income composition of resources -0.438484
Schooling -0.472482
DevStatus -0.307279
DevStatus
Year -0.034138
Life expectancy 0.442798
Adult Mortality -0.278173
infant deaths -0.108757
Alcohol 0.607782
percentage expenditure 0.461688
Hepatitis B 0.140351
12
Measles -0.071963
BMI 0.298380
under-five deaths -0.109847
Polio 0.201917
Total expenditure 0.192538
Diphtheria 0.201654
HIV/AIDS -0.129555
GDP 0.484801
Population -0.034790
thinness 1-19 years -0.308005
thinness 5-9 years -0.307279
Income composition of resources 0.463615
Schooling 0.512543
DevStatus 1.000000
13
DevStatus Alcohol
dtype: object
14
[ ]: # pairwise plot of selected variables
selected_vars = ['Life expectancy', 'Adult Mortality', 'infant deaths',␣
↪'Alcohol']
sns.pairplot(df[selected_vars])
[ ]: <seaborn.axisgrid.PairGrid at 0x7fe820334250>
15
[ ]: #Histogram: Examine the distribution of a numeric variable.
sns.histplot(data=df, x='Life expectancy')
plt.xlabel('Life Expectancy')
plt.ylabel('Frequency')
plt.title('Distribution of Life Expectancy')
plt.show()
16
[ ]: # Scatter Plot - Life Expectancy vs GDP
plt.scatter(df['GDP'], df['Life expectancy'])
plt.xlabel('GDP')
plt.ylabel('Life Expectancy')
plt.title('Life Expectancy vs. GDP')
plt.show()
17
[ ]: # Bar plot - Life Expectancy by Region or Country
plt.figure(figsize=(12, 6))
sns.barplot(data=df, x='Country', y='Life expectancy')
plt.xlabel('Country')
plt.ylabel('Life Expectancy')
plt.title('Average Life Expectancy by Country')
plt.xticks(rotation=45)
plt.subplots_adjust(left=0.1, right=0.9)
plt.show()
18
5 Predicting Life Expectancy with kNN Regression
[ ]: # kNN Regression Prediction
predictors = ['Adult Mortality', 'infant deaths', 'Alcohol', 'Hepatitis B',␣
↪'Measles', 'BMI']
X = df[predictors].values
y = df[target].values
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
knn = KNeighborsRegressor(n_neighbors=7)
knn.fit(X_train, y_train)
predictions = knn.predict(X_test)
print(predictions[:10])
19
print(y_test[:10])
[[67.4 ]
[74.47142857]
[81.87142857]
[53.21428571]
[51.85714286]
[49.72857143]
[70.6 ]
[73.64285714]
[73.3 ]
[74.18571429]]
[[67.5]
[73.8]
[79.1]
[54.9]
[48.6]
[50. ]
[68.7]
[74.1]
[76.9]
[72.4]]
MSE: 13
RMSE: 4
blind MSE: 79
blind RMSE: 9
[ ]: # Same prediction with more predictors (adding binary developed status feature␣
↪as well)
X = df[predictors].values
y = df[target].values
20
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30,␣
↪random_state=42)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
knn = KNeighborsRegressor(n_neighbors=7)
knn.fit(X_train, y_train)
predictions = knn.predict(X_test)
print(predictions[:10])
print(y_test[:10])
[[67.17142857]
[74.78571429]
[80.05714286]
[53.31428571]
[53.18571429]
[49.72857143]
[69.11428571]
[73.07142857]
[72.35714286]
[73.32857143]]
[[67.5]
[73.8]
[79.1]
[54.9]
[48.6]
[50. ]
[68.7]
[74.1]
[76.9]
[72.4]]
MSE: 13
RMSE: 4
blind MSE: 79
21
blind RMSE: 9
[ ]: # After finding correlation with Schooling, adding to see if the RMSE decreases
# Same prediction with more predictors (adding binary developed status feature␣
↪as well)
knn = KNeighborsRegressor(n_neighbors=7)
knn.fit(X_train, y_train)
predictions = knn.predict(X_test)
print(predictions[:10])
print(y_test[:10])
[[65.35714286]
[75.17142857]
[81.52857143]
[51.58571429]
[53.18571429]
[49.72857143]
[69.97142857]
[74.41428571]
[75.8 ]
[73.32857143]]
[[67.5]
[73.8]
[79.1]
[54.9]
[48.6]
[50. ]
[68.7]
[74.1]
[76.9]
[72.4]]
MSE: 12
RMSE: 3
22
blind MSE: 79
blind RMSE: 9
X = df[predictors].values
y = (df[target] == 'Developing').values.astype(int)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
knn = KNeighborsClassifier()
knn.fit(X_train, y_train)
predictions = knn.predict(X_test)
print(predictions[:20])
print(y_test[:20])
[1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1]
[1 1 1 1 1 1 0 1 1 1 1 0 1 0 1 1 1 1 1 1]
accuracy: 0.858
X = df[predictors].values
y = (df[target] == 'Developing').values.astype(int)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
23
knn = KNeighborsClassifier()
knn.fit(X_train, y_train)
predictions = knn.predict(X_test)
print(predictions[:20])
print(y_test[:20])
[1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1]
[1 1 1 1 1 0 1 1 1 1 0 1 1 1 1 1 1 1 1 1]
accuracy: 0.900
X = df[predictors].values
y = (df[target] == 'Developing').values.astype(int)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
knn = KNeighborsClassifier()
knn.fit(X_train, y_train)
predictions = knn.predict(X_test)
print(predictions[:20])
print(y_test[:20])
[1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1]
[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]
accuracy: 0.933
24
[ ]: #adding even more features
df.info()
predictors = ['Year', 'Population', 'Adult Mortality', 'infant deaths', 'Life␣
↪expectancy',
X = df[predictors].values
y = (df[target] == 'Developing').values.astype(int)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
knn = KNeighborsClassifier()
knn.fit(X_train, y_train)
predictions = knn.predict(X_test)
print(predictions[:20])
print(y_test[:20])
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1649 entries, 0 to 2937
Data columns (total 23 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Country 1649 non-null object
1 Year 1649 non-null int64
2 Status 1649 non-null object
3 Life expectancy 1649 non-null float64
4 Adult Mortality 1649 non-null float64
5 infant deaths 1649 non-null int64
6 Alcohol 1649 non-null float64
7 percentage expenditure 1649 non-null float64
8 Hepatitis B 1649 non-null float64
9 Measles 1649 non-null int64
10 BMI 1649 non-null float64
11 under-five deaths 1649 non-null int64
12 Polio 1649 non-null float64
13 Total expenditure 1649 non-null float64
14 Diphtheria 1649 non-null float64
25
15 HIV/AIDS 1649 non-null float64
16 GDP 1649 non-null float64
17 Population 1649 non-null float64
18 thinness 1-19 years 1649 non-null float64
19 thinness 5-9 years 1649 non-null float64
20 Income composition of resources 1649 non-null float64
21 Schooling 1649 non-null float64
22 DevStatus 1649 non-null uint8
dtypes: float64(16), int64(4), object(2), uint8(1)
memory usage: 297.9+ KB
[1 1 1 1 1 0 0 1 1 1 1 1 1 1 0 1 1 1 1 1]
[1 1 1 1 1 0 0 1 1 1 1 1 1 1 0 1 1 1 1 1]
accuracy: 0.958
26