Professional Documents
Culture Documents
Final Report
Final Report
Contents
1 Executive Summary.......................................................................................................................1
1.1 Finding a suitable data set.....................................................................................................1
1.2 Purpose of data cleaning and file preparation.......................................................................1
1.3 File conversion and preparation............................................................................................2
1.4 The synergy............................................................................................................................2
1.5 The findings...........................................................................................................................3
2 Database........................................................................................................................................3
2.1 The process of creating the database....................................................................................3
2.2 Exploratory Analysis...............................................................................................................3
3 Other Methods..............................................................................................................................8
3.1 Model building.......................................................................................................................9
3.2 Conclusion...........................................................................................................................12
3.3 Recommendation................................................................................................................12
3.4 Detailed Schedule................................................................................................................12
3.5 Help from the course material.............................................................................................17
4 References...................................................................................................................................18
5 Appendices..................................................................................................................................19
5.1 Appendix 1...........................................................................................................................19
5.2 Appendix II...........................................................................................................................37
PROJECT FINAL REPORT
1 Executive Summary
The higher life expectancy is a result of many indicators such as health, nutrition improvements,
education, health services, decrease in mortality and many other social and economic factors. An
important fact is that, not only life expectancy increases when older people live longer but also
when fewer people in their younger age die. Hence, the importance of the indicators in life
expectancy is significant (Murillo, 2020).
It has become important to analyse and identify the indicators/variables that have an impact on the
population health to improve or implement new strategies to improve health care systems or other
measures and thereby increase the life expectancy .
In this study, it aims to analyse the Organization for Economic Corporation and Development (OECD)
health-related data to determine the indicators which affect population health. It will be
determined using the key measure of 'Life expectancy'.
This study analyses their health-related data to recognize what variables have significantly caused a
high level of significance in population health. That can be then used as a role-model to improve the
level of indicators in developing countries to improve their population health and their life
expectancy.
Analysing the various aspects of Health in developed countries will allow comparing the
same health-related services in developing countries to fill the gaps and provide a better
insight into future improvements.
As a result of the advanced search, I was able to find the OECD statistics (OECD.org., 2020)
(https://stats.oecd.org/) which helped to answer my research question.
Page | 1
PROJECT FINAL REPORT
The variable ‘HealthExpenditureByAge ‘data set only has three countries reported for all
years. Therefore, it has been excluded from the analysis
Merger all data to a single dataset.
Created a SQL view to show all variables for all countries across all years including the
dependant variable (Life expectancy).
Replace all missing data with nulls.
Data slicing using the created view.
Import merged final dataset to R for data analysis and visualisation
Page | 2
PROJECT FINAL REPORT
2 Database
2.1 The process of creating the database
The following steps were taken when creating the database in this project.
Requirements analysis and design plan
Requirements for the database have been identified going through all the CSV datasheets.
Identify the database tables, fields and primary / foreign keys.
Draft database outline
Design and draft the table structure according to the above-identified facts.
Create database scripts
According to the OECD data set, all datasets were arranged by country and by year for each
variable. Therefore, each CSV datasheet for a variable is converted to a SQL table, assigning
the country name as the primary key (see appendix 1).
Load test data and perform testing
Load data
Once the test is successful, the data from CSV files are loaded into the tables as an automatic
process, using SQL scripts (see appendix 1)
Missing values
Missing values need to be handled as they can lead to wrong prediction or classification given in any
model being used. There are several techniques to handle the missing data in Data analysis process.
I have used the ‘MICE’ package in R (appendix 2), to impute the missing values, instead of deleting
the missing values (Chaitanya Sagar, 2017). This method uses linear regression to predict the missing
values.
Page | 3
PROJECT FINAL REPORT
Below figures clearly shows the missing values in the dataset. The ‘Cancer’ variable is 100% missing
and ‘Population’ variable is the least missing variable which is 7.8%. Tobacco, fruit consumption and
obese population variables are missing more than 50% of its data.
Figure 2
Page | 4
PROJECT FINAL REPORT
According to the above, I have imputed the missing values using three parameters for the package.
The number of datasets and the iterations the model should run and the method (VIDHYA, 2016).
Five imputed datasets were generated and ‘Goodness of fit’ has been checked to select the closest
data set. Density plot has been generated to visualise the similarity between the imputed datasets
and the original dataset.
Most importantly, the ‘Life Expectancy’ variable is excluded from the dataset, not to be imputed. It
will not be reasonable to include the response variable in the imputation prediction which may
affect the final prediction model.
Page | 5
PROJECT FINAL REPORT
After a closer analysis, I have chosen the second dataset to be imputed to the original dataset. Below
image shows the new ‘completed’ dataset which does not contain any missing values.
Checking for Outliers – It is important to check for outliers in the given data sets since they will
affect the mean/median of the data. This will then affect the absolute and mean error of the model
we are going to build in the next steps. Below boxplots are created for each variable in the data set
to see the spread of the data for each variable. However, any lower or high figures are not
considered as outliers, since they are identical to each country and are reported as valid data.
Page | 6
PROJECT FINAL REPORT
Page | 7
PROJECT FINAL REPORT
The above boxplots show that there is a strong relationship between life expectancy and the type of
country (OECD or non-OECD). The mean and the median figure for OECD countries are higher than
the non-OECD countries. This shows in a glance, that OECD countries (developed) have higher life
expectancy than non-OECD countries.
The summary of the variables is as below. Accordingly, the minimum life expectancy is 56 and the
highest life expectancy is 84 years in this dataset.
Page | 8
PROJECT FINAL REPORT
A correlation matrix was created to see the relationships between the predictor variable. However, I
have excluded the non-OECD countries from this point onwards in this analysis to see the
relationships to the life expectancy in the developed countries (OECD countries).
According to the above correlation matrix, it is visible that there is a positive weak relationship
between total employment, GDP, health expenditure, fruit consumption and pharmaceutical
consumption. On the other hand, there are negative weak relationships between live birth deaths
and AIDS.
3 Other Methods
The multiple linear regression (Robert I. Kabacoff, 2017) is used in this study as the statistical analysis
technique to find the best fitting line to response and predictor variable, so the slope of the
regression line can be measured.
The imputed and completed dataset is split into two to use as model training data and model test
data. 70% of the data (233 observations) will be used to train the model and the rest of the 30% (100
observations) is used to test and evaluate the model.
A correlation matrix was generated to check if there's any multicollinearity between the predictor
variables. According to the below matrix, it shows population and physician practising is strongly
correlated (97%) to each other. Besides,’ total deaths’ are strongly (97%) correlated with physician
practising. Therefore ‘physician practising' variable will be removed from the model. Secondly, total
deaths and total health care also strongly correlated to each other by 74% and total deaths are
removed from the model. Total employment and GDP are correlated by 76% and therefore total
employment has removed safely from the model.
Page | 9
PROJECT FINAL REPORT
Page | 10
PROJECT FINAL REPORT
Page | 11
PROJECT FINAL REPORT
Built models are then used to predict the life expectancy using the test dataset and evaluated as
shown below using their R squared values. The R squared values for both models are closer to the
models trained.
Page | 12
PROJECT FINAL REPORT
According to the above plots, the residuals and fitted values are not completely linear. However,
most data points fit into the line and they are approximately normally distributed.
The model with all selected (13) variables, got an adjusted R squared of 0.1827 with the training data
set. This says the 18% of the variation of life expectancy can be explained by the predicting
variables. The model performance with the test dataset gave a 0.1233 which shows there's a positive
relationship between the response variable and predictor variables.
Likewise, the model built using the step-wise regression gave an R squared of 0.1883 which is slightly
higher than to that of the model 1.
Model 2 shows GDP and total health care variables are highly significant when predicting life
expectancy. Besides, it also demonstrates injuries and health expenditure have a good significance
whereas pharmaceutical consumption and tobacco consumption have a slight significance over the
life expectancy.
3.2 Conclusion
A country’s life expectancy has a greater significance on its GDP (gross domestic product) and health
care (government / social health care). The countries with higher domestic income have a good
potential of having a high life expectancy. Besides, health care is a major factor in life expectancy
which makes an obvious factor to life expectancy. Country's with better economies have better
health care systems and which lead to better life expectancy. On the other hand, out of all other
health-related indicators 'Injuries' contribute a good significance to life expectancy along with the
health expenditure.
3.3 Recommendation
There may be a better method for predicting life expectancy than the multiple linear regression.
According to the readings, LASSO, Ridge regression and Random Forest can be used in predicting life
expectancy. Besides, including other variables such as economic, social, environmental and safety
into the analysis could help in improving the model performance. According to the model results it
is visible that health related indicators has less effect on life expectancy. Therefore, there may be
other factors which has more significance on it which leads to an extended analysis on ‘Life
Expectancy’.
Outcome
AIMS
Page | 13
PROJECT FINAL REPORT
question
Page | 14
PROJECT FINAL REPORT
AIMS Outcome
Page | 15
PROJECT FINAL REPORT
behind the
schedule at this
point but was
confident that I
could catch up
on the work in
the next few
weeks.
Page | 16
PROJECT FINAL REPORT
writing
Week 12 Continue writing Continue with the Progress report completed
28-2 Progress report progress report writing
October work continued
Week 13 Submit progress Module 11 Read module 11 Work pending – Progress report
5-9 report Submit Assignment not achieved submitted.
October 2 anything this
Continue with the week, being
analysis unwell.
Start creating
visualisations with
the analysed data
findings
Week 14 Obtain final Module 12 Discuss and Work pending Completed module
12-16 results analyse the 11 reading and
October /Evaluations outcomes of the further analysis
analysis continued. With
further research
paper readings,
found a method of
handling missing
values. Therefore,
had to revisit the
data analysis.
Increased my study
time to catch up
and be on track.
Started creating
visualisation to the
findings.
Week 15 Prepare final Start writing the Work pending Report writing has
19-23 report final report started while
October continuing with the
model building and
evaluation was
carried out.
Week 16 Submit final Submit final Work pending Continuation of
26-6 report assessment model building and
November evaluation using
different R
functions and
regression
methods.
Completed the
report by the due
date.
3.5 Help from the course material
My biggest challenge was finding the dataset for the project. Fortunately, the presentation included
in the course material helped me a lot to find my required dataset. I was exposed to most of the
publicly available big datasets, thanks to this tutorial.
Page | 17
PROJECT FINAL REPORT
Handling big data, preparing the datasets and data cleaning was an important step in this project.
These steps were discussed in course modules clearly which I directly applied to this project work.
Database creation, managing and security also have discussed in this module which gave me an
interesting start to data modelling. Especially the ‘data activities’/video clips allowed me to get
hands-on experience as an extended knowledge. I was able to manipulate the dataset (slicing/dicing)
according to the analysis requirements before imported into 'R' for analysis.
'R' is completely new knowledge to me and I was a little worried If I will be able to handle it in my
project analysis. However, the modules have given lots of knowledge from the beginner level, which
helped me to start with ease and then advance into deeper functions. I am happy that I got to learn
this powerful statistical package which is an added-value and sincerely proud of my achievement
here. The modules also discuss data visualization and their quality, techniques and reasons which
helped to produce my results of the analysis.
It was interesting to learn about Hadoop framework, although I will not specifically be using it in this
project. However, with the big data management, knowing the Hadoop framework will be a great
asset in future.
4 References
Page | 18
PROJECT FINAL REPORT
CHAITANYA SAGAR, P. A. 2017. A Solution to Missing Data: Imputation Using R ( 17:n37 ) [Online].
Available: https://www.kdnuggets.com/2017/09/missing-data-imputation-using-r.html
[Accessed].
MURILLO, I. L. 2020. The life expectancy: what is it and why does it matter [Online]. CEINE. Available:
https://cenie.eu/en/blogs/age-society/life-expectancy-what-it-and-why-does-it-matter
[Accessed].
OECD.ORG. 2020. Our global reach - OECD [Online]. Available:
http://www.oecd.org/about/members-and-partners/ [Accessed 26/07/2020 2020].
ROBERT I. KABACOFF, P. D. 2017. Multiple (Linear) Regression [Online]. Data Camp. Available:
https://www.statmethods.net/stats/regression.html [Accessed].
VIDHYA, A. 2016. Tutorial on 5 Powerful R Packages used for imputing missing values [Online].
Available: https://www.analyticsvidhya.com/blog/2016/03/tutorial-powerful-packages-
imputing-missing-values/ [Accessed].
5 Appendices
5.1 Appendix 1
/***** Project Database Creation SQL ********/
Page | 19
PROJECT FINAL REPORT
/***************************************/
Page | 20
PROJECT FINAL REPORT
Page | 21
PROJECT FINAL REPORT
WITH
(
FIELDTERMINATOR = ',',
ROWTERMINATOR = '\n'
)
Page | 22
PROJECT FINAL REPORT
Page | 23
PROJECT FINAL REPORT
(
FIELDTERMINATOR = ',',
ROWTERMINATOR = '\n'
)
select e.Country,case when e.Country like '%Non-OECD%' then 'N' else 'Y' end as Oecd,
2005 as [year], p.[2005] as TotalPopulation, i.[2005] as Injuries,
ld.[2005] as LiveBirth,
case when Country like '%Non-OECD%' then 'N' else 'Y' end as Oecd,Deaths, null as
Cancer, bw.[2005] as LowBirthWeight,
Page | 24
PROJECT FINAL REPORT
on e.Country = p.Country
Union
select e.Country,
case when e.Country like '%Non-OECD%' then 'N' else 'Y' end as Oecd,
2006 as [year], p.[2006] as TotalPopulation, i.[2006] as Injuries,
ld.[2006] as LiveBirthDeaths, null as Cancer, bw.[2006] as LowBirthWeight,
pp.[2006] as PhysicianPractising,null as HealthExpenditure,a.[2006] as AIDS,
ph.[2006] as PharmaConsumption,al.[2006] as AlcoholConsumption,ob.[2006] as
ObesePopulation,
Page | 25
PROJECT FINAL REPORT
on e.Country = p.Country
Union
select e.Country,case when e.Country like '%Non-OECD%' then 'N' else 'Y' end as Oecd,
2007 as [year], p.[2007] as TotalPopulation, i.[2007] as Injuries,
on e.Country = p.Country
Page | 26
PROJECT FINAL REPORT
Union
select e.Country,case when e.Country like '%Non-OECD%' then 'N' else 'Y' end as Oecd,
2008 as [year], p.[2008] as TotalPopulation, i.[2008] as Injuries,
on e.Country = p.Country
Page | 27
PROJECT FINAL REPORT
Union
select e.Country,
case when e.Country like '%Non-OECD%' then 'N' else 'Y' end as Oecd,
2010 as [year], p.[2009] as TotalPopulation, i.[2009] as Injuries,
on e.Country = p.Country
Page | 28
PROJECT FINAL REPORT
Union
select e.Country,case when e.Country like '%Non-OECD%' then 'N' else 'Y' end as Oecd,
2010 as [year], p.[2010] as TotalPopulation, i.[2010] as Injuries,
on e.Country = p.Country
Union
select e.Country,case when e.Country like '%Non-OECD%' then 'N' else 'Y' end as Oecd,
2011 as [year], p.[2011] as TotalPopulation, i.[2011] as Injuries,
Page | 29
PROJECT FINAL REPORT
on e.Country = p.Country
Union
select e.Country,case when e.Country like '%Non-OECD%' then 'N' else 'Y' end as Oecd,
2012 as [year], p.[2012] as TotalPopulation, i.[2012] as Injuries,
Page | 30
PROJECT FINAL REPORT
on e.Country = p.Country
Union
select e.Country,case when e.Country like '%Non-OECD%' then 'N' else 'Y' end as Oecd,
2013 as [year], p.[2013] as TotalPopulation, i.[2013] as Injuries,
on e.Country = p.Country
Page | 31
PROJECT FINAL REPORT
Union
select e.Country,case when e.Country like '%Non-OECD%' then 'N' else 'Y' end as Oecd,
2014 as [year], p.[2014] as TotalPopulation, i.[2014] as Injuries,
on e.Country = p.Country
Page | 32
PROJECT FINAL REPORT
Union
select e.Country,case when e.Country like '%Non-OECD%' then 'N' else 'Y' end as Oecd,
2015 as [year], p.[2015] as TotalPopulation, i.[2015] as Injuries,
on e.Country = p.Country
Page | 33
PROJECT FINAL REPORT
Union
select e.Country,case when e.Country like '%Non-OECD%' then 'N' else 'Y' end as Oecd,
2016 as [year], p.[2016] as TotalPopulation, i.[2016] as Injuries,
on e.Country = p.Country
Page | 34
PROJECT FINAL REPORT
Union
select e.Country,case when e.Country like '%Non-OECD%' then 'N' else 'Y' end as Oecd,
2017 as [year], p.[2017] as TotalPopulation, i.[2017] as Injuries,
on e.Country = p.Country
Union
select e.Country,case when e.Country like '%Non-OECD%' then 'N' else 'Y' end as Oecd,
Page | 35
PROJECT FINAL REPORT
on e.Country = p.Country
Order by e.Country
Page | 36
PROJECT FINAL REPORT
5.2 Appendix II
R code
# Deploy packages
library(dplyr) library(tidyr) library(plot.matrix)
library(magrittr) library(tidyverse)
library("GGally") library(xyplot) library(ggplot2)
install.packages("xyplot library(lattice)
") library(dplyr) library(GGally)
install.packages("ggplot library(data.table) library(cdata)
") library("ggplot2")
install.packages("tidyve library(ggplot2) library(WVPlots)
rse") install_github("easyGgpl
library(ggplot2) ot2", "kassambara")
library(purrr)
## test
install.packages("mice")
library(mice)
Page | 37
PROJECT FINAL REPORT
install.packages("mi")
library(mi)
install.packages("missForest")
library(missForest)
install.packages("survival")
library(readr)
TotalPopulation = col_double(),
Injuries = col_double(),
LiveBirthDeaths = col_double(),
Cancer = col_double(),
LowBirthWeight = col_double(),
PhysicianPractising = col_double(),
HealthExpenditure = col_double(),
AIDS = col_double(),
PharmaConsumption = col_double(),
AlcoholConsumption =
col_double(),
ObesePopulation = col_double(),
FruitConsumption = col_double(),
TobaccoConsumption =
col_double(),
TotalHealthcare = col_double(),
TotalEmployment = col_double(),
LifeExpectancy = col_double()),
na = "null")
Page | 38
PROJECT FINAL REPORT
select_if(is.numeric)
select_if(is.numeric)
select_if(is.numeric)
ggcorr(Oecd_num,
label = T,
label_size = 2,
label_round = 2,
hjust = 1,
size = 3,
color = "royalblue",
layout.exp = 5,
low = "green3",
mid = "gray95",
high = "darkorange",
name = "Correlation")
ggcorr(Non_Oecd_selected,
label = T,
label_size = 2,
label_round = 2,
hjust = 1,
size = 3,
color = "royalblue",
Page | 39
PROJECT FINAL REPORT
layout.exp = 5,
low = "green3",
mid = "gray95",
high = "darkorange",
name = "Correlation")
select_if(is.numeric)
# Correlation matrix
ggcorr(data_num,
label = T,
label_size = 2,
label_round = 2,
hjust = 1,
size = 3,
color = "royalblue",
layout.exp = 5,
low = "green3",
mid = "gray95",
high = "darkorange",
name = "Correlation")
boxplot(TotalPopulation~year, border="black"
data=LifeExp, )
Page | 40
PROJECT FINAL REPORT
xlab="Year",
) xlab="Year",
xlab="Year", )
xlab="Year",
data=LifeExp, col="blue",
ylab="Number of persons",
boxplot(LifeExpectancy~year,
col="blue",
data=LifeExp,
border="black"
main="Boxplots for each year\
) nLife Expectancy",
xlab="Year",
boxplot(HealthExpenditure~year, ylab="Years",
data=LifeExp, col="blue",
col="blue", boxplot(ObesePopulation~year,
border="black" data=LifeExp,
Page | 41
PROJECT FINAL REPORT
xlab="Year", border="black"
ylab="Percentage of total )
population",
col="blue",
border="black"
boxplot(TotalEmployment~year,
)
data=LifeExp,
data=LifeExp, xlab="Year",
col="blue",
border="black" boxplot(GDP~year,
) data=LifeExp,
col="blue",
border="black"
) boxplot(totDeaths~year,
data=LifeExp,
xlab="Year", col="blue",
Page | 42
PROJECT FINAL REPORT
l<-select(LifeExp, LifeExpectancy)
view(le)
geom_point()
summary(Oecd.mis)
sort(sapply(Oecd.mis,function(x) sum(is.na(x))),decreasing = T)
md.pattern(Oecd.mis)
md.pattern(LifeExp.mis)
LifeVar<-subset(LifeExp,select = c(LifeExpectancy))
Var<-subset(LifeExp,select = c(LifeExpectancy,Oecd))
md.pattern(LifeExp.mis)
Page | 43
PROJECT FINAL REPORT
install.packages("VIM")
library(VIM)
numbers=TRUE, sortVars=TRUE,
labels=names(Oecd.mis), cex.axis=.7,
imputed_Oecd$imp$Oecd
imputed_Oecd$imp$LifeExpectancy
imputed_LifeExp$imp$LifeExpectancy
summary(imputed_Oecd)
summary(complete_LifeExp)
imputed_LifeExp$imp$AIDS
complete_data=mice::complete(imputed_LifeExp,2)
complete_LifeExp<- merge(complete_data,LifeVar,by="row.names")
complete_Oecd=mice::complete(imputed_Oecd,2)
complete_Oecd<- merge(complete_Oecd,LifeVar,by="row.names")
complete_Explo<- merge(complete_data,Var,by="row.names")
md.pattern(complete_Oecd)
densityplot(imputed_LifeExp)
xyplot(imputed_Data)
Page | 44
PROJECT FINAL REPORT
# Correlation matrix
ggcorr(complete_Oecd,
label = T,
label_size = 2,
label_round = 2,
hjust = 1,
size = 3,
color = "royalblue",
layout.exp = 5,
low = "green3",
mid = "gray95",
high = "darkorange",
name = "Correlation")
install.packages("faraway")
install.packages("glmnet")
library(faraway)
library(glmnet)
library(car)
# excluding multicollinearity
install.packages("caret")
library("caret")
set.seed(123)
length(train)
Page | 45
PROJECT FINAL REPORT
length(test)
summary(regress_Oecd)
# model 1
lm_mod1=lm(LifeExpectancy~.,data=regress_Oecd,subset = train)
summary(lm_mod1)
round(vif(lm_mod1), 2)
step_mod=step(lm_mod1, direction='backward',trace=F)
summary(step_mod)
round(vif(step_mod), 2)
# plot model 1
plot(lm_mod1,which=c(1,2))
plot(step_mod,which=c(1,2))
data.frame(
R2 = R2(step_pred, test$LifeExpectancy)
# Predict model1
data.frame(
Page | 46
PROJECT FINAL REPORT
R2 = R2(mod1_pred, test$LifeExpectancy)
Page | 47