Final Report

PROJECT FINAL REPORT
ANALYSIS OF POPULATION HEALTH

Organization for Economic Co-Operation and Development
(OECD health dataset)
Contents
1 Executive Summary.......................................................................................................................1
1.1 Finding a suitable data set.....................................................................................................1
1.2 Purpose of data cleaning and file preparation.......................................................................1
1.3 File conversion and preparation............................................................................................2
1.4 The synergy............................................................................................................................2
1.5 The findings...........................................................................................................................3
2 Database........................................................................................................................................3
2.1 The process of creating the database....................................................................................3
2.2 Exploratory Analysis...............................................................................................................3
3 Other Methods..............................................................................................................................8
3.1 Model building.......................................................................................................................9
3.2 Conclusion...........................................................................................................................12
3.3 Recommendation................................................................................................................12
3.4 Detailed Schedule................................................................................................................12
3.5 Help from the course material.............................................................................................17
4 References...................................................................................................................................18
5 Appendices..................................................................................................................................19
5.1 Appendix 1...........................................................................................................................19
5.2 Appendix II...........................................................................................................................37
1 Executive Summary
The higher life expectancy is a result of many indicators such as health, nutrition improvements,
education, health services, decrease in mortality and many other social and economic factors. An
important fact is that, not only life expectancy increases when older people live longer but also
when fewer people in their younger age die. Hence, the importance of the indicators in life
expectancy is significant (Murillo, 2020).
It has become important to analyse and identify the indicators/variables that have an impact on the
population health to improve or implement new strategies to improve health care systems or other
measures and thereby increase the life expectancy .
In this study, it aims to analyse the Organization for Economic Corporation and Development (OECD)
health-related data to determine the indicators which affect population health. It will be
determined using the key measure of 'Life expectancy'.
This study analyses their health-related data to recognize what variables have significantly caused a
high level of significance in population health. That can be then used as a role-model to improve the
level of indicators in developing countries to improve their population health and their life
expectancy.
Analysing the various aspects of Health in developed countries will allow comparing the
same health-related services in developing countries to fill the gaps and provide a better
insight into future improvements.
1.1 Finding a suitable data set

Finding the dataset for the project was the biggest hurdle. Fortunately, the presentation included in
the course material helped a lot to find the required dataset. I was exposed to most of the publicly
available big datasets, thanks to this tutorial. Before reading this material, I wasn’t aware of how to
do advanced google searches which enhance the opportunity of finding all the materials not
included in general search.
As a result of the advanced search, I was able to find the OECD statistics (OECD.org., 2020)
(https://stats.oecd.org/) which helped to answer my research question.
Figure 1:Dataset links
Page | 1
1.2 Purpose of data cleaning and file preparation

There are many reasons why the data must be cleaned before analysis. The main reasons would be,
the inaccuracies and duplications over time in data. This can occur when collecting and storing data.
Therefore, these need to be identified and cleaned before analysis to obtain an accurate result. On
the other hand, any data in text fields need to have a level of processing to standardize the data to
eliminate a large number of variations in a field.
While pre-processing the data set, it may need to merge different datasets and in this case,
duplication is something which could occur and need attention. However, this has to be performed
carefully, not to incur erroneous exclusions.
Next, missing or erroneous data has to be identified and fixed. When fixing missing data, several
techniques can be applied according to the nature of the analysis.
1.3 File conversion and preparation

OECD data set used for this analysis consists of 18 health-related variables in CSV format. Having
data in a CSV format is not an easy way of analysing the data. Therefore, using the course
knowledge learnt, I have created a SQL database to make the data cleaning process easier.
Below are the steps that are taken to prepare the datasets for the analysis stage –
 Download the data sets
 Check ('eyeballing') every CSV file to identify the variables within each sheet.
 Handling structural mismatches/errors
the spreadsheet was differently organised
blank columns in between and contained irrelevant columns to the analysis
Comments included for some variables
trailing and leading spaces - removed manually before the data was migrated.
unaccepted characters such as “..” included for missing values - updated to “blanks” in the
sheet before the analysis.
 Create SQL database and tables equivalent to all data sheets separately (see appendices 01)
 Import data from CSV files to SQL tables
 Removing unwanted observations
The spread of the data was different in different datasets. Some of the data sets were spread
from 2010-2017 while some were from 2005-2019. Besides, some have data from 2000-
2019. Besides, the year of 2019 does not have data reported from many countries.
Considering all these anomalies, the most non-missing data set for all countries was from
the year 2010 to 2018. Therefore, I will be using 2010 to 2018 dataset for model building
and further analysis.
The variable ‘HealthExpenditureByAge ‘data set only has three countries reported for all
years. Therefore, it has been excluded from the analysis
 Merger all data to a single dataset.
Created a SQL view to show all variables for all countries across all years including the
dependant variable (Life expectancy).
 Replace all missing data with nulls.
 Data slicing using the created view.
 Import merged final dataset to R for data analysis and visualisation
Page | 2
1.4 The synergy

Comparatively, the OECD countries have better life expectancy over the non-OECD
countries. All health-related variables and life expectancy data has taken in to the analysis.
By analysing and applying the statistical methods to the dataset, it was possible to find the
relationships and their significance between the ‘Life Expectancy’ and the health-related
predictors. Accordingly, it was helpful to answer my research question "What health
indicators contribute to Country's life expectancy?”. The study revealed that gross domestic
product and health care are the most significant predictors of life expectancy.
1.5 The findings

GDP (gross domestic product) and total health care variables are highly significant when predicting
life expectancy among OECD countries. Besides, it also demonstrates injuries and health
expenditure have a good significance whereas pharmaceutical consumption and tobacco
consumption have a slight significance over the life expectancy. In overall, the better the country’s
economic status is, the higher it’s life expectancy is going to be.
2 Database
2.1 The process of creating the database
The following steps were taken when creating the database in this project.
 Requirements analysis and design plan
Requirements for the database have been identified going through all the CSV datasheets.
Identify the database tables, fields and primary / foreign keys.
 Draft database outline
Design and draft the table structure according to the above-identified facts.
 Create database scripts
According to the OECD data set, all datasets were arranged by country and by year for each
variable. Therefore, each CSV datasheet for a variable is converted to a SQL table, assigning
the country name as the primary key (see appendix 1).
 Load test data and perform testing
 Load data
Once the test is successful, the data from CSV files are loaded into the tables as an automatic
process, using SQL scripts (see appendix 1)
2.2 Exploratory Analysis

In the data set, apart from ‘country’ and ‘year’ variables, other variables are numerical. A new
categorical column named ‘Oecd’ has been derived to indicate the country’s OECD status.
Missing values
Missing values need to be handled as they can lead to wrong prediction or classification given in any
model being used. There are several techniques to handle the missing data in Data analysis process.
I have used the ‘MICE’ package in R (appendix 2), to impute the missing values, instead of deleting
the missing values (Chaitanya Sagar, 2017). This method uses linear regression to predict the missing
values.
Page | 3
Below figures clearly shows the missing values in the dataset. The ‘Cancer’ variable is 100% missing
and ‘Population’ variable is the least missing variable which is 7.8%. Tobacco, fruit consumption and
obese population variables are missing more than 50% of its data.
Figure 2
Figure 3: Missing value count
Page | 4
Figure 4: Missing values & pattern
According to the above, I have imputed the missing values using three parameters for the package.
The number of datasets and the iterations the model should run and the method (VIDHYA, 2016).
Five imputed datasets were generated and ‘Goodness of fit’ has been checked to select the closest
data set. Density plot has been generated to visualise the similarity between the imputed datasets
and the original dataset.
Most importantly, the ‘Life Expectancy’ variable is excluded from the dataset, not to be imputed. It
will not be reasonable to include the response variable in the imputation prediction which may
affect the final prediction model.
Figure 5: Density plot for imputed data vs. Original data
Page | 5
After a closer analysis, I have chosen the second dataset to be imputed to the original dataset. Below
image shows the new ‘completed’ dataset which does not contain any missing values.
Figure 6: Data summary after imputation
Checking for Outliers – It is important to check for outliers in the given data sets since they will
affect the mean/median of the data. This will then affect the absolute and mean error of the model
we are going to build in the next steps. Below boxplots are created for each variable in the data set
to see the spread of the data for each variable. However, any lower or high figures are not
considered as outliers, since they are identical to each country and are reported as valid data.
Page | 6
Page | 7
The above boxplots show that there is a strong relationship between life expectancy and the type of
country (OECD or non-OECD). The mean and the median figure for OECD countries are higher than
the non-OECD countries. This shows in a glance, that OECD countries (developed) have higher life
expectancy than non-OECD countries.
The summary of the variables is as below. Accordingly, the minimum life expectancy is 56 and the
highest life expectancy is 84 years in this dataset.
Figure 7 - Summary of the variables
Page | 8
A correlation matrix was created to see the relationships between the predictor variable. However, I
have excluded the non-OECD countries from this point onwards in this analysis to see the
relationships to the life expectancy in the developed countries (OECD countries).
Figure 8: Correlation matrix
According to the above correlation matrix, it is visible that there is a positive weak relationship
between total employment, GDP, health expenditure, fruit consumption and pharmaceutical
consumption. On the other hand, there are negative weak relationships between live birth deaths
and AIDS.
3 Other Methods
The multiple linear regression (Robert I. Kabacoff, 2017) is used in this study as the statistical analysis
technique to find the best fitting line to response and predictor variable, so the slope of the
regression line can be measured.
The imputed and completed dataset is split into two to use as model training data and model test
data. 70% of the data (233 observations) will be used to train the model and the rest of the 30% (100
observations) is used to test and evaluate the model.
A correlation matrix was generated to check if there's any multicollinearity between the predictor
variables. According to the below matrix, it shows population and physician practising is strongly
correlated (97%) to each other. Besides,’ total deaths’ are strongly (97%) correlated with physician
practising. Therefore ‘physician practising' variable will be removed from the model. Secondly, total
deaths and total health care also strongly correlated to each other by 74% and total deaths are
removed from the model. Total employment and GDP are correlated by 76% and therefore total
employment has removed safely from the model.
Page | 9
Figure 9:Correlation matrix for predictor variables
3.1 Model building

Two models were built to compare the final results. Model 1 included all variables except total
deaths, physician practising and total employment which consists of 13 variables. Model two was
built after the performance of stepwise regression. Stepwise regression has selected 8 variables out
of the 13 variables. Below show the results of the two models.
Page | 10
Figure 10: model 1 summary
Figure 11: model 2 (stepwise regression) summary
Page | 11
Model 1 visualisation for residuals:
Model 2 (stepwise regression) visualisation for residuals:
Built models are then used to predict the life expectancy using the test dataset and evaluated as
shown below using their R squared values. The R squared values for both models are closer to the
models trained.
Page | 12
According to the above plots, the residuals and fitted values are not completely linear. However,
most data points fit into the line and they are approximately normally distributed.
The model with all selected (13) variables, got an adjusted R squared of 0.1827 with the training data
set. This says the 18% of the variation of life expectancy can be explained by the predicting
variables. The model performance with the test dataset gave a 0.1233 which shows there's a positive
relationship between the response variable and predictor variables.
Likewise, the model built using the step-wise regression gave an R squared of 0.1883 which is slightly
higher than to that of the model 1.
Model 2 shows GDP and total health care variables are highly significant when predicting life
expectancy. Besides, it also demonstrates injuries and health expenditure have a good significance
whereas pharmaceutical consumption and tobacco consumption have a slight significance over the
life expectancy.
3.2 Conclusion
A country’s life expectancy has a greater significance on its GDP (gross domestic product) and health
care (government / social health care). The countries with higher domestic income have a good
potential of having a high life expectancy. Besides, health care is a major factor in life expectancy
which makes an obvious factor to life expectancy. Country's with better economies have better
health care systems and which lead to better life expectancy. On the other hand, out of all other
health-related indicators 'Injuries' contribute a good significance to life expectancy along with the
health expenditure.
3.3 Recommendation
There may be a better method for predicting life expectancy than the multiple linear regression.
According to the readings, LASSO, Ridge regression and Random Forest can be used in predicting life
expectancy. Besides, including other variables such as economic, social, environmental and safety
into the analysis could help in improving the model performance. According to the model results it
is visible that health related indicators has less effect on life expectancy. Therefore, there may be
other factors which has more significance on it which leads to an extended analysis on ‘Life
Expectancy’.
3.4 Detailed Schedule
Outcome
AIMS
Learning Progress report Final report

Week Project Activity Details of achievement
Activity
Week 1 Familiarization Module 1  Finish completed completed
13-17July of the project reading Module 1
specification  Understanding of
Building the the project
research specification and
question requirements.
 Start thinking
about a research
Page | 13
question
Week 2 accessing big Module 2  Have finalised the completed completed

20-24 July data/Finalise the research question
data  Finished reading
set/familiarise module 2 which
with the data will help to design
set the relationships
among the
datasets
 Finalised the big
datasets to be used
in the project
Week 3 Project proposal Module 3  Completed module Achieved all completed

27-31 July writing 3 reading tasks but failed
 Report writing has to read the
started and module due to
completed for taking more
project proposal time to finish
the project
proposal.
Week 4 Submit a Module 4 Submitted project Completed the completed

03-07 proposal / proposal course reading
August Database design  Read Module 4 and catch up
 Start designing the with the missed
database to import module from
the dataset the previous
week. Instead of
database
designing,
analysed the
downloaded
dataset for any
inconsistencies,
missingness and
well-
preparedness.
Also familiarised
with the dataset
to identify their
primary keys
and
relationships.
Page | 14
AIMS Outcome
Week Project Activity Learning Details of achievement comments

Activity
Week 5 Creation of the  Modul  Read Module 5 Finalised the completed
10-14 database and e5  Creation of database
August tables / Import database, tables design,
datasets and their downloaded
relationships SQL express and
 Import data into installed for the
the database data cleaning
process.
Scripts were
written to
create the
database and
import data
from CSV to
SQL.
Week 6 Data cleaning / Module 6  Read module 6 Course reading completed
17-21 Exploratory  Start to query data from the
August Analysis and find any previous week
missing data or and module 6
perform data were
cleaning if completed.
required. Continuation of
 Exploratory database script
analysis to get to writing and then
better know the data import to
datasets. tables were
completed.
Querying the
data and
exploratory
analysis was not
started and fell
behind the
schedule due to
personal
(family)
circumstances.
Week 7 Feature Module 7 Selecting the most Querying the completed
24-28 selection significant data and
August variables out of the cleaning the
many variables data set was
according to the completed. Not
previous studies or completed the
own analysis scheduled tasks
 Read module 7 for this week at
ll.
I was still
Page | 15
behind the
schedule at this
point but was
confident that I
could catch up
on the work in
the next few
weeks.
Week 8 Feature Module 8  Read module 8 Completed the completed

31-4 selection  Continue with the exploratory
September feature selection analysis (week 6
and finalise the remaining work)
feature to be used and read
in the analysis. module 7 from
week 7.
Week 9 Analysis Module 9  Start the analysis Downloaded R completed
7-11 Identify if any with the data and start
September changes needed  Any revisions to learning it since
to the research the research this was
question or question or data completely new
datasets sets, if find to me. This was
necessary something that I
 Complete reading didn't schedule
module 9 time in my
proposal and
contributed to
fell behind the
schedule.
Week 10 Review process Module 10  Repeat analysis, if Completed completed
14-18 if required changes occurred, reading the
September Continue if not continue course module
Analysis with the analysis 9 (from previous
 Read module 10 week) and
module 10.
Reviewing the
research
question is done
and identified
that no changes
need to be
made after
knowing the
dataset better
and reading
through other
research
papers.
Week 11 Prepare and  Organise and start Started writing
21-25 start the writing the the progress
September progress report progress report report.
Page | 16
writing
Week 12 Continue writing  Continue with the Progress report completed
28-2 Progress report progress report writing
October work continued
Week 13 Submit progress Module 11  Read module 11 Work pending – Progress report
5-9 report  Submit Assignment not achieved submitted.
October 2 anything this
 Continue with the week, being
analysis unwell.
 Start creating
visualisations with
the analysed data
findings
Week 14 Obtain final Module 12  Discuss and Work pending Completed module
12-16 results analyse the 11 reading and
October /Evaluations outcomes of the further analysis
analysis continued. With
further research
paper readings,
found a method of
handling missing
values. Therefore,
had to revisit the
data analysis.
Increased my study
time to catch up
and be on track.
Started creating
visualisation to the
findings.
Week 15 Prepare final  Start writing the Work pending Report writing has
19-23 report final report started while
October continuing with the
model building and
evaluation was
carried out.
Week 16 Submit final  Submit final Work pending Continuation of
26-6 report assessment model building and
November evaluation using
different R
functions and
regression
methods.
Completed the
report by the due
date.
3.5 Help from the course material
My biggest challenge was finding the dataset for the project. Fortunately, the presentation included
in the course material helped me a lot to find my required dataset. I was exposed to most of the
publicly available big datasets, thanks to this tutorial.
Page | 17
Handling big data, preparing the datasets and data cleaning was an important step in this project.
These steps were discussed in course modules clearly which I directly applied to this project work.
Database creation, managing and security also have discussed in this module which gave me an
interesting start to data modelling. Especially the ‘data activities’/video clips allowed me to get
hands-on experience as an extended knowledge. I was able to manipulate the dataset (slicing/dicing)
according to the analysis requirements before imported into 'R' for analysis.
'R' is completely new knowledge to me and I was a little worried If I will be able to handle it in my
project analysis. However, the modules have given lots of knowledge from the beginner level, which
helped me to start with ease and then advance into deeper functions. I am happy that I got to learn
this powerful statistical package which is an added-value and sincerely proud of my achievement
here. The modules also discuss data visualization and their quality, techniques and reasons which
helped to produce my results of the analysis.
It was interesting to learn about Hadoop framework, although I will not specifically be using it in this
project. However, with the big data management, knowing the Hadoop framework will be a great
asset in future.
4 References
Page | 18
CHAITANYA SAGAR, P. A. 2017. A Solution to Missing Data: Imputation Using R ( 17:n37 ) [Online].
Available: https://www.kdnuggets.com/2017/09/missing-data-imputation-using-r.html
[Accessed].
MURILLO, I. L. 2020. The life expectancy: what is it and why does it matter [Online]. CEINE. Available:
https://cenie.eu/en/blogs/age-society/life-expectancy-what-it-and-why-does-it-matter
[Accessed].
OECD.ORG. 2020. Our global reach - OECD [Online]. Available:
http://www.oecd.org/about/members-and-partners/ [Accessed 26/07/2020 2020].
ROBERT I. KABACOFF, P. D. 2017. Multiple (Linear) Regression [Online]. Data Camp. Available:
https://www.statmethods.net/stats/regression.html [Accessed].
VIDHYA, A. 2016. Tutorial on 5 Powerful R Packages used for imputing missing values [Online].
Available: https://www.analyticsvidhya.com/blog/2016/03/tutorial-powerful-packages-
imputing-missing-values/ [Accessed].
5 Appendices
5.1 Appendix 1
/***** Project Database Creation SQL ********/
Page | 19
-- This is to create the database tables to

-- populate the OECD data
/***************************************/
-- Create AIDS table

CREATE TABLE [dbo].[HealthStatus__AIDS](
[Country] [varchar](100) NOT NULL,
[2000] [decimal](18, 1) NULL,
[2001] [decimal](18, 1) NULL,
[2002] [decimal](18, 1) NULL,
[2003] [decimal](18, 1) NULL,
[2004] [decimal](18, 1) NULL,
[2005] [decimal](18, 1) NULL,
[2006] [decimal](18, 1) NULL,
[2007] [decimal](18, 1) NULL,
[2008] [decimal](18, 1) NULL,
[2009] [decimal](18, 1) NULL,
[2010] [decimal](18, 1) NULL,
[2011] [decimal](18, 1) NULL,
[2012] [decimal](18, 1) NULL,
[2013] [decimal](18, 1) NULL,
[2014] [decimal](18, 1) NULL,
[2015] [decimal](18, 1) NULL,
[2016] [decimal](18, 1) NULL,
[2017] [decimal](18, 1) NULL,
[2018] [decimal](18, 1) NULL,
[2019] [decimal](18, 1) NULL
) ON [PRIMARY]
GO
-- Creating the Alcohol consumption table

CREATE TABLE [dbo].[HealthStatus_AlcoholConsumption](
[2000] [decimal](18, 1) NULL,
[2001] [decimal](18, 1) NULL,
[2002] [decimal](18, 1) NULL,
[2003] [decimal](18, 1) NULL,
[2004] [decimal](18, 1) NULL,
[2005] [decimal](18, 1) NULL,
[2006] [decimal](18, 1) NULL,
[2007] [decimal](18, 1) NULL,
[2008] [decimal](18, 1) NULL,
[2009] [decimal](18, 1) NULL,
[2010] [decimal](18, 1) NULL,
[2011] [decimal](18, 1) NULL,
[2012] [decimal](18, 1) NULL,
[2013] [decimal](18, 1) NULL,
[2014] [decimal](18, 1) NULL,
[2015] [decimal](18, 1) NULL,
[2016] [decimal](18, 1) NULL,
[2017] [decimal](18, 1) NULL,
[2018] [decimal](18, 1) NULL,
[2019] [decimal](18, 1) NULL
) ON [PRIMARY]
-- Creating the Cancer table

CREATE TABLE [dbo].[HealthStatus_Cancer](
[2000] [decimal](18, 1) NULL,
[2002] [decimal](18, 1) NULL,
[2008] [decimal](18, 1) NULL,
Page | 20
[2012] [decimal](18, 1) NULL

) ON [PRIMARY]
GO
-- Creating the FruitConsumption table
CREATE TABLE [dbo].[HealthStatus_FruitConsumption](

[2000] [decimal](18, 1) NULL,
[2001] [decimal](18, 1) NULL,
[2002] [decimal](18, 1) NULL,
[2003] [decimal](18, 1) NULL,
[2004] [decimal](18, 1) NULL,
[2005] [decimal](18, 1) NULL,
[2006] [decimal](18, 1) NULL,
[2007] [decimal](18, 1) NULL,
[2008] [decimal](18, 1) NULL,
[2009] [decimal](18, 1) NULL,
[2010] [decimal](18, 1) NULL,
[2011] [decimal](18, 1) NULL,
[2012] [decimal](18, 1) NULL,
[2013] [decimal](18, 1) NULL,
[2014] [decimal](18, 1) NULL,
[2015] [decimal](18, 1) NULL,
[2016] [decimal](18, 1) NULL,
[2017] [decimal](18, 1) NULL,
[2018] [decimal](18, 1) NULL,
[2019] [decimal](18, 1) NULL
) ON [PRIMARY]
/*Bulk insert data to tables*/

BULK INSERT HealthStatus__AIDS
FROM 'C:\Users\chami\OneDrive\Documents\MSC\BigDataMngmt\Datasets\
HealthStatus__AIDS.txt'
WITH
(
FIELDTERMINATOR = ',',
ROWTERMINATOR = '\n'
)
BULK INSERT HealthStatus_AlcoholConsumption

HealthStatus_AlcoholConsumption.txt'
WITH
(
)
BULK INSERT HealthStatus_Cancer
HealthStatus_Cancer.txt'
WITH
(
)
BULK INSERT HealthStatus_FruitConsumption

HealthStatus_FruitConsumption.txt'
Page | 21
WITH
(
)
BULK INSERT HealthStatus_GDP

HealthStatus_GDP.txt'
WITH
(
)
BULK INSERT HealthStatus_HealthExpenditure

HealthStatus_HealthExpenditure.txt'
WITH
(
)
BULK INSERT HealthStatus_Injuries

HealthStatus_Injuries.txt'
WITH
(
)
BULK INSERT HealthStatus_LifeExpectancy

HealthStatus_LifeExpectancy.txt'
WITH
(
)
BULK INSERT HealthStatus_LowBirthWeight

HealthStatus_LowBirthWeight.txt'
WITH
(
)
BULK INSERT HealthStatus_Mortality_LiveBirthDeaths

HealthStatus_Mortality_LiveBirthDeaths.txt'
WITH
(
)
Page | 22
BULK INSERT HealthStatus__AIDS

HealthStatus__AIDS.txt'
WITH
(
)
BULK INSERT HealthStatus_Mortality_totDeaths

HealthStatus_Mortality_totDeaths.txt'
WITH
(
)
BULK INSERT HealthStatus_ObesePopulation

HealthStatus_ObesePopulation.txt'
WITH
(
)
BULK INSERT HealthStatus_PharmaConsumption

HealthStatus_PharmaConsumption.txt'
WITH
(
)
BULK INSERT HealthStatus_PhysicianPractising

HealthStatus_PhysicianPractising.txt'
WITH
(
)
BULK INSERT HealthStatus_TobaccoConsumption

HealthStatus_TobaccoConsumption.txt'
WITH
(
)
BULK INSERT HealthStatus_TotalEmployment

HealthStatus_TotalEmployment.txt'
WITH
Page | 23
(
)
BULK INSERT HealthStatus_TotalHealthcare

HealthStatus_TotalHealthcare.txt'
WITH
(
)
BULK INSERT HealthStatus_TotalPopulation

HealthStatus_TotalPopulation.txt'
WITH
(
)
/***** Data wrangling ********/

-- merging data tables
-- Pivot the columns --> years to columns
-- deriving new columns
/***************************************/
select * from HealthStatus__AIDS

select * from HealthStatus_AlcoholConsumption
select * from HealthStatus_Cancer
select * from HealthStatus_FruitConsumption
select * from HealthStatus_GDP
select * from HealthStatus_HealthExpenditure
select * from HealthStatus_Injuries
select * from HealthStatus_LifeExpectancy
select * from HealthStatus_LowBirthWeight
select * from HealthStatus_Mortality_LiveBirthDeaths
select * from HealthStatus_Mortality_totDeaths
select * from HealthStatus_ObesePopulation
select * from HealthStatus_PharmaConsumption
select * from HealthStatus_PhysicianPractising
select * from HealthStatus_TobaccoConsumption
select * from HealthStatus_TotalEmployment
select * from HealthStatus_TotalHealthcare
select * from HealthStatus_TotalPopulation
/* Merging the datasets to a single dataset and pivoting the dataset

** Saved the results as a view
*/
select e.Country,case when e.Country like '%Non-OECD%' then 'N' else 'Y' end as Oecd,
2005 as [year], p.[2005] as TotalPopulation, i.[2005] as Injuries,
ld.[2005] as LiveBirth,
case when Country like '%Non-OECD%' then 'N' else 'Y' end as Oecd,Deaths, null as
Cancer, bw.[2005] as LowBirthWeight,
Page | 24
pp.[2005] as PhysicianPractising, null as HealthExpenditure,a.[2005] as AIDS,
ph.[2005] as PharmaConsumption,al.[2005] as AlcoholConsumption,ob.[2005] as

ObesePopulation,
fr.[2005] as FruitConsumption,tc.[2005] as TobaccoConsumption,null as TotalHealthcare,

em.[2005] as TotalEmployment,
null as GDP, e.[2005] as LifeExpectancy
from HealthStatus_LifeExpectancy e left outer join HealthStatus_TotalPopulation p
on e.Country = p.Country
left outer join HealthStatus_Injuries i on e.Country = i.Country
left outer join HealthStatus_Mortality_LiveBirthDeaths ld on e.Country = ld.Country
left outer join HealthStatus_Cancer c on e.Country = c.Country
left outer join HealthStatus_LowBirthWeight bw on e.Country = bw.Country
left outer join HealthStatus_PhysicianPractising pp on e.Country = pp.Country
left outer join HealthStatus_HealthExpenditure he on e.Country = he.Country
left outer join HealthStatus__AIDS a on e.Country = a.Country
left outer join HealthStatus_PharmaConsumption ph on e.Country = ph.Country
left outer join HealthStatus_AlcoholConsumption al on e.Country = al.Country
left outer join HealthStatus_ObesePopulation ob on e.Country = ob.Country
left outer join HealthStatus_FruitConsumption fr on e.Country = fr.Country
left outer join HealthStatus_TobaccoConsumption tc on e.Country = tc.Country
left outer join HealthStatus_TotalEmployment em on e.Country = em.Country
left outer join HealthStatus_TotalHealthcare hc on e.Country = hc.Country
left outer join HealthStatus_GDP gd on e.Country = gd.Country
Union
select e.Country,
case when e.Country like '%Non-OECD%' then 'N' else 'Y' end as Oecd,
ld.[2006] as LiveBirthDeaths, null as Cancer, bw.[2006] as LowBirthWeight,
pp.[2006] as PhysicianPractising,null as HealthExpenditure,a.[2006] as AIDS,
ObesePopulation,

Page | 25
Union

ObesePopulation,
fr.[2007] as FruitConsumption,tc.[2007] as TobaccoConsumption,null as

TotalHealthcare, em.[2007] as TotalEmployment,
Page | 26
Union
ld.[2008] as LiveBirthDeaths, c.[2008] as Cancer, bw.[2008] as LowBirthWeight,

ObesePopulation,

Page | 27
Union
select e.Country,
case when e.Country like '%Non-OECD%' then 'N' else 'Y' end as Oecd,

ObesePopulation,

Page | 28
Union
pp.[2010] as PhysicianPractising, he.[2010] as HealthExpenditure,a.[2010] as AIDS,

ObesePopulation,
fr.[2010] as FruitConsumption,tc.[2010] as TobaccoConsumption,hc.[2010] as

gd.[2010] as GDP, e.[2010] as LifeExpectancy
Union
-- Life Expectancy and Other variables 2011
Page | 29

ObesePopulation,

Union
ld.[2012] as LiveBirthDeaths, c.[2012] as Cancer, bw.[2012] as LowBirthWeight,

ObesePopulation,

Page | 30
Union

ObesePopulation,

Page | 31
Union

ObesePopulation,

Page | 32
Union

ObesePopulation,

Page | 33
Union

ObesePopulation,

Page | 34
Union

ObesePopulation,

Union
Page | 35

ObesePopulation,

Order by e.Country
-- merging the adult mortality totDeaths

select l.*, totDeaths
from LifeExp l left outer join (SELECT TOP (1000) [Country],[Year],sum([Value]) as
totDeaths
FROM [Project].[dbo].[HealthStatus_Mortality_totDeaths]
Group by Country, [Year]) dth
on l.Country = dth.country
and l.[year] = dth.[Year]
Page | 36
5.2 Appendix II
R code
# Deploy packages
library(dplyr) library(tidyr) library(plot.matrix)
library(magrittr) library(tidyverse)
library("GGally") library(xyplot) library(ggplot2)
install.packages("xyplot library(lattice)
") library(dplyr) library(GGally)
install.packages("ggplot library(data.table) library(cdata)
") library("ggplot2")
install.packages("tidyve library(ggplot2) library(WVPlots)
rse") install_github("easyGgpl
library(ggplot2) ot2", "kassambara")
library(purrr)
## test
install.packages("mice")
library(mice)
Page | 37
#install package and load library
install.packages("mi")
library(mi)
install.packages("missForest")
library(missForest)
install.packages("survival")
#Load raw data
library(readr)
LifeExp <- read_csv("~/MSC/BigDataMngmt/Datasets/LifeExp.csv",
col_types = cols(year = col_integer(), Oecd =

col_character(),
TotalPopulation = col_double(),
Injuries = col_double(),
LiveBirthDeaths = col_double(),
Cancer = col_double(),
LowBirthWeight = col_double(),
PhysicianPractising = col_double(),
HealthExpenditure = col_double(),
AIDS = col_double(),
PharmaConsumption = col_double(),
AlcoholConsumption =
col_double(),
ObesePopulation = col_double(),
FruitConsumption = col_double(),
TobaccoConsumption =
col_double(),
TotalHealthcare = col_double(),
TotalEmployment = col_double(),
GDP = col_double(), totDeaths =

col_double(),
LifeExpectancy = col_double()),
na = "null")
# converting the country to uppercase
LifeExp$Country <- toupper(LifeExp$Country)
# separating OECD countries
# non OECD countries do not have sufficient data
Page | 38
Oecd <- LifeExp %>% dplyr::filter(!(Country %like% "OECD"))
NonOecd <- LifeExp %>% dplyr::filter(Country %like% "OECD")
# adding condition for numerical variables
data_num <- life_selected %>%
select_if(is.numeric)
Oecd_num <- data_num %>%
Non_Oecd_num <- NonOecd %>%
Oecd_selected <- Oecd_num %>% select(-year)
Non_Oecd_selected <- Non_Oecd_num %>% select(-year)
# Correlation matrix - oecd
ggcorr(Oecd_num,
label = T,
label_size = 2,
label_round = 2,
hjust = 1,
size = 3,
color = "royalblue",
layout.exp = 5,
low = "green3",
mid = "gray95",
high = "darkorange",
name = "Correlation")
ggcorr(Non_Oecd_selected,
label = T,
label_size = 2,
label_round = 2,
hjust = 1,
size = 3,
Page | 39
layout.exp = 5,
low = "green3",
mid = "gray95",
# selecting variables with correlation - oecd
var<-Oecd %>% select(1:2, 3,5,9,14,18,20)
var_selected <- var %>% select(-Country, -year)
data_num <- var_selected %>%
# Correlation matrix
ggcorr(data_num,
label = T,
label_size = 2,
label_round = 2,
hjust = 1,
size = 3,
layout.exp = 5,
low = "green3",
mid = "gray95",
# plotting the variables over the years
boxplot(TotalPopulation~year, border="black"
data=LifeExp, )
main="Boxplots for each year\

nPopulation",
boxplot(Injuries~year,
xlab="Year",
data=LifeExp,
ylab="Thousands of persons",
col="blue", nInjuries",
Page | 40
xlab="Year",
ylab="Injured per million boxplot(PharmaConsumption~year,

population",
data=LifeExp,
col="blue",
border="black" nPharma Consumption",
) xlab="Year",
boxplot(LowBirthWeight~year, ylab="Daily dosage per 1 000

inhabitants per day",
data=LifeExp,
col="blue",
nLow birth weight", border="black"
xlab="Year", )
ylab="(% of total live

births)",
boxplot(AlcoholConsumption~year,
col="blue",
data=LifeExp,
border="black"
) nAlcohol Consumption",
xlab="Year",
boxplot(PhysicianPractising~year, ylab="Litres per capita",
data=LifeExp, col="blue",
main="Boxplots for each year\ border="black"

nPhysician Practising",
)
xlab="Year",
ylab="Number of persons",
boxplot(LifeExpectancy~year,
col="blue",
data=LifeExp,
border="black"
) nLife Expectancy",
xlab="Year",
boxplot(HealthExpenditure~year, ylab="Years",
data=LifeExp, col="blue",
main="Boxplots for each year\ border="black"

nHealth Expenditure",
)
xlab="Year",
ylab="Share of gross domestic

product",
col="blue", boxplot(ObesePopulation~year,
border="black" data=LifeExp,
) main="Boxplots for each year\

nObese Population",
Page | 41
xlab="Year", border="black"
ylab="Percentage of total )
population",
col="blue",
border="black"
boxplot(TotalEmployment~year,
)
data=LifeExp,

boxplot(FruitConsumption~year, nTotal Employment",
data=LifeExp, xlab="Year",
main="Boxplots for each year\ ylab="% of total population",

nFruit Consumption",
col="blue",
xlab="Year",
border="black"
ylab="% of population aged 15
years old and over", )
col="blue",
border="black" boxplot(GDP~year,
) data=LifeExp,

nGross domestic product (GDP)",
boxplot(TobaccoConsumption~year,
xlab="Year",
data=LifeExp,
ylab="US$ purchasing power
main="Boxplots for each year\ parity",
nTobacco Consumption",
col="blue",
xlab="Year",
border="black"
ylab="% of population of daily
smokers (15+ years)", )
col="blue",
border="black"
) boxplot(totDeaths~year,
data=LifeExp,
boxplot(TotalHealthcare~year, main="Boxplots for each year\

nTotal Deaths",
data=LifeExp,
xlab="Year",
nTotal Health Care", ylab="persons",
xlab="Year", col="blue",
ylab="% of total population border="black"

covered", )
col="blue",
Page | 42
le<-select(LifeExp, GDP, LifeExpectancy)
l<-select(LifeExp, LifeExpectancy)
view(le)
plot(le, main="Scatterplot Example",
xlab="GDP", ylab="Life Expectancy ", pch=19)
# plot GDP & Life Expectancy
ggplot(regress_Oecd, aes(GDP, LifeExpectancy)) +
geom_point()
#Generate 10% missing values at Random
Oecd.mis <- prodNA(Oecd, noNA = 0.1)
LifeExp.mis <- prodNA(LifeExp, noNA = 0.1)
# visualising the missing values
Oecd.mis <- subset(Oecd.mis, select = -c(year))
Oecd.mis <- subset(Oecd.mis, select = -c(LifeExpectancy))
Oecd.mis <- subset(Oecd.mis, select = -c(Country))
Oecd.mis <- subset(Oecd.mis, select = -c(Oecd))
summary(Oecd.mis)
sort(sapply(Oecd.mis,function(x) sum(is.na(x))),decreasing = T)
md.pattern(Oecd.mis)
md.pattern(LifeExp.mis)
LifeExp.mis <- subset(LifeExp.mis, select = -c(Country))
LifeExp.mis <- subset(LifeExp.mis, select = -c(year))
LifeExp.mis <- subset(LifeExp.mis, select = -c(Oecd))
LifeExp.mis <- subset(LifeExp.mis, select = -c(LifeExpectancy))
# seperating the life exp variable before imputation
LifeVar<-subset(LifeExp,select = c(LifeExpectancy))
Var<-subset(LifeExp,select = c(LifeExpectancy,Oecd))
md.pattern(LifeExp.mis)
Page | 43
install.packages("VIM")
library(VIM)
mice_plot <- aggr(Oecd.mis, col=c('navyblue','yellow'),
numbers=TRUE, sortVars=TRUE,
labels=names(Oecd.mis), cex.axis=.7,
gap=3, ylab=c("Missing data","Pattern"))
# Imputing the data
imputed_Oecd <- mice(Oecd.mis, exclude="LifeExpectancy", m=5, maxit = 50, method =

'cart')
imputed_Oecd$imp$Oecd
imputed_Oecd$imp$LifeExpectancy
imputed_LifeExp$imp$LifeExpectancy
imputed_LifeExp <- mice(LifeExp.mis, m=5, maxit = 50, method = 'cart')
# summary of imputed data
summary(imputed_Oecd)
summary(complete_LifeExp)
imputed_LifeExp$imp$AIDS
#preparing the final dataset - all countries
complete_data=mice::complete(imputed_LifeExp,2)
complete_LifeExp<- merge(complete_data,LifeVar,by="row.names")
complete_LifeExp <- subset(complete_LifeExp, select = -c(Row.names))
#preparing the final dataset - Oecd
complete_Oecd=mice::complete(imputed_Oecd,2)
complete_Oecd<- merge(complete_Oecd,LifeVar,by="row.names")
complete_Oecd <- subset(complete_Oecd, select = -c(Row.names))
complete_Oecd <- subset(complete_Oecd, select = -c(Cancer))
complete_Explo<- merge(complete_data,Var,by="row.names")
md.pattern(complete_Oecd)
#Comparing the imputed with original data
densityplot(imputed_LifeExp)
xyplot(imputed_Data)
Page | 44
# Correlation matrix
ggcorr(complete_Oecd,
label = T,
label_size = 2,
label_round = 2,
hjust = 1,
size = 3,
layout.exp = 5,
low = "green3",
mid = "gray95",
## METHOD - Multiple Regression
install.packages("faraway")
install.packages("glmnet")
library(faraway)
library(glmnet)
library(car)
# excluding multicollinearity
regress_Oecd <- subset(complete_Oecd, select = -c(PhysicianPractising))
regress_Oecd <- subset(regress_Oecd, select = -c(totDeaths))
regress_Oecd <- subset(regress_Oecd, select = -c(TotalEmployment))
install.packages("caret")
library("caret")
## Train Model:70% training / 30 validation set split of the data set
set.seed(123)
train <- regress_Oecd$LifeExpectancy %>%
createDataPartition(p = 0.7, list = FALSE)
length(train)
Page | 45
test <- regress_Oecd[-train, ]
length(test)
summary(regress_Oecd)
# model 1
lm_mod1=lm(LifeExpectancy~.,data=regress_Oecd,subset = train)
summary(lm_mod1)
round(vif(lm_mod1), 2)
# backward stepwise selection
step_mod=step(lm_mod1, direction='backward',trace=F)
summary(step_mod)
round(vif(step_mod), 2)
# plot model 1
plot(lm_mod1,which=c(1,2))
# plot model 2 stepwise
plot(step_mod,which=c(1,2))
# Prediction & Evaluation
# Predict stepwise model
step_pred<- step_mod %>% predict(test)
# Model performance stepwise model
data.frame(
RMSE = RMSE(step_pred, test$LifeExpectancy),
R2 = R2(step_pred, test$LifeExpectancy)
# Predict model1
mod1_pred<- lm_mod1 %>% predict(test)
# Model performance model1
data.frame(
Page | 46
RMSE = RMSE(mod1_pred, test$LifeExpectancy),
R2 = R2(mod1_pred, test$LifeExpectancy)
Page | 47

Final Report

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Final Report

Uploaded by

Copyright:

Available Formats

PROJECT FINAL REPORT

ANALYSIS OF POPULATION HEALTH

1.1 Finding a suitable data set

Figure 1:Dataset links

1.2 Purpose of data cleaning and file preparation

1.3 File conversion and preparation

1.4 The synergy

1.5 The findings

2.2 Exploratory Analysis

Figure 3: Missing value count

Figure 4: Missing values & pattern

Figure 5: Density plot for imputed data vs. Original data

Figure 6: Data summary after imputation

Figure 7 - Summary of the variables

Figure 8: Correlation matrix

Figure 9:Correlation matrix for predictor variables

3.1 Model building

Figure 10: model 1 summary

Figure 11: model 2 (stepwise regression) summary

Model 1 visualisation for residuals:

Model 2 (stepwise regression) visualisation for residuals:

3.4 Detailed Schedule

Learning Progress report Final report

Week 2 accessing big Module 2  Have finalised the completed completed

Week 3 Project proposal Module 3  Completed module Achieved all completed

Week 4 Submit a Module 4 Submitted project Completed the completed

Week Project Activity Learning Details of achievement comments

Week 8 Feature Module 8  Read module 8 Completed the completed

-- This is to create the database tables to

-- Create AIDS table

-- Creating the Alcohol consumption table

-- Creating the Cancer table

[2012] [decimal](18, 1) NULL

-- Creating the FruitConsumption table

CREATE TABLE [dbo].[HealthStatus_FruitConsumption](

/*Bulk insert data to tables*/

BULK INSERT HealthStatus_AlcoholConsumption

BULK INSERT HealthStatus_FruitConsumption

BULK INSERT HealthStatus_GDP

BULK INSERT HealthStatus_HealthExpenditure

BULK INSERT HealthStatus_Injuries

BULK INSERT HealthStatus_LifeExpectancy

BULK INSERT HealthStatus_LowBirthWeight

BULK INSERT HealthStatus_Mortality_LiveBirthDeaths

BULK INSERT HealthStatus__AIDS

BULK INSERT HealthStatus_Mortality_totDeaths

BULK INSERT HealthStatus_ObesePopulation

BULK INSERT HealthStatus_PharmaConsumption

BULK INSERT HealthStatus_PhysicianPractising

BULK INSERT HealthStatus_TobaccoConsumption

BULK INSERT HealthStatus_TotalEmployment

BULK INSERT HealthStatus_TotalHealthcare

BULK INSERT HealthStatus_TotalPopulation

/***** Data wrangling ********/

select * from HealthStatus__AIDS

/* Merging the datasets to a single dataset and pivoting the dataset

pp.[2005] as PhysicianPractising, null as HealthExpenditure,a.[2005] as AIDS,

ph.[2005] as PharmaConsumption,al.[2005] as AlcoholConsumption,ob.[2005] as

fr.[2005] as FruitConsumption,tc.[2005] as TobaccoConsumption,null as TotalHealthcare,

null as GDP, e.[2005] as LifeExpectancy

from HealthStatus_LifeExpectancy e left outer join HealthStatus_TotalPopulation p

left outer join HealthStatus_Injuries i on e.Country = i.Country

left outer join HealthStatus_Mortality_LiveBirthDeaths ld on e.Country = ld.Country

/Bulk insert data to tables/

/* Data wrangling ****/