Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 49

PROJECT FINAL REPORT

ANALYSIS OF POPULATION HEALTH


Organization for Economic Co-Operation and Development
(OECD health dataset)
PROJECT FINAL REPORT

Contents

1 Executive Summary.......................................................................................................................1
1.1 Finding a suitable data set.....................................................................................................1
1.2 Purpose of data cleaning and file preparation.......................................................................1
1.3 File conversion and preparation............................................................................................2
1.4 The synergy............................................................................................................................2
1.5 The findings...........................................................................................................................3
2 Database........................................................................................................................................3
2.1 The process of creating the database....................................................................................3
2.2 Exploratory Analysis...............................................................................................................3
3 Other Methods..............................................................................................................................8
3.1 Model building.......................................................................................................................9
3.2 Conclusion...........................................................................................................................12
3.3 Recommendation................................................................................................................12
3.4 Detailed Schedule................................................................................................................12
3.5 Help from the course material.............................................................................................17
4 References...................................................................................................................................18
5 Appendices..................................................................................................................................19
5.1 Appendix 1...........................................................................................................................19
5.2 Appendix II...........................................................................................................................37
PROJECT FINAL REPORT

1 Executive Summary
The higher life expectancy is a result of many indicators such as health, nutrition improvements,
education, health services, decrease in mortality and many other social and economic factors. An
important fact is that, not only life expectancy increases when older people live longer but also
when fewer people in their younger age die. Hence, the importance of the indicators in life
expectancy is significant (Murillo, 2020).
It has become important to analyse and identify the indicators/variables that have an impact on the
population health to improve or implement new strategies to improve health care systems or other
measures and thereby increase the life expectancy .

In this study, it aims to analyse the Organization for Economic Corporation and Development (OECD)
health-related data to determine the indicators which affect population health. It will be
determined using the key measure of 'Life expectancy'.
This study analyses their health-related data to recognize what variables have significantly caused a
high level of significance in population health. That can be then used as a role-model to improve the
level of indicators in developing countries to improve their population health and their life
expectancy.

Analysing the various aspects of Health in developed countries will allow comparing the
same health-related services in developing countries to fill the gaps and provide a better
insight into future improvements.

1.1 Finding a suitable data set


Finding the dataset for the project was the biggest hurdle. Fortunately, the presentation included in
the course material helped a lot to find the required dataset. I was exposed to most of the publicly
available big datasets, thanks to this tutorial. Before reading this material, I wasn’t aware of how to
do advanced google searches which enhance the opportunity of finding all the materials not
included in general search.

As a result of the advanced search, I was able to find the OECD statistics (OECD.org., 2020)
(https://stats.oecd.org/) which helped to answer my research question.

Figure 1:Dataset links

Page | 1
PROJECT FINAL REPORT

1.2 Purpose of data cleaning and file preparation


There are many reasons why the data must be cleaned before analysis. The main reasons would be,
the inaccuracies and duplications over time in data. This can occur when collecting and storing data.
Therefore, these need to be identified and cleaned before analysis to obtain an accurate result. On
the other hand, any data in text fields need to have a level of processing to standardize the data to
eliminate a large number of variations in a field.
While pre-processing the data set, it may need to merge different datasets and in this case,
duplication is something which could occur and need attention. However, this has to be performed
carefully, not to incur erroneous exclusions.
Next, missing or erroneous data has to be identified and fixed. When fixing missing data, several
techniques can be applied according to the nature of the analysis.

1.3 File conversion and preparation


OECD data set used for this analysis consists of 18 health-related variables in CSV format. Having
data in a CSV format is not an easy way of analysing the data. Therefore, using the course
knowledge learnt, I have created a SQL database to make the data cleaning process easier.
Below are the steps that are taken to prepare the datasets for the analysis stage –
 Download the data sets
 Check ('eyeballing') every CSV file to identify the variables within each sheet.
 Handling structural mismatches/errors
the spreadsheet was differently organised
blank columns in between and contained irrelevant columns to the analysis
Comments included for some variables
trailing and leading spaces - removed manually before the data was migrated.
unaccepted characters such as “..” included for missing values - updated to “blanks” in the
sheet before the analysis.
 Create SQL database and tables equivalent to all data sheets separately (see appendices 01)
 Import data from CSV files to SQL tables
 Removing unwanted observations
The spread of the data was different in different datasets. Some of the data sets were spread
from 2010-2017 while some were from 2005-2019. Besides, some have data from 2000-
2019. Besides, the year of 2019 does not have data reported from many countries.
Considering all these anomalies, the most non-missing data set for all countries was from
the year 2010 to 2018. Therefore, I will be using 2010 to 2018 dataset for model building
and further analysis.

The variable ‘HealthExpenditureByAge ‘data set only has three countries reported for all
years. Therefore, it has been excluded from the analysis
 Merger all data to a single dataset.
Created a SQL view to show all variables for all countries across all years including the
dependant variable (Life expectancy).
 Replace all missing data with nulls.
 Data slicing using the created view.
 Import merged final dataset to R for data analysis and visualisation

Page | 2
PROJECT FINAL REPORT

1.4 The synergy


Comparatively, the OECD countries have better life expectancy over the non-OECD
countries. All health-related variables and life expectancy data has taken in to the analysis.
By analysing and applying the statistical methods to the dataset, it was possible to find the
relationships and their significance between the ‘Life Expectancy’ and the health-related
predictors. Accordingly, it was helpful to answer my research question "What health
indicators contribute to Country's life expectancy?”. The study revealed that gross domestic
product and health care are the most significant predictors of life expectancy.

1.5 The findings


GDP (gross domestic product) and total health care variables are highly significant when predicting
life expectancy among OECD countries. Besides, it also demonstrates injuries and health
expenditure have a good significance whereas pharmaceutical consumption and tobacco
consumption have a slight significance over the life expectancy. In overall, the better the country’s
economic status is, the higher it’s life expectancy is going to be.

2 Database
2.1 The process of creating the database
The following steps were taken when creating the database in this project.
 Requirements analysis and design plan
Requirements for the database have been identified going through all the CSV datasheets.
Identify the database tables, fields and primary / foreign keys.
 Draft database outline
Design and draft the table structure according to the above-identified facts.
 Create database scripts
According to the OECD data set, all datasets were arranged by country and by year for each
variable. Therefore, each CSV datasheet for a variable is converted to a SQL table, assigning
the country name as the primary key (see appendix 1).
 Load test data and perform testing
 Load data
Once the test is successful, the data from CSV files are loaded into the tables as an automatic
process, using SQL scripts (see appendix 1)

2.2 Exploratory Analysis


In the data set, apart from ‘country’ and ‘year’ variables, other variables are numerical. A new
categorical column named ‘Oecd’ has been derived to indicate the country’s OECD status.

Missing values
Missing values need to be handled as they can lead to wrong prediction or classification given in any
model being used. There are several techniques to handle the missing data in Data analysis process.
I have used the ‘MICE’ package in R (appendix 2), to impute the missing values, instead of deleting
the missing values (Chaitanya Sagar, 2017). This method uses linear regression to predict the missing
values.

Page | 3
PROJECT FINAL REPORT

Below figures clearly shows the missing values in the dataset. The ‘Cancer’ variable is 100% missing
and ‘Population’ variable is the least missing variable which is 7.8%. Tobacco, fruit consumption and
obese population variables are missing more than 50% of its data.

Figure 2

Figure 3: Missing value count

Page | 4
PROJECT FINAL REPORT

Figure 4: Missing values & pattern

According to the above, I have imputed the missing values using three parameters for the package.
The number of datasets and the iterations the model should run and the method (VIDHYA, 2016).

Five imputed datasets were generated and ‘Goodness of fit’ has been checked to select the closest
data set. Density plot has been generated to visualise the similarity between the imputed datasets
and the original dataset.

Most importantly, the ‘Life Expectancy’ variable is excluded from the dataset, not to be imputed. It
will not be reasonable to include the response variable in the imputation prediction which may
affect the final prediction model.

Figure 5: Density plot for imputed data vs. Original data

Page | 5
PROJECT FINAL REPORT

After a closer analysis, I have chosen the second dataset to be imputed to the original dataset. Below
image shows the new ‘completed’ dataset which does not contain any missing values.

Figure 6: Data summary after imputation

Checking for Outliers – It is important to check for outliers in the given data sets since they will
affect the mean/median of the data. This will then affect the absolute and mean error of the model
we are going to build in the next steps. Below boxplots are created for each variable in the data set
to see the spread of the data for each variable. However, any lower or high figures are not
considered as outliers, since they are identical to each country and are reported as valid data.

Page | 6
PROJECT FINAL REPORT

Page | 7
PROJECT FINAL REPORT

The above boxplots show that there is a strong relationship between life expectancy and the type of
country (OECD or non-OECD). The mean and the median figure for OECD countries are higher than
the non-OECD countries. This shows in a glance, that OECD countries (developed) have higher life
expectancy than non-OECD countries.
The summary of the variables is as below. Accordingly, the minimum life expectancy is 56 and the
highest life expectancy is 84 years in this dataset.

Figure 7 - Summary of the variables

Page | 8
PROJECT FINAL REPORT

A correlation matrix was created to see the relationships between the predictor variable. However, I
have excluded the non-OECD countries from this point onwards in this analysis to see the
relationships to the life expectancy in the developed countries (OECD countries).

Figure 8: Correlation matrix

According to the above correlation matrix, it is visible that there is a positive weak relationship
between total employment, GDP, health expenditure, fruit consumption and pharmaceutical
consumption. On the other hand, there are negative weak relationships between live birth deaths
and AIDS.

3 Other Methods
The multiple linear regression (Robert I. Kabacoff, 2017) is used in this study as the statistical analysis
technique to find the best fitting line to response and predictor variable, so the slope of the
regression line can be measured.

The imputed and completed dataset is split into two to use as model training data and model test
data. 70% of the data (233 observations) will be used to train the model and the rest of the 30% (100
observations) is used to test and evaluate the model.

A correlation matrix was generated to check if there's any multicollinearity between the predictor
variables. According to the below matrix, it shows population and physician practising is strongly
correlated (97%) to each other. Besides,’ total deaths’ are strongly (97%) correlated with physician
practising. Therefore ‘physician practising' variable will be removed from the model. Secondly, total
deaths and total health care also strongly correlated to each other by 74% and total deaths are
removed from the model. Total employment and GDP are correlated by 76% and therefore total
employment has removed safely from the model.

Page | 9
PROJECT FINAL REPORT

Figure 9:Correlation matrix for predictor variables

3.1 Model building


Two models were built to compare the final results. Model 1 included all variables except total
deaths, physician practising and total employment which consists of 13 variables. Model two was
built after the performance of stepwise regression. Stepwise regression has selected 8 variables out
of the 13 variables. Below show the results of the two models.

Page | 10
PROJECT FINAL REPORT

Figure 10: model 1 summary

Figure 11: model 2 (stepwise regression) summary

Page | 11
PROJECT FINAL REPORT

Model 1 visualisation for residuals:

Model 2 (stepwise regression) visualisation for residuals:

Built models are then used to predict the life expectancy using the test dataset and evaluated as
shown below using their R squared values. The R squared values for both models are closer to the
models trained.

Page | 12
PROJECT FINAL REPORT

According to the above plots, the residuals and fitted values are not completely linear. However,
most data points fit into the line and they are approximately normally distributed.

The model with all selected (13) variables, got an adjusted R squared of 0.1827 with the training data
set. This says the 18% of the variation of life expectancy can be explained by the predicting
variables. The model performance with the test dataset gave a 0.1233 which shows there's a positive
relationship between the response variable and predictor variables.

Likewise, the model built using the step-wise regression gave an R squared of 0.1883 which is slightly
higher than to that of the model 1.

Model 2 shows GDP and total health care variables are highly significant when predicting life
expectancy. Besides, it also demonstrates injuries and health expenditure have a good significance
whereas pharmaceutical consumption and tobacco consumption have a slight significance over the
life expectancy.

3.2 Conclusion
A country’s life expectancy has a greater significance on its GDP (gross domestic product) and health
care (government / social health care). The countries with higher domestic income have a good
potential of having a high life expectancy. Besides, health care is a major factor in life expectancy
which makes an obvious factor to life expectancy. Country's with better economies have better
health care systems and which lead to better life expectancy. On the other hand, out of all other
health-related indicators 'Injuries' contribute a good significance to life expectancy along with the
health expenditure.

3.3 Recommendation
There may be a better method for predicting life expectancy than the multiple linear regression.
According to the readings, LASSO, Ridge regression and Random Forest can be used in predicting life
expectancy. Besides, including other variables such as economic, social, environmental and safety
into the analysis could help in improving the model performance. According to the model results it
is visible that health related indicators has less effect on life expectancy. Therefore, there may be
other factors which has more significance on it which leads to an extended analysis on ‘Life
Expectancy’.

3.4 Detailed Schedule

Outcome
AIMS

Learning Progress report Final report


Week Project Activity Details of achievement
Activity
Week 1 Familiarization Module 1  Finish completed completed
13-17July of the project reading Module 1
specification  Understanding of
Building the the project
research specification and
question requirements.
 Start thinking
about a research

Page | 13
PROJECT FINAL REPORT

question

Week 2 accessing big Module 2  Have finalised the completed completed


20-24 July data/Finalise the research question
data  Finished reading
set/familiarise module 2 which
with the data will help to design
set the relationships
among the
datasets
 Finalised the big
datasets to be used
in the project

Week 3 Project proposal Module 3  Completed module Achieved all completed


27-31 July writing 3 reading tasks but failed
 Report writing has to read the
started and module due to
completed for taking more
project proposal time to finish
the project
proposal.

Week 4 Submit a Module 4 Submitted project Completed the completed


03-07 proposal / proposal course reading
August Database design  Read Module 4 and catch up
 Start designing the with the missed
database to import module from
the dataset the previous
week. Instead of
database
designing,
analysed the
downloaded
dataset for any
inconsistencies,
missingness and
well-
preparedness.
Also familiarised
with the dataset
to identify their
primary keys
and
relationships.

Page | 14
PROJECT FINAL REPORT

AIMS Outcome

Week Project Activity Learning Details of achievement comments


Activity
Week 5 Creation of the  Modul  Read Module 5 Finalised the completed
10-14 database and e5  Creation of database
August tables / Import database, tables design,
datasets and their downloaded
relationships SQL express and
 Import data into installed for the
the database data cleaning
process.
Scripts were
written to
create the
database and
import data
from CSV to
SQL.
Week 6 Data cleaning / Module 6  Read module 6 Course reading completed
17-21 Exploratory  Start to query data from the
August Analysis and find any previous week
missing data or and module 6
perform data were
cleaning if completed.
required. Continuation of
 Exploratory database script
analysis to get to writing and then
better know the data import to
datasets. tables were
completed.
Querying the
data and
exploratory
analysis was not
started and fell
behind the
schedule due to
personal
(family)
circumstances.
Week 7 Feature Module 7 Selecting the most Querying the completed
24-28 selection significant data and
August variables out of the cleaning the
many variables data set was
according to the completed. Not
previous studies or completed the
own analysis scheduled tasks
 Read module 7 for this week at
ll.
I was still

Page | 15
PROJECT FINAL REPORT

behind the
schedule at this
point but was
confident that I
could catch up
on the work in
the next few
weeks.

Week 8 Feature Module 8  Read module 8 Completed the completed


31-4 selection  Continue with the exploratory
September feature selection analysis (week 6
and finalise the remaining work)
feature to be used and read
in the analysis. module 7 from
week 7.
Week 9 Analysis Module 9  Start the analysis Downloaded R completed
7-11 Identify if any with the data and start
September changes needed  Any revisions to learning it since
to the research the research this was
question or question or data completely new
datasets sets, if find to me. This was
necessary something that I
 Complete reading didn't schedule
module 9 time in my
proposal and
contributed to
fell behind the
schedule.
Week 10 Review process Module 10  Repeat analysis, if Completed completed
14-18 if required changes occurred, reading the
September Continue if not continue course module
Analysis with the analysis 9 (from previous
 Read module 10 week) and
module 10.
Reviewing the
research
question is done
and identified
that no changes
need to be
made after
knowing the
dataset better
and reading
through other
research
papers.
Week 11 Prepare and    Organise and start Started writing
21-25 start the writing the the progress
September progress report progress report report.

Page | 16
PROJECT FINAL REPORT

writing
Week 12 Continue writing    Continue with the Progress report completed
28-2 Progress report progress report writing
October work continued
Week 13 Submit progress Module 11  Read module 11 Work pending – Progress report
5-9 report  Submit Assignment not achieved submitted.
October 2 anything this
 Continue with the week, being
analysis unwell.
 Start creating
visualisations with
the analysed data
findings
Week 14 Obtain final Module 12  Discuss and Work pending Completed module
12-16 results analyse the 11 reading and
October /Evaluations outcomes of the further analysis
analysis continued. With
further research
paper readings,
found a method of
handling missing
values. Therefore,
had to revisit the
data analysis.
Increased my study
time to catch up
and be on track.
Started creating
visualisation to the
findings.
Week 15 Prepare final    Start writing the Work pending Report writing has
19-23 report final report started while
October continuing with the
model building and
evaluation was
carried out.
Week 16 Submit final    Submit final Work pending Continuation of
26-6 report assessment model building and
November evaluation using
different R
functions and
regression
methods.
Completed the
report by the due
date.
3.5 Help from the course material
My biggest challenge was finding the dataset for the project. Fortunately, the presentation included
in the course material helped me a lot to find my required dataset. I was exposed to most of the
publicly available big datasets, thanks to this tutorial.

Page | 17
PROJECT FINAL REPORT

Handling big data, preparing the datasets and data cleaning was an important step in this project.
These steps were discussed in course modules clearly which I directly applied to this project work.

Database creation, managing and security also have discussed in this module which gave me an
interesting start to data modelling. Especially the ‘data activities’/video clips allowed me to get
hands-on experience as an extended knowledge. I was able to manipulate the dataset (slicing/dicing)
according to the analysis requirements before imported into 'R' for analysis.

'R' is completely new knowledge to me and I was a little worried If I will be able to handle it in my
project analysis. However, the modules have given lots of knowledge from the beginner level, which
helped me to start with ease and then advance into deeper functions. I am happy that I got to learn
this powerful statistical package which is an added-value and sincerely proud of my achievement
here. The modules also discuss data visualization and their quality, techniques and reasons which
helped to produce my results of the analysis.

It was interesting to learn about Hadoop framework, although I will not specifically be using it in this
project. However, with the big data management, knowing the Hadoop framework will be a great
asset in future.

4 References

Page | 18
PROJECT FINAL REPORT

CHAITANYA SAGAR, P. A. 2017. A Solution to Missing Data: Imputation Using R ( 17:n37 ) [Online].
Available: https://www.kdnuggets.com/2017/09/missing-data-imputation-using-r.html
[Accessed].
MURILLO, I. L. 2020. The life expectancy: what is it and why does it matter [Online]. CEINE. Available:
https://cenie.eu/en/blogs/age-society/life-expectancy-what-it-and-why-does-it-matter
[Accessed].
OECD.ORG. 2020. Our global reach - OECD [Online]. Available:
http://www.oecd.org/about/members-and-partners/ [Accessed 26/07/2020 2020].
ROBERT I. KABACOFF, P. D. 2017. Multiple (Linear) Regression [Online]. Data Camp. Available:
https://www.statmethods.net/stats/regression.html [Accessed].
VIDHYA, A. 2016. Tutorial on 5 Powerful R Packages used for imputing missing values [Online].
Available: https://www.analyticsvidhya.com/blog/2016/03/tutorial-powerful-packages-
imputing-missing-values/ [Accessed].

5 Appendices
5.1 Appendix 1
/***** Project Database Creation SQL ********/

Page | 19
PROJECT FINAL REPORT

-- This is to create the database tables to


-- populate the OECD data

/***************************************/

-- Create AIDS table


CREATE TABLE [dbo].[HealthStatus__AIDS](
[Country] [varchar](100) NOT NULL,
[2000] [decimal](18, 1) NULL,
[2001] [decimal](18, 1) NULL,
[2002] [decimal](18, 1) NULL,
[2003] [decimal](18, 1) NULL,
[2004] [decimal](18, 1) NULL,
[2005] [decimal](18, 1) NULL,
[2006] [decimal](18, 1) NULL,
[2007] [decimal](18, 1) NULL,
[2008] [decimal](18, 1) NULL,
[2009] [decimal](18, 1) NULL,
[2010] [decimal](18, 1) NULL,
[2011] [decimal](18, 1) NULL,
[2012] [decimal](18, 1) NULL,
[2013] [decimal](18, 1) NULL,
[2014] [decimal](18, 1) NULL,
[2015] [decimal](18, 1) NULL,
[2016] [decimal](18, 1) NULL,
[2017] [decimal](18, 1) NULL,
[2018] [decimal](18, 1) NULL,
[2019] [decimal](18, 1) NULL
) ON [PRIMARY]
GO

-- Creating the Alcohol consumption table


CREATE TABLE [dbo].[HealthStatus_AlcoholConsumption](
[Country] [varchar](100) NOT NULL,
[2000] [decimal](18, 1) NULL,
[2001] [decimal](18, 1) NULL,
[2002] [decimal](18, 1) NULL,
[2003] [decimal](18, 1) NULL,
[2004] [decimal](18, 1) NULL,
[2005] [decimal](18, 1) NULL,
[2006] [decimal](18, 1) NULL,
[2007] [decimal](18, 1) NULL,
[2008] [decimal](18, 1) NULL,
[2009] [decimal](18, 1) NULL,
[2010] [decimal](18, 1) NULL,
[2011] [decimal](18, 1) NULL,
[2012] [decimal](18, 1) NULL,
[2013] [decimal](18, 1) NULL,
[2014] [decimal](18, 1) NULL,
[2015] [decimal](18, 1) NULL,
[2016] [decimal](18, 1) NULL,
[2017] [decimal](18, 1) NULL,
[2018] [decimal](18, 1) NULL,
[2019] [decimal](18, 1) NULL
) ON [PRIMARY]

-- Creating the Cancer table


CREATE TABLE [dbo].[HealthStatus_Cancer](
[Country] [varchar](100) NOT NULL,
[2000] [decimal](18, 1) NULL,
[2002] [decimal](18, 1) NULL,
[2008] [decimal](18, 1) NULL,

Page | 20
PROJECT FINAL REPORT

[2012] [decimal](18, 1) NULL


) ON [PRIMARY]
GO

-- Creating the FruitConsumption table

CREATE TABLE [dbo].[HealthStatus_FruitConsumption](


[Country] [varchar](100) NOT NULL,
[2000] [decimal](18, 1) NULL,
[2001] [decimal](18, 1) NULL,
[2002] [decimal](18, 1) NULL,
[2003] [decimal](18, 1) NULL,
[2004] [decimal](18, 1) NULL,
[2005] [decimal](18, 1) NULL,
[2006] [decimal](18, 1) NULL,
[2007] [decimal](18, 1) NULL,
[2008] [decimal](18, 1) NULL,
[2009] [decimal](18, 1) NULL,
[2010] [decimal](18, 1) NULL,
[2011] [decimal](18, 1) NULL,
[2012] [decimal](18, 1) NULL,
[2013] [decimal](18, 1) NULL,
[2014] [decimal](18, 1) NULL,
[2015] [decimal](18, 1) NULL,
[2016] [decimal](18, 1) NULL,
[2017] [decimal](18, 1) NULL,
[2018] [decimal](18, 1) NULL,
[2019] [decimal](18, 1) NULL
) ON [PRIMARY]

/*Bulk insert data to tables*/


BULK INSERT HealthStatus__AIDS
FROM 'C:\Users\chami\OneDrive\Documents\MSC\BigDataMngmt\Datasets\
HealthStatus__AIDS.txt'
WITH
(
FIELDTERMINATOR = ',',
ROWTERMINATOR = '\n'
)

BULK INSERT HealthStatus_AlcoholConsumption


FROM 'C:\Users\chami\OneDrive\Documents\MSC\BigDataMngmt\Datasets\
HealthStatus_AlcoholConsumption.txt'
WITH
(
FIELDTERMINATOR = ',',
ROWTERMINATOR = '\n'
)
BULK INSERT HealthStatus_Cancer
FROM 'C:\Users\chami\OneDrive\Documents\MSC\BigDataMngmt\Datasets\
HealthStatus_Cancer.txt'
WITH
(
FIELDTERMINATOR = ',',
ROWTERMINATOR = '\n'
)

BULK INSERT HealthStatus_FruitConsumption


FROM 'C:\Users\chami\OneDrive\Documents\MSC\BigDataMngmt\Datasets\
HealthStatus_FruitConsumption.txt'

Page | 21
PROJECT FINAL REPORT

WITH
(
FIELDTERMINATOR = ',',
ROWTERMINATOR = '\n'
)

BULK INSERT HealthStatus_GDP


FROM 'C:\Users\chami\OneDrive\Documents\MSC\BigDataMngmt\Datasets\
HealthStatus_GDP.txt'
WITH
(
FIELDTERMINATOR = ',',
ROWTERMINATOR = '\n'
)

BULK INSERT HealthStatus_HealthExpenditure


FROM 'C:\Users\chami\OneDrive\Documents\MSC\BigDataMngmt\Datasets\
HealthStatus_HealthExpenditure.txt'
WITH
(
FIELDTERMINATOR = ',',
ROWTERMINATOR = '\n'
)

BULK INSERT HealthStatus_Injuries


FROM 'C:\Users\chami\OneDrive\Documents\MSC\BigDataMngmt\Datasets\
HealthStatus_Injuries.txt'
WITH
(
FIELDTERMINATOR = ',',
ROWTERMINATOR = '\n'
)

BULK INSERT HealthStatus_LifeExpectancy


FROM 'C:\Users\chami\OneDrive\Documents\MSC\BigDataMngmt\Datasets\
HealthStatus_LifeExpectancy.txt'
WITH
(
FIELDTERMINATOR = ',',
ROWTERMINATOR = '\n'
)

BULK INSERT HealthStatus_LowBirthWeight


FROM 'C:\Users\chami\OneDrive\Documents\MSC\BigDataMngmt\Datasets\
HealthStatus_LowBirthWeight.txt'
WITH
(
FIELDTERMINATOR = ',',
ROWTERMINATOR = '\n'
)

BULK INSERT HealthStatus_Mortality_LiveBirthDeaths


FROM 'C:\Users\chami\OneDrive\Documents\MSC\BigDataMngmt\Datasets\
HealthStatus_Mortality_LiveBirthDeaths.txt'
WITH
(
FIELDTERMINATOR = ',',
ROWTERMINATOR = '\n'
)

Page | 22
PROJECT FINAL REPORT

BULK INSERT HealthStatus__AIDS


FROM 'C:\Users\chami\OneDrive\Documents\MSC\BigDataMngmt\Datasets\
HealthStatus__AIDS.txt'
WITH
(
FIELDTERMINATOR = ',',
ROWTERMINATOR = '\n'
)

BULK INSERT HealthStatus_Mortality_totDeaths


FROM 'C:\Users\chami\OneDrive\Documents\MSC\BigDataMngmt\Datasets\
HealthStatus_Mortality_totDeaths.txt'
WITH
(
FIELDTERMINATOR = ',',
ROWTERMINATOR = '\n'
)

BULK INSERT HealthStatus_ObesePopulation


FROM 'C:\Users\chami\OneDrive\Documents\MSC\BigDataMngmt\Datasets\
HealthStatus_ObesePopulation.txt'
WITH
(
FIELDTERMINATOR = ',',
ROWTERMINATOR = '\n'
)

BULK INSERT HealthStatus_PharmaConsumption


FROM 'C:\Users\chami\OneDrive\Documents\MSC\BigDataMngmt\Datasets\
HealthStatus_PharmaConsumption.txt'
WITH
(
FIELDTERMINATOR = ',',
ROWTERMINATOR = '\n'
)

BULK INSERT HealthStatus_PhysicianPractising


FROM 'C:\Users\chami\OneDrive\Documents\MSC\BigDataMngmt\Datasets\
HealthStatus_PhysicianPractising.txt'
WITH
(
FIELDTERMINATOR = ',',
ROWTERMINATOR = '\n'
)

BULK INSERT HealthStatus_TobaccoConsumption


FROM 'C:\Users\chami\OneDrive\Documents\MSC\BigDataMngmt\Datasets\
HealthStatus_TobaccoConsumption.txt'
WITH
(
FIELDTERMINATOR = ',',
ROWTERMINATOR = '\n'
)

BULK INSERT HealthStatus_TotalEmployment


FROM 'C:\Users\chami\OneDrive\Documents\MSC\BigDataMngmt\Datasets\
HealthStatus_TotalEmployment.txt'
WITH

Page | 23
PROJECT FINAL REPORT

(
FIELDTERMINATOR = ',',
ROWTERMINATOR = '\n'
)

BULK INSERT HealthStatus_TotalHealthcare


FROM 'C:\Users\chami\OneDrive\Documents\MSC\BigDataMngmt\Datasets\
HealthStatus_TotalHealthcare.txt'
WITH
(
FIELDTERMINATOR = ',',
ROWTERMINATOR = '\n'
)

BULK INSERT HealthStatus_TotalPopulation


FROM 'C:\Users\chami\OneDrive\Documents\MSC\BigDataMngmt\Datasets\
HealthStatus_TotalPopulation.txt'
WITH
(
FIELDTERMINATOR = ',',
ROWTERMINATOR = '\n'
)

/***** Data wrangling ********/


-- merging data tables
-- Pivot the columns --> years to columns
-- deriving new columns
/***************************************/

select * from HealthStatus__AIDS


select * from HealthStatus_AlcoholConsumption
select * from HealthStatus_Cancer
select * from HealthStatus_FruitConsumption
select * from HealthStatus_GDP
select * from HealthStatus_HealthExpenditure
select * from HealthStatus_Injuries
select * from HealthStatus_LifeExpectancy
select * from HealthStatus_LowBirthWeight
select * from HealthStatus_Mortality_LiveBirthDeaths
select * from HealthStatus_Mortality_totDeaths
select * from HealthStatus_ObesePopulation
select * from HealthStatus_PharmaConsumption
select * from HealthStatus_PhysicianPractising
select * from HealthStatus_TobaccoConsumption
select * from HealthStatus_TotalEmployment
select * from HealthStatus_TotalHealthcare
select * from HealthStatus_TotalPopulation

/* Merging the datasets to a single dataset and pivoting the dataset


** Saved the results as a view
*/

select e.Country,case when e.Country like '%Non-OECD%' then 'N' else 'Y' end as Oecd,
2005 as [year], p.[2005] as TotalPopulation, i.[2005] as Injuries,

ld.[2005] as LiveBirth,
case when Country like '%Non-OECD%' then 'N' else 'Y' end as Oecd,Deaths, null as
Cancer, bw.[2005] as LowBirthWeight,

Page | 24
PROJECT FINAL REPORT

pp.[2005] as PhysicianPractising, null as HealthExpenditure,a.[2005] as AIDS,

ph.[2005] as PharmaConsumption,al.[2005] as AlcoholConsumption,ob.[2005] as


ObesePopulation,

fr.[2005] as FruitConsumption,tc.[2005] as TobaccoConsumption,null as TotalHealthcare,


em.[2005] as TotalEmployment,

null as GDP, e.[2005] as LifeExpectancy

from HealthStatus_LifeExpectancy e left outer join HealthStatus_TotalPopulation p

on e.Country = p.Country

left outer join HealthStatus_Injuries i on e.Country = i.Country

left outer join HealthStatus_Mortality_LiveBirthDeaths ld on e.Country = ld.Country

left outer join HealthStatus_Cancer c on e.Country = c.Country

left outer join HealthStatus_LowBirthWeight bw on e.Country = bw.Country

left outer join HealthStatus_PhysicianPractising pp on e.Country = pp.Country

left outer join HealthStatus_HealthExpenditure he on e.Country = he.Country

left outer join HealthStatus__AIDS a on e.Country = a.Country

left outer join HealthStatus_PharmaConsumption ph on e.Country = ph.Country

left outer join HealthStatus_AlcoholConsumption al on e.Country = al.Country

left outer join HealthStatus_ObesePopulation ob on e.Country = ob.Country

left outer join HealthStatus_FruitConsumption fr on e.Country = fr.Country

left outer join HealthStatus_TobaccoConsumption tc on e.Country = tc.Country

left outer join HealthStatus_TotalEmployment em on e.Country = em.Country

left outer join HealthStatus_TotalHealthcare hc on e.Country = hc.Country

left outer join HealthStatus_GDP gd on e.Country = gd.Country

Union

select e.Country,
case when e.Country like '%Non-OECD%' then 'N' else 'Y' end as Oecd,
2006 as [year], p.[2006] as TotalPopulation, i.[2006] as Injuries,
ld.[2006] as LiveBirthDeaths, null as Cancer, bw.[2006] as LowBirthWeight,
pp.[2006] as PhysicianPractising,null as HealthExpenditure,a.[2006] as AIDS,
ph.[2006] as PharmaConsumption,al.[2006] as AlcoholConsumption,ob.[2006] as
ObesePopulation,

fr.[2006] as FruitConsumption,tc.[2006] as TobaccoConsumption,null as TotalHealthcare,


em.[2006] as TotalEmployment,

null as GDP, e.[2006] as LifeExpectancy

from HealthStatus_LifeExpectancy e left outer join HealthStatus_TotalPopulation p

Page | 25
PROJECT FINAL REPORT

on e.Country = p.Country

left outer join HealthStatus_Injuries i on e.Country = i.Country

left outer join HealthStatus_Mortality_LiveBirthDeaths ld on e.Country = ld.Country

left outer join HealthStatus_Cancer c on e.Country = c.Country

left outer join HealthStatus_LowBirthWeight bw on e.Country = bw.Country

left outer join HealthStatus_PhysicianPractising pp on e.Country = pp.Country

left outer join HealthStatus_HealthExpenditure he on e.Country = he.Country

left outer join HealthStatus__AIDS a on e.Country = a.Country

left outer join HealthStatus_PharmaConsumption ph on e.Country = ph.Country

left outer join HealthStatus_AlcoholConsumption al on e.Country = al.Country

left outer join HealthStatus_ObesePopulation ob on e.Country = ob.Country

left outer join HealthStatus_FruitConsumption fr on e.Country = fr.Country

left outer join HealthStatus_TobaccoConsumption tc on e.Country = tc.Country

left outer join HealthStatus_TotalEmployment em on e.Country = em.Country

left outer join HealthStatus_TotalHealthcare hc on e.Country = hc.Country

left outer join HealthStatus_GDP gd on e.Country = gd.Country

Union

select e.Country,case when e.Country like '%Non-OECD%' then 'N' else 'Y' end as Oecd,
2007 as [year], p.[2007] as TotalPopulation, i.[2007] as Injuries,

ld.[2007] as LiveBirthDeaths, null as Cancer, bw.[2007] as LowBirthWeight,

pp.[2007] as PhysicianPractising, null as HealthExpenditure,a.[2007] as AIDS,

ph.[2007] as PharmaConsumption,al.[2007] as AlcoholConsumption,ob.[2007] as


ObesePopulation,

fr.[2007] as FruitConsumption,tc.[2007] as TobaccoConsumption,null as


TotalHealthcare, em.[2007] as TotalEmployment,

null as GDP, e.[2007] as LifeExpectancy

from HealthStatus_LifeExpectancy e left outer join HealthStatus_TotalPopulation p

on e.Country = p.Country

left outer join HealthStatus_Injuries i on e.Country = i.Country

left outer join HealthStatus_Mortality_LiveBirthDeaths ld on e.Country = ld.Country

left outer join HealthStatus_Cancer c on e.Country = c.Country

left outer join HealthStatus_LowBirthWeight bw on e.Country = bw.Country

Page | 26
PROJECT FINAL REPORT

left outer join HealthStatus_PhysicianPractising pp on e.Country = pp.Country

left outer join HealthStatus_HealthExpenditure he on e.Country = he.Country

left outer join HealthStatus__AIDS a on e.Country = a.Country

left outer join HealthStatus_PharmaConsumption ph on e.Country = ph.Country

left outer join HealthStatus_AlcoholConsumption al on e.Country = al.Country

left outer join HealthStatus_ObesePopulation ob on e.Country = ob.Country

left outer join HealthStatus_FruitConsumption fr on e.Country = fr.Country

left outer join HealthStatus_TobaccoConsumption tc on e.Country = tc.Country

left outer join HealthStatus_TotalEmployment em on e.Country = em.Country

left outer join HealthStatus_TotalHealthcare hc on e.Country = hc.Country

left outer join HealthStatus_GDP gd on e.Country = gd.Country

Union

select e.Country,case when e.Country like '%Non-OECD%' then 'N' else 'Y' end as Oecd,
2008 as [year], p.[2008] as TotalPopulation, i.[2008] as Injuries,

ld.[2008] as LiveBirthDeaths, c.[2008] as Cancer, bw.[2008] as LowBirthWeight,

pp.[2008] as PhysicianPractising, null as HealthExpenditure,a.[2008] as AIDS,

ph.[2008] as PharmaConsumption,al.[2008] as AlcoholConsumption,ob.[2008] as


ObesePopulation,

fr.[2008] as FruitConsumption,tc.[2008] as TobaccoConsumption,null as TotalHealthcare,


em.[2008] as TotalEmployment,

null as GDP, e.[2008] as LifeExpectancy

from HealthStatus_LifeExpectancy e left outer join HealthStatus_TotalPopulation p

on e.Country = p.Country

left outer join HealthStatus_Injuries i on e.Country = i.Country

left outer join HealthStatus_Mortality_LiveBirthDeaths ld on e.Country = ld.Country

left outer join HealthStatus_Cancer c on e.Country = c.Country

left outer join HealthStatus_LowBirthWeight bw on e.Country = bw.Country

left outer join HealthStatus_PhysicianPractising pp on e.Country = pp.Country

left outer join HealthStatus_HealthExpenditure he on e.Country = he.Country

left outer join HealthStatus__AIDS a on e.Country = a.Country

left outer join HealthStatus_PharmaConsumption ph on e.Country = ph.Country

left outer join HealthStatus_AlcoholConsumption al on e.Country = al.Country

left outer join HealthStatus_ObesePopulation ob on e.Country = ob.Country

Page | 27
PROJECT FINAL REPORT

left outer join HealthStatus_FruitConsumption fr on e.Country = fr.Country

left outer join HealthStatus_TobaccoConsumption tc on e.Country = tc.Country

left outer join HealthStatus_TotalEmployment em on e.Country = em.Country

left outer join HealthStatus_TotalHealthcare hc on e.Country = hc.Country

left outer join HealthStatus_GDP gd on e.Country = gd.Country

Union

select e.Country,
case when e.Country like '%Non-OECD%' then 'N' else 'Y' end as Oecd,
2010 as [year], p.[2009] as TotalPopulation, i.[2009] as Injuries,

ld.[2009] as LiveBirthDeaths, null as Cancer, bw.[2009] as LowBirthWeight,

pp.[2009] as PhysicianPractising, null as HealthExpenditure,a.[2009] as AIDS,

ph.[2009] as PharmaConsumption,al.[2009] as AlcoholConsumption,ob.[2009] as


ObesePopulation,

fr.[2009] as FruitConsumption,tc.[2009] as TobaccoConsumption,null as TotalHealthcare,


em.[2009] as TotalEmployment,

null as GDP, e.[2009] as LifeExpectancy

from HealthStatus_LifeExpectancy e left outer join HealthStatus_TotalPopulation p

on e.Country = p.Country

left outer join HealthStatus_Injuries i on e.Country = i.Country

left outer join HealthStatus_Mortality_LiveBirthDeaths ld on e.Country = ld.Country

left outer join HealthStatus_Cancer c on e.Country = c.Country

left outer join HealthStatus_LowBirthWeight bw on e.Country = bw.Country

left outer join HealthStatus_PhysicianPractising pp on e.Country = pp.Country

left outer join HealthStatus_HealthExpenditure he on e.Country = he.Country

left outer join HealthStatus__AIDS a on e.Country = a.Country

left outer join HealthStatus_PharmaConsumption ph on e.Country = ph.Country

left outer join HealthStatus_AlcoholConsumption al on e.Country = al.Country

left outer join HealthStatus_ObesePopulation ob on e.Country = ob.Country

left outer join HealthStatus_FruitConsumption fr on e.Country = fr.Country

left outer join HealthStatus_TobaccoConsumption tc on e.Country = tc.Country

left outer join HealthStatus_TotalEmployment em on e.Country = em.Country

left outer join HealthStatus_TotalHealthcare hc on e.Country = hc.Country

left outer join HealthStatus_GDP gd on e.Country = gd.Country

Page | 28
PROJECT FINAL REPORT

Union

select e.Country,case when e.Country like '%Non-OECD%' then 'N' else 'Y' end as Oecd,
2010 as [year], p.[2010] as TotalPopulation, i.[2010] as Injuries,

ld.[2010] as LiveBirthDeaths, null as Cancer, bw.[2010] as LowBirthWeight,

pp.[2010] as PhysicianPractising, he.[2010] as HealthExpenditure,a.[2010] as AIDS,

ph.[2010] as PharmaConsumption,al.[2010] as AlcoholConsumption,ob.[2010] as


ObesePopulation,

fr.[2010] as FruitConsumption,tc.[2010] as TobaccoConsumption,hc.[2010] as


TotalHealthcare, em.[2010] as TotalEmployment,

gd.[2010] as GDP, e.[2010] as LifeExpectancy

from HealthStatus_LifeExpectancy e left outer join HealthStatus_TotalPopulation p

on e.Country = p.Country

left outer join HealthStatus_Injuries i on e.Country = i.Country

left outer join HealthStatus_Mortality_LiveBirthDeaths ld on e.Country = ld.Country

left outer join HealthStatus_Cancer c on e.Country = c.Country

left outer join HealthStatus_LowBirthWeight bw on e.Country = bw.Country

left outer join HealthStatus_PhysicianPractising pp on e.Country = pp.Country

left outer join HealthStatus_HealthExpenditure he on e.Country = he.Country

left outer join HealthStatus__AIDS a on e.Country = a.Country

left outer join HealthStatus_PharmaConsumption ph on e.Country = ph.Country

left outer join HealthStatus_AlcoholConsumption al on e.Country = al.Country

left outer join HealthStatus_ObesePopulation ob on e.Country = ob.Country

left outer join HealthStatus_FruitConsumption fr on e.Country = fr.Country

left outer join HealthStatus_TobaccoConsumption tc on e.Country = tc.Country

left outer join HealthStatus_TotalEmployment em on e.Country = em.Country

left outer join HealthStatus_TotalHealthcare hc on e.Country = hc.Country

left outer join HealthStatus_GDP gd on e.Country = gd.Country

Union

-- Life Expectancy and Other variables 2011

select e.Country,case when e.Country like '%Non-OECD%' then 'N' else 'Y' end as Oecd,
2011 as [year], p.[2011] as TotalPopulation, i.[2011] as Injuries,

Page | 29
PROJECT FINAL REPORT

ld.[2011] as LiveBirthDeaths, null as Cancer, bw.[2011] as LowBirthWeight,

pp.[2011] as PhysicianPractising, he.[2011] as HealthExpenditure,a.[2011] as AIDS,

ph.[2011] as PharmaConsumption,al.[2011] as AlcoholConsumption,ob.[2011] as


ObesePopulation,

fr.[2011] as FruitConsumption,tc.[2011] as TobaccoConsumption,hc.[2011] as


TotalHealthcare, em.[2011] as TotalEmployment,

gd.[2011] as GDP, e.[2011] as LifeExpectancy

from HealthStatus_LifeExpectancy e left outer join HealthStatus_TotalPopulation p

on e.Country = p.Country

left outer join HealthStatus_Injuries i on e.Country = i.Country

left outer join HealthStatus_Mortality_LiveBirthDeaths ld on e.Country = ld.Country

left outer join HealthStatus_Cancer c on e.Country = c.Country

left outer join HealthStatus_LowBirthWeight bw on e.Country = bw.Country

left outer join HealthStatus_PhysicianPractising pp on e.Country = pp.Country

left outer join HealthStatus_HealthExpenditure he on e.Country = he.Country

left outer join HealthStatus__AIDS a on e.Country = a.Country

left outer join HealthStatus_PharmaConsumption ph on e.Country = ph.Country

left outer join HealthStatus_AlcoholConsumption al on e.Country = al.Country

left outer join HealthStatus_ObesePopulation ob on e.Country = ob.Country

left outer join HealthStatus_FruitConsumption fr on e.Country = fr.Country

left outer join HealthStatus_TobaccoConsumption tc on e.Country = tc.Country

left outer join HealthStatus_TotalEmployment em on e.Country = em.Country

left outer join HealthStatus_TotalHealthcare hc on e.Country = hc.Country

left outer join HealthStatus_GDP gd on e.Country = gd.Country

Union

select e.Country,case when e.Country like '%Non-OECD%' then 'N' else 'Y' end as Oecd,
2012 as [year], p.[2012] as TotalPopulation, i.[2012] as Injuries,

ld.[2012] as LiveBirthDeaths, c.[2012] as Cancer, bw.[2012] as LowBirthWeight,

pp.[2012] as PhysicianPractising, he.[2012] as HealthExpenditure,a.[2012] as AIDS,

ph.[2012] as PharmaConsumption,al.[2012] as AlcoholConsumption,ob.[2012] as


ObesePopulation,

fr.[2012] as FruitConsumption,tc.[2012] as TobaccoConsumption,hc.[2012] as


TotalHealthcare, em.[2012] as TotalEmployment,

gd.[2012] as GDP, e.[2012] as LifeExpectancy

Page | 30
PROJECT FINAL REPORT

from HealthStatus_LifeExpectancy e left outer join HealthStatus_TotalPopulation p

on e.Country = p.Country

left outer join HealthStatus_Injuries i on e.Country = i.Country

left outer join HealthStatus_Mortality_LiveBirthDeaths ld on e.Country = ld.Country

left outer join HealthStatus_Cancer c on e.Country = c.Country

left outer join HealthStatus_LowBirthWeight bw on e.Country = bw.Country

left outer join HealthStatus_PhysicianPractising pp on e.Country = pp.Country

left outer join HealthStatus_HealthExpenditure he on e.Country = he.Country

left outer join HealthStatus__AIDS a on e.Country = a.Country

left outer join HealthStatus_PharmaConsumption ph on e.Country = ph.Country

left outer join HealthStatus_AlcoholConsumption al on e.Country = al.Country

left outer join HealthStatus_ObesePopulation ob on e.Country = ob.Country

left outer join HealthStatus_FruitConsumption fr on e.Country = fr.Country

left outer join HealthStatus_TobaccoConsumption tc on e.Country = tc.Country

left outer join HealthStatus_TotalEmployment em on e.Country = em.Country

left outer join HealthStatus_TotalHealthcare hc on e.Country = hc.Country

left outer join HealthStatus_GDP gd on e.Country = gd.Country

Union

select e.Country,case when e.Country like '%Non-OECD%' then 'N' else 'Y' end as Oecd,
2013 as [year], p.[2013] as TotalPopulation, i.[2013] as Injuries,

ld.[2013] as LiveBirthDeaths, null as Cancer, bw.[2013] as LowBirthWeight,

pp.[2013] as PhysicianPractising, he.[2013] as HealthExpenditure,a.[2013] as AIDS,

ph.[2013] as PharmaConsumption,al.[2013] as AlcoholConsumption,ob.[2013] as


ObesePopulation,

fr.[2013] as FruitConsumption,tc.[2013] as TobaccoConsumption,hc.[2013] as


TotalHealthcare, em.[2013] as TotalEmployment,

gd.[2013] as GDP, e.[2013] as LifeExpectancy

from HealthStatus_LifeExpectancy e left outer join HealthStatus_TotalPopulation p

on e.Country = p.Country

left outer join HealthStatus_Injuries i on e.Country = i.Country

left outer join HealthStatus_Mortality_LiveBirthDeaths ld on e.Country = ld.Country

Page | 31
PROJECT FINAL REPORT

left outer join HealthStatus_Cancer c on e.Country = c.Country

left outer join HealthStatus_LowBirthWeight bw on e.Country = bw.Country

left outer join HealthStatus_PhysicianPractising pp on e.Country = pp.Country

left outer join HealthStatus_HealthExpenditure he on e.Country = he.Country

left outer join HealthStatus__AIDS a on e.Country = a.Country

left outer join HealthStatus_PharmaConsumption ph on e.Country = ph.Country

left outer join HealthStatus_AlcoholConsumption al on e.Country = al.Country

left outer join HealthStatus_ObesePopulation ob on e.Country = ob.Country

left outer join HealthStatus_FruitConsumption fr on e.Country = fr.Country

left outer join HealthStatus_TobaccoConsumption tc on e.Country = tc.Country

left outer join HealthStatus_TotalEmployment em on e.Country = em.Country

left outer join HealthStatus_TotalHealthcare hc on e.Country = hc.Country

left outer join HealthStatus_GDP gd on e.Country = gd.Country

Union

select e.Country,case when e.Country like '%Non-OECD%' then 'N' else 'Y' end as Oecd,
2014 as [year], p.[2014] as TotalPopulation, i.[2014] as Injuries,

ld.[2014] as LiveBirthDeaths, null as Cancer, bw.[2014] as LowBirthWeight,

pp.[2014] as PhysicianPractising, he.[2014] as HealthExpenditure,a.[2014] as AIDS,

ph.[2014] as PharmaConsumption,al.[2014] as AlcoholConsumption,ob.[2014] as


ObesePopulation,

fr.[2014] as FruitConsumption,tc.[2014] as TobaccoConsumption,hc.[2014] as


TotalHealthcare, em.[2014] as TotalEmployment,

gd.[2014] as GDP, e.[2014] as LifeExpectancy

from HealthStatus_LifeExpectancy e left outer join HealthStatus_TotalPopulation p

on e.Country = p.Country

left outer join HealthStatus_Injuries i on e.Country = i.Country

left outer join HealthStatus_Mortality_LiveBirthDeaths ld on e.Country = ld.Country

left outer join HealthStatus_Cancer c on e.Country = c.Country

left outer join HealthStatus_LowBirthWeight bw on e.Country = bw.Country

left outer join HealthStatus_PhysicianPractising pp on e.Country = pp.Country

left outer join HealthStatus_HealthExpenditure he on e.Country = he.Country

Page | 32
PROJECT FINAL REPORT

left outer join HealthStatus__AIDS a on e.Country = a.Country

left outer join HealthStatus_PharmaConsumption ph on e.Country = ph.Country

left outer join HealthStatus_AlcoholConsumption al on e.Country = al.Country

left outer join HealthStatus_ObesePopulation ob on e.Country = ob.Country

left outer join HealthStatus_FruitConsumption fr on e.Country = fr.Country

left outer join HealthStatus_TobaccoConsumption tc on e.Country = tc.Country

left outer join HealthStatus_TotalEmployment em on e.Country = em.Country

left outer join HealthStatus_TotalHealthcare hc on e.Country = hc.Country

left outer join HealthStatus_GDP gd on e.Country = gd.Country

Union

select e.Country,case when e.Country like '%Non-OECD%' then 'N' else 'Y' end as Oecd,
2015 as [year], p.[2015] as TotalPopulation, i.[2015] as Injuries,

ld.[2015] as LiveBirthDeaths, null as Cancer, bw.[2015] as LowBirthWeight,

pp.[2015] as PhysicianPractising, he.[2015] as HealthExpenditure,a.[2015] as AIDS,

ph.[2015] as PharmaConsumption,al.[2015] as AlcoholConsumption,ob.[2015] as


ObesePopulation,

fr.[2015] as FruitConsumption,tc.[2015] as TobaccoConsumption,hc.[2015] as


TotalHealthcare, em.[2015] as TotalEmployment,

gd.[2015] as GDP, e.[2015] as LifeExpectancy

from HealthStatus_LifeExpectancy e left outer join HealthStatus_TotalPopulation p

on e.Country = p.Country

left outer join HealthStatus_Injuries i on e.Country = i.Country

left outer join HealthStatus_Mortality_LiveBirthDeaths ld on e.Country = ld.Country

left outer join HealthStatus_Cancer c on e.Country = c.Country

left outer join HealthStatus_LowBirthWeight bw on e.Country = bw.Country

left outer join HealthStatus_PhysicianPractising pp on e.Country = pp.Country

left outer join HealthStatus_HealthExpenditure he on e.Country = he.Country

left outer join HealthStatus__AIDS a on e.Country = a.Country

left outer join HealthStatus_PharmaConsumption ph on e.Country = ph.Country

left outer join HealthStatus_AlcoholConsumption al on e.Country = al.Country

left outer join HealthStatus_ObesePopulation ob on e.Country = ob.Country

Page | 33
PROJECT FINAL REPORT

left outer join HealthStatus_FruitConsumption fr on e.Country = fr.Country

left outer join HealthStatus_TobaccoConsumption tc on e.Country = tc.Country

left outer join HealthStatus_TotalEmployment em on e.Country = em.Country

left outer join HealthStatus_TotalHealthcare hc on e.Country = hc.Country

left outer join HealthStatus_GDP gd on e.Country = gd.Country

Union

select e.Country,case when e.Country like '%Non-OECD%' then 'N' else 'Y' end as Oecd,
2016 as [year], p.[2016] as TotalPopulation, i.[2016] as Injuries,

ld.[2016] as LiveBirthDeaths, null as Cancer, bw.[2016] as LowBirthWeight,

pp.[2016] as PhysicianPractising, he.[2016] as HealthExpenditure,a.[2016] as AIDS,

ph.[2016] as PharmaConsumption,al.[2016] as AlcoholConsumption,ob.[2016] as


ObesePopulation,

fr.[2016] as FruitConsumption,tc.[2016] as TobaccoConsumption,hc.[2016] as


TotalHealthcare, em.[2016] as TotalEmployment,

gd.[2016] as GDP, e.[2016] as LifeExpectancy

from HealthStatus_LifeExpectancy e left outer join HealthStatus_TotalPopulation p

on e.Country = p.Country

left outer join HealthStatus_Injuries i on e.Country = i.Country

left outer join HealthStatus_Mortality_LiveBirthDeaths ld on e.Country = ld.Country

left outer join HealthStatus_Cancer c on e.Country = c.Country

left outer join HealthStatus_LowBirthWeight bw on e.Country = bw.Country

left outer join HealthStatus_PhysicianPractising pp on e.Country = pp.Country

left outer join HealthStatus_HealthExpenditure he on e.Country = he.Country

left outer join HealthStatus__AIDS a on e.Country = a.Country

left outer join HealthStatus_PharmaConsumption ph on e.Country = ph.Country

left outer join HealthStatus_AlcoholConsumption al on e.Country = al.Country

left outer join HealthStatus_ObesePopulation ob on e.Country = ob.Country

left outer join HealthStatus_FruitConsumption fr on e.Country = fr.Country

left outer join HealthStatus_TobaccoConsumption tc on e.Country = tc.Country

left outer join HealthStatus_TotalEmployment em on e.Country = em.Country

Page | 34
PROJECT FINAL REPORT

left outer join HealthStatus_TotalHealthcare hc on e.Country = hc.Country

left outer join HealthStatus_GDP gd on e.Country = gd.Country

Union

select e.Country,case when e.Country like '%Non-OECD%' then 'N' else 'Y' end as Oecd,
2017 as [year], p.[2017] as TotalPopulation, i.[2017] as Injuries,

ld.[2017] as LiveBirthDeaths, null as Cancer, bw.[2017] as LowBirthWeight,

pp.[2017] as PhysicianPractising, he.[2017] as HealthExpenditure,a.[2017] as AIDS,

ph.[2017] as PharmaConsumption,al.[2017] as AlcoholConsumption,ob.[2017] as


ObesePopulation,

fr.[2017] as FruitConsumption,tc.[2017] as TobaccoConsumption,hc.[2017] as


TotalHealthcare, em.[2017] as TotalEmployment,

gd.[2017] as GDP, e.[2017] as LifeExpectancy

from HealthStatus_LifeExpectancy e left outer join HealthStatus_TotalPopulation p

on e.Country = p.Country

left outer join HealthStatus_Injuries i on e.Country = i.Country

left outer join HealthStatus_Mortality_LiveBirthDeaths ld on e.Country = ld.Country

left outer join HealthStatus_Cancer c on e.Country = c.Country

left outer join HealthStatus_LowBirthWeight bw on e.Country = bw.Country

left outer join HealthStatus_PhysicianPractising pp on e.Country = pp.Country

left outer join HealthStatus_HealthExpenditure he on e.Country = he.Country

left outer join HealthStatus__AIDS a on e.Country = a.Country

left outer join HealthStatus_PharmaConsumption ph on e.Country = ph.Country

left outer join HealthStatus_AlcoholConsumption al on e.Country = al.Country

left outer join HealthStatus_ObesePopulation ob on e.Country = ob.Country

left outer join HealthStatus_FruitConsumption fr on e.Country = fr.Country

left outer join HealthStatus_TobaccoConsumption tc on e.Country = tc.Country

left outer join HealthStatus_TotalEmployment em on e.Country = em.Country

left outer join HealthStatus_TotalHealthcare hc on e.Country = hc.Country

left outer join HealthStatus_GDP gd on e.Country = gd.Country

Union

select e.Country,case when e.Country like '%Non-OECD%' then 'N' else 'Y' end as Oecd,

Page | 35
PROJECT FINAL REPORT

2018 as [year], p.[2018] as TotalPopulation, i.[2018] as Injuries,

ld.[2018] as LiveBirthDeaths, null as Cancer, bw.[2018] as LowBirthWeight,

pp.[2018] as PhysicianPractising, he.[2018] as HealthExpenditure,a.[2018] as AIDS,

ph.[2018] as PharmaConsumption,al.[2018] as AlcoholConsumption,ob.[2018] as


ObesePopulation,

fr.[2018] as FruitConsumption,tc.[2018] as TobaccoConsumption,hc.[2018] as


TotalHealthcare, em.[2018] as TotalEmployment,

gd.[2018] as GDP, e.[2018] as LifeExpectancy

from HealthStatus_LifeExpectancy e left outer join HealthStatus_TotalPopulation p

on e.Country = p.Country

left outer join HealthStatus_Injuries i on e.Country = i.Country

left outer join HealthStatus_Mortality_LiveBirthDeaths ld on e.Country = ld.Country

left outer join HealthStatus_Cancer c on e.Country = c.Country

left outer join HealthStatus_LowBirthWeight bw on e.Country = bw.Country

left outer join HealthStatus_PhysicianPractising pp on e.Country = pp.Country

left outer join HealthStatus_HealthExpenditure he on e.Country = he.Country

left outer join HealthStatus__AIDS a on e.Country = a.Country

left outer join HealthStatus_PharmaConsumption ph on e.Country = ph.Country

left outer join HealthStatus_AlcoholConsumption al on e.Country = al.Country

left outer join HealthStatus_ObesePopulation ob on e.Country = ob.Country

left outer join HealthStatus_FruitConsumption fr on e.Country = fr.Country

left outer join HealthStatus_TobaccoConsumption tc on e.Country = tc.Country

left outer join HealthStatus_TotalEmployment em on e.Country = em.Country

left outer join HealthStatus_TotalHealthcare hc on e.Country = hc.Country

left outer join HealthStatus_GDP gd on e.Country = gd.Country

Order by e.Country

-- merging the adult mortality totDeaths


select l.*, totDeaths
from LifeExp l left outer join (SELECT TOP (1000) [Country],[Year],sum([Value]) as
totDeaths
FROM [Project].[dbo].[HealthStatus_Mortality_totDeaths]
Group by Country, [Year]) dth
on l.Country = dth.country
and l.[year] = dth.[Year]

Page | 36
PROJECT FINAL REPORT

5.2 Appendix II
R code

# Deploy packages
library(dplyr) library(tidyr) library(plot.matrix)
library(magrittr) library(tidyverse)
library("GGally") library(xyplot) library(ggplot2)
install.packages("xyplot library(lattice)
") library(dplyr) library(GGally)
install.packages("ggplot library(data.table) library(cdata)
") library("ggplot2")
install.packages("tidyve library(ggplot2) library(WVPlots)
rse") install_github("easyGgpl
library(ggplot2) ot2", "kassambara")
library(purrr)

## test

install.packages("mice")

library(mice)

Page | 37
PROJECT FINAL REPORT

#install package and load library

install.packages("mi")

library(mi)

install.packages("missForest")

library(missForest)

install.packages("survival")

#Load raw data

library(readr)

LifeExp <- read_csv("~/MSC/BigDataMngmt/Datasets/LifeExp.csv",

col_types = cols(year = col_integer(), Oecd =


col_character(),

TotalPopulation = col_double(),
Injuries = col_double(),

LiveBirthDeaths = col_double(),
Cancer = col_double(),

LowBirthWeight = col_double(),
PhysicianPractising = col_double(),

HealthExpenditure = col_double(),

AIDS = col_double(),
PharmaConsumption = col_double(),

AlcoholConsumption =
col_double(),

ObesePopulation = col_double(),
FruitConsumption = col_double(),

TobaccoConsumption =
col_double(),

TotalHealthcare = col_double(),
TotalEmployment = col_double(),

GDP = col_double(), totDeaths =


col_double(),

LifeExpectancy = col_double()),
na = "null")

# converting the country to uppercase

LifeExp$Country <- toupper(LifeExp$Country)

# separating OECD countries

# non OECD countries do not have sufficient data

Page | 38
PROJECT FINAL REPORT

Oecd <- LifeExp %>% dplyr::filter(!(Country %like% "OECD"))

NonOecd <- LifeExp %>% dplyr::filter(Country %like% "OECD")

# adding condition for numerical variables

data_num <- life_selected %>%

select_if(is.numeric)

Oecd_num <- data_num %>%

select_if(is.numeric)

Non_Oecd_num <- NonOecd %>%

select_if(is.numeric)

Oecd_selected <- Oecd_num %>% select(-year)

Non_Oecd_selected <- Non_Oecd_num %>% select(-year)

# Correlation matrix - oecd

ggcorr(Oecd_num,

label = T,

label_size = 2,

label_round = 2,

hjust = 1,

size = 3,

color = "royalblue",

layout.exp = 5,

low = "green3",

mid = "gray95",

high = "darkorange",

name = "Correlation")

ggcorr(Non_Oecd_selected,

label = T,

label_size = 2,

label_round = 2,

hjust = 1,

size = 3,

color = "royalblue",

Page | 39
PROJECT FINAL REPORT

layout.exp = 5,

low = "green3",

mid = "gray95",

high = "darkorange",

name = "Correlation")

# selecting variables with correlation - oecd

var<-Oecd %>% select(1:2, 3,5,9,14,18,20)

var_selected <- var %>% select(-Country, -year)

data_num <- var_selected %>%

select_if(is.numeric)

# Correlation matrix

ggcorr(data_num,

label = T,

label_size = 2,

label_round = 2,

hjust = 1,

size = 3,

color = "royalblue",

layout.exp = 5,

low = "green3",

mid = "gray95",

high = "darkorange",

name = "Correlation")

# plotting the variables over the years

boxplot(TotalPopulation~year, border="black"

data=LifeExp, )

main="Boxplots for each year\


nPopulation",
boxplot(Injuries~year,
xlab="Year",
data=LifeExp,
ylab="Thousands of persons",
main="Boxplots for each year\
col="blue", nInjuries",

Page | 40
PROJECT FINAL REPORT

xlab="Year",

ylab="Injured per million boxplot(PharmaConsumption~year,


population",
data=LifeExp,
col="blue",
main="Boxplots for each year\
border="black" nPharma Consumption",

) xlab="Year",

boxplot(LowBirthWeight~year, ylab="Daily dosage per 1 000


inhabitants per day",
data=LifeExp,
col="blue",
main="Boxplots for each year\
nLow birth weight", border="black"

xlab="Year", )

ylab="(% of total live


births)",
boxplot(AlcoholConsumption~year,
col="blue",
data=LifeExp,
border="black"
main="Boxplots for each year\
) nAlcohol Consumption",

xlab="Year",

boxplot(PhysicianPractising~year, ylab="Litres per capita",

data=LifeExp, col="blue",

main="Boxplots for each year\ border="black"


nPhysician Practising",
)
xlab="Year",

ylab="Number of persons",
boxplot(LifeExpectancy~year,
col="blue",
data=LifeExp,
border="black"
main="Boxplots for each year\
) nLife Expectancy",

xlab="Year",

boxplot(HealthExpenditure~year, ylab="Years",

data=LifeExp, col="blue",

main="Boxplots for each year\ border="black"


nHealth Expenditure",
)
xlab="Year",

ylab="Share of gross domestic


product",

col="blue", boxplot(ObesePopulation~year,

border="black" data=LifeExp,

) main="Boxplots for each year\


nObese Population",

Page | 41
PROJECT FINAL REPORT

xlab="Year", border="black"

ylab="Percentage of total )
population",

col="blue",

border="black"
boxplot(TotalEmployment~year,
)
data=LifeExp,

main="Boxplots for each year\


boxplot(FruitConsumption~year, nTotal Employment",

data=LifeExp, xlab="Year",

main="Boxplots for each year\ ylab="% of total population",


nFruit Consumption",
col="blue",
xlab="Year",
border="black"
ylab="% of population aged 15
years old and over", )

col="blue",

border="black" boxplot(GDP~year,

) data=LifeExp,

main="Boxplots for each year\


nGross domestic product (GDP)",
boxplot(TobaccoConsumption~year,
xlab="Year",
data=LifeExp,
ylab="US$ purchasing power
main="Boxplots for each year\ parity",
nTobacco Consumption",
col="blue",
xlab="Year",
border="black"
ylab="% of population of daily
smokers (15+ years)", )

col="blue",

border="black"

) boxplot(totDeaths~year,

data=LifeExp,

boxplot(TotalHealthcare~year, main="Boxplots for each year\


nTotal Deaths",
data=LifeExp,
xlab="Year",
main="Boxplots for each year\
nTotal Health Care", ylab="persons",

xlab="Year", col="blue",

ylab="% of total population border="black"


covered", )
col="blue",

Page | 42
PROJECT FINAL REPORT

le<-select(LifeExp, GDP, LifeExpectancy)

l<-select(LifeExp, LifeExpectancy)

view(le)

plot(le, main="Scatterplot Example",

xlab="GDP", ylab="Life Expectancy ", pch=19)

# plot GDP & Life Expectancy

ggplot(regress_Oecd, aes(GDP, LifeExpectancy)) +

geom_point()

#Generate 10% missing values at Random

Oecd.mis <- prodNA(Oecd, noNA = 0.1)

LifeExp.mis <- prodNA(LifeExp, noNA = 0.1)

# visualising the missing values

Oecd.mis <- subset(Oecd.mis, select = -c(year))

Oecd.mis <- subset(Oecd.mis, select = -c(LifeExpectancy))

Oecd.mis <- subset(Oecd.mis, select = -c(Country))

Oecd.mis <- subset(Oecd.mis, select = -c(Oecd))

summary(Oecd.mis)

sort(sapply(Oecd.mis,function(x) sum(is.na(x))),decreasing = T)

md.pattern(Oecd.mis)

md.pattern(LifeExp.mis)

LifeExp.mis <- subset(LifeExp.mis, select = -c(Country))

LifeExp.mis <- subset(LifeExp.mis, select = -c(year))

LifeExp.mis <- subset(LifeExp.mis, select = -c(Oecd))

LifeExp.mis <- subset(LifeExp.mis, select = -c(LifeExpectancy))

# seperating the life exp variable before imputation

LifeVar<-subset(LifeExp,select = c(LifeExpectancy))

Var<-subset(LifeExp,select = c(LifeExpectancy,Oecd))

md.pattern(LifeExp.mis)

Page | 43
PROJECT FINAL REPORT

install.packages("VIM")

library(VIM)

mice_plot <- aggr(Oecd.mis, col=c('navyblue','yellow'),

numbers=TRUE, sortVars=TRUE,

labels=names(Oecd.mis), cex.axis=.7,

gap=3, ylab=c("Missing data","Pattern"))

# Imputing the data

imputed_Oecd <- mice(Oecd.mis, exclude="LifeExpectancy", m=5, maxit = 50, method =


'cart')

imputed_Oecd$imp$Oecd

imputed_Oecd$imp$LifeExpectancy

imputed_LifeExp$imp$LifeExpectancy

imputed_LifeExp <- mice(LifeExp.mis, m=5, maxit = 50, method = 'cart')

# summary of imputed data

summary(imputed_Oecd)

summary(complete_LifeExp)

imputed_LifeExp$imp$AIDS

#preparing the final dataset - all countries

complete_data=mice::complete(imputed_LifeExp,2)

complete_LifeExp<- merge(complete_data,LifeVar,by="row.names")

complete_LifeExp <- subset(complete_LifeExp, select = -c(Row.names))

#preparing the final dataset - Oecd

complete_Oecd=mice::complete(imputed_Oecd,2)

complete_Oecd<- merge(complete_Oecd,LifeVar,by="row.names")

complete_Oecd <- subset(complete_Oecd, select = -c(Row.names))

complete_Oecd <- subset(complete_Oecd, select = -c(Cancer))

complete_Explo<- merge(complete_data,Var,by="row.names")

md.pattern(complete_Oecd)

#Comparing the imputed with original data

densityplot(imputed_LifeExp)

xyplot(imputed_Data)

Page | 44
PROJECT FINAL REPORT

# Correlation matrix

ggcorr(complete_Oecd,

label = T,

label_size = 2,

label_round = 2,

hjust = 1,

size = 3,

color = "royalblue",

layout.exp = 5,

low = "green3",

mid = "gray95",

high = "darkorange",

name = "Correlation")

## METHOD - Multiple Regression

install.packages("faraway")

install.packages("glmnet")

library(faraway)

library(glmnet)

library(car)

# excluding multicollinearity

regress_Oecd <- subset(complete_Oecd, select = -c(PhysicianPractising))

regress_Oecd <- subset(regress_Oecd, select = -c(totDeaths))

regress_Oecd <- subset(regress_Oecd, select = -c(TotalEmployment))

install.packages("caret")

library("caret")

## Train Model:70% training / 30 validation set split of the data set

set.seed(123)

train <- regress_Oecd$LifeExpectancy %>%

createDataPartition(p = 0.7, list = FALSE)

length(train)

Page | 45
PROJECT FINAL REPORT

test <- regress_Oecd[-train, ]

length(test)

summary(regress_Oecd)

# model 1

lm_mod1=lm(LifeExpectancy~.,data=regress_Oecd,subset = train)

summary(lm_mod1)

round(vif(lm_mod1), 2)

# backward stepwise selection

step_mod=step(lm_mod1, direction='backward',trace=F)

summary(step_mod)

round(vif(step_mod), 2)

# plot model 1

plot(lm_mod1,which=c(1,2))

# plot model 2 stepwise

plot(step_mod,which=c(1,2))

# Prediction & Evaluation

# Predict stepwise model

step_pred<- step_mod %>% predict(test)

# Model performance stepwise model

data.frame(

RMSE = RMSE(step_pred, test$LifeExpectancy),

R2 = R2(step_pred, test$LifeExpectancy)

# Predict model1

mod1_pred<- lm_mod1 %>% predict(test)

# Model performance model1

data.frame(

Page | 46
PROJECT FINAL REPORT

RMSE = RMSE(mod1_pred, test$LifeExpectancy),

R2 = R2(mod1_pred, test$LifeExpectancy)

Page | 47

You might also like