Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

ETW2001 2022 S2 Assessment 2

Individual Assignment (15%)

The main task of the assignment is for students to perform analysis using fundamental skills of R

programming. This assignment requires coding skills from week 1 to week 6.

Deadline: Week 7, 9 September 2022 (Friday) by 11:55 PM in MYT

Submission: A report in pdf format and an R-script that records all codes for data preparation and
output.

Note

• When you submit in Moodle, it might show an error saying, “You must upload a supported
file type for this assignment. Accepted file types are; …”. You can ignore this message as long
as you uploaded the R-script.
• Do not take screenshots of your output. For plots, export it properly. For tables, you can create
tables using the Microsoft word function. There are some packages to create a table in R-
Studio, but it requires too much work for you. You can try if you want.
• Since this assignment does not require much writing for the report, you will spend most time
searching and writing codes. I encourage you to discuss task 1 via the Discord channel.
• This assignment covers the Unit’s Learning outcomes 1 and 2.
• For 1.7 and 1.8, you might need to refer to Week 6 lecture material.
• You do not need to copy and paste the codes into the report since it is recorded in R-script.
Your report should include plots and analysis.

Context

Population data is valuable as it affects several aspects of the sustainable growth of countries. Analysis
and estimation of population allow us to manage global challenges such as poverty, starvation,
distribution of commodities, energy, environment, and our well-being. This can be useful for multi-
national firms in that observing demographic changes provides an insight into future market size.

For this assignment, you will begin with tidying up the data and come up with some analysis of multiple
variables related to the population. The data covers from the years 2001 to 2050, where future values
are estimated values by the World Bank.

Data source: https://databank.worldbank.org/source/population-estimates-and-projections


Task 1 – Data preparation (33 marks)

The answer for this task should be written in the R-script.

1.1 Assign a new name to the data frame called “df” by loading the datasheet from the excel file
(Assessment 2 data) to R-Studio. Load the packages that you use in the assignment. Ensure the
loaded sheet is “data” and set the first row as the column name. (2 marks)
1.2 Using the dplyr functions, delete the “Series code” column from “df”. Save the output as “df1”
(2m)
1.3 Using the dplyr functions, convert the “df1” data for columns from 2001 to 2050 as numeric. Save
the output as “df2” (dim = 49747, 53). You might see an error saying, “There were 50 or more
warnings. Do not worry about it. (2m)
1.4 Using the dplyr functions, convert the “df2” data for columns from 2001 to 2050 to take 2 decimal
places. Save the output as “df3”. (2m)
1.5 Change the format of column names as below:
Convert the “df3” column names from “2001 [YR2001]” to “2001”. Apply this to the rest of the
columns up to 2050. Make sure you do not repeat the codes 50 times. You need two sets of codes.
One for the new set of column names and the other one to apply as new column names. (5m)
1.6 Drop all NA values from “df3”. Save this output as “df4”. The dimension of df4 should be 33491
by 53. (2m)
1.7 There are too many columns for each year. Put all year column names from “df4” under the “Year”
column and put their values under the “Values” column. Also, convert “Year” as numerical values.
Save this output as “df5”. The dimension of “df5” is 1674550 by 5. (4m)
1.8 There are too many rows for each country. Put all “Series name” from “df5” to separate columns
following its name. Save this output as “df6”; its dimension is 12950 by 156. (4m)
1.9 Since there are 156 columns, it would be difficult to identify variables' name whenever you want
to perform further analysis. Create a data frame called “index” to show the list of variable names
from “df6”. (2m)
1.10 Using the index number, create “df7”, including the followings: (2m)
• Country Name
• Country Code
• Year
• Age dependency ratio (% of working-age population)
• Population ages 65 and above, female (% of female population)
• Population ages 65 and above, male (% of male population)
• Population, female (% of the total population)
• Population, total
• Rural population (% of the total population)
• Urban population (% of the total population)

1.11 Rename the above variables from “df7” as below: (2m)


Original Variable name New name
Country Name Use as it is
Country Code Use as it is
Year Use as it is
Age dependency ratio (% of working-age dep_ratio
population)
Population ages 65 and above, female (% of pop65f
female population)
Population ages 65 and above, male (% of pop65m
male population)
Population, female (% of total population) popf
Population, total pop
Rural population (% of total population) pop_rural
Urban population (% of total population) pop_ruban

1.12 Using the dplyr functions, calculate population growth (popgr). You may refer to the formula
below:
(𝑝𝑜𝑝! − 𝑝𝑜𝑝!"# )
𝑝𝑜𝑝𝑔𝑟 = & * ∗ 100
𝑝𝑜𝑝!"#
Create “df8” that includes popgr and all the variables from “df7”. It is important to group the data
by country, then to calculate popgr. (4m)
Task 2 – Population around the world (43 marks)
You will be using “df8” for Task 2 and Task 3.

2.1 Create a plot that shows changes in population growth from 2002 to 2022. Your plot must include
the following aspects: (10m)
• Use data for 5 income categories (Low, Lower middle, Middle, Upper middle and High income)
which can be found in `Country Name`
• Assign 5 different colours for these 5 categories.
• The line width is 1.2.
• Display the y-axis from 0 to 3 and the breaks by 0.5.
• Display the x-axis from 2002 to 2022 and the breaks by 2.
• Add appropriate title, subtitle and axis labels.
• Locate the legend at the bottom of the plot. The legend items should fit nicely.
• You may add an extra theme to make your plot more aesthetic (extra marks not granted)

2.2 Analyse the plot you created. Based on plot 2.1, forecast (without any calculation) how the
population growth will be in the future. Your analysis should not exceed 1 page. (10m)

2.3 Replicate the plot you created in 2.1 but covering a time span from 2022 to 2050. Make a
necessary adjustment for the title and scale (the x-axis breaks by 5). (3m)

2.4 Does the plot from 2.3 match your forecast from 2.2? Discuss any observed differences to your
forecast. Provide possible economic reasons for any mismatch. Your analysis should not exceed
half a page. (3m)

2.5 Create a plot that shows changes in dependency ratio from 2002 to 2022. Your plot must include
the following aspects: (3m)
• Use data for 5 income categories (Low, Lower middle, Middle, Upper middle and High income).
• Assign 5 different colours for these 5 categories.
• The line width is 1.2.
• Display the y-axis from 40 to 100 and the breaks by 10.
• Display the x-axis from 2002 to 2022 and the breaks by 2.
• Add appropriate title, subtitle and axis labels.
• Locate the legend at the bottom of the plot. The legend items should fit nicely.
• You may add an extra theme to make your plot more aesthetic (extra marks not granted)

2.6 Analyse the plot you created. Based on plot 2.5, forecast (without any calculation) how the
dependency ratio will be in the future. Your analysis should not exceed 1 page. (6m)
2.7 Replicate the plot you created in 2.5 but covering a time span from 2022 to 2050. Make a
necessary adjustment for the title and scale (the x-axis breaks by 5). (3m)
2.8 Does the plot from 2.7 match your forecast from 2.6? Discuss any observed differences to your
forecast. Discuss why the predicted data by the World Bank shows such patterns for each income
category. Your analysis should not exceed 1 page. (5m)

Task 3 – Mini research (24 marks)


Obtain a random country name by using your student ID as follows:

set.seed(“your student ID”)


sample(df8$`Country Name`, size=1)

Using the code above, you will obtain a random country’s name from the ‘Country Name’ variable. It
is okay if your output shows a region or group of countries.

3.1 Create a plot that includes 4 line graphs within. Include the following details (5m):
• A country or region that is selected by random sampling.
• 4 plots for Population, Population growth rate, Dependency Ratio, and Urban population.
• Since the population scale is too large relative to the other variables, express the population
in a unit of million. You are allowed to use billion if you got region instead of country.
• Label each y-axis.
• The x-axis should indicate years less or equal to 2020.
• The main title should be “Population in ‘country’s name’”.

Before you filter the selected country, you should ungroup “df8” first.

3.2 Analyse the plot you created in 3.1. If you find any unusual changes, search for possible reasons
with proper citation. Do not write more than a page. (8m)

3.3 Repeat 3.1, include the randomly selected country/region and “Malaysia”. For each graph, two
lines should indicate the randomly selected country/region and Malaysia respectively. Set all
other details the same with a proper title and legend. (3m)

3.4 Considering the economic development level, compare and analyse each graph. A good analysis
should include insight with sound reasoning. Do not write more than a page. (8m)

You might also like