Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

EC1011 Data Analysis 2 – Academic Year 2022/23

Portfolio Assessment

Introduction
The first coursework assignment for this module is a group portfolio of short tasks worth a cumulative 50% of the
overall module mark. You are asked to complete six tasks for the portfolio. All six tasks are based on real world
data and require you to produce statistical analysis based on the concepts and theories we are learning throughout
the term. We have now gone through enough material in the module so far, for you to be able to start working on
the portfolio straightaway. There are still some topics that we will cover in the second part of the term and that will
be needed for you to complete the portfolio. However, I would suggest that you get started with the portfolio and
you work through the tasks over the coming weeks. For example, you could already complete tasks 1 and 2 (and
possibly 3) and the remaining three over the following weeks. Do not leave the work until the last moment. Your
overall performance is likely to greatly benefit from working regularly throughout the term. The portfolio should
be submitted by Thursday 6th April 2023, 4pm (London time).

Group or Individual Work


The portfolio assessment should be completed as a group work. If you would like to work on your own you should
be aware of the challenge of completing the task on your own and you should contact Guglielmo Volpe
(Guglielmo.volpe@city.ac.uk) to explain and discuss your choice. I have a preference for you to work in a group
because sharing ideas, thoughts, views, opinions helps to make your work richer and more creative. If you set up
a group, the group should not have more than five members and my expectation is that you will work together on
the six tasks by actively collaborating, sharing the work and consulting each other on a regular basis. Each group
should appoint a ‘group leader’ who will keep in regular contact with me and the class teachers and will inform us
of the group progress. Any concern about the group’s ability to work together should be communicated as soon
as possible to me or the class teachers. Whether you decide to work with a group or on your own, please record
your group names or just your individual name and ID on this online spreadsheet by 30th March 2023 (the link is
also available on Moodle):

EC1011 Portfolio Groups.xlsx

Introduction to the Portfolio Tasks


The dataset is based on the data from the Understanding Society survey, which annually interviews a
representative sample of individuals living in the UK about different aspects of their lives. The dataset is available
on Moodle and is named “EC1011 Portfolio Assessment Dataset.xlsx”.

A Brief Introduction about the Understanding Society survey


Researchers follow a large sample of UK households over a long period of time, giving a long-term perspective on
people’s lives. As a longitudinal study, Understanding Society helps explore how life in the UK is changing and
what stays the same over many years. The research is a household panel study, interviewing everyone in a
household to see how different generations experience life in the UK. The Study helps to find out about parents
and children, siblings, new family formation and the wider family and community links. You can find more details
about the research from the Understanding Society website.

The Dataset
The dataset is a cross-sectional dataset that includes information on the respondents over 18 years old. In total
there are 28,981 responses (records). Below you can find the list of the variables included in the dataset and their
values:

EC1011 Data Analysis 2 page 1


Variable name Variable description Variable values
pidp Person identifier synthetic identifier
age Age Numerical variable
sat Life satisfaction Categorical variable:
1 Completely dissatisfied
2 Mostly dissatisfied
3 Somewhat dissatisfied
4 Neither satisfied nor dissatisfied
5 Somewhat satisfied
6 Mostly satisfied
7 Completely satisfied
educ Education level Categorical variable
1 Degree
2 Other higher degree
3 A-level etc
4 GCSE etc
5 Other qualification
9 No qualification
income Equivalence scale adjusted Numerical variable: total household net income (no
household income deductions) divided by a square root of household size
ln_income Logarithm of equivalence scale Numerical variable calculates as log(income)
adjusted household income
age2 Age-squared Numerical variable calculated as: age^2
mhealth Ever diagnosed with mental health Dummy variable: 1 Yes, 0 No
condition
hcond Number of health conditions ever Numerical Variable: sum of reported health conditions
diagnosed ever diagnosed
unempl Unemployed Dummy Variable: 1 Yes, 0 No
partnered Partnered Dummy variable: is respondent in partnership (de
facto marital status is married, in a registered same-
sex civil partnership or living as couple): 1 Yes, 0 No
Coursework Brief
The portfolio is made up of six tasks. Please complete each task in separate worksheets in the same file that contains
the dataset. Label each worksheet with the number of the task i.e. Task 1, Task 2 etc. It is important that for each
question there is clear evidence of how the answer has been calculated i.e. there should be formulae, functions,
explanations etc. that show how the solution has been computed. The comments should be typed in ‘text boxes’
inside the Excel document. Formulae/functions/graphs/tables do not count towards the word limit.

Feedback
We will be happy to provide you with some feedback on one of the portfolio tasks during the term provided that
any request reaches us no later than 31st March at the latest.

Submission Deadline
You are expected to submit one excel file that contains seven worksheets: one worksheet with the actual data and
one for each of the six tasks. Each worksheet should contain details of the solutions and comments wherever
needed.

The portfolio coursework has one single submission point which is Thursday 6th April 2023, 4pm (London time).

The coursework is a group-work and the submission should be done by only one of the group members. There is
no need for all group members to submit the excel file.

EC1011 Data Analysis 2 page 2


Task 1

This task requires you to deal with density distributions and to use the density histogram to calculate probabilities.
Complete this task in a worksheet labelled ‘Task 1’ and clearly show your solutions and, where required, add your
comments. The comments should not exceed the 300 words limit. Task 1 accounts for 15 marks.

1) What type of random variable is the variable income? Briefly explain.


2) Use the excel functions (e.g. =average()) to compute the mean, median mode, minimum, maximum, standard
deviation, skew, first quintile and fourth quintile of the variable income. Comment on the statistics.
3) Produce a density histogram of the variable income. Comment on your graph with reference to your
discussion in 1) above.
4) Use the density histogram to compute the following:
a. The probability that a randomly chosen household earns an income smaller than or equal to £800;
b. The probability that a randomly chosen household earns an income between £1,000 and £2,000.
5) Suppose that income is normally distributed. Use income statistics and the appropriate Excel functions to plot
the approximate normal distribution of income.
Task 2

Is there a relationship between ‘life satisfaction’ (sat) and ‘employment status’ (unempl)? You will investigate this
issue by creating a joint distribution and compute relevant statistics such as the covariance and the correlation
coefficient. Complete this task in a worksheet labelled ‘Task 2’ and clearly show your solutions and, where required,
add your comments. The comments should not exceed the 300 words limit. Task 2 accounts for 25 marks.

Use the ‘life satisfaction’ (sat) and the ‘employment status’ (unempl) variables to answer the following questions:

1) What type of variables are ‘life satisfaction’ (sat) and ‘employment status’ (unempl)? What relationship do
you expect between the two variables? Briefly explain.
2) Produce a joint probability distribution of ‘life satisfaction’ and ‘employment status’. Briefly comment on the
table.
3) Compute the expected values of ‘life satisfaction’ (sat) and ‘employment status’ (unempl). Briefly comment
on your findings.
4) What is the probability that a randomly selected individual states to have a ‘life satisfaction’ of 5?
5) What is the probability that a randomly selected individual is unemployed?
6) What is the probability that a randomly selected individual shows a ‘life satisfaction’ of 3 or is employed?
7) What is the probability that a randomly selected individual shows a ‘life satisfaction’ of 6 given that they are
unemployed?
8) Are ‘life satisfaction’ (sat) and ‘employment status’ (unempl) independent events? Explain.
9) Is there a relationship between ‘life satisfaction’ (sat) and ‘employment status’ (unempl)? Use the Excel
functions to compute the covariance and correlation coefficients between ‘life satisfaction’ (sat) and
‘employment status’ (unempl). Briefly comment on your findings.
10) Use the joint probability distribution from 2) to compute the covariance and the correlation coefficient.
Compare your results with those obtained in 9) and briefly comment.
11) The government is concerned both about the ‘life satisfaction’ of citizens and their ‘employment status’
according to the following welfare function:

EC1011 Data Analysis 2 page 3


𝑊𝑒𝑙𝑓𝑎𝑟𝑒 = 2𝐿𝑖𝑓𝑒 𝑠𝑎𝑡𝑖𝑠𝑓𝑎𝑐𝑡𝑖𝑜𝑛 − 1.5𝑈𝑛𝑒𝑚𝑝𝑙𝑜𝑦𝑚𝑒𝑛𝑡 𝑆𝑡𝑎𝑡𝑢𝑠
Compute the 1) expected value and 2) standard deviation of welfare. Briefly comment on your findings.
12) Define the proportion of households who reported a ‘life satisfaction’ of 6 or more as the probability that a
household is ‘satisfied with their life’. Suppose that a researcher is about to interview 30 households about
their level of satisfaction.
a. Use the appropriate Excel function to produce the probability distribution of the random variable
“Household is satisfied with their life”.
b. Plot the probability distribution graph. Briefly comment on the graph.
c. Compute the expected value and the variance of the distribution.
d. What is the probability that more than 10 individuals will declare to be “satisfied with their life”?
e. What is the probability that between 12 and 20 individuals will declare to be “satisfied with their life”?

Task 3

This task requires you to engage with the insights of the Central Limit Theorem. Complete this task in a worksheet
labelled ‘Task 3’ and clearly show your solutions and, where required, add your comments. The comments should
not exceed the 200 words limit. Task 3 accounts for 10 marks.

In this task we focus on the variable age.

1) Draw thirty random samples of size 20 from the age variable and, for each sample, compute the sample
mean. Compute the mean of the sample means and produce a histogram of the distribution of the sample
means. Briefly comment on your findings.

2) Now, draw thirty random samples of size 50 from the age variable and, for each sample, compute the sample
mean. Compute the mean of the sample means and produce a histogram of the distribution of the sample
means. Compare the mean of the sample means and the distribution of the sample means with those you
computed in 1). Explain the similarity or differences.

3) Draw a random sample of 40 observations from the age variable and:

a. Compute the sample mean.

b. Use the information from the random sample to compute the probability that the sample mean is smaller
than 45. Briefly comment on your findings.

Task 4

This task asks you to engage with point estimation and, in particular, to compute and compare two Point Estimators.
Complete this task in a worksheet labelled ‘Task 4’ and clearly show your solutions and, where required, add your
comments that should not exceed the 200 words limit. Task 4 accounts for 15 marks.

For this task you should focus on the variable ‘life satisfaction’ (sat). Suppose that the population average life
satisfaction is equal to 5 but this average is not known to the researchers.
You are hired to check which estimator might be best in estimating the population average level of life
satisfaction. To this end you are asked to complete the tasks below.

1. We are interested in comparing the properties of two estimators (𝑋̅1 𝑎𝑛𝑑 𝑋̅2 ) of the population mean. To this
end, complete the following tasks:

EC1011 Data Analysis 2 page 4


a. Draw 30 random samples of size 𝑛 = 100 from the variable sat.

b. For each of the 30 samples drawn, calculate:


100
∑ 𝑋𝑖
• The sample mean 𝑋̅1 = 𝑖=1
100
where 𝑋𝑖 are all 100 observations of sat in the sample (𝑖 = 1, 2,…,100);
20
∑ 𝑋𝑖
• An alternative estimator 𝑋̅2 = 𝑖=1
20
where 𝑋𝑖 are the first 20 observations of sat in the sample (𝑖 =
1, 2, …,20).
c. Present your calculations in a table:

• The first column of this table should contain the 30 values of 𝑋̅1 you have obtained.

• The second column of this table should contain the 30 values of 𝑋̅2 you have obtained.

2. Compute the (approximate) bias of the estimator 𝑋̅1 and the (approximate) bias of the estimator 𝑋̅2. Is one of
the two estimators more biased than the other? Explain why or why not.

3. Compute the (approximate) variance of the estimator 𝑋̅1 and the (approximate) variance of the estimator 𝑋̅2 .
Compute the relative efficiency of the two estimators and interpret it. Is one of the two estimators more
efficient than the other?

4. Overall, what estimator would you recommend the researchers use in order to estimate household’s average
life satisfaction?

Task 5

This task asks you to engage with inferential statistics for the population mean with a particular focus on hypotheses
concerning the level of ‘life satisfaction’ across individuals who have or have not been diagnosed with mental health
conditions. Complete this task in a worksheet labelled ‘Task 5’ and clearly show your solutions and, where required,
add your comments that should not exceed the 300 words limit. Task 5 accounts for 20 marks.

The government is interested in understanding whether there are differences in the average level of ‘life
satisfaction’ (sat) among household who have or have not been diagnosed with mental health issues (mhealth).
To this end, you are asked to complete the following tasks.

1) Suggest two unbiased estimators that researchers can use to estimate the average life satisfaction among
household with and without diagnosed mental health conditions. Use the sample data and your suggested
estimators to compute the estimates and comment on your findings.
2) Produce two confidence intervals that have a high probability of capturing the true average ‘life satisfaction’
of a) households who have been diagnosed with mental health conditions and b) household who have not
been diagnosed with health conditions. Plot the confidence intervals and comment on your findings.
3) The government believes that individuals without mental health conditions show a greater level of life
satisfaction than individuals who instead experienced mental health issues. You are asked to test the
government hypotheses by completing the following tasks:
a. The government believes that the average level of life satisfaction among households without mental
health issues is 5.1. Mental Health experts disagree with the government hypothesis and they claim that
the actual average life satisfaction is different. Test this hypothesis and comment on your findings.
b. The government also believes that the average level of life satisfaction among households who have
experienced mental health conditions is 4.1. Mental Health experts, again, strongly disagree and they
believe that the true average level of life satisfaction is smaller than the one claimed by the government.
Test this hypothesis and comment on your findings.

EC1011 Data Analysis 2 page 5


Task 6

This task asks you to engage with inferential statistics for the population proportion with a particular focus on
hypotheses concerning the proportion of individuals with a ‘degree’ or ‘other higher degree’ level of education
(educ). Complete this task in a worksheet labelled ‘Task 6’ and clearly show your solutions and, where required,
add your comments that should not exceed the 200 words limit. Task 6 accounts for 15 marks

The government would like to use the survey data in order to understand the proportion of the UK households
who have got a ‘degree’ or ‘other higher degree’ (educ).
1) Complete the following tasks to estimate the population proportion of individuals with a ‘degree or
equivalent’.
a. Suggest an unbiased estimator that could be used to estimate the population proportion of households
with a ‘degree’ or ‘other higher degree’. Use the sample data in order to provide an estimate. Briefly
comment on your choice of estimator and on your estimate.
b. Compute and plot a confidence interval that has a high probability of including the population
proportion of individuals with a ‘degree’ or ‘other higher degree’. Briefly comment on your findings.
c. The government believes that 28% of households have got a ‘degree’ or ‘other higher degree’. Use the
sample data in order to test this hypothesis. Briefly comment on your findings.
2) Suppose that the government is interested in knowing whether a larger proportion of households with a with
a ‘degree’ or ‘other higher degree’ have expressed a high satisfaction with their life relative to households
with a lower educational level. Explain how you would carry out such a research, apply it to the given data
and comment on your findings.

EC1011 Data Analysis 2 page 6

You might also like