Download as pdf or txt
Download as pdf or txt
You are on page 1of 8

AUSTRALIAN NATIONAL UNIVERSITY

RESEARCH SCHOOL OF FINANCE, ACTUARIAL STUDIES, AND STATISTICS

Question 1 [18 [15 marks]


marks]
Previous research has shown that for young people, excessive screen time (that is, time spent
in front of a TV or watching digital devices) is associated with lower mental wellbeing (among
other health outcomes).
With the increasing use of digital technology in the classroom and for homework, exposure
to digital devices is hard to avoid, even for very young children.
In a national survey, households were contacted by mail at random to identify those with
children aged 14 years or less. The sampling frame was created from a census administrative
register of household addresses. Households with children aged 14 or under who were willing
to participate in the survey then completed a form listing the age and sex of all children aged
14 and under in the household. Within each household, one child aged 14 or under was then
randomly selected by the survey administrators to be the subject of the survey.
The variable SCREENTIME records how many hours on a weekday the child spends in front
of the TV or on a digital device. The survey responses for 5 year old children are summarised
in the frequency table below.

SCREENTIME Frequency
< 1 hour 264
1 hour 441
2 hours 481
3 hours 216
4 hours 128
5 hours 10

(a) [1 mark] What sampling method best describes the process by which a child is selected
for participation in the survey?
Answer: Simple random sampling.

Households were contacted at random (assume with equal probability) and a child
was selected at random from each household that agreed to participate in the survey.
Households were not grouped by region or any strata before sampling so this is not
stratified random sampling. All children within a household were not sampled so it is
not cluster sampling. Households were not selected in alphabetical or other systematic
order so this is not systematic sampling.

Page 1 of 8 - QUANTITATIVE RESEARCH METHODS - (STAT1008) - MidSemester Examination (S1


2021) - Solutions
AUSTRALIAN NATIONAL UNIVERSITY
RESEARCH SCHOOL OF FINANCE, ACTUARIAL STUDIES, AND STATISTICS

(b) [3 marks] After summarising the sample data, it was found that the number of chil-
dren in the sample aged 0-5 years was approximately 1.5 times more than the sample
count of children in the 6-10 age group. However, national data tells us that there are
approximately equal numbers of children in the population in both age groups. What
type of survey error is the most likely source of this inconsistency between the national
figures and the sample counts? Explain your answer.

Answer:. To have 50% more children aged 0-5 in the data set than aged 6-10, the
most likely source is sampling error in that just by chance the 0-5 age group was over
sampled relative to the 6-10 age group. Non-Response error is also plausible if for
whatever reason, households with children under the age of 0-5 were more likely to
agree to participate in the survey compared to households with children aged 6-10.
But unlikely that non-response error would be the major source of error to produce
such a major discrepancy between population counts and sample counts, especially if
households have both children in the age groups 0-5 and 6-10 so age of children would
not be the key reason behind non-response. Unlikely to be measurement error whereby
households consistently misreport children ages. We are not given information that
certain households with certain ages of children were not included in the sampling
frame, hence we cannot reason the discrepancy is due to coverage error.

(c) [1 mark] Estimate the probability that a 5 year old child has 2 hours of screen time on
a weekday.

Answer:. Total in the sample is 264+441+481+216+128+10=1540.


Pr (5yr old has 2 hrs screentime) = 481/1540=0.31

(d) [1 mark] Estimate the probability that a 5 year old child has 4 or more hours of screen
time on a weekday.

Answer: Pr (5yr old has 4+ hrs screentime) = (128+10)/1540=0.090

(e) [3 marks] Estimate the population mean number of hours spent on screen time for all
5 year old children in the country. You will need to make an assumption on what
numerical value to assign to the category ‘< 1 hour’ to calculate a mean.
Answer: Assume 30mins screentime (so the midpoint) for the category ‘< 1hour’. The
estimate of the population mean is the sample mean
0.5 ⇥ 264 + 1 ⇥ 441 + 2 ⇥ 481 + 3 ⇥ 216 + 4 ⇥ 128 + 5 ⇥ 10
x̄ = = 1.78
1540

Page 2 of 8 - QUANTITATIVE RESEARCH METHODS - (STAT1008) - MidSemester Examination (S1


2021) - Solutions
AUSTRALIAN NATIONAL UNIVERSITY
RESEARCH SCHOOL OF FINANCE, ACTUARIAL STUDIES, AND STATISTICS

(f) [2 marks] Describe the shape of the distribution of the data for SCREENTIME for 5
year old children. Provide an explanation for your answer

Answer: The distribution is unimodal, the mode being 2 hours of screentime. The
distribution is not symmetric as about 45% of observations are below 2 hours but only
23% of observations are greater than 2 hours. The distribution appears to be slightly
right skew.

(g) [1 mark] Calculate the sample median of SCREENTIME for 5 year old children.

Answer: The median is 2 hours

(h) [3 marks] The sample standard deviation of the variable SCREENTIME for children of
all ages in the survey is 2.08. If all children in a household that agrees to participate in
the survey, are selected to be subjects of the survey, how might the value of the sample
standard deviation change? Would it increase, decrease or remain at a similar value?
Explain your answer.

Answer: It would be plausible to assume that children in the same household have
digital device or TV time at the same time (as allowed by the parents) so the re-
sponses of children in the same household would be more similar than children in
di↵erent households. Therefore, we might expect the standard deviation of the data
from SCREENTIME to decrease meaning less variability in the data values due to the
similarity of hours watched from children in the same household.

(i) [3 marks] What values of the variable SCREENTIME for 5 year old children would be
considered outliers? State any assumptions you need to make. Show some working for
your answer. Note that your answer could be outside the range of observed data values
listed in the summary table at the beginning of this question.

Answer: To be considered an outlier, the corresponding z-score value must be greater


than +3 or less than -3. First let’s calculate the sample standard deviation.

0.52 ⇥ 264 + 12 ⇥ 441 + 22 ⇥ 481 + 32 ⇥ 216 + 42 ⇥ 128 + 52 ⇥ 10 1540 ⇥ 1.782


s2 = = 1.07552
1540 1

Let xu be the minimum upper data value to be considered a positive outlier. We require
xu x̄
sx
= x1.0755
u 1.78
> 3. Solving for xu we have xu > 5.01. So to be considered a
positive outlier the 5 year old child would have more than 5.01 hours of
SCREENTIME on a week day. To be considered a negative outlier, solve for xl
such that x1.0755
l 1.78
< 3. This implies xl < 1.44 but we cannot have negative screentime
values so there are no negative outliers to consider for this study.

Page 3 of 8 - QUANTITATIVE RESEARCH METHODS - (STAT1008) - MidSemester Examination (S1


2021) - Solutions
AUSTRALIAN NATIONAL UNIVERSITY
RESEARCH SCHOOL OF FINANCE, ACTUARIAL STUDIES, AND STATISTICS

[13 marks]
[15marks]
Question 2 [17 marks]

More and more households are looking to save money by switching to solar energy to gen-
erate electricity. Some households are hesitant to convert to solar energy given the high
cost of installation, the variability in solar power production depending on the weather and
the inflexibility of trying to time electricity consumption when the sun’s rays are at their
strongest.
The data below shows the daily amount of solar energy (in kWh) produced by a household’s
solar system for every day in the month of January 2021. The data has been sorted from
smallest to largest.
10.33, 12.12, 19.11, 30.74, 31.78, 32.91, 33.50, 37.24, 40.04, 48.04, 56.25, 57.20, 61.45, 62.02,
64.76, 64.85, 65.78, 67.70, 68.12, 70.52, 71.04, 73.43, 73.95, 74.74, 75.55, 75.69, 75.74, 76.35,
76.35, 77.02, 77.68

(a) [1 mark] What is the lower quartile (25th percentile) of daily solar energy production
for this household in the month of January?

Answer: n=31 (31 days in January). The 25th percentile is the (31+1)/4=8th ranked
value. Therefore the 25th percentile is 37.24.

(b) [1 mark] What is the upper quartile (75th percentile) of daily solar energy production
for this household in the month of January?

3⇥(31+1)
Answer: The 75th percentile is the 4
=24th ranked value. Therefore the 75th
percentile is 74.74.

(c) [1 mark] What is the median amount of daily solar energy production for this household
in the month of January?
(31+1)
Answer: The median is the 2
=16th ranked value. Therefore the median is 64.85.

(d) [4 marks] Comment on the shape of the distribution of daily solar energy production in
the month of January for this household. Provide some summary statistics to support
your answer.
Answer: The five number summary is Min=10.33; Q1 =37.24; Median = 64.85; Q3 =
74.74; Max = 77.68.

We can calculate the following distances:

• Q1 - Min = 26.91; Max - Q3 = 2.94. ! left whisker is much longer than right
whisker (on a boxplot)
• Median - Min = 54.52; Max - Median = 12.83 ! Minimum is further away from
median than maximum value
• Median - Q1 = 27.61; Q3 - Median= 9.89 ! Median is closer to Q3 than Q1

All relative distance measures above are indicative of a left-skew distribution.

Page 4 of 8 - QUANTITATIVE RESEARCH METHODS - (STAT1008) - MidSemester Examination (S1


2021) - Solutions
AUSTRALIAN NATIONAL UNIVERSITY
RESEARCH SCHOOL OF FINANCE, ACTUARIAL STUDIES, AND STATISTICS

(d) [2 marks] To what population can the sample data results be generalised to? Give
(e)
reason(s) for your answer.

Answer: Solar energy production is seasonal so we certainly cannot generalise the


results to all/other months of the year, System solar energy production also depends
on the number of solar panels installed and the type of solar panel (which power and
wattage levels). At best, the results can be generalised to solar energy production
in the month of January for households with the same solar system setup.
(e) [2 marks] The sample mean of the daily solar energy production in January 2021 is
(f)
56.84kWh and the sample variance is 440.02. Which data values (if any) have z-score
values less than -2 or greater than +2? Show some working for your answer.

Answer: Calculate z-score values of smallest and largest data points first.
77.68
p 56.84
• 440.02
= 0.99 (< 2 so NO data values with z-scores greater than +2)
10.32
p 56.84
• 440.02
= 2.22; 12.12
p 56.84 =
440.02
2.12; 19.11
p 56.84 =
440.02
1.80. So the lowest two data
values (10.33 and 12.12) have z-score values < -2.

(f)
(g) [2 marks] For this household, the average daily solar energy production in the month
of February 2021 is 51.22 kWh. Calculate the average daily solar energy production
over the first two months of the year in 2021. Show some working for your answer.

Answer:
31 ⇥ 56.84 + 28 ⇥ 51.22
x̄Jan,F eb = = 54.17kW h
31 + 28

Page 5 of 8 - QUANTITATIVE RESEARCH METHODS - (STAT1008) - MidSemester Examination (S1


2021) - Solutions
AUSTRALIAN NATIONAL UNIVERSITY
RESEARCH SCHOOL OF FINANCE, ACTUARIAL STUDIES, AND STATISTICS

(g) [2
(h) [3marks]
marks] The variable Self-Consumption tells us how much of the solar energy produced
is actually used by the household on a given day. Below is a scatter plot of daily solar
energy production (System Production) and daily self-consumption for January 2021.
Comment on the relationship between System Production and Self-Consumption and
explain whether this observed relationship makes practical sense.

Answer: The relationship is positive (an increase in self-consumption is associated


with an increase in system-production). This makes sense as if more solar energy
is produced, then more solar energy is available for consumption. A linear trendline
through the body of the data would show the points lie reasonably close to the trendline,
indicating moderate to strong positive correlation between the two variables.

(i) [3marks]
(h) [2 marks] The covariance between the variables System Production and Self-Consumption
is 106.6373. The sample variance of Self-Consumption is 63.4335. Calculate the coef-
ficient of correlation between System Production and Self-Consumption and interpret
its value. (Please show some working for your answer).

Answer:
106.6373
r=p p = 0.64
440.02 63.4335
In words, there is evidence of a strong positive linear association between Production
and Self-Consumption.

Page 6 of 8 - QUANTITATIVE RESEARCH METHODS - (STAT1008) - MidSemester Examination (S1


2021) - Solutions
AUSTRALIAN NATIONAL UNIVERSITY
RESEARCH SCHOOL OF FINANCE, ACTUARIAL STUDIES, AND STATISTICS

Question 3 [10 marks]

E-scooters for rental were introduced in Canberra by the ACT government in September
2020. The e-scooters were launched in Canberra to provide an environmentally friendly
alternative mode of transport for people to commute short distances such as to work or for
visitors to travel between the local attractions of the city. However, there have been reports
that e-Scooters are more likely to be hired for joyrides (that is for leisure) rather than as a
regular transport option.
The contingency tables below were created from a random sample of e-Scooter users. Separate
tables have been produced for the sample of male users and for the sample of female users.
The rows in the table refer to the purpose of hire (Leisure/Non-Leisure). The columns in the
table refer to the age group of the user.

MALES < 18 18-34 35-50 51-65 over 65


Leisure 16 48 16 10 5
Non-Leisure 0 23 32 0 0

FEMALES < 18 18-34 35-50 51-65 over 65


Leisure 20 20 14 4 12
Non-Leisure 0 28 30 4 0

(a) [2 marks] Estimate the probability that an e-Scooter user is male and aged 18-34. Show
some working for your answer
Answer:
Pr(Male and aged 18-34) = NumberTotal
of users who are male and aged 18-34
number of users in survey
= 48+23
282
= 71
282
= 0.25
(where 282 is the sum of all the sample counts in both tables)

(b) [2 marks] Estimate the probability that an e-Scooter user who is aged 18-34, hires the
e-Scooter for leisure. Show some working for your answer.
Answer:
Number of users who are aged 18-34. and hire for leisure
Pr(Leisure | aged 18-34) = Total number of 18-34 aged users in survey
48+20 68
= 48+23+20+28 = 119 = 0.57
Note: this is equivalent to
Pr(Leisure | aged 18-34) = Pr(Leisure | Male and aged 18-34) Pr(Male | aged 18-34)+
Pr(Leisure | Female and aged 18-34) Pr(Female | aged 18-34) = 48
71
71
⇥ 119 + 20
48
48
⇥ 119

Page 7 of 8 - QUANTITATIVE RESEARCH METHODS - (STAT1008) - MidSemester Examination (S1


2021) - Solutions
AUSTRALIAN NATIONAL UNIVERSITY
RESEARCH SCHOOL OF FINANCE, ACTUARIAL STUDIES, AND STATISTICS

(c) [2 marks] Given a person hires an e-Scooter for leisure, what is the probability that
person is male? Show some working for your answer.

Answer:
Pr(Male | Leisure)= Number of users who are male and hire for leisure
Total number of users who hire for leisure

16+48+16+10+5 95
= 16+48+16+10+5+20+20+14+4+12
= 165
= 0.58

(d) [2 marks] Estimate the probability that an e-Scooter user hires the e-Scooter for leisure.
Show some working for your answer.

Number of users who hire for leisure 95+70


Answer: Pr(Leisure) = Total number of users
= 282 = 0.59

(e) [2 marks] At an intersection, 5 e-Scooter riders are seen waiting to cross the road.
Estimate the probability that 3 out of the 5 riders have hired the e-Scooter for leisure.
(You may assume the 5 riders do not know each other). Show some working for your
answer.

Answer: Let X be the random variable denoting the number of riders out 5 who hired
for leisure. We can assume X ⇠ Binomial(n = 5, p = 0.59)

Pr(X=3) = 5 C3 ⇥ 0.593 ⇥ (1 0.59)5 3


= 0.35

Page 8 of 8 - QUANTITATIVE RESEARCH METHODS - (STAT1008) - MidSemester Examination (S1


2021) - Solutions

You might also like