Stuvia 509464 St104a Statistics 1 Exams With Commentaries 2011 2018

ST104a - Statistics 1- Exams with
commentaries 2011-2018
written by
mreducation
www.stuvia.com
Downloaded by: aruzhanyerbolatova | aruzhan.yerbolatovaa@gmail.com

Distribution of this document is illegal
Stuvia.com - The Marketplace to Buy and Sell your Study Material
This paper is not to be removed from the Examination Halls
UNIVERSITY OF LONDON 279 004a ZA
BSc degrees and Diplomas for Graduates in Economics, Management, Finance and the
Social Sciences, the Diplomas in Economics and Social Sciences and Access Route for
External Students
Statistics 1 (half unit)
Thursday, 5 May 2011 : 10.00am to 12.00pm
Candidates should answer THREE of the following FOUR questions: QUESTION 1 of

Section A (50 marks) and TWO questions from Section B (25 marks each). Candidates are
strongly advised to divide their time accordingly.
Extracts from statistical tables are given after the final question on this paper
Graph paper is provided at the end of this question paper. If used, it must be detached and
fastened securely inside the answer book.
A calculator may be used when answering questions on this paper and it must comply in all
respects with the specification given with your Admission Notice. The make and type of
machine must be clearly stated on the front cover of the answer book.
© University of London 2011

UL11/0186 PLEASE TURN OVER
D01 Page 1 of 19
SECTION A
Answer all parts of Question 1 (50 marks in total).
1. (a) Consider the following sample dataset:

3 2 4 6 1
i. Find the median and the variance.
ii. Suppose that we add the same number to all the values of the dataset.
Which of the two quantities (median or variance) will remain the same?
Explain briefly. (Note that no marks will be awarded without an
explanation.)
(4 marks)
(b) State whether the following are possible or not and give a brief explanation.
(Note that no marks will be awarded for a simple possible/not possible answer.)
i. The mean of a dataset is always smaller than the mode.
ii. The significance level of a test is greater than the probability of a Type I
error.
iii. A chi-squared value can be positive.
iv. Statistical hypotheses can be statements about the sample mean.
(8 marks)
(c) There is an error in each of the following sentences. Explain what is wrong in
each case.
i. We found a high positive correlation of 0.03 between faculty teaching and
research evaluations.
ii. The correlation between height and weight was found to be 0.53 kilograms.
iii. There is a positive correlation between the gender of employees in a
company and their income.
(6 marks)
(d) Consider a random sample of 20 values from a normal population with mean
𝜇 and variance 𝜎 2 . The sample mean is 18.5 and an estimate of the variance is
2.1. Calculate a 95% confidence interval for 𝜇 and provide its general formula.
(4 marks)
(e) i. A bowl contains 7 balls: 3 blue and 4 yellow. Two balls are drawn
successively without replacement. What is the probability that both of
the balls are yellow?
ii. In the same draw what is the probability that the two balls are different
colours?
iii. Consider a high-risk population where 10% of the people have HIV. A
diagnostic test is correct in 90% of the cases if the person has HIV and in
95% of the cases if the person does not have the virus. If a person is tested
positive, what is the probability that the person does not have HIV?
(8 marks)
UL11/0185 Page 2 of 6
D01
UL11/0186 Downloaded by: aruzhanyerbolatova | aruzhan.yerbolatovaa@gmail.com

D01 Page 2 of 19
(f) With 𝑥1 = 3, 𝑥2 = 1, 𝑥3 = 2, 𝑥4 = 1, 𝑥5 = 2, find

𝑖=4
∑ 𝑖=5
∑ 𝑖=5
∑
i. (𝑥𝑖 − 2) ii. 3𝑥𝑖 iii. 𝑥3𝑖
𝑖=1 𝑖=2 𝑖=4
(6 marks)
(g) The summary statistics for 2 independent datasets from a population with a
normal distribution are as follows:
Sample size Sample mean Sample standard deviation
𝑥 data 13 4.3 1.2
𝑦 data 21 4.9 1.4
Compute the mean and the variance of the combined dataset.
(6 marks)
(h) Assume that the marks of students at a certain university are normally
distributed with mean 52 and variance 100. Consider a randomly chosen
student from that university and find the probability
i. of failing the class (pass mark is 34).
ii. of obtaining a mark between 60 and 70.
(4 marks)
(i) Define random sampling and quota sampling. Provide an example where you
would prefer one method to the other.
(4 marks)
UL11/0185 Page 3 of 6
D01
D01 Page 3 of 19
SECTION B
Answer two questions from this section (25 marks each).
2. (a) It is assumed that there is a linear relationship between the obtained yield
of apple trees and the amount of fertiliser supplied to them. In order to test
this assumption, nine apple trees of the same type were randomly selected and
supplied weekly with a fixed quantity (𝑥 grams) of fertiliser. The yield of each
apple tree (𝑦 kilograms) was recorded.
Tree 1 2 3 4 5 6 7 8 9
𝑥 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0
𝑦 3.9 4.3 5.5 6.4 6.9 7.1 7.3 7.7 8.0
The summary statistics for these data are:

Sum of the 𝑥 data: 27 Sum of the squares of 𝑥 data: 96
Sum of the 𝑦 data: 57.1 Sum of the squares of 𝑦 data: 379.51
Sum of the products of 𝑥 and 𝑦 data: 186.75
i. Draw a scatter diagram of these data on the graph paper provided. Label
the diagram carefully.
ii. Calculate the least squares line of 𝑦 on 𝑥 and draw the line on the scatter
diagram.
iii. What prediction will you give for a tree that is treated weekly with 3.2
grams of fertiliser?
iv. Would you use the least squares line in part (ii.) to predict the yield of an
apple tree that is treated weekly with 15 grams of fertiliser? Justify your
answer.
(13 marks)
(b) A consumer report examined potential differences between two brands of tyres.
The mean life of the tyres is of primary concern. The available data, measured
in thousands of miles, are provided below.

Brand A 34 21.4 1.5
Brand B 38 22.3 1.8
i. Use an appropriate hypothesis test to determine whether the mean lives

of the two brands are different. Test at two appropriate significance levels
and comment on your findings.
ii. State clearly any assumptions you made.
iii. Repeat the procedure above to determine whether the mean life of the
tyres of brand B is longer than that of the brand A tyres.
(12 marks)
UL11/0185 Page 4 of 6
D01
D01 Page 4 of 19
3. (a) The ministry of education is considering funding pre-school education. Before

making their recommendations, administrators take a random sample of 100
students from various areas to compare the performance of students in algebra
between those who attended pre-school and those who did not. The results are
summarised in the table below:
Below Grade Level At Grade Level Advanced
Pre-school 12 29 16
No Pre-school 18 11 14
i. Test for association between performance in algebra and pre-school
attendance at two appropriate significance levels. State the null and
alternative hypotheses clearly.
ii. Comment on your results describing potential associations in detail. Dis-
cuss any potential differences in the algebra marks between students who
did and students who did not attend pre-school.
(13 marks)
(b) You have been asked to design a stratified random sample survey to examine
whether the recession had any impact in the spending habits of UK university
students.
i. Discuss how you will choose your sampling frame. Provide possible limi-
tations of your choice.
ii. Provide two relevant stratification factors. Justify your answers.
iii. Provide two actions to reduce response bias and provide the reasons for
which you think they would be successful.
iv. Briefly discuss the statistical methodology you will use to analyse the
collected data.
(12 marks)
UL11/0185 Page 5 of 6
D01
D01 Page 5 of 19
4. (a) The IQ scores for a sample of 30 students who are entering their first year of
high school are shown below:
95 95 97 98 101
102 103 104 105 106
106 107 108 108 110
111 115 115 117 119
119 121 121 126 126
128 133 134 136 142
i. Carefully construct, draw and label a histogram of these data on the graph
paper provided.
ii. Find the mean (given that the sum of the data is 3408) and the modal
group.
iii. Find the median and the upper quartile.
iv. Comment on the data given the shape of the histogram and the measures
you have calculated.
(12 marks)
(b) The student union of a large university gathered a random sample of 525
students to determine whether they are in favour of a new grading system.
The results are summarised in the table below:
Sample size Number in favour of new grading system

Humanities 325 221
Science 200 120
i. Do the results indicate a difference between humanities and science in the

population proportions in favour of the new grading system? Conduct an
appropriate hypothesis test at two appropriate levels and comment on your
results.
ii. Give a 97% confidence interval for the difference between the two
proportions in the population.
(13 marks)
END OF PAPER
UL11/0185 Page 6 of 6
D01
D01 Page 6 of 19
Dennis V. Lindley, William F. Scott, New Cambridge Statistical Tables, (1995) © Cambridge University Press, reproduced with permission.
UL11/0186 Page 7 of 19
UL11/0186 Page 8 of 19
UL11/0186 Page 9 of 19
UL11/0186 Page 10 of 19
UL11/0186 Page 11 of 19
UL11/0186 Page 12 of 19
UL11/0186 Page 13 of 19
UL11/0186 Page 14 of 19
UL11/0186 Page 15 of 19
UL11/0186 Page 16 of 19
UL11/0186 Page 17 of 19
UL11/0186 Page 18 of 19
UL11/0186 Page 19 of 19
UNIVERSITY OF LONDON 279 004a ZB
Social Sciences, the Diplomas in Economics and Social Sciences and Access Route for
External Students
Thursday, 5 May 2011 : 10.00am to 12.00pm

Extracts from statistical tables are given after the final question on this paper

D01 Page 1 of 19
SECTION A
1. (a) Consider the following sample dataset:

5 3 2 4 1
i. Find the mean and the variance.
ii. Suppose that we subtract the same number from all the values of the
dataset. Which of the two quantities (mean or variance) will remain the
same? Explain briefly. (Note that no marks will be awarded without an
explanation.)
(4 marks)
(b) State whether the following are possible or not and give a brief explanation.
(Note that no marks will be awarded for a simple possible/not possible answer.)
i. The mean of a dataset is always greater than the median.
ii. The power of a test is related to the probability of a Type II error.
iii. A low chi-squared value shows that there is little evidence of association
between the two variables tested.
iv. Statistical hypotheses can be statements about the sample variance.
(8 marks)
(c) There is an error in each of the following sentences. Explain what is wrong in
each case.
i. We found a high negative correlation of -0.03 between faculty teaching and
research evaluations.
ii. The correlation between height and weight was found to be 0.7 meters.
iii. There is a positive correlation between the gender of employees in a
company and their salary.
(6 marks)
(d) Consider a random sample of 15 values from a normal population with mean
𝜇 and variance 𝜎 2 . The sample mean is 11.5 and an estimate of the variance is
1.1. Calculate a 95% confidence interval for 𝜇 and provide its general formula.
(4 marks)
(e) i. A bowl contains 9 balls: 5 red and 4 green. Two balls are drawn
successively without replacement. What is the probability that both of
the balls are red?
ii. In the same draw what is the probability that the two balls are different
colours?
iii. Consider a high-risk population where 5% of the people have HIV. A
diagnostic test is correct in 95% of the cases if the person has HIV and in
90% of the cases if the person does not have the virus. If a person is tested
positive, what is the probability that the person does not have HIV?
(8 marks)
UL11/0186 Page 2 of 6
D01

D01 Page 2 of 19
(f) With 𝑥1 = 4, 𝑥2 = 1, 𝑥3 = 2, 𝑥4 = 4, 𝑥5 = 3, find

𝑖=5
∑ 𝑖=4
∑ 𝑖=3
∑
i. (𝑥𝑖 − 4) ii. 2𝑥𝑖 iii. 𝑥3𝑖
𝑖=3 𝑖=1 𝑖=2
(6 marks)
(g) The summary statistics for 2 independent datasets from a population with a
normal distribution are as follows:
𝑥 data 18 5.3 1.0
𝑦 data 15 4.1 1.5
Compute the mean and the variance of the combined dataset.
(6 marks)
(h) Assume that the marks of students at a certain university are normally
distributed with mean 55 and variance 81. Consider a randomly chosen
student from that university and find the probability
i. of getting a first (70 or above).
ii. of obtaining a mark between 50 and 60.
(4 marks)
(i) Define random sampling and cluster sampling. Provide an example where
cluster sampling will be useful.
(4 marks)
UL11/0186 Page 3 of 6
D01
D01 Page 3 of 19
SECTION B
2. (a) A company would like to predict how its trainees in sales will perform based
on the results of aptitude test that is given to them at the beginning of the
training. The table below contains the test scores (x values) and the values of
the sales for these trainees during the first month of working at the company
(y values in hundreds of dollars).
Salesman 1 2 3 4 5 6 7 8 9
𝑥 1.8 2.6 2.8 3.4 3.6 4.2 4.8 5.2 5.4
𝑦 5.4 6.4 6.0 6.2 6.8 7.0 7.6 7.3 7.6

Sum of the 𝑥 data: 33.8 Sum of the squares of 𝑥 data 139.24:
Sum of the 𝑦 data: 60.3 Sum of the squares of 𝑦 data: 408.61
Sum of the products of 𝑥 and 𝑦 data: 233.6
ii. Calculate the least squares line of 𝑦 on 𝑥 and draw the line on the scatter
diagram.
iii. What level of sales would you expect from a salesman who scored 4.0 on
the aptitude test?
iv. Would you use the least squares line in part (ii.) to predict the sales of a
salesman that scored 7.0 on the aptitude test? Justify your answer.
(13 marks)
(b) A consumer report examined potential differences between two brands of tyres.
The mean life of the tyres is of primary concern. The available data, measured
in thousands of miles, are provided below:

Brand A 40 25.1 1.3
Brand B 32 23.9 2.0
i. Use an appropriate hypothesis test to determine whether the mean lives

of the two brands are different. Test at two appropriate significance levels
and comment on your findings.
ii. State clearly any assumptions you made.
iii. Repeat the procedure above to determine whether the mean life of the
tyres of brand B is longer than that of the brand A tyres.
(12 marks)
UL11/0186 Page 4 of 6
D01
D01 Page 4 of 19
3. (a) The ministry of education is considering funding pre-school education further.

Before making their recommendations, administrators take a random sample
of 100 students from various areas to compare the performance of students in
algebra between those who attended pre-school and those who did not. The
results are summarised in the table below:
Pre-school 10 26 15
No Pre-school 20 14 15
i. Test for association between performance in algebra and pre-school
attendance at two appropriate significance levels. State the null and
ii. Comment on your results describing potential associations in detail. Dis-
cuss any potential differences in the algebra marks between students who
did and students who did not attend pre-school.
(13 marks)
(b) You have been asked to design a stratified random sample survey to examine
whether job satisfaction of employees varies between different job types.
i. Discuss how you will choose your sampling frame. Provide possible limi-
tations of your choice.
ii. Provide two relevant stratification factors. Justify your answers.
iii. Provide two actions to reduce response bias and provide the reasons for
which you think they would be successful.
iv. Briefly discuss the statistical methodology you will use to analyse the
collected data.
(12 marks)
UL11/0186 Page 5 of 6
D01
D01 Page 5 of 19
4. (a) The IQ scores for a sample of 30 students who are entering their first year of
high school are shown below:
103 117 121 104 127

114 111 129 143 115
95 102 96 107 98
99 101 104 107 113
123 131 135 113 112
124 102 99 96 101
i. Carefully construct, draw and label an appropriate stem-and-leaf diagram.

ii. Find the mean (given that the sum of the data is 3342) and the modal
group.
iii. Find the median and the lower quartile.
iv. Comment on the data given the shape of the stem-and-leaf diagram and
the measures you have calculated.
(12 marks)
students to determine whether they are in favour of a new grading system.
The results are summarised in the table below:
Sample size Number in favour of new grading system

Males 225 110
Females 275 165
i. Do the results indicate a difference between males and females in the

population proportions in favour of the new grading system? Conduct
an appropriate hypothesis test at two appropriate levels and comment on
your results.
ii. Give a 98% confidence interval for the difference between the two
proportions in the population.
(13 marks)
END OF PAPER
UL11/0186 Page 6 of 6
D01
D01 Page 6 of 19
UL11/0187 Page 7 of 19
UL11/0187 Page 8 of 19
UL11/0187 Page 9 of 19
UL11/0187 Page 10 of 19
UL11/0187 Page 11 of 19
UL11/0187 Page 12 of 19
UL11/0187 Page 13 of 19
UL11/0187 Page 14 of 19
UL11/0187 Page 15 of 19
UL11/0187 Page 16 of 19
UL11/0187 Page 17 of 19
UL11/0187 Page 18 of 19
UL11/0187 Page 19 of 19
Examiners’ commentaries 2011

04a Statistics 1
Important note
This commentary reflects the examination and assessment arrangements for this course in the
academic year 2010–11. The format and structure of the examination may change in future years,
and any such changes will be publicised on the virtual learning environment (VLE).
Please note that all page references are to the 2011 subject guide.
Specific comments on questions – Zone A
SECTION A
Question 1
(a) Reading for this question

This question asks for the median, which is a measure of location, and the variance,
which is a measure of dispersion. Specific sections exist for this material in Chapter 3 of
the subject guide about data presentation. These sections also contain activities to test
your understanding about these measures.
Approaching this question

i. Such questions are more easily approached if the numbers are rearranged in
(ascending) order. You get, going upwards:
1 2 3 4 6
which gives a median of 3.
1+2+3+4+6
The mean can also be calculated to be 5 = 3.2, which can be used to
calculate the variance.
(1 − 3.2)2 + (2 − 3.2)2 + (3 − 3.2)2 + (4 − 3.2)2 + (6 − 3.2)2

= 3.7.
5−1
ii. In this question you may think about the definition and realise that if you add the
same number to all the sample values, the location will be shifted whereas the
dispersion will remain the same. Hence the median will change but the variance will
remain the same. Some candidates verified this with a numerical example by adding a
specific number – say 1 – to all the values. This can be helpful as well.
Weak candidates confused definitions or did not know how to calculate some or all of the
measures asked for. It is important that these basics are thoroughly revised: they
underpin the rest of the syllabus.
(4 marks)
1
04a Statistics 1
(b) Reading for this question

This question contains material from various parts of the subject guide. Here, it is more
important to have a good intuitive understanding of the relevant concepts than the
technical level in computations. Part (i) requires material from Chapter 3 in the section
about measures of location. For parts (ii) and (iv) you need to know about hypotheses
and types of error. You can look at Chapter 7 on hypothesis testing in the section about
Type I and Type II errors and the hypotheses respectively. Finally, part (iii) requires
knowledge about the chi-squared test that can be found in Chapter 8.

Candidates always find this type of question tricky. It requires a brief explanation of the
reason for a possible/not possible answer and not just a choice between the two. Some
candidates lost marks as well for long, rambling explanations without a decision as to
whether a statement was ‘possible’ or not.
i. The key word in this case is the word ‘always’ which makes a very strong statement.
A good way to approach such questions is to think whether there is a possibility that
the mean is not larger than the mode. Since this is the case, a good answer would be
‘No, it can also be smaller or equal’.
ii. Here, the definition of a Type I error is required. It states clearly that the two
quantities are equal. So, a good answer would be ‘No, the significance level is equal to
the probability of a Type I error’.
iii. Note that the chi-squared value is a sum of squares and therefore takes positive values.
Hence, a good answer would be ‘The chi-squared value is always positive, so yes’.
iv. This question points out a key feature of statistical hypotheses. They can be
statements about population parameters only. It would be wrong to phrase a
statistical hypothesis in terms of characteristics of the sample. Hence, a good answer
would be ‘No, hypotheses have to be statements about population parameters’.
(8 marks)
(c) Reading for this question

This question asks candidates to show their understanding of the basic ideas of
correlation that are covered on pp.164–168 of the subject guide with further references
given in Chapter 11. Again a good technical level is not needed, rather a good and
intuitive understanding of correlation, although it can be useful.

This question asks you to identify an error about statistics in some sentences. Some
candidates were confused and gave answers like ‘There cannot be a positive correlation
between research and teaching’. First of all, one cannot be sure that this is the case.
But, more importantly, this is an examination on statistical concepts so you should try
to identify errors about them. In this case the errors are easy to spot if you recall the
basic properties of correlation.
i. Correlation takes values between −1 and 1 with values close to 0 indicating weak
(low) correlation. A good answer would be ‘Correlation of 0.03 is not high’.
ii. Note that correlation is a relative measure and it can quantify the relationship of
quantities with different measurement units. Hence a good answer would be
‘Correlation does not reflect measurement units’.
iii. Here you have to realise that correlation refers only to variables where ‘high’ and
‘low’ make sense. A good answer would be ‘Gender is not a continuous variable’.
(6 marks)
(d) Reading for this question

The entire content of Chapter 6 is relevant and in particular Sections 6.6 and 6.7. Try
Activity A6.4.
2

This asks for a 95% confidence interval. This was straightforward once the correct
distribution, t, was identified. Weak candidates did not notice that the variance was
unknown and used the normal distribution.
The working is given below:
• Confidence interval formula: x̄ ± tα/2,n−1 √sn .
• Degrees of freedom: 19.
• t-value: ≈ 2.09.
• Confidence interval: (17.82, 19.18).
(4 marks)
(e) Reading for this question

These are both probability questions. Read Chapter 4 about probability and in
particular the sections about the definition of probability and probability trees.

i. In such questions it is essential to start by defining the events and then list what you
know about them. In our case one can define
• B1 : The first ball is yellow.
• B2 : The second ball is yellow.
It may be of help to write down some things that are immediately known about B1
and B2 . For example, as there are 7 balls in total and 4 of them are yellow,
P (B1 ) = 47 .
Now, to get to the specific question of this part, the event that both balls are yellow
is B1 ∩ B2 . So, we can write
4 3 2
P (B1 ∩ B2 ) = P (B1 )P (B2 |B1 ) = · = .
7 6 7
ii. If the balls are of different colours, then only one ball is yellow. This can happen if
the first ball was yellow and the second ball was blue (B1 ∩ B2c ), or if the first ball was
blue and the second ball was yellow (B1c ∩ B2 ). Adding the probabilities for these two
cases gives
4 3 3 4 4
P (B1 ∩ B2c ) + P (B1c ∩ B2 ) = · + · = .
7 6 7 6 7
iii. Most candidates had difficulty in this part although it had similarities with (i.) and
(ii.). As before, the best way to start with such exercises is to define the relevant
events. In this case we have
• A : Test positive.
• B : Person has HIV.
The next step is to write down what is given for the above events, or their
complements, or combinations of events. In our case, for example, 10% of people have
HIV, so P (B) = 0.1. Another way that information can be given is through
conditional probabilities. Typical phrases to identify such cases are ‘given ..., the
probability of ... is’ or ‘if ..., then the probability of ... is’ etc. In this question we are
told that if the person has HIV (or else, given B) the diagnostic test is correct (hence
positive, or else A) with probability 90%. This means that P (A|B) = 0.9. Similarly,
we obtain that if the person does not have HIV (given B c ) the test is correct (hence
negative, or else Ac ) which leads to P (Ac |B c ) = 0.95.
We may now use the formula
P (A|B c )P (B c )
P (B c |A) = ,
P (A|B c )P (B c ) + P (A|B)P (B)
3
04a Statistics 1
as we know all the quantities. We get
(1 − 0.95) × 0.9 1
P (B c |A) = = .
(1 − 0.95) × 0.9 + 0.9 × 0.1 3
(8 marks)
(f ) Reading for this question

This question refers to the basic bookwork which can be found on pp.15–16 of the
subject guide, and in particular Activity A1.6 on p.19.

Be careful to leave the xi s in the order given and only cover the values of i asked for.
This question was generally well done; the answers are:
Pi=4
i. i=1 (xi − 2) = (3 − 2) + (1 − 2) + (2 − 2) + (1 − 2) = −1.
Pi=5
ii. i=2 3xi = (3 × 1) + (3 × 2) + (3 × 1) + (3 × 2) = 18.
Pi=5 3 3 3
iii. i=4 xi = 1 + 2 = 9.
(6 marks)
(g) Reading for this question

This question asks candidates to go back to first principles and calculate a mean and
standard deviation using summary statistics. The bookwork is given on pp.36–37 for the
arithmetic mean and on pp.40–41 for the variance.

The total of the data is (13 × 4.3) + (21 × 4.9) = 158.8. There are 13 + 21 = 34 data
values, so the combined mean is 158.8
34 = 4.67. To calculate the variance, first find the
sample variances. They are 1.22 = 1.44 and 1.42 = 1.96. Hence, the ‘sum of squares’ is
(12 × 1.44) + (20 × 1.96) = 56.48. Samples are from the same normal distribution, so
their variances are the same, so we can use the pooled variance formula. Hence
56.48
s2p = 13+21−2 = 1.765.
Answers to one decimal place were accepted.
(6 marks)
(h) Reading for this question

This section examines the ideas of the normal random variable. Read the relevant
section of Chapter 5 and work through the examples and activities of this section.

The basic property of the normal random variable for this question is that if
X ∼ N (µ, σ 2 ), then Z = X−µ
σ ∼ N (0, 1). Note also that
• P (Z < a) = P (Z ≤ a) = Φ(a)
• P (Z > a) = P (Z ≥ a) = 1 − P (Z ≤ a) = 1 − P (Z < a) = 1 − Φ(a)
• P (a < Z < b) = P (a ≤ Z < b) = P (a < Z ≤ b) = P (a ≤ Z ≤ b) = Φ(b) − Φ(a).
The above is all you need to answer the two parts of this question:

i. P (X < 34) = P X−52
√
100
< 34−52
√
100
= Φ(−1.8) = 0.036.

ii. P (60 ≤ X ≤ 70) = P 60−52
√
100
< X−52
√
100
< 70−52
√
100
= Φ(1.8) − Φ(0.8) = 0.176.
(4 marks)
4
(i) Reading for this question

This question requires knowledge about sampling and sample surveys. Useful
background reading may be found in Chapter 9 of the subject guide. See also the
references to Newbold, Carlson and Thorne given on p.135 of Chapter 9.

This question asked for definitions and an example. The answer is meant to be short.
Many candidates wrote long answers that in most situations contained irrelevant things.
The definitions (also available in the subject guide) are given below:
• Random sampling: Each unit has a known, non-zero probability of being selected.
• Quota sampling: interviewers are given a quota of numbers they must interview
according to various factors such as age, sex, etc (see pp.138–139 of the subject guide).
Regarding the example, one could mention any kind of sample survey, e.g. data by
population density, age, and income within London boroughs in order to decide where
to locate new convenience stores. An advantage of random sampling is that it allows the
use of statistical methodology, whereas quota sampling surveys are easier to conduct and
have lower cost.
(4 marks)
SECTION B
Question 2

This is a standard regression question and the reading is to be found on pp.170–174 in the
subject guide. Further references are given in Chapter 11 of the subject guide.

i.
Obtained yield of apple trees versus weekly amount of fertiliser supplied

8
7
y: yield in kgs
6
5
4
1 2 3 4 5
x: grams of fetiliser
5
04a Statistics 1
Candidates are reminded that they are asked to draw and label the scatter diagram
which should include a title (‘Scatter diagram’ alone will not suffice) and labelled axes
which also give their units. Far too many candidates threw away marks by neglecting
these points and consequently were only given one mark out of the possible four allocated
for this part of the question. Another common way of losing marks was failing to use the
graph paper which was provided, and required, in the question. Candidates who drew on
the ordinary paper in their booklet were not awarded marks for this part of the question.
(4 marks)
ii. The regression line can be written by the equation ŷ = a + bx or y = a + bx + . The
formula for b is P
xi yi − nx̄ȳ
b= P 2 ,
xi − nx̄2
and by substituting the summary statistics we get b = 1.03.
The formula for a is a = ȳ − bx̄, so we get a = 3.25.
Hence the regression line can be written as ŷ = 3.25 + 1.03x or y = 3.25 + 1.03x + .
(5 marks)
iii. The prediction will be ŷ = 3.25 + 1.03 × 3.2 = 6.55 kilograms. One mark was deducted in
cases where the units of measurement were not given.
(2 marks)
iv. This could be a good idea due to a strong, positive, linear relationship but requires
extrapolation, so it has to be applied with caution. Answers such as ‘No, because it
requires extrapolation’ were given half credit whereas answers saying yes, but without
mentioning extrapolation, were not given any credit.
(2 marks)

The question asks for a two-tailed hypothesis test comparing means. See pp.114–115 of the
subject guide.

i. The null hypothesis is that the mean lives of the two brands (µA and µB ) do not differ,
the alternative is that they do differ.
H0 : µA = µB vs H1 : µA 6= µB .
Use the test statistic formula:

x̄ − ȳ x̄ − ȳ
q or q
s2A s2B s2p s2p
nA + nB np + n2
to find the test statistic value: 2.313 (or 2.289 if pooled variance used). The critical
values, assuming a normal approximation as the number of observations is large, are
±1.96. If a t-distribution with 70 degrees of freedom is assumed, we have t = 2.00 (using
60 degrees of freedom, the nearest value in the table). Taking 5%, we reject the null
hypothesis and there is therefore evidence for a difference between the two. If we take an
α of 1%, the critical values are ±2.576, so we do not reject H0 . We conclude that there is
some evidence of a difference between the brands.
(7 marks)
ii. The assumptions for (ii.) were that:
2 2
• Assumption about whether σA = σB .
• Assumption about whether nA + nB − 2 is ‘large’, hence t v. z.
• Assumption about independent samples.
(2 marks)
6
iii. In this case the question was whether the mean life of the tyres of brand B is longer than
that of the brand A tyres. Hence the hypotheses are
H0 : µA = µB vs H1 : µA < µB .
The statistic to use is the same as before. However, the critical values will be different.
We conclude that the result is highly significant so there is evidence that the mean life of
brand B is longer. The z-values are ≈ 1.645 for 5% and ≈ 2.32 for 1%.
(3 marks)
Question 3

Part (i.) is a straightforward chi-squared test and the reading is given in Chapter 8 of the
subject guide, in particular pp.122–127. For part (ii.) of the question, look at Activity A8.4.

i. Set out the null hypothesis that there is no association between performance and
pre-school attendance against the alternative that there is an association. Be careful to
get these the correct way round!
H0 : No association between performance and pre-school attendance.

H1 : Association between performance and pre-school attendance.
Work out the expected values. For example, you should work out the expected value, if
there is no association, for the students below grade that attended pre-school as:
(30/100) × 57 = 17.1. Repeat for each cell to get the table below.

Pre-school 17.1 22.9 17.1
No pre-school 12.9 17.2 12.9
The test statistic formula is

X (Oi,j − Ei,j )2
,
Ei,j
which gives a value of 7.623. This is a 3 × 2 contingency table so the degrees of freedom
are (3 − 1) × (2 − 1) = 2. For α = 0.05 this gives a critical value of 5.99, hence we reject
H0 . For a second (smaller) α, say 1% we get a critical value of 9.21, where we do not
reject H0 .
We conclude that there is some evidence of an association between pre-school attendance
and algebra marks.
Many candidates looked up the tables incorrectly and so failed to follow through their
earlier accurate work. A larger number did not expand on their results sufficiently.
Saying ‘we reject at the 5% level, but not at 1%’ is insufficient. What does this mean? Is
there an association or not? If there is one, how strong is it? This needed to be answered
if the full nine marks allocated for this question were to be given. Many candidates lost
marks by missing out on follow-up parts like this.
(9 marks)
ii. There are a number of statements that can be drawn from the previous results. By
checking differences between expected and observed numbers we can extract various
arguments that aid in the interpretation of the results. For example, we may say things
like:
• Main sources of association: pre-school v. below grade and at grade.
7
04a Statistics 1
• Students who attended pre-school are more likely to obtain grade algebra marks than
students who did not.
• Pre-school attendance reduces the chances of a below grade level algebra mark.
There were some excellent answers to this, but many candidates ignored this part of the
question. (4 marks)

This was a fairly standard survey design question. Background reading is given in Chapters
9 and 10 of the subject guide which, along with the essential reading, should be looked at
carefully. Candidates were expected to have studied and understood the main important
constituents of design in random sampling.

The main thing to note here is that many candidates wrote essays without any structure.
This exercise asks for specific things and each one of them requires one or two lines. If you
do not know what these things are, do not write lengthy essays. This is not giving you
anything and is a waste of your invaluable examination time. If you can identify what is
being asked for, keep in mind that the answer should not be long.
Note also that there is usually no unique answer to such questions. Below are some good
answers for this case.
i. Two possible ways to answer this part:
• A sampling frame can be an email list of university students. However, this would
probably rule out face-to-face interviews and response bias may become an issue.
• Alternatively, a list of telephone numbers can be obtained and telephone interviews
may be conducted. However, there may not be a telephone for every university
student. Also, it is likely that those who do not have telephones were affected
differently from the recession and also may have different spending habits. This can
create problems.
ii. Make sure you provide a justification for your answers. Possible relevant stratification
factors are the following:
• Income level, as it is obviously related to spending habits. Also, the recession may
have had different impacts on different income levels.
• Gender, is related with spending habits.
• Studying away from home. Students who live on their own have different spending
habits than those living with their families.
iii. Make sure you provide a justification for your answers. Possible ways to reduce bias are
the following:
• Incentives, as people are more likely to answer and also to answer with accuracy.
• Face-to-face interviews, to eliminate confusion and reduce the chance of missing
values.
• Length of questionnaire, so that people are more likely to answer and devote a
reasonable amount of attention.
iv. Summaries of variables indicating amount of spending (in different categories) before and
after the recession. If these variables are continuous, boxplots can be used (before and
after recession) for graphical presentation, whereas t-tests (possibly one-tailed) may be
used to test if the observed differences are significant. If the variables are categorical,
contingency tables and chi-squared tests can be used instead.
(12 marks)
8
Question 4

Chapter 3 provides all the relevant material for this question.

i. Candidates were not required to use the graph paper provided here, but it was useful to
those who did so and they were not penalised. A good histogram gave a title and
labelled the x and y axes. Candidates and their teachers should note that the y-axis does
not denote ‘frequency’ as was commonly noted. Frequencies are given by the area under
each block. A graph like the one below would be a good answer.
Histogram of IQScores
0.03
Frequency Densities
0.02
0.01
0.00
90 100 110 120 130 140 150
IQ scores
ii. The mean can be found to be 113.6, whereas the modal group is the one between IQ
scores of 100 and 110.
iii. The median is 110.5, whereas the upper quartile 121. The median had to be exactly
110.5, but for the upper quartile similar values based on different interpolations were also
accepted.
iv. There is positive (right) skewness in the distribution of the data. Most of the IQ scores
are around 100 and 110.
(12 marks)

Look up the sections about hypothesis testing for proportions (part i.) and confidence
intervals for proportions (part ii.) in Chapters 7 and 6 of the subject guide, respectively.

i. If π1 is the proportion in favour of the new grading system in humanities and π2 the
corresponding proportion in science, the hypotheses are:
H0 : π1 = π2 vs H1 : π1 6= π2 .

p1 − p2
,
s.e.(p1 − p2 )
where the standard error can be calculated with either of the following methods:
9
04a Statistics 1
s
1 1
s.e.(p1 − p2 ) = 0.6495 × 0.3505 + = 0.043
325 200
r
0.68 · 0.32 0.6 · 0.4
or = + = 0.043.
325 200
The test statistic value is 1.860. For α = 0.05, the critical values are ±1.96, so we do not
reject H0 at the 5% level.
We therefore choose a larger second α to be 10%, which gives critical values of ±1.645.
We therefore reject H0 at this level and conclude that there is weak evidence of a
difference in the proportions in favour of the new grading system between students in
humanities and science.
Candidates got full marks for this question if they either:
• provided an interpretation of the findings saying that ‘Students in humanities are
more in favour of the new grading system than students in science’, or
• justified the use of the normal distribution by the large sample.
. (9 marks)
ii. This asks for a 97% confidence interval. The normal distribution may be used as before.
• Confidence interval formula: (p1 − p2 ) ± zα/2 × s.e.(p1 − p2 ).
• z-value: 2.17.
• End-points: 0.08 ± 2.17 × 0.043.
• Report as an interval: (−0.013, 0.173).
. (4 marks)
10

04a Statistics 1
Important note
Please note that all page references are to the 2011 subject guide.
Specific comments on questions – Zone B
SECTION A
Question 1

This question asks for the mean, which is a measure of location, and the variance, which
is a measure of dispersion. Specific sections exist for this material in Chapter 3 of the
subject guide about data presentation. These sections also contain activities to test your
understanding about these measures.

1+2+3+4+6
i. The mean can be calculated to be 5 = 3.2, which can then be used to
calculate the variance.
(1 − 3.2)2 + (2 − 3.2)2 + (3 − 3.2)2 + (4 − 3.2)2 + (6 − 3.2)2

= 3.7.
5−1
ii. In this question you may think about the definition and realise that if you subtract
the same number from all the sample values, the location will be shifted whereas the
dispersion will remain the same. Hence the mean will change and the variance will
remain the same. Some students verified this with a numerical example by subtracting
a specific number – say 1 – from all the values. This can be helpful as well.
Weak candidates confused definitions or did not know how to calculate some or all of the
measures asked for. It is important that these basics are thoroughly revised: they
underpin the rest of the syllabus.
(4 marks)

important to have a good intuitive understanding of the relevant concepts than the
technical level in computations. Part (i) requires material from Chapter 3 in the section
about measures of location. For parts (ii) and (iv) you need to know about hypotheses
and types of error. You can look at Chapter 7 on hypothesis testing in the section about
Type I and Type II errors and the hypotheses respectively. Finally, part (iii) requires
knowledge about the chi-squared test that can be found in Chapter 8.
1
04a Statistics 1

reason for a possible/not possible answer and not just a choice between the two. Some
candidates lost marks as well for long, rambling explanations without a decision as to
whether a statement was ‘possible’ or not.
i. The key word in this case is the word ‘always’ which makes a very strong statement.
A good way to approach such questions is to think whether there is a possibility that
the mean is not larger than the median. Since this is the case, a good answer would
be ‘No, it can be also smaller or equal’.
ii. Here, the definition of a Type II error is required. It states clearly that the power of a
test is equal to 1 minus the probability of a Type II error. So, a good answer would
be ‘Yes, it is equal to 1 minus the probability of a Type II error’.
iii. Low values of the chi-squared statistic show that there is not much distance between
observed and expected values (if we assumed no association), therefore this is
possible. An answer here would be ‘Yes, low values provide little evidence for
association’. Some careful candidates mentioned here that we may still reject the null
hypothesis for low values. This is correct but it does not mean that the statement is
not possible. Marks were given to candidates when their explanations were valid.
iv. This question points out a key feature of statistical hypotheses. They can be
statements about population parameters only. It would be wrong to phrase a
statistical hypothesis in terms of characteristics of the sample. Hence, a good answer
would be ‘No, hypotheses have to be statements about population parameters’.
(8 marks)
(c) Reading for this question

This question asks candidates to show their understanding of the basic ideas of
correlation that are covered on pp.164–168 of the subject guide with further references
given in Chapter 11. Again a good technical level is not needed, rather a good and
intuitive understanding of correlation, although it can be useful.

This question asks you to identify an error about statistics in some sentences. Some
candidates were confused and gave answers like ‘There cannot be a positive correlation
between research and teaching’. First of all, one cannot be sure that this is the case.
But, more importantly, this is an examination on statistical concepts so you should try
to identify errors about them. In this case the errors are easy to spot if you recall the
basic properties of correlation.
i. Correlation takes values between −1 and 1 with values close to 0 indicating weak
(low) correlation. A good answer would be ‘Correlation of −0.03 is not high’.
ii. Note that correlation is a relative measure and it can quantify the relationship of
quantities with different measurement units. Hence a good answer would be
‘Correlation does not reflect measurement units’.
iii. Here you have to realise that correlation refers only to variables where ‘high’ and
‘low’ make sense. A good answer would be ‘Gender is not a continuous variable’.
(6 marks)
(d) Reading for this question

The entire content of Chapter 6 is relevant and in particular Sections 6.6 and 6.7. Try
Activity A6.4.
2

This asks for a 95% confidence interval. This was straightforward once the correct
distribution, t, was identified. Weak candidates did not notice that the variance was
unknown and used the normal distribution.
• Confidence interval formula: x̄ ± tα/2,n−1 √sn .
• Degrees of freedom: 14.
• t-value: ≈ 2.14.
• Confidence interval: (10.92, 12.08).
(4 marks)
(e) Reading for this question

These are both probability questions. Read Chapter 4 about probability and in
particular the sections about the definition of probability and probability trees.

i. In such questions it is essential to start by defining the events and then list what you
know about them. In our case one can define
• B1 : The first ball is red.
• B2 : The second ball is red.
It may be of help to write down some things that are immediately known about B1
and B2 . For example, as there are 9 balls in total and 5 of them are red, P (B1 ) = 59 .
Now, to get to the specific question of this part, the event that both balls are yellow
is B1 ∩ B2 . So, we can write
5 4 5
P (B1 ∩ B2 ) = P (B1 )P (B2 |B1 ) = · = .
9 8 18
ii. If the balls are of different colours, then only one ball is red. This can happen if the
first ball was red and the second ball was green (B1 ∩ B2c ), or if the first ball was
green and the second ball was red (B1c ∩ B2 ). Adding the probabilities for these two
cases gives
5 4 4 5 5
P (B1 ∩ B2c ) + P (B1c ∩ B2 ) = · + · = .
9 8 9 8 9
iii. Most candidates had difficulty in this part although it had similarities with (i.) and
(ii.). As before, the best way to start with such exercises is to define the relevant
events. In this case we have
• A : Test positive.
• B : Person has HIV.
The next step is to write down what is given for the above events, or their
complements, or combinations of events. In our case, for example, 5% of people have
HIV, so P (B) = 0.05. Another way that information can be given is through
conditional probabilities. Typical phrases to identify such cases are ‘given ..., the
probability of ... is’ or ‘if ..., then the probability of ... is’ etc. In this question we are
told that if the person has HIV (or else, given B) the diagnostic test is correct (hence
positive, or else A) with probability 95%. This means that P (A|B) = 0.95. Similarly,
we obtain that if the person does not have HIV (given B c ) the test is correct (hence
negative, or else Ac ) which leads to P (Ac |B c ) = 0.90.
We may now use the formula

P (A|B c )P (B c )
P (B c |A) = ,
P (A|B c )P (B c ) + P (A|B)P (B)
as we know all the quantities. We get
(1 − 0.90) × 0.95 2
P (B c |A) = = .
(1 − 0.90) × 0.95 + 0.95 × 0.05 3
3
04a Statistics 1
(8 marks)
(f ) Reading for this question

This question refers to the basic bookwork which can be found on pp.15–16 of the
subject guide, and in particular Activity A1.6 on p.19.

Be careful to leave the xi s in the order given and only cover the values of i asked for.
Pi=5
i. (xi − 4) = (2 − 4) + (4 − 4) + (3 − 4) = −3.
Pi=3
i=4
ii. 2xi = (2 × 4) + (2 × 1) + (2 × 2) + (2 × 4) = 22.
Pi=1
i=3 3 3 3
iii. i=2 xi = 1 + 2 = 9.
(6 marks)
(g) Reading for this question

This question asks candidates to go back to first principles and calculate a mean and
standard deviation using summary statistics. The bookwork is given on pp.36–37 for the
arithmetic mean and on pp.40–41 for the variance.

The total of the data is (18 × 5.3) + (15 × 4.1) = 156.9. There are 18 + 15 = 33 data
values, so the combined mean is 156.9
33 = 4.75. To calculate the variance, first find the
sample variances. They are 1.02 = 1.0 and 1.52 = 2.25. Hence, the ‘sum of squares’ is
(17 × 1.0) + (14 × 2.25) = 48.5. Samples are from the same normal distribution, so their
variances are the same, so we can use the pooled variance formula. Hence
48.5
s2p = 18+15−2 = 1.564.
Answers to one decimal place were accepted.
(6 marks)
(h) Reading for this question

This section examines the ideas of the normal random variable. Read the relevant
section of Chapter 5 and work through the examples and activities of this section.

The basic property of the normal random variable for this question is that if
X ∼ N (µ, σ 2 ), then Z = X−µ
• P (Z < a) = P (Z ≤ a) = Φ(a),
• P (Z > a) = P (Z ≥ a) = 1 − P (Z ≤ a) = 1 − P (Z < a) = 1 − Φ(a),
The above is all you need to answer the two parts of this question:

i. P (X ≥ 70) = 1 − P (X < 70) = 1 − P X−55
√
81
< 70−55
√
81
= 1 − Φ(5/3) = 0.048.

ii. P (50 ≤ X ≤ 59) = P 50−55
√
81
< X−55
√
81
< 59−55
√
81
= Φ( 59 ) − Φ(− 59 ) = 0.422.
(4 marks)
(i) Reading for this question

This question requires knowledge about sampling and sample surveys. Useful
background reading may be found in Chapter 9 of the subject guide. See also the
references to Newbold, Carlson and Thorne given on p.135 of Chapter 9.

This question asked for definitions and an example. The answer is meant to be short.
Many candidates wrote long answers that in most situations contained irrelevant things.
The definitions (also available in the subject guide) are given below:
4
• Random sampling: Each unit has a known, non-zero probability of being selected.
• Cluster sampling: Roughly speaking, random sampling within a cluster/subgroup of
the population (that usually has also been chosen at random). See also p.142 of the
subject guide.
Regarding the example, one could mention any kind of sample survey, e.g. data by
population density, age, and income within London boroughs in order to decide where
to locate new convenience stores. An advantage of random sampling is that it allows for
more accurate statistical methodology, whereas quota sampling surveys are easier to
conduct and have lower cost.
(4 marks)
SECTION B
Question 2

This is a standard regression question and the reading is to be found on pp.170–174 in the
subject guide. Further references are given in Chapter 11 of the subject guide.

i. Candidates are reminded that they are asked to draw and label the scatter diagram
which also give their units. Far too many candidates threw away marks by neglecting
these points and consequently were only given one mark out of the possible four allocated
for this part of the question. Another common way of losing marks was failing to use the
graph paper which was provided, and required, in the question. Candidates who drew on
the ordinary paper in their booklet were not awarded marks for this part of the question.
Sales versus aptitude test scores

7.5
7.0
y: sales in hundreds of dollars
6.5
6.0
5.5
2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5
x:aptitude score
(4 marks)
5
04a Statistics 1
ii. The regression line can be written by the equation ŷ = a + bx or y = a + bx + . The

formula for b is P
xi yi − nx̄ȳ
b= P 2 ,
xi − nx̄2
Hence the regression line can be written as ŷ = 4.52 + 0.58x or y = 4.52 + 0.58x + .
(5 marks)
iii. The prediction will be ŷ = 4.52 + 0.58 × 4.0 = 6.84 hundreds of dollars. One mark was
deducted in cases where the units of measurement were not given.
(2 marks)
iv. This could be a good idea due to a strong, positive, linear relationship but requires
extrapolation, so it has to be applied with caution. Answers such as ‘No, because it
requires extrapolation’ were given half credit whereas answers saying yes, but without
mentioning extrapolation, were not given any credit.
(2 marks)

The question asks for a two-tailed hypothesis test comparing means. See pp.114–115 of the
subject guide.

i. The null hypothesis is that the mean lives of the two brands (µA and µB ) do not differ,
the alternative is that they do differ.
H0 : µA = µB vs H1 : µA 6= µB .
Use the test statistic formula:
x̄ − ȳ x̄ − ȳ
q or q
s2A s2B s2p s2p
nA + nB np + n2
to find the test statistic value: 2.934 (or 3.071 if pooled variance used). The critical
values, assuming a normal approximation as the number of observations is large, are
±1.96. If a t-distribution with 70 degrees of freedom is assumed, we have t = 2.00 (using
60 degrees of freedom, the nearest value in the table). Taking 5%, we reject the null
hypothesis and there is therefore evidence for a difference between the two. If we take an
α of 1%, the critical values are ±2.576, so we do not reject H0 . We conclude that there is
some evidence of a difference between the brands.
(7 marks)
2 2
• Assumption about whether nA + nB − 2 is ‘large’, hence t v. z.
(2 marks)
iii. In this case the question was whether the mean life of the tyres of brand B is longer than
that of the brand A tyres. Hence the hypotheses are
H0 : µA = µB vs H1 : µA < µB .
The statistic to use is the same as before in absolute value but has a different sign, it is
2.934 (or 3.071 if pooled variance used). The critical values take a positive value for any
significance level (≈ 1.645 for 5%), so we do not reject the hypothesis that the life of
brand B is longer.
6
This bit was a little confusing as the sample mean of brand B was in fact smaller. Some
candidates tested the hypothesis
H0 : µA = µB vs H1 : µB < µA
as they thought it might have been a more interesting question. Usually this is not
allowed and candidates should answer the question as set. But given the peculiarity of
this case these candidates were awarded full marks if they carried out their test correctly.
(3 marks)
Question 3

Part (i.) is a straightforward chi-squared test and the reading is given in Chapter 8 of the
subject guide, in particular pp.122–127. For part (ii.) of the question, look at Activity A8.4.

i. Set out the null hypothesis that there is no association between performance and
pre-school attendance against the alternative, that there is an association. Be careful to
H0 : No association between performance and pre-school attendance.

H1 : Association between performance and pre-school attendance.
Work out the expected values. For example, you should work out the expected value, if
there is no association, for the students below grade that attended pre-school as:
(30/100) × 51 = 15.3. Repeat for each cell to get the table below.

Pre-school 15.3 20.4 15.3
No pre-school 14.7 19.6 14.7
X (Oi,j − Ei,j )2
,
Ei,j
are (3 − 1) × (2 − 1) = 2. For α = 0.05 this gives a critical value of 5.99, hence we reject
H0 . For a second (smaller) α, say 1% we get a critical value of 9.21, where we do not
reject H0 .
We conclude that there is some evidence of an association between pre-school attendance
and algebra marks.
Saying ‘we reject at the 5% level, but not at 1%’ is insufficient. What does this mean? Is
there an association or not? If there is one, how strong is it? This needed to be answered
if the full nine marks allocated for this question were to be given. Many candidates lost
marks by missing out on follow-up parts like this.
(9 marks)
ii. There are a number of statements that can be drawn from the previous results. By
checking differences between expected and observed numbers we can extract various
arguments that aid in the interpretation of the results. For example, we may say things
like:
• Main sources of association: pre-school v. below grade and at grade.
• Students who attended pre-school are more likely to obtain grade algebra marks than
students who did not.
7
04a Statistics 1
• Pre-school attendance reduces the chances of a below grade level algebra mark.
There were some excellent answers to this, but many candidates ignored this part of the
question. (4 marks)

This was a fairly standard survey design question. Background reading is given in Chapters
9 and 10 of the subject guide which, along with the essential reading, should be looked at
constituents of design in random sampling.

The main thing to note here is that many candidates wrote essays without any structure.
do not know what these things are, do not write lengthy essays. This is not giving you
being asked for, keep in mind that the answer should not be long.
Note also that there is usually no unique answer to such questions. Below are some good
answers for this case.
i. Two possible ways to answer this part:
• A sampling frame can be an email list from different companies. However, this would
probably rule out face-to-face interviews and response bias may become an issue.
• Alternatively, a list of telephone numbers can be obtained and telephone interviews
may be conducted. However, there may not be a telephone for every employee and
this could vary depending on the job type.
ii. Make sure you provide a justification for your answers. Possible relevant stratification
factors are the following:
• Income level, as it is obviously related to job satisfaction and varies across job types.
• Gender, as we see different proportions of women in different job types and the link
with job satisfaction is interesting.
• Education level, as it varies across job types and the link with job satisfaction is
interesting.
iii. Make sure you provide a justification for your answers. Possible ways to reduce bias are
the following:
• Incentives, as people are more likely to answer and also to answer with accuracy.
• Face-to-face interview, to eliminate confusion and reduce the chance of missing values.
• Length of questionnaire, so that people are more likely to answer and devote a
reasonable amount of attention.
iv. Summaries of variables indicating amount of job satisfaction for different job types. If
these variables are continuous, boxplots can be used to graphically compare pairs of job
types, whereas t-tests may be used to test if the observed differences are significant. If the
variables are categorical, contingency tables and chi-squared tests can be used instead.
(12 marks)
Question 4

Chapter 3 provides all the relevant material for this question. More specifically read p.35 of
the subject guide and look at the stem-and-leaf example and the accompanying commentary.
8

i. The stem-and-leaf diagram the Examiners were hoping to see, is shown below. Marks
were awarded for including the title, a sensible choice of stems, stem-and-leaf labels,
correct vertical alignment, and accuracy.
Stem-and-leaf plot of IQ scores
Stem = $10s | Leaf = 1s
9 | 566899
10 | 112234477
11 | 1233457
12 | 13479
13 | 15
14 | 3
ii. The mean can be found to be 111.4, whereas the modal group is the one between IQ
scores of 100 and 110.
iii. The median is 109, whereas the lower quartile 101.25. The median had to be exactly 109,
but for the lower quartile similar values based on different interpolations were also
accepted.
iv. There is positive (right) skewness in the distribution of the data. Most of the IQ scores
are around 100 and 110.
(12 marks)

Look up the sections about hypothesis testing for proportions (part i.) and confidence
intervals for proportions (part ii.) in Chapters 7 and 6 of the subject guide, respectively.

i. If π1 is the proportion of males in favour of the new grading system and π2 the
corresponding proportion of females, the hypotheses are:
H0 : π1 = π2 vs H1 : π1 6= π2 .

p1 − p2
,
s.e.(p1 − p2 )
where the standard error can be calculated with either of the following methods:
s
1 1
s.e.(p1 − p2 ) = 0.55 × 0.45 + = 0.045
225 275
r
0.4889 · 0.5111 0.6 · 0.4
or = + = 0.045
225 275
The test statistic value is 2.495. For α = 0.05, the critical values are ±1.96, so we reject
H0 at the 5% level.
We therefore choose a smaller second α to be 1%, which gives critical values of ±2.576.
We therefore do not reject H0 at this level and conclude that there is some evidence of a
difference in the proportions in favour of the new grading system between males and
females.
Candidates got full marks for this question if they either:
• provided an interpretation of the findings saying that ‘Females are more in favour of
the new grading system than males’, or
9
04a Statistics 1
• justified the use of the normal distribution by the large sample.

. (9 marks)
ii. This asks for a 98% confidence interval. The normal distribution may be used as before.
• Confidence interval formula: (p1 − p2 ) ± zα/2 × s.e.(p1 − p2 ).
• z-value: 2.326
• End-points: 0.111 ± 2.326 × 0.045.
• Report as an interval: (0.006, 0.216).
. (4 marks)
10
UNIVERSITY OF LONDON ST104A ZA

(279 004A)
Social Sciences, the Diplomas in Economics and Social Sciences and Access Route
Friday,##
[Day], 4 May 20122012
[Month] : 10.00am to 12.00pm
: ##.##Xm to ##.##Xm

A list of formulae and extracts from statistical tables are given after the final question on this
paper.

D01 Page 1 of 21
SECTION A
1. (a) The following data represent different types of variables. Classify each one of
them as measurable (continuous) or categorical. If a variable is categorical,
further classify it as nominal or ordinal. Justify your answer. (Note that no
marks will be awarded without justification.)
i. The education level for a number of employees from a company (elementary
school, high school, university or postgraduate degree).
ii. The blood pressure from 30 hospital patients.
iii. The hair colour of 50 persons.
iv. The weights of 30 randomly selected cereal boxes.
(8 marks)
(b) The table below contains the number of graduates from eight high schools
in a particular year that are pursuing a university degree in humanities and
sciences:
Sciences: 65 76 104 67 75 88 77 116
Humanities: 46 65 76 50 72 51 40 87
i. Find the mean and the median for the number of students in each category
of degree.
ii. Find the lower quartile of the number of students in sciences and the upper
quartile for the number of students in humanities.
iii. Calculate the Spearman rank correlation coefficient and interpret its value.
(13 marks)
(c) A test is taken by some students, their marks are recorded and we are
interested in the properties of the sample mean. Under the assumption that
the marks follow a Normal distribution with exact mean 60 and variance 81,
calculate the probability that the mark of a randomly selected student
i. is greater than 59.5 exactly; and
ii. lies between 59 and 60.5 exactly.
(4 marks)
(d) A sample of 180 students was taken and each student was questioned regarding
their preferences for a number of courses. The course in Mathematics was
chosen by 65 students. Calculate a 95% confidence interval for the proportion
of students in favour of Mathematics in the population.
(3 marks)
UL12/0217 Page 2 of 6
D00

D01 Page
Distribution of this2document
of 21 is illegal
(e) Suppose that x1 = 2, x2 = 3, x3 = 5, x4 = 4, x5 = 0, and y1 = 3, y2 = 2,

y3 = 1, y4 = 0, y5 = 1. Calculate the following quantities:

i=5
i=4
i=3
i. 3(xi − 1) ii. xi yi iii. xi (yi − 2)
i=3 i=2 i=1
(6 marks)
(f) The probability distribution of a variable X is given below.
x 0 1 3 5
pX (x) .5 .2 .2 .1
i. Find the probability that X is larger than 2.
ii. Find the expected value of X, E(X).
(4 marks)
(g) Two fair dice are thrown.

i. Suppose that M denotes the largest of the scores on the two dice. State
the probability distribution of M .
ii. You are told that the sum of the scores on the two dice is at most 4. What
is the probability of at least one score being 2?
(4 marks)
(h) State whether the following are true or false and give a brief explanation. (Note
that no marks will be awarded for a simple true/false answer.)
i. A 95% confidence interval for the mean is wider than a 90% one when
obtained from the same data.
ii. A p-value is the probability of the alternative hypothesis being true.
iii. As the p-value becomes larger the null hypothesis becomes more plausible.
(6 marks)
(i) Provide an example where response bias may occur. Be brief in explaining why
response bias may occur.
(2 marks)
UL12/0217 Page 3 of 6
D00

D01 Page
of 21 is illegal
SECTION B
2. (a) A survey was conducted in order to examine potential differences of

opinion regarding a new taxation policy in 3 major cities of England
(London, Birmingham, Manchester). The responses were measured on a
binary scale (in favour, against) and are summarised in the table below
London Birmingham Manchester

In favour 56 34 28
Against 44 46 42
i. Test for an association between a person’s opinion on the new taxation
policy and the city of residence at two appropriate significance levels. State
the null and alternative hypotheses clearly.
ii. Comment on your results describing potential associations in detail.
Discuss the potential differences in rates in favour of the new taxation
policy across different cities.
(13 marks)
(b) You work for a market research company and your boss has asked you to carry
out a random sample survey for a car company to identify whether a new car
model is attractive to females. The main concern is to produce results of high
accuracy. You are being asked to prepare a brief summary containing the items
below. (Note you are not supposed to provide a lengthy answer. You are in
danger of losing marks should you do so.)
i. Choose an appropriate probability sampling scheme. Provide a brief
justification for your answer.
ii. Describe the sampling frame and the method of contact you will use.
Briefly explain the reasons for your choices.
iii. Provide an example in which selection bias may occur. State an action
that you would take to address this issue.
iv. State the main research question of the survey. Identify the variables
associated with this question.
(12 marks)
UL12/0217 Page 4 of 6
D00

D01 Page
of 21 is illegal
3. (a) A study was conducted to determine whether smoking is associated with

alcohol consumption. The data in the table below provide the number of
cigarettes smoked per day (y) and the number of alcohol units consumed per
week (x) for 12 randomly selected people.
Cigarettes per day (y) 10 20 15 17 25 5 2 13 30 3 20 10

Alcohol units per week (x) 5 7.5 5 7 8 3 2 8 11 4 5 8
Sum of x data: 73.5 Sum of the squares of x data: 522.25
Sum of y data: 170 Sum of the squares of y data: 3,246
Sum of the products of x and y data: 1,239
ii. Calculate the correlation coefficient. Interpret your findings.
iii. Calculate the least squares line of y on x and draw the line on the scatter
diagram.
iv. Obtain the predicted number of cigarettes per day for a person who
consumes 5 alcohol units per week according to the equation in (iii.).
Would you use this value to predict the number of cigarettes per day
for this person? Justify your answer.
(13 marks)
(b) A survey was conducted in order to compare the number of working hours per
day between male and female employees in a big company. A random sample
was drawn consisting of various employees in the company and the average
number of working hours per day was recorded. The data are summarised in
the following table:
Gender Sample size Working hours per day Sample standard deviation
Males 41 9.0 1.9
Females 29 7.5 1.1
i. You are asked to consider an appropriate hypothesis test to determine

whether there is a difference between the number of working hours per day
between male and female employees. Test at two appropriate significance
levels and comment on your findings. Specify the test statistic you use and
its distribution under the null hypothesis.
ii. State clearly any other assumptions you make.
iii. Give a 98% confidence interval for the mean number of working hours per
day for female employees.
(12 marks)
UL12/0217 Page 5 of 6
D00

D01 Page
of 21 is illegal
4. (a) An assignment is given to the students of a big class. A random sample of

students was selected afterwards and the time (in minutes) that students
needed to complete the assignment was recorded. The figures are reported
below:
39 40 44 47 32
37 25 71 56 33
64 63 42 43 34
25 28 35 24 45
35 22 53 55 36
46 46 27 27 38
i. Carefully construct, draw and label a stem-and-leaf diagram of these data

on the graph paper provided.
ii. Find the mean, the median, the interquartile range and the modal group.
iii. Comment on the data given the shape of the stem-and-leaf diagram and
(13 marks)
(b) i. A pharmaceutical company is conducting an experiment to test whether

a new cough pill is effective. The cough pill was given to 40 patients and
it reduced coughing in 14 of them. You are asked to use an appropriate
hypothesis test to determine whether the cough pill is effective. State the
test hypotheses, and specify your test statistic and its distribution under
the null hypothesis. Comment on your findings.
ii. A second experiment followed where a placebo pill was given to another
group of 30 patients. A placebo pill contains no medication and is
prescribed so that the patient will expect to get well. In some
situations, this expectation is enough for the patient to recover. This
effect, also known as the placebo effect, occurred to some extent in the
second experiment where coughing was reduced in 11 of the patients. You
are asked to consider an appropriate hypothesis test to incorporate this
new evidence with the previous data and re-assess the effectiveness of the
pain reliever.
(12 marks)
END OF PAPER
UL12/0217 Page 6 of 6
D00

D01 Page
of 21 is illegal
ST104a Statistics 1
Examination Formula Sheet
Expected value of a discrete random Standard deviation of a discrete random

variable: variable:

N
N
√
μ = E[X] = pi x i σ = σ2 = pi (xi − μ)2
i=1 i=1
The transformation formula: Finding Z for the sampling distribution

of the sample mean:
X −μ
Z=
σ X̄ − μ
Z= √
σ/ n
Finding Z for the sampling distribution Confidence interval endpoints for a

of the sample proportion: single mean (σ known):
P −π σ
x̄ ± z √
Z= n
π(1−π)
n
Confidence interval endpoints for a Confidence interval endpoints for a

single mean (σ unknown): single proportion:
s
x̄ ± tn−1 √ p(1 − p)
n p±z
n
Sample size determination for a mean: Sample size determination for a

proportion:
Z 2σ2
n≥
e2 Z 2 p(1 − p)
n≥
e2
Z-test of hypothesis for a single mean (σ t-test of hypothesis for a single mean (σ
known): unknown):
X̄ − μ
Z= √ X̄ − μ
σ/ n t= √
S/ n
1
D01 Page
of 21 is illegal
Z-test of hypothesis for a single Z-test for the difference between two means
proportion: (variances known):
p−π (X̄1 − X̄2 ) − (μ1 − μ2 )
Z∼
= Z= 2
π(1−π) σ1 σ22
n n1 + n2
t-test for the difference between two means Confidence interval endpoints for the
(variances unknown): difference between two means:

(X̄1 − X̄2 ) − (μ1 − μ2 ) 1 1
t= 2
(x̄1 − x̄2 ) ± tn1 +n2 −2 sp +
n1 n2
Sp2 n11 + n12
Pooled variance estimator: t-test for the difference in means in

paired samples:
(n1 − 1)S12 + (n2 − 1)S22
Sp2 =
n1 + n2 − 2 X̄d − μd
t= √
Sd / n
Confidence interval endpoints for the Z-test for the difference between two
difference in means in paired samples: proportions:
sd (P1 − P2 ) − (π1 − π2 )
x̄d ± tn−1 √ Z=
n
P (1 − P ) n11 + n12
Pooled proportion estimator: Confidence interval endpoints for the

difference between two proportions:
R1 + R 2
P =
n1 + n2 p1 (1 − p1 ) p2 (1 − p2 )
(p1 − p2 ) ± z +
n1 n2
χ2 test of association: Sample correlation coefficient:

c
r
n
(Oij − Eij )2 i=1 xi yi − nx̄ȳ
r = n
Eij n 2 2 2 2
i=1 j=1 i=1 xi − nx̄ i=1 yi − nȳ
Spearman rank correlation: Simple linear regression line estimates:

n
6 ni=1 d2i xi yi − nx̄ȳ
rs = 1 − b = i=1 n 2 2
n(n2 − 1) i=1 xi − nx̄
a = ȳ − bx̄
2
D01 Page
of 21 is illegal
D01 Page
of 21 is illegal
D01 Page
Distribution 10document
of this of 21 is illegal
D01 Page
D01 Page
D01 Page
D01 Page
D01 Page
D01 Page
D01 Page
D01 Page
D01 Page
D01 Page
D01 Page
UNIVERSITY OF LONDON ST104A ZB

(279 004A)
[Day], ## [Month] 2012 : ##.##Xm to ##.##Xm

A list of formulae and extracts from statistical tables are given after the final question on this
paper.

D01 Page 1 of 21
SECTION A
1. (a) The following data represent different types of variables. Classify each one of
them as measurable (continuous) or categorical. If a variable is categorical,
further classify it as nominal or ordinal. Justify your answer. (Note that no
marks will be awarded without justification.)
i. The amount of time it takes each of 15 telephone installers to hook up a
wall phone.
ii. The style of music preferred by each of 30 randomly selected radio listeners.
iii. The lengths of 50 randomly selected cars.
iv. The classification of a student (First, Upper Second, Lower Second, Third,
Pass, Fail) in the course 04a: Statistics 1.
(8 marks)
(b) The number of raisins in each of 16 mini boxes for two brands are shown below:
Brand A: 22 27 20 29 24 31 25 26
Brand B: 26 29 25 33 24 35 31 27
i. Find the mean and the mode for each brand.
ii. Find the upper quartile of Brand A and the lower quartile of Brand B.
iii. The mini boxes were made in 8 different machines corresponding to each
column in the table above. Calculate the Spearman rank correlation
coefficient and interpret its value.
(13 marks)
(c) A test is taken by some students, their marks are recorded and we are
interested in the properties of the sample mean. Under the assumption that
the marks follow a Normal distribution with exact mean 65 and variance 144,
calculate the probability that the mark of a randomly selected student
i. is greater than 67.5 exactly; and
ii. lies between 63 and 67 exactly.
(4 marks)
(d) A sample of 160 students was taken and each student was questioned regarding
their preferences for a number of courses. The course in Economics was chosen
by 75 students. Calculate a 95% confidence interval for the proportion of
students in favour of Economics in the population.
(3 marks)
UL12/0218 Page 2 of 6
D00

D01 Page
of 21 is illegal
(e) Suppose that x1 = 3, x2 = 2, x3 = 0, x4 = 4, x5 = 1, and y1 = 1, y2 = 0,


i=4
i=5
i=5
i. 2(xi − 2) ii. (xi + yi ) iii. xi (yi − 3)
i=1 i=3 i=4
(6 marks)
(f) The probability distribution of a variable X is given below.
x 1 3 4 6
pX (x) .2 .3 .4 .1
i. Find the probability that X is an odd number.
ii. Find the expected value of X, E(X).
(4 marks)
(g) Two fair dice are thrown.

i. Suppose that D is the absolute difference between the scores on the two
dice. State the probability distribution of D.
ii. You are told that the sum of the scores on the two dice is at least 10.
What is the probability of at least one score being 6?
(4 marks)
i. A 95% confidence interval for the mean is wider than a 99% one when
obtained from the same data.
ii. A p-value is the probability of not rejecting the null hypothesis.
iii. As the value of a chi-squared test statistic becomes larger, the associated
p-value becomes smaller.
(6 marks)
(i) Provide an example where selection bias may occur. Be brief in explaining why
selection bias may occur.
(2 marks)
UL12/0218 Page 3 of 6
D00

D01 Page
of 21 is illegal
SECTION B
2. (a) An experiment was conducted in order to determine whether contacting people

by phone or by letter before sending them a survey will increase the response
rate. Specifically, one group of people received a letter before getting the
survey; one group received a phone call before receiving the survey; and one
group did not receive any information before the survey arrived. For this study,
a response was defined as returning the survey within 2 weeks.
no contact letter phone

Number of people who responded 10 17 37
Number of people who did not respond 31 22 12
i. Test for an association between the method of contact prior to the survey
and response at two appropriate significance levels. State the null and
ii. Comment on your results describing potential associations in detail.
Discuss the potential differences in response rates for different methods
of contact.
(13 marks)
(b) You work for a market research company and your boss has asked you to carry
out a random sample survey for a mobile phone company to identify whether
a recently launched mobile phone is attractive to younger people. Limited
time and money resources are available at your disposal. You are being asked
to prepare a brief summary containing the items below. (Note you are not
supposed to provide a lengthy answer. You are in danger of losing marks should
you do so.)
justification for your answer.
Briefly explain the reasons for your choices.
iii. Provide an example in which response bias may occur. State an action
(12 marks)
UL12/0218 Page 4 of 6
D00

D01 Page
of 21 is illegal
3. (a) We are interested in assessing the potential impact of the growth rate (X) of
the Gross National Product (GNP) on the birth rate (Y ) of a country. The
table below provides data for these quantities for 12 countries:
Country Birth rate (y) GNP growth rate (x)

Brazil 30 5.1
Colombia 29 3.2
Costa Rica 30 3.0
India 35 1.4
Mexico 36 3.8
Peru 36 1.0
Philippines 34 2.8
Senegal 48 -0.3
South Korea 24 6.9
Sri Lanka 27 2.5
Taiwan 21 6.2
Thailand 30 4.6

Sum of y data: 380 Sum of the squares of y data: 12,564
Sum of the products of x and y data: 1,139.7
ii. Calculate the correlation coefficient. Interpret your findings.
diagram.
iv. Obtain the predicted birth rate value of a country with a GNP growth
rate of 5.0 according to the equation in (iii.). Would you use this value to
predict the birth rate of this country? Justify your answer.
(13 marks)
(b) A transport company operates two types of trucks (A and B) and wants to
compare them in terms of fuel consumption. An experiment is conducted and
the kilometers per litre (kpl) rates of various type A and type B trucks are
recorded and summarised in the following table:
Sample size Average kpl Sample standard deviation

Type A 33 31.0 7.6
Type B 40 32.2 1.8
UL12/0218 Page 5 of 6
D00

D01 Page
of 21 is illegal

whether the mean distances per litre, covered by each of the two types
of trucks, are different. Test at two appropriate significance levels and
comment on your findings. Specify the test statistic you use and its
distribution under the null hypothesis.
iii. Give a 98% confidence interval for the mean kpl rate for the type A trucks.
(12 marks)
4. (a) The following figures are the hottest daily temperatures (in degrees Celsius)
for 15 days of June at two coastal resorts:
19 20 21 21 22
22 22 22 23 23
23 23 23 23 24
24 24 24 24 25
25 25 25 25 26
26 26 27 27 28
paper provided.
ii. Find the mean, the median, the interquartile range and the modal group.
iii. Comment on the data given the shape of the histogram and the measures
(13 marks)

a new type of pain reliever is effective. The pain reliever was given to 30
patients and it reduced the pain for 16 of them. You are asked to use
an appropriate hypothesis test to determine whether the pain reliever is
effective. State the test hypotheses, and specify your test statistic and its
distribution under the null hypothesis. Comment on your findings.
prescribed so that the patient will expect to get well. In some
situations, this expectation is enough for the patient to recover. This
effect, also known as the placebo effect, occurred to some extent in the
second experiment where the pain was reduced for 13 of the patients. You
are asked to consider an appropriate hypothesis test to incorporate this
new evidence with the previous data and re-assess the effectiveness of the
pain reliever.
(12 marks)
END OF PAPER
UL12/0218 Page 6 of 6
D00

D01 Page
of 21 is illegal
ST104a Statistics 1

variable: variable:
v
N uN
X √
µ = E[X] = pi xi
uX
2
σ= σ =t pi (xi − µ)2
i=1 i=1

of the sample mean:
X −µ
Z=
σ X̄ − µ
Z= √
σ/ n
Finding Z for the sampling distribution Confidence interval endpoints for a

P −π σ
Z=q x̄ ± z √
π(1−π) n
n
Confidence interval endpoints for a Confidence interval endpoints for a

s r
x̄ ± tn−1 √ p(1 − p)
n p±z
n

proportion:
Z 2σ2
n≥
e2 Z 2 p(1 − p)
n≥
e2
known): unknown):
X̄ − µ
Z= √ X̄ − µ
σ/ n t= √
S/ n
1
D01 Page
of 21 is illegal
Z-test of hypothesis for a single Z-test for the difference between two means
p−π (X̄1 − X̄2 ) − (µ1 − µ2 )
Z∼
=q Z=
π(1−π)
q 2
σ1 σ22
n n1 + n2
t-test for the difference between two means Confidence interval endpoints for the
(variances unknown): difference between two means:
s
(X̄1 − X̄2 ) − (µ1 − µ2 )

1 1
t= r 2
(x̄1 − x̄2 ) ± tn1 +n2 −2 sp +
n1 n2

Sp2 n11 + n12
Pooled variance estimator: t-test for the difference in means in

paired samples:
(n1 − 1)S12 + (n2 − 1)S22
Sp2 =
n1 + n2 − 2 X̄d − µd
t= √
Sd / n
Confidence interval endpoints for the Z-test for the difference between two
difference in means in paired samples: proportions:
sd (P1 − P2 ) − (π1 − π2 )
x̄d ± tn−1 √ Z=r
n
P (1 − P ) n11 + n12
Pooled proportion estimator: Confidence interval endpoints for the

difference between two proportions:
R1 + R2
P = s
n1 + n2 p1 (1 − p1 ) p2 (1 − p2 )
(p1 − p2 ) ± z +
n1 n2
χ2 test of association: Sample correlation coefficient:

r X
c
Pn
X (Oij − Eij )2 i=1 xi yi − nx̄ȳ
r=q P
Eij n 2 − nx̄2
Pn 2 − nȳ 2

i=1 j=1 x
i=1 i i=1 i y

Pn
P
rs = 1 − b = Pi=1 n 2 2
n(n2 − 1) i=1 xi − nx̄
a = ȳ − bx̄
2
D01 Page
of 21 is illegal
D01 Page
of 21 is illegal
D01 Page
D01 Page
D01 Page
D01 Page
D01 Page
D01 Page
D01 Page
D01 Page
D01 Page
D01 Page
D01 Page
D01 Page
~~ST104A ZA d0
Statistics 1
Friday, 03 May 2013 : 10.00am to 12.00pm
Candidates should answer THREE of the following FOUR questions: QUESTION 1 of Section
A (50 marks) and TWO questions from Section B (25 marks each). Candidates are strongly
advised to divide their time accordingly.
A list of formulae and extracts from statistical tables are provided after the final question on this
paper.
PLEASE TURN OVER

UL13/0207 Page 1 of 21 D1
SECTION A
1. (a) Classify each one of the following variables as measurable (continuous) or

categorical. If a variable is categorical, further classify it as nominal or ordinal.
Justify your answer. (Note that no marks will be awarded without justification.)
i. Country of birth.
ii. Favourite brand of soft drink.
iii. Rank of country by academic quality according to ratings given by
educational specialists.
iv. Temperature in degrees Celsius.
[8 marks]
(b) The table below contains the ages of the volunteers for a project in two different
years:
2011 20 18 38 18 20 18
2012 20 22 18 22 20 22 24 22 20
i. Find the mean mark and the median mark for each year.
ii. Calculate the range of the marks for each year and give an explanation for
any differences you find.
iii. Calculate the standard deviation of the marks for each year and give an
explanation for any differences you find.
iv. Comment on the differences in the mean and median for the two years
that you found in part i. For this data set, which do you think would give
a better description of the difference in marks: the mean or the median?
Explain briefly.
[12 marks]
(c) Monthly household expenditure in country A is normally distributed with a

mean of £1200 per week and a standard deviation of £400 per week. In country
B it is also normally distributed but with a mean of £960 per week and a
standard deviation of £200 per week. Which country has a higher proportion
of households spending less than £800? [4 marks]
(d) We would like to design a survey to estimate the average number of hours
university students spend studying per week. How many students must we
randomly select to be 95 percent confident that the sample mean is within 2
hours of the population mean? Assume that a previous survey has shown that
the standard deviation of hours spent studying is 6.95 hours. [3 marks]
UL12/0217 Page 2 of 6
D00

UL13/0207 Page 2 of 21
D1
(e) Suppose that x1 = 4, x2 = −3, x3 = 5, x4 = 0, x5 = 3, and y1 = 3, y2 = 2,


5
5
3
i. xi ii. 2xi (yi + 1) iii. x22 + (xi + yi3)
i=1 i=2 i=1
[6 marks]
(f) In an introductory economics class, the numbers of males and females are 16
and 24, respectively.
i. A student is selected randomly from the class. What is the probability the
student is female?
ii. A student is selected at random and removed from the class. A second
student is then selected. What is the probability that one of the students
is male and the other is female?
iii. What is the probability that the second student is male, given that the
first student is female and removed from the class?
iv. In previous years it was found that 80% of males pass the exam and 85%
of females pass the examination. Based on the available information, find
the probability that a student who passes the exam is female.
[8 marks]
(g) State whether the following are true or false and give a brief explanation. (Note
i. In an observational study, a control group provides an essential tool to
establish causal relationships.
ii. If two variables are correlated we can conclude that one causes the other.
iii. The mean income of British households can be expected to be larger than
the median income of British households.
[6 marks]
(h) In the context of sampling, explain the difference between item non-response
and unit non-response. [3 marks]
UL12/0217 Page 3 of 6
D00

UL13/0207 Page 3 of 21
D1
SECTION B
2. (a) A social survey in the United States asked subjects, ‘Would you say that home-
opathy is very scientific, sort of scientific, or not at all scientific?’ The table
below cross-classifies their responses with their highest level of education.
Homeopathy is scientific
Highest degree Very Sort of Not at all Total
Less than High school 46 (11%) 168 (41%) 196 (48%) 410 (100%)
High school 100(5%) 572 (31%) 1148 (63%) 1820 (100%)
College or higher 32 (2%) 248 (18%) 1076 (79%) 1356 (100%)
Total 178(5%) 988 (28%) 2420 (67%) 3586 (100%)
i. Based on the data in the table, and without doing a significance test, how
would you describe the relationship between education and opinion on
whether or not homeopathy is scientific? [4 marks]
2
ii. Calculate the χ statistic and use it to test for independence, using a 1%
significance level. What do you conclude? [9 marks]
(b) i. Define each of the following:
– Simple random sampling
– Stratified random sampling.
[4 marks]
ii. Why might a researcher prefer to take a stratified random sample rather
than a simple random sample? Give two reasons. [3 marks]
iii. You have been asked to design a nation-wide survey in your country to find
out about the smoking habits of adults. Give two stratification factors you
might use, and explain why you have chosen them. [5 marks]
UL12/0217 Page 4 of 6
D00

UL13/0207 Page 4 of 21
D1
3. The level of infant mortality (y) is represented by the number of baby deaths for
every 1000 births. For 12 areas these are shown in the following table. For each
area, the percentage (x) of babies born into families earning at least £25,000 is also
shown.
Area A B C D E F G H I J K L
Percentage (x) 20 6 10 21 12 36 6 19 26 13 21 16
Infant mortality (y) 5 17 16 8 15 5 25 12 11 11 7 12
Sum of x data: 206 Sum of the squares of x data: 4356

Sum of y data: 144 Sum of the squares of y data: 2088
Sum of the products of x and y data: 2036
(a) i. Draw a scatter diagram of these data on the graph paper provided. Label
the diagram carefully. [4 marks]
ii. Calculate the sample correlation coefficient. Interpret your findings.
[3 marks]
diagram. [4 marks]
iv. Using the equation you found in iii., obtain the predicted infant mortality
for an area where 38% of babies are born into families earning at least
£25,000. Do you think this value is realistic? Justify your answer.[2 marks]
(b) A survey is conducted to compare public local attitudes towards environmental
policies. A number of people in two areas of interest are sampled, and asked
if they are satisfied with their local environmental policy. The results of this
survey are shown in the following table.
Sample size Number satisfied

Area A 168 127
Area B 207 132

whether there is a difference between the two areas in the proportion who
are satisfied. Test at two appropriate significance levels and comment on
your findings. Specify the test statistic you use and its distribution under
the null hypothesis. [7 marks]
ii. State clearly any other assumptions you make. [2 marks]
iii. Give a 98% confidence interval for the proportion of people in Areas A
and B combined who are satisfied, assuming the respective sample sizes
are proportional to population sizes. [3 marks]
UL12/0217 Page 5 of 6
D00

UL13/0207 Page 5 of 21
D1
4. (a) i. Carefully construct a box plot on the graph paper provided to display the
following yearly incomes of a group of people, measured in £1000:
9 6 12 24 21 57 6 15 9 12 30 36
[8 marks]
ii. Based on the shape of the box plot you have drawn, describe the
distribution of the data. [2 marks]
iii. Name two other types of graphical displays that would be suitable to
represent the data. Briefly explain your choices. [3 marks]
(b) A new treatment has been devised with the aim of reducing blood pressure
for people with high blood pressure. Each participant’s blood pressure was
measured before and after the program to see if the treatment is effective. The
following data were obtained:
Before After
177 174
142 146
146 144
162 159
145 145
162 163
152 156
154 150
171 172
i. Carry out an appropriate hypothesis test to determine whether the
treatment is effective for reducing blood pressure. State the test
hypotheses, and specify your test statistic and its distribution under the
null hypothesis. Comment on your findings. [6 marks]
ii. State any assumptions you made. [2 marks]
iii. Give a 90% confidence interval for the difference in means. [2 marks]
iv. On the basis of the data alone, would you recommend the programme to
a friend who suffers from high blood pressure? Explain why or why not.
[2 marks]
END OF PAPER
UL12/0217 Page 6 of 6
D00

UL13/0207 Page 6 of 21
D1
ST104a Statistics 1

variable: variable:

N N
√
μ = E[X] = pi x i σ = σ2 = pi (xi − μ)2
i=1 i=1

of the sample mean:
X −μ
Z=
σ X̄ − μ
Z= √
σ/ n

P −π σ
x̄ ± z √
Z= n
π(1−π)
n

s
x̄ ± tn−1 √ p(1 − p)
n p±z
n

proportion:
Z 2σ2
n≥
e2 Z 2 p(1 − p)
n≥
e2
known): unknown):
X̄ − μ
Z= √ X̄ − μ
σ/ n t= √
S/ n
1
UL13/0207 Page 7 of 21
D1
p−π (X̄1 − X̄2 ) − (μ1 − μ2 )
Z∼
= Z= 2
π(1−π) σ1 σ22
n n1 + n2

(X̄1 − X̄2 ) − (μ1 − μ2 ) 1 1
t= 2
(x̄1 − x̄2 ) ± tn1 +n2 −2 sp +
n1 n2
Sp2 n11 + n12

paired samples:
(n1 − 1)S12 + (n2 − 1)S22
Sp2 =
n1 + n2 − 2 X̄d − μd
t= √
Sd / n
sd (P1 − P2 ) − (π1 − π2 )
x̄d ± tn−1 √ Z=
n
P (1 − P ) n11 + n12

R 1 + R2
P =
n1 + n 2 p1 (1 − p1 ) p2 (1 − p2 )
(p1 − p2 ) ± z +
n1 n2

n

r
c
r = n
Eij n 2 2 2 2
i=1 j=1 i=1 xi − nx̄ i=1 yi − nȳ

n
rs = 1 − b = i=1 n 2 2
n(n2 − 1) i=1 xi − nx̄
a = ȳ − bx̄
2
UL13/0207 Page 8 of 21
D1
UL13/0207 Page 9 of 21
D1
UL13/0207 Page 10 of 21
D1
UL13/0207 Page 11 of 21
D1
UL13/0207 Page 12 of 21
D1
UL13/0207 Page 13 of 21
D1
UL13/0207 Page 14 of 21
D1
UL13/0207 Page 15 of 21
D1
UL13/0207 Page 16 of 21
D1
UL13/0207 Page 17 of 21
D1
UL13/0207 Page 18 of 21
D1
UL13/0207 Page 19 of 21
D1
UL13/0207 Page 20 of 21
D1
UL13/0207 Page 21 of 21
D1
~~ST104A ZA d0
Statistics 1
Friday, 03 May 2013 : 10.00am to 12.00pm
A list of formulae and extracts from statistical tables are provided after the final question on this
paper.
PLEASE TURN OVER

UL13/0208 Page 1 of 21 D1
SECTION A

Justify your answer. (Note that no marks will be awarded without justification.)
i. Rank of a university according to its reputation.
ii. Country of residence.
iii. Birth-weight of a baby.
iv. Favourite pop group.
[8 marks]
(b) The table below contains the marks (out of 20) of all students taking an
examination for the same course in two years:
2011 10 9 19 9 10 9
2012 10 11 9 11 10 11 12 11 10
any differences you find.
explanation for any differences you find.
iv. Comment on the differences in the mean and median for the two years
that you found in part i. For this data set, which do you think would give
a better description of the difference in marks: the mean or the median?
Explain briefly.
[12 marks]
(c) Weekly household expenditure in country A is normally distributed with a

standard deviation of £50 per week. Which country has a higher proportion
of households spending less than £200? [4 marks]
(d) We would like to start an internet service provider and need to estimate the
average weekly internet usage of households for our business plan. Internet
usage is measured in minutes. How many households must we randomly select
to be 95 percent confident that the sample mean is within 2 minutes of the
population mean? Assume that a previous survey of household usage has shown
that the standard deviation of internet usage is 6.95 minutes. [3 marks]
UL12/0217 Page 2 of 6
D00

UL13/0208 Page 2 of 21
D1
(e) Suppose that x1 = 2, x2 = −3, x3 = 6, x4 = 0, x5 = 3, and y1 = 3, y2 = 2,


5
5
3
i. xi ii. 2xi (yi + 1) iii. x22 + (xi + yi3)
i=1 i=2 i=1
[6 marks]
(f) In an introductory statistics class, the numbers of males and females are 17
student is female?
student is then selected. What is the probability that one of the students
is male and the other is female?
iii. What is the probability that the second student is male, given that the
first student is female and removed from the class?
iv. In previous years it was found that 80% of males pass the exam and 85%
of females pass the examination. Based on the available information, find
the probability that a student who passes the exam is female.
[8 marks]
i. An important difference between an experimental design and an
observational study is that in an observational study data are collected
on units without any intervention.
iii. If a variable has a symmetric distribution, its mean and median are the
same.
[6 marks]
(h) In the context of sampling, explain the difference between item non-response
and unit non-response. [3 marks]
UL12/0217 Page 3 of 6
D00

UL13/0208 Page 3 of 21
D1
SECTION B
2. (a) The 2006 General Social Survey in the United States asked subjects, ‘Would
you say that astrology is very scientific, sort of scientific, or not at all
scientific?’ The table below cross-classifies their responses with their highest
level of education.
Astrology is scientific
High school 50 (5%) 286 (31%) 574 (63%) 910 (100%)
Total 89 (5%) 494 (28%) 1210 (67%) 1793 (100%)
i. Based on the data in the table, and without doing a significance test, how
whether or not astrology is scientific? [4 marks]
2
ii. Calculate the χ statistic and use it to test for independence, using a 5%
significance level. What do you conclude? [9 marks]
(b) i. Define each of the following:
– Simple random sampling
– Stratified random sampling.
[4 marks]
ii. Why might a researcher prefer to take a stratified random sample rather
than a simple random sample? Give two reasons. [3 marks]
iii. You have been asked to design a nation-wide survey in your country to find
out about the smoking habits of adults. Give two stratification factors you
might use, and explain why you have chosen them. [5 marks]
UL12/0217 Page 4 of 6
D00

UL13/0208 Page 4 of 21
D1
3. The level of infant mortality (y) is represented by the number of baby deaths for
shown.
Percentage (x) 19 5 9 20 11 35 5 18 25 12 20 15

the diagram carefully. [4 marks]
ii. Calculate the sample correlation coefficient. Interpret your findings.
[3 marks]
diagram. [4 marks]
iv. Using the equation you found in iii., obtain the predicted infant mortality
for an area where 34% of babies are born into families earning at least
£25,000. Do you think this value is realistic? Justify your answer.[2 marks]
(b) A survey is conducted to compare public attitudes towards local policing. A
number of people in two areas of interest are sampled, and asked if they are
satisfied with their local police-community relationship. The results of this
Sample size Number satisfied

Area A 153 115
Area B 188 120

whether there is a difference between the two areas in the proportion who
are satisfied. Test at two appropriate significance levels and comment on
your findings. Specify the test statistic you use and its distribution under
the null hypothesis. [7 marks]
ii. State clearly any other assumptions you make. [2 marks]
iii. Give a 98% confidence interval for the proportion of people in Areas A
and B combined who are satisfied, assuming the respective sample sizes
are proportional to population sizes. [3 marks]
UL12/0217 Page 5 of 6
D00

UL13/0208 Page 5 of 21
D1
4. (a) i. Carefully construct a box plot on the graph paper provided to display the
3 2 4 8 7 19 2 5 3 4 10 12
[8 marks]
distribution of the data. [2 marks]
represent the data. Briefly explain your choices. [3 marks]
(b) A new fitness programme is devised for obese people. Each participant’s weight
in kg was measured before and after the program to see if the fitness program
is effective in reducing their weights. The following data were obtained:
Before After
145 143
116 120
120 118
133 130
119 119
133 134
125 128
126 123
140 141
i. Carry out an appropriate hypothesis test to determine whether the fitness
programme is effective for reducing weight. State the test hypotheses, and
specify your test statistic and its distribution under the null hypothesis.
Comment on your findings. [6 marks]
ii. State any assumptions you made. [2 marks]
iii. Give an 80% confidence interval for the difference in means. [2 marks]
iv. On the basis of the data alone, would you recommend the programme to
a friend who wants to lose weight? Explain why or why not. [2 marks]
END OF PAPER
UL12/0217 Page 6 of 6
D00

UL13/0208 Page 6 of 21
D1
ST104a Statistics 1

variable: variable:

N N
√
μ = E[X] = pi x i σ = σ2 = pi (xi − μ)2
i=1 i=1

of the sample mean:
X −μ
Z=
σ X̄ − μ
Z= √
σ/ n

P −π σ
x̄ ± z √
Z= n
π(1−π)
n

s
x̄ ± tn−1 √ p(1 − p)
n p±z
n

proportion:
Z 2σ2
n≥
e2 Z 2 p(1 − p)
n≥
e2
known): unknown):
X̄ − μ
Z= √ X̄ − μ
σ/ n t= √
S/ n
1
UL13/0208 Page 7 of 21
D1
p−π (X̄1 − X̄2 ) − (μ1 − μ2 )
Z∼
= Z= 2
π(1−π) σ1 σ22
n n1 + n2

(X̄1 − X̄2 ) − (μ1 − μ2 ) 1 1
t= 2
(x̄1 − x̄2 ) ± tn1 +n2 −2 sp +
n1 n2
Sp2 n11 + n12

paired samples:
(n1 − 1)S12 + (n2 − 1)S22
Sp2 =
n1 + n2 − 2 X̄d − μd
t= √
Sd / n
sd (P1 − P2 ) − (π1 − π2 )
x̄d ± tn−1 √ Z=
n
P (1 − P ) n11 + n12

R 1 + R2
P =
n1 + n 2 p1 (1 − p1 ) p2 (1 − p2 )
(p1 − p2 ) ± z +
n1 n2

n

r
c
r = n
Eij n 2 2 2 2
i=1 j=1 i=1 xi − nx̄ i=1 yi − nȳ

n
rs = 1 − b = i=1 n 2 2
n(n2 − 1) i=1 xi − nx̄
a = ȳ − bx̄
2
UL13/0208 Page 8 of 21
D1
UL13/0208 Page 9 of 21
D1
UL13/0208 Page 10 of 21
D1
UL13/0208 Page 11 of 21
D1
UL13/0208 Page 12 of 21
D1
UL13/0208 Page 13 of 21
D1
UL13/0208 Page 14 of 21
D1
UL13/0208 Page 15 of 21
D1
UL13/0208 Page 16 of 21
D1
UL13/0208 Page 17 of 21
D1
UL13/0208 Page 18 of 21
D1
UL13/0208 Page 19 of 21
D1
UL13/0208 Page 20 of 21
D1
UL13/0208 Page 21 of 21
D1

ST104a Statistics 1
Important note
A change that took place from 2011–12 onwards is the presence of a formula sheet. The purpose of
this change is to encourage candidates to devote more time in understanding the key concepts of the
syllabus rather than memorising a big number of formulae. Nevertheless, candidates should not rely
on this formula sheet entirely but only use it for verification. The formula sheet is available on the
virtual learning environment (VLE).
Information about the subject guide
Unless otherwise stated, all cross-references will be to the latest version of the subject guide (2011).
You should always attempt to use the most recent edition of any Essential reading textbook, even if
the commentary and/or online reading list and/or subject guide refers to an earlier edition. If
different editions of Essential reading are listed, please check the VLE for reading supplements – if
none are available, please use the contents list and index of the new edition to find the relevant
section.
General remarks
Learning outcomes
By the end of this course, and having completed the Essential reading and activities, you should:
• be familiar with the key ideas of statistics that are accessible to a candidate with a
moderate mathematical competence
• be able to routinely apply a variety of methods for explaining, summarising and presenting
data and interpreting results clearly using appropriate diagrams, titles and labels when
required
• be able to summarise the ideas of randomness and variability, and the way in which these
link to probability theory to allow the systematic and logical collection of statistical
techniques of great practical importance in many applied areas
• have a grounding in probability theory and some grasp of the most common statistical
methods
• be able to perform inference to test the significance of common measures such as means and
proportions and conduct chi-squared tests of contingency tables
• be able to use simple regression and correlation analysis and know when it is appropriate to
do so.
1
ST104a Statistics 1
Planning your time in the examination
You have two hours to complete this paper, which is in two parts. The first part, Section A, is
compulsory which covers several subquestions and accounts for 50 per cent of the total marks.
Section B contains three questions, each worth 25 per cent, from which you are asked to choose two.
Remember that each of the Section B questions is likely to cover more than one topic. In 2013, for
example, the first part of Question 2 asked for a chi-squared test and survey design problems
appeared in the second. The first part of Question 3 was on regression and involved drawing a
diagram, while the second part was a hypothesis test comparing population means using the sample
data given. Question 4 had a series of questions involving drawing diagrams, hypothesis testing and
confidence intervals. This means that it is really important that you make sure you have a
reasonable idea of what topics are covered before you start work on the paper! We suggest you
divide your time as follows during the examination:
• Spend the first 10 minutes annotating the paper. Note the topics covered in each question
and subquestion.
• Allow yourself 45 minutes for Section A. Don’t allow yourself to get stuck on any one
question, but don’t just give up after two minutes!
• Once you have chosen your two Section B questions, give them about 25 minutes each.
• This leaves you with 15 minutes. Do not leave the examination hall at this point! Check
over any questions you may not have completely finished. Make sure you have labelled and
given a title to any tables or diagrams which were required and, if you did more than the
two questions required in Section B, decide which one to delete. Remember that only two of
your answers will be given credit in Section B and that you must choose which these are!
What are the Examiners looking for?
The Examiners are looking for very simple demonstrations from you. They want to be sure that you:
• have covered the syllabus as described and explained in the subject guide
• know the basic formulae given there and when and how to use them
• understand and answer the questions set.
You are not expected to write long essays where explanations or descriptions of sample design
are required, and note form answers are acceptable. However, clear and accurate language, both
mathematical and written, is expected and marked. The explanations below and in the specific
commentaries for the papers for each zone should make these requirements clear.
Key steps to improvement
The most important thing you can do is answer the question set! This may sound very simple, but
these are some of the things that candidates did not do, though asked, in the 2013 examinations!
Remember:
• If you are asked to label a diagram (which is almost always the case!), please do so. Writing
‘Histogram’ or ‘Stem-and-leaf diagram’ in itself is insufficient. What do the data describe?
What are the units? What are the x and y axes?
• If you are specifically asked to carry out a hypothesis test, or a confidence interval, do so. It
is not acceptable to do one rather than the other! If you are asked to find a 5% value, this is
what will be marked.
• Do not waste time calculating things which are not required by the Examiners. If you are
asked to find the line of best fit, you will get no marks if you calculate the correlation
coefficient as well. If you are asked to use the confidence interval you have just calculated to
comment on the results, carrying out an additional hypothesis test will not help your marks.
2
How should you use the specific comments on each question given in the
Examiners’ commentaries?
We hope that you find these useful. For each question and subquestion, they give:
• further guidance for each question on the points made in the last section
• the answers, or keys to the answers, which the Examiners were looking for
• the relevant detailed reference to Newbold (seventh edition) and the subject guide (2011)
• where appropriate, suggested activities from the subject guide which should help you to
prepare, and similar questions from Newbold.
Any further references you might need are given in the part of the subject guide to which you are
referred for each answer.
Question spotting
Many candidates are disappointed to find that their examination performance is poorer
than they expected. This can be due to a number of different reasons and the Examiners’
commentaries suggest ways of addressing common problems and improving your performance.
We want to draw your attention to one particular failing – ‘question spotting’, that is,
confining your examination preparation to a few question topics which have come up in past
papers for the course. This can have very serious consequences.
We recognise that candidates may not cover all topics in the syllabus in the same depth, but
you need to be aware that Examiners are free to set questions on any aspect of the syllabus.
This means that you need to study enough of the syllabus to enable you to answer the required
number of examination questions.
The syllabus can be found in the ‘Course information sheet’ in the section of the VLE dedicated
to this course. You should read the syllabus very carefully and ensure that you cover sufficient
material in preparation for the examination.
Examiners will vary the topics and questions from year to year and may well set questions that
have not appeared in past papers – every topic on the syllabus is a legitimate examination
target. So although past papers can be helpful in revision, you cannot assume that topics or
specific questions that have come up in past examinations will occur again.
If you rely on a question spotting strategy, it is likely you will find yourself in
difficulties when you sit the examination paper. We strongly advise you not to
adopt this strategy.
3
ST104a Statistics 1

ST104a Statistics 1
Important note
section.
Comments on specific questions – Zone A
Section A
Question 1
(a) Classify each one of the following variables as measurable (continuous) or

Justify your answer. (Note that no marks will be awarded without justification.)
i. Country of birth.
ii. Favourite brand of soft drink.
iii. Rank of country by academic quality according to ratings given by
educational specialists.
iv. Temperature in degrees Celsius.
(8 marks)
4
Reading for this question

This question requires identifying types of variable so reading the relevant section in the
subject guide (Section 3.6) is essential. Candidates should gain familiarity with the notion of
a variable and be able to distinguish between discrete and continuous (measurable) data. In
addition to identifying whether a variable is categorical or measurable, further distinctions
between ordinal and nominal categorical variables should be made by candidates.
Approaching the question
A general tip for identifying continuous and categorical variables is to think of the possible
values they can take. If these are finite and represent specific entities the variable is
categorical. Otherwise, if these consist of numbers corresponding to measurements, the data
are continuous and the variable is measurable. Such variables may also have measurement
units or can be measured to various decimal places.
i. Each country is a category, so the possible values are one for each country. Hence, the
variable is categorical. Note also that countries do not have a natural ordering, so this
represents a categorical nominal variable.
ii. Each brand of soft drink is a category and is a potential value of this variable. Hence,
the variable is categorical. Moreover, brands of soft drinks do not have a natural
ordering, therefore this categorical variable is on a nominal scale.
iii. Each rank is a category, therefore this is a categorical variable. The values of this
variable are the ranks of each country. By definition the categories (ranks) are ordered,
thus resulting in a (categorical) ordinal variable.
iv. The data represent temperatures that can be measured to many decimal places; e.g.
19.234 degrees Celsius. This is, therefore, a measurable variable.
Weak candidates did not provide a justification for their choices, reported nominal or
categorical to measurable variables and sometimes answered ordinal when their justification
was pointing to a nominal variable. Writing ‘It is measurable because it can be measured’
will not result in a high mark.
(b) The table below contains the ages of the volunteers for a project in two different
years:
2011 20 18 38 18 20 18
2012 20 22 18 22 20 22 24 22 20
i. Find the mean age and the median age for each year.
ii. Calculate the range of the ages for each year and give an explanation for any
differences you find.
iii. Calculate the standard deviation of the ages for each year and give an
explanation for any differences you find.
iv. Comment on the differences in the mean and median for the two years that
you found in part i. For this data set, which do you think would give a better
description of the difference in ages: the mean or the median? Explain
briefly.
(12 marks)
This question contains material mostly from Chapter 3 of the subject guide and in
particular Section 3.8 (Measures of location) for parts (i) and (iv), and Section 3.9
(Measures of spread) for parts (ii) and (iii).
It is important to do the summation carefully and divide by the correct number of
observations to obtain the mean. For questions that require calculations on the median (or
other percentiles like quartiles), a good strategy is to write the observations in order. Note
also that this question requires these measures for both years, so the calculations should be
done for each year separately.
5
ST104a Statistics 1
i. In order to calculate the two means, you should sum the numbers corresponding to each
year and then divide them by the number of observations in each row. Doing so yields
(20 + 18 + 38 + 18 + 20 + 18)/6 = 22, for 2011,
and
(20 + 22 + 18 + 22 + 20 + 22 + 24 + 22 + 20)/9 = 21.11, for 2012.
For the median if we put the numbers in ascending order we get
18 18 18 20 20 38, for 2011,
and
18 20 20 20 22 22 22 22 24, for 2012.
The median for 2011 is given by taking the average between the 3rd and the 4th number
in the first of the rows above, resulting in a value of (18 + 20)/2 = 19. The median for
2012 is obtained from the 5th number in the 2nd row above, which is 22.
ii. Note that the range of a variable equals the difference between the maximum value and
the minimum value. Hence, the range for 2011 was 38 − 18 = 20, whereas the range for
2012 was 24 − 18 = 6. Some candidates answered ‘from 18 to 38’. While this is true, note
that it does not correspond to the definition of the range so it is essential to give the
numbers 20 (2011) and 6 (2012) in your answer.
It is also essential to comment on the different ranges between 2011 and 2012. The
difference is big and is caused by the outlier 38 in 2011.
Some candidates confused ‘Range’ and ‘Interquartile range’. Make sure that you identify
what is being asked.
iii. In order to answer this question, candidates should be familiar with Section 3.9.3 (on
variance and standard deviation) and the chapter activities. It is very important to show
your work with relevant summations of the squared deviation from the mean. In this way
you may get some marks even if the numerical answer is wrong as you are demonstrating
knowledge of the method. The answer for 2011 is 7.90, whereas for 2012 it is 1.76.
iv. The mean is higher in 2011 but the median is higher in 2012. This can be attributed to
the fact that 2011 contains an outlier (38) which results in a high mean. Apart from this
outlier, ages tend to be higher in 2012, so the median gives a somewhat better indication
of the ‘typical’ age for each year.
(c) Monthly household expenditure in country A is normally distributed with a

standard deviation of £200 per week. Which country has a higher proportion of
households spending less than £800?
(4 marks)
This section examines the ideas of the normal random variable. Read the relevant section of
Chapter 5 and work through the examples and activities of this section. The Sample
examination questions are quite relevant.
The basic property of the normal random variable for this question is that if X ∼ N (µ, σ 2 ),
then Z = X−µ
• P (Z < a) = P (Z ≤ a) = Φ(a)
• P (Z > a) = P (Z ≥ a) = 1 − P (Z ≤ a) = 1 − P (Z < a) = 1 − Φ(a)
• P (a < Z < b) = P (Z ≤ Z < b) = P (a < Z ≤ b) = P (a ≤ Z ≤ b) = Φ(b) − Φ(a).
6
The above is all you need to find the requested proportions:

• P (X < 800) = P X−1200
400 < 800−1200
400 = P (Z < −1) = 0.1587

−960
• P (Y < 800) = P Y 200 < 800−960
200 = P (Z < −0.8) = 0.2119.
So country B has a higher proportion of households spending less than £800.
(d) We would like to design a survey to estimate the average number of hours
university students spend studying per week. How many students must we
randomly select to be 95 percent confident that the sample mean is within 2
hours of the population mean? Assume that a previous survey has shown that
the standard deviation of hours spent studying is 6.95 hours.
(3 marks)
All of Chapter 6 is relevant, but the main reading for this question can be found in Section
6.1 (Choosing a sample size). It is essential to read this section carefully and attempt the
activities and exercises.
This question asks you to determine a sample size. This is straightforward once the
distribution is identified. Since the sample size is large, a normal distribution can be used.
• Identify the correct z-value: 1.96.
• Solve
σ
1.96 √ = 2.
n
We can take σ = 6.95 to find n = 46.38.
• Round up to n = 47.
Some candidates forgot to round up. Remember that you are asked about a sample size.
(e) Suppose that x1 = 4, x2 = −3, x3 = 5, x4 = 0, x5 = 3, and y1 = 3, y2 = 2, y3 = 1,

y4 = 0, y5 = 1. Calculate the following quantities:
5
X 5
X 3
X
i. xi ii. 2xi (yi + 1) iii. x22 + (xi + yi3 )
i=1 i=2 i=1
(6 marks)
This question refers to the basic bookwork which can be found in Section 1.9 of the subject
guide and, in particular, in Activity A1.6.
Be careful to leave the xs and ys in the order given and only cover the values of i asked for.
This question was generally done well; the answers are:
P5
i. i=1 xi = 4 + (−3) + 5 + 0 + 3 = 9. (1 mark)
P5 P5
ii. i=2 2xi (yi + 1) = 2 i=2 xi (yi + 1) =
2(−3 × (2 + 1) + 5 × (1 + 1) + 0 × (0 + 1) + 3 × (1 + 1)) = 2 × 7 = 14. (2 marks)
3
iii. x22 + i=1 (xi + yi3 ) = (−3)2 + (4 + 33 ) + (−3 + 23 ) + (5 + 13 ) = 9 + 29 + 5 + 7 = 51.
P
(3 marks)
7
ST104a Statistics 1
(f ) In an introductory economics class, the numbers of males and females are 16

student is female?
student is then selected. What is the probability that one of the students is
male and the other is female?
iii. What is the probability that the second student is male, given that the first
student is female and removed from the class?
iv. In previous years it was found that 80% of males pass the exam and 85% of
females pass the exam. Based on the available information, find the
probability that a student who passes the examination is female.
(8 marks)
This is a question on probability and targets mostly the material covered in Chapter 4. It is
essential to practise this area by attempting the chapter activities and exercises as well as
accessing the material on the VLE. In particular you should attempt Activity A4.6 and
Sample examination question 4. It is also useful to familiarise yourself with probability trees
as they can be quite useful when completing such exercises.
The first three parts were straightforward for those that were familiar with this section. Part
(iv) required knowledge of Bayes’ formula or a very good understanding of probability trees.
The working out is shown below:
i. There are 24 females and 16 males in the class. Hence the answer is
24/(16 + 24) = 24/40 = 0.600.
ii. The correct answer here is 16 24 24 16
40 × 39 + 40 × 39 = 0.492. Although not necessary, the use of
a probability tree would be quite helpful here.
iii. This part can be answered in a similar way to part (i) noting that there are now 16
males and 23 females in the class. Hence 16/39 = 0.410.
iv.
P (pass|female)P (female)
P (female|pass) =
P (pass)
0.85 × 24/40
=
P (pass ∩ female) + P (pass ∩ male)
0.85 × 24/40
=
0.85 × 24/40 + 0.80 × 16/40
= 0.614.
i. In an observational study, a control group provides an essential tool to
establish causal relationships.
iii. The mean income of British households can be expected to be larger than
the median income of British households.
(6 marks)
important to have a good intuitive understanding of the relevant concepts than a technical
level of knowledge in computations. Part (i) requires material from Chapter 10 and, in
8
particular, the sections on observational studies and designed experiments. Part (ii) is about
correlation and causation detailed in Section 11.7 of the subject guide. Finally part (iii)
targets the material covered in Chapter 3.
reason for a true/false and not just a choice between the two. Some candidates also lost
marks for long rambling explanations without a decision as to whether a statement was true
or false.
i. True. A possible way to provide an explanation here is through an example, for example
if we want to establish causal effects of fluoridated water, we need a control group
without fluoride in the water, but which is as similar as possible to a group with
fluoridated water. Another way is to note that randomised experiments are better tools
to establish causal relations, but we may not be able to carry out a proper experiment
(see p.156 of the subject guide).
ii. False; the correlation may be spurious, for example there may be a third variable
affecting both variables leading to a correlation.
iii. In this part it is important to realise that income is typically a right (positively) skewed
variable. Hence the statement is true since, due to the right skewness, the mean will be
bigger than the median.
(h) In the context of sampling, explain the difference between item non-response
and unit non-response.
(3 marks)
This question requires knowledge about sampling and sample surveys. Useful background
reading may be found in Chapter 9 of the subject guide. The material directly related to
this question, item non-response and unit non-response, appears on p.145. See also the
references to Newbold and Carlson given in Chapter 9 of the subject guide.
The relevant parts of p.145 are that:
• item non-response occurs when a sampled member fails to respond
• unit non-response occurs when no information is collected from a sample member.
In addition to the definitions supplied above, it would also be useful to use an example.
Section B
Question 2
(a) A social survey in the United States asked subjects, ‘Would you say that
homeopathy is very scientific, sort of scientific, or not at all scientific?’ The
table below cross-classifies their responses with their highest level of education.
Homeopathy is scientific
High school 100(5%) 572 (31%) 1148 (63%) 1820 (100%)
Total 178(5%) 988 (28%) 2420 (67%) 3586 (100%)
9
ST104a Statistics 1
i. Based on the data in the table, and without doing a significance test, how
whether or not homeopathy is scientific?
(4 marks)
ii. Calculate the χ2 statistic and use it to test for independence, using a 1%
significance level. What do you conclude?
(9 marks)
This part targets Chapter 8 on contingency tables and chi-square tests. Note that part (i) of
the question does not require any calculations, just understanding and interpreting
contingency tables. Candidates can attempt Activity A8.4 to practise. Part (ii) is a
straightforward chi-squared test and the reading is also given in Chapter 8.
i. Using the percentages we see that the higher someone’s education, the smaller the belief
that homeopathy is very scientific and the higher the belief that it is not at all scientific.
For example, 79% of those who attended college or higher education responded that
homeopathy is not at all scientific, whereas the corresponding proportion for those with
less than high school education is 48%.
ii. Set out the null hypothesis that there is no association between education and views on
homeopathy against the alternative, that there is an association. Be careful to get these
the correct way round!
H0 : No association between education and views on homeopathy versus
H1 : Association between education and views on homeopathy.
Work out the expected values to obtain the table below
20.3514 112.962 276.687

90.3402 501.439 1228.22
67.3084 373.6 915.092

X (Oi,j − Ei,j )2
,
Ei,j
which gives a value of 187.913. This is a 3 × 3 contingency table so the degrees of
freedom are (3 − 1) × (3 − 1) = 4.
For α = 0.05 ⇒ the critical value is 9.488, hence we reject H0 . For a second (stronger)
significance level, say 1%, the critical value is 13.277, hence again we reject H0 .
We conclude that the association between views on homeopathy and educational level is
highly significant.
Saying ‘we reject at the 5% level, but not at 10%’ is insufficient. What does this mean?
Is there a connection or not? If there is one, how strong is it? This needed to be
answered if the full nine marks allocated for this question were given. Many candidates
lost marks by failing to follow-up like this.
(b) i. Define each of the following:

— Simple random sampling
— Stratified random sampling.
(4 marks)
ii. Why might a researcher prefer to take a stratified random sample rather
than a simple random sample? Give two reasons.
(3 marks)
10
iii. You have been asked to design a nation-wide survey in your country to find
out about the smoking habits of adults. Give two stratification factors you
might use, and explain why you have chosen them.
(5 marks)
This question on basic material on survey designs required background reading from
Chapters 9 and 10 of the subject guide which, along with the recommended reading should
be looked at carefully. Candidates were expected to have studied and understood the main
important constituents of design in random sampling. It is also a good idea to try the
activities in Chapter 9.
One of the main things to avoid here is writing an answer without any structure. This
exercise asks for specific things and each one of them requires one or two lines. If you are
unsure of what these specific things are, do not write lengthy essays. This is a waste of
your valuable examination time. If you can identify what is being asked, keep in mind that
the answer should not be long. Note also that in some cases there is no unique answer
to the question.
i. Simple random sampling:
• Every sample has equal probability.
• With replacement.
Stratified random sampling:
• Population divided into strata (or groups).
• Random sample from each group.
ii. There are generally two main reasons why one would prefer stratified to simple random
sampling.
• Potentially more precision of parameter estimates.
• Obtain information about subgroups.
iii. In this part you can choose factors based on two arguments. First, you can aim for
factors whose subgroups differ regarding smoking habits (e.g. gender, ethnic groups, age
groups etc.). In that way the stratified sampling scheme will have increased precision.
Alternatively you can just suggest factors that are interesting from a research point of
view.
Question 3
The level of infant mortality (y) is represented by the number of baby deaths for
shown.
Percentage (x) 20 6 10 21 12 36 6 19 26 13 21 16

(a) i. Draw a scatter diagram of these data on the graph paper provided. Label the
diagram carefully.
(4 marks)
11
ST104a Statistics 1
ii. Calculate the sample correlation coefficient. Interpret your findings.

(3 marks)
diagram.
(4 marks)
iv. Using the equation you found in iii., obtain the predicted infant mortality for
an area where 38% of babies are born into families earning at least £25,000.
Do you think this value is realistic? Justify your answer.
(2 marks)
This is a standard regression question and the reading is to be found in Chapter 11. Section
11.6 provides details for scatter diagrams and is suitable for part (i) whereas the remaining
parts focus on correlation and regression and are covered in Sections 11.8 to 11.10 of the
subject guide. Section 11.7 is also relevant. Sample examination question 2 from this
chapter is recommended for practice on questions of this type.
which should include a full title (‘Scatter diagram’ alone will not suffice) and labelled
axes, including information about units. Far too many candidates threw away marks by
neglecting these points and consequently were only given one mark out of the possible
four allocated for this part of the question. Another common way of losing marks was
failing to use the graph paper which was provided, and required, in the question.
Candidates who drew on the ordinary paper in their booklet were not awarded marks for
this part of the question.
Infant mortality and economic class

y: Infant mortality (number of baby deaths for every 1000 births)
25
●
20
15
●
●
●
10
● ●
● ●
●
●
5
● ●
0
0 5 10 15 20 25 30 35
x: percentage of babies born into families earning at least 25,000 pounds
ii. The summary statistics can be substituted into the formula for the correlation coefficient
(make sure you know which one it is!) to obtain the value −0.8026. An interpretation of
this value is the following: The data suggest that the higher the percentage of families
earning at least a certain income, the lower the mortality. The fact that the value is very
close to −1, suggests that this is a strong (negative) association.
12
iii. The regression line can be written by the equation ŷ = a + bx or y = a + bx + . The

formula for b is P
xi yi − nx̄ȳ
b= P 2 ,
xi − nx̄2
and by substituting the summary statistics we get b = −0.5319.
Hence the regression line can be written as ŷ = 21.1314 − 0.5319x or
y = 21.1314 − 0.5319x + ε. It should also be plotted in the scatter diagram.
iv. The prediction will be ŷ = 21.1314 − 0.5319 × 38 = 0.918 infant mortality (number of
baby deaths for every 1,000 births). However, since this point is outside the observed
range of x, this prediction should not be trusted as it is based on extrapolation.
Many candidates did not give the measurement units here. These are essential in
answering such a question and a mark is deducted if they are not specified.
(b) A survey is conducted to compare public local attitudes towards environmental

policies. A number of people in two areas of interest are sampled, and asked if
they are satisfied with their local environmental policy. The results of this
Sample size Number satisfied
Area A 168 127
Area B 207 132

whether there is a difference between the two areas in the proportion who
are satisfied. Test at two appropriate significance levels and comment on
your findings. Specify the test statistic you use and its distribution under the
null hypothesis.
(7 marks)
(2 marks)
iii. Give a 98% confidence interval for the proportion of people in Areas A and B
combined who are satisfied, assuming the respective sample sizes are
proportional to population sizes.
(3 marks)
The first two parts of the question refer to a two-sided hypothesis test comparing
proportions. While the entire chapter on hypothesis testing is relevant, one can focus on the
sections involving proportions (Sections 7.14 and 7.15). The last part of the question is on
confidence intervals that are located in Chapter 6 and, in particular (confidence intervals for
proportions), in Section 6.10.
i. The null hypothesis is that the proportions of the two areas (πA and πB ) do not differ,
the alternative is that they do.
H0 : πA = πB versus H1 : πA 6= πB .
The test statistic is provided in the formula sheet (note that it is based on the pooled
variance):
PA − P B
Z̃ = q
P (1−P )
nA + P (1−P
nB
)
where P = (127 + 132)/(168 + 207) = 0.690667.

The test statistic value is 2.464 (PA = 0.7560, PB = 0.6377, pooled se = 0.0480). The
critical value at the 5% level, assuming a normal approximation as the number of
13
ST104a Statistics 1
observations is large, is ±1.96. Hence, we reject the null hypothesis suggesting evidence
for a difference between the two areas. If we take a (smaller) α of 1%, the critical value is
±2.576, so we do not reject H0 . We conclude that there is some, but not strong, evidence
of a difference between the two areas.
ii. The assumptions included:
2 2
• Assumption about whether nA + nB − 2 is ‘large’, hence t v. z
iii. The question is a standard exercise in confidence intervals. Note the question refers to
areas A and B combined. The workout is given below:
• Correct quantile: zα/2 = 2.326.
• Correct endpoints: 0.635 and 0.746. (Also accept two decimal places.)
• Report as an interval: (0.635, 0.746). (Also accept between 0.635 and 0.746.)
Question 4
(a) i. Carefully construct a box plot on the graph paper provided to display the
9 6 12 24 21 57 6 15 9 12 30 36
(8 marks)
ii. Based on the shape of the box plot you have drawn, describe the distribution
of the data
(2 marks)
represent the data. Briefly explain your choices.
(3 marks)
Chapter 3 provides all the relevant material for this question. More specifically, information
on boxplots can be found in Section 3.9.2, but all of Sections 3.8 and 3.9 are highly relevant.
i. The boxplot diagram the Examiners were hoping to see is shown below. Marks were
awarded for including the title, identifying the box and the whiskers and noting outlier,
at a reasonable accuracy.
In order to identify the box, the quartiles are needed that are 9 and 25.5, hence giving an
interquartile range of 16.5. The median is also needed which is 13.5.
Hence the outlier limits are from 0 to 50.25. (−15.75 to 50.25 is also allowed.)
The extreme outlier limits are then from 0 to 70 (−40.5 to 70 is also allowed.)
Hence 57 is an outlier but not an extreme outlier.
Note that you did not need to label the x axis and that the plot can be transposed.
ii. Based on the shape of the boxplot, we can see that the distribution of the data is
positively skewed.
iii. A histogram or stem-and-leaf diagram are other types of suitable graphical displays. The
variable income is measurable and these graphs are suitable for displaying the
distribution of such variables.
(b) A new treatment has been devised with the aim of reducing blood pressure for
people with high blood pressure. Each participant’s blood pressure was
measured before and after the program to see if the treatment is effective. The
following data were obtained:
14
Distribution of Income
60
●
50
Income in thousands of pounds
40
30
20
10
Before After
177 174
142 146
146 144
162 159
145 145
162 163
152 156
154 150
171 172
treatment is effective for reducing blood pressure. State the test hypotheses,
and specify your test statistic and its distribution under the null hypothesis.
Comment on your findings.
(6 marks)
ii. State any assumptions you made.
(2 marks)
iii. Give a 90% confidence interval for the difference in means.
(2 marks)
iv. On the basis of the data alone, would you recommend the programme to a
friend who suffers from high blood pressure? Explain why or why not.
(2 marks)
Look up the sections about hypothesis testing for testing differences in means. However, it
is essential for this part of the question to focus on the section of the subject guide
regarding paired samples (Section 7.16.4).
i. Regarding hypotheses, note that the word ‘effective’ suggests a one-sided test:
H0 : µbefore = µafter , H1 : µbefore > µafter
In this part, it is also essential to realise that we have a paired sample, as we have two
observations for each person (before and after treatment). Hence the difference for each
person should be calculated
3 −4 2 3 0 −1 −4 4 −1
The next step is to calculate sd = 2.991, x̄d = 0.2222, in order to obtain the value of the
test statistic sx̄dd/−0
√ = 0.2229.
n
15
ST104a Statistics 1
We have the t distribution with eight degrees of freedom, hence the critical value (for a
one-sided test) is 1.860.
Hence, we do not reject H0 at the 5% level. Testing at the 10% level gives a critical value
of t8,0.1 = 1.397. Therefore, we still do not reject H0 . There is no significant evidence
that the treatment is effective.
ii. • Differences normally distributed (no marks for normally distributed blood pressure).
• Pairs of observations are independent (a weaker condition which suffices is that the
differences are independent, but this is unlikely if observations are not).
iii. This is a straightforward exercise for confidence intervals given the appropriate formula
from the formula sheet (make sure that you can recognise it). The requested confidence
interval is (−1.6316, 2.0766).
iv. The evidence in the data that the treatment works is close to negligible as can be seen,
for example, from the 90% confidence interval, so there is no reason to recommend the
treatment on the basis of the data alone.
16

ST104a Statistics 1
Important note
section.
Comments on specific questions – Zone B
Section A
Question 1

i. Rank of a university according to its reputation.
ii. Country of residence.
iii. Birth-weight of a baby.
iv. Favourite pop group.
(8 marks)
17
ST104a Statistics 1
i. Each rank is a category, therefore this is a categorical variable. The values of this
variable are the ranks of each university. By definition the categories (ranks) are ordered,
thus resulting in a (categorical) ordinal variable.
ii. Each country is a category, so the possible values are one for each country. Hence, the
iii. The data represent weights of babies at birth that can be measured to many decimal
places; for example 5.234 kgs. This is, therefore, a measurable variable.
iv. Each pop group is a category and is also a potential value of this variable. Hence, the
variable is categorical. Moreover, pop groups do not have a natural ordering, therefore
this categorical variable is on a nominal scale.
was pointing to a nominal variable. Writing ‘It is measurable because it can be measured’
will not result in a high mark.
(b) The table below contains the marks (out of 20) of all students taking an
examination for the same course in two years:
2011 10 9 19 9 10 9
2012 10 11 9 11 10 11 12 11 10
any differences you find.
explanation for any differences you find.
iv. Comment on the differences in the mean and median for the two years that
you found in part i. For this data set, which do you think would give a better
description of the difference in marks: the mean or the median? Explain
briefly.
(12 marks)
This question contains material mostly from Chapter 3 of the subject guide and, in
particular, Section 3.8 (Measures of location) for parts (i) and (iv), and Section 3.9
(Measures of spread) for parts (ii) and (iii).
It is important to do the summation carefully and divide by the correct number of
observations to obtain the mean. For questions that require calculations on the median (or
also that this question requires these measures for both years, so the calculations should be
done for each year separately.
18
year and then divide them by the number of observations in each row. Doing so yields
(10 + 9 + 19 + 9 + 10 + 9)/6 = 11, for 2011,
and
(10 + 11 + 9 + 11 + 10 + 11 + 12 + 11 + 10)/9 = 10.56, for 2012.
For the median if we put the numbers in ascending order we get
9 9 9 10 10 19, for 2011,
and
9 10 10 10 11 11 11 11 12, for 2012.
The median for 2011 is given by taking the average between the 3rd and the 4th number
in the first of the rows above, resulting in a value of (9 + 10)/2 = 9.5. The median for
2012 is obtained from the 5th number in the 2nd row above, which is 11.
ii. Note that the range of a variable equals the difference between the maximum value and
the minimum value. Hence, the range for 2011 was 19 − 9 = 10, whereas the range for
2012 was 12 − 9 = 3. Some candidates answered ‘from 9 to 19’. While this is true, note
that it does not correspond to the definition of the range so it is essential to give the
numbers 10 (2011) and 3 (2012) in your answer.
Some candidates confused ‘Range’ and ‘Interquartile range’. Make sure that you identify
what is being asked.
iii. In order to answer this question, candidates should be familiar with Section 3.9.3 (on
variance and standard deviation) and the chapter activities. It is very important to show
your work with relevant summations of the squared deviation from the mean. In this way
you may get some marks even if the numerical answer is wrong as you are demonstrating
knowledge of the method. The answer for 2011 is 3.95, whereas for 2012 it is 0.88.
iv. The mean is higher in 2011 but the median is higher in 2012. This can be attributed to
the fact that 2011 contains an outlier (19) which results in a high mean. Apart from this
outlier, marks tend to be higher in 2012, so the median gives a somewhat better
indication of the ‘typical’ mark for each year.
(c) Weekly household expenditure in country A is normally distributed with a

standard deviation of £50 per week. Which country has a higher proportion of
households spending less than £200?
(4 marks)
Chapter 5 and work through the examples and activities of this section. The Sample
then Z = X−µ
• P (Z < a) = P (Z ≤ a) = Φ(a)
• P (Z > a) = P (Z ≥ a) = 1 − P (Z ≤ a) = 1 − P (Z < a) = 1 − Φ(a)
19
ST104a Statistics 1

• P (X < 200) = P X−300
100 < 200−300
100 = P (Z < −1) = 0.1587

• P (Y < 200) = P Y −240
50 < 200−240
50 = P (Z < −0.8) = 0.2119.
So country B has a higher proportion of households spending less than £200.
(d) We would like to start an internet service provider and need to estimate the
average weekly internet usage of households for our business plan. Internet
usage is measured in minutes. How many households must we randomly select
to be 95 percent confident that the sample mean is within 2 minutes of the
population mean? Assume that a previous survey of household usage has shown
that the standard deviation of internet usage is 6.95 minutes.
(3 marks)
All of Chapter 6 is relevant, but the main reading for this question can be found in Section
6.1 (Choosing a sample size). It is essential to read this section carefully and attempt the
activities and exercises.
This question asks you to determine a sample size. This is straightforward once the
distribution is identified. Since the sample size is large, a normal distribution can be used.
• Identify the correct z-value: 1.96.
• Solve
σ
1.96 √ = 2.
n
We can take σ = 6.95 to find n = 46.38.
• Round up to n = 47.
Some candidates forgot to round up. Remember that you are asked about a sample size.
(e) Suppose that x1 = 2, x2 = −3, x3 = 6, x4 = 0, x5 = 3, and y1 = 3, y2 = 2, y3 = 1,

y4 = 0, y5 = 1. Calculate the following quantities:
5
X 5
X 3
X
i. xi ii. 2xi (yi + 1) iii. x22 + (xi + yi3 )
i=1 i=2 i=1
(6 marks)
guide and, in particular, in Activity A1.6.
Be careful to leave the xs and ys in the order given and only cover the values of i asked for.
This question was generally done well; the answers are:
P5
i. i=1 xi = 2 + (−3) + 6 + 0 + 3 = 8. (1 mark)
P5 P5
ii. i=2 2xi (yi + 1) = 2 i=2 xi (yi + 1) =
2(−3 × (2 + 1) + 6 × (1 + 1) + 0 × (0 + 1) + 3 × (1 + 1)) = 2 × 9 = 18. (2 marks)
2
P3 3 2 3 3 3
iii. x2 + i=1 (xi + yi ) = (−3) + (2 + 3 ) + (−3 + 2 ) + (6 + 1 ) = 9 + 29 + 5 + 7 = 50.
(3 marks)
20
(f ) In an introductory statistics class, the numbers of males and females are 17 and
23, respectively.
student is female?
student is then selected. What is the probability that one of the students is
male and the other is female?
iii. What is the probability that the second student is male, given that the first
student is female and removed from the class?
iv. In previous years it was found that 80% of males pass the exam and 85% of
females pass the exam. Based on the available information, find the
probability that a student who passes the examination is female.
(8 marks)
This is a question on probability and targets mostly the material covered in Chapter 4. It is
essential to practise this area by attempting the chapter activities and exercises as well as
accessing the material on the VLE. In particular you can attempt Activity A4.6 and Sample
examination question 4. It is also useful to familiarise yourself with probability trees as they
can be quite useful when completing such exercises.
The first three parts were straightforward for those that were familiar with this section. Part
(iv) required knowledge of Bayes’ formula or a very good understanding of probability trees.
The working out is shown below:
i. There are 23 females and 17 males in the class. Hence the answer is
23/(17 + 23) = 23/40 = 0.575.
ii. The correct answer here is 17 23 23 17
40 × 39 + 40 × 39 = 0.501. Although not necessary, the use of
a probability tree would be quite helpful here.
iii. This part can be answered in a similar way to part (i) noting that there are now 17
males and 22 females in the class. Hence 17/39 = 0.436.
iv.
P (pass|female)P (female)
P (female|pass) =
P (pass)
0.85 × 23/40
=
P (pass ∩ female) + P (pass ∩ male)
0.85 × 23/40
=
0.85 × 23/40 + 0.80 × 17/40
= 0.5897.
i. An important difference between an experimental design and an
observational study is that in an observational study data are collected on
units without any intervention.
iii. If a variable has a symmetric distribution, its mean and median are the same.
(6 marks)
important to have a good intuitive understanding of the relevant concepts than a technical
level of knowledge in computations. Part (i) requires material from Chapter 10 and, in
21
ST104a Statistics 1
particular, the sections on observational studies and designed experiments. Part (ii) is about
correlation and causation detailed in Section 11.7 of the subject guide. Finally part (iii)
targets the material covered in Chapter 3.
reason for a true/false and not just a choice between the two. Some candidates also lost
or false.
i. True. A possible way to provide an explanation here is through an example, for example
in an experimental design some units are administered a treatment, and this is not
possible in an observational study.
Note: candidates should indicate in some way that they know what the assertion means,
such as via an example (see p.156 of the subject guide).
ii. False; the correlation may be spurious, for example there may be a third variable
affecting both variables leading to a correlation.
iii. True; mean and median are at the centre of symmetry.
(h) In the context of sampling, explain the difference between item non-response
and unit non-response.
(3 marks)
This question requires knowledge about sampling and sample surveys. Useful background
reading may be found in Chapter 9 of the subject guide. The material directly related to
this question, item non-response and unit non-response, appears on p.145. See also the
references to Newbold and Carlson given in Chapter 9 of the subject guide.
The relevant parts of p.145 are that:
• item non-response occurs when a sampled member fails to respond
• unit non-response occurs when no information is collected from a sample member.
In addition to the definitions supplied above, it would also be useful to use an example.
Section B
Question 2
(a) The 2006 General Social Survey in the United States asked subjects, ‘Would you
say that astrology is very scientific, sort of scientific, or not at all scientific?’ The
table below cross-classifies their responses with their highest level of education.
Astrology is scientific
High school 50 (5%) 286 (31%) 574 (63%) 910 (100%)
Total 89 (5%) 494 (28%) 1210 (67%) 1793 (100%)
whether or not astrology is scientific?
(4 marks)
22
(9 marks)
This part targets Chapter 8 on contingency tables and chi-square tests. Note that part (i) of
the question does not require any calculations, just understanding and interpreting
straightforward chi-squared test and the reading is also given in Chapter 8.
i. Using the percentages we see that the higher someone’s education, the smaller the belief
that astrology is very scientific and the higher the belief that it is not at all scientific.
For example, 79% of those who attended college or higher education responded that
astrology is not at all scientific, whereas the corresponding proportion for those with less
than high school education is 48%.
ii. Set out the null hypothesis that there is no association between education and views on
astrology against the alternative, that there is an association. Be careful to get these the
correct way round!
H0 : No association between education and views on astrology versus
H1 : Association between education and views on astrology.
Work out the expected values to obtain the table below
10.1757 56.4808 138.344

45.1701 250.719 614.11
33.6542 186.8 457.546

X (Oi,j − Ei,j )2
,
Ei,j
which gives a value of 93.9567. This is a 3 × 3 contingency table so the degrees of
freedom are (3 − 1) × (3 − 1) = 4.
For α = 0.05 ⇒ the critical value is 9.488, hence we reject H0 . For a second (stronger) α,
say 1%, the critical value is 13.277, hence we still reject H0 .
We conclude that the association between views on astrology and educational level is
highly significant.
Saying ‘we reject at the 5% level, but not at 10%’ is insufficient. What does this mean?
Is there a connection or not? If there is one, how strong is it? This needed to be
answered if the full nine marks allocated for this question were given. Many candidates
lost marks by failing to follow-up like this.
(b) i. Define each of the following:

— Simple random sampling
— Stratified random sampling.
(4 marks)
ii. Why might a researcher prefer to take a stratified random sample rather
than a simple random sample? Give two reasons.
(3 marks)
iii. You have been asked to design a nation-wide survey in your country to find
out about the smoking habits of adults. Give two stratification factors you
might use, and explain why you have chosen them.
(5 marks)
23
ST104a Statistics 1

This question on basic material on survey designs required background from Chapters 9 and
10 of the subject guide which, along with the recommended reading should be looked at
constituents of design in random sampling. It is also a good idea to try the activities in
Chapter 9.
One of the main things to avoid here is writing an answer without any structure. This
unsure of what these specific things are, do not write lengthy essays. This is a waste of
your valuable examination time. If you can identify what is being asked, keep in mind that
the answer should not be long. Note also that in some cases there is no unique answer
to the question.
i. Simple random sampling:
• Every sample has equal probability.
• With replacement.
Stratified random sampling:
• Population divided into strata (or groups).
• Random sample from each group.
ii. There are generally two main reasons why one would prefer stratified to simple random
sampling.
• Potentially more precision of parameter estimates.
• Obtain information about subgroups.
iii. In this part you can choose factors based on two arguments. First, you can aim for
factors whose subgroups differ regarding smoking habits (e.g. gender, ethnic groups, age
groups etc.). In that way the stratified sampling scheme will have increased precision.
Alternatively you can just suggest factors that are interesting from a research point of
view.
Question 3
The level of infant mortality (y) is represented by the number of baby deaths for
shown.
Percentage (x) 19 5 9 20 11 35 5 18 25 12 20 15

(a) i. Draw a scatter diagram of these data on the graph paper provided. Label the
diagram carefully.
(4 marks)
(3 marks)
24
diagram.
(4 marks)
iv. Using the equation you found in iii., obtain the predicted infant mortality for
an area where 34% of babies are born into families earning at least £25,000.
Do you think this value is realistic? Justify your answer.
(2 marks)
parts focus on correlation and regression and are covered in Sections 11.8 to 11.10 of the
subject guide. Section 11.7 is also relevant. Sample examination question 2 from this
chapter is recommended for practice on questions of this type.
which should include a full title (‘Scatter diagram’ alone will not suffice) and labelled
axes, including information about units. Far too many candidates threw away marks by
Infant mortality and economic class

y: Infant mortality (number of baby deaths for every 1000 births)
25
●
20
15
●
●
●
10
● ●
● ●
●
●
5
● ●
0
0 5 10 15 20 25 30 35
x: percentage of babies born into families earning at least 25,000 pounds
this value is the following: The data suggest that the higher the percentage of families
earning at least a certain income, the lower the mortality. The fact that the value is very
close to −1, suggests that this is a strong (negative) association.
iii. The regression line can be written by the equation ŷ = a + bx or y = a + bx + . The
formula for b is P
xi yi − nx̄ȳ
b= P 2 ,
xi − nx̄2
25
ST104a Statistics 1

iv. The prediction will be ŷ = 18.5994 − 0.5319 × 38 = 0.51 infant mortality (number of
baby deaths for every 1,000 births). However, since this point is very close the maximum
observation of x, which is 35%, this prediction should not be trusted too much as it is
almost based on extrapolation.
answering such a question and a mark is deducted if they are not specified.
(b) A survey is conducted to compare public attitudes towards local policing. A

number of people in two areas of interest are sampled, and asked if they are
satisfied with their local police-community relationship. The results of this
Sample size Number satisfied
Area A 153 115
Area B 188 120

whether there is a difference between the two areas in the proportion who
are satisfied. Test at two appropriate significance levels and comment on
your findings. Specify the test statistic you use and its distribution under the
null hypothesis.
(7 marks)
(2 marks)
iii. Give a 98% confidence interval for the proportion of people in Areas A and B
combined who are satisfied, assuming the respective sample sizes are
proportional to population sizes.
(3 marks)
proportions. While the entire chapter on hypothesis testing is relevant, one can focus on the
sections involving proportions (Sections 7.14 and 7.15). The last part of the question is on
confidence intervals that are located in Chapter 6 and, in particular (confidence intervals for
proportions), in Section 6.10.
i. The null hypothesis is that the proportions of the two areas (πA and πB ) do not differ,
the alternative is that they do.
H0 : πA = πB versus H1 : πA 6= πB .
The test statistic is provided in the formula sheet (note that it is based on the pooled
variance):
PA − P B
Z̃ = q
P (1−P )
nA + P (1−P
nB
)
where P = (115 + 120)/(153 + 188) = 0.68915.

The test statistic value is 2.249 (PA = 0.7516, PB = 0.6383, pooled se = 0.0504). The
critical value at the 5% level, assuming a normal approximation as the number of
for a difference between the two areas. If we take a (smaller) α of 1%, the critical value is
±2.576, so we do not reject H0 . We conclude that there is some, but not strong, evidence
of a difference between the two areas.
26
ii. The assumptions included:

2 2
• Assumption about whether nA + nB − 2 is ‘large’, hence t v. z
iii. The question is a standard exercise in confidence intervals. Note the question refers to
areas A and B combined. The workout is given below:
• Correct quantile: zα/2 = 2.326.
• Correct endpoints: 0.631 and 0.747. (Also accept two decimal places.)
• Report as an interval: (0.631, 0.747). (Also accept between 0.631 and 0.747.)
Question 4
3 2 4 8 7 19 2 5 3 4 10 12
(8 marks)
of the data
(2 marks)
(3 marks)
Chapter 3 provides all the relevant material for this question. More specifically, information
on boxplots can be found in Section 3.9.2, but all of Sections 3.8 and 3.9 are highly relevant.
i. The boxplot diagram the Examiners were hoping to see is shown below. Marks were
awarded for including the title, identifying the box and the whiskers and noting the
outlier, at a reasonable accuracy.
Distribution of Income
20
●
Income in thousands of pounds
15
10
5
0
In order to identify the box, the quartiles are needed that are 3 and 8.5, hence giving an
interquartile range of 4.5. The median is also needed which is 5.5.
27
ST104a Statistics 1
Hence the outlier limits are from 0 to 16.75. (−5.25 to 16.75 is also allowed.)
The extreme outlier limits are then from 0 to 25 (−13.5 to 25 is also allowed.)
Hence 19 is an outlier but not an extreme outlier.
Note that you did not need to label the x axis and that the plot can be transposed.
ii. Based on the shape of the boxplot, we can see that the distribution of the data is
positively skewed.
iii. A histogram or stem-and-leaf diagram are other types of suitable graphical displays. The
variable income is measurable and these graphs are suitable for displaying the
distribution of such variables.
(b) A new fitness programme is devised for obese people. Each participant’s weight
in kg was measured before and after the program to see if the fitness program is
effective in reducing their weights. The following data were obtained:
Before After
145 143
116 120
120 118
133 130
119 119
133 134
125 128
126 123
140 141
i. Carry out an appropriate hypothesis test to determine whether the fitness
programme is effective for reducing weight. State the test hypotheses, and
(6 marks)
ii. State any assumptions you made.
(2 marks)
(2 marks)
iv. On the basis of the data alone, would you recommend the programme to a
friend who wants to lose weight? Explain why or why not.
(2 marks)
Look up the sections about hypothesis testing for testing differences in means. However, it
is essential for this part of the question to focus on the section of the subject guide
regarding paired samples (Section 7.16.4).
i. Regarding hypotheses, note that the word ‘effective’ suggests a one-sided test:
H0 : µbefore = µafter , H1 : µbefore > µafter
observations for each person (before and after treatment). Hence the difference for each
person should be calculated
2 −4 2 3 0 −1 −3 3 −1
The next step is to calculate sd = 2.571, x̄d = 0.1111, in order to obtain the values of the
test statistic sx̄dd/−0
√ = 0.1296.
n
We have the t distribution with eight degrees of freedom, hence the critical value (for a
Hence, we do not reject H0 at the 5% level. Testing at the 10% level gives a critical value
of t8,0.1 = 1.397. Therefore, we still do not reject H0 . There is no significant evidence
that the fitness program is effective.
28
ii. • Differences normally distributed (no marks for normally distributed blood pressure).
• Pairs of observations are independent (a weaker condition which suffices is that the
differences are independent, but this is unlikely if observations are not).
iii. This is a straightforward exercise for confidence intervals give the appropriate formula
from the formula sheet (make sure that you can recognise it). The requested confidence
interval is (−0.650729, 0.872951).
iv. The evidence in the data that the programme works is close to negligible as can be seen,
for example, from the 80% confidence interval, so there is no reason to recommend the
programme on the basis of the data alone.
29
~~ST104A ZA d0
BSc degrees and Diplomas for Graduates in Economics, Management, Finance

and the Social Sciences, the Diplomas in Economics and Social Sciences and
Access Route
Statistics 1
Wednesday, 14 May 2014 : 10:00 to 12:00

Section A (50 marks) and TWO questions from Section B (25 marks each). Candidates
are strongly advised to divide their time accordingly.
A list of formulae and extracts from statistical tables are provided after the final question
on this paper.
Graph paper is provided at the end of this question paper. If used, it must be detached
and fastened securely inside the answer book.
A calculator may be used when answering questions on this paper and it must comply
in all respects with the specification given with your Admission Notice. The make and
type of machine must be clearly stated on the front cover of the answer book.
PLEASE TURN OVER

UL14/0741 Page 1 of 21 D1

SECTION A

i. Country of residence.
ii. Maximum speed a car can reach in 10 seconds.
iii. Value of the Dow Jones index.
iv. Position of Manchester United in the English Premier League (EPL) at a
particular point of the season.
[8 marks]
(b) The table below contains the number of wine bottles sold at two different
supermarkets on the last days from the previous month:
Supermarket A 55 52 102 96 59 55 60
Supermarket B 61 68 63 69 62 71 72 67 62
i. Find the mean and the median number of wine bottles sold for each
supermarket.
ii. Comment on the differences in the mean and median for the two
supermarkets that you found in part (i.). For this data set, which do
you think would give a better description for the number of wine bottles
sold: the mean or the median? Explain briefly.
iii. After making some enquiries, you find out that there was a party thrown
in a house on the street of supermarket A on the days with 102 and 96
wine bottles sold. Without doing any calculations, would you change your
answers about potential differences between the means and medians for the
two supermarkets? Give explanations for any statements that you make.
[8 marks]
(c) Suppose that X is a normally distributed random variable with mean 0 and
variance 1.
i. Find the probability that X + 4 is less than 4.
ii. Find the value of b so that the probability of X − b being less than zero is
0.975
[4 marks]
(d) You are told that a 95% confidence interval for a population proportion is
(0.3775, 0.6225). What was the sample proportion that lead to this confidence
interval? Also, what was the size of the sample used? [5 marks]
UL12/0217 Page 2 of 6
D00
UL14/0741 Downloaded by: aruzhanyerbolatova

Page 2| of aruzhan.yerbolatovaa@gmail.com
21
(e) Suppose that x1 = 3, x2 = −2, x3 = 2, x4 = 2, x5 = −2, and y1 = 1, y2 = 2,

y3 = −2, y4 = 1, y5 = 0. Calculate the following quantities:
X
i=5 X
i=5 X
i=4
i. xi ii. 3xi (yi − 2) iii. x25 + (x2i + yi )
i=1 i=3 i=2
[6 marks]
(f) Suppose there are two boxes; the first one contains three green and one red
balls, whereas the second contains two green and two red balls. First, a box is
chosen at random and then a ball is drawn randomly from that box.
i. What is the probability that the ball drawn is green?
ii. If the ball drawn was green, what is the probability that the first box was
chosen?
[5 marks]
(g) The probability distribution of a random variable X is given below.
x 0 1 2 3
pX (x) 0.2 0.3 0.1 0.4
ii. Find E(X), the expected value of X.
[4 marks]
i. In quota sampling we cannot draw statistical inference.
ii. The Spearman rank correlation coefficient is more useful than Pearson
correlation in data with outliers.
iii. If the constant in the regression equation is negative, the correlation will
also be negative.
iv. If the p-value for a test is larger than the significance level, we reject H0 .
v. In experimental studies one can use quota sampling to select the treatment
and control groups.
[10 marks]
UL12/0217 Page 3 of 6
D00

21
SECTION B
2. (a) A social survey in the UK asked subjects, ‘Do you do your shopping online?’
with the possible answers being ‘Frequently’, ‘Rarely’ and ‘Never’. The table
below cross-classifies their responses with their gender.
Shop online
Gender Frequently Rarely Never Total
Male 52 (26%) 94 (47%) 54 (27%) 200 (100%)
Female 47 (39%) 52 (43%) 21 (18%) 120 (100%)
Total 99 (31%) 146 (46%) 75 (23%) 320 (100%)
would you describe the relationship between gender and tendency to shop
online?
[13 marks]
(b) i. You have been asked to design a nationwide survey in your country to find
out about internet usage among children less than 10 years old. Provide
a probability sampling scheme and a sampling frame that you would like
to use. Identify a potential source of selection bias that may occur and
discuss how this issue can be addressed.
ii. Describe what is a longitudinal survey. State two ways in which panel
surveys differ from longitudinal surveys.
[12 marks]
UL12/0217 Page 4 of 6
D00

21
3. A car insurance company would like to examine the relationship between driving
experience and insurance premium. For this reason, a random sample of ten drivers
is taken and the years of driving experience (x) as well as the monthly insurance
premium (y, in £) is recorded. The data are shown in the table below.
Driver #1 #2 #3 #4 #5 #6 #7 #8 #9 #10
Driving experience (x) 6 3 11 10 15 6 25 16 15 20
Insurance premium (y) 66 88 51 70 44 56 42 60 45 40

diagram.
iv. Based on the regression equation in part (iii.), what will be the predicted
monthly insurance premium for a driver with 10 years of experience? Will
you trust this value? Justify your answer.
[13 marks]
(b) A company wants to check the quality of its customer service regarding phone
enquiries. For this reason, the manager wants to compare the call waiting
times during the years 2013 and 2012. Unfortunately, extensive records of
the company are not available, and he can only check a random sample of
phone calls within these two years. The available data, measured in minutes
of waiting time, are provided below for each year.

2013 42 7.4 0.5
2012 35 7.1 0.6
i. Use an appropriate hypothesis test to determine whether the mean waiting

times were different between these two years. Test at two appropriate
significance levels, stating clearly the hypotheses, the test statistic and its
distribution under the null hypothesis. Comment on your findings.
ii. State clearly any assumptions you made in (i.).
iii. Adjust the procedure above to determine whether the mean waiting time
in 2013 was greater than in 2012.
[12 marks]
UL12/0217 Page 5 of 6
D00

21
4. (a) i. Carefully construct a box plot on the graph paper provided to display
the following average daily intakes of calories for 12 athletes, measured in
kcals:
1808 2200 2154 2004 2101 1957 3061 2500 2009 2147 2231 1936
distribution of the data.
[13 marks]
(b) A study was made to determine the amount of fuel economy obtained by using a
specific new type of tyre over a standard type. For this reason, 8 cars were fitted
with the new type of tyre and the fuel consumption (in km/l) was measured
after a test-drive. Afterwards, the same cars with the same drivers were fitted
with the standard type of tyre and the experiment was repeated to obtain the
following fuel consumption measurements.
Car #1 #2 #3 #4 #5 #6 #7 #8
Standard type tyres 4.6 6.5 7.4 5.5 5.3 5.2 6.6 6.7
New type tyres 4.1 6.2 7.1 5.4 5.5 5.1 6.1 6.3
fuel consumption is different between the two types of tyre. State the
the null hypothesis. Comment on your findings.
ii. State any assumptions you made in (i.).
iv. On the basis of the data alone, would you be concerned about fuel
consumption if you wanted to buy the new type of tyre? Provide an
explanation with your answer.
[12 marks]
END OF PAPER
UL12/0217 Page 6 of 6
D00

21
ST104a Statistics 1

variable: variable:
v
N uN
X √
µ = E[X] = pi x i
uX
2
i=1 i=1

of the sample mean:
X −µ
Z=
σ X̄ − µ
Z= √
σ/ n

P −π σ
Z=q x̄ ± z √
π(1−π) n
n

s r
x̄ ± tn−1 √ p(1 − p)
n p±z
n

proportion:
Z 2σ2
n≥
e2 Z 2 p(1 − p)
n≥
e2
known): unknown):
X̄ − µ
Z= √ X̄ − µ
σ/ n t= √
S/ n

21
p−π (X̄1 − X̄2 ) − (µ1 − µ2 )
Z∼
=q Z=
π(1−π)
q 2
σ1 σ22
n n1 + n2
s
(X̄1 − X̄2 ) − (µ1 − µ2 )

1 1
t= r 2
(x̄1 − x̄2 ) ± tn1 +n2 −2 sp +
n1 n2

Sp2 n11 + n12

paired samples:
(n1 − 1)S12 + (n2 − 1)S22
Sp2 =
n1 + n2 − 2 X̄d − µd
t= √
Sd / n
sd (P1 − P2 ) − (π1 − π2 )
x̄d ± tn−1 √ Z=r
n
P (1 − P ) n11 + n12

R1 + R 2
P = s
n1 + n2 p1 (1 − p1 ) p2 (1 − p2 )
(p1 − p2 ) ± z +
n1 n2

r X
c
Pn
r=q P
Eij n 2
Pn 2

2 2
i=1 j=1 i=1 xi − nx̄ i=1 yi − nȳ

Pn
P
rs = 1 − b = Pi=1 n 2 2
n(n2 − 1) i=1 xi − nx̄
a = ȳ − bx̄

21
21
Page 10| aruzhan.yerbolatovaa@gmail.com
of 21
of 21
of 21
of 21
of 21
of 21
of 21
of 21
of 21
of 21
of 21
of 21
~~ST104A ZA d0

Access Route
Statistics 1
Wednesday, 14 May 2014 : 10:00 to 12:00

on this paper.
PLEASE TURN OVER

UL14/0742 Page 1 of 21 D3

SECTION A

i. Calories you consumed yesterday.
ii. Type of a film.
iii. Rank of a continent according to its population.
iv. Time an employee has spent in a company.
[8 marks]
(b) The table below contains the number of customers that visited two different
branches of a bank on the last days of the previous month:
Branch A 75 72 142 120 79 75 81
Branch B 81 88 83 89 82 91 92 87 82
i. Find the mean and the median number of customers for each branch.
ii. Comment on the differences in the mean and median for the two branches
that you found in part (i.). For this data set, which do you think would
give a better description for the number of customers: the mean or the
median? Explain briefly.
iii. After making some enquiries, you find out that the ATM next door to
branch A was not working on the days with 142 and 120 customers.
Without doing any calculations, would you change your answers about
potential differences between the means and between the medians of the
two branches? Give explanations for any statements that you make.
[8 marks]
variance 1.
0.95
[4 marks]
interval? Also, what was the size of the sample used? [5 marks]
UL12/0217 Page 2 of 6
D00

UL14/0742 Page 2 of 21

X
i=5 X
i=5 X
i=4
i=1 i=3 i=2
[6 marks]
(f) Suppose there are two boxes; the first one contains one green and three red
chosen?
[5 marks]
x 1 2 3 4
pX (x) 0.3 0.3 0.3 0.1
i. Find the probability that X is an even number.
[4 marks]
i. In stratified random sampling the interviewer selects a certain number of
people according to some pre-specified strata.
ii. If two variables have correlation which is almost zero, we can conclude
that they are independent.
iii. If two variables have correlation which is close to one, we can conclude
that the variables are related.
iv. If the χ2 test statistic is larger than the 5% critical value, the p-value is
also larger than 0.05.
v. Cluster sampling can be used to reduce the cost of a survey.
[10 marks]
UL12/0217 Page 3 of 6
D00

UL14/0742 Page 3 of 21
SECTION B
2. (a) A social survey in the UK asked subjects, ‘Do you buy organic products, despite
the fact they are usually more expensive?’ with the possible answers being
‘Yes’, ‘Sometimes’ and ‘No’. The table below cross-classifies their responses
with their place of residence (‘Rural’ or ‘Urban’ areas).
Buy organic products
Place of residence Yes Sometimes No Total
Rural area 35 (17%) 90 (45%) 75 (38%) 200 (100%)
Urban area 73 (21%) 163 (46%) 114 (33%) 350 (100%)
Total 108(20%) 253 (46%) 189 (34%) 550 (100%)
would you describe the relationship between place of residence and buying
organic products?
[13 marks]
out about internet usage among children less than 10 years old. Provide
to use. Identify a potential source of selection bias that may occur and
discuss how this issue can be addressed.
ii. Describe what is a longitudinal survey. State an advantage and a disad-
vantage when using such surveys.
[12 marks]
UL12/0217 Page 4 of 6
D00

UL14/0742 Page 4 of 21
3. We are interested in studying the association between the price of flour and the
production of wheat in a particular area of the UK. The data shown in the table
below provide figures regarding the production of wheat in tonnes (x) as well as the
price of flour (y), in £ per kg, over the last 10 years.
Year #1 #2 #3 #4 #5 #6 #7 #8 #9 #10
Production of Wheat (x) 30 28 32 25 25 24 22 24 35 40
Price of Flour (y) 25 30 27 40 42 41 50 45 30 25

diagram.
price of flour for a year with 45 tonnes production of wheat? Will you
trust this value? Justify your answer.
[13 marks]
enquiries. For this reason the manager wants to compare the call waiting
times during the years 2013 and 2012. Unfortunately, extensive records of
the company are not available, and he can only check a random sample of
phone calls within these two years. The available data, measured in minutes
of waiting times, are provided below for each year.

2013 41 5.8 0.5
2012 34 6.1 0.6

iii. Adjust the procedure above to determine whether the mean waiting time
in 2013 was less than that of 2012.
[12 marks]
UL12/0217 Page 5 of 6
D00

UL14/0742 Page 5 of 21
4. (a) i. Carefully construct a box plot on the graph paper provided to display
the following annual earnings for the salesmen of a company, measured in
£000s:
35 26 22 24 21 57 36 35 29 47 30 36
distribution of the data.
[13 marks]
(b) A study was made to determine the amount of fuel economy obtained by using a
specific new type of tyre over a standard type. For this reason, 8 cars were fitted
with the new type of tyre and the fuel consumption (in km/l) was measured
after a test-drive. Afterwards, the same cars with the same drivers were fitted
with the standard type tyres and the experiment was repeated to obtain the
following fuel consumption measurements.
Car #1 #2 #3 #4 #5 #6 #7 #8
New type tyres 5.1 6.2 7.3 5.4 5.5 5.1 6.1 7.3
fuel consumption is different between the two types of tyre. State the
the null hypothesis. Comment on your findings.
[12 marks]
END OF PAPER
END OF PAPER
UL12/0217 Page 6 of 6
D00

UL14/0742 Page 6 of 21
ST104a Statistics 1

variable: variable:
v
N uN
X √
µ = E[X] = pi x i
uX
2
i=1 i=1

of the sample mean:
X −µ
Z=
σ X̄ − µ
Z= √
σ/ n

P −π σ
Z=q x̄ ± z √
π(1−π) n
n

s r
x̄ ± tn−1 √ p(1 − p)
n p±z
n

proportion:
Z 2σ2
n≥
e2 Z 2 p(1 − p)
n≥
e2
known): unknown):
X̄ − µ
Z= √ X̄ − µ
σ/ n t= √
S/ n
1
UL14/0742 Page 7 of 21
p−π (X̄1 − X̄2 ) − (µ1 − µ2 )
Z∼
=q Z=
π(1−π)
q 2
σ1 σ22
n n1 + n2
s
(X̄1 − X̄2 ) − (µ1 − µ2 )

1 1
t= r 2
(x̄1 − x̄2 ) ± tn1 +n2 −2 sp +
n1 n2

Sp2 n11 + n12

paired samples:
(n1 − 1)S12 + (n2 − 1)S22
Sp2 =
n1 + n2 − 2 X̄d − µd
t= √
Sd / n
sd (P1 − P2 ) − (π1 − π2 )
x̄d ± tn−1 √ Z=r
n
P (1 − P ) n11 + n12

R1 + R 2
P = s
n1 + n2 p1 (1 − p1 ) p2 (1 − p2 )
(p1 − p2 ) ± z +
n1 n2

r X
c
Pn
r=q P
Eij n 2
Pn 2

2 2
i=1 j=1 i=1 xi − nx̄ i=1 yi − nȳ

Pn
P
rs = 1 − b = Pi=1 n 2 2
n(n2 − 1) i=1 xi − nx̄
a = ȳ − bx̄
2
UL14/0742 Page 8 of 21
UL14/0742 Page 9 of 21
UL14/0742 Page 10 of 21
UL14/0742 Page 11 of 21
UL14/0742 Page 12 of 21
UL14/0742 Page 13 of 21
UL14/0742 Page 14 of 21
UL14/0742 Page 15 of 21
UL14/0742 Page 16 of 21
UL14/0742 Page 17 of 21
UL14/0742 Page 18 of 21
UL14/0742 Page 19 of 21
UL14/0742 Page 20 of 21
UL14/0742 Page 21 of 21

ST104a Statistics 1 (half course)
Important note
section.
General remarks
Learning outcomes
By the end of this unit and having completed the Essential reading and activities you should:
• be able to apply a variety of methods for explaining, summarising and presenting data and
interpreting results clearly using appropriate diagrams, titles and labels when required
• understand the ideas of randomness and variability, and the way in which these link to
probability theory to allow the systematic and logical collection of statistical techniques of
great practical importance in many applied areas
methods
• be able to use inference to test the significance of common measures such as means and
proportions and carry out chi-squared tests of contingency tables
• be able to carry out simple regression and correlation analysis and know when it is
appropriate to do so.
1
appeared in the second part. The first part of Question 3 was on regression and involved drawing a
data provided. Question 4 had a series of questions involving drawing diagrams, such as boxplots,
hypothesis testing, in particular paired t tests, and confidence intervals. This means that it is really
important that you make sure you have a reasonable idea of what topics are covered before you start
work on the paper! We suggest you divide your time as follows during the examination:
and subquestion.
• Allow yourself 45 minutes for Section A. Do not allow yourself to get stuck on any one
question, but do not just give up after two minutes!
What are the Examiners looking for?
The Examiners are looking for very simple demonstrations from you. They want to be sure that you:
Remember:
is not acceptable to do one rather than the other! If you are asked to find a 5% critical
value, this is what will be marked.
• Do not waste time calculating things which are not required by the Examiners. If you are
2
Examiners’ commentaries?
• the answers, or keys to the answers, which the Examiners were looking for
• the relevant detailed reference to Newbold et al. (seventh edition) and the subject guide
prepare, and similar questions from Newbold et al.
Question spotting
Many candidates are disappointed to find that their examination performance is poorer
than they expected. This can be due to a number of different reasons and the Examiners’
commentaries suggest ways of addressing common problems and improving your performance.
We want to draw your attention to one particular failing – ‘question spotting’, that is,
confining your examination preparation to a few question topics which have come up in past
papers for the course. This can have very serious consequences.
We recognise that candidates may not cover all topics in the syllabus in the same depth, but
you need to be aware that Examiners are free to set questions on any aspect of the syllabus.
This means that you need to study enough of the syllabus to enable you to answer the required
number of examination questions.
The syllabus can be found in the ‘Course information sheet’ in the section of the VLE dedicated
to this course. You should read the syllabus very carefully and ensure that you cover sufficient
material in preparation for the examination.
Examiners will vary the topics and questions from year to year and may well set questions that
have not appeared in past papers – every topic on the syllabus is a legitimate examination
target. So although past papers can be helpful in revision, you cannot assume that topics or
specific questions that have come up in past examinations will occur again.
If you rely on a question spotting strategy, it is likely you will find yourself in
difficulties when you sit the examination paper. We strongly advise you not to
adopt this strategy.
3

Important note
section.
Section A
Question 1

i. Country of residence.
ii. Maximum speed a car can reach in 10 seconds.
iii. Value of the Dow Jones index.
iv. Position of Manchester United in the English Premier League (EPL) at a
particular point of the season.
(8 marks)

This question requires identifying types of variables so reading the relevant section in the
1

A general tip for identifying continuous and categorical variable is to think of the possible
i. Each country is a category, so the possible values are one for each country. Hence, the
represents a nominal categorical variable.
ii. Speed is a variable which can be measured in miles per hour or kilometres per hour to
several decimal places. Hence it is a measurable variable.
iii. The Dow Jones index takes values to several decimal places. It is therefore regarded as a
measurable variable.
iv. The position of Manchester United can be either 1st, 2nd or any other position up to
20th. By definition, these positions (places) are ordered: 1st is the highest place and
20th is the lowest. Hence it is an ordinal categorical variable.
Weak candidates did not provide justifications for their choices, reported nominal or
categorical or measurable variables and sometimes answered ordinal when their justification
was pointing to a nominal variable. There were also phrases like ‘It is measurable because it
can be measured’ that were not awarded any marks.
(b) The table below contains the number of wine bottles sold at two different
supermarkets on the last days from the previous month:
Supermarket A 55 52 102 96 59 55 60
Supermarket B 61 68 63 69 62 71 72 67 62
i. Find the mean and the median number of wine bottles sold for each
supermarket.
ii. Comment on the differences in the mean and median for the two
supermarkets that you found in part (i.). For this data set, which do you
think would give a better description for the number of wine bottles sold:
the mean or the median? Explain briefly.
iii. After making some enquiries, you find out that there was a party thrown in a
house on the street of supermarket A on the days with 102 and 96 wine
bottles sold. Without doing any calculations, would you change your answers
about potential differences between the means and medians for the two
supermarkets? Give explanations for any statements that you make.
(8 marks)

This question contains material mostly from Chapter 4 and in particular Section 4.8
(Measures of location) for parts (i.) and (iv.) and Section 4.9 (Measures of spread) for parts
(ii.) and (iii.).

It is important to do the summation carefully and divide with the correct number of
observations to obtain the mean. For questions which require calculations on the median (or
also that this question requires these measures for both supermarkets, so the calculations
should be done for each supermarket separately.
2
supermarket and then divide them by the number of observations in each row. Doing so
yields:
(55 + 52 + · · · + 60)/7 = 68.4, for supermarket A
and:
(61 + 68 + · · · + 62)/9 = 66.1, for supermarket B.
For the median if we put the numbers in ascending order we get:
52 55 55 59 60 96 102, for supermarket A
and:
61 62 62 63 67 68 69 71 72, for supermarket B.
The median for supermarket B is given by taking the 4th number in the first of the rows
above, which is 59. The median for supermarket B is obtained from the 5th number in
the 2nd row above, which is 67.
One mark for each of the four numbers above was awarded.
ii It is first important to note that the mean of supermarket A is higher than that of
supermarket B. However this does not necessarily indicate that the centre of the
distribution of supermarket A is larger than that of supermarket B. Supermarket A
contains two outliers (102 and 96) which result in a high mean. Apart from these
outliers, the numbers of wine bottles sold tend to be higher in supermarket B, so the
median gives a somewhat better indication of the ‘typical’ nnumber of wine bottles sold
for each supermarket.
iii. After taking out these days we can argue that both the mean and median for
supermarket A would be smaller than that for supermarket B on a day where there are
no home parties nearby. This is because all the numbers of wine bottles sold in
supermarket A (on a day where there are no house parties nearby) are smaller or equal
to any number of wine bottles sold in supermarket B.
variance 1.
0.975.
(4 marks)

Chapter 6 and work out the examples and activities of this section. The Sample

then Z = (X − µ)/σ ∼ N (0, 1). Note also that:
• P (Z < a) = P (Z ≤ a) = Φ(a)
• P (Z > a) = P (Z ≥ a) = 1 − P (Z ≤ a) = 1 − P (Z < a) = 1 − Φ(a)
3
i. We get:
X +4−4 4−4
P (X + 4 < 4) = P < = P (X < 0) .
1 1
Due to symmetry we get:
P (X < 0) = 0.5.
In fact the above probability can be found directly using the symmetry property of the
normal distribution. Such direct answers, giving the correct value above and stating the
symmetry, were also accepted.
ii. We can write:
P (X − b < 0) = 0.975 ⇔ P (X − b + b < b) = 0.975 ⇔ P (X < b) = 0.975.
From the tables we can get that b = 1.96.
interval? Also, what was the size of the sample used?
(5 marks)

The whole of Chapter 7 is relevant, but the main reading for this question is Section 7.10
about confidence intervals for a single proportion. Section 7.11 on sample size determination
is also highly relevant. It is essential to read this section carefully and attempt all Learning
activities.

Note that confidence intervals for single proportions (and all the other confidence intervals
in this course) are symmetric around p. Hence p would be in the centre of the interval
(0.3775, 0.6225). Adding the two endpoints and dividing by 2 gives p = 0.5. This question
also asks you to determine a sample size.
• Find the sample proportion: p = 0.5 (see above).

• Find the standard error (using the relevant formula from the formula sheet):
r √
p (1 − p) 0.5 × 0.5 0.5
= √ =√ .
n n n
• Use the correct z value: 1.96.

• Solve the relevant equation to find n:
0.5
1.96 × √ = 0.1225.
n
• Remember to round up the solution to the equation above. The correct sample size is
n = 64.
Some candidates forgot to round up. Remember that we are asked about a sample size.
4

i=5
X i=5
X i=4
X
i=1 i=3 i=2
(6 marks)

guide and in particular Learning activity Question 6.

Be careful to leave the xi s and yi s in the order given and only cover the values of i asked for.
5
P
i. xi = 3 + (−2) + 2 + 2 − 2 = 3.
i=1
P5 5
P
ii. 3xi (yi − 2) = 3 xi (yi − 2) = 3(2 × (−2 − 2) + 2 × (1 − 2) + (−2) × (0 − 2)) =
i=3 i=3
3 × (−6) = −18.
4
iii. x25 + (x2i + yi ) = (−2)2 + ((−2)2 + 2) + (22 − 2) + (22 + 1) = 4 + 6 + 2 + 5 = 17.
P
i=2
(f ) Suppose there are two boxes; the first one contains three green and one red
chosen?
(5 marks)

This is a question on probability and targets mostly the material of Chapter 5. It is
essential to practise exercises through the Learning activities of this chapter as well as the
material on the virtual learning environment (VLE). In particular you can attempt
Learninig activity Question 6, Sample examination Question 4. It is also useful to
familiarise yourself with probability trees as they can be quite handy in such exercises.

The first part was straightforward for those that were familiar with this section. Part (ii.)
required knowledge of Bayes’ formula or a very good understanding of probability trees. The
working is given below.
i. Let B1 , B2 denote boxes 1 and 2, respectively, G denote a green ball and R a red ball.
We have P (G) = P (G | B1 ) P (B1 ) + P (G | B2 )P (B2 ).
Note: It is useful to note that the working up to this point may be thought of as
‘knowledge of the method’ and therefore earns one mark regardless of whether
candidates obtain the correct final answer.
The following was worth one more mark and can be found by just substituting and doing
the straightforward calculations:
3 1 1 1 5
× + × = .
P (G) =
4 2 2 2 8
Note: Some candidates reported the number 0.625 instead of 5/8. This is acceptable as
long as three decimal places are used.
5
ii. This part can be found by using Bayes’ formula:
P (G | B1 ) P (B1 ) 3/4 × 1/2 3

P (B1 | G) = = = = 0.6.
P (G) 5/8 5
x 0 1 2 3
pX (x) 0.2 0.3 0.1 0.4
(4 marks)

This is another question on probability, exploring the concepts of relative frequency,
conditional probability and probability distributions. Reading from Chapter 5 is suggested
focusing on the sections covering these topics. Try Learning activity Question 1 and the
exercises on probability trees.
i. P (X odd number) = P (X = 1) + P (X = 3) = 0.3 + 0.4 = 0.7.
P
ii. E(X) = i xi P (X = xi ) = 0 × 0.2 + 1 × 0.3 + 2 × 0.1 + 3 × 0.4 = 1.7.
i. In quota sampling we cannot draw statistical inference.
ii. The Spearman rank correlation coefficient is more useful than Pearson
correlation in data with outliers.
iii. If the constant in the regression equation is negative, the correlation will also
be negative.
iv If the p-value for a test is larger than the significance level, we reject H0 .
v. In experimental studies one can use quota sampling to select the treatment
and control groups.
(10 marks)

This questions contains material from various parts of the subject guide. Here, it is more
important to have a good intuitive understanding of the relevant concepts than the technical
level in computations. Part (i.) requires material from Chapter 10 and in particular Section
10.7 on types of samples. Part (ii.) is about correlation (see Section 12.8), whereas part (iii.)
is mainly on regression (Section 12.9) and its links with correlation. Part (iv.) targets the
concepts of a p-value covered in Section 8.11. Finally, part (v.) is on observational studies
and designed experiments and the use of treatment and control groups. Sections 11.6 and
11.7 are relevant.

reason for either true or false and not just a choice between the two. Some candidates lost
marks too for long rambling explanations without a decision as to whether a statement was
true or false.
6
i. True, because we cannot attach a random probability model to the data.

ii. True, because the ranks are affected less than the sample mean present in Pearson’s
correlation coefficient.
iii. False, because the constant in the regression coefficient has nothing to do with
correlation.
iv. False, because we reject H0 if the p-value is smaller than the significance level.
v. False, because in experimental studies, the subjects should be allocated to groups in a
random way.
Section B
Question 2
(a) A social survey in the UK asked subjects, ‘Do you do your shopping online?’
with the possible answers being ‘Frequently’, ‘Rarely’ and ‘Never’. The table
below cross-classifies their responses with their gender.
Shop online
Gender Frequently Rarely Never Total
Male 52 (26%) 94 (47%) 54 (27%) 200 (100%)
Female 47 (39%) 52 (43%) 21 (18%) 120 (100%)
Total 99 (31%) 146 (46%) 75 (23%) 320 (100%)
would you describe the relationship between gender and tendency to shop
online?
(13 marks)

This part targets Chapter 9 on contingency tables and chi-squared tests. Note that part (i.)
of the question does not require any calculations, just understanding and interpreting
contingency tables. Candidates can attempt Learning activity Question 4 to practise. Part
(ii) is a straightforward chi-squared test and the reading is also given in Chapter 9.

i. Looking at the percentages, we see some differences between males and females. More
specifically, 39% of females shop online frequently versus 26% of males. Moreover, the
percentage of males who never shop online is 27% versus 18% for females. Hence, there
may be an association between gender and tendency to shop online, although this needs
to be investigated further.
ii. Set out the null hypothesis that there is no association between gender and tendency to
shop online against the alternative, that there is an association. Be careful to get these
H0 : No association between gender and tendency to shop online vs.
H1 : Association between gender and tendency to shop online.
Work out the expected values to obtain the table below:
61.875 91.250 46.875

37.125 54.75 28.125
7
The test statistic formula is:

X (Oi,j − Ei,j )2
Ei,j
are (2 − 1) × (3 − 1) = 2.
For α = 0.05 ⇒ the critical value is 5.991, hence we reject H0 .
We conclude that there is some evidence of an association between gender and tendency
to shop online.
Saying ‘we reject at the 5% significance level, but fail to reject at the 1% significance
level’ is insufficient. What does this mean? Is there an association or not? If there is one,
how strong is it? This needed to be answered if the full nine marks allocated for this
question were to be given. Many candidates lost marks by missing out follow-up like this.
out about internet usage among children less than 10 years old. Provide a
probability sampling scheme and a sampling frame that you would like to
use. Identify a potential source of selection bias that may occur and discuss
how this issue can be addressed.
ii. Describe what is a longitudinal survey. State two ways in which panel
surveys differ from longitudinal surveys.
(12 marks)

This was a question on basic material on survey designs. Background reading is given in
Chapters 10 and 11 of the subject guide which, along with the recommended reading, should
Learning activities of Chapter 10. Part (ii.) in particular looked at longitudinal studies for
which Section 11.8.1 is highly relevant.

One of the main things to avoid in part (i.) here is to write essays without any structure.
are unsure of what these things are, do not write lengthy essays. This is not giving you
being asked, keep in mind that the answer should not be long. Note also that in some
cases there is no unique answer to the question.
The marking scheme and some model answers for part (i.) (worth 6 marks) are given below.
Sampling frame – 1 mark: note that the target group is ‘children less than ten years old’
hence one might take the view that they only need to look at children who are at school or
nursery school – say aged from 4 or 5 to 10. If this is the case, a frame of schools and
nurseries and sampling from their lists may be used. Another example is to use doctors’ lists
(if possible).
Sampling scheme – 2 marks: one mark for stating a probability/random sampling scheme
and one mark for a relevant justification. For example, if one goes with clustering (area of
the country/type of school. . . junior, infants, preschool etc.) or stratified sampling
(stratification factors: gender, age group), why would these schemes be advantageous?
Source of selection bias – 2 marks: selection bias could arise from the omission of those who
are not at school or preschool (in most countries, school is compulsory only from five or six
years old) and those who are home schooled. Note that this should not be confused with
response bias, for example things about children responding differently if the teachers ask
the questions.
8
Way to address it – 1 mark: reset the target population group to match what the sampling
frame is actually providing.
Part (ii.) was on longitudinal studies and involved more direct questions. One mark was
given for a description of a longitudinal study; i.e. a longitudinal survey is a survey where
the same individuals are resurveyed over time. Another mark was given for a relevant
example. In terms of ways panel surveys are different from longitudinal surveys, two marks
were given for each statement in the list below (for a maximum of four marks):
• they are more likely to be chosen by quota rather than random methods
• individuals are interviewed every 2 to 4 weeks (rather than every few years)
• individuals are unlikely to be panel members for longer than two years at a time.
Question 3
(a) A car insurance company would like to examine the relationship between driving
experience and insurance premium. For this reason, a random sample of ten
drivers is taken and the years of driving experience (x) as well as the monthly
insurance premium (y, in £) is recorded. The data are shown in the table below.
Driver #1 #2 #3 #4 #5 #6 #7 #8 #9 #10
Driving experience (x) 6 3 11 10 15 6 25 16 15 20
Insurance premium (y) 66 88 51 70 44 56 42 60 45 40
i. Draw a scatter diagram of these data on the graph paper provided. Label the
diagram carefully.
diagram.
monthly insurance premium for a driver with 10 years of experience? Will
you trust this value? Justify your answer.
(13 marks)

12.6 provides details for scatter diagrams and is suitable for part (i.) whereas the remaining
parts are on correlation and regression that are covered in Sections 12.8–12.10 of the subject
guide. Section 12.7 is also relevant. Sample examination question 2 of this chapter is
recommended for practice on questions of this type.

which give their units in addition. Far too many candidates threw away marks by
9
Driving experience and insurance premium
80
y: insurance premium in pounds
70
60
50
40
5 10 15 20 25
x: Years of driving experiemce
this value is the following: the data suggest that the higher the driving experience of a
certain driver, the lower the insurance premium. The fact that the value is very close to
−1, suggests that this is a strong (negative) linear association.
iii. The regression line can be written by the equation ŷ = a + bx or y = a + bx + ε. The
formula for b is: P
xi yi − nx̄ȳ
b= P 2
xi − nx̄2
iv. The prediction will be ŷ = 78.4318 − 1.7505 × 10 = £60.93. Yes, we would trust this
value, since this point is inside the observed range of x, and therefore the prediction is
based on interpolation.
answering such questions and a mark is deducted if they are not specified. It is also
important to provide the answer to at least two decimal places.
enquiries. For this reason, the manager wants to compare the call waiting times
during the years 2013 and 2012. Unfortunately, extensive records of the
company are not available, and he can only check a random sample of phone
calls within these two years. The available data, measured in minutes of waiting
time, are provided below for each year.

2013 42 7.4 0.5
2012 35 7.1 0.6
10

iii. Adjust the procedure above to determine whether the mean waiting time in
2013 was greater than in 2012.
(12 marks)

proportions. While all of Chapter 8 on hypothesis testing is relevant, one can focus on the
sections involving proportions (8.14 and 8.15), in particular Section 8.15. The last part of
the question refers to one-sided hypothesis tests that are also located in these sections.

i. Let µ1 denote the mean waiting time during 2013 and µ2 the mean waiting time during
2012.
The null hypothesis is that the proportions of the two population means (µ1 and µ2 ) do
not differ, the alternative is that they do.
H0 : µ1 = µ2 vs. H1 : µ1 6= µ2 .
The test statistic formulae, depending on whether a pooled variance is used or not, are
provided in the formula sheet:
X̄1 − X̄2 X̄1 − X̄2

p or q .
S1 /n1 + S22 /n2
2
Sp2 (1/n1 + 1/n2 )
The test statistic value is 2.354 (2.394 if the pooled variance is used). The critical value
at the 5% significance level, assuming a normal approximation as the number of
for a difference between the two years. If we take a (smaller) significance level of 1%, the
critical value is ±2.576, so we do not reject H0 . We conclude that there is some but not
strong (i.e. moderate) evidence of a difference between the two years.
ii. The assumptions for (ii.) were:
• Assumption about whether σ12 = σ22 .
• Assumption about whether n1 + n2 − 2 is ‘large’, hence t vs. z.
iii. This case corresponds to a one-sided test, therefore the hypotheses would be H0 : µ1 = µ2
vs. H1 : µ1 > µ2 . The test statistic value is the same for this case but the critical values
are now 1.645 for the 5% significance level and ≈ 2.33 for the 1% significance level. As
we now reject H0 at both levels we conclude that there is strong evidence (i.e. the result
is highly significant) that the mean waiting time in 2013 was greater than in 2012.
Question 4
following average daily intakes of calories for 12 athletes, measured in kcals:
1808 2200 2154 2004 2101 1957 3061 2500 2009 2147 2231 1936
of the data
11
(13 marks)

Chapter 4 provides all the relevant material for this question. More specifically, reading on
boxplots can be found in Section 4.9.2, but all of Sections 4.8 and 4.9 are highly relevant.
i. The boxplot diagram the Examiners were expecting to see is shown below. Marks were
awarded for including the title, identifying the box and the whiskers and noting outliers,
In order to identify the box, the quartiles are needed which are 1992.25 (anything
between 1957 and 2009 is acceptable), 2124.00, 2207.75 (anything between 2200 and 2231
is also acceptable as long as it is consistent with Q1 ), hence giving an interquartile range
of 215.5 (or anything else consistent with the values of Q1 and Q3 ).
Hence the outlier limits are from 1669 to 2531.
The value of 3061 is therefore an outlier.
Note that no label of the x-axis is necessary and that the plot can be transposed.
ii. Based on the shape of the boxplot above, we can see that the distribution of the data is
positively skewed.
iii. A histogram, steam-and-leaf diagram or a dot plot are other types of suitable graphical
displays. The reason is that the variable income is measurable and these graphs are
suitable for displaying the distribution of such variables.
12
(b) A study was made to determine the amount of fuel economy obtained by
using a specific new type of tyre over a standard type. For this reason, 8 cars
were fitted with the new type of tyre and the fuel consumption (in km/l) was
measured after a test-drive. Afterwards, the same cars with the same drivers
were fitted with the standard type of tyre and the experiment was repeated
to obtain the following fuel consumption measurements.
Car #1 #2 #3 #4 #5 #6 #7 #8
New type tyres 4.1 6.2 7.1 5.4 5.5 5.1 6.1 6.3
i. Carry out an appropriate hypothesis test to determine whether the fuel
consumption is different between the two types of tyre. State the test
null hypothesis. Comment on your findings.
(12 marks)
Look up the sections about hypothesis testing for testing differences in means in Chapter 8.
However, it is essential for this part to focus on the section regarding paired samples
(Section 8.16.4).

i. Regarding hypotheses, note that the wording ‘is different’ suggests a two-sided test:
H0 : µstandard = µnew vs. H1 : µstandard > µnew .
In this part, it is also essential to realise that we have paired samples, as we have two
observations for each car (with standard and new types of tyres). Hence the difference
for each car should be calculated:
0.5 0.3 0.3 0.1 −0.2 0.1 0.5 0.4.
The next step is to calculate√sd = 0.23905 and x̄d = 0.25, in order to obtain the test
statistic value (x̄d − 0)/(sd / n) = 2.958.
We have a t distribution with 7 degrees of freedom, hence the critical value (for a
Hence, we reject H0 at the 5% significance level. Testing at the 1% significance level
gives a critical value of t7, 0.01 = 3.499. Therefore, we do not reject H0 concluding that
there is moderate evidence of a difference between the two types of tyre.
ii. Assumptions are:
• Differences normally distributed [no marks for normally distributed fuel
consumption].
• Pairs of observations are independent [a weaker condition which suffices is that the
differences are independent, but this is unlikely if observations are not].
iii. This is a straightforward exercise for confidence intervals using the appropriate formula
from the formula sheet (make sure to be able to recognise it). The requested confidence
interval is (0.0501, 0.4499).
iv. There is some, but not strong, evidence in the data that the new type of tyre results in a
lower fuel consumption. This can be seen, for example, from the 95% confidence interval
whose endpoints are both positive.
13

Important note
section.
Section A
Question 1

i. Calories you consumed yesterday.
ii. Type of a film.
iii. Rank of a continent according to its population.
iv. Time an employee has spent in a company.
(8 marks)

1

A general tip for identifying continuous and categorical variable is to think of the possible
i. The data represent calories which can be measured to many decimal places, for example
203.4. This is therefore a measurable variable.
ii. Each type of film is a category: comedy, drama, horror, thriller etc. Hence, the variable
is categorical. Note also that types of film do not have a natural ordering, so this
represents a nominal categorical variable.
iii. Each rank is a category, therefore this is a categorical variable. The values of this
variable are the ranks of each continent. By definition the categories (ranks) are ordered,
therefore resulting in an ordinal categorical variable.
iv. Time can be measured in various units (years, months, weeks) and to several decimal
places, for example 5.3 years. This is therefore a measurable variable.
Weak candidates did not provide justifications for their choices, reported nominal or
categorical or measurable variables and sometimes answered ordinal when their justification
(b) The table below contains the number of customers that visited two different
branches of a bank on the last days of the previous month:
Branch A 75 72 142 120 79 75 81
Branch B 81 88 83 89 82 91 92 87 82
i. Find the mean and the median number of customers for each branch.
ii. Comment on the differences in the mean and median for the two branches
that you found in part (i.). For this data set, which do you think would give
a better description for the number of customers: the mean or the median?
Explain briefly.
iii. After making some enquiries, you find out that the ATM next door to
branch A was not working on the days with 142 and 120 customers. Without
doing any calculations, would you change your answers about potential
differences between the means and between the medians of the two
branches? Give explanations for any statements that you make.
(8 marks)

(Measures of location) for parts (i.) and (iv.) and Section 4.9 (Measures of spread) for parts
(ii.) and (iii.).

It is important to do the summation carefully and divide with the correct number of
observations to obtain the mean. For questions which require calculations on the median (or
also that this question requires these measures for both banks, so the calculations should be
done for each bank separately.
branch and then divide them by the number of observations in each row. Doing so yields:
(75 + 72 + · · · + 81)/7 = 92.0, for branch A
2
and:
(81 + 88 + · · · + 82)/9 = 66.1, for branch B.
For the median if we put the numbers in ascending order we get:
72 75 75 79 81 120 142, for branch A
and:
81 82 82 83 87 88 89 91 92, for branch B.
The median for branch B is given by taking the 4th number in the first of the rows
above, which is 79. The median for branch B is obtained from the 5th number in the 2nd
row above, which is 87.
One mark for each of the four numbers above was awarded.
ii It is first important to note that the mean of branch A is higher than that of branch B.
However, this does not necessarily indicate that the centre of the distribution of branch
A is larger than that of branch B. Branch A contains two outliers (120 and 142) which
result in a high mean. Apart from these outliers, the numbers of customers tend to be
higher in branch B, so the median gives a somewhat better indication of the ‘typical’
number of customers for each branch.
iii. After taking out these days we can argue that both the mean and median for branch A
would be smaller than that for branch B on a day that the ATM next door to branch A
is working. This is because all the numbers of customers in branch A (on a day when the
ATM next door to branch A is working) are smaller or equal to any number of customers
in branch B.
variance 1.
0.95.
(4 marks)

Chapter 6 and work out the examples and activities of this section. The Sample

• P (Z < a) = P (Z ≤ a) = Φ(a)
• P (Z > a) = P (Z ≥ a) = 1 − P (Z ≤ a) = 1 − P (Z < a) = 1 − Φ(a)
i. We get:
X +3−3 3−3
P (X + 3 < 3) = P < = P (X < 0) .
1 1
Due to symmetry we get:
P (X < 0) = 0.5.
In fact the above probability can be found directly using the symmetry property of the
normal distribution. Such direct answers, giving the correct value above and stating the
symmetry, were also accepted.
3
ii. We can write:

P (X − b < 0) = 0.95 ⇔ P (X − b + b < b) = 0.95 ⇔ P (X < b) = 0.95.
From the tables we can get that b = 1.6449.
interval? Also, what was the size of the sample used?
(5 marks)

The whole of Chapter 7 is relevant, but the main reading for this question is Section 7.10
about confidence intervals for a single proportion. Section 7.11 on sample size determination
is also highly relevant. It is essential to read this section carefully and attempt all Learning
activities.

Note that confidence intervals for single proportions (and all the other confidence intervals
in this course) are symmetric around p. Hence p would be in the centre of the interval
(0.4086, 0.5914). Adding the two endpoints and dividing by 2 gives p = 0.5. This question
also asks you to determine a sample size.
• Find the sample proportion: p = 0.5 (see above).
• Find the standard error (using the relevant formula from the formula sheet):
r √
p (1 − p) 0.5 × 0.5 0.5
= √ =√ .
n n n
• Use the correct z value: 1.6449.
• Solve the relevant equation to find n:
0.5
1.6449 × √ = 0.0914.
n
• Remember to round up the solution to the equation above. The correct sample size is
n = 81.
Some candidates forgot to round up. Remember that we are asked about a sample size.

i=5
X i=5
X i=4
X
i=1 i=3 i=2
(6 marks)

guide and in particular Learning activity Question 6.

4
5
P
i. xi = 3 + (−2) + 1 + 0 − 2 = 0.
i=1
5
P 5
P
ii. 3xi (yi − 2) = 3 xi (yi − 2) = 3(1 × (−1 − 2) + 0 × (2 − 2) + (−2) × (0 − 2)) = 3 × 1 = 3.
i=3 i=3
4
iii. x25 + (x2i + yi ) = (−2)2 + ((−2)2 + 2) + (12 − 1) + (02 + 2) = 4 + 6 + 0 + 2 = 12.
P
i=2
(f ) Suppose there are two boxes; the first one contains one green and three red
chosen?
(5 marks)

This is a question on probability and targets mostly the material of Chapter 5. It is
essential to practise exercises through the Learning activities of this chapter as well as the
material on the virtual learning environment (VLE). In particular you can attempt
Learninig activity Question 6, Sample examination Question 4. It is also useful to
familiarise yourself with probability trees as they can be quite handy in such exercises.

The first part was straightforward for those that were familiar with this section. Part (ii.)
required knowledge of Bayes’ formula or a very good understanding of probability trees. The
working is given below.
i. Let B1 , B2 denote boxes 1 and 2 respectively, G denote a green ball and R a red ball. We
have P (G) = P (G | B1 ) P (B1 ) + P (G | B2 ) P (B2 ).
Note: It is useful to note that the working up to this point may be thought of as
‘knowledge of the method’ and therefore earns one mark regardless of whether
candidates obtain the correct final answer.
The following was worth one more mark and can be found by just substituting and doing
the straightforward calculations:
1 1 1 1 3
P (G) = × + × = .
4 2 2 2 8
Note: Some candidates reported the number 0.375 instead of 3/8. This is acceptable as
long as three decimal places are used.
ii. This part can be found by using Bayes’ formula:
P (G | B1 ) P (B1 ) 1/4 × 1/2 1

P (B1 | G) = = = .
P (G) 3/8 3

x 1 2 3 4
pX (x) 0.3 0.3 0.3 0.1
i. Find the probability that X is an even number.
(4 marks)
5

conditional probability and probability distributions. Reading from Chapter 5 is suggested
focusing on the sections covering these topics. Try Learning activity Question 1 and the
i. P (X odd number) = P (X = 2) + P (X = 4) = 0.3 + 0.1 = 0.4.

P
ii. E(X) = i xi P (X = xi ) = 1 × 0.3 + 2 × 0.3 + 3 × 0.3 + 4 × 0.1 = 2.2.
i. In stratified random sampling the interviewer selects a certain number of
people according to some pre-specified strata.
ii. If two variables have correlation which is almost zero, we can conclude that
they are independent.
iii. If two variables have correlation which is close to one, we can conclude that
the variables are related.
iv. If the χ2 test statistic is larger than the 5% critical value, the p-value is also
larger than 0.05.
v. Cluster sampling can be used to reduce the cost of a survey.
(10 marks)

level in computations. Part (i.) requires material from Chapter 10 and in particular the
Section 10.7 on types of samples. Parts (ii.) and (iii.) are about correlation (see Sections
12.7 and 12.8). Part (iv.) targets the concepts of a p-value covered in Section 8.11. Finally,
part (v.) is on types of sampling for which Section 10.7 is relevant as mentioned earlier.

reason for either true or false and not just a choice between the two. Some candidates lost
true or false.
i. False, because in stratified sampling people would be chosen randomly, not by the
interviewer.
ii. False, because the two variables may have a non-linear relationship.
iii. True, because this means they would have a strong linear relationship.
iv. False, because if the χ2 value was above the 5% critical value, the test is significant
which means the p-value would be smaller than 0.05.
v. True, because cluster sampling can be cheaper than other forms of random sampling.
6
Section B
Question 2
(a) A social survey in the UK asked subjects, ‘Do you buy organic products,
despite the fact they are usually more expensive?’ with the possible answers
being ‘Yes’, ‘Sometimes’ and ‘No’. The table below cross-classifies their
responses with their place of residence (‘Rural’ or ‘Urban’ areas).
Buy organic products
Place of residence Yes Sometimes No Total
Rural area 35 (17%) 90 (45%) 75 (38%) 200 (100%)
Urban area 73 (21%) 163 (46%) 114 (33%) 350 (100%)
Total 108(20%) 253 (46%) 189 (34%) 550 (100%)
would you describe the relationship between place of residence and buying
organic products?
(13 marks)

contingency tables. Candidates can attempt Learning activity Question 4 to practise. Part
(ii) is a straightforward chi-squared test and the reading is also given in Chapter 9.

i. Looking at the percentages, we see that distributions for the answers regarding buying
organic products are similar in rural and urban areas. More specifically, the percentage
of people who answered ‘Yes’ is quite close in these two cases (17% and 21%,
respectively). The percentages of those who replied ‘No’ were not too far either (38% vs.
33%). Hence, there does not seem to be a strong association between place of residence
and buying organic products, although this needs to be investigated further.
ii. Set out the null hypothesis that there is no association between place of residence and
buying organic products against the alternative, that there is an association. Be careful
to get these the correct way round!
H0 : No association between place of residence and buying organic products vs.
H1 : Association between place of residence and buying organic products.
39.2727 92.0000 68.7273

68.7273 161.0000 120.2727

X (Oi,j − Ei,j )2
Ei,j
are (2 − 1) × (3 − 1) = 2.
For α = 0.1 ⇒ the critical value of 4.605, hence we do not reject H0 .
We conclude that there is no evidence to support an association between place of
residence and buying organic products.
7
Saying ‘we do not reject at the 5% significance level, but do reject at the 10% significance
level’ is insufficient. What does this mean? Is there an association or not? If there is one,
how strong is it? This needed to be answered if the full nine marks allocated for this
question were to be given. Many candidates lost marks by missing out follow-up like this.
out about internet usage among children less than 10 years old. Provide a
probability sampling scheme and a sampling frame that you would like to
use. Identify a potential source of selection bias that may occur and discuss
how this issue can be addressed.
ii. Describe what is a longitudinal survey. State an advantage and a
disadvantage when using such surveys.
(12 marks)

Learning activities of Chapter 10. Part (ii.) in particular looked at longitudinal studies for
which Section 11.8.1 is highly relevant.

One of the main things to avoid in part (i.) here is to write essays without any structure.
are unsure of what these things are, do not write lengthy essays. This is not giving you
The marking scheme and some model answers for part (i.) (worth 6 marks) are given below:
Sampling frame – 1 mark: note that the target group is ‘children less than ten years old’
hence one might take the view that they only need to look at children who are at school or
nursery school – say aged from 4 or 5 to 10. If this is the case, a frame of schools and
nurseries and sampling from their lists may be used. Another example is to use doctors’ lists
(if possible).
Sampling scheme – 2 marks: one mark for stating a probability/random sampling scheme
and one mark for a relevant justification. For example, if one goes with clustering (area of
the country/type of school. . . junior, infants, preschool etc.) or stratified sampling
(stratification factors: gender, age group), why would these schemes be advantageous?
Source of selection bias – 2 marks: selection bias could arise from the omission of those who
are not at school or preschool (in most countries, school is compulsory only from five or six
years old) and those who are home schooled. Note that this should not be confused with
response bias, for example things about children responding differently if the teachers ask
the questions.
Way to address it – 1 mark: reset the target population group to match what the sampling
frame is actually providing.
Part (ii.) was on longitudinal studies and involved more direct questions. One mark was
given for a description of a longitudinal study; i.e. a longitudinal survey is a survey where
the same individuals are resurveyed over time. Another mark was given for a relevant
example. In terms of advantages of longitudinal surveys, two marks were given for any in
the list below or any other sensible argument:
8
• Being able to see how individuals change over time. For example, the kinds of people
who change the products they buy in response to price changes or advertising campaigns.
• Being able to see the characteristics of those who do not change with respect to an
attribute. For example, seeing the characteristics of those who are loyal to a brand.
(Note that a cross-sectional survey might show no overall change, but the individuals’
positions might have reversed.)
Marks were also given for disadvantages; see below for some examples.
• The response rate might tail off over time (drop out).
• To improve the response rate, the researcher may have an effect on the respondents
(conditioning).
Question 3
(a) We are interested in studying the association between the price of flour and the
production of wheat in a particular area of the UK. The data shown in the table
below provide figures regarding the production of wheat in tonnes (x) as well as
the price of flour (y), in £ per kg, over the last 10 years.
Year #1 #2 #3 #4 #5 #6 #7 #8 #9 #10
Production of Wheat (x) 30 28 32 25 25 24 22 24 35 40
Price of Flour (y) 25 30 27 40 42 41 50 45 30 25
diagram carefully.
diagram.
price of flour for a year with 45 tonnes production of wheat? Will you trust
this value? Justify your answer.
(13 marks)

12.6 provides details for scatter diagrams and is suitable for part (i.) whereas the remaining
guide. Section 12.7 is also relevant. Sample examination question 2 of this chapter is

9
Flour price and wheat production
50
45
y: Flour price (Sterlings per kgr)
40
35
30
25
25 30 35 40
x: Wheat production (tons)
this value is the following: the data suggest that the higher the production of wheat, the
lower the price. The fact that the value is very close to −1, suggests that this is a strong
(negative) linear association.
iii. The regression line can be written by the equation ŷ = a + bx or y = a + bx + ε. The
formula for b is: P
xi yi − nx̄ȳ
b= P 2
xi − nx̄2
iv. The prediction will be ŷ = 73.9005 − 1.3474 × 45 = 13.268 13.268 £ per kg. However,
since this point is outside the range of x, this prediction should not be trusted too much
as it is based on extrapolation.
important to provide the answer to at least two decimal places.
enquiries. For this reason the manager wants to compare the call waiting times
during the years 2013 and 2012. Unfortunately, extensive records of the
company are not available, and he can only check a random sample of phone
10
calls within these two years. The available data, measured in minutes of waiting
times, are provided below for each year.

2013 41 5.8 0.5
2012 34 6.1 0.6

iii. Adjust the procedure above to determine whether the mean waiting time in
2013 was less than that of 2012.
(12 marks)

proportions. While all of Chapter 8 on hypothesis testing is relevant, one can focus on the
sections involving proportions (8.14 and 8.15), in particular Section 8.15. The last part of
the question refers to one-sided hypothesis tests that are also located in these sections.

i. Let µ1 denote the mean waiting time during 2013 and µ2 the mean waiting time during
2012.
The null hypothesis is that the proportions of the two population means (µ1 and µ2 ) do
H0 : µ1 = µ2 vs. H1 : µ1 6= µ2 .
X̄1 − X̄2 X̄1 − X̄2

p or q .
S1 /n1 + S22 /n2
2
Sp2 (1/n1 + 1/n2 )
The test statistic value is 2.3225 (2.3624 if the pooled variance is used). The critical
value at the 5% significance level, assuming a normal approximation as the number of
for a difference between the two years. If we take a (smaller) significance level of 1%, the
critical value is ±2.576, so we do not reject H0 . We conclude that there is some but not
strong (i.e. moderate) evidence of a difference between the two years.
ii. The assumptions for (ii.) were:
• Assumption about whether σ12 = σ22 .
• Assumption about whether n1 + n2 − 2 is ‘large’, hence t vs. z.
iii. This case corresponds to a one-sided test, therefore the hypotheses would be H0 : µ1 = µ2
vs. H1 : µ1 > µ2 . The test statistic value is the same for this case but the critical values
are now 1.645 for the 5% significance level and ≈ 2.33 for the 1% significance level. As
we now reject H0 at both levels we conclude that there is strong evidence (i.e. the result
is highly significant) that the mean waiting time in 2013 was greater than in 2012.
11
Question 4
following annual earnings for the salesmen of a company, measured in £000s:
35 26 22 24 21 57 36 35 29 47 30 36
of the data
(13 marks)

boxplots can be found in Section 4.9.2, but all of Sections 4.8 and 4.9 are highly relevant.

i. The boxplot diagram the Examiners were expecting to see is shown below. Marks were
awarded for including the title, identifying the box and the whiskers and noting outliers,
In order to identify the box, the quartiles are needed that are 25.5 (anything between 24
and 26 is acceptable), 32.5, 36, hence giving an interquartile range of 10.5.
Hence the outlier limits are from 9.75 to 51.75.
The value of 57 is therefore an outlier.
Note that no label of the x-axis is necessary and that the plot can be transposed.
ii. Based on the shape of the boxplot above, we can see that the distribution of the data is
positively skewed.
12
iii. A histogram, steam-and-leaf diagram or a dot plot are other types of suitable graphical
displays. The reason is that the variable income is measurable and these graphs are
suitable for displaying the distribution of such variables.
(b) A study was made to determine the amount of fuel economy obtained by
using a specific new type of tyre over a standard type. For this reason, 8 cars
were fitted with the new type of tyre and the fuel consumption (in km/l) was
measured after a test-drive. Afterwards, the same cars with the same drivers
were fitted with the standard type tyres and the experiment was repeated to
obtain the following fuel consumption measurements.
Car #1 #2 #3 #4 #5 #6 #7 #8
New type tyres 5.1 6.2 7.3 5.4 5.5 5.1 6.1 7.3
i. Carry out an appropriate hypothesis test to determine whether the fuel
consumption is different between the two types of tyre. State the test
(12 marks)
Look up the sections about hypothesis testing for testing differences in means in Chapter 8.
However, it is essential for this part to focus on the section regarding paired samples
(Section 8.16.4).

i. Regarding hypotheses, note that the wording ‘is different’ suggests a two-sided test:
H0 : µstandard = µnew vs. H1 : µstandard > µnew .
In this part, it is also essential to realise that we have paired samples, as we have two
observations for each car (with standard and new types of tyres). Hence the difference
for each car should be calculated:
−0.5 0.3 0.1 0.1 −0.2 0.1 0.5 −0.6.
The next step is to calculate√sd = 0.3808, x̄d = −0.025, in order to obtain the test
statistic value (x̄d − 0)/(sd / n) = −0.1857.
Hence, we do not reject H0 at the 5% significance level. Testing at the 10% significance
level gives a critical value of t7, 0.1 = 1.895. Therefore, we still do not reject H0 concluding
that there is no significant evidence of a difference between the two types of tyre.
ii. Assumptions are:
• Differences normally distributed [no marks for normally distributed fuel
consumption].
• Pairs of observations are independent [a weaker condition which suffices is that the
differences are independent, but this is unlikely if observations are not].
iii. This is a straightforward exercise for confidence intervals using the appropriate formula
from the formula sheet (make sure to be able to recognise it). The requested confidence
interval is (−0.3433, 0.2933).
iv. The evidence in the data that the new type of type works is close to negligible. This can
be seen, for example, from the 95% confidence interval whose endpoints are negative and
positive.
13
~~ST104A ZA d0

Access Route
Statistics 1
Thursday, 7 May 2015 : 10:00 to 12:00

on this paper.
PLEASE TURN OVER

UL15/0377 Page 1 of 9 D1
SECTION A
(a) Classify each one of the following variables as either measurable (continuous) or
categorical. If a variable is categorical, further classify it as either nominal or ordinal.
Justify your answer. (Note that no marks will be awarded without a justification.)
i. The rating of a restaurant according to the number of Michelin stars it has.

ii. The unemployment rate of a country.
iii. The amount of money in a bank account.
iv. The colour of metro line.
[8 marks]
(b) Consider the following sample dataset:
2, 6, x, 13, 9.
You are told that the value of the sample mean is 8.
i. Calculate the value of x.

ii. Find the sample variance.
[4 marks]
(c) For a certain type of laptop, the duration of a fully charged battery until it becomes
empty, X, is normally distributed with a mean of 6 hours and a standard deviation
of 2 hours.
i. What is the probability that such a battery will last at least 5 hours?
ii. What is the probability that such a battery will last between 6 and 8 hours?
[4 marks]
(d) Suppose that x1 = 5, x2 = 3, x3 = −1, x4 = −2, x5 = 4, and y1 = 3, y2 = −2,

y3 = 2, y4 = −1, y5 = 0. Calculate the following quantities:
!
i=3 !
i=5 !
i=4
i. 2xi ii. 3xi (yi − 2) iii. y42 + (3xi + yi2 ).
i=1 i=3 i=2
[6 marks]
UL15/0217 Page 2 of 7
D00
UL15/0377 Page 2 of 9 D1
(e) The length of stay in a hospital is useful for planning purposes. Let X denote
the length of stay in days in a hospital after a minor operation. The probability
distribution of X is given below:
x 1 2 3 4
pX (x) 0.5 0.3 0.1 0.1
i. Find E(X), the expected length of stay in days in hospital after a minor
operation.
ii. A new policy in the hospital will add exactly one day to the length of stay
for this operation for every stay. Will the probability distribution of X change
after this new policy is put in place? If so, what will be the new expected
length of stay after this new policy is put in place?
[4 marks]
(f) The NBA basketball player LeBron James generally shoots his first 3-point in a
basketball game shot with a 30% success rate. If LeBron makes his first 3-point
shot, the success rate on his following 3-point shots goes up to 40%. If he misses it,
the success rate on his following 3-point shots drops to 20%.
i. What is the probability that LeBron James makes exactly one of his first two
3-point shots?
ii. If LeBron made exactly one of his first two 3-point shots, what is the probability
that the shot he made was the first one?
Note: A 3-point shot is when player attempts to put the ball into the basket
from a wide distance.
[5 marks]
(g) It is known that the true mean mark in the course of ‘Statistics I’ at LSE is 64.5.
A random sample of 49 LSE athletes who took the course was taken where the
sample average was x̄ = 63.1 and the sample standard deviation was s = 5.6.
Perform a suitable hypothesis test to determine whether LSE athletes have a
different mean mark for the course ‘Statistics I’ than LSE students in general. State
your hypotheses, the test statistic and its distribution under the null hypothesis,
and your conclusion in the context of the problem.
[7 marks]
i. In a confidence interval for a population mean, an increase of the variance
will increase the width of the interval (assuming that everything else remains
constant).
UL15/0217 Page 3 of 7
UL15/0377 Page 3 of 9 D1
D00

ii. In a confidence interval for a population mean based on the t distribution, an

increase in the degrees of freedom of t will increase the width of the interval
(assuming that everything else remains constant).
iii. In a χ2 test, an increase in the significance level α from 5% to 10% will increase
the p-value.
iv. In a χ2 test, if the p-value is larger than the significance level, we conclude
that there is not sufficient evidence of association between the two relevant
variables.
v. In a sample survey assume that some respondents replied to all the questions
except for one. The non-responses are called ‘unit non response’.
vi. The regression of the variable Y on the variable X will always have the same
slope as the regression of the variable X on the variable Y .
[12 marks]
UL15/0217 Page 4 of 7
UL15/0377 Page 4 of 9 D1
D00

SECTION B
2. (a) A study looked into the amount of help students are receiving, and consisted
of 300 students from three schools. The students were classified into three
categories according to the type of help they receive. The data are shown
below.
Type of Problem
Private tuition Help from family No help Total
School 1 35 25 40 100
School 2 28 47 25 100
School 3 38 22 40 100
Total 101 94 105 300
i. Based on the data in the table, and without conducting a significance
test, compare the distributions of help received by students within schools.
Which type of help is most common in School 1, School 2 and School 3?
[14 marks]
(b) i. Describe what selection bias is and when it may occur. Give an example.
ii. You have been asked to design a nationwide survey in your country to
find out about working conditions among employees in the postal offices.
Provide a probability sampling scheme and a sampling frame that you
would like to use. Identify a potential source of response bias that may
occur and discuss how this issue could be addressed.
[11 marks]
UL15/0217 Page 5 of 7
UL15/0377 Page 5 of 9 D1
D00

3. A study is made for a particular allergy medication in order to determine the length
of relief it provides Y (in hours) in relation to the dosage of medication X (in mg).
For this reason, ten patients were given different doses of the medication and were
asked to report back when the medication seemed to wear off.
Patient #1 #2 #3 #4 #5 #6 #7 #8 #9 #10
Dosage (x) 3 3.5 4 5 6 6.5 7 8 8.5 9
Relief Hours (y) 9.1 5.5 12.3 9.2 14.2 16.8 22.0 18.3 24.5 22.7

Sum of y data: 154.6 Sum of the squares of y data: 2767.3
Sum of the products of x and y data: 1049.1
diagram.
length of relief for a dosage of 11 mg? Will you trust this value? Justify
your answer.
[13 marks]
(b) A study focused on the perception of life satisfaction that may vary between
older and younger people. For this reason 12 adults over the age of 70 and 16
adults aged between 18 and 30 took a life satisfaction test that gave a score
for each one of them (high values of the score indicate higher life satisfaction).
Summaries of these scores are presented below.
Sample size Sample mean Sample variance

Older adults 12 33.5 16.0
Younger adults 16 29.0 15.3

times were different between these two age groups. Test at two appropriate
iii. Adjust the procedure above to determine whether the mean life satisfaction
score for older adults is higher than that of younger adults.
[12 marks]
UL15/0217 Page 6 of 7
UL15/0377 Page 6 of 9 D1
D00

4. (a) A variety of a broad bean plant is studied and the number of beans per plant
is counted and listed below.
71 94 62 74 106
76 87 94 76 78
83 56 78 79 80
60 92 54 81 45
72 54 45 85 72
74 65 68 55 66
i. Carefully construct, draw and label a stem-and-leaf diagram of these data.
ii. Find the mean (given that the sum of the data is 2182), the median and
the modal stem.
iii. Comment on the data given the shape of the stem-and-leaf diagram and
iv Name two other types of graphical displays that would be suitable to
represent the data.
[12 marks]
(b) A researcher is interested in determining whether a particular pill provides

effective treatment for stomach pain. A randomised experiment was conducted
to address this question. The study randomly allocated 200 people to either
a group where the pill was administered, or a group where a placebo pill was
given. These people were monitored and the numbers of those who got better
(or did not) were recorded. The results are summarised below:
Did not get better Got better
Pill 21 74
Placebo 50 55
i. Give a 95% confidence interval for the difference in the rates of getting
better from stomach pain between those who took the pill and the placebo
group.
ii. Carry out an appropriate hypothesis test at the 5% significance level to
determine whether the rate of getting better is higher in the pill group,
compared to the rate in the placebo group. State the test hypotheses, and
iii. State any assumptions you made in (ii.).
iv. On the basis of the data alone, would you conclude that the particular
pill increases the chances of getting better from stomach pain? Provide an
[13 marks]
END OF PAPER
UL15/0217 Page 7 of 7
UL15/0377 Page 7 of 9 D1
D00

ST104a Statistics 1

variable: variable:
v
X
N uN
uX
σ= σ =t
√
µ = E(X) = pi x i 2 pi (xi − µ)2
i=1 i=1

of the sample mean:
X −µ
Z=
σ X̄ − µ
Z= √
σ/ n

P −π σ
Z=p x̄ ± zα/2 · √
n
π(1 − π)/n

s r
x̄ ± tα/2, n−1 · √ p(1 − p)
n p ± zα/2 ·
n

proportion:
zα/2 2 σ 2
n≥ zα/2 2 p(1 − p)
e2
n≥
e2
z test of hypothesis for a single mean (σ t test of hypothesis for a single mean (σ
known): unknown):
X̄ − µ0
Z= √ X̄ − µ0
σ/ n T = √
S/ n
1
UL15/0377 Page 8 of 9 D1
z test of hypothesis for a single z test for the difference between two means
P − π0 X̄1 − X̄2 − (µ1 − µ2 )
=p
Z∼ Z= p
π0 (1 − π0 )/n σ12 /n1 + σ22 /n2
t test for the difference between two means Confidence interval endpoints for the
s
X̄1 − X̄2 − (µ1 − µ2 ) 1 1
T = q 2
(x̄1 − x̄2 )±tα/2, n1 +n2 −2 · sp +
Sp2 (1/n1 + 1/n2 ) n1 n 2
Pooled variance estimator: t test for the difference in means in

paired samples:
(n1 − 1)S12 + (n2 − 1)S22
Sp2 =
n 1 + n2 − 2 X̄d − µd
T = √
Sd / n
Confidence interval endpoints for the z test for the difference between two
sd (P1 − P2 ) − (π1 − π2 )
x̄d ± tα/2, n−1 · √ Z=p
n P (1 − P ) (1/n1 + 1/n2 )

R1 + R2
P = s
n1 + n 2 p1 (1 − p1 ) p2 (1 − p2 )
(p1 −p2 )±zα/2 · +
n1 n2
X
r X
c
(Oij − Eij )2 P
n
xi yi − nx̄ȳ
Eij r = s i=1
i=1 j=1
P
n P
n
x2i − nx̄2 yi2 − nȳ 2
i=1 i=1

P
n Pn
6 d2i xi yi − nx̄ȳ
i=1
i=1 b =
rs = 1 − Pn
n(n2 − 1) x2i − nx̄2
i=1
a = ȳ − bx̄
2
UL15/0377 Page 9 of 9 D1
~~ST104A ZA d0

Access Route
Statistics 1
Thursday, 7 May 2015 : 10:00 to 12:00

on this paper.
PLEASE TURN OVER

UL15/0378 Page 1 of 9 D1
SECTION A
categorical. If a variable is categorical, further classify it as either nominal or ordinal.
Justify your answer. (Note that no marks will be awarded without a justification.)
i. Political parties in your country.

ii. The recorded time of a marathon runner in the Olympic Games.
iii. The Gross Domestic Product (GDP) of a country.
iv. Highest level of education for the employees of a company.
[8 marks]
3, 5, x, 12, 10

[4 marks]
(c) For a certain type of laptop, the duration of a fully charged battery until it becomes
empty, X, is normally distributed with a mean of 5 hours and a standard deviation
of 1.5 hours.
i. What is the probability that such a battery will last at least 4 hours?
ii. What is the probability that such a battery will last between 5 and 7 hours?
[4 marks]
(d) Suppose that x1 = 5, x2 = 3, x3 = −1, x4 = −2, x5 = 4, and y1 = 3, y2 = −2,

!
i=5 !
i=4 !
i=3
i. 3xi ii. 2xi (yi − 3) iii. y32 + (2xi + yi2 ).
i=3 i=2 i=1
[6 marks]
UL15/0217 Page 2 of 7
D00
UL15/0378 Page 2 of 9 D1
(e) The length of stay in a hospital is useful for planning purposes. Let X denote
the length of stay in days in a hospital after a minor operation. The probability
distribution of X is given below:
x 1 2 3 4
pX (x) 0.4 0.3 0.2 0.1
i. Find E(X), the expected length of stay in days in hospital after a minor
operation.
ii. A new policy in the hospital will add exactly one day to the length of stay
for this operation for every stay. Will the probability distribution of X change
after this new policy is put in place? If so, what will be the new expected
length of stay after this new policy is put in place?
[4 marks]
(f) The NBA basketball player Kobe Bryant generally shoots his first 3-point shot in a
basketball game with a 40% success rate. If Kobe makes his first 3-point, the success
rate on his following 3-point shots goes up to 50%. If he misses it, the success rate
on his following 3-point shots drops to 20%.
i. What is the probability that Kobe Bryant makes exactly one of his first two
3-point shots?
ii. If Kobe made exactly one of his first two 3-point shots, what is the probability
that the shot he made was the first one?
Note: A 3-point shot is when player attempts to put the ball into the basket
from a wide distance.
[5 marks]
(g) It is known that the true mean mark in the course of ‘Statistics I’ at LSE is 63.5.
A random sample of 49 LSE athletes who took this course was taken where the
sample average was x̄ = 62.2 and the sample standard deviation was s = 5.2.
Perform a suitable hypothesis test to determine whether LSE athletes have a
different mean mark for the course ‘Statistics I’ than LSE students in general. State
your hypotheses, the test statistic and its distribution under the null hypothesis,
and your conclusion in the context of the problem.
[7 marks]
i. Increasing the confidence level will increase the width of a confidence interval
for a population mean (assuming that everything else remains constant).
UL15/0217 Page 3 of 7
UL15/0378 Page 3 of 9 D1
D00

ii. Increasing the sample size will increase the width of a confidence interval for a
population mean (assuming that everything else remains constant).
√
Alternatively since divide by n the width will decrease if n increases
iii. In a χ2 test, an increase in the significance level α from 5% to 10% will decrease
the probability of a Type I error.
iv. In a χ2 test, if the p-value is smaller than the significance level, we conclude
that there is association between the two relevant variables.
v. In a sample survey assume that some respondents replied to all the questions
and some did not reply at all. The non-responses are called ‘item non response’.
[12 marks]
UL15/0217 Page 4 of 7
UL15/0378 Page 4 of 9 D1
D00

SECTION B
2. (a) A mental health study focused on 300 patients visiting three community mental
health centres. The patients were classified into three groups according to the
primary issue for which they were seen. The data are shown below.
Type of Problem
Social Adjustment Stress Related Other Total
Centre 1 45 28 27 100
Centre 2 28 44 28 100
Centre 3 46 29 25 100
Total 119 101 80 300
i. Based on the data in the table, and without conducting a significance test,
compare the distributions of problems within centres. Which problem is
most common in Centre 1, Centre 2 and Centre 3?
[14 marks]
(b) i. Describe what response bias is and when it may occur. Give an example.
ii. You have been asked to design a nationwide survey in your country to find
out about job satisfaction among employees in the banking sector. Provide
to use. Identify a potential source of response bias that may occur and
discuss how this issue could be addressed.
[11 marks]
UL15/0217 Page 5 of 7
UL15/0378 Page 5 of 9 D1
D00

3. A chain of package delivery stores is looking into the association between weekly
sales (in hundreds of $) in each store (y) and the number of customers who made
purchases in that week (x). For this reason, 10 stores were selected at random from
all the stores in the chain and the variables x and y were recorded. They appear in
the table below:
Store #1 #2 #3 #4 #5 #6 #7 #8 #9 #10
# of customers (x) 90 92 50 74 78 88 87 51 53 42
Sales (y) 11.2 11.1 6.8 9.2 9.4 10.1 9.4 7.7 8.2 6.1

Sum of y data: 89.2 Sum of the squares of y data: 822
diagram.
weekly sales for a store where 70 customers made a purchase? Will you
trust this value? Justify your answer.
[13 marks]
(b) A study focused on the perception of life satisfaction that may vary between
older and younger people. For this reason 15 adults over the age of 70 and 13
adults aged between 18 and 30 took a life satisfaction test that gave a score
for each one of them (high values of the score indicate higher life satisfaction).
Summaries of these scores are presented below.

Older adults 15 32.1 15.2
Younger adults 13 28.5 19.3

times were different between these two age groups. Test at two appropriate
iii. Adjust the procedure above to determine whether the mean life satisfaction
score for older adults is higher than that of younger adults.
[12 marks]
UL15/0217 Page 6 of 7
UL15/0378 Page 6 of 9 D1
D00

4. (a) Thirty people were asked about the number of hours they exercise in a seven
day period and their answers were recorded and listed below.
2.0 4.0 4.5 5.0 5.5
6.0 6.5 6.5 7.0 7.0
7.5 7.5 8.0 8.0 8.5
8.5 8.5 9.0 9.0 10.0
10.5 10.5 11.0 11.5 12.0
13.0 14.0 17.0 18.0 21.0
paper provided.
ii. Find the mean (given that the sum of the data is 277), the median and
the modal group.
represent the data.
[12 marks]
(b) A researcher is interested in determining whether taking additional vitamin C

helps prevent the common cold. A randomised experiment was conducted to
address this question. The study randomly allocated 279 people to either a
group where vitamin C supplements were given, or a group where a placebo
pill was given. These people were monitored and the numbers of those who
got or did not get a cold were recorded. The results are summarised below:
Got a cold Did not get a cold
Vitamin C 17 122
Placebo 31 109
i. Give a 95% confidence interval for the difference in the rates of getting a
cold between the vitamin C and the placebo groups.
ii. Carry out an appropriate hypothesis test at the 5% significance level to
determine whether the rate of getting a cold is lower in the vitamin C
group, compared to the rate in the placebo group. State the test
iv. On the basis of the data alone, would you conclude that a vitamin C pill
reduces the chances of getting a cold? Provide an explanation with your
answer.
[13 marks]
END OF PAPER
UL15/0217 Page 7 of 7
UL15/0378 Page 7 of 9 D1
D00

ST104a Statistics 1

variable: variable:
v
X
N uN
uX
σ= σ =t
√
µ = E(X) = pi x i 2 pi (xi − µ)2
i=1 i=1

of the sample mean:
X −µ
Z=
σ X̄ − µ
Z= √
σ/ n

P −π σ
Z=p x̄ ± zα/2 · √
n
π(1 − π)/n

s r
x̄ ± tα/2, n−1 · √ p(1 − p)
n p ± zα/2 ·
n

proportion:
zα/2 2 σ 2
n≥ zα/2 2 p(1 − p)
e2
n≥
e2
known): unknown):
X̄ − µ0
Z= √ X̄ − µ0
σ/ n T = √
S/ n
1
UL15/0378 Page 8 of 9 D1
P − π0 X̄1 − X̄2 − (µ1 − µ2 )
=p
Z∼ Z= p
π0 (1 − π0 )/n σ12 /n1 + σ22 /n2
s
X̄1 − X̄2 − (µ1 − µ2 ) 1 1
T = q 2
(x̄1 − x̄2 )±tα/2, n1 +n2 −2 · sp +
Sp2 (1/n1 + 1/n2 ) n1 n 2

paired samples:
(n1 − 1)S12 + (n2 − 1)S22
Sp2 =
n 1 + n2 − 2 X̄d − µd
T = √
Sd / n
sd (P1 − P2 ) − (π1 − π2 )
x̄d ± tα/2, n−1 · √ Z=p
n P (1 − P ) (1/n1 + 1/n2 )

R1 + R2
P = s
n1 + n 2 p1 (1 − p1 ) p2 (1 − p2 )
(p1 −p2 )±zα/2 · +
n1 n2
X
r X
c
(Oij − Eij )2 P
n
xi yi − nx̄ȳ
Eij r = s i=1
i=1 j=1
P
n P
n
x2i − nx̄2 yi2 − nȳ 2
i=1 i=1

P
n Pn
i=1
i=1 b =
rs = 1 − Pn
n(n2 − 1) x2i − nx̄2
i=1
a = ȳ − bx̄
2
UL15/0378 Page 9 of 9 D1

ST104a Statistics 1
Important note
Information about the subject guide and the Essential reading

references
the commentary and/or online reading list and/or subject guide refer to an earlier edition. If
section.
General remarks
Learning outcomes
By the end of this course and having completed the Essential reading and activities you should:
• be able to apply a variety of methods for explaining, summarising and presenting data and
interpreting results clearly using appropriate diagrams, titles and labels when required
methods
• be able to use simple regression and correlation analysis and know when it is appropriate to
do so.
compulsory and covers several subquestions and accounts for 50 per cent of the total marks. Section
B contains three questions, each worth 25 per cent, from which you are asked to choose two.
1
ST104a Statistics 1
appeared in the second. The first part of Question 3 was on regression and involved drawing a
data given. Question 4 had a series of questions which involved, drawing diagrams, such as box
plots, hypothesis testing, in particular paired t-tests, and confidence intervals. This means that it is
really important that you make sure you have a reasonable idea of what topics are covered before
you start work on the paper! We suggest you divide your time as follows during the examination:
and subquestion.
• Allow yourself 45 minutes for Section A. Don’t allow yourself to get stuck on any one
question, but don’t just give up after two minutes!
given a title to any tables or diagrams that were required and, if you did more than the two
questions required in Section B, decide which one to delete. Remember that only two of
your answers will be given credit in Section B and that you must choose which these are.
What are the examiners looking for?
The examiners are looking for very simple demonstrations from you. They want to be sure that you:
• understand and can answer the questions set.
these are some of the things that candidates did not do, though asked, in the 2014 examinations.
Remember:
• If you are asked to label a diagram (which is almost always the case), please do so. Writing
is not acceptable to do one rather than the other. If you are asked to find a 5% value, this is
what will be marked.
• Do not waste time calculating things which are not required by the examiners. If you are
2
Commentaries?
• the answers, or keys to the answers, which the examiners were looking for
• the relevant detailed reference to P. Newbold, W.L. Carlson and B.M. Thorne Statistics for
business and economics. (London: Prentice–Hall, 2012) eighth edition [ISBN
9780273767060] and the subject guide
prepare, and similar questions from Newbold (2012).
Important note
In 2015, ST104a Statistics 1 was examined by two replacement examination papers, sat on 28
May and 3 June. Commentaries for these papers are provided and hence references are to these two
dates rather than ‘Zone A’ and ‘Zone B’.
Examination revision strategy
Many candidates are disappointed to find that their examination performance is poorer than they
expected. This may be due to a number of reasons. The Examiners’ commentaries suggest ways of
addressing common problems and improving your performance. One particular failing is ‘question
spotting’, that is, confining your examination preparation to a few questions and/or topics which
have come up in past papers for the course. This can have serious consequences.
We recognise that candidates may not cover all topics in the syllabus in the same depth, but you
need to be aware that examiners are free to set questions on any aspect of the syllabus. This
means that you need to study enough of the syllabus to enable you to answer the required number of
examination questions.
The syllabus can be found in the Course information sheet in the section of the VLE dedicated to
each course. You should read the syllabus carefully and ensure that you cover sufficient material in
preparation for the examination. Examiners will vary the topics and questions from year to year and
may well set questions that have not appeared in past papers. Examination papers may legitimately
include questions on any topic in the syllabus. So, although past papers can be helpful during your
revision, you cannot assume that topics or specific questions that have come up in past examinations
will occur again.
If you rely on a question-spotting strategy, it is likely you will find yourself in difficulties
when you sit the examination. We strongly advise you not to adopt this strategy.
3
ST104a Statistics 1

ST104a Statistics 1
Important note
and any such changes will be publicised on the virtual learning environment (VLE). Note that in
what follows • corresponds to 1 mark unless stated otherwise.

references
section.
Comments on specific questions – 28 May replacement

examination
Section A
Question 1
(a) Consider the following sample dataset:
8, 2, 6, x, 5.

[4 marks]

(Measure of location) for part (i) and Section 3.9 (Measure of dispersion) for part (ii) of the
subject guide.
4

First you need to write down the formula for the sample mean. Then, it is important to do
the summation carefully and divide with the correct number of observations to obtain the
mean. Note that the sum in the numerator will contain the unknown x, hence this will give
you a simple equation. The solution of this equation will provide x. The workout of the
solution is as follows.
i. • Since the sample mean is equal to 6, we can write:
8+2+6+x+5
=6
5
• or else:
21 + x = 30 ↔ x = 9.
ii. • Method:
(8 − 6)2 + (2 − 6)2 + (6 − 6)2 + (9 − 6)2 + (5 − 6)2
s2 =
4
• Correct value: 7.5.
Some candidates divided by 5 in the formula above. In such cases only one mark was
awarded for part (ii), provided that the correct value was obtained. The reason is that the
formula for the sample variance provided in the subject guide only suggests dividing by
n − 1, where n is the number of observations. In another error that occurred in some cases,
candidates subtracted the number x = 9 rather than the sample mean which is given to be 6.
(b) Suppose that x1 = 7, x2 = 3, x3 = 1, x4 = 0, x5 = −6, and y1 = −3, y2 = 5,

i=4
X i=3
X i=5
X
i. 2yi ii. 4(xi − 1) iii. y12 + (x2i + 2yi2 ).
i=2 i=1 i=3
[6 marks]

This question refers to the basic bookwork which can be found on Section 1.9 of the subject
guide and in particular Activity A1.6.
This question was generally well done. The answers are:
i=4
P
i. 2yi = 2(5 − 8 + 9) = 12.
i=2
i=3
P i=3
P
ii. 4(xi − 1) = 4 (xi − 1) = 4((7 − 1) + (3 − 1) + (1 − 1)) = 4(6 + 2 − 0) = 32.
i=1 i=1
i=5
iii. y12 + (x2i + 2yi2 ) = (−3)2 + (12 + 2 × (−8)2 ) + (02 + 2 × 92 ) + ((−6)2 + 2 × 12 ) =
P
i=3
9 + 129 + 162 + 38 = 338.
(c) In a population 20% of men show early signs of losing their hair and 2% of them
carry a gene that is related to hair loss. It is also known that 80% of men who
carry the gene experience early hair loss.
i. What is the probability that a man carries the gene and experiences early
hair loss?
ii. What is the probability that a man carries the gene, given that he
experiences early hair loss?
[4 marks]
5
ST104a Statistics 1

This is a question on probability and targets mostly the material in Chapter 4. It is
essential to practise on such exercises through the activities and exercises in this chapter as
well as the material on the VLE. In particular you can attempt Activity A4.6 and Sample
examination question 4. It is also useful to familiarise yourself with probability trees as they
can be quite handy in such exercises.

The first part was straightforward for those who were familiar with this section as it just
requires knowledge of the conditional probability definition. Part (ii) can be done by either
using Bayes formula or by a probability tree or even a good understanding of the conditional
probability concept. The workout is given below:
i. • P (G ∩ H) = P (G) P (H | G)
• = 0.02 × 0.8 = 0.016.
ii. • P (G | H) = P (G ∩ H)/P (H)
• = 0.016/0.2 = 0.08.
(d) Classify each one of the following variables as either measurable (continuous) or
categorical. If a variable is categorical, further classify it as either nominal or
ordinal. Justify your answer. (Note that no marks will be awarded without a
justification.)
i. Classification of a university degree.
ii. Fuel consumption of a car.
iii. Eye colour.
iv. The cost of life insurance.
[8 marks]

subject guide (Section 3.6) is essential. Candidates should gain familiarity with the notion
of a variable and be able to distinguish between discrete and continuous (measurable) data.
In addition to identifying whether a variable is categorical or measurable, further
distinctions between ordinal and nominal categorical variable should be made by candidates.

values they can take. If these are finite and represent specific entities, the variable is
categorical. Otherwise, if these consist of number corresponding to measurements, the data
i. The classification of a university degree can be 1st, 2.1, 2.2, 3 or fail in some countries.
Clearly these values represent categories and by definition these classifications are
ordered. Hence, this variable is a categorical ordinal variable.
ii. Fuel consumption is a variable that can be measured in miles/gallon or kilometres/litre
to some decimal places. Hence it is a measurable variable.
iii. Each eye colour is a category, so the possible values are one for each colour. Hence, the
variable is categorical. Note also that colours do not have a natural ordering, so this
iv. The cost of life insurance is a variable that can be measured in $, £ etc. to two decimal
places. Hence it is a measurable variable.
6
(e) In the past, the mean telephone call time of customers to a computer helpline
has been 16.0 minutes. The computer company conducts a training scheme for
its telephone consultants with the intention of reducing this mean call time.
After training, a random sample of 20 calling times had a sample mean of 14.3
minutes and a sample standard deviation of 5.0 minutes. Carry out a hypothesis
test, at two suitable significance levels, to decide if the training scheme has been
successful. State your hypotheses, the test statistic and its distribution under
the null hypothesis, and your conclusion in the context of the problem.
[7 marks]

This question refers to a one-sided hypothesis test examining whether the telephone call
time of customers to a computer helpline is less than 16.0 minutes. While the entire chapter
on hypothesis testing is relevant, candidates can focus on the relevant sections for a single
mean (7.12 and 7.13) and in particular 7.13. The question refers to one-sided hypothesis
tests that are located in Section 7.10 of Chapter 7.
It is essential to identify the type of hypothesis test required for this question. Since there is
only one variable involved it will have to be a single mean test, and the test statistic can be
found in the formula sheet. Make sure to substitute the relevant quantities carefully and
avoid any numerical errors in the calculation.
The next step is to identify the distribution of the test statistic. The fact that a sample
standard deviation is given, indicates that the variance is unknown. Hence, since n < 30 the
t distribution should be used.
The remaining steps involve finding the critical values from the corresponding statistical
table for the relevant significance levels, deciding whether to reject H0 , and interpreting the
results in the context of the problem. The working of the exercise is given below:
• H0 : µ = 16 vs. H1 : µ < 16. (No X̄s, accept H0 : µ ≥ 16.)
• Test statistic value:
x̄ − 16 14.3 − 16
√ = √ = −1.52.
s/ 20 5/ 20
• The variance is unknown and n < 30 so the t distribution should be used.
• For α = 0.05, the critical value is −1.729.
• Decision: do not reject H0 .
• Choose larger α, say α = 0.1, hence −1.328, hence reject H0 .
• Weak evidence that the training has been successful in reducing the mean call time.
(f ) The amount of coffee dispensed into a coffee cup by a coffee machine follows a
normal distribution with mean 125 ml and standard deviation 8 ml.
i. Find the probability that one cup is filled above the level of 137 ml.
ii. What is the proportion of cups with coffee contents between 117 ml and 133
ml?
[4 marks]

Chapter 5 and work out the examples and activities of this section. The sample examination
questions are quite relevant.
7
ST104a Statistics 1
∗ P (Z < a) = P (Z ≤ a) = Φ(a)
∗ P (Z > a) = P (Z ≥ a) = 1 − P (Z ≤ a) = 1 − P (Z < a) = 1 − Φ(a)
∗ P (a < Z < b) = P (a ≤ Z < b) = P (a < Z ≤ b) = P (a ≤ Z ≤ b) = Φ(b) − Φ(a).
i. • We can write:

X − 125 137 − 125
P (X > 137) = P > = P (Z > 1.5).
8 8
• Continuing from above, we get P (Z > 1.5) = 1 − Φ(1.5) = 1 − 0.9332 = 0.0668.

ii. • We can write:

117 − 125 133 − 125
P (117 < X < 133) = P ≤Z≤ = P (−1 ≤ Z ≤ 1).
8 8
• Continuing from above, we get:
P (−1 ≤ Z ≤ 1) = Φ(1) − Φ(−1) = 0.8413 − 0.1586 = 0.6827.
(g) The variable X takes the values 1, 2, 3 and 5 according to the following
distribution
x 1 2 3 5
pX (x) 0.1 0.3 0.4 0.2
i. What is the probability that X is negative?
iii. Find the probability that X 2 > 8.
[5 marks]

conditional probability and probability distribution. Reading from Chapter 4 of the subject
guide is suggested with a focus on the sections on these topics. Try Activity A4.1 and the
i. • X only takes positive values, so the probability is 0.
P
ii. •• E(X) = i xi P (X = xi ) = 1 × 0.1 + 2 × 0.3 + 3 × 0.4 + 5 × 0.2 = 2.9.
iii. • The probability distribution of Z = X 2 will be:
Z 1 4 9 25
pZ (z) 0.1 0.3 0.4 0.2
• Hence, the correct probability is 0.6.
Note that this part may be answered without deriving the probability distribution table
of Z. One can note that only the values X = 3 and X = 5 will give X 2 > 8, hence the
requested probability is 0.4 + 0.2 = 0.6.
i. The chance that a normal random variable is less than one standard
deviation from its mean is 95%.
ii. Quota sampling is free of selection bias.
iii. Increasing the level of confidence will decrease the width of a confidence
interval for a population mean (assuming that everything else remains
constant).
8
iv. Failing to reject a false null hypthesis is known as a Type I error.

v. In a chi-squared test of association, the larger the test statistic value, the
larger the corresponding p-value.
vi. The upper quartile of a sample dataset is never smaller than the lower
quartile.
[12 marks]

level in computations. Part (i) concerns normal random variables that can be found in
Chapter 5 of the subject guide. Part (ii) requires material from Chapter 9 and in particular
Section 9.7 on types of sample, whereas Part (iii) is about correlation and regression (see
Sections 11.8 and 11.9). Parts (iv) and (v) target specific concepts of hypothesis testing;
namely Type I error (see Section 7.7) and a p-value (covered in Section 7.11) respectively.
Finally, Part (vi) is on measure of spread and location that are located in Sections 3.8 and
3.9.
reason for a true/false answer and not just a choice between the two. Some candidates lost
true or false.
i. False; it is approximately 68%.
ii. False. Selection is non-random and therefore introduces selection bias.
iii. False. A higher level of confidence would increase the z/t value making the interval wider.
iv. False. It is a Type II error.
v. False. The larger the test statistic value, the smaller the p-value.
vi. True. Q1 ≤ Q3 .
Section B
Question 2
(a) The following data show the periods (in minutes) that a random sample of
students needed to complete a statistics assignment:
76 59 93 87 38
50 56 123 45 67
102 34 54 85 85
50 44 33 51 40
82 92 79 38 86
34 29 107 63 46
i. Carefully construct a stem-and-leaf diagram of these data.
ii. Find the median and the quartiles.
iii. Comment on the data given the shape of the stem-and-leaf diagram without
any further calculations.
iv. Name two other types of graphical displays that would be suitable to
represent the data.
[12 marks]
9
ST104a Statistics 1

stem-and-leaf diagrams can be found in Section 3.7.4, but the entire Sections 3.7, 3.8 and
3.9 are highly relevant.

i. A stem-and-leaf diagram, which was compatible with what the examiners were expecting
to see, is shown below. Marks were awarded for including the title, correct labelling,
vertical alignment and reasonable accuracy.
Stem-and-leaf diagram of time needed to complete a statistics assignment
Stem = 10s of minutes | Leaf = minutes
2 9
3 34488
4 0456
5 001469
6 37
7 69
8 25567
9 23
10 27
11
12 3
ii. • Median = 57.5.
• Q1 ≈ 44.25. Note: Any reasonable quartile method was accepted.
• Q3 = 85.
iii. Based on the shape of the boxplot you have drawn, we can see that the distribution of
the data is positively skewed.
iv. A boxplot, histogram or dot plot are other types of suitable graphical displays. The
reason for this is that the variable income is measurable and these graphs are suitable for
displaying the distribution of such variables.
(b) A random sample of 512 unionised workers found that 38 had been made
redundant in the last twelve months. An independent random sample of 654
non-unionised workers found that 67 had been made redundant over the same
period.
i. Give a 95% confidence interval for the difference in the rates of redundancy
between unionised and non-unionised workers.
ii. Carry out a hypothesis test, at two suitable significance levels, to determine
whether unionised workers are less likely to be made redundant compared to
non-unionised workers. State the test hypotheses, and your test statistic and
its distribution under the null hypothesis. Comment on your findings.
[13 marks]

Look up the sections in the subject guide about hypothesis testing and confidence intervals
for differences in proportions; more specifically Sections 6.13, 7.14 and 7.15.

i. Let p1 , n1 refer to the proportion of redundant unionised workers and to the total
number of unionised workers, respectively. Similarly, denote by p2 and n2 the
corresponding quantities for non-unionised workers. The calculation for the confidence
10
interval is straightforward given the formula sheet; make sure to be able to recognise the
relevant formula. First, the standard error needs to be calculated:
s
p1 (1 − p1 ) p2 (1 − p2 )
s.e.(p1 − p2 ) = + = 0.0166.
n1 n2
Then, the lower and upper bounds can be found to be −0.0604 and 0.0044, respectively.
Finally, the above should be presented as an interval (−0.0604, 0.0044).
ii. As before, let π1 denote the proportion of unionised workers made redundant and π2 the
corresponding proportion for non-unionised workers. Also denote by p the overall
proportion of redundant workers. Regarding hypotheses, note that the wording ‘less
likely’ suggests an one sided test: H0 : π1 = π2 vs. H1 : π1 < π2 . The next step is to
identify the test statistic which is (p1 − p2 )/s.e.(p1 − p2 ), and follows a standard normal
distribution. s
1 1
s.e.(p1 − p2 ) = p(1 − p) + = 0.0169.
n1 n2
Based on the above, the value of the test statistic is −1.6576. The critical value at the
5% level is −1.645, hence we reject H0 at the 5% level. Testing at the 1% level gives a
critical value of −2.323. Therefore, we do not reject H0 concluding that there is
moderate evidence that unionised workers are less likely to be made redundant.
iii. • Sample size is large enough to justify the normality assumption.
• Equal variances.
Some candidates stated assumptions in this part that were not made in part (ii). Marks
were not awarded in such cases.
Question 3
(a) A survey was conducted to investigate the relationship between the frequency of
newspaper readership and readers’ educational background. The following table
shows the results of this survey:
Educational background
Graduate A-levels Less than A-levels Total
Low readership 19 32 49 100
Moderate readership 25 52 23 100
Frequent readership 46 40 14 100
Total 90 124 86 300
would you say there is an association between the frequency of newspaper
readership and reader’s educational background?
ii. Calculate the χ2 statistic and use it to test for independence, using two
appropriate significance levels. What do you conclude?
[14 marks]

straightforward chi-squared test and the reading is also given in Chapter 8. Look also at
Activity A8.4.
11
ST104a Statistics 1
i. There are some differences in the distributions within readership levels. More specifically,
graduates appear more frequent readers than low readership compared to those with
lower educational attainment than A-levels (46% vs. 19% and 14% vs. 49%,
respectively). For those with A-levels, most are of a moderate readership type (52%).
Hence, there seems to be an association between readership levels and reader’s
educational background, although this needs to be investigated further. (Note: the
conclusion of the last sentence must be stated to get full marks).
ii. Set out the null hypothesis that there is no association between readership level and
educational background against the alternative – that there is an association. Be careful
to get these the correct way round!
H0 : No association between readership level and educational background vs.
H1 : Association between readership level and educational background.
30.00 41.33 28.67
30.00 41.33 28.67
30.00 41.33 28.67
X (Oi,j − Ei,j )2
Ei,j
that gives a value of 41.350. This is a 3 × 3 contingency table so the degrees of freedom
are (3 − 1) × (3 − 1) = 4.
For α = 0.05, the critical value from the chi-squared distribution with 4 degrees of
freedom is 9.488, hence reject H0 .
Next, for α = 0.01, the critical value is 13.277, hence reject H0 again.
We conclude that there is strong evidence of an association between readership level and
educational background.
Saying ‘we do reject at the 5% level, but at 10%’ is insufficient. What does this mean? Is
there a connection or not? If there is one, how strong is it? This needed to be answered
if the full nine marks allocated for this question were to be earned. Many candidates lost
marks by missing out on follow-up like this.
(b) i. Explain the difference between item non-response and unit non-response.
ii. State any three factors which could cause non-response.
iii. A travel agency offers customers a range of ways to make holiday bookings –
in store, online and through their call centres. To determine the level of
customer satisfaction, the company’s management has decided to use a
survey of all types of customers and has asked you to devise an appropriate
random sampling scheme. Explain in detail your recommendation, including
how you might address non-response.
[11 marks]
This question was on basic material on survey designs. Background reading is given in Chapters
9 and 10 of the subject guide which, along with the recommended reading, should be looked at
constituents of design in random sampling. It is also a good idea to try the activities in Chapter
9.
12
One of the main things to avoid in this part is to write essays without any structure. This
exercise asks for specific things and each one of them requires 1 or 2 lines. If you are unsure of
what these things are, do not write lengthy essays. This is not giving you anything and is a
waste of your invaluable examination time. If you can identify what is being asked, keep in mind
that the answer should not be long. Note also that in some cases there is no unique answer
to the question.
The marking scheme and some model answers are given below:
i. • Item non-response occurs when a sampled member fails to respond to a question in the
questionnaire.
• Unit non-response occurs when no information is collected from a sample member.
ii. 3 marks: Any three of:
— Not-at-home.
— Refusals.
— Incapacity to respond.
— Not found
— Lost schedules.
iii. 6 marks: Possible ‘ingredients’ of an answer:
— Sampling frame to be the travel agency’s customer database.
— Propose stratified sampling since all types of customers are to be surveyed.
— Stratification factors could include booking method, gender, holiday type.
— Take a simple random sample from each stratum.
— Contact method: mail, phone or email (likely to have all details on database).
— Minimise non-response through suitable incentive, such as discount off next booking.
Question 4
(a) An area manager in a department store wants to study the relationship between
the number of workers on duty, x, and the value of merchandise lost to
shoplifters, in $. To do so, the manager assigned a different number of workers
for each of 10 weeks. The results were as follows:
Week #1 #2 #3 #4 #5 #6 #7 #8 #9 #10
x 9 11 12 13 15 18 16 14 12 10
y 420 350 360 300 225 200 230 280 315 410
diagram carefully.
diagram.
iv. Based on the regression equation in part (iii.), what will be the predicted loss
from shoplifting when there are 17 workers on duty? Will you trust this
value? Justify your answer.
[13 marks]
13
ST104a Statistics 1

This is a standard regression question and the reading is to be found on Chapter 11. Section
guide. Section 11.7 is also relevant. Sample examination question 2 in this chapter is

ii. The summary statistics can be substituted to the formula for the correlation (make sure
you know which one it is!) to obtain the value −0.9688. An interpretation of this value is
the following: The data suggest that the higher the number of workers, the lower the loss
from shoplifters. The fact that the value is very close to −1, suggests that this is a
strong linear negative association.
Many candidates did not mention all three words (strong, linear, negative). Note that all
of these words provide useful information on interpreting the association and are
therefore required to obtain full marks.
iii. The regression line can be written by the equation yb = a + bx or y = a + bx + ε. The
formula for b is: P
xi yi − nx̄ȳ
b= P 2
xi − nx̄2
Hence the regression line can be written as yb = 655.36 − 26.64x or
14
Many candidates reported incorrectly the regression line as y = 655.36 − 26.64x. This
expression is false; one of the two above is required.
iv. The prediction will be yb = 655.36 − 26.64 × 17 = $202.48. Yes we would trust this value,
since this point is inside the observed range of x, and therefore the prediction is based on
interpolation.
important to provide the answer in two decimal places.
(b) A study was conducted to determine the amount of hours spent on Facebook by
university and high school students. For this reason, a questionnaire was
administered to a random sample of 16 university and 14 high school students
and the hours per day spent on Facebook were recorded. Summaries of the data
are shown in the table below:
University students 16 2.9 0.9
High school students 14 2.1 1.1
i. Use an appropriate hypothesis test to determine whether the mean hours per
day spent on Facebook were different between university and high school
students. Test at two suitable significance levels, stating clearly the
hypotheses, the test statistic and its distribution under the null hypothesis.
iii. Adjust the procedure above to determine whether the mean hours spent per
day on Facebook for university students is higher than that of high school
students.
[12 marks]

proportions. While the entire chapter on hypothesis testing is relevant, one can of focus on
the sections involving proportions (7.14 and 7.15), in particular 7.15. The last part of the
question refers to one-sided hypothesis tests that are also located in these sections.
i. Let µA denote the mean hours per day spent on Facebook for university students and µB
the mean hours per day spent on Facebook for high school students.
The null hypothesis is that the proportions of the two population means (µA and µB ) do
H0 : µA = µB vs. H1 : µA 6= µB .
If equal variances are assumed, the test statistic value is 2.1939 (the pooled variance is
0.9929). If equal variances are not assumed the test statistic value is 2.1788.
Since the variances are unknown and the sample size is not large enough, the t28
distribution is being used. The critical values at the 5% level are ±2.048, hence we reject
the null hypothesis. If we take a (smaller) α of 1%, the critical values are ±2.763, so we
do not reject H0 . We conclude that there is some but not strong (moderate) evidence of
a difference in the mean hours spent on Facebook between university and high school
students.
2 2
∗ Assumption about whether σA = σB .
∗ Assumption about whether nA + nB − 2 is ‘large’, hence t vs. z.
∗ Assumption about independent samples.
Some candidates stated assumptions in this part that were not made in part (i). Marks
15
ST104a Statistics 1
iii. This case corresponds to an one-sided test, therefore the hypotheses would be
H0 : µA = µB vs. H1 : µA > µB . The test statistic is the same for this case but the
critical values are now 1.701 for 5% and ≈ 2.467 for 1%. As before we reject H0 at the
5% but not at the 1% level, and we conclude that there is moderate evidence (result is
moderately significant) – university students spend more time on Facebook than high
school students.
16

ST104a Statistics 1
Important note
what follows • corresponds to 1 mark unless stated otherwise.

references
section.
Comments on specific questions – 3 June replacement examination
Section A
Question 1
ordinal. Justify your answer. (Note that no marks will be awarded without a
justification.)
i. The manufacturer of a car.
ii. The amount of money in a bank account.
iii. The Gross Domestic Product (GDP) of a country.
iv. The rating of a hotel according to the number of stars it has.
[8 marks]

17
ST104a Statistics 1

categorical. Otherwise, if these consist of number corresponding to measurements, the data
i. Each car manufacturer is a category: Audi, Ford, Toyota, Mercedes, etc. Hence, the
variable is categorical. Note also that car brands do not have a natural ordering, so this
ii. The data represent amounts in – say – $ or euros that can be measured to two decimal
places, for example $1203.40. This is therefore a measurable variable.
iii. GDP can be measured in $ to two decimal places. This is therefore a measurable variable.
iv. Each hotel rating (1 star, 2 stars, . . . , 5 stars) is a category. Also, by definition there is a
natural ordering, for example a 5 star hotel is regarded as better than a 3 star one. This
is therefore a categorical ordinal variable.
Weak candidates did not provide justification for their choices, reported nominal or
4, x, 8, 7, 2

[4 marks]

This questions contains material mostly from the subject guide, Chapter 3 and in particular
Section 3.8 (Measure of location) for part (i) and Section 3.9 (Measure of dispersion) for
part (ii).
First you need to write down the formula for the sample mean. Then, it is important to do
the summation carefully and divide with the correct number of observations to obtain the
mean. Note that the sum in the numerator will contain the unknown x, hence this will give
you a simple equation. The solution of this equation will provide x. The workout of the
solution is as follows.
i. • Since the sample mean is equal to 7, we can write:
4+x+8+7+2
=5
5
• or else:
21 + x = 25 ↔ x = 4.
ii. • Method:
(4 − 5)2 + (4 − 5)2 + (8 − 5)2 + (7 − 5)2 + (2 − 5)2
s2 = .
4
• Correct value: 6.
18
Some candidates divided by 5 in the formula above. In such cases only one mark was
awarded for part (ii), provided that the correct value was obtained. The reason is that the
formula for the sample variance provided in the subject guide only suggests dividing by
n − 1, where n is the number of observations. In another error that occurred in some cases,
candidates subtracted the number x = 4 rather than the sample mean which is given to be 7.
(c) The salaries of the employees of a company are normally distributed with mean
£25,000 and a standard deviation of £10,000.
i. What is the proportion of employees with a salary of at least £20,000?
ii. What is the proportion of employees with salaries between £15,000 and
£35,000?
[4 marks]

Chapter 5 of the subject guide and work out the examples and activities of this section. The
sample examinations questions in this chapter are quite relevant.
∗ P (Z < a) = P (Z ≤ a) = Φ(a)
∗ P (Z > a) = P (Z ≥ a) = 1 − P (Z ≤ a) = 1 − P (Z < a) = 1 − Φ(a)
∗ P (a < Z < b) = P (a ≤ Z < b) = P (a < Z ≤ b) = P (a ≤ Z ≤ b) = Φ(b) − Φ(a).
i. • We can write:

X − 25000 20000 − 25000
P (X ≥ 20000) = P > = P (Z > −0.5).
10000 10000
• Continuing from above, we get P (Z > −0.5) = Φ(0.5) = 0.6914.

ii. • We can write:

15000 − 25000 35000 − 25000
P (15000 < X < 35000) = P ≤Z≤ = P (−1 ≤ Z ≤ 1).
10000 10000
• Continuing from above, we get:
P (−1 ≤ Z ≤ 1) = Φ(1) − Φ(−1) = 0.8413 − 0.1586 = 0.6827.
(d) Suppose that x1 = −3, x2 = 5, x3 = 5, x4 = −1, x5 = 2, and y1 = 1, y2 = −4,

i=5
X i=4
X i=3
X
i. 2xi ii. 3(yi − 3) iii. y42 + (2xi + yi2 ).
i=3 i=2 i=1
[6 marks]

guide and in particular Activity A1.6.
This question was generally well done. The answers are:
19
ST104a Statistics 1
i=5
P
i. 2xi = 2(5 − 1 + 2) = 12.
i=3
i=4
P i=4
P
ii. 3(yi − 3) = 3 (yi − 3) = 3((−4 − 3) + (5 − 3) + (−1 − 3)) = 3(−7 + 2 − 4) = −27.
i=2 i=2
i=3
iii. y42 + (2xi +yi2 ) = (−1)2 +(2×(−3)+12 )+(2×5+(−4)2 )+(2×5+52 ) = 1−5+26+35 = 57.
P
i=1
(e) The variable X takes the values 2, 4, 6 and 8 according to the following
distribution
x 2 4 6 8
pX (x) 0.3 0.2 0.1 0.4
i. What is the probability that X is an odd number?
iii. Find the probability that X/2 > 3.
[5 marks]

conditional probability and probability distribution. Reading from Chapter 4 of the subject
guide is suggested with a focus on the sections on these topics. Try Activity A4.1 and the
i. • X only takes even values, so the probability is 0.
P
ii. •• E(X) = i xi P (X = xi ) = 2 × 0.3 + 4 × 0.2 + 6 × 0.1 + 8 × 0.4 = 5.2.
iii. • The probability distribution of Z = X/2 will be:
Z 1 2 3 4
pZ (z) 0.3 0.2 0.1 0.4
• Hence, the correct probability is 0.4.
Note that this part may be answered without deriving the probability distribution table
of Z. One can note that only the value X = 2 gives X/2 > 3, hence the requested
probability is 0.4.
(f ) You toss two fair dice independently.

i. What is the probability that both numbers are sixes?
ii. What is the probability that both numbers are odd?
iii. You are now told that the first one of them shows a two. What is the
probability in this case that both are twos?
[4 marks]

This is a question on probability and targets mostly the material of Chapter 4 in the subject
guide. It is essential to practise on such exercises through the activities and exercises of this
chapter as well as the material on the VLE. In particular you can attempt Activity A4.6
and Sample examination question 4. It is also useful to familiarise yourself with probability
trees as they can be quite handy in such exercises.
The first part was straightforward for those who were familiar with this section. Part (iii)
requires knowledge of the conditional probability de
nition, although it can also be approached by common logic regarding the concept of
independence. The workout is given below.
20
i. • Correct probability calculation: (1/6) × (1/6) = 1/36.

ii. • Correct probability calculation: (1/2) × (1/2) = 1/4.
iii. • Correct probability of the first one being two: (1/6).
• Correct conditional probability: (1/6) × (1/6)/(1/6) = 1/6.
(g) It is stated in a consumer magazine that the average price of football shirts in
London is £19.00. A random sample is taken by obtaining a single football shirt
from each of 16 randomly chosen London retailers. The sample mean is £20.20
and the sample standard deviation is £2.40. Carry out a hypothesis test, at two
appropriate significance levels, to determine whether the price of football shirts
in London is more expensive than the price stated in the consumer magazine.
State your hypotheses, the test statistic and its distribution under the null
hypothesis, and your conclusion in the context of the problem.
[7 marks]

This question refers to a one-sided hypothesis test examining whether the price of football
shirts is greater than £19.00. While the entire chapter on hypothesis testing is relevant,
candidates can focus on the relevant sections for a single mean (7.12 and 7.13) and in
particular 7.13. The question refers to one-sided hypothesis tests that are located in Section
7.10 of Chapter 7.
only one variable involved it will have to be a single mean test, and the test statistic can be
found in the formula sheet. Make sure to substitute the relevant quantities carefully and
avoid any numerical errors in the calculation.
The next step is to identify the distribution of the test statistic. The fact that a sample
standard deviation is given, indicates that the variance is unknown. Hence, since n < 30 the
t distribution should be used.
table for the relevant significance levels, deciding whether to reject H0 , and interpreting the
results in the context of the problem. The working of the exercise is given below:
• H0 : µ = 19 vs. H1 : µ > 19. (No X̄s, accept H0 : µ ≤ 19.)
• Test statistic value:
x̄ − 19 20.20 − 19.00
√ = √ = 2.
s/ 16 2.40/ 16
• The variance is unknown and n < 30 so the t distribution should be used.
• For α = 0.05, the critical value is 1.753.
• Decision: reject H0 .
• Choose smaller α, say α = 0.01, hence 2.602, hence do not reject H0 .
• Moderate evidence that the actual price of football shirts in London is more expensive
than the price stated in the consumer magazine.
i. The chance that a normal random variable is less than two standard
deviations from its mean is 99%.
ii. The lower the regression coefficient in absolute value the weaker the
correlation.
iii. Increasing the sample size will increase the width of a confidence interval for
a population mean (assuming that everything else remains constant).
21
ST104a Statistics 1
iv. When testing a hypothesis, we use a two tailed test if we want to test
whether the parameter is greater than what is stated in the null hypothesis.
v. A population list is needed in order to conduct quota sampling.
[12 marks]

level in computations. Part (i) concerns normal random variables that can be found in
Chapter 5 of the subject guide. Part (ii) is about correlation and regression (see Sections
11.7 and 11.8), whereas the next part targets confidence intervals (see for example Sections
7.7. to 7.9) . The next part (iv) targets the concepts of a p-value covered in Section 7.11 in
the context of chi-squared test, presented in Chapter 8. Finally, part (vi) requires material
from Chapter 9 and in particular the Section 9.7 on types of sample.
marks too for long, rambling explanations without indicating a decision as to whether a
statement was true or false.
i. False; it is 97.75%.
ii. True; as the absolute value of the regression coefficient approaches zero, the correlation
coefficient also approaches zero.
iii. False; larger sample size will result in higher
√ accuracy and therefore a smaller width.
Alternatively, since we are dividing by n the width will decrease if n increases.
iv. False; this is done when we want to test if the parameter is different than what is stated
in the null hypothesis.
v. False; a population list is needed in random sampling designs.
vi. False; the denominator in the formula for the coefficient b will generally be different in
these two cases whereas the numerator will be the same.
Section B
Question 2
(a) Questionnaires were mailed to 300 households, in three different areas of a city,
to assess the level of local sporting facilities. The collected data are shown in
the table below.
Sporting Facilities Level
Very good Fairly good Poor Total
Area 1 44 30 26 100
Area 2 29 26 45 100
Area 3 45 28 27 100
Total 118 84 98 300
would you say there is an association between areas and level of local
sporting facilities?
ii. Calculate the χ2 statistic and use it to test for independence, using two
appropriate significance levels. What do you conclude?
[14 marks]
22

This part targets Chapter 8 on contingency tables and chi-squared tests in the subject
guide. Note that part (i) of the question does not require any calculations, just
understanding and interpreting contingency tables. Candidates can attempt Activity A8.4
to practise. Part (ii) is a straightforward chi-squared test and the reading is alson given in
Chapter 8. Look also at Activity A8.4.
i. There are some differences in the distributions within areas. More specifically, very good
sporting facilities appear more frequently than poor sporting facilities in Areas 1 and 3
(44% vs. 30% and 45% vs. 28%, respectively). In Area 2, however, poor sporting facilities
are more common than very good ones (45% vs. 29%). There seems to be an association
between area and level of sporting facilities, although this needs to be investigated
further. (Note: the conclusion of the last sentence is essential to get full marks).
ii. Set out the null hypothesis that there is no association between area and level of sporting
facilities against the alternative, that there is association. Be careful to get these the
correct way round!
H0 : No association between areas and level of sporting facilities vs.
H1 : Association between areas and level of sporting facilities.
39.3 28.00 32.67
39.3 28.00 32.67
39.3 28.00 32.67
X (Oi,j − Ei,j )2
Ei,j
that gives a value of 11.370. This is a 3 × 3 contingency table so the degrees of freedom
are (3 − 1) × (3 − 1) = 4.
For α = 0.05, the critical value from the chi-squared distribution with 4 degrees of
freedom is 9.488, hence reject H0 .
Next, for α = 0.01, the critical value is 13.277, hence do not reject H0 .
We conclude that there is some evidence to support an association between area and
level of sporting facilities.
Saying ‘we do reject at the 5% level, but at 10%’ is insufficient. What does this mean? Is
there a connection or not? If there is one, how strong is it? This needed to be answered
if the full nine marks allocated for this question were to be earned. Many candidates lost
marks by missing out on follow-up like this.
(b) i. Provide the definition of simple random sampling and cluster sampling
designs.
ii. Why might a researcher prefer cluster sampling rather than simple random
sampling?
iii. Name one other random sampling scheme, provide its definition and one of
its advantages.
[11 marks]

This was question on basic material on survey designs. Background reading is given in
activities of Chapter 9 of the subject guide.
23
ST104a Statistics 1

One of the main things to avoid in this part of the question is to write essays without any
structure. This exercise asks for specific things and each one of them requires 1 or 2 lines. If
you are unsure of what these things are, do not write lengthy essays. This is not giving
you anything and is a waste of your invaluable examination time. If you can identify what is
The marking scheme and some model answers are given below:
i. Simple random sampling – each population unit has a known, equal, non-zero probability
of selection.
Cluster sampling – the population is divided into clusters. Next, some of the clusters are
selected by simple random sampling.
See pages 141–142 of the guide for more details.
ii. One of the advantages of cluster sampling over simple random sampling is the reduced
cost. Under cluster sampling only one area needs to be visited instead of many.
iii. Candidates here can choose any of the remaining random sampling designs that have
been described in the subject guide. These are stratified random sampling, systematic
random sampling and multistage sampling.
A model answer should contain two parts:
• Definition.
• Description of the advantage together with an example.
Candidates are advised to go to the relevant part of the subject guide and write down
the above for all the random sampling designs.
Question 3
(a) The following data shows the recorded times (y) in seconds taken by 10
international athletes to run 100 metres together with the corresponding wind
speeds (x) at the time of running. A positive wind speed indicates the wind is
in the direction of running and therefore considered to be helpful whereas a
negative wind speed indicates the wind is against the runner.
Athlete #1 #2 #3 #4 #5 #6 #7 #8 #9 #10
x −2.45 −1.23 −0.78 −0.33 −0.37 0.34 0.53 1.17 2.35 2.91
y 10.52 10.47 10.41 10.25 10.54 10.09 10.30 9.99 9.92 9.87

diagram carefully.
diagram.
time for a runner for a wind speed of 1.5? Will you trust this value? Justify
your answer.
[13 marks]
24

This is a standard regression question and the reading is to be found on Chapter 11 of the
subject guide. Section 11.6 provides details for scatter diagrams and is suitable for part (i)
whereas the remaining parts are on correlation and regression that are covered in Sections
11.8–11.10 of subject guide. Section 11.7 is also relevant. Sample examination question 2 in
this chapter is recommended for practice on questions of this type.
failing to use the graph paper, which was provided, and required, in the question.
Times of 100m international athletes against wind speed
x
x
10.5
x
10.4
x
Recorded time in seconds
10.3
x
x
10.2
10.1
x
10.0
x
9.9
−2 −1 0 1 2 3
Wind speed
ii. The summary statistics can be substituted to the formula for the correlation (make sure
you know which one it is!) to obtain the value −0.9051. An interpretation of this value is
the following: The data suggest that the higher the wind speed, the lower the time to
run 100 metres. The fact that the value is very close to −1, suggests that this is a strong
linear negative association.
Many candidates did not mention all three words (strong, linear, negative). Note that all
of these words provide useful information on interpreting the association and are
formula for b is: P
xi yi − nx̄ȳ
b= P 2
xi − nx̄2
The formula a is a = ȳ − bx̄, so we get a = 10.2663.
Hence the regression line can be written as yb = 10.2663 − 0.1414x or
Many candidates reported incorrectly the regression line as y = 10.2663 − 0.1414x.
This expression is false; one of the two above is required.
25
ST104a Statistics 1
iv. The prediction will be yb = 10.2663 − 0.1414 × 1.5 = 10.05 seconds. We would trust this
value, since this point is inside the observed range of x, and therefore the prediction is
based on interpolation.
answering such question and a mark is deducted if they are not specified. It is also
important to provide the answer in at least two decimal places.
(b) Behavioural researchers have developed an index designed to measure

managerial success. Of interest is whether there is a difference in average
managerial success based on the level of interaction with people outside a
managers immediate work unit. Managers in group 1 engage in a high volume
of interactions with people outside their work unit, while those in group 2
rarely do. The data are summarised in the table below:
Group 1 22 65.33 6.61
Group 2 25 61.58 5.37
i. Carry out a hypothesis test to determine whether the mean managerial
success index scores are different between the two groups. Test at two
suitable significance levels, stating clearly the hypotheses, the test statistic
and its distribution under the null hypothesis. Comment on your findings.
iii. Adjust the procedure above to determine whether the mean managerial
success for managers who have a high volume of interactions with people
outside their work unit is higher than that of those who rarely do.
[12 marks]

proportions. While the entire chapter on hypothesis testing in the subject guide is relevant,
candidates can focus on the sections involving proportions (7.14 and 7.15) and in particular
7.15. The last part of the question refers to one-sided hypothesis tests that are also located
in these sections.
i. Let µA denote the mean managerial success score for group 1 and µB the mean
managerial success score for group 2.
The null hypothesis is that the proportions of the two population means (µA and µB ) do
H0 : µA = µB vs. H1 : µA 6= µB .
x̄ − ȳ x̄ − ȳ
p or q .
s2A /nA + s2B /nB s2p (1/nA + 1/nB )
distribution is being used. The critical value at the 5% level is −2.048, hence we reject
the null hypothesis. If we take a (smaller) α of 1%, the critical value is −2.763, so we do
not reject H0 . We conclude that there is moderate evidence of a difference in the mean
scores of managerial success between the two groups.
∗ Assumption about equal variances.
26
∗ Assumption about whether nA + nB is ‘large’ so that the normality assumption is

satisfied.
∗ Assumption about independent samples.
Some candidates stated assumptions in this part that were not made in part (i). Marks
iii. This case corresponds to an one-sided test, therefore the hypotheses would be
H0 : µA = µB vs. H1 : µA > µB . The test statistic is the same for this case but the
critical values are now 1.684 for 5% and ≈ 2.423 for 1%. As before we reject H0 at the
5% but not at the 1% level, and we conclude that there is moderate evidence (result is
moderately significant) – group 1 managers are on average more successful than group 2
managers.
Question 4
(a) The following data show the length (in inches) of fish caught in one day in a
river:
10.1 10.4 10.5 10.9 11.1
11.2 11.2 11.5 11.7 11.9
12.1 12.1 12.2 12.2 12.3
12.4 12.5 12.6 12.8 12.9
13.2 13.4 13.5 13.6 13.7
14.3 14.5 14.8 15.2 15.5
paper provided.
ii. Find the mean (given that the sum of the data is 376.3), the median and the
modal group.
represent the data.
[12 marks]

Chapter 3 in the subject guide provides all the relevant material for this question. More
specifically, reading on histograms can be found in Section 3.7.3, but the entire Sections 3.7,
3.8 and 3.9 are highly relevant.
i. A histogram compatible with what the examiners were expecting to see is shown below.
Marks were awarded for including the title, labelling correctly and accurately drawing
the figure.
27
ST104a Statistics 1
Histogram of lengths of fish caught
0.30
0.25
0.20
Density
0.15
0.10
0.05
0.00
10 11 12 13 14 15 16
lengths in inches
ii. • Mean: 12.543. Note: raw data should be used, not grouped data.
• Median: 12.35. Note: same as above.
• Modal group: 12–13 inches. Note: units are necessary.
iii. Based on the shape of the histrogram, we can see that the distribution of the data is
positively skewed.
iv. A boxplot, stem-and-leaf diagram or a dot plot are other types of suitable graphical
displays. The reason for that is that the variable income is measurable and these graphs
are suitable for displaying the distribution of such variables.
(b) In order to estimate the percentage of city households that have high speed
internet access, a random sample of 140 city households was taken. Of these, 70
had high speed internet access. A similar sample of 170 rural households was
also taken and it was found that 61 of them had high speed internet access. The
data are summarised in the table below.
City Households Rural Households

With high speed internet 70 61
Total 140 170
i. Give a 95% confidence interval for the difference between the proportions of
high speed internet access in city and rural households.
ii. Carry out a hypothesis test, at two suitable significance levels, to determine
whether city households are more likely to have high speed internet access
compared to rural households. State the test hypotheses, and specify your
test statistic and its distribution under the null hypothesis. Comment on
your findings.
[13 marks]
28

Look up the sections about hypothesis testing and confidence intervals for differences in
proportions; more specifically sections 6.13, 7.14 and 7.15 in Chapters 6 and 7 of the subject
guide.
i. Let p1 , n1 refer to the proportion of city households with high speed internet access and
to the total number of households, respectively. Similarly, denote by p2 and n2 the
corresponding quantities of rural households. The calculation for the confidence interval
is straightforward given the formula sheet; make sure you are able to recognise the
relevant formula. First, the standard error needs to be calculated:
s
p1 (1 − p1 ) p2 (1 − p2 )
s.e.(p1 − p2 ) = + = 0.056.
n1 n2
Then, the lower and upper bounds can be found to be 0.0314 and 0.2510 respectively.
Finally, the above should be presented as an interval (0.0314, 0.2510).
ii. As before, let π1 denote the proportion of city households with high speed internet and
π2 the corresponding proportion of rural households. Also denote by p the overall
proportion of households with high speed internet. Regarding hypotheses, note that the
wording ‘less likely’ suggests an one sided test: H0 : π1 = π2 vs. H1 : π1 > π2 .
The next step is to identify the test statistic which is (p1 − p2 )/(s.e.(p1 − p2 )), and
follows a standard normal distribution.
s
1 1
s.e.(p1 − p2 ) = p(1 − p) + = 0.0564.
n1 n2
Based on the above the value of the test statistic is 2.503. The critical value at the 5%
level is 1.645, hence we reject H0 at the 5% level. Testing at the 1% level gives a critical
value of 2.323. Therefore, we still reject H0 at the 1% level, concluding that city
households are more likely to have high speed internet than rural households.
iii. • Sample size is large enough to justify the normality assumption.
• Equal variances.
Some candidates stated assumptions in this part that were not made in part (ii). Marks
29
~~ST104A_ZA_2016_d0

and the Social Sciences, the Diplomas in Economics and Social Sciences
Statistics 1
Friday, 6 May 2016 : 10:00 to 12:00

on this paper.
PLEASE TURN OVER

UL16/0489 Page 1 of 21 D1
SECTION A
Answer all parts of question 1 (50 marks in total).
1. (a) A random sample of the heights of buildings has a sample mean of 24.96 metres.
State the units of measurements for the summaries below and justify your
answers.
i. sample variance
ii. sample standard deviation.
[4 marks]
(b) Suppose that x1 = 8, x2 = −1, x3 = −6, x4 = 5, x5 = 0, and y1 = −7, y2 = 3,

y3 = 0, y4 = 1, y5 = −3. Calculate the following quantities:
!
i=4 !
i=3 !
i=4 4
y i
i. x2i ii. 2xi yi iii. y53 + .
i=2 i=1 i=3
xi
[6 marks]
(c) A population is normally distributed with a population mean of 138 and a

population standard deviation of 21.
i. State the distribution of the sample mean for simple random samples of
size n = 25.
ii. Given a simple random sample of size n = 25, determine the probability
that the sample mean will be less than 128.
[4 marks]
(d) Classify each one of the following variables as either measurable

(continuous) or categorical. If a variable is categorical, further classify it as
either nominal or ordinal. Justify your answer. (No marks will be awarded
without a justification.)
i. The weight of a cereal packet produced in a factory.
ii. The order an athelete finshes a marathon.
iii. The colour of a pair of shoes.
iv. Currency exchange rates.
[8 marks]
(e) The random variable X takes the values 0, 1 and 4 according to the following
probability distribution:
x 0 1 4
pX (x) 0.2 k k
UL16/0217 Page 2 of 6
D00 Question continues on next page.
UL16/0489 Page 2 of 21
i. Determine the constant k.

iii. Find Var(X), the variance of X.
[5 marks]
(f) An engine encounters a standard environment with a probability of 0.95, and

a severe environment with a probability of 0.05. In a normal environment the
probability of failure is 0.02, whereas in the severe environment this probability
is 0.5.
i. What is the probability of failure?
ii. Given that failure has occurred, what is the probability that the
environment encountered was severe?
[4 marks]
(g) A museum conducts a survey of its visitors in order to assess the popularity
of a device which is used to provide information on the museum exhibits. The
device will be withdrawn if fewer than 20% of all of the museum’s visitors make
use of it. Of a random sample of 100 visitors, 15 chose to use the device.
i. Carry out an appropriate hypothesis test at the 5% significance level to
see if the device should be withdrawn and state your conclusions.
ii. Calculate the p-value of the test.
[7 marks]
(h) State whether the following are true or false and give a brief explanation. (No
marks will be awarded for a simple true/false answer.)
i. The interquartile range of a sample is influenced by extreme values.
ii. A sampling distribution is the probability distribution of a population
parameter.
iii. A sample correlation coefficient close to 1 indicates a strong positive linear
relationship between two categorical variables.
iv. A p-value of 0.08 represents a highly significant hypothesis test result.
v. Rejection of a null hypothesis might indicate that a Type II error has been
committed.
vi. A quota sample is the non-random equivalent of a systematic random
sample.
[12 marks]
SECTION B
Answer two out of the three questions from this section (25 marks each).
2. (a) A factory uses four different machines to manufacture a particular type

of machine component. A random sample of 400 components is selected
from the output of the factory. Each component in the sample is inspected
to determine whether or not it is faulty. The machine that produced the
component is also recorded. The results are as follows:
UL16/0217 Page 3 of 6
D00
Question continues on next page.
UL16/0489 Page 3 of 21
Outcome
Faulty Non-faulty Total
Machine 1 4 96 100
Machine 2 2 98 100
Machine 3 11 89 100
Machine 4 14 86 100
Total 31 369 400
i. Based on the data in the table, and without conducting any significance
test, would you say there is an association between the machine number
and the component being faulty?
ii. Calculate the χ2 statistic and use it to test for independence, using a
5% significance level. What do you conclude?
[14 marks]
(b) i. Describe how stratified random sampling is performed and explain how
it differs from quota sampling.
ii. A company producing handheld electronic devices (tablets, mobile
phones etc.) wants to understand how people of different ages rate
its products. For this reason, the company’s management has decided
to use a survey of its customers and has asked you to devise an
appropriate random sampling scheme. Outline the key components
of your sampling scheme.
[11 marks]
UL16/0217 Page 4 of 6
D00
UL16/0489 Page 4 of 21
3. (a) The data below represent heights, measured in centimetres, of women from
an adult female population:
162 164 164 165 165
166 166 166 167 167
167 167 167 168 168
168 168 168 168 169
169 169 169 170 170
170 171 172 184 185
i. Carefully construct, draw and label a histogram of these data on the
graph paper provided.
ii. Find the median height among these women and the upper quartile.
What percentage of women were below 165 cm?
iii. Comment on the data given the shape of the histogram without doing
represent the data.
[13 marks]
(b) A random sample of 9 people tried a specific diet that lasted 2 months
to lose weight. The weights of these people, measured in kilograms, were
measured both at the beginning and the end of the diet, and are shown in
the table below:
Weight before diet Weight after diet
75 73
76 72
90 92
92 93
89 89
63 61
65 62
80 76
90 84
i. Carry out an appropriate hypothesis test to determine whether the diet
is effective in helping people lose weight. State the test hypotheses, and
ii. State any assumptions you made in i.
iii. Give a 90% confidence interval for the difference between the means of
the weights before and after the diet.
[12 marks]
UL16/0217 Page 5 of 6
D00
UL16/0489 Page 5 of 21
4. The director of a local Tourism Authority would like to know whether a family’s
annual expenditure on recreation (y), measured in $000s, is related to their
annual income (x), also measured in $000s. In order to explore this potential
relationship, the variables x and y were recorded for 10 randomly selected
families that visited the area last year. The results were as follows:
Week #1 #2 #3 #4 #5 #6 #7 #8 #9 #10
x 41.2 50.1 52.0 62.0 44.5 37.7 73.5 37.5 56.7 65.2
y 2.4 2.7 2.8 8.0 3.1 2.1 12.1 2.0 3.9 8.9
Sum of y data: 48 Sum of the squares of y data: 343.74
(a) i. Draw a scatter diagram of these data on the graph paper provided.
Label the diagram carefully.
iii. Calculate the least squares line of y on x and draw the line on the
scatter diagram.
iv. Do you find the analyses in ii. and iii. appropriate? Justify your
answer and suggest any alternative ways to model the relationship
between x and y.
[13 marks]
(b) The fuel consumption of two different car models (A and B) was compared
in the following way. A random sample of 20 cars from model A and 35 cars
from model B were taken and the fuel consumption (in miles per gallon)
was measured for each car. The results are summarised in the table below.

Car Model A 20 30.9 6.11
Car Model B 35 27.1 6.41
i. Use an appropriate hypothesis test to determine whether the model A
cars can do more miles per gallon than model B cars. State clearly
the hypotheses, the test statistic and its distribution under the null
hypothesis, and carry out the test at two appropriate significance
levels. Comment on your findings.
ii. State clearly any assumptions you made in i.
iii. Provide a 95% confidence interval for the difference between the mean
fuel consumption of the two car models.
[12 marks]
END OF PAPER
UL16/0217 Page 6 of 6
D00
UL16/0489 Page 6 of 21
ST104a Statistics 1

variable: variable:
v
X
N uN
uX
σ= σ =t
√
µ = E(X) = pi x i 2 pi (xi − µ)2
i=1 i=1

of the sample mean:
X −µ
Z=
σ X̄ − µ
Z= √
σ/ n

P −π σ
Z=p x̄ ± zα/2 · √
n
π(1 − π)/n

s r
x̄ ± tα/2, n−1 · √ p(1 − p)
n p ± zα/2 ·
n

proportion:
zα/2 2 σ 2
n≥ zα/2 2 p(1 − p)
e2
n≥
e2
known): unknown):
X̄ − µ0
Z= √ X̄ − µ0
σ/ n T = √
S/ n
UL16/0489 Page 7 of 21
P − π0 X̄1 − X̄2 − (µ1 − µ2 )
=p
Z∼ Z= p
π0 (1 − π0 )/n σ12 /n1 + σ22 /n2
s
X̄1 − X̄2 − (µ1 − µ2 ) 1 1
T = q 2
(x̄1 − x̄2 )±tα/2, n1 +n2 −2 · sp +
Sp2 (1/n1 + 1/n2 ) n1 n 2

paired samples:
(n1 − 1)S12 + (n2 − 1)S22
Sp2 =
n 1 + n2 − 2 X̄d − µd
T = √
Sd / n
sd (P1 − P2 ) − (π1 − π2 )
x̄d ± tα/2, n−1 · √ Z=p
n P (1 − P ) (1/n1 + 1/n2 )

R1 + R2
P = s
n1 + n 2 p1 (1 − p1 ) p2 (1 − p2 )
(p1 −p2 )±zα/2 · +
n1 n2
X
r X
c
(Oij − Eij )2 P
n
xi yi − nx̄ȳ
Eij r = s i=1
i=1 j=1
P
n P
n
x2i − nx̄2 yi2 − nȳ 2
i=1 i=1

P
n Pn
i=1
i=1 b =
rs = 1 − Pn
n(n2 − 1) x2i − nx̄2
i=1
a = ȳ − bx̄
2
UL16/0489 Page 8 of 21
UL16/0489 Page 9 of 21
UL16/0489 Page 10 of 21
UL16/0489 Page 11 of 21
UL16/0489 Page 12 of 21
UL16/0489 Page 13 of 21
UL16/0489 Page 14 of 21
UL16/0489 Page 15 of 21
UL16/0489 Page 16 of 21
UL16/0489 Page 17 of 21
UL16/0489 Page 18 of 21
UL16/0489 Page 19 of 21
UL16/0489 Page 20 of 21
UL16/0489 Page 21 of 21
~~ST104A_ZA_2016_d0

Statistics 1
Friday, 6 May 2016 : 10:00 to 12:00

on this paper.
PLEASE TURN OVER

UL16/0490 Page 1 of 21 D1
SECTION A
1. (a) A random sample of athletes’ times to run 200 metres has a sample mean of
24.96 seconds. State the units of measurements for the summaries below and
justify your answers.
i. sample variance
[4 marks]

!
i=4 !
i=3 !
i=5 4
y i
i=2 i=1 i=4
xi
[6 marks]

i. State the distribution of the sample mean for simple random samples of
size n = 100.
[4 marks]
(d) Classify each one of the following variables as either measurable

i. The weight of a chocolate bar produced in a factory.
ii. Responses to ‘what is your age group?’ in a questionnaire.
iii. The colour of a car.
iv. Inflation rates.
[8 marks]
x 0 1 3
pX (x) 0.4 k k
UL16/0217 Page 2 of 6
D00 Question continues on next page.
UL16/0490 Page 2 of 21

[5 marks]
(f) An engine encounters a standard environment with a probability of 0.9, and

a severe environment with a probability of 0.1. In a normal environment the
is 0.5.
ii. Given that failure has occurred, what is the probability that the
environment encountered was severe?
[4 marks]
(g) A museum conducts a survey of its visitors in order to assess the popularity
of a device which is used to provide information on the museum exhibits. The
i. Carry out an appropriate hypothesis test at the 5% significance level to
see if the device should be withdrawn and state your conclusions.
[7 marks]
(h) State whether the following are true or false and give a brief explanation. (No
marks will be awarded for a simple true/false answer.)
i. The range of a sample is influenced by extreme values.
parameter.
iii. A sample correlation coefficient close to −1 indicates a strong negative
linear relationship between two categorical variables.
iv. A p-value of 0.007 represents a weakly significant hypothesis test result.
v. Failure to reject a null hypothesis might indicate that a Type I error has
been committed.
vi. A stratified random sample is the random equivalent of a convenience
sample.
[12 marks]
UL16/0217 Page 3 of 6
D00
UL16/0490 Page 3 of 21
SECTION B
2. (a) A sample consisting of 400 randomly selected students was classified in

terms of personality type (introvert or extrovert) and in terms of their
favourite colour (red, yellow, green or blue). Their responses are
summarised in the table below:
Personality type
Introvert Extrovert Total
Red 32 68 100
Yellow 26 74 100
Green 21 79 100
Blue 46 54 100
Total 125 275 400
test, would you say there is an association between the student’s type of
personality and colour preference?
ii. Calculate the χ2 statistic and use it to test for independence, using a
5% significance level. What do you conclude?
[14 marks]
(b) i. Describe how quota sampling is performed and explain how it differs
from stratified random sampling.
ii. A company producing handheld electronic devices (tablets, mobile
phones etc.) wants to understand how men and women rate its
products. For this reason, the company’s management has decided to
use a survey of its customers and has asked you to devise an
appropriate random sampling scheme. Outline the key components
of your sampling scheme.
[11 marks]
UL16/0217 Page 4 of 6
D00
UL16/0490 Page 4 of 21
3. (a) A policeman recorded the speed of 30 cars on a road with a 30 miles per
hours speed limit. The recorded data are shown below:
25.6 25.7 25.7 25.8 25.8
26.2 26.9 27.5 27.7 27.8
27.9 27.9 28.3 28.4 28.5
28.8 28.9 28.9 29.0 29.1
29.2 29.3 29.5 29.7 29.8
30.1 30.1 30.2 36.2 36.9
ii. Find the median speed among these cars and the upper quartile. What
percentage of drivers were exceeding the 30 miles per hour speed limit?
iii. Comment on the data given the shape of the histogram without doing
represent the data.
[13 marks]
(b) A random sample of 9 students received special training to improve their

performance on IQ tests. Each of the 9 students took an IQ test before
and after the training and their scores are shown in the table below:
IQ score before training IQ score after training
105 107
116 120
120 118
93 92
119 119
133 135
75 78
86 90
90 96
special training is effective for increasing the average IQ score. State
the test hypotheses, and specify your test statistic and its distribution
under the null hypothesis. Comment on your findings.
iii. Give a 90% confidence interval for the difference between the means of
the IQ scores before and after training.
[12 marks]
UL16/0217 Page 5 of 6
D00
UL16/0490 Page 5 of 21
4. An insurance company wants to relate the amount of fire damage (y) in

major residential fires to the distance between the residence and the nearest fire
station (x). For this reason, a study was conducted in a large suburb of a major
city based on a sample of 10 recent fires in this suburb. For each of these fires,
the variables x and y were recorded and are shown in the table below:
Fire #1 #2 #3 #4 #5 #6 #7 #8 #9 #10
x 3.4 1.8 4.6 2.3 3.1 5.5 0.7 3.0 2.6 4.3
y 2.6 1.8 5.9 2.3 2.8 8.6 1.4 2.3 2.0 5.7
(a) i. Draw a scatter diagram of these data on the graph paper provided.
Label the diagram carefully.
iii. Calculate the least squares line of y on x and draw the line on the
scatter diagram.
iv. Do you find the analyses in ii. and iii. appropriate? Justify your
answer and suggest any alternative ways to model the relationship
between x and y.
[13 marks]
(b) The 55 university students on a certain course were randomly assigned to

two class groups of size 30 and 25 students respectively. At the end of the
year, all students took the examination and their marks are summarised
in the table below.

Class Group 1 30 75.33 7.61
Class Group 2 25 71.40 6.37
i. Use an appropriate hypothesis test to determine whether the students
of class group 1 were better in terms of examination marks. State
clearly the hypotheses, the test statistic and its distribution under the
null hypothesis, and carry out the test at two appropriate significance
levels Comment on your findings.
iii. Provide a 95% confidence interval for the difference between the mean
exam marks of the two class groups.
[12 marks]
END OF PAPER
UL16/0217 Page 6 of 6
D00
UL16/0490 Page 6 of 21
ST104a Statistics 1

variable: variable:
v
X
N uN
uX
σ= σ =t
√
µ = E(X) = pi x i 2 pi (xi − µ)2
i=1 i=1

of the sample mean:
X −µ
Z=
σ X̄ − µ
Z= √
σ/ n

P −π σ
Z=p x̄ ± zα/2 · √
n
π(1 − π)/n

s r
x̄ ± tα/2, n−1 · √ p(1 − p)
n p ± zα/2 ·
n

proportion:
zα/2 2 σ 2
n≥ zα/2 2 p(1 − p)
e2
n≥
e2
known): unknown):
X̄ − µ0
Z= √ X̄ − µ0
σ/ n T = √
S/ n
UL16/0490 Page 7 of 21
P − π0 X̄1 − X̄2 − (µ1 − µ2 )
=p
Z∼ Z= p
π0 (1 − π0 )/n σ12 /n1 + σ22 /n2
s
X̄1 − X̄2 − (µ1 − µ2 ) 1 1
T = q 2
(x̄1 − x̄2 )±tα/2, n1 +n2 −2 · sp +
Sp2 (1/n1 + 1/n2 ) n1 n 2

paired samples:
(n1 − 1)S12 + (n2 − 1)S22
Sp2 =
n 1 + n2 − 2 X̄d − µd
T = √
Sd / n
sd (P1 − P2 ) − (π1 − π2 )
x̄d ± tα/2, n−1 · √ Z=p
n P (1 − P ) (1/n1 + 1/n2 )

R1 + R2
P = s
n1 + n 2 p1 (1 − p1 ) p2 (1 − p2 )
(p1 −p2 )±zα/2 · +
n1 n2
X
r X
c
(Oij − Eij )2 P
n
xi yi − nx̄ȳ
Eij r = s i=1
i=1 j=1
P
n P
n
x2i − nx̄2 yi2 − nȳ 2
i=1 i=1

P
n Pn
i=1
i=1 b =
rs = 1 − Pn
n(n2 − 1) x2i − nx̄2
i=1
a = ȳ − bx̄
2
UL16/0490 Page 8 of 21
UL16/0490 Page 9 of 21
UL16/0490 Page 10 of 21
UL16/0490 Page 11 of 21
UL16/0490 Page 12 of 21
UL16/0490 Page 13 of 21
UL16/0490 Page 14 of 21
UL16/0490 Page 15 of 21
UL16/0490 Page 16 of 21
UL16/0490 Page 17 of 21
UL16/0490 Page 18 of 21
UL16/0490 Page 19 of 21
UL16/0490 Page 20 of 21
UL16/0490 Page 21 of 21

ST104a Statistics 1
Important note

references
section.
General remarks
Learning outcomes
At the end of the course and having completed the Essential reading and activities you should:
• be familiar with the key ideas of statistics that are accessible to a student with a moderate
mathematical competence
required
methods
proportions and conduct chi-square tests of contingency tables
• be able to use simple linear regression and correlation analysis and know when it is
1
ST104a Statistics 1
appeared in the second part. Question 3 had a series of questions involving drawing diagrams, such
as histograms, hypothesis testing, in particular paired sample t tests, and confidence intervals. The
first part of Question 4 was on linear regression and involved drawing a diagram, while the second
part was a hypothesis test comparing population means using the sample data given. This means
that it is really important that you make sure you have a reasonable idea of what topics are covered
before you start work on the paper! We suggest you divide your time as follows during the
examination.
and subquestion.
are required, and note-form answers are acceptable. However, clear and accurate language, both
Examiners’ commentaries for the papers for each zone should make these requirements clear.
Remember the following.
What are the units? What are the x-axis and y-axis?
is not acceptable to do one rather than the other! If you are asked to find a 5% critical
value, this is what will be marked.
comment on the results, carrying out an additional hypothesis test will not gain you marks.
2
Examiners0 commentaries?
• the relevant detailed reference to P. Newbold, W.L. Carlson and B.M. Thorne Statistics for
prepare, and similar questions from Newbold (2012).
Memorising from the Examiners0 commentaries
It was noted recently that a small number of candidates appeared to be memorising answers from
previous years’ Examiners’ commentaries, for example plots, and produced the exact same image of
them without looking at the current year’s examination paper questions! Note that this is very easy
to spot. The Examiners’ commentaries should be used as a guide to practise on sample examination
questions and it is pointless to attempt to memorise them.
expected. This may be due to a number of reasons. The Examiners’ commentaries suggest ways of
addressing common problems and improving your performance. One particular failing is ‘question
We recognise that candidates may not cover all topics in the syllabus in the same depth, but you
need to be aware that the examiners are free to set questions on any aspect of the syllabus. This
The syllabus can be found in the Course information sheet in the section of the VLE dedicated to
each course. You should read the syllabus carefully and ensure that you cover sufficient material in
preparation for the examination. Examiners will vary the topics and questions from year to year and
may well set questions that have not appeared in past papers. Examination papers may legitimately
include questions on any topic in the syllabus. So, although past papers can be helpful during your
revision, you cannot assume that topics or specific questions that have come up in past examinations
will occur again.
3
ST104a Statistics 1

ST104a Statistics 1
Important note
what follows the symbol • corresponds to 1 mark unless stated otherwise.

references
section.
A (50 marks) and TWO questions from Section B (25 marks each).
Section A
Question 1
(a) A random sample of the heights of buildings has a sample mean of 24.96 metres.
State the units of measurements for the summaries below and justify your
answers.
i. sample variance
(4 marks)

This question requires knowledge regarding measures of location and spread. Hence reading
of Sections 4.8 and 4.9 in the subject guide is essential and in particular Section 4.9.3. For
example, candidates should gain familiarity with the sample mean, median, variance and
standard deviation.
4

The first thing to do is check the formulae for the sample variance and standard deviation.
It is then not hard to note that the sample variance, s2 , involves squared deviations of the
observations about the sample mean:
n
1 X
s2 = (xi − x̄)2 .
n − 1 i=1
The units of measurement will therefore be metres squared, m2 .

The formula for the standard deviation, s, involves the square root of the sample variance:
v
u n
u 1 X
s=t (xi − x̄)2
n − 1 i=1
hence we return to the original units of measurement, i.e. meters, m.

Some candidates did not provide a justification for their choices, for example just reporting
meters or meters squared. Justification is essential however, and therefore the mention of
the formulae was essential to get full marks.

y3 = 0, y4 = 1, y5 = −3. Calculate the following quantities:
i=4 i=3 i=4

X X X yi4
i=2 i=1 i=3
xi
(6 marks)

guide, and in particular Activity A1.6.
This question was generally well done. The answers are as follows.
i=4
P 2
i. xi = (−1)2 + (−6)2 + 52 = 1 + 36 + 25 = 62.
i=2
i=3
P 3
P
ii. 2xi yi = 2 xi yi = 2((8 × −7) + (−1 × 3) + (−6 × 0)) = 2(−56 − 3 + 0) = −118.
i=1 i=1
i=4
iii. y53 + yi4 /xi = (−3)3 + (0 + 1/5) = −26.8.
P
i=3

i. State the distribution of the sample mean for simple random samples of size
n = 25.
ii. Given a simple random sample of size n = 25, determine the probability that
the sample mean will be less than 128.
(4 marks)

5
ST104a Statistics 1
Sample examination questions are quite relevant. For the first part of the question it is
essential to check Section 6.9 of the subject guide.
The first part just requires knowledge of the fact that if X is a normal random variable with
mean µ and variance σ 2 , the sample mean from a sample of size n, X̄, is also a normal
random variable with mean µ and variance σ 2 /n. Direct application of this fact then yields
that:
(21)2

X̄ ∼ N 138, = N (138, 17.64).
25
For the second part, the basic property of the normal random variable for this question is
that if X ∼ N (µ, σ 2 ), then Z = (X − µ)/σ ∼ N (0, 1). Note also that:
* P (Z < a) = P (Z ≤ a) = Φ(a)
* P (Z > a) = P (Z ≥ a) = 1 − P (Z ≤ a) = 1 − P (Z < a) = 1 − Φ(a)
* P (a < Z < b) = P (a ≤ Z < b) = P (a < Z ≤ b) = P (a ≤ Z ≤ b) = Φ(b) − Φ(a).
The above is all you need to find the requested proportion. We can write:

128 − 138
P (X̄ < 128) = P Z < √
17.64
= P (Z < −2.38)
= 1 − Φ(2.38)
= 1 − 0.99134
= 0.00866.
(d) Classify each one of the following variables as measurable (continuous) or

i. The weight of a cereal packet produced in a factory.
ii. The order an athlete finishes a marathon.
iii. The colour of a pair of shoes.
iv. Currency exchange rates.
(8 marks)

i. Measurable because the weight can be measured, for example, in grammes to several
decimal places such as 499.28 g.
ii. The observations consist of the athletes finishing in a specific order (1st, 2nd etc.). It is
therefore a categorical ordinal variable.
6
iii. Each colour (black, white, red, etc.) is a category. Also, there is no natural ordering
between the colours, for example we cannot really say that ‘blue is higher than red’. This
is therefore a categorical nominal variable.
iv. Measurable because exchange rates are quoted to several decimal places, for example
US$1.45 to the £.
x 0 1 4
pX (x) 0.2 k k
(5 marks)

This is a question on probability, exploring the concepts of relative frequency, conditional
probability and probability distribution. Reading from Chapter 5 of the subject guide is
suggested with focus on the sections on these topics. Try Activity A5.1 and the exercises on
probability trees.
P
i. i p(xi ) = 1, hence k = 0.4.
P
ii. E(X) = i xi p(xi ) = 0 × 0.2 + 1 × 0.4 + 4 × 0.4 = 2.0.
iii. E(X 2 ) = i x2i p(xi ) = 02 × 0.2 + 12 × 0.4 + 42 × 0.4 = 6.8. Hence:
P
Var(X) = 6.8 − 22 = 2.8.

− µ)2 p(xi ),
P
An alternative method to find the variance is through the formula i (xi
where µ is found in part ii.
(f ) An engine encounters a standard environment with a probability of 0.95, and a

severe environment with a probability of 0.05. In a normal environment the
is 0.5.
ii. Given that failure has occurred, what is the probability that the environment
encountered was severe?
(4 marks)

guide. It is essential to practise on such exercises through the learning activities and
exercises of this chapter as well as the material on the VLE. In particular you can attempt
Learning activity A5.6 and Sample examination question 5. It is also useful to familiarise
yourself with probability trees as they can be quite handy in such exercises.
The first part was straightforward for candidates familiar with this section, requiring the use
of the total law of probability (although it can also be calculated using common intuition).
Part ii. requires knowledge of the conditional probability definition or, alternatively,
knowledge of Bayes’ theorem.
7
ST104a Statistics 1
The workout of the exercise is given below.

i. We have:
P (F ) = P (F | N ) P (N ) + P (F | S) P (S) = 0.02 × 0.95 + 0.5 × 0.05 = 0.044.
ii. We have:
P (F | S) P (S) 0.025 25
P (S | F ) = = = = 0.5682.
P (F ) 0.044 44
(g) A museum conducts a survey of its visitors in order to assess the popularity of a
device which is used to provide information on the museum exhibits. The
i. Carry out an appropriate hypothesis test at the 5% significance level to see if
the device should be withdrawn and state your conclusions.
(7 marks)

This question refers to a one-sided hypothesis test examining whether the proportion of all
museum visitors is less than 20%. While the entire chapter (Chapter 8 of the subject guide)
on hypothesis testing is relevant, one can focus on the relevant section for a single
proportion, Section 8.14. Note also that reading on one-tailed (and two-tailed) hypothesis
tests are located in Section 8.10. The second part of the question looks at p-values, and the
relevant section in the subject guide is Section 8.11.
only one variable involved it will have to be a test for a single proportion, and the test
statistic can be found in the formula sheet. Make sure to substitute the relevant quantities
carefully and avoid any numerical errors in the calculation.
table for the relevant significance level, deciding whether to reject H0 , and interpreting the
results in the context of the problem. The working of the first part of the exercise is given
below.
• H0 : π = 0.2 vs. H1 : π < 0.2.
• The sample proportion
p is p = 15/100 = 0.15. The standard error of the sample
proportion is 0.2 × 0.8/100 = 0.04. The test statistic value is:
0.15 − 0.2
t= = −1.25.
0.04
• No evidence that fewer than 20% of visitors make use of the device.
The second part of the question requires the use of p-values and challenged most candidates.
The exercise does not require lengthy calculations and can be derived in a relatively
straightforward manner if one is familiar with the material of Section 8.11 of the subject
guide. Once the test statistic is calculated (t = −1.25 from the first part) one simply needs
to calculate, where Z ∼ N (0, 1):
P (Z ≤ −1.25) = 1 − Φ(1.25) = 1 − 0.8944 = 0.1056.
Note: The last three marks of the first part can also be awarded by correct use of the
p-value, see below.
• The p-value is higher than α = 0.05.
8

i. The interquartile range of a sample is influenced by extreme values.

parameter.
iii. A sample correlation coefficient close to 1 indicates a strong positive linear
iv. A p-value of 0.08 represents a highly significant hypothesis test result.
v. Rejection of a null hypothesis might indicate that a Type II error has been
committed.
vi. A quota sample is the non-random equivalent of a systematic random sample.
(12 marks)

level in computations. Part i. concerns measures of spread that can be found in Section 4.9
of the subject guide. Part ii. enquires about the sampling distribution which is defined in
Section 6.9. Part iii. is about correlation (see Section 12.8) and types of variables (see
Section 4.6). Part iv. targets the concepts of a p-value covered in Section 8.11, whereas part
v. looks at types of error in hypothesis testing (Section 8.7). Finally, part vi. requires
material from Chapter 10 and in particular Section 10.7 on types of sampling.

or false.
i. False. The interquartile range of a sample is defined as the range of the central 50% of
the values in a dataset, so any extreme values would lie below the lower quartile and/or
above the upper quartile.
ii. False. A sampling distribution is the probability distribution of a sample statistic.
iii. False. A value of r close to 1 indicates a strong, positive linear relationship between two
measurable (continuous) variables.
iv. False. A p-value less than 0.01 represents a highly significant hypothesis test result, 0.08
is merely weakly significant.
v. False. Rejection of a true null hypothesis might indicate that a Type I error has been
committed.
vi. False. A quota sample is the non-random equivalent of a stratified random sample.
9
ST104a Statistics 1
Section B
Question 2
(a) A factory uses four different machines to manufacture a particular type of

machine component. A random sample of 400 components is selected from the
output of the factory. Each component in the sample is inspected to determine
whether or not it is faulty. The machine that produced the component is also
recorded. The results are as follows:
Outcome
Faulty Non-faulty Total
Machine 1 4 96 100
Machine 2 2 98 100
Machine 3 11 89 100
Machine 4 14 86 100
Total 31 369 400
i. Based on the data in the table, and without conducting any significance test,
would you say there is an association between the machine number and the
component being faulty?
(14 marks)

This part targets Chapter 8 of the subject guide on contingency tables and chi-squared
tests. Note that part i. of the question does not require any calculations, just understanding
and interpreting contingency tables. Candidates can attempt Activity A8.4 to practise. Part
ii. is a straightforward chi-squared test and the reading is also given in Chapter 8. Look also
at Activity A8.4.
i. There are some differences in the proportions of faulty components for each machine.
More specifically, 2% of the components from Machine 2 are faulty, whereas the
corresponding proportion for Machine 3 is 11%, and for Machine 4 is 14%. Hence, there
seems to be an association between machine number and the component being faulty,
although this needs to be investigated further. (Note: the conclusion of the last sentence
must be stated to get full marks.)
ii. Set out the null hypothesis that there is no association between machine number and the
component being faulty against the alternative that there is an association. Be careful to
H0 : No association between the machine number and the component being faulty.
vs.
H1 : Association between machine number and the component being faulty.
Work out the expected values to obtain the table below.
7.75 92.25
7.75 92.25
7.75 92.25
7.75 92.25

X (Oi,j − Ei,j )2
Ei,j
10
which gives a value of 13.53. This is a 4 × 2 contingency table, so the degrees of freedom
are (4 − 1) × (2 − 1) = 3.
For α = 0.05, the critical value is 7.815, hence we reject H0 .
We conclude that there is evidence of an association between machine number and the
component being faulty.
earlier accurate work.
(b) i. Describe how stratified random sampling is performed and explain how it
differs from quota sampling.
ii. A company producing handheld electronic devices (tablets, mobile phones
etc.) wants to understand how people of different ages rate its products. For
this reason, the company’s management has decided to use a survey of its
customers and has asked you to devise an appropriate random sampling
scheme. Outline the key components of your sampling scheme.
(11 marks)

This question was on basic material on survey designs. Background reading is given in
Learning activities of Chapter 10.
unsure of what these things are, do not write lengthy essays. This is not giving you
The marking scheme and some model answers are given below.
i. Description of stratified random sampling: the population is divided into strata, natural
groupings within the population, and a simple random sample is taken from each
stratum. See page 162 of the subject guide for a more detailed description.
Stratified random sampling is different from quota sampling in the following ways.
∗ Stratified random sampling is probability sampling, whereas quota sampling is
non-probability sampling.
∗ In stratified random sampling a sampling frame is required, whereas in quota
sampling pre-chosen frequencies in each category are sought.
ii. As mentioned earlier, it is crucial in this type of question to avoid long answers. Also,
note that there is no unique answer. A possible set of ‘ingredients’ of an answer is given
below (each bullet point corresponds to a mark).
• Propose stratified sampling since customers of all ages are to be surveyed.
• Sampling frame could be the company’s customer database.
• Take a simple random sample from each stratum.
• Stratification factors should include age.
• Other stratification factors could be gender, country of residence, etc.
• Contact method: mail, telephone or email (likely to have all details on database).
• Minimise non-response through a suitable incentive, such as discount off the next
purchase.
11
ST104a Statistics 1
Question 3
(a) The data below represent heights, measured in centimetres, of women from an
adult female population:
162 164 164 165 165
166 166 166 167 167
167 167 167 168 168
168 168 168 168 169
169 169 169 170 170
170 171 172 184 185
paper provided.
ii. Find the median height among these women and the upper quartile. What
percentage of women were below 165 cm?
iii. Comment on the data given the shape of the histogram without doing any
further calculations.
represent the data.
(13 marks)

histograms can be found in Section 4.7.3, but the entire Sections 4.7, 4.8 and 4.9 are highly
relevant.
the figure. Note that it is essential (and more convenient) to draw the figure on the
graph paper provided; marks will be withdrawn otherwise.
Histogram of Heights
0.12
Frequency Densities
0.08
0.04
0.00
160 165 170 175 180 185
Heights of women in centimeters
12
ii. • Median: 168 centimeters. Note: Raw data should be used, not grouped data. Also,
make sure to mention the units to get the full marks.
• Upper quartile: 169 centimeters. Note: Same as above.
• Percentage: 3/30 = 10%. Note: As the question asks for a percentage, make sure to
report 10%, not just 3/30 or anything else.
iii. Based on the shape of the histogram, we can see that the distribution of the data is
positively skewed. Also two women, with heights of 184 cm and 185 cm, may be regarded
as outliers. Note: It is important to identify the specific outliers (184 cm and 185 cm)
not just write ‘there are two outliers’.
iv. A boxplot, stem-and-leaf diagram or dot plot are other types of suitable graphical
displays. The reason for that is that the variable height is measurable and these graphs
(b) A random sample of 9 people tried a specific diet that lasted 2 months to lose
weight. The weights of these people, measured in kilograms, were measured
both at the beginning and the end of the diet, and are shown in the table below:
Weight before diet Weight after diet
75 73
76 72
90 92
92 93
89 89
63 61
65 62
80 76
90 84
i. Carry out an appropriate hypothesis test to determine whether the diet is
effective in helping people lose weight. State the test hypotheses, and specify
your test statistic and its distribution under the null hypothesis. Comment
on your findings.
iii. Give a 90% confidence interval for the difference between the means of the
weights before and after the diet.
(12 marks)

Look up the sections about hypothesis testing for testing a difference between two
population means. However, it is essential for this part to focus on the section regarding
paired samples (Section 8.16.4).
i. Regarding hypotheses, note that the wording ‘effective’ suggests a one-sided test. Hence
we test:
H0 : µbefore = µafter vs. H1 : µbefore < µafter .
observations for each person (before and after the diet). Hence the difference for each
person should be calculated:
−2 −4 2 1 0 −2 −3 −4 −6
The next step is to calculate sd = 2.598 and s̄d = −2.0, in order to obtain the value of
the test statistic:
x̄d − 0
t= √ = −2.309.
sd / n
13
ST104a Statistics 1
one-sided test) is −1.860. Note: This is clearly a t distribution, make sure not to use the
standard normal distribution.
gives a critical value of t8, 0.99 = −2.896. Therefore, we do not reject H0 and conclude
that there is moderate evidence that the diet is effective.
ii. • Differences are normally distributed.
• Pairs of observations are independent.
iii. This is a standard exercise for confidence intervals given the appropriate formula from
the formula sheet (make sure to be able to recognise it). The requested confidence
interval is (−3.610, −0.390).
Question 4
(a) The director of a local Tourism Authority would like to know whether a family’s
annual expenditure on recreation (y), measured in $000s, is related to their
annual income (x), also measured in $000s. In order to explore this potential
relationship, the variables x and y were recorded for 10 randomly selected
families that visited the area last year. The results were as follows:
Week #1 #2 #3 #4 #5 #6 #7 #8 #9 #10
x 41.2 50.1 52.0 62.0 44.5 37.7 73.5 37.5 56.7 65.2
y 2.4 2.7 2.8 8.0 3.1 2.1 12.1 2.0 3.9 8.9
Sum of y data: 48 Sum of the squares of y data: 343.74
diagram carefully.
diagram.
iv. Do you find the analyses in ii. and iii. appropriate? Justify your answer and
suggest any alternative ways to model the relationship between x and y.
(13 marks)

This is a standard linear regression question and the reading is to be found in Chapter 12 of
the subject guide. Section 12.6 provides details for scatter diagrams and is suitable for part
i., whereas the remaining parts are on correlation and regression which are covered in
Sections 12.8, 12.9 and 12.10 of the subject guide. Section 12.7 is also relevant. Sample
examination question 2 of this chapter is also recommended for practice on questions of this
type.
Candidates who drew on the ordinary paper in their answer booklet were not awarded
marks for this part of the question.
14
Annual family recreation expenditure vs. Annual family income
12
Annual family recreation expenditure in $000s
10
x
8
6
4
x
x
x x
x
xx
2
40 45 50 55 60 65 70
Annual family income in $000s

ii. The summary statistics can be substituted into the formula for the sample correlation
coefficient (make sure you know which one it is!) to obtain the value 0.9222. An
interpretation of this value is the following: the data suggest that the higher family
annual income, the higher the family annual recreation expenditure. The fact that the
value is very close to 1, suggests that this is a strong, positive linear relationship.
Many candidates did not mention all three words (strong, positive, linear). Note that all
of these words provide useful information on interpreting the relationship and are
formula for b is: P
xi yi − nx̄ȳ
b= P 2
xi − nx̄2
The formula for a is a = ȳ − bx̄, so we get a = −9.107.
Hence the regression line can be written as yb = −9.107 + 0.267x or
y = −9.107 + 0.267x + ε. It should also be plotted on the scatter diagram.
Many candidates reported incorrectly the regression line as y = −9.107 + 0.267x. This
expression is false; one of the two above expressions is required.
iv. In this case, one can note in the scatter diagram that the points seem to be ‘scattered’
around a non-linear curve rather than a straight line. Another, equivalent, way to note
this is the presence of two outliers. Hence a linear regression model does not seem to be
a good model for the relationship between family annual income and family annual
recreation expenditure. Alternative approaches may involve the Spearman’s rank
correlation coefficient or transformations of the data, for example a log-transformation.
(b) The fuel consumption of two different car models (A and B) was compared in
the following way. A random sample of 20 cars from model A and 35 cars from
model B were taken and the fuel consumption (in miles per gallon) was
measured for each car. The results are summarised in the table below.
Car Model A 20 30.9 6.11
Car Model B 35 27.1 6.41
15
ST104a Statistics 1
i. Use an appropriate hypothesis test to determine whether the model A cars

can do more miles per gallon than model B cars. State clearly the
hypotheses, the test statistic and its distribution under the null hypothesis,
and carry out the test at two appropriate significance levels. Comment on
your findings.
iii. Provide a 95% confidence interval for the difference between the mean fuel
consumption of the two car models.
(12 marks)

The first two parts of the question refer to a two-sided hypothesis test comparing two
population means. While the entire chapter on hypothesis testing is relevant (Chapter 8),
one can focus on Section 8.16 and in particular Sections 8.16.2 and 8.16.3 as the variances
are unknown. The last part of the question requires a confidence interval for the difference
between two population means, therefore Sections 7.13.2 and 7.13.3 are most relevant.
i. Let µA denote the mean fuel consumption for car model A and µB the mean fuel
consumption for car model B.
The wording ‘can do more miles per gallon than’ implies a one-sided test, hence the
hypotheses can be written as:
H0 : µA = µB vs. H1 : µA > µB .
x̄ − ȳ x̄ − ȳ
p or q .
s2A /nA + s2B /nB 2
sp (1/n1 + 1/n2 )
distribution is being used. The critical value at the 5% significance level is 1.676, hence
we reject the null hypothesis. If we take a (smaller) α of 1%, the critical value is 2.390,
so we do not reject H0 . We conclude that there is moderate evidence of a difference in
the mean fuel consumption between the car models.
ii. The assumptions for ii. were the following.
• Assumption about equal variances.
• Assumption about whether nA + nB is ‘large’ so that the normality assumption is
satisfied.
Some candidates stated assumptions in this part that were not made in part i. Marks
were not awarded in such cases. Also some other candidates just copied the phrase
‘assumption about equal variances’ and naturally were not awarded any marks. One
should state whether the calculations were based on the assumption that the unknown
variances are equal or unequal.
iii. Based on the t50 distribution and using the correct formula from the formula sheet (make
sure to be able to recognise it) the requested 95% confidence interval is (0.251, 7.349).
Note: In the solution above, the t50 distribution was used but the use of the standard
normal distribution is also justified as the sample size is relatively large. Hence a solution
based on the standard normal distribution is also acceptable.
16

ST104a Statistics 1
Important note

references
section.
A (50 marks) and TWO questions from Section B (25 marks each).
Section A
Question 1
(a) A random sample of athletes’ times to run 200 metres has a sample mean of
24.96 seconds. State the units of measurements for the summaries below and
justify your answers.
i. sample variance
(4 marks)

This question requires knowledge regarding measures of location and spread. Hence reading
of Sections 4.8 and 4.9 in the subject guide is essential and in particular Section 4.9.3. For
example, candidates should gain familiarity with the sample mean, median, variance and
standard deviation.
17
ST104a Statistics 1

The first thing to do is check the formulae for the sample variance and standard deviation.
It is then not hard to note that the sample variance, s2 , involves squared deviations of the
observations about the sample mean:
n
1 X
s2 = (xi − x̄)2 .
n − 1 i=1
The units of measurement will therefore be seconds squared.

The formula for standard deviation s involves the square root of the sample variance:
v
u n
u 1 X
s=t (xi − x̄)2
n − 1 i=1
hence we return to the original units of measurement, i.e. seconds.

Some candidates did not provide a justification for their choices, for example just reporting
seconds or seconds squared. Justification is essential however, and therefore the mention of
the formulae was essential to get full marks.

i=4 i=3 i=5
X X X yi4
i=2 i=1 i=4
xi
(6 marks)

guide, and in particular Activity A1.6.
This question was generally well done. The answers are as follows.
i=4
P 2
i. xi = (−3)2 + (−7)2 + 62 = 9 + 49 + 36 = 94.
i=2
i=3
P 3
P
ii. 3xi yi = 3 xi yi = 3((4 × −6) + (−3 × 4) + (−7 × −4)) = 3(−24 − 12 + 28) = −24.
i=1 i=1
i=5
iii. y33 + yi4 /xi = (−4)3 + (0 + 1/2) = −63.5.
P
i=4

i. State the distribution of the sample mean for simple random samples of size
n = 100.
(4 marks)

Sample examination questions are quite relevant. For the first part of the question it is
essential to check Section 6.9 of the subject guide.
18

The first part just requires knowledge of the fact that if X is a normal random variable with
random variable with mean µ and variance σ 2 /n. Direct application of this fact then yields
that:
(12)2

X̄ ∼ N 76, = N (76, 1.44).
100
For the second part, the basic property of the normal random variable for this question is
that if X ∼ N (µ, σ 2 ), then Z = (X − µ)/σ ∼ N (0, 1). Note also that:
* P (Z < a) = P (Z ≤ a) = Φ(a)
* P (Z > a) = P (Z ≥ a) = 1 − P (Z ≤ a) = 1 − P (Z < a) = 1 − Φ(a)
* P (a < Z < b) = P (a ≤ Z < b) = P (a < Z ≤ b) = P (a ≤ Z ≤ b) = Φ(b) − Φ(a).
The above is all you need to find the requested proportion: We can write:

75 − 76
P (X̄ < 75) = P Z < √
1.44
= P (Z < −0.83)
= 1 − Φ(0.83)
= 1 − 0.7967
= 0.2033.
(d) Classify each one of the following variables as measurable (continuous) or

i. The weight of a chocolate bar produced in a factory.
ii. Responses to ‘what is your age group?’ in a questionnaire.
iii. The colour of a car.
iv. Inflation rates.
(8 marks)

i. Measurable because the weight can be measured, for example, in grammes to several
decimal places such as 499.28 g.
ii. Age groups are in a ranked order, for example [18, 30), [30, 40) etc. It is therefore a
categorical ordinal variable.
iii. Each colour (black, white, red, etc.) is a category. Also, there is no natural ordering
between the colours, for example we cannot really say that ‘blue is higher than red’. This
is therefore a categorical nominal variable.
19
ST104a Statistics 1
iv. Measurable because inflation rates are quoted to several decimal places, for example
1.50%.
x 0 1 3
pX (x) 0.4 k k
(5 marks)

probability and probability distribution. Reading from Chapter 5 of the subject guide is
suggested with focus on the sections on these topics. Try Activity A5.1 and the exercises on
probability trees.
P
i. i p(xi ) = 1, hence k = 0.3.
P
ii. E(X) = i xi p(xi ) = 0 × 0.4 + 1 × 0.3 + 3 × 0.3 = 1.2.
iii. E(X 2 ) = i x2i p(xi ) = 02 × 0.4 + 12 × 0.3 + 32 × 0.3 = 3.0. Hence:
P
Var(X) = 3.0 − (1.2)2 = 1.56.
− µ)2 p(xi ),
P
where µ is found in part ii.
(f ) An engine encounters a standard environment with a probability of 0.9, and a

severe environment with a probability of 0.1. In a normal environment the
is 0.5.
ii. Given that failure has occurred, what is the probability that the environment
encountered was severe?
(4 marks)

guide. It is essential to practise on such exercises through the learning activities and
exercises of this chapter as well as the material on the VLE. In particular you can attempt
Learning activity A5.6 and Sample examination question 5. It is also useful to familiarise
yourself with probability trees as they can be quite handy in such exercises.
The first part was straightforward for candidates familiar with this section, requiring the use
of the total law of probability (although it can also be calculated using common intuition).
Part ii. requires knowledge of the conditional probability definition or, alternatively,
knowledge of Bayes’ theorem.
20
The workout of the exercise is given below.

i. We have:
P (F ) = P (F | N ) P (N ) + P (F | S) P (S) = 0.03 × 0.9 + 0.5 × 0.1 = 0.077.
ii. We have:
P (F | S) P (S) 0.05 50
P (S | F ) = = = = 0.6494 ≈ 0.65.
P (F ) 0.077 77
(g) A museum conducts a survey of its visitors in order to assess the popularity of a
device which is used to provide information on the museum exhibits. The
device will be withdrawn if fewer than 250% of all of the museum’s visitors
make use of it. Of a random sample of 100 visitors, 20 chose to use the device.
i. Carry out an appropriate hypothesis test at the 5% significance level to see if
the device should be withdrawn and state your conclusions.
(7 marks)

This question refers to a one-sided hypothesis test examining whether the proportion of all
museum visitors is less than 20%. While the entire chapter (Chapter 8 of the subject guide)
on hypothesis testing is relevant, one can focus on the relevant section for a single
proportion, Section 8.14. Note also that reading on one-tailed (and two-tailed) hypothesis
tests are located in Section 8.10. The second part of the question looks at p-values, and the
relevant section in the subject guide is Section 8.11.
only one variable involved it will have to be a test for a single proportion, and the test
statistic can be found in the formula sheet. Make sure to substitute the relevant quantities
carefully and avoid any numerical errors in the calculation.
table for the relevant significance level, deciding whether to reject H0 , and interpreting the
results in the context of the problem. The working of the first part of the exercise is given
below.
• H0 : π = 0.25 vs. H1 : π < 0.25.
• The sample proportion
p is p = 20/100 = 0.20. The standard error of the sample
proportion is 0.25 × 0.75/100 = 0.0433. The test statistic value is:
0.2 − 0.25
t= = −1.15.
0.0433
The second part of the question requires the use of p-values and challenged most candidates.
The exercise does not require lengthy calculations and can be derived in a relatively
straightforward manner if one is familiar with the material of Section 8.11 of the subject
guide. Once the test statistic is calculated (t = −1.15 from the first part) one simply needs
to calculate, where Z ∼ N (0, 1):
P (Z ≤ −1.15) = 1 − Φ(1.15) = 1 − 0.8749 = 0.1251.
Note: The last three marks of the first part can also be awarded by correct use of the
p-value, see below.
• The p-value is higher than α = 0.05.
21
ST104a Statistics 1

i. The range of a sample is influenced by extreme values.

parameter.
iii. A sample correlation coefficient close to −1 indicates a strong negative linear
v. Failure to reject a null hypothesis might indicate that a Type I error has
been committed.
vi. A stratified random sample is the random equivalent of a convenience sample.
(12 marks)

level in computations. Part i. concerns measures of spread that can be found in Section 4.9
of the subject guide. Part ii. enquires about the sampling distribution which is defined in
Section 6.9. Part iii. is about correlation (see Section 12.8) and types of variables (see
Section 4.6). Part iv. targets the concepts of a p-value covered in Section 8.11, whereas part
v. looks at types of error in hypothesis testing (Section 8.7). Finally, part vi. requires
material from Chapter 10 and in particular Section 10.7 on types of sampling.

or false.
i. True. The range is defined as x(n) − x(1) , so any extreme values would be x(1) and/or
x(n) , hence influencing the range.
ii. False. A sampling distribution is the probability distribution of a sample statistic.
iii. False. A value of r close to −1 indicates a strong, negative linear relationship between
two measurable (continuous) variables.
iv. False. A p-value of 0.007 represents a highly significant hypothesis test result. Weakly
significant means a p-value between 0.05 and 0.10.
v. False. Failure to reject a null hypothesis might indicate that a Type II error has been
committed.
vi. False. A quota sample is the non-random equivalent of a stratified random sample.
22
Section B
Question 2
(a) A sample consisting of 400 randomly selected students was classified in terms of
personality type (introvert or extrovert) and in terms of their favourite colour
(red, yellow, green or blue). Their responses are summarised in the table below:
Personality type
Introvert Extrovert Total
Red 32 68 100
Yellow 26 74 100
Green 21 79 100
Blue 46 54 100
Total 125 275 400
would you say there is an association between the student’s type of
personality and colour preference?
(14 marks)

This part targets Chapter 8 of the subject guide on contingency tables and chi-squared
and interpreting contingency tables. Candidates can attempt Activity A8.4 to practise. Part
ii. is a straightforward chi-squared test and the reading is also given in Chapter 8. Look also
at Activity A8.4.
i. There are some differences in rates of introvert students for each colour preference. More
specifically, 21% of the students who prefer the green colour are introvert, whereas the
corresponding proportion for students who prefer red is 32%, and for students preferring
blue is 46%. Hence, there seems to be an association between personality type and
colour preference, although this needs to be investigated further. (Note: the conclusion
of the last sentence must be stated to get full marks.)
ii. Set out the null hypothesis that there is no association between personality type and
colour preference against the alternative that there is an association. Be careful to get
these the correct way round!
H0 : No association between the personality type and colour preference.
vs.
H1 : Association between personality type and colour preference.
Work out the expected values to obtain the table below.
31.25 68.75
31.25 68.75
31.25 68.75
31.25 68.75

X (Oi,j − Ei,j )2
Ei,j
which gives a value of 16.33. This is a 4 × 2 contingency table, so the degrees of freedom
are (4 − 1) × (2 − 1) = 3.
23
ST104a Statistics 1
For α = 0.05, the critical value is 7.815, hence we reject H0 .

We conclude that there is evidence of an association between personality type and colour
preference.
earlier accurate work.
(b) i. Describe how quota sampling is performed and explain how it differs from
stratified random sampling.
ii. A company producing handheld electronic devices (tablets, mobile phones
etc.) wants to understand how men and women rate its products. For this
reason, the company’s management has decided to use a survey of its
customers and has asked you to devise an appropriate random sampling
scheme. Outline the key components of your sampling scheme.
(11 marks)

This question was on basic material on survey designs. Background reading is given in
Learning activities of Chapter 10.
The marking scheme and some model answers are given below.
i. Description of quota sampling: the interviewer is given specific quota controls on certain
specified characteristics, such as age, gender, social class etc. and then interviews people
until these quota are reached. See page 159 of the subject guide for a more detailed
description.
Quota is different from stratified random sampling in the following ways.
∗ Stratified random sampling is probability sampling, whereas quota sampling is
non-probability sampling.
∗ In stratified random sampling a sampling frame is required, whereas in quota
sampling pre-chosen frequencies in each category are sought.
ii. As mentioned earlier, it is crucial in this type of question to avoid long answers. Also,
note that there is no unique answer. A possible set of ‘ingredients’ of an answer is given
below (each bullet point corresponds to a mark).
• Propose stratified sampling since customers of all ages are to be surveyed.
• Sampling frame could be the company’s customer database.
• Take a simple random sample from each stratum.
• Stratification factors should include gender.
• Other stratification factors could be age, country of residence, etc.
• Contact method: mail, telephone or email (likely to have all details on database).
• Minimise non-response through a suitable incentive, such as discount off the next
purchase.
24
Question 3
(a) A policeman recorded the speed of 30 cars on a road with a 30 miles per hours
speed limit. The recorded data are shown below:
25.6 25.7 25.7 25.8 25.8
26.2 26.9 27.5 27.7 27.8
27.9 27.9 28.3 28.4 28.5
28.8 28.9 28.9 29.0 29.1
29.2 29.3 29.5 29.7 29.8
30.1 30.1 30.2 36.2 36.9
paper provided.
ii. Find the median speed among these cars and the upper quartile. What
percentage of drivers were exceeding the 30 miles per hour speed limit?
iii. Comment on the data given the shape of the histogram without doing any
further calculations.
represent the data.
(13 marks)

histograms can be found in Section 4.7.3, but the entire Sections 4.7, 4.8 and 4.9 are highly
relevant.
Histogram of Speeds
0.20
0.15
Frequency Densities
0.10
0.05
0.00
24 26 28 30 32 34 36 38
Speeds in miles per hour
25
ST104a Statistics 1
ii. • Median: 28.65 miles per hour. Note: Raw data should be used, not grouped data.
Also, make sure to mention the units to get the full marks.
• Upper quartile: 29.45 miles per hour. Note: Same as above.
• percentage: 5/30 = 16.67%. Note: As the question asks for a percentage, make sure
to report 16.67% (17% is also fine), not just 5/30 or anything else.
iii. Based on the shape of the histogram, we can see that the distribution of the data is
positively skewed. Also two cars, with speeds 36.2 and 36.9 miles per hour, may be
regarded as outliers. Note: It is important to identify the specific outliers (36.2 and 36.9
miles per hour) not just write ‘there are two outliers’.
iv. A boxplot, stem-and-leaf diagram or dot plot are other types of suitable graphical
displays. The reason for that is that the variable speed is measurable and these graphs
(b) A random sample of 9 students received special training to improve their

performance on IQ tests. Each of the 9 students took an IQ test before and
after the training and their scores are shown in the table below:
IQ score before training IQ score after training
105 107
116 120
120 118
93 92
119 119
133 135
75 78
86 90
90 96
i. Carry out an appropriate hypothesis test to determine whether the special
training is effective for increasing the average IQ score. State the test
hypotheses, and specify your test statistic and its distribution under the null
hypothesis. Comment on your findings.
iii. Give a 90% confidence interval for the difference between the means of the
IQ scores before and after training.
(12 marks)

Look up the sections about hypothesis testing for testing a difference between two
population means. However, it is essential for this part to focus on the section regarding
paired samples (Section 8.16.4).
i. Regarding hypotheses, note that the wording ‘increasing’ suggests a one-sided test.
Hence we test:
H0 : µbefore = µafter vs. H1 : µbefore < µafter .
observations for each person (before and after the special training). Hence the difference
for each person should be calculated:
2 4 −2 −1 0 2 3 4 6
The next step is to calculate sd = 2.598 and x̄d = 2.0, in order to obtain the value of the
test statistic:
x̄d − 0
t= √ = 2.309.
sd / n
26
one-sided test) is 1.860. Note: This is clearly a t distribution, make sure not to use the
standard normal distribution.
gives a critical value of t8, 0.01 = 2.896. Therefore, we do not reject H0 concluding that
there is moderate evidence that the special training is effective.
ii. • Differences are normally distributed.
• Pairs of observations are independent.
iii. This is a standard exercise for confidence intervals given the appropriate formula from
interval is (0.390, 3.610).
Question 4
(a) An insurance company wants to relate the amount of fire damage (y) in major
residential fires to the distance between the residence and the nearest fire
station (x). For this reason, a study was conducted in a large suburb of a major
city based on a sample of 10 recent fires in this suburb. For each of these fires,
the variables x and y were recorded and are shown in the table below:
Fire #1 #2 #3 #4 #5 #6 #7 #8 #9 #10
x 3.4 1.8 4.6 2.3 3.1 5.5 0.7 3.0 2.6 4.3
y 2.6 1.8 5.9 2.3 2.8 8.6 1.4 2.3 2.0 5.7
diagram carefully.
diagram.
iv. Do you find the analyses in ii. and iii. appropriate? Justify your answer and
suggest any alternative ways to model the relationship between x and y.
(13 marks)

i., whereas the remaining parts are on correlation and regression which are covered in
Sections 12.8, 12.9 and 12.10 of the subject guide. Section 12.7 is also relevant. Sample
examination question 2 of this chapter is also recommended for practice on questions of this
type.
27
ST104a Statistics 1
Amount of fire damage vs. Distance from nearest fire station
8
7
Amount of fire damage
6 x
x
5
4
3
x
x
x x
x
2
x
x
1 2 3 4 5
Distance between residence and the nearest fire station
interpretation of this value is the following: the data suggest that the greater the
distance of the residence from the nearest fire station, the higher the amount of fire
damage. The fact that the value is very close to 1, suggests that this is a strong, positive
linear relationship.
of these words provide useful information on interpreting the relationship and are
formula for b is: P
xi yi − nx̄ȳ
b= P 2
xi − nx̄2
Hence the regression line can be written as yb = −1.235 + 1.526x or
y = −1.235 + 1.526x + ε. It should also be plotted on the scatter diagram.
expression is false; one of the two above expressions is required.
iv. In this case, one can note in the scatter diagram that the points seem to be ‘scattered’
around a non-linear curve rather than a straight line. Another, equivalent, way to note
this is the presence of two outliers. Hence a linear regression model does not seem to be
a good model for the relationship between the amount of fire damage and the distance
from the nearest fire station. Alternative approaches may involve the Spearman’s rank
correlation coefficient or transformations of the data, for example the log-transformation.
(b) The 55 university students on a certain course were randomly assigned to two
class groups of size 30 and 25 students respectively. At the end of the year, all
students took the examination and their marks are summarised in the table
below.
Class Group 1 30 75.33 7.61
Class Group 2 25 71.40 6.37
28
i. Use an appropriate hypothesis test to determine whether the students of

class group 1 were better in terms of examination marks. State clearly the
and carry out the test at two appropriate significance levels Comment on
your findings.
iii. Provide a 95% confidence interval for the difference between the mean exam
marks of the two class groups.
(12 marks)

population means. While the entire chapter on hypothesis testing is relevant (Chapter 8),
one can focus on Section 8.16 and in particular Sections 8.16.2 and 8.16.3 as the variances
are unknown. The last part of the question requires a confidence interval for the difference
between two population means, therefore Sections 7.13.2 and 7.13.3 are most relevant.
i. Let µA denote the mean examination mark for class group 1 and µB the mean
examination mark for class group 2.
The wording ‘were better in terms of examination marks’ implies a one-sided test, hence
the hypotheses can be written as:
H0 : µA = µB vs. H1 : µA > µB .
x̄ − ȳ x̄ − ȳ
p or q .
s2A /nA + s2B /nB 2
sp (1/n1 + 1/n2 )
distribution is being used. The critical value at the 5% significance level is 1.676, hence
we reject the null hypothesis. If we take a (smaller) α of 1%, the critical value is 2.390,
so we do not reject H0 . We conclude that there is moderate evidence of a difference
between the mean examination marks of the two class groups.
ii. The assumptions for ii. were the following.
satisfied.
were not awarded in such cases. Also some other candidates just copied the phrase
iii. Based on the t50 distribution and using the correct formula from the formula sheet (make
sure to be able to recognise it) the requested 95% confidence interval is (0.082, 7.778).
Note: In the solution above, the t50 distribution was used but the use of the standard
normal distribution is also justified as the sample size is relatively large. Hence a solution
based on the standard normal distribution is also acceptable.
29
~~ST104A_ZA_2016_d0

Statistics 1
Monday, 8 May 2017: 10:00 to 12:00

on this paper.
PLEASE TURN OVER

UL17/0338 Page 1 of 21 D0
SECTION A
1. (a) Suppose that y1 = −1, y2 = −4, y3 = 2, y4 = 12, y5 = 7, and z1 = 7, z2 = −11,

z3 = 9, z4 = 3, z5 = 7. Calculate the following quantities:
i=3 i=5 i=3

X X √ X 1
i. zi2 ii. yi zi iii. z52 + .
i=1 i=4
y
i=1 i
(6 marks)
(b) Classify each one of the following variables as either measurable (continuous)
or categorical. If a variable is categorical, further classify it as either nominal or
ordinal. Justify your answer (no marks will be awarded without a justification).
i. Time spent in the previous week browsing the internet.
ii. Highest level of education obtained, i.e. no education, school education,
bachelor’s degree, master’s degree, doctorate.
iii. Country of residence.
iv. The rate of change of the human population.
(8 marks)
(c) The weights of a large population of animals have a mean of 7.3 kg and a
standard deviation of 1.9 kg.
i. Assuming that the weights are normally distributed, what is the
probability that a random selection of 40 animals from that population
will have a mean weight between 7.0 kg and 7.4 kg?
ii. A researcher stated that the probability you calculated is approximately
correct even if the distribution of the weights is not normal. Do you agree?
Justify your answer (no marks will be awarded without a justification).
(5 marks)
(d) The random variable X takes only the values 3, 5, 8 and 10 according to the
following probability distribution:
x 3 5 8 10
pX (x) k k k 2k
i. Determine the constant k and hence write down the probability
distribution of X.
(6 marks)
UL17/0338 Page 2| of
Downloaded by: aruzhanyerbolatova 21
aruzhan.yerbolatovaa@gmail.com
(e) A paired-difference experiment involved n = 121 adults. For each adult a

characteristic was measured under two distinct conditions and the difference
in characteristic values was recorded. The sample mean of the differences
was 1.195, whereas their sample standard deviation was 10.2. The researchers
reported a t statistic value of 1.289 when testing whether the means of the two
conditions are the same.
i. Show how the researchers obtained the t statistic value of 1.289.
ii. Calculate the p-value of the test and use the p-value to test the hypothesis
of equal means. Use a 5% significance level.
(7 marks)
(f) State whether the following are true or false and give a brief explanation (no
marks will be awarded for a simple true/false answer ).
i. The median of a random sample is influenced by extreme values.
ii. If A and B are independent events, then P (A | B) = P (A).
iii. If X ∼ N (5, 2), then P (X ≤ 5) < 0.5.
v. In stratified random sampling, elements within a stratum are
heterogeneous.
vi. A scatter diagram is used to display two categorical variables.
(12 marks)
(g) In a random sample of size n = 6 the mean of the data is 12 and the median
is 9. Another observation is then obtained and this takes the value of 5, i.e.
x7 = 5.
i. Calculate the mean of the seven observations.
ii. What can you conclude about the median of the seven observations?
(6 marks)
SECTION B
2. (a) The data below represent the weights (in kg) of 30 athletes.
57 59 61 63 64
65 73 74 74 74
75 77 77 81 82
82 82 83 83 85
87 89 91 93 96
96 98 99 99 101
ii. Find the mean and the modal group. You are given that the sum of
the data is 2420.
iv. Comment on the data, given the shape of the histogram and the
measures which you have calculated.
(13 marks)
students to determine whether they are in favour of a new examination
timetable. The table below summarises the student responses.
In favour of
Subject area Sample size new examination timetable
Humanities 325 221
Science 200 120
i. Do the student responses indicate a difference between students in

humanities and science degrees in whether they are in favour of the new
examination timetable? Conduct a suitable hypothesis test at two
appropriate significance levels and comment on your results. State
any assumptions that you make.
ii. Compute a 97% confidence interval for the difference of proportions
with positive responses between humanities and science degrees in the
population.
(12 marks)
3. (a) It is assumed that there is an association between the gestational age at

birth, i.e. the number of weeks the mother was pregnant when she gave
birth (x) and the birth weight of the baby (y, in kg). An experiment was
conducted on 9 randomly-selected babies and the data are summarised in
the table below.
Baby 1 2 3 4 5 6 7 8 9
x 36.0 39.7 38.0 41.4 38.7 35.7 40.3 37.3 42.4
y 2.0 3.7 2.7 3.7 2.9 2.6 3.5 2.7 3.8
Sum of the x values: 349.5 Sum of the squares of the x values: 13615.37
Sum of the y values: 27.6 Sum of the squares of the y values: 87.82
Sum of the products of the x and y values: 1082.6
i. Draw a scatter diagram of these data on the graph paper provided.

Carefully label the diagram.
ii. Calculate the sample correlation coefficient. Interpret its value.
iii. Calculate and report the least squares line of y on x. Draw the line on
the scatter diagram.
iv. Based on the regression model above, what baby birth weight would
you expect from a mother who gave birth when she was 38 weeks
pregnant? Would you trust this value? Justify your answer.
(13 marks)
(b) An experiment was conducted to determine whether two different brands of

batteries have similar lifetimes. A random sample of batteries was obtained
from each brand and their lifetimes were measured. The measurements are
summarised in the table below.

Battery brand 1 37 22.15 2.00
i. Use an appropriate hypothesis test to determine whether there is a
difference between the mean battery lifetimes of the two brands. State
ii. State clearly any assumptions you made in (b) part i.
iii. Repeat the procedure in (b) part i. to determine whether the mean
battery lifetime of brand 1 is shorter than that of brand 2.
(12 marks)
4. (a) A sample consisting of 100 randomly-selected adults in the USA was

classified in terms of their political affiliation (Democrat or Republican)
and opinion on a tax reform bill (in favour, indifferent or opposed). The
data are summarised in the table below.
In Favour Indifferent Opposed
Democrat 12 29 16
Republican 18 11 14
i. Based on the data in the table, and without conducting any
significance test, would you say there is an association between the
political affiliation and opinion on the tax reform bill?
ii. Calculate the χ2 statistic and use it to test for independence of political
affiliation and opinion on the tax reform bill. What do you conclude?
(13 marks)
(b) You have been asked to design a cluster random sample survey from the
employees of a certain large company to examine whether job satisfaction
of employees varies between different job types.
i. Discuss how you will choose your sampling frame. Also discuss the
limitations of your choice.
ii. Propose two relevant clusters. Justify your answers.
iii. Provide two actions to reduce response bias and explain why you think
they would be successful.
iv. Briefly discuss the statistical methodology you would use to analyse the
collected data.
(12 marks)
END OF PAPER
ST104a Statistics 1

variable: variable:
v
N uN
X √
µ = E[X] = pi xi
uX
2
i=1 i=1

of the sample mean:
X −µ
Z=
σ X̄ − µ
Z= √
σ/ n

P −π σ
Z=q x̄ ± z √
π(1−π) n
n

s r
x̄ ± tn−1 √ p(1 − p)
n p±z
n

proportion:
Z 2σ2
n≥
e2 Z 2 p(1 − p)
n≥
e2
known): unknown):
X̄ − µ
Z= √ X̄ − µ
σ/ n t= √
S/ n
p−π (X̄1 − X̄2 ) − (µ1 − µ2 )
Z∼
=q Z=
π(1−π)
q 2
σ1 σ22
n n1 + n2
s
(X̄1 − X̄2 ) − (µ1 − µ2 )

1 1
t= r 2
(x̄1 − x̄2 ) ± tn1 +n2 −2 sp +
n1 n2

Sp2 n11 + n12

paired samples:
(n1 − 1)S12 + (n2 − 1)S22
Sp2 =
n1 + n2 − 2 X̄d − µd
t= √
Sd / n
sd (P1 − P2 ) − (π1 − π2 )
x̄d ± tn−1 √ Z=r
n
P (1 − P ) n11 + n12

R1 + R2
P = s
n1 + n2 p1 (1 − p1 ) p2 (1 − p2 )
(p1 − p2 ) ± z +
n1 n2

r X
c
Pn
r=q P
Eij n 2 − nx̄2
Pn 2 − nȳ 2

i=1 j=1 x
i=1 i i=1 i y

Pn
P
rs = 1 − b = Pi=1 n 2 2
n(n2 − 1) i=1 xi − nx̄
a = ȳ − bx̄
UL17/0338 Page 10| aruzhan.yerbolatovaa@gmail.com
Downloaded by: aruzhanyerbolatova of 21
~~ST104A_ZB_2016_d0

Statistics 1
Monday, 8 May 2017: 10:00 to 12:00

on this paper.
PLEASE TURN OVER

UL17/0339 Page 1 of D0
SECTION A
1. (a) Suppose that y1 = −2, y2 = −5, y3 = 1, y4 = 16, y5 = 10, and z1 = 8, z2 = −5,

i=3 i=5 i=3

X X √ X 1
i=1 i=4 i=1
y i
(6 marks)
(b) Classify each one of the following variables as either measurable (continuous)
or categorical. If a variable is categorical, further classify it as either nominal or
i. Gross domestic product (GDP) of a country.
ii. Community type, i.e. rural, small town, large town, small city, large city.
iii. Discipline studied as the degree major.
iv. Volume of water in a bottle.
(8 marks)
i. Assuming that the weights are normally distributed, what is the
probability that a random selection of 50 animals from that population
will have a mean weight between 8.6 kg and 9.1 kg?
(5 marks)
x 2 6 7 9
pX (x) k k k 2k
i. Determine the constant k and hence write down the probability
distribution of X.
(6 marks)

characteristic was measured under two distinct conditions and the difference
in characteristic values was recorded. The sample mean of the differences was
2.326, whereas their standard deviation was 7.6. The researchers reported a t
statistic value of 2.390 when testing whether the means of the two conditions
are the same.
ii. Calculate the p-value of the test and use the p-value to test the hypothesis
of equal means. Use a 5% significance level.
(7 marks)
(f) State whether the following are true or false and give a brief explanation (no
i. The median of a random sample is not influenced by extreme values.
ii. If A and B are independent events, then P (A | B) < P (A).
iii. If X ∼ N (7, 4), then P (X ≥ 7) > 0.5.
iv. A p-value of 0.03 represents an insignificant hypothesis test result.
v. In cluster random sampling, elements within a cluster are
homogeneous.
vi. A contingency table is used to display two measurable variables.
(12 marks)
(g) In a random sample of size n = 6 the mean of the data is 15 and the median
is 11. Another observation is then obtained and this takes the value of 8, i.e.
x7 = 8.
(6 marks)
SECTION B
2. (a) The data below contain measurements of the low-density lipoproteins, also
known as the ‘bad’ cholesterol, in the blood of 30 patients. Data are
measured in milligrams per deciliters (mg/dL).
95 96 96 98 99
99 101 101 102 102
103 104 104 107 107
111 112 113 113 114
115 117 121 123 124
127 129 131 135 143
ii. Find the mean and the modal group. You are given that the sum of
the data is 3342.
iv. Comment on the data, given the shape of the histogram and the
(13 marks)
(b) A mobile telephone company gathered a random sample of 500 people to

determine whether they like the design of its latest mobile telephone. The
table below summarises the people’s responses.
Positive view on the

Gender Sample size latest mobile telephone design
Males 225 110
Females 275 165
i. Do the people’s responses indicate a difference between males and

females in whether they like the design of the latest mobile telephone?
Conduct a suitable hypothesis test at two appropriate significance
levels and comment on your results. State any assumptions that you
make.
ii. Compute a 98% confidence interval for the difference of proportions
with a positive view between males and females in the population.
(12 marks)
3. (a) The table below contains information from 9 students taking a course in
Statistics. Students were asked how many hours they spent revising the
material before the examination (x values, in hours) and what their
examination mark was (y values, in %).
Student 1 2 3 4 5 6 7 8 9
x 1.8 2.6 2.8 3.4 3.6 4.2 4.8 5.2 5.4
y 54 64 60 62 68 70 76 73 76

Sum of the y values: 603 Sum of the squares of the y values: 40861
Sum of the products of the x and y values: 2336

iii. Calculate and report the least squares line of y on x. Draw the line on
the scatter diagram.
iv. Based on the regression model above, what examination mark would
you expect from a student who studied 8 hours? Would you trust this
value? Justify your answer.
(13 marks)


difference between the mean battery lifetimes of the two brands. State
iii. Repeat the procedure in (b) part i. to determine whether the mean
battery lifetime of brand 1 is longer than that of brand 2.
(12 marks)
4. (a) A sample consisting of 100 randomly-selected students in a UK university

was classified in terms of a student’s origin (either UK/EU or overseas) and
in terms of their satisfaction with university life (satisfied, indifferent or
dissatisfied). The data are summarised in the table below.
Satisfied Indifferent Dissatisfied
UK/EU 10 26 15
Overseas 20 14 15
test, would you say there is an association between the student’s origin
and satisfaction with university life?
ii. Calculate the χ2 statistic and use it to test for independence of
the student’s origin and satisfaction with university life. What do
you conclude?
(13 marks)
(b) You have been asked to design a stratified random sample survey from the
employees of a certain large company to examine whether job satisfaction
of employees varies between different job types.
ii. Propose two relevant stratification factors. Justify your answers.
iii. Provide two actions to reduce response bias and explain why you think
they would be successful.
collected data.
(12 marks)
END OF PAPER
ST104a Statistics 1

variable: variable:
v
N uN
X √
µ = E[X] = pi xi
uX
2
i=1 i=1

of the sample mean:
X −µ
Z=
σ X̄ − µ
Z= √
σ/ n

P −π σ
Z=q x̄ ± z √
π(1−π) n
n

s r
x̄ ± tn−1 √ p(1 − p)
n p±z
n

proportion:
Z 2σ2
n≥
e2 Z 2 p(1 − p)
n≥
e2
known): unknown):
X̄ − µ
Z= √ X̄ − µ
σ/ n t= √
S/ n
p−π (X̄1 − X̄2 ) − (µ1 − µ2 )
Z∼
=q Z=
π(1−π)
q 2
σ1 σ22
n n1 + n2
s
(X̄1 − X̄2 ) − (µ1 − µ2 )

1 1
t= r 2
(x̄1 − x̄2 ) ± tn1 +n2 −2 sp +
n1 n2

Sp2 n11 + n12

paired samples:
(n1 − 1)S12 + (n2 − 1)S22
Sp2 =
n1 + n2 − 2 X̄d − µd
t= √
Sd / n
sd (P1 − P2 ) − (π1 − π2 )
x̄d ± tn−1 √ Z=r
n
P (1 − P ) n11 + n12

R1 + R2
P = s
n1 + n2 p1 (1 − p1 ) p2 (1 − p2 )
(p1 − p2 ) ± z +
n1 n2

r X
c
Pn
r=q P
Eij n 2 − nx̄2
Pn 2 − nȳ 2

i=1 j=1 x
i=1 i i=1 i y

Pn
P
rs = 1 − b = Pi=1 n 2 2
n(n2 − 1) i=1 xi − nx̄
a = ȳ − bx̄

ST104a Statistics 1
Important note

references
section.
General remarks
Learning outcomes
At the end of the half course and having completed the Essential reading and activities you should:
required
methods
1
ST104a Statistics 1
example, the first part of Question 2 asked for data presentation and descriptive statistics, while
hypothesis testing and confidence intervals (for proportions) appeared in the second part. Question
3 began with correlation and linear regression, followed by further hypothesis testing (of means).
The first part of Question 4 required a chi-squared test of association, while the second part covered
survey design questions. This means that it is really important that you make sure you have a
reasonable idea of what topics are covered before you start work on the paper! We suggest you
divide your time as follows during the examination:
and subquestion.
You are not expected to write long essays where explanations or descriptions of sampling design
• If you are specifically asked to perform a hypothesis test, or calculate a confidence interval,
do so. It is not acceptable to do one rather than the other! If you are asked to use a 5%
significance level, this is what will be marked.
2
• the relevant detailed reference to Newbold, P., W.L. Carlson and B.M. Thorne Statistics for
prepare, and similar questions from Newbold et al. (2012).
expected. This may be due to a number of reasons, but one particular failing is ‘question
We recognise that candidates might not cover all topics in the syllabus in the same depth, but you
The syllabus can be found in the Course information sheet available on the VLE. You should read
the syllabus carefully and ensure that you cover sufficient material in preparation for the
examination. Examiners will vary the topics and questions from year to year and may well set
questions that have not appeared in past papers. Examination papers may legitimately include
questions on any topic in the syllabus. So, although past papers can be helpful during your revision,
you cannot assume that topics or specific questions that have come up in past examinations will
occur again.
3
ST104a Statistics 1

ST104a Statistics 1
Important note

references
section.
Section A
Question 1
(a) Suppose that y1 = −1, y2 = −4, y3 = 2, y4 = 12, y5 = 7, and z1 = 7, z2 = −11,

i=3 i=5 i=3

X X √ X 1
i=1 i=4 i=1
yi
(6 marks)

guide and in particular Learning activity 6 in Section 2.14.
This question was generally well done.
4
The answers are as follows.

i. We have:
i=3
X
zi2 = 72 + (−11)2 + 92 = 49 + 121 + 81 = 251.
i=1
ii. We have:
i=5
X √ √ √
yi zi = 12 × 3 + 7 × 7 = 6 + 7 = 13.
i=4
iii. We have:
i=3
X 1 1 1
z52 + 2
= 7 + −1 − + = 48.25.
y
i=1 i
4 2
(b) Classify each one of the following variables as either measurable (continuous) or
i. Time spent in the previous week browsing the internet.
ii. Highest level of education obtained, i.e. no education, school education,
bachelor’s degree, master’s degree, doctorate.
iii. Country of residence.
iv. The rate of change of the human population.
(8 marks)

between nominal and ordinal categorical variables should be made by candidates.
A general tip for identifying measurable and categorical variables is to think of the possible
i. Measurable because the amount can be measured, for example, in hours or minutes to
several decimal places such as 79.28 minutes.
ii. Each education level corresponds to a category. The highest level of education is in a
ranked order, for example in terms of the list items provided. Therefore, it is a
categorical ordinal variable.
iii. Each country (UK, China, Germany etc.) is a category. Also, there is no natural
ordering between the countries, for example we cannot really say that ‘Germany is higher
than China’. Therefore, this is a categorical nominal variable.
iv. Measurable because population growth rates are quoted to several decimal places, for
example 1.50%.
Weak candidates did not provide justifications for their choices, reported nominal or ordinal
to measurable variables and sometimes answered ordinal when their justification was
pointing to a nominal variable. There were also phrases like ‘It is measurable because it can
be measured’ that were not awarded any marks.
5
ST104a Statistics 1
i. Assuming that the weights are normally distributed, what is the probability
that a random selection of 40 animals from that population will have a mean
weight between 7.0 kg and 7.4 kg?
(5 marks)

Chapter 6 of the subject guide and work out the examples and activities in this section. The
sample examination questions are relevant. For the first part of the question it is essential to
check Section 6.9 of the subject guide.
i. This first just requires knowledge of the fact that if X is a normal random variable with
random variable with mean µ and variance σ 2 /n. Direct application of this fact then
yields that:
σ2 (1.9)2

X̄ ∼ N µ, = N 7.3, .
n 40
Now note the basic property of a normal random variable that if X ∼ N (µ, σ 2 ) then
Z = (X − µ)/σ ∼ N (0, 1). Note also that:
∗ P (Z < a) = P (Z ≤ a) = Φ(a)
∗ P (Z > a) = P (Z ≥ a) = 1 − P (Z < a) = 1 − P (Z ≤ a) = 1 − Φ(a)
∗ P (a < Z < b) = P (Z ≤ Z < b) = P (a < Z ≤ b) = P (a ≤ Z ≤ b) = Φ(b) − Φ(a).
The above is all you need to find the requested proportion:

7.0 − 7.3 7.4 − 7.3
P (7.0 ≤ X̄ ≤ 7.4) = P √ ≤Z≤ √
1.9/ 40 1.9/ 40
= P (−1.00 ≤ Z ≤ 0.33)
= Φ(0.33) − (1 − Φ(1.00))
= 0.6293 − (1 − 0.8413)
= 0.4706
using Table 4 of the New Cambridge Statistical Tables.

ii. The researcher’s statement is justified since we can apply the central limit theorem given
the large sample size n.
x 3 5 8 10
pX (x) k k k 2k
i. Determine the constant k and hence write down the probability distribution
of X.
(6 marks)
6

probability and probability distributions. Reading from Chapter 5 of the subject guide is
suggested with particular focus on the sections on these topics.
P
i. Since i p(xi ) = 5k = 1, then k = 0.2. Hence the probability distribution is:
x 3 5 8 10
pX (x) 0.2 0.2 0.2 0.4
ii. We have:
X
E(X) = xi p(xi ) = 3 × 0.2 + 5 × 0.2 + 8 × 0.2 + 10 × 0.4 = 7.2.
i
iii. We have:
X
E(X 2 ) = x2i p(xi ) = 32 × 0.2 + 52 × 0.2 + 82 × 0.2 + (10)2 × 0.4 = 59.6
i
hence Var(X) = 59.6 − (7.2)2 = 7.76.

An alternative method to find the variance is through the formula i (xi − µ)2 p(xi ),
P
where µ was found in part ii.

characteristic was measured under two distinct conditions and the difference in
characteristic values was recorded. The sample mean of the differences was
1.195, whereas their sample standard deviation was 10.2. The researchers
ii. Calculate the p-value of the test and use the p-value to test the hypothesis of
equal means. Use a 5% significance level.
(7 marks)

This question refers to a paired-samples t test examining whether there is a difference in the
means of the characteristic for each adult under the two distinct conditions. While the
entire chapter on hypothesis testing is relevant (Chapter 8), one can focus on the relevant
section for the paired-samples t test, Section 8.16.4, and Exercise 8.8. Note also the reading
on one- and two-tailed tests, located in Section 8.10. The second part of the question looks
at p-values and the relevant section in the subject guide is Section 8.11.
i. The test statistic value is:
x̄d 1.195
√ = √ = 1.289.
sd / n 10.2/ 121
ii. The second part of the question requires the use of p-values and challenged most
candidates. The exercise does not require lengthy calculations and can be derived in a
relatively straightforward manner if one is familiar with the material of Section 8.11 of
the subject guide. It is also essential to decide on the distribution of the test statistic. In
this case one can use the tn−1 = t120 distribution, since the standard deviation is
unknown and estimated by sd . Nevertheless the assumption of a standard normal
distribution is also justified by the central limit theorem given the large sample size n.
Finally, note that this is a two-tailed test.
Combining the above, we get that the p-value is, where T ∼ t120 , using Table 10 of the
New Cambridge Statistical Tables:
2 × P (T ≥ 1.289) = 2 × 0.10 = 0.20.
Since the p-value is 0.20 > 0.05, the test is not significant at the 5% significance level.
7
ST104a Statistics 1
(f ) State whether the following are true or false and give a brief explanation (no
i. The median of a random sample is influenced by extreme values.
ii. If A and B are independent events, then P (A | B) = P (A).
iii. If X ∼ N (5, 2), then P (X ≤ 5) < 0.5.
v. In stratified random sampling, elements within a stratum are heterogeneous.
vi. A scatter diagram is used to display two categorical variables.
(12 marks)

level in computations. Part i. concerns measures of location that can be found in Section 4.8
of the subject guide. Part ii. requires knowledge of basic probability properties that can be
found in Section 5.10. Part iii. is about probability properties of the normal distribution, see
Section 6.8. Part iv. targets the concepts of a p-value covered in Section 8.11, whereas part
v. requires material from Chapter 10 and in particular Section 10.7 on types of sample.
Finally, part vi. looks at scatter diagrams (Section 12.6) and/or contingency tables (Section
9.7.1).
reason for each true/false response and not just a choice between the two. Some candidates
also lost marks for long rambling explanations without a decision as to whether a statement
was true or false.
i. False. The median is the midpoint of an ordered set of data, so extreme values will not
span the midpoint when the data are ordered. An alternative justification may be that
the mean of a sample is influenced by extreme values.
ii. True. If A and B are independent events, then:
P (A ∩ B) P (A) P (B)
P (A | B) = = = P (A).
P (B) P (B)
iii. False. A normal distribution is symmetric about its mean, hence P (X ≤ 5) = 0.5.
iv. False. A p-value of 0.3 represents an insignificant hypothesis test result. An alternative
justification could be that a weakly significant hypothesis test result means a p-value
between 0.05 and 0.10.
v. False. Stratified random sampling works better if the elements within a stratum are
homogeneous.
vi. False. A scatter diagram is used to display two measurable variables. An alternative
justification is that a contingency table is used to display two categorical variables.
(g) In a random sample of size n = 6 the mean of the data is 12 and the median is 9.
Another observation is then obtained and this takes the value of 5, i.e. x7 = 5.
(6 marks)

This question contains material from Section 4.8 of the subject guide, and in particular
Sections 4.8.1 and 4.8.2.
8

The reading required for this question is minimal, however a good level of understanding of
the formulae for the sample mean and median is required to get the full marks. The first
part was relatively straightforward, but the second part was perhaps the most challenging
exercise of this examination paper and very few candidates managed to get the full marks.
i. The working is as follows.

6
P
• We have xi = n x̄ = 6 × 12 = 72.
i=1
• Hence the new sample mean is:
7
P
xi
i=1 72 + 5
x̄ = = = 11.
n 7
ii. The working is as follows.

• Based on the first 6 observations, 9 = (x(3) + x(4) )/2.
• We have x(3) ≤ 9 and x(4) ≥ 9.
• Let m denote the median of the seven observations. Since x7 = 5, then:
— if x7 < x(3) , then m = x(3) and 5 < m ≤ 9
— if x7 = x(3) , then m = x7 = x(3) = 5
— if x7 > x(3) , then m = x7 = 5.
• Hence the median of the seven observations, m, is such that 5 ≤ m ≤ 9.
Section B
Question 2
(a) The data below represent the weights (in kg) of 30 athletes.
57 59 61 63 64
65 73 74 74 74
75 77 77 81 82
82 82 83 83 85
87 89 91 93 96
96 98 99 99 101
paper provided.
ii. Find the mean and the modal group. You are given that the sum of the data
is 2420.
iv. Comment on the data, given the shape of the histogram and the measures
which you have calculated.
(13 marks)
9
ST104a Statistics 1

Chapter 4 of the subject guide provides all the relevant material for this question. More
specifically, reading on histograms can be found in Section 4.7.3, but the whole of Sections
4.7–4.9 are highly relevant.
Histogram of athlete weights
0.9
0.8
0.7
Frequency density
0.6
0.5
0.4
0.3
0.2
0.1
50 60 70 80 90 100 110
Athlete weights (in kg)
ii. • Mean = 2420/30 = 80.67 kg. Note: The raw data should be used, not grouped data.
Also make sure to mention the units to get the full marks.
• Modal group: [80, 90) kg. Note: Same as above.
iii. • Median = 82 kg.
• Correct position of Q1 (between 7th and 8th inclusive).
• Q1 ≈ 73.5 kg.
iv. The distribution of the data appears to be negatively/left-skewed. This is also supported
by the fact that the mean is less than the median.
students to determine whether they are in favour of a new examination
timetable. The table below summarises the student responses.
In favour of
Subject area Sample size new examination timetable
Humanities 325 221
Science 200 120
i. Do the student responses indicate a difference between students in
humanities and science degrees in whether they are in favour of the new
10
examination timetable? Conduct a suitable hypothesis test at two

appropriate significance levels and comment on your results. State any
assumptions that you make.
ii. Compute a 97% confidence interval for the difference of proportions with
positive responses between humanities and science degrees in the population.
(12 marks)

Read Section 8.15 of the subject guide about hypothesis testing for the difference between
two population proportions. For the second part, see Section 7.12 on confidence intervals for
the difference between two population proportions.
i. Regarding hypotheses, note that the wording ‘indicate a difference’ suggests a two-tailed
test:
H0 : π1 = π2 vs. H1 : π1 6= π2
where π1 refers to the population proportion of students in the humanities being in
favour of the new examination timetable, and π2 is the corresponding population
proportion for students in science.
In order to conduct this test we need the pooled sample proportion which is:
221 + 120
p= ≈ 0.65
325 + 200
from which we can get:
s
1 1
s.e.(p1 − p2 ) = 0.65 × 0.35 × + = 0.043.
325 200
The value of the test statistic is then:

p1 − p2 0.68 − 0.60
= = 1.866.
s.e.(p1 − p2 ) 0.043
Using the standard normal distribution, which is justified by the large sample sizes
according to the central limit theorem, the critical values at the 5% significance level are
±1.96. Since 1.866 < 1.96 we do not reject H0 at the 5% significance level.
Therefore, we choose a second (larger) significance level, say 10%, which gives critical
values of ±1.645, in which case we reject H0 since 1.645 < 1.866.
Hence we conclude that there is weak evidence of a difference between the population
proportions of students in humanities and science in favour of the new examination
timetable.
ii. This is a standard exercise for confidence intervals given the appropriate formula from
interval is (−0.013, 0.173), using a z-value of 2.17 from Table 4 of the New Cambridge
Statistical Tables.
Note that in order to get the confidence interval above the following formula for
s.e.(p1 − p2 ) is required:
r
0.68 × 0.32 0.6 × 0.4
s.e.(p1 − p2 ) = + = 0.043.
325 200
In this case it makes no difference in the confidence interval calculation, but it could give
different answers in other questions of this type.
11
ST104a Statistics 1
Question 3
(a) It is assumed that there is an association between the gestational age at birth,
i.e. the number of weeks the mother was pregnant when she gave birth (x) and
the birth weight of the baby (y, in kg). An experiment was conducted on 9
randomly-selected babies and the data are summarised in the table below.
Baby 1 2 3 4 5 6 7 8 9
x 36.0 39.7 38.0 41.4 38.7 35.7 40.3 37.3 42.4
y 2.0 3.7 2.7 3.7 2.9 2.6 3.5 2.7 3.8
Sum of the y values: 27.6 Sum of the squares of the y values: 87.82
Sum of the products of the x and y values: 1082.6
i. Draw a scatter diagram of these data on the graph paper provided. Carefully
label the diagram.
iii. Calculate and report the least squares line of y on x. Draw the line on the
scatter diagram.
iv. Based on the regression model above, what baby birth weight would you
expect from a mother who gave birth when she was 38 weeks pregnant?
Would you trust this value? Justify your answer.
(13 marks)

i., whereas the remaining parts are on correlation and linear regression that are covered in
Sections 12.8–12.10 of the subject guide. Section 12.7 is also relevant. Sample examination
question 2 of this chapter is also recommended for practice on questions of this type.
i. We have:
Baby birth weight vs. Gestational age
x
x x
3.5
x
Baby birth weight (in kg)
3.0
x x
x
2.5
2.0
36 37 38 39 40 41 42
Gestational age (in weeks)

12
interpretation of this value is the following: The data suggest that the higher the
gestational age, the higher the birth weight. The fact that the value is very close to 1,
suggests that this is a strong, positive linear association.
of these words provide useful information for interpreting the association and are,
therefore, required to obtain full marks.
formula for b is: P
xi yi − nx̄ȳ
b= P 2
xi − nx̄2
and by substituting in the summary statistics we get b = 0.250.
Hence the regression line can be written as:
yb = −6.660 + 0.250x or y = −6.660 + 0.250x + ε.
It should also be plotted on the scatter diagram.

iv. In this case one can note in the scatter diagram that the points seem to be ‘scattered’
around a straight line. Hence a linear regression model does seem to be a good model for
the association between gestational age and birth weight. According to the model the
expected birth weight for 38 weeks of pregnancy is −6.660 + 0.250 × 38 ≈ 2.84 kg.
Note also that the value 38 is well inside the limits of the x variable. Hence this value
should be trusted since this is an interpolation.

i. Use an appropriate hypothesis test to determine whether there is a difference

between the mean battery lifetimes of the two brands. State clearly the
your findings.
iii. Repeat the procedure in (b) part i. to determine whether the mean battery
lifetime of brand 1 is shorter than that of brand 2.
(12 marks)
13
ST104a Statistics 1

The first two parts of the question refer to a two-tailed hypothesis test comparing two
population means. While the entire subject guide chapter on hypothesis testing is relevant
(Chapter 8), one can focus on Section 8.16 and in particular Sections 8.16.2 and 8.16.3 as
the variances are unknown.
i. Let µ1 denote the mean battery lifetime for brand 1 and µ2 denote the mean battery
lifetime for brand 2.
The wording ‘whether there is a difference between the mean battery lifetimes of the two
brands’ implies a two-tailed test, hence the hypotheses can be written as:
H0 : µ1 = µ2 vs. H1 : µ1 6= µ2 .
The test statistic formulae, depending on whether or not a pooled variance is used, are
provided in the formula sheet, hence use:
x̄ − ȳ x̄ − ȳ
q or p .
s2p (1/n1 + 1/n2 ) s21 /n1 + s22 /n2
If equal variances are assumed, the test statistic value is −2.084. If equal variances are
not assumed the test statistic value is −2.082.
The variances are unknown but the sample sizes are large, so the standard normal
distribution can be used due to the central limit theorem. The t60 distribution is also
acceptable. The critical values at the 5% significance level are ±1.96, hence we reject the
null hypothesis. If we take a (smaller) α and test at the 1% significance level, the critical
values are ±2.576, so we do not reject H0 .
We conclude that there is moderate evidence of a difference between the population mean
battery lifetimes of the two brands.
ii. The assumptions for i. concerned:
• an assumption about equal variances
• an assumption about whether n1 + n2 is ‘large’ so that the normality assumption is
satisfied
• an assumption about independent samples.
were not awarded in such cases. Also, some other candidates just copied the phrase
should state whether the calculations were based on the assumption that unknown
iii. Given the different wording in this part, ‘mean battery lifetime of brand 1 is shorter than
that of brand 2’, a one-tailed test is required. The hypotheses now become:
H0 : µ1 = µ2 vs. H1 : µ1 < µ2 .
The critical values, still based on the standard normal distribution, now become −1.645
for the 5% significance level and −2.326 for the 1% significance level. We reject H0 for
α = 0.05, but not α = 0.01, hence we conclude that there is moderate evidence that the
brand 1 batteries have a shorter mean battery lifetime.
Question 4
(a) A sample consisting of 100 randomly-selected adults in the USA was classified
in terms of their political affiliation (Democrat or Republican) and opinion on a
tax reform bill (in favour, indifferent or opposed). The data are summarised in
the table below.
14

Democrat 12 29 16
Republican 18 11 14
would you say there is an association between the political affiliation and
opinion on the tax reform bill?
ii. Calculate the χ2 statistic and use it to test for independence of political
affiliation and opinion on the tax reform bill. What do you conclude?
(13 marks)

This question targets Chapter 9 of the subject guide on contingency tables and chi-squared
and interpreting contingency tables. Part ii. is a straightforward chi-squared test and the
reading is also given in Chapter 9. Candidates can attempt Learning activity 4 in Section
9.11 for further practice.
i. There are some differences in the opinion on the tax reform bill between Democrats and
Republicans. More specifically, only 40% of those in favour are Democrats, whereas more
than 50% of those opposed are Democrats. Hence there seems to be an association
between political affiliation and opinion on the tax reform bill, although this needs to be
investigated further.
ii. Set out the null hypothesis that there is no association between political affiliation and
opinion on the tax reform bill against the alternative, that there is an association. Be
careful to get these the correct way round!
H0 : No association between political affiliation and opinion on the tax reform bill
vs.
H1 : Association between political affiliation and opinion on the tax reform bill.

Democrat 17.1 22.8 17.1
Republican 12.9 17.2 12.9
X (Oi,j − Ei,j )2
Ei,j
are (2 − 1) × (3 − 1) = 2. Hence we use Table 8 of the New Cambridge Statistical Tables.
For α = 0.05, the critical value is 5.991, hence reject H0 .
For α = 0.01, the critical value is 9.210, hence do not reject H0 .
We conclude that there is moderate evidence of an association between political
affiliation and opinion on the tax reform bill.
Many candidates looked up the statistical tables incorrectly and so failed to follow
through their earlier accurate work.
(b) You have been asked to design a cluster random sample survey from the
employees of a certain large company to examine whether job satisfaction of
employees varies between different job types.
15
ST104a Statistics 1
ii. Propose two relevant clusters. Justify your answers.

iii. Provide two actions to reduce response bias and explain why you think they
would be successful.
collected data.
(12 marks)

This was a question on basic material on survey design. Background reading is given in
Chapter 10 of the subject guide which, along with the recommended reading, should be
looked at carefully. Candidates were expected to have studied and understood the main
learning activities in Chapter 10.
Some model answers are given below.
i. An indicative answer here would be to use an online list of schools. A limitation with
this choice is that this list may not contain all schools.
ii. Example clusters are based on areas of the city or companies. In order for cluster
sampling to be effective the clusters have to be representative of the population.
iii. Example actions are incentives and face-to-face interviews. Note here that response bias
occurs when the respondents give consistently false answers, for example claiming a
younger age.
iv. Examples here are appropriate graphs (boxplots, histograms etc.), confidence intervals
and hypothesis tests of job satisfaction measurement across different job types.
16

ST104a Statistics 1
Important note

references
section.
Section A
Question 1
(a) Suppose that y1 = −2, y2 = −5, y3 = 1, y4 = 16, y5 = 10, and z1 = 8, z2 = −5,

i=3 i=5 i=3

X X √ X 1
i=1 i=4 i=1
yi
(6 marks)

guide and in particular Learning activity 6 in Section 2.14.
17
ST104a Statistics 1
The answers are as follows.

i. We have:
i=3
X
zi2 = 82 + (−5)2 + 62 = 64 + 25 + 36 = 125.
i=1
ii. We have:
i=5
X √ √ √
yi zi = 16 × 4 + 10 × 10 = 8 + 10 = 18.
i=4
iii. We have:
i=3
X 1 1 1
z42 + = 42 + − − + 1 = 16.3.
y
i=1 i
2 5
ii. Community type, i.e. rural, small town, large town, small city, large city.
iii. Discipline studied as the degree major.
iv. Volume of water in a bottle.
(8 marks)

between nominal and ordinal categorical variables should be made by candidates.

A general tip for identifying measurable and categorical variables is to think of the possible
i. Measurable because the amount can be measured, for example, in trillions of pounds to
several decimal places such as £2.65 trillion.
ii. Each community type corresponds to a category. Moreover, the categories are in a
ranked order in terms of population or size, for example a large city has more residents
than a small city. Therefore, it is a categorical ordinal variable.
iii. Each discipline (Philosophy, Mathematics, Geography etc.) is a category. Also, there is
no natural ordering between the disciplines, for example we cannot really say that
‘Philosophy is higher than Geography’. Therefore, this is a categorical nominal variable.
iv. Measurable because volume can be measured to several decimal places, for example 502
ml.
be measured’ that were not awarded any marks.
18
i. Assuming that the weights are normally distributed, what is the probability
that a random selection of 50 animals from that population will have a mean
weight between 8.6 kg and 9.1 kg?
(5 marks)

sample examination questions are relevant. For the first part of the question it is essential to
check Section 6.9 of the subject guide.
i. This first just requires knowledge of the fact that if X is a normal random variable with
random variable with mean µ and variance σ 2 /n. Direct application of this fact then
yields that:
σ2 (2.1)2

X̄ ∼ N µ, = N 8.9, .
n 50
Now note the basic property of a normal random variable that if X ∼ N (µ, σ 2 ) then
Z = (X − µ)/σ ∼ N (0, 1). Note also that:
∗ P (Z < a) = P (Z ≤ a) = Φ(a)
∗ P (Z > a) = P (Z ≥ a) = 1 − P (Z < a) = 1 − P (Z ≤ a) = 1 − Φ(a)
∗ P (a < Z < b) = P (Z ≤ Z < b) = P (a < Z ≤ b) = P (a ≤ Z ≤ b) = Φ(b) − Φ(a).
The above is all you need to find the requested proportion:

8.6 − 8.9 9.1 − 8.9
P (8.6 ≤ X̄ ≤ 9.1) = P √ ≤Z≤ √
2.1/ 50 2.1/ 50
= P (−1.01 ≤ Z ≤ 0.67)
= Φ(0.67) − (1 − Φ(1.01))
= 0.7486 − (1 − 0.8438)
= 0.5924
using Table 4 of the New Cambridge Statistical Tables.

ii. The researcher’s statement is justified since we can apply the central limit theorem given
the large sample size n.
x 2 6 7 9
pX (x) k k k 2k
i. Determine the constant k and hence write down the probability distribution
of X.
(6 marks)
19
ST104a Statistics 1

probability and probability distributions. Reading from Chapter 5 of the subject guide is
suggested with particular focus on the sections on these topics.
P
i. Since i p(xi ) = 5k = 1, then k = 0.2. Hence the probability distribution is:
x 2 6 7 9
pX (x) 0.2 0.2 0.2 0.4
ii. We have:
X
E(X) = xi p(xi ) = 2 × 0.2 + 6 × 0.2 + 7 × 0.2 + 9 × 0.4 = 6.6.
i
iii. We have:
X
E(X 2 ) = x2i p(xi ) = 22 × 0.2 + 62 × 0.2 + 72 × 0.2 + 92 × 0.4 = 50.2
i
hence Var(X) = 50.2 − (6.6)2 = 6.64.

An alternative method to find the variance is through the formula i (xi − µ)2 p(xi ),
P
where µ was found in part ii.

characteristic was measured under two distinct conditions and the difference in
characteristic values was recorded. The sample mean of the differences was
2.326, whereas their sample standard deviation was 7.6. The researchers
ii. Calculate the p-value of the test and use the p-value to test the hypothesis of
equal means. Use a 5% significance level.
(7 marks)

This question refers to a paired-samples t test examining whether there is a difference in the
means of the characteristic for each adult under the two distinct conditions. While the
entire chapter on hypothesis testing is relevant (Chapter 8), one can focus on the relevant
section for the paired-samples t test, Section 8.16.4, and Exercise 8.8. Note also the reading
on one- and two-tailed tests, located in Section 8.10. The second part of the question looks
at p-values and the relevant section in the subject guide is Section 8.11.
i. The test statistic value is:
x̄d 2.326
√ = √ = 2.390.
sd / n 7.6/ 61
ii. The second part of the question requires the use of p-values and challenged most
candidates. The exercise does not require lengthy calculations and can be derived in a
relatively straightforward manner if one is familiar with the material of Section 8.11 of
the subject guide. It is also essential to decide on the distribution of the test statistic. In
this case one can use the tn−1 = t60 distribution, since the standard deviation is
unknown and estimated by sd . Nevertheless the assumption of a standard normal
distribution is also justified by the central limit theorem given the large sample size n.
Finally, note that this is a two-tailed test.
Combining the above, we get that the p-value is, where T ∼ t60 , using Table 10 of the
New Cambridge Statistical Tables:
2 × P (T ≥ 2.390) = 2 × 0.01 = 0.02.
Since the p-value is 0.02 < 0.05, the test is significant at the 5% significance level.
20
(f ) State whether the following are true or false and give a brief explanation (no
i. The median of a random sample is not influenced by extreme values.
ii. If A and B are independent events, then P (A | B) < P (A).
iii. If X ∼ N (7, 4), then P (X ≥ 7) > 0.5.
iv. A p-value of 0.03 represents an insignificant hypothesis test result.
v. In cluster random sampling, elements within a cluster are homogeneous.
vi. A contingency table is used to display two measurable variables.
(12 marks)

level in computations. Part i. concerns measures of location that can be found in Section 4.8
of the subject guide. Part ii. requires knowledge of basic probability properties that can be
found in Section 5.10. Part iii. is about probability properties of the normal distribution, see
Section 6.8. Part iv. targets the concepts of a p-value covered in Section 8.11, whereas part
v. requires material from Chapter 10 and in particular Section 10.7 on types of sample.
Finally, part vi. looks at scatter diagrams (Section 12.6) and/or contingency tables (Section
9.7.1).
reason for each true/false response and not just a choice between the two. Some candidates
also lost marks for long rambling explanations without a decision as to whether a statement
was true or false.
i. True. The median is the midpoint of an ordered set of data, so extreme values will not
span the midpoint when the data are ordered.
ii. False. If A and B are independent events, then:
P (A ∩ B) P (A) P (B)
P (A | B) = = = P (A).
P (B) P (B)
iii. False. A normal distribution is symmetric about its mean, hence P (X ≥ 7) = 0.5.
iv. False. A p-value of 0.03 represents an moderately significant hypothesis test result. An
alternative justification could be that an insignificant hypothesis test result means a
p-value larger than 0.10.
v. False. Cluster random sampling works better if the elements within a cluster are
heterogeneous.
vi. False. A contingency table is used to display two categorical variables. An alternative
justification is that a scatter diagram is used to display two measurable variables.
(g) In a random sample of size n = 6 the mean of the data is 15 and the median is
11. Another observation is then obtained and this takes the value of 8, i.e.
x7 = 8.
(6 marks)

This question contains material from Section 4.8 of the subject guide, and in particular
Sections 4.8.1 and 4.8.2.
21
ST104a Statistics 1

The reading required for this question is minimal, however a good level of understanding of
the formulae for the sample mean and median is required to get the full marks. The first
part was relatively straightforward, but the second part was perhaps the most challenging
exercise of this examination paper and very few candidates managed to get the full marks.
i. The working is as follows.
P6
• We have xi = n x̄ = 6 × 15 = 90.
i=1
• Hence the new sample mean is:
7
P
xi
i=1 90 + 8
x̄ = = = 14.
n 7
ii. The working is as follows.
• Based on the first 6 observations, 11 = (x(3) + x(4) )/2.
• We have x(3) ≤ 11 and x(4) ≥ 11.
• Let m denote the median of the seven observations. Since x7 = 8, then:
— if x7 < x(3) , then m = x(3) and 8 < m ≤ 11
— if x7 = x(3) , then m = x7 = x(3) = 8
— if x7 > x(3) , then m = x7 = 8.
• Hence the median of the seven observations, m, is such that 8 ≤ m ≤ 11.
Section B
Question 2
(a) The data below contain measurements of the low-density lipoproteins, also
known as the ‘bad’ cholesterol, in the blood of 30 patients. Data are measured
in milligrams per deciliters (mg/dL).
95 96 96 98 99
99 101 101 102 102
103 104 104 107 107
111 112 113 113 114
115 117 121 123 124
127 129 131 135 143
paper provided.
ii. Find the mean and the modal group. You are given that the sum of the data
is 3342.
iv. Comment on the data, given the shape of the histogram and the measures
(13 marks)

Chapter 4 of the subject guide provides all the relevant material for this question. More
specifically, reading on histograms can be found in Section 4.7.3, but the whole of Sections
4.7–4.9 are highly relevant.
22

Histogram of low-density lipoprotein data
0.9
0.8
0.7
Frequency density
0.6
0.5
0.4
0.3
0.2
0.1
90 100 110 120 130 140 150
Low-density lipoproteins (in mg/dL)
ii. • Mean = 3342/30 = 111.4 mg/dL. Note: The raw data should be used, not grouped
data. Also make sure to mention the units to get the full marks.
• Modal group: [100, 110) mg/dL. Note: Same as above.
iii. • Median = 109 mg/dL.
• Correct position of Q3 (between 22nd and 23rd inclusive).
• Q3 ≈ 119 mg/dL.
iv. The distribution of the data appears to be positively/right-skewed. This is also
supported by the fact that the mean is greater than the median.
(b) A mobile telephone company gathered a random sample of 500 people to

determine whether they like the design of its latest mobile telephone. The table
below summarises the people’s responses.
Positive view on the
Gender Sample size latest mobile telephone design
Males 225 110
Females 275 165
i. Do the people’s responses indicate a difference between males and females in
whether they like the design of the latest mobile telephone? Conduct a
suitable hypothesis test at two appropriate significance levels and comment
on your results. State any assumptions that you make.
ii. Compute a 98% confidence interval for the difference of proportions with a
positive view between males and females in the population.
(12 marks)
23
ST104a Statistics 1

Read Section 8.15 of the subject guide about hypothesis testing for the difference between
two population proportions. For the second part, see Section 7.12 on confidence intervals for
the difference between two population proportions.
i. Regarding hypotheses, note that the wording ‘indicate a difference’ suggests a two-tailed
test:
H0 : π 1 = π 2 vs. H1 : π1 6= π2
where π1 refers to the population proportion of males with a positive view on the latest
mobile telephone design, and π2 is the corresponding population proportion for females.
In order to conduct this test we need the pooled sample proportion which is:
110 + 165
p= = 0.55
225 + 275
from which we can get:
s
1 1
s.e.(p1 − p2 ) = 0.55 × 0.45 × + = 0.045.
225 275
The value of the test statistic is then:

p1 − p2 0.60 − 0.489
= = 2.485.
s.e.(p1 − p2 ) 0.045
Using the standard normal distribution, which is justified by the large sample sizes
according to the central limit theorem, the critical values at the 5% significance level are
±1.96. Since 2.485 > 1.96 we reject H0 at the 5% significance level.
Therefore, we choose a second (smaller) significance level, say 1%, which gives critical
values of ±2.576, in which case we do not reject H0 since 2.485 < 2.576.
Hence we conclude that there is moderate evidence of a difference between the
population proportions of males and females with a positive view on the latest mobile
telephone design.
ii. This is a standard exercise for confidence intervals given the appropriate formula from
interval is (0.007, 0.215), using a z-value of 2.326 from Table 10 of the New Cambridge
Statistical Tables.
Note that in order to get the confidence interval above the following formula for
s.e.(p1 − p2 ) is required:
r
0.489 × 0.511 0.6 × 0.4
s.e.(p1 − p2 ) = + = 0.045.
225 275
In this case it makes no difference in the confidence interval calculation, but it could give
different answers in other questions of this type.
Question 3
(a) The table below contains information from 9 students taking a course in
Statistics. Students were asked how many hours they spent revising the
material before the examination (x values, in hours) and what their
examination mark was (y values, in %).
Student 1 2 3 4 5 6 7 8 9
x 1.8 2.6 2.8 3.4 3.6 4.2 4.8 5.2 5.4
y 54 64 60 62 68 70 76 73 76
24
Sum of the y values: 603 Sum of the squares of the y values: 40861
Sum of the products of the x and y values: 2336
label the diagram.
scatter diagram.
iv. Based on the regression model above, what examination mark would you
expect from a student who studied 8 hours? Would you trust this value?
Justify your answer.
(13 marks)

i., whereas the remaining parts are on correlation and linear regression that are covered in
Sections 12.8–12.10 of the subject guide. Section 12.7 is also relevant. Sample examination
question 2 of this chapter is also recommended for practice on questions of this type.
i. We have:
Examination mark vs. Hours spent revising
x x
75
x
70
x
Examination mark (in %)
x
65
x
60
x
55
2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5
Hours spent revising
25
ST104a Statistics 1
interpretation of this value is the following: The data suggest that the higher the hours
spent revising, the higher the examination mark. The fact that the value is very close to
1, suggests that this is a strong, positive linear association.
of these words provide useful information for interpreting the association and are,
formula for b is: P
xi yi − nx̄ȳ
b= P 2
xi − nx̄2
and by substituting in the summary statistics we get b = 5.804.
yb = 45.203 + 5.804x or y = 45.203 + 5.804x + ε.
It should also be plotted on the scatter diagram.

Many candidates reported incorrectly the regression line as y = 45.203 + 5.804x. This
around a straight line. Hence a linear regression model does seem to be a good model for
the association between revision time and examination mark. According to the model
the expected examination mark for 8 hours of revision is 45.203 + 5.804 × 8 ≈ 92%.
However, note that the value 8 is well outside the limits of the x variable. Hence this
value shoud be treated with caution since this is an extrapolation.


between the mean battery lifetimes of the two brands. State clearly the
your findings.
iii. Repeat the procedure in (b) part i. to determine whether the mean battery
lifetime of brand 1 is longer than that of brand 2.
(12 marks)

The first two parts of the question refer to a two-tailed hypothesis test comparing two
population means. While the entire subject guide chapter on hypothesis testing is relevant
(Chapter 8), one can focus on Section 8.16 and in particular Sections 8.16.2 and 8.16.3 as
the variances are unknown.
26

i. Let µ1 denote the mean battery lifetime for brand 1 and µ2 denote the mean battery
lifetime for brand 2.
The wording ‘whether there is a difference between the mean battery lifetimes of the two
brands’ implies a two-tailed test, hence the hypotheses can be written as:
H0 : µ1 = µ2 vs. H1 : µ1 6= µ2 .
provided in the formula sheet, hence use:
x̄ − ȳ x̄ − ȳ
q or p .
s2p (1/n1 + 1/n2 ) s21 /n1 + s22 /n2
If equal variances are assumed, the test statistic value is 2.426. If equal variances are not
assumed the test statistic value is 2.465.
The variances are unknown but the sample sizes are large, so the standard normal
acceptable. The critical values at the 5% significance level are ±1.96, hence we reject the
null hypothesis. If we take a (smaller) α and test at the 1% significance level, the critical
values are ±2.576, so we do not reject H0 .
We conclude that there is moderate evidence of a difference between the population mean
battery lifetimes of the two brands.
ii. The assumptions for i. concerned:
• an assumption about equal variances
• an assumption about whether n1 + n2 is ‘large’ so that the normality assumption is
satisfied
• an assumption about independent samples.
were not awarded in such cases. Also, some other candidates just copied the phrase
should state whether the calculations were based on the assumption that unknown
iii. Given the different wording in this part, ‘mean battery lifetime of brand 1 is longer than
that of brand 2’, a one-tailed test is required. The hypotheses now become:
H0 : µ1 = µ2 vs. H1 : µ1 > µ2 .
The critical values, still based on the standard normal distribution, now become 1.645 for
the 5% significance level and 2.326 for the 1% significance level. We reject H0 for
α = 0.05 and α = 0.01, hence we conclude that there is strong evidence that the brand 1
batteries have a longer mean battery lifetime.
Question 4
(a) A sample consisting of 100 randomly-selected students in a UK university was

classified in terms of a student’s origin (either UK/EU or overseas) and in
terms of their satisfaction with university life (satisfied, indifferent or
dissatisfied). The data are summarised in the table below.

UK/EU 10 26 15
Overseas 20 14 15
27
ST104a Statistics 1
would you say there is an association between the student’s origin and
satisfaction with university life?
ii. Calculate the χ2 statistic and use it to test for independence of student’s
origin and satisfaction with university life. What do you conclude?
(13 marks)

This question targets Chapter 9 of the subject guide on contingency tables and chi-squared
and interpreting contingency tables. Part ii. is a straightforward chi-squared test and the
reading is also given in Chapter 9. Candidates can attempt Learning activity 4 in Section
9.11 for further practice.
i. There are some differences in the rates of satisfaction between UK/EU and overseas
students. More specifically, two thirds of the satisfied students were overseas students,
whereas only half of the dissatisfied students were overseas students. Hence there seems
to be an association between a student’s origin and satisfaction with university life,
although this needs to be investigated further.
ii. Set out the null hypothesis that there is no association between a student’s origin and
satisfaction with university life against the alternative, that there is an association. Be
careful to get these the correct way round!
H0 : No association between a student’s origin and satisfaction with university life
vs.
H1 : Association between a student’s origin and satisfaction with university life.

UK/EU 15.3 20.4 15.3
Overseas 14.7 19.6 14.7
X (Oi,j − Ei,j )2
Ei,j
are (2 − 1) × (3 − 1) = 2. Hence we use Table 8 of the New Cambridge Statistical Tables.
We conclude that there is moderate evidence of an association between a student’s origin
and satisfaction with university life.
(b) You have been asked to design a stratified random sample survey from the
employees of a certain large company to examine whether job satisfaction of
employees varies between different job types.
ii. Propose two relevant stratification factors. Justify your answers.
iii. Provide two actions to reduce response bias and explain why you think they
would be successful.
28
collected data.
(12 marks)

This was a question on basic material on survey design. Background reading is given in
Some model answers are given below.
i. An indicative answer here would be to use an email list. A limitation with this choice is
that this list may not contain all the employees.
ii. Example strata are income level, gender, age group etc. In order for stratified sampling
to be effective the population within strata has to be homogeneous.
iii. Example actions are incentives and face-to-face interviews. Note here that response bias
occurs when the respondents give consistently false answers, for example claiming a
younger age.
iv. Examples here are appropriate graphs (boxplots, histograms etc.), confidence intervals
and hypothesis tests of job satisfaction measurement across different job types.
29
~~ST104A_ZA_2016_d0
This paper is not to be removed from the Examination Hall

Statistics 1
Tuesday, 8 May 2018: 10:00 to 12:00

on this paper.
PLEASE TURN OVER

UL18/0322 Page 1 of 21
SECTION A
1. (a) Suppose that x1 = 0.5, x2 = 2.5, x3 = 2.8, x4 = 0.4, x5 = 6.1, and

y1 = 0.5, y2 = 4.0, y3 = 4.6, y4 = 2.0, y5 = 0. Calculate the following
quantities:
i=5
X i=2
X i=5 2
X
1 y i
i. x2i ii. iii. y43 + .
i=3
xy
i=1 i i i=4
xi
(6 marks)
(b) Classify each one of the following variables as either measurable

ii. Five possible responses to a customer satisfaction survey ranging from
‘very satisfied’ to ‘very dissatisfied’.
iii. A person’s name.
(6 marks)
(c) State whether the following are true or false and give a brief explanation. (No
marks will be awarded for a plain true/false answer.)
i. If A and B are independent events, then P (A \ B) = P (A)/P (B).
ii. If X ⇠ N (3, 4), then P (X  3) = 0.5.
iii. A p-value can be negative.
iv. A Type I error is the failure to reject a true null hypothesis.
v. Item non-response occurs when no information is collected from a sample
member.
(10 marks)
(d) If P (B) = 0.05, P (A | B) = 0.70 and P (A | B c ) = 0.30, find P (B | A).

(5 marks)

UL18/0322 Page 2 of 21
x 1 1 2
pX (x) 0.20 k 4k
i. Determine the constant k and, hence, write out the probability distribution
of X.
ii. Find E(X) (the expected value of X).
iii. Find Var(X) (the variance of X).
(6 marks)
(f) The scores on a verbal reasoning test are normally distributed with a population
mean of µ = 100 and a population standard deviation of = 10.
i. What is the probability that a randomly chosen person scores at least 105?
ii. A simple random sample of size n = 20 is selected. What is the probability
that the sample mean will be between 97 and 104? (You may use the
nearest values provided in the statistical tables.)
(7 marks)
(g) You are told that a 99% confidence interval for a single population proportion
is (0.3676, 0.5324).
i. What was the sample proportion that lead to this confidence interval?
ii. What was the size of the sample used?
(6 marks)
(h) i. State one advantage and one disadvantage of face-to-face interviews.

ii. State one advantage and one disadvantage of stratified sampling.
(4 marks)

UL18/0322 Page 3 of 21
SECTION B
2. (a) An experiment was conducted to examine whether age, in particular

being over 30 or not, has any e↵ect on preferences for a digital or an
analogue watch. Specifically, 129 randomly-selected people were asked
what watch they prefer and their responses are summarised in the table
below:
analogue watch undecided digital watch
30 year old or younger 10 17 37
Over 30 years old 31 22 12
significance test, would you say there is an association between age
and watch preference? Provide a brief justification for your answer.
ii. Calculate the 2 statistic for the hypothesis of independence between
age and watch preference, and test that hypothesis. What do you
conclude?
(13 marks)
(b) You work for a market research company and your manager has asked
you to carry out a random sample survey for a mobile phone company to
identify whether a recently launched mobile phone is attractive to people
over 40 years old. Limited time and money resources are available at your
disposal. You are being asked to prepare a brief summary containing the
items below.
justification for your answer.
Briefly explain the reasons for your choices.
iii. Provide an example in which response bias may occur. State an action
(12 marks)

UL18/0322 Page 4 of 21
3. (a) An area manager in a department store wants to study the relationship

between the number of workers on duty, x, and the value of merchandise lost
to shoplifters, y, in $. To do so, the manager assigned a di↵erent number of
workers for each of 10 weeks. The results were as follows:
Week #1 #2 #3 #4 #5 #6 #7 #8 #9 #10
x 9 11 12 13 15 18 16 14 12 10
y 420 350 360 300 225 200 230 280 315 410

scatter diagram.
iv. Based on the regression model above, what will be the predicted loss from
shoplifting when there are 17 workers on duty? Would you trust this value?
(13 marks)
(b) An experiment is conducted to determine whether intensive tutoring (covering

a great deal of material in a fixed amount of time) is more e↵ective than
standard tutoring (covering less material in the same amount of time) for
a particular course. Two randomly chosen groups of students were tutored
separately and their examination mark on the course was recorded. The data
are summarised in the table below:
Sample size Average Sample standard

examination mark deviation
Intensive tutoring 22 65.33 6.61
Standard tutoring 25 61.58 5.37

di↵erence between the average examination mark between the two tutoring
groups. State clearly the hypotheses, the test statistic and its distribution
under the null hypothesis, and carry out the test at two appropriate
significance levels. Comment on your findings.
ii. State clearly any assumptions you made in part i.
iii. Give a 90% confidence interval for the mean mark of the intensive tutoring
group.
(12 marks)

UL18/0322 Page 5 of 21
4. (a) A sales department monitors the distribution of orders by their value (in £s).
The data below are the values of 30 recent orders:
76 59 93 87 38
50 56 123 45 67
102 34 54 85 85
50 44 33 51 40
82 92 79 38 86
34 29 107 63 46
paper provided.
ii. Find the mean, the median, the interquartile range and the modal group
on the histogram.
iii. Comment on the data, given the shape of the histogram and the
(13 marks)

a new type of pain reliever is e↵ective. In this context, a treatment is
considered e↵ective if it is successful with a probability of more than 0.5.
The pain reliever was given to 30 patients and it reduced the pain for 20
of them. You are asked to use an appropriate hypothesis test to determine
whether the pain reliever is e↵ective. State the test hypotheses, and specify
on your findings.
prescribed so that the patient will expect to get well. In some situations,
this expectation is enough for the patient to recover. This e↵ect, also
known as the placebo e↵ect, occurred to some extent in the second
experiment where the pain was reduced for 21 of the patients. You are
asked to consider an appropriate hypothesis test to incorporate this new
evidence with the previous data and reassess the e↵ectiveness of the pain
reliever.
(12 marks)
END OF PAPER

UL18/0322 Page 6 of 21
ST104a Statistics 1

variable: variable:
v
N uN
X √
µ = E(X) = p i xi
uX
2
i=1 i=1

of the sample mean:
X −µ
Z=
σ X̄ − µ
Z= √
σ/ n

P −π σ
Z=p x̄ ± zα/2 × √
π(1 − π)/n n

s r
x̄ ± tα/2, n−1 × √ p (1 − p)
n p ± zα/2 ×
n

proportion:
(zα/2 )2 σ 2
n≥ (zα/2 )2 p (1 − p)
e2
n≥
e2
known): unknown):
X̄ − µ0
Z= √ X̄ − µ0
σ/ n T = √
S/ n

UL18/0322 Page 7 of 21
P − π0 X̄1 − X̄2 − (µ1 − µ2 )
Z∼
=p Z=
π0 (1 − π0 )/n
p
σ12 /n1 + σ22 /n2
s
X̄1 − X̄2 − (µ1 − µ2 )

1 1
T = q 2
x̄1 − x̄2 ±tα/2, n1 +n2 −2 × sp +
Sp2 (1/n1 + 1/n2 ) n1 n2

paired samples:
(n1 − 1)S12 + (n2 − 1)S22
Sp2 =
n1 + n2 − 2 X̄d − µd
T = √
Sd / n
sd P1 − P2 − (π1 − π2 )
x̄d ± tα/2, n−1 × √ Z=p
n P (1 − P ) (1/n1 + 1/n2 )

R1 + R2
P = s
n1 + n2 p1 (1 − p1 ) p2 (1 − p2 )
p1 −p2 ±zα/2 × +
n1 n2
χ2 statistic for test of association: Sample correlation coefficient:

r X
c n
X (Oij − Eij )2 P
xi yi − nx̄ȳ
Eij r = s i=1
i=1 j=1
n n

x2i − nx̄2 yi2 − nȳ 2
P P
i=1 i=1

n
n P
6
P
d2i xi yi − nx̄ȳ
i=1
rs = 1 − i=1 b= n
n(n2 − 1) x2i − nx̄2
P
i=1
a = ȳ − bx̄

UL18/0322 Page 8 of 21
UL18/0322 Page 9 of 21
UL18/0322 Page 10 of 21
UL18/0322 Page 11 of 21
UL18/0322 Page 12 of 21
UL18/0322 Page 13 of 21
UL18/0322 Page 14 of 21
UL18/0322 Page 15 of 21
UL18/0322 Page 16 of 21
UL18/0322 Page 17 of 21
UL18/0322 Page 18 of 21
UL18/0322 Page 19 of 21
UL18/0322 Page 20 of 21
UL18/0322 Page 21 of 21
~~ST104A_ZA_2016_d0
This paper is not to be removed from the Examination Hall

Statistics 1
Tuesday, 8 May 2018: 10:00 to 12:00

on this paper.
PLEASE TURN OVER

UL18/0323 Page 1 of 21 D0
SECTION A
1. (a) Suppose that x1 = 0.2, x2 = 2.5, x3 = 3.7, x4 = 0.8, x5 = 7.4, and

y1 = 0.2, y2 = 8.0, y3 = 3.9, y4 = 2.0, y5 = 0. Calculate the following
quantities:
i=5
X i=2
X i=5 2
X
1 y i
i. x2i ii. iii. y43 + .
i=3
xy
i=1 i i i=4
xi
(6 marks)
(b) Classify each one of the following variables as either measurable

i. A person’s nationality.
iii. Responses to a customer opinion survey ranging from ‘strongly agree’ to
‘strongly disagree’.
(6 marks)
i. If A and B are mutually exclusive events, then P (A [ B) = 0.
ii. If X ⇠ N (8, 9), then P (X 8) = 0.5.
iii. A p-value can be greater than 1.
iv. A Type II error is to reject a false null hypothesis.
v. Unit non-response occurs when a sampled member fails to respond to a
question in the questionnaire.
(10 marks)

(5 marks)
x 1 1 3
pX (x) 0.10 k 5k
i. Determine the constant k and, hence, write out the probability distribution
of X.
(6 marks)
(f) The scores on a verbal reasoning test are normally distributed with a population
mean of µ = 100 and a population standard deviation of = 12.
that the sample mean will be between 96 and 103? (You may use the
nearest values provided in the statistical tables.)
(7 marks)
(g) You are told that a 90% confidence interval for a single population proportion
is (0.3853, 0.5147).
(6 marks)
(h) i. State one advantage and one disadvantage of quota sampling.

ii. State one advantage and one disadvantage of telephone interviews.
(4 marks)
SECTION B
2. (a) A survey was conducted in order to examine whether the final grade of
students taking a class is associated with their attendance of a revision
session a few days before the examination. The data, consisting of
students’ final grades and revision session attendance, are summarised in
the table below.
Final Final Final
Grade A Grade B Grade C
Attended revision session 56 34 28
Did not attend revision session 44 46 42
significance test, would you say there is an association between final
grade and attending revision? Provide a brief justification for your
answer.
ii. Calculate the 2 statistic for the hypothesis of independence between
final grade and attending revision, and test that hypothesis. What do
you conclude?
(13 marks)
(b) You work for a market research company and your manager has asked you
to carry out a random sample survey for a laptop company to identify
whether a new laptop model is attractive to females. The main concern
is to produce results of high accuracy. You are being asked to prepare a
brief summary containing the items below.
Briefly explain the reasons for your choices.
iii. Provide an example in which selection bias may occur. State an action
(12 marks)
3. (a) A study was conducted to determine whether the yield of olive oil is associated
with the average temperature of the area. The data in the table below provide
the average kilograms of olive oil per tree (y) and the average temperature
(x), measured in degrees Celsius. The data correspond to areas taken for 12
di↵erent countries.
Average temperature (x) 5 7.5 5 7 8 3 2 8 11 4 5 8

Olive oil yield (y) 10 20 15 17 25 5 2 13 30 3 20 10


scatter diagram.
iv. Based on the regression model above, what olive oil yield would you expect
in an area with average temperature of 5 degrees Celsius? Would you trust
(13 marks)
(b) A survey was conducted in order to compare the average delivery times (in
minutes) between two pizza companies operating in the same area. A random
sample was drawn consisting of various pizza orders from both companies and
the delivery times were recorded. The data are summarised in the following
table:
Sample size Average delivery time Sample standard

deviation
Pizza Company A 41 29.0 1.9
Pizza Company B 29 27.5 1.1

di↵erence in the average delivery times between the two companies. State
clearly the hypotheses, the test statistic and its distribution under the null
hypothesis, and carry out the test at two appropriate significance levels.
iii. Give a 98% confidence interval for the mean delivery time for Pizza
Company B.
(12 marks)
4. (a) A large company is checking the salaries of its employees regularly to get an
idea of their distribution. The data below are the salaries (in $000s per year
before tax) of 30 employees.
39 40 44 47 32
37 25 71 56 33
64 63 42 43 34
25 28 35 24 45
35 22 53 55 36
46 46 27 27 38
paper provided.
ii. Find the mean, the median, the interquartile range and the modal group
on the histogram.
iii. Comment on the data, given the shape of the histogram and the measures
(13 marks)
(b) i. A doctor is conducting an experiment to test whether a new treatment for a

disease is e↵ective. In this context, a treatment is considered e↵ective if it is
successful with a probability of more than 0.5. The treatment was applied
to 40 randomly sampled patients and it was successful for 27 of them.
You are asked to use an appropriate hypothesis test to determine whether
the treatment is e↵ective in general. State the test hypotheses, and specify
on your findings.
group of 30 randomly sampled patients. A placebo pill contains no
medication and is prescribed so that the patient will expect to get well. In
some situations, this expectation is enough for the patient to recover. This
e↵ect, also known as the placebo e↵ect, occurred in the second experiment
where 17 patients recovered. You are asked to consider an appropriate
hypothesis test to incorporate this new evidence with the previous data
and reassess the e↵ectiveness of the new treatment.
(12 marks)
END OF PAPER
ST104a Statistics 1

variable: variable:
v
N uN
X √
µ = E(X) = p i xi
uX
2
i=1 i=1

of the sample mean:
X −µ
Z=
σ X̄ − µ
Z= √
σ/ n

P −π σ
Z=p x̄ ± zα/2 × √
π(1 − π)/n n

s r
x̄ ± tα/2, n−1 × √ p (1 − p)
n p ± zα/2 ×
n

proportion:
(zα/2 )2 σ 2
n≥ (zα/2 )2 p (1 − p)
e2
n≥
e2
known): unknown):
X̄ − µ0
Z= √ X̄ − µ0
σ/ n T = √
S/ n
P − π0 X̄1 − X̄2 − (µ1 − µ2 )
Z∼
=p Z=
π0 (1 − π0 )/n
p
σ12 /n1 + σ22 /n2
s
X̄1 − X̄2 − (µ1 − µ2 )

1 1
T = q 2
x̄1 − x̄2 ±tα/2, n1 +n2 −2 × sp +
Sp2 (1/n1 + 1/n2 ) n1 n2

paired samples:
(n1 − 1)S12 + (n2 − 1)S22
Sp2 =
n1 + n2 − 2 X̄d − µd
T = √
Sd / n
sd P1 − P2 − (π1 − π2 )
x̄d ± tα/2, n−1 × √ Z=p
n P (1 − P ) (1/n1 + 1/n2 )

R1 + R2
P = s
n1 + n2 p1 (1 − p1 ) p2 (1 − p2 )
p1 −p2 ±zα/2 × +
n1 n2
χ2 statistic for test of association: Sample correlation coefficient:

r X
c n
X (Oij − Eij )2 P
xi yi − nx̄ȳ
Eij r = s i=1
i=1 j=1
n n

x2i − nx̄2 yi2 − nȳ 2
P P
i=1 i=1

n
n P
6
P
d2i xi yi − nx̄ȳ
i=1
rs = 1 − i=1 b= n
n(n2 − 1) x2i − nx̄2
P
i=1
a = ȳ − bx̄
Examiners’ commentary 2018

ST104a Statistics 1
Important note

references
section.
General remarks
Learning outcomes
At the end of the half course and having completed the Essential reading and activities you should:
required
methods
1
ST104a Statistics 1
example, the first part of Question 2 asked for a chi-squared test of association and survey design
problems appeared in the second part. Question 3 began with correlation and linear regression,
followed by hypothesis testing of means and confidence interval construction. Question 4 began with
data presentation and descriptive statistics, while hypothesis testing for proportions appeared in the
second part. This means that it is really important that you make sure you have a reasonable idea of
what topics are covered before you start work on the paper! We suggest you divide your time as
follows during the examination:
and subquestion.
You are not expected to write long essays where explanations or descriptions of sampling design
• If you are specifically asked to perform a hypothesis test, or calculate a confidence interval,
do so. It is not acceptable to do one rather than the other! If you are asked to use a 5%
significance level, this is what will be marked.
• When performing calculations try to use as many decimal places as possible in intermediate
steps to reach the most accurate solution. It is advised to have at least two decimal places
in general and at least three decimal places when calculating probabilities.
2
• the relevant detailed reference to Newbold, P., W.L. Carlson and B.M. Thorne Statistics for
prepare, and similar questions from Newbold et al. (2012).
expected. This may be due to a number of reasons, but one particular failing is ‘question
We recognise that candidates might not cover all topics in the syllabus in the same depth, but you
The syllabus can be found in the Course information sheet available on the VLE. You should read
the syllabus carefully and ensure that you cover sufficient material in preparation for the
examination. Examiners will vary the topics and questions from year to year and may well set
questions that have not appeared in past papers. Examination papers may legitimately include
questions on any topic in the syllabus. So, although past papers can be helpful during your revision,
you cannot assume that topics or specific questions that have come up in past examinations will
occur again.
3
ST104a Statistics 1

ST104a Statistics 1
Important note

references
section.
Section A
Question 1
(a) Suppose that x1 = −0.5, x2 = 2.5, x3 = −2.8, x4 = 0.4, x5 = 6.1, and y1 = −0.5,
y2 = 4.0, y3 = 4.6, y4 = −2.0, y5 = 0. Calculate the following quantities:
i=5 i=2 i=5
X X 1 X yi2
i. x2i ii. iii. y43 + .
i=3 i=1
x i yi i=4
xi
(6 marks)

guide, and in particular Learning activity 6.
4
i. We have:
i=5
X
x2i = (−2.8)2 + (0.4)2 + (6.1)2 = 7.84 + 0.16 + 37.21 = 45.21.
i=3
ii. We have:
i=2
X 1 1 1
= + = 4 + 0.1 = 4.1.
x
i=1 i i
y (−0.5) × (−0.5) 2.5 × 4.0
iii. We have:
i=5 2
(−2.0)2 02

X y i
y43 + 3
= (−2.0) + + = −8 + 10 = 2.
i=4
xi 0.4 6.1
Justify your answer. (No marks will be awarded without a justification.)
ii. Five possible responses to a customer satisfaction survey ranging from ‘very
satisfied’ to ‘very dissatisfied’.
iii. A person’s name.
(6 marks)

distinctions between nominal and ordinal categorical variables should be made.
i. Measurable, because GDP can be measured in $bn or $tn to several decimal places.
ii. Each satisfaction level corresponds to a category. The level of satisfaction is in a ranked
order – for example, in terms of the list items provided. Therefore, this is a categorical
ordinal variable.
iii. Each name (James, Jane etc.) is a category. Also, there is no natural ordering between
the names – for example, we cannot really say that ‘James is higher than Jane’.
Therefore, this is a categorical nominal variable.
be measured’ which were not awarded any marks.
i. If A and B are independent events, then P (A ∩ B) = P (A)/P (B).
ii. If X ∼ N (3, 4), then P (X ≤ 3) = 0.5.
iii. A p-value can be negative.
5
ST104a Statistics 1
iv. A Type I error is the failure to reject a true null hypothesis.

v. Item non-response occurs when no information is collected from a sample
member.
(10 marks)

level in computations. Part i. requires knowledge of basic probability properties which can
be found in Section 5.9. Part ii. is about probability properties of the normal distribution,
for which see Section 6.8. Part iii. relates to the concept of a p-value covered in Section 8.11,
whereas part iv. relates to the types of error in hypothesis testing covered in Section 8.7.
Finally, part v. requires material from Chapter 10 and in particular the Section 10.10 on
non-response and response bias.
reason why the statement is true/false and not just a choice between the two. Some
candidates also lost marks for long rambling explanations without a decision as to whether
the statement was true or false.
i. False. If A and B are independent events, then P (A ∩ B) = P (A) P (B).
ii. True. A normal distribution is symmetric about its mean, hence we have that
P (X ≤ 3) = 0.5.
iii. False. A p-value is a probability, (so must be in [0, 1]).
iv. False. A Type I error is rejecting a true null hypothesis.
v. False. Either ‘item non-response occurs when a sampled member fails to respond to a
question in the questionnaire’, or ‘unit non-response occurs when no information is
collected from a sample member’.

(5 marks)

This question covers material on basic probability which can be found in Chapter 5 of the
subject guide. Particular focus is given to the total probability and Bayes’ formulae in
Section 5.10. Related exercises to test yourself against these types of questions are the
Activities 5.2 and 5.3, the Learning activity 7 and the Sample examination question 4.
The solution of this exercise requires the following steps. Note, however, that these steps
can be performed in a different order.
• P (B c ) = 1 − P (B) = 1 − 0.05 = 0.95.
• P (A ∩ B) = P (A | B) P (B) = 0.70 × 0.05 = 0.035.
• P (A) = P (A | B) P (B) + P (A | B c ) P (B c ) = 0.70 × 0.05 + 0.30 × 0.95 = 0.32.
• P (B | A) = P (A ∩ B)/P (A) = 0.035/0.32 = 0.1094.
(e) The random variable X takes the values −1, 1 and 2 according to the following
x −1 1 2
pX (x) 0.20 k 4k
6
i. Determine the constant k and, hence, write down the probability distribution
of X.
(6 marks)

probability and probability distributions. Reading from Chapter 5 in the subject guide is
suggested with a focus on the sections of these topics. Try Activity 5.1 and the exercises on
probability trees.
P
i. i p(xi ) = 0.20 + 5k = 1, hence k = 0.16. Therefore, the probability distribution is:
x −1 1 2
pX (x) 0.20 0.16 0.64
ii. We have:
X
E(X) = xi p(xi ) = (−1) × 0.20 + 1 × 0.16 + 2 × 0.64 = 1.24.
i
iii. We have:
X
E(X 2 ) = x2i p(xi ) = (−1)2 × 0.20 + 12 × 0.16 + 22 × 0.64 = 2.92
i
hence:
Var(X) = 2.92 − (1.24)2 = 1.3824.
− µ)2 p(xi ),
P
where µ = E(X) was found in part ii.
(f ) The scores on a verbal reasoning test are normally distributed with a

population mean of µ = 100 and a population standard deviation of σ = 10.
that the sample mean will be between 97 and 104? (You may use the nearest
values provided in the statistical tables.)
(7 marks)

This section examines the ideas of normal random variables. Read the relevant section of
sample examination questions are relevant. For the second part of the question it is essential
to read Section 6.9.
The first part just requires knowledge of the fact that X is a normal random variable with
mean µ = 100 and variance σ 2 = 100. However, for the second part of the exercise it is
important to note that X̄, the sample mean, is also a normal random variable with mean µ
and variance σ 2 /n. Direct application of this fact then yields that:
σ2

X̄ ∼ N µ, = N (100, 5).
n
For both parts, the basic property of normal random variables for this question is that if
X ∼ N (µ, σ 2 ), then:
X −µ
Z= ∼ N (0, 1).
σ
7
ST104a Statistics 1
Note also that:

• P (Z < a) = P (Z ≤ a) = Φ(a)
• P (Z > a) = P (Z ≥ a) = 1 − P (Z ≤ a) = 1 − P (Z < a) = 1 − Φ(a)
The above is all you need to find the requested probabilities.
i. We have X ∼ N (100, 100), hence:

105 − 100
P (X ≥ 105) = P Z ≥ = P (Z ≥ 0.5) = 1 − Φ(0.5) = 1 − 0.6915 = 0.3085.
10
ii. We have X̄ ∼ N (100, 5), hence:

97 − 100 104 − 100
P (97 ≤ X̄ ≤ 104) = P √ ≤Z≤ √ = P (−1.34 ≤ Z ≤ 1.79)
5 5
= Φ(1.79) − (1 − Φ(1.34))
= 0.9633 − (1 − 0.9099)
= 0.8732.
(g) You are told that a 99% confidence interval for a single population proportion is
(0.3676, 0.5324).
(6 marks)

This question refers to confidence intervals for proportions. While all of Chapter 7 of the
subject guide on estimation is relevant, one can focus on the relevant sections covering
confidence intervals for proportions (Section 7.10) and sample size determination (Section
7.11). In terms of exercises, check Examples 7.3 and 7.6, Learning activities 4 and 7, and
Sample examination question 3.
i. The sample proportion, p, must be in the centre of the interval (0.3676, 0.5324). Adding
the two endpoints and dividing by 2 gives p = (0.3676 + 0.5324)/2 = 0.45.
ii. The (estimated) standard error when estimating a single proportion is:
r √
p (1 − p) 0.45 × 0.55 0.4975
= √ = √ .
n n n
Since this is a 100 (1 − α)% = 99% confidence interval, then α = 0.01, so the confidence
coefficient is zα/2 = z0.005 = 2.576. Therefore, to determine n we need to solve:
0.4975
2.576 × √ = 0.5324 − 0.45 = 0.0824.
n
The correct sample size is n = 242.

Note: In questions regarding sample size determination remember to round up when the
solution is not an integer.
8
(h) i. State one advantage and one disadvantage of face-to-face interviews.

ii. State one advantage and one disadvantage of stratified sampling.
(4 marks)

This question contains material on sample surveys which can be found in Chapter 10 of the
subject guide. While the entire chapter is relevant and you are advised to read it all, focus
can be given to Section 10.11 regarding face-to-face interviews and Section 10.7 (more
specifically Section 10.7.2) regarding stratified sampling.

Note that for these types of questions there is no single correct answer. Below are some
suggested ‘good’ answers for each part.
i. – Possible advantages: good for personal questions; allow for probing issues in greater
depth; permit difficult concepts to be explained; can show samples (such as new
product designs).
– Possible disadvantages: (very) expensive; not always easy to obtain detailed
information on the spot.
ii. – Possible advantages: high level of accuracy; can perform statistical inference.
– Possible disadvantages: requires a sampling frame; requires knowledge of stratum
membership; possibly time-consuming.
Section B
Question 2
(a) An experiment was conducted to examine whether age, in particular being over
30 or not, has any effect on preferences for a digital or an analogue watch.
Specifically, 129 randomly-selected people were asked what watch they prefer
and their responses are summarised in the table below:

30 year old or younger 10 17 37
Over 30 years old 31 22 12
would you say there is an association between age and watch preference?
Provide a brief justification for your answer.
ii. Calculate the χ2 statistic for the hypothesis of independence between age
and watch preference, and test that hypothesis. What do you conclude?
(13 marks)

This part examines Chapter 9 of the subject guide, i.e. contingency tables and the
chi-squared test. Note that part i. of the question does not require any calculations, just
understanding and interpreting contingency tables. Part ii. is a straightforward chi-squared
test and the reading is also given in Chapter 9. Look also at Learning activity 4.
9
ST104a Statistics 1

i. There are some differences between younger and older people regarding watch preference.
More specifically, 16% of younger people prefer an analogue watch compared to 48% for
people over 30. Hence there seems to be an association between age and watch
preference, although this needs to be investigated further.
ii. Set out the null hypothesis that there is no association between age and watch preference
against the alternative, that there is an association. Be careful to get these the correct
way round!
H0 : No association between age and watch preference.
H1 : Association between age and watch preference.
30 year old or younger 20.34 19.35 24.31
Over 30 years old 20.66 19.65 24.69
r X c
X (Oij − Eij )2
i=1 j=1
Eij
which gives a test statistic value of 24.146. This is a 2 × 3 contingency table so the
degrees of freedom are (2 − 1) × (3 − 1) = 2.
We conclude that there is strong evidence of an association between age and watch
preference.
(b) You work for a market research company and your manager has asked you to
carry out a random sample survey for a mobile phone company to identify
whether a recently launched mobile phone is attractive to people over 40 years
old. Limited time and money resources are available at your disposal. You are
being asked to prepare a brief summary containing the items below.
ii. Describe the sampling frame and the method of contact you will use. Briefly
explain the reasons for your choices.
iii. Provide an example in which response bias may occur. State an action that
you would take to address this issue.
(12 marks)

One of the main things to avoid in this part is to write ‘essays’ without any structure. This
exercise asks for specific things and each one of them requires only one or two lines of
response. If you are unsure of what these things are, do not write lengthy answers. This
10
is a waste of your valuable examination time. If you can identify what is being asked, keep
in mind that the answer should not be long.
Note also that in some cases there is no single right answer to the question. Some suggested
answers are given below.
i. Cluster sampling is appropriate here due to the cost issue (the subject guide emphasises
its use for reasons of economy). Also, multistage sampling is an option. Although the
question mentions limited time, discussion of quota sampling (for speed) gained no
marks due to the question stressing a ‘probability sampling scheme’.
ii. The question requires:
∗ a description of a sampling frame
∗ a justification of its choice
∗ mentioning a (sensible) contact method
∗ stating an advantage of the contact method mentioned above.
A suggested answer is given below.
The sampling frame could be an email list from the records of the mobile phone
company, which should be easy to obtain. Assuming that the new type is a smartphone,
the method of contact can be via email. Alternatively, if the new type is not a
smartphone, some basic questions can be sent by text. Method of contact by post would
be too expensive compared with the other two methods.
iii. The question requires an example of response bias and an action suggested to address
this issue.
Those least comfortable with the phone are unlikely to reply, so the questionnaire should
be designed appropriately. Also, busy people may not want to spend time on such
market research. A reward of, say, ten free text messages could be offered as an incentive
so that the cost remains low.
iv. A suggested answer for the question is ‘Are attitudes to the new phone more favourable
to younger owners?’.
In terms of variables one could mention ‘age’ and ‘measure of favourableness toward the
new phone’.
Question 3
(a) An area manager in a department store wants to study the relationship between
the number of workers on duty, x, and the value of merchandise lost to
shoplifters, y, in $. To do so, the manager assigned a different number of
workers for each of 10 weeks. The results were as follows:
Week #1 #2 #3 #4 #5 #6 #7 #8 #9 #10
x 9 11 12 13 15 18 16 14 12 10
y 420 350 360 300 225 200 230 280 315 410
label the diagram.
scatter diagram.
iv. Based on the regression model above, what will be the predicted loss from
shoplifting when there are 17 workers on duty? Would you trust this value?
(13 marks)
11
ST104a Statistics 1

This is a standard regression question and the relevant reading is to be found in Chapter 12
of the subject guide. Section 12.6 provides details for scatter diagrams and is suitable for
part i., whereas the remaining parts are on correlation and regression which are covered in
Sections 12.8–12.10. Section 12.7 is also relevant. Sample examination question 2 of this
chapter is also recommended for practice on questions of this type.
which should include an informative title (‘Scatter diagram’ alone will not suffice) and
labelled axes which also state the units. Far too many candidates threw away marks by
neglecting these points and, consequently, were only given one mark out of the possible
four allocated to this part of the question. Another common way of losing marks was
any marks for this part of the question.
Stolen merchandise vs number of workers

400
value of merchandise in $'s lost to shoplifters
350
300
250
200
10 12 14 16 18
number of workers on duty
coefficient (make sure you know which one it is!) to obtain the value r = −0.9688. An
interpretation of this value is the following – the data suggest that the higher the number
of workers, the lower the loss from shoplifters. The fact that the value is very close to −1
suggests that this is a strong, negative, linear association.
Many candidates did not mention all three words (strong, negative, linear). Note that all
of these words provide useful information on interpreting the association and are,
formula for b is: P
xi yi − nx̄ȳ
b= P 2
xi − nx̄2
The formula for a is a = ȳ − bx̄, and we get a = 655.36.
yb = 655.36 − 26.64x or y = 655.36 − 26.64x + ε.
12
The line should also be plotted on the scatter diagram.

Many candidates reported incorrectly the regression line as y = 655.36 − 26.64x. This
expression is false; one of the two equations above is required.
Also, many candidates did not draw the calculated line on the scatter diagram; instead
they drew an approximate line trying to go around the points but without reference to
the above equation. No marks were awarded in such cases.
around a straight line. Hence a linear regression model does seem to be a good model
here. According to the model, the expected loss from shoplifting for 17 workers on duty
is:
655.36 − 26.64 × 17 ≈ $202.48.
Many candidates did not provide units here. It is essential to do so in order to obtain full
marks.
(b) An experiment is conducted to determine whether intensive tutoring (covering

a great deal of material in a fixed amount of time) is more effective than
standard tutoring (covering less material in the same amount of time) for a
particular course. Two randomly chosen groups of students were tutored
separately and their examination mark on the course was recorded. The data
are summarised in the table below:
Sample size Average Sample standard

examination mark deviation
Intensive tutoring 22 65.33 6.61
Standard tutoring 25 61.58 5.37

between the average examination mark between the two tutoring groups.
State clearly the hypotheses, the test statistic and its distribution under the
null hypothesis, and carry out the test at two appropriate significance levels.
iii. Give a 90% confidence interval for the mean mark of the intensive tutoring
group.
(12 marks)

population means. While all of Chapter 8 on hypothesis testing is relevant, one can focus on
Section 8.16, and in particular Sections 8.16.2 and 8.16.3 as the variances are unknown. The
third part of this exercise refers to confidence intervals for a single mean with Section 7.9
being the most relevant (with unknown variance).
i. Let µ1 denote the mean examination mark for the intensive tutoring group and µ2
denote the mean examination mark for the standard tutoring group.
The wording ‘whether there is a difference between the average examination mark’
implies a two-sided test, hence the hypotheses can be written as:
H0 : µ1 = µ2 vs. H1 : µ1 6= µ2 .
provided on the formula sheet:
x̄1 − x̄2 x̄1 − x̄2
q or p .
s2p (1/n1 + 1/n2 ) s1 /n1 + s22 /n2
2
13
ST104a Statistics 1
If equal variances are assumed, the test statistic value is 2.1449. If equal variances are
not assumed, the test statistic value is 2.1164.
The variances are unknown but the sample size is large enough, so the standard normal
correct and will be used in what follows.
The critical values at the 5% significance level are ±2.021, hence we reject the null
hypothesis. If we take a (smaller) α of 1%, the critical values are ±2.704, so we do not
reject H0 . We conclude that there is moderate evidence of a difference between the two
tutoring groups.
ii. The assumptions for part i. relate to the following.
satisfied.
• Assumption about normality.
were not awarded in such cases. Also, some other candidates just memorised the phrase
‘assumption about equal variances’ and, naturally, were not awarded any marks. One
iii. It is important to identify the correct formula for this confidence interval and substitute
correctly the elements required. Assuming a t distribution with 21 degrees of freedom,
the correct t-value is 1.721. The interval can be worked out as:
6.61
65.33 ± 1.721 × √ .
22
Finally, it is important to report it as an interval, i.e. (62.905, 67.755).
Question 4
(a) A sales department monitors the distribution of orders by their value (in £s).
The data below are the values of 30 recent orders:
76 59 93 87 38
50 56 123 45 67
102 34 54 85 85
50 44 33 51 40
82 92 79 38 86
34 29 107 63 46
paper provided.
ii. Find the mean, the median, the interquartile range and the modal group on
the histogram.
(13 marks)

histograms can be found in Section 4.7.3, although all of Sections 4.7, 4.8 and 4.9 are highly
relevant.
14

Marks were awarded for including an informative title, correct labelling and accurately
drawing the figure. Note that it is essential (and more convenient) to draw the figure on
the graph paper provided; marks were withheld otherwise.
Histogram of order values of a sales department

0.50
0.45
0.40
0.35
Frequency density
0.30
0.25
0.20
0.15
0.10
0.05
20 40 60 80 100 120 140
Value of orders (in £)
ii. ∗ Mean: £64.27. Note: Make sure to mention the units to get the full marks.
∗ Median: £57.50. Note: The raw data should be used.
∗ Modal group: between £40 and £59. Note: between £41 and £60 would also be
acceptable.
∗ Correct values of quartiles. Q1 = £44.25 and Q3 = £85.00. Note: Any reasonable
method for quartile calculations would be acceptable.
∗ Interquartile range: £85 − £44.25 = £40.75.
iv. The distribution of the data appears to be slightly positively/right-skewed. This is also
supported by the fact that the mean is larger than the median.
(b) i. A pharmaceutical company is conducting an experiment to test whether a

new type of pain reliever is effective. In this context, a treatment is
considered effective if it is successful with a probability of more than 0.5.
The pain reliever was given to 30 patients and it reduced the pain for 20 of
them. You are asked to use an appropriate hypothesis test to determine
whether the pain reliever is effective. State the test hypotheses, and specify
on your findings.
group of 40 patients. A placebo pill contains no medication and is prescribed
so that the patient will expect to get well. In some situations, this
expectation is enough for the patient to recover. This effect, also known as
the placebo effect, occurred to some extent in the second experiment where
the pain was reduced for 21 of the patients. You are asked to consider an
appropriate hypothesis test to incorporate this new evidence with the
previous data and reassess the effectiveness of the pain reliever.
(12 marks)
15
ST104a Statistics 1

While the wording of the exercise may appear complicated, it does in fact only refer to
hypothesis testing on proportions. Read Sections 8.14 and 8.15 of the subject guide about
hypothesis testing for a single proportion (the first part of the exercise) and for testing
differences between population proportions (the second part of the exercise).
i. Let πT denote the true probability for the new treatment to work. We can use the
following test.
∗ Test H0 : πT = 0.5 vs. H1 : πT > 0.5.
p
∗ Standard Error: 0.5 × (1 − 0.5)/30 = 0.091.
∗ Test statistic value: 1.826.
∗ For α = 0.05, the critical value is 1.645.
∗ Decision: reject H0 at the 5% significance level.
∗ Evidence that the treatment is effective.
ii. Let πP denote the true probability for the patient to recover with the placebo.
∗ Test H0 : πT = πP vs. H1 : πT > πP .
For reference, the test statistic is:
πT − πP
s.e.(πT − πP )
which follows a standard normal distribution, approximately, due to the central limit
theorem.
∗ Calculation of standard error:
s
41 29 1 1
s.e.(πT − πP ) = × × + = 0.119.
70 70 40 30
∗ The test statistic value is:

20/30 − 21/40
= 1.191.
0.119
∗ Do not reject H0 at the 5% significance level.
∗ Insufficient evidence of higher effectiveness than the placebo effect.
16

ST104a Statistics 1
Important note

references
section.
Section A
Question 1
(a) Suppose that x1 = −0.2, x2 = 2.5, x3 = −3.7, x4 = 0.8, x5 = 7.4, and y1 = −0.2,
y2 = 8.0, y3 = 3.9, y4 = −2.0, y5 = 0. Calculate the following quantities:
i=5 i=2 i=5
X X 1 X yi2
i. x2i ii. iii. y43 + .
i=3 i=1
x i yi i=4
xi
(6 marks)

guide, and in particular Learning activity 6.
17
ST104a Statistics 1
i. We have:
i=5
X
x2i = (−3.7)2 + (0.8)2 + (7.4)2 = 13.69 + 0.64 + 54.76 = 69.09.
i=3
ii. We have:
i=2
X 1 1 1
= + = 25 + 0.05 = 25.05.
i=1
x i yi (−0.2) × (−0.2) 2.5 × 8.0
iii. We have:
i=5 2
(−2.0)2 02

X y i
y43 + = (−2.0)3 + + = −8 + 5 = −3.
i=4
xi 0.8 7.4
Justify your answer. (No marks will be awarded without a justification.)
i. A person’s nationality.
iii. Responses to a customer opinion survey ranging from ‘strongly agree’ to
‘strongly disagree’.
(6 marks)

distinctions between nominal and ordinal categorical variables should be made.
i. Each nationality (British, Chinese, Indian etc.) is a category. Also, there is no natural
ordering between the nationalities – for example, we cannot really say that ‘British is
higher than Chinese’. Therefore, this is a categorical nominal variable.
ii. Measurable, because the unemployment rate can be measured in % to several decimal
places.
iii. Each opinion level corresponds to a category. The opinion level is in a ranked order – for
example, in terms of the list items provided. Therefore, this is a categorical ordinal
variable.
be measured’ which were not awarded any marks.
i. If A and B are mutually exclusive events, then P (A ∪ B) = 0.
ii. If X ∼ N (8, 9), then P (X ≥ 8) = 0.5.
iii. A p-value can be greater than 1.
18
iv. A Type II error is to reject a false null hypothesis.

v. Unit non-response occurs when a sampled member fails to respond to a
question in the questionnaire.
(10 marks)

level in computations. Part i. requires knowledge of basic probability properties which can
be found in Section 5.9. Part ii. is about probability properties of the normal distribution,
for which see Section 6.8. Part iii. relates to the concept of a p-value covered in Section 8.11,
whereas part iv. relates to the types of error in hypothesis testing covered in Section 8.7.
Finally, part v. requires material from Chapter 10 and in particular the Section 10.10 on
non-response and response bias.
reason why the statement is true/false and not just a choice between the two. Some
candidates also lost marks for long rambling explanations without a decision as to whether
the statement was true or false.
i. False. If A and B are mutually exclusive events, then P (A ∩ B) = 0.
ii. True. A normal distribution is symmetric about its mean, hence we have that
P (X ≥ 8) = 0.5.
iii. False. A p-value is a probability, (so must be in [0, 1]).
iv. False. A Type II error is failing to reject a false null hypothesis, or rejecting a false null
hypothesis is the power of the test.
v. False. Either ‘item non-response occurs when a sampled member fails to respond to a
question in the questionnaire’, or ‘unit non-response occurs when no information is
collected from a sample member’.

(5 marks)

This question covers material on basic probability which can be found in Chapter 5 of the
subject guide. Particular focus is given to the total probability and Bayes’ formulae in
Section 5.10. Related exercises to test yourself against these types of questions are the
Activities 5.2 and 5.3, the Learning activity 7 and the Sample examination question 4.
The solution of this exercise requires the following steps. Note, however, that these steps
can be performed in a different order.
• P (B c ) = 1 − P (B) = 1 − 0.15 = 0.85.
• P (A ∩ B) = P (A | B) P (B) = 0.60 × 0.15 = 0.09.
• P (A) = P (A | B) P (B) + P (A | B c ) P (B c ) = 0.60 × 0.15 + 0.70 × 0.85 = 0.685.
• P (B | A) = P (A ∩ B)/P (A) = 0.09/0.685 = 0.1314.
(e) The random variable X takes the values −1, 1 and 3 according to the following
x −1 1 3
pX (x) 0.10 k 5k
19
ST104a Statistics 1
i. Determine the constant k and, hence, write down the probability distribution
of X.
(6 marks)

probability and probability distributions. Reading from Chapter 5 in the subject guide is
suggested with a focus on the sections of these topics. Try Activity 5.1 and the exercises on
probability trees.
P
i. i p(xi ) = 0.10 + 6k = 1, hence k = 0.15. Therefore, the probability distribution is:
x −1 1 3
pX (x) 0.10 0.15 0.75
ii. We have:
X
E(X) = xi p(xi ) = (−1) × 0.10 + 1 × 0.15 + 3 × 0.75 = 2.3.
i
iii. We have:
X
E(X 2 ) = x2i p(xi ) = (−1)2 × 0.10 + 12 × 0.15 + 32 × 0.75 = 7
i
hence:
Var(X) = 7 − (2.3)2 = 1.71.
− µ)2 p(xi ),
P
where µ = E(X) was found in part ii.
(f ) The scores on a verbal reasoning test are normally distributed with a

population mean of µ = 100 and a population standard deviation of σ = 12.
that the sample mean will be between 96 and 103? (You may use the nearest
values provided in the statistical tables.)
(7 marks)

This section examines the ideas of normal random variables. Read the relevant section of
sample examination questions are relevant. For the second part of the question it is essential
to read Section 6.9.
The first part just requires knowledge of the fact that X is a normal random variable with
mean µ = 100 and variance σ 2 = 144. However, for the second part of the exercise it is
important to note that X̄, the sample mean, is also a normal random variable with mean µ
and variance σ 2 /n. Direct application of this fact then yields that:
σ2

X̄ ∼ N µ, = N (100, 6).
n
For both parts, the basic property of normal random variables for this question is that if
X ∼ N (µ, σ 2 ), then:
X −µ
Z= ∼ N (0, 1).
σ
20
Note also that:

• P (Z < a) = P (Z ≤ a) = Φ(a)
• P (Z > a) = P (Z ≥ a) = 1 − P (Z ≤ a) = 1 − P (Z < a) = 1 − Φ(a)
The above is all you need to find the requested probabilities.
i. We have X ∼ N (100, 144), hence:

118 − 100
P (X ≥ 118) = P Z ≥ = P (Z ≥ 1.5) = 1 − Φ(1.5) = 1 − 0.9332 = 0.0668.
12
ii. We have X̄ ∼ N (100, 6), hence:

96 − 100 103 − 100
P (96 ≤ X̄ ≤ 103) = P √ ≤Z≤ √ = P (−1.63 ≤ Z ≤ 1.22)
6 6
= Φ(1.22) − (1 − Φ(1.63))
= 0.8888 − (1 − 0.9484)
= 0.8372.
(g) You are told that a 90% confidence interval for a single population proportion is
(0.3853, 0.5147).
(6 marks)

This question refers to confidence intervals for proportions. While all of Chapter 7 of the
subject guide on estimation is relevant, one can focus on the relevant sections covering
confidence intervals for proportions (Section 7.10) and sample size determination (Section
7.11). In terms of exercises, check Examples 7.3 and 7.6, Learning activities 4 and 7, and
Sample examination question 3.
i. The sample proportion, p, must be in the centre of the interval (0.3853, 0.5147). Adding
the two endpoints and dividing by 2 gives p = (0.3853 + 0.5147)/2 = 0.45.
ii. The (estimated) standard error when estimating a single proportion is:
r √
p (1 − p) 0.45 × 0.55 0.4975
= √ = √ .
n n n
Since this is a 100 (1 − α)% = 90% confidence interval, then α = 0.1, so the confidence
coefficient is zα/2 = z0.05 = 1.645. Therefore, to determine n we need to solve:
0.4975
1.645 × √ = 0.5147 − 0.45 = 0.0647.
n
The correct sample size is n = 160.

Note: In questions regarding sample size determination remember to round up when the
solution is not an integer.
21
ST104a Statistics 1
(h) i. State one advantage and one disadvantage of quota sampling.

ii. State one advantage and one disadvantage of telephone interviews.
(4 marks)

This question contains material on sample surveys which can be found in Chapter 10 of the
subject guide. While the entire chapter is relevant and you are advised to read it all, focus
can be given to Section 10.7 (more specifically Section 10.7.1) regarding quota sampling and
Section 10.11 regarding telephone interviews.
Note that for these types of questions there is no single correct answer. Below are some
suggested ‘good’ answers for each part.
i. – Possible advantages: useful in the absence of a sampling frame; speed; cost.
– Possible disadvantages: standard errors not measurable; systematically biased due to
interviewer; no guarantee of representativeness.
ii. – Possible advantages: easy to achieve a large number of interviews; easy to check on
quality of interviewers.
– Possible disadvantages: not everyone has a telephone so the sample can be biased;
cannot usually show samples; although telephone directories exist for landline
numbers, what about mobile telephone numbers? Also, young people are more likely
to use mobile telephones rather than landline telephones, so are more likely to be
excluded.
Section B
Question 2
(a) A survey was conducted in order to examine whether the final grade of students
taking a class is associated with their attendance of a revision session a few days
before the examination. The data, consisting of students’ final grades and
revision session attendance, are summarised in the table below.
Final Final Final
Grade A Grade B Grade C
Attended revision session 56 34 28
Did not attend revision session 44 46 42
would you say there is an association between final grade and attending
revision? Provide a brief justification for your answer.
ii. Calculate the χ2 statistic for the hypothesis of independence between final
grade and attending revision, and test that hypothesis. What do you
conclude?
(13 marks)

This part examines Chapter 9 of the subject guide, i.e. contingency tables and the
chi-squared test. Note that part i. of the question does not require any calculations, just
understanding and interpreting contingency tables. Part ii. is a straightforward chi-squared
test and the reading is also given in Chapter 9. Look also at Learning activity 4.
22

i. There are some differences in the final grades between students who did and did not
attend the revision session. More specifically, 56% of those who got a final grade A
attended the revision session, whereas only 40% of those who got a final grade C
attended it. Hence there seems to be an association between final grade and attending
revision although this needs to be investigated further.
ii. Set out the null hypothesis that there is no association between attending revision and
final grade against the alternative, that there is an association. Be careful to get these
H0 : No association between attending revision and final grade.
H1 : Association between attending revision and final grade.
Final Grade A Final Grade B Final Grade C
Attended revision session 47.20 37.76 33.04
Did not attend revision session 52.80 42.24 36.96
r X c
X (Oij − Eij )2
i=1 j=1
Eij
which gives a test statistic value of 5.273. This is a 2 × 3 contingency table so the degrees
of freedom are (2 − 1) × (3 − 1) = 2.
We conclude that there is weak evidence of an association between attending revision
and final grade.
(b) You work for a market research company and your manager has asked you to
carry out a random sample survey for a laptop company to identify whether a
new laptop model is attractive to females. The main concern is to produce
results of high accuracy. You are being asked to prepare a brief summary
containing the items below.
ii. Describe the sampling frame and the method of contact you will use. Briefly
explain the reasons for your choices.
iii. Provide an example in which selection bias may occur. State an action that
you would take to address this issue.
(12 marks)

One of the main things to avoid in this part is to write ‘essays’ without any structure. This
exercise asks for specific things and each one of them requires only one or two lines of
23
ST104a Statistics 1
response. If you are unsure of what these things are, do not write lengthy answers. This
is a waste of your valuable examination time. If you can identify what is being asked, keep
in mind that the answer should not be long.
Note also that in some cases there is no single right answer to the question. Some suggested
answers are given below.
i. We are asked for accuracy and random (probability) sampling, so a reasonable is option
is the use of stratified random sampling which is known to produce results of high
accuracy. An example of a sampling scheme could be ‘a stratified sample of those
customers who bought this laptop recently’.
ii. The question requires:
∗ a description of a sampling frame
∗ a justification of its choice
∗ mentioning a (sensible) contact method
∗ stating an advantage of the contact method mentioned above.
A suggested answer is given below.
Use a list provided by retailers to identify those who bought this laptop model recently.
The list could include the postal address, telephone or email. Stratification can be made
by area of country or by gender of buyer. Finally, an explanation as to which you would
prefer – for example, email is fast if all have it but there may be a lot of non-response.
iii. The question requires an example of selection bias and an action suggested to address
this issue.
For example, retailers’ records may be incomplete. Offer incentives to make sure they
keep accurate records.
iv. A suggested answer for the question is ‘How does preference for the laptop model
compare for men and women?’.
In terms of variables on could mention ‘gender’ and ‘buying preference’.
Question 3
(a) A study was conducted to determine whether the yield of olive oil is associated
with the average temperature of the area. The data in the table below provide
the average kilograms of olive oil per tree (y) and the average temperature (x),
measured in degrees Celsius. The data correspond to areas taken for 12
different countries.
Average temperature (x) 5 7.5 5 7 8 3 2 8 11 4 5 8

Olive oil yield (y) 10 20 15 17 25 5 2 13 30 3 20 10
label the diagram.
scatter diagram.
iv. Based on the regression model above, what olive oil yield would you expect
in an area with average temperature of 5 degrees Celsius? Would you trust
(13 marks)
24

This is a standard regression question and the relevant reading is to be found in Chapter 12
of the subject guide. Section 12.6 provides details for scatter diagrams and is suitable for
part i., whereas the remaining parts are on correlation and regression which are covered in
Sections 12.8–12.10. Section 12.7 is also relevant. Sample examination question 2 of this
chapter is also recommended for practice on questions of this type.
which should include an informative title (‘Scatter diagram’ alone will not suffice) and
labelled axes which also state the units. Far too many candidates threw away marks by
neglecting these points and, consequently, were only given one mark out of the possible
four allocated to this part of the question. Another common way of losing marks was
any marks for this part of the question.
yield of olive oil vs average temperature

30
25
kilograms of olive oil yield per tree
20
15
10
5
2 4 6 8 10
average temperature in Celsius degrees
coefficient (make sure you know which one it is!) to obtain the value r = 0.8049. An
interpretation of this value is the following – the data suggest that the higher the average
temperature, the higher the olive oil yield. The fact that the value is very close to 1
suggests that this is a strong, positive, linear association.
of these words provide useful information on interpreting the association and are,
formula for b is: P
xi yi − nx̄ȳ
b= P 2
xi − nx̄2
The formula for a is a = ȳ − bx̄, and we get a = −2.641.
yb = −2.641 + 2.744x or y = −2.641 + 2.744x + ε.
25
ST104a Statistics 1
The line should also be plotted on the scatter diagram.

expression is false; one of the two equations above is required.
Also, many candidates did not draw the calculated line on the scatter diagram; instead
they drew an approximate line trying to go around the points but without reference to
the above equation. No marks were awarded in such cases.
around a straight line. Hence a linear regression model does seem to be a good model
here. According to the model, the expected olive oil yield with average temperature of 5
degrees Celsius is:
−2.641 + 2.744 × 5 ≈ 11.079 kg per tree.
Many candidates did not provide units here. It is essential to do so in order to obtain full
marks.
(b) A survey was conducted in order to compare the average delivery times (in
minutes) between two pizza companies operating in the same area. A random
sample was drawn consisting of various pizza orders from both companies and
the delivery times were recorded. The data are summarised in the following
table:
Sample size Average delivery time Sample standard

deviation
Pizza Company A 41 29.0 1.9
Pizza Company B 29 27.5 1.1

in the average delivery times between the two companies. State clearly the
your findings.
iii. Give a 98% confidence interval for the mean delivery time for Pizza
Company B.
(12 marks)

population means. While all of Chapter 8 on hypothesis testing is relevant, one can focus on
Section 8.16, and in particular Sections 8.16.2 and 8.16.3 as the variances are unknown. The
third part of this exercise refers to confidence intervals for a single mean with Section 7.9
being the most relevant (with unknown variance).
i. Let µ1 denote the mean delivery time for pizza company A and µ2 denote the mean
delivery time for pizza company B.
The wording ‘whether there is a difference in the average delivery times’ implies a
two-sided test, hence the hypotheses can be written as:
H0 : µ1 = µ2 vs. H1 : µ1 6= µ2 .
provided on the formula sheet:
x̄1 − x̄2 x̄1 − x̄2
q or p .
s2p (1/n1 + 1/n2 ) s1 /n1 + s22 /n2
2
26
If equal variances are assumed, the test statistic value is 3.8180. If equal variances are
not assumed, the test statistic value is 4.1639.
The variances are unknown but the sample size is large enough, so the standard normal
correct and will be used in what follows.
The critical values at the 5% significance level are ±2.000, hence we reject the null
hypothesis. If we take a (smaller) α of 1%, the critical values are ±2.660, so we still reject
H0 . We conclude that there is strong evidence of a difference between the two companies.
ii. The assumptions for part i. relate to the following.
satisfied.
• Assumption about normality.
were not awarded in such cases. Also, some other candidates just memorised the phrase
‘assumption about equal variances’ and, naturally, were not awarded any marks. One
iii. It is important to identify the correct formula for this confidence interval and substitute
correctly the elements required. Assuming a t distribution with 28 degrees of freedom,
the correct t-value is 2.467. The interval can be worked out as:
1.1
27.5 ± 2.467 × √ .
29
Finally, it is important to report it as an interval, i.e. (26.996, 28.004).
Question 4
(a) A large company is checking the salaries of its employees regularly to get an
idea of their distribution. The data below are the salaries (in $000s per year
before tax) of 30 employees.
39 40 44 47 32
37 25 71 56 33
64 63 42 43 34
25 28 35 24 45
35 22 53 55 36
46 46 27 27 38
paper provided.
ii. Find the mean, the median, the interquartile range and the modal group on
the histogram.
(13 marks)

histograms can be found in Section 4.7.3, although all of Sections 4.7, 4.8 and 4.9 are highly
relevant.
27
ST104a Statistics 1

Marks were awarded for including an informative title, correct labelling and accurately
drawing the figure. Note that it is essential (and more convenient) to draw the figure on
the graph paper provided; marks were withheld otherwise.
Histogram of employee salaries

1.0
0.9
0.8
0.7
Frequency density
0.6
0.5
0.4
0.3
0.2
0.1
20 30 40 50 60 70 80
Salary of employees (in $000s)
ii. ∗ Mean: $40,400. Note: Make sure to mention the units to get the full marks.
∗ Median: $38,500. Note: The raw data should be used.
∗ Modal group: between $30,000 and $39,000. Note: between $31,000 and $40,000
would also be acceptable.
∗ Correct values of quartiles. Q1 = $32,250 and Q3 = $46,000. Note: Any reasonable
method for quartile calculations would be acceptable.
∗ Interquartile range: $46,000 − $32,250 = $13,750.
iv. The distribution of the data appears to be slightly positively/right-skewed. This is also
supported by the fact that the mean is larger than the median.
(b) i. A doctor is conducting an experiment to test whether a new treatment for a

disease is effective. In this context, a treatment is considered effective if it is
successful with a probability of more than 0.5. The treatment was applied to
40 randomly sampled patients and it was successful for 27 of them. You are
asked to use an appropriate hypothesis test to determine whether the
treatment is effective in general. State the test hypotheses, and specify your
test statistic and its distribution under the null hypothesis. Comment on
your findings.
group of 30 randomly sampled patients. A placebo pill contains no
medication and is prescribed so that the patient will expect to get well. In
some situations, this expectation is enough for the patient to recover. This
effect, also known as the placebo effect, occurred in the second experiment
where 17 patients recovered. You are asked to consider an appropriate
hypothesis test to incorporate this new evidence with the previous data and
reassess the effectiveness of the new treatment.
(12 marks)
28

While the wording of the exercise may appear complicated, it does in fact only refer to
hypothesis testing on proportions. Read Sections 8.14 and 8.15 of the subject guide about
hypothesis testing for a single proportion (the first part of the exercise) and for testing
differences between population proportions (the second part of the exercise).
i. Let πT denote the true probability for the new treatment to work. We can use the
following test.
∗ Test H0 : πT = 0.5 vs. H1 : πT > 0.5.
p
∗ Standard Error: 0.5 × (1 − 0.5)/40 = 0.079.
∗ Test statistic value: 2.214.
∗ Decision: reject H0 at the 5% significance level.
∗ Evidence that the treatment is effective.
ii. Let πP denote the true probability for the patient to recover with the placebo.
∗ Test H0 : πT = πP vs. H1 : πT > πP .
For reference, the test statistic is:
πT − πP
s.e.(πT − πP )
which follows a standard normal distribution, approximately, due to the central limit
theorem.
∗ Calculation of standard error:
s
44 26 1 1
s.e.(πT − πP ) = × × + = 0.117.
70 70 40 30
∗ The test statistic value is:

27/40 − 17/30
= 0.928.
0.117
∗ Do not reject H0 at the 5% significance level.
∗ Insufficient evidence of higher effectiveness than the placebo effect.
29
Powered by TCPDF (www.tcpdf.org)

Stuvia 509464 St104a Statistics 1 Exams With Commentaries 2011 2018

Uploaded by

Copyright:

Available Formats

You might also like

Stuvia 509464 St104a Statistics 1 Exams With Commentaries 2011 2018

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Stuvia 509464 St104a Statistics 1 Exams With Commentaries 2011 2018

Uploaded by

Copyright:

Available Formats

ST104a - Statistics 1- Exams with

Downloaded by: aruzhanyerbolatova | aruzhan.yerbolatovaa@gmail.com

This paper is not to be removed from the Examination Halls

UNIVERSITY OF LONDON 279 004a ZA

Statistics 1 (half unit)

Thursday, 5 May 2011 : 10.00am to 12.00pm

Candidates should answer THREE of the following FOUR questions: QUESTION 1 of

© University of London 2011

Answer all parts of Question 1 (50 marks in total).

1. (a) Consider the following sample dataset:

UL11/0186 Downloaded by: aruzhanyerbolatova | aruzhan.yerbolatovaa@gmail.com

(f) With 𝑥1 = 3, 𝑥2 = 1, 𝑥3 = 2, 𝑥4 = 1, 𝑥5 = 2, ﬁnd

Answer two questions from this section (25 marks each).

The summary statistics for these data are:

Sample size Sample mean Sample standard deviation

i. Use an appropriate hypothesis test to determine whether the mean lives

3. (a) The ministry of education is considering funding pre-school education. Before

Sample size Number in favour of new grading system

i. Do the results indicate a diﬀerence between humanities and science in the

This paper is not to be removed from the Examination Halls

UNIVERSITY OF LONDON 279 004a ZB

Statistics 1 (half unit)

Thursday, 5 May 2011 : 10.00am to 12.00pm

Candidates should answer THREE of the following FOUR questions: QUESTION 1 of

© University of London 2011

Answer all parts of Question 1 (50 marks in total).

1. (a) Consider the following sample dataset:

UL11/0187 Downloaded by: aruzhanyerbolatova | aruzhan.yerbolatovaa@gmail.com

(f) With 𝑥1 = 4, 𝑥2 = 1, 𝑥3 = 2, 𝑥4 = 4, 𝑥5 = 3, ﬁnd

Answer two questions from this section (25 marks each).

The summary statistics for these data are:

Sample size Sample mean Sample standard deviation

i. Use an appropriate hypothesis test to determine whether the mean lives

3. (a) The ministry of education is considering funding pre-school education further.

103 117 121 104 127

i. Carefully construct, draw and label an appropriate stem-and-leaf diagram.

Sample size Number in favour of new grading system

i. Do the results indicate a diﬀerence between males and females in the

Examiners’ commentaries 2011

Examiners’ commentaries 2011

Specific comments on questions – Zone A

Answer all parts of Question 1 (50 marks in total).

(a) Reading for this question

Approaching this question

(1 − 3.2)2 + (2 − 3.2)2 + (3 − 3.2)2 + (4 − 3.2)2 + (6 − 3.2)2

(b) Reading for this question

Approaching this question

(c) Reading for this question

Approaching this question

(d) Reading for this question

Examiners’ commentaries 2011

Approaching this question

(e) Reading for this question

Approaching this question

We may now use the formula

as we know all the quantities. We get

(f ) Reading for this question

Approaching this question

(g) Reading for this question

ii. The regression line can be written by the equation ŷ = a + bx or y = a + bx + . The