Exercise

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 29

Examiners commentaries 2013

Examiners commentaries 2013


ST104a Statistics 1 Important note
This commentary reects the examination and assessment arrangements for this course in the academic year 201213. The format and structure of the examination may change in future years, and any such changes will be publicised on the virtual learning environment (VLE). A change that took place from 201112 onwards is the presence of a formula sheet. The purpose of this change is to encourage candidates to devote more time in understanding the key concepts of the syllabus rather than memorising a big number of formulae. Nevertheless, candidates should not rely on this formula sheet entirely but only use it for verication. The formula sheet is available on the virtual learning environment (VLE).

Information about the subject guide


Unless otherwise stated, all cross-references will be to the latest version of the subject guide (2011). You should always attempt to use the most recent edition of any Essential reading textbook, even if the commentary and/or online reading list and/or subject guide refers to an earlier edition. If dierent editions of Essential reading are listed, please check the VLE for reading supplements if none are available, please use the contents list and index of the new edition to nd the relevant section.

General remarks
Learning outcomes
By the end of this course, and having completed the Essential reading and activities, you should: be familiar with the key ideas of statistics that are accessible to a candidate with a moderate mathematical competence be able to routinely apply a variety of methods for explaining, summarising and presenting data and interpreting results clearly using appropriate diagrams, titles and labels when required be able to summarise the ideas of randomness and variability, and the way in which these link to probability theory to allow the systematic and logical collection of statistical techniques of great practical importance in many applied areas have a grounding in probability theory and some grasp of the most common statistical methods be able to perform inference to test the signicance of common measures such as means and proportions and conduct chi-squared tests of contingency tables be able to use simple regression and correlation analysis and know when it is appropriate to do so.

ST104a Statistics 1

Planning your time in the examination


You have two hours to complete this paper, which is in two parts. The rst part, Section A, is compulsory which covers several subquestions and accounts for 50 per cent of the total marks. Section B contains three questions, each worth 25 per cent, from which you are asked to choose two. Remember that each of the Section B questions is likely to cover more than one topic. In 2013, for example, the rst part of Question 2 asked for a chi-squared test and survey design problems appeared in the second. The rst part of Question 3 was on regression and involved drawing a diagram, while the second part was a hypothesis test comparing population means using the sample data given. Question 4 had a series of questions involving drawing diagrams, hypothesis testing and condence intervals. This means that it is really important that you make sure you have a reasonable idea of what topics are covered before you start work on the paper! We suggest you divide your time as follows during the examination: Spend the rst 10 minutes annotating the paper. Note the topics covered in each question and subquestion. Allow yourself 45 minutes for Section A. Dont allow yourself to get stuck on any one question, but dont just give up after two minutes! Once you have chosen your two Section B questions, give them about 25 minutes each. This leaves you with 15 minutes. Do not leave the examination hall at this point! Check over any questions you may not have completely nished. Make sure you have labelled and given a title to any tables or diagrams which were required and, if you did more than the two questions required in Section B, decide which one to delete. Remember that only two of your answers will be given credit in Section B and that you must choose which these are!

What are the Examiners looking for?


The Examiners are looking for very simple demonstrations from you. They want to be sure that you: have covered the syllabus as described and explained in the subject guide know the basic formulae given there and when and how to use them understand and answer the questions set.

You are not expected to write long essays where explanations or descriptions of sample design are required, and note form answers are acceptable. However, clear and accurate language, both mathematical and written, is expected and marked. The explanations below and in the specic commentaries for the papers for each zone should make these requirements clear.

Key steps to improvement


The most important thing you can do is answer the question set! This may sound very simple, but these are some of the things that candidates did not do, though asked, in the 2013 examinations! Remember: If you are asked to label a diagram (which is almost always the case!), please do so. Writing Histogram or Stem-and-leaf diagram in itself is insucient. What do the data describe? What are the units? What are the x and y axes? If you are specically asked to carry out a hypothesis test, or a condence interval, do so. It is not acceptable to do one rather than the other! If you are asked to nd a 5% value, this is what will be marked. Do not waste time calculating things which are not required by the Examiners. If you are asked to nd the line of best t, you will get no marks if you calculate the correlation coecient as well. If you are asked to use the condence interval you have just calculated to comment on the results, carrying out an additional hypothesis test will not help your marks.

Examiners commentaries 2013

How should you use the specic comments on each question given in the Examiners commentaries ?
We hope that you nd these useful. For each question and subquestion, they give: further guidance for each question on the points made in the last section the answers, or keys to the answers, which the Examiners were looking for the relevant detailed reference to Newbold (seventh edition) and the subject guide (2011) where appropriate, suggested activities from the subject guide which should help you to prepare, and similar questions from Newbold.

Any further references you might need are given in the part of the subject guide to which you are referred for each answer.

Question spotting Many candidates are disappointed to nd that their examination performance is poorer than they expected. This can be due to a number of dierent reasons and the Examiners commentaries suggest ways of addressing common problems and improving your performance. We want to draw your attention to one particular failing question spotting, that is, conning your examination preparation to a few question topics which have come up in past papers for the course. This can have very serious consequences. We recognise that candidates may not cover all topics in the syllabus in the same depth, but you need to be aware that Examiners are free to set questions on any aspect of the syllabus. This means that you need to study enough of the syllabus to enable you to answer the required number of examination questions. The syllabus can be found in the Course information sheet in the section of the VLE dedicated to this course. You should read the syllabus very carefully and ensure that you cover sucient material in preparation for the examination. Examiners will vary the topics and questions from year to year and may well set questions that have not appeared in past papers every topic on the syllabus is a legitimate examination target. So although past papers can be helpful in revision, you cannot assume that topics or specic questions that have come up in past examinations will occur again. If you rely on a question spotting strategy, it is likely you will nd yourself in diculties when you sit the examination paper. We strongly advise you not to adopt this strategy.

ST104a Statistics 1

Examiners commentaries 2013


ST104a Statistics 1 Important note
This commentary reects the examination and assessment arrangements for this course in the academic year 201213. The format and structure of the examination may change in future years, and any such changes will be publicised on the virtual learning environment (VLE). A change that took place from 201112 onwards is the presence of a formula sheet. The purpose of this change is to encourage candidates to devote more time in understanding the key concepts of the syllabus rather than memorising a big number of formulae. Nevertheless, candidates should not rely on this formula sheet entirely but only use it for verication. The formula sheet is available on the virtual learning environment (VLE).

Information about the subject guide


Unless otherwise stated, all cross-references will be to the latest version of the subject guide (2011). You should always attempt to use the most recent edition of any Essential reading textbook, even if the commentary and/or online reading list and/or subject guide refers to an earlier edition. If dierent editions of Essential reading are listed, please check the VLE for reading supplements if none are available, please use the contents list and index of the new edition to nd the relevant section.

Comments on specic questions Zone A


Candidates should answer THREE of the following FOUR questions: QUESTION 1 of Section A (50 marks) and TWO questions from Section B (25 marks each). Candidates are strongly advised to divide their time accordingly. Section A Answer all parts of Question 1 (50 marks in total). Question 1 (a) Classify each one of the following variables as measurable (continuous) or categorical. If a variable is categorical, further classify it as nominal or ordinal. Justify your answer. (Note that no marks will be awarded without justication.) i. Country of birth. ii. Favourite brand of soft drink. iii. Rank of country by academic quality according to ratings given by educational specialists. iv. Temperature in degrees Celsius. (8 marks)

Examiners commentaries 2013

Reading for this question This question requires identifying types of variable so reading the relevant section in the subject guide (Section 3.6) is essential. Candidates should gain familiarity with the notion of a variable and be able to distinguish between discrete and continuous (measurable) data. In addition to identifying whether a variable is categorical or measurable, further distinctions between ordinal and nominal categorical variables should be made by candidates. Approaching the question A general tip for identifying continuous and categorical variables is to think of the possible values they can take. If these are nite and represent specic entities the variable is categorical. Otherwise, if these consist of numbers corresponding to measurements, the data are continuous and the variable is measurable. Such variables may also have measurement units or can be measured to various decimal places. i. Each country is a category, so the possible values are one for each country. Hence, the variable is categorical. Note also that countries do not have a natural ordering, so this represents a categorical nominal variable. ii. Each brand of soft drink is a category and is a potential value of this variable. Hence, the variable is categorical. Moreover, brands of soft drinks do not have a natural ordering, therefore this categorical variable is on a nominal scale. iii. Each rank is a category, therefore this is a categorical variable. The values of this variable are the ranks of each country. By denition the categories (ranks) are ordered, thus resulting in a (categorical) ordinal variable. iv. The data represent temperatures that can be measured to many decimal places; e.g. 19.234 degrees Celsius. This is, therefore, a measurable variable. Weak candidates did not provide a justication for their choices, reported nominal or categorical to measurable variables and sometimes answered ordinal when their justication was pointing to a nominal variable. Writing It is measurable because it can be measured will not result in a high mark. (b) The table below contains the ages of the volunteers for a project in two dierent years: 2011 2012 20 20 18 22 38 18 18 22 20 20 18 22 24 22 20

i. Find the mean age and the median age for each year. ii. Calculate the range of the ages for each year and give an explanation for any dierences you nd. iii. Calculate the standard deviation of the ages for each year and give an explanation for any dierences you nd. iv. Comment on the dierences in the mean and median for the two years that you found in part i. For this data set, which do you think would give a better description of the dierence in ages: the mean or the median? Explain briey. (12 marks) Reading for this question This question contains material mostly from Chapter 3 of the subject guide and in particular Section 3.8 (Measures of location) for parts (i) and (iv), and Section 3.9 (Measures of spread) for parts (ii) and (iii). Approaching the question It is important to do the summation carefully and divide by the correct number of observations to obtain the mean. For questions that require calculations on the median (or other percentiles like quartiles), a good strategy is to write the observations in order. Note also that this question requires these measures for both years, so the calculations should be done for each year separately.

ST104a Statistics 1

i. In order to calculate the two means, you should sum the numbers corresponding to each year and then divide them by the number of observations in each row. Doing so yields (20 + 18 + 38 + 18 + 20 + 18)/6 = 22, for 2011, and (20 + 22 + 18 + 22 + 20 + 22 + 24 + 22 + 20)/9 = 21.11, for 2012. For the median if we put the numbers in ascending order we get 18 18 18 20 20 38, for 2011, and 18 20 20 20 22 22 22 22 24, for 2012. The median for 2011 is given by taking the average between the 3rd and the 4th number in the rst of the rows above, resulting in a value of (18 + 20)/2 = 19. The median for 2012 is obtained from the 5th number in the 2nd row above, which is 22. ii. Note that the range of a variable equals the dierence between the maximum value and the minimum value. Hence, the range for 2011 was 38 18 = 20, whereas the range for 2012 was 24 18 = 6. Some candidates answered from 18 to 38. While this is true, note that it does not correspond to the denition of the range so it is essential to give the numbers 20 (2011) and 6 (2012) in your answer. It is also essential to comment on the dierent ranges between 2011 and 2012. The dierence is big and is caused by the outlier 38 in 2011. Some candidates confused Range and Interquartile range. Make sure that you identify what is being asked. iii. In order to answer this question, candidates should be familiar with Section 3.9.3 (on variance and standard deviation) and the chapter activities. It is very important to show your work with relevant summations of the squared deviation from the mean. In this way you may get some marks even if the numerical answer is wrong as you are demonstrating knowledge of the method. The answer for 2011 is 7.90, whereas for 2012 it is 1.76. It is also essential to comment on the dierent ranges between 2011 and 2012. The dierence is big and is caused by the outlier 38 in 2011. iv. The mean is higher in 2011 but the median is higher in 2012. This can be attributed to the fact that 2011 contains an outlier (38) which results in a high mean. Apart from this outlier, ages tend to be higher in 2012, so the median gives a somewhat better indication of the typical age for each year. (c) Monthly household expenditure in country A is normally distributed with a mean of 1200 per week and a standard deviation of 400 per week. In country B it is also normally distributed but with a mean of 960 per week and a standard deviation of 200 per week. Which country has a higher proportion of households spending less than 800? (4 marks) Reading for this question This section examines the ideas of the normal random variable. Read the relevant section of Chapter 5 and work through the examples and activities of this section. The Sample examination questions are quite relevant. Approaching the question The basic property of the normal random variable for this question is that if X N (, 2 ), N (0, 1). Note also that then Z = X P (Z < a) = P (Z a) = (a) P (Z > a) = P (Z a) = 1 P (Z a) = 1 P (Z < a) = 1 (a) P (a < Z < b) = P (Z Z < b) = P (a < Z b) = P (a Z b) = (b) (a).

Examiners commentaries 2013

The above is all you need to nd the requested proportions: P (X < 800) = P P (Y < 800) = P
X 1200 400 Y 960 200

<

8001200 400

= P (Z < 1) = 0.1587 = P (Z < 0.8) = 0.2119.

<

800960 200

So country B has a higher proportion of households spending less than 800. (d) We would like to design a survey to estimate the average number of hours university students spend studying per week. How many students must we randomly select to be 95 percent condent that the sample mean is within 2 hours of the population mean? Assume that a previous survey has shown that the standard deviation of hours spent studying is 6.95 hours. (3 marks) Reading for this question All of Chapter 6 is relevant, but the main reading for this question can be found in Section 6.1 (Choosing a sample size). It is essential to read this section carefully and attempt the activities and exercises. Approaching the question This question asks you to determine a sample size. This is straightforward once the distribution is identied. Since the sample size is large, a normal distribution can be used. The working is given below: Identify the correct z -value: 1.96. Solve 1.96 = 2. n

We can take = 6.95 to nd n = 46.38. Round up to n = 47. Some candidates forgot to round up. Remember that you are asked about a sample size. (e) Suppose that x1 = 4, x2 = 3, x3 = 5, x4 = 0, x5 = 3, and y1 = 3, y2 = 2, y3 = 1, y4 = 0, y5 = 1. Calculate the following quantities:
5 5 3

i.
i=1

xi

ii.
i=2

2xi (yi + 1)

iii.

x2 2+
i=1

3 ( xi + yi )

(6 marks) Reading for this question This question refers to the basic bookwork which can be found in Section 1.9 of the subject guide and, in particular, in Activity A1.6. Approaching the question Be careful to leave the xs and y s in the order given and only cover the values of i asked for. This question was generally done well; the answers are: i. ii.
5 i=1 5 i=2

xi = 4 + (3) + 5 + 0 + 3 = 9.
5 i=2

(1 mark) (2 marks)

2xi (yi + 1) = 2 xi (yi + 1) = 2(3 (2 + 1) + 5 (1 + 1) + 0 (0 + 1) + 3 (1 + 1)) = 2 7 = 14.


3 i=1 (xi

iii. x2 2+

3 + yi ) = (3)2 + (4 + 33 ) + (3 + 23 ) + (5 + 13 ) = 9 + 29 + 5 + 7 = 51. (3 marks)

ST104a Statistics 1

(f ) In an introductory economics class, the numbers of males and females are 16 and 24, respectively. i. A student is selected randomly from the class. What is the probability the student is female? ii. A student is selected at random and removed from the class. A second student is then selected. What is the probability that one of the students is male and the other is female? iii. What is the probability that the second student is male, given that the rst student is female and removed from the class? iv. In previous years it was found that 80% of males pass the exam and 85% of females pass the exam. Based on the available information, nd the probability that a student who passes the examination is female. (8 marks) Reading for this question This is a question on probability and targets mostly the material covered in Chapter 4. It is essential to practise this area by attempting the chapter activities and exercises as well as accessing the material on the VLE. In particular you should attempt Activity A4.6 and Sample examination question 4. It is also useful to familiarise yourself with probability trees as they can be quite useful when completing such exercises. Approaching the question The rst three parts were straightforward for those that were familiar with this section. Part (iv) required knowledge of Bayes formula or a very good understanding of probability trees. The working out is shown below: i. There are 24 females and 16 males in the class. Hence the answer is 24/(16 + 24) = 24/40 = 0.600.
24 24 16 ii. The correct answer here is 16 40 39 + 40 39 = 0.492. Although not necessary, the use of a probability tree would be quite helpful here.

iii. This part can be answered in a similar way to part (i) noting that there are now 16 males and 23 females in the class. Hence 16/39 = 0.410. iv. P (female|pass) = = = = P (pass|female)P (female) P (pass) 0.85 24/40 P (pass female) + P (pass male) 0.85 24/40 0.85 24/40 + 0.80 16/40 0.614.

(g) State whether the following are true or false and give a brief explanation. (Note that no marks will be awarded for a simple true/false answer.) i. In an observational study, a control group provides an essential tool to establish causal relationships. ii. If two variables are correlated we can conclude that one causes the other. iii. The mean income of British households can be expected to be larger than the median income of British households. (6 marks) Reading for this question This question contains material from various parts of the subject guide. Here, it is more important to have a good intuitive understanding of the relevant concepts than a technical level of knowledge in computations. Part (i) requires material from Chapter 10 and, in

Examiners commentaries 2013

particular, the sections on observational studies and designed experiments. Part (ii) is about correlation and causation detailed in Section 11.7 of the subject guide. Finally part (iii) targets the material covered in Chapter 3. Approaching the question Candidates always nd this type of question tricky. It requires a brief explanation of the reason for a true/false and not just a choice between the two. Some candidates also lost marks for long rambling explanations without a decision as to whether a statement was true or false. i. True. A possible way to provide an explanation here is through an example, for example if we want to establish causal eects of uoridated water, we need a control group without uoride in the water, but which is as similar as possible to a group with uoridated water. Another way is to note that randomised experiments are better tools to establish causal relations, but we may not be able to carry out a proper experiment (see p.156 of the subject guide). ii. False; the correlation may be spurious, for example there may be a third variable aecting both variables leading to a correlation. iii. In this part it is important to realise that income is typically a right (positively) skewed variable. Hence the statement is true since, due to the right skewness, the mean will be bigger than the median.

(h) In the context of sampling, explain the dierence between item non-response and unit non-response. (3 marks) Reading for this question This question requires knowledge about sampling and sample surveys. Useful background reading may be found in Chapter 9 of the subject guide. The material directly related to this question, item non-response and unit non-response, appears on p.145. See also the references to Newbold and Carlson given in Chapter 9 of the subject guide. Approaching the question The relevant parts of p.145 are that: item non-response occurs when a sampled member fails to respond unit non-response occurs when no information is collected from a sample member. In addition to the denitions supplied above, it would also be useful to use an example.

Section B Answer two questions from this section (25 marks each). Question 2 (a) A social survey in the United States asked subjects, Would you say that homeopathy is very scientic, sort of scientic, or not at all scientic? The table below cross-classies their responses with their highest level of education. Highest degree Less than High school High school College or higher Total Very 46 (11%) 100(5%) 32 (2%) 178(5%) Homeopathy is scientic Sort of Not at all Total 168 (41%) 196 (48%) 410 (100%) 572 (31%) 1148 (63%) 1820 (100%) 248 (18%) 1076 (79%) 1356 (100%) 988 (28%) 2420 (67%) 3586 (100%)

ST104a Statistics 1

i. Based on the data in the table, and without doing a signicance test, how would you describe the relationship between education and opinion on whether or not homeopathy is scientic? (4 marks) ii. Calculate the 2 statistic and use it to test for independence, using a 1% signicance level. What do you conclude? (9 marks) Reading for this question This part targets Chapter 8 on contingency tables and chi-square tests. Note that part (i) of the question does not require any calculations, just understanding and interpreting contingency tables. Candidates can attempt Activity A8.4 to practise. Part (ii) is a straightforward chi-squared test and the reading is also given in Chapter 8. Approaching the question i. Using the percentages we see that the higher someones education, the smaller the belief that homeopathy is very scientic and the higher the belief that it is not at all scientic. For example, 79% of those who attended college or higher education responded that homeopathy is not at all scientic, whereas the corresponding proportion for those with less than high school education is 48%. ii. Set out the null hypothesis that there is no association between education and views on homeopathy against the alternative, that there is an association. Be careful to get these the correct way round! H0 : No association between education and views on homeopathy versus H1 : Association between education and views on homeopathy. Work out the expected values to obtain the table below

20.3514 112.962 276.687 90.3402 501.439 1228.22 67.3084 373.6 915.092 The test statistic formula is (Oi,j Ei,j )2 , Ei,j

which gives a value of 187.913. This is a 3 3 contingency table so the degrees of freedom are (3 1) (3 1) = 4. For = 0.05 the critical value is 9.488, hence we reject H0 . For a second (stronger) signicance level, say 1%, the critical value is 13.277, hence again we reject H0 . We conclude that the association between views on homeopathy and educational level is highly signicant. Many candidates looked up the tables incorrectly and so failed to follow through their earlier accurate work. A larger number did not expand on their results suciently. Saying we reject at the 5% level, but not at 10% is insucient. What does this mean? Is there a connection or not? If there is one, how strong is it? This needed to be answered if the full nine marks allocated for this question were given. Many candidates lost marks by failing to follow-up like this. (b) i. Dene each of the following: Simple random sampling Stratied random sampling. (4 marks) ii. Why might a researcher prefer to take a stratied random sample rather than a simple random sample? Give two reasons. (3 marks)

10

Examiners commentaries 2013

iii. You have been asked to design a nation-wide survey in your country to nd out about the smoking habits of adults. Give two stratication factors you might use, and explain why you have chosen them. (5 marks) Reading for this question This question on basic material on survey designs required background reading from Chapters 9 and 10 of the subject guide which, along with the recommended reading should be looked at carefully. Candidates were expected to have studied and understood the main important constituents of design in random sampling. It is also a good idea to try the activities in Chapter 9. Approaching the question One of the main things to avoid here is writing an answer without any structure. This exercise asks for specic things and each one of them requires one or two lines. If you are unsure of what these specic things are, do not write lengthy essays. This is a waste of your valuable examination time. If you can identify what is being asked, keep in mind that the answer should not be long. Note also that in some cases there is no unique answer to the question. i. Simple random sampling: Every sample has equal probability. With replacement. Stratied random sampling: Population divided into strata (or groups). Random sample from each group. ii. There are generally two main reasons why one would prefer stratied to simple random sampling. Potentially more precision of parameter estimates. Obtain information about subgroups. iii. In this part you can choose factors based on two arguments. First, you can aim for factors whose subgroups dier regarding smoking habits (e.g. gender, ethnic groups, age groups etc.). In that way the stratied sampling scheme will have increased precision. Alternatively you can just suggest factors that are interesting from a research point of view.

Question 3 The level of infant mortality (y ) is represented by the number of baby deaths for every 1000 births. For 12 areas these are shown in the following table. For each area, the percentage (x) of babies born into families earning at least 25,000 is also shown. Area Percentage (x) Infant mortality (y ) A 20 5 B 6 17 C 10 16 D 21 8 E 12 15 F 36 5 G 6 25 H 19 12 I 26 11 J 13 11 K 21 7 L 16 12

The summary statistics for these data are: Sum of x data: 206 Sum of the squares of x data: 4356 Sum of y data: 144 Sum of the squares of y data: 2088 Sum of the products of x and y data: 2036 (a) i. Draw a scatter diagram of these data on the graph paper provided. Label the diagram carefully. (4 marks)

11

ST104a Statistics 1

ii. Calculate the sample correlation coecient. Interpret your ndings. (3 marks) iii. Calculate the least squares line of y on x and draw the line on the scatter diagram. (4 marks) iv. Using the equation you found in iii., obtain the predicted infant mortality for an area where 38% of babies are born into families earning at least 25,000. Do you think this value is realistic? Justify your answer. (2 marks) Reading for this question This is a standard regression question and the reading is to be found in Chapter 11. Section 11.6 provides details for scatter diagrams and is suitable for part (i) whereas the remaining parts focus on correlation and regression and are covered in Sections 11.8 to 11.10 of the subject guide. Section 11.7 is also relevant. Sample examination question 2 from this chapter is recommended for practice on questions of this type. Approaching the question i. Candidates are reminded that they are asked to draw and label the scatter diagram which should include a full title (Scatter diagram alone will not suce) and labelled axes, including information about units. Far too many candidates threw away marks by neglecting these points and consequently were only given one mark out of the possible four allocated for this part of the question. Another common way of losing marks was failing to use the graph paper which was provided, and required, in the question. Candidates who drew on the ordinary paper in their booklet were not awarded marks for this part of the question.
Infant mortality and economic class
25

y: Infant mortality (number of baby deaths for every 1000 births)

15

20

G G G

10

G G

G G

G G G G

0 0

10

15

20

25

30

35

x: percentage of babies born into families earning at least 25,000 pounds

ii. The summary statistics can be substituted into the formula for the correlation coecient (make sure you know which one it is!) to obtain the value 0.8026. An interpretation of this value is the following: The data suggest that the higher the percentage of families earning at least a certain income, the lower the mortality. The fact that the value is very close to 1, suggests that this is a strong (negative) association.

12

Examiners commentaries 2013

iii. The regression line can be written by the equation y = a + bx or y = a + bx + . The formula for b is xi yi n x y , b= 2 n x x2 i and by substituting the summary statistics we get b = 0.5319. The formula for a is a = y bx , so we get a = 21.1314. Hence the regression line can be written as y = 21.1314 0.5319x or y = 21.1314 0.5319x + . It should also be plotted in the scatter diagram. iv. The prediction will be y = 21.1314 0.5319 38 = 0.918 infant mortality (number of baby deaths for every 1,000 births). However, since this point is outside the observed range of x, this prediction should not be trusted as it is based on extrapolation. Many candidates did not give the measurement units here. These are essential in answering such a question and a mark is deducted if they are not specied. (b) A survey is conducted to compare public local attitudes towards environmental policies. A number of people in two areas of interest are sampled, and asked if they are satised with their local environmental policy. The results of this survey are shown in the following table. Area A Area B Sample size 168 207 Number satised 127 132

i. You are asked to consider an appropriate hypothesis test to determine whether there is a dierence between the two areas in the proportion who are satised. Test at two appropriate signicance levels and comment on your ndings. Specify the test statistic you use and its distribution under the null hypothesis. (7 marks) ii. State clearly any other assumptions you make. (2 marks) iii. Give a 98% condence interval for the proportion of people in Areas A and B combined who are satised, assuming the respective sample sizes are proportional to population sizes. (3 marks) Reading for this question The rst two parts of the question refer to a two-sided hypothesis test comparing proportions. While the entire chapter on hypothesis testing is relevant, one can focus on the sections involving proportions (Sections 7.14 and 7.15). The last part of the question is on condence intervals that are located in Chapter 6 and, in particular (condence intervals for proportions), in Section 6.10. Approaching the question i. The null hypothesis is that the proportions of the two areas (A and B ) do not dier, the alternative is that they do. H0 : A = B versus H1 : A = B . The test statistic is provided in the formula sheet (note that it is based on the pooled variance): PA P B = Z P (1P ) P ) + P (1 nA nB where P = (127 + 132)/(168 + 207) = 0.690667. The test statistic value is 2.464 (PA = 0.7560, PB = 0.6377, pooled se = 0.0480). The critical value at the 5% level, assuming a normal approximation as the number of

13

ST104a Statistics 1

observations is large, is 1.96. Hence, we reject the null hypothesis suggesting evidence for a dierence between the two areas. If we take a (smaller) of 1%, the critical value is 2.576, so we do not reject H0 . We conclude that there is some, but not strong, evidence of a dierence between the two areas. ii. The assumptions included:
2 2 Assumption about whether A = B . Assumption about whether nA + nB 2 is large, hence t v. z Assumption about independent samples.

iii. The question is a standard exercise in condence intervals. Note the question refers to areas A and B combined. The workout is given below: Correct quantile: z/2 = 2.326. Correct endpoints: 0.635 and 0.746. (Also accept two decimal places.) Report as an interval: (0.635, 0.746). (Also accept between 0.635 and 0.746.)

Question 4 (a) i. Carefully construct a box plot on the graph paper provided to display the following yearly incomes of a group of people, measured in 1000: 9 6 12 24 21 57 6 15 9 12 30 36 (8 marks) ii. Based on the shape of the box plot you have drawn, describe the distribution of the data (2 marks) iii. Name two other types of graphical displays that would be suitable to represent the data. Briey explain your choices. (3 marks) Reading for this question Chapter 3 provides all the relevant material for this question. More specically, information on boxplots can be found in Section 3.9.2, but all of Sections 3.8 and 3.9 are highly relevant. Approaching the question i. The boxplot diagram the Examiners were hoping to see is shown below. Marks were awarded for including the title, identifying the box and the whiskers and noting outlier, at a reasonable accuracy. In order to identify the box, the quartiles are needed that are 9 and 25.5, hence giving an interquartile range of 16.5. The median is also needed which is 13.5. Hence the outlier limits are from 0 to 50.25. (15.75 to 50.25 is also allowed.) The extreme outlier limits are then from 0 to 70 (40.5 to 70 is also allowed.) Hence 57 is an outlier but not an extreme outlier. Note that you did not need to label the x axis and that the plot can be transposed. ii. Based on the shape of the boxplot, we can see that the distribution of the data is positively skewed. iii. A histogram or stem-and-leaf diagram are other types of suitable graphical displays. The variable income is measurable and these graphs are suitable for displaying the distribution of such variables. (b) A new treatment has been devised with the aim of reducing blood pressure for people with high blood pressure. Each participants blood pressure was measured before and after the program to see if the treatment is eective. The following data were obtained:

14

Examiners commentaries 2013

Distribution of Income
60

Income in thousands of pounds

10

20

30

40

50

Before 177 142 146 162 145 162 152 154 171

After 174 146 144 159 145 163 156 150 172

i. Carry out an appropriate hypothesis test to determine whether the treatment is eective for reducing blood pressure. State the test hypotheses, and specify your test statistic and its distribution under the null hypothesis. Comment on your ndings. (6 marks) ii. State any assumptions you made. (2 marks) iii. Give a 90% condence interval for the dierence in means. (2 marks) iv. On the basis of the data alone, would you recommend the programme to a friend who suers from high blood pressure? Explain why or why not. (2 marks) Reading for this question Look up the sections about hypothesis testing for testing dierences in means. However, it is essential for this part of the question to focus on the section of the subject guide regarding paired samples (Section 7.16.4). Approaching the question i. Regarding hypotheses, note that the word eective suggests a one-sided test: H0 : before = after , H1 : before > after In this part, it is also essential to realise that we have a paired sample, as we have two observations for each person (before and after treatment). Hence the dierence for each person should be calculated 3 4 2 3 0 1 4 4 1

The next step is to calculate sd = 2.991, x d = 0.2222, in order to obtain the value of the d 0 = 0.2229. test statistic sx d/ n

15

ST104a Statistics 1

We have the t distribution with eight degrees of freedom, hence the critical value (for a one-sided test) is 1.860. Hence, we do not reject H0 at the 5% level. Testing at the 10% level gives a critical value of t8,0.1 = 1.397. Therefore, we still do not reject H0 . There is no signicant evidence that the treatment is eective. ii. Dierences normally distributed (no marks for normally distributed blood pressure). Pairs of observations are independent (a weaker condition which suces is that the dierences are independent, but this is unlikely if observations are not). iii. This is a straightforward exercise for condence intervals given the appropriate formula from the formula sheet (make sure that you can recognise it). The requested condence interval is (1.6316, 2.0766). iv. The evidence in the data that the treatment works is close to negligible as can be seen, for example, from the 90% condence interval, so there is no reason to recommend the treatment on the basis of the data alone.

16

Examiners commentaries 2013

Examiners commentaries 2013


ST104a Statistics 1 Important note
This commentary reects the examination and assessment arrangements for this course in the academic year 201213. The format and structure of the examination may change in future years, and any such changes will be publicised on the virtual learning environment (VLE). A change that took place from 201112 onwards is the presence of a formula sheet. The purpose of this change is to encourage candidates to devote more time in understanding the key concepts of the syllabus rather than memorising a big number of formulae. Nevertheless, candidates should not rely on this formula sheet entirely but only use it for verication. The formula sheet is available on the virtual learning environment (VLE).

Information about the subject guide


Unless otherwise stated, all cross-references will be to the latest version of the subject guide (2011). You should always attempt to use the most recent edition of any Essential reading textbook, even if the commentary and/or online reading list and/or subject guide refers to an earlier edition. If dierent editions of Essential reading are listed, please check the VLE for reading supplements if none are available, please use the contents list and index of the new edition to nd the relevant section.

Comments on specic questions Zone B


Candidates should answer THREE of the following FOUR questions: QUESTION 1 of Section A (50 marks) and TWO questions from Section B (25 marks each). Candidates are strongly advised to divide their time accordingly. Section A Answer all parts of Question 1 (50 marks in total). Question 1 (a) Classify each one of the following variables as measurable (continuous) or categorical. If a variable is categorical, further classify it as nominal or ordinal. Justify your answer. (Note that no marks will be awarded without justication.) i. Rank of a university according to its reputation. ii. Country of residence. iii. Birth-weight of a baby. iv. Favourite pop group. (8 marks) Reading for this question This question requires identifying types of variable so reading the relevant section in the subject guide (Section 3.6) is essential. Candidates should gain familiarity with the notion of

17

ST104a Statistics 1

a variable and be able to distinguish between discrete and continuous (measurable) data. In addition to identifying whether a variable is categorical or measurable, further distinctions between ordinal and nominal categorical variables should be made by candidates. Approaching the question A general tip for identifying continuous and categorical variables is to think of the possible values they can take. If these are nite and represent specic entities the variable is categorical. Otherwise, if these consist of numbers corresponding to measurements, the data are continuous and the variable is measurable. Such variables may also have measurement units or can be measured to various decimal places. i. Each rank is a category, therefore this is a categorical variable. The values of this variable are the ranks of each university. By denition the categories (ranks) are ordered, thus resulting in a (categorical) ordinal variable. ii. Each country is a category, so the possible values are one for each country. Hence, the variable is categorical. Note also that countries do not have a natural ordering, so this represents a categorical nominal variable. iii. The data represent weights of babies at birth that can be measured to many decimal places; for example 5.234 kgs. This is, therefore, a measurable variable. iv. Each pop group is a category and is also a potential value of this variable. Hence, the variable is categorical. Moreover, pop groups do not have a natural ordering, therefore this categorical variable is on a nominal scale. Weak candidates did not provide a justication for their choices, reported nominal or categorical to measurable variables and sometimes answered ordinal when their justication was pointing to a nominal variable. Writing It is measurable because it can be measured will not result in a high mark. (b) The table below contains the marks (out of 20) of all students taking an examination for the same course in two years: 2011 2012 10 10 9 11 19 9 9 11 10 10 9 11 12 11 10

i. Find the mean mark and the median mark for each year. ii. Calculate the range of the marks for each year and give an explanation for any dierences you nd. iii. Calculate the standard deviation of the marks for each year and give an explanation for any dierences you nd. iv. Comment on the dierences in the mean and median for the two years that you found in part i. For this data set, which do you think would give a better description of the dierence in marks: the mean or the median? Explain briey. (12 marks) Reading for this question This question contains material mostly from Chapter 3 of the subject guide and, in particular, Section 3.8 (Measures of location) for parts (i) and (iv), and Section 3.9 (Measures of spread) for parts (ii) and (iii). Approaching the question It is important to do the summation carefully and divide by the correct number of observations to obtain the mean. For questions that require calculations on the median (or other percentiles like quartiles), a good strategy is to write the observations in order. Note also that this question requires these measures for both years, so the calculations should be done for each year separately.

18

Examiners commentaries 2013

i. In order to calculate the two means, you should sum the numbers corresponding to each year and then divide them by the number of observations in each row. Doing so yields (10 + 9 + 19 + 9 + 10 + 9)/6 = 11, for 2011, and (10 + 11 + 9 + 11 + 10 + 11 + 12 + 11 + 10)/9 = 10.56, for 2012. For the median if we put the numbers in ascending order we get 9 9 9 10 10 19, for 2011, and 9 10 10 10 11 11 11 11 12, for 2012. The median for 2011 is given by taking the average between the 3rd and the 4th number in the rst of the rows above, resulting in a value of (9 + 10)/2 = 9.5. The median for 2012 is obtained from the 5th number in the 2nd row above, which is 11. ii. Note that the range of a variable equals the dierence between the maximum value and the minimum value. Hence, the range for 2011 was 19 9 = 10, whereas the range for 2012 was 12 9 = 3. Some candidates answered from 9 to 19. While this is true, note that it does not correspond to the denition of the range so it is essential to give the numbers 10 (2011) and 3 (2012) in your answer. It is also essential to comment on the dierent ranges between 2011 and 2012. The dierence is big and is caused by the outlier 19 in 2011. Some candidates confused Range and Interquartile range. Make sure that you identify what is being asked. iii. In order to answer this question, candidates should be familiar with Section 3.9.3 (on variance and standard deviation) and the chapter activities. It is very important to show your work with relevant summations of the squared deviation from the mean. In this way you may get some marks even if the numerical answer is wrong as you are demonstrating knowledge of the method. The answer for 2011 is 3.95, whereas for 2012 it is 0.88. It is also essential to comment on the dierent ranges between 2011 and 2012. The dierence is big and is caused by the outlier 19 in 2011. iv. The mean is higher in 2011 but the median is higher in 2012. This can be attributed to the fact that 2011 contains an outlier (19) which results in a high mean. Apart from this outlier, marks tend to be higher in 2012, so the median gives a somewhat better indication of the typical mark for each year. (c) Weekly household expenditure in country A is normally distributed with a mean of 300 per week and a standard deviation of 100 per week. In country B it is also normally distributed but with a mean of 240 per week and a standard deviation of 50 per week. Which country has a higher proportion of households spending less than 200? (4 marks) Reading for this question This section examines the ideas of the normal random variable. Read the relevant section of Chapter 5 and work through the examples and activities of this section. The Sample examination questions are quite relevant. Approaching the question The basic property of the normal random variable for this question is that if X N (, 2 ), N (0, 1). Note also that then Z = X P (Z < a) = P (Z a) = (a) P (Z > a) = P (Z a) = 1 P (Z a) = 1 P (Z < a) = 1 (a) P (a < Z < b) = P (Z Z < b) = P (a < Z b) = P (a Z b) = (b) (a).

19

ST104a Statistics 1

The above is all you need to nd the requested proportions: P (X < 200) = P P (Y < 200) = P
X 300 100 Y 240 50

< <

200300 100 200240 50

= P (Z < 1) = 0.1587 = P (Z < 0.8) = 0.2119.

So country B has a higher proportion of households spending less than 200. (d) We would like to start an internet service provider and need to estimate the average weekly internet usage of households for our business plan. Internet usage is measured in minutes. How many households must we randomly select to be 95 percent condent that the sample mean is within 2 minutes of the population mean? Assume that a previous survey of household usage has shown that the standard deviation of internet usage is 6.95 minutes. (3 marks) Reading for this question All of Chapter 6 is relevant, but the main reading for this question can be found in Section 6.1 (Choosing a sample size). It is essential to read this section carefully and attempt the activities and exercises. Approaching the question This question asks you to determine a sample size. This is straightforward once the distribution is identied. Since the sample size is large, a normal distribution can be used. The working is given below: Identify the correct z -value: 1.96. Solve 1.96 = 2. n

We can take = 6.95 to nd n = 46.38. Round up to n = 47. Some candidates forgot to round up. Remember that you are asked about a sample size. (e) Suppose that x1 = 2, x2 = 3, x3 = 6, x4 = 0, x5 = 3, and y1 = 3, y2 = 2, y3 = 1, y4 = 0, y5 = 1. Calculate the following quantities:
5 5 3

i.
i=1

xi

ii.
i=2

2xi (yi + 1)

iii.

x2 2+
i=1

3 ( xi + yi )

(6 marks) Reading for this question This question refers to the basic bookwork which can be found in Section 1.9 of the subject guide and, in particular, in Activity A1.6. Approaching the question Be careful to leave the xs and y s in the order given and only cover the values of i asked for. This question was generally done well; the answers are: i. ii. iii.
5 i=1 5 i=2

xi = 2 + (3) + 6 + 0 + 3 = 8.
5 i=2

(1 mark) (2 marks)

2xi (yi + 1) = 2 xi (yi + 1) = 2(3 (2 + 1) + 6 (1 + 1) + 0 (0 + 1) + 3 (1 + 1)) = 2 9 = 18. x2 2 +


3 i=1 (xi

3 yi )

= (3) + (2 + 3 ) + (3 + 2 ) + (6 + 1 ) = 9 + 29 + 5 + 7 = 50. (3 marks)

20

Examiners commentaries 2013

(f ) In an introductory statistics class, the numbers of males and females are 17 and 23, respectively. i. A student is selected randomly from the class. What is the probability the student is female? ii. A student is selected at random and removed from the class. A second student is then selected. What is the probability that one of the students is male and the other is female? iii. What is the probability that the second student is male, given that the rst student is female and removed from the class? iv. In previous years it was found that 80% of males pass the exam and 85% of females pass the exam. Based on the available information, nd the probability that a student who passes the examination is female. (8 marks) Reading for this question This is a question on probability and targets mostly the material covered in Chapter 4. It is essential to practise this area by attempting the chapter activities and exercises as well as accessing the material on the VLE. In particular you can attempt Activity A4.6 and Sample examination question 4. It is also useful to familiarise yourself with probability trees as they can be quite useful when completing such exercises. Approaching the question The rst three parts were straightforward for those that were familiar with this section. Part (iv) required knowledge of Bayes formula or a very good understanding of probability trees. The working out is shown below: i. There are 23 females and 17 males in the class. Hence the answer is 23/(17 + 23) = 23/40 = 0.575.
23 23 17 ii. The correct answer here is 17 40 39 + 40 39 = 0.501. Although not necessary, the use of a probability tree would be quite helpful here.

iii. This part can be answered in a similar way to part (i) noting that there are now 17 males and 22 females in the class. Hence 17/39 = 0.436. iv. P (female|pass) = = = = P (pass|female)P (female) P (pass) 0.85 23/40 P (pass female) + P (pass male) 0.85 23/40 0.85 23/40 + 0.80 17/40 0.5897.

(g) State whether the following are true or false and give a brief explanation. (Note that no marks will be awarded for a simple true/false answer.) i. An important dierence between an experimental design and an observational study is that in an observational study data are collected on units without any intervention. ii. If two variables are correlated we can conclude that one causes the other. iii. If a variable has a symmetric distribution, its mean and median are the same. (6 marks) Reading for this question This question contains material from various parts of the subject guide. Here, it is more important to have a good intuitive understanding of the relevant concepts than a technical level of knowledge in computations. Part (i) requires material from Chapter 10 and, in

21

ST104a Statistics 1

particular, the sections on observational studies and designed experiments. Part (ii) is about correlation and causation detailed in Section 11.7 of the subject guide. Finally part (iii) targets the material covered in Chapter 3. Approaching the question Candidates always nd this type of question tricky. It requires a brief explanation of the reason for a true/false and not just a choice between the two. Some candidates also lost marks for long rambling explanations without a decision as to whether a statement was true or false. i. True. A possible way to provide an explanation here is through an example, for example in an experimental design some units are administered a treatment, and this is not possible in an observational study. Note: candidates should indicate in some way that they know what the assertion means, such as via an example (see p.156 of the subject guide). ii. False; the correlation may be spurious, for example there may be a third variable aecting both variables leading to a correlation. iii. True; mean and median are at the centre of symmetry. (h) In the context of sampling, explain the dierence between item non-response and unit non-response. (3 marks) Reading for this question This question requires knowledge about sampling and sample surveys. Useful background reading may be found in Chapter 9 of the subject guide. The material directly related to this question, item non-response and unit non-response, appears on p.145. See also the references to Newbold and Carlson given in Chapter 9 of the subject guide. Approaching the question The relevant parts of p.145 are that: item non-response occurs when a sampled member fails to respond unit non-response occurs when no information is collected from a sample member. In addition to the denitions supplied above, it would also be useful to use an example.

Section B Answer two questions from this section (25 marks each). Question 2 (a) The 2006 General Social Survey in the United States asked subjects, Would you say that astrology is very scientic, sort of scientic, or not at all scientic? The table below cross-classies their responses with their highest level of education. Highest degree Less than High school High school College or higher Total Very 23 (11%) 50 (5%) 16 (2%) 89 (5%) Astrology is scientic Sort of Not at all Total 84 (41%) 98 (48%) 205 (100%) 286 (31%) 574 (63%) 910 (100%) 124 (18%) 538 (79%) 678 (100%) 494 (28%) 1210 (67%) 1793 (100%)

i. Based on the data in the table, and without doing a signicance test, how would you describe the relationship between education and opinion on whether or not astrology is scientic? (4 marks)

22

Examiners commentaries 2013

ii. Calculate the 2 statistic and use it to test for independence, using a 1% signicance level. What do you conclude? (9 marks) Reading for this question This part targets Chapter 8 on contingency tables and chi-square tests. Note that part (i) of the question does not require any calculations, just understanding and interpreting contingency tables. Candidates can attempt Activity A8.4 to practise. Part (ii) is a straightforward chi-squared test and the reading is also given in Chapter 8. Approaching the question i. Using the percentages we see that the higher someones education, the smaller the belief that astrology is very scientic and the higher the belief that it is not at all scientic. For example, 79% of those who attended college or higher education responded that astrology is not at all scientic, whereas the corresponding proportion for those with less than high school education is 48%. ii. Set out the null hypothesis that there is no association between education and views on astrology against the alternative, that there is an association. Be careful to get these the correct way round! H0 : No association between education and views on astrology versus H1 : Association between education and views on astrology. Work out the expected values to obtain the table below

10.1757 56.4808 45.1701 250.719 33.6542 186.8 The test statistic formula is

138.344 614.11 457.546

(Oi,j Ei,j )2 , Ei,j

which gives a value of 93.9567. This is a 3 3 contingency table so the degrees of freedom are (3 1) (3 1) = 4. For = 0.05 the critical value is 9.488, hence we reject H0 . For a second (stronger) , say 1%, the critical value is 13.277, hence we still reject H0 . We conclude that the association between views on astrology and educational level is highly signicant. Many candidates looked up the tables incorrectly and so failed to follow through their earlier accurate work. A larger number did not expand on their results suciently. Saying we reject at the 5% level, but not at 10% is insucient. What does this mean? Is there a connection or not? If there is one, how strong is it? This needed to be answered if the full nine marks allocated for this question were given. Many candidates lost marks by failing to follow-up like this. (b) i. Dene each of the following: Simple random sampling Stratied random sampling. (4 marks) ii. Why might a researcher prefer to take a stratied random sample rather than a simple random sample? Give two reasons. (3 marks) iii. You have been asked to design a nation-wide survey in your country to nd out about the smoking habits of adults. Give two stratication factors you might use, and explain why you have chosen them. (5 marks)

23

ST104a Statistics 1

Reading for this question This question on basic material on survey designs required background from Chapters 9 and 10 of the subject guide which, along with the recommended reading should be looked at carefully. Candidates were expected to have studied and understood the main important constituents of design in random sampling. It is also a good idea to try the activities in Chapter 9. Approaching the question One of the main things to avoid here is writing an answer without any structure. This exercise asks for specic things and each one of them requires one or two lines. If you are unsure of what these specic things are, do not write lengthy essays. This is a waste of your valuable examination time. If you can identify what is being asked, keep in mind that the answer should not be long. Note also that in some cases there is no unique answer to the question. i. Simple random sampling: Every sample has equal probability. With replacement. Stratied random sampling: Population divided into strata (or groups). Random sample from each group. ii. There are generally two main reasons why one would prefer stratied to simple random sampling. Potentially more precision of parameter estimates. Obtain information about subgroups. iii. In this part you can choose factors based on two arguments. First, you can aim for factors whose subgroups dier regarding smoking habits (e.g. gender, ethnic groups, age groups etc.). In that way the stratied sampling scheme will have increased precision. Alternatively you can just suggest factors that are interesting from a research point of view.

Question 3 The level of infant mortality (y ) is represented by the number of baby deaths for every 1000 births. For 12 areas these are shown in the following table. For each area, the percentage (x) of babies born into families earning at least 25,000 is also shown. Area Percentage (x) Infant mortality (y ) A 19 3 B 5 15 C 9 14 D 20 6 E 11 13 F 35 3 G 5 23 H 18 10 I 25 9 J 12 9 K 20 5 L 15 10

The summary statistics for these data are: Sum of x data: 194 Sum of the squares of x data: 3956 Sum of y data: 120 Sum of the squares of y data: 1560 Sum of the products of x and y data: 1504 (a) i. Draw a scatter diagram of these data on the graph paper provided. Label the diagram carefully. (4 marks) ii. Calculate the sample correlation coecient. Interpret your ndings. (3 marks)

24

Examiners commentaries 2013

iii. Calculate the least squares line of y on x and draw the line on the scatter diagram. (4 marks) iv. Using the equation you found in iii., obtain the predicted infant mortality for an area where 34% of babies are born into families earning at least 25,000. Do you think this value is realistic? Justify your answer. (2 marks) Reading for this question This is a standard regression question and the reading is to be found in Chapter 11. Section 11.6 provides details for scatter diagrams and is suitable for part (i) whereas the remaining parts focus on correlation and regression and are covered in Sections 11.8 to 11.10 of the subject guide. Section 11.7 is also relevant. Sample examination question 2 from this chapter is recommended for practice on questions of this type. Approaching the question i. Candidates are reminded that they are asked to draw and label the scatter diagram which should include a full title (Scatter diagram alone will not suce) and labelled axes, including information about units. Far too many candidates threw away marks by neglecting these points and consequently were only given one mark out of the possible four allocated for this part of the question. Another common way of losing marks was failing to use the graph paper which was provided, and required, in the question. Candidates who drew on the ordinary paper in their booklet were not awarded marks for this part of the question.
Infant mortality and economic class
25

y: Infant mortality (number of baby deaths for every 1000 births)

15

20

G G G

10

G G

G G

G G G G

0 0

10

15

20

25

30

35

x: percentage of babies born into families earning at least 25,000 pounds

ii. The summary statistics can be substituted into the formula for the correlation coecient (make sure you know which one it is!) to obtain the value 0.8026. An interpretation of this value is the following: The data suggest that the higher the percentage of families earning at least a certain income, the lower the mortality. The fact that the value is very close to 1, suggests that this is a strong (negative) association. iii. The regression line can be written by the equation y = a + bx or y = a + bx + . The formula for b is xi yi n x y , b= 2 2 xi nx

25

ST104a Statistics 1

and by substituting the summary statistics we get b = 0.5319. The formula for a is a = y bx , so we get a = 18.5994. Hence the regression line can be written as y = 18.5994 0.5319x or y = 18.5994 0.5319x + . It should also be plotted in the scatter diagram. iv. The prediction will be y = 18.5994 0.5319 38 = 0.51 infant mortality (number of baby deaths for every 1,000 births). However, since this point is very close the maximum observation of x, which is 35%, this prediction should not be trusted too much as it is almost based on extrapolation. Many candidates did not give the measurement units here. These are essential in answering such a question and a mark is deducted if they are not specied. (b) A survey is conducted to compare public attitudes towards local policing. A number of people in two areas of interest are sampled, and asked if they are satised with their local police-community relationship. The results of this survey are shown in the following table. Area A Area B Sample size 153 188 Number satised 115 120

i. You are asked to consider an appropriate hypothesis test to determine whether there is a dierence between the two areas in the proportion who are satised. Test at two appropriate signicance levels and comment on your ndings. Specify the test statistic you use and its distribution under the null hypothesis. (7 marks) ii. State clearly any other assumptions you make. (2 marks) iii. Give a 98% condence interval for the proportion of people in Areas A and B combined who are satised, assuming the respective sample sizes are proportional to population sizes. (3 marks) Reading for this question The rst two parts of the question refer to a two-sided hypothesis test comparing proportions. While the entire chapter on hypothesis testing is relevant, one can focus on the sections involving proportions (Sections 7.14 and 7.15). The last part of the question is on condence intervals that are located in Chapter 6 and, in particular (condence intervals for proportions), in Section 6.10. Approaching the question i. The null hypothesis is that the proportions of the two areas (A and B ) do not dier, the alternative is that they do. H0 : A = B versus H1 : A = B . The test statistic is provided in the formula sheet (note that it is based on the pooled variance): PA P B = Z P (1P ) P ) + P (1 nA nB where P = (115 + 120)/(153 + 188) = 0.68915. The test statistic value is 2.249 (PA = 0.7516, PB = 0.6383, pooled se = 0.0504). The critical value at the 5% level, assuming a normal approximation as the number of observations is large, is 1.96. Hence, we reject the null hypothesis suggesting evidence for a dierence between the two areas. If we take a (smaller) of 1%, the critical value is 2.576, so we do not reject H0 . We conclude that there is some, but not strong, evidence of a dierence between the two areas.

26

Examiners commentaries 2013

ii. The assumptions included:


2 2 Assumption about whether A = B . Assumption about whether nA + nB 2 is large, hence t v. z Assumption about independent samples.

iii. The question is a standard exercise in condence intervals. Note the question refers to areas A and B combined. The workout is given below: Correct quantile: z/2 = 2.326. Correct endpoints: 0.631 and 0.747. (Also accept two decimal places.) Report as an interval: (0.631, 0.747). (Also accept between 0.631 and 0.747.)

Question 4 (a) i. Carefully construct a box plot on the graph paper provided to display the following yearly incomes of a group of people, measured in 1000: 3 2 4 8 7 19 2 5 3 4 10 12 (8 marks) ii. Based on the shape of the box plot you have drawn, describe the distribution of the data (2 marks) iii. Name two other types of graphical displays that would be suitable to represent the data. Briey explain your choices. (3 marks) Reading for this question Chapter 3 provides all the relevant material for this question. More specically, information on boxplots can be found in Section 3.9.2, but all of Sections 3.8 and 3.9 are highly relevant. Approaching the question i. The boxplot diagram the Examiners were hoping to see is shown below. Marks were awarded for including the title, identifying the box and the whiskers and noting the outlier, at a reasonable accuracy.
Distribution of Income
20

Income in thousands of pounds

In order to identify the box, the quartiles are needed that are 3 and 8.5, hence giving an interquartile range of 4.5. The median is also needed which is 5.5.

10

15

27

ST104a Statistics 1

Hence the outlier limits are from 0 to 16.75. (5.25 to 16.75 is also allowed.) The extreme outlier limits are then from 0 to 25 (13.5 to 25 is also allowed.) Hence 19 is an outlier but not an extreme outlier. Note that you did not need to label the x axis and that the plot can be transposed. ii. Based on the shape of the boxplot, we can see that the distribution of the data is positively skewed. iii. A histogram or stem-and-leaf diagram are other types of suitable graphical displays. The variable income is measurable and these graphs are suitable for displaying the distribution of such variables. (b) A new tness programme is devised for obese people. Each participants weight in kg was measured before and after the program to see if the tness program is eective in reducing their weights. The following data were obtained: Before After 145 143 116 120 120 118 133 130 119 119 133 134 125 128 126 123 140 141 i. Carry out an appropriate hypothesis test to determine whether the tness programme is eective for reducing weight. State the test hypotheses, and specify your test statistic and its distribution under the null hypothesis. Comment on your ndings. (6 marks) ii. State any assumptions you made. (2 marks) iii. Give a 80% condence interval for the dierence in means. (2 marks) iv. On the basis of the data alone, would you recommend the programme to a friend who wants to lose weight? Explain why or why not. (2 marks) Reading for this question Look up the sections about hypothesis testing for testing dierences in means. However, it is essential for this part of the question to focus on the section of the subject guide regarding paired samples (Section 7.16.4). Approaching the question i. Regarding hypotheses, note that the word eective suggests a one-sided test: H0 : before = after , H1 : before > after In this part, it is also essential to realise that we have a paired sample, as we have two observations for each person (before and after treatment). Hence the dierence for each person should be calculated 2 4 2 3 0 1 3 3 1

The next step is to calculate sd = 2.571, x d = 0.1111, in order to obtain the values of the d 0 = 0.1296. test statistic sx d/ n We have the t distribution with eight degrees of freedom, hence the critical value (for a one-sided test) is 1.860. Hence, we do not reject H0 at the 5% level. Testing at the 10% level gives a critical value of t8,0.1 = 1.397. Therefore, we still do not reject H0 . There is no signicant evidence that the tness program is eective.

28

Examiners commentaries 2013

ii. Dierences normally distributed (no marks for normally distributed blood pressure). Pairs of observations are independent (a weaker condition which suces is that the dierences are independent, but this is unlikely if observations are not). iii. This is a straightforward exercise for condence intervals give the appropriate formula from the formula sheet (make sure that you can recognise it). The requested condence interval is (0.650729, 0.872951). iv. The evidence in the data that the programme works is close to negligible as can be seen, for example, from the 80% condence interval, so there is no reason to recommend the programme on the basis of the data alone.

29

You might also like