Math1041 2013 S1

You might also like

Download as pdf
Download as pdf
You are on page 1of 24
Semester 1, 2013 MATHI041 Page 2 Use a separate book clearly marked Question 1 1. [15 marks} The Tour de France (TdF) is an annual three-week-long cycling stage race and they typically begin with a short Prologue stage before starting the race proper. This Prologue is usually between 5 km and 10 km. Below, Figure 1 shows some data with the average speed (in km/h) for the winner of all the Prologue stages that are between 5-10km long since 1972. Figure 1: Average Speeds of T4F Prologue winners 3 4 ° 3 O20 06 = °° z 84 eo cl 2 ° =z 0 ° ° ° 34 ° 38 ° ° x ° ° ° & 3 4 © © o : © oo, <7 #7 ° ° 1 1 T + 1980 1990 2000 2010 Year Call: 1m(formula = Speed ~ Year) Residuals: Min 1Q Median 3Q Max 4.0360 -1.5695 0.1302 1.3488 4.4479 Coefficients: Estimate Std. Error t value Pr(>/t|) (Intercept) -324.5535 62.3061 -5.209 9.96e-06 +++ Year 0.1882 0.0313 6.013 9.28e-07 +#« Signif. codes: 0 '*##! 0.001 '#*! 0.01 '#' 0.05.1 0.114 Residual standard error: 2.053 on 33 degrees of freedom Multiple R-squared: 0.5228, Adjusted R-squared: 0.5084 Festatistic: 36.16 on 1 and 33 DF, p-value: 9.283e-07 Please see over . Semester 1, 2013 MATHI1041 Page 3 2 4 Residuals 0 -4 Residuals vs fitted values Normal Quantile plot of residuals 2 4 sx 3B ga 4 Bo 4 Ea J at & 2 ¥ toe oY “92 + 48 50 82 54 2-10 1 2 Fitted values Theoretical Quantiles Figure 2: Diagnostic plots of Regression Analysis i) [2 marks] How strong is the relationship between the Prologue winners’ average speed and the year? Include an appropriate numerical summary in your answer. ii) [8 marks] Construct a 95% confidence interval for the true regression slope. iii) [2 marks] Explain in words what this confidence interval tells us about how the average speed of Prologue winners’ are changing over time. iv) [4 marks] Is linear regression appropriate for this dataset? In your an- swer refer to any specific assumptions that are made, whether or not they seem reasonable, and what output you used in coming to this conclusion. v) [2 marks] Use the above regression analyses to predict the average speed for the Prologue of the TdF last year (2012). vi) [2 marks] The average speed of the Prologue winner in 2012 was recorded at 53.21 km/h. Using this information compute the residual against the linear regression. ‘Then comment on the athletes performance against the fitted value in the linear regression. Please see over Semester 1, 2013 MATHIO041 Page 4 Use a separate book clearly marked Question 2 2, [15 marks} In recent history, the Tour de France (TdF) has been associated with performance enhancing drugs. As supporting evidence of prior foul play, the following research question has been posed: Were mountain ascent times of TAF cyclists during the 1990-2003 period faster than the 2004-2011 period? One way to address this question is to compare the times (mins) it took the cyclists to ascend a 13.8km road section of Alpe d’Huez, a French mountain in the Alps, during those time periods. This information is summarised in Table 1 with the ascent times for 15 randomly selected racers during the 1990-2003 period (x) and 14 randomly selected racers during the 2004-2011 period (y).. ‘Table 1: Numerical summary of ascent times (in minutes) 2 v 37.58 37.6 38 38.57 38.02 38.58 38.07 38.62 38.38 39.28 39.03 39.35, 39.1 39.52 39.47 40.77 30.47 41.35 39.5 45 39.73 477 39.75 41.95 40 43.2 40.85 43.2 413 Figure 3: Graphical summaries of the ascent times 0 wy ay ie an . iv | een iB ie] ii T : Olo 5] 7 x y 37304143 ‘Ascent tes (mins) Please see over Semester 1, 2013 MATH1041 Page 5 i) [1 mark] Which of the graphs in Figure 3 (labelled LIT) best answers the research question? ii) {2 marks} Briefly summarise what your graph says about the ascent times during the period 1990-2003 compared to the ascent times during 2004-2011 iii) [9 marks] Making any necessary assumptions, use a hypothesis test to find out if there is evidence that, on average, TdF racers during the era from 1990-2003 had faster ascent times up Alpe d’Huez than TdF racers during the period 2004-2011. In your answer: a) State clearly the hypotheses Hy and Hy. b) Calculate the test statistic, showing your working. ©) State the distribution of the test statistic assuming Ho is true. 4d) Give an expression for the P-value, and find this P-value as accu- rately as you can using tables. ¢) State your conclusion in simple language. iv) [3 marks] a) What assumptions were required for your hypothesis test to be valid? b) Do you think these assumptions were valid? In your answer, refer to the appropriate information and analyses in the above. Please see over Semester 1, 2013 MATHI1041 Page 6 ‘Use a separate book clearly marked Question 3 3. [15 marks] Consider a similar research question: Is there a relationship be- tween the Alpe d’Huez ascent times of TAF cyclists during the time periods, 1986-1997, 1998-2005 and 2006-2011? ‘The boundaries for these time periods were chosen to reflect the key events in the anti-doping movement. ‘These were the 1998 Festina seizure of drugs by French customs and the 2006 Operacién Puerto investigation. Suppose a simple random sample of TAF cyclists who ascended Alpe d’Huez for each of the three time periods was taken. The results are summarised in the table below. Table 2: Alpe d'Huez, ascent times in recent history | Less than 40 mins Greater than 40 mins 1986 - 1997 20 20 1998 - 2005 22 58 2006 - 2011 6 34 i) [8 marks] a) Find the marginal distributions of ascent time and key time period. b) Assuming the time periods and ascent times are independent, find the expected number of individuals with ascent time less than 40 mins during the 2006-2011 period. ii) [9 marks] Making any necessary assumptions, use a hypothesis test to find out if there is evidence that, the ascent times of the TdF racers up Alpe d'Huez is related to the three time periods in question. In your answer: a) State clearly the hypotheses Hy and Hs. b) Calculate the test statistic, showing your working, ©) State the distribution of the test statistic assuming Ho is true. 4) Give an expression for the P-value, and find this P-value as accu- rately as you can using tables. e) State your conclusion in simple language. iii) [3 marks] a) What assumptions were reqt d for your hypothesis test to be valid? b) Do you think these assumptions were valid? In your answer, refer to the appropriate information and analyses in the above. Please see over Semester 1, 2013 MATH1041 Page 7 Use a separate book clearly marked Question 4 4, [15 marks] To help combat the use of performance enhancing drugs (PEDs), professional cyclists are tested for the presence of PEDs throughout the year at random. Suppose the anti-doping authority released (some incomplete) information about their testing frequency for the cyclists. Figure 4: Incomplete probability histogram of the drug tests for the cyclists i ) ii) 04 — Probability ee 0 0 1 2 3 Number of random drug tests per month [5 marks] Let X be the number of drug tests completed by a professional cyclist over a month. Suppose you know that the maximum number of tests conducted on a rider was capped at three. Use the information given by Figure 4 to answer the following: a) What is the probability that a randomly chosen cyclists did not com- plete a drug test in a given month? In other words, what should the height for the zero column be in the incomplete histogram in Figure a b) Find 1x, the mean number of drug tests completed by a cyclist in a month. ©) Find o%, the variance of the number of drug tests completed by a cyclist in a month. [10 marks} Suppose that you were asked to conduct some research on the anti-doping process and want. to find the probability that any rider will have to complete a drug test for a given month. Suppose you conduct, survey of 160 professional cyclists and ask if they had completed a drug test in the past month. Let X denote the number of cyclists that had completed a drug test out of the 160 surveyed. a) What distribution can be used to model X? 'b) Suppose you observe x = 144 (144 cyclists completed a drug test out of the 160 surveyed). What is the sample proportion, 9, of cyclists that completed the drug tests? ©) What is the approximate distribution of f? d) Construct a 95 % confidence interval for #. e) Historically, the proportion of riders tested was claimed to be p = 0.85. Using your computed confidence interval, is there sufficient evidence that the proportion of riders tested has changed? Semester 1, 2014 MATH1041 Page 2 Use a separate book clearly marked Question 1 1. [15 marks] Two indicators of cardiac health are the systolic and diastolic blood pressure (BP) values. These two values measure the maximum and minimum blood pressure respectively in the arteries of an individual. These are recorded when the heart beats (systolic BP) and between heartbeats (diastolic BP). As part of a cardiac research project, a student sampled BP values (in mm Hg) on randomly chosen patients at a local hospital. The data is shown below with an accompanying regression analysi Figure 1: Scatterplot of BP values ° af oc 3 z oe? Eo ° 8% : £8, 04% °° g ° a ° 2 O95 @o ge] 90) a> % © 2 ° se] ae ° G ° ° 1 1 1 1 1 60 80 100 120 140 Diastolic Pressure (mmHg) “* ## call ## Im(formula = Systolic ~ Diastolic) a ## Residuals: tt Min 1Q Median 3Q Max ## 47.28 -10.99 1.25 9.73 38.15 ea #4 Costticients: aad Estimate Std. Error t value Pr(>|t!) ## (Intercept) 98.031 15.454 6.34 2.7e-O7 ++ ## Diaotolic —«0.614.«O.174.—«3.68 0.001 w+ a ## Signit. codes: 0 '¥ex! 0.001 tex! 0.01 '#! 0.05 1.1.29 1 4 a ## Residual standard error: 17.4 on 35 degrees of freedom WH Multiple R-squared: 0.268,Adjusted R-squared: 0.247 ## F-statistic: 12.8 on 1 and 35 DF, p-value: 0.00103 Please see over Semester 1, 2014 MATH1041 Page 3 Residuals 0 20 40 -40 Residuals vs fitted values Normal Quantile plot of residuals 4 g 38 & go 3 8 $ ° Zo 4 ag T T T = T T T ° T T T T T 130 150 170 a | 0 7 2 Fitted values Theoretical Quantiles ii) iii) vi) Figure 2: Diagnostic plots of Regression Analysis How strong is the relationship between the diastolic BP values and the systolic BP values? Include an appropriate numerical summary in your answer. Construct a 99% confidence interval for the true regression slope. Explain in words what this confidence interval tells us about the average systolic BP values. Suppose a colleague claimed that a change in diastolic BP does not im- pact on systolic BP values. Support or refute their claim with evidence using your computed confidence interval or any other relevant linear re- gression output. Is linear regression appropriate for this dataset? In your answer refer to any specific assumptions that are made, whether or not they seer reasonable, and what output you used in coming to this conclusion. The typical diastolic pressure for a healthy individual is around 90 mr Hg. Using the regression model, predict the systolic BP for an individuel whose diastolic pressure is 90 mm Hg. Please see over Semester 1, 2014 MATHI1041 Page 4 Use a separate book clearly marked Question 2 2. [15 marks] A less known indicator of cardiac health is the Mean Arterial Pressure (MAP) which is roughly the average arterial pressure during a single cardiac cycle (heartbeat). A research question is posed: Do MAP values for cardiac patients decrease during the night compared to the daytime? At noon and midnight, a student sampled MAP values (in mm Hg) on ran- domly chosen cardiac patients at a local hospital. Those recorded during daytime listed as x and those recorded at. nighttime listed as y. The dataset is shown in the Table below along with some summary statistics. Figure 3: Graphical summaries ‘Table 1: Numerical Summary i} 3 ée] He a 121 67 17 281° 101 125 83 Eee let 122 109 106 Theat! uals 129 101 102 104 123 122 w 139110103 elm 100 91 101 ?.)] = 106 118 105 1S = 95 108, 110 $24 — ° 90 91 ¢ 7 103 121 x y 105 119 122 111 153 116 . | ne =28 my =9 Py B= 10.71 p= 105.44 fy] $= 16.92 sy = 11.01 a Day Night i) Which of the graphs in Figure 3 (labelled LIT) best helps answer the research question? Explain your answer. ll) Making any necessary assumptions, use a hypothesis test to find out if there is evidence that, on average, MAP values are lower during the night compared to during the daytime. In your answer: a) State clearly the hypotheses Hy and Hy, b) Calculate the test statistic, showing your working. ©) State the distribution of the test statistic assuming Hp is true. Please see over Semester 1, 2014 MATH1041 Page 5 d) Give an expression for the P-value, and find this P-value as accu. rately as you can using tables. e) State your conchusion in simple language. iii) a) What assumptions were required for your hypothesis test to be valid? b) Do you think these assumptions were valid? In your answer, refer to the appropriate information and analyses in the above. Please see over Semester 1, 2014 MATH1041 Page 6 Use a separate book clearly marked Question 3 3. [15 marks] To study association between genetic traits, a researcher at the University of Delaware asked students to complete a survey including their hair and eye colour (see Snee (1974)). This data is shown below in Table 2. ‘Table 2: Counts of Eye and Hair colour Bye colour Brown Blue Has Black 6 200016 Hair colour Brown 119 84 54. Blond 7 94 10 Using the information in Table 2, answer the questions below. i ) ii ) iii) a) Find the marginal distributions of Hair colour and Bye colour. b) Find the probability that a randomly selected student from this study doesn’t have blond hair. ©) Suppose this researcher noticed from afar that a student was wait- ing outside their office for consultation. From that distance the re- searcher could only notice that the student doesn’t have blond hair. Given this information, find the probability that this student hes blue eyes. Making any necessary assumptions, use a hypothesis test to find out if there is evidence that, hair colour is related to eye colour. In your answer: a) State clearly the hypotheses Ho and H b) Calculate the test statistic, showing your working. ©) State the distribution of the test statistic assuming Hp is true. ) Give an expression for the P-value, and find this P-value as accu- rately as you can using tables. e) State your conclusion in simple language. a) What assumptions were required for your hypothesis test to be valid? b) Do you think these assumptions were valid? In your answer, refer to the appropriate information and analyses in the above. Please see over Semester 1, 2014 MATH1041 Page 7 Use a separate book clearly marked Question 4 4, (15 marks] i) Crude Birth rates can be used to help determine the rate of population growth. The Crude Birth rate gives the average annual number of births during a year per 1,000 individuals in the population at midyear. Crude Birth rates for 2012 were recorded for each nation and a graphical sum- mary for Oceania and Europe is shown below. igure 4: Graphical summary of Birth rates for Europe and Oceania Europe -- “+4 ce Oceania —— . | 6 8 10 12 14 16 18 20 22 24 26 28 Birth rate per 1000 individuals Use the information given by Figure 4 to answer the following: a) Which continental area, Europe or Oceania, appears to have higher birth rates? Explain your answer with reference to the graph. b) In Figure 4, the mean population rate for the continents is displayed with a diamond symbol. In both cases the location of the mean is different to the location of the median. Explain why and, in your answer, refer to the graph. Please see over Semester 1, 2014 MATH1041 Page 8 ii) A controversial study was presented by Burt (1966) in the British Journal of Psychology. The intelligence of 27 identical twins was recorded by 1Q score with one twin raised in a foster home while the other was raised by their biological parents. Suppose a research claim is made: Biological parents are more likely to raise a child with IQ higher than 88 ‘Two tables, labelled (1) and (I), summarise Burt’s data in two different, ways @ Parent (ological Foster 1Q>88 vw 16 TQ score IQ <88 1 (m) Biological parent 1Q>88_ IQ <8 1Q > 88 13 3 Foster parent 75 gg ‘ ; ‘Table (I) counts each individual with a high or low IQ separately accord- ing to biological or foster parent. ‘Table (II) ‘cross-classifies’ and counts twin pairs whereby both twins had high IQs, low IQs or only one of the twins had a high IQ. a) i, Which table, (1) or (II), is better to answer the research ques- tion? Use study design as your main consideration in answering this question. ii, Briefly explain why the table you chose in i, is better than tke other table at answering the research question. b) ‘Test whether individuals who were raised by their biological parent are more likely to have an IQ score higher than 88. In your answer: i. State clearly the hypotheses Hy and Hy. ii, Calculate the test statistic, showing your working iii, State the distribution of the test statistic assuming Hp is true. iv. Give an expression for the P-value, and find this P-value as ac- curately as you can using tables. v. State your conclusion in simple language. c) i. What assumptions were required for your hypothesis test to be valid? li, Do you think these assumptions were valid? In your answer, refer to the appropriate information and analyses in the above. d) A random sample of 4 pairs of twins will be chosen from the sample of 27 for a follow up study. Let X be the number of twin pairs from the sample of 4 drawn whereby at least one twin in each pair has en 1Q score larger than 88. Use the information in the table(s) to verify that, P(X = 2) =0.23, to two decimal places. Sem: jester 2, 2015 MATHI041 Page 2 Use a separate book clearly marked Question 1 1. Since 1979 satellites have regularly measured the extent of'sea ice in the Arctic Ocean. Rapid melting of Arctic sea ice is seen as a symptom and a cause of a changing climate. The average September Sea Ice Extent (in 1,000,000 km*) recorded for each year are displayed in Figure 1.1. ‘The left plot shows the recorded values between 1979-2002. ‘The right plot shows the recorded values between 1979-2014. Fesiduals 0s 10 05 10 05 00 “05 00 ‘Standardised residuals ‘te Se ee Eten Figure 1.1: Seatterplot of Arctic Sea Ice Extent vs Year for both datasets. Residual plot for 1979-2002 Residual plot for 1979-2014 F el. “a m= we ew OM ‘60 55 69 65 70 75 Fited values Fited values Normal Quantile plot for 1979-2002 Normal Quantile plot for 1979-2014 ‘Theoretical Quantios é i y L ie goat ay a a, % a 0 1 2 7 7 oO 1 2 ‘Theoraical Quantles Figure 1.2: Diagnostic plots: Col. 1: 1979-2002 data. Col. 2: 1979-2014 data, Please see over Semester 2, 2015 MATHI1041 Page 3 Bill and Ben are two researchers studying data sets about Sea Tee Extent. ) ii) Use the information on the previous page to answer the following: ) Bill obtained the data from 1979-2002 and stated that a linear re- gression model is appropriate for this data. Justify Bill’s statement with reference to both Figure 1.1 and Figure 1.2. b) Ben obtained the full data from 1979-2014 and stated that a lin- ear regression model is not appropriate for this data. Justify Ben's statement with reference to both Figure 1.1 and Figure 1.2 c) Using Figure 1.2, suggest an alternative model thet is more appro- priate for the 1979-2014 data, Bill conducted a linear regression analysis for the 1979-2002 data. Some of the output is given below. #4 Call: ## Im(formula = Extent ~ Year, data = sear) # #4 Coefficients: r Estimate Std. Error t value Pr(>lt|) ## (Intercept) 98.53965 25.8702 3.818 0.000938, #8 Year 0.04606 0.01297 -3.853 0.001783 e #8 Residual standard error: 0.4997 on 22 degrees of freedom #8 Multiple R-squared: 0.3646,Adjusted R-squared: 0.3357 ‘Use the regression output: in answering the following questions about the 1979-2002 date: fa) Is this a strong linear relationship? Justify your answer with a m- merical summary from the output and explain. 'b) Compute a 959 confidence interval for the change in Extent of Sea Teo (in 1,000,000m?) for each increase in year during 1979-2002. ©) State in plain English what this confidence interval means in the context of the Sea Ice Extent during 1979-2002. 4) Using the regression output above for the 1979-2002 data, predict the area of the ive for the year 2012. ) The full dataset for the years 1979-2014 shows that the observed ice area in 2012 was actually 3,565,600 km? (or y = 3.5656 in the units for the dataset). |A) Compare this observed value with the value you predicted in d). B) Using concepts discussed in MATH1041 comment on the appro- priatemess of using this regression analysis to predict the area of joe for the year 2012. Please see over ‘Semester 2, 2015 MATHI041 Page 4 Use a separate book clearly marked Question 2 2. 4) A fair coin is fipped twice, so on each toss P(H) =} and P(T) Let A be the event: ‘the first flip is H’. Let B be the event: ‘both flips have the same outcome’, ie.HH or TT. Using the definition of independence given in MATHLO41, prove that the events A and B are independent. About 10% of the population are left, handed. Alex speculates that artists are more likely to be left handed, and decides to investigate his theory. Alex asks 150 artists and finds that 18 are left; handed. a) b) Is this an experiment or an observational study? (Briefly explain ‘your answer) Carry out the hypothesis test for Alex. In your answer A) Define the parameter which is being tested end clearly state the Ho and H, B) State the test statistic. What is the distribution of the test statistic if H is true? C) State the observed value of the test statistic D) Give an expression for the P-value, and give its value as best can be determined from the tables. E) State your conclusion in simple language. F) State what assumptions you (and Alex) needed to make to do this hypothesis test, and comment on whether these assumptions are reasonable to make in this study. Using no more than 40 words, write up the results of Alex's investigation and hypothesis test as a short item/article as if you ‘were writing it for a Science magazine for senior bigh school students. Please see over Semester 2, 2015 MATHI041 Page 5 Use a separate book clearly marked Question 3 3. i) ‘The heights of young women are approximately normally distributed with ‘a mean of 164 cm and a standard deviation of 6.4 cm. Using this approx- imation, answer the questions below as accurately as possible: a) Calculate the proportion of young women who more than 174 cm tall b) Ifa young woman is more than 174 em tall, what is the probabilty that she is more than 180 em? (That is, find the conditional proba- bility that a young woman is more than 180 cm tall, given that she is known to be more than 174 cm tall.) ©) Random samples of 25 young women are taken and the average height is recorded, X. A) What is the mean of X? B) What is the standard deviation of X? C) Calculate the probability P(X < 165). Studies have indicated that elite athletes in Canada may suffer from some nutritional intake deficiencies. A sample of 114 male athletes from various Canadian Sports Centres participated in a survey. Their sample average calorie intake was 3077.0 keal/day, with a standard deviation of 987.0. a) Construct a 95% confidence interval for the average calorie intake (in kcal/day) for Canadian high performance male sthletes. b) The recommended daily calorie intake for males is 3421.7 keal/day. Construct a 95% confidence interval for the average deficiency in calorie intake for Canadian high performance male athletes. (That is, construct a 95% confidence interval fer the difference be- tween the recommended intake and the average intake for Canadian high performance male athletes.) ©) State what assumptions you needed to make to construct these two confidence intervals. Explain whether you can justify making any or all of these assumptions here. d) Using your confidence intervals (and making the same assumptions) explain whether there is statistically significant evidence that Cana- dian high performance male athletes are deficient in calorie intake on average. (You are NOT asked to carry out a hypothesis test here.) Please see over Semester 2, 2015 MATHI041 Page 6 ‘Use a separate book clearly marked Question 4 4. A study collected data on all singleton births over a seven-week period. Five years later, the children and caregivers were invited to attend interviews. Some, but not all, attended the interview. If they attended the interview, they are denoted as ‘Reviewed’. One variable recorded was whether or not the child was covered by Medical Insurance at birth. ‘Medical Insurance’ and ‘reviewed’ data ie eummarised below in a table of counts, i) ii) Reviewed No Yes Medical Insurance No 979 370 Yes 19546 Total 1174416 a) Of the subjects who were Reviewed, the proportion with Medical Insurance is 11.1%. Show how this can be calculated from the table, Of the subjects who were not Reviewed, the proportion with Med- ical Insurance is 16.6%. Show how this can be calculated. b) Assuming that Medical Insurance and Reviewed status are indepen- dent, verify that the expected count of subjects that have Medical Insurance and were Reviewed is 63.05. ¢) The difference between the computed sample conditional proportions of 11.1% and 16.6% computed in a) is statistically significant. Carry out a test to show the statistically significant relationship between Medical Insurance and Reviewed status for these subjects. You may use the fact that the observed X? = 7.36 for this dataset and you do NOT need to compute the value yourself. In your answer: A) Clearly state the hypotheses Hy and Hy. B) State the formula of the test statistic, X?. ©) State the exact distribution of X? assuming that Ho is true. D) Give an expression for the P-value, and provide bounds on its value. E) State your conclusion in terms of Ho, Hq and in simple language. 4) State what conditions are necessary for the hypothesis test in ic) to be valid. Check to see if those conditions are valid for this dataset. ‘The subjects in this study were classified into one of two Socioeconomic groups, denoted Group A and Group B. This data is summarised on the next page. Please see over Semester 2, 2015 MATHIO41 Page 7 Group A. Group B. “Not Reviewed Reviewed Not Reviewed Reviewed Medical No 22 (17.5%) 2 (16.7%) 957 (91.3%) 368(91.1%) Yes __ 104 (82.5%) 10(83.8%) 918.7%) _36_(8.9%) Total 126 R 1048 404 ‘This Table shows that within Group A, of the subjects who were not Reviewed, the proportion who hil Medical Insurance is 82.5%, com pared to 83.3% of those subjects who were Reviewed. On the other hand, within Group B, of the subjects who were not Re- viewed, the proportion who had Medical Insurance is 8.7%, compared to 8.9% of those subjects who were Reviewed. a) Compare these conditional proportions of having Medical Insurance (conditional on being Reviewed or not Reviewed) within Group A and Group B: and compare them also with the conditional propor- tions for the whole group as given in part i) a). ) When hypothesis tests are carried out separately on each of the groups Group A and Group B, it can be determined that there is NOT a statistically significant difference between the proportions when Socioeconomic group is taken into account. You are NOT asked to carry out these two hypothesis tests, Using ideas from MATH04I, comment on how this apparent reversal of conclusions and/or significance can take place. JUNE 2016 MATH1041 Page 2 Use a separate book clearly marked Question 1 1. (15 marks] Mauna Loa Observatory is a remote high altitude atmospheric re- search facility that has been taking measurements for over 50 years, well-known for its documentation of recent climate trends. Below we analyse average car- bon dioxide concentration (in parts per million, ppm) and average temperature (in degrees Celsius) in January each year from 1959 to 2016, inclusive. These variables are stored as C02 and Temp, respectively ° 5 6 7 8 ‘Temperature (degrees Celsius) 4 320 340 360 380 400 Carbon dioxide (ppm) call: In(formula = Temp ~ C02, data = maunaDat) Residuals: Min 19 Median 30 Max 3.01328 -0.94195 0.06269 1.14779 3.07680 Coefficients: Estimate Std. Error t value Pr(>|t!) (Intercept) 1.330146 2.746397 -0.484 0.6300 02 0.020241 0.007784 2.600 0.0119 * Signif. codes: 0 *#* 0.001 ** 0.01 * 0.05 .0.1 1 Residual standard error: 1.53 on 56 degrees of freedom Multiple R-squared: 0.1077,Adjusted R-squared: 0.09181 Festatistic: 6.762 on 1 and 56 DF, p-value: 0.01189 Please see aver .. JUNE 2016 MATH1041 Page 3 Where appropriate, use the output to answer the following questions. i) [2 marks] Comment on the relationship between carbon dioxide con- centration and temperature at the Mauna Loa Observatory fi) [8 marks] Many climate sciontists advocate stabilising atmospheric car- bon dioxide concentration below 450ppm. a) What is the predicted average January temperature at Mauna Loa Observatory when carbon dicxide concentration is 450ppm? (to two decimal places) b) Why should we be cautious about making such a prediction? iif) [4 marks] Now we will make inferences about the true predicted change in temperature, given a change in carbon dioxide concentration. a) Construct a 95% confidence interval for the expected increase in av- erage January temperature at Mauna Loa Observatory when carbon dioxide concentration increases by one ppm. b) Hence construct a 95% confidence interval for the expected increase in average January temperature at Mauna Loa Observatory when carbon dioxide concentration increases by 50 ppm (as will occur if carbon dioxide concentration is to reach 450ppm). iv) [1 mark] Was this an observational study or an experiment? v) [1 mark] Sceptics argue that increasing global temperatures may not be due to increasing atmospheric carbon dioxide concentration, but may instead be due to sunspot activity or other reasons. Use ideas from MATH1041 to briefly explain whether or not the Mauna Loa data disproves sceptics’ claims. vi) [4 marks] To answer part (iii), several assumptions were made about the data. a) What are these assumptions? b) Do you think these assumptions are reasonable? (Refer to any output that supports your answer, including the below graphs.) ” = of 7 * n od 8 Bo Sof ca e. ¥ 8 ° r TT 5055 60 OBS 100 1 2 Fitted values Theoretical Quantiles Please see over .. JUNE 2016 MATH1041 Page 4 Use a separate book clearly marked Question 2 2. A Queensland University of Technology study looked at the question of whether there is a difference in literacy between boys and girls on average. A simple random sample of 200 first grade girls was obtained, and a simple random sample of 200 first grade boys, and each student's literacy was assessed using a standard exam. Below are the numerical and graphical summaries of result n = os girls 200 65.6 19.9 boys 200 59.7 20.1 girs boys 0 20 40 60 80 100 Literacy score i) [1 mark] Briefly explain why this study is better described as an obser- vational study rather than as an experiment. ii) [9 marks] Test the hypothesis that literacy scores are the same, on average, for boys and girls. Include in your answer: a) a statement of the null hypothesis and alternative hypothesis; b) the test statistic and its distribution under the null hypothesis; ©) the P-value; ) a conclusion in plain language. iii) [3 marks] State the assumptions you needed to make to answer part (i), and whether or not you think these assumptions were reasonable. Note: Make sure you explain your reasoning, and refer to any evidence in the output thet supports your case iv) [2 marks] The computer output for the hypothesis test of part (ii) includes the following: 95 percent confidence interval -9.823379 -1.955725 Explain what this confidence interval is estimating, and what conclusion you could draw from it (in the context of the original research question). Please see over ... JUNE 2016 MATH1041 Page 5 Use a separate book clearly marked Question 3 3. [15 marks] How many cars are owned by Australian households? ‘The Aus- tralian Bureau of Statistics 2011 census collected data to answer this question. Below is the distribution of the number of registered motor vehicles per house- hold, as in the 2011 census: Number of cars| 0 1 Probability [0.09 037 0.37 O.l7 i) [1 mark] How is a census different from a sample? ii) [4 marks] A useful measure of centre of random variables is the mean a) Find the mean number of registered motor vehicles per household in 2011. b) In the 2006 census, the mean number of registered motor vehicles per household, across all Australian households, was 1.57. Do you think there is evidence that the average has changed since 2006? Explain your answer. iii) [2 marks] A useful measure of the spread of a random variable is the standard deviation. Find the standard deviation of the number of cars per household in 2011. iv) [2 marks] Now consider a simple random sample of 100 households, taken in 2011. Let T’ be the total number of registered motor vehicles across the 100 sampled households. Using your previous calculations in parts iit) a) Find the mean of T. b) Find the standard deviation of T. y) [4 marks] Consider the question of whether or not a houschold has a registered motor vehicle. According to the 2011 census: a) Find the probability that the first household in a random sample has at least one registered motor vehicle. b) Find the probability that all 100 households in a random sample have at least one registered motor vehicle. ©) Find the distribution of X’, the number of households that have at Jeast one registered motor vehicle, in a random sample of 100 Aus- tralian households. vi) [2 marks] Briefly describe how you would take a random sample of 100 Australian households, given a list of all known household addresses in Australia. Please see aver MATHI041 Page 6 Use a separate book clearly marked Question 4 4. Medical researchers studied the association between the amount of fish in a man’s diet and prostate cancer, by tracking the eating patterns of a simple random sample of 6272 Swedish men aged between 43 and 82 for many years. (The Lancet, 2001, 357: 1764-66). The main results of the study are in the below table. Fish Consumption Little/None Large Prostate Cancer 2 No Prostate Cancer 507 ‘The expected counts under the assumption of independence are below. Fish Consumption ttle/None Moderate Large Prostate Cancer 203.95 221.96 40.79 No Prostate Cancer | 2541.05 2756.74 508.21 i) [2 marks] Use a formula for calculating expected counts to verify that 203.95 is the expected count of men who were diagnosed with prostate cancer and who had little or no fish consumption. ii) [5 marks] ‘The computer output included a X? value of 1.4196. Use this value to cary out the hypothesis test of interest to the medical researchers, making sure to a) state the null and alternative hypotheses; b) give the distribution of X? under the null hypothesis; ©) the P-value; d) state your conclusion in plain language. iii) [3 marks] In order to proceed with the test of independence in part (ii), certain assumptions were made. a) What are these assumption: b) Where possible, check if these assumptions are reasonable. iv) [5 marks) A formula for an approximate 95% confidence interval for the true proportion p is (o-1s0 MP B+ 1.96y/ A ) a) Overall, 466 of the 6272 men were diagnosed with prostate cancer. Use the above formula to construct a 95% confidence interval for the proportion of all Swedish men aged between 43 and 82 who were diagnosed with prostate cancer. b) Explain how to derive the above formula using the following result: _—— ( x )

You might also like