S1 Summary

You might also like

Download as pdf
Download as pdf
You are on page 1of 13
In statistics, we collect observations or measurements of some variables. These observations are known as data. Variables associated with non-numerical data are qualitative variables, and variables associated with numerical data are quantitative variables. The flowchart below shows different types of data in more detail. yuan Oca = ‘only specific values in a given ‘ange isa discrete variable, ‘qualitative Quantitative @ (Non-numeical data) (umerical data) ‘Avariable that can take ‘any value ina given range isa ‘continuous variable. Discrete data | “Tokes specifi values ‘ina given range “The time x seconds taken by a random sample of females to run 400m is measured and is shown i two different tales 1 Write down the lass boundaries fr the first row of each table bb Find the midpoint and class width forthe fist row for each table. Tahkez Taber Time torn | Namberol 400m) females 35-65 2 cs 35 70-75 eT 75-90 1B Namber of females 2 as 30 1 Tine torn 003) 35.65 ‘om 7-98 76-90 1 The class boundries for Table ae 553, || b The miapont lor abe ts 655 35 the deta hus no gaps and therelore |The dpc or Table 2s 5154.54.65) the cats boundaries are the numbers of te |The lea width or Table i €5.-S5 = 10 ‘aes, The cise boundaries for Table 2 are ‘The cline wid for Table 2 ib 655-54 5455, 5s because the dita has gps Ameasure of location isa single value which describes a position in a data set. f the single value describes the centre of the data, itis called a measure of central tendency. You should already know how to work out the mean, median and mode of a set of ungrouped data and from ungrouped frequency tables. 1 The mode or modal class is the value or class that occurs most often. * The median isthe middle value when the data values are put in order. ‘= The mean can be calculated using the ae "represents the mean of the data You say bar. 7 ‘© Sirrepresents the sum of the data values. 4 mis the number of data values. formula, Combining means If set 4, of size, has mean ¥, and set B, of size n, has a mean ¥,, then the mean of the combined set of A and Bis ~My You need to decide on the best measure to use in particular situations © Mode This is used when data are qualitative, or when quantitative with either a single mode or ‘two modes (bimodal) There is no mode if each value occurs just onc. "© Median This is used for quantitative data is usually used when there are extreme values, as they {do not affect it as mach as they affect the mea, ‘= Mean This is used for quantitative data and uses al the pieces of data. It therefore gives a true measure of the data However its affected by extreme values. Youcan calculate the mean and median fr discrete data presented ina frequency table, * For data given in a frequency table, the ‘mean can be calculated using the formula, Sree tie nat tie Goa ‘ata valves ane the frequencies. 1+ Sis the sum ofthe frequencies, The median describes the middle ofthe dataset. It splits the data set into two equal (0%) halves. You can calculate other measures of location such as quartiles and percentile, Lowest hese wie, ale ok "To find the lower quartile for discrete data, divide m by 4, CTD o.ictne ae ae ' Tofind the upper quartile for ciscrete data, find 2 of. If this sa whole number, the upper When data are presented in a grouped frequency table you can use a technique called interpolation to estimate the median, quartiles and percentiles. When you use interpolation, you are assuming thatthe data values are evenly distributed withia each class. ‘The length of time to the nearest minute) eseswot on Tap i | a2-a3 | mas] 3730 spent on the interme cach evening by a tnt wints)_|° ‘group of stants is shown inthe tbl Trea 7s >» > a Find an estimate forthe upper quartile, Find an estimate forthe Ith percentile 3%70 2 525¢h vee - _ —_ eee a.=395 | 525-27 305-555" 57 a. 385 2 b The 1th percent 7th ta ate 4-395 Fh yay beeneen the ess mig pica ee eens ae Downs Et tese wo factor toon an equation and save on, CEEED ovcsnsnterne tn pecans ‘A measure of spread is a measure of how spread out CEEEED easures of spread the data are. Here are two simple measures of spread. Serer eer = The range is the difference between the largest dispersion or measures of variation. and smallest values in the data set. = The interquartile range (IQR) is the difference between the upper quartile and the lower quartile, Q, - Q, The range takes into account all of the data but can be affected by extreme values. The interquartile range is not affected by extreme values but only considers the spread of the middle 50% of the data. ™ The interpercentile range is the difference between the values for two given percentiles. The 10th to 90th interpercentile range is often used since it is not affected by extreme values but still considers 80% of the data in its calculation, Another measure that can be used to work out the spread of a data set is the variance. This makes use of the fact that each data point deviates from the mean by the amount x Ze Xe- 3? Sex * Variance = 2¢= 3 = CEEED s,s summery ok ‘statistic, which is used to make where S,, = ¥(x— ¥)? = Ex? formulae easier to use and learn. 2)" is easier to work with when given raw data, The second version of the formula, == Itcan be thought ofa ‘the mean of the squares minus the square ofthe mean ‘The third version, ©, is easier to use if you can use your calculator to find S.. quickly ‘The units of the variance are the units ofthe data squared. You can find a related measure of spread that has the same units asthe data 1 The standard deviation is the square root ofthe variance: CED isthe symbol we use forthe standard deviation of a data set, Hence o? is used for the variance. ™ You can use these versions of the formulae for variance and standard deviation for grouped data that is presented in a frequency table: Meet 3h (3) yy ANY, +o EE FEB Vow bar Gy, where fis the frequency for each group and Sis the total frequency. eo Coding is a way of simplifying statistical calculations. Each data value is coded to make a new set of data values which are easier to work with. Inyour exam, you will usually have to code values using a formula lke this: y = “= where a and b are constants that you have to choose, or are given with the question. When data are coded, different statistics change in different ways. = If data are coded using the formula y =~ CLD vou usuaty need to find the mean on ‘and standard deviation ofthe original data + the mean of the coded data is given by ="F sien the statistics forthe coded data. You « the standard deviation ofthe coded datais given “nSoTange the formulae as: by o, =“, where oy, is the standard deviation 6 of the original data. The variable y was measured to the nearest whole number. 60 observations were taken and are recorded in the table below. » wa | 314 [1517 [1825 Frequency | 6 m4 18 2 a Write down the class boundaries for the 13-14 class. (mark) A histogram was drawn and the bar representing the 13-14 class had a width of 4em and a height of 6em, For the bar representing the 15-17 class, find: b i the width (1-mark) Remember that area is proportional to frequency. ii the height. (2 marks) ‘The blood glucose levels of 30 females are recorded. The results, in mmoVitre, are shown below: 17,2.2,2.3,2.3,2.5, 2.7, 3.1,3.2, 3.6, 3.7, 3.7.3.7, 38, 3.8, 3.8, 38,39, 3.9, 39, 4.0, 4.0, 4.0, 4.0, 44,45, 4.6, 4.7, 48, 5.0, 5.1 An outlier is an observation that falls ether 1.5 x the interquartile range above the upper quartile, ‘or 1.5 x the interquartile range below the lower quartile 0p 22% 0225: pt th Bde 40 a: 2215; he 5th tern = 88 b interquartile range = 40-32 = 08 ‘Outlers are vals leas than 32-15 x08=2 for greater than 40 +15 x08 =5.2 Therelore 17 is the only outer : ‘Abbox plot can be drawn to represent important features of the data, it shows the quartiles, maximum and minimum values and any outliers. ‘A box plot looks like this: ‘Two sets of data can be compared using box plots Achara recorded the resting pulse rate for the 16 boys and 23 girls in her year at school, The results were as follows: Girls Boys 5S 80 84 91 80 92 80 60 91 65 98 40 60 64 66 72 679 78 46 96 85 88 90 76 SH nn ms s8 92 78 80 79 64 60 50 68 ‘a Construct a back to back stem and leaf diagram to represent these data. b Comment on your results, 12 Use the steps outined in Example 6 to complete the stem and leat Q-@ ‘Another test uses the measures of location: "= Mode = median = mean describes a distribution which is symmetrical = Mode < median < mean describes a distribution with a positive skew "= Mode > median > mean describes a distribution with a negative skew 3(mean ~ median) You can also calculate ee ovation Which tells you how skewed the data are. Negative skew Symmetrical Positive skew << = Avalue of 0 implies that the mean = median and the distribution is symmetrical = A positive value implies that the median < mean and the distribution is positively skewed = A negative value implies that median > mean and the distribution is negatively skewed The further from 0 the value is, the more likely the data will be skewed, # AVenn diagram can be used to represent events graphically. Frequencies or probabilities can be placed in the regions of the Venn diagram. a Arectangle represents the sample space, é, and it contains closed curves that represent events. For events A and Bin a sample space &: 1 The event A and B 2 The event Aor B 3 The event not 4 4 a 4 8 You can write numbers of outcomes (frequencies) or the probability of the events in a Venn diagram to help solve problems. ‘When events have no outcomes in common they are called mutually exclusive. Ina Venn diagram, the closed curves do not overlap and you can Use a simple addition rule to work out combined probabilities: 7 is * For mutually exclusive events, P(4 or B) = P(A) + P(B) When one event has no effect on another, they are independent. Therefore if 4 and Bare independent, the probability of 4 happening is the same whether or not B happens. = For independent events, P(4 and B) = P(A) x P(B) You can use this multiplication rule to determine whether or not events are independent, The probability of an event can change depending on the outcome of a previous event. For example, the probability of you being late for work may change depending on whether or not you oversleep. Situations like this can be modelled using conditional probability. You use a vertical line symbol to indicate conditional probabilities. © The probability that B occurs given that 4 has already occurred is written as P(B| 4). Similarly, P(B| A") describes the probability of B occurring given that 4 has not occurred * For independent events, P(4| B) = P(A |B’) = P(4), and P(B| 4) = P(B| A") = P(B).. You can use this condition to determine independence. You can solve some problems involving conditional probability by considering a restricted sample space of the outcomes where one event has already occurred. There isa formula you can use for two events that links the probability of the union and the probability of the intersection. Let P(4) = a and P(B) = b. Since /= P(A) B) you can write the following addition formula for two events 4 and B: = P(A.U B) = P(A) + P(B) - P(A B) * Bivariate di re data which have pairs of values for two variables. You can represent bivariate data on a seatter diagram, This scatter diagram shows the results from ‘an experiment on how breath rate affects pulse rate: Pulse rate The two different variables in a set of bivariate data are often related. * Correlation describes the nature of the linear relationship between two variables. ; [ a ce - For negatively correlated variables, when one variable increases QE the other decreases, For positively correlated variables, when one variable increases, the other also increases. * An independent (or explanatory) variable is one that is set independently ofthe other variable. Itis plotted along the x-axis ‘= A dependent (or response) variable is one whose values are determined by the values of the independent variable. It is plotted along the y-axis, ‘When a scatter diagram shows correlation, you can draw a line of best Rt. This sa linear made! that approximates CEEED eet uses the relationship between the variables. One type of line of best fit that is useful in statistics isa least squares regression line. This isthe straight line thet minimises the sum of the squares. of the vertical distances of each data point from the line. Tegresion ines usualy just called the regression tne. 1 The regression line of y on x is written in the form y=a+ bx, 1 The coefficient b tells you the change in y for each unit, ‘change in x. «If the data are positively correlated, 6 will be positive. «Ifthe data are negatively correlated, b wil be negative. IF you know a value of the independent variable from a bivariate data set, you can use the regression line to make a prediction or estimate of the corresponding value of the dependent variable. * You should only use the regression line to make CEE tisis cates Making a prediction based on a value outside the range of the given data is «alled extrapolation, and gives a much {ess reliable estimate. predictions for values of the dependent variable that are within the range of the given data. When you are analysing bivariate data, you can use a least squares regression tine to predict values of the dependent (response) variable for given values ofthe independent (explanatory) variable. the response variable is and the explanatory variable sx, you should use the regression line of yon x, which can be written inthe form y= a+ bx The ast squares geste este ne tat minimis the sum fhe squares ofthe esas ofeach data pot Theresia ofa given dtapitisthe Gierencebeien treabseved lve of CAND Te rts oon i the dependent vara andthe peited talue ofthe dependent male You need to be able to find the equation of a least squares regressio bv using summary statistics. either by using raw data or |= The equation of the regression line of y on xis: peachy where b= Si and Sy, are known as summary statistics and you can calculate them using these formulae: #5, =Sony- DAE (oF Su=Dxt~ "The results from an experiment in which different masses were placed on a spring and the resulting length of the spring measured, are shown below Mass00) | 2 | 0 | ® | 0 | 1 Length, (em) | 8] 381 | 503 | ot2 | os a Calculate $,,and Sy, (Youmay user = 300 Yv'=200 a6 Layais238 Lye = 6879.14 Ly= 2886 7.72) 'b Calculate the regression line of » on x: ‘€ Use your equation to predict the length of the spring when the applied mass i S8kg Wi 30ke. 44 Comment on the reliability of your predictions sne-Ge ——__| = 22000 - 399 +4000 SyeEy- ED ‘oaao - 20052886 bee 4 Aesuming the mode reasonable, the Fredetion when the mass 58g Tre pedicton when the aoa fs 13096 39cm (3 5!) ofthe ata ‘Te product moment correlation coeficent (PMCO) measures the linear correlation between two variables, The PMCC can take values between 1 and 1, where 1 is petect positive linear correlation and ~1 is perfect negative linear covelation. Previously, you used the summary statistics SS, and S, to calculate the coefficients ofa regression ine equation. You can aso use these summary statistics to calculate the product ‘moment cortelation coefficient. ‘+ The product moment correlation coefficient, Isgiven by Sometimes the original data are coded to make it easier to manage. Coding affects diferent statistics In diferent ways. As lng.as the coding near, the product moment cortelation coefficient will be unaffected bythe coding Examples of linear coding ofa dataset x; ae p= ax, + band pix * You can think of near coding as change nscale on the axes o a seater gram. ris a measure of linear relationship > perfect positive linear correlation 1 = perfect negative linear correlation = no linear correlation You can rewrite the variables x and y by using the coding p: a =e — "impure andra ion ote PA a ca termine he srt toca ang bee nals yahrgat eva ee ML 1 Values of rbetween 1 and 0 indicate a greater or lesser degree of positive correlation ‘The closer to 1 the better the correlation, the closer to the worse the correlation * Values of rbetween ~1 and Oindicate a greater or lesser degree of negative correlation. ‘The closer to ~1 the better the correlation, the closer to 0 the worse the correlation. Even if two variables are associated and havea linear correlation it does not necessarily mean that 8 change in one ofthe variables causes a change in the ether variable, For example, just because ofthe numberof carson the rea has increased and the number of TVs bought has decreased, it doesnot mean there isa corelation between these two variables. certain amount of critical thought is required when answering these types of questions ‘random variable isa variable whose value depends an the outcome of a random event. etree enue tora waacucueses notions Teparate tet ie ome ciwikowenseret = a accu valle, ‘The probability that the random variable X takes a) is less than 0.5, then awill be > 0 ® If P(Z > a) is greater than 0.5, then awill be < 0 It is often useful to standardise normally distributed random variables. You do this by coding the data so that it can be modelled by the standard normal distribution, = The standard normal distribution has mean 0 and CESEED tre standard normal standard deviation 1 ‘variable is written as Z~N(O, 13) 16¥~ Nlu, 09 is anormal random variable with mean y. and standard X=. then the corresponding value of Z willbe deviation 0, then you can code X =H re paanctibeodel deel wee using the formula ia . a the standard deviation will be 2 = 1 and, Ze The resulting z-values will be normally distributed with mean 0 and standard deviation 1 For the standard normal curve Z~ N(0, 12), the probability P(Z < a) is sometimes written as (a). You can find it by entering j= 0 and «= 1 into the normal cumulative distribution function on your calculator or by using the tables. 8 Here is a summary of the results. P(e sc) -P(z sc) we woe Piz s-c) =1-P(z sc) P(z 2 -c) = P(z sc) “coe cou ue

You might also like