Unit 4 Probability and Statistics Concepts Revision

TOPIC 4: STATISTICS AND PROBABILITY ‘SAMPLING ‘We obiain data from a sample ofa population when itis impractical 1 obtain data from the entire population, ‘You should know the four main categories of error th ise Irom sampling: ‘© Sampling errors oocur when a erste of «sample ditfers from that ofthe population ‘+ Measurement errors are inaccuracies in measurement during data collection. ‘© Coverage errors occur when a sample does not tly reflect population. ‘© Non-response errors occur when large numberof people selected fora survey choose not ¥ respond, SAMPLING METHODS ‘In simple random sampling: Each member of the population has the same chance of being selected in the sample » Each set of » members ofthe population has the same chance of being selected as any other set of n members. ‘In systematic sampling, the sample is created by sele ing members of the population at regular intervals, '* In convenience sampling, members are chosen for the sample because they are easier to select or more likely to respond. «In stratfied sampling or quata sampng the population i divided ino subgroups, and he numberof members sampled fiom each subgroup is proportional tthe faction ofthe population represented by that subgroup. Ithe members ofeach subgroup are randomly selected, the sample isa stratified sample, Ihe members ae speitically chosen, the sample is ‘quota sample. TYPES OF DATA AND ITS REPRESENTATION Categorical ta refers to data which describes « particular quality or characteristic. Diserete data can tke any ofa set of exaet number values (222.255 ~}- [Uisnormally counted Continuous data ean take any numerical value within a certain range. Its nommally measured, ‘Grouped data is numerical data which is collected in groups or clases, The modal elass isthe class with the highest frequency. ‘A column graph is used to display discrete data and groupe data. The columns have spaces between them. [A frequeney histogram is used to display continuous data, ‘The classes are of equal wid ‘columns. and there are no spaces between the Data may be symmeti, positively skewed, or negatively skewed ZN SL ‘ymmenie Postvely skewed” “hopalvaly skewed We use a cumulative frequency graph to display the cumulative frequency foreach data value in distribution. This enables us tw read ofF the values a each percentile, MEASURING THE CENTRE OF DATA ‘The mean ofa set of scores is their arithmetic average Fors large population, the population mean ji generally unknown, The sample mean is wed as an approximation for ea For ungrouped data, 3 = £ For data ina fequnsy able, = = Rt where Js the eqns of ach vale For grouped data we can only esi that group, tothe mean, We use the eral value within each group to represent all seores within ‘The median isthe middle value ofan ordered datas # Foran odd number of data, the median is one of the original data values. © Foraneven number of data, tie median isthe average ofthe two middle values, and may not be in the orginal dataset ‘The mode is the most fequently occurring score. If there are nwo modes we say the data is bimodal. For continuous data we refer ‘oa modal class, PERCENTILES “The Ath percentile isthe score o such that Ki ofthe scores are less than ‘The lower quartile (Q,) isthe 25h percentile The median (Q,) isthe Sith percentile, The upper quartile (Q,)s the 75th percentile. ‘You should know how to generate a cumulative frequeney graph and use it to estimate Q,, Qa, and Qy BOX AND WHISKER DIAGRAMS ‘A box and whisker diagram or box plot illustrates the five-number summary ofa data set + minimum value & Q, + median +a axiom value EE inininun Qy mein cy An uilier is indicated by am asterisks. IVARIATE STATISTICS Correlation refers to the relationship between two numerical variables, We can use a seatter diagram to help identi outliers and to describe the coreation between variables. We conser direction strength, and linearity ‘trong posive ‘moderute postive mm, fe Sy, strong negative moderate negative wok negative a change in one variable causes a change inthe other variable then we say there isa causal relationship between them. ‘To measure the strength ofthe relationship between two variables, we use Pearson's product-moment correlation coefficient. The correlation coefficient lies in the ange —1 A negative value for indiates the variables are negatively correlated, ‘¢ Thesize of r indicates the strength of correlation, > Avvalue of close 10 +1 oF —1 indicates strong corelation between the variables. » A value ofr close to zero indicates weak correlation between the variables. Line of best fit ftwo variables are linearly correlated, we can draw a ine of best ft to illustrate ther relationship. ‘We can draw a line of best fit by eye, which passes through the meam point (Fj), and which fits the trend of the data, ‘To get amore accurate line of best fit, we use a method called linear regression. The line obtained i called the least squares ‘regression line. You shouldbe able to find this line using your calculator. ‘When using a line of best fit to estimate values, interpolation is usually reliable, whereas extrapolation may not be. ‘Spearman's rank correlation coefficient ofa bivariate dataset is defined as the Pearson product-moment correlation coefficient ofthe variables’ ranks. It soften used when the data is clearly non-linear, but has an upward or downward trend, PROBABILITY Atrial ocears each time we perform an experiment “The possible results fom each trial ofan experiment are called is outcomes. “The sample space U isthe set of ll possible outcomes of an experiment Experimental probability In many situations, we can only measure the probability of an event by experimentation ‘experimental probability = relative frequency of event Theoretical probability ma) fall outcomes are equal ily, the probability of event Ais P(A) = 2 Forany event A, 0 < P(A) < 1 Forany event A, AT isthe event that does not occur A and A’ are complementary events, and P(A) + P(A’) “The event that both A and fF occuris written AM B. ‘The event that A or J oF both occuris written AU B. P(AUB) = P(A) +P(B)— PAN B) For disjoint or mutually exclusive events, P(A 0B) = 0. ‘Making predictions using probability there are nwials of an experiment, and an event has probsbility p of occurring in each ofthe tras then the number of times we ‘expect he evento occur is np, Independent events ‘Two events are independent ifthe occurrence of each of them does not afeet the probability thatthe other occurs. An example of this is sampling with replacement. Forindependent events A and B, P(A B) = P(A) P(B). Dependent events ‘Two evens are dependent ifthe occurence of one of them does affect the probability that the other occurs, An example ofthis is ssumpling without replacement and B, P(A B) = P(A) x PCB given that has occured), For depenlat events Conditional probability For any owo events A and 2, “A | 2" represents the event “A given that 2 has oscurred”, and) P(A | B) = ACL FB) DISCRETE RANDOM VARIABLES, A random variable represents the possible numerical outcomes of an experiment A diserete random variable can take any of set of distinct values, IX isa discrete random variable with posible values {1.22.4 :%,) and corresponding probabilities {p15 as os Pals then © 0< p< 1 forall i=1,...n Dart ptt peal ie} describes the probsbilty distribution of X. ‘We can also describe the probability distebution of X using 8 probability mass fumetion P(e) = P(X =). The expectation of diserete random variable Xis BUX) =n = SS zips A game where X isthe gain to the player is sid wo be fair if E(X) = 0. The mode isthe data value 2, whose probability pi the highest. ‘THE BINOMIAL DISTRIBUTION Ina bi rial experiment there are two possible results: success and flue, Suppose there are 1 independent trials ofthe same experiment with the probability of suscess being @ constant p for each tial. IF X represents the numberof successes in the trials, then X has. a binomial distribution, and we write X ~ Bin, p), The binomial probabil mass function ig P(X =») ()pr = py whore 20, 1,2, 9h ‘You should beable we your ales to nd 4 P(X ==) using the binomial probably distribution funtion al cumulative distribution fanetion, © P(X <2) oF P(X > =) using the binor WX ~ Boa, p), then © EO)=p=ap Var(X) = net —») . VanT=9} ‘THE NORMAL DISTRIBUTION Ifthe random variable X'has a normal distribution with mean jc and variance 0, we write X'~ N(vs, 0). The probability density netion is {(#) = fla) isabel L-shaped curve which is sym thas the property that © = 63% of all scores lie between jp—o and js +o © 95% ofall scores lie between ji — 2a and ye} 2 # 09.7% of all scores lie herwoen ja ~ 84 and y+ Bo. ‘You shouldbe able fo use your calculator to find nonmal probabilities forthe situation: © PX a) © PacX Hy > uo (one-tailed hypothesis) Hy 1 no (one-tailed hypothesis) > He 1 no (Owortalled hypothesis, as ju) Could mean p> ja OF J < Ho). ‘© A test statistic i a random variable that summarises the information in a sample ‘© The distribution of the test statsie under the assumptions of Fp scaled the nul distribution, ‘©The p-value of est statistic isthe probability ofa result tha isa or more “extreme” being ebserved if Hy is rw. ‘© The significance level eof a statisti! hypothesis test isthe largest p-value that wold result in rejecting Ho. Any ps tess than ot equal too results in Fy being rejected, 1a statistical hyposhesis test hs significance level a, the probability ofa Type I eror isa The significance level may be piven as a decimal ora percentage General procedure Seep 1: Formulates tical hypotheses, Swep 2: Choose a signifieance level for the test This isa threshold For making a decision ike the confidence levels we saw previously ‘Step 3: Use data from a simple to calculate atest statisti, Step 4: Calculate a p-value forthe test tai ‘one ofthe hypotheses. ‘Step 5: Make decisions about the hypotheses ‘Siep 6: Interpret hed Tis isthe probability ofthat test statisti occurring under the assumptions of ion in the content ofthe problem ‘The one-sample ¢-test ‘The ttest is used to test hypotheses about a population mean p: when: ‘© the population is normally distributed ‘© the population variance is unknown, Fora testof Ho: 41 po usinga sample of size n with sample mean 7 and sample standard deviation s: 1s the test statistic is ¢ (© the null distribution is T~ ty ‘+ the pale calculation depends on 1 > IE Hi: je> jo pale = POT > #) > IEA: se< pop pralue = POT < 0) > ICH: se so, prvalue = 2 x PCF > |). The two-sample t-test ‘The two-sample (ests used to compare the means of two samples fom different populations ‘fhe populations have means and ya the nl hypothesis haste forms Hy; = soreauivalenty Hs: mm =0 You shouldbe able use technology to calculate the test statistic and vale. In this course you are expected 0 assume equal variances an hence use the pooled two-sample Lest on your calculator. The x? goodness of fit test ‘The x? goodness of fit test is used to determine whether a probabi istebution fits set of dat, Consider seenario with ealepores. Let, bethe population proportion of individuals in category i, where pi +pabantpy = est have the Form: The hypotheses ne x? goodness of 1: = Pos P= Poe os 48 = a Wy: atleast one ofp 7 ‘ere po isthe population proportion of tga (mde te nl hypotesis The tt stale fora x? goodness of fittest: x2 = Fan fel® eis fs Interdit Hite og toa expected eqn. Degresof freedom (refers tothe amber of ales th a “few Fora y* goodness of fittest, a = number of eategories ~ 1 ‘pevalue = probability of observing a value greater than or equal 10-2, ‘Fitibation

