Study Guide

5.1. Correlation iGURE 51 Sepoe mustang various types lconship (another postive near floral (opie ear te * Pearson's sample correlation coefficient DEFINITION Pearson's sample correlation coefficient ris given by _ 2a Ve Although there are several different correlation coefficents, Pearson's correlation coefficient is by far the most commonly used, and so the name “Pearson's” often omitted ard itis referred to as smply the correlation coefficient. 1 x, x yi-y r=— —_ —————£— n-|l Ss x nA Handout #8 Chapter 5 Page 10 1 Sispoiive ispesiie TM isargatve PWV apis pesitve negative ispesiive + FIGURE 52 Viewing a seaerpoeacrriin tthe gps of > an 2 (a 3 posite rela tie 6) 8 neste rebtons © no song rato, © Properties of r The value of r does not depend on the unit of measurement for either variable. ‘The value of r does not depend on which of the two variables is considered x, The value of ris between -1 and +1 Seong Moderate Weak Moderate Stooge FIGURE 5.4 _—— — Desi the strenach oF anger | iano 08-05 0 os 08 4 A correlation coefficient of r=1 occurs only when all the points in a scatterplot of the data lie exactly on a straight line that slopes upward. Similarly, r=-1 only when all the points lie exactly ‘ona downward sloping line. ‘The value of ris a measure of the extent to which x andy are linearly related. A value of r close to 0 does not rule out ANY strong relationship between x and y; there could still be a strong, relationship that is not linear. Handout #8 Chapter 5 Page 2+ The population correlation coefficient The sample correlation coefficient r measures how strongly x and y values in a sample of pairs are linearly related to one another. There is an analogous measure of how strongly x and y are related in the ENTIRE population of pairs from which the sample was obtained. Its called the POPULATION CORRELATION COEFFICIENT and it is denoted by p. 1. pisanumber between —1 and +1 that does not depend on the unit of measure- ‘ment for either x or y, of om which variable is labeled x and which is labeled y. 2. p= +1 or —1 if and only if all (x) pairs in the population lie exactly on a straight line, so p measutes the extent to which there is a linear relationship in the population. * Correlation and causation CORRELATION DOES NOT IMPLY CAUSATION. Correlation measures the extent of association, but association does not imply causation. It frequently happens that these two variables are highly correlated not because one is causally related to the other but because they are both strongly related to a third variable. Ex. High values of hot chocolate consumption and lower crime rates are negatively correlated, Both are responses to cold weather. Exercise 5.2 on pg. 221 5.2. Is the following statement correct? Explain why or why not. A correlation coefficient of 0 implies that no rela- sionship exists between the two variables under study. Exercise 5.4.0n pg. 221 5.4. The article “That's Rich: More You Drink, More You Eam” (Calgary Herald, April 16, 2002) reported that there was a positive correlation between alcohol consumption and income: Is i reasonable to condude that increasing alcohol consumption will increase income? Give at least two reasons or examples to support your answer, Handout #8 Chapter 5 Page 35.2. Linear Regression: Fitting a line to Bivariate Data Dyaa NIG oly + Fitting a straight line: The principle of Least Squares y 60) Slope =2 ‘Vertical intercept = 10 sq 4g 3q 20) FIGURE 5.9 aq line | gives a poor fit and Line Il gives x a good fit to the data. 5 10 15 20 25 30 DEFINITION Handout #8 Chapter 5 Page 6Calculating Formula for the Slope of the Least-Squares Line a ey, - Dy n ~ e n ‘The regression equation is lt calibe Pesceved Actngency =~159-+258 Tannin concentration “Or Predictor Coot SEC TP Constant 1500 01a )-11.68 0.000 Tamncnentaioy 28048 92078 | 1248 onto CURE 5.13 he weal” Fart Minrab ourpex fr Fame 56. vil va Handout #8 Chapter 5 Page 7© Regression ‘The least-squares line is often called the sample regression line. This terminology comes from the relationship between the least-squares line and Pearson's correla tion coefficient. To understand this relationship, we first need alternative expres- sions for the slope 6 and the equation of the line itself. With s, and s, denoting the sample standard deviations of the x's and y's, respectively, a bit of algebraic manipu- lation gives jaye (2)o-9 You do not need to use these formulas in any computations, but several of their im- plications are important for appreciating what the least-squares line does. 1. When x = ¥ is substituted in the equation of the line, = y results. That is, the least-squares line passes through the point of averages (x, 9). 2. Suppose for the moment that r = 1, so that all points lie exactly on the line whose equation is -,% =! p y+ E&-®) Now substitute x = x + s,, which is 1 standard deviation above x: Exercise 5.26 on pg. 234 5.26 ‘The sales manager of a large company selected random sample of = 10 salespeople and determined for each one the values of x = years of sales experience and. y= annual sales (in thousands of dollars). A seatterplot of the resulting (x, ») pairs showed a linear pattern. a. Suppose that the sample correlation coefficient is r= 75 and that the average annual sales is y = 100. If a particular salesperson is 2 standard deviations above the mean in terms of experience, what would you predict for that person’s annual sales? b, If a particular person whose sales experience is 1.5 standard deviations below the average experience is predicted to have an annual sales value that is 1 standard deviation below the average annual sales, what is the value of? Handout #8 Chapter 5 Page 95.3. Assessing the fit of aline © Predicted Values and Residuals The predicted value corresponding to the first observation in a data set is obtained by substituting that value, , into the regression equation to obtain jy, where jaa t bey The difference berween che actual y value for the first observation, yy, and the corresponding predicted value is noah This difference, called a residual, is the vertical deviation of a point in the scatterplot from the regression line. y FIGURE 514 ‘he es squares ne res) = ET DEFINITION Handout #8 Chapter 5 Page 10Plotting the Residuals DEFINITION A residual plot is a scatterplot of the (x residual) pairs. Isolated points or a pattern of points in the residual plot indicate potential problems. Posie Mie wage A500 06091 Age chemin gio «| . a “a ser F sl Rese Age Oba wih ae ei st Iotenit coin . bs HOMES mor a ‘mpl 39,19 sce sad (torent pt ® i Lie Pet—Onit Bee 3 Weight $1150 L220 age wade FiouRE 520 Scena ast tes ne with Handout #8 Chapter 5 Page 11Unusual points in a bivariate daca set are those that fall away from most of the other points in the scatterplot in either the x direction or the y direction. ‘An observation is potentially an influential observation ifit has an xvalue that is far away. from the rest of the data (separated from the rest of the data in the x direction). To de- termine F the observation is in fact influential, we assess whether removal of this cbserya~ ton has a large impact on the value of the spe or Incercept of the least-squares line. ‘An observation is an outlier if t has a large residual. Outlier observations fall Far away from the least-squares line in the y direction, © Coefficient of Determination DEFINITION ‘The coefficient of determination, denoted by r?, gives the proportion of varia~ tion in y that can be attributed to an approximate linear relationship between xand y. The value of r? is often converted to a percentage (by multiplying by 100) and interpreted as the percentage of variation in y that can be explained by an approximate linear relationship between x and y. DEFINITION The total sum of squares, denoted by SSTo, is defined as $sT0 = (y= y)? + ( — yr +2°° + (y, 9)? = BY WY The residual sum of squares (sometimes referred vo as the error sum of squares), denoted by SSResid, is defined as ssresid = (yj =)? + (ys — 2)? +++ + (yy — 3)? = Bly — 0)? These sums of squares can be found as part of the regression output from most standard statistical packages or can be obtained using the following computa- tional formulas: (Sy So = ly n sstesid = Dy’ — ay — bay Handout #8 Chapter 5 Page 12FIGURE 5.22 Ircerprecng sus of squares: (a) SStesé = sum of squared vertical deviations from the least-squares tne: (6) SSTo = sum of squared vertical ‘deviacons from the horizontal line at height a. FIGURE 5.21 nea outpu fr te data of Example 510, Regression Analysis: Distance Traveled versus Distance to Debris ‘The regression equation is Dt awed «77 +328 stare to Debris predcar Got SE Coe Consat 16 1838 Distance to Debs azo 178 Se86TOT RSq=%20% Sea Analysis of Variance source oF sms Regen = 124788478 Wgoiee 7B toad a ym fo ae ost * Standard Deviation about the Least Squares Line DEFINITION TOP 0580582 182 ota 22.3% Foo 329 ont? Handout #8 Chapter 5 Page 13

Study Guide

Uploaded by

Copyright:

Available Formats

You might also like

Study Guide

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Study Guide

Uploaded by

Copyright:

Available Formats

You might also like