strength of the association (linear relationship) between two variables.
Regression analysis used to: Positive - upward
Predict the value od a DV based on Negative - downward the value of at least one IV Equation of Simple Linear Regression:
Explain the impact of changes in an IV
on DV
*Income and spending, wage
DV: the variable we wish to predict or
explain Beta sub 0 is constant IV: variable used to predict or explain the DV
Example:
Only one IV, X
Relationship between X and Y is
described by a linear function The representation of the linearity of Changes in Y are assumed to be relationship if they weak are scattered related to changes in X Measures of Variation any conclusions reached was at each level of X is not extremely emphasized. different from a normal distribution, When using the least-squares method inferences about 130 and 131 are not to determine the regression The assumptions necessary for seriously affected. coefficients for a set of data, you need regression are similar to those of the Evaluating the Assumptions to compute three measures of analysis of variance because both are The fourth assumption, equal Recall from Section 13.4 that the four variation. The first measure, the part of the general category of linear variance, or homoscedasticity, assumptions of regression (known by models (reference 4). requires that the variance of the errors the acronym LINE) are linearity, total sum of squares (SST), is a Ei be constant for all values of X. In measure of variation of the Yi values The four assumptions of regression independence, normality, and equal other words, the variability of Y values variance. around their mean, Y. The (known by the acronym LINE) are as is the same when X is a low value as follows: when X is a high value. The equal- Linearity. To evaluate linearity, you total variation, or total sum of squares, is subdivided into explained • Linearity variance assumption is important plot the residuals on the vertical axis variation and unexplained variation. when making inferences about [30 against the corresponding Xi values of • Independence of errors and [31. If there are serious the independent variable on the The explained variation, or departures from this assumption, you horizontal axis. If the linear model is regression sum of squares (SSR), • Normality of error appropriate for the data, you will not can use either data transformations or represents variation that is explained • Equal variance weighted least-squares methods (see see any apparent pattern in the plot. by the relationship between X and Y, reference 4). However, if the linear model is not The first assumption, linearity, states appropriate, in the residual plot, there and the unexplained variation, or that the relationship between variables Residual Analysis will be a relationship between the Xi error sum of squares (SSE), is linear. Relationships between values and the residuals, q. represents variation due to factors Residual analysis visually evaluates variables that are not linear are other than the relationship between X these assumptions and helps you to You can see such a pattern in Figure discussed in Chapter 15. and Y. Figure 13.6 shows these determine whether the regression 13.9. Panel A shows a situation in different measures of variation. The second assumption, model that has been selected is which, although there is an increasing independence of errors, requires appropriate. trend in Y as X increases, the that the errors Ei are independent of relationship seems curvilinear The residual, or estimated error value, one another. This assumption is because the upward trend decreases q, is the difference between the particularly important when data are for increasing values of X. This observed (Yi) and predicted (Yi) collected over a period of time. In such quadratic effect is highlighted in Panel values of the dependent variable for a situations, the errors in a specific time B, where there is a clear relationship given value of Xi. A residual appears period are sometimes correlated with between Xi and q. By plotting the on a scatter plot as the vertical those of the previous time period. residuals, the linear trend of X with Y distance between an observed value of Y and the prediction line. has been removed, thereby exposing The third assumption, normality, the lack of fit in the simple linear requires that the errors are normally Equation (13.14) defines the residual. model. Thus, a quadratic model is a distributed at each value of X. Like the Assumptions better fit and should be used in place t test and the ANOVA F test, RESIDUAL of the simple linear model. (See When hypothesis testing and the regression analysis is fairly robust The residual is equal to the difference Section 15.1 for further discussion of analysis of variance were discussed in against departures from the normality between the observed value of Y and fitting curvilinear models.) Chapters 9 through 12, the importance assumption. As long as the distribution of the errors the predicted value of Y. of the assumptions to the validity of one residual may sometimes be to conclude that you should not be related to the previous residual. overly concerned about departures from this normality assumption in the If this relationship exists between Sunflowers Apparel data. consecutive residuals (which violates the assumption of independence), the plot of the residuals versus the time in which the data were collected will often show a cyclical pattern.
Because the Sunflowers Apparel data
were collected during the same time To assess linearity, the residuals are period, you do not need to evaluate plotted against the independent the independence assumption for variable (store size, in thousands of these data. square feet) in Figure 13.11. Although Normality. You can evaluate the there is widespread scatter in the assumption of normality in the errors residual plot, there is no clear pattern by organizing the residuals into a or relationship between the residuals frequency distribution as shown in and X. The residuals appear to be Table 13.3. You cannot construct a evenly spread above and below 0 for meaningful histogram because the different values of X. You can sample size is too small. And with conclude that the linear model is such a small sample size (n = 14), it appropriate for the Sunflowers Apparel can be difficult to evaluate the data. normality assumption by using a stem- Independence. You can evaluate the and-leaf display (see Section 2.5), a boxplot (see Section 3.3), or a normal Equal Variance. You can evaluate the probability plot (see Section 6.3). assumption of equal variance from a plot of the residuals with Xi. For the Sunflowers Apparel data of Figure 13.11 on page 540, there do not To determine whether the simple appear to be major differences in the linear regression model is appropriate, variability of the residuals for different return to the evaluation of the Xi values. Sunflowers Apparel data. Figure 13.10 displays the predicted annual sales Thus, you can conclude that there is values and residuals. no apparent violation in the assumption of independence of the From the normal probability plot of the assumption of equal variance at each errors by plotting the residuals in the residuals in Figure 13.12, the data do level of X. order or sequence in which the data not appear to depart substantially from were collected. If the values of Y are a normal distribution. The robustness To examine a case in which the equal- part of a time series (see Section 2.6), of regression analysis with modest variance assumption is violated, departures from normality enables you observe Figure 13.13, which is a plot of the residuals with Xi for a coefficient of multiple determination, r, The coefficient of multiple hypothetical set of data. This plot is the adjusted r2, and the overall F test. determination also appears in the fan shaped Compare the multiple regression Figure 14.2 results on page 580, and model to the simple linear regression is labeled R Square in the Excel because the variability of the residuals model [Equation (13.1) on page 522]: results and R-Sq in the Minitab increases dramatically as X increases. results. Because this plot shows unequal Coefficient of Multiple Determination variances of the residuals at different Adjusted r2 levels of X, the equal-variance In the simple linear regression model, Recall from Section 13.3 that the assumption is invalid. the slope, ß1, represents the change in coefficient of determination, r, When considering multiple regression the mean of Y per unit change in X measures the proportion of the models, some statisticians suggest and does not take into account any variation in Y that is explained by the that you should use the adjusted r2 to other variables. In the multiple independent variable X in the simple take into account both the number of regression model with two linear regression model. In multiple independent variables in the model independent variables [Equation regression, the coefficient of and the sample size. Reporting the (14.2)], the slope, ß1, represents the multiple determination represents adjusted r is extremely important when change in the mean of Y per unit the proportion of the variation in Y that you are comparing two or more change in X1, taking into account the is explained by the set of independent regression models that predict the effect of X2. variables. same dependent variable but have a different number of independent As in the case of simple linear Equation (14.4) defines the coefficient variables. Equation (14.5) defines the regression, you use the least-squares method to compute sample regression Multiple regression models use two coefficients (ß0, ß1, and ß2) as or more independent variables to estimates of the population predict the value of a dependent parameters variable. (ß0, ß1, and ß2). Equation (14.3) defines the regression equation for a of multiple determination for a multiple multiple regression model with two regression model with two or more independent variables. independent variables.
MULTIPLE REGRESSION In the OmniPower example, from
EQUATION WITH TWO Figure 14.2 on page 580, SSR = 39,472,730.77 and SST = adjusted r2. 52,093,677.44. Thus, INDEPENDENT VARIABLES Therefore, 74.21% of the variation in sales is explained by the multiple The coefficient of multiple regression model adjusted for the r2, Adjusted r2, and the Overall F Test determination (r2 = 0.7577) indicates number of independent variables and This section discusses three methods that 75.77% of the variation in sales is sample size. The adjusted r also you can use to evaluate the overall explained by the variation in the price appears in the Figure 14.2 results on multiple regression model: the and in the promotional expenditures. page 580, and is labeled Adjusted R Square in the Excel results and R- Using a 0.05 level of significance, the - variables come from item-indicators Bartlett’s test, it should be significant Sq(adj) in the Minitab results. critical value of the F distribution with or item-questions that describe the (0.05 or lesser) to make the factor 2 and 31 degrees of freedom found factor. analysis reliable. from Table E.5 is approximately 3.32 (see Figure 14.4 below). From Figure - basis: higher factor loading - the Total Variance Explained: 14.2 on page 580, the FSTATtest higher number in the group will be the name of the factor loading. The cumulative variance explained statistic given in the ANOVA summary should be 60% or higher to make it table is 48.4771. - respondents: min. 100 observations. valid and reliable. > 3.32, or because the p-value = - if you are going to split the samples Communalities of Variables Test for the Significance of the 0.000 < 0.05, you reject Ho and con- for validation purposes the Overall Multiple Regression Model Below 0.50 is a candidate for deletion Because 48.4771 include that at least respondents must be 200 or higher. You use the overall F test to or failing to explain the other variables. determine whether there is a General rule: If it falls into on or above 0.40, but significant relationship between the Minimum is to have at least five times below 0.50; it depends to researcher’s dependent variable and the entire set as many observations as the number discretion if the he will remove it or it of independent variables (the overall of variables to be analyze. More has signifacnt factor multiple regression model). Because acceptable is 10:1 (10 respondents there is more than one independent per one variable). Scree Plot variable, you use the following null and alternative hypotheses: Interpretation: - 1 the eigen value should be greater than 1 to make it significant. Measures of Sampliing Adequacy (MSA) through KMO is needed to - if below 1, the factors are not proceed factor analysis and Barlett’s considered anymore in the analysis. Test of Spherecity. one of the independent variables KMO (Kaiser-Meyer Olkin Test) it (price and/or promotional should be greater than 0.5 for a expenditures) is related to sales. satisfactory factor analysis to proceed. -------------------------------------------------- 0.90 or above, marvelous Factor Analysis 0.80 or above, meritorious Exploratory factor analysis (EFA) 0.70 or above, middling - only be used for if you can’t find or 0.60 or above, mediocre identify a variable in the RRL.- processed by forming the variables 0.50 or above, miserable into a structure called factors. Below 0.50, unacceptable