Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 10

APPENDIX

http://en.wikipedia.org/wiki/Linear_regression

Linear regression

In statistics, linear regression refers to any approach to modeling the relationship between one or more
variables denoted y and one or more variables denoted X, such that the model depends linearly on the unknown
parameters to be estimated from the data. Such a model is called a "linear model." Most commonly, linear
regression refers to a model in which the conditional mean of y given the value of X is an affine function of X.
Less commonly, linear regression could refer to a model in which the median, or some other quantile of the
conditional distribution of y given X is expressed as a linear function of X. Like all forms of regression analysis,
linear regression focuses on the conditional probability distribution of y given X, rather than on the joint probability
distribution of y and X, which is the domain of multivariate analysis.

Linear regression was the first type of regression analysis to be studied rigorously, and to be used extensively in
practical applications. This is because models which depend linearly on their unknown parameters are easier to
fit than models which are non-linearly related to their parameters and because the statistical properties of the
resulting estimators are easier to determine.

Linear regression has many practical uses. Most applications of linear regression fall into one of the following two
broad categories:

 If the goal is prediction, or forecasting, linear regression can be used to fit a predictive model to an
observed data set of y and X values. After developing such a model, if an additional value of X is then
given without its accompanying value of y, the fitted model can be used to make a prediction of the value
of y.

 Given a variable y and a number of variables X1, ..., Xp that may be related to y, then linear regression
analysis can be applied to quantify the strength of the relationship between y and the Xj, to assess which
Xj may have no relationship with y at all, and to identify which subsets of the Xj contain redundant
information about y, thus once one of them is known, the others are no longer informative.

Linear regression models are often fitted using the least squares approach, but they may also be fitted in other
ways, such as by minimizing the "lack of fit" in some other norm, or by minimizing a penalized version of the least
squares loss function as in ridge regression. Conversely, the least squares approach can be used to fit models
that are not linear models. Thus, while the terms "least squares" and linear model are closely linked, they are not
synonymous.

Introduction to linear regression

Given a data set of n statistical units, a linear regression model assumes that the
relationship between the dependent variable yi and the p-vector of regressors xi is approximately linear. This
approximate relationship is modeled through a so-called “disturbance term” εi — an unobserved random variable
that adds noise to the linear relationship between the dependent variable and regressors. Thus the model takes
the form

where x′iβ is the inner product between vectors xi and β.

Often these n equations are stacked together and written in vector form as

where

[edit] Applications of linear regression

Linear regression is widely used in biological, behavioral and social sciences to describe possible relationships
between variables. It ranks as one of the most important tools used in these disciplines.

[edit] Trend line

For trend lines as used in technical analysis, see Trend lines (technical analysis)

A trend line represents a trend, the long-term movement in time series data after other components have been
accounted for. It tells whether a particular data set (say GDP, oil prices or stock prices) have increased or
decreased over the period of time. A trend line could simply be drawn by eye through a set of data points, but
more properly their position and slope is calculated using statistical techniques like linear regression. Trend lines
typically are straight lines, although some variations use higher degree polynomials depending on the degree of
curvature desired in the line.

Trend lines are sometimes used in business analytics to show changes in data over time. This has the advantage
of being simple. Trend lines are often used to argue that a particular action or event (such as training, or an
advertising campaign) caused observed changes at a point in time. This is a simple technique, and does not
require a control group, experimental design, or a sophisticated analysis technique. However, it suffers from a
lack of scientific validity in cases where other potential changes can affect the data.

[edit] Epidemiology

As one example, early evidence relating tobacco smoking to mortality and morbidity came from studies
employing regression. Researchers usually include several variables in their regression analysis in an effort to
remove factors that might produce spurious correlations. For the cigarette smoking example, researchers might
include socio-economic status in addition to smoking to ensure that any observed effect of smoking on mortality
is not due to some effect of education or income. However, it is never possible to include all possible confounding
variables in a study employing regression. For the smoking example, a hypothetical gene might increase mortality
and also cause people to smoke more. For this reason, randomized controlled trials are often able to generate
more compelling evidence of causal relationships than correlational analysis using linear regression. When
controlled experiments are not feasible, variants of regression analysis such as instrumental variables and other
methods may be used to attempt to estimate causal relationships from observational data.

[edit] Finance

The capital asset pricing model uses linear regression as well as the concept of Beta for analyzing and
quantifying the systematic risk of an investment. This comes directly from the Beta coefficient of the linear
regression model that relates the return on the investment to the return on all risky assets.

Regression may not be the appropriate way to estimate beta in finance given that it is supposed to provide the
volatility of an investment relative to the volatility of the market as a whole. This would require that both these
variables be treated in the same way when estimating the slope. Whereas regression treats all variability as being
in the investment returns variable, i.e. it only considers residuals in the dependent variable. [19]

[edit] Environmental science

Linear regression finds application in a wide range of environmental science applications.

[edit] Software tools


Main article: List of statistical packages

This section may be inaccurate in or unbalanced towards certain viewpoints. Please improve the
article by adding information on neglected viewpoints, or discuss the issue on the talk page. (July
2009)

 In Microsoft Excel, the LINEST spreadsheet function performs linear regression analysis with optional
calculation of confidence intervals.
 The free open source software package "R" offers several programs for linear regression and related
methods.
 The Unscrambler - can perform multiple linear regression (MLR), partial least squares regression (PLS-
R), and 3-way PLS regression.

http://www.graphpad.com/curvefit/linear_regression.htm

Linear regression

Introduction to linear regression

Linear regression analyzes the relationship between two variables, X and Y. For each subject (or experimental
unit), you know both X and Y and you want to find the best straight line through the data. In some situations, the
slope and/or intercept have a scientific meaning. In other cases, you use the linear regression line as a standard
curve to find new values of X from Y, or Y from X.

The term "regression", like many statistical terms, is used in statistics quite differently than it is used in other
contexts. The method was first used to examine the relationship between the heights of fathers and sons. The
two were related, of course, but the slope is less than 1.0. A tall father tended to have sons shorter than himself;
a short father tended to have sons taller than himself. The height of sons regressed to the mean. The term
"regression" is now used for many sorts of curve fitting.
Prism determines and graphs the best-fit linear regression line, optionally including a 95% confidence interval or
95% prediction interval bands. You may also force the line through a particular point (usually the origin), calculate
residuals, calculate a runs test, or compare the slopes and intercepts of two or more regression lines.

In general, the goal of linear regression is to find the line that best predicts Y from X. Linear regression does this
by finding the line that minimizes the sum of the squares of the vertical distances of the points from the line.

Note that linear regression does not test whether your data are linear (except via the runs test). It assumes that
your data are linear, and finds the slope and intercept that make a straight line best fit your data.

How linear regression works

Minimizing sum-of-squares

The goal of linear regression is to adjust the values of slope and intercept to find the line that best predicts Y from
X. More precisely, the goal of regression is to minimize the sum of the squares of the vertical distances of the
points from the line. Why minimize the sum of the squares of the distances?  Why not simply minimize the sum of
the actual distances?

If the random scatter follows a Gaussian distribution, it is far more likely to have two medium size deviations (say
5 units each) than to have one small deviation (1 unit) and one large (9 units). A procedure that minimized the
sum of the absolute value of the distances would have no preference over a line  that was 5 units away from two
points and one that was 1 unit away from one point and 9 units from another. The sum of the distances (more
precisely, the sum of the absolute value of the distances) is 10 units in each case. A procedure that minimizes the
sum of the squares of the distances prefers to be 5 units away from two points (sum-of-squares = 50) rather than
1 unit away from one point and 9 units away from another (sum-of-squares = 82). If the scatter is Gaussian (or
nearly so), the line determined by minimizing the sum-of-squares is most likely to be correct.

The calculations are shown in every statistics book, and are entirely standard.

Slope and intercept

Prism reports the best-fit values of the slope and intercept, along with their standard errors and confidence
intervals.

The slope quantifies the steepness of the line. It equals the change in Y for each unit change in X. It is expressed
in the units of the Y-axis divided by the units of the X-axis. If the slope is positive, Y increases as X increases. If
the slope is negative, Y decreases as X increases.

The Y intercept is the Y value of the line when X equals zero. It defines the elevation of the line.

The standard error values of the slope and intercept can be hard to interpret, but their main purpose is to
compute the 95% confidence intervals. If you accept the assumptions of linear regression, there is a 95% chance
that the 95% confidence interval of the slope contains the true value of the slope,  and  that the 95% confidence
interval for the intercept contains the true value of the intercept.

r2, a measure of goodness-of-fit of linear regression


The value r2 is a fraction between 0.0 and 1.0, and has no units. An r2 value of  0.0 means that knowing X does
not help you predict Y. There is no linear relationship between X and Y, and the best-fit line is a horizontal line
going through the mean of all Y values.  When r2 equals 1.0, all points lie exactly on a straight line with no
scatter. Knowing X lets you predict Y perfectly.

This figure demonstrates how Prism computes r2.

The left panel shows the best-fit linear regression line This lines minimizes the sum-of-squares of the vertical
distances of the points from the line. Those vertical distances are also shown on the left panel of the figure. In this
example, the sum of squares of those distances (SSreg) equals 0.86. Its units are the units of the Y-axis squared.
To use this value as a measure of goodness-of-fit, you must compare it to something.

The right half of the figure shows the null hypothesis -- a horizontal line through the mean of all the Y values.
Goodness-of-fit of this model (SStot) is also calculated as the sum of squares of the vertical distances of the
points from the line, 4.907 in this example. The ratio of the two sum-of-squares values compares the regression
model with the null hypothesis model. The equation to compute r2 is shown in the figure. In this example r2 is
0.8248. The regression model fits the data much better than the null hypothesis, so SSreg is much smaller than
SStot, and r2 is near 1.0. If the regression model were not much better than the null hypothesis, r2 would be near
zero.

You can think of r2 as the fraction of the total variance of Y that is "explained" by variation in X. The value of r2
(unlike the regression line itself) would be the same if X and Y were swapped. So r2 is also the fraction of the
variance in X that is "explained" by variation in Y. In other words, r2 is the fraction of the variation that is shared
between X and Y.

In this example, 84% of the total variance in Y is "explained" by the linear regression model. That leaves the rest
of the vairance (16% of the total) as variability of the data from the model (SStot)
The dashed lines that demarcate the confidence interval are curved. This does not mean that the confidence
interval includes the possibility of curves as well as straight lines. Rather, the curved lines are the boundaries of
all possible straight lines. The figure below shows four possible linear regression lines (solid) that lie within the
confidence interval (dashed).

Given the assumptions of linear regression, you can be 95% confident that the two curved confidence bands
enclose the true best-fit linear regression line, leaving a 5% chance that the true line is outside those boundaries.

Many data points will be outside the 95% confidence interval boundary. The confidence interval is 95% sure to
contain the best-fit regression line. This is not the same as saying it will contain 95% of the data points.

Prism can also plot the 95% prediction interval. The prediction bands are further from the best-fit line than the
confidence bands, a lot further if you have many data points. The 95% prediction interval is the area in which you
expect 95% of all data points to fall. In contrast, the 95% confidence interval is the area that has a 95% chance of
containing the true regression line. This graph shows both prediction and confidence intervals (the curves
defining the prediction intervals are further from the regression line).
Comparing slopes and intercepts

Prism can test whether the slopes and intercepts of two or more data sets are significantly different. It compares
linear regression lines using the method explained in Chapter 18 of J Zar, Biostatistical Analysis, 2nd edition,
Prentice-Hall, 1984.

Prism compares slopes first. It calculates a P value (two-tailed) testing the null hypothesis that the slopes are all
identical (the lines are parallel). The P value answers this question: If the slopes really were identical, what is the
chance that randomly selected data points would have slopes as different (or more different) than you observed.
If the P value is less than 0.05, Prism concludes that the lines are significantly different. In that case, there is no
point in comparing the intercepts. The intersection point of two lines is:

If the P value for comparing slopes is greater than 0.05, Prism concludes that the slopes are not significantly
different and  calculates a single slope for all the lines. Now the question is whether the lines are parallel or
identical. Prism calculates a second P value testing the null hypothesis that the lines are identical. If this P value
is low, conclude that the lines are not identical (they are distinct but parallel). If this second P value is high, there
is no compelling evidence that the lines are different.

This method is equivalent to an Analysis of Covariance (ANCOVA), although ANCOVA can be extended to more
complicated situations.

Standard Curve

To read unknown values from a standard curve, you must enter unpaired X or Y values below the X and Y values
for the standard curve.

Depending on which option(s) you selected in the Parameters dialog, Prism calculates Y values for all the
unpaired X values and/or X values for all unpaired Y values and places these on new output views.

How to think about the results of linear regression

Your approach to linear regression will depend on your goals.

If your goal is to analyze a standard curve, you won't be very interested in most of the results. Just make sure
that r2 is high and that the line goes near the points. Then go straight to the standard curve results.
In many situations, you will be most interested in the best-fit values for slope and intercept. Don't just look at the
best-fit values, also look at the 95% confidence interval of the slope and intercept. If the intervals are too wide,
repeat the experiment with more data.

If you forced the line through a particular point, look carefully at the graph of the data and best-fit line to make
sure you picked an appropriate point.

Consider whether a linear model is appropriate for your data. Do the data seem linear? Is the P value for the runs
test high? Are the residuals random? If you answered no to any of those questions, consider whether it makes
sense to use nonlinear regression instead.

http://www.stat.yale.edu/Courses/1997-98/101/linreg.htm

Linear Regression

Linear regression attempts to model the relationship between two variables by fitting a linear equation to
observed data. One variable is considered to be an explanatory variable, and the other is considered to be a
dependent variable. For example, a modeler might want to relate the weights of individuals to their heights using
a linear regression model.

Before attempting to fit a linear model to observed data, a modeler should first determine whether or not there is
a relationship between the variables of interest. This does not necessarily imply that one variable causes the
other (for example, higher SAT scores do not cause higher college grades), but that there is some significant
association between the two variables. A scatterplot can be a helpful tool in determining the strength of the
relationship between two variables. If there appears to be no association between the proposed explanatory and
dependent variables (i.e., the scatterplot does not indicate any increasing or decreasing trends), then fitting a
linear regression model to the data probably will not provide a useful model. A valuable numerical measure of
association between two variables is the correlation coefficient, which is a value between -1 and 1 indicating the
strength of the association of the observed data for the two variables.

A linear regression line has an equation of the form Y = a + bX, where X is the explanatory variable and Y is the
dependent variable. The slope of the line is b, and a is the intercept (the value of y when x = 0).

Least-Squares Regression
The most common method for fitting a regression line is the method of least-squares. This method calculates the
best-fitting line for the observed data by minimizing the sum of the squares of the vertical deviations from each
data point to the line (if a point lies on the fitted line exactly, then its vertical deviation is 0). Because the
deviations are first squared, then summed, there are no cancellations between positive and negative values.

Example

The dataset "Televisions, Physicians, and Life Expectancy" contains, among other variables, the number of
people per television set and the number of people per physician for 40 countries. Since both variables probably
reflect the level of wealth in each country, it is reasonable to assume that there is some positive association
between them. After removing 8 countries with missing values from the dataset, the remaining 32 countries have
a correlation coefficient of 0.852 for number of people per television set and number of people per physician. The
r² value is 0.726 (the square of the correlation coefficient), indicating that 72.6% of the variation in one variable
may be explained by the other. (Note: see correlation for more detail.) Suppose we choose to consider number of
people per television set as the explanatory variable, and number of people per physician as the dependent
variable. Using the MINITAB "REGRESS" command gives the following results:

The regression equation is People.Phys. = 1019 + 56.2 People.Tel.


To view the fit of the model to the
observed data, one may plot the
computed regression line over the actual
data points to evaluate the results. For
this example, the plot appears to the right,
with number of individuals per television
set (the explanatory variable) on the x-
axis and number of individuals per
physician (the dependent variable) on the
y-axis. While most of the data points are
clustered towards the lower left corner of
the plot (indicating relatively few
individuals per television set and per
physician), there are a few points which
lie far away from the main cluster of the
data. These points are known as outliers,
and depending on their location may have
a major impact on the regression line (see
below).

Data source: The World Almanac and Book of Facts 1993 (1993), New York: Pharos Books. Dataset available
through the JSE Dataset Archive.

Outliers and Influential Observations

After a regression line has been computed for a group of data, a point which lies far from the line (and thus has a
large residual value) is known as an outlier. Such points may represent erroneous data, or may indicate a poorly
fitting regression line. If a point lies far from the other data in the horizontal direction, it is known as an influential
observation. The reason for this distinction is that these points have may have a significant impact on the slope
of the regression line. Notice, in the above example, the effect of removing the observation in the upper right
corner of the plot:

With this influential observation removed, the


regression equation is now

People.Phys = 1650 + 21.3 People.Tel.


The correlation between the two variables has
dropped to 0.427, which reduces the r² value to
0.182. With this influential observation removed,
less that 20% of the variation in number of
people per physician may be explained by the
number of people per television. Influential
observations are also visible in the new model,
and their impact should also be investigated.

Residuals

Once a regression model has been fit to a group


of data, examination of the residuals (the
deviations from the fitted line to the observed
values) allows the modeler to investigate the validity of his or her assumption that a linear relationship exists.
Plotting the residuals on the y-axis against the explanatory variable on the x-axis reveals any possible non-linear
relationship among the variables, or might alert the modeler to investigate lurking variables. In our example, the
residual plot amplifies the presence of outliers.

Lurking Variables

If non-linear trends are visible in the relationship between an explanatory and dependent variable, there may be
other influential variables to consider. A lurking variable exists when the relationship between two variables is
significantly affected by the presence of a third variable which has not been included in the modeling effort. Since
such a variable might be a factor of time (for example, the effect of political or economic cycles), a time series
plot of the data is often a useful tool in identifying the presence of lurking variables.

You might also like