Linear Regression

You might also like

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 28

Linear Regression Analysis

Linear Regression
Linear regression is a common 
Statistical Data Analysis technique.  It is used to
determine the extent to which there is a linear
relationship between a dependent variable and one or
more independent variables. There are two types of
linear regression, simple linear regression and multiple
linear regression.
Difference between simple linear and
multiple linear regression analysis
In simple linear regression a single independent
variable is used to predict the value of a dependent
variable. In multiple linear regression two or more
independent variables are used to predict the value of a
dependent variable. The difference between the two is
the number of independent variables. In both cases
there is only a single dependent variable.
Correlation and Regression
Simple linear regression is similar to correlation in that
the purpose is to measure to what extent there is a
linear relationship between two variables.
The major difference between the two is that
correlation makes no distinction between independent
and dependent variables while linear regression does.
In particular, the purpose of linear regression is to
"predict" the value of the dependent variable based
upon the values of one or more independent variables
Example
For example, you could use linear regression to
understand whether exam performance can be
predicted based on revision time; whether cigarette
consumption can be predicted based on smoking
duration; and so forth. If you have two or more
independent variables, rather than just one, you need to
use multiple regression
Assumptions
Assumption #1: 

Your two variables should be measured at


the continuous level (i.e., they are
either interval or ratio variables). Examples of 
continuous variables include revision time (measured
in hours), intelligence (measured using IQ score),
exam performance (measured from 0 to 100), weight
(measured in kg), and so forth.
Assumption #2:
 There needs to be a linear relationship between the
two variables. Linearity means that the predictor
variable in the regression have a straight-line
relationship with the outcome variable. If your
residuals are normally distributed and homoscedastic,
you do not have to worry about linearity.

Use scatter plot to check linearity


Your scatterplot may look something like
one of the following:

If the relationship displayed in your scatterplot is not linear,


you will have to either run a non-linear regression analysis,
perform a polynomial regression or "transform" your data,
which you can do using SPSS Statistics.
Assumption #3: 
There should be no significant outliers.
An outlier is an observed data point that has a
dependent variable value that is very different to the
value predicted by the regression equation. As such, an
outlier will be a point on a scatterplot that is
(vertically) far away from the regression line
indicating that it has a large residual, as highlighted
below:
The problem with outliers is that they can have a negative
effect on the regression analysis (e.g., reduce the fit of the
regression equation) that is used to predict the value of the
dependent (outcome) variable based on the independent
(predictor) variable. This will change the output that SPSS
Statistics produces and reduce the predictive accuracy of your
results. Fortunately, when using SPSS Statistics to run a linear
regression on your data, you can easily include criteria to help
you detect possible outliers.
To check outliers we’ll estimate cook’s distance
The value of cook’s distance should not be greater than 1.
Assumption #4: You should have independence of
observations, which you can easily check using the
Durbin-Watson statistic, which is a simple test to run
using SPSS Statistics.

The value of the durbin Watson should be between


the two critical values of 1.5 < d < 2.5 
Assumption #5: Your data needs to show homoscedasticity,
which is where the variances along the line of best fit remain
similar as you move along the line.
Homoscedasticity refers to whether these residuals are equally
distributed, or whether they tend to bunch together at some
values, and at other values, spread far apart. In the context of t-
tests and ANOVAs, you may hear this same concept referred to as
equality of variances or homogeneity of variances. Your data is
homoscedastic if it looks somewhat like a shotgun blast of
randomly distributed data. The opposite of homoscedasticity is
heteroscedasticity, where you might find a cone or fan shape in
your data. You check this assumption by plotting the predicted
values and residuals on a scatterplot, which we will show you.
Take a look at the three scatterplots below, which provide three simple examples:
two of data that fail the assumption (called heteroscedasticity) and one of data that
meets this assumption (called homoscedasticity):
Assumption #6: 
Finally, you need to check that the residuals (errors) of the
regression line are approximately normally distributed.
the residuals of the regression should follow a normal
distribution. The residuals are simply the error terms, or the
differences between the observed value of the dependent
variable and the predicted value.
Two common methods to check this assumption include
using either a histogram (with a superimposed normal curve)
or a Normal P-P Plot.If we examine a normal Predicted
Probability (P-P) plot, we can determine if the residuals are
normally distributed. If they are, they will conform to the
diagonal normality line indicated in the plot.
Regression equation

Y=ax+b
or
y=b+ax
A linear regression line has an equation of the form Y =
a + bX, where X is the explanatory variable and Y is
the dependent variable. The slope of the line is b, and a
is the intercept (the value of y when x = 0).
Important symbols and their meanings

B is the rate of change per unit time.

 2. Beta is the correlation coefficient range from 0-1,


higher the value of beta stronger the association
between variables.
There are five symbols that easily confuse students in
a regression table: the unstandardized beta (B), the
standard error for the unstandardized beta (SE B), the
standardized beta (β), the t test statistic (t), and the
probability value (p). Typically, the only two values
examined are the Band the p. However, all of them are
useful to know.
The first symbol is the unstandardized beta (B). This
value represents the slope of the line between the
predictor variable and the dependent variable. So for
Variable in above table, this would mean that for every
one unit increase in independent, the dependent
variable increases by .564 units.
The next symbol is the standard error for the
unstandardized beta (SE B). This value is similar to the
standard deviation for a mean.  The larger the number,
the more spread out the points are from the regression
line. The more spread out the numbers are, the less
likely that significance will be found.
The third symbol is the standardized beta (β). This
works very similarly to a correlation coefficient. It will
range from 0 to 1 or 0 to -1, depending on the direction
of the relationship. The closer the value is to 1 or -1,
the stronger the relationship. With this symbol, you
can actually compare the variables to see which had
the strongest relationship with the dependent variable,
since all of them are on the 0 to 1 scale. In the table
above, Variable 3 had the strongest relationship.
The fourth symbol is the t test statistic (t). This is the
test statistic calculated for the individual predictor
variable. This is used to calculate the p value.
The last symbol is the probability level (p). This tells
whether or not an individual variable significantly
predicts the dependent variable. You can have a
significant model, but a non-significant predictor
variable.Typically, if the p value is below .050, the
value is considered significant.
difference between R2 and the adjusted R2
However, there is one main difference between R2
and the adjusted R2:

R-squared measures the proportion of the variation in


your dependent variable (Y) explained by your
independent variables (X) for a linear regression
model.
Adjusted R-squared adjusts the statistic based on the
number of independent variables in the model.
A low R-squared value indicates that your
independent variable is not explaining much in the
variation of your dependent variable -
It depends on your research work ! It depends on your
research work but more then 50%, R2 value with low
RMES value is acceptable to scientific research
community
Results with low R2 value of 25% to 30% are valid
because it represent your findings.
The most common interpretation of r-squared is
how well the regression model fits the observed data.
For example, an r-squared of 60% reveals that 60% of
the data fit the regression model. Generally, a higher r-
squared indicates a better fit for the model.
The value of R2 is between 0-1
R2 can be greater than 1.0 only when an invalid (or
nonstandard) equation is used to compute R2 and
when the chosen model (with constraints, if any) fits
the data really poorly, worse than the fit of a
horizontal line.
Interpretation according to APA

You might also like