Data Analysis - Multiple Regression P-Value

You might also like

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 20

BIG DATA

Week 10 – Regression and p-values


April 2023
TODAY’S OBJECTIVES

All the session today will be done on JASP, we will be dealing with:
1. Multiple regression
2. Hypothesis testing
3. P-value tests

2
CORRELATION VS. REGRESSION

What are the differences between the 2?

What to expect from regression?


PURPOSE OF REGRESSION ANALYSIS

The purpose of regression analysis is to


analyze relationships among variables.

Answer the question of how much y


Forecast or predict the value of y
changes with changes in each of the
based on the values of the X's
X’s
SIMPLE LINEAR REGRESSION- JASP

Choice of dependent variable

Choice of independent
variables
MULTIPLE LINEAR REGRESSION

Simple vs. Multiple Regression


• One dependent variable Y predicted • One dependent variable Y predicted
from one independent variable X from a set of independent variables (X1,
X2 ….Xn)
• One regression coefficient
• One regression coefficient for each
• r2: proportion of variation in independent variable
dependent variable Y predictable
from X • R2: proportion of variation in dependent
variable Y predictable by set of
independent variables (X’s)
𝑌 =𝒂 𝑋 +𝒃
𝑌 =𝒂𝟏 𝑿 𝟏 + 𝒂𝟐 𝑿 𝟐 +…+ 𝒂 𝒏 𝑿 𝒏 +𝒃
MULTIPLE LINEAR REGRESSION- JASP

Choice of dependent variable

Choice of several independent


variables

For multiple
regression we know
have several
independent variables

𝑌 =𝒃+ 𝒂𝟏 𝑿 𝟏 + 𝒂𝟐 𝑿 𝟐+ …+ 𝒂𝒏 𝑿 𝒏
MULTIPLE LINEAR REGRESSION- JASP

the model’s accuracy in explaining the dependent


variable

Coefficient for b (intercept in linear regression


formula)

Coefficients for each a (slopes in linear regression


formula)
𝑇𝑟𝑎𝑛𝑠𝑎𝑐𝑡𝑖𝑜𝑛𝑠=𝟒𝟑𝟏𝟖𝟕𝟕.𝟗−𝟒𝟔.𝟓𝟏𝟐𝑷𝒐𝒑𝒖𝒍𝒂𝒕𝒊𝒐𝒏+𝟏𝟗.𝟒𝟖𝟐𝑳𝒂𝒃𝒐𝒖𝒓 𝒇𝒐𝒓𝒄𝒆−𝟏𝟒𝟔𝟐.𝟖𝟒𝟔𝑼𝒏𝒆𝒎𝒑𝒍𝒐𝒚𝒎𝒆𝒏𝒕

𝑌 =𝒃+ 𝒂𝟏 𝑿 𝟏 + 𝒂𝟐 𝑿 𝟐+ …+ 𝒂𝒏 𝑿 𝒏
BUILDING GOOD REGRESSION MODELS

Systematic Approach to Building Good Multiple Regression Models


• Construct a model with all available independent variables and check for significance of
each.
• Identify the largest p-value that is greater than .05
• Remove that variable and evaluate adjusted R2.
• Continue until all variables are significant.

Find the model with the highest adjusted R2.


(Do not use unadjusted R2 since it always increases when variables are
added.)
HYPOTHESIS TESTING – DEFINITION

• Hypothesis testing is an act in statistics whereby an analyst tests an


assumption regarding a population parameter.
• Hypothesis testing is used to assess the plausibility of a hypothesis by
using sample data.

Steps of Hypothesis testing

State the two hypotheses so that only one can be right

Specify the level of significance

State the decision rule

Analyze the results and either reject the null hypothesis, or state that the null hypothesis is plausible, given the data.
LEVEL OF SIGNIFICANCE
Defines the unlikely values of the sample
statistic if the null hypothesis is true
Margin of error –level of
Defines rejection region of the sampling distribution significance

Is designated by  , (level of significance)


Typical values are .01, .05, or .10 (1%, 5% or 10%)

Is selected by the researcher at the


beginning of the test

Provides the critical value(s) of the test


HYPOTHESIS TESTING – NULL AND ALTERNATIVE
HYPOTHESIS EXAMPLE

We take a sample and analyze the average height in our sample. We


measure that the height in our sample and is 1.75m.

We now question if that is still the height for the entire population or
just our sample! This is hypothesis testing!

Null hypothesis -> H0: The population’s height is 1.75m (


Vs
Alternative hypothesis -> H1: The population’s height is NOT 1.75m (

A decision is then made based on the level of significance and decision rules
HYPOTHESIS TESTING – NULL AND ALTERNATIVE
HYPOTHESIS DEFINITION JASP

Every statistical analysis performed on JASP has already hypothesis


testing hardcoded into them.

An analyst only analyzes result and concludes based on the decision


rule he sets.

Hypothesis test:
H0: The slope of the regression line is 0 (a=0) (no
connection between the variables)
H1: The slope of the regression line is not 0 (a≠0) (a
connection exists between the variables)
P-VALUES- DEFINITION

• A p-value is a measure of the probability that an observed difference


could have occurred just by random chance.
• The lower the p-value, the greater the statistical significance of the
observed difference.

P-value can be used as an


alternative to or in addition to
pre-selected confidence levels
for hypothesis testing.
P-VALUE DECISION RULE

1. Compare obtained p-value with  , (level of


significance)

Decision
If p-value < , then reject H0
• (it means the result is strong enough to be transferred to the population)

If p-value > , then accept H0


• (it means the result is NOT strong enough to be transferred to the
population)
CORRELATION ANALYSIS AND P-VALUE
TESTS Level of
significance (α) =
Hypothesis test: 5%

H0: the correlation between the variables (pair) is zero @


population level
H1: the correlation between the variables (pair) is not zero
@ population level

P-value test on the hypothesis (example)


For salary versus salbegin p- value is <0.01, hence less
than 0.05, so we reject H0
That implies that the connection between salary and
salbegin is statistically significant @ population level

NOTE

Remember that different combinations of variable type, will use different correlation coefficients.
(Pearson and Spearman)

16
REGRESSION ANALYSIS AND P-VALUE TESTS - ANOVA

Hypothesis test:
H0: The independent variables do not reliably predict the
dependent variable
H1: The independent variables reliably predict the
dependent variable

Level of P-value test on the hypothesis (example)


significance (α) =
For this model p- value is <0.01, hence less than 0.05, so
5%
we reject H0
That implies that the model can reliably be used to predict
salary @ population level

NOTE

ANOVA is about the model (Y=a X + b) and ability to predict. How reliable is the prediction at the
population level.

17
REGRESSION ANALYSIS AND P-VALUE TESTS -
COEFFICIENTS Level of
significance (α) =
Hypothesis test: 5%

H0: The slope of the regression line is 0 (a=0) (no


connection between the variables)
H1: The slope of the regression line is not 0 (a≠0) (a
connection exists between the variables)

P-value test on the hypothesis (example)


Simple regression line
For salbegin p- value is <0.01, hence less than 0.05, so we
Y = a*X + b
reject H0
That implies that thee is a connection between salary and
salbegin@ population level

NOTE

Only the p-value that is relevant to the independent variables is tested.

18
MULTIPLE REGRESSION ANALYSIS AND P-VALUE TESTS -
COEFFICIENTS
Hypothesis test:
H0: EACH slope of the regression line is 0 (a1 = a1 =…=0)
(no connection between the variables)
H1: EACH slope of the regression line is not 0 (a 1 ≠ a1 ≠ …
≠ 0)
(a connection exists between the variables)

Multiple regression line P-value test on the hypothesis (example)


Y = a1*X1 + a2*X2 + … +b For each one p- value is <0.01, hence less than 0.05, so we
reject H0
That implies that there is a connection between salary and
NOTE the group of independent variables (age, educ, salbegin,
jobtime, prevexp) @ population level
This is a group test. It fails for one, it fails for all.
That means that we need to try with a smaller
Level of
set. significance (α) =
5%
19

You might also like