Pink Green Bright Aesthetic Playful Math Class Presentation

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 34

Welcome to our

MATH CLASS
ABOUT MULTIPLE
REGRESSION AND
CHI SQUARE
KEY TAKEWAYS:
Regression analysis is a series of statistical modeling processes that helps analysts
estimate relationships between one, or multiple, independent variables and a dependent
variable.
You can represent multiple regression analysis using the formula:
Y = b0 + b1X1 + b1 + b2X2 + . . + bpXp
Multiple regression analysis has many applications, from business to marketing to statistics.
WHAT IS MULTIPLE REGRESSION?
Multiple regression, also known as multiple linear regression (MLR), is a
statistical technique that uses two or more explanatory variables to predict
the outcome of a response variable. It can explain the relationship between
multiple independent variables against one dependent variable. These
independent variables serve as predictor variables, while the single
dependent variable serves as the criterion variable. You can use this
technique in a variety of contexts, studies and disciplines, including in
econometrics and financial inference.
WHAT IS THE MULTIPLE REGRESSION ANALYSIS FORMULA?
To perform a regression analysis, first calculate the multiple regression of your data. You can use
this formula:

Y = b0 + b1X1 + b1 + b2X2 + ... + bpXp


In this formula:

Y stands for the predictive value or dependent variable.


The variables (X1), (X2) and so on through (Xp) represent the predictive values, or
independent variables, causing a change in Y. It's important to note that each X factor
represents a distinct predictive value.
The variable (b0) represents the Y-value when all the independent variables (X1 through Xp)
are equal to zero.
The variables (b1) through (bp) represent the regression coefficients.
WHEN TO USE MULTIPLE REGRESSION ANALYSIS
Multiple regression analysis is a useful tool in a wide range of applications. From business, marketing and sales
analytics to environmental, medical and technological applications, multiple regression analysis helps professionals
evaluate diverse data that supports goals, processes and outcomes in many industries. Here are several ways
multiple regression analysis can benefit a business or organization:

Gives insight into predictive factors


Conducting a multiple regression analysis is useful for determining what factors are affecting
different aspects of a business' processes. For instance, revenue can be one type of Y-value,
where different independent variables like the number of sales and cost of goods sold affect
business revenue. With multiple regression analysis, analysts can identify the individual activities
that affect specific metrics they want to measure, giving them better insight into how to
improve efficiency and productivity.
Predicts factors affecting outcomes

When companies can analyze the factors that affect certain business operations,
management can better predict which independent variables influence the dependent
functions of the business. For example, a business analyst can predict which factors
are likely to affect an organization's future profitability, based on the results of a
multiple regression analysis.

In this case, the analyst may calculate the regression using the formula where profit
is the predictive variable and factors like overhead, liabilities and total sales revenue
represent the (b) and (X) values in the formula. When the analyst understands how
much these factors affect profits, they can better predict the variables that may
affect profits in the future.
Creates models for cause-and-effect analysis

Understanding the mathematical data that multiple regression analysis can provide
allows professionals to model the information in a graph or chart. Displaying multiple
regression—how external variables cause changes in a dependent variable—in this
way can help you model the cause-and-effect relationship to better see the changes
taking place in real time. This can be especially beneficial for financial activities like
investing in stocks and securities, where traders can see the cause-and-effect
relationship in a chart to understand how economic factors are influencing current
market shares.
.
ADVANTAGES AND DISADVANTAGES OF
MULTIPLE REGRESSION
The main advantages and disadvantages of Multiple Regression are tabulated below.

Advantages
It has the ability to determine the relative influence of one or more predictor variables to the criterion value.
It also has the ability to identify outliers, or anomalies.

Disadvantages
It needs high-level mathematics to analyze the data and is required in the statistical program.
It is difficult for researchers to interpret the results of the multiple regression analysis on the basis of
assumptions as it has a requirement of a large sample of data to get the effective results.
MULTIPLE LINEAR REGRESSION
FORMULA
Here's the formula for multiple linear regression, which produces a more specific
calculation:

y = ß0 + ß1x1 + ß2x2 + ... + ßpxp

The variables in this equation are:


y is the predicted or expected value of the dependent variable.
x1, x2, and xp are three independent or predictor variables.
ß0 is the value of y when all the independent variables are equal to zero.
ß1, ß2, and ßp are the estimated regression coefficients. Each regression
coefficient represents the change in y relative to a one-unit change in the respective
independent variable.
MULTIPLE LINEAR REGRESSION
FORMULA
Because of the multiple variables, which can be linear or nonlinear, this
regression analysis model allows for more variance and precision when it comes
to predicting outcomes and understanding the impact of each explanatory
variable on the model's total variance.
CALCULATING MULTIPLE REGRESSION
To understand the calculations of multiple regression analysis, assume a financial analyst wants to predict the
price changes in a stock share of a major fuel company. Using this example, follow the steps below to understand
how the analyst calculates multiple regression:

1. Determine all predictive variables

Using the example, the financial analyst must first determine all the factors that can cause the share prices to
fluctuate. While stock prices can have many influencing factors, assume the predictive variables the analyst
evaluates include interest rates, crude oil prices and prices to move fuel resources. The analyst determines:
The X1 variable is a 5% interest rate, or 0.05.
The X2 variable is a current price of $50 per barrel of crude oil.
The Xp variable is the current transport price of $25 per load of 100 barrels.
The analyst plugs these values into the formula:
Y = b0 + b1X1 + b1 + b2X2 +...+ bpXp = b0 + b1(0.05) + b2(50) + bp(250)
CALCULATING MULTIPLE REGRESSION
2. Determine the regression coefficient at time zero

Once the analyst knows the independent variables affecting share price, they can identify the value
of the regression coefficient, or the relationship between predictive variables and responses in Y, at
time zero. Time zero refers to the value of the stock at the moment of evaluation. If the stock price
is $50 when the analyst begins their assessment, the b0 value is $50:
Y = b0 + b1X1 + b1 + b2X2 +...+ bpXp = (500) + b1(0.05) + b2(50) + bp(250)
CALCULATING MULTIPLE REGRESSION
3. Identify the regression coefficients for b variables
After calculating the predictive variables and the regression coefficient at time zero, the analyst can find the
regression coefficients for each X predictive factor. The regression coefficient for the X1 variable represents the
change in interest rates from time zero, the regression coefficient for the X2 variable is the change in the price
of crude oil and the regression coefficient for the Xp variable is the change in transportation costs. The
regression coefficients, or change rates, the analyst calculates come from the differences in prices between
previous and current years. Assume the analyst uses these values in the formula:

Y = (500) + b1(0.05) + b2(50) + bp(25) where b1 represents the change in interest rates, b2 is the change in
stock price and bp is the change in transportation costs between the previous and current years. The analyst uses
b1 = 0.015, b2 = 0.33 and bp = 0.8 in the formula:
Y = (500) + (0.015)(0.05) + (0.33)(50) + (0.8)(25)
CALCULATING MULTIPLE REGRESSION
4. Sum these values

Once the analyst has all values in the formula, they can find the total sum, or the value of Y. It looks like this:
Y = (50) + (0.015)(0.05) + (0.33)(50) + = (0.8)(25)
(50) + (0.00075) + (16.5) + (20) = 86.5

5. Evaluate the results

The multiple regression sum represents the likelihood of changes occurring because of the changes in the independent
variables affecting the dependent factor. In the example of the financial analyst evaluating the advantages of company
stocks, the value of Y is approximately 86.5, or 86.5%.
This shows that the stock price for shares of the fuel company's stock has an 86.5% chance of fluctuating based on
changes in external factors. While this value doesn't determine whether the fluctuations are increases or decreases in price,
a multiple regression rate of 86.5% can give the analyst valuable insight into just how volatile the company stock prices are.
EXAMPLE:
Example: Multiple Linear Regression by Hand
Suppose we have the following dataset with one response variable y and two predictor variables X1 and X2:

Use the following steps to fit a multiple linear regression model to this dataset.
Step 1: Calculate X12, X22, X1y, X2y and X1X2.

Step 2: Calculate Regression Sums.


Next, make the following regression sum calculations:
Σx12 = ΣX12 – (ΣX1)2 / n = 38,767 – (555)2 / 8 = 263.875
Σx22 = ΣX22 – (ΣX2)2 / n = 2,823 – (145)2 / 8 = 194.875
Σx1y = ΣX1y – (ΣX1Σy) / n = 101,895 – (555*1,452) / 8 = 1,162.5
Σx2y = ΣX2y – (ΣX2Σy) / n = 25,364 – (145*1,452) / 8 = -953.5
Σx1x2 = ΣX1X2 – (ΣX1ΣX2) / n = 9,859 – (555*145) / 8 = -200.375
Step 3: Calculate b0, b1, and b2.

The formula to calculate b1 is: [(Σx22)(Σx1y) – (Σx1x2)(Σx2y)] / [(Σx12) (Σx22) – (Σx1x2)2]


Thus, b1 = [(194.875)(1162.5) – (-200.375)(-953.5)] / [(263.875) (194.875) – (-200.375)2] = 3.148

he formula to calculate b2 is: [(Σx12)(Σx2y) – (Σx1x2)(Σx1y)] / [(Σx12) (Σx22) – (Σx1x2)2]


Thus, b2 = [(263.875)(-953.5) – (-200.375)(1152.5)] / [(263.875) (194.875) – (-200.375)2] = -1.656
The formula to calculate b0 is: y – b1X1 – b2X2
Thus, b0 = 181.5 – 3.148(69.375) – (-1.656)(18.125) = -6.867
Step 5: Place b0, b1, and b2 in the estimated linear regression equation.

The estimated linear regression equation is: ŷ = b0 + b1*x1 + b2*x2


In our example, it is ŷ = -6.867 + 3.148x1 – 1.656x2

How to Interpret a Multiple Linear Regression Equation


Here is how to interpret this estimated linear regression equation: ŷ = -6.867 + 3.148x1 – 1.656x2

b0 = -6.867. When both predictor variables are equal to zero, the mean value for y is -6.867.
b1 = 3.148. A one unit increase in x1 is associated with a 3.148 unit increase in y, on average,
assuming x2 is held constant.
b2 = -1.656. A one unit increase in x2 is associated with a 1.656 unit decrease in y, on average,
assuming x1 is held constant.
SUMMARY
Multiple Linear Regression is the analysis to use when the response variable is
quantitative and there is more than one explanatory variable. Much of the analysis is
similar to simple linear regression. The major differences between simple and multiple
regression is what is being tested with the F-test and with the t-test. In multiple
regression, the F-test is like an initial screening – it will tell us if at least one of the
explanatory variables is a significant predictor of the response variable and, therefore,
whether we need to continue the analysis or not. If the conclusions from the F-test
tell us that there’s evidence that at least one explanatory variable helps to explain the
response variable, then we do a t-test on each explanatory variable to determine if
that explanatory variable helps explain the response variable after accounting for the
effects of the other explanatory variables in the model.
SUMMARY
This last part is important – if one of the explanatory variables was not in the model,
conclusions about the remaining explanatory variables may change. We use this idea when
using the backwards selection process to find a model that includes only significant
predictors of the response variable – the backwards selection process removes the
explanatory variable with the highest pvalue from the t-tests as long as its p-value is
greater than 0.05 (or so). The process continues until all remaining explanatory variables
have p-values less than 0.05, or so. Such explanatory variables are included in the final
model and an analysis is performed on this final model.
IA researcher collected data in a project to predict the annual growth per acre of upland boreal forests in
southern Canada. They hypothesized that cubic foot volume growth (y) is a function of stand basal area
per acre (x1), the percentage of that basal area in black spruce (x2), and the stand’s site index for black
spruce (x3). α = 0.05.

Table 3. Observed data for cubic feet, stand basal area, percent basal area in black spruce, and site index.
Scatterplots of the response variable versus each predictor variable were created along with a correlation
matrix.
Figure 1. Scatterplots of cubic feet versus basal area, percent basal area in black spruce, and site index.
.Table 4. Correlation matrix.
As you can see from the scatterplots and the
correlation matrix, BA/ac has the strongest linear
relationship with CuFt volume (r = 0.816) and %BA
in black spruce has the weakest linear relationship
(r = 0.413). Also of note is the moderately strong
correlation between the two predictor variables,
BA/ac and SI (r = 0.588). All three predictor
variables have significant linear relationships with
the response variable (volume) so we will begin by
using all variables in our multiple linear regression
model. The Minitab output is given below.
We begin by testing the following null and
alternative hypotheses:
H0: β1 = β2 = β3 = 0
H1: At least one of β1, β2 , β3 ≠0
General Regression Analysis: CuFt versus BA/ac, SI,
%BA Bspruce
The F-test statistic (and associated p-value) is used to answer this question and is found in the ANOVA
table. For this example, F = 170.918 with a p-value of 0.00000. The p-value is smaller than our level of
significance (0.0000<0.05) so we will reject the null hypothesis. At least one of the predictor variables
significantly contributes to the prediction of volume.
The coefficients for the three predictor variables are all positive indicating that as they increase cubic foot
volume will also increase. For example, if we hold values of SI and %BA Bspruce constant, this equation tells
us that as basal area increases by 1 sq. ft., volume will increase an additional 0.591004 cu. ft. The signs of
these coefficients are logical, and what we would expect. The adjusted R2 is also very high at 94.97%.
The next step is to examine the individual t-tests for each predictor variable. The test statistics and
associated p-values are found in the Minitab output and repeated below:

The predictor variables BA/ac and %BA Bspruce have t-statistics of 13.7647 and 9.3311 and p-values of
0.0000, indicating that both are significantly contributing to the prediction of volume. However, SI has a t-
statistic of 0.7991 with a p-value of 0.432. This variable does not significantly contribute to the prediction
of cubic foot volume.
This result may surprise you as SI had the second strongest relationship with volume, but don’t forget
about the correlation between SI and BA/ac (r = 0.588). The predictor variable BA/ac had the strongest
linear relationship with volume, and using the sequential sums of squares, we can see that BA/ac is already
accounting for 70% of the variation in cubic foot volume (3611.17/5176.56 = 0.6976). The information from
SI may be too similar to the information in BA/ac, and SI only explains about 13% of the variation on volume
(686.37/5176.56 = 0.1326) given that BA/ac is already in the model.

The next step is to examine the residual and normal probability plots. A single outlier is evident in the
otherwise acceptable plots.
Figure 2. Residual and normal probability plots.

So where do we go from here?


We will remove the non-significant variable and re-fit the model excluding the data for SI in our model. The
Minitab output is given below.
General Regression Analysis: CuFt versus BA/ac, %BA Bspruce
General Regression Analysis: CuFt versus BA/ac, %BA Bspruce

We will repeat the steps followed with our first model. We begin by again testing the following hypotheses:
H0: β1 = β2 = β3 = 0
H1: At least one of β1, β2 , β3 ≠ 0
This reduced model has an F-statistic equal to 259.814 and a p-value of 0.0000. We will reject the null
hypothesis. At least one of the predictor variables significantly contributes to the prediction of volume. The
coefficients are still positive (as we expected) but the values have changed to account for the different
model.
The individual t-tests for each coefficient (repeated below) show that both predictor variables are
significantly different from zero and contribute to the prediction of volume.
General Regression Analysis: CuFt versus BA/ac, %BA Bspruce

Notice that the adjusted R2 has increased from 94.97% to 95.04% indicating a slightly better fit to the
data. The regression standard error has also changed for the better, decreasing from 3.17736 to 3.15431
indicating slightly less variation of the observed data to the model.

Figure 3. Residual and normal probability plots


The residual and normal probability plots have changed little, still not indicating any
issues with the regression assumption. By removing the non-significant variable, the
model has improved.

Model Development and Selection


There are many different reasons for creating a multiple linear regression model and
its purpose directly influences how the model is created. Listed below are several of the
more commons uses for a regression model:
1. Describing the behavior of your response variable
2. Predicting a response or estimating the average response
3. Estimating the parameters (β0, β1, β2, …)
4. Developing an accurate model of the process
Using the Chi-square test of independence
The Chi-square test of independence checks whether two variables are likely to be related or not. We have counts for
two categorical or nominal variables. We also have an idea that the two variables are not related. The test gives us a
way to decide if our idea is plausible or not.
The sections below discuss what we need for the test, how to do the test, understanding results, statistical details and
understanding p-values

What do we need?
For the Chi-square test of independence, we need two variables. Our idea is that the variables are not related. Here are
a couple of examples:
We have a list of movie genres; this is our first variable. Our second variable is whether or not the patrons of
those genres bought snacks at the theater. Our idea (or, in statistical terms, our null hypothesis) is that the type
of movie and whether or not people bought snacks are unrelated. The owner of the movie theater wants to
estimate how many snacks to buy. If movie type and snack purchases are unrelated, estimating will be simpler
than if the movie types impact snack sales.
A veterinary clinic has a list of dog breeds they see as patients. The second variable is whether owners feed dry
food, canned food or a mixture. Our idea is that the dog breed and types of food are unrelated. If this is true, then
the clinic can order food based only on the total number of dogs, without consideration for the breeds.
For a valid test, we need:
Data values that are a simple random sample from the population of interest.
Two categorical or nominal variables. Don't use the independence test with continous variables that define the
category combinations. However, the counts for the combinations of the two categorical variables will be
continuous.
For each combination of the levels of the two variables, we need at least five expected values. When we have
fewer than five for any one combination, the test results are not reliable

Let’s take a closer look at the movie snacks example. Suppose we collect data for 600 people at our theater. For each
person, we know the type of movie they saw and whether or not they bought snacks.
Let’s start by answering: Is the Chi-square test of independence an appropriate method to evaluate the relationship
between movie type and snack purchases?
We have a simple random sample of 600 people who saw a movie at our theater. We meet this requirement.
Our variables are the movie type and whether or not snacks were purchased. Both variables are categorical. We
meet this requirement.
The last requirement is for more than five expected values for each combination of the two variables. To confirm
this, we need to know the total counts for each type of movie and the total counts for whether snacks were
bought or not. For now, we assume we meet this requirement and will check it later.
It appears we have indeed selected a valid method. (We still need to check that more than five values are expected for
each combination.)
THANK
YOU! BY GROUP 6;
MARGARET LOYOLA
KATE NICOLE CAPATID
JUAN MIGUEL SIATAN
GIAN ANGELO FAELDONEA

You might also like