Professional Documents
Culture Documents
807 2ready
807 2ready
Assignment no. 2
Question No. 1
Answer:
The Dummy Variable Trap occurs when two or more dummy variables
created by one- hot encoding are highly correlated (multi-collinear). This means that
one variable can be predicted from the others, making it difficult to interpret
predicted coefficient variables in regression models. In other words, the individual
effect of the dummy variables on the prediction model cannot be interpreted well
because of multicollinearity.
Using the one-hot encoding method, a new dummy variable is created for
each categorical variable to represent the presence (1) or absence (0) of the
categorical variable. For example, if tree species is a categorical variable
made up of the values pine or oak, then tree species can be represented as a
dummy variable by converting each variable to a one-hot vector. This means that
a separate column is obtained for each category, where the first column
represents if the tree is pine and the second column represents if the tree is oak.
Each column will contain a 0 or 1 if the tree in question is of the column's species.
These two columns are multi-collinear since if a tree is pine, then we know it's
not oak and vice versa. Because a 1 in the pine column would mean a 0 in the oak
column, we can say xpine= 1– xoak. This results in two multi-collinear dummy
variables, so the dummy variable trap may occur in regression analysis.
To overcome the Dummy variable Trap, we drop one of the columns created
when the categorical variables were converted to dummy variables by one-hot
encoding. This can be done because the dummy variables include redundant
information.
As you can see, we were able to rewrite the regression equation using only xpine,
where the new coefficients to be predicted are (β0+β2) and (β1−β2). By dropping a
dummy variable column, we can avoid this trap.
This example shows two categories, but this can be expanded to any
number of categorical variables. In general, if we have p number of
categories, we will use p−1 dummy variables. Dropping one dummy variable to
protect from the dummy variable trap.
Question No. 2
What is the use of Chow test? Describe the steps apply in the chow test.
If we used one regression line to summarize the pattern in the data, it may
look like this:
The Chow test allows us to test for whether or not the regression coefficients
of each regression line are equal. If the test determines that the coefficients
are not equal between the regression lines, this means there is significant
evidence that a structural break exists in the data. In other words, the pattern
in the data is significantly different before and after that structural break point.
The following examples illustrate situations where you may wish to perform a
Chow test:
yt = a + bx1t + cxt2 + ε
Then suppose we split our data into two groups based on some structural
break point and fit the following regression models to each group:
yt = a1 + b1x1t + c1xt2 + ε
yt = a2 + b2x1t + c2xt2 + ε
We would use the following null and alternative hypotheses for the Chow test:
equal.
If we reject the null hypothesis, we have sufficient evidence to say that there
is a structural break point in the data and two regression lines can fit the data
better than one.
If we fail to reject the null hypothesis, we do not have sufficient evidence to
say that there is a structural break point in the data. In this case, we say
that the regression lines can be “pooled” into a single regression line that
represents the pattern in the data sufficiently well.
Step 2: Calculate the test statistic.
If we define the following terms:
This test statistic follows the F-distribution with k and N1+N2-2k degrees of
freedom.
If the p-value associated with this test statistic is less than a certain
significance level, we can reject the null hypothesis and conclude that there is
a structural break point in the data.
Fortunately, most statistical software is capable of performing a Chow test so
you will likely never have to perform the test by hand.
Question No. 3
Explain concept of ANOVA? How to construct ANOVA table?
Answer:-
We will next illustrate the ANOVA procedure using the five step approach.
Because the computation of the test statistic is involved, the computations are often
organized in an ANOVA table. The ANOVA table breaks down the components of
variation in the data into variation between treatments and error or residual variation.
Statistical computing packages also produce ANOVA tables as part of their standard
output for ANOVA, and the ANOVA table is set up as follows:
Source Sums of Squares (SS) Degrees of Mean Squares F
Variation Freedom (df) (MS)
Between k-1
Treatments
( N-k
Error
o
(or Residual)
r
N-1
Total
Where
X = individual observation,
The fourth column contains "Mean Squares (MS)" which are computed by
dividing sums of squares (SS) by degrees of freedom (df), row by row.
Specifically, MSB=SSB/ (k-1) and MSE=SSE/ (N-k). Dividing SST/ (N-1)
produces the variance of the total sample. The F statistic is in the rightmost
column of the ANOVA table and is computed by taking the ratio of
MSB/MSE.
Example:
A clinical trial is run to compare weight loss programs and participants are randomly
assigned to one of the comparison programs and are counseled on the details of the
assigned program. Participants follow the assigned program for 8 weeks. The
outcome of interest is weight loss, defined as the difference in weight measured at
the start of the study (baseline) and weight measured at the end of the study (8
weeks), and measured in pounds. Three popular weight loss programs are
considered. The first is a low calorie diet. The second is a low fat diet and the third is
a low carbohydrate diet. For comparison purposes, a fourth group is considered as a
control group. Participants in the fourth group are told that they are participating in a
study of healthy behaviors with weight loss only one component of interest. The
control group is included here to assess the placebo effect (i.e., weight loss due to
simply participating in the study). A total of twenty patients agree to participate in the
study and are randomly assigned to one of the four diet groups. Weights are
measured at baseline and patients are counseled on the proper implementation of
the assigned diet (with the exception of the control group). After 8 weeks, each
patient's weight is again measured and the difference in weights is computed by
subtracting the 8 week weight from the baseline weight. Positive differences indicate
weight losses and negative differences indicate weight gains. For interpretation
purposes, we refer to the differences in weights as weight losses and the observed
weight losses are shown below.
Low Calorie Low Fat Low Carbohydrate Control
8 2 3 2
9 4 5 2
6 3 4 -1
7 5 2 0
3 1 3 3
Is there a statistically significant difference in the mean weight loss among the
four diets?
We will run the ANOVA using the five-step approach.
Step 1. Set up hypotheses and determine level of significance
The appropriate critical value can be found in a table of probabilities for the F
distribution(see "Other Resources"). In order to determine the critical value of F we
need degrees of freedom, df1=k-1 and df2=N-k. In this example, df1=k-1=4-1=3 and
df2=N-k=20- 4=16. The critical value is 3.24 and the decision rule is as follows:
Reject H0 if F > 3.24.
Step 4. Compute the test statistic.
N 5 5 5 5
Next we compute,
SSE requires computing the squared differences between each observation and
its group mean. We will compute SSE in parts. For the participants in the low calorie
diet:
8 1.4 2.0
9 2.4 5.8
6 -0.6 0.4
7 0.4 0.2
3 -3.6 13.0
Totals 0 21.4
Thus,
2 -1.0 1.0
4 1.0 1.0
3 0.0 0.0
5 2.0 4.0
1 -2.0 4.0
Totals 0 10.0
Thus,
∑ (X - X2)2 = 10.0
Thus,
∑ (X - X3)2 = 5.4
2 0.8 0.6
2 0.8 0.6
-1 -2.2 4.8
0 -1.2 1.4
3 1.8 3.2
Totals 0 10.6
Thus,
∑ (X – X4)2 = 10.6
Therefore,
Question No. 4
Answer:-
Monday 10% 5%
Tuesday 5% -2%
Friday -5%
Question No. 5
Answer:-