Professional Documents
Culture Documents
Multiple Linear Regression
Multiple Linear Regression
Multiple Linear Regression
It contains 18,207 observations (rows) and 80 features (variables). In this exercise we want to
look at how variables like international reputation, skill moves, long shots, strength and vision
affect wage each footballer in the dataset earns.
Where:
β is the intercept.
α, κ, γ, θ and ρ are the coefficients that measure the strength at which Wage depends on
InternationalReputation, SkillMoves, Strength, Longshots and Vision respectively.
A duplicates report was generated and there were no duplicates in the entire dataset. The
command and output were:
. duplicates report
1 18207 0
• Cleaning variable Wage and generating Salary
The dependent variable had a euro symbol of currency and a symbol “K” for thousands so it was
read by Stata as a string variable and therefore there was need to clean it up. The symbol of the
euro currency was dropped and the cell content was destringed plus ignoring the “K” at the end.
The new variable name given to Wage was salary. Then after it was multiplied by 1000 to give
figures in thousands. The command and output were:
. split Wage , parse(€) generate (wage_y)
variables created as string:
wage_y1 wage_y2
.
. destring wage_y2, generate(salary) ignore("K")
wage_y2: character K removed; salary generated as int
(241 missing values generated)
.
. replace salary = salary * 1000
variable salary was int now long
(17,966 real changes made)
And therefore, there is need to restructure the multiple linear regression model as:
Where:
β is the intercept.
α, κ, γ, θ and ρ are the coefficients that measure the strength at which salary depends on
InternationalReputation, SkillMoves, Strength, Longshots and Vision respectively.
Out of all the variables, 6 were needed and the variable Name and A making them 8. So, the
keep command was used so that Stata could keep the 8 and drop the rest. The command was:
. keep A Name InternationalReputation SkillMoves Strength Vision LongShots salary
. display (1+2+3+4+5)/5
3
Then its this average position 3 that was replaced where the missing values in these two variables
were. The commands were:
In order to work on missing values of the variable salary, the mean of values present in the
column was calculated and replaced where the missing values were. The command and output
for generating mean were:
. mean salary
The command for replacing missing values in the salary command was:
. replace salary = 9861.85 if salary == .
variable salary was long now double
(241 real changes made)
.
. rename A Time
The variable Strength had missing values. So, the mean in that column was calculated and then
replaced in the missing values. The commands and output were:
. mean Strength
The variable LongShots had missing values. So, the mean in that column was calculated and then
replaced in the missing values. The commands and output were:
. mean LongShots
The variable Vision had missing values. So, the mean in that column was calculated and then
replaced in the missing values. The commands and output were:
. mean Vision
Conclusion:
The data was summarized after the cleaning procedure. The command and output were:
. summarize Time InternationalReputation SkillMoves Strength LongShots Vision salary Names
• Time: The mean is 9103, standard deviation is 5256.053, minimum is 0 and maximum is
18206.
• InternationalReputation: The mean is 1.118196, standard deviation is 0.4052307,
minimum is 1 and maximum is 5.
• SkillMoves: The mean is 2.362992, standard deviation is 0.7558765, minimum is 1 and
maximum is 5.
• Strength: The mean is 65.31197, standard deviation is 12.54044, minimum is 17 and
maximum is 97.
• LongShots: The mean is 47.10977, standard deviation is 19.23512, minimum is 3 and
maximum is 94.
• Vision: The mean is 53.4009, standard deviation is 14.12822, minimum is 10 and
maximum is 94.
• salary: The mean is 9861.85, standard deviation is 21970.4, minimum is 1000 and
maximum is 565000.
• Names: It has mean, standard deviation a minimum and maximum which we will
consider invalid since the data in it is in word form though not in string format since it
was encoded from string.
InternationalReputation
2 19552.05 471.2686 41.49 0.000 18628.32 20475.78
3 56872 844.8184 67.32 0.000 55216.07 58527.92
4 158901.5 2168.295 73.28 0.000 154651.5 163151.6
5 286925 6338.772 45.27 0.000 274500.4 299349.6
SkillMoves
2 -4862.77 481.343 -10.10 0.000 -5806.248 -3919.292
3 -3332.904 595.9187 -5.59 0.000 -4500.961 -2164.847
4 7805.529 819.2849 9.53 0.000 6199.654 9411.405
5 9032.809 2289.064 3.95 0.000 4546.028 13519.59
For the categorical variables (InternationalReputation and SkillMoves) dummy variables were
generated.
Interpretation
The F value is 0.0000 which is less than 0.05 which means that not all coefficients of variables
in this linear regression model are zero.
The adjusted R^2 is 0.5169 which means that 51.69% of a change in salary is explained by
InternationalReputation, SkillMoves, Strength, Longshots and Vision.
All p values are 0.000 meaning that all variables are significant in this model.
TESTING ASSUMPTIONS
. tab salary
. summarize salary
And from this we conclude that Strength, LongShots and Vision are continuous and also
InternationalReputation and SkillMoves are categorical.
Furthermore, on this assumption we did some summary statistics for the independent variables.
Testing assumption 3a): The needs to be a linear relationship between dependent variable
and independent variables.
Generally, a log transformation was done for variable salary in order to shift the line of best fit up
in all plots done for this assumption. The command was:
Strength
20 40 60 80 100
Strength
20 40 60 80 100
Strength
LongShots
0 20 40 60 80 100
LongShots
0 20 40 60 80 100
LongShots
Vision
0 20 40 60 80 100
Vision
0 20 40 60 80 100
Vision
Testing assumption 3b): There should be a linear relationship between the dependent
variable and the independent variables collectively.
This assumption is tested using partial regression plots or added variable plots. The command
produces an added variable plot for all variables in the multiple linear regression model giving
respect to dummy variables too.
. avplots
4) Your data must not show multi collinearity which occurs when you have two or more
variables that are highly correlated.
This assumption was tested using variance inflation factors. The command and output were:
. estat vif
Internatio~n
2 1.12 0.894554
3 1.07 0.933454
4 1.03 0.975212
5 1.03 0.967540
SkillMoves
2 4.51 0.221882
3 6.43 0.155575
4 2.51 0.398925
5 1.14 0.875024
Strength 1.10 0.912298
LongShots 3.67 0.272388
Vision 2.68 0.373396
Conclusion: Since all the variance inflation factors (VIF) are between 1 and 10, there is moderate
correlation and hence no high correlation.
5) There should be homoscedasticity
This assumption was tested using a residual versus fitted values plot (rvfplot). The command and
output were:
. rvfplot, yline(0)
400000
200000
Residuals
0
-200000
-400000
Conclusion: Since variances are moving in a way that they spread out from the line of fit at 0, it
means there is no homoscedasticity.
. estat dwatson
Conclusion: The Durbin Watson Statistic is 1.421951 which is not equal to 2 meaning there is
autocorrelation. This statistic ranges between 0 and 4; at 2 there is no autocorrelation.
7) The residuals should be normally distributed.
In order to test this assumption, studentized residuals were generated and then plotted in a
histogram with a normal density plot imposed. The commands and output were:
. predict stres, rstudent
-20 -10 0 10 20
Studentized residuals
8) There should be no significant outliers, high leverage points and highly influential
points
Outliers
To check for outliers a stem and leaf display was generated. Then the outliers above and below
were identified. For those that were above, since they are few, they were all listed and for those
that were below, since they are many we shall list only 10 in this document but in Stata all will
be output since there is a command for that in the do-file.
The commands and output were:
i) For those above
. list InternationalReputation SkillMoves Strength LongShots Vision salary stres if stres <= -5
1. 5 4 59 94 94 565000 18.30344
5. 4 4 75 91 94 355000 11.10393
6. 4 4 66 80 89 340000 10.3053
7. 4 4 58 82 92 420000 15.72673
8. 5 3 83 85 84 455000 10.90395
9. 4 3 83 59 63 380000 13.90365
12. 4 3 73 92 86 355000 11.96119
15. 3 2 76 69 79 225000 10.23313
19. 3 1 79 10 69 240000 11.1949
21. 4 3 77 54 87 315000 9.366471
Leverage points
In order to check for leverage points, they were predicted using the following command:
Then the cut off for leverage points was calculated. The formula is (2k + 2)/n where k is the number
of variables used in the multiple linear regression model and n is the total number of observations.
The command and output were:
. display ((2*6)+2)/18207
.00076894
Then a list of leverage points greater than the cut off of 0.00076894 was generated but for this
document only the first 10 have been shown, all of them will appear in Stata since the command
is there in the do-file. The command and output were:
. list InternationalReputation SkillMoves Strength LongShots Vision salary leverage if leverage > .00076894 in
> 1/10
1. 5 4 59 94 94 565000 .1726412
2. 5 5 79 93 82 405000 .1719912
3. 5 5 49 82 87 290000 .1720172
4. 4 1 64 12 68 260000 .02072
5. 4 4 75 91 94 355000 .0202646
6. 4 4 66 80 89 340000 .0202131
7. 4 4 58 82 92 420000 .0202777
8. 5 3 83 85 84 455000 .172036
9. 4 3 83 59 63 380000 .0201346
10. 3 1 78 12 70 94000 .0040192
Influential points
In order to check for influential points, distributed inter frame space (dfits) were predicted using
the following command:
Then the cut off for influential points was calculated. The formula is (2*sqrt(k/n)) where k is the
number of variables used in the multiple linear regression model and n is the total number of
observations. The command and output were:
. display 2*sqrt(6/18207)
.03630667
Then a list of influential points greater than the cut off of 0.03630667 was generated but for this
document only the first 10 have been shown, all of them will appear in Stata since the command
is there in the do-file. The command and output were:
. list InternationalReputation SkillMoves Strength LongShots Vision salary dfits if dfits>.03630667 in 1/11
1. 5 4 59 94 94 565000 8.360995
2. 5 5 79 93 82 405000 2.931811
4. 4 1 64 12 68 260000 .8741187
5. 4 4 75 91 94 355000 1.596951
6. 4 4 66 80 89 340000 1.480169
7. 4 4 58 82 92 420000 2.26254
8. 5 3 83 85 84 455000 4.970357
9. 4 3 83 59 63 380000 1.993044
10. 3 1 78 12 70 94000 .099716
11. 4 4 84 84 77 205000 .1810446
In order to treat unusual points like outliers, a robust regression was used.
Robust regression is an iterative procedure that seeks to identify outliers and minimize their
impact on the coefficient estimates.
For this document, the robust regression model will be shown excluding the iterations. All the
out put will fully be shown in Stata since the command is there in the do-file.
InternationalReputation
2 6796.234 94.90796 71.61 0.000 6610.205 6982.262
3 48224.09 170.1606 283.40 0.000 47890.56 48557.62
4 156367.1 436.6914 358.07 0.000 155511.1 157223.1
5 -3.67e-10 2221.061 -0.00 1.000 -4353.488 4353.488
SkillMoves
2 -1504.482 96.93651 -15.52 0.000 -1694.486 -1314.477
3 -229.601 120.0113 -1.91 0.056 -464.8344 5.632478
4 3018.416 165.0164 18.29 0.000 2694.968 3341.864
5 2914.861 467.8699 6.23 0.000 1997.792 3831.93
Since there is some insignificancy in the categorical variables in this treatment, one of them was
dropped i.e. SkillMoves and the robust regression model was finally run.
InternationalReputation
2 11701.48 95.58984 122.41 0.000 11514.11 11888.84
3 50277.6 171.8411 292.58 0.000 49940.78 50614.43
4 156413.4 445.3501 351.21 0.000 155540.5 157286.4
5 284990.4 3159.159 90.21 0.000 278798.2 291182.7
Interpretation
There are 18,202 observations since some unusual points have been dropped.
The F value is 0.0000 which is less than 0.05 which means that not all coefficients of variables in
this robust regression model are zero.
All p values are 0.000 meaning that all variables are significant in this model.
END