Multiple Linear Regression

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 22

MAKERERE UNIVERSITY

COLLEGE OF BUSSINESS AND MANAGEMENT


SCIENCES
SCHOOL OF STATISTICS AND PLANNING
DEPARTMENT OF STATISTICS AND ACTUARIAL
SCIENCES
DATA ANALYSIS 3 COURSEWORK REPORT

NAME STUDENT NO. REG. NO.

NALUTAAYA AGNES 1800700023 18/U/023


LUNKUSE

TUMWEBAZE CLARITY 1800700022 18/U/022

KAGGA IVAN CLIFF MAZZI 217005264 17/U/4354/PS

TUHAIRWE DUNCAN 1800723176 18/U/23176/PS

SSEKABIRA CAROL 216004779 16/U/11552/PS


NANDAWULA

MUNEZERO BONIVENTURE 1800723165 18/U/23165/PS

ATWIINE GLORIA 1800741655 18/U/41655

TSIKHABI JOSHUA I MASAWI 1800714171 18/U/14171/PS


INTRODUCTION
The dataset was downloaded from https://www.kaggle.com/datasets and it was saved on desktop.
Since this site only has csv files, the dataset was converted into the desired format of an excel
workbook and then it was imported into Stata and then used in analysis.

It contains 18,207 observations (rows) and 80 features (variables). In this exercise we want to
look at how variables like international reputation, skill moves, long shots, strength and vision
affect wage each footballer in the dataset earns.

We propose the following multiple linear regression model:

Wage = β + α*InternationalReputation + κ*SkillMoves + γ*Strength + θ*LongShots + ρ*Vision

Where:

Wage is the dependent variable.

InternationalReputation, SkillMoves, Strength, Longshots and Vision are the independent


variables.

β is the intercept.

α, κ, γ, θ and ρ are the coefficients that measure the strength at which Wage depends on
InternationalReputation, SkillMoves, Strength, Longshots and Vision respectively.

PART ONE: DATA CLEANING


Generally, codebook command was run and it gave a description of the data corresponding to
every variable in the dataset.

• Dealing with duplicates

A duplicates report was generated and there were no duplicates in the entire dataset. The
command and output were:
. duplicates report

Duplicates in terms of all variables

copies observations surplus

1 18207 0
• Cleaning variable Wage and generating Salary

The dependent variable had a euro symbol of currency and a symbol “K” for thousands so it was
read by Stata as a string variable and therefore there was need to clean it up. The symbol of the
euro currency was dropped and the cell content was destringed plus ignoring the “K” at the end.

The new variable name given to Wage was salary. Then after it was multiplied by 1000 to give
figures in thousands. The command and output were:
. split Wage , parse(€) generate (wage_y)
variables created as string:
wage_y1 wage_y2

.
. destring wage_y2, generate(salary) ignore("K")
wage_y2: character K removed; salary generated as int
(241 missing values generated)

.
. replace salary = salary * 1000
variable salary was int now long
(17,966 real changes made)

And therefore, there is need to restructure the multiple linear regression model as:

salary = β + α*InternationalReputation + κ*SkillMoves + γ*Strength + θ*LongShots + ρ*Vision

Where:

salary is the dependent variable.

InternationalReputation, SkillMoves, Strength, Longshots and Vision are the independent


variables.

β is the intercept.

α, κ, γ, θ and ρ are the coefficients that measure the strength at which salary depends on
InternationalReputation, SkillMoves, Strength, Longshots and Vision respectively.

• Dropping some variables and keeping some variables.

Out of all the variables, 6 were needed and the variable Name and A making them 8. So, the
keep command was used so that Stata could keep the 8 and drop the rest. The command was:
. keep A Name InternationalReputation SkillMoves Strength Vision LongShots salary

• Cleaning variable Name


The variable Name was encoded from string to a form that Stata could manipulate. The new
variable generated was Names. The command was:

. encode Name, generate(Names)

• Dealing with missing values in InternationalReputation and SkillMoves variables

In order to work on missing values in the variables InternationalReputation and SkillMoves


which have discrete integer values 1, 2, 3, 4 and 5, a calculation of the average rank position for
1, 2, 3, 4 and 5 was made in Stata. The command and output were:

. display (1+2+3+4+5)/5
3

Then its this average position 3 that was replaced where the missing values in these two variables
were. The commands were:

. replace InternationalReputation = 3 if InternationalReputation == .


(48 real changes made)

. replace SkillMoves = 3 if SkillMoves == .


(48 real changes made)

• Dealing with missing values in variable salary

In order to work on missing values of the variable salary, the mean of values present in the
column was calculated and replaced where the missing values were. The command and output
for generating mean were:
. mean salary

Mean estimation Number of obs = 17,966

Mean Std. Err. [95% Conf. Interval]

salary 9861.85 165.0083 9538.418 10185.28

The command for replacing missing values in the salary command was:
. replace salary = 9861.85 if salary == .
variable salary was long now double
(241 real changes made)

• Generating variable Time from variable A


The variable A was renamed to Time since a time variable is needed in the Durbin Watson’s d
statistic test for autocorrelation. So, that is why this variable was not dropped at initial stages
because its use was to be realized in the near future. The command was:
. label variable A "time"

.
. rename A Time

• Dealing with missing values in variable Strength

The variable Strength had missing values. So, the mean in that column was calculated and then
replaced in the missing values. The commands and output were:
. mean Strength

Mean estimation Number of obs = 18,159

Mean Std. Err. [95% Conf. Interval]

Strength 65.31197 .0931837 65.12932 65.49462

. replace Strength = 65.31197 if Strength == .


variable Strength was byte now float
(48 real changes made)

• Dealing with missing values in variable LongShots

The variable LongShots had missing values. So, the mean in that column was calculated and then
replaced in the missing values. The commands and output were:
. mean LongShots

Mean estimation Number of obs = 18,159

Mean Std. Err. [95% Conf. Interval]

LongShots 47.10997 .1429296 46.82982 47.39013

. replace LongShots = 47.10997 if LongShots == .


variable LongShots was byte now float
(48 real changes made)

• Dealing with missing values in variable Vision

The variable Vision had missing values. So, the mean in that column was calculated and then
replaced in the missing values. The commands and output were:
. mean Vision

Mean estimation Number of obs = 18,159

Mean Std. Err. [95% Conf. Interval]

Vision 53.4009 .104982 53.19513 53.60668

. replace Vision = 53.4009 if Vision == .


variable Vision was byte now float
(48 real changes made)

Conclusion:

The data was summarized after the cleaning procedure. The command and output were:
. summarize Time InternationalReputation SkillMoves Strength LongShots Vision salary Names

Variable Obs Mean Std. Dev. Min Max

Time 18,207 9103 5256.053 0 18206


Internatio~n 18,207 1.118196 .4052307 1 5
SkillMoves 18,207 2.362992 .7558765 1 5
Strength 18,207 65.31197 12.54044 17 97
LongShots 18,207 47.10997 19.23512 3 94

Vision 18,207 53.4009 14.12822 10 94


salary 18,207 9861.85 21970.4 1000 565000
Names 18,207 8562.34 4944.38 1 17194

The total number of observations for all variables is 18,207.

• Time: The mean is 9103, standard deviation is 5256.053, minimum is 0 and maximum is
18206.
• InternationalReputation: The mean is 1.118196, standard deviation is 0.4052307,
minimum is 1 and maximum is 5.
• SkillMoves: The mean is 2.362992, standard deviation is 0.7558765, minimum is 1 and
maximum is 5.
• Strength: The mean is 65.31197, standard deviation is 12.54044, minimum is 17 and
maximum is 97.
• LongShots: The mean is 47.10977, standard deviation is 19.23512, minimum is 3 and
maximum is 94.
• Vision: The mean is 53.4009, standard deviation is 14.12822, minimum is 10 and
maximum is 94.
• salary: The mean is 9861.85, standard deviation is 21970.4, minimum is 1000 and
maximum is 565000.
• Names: It has mean, standard deviation a minimum and maximum which we will
consider invalid since the data in it is in word form though not in string format since it
was encoded from string.

PART TWO: MULTIPLE LINEAR REGRESSION AND


TESTING ASSUMPTIONS
Running the multiple linear regression model.

The command and output were:


. regress salary i.InternationalReputation i.SkillMoves Strength LongShots Vision

Source SS df MS Number of obs = 18,207


F(11, 18195) = 1772.09
Model 4.5453e+12 11 4.1321e+11 Prob > F = 0.0000
Residual 4.2427e+12 18,195 233177953 R-squared = 0.5172
Adj R-squared = 0.5169
Total 8.7880e+12 18,206 482698402 Root MSE = 15270

salary Coef. Std. Err. t P>|t| [95% Conf. Interval]

InternationalReputation
2 19552.05 471.2686 41.49 0.000 18628.32 20475.78
3 56872 844.8184 67.32 0.000 55216.07 58527.92
4 158901.5 2168.295 73.28 0.000 154651.5 163151.6
5 286925 6338.772 45.27 0.000 274500.4 299349.6

SkillMoves
2 -4862.77 481.343 -10.10 0.000 -5806.248 -3919.292
3 -3332.904 595.9187 -5.59 0.000 -4500.961 -2164.847
4 7805.529 819.2849 9.53 0.000 6199.654 9411.405
5 9032.809 2289.064 3.95 0.000 4546.028 13519.59

Strength 177.3617 9.44834 18.77 0.000 158.8421 195.8814


LongShots 52.97948 11.2732 4.70 0.000 30.88294 75.07603
Vision 173.3927 13.10883 13.23 0.000 147.6981 199.0872
_cons -13400.04 807.4742 -16.60 0.000 -14982.77 -11817.31

For the categorical variables (InternationalReputation and SkillMoves) dummy variables were
generated.

Interpretation

The total number of observations is 18207.

The F value is 0.0000 which is less than 0.05 which means that not all coefficients of variables
in this linear regression model are zero.
The adjusted R^2 is 0.5169 which means that 51.69% of a change in salary is explained by
InternationalReputation, SkillMoves, Strength, Longshots and Vision.

A change in InternationalReputation from 1 to 2 increases salary by 19552.05.

A change in InternationalReputation from 1 to 3 increases salary by 56872.

A change in InternationalReputation from 1 to 4 increases salary by 158901.5.

A change in InternationalReputation from 1 to 5 increases salary by 286925.

A change in SkillMoves from 2 to 1 decreases salary by 4862.77.

A change in SkillMoves from 3 to 1 decreases salary by 3332.904.

A change in SkillMoves from 1 to 4 increases salary by 7805.529.

A change in SkillMoves from 1 to 5 increases salary by 9032.809

A unit increase in Strength increases salary by 177.3617.

A unit increase in LongShots increases salary by 52.97948.

A unit increase in Vision increases salary by 173.3927.

All p values are 0.000 meaning that all variables are significant in this model.

The intercept is -13400.04.

The multiple linear model is:

salary = -13400.04 + 19552.05*InternatioanlReputation2 + 56872* InternatioanlReputation3 +


158901.5*InternatioanlReputation4 + 286925* InternatioanlReputation5 +
-4862.77*SkillMoves2 + -3332.904*SkillMoves3 + 7805.529* SkillMoves4 + 9032.809*
SkillMoves5 + 177.3617*Strength + 52.97948*LongShots + 173.3927*Vision

TESTING ASSUMPTIONS

1) Your dependent variable must be measured at a continuous level/ scale.


We checked this assumption by tabulating and summarizing charges. The commands were:

. tab salary

. summarize salary

Variable Obs Mean Std. Dev. Min Max

salary 18,207 9861.85 21970.4 1000 565000


Conclusion: The variable Salary is continuous.

2) You have two or more independent variables measured at continuous or categorical


level.
We checked this assumption by using multiple one-way frequency tables for
InternationalReputation, SkillMoves, Strength, LongShots and Vision using the command:

. tab1 InternationalReputation SkillMoves Strength LongShots Vision

And from this we conclude that Strength, LongShots and Vision are continuous and also
InternationalReputation and SkillMoves are categorical.

Furthermore, on this assumption we did some summary statistics for the independent variables.

The command and output are:


. summarize InternationalReputation SkillMoves Strength LongShots Vision salary

Variable Obs Mean Std. Dev. Min Max

Internatio~n 18,207 1.118196 .4052307 1 5


SkillMoves 18,207 2.362992 .7558765 1 5
Strength 18,207 65.31197 12.54044 17 97
LongShots 18,207 47.10997 19.23512 3 94
Vision 18,207 53.4009 14.12822 10 94

salary 18,207 9861.85 21970.4 1000 565000

3) There needs to be a linear relationship between:


a) the dependent variable and independent variables and
b) the dependent variable and the independent variables collectively.

Testing assumption 3a): The needs to be a linear relationship between dependent variable
and independent variables.

Generally, a log transformation was done for variable salary in order to shift the line of best fit up
in all plots done for this assumption. The command was:

. generate logsalary = ln( salary)

Strength

It was tested by using a scatter plot of salary and Strength.

The command and output before the log transformation were:

. twoway (scatter salary Strength) (lfit salary Strength)


600000
400000
200000
0

20 40 60 80 100
Strength

salary Fitted values

The command and output after the log transformation were:

. twoway (scatter logsalary Strength) (lfit logsalary Strength)


14
12
10
8
6

20 40 60 80 100
Strength

logsalary Fitted values


Conclusion: There is linearity between Strength and salary.

LongShots

It was tested by using a scatter plot of salary and LongShots.

The command and output before the log transformation were:


. twoway (scatter salary LongShots ) (lfit salary LongShots )
600000
400000
200000
0

0 20 40 60 80 100
LongShots

salary Fitted values

The command and output after the log transformation were:


. twoway (scatter logsalary LongShots ) (lfit logsalary LongShots )
14
12
10
8
6

0 20 40 60 80 100
LongShots

logsalary Fitted values

Conclusion: There is linearity between Longshots and salary.

Vision

It was tested by using a scatter plot of salary and Vision.

The command and output before the log transformation were:

. twoway (scatter salary Vision ) (lfit salary Vision )


600000
400000
200000
0

0 20 40 60 80 100
Vision

salary Fitted values

The command and output after the log transformation were:

. twoway (scatter logsalary Vision ) (lfit logsalary Vision )


14
12
10
8
6

0 20 40 60 80 100
Vision

logsalary Fitted values


Conclusion: There is linearity between Vision and salary.

Testing assumption 3b): There should be a linear relationship between the dependent
variable and the independent variables collectively.

This assumption is tested using partial regression plots or added variable plots. The command
produces an added variable plot for all variables in the multiple linear regression model giving
respect to dummy variables too.

The command was:

. avplots

Conclusion: There is linearity between salary and all independent variables.

4) Your data must not show multi collinearity which occurs when you have two or more
variables that are highly correlated.
This assumption was tested using variance inflation factors. The command and output were:
. estat vif

Variable VIF 1/VIF

Internatio~n
2 1.12 0.894554
3 1.07 0.933454
4 1.03 0.975212
5 1.03 0.967540
SkillMoves
2 4.51 0.221882
3 6.43 0.155575
4 2.51 0.398925
5 1.14 0.875024
Strength 1.10 0.912298
LongShots 3.67 0.272388
Vision 2.68 0.373396

Mean VIF 2.39

Conclusion: Since all the variance inflation factors (VIF) are between 1 and 10, there is moderate
correlation and hence no high correlation.
5) There should be homoscedasticity
This assumption was tested using a residual versus fitted values plot (rvfplot). The command and
output were:
. rvfplot, yline(0)
400000
200000
Residuals
0
-200000
-400000

0 100000 200000 300000


Fitted values

Conclusion: Since variances are moving in a way that they spread out from the line of fit at 0, it
means there is no homoscedasticity.

6) There should be no autocorrelation


The assumption was tested using the Durbin Watson’s d statistic. Since it only applies to time
series data, then the data was declared to be time series data and the test was applied. The command
and output were:
. tsset Time
time variable: Time, 0 to 18206
delta: 1 unit

. estat dwatson

Durbin-Watson d-statistic( 12, 18207) = 1.421951

Conclusion: The Durbin Watson Statistic is 1.421951 which is not equal to 2 meaning there is
autocorrelation. This statistic ranges between 0 and 4; at 2 there is no autocorrelation.
7) The residuals should be normally distributed.

In order to test this assumption, studentized residuals were generated and then plotted in a
histogram with a normal density plot imposed. The commands and output were:
. predict stres, rstudent

. histogram stres, normal


(bin=42, start=-21.936043, width=.95808297)
.8
.6
Density
.4 .2
0

-20 -10 0 10 20
Studentized residuals

Conclusion: The studentized residuals are normally distributed.

8) There should be no significant outliers, high leverage points and highly influential
points
Outliers
To check for outliers a stem and leaf display was generated. Then the outliers above and below
were identified. For those that were above, since they are few, they were all listed and for those
that were below, since they are many we shall list only 10 in this document but in Stata all will
be output since there is a command for that in the do-file.
The commands and output were:
i) For those above
. list InternationalReputation SkillMoves Strength LongShots Vision salary stres if stres <= -5

Intern~n SkillM~s Strength LongSh~s Vision salary stres

23. 5 1 80 16 70 130000 -12.34081


42. 4 1 69 13 50 77000 -5.966961
69. 4 4 67 86 86 100000 -5.60563
77. 4 4 58 71 93 21000 -10.7845
109. 4 2 86 56 48 57000 -7.300009

110. 5 5 86 82 79 15000 -21.93604


207. 4 4 86 79 74 55000 -8.657025
222. 4 5 60 71 84 72000 -7.453211
281. 4 4 68 66 86 67000 -7.738674
315. 4 4 55 78 78 62000 -7.867957

318. 4 1 63 11 53 60000 -7.052768


319. 4 1 70 13 65 10000 -10.61177
379. 4 4 92 90 75 25000 -10.77981
548. 4 2 72 55 68 20000 -9.823951
551. 4 3 75 77 81 11000 -10.79005

553. 4 3 78 80 82 13000 -10.71431


677. 3 4 66 71 81 1000 -5.238172

ii) For those below


. list InternationalReputation SkillMoves Strength LongShots Vision salary stres if stres >= 9 in 1/21

Intern~n SkillM~s Strength LongSh~s Vision salary stres

1. 5 4 59 94 94 565000 18.30344
5. 4 4 75 91 94 355000 11.10393
6. 4 4 66 80 89 340000 10.3053
7. 4 4 58 82 92 420000 15.72673
8. 5 3 83 85 84 455000 10.90395

9. 4 3 83 59 63 380000 13.90365
12. 4 3 73 92 86 355000 11.96119
15. 3 2 76 69 79 225000 10.23313
19. 3 1 79 10 69 240000 11.1949
21. 4 3 77 54 87 315000 9.366471

Conclusion: There are outliers in the model.

Leverage points

In order to check for leverage points, they were predicted using the following command:

. predict leverage, leverage

Then the cut off for leverage points was calculated. The formula is (2k + 2)/n where k is the number
of variables used in the multiple linear regression model and n is the total number of observations.
The command and output were:
. display ((2*6)+2)/18207
.00076894

Then a list of leverage points greater than the cut off of 0.00076894 was generated but for this
document only the first 10 have been shown, all of them will appear in Stata since the command
is there in the do-file. The command and output were:
. list InternationalReputation SkillMoves Strength LongShots Vision salary leverage if leverage > .00076894 in
> 1/10

Intern~n SkillM~s Strength LongSh~s Vision salary leverage

1. 5 4 59 94 94 565000 .1726412
2. 5 5 79 93 82 405000 .1719912
3. 5 5 49 82 87 290000 .1720172
4. 4 1 64 12 68 260000 .02072
5. 4 4 75 91 94 355000 .0202646

6. 4 4 66 80 89 340000 .0202131
7. 4 4 58 82 92 420000 .0202777
8. 5 3 83 85 84 455000 .172036
9. 4 3 83 59 63 380000 .0201346
10. 3 1 78 12 70 94000 .0040192

Conclusion: There are leverage points in the model.

Influential points

In order to check for influential points, distributed inter frame space (dfits) were predicted using
the following command:

. predict dfits, dfits

Then the cut off for influential points was calculated. The formula is (2*sqrt(k/n)) where k is the
number of variables used in the multiple linear regression model and n is the total number of
observations. The command and output were:

. display 2*sqrt(6/18207)
.03630667

Then a list of influential points greater than the cut off of 0.03630667 was generated but for this
document only the first 10 have been shown, all of them will appear in Stata since the command
is there in the do-file. The command and output were:
. list InternationalReputation SkillMoves Strength LongShots Vision salary dfits if dfits>.03630667 in 1/11

Intern~n SkillM~s Strength LongSh~s Vision salary dfits

1. 5 4 59 94 94 565000 8.360995
2. 5 5 79 93 82 405000 2.931811
4. 4 1 64 12 68 260000 .8741187
5. 4 4 75 91 94 355000 1.596951
6. 4 4 66 80 89 340000 1.480169

7. 4 4 58 82 92 420000 2.26254
8. 5 3 83 85 84 455000 4.970357
9. 4 3 83 59 63 380000 1.993044
10. 3 1 78 12 70 94000 .099716
11. 4 4 84 84 77 205000 .1810446

Conclusion: There are influential points in the model.

Treatment of unusual points like outliers.

In order to treat unusual points like outliers, a robust regression was used.

Robust regression is an iterative procedure that seeks to identify outliers and minimize their
impact on the coefficient estimates.

For this document, the robust regression model will be shown excluding the iterations. All the
out put will fully be shown in Stata since the command is there in the do-file.

The command and output were:

. rreg salary i.InternationalReputation i.SkillMoves Strength LongShots Vision


Robust regression Number of obs = 18,203
F( 11, 18191) = 22023.11
Prob > F = 0.0000

salary Coef. Std. Err. t P>|t| [95% Conf. Interval]

InternationalReputation
2 6796.234 94.90796 71.61 0.000 6610.205 6982.262
3 48224.09 170.1606 283.40 0.000 47890.56 48557.62
4 156367.1 436.6914 358.07 0.000 155511.1 157223.1
5 -3.67e-10 2221.061 -0.00 1.000 -4353.488 4353.488

SkillMoves
2 -1504.482 96.93651 -15.52 0.000 -1694.486 -1314.477
3 -229.601 120.0113 -1.91 0.056 -464.8344 5.632478
4 3018.416 165.0164 18.29 0.000 2694.968 3341.864
5 2914.861 467.8699 6.23 0.000 1997.792 3831.93

Strength 58.55993 1.902957 30.77 0.000 54.82995 62.2899


LongShots 23.77543 2.27064 10.47 0.000 19.32476 28.2261
Vision 33.79158 2.640441 12.80 0.000 28.61607 38.96709
_cons -2412.282 162.6184 -14.83 0.000 -2731.03 -2093.535

Since there is some insignificancy in the categorical variables in this treatment, one of them was
dropped i.e. SkillMoves and the robust regression model was finally run.

The command and output were:

. rreg salary i.InternationalReputation Strength LongShots Vision

Robust regression Number of obs = 18,202


F( 7, 18194) = 34959.35
Prob > F = 0.0000

salary Coef. Std. Err. t P>|t| [95% Conf. Interval]

InternationalReputation
2 11701.48 95.58984 122.41 0.000 11514.11 11888.84
3 50277.6 171.8411 292.58 0.000 49940.78 50614.43
4 156413.4 445.3501 351.21 0.000 155540.5 157286.4
5 284990.4 3159.159 90.21 0.000 278798.2 291182.7

Strength 47.29808 1.90894 24.78 0.000 43.55638 51.03979


LongShots 16.49655 1.869364 8.82 0.000 12.83242 20.16068
Vision 51.34202 2.598845 19.76 0.000 46.24804 56.436
_cons -3127.501 163.6803 -19.11 0.000 -3448.329 -2806.672

Interpretation

There are 18,202 observations since some unusual points have been dropped.
The F value is 0.0000 which is less than 0.05 which means that not all coefficients of variables in
this robust regression model are zero.

A change in InternationalReputation from 1 to 2 increases salary by 11701.48.

A change in InternationalReputation from 1 to 3 increases salary by 50277.6.

A change in InternationalReputation from 1 to 4 increases salary by 156413.4.

A change in InternationalReputation from 1 to 5 increases salary by 284990.4.

A unit increase in Strength increases salary by 47.29808.

A unit increase in LongShots increases salary by 16.49655.

A unit increase in Vision increases salary by 51.34202.

All p values are 0.000 meaning that all variables are significant in this model.

The intercept is -3127.501.

The robust model is:

salary = -3127.501 + 11701.48*InternatioanlReputation2 + 50277.6* InternatioanlReputation3 +


156413.4*InternatioanlReputation4 + 284990.4* InternatioanlReputation5 +
47.29808*Strength + 16.49655*LongShots + 51.34202*Vision

END

You might also like