Professional Documents
Culture Documents
Stern's MBA 1 Students Expect To Make The Big Bucks After Graduation!
Stern's MBA 1 Students Expect To Make The Big Bucks After Graduation!
At a first look at the data, there are different things happening. None of the 40
students claim to be unsure about which industry they want to go in after graduation and
none of them are planning to work in an industry other than finance, consulting or
marketing. Moreover, the average expected salary is $84,250 is much higher than the
actual salary of graduating MBA students in 1996 of $70,000 (footnote). This
discrepancy could mean that MBA 1 students are rather optimistic.
Salaries are often right tailed. Lets check the distribution of both expected and
past salaries.
10
8
6
Frequency
Frequency
7
5
4
3
2
1
0
60000 70000 80000 90000 100000110000120000130000
0
30000 40000 50000 60000 70000 80000 90000 100000
Expected salary
past salary
These distributions of salaries are right tailed. Thus, it might be helpful to log
salaries.
10
Frequency
Frequency
6
5
4
3
2
1
0
0
4.8
4.9
5.0
5.1
4.48 4.53 4.58 4.63 4.68 4.73 4.78 4.83 4.88 4.93 4.98
log expected
log past
130000
70000
120000
Expected salary
past salary
120000
110000
100000
90000
80000
70000
20000
60000
130000
130000
120000
120000
Expected salary
Expected salary
110000
100000
90000
80000
70000
110000
100000
90000
80000
70000
60000
60000
0
plan fin.
plan cons.
130000
Expected salary
120000
110000
100000
90000
80000
70000
60000
0
plan mark.
There are only 2 students out of 40 who plan to work in the marketing industry. This
variable apparently has a low significance. Students who plan to work in finance are
coded by 1. It is interesting to see that students who plan to work in finance and the ones
who do not actually have the same median expected salary ($80,000). It looks like
students who are planning to go in finance and those who are planning to go to consulting
are negatively correlated.
Regression Analysis
*planmark.ishighlycorrelatedwithotherXvariables
*planmark.hasbeenremovedfromtheequation
*planotherhasallvalues=0
*planotherhasbeenremovedfromtheequation
*unsurehasallvalues=0
*unsurehasbeenremovedfromtheequation
Theregressionequationis
Expectedsalary=44885+3352age+647num.ofyrs.ofexp.
2591planfin.+4025plancons.+3171500comp?
+432Hrs.ofwork+0.125pastsalary
PredictorCoefStDevTPVIF
Constant44885180682.480.018
age3351.8849.53.950.0003.6
num.of647.5514.71.260.2171.3
planfin259150820.510.6144.6
plancon402553440.750.4574.4
500comp317128871.100.2801.7
Hrs.of431.6122.23.530.0011.9
pastsal0.125060.096841.290.2063.2
S=6999RSq=86.4%RSq(adj)=83.5%
AnalysisofVariance
SourceDFSSMSFP
Regression79988126411142687520229.130.000
Error32156737358948980425
Total3911555500000
SourceDFSeqSS
age18185071089
num.of175586010
planfin1531542752
plancon1158051549
500comp184132640
Hrs.of1872058157
pastsal181684215
UnusualObservations
ObsageExpectedFitStDevFitResidualStResid
1625.070000865413156165412.65R
3427.01200001013123493186883.08R
Rdenotesanobservationwithalargestandardizedresidual
DurbinWatsonstatistic=1.64
We can see that Minitab directly get rid of 3 variables. These variables are students
planning to work in marketing, planning to work in other industries and students who are
unsure about for which industry they will be working. I would also remove students who
plan consulting because it is highly negatively correlated with students who plan finance.
The overall regression is statistically significant. However, some variables have P-values
over.05.
Residuals Versus the Fitted Values
(response is Expected)
Standardized Residual
Standardized Residual
3
2
1
0
-1
-2
2
1
0
-1
-2
-3
-3
5
10
15
20
25
Observation Order
30
35
40
55000
65000
75000
85000
95000
Fitted Value
105000
115000
125000
(response is Expected)
(response is Expected)
10
2
1
Frequency
Standardized Residual
0
-1
-2
-3
-2
-1
Normal Score
-3
-2
-1
Standardized Residual
The distribution of the residual looks normal. However, we can notice couples of outliers.
Now, lets try a regression with logged salaries for past and expected while keeping the
same variables.
Regression Analysis
*planmark.ishighlycorrelatedwithotherXvariables
*planmark.hasbeenremovedfromtheequation
*planotherhasallvalues=0
*planotherhasbeenremovedfromtheequation
*unsurehasallvalues=0
*unsurehasbeenremovedfromtheequation
Theregressionequationis
logexpected=4.26+0.0178age+0.00289num.ofyrs.ofexp.
0.0137planfin.+0.0190plancons.+0.0125500comp?
+0.00202Hrs.ofwork+0.000001pastsalary
PredictorCoefStDevTPVIF
Constant4.260540.0896747.520.000
age0.0177590.0042164.210.0003.6
num.of0.0028870.0025541.130.2671.3
planfin0.013660.025220.540.5924.6
plancon0.018990.026520.720.4794.4
500comp0.012510.014330.870.3891.7
Hrs.of0.00202150.00060633.330.0021.9
pastsal0.000000570.000000481.180.2463.2
S=0.03473RSq=86.2%RSq(adj)=83.2%
AnalysisofVariance
SourceDFSSMSFP
Regression70.2411620.03445228.560.000
Error320.0386000.001206
Total390.279762
SourceDFSeqSS
age10.201723
num.of10.001471
planfin10.012164
plancon10.003731
500comp10.001379
Hrs.of10.019007
pastsal10.001687
UnusualObservations
ObsagelogexpeFitStDevFitResidualStResid
1625.04.845104.926680.015660.081582.63R
3427.05.079184.997670.017330.081512.71R
Rdenotesanobservationwithalargestandardizedresidual
DurbinWatsonstatistic=2.01
Logging the salaries does not change the regression model significantly. Thus, I will keep
the antilog data.
Lets see how the regression looks without the plan consulting variable.
Regression Analysis
Theregressionequationis
Expectedsalary=46346+3477age+574num.ofyrs.ofexp.
5884planfin.+2553500comp?+466Hrs.ofwork
+0.113pastsalary
PredictorCoefStDevTPVIF
Constant46346178462.600.014
age3476.8827.64.200.0003.4
num.of573.8501.91.140.2611.2
planfin588425732.290.0291.2
500comp255327500.930.3601.6
Hrs.of465.9112.64.140.0001.6
pastsal0.113030.094891.190.2423.1
S=6953RSq=86.2%RSq(adj)=83.7%
AnalysisofVariance
SourceDFSSMSFP
Regression69960345863166005764434.340.000
Error33159515413748338004
Total3911555500000
SourceDFSeqSS
age18185071089
num.of175586010
planfin1531542752
500comp122521046
Hrs.of11077031091
pastsal168593876
UnusualObservations
ObsageExpectedFitStDevFitResidualStResid
1033.0110000111467504914670.31X
1625.070000864103131164102.64R
3427.01200001011243461188763.13R
Rdenotesanobservationwithalargestandardizedresidual
XdenotesanobservationwhoseXvaluegivesitlargeinfluence.
DurbinWatsonstatistic=1.67
Histogram of the Residuals
(response is Expected)
(response is Expected)
34
Standardized Residual
Frequency
7
6
5
4
3
2
1
1
0
-1
-2
0
-3
-2
-1
16
-3
55000
Standardized Residual
65000
75000
85000
95000
105000
115000
125000
Fitted Value
Standardized Residual
34
2
1
0
-1
-2
16
-3
-2
-1
Normal Score
Not the outliers are still here. Now, lets run a best subset regression to find out what variables are best to
choose for our model.
ResponseisExpected
p5p
nl0Ha
ua0rs
mnst
.c.
afos
RSqgoimoa
VarsRSq(adj)CpSefnpfl
170.870.133.79417.8X
157.856.764.811325X
281.280.210.97660.7XX
275.574.124.68752.2XX
384.783.44.67012.4XXX
383.281.88.17334.3XXX
485.383.65.26971.0XXXX
485.283.55.57000.3XXXX
585.883.85.96938.4XXXXX
585.683.56.36983.8XXXXX
686.283.77.06952.6XXXXXX
My choice is between the two possibilities in bold. One reason is that they have small S
and relatively high R-sq. Another reason is that C-p should be approximately P+1 =6.
Thus, I picked the one that has a C-p of 5.5 and a S of 7000.3. Here is the new regression:
Regression Analysis
Theregressionequationis
Expectedsalary=57732+4034age6828planfin.+0.0975past
salary
+468Hrs.ofwork
PredictorCoefStDevTPVIF
Constant57732164153.520.001
age4034.1753.25.360.0002.8
planfin682823922.850.0071.0
pastsal0.097550.091981.060.2962.9
Hrs.of468.5111.74.190.0001.6
S=7000RSq=85.2%RSq(adj)=83.5%
AnalysisofVariance
SourceDFSSMSFP
Regression49840364419246009110550.200.000
Error35171513558149003874
Total3911555500000
SourceDFSeqSS
age18185071089
planfin1536172676
pastsal1257832450
Hrs.of1861288204
UnusualObservations
ObsageExpectedFitStDevFitResidualStResid
829.01250001115642946134362.12R
1033.0110000115013449750130.93X
1625.070000873292846173292.71R
3427.01200001015453129184552.95R
3930.0100000107987444979871.48X
4031.0110000114851446648510.90X
Rdenotesanobservationwithalargestandardizedresidual
XdenotesanobservationwhoseXvaluegivesitlargeinfluence.
DurbinWatsonstatistic=1.57
Past salaries still have a P-value above .05. So, I decide to take this variable out of the
regression.
Regression Analysis
Theregressionequationis
Expectedsalary=69455+4583age6843planfin.+501Hrs.ofwork
PredictorCoefStDevTP
Constant69455121575.710.000
age4583.5547.78.370.000
planfin684323962.860.007
Hrs.of500.9107.74.650.000
S=7012RSq=84.7%RSq(adj)=83.4%
AnalysisofVariance
SourceDFSSMSFP
Regression39785248786326174959566.330.000
Error36177025121449173645
Total3911555500000
SourceDFSeqSS
age18185071089
planfin1536172676
Hrs.of11064005021
DurbinWatsonstatistic=1.64noautocorrelation.Itconfirmsthe
residualvrsorderplot
UnusualObservations
ObsageExpectedFitStDevFitResidualStResid
829.01250001110472910139532.19R
1033.0110000116859415468591.21X
1625.070000877042829177042.76R
3427.01200001018803118181202.88R
Rdenotesanobservationwithalargestandardizedresidual
XdenotesanobservationwhoseXvaluegivesitlargeinfluence.
34
-1
(response is Expected)
-2
16
-3
5
Standardized Residual
10
15
34
20
25
30
35
40
Observation Order
1
0
-1
-2
16
-3
50000
60000
70000
80000
90000
100000
110000
120000
Fitted Value
Standardized Residual
2
1
0
-1
-2
-3
-2
-1
(response is Expected)
10
Frequency
Standardized Residual
0
-3
-2
-1
Standardized Residual
Now we have a statistically significant model with P-value below .05. However, two outliers are still
visible in the residuals plots. We can try to get ride of these 2 oultiers (observation 34 and 16).
Regression Analysis
Theregressionequationis
Expectedsalary=65739+4508age6676planfin.+475Hrs.ofwork
PredictorCoefStDevTPVIF
Constant65739100886.520.000
age4508.2474.39.510.0001.6
planfin667620713.220.0031.0
Hrs.of474.7100.64.720.0001.6
S=5742RSq=88.9%RSq(adj)=87.9%
AnalysisofVariance
SourceDFSSMSFP
Regression38941383570298046119090.410.000
Error34112082695632965499
Total3710062210526
SourceDFSeqSS
age17937259218
planfin1270764351
Hrs.of1733360001
UnusualObservations
ObsageExpectedFitStDevFitResidualStResid
829.01250001100912846149092.99R
1033.0110000116257341562571.36X
1324.070000804302798104302.08R
Rdenotesanobservationwithalargestandardizedresidual
XdenotesanobservationwhoseXvaluegivesitlargeinfluence.
DurbinWatsonstatistic=1.65
Residuals Versus the Order of the Data
(response is Expected)
Standardized Residual
-1
-2
5
10
15
20
25
Observation Order
30
35
Standardized Residual
-1
-2
60000
70000
80000
90000
100000
110000
120000
Fitted Value
Standardized Residual
-1
-2
-2
-1
Normal Score
Regression Analysis
Theregressionequationis
Expectedsalary=63050+4653age4558planfin.+351Hrs.ofwork
PredictorCoefStDevTPVIF
Constant6305088277.140.000
age4652.8415.511.200.0001.6
planfin455819082.390.0231.1
Hrs.of351.1094.813.700.0011.7
S=5004RSq=90.1%RSq(adj)=89.2%
AnalysisofVariance
SourceDFSSMSFP
Regression37482915093249430503199.630.000
Error3382616598825035333
Total368309081081
SourceDFSeqSS
age17066013767
planfin173551726
Hrs.of1343349601
UnusualObservations
ObsageExpectedFitStDevFitResidualStResid
933.0110000115070299650701.27X
3427.070000808401093108402.22R
Rdenotesanobservationwithalargestandardizedresidual
XdenotesanobservationwhoseXvaluegivesitlargeinfluence.
DurbinWatsonstatistic=2.00
Residuals Versus the Fitted Values
(response is Expected)
Standardized Residual
-1
-2
60000
70000
80000
90000
100000
110000
120000
Fitted Value
Standardized Residual
(response is Expected)
Heteroscadasticity is nonconstant variance.
It appears that there is non-constant
variance the residuals versus
the
fitted
values.
But
to further explore that aspect, we
2
would have to do a Levenes test. Hopefully, the logged variables would take care of this.
1
Lets do a regression with logged
expected salaries.
-1
-2
10
15
20
Observation Order
25
30
35
Standardized Residual
-1
-2
-2
-1
Normal Score
7
6
Frequency
5
4
3
2
1
0
-2.0
-1.5
-1.0
-0.5
0.0
0.5
Standardized Residual
1.0
1.5
2.0
-0.156380.058155
-1.584370.284480
-0.915330.063633
0.76627 0.063603
0.23355 0.128017
-1.449700.231559
-0.751440.054100
-0.391980.079616
1.54343 0.100642
-1.683070.092249
0.94590 0.060364
-0.013530.076361
0.35320 0.063603
-0.305420.095103
0.59163 0.084579
1.66542 0.197385
0.21357 0.105834
-1.202020.170243
0.04569 0.064749
0.42040 0.043466
0.37278 0.145178
1.33865 0.097529
-0.915330.063633
-2.220130.047736
-0.156380.058155
-0.381650.091066
0.38048 0.134487
0.000378
0.249509
0.014234
0.009971
0.002002
0.158323
0.008074
0.003323
0.066644
0.071967
0.014370
0.000004
0.002118
0.002451
0.008085
0.170528
0.001350
0.074111
0.000036
0.002008
0.005900
0.048414
0.014234
0.061771
0.000378
0.003648
0.005624
Regression Analysis
Theregressionequationis
logexp=4.18+0.0231age0.0256planfin.+0.00186Hrs.ofwork
PredictorCoefStDevTPVIF
Constant4.182040.0486885.920.000
age0.0230560.00229110.060.0001.6
planfin0.025600.010522.430.0211.1
Hrs.of0.00186430.00052283.570.0011.7
S=0.02759RSq=88.3%RSq(adj)=87.2%
AnalysisofVariance
SourceDFSSMSFP
Regression30.1889910.06299782.740.000
Error330.0251260.000761
Total360.214116
SourceDFSeqSS
age10.176882
planfin10.002427
Hrs.of10.009681
UnusualObservations
ObsagelogexpFitStDevFitResidualStResid
825.04.903094.851660.010770.051432.02R
933.05.041395.073400.016520.032011.45X
2025.04.778154.835390.008380.057242.18R
3427.04.845104.900150.006030.055052.04R
Rdenotesanobservationwithalargestandardizedresidual
XdenotesanobservationwhoseXvaluegivesitlargeinfluence.
DurbinWatsonstatistic=2.34
Residuals Versus the Fitted Values
(response is logexp)
Standardized Residual
-1
-2
4.8
4.9
5.0
5.1
Fitted Value
It appears that there is still non-constant variance. The only reasonable thing to do now is
weighted least square.
Standardized Residual
-1
-2
5
10
15
20
Observation Order
25
30
35
Standardized Residual
-1
-2
-2
-1
Normal Score
8
7
Frequency
6
5
4
3
2
1
0
-2.0
-1.5
-1.0
-0.5
0.0
0.5
Standardized Residual
1.0
1.5
2.0