Download as doc, pdf, or txt
Download as doc, pdf, or txt
You are on page 1of 18

Sterns MBA 1 students expect to make the big bucks after graduation!

Data File: salary.doc


Sterns school of business is a very reputable university. Students generally attend
this school not only to increase their knowledge base of business but also to increase their
salary. I am interested in finding out what leads MBA 1 student to believe that they going
to earn a certain salary after graduation. In other word, what factors affect the expected
salary of students? As I could not find enough meaningful data on the topic, I decided to
conduct my own survey. I was actually very surprised to see how cooperative students
were in filling my survey when I gave them candies in return!
Here are some descriptive statistics for all 40 MBA 1 surveyed. Variables include
expected salary after graduation (expected), age of the person (age), the number of year
of working experience in the chosen industry after graduation (num.of), if the person
plans to work in finance (plan fin), if the person plans to work in consulting (plan con), if
the person plans to work in marketing (plan mar), if the person plans to work in other
industries (plan oth), if the person is unsure about the industry (unsure), if the person
plans to work for a Fortune 500 Company (500 comp), the number of hours he/she plans
to work a week, and his/her past salary (past sal).
Descriptive Statistics
VariableNMeanMedianTrMeanStDevSEMean
Expected40842508000083333172132722
age4026.82526.00026.6942.4900.394
num.of401.7250.0001.5002.4490.387
planfin400.67501.00000.69440.47430.0750
plancon400.25000.00000.22220.43850.0693
planmar400.07500.00000.02780.26670.0422
planoth400.000000.000000.000000.000000.00000
unsure400.000000.000000.000000.000000.00000
500comp400.50000.50000.50000.50640.0801
Hrs.of4070.6270.0070.4212.571.99
pastsal40542504850051667206643267
VariableMinMaxQ1Q3
Expected600001250007000097500
age23.00033.00025.00029.000
num.of0.0008.0000.0003.000
planfin0.00001.00000.00001.0000
plancon0.00001.00000.00000.7500
planmar0.00001.00000.00000.0000
planoth0.000000.000000.000000.00000
unsure0.000000.000000.000000.00000
500comp0.00001.00000.00001.0000
Hrs.of50.0095.0060.0080.00
pastsal300001250004000060000

At a first look at the data, there are different things happening. None of the 40
students claim to be unsure about which industry they want to go in after graduation and
none of them are planning to work in an industry other than finance, consulting or

marketing. Moreover, the average expected salary is $84,250 is much higher than the
actual salary of graduating MBA students in 1996 of $70,000 (footnote). This
discrepancy could mean that MBA 1 students are rather optimistic.
Salaries are often right tailed. Lets check the distribution of both expected and
past salaries.

10

8
6

Frequency

Frequency

7
5
4
3

2
1
0
60000 70000 80000 90000 100000110000120000130000

0
30000 40000 50000 60000 70000 80000 90000 100000

Expected salary

past salary

These distributions of salaries are right tailed. Thus, it might be helpful to log
salaries.

10

Frequency

Frequency

6
5
4
3

2
1
0

0
4.8

4.9

5.0

5.1

4.48 4.53 4.58 4.63 4.68 4.73 4.78 4.83 4.88 4.93 4.98

log expected

log past

Lets check potential outliers in both expected and past salaries:

130000

70000

120000

Expected salary

past salary

120000

110000
100000
90000
80000
70000

20000

60000

130000

130000

120000

120000

Expected salary

Expected salary

There are apparently 3 outliers in the past salaries observations.


Now, lets look at the distribution of the different industries students plan to work in.

110000
100000
90000
80000
70000

110000
100000
90000
80000
70000

60000

60000
0

plan fin.

plan cons.

130000

Expected salary

120000
110000
100000
90000
80000
70000
60000
0

plan mark.

There are only 2 students out of 40 who plan to work in the marketing industry. This
variable apparently has a low significance. Students who plan to work in finance are
coded by 1. It is interesting to see that students who plan to work in finance and the ones
who do not actually have the same median expected salary ($80,000). It looks like
students who are planning to go in finance and those who are planning to go to consulting
are negatively correlated.

Regression Analysis
*planmark.ishighlycorrelatedwithotherXvariables
*planmark.hasbeenremovedfromtheequation
*planotherhasallvalues=0
*planotherhasbeenremovedfromtheequation
*unsurehasallvalues=0
*unsurehasbeenremovedfromtheequation

Theregressionequationis
Expectedsalary=44885+3352age+647num.ofyrs.ofexp.

2591planfin.+4025plancons.+3171500comp?
+432Hrs.ofwork+0.125pastsalary
PredictorCoefStDevTPVIF
Constant44885180682.480.018
age3351.8849.53.950.0003.6
num.of647.5514.71.260.2171.3
planfin259150820.510.6144.6
plancon402553440.750.4574.4
500comp317128871.100.2801.7
Hrs.of431.6122.23.530.0011.9
pastsal0.125060.096841.290.2063.2
S=6999RSq=86.4%RSq(adj)=83.5%
AnalysisofVariance
SourceDFSSMSFP
Regression79988126411142687520229.130.000
Error32156737358948980425
Total3911555500000
SourceDFSeqSS
age18185071089
num.of175586010
planfin1531542752
plancon1158051549
500comp184132640
Hrs.of1872058157
pastsal181684215
UnusualObservations
ObsageExpectedFitStDevFitResidualStResid
1625.070000865413156165412.65R
3427.01200001013123493186883.08R
Rdenotesanobservationwithalargestandardizedresidual
DurbinWatsonstatistic=1.64

We can see that Minitab directly get rid of 3 variables. These variables are students
planning to work in marketing, planning to work in other industries and students who are
unsure about for which industry they will be working. I would also remove students who
plan consulting because it is highly negatively correlated with students who plan finance.
The overall regression is statistically significant. However, some variables have P-values
over.05.
Residuals Versus the Fitted Values
(response is Expected)

Residuals Versus the Order of the Data


(response is Expected)
3

Standardized Residual

Standardized Residual

3
2
1
0
-1
-2

2
1
0
-1
-2
-3

-3
5

10

15

20

25

Observation Order

30

35

40

55000

65000

75000

85000

95000

Fitted Value

105000

115000

125000

Normal Probability Plot of the Residuals

Histogram of the Residuals

(response is Expected)

(response is Expected)
10

2
1

Frequency

Standardized Residual

0
-1

-2
-3
-2

-1

Normal Score

-3

-2

-1

Standardized Residual

The distribution of the residual looks normal. However, we can notice couples of outliers.
Now, lets try a regression with logged salaries for past and expected while keeping the
same variables.
Regression Analysis
*planmark.ishighlycorrelatedwithotherXvariables
*planmark.hasbeenremovedfromtheequation
*planotherhasallvalues=0
*planotherhasbeenremovedfromtheequation
*unsurehasallvalues=0
*unsurehasbeenremovedfromtheequation

Theregressionequationis
logexpected=4.26+0.0178age+0.00289num.ofyrs.ofexp.
0.0137planfin.+0.0190plancons.+0.0125500comp?
+0.00202Hrs.ofwork+0.000001pastsalary
PredictorCoefStDevTPVIF
Constant4.260540.0896747.520.000
age0.0177590.0042164.210.0003.6
num.of0.0028870.0025541.130.2671.3
planfin0.013660.025220.540.5924.6
plancon0.018990.026520.720.4794.4
500comp0.012510.014330.870.3891.7

Hrs.of0.00202150.00060633.330.0021.9
pastsal0.000000570.000000481.180.2463.2
S=0.03473RSq=86.2%RSq(adj)=83.2%
AnalysisofVariance
SourceDFSSMSFP
Regression70.2411620.03445228.560.000
Error320.0386000.001206
Total390.279762
SourceDFSeqSS
age10.201723
num.of10.001471
planfin10.012164
plancon10.003731
500comp10.001379
Hrs.of10.019007
pastsal10.001687
UnusualObservations
ObsagelogexpeFitStDevFitResidualStResid
1625.04.845104.926680.015660.081582.63R
3427.05.079184.997670.017330.081512.71R
Rdenotesanobservationwithalargestandardizedresidual
DurbinWatsonstatistic=2.01

Logging the salaries does not change the regression model significantly. Thus, I will keep
the antilog data.
Lets see how the regression looks without the plan consulting variable.

Regression Analysis
Theregressionequationis
Expectedsalary=46346+3477age+574num.ofyrs.ofexp.
5884planfin.+2553500comp?+466Hrs.ofwork
+0.113pastsalary
PredictorCoefStDevTPVIF
Constant46346178462.600.014
age3476.8827.64.200.0003.4
num.of573.8501.91.140.2611.2
planfin588425732.290.0291.2
500comp255327500.930.3601.6
Hrs.of465.9112.64.140.0001.6
pastsal0.113030.094891.190.2423.1
S=6953RSq=86.2%RSq(adj)=83.7%

AnalysisofVariance
SourceDFSSMSFP
Regression69960345863166005764434.340.000
Error33159515413748338004
Total3911555500000
SourceDFSeqSS
age18185071089
num.of175586010
planfin1531542752
500comp122521046
Hrs.of11077031091
pastsal168593876
UnusualObservations
ObsageExpectedFitStDevFitResidualStResid
1033.0110000111467504914670.31X
1625.070000864103131164102.64R
3427.01200001011243461188763.13R
Rdenotesanobservationwithalargestandardizedresidual
XdenotesanobservationwhoseXvaluegivesitlargeinfluence.
DurbinWatsonstatistic=1.67
Histogram of the Residuals

Residuals Versus the Fitted Values

(response is Expected)

(response is Expected)

34

Standardized Residual

Frequency

7
6
5
4
3
2
1

1
0
-1
-2

0
-3

-2

-1

16

-3

55000

Standardized Residual

65000

75000

85000

95000

105000

115000

125000

Fitted Value

Normal Probability Plot of the Residuals


(response is Expected)

Standardized Residual

34
2

1
0
-1
-2

16

-3
-2

-1

Normal Score

Not the outliers are still here. Now, lets run a best subset regression to find out what variables are best to
choose for our model.

Best Subsets Regression

ResponseisExpected
p5p
nl0Ha
ua0rs
mnst
.c.
afos
RSqgoimoa
VarsRSq(adj)CpSefnpfl
170.870.133.79417.8X
157.856.764.811325X
281.280.210.97660.7XX
275.574.124.68752.2XX
384.783.44.67012.4XXX
383.281.88.17334.3XXX
485.383.65.26971.0XXXX
485.283.55.57000.3XXXX
585.883.85.96938.4XXXXX
585.683.56.36983.8XXXXX
686.283.77.06952.6XXXXXX

My choice is between the two possibilities in bold. One reason is that they have small S
and relatively high R-sq. Another reason is that C-p should be approximately P+1 =6.
Thus, I picked the one that has a C-p of 5.5 and a S of 7000.3. Here is the new regression:
Regression Analysis
Theregressionequationis
Expectedsalary=57732+4034age6828planfin.+0.0975past
salary
+468Hrs.ofwork
PredictorCoefStDevTPVIF
Constant57732164153.520.001
age4034.1753.25.360.0002.8
planfin682823922.850.0071.0
pastsal0.097550.091981.060.2962.9
Hrs.of468.5111.74.190.0001.6
S=7000RSq=85.2%RSq(adj)=83.5%
AnalysisofVariance
SourceDFSSMSFP
Regression49840364419246009110550.200.000
Error35171513558149003874
Total3911555500000
SourceDFSeqSS
age18185071089
planfin1536172676
pastsal1257832450

Hrs.of1861288204
UnusualObservations
ObsageExpectedFitStDevFitResidualStResid
829.01250001115642946134362.12R
1033.0110000115013449750130.93X
1625.070000873292846173292.71R
3427.01200001015453129184552.95R
3930.0100000107987444979871.48X
4031.0110000114851446648510.90X
Rdenotesanobservationwithalargestandardizedresidual
XdenotesanobservationwhoseXvaluegivesitlargeinfluence.
DurbinWatsonstatistic=1.57

Past salaries still have a P-value above .05. So, I decide to take this variable out of the
regression.
Regression Analysis
Theregressionequationis
Expectedsalary=69455+4583age6843planfin.+501Hrs.ofwork
PredictorCoefStDevTP
Constant69455121575.710.000
age4583.5547.78.370.000
planfin684323962.860.007
Hrs.of500.9107.74.650.000
S=7012RSq=84.7%RSq(adj)=83.4%
AnalysisofVariance
SourceDFSSMSFP
Regression39785248786326174959566.330.000
Error36177025121449173645
Total3911555500000
SourceDFSeqSS
age18185071089
planfin1536172676
Hrs.of11064005021
DurbinWatsonstatistic=1.64noautocorrelation.Itconfirmsthe
residualvrsorderplot
UnusualObservations
ObsageExpectedFitStDevFitResidualStResid
829.01250001110472910139532.19R
1033.0110000116859415468591.21X
1625.070000877042829177042.76R
3427.01200001018803118181202.88R
Rdenotesanobservationwithalargestandardizedresidual

XdenotesanobservationwhoseXvaluegivesitlargeinfluence.

Residuals Versus the Order of the Data


(response is Expected)
3

34

Residuals Versus the Fitted Values

-1

(response is Expected)

-2

16

-3
5

Standardized Residual

10

15

34

20

25

30

35

40

Observation Order
1
0
-1
-2

16

-3
50000

60000

70000

80000

90000

100000

110000

120000

Fitted Value

Normal Probability Plot of the Residuals


(response is Expected)
3

Standardized Residual

2
1
0
-1
-2
-3
-2

-1

Histogram of the Residuals


Normal Score

(response is Expected)
10

Frequency

Standardized Residual

0
-3

-2

-1

Standardized Residual

Now we have a statistically significant model with P-value below .05. However, two outliers are still
visible in the residuals plots. We can try to get ride of these 2 oultiers (observation 34 and 16).

Regression Analysis
Theregressionequationis
Expectedsalary=65739+4508age6676planfin.+475Hrs.ofwork
PredictorCoefStDevTPVIF
Constant65739100886.520.000
age4508.2474.39.510.0001.6
planfin667620713.220.0031.0
Hrs.of474.7100.64.720.0001.6
S=5742RSq=88.9%RSq(adj)=87.9%
AnalysisofVariance
SourceDFSSMSFP
Regression38941383570298046119090.410.000
Error34112082695632965499
Total3710062210526
SourceDFSeqSS
age17937259218
planfin1270764351
Hrs.of1733360001
UnusualObservations
ObsageExpectedFitStDevFitResidualStResid
829.01250001100912846149092.99R
1033.0110000116257341562571.36X
1324.070000804302798104302.08R
Rdenotesanobservationwithalargestandardizedresidual
XdenotesanobservationwhoseXvaluegivesitlargeinfluence.
DurbinWatsonstatistic=1.65
Residuals Versus the Order of the Data
(response is Expected)

Standardized Residual

-1

-2
5

10

15

20

25

Observation Order

30

35

Residuals Versus the Fitted Values


(response is Expected)

Standardized Residual

-1

-2
60000

70000

80000

90000

100000

110000

120000

Fitted Value

Normal Probability Plot of the Residuals


(response is Expected)

Standardized Residual

-1

-2
-2

-1

Normal Score

Regression Analysis
Theregressionequationis
Expectedsalary=63050+4653age4558planfin.+351Hrs.ofwork
PredictorCoefStDevTPVIF

Constant6305088277.140.000
age4652.8415.511.200.0001.6
planfin455819082.390.0231.1
Hrs.of351.1094.813.700.0011.7
S=5004RSq=90.1%RSq(adj)=89.2%
AnalysisofVariance
SourceDFSSMSFP
Regression37482915093249430503199.630.000
Error3382616598825035333
Total368309081081
SourceDFSeqSS
age17066013767
planfin173551726
Hrs.of1343349601
UnusualObservations
ObsageExpectedFitStDevFitResidualStResid
933.0110000115070299650701.27X
3427.070000808401093108402.22R
Rdenotesanobservationwithalargestandardizedresidual
XdenotesanobservationwhoseXvaluegivesitlargeinfluence.
DurbinWatsonstatistic=2.00
Residuals Versus the Fitted Values
(response is Expected)

Standardized Residual

-1

-2

60000

70000

80000

90000

100000

110000

120000

Fitted Value

Residuals Versus the Order of the Data

Standardized Residual

(response is Expected)
Heteroscadasticity is nonconstant variance.
It appears that there is non-constant
variance the residuals versus
the
fitted
values.
But
to further explore that aspect, we
2
would have to do a Levenes test. Hopefully, the logged variables would take care of this.
1
Lets do a regression with logged
expected salaries.

-1

-2

10

15

20

Observation Order

25

30

35

Normal Probability Plot of the Residuals


(response is Expected)

Standardized Residual

-1

-2

-2

-1

Normal Score

Histogram of the Residuals


(response is Expected)

7
6

Frequency

5
4
3
2
1
0
-2.0

-1.5

-1.0

-0.5

0.0

0.5

Standardized Residual

Lets check residuals, leverage points and cooks distance.


SRES5 HI5
COOK5
0.59872 0.079655
0.007756
0.27269 0.116828
0.002459
-0.013530.076361
0.000004
0.97156 0.103741
0.027315
-1.265100.044574
0.018667
1.15005 0.063633
0.022470
1.27035 0.105834
0.047753
1.99149 0.152316
0.178160
-1.265240.358517*
0.223670
-0.731940.143014
0.022351

1.0

1.5

2.0

-0.156380.058155
-1.584370.284480
-0.915330.063633
0.76627 0.063603
0.23355 0.128017
-1.449700.231559
-0.751440.054100
-0.391980.079616
1.54343 0.100642
-1.683070.092249
0.94590 0.060364
-0.013530.076361
0.35320 0.063603
-0.305420.095103
0.59163 0.084579
1.66542 0.197385
0.21357 0.105834
-1.202020.170243
0.04569 0.064749
0.42040 0.043466
0.37278 0.145178
1.33865 0.097529
-0.915330.063633
-2.220130.047736
-0.156380.058155
-0.381650.091066
0.38048 0.134487

0.000378
0.249509
0.014234
0.009971
0.002002
0.158323
0.008074
0.003323
0.066644
0.071967
0.014370
0.000004
0.002118
0.002451
0.008085
0.170528
0.001350
0.074111
0.000036
0.002008
0.005900
0.048414
0.014234
0.061771
0.000378
0.003648
0.005624

Leverage points should be less than 2.5*(p+1)/n, 2.5*(4+1)/37 =.35


One leverage point is about .35 (*).
Cooks distance should be less than 1, which is true.

Regression Analysis
Theregressionequationis
logexp=4.18+0.0231age0.0256planfin.+0.00186Hrs.ofwork
PredictorCoefStDevTPVIF
Constant4.182040.0486885.920.000
age0.0230560.00229110.060.0001.6
planfin0.025600.010522.430.0211.1
Hrs.of0.00186430.00052283.570.0011.7
S=0.02759RSq=88.3%RSq(adj)=87.2%
AnalysisofVariance
SourceDFSSMSFP
Regression30.1889910.06299782.740.000
Error330.0251260.000761
Total360.214116
SourceDFSeqSS

age10.176882
planfin10.002427
Hrs.of10.009681
UnusualObservations
ObsagelogexpFitStDevFitResidualStResid
825.04.903094.851660.010770.051432.02R
933.05.041395.073400.016520.032011.45X
2025.04.778154.835390.008380.057242.18R
3427.04.845104.900150.006030.055052.04R
Rdenotesanobservationwithalargestandardizedresidual
XdenotesanobservationwhoseXvaluegivesitlargeinfluence.
DurbinWatsonstatistic=2.34
Residuals Versus the Fitted Values
(response is logexp)

Standardized Residual

-1

-2
4.8

4.9

5.0

5.1

Fitted Value

It appears that there is still non-constant variance. The only reasonable thing to do now is
weighted least square.

Residuals Versus the Order of the Data


(response is logexp)

Standardized Residual

-1

-2
5

10

15

20

Observation Order

25

30

35

Normal Probability Plot of the Residuals


(response is logexp)

Standardized Residual

-1

-2
-2

-1

Normal Score

Histogram of the Residuals


(response is logexp)

8
7

Frequency

6
5
4
3
2
1
0
-2.0

-1.5

-1.0

-0.5

0.0

0.5

Standardized Residual

1.0

1.5

2.0

You might also like