Download as pdf or txt
Download as pdf or txt
You are on page 1of 26

MBA Starting Salaries – Class Demonstration

Summary:

Marie Dear, an aspiring MBA applicant, was very interested in the starting salaries of graduating
students. She was able to track down a dataset from an anonymous college. The data included the
graduates' age, gender, years of work experience, GMAT information, fall and spring MBA average,
quartile ranking, and their native language.

She is interested in understanding the starting salaries, factors influencing the starting salaries,
satisfaction of the graduates with their MBA program.

Now, where do we start??

First, we must understand the dataset received. We need to have a feel of the data to know what
could we do with the data. We can start with understanding the data size, missing values, number
of variables present, the measures of every variable, how many variables are categorical and how
many are continuous.

Why do you think we need to know all these??

This information helps us make sense of the dataset in our hand. It helps us understand what
analysis we could do with the data and whether we can answer Marie's concerns.

Let's start with the dataset. We are using SPSS to do the analysis, and hence from here onwards,
the procedures will be explained based on how it works on SPSS.

Open a new SPSS file. You may know that when we open the SPSS file, there will be two windows
open. First, there will be a dataset file (.sav) comprising data view and variable view. Second, there
is an output file (.spv).

We have our dataset on the excel sheet. We want to import them from an excel sheet to the SPSS
file. We will go to File – Import data – Excel as shown below.
Then the excel sheet will be selected. A pop-up window appears asking requirements while
converting the .xls file to the .sav file.

The process starts with reading the variable names for the SPSS file from the first row of data from
the excel sheet.

Once the data is open, we will save the data file.


Now the data is here!! What do we check first??

It is good to check the missing data. The Case mentions that they have replaced 998 where there
was no response and 999 where the answers of salaries were not disclosed. The Case also presents
the data size as 274. Let's check it once.

We go to the 'Analyze' tab – descriptive statistics – frequencies, as shown below.

You select all the variables on the left side and take them to the right box. Then you click 'display
frequency tables' and click OK.

The result will be presented in the output file as shown below.


The output suggests no missing data. The missing data were adjusted according to imputations
depicted in the Case. The output also provides the frequency distribution of every variable
mentioned above. It will help you understand how the values are distributed.

If you are interested, you can check the descriptive statistics of scale variables. We can test the
assumptions as well.

Let's now see what is it that Marie is interested to know. She was keen on gaining information
regarding the starting salaries. So, the variable we are interested in is Salary. Let's check the
frequency distribution of the Salary. You may do that by choosing the variable 'salary'.

salary
Cumulative
Frequency Percent Valid Percent Percent
Valid 0 90 32.8 32.8 32.8
998 46 16.8 16.8 49.6
999 35 12.8 12.8 62.4
64000 1 .4 .4 62.8
77000 1 .4 .4 63.1
78256 1 .4 .4 63.5
82000 1 .4 .4 63.9
85000 4 1.5 1.5 65.3
86000 2 .7 .7 66.1
88000 1 .4 .4 66.4
88500 1 .4 .4 66.8
90000 3 1.1 1.1 67.9
92000 3 1.1 1.1 69.0
93000 3 1.1 1.1 70.1
95000 7 2.6 2.6 72.6
96000 4 1.5 1.5 74.1
96500 1 .4 .4 74.5
97000 2 .7 .7 75.2
98000 10 3.6 3.6 78.8
99000 1 .4 .4 79.2
100000 9 3.3 3.3 82.5
100400 1 .4 .4 82.8
101000 2 .7 .7 83.6
101100 1 .4 .4 83.9
101600 1 .4 .4 84.3
102500 1 .4 .4 84.7
103000 1 .4 .4 85.0
104000 2 .7 .7 85.8
105000 11 4.0 4.0 89.8
106000 3 1.1 1.1 90.9
107000 1 .4 .4 91.2
107300 1 .4 .4 91.6
107500 1 .4 .4 92.0
108000 2 .7 .7 92.7
110000 1 .4 .4 93.1
112000 3 1.1 1.1 94.2
115000 5 1.8 1.8 96.0
118000 1 .4 .4 96.4
120000 4 1.5 1.5 97.8
126710 1 .4 .4 98.2
130000 1 .4 .4 98.5
145800 1 .4 .4 98.9
146000 1 .4 .4 99.3
162000 1 .4 .4 99.6
220000 1 .4 .4 100.0
Total 274 100.0 100.0

What do we understand from the output?

There are 46 + 35 = 81 missing data. Another important fact is 90 respondents stated their Salary as
0. What does that mean? And what will we do about it? How will we make sense of this data to
arrive at the graduates' starting salaries?

Since Salary is a continuous variable, let's get the measures of descriptive statistics. What are the
measures of descriptive statistics? There are four groups of them? If you do not know the answer,
you should be writing imposition!!!!

Do the following to receive the required measures.


The output will be as follows:
How do you interpret the output? Does it make sense to state that the mean Salary is 39025? The
standard deviation looks more than the mean. Let's see how the histogram looks, and do we have a
normal distribution here?

The values at zero to 1000 will be 81 + 90 = 171. Hence, we can say that 171 data points are either
zero or missing values. Do you think we should remove the same and understand the variable
'salary'? Yes/No. Why?

Let's remove the values and understand. How do we select the responses other than 0 and missing
values? We do not need values that are 0, 998, and 999.

Since the rest of the data in the dataset is more than 1000, we can select the remaining responses
in the following manner.
Once we apply the filter, the dataset will look like the figure below.

It shows that the responses with Salary 0, 998, 999 were not selected, and the others were
selected.

Now, we can understand the descriptive statistics for the rest of the responses. Please check the
previous steps to understand the procedure to determine descriptive statistics. We will also review
the histogram (which fall under the 'frequency' tab).
What do you understand from this output??

The data follows a normal distribution. However, we can see outliers. So, we can find the outliers
(please refer to slides to do the same), remove the same, and see how the data fits the normal
distribution curve. The right tail with the outliers on the right-side states that the data is positively
skewed, suggesting that the mean value will be slightly higher than the median value.

The dispersion of the data could be seen from range and standard deviation. It's important to
understand the dispersion. A large dispersion in a small sample can affect the mean value. What is
your opinion regarding the range and standard deviation value?

What was Marie's next concern? Marie wanted to understand if any variables, such as Gender,
Age, GMAT score, language, etc., affect the starting salary?

Let's consider the effect of various variables on the graduates' starting Salary. Since we are
concerned with starting salaries, should we go with the same 103 data points, or should we
consider the initial total data points (274)? What do you think?
What are the data types of the variables selected?

1. Salary – continuous variable


2. Gender – categorical variable
3. First language – categorical variable
4. Quartile ranks – categorical variables
5. Age – continuous variable
6. GMAT total score – continuous variable
7. Work experience – continuous variable

Now what??

Let's start with…

1. Gender and Salary

There are two values assigned in the data, 1=Male and 2=Female. So first, we need to set the values
in our datasheet. We will see how to do that.

Good!! We assigned the values.

How do we understand the influence of Gender on Salary? What test should we use?

We know that IV is Gender (categorical with two groups) and DV is Salary (continuous variable). So
based on this information, which method is appropriate?
Independent Sample t-Test??

What will be the hypothesis?

H1: Gender (Male/Female) influences Salary

(or) H1: There is a difference in salaries received by males and females (Mean value of both the
groups in the population are not the same)

How do we conduct an independent sample t-test? It is shown in the figure below. You can also
refer to the class slides.
The output will be as follows:

Group Statistics
sex N Mean Std. Deviation Std. Error Mean
salary Male 72 104970.97 13672.283 1611.294
Female 31 98524.39 24762.405 4447.459

How do we interpret this??

The dataset comprises 72 Male and 31 Females. The mean Salary of males = 104970, and the mean
Salary of females = 98524. What does this tell you?

The sample suggests a difference in the mean Salary of males and females. No, we need to know
whether the result will be relevant for the population. How do we do that??

Remember!! Statistical significance?? What does that imply? Once we conduct the test and gets a
significant p-value, we call it statistically significant. So, the null hypothesis will be rejected, and the
difference in mean value can be generalized for the population.

But before we go with the t-test, we are concerned about the homogeneity of variance (the basic
underlying assumption for mean comparison). Let's try to understand the result.

Levene's test of equality of variance states that

H0: Equal variance assumed

H1: Equal variance not assumed

The p-value for Levene's test is 0.507, suggesting that we failed to reject the null hypothesis at a
significance level of 5%. Hence, equal variance is assumed.

Now the t-test shows that the p-value is 0.093. It suggests that we failed to reject the null
hypothesis at a significance level of 5%. Hence Gender has no impact on Salary.
2. First Language and Salary

There are two values assigned in the data, 1=English and 2=Others. So first, we need to set the
values in our datasheet.

How do we understand the influence of the first language on Salary? What test should we use?

We know that IV is the First language (categorical with two groups) and DV is Salary (continuous
variable). So based on this information, which method is appropriate?

Independent Sample t-Test?? Again!! Yes!!

What will be the hypothesis?

H1: First language (English/Others) influences Salary

(or) H1: There is a difference in salaries received by graduates having English as the first language
and others (Mean value of both the groups in the population are not the same)

Please do the t-test like the previous one. The output will be as follows:

Group Statistics
frstlang N Mean Std. Deviation Std. Error Mean
salary English 96 101748.60 13923.953 1421.108
Others 7 120614.29 44399.040 16781.260

How do we interpret this??

The dataset comprises 96 respondents whose first language is English and 7 responses from others.
The mean Salary of English = 101748, and the mean Salary of others = 120614.

Levene's test of equality of variance states that

H0: Equal variance assumed


H1: Equal variance not assumed

The p-value for Levene's test is 0.000, suggesting that we rejected the null hypothesis at a
significance level of 5%. Hence, equal variance is not assumed. So, the assumption test is not
satisfied with interpreting the results. Therefore, there is not enough data to draw meaningful
conclusions on the influence of the First language on Salary.

3. Quartile Ranks and Salary

There are four values assigned in the data, 1=Top and 4=Bottom. So first, we need to set the values
in our datasheet.

How do we understand the influence of quartile ranks on Salary? What test should we use?

We know that IV is quartile ranks (categorical with four groups) and DV is Salary (continuous
variable). So based on this information, which method is appropriate?

Independent Sample t-Test??

Can we do a t-test for more than two groups??

No. Then what will be the option??

One-way ANOVA!!!

What will be the hypothesis?

H1: Quartile rank influences Salary

(or) H1: There is a difference in salaries received by graduates with different ranks (Mean value of
both the groups in the population are not the same)

How do we conduct one-way ANOVA? It is shown in the figure below. You can also refer to the class
slides.
The output will be as follows:

Test of Homogeneity of Variances


Levene Statistic df1 df2 Sig.
salary Based on Mean 2.646 3 99 .053
Based on Median 2.137 3 99 .100
Based on Median and with 2.137 3 41.036 .110
adjusted df
Based on trimmed mean 2.124 3 99 .102

ANOVA
salary
Sum of Squares df Mean Square F Sig.
Between Groups 936893650.930 3 312297883.643 .977 .407
Within Groups 31631098668.9 99 319506047.162
92
Total 32567992319.9 102
22

How do we interpret this??

First, we check the homogeneity of variance (the basic underlying assumption for mean
comparison). Let's try to understand the result.

Levene's test of equality of variance states that

H0: Equal variance assumed

H1: Equal variance not assumed

The p-value for Levene's test is 0.053, suggesting that we failed to reject the null hypothesis at a
significance level of 5%. Hence, equal variance is assumed.

Now we will check the ANOVA results. The result shows that the p-value is 0.407. It suggests that
we failed to reject the null hypothesis at a significance level of 5%. Hence Quartile rank has no
impact on Salary.

4. Age, GMAT total score, Work experience on Salary

When both IV and DV are continuous variables, what will be the test to understand the impact of IV
on DV?

Regression analysis!!

But it is always good to do correlation before regression analysis. Why? We did mention this in our
class. Why??
Because it is always better to understand a relationship before understanding the influence, we
need not do the regression if there is no relationship.

Let's first do regression analysis. This is how we do it. You can refer to the class slides as well.

The output is as follows:


Correlations
age gmat_tot work_yrs salary
age Pearson Correlation 1 -.079 .881** .500**
Sig. (2-tailed) .429 .000 .000

N 103 103 103 103


gmat_tot Pearson Correlation -.079 1 -.123 -.091
Sig. (2-tailed) .429 .217 .362

N 103 103 103 103


work_yrs Pearson Correlation .881** -.123 1 .455**
Sig. (2-tailed) .000 .217 .000

N 103 103 103 103


salary Pearson Correlation .500** -.091 .455** 1
Sig. (2-tailed) .000 .362 .000

N 103 103 103 103


**. Correlation is significant at the 0.01 level (2-tailed).

What do we understand here?

Age and salary = 0.500 (significant at 0.01 level

Work experience and salary = 0.455 (significant at 0.01 level)

Age and Work experience = 0.881 (significant at 0.01 level)

These three relationships were significant. However, we can see that two IVs are highly correlated.
Do we have to be concerned about it?? Or can we go ahead??

What is our concern?? A high correlation coefficient could indicate that there is a chance of
multicollinearity? So we will have to test for the assumption of multicollinearity. We can check the
VIF (variance inflation factor) to check the presence of multicollinearity.

What are the other assumptions that we need to test??

Normality, Randomness, Linearity, Data Independence!!!


a. Normality (K-S Test)

H0: The data follows a normal distribution

H1: The data is not normally distributed

The output is as follows:


One-Sample Kolmogorov-Smirnov Test
age work_yrs salary
N 103 103 103

Normal Parametersa,b Mean 26.78 3.68 103030.74


Std. Deviation 3.272 3.010 17868.801

Most Extreme Differences Absolute .182 .249 .196

Positive .182 .249 .196

Negative -.140 -.201 -.123

Test Statistic .182 .249 .196


Asymp. Sig. (2-tailed) .000c .000c .000c

a. Test distribution is Normal.


b. Calculated from data.
c. Lilliefors Significance Correction.

The output suggests that we failed to reject the null hypothesis, and non-normality exist. But
according to Chou & Bentler (1995), we can check the skewness and kurtosis value before
confirming normality.

N Skewness Kurtosis
Statistic Statistic Std. Error Statistic Std. Error
work_yrs 103 2.553 .238 7.438 .472

salary 103 3.274 .238 18.496 .472


age 103 1.980 .238 5.371 .472
Valid N (listwise) 103

The value suggests that the results within the permissible skewness values are between -3 to +3,
and kurtosis is between -10 to +10. Hence, we can state that the data follows a normal distribution.
However, salary data suggests non-normality showing high peakedness.

b. Randomness (Runs test)

H0: The responses collected are random

H1: The responses collected are non-random


The output is as follows:

Runs Test
age work_yrs salary
Test Valuea 26 3 100000
Cases < Test Value 45 47 46
Cases >= Test Value 58 56 57
Total Cases 103 103 103
Number of Runs 42 50 8
Z -1.948 -.420 -8.798
Asymp. Sig. (2-tailed) .051 .674 .000
a. Median

The result suggests that salary data is non-random as we rejected the null hypothesis.

c. Linearity (Test of Linearity)

H0: The relationship is non-linear

H1: The relationship is linear


The output is as follows:

The result suggests that we can reject the null hypothesis in both cases. Hence, the relationship is
linear.

d&e. Data independence and multicollinearity can be tested while doing the regression analysis.

Let's do Regression Analysis

H1: Age influences the Salary of a graduate

H2: Work experience influences the Salary of a graduate

How do we do it?

You can check the class slides to understand how to execute regression analysis.

It is shown below.
The output is as follows:

Model Summaryb
Adjusted R Std. Error of the
Model R R Square Square Estimate Durbin-Watson
1 .501a .251 .236 15622.490 1.238
a. Predictors: (Constant), work_yrs, age
b. Dependent Variable: salary
ANOVAa
Model Sum of Squares df Mean Square F Sig.
1 Regression 8161772405.307 2 4080886202.654 16.721 .000b
Residual 24406219914.61 100 244062199.146
5
Total 32567992319.92 102
2
a. Dependent Variable: Salary
b. Predictors: (Constant), work_yrs, age

Coefficientsa
Standardized
Unstandardized Coefficients Coefficients Collinearity Statistics
Model B Std. Error Beta t Sig. Tolerance VIF
1 (Constant) 36967.455 23323.769 1.585 .116

age 2413.760 997.441 .442 2.420 .017 .225 4.451

work_yrs 388.835 1084.015 .066 .359 .721 .225 4.451

a. Dependent Variable: salary

The result suggests that the Durbin-Watson value is outside the permissible level of 1.5 to 2.5,
showing the threat to data independence. The VIF values are also above the allowable limit of 3.3

The regression results suggest that age significantly impacts the graduate's Salary while the
influence of work experience on Salary is not significant at the 5% level.

The adjusted r squared value of 0.236 states that age explains 23.6 per cent of the variance in
Salary.

Now!!

Should we go with the results of regression analysis?? Yes/No??

I would say No!!


Why? Because most of the assumptions were not satisfied. When assumptions are not satisfied,
hypothesis testing doesn't make sense.

What do you think about the data? Can Marie depend on this data to make the decision?

You might also like