Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 19

Business Statistics

Data Analysis Project

Submitted by- Bhanupratap Singh Sisodiya


Concordia ID - 40202831
Admitted to John Molson School of Business

1|Page
Introduction
I was given to perform an analysis using a sample of 73 companies/firms, whose cost
effectiveness and size were considered and used for analysis purposes. The companies were
categorized regarding industry risk and the importance on analytics and they were both
divided into levels of either low, medium or high. I did an analysis which discusses the
relationship between the cost effectiveness and size of the companies as well as how these
two factors relate to industry risk and the importance these companies place on analytics.
1. Cost effectiveness 2. Size_____________________
Column1
Column1
Mean 10.97329
Standard Error 1.891222 Mean 658.9726
Median 6.08 Standard Error 88.92274
Mode 4.07 Median 390
Standard Mode 350
Deviation 16.15861 Standard
Sample Deviation 759.7562
Variance 261.1007 Sample
Kurtosis 16.12227 Variance 577229.5
Skewness 3.793826 Kurtosis 6.689311
Range 97.35 Skewness 2.514191
Minimum 0.2 Range 3994
Maximum 97.55 Minimum 19
Sum 801.05 Maximum 4013
Count 73 Sum 48105
Count 73

1b. To identify whether there is an existence of symmetrical behaviour in the distribution of


COST and SIZE of the companies, I decided to conduct a frequency distribution for cost and
size.

 I decided to draw a histogram for identification of symmetrical behaviour.

 COST

 SIZE

2|Page
 As we can observe from both the histograms, there is no symmetrical
behaviour on both the size and the cost. In fact, the data is significantly left-
skewed in both the cases as displayed in the histograms.
c. To compare the cost effectiveness and size of all the companies, I decided to perform
descriptive statistics once again to help visualize the results.

 Descriptive statistics for low-risk cost firms and Descriptive statistics for
medium-risk cost firms:
Low risk cost Medium risk cost

Column1
Column1

Mean 8.44
Standard Mean 20.50727
Error 1.473641
Standard
Error 7.839449
Median 5.27
Median 13.57
Mode 15
Standard Mode #N/A
Deviation 11.41477
Standard
Sample Deviation 26.00051
Variance 130.297
Sample
Variance 676.0265
Kurtosis 25.25309
Kurtosis 10.02561
Skewness 4.400223
Skewness 3.10657
Range 79.1
Range 93.66
Minimum 0.2
Minimum 3.89
Maximum 79.3

3|Page
Maximum 97.55
Sum 506.4
Sum 225.58
Count 60
Count 11

Descriptive statistics for High-risk cost firms:

Column1

Mean 34.535
Standard Error 30.465
Median 34.535
Mode #N/A
Standard
Deviation 43.08402
Sample
Variance 1856.232
Kurtosis #DIV/0!
Skewness #DIV/0!
Range 60.93
Minimum 4.07
Maximum 65
Sum 69.07
Count 2

Now, I will do the descriptive statistics for SIZE part.

 Descriptive statistics for low-risk size firms:


Column1

Mean 696.85
105.736
Standard Error 5
Median 388.5
Mode 350
Standard 819.031
Deviation 4
Sample 670812.
Variance 4
5.38197
Kurtosis 8
2.34020
Skewness 2
Range 3994
Minimum 19
Maximum 4013
Sum 41811
Count 60

4|Page
 Descriptive statistics for medium-risk size firms:

Column1

Mean 517.9091
Standard Error 109.6382
Median 440
Mode #N/A
Standard
Deviation 363.6288
Sample
Variance 132225.9
Kurtosis -1.46794
Skewness 0.430084
Range 966
Minimum 139
Maximum 1105
Sum 5697
Count 11

 Descriptive statistics for high-risk size firms:


Column1

Mean 298.5
Standard Error 218.5
Median 298.5
Mode #N/A
Standard 309.005
Deviation 7
Sample
Variance 95484.5
Kurtosis #DIV/0!
Skewness #DIV/0!
Range 437
Minimum 80
Maximum 517
Sum 597
Count 2

 After obtaining descriptive statistics of all the firms, it was observed that cost
effectiveness was found as the lowest among the low-risk firms and the data for those
firms appeared as normal distribution given by the mean and median. On contrast, cost
effectiveness was found higher in medium and high-risk firms which is due to high mean
and median values found in high-risk firms when compared to low-risk firms. Overall,
high-risk firms seem to have higher cost effectiveness.

 Now, when talking in terms of the size, low-risk firms’ data were seen as left-skewed data
as it was found earlier, with the median value, 388.5, which is significantly lower than the

5|Page
mean, 696.85. Among medium-risk firms, a normal distribution was found with a mean
size with 617.9 and a median of 440. On the other hand, high-risk firms had a mean and
median value as 298.5.

1d.
 Descriptive statistics for low importance cost firms:

Column1

Mean 6.058182
Standard Error 1.357232
Median 5.25
Mode #N/A
Standard
Deviation 4.501428
Sample
Variance 20.26286
Kurtosis 0.253617
Skewness 0.852926
Range 14.72
Minimum 0.28
Maximum 15
Sum 66.64
Count 11

 Descriptive statistics for medium importance cost firms:

Column1

Mean 12.39706
Standard Error 2.646439
Median 6.08
Mode #N/A
Standard
Deviation 18.89936
Sample
Variance 357.1857
Kurtosis 11.07349
Skewness 3.236072
Range 97.35
Minimum 0.2
Maximum 97.55
Sum 632.25
Count 51
 Descriptive statistics for high importance cost firms:

6|Page
Column1

Mean 9.287273
Standard Error 1.750181
Median 9.13
Mode #N/A
Standard
Deviation 5.804695
Sample
Variance 33.69448
Kurtosis -0.88403
Skewness 0.266543
Range 18.45
Minimum 0.93
Maximum 19.38
Sum 102.16
Count 11

 Now, I will perform descriptive statistics for importance of size for firms:

 Descriptive statistics for low importance size firms:

Column1

923.090
Mean 9
347.101
Standard Error 9
Median 402
Mode #N/A
Standard 1151.20
Deviation 7
Sample 132527
Variance 7
5.55163
Kurtosis 9
2.32145
Skewness 3
Range 3832
Minimum 181
Maximum 4013
Sum 10154
Count 11

7|Page
 Descriptive statistics for medium importance size firms:

Column1

Mean 630.6078
Standard Error 100.5514
Median 350
Mode 350
Standard
Deviation 718.0803
Sample
Variance 515639.3
Kurtosis 4.54756
Skewness 2.237093
Range 3138
Minimum 19
Maximum 3157
Sum 32161
Count 51

 Descriptive statistics for high importance size firms:

Column1

Mean 526.3636
Standard Error 117.7045
Median 487
Mode #N/A
Standard
Deviation 390.3817
Sample
Variance 152397.9
Kurtosis 1.382614
Skewness 1.162375
Range 1294
Minimum 110
Maximum 1404
Sum 5790
Count 11

 After performing descriptive statistics for the importance of size and cost for all the firms,
it was found that the mean and median for the cost effectiveness of low importance firms
was 6.06 and 5.25, which indicates that the data is normally distributed. The mean for
medium importance was found significantly higher with value of 12.4, but the median
was somewhat similar to that of low importance firms. As a result, firms with medium
importance had left-skewed data, and also the overall data of these firms were found to be
higher than that of low importance firms. However, there was similarity found between

8|Page
these two groups because they both had left-skewed data as found with the help of
histogram. As for high importance firms, the data was normally distributed just like the
low importance firms, however, the mean was found higher than the low importance
firms, but lower than the medium importance firms. After combining the firms in terms of
cost, it can be said that medium importance firms had the highest cost effectiveness and
the highest range of cost-effectiveness was found in highest importance firms.

 In terms of the size, low importance firms had a significantly higher mean with 923,
which again indicates left-skewed behaviour of the data. A similar result was found in the
medium importance firms with the mean of 631 and a median of 350, which is less than
the low importance firms. The high importance firms have data with mean of 526 and
median of 487, illustrating a normal distribution of the data. It can be concluded that
importance ranking doesn’t have a significant impact on the size of the firms.

2.

 Stem and Leaf diagram for all the firms according to the size:
Ste
m Leaf
0 19,54,80
10,10,20,39,50,60,67,72,81,81
1 ,81,98
2 0,0,10,30,70,86
1,10,20,23,29,39,50,50,50,50,53,
3 87,90,98,98
4 2,40,49,82,87
5 1,1,17,17,32,38
6 0,0,
7 4,71,79,
8 2,18,43,96,
9 41
10 0
11 5
12 96
13
14 4

9|Page
15
16
17
18
19 15
20 95
21
22 47
23
24 83
25
26
27
28 85
29
30
31 57
32
33
34
35
36
37
38
39
40 13

 After plotting the box and whisker plot and stem and leaf diagram, it can be
seen that that data is skewed to the right. The data point 4013 and 3157 are
significantly higher than the median therefore indicating that these values are
outliers.
b. There are 7 outliers in total which are above the value of 1500.
c. After plotting the box and whisker plot for all the three groups, according to the industry
risk, the highest distribution is seen in the group B which consist of firms with medium risk
and firms lying in group A and C (low and high risk) tend to have less distribution which can
be seen in the following box and whisker plot:

10 | P a g e
2. Pie chart for industry risk:

Low Medium High

 Histogram for industry risk:

11 | P a g e
 Pie chart for importance of analysis:

Low Medium High

 Histogram for frequency for analysis category:

12 | P a g e
From looking at the pie charts and histograms, it can be deduced that in terms of industry
risk, firms in medium risk industries are most frequent followed by high and low risk
industries almost equal. Overall, we can say that the analytics category follows a normal
distribution whereas in industry risk, the data is left-skewed.

3.

Cost vs Size
4500
4000
3500
3000
2500
Size

2000
1500
1000
500
0
0.00 20.00 40.00 60.00 80.00 100.00 120.00
Cost

As we can see, the bivariate relationship between cost and size is linear inversely
proportional. Overall, as size increases, cost effectiveness decreases.

4. Probability table
Risk Analysis
Risk analysis Risk
High Low Medium Total
High 1 9 1 11
Low 11 11
Medium 1 40 10 51
Final total 2 60 11 73

 Joint probability table


Risk Analysis
Risk analysis Risk
High Low Medium Total
High 9.09% 81.82% 9.09% 100%
Low 0% 100.00% 0% 100%

13 | P a g e
Medium 1.96% 78.43% 19.61% 100%
Final total 2.74% 82.19% 15.07% 100%

 Joint probability table


Firm Analysis
Analysis Analysis
High Low Medium Total
High 1 1 2
Low 9 11 40 60
Medium 1 10 11
Final total 11 11 51 73

Firm Analysis
Analysis Analysis
High Low Medium Total
High 50% 0% 50% 100%
Low 15% 18.33% 66.67% 100%
Medium 9.09% 0% 90.91% 100%
Final total 15.07% 15.07% 69.86% 100%

a. The first table shows the total count of firms in each of the category, risk industry and
importance of analytical data. The bottom table shows the same data but in percentage
form. For each of the tables, I made two versions- one that shows the firm analysis
and other one the risk analysis.
b. The probability on low emphasis on analysis is 15.07%.
c. The probability of selecting a firm from a medium risk industry and high importance
on analysis is 9.09%.
d. The probability of selecting a firm in a high-risk industry and place a high importance
on analysis is 50%.
e. The probability of selecting a firm in a low-risk industry or places a medium
importance on analysis is 85.38%.

f.
Averag Varianc
Groups Count Sum e e

Column
1 2 2 1 0
Column
2 3 60 20 301
Column
3 2 11 5.5 40.5

14 | P a g e
ANOVA

Source of variation SS df MS F P-value F crit

Between groups 501.2143 2 250.6071 1.5602 0.315581 6.944272


Within groups 642.5 4 160.625

Total 1143.714 6

 As the p-value is greater than 0.05, the differences between the groups are not significant
and the group of high-risk firms and medium importance are independent.

5. The relationship between cost and size firms is inverse, with a coefficient of -0.21332.
The strength can be described as moderate. And, as the cost increases, the size
decreases and vice versa.
___________________________________________________________________________

Part 2

For the second half of the assignment, data for 112 employees was collected to
determine whether there is discrimination between genders and age groups. The
variables used are age, gender, experience, education which was then used to compare
the employees’ salaries at the time of study and hiring process. I used tables and graphs
to illustrate the behaviour clearly.

 Frequency distribution values

Highest value Smallest value Class length


CSAL 32000 7260 3534
HSAL 7800 4200 514
RANK 98 62 5
AGE 740 276 66
EXP. 265 0 37.9

The formula I used for calculating the class length is- (Largest measurement- smallest
measurement)/number of classes.
 Salary during the study (CSAL)

Classes Upper limit Frequency Relative frequency Frequency (%)


7260-10793 10793 51 0.4554 45.54

15 | P a g e
10794-14327 14327 51 0.4554 45.54
14328-17861 17861 9 0.0804 8.04
17862-21395 21395 0 0 0
21396-24929 24929 0 0 0
24930-28463 28463 0 0 0
28464-32000 32000 1 0.0089 0.89

Scatter graph
60
50
40
Frequency

30
20
10
0
5000 10000 15000 20000 25000 30000 35000
Salary

 The above distribution is skewed-right because there are 2 values that are outliers and
much high than others. This is due to the fact that there is no data in three classes and
majority of the data is found in the first two classes which contributes majorly towards
mean. Therefore, we can say that this is not a good indicator of the average value for the
given data.

 Salary at the time of hiring process (HSAL)


Upper Relative
Class limit Frequency frequency Frequency (%)
4200-4713 4713 35 0.3125 31.25
4714-5228 5228 13 0.1161 11.61
5229-5742 5742 16 0.1429 14.29
5743-6256 6256 10 0.0893 8.93
6257-6770 6770 28 0.25 25
6771-7285 7285 7 0.0625 6.25
7286-7800 7800 3 0.0268 2.68

I couldn’t draw the graph as the data was very scattered but I tried to do the analysis just
observing the data. This data is also skewed-right, but less skewed than the previous one.

16 | P a g e
After looking at the data for HSAL, I can conclude that median and mode are extremely
important to consider to make an accurate prediction on data.
 Now, I will do the same with ages
Classes Upper limit Frequency Relative frequency Frequency (%)
276-341 341 67 0.5982 59.82
342-407 407 34 0.3036 30.36
408-473 473 1 0.0089 0.89
474-539 539 4 0.0357 3.57
540-605 605 1 0.0089 0.89
606-671 671 1 0.0089 0.89
672-740 740 4 0.0357 3.57

I couldn’t draw the graph for the age group because the data is so scattered so it was dificult
to plot it. There are 50% of the measurements in the first seven classes and 90% in the first
two classes. The data is right-skewed to the right than all the other data catergories. There are
overall 11 outliers in this data group.

2. 90% confidence interval of CSAL for both the genders:


 Values required for the intervals:
90% CI
Male (12903.5,13280.96)
Female (10392.44,10581.68)
Both (11157.67,11444.69)
Comments: The average CSAL for men has a higher 90% CI than for women.
3. The difference between the salaries of men and women during the study using graphs and
tables. By the scatter graph, CSAL trendline is above the female trendline. Looking at the line
graph, again the data for men (mean salaries) is seen as higher than 9000 while female
salaries is shown above 7000 only. Therefore, it can be concluded that the data supports that
men were paid more than women during the study.

Population Standard Sample standard


CSAL mean deviation deviation
Male 13092.23 3839.48 114.73
Female 10487.06 1924.96 57.52
Both 11301.18 2919.55 87.24

17 | P a g e
CSAL
35000
30000
25000
20000
15000
10000
5000
0
1 5 9 13 17 21 25 29 33

Men CSAL Series2 Female CSAL

4. The data does not indicate a strong relationship/correlation between age and salaries at the
time of hiring. Therefore, there is not enough data to support the belief that there is a strong
relationship between the two variables. The correlation coffecient is 0.4646. This indicates a
low correlation. The cofficient of variation was also approx. 0.2159 suggests that the data had
a lot of variability.

HSAL DURING HIRING


800
700
600
500
400
AGE

300
200
100
0
4000 4500 5000 5500 6000 6500 7000 7500 8000
HSAL

Gender count Aveage HSAL Average CSAL


Female 77 5233 10487
Male 35 6262 13902
Total 112 5555 11301
5.

18 | P a g e
 Out of the 77 females, 53 of them had salaries below average CSAL.
 To test the management’s claim that theree are no more than 50% of the female
employees had below the CSAL, I conducted a hypothesis test. The null
hypothesis is that there are no more than 50% of the female employees that had
income below the average CSAL (H0: Po ≤0.5). The alternate hypothesis says that
the test is right-tailed. I used 5% as the significance level.

 I got the p-value as 0.0005 and the z-score as 3.3049. Since the z-score is higher
than the critical value (3.3049>1.6449), Ho is rejected. So, we have enough
evidence to tell that the claim is not valid. In conclusion, there are more than half
of the females who have their salaries below the average CSAL.
6.When looking at the salary at the timee of hiring , men have greater presence in high
income intervals.

 The data in question 3 suggests a correlation between the salaries of employees and their
gender. The scatter graph illustrated the significant gap present. This concludes that the
the pay gap between men and women, men had the higher salary.

 Finally, 50% of the females had income below the average CSAL for the
company as a whole, which in fact is more than half of the female
employees had their salaries below average CSAL.
___________________________________________________________________________

End of Assignment

19 | P a g e

You might also like