Professional Documents
Culture Documents
T.A. Pai Management Institute Manipal: Submitted To: Prof. Sudhindra S
T.A. Pai Management Institute Manipal: Submitted To: Prof. Sudhindra S
T.A. Pai Management Institute Manipal: Submitted To: Prof. Sudhindra S
Submitted by:
Group AN1
Ankur Inani-18S711
Darshika Goel-18S716
Mayur Phalak-18S726
Prithviraj Padgalwar-18S734
Thomas Kuncheria-18S758
Executive summary: From the data on salaries of different class of employee for each town of
France. We applied relevant statistical techniques to analyze the salary information of various
groups.
Introduction: We have used simple random sample technique to select the various samples
from the population. Margin of error and interval estimate was used to find the interval in
which population was lying. We have found the sample size for an interval estimate of
population mean. Hypothesis testing involved an attempt to gather evidence in support of
research hypothesis. We began with alternative Hypothesis and made it the conclusion that
researcher wants to support. Null Hypothesis was used as an assumption to be challenged and if
we found any statistical data that assumption was wrong then we rejected the null hypothesis.
We also found Hypothesis tests about population mean to find the p-value. Critical value
approach was also used the Hypothesis tests. We also inferred about difference between two
populations means when standard deviations are unknown. Inference of population variance
was done using the Chi-Square distribution. Also, we found the test of independence between
two categorical data. We used the concept of matched samples for inferring about the
difference between two population means.
Problem statement
The National Institute of Statistics and Economic Studies, France has released data on salaries of
different class of employee for each town of France. The data consists of 5136 records. There
are data regarding the mean net salary of the entire population, mean net salary each for
woman and men, the mean salary for different age groups and mean net salary for different
roles in organization. The institute wanted to analyze the data to understand the salary
differences among the different sections of the employees. INSEE (Institute National de la
Statisque et Des’etudes Economiques) is the national institute for statistics and economic
studies in France. It collects and publishes information about the French economy, various
economic indicators, various economic factors about people and carries out national census.
We have taken data from a report of INSEE report 2017 on mean salary among different classes
(gender based, age based, post based) across 5136 towns. Our goal is to explore inequality
between men and women, youngsters and elders, working / social classes. We have formed
questions to determine whether there is equality of salary among different classes exist or not.
This data is also used by Espectra, a major employment agency in France. The agency wants to
hire professional employees at a lower rate with help of this data. So, they tried to know
whether there can be any salary discrimination among different classes to reduce costs for
hiring.
Objectives of the study
1) To study how to use Hypothesis testing in real life situation to understand data and make
decisions
3) Apply relevant statistical inference techniques to business data and draw fitting conclusions
4) To examine and justify the applications of various statistical hypothesis testing tools while
exploring business situations
Methodology
Source of data
The data was collected by The National Institute of Statistics and Economic Studies, France
The National Institute of Statistics and Economic Studies collects, produces, analyzes and
disseminates information on the French economy and society.
Link: https://www.kaggle.com/etiennelq/french-employment-by-town
Group AN1.xlsx
1)France Institute has issued reports on salaries of different class of employee for each town
of France. Institute has reported that mean salary per hour for men is $14.84/hr in 2017 while
mean salary/hr for woman is $12.03. Since sufficiently historical data is available for all town
about mean men salary and mean women salary. Standard deviation for are 3.17 and 1.78
respectively for men and women.
a.We are interested to find what is the probability that a sample of mean salary per hour for
men in 100 towns will provide a mean salary with in an interval of $0.5 to the population
mean of $14.84
b.What is the probability that a sample of mean salary per hour for woman in 100 towns will
provide a mean salary within an interval of $0.5 to the population mean of $12.03
Firstly, we need to find the Z value of the given data to find the interval
The value of sample mean will be 14.34 and 15.34 to be within a range of $0.5
We need to find the difference between these two probabilities to find the actual probability of
mean salary being within an interval of $0.5
= 0.8853
The value of sample mean will be 11.53 and 12.53 to be within a range of $0.5
= 0.9948
2)For France, the mean net salary is 13.7. A sample of 50 woman showed a sample mean of
11.946. The population standard deviation is 2.93.
Formulate hypothesis for a test to determine whether the same data supports the conclusion
that mean female net salary of population is less than the mean of 13.7 for the total
population.
Critical value
For α= .05, Z.05= -1.960
Conclusion:
There is enough statistical evidence to infer that the null hypothesis is false. Hence, we can say
the mean net salary of woman is less the mean of net salary of men and woman combined.
3) Institute wanted to know whether the salaries earned by the people is independent of the
Gender or not. Also, they wanted to know if the p-value and critical value method will give
them the same result or not.
For knowing this we use Test of Independence where chi-Square test involves using the sample
data to test for the independence of two categorical variables. The null Hypothesis for this is
that two categorical data are independent. We take a sample of 200 males and females and ask
them whether they think salaries are dependent on Gender.
Sample Results
Percentage of people who agree salaries are independent of Gender = (105\200) = 0.525
Percentage of people who agree salaries are not independent of Gender = (95\200) = 0.475
Computation of the CHI-SQUARE test statistic for the test of independence between Gender and
Salaries
With r rows and c columns, the chi-square distribution will have (r-1) (c-1) degrees of freedom
provided the expected frequency is at least 5 for each cell. Thus, degrees of freedom will be (2-
(2-1)=1.
Bar Chart
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
Mal e Femal e
Yes No
We can now use the upper tail of chi-square distribution with 1 degree of freedom and p-value
approach to determine whether null hypothesis salaries are independent of gender can be
rejected.
Χ2 falls in between 0.10 and 0.90 and corresponding p-value must be between 0.1 and 0.90.
With p≤0.5 we must reject null hypothesis. But here p is greater than 0.1 so we accept null
hypothesis and accept that salaries are independent of gender.
By critical value approach also we draw the same conclusion. With α=0.05 and 1 degree of
freedom, the critical value for the chi-square test statistic is χ 0.51 = 3.841. The upper tail rejection
region becomes
Reject Ho if ≥ 3.3841
4)France Institute had the variance of mean salaries per hour of the population as 6.54. There
was a new list which was prepared by institute. Administrators of the France Institute now
would like the variance of the new list for the mean salaries per hour to remain at historical
level. They want to evaluate the new list variance. Use of level of significance as 0.05 to
conduct hypothesis test. Check the result using p-value and critical method.
B. Find the interval estimation of population standard deviation for 95% confidence interval
Ho: σ2 = 6.548
Rejection of Ho will indicate that a change in the variance has occurred and suggest some
changes are needed to make variance of new list similar to that of old values. A sample of 50
people data from the new list will be used to do the analysis.
Sample of 50 people mean salary per hour gave a variance of 2.59. The value of the chi-square
test statistic is as follows
= 19.38
Let us now compute the p-value. By using chi-square distribution table for 49 degrees of
freedom and X2= 19.38, we get area in upper tail as 1. With p value ≥ 0.05, we do no reject null
hypothesis.
Now let us check the result using critical value method. With α=0.05, Χ .205 provides the critical
value for upper tail hypothesis test. For degree of freedom 49, Χ.205= 66.39
Now we reject HO if Χ2≥66.39. Since our value is less than 66.39 we do no reject null hypothesis.
Conclusion: p-value and critical value method generated the same result.
B.
France Institute was interested in finding the interval estimate of population variance with
sample of 50 mean salaries. Sample variance was 2.69 (using same previous sample). With a
sample size of 50, we have degree of freedom as 49. We need to determine Χ .2025 and Χ.2975 as
level of significance is 0.05
1.87 ≤ σ2 ≤ 4.17
Taking the square root of these values provides the following 95% confidence interval for the
population standard deviation
1.36 ≤ σ ≤ 2.04
5)France Institute takes two sample of salaries from mean net salary/hr for women and mean
net salary/hr for feminine executive. It is interested in finding the interval estimate of the
difference between two population means with σ1 and σ2 unknown. Taking 0.05 as level of
significance
It takes sample size of n2=50 from mean net salary/hr for women and sample size n1=60.
Sample mean for first sample x2= 11.946 and sample mean for second sample x1= 21.008
Sample standard deviation s2= 1.107 and sample standard deviation s1= 2.322
Here we use x1-x2± tα/2 sq. Root (s12\ n1 + s22\ n2) to find interval estimate
Degree of freedom
Putting s2= 1.107, n2=50, s1= 2.322, n1=60 in the above formula we get df= 67
Now we develop the 95% confidence interval estimate of the difference between two
population means by using
t.025= 1.996 for df=67 and x2= 11.946, x1= 21.008, s2= 1.107, n2=50, s1= 2.322, n1=60
Conclusion
The point estimate of difference between two population mean checking mean salaries per
hour is 9.062. The margin of error is 0.672 and 95% confidence interval estimate of the
difference between the two-population means is 8.39, 9.734.
6)Is it possible to conclude, using a .05 level of significance, that the average of the mean net
salaries of male executive is more than the average of the mean net salary of female
executive?
σ1= 3.42 (S.D for population of mean net salaries of male executive)
σ2= 2.32 (S.D for population of mean net salaries of female executive)
Sampling distribution of
2) Specify the level of significance
α= .05
Z= (1-2)-D0/sqrt(σ12/n1-σ22/n2)
Z= (25.946-20.972)-0/sqrt (3.4212/50-2.3212/50)
=13.99
For z=13.99
Critical value
Conclusion
At the .05 level of significance, the sample evidence indicates that the average of mean net
salary of male executive is greater than the average of mean net salary of female executive.
7)Espectra, the famous French employment agency wants to hire some new employees at a
lower cost. So, it wants to know about whether there is disparity of salaries among towns.
It took two equal sized samples of towns (size 50) and compared the mean hourly salary (for
all). It assumed initially that two samples have same mean hourly salary
Ho: µ1 - µ2 =0
Ha: µ1 - µ2 ≠0
So, the null hypotheses will be rejected if two sample’s means are not equal
Let µd = The mean of the difference in values for the salary
Name of the Mean net salary Name of the Mean net salary di
town (sample 1) (for each town) town (sample 2) (for each town)
Saulieu 11.2 Montreal-la- 12.4 -1.2
clause
Fontaine 12.9 Balan 13.9 -1
Bouvigny- 13.8 Marboz 12.7 1.1
Boyeffles
Chantilly 17.8 Saint -andra-de 15 2.8
corcy
A table of only 4 towns are shown to represent sample size of 50, actual sample size is 50.
Last column is showing the difference in salary among towns (row wise)
Calculation:
Calculation:
sd = sqrt (438.06/8.94)
sd = 2.99
Now we must calculate test statistics for hypothesis tests involving matched samples
t = (0.656-0)/(2.99/sqrt50)
t = 0.656/0.422
t = 1.551381
Now we must calculate the p-value for these two tailed tests. Because t =1.55 > 0, the test
statistic is in the upper tail of the t-distribution. With t =1.55, the area in the upper tail to the
right of the test statistic was found by using the t distribution table with degrees of freedom
here = 50-1=49
From t distribution table we found that the area in the upper tail is between 0.10 and 0.05 since
it is two tailed test p-value is between 0.2 and 0.1.
This p-value is greater than α =0.05 thus the null hypothesis Ho: µ1 - µ2 =0 is not rejected.
In addition, we can obtain an interval estimate of the difference between two population mean
= -0.1922 to 1.50
8. INSEE Report states that mean hourly salary for all persons of age between 18-25 years
(including men and women both) is $9.54/hr. Now we took sample of 50 towns for women
having age between 18-25 years old and men having age between 18-25 years old from same
population to see if the mean hourly salary for these samples differ from reported mean of
salary for all (men & women). Result will help in knowing the payment disparity based on
Gender.
a. Formulate the null and alternative hypothesis that can be used to determine if the mean
hourly salary for women (18-25) differ from population mean hourly salary for all (including
men & women).
b. Sample of 50 towns showed a sample mean of $9.19/hr for women (18-25 year). Sample
standard deviation is $0.425, compute the p-value.
d. Formulate the null and alternative hypothesis that can be used to determine if the mean
hourly salary for men (18-25) differ from population mean hourly salary for all (including men &
women).
e. Suppose a sample of 50 towns showed a sample mean of $9.86/hr for men (18-25 year).
Sample standard deviation is $0.4260, compute the p-value.
b.
t = (9.19-9.54)/(0.42/sqrt50)
t = -5.750973964
We can see with help of excel that with 49 degrees of freedom and t value equal to - 5.7509739,
p-value for upper tail comes out to be almost zero. So, we can conclude that probability of
getting salaries equal for female and that for overall is almost impossible.
It concludes that there is a disparity among salaries between female and overall.
c. Because p-value is zero it is less than α= 0.05 so the hypothesis got rejected. It interprets that
salaries are not equal for men and women that’s why overall salary is not coming closer to that
of female’s salary.
d.
Ho: µ (18-25 all)-µ (18-25-year male) = 0
Ha: µ (18-25 all)-µ (18-25-year male) ≠ 0
e.
t = (9.86-9.54)/(0.42/sqrt50)
t = 5.377125759
We can see with help of excel that with 49 degrees of freedom and t value equal to
5.377125759, p-value for upper tail comes out to be zero. So, we can conclude that probability
of getting salaries equal for male and that for overall is almost impossible. It implies that
because females are getting less salary while males are getting more salary that’s why both are
not equal to overall average.
It concludes that there is a disparity among salaries between female and overall.
Results and analysis
We were able to use sampling techniques to select a sample from population and estimate the
sample mean and sampling distribution of x bar for the mean hourly salary. We were able to
find interval estimate of population and was used to do the interval estimate of the population
mean with standard deviation known. We were able to make tentative assumptions about the
population parameters using Hypothesis testing. We used one tailed test and two tailed tests for
Hypothesis testing. Chi square distribution gave us the Hypothesis test to find population
variance and inference about the difference between two population mean was achieved by
using T-distribution with standard deviation unknown. Match sample concept gave the
inference about the difference between two population mean. Chi square test was used for test
of independence between two categorical variables.