Download as pdf or txt
Download as pdf or txt
You are on page 1of 15

SMDM Project Report

Submitted by:
Kratika Vijayvergiya
Question 1.1 Use methods of descriptive statistics to summarize data. Which
Region and which Channel seems to spend more? Which Region and which
Channel seems to spend less?

Basic EDA

● Find the shape of the data,data type of individual columns


● Descriptive stats of numerical columns
● Find the distribution of numerical columns and the associated skewness and
presence of outliers
● Distribution of categorical columns

Spend as per the regions

● The Region that spends more is Other: 10165489


● The Region that spends less is Oporto: 1500582
Spend as per the Channels

● The Channel that spends more is Hotel: 7577614


● The Channel that spends less is Retail: 6370943

Question 1.2 There are 6 different varieties of items considered. Do all varieties
show similar behavior across Region and Channel?

● All varieties show different behaviour across Region and Channel.

Behaviour across Region and Channel


Behaviour across Channel

Behaviour across Region

Question 1.3 On the basis of the descriptive measure of variability, which item
shows the most inconsistent behavior? Which items show the least
inconsistent behavior?
Standard Deviation of the items

● Fresh items have the highest standard deviation- 12647.32, so they show the
most inconsistent behaviour.
● Delicatessen have the smallest standard deviation- 2820.10, so they are least
inconsistent.

Question 1.4 Are there any outliers in the data?

We have checked the distribution of all varieties using boxplots

Box plot shows the three quartile values of the distribution along with extreme
values. The “whiskers” extend to points that lie within 1.5 IQRs of the lower and
upper quartile, and then observations that fall outside this range are displayed
independently.
In this, All the varieties consist of outliers, thus are skewed.
Question 1.5 On the basis of this report, what are the recommendations?

Insights Summary:

Spend by Region
Max - Other
Min- Oporto

Spend by Channel-
Max- Hotel
Min- Retail

The behaviour of different varieties were also shown, which was highly inconsistent.

● Fresh items show the most inconsistent behaviour.


● Delicatessen are the least inconsistent.
All varieties show the presence of outliers which were calculated using
boxplots.

2.1​ ​For this data, construct the following contingency tables


(Keep Gender as row variable)

2.1.1. Gender and Major

2.1.2. Gender and Grad Intention

2.1.3. Gender and Employment


2.1.4. Gender and Computer

Question 2.2 Assume that the sample is representative of the population of


CMSU. Based on the data, answer the following question:

1. What is the probability that a randomly selected CMSU student will be male?

P (Male| Student) = count of males/ students count

The probability that a randomly selected CMSU student will be male is 0.47

2. What is the probability that a randomly selected CMSU student will be


female?

P (Female| Student) = count of females/ students count

The probability that a randomly selected CMSU student will be female is 0.53

Question 2.3 Assume that the sample is representative of the population of


CMSU. Based on the data, answer the following question:

2.3.1 Find the conditional probability of different majors among the male
students in CMSU.

● P (Accounting| Male) = count of males selecting account/male count- 0.138


● P (CIS| Male) = count of males selecting CIS/male count- 0.034
● P (Economics or Finance | Male) = count of males selecting Economics or
Finance/male count- 0.138
● P (International business| Male) = count of males selecting IB /male count-
0.138
● P (Management| Male) = count of males selecting Management /male count-
0.207

● P (Other| Male) = count of males selecting Other /male count- 0.138
● P (Retailing or Market| Male) = count of males selecting Retailing or Market
/male count- 0.172
● P (Undecided| Male) = count of males undecided /male count- 0.103

2.3.2 Find the conditional probability of different majors among the female
students of CMSU.

Similarly, the conditional probability of different majors among the female students of
CMSU is​:
● Accounting- 0.091
● CIS- 0.091
● Economics or Finance- 0.212
● International business- 0.121
● Management- 0.121
● Other- 0.091
● Retailing or Market- 0.273
● Undecided- 0

Question 2.4 Assume that the sample is a representative of the population of


CMSU. Based on the data, answer the following question:

a) Find the probability That a randomly chosen student is a male and intends
to graduate.
Prob(Male AND Intends to graduate) = P(M ∩ G)= M ∩ G / Total students

The probability that a randomly chosen student is a male and intends to graduate is
0.274

b) Find the probability that a randomly selected student is a female and does
NOT have a laptop.

Prob(Female AND Does not have a laptop) = P(F ∩ Lc)= (F ∩ Lc) / Total Students

The probability that a randomly selected student is a female and does NOT have a
laptop is 0.065

Question 2.5 Assume that the sample is a representative of the population of


CMSU. Based on the data, answer the following question:

a) Find the probability that a randomly chosen student is either a male or has a
full-time employment?

[A U B= P(A) + P(B) - P(A∩B)]

Prob= Prob(IMale OR Full time Employment | Student)

The probability that a randomly chosen student is either a male or has full-time
employment is 0.516

b) Find the conditional probability that given a female student is randomly


chosen, she is majoring in international business or management.

Prob= Prob(International Business OR Management | Female)

The conditional probability that given a female student is randomly chosen, she is
majoring in international business or management.i​ Iis 0​.2424.

Question 2.6 Construct a contingency table of Gender and Intent to Graduate


at 2 levels (Yes/No). The Undecided students are not considered now and the
table is a 2x2 table. Do you think graduate intention and being female are
independent events?

Contingency Table of Gender and Intent to Graduate at 2 levels (Yes/No)-


(Note- There’s a difference in the total sample size as well, as we have created a
subset of data, excluding the undecided.)

To check whether, graduate intention and being female are independent events the
condition to be checked is: If being female and graduate intention are independent,
the P(F ∩ Yes) = P(F)P(Yes)

● P(F ∩ Yes)= 0.275


● P(F)P(Yes)= 0.5*0.7= 0.35

Conclusion:

Since, the P(F ∩ Yes) is not equal to P(F)P(Yes), they are not independent
events.

Question 2.7 Note that there are four numerical (continuous) variables in the
data set, GPA, Salary, Spending and Text Messages. Answer the following
questions based on the data

a) If a student is chosen randomly, what is the probability that his/her GPA is


less than 3?

● Prob= Prob (count of students with GPA <3/ Total students)

Count of students with GPA <3= 17


The probability that a randomly selected students GPA is less than 3 is 0.274.

b) Find conditional probability that a randomly selected male earns 50 or


more. Find conditional probability that a randomly selected female earns 50 or
more.

● Prob= Prob (M ∩ S)-

Prob= Prob (count of male who earns 50 or more / Total males)

count of male who earns 50 or more= 14

The probability that a randomly selected male earns 50 or more is 0.483.

● Prob= Prob (F ∩ S)-

Prob= Prob (count of females who earns 50 or more / Total females)

count of females who earns 50 or more= 18

The probability that a randomly selected female earns 50 or more is 0.545.


3) ABC Asphalt Shingles

3.1) Do you think there is evidence that mean moisture contents in both types
of shingles are within the permissible limits? State your conclusions clearly
showing all steps.

Step 1: Define the Null and Alternate Hypothesis


Null Hypothesis states that the mean moisture content is within the permissible
limits.
Alternate Hypothesis states that the mean moisture content is not within the
permissible limits.

● H0: mean =< 0.35


● HA: mean > 0.35

Step 2: Decide the significance level


​Here we select 𝛼= 0.05.

Step 3: Identify the test statistic.


We do not know the population standard deviation and n= 36 for shingles A and n=
31 for shingles B. So, we use the t distribution and the tSTAT test statistic.

Step 4: Calculate the p-value and test statistic


Scipy.stats.ttest_1samp calculates the mean of one sample given the sample
observations and the expected value in the null hypothesis.
Sample sizes: n= 36 for shingles A and n= 31 for shingles B. We tried to implement
using nan_policy = 'omit' in the T-test function for the same.
This function returns t-statistic and one tailed p value. We have also used the
function p-value/ 2, to get one tailed p-value.

Step 5: Decide to reject or accept null hypothesis


In this example, p value for A shingles is 0.075 which is greater than the 5% level of
significance. So the statistical decision fails to reject the null hypothesis.
Thus, we conclude that the population mean moisture content for A shingles is
less 0.35 pound per 100 square feet, which means it is within the permissible
limits.

P value for B shingles is 0.002, which is less than the 5% level of significance. So
the statistical decision rejects the null hypothesis.
Thus, we conclude that the population mean moisture content for B shingles is
greater 0.35 pound per 100 square feet, which means it is not within the
permissible limits.
Conclusion:
So at 95% confidence level there is sufficient evidence to prove that the mean
moisture content for Shingles A ​is within the permissible limits.
But for ​Shingles B ​it is not within the permissible limits.

3.2 Do you think that the population mean for shingles A and B are
equal? Form the hypothesis and conduct the test of the hypothesis.
What assumption do you need to check before the test for equality of
means is performed

Step 1: Define the Null and Alternate Hypothesis

In testing whether the population mean for shingles A and B are equal, the null
hypothesis states that the population mean for both the shingles are the same, μA
equals μB. The alternate hypothesis states that the the population mean for both the
shingles are different, μB is not equal to μB

● H0: μA - μB= 0 i.e μA= μB


● HA: μA - μB ≠ 0 i.e μA≠ μB

Step 2: Decide the significance level


​Here we select 𝛼= 0.05 and the population standard deviation is unknown.

Step 3: Identify the test statistic.


We have two samples and we do not know the population standard deviation.
Sample sizes: n= 36 for shingles A and n= 31 for shingles B. We tried to implement
using nan_policy = 'omit' in the T-test function for the same.
So, we use the t distribution and the tSTAT test statistic for two sample unpaired
tests.

Step 4: Calculate the p-value and test statistic


Scipy.stats.ttest_ind to calculate the t-test for the means of TWO INDEPENDENT
samples of scores given of the two sample observations. This function returns
t-statistic and two-tailed p value.
This is a two sided test for the null hypothesis that 2 independent samples have
identical average (expected) values. This test assumes that the populations have
identical variances.
Step 5: Decide to reject or accept null hypothesis
In this example, ​Our 2 sample T-Test p-value: 0.004​, which is less than the 5% level of
significance. So the statistical decision rejects the null hypothesis, .i.e, we accept the
alternate hypothesis.
Hence, we conclude that the population mean for shingles A and B are not equal.

Conclusion:
So at 95% confidence level there is sufficient evidence to prove that the
population mean for shingles A and B are different.

THE END

You might also like