Download as pdf or txt
Download as pdf or txt
You are on page 1of 20

SMDM PROJECT

Bollibathula Vani
PGP-DSBA Online
January 2022
13/03/2022
List of contents:
Problem 1…………………………………………………………………………………………………………………………1
1.1 Use methods of descriptive statistics to summarize data. Which Region and which Channel
spent the most? Which Region and which Channel spent the least?.........................................1
1.2 There are 6 different varieties of items that are considered. Describe and
comment/explain all the varieties across Region and Channel? Provide a detailed justification
for your answer……………………………………………………………………………………………………………………3
1.3 On the basis of a descriptive measure of variability, which item shows the most
inconsistent behaviour? Which items show the least inconsistent behaviour?.........................6
1.4 Are there any outliers in the data? Back up your answer with a suitable plot/technique
with the help of detailed comments………………………………………………………………………………………6
1.5 On the basis of your analysis, what are your recommendations for the business? How can
your analysis help the business to solve its problem? Answer from the business
perspective….............................................................................................................................7
Problem 2…………………………………………………………………………………………………………………………8
2.1. For this data, construct the following contingency tables (Keep Gender as row variable)
………………………………………………………………………………………………………………………………………….….8
2.1.1. Gender and Major……………………………………………………………………………………………………….8
2.1.2. Gender and Grad Intention………………………………………………………………………………………….8
2.1.3. Gender and Employment…………………………………………………………………………………………….8
2.1.4. Gender and Computer…………………………………………………………………………………………………9
2.2. Assume that the sample is representative of the population of CMSU. Based on the data,
answer the following question……………………………………………………………………………………………...9
2.2.1. What is the probability that a randomly selected CMSU student will be male?................9
2.2.2. What is the probability that a randomly selected CMSU student will be female?............9
2.3. Assume that the sample is representative of the population of CMSU. Based on the data,
answer the following question…………………………………………………………………………………….…...….9
2.3.1. Find the conditional probability of different majors among the male students in
CMSU……………………………………………………………………………………………………………………………………9
2.3.2 Find the conditional probability of different majors among the female students of
CMSU………………………………………………………………………………………………………………………………….10
2.4. Assume that the sample is a representative of the population of CMSU. Based on the
data, answer the following question…………………………………………………………………………………...10
2.4.1. Find the probability That a randomly chosen student is a male and intends to
graduate…………………………………………………………………………………………………………………………….10
2.4.2 Find the probability that a randomly selected student is a female and does NOT have a
laptop…………………………………………………………………………………………………………………………………10
2.5. Assume that the sample is representative of the population of CMSU. Based on the data,
answer the following question…………………………………………………………………………………………….10
2.5.1. Find the probability that a randomly chosen student is a male or has full-time
employment?..........................................................................................................................10
2.5.2. Find the conditional probability that given a female student is randomly chosen, she is
majoring in international business or management……………………………………………………….……11
2.6. Construct a contingency table of Gender and Intent to Graduate at 2 levels (Yes/No). The
Undecided students are not considered now and the table is a 2x2 table. Do you think the
graduate intention and being female are independent events?..............................................11
2.7 Note that there are four numerical (continuous) variables in the data set, GPA, Salary,
Spending and Text Messages. Answer the following questions based on the data………………..11
2.7.1 If a student is chosen randomly, what is the probability that his/her GPA is less than
3?.............................................................................................................................................11
2.7.2. Find the conditional probability that a randomly selected male earns 50 or more. Find
the conditional probability that a randomly selected female earns 50 or more…………………….12
2.8. Note that there are four numerical (continuous) variables in the data set, GPA, Salary,
Spending, and Text Messages. For each of them comment whether they follow a normal
distribution. Write a note summarizing your conclusions…………………………………………………….12

Problem 3………………………………………………………………………………………………………………………14
3.1 Do you think there is evidence that means moisture contents in both types of shingles are
within the permissible limits? State your conclusions clearly showing all steps…………………….14
3.2 Do you think that the population mean for shingles A and B are equal? Form the
hypothesis and conduct the test of the hypothesis. What assumption do you need to check
before the test for equality of means is performed?................................................................16

List of Tables
Table 1: Describe table of wholesale customer dataset………………………………………………………..1
Table 2: It is a table of the total spend by Channel and Region…………………………………………….2
Table 3: Table of Total spend by Region……………………………………………………………………………….2
Table 4: Contingency table of Gender and Major………………………………………………………………….8
Table 5: Contingency table of Gender and Grad Intension……………………………………………………8
Table 6: Contingency table of Gender and Employment……………………………………………………….8
Table 7: Contingency table of Gender and Computer……………………………………………………………9
Table 8: Contingency table of Gender and Grad Intention at 2 levels………………………………….11

List of Figures
Figure 1: Bar plot of channel vs Total……………………………………………………………………………………2
Figure 2: Bar plot of region vs Total………………………………………………………………………………………3
Figure 3: Bar plot of region and channel by Total Spend…………………………………………………….…3
Figure 4: Bar plot of Channel vs Fresh…………………………………………………………………………………..4
Figure 5: Bar plot of Region vs Fresh…………………………………………………………………………………….4
Figure 6: Bar plot of Channel vs Milk…………………………………………………………………………………….4
Figure 7: Bar plot of Region vs Milk………………………………………………………………………………………4
Figure 8: Bar plot of Channel vs Grocery……………………………………………………………………………….4
Figure 9: Bar plot of Region vs Grocery…………………………………………………………………………………4
Figure 10: Bar plot of Channel vs Frozen……………………………………………………………………………….5
Figure 11: Bar plot of Region vs Frozen…………………………………………………………………………………5
Figure 12: Bar plot of Channel vs Detergents………………………………………………………………………..5
Figure 13: Bar plot of Region vs Detergents paper………………………………………………………………..5
Figure 14: Bar plot of Channel vs Delictessen……………………………………………………………………….5
Figure 15: Bar plot of Region vs Delictessen………………………………………………………………………….5
Figure 16: figure of outliers of wholesale customer Dataset…………………………………………………6
Figure 17: Dist plot of GPA………………………………………………………………………………………………….12
Figure 18: Dist plot of Salary……………………………………………………………………………………………….12
Figure 19: Dist plot of Spending…………………….……………………………………………………………………12
Figure 20: Dist plot of Text Messages………………………………………………………………………………….12
Problem 1
Wholesale Customers Analysis
Problem Statement:
A wholesale distributor operating in different regions of Portugal has information on annual
spending of several items in their stores across different regions and channels. The data consists of
440 large retailers’ annual spending on 6 different varieties of products in 3 different regions (Lisbon,
Oporto, Other) and across different sales channel (Hotel, Retail).

1 – Wholesale Customer Data Analysis:

We imported the ‘Wholesale Customer data’ dataset in python to analyse the spend
under each store items across regions and channel to find solutions to each problem. Below
is the detailed approach and answer.

1.1Use methods of descriptive statistics to summarize data. Which Region and


which Channel spent the most? Which Region and which Channel spent the
least?
Solution:

Using describe function in python we first looked at the basic descriptive statistics of the
data set.

Table 1: Describe table of wholesale customer dataset.


The dataset gives us about sales of 6 category of products across 3 regions and 2 channel.
Region frequency: total=440 rows, they are Lisbon 77 rows, Oporto 47 rows and Others 316 rows.
Channel frequency: total=440rows, they are Hotel 298 rows and Retail 142 rows.

1
Now we created a new column of Total for calculating Which Region and which Channel
spent the least and most.

Table 2: It is a table of the total spend by Channel and Region.


we grouped Channel by Total to get spend by Channel.

Below is the output from Python:

Table 3: Table of Total spend by Channel

Figure 1: Bar plot of channel vs Total


we grouped Region by Total to get spend by Region.

Below is the output from Python:

Table 3: Table of Total spend by Region.

2
Figure 2: Bar plot of region vs Total.
Highest spend in the Region is from Others and lowest spend in the Region is from Oporto.
Highest spend in the Channel is from Hotel and lowest spend in the channel is from Retail.

Now we grouped Region and Channel by Total to get total spend by Region and Channel.
Below is the output from Python:

Figure 3: Bar plot of region and channel by Total Spend.

Highest spend in the Region/Channel is from Others/Hotel.


Lowest spend in the Region/Channel is from Oporto/Hotel.

1.2 There are 6 different varieties of items that are considered. Describe and
comment/explain all the varieties across Region and Channel? Provide a deta
iled justification for your answer.
Solution:
Using bar plot for each category and checking spend across Region and Channel we get the following
outputs

3
Figure 4: Bar plot of Channel vs Fresh. Figure 5: Bar plot of Region vs Fresh.

Figure 6: Bar plot of Channel vs Milk. Figure 7: Bar plot of Region vs Milk.

Figure 8: Bar plot of Channel vs Grocery. Figure 9: Bar plot of Region vs Grocery.

4
Figure 10: Bar plot of Channel vs Frozen Figure 11: Bar plot of Region vs Frozen

Figure 12: Bar plot of Channel vs Detergents paper Figure 13: Bar plot of Region vs Detergents paper

Figure 14: Bar plot of Channel vs Delictessen Figure 15: Bar plot of Region vs Delictessen

Looking at the above plots, we see that some categories like Milk, Grocery & Detergents_ P
aper have higher spent in the Retail channel versus Hotel, across all regions. On the other hand, Fres
h and Frozen have higher consumption in the Hotel channel versus Retail, across all regions. Also, if w
e plot a box plot, we can summarize that the spend for Fresh and groceries is the maximum across re
gion and channel while for Delicatessen it is the least across region and channel.

5
1.3 On the basis of a descriptive measure of variability, which item shows the
most inconsistent behaviour? Which items show the least inconsistent behavi
our?
Solution:
Using Coefficient of Variation, we find out from the given data it is clear that
Most inconsistent behaviour shown by item – Delicatessen (1.84).
And least inconsistent behaviour shown by item – Fresh (1.05).
Below is the output from Python:
Coefficient of Variation for Fresh is 1.0527196084948245.
Coefficient of Variation for Milk is 1.2718508307424503.
Coefficient of Variation for Grocery is 1.193815447749267.
Coefficient of Variation for Frozen is 1.5785355298607762.
Coefficient of Variation for Detergents Paper is 1.6527657881041729.
Coefficient of Variation for Delicatessen is 1.8473041039189306.

Fresh item has lowest coefficient of Variation, so that is consistent.


Delicatessen item have highest coefficient of Variation, so that is Inconsistent.

1.4 Are there any outliers in the data? Back up your answer with a suitable pl
ot/technique with the help of detailed comments.
Solution:
To find out outliers we plotted boxplot and the output gives the details that in all the data
there are outliers

Figure 16: figure of outliers of wholesale customer Dataset.

6
1.5 On the basis of your analysis, what are your recommendations for the bus
iness? How can your analysis help the business to solve its problem? Answer f
rom the business perspective.
Solution:
As per the analysis, I find out that there are inconsistencies in spending of different items
(By calculating Coefficient of Variation), which should be minimized. The spending of hotel and
retail channel is different which should be more or less equal. And also spent should equal for
different regions. Need to focus on other items also than “Fresh” and “Grocery”.

7
Problem 2
Survey:
The Student News Service at Clear Mountain State University (CMSU) has decided to gathe
r data about the undergraduate students that attend CMSU. CMSU creates and distributes a survey o
f 14 questions and receives responses from 62 undergraduates (stored in the Survey data set).

2-CMSU Survey Data Analysis:


We imported the ‘CMSU Survey-1’ dataset in python to analyse the data about the underg
raduate students who attend CMSU. Below is the detailed approach and answer.

2.1. For this data, construct the following contingency tables (Keep Gender as
row variable)
2.1.1. Gender and Major
Solution:
Below is the output from Python:

Table 4: Contingency table of Gender and Major.

2.1.2. Gender and Grad Intention


Solution:
Below is the output from python:

Table 5: Contingency table of Gender and Grad Intension.


2.1.3. Gender and Employment
Solution:
Below is the output from python:

Table 6: Contingency table of Gender and Employment.

8
2.1.4. Gender and Computer
Solution:
Below is the output from python:

Table 7: Contingency table of Gender and Computer.

2.2. Assume that the sample is representative of the population of CMSU.


Based on the data, answer the following question:
2.2.1. What is the probability that a randomly selected CMSU student will be male?
Solution:
In this data set we have 29 Male students and 33 Female students.
For this we need to find out total male students out of whole student from the given data.
After calculation we got the result that probability of 46.77% student will be male in CMSU if random
ly selected.
2.2.2. What is the probability that a randomly selected CMSU student will be female?
Solution:
For this we need to find out total female students out of whole student from the given dat
a. After calculation we got the result that probability of 53.22% student will be female in CMSU if ran
domly selected.

2.3. Assume that the sample is representative of the population of CMSU.


Based on the data, answer the following question:
2.3.1. Find the conditional probability of different majors among the male students in
CMSU.
Solution:
Using contingency tables of Gender and Majors we got the total numbers of males and
number of males opting for different majors.
Below is the output from python:
Among MALE candidates:
Probability of opting for Accounting: 0.13793103448275862.
Probability of opting for CIS: 0.034482758620689655.
Probability of opting for Economics/Finance: 0.13793103448275862.
Probability of opting for International Business: 0.06896551724137931.
Probability of opting for Management: 0.20689655172413793.
Probability of opting for Other: 0.13793103448275862.
Probability of opting for Retailing/Marketing: 0.1724137931034483.
Probability of opting for Undecided: 0.10344827586206896.

And from this output we can easily say that most of the male students prefer Management as Major
s and CIS is the least preferred one.

9
2.3.2 Find the conditional probability of different majors among the female students of
CMSU.
Solution:
Using contingency tables of Gender and Majors we got the total numbers of females and
number of females opting for different majors.
Below is the output from python:
Among FEMALE candidates:
Probability of opting for Accounting: 0.09090909090909091.
Probability of opting for CIS: 0.09090909090909091.
Probability of opting for Economics/Finance: 0.21212121212121213.
Probability of opting for International Business: 0.12121212121212122.
Probability of opting for Management: 0.12121212121212122.
Probability of opting for Other: 0.09090909090909091.
Probability of opting for Retailing/Marketing: 0.2727272727272727.
Probability of opting for Undecided: 0.0.
And from this output we can easily say that most of the female students prefer Retailing/Marketing a
s Majors.

2.4. Assume that the sample is a representative of the population of CMSU.


Based on the data, answer the following question:
2.4.1. Find the probability That a randomly chosen student is a male and intends to
graduate.
Solution:

Using contingency tables of Gender and Grad Intention we got the total numbers of males
and number of males intends to be graduate and post calculation we find out that

➢ Probability of Males and intends to be Graduate is 58.62%.

2.4.2 Find the probability that a randomly selected student is a female and does NOT have
a laptop.
Solution:

Using contingency tables of Gender and Computer we got the total numbers of females and
number of females does not have a laptop and post calculation we find out that

➢ Probability of randomly selected student is a Female and does NOT have a laptop. is 12.12%.

2.5. Assume that the sample is representative of the population of CMSU.


Based on the data, answer the following question:
2.5.1. Find the probability that a randomly chosen student is a male or has full-time
employment?
Solution:

Using contingency tables of Gender and Employment we got the total numbers of males and
number of males who are full time employed and post calculation we find out that

➢ Probability of randomly chosen student is Male or has full time employment. is 24.13%.

10
2.5.2. Find the conditional probability that given a female student is randomly chosen, she
is majoring in international business or management.
Solution:

Using contingency tables of Gender and Major we got the total numbers of females and
number of females majoring in international business or management and post calculation we find
out that

➢ Probability that given a female student is randomly chosen, she is majoring in international
business or management is 24.24%.

2.6. Construct a contingency table of Gender and Intent to Graduate at 2 levels (Yes/No).
The Undecided students are not considered now and the table is a 2x2 table. Do you think
the graduate intention and being female are independent events?
Solution:

Table 8: Contingency table of Gender and Grad Intention at 2 levels.

Is the graduate intention and being female are independent events?


The Probability that a randomly selected student ‘being female’.
The Probability that a randomly selected student the graduate intention and being female
P (Grad Intention Yes) = 28/40 = 0.7.
P (Grad Intention Yes | female) = 11 / 20 = 0.55 .
These probabilities are not equal. This suggests that the two events are independent.

2.7 Note that there are four numerical (continuous) variables in the data set,
GPA, Salary, Spending and Text Messages. Answer the following questions
based on the data
2.7.1 If a student is chosen randomly, what is the probability that his/her GPA is less than
3?
Solution:
Using contingency tables of Gender and GPA we got the total numbers of students and number of
students GPA less than 3.
And post calculation we find out that - Probability that student is chosen randomly and that his/her GPA is less
than 3 is 27.41%.

11
2.7.2. Find the conditional probability that a randomly selected male earns 50 or more.
Find the conditional probability that a randomly selected female earns 50 or more.
Solution:

Using contingency tables of Gender and Salary we got the total numbers of Male and Female and
number of male and female earning 50 or more.
And post calculation we find out that
Probability that randomly selected male earns 50 or more is 0.48%
And Probability that randomly selected female earns 50 or more is 0.54%.

2.8. Note that there are four numerical (continuous) variables in the data set,
GPA, Salary, Spending, and Text Messages. For each of them comment
whether they follow a normal distribution. Write a note summarizing your
conclusions.
Solution:

skew value of GPA is -0.3146000894506981.


skew value of Salary is 0.5347008436225946.
skew value of Spending is 1.5859147414045331.
skew value of Text Message is 1.2958079731054333.

Figure 17: Dist plot of GPA. Figure 18: Dist plot of Salary.

Figure 19: Dist plot of Spending. Figure 20: Dist plot of Text Messages.

12
The probability plot can be used to find the dataset follows a normal distribution or not, in
our dataset we can find the points follow a straight line and we can say that all the GPA, Salary,
Spending and Text messages follow a normal distribution.

Looking at the skew value if the value is zero it is symmetric data, if we have a negative value
for the skew that indicates the data are skewed left and positive value of skew indicates the data are
skewed towards right.

CONCLUSION:
We have dataset of students answering to the survey and we have 62 responses from the
students both male and female. Many students have intension of graduating the retailing and
marketing seem to have a chosen by quite number of students, 2/3 of the students are looking for a
part time job. The mean salary means to be around 50.

13
Problem 3
Shingles:
An important quality characteristic used by the manufacturers of ABC asphalt shingles is the
amount of moisture the shingles contain when they are packaged. Customers may feel that they
have purchased a product lacking in quality if they find moisture and wet shingles inside the
packaging. In some cases, excessive moisture can cause the granules attached to the shingles for
texture and colouring purposes to fall off the shingles resulting in appearance problems. To monitor
the amount of moisture present, the company conducts moisture tests. A shingle is weighed and
then dried. The shingle is then reweighed, and based on the amount of moisture taken out of the
product, the pounds of moisture per 100 square feet are calculated. The company would like to
show that the mean moisture content is less than 0.35 pounds per 100 square feet.
The file includes 36 measurements (in pounds per 100 square feet) for A shingles and 31 for B
shingles.
3 – Shingles Data Analysis:
We imported the ‘A & B shingles’ dataset in python to analyse the data about the Asphalt Shingles.
Below is the detailed approach and answer.

3.1 Do you think there is evidence that means moisture contents in both
types of shingles are within the permissible limits? State your conclusions
clearly showing all steps.
Solution:

Define Null and alternate hypothesis for sample A


step 1: Null and Alternate hypothesis
Testing whether the moisture content is less the permissible limit.
The null hypothesis states that the moisture content of sample A is greater than or equal to the
permissible limit, 𝜇>=0.35.
The alternative hypothesis states that the moisture content of sample A is less than or equal to the
permissible limit, 𝜇<0.35.
𝐻0: 𝜇 ≥ 0.35
𝐻𝐴: 𝜇 < 0.35
step 2: Decide the significance level
Here we know alpha = 0.05 as given in the question.
step 3: Identify the test statistic
We have two samples (A and B) and we do not know the population standard deviation.
Sample sizes for both samples are not the same.
The sample size is , n > 30. So, we use the t distribution and the 𝑡𝑆𝑇𝐴𝑇 test statistic for one sample
test for A sample. One tail test for sample A.
step 4: Calculate the p - value and test statistic
input from python:
t_statistic, p_value = ttest_1samp(df['A'],0.35, nan_policy='omit')
print('tstat',t_statistic)
print('P Value',p_value/2)
Below is the output from python
tstat -1.4735046253382782
P Value-0.07477633144907513

14
Step 5: Decide to reject or accept null hypothesis
Input from python:
print("one-sample t-test p-value=", p_value/2)
alpha_level = 0.05
if (p_value/2) < alpha_level:
print('We have enough evidence to reject the null hypothesis in favour of alternative hypothesis')
else:
print('We do not have enough evidence to reject the null hypothesis in favour of alternative
hypothesis')
print('We conclude that the moisture content is greater than permissible limit in sample A.')
Below is the output from python:
one-sample t-test p-value= 0.07477633144907513
We do not have enough evidence to reject the null hypothesis in favour of alternative hypothesis
We conclude that the moisture content is greater than permissible limit in sample A.

Define Null and alternate hypothesis for sample B

step 1:Null and Alternate hypothyesis


Testing whether the moisture content is less the permissible limit.
The null hypothesis states that the moisture content of sample B is greater or than equal to the
permissible limit, 𝜇 ≥ 0.35.
The alternative hypothesis states that the moisture content of sample B is less than permissible limit,
𝜇 < 0.35.
𝐻0 : 𝜇 ≥ 0.35.
𝐻𝐴 : 𝜇 < 0.35.
Step 2: Decide the significance level
Here we select 𝛼 = 0.05 as given in the question.
Step 3: Identify the test statistic
We have two samples (A and B) and we do not know the population standard deviation.
Sample sizes for both samples are not the same.
The sample size is , n > 30. So we use the t distribution and the 𝑡𝑆𝑇𝐴𝑇 test statistic for one sample
test for B sample. one tail test for Sample B
Step 4: Calculate the p - value and test statistic
Input from python:
t_statistic, p_value = ttest_1samp(df['B'],0.35, nan_policy='omit')
print('tstat',t_statistic)
print('P Value',p_value/2)
Below is the output from python:
tstat -3.1003313069986995
P Value 0.0020904774003191826
Step 5: Decide to reject or accept null hypothesis
Input from python:
print ("two-sample t-test p-value=", p_value)
alpha_level = 0.05
if (p_value/2) < alpha_level:
print('We have enough evidence to reject the null hypothesis in favour of alternative hypothesis')
print('We conclude that the moisture content is less than permissible limit in sample B.')
else:
print('We do not have enough evidence to reject the null hypothesis in favour of alternative
hypothesis')

15
Below is the output from python:
Two-sample t-test p-value= 0.004180954800638365
We have enough evidence to reject the null hypothesis in favour of alternative hypothesis
We conclude that the moisture content is less than permissible limit in sample B.

3.2 Do you think that the population mean for shingles A and B are equal?
Form the hypothesis and conduct the test of the hypothesis. What
assumption do you need to check before the test for equality of means is
performed?
Solution:

step 1:Null and Alternate hypothesis


In testing whether the mean for shingles A and Shingles B are the same, the null hypothesis states
that the mean of shingle A to mean of shingle B are the same, 𝜇A equals 𝜇B. The alternative
hypothesis states that the mean is different, 𝜇a is not equal to 𝜇B.

• 𝐻0: μA - μB ≠≠0 i,e μA ≠ μB


• 𝐻𝐴: μA - μB = 0 i,e μA = μB

Step 2: Decide the significance level


Here we select 𝛼α = 0.05 and the population standard deviation is not known.
Step 3: Identify the test statistic
We have two samples and we do not know the population standard deviation. Sample sizes for both
samples are not the same. The sample size is , n > 30. So we use the t distribution and the 𝑡𝑆𝑇𝐴𝑇
test statistic for two sample test.
Step 4: Calculate the p - value and test statistic
Input from python:
t_statistic, p_value = ttest_ind(df['A'],df['B'],nan_policy='omit')
print('tstat',t_statistic)
print('P Value',p_value)
Below is the output from python
tstat 1.2896282719661123
P Value 0.2017496571835306
Step 5: Decide to reject or accept null hypothesis
Input from python:
print ("two-sample t-test p-value=", p_value)
alpha_level = 0.05
if p_value < alpha_level:
print('We have enough evidence to reject the null hypothesis in favour of alternative hypothesis')
else:
print('We do not have enough evidence to reject the null hypothesis in favour of alternative
hypothesis')
print('We conclude that mean for shingles A and singles B are not the same')
Below is the output from python:
Two-sample t-test p-value= 0.2017496571835306
We do not have enough evidence to reject the null hypothesis in favour of alternative hypothesis
We conclude that mean for shingles A and singles B are not the same.

16

You might also like