Professional Documents
Culture Documents
Statistical Method For Decision Making Assignment - January 2021
Statistical Method For Decision Making Assignment - January 2021
PGP - DSBA
Submitted by:
Aamir Haque
Post Graduate Program in Data Science and Business Analytics
Page | 1
Contents:
1 – Wholesale Customer Data Analysis...................................................................................................4
2.1Problem 2.1..............................................................................................................................18
2.2Problem 2.2..............................................................................................................................20
Page | 2
SMDM JAN-2021
Executive Summary
This project summarizes events and situations from three different problem areas driven with
their current available data. The primary objective of this project is to thoroughly analyze the
diversified data using the different lines of statistics and draw conclusions/recommendations in
order to make a decision backed by the results of statistical analysis of data.
Note:
#Kindly refer to the I/p datasets from SMDM project olympus portal for all three problems.
#Kindly refer to the python notebook file(SMDM Assignment Jan 2021 - Aamir Haque.ipynb )
SMDM for the code wise execution and solution of the assignment.
Page | 3
Problem Statement 1
Page | 4
SMDM JAN-2021
1. The Channel column has two unique values in which ‘Hotels’ have the most values
(298), The Region column has three unique values in which ‘Other’ have the most values
(316).
2. The various items mean and their standard deviation are as follows: -
Page | 5
No. Items Mean Standard Deviation
The above data shows that there are no null-values in the data.
All the variables are in numerical format except Region and Channel which are in object format.
There are total 440 rows and 9 columns in the dataset
Page | 6
SMDM JAN-2021
Question 1.1: - Which Region and which Channel seems to spend more? Which Region and
which Channel seems to spend less?
Answer 1.1: -
In order to calculate the amount spent by Region and Channel, the data needs to be
aggregated individually by Region and Channel. Based on the aggregation, the annual spend is
more in the sales channel ‘Hotel’ than Retail. The above descriptive statistics shows that average
spending on Fresh is 1200, Milk is 5796,
Grocery is 7951, Frozen is 3071, Detergents_Paper is 2881 and Delicatessen is 1525. From this
result we can say that highest spending amount is on Grocery.
Page | 7
1. The data shows that the spending in Other regions is way more as compared to both of
the regions combined. The spending in Others is 10,677,599 as compared to annual
spending in Lisbon and Oporto are 2,386,813 and 1,555,088 respectively.
2. The spending in Lisbon region is 53.4% more as compared to Oporto. Oporto is spending
the least which is 1,555,088
3. This clearly shows that the Hotels as a Channel are spending more i.e., 7,999,569 than
Retails which is spending 6,619,931.
4. This shows that hotels are spending 20.8% more as compared to Retail.
5. The Hotel and Oporto region seems to spending the least.
Page | 8
SMDM JAN-2021
Question 1.2: - There are 6 different varieties of items are considered. Do all varieties show
similar behavior across Region and Channel?
Answer 1.2: -
All the behavioral characteristics of data can be determined by aggregating the data
using the describe function of python. Below data shows the count, mean, std, min, max and IQRs
of all the varieties across the three regions and two channels.
Page | 9
1. The graph clearly shows that the amount spend on the ‘Fresh’ items is more through Hotel
channel as compared to the Retail channel. Also, in Hotel channel spending on the Fresh
items is maximum in every region as compared to Retail channel.
2. Looking at the above graphs, we see that some categories like Milk, Grocery &
Detergents_Paper have higher spend in the Retail channel versus Hotel, across all regions.
On the other hand, Fresh and Frozen have higher consumption in the Hotel channel versus
Retail, across all regions.
3. The average annual spending on the ‘Fresh’ items is high in Lisbon region as compare to
Oporto region in Hotel channel, and vice-versa trend can be seen in Retail channel.
Page | 10
SMDM JAN-2021
4. The average annual spending on ‘Milk’ items is higher in the Retail channel as compared
to Hotel. Lisbon seems to spending more as compared to Oporto region on Milk items
through both channels.
5. The average amount spent on the ‘Grocery’ items is more in Retail channel as compared
to the Hotel. Also, the Lisbon region is spending more in Grocery items via Retail
channel as compared to the Hotel in which Oporto’s annual spending is highest across the
country.
6. The average annual spending on ‘Frozen’ items in Hotel is more as compared to Retail
channel. Oporto region is the major contributor for Hotel whereas the average spending is
highest in Retail channel.
7. The average annual spending on ‘Detergent Paper’ is very high across the Retail channel
as compared to the Hotel. The Oporto region is spending the most via Retail whereas
Hotels dominate the spending in Lisbon region.
8. The average annual spending in ‘Delicatessen’ items is less in hotels as compared to the
Retail channel. The Lisbon region on an average spends the highest via Retail.
The skewness results for all the items are displayed above which clearly states that the data for
Delicatessen is highly skewed and asymmetric. It clearly says that the data is right skewed for
Page | 11
all the columns. However, the item ‘Delicatessen’ is highly right skewed with the value 11.11
above and Since our data is positively skewed here, it means that it has a higher number of
data points having low values.
Question 1.3: - On the basis of descriptive measure of variability, which item shows the
most inconsistent behavior? Which items show the least inconsistent behaviour?
Answer 1.3: -
Descriptive measures of variability are used to describe the amount of variability or spread in a set
of data. The most common measures of variability are the range, the interquartile range (IQR),
variance, standard deviation, and coefficient of variation. We will use coefficient of variation here.
CV= σ/ μ
From the above results, we found that Delicatessen shows the most inconsistent behavior and Fresh
shows the least inconsistent behavior.
Page | 12
SMDM JAN-2021
Page | 13
The histograms of the items also show the same that the ‘Fresh’ items and ‘Grocery’ items are the
most widespread among the items and have the highest standard deviation as well whereas
‘Delicatessen’ items being less variable have the lowest standard deviation.
When a distribution has lower variability, the values in a dataset are more consistent. However,
when the variability is higher, the data points are more dissimilar and extreme values become more
likely.
Page | 14
SMDM JAN-2021
Answer 1.4: -
To check the outliers in the dataset, there are couple of graphs and methods are available to
perform.
Plots which we can use to check for the outliers are Boxplot, scatterplot. Data point far from
other data point present in the plots would be considered as an outlier.
We can calculate IQR Interquartile range for the variable. If the data point is outside the IQR
range, then it would consider as outlier.
The data for the above questions such as skewness and the behavioral characteristics indicates
that there are many outliers in the data. We can also confirm the same by doing a boxplot for all
the items.
The box plots clearly show us that each of the item in the data has outliers in it. We can notice
data for median and IQR also not evenly placed for few such as Detergents_Paper. There are
extreme outlier occurrences for all of these items.
Page | 15
Question 1.5 - On the basis of your analysis, what are your recommendations for the
business? How can your analysis help the business to solve its problem? Answer from the
business perspective.
Answer 1.5: -
Business Recommendations-
1.It can be noticed that the overall sales in Hotels is much higher than the sales in Retail. The
distributor may consider Retail channel as a target area for further expansion on growth. Spend in
Hotel needs to be increased in Milk, Grocery and Detergents_Paper. Spend in Retail needs to be
increased in Fresh, Frozen and Delicatessen.
2.The annual spending on Grocery items is directly proportional to the number of Retailers in the
region. So, Retail Channel should spend most on the Grocery items. The spending should be done
carefully as Grocery items are also very inconsistent.
3.The annual spending through both channels by all the regions should be managed carefully
especially in case of Fresh items. The Fresh items have the highest standard deviation and are least
inconsistent. So, the spending on this item should be done carefully.
4.The data is not normally distributed due to the presence of many outliers. This indicates that a
large no of sales can be attributed to some specific buyers. Additional consideration in the business
should be taken for these buyers to ensure long term retention.
5.As the skewness of delicatessen is very high, it indicates the dependency of sales on few buyers
depicting a high risk and high dependency.
6.There needs to be focus on increasing the spend in Lisbon, Oporto regions and Retail Channel
to balance the spend to reduce risk while increasing business
Page | 16
SMDM JAN-2021
Problem Statement 2
The Student News Service at Clear Mountain State University (CMSU) has decided to gather
data about the undergraduate students that attend CMSU. CMSU creates and distributes a survey
of 14 questions and receives responses from 62 undergraduates.
Data has 14 variables in it. ID is the variable which has the unique row number for each response
Page | 17
1. There are 6 categorical variables that are Gender, Class, major, Grad Intent, Employment
and Computer.
2. There are 5 integer data type variables that are Age, Social Networking, Satisfaction,
Spending and Text Messages.
3. There GPA and Salary are 2 float data type variables.
Question 2.1: - For this data, construct the following contingency tables
Page | 18
SMDM JAN-2021
Answer 2.1.1: - The below table is created using the crosstab function denoting the varied
subjects as Major for each Male and Female.
Answer 2.1.2: - The below table is created using the crosstab function denoting the intention to
be a graduate for each Male and Female.
Answer 2.1.3: - The below table is created using the crosstab function denoting the Employment
status for each Male and Female.
Answer 2.1.4: - The below table is created using the crosstab function denoting the type of
computer type availability status each Male and Female.
Page | 19
Question 2.2: - Assume that the sample is representative of the population of CMSU. Based
on the data, answer the following question:
2.2.1. What is the probability that a randomly selected CMSU student will be male?
Answer 2.2.1: -
Total Students=62
2.2.2. What is the probability that a randomly selected CMSU student will be female?
Answer 2.2.2: -
Question 2.3: - Assume that the sample is representative of the population of CMSU. Based
on the data, answer the following question:
2.3.1 Find the conditional probability of different majors among the male students in
CMSU
Page | 20
SMDM JAN-2021
Answer 2.3.1: -
Number of males = 29
Page | 21
Probability of a male selecting Economics Finance = 4/29
P (Economics/Finance| Male) = 0.138 or 13.8%
Page | 22
SMDM JAN-2021
2.3.2 Find the conditional probability of different majors among the female students in
CMSU
Answer 2.3.2: -
Number of females = 33
Page | 23
2. Number of females prefer CIS = 3
Probability CIS ∩ Female = 3/62
Probability Female = 33/62
Probability of a female selecting CIS = (Probability CIS ∩ Female)/ (Probability Female)
Probability of a female selecting CIS = 3/33
P (CIS| Female) = 0.091 or 9.1%
Page | 24
SMDM JAN-2021
Question 2.4: - Assume that the sample is a representative of the population of CMSU.
Based on the data, answer the following question:
2.4.1 Find the probability that a randomly chosen student is a male and intends to
graduate.
Page | 25
Answer 2.4.1: -
Number of males = 29
2.4.1 Find the probability that a randomly chosen student is a female and does NOT
have a laptop.
Page | 26
SMDM JAN-2021
Answer 2.4.1: -
Number of females = 33
Question 2.5: - Assume that the sample is representative of the population of CMSU. Based
on the data, answer the following question:
2.5.1 Find the probability that a randomly chosen student is either a male or has full-
time employment?
Page | 27
Answer 2.5.1: -
Number of males = 29
Page | 28
SMDM JAN-2021
2.5.2 Find the conditional probability that given a female student is randomly chosen, she is
majoring in international business or management.
Answer 2.5.2: -
Number of females = 33
Page | 29
Question 2.6: - Construct a contingency table of Gender and Intent to Graduate at 2 levels
(Yes/No). The Undecided students are not considered now and the table is a 2x2 table. Do
you think the graduate intention and being female are independent events?
Answer 2.6: -
Number of females = 20
P (F ∩ Yes) = 13.8%
Page | 30
SMDM JAN-2021
Question 2.7: - Note that there are four numerical (continuous) variables in the data set,
GPA, Salary, Spending, and Text Messages. Answer the following questions based on the
data
2.7.1. If a student is chosen randomly, what is the probability that his/her GPA is
less than 3?
Answer 2.7.1: -
2.7.2. Find the conditional probability that a randomly selected male earns 50 or
more. Find the conditional probability that a randomly selected female earns 50 or more.
Page | 31
Answer 2.7.2: -
2. Number of females = 33
Number of females salary greater than 50 = 18
Page | 32
SMDM JAN-2021
Question 2.8: - Note that there are four numerical (continuous) variables in the data set,
GPA, Salary, Spending, and Text Messages. For each of them comment whether they
follow a normal distribution. Write a note summarizing your conclusions
Answer 2.8: -
Using histogram and mean, median and mode to know the normal distribution of these four
numerical (continuous) variables in the data set – GPA, Salary, Spending and Text Messages.
GPA Variable:
Page | 33
Salary Variable:
As shown above in the visual representations and concluded from this that the data is not
following the normal distribution. Since mean and median of the Salary column has slight
difference. It means that the data is slightly skewed. The boxplot shows that it has outliers.
Spending Variable:
Page | 34
SMDM JAN-2021
As shown above in the visual representations and concluded from this that the data is not
following the normal distribution. Since mean and median of the Spending column is not same.
That means the variable is skewed and has outliers.
As shown above in the visual representations and concluded from this that the data is not
following the normal distribution. Since mean and median of the Text Messages column has
huge difference. It results that that data is highly right skewed and has outliers.
The GPA box plot is normally distributed as the whiskers of the box plot are of the same length
whereas the box plots of Salary, Spending, Text Messages have different whisker length and
hence are not normally distributed.
Page | 35
Problem Statement 3
An important quality characteristic used by the manufacturers of ABC asphalt shingles is the
amount of moisture the shingles contain when they are packaged. Customers may feel that they
have purchased a product lacking in quality if they find moisture and wet shingles inside the
packaging. In some cases, excessive moisture can cause the granules attached to the shingles for
texture and coloring purposes to fall off the shingles resulting in appearance problems. To
monitor the amount of moisture present, the company conducts moisture tests. A shingle is
weighed and then dried. The shingle is then reweighed, and based on the amount of moisture
taken out of the product, the pounds of moisture per 100 square feet are calculated. The company
would like to show that the mean moisture content is less than 0.35 pound per 100 square feet.
Dataset has 2 variables A and B, which has the measurement of moisture present per 100 sq. ft.
Both variables are float data types.
Question 3.1. Do you think there is evidence that mean moisture contents in both types of
shingles are within the permissible limits? State your conclusions clearly showing all steps.
Page | 36
SMDM JAN-2021
Answer 3.1.: -
Let
For Shingle A
Step I
Null hypothesis (𝐻0) states that mean moisture contents in shingles A, 𝜇A is less than
equals to 0.35.
Alternative hypothesis (𝐻𝐴) states that moisture contents in shingles A, 𝜇A is greater
than 0.35.
Step II
Step III
We do not know the population standard deviation and n = 36. So, we use the t
distribution and the 𝑡𝑆𝑇𝐴𝑇 test statistic.
Step IV
We will conduct a one sample t test on Shingles A(n=36) by using a python function to
compare the mean of a sample to a pre-specified value and tests for a deviation from that
value.
One sample t test
Page | 37
From Python notebook:
t statistic: -1.473505
P value: 0.07477633
Step V
So, the statistical decision is failing to reject the null hypothesis at 5% level of
significance.
Conclusion
Hence, at 95% confidence level, there is sufficient evidence to prove that mean moisture
content in A shingles is less than 0.35 pound per 100 square feet.
For Shingle B
Step I
Null hypothesis (𝐻0) states that mean moisture contents in shingles B, 𝜇B is less than or
equals to 0.35 pound per 100 square feet.
Alternative hypothesis (𝐻𝐴) states that moisture contents in shingles B, 𝜇B is greater than
0.35 pound per 100 square feet.
Step II
Page | 38
SMDM JAN-2021
Step III
We do not know the population standard deviation and n = 31. So, we use the t
distribution and the 𝑡𝑆𝑇𝐴𝑇 test statistic.
Step IV
t statistic: -3.10033130
P value: 0.0020904774
Step V
So, the statistical decision is to reject the null hypothesis at 5% level of significance.
Conclusion
Hence, at 95% confidence level, there is sufficient evidence to prove that mean moisture
content in B shingles is more than 0.35 pound per 100 square feet.
Question 3.2: - Do you think that the population mean for shingles A and B are equal?
Form the hypothesis and conduct the test of the hypothesis. What assumption do you need
to check before the test for equality of means is performed?
Page | 39
Answer 3.2: -
1. We assumed that samples are random and both the populations are normally distributed.
2. We assumed unequal variances of the populations.
3. We assume that for a t-test the scale of measurement applied to the data collected follows
a continuous or ordinal scale.
4. We assume that a large sample size is used. A larger sample size means the distribution
of results should approach a normal bell-shaped curve.
Step I
Null hypothesis (𝐻0) states that the population mean in shingles is the same, 𝜇𝐴 equals
𝜇𝐵.
Alternative hypothesis (𝐻𝐴) states that the population mean in shingles is different, 𝜇𝐴 is
not equal to 𝜇𝐵.
𝐻0: 𝜇𝐴 - 𝜇𝐵 = 0 i.e. 𝜇𝐴 = 𝜇𝐵
𝐻𝐴: 𝜇𝐴 - 𝜇𝐵 ≠ 0 i.e. 𝜇𝐴 ≠ 𝜇𝐵
Step II
Step III
We have two samples and we do not know the population standard deviation.
The sample is not a large sample. So, you use the t distribution and the 𝑡𝑆𝑇𝐴𝑇 test
statistic for two sample unpaired test.
Step IV
We use the scipy.stats.ttest_ind to calculate the t-test for the means of TWO
INDEPENDENT samples of shingles given the two sample observations. This function
Page | 40
SMDM JAN-2021
returns t statistic and two-tailed p value. This is a two-sided test for the null hypothesis
that 2 independent samples have identical average (expected) values. This test assumes
that the populations have identical variances.
Step V
So, the statistical decision is failing to reject the null hypothesis at 5% level of
significance.
Conclusion
Hence, at 95% confidence level, there is sufficient evidence to prove that population
means in shingles A is equal to population mean in shingles B.
Page | 41
Page | 42