SMDM Assignment: Problem 1

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 16

SMDM

ASSIGNMENT
By:- Manas Vikram Singh

Problem 1
We imported the ‘Wholesale Customer data’ dataset in python to analyze the
spend under each store items across regions and channel to find solutions to
each problem. Below is the detailed approach and answer.
1.1 Use methods of descriptive statistics to summarize data. Which Region
and which Channel seems to spend more? Which Region and which
Channel seems to spend less?
Solution:
The data set is of 440 buyer\spenders across different region in Portugal
mainly divided into 3 categories Lisbon, Oporto and others region. It is also
divided into 2 different channel i.e. hotel and retail.
In jypyter notebook we created a summary to it.
• Hotel channel spend amount is 7999569 with the highest spend amount and,
• Retail spend amount 6619931 has least spend amount based on Channel.
Below is the output from Python
Channel
Hotel 7999569
Retail 6619931
Similarly we grouped totals by region to get totals by region.
Other regions spend amount is 10677599 with the highest spend amount and
Oporto region spend amount is 1555088 and has least spend amount by
Region.
Below is the output from Python –
Region
Lisbon 2386813
Oporto 1555088
Other 10677599
1.2Problem 1.2 There are 6 different varieties of items are considered.
Do all varieties show similar behavior across Region and Channel?
Provide justification for your answer.
Solution:
Using bar graph for each category and checking spend across Channel we
get the following outputs from python –
Looking at the above graphs, we see that Milk, Grocery, Detergents paper
and delicatessen have higher spent in the Retail channel as compared to
Hotel, across all regions. On the other hand, fresh and frozen have higher
consumption in the Hotel channel as compared to retail, across all regions.
Similarly, using bar graph for each category and checking spend across
regioni ng
By looking at the above graph we see that grocery, frozen and detergents
paper is most consumed in Oporto as compared to other regions.
Delicatessen, fresh and milk are consumed more in other region than Lisbon
and Oporto.

1.3 Problem On the basis of the descriptive measure of variability, which


item shows the most inconsistent behavior? Which items shows the
least inconsistent behavior?
Solution:
Using Coefficient of Variation we find out the least value is of Category
“Delicatessen” (0.541) and highest value is of Category “Fresh” (0.949)
So from the given data it is clear that most inconsistent behavior shown by
item – Fresh
And least inconsistent behavior shown by item – Delicatessen
Below is the output from Python –
Coefficient of Variation for Fresh is 0.949
Coefficient of Variation for Milk is 0.785
Coefficient of Variation for Frozen is 0.633
Coefficient of Variation for Grocery is 0.837
Coefficient of Variation for Detergents paper is 0.604
Coefficient of Variation for Delicatessen is 0.541
1.4 Problem Are there any outliers in the data?
Solution:
To find out outliers we plotted box plot of every data. Given below is output
from python
The output shows that there are outliers in every data set.

1.5 Problem On the basis of your analysis, what are your


recommendations for the business? How can your analysis help the
business to solve its problem? Answer from the business perspective
Solution:
As per the analysis, I find out that there are inconsistencies in spending of
different items (which we conclude using coefficient of variation), which should
be minimized. The spending of Hotel and Retail channel are different which
should be more or less equal. And the spent of different region is different,
which should be equal. Lisbon region is always either least or is in middle of
Oporto and other region in consumptions of item.
Problem 2
2.1. Problem for this data, construct the following contingency tables
(Keep Gender as row variable)
2.1Problem 2.1.1 Gender and Major
Solution:
Below is output from python
Major Accounting CIS Economics\ International management other Retailing\marketing undecided
business
Gender finance

Male 3 3 7 4 4 3 9 0

Female 4 1 4 2 6 4 5 3

2.1Problem 2.1.2 Gender and Grad Intention


Solution:
Below is output from python
Grad Intention NO Undecided YES
Gender
Male 3 9 17
Female 9 13 11

2.1Problem 2.1.3 Gender and Employment


Solution:
Below is output from python
EMPLOYMENT Full-Time Part-Time Unemployed
GENDER
Male 7 19 3
Female 3 24 6

2.1Problem 2.1.4 Gender and Computer


Solution:
Below is output from python
COMPUTER Desktop Laptop Tablet
GENDER
Female 2 29 2
Male 3 26 0

2.2Problem. Assume that the sample is representative of the population


of CMSU. Based on the data, answer the following question:
2.2Problem 2.2.1 what is the probability that a randomly selected CMSU
student will be male?
Solution:
Total students =62
Total male =29
Probability that a randomly selected CMSU student will be male = Total
male/total student
=29/62
=0.468
So the probability of randomly selecting a student that will be male is 46.8%.

2.2Problem 2.2.2 what is the probability that a randomly selected CMSU


student will be female?
Solution:
Total student=62
Total female=33
Probability of selecting a female= total female/total student
=33/62
=0.532
So probability of randomly selecting a student that will be a female is 53.2%.

2.3 Problem. Assume that the sample is representative of the population


of CMSU. Based on the data, answer the following question:
2.3 Problem 2.3.1 Find the conditional probability of different majors
among the male students in CMSU.
2.3 Problem 2.3.2 Find the conditional probability of different majors
among the female students in CMSU.

Solution:
Total male=29
Using contingency tables of Gender and Majors we got the total numbers of
males and number of males opting for different majors
Below is the output from Python –
Probability of male opting for accounting is 13.79%
Probability of male opting for CIS is 3.45%
Probability of male opting for Economics/Finance is 13.79%
Probability of male opting for International business is 6.90%
Probability of male opting for management is 20.69%
Probability of male opting for other is 13.79%
Probability of male opting for Retailing/Marketing is 17.24%
Probability of male opting for Undecided is 10.34%

Problem 2.3.2 Solution:


Using the similar approach we find out probability of different majors among
female.
Total female=33
Using contingency tables of Gender and Majors we got the total numbers of
females and number of females opting for different majors
Below is the output from Python –
Probability of female opting for Accounting is 9.09%
Probability of female opting for CIS is 9.09%
Probability of female opting for Economics/Finance is 21.21%
Probability of female opting for International business is 12.12%
Probability of female opting for Management is 12.12%
Probability of female opting for other is 9.09%
Probability of female opting for Retailing/Marketing is 27.27%
Probability of female opting for Undecided is 0.00%

2.4. Assume that the sample is a representative of the population of


CMSU. Based on the data, answer the following question:
2.4.1. Find the probability that a randomly chosen student is a male and
intends to graduate.
2.4.2 Find the probability that a randomly selected student is a female
and does NOT have a laptop.
Solution:

Using contingency tables of Gender and Grad Intension we got the total
numbers of males and number of males who intends to graduate
Given below is the output from python-
Probability that a randomly chosen student is male and intends to graduate is
27.419%
Using contingency tables of Gender and computer, we got total number of
females and does NOT have laptop.
Given below is the output from python-
Probability that a randomly chosen student is female and does NOT have a
laptop is 6.452%
2.5. Assume that the sample is representative of the population of CMSU. Based
on the data, answer the following question:
2.5.1. Find the probability that a randomly chosen student is either a male or has
full-time employment?
2.5.2. Find the conditional probability that given a female student is randomly
chosen, she is majoring in international business or management.

2.5.1 Solution

Using contingency tables of Gender and Employment we got the total


numbers of males and number of males having full time employment.
Given below is output from python-

Probability of male=0.468

Probability of student having full time employment = 0.161

Probability of student being male and has full time employment =0.112

Probability= (probability of male + probability of student having full time


employment-probability of student being male and has full time
employment)*100
= (0.468+0.161-0.112)*100

= 51.7%

So, probability that a randomly selected student will either be a male or has
full-time employment.
2.5.2 Solution

Using contingency table of gender and major we got total number of female
and number of females opting for different major

Total female=33

Total female majoring in international business and management are =8

Probability= (Total female majoring in international business and


management/ total female)

=8/33

=0.242

So probability of female student randomly chosen will be doing major in


international business or management is 24.2%
2.6. Construct a contingency table of Gender and Intent to Graduate at 2 levels (Yes/No).
The Undecided students are not considered now and the table is a 2x2 table. Do you
think the graduate intention and being female are independent events?

Below is output from python


Grad Intention NO YES
Gender
Male 3 17
Female 9 11

Yes, graduate intention and being female are independent event

2.7.1. If a student is chosen randomly, what is the probability that his/her GPA is
less than 3?
2.7.2. Find the conditional probability that a randomly selected male earns 50 or
more. Find the conditional probability that a randomly selected female earns 50 or
more.

2.6.1 Solution:

Using python, the number of student having GPA less than 3=17

Total student=62

Using python we calculated using calculate probability of randomly selected


student has GPA less than 3 is 27.4%

2.6.2 Solution

Using python we get the following output-

The number of male who earn 50 or more is 14

Total male =29

Probability of randomly selected male earn 50 or more is 48.2%

The number of female earning 50 or more is 18

Total female =33

Probability of randomly selected female earning 50 or more is 54.5%

2.8. Note that there are four numerical (continuous) variables in the data
set, GPA, Salary, Spending, and Text Messages. For each of them
comment whether they follow a normal distribution. Write a note
summarizing your conclusions.

By using python we calculated skewness of different data set-


GPA = -0.314
Salary = 0.534
Spending =1.585
Text Messages = 1.295
From the above result we can say that data set “Spending” & “Text Messages”
are skewed right where as “Salary” is slightly skewed right. “GPA” on the other
hand is slightly left skewed. So, none of the data set shows an ideal normal
distribution.”GPA” is nearest to follow normal distribution.

Problem 3
An important quality characteristic used by the manufacturers of ABC asphalt
shingles is the amount of moisture the shingles contain when they are
packaged. Customers may feel that they have purchased a product lacking in
quality if they find moisture and wet shingles inside the packaging. In some
cases, excessive moisture can cause the granules attached to the shingles for
texture and coloring purposes to fall off the shingles resulting in appearance
problems. To monitor the amount of moisture present, the company conducts
moisture tests. A shingle is weighed and then dried. The shingle is then
reweighed, and based on the amount of moisture taken out of the product, the
pounds of moisture per 100 square feet is calculated. The company would like
to show that the mean moisture content is less than 0.35 pound per 100
square feet.

The file (A & B shingles.csv) includes 36 measurements (in pounds per 100
square feet) for A shingles and 31 for B shingles.

3.1Problem Do you think there is evidence that mean moisture contents


in both types of shingles are within the permissible limits? State your
conclusions clearly showing all steps.
Solution:
Sample A shingles
By using Python, we got following output
One sample t test
t statistic: -1.4735046253382782 p value: 0.07477633144907513
Since pvalue > 0.05, we don’t reject H0(null hypothesis). There is not enough
evidence to conclude that the mean moisture content for Sample A shingles is
less than 0.35 pounds per 100 square feet. pvalue = 0.0748. If the population
mean moisture content is in fact no less than 0.35 pounds per 100 square
feet, the probability of observing a sample of 36 shingles that will result in
sample mean moisture content of 0.3167 pounds per 100 square feet or less
is .0748.
Sample B shingles
Output from Python
One sample t test
t statistic: -3.1003313069986995 p value: 0.0020904774003191826
Since pvalue < 0.05, reject H0(null hypothesis) . There is enough evidence to
conclude that the mean moisture content for Sample B shingles is not less
than 0.35 pounds per 100 square feet. p-value = 0.0021. If the population
mean moisture content is in fact no less than 0.35pounds per 100 square feet,
the probability of observing a sample of 31 shingles that will result in a sample
mean moisture content of 0.2735 pounds per 100 square feet or less is .0021.
3.2 Problem Do you think that the population means for shingles A and
B are equal? Form the hypothesis and conduct the test of the
hypothesis. What assumption do you need to check before the test for
equality of means is performed?
Solution:
Null hypothesis (H0) : mean (A) = mean(B)
Alternate hypothesis (Ha) : mean(A) not equal to mean(B)
alpha = 0.05
By using python we got following output
t statistic=1.29 and pvalue=0.202
As the pvalue > alpha , we don’t reject H0(null hypothesis) and we can say
that population mean for shingles A and B are equal.
TEST ASSUMPTIONS:-
When running a two-sample t-test, the basic assumptions are that the
distributions of the two populations are normal, and that the variances of the
two distributions are the same. If those assumptions are not likely to be met,
another testing procedure could be use.

You might also like