Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 23

SMDM PROJECT – MOHD FAKHAR ADIL

Business Report
Problem-1: A wholesale distributor operating in different regions of Portugal has information on
annual spending of several items in their stores across different regions and channels. The data
consists of 440 large retailers’ annual spending on 6 different varieties of products in 3 different
regions (Lisbon, Oporto, Other) and across different sales channel (Hotel, Retail).

Before we get into the questions let us understand some basic things what data tells us!

a) Look at the data, first and last

Head -

B) Tail –

C) Information of the Data


D) Data has no null value

E) Descriptive Statistics of Data

In this table Standard Deviation is greater than the mean from all levels. Mean have greater disparity
from the median we can assume this all levels of the outliers In the Describe () function all values
are similar to each other
F) Distributions of Numerical and Categorical Variables

6 Histograms are as follows:

All of the 6 items have a right skewed distribution as we can see the tail down towards the right of
the peak for all 6 items delicatessen, milk, grocery, detergent paper, fresh products, and frozen
Count Plots of Region and Channel –

At the 440 observations, the channel Hotel has a greater number of observations than the Retail

At the 440 observations, Other regions top the list, followed by Lisbon and Oporto being the least in
terms of observations.

1.1 Use methods of descriptive statistics to summarize data. Which Region and which Channel
seems to spend more? Which Region and which Channel seems to spend less?
Answer - We calculated the total amount spent for each and every observation. After grouping
them in two ways, one in terms of channels and another in the regions
Hotel and restaurants had spent more than retail from the wholesale here the x-axis tells us the
amount spent in lakhs

Second one is the Region –

On the Otherhand region had spent more than Lisbon. Here, the x-axis is the amount spent in
millions.

Let us the deeper, to understand each region with respect to each channel –
This is like a breakdown to the previous graph where each region is synthesized into Hotels and
Retails. The figure depicts that in Lisbon Hotels/restaurants spent more than Retails and same with
the other region as well .

1.2 There are 6 different varieties of items are considered. Do all varieties show similar behaviour
across Region and Channel?

Answer - I would like to present the analyses similarly how I have had done for Total Spent but for
each item individually But we need to delve on things to get deeper understandings about the data.
Before we get into the visualizations, Many points are as followings:

We can also put the above point in this way, in any aspect whether it is any channel, Other Region
has more composition than Lisbon and Oporto Many points are changing according to the situations

Items – Fresh, Frozen, Delicates show similar patterns are interrelated to each other Like Detergent
Paper, Groceries, milk show similar pattern to each other

All the items are as follows :

A) Fresh

Fresh products have mostly purchased by hotels/restaurants than the retails. The amount here is
given in millions
The common aspect in every region is Hotels/restaurants spend more on fresh products than the
retail stores.

B) Milk

Now we can understand the point that all people buy milk from retails as it is one of the daily
essential needs at households
As always said, Other regions have spent on milk more than Lisbon and Oporto

In the above mentioned above that milk is mostly purchased by Retails

C) Grocery

The same point which was told for milk goes for groceries as well, because people need groceries as
well as
The statement is same here as well, Other regions spent more than Lisbon and Oporto But money os
spent more than milk than the groceries

As per retails are concerned we already know the point, that they purchased more than the hotel.

D) Frozen

Even though Frozen foods are available at both retails and Hotels/restaurants but people tend to
buy them when they go on partying at the Hotels/restaurants
There is huge disparity in between the channels for every region and Hotels/restaurants

E) Detergent_Paper

Detergent paper also comes under daily uses. detergent paper is also needed at the
hotels/restaurants Nonetheless, households would be acquiring more
Interestingly, the amount spend on detergent paper is almost equal to amount spend on Frozen
in terms of regions.

We know the point that retails have purchased more than hotels from the wholesale distributor

F) Delicatessen

Delicatessen are places selling cooked meats, cheeses, and unusual or foreign prepared foods So we
can say that is mostly used in the hotels and restaurants
This is clear we might infer a new way of saying is other regions might have more hotels than Lisbon
and Oporto

The pattern did not change even either here, but Oporto has no significance in between the
channels.

1.3 On the basis of descriptive measure of variability, which item shows the most inconsistent
behaviour? Which items show the least inconsistent behaviour?

Answers - When we look for inconsistency, we get “Coefficient of Variation (CV) ” in our minds in
descriptive measures of variability that is all the values have CV greater than 1, which is very
high.

The right-side piece of figure tells us the values of the coefficient of variation using python
As we can see, three items which have light red and somewhat greyish are having less CV
comparatively, the items are Fresh, Grocery and milk. darker in red color as they have CV more
than 1.5

1.4 Are there any outliers in the data?

Answers - Yes, there are outliers in the data. So, only numeric columns or continuous values would
have outliers.
We can also infer one more point looking at the visualizations that all our items are right skewed
data

1.5 On the basis of this report, what are the recommendations?

Answers - Data is more about region rather than Lisbon, and Oporto where Other region does not
really make sense whether it is about a city or a region or groups.
.

The wholesale distributor needs to understand that for other region there are hotels/restaurants
and also know about the market

The distributor needs to increase his tie-ups with a greater number of retailers

The wholesale distributor needs to increase the collaboration with more number of retailers and
hotels/restaurants so forth to increase the value.

Correlation between the items


Milk is highly correlated with Detergent Paper, grocery. Moreover, grocery is highly correlated
Detergent Paper

We can associate these things to understand the pattern of people buying these items increase the
business.

Problem-2: The Student News Service at Clear Mountain State University (CMSU) has decided to
gather data about the undergraduate students that attend CMSU. CMSU creates and distributes a
survey of 14 questions and receives responses from 62 undergraduates (stored in the Survey data
set).

2.1. For this data, construct the following contingency tables (Keep Gender as row variable)

2.1.1 Gender and Major:

2.1.2 Gender and Grad Intention


2.1.3. Gender and Employment:

2.1.4 Gender and Computer

2.2. Assume that the sample is representative of the population of CMSU. Based on the data,
answer the following question:

2.2.1. What is the probability that a randomly selected CMSU student will be male?

Answer –

2.2.2. What is the probability that a randomly selected CMSU student will be female?

Answer –
2.3. Assume that the sample is representative of the population of CMSU. Based on the data,
answer the following question:

2.3.1. Find the conditional probability of different majors among the male students in CMSU.

Answer –

P(A) = Majors { Accounting, CIS, Economics/Finance, International Business, Management, Other,


Retail/Marketing ,

Undecided) P(B) = Total number of Males = 29

P(A n B) = Joint Probability P(A/B) = P(A n B) / P(B)

following probabilities of Majors among males

2.3.2 Find the conditional probability of different majors among the female students of CMSU.

P(A) = Majors { Accounting, CIS, Economics/Finance, International Business, Management, Other,


Retail/Marketing , Undecided)

P(B) = Total number of Females = 33

P(A n B) = Joint Probability / Fetch the count from Contingency Table

P(A/B) = P(A n B) / P(B)

probabilities of Majors among Females


: How is this done in python?

Answer - “normalize = index” (row).

2.4. Assume that the sample is a representative of the population of CMSU. Based on the data,
answer the following question:

2.4.1. Find the probability That a randomly chosen student is a male and intends to graduate

Answer - a joint probability question, from 2.1.2 fetch the Count where row is gender – “Male” and
column is “Yes”.

The count is 17. The probability would be 17divided by (Total Sample Size) 62 which is 0.274194.

How is this done in Python?

Answer – “normalize = all” (row).

2.4.2 Find the probability that a randomly selected student is a female and does NOT have a
laptop.

Answer- Selected Student does not have laptop or desktop

The probability of randomly selected student is a female and does not have a laptop is 2/62 + 2/62 =
4/62 = 0.06
How is this done in Python?

Answer – “normalize = all” (row)

female and desktop & female and Tablet which is 0.03+0.03 would give us 0.06

2.5. Assume that the sample is representative of the population of CMSU. Based on the data,
answer the following question

2.5.1. Find the probability that a randomly chosen student is either a male or has full-time
employment?

Answer –

P(A) = student is a male = 29/62 P(B) = Has full time employment = 10/62 P(A n B) = Student is male
and has full time employment

Joint probability) = 7/62

P(A) + P(B) - P(A n B) = 0.51

How is this done in Python?

Answer - we use normalize = ‘all’ from there we can get P( A n B ) and P(B) can be calculated by
value counts()

2.5.2. Find the conditional probability that given a female student is randomly chosen, she is
majoring in international business or management.

Here, P( A ) would be 4/33 ( female majoring international business)

P(B) would be 4/33 (female majoring management

And as we cannot have P ( A n B)

Therefore, by probability law of addition, P(A) + P(B) would be 0.12 + 0.12 which is 0.24.

How is this done in Python?

Answer - “normalize = index” (row).


2.6. Construct a contingency table of Gender and Intent to Graduate at 2 levels (Yes/No). The
Undecided students are not considered now, and the table is a 2x2 table. Do you think the
graduate intention and being female are independent events?

Answer –

f P(A n B) is equal to P(A) *

P(B). P(A) = 28/40 P(B) = 20/40 P(A n B) = 11/40

2.7. Note that there are four numerical (continuous) variables in the data set, GPA, Salary,
Spending, and Text Messages.

Answer the following questions based on the data

2.7.1. If a student is chosen randomly, what is the probability that his/her GPA is less than 3?

Answer - Total number of students (62) which is equal to 17/62 = 0.27

2.7.2. Find the conditional probability that a randomly selected male earns 50 or more. Find the
conditional probability that a randomly selected female earns 50 or more.

Answer - 14 such males out of 29, who earn 50 or more. So, the probability would be 14/29 which is
0.48
Total number of females and there are 18 such females out of 33 who earn 50 or more. So, the
probability would be 18/33 which is 0.54

2.8. Note that there are four numerical (continuous) variables in the data set, GPA, Salary,
Spending, and Text Messages. For each of them comment whether they follow a normal
distribution. Write a note summarizing your conclusions

After looking at the histogram, we can visually say that they are not normally distributed. There is
skewness.

Problem-3: An important quality characteristic used by the manufacturers of ABC asphalt shingles
is the amount of moisture the shingles contain when they are packaged. Customers may feel that
they have purchased a product lacking in quality if they find moisture and wet shingles inside the
packaging. In some cases, excessive moisture can cause the granules attached to the shingles for
texture and purposes to fall off the shingles resulting in appearance problems. To monitor the
amount of moisture present, the company conducts moisture tests. A shingle is weighed and then
dried. The shingle is then reweighed and based on the amount of moisture taken out of the
product, the pounds of moisture per 100 square feet are calculated. The company would like to
show that the mean moisture content is less than 0.35 pound per 100 square feet. The file (A & B
shingles.csv) includes 36 measurements (in pounds per 100 square feet) for A shingles and 31 for B
shingles. 3.1 Do you think there is evidence that mean moisture contents in both types of shingles
are within the permissible limits? State your conclusions clearly showing all steps.

3.1 Do you think there is evidence that mean moisture contents in both types of shingles are
within the permissible limits? State your conclusions clearly showing all step

The means of A and B are computed through the describe() function. Mean of A and B are
0.316667 and 0.273548, respectively. The standard deviations of A and B are 0.135731 and
0.137296
(a) Hypothesis formulation for

Ho: µA >= 0.35 Pounds/Sq.feet

Ha: µA < 0.35 Pounds/Sq.feet

Test statistic is computed by Xbar - µ/(SD/√n) with degree of freedom 35 for shingles A. This yields a
value of -1.4735 which is not less than -1.69

(b) Hypothesis formulation for B shingles:

Ho: µB >= 0.35 Pounds/Sq.feet

Ha: µB < 0.35 Pounds/Sq.feet

Test statistic is computed by Xbar - µ/(SD/√n) with degree of freedom 30 for shingles B. This yields a
value of -3.10 which is much less than -1.69.

3.2 Do you think that the population mean for shingles A and B are equal? Form the hypothesis
and conduct the test of the hypothesis. What assumption do you need to check before the test for
equality of means is performed?

Answer Hypothesis formulation:

Ho: µA = µB Ha:

µA not equal to µB

Followings Assumption are as follows

The populations are normally distributed.

Each value is sampled independently from each other value

Data are assumed to be independently drawn from a population that is normally distributed. Since
the sample sizes are 36 and 31
Histogram and after calculation of skewness of the samples A and B, the values are 0.91 and 0.48 for
A and B

t-statistic and P-value, we get 1.28 and 0.20, respectively. The p-value being more than alpha 0.05,
we fail to reject the null hypothesis. We conclude that, with 95% confidence we can say that the
populations means of A and B are equal

You might also like