DS Module 2 Project

Business Report
SMDM Project
Sudhanva Saralaya 6/13/21 PGP-DSBA

Contents
1 Packages and Common procedures followed across this Assignment...............................3
2 Problem 1...........................................................................................................................4
A wholesale distributor operating in different regions of Portugal has information on annual
spending of several items in their stores across different regions and channels. The data
consists of 440 large retailers’ annual spending on 6 different varieties of products in 3
different regions (Lisbon, Oporto, Other) and across different sales channel (Hotel, Retail).. .4
2.1 Use methods of descriptive statistics to summarize data. Which Region and which
Channel spent the most? Which Region and which Channel spent the least?.......................4
2.2 There are 6 different varieties of items that are considered. Describe and
comment/explain all the varieties across Region and Channel? Provide a detailed
justification for your answer...................................................................................................5
2.3 Based on a descriptive measure of variability, which item shows the most
inconsistent behaviour? Which items show the least inconsistent behaviour?......................9
2.4 Are there any outliers in the data? Back up your answer with a suitable
plot/technique with the help of detailed comments..............................................................10
2.5 Based on your analysis, what are your recommendations for the business? How can
your analysis help the business to solve its problem? Answer from the business
perspective............................................................................................................................10
3 Problem 2.........................................................................................................................12
3.1 For this data, construct the following contingency tables (Keep Gender as row
variable)................................................................................................................................12
For creating a contingency table, we can use the crosstab function that summarizes the
table with the conditions we give. For example, we have conditions such as gender and
major in the below questions................................................................................................12
3.1.1 Gender and Major ............................................................................................12
3.1.2 Gender and Grad Intention ...............................................................................12
3.1.3 Gender and Employment....................................................................................12
3.1.4 Gender and Computer........................................................................................12
3.2 Assume that the sample is representative of the population of CMSU. Based on the
data, answer the following question:....................................................................................13
3.2.1 What is the probability that a randomly selected CMSU student will be male?13
3.2.2 What is the probability that a randomly selected CMSU student will be female?
13
1.1 Find the conditional probability of different majors among the male students in
CMSU...............................................................................................................................13
3.3.1 Find the conditional probability of different majors among the female students
of CMSU...........................................................................................................................13
3.4 Assume that the sample is a representative of the population of CMSU. Based on the
3.4.1 Find the probability That a randomly chosen student is a male and intends to
graduate.............................................................................................................................14
3.4.2 Find the probability that a randomly selected student is a female and does NOT
have a laptop.....................................................................................................................14
3.5.1 Find the probability that a randomly chosen student is a male or has full-time
employment?.....................................................................................................................14
3.5.2 Find the conditional probability that given a female student is randomly chosen,
she is majoring in international business or management................................................15
3.6 Construct a contingency table of Gender and Intent to Graduate at 2 levels (Yes/No).
The Undecided students are not considered now, and the table is a 2x2 table. Do you think
the graduate intention and being female are independent events?.......................................16
3.7 Note that there are four numerical (continuous) variables in the data set, GPA,
Salary, Spending, and Text Messages. Answer the following questions based on the data 16
3.7.1 If a student is chosen randomly, what is the probability that his/her GPA is less
than 3? 16
3.7.2 Find the conditional probability that a randomly selected male earns 50 or
more. Find the conditional probability that a randomly selected female earns 50 or more.
17
3.8 Note that there are four numerical (continuous) variables in the data set, GPA,
Salary, Spending, and Text Messages. For each of them comment whether they follow a
normal distribution. Write a note summarizing your conclusions.......................................18
4 Problem 3.........................................................................................................................21
4.1 Do you think there is evidence that means moisture contents in both types of
shingles are within the permissible limits? State your conclusions clearly showing all steps.
21
4.2 Do you think that the population mean for shingles A and B are equal? Form the
hypothesis and conduct the test of the hypothesis. What assumption do you need to check
before the test for equality of means is performed?.............................................................23
1 Packages and Common procedures followed across this Assignment.
 Packages
o import numpy as np
o import pandas as pd
o import matplotlib.pyplot as plt
o %matplotlib inline
o import seaborn as sns
o sns.set(color_codes=True)
o from warnings import filterwarnings #ignore warning wen it shows
o filterwarnings("ignore")
o import scipy.stats as stats
o from scipy.stats import ttest_ind
o import math
 Process Followed in common Across the Project

o Importing the data with pd.read_csv()
o Checking if the data is imported or not.
o Assigning the database to a variable
o Going through the data and dropping unnecessary data using drop()
o Checking the description of the data using describe()
o Checking the Information of the data using Info()
o Checking for null values in the dataset
o Checking the question and getting an answer
2 Problem 1
A wholesale distributor operating in different regions of Portugal has information on

annual spending of several items in their stores across different regions and channels.
The data consists of 440 large retailers’ annual spending on 6 different varieties of
products in 3 different regions (Lisbon, Oporto, Other) and across different sales channel
(Hotel, Retail).
2.1 Use methods of descriptive statistics to summarize data. Which Region and which
Channel spent the most? Which Region and which Channel spent the least?
 For answering this question, we need to make a new data frame with the data adding a
new column called total to know the total.
 We can get basic description for the data by using Describe () function. This helps us
to know the details such as standard deviation mean etc in statistics perspective.
Table 1
 Now to get an accurate data on which region spends the most we can group each data
on basis of region and channel. We can use groupby(). sum() function to do so.
Fig 1
Fig 2
 Groupby(). sum() gives us the total of all the rows and columns based on the way we
have grouped them.
 Fig 1 shows data that is grouped and totalled based on Channels. Here we see that
hotel is a channel that spends the most when we compare it to retail channel is the
lowest spender.
 In the same way Fig 2 shows us the data grouped and totalled based on region. When
we see the fig 2, we see that other is the highest spender and Oporto is the lowest
spender.
2.2 There are 6 different varieties of items that are considered. Describe and
comment/explain all the varieties across Region and Channel? Provide a detailed
justification for your answer.
Pivoting the table given we get the below details.
Table 2
 Understanding the Summary

o Now as we look at the summary of the entire table, we first notice how the
spending pattern is spread out in the data. This data is related to Portugal.
Portugal has a total of 18 districts and Lisbon and Oporto or commonly called
as Porto are shown and rest of the 16 districts are clustered together as
“others” in the data set.
o This shows that Lisbon and Oporto have the greatest number of people who
spend money on the items when compared to the other 16 districts.
o One more important thing we can draw a conclusion of is that “others” region
will always have the maximum total spenders as it consists of the rest of
Portugal.
o Next, we see that there are 2 types of channels through which the products are
sold. They are Hotels and Retails. Moreover, we see that the products are sold
more in hotels than retails when it is related to food items and food related
items. On the flip side we see that items that are not food items are sold more
using retail channel.
o The below inferences have been made after using pivot on individual elements
therefore we can see individual elements and their details.
o Now going on individual items we see that there are overall 6 different items
provided in the data.
 Delicatessen
 Detergent paper
 Fresh
 Frozen
 Grocery
 Milk
 Individual items Analysis

o Delicatessen
o Here when we take the summary as per one item (i.e., Delicatessen), we see
some interesting things. Starting with the way the total sales is distributed. We
see that total spending done for Delicatessen is 670943. We see region wise it
is spread out. Lisbon has a total of 104327, Oporto has 54506 and finally other
regions have around 512110. Now we know that there are around 16 regions
in Portugal, so it becomes easy to estimate how much is the spending done in
other regions. The remaining states have contributed approx. 42,675 each on
average. Going further based on channels of sales we see hotel channel has
nearly double the total spending than retail. This shows when it comes to
delicatessen people prefer to spend in hotel to acquire the item. Further we see
we have division between the item based on region and channel. For example,
in Lisbon total spending done on hotels is 70632 for delicatessen. In the same
way we can refer the above table and understand how the spending is done in
hotel and retail for delicatessen in different regions.
 Detergent Paper
 Individual analysis of detergent paper shows the details like how it is being
sold and how are the channels impacted. Again, we see total spending
done on detergent paper is 1267857. So, on an average each region spends
70436.5 on detergent paper. Here we see that people prefer spending on
Detergent paper in retails as it is the place where we get it properly and it
is not a food commodity and hence hotel has a less sum of spending done.
 Fresh
 When we see the commodity called fresh it can be safe to assume that we
are speaking in terms of fresh fruits and vegetables or anything that needs
to be fresh and has a small-time frame for expiry. Usually, it is related to
food items and other consumables. Here we see that 5280131 is the total
spending done on fresh items and again since it is related to food items, we
see maximum spending done on the hotel channel. We also notice that
until now Fresh is a commodity for which people have spent most money
on. On an average in 18 regions, we see that people spend 2,93,341 for the
fresh items. We also see Lisbon and Oporto have gone ahead of average
and have spent more than it. This shows that fresh items are more in
demand in those places than the other 16 regions of Portugal.
 Frozen
 Here we can be safe to assume that we are talking about stuff that is frozen
and it is again related to food items. We see that 1351650 is the total
spending done on the frozen items. Again, we see that majority of the
spending is done in hotel across regions than in in retail.
 Milk
 Here we see that 2550357 is the total spending done for milk. Here we see
that retail is a channel that tops in this aspect as well as we see nearly
1521743 is the total spending done for milk in retail channel across
regions. One more thing that we can see is that there is that it is easy to
understand why we see retail topping this aspect of sales. It is because
people prefer buying Milk from mainly retails and hotels usually don’t or
may have less amount of sales when it comes to milk.
 Grocery
 Here we see that 3498562 is the total spending done for grocery. Here we
see that retail is a channel that tops in this aspect as well as we see nearly
2317845 is the total spending done for grocery in retail channel across
regions. One more thing that we can see is that there is that it is easy to
understand why we see retail topping this aspect of sales. It is because
people prefer buying grocery from mainly retails and hotels usually don’t
or may have less amount of sales when it comes to grocery.
 With this we have com to the end of individual analysis of each commodity . it is
interesting to see that anything that is related to food preparations and frozen food
is usually having high amount of sales done in Hotel Channel. But when it comes
to essentials or daily need products such as grocery, milk detergent paper etc we
see a huge margin of sales done by the retail channel. Through this we can easily
come into conclusions that daily need items can make more profit when sold via
retail region and frozenm fresh and delicatessen can be easily sold via the hotel
channel.
2.3 Based on a descriptive measure of variability, which item shows the most inconsistent
behaviour? Which items show the least inconsistent behaviour?
 For measuring inconsistent behaviour, we can use the method of variance. Now in
statistics, variance measures variability from the average or mean. It is calculated
by taking the differences between each number in the data set and the mean, then
squaring the differences to make them positive, and finally dividing the sum of the
squares by the number of values in the data set. the data having lower coefficient
of Variation is more consistent and vice - versa.
 The table when converted to normal integers we get the below:
Fresh 159954900.00
Milk 54469970.00
Grocery 90310100.00
Frozen 23567850.00
Detergents Paper 22732440.00
Delicatessen 7952997.00
 The item showing the most consistency is Delicatessen with a variation of

7952997 and the product showing the least consistency Is fresh with a variation of
159954900.
2.4 Are there any outliers in the data? Back up your answer with a suitable plot/technique
with the help of detailed comments.
Yes, as we see in the below boxplot there are outlier in every single item has an
outlier. The best way to check for outliers are boxplots. all the dots after the top
quartile are called as outlier.
2.5 Based on your analysis, what are your recommendations for the business?
How can your analysis help the business to solve its problem? Answer from
the business perspective.
 Based on the analysis done so far, we see that the out of 18 different location
majority of the products are sold only in two locations. We can start expanding to
the other regions slowly by which the target consumer counts can increase.
 Sales should be made consistent as it increases chances of product being sold
more. When there is inconsistent supply and sales of product people tend to
change products.
 Must make sure to provide enough product to each channel based on their
requirement and cut down the sales in channels where it is not needed. For
example, fresh is most sold in Channel hotel whereas detergent paper is sold more
in retails. So, merchant must make sure to know which channel is more important
for the products he supplies.
 This is a contingency plan for a situation where there is a case of shortage of
products to sell. When the merchant knows what his channel is selling and how
are the channel being used by his consumers, he can make sure to provide
products to the perfect channel. Now it will not make sense giving fresh products
to retail when the sales are more in hotel. So, he must know how to do all the
division during case of shortages.
 Further we see that delicatessen is an exceptionally low demanded product.
Therefore, if possible, must increase its demand by more advertising or over all
drop the product which will decrease the cost to make them, and that money can
be used for the ones where demand is high.
3 Problem 2
The Student News Service at Clear Mountain State University (CMSU) has decided to gather
data about the undergraduate students that attend CMSU. CMSU creates and distributes a
survey of 14 questions and receives responses from 62 undergraduates (stored in
the Survey data set).
3.1 For this data, construct the following contingency tables (Keep Gender as row variable)
For creating a contingency table, we can use the crosstab function that summarizes the
table with the conditions we give. For example, we have conditions such as gender and
major in the below questions.
3.1.1 Gender and Major
3.1.2 Gender and Grad Intention
3.1.3 Gender and Employment
3.1.4 Gender and Computer

data, answer the following question:
To answer the questions below we can use the basic statistical formula
P(A|B)=P(A∩B)/P(B). after substituting the the entries from the Gender major
contingency table we get the belo results.
3.2.1 What is the probability that a randomly selected CMSU student will be male?
3.2.2 What is the probability that a randomly selected CMSU student will be female?
Gender Probability Percent
Male 0.467741935 46.77
Female 0.532258065 53.22
1.1 Find the conditional probability of different majors among the male students in CMSU.
Major Probability Percent

Accounting 0.137931034 13.79
CIS 0.034482759 3.45
Economics/Finance 0.137931034 13.79
International Business 0.068965517 6.897
Management 0.206896552 20.69
Other 0.137931034 13.79
Retailing/Marketing 0.172413793 17.24
Undecided 0.103448276 10.34
3.3.1 Find the conditional probability of different majors among the female students of CMSU.
Major Probability Percent

Accounting 0.090909091 9.09
CIS 0.090909091 9.09
Economics/Finance 0.212121212 21.21
International Business 0.121212121 12.12
Management 0.121212121 12.12
Other 0.090909091 9.09
Retailing/Marketing 0.272727273 27.27
Undecided 0 0.00
3.4 Assume that the sample is a representative of the population of CMSU. Based on the
3.4.1 Find the probability That a randomly chosen student is a male and intends to graduate.
To answer this we use the Gender and graduation table for data and again we use the P(A|
B) formula
The probability that a randomly chosen student is a male and intends to graduate is
0.5862068965517241 or 58.62%
3.4.2 Find the probability that a randomly selected student is a female and does NOT have a
laptop.
To answer this we use the Gender and computer table for data and again we use the P(A|
B) formula
The probability that a randomly selected student is a female and does NOT have a laptop
is 0.12121212121212122 or 12.12%.
3.5.1 Find the probability that a randomly chosen student is a male or has full-time
employment?
We can first take a look at the table and understand what we got here as per the question.
So we want to know the probability that a randomly chosen student is a male or has full-
time employment.
We see that there is total of 29 male students and 10 full time employees. Further we see
a union of full time male employees
Now to calculate probability we use the formula (p(A)+p(B))-p(AUB)
Which gives us the below
P(A) = 29/62
P(B) =10/62
P(AB)=7/62
Now substituting in the formula we get
P(AUB) = (0.47+0.16) - 0.11
= 0.63-0.11
= 0.5190
Putting it in percentage3 we get 51.9%. therefore the probability of randomly selected
student to be
3.5.2 Find the conditional probability that given a female student is randomly chosen, she is
majoring in international business or management
So as per the question we are safe to assume following

A) Gender is female i.e total number of events is 33 as we are only considering a female
B) Total number of female persuing International Business is 4
C) Total number of female persuing Management is 4
D) Now since both events are mutually exculsive I.e., one event does not affect or
coincide with the other, so we can say that a female who takes International Business
wont be able to take up Management as well
So the probablilty that given a female student is randomly chosen, she is majoring in
international business or management is as below
p_fibm= (4/33)+(4/33)
p_fibm= 0.24242424242424243
in percentage it is 24.24%
3.6 Construct a contingency table of Gender and Intent to Graduate at 2 levels (Yes/No).
The Undecided students are not considered now, and the table is a 2x2 table. Do you
think the graduate intention and being female are independent events?
 Here we have a contingency table for the data with gender and graduation
intention. We have skipped the data for student whose intention is still undecided
to get an accurate picture for the data.
 Now seeing the data, we see that it is 40 students is the total number of students.
Further we see these 40 students consists of 20 Male and 20 Female.
 Seeing the table, we can assume that graduation intention is gender dependent as
we see that nearly 17 males of 20 have an intention to graduate whereas when we
see the data for female, we see that only 11 have an intention to graduate.
 Going ahead we can calculate the conditional probability for it as well.
 We can check in the format of A given B where B is the gender and A is if they
have an intention to graduate or simply the yes column.
 When calculating we get graduation intent yes given the student is male we get
17/20=0.85 or simply 85%
 In the same way, when calculating we get graduation intent yes given the student
is female we get.
11/20=0.55 or simply 55%
 This calculation clearly shows graduation intent is dependent on the student being
a female.
3.7 Note that there are four numerical (continuous) variables in the data set, GPA, Salary,
Spending, and Text Messages. Answer the following questions based on the data
3.7.1 If a student is chosen randomly, what is the probability that his/her GPA is less than
3?
 The most easy way to do this would be making a contengency table of GPA and
gender (used for a second argument) and we get the below table
 With this table we can get the total off students from 2.4 to 2.9 GPA as we need
the probability of the student lying in this range
 That would be which totals up to 14
 Now it is a simple calculation to know the probability is 14/40 = 0.35 or 35%
3.7.2 Find the conditional probability that a randomly selected male earns 50 or more. Find
the conditional probability that a randomly selected female earns 50 or more.
We first construct a contingency table for gender and salary as below
Now the values that we need are as below
 So here we see 2 different questions.

 Conditional probability that a randomly selected male earns 50 or more
We can do it by adding the total number of values for male whose salary is 50
or more and divide it by total number of males to get the answer for it . the
values are as follows
which equals to 10
Now the probability that a randomly selected male earns 50 or more is 10/20=
0.5 which is 50%
 Conditional probability that a randomly selected female earns 50 or more

We can do it by adding the total number of values for male whose salary is 50
or more and divide it by total number of males to get the answer for it . the
values are as follows
which again equals to 10
Now the probability that a randomly selected female earns 50 or more is

10/20= 0.5 which is 50%
 With this we see that both male and female have 50% of there population who
earn a salary of 50 or more
3.8 Note that there are four numerical (continuous) variables in the data set, GPA, Salary,
Spending, and Text Messages. For each of them comment whether they follow a
normal distribution. Write a note summarizing your conclusions.
 Here we can first pull out the data what we have and know what are the mean and
standard diviation that we have. We use the Describe() funcion for it. When we
run the code we get the below as the results.
 Now there is a slight problem with this table. That is we got a lot of unnecessary
columns in this like ID age etc. so what we can do is we can assign this result to a
variable and then drop the unnecessary columns that we don’t need using the
drop() function.
 Once we have done this we get a perfect table with all the information we need to
check for normal distribution
 Now we can even plot a histogram and check if the data follows normality or not.
First we will see how we do it. We use the seaborn distplot function to get the
histogram as well as a line graph which indicates how the data points are flowing.
When we use the fiunction on the below case we see that we get the following
histograph
 Now taking a look at the above charts we see that GPA and Salary seems to be
bell shaped and we can conclude that they can be normally distributed. But the
problem with the histogram is that even though this is a useful tool to visually
summarize the data, a major drawback is that the bin size can greatly affect how
the data look. Therefore, we will go for a more standardized test for knowing if
the data is normalized or not. Here we can use check the skewness of the data to
understand how the data is skewed. If the value, we get is close to 0 it means it is
normally distributed but if it tends to b away from 0 it means it is not normally
distributed. Now to check that we can use stats.skew() function from SciPy.
 The results of the tests are as follows:
Column Name Skewness

GPA -0.205068
Spending 1.555092
Salary -0.035779
Text 1.207536
Messages
 By seeing the skewness test results we can conclude that GPA and Salary
are almost normally distributed, and spending, text messages are not
normally distributed.
4 Problem 3
An important quality characteristic used by the manufacturers of ABC asphalt shingles is the
amount of moisture the shingles contain when they are packaged. Customers may feel that
they have purchased a product lacking in quality if they find moisture and wet shingles inside
the packaging. In some cases, excessive moisture can cause the granules attached to the
shingles for texture and colouring purposes to fall off the shingles resulting in appearance
problems. To monitor the amount of moisture present, the company conducts moisture tests.
A shingle is weighed and then dried. The shingle is then reweighed and based on the amount
of moisture taken out of the product, the pounds of moisture per 100 square feet are
calculated. The company would like to show that the mean moisture content is less than 0.35
pounds per 100 square feet.
4.1 Do you think there is evidence that means moisture contents in both types of shingles
are within the permissible limits? State your conclusions clearly showing all steps.
 Here we first begin by importing the table using read CSV.
 After importing we check if the table is working fine by checking the first 5
lines of the table by using head() function. For my case I have used head(8) for
checking the first 8 lines and below is the output for it.
 Next, we check the description of the data using the describe function.
 In the describe function we get the basic statistical data such as mean standard
deviation minimum and maximum etc.
 Now after these preliminary steps we can start answering the question.
 The question asks us to check is there is evidence that means moisture contents in
both types of shingles are within the permissible limits.
o So starting with Shingles A

 We use 1 sample T Test to get the answer.
 First we declare the Hypothesis
o H0= There is evidence that shows moisture contents in A is 0.35
o HA= There is evidence that shows moisture contents A are not in
A is 0.35
 Level of significance is 0.05.
 Formula for 1 sample T test is,
x−μ
 T= 5x 0
s
 Where s x= √n
 In jupyter we can use the function ttest_1samp to get the answer.
 Here we get following as our result
 t_statistic -1.4735046253382782
 p_value 0.14955266289815025
 since P_Value I > 0.05 we fail to reject the null hypothesis therefore,
There is evidence that shows moisture contents in A is 0.35.
o Now for Shingles B

 We again use 1 sample T Test to get the answer.
 First we declare the Hypothesis
o H0= There is evidence that shows moisture contents in B is 0.35
o HA= There is evidence that shows moisture contents B are not in A
is 0.35
 Formula for 1 sample T test is,
x−μ
 T= 5x 0
s
 Where s x= √n
 In jupyter we can use the function ttest_1samp to get the answer.
 Here we get following as our result
 t_statistic -3.1003313069986995
 p_value 0.004180954800638365
 since P_Value I < 0.05 we reject the null hypothesis therefore, there is no
evidence that shows moisture contents in B is 0.35.
4.2 Do you think that the population mean for shingles A and B are equal? Form the
hypothesis and conduct the test of the hypothesis. What assumption do you need to
check before the test for equality of means is performed?
o The main assumption to be done before answering this question are
o The two populations have the same variance.
o The populations are normally distributed.
o Each value is sampled independently from each other value. This assumption requires
that each subject provide only one value.
 Now back to answering with the above assumptions we use the 2 sample T test for
this question.
 First, we declare the null hypothesis (H0) and alt hypothesis (H1)
o H0= Ho means of both the samples are equal
o H0= Means of both the samples are not equal
( x 1−x 2 )
 T=
√ s1 S 2
+
N1 N 2
 We can use the function ttest_ind form the SciPy.Stats library.

 When we use that we get the T test value as well as the P value
 Now when we do the t test on the given data we get.
 t_statistic 1.2885080295255027
 p_value 0.2022582205021782
 Our alpha is 0.05 and P value is 0.2017 therefore we accept the null hypothesis and
can conclude that there is evidence that means moisture contents in both types of
shingles are within the permissible limits.
 Now here we see a major problem i.e., there is no way our hypothesis can be accepted
as the means when calculated individually are different. The means are as below
 In the above table we see the means are different. The variance is also different.
Below are the variance of the data
 A 0.000511746
 B 0.000608075
 So by this our first assumption is violated
 Going forward we can plot the 2 columns of the data using Distplot() and we get the
below.
 Now seeing the plots, we can clearly say that it is not bell shaped. Both seem to be a
positively Skewed. Therefore, our second assumption has failed too because a
distribution is normal only when it can be seen as a bell-shaped curve on the plots.
 Now moving on to the 3rd assumption which says that the samples in the distribution
should be independent. Two events are independent if the result of the second event
is not affected by the result of the first event. But here we see clearly that A
shingles is A shingle is weighed and then dried. Therefore, event A or the weight
before drying clearly effects the weight of the Shingle after drying. So even the 3rd
assumption fails. Therefore, any kind of 2 sample T test done on the hypothesis will
give us a wrong answer and even though we see that the 2 means are not equal we still
accept the null hypothesis which says that the 2 means are equal.
****************THE END****************

DS Module 2 Project

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

DS Module 2 Project

Uploaded by

Copyright:

Available Formats

Business Report

Sudhanva Saralaya 6/13/21 PGP-DSBA

 Process Followed in common Across the Project

A wholesale distributor operating in different regions of Portugal has information on

 Understanding the Summary

 Individual items Analysis

 The table when converted to normal integers we get the below:

 The item showing the most consistency is Delicatessen with a variation of

3.1.1 Gender and Major

3.1.2 Gender and Grad Intention

3.1.3 Gender and Employment

3.1.4 Gender and Computer

Major Probability Percent

Major Probability Percent

So as per the question we are safe to assume following

 That would be which totals up to 14

 Now it is a simple calculation to know the probability is 14/40 = 0.35 or 35%

We first construct a contingency table for gender and salary as below

Now the values that we need are as below

 So here we see 2 different questions.

 Conditional probability that a randomly selected female earns 50 or more

which again equals to 10

Now the probability that a randomly selected female earns 50 or more is

 The results of the tests are as follows:

Column Name Skewness

o So starting with Shingles A

o Now for Shingles B

 We can use the function ttest_ind form the SciPy.Stats library.

You might also like