Descriptive Statistics: Prepared By: Maira Sami

You might also like

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 48

Business Analytics

Descriptive Statistics
Prepared By: Maira Sami
Modified By : Sophia Ajaz

1
Objectives
• Analyzing Distributions
• Measures of Association Between Two Variables
• Data Cleansing

2
Analyzing Distributions
• Distributions are very useful for interpreting and analyzing
data.
• A distribution describes the overall variability of the observed
values of a variable.
• In this section we introduce additional ways of analyzing
distributions.

3
Analyzing Distributions
• Percentiles
• Quartiles
• Z- score
• Empirical Rule
• Box Plots

4
Percentile
• A percentile is the value of a variable at which a specified
(approximate) percentage of observations are below that
value.

• The pth percentile tells us the point in the data where


approximately p% of the observations have values less than
the pth percentile; hence, approximately (100 − p)% of the
observations have values greater than the pth percentile.

5
• Percentile:
The value below which a percentage of data falls.

6
Example
• If a child's weight is at the 50th percentile line, that means
that out of 100 normal children her age, 50 will be bigger than
she is and 50 smaller.
• Similarly, if she is in the 75th percentile, that means that she
is bigger than 75 children and smaller than only 25, compared
with 100 children her age.

7
Examples:
Colleges and universities frequently report admission test scores in
terms of percentiles.

• For instance, suppose an applicant obtains a raw score of 54 on


the verbal portion of an admission test. How this student
performed in relation to other students taking the same test
may not be readily apparent.

• However, if the raw score of 54 corresponds to the 70th


percentile, we know that approximately 70% of the students
scored lower than this individual, and approximately 30% of the
students scored higher.

8
To calculate the pth percentile for a data set containing n observations we must first
arrange the data in ascending order (smallest value to largest value).

9
Exercise:

Compute the 85th percentile for the home sales data in Table
2.9.

10
11
# selling price
1 108,000
2 138,000
3 138,000
4 142,000
5 186,000
6 199,500
7 208,000
8 254,000
9 254,000
10 257,500
11 298,000
12 456,250
12
• with p = 85 and n = 12, the location of the
• 85th percentile is

13
• The interpretation of L85 = 11.05

• i.e. the 85th percentile is 5% of the way between the value in


position 11 and the value in position 12.
OR
• the 85th percentile is the value in position 11 (298,000) plus
0.05 times the difference between the value in position 12
(456,250) and the value in position 11 (298,000).

14
Thus,85th percentile
=298,000+ 0.05(456,250- 298,000)
=298,000 +0.05(158,250)
=305,912.50

• Therefore, $305,912.50 represents the 85th percentile of the


home sales data.

15
Quartiles
• •   When the data is divided into four equal parts:
– Each part contains approximately 25% of the
observations
– Division points are referred to as quartiles

= first quartile, or 25th percentile


= Second quartile, or 50th percentile (also the median)
= Third quartile, or 75th percentile

16
17
• To demonstrate quartiles, the home sales data are again
arranged in ascending order
# selling price
1 108,000
2 138,000
3 138,000
4 142,000
5 186,000
6 199,500
7 208,000
8 254,000
9 254,000
10 257,500
11 298,000
12 456,250
18
• We already identified Q2, the second quartile (median) as
203,750 (as calculated earlier)
• To find Q1 and Q3 we must find the 25th and 75th
percentiles.

19
20
• Therefore, for the home sale data
• 25th percentile is $139,000
• 50th percentile is $203,750
• 75th percentile is $256,625.
• So, The quartiles divide the home sales data into four parts,
with each part containing 25% of the observations.

108,000 138,000 138,000 ,


142,000 186,000 199,500 ,
208,000 254,000 254,000 ,
257,500 298,000 456,250

21
• The difference between the third and first quartiles is often
referred to as the interquartile range, or IQR.
• For the home sales data,
IQR =Q3 -Q1
256,625 - 139,000 = 117,625.
Because it excludes the smallest and largest 25% of values in the
data, the IQR is a useful measure of variation for data that have
extreme values or are highly skewed.

22
z-Scores
• A z-score allows us to measure the relative location of a value
in the data set. More specifically, a z-score helps us determine
how far a particular value is from the mean relative to the
data set’s standard deviation
• The z-score is often called the standardized value.

23
• How many standard deviations a value is from the mean.

In this example, the value 1.7 is 2 standard deviations away


from the mean of 1.4, so 1.7 has a z-score of 2.
Similarly 1.85 has a z-score of 3.

24
• For example, z1 =1.2 indicates that x1 is 1.2 standard deviations
greater than the sample mean.
• Similarly, z2 = − 0.5 indicates that x2 is 0.5, or 1/2, standard
deviation less than the sample mean.
• A z-score of zero indicates that the value of the observation is
equal to the mean.

25
The z-scores for the class size data are computed in Table 2.13.

26
Empirical Rule
• When the distribution of data exhibits a symmetric bell-
shaped distribution, as shown in Figure 2.21, the empirical
rule can be used to determine the percentage of data values
that are within a specified number of standard deviations of
the mean.

27
Empirical Rule
• For data having a bell-shaped distribution:
• Approximately 68% of the data values will be within 1 standard
deviation of the mean.
• Approximately 95% of the data values will be within 2 standard
deviations of the mean.
• Almost all of the data values will be within 3 standard deviations of
the mean.

28
The height of adult males in the United States has a bell-shaped
distribution similar to that shown in Figure 2.21, with a mean of
approximately 69.5 inches and standard deviation of approximately 3
inches.

Using the empirical rule, we can draw the following conclusions.

• Approximately 68% of adult males in the United States have heights


between 69.5 - 3 = 66.5 and 69.5 + 3 = 72.5 inches.

• Approximately 95% of adult males in the United States have heights


between 63.5 and 75.5 inches.

• Almost all adult males in the United States have heights between
60.5 and 78.5 inches.
29
Box Plots
• A box plot is a graphical summary of the distribution of data.
A box plot is developed from the quartiles for a data set.
Figure 2.22 is a box plot for the home sales data.

30
EXAMPLE

31
• Box plots are also very useful for comparing different data
sets.
• For instance, if we want to compare home sales from several
different communities, we could create box plots for recent
home sales in each community.

32
Measures of Association Between Two Variables
• Often a manager or decision maker is interested in the
relationship between two variables. In this section, we
present covariance and correlation as descriptive measures of
the relationship between two variables.
• Scatter chart
• Covariance
• Correlation Coefficient

33
To illustrate these concepts,
• we consider the case of the sales manager of Queensland
Amusement Park, who is in charge of ordering bottled water to
be purchased by park customers.
• The sales manager believes that daily bottled water sales in
the summer are related to the outdoor temperature.
• Table 2.14 shows data for high temperatures and bottled
water sales for 14 summer days. The data have been sorted by
high temperature from lowest value to highest value.

34
35
Scatter Charts
• A scatter chart is a useful graph for analyzing the relationship
between two variables. Figure 2.26 shows a scatter chart for sales of
bottled water versus the high temperature experienced on 14
consecutive days.
• The scatter chart in the figure suggests that higher daily high
temperatures are associated with higher bottled water sales. This is
an example of a positive relationship, because when one variable
(high temperature) increases, the other variable (sales of bottled
water) generally also increases. The scatter chart also suggests that a
straight line could be used as an approximation for the relationship
between high temperature and sales of bottled water.

36
37
Covariance

• Covariance is a descriptive measure of the linear association


between two variables. For a sample of size n with the
observations (x1 , y1), (x2 , y2 ), and so on, the sample
covariance is defined as follows:

38
39
The covariance calculated is Sxy =12.8.
As the covariance is greater than 0, it indicates a positive
relationship between the high temperature and sales of bottled
water.

This verifies the relationship we saw in the scatter chart that as


the high temperature for a day increases, sales of bottled water
generally increase.

40
Correlation Coefficient
• The correlation coefficient measures the relationship between two
variables

41
42
43
• For example, the scatter diagram in Figure 2.29 shows the
relationship between the amount spent by a small retail store
for environmental control (heating and cooling) and the daily
high outside temperature for 100 consecutive days.

• Figure 2.29 provides strong visual evidence of a nonlinear


relationship. That is, we can see that as the daily high outside
temperature increases, the money spent on environmental
control first decreases as less heating is required and then
increases as greater cooling is required.

44
45
Data Cleansing
• The data in a data set are often said to be “dirty” and “raw”
before they have been put into a form that is best suited for
investigation, analysis, and modeling.

• Data preparation makes heavy use of the descriptive statistics


and data-visualization methods to gain an understanding of
the data.

• Common tasks in data preparation include treating missing


data, identifying erroneous data and outliers, and defining the
appropriate way to represent variables

46
• Missing Data
• Identification of Erroneous Outliers and Other Erroneous
Values
• Variable Representation

47
THANK YOU!!

48

You might also like