03 Descriptive-Numerical

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 91

Engineering Statistics

Descriptive Statistics – Numerical Method


Learning objectives

§ To describe the properties of location or central


tendency, variation and shape in numerical data
§ To calculate descriptive summary measures for a
population
§ To construct and interpret a box-and-whisker plot
§ To describe the covariance and coefficient of
correlation

2
Definitions

§ The central tendency is the extent to which all the data


values group around a typical or central value.
§ The variation is the amount of dispersion, or scattering,
of values
§ The shape is the pattern of the distribution of values
from the lowest value to the highest value.

3
Course outline

§ Measures of location/central tendency


§ Measures of variation
§ Measures of distribution shapes, relative location and
detecting outliers
§ Exploratory data analysis
§ Measures of association between two variables
§ The weighted mean and working with grouped data

4
Measures of location/central tendency

§ Mean
§ Median
§ Mode
§ Percentiles
§ Quartiles

5
Measures of central tendency
the arithmetic mean
§ The mean of a data set is the average of all the data
values.
§ The mean provides a measure of central location for the
data.
§ If the data are for a sample, the mean is denoted by 𝑥̅
§ If the data are for a population, the mean is denoted by
the Greek letter µ.

6
Measures of central tendency
the arithmetic mean

7
Measures of central tendency
the arithmetic mean

The arithmetic mean (mean) is the most common measure of


central tendency

For a sample of size n:


n

åX i
X1 + X2 + ! + Xn
X= i=1
=
n n
Sample size Observed values

8
Measures of central tendency
the arithmetic mean
§ The most common measure of central tendency
§ Mean = sum of values divided by the number of values
§ Affected by extreme values (outliers)

0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

Mean = 3 Mean = 4
1 + 2 + 3 + 4 + 5 15 1 + 2 + 3 + 4 + 10 20
= =3 = =4
5 5 5 5
9
Example: Apartment rents

10
Measures of central tendency
the median
§ The median is another measure of central location.
§ The median is the value in the middle when the data are arranged
in ascending order (smallest value to largest value).
§ The median is the measure of location most often reported for
annual income and property value data.
§ A few extremely large incomes or property values can inflate the
mean.
§ A sample median is notated by 𝑥# and population median is
notated by 𝜇#

11
Measures of central tendency
the median

§ In an ordered array, the median is the “middle” number (50%


above, 50% below)

0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

Median = 4 Median = 4

§ Not affected by extreme values


n +1
§ Note that is NOT the value of the median,
2
only the position of the median in the ranked data.
12
Example: apartment rents

13
Measures of central tendency
the mode
§ The mode of a data set is the value that occurs with
greatest frequency.
§ The greatest frequency can occur at two or more
different values.
§ If the data have exactly two modes, the data are
bimodal.
§ If the data have more than two modes, the data are
multimodal.

14
Measures of central tendency
the mode
§ Value that occurs most often
§ Not affected by extreme values
§ Used for either numerical or categorical data
§ There may be no mode
§ There may be several modes

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 0 1 2 3 4 5 6

Mode = 9 No Mode

15
Example: apartment rents

16
Measures of central tendency
which measure to choose?
§ The mean is generally used, unless extreme values
(outliers) exist.
§ Then median is often used, since the median is not
sensitive to extreme values. For example, median
home prices may be reported for a region; it is less
sensitive to outliers.

17
Percentiles

§ A percentile provides information about how the data


are spread over the interval from the smallest value to
the largest value.
§ Admission test scores for colleges and universities are
frequently reported in terms universities of percentiles.

18
Percentiles

19
Example: apartment rents

20
Quartiles
§ Quartiles are specific percentiles with each part containing
approximately one-fourth, or 25% observations
§ First Quartile = 25th Percentile
§ Second Quartile = 50th Percentile = Median
§ Third Quartile = 75th Percentile

21
Example: apartment rents

22
Measures of central tendency
the geometric mean
§ Geometric mean
§ Used to measure the rate of change of a variable over time

X G = ( X1 ´ X 2 ´!´ X n ) 1/ n

§ Geometric mean rate of return


§ Measures the status of an investment over time

RG = [(1 + R1 ) ´ (1 + R 2 ) ´ ! ´ (1 + Rn )] 1/ n
-1
§ Where Ri is the rate of return in time period i

23
Measures of central tendency
the geometric mean

An investment of $100,000 declined to $50,000 at the end of


year one and rebounded to $100,000 at end of year two:

X1 = $100,000 X2 = $50,000 X3 = $100,000

50% decrease 100% increase

The overall two-year return is zero, since it started and ended


at the same level.
24
Measures of central tendency
the geometric mean
Use the 1-year returns to compute the arithmetic mean and the
geometric mean:

Arithmetic
(-.5) + (1)
mean rate X= = .25 Misleading result
2
of return:

Geometric R G = [(1 + R1 ) ´ (1 + R2 ) ´ !´ (1 + Rn )]1/ n - 1


More
mean rate of = [(1 + (-.5)) ´ (1 + (1))]1/ 2 - 1 accurate
return: result
= [(.50) ´ (2)]1/ 2 - 1 = 11/ 2 - 1 = 0%

25
Measures of central tendency
summary
Central Tendency

Arithmetic Median Mode Geometric Mean


Mean
n

åX i
XG = ( X1 ´ X2 ´ ! ´ Xn )1/ n

X= i=1
n Middle value in Most
the ordered frequently
array observed
value

26
Measures of variation
§ It is often desirable to consider measures of variability
(dispersion), as well as measures of location.
§ For example, in choosing supplier A or supplier B we might
consider not only the average delivery time for each, but
also the variability in delivery time for each.

27
Measures of variation

§ Range
§ Interquartile range
§ Variance
§ Standard deviation
§ Coefficient of variation

28
Range

§ The range of a data set is the difference between the largest and
smallest data values.
§ It is the simplest measure of variation.
§ It is very sensitive to the smallest and largest data values.

Range = Xlargest – Xsmallest

Example:

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

Range = 13 - 1 = 12
29
Example: apartment rents

30
Measures of variation
disadvantages of the range

§ Ignores the way in which data are distributed

7 8 9 10 11 12 7 8 9 10 11 12
Range = 12 - 7 = 5 Range = 12 - 7 = 5

§ Sensitive to outliers
1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,4,5
Range = 5 - 1 = 4

1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,4,120
Range = 120 - 1 = 119
31
Interquartile range
§ Problems caused by outliers can be eliminated by using the
interquartile range.
§ The IQR can eliminate some high and low values and calculate the
range from the remaining values.
§ Interquartile range = 3rd quartile – 1st quartile
= Q3 – Q1
Example:
Median X
X Q1 (Q2) Q3 maximum
minimum
25% 25% 25% 25%

12 30 45 57 70

Interquartile range
= 57 – 30 = 27 32
Example: apartment rents

33
Measures of variation
standard deviation
§ The variance is a measure of variability that utilizes all
the data.
§ It is based on the difference between the value of each
observation (xi) and the mean (x for a sample, µ for a
population).

34
Measures of variation
standard deviation

35
Measures of variation
standard deviation

36
Example – mean class

37
Example – starting salary

38
Measures of variation
standard deviation

39
Example – starting salary

40
Measures of variation
comparing standard deviation

Data A
Mean = 15.5
11 12 13 14 15 16 17 18 19 20 21 S = 3.338

Data B Mean = 15.5


11 12 13 14 15 16 17 18 19 20 21 S = 0.926

Data C Mean = 15.5


S = 4.570
11 12 13 14 15 16 17 18 19 20 21

41
Measures of variation
comparing standard deviation

Small standard deviation

Large standard deviation

42
Measures of variation
summary characteristics
§ The more the data are spread out, the greater the range,
interquartile range, variance, and standard deviation.
§ The more the data are concentrated, the smaller the
range, interquartile range, variance, and standard
deviation.
§ If the values are all the same (no variation), all these
measures will be zero.
§ None of these measures are ever negative.

43
Coefficient of variation

44
Example – mean class

45
Measures of distribution shape, relative location
and detecting outliers

§ Distribution Shape
§ z-Scores
§ Chebyshev’s Theorem
§ Empirical Rule
§ Detecting Outliers

46
Shape of distribution
§ Describes how data are distributed
§ Measures of shape
§ Symmetric or skewed

Left-Skewed Symmetric Right-Skewed


Mean < Median Mean = Median Median < Mean

47
Shape of distribution

Mean < Median Median < Mean

Mean = Median Median < Mean

48
General descriptive statistics using
Microsoft Excel

1. Select Tools.

2. Select Data Analysis.

3. Select Descriptive
Statistics and click OK.

49
General descriptive statistics using
Microsoft Excel

4. Enter the cell


range.
5. Check the
Summary
Statistics box.
6. Click OK

50
z-scores

51
Example – mean class

52
Chebyshev’s theorem

53
Example - Chebyshev’s theorem

54
Example - Chebyshev’s theorem

For the test scores between 58 and 82:


%&'()
q = −2.4, indicates 58 is 2.4 standard deviations below the mean.
%
&/'()
q = 2.4, indicates 82 is 2.4 standard deviations above the mean.
%

Applying Chebyshev theorem with z=2.4, we have:

1 1
1− / = 1− /
= 0.826
𝑧 (2.4)
At least 82.6% of the students must have test scores between
58 and 82
55
Empirical rule

56
Empirical rule

57
Empirical rule

58
Empirical rule

59
Example – apartment rents

𝑥̅ = 490.8
𝑠 = 54.74

60
Example – apartment rents

𝑥̅ = 490.8
𝑠 = 54.74

61
Example – apartment rents

𝑥̅ = 490.8
𝑠 = 54.74

615 615
62
Detecting outliers

§ Sometimes a data set will have one or more


observations with unusually large or unusually small
values.
§ These extreme values are called outliers.
§ Experienced statisticians take steps to identify outliers
and then review each one carefully.
§ An outlier may be a data value that has been
incorrectly recorded. à If so, it can be corrected
before further analysis.

63
Detecting outliers

§ An outlier may also be from an observation that was


incorrectly included in the data set. à if so, it can be
removed.
§ An outlier may be an unusual data value that has been
recorded correctly and belongs in the data set. à
should remain
§ Standardized values (z-scores) can be used to identify
outliers.
§ In using z-scores to identify outliers, we recommend
treating any data value with a z-score less than -3 or
greater than 3 as an outlier.

64
Example – detecting outliers

65
Example: apartment rents

66
Exploratory data analysis

§ Five-Number Summary
1. Minimum/smallest value
2. First quartile (Q1)
3. Median (Q2)
4. Third quartile (Q3)
5. Maximum/largest value
§ Box-and-Whisker Plot

67
Five-number summary

Max. value
§ Five-Number Summary
1. Minimum/smallest value
2. First quartile (Q1)
3. Median (Q2)
4. Third quartile (Q3)
5. Maximum/largest value
§ Box-and-Whisker Plot 68
Box-and-whisker plot

§ A box plot is a graphical summary of data that is based on a


five-number summary.
§ A key to the development of a box plot is the computation
of the median and the quartiles, Q1 and Q3. The
interquartile range, IQR Q3 Q1, is also used.
o A box is drawn with the ends of the box located at the first and
third quartiles. For the salary data,Q1 =3465 and Q3= 3600. This
box contains the middle 50%of the data.
o A vertical line is drawn in the box at the location of the median
(3505 for the salary data).
o By using the interquartile range, IQR = Q3-Q1 limits are
located. The limits for the box plot are 1.5(IQR) below Q1 and
1.5(IQR) above Q3.
o The dashed lines called whiskers. The whiskers are drawn from
the ends of the box to the smallest and largest value inside limits.
o Finally, locate each outlier and drawn using asterisk symbol (*)69
Box plot

70
Measures of association between two
variables
§ Covariance
§ Correlation coefficient

71
Covariance

72
Example - stereo and sound equipment
store

73
Scatter diagram and calculation of sample
covariance for the stereo and sound equipment store

74
Correlation coefficient

75
The correlation coefficient

Y Y Y

X X X
r = -1 r = -.6 r=0

Y Y

X X
r = +1 r = +.3
76
Correlation coefficient

77
Example - stereo and sound equipment store

78
Perfect linear relationship

79
The correlation coefficient
using Microsoft Excel

1. Select Tools/Data Analysis


2. Choose Correlation from
the selection menu
3. Click OK . . .

80
The correlation coefficient
using Microsoft Excel

3. Input data range and select


appropriate options
4. Click OK to get output

81
The correlation coefficient
using Microsoft Excel

§ r = .733 Scatter Plot of Test Scores

100

§ There is a relatively 95

strong positive linear

Test #2 Score
90

relationship between test 85

score #1 and test score 80

#2. 75

70
70 75 80 85 90 95 100

Test #1 Score
§ Students who scored high
on the first test tended to
score high on second test.

82
The weighted mean and
working with grouped data
§ Weighted mean
§ Mean for grouped data
§ Variance for grouped data
§ Standard deviation for grouped data

83
Weighted mean

84
Example

85
Grouped data

§ The weighted mean computation can be used to obtain


approximations of the mean, variance, and standard
deviation for the grouped data.
§ To compute the weighted mean, we treat the midpoint
of each class as though it were the mean midpoint of all
items in the class.
§ We compute a weighted mean of the class midpoints
using the class frequencies as weights.
§ Similarly, in computing the variance and standard
deviation, the class frequencies are used as weights.
86
Mean and sample variance for grouped
data

87
Mean and sample variance for grouped
data

88
Example

89
Example

90
References

§ Statistics for Business and Economics, Anderson,


Sweeney, and Williams, West Publishing Company.
§ Statistics for Business and Economics.,
SouthWestern/Thompson Learning
§ Statistics for Managers Using Microsoft Excel, 5e ©
2008 Pearson Prentice-Hall, Inc.

91

You might also like