Professional Documents
Culture Documents
Topic 01 - Descriptive Statistics PDF
Topic 01 - Descriptive Statistics PDF
DECISION SCIENCE ‐1
DATA REPRESENTATION:
DATA AND STATISTICS
Term 1: 2019
Statistics and Role in Decision Making
• The term statistics can refer to numerical facts such as averages, medians, and
percentages that help us understand a variety of business and economic
situations.
• Statistics can also refer to the art and science of collecting, analyzing,
presenting, and interpreting data.
• Used by people working in accounting, finance, economics, marketing,
production, information systems, etc.
June‐August, 2019 Decision Science ‐1, PGP Term 1, Section E 2
Prof. Ananth Krishnamurthy 1
Decision Science ‐1, PGP‐Term1, Section E 2019
June‐August, 2019 Decision Science ‐1, PGP Term 1, Section E 3
Data Categorization
• Categorization of Data Data
Qualitative Quantitative
What do these
mean?
Numeric Non‐numeric Numeric
June‐August, 2019 Decision Science ‐1, PGP Term 1, Section E 4
Prof. Ananth Krishnamurthy 2
Decision Science ‐1, PGP‐Term1, Section E 2019
Scale of Measurement
• Nominal Scale
– Data are labels or names used to identify an attribute of the element
– Non‐numeric (AMEX, NYSE,..) or numeric code (AMEX = 1, NYSE = 2, …)
• Ordinal Scale
– The data have the properties of nominal data and the order or rank of the data is
meaningful.
– Non‐numeric (Excellent, Good,..) or numeric code (1 for Excellent, 2 for Good, …)
June‐August, 2019 Decision Science ‐1, PGP Term 1, Section E 5
Scale of Measurement
• Interval Scale
– They have the properties of ordinal data, the interval between observations is expressed
in terms of a fixed unit of measure. They are always numeric.
– Time of day: Between 10:00 am and 11:30 am; Between 10:30 pm and 11:30 pm
– Temperature: Between 10 C and 15C; Between ‐10 C and ‐15C;
• Ratio Scale
– The data have all the properties of interval data and the ratio of two values is
meaningful.
– Money: Rs 2,00,000 is twice as large as Rs 1,00,000
June‐August, 2019 Decision Science ‐1, PGP Term 1, Section E 6
Prof. Ananth Krishnamurthy 3
Decision Science ‐1, PGP‐Term1, Section E 2019
Data Categorization
• Cross‐sectional
– Data are collected at the same or approximately the
same point in time.
– Permits issued in November in each of the districts in
Cross sectional
Karnataka.
• Time series
– Time series data are collected over several time
periods. Time series
– Permits issues each month in Karnataka for the last 36
months.
June‐August, 2019 Decision Science ‐1, PGP Term 1, Section E 7
Statistical Inference
• Population: The set of all elements of interest for a
specific study Population
consists of all A sample of
tune‐ups. 50 engine
• Sample: A subset of the population
Average cost of tune‐ups
parts is unknown. examined.
• Statistical Inference: The process of using data
obtained from a sample to make estimates and test
hypothesis about the characteristics of a population
June‐August, 2019 Decision Science ‐1, PGP Term 1, Section E 8
Prof. Ananth Krishnamurthy 4
Decision Science ‐1, PGP‐Term1, Section E 2019
DECISION SCIENCE ‐1
DATA REPRESENTATION:
DESCRIPTIVE STATISTICS ‐ DISPLAYS
Term 1: 2019
June‐August, 2019 Decision Science ‐1, PGP Term 1, Section E 9
Descriptive Statistics
• Most of the statistical information in company reports, and other publications
consists of data that are summarized and presented in a form that is easy to
understand.
• Such summaries of data, which may be tabular, graphical, or numerical, are
referred to as descriptive statistics.
June‐August, 2019 Decision Science ‐1, PGP Term 1, Section E 10
10
Prof. Ananth Krishnamurthy 5
Decision Science ‐1, PGP‐Term1, Section E 2019
Summarizing Categorical and Quantitative Data
• We will review a few techniques to summarize data:
Categorical Data Quantitative Data
Frequency Distribution Frequency Distribution
Relative Frequency Distribution Relative Frequency Distribution
Percent Frequency Distribution Percent Frequency Distribution
Bar Chart Dot Plot
Pie Chart Histogram
Cumulative Distribution
Stem and Leaf Diagram
June‐August, 2019 Decision Science ‐1, PGP Term 1, Section E 11
11
Example: Summarizing Categorical Data
• Customers coming to Moe’s Tavern in Springfield were
asked to rate the quality of their service as being
excellent, above average, average, below average, or
poor.
• The ratings provided by a sample of 20 customers are:
June‐August, 2019 Decision Science ‐1, PGP Term 1, Section E 12
12
Prof. Ananth Krishnamurthy 6
Decision Science ‐1, PGP‐Term1, Section E 2019
Frequency, Bar Chart and Pareto Diagram
• Frequency, Relative Frequency, Percentage Frequency
– When the bars are arranged in descending order of height from left to right (with the
most frequently occurring cause appearing first) the bar chart is called a Pareto diagram.
June‐August, 2019 Decision Science ‐1, PGP Term 1, Section E 13
13
Summarizing Categorical and Quantitative Data
• We will review a few techniques to summarize data:
Categorical Data Quantitative Data
Frequency Distribution Frequency Distribution
Relative Frequency Distribution Relative Frequency Distribution
Percent Frequency Distribution Percent Frequency Distribution
Bar Chart Cumulative Distribution
Pie Chart Dot Plot
Histogram
Stem and Leaf Diagram
June‐August, 2019 Decision Science ‐1, PGP Term 1, Section E 14
14
Prof. Ananth Krishnamurthy 7
Decision Science ‐1, PGP‐Term1, Section E 2019
Example: Summarizing Quantitative Data
• The manager of Springfield Auto would like to have a
better understanding of the cost of parts used in the
engine tune‐ups performed in his shop. He examines
50 customer invoices for tune‐ups. The costs of parts,
rounded to the nearest dollar, are listed below:
91 78 93 57 75 52 99 80 97 62
71 69 72 89 66 75 79 75 72 76
104 74 62 68 97 105 77 65 80 109
85 97 88 68 83 68 71 69 67 74
62 82 98 101 79 105 79 69 62 73
June‐August, 2019 Decision Science ‐1, PGP Term 1, Section E 15
15
Example: Summarizing Categorical Data
• The most common numerical descriptive statistic is the average (or mean or expected value)
– It is a measure of the central tendency (central location of the data for a variable)
– Springfield Auto’s average cost of parts, based on the 50 tune‐ups studied, is $79.
16
Prof. Ananth Krishnamurthy 8
Decision Science ‐1, PGP‐Term1, Section E 2019
Cross Tabulation
• Crosstabulation is a method for summarizing the
data for two variables.
• Example: The number of homes sold for each style
and price for the past two years is shown below:
Home Style
Price Range Colonial Log Split A‐Frame Total
Less than
$200,000 18 6 19 12 55
$200,000 or
more 12 14 16 3 45
Total 30 20 35 15 100
June‐August, 2019 Decision Science ‐1, PGP Term 1, Section E 17
17
Simpson’s Paradox
• We must be careful in drawing conclusions about the
relationship between the two variables in the aggregated
crosstabulation.
• In some cases the conclusions based upon an aggregated
crosstabulation can be completely reversed if we look at
the unaggregated data. Edward H. Simpson first described
this phenomenon in a technical
paper in 1951, but the statisticians
• The reversal of conclusions based on aggregate and Karl Pearson et al., in 1899, and
unaggregated data is called Simpson’s paradox. Udny Yule, in 1903, had mentioned
similar effects earlier. The name
Simpson's paradox was introduced
by Colin R. Blyth in 1972.
June‐August, 2019 Decision Science ‐1, PGP Term 1, Section E 18
18
Prof. Ananth Krishnamurthy 9
Decision Science ‐1, PGP‐Term1, Section E 2019
Simpson’s Paradox: Example
Example: https://www.youtube.com/watch?v=ebEkn-BiW5k
June‐August, 2019 Decision Science ‐1, PGP Term 1, Section E 19
19
DECISION SCIENCE ‐1
DATA REPRESENTATION:
DESCRIPTIVE STATISTICS ‐ NUMERICAL
Term 1: 2019
June‐August, 2019 Decision Science ‐1, PGP Term 1, Section E 20
20
Prof. Ananth Krishnamurthy 10
Decision Science ‐1, PGP‐Term1, Section E 2019
Descriptive Statistics: Numerical Measures
• Measures of Location
– Mean Population: Sample:
– Median n data
– Mode N data points points
– Percentiles
– Quartiles
Population Parameters: Sample Statistic:
• Measures of Variability If the measure is computed If the measure is computed
– Range for data from the population using data from a sample
– Interquartile range
– Variance Point Estimator:
Sample statistics is a point estimator of the
– Standard Deviation corresponding population parameter.
– Coefficient of Variation
June‐August, 2019 Decision Science ‐1, PGP Term 1, Section E 21
21
Example: Apartment Rent in Springfield
• Seventy (70) efficiency apartments were randomly sampled in a small town.
The monthly rent prices for these apartments are listed below:
445 615 430 590 435 600 460 600 440 615
440 440 440 525 425 445 575 445 450 450
465 450 525 450 450 460 435 460 465 480
450 470 490 472 475 475 500 480 570 465
600 485 580 470 490 500 549 500 500 480
570 515 450 445 525 535 475 550 480 510
510 575 490 435 600 435 445 435 430 440
• What is the average rent?
• What is the standard deviation?
June‐August, 2019 Decision Science ‐1, PGP Term 1, Section E 22
22
Prof. Ananth Krishnamurthy 11
Decision Science ‐1, PGP‐Term1, Section E 2019
Mean
• Mean provides a measure of central location.
• The mean of a data set is the average of all the data values.
• The sample mean, 𝑥̅ is the point estimator of the population mean, 𝜇.
Population
∑ 𝑥 ∑ 𝑥
𝜇 𝑥̅
𝑁 𝑛
• Trimmed Mean: Obtained by deleting a percentage of the smallest and largest
value and then computing the mean.
June‐August, 2019 Decision Science ‐1, PGP Term 1, Section E 23
23
Example: Apartment Rent in Springfield
• Seventy efficiency apartments were randomly sampled in a small town. The
monthly rent prices for these apartments are listed below:
445 615 430 590 435 600 460 600 440 615
440 440 440 525 425 445 575 445 450 450
465 450 525 450 450 460 435 460 465 480
450 470 490 472 475 475 500 480 570 465
600 485 580 470 490 500 549 500 500 480
570 515 450 445 525 535 475 550 480 510
510 575 490 435 600 435 445 435 430 440
∑ 𝑥 34356
𝑥̅ 490.80
𝑛 70
June‐August, 2019 Decision Science ‐1, PGP Term 1, Section E 24
24
Prof. Ananth Krishnamurthy 12
Decision Science ‐1, PGP‐Term1, Section E 2019
Median
• The median of a data set is the value in the middle when the data
items are arranged in ascending order.
• When the data has extreme values, the median can be a useful
measure of central location (property value, annual income).
1 2 3 4 5 6 7 1 2 3 4 5 6 7 8
26 18 27 12 14 27 19 26 18 27 12 14 27 19 30
Re‐arranged in ascending order Re‐arranged in ascending order
12 14 18 19 26 27 27 12 14 18 19 26 27 27 30
June‐August, 2019 Decision Science ‐1, PGP Term 1, Section E 25
25
Mode and Percentiles
• Mode
– The mode of a data set is the value that occurs with greatest frequency.
– The greatest frequency can occur for two or more different values (bimodal, multimodal)
• Percentile
– A percentile provides information about how the data is spread
– The pth percentile of a data set is the value such that at least p percent of the items take on this value or
less and at least (100‐p) percent of the items take on this value or more.
– Arrange data in the ascending order, compute the index i= (p*n/100). If value of i is not integer, round up.
If i is an integer, the pth percentile is the average of the values in the positions i and i+1.
• Quartiles
– First quartile – 25th percentile, Second quartile – 50th percentile, Third quartile – 75th percentile
June‐August, 2019 Decision Science ‐1, PGP Term 1, Section E 26
26
Prof. Ananth Krishnamurthy 13
Decision Science ‐1, PGP‐Term1, Section E 2019
Example: Apartment Rent in Springfield
• Seventy (70) efficiency apartments were randomly sampled in a small town.
The monthly rent prices for these apartments are listed below:
445 615 430 590 435 600 460 600 440 615
440 440 440 525 425 445 575 445 450 450
465 450 525 450 450 460 435 460 465 480
450 470 490 472 475 475 500 480 570 465
600 485 580 470 490 500 549 500 500 480
570 515 450 445 525 535 475 550 480 510
510 575 490 435 600 435 445 435 430 440
• What is the median? What is the mode?
• What is the 80th percentile?
June‐August, 2019 Decision Science ‐1, PGP Term 1, Section E 27
27
Example: Apartment Rent in Springfield
• Seventy (70) efficiency apartments were randomly sampled in a small town.
The monthly rent prices for these apartments are listed below:
425 430 430 435 435 435 435 435 440 440
440 440 440 445 445 445 445 445 450 450
450 450 450 450 450 460 460 460 465 465
In ascending order…
465 470 470 472 475 475 475 480 480 480
480 485 490 490 490 500 500 500 500 510
510 515 525 525 525 535 549 550 570 570
575 575 580 590 600 600 600 600 615 615
• What is the median? What is the mode?
• What is the 80th percentile?
June‐August, 2019 Decision Science ‐1, PGP Term 1, Section E 28
28
Prof. Ananth Krishnamurthy 14
Decision Science ‐1, PGP‐Term1, Section E 2019
Example: Apartment Rent in Springfield
• Median and Mode:
– Averaging the 35th and 36th data points, Median = 475.
– Mode = 450 (occurs seven times)
425 430 430 435 435 435 435 435 440 440
440 440 440 445 445 445 445 445 450 450
450 450 450 450 450 460 460 460 465 465
465 470 470 472 475 475 475 480 480 480
480 485 490 490 490 500 500 500 500 510
510 515 525 525 525 535 549 550 570 570
575 575 580 590 600 600 600 600 615 615
• 80th Percentile:
– Compute i = 80*70/100 = 56
– Compute the average of the 56th and 57th values (after arranging in ascending order): (535+549)/2 = 542
June‐August, 2019 Decision Science ‐1, PGP Term 1, Section E 29
29
Descriptive Statistics: Numerical Measures
• Measures of Location
– Mean Population: Sample:
– Median n data
– Mode N data points points
– Percentiles
– Quartiles
Population Parameters: Sample Statistic:
• Measures of Variability If the measure is computed If the measure is computed
– Range for data from the population using data from a sample
– Interquartile range
– Variance Point Estimator:
Sample statistics is a point estimator of the
– Standard Deviation corresponding population parameter.
– Coefficient of Variation
June‐August, 2019 Decision Science ‐1, PGP Term 1, Section E 30
30
Prof. Ananth Krishnamurthy 15
Decision Science ‐1, PGP‐Term1, Section E 2019
Example: Apartment Rent in Springfield
• Seventy (70) efficiency apartments were randomly sampled in a small town.
The monthly rent prices for these apartments are listed below:
445 615 430 590 435 600 460 600 440 615
440 440 440 525 425 445 575 445 450 450
465 450 525 450 450 460 435 460 465 480
450 470 490 472 475 475 500 480 570 465
600 485 580 470 490 500 549 500 500 480
570 515 450 445 525 535 475 550 480 510
510 575 490 435 600 435 445 435 430 440
• What is the average rent?
• What is the standard deviation?
June‐August, 2019 Decision Science ‐1, PGP Term 1, Section E 31
31
Measures of Dispersion (Variability)
• Range: Difference between the largest and smallest value in a data set.
• Interquartile Range: Difference between the third quartile and the first quartile (it is the
range for the middle 50% of the data set).
• Variance: Based on the difference between the value of each observation and the mean.
• Standard Deviation: Positive square root of the variance.
• Coefficient of Variation (CV): Ratio of the standard deviation to the mean.
Population ∑ 𝑥 𝜇 ∑ 𝑥 𝑥̅
𝜎 𝑠
𝑁 𝑛 1
Population Sample
June‐August, 2019 Decision Science ‐1, PGP Term 1, Section E 32
32
Prof. Ananth Krishnamurthy 16
Decision Science ‐1, PGP‐Term1, Section E 2019
Example: Apartment Rent in Springfield
• Variance: ∑ 𝑥 𝑥̅
𝑠 2996.16
𝑛 1
• Standard Deviation: 𝑠 54.74
54.74
• Coefficient of Variation: 𝐶𝑉 ∗ 100 11.15%
490.80
425 430 430 435 435 435 435 435 440 440
440 440 440 445 445 445 445 445 450 450
450 450 450 450 450 460 460 460 465 465
465 470 470 472 475 475 475 480 480 480
480 485 490 490 490 500 500 500 500 510
510 515 525 525 525 535 549 550 570 570
575 575 580 590 600 600 600 600 615 615
June‐August, 2019 Decision Science ‐1, PGP Term 1, Section E 33
33
Descriptive Statistics: Numerical Measures
• Measures of Location What other statistics are useful to measure?
– Mean Population: Sample:
– Median n data
– Mode N data points points
– Percentiles
– Quartiles
Population Parameters: Sample Statistic:
• Measures of Variability If the measure is computed If the measure is computed
– Range for data from the population using data from a sample
– Interquartile range
– Variance Point Estimator:
Sample statistics is a point estimator of the
– Standard Deviation corresponding population parameter.
– Coefficient of Variation
June‐August, 2019 Decision Science ‐1, PGP Term 1, Section E 34
34
Prof. Ananth Krishnamurthy 17
Decision Science ‐1, PGP‐Term1, Section E 2019
Distribution Shape
• Skewness: An important measure of distribution shape
𝑛 𝑥 𝑥̅
𝑆𝑘𝑒𝑤𝑛𝑒𝑠𝑠
𝑛 1 𝑛 2 𝑠
Relative Frequency
.30 Mean and median are equal .30 Mean is usually less than median
.25 .25
.20 .20
.15 .15
.10 .10
.05 .05
0 0
June‐August, 2019 Decision Science ‐1, PGP Term 1, Section E 35
35
Distribution Shape
• Skewness: An important measure of distribution shape
𝑛 𝑥 𝑥̅
𝑆𝑘𝑒𝑤𝑛𝑒𝑠𝑠
𝑛 1 𝑛 2 𝑠
.30 Mean is usually more than median .35 Mean is usually more than median
Relative Frequency
.25 .30
.20 .25
.20
.15
.15
.10
.10
.05 .05
0 0
June‐August, 2019 Decision Science ‐1, PGP Term 1, Section E 36
36
Prof. Ananth Krishnamurthy 18
Decision Science ‐1, PGP‐Term1, Section E 2019
Example: Apartment Rent in Springfield
• Seventy (70) efficiency apartments were randomly sampled in a small town.
The monthly rent prices for these apartments are listed below:
445 615 430 590 435 600 460 600 440 615
440 440 440 525 425 445 575 445 450 450
465 450 525 450 450 460 435 460 465 480
450 470 490 472 475 475 500 480 570 465
600 485 580 470 490 500 549 500 500 480
570 515 450 445 525 535 475 550 480 510
510 575 490 435 600 435 445 435 430 440
• What is the skewness?
– What is the mean? What is the median?
June‐August, 2019 Decision Science ‐1, PGP Term 1, Section E 37
37
Example: Apartment Rent in Springfield
• Skewness = 0.92 𝑆𝑘𝑒𝑤𝑛𝑒𝑠𝑠
𝑛 𝑥 𝑥̅
𝑛 1 𝑛 2 𝑠
Skewness = .92
425 430 430 435 435 435 435 435 440 440
440 440 440 445 445 445 445 445 450 450
450 450 450 450 450 460 460 460 465 465
465 470 470 472 475 475 475 480 480 480
480 485 490 490 490 500 500 500 500 510
510 515 525 525 525 535 549 550 570 570
575 575 580 590 600 600 600 600 615 615
June‐August, 2019 Decision Science ‐1, PGP Term 1, Section E 38
38
Prof. Ananth Krishnamurthy 19
Decision Science ‐1, PGP‐Term1, Section E 2019
z‐Scores
• The z‐score is often called the standardized value. It denotes the
number of standard deviations a data value, xi is from the mean
𝑥 𝑥̅
‐1.202 ‐1.111 ‐1.111 ‐1.019 ‐1.019 ‐1.019 ‐1.019 ‐1.019 ‐0.928 ‐0.928 𝑧
𝑠
‐0.928 ‐0.928 ‐0.928 ‐0.837 ‐0.837 ‐0.837 ‐0.837 ‐0.837 ‐0.745 ‐0.745
‐0.745 ‐0.745 ‐0.745 ‐0.745 ‐0.745 ‐0.563 ‐0.563 ‐0.563 ‐0.471 ‐0.471
z‐score of smallest value, 425:
‐0.471 ‐0.380 ‐0.380 ‐0.343 ‐0.289 ‐0.289 ‐0.289 ‐0.197 ‐0.197 ‐0.197
‐0.197 ‐0.106 ‐0.015 ‐0.015 ‐0.015 0.168 0.168 0.168 0.168 0.351 425 490.80
𝑧 1.202
0.351 0.442 0.625 0.625 0.625 0.807 1.063 1.081 1.447 1.447
54.74
1.538 1.538 1.630 1.812 1.995 1.995 1.995 1.995 2.269 2.269
June‐August, 2019 Decision Science ‐1, PGP Term 1, Section E 39
39
Chebyshev’s Theorem
• At least 75% of the data values must be within 2 standard deviations of the mean.
• At least 89% of the data values must be within 3 standard deviations of the mean.
• At least 94% of the data values must be within 4 standard deviations of the mean.
June‐August, 2019 Decision Science ‐1, PGP Term 1, Section E 40
40
Prof. Ananth Krishnamurthy 20
Decision Science ‐1, PGP‐Term1, Section E 2019
Markov’s Inequality
• If 𝑋 is a non‐negative random variable with finite mean, 𝐸 𝑋 and variance,
Var 𝑋 and 𝑐 > 0, then:
𝐸 𝑋
𝑃 𝑋 𝑐
𝑐
• Markov’s Inequality can be used to prove the Chebyshev’s Theorem.
June‐August, 2019 Decision Science ‐1, PGP Term 1, Section E 41
41
Example: Apartment Rent in Springfield
𝐼𝑓 𝑎 𝑟𝑎𝑛𝑑𝑜𝑚 𝑣𝑎𝑟𝑖𝑎𝑏𝑙𝑒 ℎ𝑎𝑠 𝑚𝑒𝑎𝑛 𝜇, 𝑎𝑛𝑑 𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒 𝜎 , 𝑡ℎ𝑒𝑛, 𝑓𝑜𝑟 𝑐 1
1
𝑃 𝜇 𝑐𝜎 𝑋 𝜇 𝑐𝜎 1
𝑐
445 615 430 590 435 600 460 600 440 615
440 440 440 525 425 445 575 445 450 450
465 450 525 450 450 460 435 460 465 480
450 470 490 472 475 475 500 480 570 465 What if c = 1.5?
600 485 580 470 490 500 549 500 500 480
570 515 450 445 525 535 475 550 480 510
510 575 490 435 600 435 445 435 430 440
• Then… at least ____% of the data values must be within 1.5 standard
deviations of the mean.
June‐August, 2019 Decision Science ‐1, PGP Term 1, Section E 42
42
Prof. Ananth Krishnamurthy 21
Decision Science ‐1, PGP‐Term1, Section E 2019
Example: Apartment Rent in Springfield
425 430 430 435 435 435 435 435 440 440
440 440 440 445 445 445 445 445 450 450
450 450 450 450 450 460 460 460 465 465
465 470 470 472 475 475 475 480 480 480
480 485 490 490 490 500 500 500 500 510
510 515 525 525 525 535 549 550 570 570
575 575 580 590 600 600 600 600 615 615
June‐August, 2019 Decision Science ‐1, PGP Term 1, Section E 43
43
Application to the Normal Distribution
More of this to follow…
• 68.26% of the values of a normal random 99.72%
variable are within +/‐ 1 standard deviation 95.44%
of its mean. 68.26%
• 95.44% of the values of a normal random
variable are within +/‐ 2 standard deviation
of its mean.
• 99.72% of the values of a normal random
variable are within +/‐ 3 standard deviation
x
– 3 – 1 + 1 + 3
of its mean.
– 2 + 2
June‐August, 2019 Decision Science ‐1, PGP Term 1, Section E 44
44
Prof. Ananth Krishnamurthy 22
Decision Science ‐1, PGP‐Term1, Section E 2019
Example on Chebyshev Theorem and Markov Inequality
• A group of students in the agricultural college are examining the growth of corn. Data collection reveals that
for a two‐week old plant the height varies, but has a mean of 29.4 inches and a standard deviation of 2.1
inches.
– What can you say about the probability that a two‐week old corn plant will have a height of less than 35 inches?
– What does Chebyshev’s theorem say in terms of the height of a two‐week old corn plant for c=2, 3.
• Solution
– From Markov Inequality, we have: P(X < 35) = 1 ‐ P(X ≥ 35), but P(X ≥ 35) ≤ 29.4/35, i.e. P(X < 35) ≥ (35‐29.4)/35 = 0.16
– From Chebyshev’s Theorem: P(25.2 ≤ X ≤ 33.6) ≥ 0.75; P(23.1 ≤ X ≤ 35.7) ≥ 0.89.
• Take away: In this example, Chebyshev’s Theorem provides better bounds on probability. So, while these
inequalities are useful, the degree of usefulness depends on the data (since they are just bounds).
June‐August, 2019 Decision Science ‐1, PGP Term 1, Section E 45
45
Descriptive Statistics: Numerical Measures
• Measures of Location
But.. That is a lot of statistics that could be useful
– Mean Population: Sample:
– Median n data
– Mode N data points points
– Percentiles
– Quartiles
Population Parameters: Sample Statistic:
• Measures of Variability If the measure is computed If the measure is computed
– Range for data from the population using data from a sample
– Interquartile range
– Variance Point Estimator:
Is there a simple way to show the key statistics?
Sample statistics is a point estimator of the
– Standard Deviation corresponding population parameter.
– Coefficient of Variation
June‐August, 2019 Decision Science ‐1, PGP Term 1, Section E 46
46
Prof. Ananth Krishnamurthy 23
Decision Science ‐1, PGP‐Term1, Section E 2019
Five Number Summaries and Box Plot
• Five Number Summary
– Smallest Value (425)
– First Quartile (445)
– Median (475)
– Third Quartile (525)
– Largest Value (615)
• Sometimes the mean is also
listed (490.80)
June‐August, 2019 Decision Science ‐1, PGP Term 1, Section E 47
47
Example: Covariance and Correlation
• Homer Simpsons is interested in relationship between driving distance (yards)
and 18‐hole score.
Driving 18‐Hole
Distance Score
277.6 69
259.5 71
269.1 70
267.0 70
255.6 71
272.9 69
June‐August, 2019 Decision Science ‐1, PGP Term 1, Section E 48
48
Prof. Ananth Krishnamurthy 24
Decision Science ‐1, PGP‐Term1, Section E 2019
Covariance and Correlation Coefficient
• Relationship between two variables is measured using Covariance and Correlation
Coefficient.
• Covariance is a measure of linear relationship between two variables
– Positive value indicates a positive relationship
– Negative value indicates a negative relationship
Population ∑ 𝑥 𝜇 𝑦 𝜇 ∑ 𝑥 𝑥̅ 𝑦 𝑦
𝜎 𝑠
𝑁 𝑛 1
Population Sample
June‐August, 2019 Decision Science ‐1, PGP Term 1, Section E 49
49
Covariance and Correlation Coefficient
• Relationship between two variables is measured using Covariance and Correlation
Coefficient.
• Correlation is a measure of linear association, not necessarily causation.
– Values near ‐1 indicate a strong negative linear relationship
– Values near +1 indicate a strong positive linear relationship
– Values near 0 indicate a weak relationship
Population 𝜎 𝑠
𝜚 𝑟
𝜎 𝜎 𝑠 𝑠
Population Sample
June‐August, 2019 Decision Science ‐1, PGP Term 1, Section E 50
50
Prof. Ananth Krishnamurthy 25
Decision Science ‐1, PGP‐Term1, Section E 2019
Example: Covariance and Correlation
• Homer Simpsons is interested in relationship between driving distance (yards)
and 18‐hole score. 𝑠
𝑟
𝑠 𝑠
𝑥 𝑦 𝑥 𝑥̅ 𝑦 𝑦 𝑥 𝑥̅ 𝑦 𝑦
277.6 69 ‐10.65 1.00 ‐10.65 7.08
𝑟
259.5 71 7.45 ‐1.00 ‐7.45 8.22 ∗ 0.894
269.1 70 ‐2.15 0.00 0.00
267.0 70
0.963
‐0.05 0.00 0.00
255.6 71 11.35 ‐1.00 ‐11.35
272.9 69 ‐5.95 1.00 ‐5.95
𝑥̅ = 267; 𝑦 = 70; 35.40
𝑠 8.22 𝑠 7.08
𝑠 0.894 6 1
June‐August, 2019 Decision Science ‐1, PGP Term 1, Section E 51
51
Questions
52
Prof. Ananth Krishnamurthy 26
Decision Science ‐1, PGP‐Term1, Section E 2019
Proof of Markov Inequality
• If 𝑋 is a non‐negative random variable with finite 𝑔 𝑥 𝑐∗𝑓 𝑥
𝑐
mean, 𝐸 𝑋 and variance, Var 𝑋 and 𝑐 > 0, then:
𝐸 𝑋 𝑓 𝑥 =𝕀
𝑃 𝑋 𝑐 1
𝑐
𝑥 𝑐 𝑥 𝑐
• Proof: 𝑐 𝑥
• 𝐶𝑜𝑛𝑠𝑖𝑑𝑒𝑟 𝑡ℎ𝑒 𝑓𝑢𝑛𝑐𝑡𝑖𝑜𝑛 𝑓 𝑥 𝕀 𝑠𝑢𝑐ℎ 𝑡ℎ𝑎𝑡 𝑓 𝑥 0 𝑓𝑜𝑟 𝑥 𝑐 𝑎𝑛𝑑 𝑓 𝑥 1 𝑓𝑜𝑟 𝑥 𝑐 𝑏𝑙𝑢𝑒 𝑙𝑖𝑛𝑒 .
• 𝑇ℎ𝑒𝑛 𝑑𝑒𝑓𝑖𝑛𝑒, 𝑔 𝑥 𝑐 ∗ 𝑓 𝑥 . 𝑇ℎ𝑒𝑟𝑒𝑓𝑜𝑟𝑒, 𝑔 𝑥 0 𝑓𝑜𝑟 𝑥 𝑐 𝑎𝑛𝑑𝑓 𝑥 𝑐 𝑓𝑜𝑟 𝑥 𝑐 𝑟𝑒𝑑 𝑑𝑜𝑡𝑡𝑒𝑑 𝑙𝑖𝑛𝑒 .
• 𝑇ℎ𝑒𝑛 𝑓𝑜𝑟 𝑥 𝑐, 𝑔 𝑥 0, 𝑎𝑛𝑑 𝑠𝑜 𝑔 𝑥 𝑥.
• 𝐹𝑜𝑟 𝑥 𝑐, 𝑔 𝑥 𝑐 𝑎𝑛𝑑 𝑠𝑜 𝑔 𝑥 𝑥.
• 𝑇ℎ𝑖𝑠 𝑖𝑚𝑝𝑙𝑖𝑒𝑠 𝑔 𝑥 𝑐∗𝕀 𝑥 𝑓𝑜𝑟 𝑎𝑙𝑙 𝑥 0.
• 𝑇𝑎𝑘𝑖𝑛𝑔 𝑒𝑥𝑝𝑒𝑐𝑡𝑎𝑡𝑖𝑜𝑛𝑠, 𝑤𝑒 ℎ𝑎𝑣𝑒 𝑐 ∗ 𝑃 𝑋 𝑐 𝐸 𝑋 .
• 𝑇ℎ𝑖𝑠 𝑖𝑚𝑝𝑙𝑖𝑒𝑠 𝑃 𝑋 𝑐
53
Proof of Chebyshev’s Theorem
𝐼𝑓 𝑎 𝑟𝑎𝑛𝑑𝑜𝑚 𝑣𝑎𝑟𝑖𝑎𝑏𝑙𝑒 ℎ𝑎𝑠 𝑚𝑒𝑎𝑛 𝜇, 𝑎𝑛𝑑 𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒 𝜎 , 𝑡ℎ𝑒𝑛, 𝑓𝑜𝑟 𝑐 1
1
𝑃 𝜇 𝑐𝜎 𝑋 𝜇 𝑐𝜎 1
𝑐
• Proof:
• 𝐹𝑟𝑜𝑚 𝑀𝑎𝑟𝑘𝑜𝑣 𝑠 𝐼𝑛𝑒𝑞𝑢𝑎𝑙𝑖𝑡𝑦:
• 𝑃 𝑋 𝑐 , 𝑓𝑜𝑟 𝑎𝑛𝑦 𝑟𝑎𝑛𝑑𝑜𝑚 𝑣𝑎𝑟𝑖𝑎𝑏𝑙𝑒 𝑋, 𝑤ℎ𝑒𝑟𝑒 𝐸 𝑋 ∞, 𝑉𝑎𝑟 𝑥 ∞, 𝑎𝑛𝑑 𝑎𝑛𝑦 𝑐 0.
• 𝑆𝑜 𝑖𝑡 ℎ𝑜𝑙𝑑𝑠 𝑓𝑜𝑟 𝑡ℎ𝑒 𝑟𝑎𝑛𝑑𝑜𝑚 𝑣𝑎𝑟𝑖𝑎𝑏𝑙𝑒 𝑋 𝜇 . 𝑇ℎ𝑖𝑠 𝑖𝑚𝑝𝑙𝑖𝑒𝑠, 𝑃 𝑋 𝜇 𝑐𝜎 .
• 𝐹𝑢𝑟𝑡ℎ𝑒𝑟, 𝑃 𝑋 𝜇 𝑐𝜎 𝑃 𝑋 𝜇 𝑐𝜎 .
• 𝐴𝑙𝑠𝑜, 𝑃 𝑋 𝜇 𝑐 𝜎 𝑎𝑝𝑝𝑙𝑦𝑖𝑛𝑔 𝑀𝑎𝑟𝑘𝑜𝑣 𝐼𝑛𝑒𝑞𝑢𝑎𝑙𝑖𝑡𝑦
• 𝑇ℎ𝑒𝑛, 𝑃 𝑋 𝜇 𝑐𝜎 𝑓𝑟𝑜𝑚 𝑑𝑒𝑓𝑖𝑛𝑖𝑡𝑖𝑜𝑛 𝑜𝑓 𝑉𝑎𝑟 𝑋 .
• 𝑇ℎ𝑒𝑛, 𝑃 𝑋 𝜇 𝑐𝜎 1
• 𝑖. 𝑒. 𝑃 𝜇 𝑐𝜎 𝑋 𝜇 𝑐𝜎 1
54
Prof. Ananth Krishnamurthy 27