Topic 01 - Descriptive Statistics PDF

Decision Science ‐1, PGP‐Term1, Section E 2019
DECISION SCIENCE ‐1
DATA REPRESENTATION:
DATA AND STATISTICS
Term 1: 2019
June‐August, 2019 Decision Science ‐1, PGP Term 1, Section E 1
Statistics and Role in Decision Making
• The term statistics can refer to numerical facts such as averages, medians, and
percentages that help us understand a variety of business and economic
situations.
• Statistics can also refer to the art and science of collecting, analyzing,
presenting, and interpreting data.
• Used by people working in accounting, finance, economics, marketing,
production, information systems, etc.
Prof. Ananth Krishnamurthy 1
Elements, Variable and Observations Elements Variables
• Elements: Company Exchange Ticker Market Gross Profit

Symbol Capital Margin (%)
– Entities for which data is collected ($ millions)
DeWolfe AMEX DWL 36.4 36.7
Companies
• Variable: MarineMax Inc NYSE HZO 111.5 23.8
– Characteristic of interest for the Environmental AMEX ETC 51.1 35.9
elements Tectonics
Barnwell AMEX BRN 27.3 73.4
Industries, Inc.
• Observation: SEMCO Energy NYSE SEN 193.4 23.6
Inc.
– Set of measurements obtained for
a specific element
Observation
Data Categorization
• Categorization of Data Data
Qualitative Quantitative
What do these
mean?
Numeric Non‐numeric Numeric
Nominal Ordinal Nominal Ordinal Interval Ratio
Scale of Measurement
• Nominal Scale
– Data are labels or names used to identify an attribute of the element
– Non‐numeric (AMEX, NYSE,..) or numeric code (AMEX = 1, NYSE = 2, …)
• Ordinal Scale
– The data have the properties of nominal data and the order or rank of the data is
meaningful.
– Non‐numeric (Excellent, Good,..) or numeric code (1 for Excellent, 2 for Good, …)
Scale of Measurement
• Interval Scale
– They have the properties of ordinal data, the interval between observations is expressed
in terms of a fixed unit of measure. They are always numeric.
– Time of day: Between 10:00 am and 11:30 am; Between 10:30 pm and 11:30 pm
– Temperature: Between 10 C and 15C; Between ‐10 C and ‐15C;
• Ratio Scale
– The data have all the properties of interval data and the ratio of two values is
meaningful.
– Money: Rs 2,00,000 is twice as large as Rs 1,00,000
Data Categorization
• Cross‐sectional
– Data are collected at the same or approximately the
same point in time.
– Permits issued in November in each of the districts in
Cross sectional
Karnataka.
• Time series
– Time series data are collected over several time
periods. Time series
– Permits issues each month in Karnataka for the last 36
months.
Statistical Inference
• Population: The set of all elements of interest for a
specific study Population
consists of all A sample of
tune‐ups. 50 engine
• Sample: A subset of the population
Average cost of tune‐ups
parts is unknown. examined.
• Statistical Inference: The process of using data
obtained from a sample to make estimates and test
hypothesis about the characteristics of a population
• Census: Collection data for the entire population The sample average The sample data

is used to estimate provide a sample
the population average parts cost
• Sample Survey: collecting data for a sample average. of $120 per tune‐up.
DESCRIPTIVE STATISTICS ‐ DISPLAYS
Term 1: 2019
Descriptive Statistics
• Most of the statistical information in company reports, and other publications
consists of data that are summarized and presented in a form that is easy to
understand.
• Such summaries of data, which may be tabular, graphical, or numerical, are
referred to as descriptive statistics.
10
Summarizing Categorical and Quantitative Data
• We will review a few techniques to summarize data:
Categorical Data Quantitative Data
Frequency Distribution Frequency Distribution
Relative Frequency Distribution Relative Frequency Distribution
Percent Frequency Distribution Percent Frequency Distribution
Bar Chart Dot Plot
Pie Chart Histogram
Cumulative Distribution
Stem and Leaf Diagram
11
Example: Summarizing Categorical Data
• Customers coming to Moe’s Tavern in Springfield were
asked to rate the quality of their service as being
excellent, above average, average, below average, or
poor.
• The ratings provided by a sample of 20 customers are:
Below Average Average Above Average

Above Average Above Average Above Average
Above Average Below Average Below Average
Average Poor Poor
Above Average Excellent Above Average
Average Above Average Average
Above Average Average
12
Frequency, Bar Chart and Pareto Diagram
• Frequency, Relative Frequency, Percentage Frequency
– When the bars are arranged in descending order of height from left to right (with the
most frequently occurring cause appearing first) the bar chart is called a Pareto diagram.
Frequency Relative Percentage

Frequency Frequency
Poor 2 0.1 10
Below
Average 3 0.15 15
Average 5 0.25 25
Above
Average 9 0.45 45
Excellent 1 0.05 5
20 1 100
13
Summarizing Categorical and Quantitative Data
• We will review a few techniques to summarize data:
Categorical Data Quantitative Data
Frequency Distribution Frequency Distribution
Relative Frequency Distribution Relative Frequency Distribution
Percent Frequency Distribution Percent Frequency Distribution
Bar Chart Cumulative Distribution
Pie Chart Dot Plot
Histogram
Stem and Leaf Diagram
14
Example: Summarizing Quantitative Data
• The manager of Springfield Auto would like to have a
better understanding of the cost of parts used in the
engine tune‐ups performed in his shop. He examines
50 customer invoices for tune‐ups. The costs of parts,
rounded to the nearest dollar, are listed below:
91 78 93 57 75 52 99 80 97 62
71 69 72 89 66 75 79 75 72 76
104 74 62 68 97 105 77 65 80 109
85 97 88 68 83 68 71 69 67 74
62 82 98 101 79 105 79 69 62 73
15
Example: Summarizing Categorical Data
• The most common numerical descriptive statistic is the average (or mean or expected value)
– It is a measure of the central tendency (central location of the data for a variable)
– Springfield Auto’s average cost of parts, based on the 50 tune‐ups studied, is $79.
Frequency Percent Cumulative Cumulative

Part Cost ($) Frequency Frequency Percent
Frequency
50‐59 2 4 2 4
60‐69 13 26 15 30
70‐79 16 32 31 62
80‐89 7 14 38 76
90‐99 7 14 45 90
100‐109 5 10 50 100
50 100
16
Cross Tabulation
• Crosstabulation is a method for summarizing the
data for two variables.
• Example: The number of homes sold for each style
and price for the past two years is shown below:
Home Style
Price Range Colonial Log Split A‐Frame Total
Less than
$200,000 18 6 19 12 55
$200,000 or
more 12 14 16 3 45
Total 30 20 35 15 100
17
Simpson’s Paradox
• We must be careful in drawing conclusions about the
relationship between the two variables in the aggregated
crosstabulation.
• In some cases the conclusions based upon an aggregated
crosstabulation can be completely reversed if we look at
the unaggregated data. Edward H. Simpson first described
this phenomenon in a technical
paper in 1951, but the statisticians
• The reversal of conclusions based on aggregate and Karl Pearson et al., in 1899, and
unaggregated data is called Simpson’s paradox. Udny Yule, in 1903, had mentioned
similar effects earlier. The name
Simpson's paradox was introduced
by Colin R. Blyth in 1972.
18
Simpson’s Paradox: Example
Example: https://www.youtube.com/watch?v=ebEkn-BiW5k
19
DESCRIPTIVE STATISTICS ‐ NUMERICAL
Term 1: 2019
20
Descriptive Statistics: Numerical Measures
• Measures of Location
– Mean Population: Sample:
– Median n data
– Mode N data points points
– Percentiles
– Quartiles
Population Parameters: Sample Statistic:
• Measures of Variability If the measure is computed If the measure is computed
– Range for data from the population using data from a sample
– Interquartile range
– Variance Point Estimator:
Sample statistics is a point estimator of the
– Standard Deviation corresponding population parameter.
– Coefficient of Variation
21
Example: Apartment Rent in Springfield
• Seventy (70) efficiency apartments were randomly sampled in a small town.
The monthly rent prices for these apartments are listed below:
445 615 430 590 435 600 460 600 440 615
440 440 440 525 425 445 575 445 450 450
465 450 525 450 450 460 435 460 465 480
450 470 490 472 475 475 500 480 570 465
600 485 580 470 490 500 549 500 500 480
570 515 450 445 525 535 475 550 480 510
510 575 490 435 600 435 445 435 430 440
• What is the average rent?
• What is the standard deviation?
22
Mean
• Mean provides a measure of central location.
• The mean of a data set is the average of all the data values.
• The sample mean, 𝑥̅ is the point estimator of the population mean, 𝜇.
Population
∑ 𝑥 ∑ 𝑥
𝜇 𝑥̅
𝑁 𝑛
• Trimmed Mean: Obtained by deleting a percentage of the smallest and largest
value and then computing the mean.
23
• Seventy efficiency apartments were randomly sampled in a small town. The
monthly rent prices for these apartments are listed below:
445 615 430 590 435 600 460 600 440 615
440 440 440 525 425 445 575 445 450 450
465 450 525 450 450 460 435 460 465 480
450 470 490 472 475 475 500 480 570 465
600 485 580 470 490 500 549 500 500 480
570 515 450 445 525 535 475 550 480 510
510 575 490 435 600 435 445 435 430 440
∑ 𝑥 34356
𝑥̅ 490.80
𝑛 70
24
Median
• The median of a data set is the value in the middle when the data
items are arranged in ascending order.
• When the data has extreme values, the median can be a useful
measure of central location (property value, annual income).
1 2 3 4 5 6 7 1 2 3 4 5 6 7 8
26 18 27 12 14 27 19 26 18 27 12 14 27 19 30
Re‐arranged in ascending order Re‐arranged in ascending order
12 14 18 19 26 27 27 12 14 18 19 26 27 27 30
Median = 19 Median = (19+26)/2 = 22.5
25
Mode and Percentiles
• Mode
– The mode of a data set is the value that occurs with greatest frequency.
– The greatest frequency can occur for two or more different values (bimodal, multimodal)
• Percentile
– A percentile provides information about how the data is spread
– The pth percentile of a data set is the value such that at least p percent of the items take on this value or
less and at least (100‐p) percent of the items take on this value or more.
– Arrange data in the ascending order, compute the index i= (p*n/100). If value of i is not integer, round up.
If i is an integer, the pth percentile is the average of the values in the positions i and i+1.
• Quartiles
– First quartile – 25th percentile, Second quartile – 50th percentile, Third quartile – 75th percentile
26
445 615 430 590 435 600 460 600 440 615
440 440 440 525 425 445 575 445 450 450
465 450 525 450 450 460 435 460 465 480
450 470 490 472 475 475 500 480 570 465
600 485 580 470 490 500 549 500 500 480
570 515 450 445 525 535 475 550 480 510
510 575 490 435 600 435 445 435 430 440
• What is the median? What is the mode?
• What is the 80th percentile?
27
425 430 430 435 435 435 435 435 440 440
440 440 440 445 445 445 445 445 450 450
450 450 450 450 450 460 460 460 465 465
In ascending order…
465 470 470 472 475 475 475 480 480 480
480 485 490 490 490 500 500 500 500 510
510 515 525 525 525 535 549 550 570 570
575 575 580 590 600 600 600 600 615 615
• What is the median? What is the mode?
• What is the 80th percentile?
28
• Median and Mode:
– Averaging the 35th and 36th data points, Median = 475.
– Mode = 450 (occurs seven times)
425 430 430 435 435 435 435 435 440 440
440 440 440 445 445 445 445 445 450 450
450 450 450 450 450 460 460 460 465 465
465 470 470 472 475 475 475 480 480 480
480 485 490 490 490 500 500 500 500 510
510 515 525 525 525 535 549 550 570 570
575 575 580 590 600 600 600 600 615 615
• 80th Percentile:
– Compute i = 80*70/100 = 56
– Compute the average of the 56th and 57th values (after arranging in ascending order): (535+549)/2 = 542
29
– Median n data
– Percentiles
– Quartiles
30
445 615 430 590 435 600 460 600 440 615
440 440 440 525 425 445 575 445 450 450
465 450 525 450 450 460 435 460 465 480
450 470 490 472 475 475 500 480 570 465
600 485 580 470 490 500 549 500 500 480
570 515 450 445 525 535 475 550 480 510
510 575 490 435 600 435 445 435 430 440
• What is the average rent?
• What is the standard deviation?
31
Measures of Dispersion (Variability)
• Range: Difference between the largest and smallest value in a data set.
• Interquartile Range: Difference between the third quartile and the first quartile (it is the
range for the middle 50% of the data set).
• Variance: Based on the difference between the value of each observation and the mean.
• Standard Deviation: Positive square root of the variance.
• Coefficient of Variation (CV): Ratio of the standard deviation to the mean.
Population ∑ 𝑥 𝜇 ∑ 𝑥 𝑥̅
𝜎 𝑠
𝑁 𝑛 1
Population Sample
32
• Variance: ∑ 𝑥 𝑥̅
𝑠 2996.16
𝑛 1
• Standard Deviation: 𝑠 54.74
54.74
• Coefficient of Variation: 𝐶𝑉 ∗ 100 11.15%
490.80
425 430 430 435 435 435 435 435 440 440
440 440 440 445 445 445 445 445 450 450
450 450 450 450 450 460 460 460 465 465
465 470 470 472 475 475 475 480 480 480
480 485 490 490 490 500 500 500 500 510
510 515 525 525 525 535 549 550 570 570
575 575 580 590 600 600 600 600 615 615
33
• Measures of Location What other statistics are useful to measure?
– Median n data
– Percentiles
– Quartiles
34
Distribution Shape
• Skewness: An important measure of distribution shape
𝑛 𝑥 𝑥̅
𝑆𝑘𝑒𝑤𝑛𝑒𝑠𝑠
𝑛 1 𝑛 2 𝑠
Symmetric (not skewed) Moderately Skewed Left (-0.31)

.35 Skewness = 0 .35
Skewness is negative
Relative Frequency
Relative Frequency
.30 Mean and median are equal .30 Mean is usually less than median
.25 .25
.20 .20
.15 .15
.10 .10
.05 .05
0 0
35
Distribution Shape
• Skewness: An important measure of distribution shape
𝑛 𝑥 𝑥̅
𝑆𝑘𝑒𝑤𝑛𝑒𝑠𝑠
𝑛 1 𝑛 2 𝑠
Moderately Skewed Right (+0.31) Highly Skewed Right

.35
Skewness is positive Skewness is positive (often more than 1)
Relative Frequency
.30 Mean is usually more than median .35 Mean is usually more than median
Relative Frequency
.25 .30
.20 .25
.20
.15
.15
.10
.10
.05 .05
0 0
36
445 615 430 590 435 600 460 600 440 615
440 440 440 525 425 445 575 445 450 450
465 450 525 450 450 460 435 460 465 480
450 470 490 472 475 475 500 480 570 465
600 485 580 470 490 500 549 500 500 480
570 515 450 445 525 535 475 550 480 510
510 575 490 435 600 435 445 435 430 440
• What is the skewness?
– What is the mean? What is the median?
37
• Skewness = 0.92 𝑆𝑘𝑒𝑤𝑛𝑒𝑠𝑠
𝑛 𝑥 𝑥̅
𝑛 1 𝑛 2 𝑠
Skewness = .92
425 430 430 435 435 435 435 435 440 440
440 440 440 445 445 445 445 445 450 450
450 450 450 450 450 460 460 460 465 465
465 470 470 472 475 475 475 480 480 480
480 485 490 490 490 500 500 500 500 510
510 515 525 525 525 535 549 550 570 570
575 575 580 590 600 600 600 600 615 615
38
z‐Scores
• The z‐score is often called the standardized value. It denotes the
number of standard deviations a data value, xi is from the mean
𝑥 𝑥̅
‐1.202 ‐1.111 ‐1.111 ‐1.019 ‐1.019 ‐1.019 ‐1.019 ‐1.019 ‐0.928 ‐0.928 𝑧
𝑠
‐0.928 ‐0.928 ‐0.928 ‐0.837 ‐0.837 ‐0.837 ‐0.837 ‐0.837 ‐0.745 ‐0.745
‐0.745 ‐0.745 ‐0.745 ‐0.745 ‐0.745 ‐0.563 ‐0.563 ‐0.563 ‐0.471 ‐0.471
z‐score of smallest value, 425:
‐0.471 ‐0.380 ‐0.380 ‐0.343 ‐0.289 ‐0.289 ‐0.289 ‐0.197 ‐0.197 ‐0.197
‐0.197 ‐0.106 ‐0.015 ‐0.015 ‐0.015 0.168 0.168 0.168 0.168 0.351 425 490.80
𝑧 1.202
0.351 0.442 0.625 0.625 0.625 0.807 1.063 1.081 1.447 1.447
54.74
1.538 1.538 1.630 1.812 1.995 1.995 1.995 1.995 2.269 2.269
39
Chebyshev’s Theorem
𝐼𝑓 𝑎 𝑟𝑎𝑛𝑑𝑜𝑚 𝑣𝑎𝑟𝑖𝑎𝑏𝑙𝑒 ℎ𝑎𝑠 𝑚𝑒𝑎𝑛 𝜇, 𝑎𝑛𝑑 𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒 𝜎 , 𝑡ℎ𝑒𝑛, 𝑓𝑜𝑟 𝑐 1

1
𝑃 𝜇 𝑐𝜎 𝑋 𝜇 𝑐𝜎 1
𝑐
• At least (1 ‐ 1/c2) of the items in any data set will be within c standard deviations of the

mean, where c is any value greater than 1 (need not be integer).
• At least 75% of the data values must be within 2 standard deviations of the mean.
40
Markov’s Inequality
• If 𝑋 is a non‐negative random variable with finite mean, 𝐸 𝑋 and variance,
Var 𝑋 and 𝑐 > 0, then:
𝐸 𝑋
𝑃 𝑋 𝑐
𝑐
• Markov’s Inequality can be used to prove the Chebyshev’s Theorem.
Andrey Andreyevich Markov Pafnuty Lvovich Chebyshev

(1856 – 1922) was a Russian (1821 –1894) was a Russian
mathematician best known for his mathematician. Chebyshev is
work on stochastic processes. A known for his work in the fields
primary subject of his research of probability, statistics,
later became known as Markov mechanics, and number theory.
chains and Markov processes.
41
1
𝑐
445 615 430 590 435 600 460 600 440 615
440 440 440 525 425 445 575 445 450 450
465 450 525 450 450 460 435 460 465 480
450 470 490 472 475 475 500 480 570 465 What if c = 1.5?
600 485 580 470 490 500 549 500 500 480
570 515 450 445 525 535 475 550 480 510
510 575 490 435 600 435 445 435 430 440
• Then… at least ____% of the data values must be within 1.5 standard
deviations of the mean.
42
425 430 430 435 435 435 435 435 440 440
440 440 440 445 445 445 445 445 450 450
450 450 450 450 450 460 460 460 465 465
465 470 470 472 475 475 475 480 480 480
480 485 490 490 490 500 500 500 500 510
510 515 525 525 525 535 549 550 570 570
575 575 580 590 600 600 600 600 615 615
• We have 𝑥̅ = 490.80 and 𝑠 = 54.74

• For z = 1.5, we should have at least 56% of data between:
𝑥̅ ‐1.5*𝑠 409 𝑎𝑛𝑑 𝑥̅ ‐1.5*𝑠 573
• Actually, 85.7% of the values lie in the range.
43
Application to the Normal Distribution
More of this to follow…
• 68.26% of the values of a normal random 99.72%
variable are within +/‐ 1 standard deviation 95.44%
of its mean. 68.26%
• 95.44% of the values of a normal random
variable are within +/‐ 2 standard deviation
of its mean.
• 99.72% of the values of a normal random
variable are within +/‐ 3 standard deviation 
x
 – 3  – 1  + 1  + 3
of its mean.
 – 2  + 2
44
Example on Chebyshev Theorem and Markov Inequality
• A group of students in the agricultural college are examining the growth of corn. Data collection reveals that
for a two‐week old plant the height varies, but has a mean of 29.4 inches and a standard deviation of 2.1
inches.
– What can you say about the probability that a two‐week old corn plant will have a height of less than 35 inches?
– What does Chebyshev’s theorem say in terms of the height of a two‐week old corn plant for c=2, 3.
• Solution
– From Markov Inequality, we have: P(X < 35) = 1 ‐ P(X ≥ 35), but P(X ≥ 35) ≤ 29.4/35, i.e. P(X < 35) ≥ (35‐29.4)/35 = 0.16
– From Chebyshev’s Theorem: P(25.2 ≤ X ≤ 33.6) ≥ 0.75; P(23.1 ≤ X ≤ 35.7) ≥ 0.89.
• Take away: In this example, Chebyshev’s Theorem provides better bounds on probability. So, while these
inequalities are useful, the degree of usefulness depends on the data (since they are just bounds).
45
But.. That is a lot of statistics that could be useful
– Median n data
– Percentiles
– Quartiles
Is there a simple way to show the key statistics?
46
Five Number Summaries and Box Plot
• Five Number Summary
– Smallest Value (425)
– First Quartile (445)
– Median (475)
– Third Quartile (525)
– Largest Value (615)
• Sometimes the mean is also
listed (490.80)
47
Example: Covariance and Correlation
• Homer Simpsons is interested in relationship between driving distance (yards)
and 18‐hole score.
Driving 18‐Hole
Distance Score
277.6 69
259.5 71
269.1 70
267.0 70
255.6 71
272.9 69
48
Covariance and Correlation Coefficient
• Relationship between two variables is measured using Covariance and Correlation
Coefficient.
• Covariance is a measure of linear relationship between two variables
– Positive value indicates a positive relationship
– Negative value indicates a negative relationship
Population ∑ 𝑥 𝜇 𝑦 𝜇 ∑ 𝑥 𝑥̅ 𝑦 𝑦
𝜎 𝑠
𝑁 𝑛 1
Population Sample
49
Covariance and Correlation Coefficient
• Relationship between two variables is measured using Covariance and Correlation
Coefficient.
• Correlation is a measure of linear association, not necessarily causation.
– Values near ‐1 indicate a strong negative linear relationship
– Values near +1 indicate a strong positive linear relationship
– Values near 0 indicate a weak relationship
Population 𝜎 𝑠
𝜚 𝑟
𝜎 𝜎 𝑠 𝑠
Population Sample
50
Example: Covariance and Correlation
• Homer Simpsons is interested in relationship between driving distance (yards)
and 18‐hole score. 𝑠
𝑟
𝑠 𝑠
𝑥 𝑦 𝑥 𝑥̅ 𝑦 𝑦 𝑥 𝑥̅ 𝑦 𝑦
277.6 69 ‐10.65 1.00 ‐10.65 7.08
𝑟
259.5 71 7.45 ‐1.00 ‐7.45 8.22 ∗ 0.894
269.1 70 ‐2.15 0.00 0.00
267.0 70
0.963
‐0.05 0.00 0.00
255.6 71 11.35 ‐1.00 ‐11.35
272.9 69 ‐5.95 1.00 ‐5.95
𝑥̅ = 267; 𝑦 = 70; 35.40
𝑠 8.22 𝑠 7.08
𝑠 0.894 6 1
51
Questions
52
Proof of Markov Inequality
• If 𝑋 is a non‐negative random variable with finite 𝑔 𝑥 𝑐∗𝑓 𝑥
𝑐
mean, 𝐸 𝑋 and variance, Var 𝑋 and 𝑐 > 0, then:
𝐸 𝑋 𝑓 𝑥 =𝕀
𝑃 𝑋 𝑐 1
𝑐
𝑥 𝑐 𝑥 𝑐
• Proof: 𝑐 𝑥
• 𝐶𝑜𝑛𝑠𝑖𝑑𝑒𝑟 𝑡ℎ𝑒 𝑓𝑢𝑛𝑐𝑡𝑖𝑜𝑛 𝑓 𝑥 𝕀 𝑠𝑢𝑐ℎ 𝑡ℎ𝑎𝑡 𝑓 𝑥 0 𝑓𝑜𝑟 𝑥 𝑐 𝑎𝑛𝑑 𝑓 𝑥 1 𝑓𝑜𝑟 𝑥 𝑐 𝑏𝑙𝑢𝑒 𝑙𝑖𝑛𝑒 .
• 𝑇ℎ𝑒𝑛 𝑑𝑒𝑓𝑖𝑛𝑒, 𝑔 𝑥 𝑐 ∗ 𝑓 𝑥 . 𝑇ℎ𝑒𝑟𝑒𝑓𝑜𝑟𝑒, 𝑔 𝑥 0 𝑓𝑜𝑟 𝑥 𝑐 𝑎𝑛𝑑𝑓 𝑥 𝑐 𝑓𝑜𝑟 𝑥 𝑐 𝑟𝑒𝑑 𝑑𝑜𝑡𝑡𝑒𝑑 𝑙𝑖𝑛𝑒 .
• 𝑇ℎ𝑒𝑛 𝑓𝑜𝑟 𝑥 𝑐, 𝑔 𝑥 0, 𝑎𝑛𝑑 𝑠𝑜 𝑔 𝑥 𝑥.
• 𝐹𝑜𝑟 𝑥 𝑐, 𝑔 𝑥 𝑐 𝑎𝑛𝑑 𝑠𝑜 𝑔 𝑥 𝑥.
• 𝑇ℎ𝑖𝑠 𝑖𝑚𝑝𝑙𝑖𝑒𝑠 𝑔 𝑥 𝑐∗𝕀 𝑥 𝑓𝑜𝑟 𝑎𝑙𝑙 𝑥 0.
• 𝑇𝑎𝑘𝑖𝑛𝑔 𝑒𝑥𝑝𝑒𝑐𝑡𝑎𝑡𝑖𝑜𝑛𝑠, 𝑤𝑒 ℎ𝑎𝑣𝑒 𝑐 ∗ 𝑃 𝑋 𝑐 𝐸 𝑋 .
• 𝑇ℎ𝑖𝑠 𝑖𝑚𝑝𝑙𝑖𝑒𝑠 𝑃 𝑋 𝑐
53
Proof of Chebyshev’s Theorem
1
𝑐
• Proof:
• 𝐹𝑟𝑜𝑚 𝑀𝑎𝑟𝑘𝑜𝑣 𝑠 𝐼𝑛𝑒𝑞𝑢𝑎𝑙𝑖𝑡𝑦:
• 𝑃 𝑋 𝑐 , 𝑓𝑜𝑟 𝑎𝑛𝑦 𝑟𝑎𝑛𝑑𝑜𝑚 𝑣𝑎𝑟𝑖𝑎𝑏𝑙𝑒 𝑋, 𝑤ℎ𝑒𝑟𝑒 𝐸 𝑋 ∞, 𝑉𝑎𝑟 𝑥 ∞, 𝑎𝑛𝑑 𝑎𝑛𝑦 𝑐 0.
• 𝑆𝑜 𝑖𝑡 ℎ𝑜𝑙𝑑𝑠 𝑓𝑜𝑟 𝑡ℎ𝑒 𝑟𝑎𝑛𝑑𝑜𝑚 𝑣𝑎𝑟𝑖𝑎𝑏𝑙𝑒 𝑋 𝜇 . 𝑇ℎ𝑖𝑠 𝑖𝑚𝑝𝑙𝑖𝑒𝑠, 𝑃 𝑋 𝜇 𝑐𝜎 .
• 𝐹𝑢𝑟𝑡ℎ𝑒𝑟, 𝑃 𝑋 𝜇 𝑐𝜎 𝑃 𝑋 𝜇 𝑐𝜎 .
• 𝐴𝑙𝑠𝑜, 𝑃 𝑋 𝜇 𝑐 𝜎 𝑎𝑝𝑝𝑙𝑦𝑖𝑛𝑔 𝑀𝑎𝑟𝑘𝑜𝑣 𝐼𝑛𝑒𝑞𝑢𝑎𝑙𝑖𝑡𝑦
• 𝑇ℎ𝑒𝑛, 𝑃 𝑋 𝜇 𝑐𝜎 𝑓𝑟𝑜𝑚 𝑑𝑒𝑓𝑖𝑛𝑖𝑡𝑖𝑜𝑛 𝑜𝑓 𝑉𝑎𝑟 𝑋 .
• 𝑇ℎ𝑒𝑛, 𝑃 𝑋 𝜇 𝑐𝜎 1
• 𝑖. 𝑒. 𝑃 𝜇 𝑐𝜎 𝑋 𝜇 𝑐𝜎 1
54

Topic 01 - Descriptive Statistics PDF

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Topic 01 - Descriptive Statistics PDF

Uploaded by

Copyright:

Available Formats

Decision Science ‐1, PGP‐Term1, Section E 2019

June‐August, 2019 Decision Science ‐1, PGP Term 1, Section E 1

Elements, Variable and Observations Elements Variables

• Elements: Company Exchange Ticker Market Gross Profit

Nominal Ordinal Nominal Ordinal Interval Ratio

• Census: Collection data for the entire population The sample average The sample data

Below Average Average Above Average

Frequency Relative Percentage

Frequency Percent Cumulative Cumulative

Median = 19 Median = (19+26)/2 = 22.5

Symmetric (not skewed) Moderately Skewed Left (-0.31)

Moderately Skewed Right (+0.31) Highly Skewed Right

𝐼𝑓 𝑎 𝑟𝑎𝑛𝑑𝑜𝑚 𝑣𝑎𝑟𝑖𝑎𝑏𝑙𝑒 ℎ𝑎𝑠 𝑚𝑒𝑎𝑛 𝜇, 𝑎𝑛𝑑 𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒 𝜎 , 𝑡ℎ𝑒𝑛, 𝑓𝑜𝑟 𝑐 1

• At least (1 ‐ 1/c2) of the items in any data set will be within c standard deviations of the

Andrey Andreyevich Markov Pafnuty Lvovich Chebyshev

• We have 𝑥̅ = 490.80 and 𝑠 = 54.74

June‐August, 2019 Decision Science ‐1, PGP Term 1, Section E 52

June‐August, 2019 Decision Science ‐1, PGP Term 1, Section E 53

June‐August, 2019 Decision Science ‐1, PGP Term 1, Section E 54

You might also like