Lecture 02 - SP - Probability Review

Statistics
A Review
 Population. Universe. The entire category under
consideration. The population size is usually
indicated by a capital N.
 Examples: every lawyer in the United States; all
single women in the United States.
 Sample. That portion of the population that is
Key Terms available, or to be made available, for analysis. A

good sample is representative of the population. The
sample size is shown as small n.
 If your company manufactures one million
laptops, they might take a sample of say, 500, of
them to test quality.
 The population size is N = 1,000,000 and the
sample size is n= 500.
Statistics Review March 22 2

Parameter. A characteristic of a population. The population
mean, µ and the population standard deviation, σ, are two
examples of population parameters.
Statistic. A statistic is a measure that is derived from the sample
data. For example, the sample mean, , and the sample standard
deviation, s, are statistics. They are used to estimate the
population parameters.
Key Terms Statistical Inference. The process of using sample statistics to

draw conclusions about population parameters is known as
statistical inference.
For instance, using (based on a sample of, say, n=1000) to draw
conclusions about µ (population of, say, 300 million).
Descriptive Statistics. Those statistics that summarize a sample
of numerical data in terms of averages and other measures for the
purpose of description, such as the mean and standard deviation.

An Example for Quality Control (QC)
 GE manufactures LED bulbs and wants to know how many are defective. Suppose one million
bulbs a year are produced in its new plant in Staten Island. The company might sample, say, 500
bulbs to estimate the proportion of defectives.
 N = 1,000,000 and n = 500
 If 5 out of 500 bulbs tested are defective, the sample proportion of defectives will be 1%
(5/500).
 This statistic may be used to estimate the true proportion of defective bulbs (the population
proportion).

Types of Data
Qualitative data result in One way to determine whether data
Quantitative data result in
categorical responses. Also is continuous, is to ask yourself
numerical responses and may
called Nominal, or categorical whether you can add several
be discrete or continuous. decimal places to the answer.
data
• Example: Gender • Discrete data arise from a • For example, you may weigh
 MALE counting process. 150 pounds but in actuality
 FEMALE • Example: How many courses may weigh 150.23568924567
have you taken at this College? pounds.
____
• Continuous data arise from a
• On the other hand, if you
measuring process. have 2 children, you do not
• Example: How much do you have 2.3217638 children.
weigh? ____

 Numerical data may be summarized according to several
characteristics.
 Measures of Location
 Measures of central tendency: Mean; Median; Mode
 Measures of noncentral tendency - Quantiles
Numerical  Quartiles; Quintiles; Percentiles
Measures of Dispersion
Data – Single

 Range
Variable Interquartile range


 Variance
 Standard Deviation
 Coefficient of Variation
 Measures of Shape
 Skewness

 The sample mean is the sum of all the observations
(∑Xi) divided by the number of observations (n):
The Mean
 Example. 1, 2, 2, 4, 5, 10. Calculate the mean.
Note: n = 6 (six observations)

 The median is the middle value of the
ordered data
The Median To get the median, we must first rearrange


the data into an ordered array (in ascending

or descending order).
0 2 3 5 20 99 100
= minimum
10 20 30 40 50 60
 The mode is the value of the data that occurs
with the greatest frequency.
Example. 1, 1, 1, 2, 3, 4, 5
Answer. The mode is 1 since it occurs three times.
The Mode The other values each appear only once in the data
set.
Example. 5, 5, 5, 6, 8, 10, 10, 10.
Answer. The mode is: 5, 10.
There are two modes. This is a bi-modal dataset.

 Measures of non-central location used to
summarize a set of data
Examples of commonly used quantiles:
Quantiles  Quartiles
 Quintiles
 Deciles
 Percentiles

 Quartiles split a set of ordered data into four parts.
 Imagine cutting a chocolate bar into four equal
pieces… How many cuts would you make? (yes,
3!)
 Q1 is the First Quartile
 25% of the observations are smaller than Q1 and
75% of the observations are larger
Quartiles  Q2 is the Second Quartile
 50% of the observations are smaller than Q2 and
50% of the observations are larger. Same as the
Median. It is also the 50th percentile.
 Q3 is the Third Quartile
 75% of the observations are smaller than Q3and
25% of the observations are larger
Quartiles
 A quartile, like the median, either takes
the value of one of the observations, or
the value halfway between two
observations.
 If n/4 is an integer, the first quartile (Q1)
has the value halfway between the
(n/4)th observation and the next
observation.
 If n/4 is not an integer, the first quartile
has the value of the observation whose
position corresponds to the next highest
integer.
 The method we are using is an
approximation. If you solve this in MS
Excel, which relies on a formula, you
may get an answer that is slightly
different.

Exercise # 1
Computer Sales (n = 12 salespeople)

Original Data: 3, 10, 2, 5, 9, 8, 7, 12, 10, 0, 4, 6
Compute the mean, median, mode, quartiles.

Exercise # 1
Computer Sales (n = 12 salespeople)
Original Data: 3, 10, 2, 5, 9, 8, 7, 12, 10, 0, 4, 6
First order the data:

0, 2, 3, 4, 5, 6, 7, 8, 9, 10, 10, 12
= 76 / 12 = 6.33 computers sold

Median = 6.5 computers
Mode = 10 computers
Q1 = 3.5 computers, Q3= 9.5 computers

 We use 99 percentiles to divide a data set into 100
equal portions.
 Percentiles are used in analyzing the results of
standardized exams. For instance, a score of 40 on a
Percentiles standardized test might seem like a terrible grade, but
if it is the 99th percentile, don’t worry about telling
your parents. ☺
 We will always use computer software to obtain the
percentiles.

Exercise # 2
Data (n=16):
1, 1, 2, 2, 2, 2, 3, 3, 4, 4, 5, 5, 6, 7, 8, 10

Exercise # 2
Data (n=16):
1, 1, 2, 2, 2, 2, 3, 3, 4, 4, 5, 5, 6, 7, 8, 10
Answer.
1 1 2 2 ┋ 2 2 3 3 ┋ 4 4 5 5 ┋ 6 7 8 10
Mean = 65/16 = 4.06
Median = 3.5
Mode = 2
Q1 = 2
Q2 = Median = 3.5
Q3 = 5.5

Exercise # 3
Data – number of absences (n=13) :

0, 5, 3, 2, 1, 2, 4, 3, 1, 0, 0, 6, 12

Exercise # 3
Data – number of absences (n=13) :
0, 5, 3, 2, 1, 2, 4, 3, 1, 0, 0, 6, 12
Answer. First order the data:

0, 0, 0,┋ 1, 1, 2, 2, 3, 3, 4,┋ 5, 6, 12
Mean = 39/13 = 3.0 absences
Median = 2 absences
Mode = 0 absences
Q1 = .5 absences
Q3 = 4.5 absences

Exercise # 4
Data: Reading Levels of 16 eighth graders.

5, 6, 6, 6, 5, 8, 7, 7, 7, 8, 10, 9, 9, 9, 9, 9

Exercise # 4
Data: Reading Levels of 16 eighth graders.
5, 6, 6, 6, 5, 8, 7, 7, 7, 8, 10, 9, 9, 9, 9, 9
Answer. First, order the data:

5 5 6 6 ┋ 6 7 7 7 ┋ 8 8 9 9 ┋ 9 9 9 10
Sum=120.
Mean= 120/16 = 7.5 This is the average reading level of the 16 students.
Median = Q2 = 7.5
Q1 = 6, Q3 = 9
Mode = 9

Exercise # 4
Reading Level - Xi Frequency - fi

5 2
5 5 6 6 ┋ 6 7 7 7 ┋ 8 8 9 9 ┋ 9 9 9 10
6 3
Alternate method: The data can also be set up as a 7 3
frequency distribution.
8 2
Measures can be computed using methods for 9 5
grouped data. 10 1
Note that the sum of the frequencies, 16

Exercise # 4
Xi fi (Xi)(fi)
5 2 10
6 3 18
7 3 21
8 2 16
= ∑Xifi/n = 120 / 16 = 7.5
9 5 45
We see that the column total- ∑Xifi-is the sum 10 1 10
of the ungrouped data. 16 120

 Dispersion is the amount of spread, or variability, in a set of
data.
Why do we need to look at measures of dispersion?
Measures of

Consider this example:
Dispersion A company is about to buy computer chips that must have

an average life of 10 years. The company has a choice of
two suppliers. Whose chips should they buy? They take a
sample of 10 chips from each of the suppliers and test them.
See the data on the next slide.

Supplier A chips Supplier B chips
(life in years) (life in years)
11 170
Measures of Dispersion 11 1
10 1
10 160
11 2
11 150
We see that supplier B’s chips have a
longer average life. 11 150
11 170
However, what if the company offers
a 3-year warranty? 10 2
Then, computers manufactured using 12 140

the chips from supplier A will have no
returns
while using supplier B will result in
4/10 or 40% returns.
MedianA = 11 years MedianB= 145 years
sA = 0.63 years sB = 80.6 years
RangeA = 2 years RangeB= 169 years

 Range = Largest Value – Smallest Value
Example: 1, 2, 3, 4, 5, 8, 9, 21, 25, 30
Answer: Range = 30 – 1 = 29.
The Range  The range is simple to use and to explain to
others.
 One problem with the range is that it is
influenced by extreme values at either end.

 IQR = Q3 – Q1
 Example
 (n = 15):
0, 0, 2, 3, 4, 7, 9, 12, 17, 18, 20, 22, 45, 56, 98
Inter-Quartile Q1 = 3, Q3 = 22
Range (IQR) 
IQR = 22 – 3 = 19 (Range = 98)
This is basically the range of the central 50% of the
observations in the distribution.
 Problem: The interquartile range does not take into
account the variability of the total data (only the
central 50%). We are “throwing out” half of the data.

 The standard deviation, s, measures a kind of “average”
deviation about the mean. It is not really the “average”
deviation, even though we may think of it that way.
 Why can’t we simply compute the average deviation about
the mean, if that’s what we want?
Standard  If you take a simple mean, and then add up the deviations
Deviation about the mean, as above, this sum will be equal to 0.

Therefore, a measure of “average deviation” will not work.
 Instead, we use:
 This is the “definitional formula” for standard deviation.

Standard Deviation Xi Yi
1 0
Example. Two data sets, X and Y. Which of the two data sets has
greater variability? Calculate the standard deviation for each.
2 0
We note that both sets of data have the same mean:
=3
=3
3 0
4 5
5 10

X (X-) (X-)2
Standard Deviation
1 3 -2 4
2 3 -1 1
3 3 0 0
4 3 1 1
5 3 2 4
∑=0 10
Y (Y-) (Y-)2
SX== 1.58 0 3 -3 9
0 3 -3 9
0 3 -3 9
5 3 2 4
SY== = 4.47 10 3 7 49
∑=0 80

The variance, s2, is the standard deviation (s) squared.
Conversely, .
Variance Definitional formula:
Computational formula:

 The problem with s2 and s is that they are both, like the
mean, in the “original” units.
 This makes it difficult to compare the variability of two data
sets that are in different units or where the magnitude of the
numbers is very different in the two sets.
Coefficient of  For example,
Variation
 Suppose you wish to compare two stocks and one is in dollars and the other is in yen;
if you want to know which one is more volatile, you should use the coefficient of
variation.
(CV) 

It is also not appropriate to compare two stocks of vastly different prices even if both
are in the same units.
The standard deviation for a stock that sells for around $300 is going to be very
different from one with a price of around $0.25.
 The coefficient of variation will be a better measure of

dispersion in these cases than the standard deviation

Which stock is more volatile (unstable)?
Closing prices over the last 8 months:
Stock A Stock B
$1.62
Example: Stock
JAN
FEB
$1.00
1.50
$180
175
CVA =
$1.70
x 100% = 95.3%
Prices
MAR 1.90 182
APR .60 186 CVB =
$11 .33
x 100% = 6.0%
MAY 3.00 188 $188.88
JUN .40 190
JUL 5.00 200
AUG .20 210 Answer: The standard deviation of B is higher than for A, but A is more
volatile i.e., unstable
Mean $1.70 $188.88
s2 2.61 128.41
s $1.62 $11.33
Statistics Review
March 22 33
Exercise # 5
Data (n=10): 0, 0, 40, 50, 50, 60, 70, 90, 100, 100
Compute the mean, median, mode, quartiles (Q1,
Q2, Q3), range, interquartile range, variance,
standard deviation, and coefficient of variation. We
shall refer to all these as the descriptive (or
summary) statistics for a set of data.

Exercise # 5
Data (n=10): 0, 0, 40, 50, 50, 60, 70, 90, 100, 100
Compute the mean, median, mode, quartiles (Q1, Q2, Q3), range, interquartile range,
variance, standard deviation, and coefficient of variation. We shall refer to all these as
the descriptive (or summary) statistics for a set of data.
Answer. First order the data:

0, 0, 40, 50, 50 ┋ 60, 70, 90, 100, 100
 Mean: ∑Xi = 560 and n = 10, so = 560/10 = 56.
Median = Q2 = 55
 Q1= 40 ; Q3 = 90 (Note: Excel gives these as Q1 = 42.5, Q3 = 85.)
 Mode = 0, 50, 100
Range = 100 – 0 = 100
 IQR = 90 – 40 = 50
 s2 = 11,840/9 = 1315.5
 s = √1315.5 = 36.27
 CV = (36.27/56) x 100% = 64.8%

 A third important property of data – after location and
dispersion - is its shape.
 Shape can be described by degree of asymmetry (i.e.,
skewness).
 mean > median positive or right-skewness
Shape 

mean = median
mean < median
symmetric or zero-skewness
negative or left-skewness
Positive skewness can arise when the mean is increased by
some unusually high values.
 Negative skewness can arise when the mean is decreased by
some unusually low values.

Skewness
Left skewed:
Right skewed:
Symmetric:
Statistics Review 37
 A frequency distribution records data grouped
into classes and the number of observations
that fell into each class.
 A frequency distribution can be used for:
Working with  categorical data
Frequencies  numerical data that can be grouped into

categories
 numerical data with repeated observations
 A percentage distribution records the percent
of the observations that fell into each class.

Working with
Frequencies Take-home pay frequency percentage
520 and under 530 6 3 %
Example. A sample was taken of
200 professors at a (fictitious) 530 " " 540 30 15
local college. Each was asked for
his or her (take-home) weekly 540 " " 550 38 19
salary. The responses ranged from 550 " " 560 52 26
about$520 to $590. If we wanted
to display the data in, say, 7 equal 560 " " 570 42 21
intervals, we would use an interval 570 " " 580 24 12
width of $10. 580 to 590 8 4
Width of interval = = = $10/class.
200 100 %

Working with
Frequencies Take-home pay frequency percentage
less than 520 0 0
A Cumulative Distribution " " 530 6 3
focuses on the number or " " 540 36 18
percentage of cases that lie
below or above specified " " 550 74 37
values rather than within " " 560 126 63
intervals.
" " 570 168 84
" " 580 192 96
" " 590 200 100

Working with
Frequencies
The Frequency Histogram:

Working with
Frequencies
The Frequency Polygon

Working with
Frequencies
The Cumulative Frequency

Distribution

Descriptive Statistics – 2 variables
 Categorical Data – graphical representation

 Contingency Table
 Side-by-Side Bar Chart
 Numerical Data – looking for relationships in bivariate data

 Scatter Plot
 Correlation
 The Regression Line

The Contingency Male Female
Table Republican Candidate 250 250 500
Democrat Candidate 150 350 500
Total 400 600 1000
Two categorical variables
are most easily displayed in
a contingency table. This is
a table of two-way
frequencies.
Example: “Who would
you vote for in the next
election?”
This also works for two-
way percentages:

The Side-by-Side
Bar Chart
Chart: Relative Performance (Source: Microsoft.com)

The Scatter Plot Y (Grade)
X (Height)
100
73
95
79
90
62
80
69
70
74
65
77
60
81
40
63
30
68
20
74
What can we do with 2

numerical variables? We can
graph them.
Example – Grade and

Height (in inches)
• Correlation coefficient is r = .12

• Coefficient of determination is r2 = .01

Lecture 02 - SP - Probability Review

Uploaded by

Copyright:

Available Formats

You might also like

Lecture 02 - SP - Probability Review

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lecture 02 - SP - Probability Review

Uploaded by

Copyright:

Available Formats

Statistics

Key Terms available, or to be made available, for analysis. A

Statistics Review March 22 2

Key Terms Statistical Inference. The process of using sample statistics to

Statistics Review March 22 3

Statistics Review March 22 4

Statistics Review March 22 5

Variable Interquartile range

Statistics Review March 22 6

Statistics Review March 22 7

The Median To get the median, we must first rearrange

the data into an ordered array (in ascending

Statistics Review March 22 9

Statistics Review March 22 10

Statistics Review March 22 12

Computer Sales (n = 12 salespeople)

Statistics Review March 22 13

First order the data:

= 76 / 12 = 6.33 computers sold

Statistics Review March 22 14

Statistics Review March 22 15

Statistics Review March 22 16

Statistics Review March 22 17

Data – number of absences (n=13) :

Statistics Review March 22 18

Answer. First order the data:

Statistics Review March 22 19

Data: Reading Levels of 16 eighth graders.

Statistics Review March 22 20

Answer. First, order the data:

Statistics Review March 22 21

Reading Level - Xi Frequency - fi

Statistics Review March 22 22

Statistics Review March 22 23

Consider this example:

Dispersion A company is about to buy computer chips that must have

Statistics Review March 22 24

Then, computers manufactured using 12 140

Statistics Review March 22 25

Statistics Review March 22 26

Statistics Review March 22 27

Deviation about the mean, as above, this sum will be equal to 0.

 This is the “definitional formula” for standard deviation.

Statistics Review March 22 28

Statistics Review March 22 29

Statistics Review March 22 30

Statistics Review March 22 31

 The coefficient of variation will be a better measure of

Statistics Review March 22 32

Statistics Review March 22 34

Answer. First order the data:

Statistics Review March 22 35

Statistics Review March 22 36

Frequencies  numerical data that can be grouped into

Statistics Review March 22 38

Statistics Review March 22 39

Statistics Review March 22 40

The Frequency Histogram: