Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 42

Mathematical Statistics

Instructor:

Dr. Deshi Ye

Course homepage: http://www.cs.zju.edu.cn/people/yedeshi/


Course information
What is for?
This course provides an elementary
introduction to mathematical statistics with
applications.

Topics include: statistical estimation,


hypothesis testing; confidence intervals;
calculation of a P-value; nonparametric
testing; curve fitting; analysis of variance and
factorial experimental design.
Grading
Grades for the course will be based on the
following weighting
1) Class attendance: 10%
2) Homework assignment: 26%
3) Unit quiz: 24% (12%, 12%)
4) Final exam: 40%
Introduction
Probability theory is devoted to the study
of uncertainty and variability

Statistics can be described as the study of


how to make inference and decisions in
the face of uncertainty and variability
Brief History
Blaise Pascal and Pierre de Fermat: the
origins of probability are found.
concerning a popular dice game
fundamental principles of probability theory
Pierre de Laplace:
Before him, concern on the analysis of games
of chance
Laplace applied probabilistic ideas to many
scientific and practical problems
A case study
Visually inspecting data to improve
product quality
Population and Sample
Investigating: a physical phenomenon,
production process, or manufactured unit,
share some common characteristics.
Relevant data must be collected.
Unit: the source of each measurement.
A single entity, usually an object or person
Population: entire collection of units.
Examples
Population Unit variables
All students student GPA
currently Number of
enrolled in credits
school

All books in book Replacement


library cost
Sample
Statistical population: the set of all
measurement corresponding to each unit
in the entire population of units about
which information is sought.
Sample: A sample from a statistical
population is the subset of measurements
that are actually collected in the course of
investigation.
Ch2: Treatment of data
Outline
Pareto diagrams, dot diagrams
Histograms (Frequency distributions)
Stem-and-leaf display
Box-plot (Quartiles and Percentiles)
The calculation of x and standard deviation s
Pareto Diagram
For a computer-controlled lathe whose
performance was below par, workers
recorded the following causes and their
frequencies:
power fluctuations 6
controller not stable 22
operator error 13
worn tool not replaced 2
other 5
Minitab14
1. Stat->Quality tools->Pareto chart
2. Choose chart defects table as follows
Output
Pareto diagram
Pareto diagram: depicts Paretos empirical
law that any assortment of events consists
of a few major and many minor elements.
Typically, two or three elements will
account for more than half of the total
frequency.
Dot diagram
Observation on the deviations of cutting
speed from the target value set by the
controller.
EX. Cutting speed target speed
3 6 2 4 7 4

In minitab: stat->dotplots->simple
Dot diagram
This diagram visually summarize the
information that the lathe is generally
running fast.
Data001.
80 data of emission (in ton)of sulfur
oxides from an industry plant
15.8 26.4 17.3 11.2 23.9 24.8 18.7 13.9 9.0 13.2 22.7 9.8
6.2 14.7 17.5 26.1 12.8 28.6 17.6 23.7 26.8

22.7 18.0 20.5 11.0 20.9 15.5 19.4 16.7 10.7 19.1 15.2
22.9 26.6 20.4 21.4 19.2 21.6 16.9 19.0 18.5 23.0

24.6 20.1 16.2 18.0 7.7 13.5 23.5 14.5 14.4 29.6 19.4
17.0 20.8 24.3 22.5 24.6 18.4 18.1 8.3 21.9 12.3

22.3 13.3 11.8 19.3 20.0 25.7 31.8 25.9 10.5 15.9 27.5
18.1 17.9 9.4 24.1 20.1 28.5
Frequency distributions
A frequency distribution is a tabular
arrangement of data whereby the data is
grouped into different intervals, and then
the number of observations that belong to
each interval is determined.
Data that is presented in this manner are
known as grouped data.
Class limits & frequnecy
Class limits Frequency
5.0 -- 8.9 3
9.0 12.9 10
13.0 16.9 14
17.0 20.9 25
21.0 24.9 17
25.0 28.9 9
29.0 32.9 2
Total 80
Class limit and width
lower class limit: The smallest value that can belong to
a given interval

upper class limit: The largest value that can belong to


the interval.

Class width: The difference between the upper class


limit and the lower class limit is defined to be the.

When designing the intervals to be used in a frequency


distribution, it is preferable that the class widths of all
intervals be the same.
Class limits & frequnecy
Class limits Frequency
[5.0, 9.0) 3
[9.0, 13.0) 10
[13.0, 17.0) 14
[17.0, 21.0) 25
[21.0, 25.0) 17
[25.0, 29.0) 9
[29.0, 33.0) 2
Total 80
Variants of frequency distribution
The cumulative frequency distribution is obtained
by computing the cumulative frequency, defined
as the total frequency of all values less than the
upper class limit of a particular interval, for all
intervals.
Relative frequency: the ratio of the number of
observations in the interval to the total number of
observations
The percentage frequency distribution is arrived
at by multiplying the relative frequencies of each
interval by 100%.
cumulative frequnecy
Class limits Frequency
Less than 5 0
Less than 9 3
Less than 13 13
Less than 17 27
Less than 21 52
Less than 25 69
Less than 29 78
Less than 33 80
Percentage distribution
Class limits Perc. Dist. Frequency
[5.0, 9.0) 3.75% 3
[9.0, 13.0) 12.5% 10
[13.0, 17.0) 17.5% 14
[17.0, 21.0) 31.25% 25
[21.0, 25.0) 21.25% 17
[25.0, 29.0) 11.25% 9
[29.0, 33.0) 2.5% 2
Total 100% 80
Histogram
The most common form of graphical
presentation of a frequency distribution is
the histogram.
Histogram: is constructed of adjacent
rectangles; the height of the rectangles is
the class frequencies and the bases of the
rectangles extend between successive
class boundaries.
Histogram in Minitab
1. Graph->histogram->simple
2. Graph variables: c4
3. Edit bars: Click the bars in the output figures, in
Binning, Interval type select midpoint and interval
definition select midpoint/cutpoint, and then input 7
11 15 19 23 27 31 as illustrated in the following
Density histogram
When a histogram is constructed from a
frequency table having classes of unequal
lengths, the height of each rectangle must be
changed to

Height = relative frequency / width.

The area of the rectangle then represents the


relative frequency for the class and the total
area of the histogram is 1.
Density histogram
Cumulative histogram
1) Graph-
>histogram->simple
2) Dataview->
Datadisplay: check
symbos only
Smoother: check
lowess and 0 in
degree of
smoothing and 1
in number of steps.
Stem-and-leaf Display
Class limits and frequency, contain data in
each class, but the original data points
have been lost.
Stem-and-leaf: function the same as
histogram but save the original data
points.
Example: 10 numbers:
12, 13, 21, 27, 33, 34, 35, 37, 40, 40
Frequency table
Class limits Frequency
10 19 2
20 29 2
30 39 4
40 49 3
Stem-and-leaf

Stem-and-leaf: each row has a stem and


each digit on a stem to the right of the vertical
line is a life.
The "stem" is the left-hand column which
contains the tens digits.
The "leaves" are the lists in the right-hand
column, showing all the ones digits for each
of the tens, twenties, thirties, and forties.

Key: 4|0 means 40


Stem-and-leaf in Minitab
The display has three columns:
The leaves (right) - Each value in the leaf column
represents a digit from one observation.
The stem (middle) - The stem value represents the
digit immediately to the left of the leaf digit.
Counts (left) - If the median value for the sample is
included in a row, the count for that row is enclosed in
parentheses. The values for rows above and below
the median are cumulative.
Stem-and-leaf for DATA001
Stem-and-leaf of frequencies N = 80
Leaf Unit = 1.0

2 0 67
6 0 8999
11 1 00111
17 1 223333
24 1 4445555
32 1 66677777
(13) 1 8888888999999
35 2 0000000111
25 2 222223333
16 2 4444455
9 2 66667
4 2 889
1 3 1
Ch2.5: Descriptive measures
Mean: the sum of the observation divided by the
sample size. n

x i
x i 1

n
Median: the center, or location, of a set of data. If
the observations are arranged in an ascending or
descending order:
If the number of observations is odd, the median is the
middle value.
If the number of observations is even, the median is
the average of the two middle values.
Example
15 14 2 27 13
Mean: 15 14 2 27 13
x 14.2
5

Ordering the data from smallest to largest


2 13 14 15 27
The median is the third largest value 14
Sample variance
Deviations from the mean:
n

i
( x x ) 2 n n
n x ( xi ) 2
2
i
s2 i 1
s2 i 1 i 1

n 1 n(n 1)

Standard deviation s:
n

i
( x x ) 2

s i 1
n 1
Quartiles and Percentiles
Quartiles: are values in a given set of
observations that divide the data in 4 equal parts.
The first quartile,Q1 , is a value that has one
fourth, or 25%, of the observation below its
value.
The sample 100 p-th percentile is a value such
that at least 100p% of the observation are at or
below this value, and at least 100(1-p)% are at or
above this value.
Example
Example in P34:
14.7 15.2
Q1 14.95
2
19.0 19.1
Q2 19.05
2

22.9 23
Q3 22.95
2
Boxplots
A boxplot is a way of summarizing
information contained in the quartiles (or
on a interval)
Box length= interquartile range= Q3 Q1
Modified boxplot
Outlier: too far from third
quartile.
1.5(interquartile range)
of third quartile.
Modified boxplot:
identify outliers and
reduce the effect on the
shape of the boxplot.

You might also like