CH1 and CH2 Definitions and Descriptive Statistics

HIT 215:Statistic for Engineers
Oliver Mhlanga
Harare Institute of Technology

omhlanga@hit.ac.zw, 0712531415
July 25, 2022
Introduction to Statistics
Oliver Mhlanga (HIT) July 25, 2022 1 / 29

Overview
1 Introduction to Statistics
Definitions
2 Descriptive Statistics
Measures of Location
Measures of Variability
Frequency tables and Graphical Descriptions

Definition
Data: any observations that have been collected.
Definition
Statistics is concerned with scientific methods for collecting, organizing,
summarising, presenting, and analyzing data as well as with drawing valid
conclusions and making reasonable decisions on the basis of such analysis.
Definition
Population is defined as the complete set of all elements being studied.
Definition
Sample: some subset of a population.

Definition
Information: data that have been recorded, classified, organised, related,
or interpretedd within a framework so that meaning emerges.
Definition
Probability as a specific term is a measure of the likelihood that a
particular event will occur.
Definition
A random sample is one in which every member of the population has an
equal likelihood of appearing.

Definition
Descriptive Statistics: deals with procedures used to summarise the
information contained in a set of measurements.
Definition
Inferential Statistics: deals with procedures used to make inferences
(predictions) about a population paramater from information contained in
a sample.
Definition
A variable is a symbol, such as X, Y, H, x, or B, that can assume any of a
prescribed set of values, called the domain of the variable. If the variable
can assume only one value, it is called a constant.

Definition
Sampling: Is the practice concerned with the selection of a subset of
individuals/items from within a population to yield some knowledge about
the whole population.
Types of data:
Population Parameter (P.P): is a characteristics of a population.
Sample Statistic (S.S): is a characteristics of a sample.
Definition
Census: collection from every member of the population.
Definition
Survey: is a method used to collect in a systematic way, information from
a sample of individuals /items.
Two types of data:

Qualitative data (categorical, ordinal): Non-Numeric. eg colour,
gender, race, religion.
Mathematical operations are meaningless.
Quantitative data: Numerical. eg height, weight, speed, wages,
temperature, time.
Mathematical operations are meaningful.
Two types of quantitative data:

Discrete data: countable or finite. eg number of oranges, goats etc,
usually a counts.
continuous data: infinite number of possible values, usually
measurements.

Levels of measurements:
Nominal data: categories not ordered, eg. religion.
Ordinal data: can be ordered, differences are meaningless. eg. colour
(spectrum).
Interval: ordered, differences are meaningful. No ”‘natural zero”’. eg
temperature.
ratio: ordered, differences are meaningfu with a ”‘natural zero”’. eg
amount of money.

Elements of a Statistical problem:

1 a clear definition of the population and variable of interest.
2 a design of the experiment or sampling procedure.
3 collection and analysis of data (gathering and summarising data).
4 procedure for making predictions about the population based on
sample information.
5 a measure of ’goodness’ or reliability for the procedure.

Basic types of studies and the corresponding methods for collecting

data:
A retrospective study using historical data. Data collected in the past
for other purposes.
An observational study. Data, presently collected by a passive
observer.
A designed experiment. Data collected in response to process input
changes.

Measures of location: are designed to provide the analyst with some
quantitative values of where the center, or some other location, of data is
located.
Definition
Suppose that the observations in a sample are x1 , x2 , ..., xn . The sample
mean, denoted by x̄, is
n
X xi x1 + x2 + ... + xn
x̄ = = (1)
n n
i=1
If some results occur more than once, it is convenient to take frequencies

into account. If fi stands for the frequency of result xi , equation (1)
becomes
P
x i fi
x̄ = P , ∀i. (2)
fi

The sample mean, x̄ , is a reasonable estimate of the population mean, µ.

Other Measures of Locations.
Definition
Trimmed mean: a p% trimmed mean is obtained by eliminating the
smallest (p/2)% data values and the largest (p/2)% data values and
averaging the remaining data values.
Example
Calculate a 40% trimmed mean of the following observations: 6, 8.1, 8.3,
9.1, 9.9.
Solution: x¯p = (8.1 + 8.3 + 9.1)/3

Definition
Given that the observations in a sample are x1 , x2 , ..., xn , arranged in
increasing order of magnitude, the sample median is
xmedian = x n+1 if n is odd and xmedian = 12 (xn/2 + xn/2+1 ) if n is even.
2
n
− CF
xmedian =L+(2 )h,
| {z } f
for grouped data
where L=lower limit of median class, CF= cumulative frequency of classes

below the median class, h= class width of median class, f=frequency of
median class.
Example
Suppose the data set is the following: 1.7, 2.2, 3.9, 3.11, and 14.7. The
sample mean and median are, respectively, x̄ = 5.12, xmedian = 3.9.

Definition
The sample mode is the most frequently occurring data value.
1 −f0
Mode of grouped data: mode=L + ( 2f1f−f 0 −f2
)h,
where L=lower class limit of the modal class, f1 is the frequency of the
modal class, f0 is the frequency of the class below the modal class, f2 is
the frequency of the class above the modal class, while h is the class width
of the modal class.

The quartiles of a set of values are the three points that divide the data
into four equal parts each representing a fourth of the population being
sampled.
first quartile (designated Q1) = lower quartile = cuts off lowest 25%
of data = 25th percentile=x( n+1 )
4
second quartile (designated Q2) = median = cuts data set in half =

50th percentile=x( n+1 )
2
third quartile (designated Q3) = upper quartile = cuts off highest

25% of data, or lowest 75% = 75th percentile=x( 3(n+1) )
4

Another feature of interest is the spread (or variability or dispersion or
scatter): how widely spread the data are about the mean (or other
measure of location).
Definition
The range is a very simple measure of spread defined, as its name
suggests, by the difference between the largest and smallest observations
in the data set.
|{z}(xi ) − |{z}
Range=max min (xi ).
i i
Definition
The interquartile range (IQR) is another measure of spread which is like
the range but which is not affected by the data extremes. First we must
define the quartiles of a set of data.
The inter-quartile range is defined as Q3 − Q1.
p
The p th percentile corresponds to the ( 100 × n + 12 ) data value.
The variability or scatter in the data may be described by the sample

variance or the sample standard deviation.
Definition
Sample standard deviation: is the most commonly used measure of
spread. It is essentially a measure of how far on average the observations
are from the mean.
If x1 , x2 , ..., xn is a sample of n observations, the sample variance is
sP sP
n 2 n 2 2
(x
i=1 i − x̄) i=1 xi − nx̄
s= = (3)
n−1 n−1
The sample variance, s 2 , is the square of the sample standard

deviation. Analogous to the sample variance s 2 , the variability in the
population is defined by the population variance (σ 2 )

Grouped data sample variance [H/W].
Example
An engineer is interested in testing the bias in a pH meter. Data are
collected on the meter by measuring the pH of a neutral substance (pH =
7.0). A sample of size 10 is taken, with results given by
7.07 7.00 7.10 6.97 7.00 7.03 7.01 7.01 6.98 7.08.
x̄ = 7.07+7.00+7.10+...+7.08
10 = 7.0250
s 2 = 19 [(7.07 − 7.025)2 + (7.00 − 7.025)2 + (7.10 − 7.025)2 + ... + (7.08 −
7.025)2 ] = 0.001939
As a√result, the sample standard deviation is given by
s = 0.001939 = 0.044.

Frequencies tables and Graphical Descriptions
Stem-and-leaf diagram
A stem-and-leaf diagram is a good way to obtain an informative visual
display of a data set x1 , x2 , ..., xn where each number xi consists of at least
two digits. To construct a stem-and-leaf diagram, use the following steps:
1 Divide each number xi into two parts: a stem, consisting of one or
more of the leading digits and a leaf, consisting of the remaining digit.
2 List the stem values in a vertical column.
3 Record the leaf for each observation beside its stem.
4 Write the units for stems and leaves on the display
The ordered stem-and-leaf display makes it relatively easy to find data
features such as percentiles, quartiles, and the median.

Frequency distribution
A frequency distribution is a more compact summary of data than a
stem-and-leaf diagram. To construct a frequency distribution, we must
divide the range of the data into intervals, which are usually called class
intervals, cells, or bins.
Choosing the number of bins approximately equal to the square root of the
number of observations often works well in practice.
The histogram is a visual display of the frequency distribution. The stages
for constructing a histogram:
1 Label the bin (class interval) boundaries on a horizontal scale.
2 Mark and label the vertical scale with the frequencies or the relative
frequencies.
3 Above each bin, draw a rectangle where height is equal to the
frequency (or relative frequency) corresponding to that bin.

The histogram, like the stem-and-leaf diagram, provides a visual

impression of the shape of the distribution of the measurements and
information about the central tendency and scatter or dispersion in
the data. The display often gives insight about possible choices of
probability distribution to use as a model for the population.
When the bins are of unequal width, the rectangles area (not its
height) should be proportional to the bin frequency. This implies that
the rectangle height should be
bin frequency
Rectangle height(frequency density ) = .
bin width
In passing from either the original data or stem-and-leaf diagram to a
frequency distribution or histogram, information is lost because we no
longer have the individual observations.
Histograms are relatively sensitive to the number of bins and their
width.
The box-plot is a graphical display that simultaneously describes several
important features of a data set, such as center, spread, departure from
symmetry, and identification of unusual observations or outliers.
A box plot displays the three quartiles, the minimum, and the maximum of
the data on a rectangular box, aligned either horizontally or vertically.

Descriptive Statistics
Example
The female students in an undergraduate engineering core course at ASU
self-reported their heights to the nearest inch. The data are
62 64 66 67 65 68 61 65 67 65 64 63 67
68 64 66 68 69 65 67 62 66 68 67 66 65
69 65 70 65 67 68 65 63 64 67 67
(a) Calculate the sample mean and standard deviation of height.
(b) Construct a stem-and-leaf diagram for the height data and comment
on any important features that you notice.
(c) What is the median height of this group of female engineering
students?
(d) Construct a histogram for the female student height data.

Using Minitab


Example
An article in the Transactions of the Institution of Chemical Engineers
(Vol. 34, 1956, pp. 280293) reported data from an experiment
investigating the effect of several process variables on the vapor phase
oxidation of naphthalene. A sample of the percentage mole conversion of
naphthalene to maleic anhydride follows: 4.2, 4.7, 4.7, 5.0, 3.8, 3.6, 3.0,
5.1, 3.1, 3.8, 4.8, 4.0, 5.2, 4.3, 2.8, 2.0, 2.8, 3.3, 4.8, 5.0.
(a) Calculate the sample mean.
(b) Calculate the sample variance and sample standard deviation.
(c) Construct a box plot of the data.

Using Minitab


The End

CH1 and CH2 Definitions and Descriptive Statistics

Uploaded by

Copyright:

Available Formats

You might also like

CH1 and CH2 Definitions and Descriptive Statistics

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

CH1 and CH2 Definitions and Descriptive Statistics

Uploaded by

Copyright:

Available Formats

HIT 215:Statistic for Engineers

Harare Institute of Technology

July 25, 2022

Oliver Mhlanga (HIT) July 25, 2022 1 / 29

Oliver Mhlanga (HIT) July 25, 2022 2 / 29

Oliver Mhlanga (HIT) July 25, 2022 3 / 29

Oliver Mhlanga (HIT) July 25, 2022 4 / 29

Oliver Mhlanga (HIT) July 25, 2022 5 / 29

Two types of data:

Two types of quantitative data:

Oliver Mhlanga (HIT) July 25, 2022 7 / 29

Oliver Mhlanga (HIT) July 25, 2022 8 / 29

Elements of a Statistical problem:

Oliver Mhlanga (HIT) July 25, 2022 9 / 29

Basic types of studies and the corresponding methods for collecting

Oliver Mhlanga (HIT) July 25, 2022 10 / 29

If some results occur more than once, it is convenient to take frequencies

Oliver Mhlanga (HIT) July 25, 2022 11 / 29

The sample mean, x̄ , is a reasonable estimate of the population mean, µ.

Oliver Mhlanga (HIT) July 25, 2022 12 / 29

where L=lower limit of median class, CF= cumulative frequency of classes

Oliver Mhlanga (HIT) July 25, 2022 13 / 29

Oliver Mhlanga (HIT) July 25, 2022 14 / 29

second quartile (designated Q2) = median = cuts data set in half =

third quartile (designated Q3) = upper quartile = cuts off highest

Oliver Mhlanga (HIT) July 25, 2022 15 / 29

The variability or scatter in the data may be described by the sample

The sample variance, s 2 , is the square of the sample standard

Oliver Mhlanga (HIT) July 25, 2022 17 / 29

Grouped data sample variance [H/W].

Oliver Mhlanga (HIT) July 25, 2022 18 / 29

Oliver Mhlanga (HIT) July 25, 2022 19 / 29

Oliver Mhlanga (HIT) July 25, 2022 20 / 29

The histogram, like the stem-and-leaf diagram, provides a visual

Oliver Mhlanga (HIT) July 25, 2022 22 / 29

Oliver Mhlanga (HIT) July 25, 2022 23 / 29

Oliver Mhlanga (HIT) July 25, 2022 24 / 29

Oliver Mhlanga (HIT) July 25, 2022 25 / 29

Oliver Mhlanga (HIT) July 25, 2022 26 / 29

Oliver Mhlanga (HIT) July 25, 2022 27 / 29

Oliver Mhlanga (HIT) July 25, 2022 28 / 29

Oliver Mhlanga (HIT) July 25, 2022 29 / 29

You might also like