CH1 and CH2 Definitions and Descriptive Statistics

HIT 215:Statistic for Engineers

Oliver Mhlanga

Harare Institute of Technology, 0712531415

July 25, 2022

Introduction to Statistics

1 Introduction to Statistics

2 Descriptive Statistics
Measures of Location
Measures of Variability
Frequency tables and Graphical Descriptions

Introduction to Statistics

Data: any observations that have been collected.

Statistics is concerned with scientific methods for collecting, organizing,
summarising, presenting, and analyzing data as well as with drawing valid
conclusions and making reasonable decisions on the basis of such analysis.

Population is defined as the complete set of all elements being studied.

Sample: some subset of a population.

Introduction to Statistics

Information: data that have been recorded, classified, organised, related,
or interpretedd within a framework so that meaning emerges.

Probability as a specific term is a measure of the likelihood that a
particular event will occur.

A random sample is one in which every member of the population has an
equal likelihood of appearing.

Introduction to Statistics

Descriptive Statistics: deals with procedures used to summarise the
information contained in a set of measurements.

Inferential Statistics: deals with procedures used to make inferences
(predictions) about a population paramater from information contained in
a sample.

A variable is a symbol, such as X, Y, H, x, or B, that can assume any of a
prescribed set of values, called the domain of the variable. If the variable
can assume only one value, it is called a constant.

Introduction to Statistics
Sampling: Is the practice concerned with the selection of a subset of
individuals/items from within a population to yield some knowledge about
the whole population.

Types of data:
Population Parameter (P.P): is a characteristics of a population.
Sample Statistic (S.S): is a characteristics of a sample.

Census: collection from every member of the population.

Survey: is a method used to collect in a systematic way, information from
a sample of individuals /items.
Introduction to Statistics

Two types of data:

Qualitative data (categorical, ordinal): Non-Numeric. eg colour,
gender, race, religion.
Mathematical operations are meaningless.
Quantitative data: Numerical. eg height, weight, speed, wages,
temperature, time.
Mathematical operations are meaningful.

Two types of quantitative data:

Discrete data: countable or finite. eg number of oranges, goats etc,
usually a counts.
continuous data: infinite number of possible values, usually

Introduction to Statistics

Levels of measurements:
Nominal data: categories not ordered, eg. religion.
Ordinal data: can be ordered, differences are meaningless. eg. colour
Interval: ordered, differences are meaningful. No ”‘natural zero”’. eg
ratio: ordered, differences are meaningfu with a ”‘natural zero”’. eg
amount of money.

Introduction to Statistics

Elements of a Statistical problem:

1 a clear definition of the population and variable of interest.
2 a design of the experiment or sampling procedure.
3 collection and analysis of data (gathering and summarising data).
4 procedure for making predictions about the population based on
sample information.
5 a measure of ’goodness’ or reliability for the procedure.

Introduction to Statistics

Basic types of studies and the corresponding methods for collecting

A retrospective study using historical data. Data collected in the past
for other purposes.
An observational study. Data, presently collected by a passive
A designed experiment. Data collected in response to process input

Measures of Location
Measures of location: are designed to provide the analyst with some
quantitative values of where the center, or some other location, of data is
Suppose that the observations in a sample are x1 , x2 , ..., xn . The sample
mean, denoted by x̄, is
X xi x1 + x2 + ... + xn
x̄ = = (1)
n n

If some results occur more than once, it is convenient to take frequencies

into account. If fi stands for the frequency of result xi , equation (1)
x i fi
x̄ = P , ∀i. (2)

Measures of Location

The sample mean, x̄ , is a reasonable estimate of the population mean, µ.

Other Measures of Locations.
Trimmed mean: a p% trimmed mean is obtained by eliminating the
smallest (p/2)% data values and the largest (p/2)% data values and
averaging the remaining data values.

Calculate a 40% trimmed mean of the following observations: 6, 8.1, 8.3,
9.1, 9.9.
Solution: x¯p = (8.1 + 8.3 + 9.1)/3

Measures of Location

Given that the observations in a sample are x1 , x2 , ..., xn , arranged in
increasing order of magnitude, the sample median is
xmedian = x n+1 if n is odd and xmedian = 12 (xn/2 + xn/2+1 ) if n is even.
− CF
xmedian =L+(2 )h,
| {z } f
for grouped data

where L=lower limit of median class, CF= cumulative frequency of classes

below the median class, h= class width of median class, f=frequency of
median class.

Suppose the data set is the following: 1.7, 2.2, 3.9, 3.11, and 14.7. The
sample mean and median are, respectively, x̄ = 5.12, xmedian = 3.9.

Measures of Location

The sample mode is the most frequently occurring data value.
1 −f0
Mode of grouped data: mode=L + ( 2f1f−f 0 −f2
where L=lower class limit of the modal class, f1 is the frequency of the
modal class, f0 is the frequency of the class below the modal class, f2 is
the frequency of the class above the modal class, while h is the class width
of the modal class.

Measures of Location

The quartiles of a set of values are the three points that divide the data
into four equal parts each representing a fourth of the population being
first quartile (designated Q1) = lower quartile = cuts off lowest 25%
of data = 25th percentile=x( n+1 )

second quartile (designated Q2) = median = cuts data set in half =

50th percentile=x( n+1 )

third quartile (designated Q3) = upper quartile = cuts off highest

25% of data, or lowest 75% = 75th percentile=x( 3(n+1) )

Measures of Variability
Another feature of interest is the spread (or variability or dispersion or
scatter): how widely spread the data are about the mean (or other
measure of location).
The range is a very simple measure of spread defined, as its name
suggests, by the difference between the largest and smallest observations
in the data set.
|{z}(xi ) − |{z}
Range=max min (xi ).
i i

The interquartile range (IQR) is another measure of spread which is like
the range but which is not affected by the data extremes. First we must
define the quartiles of a set of data.
The inter-quartile range is defined as Q3 − Q1.
The p th percentile corresponds to the ( 100 × n + 12 ) data value.
Measures of Variability

The variability or scatter in the data may be described by the sample

variance or the sample standard deviation.
Sample standard deviation: is the most commonly used measure of
spread. It is essentially a measure of how far on average the observations
are from the mean.
If x1 , x2 , ..., xn is a sample of n observations, the sample variance is
sP sP
n 2 n 2 2
i=1 i − x̄) i=1 xi − nx̄
s= = (3)
n−1 n−1

The sample variance, s 2 , is the square of the sample standard

deviation. Analogous to the sample variance s 2 , the variability in the
population is defined by the population variance (σ 2 )

Measures of Variability

Grouped data sample variance [H/W].

An engineer is interested in testing the bias in a pH meter. Data are
collected on the meter by measuring the pH of a neutral substance (pH =
7.0). A sample of size 10 is taken, with results given by
7.07 7.00 7.10 6.97 7.00 7.03 7.01 7.01 6.98 7.08.
x̄ = 7.07+7.00+7.10+...+7.08
10 = 7.0250
s 2 = 19 [(7.07 − 7.025)2 + (7.00 − 7.025)2 + (7.10 − 7.025)2 + ... + (7.08 −
7.025)2 ] = 0.001939
As a√result, the sample standard deviation is given by
s = 0.001939 = 0.044.

Frequencies tables and Graphical Descriptions

Stem-and-leaf diagram
A stem-and-leaf diagram is a good way to obtain an informative visual
display of a data set x1 , x2 , ..., xn where each number xi consists of at least
two digits. To construct a stem-and-leaf diagram, use the following steps:
1 Divide each number xi into two parts: a stem, consisting of one or
more of the leading digits and a leaf, consisting of the remaining digit.
2 List the stem values in a vertical column.
3 Record the leaf for each observation beside its stem.
4 Write the units for stems and leaves on the display
The ordered stem-and-leaf display makes it relatively easy to find data
features such as percentiles, quartiles, and the median.

Frequencies tables and Graphical Descriptions

Frequency distribution
A frequency distribution is a more compact summary of data than a
stem-and-leaf diagram. To construct a frequency distribution, we must
divide the range of the data into intervals, which are usually called class
intervals, cells, or bins.
Choosing the number of bins approximately equal to the square root of the
number of observations often works well in practice.
The histogram is a visual display of the frequency distribution. The stages
for constructing a histogram:
1 Label the bin (class interval) boundaries on a horizontal scale.
2 Mark and label the vertical scale with the frequencies or the relative
3 Above each bin, draw a rectangle where height is equal to the
frequency (or relative frequency) corresponding to that bin.

Frequencies tables and Graphical Descriptions

The histogram, like the stem-and-leaf diagram, provides a visual

impression of the shape of the distribution of the measurements and
information about the central tendency and scatter or dispersion in
the data. The display often gives insight about possible choices of
probability distribution to use as a model for the population.
When the bins are of unequal width, the rectangles area (not its
height) should be proportional to the bin frequency. This implies that
the rectangle height should be
bin frequency
Rectangle height(frequency density ) = .
bin width
In passing from either the original data or stem-and-leaf diagram to a
frequency distribution or histogram, information is lost because we no
longer have the individual observations.
Histograms are relatively sensitive to the number of bins and their
Frequencies tables and Graphical Descriptions
The box-plot is a graphical display that simultaneously describes several
important features of a data set, such as center, spread, departure from
symmetry, and identification of unusual observations or outliers.
A box plot displays the three quartiles, the minimum, and the maximum of
the data on a rectangular box, aligned either horizontally or vertically.

Descriptive Statistics

The female students in an undergraduate engineering core course at ASU
self-reported their heights to the nearest inch. The data are
62 64 66 67 65 68 61 65 67 65 64 63 67
68 64 66 68 69 65 67 62 66 68 67 66 65
69 65 70 65 67 68 65 63 64 67 67
(a) Calculate the sample mean and standard deviation of height.
(b) Construct a stem-and-leaf diagram for the height data and comment
on any important features that you notice.
(c) What is the median height of this group of female engineering
(d) Construct a histogram for the female student height data.

Descriptive Statistics
Using Minitab

Descriptive Statistics

Descriptive Statistics

An article in the Transactions of the Institution of Chemical Engineers
(Vol. 34, 1956, pp. 280293) reported data from an experiment
investigating the effect of several process variables on the vapor phase
oxidation of naphthalene. A sample of the percentage mole conversion of
naphthalene to maleic anhydride follows: 4.2, 4.7, 4.7, 5.0, 3.8, 3.6, 3.0,
5.1, 3.1, 3.8, 4.8, 4.0, 5.2, 4.3, 2.8, 2.0, 2.8, 3.3, 4.8, 5.0.
(a) Calculate the sample mean.
(b) Calculate the sample variance and sample standard deviation.
(c) Construct a box plot of the data.

Oliver Mhlanga (HIT) July 25, 2022 26 / 29

Descriptive Statistics
Using Minitab

Descriptive Statistics

The End

