CH1 and CH2 Definitions and Descriptive Statistics

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 29

HIT 215:Statistic for Engineers

Oliver Mhlanga

Harare Institute of Technology


omhlanga@hit.ac.zw, 0712531415

July 25, 2022

Introduction to Statistics

Oliver Mhlanga (HIT) July 25, 2022 1 / 29


Overview

1 Introduction to Statistics
Definitions

2 Descriptive Statistics
Measures of Location
Measures of Variability
Frequency tables and Graphical Descriptions

Oliver Mhlanga (HIT) July 25, 2022 2 / 29


Introduction to Statistics

Definition
Data: any observations that have been collected.

Definition
Statistics is concerned with scientific methods for collecting, organizing,
summarising, presenting, and analyzing data as well as with drawing valid
conclusions and making reasonable decisions on the basis of such analysis.

Definition
Population is defined as the complete set of all elements being studied.

Definition
Sample: some subset of a population.

Oliver Mhlanga (HIT) July 25, 2022 3 / 29


Introduction to Statistics

Definition
Information: data that have been recorded, classified, organised, related,
or interpretedd within a framework so that meaning emerges.

Definition
Probability as a specific term is a measure of the likelihood that a
particular event will occur.

Definition
A random sample is one in which every member of the population has an
equal likelihood of appearing.

Oliver Mhlanga (HIT) July 25, 2022 4 / 29


Introduction to Statistics

Definition
Descriptive Statistics: deals with procedures used to summarise the
information contained in a set of measurements.

Definition
Inferential Statistics: deals with procedures used to make inferences
(predictions) about a population paramater from information contained in
a sample.

Definition
A variable is a symbol, such as X, Y, H, x, or B, that can assume any of a
prescribed set of values, called the domain of the variable. If the variable
can assume only one value, it is called a constant.

Oliver Mhlanga (HIT) July 25, 2022 5 / 29


Introduction to Statistics
Definition
Sampling: Is the practice concerned with the selection of a subset of
individuals/items from within a population to yield some knowledge about
the whole population.

Types of data:
Population Parameter (P.P): is a characteristics of a population.
Sample Statistic (S.S): is a characteristics of a sample.

Definition
Census: collection from every member of the population.

Definition
Survey: is a method used to collect in a systematic way, information from
a sample of individuals /items.
Oliver Mhlanga (HIT) July 25, 2022 6 / 29
Introduction to Statistics

Two types of data:


Qualitative data (categorical, ordinal): Non-Numeric. eg colour,
gender, race, religion.
Mathematical operations are meaningless.
Quantitative data: Numerical. eg height, weight, speed, wages,
temperature, time.
Mathematical operations are meaningful.

Two types of quantitative data:


Discrete data: countable or finite. eg number of oranges, goats etc,
usually a counts.
continuous data: infinite number of possible values, usually
measurements.

Oliver Mhlanga (HIT) July 25, 2022 7 / 29


Introduction to Statistics

Levels of measurements:
Nominal data: categories not ordered, eg. religion.
Ordinal data: can be ordered, differences are meaningless. eg. colour
(spectrum).
Interval: ordered, differences are meaningful. No ”‘natural zero”’. eg
temperature.
ratio: ordered, differences are meaningfu with a ”‘natural zero”’. eg
amount of money.

Oliver Mhlanga (HIT) July 25, 2022 8 / 29


Introduction to Statistics

Elements of a Statistical problem:


1 a clear definition of the population and variable of interest.
2 a design of the experiment or sampling procedure.
3 collection and analysis of data (gathering and summarising data).
4 procedure for making predictions about the population based on
sample information.
5 a measure of ’goodness’ or reliability for the procedure.

Oliver Mhlanga (HIT) July 25, 2022 9 / 29


Introduction to Statistics

Basic types of studies and the corresponding methods for collecting


data:
A retrospective study using historical data. Data collected in the past
for other purposes.
An observational study. Data, presently collected by a passive
observer.
A designed experiment. Data collected in response to process input
changes.

Oliver Mhlanga (HIT) July 25, 2022 10 / 29


Measures of Location
Measures of location: are designed to provide the analyst with some
quantitative values of where the center, or some other location, of data is
located.
Definition
Suppose that the observations in a sample are x1 , x2 , ..., xn . The sample
mean, denoted by x̄, is
n
X xi x1 + x2 + ... + xn
x̄ = = (1)
n n
i=1

If some results occur more than once, it is convenient to take frequencies


into account. If fi stands for the frequency of result xi , equation (1)
becomes
P
x i fi
x̄ = P , ∀i. (2)
fi

Oliver Mhlanga (HIT) July 25, 2022 11 / 29


Measures of Location

The sample mean, x̄ , is a reasonable estimate of the population mean, µ.


Other Measures of Locations.
Definition
Trimmed mean: a p% trimmed mean is obtained by eliminating the
smallest (p/2)% data values and the largest (p/2)% data values and
averaging the remaining data values.

Example
Calculate a 40% trimmed mean of the following observations: 6, 8.1, 8.3,
9.1, 9.9.
Solution: x¯p = (8.1 + 8.3 + 9.1)/3

Oliver Mhlanga (HIT) July 25, 2022 12 / 29


Measures of Location

Definition
Given that the observations in a sample are x1 , x2 , ..., xn , arranged in
increasing order of magnitude, the sample median is
xmedian = x n+1 if n is odd and xmedian = 12 (xn/2 + xn/2+1 ) if n is even.
2
n
− CF
xmedian =L+(2 )h,
| {z } f
for grouped data

where L=lower limit of median class, CF= cumulative frequency of classes


below the median class, h= class width of median class, f=frequency of
median class.

Example
Suppose the data set is the following: 1.7, 2.2, 3.9, 3.11, and 14.7. The
sample mean and median are, respectively, x̄ = 5.12, xmedian = 3.9.

Oliver Mhlanga (HIT) July 25, 2022 13 / 29


Measures of Location

Definition
The sample mode is the most frequently occurring data value.
1 −f0
Mode of grouped data: mode=L + ( 2f1f−f 0 −f2
)h,
where L=lower class limit of the modal class, f1 is the frequency of the
modal class, f0 is the frequency of the class below the modal class, f2 is
the frequency of the class above the modal class, while h is the class width
of the modal class.

Oliver Mhlanga (HIT) July 25, 2022 14 / 29


Measures of Location

The quartiles of a set of values are the three points that divide the data
into four equal parts each representing a fourth of the population being
sampled.
first quartile (designated Q1) = lower quartile = cuts off lowest 25%
of data = 25th percentile=x( n+1 )
4

second quartile (designated Q2) = median = cuts data set in half =


50th percentile=x( n+1 )
2

third quartile (designated Q3) = upper quartile = cuts off highest


25% of data, or lowest 75% = 75th percentile=x( 3(n+1) )
4

Oliver Mhlanga (HIT) July 25, 2022 15 / 29


Measures of Variability
Another feature of interest is the spread (or variability or dispersion or
scatter): how widely spread the data are about the mean (or other
measure of location).
Definition
The range is a very simple measure of spread defined, as its name
suggests, by the difference between the largest and smallest observations
in the data set.
|{z}(xi ) − |{z}
Range=max min (xi ).
i i

Definition
The interquartile range (IQR) is another measure of spread which is like
the range but which is not affected by the data extremes. First we must
define the quartiles of a set of data.
The inter-quartile range is defined as Q3 − Q1.
p
The p th percentile corresponds to the ( 100 × n + 12 ) data value.
Oliver Mhlanga (HIT) July 25, 2022 16 / 29
Measures of Variability

The variability or scatter in the data may be described by the sample


variance or the sample standard deviation.
Definition
Sample standard deviation: is the most commonly used measure of
spread. It is essentially a measure of how far on average the observations
are from the mean.
If x1 , x2 , ..., xn is a sample of n observations, the sample variance is
sP sP
n 2 n 2 2
(x
i=1 i − x̄) i=1 xi − nx̄
s= = (3)
n−1 n−1

The sample variance, s 2 , is the square of the sample standard


deviation. Analogous to the sample variance s 2 , the variability in the
population is defined by the population variance (σ 2 )

Oliver Mhlanga (HIT) July 25, 2022 17 / 29


Measures of Variability

Grouped data sample variance [H/W].

Example
An engineer is interested in testing the bias in a pH meter. Data are
collected on the meter by measuring the pH of a neutral substance (pH =
7.0). A sample of size 10 is taken, with results given by
7.07 7.00 7.10 6.97 7.00 7.03 7.01 7.01 6.98 7.08.
x̄ = 7.07+7.00+7.10+...+7.08
10 = 7.0250
s 2 = 19 [(7.07 − 7.025)2 + (7.00 − 7.025)2 + (7.10 − 7.025)2 + ... + (7.08 −
7.025)2 ] = 0.001939
As a√result, the sample standard deviation is given by
s = 0.001939 = 0.044.

Oliver Mhlanga (HIT) July 25, 2022 18 / 29


Frequencies tables and Graphical Descriptions

Stem-and-leaf diagram
A stem-and-leaf diagram is a good way to obtain an informative visual
display of a data set x1 , x2 , ..., xn where each number xi consists of at least
two digits. To construct a stem-and-leaf diagram, use the following steps:
1 Divide each number xi into two parts: a stem, consisting of one or
more of the leading digits and a leaf, consisting of the remaining digit.
2 List the stem values in a vertical column.
3 Record the leaf for each observation beside its stem.
4 Write the units for stems and leaves on the display
The ordered stem-and-leaf display makes it relatively easy to find data
features such as percentiles, quartiles, and the median.

Oliver Mhlanga (HIT) July 25, 2022 19 / 29


Frequencies tables and Graphical Descriptions

Frequency distribution
A frequency distribution is a more compact summary of data than a
stem-and-leaf diagram. To construct a frequency distribution, we must
divide the range of the data into intervals, which are usually called class
intervals, cells, or bins.
Choosing the number of bins approximately equal to the square root of the
number of observations often works well in practice.
The histogram is a visual display of the frequency distribution. The stages
for constructing a histogram:
1 Label the bin (class interval) boundaries on a horizontal scale.
2 Mark and label the vertical scale with the frequencies or the relative
frequencies.
3 Above each bin, draw a rectangle where height is equal to the
frequency (or relative frequency) corresponding to that bin.

Oliver Mhlanga (HIT) July 25, 2022 20 / 29


Frequencies tables and Graphical Descriptions

The histogram, like the stem-and-leaf diagram, provides a visual


impression of the shape of the distribution of the measurements and
information about the central tendency and scatter or dispersion in
the data. The display often gives insight about possible choices of
probability distribution to use as a model for the population.
When the bins are of unequal width, the rectangles area (not its
height) should be proportional to the bin frequency. This implies that
the rectangle height should be
bin frequency
Rectangle height(frequency density ) = .
bin width
In passing from either the original data or stem-and-leaf diagram to a
frequency distribution or histogram, information is lost because we no
longer have the individual observations.
Histograms are relatively sensitive to the number of bins and their
width.
Oliver Mhlanga (HIT) July 25, 2022 21 / 29
Frequencies tables and Graphical Descriptions
The box-plot is a graphical display that simultaneously describes several
important features of a data set, such as center, spread, departure from
symmetry, and identification of unusual observations or outliers.
A box plot displays the three quartiles, the minimum, and the maximum of
the data on a rectangular box, aligned either horizontally or vertically.

Oliver Mhlanga (HIT) July 25, 2022 22 / 29


Descriptive Statistics

Example
The female students in an undergraduate engineering core course at ASU
self-reported their heights to the nearest inch. The data are
62 64 66 67 65 68 61 65 67 65 64 63 67
68 64 66 68 69 65 67 62 66 68 67 66 65
69 65 70 65 67 68 65 63 64 67 67
(a) Calculate the sample mean and standard deviation of height.
(b) Construct a stem-and-leaf diagram for the height data and comment
on any important features that you notice.
(c) What is the median height of this group of female engineering
students?
(d) Construct a histogram for the female student height data.

Oliver Mhlanga (HIT) July 25, 2022 23 / 29


Descriptive Statistics
Using Minitab

Oliver Mhlanga (HIT) July 25, 2022 24 / 29


Descriptive Statistics

Oliver Mhlanga (HIT) July 25, 2022 25 / 29


Descriptive Statistics

Example
An article in the Transactions of the Institution of Chemical Engineers
(Vol. 34, 1956, pp. 280293) reported data from an experiment
investigating the effect of several process variables on the vapor phase
oxidation of naphthalene. A sample of the percentage mole conversion of
naphthalene to maleic anhydride follows: 4.2, 4.7, 4.7, 5.0, 3.8, 3.6, 3.0,
5.1, 3.1, 3.8, 4.8, 4.0, 5.2, 4.3, 2.8, 2.0, 2.8, 3.3, 4.8, 5.0.
(a) Calculate the sample mean.
(b) Calculate the sample variance and sample standard deviation.
(c) Construct a box plot of the data.

Oliver Mhlanga (HIT) July 25, 2022 26 / 29


Descriptive Statistics
Using Minitab

Oliver Mhlanga (HIT) July 25, 2022 27 / 29


Descriptive Statistics

Oliver Mhlanga (HIT) July 25, 2022 28 / 29


The End

Oliver Mhlanga (HIT) July 25, 2022 29 / 29

You might also like