Download as pdf or txt
Download as pdf or txt
You are on page 1of 64

ENGINEERING

DATA
ANALYSIS
MODULE 1.1: Introduction to Statistics and
Data Analysis
DEFINITION
In the plural sense, STATISTICS means a set
of numerical facts and figures. We say for
example, “statistics on birth, statistics on
crime, statistics on unemployment or statistics
on marriage”

In the singular sense, STATISTICS is a body of


knowledge concerned with collection,
presentation, analysis and interpretation of
data.
CATEGORIES
Descriptive Statistics: deals with the
methods of organizing, summarizing and
presenting a mass data”

Inferential Statistics: concerned with


making generalizations about a body of data
where only a part of it is examined.
TERMS
1. Universe- the set of all the individuals or
entities under consideration
2. Measurement- the assignment of
numbers to objects or events, observations
according to logically accepted rules
3. Random Variable- a characteristic of
interest measurable on each and every
individual of the universe.
TERMS
4. Quantitative Variable- a variable which may take
on numerical values; observations vary in degree
5. Qualitative variable- a variable which takes on non-
numerical values; observations vary in kind
6. Discrete variable- observations have a finite
number of values
7. Continuous variable- observations have an infinite
number of values; a variable which can assume any
value in a given interval of values.
TERMS
8. Constant- observations do not vary
9. Population – the set of all possible values of
a variable
10. Sample- a subset of the population
11.Parameter- a numerical characteristic of a
population
12. Statistic- a quantity calculated from the
observation in a sample
SCALES OF
MEASUREMENT
1. Nominal
2. Ordinal
3. Interval
4. Ratio
Sampling methods often use
what are called random
numbers to select samples.

Sampling
When the research problem
and design were identified, the Procedures
process of collecting
information is necessary.
& Collection
of Data
Probability
There are two Sampling
Technique
types of sampling Non-Probability
techniques: Sampling
Technique
Simple random sampling
In simple random sampling technique,
every item in the population has an equal
and likely chance of being selected in the
sample. Since the item selection entirely
depends on the chance, this method is
known as “Method of Chance Selection”.
As the sample size is large, and the item is
chosen randomly, it is known as
“Representative Sampling”.
SIMPLE RANDOM SAMPLING
Example:
Suppose we want to select a simple random
sample of 200 students from a school. Here, we
can assign a number to every student in the
school database from 1 to 500 and use a
random number generator to select a sample of
200 numbers.
SYSTEMATIC
SAMPLING
In the systematic sampling method, the
items are selected from the target
population by selecting the random
selection point and selecting the other
methods after a fixed sample interval.
It is calculated by dividing the total
population size by the desired
population size.
SYSTEMATIC SAMPLING
Example:
Suppose the names of 300 students of a school
are sorted in the reverse alphabetical order. To
select a sample in a systematic sampling
method, we have to choose some 15 students by
randomly selecting a starting number, say
5. From number 5 onwards, will select every
15th person from the sorted list. Finally, we can
end up with a sample of some students.
STRATIFIED
SAMPLING
In a stratified sampling method,
the total population is divided
into smaller groups to complete
the sampling process. The small
group is formed based on a few
characteristics in the
population. After separating the
population into a smaller group,
the statisticians randomly select
the sample.
STRATIFIED SAMPLING

For example, there are three bags (A, B


and C), each with different balls. Bag A
has 50 balls, bag B has 100 balls, and
bag C has 200 balls. We have to choose a
sample of balls from each bag
proportionally. Suppose 5 balls from bag
A, 10 balls from bag B and 20 balls from
bag C.
CLUSTERED
SAMPLING
In the clustered sampling
method, the cluster or group of
people are formed from the
population set. The group has
similar significatory
characteristics. Also, they have
an equal chance of being a part
of the sample. This method uses
simple random sampling for the
cluster of population.
CLUSTERED SAMPLING
Example:
An educational institution has ten branches across the
country with almost the number of students. If we
want to collect some data regarding facilities and
other things, we can’t travel to every unit to collect
the required data. Hence, we can use random
sampling to select three or four branches as clusters.
All these four methods can be understood in a better
manner with the help of the figure given below. The
figure contains various examples of how samples will
be taken from the population using different
techniques.
NON-PROBABILITY
SAMPLING TECHNIQUE
In non-probability
sampling, not every member of the
population has the equal chance of being
selected. It can rely on the subjective
judgment of the researcher.
This method of sampling is
resorted to when it is difficult to estimate
the population of the study because they
are moving or transition in a given
location. This method is also useful in
exploratory or descriptive studies with a
qualitative implication. It is purposely to
characterize the direction of the study,
rather than to qualify it.
Thus, the respondents are chosen
on the basis of specific criteria
formulated by the researcher, rather
than randomly selected.
ACCIDENTAL OR
CONVENIENCE
SAMPLING
This method is implemented
by seeking at elements who are
readily available to respond to a
question. In other words, the first
person who comes along who
typifies the candidate serves as
the respondent of the study.
PURPOSIVE SAMPLING
Purposive sampling is a useful methodology in
qualitative or exploratory studies, since the quality of the
key person or informant are identified by the researcher.
The purpose is to get an information from a respondent
who are involved in the situation.
QUOTA SAMPLING
Quota sampling entails grouping elements
according to certain characteristics and
ensuring that each group is represented. This is
similar to stratified sampling minus
randomization.
SNOWBALL OR
REFERRAL SAMPLING
This involves having a respondents
refer other people who are in a position to
answer some of the questions of the
researcher.
This sampling technique is useful to
highly sensitive topics where the identity of
respondents is difficult to divulge. Hence, for
the topics that are highly confidential,
referral system is appropriate.
ENGINEERING
DATA ANALYSIS
MODULE 1.2: Measures of Central Tendency and Dispersion
MEASURES OF
CENTRAL
TENDENCY
UNGROUPED DATA
MEASURES OF CENTRAL TENDENCY

Measures of central Tendency or


Location- a single value about which the set
of observations tend to cluster.
MEASURES OF CENTRAL TENDENCY

Arithmetic Mean- the sum of all the observations divided by the


total number of observations; denoted as 𝜇 (Greek letter mu)
𝑁

𝜇 = ෍ 𝑋𝑖/𝑁
𝑖=1

Where 𝑋𝑖 is the value of the ith observation, i =1,…N


𝑁 is the total number of observations
MEASURES OF CENTRAL TENDENCY

Median- a single value which divides an array (arranged


data set in ascending or descending order) of
observations into two equal parts such that 50% of
the observations fall below it and 50% of the
observations fall above it; denoted as Md.

If N (no. of observations) is odd, the median is the


middle value of the array
If N is even, the median is the mean of the two middle
values of the array
MEASURES OF CENTRAL TENDENCY

MODE -the value which occurs most frequently in the


data set
-denoted as Mo

For ungrouped data set, the mode is the value which


occurs most frequently
MEASURES OF CENTRAL TENDENCY

Geometric mean
- the Nth root of the product of N positive number
-used mainly to average ratios, rates of change,
economic indices, etc.
-in Practice, geometric mean means are calculated by
making use of the fact that the logarithm of the
geometric mean of a set of positive numbers equals the
arithmetic means of their logarithms.
Comparison among measures of central tendency

MEAN MEDIAN MODE


• Reflects the magnitude of • It is positional value and • determined by the frequency
observation hence is not affected by and not by the values of the
• Easily affected by the the presence of extreme observations
presence of extreme values values(suggested mct • when a quick measure of
• Most commonly used when there are few location is needed
measure of central extreme values) • it cannot be manipulated
tendency (mct) because of • the median of grouped algebraically
its good statistical data can be calculated • can be defined with quantitative
properties even with open-ended as well as qualitative random
• Most meaningful mct when intervals provided the variables
there are no extreme values median class is not • very much affected by the
open-ended method of grouping data
• can be computed with open-
ended intervals provided the
modal class is not open-ended
MEASURES OF
CENTRAL
TENDENCY
GROUPED DATA
MEDIAN OF GROUPED DATA
In a grouped data, it is not possible to find the median
for the given observation by looking at the cumulative
frequencies. The middle value of the given data will be
in some class interval. So, it is necessary to find the
value inside the class interval that divides the whole
distribution into two halves. In this scenario, we must
find the median class.
To find the median class, we must find the cumulative
frequencies of all the classes and n/2. After that, locate
the class whose cumulative frequency is greater than
(nearest to) n/2. The class is called the median class.
MEDIAN OF GROUPED DATA

After finding the median class, use the below formula to find the median value.

Where

l is the lower limit of the median class

n is the number of observations

f is the frequency of median class

h is the class size

cf is the cumulative frequency of class preceding the median class.


MEDIAN OF GROUPED DATA
EXAMPLE
The following data represents the survey regarding the heights (in cm) of 51 girls of Class
x. Find the median height.

Answer: Median = 149.03


MODE OF GROUPED DATA
In the case of grouped data, it is not possible to identify the mode of the data,
by looking at the frequency of data. In this scenario, we can determine the
mode value by locating the class with the maximum frequency called modal
class. Inside a modal class, we can locate the mode value of the data by using
the formula,
MEDIAN OF GROUPED DATA

Where,

f1 is the frequency of the modal class

f0 is the frequency of the class preceding the modal class


f2 is the frequency of the class succeeding the modal class

h is the size of the class intervals

l is the lower limit of the modal class


MODE OF GROUPED DATA
EXAMPLE
A survey has been conducted by a group of students on 20 households in a locality as
shown in the following frequency distribution table. Find the mode for the given data.

Answer: Mode = 3.286.


MEASURES OF
VARIATION
MEASURES OF DISPERSION

Measures of Dispersion- a quantity that measures the spread


or variability of the observation in a given population

Illustration:
Data Set 1: 3,3,3,3,3
Data Set 2: 1,2,3,4,5
Data Set 3: 2,2,3,4,4

All three data sets have mean equal to 3 yet they are not
identical. There is a need for another quantity to measure the
spread of the values in a given population.
Some common measures of dispersion:

1. Range
2. Variance
3. Standard Deviation
4. Coefficient of Variation
Range- difference between the highest value and the lowest
value of the population

Example.
The range of actual body weight value is 46.8-8.00=38.8.

Properties:
1. It is quick but rough measure of dispersion
2. The larger the value of the range the more dispersed are
the observations.
3. It considers the highest and lowest observations I the
population. Hence, it may be reflective of the dispersion
characteristic of the majority.
Variance -mean of the squared deviations of the observations
from the mean, denoted by 𝜎 2

σ(𝑋𝑖−𝜇)² σ 𝑋𝑖²−(𝜇)²
𝜎 2 = 𝑁
= 𝑁
Properties:

1. The variance is always non-negative.


2. A large variance corresponds to a highly dispersed set of
values.
3. The variance is easy to manipulate for further
mathematical treatment.
4. The variance makes use of all observations.
5. The variance comes in a unit of measure that is the square
of the unit of measure of the given set of values
Standard deviation - the positive square root of variance. That is,
𝜎= 𝜎2

Properties: The standard deviation has the same set of properties as


the variance except that its unit of measurement is similar to the unit
of measurement of the observations

Example. In actual body weight of sheep, the standard deviation is

𝜎 = 66.3533917= 8.145759091

Remark: The standard deviation, coupled with arithmetic mean, gives a


lot of information about the distribution of a given population
Interquartile range- the difference between the third and the
first quartiles of a set of data. It is denoted by IR. It provides a
measure of the range of the middle 50% of the observations.

Quartiles are values from a given array of data which divide the
array into four equal parts.

The First Quartile, denoted by Q1, is the value for which 25% of
the observations are less than Q1 and 75% are greater than it.

The Third Quartile, denoted by Q3, is the value for which 75% of
the observations are less than Q3 and 25% are greater than it.
The Empirical Rule states that if the
distribution of our data values appears to be
mound-shaped or bell shaped with mean 𝜇 and
standard deviation 𝜎 , then approximately

a) 68% of the population values lies between 𝜇- 𝜎


and 𝜇 + 𝜎
b) 95% of the population lie between 𝜇 − 2𝜎 and 𝜇 + 2𝜎
c) 99.7% of the population values lie between 𝜇 −
3𝜎 𝑎𝑛𝑑 𝜇 + 3𝜎
A Russian mathematician named Chebychev
has shown that:
• At least 75% of the observation fall within 2
standard deviations from that mean
• At least 89% of the observations fall within 3
standard deviations from the mean
• At least 94% of the observations fall within 4
standard deviations from the mean
Coefficient of Variation – ratio of the standard deviation and the
mean
- denoted as CV

CV= 𝜎/𝜇 , provided 𝜇 is not equal to zero

Properties:

1. CV could be expressed in decimal or percentage.


2. CV is an absolute measure of dispersion.
3. The CV, being unit less, can be used to compare the
dispersion of two or more populations measured in
different units.
4. CV can be expressed in percentage.
GRAPHICAL
SUMMARY
Stem-and-Leaf Plots
A stem-and-leaf plot is a simple way to summarize a data set.
Stem-and-Leaf Plots
Figure 1.5 presents a stem-and-leaf plot of the geyser data. Each item in the sample is
divided into two parts: a stem, consisting of the leftmost one or two digits, and the leaf,
which consists of the next digit
Dotplots
A dotplot is a graph that can be used to give a rough impression of the shape of a sample. It
is useful when the sample size is not too large and when the sample contains some
repeated values. Figure 1.7 presents a dotplot for the geyser data in Table 1.3.
Histograms
A histogram is a graphic that gives an idea of the “shape” of a sample, indicating regions
where sample points are concentrated and regions where they are sparse.

We will construct a histogram for the PM emissions of 62 vehicles driven at high altitude,
as presented in Table 1.2. The sample values range from a low of 1.11 to a high of 23.38, in
units of grams of emissions per gallon of fuel.
The first step is to construct a frequency table, shown in Table
1.4.
Histogram for the data in Table 1.4. In this histogram the heights
of the rectangles are the relative frequencies. Since the class widths
are all the same, the frequencies, relative frequencies, and densities
are proportional to one another, so it would have been equally
appropriate to set the heights equal to the frequencies or to the
densities.

You might also like