Professional Documents
Culture Documents
ENS 185 Module 1
ENS 185 Module 1
DATA
ANALYSIS
MODULE 1.1: Introduction to Statistics and
Data Analysis
DEFINITION
In the plural sense, STATISTICS means a set
of numerical facts and figures. We say for
example, “statistics on birth, statistics on
crime, statistics on unemployment or statistics
on marriage”
Sampling
When the research problem
and design were identified, the Procedures
process of collecting
information is necessary.
& Collection
of Data
Probability
There are two Sampling
Technique
types of sampling Non-Probability
techniques: Sampling
Technique
Simple random sampling
In simple random sampling technique,
every item in the population has an equal
and likely chance of being selected in the
sample. Since the item selection entirely
depends on the chance, this method is
known as “Method of Chance Selection”.
As the sample size is large, and the item is
chosen randomly, it is known as
“Representative Sampling”.
SIMPLE RANDOM SAMPLING
Example:
Suppose we want to select a simple random
sample of 200 students from a school. Here, we
can assign a number to every student in the
school database from 1 to 500 and use a
random number generator to select a sample of
200 numbers.
SYSTEMATIC
SAMPLING
In the systematic sampling method, the
items are selected from the target
population by selecting the random
selection point and selecting the other
methods after a fixed sample interval.
It is calculated by dividing the total
population size by the desired
population size.
SYSTEMATIC SAMPLING
Example:
Suppose the names of 300 students of a school
are sorted in the reverse alphabetical order. To
select a sample in a systematic sampling
method, we have to choose some 15 students by
randomly selecting a starting number, say
5. From number 5 onwards, will select every
15th person from the sorted list. Finally, we can
end up with a sample of some students.
STRATIFIED
SAMPLING
In a stratified sampling method,
the total population is divided
into smaller groups to complete
the sampling process. The small
group is formed based on a few
characteristics in the
population. After separating the
population into a smaller group,
the statisticians randomly select
the sample.
STRATIFIED SAMPLING
𝜇 = 𝑋𝑖/𝑁
𝑖=1
Geometric mean
- the Nth root of the product of N positive number
-used mainly to average ratios, rates of change,
economic indices, etc.
-in Practice, geometric mean means are calculated by
making use of the fact that the logarithm of the
geometric mean of a set of positive numbers equals the
arithmetic means of their logarithms.
Comparison among measures of central tendency
After finding the median class, use the below formula to find the median value.
Where
Where,
Illustration:
Data Set 1: 3,3,3,3,3
Data Set 2: 1,2,3,4,5
Data Set 3: 2,2,3,4,4
All three data sets have mean equal to 3 yet they are not
identical. There is a need for another quantity to measure the
spread of the values in a given population.
Some common measures of dispersion:
1. Range
2. Variance
3. Standard Deviation
4. Coefficient of Variation
Range- difference between the highest value and the lowest
value of the population
Example.
The range of actual body weight value is 46.8-8.00=38.8.
Properties:
1. It is quick but rough measure of dispersion
2. The larger the value of the range the more dispersed are
the observations.
3. It considers the highest and lowest observations I the
population. Hence, it may be reflective of the dispersion
characteristic of the majority.
Variance -mean of the squared deviations of the observations
from the mean, denoted by 𝜎 2
σ(𝑋𝑖−𝜇)² σ 𝑋𝑖²−(𝜇)²
𝜎 2 = 𝑁
= 𝑁
Properties:
𝜎 = 66.3533917= 8.145759091
Quartiles are values from a given array of data which divide the
array into four equal parts.
The First Quartile, denoted by Q1, is the value for which 25% of
the observations are less than Q1 and 75% are greater than it.
The Third Quartile, denoted by Q3, is the value for which 75% of
the observations are less than Q3 and 25% are greater than it.
The Empirical Rule states that if the
distribution of our data values appears to be
mound-shaped or bell shaped with mean 𝜇 and
standard deviation 𝜎 , then approximately
Properties:
We will construct a histogram for the PM emissions of 62 vehicles driven at high altitude,
as presented in Table 1.2. The sample values range from a low of 1.11 to a high of 23.38, in
units of grams of emissions per gallon of fuel.
The first step is to construct a frequency table, shown in Table
1.4.
Histogram for the data in Table 1.4. In this histogram the heights
of the rectangles are the relative frequencies. Since the class widths
are all the same, the frequencies, relative frequencies, and densities
are proportional to one another, so it would have been equally
appropriate to set the heights equal to the frequencies or to the
densities.