Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 15

BASIC CONCEPTS

A. Importance of Statistics- Statistics is important in almost all fields. The importance of studying statistics is due to the increased
amount of data that is collected and disseminated to the public.

Statistics
- the collection, organization, presentation, analysis, or interpretation of numerical data, especially as a
branch of mathematics in which deductions are made on the assumption that the relationship between
a sufficient sample of numerical data are characteristic of those between all such data.
- it is a science which deals with the collection, presentation, analysis, and use of data to make decisions,
solve problems, and, design products and processes.

B. Categories of Statistics

Methods of statistical analysis can be categorized into two:


1. Descriptive Statistics
2. Inferential Statistics

Descriptive Statistics
- It comprises those methods concerned with collecting and describing a set of data so as to yield
meaningful information.
- It deals with the methods of organizing, summarizing and presenting a mass of data.

Inferential Statistics
- It is concerned with making generalizations about a population or other groups of data based on the study
of the sample.
- It comprises those methods concerned with the analysis of a subset of data leading to predictions or
inferences about the entire set of data.

C. Population and Sample

Population
- it consists of the totality of the observations with which we are concerned.
- it refers to a group of a total number of people, objects, or reactions that can be described as having a
unique or combination of qualities.
- It refers to an entire group that is being studied. Each member of the population is called a unit.

Sample
- It refers to a finite number of objects selected from a well-defined population.
- It is a collection of some elements in a population which is a representative of the entire population.

D. Enumerative vs Analytic Study


Enumerative study is such that a sample is used to make an inference to the population from which the sample
is selected while Analytic study is one where a sample is used to make inference to a conceptual (future) population.

D. Variable
A variable is any property, characteristic or attribute which is of interest about each individual unit of a population or of
a sample.
Types:
1. Qualitative – also called categorical variables. It describes data which fit into categories.
2. Quantitative – They represent a measurable quantity.

Two types of variables essential in an experiment:


1. Independent Variables
- Frequently referred to as the input variable or predictor variable because it is systematically manipulated by
the researcher and it is used to predict the outcome.
2. Dependent Variables
- It is the quantitative variable that the investigation or experiment measures to determine the effect of the
independent variable.

E. Data
- it is the raw material for the statistical investigation.

Types of Data (According to Nature of Variable)


1. Qualitative Data
– refers to characteristics of the sample. The degree could be described but not quantified.
2. Quantitative Data
– refers to numerical information gathered about a sample, this involves numbers and are the results of counting
or measuring.
a. discrete
– data obtained through counting. These are expressed as whole numbers and are always exact.
b. continuous
– data which are results of measurement and are not necessarily whole numbers.

Type of Data (According to Source of Data)


1. Primary Data
– refers to data obtained through first-hand account or observation.
2. Secondary Data
– refers to data obtained from a secondary source, published materials, books and the like.

F. Levels or Scales of Measurement of Data


- was first proposed by the American psychologist Stanley Smith Stevens in 1946.

1. Nominal Scale – It involves categorizing cases according to the presence or absence of some attribute. It is generally
used for the purpose of classification. Data gathered from variables measured at a nominal level can be categorized
but cannot be ranked, as there are no quantitative differences between and among them.

2. Ordinal Scale – The simplest scale which orders people, objects, or events along some continuum. The name of this
level is derived from the use of ordinal numbers for ranking. Numbers are used only to place objects in order and the
difference between consecutive values does not have a meaning.

3. Interval Scale – The scale on which zero is arbitrary. It does not reflect the absence of an attribute. Data gathered
from variables measured at an interval scale can be categorized, ranked, and can be added or subtracted.
Difference between two values has a meaning.

4. Ratio Scale – Possesses all of the characteristics of interval scales but have a true zero point. A variable measured in
this level does not only include the concept of order and interval but it also adds the idea of “nothingness”. Thus, a case
where 0 is on a scale indicates the total absence of the property being measured.
G. Parameter and Statistic

Parameter – it is a summary measure which is calculated from population data. It is represented by Greek letters

Statistic - it is a summary measure or value which is calculated from sample data. The letters of the English alphabet is
used to represent statistics.

SAMPLING AND DATA COLLECTION

A. SAMPLING

Sampling is the process of selecting units, like people, organizations, or objects from a population of interest in order to
study and fairly generalize the results back to the population from which the sample was taken.

Sample Size Criteria

There are usually three criteria that needs to be specified to determine the appropriate sample size: the level of
precision, the level of confidence or risk, and the degree of variability in the attributes that are being measured (Miaoulis and
Michener, 1976).

1. The level of precision


 Sometimes called sampling error.
 The range in which the true value of the population is estimated to be.
 Often expressed in percentage points (example, ±5%).

2. The confidence level


 Also referred to as risk level.
 Key idea is encompassed in the Central Limit Theorem (CLT).
 Central Limit Theorem: If X1, X2, . . ., Xn is a random sample of size n taken from a population (either finite
or infinite) with mean µ and variance σ2, and if X is the sample mean, the limiting form of
X 
Z , as n
 n
is the standard normal distribution (Montgomery, D.C. and Runger, G.C., 2003).

3. Degree of Variability
 It refers to the distribution of the attributes in the population.
 The more heterogeneous a population is, the larger the sample size required to obtain a iven level of
precision. The more homogeneous a population is, the smaller the sample size.
 The variance is usually an unknown quantity.

Strategies for Determining the Sample Size


1. Using a census for small populations
2. Using a sample size of a similar study
3. Using published tables (see Yamane, 1967)
4. Using formulas to calculate sample sizes.

Determining the sample size

Several formulas to calculate the sample size for a certain study has been developed and suggested such as Parten’s
formula (1950), Lehr’s rule (1992), the formula by Berlowitz, Watson’s formula (2001), and Cochran’s formula (1963) with most
of these formulas considering the case where the population distribution is approximately normal.

The following are some formulas that are used to calculate the size of a sample.

Calculating the size of a Sample for Proportions

For populations that are large, Cochran (1963:75) developed the following equation to yield a representative
sample for proportions.

Z 2 pq
n0 
e2
Where: n0 - sample size
Z 2 - is the abscissa of the normal curve that cuts off an area  at the tails (1 -  ) equals
the desired confidence level.
e - is the desired level of precision.
p - is the estimated proportion of an attribute that is present in the population
q - is 1 - p

* The value for Z is found in the statistical table of areas under the normal curve.

Finite Population Correction for Proportions

If the population is small, then the sample size can be reduced slightly.

n0
n
n  1
1 0
N
Where: n - is the sample size.
N - is the population size

Yamane’s Formula (Simplified Formula for Proportions)

If the behavior of the population is not certain or the researcher is not familiar with the population’s behavior, Yaro
Yamen’s formula (1980) or Taro Yamane’s formula (1967) may be used. The formula is:

N
n
1  Ne 2

Where: n - is the sample size


N - is the population size
e - is the level of precision.
Example. From a population of 10,000 individuals of a certain town, what sample size is needed in order to get
an accurate result for a certain study using a margin of error of a.) 1% ; b.) 2.5% ; c.) 5%

Computing the size of a Sample for the Mean

For polytomous or continuous variables, one way of determining the sample size is to combine responses into
two categories and then compute for the sample size based on proportion (Smith, 1983).
Another method to determine the sample size for the mean is to use the following formula:

Z2  2
n0 
e2

Where: n0 - is the sample size


Z - is the abscissa of the standard normal curve that cuts off an area  at the tails
e - is the desired level of precision (in the same unit of measure as the variance.
 2 - is the variance of an attribute in the population.
* Often, a “good” estimate of the variance is not available. Furthermore, the sample size can vary widely from one
attribute to another because each is likely to have a different variance. These are the problems which make it
disadvantageous to compute for the sample size based on the mean. Because of these problems, it is frequently
preferred to use the sample size for proportion.

Other Considerations with regard to the sample size:

 For the aforementioned methods/approaches of determining the sample size, the sampling design is
assumed to be simple random sampling. More complex designs must take into account the variances of
subpopulation strata, or clusters, before making an estimate of the variability in the population as a whole.
 The sample size should be appropriate for the analysis that is planned.
 An adjustment in the sample size may be needed to accommodate a comparative analysis of subgroups.
o Sudman (1976) suggests a minimum of 100 elements for each major group or subgroup in the
sample and a sample of 20 to 50 elements is necessary for each minor subgroup.
o According to Kish (1965), when the attribute of interest is present 20% to 80% of the time (the
distribution approaches normality), 30 to 200 elements are then sufficient. For skewed distributions, a
larger sample or a census is required. Skewed distributions can result in serious departures from
normality even for moderate size samples (Kish, 1965).
 Researchers commonly add 10% to the sample size to compensate for persons that the researcher is unable
to contact.
 The sample size is also often increased by 30% to compensate for nonresponse.

Probability Sampling
A probability sampling method is any method of sampling that utilizes some form of random selection. Random selection
is performed by selecting a group of subjects (a sample) for study from a larger group (a population). Each individual is chosen
entirely by chance.

1. Simple random sampling


- it is the simplest form of random sampling. Each individual is chosen entirely by chance and each member of
the population has an equal chance of being included in the sample.

2. Systematic random sampling (with a random start)


- it is a method of selecting a sample by taking every kth unit from an ordered population, the first unit being
selected at random. Here, k is called the sampling interval and the reciprocal 1/k is the sampling fraction.

Example. From a population size of 300 items, 30 are to be selected randomly using systematic random sampling. Which
elements or units in the population are to be taken for the sample?

3. Stratified random sampling


- this sampling method involves dividing the population into homogeneous subgroups and then taking a simple
random sample in each subgroup. The objective is to divide the population into non-overlapping groups.

Example. From the data below on the number of employees per department of a certain company, determine the
number of employees that are to be taken from each of the departments needed to represent the population using
equal allocation and proportional allocation of samples.
Department No. of Employees
Engineering 150
Production 500
Marketing 325
Management 100
Total 1,075

4. Cluster random sampling


- this sampling method involves dividing the whole population into clusters, usually along geographic
boundaries, then randomly taking samples of clusters, and measuring all units within sampled clusters.

5. Multi-Stage sampling
- Uses several stages or phases in the process of sampling from a population. Very useful in conducting
nationwide surveys or any survey that involves a very large population.

Non-Probability Sampling
This is a sampling method that does not involve random selection of samples. With non-probability samples, the
population may or may not be represented well, and it will often be difficult to know how well the population has been
represented. Some forms of non-probability sampling are:

1. Accidental or Haphazard or Convenience sampling


- one of the most common methods of sampling where methods done are normally biased since the
researcher considers his/her convenience in the collection of the data.

2. Purposive sampling
- sampling is based on certain criteria laid down by the researcher. People who satisfy the criteria are
interviewed.

Subcategories of Purposive sampling:

a. Modal instance sampling


- When we do modal instance sampling, we are sampling the most frequent case. The problem with modal
instance sampling is identifying the “modal” case. Modal instance sampling is only sensible for informal
sampling contexts.

b. Expert sampling
- Involves the assembling of a sample of persons with known or demonstrable experience and expertise in
some area.
Two reasons we might do expert sampling:
1. It would be the best way to elicit the views of persons who have specific expertise.
2. To provide evidence for the validity of another sampling approach you’ve chosen.

c. Quota sampling
- Select items nonrandomly according to some fixed quota.

d. Snowball sampling
- Begin by identifying someone who meets the criteria for inclusion in your study. You then ask them to
recommend others who they may know who also meet the criteria.

B. METHODS OF DATA COLLECTION


In order to have accurate data, the researcher must know the right sources and the right way of collecting them.

Methods of Data Collection


1. Interview or Direct Method
- uses at least two persons exchanging information. This method gives precise information because clarifications
can be made.

2. Questionnaire or Indirect Method


- a method where written answers are given to prepared questions. This method gives the respondent a sense
of freedom in honestly answering the questions because of anonymity.

3. Registration method
- a method enforced by certain laws.

4. Observation method
- it is a method which observes the behavior of individuals or organizations in the study. This is also used when the
respondents cannot read nor write.
5. Experiment Method
- used when the objective of the study is to determine the cause and effect of certain phenomena or event.

Characteristics of a Good Question


1. A good question is unbiased.
- questions must not be worded in a manner that influences the answer of a respondent in a certain way, that
is, to favor a certain response or be against it.
- An unbiased question is stated in neutral language and there is no element of pressure

2. A good question must be clear and simply stated.


- it is easier to understand and a question that is simple and clear and is more likely to be answered truthfully.

3. Questions must be precise


- Questions must not be vague. The question should indicate clearly the manner on how the answers must be
given.

4. Good questionnaires lend themselves to easy analyses.

Two Categories of Survey questions


1. Open question
- allows for a free response.
2. Closed question
- allows only a fixed response.

Planning the study


1. Make an estimate of the number of items in the population, if the exact number cannot be determined.
2. Assess resources such as time and money which are available to pursue the research.
3. Determine the sample size needed.

 DATA ORGANIZATION AND DATA PRESENTATION

Organization or Tabulation of Data in Microsoft Excel (or in SPSS)


*SPSS – Statistical Packages for the Social Sciences

*Preparing a codebook for the Data File

Preparing a codebook involves deciding (and documenting) how you will go about:
1. Defining and labeling each of the variables.
2. Assigning numbers to each of the possible categorical responses.

Table 1. Example of a Codebook:

Windows Excel
Variable Coding instructions
Variable Name
Number assigned to each
Identification number ID questionnaire

1 = Males
Sex ( or Gender) Sex
2 = Females

Age Age Age in Years

1 = Single
2 = Steady relationship
3 = Married for the first time
Marital Status Marital
4 = Remarried
5 = Divorced/Separated
6 = Widowed

Optimism scale item 1-4 Op 1 to Op 4 Enter the number circled


Enter the value provided or
Height Height
measured

*Variable Names
- Each question or item must have a unique variable name (these names shall clearly identify the information.)

Rules for naming of Variables:


Variable names:
- unique, that is, each variable in a data set must have a different name;

Note: The first variable in any data set should be the ID (respondent number)

IDs and demographic variables are usually placed at the beginning of the file. Then, entry of other variables should follow
some sort of logical order.

Each row in the data file represents one case. Each column represents a separate variable and the variable labels at
the top of the columns should make it clear what is being measured.

Methods of Data Presentation

A. Textual
- Collected data may be organized and presented in a narrative or textual form.

Example:
2000 Census of Population
The population of the Philippines as of May 1,2000 is 75.33 million. This figure is higher by 6.71 million
from the 1995 population.
The annual growth rate from 1995 to 2000 is 2.02 percent, which is lower by 0.30 percentage point
from the 1995 figure of 2.32 percent and by 0.33 percentage points from the 1990 figure of 2.35 percent

Source: NSO Monthly Bulletin of Statistics, August 2000


B. Graphical
- A pictorial representation of a set of data.

Data Patterns in Statistics


Graphic displays are useful for seeing patterns in data. The data patterns are commonly described in terms of
the center, spread, shape, and other unusual features.
 Center
o The point in a graphic display where about half of the observations are on either side.

 Spread
o This refers to the variability of the data. If the observations cover a wide range, then the spread
is larger. The spread is smaller, on the other hand, when the observations are clustered around a
single value.

 Shape
o It is described by the following characteristics:
 Symmetry. Graph can be divided at the center so that each half is a mirror image of
the other.

 Number of peaks. A distribution with one peak is referred to as unimodal while a


distribution with two peaks is bimodal.

 Skewness. Some distributions have many observations on one side of the graph than
the other. A distribution with fewer observations on the right (toward higher values) are
said to be skewed to the right. On the other hand, distributions with fewer observations
on the left (toward lower values) are said to be skewed to the left.

 Uniform. Data distribution is equally spread across the range of the distribution.

 Unusual features.
o Gaps. Areas of a distribution where there are no observations.
o Outliers. Distribution of data are sometimes characterized by extreme values that greatly differ
from the other observations.
1. Dotplot
- A graphic display that is used to compare frequency counts within a small number of categories or groups,
usually with small sets of data.
- The pattern of data in a dotplot can be described in terms of symmetry and skewness, only if the categories
are quantitative. If the categories are qualitative, the dotplot cannot be described in those terms.

Example:

2. Bar charts and Histograms


- These are graphs or charts that are used to compare the sizes of different groups.

Bar Chart. It represents the frequency or magnitudes of quantities of each of the categories as a bar rising vertically
from the horizontal axis with the height of each bar proportional to the frequency or magnitude of the
corresponding category.
It may be simple, compound and can be vertically or horizontally arranged. It is used for both qualitative and
quantitative data.

Example:

Figure 1. Monthly mean particulate matter (PM10) level in Baguio City for 2010.

Histogram. It is made up of columns plotted on a graph where there is no space between adjacent columns. The
columns are positioned over a label that represents a continuous, quantitative variable.
A histogram is distinct from a bar chart based on the type of variable that is being presented. With this distinction, it
can be appropriate to talk about skewness of a histogram.

Example:

3. Stem and leaf plot


- A chart that shows how individual values are distributed within a set of data. It is used to display quantitative
data from small data sets.

Example: Stem and leaf plot of IQ scores of 30 individuals.


4. Boxplot (box and whisker plot)
- A type of graph which is used to display patterns of quantitative data.
- It splits the data set into quartiles. The body of the boxplot consists of a “box” which goes from the first quartile
(Q1) to the third quartile (Q3).
- Within the box, a vertical line is drawn at the median (Q2) of the data set.
- The whiskers are two horizontal lines from the front and the back of the box. The front whisker goes from Q1 to
the smallest non-outlier in the data set while the back whisker goes from Q3 to the largest non-outlier.
- Outliers in the data set are plotted separately as points on the chart.

Example:

Interpreting a boxplot or box and whisker plot:


 Range. It is the horizontal distance between the smallest value and the largest value which includes any
outliers.

 Interquartile Range (IQR). The interquartile range is represented by the width of the box.

 Shape of the data set.

5. Scatterplot
- A graphic tool used to display the relationship between two quantitative variables.
- Each dot on the scatterplot represents an ordered pair of observation from a data set.
- Used to analyze patterns in bivariate data. The patterns are described in terms of linearity, slope, and strength.
Examples:
6. Line chart.
- Graphical presentation of data especially useful for showing trends over a period of time.

Example:

Figure 2. Age at First Marriage in the United States


Following a sharp decline during and after World War II, the age at
which men and women in the United States first marry has steadily
increased. In the mid-1990s, the age of first marriage for women was
higher and closer to the age at which men first marry than at any time in
the previous 100 years.

Four ways to describe Data Sets:


When comparing two or more data sets, we focus on four features:
 Center
 Spread
 Shape
 Unusual features

 MEASURES OF CENTRAL TENDENCY


Measures of Central Tendency or Central Location are numerical values that tend to locate in some sense the middle of a set of
data when arranged in increasing or decreasing order. The term average is often associated with these measures. The most
important measures of central tendency are (1) the mean, (2) the median, and (3) the mode.
A. MEAN, 𝜇 or 𝑥̅
1. Arithmetic Mean – it is obtained by adding all the observations and dividing the sum by the number of observations, thus
it is called a computational average.
Population mean: If a set of data 𝑥1 , 𝑥2 … 𝑥𝑁 represents a finite population of size 𝑁, then the population mean 𝜇 is
N

x
i 1
i


N
Sample Mean: If a set of data 𝑥1 , 𝑥2 … 𝑥𝑛 represents a finite sample of size 𝑛, then the sample mean 𝑥̅ is
n

x
i 1
1

x
n
Example 1
Suppose you are to choose ten people who enter the campus and whose ages are as follows:
15 25 18 20 25 18 18 20 25 15
What is the mean age of this sample?

2. Weighted Mean – if the data set 𝑥1 , 𝑥2 … 𝑥𝑘 have assigned weights 𝑤1 , 𝑤2 … 𝑤𝑘 , respectively, then the weighted mean is
computed as follows:
k

w x i i
x i 1
k

w i 1
i

Example 2
The table provides the grades obtained by a student in the different criteria for grading and the corresponding weight
for each criterion. Find his weighted average.
Criteria Grade Weight
Long Tests 80 0.30
Quizzes 85 0.20
Departmental Exam 82 0.25
Class Participation 88 0.10
Homework and Projects 85 0.15

Example 3
Mall goers were asked to rate the level of effectiveness of the inspection being done by security forces in preventing
crimes in malls.
Level of Effectiveness Very Effective (4) Moderately Effective (3) Least Effective (2) Not Effective (1)
Number of Mall goers 97 132 176 170

*Likert Scale: Interval Scale = (highest rate – lowest rate)/ no. of ratings = ( 4 - 1 )/ 4 = 0.75

Rating Range of Values Qualitative Description


4 3.25 – 3.99 Very Effective
3 2.50 – 3.24 Moderately Effective
2 1.75 – 2.49 Least Effective
1 1.00 – 1.74 Not Effective

B. MEDIAN, 𝜇̃ or 𝑥̃
- a value that divides the distribution into two equal parts (after arranging the values/scores in ascending or descending order).
As such, it is a positional average. The median is defined by
𝑥𝑛+1 𝑖𝑓 𝑛 𝑖𝑠 𝑜𝑑𝑑
2
𝜇̃ 𝑜𝑟 𝑥̃ = {𝑥𝑛 + 𝑥𝑛+1
2 2
𝑖𝑓 𝑛 𝑖𝑠 𝑒𝑣𝑒𝑛
2

Example 4
Find the median: (a) 12, 15, 18, 8, 9,10, 6; (b) 23, 18, 15, 12, 10, 9, 8, 6

C. MODE, 𝜇̂ or 𝑥̂
- the value in the distribution with the highest frequency. It locates the point where the observation values occur with the greatest
density. It can be used for quantitative as well as qualitative data.

Example 5
Find the mode of the following data: 15 12 4 9 6 10 5 15
12 4 12 6 12 5 15 12 4 15 4 6 5

Evidently, a distribution can have no mode, one mode, or more than one mode. Thus, the mode is not a very reliable measure of
central tendency. However, there are instances when no other measure can be used except the mode. In determining the
prevalent gender, civil status, or highest educational attainment, only the mode can be used because no numerical values can
be assigned to these variables.

D. MIDRANGE
- the mean of the largest and smallest values in the data set.

Remarks
Mean:
1. All the scores or measurements are considered in the computation of the mean.
2. Very high or very low scores or measurements affect the mean.
Median:
1. Only the middle scores or measurements are considered in the computation of the median.
2. Very high or very low scores do not affect the median.
Mode:
1. It is very easy to compute but is seldom used because it is very unstable.
2. It is most appropriate for nominal scale as a measure of popularity.

Exercises
1. Find the mean, median and mode of the following examination scores given in a stem-and-leaf plot.
Exam Scores
4 568
5 34569
6 2356699
7 01133455578
8 122369
2. The numbers of incorrect answers on a true or false competency test for a random sample of 15 students were recorded
as follows: 2, 1, 3, 0, 1, 3, 6, 0, 3, 3, 5, 2, 1, 4, and 2. Find a. mean, b. median, c. mode, d. midrange.
3. A student had accumulated 20 credits with the grade of A, 25 credits with B’s, 10 credits with C’s, and 2 credits with D’s.
The school uses the grading scale in which A = 4 grade points, B = 3, C = 2 and D = 1. Determine the grade point average
of the student.
4. A student was taking six subjects in college during the first semester. Find his average grade if his final grades were as
follows:
Subject Math Physics English Speech Statistics
Grade 1.75 2.50 2.25 1.50 3.0
Units 3 5 3 2 4
5. An economist studying trends in gasoline prices within a city takes sample of 30 of the city’s gas stations, determining
for each station the price per litre (in dollars) of unleaded regular gasoline. The results are given below. Find the mean
and median.
Price ($) 1.05 1.07 1.08 1.10 1.11 1.12
Frequency 1 3 1 12 8 5
 MEASURES OF VARIABILITY OR DISPERSION

The measures of central tendency do not by themselves give an adequate description of the data. It is also very important for us
to know how the observations spread out from the average. The measures of variation indicate the extent to which individual
items in a series are scattered about the average. It is used to determine the extent of the scatter so that steps may be taken to
control the existing variation.

Let us consider the following measurements for two samples of data:

Sample A P24,500 20,700 22,900 26,000 24,100 23,800 22,500


Sample B P24,900 17,500 21,600 29,700 25,300 23,800 21,700

Both samples have the same mean but, it is quite obvious that the measurements for sample A are more uniform or the values are
close to each other as compared to sample B.

General Classifications of Measures of Variation


 Measures of Absolute Dispersion
 Measures of Relative Dispersion

Measures of Absolute Dispersion


The measures of absolute dispersion are expressed in the units of the original observations. They cannot be used to
compare variations of two data sets when the averages of these data sets differ a lot in value or when the observations differ in
units of measurement. The most common statistics for measuring the variability of a set of data are the range, variance, and the
standard deviation.

RANGE
The range measures the distance between the largest and the smallest values and, as such, gives an idea of the spread of the
data set. However, the range does not use the concept of deviation. It is affected by outliers but does not consider all values in
the data set. Thus it is a not a very useful measure of variability.
𝑅𝑎𝑛𝑔𝑒 (𝑅) = 𝑚𝑎𝑥𝑖𝑚𝑢𝑚 𝑣𝑎𝑙𝑢𝑒 – 𝑚𝑖𝑛𝑖𝑚𝑢𝑚 𝑣𝑎𝑙𝑢𝑒

MEAN ABSOLUTE DEVIATION


The mean absolute deviation (MAD) utilizes deviations of the data values from the mean in its computation. The MAD is the
average of the absolute values of the deviations from the mean, computed using the formula
∑ |𝑥𝑖 −𝜇| ̅
∑ |𝑥𝑖 −𝑥|
population: 𝑀𝐴𝐷 = sample: 𝑀𝐴𝐷 =
𝑁 𝑛
If a data set A has a greater MAD than data set B, then it is reasonable to believe that the values in data set A are more spread
out (variable) than the values in set B.
VARIANCE AND STANDARD DEVIATION
The variance and the standard deviation are the most common and useful measures of variability. These two measures provide
information about how the data vary about the mean. The variance 𝜎 2 or 𝑠 2 is a measure of variation which considers the position
of each observation relative to the mean of the set. It is an approximate average of the squared deviations from the sample
mean. The standard deviation 𝜎 or 𝑠 is the square root of the variance.

Population Variance: Given the finite population 𝑥1 , 𝑥2 … 𝑥𝑁 , the population variance, which is exact, is
∑(𝑥𝑖 − 𝜇)2 𝑁 ∑ 𝑥𝑖 2 − (∑ 𝑥𝑖 )2
𝜎2 = 𝜎2 =
𝑁 𝑁2
Sample Variance: Given a random sample 𝑥1 , 𝑥2 … 𝑥𝑛 , the sample variance is
∑(𝑥𝑖 − 𝑥̅ )2
𝑠2 =
𝑛−1
𝑛 ∑ 𝑥𝑖 2 − (∑ 𝑥𝑖 )2
𝑠2 =
𝑛(𝑛 − 1)
where:  = population standard deviation 𝑥𝑖 = 𝑖th observation
𝑠 = sample standard deviation 𝜇 = population mean
𝑥̅ = sample mean 𝑁 = population size
𝑛 = sample size

If the data are clustered around the mean, then the variance and the standard deviation will be somewhat small. If, however,
the data are widely scattered about the mean, the variance and the standard deviation will be somewhat large.

Notes:
1. We divide by the quantity 𝑛 − 1 in order to make the sample variance an unbiased estimator of the population variance.
(An estimator is unbiased if its average value is equal to the parameter it is estimating.)
2. The unit of the standard deviation is the same as that of the raw data, so it is preferable to use the standard deviation
as a measure of variability instead of the variance.
3. The range is a quick but a rough measure of variation since considers only the highest value and the lowest value of the
observations.

In attempting to develop a sense for the standard deviation, we consider the following results from Chebyshev’s theorem:
At least 75% of all scores will fall within two standard deviations of the mean.
At least 89% of all scores will fall within three standard deviations of the mean.

Let us also consider the empirical rule, which applies to data that is approximately bell shaped. For these bell-shaped
distributions, the empirical rule states that:

About 68% of all scores fall within one standard deviation of the mean.
About 95% of all scores fall within two standard deviations of the mean.
About 99.7% of all scores fall within three standard deviations of the mean.

Measures of Relative Dispersion


The measures of relative dispersion are unit less and are used when one wishes to compare the dispersion of one
distribution with another distribution.

COEFFICIENT OF VARIATION (CV)


The coefficient of variation standardizes the variation by dividing it by the sample mean. Because of this property, it can be used
to compare variations for different variables with different units.
𝜎 𝑠
population: 𝐶𝑉 = ( ) 100% sample: 𝐶𝑉 = ( ) 100%
𝜇 𝑥̅
A larger coefficient of variation implies a more spread out or more dispersed data set.

Exercises
1. A sample of seven taxicabs from a large fleet of taxicabs used the following amounts of gasoline in one day: 10.9, 19.3,
14.7, 13.8, 15.3, 11.4, and 12.6 gallons. Compute for the range, mean absolute deviation, variance and standard
deviation of the sample data.
2. The manager of a small dry cleaner employs six people. As part of their personnel file, she asked each one to record to
the nearest one-tenth of a mile the distance they travel one way from home to work. The six distances are listed below:
17.6, 22.9, 29.8, 29.7, 12.2, and 15.8. Determine the range and the standard deviation.
3. The following are the gains and losses (in thousands of pesos) of two commodities for 10 business days.
Commodity 1: 6 4 2 -3 4 0 -2 5 4 5
Commodity 2: 3 2 0 -1 -4 3 5 6 5 5
a.) Calculate the mean and standard deviation of each of the samples.
b.) Which commodity shows the more consistent performance?

4. A written test administered to 2 sections of Math 5C gave the following mean and standard deviation.
Section A Section B
Mean 60 76
Standard Deviation 10 12
Determine which of the 2 sections has greater variability of scores.
5. The mean stature of college women is 5’2” with standard deviation of 2.5” while their mean weight is 105 lbs. with a
standard deviation of 8 lbs. Which is more variable, height or weight of college women?
The numbers of minutes spent in the computer lab by a sample of 20 students working on a project are given below.
Find the mean, range, variance, standard deviation, and coefficient of variation for this sample.
Numbers of Minutes
30 | 0 2 5 5 6 6 6 8
40 | 0 2 2 5 7 9
50 | 0 1 3 5
60 | 1 3
6. Find the mean, range, variance, standard deviation, and coefficient of variation for the following data set given in a
stem-and-leaf plot.
4 | 568
5 | 34569
6 | 2356699
7 | 01133455578
8 | 12369
9 | 3578
7. The following scores represent the final examination scores for a business statistics course:
23 60 79 32 57 74 52 70 82 36
80 77 81 95 41 65 92 85 55 76
52 10 64 75 78 25 80 98 81 67
41 71 83 54 64 72 88 62 74 43
60 78 89 76 84 48 84 90 15 79
34 67 17 82 69 74 63 80 85 61
Compute the mean, variance, standard deviation and coefficient of variation of the data.
8. A study of the effects of smoking on sleep patterns is conducted. The measure observed is the time, in minutes, that it
takes to fall asleep. These data are obtained:
Smokers: 69.3, 56.0, 22.1, 47.6, 53.2, 48.1, 52.7, 34.4, 60.2, 43.8, 23.2, 13.8
Nonsmokers: 28.6, 25.1, 26.4, 34.9, 29.8, 28.4, 38.5, 30.2, 30.6, 31.8, 41.6, 21.1, 36.0, 37.9, 13.9
a. Find the sample mean for each group.
b. Find the sample standard deviation for each group.
c. Find the coefficient of variation for each group.
d. Comment on what kind of impact smoking appears to have on the time required to fall asleep.
9. The weights of 10 boxes of a certain brand of cereal have a mean content of 278 grams with a standard deviation of
9.64 grams. If these boxes were purchased at 10 different stores and the average price per box is $1.29 with a standard
deviation of $0.09, can you conclude that the weights are relatively more homogeneous than the prices?
SLU-SAMCIS\sir h

You might also like