Download as pdf or txt
Download as pdf or txt
You are on page 1of 31

MODULE 7: DATA MANAGEMENT

Our target learning outcomes for this module are a) use a variety of statistical tools
to process and manage numerical data; b) use the methods of linear regression and
correlation to predict the value of a variable given certain conditions; c) advocate the use
of statistical data in making important decisions

Introduction to Statistics and Definitions

What is Statistics?
Statistics is the science of collecting, organizing, presenting, analyzing, and
interpreting data to assist in making more effective decisions.

Why study statistics?


Data is everywhere. Statistical techniques are used to make many decisions that
affect our lives. No matter what your career, you will make professional decisions that
involve data. An understanding of statistical methods will help you make these decisions
effectively.

A. Divisions of Statistics

1. Descriptive Statistics. It deals with the methods of organizing, summarizing, and


presenting a mass of data to yield meaningful information. It includes anything done
to the data designed to summarize, or describe without any attempt to make
inference or conclusion about the gathered data.
Activities:
 Collect data
e.g. Survey
 Present data
e.g. Tables and graphs
 Summarize data
e.g. Sample mean

2. Inferential Statistics. It is concerned with generalizing about a population or other


groups of data based on the study of the sample. It comprises those methods
concerned with the analysis of a subset of data leading to predictions or inferences
about the entire set of data.
Activities:
 Estimation
e.g. Estimate the population mean weight using the sample
mean weight
 Hypothesis testing
e.g. Test the claim that the population mean weight is 70 kg

Property of and for the exclusive use of SLU. Reproduction, storing in a retrieval system, distributing, uploading or posting online, or transmitting in any form or by any
means, electronic, mechanical, photocopying, recording, or otherwise of any part of this document, without the prior written permission of SLU, is strictly prohibited. 112
B. Population and Sample

1. Population. It consists of the totality of the observations with which we are


concerned. It refers to a group of a total number of people, objects, or reactions
that can be described as having a unique or combination of qualities. Populations
can be either finite or infinite.
 A parameter is any numerical value describing a characteristic of a
population usually represented by Greek letters.
Examples:
 If we consider all math classes to be the population, then the
average number of points earned per student over all the math
classes is an example of a parameter.
 There are 35,000 students enrolled in a university and 15 % of
them are enrolled in math. The figure of 15% is a parameter
because it is based on the entire population of all enrolled
students.

2. Sample. It refers to a finite number of objects selected from the population. It is a


collection of some elements in a population or is a representative of the entire
population.
 A statistic is any numerical value describing a characteristic of a sample and
usually represented by the letters of the English alphabet.
Examples:
 If we consider one math class to be a sample of the population
of all math classes, then the average number of points earned
by students in that one math class at the end of the term is an
example of a statistic. The statistic is an estimate of a population
parameter, in this case the mean.
 An institution polled 2.3 million adults in the Philippines and 80%
said that they would vote for the presidency. That figure of 80%
is a statistic because it is based on a sample (of 2.3 million), not
the entire population of all adults in the Philippines.

An illustration below is given to differentiate population and sample.

Property of and for the exclusive use of SLU. Reproduction, storing in a retrieval system, distributing, uploading or posting online, or transmitting in any form or by any
means, electronic, mechanical, photocopying, recording, or otherwise of any part of this document, without the prior written permission of SLU, is strictly prohibited. 113
C. Sample Size Determination

The number of respondents or subjects to form a sample is termed as the sample


size. There are many ways to determine the sample size, depending on the conditions:

1. Cochran (1977) presented a set of formulas that can be used to determine the
sample size.

For finite and known For an infinite or unknown


population size, N population size, N:

Estimating a Population
Mean

Estimating a Population
Proportion

where n is the sample size, 𝒁𝜶 is the two-tailed 𝑧-score corresponding to the level of
𝟐
significance 𝛼, s is the known standard deviation, e is the margin of error, p is the
past estimate of the population proportion, and q=1-p

NOTE
a. The level of significance 𝛼 can take any of the standard values namely: 0.01,
0.05, and 0.10. Theoretically, the level of significance is the probability of the
type 1 error in hypothesis testing.
b. The following table presents the values of 𝒁𝜶 corresponding to the standard
𝟐
values of 𝛼
𝛼 𝒁𝜶
𝟐
0.01 2.575
0.05 1.96
0.10 1.645

c. The standard deviation, s, can be estimated from a pilot data set or the value
can be adopted from a previous study that considered the same or similar
population.
d. In the same manner as s, p can be the past estimate of the population
proportion or can be computed from a pilot data set.

Property of and for the exclusive use of SLU. Reproduction, storing in a retrieval system, distributing, uploading or posting online, or transmitting in any form or by any
means, electronic, mechanical, photocopying, recording, or otherwise of any part of this document, without the prior written permission of SLU, is strictly prohibited. 114
2. Yamane’s Formula (Simplified Formula for Proportions)
If the behavior of the population is not certain or the researcher is not familiar
with the population‟s behavior, Yaro Yamen’s formula (1980) or Taro Yamane’s
formula (1967) may be used. The formula is:

𝑁
𝑛=
1 + 𝑁𝑒 2

where 𝑵 is the population size and 𝒆 is the margin of error.

Example 1.41% of Jacksonville residents said that they had been in a hurricane. How many
adults should be surveyed to estimate the true proportion of adults who have been in a
hurricane, with a 95% confidence interval and 3% margin of error?
Solution
41% is a past estimate of population proportion; Unknown population size. Hence,
we use the following formula.
𝛼 =0.05
p=0.41
q=1-0.41=0.59
𝑍𝛼 = 1.96
2
2
𝒁𝜶 𝑝𝑞
𝟐
𝑛≥
𝑒2
𝟏. 𝟗𝟔 2 (0.41)(0.59)
𝑛≥
(0.03)2
𝑛 ≥ 1,032.54 ≈ 1.033

Example 2. From a population of 10,000 individuals of a certain town, what sample size is
needed in order to get an accurate result for a certain study using a margin of error of 3%.
Solution
𝑁
𝑛=
1 + 𝑁𝑒 2
10,000
𝑛=
1 + (10,000)(0.03)2
𝑛 = 1,000
Hence, the sample size needed in order to get an accurate result for a certain study
using a margin of error of 3% is 1000 individuals.

D. Sampling Techniques

Sampling is the process of selecting units, like people, organizations, or objects from
a population of interest in order to study and fairly generalize the results back to the
population from which the sample was taken. The two types of sampling are

Property of and for the exclusive use of SLU. Reproduction, storing in a retrieval system, distributing, uploading or posting online, or transmitting in any form or by any
means, electronic, mechanical, photocopying, recording, or otherwise of any part of this document, without the prior written permission of SLU, is strictly prohibited. 115
1. Random Sampling Techniques
Members from the population are selected in such a way that each individual
member in the population has a chance of being selected.

a. Simple Random Sampling. Every case in the population being sampled


has an equal chance of being chosen. It is an equal probability sampling
Method (EPSEM).
Basic Steps:
1. Make a list of the population units and number them from a 1 to
N, where N is the population size. This is called the sampling
frame.
2. Select n random numbers from 1 to N using some random
process.
3. Employ any of the following selection procedure:
 Draw lots
 Lottery
 Usage of gadgets like the calculator or computer to
generate Random Numbers
 Table of Random Numbers

b. Systematic Random Sampling. We select some starting point randomly


and then select every 𝑘th (such as every 50th) element in the population
until the desired sample size is achieved.
Basic Steps:
1. Construct the sampling frame
2. Determine the sample size
𝑁
3. Determine the sample interval, 𝑘: 𝑘 =
𝑛
4. Identify the random start using SRS, 𝑟: 1 ≤ 𝑟 ≤ 𝑘
5. Commencing on the random start, select every 𝑘th item until
the desired sample size is reached.

c. Stratified Random Sampling. We subdivide the population into at least two


different subgroups (or strata) so that subjects within the same subgroup
share the same characteristics (such as gender or age bracket), then we
draw a sample from each subgroup (or stratum).

Property of and for the exclusive use of SLU. Reproduction, storing in a retrieval system, distributing, uploading or posting online, or transmitting in any form or by any
means, electronic, mechanical, photocopying, recording, or otherwise of any part of this document, without the prior written permission of SLU, is strictly prohibited. 116
We use Proportional Allocation to draw a sample from each stratum to
reach the desired sample size:

where: 𝑛𝑕 = sample size for each stratum


𝑁𝑕 = stratum size
𝑁 = population size
𝑛= sample size

Example 3. Suppose a school has five departments composed of the


following number of students. Determine the number of students to be part of
the sample when the researcher needs 363 respondents.

Department 𝑁𝑕 𝑛𝑕
Business Administration (BA) 1,500 140
Management(M) 1,200 112
Finance(F) 850 80
Entrepreneurship(E) 200 19
Culinary Arts(CA) 150 14
Total 3,900
Solution:

1,500
𝑛𝐴𝐵 = 363 = 139.62 ≈ 140
3,900
1,200
𝑛𝑀 = 363 = 111.69 ≈ 112
3,900
850
𝑛𝐹 = 363 = 79.12 ≈ 80
3,900
200
𝑛𝐸 = 363 = 18.62 ≈ 19
3,900
150
𝑛𝐶𝐴 = 363 = 13.96 ≈ 14
3,900

Hence, 140 students from Business Administration, 112 students


from Management, 80 students from Finance, 19 students from
Entrepreneurship, and 14 students from Culinary Arts are part of the
sample.

Property of and for the exclusive use of SLU. Reproduction, storing in a retrieval system, distributing, uploading or posting online, or transmitting in any form or by any
means, electronic, mechanical, photocopying, recording, or otherwise of any part of this document, without the prior written permission of SLU, is strictly prohibited. 117
Sample Size Round-off
Rule: When the calculated sample size is not a whole number, it should
be rounded up to the next higher whole number. Rounding up a
sample size calculation for conservativeness ensures that your sample
size will always be representative of the population.

Example 4. A sample size calculation determined that 2006.083 data


points were necessary to represent the population. In this case, 2007
data points samples should be taken.

d. Cluster Random Sampling. Divide the population into sections (or clusters),
then randomly select some of those clusters, and then choose all
members from those selected clusters.

e. Multi-Stage Sampling. This method uses several stages or phases in getting


random samples from the general population.
Commonly used if research is of National Scope.
 We divide the country to Regions
 Regions to Municipalities and Cities
 Municipalities and Cities to barangays
 Barangays to Sitios or sections

2. Non Random Sampling Techniques

a. Accidental or Haphazard or Convenience sampling. It is one of the most


common methods of sampling where methods done are normally biased
since the researcher considers his/her convenience in the collection of
the data.

b. Purposive sampling. It is based on certain criteria laid down by the


researcher. People who satisfy the criteria are interviewed. The sub-
categories of purposive sampling are:

Property of and for the exclusive use of SLU. Reproduction, storing in a retrieval system, distributing, uploading or posting online, or transmitting in any form or by any
means, electronic, mechanical, photocopying, recording, or otherwise of any part of this document, without the prior written permission of SLU, is strictly prohibited. 118
1. Modal Instance Sampling. When we do modal instance sampling, we
are sampling most frequent cases. The problem with modal instance
sampling is identifying the “modal” case.
2. Expert Sampling. It involves the assembling of a sample of persons with
known or demonstrable experience and expertise in some area.
3. Quota Sampling. Selecting items non randomly according to some
fixed quota.
4. Snowball Sampling. Begin by identifying someone who meets the
criteria for inclusion in your study. You ask them to recommend others
who they may know who also meet the criteria.

E. Statistical Data

Statistical data are the raw materials of research or any statistical investigations
usually obtained by counting or measuring items. Data are categorized:

A. according to description:

a. Qualitative (Categorical) Data generally described by words or letters. They are


not as widely used as quantitative data because many numerical techniques
do not apply to the qualitative data. For example, it does not make sense to
find an average hair color and other attributes of the population.
Example 5.
 The gender (male, female) of survey respondents
 The numbers 24, 28, 17, 54, and 31 sewn on the shirts of the
basketball team are categorical data. These numbers are
substitutes for names. They do not count or measure anything.

Qualitative data can be separated into two subgroups:


a. Dichotomic takes the form of a word with two options, such as
gender - male or female.
b. polynomic takes the form of a word with more than two options,
such as education - primary school, secondary school and
university.
b. Quantitative (Numerical) Data are always numbers and are the result of
counting or measuring attributes of a population.
Example 6.
 The ages (in years) of survey respondents
 distance traveled
 number of children in a family,

Quantitative data can be separated into two subgroups:

1. discrete is the result of counting. It is expressed as whole numbers and


is always exact.

Property of and for the exclusive use of SLU. Reproduction, storing in a retrieval system, distributing, uploading or posting online, or transmitting in any form or by any
means, electronic, mechanical, photocopying, recording, or otherwise of any part of this document, without the prior written permission of SLU, is strictly prohibited. 119
Example 7.
 The numbers of eggs that hens lay are discrete data
because they represent counts.
 The number of students of a given ethnic group in a
class.
 The number of books on a shelf.

2. continuous is the result of measuring. It is not necessarily whole


numbers.
Example 8.
 The amount of milk from cows are continuous data
because they are measurements that can assume any
value over a continuous span. During a year, a
a cow might yield an amount of milk that can be any
value between 0 and 7000 liters. It would be possible to
get 5678.1234 liters because the cow is not restricted to
the discrete amounts of 0, 1, 2, . . . , 7000 liters.
 distance traveled
 weight of luggage

B. according to source:

1. Primary data refers to the information which is gathered directly from an


original source or which are based on direct or first- hand experience using
methods like surveys, interviews, or experiments.
2. Secondary data refers to the information taken from published / unpublished
materials that have been previously gathered by other individuals,
researcher‟s or agencies.

C. according to level of measurement

1. Nominal Scale. It involves categorizing cases according to the presence or


absence of some attribute. It is generally used for the purpose of
classification. Data gathered from variables measured at a nominal level can
be categorized but cannot be ranked, as there are no quantitative
differences between and among them.
Example 9.
 gender
 religious affiliation
 eye color

2. Ordinal Scale. It is the simplest scale which orders people, objects, or events
along some continuum. Values of variables measured at the ordinal level
offer at least a rough indication of quantitative differences; they can also be
categorized and ranked, numbers are used only to place objects in order.

Property of and for the exclusive use of SLU. Reproduction, storing in a retrieval system, distributing, uploading or posting online, or transmitting in any form or by any
means, electronic, mechanical, photocopying, recording, or otherwise of any part of this document, without the prior written permission of SLU, is strictly prohibited. 120
Example 10.
 year level
 job position

3. Interval Scale. It is the scale on which zero is arbitrary. It does not reflect the
absence of an attribute. Data gathered from variables measured at an
interval scale can be categorized, ranked, and can be added or
subtracted.
Example 11
 IQ Scores
 temperature

4. Ratio Scale. It possesses all of the characteristics of interval scales but has a
true zero point. Thus, a case where 0 is on a scale indicates the total absence
of the property being measured. For values at this level, differences and ratios
are both meaningful.
Example 12.
 Distances (in km) traveled by cars (0 km represents no distance
traveled, and 400 km is twice as far as 200 km.)
 Prices of books(P0.00 does represent no cost, and a P300.00
book does cost twice as much as a P150.00 book.)
 height
 weight

METHODS OF DATA COLLECTION

A. Interview (Direct) Method– a method of person-to-person exchange between the


interviewer and the interviewee.
Positive:
1. It provides consistent and more precise information since clarification
may be given by the interviewee.
2. Questions may be repeated or maybe modified to suit the
interviewee‟s level of understanding.
Negative:
1. Time-consuming
2. Expensive
3. Limited field coverage
B. Questionnaire (Indirect) Method – in this method written responses are given to
prepared questions. A questionnaire is used to elicit answers to the problems of the
study. Questionnaires may be mailed or hand-carried.
Positive:
1. Inexpensive
2. Can cover a wide area in a shorter span of time.
3. Respondents may feel a greater sense of freedom to express views
and opinions because their anonymity is maintained.

Property of and for the exclusive use of SLU. Reproduction, storing in a retrieval system, distributing, uploading or posting online, or transmitting in any form or by any
means, electronic, mechanical, photocopying, recording, or otherwise of any part of this document, without the prior written permission of SLU, is strictly prohibited. 121
Negative:
1. There‟s a strong possibility of non-response, especially when
questionnaires are mailed.
2. Questions not easily understood may not be answered.

C. Observation Method – the investigator observes the behavior of the subject


/respondent. It is used when the subjects cannot talk or write.
Positive:
 The recording of behavior at the appropriate time and situation is
made possible.

D. Experiment Method - this method is used when the objective is to determine the
cause-and-effect relationship of certain phenomena under controlled conditions. It
is usually used by scientific researchers.

E. Registration Method – this method of gathering information is enforced by law.


Example 13.
 registration of births
 deaths
 vehicles
 licenses
Positive:
1. Information is kept systematized.
2. Information is always made available to the public.

Characteristics of a Good Question


1. A good question is unbiased.
2. Questions must not be worded in a manner that influences the answer of a
respondent in a certain way, that is, to favor a certain response or be against it.
3. An unbiased question is stated in neutral language and there is no element of
pressure
4. A good question must be clear and simply stated.
5. It is easier to understand and a question that is simple and clear and is more likely to
be answered truthfully.
6. Questions must be precise
7. Questions must not be vague. The question should indicate clearly the manner on
how the answers must be given.
8. Good questionnaires lend themselves to easy analyses.

Property of and for the exclusive use of SLU. Reproduction, storing in a retrieval system, distributing, uploading or posting online, or transmitting in any form or by any
means, electronic, mechanical, photocopying, recording, or otherwise of any part of this document, without the prior written permission of SLU, is strictly prohibited. 122
METHODS OF DATA PRESENTATION

1. Textual Presentation – This type of presentation incorporates data in a set of


narrative sentences or paragraphs. It emphasizes and compares important figures.
However, it can be tedious to read especially if it consists of lengthy paragraphs and
some figures or words are repeated many times.

2. Tabular Presentation – This is a systematic way of categorizing related data in rows


and columns. This methodical arrangement called statistical table presents data in
a more concise and greater detail than in textual or graphical form.

3. Graphical Method – This is a method of presenting quantitative data in pictorial form


produces a device which is often referred to as graph or chart. They have visual
appeal that can attract better and hold further, the reader‟s interests.

Kinds of Graphs and Diagrams


1. Bar Graph. A bar graph uses bars of equal width to show frequencies of
categories of qualitative data. The vertical scale represents frequencies or
relative frequencies. The horizontal scale identifies the different categories of
qualitative data. It is best used for large changes over time/category.

Property of and for the exclusive use of SLU. Reproduction, storing in a retrieval system, distributing, uploading or posting online, or transmitting in any form or by any
means, electronic, mechanical, photocopying, recording, or otherwise of any part of this document, without the prior written permission of SLU, is strictly prohibited. 123
2. Frequency Polygon. One type of statistical graph involves the class midpoints.
A frequency polygon uses line segments connected to points located directly
above class midpoint values. A variation of the basic frequency polygon is
the relative frequency polygon, which uses relative frequencies (proportions
or percentages) for the vertical scale. When trying to compare two data sets,
it is often very helpful to graph two relative frequency polygons on the same
axes.

3. Ogive. Another type of statistical graph called an ogive (pronounced “oh-


jive”) involves cumulative frequencies. Ogives are useful for determining the
number of values below some particular value, as illustrated in Example 3. An
ogive is a line graph that depicts cumulative frequencies. An ogive uses class
boundaries along the horizontal scale, and cumulative frequencies along the
vertical scale.

Property of and for the exclusive use of SLU. Reproduction, storing in a retrieval system, distributing, uploading or posting online, or transmitting in any form or by any
means, electronic, mechanical, photocopying, recording, or otherwise of any part of this document, without the prior written permission of SLU, is strictly prohibited. 124
4. Pie chart is a graph that depicts qualitative data as slices of a circle, in which
the size of each slice is proportional to the frequency count for the category.

5. A stemplot (or stem-and-leaf plot) represents quantitative data by separating


each value into two parts: the stem (such as the leftmost digit) and the leaf
(such as the rightmost digit).

Property of and for the exclusive use of SLU. Reproduction, storing in a retrieval system, distributing, uploading or posting online, or transmitting in any form or by any
means, electronic, mechanical, photocopying, recording, or otherwise of any part of this document, without the prior written permission of SLU, is strictly prohibited. 125
MEASURES OF CENTRAL TENDENCY AND OTHER LOCATIONS

Measures of Central Tendency

These are numerical values that tend to locate in some sense the middle of a set of
data when arranged in increasing or decreasing order. The term average is often
associated with these measures: mean, median, mode, midrange.

1. Mean 𝝁 or x

b. Arithmetic Mean. It is obtained by adding all the observations and


dividing the sum by the number of observations, thus it is called
computational average.

1. Population Mean: If 𝑥1 , 𝑥2 , ..., 𝑥𝑛 represents a finite population of size N,


the population mean is given by

2. Sample Mean: If 𝑥1 , 𝑥2 , ..., 𝑥𝑛 represents a finite sample of size n,


the sample mean is given by

Example 14: Suppose you chose ten people who entered the campus
and whose ages are as follows: 15 25 18 20 25 18 18 20 20 25
What is the mean age of this sample?

Solution:

The mean age of the sample is 20.40 .

b. Weighted Mean. If the data values 𝑥1 , 𝑥2 , ..., 𝑥𝑘 have assigned weights 𝑤1 , 𝑤2 ,


..., 𝑤𝑘 , respectively, the mean is given by

Example 15: A student was taking 5 subjects last semester. Find his average
if his final grades were as follows:

Property of and for the exclusive use of SLU. Reproduction, storing in a retrieval system, distributing, uploading or posting online, or transmitting in any form or by any
means, electronic, mechanical, photocopying, recording, or otherwise of any part of this document, without the prior written permission of SLU, is strictly prohibited. 126
Solution:

3(1.75) + 5(2.50) + 3(2.25) + 2(1.50) + 4(3.0)


𝒙= = 2.32
3+5+3+2+4

Characteristics of Mean
1. Interval and ratio measurements
2. All the scores or measurements are considered in the computation of the
mean.
3. Very high or very low scores or measurements affect the mean.

2. Mode 𝝁 or 𝒙
It is the value in the distribution with the highest frequency. It locates the point
where the observation values occur with the greatest density. It can be used for
quantitative as well as qualitative data.
A data set can have one mode, more than one mode, or no mode.
 When two data values occur with the same greatest frequency, each
one is a mode and the data set is bimodal.
 When more than two data values occur with the same greatest
frequency, each is a mode and the data set is said to be multimodal.
 When no data value is repeated, we say that there is no mode.

Example 16
Observe the given ungrouped data below:
a. 1,2,3,4,5,6,7 (No Mode)
b. 15.2, 12.3, 4.6, 12.3, 6.5, 12.3, 5.5 (𝒙=12.3)
c. 15,12,4,15,4,6,5 (𝒙=15 and 𝒙= 4)
d. 3,4,5,1,3,2,4,5,7,10 (𝒙=3, 𝒙=4, and 𝒙= 5)

Characteristics of Mode
1. It is very easy to compute but is seldom used because it is very unstable.
2. When a rough or quick estimate of a central value is wanted.
3. It is most appropriate for nominal scale as a measure of popularity.

3. Median 𝝁 or 𝒙
It is a value that divides the distribution into two equal parts (after arranging
the values in ascending or descending order). As such, it is a positional average. The
median is defined by

Example 17: During the first marking period, Nicole's math quiz scores were

90, 92, 93, 88, 95, 88, 97, 87, and 98. What was the median quiz score?

Property of and for the exclusive use of SLU. Reproduction, storing in a retrieval system, distributing, uploading or posting online, or transmitting in any form or by any
means, electronic, mechanical, photocopying, recording, or otherwise of any part of this document, without the prior written permission of SLU, is strictly prohibited. 127
Solution:Ordering the data from least to greatest, we get:

Since n =9 (odd),

The median quiz score is 92. (Four quiz scores were higher than 92 and four
were lower.)

Example 18: The ages of 10 college students are listed below. Find the median.
18, 24, 20, 35, 19, 23, 26, 23, 19, 20
.
Solution: Ordering the data from least to greatest, we get:

Since n =10 (even),

The median age of the college students is 21.5

Characteristics of Median
1. Ordinal or ranked measurements
2. Only the middle scores or measurements are considered in the computation
of the median.
3. Very high or very low scores do not affect the median.
4. When there are extreme cases, thus the distribution is markedly skewed.
5. When we desire to know whether the cases fall within the upper halves or the
lower halves of a distribution

Measures of Relative Location

It describes or locates the position of certain non-central pieces of data relative to


the entire set of data. These measures, also known as quantiles or fractiles, are values
below which a specific fraction or percentage of the observations in a given data set must
fall. These are percentiles, deciles, and quartiles

1. Percentiles arevalues that divide a set of observations into 100 equal parts. These
values, denoted by 𝑃1 , 𝑃2 , … , 𝑃99 , are such that 1% of the data falls below 𝑃1 , 2% falls
below 𝑃2 , …, and 99% falls below 𝑃99 .
The 𝑘th percentile, 𝑃𝑘 (𝑘 = 1, 2, 3, … ,99), can be determined using the
following procedure:
a. Arrange the data in increasing order and compute the value of the index
𝑘
𝑖= 𝑛, where 𝑛 is the number of observations.
100
𝑥 𝑖 +𝑥 𝑖+1
b. If 𝑖 is an integer, 𝑃𝑘 = . If 𝑖 is not an integer, use the rounded up value for 𝑖
2
and take 𝑃𝑘 = 𝑥𝑖 . Note that 𝑥𝑖 here pertains to the score in the data set.

Property of and for the exclusive use of SLU. Reproduction, storing in a retrieval system, distributing, uploading or posting online, or transmitting in any form or by any
means, electronic, mechanical, photocopying, recording, or otherwise of any part of this document, without the prior written permission of SLU, is strictly prohibited. 128
2. Deciles are values that divide a set of observations into 10 equal parts. These
values, denoted by 𝐷1 , 𝐷2 , … , 𝐷9 , are such that 10% of the data falls below 𝐷1 , 20%
falls below 𝐷2 , …, and 90% falls below 𝐷9 .
The 𝑘th decile, 𝐷𝑘 (𝑘 = 1, 2, … ,9), can be determined using the following
procedure:
a. Arrange the data in increasing order and compute the value of the index
𝑘
𝑖= 𝑛, where 𝑛 is the number of observations.
10
𝑥 𝑖 +𝑥 𝑖+1
b. If 𝑖 is an integer, 𝐷𝑘 = . If 𝑖 is not an integer, use the rounded up value for 𝑖
2
and take 𝐷𝑘 = 𝑥𝑖 .

3. Quartiles. are values that divide a set of observations into 4 equal parts. These
values, denoted by 𝑄1 , 𝑄2 , and 𝑄3 , are such that 25% of the data falls below 𝑄1 , 50%
falls below 𝑄2 and 75% falls below 𝑄3 .
The 𝑘th quartile, 𝑄𝑘 (𝑘 = 1, 2, 3), can be determined using the following
procedure:
a. Arrange the data in increasing order and compute the value of the index
𝑘
𝑖= 𝑛, where 𝑛 is the number of observations.
4
𝑥 𝑖 +𝑥 𝑖+1
b. If 𝑖 is an integer, 𝑄𝑘 = . If 𝑖 is not an integer, use the rounded up value for 𝑖
2
and take 𝑄𝑘 = 𝑥𝑖 .

Example 19: As part of a quality-control study aimed at improving a production line, the
weights (in ounces) of 50 bars of soap are measured. The results are as follows, sorted from
smallest to largest. Find, first quartile, the 9th decile, and the 43rd percentile.

11.6 12.6 12.7 12.8 13.1 13.3 13.6 13.7 13.8 14.1
14.3 14.3 14.6 14.8 15.1 15.2 15.6 15.6 15.7 15.8
15.8 15.9 15.9 16.1 16.2 16.2 16.3 16.4 16.5 16.5
16.5 16.6 17.0 17.1 17.3 17.3 17.4 17.4 17.4 17.6
17.7 18.1 18.3 18.3 18.3 18.5 18.5 18.8 19.2 20.3

a. 43rd Percentile

We compute the index. Note that 𝑘=43 and 𝑛=50


43
Then 𝑖 = 50 = 21.5 ≈ 22 (𝑟𝑜𝑢𝑛𝑑 𝑢𝑝)
100
From the data set, 𝑥22 = 15.9
Hence, we have 𝑃43 = 𝑥22 = 15.9
Hence, 43% of the values lie below 15.9.

b. 9th Decile

Let us compute the index i given that 𝑘 = 9 and 𝑛 = 50


9
𝑖= 50 = 45
10

Property of and for the exclusive use of SLU. Reproduction, storing in a retrieval system, distributing, uploading or posting online, or transmitting in any form or by any
means, electronic, mechanical, photocopying, recording, or otherwise of any part of this document, without the prior written permission of SLU, is strictly prohibited. 129
(𝑥45+𝑥46)
Since i is an integer, 𝐷9 = . From the data set, 𝑥45 = 18.3 and 𝑥46 = 18.5
2
18.3+18.5
Thus, we have , 𝐷9 = = 18.4
2
Hence, 90% of the values lie below 18.4.

c. First quartile

We compute the index. Note that 𝑘=1 𝑎𝑛𝑑𝑛=50


1
Then 𝑖 = 50 = 12.5 ≈ 13 (𝑟𝑜𝑢𝑛𝑑 𝑢𝑝)
4
From the data set, 𝑥13 = 14.6
Hence, we have 𝑄1 =𝑥13 = 14.6
Hence,25% of the values lie below 14.6.

Measures of Variability or Dispersion

It indicates the extent to which individual items in a series are scattered about the
average. It is used to determine the extent of the scatter so that steps may be taken to
control the existing variation. General Classifications of Measures of Variation are
• Measures of Absolute Dispersion
• Measures of Relative Dispersion

A. Measures of Absolute Dispersion: Expressed in the units of the observations. I cannot


be used to compare variations of two data sets when the averages of these data
sets differ or when the observations differ in units of measurement.
1. Range. It is the difference between the largest and smallest values. It gives an
idea of the spread of the data set but is affected by outliers and does not
consider all values in the data set.

𝑅𝑎𝑛𝑔𝑒 = 𝐻𝑖𝑔𝑕𝑒𝑠𝑡 𝑉𝑎𝑙𝑢𝑒 − 𝐿𝑜𝑤𝑒𝑠𝑡 𝑉𝑎𝑙𝑢𝑒

2. Variance and the Standard Deviation are the most common and useful
measures of variability. These two measures provide information about how
the data vary about the mean.
 The variance 𝜎 2 or 𝑠 2 is a measure of variation which considers the
position of each observation relative to the mean of the set.

 Given a finite population 𝑥1 , 𝑥2 , ..., 𝑥𝑛 , the population variance is


given by

 2

 x   
i
2

or  
2
N xi 
2
 x i
2

N N2

Property of and for the exclusive use of SLU. Reproduction, storing in a retrieval system, distributing, uploading or posting online, or transmitting in any form or by any
means, electronic, mechanical, photocopying, recording, or otherwise of any part of this document, without the prior written permission of SLU, is strictly prohibited. 130
 Given a finite sample 𝑥1 , 𝑥2 , ..., 𝑥𝑛 ,, the sample variance is given
by
 x  x n xi   x 
2 2 2
i i
s2
 or  
2

n1 nn  1

 The standard deviation 𝜎 or s is the square root of the variance.

 Population Standard Deviation:

 x  
2
i

N
 Sample Standard Deviation

 x 
2
i x
s
n1
where:   population standard deviation
xi  i th observation
s  sample standard deviation
  population mean
x  sample mean
N  population size
n  sample size

If the data are clustered around the mean, then the variance and the
standard deviation will be small. If, however, the data are widely scattered about
the mean, the variance and the standard deviation will be somewhat large.

Example 20. A high school teacher at a small private school assigns


trigonometry practice problems to be worked via the net. Students must use a
password to access the problems and the time of log-in and log-off are
automatically recorded for the teacher. At the end of the week, the teacher
examines the amount of time each student spent working the assigned problems.
The data is provided below in minutes.

15, 28 25 48 22 43 49 34 22 33 27 25 22 20 39

Find the Range, Standard Deviation, and Variance for the above data.
Solution: For this data set, 𝑛 = 15, 𝑥 = 452, and 𝑥 2 = 15160. These can be
computed using a scientific calculator.

Property of and for the exclusive use of SLU. Reproduction, storing in a retrieval system, distributing, uploading or posting online, or transmitting in any form or by any
means, electronic, mechanical, photocopying, recording, or otherwise of any part of this document, without the prior written permission of SLU, is strictly prohibited. 131
For the Standard deviation:
𝑛 𝑥𝑖 2 − 𝑥𝑖 2
𝑠=
𝑛(𝑛 − 1)

15 15160 − 452 2
𝑠=
15(15 − 1)
𝒔 = 𝟏𝟎. 𝟒𝟖𝟕𝟏𝟖
For the variance:
15 15160 − 452 2
𝑠2 =
15 15 − 1
𝟐
𝒔 = 𝟏𝟎𝟗. 𝟗𝟖𝟎𝟗𝟓

For Range:
The range is 34. The high score is 49; the low score is 15. 49 − 15 = 34.
𝑅𝑎𝑛𝑔𝑒 = 34

B. Measures of Relative Dispersion. It is used to compare variations in the dispersion of


two data sets when the averages of these data sets differ or when the observations
differ in units of measurement. It is unit less.

1. Coefficient of Variation. It indicates how large the standard deviation is


in relation to the mean. It can be used to compare variations for
different variables with different units. The larger the coefficient of
variation, the more dispersed the observations are.

Population:
𝜎
𝐶𝑉 = × 100%
𝜇
Sample:

𝑠
𝐶𝑉 = × 100%
𝑥

Example 21: If we have a standard deviation of 1.5 and a mean of 5, the ratio of
the standard deviation to the mean is 0.3. In other words, the standard deviation
is 30% of the mean. When comparing two data sets, the general rule of thumb
you should follow is:
The higher the coefficient of variation, the higher the variability of the data
set This means that, when comparing two or more data sets, the one with the
highest coefficient of variability can be said to have the highest variation.

Property of and for the exclusive use of SLU. Reproduction, storing in a retrieval system, distributing, uploading or posting online, or transmitting in any form or by any
means, electronic, mechanical, photocopying, recording, or otherwise of any part of this document, without the prior written permission of SLU, is strictly prohibited. 132
CORRELATION AND REGRESSION

A. Correlation

It measures the strength of the association or relationship between variables. It is not


defined as causation (cause and effect relationship).
Assume that the association is linear, that one variable increases or decreases a
fixed amount for a unit increase or decrease in the other.

Pearson Correlation Coefficient


 denoted by 𝑟
 used to measure the degree of linear association or relationship
 measured on a scale that varies from −1 through 0 𝑡𝑜 + 1
 formula is

n xy   x y
r
n x  2
 x 2
n y 2   y  2

The value of r is interpreted as follows:

Absolute value of 𝑟 Interpretation

1.0 Perfect positive/negative correlation

0.80-0.99 Very strong positive/ negative correlation

0.60-0.79 Strong positive/ negative correlation

0.40-0.59 Moderate positive/ negative correlation

0.20-0.39 Weak positive/ negative correlation

0.01-0.19 Very weak positive/ negative correlation


0.0 No correlation

Property of and for the exclusive use of SLU. Reproduction, storing in a retrieval system, distributing, uploading or posting online, or transmitting in any form or by any
means, electronic, mechanical, photocopying, recording, or otherwise of any part of this document, without the prior written permission of SLU, is strictly prohibited. 133
Perfect positive Perfect negative
correlation correlation

No correlation Use other measures


of correlation

Example 22. Given the following data on the number of hours of study (x) for an
examination and the scores (y) received by a random sample of 10 students,
compute for the Pearson correlation coefficient.

Student 𝒙 𝒚 𝒙𝒚 𝒚𝟐 𝒙𝟐

1 8 56 448 3136 64

2 5 44 220 1936 25

3 11 79 869 6241 121

4 13 72 936 5184 169

5 10 70 700 4900 100

6 5 54 270 2916 25

7 18 94 1692 8836 324

8 15 85 1275 7225 225

9 2 33 66 1089 4

10 8 65 520 4225 64

n xy   x y
r
n x  2
 x 2
n y 2   y  2

106996   95 652 


r
101121  95  1045688   652 
2 2

r  0.9625

Property of and for the exclusive use of SLU. Reproduction, storing in a retrieval system, distributing, uploading or posting online, or transmitting in any form or by any
means, electronic, mechanical, photocopying, recording, or otherwise of any part of this document, without the prior written permission of SLU, is strictly prohibited. 134
There is a very strong positive linear relationship between the number of hours
of study (𝑥) for an examination and the scores (𝑦) received by a random sample of
10 students.

Example 23:Consider the scores obtained in Math and Statistics by 10 students.

Student 1 2 3 4 5 6 7 8 9 10
Math 5 8 10 12 12 14 15 16 18 20
Score
Stat 2 7 8 9 10 12 14 10 16 12
Score

Student 1 2 3 4 5 6 7 8 9 10 Total
Math Score 5 8 10 12 12 14 15 16 18 20 130
(𝒙)
Stat Score 2 7 8 9 10 12 14 10 16 12 100
(𝒚)
𝒙𝒚 10 56 80 108 120 168 210 160 288 240 1440

𝒙𝟐 25 64 100 144 144 196 225 256 324 400 1878

𝒚𝟐 4 49 64 81 100 144 196 100 256 144 1138

101440   130 100 


r
101878   130  101138   100 
2 2

r  0.8692

Property of and for the exclusive use of SLU. Reproduction, storing in a retrieval system, distributing, uploading or posting online, or transmitting in any form or by any
means, electronic, mechanical, photocopying, recording, or otherwise of any part of this document, without the prior written permission of SLU, is strictly prohibited. 135
There is a very strong positive linear relationship between math and stat scores.

B. Regression

It is used to examine the relationship between one dependent and one


independent variable and to predict the dependent variable (Y) when the independent
variable (X) is known.
It finds the best line (regression line) that predicts Y from X.

The Regression Line


It is a line that is as close as possible to all the data points at once.

The Regression Equation


It is an equation that represents the relationship between one dependent
and one independent variable.
𝒚= 𝒂 + 𝒃𝒙

The slope is
n xy   x y
b
n x 2   x  2

The y-intercept

a
 y  b  x 
n  n 
 

Coefficient of Determination (𝑹𝟐 )

It is the square of the correlation coefficient.

Property of and for the exclusive use of SLU. Reproduction, storing in a retrieval system, distributing, uploading or posting online, or transmitting in any form or by any
means, electronic, mechanical, photocopying, recording, or otherwise of any part of this document, without the prior written permission of SLU, is strictly prohibited. 136
It is interpreted as the proportion of the variance in the dependent variable that is
predictable from the independent variable.
The fraction of data points falls on the regression line
𝑹𝟐 =1 (all points lie exactly on a straight line with no points scattered about the line)
means that the dependent variable is perfectly predicted without error using the
independent variable X
𝑹𝟐 =0 means that the dependent variable cannot be predicted using the
independent variable X.
An 𝑹𝟐 between 0 and 1 indicates the extent to which the dependent variable is
predictable.
An 𝑹𝟐 of 0.10 means that 10 percent of the variance in Y is predictable from X;
an 𝑹𝟐 of 0.20 means that 20 percent is predictable; and so on.

SSR
R2   100
SSY

Where: SSR  b1 SPXY

SPXY   x i y i 
 x  y  i i

SSY   y i 
2  y  i
2

SSX   xi
2

 x  i
2

Example 23: The paired data below consist of the costs of advertising (in thousands
of pesos) and the number of products sold (in thousand units).

Cost # Products  xy x 2
y 2

(x) Sold
(y)
9,000.00 85,000.00 765,000,000.00 81,000,000.00 7,225,000,000.00
2,000.00 52,000.00 104,000,000.00 4,000,000.00 2,704,000,000.00
3,000.00 55,000.00 165,000,000.00 9,000,000.00 3,025,000,000.00
4,000.00 68,000.00 272,000,000.00 16,000,000.00 4,624,000,000.00
2,000.00 67,000.00 134,000,000.00 4,000,000.00 4,489,000,000.00
5,000.00 86,000.00 430,000,000.00 25,000,000.00 7,396,000,000.00
9,000.00 83,000.00 747,000,000.00 81,000,000.00 6,889,000,000.00
10,000.00 73,000.00 730,000,000.00 100,000,000.00 5,329,000,000.00
Total 44,000.00 569,000.00 3,347,000,000.00 320,000,000.00 41,681,000,000.00

1. Plot a scatter diagram

Property of and for the exclusive use of SLU. Reproduction, storing in a retrieval system, distributing, uploading or posting online, or transmitting in any form or by any
means, electronic, mechanical, photocopying, recording, or otherwise of any part of this document, without the prior written permission of SLU, is strictly prohibited. 137
2. Find the equation of the regression line to predict weekly sales from
advertising expenditures.

Thus, the equation is 𝑦 = 55788.25 + 2.7885𝑥

Property of and for the exclusive use of SLU. Reproduction, storing in a retrieval system, distributing, uploading or posting online, or transmitting in any form or by any
means, electronic, mechanical, photocopying, recording, or otherwise of any part of this document, without the prior written permission of SLU, is strictly prohibited. 138
3. Estimate the number of products sold when advertising costs is P4,500.

𝑦=55788.25+2.7885𝑥
𝑦=𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑝𝑟𝑜𝑑𝑢𝑐𝑡𝑠 s𝑜𝑙𝑑=55788.25+2.7885(4,500)
𝑦=𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑝𝑟𝑜𝑑𝑢𝑐𝑡𝑠 s𝑜𝑙𝑑=68,336.50 units

4. Determine the coefficient of determination

Therefore, 50.08 % of the variance in the number of products sold is predictable from
the cost of advertising.

Practice Exercise 7-1: My Score:

Write your solutions (if any) and answers on a clean sheet of paper. Submit the image of
your HANDWRITTEN SOLUTIONS as a single pdf file in the submission bin for this activity in the
Classroom. You may use image scanning apps on your phone (CamScanner or Tap
Scanner) to save several images into 1 pdf file, or place your images in a document and
save as a pdf file.

A. Classify the following statements as to whether they belong to the area of


descriptive statistics or inferential statistics.

___________________ 1. At most 5% of SLU students are smokers.


___________________ 2. Assuming that less than 20% of the Kalinga coffee beans were
destroyed by a typhoon these past months, we should expect
an increase of no more than P30 for a kilogram of coffee by the
end of the year.
___________________ 3. An employee generalized that the average monthly salary of a
regular employee in a certain company is P12,000.
___________________ 4. A study found out that all customers who have received a gift
certificate from a store 75% went back to the store to shop.
___________________ 5. The average grade in statistics of 50 students is 83.60.

B. At what level are the following variables measured?


___________________ 1. Students rated as superior, above average, average,
below average, or poor

Property of and for the exclusive use of SLU. Reproduction, storing in a retrieval system, distributing, uploading or posting online, or transmitting in any form or by any
means, electronic, mechanical, photocopying, recording, or otherwise of any part of this document, without the prior written permission of SLU, is strictly prohibited. 139
___________________ 2. The scores of students in a statistics quiz
___________________ 3. The main source of income
___________________ 4. The birth order of children in the family
___________________ 5. Age of students availing a discount
___________________ 6. Weights of a sample of bags of raw materials for the
production of a certain product, measured in grams.
___________________ 7. The natural eye color of a sample of 100 children.
___________________ 8. The economic status of a sample of families in a certain
area.
___________________ 9. The final grade of graduate students taking up Statistics.
___________________ 10. The school in which a graduate student is enrolled in.

C. Classify the following variables as quantitative or qualitative variables. If the variable is


quantitative, identify whether it is discrete or continuous.

_______________ _______________ 1. The type of payment used by customers


_______________ _______________ 2. The evaluation rating of instructors
_______________ _______________ 3. The classification of employees in a
company
_______________ _______________ 4. The weekly allowance of students
_______________ _______________ 5. The length of telephone calls made by
students to their parents

D. In each of the following situations, identify the population, each variable, and
determine if the variable is qualitative or quantitative.
1. A quality control worker with Sweet-Tooth Candy weighs every 100 candy th

bar to make sure it is very close to the published weight.


2. John decides to group his employees according to the type of skill possessed.
3. A researcher is studying the effect of a newly formulated method in glue
laminating wood. She performs an experiment where she compares the
shear stress of the “gluelam” wood using the new method and commercially
available “gluelam” wood. She used 10 items for each method (new and
commercial).

E. To assign workers to two stores, the owner has the workers count off by two to divide
them into teams. Is this (team) a qualitative or quantitative variable?

F. A school is studying its students‟ test scores by grade. Explain how the characteristic
„grade‟ could be considered either a categorical or a numerical variable?

Property of and for the exclusive use of SLU. Reproduction, storing in a retrieval system, distributing, uploading or posting online, or transmitting in any form or by any
means, electronic, mechanical, photocopying, recording, or otherwise of any part of this document, without the prior written permission of SLU, is strictly prohibited. 140
G. Which of the following situations will result in probability or non-probability sampling?
1. Population: All residents of a big city
Sampling technique: For one week, researchers stop every fourth person who
passes by a busy downtown street corner.
2. Population: All students in a large high school
Sampling technique: selecting the first 50 students reporting to school on a
Wednesday morning.
3. Population: All the 72 guests at a birthday party.
Sampling technique: The name of each person is written on a slip of paper
then all are placed in a box, mixed, then drawn one after the other for the
available ten door prizes.
4. Population: Business owners with less than 15 employees
Sampling technique: Get information from the DTI (business permits section),
then select a sample size of about 30% from each of the included 12
barangays

H. In each of the following situations, a random sample must be obtained. Determine


whether a cluster, stratified, or systematic random sampling would be appropriate.
Explain in detail how the sampling is to be conducted. Do not discuss expected results
or conclusions.
1. A large convenience store chain wishes to determine its customers‟ level of
satisfaction with regard to their service.

2. A nationwide survey on charter change is to be conducted. (Note: there are


seventeen regions in the Philippines.)

3. An educational researcher wants to compare the difference in career goals


between male and female students of Otto Hahn University who has 10,000
students.

4. A social researcher wants to determine whether electronic engineers who


work in the communications field earn more than those who are in
semiconductor industries.

5. A market analyst would like to compare the durability, in terms of mean time
before wear, of two leading brands of car tires.

I. The numbers of incorrect answers on a true or false competency test for a random
sample of 15 students were recorded as follows: 2, 1, 3, 0, 1, 3, 6, 0, 3, 3, 5, 2, 1, 4, and 2.
Find a. mean, b. median, c. mode. Show your solution.

Property of and for the exclusive use of SLU. Reproduction, storing in a retrieval system, distributing, uploading or posting online, or transmitting in any form or by any
means, electronic, mechanical, photocopying, recording, or otherwise of any part of this document, without the prior written permission of SLU, is strictly prohibited. 141
J. A student was taking five subjects in college during the first semester. Find his average
grade if his final grades were as follows. Show your solution.

Subject Math Physics English Speech Statistics

Grade 1.75 2.50 2.25 1.50 3.0

Units 3 5 3 2 4

K. A corporation administers an aptitude test to all new sales representatives.


Management is interested in the extent to which this test is able to predict their
eventual success. The accompanying table records average weekly sales (in thousands
of pesos) and aptitude test scores for a random sample of eight representatives.

1. Plot a scatter diagram


2. Estimate the linear regression of weekly sales on aptitude test scores.
3. Estimate the weekly sales when test scores is 70
4. Determine the coefficient of determination

It’s time to answer Quiz 3!

Property of and for the exclusive use of SLU. Reproduction, storing in a retrieval system, distributing, uploading or posting online, or transmitting in any form or by any
means, electronic, mechanical, photocopying, recording, or otherwise of any part of this document, without the prior written permission of SLU, is strictly prohibited. 142

You might also like