Professional Documents
Culture Documents
Statistics Modules
Statistics Modules
Statistics Modules
The term statistics came from the Latin phrase “ratio status” which means study of practical
politics or the statesman’s art.
In the middle of 18th century, the term statistik (a term due to Achenwall) was used, a
German term defined as “the political science of several countries”
From statistik it became statistics defined as a statement in figures and facts of the present
condition of a state.
Examples:
1) Comparing the effects of five kinds of fertilizers on the yield of a particular variety of corn.
2) Determining the income distribution of Filipino families.
3) Comparing the effectiveness of two diet programs.
4) Prediction of daily temperatures.
5) Evaluation of student performance.
Hence, Statistics is a scientific body of knowledge that deals with collection of data,
organization or presentation of data, analysis and interpretation of data.
Descriptive Statistics may answer such questions:
1. How many male and female students are interested in Market Research?
2. What are the highest and lowest scores obtained in Standardized admission exam?
3. What are the characteristics of the most successful students according to research?
4. Which group of employees has produced more outputs?
Inferential Statistics This draws inferences about the population based on the data gathered
from samples using the techniques of descriptive statistics. Descriptive statistics therefore is the
backbone of Inferential Statistics. It may answer questions like:
1. Is there a significant difference between the performance of the male and female students in
statistics?
2. Is there a significant difference between the proportion of students who are interested to take
COURSERA online and those who are not?
3. Is there a significant correlation between educational attainment and job satisfaction?
With inferential statistics, we are trying to reach conclusions that extend beyond the
immediate data alone. Or, we may use inferential statistics to make judgement of the probability
that an observed difference between groups is a dependable one, or one that might have
happened due to chance. It is a matter of deciding between reality and coincidence.
Key Definitions
Parameters are numerical measures that describe the population or universe of interest. Usually
donated by Greek letters; (mu), (sigma), (rho), (lambda), (tau), (theta),
(alpha) and (beta).
Example: There are 8,756 math education students enrolled in Davao Region this school year.
The average age of math education students is 20.
N = 8,756 and μ=20 are parameters because they both
describe the population
Sample is a small proportion or part of a population; a representative of the population in a
research study.
Example: Out of the 8,756 students enrolled in Education (major in Math) 2, 890 are male. The
average age of male students is 21.
Variable - a variable is a characteristic of objects, people, or events that can take of different
values. It can vary in quantity (e.g. weight of people) or quality (e.g. hair color of people).
Variables can be classified in different ways.
Types of Variables. There are basically two types of random variable yielding two types of data:
qualitative and quantitative.
VARIABLES
Qualitative Quantitative
Discrete Continuous
Quantitative variable (numerical values) - A variable that is conceptualized and analyzed along
a continuum implied.. It differs in amount of degree.
In the broadest sense, all collected data are “measured” in some form. For example, even discrete
quantitative data data can be thought of as arising by a process of “measurement through
counting.” the four widely recognized levels of measurement - the nominal, ordinal, interval and
ratio.
Levels of Measurements
Exhaustive is a property of a set of categories such that each individual or objects must appear in
a category.
3. Interval level of measurement - is used to classify order and differentiate between classes or
categories in terms of degrees of differences. Interval data are either discrete or continuous
variables. e.g. Temperature ( in 0C or 0F)
Equal intervals; no absolute zero
4. Ratio level of measurement - differs from interval level of measurement only in one aspect;
it has a true zero point (complete absence of the attitude being measured). With an absolute value
point it can be said that the ratios of two observations is “twice as fast’, “half as long” or others.
Ratio data are either discrete or continuous variables. Has absolute zero.
Example: weight in pounds, age (in years or days), salary (in Philippine peso)
What is sampling?
One of the most important steps in the research process is to select the sample of
individuals who will participate as a part of the study. Sampling refers to the process of selecting
these individuals.
The purpose of sampling is for the researchers to be able to draw conclusions about the
population from the study on samples. We must use inferential statistics which enables us to
determine a population’s characteristics by directly observing or studying only a portion (or
sample) of the population. We use sample rather than a complete enumeration (a census) of the
population because it is convenient and cheaper to observe a small part rather that the whole.
1. Reduced Cost
2. Greater Speed or Timeliness
3. Greater Efficiency and Accuracy
4. Greater Scope
5. Convenience
6. Necessity
7. Ethical Considerations
PROBABILITY SAMPLES
Samples are obtained using some objective chance mechanism, thus involving
randomization.
They require the use of a complete listing of the elements of the universe called the
sampling frame.
The probabilities of selection are known.
They are generally referred to as random samples.
They allow drawing of valid generalizations about the universe/population.
1. Simple Random Sampling is a process of selecting n sample size in the population via
random numbers or through lottery.
2. Systematic Sampling is a process of selecting kth element in the population unitl the desired
number of subjects or respondents is attained.
Example: For instance we have the data shown below; say we want to consider every 5 th on the
list.
23 34 12 14 13 23 24 39 27 23
12 15 16 23 26 28 23 22 19 34
24 22 18 30 23 24 17 18 15 12
th
Therefore, the samples from every 5 from left to right are 13, 23, 26, 34, 23 and 12.
3. Stratified sampling is a process of subdividing the population into subgroups or strata and
drawing members at random from each subgroup or stratum.
Example: Given the population of a certain university and a target sample population of 5455,
determine the sample size of each subgroup or courses.
Field of Specialization Population
Education 6,000
Agricultural Engineering 500
Information Technology 2,000
Agribusiness 1,000
Accountancy 2,500
Total 12,000
N
n=
To determine the sample size, we will use the Slovin’s formula: 1+ Ne 2
Where: n= is the desired sample size
N = population size
e = margin of error
Example1: Find n if N = 12,000 and e= 2%.
N 12 ,000 12 ,000 12 , 000
n= = =
2 1+4 .8
= =5 , 455
2
1+Ne 1+12 ,000(. 02) 5.8
NON-PROBABILITY SAMPLES
It is a sampling procedure where samples selected in a deliberate manner with little or no
attention to randomization; it is also called non-probability sampling.
After the research problem has been laid, the next step is to determine the methods to collect
data. Here are the five basic methods in collecting data:
1. Direct or Interview method. It is face-to-face encounter between the interviewer and the
interviewee. The interview may vary according to the preference of either or both parties.
However, this method id time-consuming, expensive, and has limited field coverage.
2. Indirect or Questionnaire Method. Unlike direct method, this method utilized questionnaires
to obtain information. It can be done by mail or hand-carried to the intended respondents.
3. Registration Method. This method of gathering information is governed by laws. Example:
birth certificates, death certificates, and licenses, etc.
4. Observation method. This method is used to data that are pertaining to behaviors of an
individual or a group of individuals at the time of occurrence of a given situation are best
obtained by observation. One limitation of this method is observation is made only at the time or
occurrence of the appropriate events.
5. Experiment method. This is used to determine the cause and effect relationship of certain
phenomena under controlled conditions. This method usually employed by scientific researchers.
1. Textual Method. This method presents the collected data in narrative and paragraph forms.
2. Tabular Method. This method presents the collected data in table which are orderly arranged
in rows and columns for easier and more comprehensive comparison of figures.
3. Graphical Method. This method presents the collected data in visual or pictorial form to get a
clear view of data (e.g. histogram, pie chart, pareto chart, pictograph, etc.)
When conducting a statistical research, investigation or study, the researcher must gather data for
the particular variable under investigation. To describe situations, make conclusions, and draw
inferences about events, the researcher must organize the data gathered in some meaningful way.
The easiest way and widely used of organizing data is to construct a frequency distribution . A
frequency distribution is a grouping of the data into categories showing the number of
observations in each of the non-overlapping classes.
After organizing the data, the next move of the researcher is to present the data so they
can be understood easily by those who will benefit from reading the study. The most useful
method of presenting the data is by constructing graphs and charts. There are number of ways to
plot graphs and charts, and each one has a specific purpose.
TEXTUAL PRESENTATION OF DATA
Good statistical presentation requires making it easy for readers to understand and
interpret the data, and to identify key pattern or trends.
Data presented in paragraph or in sentences, are said to be in textual form. This includes
enumeration of important characteristics, emphasizing the most significant features and
highlighting the most striking attributes of the set of data. Please see example below.
The data are Math test score of 15 students out of 50 items: 47, 48, 49, 42, 42, 36, 38, 40,
35, 50, 44, 45, 45, 50, 50. Make simple analysis by writing findings, drawing conclusions and
making an inference.
Writing the data in 35, 36, 38, 40, 42, 42,
numerical order may help 44, 45, 45, 47, 48, 49,
to analyze the data.
50, 50, 50
Findings: The lowest score is 35, and the highest is 50. Three students got a perfect score of 50;
one got 35, 36, 38, 40, 44, 47, 48 and 49 while 2 got 42 and 45. If the passing mark is 70%, it
shows that nobody failed in the test.
Conclusions: I therefore conclude that the students perform well in the test.
Inference: If this trend will continue, then it is likely that nobody will fail in this Math class.
Definitions
Though analysis can be done from the text, it is however, recommended to organize the
data in tables for better comparison of values and the quicker and better analysis of details.
Furthermore, if data are presented in plain text, readers sometimes get bored, thus table and
graphs are oftentimes used.
Defining Some Terms
Before we get started in constructing frequency distribution, we must define some terms
that are essential to understand deeper the nature of data that are displayed in a frequency
distribution.
Raw data is the data collected in original form.
Range is the difference of the highest value and the lowest value in a distribution.
Frequency distribution is the organization of data in a tabular form, using mutually
exclusive classes showing the number of observations each.
Class Limits is the highest and lowest values describing a class.
Class Boundaries is the upper and lower values of a class for group frequency
distribution whose values has additional decimal place more than the class limits and end with
the digit 5.
Interval (width) is the distance between the class lower boundary and the class upper
boundary and it is denoted by the symbol i.
Frequency (f) is the number of values in a specific class of a frequency distribution.
Relative Frequency is the value obtained when the frequencies un each class of the
frequency distribution is divided by the total number of values.
Percentage is obtained by multiplying the relative frequency by 100%.
Cumulative Frequency (cf) is the sum of the frequencies accumulated up to the upper
boundary of a class in a frequency distribution.
Midpoint is the point halfway between the class limits of each class and is representative
of the data within that class.
A grouped frequency distribution is used when the range of the data set is large; the
data must be grouped into classes whether it is categorical data or interval data. For interval data
the classes is more than one unit in width. The procedure for constructing the frequency
distribution is discussed in the succeeding sections.
Range HV −LV
=
Class Interval (i) = Number of Classes k (Formula 2-1)
where: HV = highest value in the data set ; k = number of classes
LV = lowest value in the data set ; i = suggested class interval
Range
Class Interval (i) = 1+3. 322(log N ) , where N = the number of observation
3. Rule 3: Another guideline to determine the class interval is to have an ideal
number of classes then apply Formula 2-3.
HV - LV
Class Interval (i )=
Number of Classes (Formula 2-3)
Example: EPA Travel Agency, a nationwide local travel agency , offers special rates on summer
period. The owner wants additional information on the ages of those people taking travel tours.
A random samples of 50 customers taking travel last summer revealed these ages.
18 29 42 57 61 67 37 49 53 47
24 34 45 58 63 70 39 51 54 48
28 36 46 60 66 77 40 52 56 49
19 31 44 58 62 68 38 50 54 48
27 36 46 59 64 74 39 51 55 48
f
Step 5: Determine the relative frequency. (Formula: N )
Class Limits Class Boundaries Frequency Relative
Frequency
18-26 17.5-26.5 3 0.06
27-35 26.5-35.5 5 0.10
36-44 35.5-44.5 9 0.18
45-53 44.5-53.5 14 0.28
54-62 53.5-62.5 11 0.22
63-71 62.5-71.5 6 0.12
72-80 71.5-80.5 2 0.04
50
f
×100 %
Step 6: Determine the percentage. (Formula: N )
Class Limits Class Boundaries Frequency Relative Percentage
Frequency
18-26 17.5-26.5 3 0.06 6%
27-35 26.5-35.5 5 0.10 10%
36-44 35.5-44.5 9 0.18 18%
45-53 44.5-53.5 14 0.28 28%
54-62 53.5-62.5 11 0.22 22%
63-71 62.5-71.5 6 0.12 12%
72-80 71.5-80.5 2 0.04 4%
50 100%
Now, that the data are arranged in a frequency distribution table, it is easier to give
findings, draw informed conclusions and make sound inferences.
Findings:
A. Basic findings are those which you can see directly from the table;
Stem-and-Leaf Plot
A statistician named John Tukey introduces the stem-and-leaf plot. The objective of this
method is to some extent overcomes the loss of actual observations brought about by the
histogram. The advantage of the stem-and-leaf plot over the histogram is that we can see the
actual observations.
The stem is the leading digit or digits and the leaf is the trailing digit. The stem is placed
at the first column and the leaf at the second column.
Example: EPA Travel Agency, a nationwide local travel agency , offers special rates on summer
period. The owner wants additional information on the ages of those people taking travel tours.
A random samples of 50 customers taking travel last summer revealed these ages.
18 29 37 42 47 49 53 57 61 67
19 31 38 44 48 50 54 58 62 68
24 34 39 45 48 51 54 58 63 70
27 36 39 46 48 51 55 59 64 74
28 36 40 46 49 52 56 60 66 77
Stem Leaf
1 8, 9
2 4, 7, 8, 9
Tens digit 3 1, 4, 6, 6, 7, 8, 9, 9 Units Digit
(Leading 4 0, 2, 4, 5, 6, 6, 7, 8, 8, 8, 9, 9 (trailing digits)
Digits) 5 0, 1, 1, 2, 3, 4, 4, 5, 6, 7, 8, 8, 9
6 0, 1, 2, 3, 4, 6, 7, 8
7 0, 4, 7
8 9
6
6
4 5
2 3
2
0
18-26 27-35 36-44 45-53 54-62 63-71 72-80
Age (Class Limits)
Age of Travellers
16
14 14
12
11
10
Frequency
9
8
6 6
5
4
3
2 2
0 0 0
1 13 2 22 3 31 440 5
49 586 67 7 76 8 85 9
Age (Class Midpoints)
Age of Travellers
60
50 48 50
Cumulative Frequency
40 42
30 31
20
17
10 8
3
0
1 26.5 2 35.5 344.5 4
53.5 5
62.5 71.56 80.5 7
Age (Upper Boundary)
Products Sales
Junk Foods 135
Candy 250
Ice Cream 185
Chocolate 210
Others 90
Solution:
a. Constructing a Pareto Chart
Steps: 1. Arrange the data from highest to lowest according to frequency.
Products Sales
Candy 250
Chocolate 210
Ice Cream 185
Junk Foods 135
Others 90
1. Draw and label the x-axis (Products) and y-axis (sales).
2. Construct the chart by arranging the frequency from the highest to lowest and form left to
right. Make a bar with the same width and draw the height corresponding to the frequencies
Figure 2.4: Pareto Chart of Favorite Snacks
Steps: 1. Since there are 360 0 in a circle, the frequency of each class must be covered into a proportional
part of the circle. This conversion is done by applying the formula
f
Degrees= ()
n
( 3600 )
, where f = frequency ; n = sum of frequencies
Hence, the following conversion are obtained. The degrees should total to 360 0.
Candy
(250
870 )
0
( 360 )=103 0
Chocolate
(210
870 )
0
( 360 )=87 0
1. Each frequency must also be converted to a percentage and has a total of 100 0. This percentage can be
done by applying the formula
Candy
(250
870 )
( 100 % ) =29 %
Junk Foods
(135
870 )
( 100 % ) =16 %
Chocolate
(210
870 )
( 100 % ) =24 %
Ice Cream
(185
870 )
( 100 % ) =21%
2. Using a protractor, graph each section and write its name and appropriate percentage, as shown in
Figure 2.5.
Example 2. Using the information in the table about the dollar to peso exchange rate from January to
December 0f 2019, construct a time series graph.
Solution:
Steps: 1: Draw and label the x-axis and y-axis.
2. Label the x-axis for months and y-axis for Peso per US dollar
3. Plot each point according to the table.
4. Draw a line segments connecting adjacent points.
Solution:
Abstraction: Any data set can be characterized by measuring its central tendency. A measure of central
tendency, commonly referred to as an average, is a single value that represents a data set. Its purpose is
to locate the center of a data set. The arithmetic mean, often called as the mean, is the most frequently
used measure of central tendency. The mean is the only common measure in which all values play an
equal role meaning to determine its values you would need to consider all the values of any given data
set. The mean is appropriate to determine the central tendency of an interval or ratio data. The symbol
x , called “x bar”, is used to represent the mean of a sample and the symbol μ, called “mu”, is used to
denote the mean of a population. A. Properties of Mean 1. A set of data has only one mean. 2. Mean can
be applied for interval and ration data. 3. All values in the data set are included in computing the mean.
4. The mean is very useul in comparing two or more data sets. 5. Mean is affected by the extreme small
or large values on a data set. 6. The mean cannot be computed for the data in a frequency distribution
with an open-ended classs. B. Mean for Ungrouped Data Sample Mean: n x x x x n x x n n i i
... 1 1 2 3 Population Mean: N X X X X N N N N i ... 1 1 2 3 Example 1: The daily rates
of eight employee of a certain Municipality of Davao del Sur are Php 550, 420, 560, 500, 700, 670, 860,
480. Find the mean daily rate of employee. 592.50 8 4.740 8 550 420 560 500 700 670 860 480 ... 1 1 2 3
n x x x x n x x n n i i The sample mean daily salary of employees is Php
592.50. Example 2: Find the population mean of the ages of the middle-management employees of a
certain company. The ages are 53, 45, 59, 48, 54, 46,51, 58 and 55. Solution: 52.11 9 469 9 53 45 59 48
54 46 51 58 55 ... 1 1 2 3 N X X X X N N N N i The mean population
age of middle-management employee is 52.11 C. Sample Mean for the Grouped Data Sample Mean: n fx
x where: x = sample mean f = frequency x = the value of any particular observation fx = sum of all
the products of f and x n = total number of values in the sample Example 3: Using the example provided
in Module 2, EPA Travel Agency. Determine the mean of the frequency distribution on the ages of 50
people taking travel tours. Solution: Class Limits Frequency Midpoint (x) fx 18-26 27-35 36-44 45-53 3 5 9
14 22 31 40 49 66 155 360 686 54-62 63-71 72-80 11 6 2 58 67 76 638 402 152 50 fx 2 ,459 Applying
the formula, to obtain the value of the sample mean. 49.18 50 2,459 n fx x Weighted Mean,
Geometric Mean and Combined Mean A. Weighted Mean The weighted mean is particularly useful
when various classes or groups contribute differently to the total. The weighetd mean is found by
multipying each value by its corresponding weight and dividing by the sum of the weights. n n n n i i n i i
i w w w w x w x w x w w x x ... ... 1 2 1 1 2 2 1 1 Where: xw = weighted mean wi =
corresponding weight xi = the value of any particular observations or measurement Example 1: At the
Mathematics Department of Davao del Sur State College there are 18 instructors, 12 assistant
professors, 7 professors, and 3 professors. Their monthly salaries are Php 30,500, 33,700, 38,600 and
45,000. What is the weighted mean salary? Solution: 33,965 40 1,358,600 18 12 7 3
18 30,500 12 33,700 7 38,600 3 45,000 ... ... 1 2 1 1 2 2 1 1 n n n
n i i n i i i w w w w x w x w x w w x x The weighted mean salary is Php 33,965.00. B. Geometric Mean The
geometric mean of a set of n positve numbers is defined as the n th root of the product of the n
numbers. There are two main applications of geometric mean, the first is to average paercents, indexes,
and relatives; the second is to establish the average percent increase in production, sales or other
business transaction or economic series from one period of time to another. n n GM x x x x
1 2 3 1 value at the start of the period at the end of the period n1 value GM where: GM =
geometric mean xi = the value of any particular observations or measurement n = number of
observations Example 2: Suppose the profits earned by the EPA Construction Company on five projects
were 5, 6, 4, 8 and 10, respectively. What is the geometric mean profit? Solution: x1=5, x2=6, x3=4, x4=8
x5=10, n=5 564810 9,600 6.26 5 5 1 2 3 n n GM x x x x The geometric mean
profit is 6.26 percent. Eample 3: Badminton as a sport grew rapidly in 2008. from January to December
2008 the number of badminton clubs in Metro Manila increased from 20 to 155, Compute the mean
monthly percent increase in the number of badminton clubs. Solution: 1 7.75 0.2046 20 155 1 value at
the start of the period at the end of the period 11 12-1 1 n value GM Hence, badminton
clubs are increasing a rate of almost 0.2046 0r 20.46% per month. C. Combined Mean Note: The
geometric mean cannot be computed if one of the numbers is zero or negative. The combined mean is
the grand of all the values in all groups when two or more groups are combined. There will be times
when we want to determine a mean from a number of other means. In order to compute the combined
mean for a grouped of mean, we must know the size of each, or N. The formula is n n n n i n i i i CM N N
N N X N X N X N N X X 1 2 2 2 1 1 1 1 where: X CM = combined mean X i =
sample means Ni = sample size Example 4: A study comparing the typical household incomes for 3
districts in the City of Davao was initiated to see where differences in household incomes lie across
districts. The mean household incomes for a sample of 45 different families in three districts of Davao
are shown in the following table. Calculate a combined mean to obtain the average hoisehold income
for all 45 families in Davao sample. District 1 District 2 District 3 X 1= 30,400 Php 12 N1 27,300 X 2 N2
18 X 3 42,500 15 N3 Solution: 33,193.33 45 1,493,700 12 18 15 30,400(12) 27,300(18) 42,500(15) 1
2 2 2 1 1 1 1 n n n n i n i i i CM N N N N X N X N X N N X X Thus,
the combined mean household income in three districts of Davao City PhP 33,193.33.