Download as pdf or txt
Download as pdf or txt
You are on page 1of 59

1

PRIMER IN STATISTICS

Meaning of Statistics

We have come into the age of computerization and are becoming rich in information at a very fast rate. However,
data gathered will not make sense unless we know how to use the available information to make good decisions. This
problem can be aided by Statistics because Statistics is a science which deals on the collection of data, presentation of the
collected data, analysis and interpretation of the results so as to yield meaningful information. Proper interpretation of
the results will help us make better decisions for the future.

We practice statistical thinking in our everyday lives. Like for example, a student budget his or her allowance based
from the knowledge he or she got on the past prices and to the project he or she needs. Parents also do the same. Most of
the young children nowadays can make predictions of other people’s behavior based from what they experienced in the
past. During election period we try to predict the outcomes based from what we hear or observed around us. We used to
make decisions or predictions based from the previous observations. In the professional world, statistics is widely used since
decisions made are usually based from the data in the past or from data collected through experiments and surveys.

Importance of Statistics

Based on the definition of Statistics, we can derive some reasons why it is important to gain knowledge in Statistics.
Some of the reasons are as follows:

(1) Statistics helps in proper and efficient planning of a statistical inquiry in any field of study, like aiding in the method
of collecting data, deciding on the sample size and the process of selecting the sample. It will also help one in
deciding the appropriate data to be collected.
(2) It aids in presenting complex data in a suitable tabular and graphical form for an easy and clear comprehension of
the data.
(3) It provides us tools that can be used in understanding the nature and pattern of variability of a phenomenon or a set
of observation.
(4) It can help us make reliable inferences about the population even when the data comes only from sample.
(5) It can help us understand and discover the relationships between variables
(6) It helps us how to obtain reliable forecasts.
(7) It can aid in making decisions on how to improve processes.

Population and Sample

Before proceeding further, it is important to understand the basic concepts about the set of observations to be
gathered in a study. The totality of the observation of which a study is concerned about is called the population of the study.
Population can refer to the subjects of the study or the observations themselves. For example, if a study is conducted to
determine the MSU-Marawi student’s opinion on the possible tuition fee increase, and if there are 16,000 students, we say
that we have a population with finite size of 16,000. If one is interested to find out the choice among the senatorial candidates
in a coming election, the population of the study is the set of registered voters in that coming election. A study that takes
data from the entire population is called a census.

Taking the whole population into the study is costly, laborious, time-consuming and sometimes impossible. Suppose
your study deal with the opinion of the recipients of the 4P’s in the different barangays of Marawi City whether the program
greatly helps their livelihood or not. Then this study is too costly if you make census since you will need to go to each
recipient in each barangay and ask about their opinion. It is also very laborious and will take you so long to finish. Perhaps,
it will also be impossible to get the opinion of all recipients because some of them might be in other places for vacation
when you conducted the study. Thus, there arises a need to study only a part of the population which we call a sample.
Ideally, the sample must be taken in such a way that it represents the population very well.

STT 041/STT041.1, First Semester, A.Y. 2020 – 2021 Mathematics Dep’t, MSU, Marawi City
2

It refers to the
totality of the POPULATION
observations of
which the study
is concerned. It refers to a
SAMPLE
part or subset of
a population.

Variables and Types of Data

In order to gain information about seemingly haphazard events, statisticians collect data for the variables used to
describe such events.

A variable is a characteristic or attribute that changes or varies over time, or which changes for different individuals
or objects under consideration.

An experimental unit is the individual or object on which a variable is measured. A single measurement or data
value results when a variable is actually measured on an experimental unit.

Data are the values (measurements or observations) that the variables can assume. Variables whose values are
determined by chance are called random variables.

A collection of data values forms a data set. Each value in the data set is called a data value or datum.
Variables can be classified into two categories: qualitative or quantitative
Qualitative variables are variables that measure the quality or characteristic on each experimental unit. It
produce data that can be categorized according to similarities or differences in kind – often called
categorical data. For example, gender, political affiliation and religious preference.
Quantitative variables are variables that measure the numerical quantity or amount of each experimental
unit. It can be ordered or ranked. For example, heights, weights, volume and body temperatures.

Quantitative variables are classified into two groups: discrete and continuous.
1. Discrete variables
Discrete variable can assume only a finite or countable number of values. This can be assigned values such
as 0, 1, 2, 3, and are said to be countable.
Examples:
i. number of children in a family
ii. number of students in a classroom
iii. number of calls received by a call center agent each day for one month
iv. number of dropouts in Math 31 in the previous semester

2. Continuous variables
Continuous variable can assume an infinite number of values between any two specific values. They are
obtained by measuring and often include fractions and decimals.
Examples:
i. Temperature
ii. Height
iii. weight
iv. distance
v. age

STT 041/STT041.1, First Semester, A.Y. 2020 – 2021 Mathematics Dep’t, MSU, Marawi City
3

Data can be used in different ways. Depending on how they are used, the body of knowledge on statistical methods
is divided into two main areas or branches, namely: Descriptive Statistics and Inferential Statistics.

Two Major Areas in Statistics

(1) Descriptive Statistics. It comprises of those statistical tools which deal on the presentation of the observed data that can
be done in various forms such as tables, graphs and diagrams or describing the data through computation of measures that
summarize the characteristics of the set of data.

Example:
Consider the national census conducted by the National Statistics Office (NSO) every 10 years. Results of this
census give the average age, average income, and other characteristics of the Philippine population. The NSO also presents
the data in some meaningful form such as charts, graphs, or tables.

(2) Inferential Statistics. It consists of those statistical tools concerned with generalizing results from random samples to
populations, performing estimations and hypothesis tests, determining relationships among variables, and making
predictions.

Example:
Suppose we want to know the percentage of unemployed in our country. We take a random sample from the
population and find the proportion of unemployed in the sample. With the aid of other statistical measures and probability,
we make some inferences or general statements about the population proportion of unemployed.

Exercise 1. In each statement that follows identify whether descriptive or inferential statistics is used.

1. The average price of a unit of Camella homes sold in Cagayan de Oro during the week of April 22-28, 2012 was
PhP 1,051, 053.

2. According to Provincial Statistics Office, 85% of the workers in their province get to work in public utility vehicle.

3. The National Eye Institute has halted a clinical trial on a type of eye surgery, calling it ineffective and possibly harmful
to a person’s vision.

4. “Allergy therapy may make bees go away” (Prevention, April 1995).

5. Drinking decaffeinated coffee can raise cholesterol levels by 7%.

6. It is predicted that the average number of automobiles each household owns will increase next year.

7. The average number of students in a class in Mindanao State University is 22.6.

8. Last year’s total attendance of Azkal’s football games was 50,000.

9. According to the Court of Justice of the Philippines, 14% of trial-ready civil actions and equity cases during 1993 were
decided in less than six months.

10. It is estimated that, on the average, 3 SUV’s are carnapped each day in Metro Manila.

STT 041/STT041.1, First Semester, A.Y. 2020 – 2021 Mathematics Dep’t, MSU, Marawi City
4

II.Sampling Procedures

In sampling, only a relatively small number of respondents or experimental units will be involved, thus, it is
commonly used in practice. We examine some of the advantages for doing so.

Advantages of Sampling

1. It entails lesser cost, lesser effort and it is less time consuming.


a. Since the size of the sample is small compared to the population, the time, cost and effort involved on a sample
study are much less than the study done on population. For population, huge fund is required because of the
resources to be used which may include more manpower and materials.
b. It will also take a much shorter period of time to gather data from a sample than from a population. Thus,
sampling can lead to more well-timed results as well.

2. It is less cumbersome and more practical to administer.


It is easier to handle and manage and not as much burden in your part if you take only data from a smaller number
of respondents.

3. Some experiments are destructive so it is not possible to involve the whole population.
For example, a car manufacturer might want to test the durability of cars being produced. Obviously, each car
could not be crash-tested to determine its durability or else the company has nothing to sell anymore. To overcome
this problem, samples are taken from populations, and estimates are made about the total population based on
information derived from the sample.

Sampling also has disadvantages, the biggest of which is that the sample may not truly reflect the characteristic of the
population and this would lead to wrong conclusions. Hence, care must be taken in choosing a sample. Also, a sample
must be large enough to give a good representation of the population, but small enough to be manageable.

Types of Sampling Procedures

The method of drawing a sample has a big impact on the validity and reliability of the results of the study. It can
also influence on the kind of inferences that can be made for the population. Samples can either be drawn randomly (and it
is called probability or random sampling) or by non-random procedures.

A. Probability Sampling or Random Sampling


In probability sampling, each element has a known probability of selection, and a chance method such as “draw
lots” or using numbers from a random number table is used in selecting the specific units to be included in the sample.

(1) Simple Random Sampling (SRS)


This is the simplest form of random sampling where every subset of size n of the population has an equal chance of
being selected.
A simple random sample can be done using the “fishbowl” method, “draw lots” method, or using random numbers.
In drawing a simple random sample, the researcher is in effect mixing up the units in the population before a sample of n
units is selected.

 Steps in Simple Random Sampling (SRS)


(1) Obtain a list of all the units of the population.
(2) Assign a number to each element of the population using the numbers from 1 to N.
(3) Select n numbers from 1 to N using random process like fishbowl method or draw lots, or you can use random
numbers which can be generated by a scientific calculator.

STT 041/STT041.1, First Semester, A.Y. 2020 – 2021 Mathematics Dep’t, MSU, Marawi City
5

Steps in obtaining a random number from a scientific calculator.

(i.) Press INV or Shift then press Ran#. Then a number between 0 and 1 with three decimal places will appear.
(NOTE: You have different results with your classmates. And every time you press these buttons, a different number
will appear.)
(ii.) Multiply this random number with the population size, N, and round it off to a whole number. The result corresponds
to the number of an element in the list.
(iii.) Repeat (i) and (ii) until you get the desired sample size with distinct elements.

Examples:
1. Each name in a telephone book could be numbered sequentially. If the sample size is to include 1,000 people, then 1,000
numbers could be randomly generated by computer or numbers could be picked out using a random process, for instance,
draw lots. These numbers could then be matched to names in the telephone book, thereby providing a list of 1,000 people.

2. Choose a random sample of five 6 students from the following 36 students using random numbers generated from your
calculator.
Suzette Audrey Jamilah Chad Mohammad Allan
Marielle Charmie Aliah Christian Antonio Paul
Norjehan Kristine Norjannah Emil Eduard Jaime
Mia Mariel Amor Martin Carlo Gabriel
Jana Janice Noralyn Jacob Nathaniel Jullian
Melai April Carima Johary Angelo Lawrence

A disadvantage of simple random sampling is that we can never be assured that all sectors or groups are represented
in the sample. For instance, in Example (2) above, there is a possibility that all elements drawn will be girls or all will be
boys.
To avoid the above mentioned possibility, we need to contemplate and employ other sampling procedures that can
lead to more representative sample in which the sample units are spread evenly over the entire population. This sampling
procedure is called systematic random sampling.

(2) Systematic Random Sampling


This is also called interval sampling. It means that there is a gap or interval between each selection. Researchers
obtain systematic samples by numbering each subject of the population and then selecting every kth element in the population
where the first unit is chosen at random.

 Steps in taking a Systematic Random Sample


(1) Assign a number to each element of the population using the numbers from 1 to N.
(2) Compute the sampling interval, k, where

k = N/n, where N= population size and n= sample size

NOTE: If k is not a whole number, then it is rounded to the nearest whole number. For example, suppose
N = 400 and n =15, then k = 400/15 = 26.67. That is, 26.67 is rounded-off to nearest whole number 27.

(3) Select a random start, r, where 1< r < k. The sample will then include the rth element, the (r+k)th, the (r +2k)th,
the (r +3k)th, and so on until you reach the desired sample size.

Examples:
STT 041/STT041.1, First Semester, A.Y. 2020 – 2021 Mathematics Dep’t, MSU, Marawi City
6

1. Suppose a population consists of a class of 35 currently enrolled students in Math 31. If we select a systematic sample
of 5 students, the sampling interval would be:
𝑵 𝟑𝟓
𝒌= = = 𝟕.
𝒏 𝟓
The starting point would be chosen by selecting a random number between 1 and 7. Suppose this number is 5, then the
sample will consist of ( rth=5th element) which is the 5th student,( r+k = (5 +7)th element) which is the 12th student, ( r +2k
= 5 +2(7))th element) which is the 19st student, (r +3k = 5 +3(7))th element) which is the 26th student, and (r +4k = 5
+4(7))th element) which is the 33rd student.

If the starting point is r = 3, then we will get a different sample. The sample will consist of the following students in
the list: 3rd student, 10th student, 17th student, 24th student, and 31st student.

2. In a population of 1200 individuals, choose a systematic random sample of size 9.

Solution: The sampling interval, k , is k = N/n = 1200/ 9 = 133.33. Since this is not a whole number, we need to round it
off to nearest whole number, which is 133. Since k =133, the possible random start would be r = 1, 2, 3,…, 133. If we
choose to start at r =3, the sample points will be the 3rd person, the 136th person, the 269th person, the 402nd person, the 535th
person, the 668th person, the 801st person, the 934th person, and the 1,067th person.

NOTE:The list from which the systematic sample is drawn should be examined that it must not have a periodic pattern
because it could possibly result to a biased sample.

For example, if our list is the monthly sales from 2005 to 2008 arranged in chronological order then there are 48
months to choose from. If we select a random sample of 12 then k = 48/12 = 4. If we start choosing from the 3 rd element,
then the next elements will be 7th, 11th, 15th, 19th, 23rd, 27th, 31st, 35th, 39th, 43rd and 47th. This would correspond to the sales
of March 2005, July 2005, November 2005, March 2006, July 2006, November 2006, March 2007and so on, and the last
element is November 2008. This sample is biased since it only represents the months of March, July and November.

In this case where there is periodic pattern, the appropriate sampling procedure to use is Simple Random Sampling
or SRS. Like for example if we use fishbowl or draw lots or random numbers, maybe the first element is March 2005 then
July 2006 and so on until you get the desired sample size.

Knowing the danger in systematic random sampling when choosing a sampling interval that corresponds to
periodicity, we will study another sampling procedure which may be much more efficient than simple random sampling.
This sampling procedure is called Stratified Random Sampling. This is carried out by dividing the population into
homogeneous subpopulations and then selecting a simple random sample from each subpopulation.

(3) Stratified Random Sampling


In this sampling procedure, the population of N units is first divided into homogeneous subpopulations called strata
(homogeneous with respect to the characteristics of interest) and then a sample is drawn from each stratum. This type of
sampling assures that all groups or strata are represented in the sample. For instance, some stratification variables commonly
used by the Social Weather Station (SWS) survey are location, age and sex. Other stratification may be religion, academic
ability or marital status.

 Steps in taking a Stratified Random Sample


(1) Classify the population into at least two homogeneous strata. The basis for classification must be closely related
to the variable of interest. Suppose we are interested to determine the students’ opinion on the tuition fee increase,
it may be logical to subdivide the population of students by income of parents, college, or by year level, or by tribe
or a combination of these.
(2) Draw a sample from each stratum by simple or systematic random sampling.

STT 041/STT041.1, First Semester, A.Y. 2020 – 2021 Mathematics Dep’t, MSU, Marawi City
7

How many shall we take from each stratum? The most commonly used formula is proportional allocation
formula wher the number of units to be taken from each stratum is proportional to the size of the subpopulation; that is,
between two strata of different sizes, a bigger sample will be taken from the bigger stratum.

Proportional Allocation. If the size N of the population is divided into k homogeneous subpopulations or strata of
sizes N1, N2, …, Nk, then the sample size to be taken from each stratum i is obtained using the formula
𝑵
ni = ( 𝑵𝒊 )x n for i = 1, 2, …, k

NOTE: If ni is not a whole number, then it is rounded-off to the nearest whole number.

Examples:
(1) The manager of a girls’ dormitory wants to learn how the students feel about the dorm’s services. The students were
classified according to the following scheme:
NUMBER OF
CLASSIFICATION
STUDENTS
Freshmen 220
Sophomore 195
Junior 163
Senior 150

If we use proportional allocation to select stratified random sample of size n = 40, how large a sample must be taken from
each stratum?

Solution: Since n = 40 and N= 220 + 195 + 163 + 150 = 728, then

𝑁 220 𝑁 163
n1 = ( 𝑖)x n =( )x 40 = 12.088 ≈12 n3 = ( 𝑖)x n =( )x40 = 8.956 ≈ 9
𝑁 728 𝑁 728

𝑁 195
n2 = ( 𝑖)x n =( )x40 = 10.714 ≈ 11 n4 = 40 – 12-11 -9 =8
𝑁 728

(2) In an election survey in Makati City, registered voters are classified according to the following scheme:

Economic Status Number of People


A (Upper Class or Rich People) 725
B (Middle Class) 3489
C (Lower Class or Poor People) 2146

If one uses proportional allocation to select a stratified random sample of size n=345, how large a sample must be
taken from each stratum?
Solution:
Since n =345 and N= 725 + 3489 +2146 = 6360, then

𝑁 725 𝑁 3489
n1 = ( 𝑖)x n =( )x 345 = 39.328≈ 39n2 = ( 𝑖)x n =( )x345 = 189.262 ≈189n3= 345– 39 – 189 = 117
𝑁 6360 𝑁 6360
The advantage of stratified sampling is not only on the assurance that all strata are represented but it can also lead
to better estimates of the population parameters compared to Simple Random Sampling.

In many statistical studies, we reduce the cost involved in sampling and so over simple random sampling by
randomly selecting groups of elements from a population and then sampling some or all of the elements within the selected
group. This sampling procedure is usually used when the population is widely distributed geographically or may occur in

STT 041/STT041.1, First Semester, A.Y. 2020 – 2021 Mathematics Dep’t, MSU, Marawi City
8

natural clusters such as households or schools or business establishments. If the population is the set of workers of NGOs
(non-government organizations) in the Philippines, it is much cheaper to sample NGOs and interview every worker in the
selected NGOs than to interview a Simple Random Sample (SRS) of NGO workers because with SRS, you might need to
travel to an NGO office just to interview one worker. Thus, it is usually cheaper to sample in clusters than by SRS or
stratified. This sampling procedure is known to be cluster sampling.

(4) Cluster Sampling


Cluster sampling assumes that the population is naturally separated by groups or clusters. A number of clusters are
selected randomly and then all or parts of the units within the selected clusters are included in the sample. No units from
the non-selected clusters are included in the sample. It differs from stratified sampling, because in the latter, sample units
are selected from every group.
You may be able to save much resource in cluster sampling compared to SRS or Stratified Random Sampling, but
cluster sampling leads to less precise estimates. This is because when we sample each unit in a cluster, we would expect to
get similar information which may be different from other clusters not selected.

 Steps in taking Cluster Sampling:


(1) Divide the population area into clusters.
(2) Select randomly a few of these clusters.
(3) Choose all the elements from the clusters selected or select only a portion of it.

Examples:

(1) Suppose the population of a study is residents of a condominium in a large city. If there are 10 condominium buildings
in this city, the researcher can select two buildings randomly from the 10 and interview all (or a subsample) of the residents
from these buildings.

(2) Suppose an organization wishes to find out which sports senior students are participating in the Philippines. It would be
too costly and would take too long to survey every student, or even some students from every school. Instead, 100 schools
are randomly selected from all over the Philippines. These schools are considered to be clusters. Then every senior student
in these 100 schools is surveyed. In effect, students in the sample of 100 schools represent all Senior students in the
Philippines.

The advantages of Cluster sampling are: reduced costs; simplification of the fieldwork and more convenient
administration. Instead of having a sample scattered over the entire coverage area, the sample is more localized in relatively
few centers. However, it often gives less accurate results due to higher sampling error than for simple random sampling
with the same sample size. In the above example, you might expect to get more accurate estimates from randomly selecting
students across all schools than from randomly selecting 100 schools and taking every student in those chosen schools.

Exercise 2

I. Classify each sampling procedure as simple random sampling, systematic, stratified, or cluster.

(1) In a school district consisting of 5 buildings, two buildings were randomly selected and all teachers from the selected
buildings are interviewed to determine whether they believe that students have less homework to do now than in previous
years.

(2) Nursing supervisors are selected using random numbers in order to determine average annual salaries.

(3) Every hundredth hamburger manufactured is checked to determine its fat content.

(4) Mail carriers of a large city are divided into four groups according to gender (male or female) and according to whether
they walk or ride on their routes. Then 10 are selected from each group and interviewed to determine whether they have
been bitten by a dog in the last year.

STT 041/STT041.1, First Semester, A.Y. 2020 – 2021 Mathematics Dep’t, MSU, Marawi City
9

(5) There are 2000 subjects in the population and a sample of 50 is needed. So, every 40 th subject is selected at random.

(6) A city’s telephone book lists 100,000 people. Suppose the telephone book is the frame for a study and Suzette wants to
interview every 200th person.

(7) An aquarium consists of 100 different species of fishes. A researcher randomly selects 10 fishes.

II. Use the appropriate sampling procedure.

(1) A population of 70 cities is numbered from 1 to 70. Select a systematic random sample of 15 cities. Choose your own
starting sample point.

(2) Pulse Asia is conducting an Exit Poll on the recently concluded National Election. A certain barangay has been
considered for the survey. Of the 150 households (numbered 1 to 150) in the said barangay only 10 are to be included in the
survey.
(a) Use systematic random sampling to select the 10 households using your selected random start.
(b) Use the method of simple random sampling to choose the 10 households.

(3) A researcher is interested in the determining the academic performance of the CNSM students. The students have been
classified according to their departments :
DEPARTMENT NUMBER OF STUDENTS
Biology 700
Chemistry 190
430
Mathematics
Physics
180

Using proportional allocation, how many students should be taken from each department if a random sample of size
400 is to be chosen.

B Non-probability Sampling
It is one in which individuals or items are chosen in a manner that does not involve random selection process. This
is usually used when the size of the population is either unknown or cannot be individually identified. Here, personal
preferences are applied. Because chance is not used to select items, the techniques are called non- probability techniques
and are not desirable for use in gathering data to be analyzed by the methods of statistical inference because the reliability
of the measures cannot be determined objectively.

(1) Convenience Sampling


The elements in convenience sampling are selected for convenience of the researcher. Usually the researcher
chooses those that are readily available, nearby, or willing to participate. The result of such study usually leads to less varied
observations than the population because in many environments the extreme elements of the population are not readily
available.

Examples:
(a) A convenience sample of homes for door-to-door interviews might include houses where people are at home, houses
with no dogs, houses near the street, first-floor apartments, and houses with friendly people.
(b) If the research firm is located in a mall, a convenience sample might be selected by interviewing only shoppers who
pass the shop and look friendly.

(2) Quota Sampling


STT 041/STT041.1, First Semester, A.Y. 2020 – 2021 Mathematics Dep’t, MSU, Marawi City
10

Quota Sampling appears to be similar to stratified random sampling in which certain population subclasses, such as
age group, gender, or geographic region, are used as strata. However, instead of using random sampling from each stratum,
the researcher uses a nonrandom sampling method to gather data from one stratum until the desired quota of samples is
filled. It is often filled by using available, recent, or applicable elements. It is less expensive than most of the random
sampling techniques because it is essentially a technique of convenience sampling and also has a speed of data gathering in
which the researcher does not have to call back or send out a second questionnaire if he does not receive a response rather
he just moves on the next element.

Examples:
(a) Instead of randomly interviewing people to obtain a quota of Italian Americans, the researcher would go to the
Italian area of the city and interview there until enough responses are obtained to fill the quota.
(b) Suppose the researcher wants to stratify the population into owners of different types of cars but fails to find any
lists of Toyota van owners. Through quota sampling, the researcher would proceed by interviewing all car owners
and casting out non-Toyota van owners until the quota of Toyota van owners is filled.

(3) Judgment or Purposive Sampling


The elements selected for the sample are chosen by the judgment of the researcher. Researchers often believe they
can obtain a representative sample by using sound judgment or purposely choose as to who can provide the best information
to achieve the objectives of the study which will result in saving time and money. The researcher only goes to people who
in his/her opinion are likely to have the required information and are willing to share it. This is important when you want to
construct a historical reality, describe a phenomenon or develop something about which only a little is known.

Example:
A student conducted a study on the history of CNSM. To get proper information, he interviewed past deans, chairmen
and pioneering faculty and staff of the college.

(4) Snowball Sampling


The survey subjects of snowball sampling are selected based on referral from other survey respondents or selecting
a sample using networking. This technique is usually used for censored or sensitive topics like prostitution, drug trafficking,
kidnapping, and same sex relationships. This technique is done

The researcher then asks this person for the names and locations of others who would also fit the profile of subjects for the
study. This process is continued until the required number in terms of the information being sought, has been reached.
Through these referrals, survey subjects can be identified cheaply and efficiently, which is particularly useful when survey
subjects are difficult to locate. This sampling technique is useful if you know little about the group or organization you wish
to study, as you only need to make contact with a few individuals, who can then direct you to the other members of the
group.

Example:
A researcher wanted to study the factors why some students occasionally use prohibited drugs. He intended
to get 50 students, but he only knew 5 students who used it. By getting the cooperation of these 5 students, he was
referred to other drug users, who in turn also provide additional contacts. In this way, he was able to get sufficient
number of students he needed.

NOTE: Probability samples can be further analyzed using methods in statistical inference but this is not valid for non-
probability samples.

Exercises 3
Determine what type of non-random sampling procedure is used in the following:

(1) A researcher wanted to study about the participation of Meranao female students in sports. To get a sample of 50 students,
he went to every college and asked female students whether they are Meranao or not. If they are, then they are asked to
participate in the study by answering the questionnaire.

STT 041/STT041.1, First Semester, A.Y. 2020 – 2021 Mathematics Dep’t, MSU, Marawi City
11

(2) A researcher wanted to study about the sufficiency of the facilities of the CNSM library for the needs of the CNSM
students in their CNSM subjects. He planned to get a sample of 100 students. Due to time constraint and difficulty of getting
a random sample, he decided to get his sample during the CNSM orientation. From there he asked 100 students to participate
in his study.

(3) A researcher wanted to understand the attitudes of the minority managers toward system for assessing management
performance. To get the proper information, theyinterviewed the managers who are members of minority group that work
in the medium-scale to large-scale firms.

(4) A sociologist conducted a study about the opinions of employed adult women about government funding for day care.
She went around an area knocking on doors during weekend when women are likely to be at home. She asked to speak to
the woman of the house. Her first question was about whether the woman is employed or not. Interview was conducted if
the respondent is employed.

(5) Manufacturers and advertising agencies wanted to know about the habits of consumers and the effectiveness of ads.
They needed to interview a sample of 1,000 consumers. To get the required sample, they went to some shopping malls and
interviewed consumers until they obtained the required number of consumers.

(6) Suzanne wanted to know how to manage a good quality business. In order for her to have better information for her
study, she interviewed different managers on different business stations.

(7) Physical Education students wanted to know about the different views on the good effect on health in having Physical
Education 4. In order for them to get information on 100 students, they stand in the grandstand and ask each of the students
there if they had their Physical Education 4 in the previous semester. Then if there is, they ask again that student regarding
his or her views on having Physical Education 4.

(8) Emil needs to have information about the history of his place. He interviewed his ancestors and also the past officials in
their place in order for him to get the information he wanted.

(9) A company marketing wants to test a new personal computer. They need to interview 50 users. In order for them to get
the desired number of users, they went to different internet cafés. They interviewed those people who look friendly.

So far, we have only discussed how to select our sample for the study which we also call as sampling design. Sampling
design is critical to the interpretation of observational studies, that is, studies in which the researcher merely “observes” the
study units, making one or more measurement on each.
For studies in which the researcher intervenes (“experiments”) in some way to affect the manner in which the study
units (now called “experimental units”) respond, this type of study is called an experimental study. This is the topic in the
next section.

III. Design of Experiments

The sampling procedures above are applicable for survey research. But for researches that involve experiments in a
laboratory or in an agricultural field or applying different teaching methods, the question is “How will you assign the
different treatments to the experimental units?” We call this as design of experiments or experimental design.

By an experimental design, we mean a plan used to collect the data relevant to the problem under study in such a way
as to provide a basis for valid and objective inference about the stated problem. The plan usually consists of the selection
of treatments whose effects are to be studied, the specification of the experimental layouts, and the assignment of treatments
to the experimental units and the collection of observations for analysis. All these steps are accomplished before any
experiment is performed.

Two of the most common designs are Complete Randomized Design (CRD) and Randomized Complete Block
Design (RCBD). To interested students, they may read books that discuss experimental design like Black (2004), Walpole
(1982), etc. or may search in the internet.
STT 041/STT041.1, First Semester, A.Y. 2020 – 2021 Mathematics Dep’t, MSU, Marawi City
12

IV. Methods of Collecting Data

In the planning stage of a study, one of the critical things to be decided upon is the method to be used in collecting
the data. Five methods of data collection are discussed below and each of them has their own strengths and weaknesses.
The choice will depend upon the availability of time and resource, the appropriateness of the method, the type of sample
units to be studied and others.

A. Interview Method
This is a person-to-person encounter between the one soliciting information (also known as the interviewer) and
the one supplying the information (also known as the interviewee). It can be conducted in person or through telephone
conversation.
 Advantages:
(1) Questions can be repeated, rephrased, or modified for better understanding.
(2) Answers may be clarified, thus ensuring more precise information.
(3) Information can be evaluated since the interviewer can observe the facial expression of the interviewee.
 Disadvantages:
(1) It is too costly because you might need to spend a lot for transportation, aside from other incidental expenses.
(2) It can cover only a limited number of individuals in a given period of time. Hence, you need longer time to
finish the data collection.
(3) Interviewees may feel pressured for on-the-spot responses.
(4) People may give different answers to different interviewers.
(5) People may say what they think an interviewer wants to hear or what they think will impress the interviewer.
(6) A particular interviewer may affect the accuracy of the response by misreading questions, recording responses
inaccurately, or antagonizing the respondent.

B. Questionnaire Method
This could be mailed or hand-carried (delivered in person).
 Advantages:
(1) It is less expensive and has a greater scope than the interview method.
(2) Respondents have enough time to formulate appropriate responses.
 Disadvantages:
(1) Low return rate. Only a few would care to mail back the questionnaire.
(2) People do not always understand the questions or sometimes, certain words mean different things to different
people. Hence, there is no way that they can make clarification before they answer the questionnaire.

C. Observation Method
This is appropriate in obtaining data pertaining to behavior of an individual or group of individuals at the time of
occurrence of a given situation. Subjects may be observed individually or collectively.
Examples:
(a) If your study deals about how many people are involved in fighting and the reason why they are fighting in that
particular place, then you must be there before their conflict will end, in order for you to witness what is happening and also
you can get reliable reason/s behind it. Do not be too late because in this case, you can’t look back the said conflict. And
it’s impossible to rewind what is happening unless you will secretly take a video of it.
(b) Suppose you want to study the boxing games of Manny Pacquiao. In order for you to have unbiased study, you
need to be there in the arena before the game starts.
Limitation: Observation is made only at the time of occurrence of the appropriate event/s.

E. Experimentation Method
This can be applied in obtaining data from the experiment.
 Advantage:
Experiment can be made again.
 Disadvantage:
STT 041/STT041.1, First Semester, A.Y. 2020 – 2021 Mathematics Dep’t, MSU, Marawi City
13

It takes long time and great effort to wait for the result especially when you failed in your first experiment because
in that case, you must repeat your experiment in order for you to have a good outcome or result.

F. Use of Existing Data


The data are coming from:
(a) documents (books and magazines, hospital records, public files, registrations, etc.)
(b) from the internet.
 Advantages:
(1) Provide information about the incidence (the number of new cases), prevalence (the number of existing cases),
and rate (the proportion of a population with the particular concern in a population (Rossi and Freeman, 1993).
(2) Aid in definition and selection of target population.
(3) Help improve the planning and design of new study.
 Disadvantages:
(1) If you are using agency records, your information will apply only to those individuals participating in that
program. Agency records exclude data on individuals who are not participating.
(2) If you are using published reports or data collected by outside sources, you will not have enough information
about the individuals involved in the specific study you are evaluating.
(3) It cannot give the precise information about the geographic area, unless it was collected specifically in your
area. For example, if you access the records on teen pregnancy rates, those rates may not accurately reflect the
pregnancy rates in your own community.
(4) Published reports will not let you to determine the impact of your study on its actual participants.

V. Types of Level of Measurement

Another way of classifying data is according to their level of measurement. You may notice that one can measure
the exact difference between two persons with height 5 feet and 4 feet. But you cannot measure the exact difference between
two persons whose opinion on a certain issue is strongly agree and the other is agree. However, you know that one has a
higher level of agreement compared to the other. And if you compare their gender, you can only tell whether they belong
to the same category or not but you can never tell which is a stronger gender between them. This property of data leads to
four (4) classifications or levels namely: nominal, ordinal, interval and ratio.

A. Nominal Level
This is the lowest level of measurement. The values of the data of this measurement fall into unordered categories or
classes.
Nominal type of data is used to distinguish different categories for qualitative variables and can be used as measures of
identity. On the processing of data using computer packages, the encoder gives the same number to members of the same
category and different numbers to members of different categories. In other words, the numbers here are essentially "dummy
codes." Meaning, data can be coded but the codes neither have the ordering property nor a mathematical significance.

Examples:
(a) A sample of college instructors classified according to subject taught such as English, History, Psychology, or
Mathematics.
(b) Classifying respondents as male or female.
(c) Classifying residents according to zip codes, there is no meaningful order or ranking.
(d) Political party such as Liberal, United Nationalist Alliance, Independent.
(e) Religion such as Lutheran, Jewish, Catholic, Methodist, etc.
(f) Marital status such as married, divorced, widowed, separated.
(g) Blood type: 1-type A, 2-type B, 3-type AB, 4-type O

The numbers 1, 2, 3, 4 in Example (g) above have no inherent mathematical properties, that is, assigning 4 to type
O and 1 to type A does not mean that type O is better than type A. Moreover, the assignment of codes is not unique.
For instance, 0 may be assigned to type A, 1 to type B, and so on.

STT 041/STT041.1, First Semester, A.Y. 2020 – 2021 Mathematics Dep’t, MSU, Marawi City
14

The codes have no mathematical significance, thus we cannot add, subtract, etc, the data. If we do so like 4-1=3,
you might end in wrong interpretation like if we subtract a person who is blood type O from a person who is blood
type A we get a person who is blood type AB (that’s very funny).

The numbers are used only to facilitate data analysis using the computer.

B. Ordinal Level
More often, ordinal data are categorical data but categories can be ranked; however, precise differences between the
categories do not exist. But sometimes ordinal data uses numbers. In this case, the numbers indicate position in an ordered
series of the categories. But it does not indicate how much of a difference exists between the successive positions on the
scale. It means that it involves data that may be arranged in some order but difference between data values either cannot be
determined or is meaningless.
When ordinal data are encoded in the computer for analysis, they are converted to numbers such that the numbers
indicate positions in an ordered series.

Examples:
(a) Rank of students in a graduating class (1-valedictorian, 2-salutatorian, and so on). A rank of 5 is better than a
rank of 10. The difference of 5 between the 5 th and 10th ranks is meaningless, i.e., the difference of 5 between ranks
5 and 10 is not necessarily the same as the difference between ranks 20 and 25.

(b) Military rank, position in the office, opinion on an issue (strongly disagree, disagree, neutral, agree, strongly
agree), final grade in Math 1 (1.00, 1.25, 1.50, etc.)

(c) Speakers might be ranked as superior, average, or poor.

(d) Floats in homecoming parade might be ranked as first place, second place, etc.

(e) Letter grades (such as A, B, C, D, E, F) or numerical grades such as 1.0, 1.25, 1.50, etc. It is ordinal because
1.0 corresponds to the score of 90-100, 1.25 corresponds to the score of 80-90 and so on then we can say that getting
a grade of 1.0 is better than getting a grade of 1.25.

C. Interval Level
Interval levels are numerical data hence they can be ranked and precise differences between units of measure do
exist. However, there is no absolute zero. It lacks an inherent zero starting point or lacks absolute zero (absolute zero means
the total absence of the characteristic being measured). The starting point is arbitrary.
Examples:
(a) temperature in degrees Fahrenheit or degrees Celsius
The freezing point of water in Celsius is 0o while in Fahrenheit it is 32. Moreover, 30o Celsius is hotter than
15 but it is wrong to conclude that 30o is twice as hot as 15o, since 0° is not an absolute zero point. Moreover, 0 o
o

does not mean the total absence of heat. In fact there are countries during winter time that would even have negative
temperature like - 10°C.

(b) IQ is an example of interval scale. There is a meaningful difference of one point between an IQ of 109 and an
IQ of 110. But we cannot say that a person who has an IQ of 100 is twice as intelligent as the one with an IQ of 50.
IQ of zero (0) does not mean that the person who undergoes IQ test has no intelligence.

D. Ratio Level
This possesses all the characteristics of interval measurement and there exists a true zero, meaning it has an inherent
zero starting point. Like interval scale, differences are meaningful. Ratio of two measures is also meaningful. For example,
a person who is 4 feet tall is twice as tall compared to a 2 feet tall since the true starting point is zero.
This is the highest level of measurement.

STT 041/STT041.1, First Semester, A.Y. 2020 – 2021 Mathematics Dep’t, MSU, Marawi City
15

Examples:
(a) monthly salary
-Php 0 means no salary.

(b) Example of ratio scales are those used to measure height, weight, area, and number of phone calls received,
number of children, etc.

Exercise 4

1. Classify each variable as categorical or numerical.


(a) Colors of jackets in a men’s clothing store.
(b) Number of seats in classrooms.
(c) Classification of children in a day care center (infant, toddler, preschool).
(d) Length of fish caught in a certain stream.
(e) Number of students who fail their first statistics examination.
(f) Number of hours spent reading a novel
(g) Academic major
(h) Birth order(1st born, 2nd born, 3rd born, …)
(i) Political Party(Liberal Party, LAKAS, UNA,…)
(j) SASE Score

2. Classify each variable as discrete or continuous.


(a) Number of loaves of bread baked each day at a local bakery.
(b) Water temperature of the saunas (steam bath) at a given health spa.
(c) Income of single parents who attend at a community college.
(d) Lifetimes of a certain type of batteries in a tape recorder.
(e) Weights of newborn infants at a certain hospital

3. Classify each as nominal, ordinal, interval, or ratio level.


(a) Horsepower of motorcycle engines.
(b) Ratings of newscasts in Philippines (poor, fair, good, excellent).
(c) Temperature of automatic popcorn poppers.
(d) Time required by drivers to complete a course.
(e) Salaries of cashiers of Day-Night grocery stores.
(f) Marital status of respondents to a survey on savings account.
(g) Ages of students enrolled in martial arts course.
(h) Weights of beef cattle fed a special diet.
(i) Rankings of weight lifters.
(j) Number of exams given in a statistics course.
(k) Ratings of word-processing programs as user-friendly.
(l) Temperatures of a sample of automobile tires tested at 55 miles per hour for six minutes.
(m) Weights of suitcases on a selected commercial airline flight.
(n) Classification of students according to major field.
(o) Data are classified according to color.
(p) Years of service in a company.

In order for the researcher to describe results after he collected the data needed for his study, draw conclusions, or
make inferences about events, the researcher must present and organize the data in some meaningful way. The next section
will show the different methods on how to present data or organize data in meaningful way.

VI. Methods of Presenting Data

STT 041/STT041.1, First Semester, A.Y. 2020 – 2021 Mathematics Dep’t, MSU, Marawi City
16

This section shows how to organize data and to construct appropriate graphs to represent the data in a concise, easy-
to-understand form. There are three methods of presenting data: textual presentation, tabular presentation and graphical
presentation.

A. Textual Presentation
The first method in presenting data is through textual presentation.The data that are being collected are presented
in sentence form.

Example:Twenty of the respondents are male and thirty of the respondents are female.

B. Tabular Presentation
A tabular presentation is an arrangement of statistical data in rows and columns. Rows are horizontal arrangements
whereas columns are vertical arrangements.

Example:
The table below shows the average weight of respondents grouped according to gender.
GENDER AVERAGE WEIGHT (KILOS)
Male 60
Female 52

A special type of table that is important in statistical analysis is called frequency distribution table.

Definition: A frequency distribution is a summary of the data presented in the form of classes and frequencies. The data
can be presented in a one-way or two-way frequency distribution table.

(1) One-way frequency distribution


The data are tabulated according to a single variable.

Example: Frequency Distribution of Respondents According to Year Level


Year Level Number of Students
First 35
Second 50
Third 48
Fourth 24

2. Two-way frequency distribution


The data are tabulated according to two variables. It is also called a cross-tabulation or contingency table.

Example:
YEAR LEVEL GENDER TOTAL
MALE FEMALE
First 29 45 35
Second 12 21 50
Third 15 18 48
Fourth 12 5 24
Total 68 89 157

For numerical data with a wide range of values, it is more practical to group the observations into class
intervals like in the example below.
AGE NUMBER OF
(years) STUDENTS
15-16 40

STT 041/STT041.1, First Semester, A.Y. 2020 – 2021 Mathematics Dep’t, MSU, Marawi City
17

17-18 56
19-20 42
21-22 30
23-24 15
TOTAL 183

This frequency table summarizes the data into 5 classes (called class intervals). The class interval 15-16 has a lower
class limit of 15 and upper class limit of 16. The interval 15-16 actually includes age ranging from 14.5 to 16.5. Age between
14.5 and 16.5, when rounded to whole numbers, becomes 15 and 16, respectively.

When data are organized into a frequency distribution, they are called grouped data. If they have not been
summarized in any way, they are called raw data or ungrouped data.

You might ask how frequency distribution is constructed. The following is your guide.

Construction of Frequency Distribution


The following steps are involved in the construction of a frequency distribution.

(1) Find the range (R) of the raw data: The range is the difference between the largest and the smallest values. That is,

R = (highest value) – (lowest value)

(2) Decide on the number of class interval (or classes), k : There are no hard rules for the number of classes. Walpole
(1982) recommended that there should not be less than 5 and not more than 20 classes. Having too few classes would lead
to wider class intervals thereby losing much information. On the other hand, if there are too many classes, then it fails to
aggregate the data enough to the useful. Others like H.A. Sturges (1976) has given a formula for determining the number of
classes.(Note: Round off k to the nearest whole number.) We can compute k using the formula

k = 1 + 3.322 log10 N where N= number of observations

Example:
If the total number of observations is 50, the number of class intervals would be
k = 1+ 3.322 log10 N
k = 1+ 3.322 log10 50
k = 1+ 3.322 (1.69897)
k = 1+ 5.644
k = 6.644
k ≈ 7 classes

(3) Determine the class size or class width, c: This is obtained by dividing the range of the raw data by the number of
classes. But the result is rounded up to the nearest higher value whose precision is the same as those of the raw data.

𝑟𝑎𝑛𝑔𝑒,
𝒓𝒂𝒏𝒈𝒆𝑅,𝑹
c >c𝑛𝑢𝑚𝑏𝑒𝑟
> 𝒏𝒖𝒎𝒃𝒆𝒓 𝒐𝒇 𝒄𝒍𝒂𝒔𝒔𝒆𝒔𝑘,𝒌
𝑜𝑓 𝑐𝑙𝑎𝑠𝑠𝑒𝑠,
Examples:
(1) Suppose a set of data has 100 observations with a lowest observation of 20 and highest observation of 85.
Find the class width, c.
Solution: First, we compute range. In this case, the range R = 85 – 20 = 65.
Next, we estimate the number of class intervals, k.

STT 041/STT041.1, First Semester, A.Y. 2020 – 2021 Mathematics Dep’t, MSU, Marawi City
18

k = 1 + 3.322 log10 N
k = 1 + 3.322 log10 (100)
k = 7.644
k ≈ 8 classes.
Then, c > 65/8 > 8.125. And since the given observations are whole numbers, we take c = 9 (The class
width c is rounded up with the same precision as the given data).

(2) Suppose the lowest blood potassium level (in milliequivalents per liter) obtained in a study of 40 men is 3.2
and the highest blood potassium level is 5.8. Compute the number of classes, k, and the class width c.
Solution: We have
R = 5.8 – 3.2 = 2.6

k = 1 + 3.322 log10 N
k = 1 + 3.322 log10 (40)
k = 6.322
k ≈ 6 classes

c > 2.6/6 > 0.43. Since the given observations are in one decimal place, we take c = 0.5
Thus, 6 classes or categories of blood potassium levels can be made with a class width of 0.5.

(3) Suppose thirty automobiles were tested for fuel efficiency, in miles per gallon (mpg), and the lowest mpg is
7.55 and the highest mpg is 32.67. Compute the number of classes, k, and the class width c.
Solution: We have
R = 32.67 – 7.55 = 25.12

k = 1 + 3.322 log10 N
k = 1 + 3.322 log10 (30)
k = 5.91
k ≈ 6 classes

c > 25.12/6 > 4.18666. Since the given observations are in two decimal places, we take c = 4.19.
Thus, there are 6 number of classes that can be made with a class width of 4.19.

(4) Determine the class limits of the k classes: The starting class limit must be equal to or lower than the lowest value in
the raw data. When the lowest class limit has been decided, add the class size to the lowest class limit to get the lower limit
of the next class. The remaining lower class limits are determined by adding the class size repeatedly until you reach k
classes. The appropriate upper class limits are determined next.

The upper limit (UL) of the first class can be obtained by subtracting one unit of measure from the lower limit of
the next class. The upper limits of the rest of the classes can then be obtained in a similar fashion or by adding c to the upper
limit of the preceding class. Finally, we check if the highest observation is contained in the last class. If not, then we simply
add another class interval.

(5) Tally the observations in the frequency column.


After determining the class limits of k classes, tally or count the number of observations in each class.

Example:
A random sample of 40 Math 31 students was selected and their weights (in kilograms) were recorded as shown below:

Weights (in kg) of Math 31 Students


63 59 43 60 41 53 56 81

STT 041/STT041.1, First Semester, A.Y. 2020 – 2021 Mathematics Dep’t, MSU, Marawi City
19

50 66 62 52 49 48 52 40
64 64 47 53 47 54 62 56
58 53 50 47 79 70 45 47
46 58 56 55 56 45 73 49

Step 1.Compute the range: R = 81 – 40 = 41


Step 2.Compute the number of classes: k = 1 + 3.322 log1040 = 6.322 ≈ 6.
Step 3.Compute the class width: c > 41/ 6 = 6.833. Since the data are whole numbers, we take c = 7.
Then, the frequency distribution for the weights of Math 31 students is as follows:

Class Limits (Weights in kilogram) Frequency (No. of Observations)


40-46 ||||-|= 6
47-53 |||| - |||| - |||| = 14
54-60 |||| - |||| = 10
61-67 |||| - | = 6
68-74 || = 2
75-81 || = 2

There are several columns that we can add to a frequency distribution. These are the columns for the class boundary,
class mark, relative frequency, and cumulative frequency.
The class interval 40-46 actually contains all weights ranging from 39.5 to 46.5. Also the interval 47-53 contains
the weights from 46.5 to 53.5. These true class limits are called the class boundaries. The class boundaries are 39.5-46.5,
46.5-53.5, 53.5-60.5, 60.5-67.5, 67.5-74.5 and 74.5-81.5. It is important to note that the upper class boundary of a class
coincides with the lower class boundary of the next class.
We can compute class boundaries using the following formula:
Lower Class Boundary (LCB) = LL – ½ * (one unit of measure)
Upper Class Boundary (UCB) = UL + ½ * (one unit of measure)

Example:
Class Interval Class Boundaries
50-55 49.5-55.5
56-61 55.5-61.5
(1/2) * (one unit of measure) = 1/2(1) = 0.5

Class Interval Class Boundaries


19.6-20.0 19.55-20.05
20.1-20.5 20.05-20.55
(1/2)*(one unit of measure) = 0.1)=0.05

Class Interval Class Boundaries


1.56-1.65 1.555-1.655
1.66-1.75 1.655-1.755
(1/2) * (one unit of measure) =1/2(0.01) = 0.005

Class Mark or Midpoint


The class marks or midpoint is the mean of lower and upper class limits or class boundaries. So it divides the class
into two equal parts. It is obtained by dividing the sum of lower and upper class limit or class boundaries of a class by 2.
That is,
STT 041/STT041.1, First Semester, A.Y. 2020 – 2021 Mathematics Dep’t, MSU, Marawi City
20

𝑳𝑳𝒊 +𝑼𝑳𝒊 𝑳𝑪𝑩𝒊 +𝑼𝑪𝑩𝒊


̅i =
𝒙 or ̅i =
𝒙
𝟐 𝟐
𝟔𝟎+𝟔𝟗 𝟏𝟐𝟗
Example: The class mark or midpoint of the class interval 60 – 69 is = = 64.5
𝟐 𝟐
or
𝟓𝟗.𝟓+𝟔𝟗.𝟓 𝟏𝟐𝟗
if we use the class boundaries, the class mark is = = 64.5.
𝟐 𝟐

Relative Frequency (R𝒇𝒊 )


𝒇𝒓𝒆𝒒𝒖𝒆𝒏𝒄𝒚
This is the frequency of a class expressed in proportion to the total number of observations: 𝑹𝒇𝒊 =
𝒏

Cumulative Frequency (F i)
It is the accumulated frequency of a class. It is the total number of observations whose values do not exceed the
upper limit or upper class boundary of a class.

Example: Consider the frequency distribution of the weights of Math 31 students.

Frequency Distribution Table of Weights (in kg) of Math 31 Students


Class Frequency Class Class Mark, Relative Cumulative
Boundaries 𝑋̅𝑖 Frequency, Frequency, Fi
R𝒇𝒊
40 – 46 6 39.5 – 46.5 43 0.15 6
47 – 53 14 46.5 – 53.5 50 0.28 20
54 – 60 10 53.5 – 60.5 57 0.25 30
61 – 67 6 60.5 – 67.5 64 0.15 36
68 – 74 2 67.5 – 74.5 71 0.05 38
75 – 81 2 74.5 – 81.5 78 0.05 40

C.GRAPHICAL PRESENTATION
After the data have been organized into a frequency distribution, they can be presented in graphical forms. The
purpose of graph in statistics is to convey the data in pictorial form. It is easier to detect trends, low and high points in
graphs, than in frequency tables. Graphs are also useful in getting the reader’s attention in a publication or in a presentation.
They can be used to discuss an issue, reinforce a critical point, or summarize a data set.

(a) Bar Chart


This is a graph where the different classes are represented by rectangles or bars. The width of the rectangle is the
length of the interval, represented by the class limits in the horizontal axis, or categories for nominal data. The length of the
rectangle, corresponding to the class frequency, is drawn in the vertical axis. For the data on weights, the bar chart is shown
below.
Bar Chart

16
14
12
Frequency

10
8
6
4
2
0
40 - 46 47 - 53 54 - 60 61 - 67 68 - 74 75 - 81
Weights (in kg) of Math 31 Students

STT 041/STT041.1, First Semester, A.Y. 2020 – 2021 Mathematics Dep’t, MSU, Marawi City
21

(b) Histogram
This closely resembles the bar chart with the basic difference that a bar chart uses the class limits for the horizontal
axis while the histogram employs the class boundaries. Using the class boundaries eliminates the spaces between rectangles,
thus giving it a solid appearance.
Histogram
16
14
12

Frequency
10
8
6
4
2
0
39.5 - 46.5 46.5 - 53.5 53.5 - 60.5 60.5 - 67.5 67.5 - 74.5 74.5 - 81.5
Weights (in kg) of Math 31 Students

(c) Frequency Polygon


It is constructed by plotting the class marks against the frequency. Straight lines then connect the set of points
formed by the class marks and their corresponding frequencies together with additional class marks at the beginning of the
distribution.
Frequency Polygon
16

14

12
Frequency

10

0
36 43 50 57 64 71 78 85

Weights (in kg) of Math 31 Students

(d) Frequency Ogive


It represents a cumulative frequency distribution. It is constructed by plotting class boundaries on the horizontal
scale and the cumulative frequency less than the upper class boundaries in the vertical scale.
Frequency Ogive
45
40
35
30
Frequency

25
20
15
10
5
0
39.5 46.5 53.5 60.5 67.5 74.5 81.5

Weights (in kg) of Math 31 Students

(e)Pie Chart
This is a circle divided into pie-shaped sections, which look like slices of a pizza. The angle of a sector is a
proportional in size to the frequencies or relative frequencies.
Angle of a sector = Rfi x 360o

STT 041/STT041.1, First Semester, A.Y. 2020 – 2021 Mathematics Dep’t, MSU, Marawi City
22

Solution for getting the angle of a sector:


 RF=5 % or 0.05
Angle of a sector = Rfx 360o = 0.05 x 360o= 18o
 RF=15 % or 0.15
Angle of a sector = Rfx 360o = 0.15 x 360o= 54o
 RF=25 % or 0.25
Angle of a sector = Rfx 360o = 0.25 x 360o= 90o
 RF=35 % or 0.35
Angle of a sector = Rfx 360o = 0.35 x 360o= 126o

Exercise 5

1. Find the class boundaries, midpoints, and width for each class interval given.
a. 11-15 b. 17-39 c. 29.3-35.3 d. 11.8-14.7 e. 3.13-3.93

2. Construct a frequency distribution of the scores of 50 students in a Prelim Exam in Math 1. Their scores are given below:
23 50 38 42 63 75 12 33 26 39 35 47 43 52
56 59 64 77 15 21 51 54 72 68 36 65 52 60
27 34 47 48 55 58 59 62 51 48 50 41 57 65
54 43 56 44 30 46 67 53

3. The ages of the signers of the Declaration of Independence of USA are shown below. Construct a frequency distribution
for the given ages. (Source: John W. Wright, ed., The Universal Almanac, Andrews and McMeel, 1994, p.53)

41 39 42 31 53 35 30 34 27 50 50 34
44 60 50 55 50 37 69 42 63 49 39 52
44 48 46 42 45 38 33 70 33 36 60 32
35 45 42 45 62 35 40 54 50 52 27 63

STT 041/STT041.1, First Semester, A.Y. 2020 – 2021 Mathematics Dep’t, MSU, Marawi City
23

43 34 46 46 39 47 40

4. In a study of 40 women, the following data of blood potassium levels, in milliequivalents per liter, were obtained.
Construct a frequency distribution.

3.2 5.8 6.0 4.5 4.2 4.3 2.7 5.1 4.9 5.3 4.7 5.0
3.9 4.9 4.0 5.2 3.8 4.6 4.3 4.4 3.8 5.6 4.2 4.7
3.6 5.1 4.2 5.8 3.7 3.7 4.3 4.5 3.4 3.6 4.4 3.9
4.2 4.1 4.3 4.9

5. The weights (to the nearest tenth of a kilogram) of 30 students were measured and recorded as follows:

59.2 60.4 58.4 61.4 59.0 61.9 61.9 59.8 61.2 60.2
60.0 61.4 61.2 61.1 61.6 61.5 58.9 62.2 58.4 60.2
65.7 61.7 62.1 60.7 56.3 59.3 60.9 62.4 60.8 62.7

a. Construct a frequency distribution table.


b. Construct a frequency ogive.

6. Thirty AA size batteries were tested to determine how long they lasted. The results, to the nearest hundredth, were
recorded as follows (unit of measurement is 0.01):

4.23 3.71 4.31 4.00 3.96 3.69 3.77 4.01 3.81 3.72
3.87 3.89 3.63 3.99 4.10 4.11 4.09 3.91 4.15 4.19
3.93 3.92 4.05 4.28 3.86 3.94 4.08 3.82 4.22 3.90

a. Construct a frequency distribution table.


b. Construct its frequency bar chart and frequency histogram.

7. For 108 randomly selected college students, the following IQ frequency distribution was obtained. Construct histogram,
frequency polygon, and ogive for the data.
Class limits Frequency
90 - 98 6
99 - 107 22
108 - 116 43
117 - 125 28
126 - 134 9

8. Thirty automobiles were tested for fuel efficiency, in miles per gallon (mpg). The following frequency distribution was
obtained. Construct a histogram, frequency polygon, and bar chart for the data.
Class boundaries Frequency
7.5 - 12.5 3
12.5 - 17.5 5
17.5 - 22.5 15
22.5 - 27.5 5
27.5 - 32.5 2

9. In an insurance company study of the causes of 1000 deaths, the following data were obtained. Construct a pie graph to
represent the data.
STT 041/STT041.1, First Semester, A.Y. 2020 – 2021 Mathematics Dep’t, MSU, Marawi City
24

Cause of death Number of deaths


Heart disease 432
Cancer 227
Stroke 93
Accidents 24
Other 224

10. If the class marks of a frequency distribution of weights of miniature poodles are 5.0, 6.5, 8.0, 9.5 and 11.0 kilograms,
find: a. the class width b. the class boundaries c. the class limits

11. Given the frequency ogive:

49 50
50 47
44
45

40 38

35
cumulative frequency

30
30

25
20
20
14
15
10
10 6
4
5 1
00
69.5 74.5 79.5 84.5 89.5 94.5 99.5 104.5109.5114.5119.5124.5129.5
class boundaries

a. Reconstruct the frequency distribution table.


b. What is the class width?
c. What is the total frequency?
d. Based on the ogive,
 how many observations were below 99.5?
 how many were above 99.5?
 what is the total number of observations?

12.Fill-in the missing values in the table.

Class Interval Class Class Mark Frequency Relative Cum. Freq.


Boundaries Frequency
3.5-___ ____ - ____ ____ 5 ____ 5
4.5-___ ____ - ____ ____ 9 ____ ____
____ - ____ ____ - ____ ____ 15 ____ ____
____ - ____ ____ - ____ ____ 6 ____ ____
____ - ____ ____ - ____ ____ 3 ____ 38

STT 041/STT041.1, First Semester, A.Y. 2020 – 2021 Mathematics Dep’t, MSU, Marawi City
25

VII. Summation Notation

Many of the computations in statistics involve a summation of the observed data. In this section, we discuss the
notations used and its basic properties.

The summation notation, ∑𝑛𝑖=1 𝑥𝑖 , read as “the sum of xi’s where i ranges from 1 to n,” is defined as follows

∑𝑛𝑖=1 𝑥𝑖 = x1 + x2 + x3 + . . . + xn

where i is called the index of summation, 1 is the lower limit and n is the upper limit of the summation.

Examples:
a. ∑5𝑖=1 𝑥𝑖 = x1+x2+x3+x4+x5

b. ∑3𝑖=1(𝑥𝑖 + 𝑦𝑖 ) = (𝑥1 + y1) + (x2 + y2) + (x3 + y3)

c. ∑2𝑖=1 2𝑥𝑖 = 2x1 + 2x2

d. ∑4𝑖=1 3 = 3+ 3+ 3+ 3 = 3(4) = 12

Rules of Summation:
a. ∑𝑛𝑖=1(𝑥𝑖 + 𝑦𝑖 ) = ∑𝑛𝑖=1 𝑥𝑖 + ∑𝑛𝑖=1 𝑦𝑖

b. ∑𝑛𝑖=1 𝑎𝑥𝑖 = a∑𝑛𝑖=1 𝑥𝑖 , where a is any constant.

c. ∑𝑛𝑖=1 𝑛 = na, where a is any constant.

Example:
Given x1= 3, x2= 4, x3= 8, x4= -2, y1 = -6, y2= -1, y3= 5 and y4= 0.

a. ∑4𝑖=1 𝑥𝑖 2 = x12 + x22+ x32 + x42 = 32 + 42 + 82 + (-2) 2= 9+16+64+4= 93

b. ( ∑𝑛𝑖=1 𝑥𝑖 )2 = ( x1 + x2 + x3 + x4) 2 = (3 + 4+ 8 -2) 2 = (13)2 = 169

c. ∑2𝑖=1 𝑥𝑖 yi = x1y1 + x2y2 = (3) (-6) + (4) (-1) = -18 -4 = -22

d. ∑4𝑖=3(3𝑥𝑖 + yi ) = (3x3 + y3 ) + (3x4 + y4) =( (3)(8) +5) + ( (3)(-2)+ 0) = 29-6 = 23


or
∑4𝑖=3(3𝑥𝑖 + yi ) = 3∑4𝑖=3 𝑥𝑖 + ∑4𝑖=3 yi = 3(x3 + x4) + (y3 + y4) = 3(8 + (-2)) + (5 + 0) = 3(6) + 5 = 23

e. ∑3𝑖=1(3𝑥𝑖 + 2) = (3x1 + 2) + (3x2 +2) + (3x3 +2) = (3(3) +2) +(3(4) +2) + (3(8) +2) = 11 + 14 + 26 = 51
or
∑3𝑖=1(3𝑥𝑖 + 2) = 3∑3𝑖=1 𝑥𝑖 + 2(3) = 3(x1 + x2 + x3 ) + 2(3) = 3(3+ 4 +8) + 6 = 3(15) +6 = 45 + 6 = 51

Exercise 6

STT 041/STT041.1, First Semester, A.Y. 2020 – 2021 Mathematics Dep’t, MSU, Marawi City
26

Given x1 = 3, x2 = 4, x3 = 8, x4 = -2, y1 = -6, y2 = 1, y3 = 5 and y4 = 0, find the value of the following:


2
4  4  3
1.  x2 2.  x  3.  x i y i
i  i
i 1  i 1  i 1

 4  4 
6.   x 2i    y i 
3 4
4.  (3 x i  y i ) 5.  (2x i  y i )2
i 1 i 1  i 1   i 1 

 4  4   4 
8.   x i    y i  9.   y i  + 5
3
7.  (5 x i  10 )
i 1 i 2  i 2   i 1 

VIII. Statistical Description of Data

The previous discussion showed how one can gain useful information from raw data by organizing it into frequency
distribution, then presenting the data by using various graphs. This chapter shows other statistical methods that can be used
to summarize the data.

In this section, we will examine different statistical measures that are computed when given a set of data. Some of
these measures are applicable for both numerical and non-numerical data (categorical data) but many of these are applicable
only to numerical data. Recall that statistical measures can be computed from the sample or from a population. When it is
from the whole population it is called a parameter, while if it is from a sample it is referred to as statistic.

A statistic is a characteristic or measure obtained by using the data values from a sample.
A parameter is a characteristic or measure obtained by using all the data values for a specific population.

Computing Statistical Measures of Data

Measures of Central Location (Measures of Average)


The measures of central location describe the center or middle part of a group of data. Here, we will consider the
mean, median, mode and weighted mean.

A. The Arithmetic Mean


The mean is the sum of the values divided by the total number of values. This is commonly called the average in
layman’s term. In statistics, all measures of center are also called average. For a sample, the mean is denoted by 𝑥̅ and this
statistic is computed as:

∑𝒏
𝒊=𝟏 𝒙𝒊 𝐱𝟏 + 𝐱𝟐 + 𝐱𝟑 + ...+ 𝐱𝐧
𝒙= = , where n represents the total number of values in the sample.
𝒏 𝒏

While for a population, the mean is denoted by the Greek letter 𝜇 (read as “mu”) and the parameter is given as:

∑𝑵
𝒊=𝟏 𝒙𝒊 𝐱𝟏 + 𝐱𝟐 + 𝐱𝟑 + ...+ 𝐱𝐧
𝝁 = = , where N represents the total number of values in the population.
𝑵 𝑵

Examples:
(1) The ages in weeks of all kittens at an animal shelter are 3, 8, 5, 12, 14 and 12. Find the mean.

STT 041/STT041.1, First Semester, A.Y. 2020 – 2021 Mathematics Dep’t, MSU, Marawi City
27

Solution:
∑𝑵
𝒊=𝟏 𝒙𝒊 𝟑 + 𝟖 + 𝟓+ 𝟏𝟐+ 𝟏𝟒+𝟏𝟐 𝟓𝟒
𝝁= = = = 9 weeks.
𝑵 𝟔 𝟔

Thus, the mean age of the kittens is 9 weeks.

(2) The fat contents in grams for one serving of a sample of 11 brands of packaged foods, as determined by the U.S
Department of Agriculture, are given a follows: 6.5, 6.5, 9.5, 8.0, 14.0, 8.5, 3.0, 7.5, 16.5, 7.0, 8.0 . Find the
mean.
Solution:
𝑵
̅ = ∑𝒊=𝟏 𝒙𝒊 = 𝟔.𝟓 + 𝟔.𝟓+ 𝟗.𝟓+ 𝟖.𝟎+ 𝟏𝟒.𝟎+𝟖.𝟓+𝟑.𝟎+𝟕.𝟓+𝟏𝟔.𝟓+𝟕.𝟎+𝟖.𝟎 = 𝟗𝟓 = 𝟖. 𝟔𝟒 grams
𝑿
𝑵 𝟏𝟏 𝟏𝟏

Thus, the mean fat content in grams for one serving of 11 brands of packaged foods is 8.64 grams.

Properties of the Mean:


(1) It is unique, meaning it has only one value.
(2) It can be computed for numerical data only, that is interval or ratio level data.
(3) It is easily affected by extreme values in the data. Thus, one should be cautious in using the mean when there are
extreme observations or outliers. If the outlier is extremely low, it pulls down the value of the mean. If the outlier is
a very big value, it magnifies the mean. If the mean is greatly affected, then our summary description of the data is
distorted.

Example: Suppose we change the values of the data set in Example(2) above. That is, suppose we have
6.5, 6.5, 9.5, 8.0, 14.0, 8.5, 3.0, 7.5, 0.1, 7.0, 0.2
then the mean is

∑𝑵
𝒊=𝟏 𝒙𝒊 6.5 + 6.5+ 9.5+ 8.0+ 14.0+8.5+3.0+7.5+𝟎.𝟏+7.0+𝟎.𝟐 𝟕𝟎.𝟖
̅ =
𝑿 = = = 𝟔. 𝟒𝟒 grams
𝑵 11 𝟏𝟏

So, in this case the outlier is extremely low that it pulls down the value of the mean from 8.64 to 6.44 grams.

Suppose we change again the values into 6.5, 6.5, 9.5, 8.0, 14.0, 8.5, 3.0, 7.5, 19.5, 7.0, 31.0.
The mean of this new set of values is

∑𝑵
𝒊=𝟏 𝒙𝒊 𝟏𝟐𝟏
̅=
𝑿 = = 11 grams.
𝑵 𝟏𝟏

In this case the outlier is extremely high so that it pulls up the value of the mean from 8.64 to 11 grams.

B. The Median
When the data are arranged in increasing or decreasing order, the median is the point halfway (or middle value) in
a data set. Meaning, the median is the data point which divides the distribution into two equal parts.

 Steps in Computing the Median from a Set of Data:


Step 1. Arrange the data in increasing (or decreasing) order of the magnitude.
Step 2. Select the middle point.

Case 1: When the number of observations is odd, there is only one middle value and this is the median. The position
𝑁+1
of this data is located at the ( ) data point. This is denoted by 𝑋̃ for a sample and 𝜇̃ for population.
2

̃ = 𝑿𝑵+𝟏
Parameter: 𝝁 ̃ = 𝑿𝒏+𝟏
Statistic: 𝑿
𝟐 𝟐

STT 041/STT041.1, First Semester, A.Y. 2020 – 2021 Mathematics Dep’t, MSU, Marawi City
28

Example: The weights (in pounds) of a sample of seven army recruits are 180, 201, 220, 191, 219, 209 and 186.
Find the median.
Solution:
Step 1. Arrange the data in increasing order: 180, 186, 191, 201, 209, 219, 220
Step 2. Since there are seven (7) observations, then the position of the middle value is the
7+1 8
= = 4th data point which is the weight 201 pounds. That is, 𝑋̃ = 201 pounds.
2 2

Case 2: When the number of observations is even, there are two middle values. The median is the mean or average
𝑁 𝑁
of the two middle values. And the position of the two data points are at ( ) and ( + 1).
2 2

𝑿𝑵 +𝑿𝑵 𝑿𝒏 +𝑿𝒏
+𝟏 +𝟏
̃=
Parameter: 𝝁 𝟐 𝟐 ̃=
Statistic: 𝑿 𝟐 𝟐
𝟐 𝟐

Example:
The ages of a sample of 10 college students are 18, 24, 20, 35, 19, 23, 26, 23, 19, 20. Find the median.
Solution:
Step 1. Arrange the data in increasing order. We have 18, 19, 19, 20, 20, 23, 23, 24, 26 and 35.
10 10
Step 2. The two (2) middle values are at the = 5th and + 1 = 6th positions. Therefore,
2 2
5𝑡ℎ 𝑜𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛 + 6𝑡ℎ 𝑜𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛 20+23 43
𝑋̃ = = = = 21.5. Therefore, the median age is 21.5
2 2 2
years.

The median is a good alternative measure of the center when there are extreme values. It is easy to compute if there
are few observations. However, if we have a large set of data, the use of computers is essential in arranging these data.

Properties of Median:
(a) It is unique (for numerical data).
(b) It can be computed for ordinal, interval or ratio level data.
(c) It is not affected by extreme values since the median uses only the middle value/s.

C. The Mode
The third measure of average is called the mode. The mode is the value that occurs most often in a data set. It means
that the mode has the most typical value. A data set can have more than one mode or no mode at all. The sample mode is
denoted by 𝑋̂ while the parameter is denoted by 𝜇̂ . .

Examples:
(a) The following data represent the duration (in days) of US space shuttle voyages for the years 1992-94 (Source:
The Universal Almanac 1995, p. 563). Find the mode.
8, 9, 9, 14, 8, 8, 10, 7, 6, 9, 7, 8, 10, 14, 11, 8, 14, 11
Answer:
It is helpful to arrange the data in order, although it is not necessary.
6, 7, 7, 8, 8, 8, 8, 8, 9, 9, 9, 10, 10, 11, 11, 14, 14, 14
Since 8-day voyages occurred five times—a frequency larger than any other number—the mode for the data set is 8.

(b) Six strains of bacteria were tested to see how long they could remain alive outside their normal environment.
The time, in minutes, is recorded as 2, 3, 5, 7, 8 and 10. Find the mode.
Answer:
Since each value occurs once, there is no mode.
Note: Do not say that the mode is zero. That would be incorrect, because in some data zero can be an actual data.

STT 041/STT041.1, First Semester, A.Y. 2020 – 2021 Mathematics Dep’t, MSU, Marawi City
29

(c) Eleven different automobiles were tested at a speed of 15 miles per hour for stopping distances. The data, in
feet, are 15, 18, 18, 18, 20, 22, 24, 24, 24, 26 and 26. Find the mode.
Answer:
Since 18 and 24 both occur three times, the modes are 18 and 24 feet. This data set is said to be bimodal.

(d) Ten (10) students are asked of their opinion on the tuition fee increase and their responses are: in favor, in favor,
not in favor, not in favor, not in favor, neutral, in favor, in favor, not in favor and neutral.
Answer:
Since in favor and not in favor both occur four times, the modes are the opinions regarding in favor and not in favor.
This data set is said to be bimodal.

(e) The heights (in inches) of six female police officer candidates are 62, 63,63,62,63 and 62. Find the mode.
Answer:
Since 62 and 63 both occur with the same frequency (i.e three times), then we can say that they are the modes. This
data set is said to be bimodal.

Properties of the Mode:


(a) It can be computed for any type of data whether it is nominal, ordinal, interval, or ratio level data.
(b) It may not be unique since sometimes we cannot just get one value, like on the example (d) shown above.
(c) It may not exist like on the example (b) shown above.

D. The Weighted Mean


Sometimes, one must find the mean of a data set in which not all values have the same degree of importance. Just
like a data containing the scores of a student in the quizzes, exam and assignments of particular subject. Scores in major
exams weigh more than those in quizzes. This type of measurement that considers an additional factor is called the weighted
mean, which is denoted as 𝑥̅𝑤 .

We find the weighted mean of a variable X by multiplying each value by its corresponding weight and dividing the
sum of the products by the sum of the weights

𝐰𝟏 𝐱𝟏 + 𝐰𝟐 𝐱𝟐 +⋯+ 𝐰𝐧 𝐱𝐧 ∑𝒏
𝒊=𝟏 𝒘𝒊 𝒙𝒊
̅𝒘 =
𝒙 = ∑𝒏
, where w1, w2,…, wn are the weights and x1, x2,..,xn are the values.
𝐰𝟏 +𝐰𝟐 +⋯+𝐰𝐧 𝒊=𝟏 𝒘𝒊

Examples:
(a)Anna is a Science Scholar student in MSU. She got the following grades in her subjects last semester:

Subject Grade Units


Math 17 2.25 6
English 1 1.50 3
History 1 1.75 3
Filipino 1 2.00 3
English 3 1.75 3
Phys. Educ. 1 1.25 2

Compute the grade point average (GPA) of Anna. Will she be able to maintain her scholarship if the grade
maintenance is at least 1.75?

STT 041/STT041.1, First Semester, A.Y. 2020 – 2021 Mathematics Dep’t, MSU, Marawi City
30

Solution:
Subject Grade (xi) Units (wi) Grade (wi ) * Units (xi)
Math 17 2.25 6 13.5
English 1 1.50 3 4.5
History 1 1.75 3 5.25
Filipino 1 2.00 3 6.0
English 3 1.75 3 5.25
Phys. Educ. 1 1.25 2 2.5
TOTAL 20 37

∑𝟔𝒊=𝟏 𝒘𝒊 𝒙𝒊 𝟑𝟕
̅𝒘 =
𝒙 = = 1.85
∑𝟔𝒊=𝟏 𝒘𝒊 𝟐𝟎
Therefore, she is not able to maintain her scholarship since her GPA is 1.85.

(b)Suppose a survey asked a sample of 30 respondents to rate a movie on its cinematography. The rate is from 1 to 5 with
1 being the lowest. A summary of data shows that twelve (12) gave a rating of five (5), eight (8) gave a rating of four (4)
and, seven (7) and three (3) gave a rating of three (3) and two (2) respectively. Find the average rating.
Solution:
Number of Rating Number of
respondents (wi) Respondents (wi ) *
(xi) Rating (xi)
12 5 60
8 4 32
7 3 21
3 2 6
30 119
∑ 𝒘𝒊 𝒙𝒊 𝟏𝟏𝟗
̅𝒘 =
𝒙 = = 3.97 ≈ 4
∑ 𝒘𝒊 𝟑𝟎
Therefore the average rating is 4. It means that the rating on movie cinematography is almost perfect since the rate
of 4 is nearer in the rate of 5.

Exercises 7

For numbers 1-3, find the (a) mean, (b) median and the (c) mode.

(1) The grade point averages of 10 students who applied for financial aid are shown below:
3.62 2.54 2.81 3.97 1.85 1.93 2.63 2.50 2.80

(2) The calories per servings of 11 fruit juices are 150, 110 , 100, 35, 60, 130, 40 , 140,120, 160 and 110 (Source:
Consumer Reports, February 1995, p.79).

(3) During 1993, the major earthquakes had Richter magnitudes are as follows: 7.0, 6.2, 7.7, 8.0, 6.4, 6.2, 7.2, 5.4, 6.4, 6.5,
7.2 and 5.4 (Source: The Universal Almanac 1995, p.567).

(4) A student received an A in English Composition 1 (3 credits), a C in Introduction to Psychology (3 credits), a B in


Biology 1 (4 credits), and a D in Physical education (2 Credits). Assuming A= 4 grade points, B= 3 grade points, C = 2
grade points and D= 1 grade point. Find the student’s grade point average.

Course Units (wi) Grade(xi)

STT 041/STT041.1, First Semester, A.Y. 2020 – 2021 Mathematics Dep’t, MSU, Marawi City
31

Eng Comp 1 3 A (4 points)


Intro to Psych 3 C (2 points)
Biology 4 B (3 points)
Phys Ed 2 D (1 point)

(5) Suppose a student made an average score of 64% in the quizzes, 48% in the first prelim, 35% in the second prelim, and
55% in the final exam. What is the final average score of the student if the quizzes weigh 30% of the total grade, first prelim
20%, second prelim 20%, and final exam weigh 30%? Did the student pass the course if the passing cut-off score is 45%?

Measures of Dispersion or Measures of Variation


In statistics, in order to describe the data set accurately, statisticians must know more than the measures of central
tendency. Measures of dispersion or variations are measures of the degree to which numerical data are scattered or spread.
For the spread of variability of a data set, four measures are commonly used, namely: range, variance,standard deviation,
and coefficient of variation.

A.Range
The range is the simplest of the difference between the highest and the lowest values in a set of data. That is,

Range, R = highest value – lowest value

The range is considered a poor measure of dispersion in the sense that if only considers two values in its
computation. Thus, it cannot accurately determine how spread the values are in a given data set. For example, the two set
of data below have the same range.
A. 3 4 4 7 5 7 20
R = highest value – lowest value
R = 20 – 3 = 17
B. 3 7 11 12 15 18 20
R = highest value – lowest value
R = 20 – 3 = 17
However, data set B has more varied values than data set A.

Example:
The given data below represents the lifespan of the paints expressed in terms of months.
Brand A: 45, 60, 50, 55, 48, 56, 57
Brand B: 35, 25, 45, 28, 39, 40, 44
Find the range for each brand.
Solution:
For brand A, the range is R=60 - 45 = 15 months. For brand B, the range is
R = 45 – 25 = 20 months. Therefore, lifespan of Brand A are less varied compared to Brand B.

B. Variance
The most popular method of measuring dispersion in statistics is to measure the distance of each observation from the
mean of the observations (i.e. xi - 𝑥̅ 𝑜𝑟 xi − 𝜇). Since bigger samples would usually lead to greater sum of the distances
compared to small samples, the result is adjusted by the sample size. One type that measures dispersion this way is the
variance.
The variance is the average of the squares of the distances of each data value from the mean. The symbol for the
population variance is 𝜎 2 (𝑟𝑒𝑎𝑑 𝑎𝑠 "𝑠𝑖𝑔𝑚𝑎 𝑠𝑞𝑢𝑎𝑟𝑒𝑑") and for the sample, it is denoted as s2. The definitional formula of
the variance is given below:

STT 041/STT041.1, First Semester, A.Y. 2020 – 2021 Mathematics Dep’t, MSU, Marawi City
32

Definitional Formula:
∑𝑵
𝒊=𝟏( 𝐱𝐢− 𝛍)
𝟐 ∑𝒏 ̅ )𝟐
𝒊=𝟏( 𝐱𝐢− 𝐱
Parameter: 𝝈𝟐 = Statistic: s2=
𝑵 𝒏−𝟏

̅ = 0 or
You might ask, why should each term of the numerator be squared? This is because ∑𝑛𝑖=1(𝐱𝐢 − 𝐱)
∑𝑁
𝑖=1(𝐱𝐢 − μ ) = 0, that is the sum of the deviations from the mean will always be zero.

The formula for the population variance and the sample variance are almost the same except for the denominator.
The denominator of the sample variance, s2, is n-1 and not n because in this way the sample variance provides an unbiased
estimator of the population variance than when divided by n. But for large sample size n (say over 30), it really does not
matter whether it is divided by n or n-1 because the results are almost the same, and they are acceptable.

The disadvantage with the above formula is that it could lead to serious rounding-off errors especially when the
value of the mean is also a rounded-off value. Hence, we have alternative formula below which can minimize this error.
These formulas were derived from expansion of the original formula above.

Computational Formula:
𝟐
𝑵 ∑𝑵 𝟐 𝑵
𝒊=𝟏 𝒙𝒊 – (∑𝒊=𝟏 𝒙𝒊 ) 2 𝒏 ∑𝒏 𝟐 𝒏
𝒊=𝟏 𝒙𝒊 – (∑𝒊=𝟏 𝒙𝒊 )
𝟐
Parameter: 𝝈𝟐 = Statistic: s =
𝑵𝟐 𝒏(𝒏−𝟏)

Example:
A comparison of coffee prices at 4 randomly selected grocery stores showed increases from the previous month of
12, 15, 17 and 20 cents for 200-gram jar. Find the variance of this random sample of price increases.

Solution:
Note that the data were collected from a random sample of 4 grocery stores. If we use the definitional formula we
have the following computations:

12+15+17+20 64
̅=
Mean, 𝒙 = = 16 cents
4 4

∑( xi− x̅)2 (12−16)2 + (15−16)2 + (17−16)2 + (20−16)2


Sample variance, s2 = =
𝑛−1 4−1
2 16+1+1+16 2
s= =11.3 cents .
3
Therefore, the average of the squared deviations of each data from the mean is 11.3 cents 2.

If we use the computational formula, we have the following computations for the sample variance:

n∑ 𝒙𝒊 2 = 4 (122 +152 + 172 +202) = 4(144+225+289+400)=4 (1058)=4232


(∑ 𝒙𝒊 )2 = (12+15+17+20)2 = (64)2 = 4096

𝑛 ∑ xi2 –(∑ 𝑥𝑖)2 4232−4096 136


Thus, s2 = = = = 11.3 cents2.
𝑛(𝑛−1) 4(3) 12

Remarks:
 The result of the computations will be the same as long as we don’t round-off the computations except in the final
answer.
 The value of the variance cannot be negative.
 The unit of measurement for the variance is in square units of the original measure.

C. Standard Deviation
The standard deviation is the positive square root of the variance. It has the same unit of measurement with the
given data. It can be used to compare variability of two or more sets of data having the same units of measurement with
STT 041/STT041.1, First Semester, A.Y. 2020 – 2021 Mathematics Dep’t, MSU, Marawi City
33

approximately the same mean. It enables us to determine, with a great deal of accuracy, where the values of a distribution
are located in relation to the mean. The symbol for the population and sample standard deviation are given below:

Parameter: 𝝈 = √𝝈𝟐 Statistic: s = √𝒔𝟐

Example:
Find the standard deviation of the coffee price increase in the previous example.
Solution:
s = √𝑠2 = √11.3 𝑐𝑒𝑛𝑡𝑠 2 = 3.36 cents.
Therefore, the standard deviation of the coffee prices is 3.36 cents.

D. Coefficient of Variation
Whenever two or more samples have the same units of measure and approximately the same mean, the standard
deviation for each can be compared directly. A statistic thatallows one to compare standard deviations especially when the
The coefficient of variation expresses the standard deviation as a fraction (or percent) of the mean. The result is
expressed as a percentage.
𝝈 𝒔
Parameter: CV= x 100% Statistic: cv = ̅x 100%
𝝁 𝒙
Examples:
(
a) The mean height of Math 31 students is 62 inches with a standard deviation of 2 inches. Compute the coefficient of
variation.
Solution:
𝑠 2 𝑖𝑛𝑐ℎ𝑒𝑠
cv = x 100% = x 100% = 3.2 %
𝑥̅ 62 𝑖𝑛𝑐ℎ𝑒𝑠
Therefore, the standard deviation of the heights is only 3.2 % of the size of its mean.

(b) The mean of the number of cars sold over a three-month period into branches of Toyota, Incorporation is $87 and the
standard deviation is $5. The mean of the commissions is $5225 and the standard deviation is $773. Compare the variations
of the two.
Solution:
Since the units of measurement are different, we use the coefficient of variations to compare their relative
variability.

𝜎 5
Cars sold: CV = x 100% = x 100% = 5.75 % for cars sold
𝜇 87

𝜎 $773
Commission: CV= x 100% = x 100%=14.79 % for commissions
𝜇 $5225

Since the coefficient of variation is larger for commissions, the commissions are more varied than the sales.

(c) The mean of the number of pages of a sample of women’s fitness magazines is 132, with a variance of 23; the mean of
the number of pages of a sample of men’s fitness magazines is 182, with a variance of 62. Compare the variations of pages
of the two magazines.
Solution:
Although the two groups have the same unit of measurement, but their means, 132 and 182 are too different,
hence their variations will be compared using coefficient of variation.
𝑠 4.8
women: cv = x 100% = x 100% = 3.64 % for women
𝑥̅ 132

𝑠 7.87
men: cv = x 100% = x 100% = 4.32 % for men
𝑥̅ 182

STT 041/STT041.1, First Semester, A.Y. 2020 – 2021 Mathematics Dep’t, MSU, Marawi City
34

The number of pages in the men’s magazines is more varied than in the women’s magazines, since the
coefficient of variation is larger for the men’s magazines.

Finding the mean, standard deviation, and variance using the SD mode of the calculator

MODE
button
MODE
button
Shift 2nd F
MODE

DATA
or DT
or X
𝑦̅

DATA

sy
Where
S-SUM 𝜎𝑦
and S- 𝑥̅
VAR
can be
found ∑ 𝑥𝑦 𝜎𝑥

n ∑ 𝑦2

sx ∑𝑥 ∑𝑦 ∑ 𝑥2

Step 1Activate the STAT or SD mode.


There are special functions the reader must know and used in finding the mean, standard deviation, and variance using
the SD mode of the calculator. Here are the following:

S-SUM ∑ 𝒙𝟐 , ∑ 𝒙andn
 composed of functions

STT 041/STT041.1, First Semester, A.Y. 2020 – 2021 Mathematics Dep’t, MSU, Marawi City
35

 S-VAR composed of functions 𝒙 ̅ , x 𝝈𝒏and 𝒙𝝈𝒏 -1. Take note that 𝑥𝜎𝑛 is for the standard deviation of the
population while 𝒙𝝈𝒏-1 is for the standard deviation of sample.

 DATA or X is the button to be pressed in order to input the data in calculator.


AA
Step 2 Clear the memory of the calculator by pressing Shift andMode. Find the “Scl”. Press the number that corresponds
to the “Scl” and the equal (=) sign.

Step 3 Input the data. Suppose our data are: 3, 5, 6 and 23.

Press number 3 . Then press DATA or X .Then n=1appears in the screen of your calculator.
AA
DATA X
Press number 5 . Then press orAA .Then n=2 appears in the screen of your calculator.

Press number 6 . Then press DATA or X . Then n=3 appears in the screen of your calculator.
AA
Press number 23. Then press DATA or X . It appears n=4 in the screen of your calculator.
AA
If you input an incorrect number and have already pressed DATA or X , say 4 instead of 3, you should go
back to Step 2. AA
If you have not yet pressed DAT or X you can simply erase and re-enter the correct data.
AAA
Step 4 Check if the number of data entered is correct by pressing Shiftand S-SUM. On your screen will appear,

∑ 𝒙𝟐 ∑ 𝒙 𝒏
.Press number 3 that corresponds to n and also press the equal (=) sign. The number that should appear for
𝟏 𝟐 𝟑
this example is 4.

Step 5Get the values of the mean, standard deviation, and the variance.

To find the mean, pressShift and S-VAR . On your screen will appear
̅ 𝒙𝝈𝒏 𝒙𝝈𝒏 − 𝟏
𝒙
̅ ).
. Press number 1(that is 𝒙
𝟏 𝟐 𝟑

To find the standard deviation of the population, press Shift and S-VAR. On your screen will appear

̅ 𝒙𝝈𝒏 𝒙𝝈𝒏 − 𝟏
𝒙
. Press number 2 (that is x𝝈n). If you want to find the standard deviation for the sample, just press
𝟏 𝟐 𝟑

number 3 (that is x𝝈𝒏 − 𝟏).After the standard deviation appears on the screen, you can square the value by

pressing 𝑥2 and the value is equal to the variance.

Exercises 8

For numbers 1 and 2, find the (a) range, (b) variance and (c) standard deviation. Assume that the data represent
samples. Check your answers for variance and standard deviation with the result using the SD mode of your calculator.

(1) Seven students were selected and asked how many hours each studied for the final exam in statistics. Their answers are
recorded here.
8 6 3 0 5 0 9
STT 041/STT041.1, First Semester, A.Y. 2020 – 2021 Mathematics Dep’t, MSU, Marawi City
36

(2) Shown below are the numbers of stories in the 11 tallest buildings in St. Paul, Minnesota (Source: The World Almanac
and Book of Facts 1995, p.689).
32 36 46 20 32 18 16 34 26 27 26

(3) The number of goals scored by college lacrosse team for a given season are 4, 9, 0, 1, 3, 24, 12, 3, 30, 12, 7, 13, 18, 4,
5 and 15.Treating the data as a population, compute the coefficient of variation.

(4) The weights of 10 boxes of a certain brand of cereal have a mean content of 278 grams with a standard deviation of 9.64
grams. If these boxes were purchased at 10 different stores and the average per box is $1.29 with a standard deviation of
$0.09, can you conclude that the weights are relatively more homogeneous than the prices?

Measures of Position (Measures of Non-Central Location)


In addition to measures of central tendency and measures of variation, there are also measures of position whether
it will be at the center or at any points in the distribution of the data. These measures include percentiles, deciles, quartiles
and z-scores.

A. Percentiles
The percentiles are values that divide a set of observations (arranged increasingly) into 100 equal parts. We use P k
(k = 1, 2, …, 99) to denote the kth percentile such that k% the observation falls below it.

 Steps in computing Pk:


(a) Arrange the data in increasing order of magnitude.
𝒏𝒌
(b)Find the location of the kth percentile by computing: L = .
𝟏𝟎𝟎
(c) If L is an integer, then the desired value is the average of L th and (L+1)th observations. If L is not an integer,
round up L to the next integer. The desired value is the observation located to the rounded up value of L.

Example:
The number of movies attended last month by a random sample of 12 students are recorded as follows: 2, 0, 3, 1, 6, 4,
7, 5, 8, 9, 10 and 11. Find the following:

 P48
Arrange the data in increasing order. That is 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11
48
L= x 12 = 5.76 and since this is not whole number we round it up to 6. Then, the P 48= 6th observation = 5.
100
Therefore, 48% of the observation falls below 5.

 P75
75
L= x 12 = 9 and since this is whole number then
100
9thobservation +10th observation 8+9 17
P75 = = = = 8.5 ≈ 8. Therefore, 75% of the observations fall below 8.
2 2 2

B. Deciles
Deciles are values that divide the set of observations into 10 equal parts. It is denoted by D k (k = 1, 2,...,9 ), such
thatDk = the value such that 10 *k % of the observation falls below it.

Example:
(a) From the example above, find D3.
3
L= x 12 = 3.6 and since this is not a whole number we round it up to 4. Then D 3= 4th observation = 3. Therefore,
10
we can say that 30 % of the observations fall below 3.

(b) A teacher gives a 20-point test to 10 students. The scores are 18, 15, 12, 6, 8, 2, 3, 5, 20 and 10. Find D8.
Solution:
STT 041/STT041.1, First Semester, A.Y. 2020 – 2021 Mathematics Dep’t, MSU, Marawi City
37

Arrange the following scores in ascending order. That is 2, 3, 5, 6, 8, 10, 12, 15, 18, 20. Then solve for the D 8.
8
L = x 10 = 8. Since it is whole number,
10
8𝑡ℎ 𝑜𝑏𝑠𝑒𝑟𝑎𝑣𝑡𝑖𝑜𝑛+9𝑡ℎ 𝑜𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛 15+18 33
D8 = = = = 16.5 . Therefore, 80% of the observations fall below 16.5.
2 2 2

C. Quartiles
Quartiles are the values that divide the set of observations into 4 equal parts. It is denoted by Q k (k=1, 2, 3) such
that,Qk= the value such that 25*k% of the observations fall below it.

Example:
a. Refer to the given example (b- in decile) above, find Q1.
1
L= x 10 = 2.5 and since this is not whole number then it is rounded up to nearest whole number which is 3. So,
4
the 3rd observation is 5. Therefore, we can say that 25 % of the observations fall below 5.

b. Find the Q3 for the test scores 5, 12, 15, 16, 20 and 21.
Solution:
3
L= x 6 = 4.5. and since this is not a whole number we round it up to 4. Then Q 3= 4th observation is 20. Therefore,
4
75 % of the observations fall below 20.

There is an old saying that states, “You can’t compare apples and oranges.” But with the use of statistics, it can be
done to some extent. Suppose that a student scored 90 on a music test and 45 on an English exam. Direct comparison of raw
scores is impossible, since the exams are not equivalent in terms of number of questions, value of each question, and so on.
However, a comparison of a relative standard similar to both things can be made. This comparison uses the mean and the
standard deviation and is called a z-score (standard score).
z-score or Standard Score

The z score represents the number of standard deviations a data falls above or below the mean. A standard score or
z score for a value is obtained by subtracting the mean from the value and dividing the result by the standard deviation. The
symbol for the standard score is z. That is,
𝒗𝒂𝒍𝒖𝒆−𝒎𝒆𝒂𝒏 𝒙− 𝝁
z= =
𝒔𝒕𝒂𝒏𝒅𝒂𝒓𝒅 𝒅𝒆𝒗𝒊𝒂𝒕𝒊𝒐𝒏 𝝈
Examples:
(a) A student scored 65 on a calculus test that had a mean of 50 and a standard deviation of 10; she scored 30 on a history
test with a mean of 25 and a standard deviation of 5. Compare her relative positions on the two sets.

Solution:
First, find the z score. For the calculus, the z score is
𝒙− 𝝁 𝟔𝟓−𝟓𝟎
z= = = 1.5.
𝝈 𝟏𝟎
30− 25
For the history, the z score is z = = 1.0. Since the z score for calculus is larger, her relative position in the
10
calculus class is higher than her relative position in the history class.

(b)An aptitude test has a mean of 220 and a standard deviation of 10. Solve for the z-score if the test score is 218.
Solution:
𝒙− 𝝁 𝟐𝟏𝟖−𝟐𝟐𝟎
z= = = - 0.2.
𝝈 𝟏𝟎
Since the z-score for aptitude test is negative when the test score is 218, then we can say that he or she is performing
less than average.

Exercises 9

(1) Given the weights in pounds: 78, 82, 86, 88, 92, 97, 85, 90, 98, 79, 89, 95 and 99, what value corresponds to the 69 th
percentile?

STT 041/STT041.1, First Semester, A.Y. 2020 – 2021 Mathematics Dep’t, MSU, Marawi City
38

(2) For the given test scores 12, 28, 35, 42, 47, 49 and 50, what value corresponds to the 46 th percentile?

(3) In the given data set in number 1, what value corresponds to the 4th quartile?

(4) In the given data set in number 2, what value corresponds to the 8 th decile?

(5) A student scores 60 on a mathematics test that has a mean of 54 and a standard deviation of 3, and she scores 80 on a
history test with a mean of 75 and a standard deviation of 2. On which test did she do better than the rest of the class?

(6) If an IQ test has mean of 100 and a standard deviation of 15, find the corresponding z-score for each IQ.
a. 115 b. 122 c. 93 d. 100 e. 85

(7) Suppose 8 samples of a car racing are given by acceleration in terms of meter per second squared (m/s 2): -50, -34, -78,
20, 67, -40, 48 and 65, compute the following measures of position:
a. P20 b. P58 c. D5 d. D7 e. Q2

(8) Suppose two tests, A and B, of the effectiveness of medicine are given below. Find the z- score for each test and state
which has the higher effect on the patient.

Test A x=38 𝑥̅ =40 s=5


Test B x=94 𝑥̅ =100 s=10

(10) A final examination for a psychology course has a mean of 84 and a standard deviation of 4. Find the corresponding z-
score for each raw score.
a. 87 b. 79 c.93 d.76 e.82

Computing Measures for Grouped Data

For the data summarized in frequency distribution table, the individual observations are unknown; we have a
different way of computing for statistical measures. Each observation in a class is estimated by its classark. We will use this
approach only if the raw data are not available. But if they are available, we compute the statistical measures using the
formula that have been discussed previously.

A.Mean
The procedure for finding the mean for grouped data is similar to that for ungrouped data, except that the midpoints of
the classes are used for the X values. The formula of finding the mean for grouped data is

∑𝒌𝒊=𝟏 𝒙𝒊 𝒇𝒊
𝝁= where 𝑥𝑖 = class mark of the ith class andfi = frequency of the ith class.
∑𝒌𝒊=𝟏 𝒇𝒊

Example:
More and more employers are using psychological testing as an aid in determining whether the applicant is fit for
the work in the company. The following data shows the distribution of the scores of a group of applicants who took the
psychological test administered by a company.
Score Class Frequency
41-50 5
51-60 7
61-70 10
71-80 16
81-90 11
STT 041/STT041.1, First Semester, A.Y. 2020 – 2021 Mathematics Dep’t, MSU, Marawi City
39

91-100 9
Total 58
Estimate the mean score.

Solution:
First, we compute the class mark for each class and then multiply them by the corresponding frequency as shown
below.

Scores Class Frequency(fi) Class Mark(𝑥𝑖 ) 𝑥𝑖 𝑓𝑖


41-50 5 45.5 227.5
51-60 7 55.5 388.5
61-70 10 65.5 655
71-80 16 75.5 1208
81-90 11 85.5 940.5
91-100 9 95.5 859.5
total 58 4279

∑𝟔𝒊=𝟏 𝒙𝒊 𝒇𝒊 𝟒𝟐𝟕𝟗
Then the mean is = = 73.78. Therefore, the mean score of the applicants is 73.78.
∑𝟔𝒊=𝟏 𝒇𝒊 𝟓𝟖

𝑛
−𝐹𝑚
̃ = 𝐿𝐶𝐵𝑀 + 𝑐 [ 2
B. Median: 𝒙 ]
𝑓𝑚
where 𝐿𝐶𝐵𝑚 = lower class boundary of the median class
𝑛 = total number of observations
𝐹𝑚 = total number of observations of the classes before the median class
𝑓𝑚 = frequency of the median class
(median class is the class where the middle observation belongs)

Example: Compute the median for the data of the above example.
58
Solution. = 29. The 29𝑡ℎ observation is located at the 4𝑡ℎ class. Therefore,
2

29−(5+7+10) 7
Median = 70.5 + 10 [ ] = 70.5 + 10 ( ) = 70.5 + 4.375 = 74.875
16 16

C. Variance:
2
k
 k   k 
2
N  ( xi f i )    xi f i 
2 k
n  xi f i    x i f i 
2

Parameter:  
2 i 1  i 1  Statistic: s 2  i 1  i 1 
N 2
n( n  1)

Examples:
(a) Compute the sample variance for the data:

Score Class Frequency


41-50 5
51-60 7
61-70 10
STT 041/STT041.1, First Semester, A.Y. 2020 – 2021 Mathematics Dep’t, MSU, Marawi City
40

71-80 16
81-90 11
91-100 9
Total 58

Solution:
Frequency Class mark 2 2
Scores Class 𝑥𝑖 𝑓𝑖 𝑥𝑖 𝑥𝑖 𝑓𝑖
( fi ) (𝑥𝑖 )
41  50 5 45.5 227.5 2070.25 10351.25
51  60 7 55.5 388.5 3080.25 21561.75
61  70 10 65.5 655 4290.25 42902.5
71  80 16 75.5 1208 5700.25 91204.0
81  90 11 85.5 940.5 7310.25 80412.75
91  100 9 95.5 859.5 9120.25 82082.25
Total 58 4,279 328,514.5
2
k
 k 
n x i f i    x i f i 
2

Then s 2  i 1  i 1 
n(n  1)
58(328,514.5)  (4279)2 19,053,841  18,309,841 744,000
s2     225.05
58(57) 3306 3306

(b) This frequency distribution represents the data obtained from a sample of word-processor repairers. The values are the
days between service calls on 80 machines.
Class boundaries (in days) Frequency
25.5-28.5 5
28.5-31.5 9
31.5-34.5 32
34.4-37.5 20
37.5-40.5 12
40.5-43.5 2

Solution:
Class Frequency Class mark 2 2
𝑥𝑖 𝑓𝑖 𝑥𝑖 𝑥𝑖 𝑓𝑖
boundaries ( fi ) (𝑥𝑖 )
25.5-28.5 5 27 135 729 3645
28.5-31.5 9 30 270 900 8100
31.5-34.5 32 33 1056 1089 34848
34.5-37.5 20 36 720 1296 25920
37.5-40.5 12 39 468 1521 18252
40.5-43.5 2 42 84 1764 3528
Total 80 2,733 94,293
2
k
 k 
n x i f i    x i f i 
2

  80(94,293)  (2,733) 2 7,543,440  7,469,289 74,151


Then, s 
2 i 1 i 1
s 2
    11.73
n(n  1) 80(79) 6320 6320
Therefore the variance is approximately 12 days.

STT 041/STT041.1, First Semester, A.Y. 2020 – 2021 Mathematics Dep’t, MSU, Marawi City
41

D. Standard Deviation:
Parameter: σ  σ 2 Statistic: s  s 2

Examples:
(a) Compute the standard deviation in example (a) regarding psychological testing.

Solution: s  s2 = 225.05 = 15.002. Therefore, the standard deviation of the psychological test of the applicants is
15.002.

(b) Compute the standard deviation of example (b) regarding on the word –processor repairers.

Solution:
s  s 2 = 11.73 = 3.425. Therefore, the standard deviation of the number of days between service calls is 3.425.

E. Range
The range can be measured by getting the difference between the highest class boundary and lowest class boundary,
that is,
range R= highest class boundary - lowest class boundary
Examples:
(a) Compute the range in example (a) regarding psychological testing.
Solution:
The highest class limit is 100. Since we need the class boundary, we just add 0.5 to 100 that is 100.5. For the lowest
class boundary, just subtract 0.5 to 41, which is 40.5. Hence, R= 100.5-40.5=60.0 is the range of the score in the
psychological test.

(b) Compute the range in example (b) regarding the word-processor repairers.
Solution:
The highest class boundary is already given, and that is 43.5. For the lowest class boundary, we have 25.45.
Therefore, R= 43.5 - 25.45=18 is the range of the number of days between service calls.

Measures of Symmetry and Skewness

Frequency histogram helps us evaluate the shape of the distribution of the data. An analysis of the shape is important
because it tells about the behavior of the data (for example, where the data are concentrated). However, looking at the
histogram alone is not enough. In this section, we discuss some measures that describe the shape of the distribution.
The three most popular shapes of the frequency distribution are:

(a) Symmetric Distribution


In symmetric form, the data values are similarly distributed on both sides of the mean. It is usually in bell-shaped
form and could also be rectangular. Symmetrical distribution has its mean equal to median. And if it is unimodal, then the
mean, median and mode are equal, which are located at the center of the distribution.

STT 041/STT041.1, First Semester, A.Y. 2020 – 2021 Mathematics Dep’t, MSU, Marawi City
42

mean= median = mode

mean= median

(b) Skewed to the left (negatively skewed)


It has a longer left tail. The mean is lesser than the median because the value of the mean is pulled down due to the
presence of low values at the left tail. Skewness refers to the tail of the distribution.

mean median
Example:
When majority of the students scored high on an examination while the rest have scores scattered at the left side.
These scores will tend to cluster to the right of the distribution.

(c) Skewed to the right (positively skewed)


It has a long right tail. The mean is greater than the median because the value of the mean is pulled up due to the
presence of high values at the right.

medianmean
Examples:
(a) If an instructor gave an examination and most of the students did poorly, their scores would tend to cluster on the left
side of the distribution. A few high scores would constitute the tail of the distribution, which would be on the right side.

STT 041/STT041.1, First Semester, A.Y. 2020 – 2021 Mathematics Dep’t, MSU, Marawi City
43

(b) The income distribution of Filipino households. Most of the income cluster at left side of the distribution; those with
high income are only a minority and are at the right tail of the distribution.

There is a measure that can help to determine the degree of skewness of a distribution and it is called thePearson
Coefficient of Skewness. The formula is
𝟑(𝒎𝒆𝒂𝒏−𝒎𝒆𝒅𝒊𝒂𝒏)
Pearson Coefficient of Skewness, Sk =
𝒔𝒕𝒅 𝒅𝒆𝒗𝒊𝒂𝒕𝒊𝒐𝒏

In general, the values of Sk is between -3 and 3. If Sk is negative, the distribution is skewed to the left. If positive, it is
skewed to the right. If zero, it is symmetric. If the value of Sk is near zero, we can say it is approximately symmetric.

Example:
The following scores represent the final examination grade for an elementary statistics course:

23 32 52 52 57
60 60 64 70 74
74 79 82 85 86

Describe the shape of the distribution of the final examination grades by computing the coefficient of skewness.

Solution:
Using scientific calculator,

Mean = 63.33

Standard deviation =18.47

15
Median = = 7.5≈ 8th observation = 64
2

Thus the coefficient of skewness,

𝟑(𝒎𝒆𝒂𝒏−𝒎𝒆𝒅𝒊𝒂𝒏) 𝟑(𝟔𝟑.𝟑𝟑−𝟔𝟒)
(Sk) = = = -0.03
𝒔𝒕𝒅 𝒅𝒆𝒗𝒊𝒂𝒕𝒊𝒐𝒏 𝟔𝟒
[slightly skewed to the left or approximately symmetric since Sk is near zero (0)].

Exercise 10
(1) The data below corresponds to the frequency distribution of battery lives:

Class Intervals Frequency


(Battery Life in years)
1.5  1.9 2
2.0  2.4 1
2.5  2.9 4
3.0  3.4 15
3.5  3.9 10
4.0  4.4 5
4.5  4.9 3

Compute the mean, median, sample variance and standard deviation.

STT 041/STT041.1, First Semester, A.Y. 2020 – 2021 Mathematics Dep’t, MSU, Marawi City
44

(2) The fuel capacity in gallons of 50 randomly selected cars is shown below (Source: Consumer Reports, April 1995,
pp.274-279).
Class Frequency
(Fuel capacity in gallon)
10-12 6
13-15 4
16-18 14
19-21 15
22-24 8
25-27 2
28-30 1

Find each of the following.


a. Mean b. Variance c. Standard deviation

(3) Eighty randomly selected light bulbs were tested to determine their lifetimes (in hours). the following frequency
distribution was obtained. Find the variance and standard deviation.

Class boundaries (lifetimes of light bulb Frequency


52.5-63.5 6
63.5-74.5 12
74.5-85.5 25
85.5-96.5 18
96.5-107.5 14
107.5-118.5 5

(4) Given the hypothetical frequency distribution of monthly income of 80 wage earners below:
Monthly income Frequency
Below 5,000 15
5,000 – 9,999 22
10,000 – 14,999 16
15,000 – 19,999 18
20,000 above 9
Compute the following:
a. mean b. median c. standard deviation d. Sk and interpret

(5) Show below is a frequency distribution for the number of inches of rain received in 1 year in 25 selected cities in the
United States.
Number of Inches (class Frequency
boundaries)
5.5 – 20.5 2
20.5 – 35.5 3
35.5 – 50.5 8
50.5 – 65.5 6
65.5 – 80.5 3
80.5 – 95.5 3

Find each of the following:


a. mean b. median class c. variance d. coefficient variation

STT 041/STT041.1, First Semester, A.Y. 2020 – 2021 Mathematics Dep’t, MSU, Marawi City
45

IX. Measures of Correlation

This area involves determining whether a relationship exist between two variables. For example, a business person
may want to know whether the volume of sales in a month is related to the amount of advertising the firm spent that month.
Educators may be interested in determining whether the number of hours a student spent on reviewing her lesson is related
to the student’s grade. These are only a few of the many questions that can be answered by using the techniques of
correlation.

A correlation analysis usually involves:


i.) determining the pattern of the relationship
ii.) measuring the strength of the relationship by computing its correlation coefficient, and
iii.) conducting a test of hypothesis on the relationship of the variables
STT 041/STT041.1, First Semester, A.Y. 2020 – 2021 Mathematics Dep’t, MSU, Marawi City
46

The third part of correlation will be taken in the last part of the semester because it is under statistical inference.

A correlation analysis involves at least two variables. One is called the independent variable and the other is called
the dependent variable.

(1) Independent Variable is the variable that is supposed to have caused the change
(2) Dependent Variable is the outcome variable which is supposed to be affected by the independent variable.

There are also cases in which we can interchange the role of independent and dependent variables. For example, in
a correlation study between openness of communication and the degree of love and respect in a relationship, the relationship
of these two variables can be a two-way process, thus the role of the two variables can be interchanged.

Although we define independent variable and dependent variable this way, a correlation study does not always
imply a cause and effect relationship. Because there can be other factors that interfere with the relationship. We can always
compute the correlation coefficient for any two variables that are seemingly not related at all. For example, we can compute
the correlation coefficient between height and IQ but we know that height has nothing to do with IQ. Thus correlation
analysis should only be done when you have a reason to believe that the two variables have a relationship.

In this chapter we discuss some measures of correlation, also known as correlation coefficients. Correlation
coefficient will measure the extent of association between two variables.

A.Measure of Correlation for Categorical Data

Only one type of correlation coefficient will be discussed in this section, the Cramer’s V coefficient. The value of V is
between 0 and 1. A value of 1 indicates that there is perfect relationship between the variables that is you can predict,
without errors, the value of the dependent variable when you are given value of the independent variable; and 0 means there
is no relationship. If we get a value near 0, then we can say that the relationship of two variables is weak. If we get a value
near 1, then we can say that the relationship of the two variables is strong.

FORMULA:
𝑿𝟐

Cramer’s V = √
𝒏
𝐦𝐢𝐧{(𝒓−𝟏),(𝒄−𝟏)}

(𝒐𝒊𝒋 − 𝒆𝒊𝒋 )𝟐
where X = ∑𝒊 ∑𝒋
2
(the chi-square value)
𝒆𝒊𝒋
oij – observed frequency in the (i,j)th cell
eij– expected frequency in the (i,j)th cell = (𝑟𝑜𝑤 𝑖 𝑡𝑜𝑡𝑎𝑙)𝑥 (𝑐𝑜𝑙𝑢𝑚𝑛 𝑗 𝑡𝑜𝑡𝑎𝑙)
𝑔𝑟𝑎𝑛𝑑 𝑡𝑜𝑡𝑎𝑙
n- number of observations
r- number of rows
c- number of columns
min {(𝒓 − 𝟏), (𝒄 − 𝟏)} means we choose the smaller value between r-1 and c-1.

Sample table:
STT 041/STT041.1, First Semester, A.Y. 2020 – 2021 Mathematics Dep’t, MSU, Marawi City
47

Factors Row i total


Variables Factor 1 Factor 2 … Factor
n
Variable 1 O11 O12 … O1n O11 +
O12+…+O1n
Variable 2 O21 O22 … O2n O21 +
O22+…+O2n
. . . . . .
. . . . . .
. . . . . .
Variable n On1 On2 … Onn On1 +
On2+…+Onn
Column j O11 + O12 + O1n + GRAND
total O21 +…+ On1 O22 +…+ On2 O2n +…+ TOTAL
Onn

 Steps in Computing Cramer’s V

(𝒓𝒐𝒘 𝒊 𝒕𝒐𝒕𝒂𝒍)𝒙 (𝒄𝒐𝒍𝒖𝒎𝒏 𝒋 𝒕𝒐𝒕𝒂𝒍)


(a) Get the expected frequency in the (i,j)th cell. That is eij=
𝒈𝒓𝒂𝒏𝒅 𝒕𝒐𝒕𝒂𝒍

(b) Determine the chi-square (X2) value. The formula is,

(𝑶𝟏𝟏 −𝒆𝟏𝟏 )𝟐 (𝑶𝟏𝟐 −𝒆𝟏𝟐 )𝟐 (𝑶𝒏𝒏 −𝒆𝒏𝒏 )𝟐


X2 = + + ⋯+
𝒆𝟏𝟏 𝒆𝟏𝟐 𝒆𝒏𝒏

(c)Get the minimum value between (number or rows -1) and (number of columns – 1). That is,
min {(𝒓 − 𝟏), (𝒄 − 𝟏)} .

(d) Substitute the values of X2, grand mean and the minimum value between (r-1) and (c-1) to the formula of
Cramer’s V.

Example:
Suppose a study was conducted to determine if there is a correlation between smoking status and the presence or
absence of cervical cancer. A survey was conducted on 656 women, and they were classified as having cancer and without
cancer and whether they are smokers or non-smokers. The results are summarized in the two-way frequency table below:

Smoker Non-smoker Total


With Cancer 108 117 225
No Cancer 163 268 431
Total 271 385 656

a. Determine the possible pattern of correlation.


b. Compute the Cramer’s V.
c. Interpret the results.

Answer on a. To determine the pattern of correlation, we must compare the difference in the row percentage distribution
(or column percentage distribution). If there is a big difference in the percentage distribution, it indicates that there is a
strong correlation. No difference indicates no correlation. Slight differences indicate a weak correlation.
The choice whether to use a row or column usually follows where the independent variable is located. If it is placed
in the right side, we compute the row percentage distribution. If it is placed in the top portion, we compute the column
percentage distribution.
STT 041/STT041.1, First Semester, A.Y. 2020 – 2021 Mathematics Dep’t, MSU, Marawi City
48

𝒇𝒓𝒆𝒒𝒖𝒆𝒏𝒄𝒚 𝒇𝒓𝒆𝒒𝒖𝒆𝒏𝒄𝒚
Row % = ( ) x 100% Column % = ( ) x 100%
𝒓𝒐𝒘 𝒕𝒐𝒕𝒂𝒍 𝒄𝒐𝒍𝒖𝒎𝒏 𝒕𝒐𝒕𝒂𝒍

In our example, since smoking status is the independent variable and the presence of cancer is the dependent
variable, it is appropriate to compute the column percentage distribution.

Smokers Non-smokers
With Cancer (108/271) x 100% = 40% (117/385) x 100% = 30%
No Cancer 60% 70%

We observe that a higher percentage of the smokers have cancer compared to the non-smokers. But we notice that
the difference is quite small so we expect the correlation coefficient to be weak.

Answer on b. We compute the strength of the correlation using the Cramer’s V Correlation Coefficient using the formula:
𝑥2
Cramer’s V = √ 𝑛
min{(𝑟−1),(𝑐−1)}
(𝒐𝒊𝒋− 𝒆𝒊𝒋 )𝟐
where x2 = ∑𝒊 ∑𝒋 (the chi-square value)
𝒆𝒊𝒋
oij – observed frequency in the (i,j)th cell
(𝑟𝑜𝑤 𝑖 𝑡𝑜𝑡𝑎𝑙)𝑥 (𝑐𝑜𝑙𝑢𝑚𝑛 𝑗 𝑡𝑜𝑡𝑎𝑙)
eij– expected frequency in the (i,j)th cell =
𝑔𝑟𝑎𝑛𝑑 𝑡𝑜𝑡𝑎𝑙
n- number of observations
r- number of rows
c- number of columns
min {(𝑟 − 1), (𝑐 − 1)} means we choose the smaller value between r-1 and c-1.
And so,
(225)(271) (225)(385)
e11= = 92.9 e12= = 132.1
656 656
(431)(271) (431)(385)
e21 = = 178.1 e22 = = 252.9
656 656

It will be convenient to write the expected frequency beside the corresponding observed frequency before X2 is
computed.
108 (92.9) 117(132.1)
163(178.1) 268(252.9)

(𝟏𝟎𝟖−𝟗𝟐.𝟗)𝟐 (𝟏𝟏𝟕−𝟏𝟑𝟐.𝟏)𝟐 (𝟏𝟔𝟑−𝟏𝟕𝟖.𝟏)𝟐 (𝟐𝟔𝟖−𝟐𝟓𝟐.𝟗)𝟐


X2 = + + =
𝟗𝟐.𝟗 𝟏𝟑𝟐.𝟏 𝟏𝟕𝟖.𝟏 𝟐𝟓𝟐.𝟗
2
X = 2.45 + 1.73 + 0.90 = 6.36

6.36

Cramer’s V = √ 656 = 0.098


1

Answer on c. Interpretation of the results. Since the value is near 0, we say that there is weak correlation between smoking
status and the presence or absence of cervical cancer. A weak correlation means that there are other important factors that
determine the presence of cancer. Knowing their smoking habit is clearly not enough information to predict whether a
person will likely to have a cervical cancer or not.

NOTE:
Cramer’s V is also applicable for numerical data by grouping the values into categories or classes. For example,
data on weights can be grouped into 40-44, 45-49, and so on. Then a two-way frequency table is constructed.

Example:
STT 041/STT041.1, First Semester, A.Y. 2020 – 2021 Mathematics Dep’t, MSU, Marawi City
49

Suppose a study was conducted to determine if there is a correlation between gender (male and female) and the
amount spent for internet use among freshmen students who have no free access to the internet use.

Monthly Expenses (in Php) Male Female Total


Less than Php 300 500 450 950
Php 300 or more 300 379 679
Total 800 829 1629

a. Determine the possible pattern of correlation.


b. Compute the Cramer’s V.
c. Interpret the results.

Answer on a. To determine the pattern of correlation, we must see if there is a difference in the row percentage distribution
(or column percentage distribution). The choice whether to use a row or column usually follows where the independent
variable is located. If it is placed in the right side, we compute the row percentage distribution. If it is placed in the top
portion, we compute the column percentage distribution.

𝒇𝒓𝒆𝒒𝒖𝒆𝒏𝒄𝒚 𝒇𝒓𝒆𝒒𝒖𝒆𝒏𝒄𝒚
Row % = ( ) x 100% Column % = ( ) x 100%
𝒓𝒐𝒘 𝒕𝒐𝒕𝒂𝒍 𝒄𝒐𝒍𝒖𝒎𝒏 𝒕𝒐𝒕𝒂𝒍

In our example, since gender is the independent variable and the amount spent is the dependent variable, it is
appropriate to compute the column percentage distribution.

Male Female
Less than Php 300 (500/800) x 100% = 62.5 % (450/829) x 100% = 54.28%
Php 300 or more 37.5 % 45.72%

We observe that a higher percentage of the male spent less than Php 300 compared to female. But we notice that
the difference is quite small so we expect the correlation coefficient to be weak.
Answer on b. We compute the strength of the correlation using the Cramer’s V Correlation Coefficient using the formula:
𝑥2
Cramer’s V = √ 𝑛
min{(𝑟−1),(𝑐−1)}
(𝒐𝒊𝒋− 𝒆𝒊𝒋 )𝟐
where x2 = ∑𝒊 ∑𝒋 (the chi-square value)
𝒆𝒊𝒋
oij – observed frequency in the (i,j)th cell
(𝑟𝑜𝑤 𝑖 𝑡𝑜𝑡𝑎𝑙)𝑥 (𝑐𝑜𝑙𝑢𝑚𝑛 𝑗 𝑡𝑜𝑡𝑎𝑙)
eij– expected frequency in the (i,j)th cell =
𝑔𝑟𝑎𝑛𝑑 𝑡𝑜𝑡𝑎𝑙
n- number of observations
r- number of rows
c- number of columns
min {(𝑟 − 1), (𝑐 − 1)} means we choose the smaller value between r-1 and c-1.
And so,
(950)(800) (950)(829)
e11= = 466.54 e12= = 483.46
1629 1629
(679)(800) (679)(829)
e21 = = 333.46 e22 = = 345.54
1629 1629
So,
(500−466.54)2 (450−483.46)2 (300−333.46)2 (379−345.54)2
X2 = + + =
466.54 483.46 333.46 345.54
X2 = 2.40 + 2.32 + 3.36 + 3.24 = 11.32
11.32

Cramer’s V = √
1629
= 0.08
1

STT 041/STT041.1, First Semester, A.Y. 2020 – 2021 Mathematics Dep’t, MSU, Marawi City
50

Answer on c. Interpretation of the results. Since the value is near 0, we say that there is weak correlation between gender
and amount spent on internet use. A weak correlation means that there are other important factors that determine their
internet use. Knowing their gender is clearly not enough information to predict a person’s expenditure on internet use.

Exercises 11
(1) A random sample of 800 Marawi City residents are classified according to their economic status (such as low, middle,
and high) and their opinions whether or not they are favor to the new re-routing scheme in Quezon Avenue. The observed
frequencies are presented below:
Decision Low Middle High
For 152 183 143
Against 104 128 90

(a) Compute the percentage distribution of each income category. Which income category has the highest percentage against
the re-routing scheme?
(b) Compute the Cramer’s V and interpret the result.

(2) A researcher conducted a survey to determine whether working mother affect the performance of their 8 year old children
in school. The following data were taken on a random sample of 200 8-year old students.

Performance of 8 year old students in school


Low Average High
Working Mother 42 44 34
Non-working Mother 46 22 12

(a) Which do you think is the independent variable? Compute the percentage distribution and compare them. What do you
observe?
(b) Compute the Cramer’s V and interpret the result.

(3) A random sample of 200 married men, all retired, was classified according to education and number of children.

Number of Children
Education
0-1 2-3 Over 3
Elementary 14 37 32
Secondary 19 42 17
Tertiary 12 17 10

(a) Compute the percentage distribution of each level of educational attainment. Compare the results.
(b) Compute the Cramer’s V coefficient. Interpret the result.

B. Correlation Between Numerical (Interval or Ratio) Variables

Definition: The Pearson CorrelationCoefficient,  , measures the strength of the linear relationship between two
numerical random variables X and Y. The estimated sample correlation coefficient, denoted by r, is
n n n
n xi yi   xi  yi
i 1 i 1 i 1
r=
 n 2  n  2
  n 2  n 2 
n xi    xi   n yi    yi  
 i 1  i 1    i 1  i 1  
where n is the sample size.

STT 041/STT041.1, First Semester, A.Y. 2020 – 2021 Mathematics Dep’t, MSU, Marawi City
51

The size of r expresses the degree of a relationship and may range from -1 to +1. Zero indicates no linear relationship.
Values that are near +1 or -1 indicates strong relationship. The strength of the linear relationship is summarized below.

Since the variables are interval or ratio scaled, the pair of observations will not be put in a two-way frequency table but
rather we plot each pair of observations in a Cartesian plane to check if the pattern of the relationship is linear or not.
Remember that the Pearson coefficient is a measure of the linear relationship. The plot is also called as a scatter diagram.
Here are some possible patterns that can be observed in a scatter diagram:

STT 041/STT041.1, First Semester, A.Y. 2020 – 2021 Mathematics Dep’t, MSU, Marawi City
52

It can be seen above that if r is positive, then as x value increases, y value also increases. Or if the value of x is low,
the value of y is also low. If the value of r is negative, then as the value of x increases, the value of y decreases. Or if the
value of X is decreased, the value of Y increases.

Note that the last figure r = -0.15 despite the fact that we observed a nonlinear pattern on the scatter diagram. But
since it is not linear, Pearson r is near 0 indicating a very weak linear relationship. In this case, alternative methods must be
sought.

Example:
(1) A study wishes to investigate the relationship between grade in English and grade inMath of high school students. A
random sample of 5 students was selected and their grades are as follows:

Student English Grade (X) Math Grade (Y)


1 85 86
2 88 85
3 87 88
4 88 86
5 90 88
Compute r and interpret the result.
Solution:
∑ 𝒙𝒊 𝒚𝒊 =(85)(86)+(88)(85)+(87)(88)+(88)(86) +(90)(88) = 37,934
∑ 𝑥𝑖 = 85+88+87+88+90 = 438
∑ 𝑦𝑖 = 86+85+88+86+88 = 433
∑ 𝑥𝑖 2 = 852+...+902 = 38,382
∑ 𝑦𝑖 2 = 862+...+882 = 37,505
n n n
n xi yi   xi  yi
i 1 i 1 i 1
Therefore, r =
 n 2  n  2
  n 2  n 2 
n xi    xi   n yi    yi  
 i 1  i 1    i 1  i 1  
5 (37,934) – (438)(433)
r=
[5(38,382) – (438)2 ][5(37,505) – (433)2 ]
189,670 – 189,654 16 16
r= = = = 0.328.
[191,910 – 191,844][187,525 – 187,459] √(66)(36) √2,376

We can see from the result that the correlation is positive but weak. Positive means that, those with high grades in
English also tend to obtain high grades in Math while those with low grades in English tend to obtain low grades in Math.
Since the correlation is weak, there are other important factors that influence performance in English or Math.

(2) Suppose the study on the number of absences and the final grades of seven randomly selected students from a statistics
class is obtained. The data are shown here.

STT 041/STT041.1, First Semester, A.Y. 2020 – 2021 Mathematics Dep’t, MSU, Marawi City
53

Student Number of absences, x Final grade, y (%)


1 6 82
2 2 92
3 15 72
4 9 78
5 12 74
6 5 86
7 8 88

Compute r and interpret the result.

Solution:
∑ 𝒙𝒊 𝒚𝒊 = (6)(82)+(2)(92)+(15)(72)+(9)(78) +(12)(74) + (5)(86) + (8)(88) = 4480
∑ 𝑥𝑖 = 6 + 2 + 15 + 9 +12 + 5 + 8= 57
∑ 𝑦𝑖 = 82 + 92 + 72 + 78 + 74 + 86 + 88= 572
∑ 𝑥𝑖 2 = 62+...+82 = 579
∑ 𝑦𝑖 2 = 822+...+882 = 47,072
n n n
n xi yi   xi  yi
i 1 i 1 i 1
Therefore, r =
 n
 n
 
2 n
 n  
2

  i   i    i   i  
 
2 2
n x x n y y
 i 1  i 1    i 1  i 1  
7 (4480) – (57)(572)
r=
[7(579) – (57)2 ][7(47,072) – (572)2]
31,360 – 32,604 −1244 −1244
r= = = 1,365.75 = -0.91
[4053 – 3249][329,504 – 327,184] √(804)(2320)

Since the value of r is negative, it tends to have a downward linear pattern and have a strong linear relationship
between the number of absences and the final grades.
This means that those who are frequently absent in class will likely have a low grade, while those who attend class regularly
will likely have a higher grade.

The computations seem to be a tedious process especially when there are several observations. The calculator can help
ease this problem using its regression mode.

Computing Pearson r Using the Regression mode of the calculator:

MODE

STT 041/STT041.1, First Semester, A.Y. 2020 – 2021 Mathematics Dep’t, MSU, Marawi City
54

Step 1. Activate the REG mode of your calculator by pressing button twice. In your calculator, it appears
𝑺𝑫 𝑹𝑬𝑮 𝑩𝑨𝑺𝑬
, then press 2. After pressing 2, it appears on the screen of your calculator
𝟏 𝟐 𝟑
𝑳𝒊𝒏 𝑳𝒐𝒈 𝑬𝒙𝒑
. Select 1, that is Lin.
𝟏 𝟐 𝟑
∑ 𝒙𝟐 ∑𝒙 𝒏
 S-SUM composed of functions .Press right key for the other functions of S-SUM. Then in
𝟏 𝟐 𝟑
∑ 𝒚𝟐 ∑𝒚 ∑ 𝒙𝒚
your calculator, it will appear
𝟏 𝟐 𝟑

̅ 𝒙𝝈𝒏 𝒙𝝈𝒏 − 𝟏
𝒙
 S-VAR composed of functions . Press right key for the other functions of S-SUM.
𝟏 𝟐 𝟑
̅ 𝒚𝝈𝒏 𝒚𝝈𝒏 − 𝟏
𝒚
Then in your, calculator it will appear . Press again the right key for the other
𝟏 𝟐 𝟑
𝑨 𝑩 𝒓
 functions. There you can see . Press again the right key for the other last functions. They are
𝟏 𝟐 𝟑
̂ 𝒚
𝒙 ̂
.
𝟏 𝟐

Step 2. Clear the memory by pressing Shift


then .

Step 3. Input the data. Enter the value of x then press the comma “,” button; enter the value of y then press the DATA
.

Example:
(a) For the data on example (1) above, we do the following:

85 86 DATA
, and then press
88 , 85 and then press DATA

87 , 88 and then press DATA

88 , 88 and then press DATA


, and then press DATA
90 88

Step 4. Inquire of the results. Press INV or Shift thenS-SUM or INV S-VAR.

(b) Verify the results of example (2) above using the REG mode of your scientific calculator.

Exercises 12
1. A researcher wishes to see if there is a relationship between the number of calories a sandwich has and the sodium content
of the sandwich. Several different types of sandwiches are used. The data follow (Source: Consumer Reports 60, no. 9,
September 1995).

Calories, x 419 419 386 240 354 231 174


Sodium (mg), y 495 645 1202 581 990 1159 787
STT 041/STT041.1, First Semester, A.Y. 2020 – 2021 Mathematics Dep’t, MSU, Marawi City
55

2. A researcher wishes to determine if there is a relationship between the energy efficiency rating (EER) of an air conditioner
and its cost. The EER is determined by the Appliance Manufacturers’ Association (Source: Consumer Reports 60, no.6,
June 1995, p.406). The data follow.

EER, x 9.5 10.0 10.0 10.0 10.3 9.1 8.6


Cost ($), y 370 360 400 400 420 350 310

3. An educator wants to see the relationship between a student’s score on a test and his or her grade point average. The data
obtained from the sample follow.

Test score, x 98 105 100 106 95 116 112


GPA, y 2.6 2.1 2.3 2.0 2.4 1.7 1.9

4. An English instructor is interested in finding the strength of a relationship between the final exam grades of students
enrolled in English 1 and English 2 classes. The data are given below in percentages. Check your answer in calculator.

English 1, x 83 97 80 95 73 78 91 86
English 2, y 78 95 83 97 78 72 90 80

Regression Analysis

If you want to come up with an equation relating the variables X and Y, we call this as regression analysis.
Regression is a statistical method used to describe the nature of the relationship between variables, that is, positive or
negative, linear or nonlinear.

If the relationship between X and Y can be described by a straight line equation, it is called simple regression
analysis. The equation can be estimated by the model
̂ = a + bx
𝒚

𝑛 ∑𝑛 𝑛 𝑛
𝑖=1 𝑥𝑖 𝑦𝑖 – (∑𝑖=1 𝑥𝑖 )(∑𝑖=1 𝑦𝑖 )
where b = 2 , also known as the slope of the line; and
𝑛 ∑𝑛 2 𝑛
𝑖=1 𝑥𝑖 − (∑𝑖=1 𝑥𝑖 )

a  y - bx , also known as the y-intercept (𝑦̅ is the mean of y, 𝑥̅ is the mean of x)

0
39.5 46.5 53.5 60.5 67.5 74.5 81.5

The symbol ŷ is used here to distinguish between the predicted value given by the regression line and an actual
value of y for some value of x.

STT 041/STT041.1, First Semester, A.Y. 2020 – 2021 Mathematics Dep’t, MSU, Marawi City
56

There are many possible lines that can be used as the model, but the one described above is called the least squares
model. The name least squares is due to the fact the line above was chosen because it is the line that has the least sum of
 y  yˆi  = minimum  error
2 2
squares of the errors in estimating y, that is, the minimum i .

Example:
In a study between math and physics grade, the following grades of 10 students selected at random were obtained.

Mathematics grade, x 75 86 87 90 79 93 79 78 79 81
Physics grade, y 73 92 74 83 84 86 80 77 83 88

a. Plot the given points.

b. Find the linear equation relating X and Y.

n∑ 𝒙𝒊 𝒚𝒊 = 10 (67940) = 679,400
∑ 𝑥𝑖 =827∑ 𝑦𝑖 =820 𝑛 ∑ 𝑥𝑖 2 =10 (68,707) = 687,070

n  x i yi   xi  yi 
b
n xi2   xi 
2

b=( 679,400 – (827)(820)) / (687,070- (827)2) = 1260 / 3141 = 0.401

a  y - bx
a= 82 – (0.401)(82.7) = 48.83

Using the Reg mode of the calculator, we find a=48.83 and b=0.401. Hence, our linear model is given by

yˆ  a  bx = 48.83 + 0.401x.

c. Estimate the physics grade of a student who obtained a math grade of 80. Compute the amount of error between
the actual observed value and the predicted value.

To estimate the physics grade, we substitute x=80 to our linear model:

STT 041/STT041.1, First Semester, A.Y. 2020 – 2021 Mathematics Dep’t, MSU, Marawi City
57

yˆ  48.83 + 0.401x = 48.83 + 0.401 (86) = 48.83 + 32.08 =80.91

d. Estimate the physics grade when math grade is 86. Compute the amount of error between the actual observed
value and the predicted value.
We substitute x=86 to the linear model:
yˆ  48.83 + 0.401x
= 48.83 + 0.401 (86)
= 48.83 + 34.49=83.32.
Since the actual value is 92 when math grade is 86 as can be seen from the table above, then the error between the
actual observed value and the predicted value is
92-83.32 = 8.68.

Note:
The equation above can be used for prediction only within interval of the observed value of X because we are not
certain if the behavior of the function still the same outside the interval.

 How do we know if the model is good?


This can be answered partially by the coefficient of determination which is equal to r2 x 100%. It is a measure of
the total variation in Y which can be explained by the linear relationship existing between X and Y. We desire a high value
of r2. The value of r2 ranges from 0 to 1. If r2=0 this means that none of the variation in Y is explained by its linear relationship
with X. If this low, this indicates that the model should be improved by adding more independent variables or by adding
non-linear terms to the model.

For example, if r=0.98, then r2 x 100% = 0.982 x 100% = 96.04 %. Therefore, we can say that the model is good.
If we get the value of r=0.14, then r2 x 100%= 0.142 x 100 = 1.96%. Then, we can say that the model should need to be
improved by adding more independent variables to the model.

Exercises 13
(1) In a study between the amount of rainfall (X, in .01 cm) and quantity of air pollution removed (Y, micrograms per cubic
meter), the following data were collected:
X 4.3 4.5 5.9 5.6 6.1 5.2 3.8 2.1 7.5
Y 126 121 116 118 114 118 132 141 108

a.) Find a linear equation relating X and Y.


b.) Estimate the amount of air pollution removed when amount of rainfall is 4.7.
c.) Estimate the amount of air pollution removed when amount of rainfall is 4.5. Compute the amount of error between
the actual observed value and the predicted value.

(2) A physician wishes to know whether there is a relationship between father’s weight (in pounds) and his newborn son’s
weight (in pounds). The data are given here.

Father’s weight, x 176 160 187 210 196 142 205 215
Son’s weight, y 6.6 8.2 9.2 7.1 8.8 9.3 7.4 8.6

a.) Find a linear equation relating X and Y.


b.) Estimate the amount of son’s weight when the amount of father’s weight is 7.9.
c.) Estimate the amount of son’s weight when the amount of father’s weight is 8.2. Compute the amount of error
between the actual observed value and the predicted value.

(3) Suppose a study is to see whether there is a relationship between a student’s grade point average and the number of
hours the student studies per week. Predict the GPA of a student who studies 10 hours a week. The data are shown here.
Check the answer in your calculator.

STT 041/STT041.1, First Semester, A.Y. 2020 – 2021 Mathematics Dep’t, MSU, Marawi City
58

Hours, x 3 12 9 15 5 7 16
GPA, y 2.5 3.0 2.75 1.5 1.75 2.25 1.0

4. A study is conducted to determine the relationship between a driver’s age and the number of accidents he or she has over
a year period. Predict the number of accidents of a driver who is 28. Check the answer in your calculator.
The data are shown here.

Driver’s age, x 16 24 18 17 23 27 32
No. of accidents, y 3 2 5

STT 041/STT041.1, First Semester, A.Y. 2020 – 2021 Mathematics Dep’t, MSU, Marawi City
59

STT 041/STT041.1, First Semester, A.Y. 2020 – 2021 Mathematics Dep’t, MSU, Marawi City

You might also like