Download as pdf or txt
Download as pdf or txt
You are on page 1of 17

35

UNIT IV. DATA MANAGEMENT


by Jaynelle G. Domingo

Overview

This unit provides an overview of the application of statistics in dealing with managing of gathered information or
data. As the term data management broadly defined as the development, execution and supervision of plans, policies,
programs and practices that control, deliver and enhance the value of data and information assets, the process itself is
actually the definition of statistics, both descriptive and inferential, which refers to collection, organizing, processing,
analyzing and interpreting of data. Thus, data gathering procedure some descriptive statistics and correlation and
regression analysis will be reviewed in this unit since most of the topics were already discussed in basic education (high
school). The discussion for this unit will be limited to ungrouped data as these data are more practical to use in dealing
with statistical analyses. Different statistical treatment and quantities will be highlighted to further establish that statistical
tools derived from mathematics are useful in processing and managing numerical data in order to describe a phenomenon
and predict values.

Learning Objectives:

At the end of the unit, I am able to:


1. use variety of statistical tools to process and manage numerical data;
2. use the methods of linear regression and correlations to predict the value of a variable given certain
conditions;
3. advocate the use of statistical data in making important decisions.

Setting Up (Unit 4)

Name: _______________________________________________________Score: ______________________________

Course/Year/Section: ___________________________________________ Date: _______________________________

Directions: Answer the following questions.

1. From a scale of 1-10, how will you rate your learning in statistics during your Junior-Senior High School?
Answer: _________
2. From a scale of 1-10, how comfortable you are with statistics?
Answer: _________
3. What are the things that come in your mind when you hear the word statistics?
Answer:
__________________________________________________________________________________________
__________________________________________________________________________________________

4. Name the different statistical symbols or acronyms that you can remember below.

__________________________a. FDT __________________________i. 𝑁


__________________________b. 𝑟 __________________________j. 𝑛
__________________________c. 𝑥̅ __________________________k. 𝑥* or Md
__________________________d. 𝑥$ or Mo __________________________l. 𝑅
__________________________e. 𝜎 __________________________m. MAD
__________________________f. 𝑠 ! __________________________n. 𝜇
__________________________g. MCT __________________________o. 𝑄"
__________________________h. Σ

Lesson Proper

Gathering, Organizing, Representing and Interpreting Data

Data Gathering
Data are needed whenever we undertake studies or researches. They are used to undertake particular problems
or to provide a basis which certain decisions are generated. There are different methods of gathering the data.

1. Direct or Interview Method – data are gathered through person-to-person encounter between the source of
information and the one who gathers the information
2. Indirect or Questionnaire Method – data are gathered through a questionnaire which consists of sets of questions
regarding the needed information
3. Registration Method – data are gathered from the records to which the sources of information have registered
4. Observation Method – data such as behaviors of individuals or groups are obtained through observation given a
particular period of time or in a given situation
36
5. Experimental Method – data (laboratory data) from scientific inquiries are obtained through the results of series of
experiments

In this module, numerical data will be emphasized. This data are the set of values as measure of specific
quantities about an individual or a group. In practical purposes, numerical data are used for forecasting and as basis for
decision making.

Data Organization and Presentation

The mere gathering of the information or data is not a small task. A greater task is to make the data
comprehensible and meaningful.

Common Ways to Present Data


a. Textual Form
The data are incorporated in the text of the report. It is commonly used when there are only few numerical
data to be enumerated or to be compared.
b. Tabular Form
The data are presented in rows and columns. In making a tabular presentation, it should be simple and
make the meanings and significance of information being presented clear. Statistical tables should have heading,
box (column) head, stud, footnote (usually legend) and source note to where the data came from.
c. Graphical Form
The data are presented in graphics form for an “easy to digest” information. It shows numerical values or
relationships in a pictoral form. There are different types of graphs that are used depending on the data that are
being presented. These are as follows:
§ Line Graph – It is used when data cover a long period of time, several series are compared, movements
are to be emphasized, trends are to be established, and estimates are to be forecasted.
§ Bar Graph – It is used when numerical bars of an item over a period of time are compared.
§ Pie Graph – It is used to show comparison from the whole.
§ Pictograh – It is used to immediately suggest the nature of data

Methods of Organizing Data


a. Array – ordering of the observations from smallest to the largest or vice versa. It has advantages because the low
and high values can be readily perceived. The process is tedious especially if the raw data are numerous.
b. Stem-and-Leaf Diagram - the number (raw data) is broken into tens and units digits are tallied together whose
values share with the tens digits.
c. Frequency Distribution Table (FDT) – condensed version of an array in which data are arranged in tabular form by
the frequencies (or the times an observation occurs). Below are the parts of FDT.
• Classes – mutually exclusive categories defining the lower limit and the upper limit with equal intervals. In
a class (e.i. 27 − 34), 27 is called lower limit while 34 is called upper limit of the given class
• Class Frequency – the number of observation in each class
• Class Mark or Class Midpoint – the middle value or the average of the lower limit and the upper limit of a
given class.
• Cumulative Frequency (CF) – tells the sum of frequency in a particular class of interest
• Relative Frequency – tells the percentage of the observations in a particular class of interest

From a raw data, the steps in constructing frequency distribution table are as follows:
1. Determine the range 𝑅 of the numerical (raw) data.
𝑅𝑎𝑛𝑔𝑒 (𝑅) = 𝐻𝑖𝑔ℎ𝑒𝑠𝑡 𝑉𝑎𝑙𝑢𝑒 − 𝐿𝑜𝑤𝑒𝑠𝑡 𝑉𝑎𝑙𝑢𝑒

2. Determine the number of classes 𝐾 to which data are to be grouped using the Sturge’s Approximization
Formula:
𝐾 = 1 + 3.322 log 𝑁
where 𝑁=total number of observation.
Note: The computed value of 𝐾 is rounded up.

3. Determine the class size C. To determine 𝐶, divide the range (𝑅) by number of classes (𝐾).
𝐶 = 𝑅/𝐾
4. Determine the lower limit of the lowest class. Set the lowest value in the data as the lower limit of the lowest
class.
5. Construct the class intervals and determine the class frequencies.

Sample Frequency Distribution Table and its Parts


Class Class Relative
Class Frequency <CF >CF
Mark Boundary Frequency
54-60 5 57 53.5-60.5 12.5 5 40
61-67 6 64 60.5-67.5 15 11 35
68-75 11 71 67.5-75.5 27.5 22 29
76-82 15 78 75.5-82.5 37.5 37 18
83-89 3 85 82.5-89.5 7.5 40 3
𝑁 = 40 100
37

Data Analysis and Interpretation

Data analysis and interpretation is the process of making sense of numerical data that has been collected,
analyzed and presented. Descriptive statistics are used to describe and make an overview of the data collected. It gives a
single value which represents the set of value. While inferential statistics are techniques wherein samples can be used to
make generalizations about populations from which samples were drawn.

Measures of Central Tendency

Statistician often collects data from small portions (sample) of a large group (population) in order to determine
information about the group. These data usually represents by a single value referred to as measures of central tendency
or central location. Measure of Central Tendency (MCT) is measure indicating the center of a set of data which are
arranged in order of magnitude. There are three (3) MCT that are commonly used namely, the mean, the median, and the
mode.

1. Mean – simply the average. It is the most commonly used MCT. The mean is denoted by 𝜇 for population mean
and 𝑥̅ for sample mean. There are two type of mean: the arithmetic mean and; weighted mean.

Properties of Mean
§ The mean reflects the magnitude of every observation, since every observation contributes to the value of
the mean.
§ The mean can be easily affected by the presence of an extreme value, hence not a good measure of
MCT when extreme value do occur.

Arithmetic Mean is computed by adding all the values divided by the number of observations.
Population Mean: Sample Mean:
∑$ ∑$
𝜇 = 𝑥̅ =
% &
where 𝑥 = 𝑑𝑎𝑡𝑎 𝑣𝑎𝑙𝑢𝑒𝑠 (𝑜𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛𝑠)
𝑁 = 𝑛 = 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑜𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛

Weighted Mean is often used when some data values are more important than others.
Population Mean: Sample Mean:
∑ '$ ∑ '$
𝜇 ' = ∑$ 𝑥̅' = ∑ $
where 𝑥 = 𝑑𝑎𝑡𝑎 𝑣𝑎𝑙𝑢𝑒𝑠 (𝑜𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛𝑠)
𝑤 = 𝑤𝑒𝑖𝑔ℎ𝑡
𝑁 = 𝑛 = 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑜𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛

2. Median – the middle score for a set of data arranged in order (array data). It is denoted by 𝑀𝑑 or 𝑥*.

Properties of Median
§ Median is a positional value and hence is not affected by the presence of an extreme value unlike the
mean.
§ The median is not amenable for further computation and hence medians of subgroups cannot be
combined in the same manner as the mean.

3. Mode – the most frequent score or value in the data set. It is sometimes considered as the most popular option
and is denoted by 𝑀𝑜 or 𝑥$. A particular data set can have no mode, one mode (unimodal) or two modes (bimodal)
and so on.

Properties of Mode
§ Since mode is the most frequently occurring value, it may not be the center of the data.
§ Mode does not make use of all observations.
§ Mode is difficult to manipulate algebraically.
§ Mode is ideal for qualitative type of data.

Illustration of MCT:

Koko recorded his duration of stay in library for 10 school days. His data are as follows:
Duration
Day
(in minutes)
1 44
2 20
3 35
4 33
5 40
6 33
7 33
8 15
9 42
10 34
38
Mean:
∑ 𝑥 44 + 20 + 35 + 33 + 40 + 33 + 33 + 15 + 42 + 34 329
𝑥̅ = = =
𝑛 10 10

𝒙
X = 𝟑𝟐. 𝟗 𝒎𝒊𝒏𝒔

Median:
Arrange first the data from lowest to highest.

15 20 33 33 33 34 35 40 42 44

Since we have even number of data, two middle scores occur. Add the two middle score and divide the sum by 2.
33 + 34 67
𝑥* = =
2 2

a
𝒙 = 𝟑𝟑. 𝟓 𝒎𝒊𝒏𝒔

Mode:
In the data set, 33 appear thrice. Thus, 33 is the mode and the data is unimodal.

𝒙
c = 𝟑𝟑 𝒎𝒊𝒏𝒔

Measures of Dispersion

Measures of dispersion (variation) identify how a set of values spreads or fluctuates. The measures of dispersion
will enable you to know how varies the observation are, whether there are extreme values in the distribution, or whether
the values are very close to each other. If the measure of variation is zero, it means that there is no variation at all and
that the observations are all alike, or homogeneous. Otherwise, they are heterogeneous. The common measures of
variation are range, mean absolute deviation, variance, standard deviation, coefficient of variation, quartile deviation, and
the percentile range. However, only the range, variance and standard deviation of ungrouped data will be discussed
in this section as these three are the most commonly used and more practical when it comes to inferential statistics.

1. Range – the difference between the greatest data value and the lowest data value. It is the simplest measure of
dispersion but the least reliable. It does not reflect variations in the data set that lie in between the highest and
lowest data value.

Example:

In Koko’s data in his duration of stay in the library, the highest data value is 44 and the lowest data value is 15.
Thus,
𝑅 = 44 − 15
𝑹 = 𝟐𝟗

2. Variance – considers the deviation of every single data value in the data set unlike range. It is simply referred to
as the average of the squared deviation of each data value form the mean of the data set. It is denoted by 𝜎 ! for
population variance and 𝑠 ! for sample variance. Like the range, the higher the computed value of variance the
more dispersed are the data set. It is always nonnegative.

Formula:
Population Variance Sample Variance
! ∑($)*)! ∑($)$̅ )! &.∑ $ !/)(∑ $)!
𝜎 = 𝑠! = or 𝑠 ! =
% &)- &(&)-)

where: 𝑁 = 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑑𝑎𝑡𝑎 𝑣𝑎𝑙𝑢𝑒 𝑖𝑛 𝑝𝑜𝑝𝑢𝑙𝑎𝑡𝑖𝑜𝑛


𝑛 = 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑑𝑎𝑡𝑎 𝑣𝑎𝑙𝑢𝑒 𝑖𝑛 𝑠𝑎𝑚𝑝𝑙𝑒
𝑥 = 𝑑𝑎𝑡𝑎 𝑣𝑎𝑙𝑢𝑒𝑠 (𝑜𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛𝑠)
𝜇 = 𝑝𝑜𝑝𝑢𝑙𝑎𝑡𝑖𝑜𝑛 𝑚𝑒𝑎𝑛
𝑥̅ = 𝑠𝑎𝑚𝑝𝑙𝑒 𝑚𝑒𝑎𝑛

Example:
Let us consider Koko’s data in his duration of stay in the library (treated as sample). The computed mean
𝑥̅ = 32.9 and 𝑛 = 10. For convenience, make a table as guide for computation.

𝑥 𝑥 − 𝑥̅ (𝑥 − 𝑥̅ )!
44 -11.1 123.21
20 12.9 166.41
35 -2.1 4.41
33 -0.1 0.01
40 -7.1 50.41
33 -0.1 0.01
33 -0.1 0.01
15 17.9 320.41
42 -9.1 82.81
39
34 -1.1 1.21
f(𝑥 − 𝑥̅ )! = 748.9

∑(𝑥 − 𝑥̅ )! 748.9 748.9


𝑠! = = =
𝑛−1 10 − 1 9

𝒗𝒂𝒓𝒊𝒂𝒏𝒄𝒆 (𝒔𝟐 ) = 𝟖𝟑. 𝟐𝟏

3. Standard Deviation – computed as the positive square root of variance. Similar to variance, it is based on the
deviations of all data value in data set. Standard deviation is considered as the most reliable measure of
dispersion as this value is associated with the characteristics of common data sets which are normally distributed.
In statistics, it is denoted by 𝜎 for population standard deviation and 𝑠 for sample standard deviation but in
research, it is denoted by 𝑆𝐷.

Formula:

Population Variance Sample Variance


∑($)*)! ∑($)$̅ )!
𝜎=q %
𝑠=q &)-

Example:
Let us consider Koko’s data in his duration of stay in the library (treated as sample). The computed
sample variance 𝑠 ! = 83.21. Thus,

𝑠 = √83.21

𝒔𝒕𝒂𝒏𝒅𝒂𝒓𝒅 𝒅𝒆𝒗𝒊𝒂𝒕𝒊𝒐𝒏 (𝒔) = 𝟗. 𝟏𝟐

Note: If you have big data set, it is suggested that you use applications such as MS Office Excel for convenience.
However, it is mandatory for you to understand and apply what measures of dispersion is all about.

Measures of Relative Position

Fractiles or Quantiles are measures of location or position which include not only central location but also any
position based on the number of equal divisions in a given distribution. If we divide the distribution into four equal
divisions, then we have quartiles denoted by 𝑄- , 𝑄! , 𝑄" and 𝑄1 . The most commonly used fractiles are the quartiles,
deciles (division is by 10), percentiles (division is by 100) and 𝑧-score (relative position associated with the number of
standard deviation from the mean (zero).

Probabilities and Normal Distribution

A normal distribution is an arrangement of a data set in which most values cluster in the middle of the range and
the rest taper off symmetrically toward either extreme of its graph called normal curve.

Properties:

§ The mean, median, and mode coincide at the exact center of the distribution.
§ The curve has a single maximum at 𝑥 = 𝜇.
§ The curve is symmetrical about the vertical line 𝑥 = 𝜇. Thus, the height of the curve at some point say, 𝑥 = 𝜇 + 𝜎
is the same as the height of the curve at 𝑥 = 𝜇 − 𝜎.
§ The total area under the curve is 1.00, and since the curve is symmetrical about 𝑥 = 𝜇, it follows that the area on
either side of the vertical line 𝑥 = 𝜇 is 0.5. This area represents the total number of cases (𝑁).
§ As 𝑥 moves away on either side of the mean µ, the height of the curve decreases but remains non-negative for all
real values of 𝑥.
40
Note: In actual practices it is unnecessary to extend the tails of the curve very far, since the area under curve
becomes negligible as we move away more than three standard deviations from the mean of the distributions.

Normal curves can describe a large number of groups of data differentiated only by their mean and their tendency
to spread.

Any normal distribution is defined by two measures: the mean which locates the center and the standard deviation
which measures the spread around the center.

The standard deviation permits us to determine quite accurately where the values in a distribution are located
relative to their mean. With the use of the standard deviation, we can measure with great precision the percentage of
items that fall within specific ranges under a normal distribution.

The Standard Score (Z-Score)

The standard scores or z-scores are measures of distance in the normal curve in terms of standard deviation from
the mean. It shows how far or close a data is from the mean. This measure of distance is expressed in terms of standard
deviation for easy comparison with other data sets.

An equivalent z-score of a normal data value can be computed, given the mean and standard deviation of that
data set, by the following formula:
𝒙)𝝁 5
𝒙)𝒙
Population: 𝒛 = Sample: 𝒛 =
𝝈 𝒔

Examples:

A. Covert the following real score into standard score.


Given: 𝜇 = 60 and 𝜎=4

1. 𝑥 = 68
Solution:
$)* 89)8: 9
𝑧= 7
= 1
= 1
𝒛= 𝟐
2. 𝑥 = 50
Solution:
$)* ;:)8: )-:
𝑧= = =
7 1 1
𝒛 = −𝟐. 𝟓
3. 𝑥 = 60
Solution:
$)* 8:)8: :
𝑧= = =
7 1 1
𝒛= 𝟎

B. Consider the following problem.


Shown below are Mr. Domingo’s scores, the mean, and the standard deviation of each three tests given to 500
applicants for a job position.
Test Score Mean S.D.

Personality 72.2 68.2 10.5

Abstract Reasoning 65 73 8.6

Aptitude 56.5 54.5 5.1

On which test did Mr. Domingo stand highest? lowest?


Solution:
To determine the answer, convert the scores of each test in standard score.

Personality Test
x = 72.2 µ = 68.2 σ = 10.5
$)* <!.!)89.! 1
𝑧= = =
7 -:.; -:.;
𝒛 = 𝟎. 𝟑𝟖 (rounded off to two decimal places)
41

Abstract Reasoning Test


x = 65 µ = 73 σ = 8.6
$)* 8;)<" )9
𝑧 = 7 = 9.8 = 9.8
𝒛 = −𝟎. 𝟗𝟑 (rounded off to two decimal places)

Aptitude Test
x = 56.5 µ = 54.5 σ = 5.1
$)* ;8.;);1.; !
𝑧= 7 = ;.-
= ;.-
𝒛 = 𝟎. 𝟑𝟗 (rounded off to two decimal places)

Answer: Since z-score indicates the deviation of the score from the mean in each distribution, then
a. Highest: Aptitude Test (it has the highest z-score);
b. Lowest: Abstract Reasoning (it has the lowest z-score).

Probabilities under Normal Curve and its Application

To calculate the area under the normal curve, we use the standard normal or z-score table (see attachment after
the unit activities). Areas under the normal curve denote probability. In a z-score table, the left most column tells you
how many standard deviations above the mean to 1 decimal place, the top row gives the second decimal place, and the
intersection of a row and column gives the probability. For the case of negative z-score, the same manner as with the
computation of probabilities of positive z-score since the curve is symmetrical along the mean.

Steps in Calculating Probabilities under Normal Curve

1. Translate the problem into one of the following:


§ To the right means 𝑃(𝑥 > 𝑎) or probability of 𝑥 greater than a certain point 𝑎
§ To the left means 𝑃(𝑥 < 𝑎) or probability of 𝑥 less than a certain point 𝑎
§ Between means 𝑃(𝑎 < 𝑥 < 𝑏) or probability of 𝑥 between certain points 𝑎 and 𝑏
2. Draw a picture of normal curve. Shade in the area of the given probability 𝑃 on the picture. This will help you to
visualize the problem.
3. Standardize 𝑎 (and/or 𝑏) to a z-score using the z-formula if it is not yet standardized.
4. Look up the z-score on the Z-table and find its corresponding probability.
5. Consider the following cases.
a. Positive z-score and P(x>a) Subtract corresponding probability
b. Negative z-score and P(x<a) from 0.5
c. P(a<x<b); a is negative and b is Add the two corresponding
positive probabilities
d. P(a<x<b); both a and b are
P(x>b) minus P(x>a)
positive
e. P(a<x<b); both a and b are
P(x>|a|) minus P(x>|b|)
negative
f. Positive z-score and P(x<a) Add corresponding probability to
g. Negative z-score and P(x<a) 0.5

Practice:

A. Directions: Locate the following z-score and shade the area that is being asked in each item under the standard
normal curve.

1. To the left of 𝑧 = 0.5 2. To the left of 𝑧 = 1.43

3. To the right of z= - 2.54 4. In between z=0.9 and z=-2.0

B. Directions: Verify the area (probability) under the standard normal curve which lies to the following conditions. Refer to
the z-table and to the concept notes for you to be guided. The following conditions are:
42

1. to the right of z = 0.66 Answer: 0.2546


2. to the left of z = -1.53 Answer: 0.0630
3. to the right of z = -2.34 Answer: 0.9904
4. to the right of z = 1.30 Answer: 0.0968
5. between z = -0.78 and z = 0.56 Answer: 0.4946

Sample Problem:

Consider the problem below that illustrates the application of the area under the normal curve in real-life.

You are a farmer about to harvest the crop. To describe the uncertainty in the size of the harvest, you feel that it
may be described as normal distribution with a mean value of 80,000 bushels with a standard deviation of 2,500 bushels.
Find the probability that your harvest will exceed 84,400 bushels.

Solution: 84,400 in z-score is z=1.76. The corresponding probability of z=1.76 is 0.4608. The term will exceed means
P(x>a), thus, subtract 0.4608 from 0.5. Hence the answer is 0.0392 or 3.92%.

Correlation and Linear Regression

Correlation Analysis

Correlation is the measure of relation/association between paired data. An analysis of correlation is a method of
measuring the strength of such relationship. Paired data is usually called bivariate data.

Pearson Correlation Coefficient (Pearson’s r)

Pearson’s 𝑟 is an index of relationship between two variables. The value of 𝑟 is from −1 to +1, If the value of 𝑟 is
+1 or −1, there is a perfect correlation between the paired data. However if r is equal to zero then x and y are
independent of each other.

Pearson’s 𝑟 Formula:

𝑛∑𝑥𝑦 − ∑𝑥 ∑𝑦
𝑟=
~[𝑛∑𝑥 ! − (∑𝑥)! ][𝑛 ∑𝑦 ! − (∑𝑦)! ]

where
𝑟 = Pearson Product Moment Coefficient of Correlation

𝑛 = sample size

∑𝑥𝑦 = the summation of the product of x and y

∑𝑥 ∑𝑦 = the product of the summation of x and the summation of y

∑𝑥 ! = summation of the squares of x

∑𝑦 ! = summation of the squares of y

The quantitative interpretation of the degree of linear relationship existing is shown in the following range of
values.
r Interpretation
±1.00 Perfect positive (negative) correlation
±0.91 − ±0.99 Very high positive correlation
±0.71 − ±0.90 High positive (negative) correlation
±0.51 − ±0.70 Moderately positive (negative) correlation
±0.31 − ±0.50 Low positive (negative) correlation
±0.01 − ±0.30 Negligible positive (negative) correlation
0.00 No correlation

Example:

Consider the following paired data and see how correlation is done.

Below are the midterm (x) and final (y) examination grades of 10 students. Compute for the Pearson’s 𝑟.
x y
75 80
70 75
65 65
90 95
85 90
85 85
80 90
43
70 75
65 70
90 90

Solution:

In order to solve the Pearson’s r conveniently, we must add columns and rows to solve for the other notations in
the formula in solving Pearson’s correlation coefficient. That is

x y 𝒙𝟐 𝒚𝟐 xy
75 80 5625 6400 6000
70 75 4900 5625 5250
65 65 4225 4225 4225
90 95 8100 9025 8550
85 90 7225 8100 7650
85 85 7225 7225 7225
80 90 6400 8100 7200
70 75 4900 5625 5250
65 70 4225 4900 4550
90 90 8100 8100 8100
! !
∑x = 775 ∑y = 815 ∑𝑥 = 60, 925 ∑𝑦 =67, 325 ∑xy = 64, 000

Computation for r:
&∑$>)∑$ ∑>
𝑟=
@[&∑$ ! )(∑$)! ][& ∑> ! )(∑>)! ]

-:(81,:::))<<;(9-;)
𝑟=
@[-:(8:,D!;))(<<;)! ][-:(8<,"!;))(9-;)! ]

𝑟 = 0.949

Thus, there is a very high positive correlation between the midterm and final grades.

Testing the Significance of a Correlation Coefficient

The values of 𝑥 and 𝑦 used to determine a coefficient of correlation can be considered to be a sample from the
population of all the possible values of 𝑥 and 𝑦. The values of 𝑥 and the corresponding values of 𝑦 are often called sample
pairs, and the assumptions made when determining the significance of a coefficient correlation is that sample pairs are
part of a normal distribution of all sample pairs forming the population.
Using the test of hypothesis, if we let the null hypothesis 𝐻𝑜: 𝜌 = 0 to mean that there is no significant correlation
(relationship) between x and y and the alternative hypothesis 𝐻𝑎: 𝜌 ≠ 0 to mean that there is a significant correlation
between the value of 𝑥 and 𝑦, then the test statistic to apply is
𝑛−2
𝑡 = 𝑟 †‡ ˆ
1 − 𝑟!
with 𝑛 − 2 degrees of freedom.

This testing method tells whether if the relationship that exists between the paired data is significant or not.

Linear Regression

Regression analysis is used when predicting the behavior of a variable. In linear regression, regression equation
is actually an equation of a line, meaning to say that there should be a significant linear relationship between paired data.
The graph of paired data should illustrate the line of best fit or the least squares regression line.
The least square regression line for a set of bivariate data is the line that minimizes the sum of the squares of the
vertical deviations from each data point to the line. The equation of the least square regression line for the n ordered pairs
(𝑥- , 𝑦- ), (𝑥! , 𝑦! ), (𝑥" , 𝑦" ), …, (𝑥& , 𝑦& ) is

𝑦$ = 𝑎 + 𝑏𝑥

where: 𝑦$ = criterion measure/dependent variable


x = predictor/independent variable
a = ordinate or the point where the regression line crosses the y-axis (y-intercept)
b = beta weight or the slope of the line

To get the regression equation, the values of a and b are computed using the formula below.
44
𝑛∑𝑥𝑦 − ∑𝑥∑𝑦
𝑏=
𝑛(∑𝑥 ! ) − (∑𝑥)!

(∑𝑦) − 𝑏(∑𝑥)
𝑎=
𝑛

Example:

Study the following example. Pay attention on how the regression equation was being computed.

The data in the table represent the memberships at a university mathematics club during the past 5 years.

Year (x) Membership (y)


1 25
2 30
3 32
4 45
5 50
Using linear regression, predict the membership 5 years from now.

Solution:

Σ𝑦 = 182 Σ𝑥 = 15 Σ𝑥𝑦 = 611 n=5 Σ𝑥 ! = 55

(𝑛)(∑𝑥𝑦) − (∑𝑥)(∑𝑦) 5(611) − 15(182)


𝑏= = = 6.50
𝑛(∑𝑥 ! ) − (∑𝑥)! 5(55) − 15!

(∑𝑦) − 𝑏(∑𝑥) 182 − 6.5(15)


𝑎= = = 16.90
𝑛 5

𝑦$ = 𝑎 + 𝑏𝑥
𝑦$ = 16.90 + 6.5𝑥

Since you need to predict the membership five years from now, or at 10 years, supernumerary 10 for x in the equation.

Thus, 5 years from now, 𝑦 = 16.90 + 6.5(10)


𝑦 = 81.9
𝑦 ≈ 82 members

Five years from now, the club would have 82 members.

References

Walpole, R. E. (2006). Introduction to Probability and Statistics Third Edition, Pearson Education South Asia Pte Ltd.

Parreño, E. B., and Jimenez, R. O. (2006). Basic Statistics: A Work text, C & E Publishing, Inc.

Reyes, J. L. et al. (2018). Mathematics in the Modern World. Panday-Lahi Publishing House, Inc., Muntinlupa City
45

Assessing Learning (Unit 4-A)

Name: _______________________________________________________Score: ______________________________

Course/Year/Section: ___________________________________________ Date: _______________________________

Directions: Answer the following items. Write your answer in the space provided.

A. Consider the following problem.


A class of 50 students took a 80-item test and their scores are as follows:
44 12 56 22 68 72 41 59 27 54
56 45 34 37 23 49 44 60 36 39
42 38 59 51 47 50 26 52 47 36
35 32 46 33 50 22 39 49 52 29
31 55 16 40 45 35 60 25 72 63

1. Arrange the data in array.

2. Arrange the data in Stem-and-Leaf Diagram.

3. Construct a Frequency Distribution Table (FDT) having the following parts: class, frequency, class mark, class
boundary, relative frequency, less than cumulative frequency and greater than cumulative frequency.

4. From the FDT, present the data in bar graph.


46
B. Consider the following data regarding the height (in cm) of 10 BS Chemistry students.
170, 165, 155, 160, 150, 149, 152, 161, 163, 175

1. Compute for the mean, median and mode.

2. Compute for the range, variance and standard deviation.


47

Assessing Learning (Unit 4-B)

Name: _______________________________________________________Score: ______________________________

Course/Year/Section: ___________________________________________ Date: _______________________________

Directions: Answer the following items regarding normal distributions. Write your answer in the space provided.

A. A normal distribution of scores has a standard deviation of 10 and mean of 52.3. Find the z-scores corresponding to
each of the following values.
Answer:
1. Score = 75 __________________________
2. Score = 32 __________________________
3. Score = 45 __________________________
4. Score = 62 __________________________
5. Score = 73 __________________________

B. A normal distribution of scores has a standard deviation of 5 and mean of 72. Find the real scores corresponding to
each of the following values.
Answer:
1. z = 2 __________________________
2. z = -0.8 __________________________
3. z = -2.6 __________________________
4. z = 2.6 __________________________
5. z = 1.9 __________________________

C. Sketch the graph of the following z-scores in standard normal curve. Find the area under the standard normal curve
which lies to the following conditions.
1. to the right of z = 0.78

2. to the left of z = -2.53

3. to the right of z = -1.34

4. to the left of z = 0.35

5. between z = -0.70 and z = 0.51


48
D. Answer the following problems.
1. Three students take equivalent stress tests. Which is the highest relative score?
A. A score of 144 on a test with a mean of 128 and a standard deviation of 34.
B. B. A score of 90 on a test with a mean of 86 and a standard deviation of 18.
C. A score of 18 on a test with a mean of 15 and a standard deviation of 5.

Why?

2. Suppose a night-shift technician of an electric company finished his work in 7.4 hours and a day-shift technician of
the same company finished his job in 6.9 hours. And suppose the mean and the standard deviation of the night-
shift technicians’ completion time are 5.5 and 0.5 hours respectively, while the mean and standard deviation of
the day-shift technicians’ completion time are 6.4 and 0.5 hours respectively. Which of the two technicians is
better worker relative to the shift to which he belongs?

3. The salaries of employees of a certain company in Baguio City have a mean of P6,500 and a standard deviation
of P1,500. What is the probability that an employee selected at random will have a salary of
a. more than P6,500

b. less than P7,000

c. between P5,000 and P7,000

4. Sir Jay has 184 students in his college mathematics lecture class. The scores on the midterm exam are normally
distributed with a mean of 72.3 and a standard deviation of 8.9. How many students in the class can be expected
to receive a score between 82 and 90?
49

Assessing Learning (Unit 4-C)

Name: _______________________________________________________Score: ______________________________

Course/Year/Section: ___________________________________________ Date: _______________________________

Directions: Answer the following items regarding correlation and linear regression. Write your answer in the space
provided.

1. Solve for the Pearson Correlation Coefficient of the following data and interpret.
x y
45 23
54 45
65 34
46 65

2. Solve for the Pearson’s 𝑟 of the following data regarding height and weight of 10 students and interpret the results.
Table for computation of other values is already arranged for you.
Weight
Height (y) 𝑥! 𝑦! xy
(x)

38 135

38 140

38 137

44 141

44 147

51 145

32 132

51 149

77 164

32 130

∑x = ∑y = ∑𝑥 ! = ∑𝑦 ! = ∑xy =
50

3. Refer to the following data below regarding the grades in English and Mathematics.
Student English Grade (x) Math Grade (y)
A 85 75
B 82 76
C 79 90
D 85 78
E 85 92
F 90 90
G 85 95
H 84 85
I 74 82
J 81 82

Use regression analysis to predict the grade of a student in math if his grade in English is:
a. 77 b. 98 c. 67
51

Z-SCORE TABLE
AREAS UNDER THE NORMAL CURVE

Z 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09


0 0 0.004 0.008 0.012 0.016 0.0199 0.0239 0.0279 0.0319 0.0359
0.1 0.0398 0.0438 0.0478 0.0517 0.0557 0.0596 0.0636 0.0675 0.0714 0.0753
0.2 0.0793 0.0832 0.0871 0.091 0.0948 0.0987 0.1026 0.1064 0.1103 0.1141
0.3 0.1179 0.1217 0.1255 0.1293 0.1331 0.1368 0.1406 0.1443 0.148 0.1517
0.4 0.1554 0.1591 0.1628 0.1664 0.17 0.1736 0.1772 0.1808 0.1844 0.1879
0.5 0.1915 0.195 0.1985 0.2019 0.2054 0.2088 0.2123 0.2157 0.219 0.2224
0.6 0.2257 0.2291 0.2324 0.2357 0.2389 0.2422 0.2454 0.2486 0.2517 0.2549
0.7 0.258 0.2611 0.2642 0.2673 0.2704 0.2734 0.2764 0.2794 0.2823 0.2852
0.8 0.2881 0.291 0.2939 0.2967 0.2995 0.3023 0.3051 0.3078 0.3106 0.3133
0.9 0.3159 0.3186 0.3212 0.3238 0.3264 0.3289 0.3315 0.334 0.3365 0.3389
1 0.3413 0.3438 0.3461 0.3485 0.3508 0.3531 0.3554 0.3577 0.3599 0.3621
1.1 0.3643 0.3665 0.3686 0.3708 0.3729 0.3749 0.377 0.379 0.381 0.383
1.2 0.3849 0.3869 0.3888 0.3907 0.3925 0.3944 0.3962 0.398 0.3997 0.4015
1.3 0.4032 0.4049 0.4066 0.4082 0.4099 0.4115 0.4131 0.4147 0.4162 0.4177
1.4 0.4192 0.4207 0.4222 0.4236 0.4251 0.4265 0.4279 0.4292 0.4306 0.4319
1.5 0.4332 0.4345 0.4357 0.437 0.4382 0.4394 0.4406 0.4418 0.4429 0.4441
1.6 0.4452 0.4463 0.4474 0.4484 0.4495 0.4505 0.4515 0.4525 0.4535 0.4545
1.7 0.4554 0.4564 0.4573 0.4582 0.4591 0.4599 0.4608 0.4616 0.4625 0.4633
1.8 0.4641 0.4649 0.4656 0.4664 0.4671 0.4678 0.4686 0.4693 0.4699 0.4706
1.9 0.4713 0.4719 0.4726 0.4732 0.4738 0.4744 0.475 0.4756 0.4761 0.4767
2 0.4772 0.4778 0.4783 0.4788 0.4793 0.4798 0.4803 0.4808 0.4812 0.4817
2.1 0.4821 0.4826 0.483 0.4834 0.4838 0.4842 0.4846 0.485 0.4854 0.4857
2.2 0.4861 0.4864 0.4868 0.4871 0.4875 0.4878 0.4881 0.4884 0.4887 0.489
2.3 0.4893 0.4896 0.4898 0.4901 0.4904 0.4906 0.4909 0.4911 0.4913 0.4916
2.4 0.4918 0.492 0.4922 0.4925 0.4927 0.4929 0.4931 0.4932 0.4934 0.4936
2.5 0.4938 0.494 0.4941 0.4943 0.4945 0.4946 0.4948 0.4949 0.4951 0.4952
2.6 0.4953 0.4955 0.4956 0.4957 0.4959 0.496 0.4961 0.4962 0.4963 0.4964
2.7 0.4965 0.4966 0.4967 0.4968 0.4969 0.497 0.4971 0.4972 0.4973 0.4974
2.8 0.4974 0.4975 0.4976 0.4977 0.4977 0.4978 0.4979 0.4979 0.498 0.4981
2.9 0.4981 0.4982 0.4982 0.4983 0.4984 0.4984 0.4985 0.4985 0.4986 0.4986
3 0.4987 0.4987 0.4987 0.4988 0.4988 0.4989 0.4989 0.4989 0.499 0.499

You might also like