Topic 5 Data Management (Statistics)

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 116

Statistics

Statistics is a branch of
Mathematics dealing
with the collection,
Introduction
and presentation, analysis,
Terminologies
and interpretation of
data.
History of Statistics

The term statistics came from the Latin phrase “ratio status”
which means study of practical politics or the statesman’s
art.

In the middle of 18th century, the term statistik (a term due
to Achenwall) was used, a German term defined as “the
political science of several countries”

From statistik it became statistics defined as a statement in


figures and facts of the present condition of a state.
 Descriptive
- includes the technique which are
concerned with summarizing and
Divisions describing numerical data.
of - pertains to the methods dealing with
the collection, organization and
Statistics analysis of a set of data without making
conclusions, predictions or inferences
about a larger set.
 Inferential
- demands a higher order of critical
judgment and mathematical
judgment.
- pertains to the methods dealing with
making inferences, estimation or
prediction about a larger set of data using
the information gathered from a subset
of this larger set.
Descriptive statistics Inferential statistics
methods concerned w/ methods concerned with
collecting, describing, and the analysis of a subset of
analyzing a set of data data leading to predictions
without drawing conclusions or inferences about the
(or inferences) about a large entire set of data
group  A college dean wants to
 A college dean wants to forecast the average
determine the average semestral enrolment based
semestral enrolment in on the enrolment for the
the past 5 school years. last 5 school years.
 An instructor would like to
 An instructor wants to know predict the number of
the exact number of students who will pass in
students who pass in her her subject based on the
subject. number of failures last
year.
Example of
Descriptive
Statistics
Present the Philippine
population by
constructing a graph
indicating the total
number of Filipinos
counted during the
last census by age
group and sex
Example of Inferential Statistics
 A new milk formulation designed to improve the psychomotor
development of infants was tested on randomly selected infants.

Based on the results, it was concluded that the new milk formulation is
effective in improving the psychomotor development of infants.
Inferential Statistics
Larger Set
(N units/observations)
Smaller Set
(n units/observations)

Inferences and
Generalizations
Definition of Some Basic
Statistical Terms

1. Universe is the set of all individuals or objects under


consideration or study.
2. Variable is a characteristic or attribute of individuals
or objects which take on different values or label.
3. Population is a collection of all elements in which the
researcher is interested in a statistical study.
4. Sample is a part or a subset of the population from
which the information is collected.
5. Statistic is descriptive numerical measure computed
in a sample.
6. Parameter is descriptive numerical measure
computed in population.
Data is the base unit of information in statistics.
- a collection of all observations
-Primary data is data which comes from an
original source or first – hand information.
Examples of this include surveys and censuses,
experiments, or observations.
-Secondary data is data coming from
compilations in files or records. Data that fall
under this category may include information in
previously recorded researches or
investigations.
Two types of data

Qualitative data
▪ Describes the quality or character of something

Quantitative data
▪ Describes the amount or number of something

a. Discrete
-countable
b. Continuous
-measurable (measured using a continuous scale
such as kilos, cms, grams)
Levels of Measurement

1. Nominal level – classifies data into uniquely different


categories in which no order or ranking is made on the data.
Examples are gender, political affiliation, and nationality.
2. Ordinal level – classifies data into categories which can be
ranked; but clear-cut differences between ranks do not exist.
Examples are Grades (A, B, C…) and socio – economic status
of families.
3. Interval level – classifies and orders the data and also
specifies that the distances between each interval on the
scale are equidistant. Examples are I.Q. and temperature.
4. Ratio level – ranks the data similar to the interval level of
measurement and has an absolute zero point. It allows all
arithmetic operations. Examples are height, weight, and
time.
Methods of Data Collection
1. Survey Method
· Telephone survey
· Mailed questionnaire
· Personal interview
2. Observation method
The researcher only observes the behavior of individuals in the study
and tries to draw conclusions based on these observations.
3. Experimental method
The researcher influences one of the variables and tries to find out
how the manipulation affects the other variables.
4. Use of existing studies
The researcher uses published or unpublished materials such as
magazines, books, newspapers, journals, and thesis.
5. Registration method
This method is enforced by certain laws. Examples are data derived
from car registration, birth registration, and marriage registration.
Classification of Data collection:
1. Census or complete inventory is a
method of collecting data from every
element in the population.
2. Survey sampling is a method of
collecting data from each selected
sample in a given population.
Sampling Techniques
1. Probability Sampling Every element in the population
has even chance of being chosen as a sample. The basic
types of probability sampling are simple random sampling,
systematic sampling, stratified sampling, cluster sampling
and multi-stage sampling.

(a) Simple Random Sampling involves selecting a


sample size (n) from the universe (N) such that each
member of the population has an equal chance of being
included in the sample and all possible combinations of
size (n) have equal chance of being selected as the
sample. This sampling method requires a listing of the
elements of the population called the sampling frame.
(b) Systematic Sampling involves selecting every kth
element of a series representing the population. A
complete listing of elements is also required in this method.

(c) Stratified Sampling is an extension of simple random


sampling which permits different homogeneous groups,
called strata, in the population to be represented in the
sample. If one wishes to use stratification, the following
questions must be asked:
Are there different groups within the population?
Are these differences important to the investigation?
If the answer to both is yes, then stratified sampling is
essential.
(d) Cluster Sampling divides the population into groups,
called clusters. A random sample of clusters is selected,
and then subjects the sampled clusters to complete
enumeration.
(e) Multi-stage Sampling is a method where the elements
in the targeted population are grouped into some kind of
hierarchy of units, and sampling is done in succession. Most
of the surveys conducted by Social Weather Stations and
Pulse Asia are done by this method of sampling. For
example, sampling could be done by Region, then a
sampling of some provinces is done among the selected
regions, then within the selected provinces, a sampling of
cities/municipalities is done. Then, a sample of barangays is
taken within the selected cities/municipalities, and lastly,
households within the barangay.
2. Non-probability sampling
(a) Haphazard or Accidental Sampling involves an
unsystematic selection of sample unit. Some field
of study, e.g., archaeology, history, and even
medicine draw conclusions from whatever
information that is made available. Other
disciplines, e.g., astronomy, experimental physics
and chemistry often do not care about the
representatives of their specimens.
(b) Convenient Sampling
(c) Quota Sampling
Data Presentation
Textual method – narrative description of the data
gathered
Tabular method – shows relationships or comparisons of
data gathered
Graphical – illustrative description of the data
Graphical presentation of discrete type of data:
Bar graph – horizontal or vertical which shows the
length representing the quantity or frequency of each
type of category
Pie chart – is a circle that is divided into wedges
according to the percentage of the frequencies in each
category
Line chart – represents data that occur over a specific
period of time.
Graphical presentation of continuous
type of data:
Histogram
Frequency polygon
Ogive
Measures of Central
Tendency
Central Tendency is any measure
indicating the center of the set of
data, arranged in an increasing or
decreasing order of magnitude. The
most common methods are the
mean, median, and the mode.
Mean
The mean of a set of numerical data is the
most important and most reliable measure,
and probably the most frequently used
measure. Usually, mean is denoted by  for
populations and for samples.

The mean of n numbers is equal to the sum of


the numbers divided by n, that is

Mean (  or )=
Mean
n
 xi
x + x + x + ... + xn i =1
F1: X = 1 2 3 =
N N
n n

f X +f X +
 fi X i  fi X n
f 3 X 3 + ... + f n X n i = 1
F2 : X = 1 1 2 2 = or i = 1
f1 + f 2 + f 3 + ... + f n n N
 fi
i =1
Where f1 + f2 + f3 + … + fn =  fi = N
n

w x + w2 x2 + w3 x3 + ... + wn xn i =1
 wi xi
F3 : X = 1 1 =
w1 + w2 + w3 + ... + wn n
 wi
i =1
Examples

1. The number of employees at 5 different department stores are 5, 9


10, 11, and 15. Find the mean number of employees for the 5 stores.

Solution:
5 + 9 + 10 + 11 + 15
X=
5
= 10
2. The salaries of 6 employees were P11,000, P9,000, P12,500,
P20,000, P15,000, and P24,000. What is the average salary?

Solution:
11,000 + 9 ,000 + 12,500 + 20 ,000 + 15,000 + 24 ,000
X=
6
= 15,250
Examples
3. Cloe has test scores in her Mathematics
class of 84, 92, 78, 82, and 90. Find the mean
of her test scores.

4. The number of employees at 5 different


department stores are 6, 9, 12, 11, and 15.
Find the mean number of employees for the
5 stores.
5. Out of 100 numbers, 20 were 5’s, 40 were 4’s
35 were 7’s and 5 were 3’s. Find the mean.
6. The average IQ of 10 students in Math
314 is 115. If there are 2 students with IQ
101, 3 with IQ 125, 1 with IQ 130, 3 with IQ
98. What must be the IQ of the other?
Median
Any list of numbers that is arranged in numerical
order from smallest to largest or from largest to
smallest is a ranked list. The median of a ranked
list is the middle number if there is an odd number
of values in the list, or the arithmetic mean of the
2 middle values if there is an even number of
values in the list.
The median of a ranked list of n numbers is:

-the middle number if n is odd.


-the mean of the two middle numbers if n is
even.
• The middle or center of the set of data

• To compute, arrange the scores from lowest to highest. If N is odd, the median is the value
of the middle term, (N + 1) th term. If N is even, there are 2 middle terms; N th and the next
2 2
term. The median is the average of these values.
Examples:

1. A = 5, 7, 9, 10, 11, 11, 12 What is the middle score?


~
n = 7; X = X n +1 = X 7 +1 = X 4 = 10
2 2

 ~
X = 10

1. B = 2, 4, 6, 8, 10, 12. What is the median score?

X n + X n+2 X 6 + X 6 +2
~ X3 + X4 6 + 8
n = 6; X= 2 2 = 2 2 = = =7
2 2 2 2
 ~
X =7
The heights, in inches, of the 5 male faculty members of
the Mathematics Department of the CIT – University are 63
in., 68 in., 60 in., 70 in., and 65 in. Find the median of the
heights.
Solution:
Arranging the heights from lowest to highest, we have
60 63 65 68 70.
Thus since n = 5( odd ), the median is the middle value 65.
Example 3: Find the median for the data in the
following lists. 5, 18, 12, 4, 21, 16.
Solution:
Arranging the values from lowest to highest, we have
4 5 12 16 18 21.
Thus since n = 6( even ), the median is the arithmetic
mean of the 2 middle values 12 and 16.
Median =

Median = 14
Mode

The mode of a set of data is the value which occurs most often or with
the greatest frequency.

Not all set of data have modes. There are some cases in which all the
numbers occur with equal frequency, hence the set has no mode or it is
non – modal. On the other hand, some set of data may have several values
that occur with equal greatest frequency. In this case, there are more than one
mode ( bi – modal, tri – modal, etc. )

Example 4: During the fire last March in Subangdako, Mandaue City, the first
10 donors who extended monetary help to the victims gave P 200, P 500,
-P 400, P 200, P 300, P 100, P 200, P 300, P 1,000, P 800. Find the mode of
the donations.
Mode

Examples: Identify the mode(s) of the following data sets.

1. 2 5 2 3 5 2 1
X̂ = 2

2. 1 2 3 3 2 1 4
X̂ = 1, 2 , and 3

3. Red Blue White Yellow Blue Pink Green


X̂ = Blue
Measures of Dispersion
The average does not give adequate description of the
set of data. We may know the middle value but it does
not tell the whole story at all.
Consider two basketball teams with 10 members, team A
and team B, where they have the same average age of
35. Knowing only the 35 year old average, we may think
that both teams have players within 30 to 40. Now, if we
show the ages of the team members, we have
Team A: 32, 35, 36, 38, 30, 40, 36, 39, 30, 34
Team B: 54, 20, 32, 45, 18, 30, 58, 20, 49, 24
As we can see, team A’s members have ages very close
to the average, but team B has some members very
much older and some are very much younger from the
average age.
In describing a set of numbers, not only it is
useful to designate an average, but also it
is important to indicate the variability or
the dispersion of the numbers. Dispersion
refers to the “ spreadness “ of the numbers
in the set about the average. Dispersion
can be measured in terms of the range,
the variance, and the standard deviation.
The Range
The Range of a set of data is the difference between the largest and the
smallest number in the set.

Example 5: In a small department of a production plant with 6 employees, the


employee’s salaries are P24,800, P 32,750, P 12,400, P 8,200, P 15,000, and
P
- 44,200. Find the range of the salaries.

Solution:
Range = P 44,200 – P8,200
Range = P 36,000

While the range is the most easily calculated measure of variability, the
range is dependent entirely on 2 extreme values and is insensitive to what is
happening in between. Thus, it is considered as the least satisfactory measure
of dispersion.
The Variance
Variance is the measure of variability that indicates
dispersion around the mean. This makes use of the
individual amount that each data value deviates from
the mean. These deviations, represented by ( x -  ), are
positive when the data value is greater than the mean,
and are negative when the data value is less than the
mean. The sum of all the deviations ( x -  ) is 0 for all sets
of data, and thus cannot be used in computing
variance, but instead we make use of the square of the
deviations.

Variance is represented by 2 for population, and s2 for


sample, and is defined in the following equations.
If x1, x2, x3, ... , xn is a population of n numbers with a mean of ,
then the variance of the population is
2  (x −μ)
2
σ = .
n
If x1, x2, x3, ... , xn is a sample of n numbers with a mean of x , then
the variance of the sample is

s =
2  ( x−x)
.
2

n−1
Example 6: During the Bb. Pilipinas Beauty Pageant last April 18, 2018, the
scores of the 6 judges for the winning candidate were 92, 94, 88, 97, 85, and
90. Compute the variance of the scores.

Solution:

92 + 94 + 88 + 97 + 85 + 90
μ=
6
µ = 91

 ( x − μ ) = ( 92 – 91 ) + ( 94 – 91 ) + ( 88 – 91 ) + ( 97 – 91 )
2 2 2 2 2

2 2
+ ( 85 – 91 ) + ( 90 – 91 )

 ( x − μ ) = 92
2

σ =
2  ( x −μ)
2

n
92
σ2 =
6
 = 15.33
2
Example 7: The following are samples taken for the volume content of a
500 – ml pack orange juice by Sunkist Orange.

501 ml, 498 ml, 505 ml, 492 ml, 500 ml.

Compute the variance.

Solution:
501 + 498 + 505 + 492 + 500
x=
5

x = 499.2

 ( x − x ) = ( 501 – 499.2 ) + ( 498 – 499.2 ) + ( 505 – 499.2 )


2 2 2 2

2 2
+ ( 492 – 499.2 ) + ( 500 – 499.2 )

 ( x − x ) = 90.8
2

 (x − x )
2

s =
2

n −1
90.8
s2 =
5 −1
2
s = 22.7
The Standard Deviation
Standard Deviation is the square root of the variance. Typically standard
deviation is represented by  for population and s for sample, and defined in
the same manner as variance.

If x1, x2, x3, ... , xn is a population of n numbers with a mean of ,


then the standard deviation of the population is

 (x −μ) .
2
σ=
n
If x1, x2, x3, ... , xn is a sample of n numbers with a mean of x , then
the standard deviation of the sample is

 (x − x ) .
2

s=
n−1
Example 8: The table below shows the measurements, in liters, for 2 samples
of soft drinks bottled by companies X and Y.

Sample X 1.07 0.97 0.95 1.00 1.01


Sample Y 1.06 1.12 0.90 0.91 1.01

1. Compute the standard deviation of X.


2. Compute the standard deviation of Y.
3. What does the result imply?
Solution:
1.07 + 0.97 + 0.95 + 1.00 + 1.01
1. sample X: x=
5

x =1

 ( x − x ) = ( 1.07 – 1 ) + ( 0.97 – 1 ) + ( 0.95 – 1 )


2 2 2 2

2 2
+ ( 1 – 1 ) + ( 1.01 – 1 )

 (x − x )
2
= 0.0084

 (x − x )
2

s=
n −1

0.0084
s=
5 −1
s = 0.0458
1.06 + 1.12 + 0.90 + 0.91 + 1.01
2. sample Y: x=
5

x =1

 (x − x )
2 2 2 2
= ( 1.06 – 1 ) + ( 1.12 – 1 ) + ( 0.90 – 1 )
2 2
+ ( 0.91 – 1 ) + ( 1.01 – 1 )

 (x − x )
2
= 0.0362

 (x − x )
2

s=
n −1

0.0362
s=
5 −1
s = 0.0951
3. Since the standard deviation of company X is
smaller, this means that the soft drinks produced by
Company X is more consistent with regards to volume
content than those bottled by Company Y.

If a set of data has a small standard deviation, we


would expect most of the values to be located closely
around the mean. However, a large value of the
standard deviation indicates that the values are more
spread out from the mean.
Measures of Relative Position

In the previous section, we have seen


several ways of choosing a value to
represent the center of a set of data.
It is also important to talk about the
position of any other value. In some
situation, one may be interested
where a specific value falls in a given
set of data.
Simple Ranking

Simple Ranking involves the arrangement of


the values in some order and noting where in
the order a particular value falls. For
example, Nathan ranked second in a
graduating class of 254 students. A
basketball player of a team of size 15 knows
that he is the 5th best free throw shooter.
Even without numerical values associated
with the elements, still simple ranking is useful,
as in ranking energy drinks on their
effectiveness without numerical
measurements of strength.
Percentile Ranking
Most standardized examinations provide scores in
terms of percentiles, which indicates what percent of
all scores fall below the value under consideration.
Percentile Ranking is useful in comparing positions with
different bases, such as comparing rank of 120 out of
480 with a rank of 176 out of 880. We can easily
compare by noting that the first equals a percentile
rank of 25% while the second is of 80% percentile rank.

The pth Percentile


A value x is called the pth percentile of a data set,
provided p% of the data values are less than x.
For percentile of x =

𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑑𝑎𝑡𝑎 𝑣𝑎𝑙𝑢𝑒𝑠 𝑙𝑒𝑠𝑠 𝑡ℎ𝑎𝑛 𝑥+0.5


𝑡𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑑𝑎𝑡𝑎 𝑣𝑎𝑙𝑢𝑒𝑠
(100)
Example 9: Record on Series Bus Liner shows that the median of the annual
th
salary of their bus drivers is P 172,840. If the 30 percentile of the salary is
-P 105,782, find the percentage of their drivers whose salaries are
1. less than P 172,840
2. more than P 105,782
3. between P 105,782 and P172,840
Solution:
1. Since P 172,840 is the median, 50% of the drivers have annual
salary less than P 172,840
th
2. Since P 105,782 is the 30 percentile, 100% - 30% = 70% of the
drivers have annual salary more than P105,785.

3. There are 50% - 30% = 20% of the drivers have salary in between
P 105,782 and P172,840.
A teacher gives a 20-point tests to 10 students. The
scores are shown below. Find the percentile rank of a
score of 15.
10, 20, 3, 5, 6, 8, 18, 12, 15, 2

Solution:
2, 3, 5, 6, 8, 10, 12, 15, 18, 20

𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓𝑑𝑎𝑡𝑎 𝑣𝑎𝑙𝑢𝑒 𝑙𝑒𝑠𝑠 𝑡ℎ𝑎𝑛 𝑥+0.5


Percentile of x = (100)
𝑡𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑑𝑎𝑡𝑎 𝑣𝑎𝑙𝑢𝑒𝑠
7+0.5
= (100)
10
=75
Therefore, the score 15 is the 75th percentile
For ungrouped data
Examples
Determine the values of P20, P50, Q1, and D4 of the given data:

5, 6, 6, 8, 20, 15, 14, 16, 18, 11, 13, 14, 16, 9, 8, 10.

Solution:

Arrange the scores (L – H):

5, 6, 6, 8, 8, 9, 10, 11, 13, 14, 14, 15, 16, 16, 18, 20

20
1. P20 is at ( 16 ) = 3.2 which is not a whole number, so take the next
100
higher whole number as the position. P20 is the 4th score.
 P20 = 8. This implies that 20% of the set of scores is below 8 or
80% is above 8.

50
2. P50 is at ( 16 ) = 8 which is a whole number, so take the position
100
halfway the 8th and 9th scores. 
th th
8 +9 11 + 13
P50 = = = 12 .
2 2
1
1. Q1 is at ( 16 ) = 4 which is a whole number, so take the position
4
between the 4th and 5th scores. 
4 th + 5 th 8 + 8
Q1 = = =8
2 2

4
2. D4 is at ( 16 ) = 6.4 which is not a whole number, so take the
10
next higher whole number as the position. D4 is the 7th score.
 D4 = 10

9
3. D9 is at ( 16 ) = 14.4 which is not a whole number, so take the
10
15th score.
 D9 = 18
Quartile Ranking
Quartiles are values that divide a set of data into 4 equal
parts, denoted by Q1, Q2, and Q3, such that 25% of the
data lie below Q1, 50% of the data lie below Q2, and 75%
of the data lie below Q3. Q1 is called the first quartile, Q2
is called the second quartile and it is the median of the
set, and Q3 is called the third quartile.
A method using medians can easily be employed in
finding quartiles and has the following steps below.
1. Form a ranked list of the data.
2. Find the median of the ranked data. This is the second
quartile, Q2.
3. The first quartile, Q1 is the median of the values less
than Q2.
4. The third quartile, Q3 is the median of the values
greater than Q2.
The following are the result of testing a sample of 9 batteries for the battery
life, in hours, of Ever Glow Battery.

6.2 6.4 7.1 5.6 8.3 6.8 5.3 7.2 9.3

Find the quartiles of the given set of data.

Solution:

Rank the data: 5.3 5.6 6.2 6.4 6.8 7.1 7.2 8.3 9.3

Median ( Q2 )
The z – score
One problem that may arise in statistics is comparing 2
observations from 2 different populations. Let’s say for
example, the supervisor of a construction project
evaluates the speed of his 2 workers for possible salary
increase. The first worker is laying hollow blocks, and he
can lay 120 hollow blocks per day. The second worker is
setting tiles and he can set 60 tiles in one day. Which
one is the faster worker? Of course we cannot decide
unless we have the basis for comparison.

A method of ranking the observations is to convert them


into standard units known as the z – score or z – values.
The z – score tells us how many standard deviations a
value is below or above the mean, and is given by the
formula below.
The z-score for a given data value x, is the number of standard deviations
that x is above or below the mean of the data. The following formulas
show how to calculate the z-score for a data value x in a population and in
a sample.
x−μ x−x
Population: z = Sample: z =
σ s
Note: z is positive, x is greater than ( or above ) the mean, and z is
negative if x is less than ( or below ) the mean.
Now, consider that in hollow block laying, on the average a worker can
lay 100 hollow blocks in one day with a standard deviation of 5, and that in tile
setting a worker averages 50 tiles in one day with a standard deviation of 2.
Converting their speeds in terms of z – score, we have

x − μ 120 − 100
Worker 1: z1 = = = 4 ( 4 standard deviation above the mean )
σ 5
x −μ 60 − 50
Worker 2: z 2 = = = 5 ( 5 standard deviation above the mean )
σ 2

Comparing z – scores, we see that the second worker is faster


compared to the first.
Akio Zachary took a test in Mathematics and scored 82 for which the mean of
all scores was 68 and a standard deviation of 8. He also took an examination
in Chemistry and scored 89 for which the mean was 80 and a standard
deviation of 6. Did he do better in Mathematics or in Chemistry?

Solution:
x −μ
Mathematics: z1 =
σ
82 − 68
z1 =
8
z1 = 1.75

x −μ
Chemistry: z2 =
σ
89 − 80
z2 =
6
z2 = 1.5

Therefore he did better in Mathematics.


AllBright Company, Inc. is a company producing light bulbs. Test record
showed that on the average the life expectancy of their bulbs was 784 hours. A
bulb was tested and it lasted for 676 hours and they found out that this is 1.5
standard deviations below the mean. What is the standard deviation of the life
of bulbs?

Solution:

The z – score formula involves 3 variables. If any 3 variables are


known, the fourth can be solved by using the formula.
x−x
z=
s
676 − 784
− 1.5 =
s
s = 72
Frequency Distribution
Oftentimes, we encounter a problem
of assessing a large group of data.
Such problem can be simplified by
grouping the data into different classes
and determining the number of
observations that fall in each class.
Such data is called grouped data and
the arrangement in tabular form at
which they are grouped is called a
frequency distribution.
Ordered array - is a listing of values
from the smallest to largest values or
conversely.

32 43 48 55 62 69 74 83
38 43 49 59 63 72 76 84
38 45 49 59 63 72 77 85
40 45 51 62 64 72 79 85
40 46 54 62 65 74 83 93
Stem and leaf display of data - a device that is
useful for representing relatively small quantitative
data sets.
(Using the set of scores above.)

STEM LEAF

3 2,8,8
4 0,0,3,3,5,5,6,8,9,9
5 1,4,5,9,9
6 2,2,2,3,3,4,5,9
7 2,2,2,4,4,6,7,9
8 3,3,4,5,5
9 3
The Frequency Distribution Table (FDT)

 Frequency Distribution - refers to the tabular


arrangement of data by classes or categories
together with their corresponding class
frequencies.
 Class frequency - refers to the number of
observations belonging to a class interval, or the
number of items within a category.

 Class interval - is a grouping or category defined


by a lower limit and an upper limit.
How to construct frequency
distribution
1. Determine (k) the desired number of class intervals.
STURGES' RULE: k = 1 + 3.3221logN
2. Determine the highest and lowest values of the given data set
and find the range (R) of values.
Range = HS - LS
3. Determine the class size or the width (w) of the class intervals
w = R/k
4. Determine the lower limit and the upper limit of the first class
interval.
5. Determine the frequency of values falling within each class
interval, and check if the sum of these frequencies is equal to the
sample size.
6. Tally the scores/observations falling in each class.
Other Columns in the Frequency Distribution
Table
Class Boundaries – true limits
- halfway of the succeeding intervals

Lower Class boundaries


LCB = (Upper Limit of the previous class interval +
Lower Limit) 2
Upper Class boundaries
UCB = (Upper limit + Lower Limit of the next class
interval)  2
Class Marks (Xi ) – midpoint of the class interval where the
observations tend to cluster about.
Xi = ½ (LL + UL) or Xi = ½ (LCB + UCB)
Other Columns in the Frequency
Distribution Table
Cumulative Frequency - arrangement of data by
class intervals whose frequencies are cumulated
or accumulated

Less than Cf (<Cf) – total number of observations


whose values are less than or equal to the upper limit

Greater than Cf (>Cf) – total number of observations


whose values are greater than or equal to the lower
limit of the class.

Relative Frequency - is an arrangement of data showing the proportion in


percent of each frequency to the total frequency.
% Rf =( frequency  N) 100%
Rf = frequency  N
Example
Consider the following final exam scores of 40
students:

48 79 83 84 62 62 43 72
45 46 59 93 64 59 32 54
83 55 45 76 72 40 51 72
65 49 62 85 74 40 74 49
69 38 85 77 63 38 43 63

1. Make an ordered array.


2. Make a stem and leaf display of data.
3. Construct the FDT and its component.
4. Graph: frequency polygon, histogram and ogive.
1. Ordered array presentation

32 43 48 55 62 69 74 83
38 43 49 59 63 72 76 84
38 45 49 59 63 72 77 85
40 45 51 62 64 72 79 85
40 46 54 62 65 74 83 93
2. Stem and leaf display of data

STEM LEAF
3 2,8,8
4 0,0,3,3,5,5,6,8,9,9
5 1,4,5,9,9
6 2,2,2,3,3,4,5,9
7 2,2,2,4,4,6,7,9
8 3,3,4,5,5
9 3
The Frequency Distribution Table
(FDT)
K = 1 + 3.3221 log 40
= 6.3222
=7

R = HS – LS = 93-32 = 61

w = R/k = 61/7 = 8.7143 = 9


The Frequency Distribution Table
(FDT)
Class interval Class Class boundaries Class mark <Cf >Cf %Rf
frequency L.B. U.B. Xi
(f)
32 – 40 5 31.5 – 40.5 36 5 40 12.5
41 – 49 8 40.5 – 49.5 45 13 35 20
50 – 58 3 49.5 – 58.5 54 16 27 7.5
59 – 67 9 58.5 – 67.5 63 25 24 22.5
68 – 76 7 67.5 – 76.5 72 32 15 17.5
77 – 85 7 76.5 – 85.5 81 39 8 17.5
86 – 94 1 85.5 – 94.5 90 40 1 2.5
The Frequency Polygon

27 36 45 54 63 72 81 90 99 Class Marks
Histogram
➢ Histogram is a chart in which the rectangular bars are constructed at the boundaries of
each class.

31.5 40.5 49.5 58.5 67.5 76.5 85.5 94.5 Class


boundaries
Ogive
➢ ogive (< ogive and > ogive)

31.5 40.5 49.5 58.5 67.5 76.5 85.5 94.5 Class


boundaries

31.5 40.5 49.5 58.5 67.5 76.5 85.5 94.5 Class boundaries
For < ogive, x - axis values are upper class boundary
y - axis are the < cumulative frequency

For > ogive, x - axis values are the lower class boundary
y - axis are > cumulative frequency

NOTE: the point of intersection (x,y),


x = median and y = N/2
Skewness
Measures of Skewness
Measures the deviation from the symmetry

Sample:SK =

Population: SK =
NORMAL DISTRIBUTION
Properties of a normal curve:
1. It is symmetrical about the mean.
2. The mean is equal to the median, which is also equal to
the mode.
3. The tails or ends are asymptotic relative to the horizontal
line.
4. The total area under the normal curve is equal to 1 or
100%.
5. The normal curve area may be subdivided into at least
three standard scores each to the left and to the right of the
vertical axis.
Area to the Area to the
left is 0.5 right is 0.5

−  +
THE STANDARD NORMAL
RANDOM VARIABLE
A normal random variable x is standardized by expressing its
value as the number of standard deviation () it lies to the
left or right of its mean (). The standardized normal random
variables (z) is defined as.
Z= x- or equivalently x =  + z.

Z = x – Mean
s
Note:
1. When x is less than the mean, the value of z is negative.
2. When x is greater than the mean, the value of z is positive.
3. When x = mean, the value of z = 0
Examples
1. The mean and the standard deviation on an
examination are 70 and 10 respectively. Find
the scores in standard units of the students
receiving the marks
a) 65 b) 70 c) 87
2. Referring to the preceding problem, find the
marks corresponding to the standards scores
a) -1 b) 0.5 c) 1.25 d) –1.75
Example
Intelligence Quotient ( IQ ) scores are
distributed normally with mean of 100 and
standard deviation of 15.
1. What percentage of the population has
an IQ score below 85?
2. What percentage of the population has
an IQ score between 85 and 115?
3. What percentage of the population has
an IQ score above 120?
Solution:

1. Perform the following steps.

Step 1: convert the score, x = 85 to its z – score.


x −μ 85 − 100
z= = = –1
σ 15
This means that the score 85 is 1 standard deviation below the mean.

-1 0
standard normal distribution
Step 2: Set calculator mode to STAT.
Press: MODE ---- then 3: STAT ---- then AC

Step 3: Compute area to the left of z = - 1.


Press: SHIFT --- then 1 --- then 5. Distr --- then 1. P( ---
then – 1 --- then --- ) --- then =

The calculator will give a value of 0.15866.

Therefore 15.866% of the population has scores less


than 85.
2. Since P( z ) gives us the area under the normal curve to the left, if
we are asked for the area for a given interval, simply subtract the 2
values of P( z ).

From 1, 85 has a z – score = - 1, and we will call it z1 = - 1.


Computing the z – score of 115, we have

x −μ 115 − 100
z2 = = =1
σ 15

This means that 115 is one standard deviation above the mean.

-1 0 1
standard normal distribution
Following the same procedure in pressing the
calculator, we get
P = P( 1 ) – P( - 1 ) = 0.68268

Therefore 68.826% of the population has scores


between 85 and 115.
Compute z – score of 120.

x −μ 120 − 100
z= = = 1.33
σ 15

0 1.33
standard normal distribution

Since the total area under the normal curve is 1, to get the area to right,
we simply subtract the area to the left of z = 1.33 from 1. We have
P = 1 – P( 1.33 ) = 0.09176

Therefore 9.176% of the population has scores above 120.


Examples
3. Find the area under the standard normal curve which lies:
a) to the right of z = 0.56
b) to the right of z = -0.75
c) to the left of z = 0.72 x = 0.7642
d)to the left of z = -0.10 x= 0.5398
e) between z = -0.97 and z = -0.67
x= 0.2981 – 0.2033
x = 0.0948
f) between z = -0.94 and z = 2.25
x = 0.9878 – 0.1949
x = 0.7929
a) to the right of z = 0.56

Solution:
Let X be the area under the standard normal curve.

z=0.56

X = 1– P(z = 0.56)
= 1 – 0.7123
= 0.2877
b) to the right of z = -0.75

a) to the right of z = -0.75

z = -0.75

X = 1 - P(Z = -0.75)
= 1 - 0.2578
= 0.7422
4. Suppose the temperature last May was normally
distributed with mean 30C and standard deviation
5.33C. Find the probability that the temperature is
between 34.2C and 36.45C.

5. A normal distribution has a mean of 118 and a


standard deviation of 11. What are the two scores
containing the middle 85% of the distribution?

6. In a qualifying examination for admittance to


the College of Arts and Sciences, the mean score
was 65 and the standard deviation was 8. If 40
students scored between 60 and 75, how many
students took the qualifying examination?
Examples
7. The salaries of employees of a certain company in Metro
Manila have a mean of P5,000 and a standard deviation of
P1,000. What is the probability that an employee selected at
random will have a salary of
a. more than P5,000?
b. between P5,000 and P6,000?
c. less than P7,000?
Using scientific calculator

Fx 991ES plus
*freq on – shift – mode – scroll down – 4(stat)
*mode – 3 – 1 – [input the data] – AC – shift –
1 – 4 - 𝑥ҧ - 𝑠𝑥 - 𝜕𝑥

Old model
*mode – SD – 𝑋1 - M+ - 𝑋2 - M+ - … M+ - 𝑋𝑛 - M+ -shift – S-
Var - 𝑥ҧ - 𝑠𝑥 - 𝜕𝑥
Linear Regression and
Correlation
Many decisions are based on a
remarkable relationship between 2
variables. For instance, a person’s blood
pressure may vary inversely with the
amount of hypertension medicine he took.
A company’s market share may increase
directly with the advertising cost.
Correlation is a statistical method used to determine
whether a relationship between variables exists.
Regression is a statistical method used to describe
the nature of the relationship between variables,
that is, positive or negative, linear or nonlinear.
A scatter plot is a graph of the ordered pairs (x, y) of
numbers consisting of the independent variable x
and the dependent variable y.
Scatter Diagram
Consider the study made by a retail merchant to determine the relation
between monthly advertising expenditure and sales.

Advertising Cost Sales


( in P 1,000’s ) ( in P1,000’s )
1.5 36
1.7 44
2.0 48
2.4 71
2.7 78
3.0 90
3.2 95
3.5 100
A graph of the ordered pairs is called a
scatter diagram. The variable we wish to
portray is called the dependent or
response variable, while the variable that
affects the dependent variable is called
the independent variable. In the graph,
the dependent variable ( sales ) is plotted
on the y – axis, and the independent
variable ( advertising cost ) is plotted on
the x – axis
Scatter diagram

Advertising cost ( in P 1,000’s)


The Linear Regression
The paired data we have plotted on the scatter
diagram are called bivariate data. After relationship
between bivariate data has been established, a
relationship equation has to be determined. A method
of determining linear relationship for bivariate data is
called the linear regression. As we can see there are
many lines that can be drawn such that the points lie
closely to the line, but the line of great interest is the
least squares regression line. This is the line that
minimizes the sum of the squares of the differences
between the observed values and the values
predicted by the line. Least square regression line is
determined by using the following formula.
The equation of the least – squares line for the n ordered pairs
( x1, y1 ), ( x2, y2 ), ( x3, y3 ), . . . , ( xn, yn ) is the line

y– y =m x−x ( ) or y= y +m x−x ( )
where: x = mean of the variable x

y = mean of the variable y

 ( x y )− n x y
 x − n(x )
m = slope of the line, m = 2
2
The symbol y ( pronounced as y – hat ) is
used in place of y in the least – squares line
to differentiate with the y – values in the
given ordered pairs.
Determine the equation of the least – squares line for the sales and advertising
relationship above.

Solution:

1.5 + 1.7 + 2.0 + 2.4 + 2.7 + 3.0 + 3.2 + 3.5


x= = 2.5
8
36 + 44 + 48 + 71 + 78 + 90 + 95 + 100
y= = 70.25
8

 ( xy ) = 1.5( 36 ) + 1.7( 44 ) + 2.0( 48 ) + 2.4( 71 ) + 2.7( 78 ) + 3.0( 90 )


+ 3.2( 95 ) + 3.5( 100 )
 (xy ) = 1,529.8

 x = ( 1.5 ) + ( 1.7 ) + ( 2 ) + ( 2.4 ) + ( 2.7 ) + ( 3 ) + ( 3.2 ) + ( 3.5 )


2 2 2 2 2 2 2 2 2

 x = 53.68
2
Advertising cost ( in P 1,000’s)
 ( x y ) − n x y 1,529.8 − 8 ( 2.5 )( 70.25 )
 x − n(x )
m= = = 33.91
53.68 − 8 ( 2.5 )
2 2
2

( )
The equation is y^ = y + m x − x .

y^ = 70.25 + 33.91( x – 2.5 )

^y = 33.91x – 14.525

The least - squares line is shown above.


The Correlation Coefficient

To gauge whether or not the relationship


between variables is strong enough so that
making use of the regression line is
meaningful, statisticians use a statistic
called the correlation coefficient.
Correlation coefficient is denoted by r and
is defined in the following manner.
Coefficient Correlation
Compute the correlation coefficient for the
previous data.
Solution:

 x = 1.5 + 1.7 + 2.0 + 2.4 + 2.7 + 3.0 + 3.2 + 3.5 = 20

 y = 36 + 44 + 48 + 71 + 78 + 90 + 95 + 100 = 562

1.5 + 1.7 + 2.0 + 2.4 + 2.7 + 3.0 + 3.2 + 3.5


x= = 2.5
8
36 + 44 + 48 + 71 + 78 + 90 + 95 + 100
y= = 70.25
8

 ( xy ) = 1.5( 36 ) + 1.7( 44 ) + 2.0( 48 ) + 2.4( 71 ) + 2.7( 78 )


+ 3.0( 90 ) + 3.2( 95 ) + 3.5( 100 )
 (xy ) = 1,529.8

 x = 1.5 + 1.7 + 2 + 2.4 + 2.7 + 3 + 3.2 + 3.5 = 53.68


2 2 2 2 2 2 2 2 2

 y = 36 + 44 + 48 + 71 + 78 + 90 + 95 + 100 = 43,786
2 2 2 2 2 2 2 2 2
n (  ( xy ) ) − (  x )(  y )
r=
2
( )
n(  x ) −  x2  n(  y ) −  y
2
( ) 2

8 (1,529.8 ) − ( 20 )( 562 )
r=
8 (53.68 ) − ( 20 )  8 ( 43,786 ) − ( 562)
2 2

r = 0.9915
Activity
A real estate company conducts a survey of 15 of its agents. The table
below shows the number of minutes spent with each costumer and the
number of sales in a month.
X 20 21 25 26 26 25 22 18 23 20 27 29 30 30 33
Y 8 10 12 15 11 10 10 9 11 11 12 14 14 15 18

X represents the number of minutes and Y represents the sales in a month.


a) Find the equation of the regression line.
b) Compute the correlation coefficient for the given data.
c) Estimate the number of sales if the agent spent 45 minutes to each
costumer.
Below is a guide in interpreting coefficients of correlation

0.90 to 1.00 ; Very high positive Very dependable relationship


(-0.90 to -1.00) (negative) correlation
0.70 to 0.89 ; High positive Marked relationship
(-0.70 to -0.89) (negative) correlation
0.50 to 0.69; Moderate positive Relationship is substantial
( -0.50 to -0.69) (negative) correlation
0.30 to 0.49; Low positive Small relationship
(-0.30 to -0.49) (negative) correlation
0.0 to 0.29 ; Little, if any correlation Almost negligible
1.0 (0.00 to –0.29) relationship

You might also like