6 PDF

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 93

UNIVERSITY OF BAGUIO

RESEARCH AND DEVELOPMENT CENTER


IRC-MATHEMATICS AND STATISTICS

BASICS OF STATISTICS - I

STATISTICS WITH SPSS LECTURE NOTES


Dr. Victor Hafalla Jr., RME, REE, MAAS, Ph.D.
University Statistician (UB)
UB Graduate School
Reaction!

There are many cases of the so called “exemption to the


rule” as individual cases sometimes vary. When do we
apply the general and when do we apply the exemption?
When do we become an automaton and adhere to the
general rule and when do we become human and give
special considerations to individual cases? While we say to
our children that they are unique in their own way, we give
them education that is tailored to conform to the general.
“While the individual man is an
insoluble puzzle, in the aggregate he
becomes a mathematical certainty.
You can, for example, never foretell
what any one man will be up to, but
you can say with precision what an
average number will be up to.
Individuals vary, but percentages
remain constant.“

Sir Arthur Conan Doyle


1859-1930
Q.

How do you present statistics


and data?
Methods of Describing and Summarizing a Set
of Measurements
◻ Textual form

◻ Tabular presentation

◻ Construction of graphs

◻ Computation of summary measures (eg. mean,


median, mode, etc.)
Textual Form
◻ is used in presenting data in paragraph or narrative
form

◻ It is appropriate when there are few numbers or


statistics to be presented
Example

The grade point average (GPA) of those who belong in group


‘0’ (80.85) is significantly different from group ‘1’ (82.90) with
a p-value of 0.017 which is less than 0.05 though at first glance
the averages seemed to be very close. More importantly, the
average number of failed subjects (FAILURES) by those who
belong in group ‘0’ (18.04) is significantly different from those
who belong to group ‘1’ (5.44) (p<0.01). This hints that one of
the determinants of passing the respondents’ board exam is the
number of failed subjects during his academic years.

Source: Hafalla, V. Jr. & Calub, E. (2011) Modeling the performance of electronics
and communications engineering students in the licensure examination. UB Research
Journal, 35(1)
Tabular Presentation

◻ process of condensing data into a table and arranging


them systematically in rows and columns
Example:
stub heading

Table 8
boxhead
Results of the F-Test on the BOD Levels of the Different
Points Along the Magsaysay Creek
Type III Mean Noncent. Observed
Source Sum of df Square F p-value Parameter Power*
Squares
BOD Sphericity 17947.6 6 2991.267 5.699 0.000 34.194 0.994
Assumed 17947.6 3.108 57737.924 5.699 0.003 17.715 0.920
Greenhouse- 17947.6 4.937 36355.966 5.699 0.000 28.134 0.985
Geisser 17947.6 1.000 179473.60 5.699 0.041 5.699 0.567
Huynh-Feldt
Lower-bound

Error Sphericity 283425.5 54 5248.621


(BOD) Assumed 283425.5 27.976 10131.111
Greenhouse- 283425.5 44.429 6379.279
Geisser 283425.5 9.000 31491.727
Huynh-Feldt
Lower-bound
*Computed using alpha = 0.05
Source: Hafalla, V. & Ferrer, V. (2008) Magsaysay creek, a Sagudin-Balili river tributary: An evaluation of the BOD readings, UB
Research Journal, 37(1). footnote
field
source note
Parts of a Statistical Table
◻ heading - consists of a table number, title and
headnotes
◻ boxhead - identifies the contents of a particular
column; includes column heads or captions and the
spanner head
◻ stub- identifies the contents of a particular row; includes
stubhead or box, center heads and subheads, and line
captions
◻ field- main part of the table’ contains the substance or the
figures of one’s data
◻ footnote - any statement or note inserted at the bottom
of the table
◻ source note - an exact citation of the source of data
presented in the table (should always be placed
when the figures are not original)
Some Guidelines in Presenting Tables
◻ A title should be provided which includes the type of
information given and any relevant dates.
◻ Row and column labels must be precise.
◻ Categories should not be overlapping.
◻ The units of measure should be clearly stated.
◻ Show any relevant total, subtotals, percentages, etc.
◻ Indicate if the data were taken from another publication by
including a source note.
◻ Formal tables should be self-explanatory, although they may
be accompanied by a paragraph interpreting or directing
attention to important figures.
Graphical Presentation

◻ method of presenting numerical values or


relationships in pictorial form
Things to Consider in Constructing a
Graph
◻ specific purpose and function
◻ nature of the data (level of measurement, qualitative
or quantitative)
◻ characteristics of the users/readers
◻ type of equipment and materials available
Qualities of a Good Graph
◻ Accuracy - a graph should not be deceptive, distorted,
misleading, or in any way susceptible to wrong
interpretations as a result of inaccurate or careless
construction
◻ Simplicity - the basic design of a graph should be simple,
straightforward, not loaded with irrelevant or trivial symbols
and ornamentation
◻ Clarity - the graph should be easily read and understood by
its target readers; there should be a forceful and unmistakable
focus on the message that the graph is trying to communicate
◻ Appearance - a good graph is one that is designed and
constructed to attract and hold attention by holding a neat,
dignified, and professional appearance
Common Types of Graphs

◻ Line Graph
◻ Time Series graph
◻ Bar Graph
◻ Pie Chart
◻ Scatterplot
◻ Stem-and-Leaf displays
◻ Boxpot
Line Graph

◻ Graphical presentation of data especially useful for


showing trends over a period of time
Example 1:

Company ABC Monthly Profit for Year 2010


Example 2:

Figure 3. pH Readings trend by Sampling Date

8.00

19-Feb-03
7.80
21-Feb-03
7.60 27-Feb-03

28-Feb-03
pH

7.40
7-Mar-03
7.20 10-Mar-03

13-Mar-03
7.00
14-Mar-03
6.80 17-Mar-03
1 2 3 4 5 6 7 8 9 10 11 20-Mar-03
Sampling Points

Source: Hafalla & Ferrer (2008)


Common Errors in Constructing Line
Graphs
1. Improper choice of scales and chart
proportions
-the scales chosen for both axes should result in a well
proportioned chart
-the resultant scales should not minimize or exaggerate
variations in the curves

2. Missing break in the vertical scale


-when only the upper portion of a coordinate field is necessary
to portray the data, it is permissible to eliminate the lower
portion of the field
Common Errors in Constructing Line
Graphs
-since it is imperative to retain the zero or base line, a break must
be chosen on the vertical scale at a point that will not interfere with
the remaining coordinate field required for plotting the data

3. Crudely designed coordinate lines or grid lines


-coordinate or grid lines should be kept to a minimum, but there
always should be a sufficient number to read clearly and
accurately the values represented by the curves on the chart
Bar Graph
◻ consists of a series of rectangular bars where the
length of the bar represents the magnitude to be
demonstrated

Remark:
◻ use horizontal arrangement of the individual bars
when comparison of categories is being made
◻ use vertical arrangement of the individual bars when
chronological comparison are being made
Example:
1. Simple Vertical Bar Graph
Student Population of College of Engineering by Course
SY 2009-2010
Examples
2. Simple Horizontal Bar Graph
Ozone Concentration in Selected
Regions in Antartica
4. Grouped Bar/Column Chart

Year on Year Comparison in Car Manufacturing

Defects in 3 Factories
5. Subdivided Bar/Column Chart

Metal Concentation in 3 Different Alloy Samples


6. Deviation Bar/Column Chart

Revenue Growth in Different Corporate Investment Firms


Common Errors in Constructing Bar
Graphs
1. The use of Broken Scales
the use of broken scale is contrary to the
one-dimensional character of the bar and column chart
and is never considered an acceptable practice
2. Missing Zero Reference line
the amputation of columns or bars by eliminating the
zero reference line is unacceptable; such a practice is
untruthful and misleading and destroys the basic
comparability of the chart as a whole
Common Errors in Constructing Bar
Graphs
3. Use of unequal intervals
in portraying time series, both the vertical and
horizontal scales of column charts should be designed
in accordance with accepted principles and practice
the spacing of the bars/columns should be adjusted to
the unequal time intervals shown by the data
Pie Chart

◻ A circle which is divided into sectors in such a way


that the area of each sector is proportional to the size
of the quantity represented by that sector
Example:

Figure 1. Student Population of College of Engineering by Course,


SY 2009-2010
Common Error in Constructing Pie Charts

Large Number of Sectors


a pie chart with an extraordinarily large number of sectors
is manifestly useless as a medium of visual
communication
Scatterplot
◻ A scatterplot or scatter diagram consists of a series of data points
located on the rectangular coordinate system.
◻ It is commonly used when we want to graphically inspect the trend
of association between two variables
◻ A scatterplot does not specify dependent or independent variables.
Either of the variables can be plotted on either axis.
◻ If the patterns of dots slope from lower left to upper right, it
suggests a positive correlation between the variables. If the pattern
of dots slopes from upper left to lower right, it suggests a negative
correlation.
◻ A line of best fit can be drawn in order to study the correlation
between the variables.
Example 1:
Example 2:

Figure 4. Plot of Cases on the Canonical Discriminant Axes


Source: Hafalla, V. (2007) Derivation of canonical discriminant models for the academic performance of
pre-major engineering students of the University of Baguio, UB Research Journal, 36(2)
Stem and Leaf Displays
◻ Sometimes abbreviated SALD, the stem-and-leaf display
shows how wide a range of values the data covers, where the
values are concentrated, how nearly symmetric the
distribution is, whether there are gaps in the data, and whether
any values stray markedly from the rest (outliers).

◻ To construct a stem-and-leaf display, we divide each number


into two parts: the stem, consisting of one or more leading
digits: and a leaf, consisting of remaining digits.
Example:
Construct a stem-and-leaf display of the given raw data below on
the BOD measurements of 30 sampling points in Chico River
(mg/l).

56.1 22.3 13.8 17.9 28.6 48.7


33.7 28.7 38.1 36.7 32.1 21.6
20.9 36.6 51.9 44.8 33.6 36.9
20.7 37.8 37.2 31.6 27.1 42.6
28.1 35.6 45.8 40.3 36.2 35.3
Using a double stem for clarity of the actual distribution of the
data and taking the tenths digit as the stem and the rest of the
remaining digits as the leaves, we have:
Boxplot
◻ Also called the box-and-whiskers plot, a boxplot is a graph
which simultaneously displays the first, second and third
quartile, the minimum and maximum values in the data, the
range and the inter-quartile range.
◻ A boxplot infers the following information regarding the
distribution of the data: location, spread, symmetry and
extreme values.
◻ It is a good graph in detecting actual and possible outliers in
the data set. Outliers are observations found in the data set that
are markedly different from the rest.
Procedure for Constructing Boxplots

i. Create a scale and locate the lowest and highest value of the
data in this scale.
ii. Construct a rectangle on this scale with one end depicting
the first quartile (Q1) and the other end the third quartile
(Q3).
iii. Put a vertical line across the interior of the rectangle at the
median.
iv. Mark the lowest and largest value in the data set by vertical
lines. Identify the outliers and mark them ‘O’ in the scale.
An outlier in the boxplot is an observation falling below
Q1-1.5IQR (lower fence) and above Q3+1.5IQR (upper
fence). Note that the IQR = Q3-Q1
Example:
From the previous data set, construct a boxplot. Determine if
there are outliers in the data set.

We calculate the following values:


HV = 56.1, LV = 13.8
Q1 = (1/4)(30) = 7.5 = 8th obs = 28.1
Q3 =(3/4)(30) = 22.5 = 23rd obs = 38.1
Md = Q2 = mean of 15th and 16th obs = (35.3 + 35.6)/2 = 35.45
IQR = Q3 – Q1 = 38.1 – 28.1 =10
LF (lower fence) = Q1-1.5IQR = 28.1 – 1.5(10) = 13.1
UF (upper fence) = Q3+1.5IQR = 38.1 + 1.5(10) = 53.1
Since the observation 56.1 is outside the upper fence (53.1),
it is an outlier (marked ‘O’ in the boxplot above).
Summary Statistics

◻ numerical measures that are used to describe certain


characteristics of the data
Common Types of Summary Measures

◻ Measures of Central Tendency or Averages


(eg. mean, weighted mean, median, mode)

◻ Measures of Location
(eg. quartiles, percentiles, deciles)

◻ Measures of Dispersion or Variation


(eg. range, variance, standard deviation, coefficient of
variation)
Measures of Central Tendency

◻ any single value which is used to identify the


“center” of the data , oftentimes referred to as the
average

◻ It is a value that is representative of a data set

◻ this value lies centrally among the data


Mean
◻ sum of all values of the observations divided by the
number of observations in the data set

Population Mean (for a finite population):


◻ is the mean of all the observations in the population

Sample Mean
◻ is the mean of the sample observations
Example 1:
◻ The achievement test scores in General Science of all 50
freshmen students from a certain college are as follows (max
score-100):
43 51 53 55 57 58 58 59 61 61
61 62 63 64 65 65 66 66 67 68
68 69 69 69 69 70 70 70 71 71
72 73 73 74 74 75 76 76 77 78
79 79 81 82 82 85 87 89 91 96

The mean of the population is:


Example 2:
◻ Suppose that a sample of seven students from this college
yielded the following observations:
70 82 77 96 55 85 64
Then the corresponding sample mean is:

◻ Suppose another sample of students of the same size was


taken and resulted to the following scores:
58 72 77 89 63 85 51
The sample mean for this is:
Q.
How close are the sample means from the actual
population mean?
How do we increase the accuracy of the sample mean
to the population mean?
How confident are we that the computed sample
mean is very close to the population mean?
What is the desirable margin of error for the mean?
Weighted Mean
◻ when several types of data make different contributions with the
mean, each data should be assigned a weight proportional to its
importance
◻ It is the sum of the products of the observations times their
corresponding weights divided by total weight or total
observations
Example 1:
◻ Company Y, manufacturing venture, pays its employees using
the following scheme: Php 650/hr for its 18 administrative
employees, Php 350/hr for its 45 supervisors and department
directors and Php 255/hr for its 300 laborers and clerical office
workers. Calculate the mean hourly rate of Company Y.

Using weighted mean, we have


Example 2:
◻ The following data presents the opinion poll of factory workers
on a work referendum. Determine the weighted opinion for each
provision.

Frequency Counts
Provisions Agree Agree but with Disagree
reservations

1. compulsory vacation 44 32 24
2. unpaid unexcused 18 52 30
leaves
3. shifting schedules 23 22 55
First assign weights to the opinions: (agree-3, agree but with
reservations-2, disagree-1)
Calculate the weighted opinion for each provision, for example:

Frequency Counts Opinion


Provisions Agree Agree but with Disagree n (xw) Interpretation
(A) reservations (D)
(AR)
1. compulsory vacation 44 32 24 100 2.2 AR
2. unpaid unexcused 18 52 30 100 1.88 AR
leaves
3. shifting schedules 23 22 55 100 1.68 AR
Characteristics of the Mean

◻ It is the most familiar measure of central tendency


used, and it employs all available information.
◻ It is strongly influenced by extreme values.
◻ Since the mean is a calculated number, it may not be
an actual number in the data set.
◻ It can be applied to data that are measured in at least
interval level.
◻ The mean is also unique, that is, there is only one
mean in a given data set.
Reaction!

Decision makers, policy makers, statisticians, … the


general public, always have a fascination towards the
average. The average performance, the average
weight, the average run-time, the average poll,… all
points to the average man. But in fact, individuals are
always above or below the average and society judges
individuals because of this benchmark ignoring our
own individuality and uniqueness.
Median

◻ the value found at the exact middle of a data set when


the observations are arranged in numerical order
(descending or ascending).

◻ a value that divides an ordered set of data (array) into


two equal parts and is commonly denoted by Md
Median
To get the median:
◻ when the number of observations is odd:
Md = middle value in the array
= (n+1)/2 th observation in the array

◻ when the number of observations is even:


Md = mean of the two middle values in the array
= mean of (n/2)th and (n/2 + 1)th observation in
the array
Example 1:
The following are the total yearly expenses of 7 companies (in
million of pesos):
1.2 7.2 12.5 6.5 50.6 4.5 10.4

The array corresponding to the above data in ascending order


is given by:
1.2 4.5 6.5 7.2 10.4 12.5 50.6

Thus the median is 7.2.


Example 2:
The following are the number of years of operation of 8
manufacturing companies:
8 10 17 18 11 16 17 10

The array in ascending order is given by


8 10 10 11 16 17 17 18

The median is
Characteristics of the Median
◻ It is a positional measure.
◻ It is not influenced by extreme values. Hence it is
favorable to the mean when extreme values are
present in the data set or when the distribution is
skewed.
◻ It can be applied to data that are measured in at least
ordinal level.
Mode

◻ the value in the data set that occurs with the greatest
frequency

◻ usually denoted by Mo
Example
◻ A psychologist has developed a new technique intended to improve
rote memory. To test the method, 30 high school students representing
three sections are selected at random, and each is taught the new
technique. The students are then asked to memorize a list of 100 word
phrases using the technique. The following are the number of word
phrases memorized correctly by the students selected from each
section.

A: 83 64 98 66 83 87 83 93 86 80 93 83 75
B: 87 76 96 77 94 92 88 85 89
C: 68 84 79 79 84 75 80

The mode for each section is: MoA=83, MoB=does not exist,
MoC=84 and 79
Characteristics of the Mode
◻ It is the easiest to interpret among the measures of
central tendency.
◻ It is not affected by extreme values.
◻ the mode is not unique, that is, it may be possible that
a given data set may contain more than one mode or
none at all. If a data set has two modes, we call it
bimodal, if there are three modes, we call it trimodal
and so on.
◻ one advantage of the mode is that it can be applied to
observations that are measured in the nominal level.
Measures of Location

◻ numbers below which a specified amount or


percentage of data must lie and are oftentimes used to
find the position of a specific piece of data in relation
to the entire set of data
Percentile
◻ values that divide an ordered set of data into 100 equal parts
◻ the ith percentile (i=1,2,…,99), denoted by Pi, is a value
below which i% of the data must lie

To determine Pi, we have the following steps:


i. arrange the data from lowest to highest
ii. if ni/100 is a whole number, Pi is the mean of the
(ni/100)th and (ni/100 + 1)th ordered values
iii. if ni/100 is not a whole number, Pi is the kth ordered
value where k is the closest whole number greater
than ni/100
Deciles

◻ values that divide an ordered set of data into 10 equal


parts

◻ the ith decile (i=1,2,…,9), denoted by Di, is a value below


which 10i% of the data must lie
Quartiles
◻ values that divide an ordered set of data into 4 equal parts

◻ the ith quartile (1=1,2,3), denoted by Qi, is a value below which


25i% of the data must lie
Example
◻ The data from 50 measurements of the traffic noise level at and
intersection are already ordered from smallest to largest in the
table given below. Locate the quartiles, 2nd decile and 49th
percentile.
The quartiles are as follows:

Q1=P25=(25/100)x50=12.5≈13th observation = 57.2


Q2=D5=P50= (60.8 + 61)/2 = 60.9
Q3=P75=(75/100)x50=37.5 ≈38th observation = 64.6

The second decile and 49th percentile are as follows:

D2=P20=(20/100)x50=10th observation = 56.4


P49= (49/100)x50 = 24.5 = 25th observation = 60.8
Measures of Dispersion
◻ numerical descriptive measures which indicate the
extent to which individual observations in a set of data
are scattered about an average

◻ the degree to which numerical data tend to spread


about an average value

◻ also called measures of variability


Data set A
Example

Data set A has


lesser variability of
data because it has
a lesser ‘spread’ Data set B
(horizontal axis).
Some Uses of Measuring Dispersion
◻ to determine the extent of scatter so that steps may be
taken to control the existing variation

◻ used as a measure of reliability of an average


Range
◻ the difference between the largest and smallest values
in a data set, that is,
R= highest value – lowest value
Example
What is the range in the traffic noise data below?

The range is
R = 77.1 - 52.0 = 25.1
Characteristics of the Range

◻ It is the simplest measure of dispersion.

◻ It is sensitive to extreme values.

◻ It fails to communicate any information about the clustering or


the lack of clustering of the values between the extremes.

◻ The range is best used for symmetric or nearly symmetric data


sets with no outliers.
Standard Deviation and Variance
◻ the standard deviation is a measure of dispersion which
indicate the extent of scattering of the observations from the
mean
◻ the standard deviation of a set of numbers is the square root of
the sum of the squared deviations from the mean divided by
n-1 for a sample and N for a population.
◻ two standard deviations exist: the sample standard deviation
(s) and the population standard deviation (σ)
◻ the square of the standard deviation is called the variance. The
sample variance is denoted by s2 and the population variance
is denoted by σ2
Population Standard Deviation vs Sample
Standard Deviation
Population standard deviation:

Sample standard deviation:


Example
Refer to the achievement test scores in General Science of 50
freshmen students.
43 51 53 55 57 58 58 59 61 61
61 62 63 64 65 65 66 66 67 68
68 69 69 69 69 70 70 70 71 71
72 73 73 74 74 75 76 76 77 78
79 79 81 82 82 85 87 89 91 96

where the mean of the population is and for the


sample of seven students

58 72 77 89 63 85 51

the corresponding sample mean is .


The population standard deviation is given by

and the population variance is .

For the sample of seven students, the sample standard deviation


is computed as

and the sample variance is .


Computing Standard Deviation in FDT’s
If the data are presented in a frequency distribution table (FDT),
the sample standard deviation can be approximated as follows:
i. compute for the mean
ii. subtract the mean from each of the class marks in the FDT
and square these deviations
iii. multiply the squared deviations by the corresponding class
frequency
iv. sum the products obtained in (iii) and divide the resulting
value by (n-1)
v. take the square root of the obtained quotient
Example
Consider the frequency distribution of the ozone concentration in 77
places in Antartica. Find the standard deviation.

Concentration Class
of Ozone in the Number of Mark Xi-mean
Atmosphere Areas (fi) (Xi) fiXi (Xi-49.70) (Xi-49.70)2 fi(Xi-49.70)2
(ppm)
10 – 19 6 14.5 87 -35.2 1239.04 7434.24
20 – 29 5 24.5 122.5 -25.2 635.04 3175.20
30 – 39 14 34.5 483 15.2 231.04 3234.56
40 – 49 17 44.5 489.5 -5.2 27.04 297.44
50 – 59 11 54.5 926.5 4.8 23.04 391.68
60 – 69 15 64.5 967.5 14.8 219.04 3285.60
70 – 79 4 74.5 298 24.8 615.04 2460.16
80 – 89 2 84.5 169 34.8 1211.04 2422.08
90 - 99 3 94.5 283.5 44.8 2007.04 6021.12 77
3826.5 28722.08
The value 49.70 is the mean and is computed as

The approximated sample standard deviation is given


by

And the sample variance is


Coefficient of Variation
◻ the ratio of the standard deviation to the mean;
oftentimes expressed as a percentage value
◻ This measure of dispersion is independent of the units
of measurement used hence it is a useful tool in
comparing the distribution of two or more data sets.

It is computed as follows:
Example:
◻ In an experiment, two groups of cattle were fed differently
with one using the usual hay mix and the other group the
improved high protein mix. After a year, the following
statistics were collected on the two groups of cattle:

Which group of cattle gives less variability?


The coefficients of variation of the two groups of cattle are:

Thus, group A has less variability.


Kurtosis
◻ KURTOSIS is the measure of peakedness of the distribution of the data in
relation to a normally peaked distribution (mesokurtic). A high peaked
distribution, called leptokurtic, has a high concentration of values near the
mean of the data set. Whereas, a low peaked distribution, a platikurtic,
tends to have much more dispersed concentration of values.
◻ A normally distributed data has a CK=3, a platikurtic
distribution would have a CK<3 and a leptokurtic distribution
has a CK>3.

◻ The formulas in Table 6.1 are used to compute the kurtosis.


Skewness
◻ Skewness is the measure of symmetry of the distribution of
the data.
◻ A normally distributed data has a coefficient of skewness
equal to zero (SK=0), whereas, if the data has a long left tail,
it has SK<0 and if the data has a long right tail, its SK>0.
◻ A distribution with a long left tail is called a negatively
skewed distribution and the concentration of data will be
found on the right of the median while a long right tail
distribution is called a positively skewed distribution and
most of the data are concentrated to the left of the median.
◻ The coefficient of skewness may be calculated using the
Pearsonian Coefficient of Skewness formula:

◻ where is the mean of the data, Md is the median and s is the


sample standard deviation.
Example:
◻ Determine the degree of peakedness and skewness
of the given frequency distribution.
◻ The following measures were determined from previous examples:
=10.62, Md = 10.606, s = 0.64. Hence,
❑ The distribution is positively skewed
(SK=0.0656>0, although nearly symmetric) and
platikurtic (CK=2.11<3). Also, since >
Md=10.606 > Mo=10.245, we can fairly say that
the distribution is positively skewed.

You might also like