Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 11

2023

1. THE NATURE OF PROBABILITY AND STATISTICS

Objectives:

At the end of this chapter, the students are expected to:


1. Define statistics;
2. Differentiate descriptive and inferential statistics;
3. Distinguish primary and secondary data;
4. Make a distinction between qualitative and quantitative data;
5. Identify discrete and continuous data; and
6. Classify data according to variable type and appropriate level of measurement; and
7. Discuss some applications of statistics.

Introduction

Decision makers make better decisions when they use all available information in an effective
and meaningful way. The primary role of statistics is to provide decision makers with methods for
obtaining and analyzing information to help make these decisions. Statistics is used to answer long-
range planning questions, such as when and where to locate facilities to handle future sales.
The word statistics is derived from the Latin word status meaning “state”. In the beginning,
statistics involved compilation of data and graphs describing various aspects of state or country. The
word statistics means different to different people. To some, statistics means actual numbers derived
from data and others refer to statistics as a method of analysis. Thus, specifically, statistics is defined as
the science of collecting, organizing, presenting, analyzing and interpreting numerical data for the
purpose of assisting in making a more effective decision.
Statistical methods are vital tools in many researches in education, psychology, medicine,
business, agriculture, and other disciplines.

Types of Statistics
Statistics is a tool which helps us develop general and meaningful conclusions that go beyond
the original data. There are two types of statistical analyses: Descriptive and Inferential or Inductive
Statistics.
1. Descriptive Statistics are all the methods used to collect, organize, summarize or present data,
usually to make the data easier to understand. It is concerned with summary calculations such as
averages, and percentages and construction of graphs, charts and tables.
2. Inferential Statistics is concerned with the formulation of conclusions or generalizations
about a population based on an observation or a series of observations of a sample drawn from a
population. It consists of performing hypothesis testing, determining relationships among variables,
and making predictions. For example, the average family income of the residents in Region 2 can be
estimated from figures obtained from a few hundred (the sample) of families.

Quantitative and Qualitative Variables or Data


In doing a research, initially, we have to define the variables relevant to the data. The term
variable means an item of interest that can take on many different numerical values while a collection of
this is called data. The variable may take on different value. If a given value does not vary or fixed, it
is called constant. There are two major qualifications of variables: qualitative and quantitative.
1. Qualitative Variables are nonnumeric variables and can't be measured. Examples include
gender (male, female), religious affiliation (Roman Catholic, Iglesia ni Cristo, Methodist, etc), ethnicity
(Ilocano, Tagalog, Ibanag, etc.)
2. Quantitative Variables are numerical variables and can be measured. Examples include
balance in your checking account, number of children in your family.

Prepared by MARIANNE JANE ANTOINETTE D. PUA, M.S. -Page 1


2023

Some quantitative variables can take on only specific or isolated values along a scale, for
example, the number of children in the family may be 1, 2, 3, or any other whole number but it can
never be 1.25 or 0.5. Thus, this variable has values which can only be obtained through the process of
counting and is referred to as discrete or discontinuous variables.
Specifically, quantitative variables can be ordered and ranked. It can be classified in to two
groups: Discrete and Continuous.
Discrete variables are values that are obtained by counting. The results are whole numbers. For
example, the number of students in the room.
Continuous variables are values that are obtained by measuring. The results can be any value
between two specific values. For example, if you take everyone’s height of students in the room, you
could get any number between two reasonable amounts. So height is a continuous variable.

Levels of Measurement:
Variables can also be classified according to the level of measurement. There are four levels of
measurement: Nominal, Ordinal, Interval and Ratio.
1. Nominal Data: The weakest data measurement. Numbers are used to represent an item or
characteristic. Examples include: names, gender, religious affiliation, civil status, college majors. Note
that such data should not be treated as numerical, since relative size has no meaning.
2. Ordinal or Rank Data: This can be ordered or ranked, but a specific difference in the levels
can not be determined. For example, the performance rating (Outstanding, Very Satisfactory,
Satisfactory, Poor). This can be ordered. You know that Outstanding is higher than Very Satisfactory
or Very Satisfactory is higher than Satisfactory, etc. , but there is no exact difference between any two
of them. For example, the grade of Outstanding and Very Satisfactory may be close (4.65 and 4.45) or
may be far apart (5.00 and 4.25), so the exact difference cannot be determined.
3. Interval Data: This can be ordered and has exact difference between any two units but has no
meaningful zero or starting point. For example, Temperature is an interval data since they can be
ordered, there is an exact difference between two degrees, but the zero does not mean the starting point
since there can be temperatures below zero.
4. Ratio Data: Is the highest level of measurement and allows for all basic arithmetic operations,
including division and multiplication. Data at this level can be ordered, has exact difference between
units, and has a meaningful zero. Things that are counted are usually ratio level, for example, business
data, such as cost, revenue and profit.

Data Collection: Data can be collected in various ways:


1. Focus Group
2. Telephone Interview
3. Mail Questionnaires
4. Door-to-Door Survey
5. Mall Intercept
6. New Product Registration
7. Personal Interview
8. Experiments

Sources of Data:
1. Secondary Data: Data which are already available. For example, ISU enrollment data.
Secondary data is less expensive; however, it may not satisfy the researcher’s need.
2. Primary Data: Data which must be collected.

Sampling Techniques:
Sampling Techniques are used when a part of the population is to be surveyed. If it takes too
long or very expensive to interview the whole population, a sample is used. If a sample is chosen

Prepared by MARIANNE JANE ANTOINETTE D. PUA, M.S. -Page 2


2023

correctly to represent the population, it is called unbiased while if it does not represent the whole
population, it is called biased.

There are many ways to collect a sample, statistical or non-statistical. The most commonly used
methods are:

A. Statistical Sampling:
1. Simple Random Sampling: This is used to see that all possible elements of the population
have an equal opportunity of being selected for the sample.
2. Stratified Random Sampling: This is obtained by selecting simple random samples from strata
(or mutually exclusive sets). Some of the criteria for dividing a population into strata are: Gender (male,
female); Age (under 18, 18 to 28, 29 to 39); Occupation (blue-collar, professional, other).
3. Cluster Sampling: This is a simple random sample of groups or cluster of elements. Cluster
sampling is useful when it is difficult or costly to generate a simple random sample. For example, to
estimate the average annual household income in a large city we use cluster sampling, because to use
simple random sampling we need a complete list of households in the city from which to sample. To use
stratified random sampling, we would again need the list of households. A less expensive way is to let
each block within the city represent a cluster. A sample of clusters could then be randomly selected, and
every household within these clusters could be interviewed to find the average annual household
income.

B. Nonstatistical Sampling:
1. Judgement Sampling: In this case, the person taking the sample has direct or indirect control
over which items are selected for the sample.
2. Convenience Sampling: In this method, the decision maker selects a sample from the
population in a manner that is relatively easy and convenient.
3. Quota Sampling: In this method, the decision maker requires the sample to contain a certain
number of items with a given characteristic. Many political polls are, in part, quota sampling.

Note: The random number table provides lists of numbers that are randomly generated and can be used
to select random samples. Computer packages are used to generate lists of random numbers. For the
table, refer to any texts in Statistics.

Parameter and Statistic


A specific, well-defined characteristic of a population is known as a parameter of that population
while a specific characteristic of a sample is called a statistic of that sample. For instance, for a given
sample of temperature readings at 1:00 P.M. local time on December 12, 2019 at various locations
around Santiago City, then the parameter is the highest temperature reading in Santiago City as
determined at hourly intervals on December 12, 2019 while the statistic is highest temperature reading
at 1:00 P.M. local time on December 12, 2019 in Santiago City.

Population and Sample


In statistics, the term population refers to a particular set of items, objects, phenomena, or people
being analyzed. These items, also called elements, can be actual
subjects such as people or animals, but they can also be numbers or
definable quantities expressed in physical units.
A sample of a population is a subset of that population. It
can be a set consisting of only one value, reading, or measurement
singled out from a population, or it can be a subset that is identified
according to certain characteristics. The physical unit (if any) that Population
Sample

Infer

Prepared by MARIANNE JANE ANTOINETTE D. PUA, M.S. -Page 3


2023

defines a sample is always the same as the physical unit that defines the main, or parent, population. A
single element of a sample is called an event.
When a sample consists of the whole population, it is called a census. When a sample consists of
a subset of a population whose elements are chosen at random, it is called a random sample.

Generating Random Variables using MS Excel


A random variable is a discrete or continuous variable whose value cannot be predicted in any
given instance. Such a variable is usually defined within a certain range of values, such as 1 through 6
in the case of a thrown die. In order for a variable to be random, the only requirement is that, it is must
be impossible to predict its value in any single instance. For instance, we can’t predict what number
will turn up if we throw a die one time.
A random sample is also called a probability sample, or scientific sample. Random sampling is a
type of sampling in which every item in a population of interest, or target population, has a known, and
usually equal, chance of being chosen for inclusion in the sample. Having such a sample ensures that
the sample items are chosen without bias and provides the statistical basis for determining the
confidence that can be associated with the inferences. The four principal methods of random sampling
are the simple, systematic, stratified, and cluster sampling methods.
A simple random sample is one in which individual items are chosen from the target population
on the basis of chance. Such chance selection is similar to the random drawing of numbers in a lottery.
However, in statistical sampling a table of random numbers or a random-number generator computer
program generally is used to identify the numbered items in the population that are to be selected for the
sample.
A systematic sample is a random sample in which the items are selected from the population at a
uniform interval of a listed order, such as choosing every tenth account receivable for the sample. The
first account of the 10 accounts to be included in the sample would be chosen randomly (perhaps by
reference to a table of random numbers). A particular concern with systematic sampling is the existence
of any periodic, or cyclical, factor in the population listing that could lead to a systematic error in the
sample results.
In stratified sampling the items in the population are first classified into separate subgroups, or
strata, by the researcher on the basis of one or more important characteristics. Then a simple random or
systematic sample is taken separately from each stratum. Such a sampling plan can be used to ensure
proportionate representation of various population subgroups in the sample. Further, the required
sample size to achieve a given level of precision typically is smaller than it is with simple random
sampling, thereby reducing sampling cost.
Cluster sampling is a type of random sampling in
which the population items occur naturally in subgroups.
Entire subgroups, or clusters, are then randomly sampled.
Although a nonrandom sample can turn out to be I’m in! Me too!

representative of the population, there is difficulty in


assuming beforehand that it will be unbiased, or in And me!

expressing statistically the confidence that can be associated


with inferences from such a sample. A judgment sample is
Population
one in which an individual selects the items to be included Voluntary Response Sample
in the sample. The extent to which such a sample is
This one’s
representative of the population then depends on the far Population
too small
judgment of that individual and cannot be statistically
assessed. A convenience sample includes the most easily
accessible measurements, or observations, as is implied by
the word convenience.

Investigator intervention
Prepared by MARIANNE JANE ANTOINETTE D. PUA, M.S. -Page 4
2023

A strict random sample is not usually feasible since only readily available items or transactions
can easily be inspected. In order to capture changes that are taking place in the quality of process
output, small samples are taken at regular intervals of time. Such a sampling scheme is called the
method of rational subgroups. Such sample data are treated as if random samples were taken at each
point in time, with the understanding that one should be alert to any known reasons why such a
sampling scheme could lead to biased results.

NOTE: For the purpose of statistical inference a representative sample is desired. Yet, the methods of
statistical inference require only that a random sample be obtained. There is no sampling method that
can guarantee a representative sample. The best we can do is to avoid any consistent or systematic bias
by the use of random (probability) sampling. Some causes of bias in sampling are voluntary response,
investigator intervention, or the effects of periodic, seasonal and/or systematic gathering of data.
While a random sample rarely will be exactly representative of the target population from which
it was obtained, use of this procedure does guarantee that only chance factors underlie the amount of
difference between the sample and the population.
In statistical sampling, a table of random numbers or a random-number generator computer
program generally is used to identify the numbered items in the population that are to be selected for the
sample. Excel is also a powerful tool in generating a sample from a given population.

Problem: A researcher wishes to obtain a simple random sample of 100 households from 876
households in San Fabian, Echague, Isabela. (For convenience, the households are identified
by the ID numbers 1 through 876. Use Excel to obtain the 100 ID numbers of the sampled
households to be included in the study.
Steps:
(1) Open Excel. Place the integers from 1 to 876 in column A of the worksheet by first
entering the number 1 A1. With cell A1 active (by clicking away from and back to A1, for instance),
CLICK EDIT, FILL, SERIES and open the Series dialog box.

(2) Select the Series in Columns button with Step value of 1 and Stop value of 876. CLICK OK,
and the integers 1 to 876 will appear in column A.

Prepared by MARIANNE JANE ANTOINETTE D. PUA, M.S. -Page 5


2023

(2) To identify the 100 households to be sampled, CLICK TOOLS, DATA ANALYSIS,
SAMPLING. Designate the Input Range as $A$1:$A$876, the Sampling Method as Random, the
number of samples as 100, and the Output Range as $B$1. CLICK OK, and the IDs of the randomly
selected households will appear in rows 1 through 100 of column B.

Prepared by MARIANNE JANE ANTOINETTE D. PUA, M.S. -Page 6


2023

An the result is:

STATISTICS OF SAMPLING
In the preceding lecture, terms such as population parameter, sample statistic, and sampling bias
were introduced. Now, we will try to understand what these terms mean and how they are related to
each other.
When you measure a certain observation from a given unit, such as a person’s response to a
Likert-scaled item (as shown in the figure in the succeeding page), that observation is called a response.
In other words, a response is a measurement value provided by a sampled unit. Each respondent will
give you different responses to different items in an instrument. Responses from different respondents
to the same item or observation can be graphed into a frequency distribution based on their frequency of
occurrences.
For a large number of responses in a sample, this frequency distribution tends to resemble a bell-
shaped curve called a normal distribution, which can be used to estimate overall characteristics of the
entire sample, such as sample mean (average of all observations in a sample) or standard deviation
(variability or spread of observations in a sample). These sample estimates are called sample statistics
(a “statistic” is a value that is estimated from observed data).

Prepared by MARIANNE JANE ANTOINETTE D. PUA, M.S. -Page 7


2023

Item Names No. attitude 1 attitude 2 attitude 3 attitude 4 attitude 5


1 3 4 3 2 4
2 3 3 2 2 2
All responses from 3 3 1 1 1 1
one respondent 4 3 3 1 4 3
5 3 4 2 2 2
All responses from 6 5 3 2 1 4 Individual
all respondents in 7 2 3 4 4 4 responses
one item. Note: 8 3 4 2 2 1
the mean or SD of 9 3 2 3 3 1
this set is a 10 3 2 1 3 3
SAMPLE STATISTIC 11 1 3 3 3 3
12 3 4 2 2 0 Missing value
13 3 3 3 3 2
14 3 2 3 2 1
15 3 3 3 3 3
16 4 4 1 3 4
17 4 3 3 3 2
18 3 3 1 3 1
19 3 3 1 4 1
20 3 3 3 2 1
Populations also have means and standard deviations that could be obtained if we could sample
the entire population. However, since the entire population can never be sampled, population
characteristics are always unknown, and are called population parameters (and not “statistic” because
they are not statistically estimated from data).
Sample statistics may differ from population parameters if the sample is not perfectly
representative of the population; the difference between the two is called sampling error. Theoretically,
if we could gradually increase the sample size so that the sample approaches closer and closer to the
population, then sampling error will decrease and a sample statistic will increasingly approximate the
corresponding population parameter.
If a sample is truly representative of the population, then the estimated sample statistics should
be identical to corresponding theoretical population parameters. There is a need for you to understand
the concept of a sampling distribution to be able to know when your samples are at least reasonably
close to the population parameters.
A sampling distribution is a frequency distribution of a sample statistic (like sample mean) from
a set of samples, while the commonly referenced frequency distribution is the distribution of a response
(observation) from a single sample. Just like a frequency distribution, the sampling distribution will also
tend to have more sample statistics clustered around the mean (which presumably is an estimate of a
population parameter), with fewer values scattered around the mean. With an infinitely large number of
samples, this distribution will approach a normal distribution. The variability or spread of a sample
statistic in a sampling distribution (i.e., the standard deviation of a sampling statistic) is called its
standard error. In contrast, the term standard deviation is reserved for variability of an observed
response from a single sample.
The mean value of a sample statistic in a sampling distribution is presumed to be an estimate of
the unknown population parameter. Based on the spread of this sampling distribution (i.e., based on
standard error), it is also possible to estimate confidence intervals for that prediction population
parameter. Confidence interval is the estimated probability that a population parameter lies within a
specific interval of sample statistic values. All normal distributions tend to follow a 68-95-99 percent
rule (see Figure below), which says that over 68% of the cases in the distribution lie within one standard
deviation of the mean value (μ + 1σ), over 95% of the cases in the distribution lie within two standard
deviations of the mean (μ +2σ), and over 99% of the cases in the distribution lie within three standard
deviations of the mean value (μ + 3σ). Since a sampling distribution with an infinite number of samples
will approach a normal distribution, the same 68-95-99 rule applies, and it can be said that:

Prepared by MARIANNE JANE ANTOINETTE D. PUA, M.S. -Page 8


2023

 (Sample statistic + one standard error) represents a 68% confidence interval for the population
parameter.
 (Sample statistic + two standard errors) represents a 95% confidence interval for the population
parameter.
 (Sample statistic + three standard errors) represents a 99% confidence interval for the population
parameter.

99.7% of data are within 3 standard deviations of the mean

95% within
2 standard deviations

68% within
1 standard deviation

34% 34%
2.4% 2.4%
0.1% 0.1%
13.5% 13.5%

-3 -2 -1  + +2 +3

A sample is “biased” (i.e., not representative of the population) if its sampling distribution
cannot be estimated or if the sampling distribution violates the 68-95-99 percent rule. As an aside, note
that in most regression analysis where we examine the significance of regression coefficients with
p<0.05, we are attempting to see if the sampling statistic (regression coefficient) predicts the
corresponding population parameter (true effect size) with a 95% confidence interval. Interestingly, the
“six sigma” standard attempts to identify manufacturing defects outside the 99% confidence interval or
six standard deviations (standard deviation is represented using the Greek letter sigma), representing
significance testing at p<0.01.

Required Sample Size
from: The Research Advisors

DETERMINING THE SAMPLE SIZE P


Population r
Confidence = 95.0% 3.841459 Confidence = 99.0% 6.634897

Degree of Accuracy/Margin of Error Degree of Accuracy/Margin of Error


Size o
0.05 0.035 0.025 0.01 0.05 0.035 0.025 0.01

The sample size depends of three factors: (1) the 10


20
30
10
19
28
10
20
29
10
20
29
10
20
30
10
19
29
10
20
29
10
20
30
10
20
30
degree of accuracy required; (2) amount of variability 50
75
44
63
47
69
48
72
50
74
47
67
48
71
49
73
50
75

inherent in the population from which the sample was 100


150
80
108
89
126
94
137
99
148
87
122
93
135
96
142
99
149
200 132 160 177 196 154 174 186 198
taken; and (3) the mature and complexity of the 250
300
152
169
190
217
215
251
244
291
182
207
211
246
229
270
246
295

characteristics of the population under consideration. 400


500
196
217
265
306
318
377
384
475
250
285
309
365
348
421
391
485

There are various formulas for calculating the 600


700
800
234
248
260
340
370
396
432
481
526
565
653
739
315
341
363
416
462
503
490
554
615
579
672
763
required sample size based upon whether the data 900
1,000
269
278
419
440
568
606
823
906
382
399
541
575
672
727
854
943

collected is to be of a categorical or quantitative nature 1,200


1,500
291
306
474
515
674
759
1067
1297
427
460
636
712
827
959
1119
1376
2,000 322 563 869 1655 498 808 1141 1785
(e.g. is to estimate a proportion or a mean). These 2,500
3,500
333
346
597
641
952
1068
1984
2565
524
558
879
977
1288
1510
2173
2890

formulas require knowledge of the variance or proportion 5,000


7,500
357
365
678
710
1176
1275
3288
4211
586
610
1066
1147
1734
1960
3842
5165

in the population and a determination as to the maximum 10,000


25,000
50,000
370
378
381
727
760
772
1332
1448
1491
4899
6939
8056
622
646
655
1193
1285
1318
2098
2399
2520
6239
9972
12455
desirable error, as well as the acceptable Type I error risk 75,000
100,000
382
383
776
778
1506
1513
8514
8762
658
659
1330
1336
2563
2585
13583
14227

(e.g., confidence level). 250,000


500,000
384
384
782
783
1527
1532
9248
9423
662
663
1347
1350
2626
2640
15555
16055
1,000,000 384 783 1534 9512 663 1352 2647 16317
2,500,000 384 784 1536 9567 663 1353 2651 16478
10,000,000 384 784 1536 9594 663 1354 2653 16560
100,000,000 384 784 1537 9603 663 1354 2654 16584
264,000,000 384 784 1537 9603 663 1354 2654 16586
† Copyright, The Research Advisors (2006). All rights reserved.

Prepared by MARIANNE JANE ANTOINETTE D. PUA, M.S. -Page 9


2023

The formula used for these calculations was:

This formula is the one used by Krejcie & Morgan in their 1970 article “Determining Sample Size for
Research Activities” (Educational and Psychological Measurement, #30, pp. 607-610).

n Ni
Proportional Allocation of Samples: ni =
N

Where i(1 ,2 , 3 , ...n) = number of group

ni =Group sample allocation; n = desired/estimated sample size;

N i=Group population ¿ ¿ ; and N = Total population.

Guidelines with regards to the minimum number of items needed for


a representative sample:

 Descriptive studies – a minimum number of 100


 Co-relational studies – a sample of at least 30 is deemed necessary to establish the existence of a
relationship.
 Experimental and causal comparative studies – minimum of 30 per group. Sometimes
experimental studies with only 15 items in each group can be defended if they are very tightly
controlled. If the sample is randomly selected and is sufficiently large, an accurate view of the
population can be used, provided that no bias enters the selection process

Prepared by MARIANNE JANE ANTOINETTE D. PUA, M.S. -Page 10


2023

Worksheet no. 1

1. What is statistics? Give one specific applications of statistics in the following fields:
a.Education d. Biology
b. Business e. Economics
c.Psychology
2. Differentiate the following and give example for each.
a. Descriptive and inferential statistics.
b. Sample and population
c. Discrete and continuous variables.
d. Qualitative and quantitative data.
3. Write the correct answer in the space provided for.
1. A collection of all the objects to be studied.
2. The highest level of measurement.
3. The level of measurement that can only be classified
into groups.
4. The level of measurement in rating a teacher as
outstanding, very satisfactory, satisfactory and poor.
5. A subset or a part of the subjects to be studied.
6. The level of measurement of the variable I.Q.
7. A sample that does not represent a population
correctly.
8. A sampling that subdivided the population into
subgroups with similar characteristics.
9. The use of preexisting groups in a sample.
10. A sampling procedure done by taking every third item
to be tested.
11. A sampling procedure done by numbering the items to
determine which ones to test

4. On the space provided after each number, write Q if the variable is qualitative and if it is
quantitative, write D if it is discrete and C if continuous.
1. Educational attainment 8. Brand of watches
2. ID number 9. Student number
3. IQ score 10 Height of the building
4. Political affiliation 11. Number of years in school
5. Rank of teachers 12. Speed of cars
6. Place of residence 13. Weight of children
7. Time required to take the examination 14. Height of the tree
References:
Beaver, B.M. and Beaver R.J. (1999). Introduction to Probability and Statistics. 10th ed. New York: Duxbury Press.
Bluman, A. (1998) Elementary Statistics: A Step by Step Approach. 3rd ed. McGraw-Hill Book Co.
Deuna, Melecio C. (1996), Elementary Statistics for Basic Education. Quezon City: Phoenix Publishing House, Inc.
Febre, F.A. and Virginia F. Cawagas (Consultant)(1987) Introduction to Statistics. Metro Manila, Pheonix Publishing
House, Inc.
Ferguson G. (1981) Statistical Analysis in Psychology and Education. 5th ed. New York: McGraw-Hill Book Company.
Padua, R. N., E.G. Adanza and R.T. Guinto (1986) Statistics: Theory and Applications. Metro Manila: Hermil Printing
Services.
Reyes, C.Z. and Saren, L.L. (2003). Metro Manila. M.G. Reprographics.
Spiegel, M. and Stephens, L. (1999). Schaum’s Outline Theory and Problems in Probability and Statistics. 3 rd. Edition.
Singapore: McGraw-Hill Book Company.
Triola, Mario (1995) Elementary Statistics. New York: Addison-Wesley Publishing Company.
Walpole, R.E (1982) Introduction to Statistics. 3rd ed. New York: Macmillan Publishing Co. Inc.

Prepared by MARIANNE JANE ANTOINETTE D. PUA, M.S. -Page 11

You might also like