Professional Documents
Culture Documents
Stats Bio Supp. 1
Stats Bio Supp. 1
Objectives:
Introduction
Decision makers make better decisions when they use all available information in an effective
and meaningful way. The primary role of statistics is to provide decision makers with methods for
obtaining and analyzing information to help make these decisions. Statistics is used to answer long-
range planning questions, such as when and where to locate facilities to handle future sales.
The word statistics is derived from the Latin word status meaning “state”. In the beginning,
statistics involved compilation of data and graphs describing various aspects of state or country. The
word statistics means different to different people. To some, statistics means actual numbers derived
from data and others refer to statistics as a method of analysis. Thus, specifically, statistics is defined as
the science of collecting, organizing, presenting, analyzing and interpreting numerical data for the
purpose of assisting in making a more effective decision.
Statistical methods are vital tools in many researches in education, psychology, medicine,
business, agriculture, and other disciplines.
Types of Statistics
Statistics is a tool which helps us develop general and meaningful conclusions that go beyond
the original data. There are two types of statistical analyses: Descriptive and Inferential or Inductive
Statistics.
1. Descriptive Statistics are all the methods used to collect, organize, summarize or present data,
usually to make the data easier to understand. It is concerned with summary calculations such as
averages, and percentages and construction of graphs, charts and tables.
2. Inferential Statistics is concerned with the formulation of conclusions or generalizations
about a population based on an observation or a series of observations of a sample drawn from a
population. It consists of performing hypothesis testing, determining relationships among variables,
and making predictions. For example, the average family income of the residents in Region 2 can be
estimated from figures obtained from a few hundred (the sample) of families.
Some quantitative variables can take on only specific or isolated values along a scale, for
example, the number of children in the family may be 1, 2, 3, or any other whole number but it can
never be 1.25 or 0.5. Thus, this variable has values which can only be obtained through the process of
counting and is referred to as discrete or discontinuous variables.
Specifically, quantitative variables can be ordered and ranked. It can be classified in to two
groups: Discrete and Continuous.
Discrete variables are values that are obtained by counting. The results are whole numbers. For
example, the number of students in the room.
Continuous variables are values that are obtained by measuring. The results can be any value
between two specific values. For example, if you take everyone’s height of students in the room, you
could get any number between two reasonable amounts. So height is a continuous variable.
Levels of Measurement:
Variables can also be classified according to the level of measurement. There are four levels of
measurement: Nominal, Ordinal, Interval and Ratio.
1. Nominal Data: The weakest data measurement. Numbers are used to represent an item or
characteristic. Examples include: names, gender, religious affiliation, civil status, college majors. Note
that such data should not be treated as numerical, since relative size has no meaning.
2. Ordinal or Rank Data: This can be ordered or ranked, but a specific difference in the levels
can not be determined. For example, the performance rating (Outstanding, Very Satisfactory,
Satisfactory, Poor). This can be ordered. You know that Outstanding is higher than Very Satisfactory
or Very Satisfactory is higher than Satisfactory, etc. , but there is no exact difference between any two
of them. For example, the grade of Outstanding and Very Satisfactory may be close (4.65 and 4.45) or
may be far apart (5.00 and 4.25), so the exact difference cannot be determined.
3. Interval Data: This can be ordered and has exact difference between any two units but has no
meaningful zero or starting point. For example, Temperature is an interval data since they can be
ordered, there is an exact difference between two degrees, but the zero does not mean the starting point
since there can be temperatures below zero.
4. Ratio Data: Is the highest level of measurement and allows for all basic arithmetic operations,
including division and multiplication. Data at this level can be ordered, has exact difference between
units, and has a meaningful zero. Things that are counted are usually ratio level, for example, business
data, such as cost, revenue and profit.
Sources of Data:
1. Secondary Data: Data which are already available. For example, ISU enrollment data.
Secondary data is less expensive; however, it may not satisfy the researcher’s need.
2. Primary Data: Data which must be collected.
Sampling Techniques:
Sampling Techniques are used when a part of the population is to be surveyed. If it takes too
long or very expensive to interview the whole population, a sample is used. If a sample is chosen
correctly to represent the population, it is called unbiased while if it does not represent the whole
population, it is called biased.
There are many ways to collect a sample, statistical or non-statistical. The most commonly used
methods are:
A. Statistical Sampling:
1. Simple Random Sampling: This is used to see that all possible elements of the population
have an equal opportunity of being selected for the sample.
2. Stratified Random Sampling: This is obtained by selecting simple random samples from strata
(or mutually exclusive sets). Some of the criteria for dividing a population into strata are: Gender (male,
female); Age (under 18, 18 to 28, 29 to 39); Occupation (blue-collar, professional, other).
3. Cluster Sampling: This is a simple random sample of groups or cluster of elements. Cluster
sampling is useful when it is difficult or costly to generate a simple random sample. For example, to
estimate the average annual household income in a large city we use cluster sampling, because to use
simple random sampling we need a complete list of households in the city from which to sample. To use
stratified random sampling, we would again need the list of households. A less expensive way is to let
each block within the city represent a cluster. A sample of clusters could then be randomly selected, and
every household within these clusters could be interviewed to find the average annual household
income.
B. Nonstatistical Sampling:
1. Judgement Sampling: In this case, the person taking the sample has direct or indirect control
over which items are selected for the sample.
2. Convenience Sampling: In this method, the decision maker selects a sample from the
population in a manner that is relatively easy and convenient.
3. Quota Sampling: In this method, the decision maker requires the sample to contain a certain
number of items with a given characteristic. Many political polls are, in part, quota sampling.
Note: The random number table provides lists of numbers that are randomly generated and can be used
to select random samples. Computer packages are used to generate lists of random numbers. For the
table, refer to any texts in Statistics.
Infer
defines a sample is always the same as the physical unit that defines the main, or parent, population. A
single element of a sample is called an event.
When a sample consists of the whole population, it is called a census. When a sample consists of
a subset of a population whose elements are chosen at random, it is called a random sample.
Investigator intervention
Prepared by MARIANNE JANE ANTOINETTE D. PUA, M.S. -Page 4
2023
A strict random sample is not usually feasible since only readily available items or transactions
can easily be inspected. In order to capture changes that are taking place in the quality of process
output, small samples are taken at regular intervals of time. Such a sampling scheme is called the
method of rational subgroups. Such sample data are treated as if random samples were taken at each
point in time, with the understanding that one should be alert to any known reasons why such a
sampling scheme could lead to biased results.
NOTE: For the purpose of statistical inference a representative sample is desired. Yet, the methods of
statistical inference require only that a random sample be obtained. There is no sampling method that
can guarantee a representative sample. The best we can do is to avoid any consistent or systematic bias
by the use of random (probability) sampling. Some causes of bias in sampling are voluntary response,
investigator intervention, or the effects of periodic, seasonal and/or systematic gathering of data.
While a random sample rarely will be exactly representative of the target population from which
it was obtained, use of this procedure does guarantee that only chance factors underlie the amount of
difference between the sample and the population.
In statistical sampling, a table of random numbers or a random-number generator computer
program generally is used to identify the numbered items in the population that are to be selected for the
sample. Excel is also a powerful tool in generating a sample from a given population.
Problem: A researcher wishes to obtain a simple random sample of 100 households from 876
households in San Fabian, Echague, Isabela. (For convenience, the households are identified
by the ID numbers 1 through 876. Use Excel to obtain the 100 ID numbers of the sampled
households to be included in the study.
Steps:
(1) Open Excel. Place the integers from 1 to 876 in column A of the worksheet by first
entering the number 1 A1. With cell A1 active (by clicking away from and back to A1, for instance),
CLICK EDIT, FILL, SERIES and open the Series dialog box.
(2) Select the Series in Columns button with Step value of 1 and Stop value of 876. CLICK OK,
and the integers 1 to 876 will appear in column A.
(2) To identify the 100 households to be sampled, CLICK TOOLS, DATA ANALYSIS,
SAMPLING. Designate the Input Range as $A$1:$A$876, the Sampling Method as Random, the
number of samples as 100, and the Output Range as $B$1. CLICK OK, and the IDs of the randomly
selected households will appear in rows 1 through 100 of column B.
STATISTICS OF SAMPLING
In the preceding lecture, terms such as population parameter, sample statistic, and sampling bias
were introduced. Now, we will try to understand what these terms mean and how they are related to
each other.
When you measure a certain observation from a given unit, such as a person’s response to a
Likert-scaled item (as shown in the figure in the succeeding page), that observation is called a response.
In other words, a response is a measurement value provided by a sampled unit. Each respondent will
give you different responses to different items in an instrument. Responses from different respondents
to the same item or observation can be graphed into a frequency distribution based on their frequency of
occurrences.
For a large number of responses in a sample, this frequency distribution tends to resemble a bell-
shaped curve called a normal distribution, which can be used to estimate overall characteristics of the
entire sample, such as sample mean (average of all observations in a sample) or standard deviation
(variability or spread of observations in a sample). These sample estimates are called sample statistics
(a “statistic” is a value that is estimated from observed data).
(Sample statistic + one standard error) represents a 68% confidence interval for the population
parameter.
(Sample statistic + two standard errors) represents a 95% confidence interval for the population
parameter.
(Sample statistic + three standard errors) represents a 99% confidence interval for the population
parameter.
95% within
2 standard deviations
68% within
1 standard deviation
34% 34%
2.4% 2.4%
0.1% 0.1%
13.5% 13.5%
A sample is “biased” (i.e., not representative of the population) if its sampling distribution
cannot be estimated or if the sampling distribution violates the 68-95-99 percent rule. As an aside, note
that in most regression analysis where we examine the significance of regression coefficients with
p<0.05, we are attempting to see if the sampling statistic (regression coefficient) predicts the
corresponding population parameter (true effect size) with a 95% confidence interval. Interestingly, the
“six sigma” standard attempts to identify manufacturing defects outside the 99% confidence interval or
six standard deviations (standard deviation is represented using the Greek letter sigma), representing
significance testing at p<0.01.
†
Required Sample Size
from: The Research Advisors
This formula is the one used by Krejcie & Morgan in their 1970 article “Determining Sample Size for
Research Activities” (Educational and Psychological Measurement, #30, pp. 607-610).
n Ni
Proportional Allocation of Samples: ni =
N
Worksheet no. 1
1. What is statistics? Give one specific applications of statistics in the following fields:
a.Education d. Biology
b. Business e. Economics
c.Psychology
2. Differentiate the following and give example for each.
a. Descriptive and inferential statistics.
b. Sample and population
c. Discrete and continuous variables.
d. Qualitative and quantitative data.
3. Write the correct answer in the space provided for.
1. A collection of all the objects to be studied.
2. The highest level of measurement.
3. The level of measurement that can only be classified
into groups.
4. The level of measurement in rating a teacher as
outstanding, very satisfactory, satisfactory and poor.
5. A subset or a part of the subjects to be studied.
6. The level of measurement of the variable I.Q.
7. A sample that does not represent a population
correctly.
8. A sampling that subdivided the population into
subgroups with similar characteristics.
9. The use of preexisting groups in a sample.
10. A sampling procedure done by taking every third item
to be tested.
11. A sampling procedure done by numbering the items to
determine which ones to test
4. On the space provided after each number, write Q if the variable is qualitative and if it is
quantitative, write D if it is discrete and C if continuous.
1. Educational attainment 8. Brand of watches
2. ID number 9. Student number
3. IQ score 10 Height of the building
4. Political affiliation 11. Number of years in school
5. Rank of teachers 12. Speed of cars
6. Place of residence 13. Weight of children
7. Time required to take the examination 14. Height of the tree
References:
Beaver, B.M. and Beaver R.J. (1999). Introduction to Probability and Statistics. 10th ed. New York: Duxbury Press.
Bluman, A. (1998) Elementary Statistics: A Step by Step Approach. 3rd ed. McGraw-Hill Book Co.
Deuna, Melecio C. (1996), Elementary Statistics for Basic Education. Quezon City: Phoenix Publishing House, Inc.
Febre, F.A. and Virginia F. Cawagas (Consultant)(1987) Introduction to Statistics. Metro Manila, Pheonix Publishing
House, Inc.
Ferguson G. (1981) Statistical Analysis in Psychology and Education. 5th ed. New York: McGraw-Hill Book Company.
Padua, R. N., E.G. Adanza and R.T. Guinto (1986) Statistics: Theory and Applications. Metro Manila: Hermil Printing
Services.
Reyes, C.Z. and Saren, L.L. (2003). Metro Manila. M.G. Reprographics.
Spiegel, M. and Stephens, L. (1999). Schaum’s Outline Theory and Problems in Probability and Statistics. 3 rd. Edition.
Singapore: McGraw-Hill Book Company.
Triola, Mario (1995) Elementary Statistics. New York: Addison-Wesley Publishing Company.
Walpole, R.E (1982) Introduction to Statistics. 3rd ed. New York: Macmillan Publishing Co. Inc.