Stat 101C Lecture Notes 1

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 20

West Visayas State University

(Formerly Iloilo Normal School)


COLLEGE OF ARTS AND SCIENCES
Luna St., La Paz, Iloilo City 5000
Iloilo, Philippines
* Trunkline: (063) (033) 320-0870 loc 1602 * Telefax No.: (033) 320-0879
* Website: www.wvsu.edu.ph * Email Address: cas@wvsu.edu.ph

Course notes
in
Stat 101 – Fundamentals of Statistics

Unit I Descriptive statistics


1. Graphical methods for describing data
2. Numerical methods for describing data
3. Examining categorical data using contingency tables
4. Sampling techniques

Learning outcomes
1. Demonstrate understanding of descriptive statistics by practical application of quantitative reasoning and data visualization.
2. Compare the common methods of gathering sample data and identified the sampling techniques for different problem
situations.
3. Organize data using tables, charts and graphs and interpreted the results.
4. Identify and calculate the measures of the center of the data.
5. Identify and calculate the measures of the variability of the data.

Introduction

As our society becomes more technologically complex, greater demands are being placed on professionals to understand and use the
results of research designed to solve applied problems. This generally requires a working understanding of statistical methods.
Knowledge of statistical analysis also helps to foster new and creative ways of thinking about problems. These skills can be
applied to any area of inquiry and hence are extremely useful.

Social scientists attempt to explain and predict human behavior. They also take “educated guesses” about the nature of social reality,
although in a far more precise and structured manner. In the process, social scientists examine characteristics of human behavior
called variables – characteristics that differ or vary from one individual to another (for example, age, social class, and attitude) or
from one point in time to another (for example, unemployment, crime rate, and population).

Basic statistics terms

Statistics is any numerical data or quantitative analysis. It is also a certain kind of measure used to evaluate a selected property of
the collection of items under consideration. As a branch of science, it is concerned with the scientific methods of collecting,
organizing, summarizing, presenting and analyzing data, as well as drawing valid conclusions and making reasonable decisions
on the basis of such analysis.

Types of statistics
1. Descriptive statistics is the method of collecting, organizing, and utilizing numerical data derived from the empirical world.
It is the phase of statistics that seeks to describe and analyze a given group without drawing any conclusions or inferences
about a larger group.
Descriptive statistics is concerned with
a) characterizing what is “typical” or common in a group
b) indicating how widely the individuals in the group vary
c) presenting other aspects of the distribution values with respect to the variable(s) being considered.
Examples: mean, percentages, proportions, standard deviation, regression and
correlation coefficient, construction of tables, charts and graphs.

2. Inferential statistics comprises some methods concerned with the analysis of a subset of data leading to predictions or
inferences about the entire set of data. Among the common types of analysis are:

1
a) testing for the existence of an association between variables
b) identifying the form of an observed relationship
c) refining observed associations into causal relationships
d) generalizing and predicting on the basis of observed data.
Examples: estimation, hypothesis testing.

The stages of Social research

Systematically testing our ideas about the nature of social reality often demands carefully planned and executed research
with the following elements:
1. The problem to be studied is reduced to testable hypothesis
2. An appropriate set of instruments is developed
3. The data are collected
4. Data analysis
5. Results of the analysis are interpreted and communicated to an audience (say, lecture, journal article, press release)

General uses of statistics


1. Summarizes data for public use
2. Aids in decision making
i. Provides comparison
ii. Explains action that has taken place
iii. Justify a claim or assertion
iv. Predict future outcomes
v. Estimates unknown quantities

Population and sample

Population is a totality of all actual or conceivable objects of a certain class under consideration. It can be finite or infinite.
Sample is a finite number of objects or persons selected from the population. It is a set of measurements that constitute part of the
totality of all possible measurement of the same quantities.

Sampling – selection of part but representative cross section of the population

Representative – property of the proportion of the population if that portion reflects the characteristics of the population
Survey – the collection of the information on a defined population to satisfy a definite need

Parameter – a value calculated from a population distribution

Statistic – a value calculated from a sample distribution

Sampling frame – a complete list of all units from which the sample is drawn

Variable is any quantity or measure or characteristics which may possess different numerical values or categories.

Variables may also be classified as qualitative [differ in quality] or quantitative [differ in magnitude] or may be dependent [criterion
variables] or independent [predictor variables].

Some criteria in the selection of variables


1. appropriateness
2. clarity
3. measurability
4. comparability

Series of numbers in social science research


Uses: 1. Classify or categorize at the nominal level of measurement
2. Rank or order at the ordinal level of measurement
3. Assign a score at the interval or ratio level of measurement

Levels of measurements
1. Nominal measurement. This type of measurement is the lowest level, consisting of classifying items or individuals into two or
more categories. Its basic requirement is that one must be able to assign an item or individual to one and only one category
and specify the criteria for placing individuals into classes.

2
The empirical operation is the determination of sameness or equivalence between items with respect to a given
characteristic. The only relationship between two items belonging to the same category is that they are the same with respect
to a particular characteristic; there is implication that one has more or less of the same characteristic as the other.

2. Ordinal measurement – specifies the relative position of items or individuals with respect to a given characteristic with no
indication as to the distance between the positions. The basic requirements is that one must be able to determine whether an
item has more, the same, or less of the attribute being considered than other items.

3. Interval measurement – property defined by an operation which permits making statements of equality of intervals rather than
just statements or difference and greater than or less than. It can compare differences. It does not have a absolute 0, although
0 may be arbitrarily assigned.

4. Ratio measurement. Numbers on a ratio scale indicate the actual amounts of the characteristics being measured; hence it is
possible to say that an item has none of the characteristic or that the item with a score of 8 has twice as much as an item with a
score of 4. It has an origin or an absolute 0 point thus can consider relative positions.

Different ways to measure the same variable


1. Nominal level
Question: Are currently in pain? Yes No
Question: How would you characterize the type of pain? Sharp, Dull, Throbbing
2. Ordinal level
Question: How bad is the pain right now? None, Mild, Moderate, Severe
Question: Compared with yesterday, is the pain less severe, about the same, or more severe?
3. Interval/ratio level
0-10 numerical scale

_____|_____|_____|_____|_____|_____|_____|_____|_____|_____|_____|_____
0 1 2 3 4 5 6 7 8 9 10
No pain worst
pain
imaginable

Track Your Problem Tasks

Visual Pain Scale Note 1

3
Validity and reliability of measurement
In discussing validity of measurement, consider

“Certain basic questions must be asked about measuring instrument: What does it measure? Are the data it
provides relevant to the characteristics in which one is interested? To what extent do the differences in scores represent
true differences in the characteristic we are trying to measure? To what extent do they reflect also the influence of other
factors?”

The validity of measuring instrument may be defined as the extent to which differences in scores on it reflects
true differences among individuals on the characteristic that we seek to measure, rather than constant or random errors.

The reliability of a measure is simply is consistency. A measure is reliable if the measurement does not change when the concept
being measured remains constant in value. However, if the concept being measured does change in value, the reliable measure will
indicate that change. In the case of a social-research instrument such as a questionnaire, the unreliability also lies within the scale
and may be due to such things as questions or answer categories so ambiguous that the respondent is unsure how he or she should
answer, and thus does not answer consistently.

Methods of Collecting Data

1. Questionnaire and Interview

Questionnaire - a survey instrument which is limited to the written responses of subjects to pre-arrange questions. It can be
mailed or handed to the informant with minimum explanation. It ensures anonymity.

Interview – allows for greater flexibility in eliciting information since the interviewer and the person interviewed are both present
when the questions are asked and answered.

Advantages:
• Questions can be repeated or rephrased for better understanding and clarity
• Offers greater opportunity for appraising the validity reports
• Technique for revealing information about complex, emotionally-laden topics or for probing sentiments
underlying an expressed opinion
Types:
a. standardized or structured – questions are presented with exactly the same
wordings in the same order to all subjects.

1. fixed alternative questions – limits the subject’s response to stated alternatives


2. open-ended questions – permits free responses from the subjects by merely raising
the issue without providing any structure to the respondents reply

b. unstructured – neither the questions to be asked nor the responses permitted the subject are determined before the
interview. It is used for intensive study of perceptions, attitudes, motivation, etc., which requires spontaneous, highly
specific, concrete, self-revealing, personal answers.

2. Observation – recording of behavior at the time of occurrence.

a. Unstructured observation – provides complete flexibility


1. participant observation – observer is a group member and participates in the functioning
of a group.
2. non-participant observation – observer is watching the group but refrain from any
interaction with the observed.

b. Structured observation – use detailed formal instruments developed in advance.

3. Use of available records – Use data accumulated for purposes of administration or of


historical description

Sources of Data

4
1. Documentary Sources – data contained in published and unpublished documents, reports, statistics, manuscripts, letters,
diaries, etc.

a) Primary Sources – first hand data wherein the responsibility for their complication and promulgation remain under the
same authority that originally gathered them.
b) Secondary Sources – data that have been transcribed or compiled from original sources.

2. Field Sources – include living persons who have the fundamental knowledge about, or have been in intimate contact with social
conditions and changes over a considerable period of time. Source is more personal and direct.

Presentation of Data
1) Textual presentation – A textual presentation of data consists, of describing the sample
data in expository form, i.e. it shows and emphasizes significant characteristics and results of data gathered in paragraph
form. It should be arranged according to data importance emphasizing certain figures. It should also justify or explain
irregularities in figures. It is adequate for limited amounts of information. However, if there are many facts involved, this
method of presentation should not be used only, because of the difficulty in reading and assimilating a repetitious list of facts
and figures. On the other hand, alongside either a tabular or graphic presentation, one or more accompanying paragraphs
can greatly enhance the understandability of the data.

2) Tabular presentation – Data presentation through tables

Statistical tables - systematic way of arranging data in columns and rows . It usually grasps significant details and
relationships.

Statistical table has 6 major parts:


a) Table number. If more than one table is included in a report or article, the table
number is required for easy reference
b) Title. This usually indicates what is presented, where the data apply, how the data
are classified, and when, or for what time, the data are pertinent. The place or period of reference of the data is
sometimes omitted when no confusion or ambiguity can result. The title should be brief and concise. If the main
title cannot contain all necessary descriptive information, a subtitle or prefatory notes may be added.
c) Stub. Indicates the basis of classification in the columns
d) Column Headings. Indicate the bases of classification in the columns
e) Body of the table. Main part of the table properly containing the figures presented
f) Sources and footnotes. The source note should be complete, giving author, title, volume, page, publisher and data so
that any reader wishing to refer to the source will be able to do so conveniently. Footnotes contains explanations
concerning an individual figure, a row, or a column

3. Graphical Presentation – effective way of presenting quantitative data to an intensive and heterogeneous group of readers since
less effort in comprehension is required.

a) graph or chart – any device for showing numerical values or relationship in pictorial
form. It should be clear and simple.

Parts of a chart or graph


a) chart number
b) title
c) vertical scale captions – indicate the scale units and the basis of classification
along the vertical axis.
d) horizontal scale captions – indicate the scale units and the basis of classification
along the horizontal axis.
e) body
f) source notes and footnotes / legend

Kinds of Statistical charts


a) Line diagrams or curves – drawn with reference to arithmetic scales, the scale being one on which a given distance
will represent the same numerical value; should be drawn with horizontal or vertical dimensions bearing a reasonable
proportion to each other. It should be used when the interest lies more in the depicting the movement or trend of the
values rather than the actual values themselves.
b) Bar charts – series of rectangles; used for comparison of categories where labels will be affixed to the individual bars.
It may be vertical – bar chart ( or column chart) or the horizontal – bar chart.
c) Pie chart – used in comparing the angles at the center of the circle
d) Pictographs – uses small or large pictures of varying sizes to indicate quantities or magnitude

5
e) Statistical Maps – show numerical measurements and location

Sample graphs

Line graphs

2-D and 3-D area graphs

6
Pie graph

Scatterplot

Bar graphs

7
8
Box plot

Measures of Central Tendency

A. Definiton of terms

Central tendency - a central value between the upper and lower limits of a distribution around which the scores are distributed
average - value which is typical or representative of a set of data
- a any single value/number that could represent the data set
- “center of gravity” of the data set
mean - the arithmetic average of all the observed values or groups of values in a distribution

median - a point in the distribution of observed values above which 50% of the values fall and below which 50% of the values
fall

mode - the most frequently observed value or group of values

B. The arithmetic mean


- “center of gravity” of the data set
n
Arithmetic mean for ungrouped data x =  xi / n
i =1
Example 1.
Find the arithmetic mean of
1. the first 10 digits
2. 17, 45, 38, 27, 6, 48, 11, 57, 34, and 22

Example 2.
The heights (in meters) of the sampled volcanoes are as follows:

Table 1. Heights of Selected Volcanoes

Volcano Height
Mariveles 1420
Smith 688
Biliran 1187
Bulusan 1559
Calayo 302

9
The sample mean of the heights of these volcanoes can be readily computed as 1031.2

For the systematic sample: Babuyan Claro, Cagua, Iraya, Mandalugan, and Ragang, the heights respectively are 837, 1158,
1008, 1880, 2815 so that here, the sample mean is (837+ 1158+ 1008+ 1880+ 2815)/5= 1539.6.

Thus, we see that a different sample will likely yield a different sample mean.

If the numbers x1, x2, … , xk occur f1, f 2, … , fk times respectively, the arithmetic mean is
k k
x =  fjx j /  fj
j =1 j=i

e.g. Find the arithmetic mean of the following scores of students in their quiz:
5, 3, 6, 5, 4, 5, 2, 8, 6, 5, 4, 8, 3, 4, 5, 4, 8, 2, 5, 4.

Arithmetic mean for grouped data

Using class marks or midpoints


k
x =  fjx j / n
j =1
e.g. A sample of television viewers was asked to rate a new soap opera. The highest
possible score is 100. The scores were organized into a frequency distribution.

Rating Frequency
50 – 99 2
60 – 69 6
70 – 79 10
80 – 89 8
90 – 99 4

C. The weighted arithmetic mean

Some numbers X1, … , Xk are associated with certain weighting factors or weights w 1, … , wk, respectively, depending on the
significance or importance attached to the numbers.

k k
x=  w jx j /  w j
j =1 j =1

e.g. A final exam in a course is weighted three times as much as a quiz. A student
has a final grade of 85 and quiz grades of 70 and 90. What is his mean grade?

D. The median
- the observation occupying the middle position when observations are arranged in an array

For ungrouped data,


~
x = xk if n = 2k +1
= ½[xk + xk+1] if n = 2k

E. The Mode

For ungrouped data, find the observation that occurs most often.

Find the mode of the following scores of students in their quiz:


5, 3, 6, 5, 4, 5, 2, 8, 6, 5, 4, 8, 3, 4, 5, 4, 8, 2, 5, 4, 5

F. Uses of the three averages

10
If the item values of the distribution are considerably concentrated or substantially close to each other, the mean is used to
describe this set of data. The mean is easy to use, compute and comprehend and mathematically tractable.It is used for interval and
ratio data.

Other points to consider about the mean


• most appropriate for unimodal symmetrical distribution
• used when the purpose is to consider the value of each observation and when further or more advanced statistical
computation is needed
• employs all available information
• more stable than the median
• strongly influenced by extreme values
• difficult to estimate the actual value in open-ended distributions

Some points to consider about the median


• used for ordinal, interval, and ratio data
• easy to compute if number of observations is relatively small
• not influenced by extreme values, hence gives “true” average for skewed distributions
• not as highly influenced by frequency occurrence of a single value as the mode
• can also be used for more advanced statistical computations

Some points to consider about the mode


• used for nominal, ordinal, ratio, and interval data
• appropriate for bimodal distribution
• requires no calculation (for ungrouped data)
• used for both quantitative and qualitative data

Measures of Dispersion or Variability

variation or dispersion - the degree to which numerical data tend to spread about an average
range of a data set - a distance measure between the largest and the smallest observed value
mean deviation / mean absolute deviation / average absolute deviation - the arithmetic
average distance of the observations from the average/mean
variance - the average of the squared distance between the mean and each item in the
population/sample
standard deviation - the positive square root of the variance
coefficient of variation - measure of relative dispersion that relates the magnitude of the standard
deviation with the magnitude of the mean

 (x )  f (x )
n 2 k 2

j −x j j −x
j =1 j =1
s =
2
s =
2
(grouped data)
n −1 n −1

11
DATA SET 1

The Labor Force Survey (LFS) adopted the 2003 Master Sample Design with a sample size of approximately 50000 households.
Labor Force- refers to the population 15 years old and over who contribute or seek to contribute to the production of goods and
services as defined in the system of National Account production boundary. It comprises the employed and unemployed. (PSA)

Name - Labor Force by Region

Source - Philippine Statistics Authority (PSA)

Units – Percent

2017Q1 2017Q2 2017Q3 2017Q4 2018Q1 2018Q2 2018Q3 2018Q4 2019Q1 2019Q2 2019Q3 2019Q4 2020Q1 2020Q2 2020Q3 2020Q4

Philippines 60.7 61.4 60.6 61.2 62.2 60.9 60.1 60.6 60.2 61.4 62.1 61.5 61.7 55.7 61.9 58.7

NCR 61.3 60.5 60.5 61.1 60.6 59.8 60 60.7 60.5 60.7 61.5 60.6 60.1 54.2 59.2 56.7

CAR - Cordillera Admin 60.1 62 64.5 62.7 62.2 60.2 63 62.2 61.6 62.5 61.9 62.4 63 56 64.6 62

Reg I - Ilocos 60.7 56.6 58.7 58.9 63.3 62.3 60 61.3 61.3 59.9 62.3 63.4 63.4 60.8 64.7 61.7

Reg II - Cagayan Valley 63.7 62.6 61.7 63.4 65.3 64.8 62.4 63.1 62 63.3 62.8 64.1 64.4 59.6 64.8 56.7

Reg III - Central Luzon 57.9 57.4 60.5 58.7 60.7 60.2 60.5 58 59.6 60 60.4 60 59.6 51.9 58.9 57.3

Reg IVA-CALABARZON 63.2 63.7 62.4 63.7 62.9 62.1 62.3 63.3 63.1 63.4 65.9 64.1 64 58.3 63.9 60.6

Reg IVB - MIMAROPA 61 63.5 63.6 64 65.9 61.9 60.8 59.4 59.8 61.3 61.7 58.6 58.8 53.4 64.2 62

Region V- Bicol 59 59.1 59.6 60.1 62 62.5 57.7 61.3 55.9 60.7 62.3 61.6 61.7 55.2 62.1 59.1

Reg VI - WVisayas 61.4 60.6 61.6 61.6 62 61.8 60.9 60.3 59.6 56.8 59.5 60.4 60.4 56.1 60.8 57.4

Reg VII - CVisayas 64.9 67.2 62.7 65.1 63.1 61.8 59.7 60.7 63.1 62.5 60.6 63.2 63.9 57.3 57.8 55.9

Reg VIII - EVisayas 56.2 64.2 60.7 60.3 61.6 63.2 61.8 58.4 58.9 61 61.5 58.3 59.5 56.2 60.9 56.2

Reg IX - Zamboanga 56.3 60.9 57.7 58.5 59.3 54.7 54.5 56.9 55.8 56.5 55.9 57.5 59.4 52 60.8 55.7

Reg X – N Mindanao 63 64.7 60.2 63.8 72 66 62 65.2 62.4 73.5 72.5 66.8 70.1 62.8 68.8 63.8

Reg XI - Davao 61.7 64 59.9 62.7 62.2 59.6 59 60.4 60.2 58.2 59.3 61.6 58.9 55.3 59.5 56.5

RegXIISOCCSKSARGEN 62.8 60.8 61.3 62.2 62.4 60.4 61.7 62.3 63.3 64.8 65.4 62.7 65.4 57.4 66.4 62.5

Reg XIII - Caraga 59.2 65.2 61.7 62.1 67.1 66.1 62.5 62 59.2 65.3 63.4 61.4 64.8 57 68.6 63.8

ARMM 44.3 48.2 46.5 46.1 46.1 44.3 46.5 49.5 47.7 55 53.3 53.4 50.9 41.1 62.3 59.4

NIR - Negros Island 66.1 63.9 63.7 - - - - - - - - - - . . .

12
2020Q1 2020Q2 2020Q3 2020Q4
NCR 60.1 54.2 59.2 56.7
CAR - Cordillera Admin 63 56 64.6 62
Reg I - Ilocos 63.4 60.8 64.7 61.7
Reg II - Cagayan Valley 64.4 59.6 64.8 56.7
Reg III - Central Luzon 59.6 51.9 58.9 57.3
Reg IVA-CALABARZON 64 58.3 63.9 60.6
Reg IVB - MIMAROPA 58.8 53.4 64.2 62
Region V- Bicol 61.7 55.2 62.1 59.1
Reg VI - WVisayas 60.4 56.1 60.8 57.4
Reg VII - CVisayas 63.9 57.3 57.8 55.9
Reg VIII - EVisayas 59.5 56.2 60.9 56.2
Reg IX - Zamboanga 59.4 52 60.8 55.7
Reg X – N Mindanao 70.1 62.8 68.8 63.8
Reg XI - Davao 58.9 55.3 59.5 56.5
RegXIISOCCSKSARGEN 65.4 57.4 66.4 62.5
Reg XIII - Caraga 64.8 57 68.6 63.8
ARMM 50.9 41.1 62.3 59.4

DATA SET 2. Test scores of students in an entrance examination (%) & strand
Student Strand WVSUCAT Communication Science Math
1 HUMSS 52 54 50 42
2 HUMSS 51 56 50 42
3 HUMSS 42 62 36 24
4 HUMSS 52 64 52 36
5 HUMSS 48 62 42 30
6 HUMSS 49 60 42 28
7 STEM 38 44 40 26
8 STEM 39 40 38 24
9 HUMSS 43 52 40 20
10 HUMSS 33 24 36 26
11 HUMSS 41 48 38 26
12 HUMSS 26 26 34 16
13 HUMSS 41 46 46 34
14 HUMSS 29 26 32 22
15 HUMSS 34 48 34 16
16 HUMSS 42 42 38 34
17 HUMSS 36 38 28 24
18 GAS 54 68 46 44
19 GAS 44 52 38 44
20 GAS 41 48 42 38
21 GAS 44 48 44 30
22 GAS 35 28 50 22
23 GAS 39 44 38 26
24 GAS 40 40 44 20
25 GAS 43 46 36 36
26 GAS 37 46 40 18
27 GAS 32 34 30 22
28 GAS 31 34 30 20
29 GAS 27 30 26 20
30 GAS 42 50 46 34
31 GAS 42 48 38 26
32 GAS 38 48 28 22
33 STEM 40 42 42 32
34 STEM 35 48 44 20

13
Strand Frequency Percentage
ABM
GAS
HUMSS
STEM
Total

Strand WVSU College Admission Test


Mean Standard deviation
ABM
GAS
HUMSS
STEM
Total
35 STEM 43 54 46 44
36 STEM 46 58 34 40
37 STEM 38 40 38 26
38 STEM 40 56 50 42
39 ABM 36 40 36 22
40 ABM 36 48 42 32
41 ABM 48 52 44 50
42 ABM 30 34 36 22
43 ABM 36 54 38 18
44 ABM 28 30 30 26
45 ABM 43 44 34 38
46 ABM 55 68 44 46
47 ABM 53 62 42 48
48 ABM 39 46 40 28
49 ABM 38 48 40 14
50 ABM 31 34 28 24

Exercises:
1. Fill out the tables below using the DATA SET 2.

2. Enter Data Set 2 using Microsoft Excel. Using the statistical function of Microsoft Excel, solve for the mean (average) and
standard deviation (stdev) of WVSU-CAT. Then using sort, solve for the mean and standard deviation of WVSU-CAT when students
were classified by strand and fill out the table below.

14
3. Using your Microsoft Excel file of Data Set 2, solve for the mean and standard deviation of the following components of WVSU-
CAT: Communication test scores, Science, and Math test scores when students were classified by strand and fill out the table below.
For those who know how to apply SPSS, you can use the Compare means option under Analyze and solve for the values required
below.

Strand Communication Science Mathematics


Mean Std dev Mean Std dev Mean Std dev
ABM
GAS
HUMSS
STEM
Total

Lesson 3

Examining categorical data using contingency tables

When we have a categorical variable, one of the first things we do with it is count the number of cases in each category. If we don’t
have too many categories we can display as counts in a table. We can also display as percentages.

In statistics, a contingency table (also known as a cross tabulation or crosstab or two-way table) is a type of table in a matrix format
that displays the (multivariate) frequency distribution of the variables. They provide a basic picture of the interrelation between two
variables and can help find interactions between them. A contingency table is a special type of frequency distribution table.

Data below represent number of passengers in the Titanic cruise ship that sank in its first voyage by ticket class.

Distribution of Ticket Class

15
Possible problem: Did chance of survival depend on ticket class?

Below is a contingency table of the 2201 people aboard the Titanic showing ticket class and survival. Here we are looking at counts.

In a contingency table when we look at the total the bottom row or the right most column that is the same as looking at each variable
separately. We just looked at the distribution of ticket class. When we have a contingency table with two variables this distribution is
called a marginal distribution of ticket class.

A contingency table of Class by Survival with the table percentages.


We can see that the 178 third class passengers that survived comprised 8.1% of the entire ship.

Column Percentages.

A contingency table of Class by Survival with only counts and column percentages. Each column represents the conditional
distribution of Survival for a given category of ticket Class.
Here we can see that 41.4% of the second class passengers survived as opposed to 58.6% that did not.

Row percentages.

16
We learn that 35.4% of those who didn’t survive had third class tickets.

Another sample contingency table when American state are classified by region and by political party control.

Party Control of State Legislatures by Region, 2013


Northeast Midwest South West
Party Control
Democrat 77.8% 16.7% 18.8% 53.8%
Republican 11.1 66.7 68.8 46.2
Split 11.1 8.3 12.5 0.0
Non-partisan 0.0 8.3 0.0 0.0
Totals 100.0 100.0 100.0 100.0
9 12 16 13
N

Source: National Conference of State Legislatures, "2012 Live Election Night Coverage of State
Legislative Races," http://www.ncsl.org. Accessed November 10, 2012.

Lesson 4

Sampling techniques

A population is the entire group of items or individuals of interest in the study. In sample surveys, two populations are considered:
1. Target population is the population fro which representative information is desired.
2. Sampling population is the population from which a sample will actually be taken as determined by the sampling frame. The
frame is merely a list of sampling units (e.g. persons) representing the population.

Types of sampling
1. Probability or random sampling: each individual (element) in the population is given a non-zero probability (chance) of being
selected. It has the greatest freedom from bias but may be costly in terms of time and energy for a given level of sampling
error.
2. Non-probability or nonrandom sampling: not all individuals are given non-zero probability (chance) of being selected.

Probability sampling procedures


1. Simple random sampling is the process of selecting a sample giving each sampling unit an equal chance of being included in the
sample. A random sample of n observations of the population has the chance of being selected. Random sampling maybe with
replacement or without replacement. In a random sampling with replacement, a chosen element is always replaced (or returned
back to the list) before the next selection is made so that an element may be chosen more than once.

17
Steps in conducting simple random sampling (SRS)
1. Make a list of the sampling units and number them from 1 to N (where N is the population size)
2. Select n (distinct) random numbers (where n is the sample size) ranging from 1 to N, using the table randomly
assorted digits. The sample consists of the units corresponding to the selected numbers.

When to use random sampling


1. If the population is not widely spread geographically.
2. If the population is more or less homogeneous with respect to the characteristics under study.

How to use the Table of Random Numbers


1. Determine how many digits there are in the population size N. assume that there are m digits in N.
2. Using any m columns of the table, read the random numbers down the column.
3. Choose the elements whose serial numbers are the same as those of the random numbers read.
4. Ignore random numbers greater than N. ignore also those that have already been selected.
5. If the random numbers of the first m columns have been exhausted, continue reading the next m columns. Stop
when n elements have been chosen.

2. Systematic sampling is a method of selecting a sample by taking every kth unit from an ordered population, the first unit being
selected at random.

Note: A spot map can be used for systematic sampling especially when you want to sample houses or households.

Procedure:
1. Number of units of the population consecutively from 1 to N.
N population size
2. Determine k, the sampling interval, by the formula: k= =
n sample size
3. Use a table of random numbers to choose r, where 1  r  N. The unit corresponding to r is the first unit of the
sample.
4. Consider the list of units of the population as a circular list, i.e., the last unit in the list is followed by the first. The
other units chosen are r + k, r + 2k, r + 3k, … until you have selected n units.

When to use:
1.If the ordering of the population is essentially random.
2.If there is slight stratification in the population.
3.When stratification with numerous data is used.

3. Stratified sampling is a method of selecting a sample where the population is divided or stratified into more or less homogeneous
subpopulations or strata before sampling is done.

Procedure:
1. Stratify the population into strata so that each stratum will consist of more or less homogeneous units.
2. After the population has been stratified, a (random) sample can be selected from each stratum.

When to use:
1. If the population is such that the distribution of the characteristic under consideration is very sporadic or concentrated
in small, scattered points of the population.
2. If the precise estimates are desired for certain parts of the population.
3. If sampling problems differ in the various sections of the populations.

4. Cluster sampling is a method of selecting a sample of distinct groups or clusters, of smaller units called elements. Similar to
strata in stratified sampling, clusters are mutually exclusive subpopulations which together comprise the entire population.
Unlike strata, however, clusters are preferably formed with heterogeneous, rather than homogeneous elements so that cluster
will be typical of the population.

Procedure:
1. List the clusters and number them from 1 to N.
2. Using a table of random numbers, obtain n numbers. The clusters corresponding to the selected numbers form the
sample of clusters.
3. Observe all the elements in each sample cluster.

18
When to use:
1¶ Clustering is used rather than individual selection when the lower cost per element more than compensates for its
disadvantages.
2¶ If the population can be grouped into clusters where individual population elements are known to be different with
respect to characteristics under study.

5. Multi-stage sampling is the selection of the sample is accomplished in two or more steps.

Procedure:
1. Number the first-stage units consecutively from 1 to N in the frame.
2. Using the table of random numbers, select n numbers, then choose first-stage units numbered consecutively.
3. Number the second-stage units corresponding to 1 to M in the frame fro each of the n selected first-stage units.
4. Using the table of random numbers, obtain n sets of m random numbers each.
5. In each of the n first-stage units, select the m second-stage units corresponding to the selected numbers.

When to use:
1. When the sub-units within the selected population unit give similar results, it seems uneconomical to measure them
all, therefore select just a sample of sub-units.

Nonprobability sampling
1. Haphazard or accidental sampling. Many fields in the social and biological sciences, like archeology, history, and
medicine, uses as samples whatever items come to hand. It is assumed, often incorrectly, that items picked this way are
typical of the population they come from. A haphazard sample, therefore, is not a random sample.

2. Judgment or purposive sampling. It is a strategy in which particular settings, persons, or events are selected deliberately
in order to provide important information that cannot be obtained from other choices. It is a sampling procedure whereby a
“representative” sample of a population is selected in accordance with an expert’s subjective judgment. This type of
sample may yield good results if the expert has had a long experience with a particular situation and knows many
important facts about it. This is, for example, the common practice in the national planning of technocrats’ picking “typical”
cities and barrios to represent the country’s urban and rural populations. Experts may differ, of course, in their choices of a
representative sample.

3. Quota sampling. A form of purposive sampling with the added specifications that the sample units must be spread over the
population and that the sample must be roughly proportional to the population. Often census data and other information
are used to “stratify” the population according to certain characteristics, after which, the sample is proportionally chosen
from these strata. “Quotas” are set up for the different strata and enumerators are instructed to keep picking items or
respondents until the quotas are filled. The selection of sample units depends on what the enumerator thinks is typical of
the population.

4. Snowball sampling is a nonrandom sampling method that uses a few cases to help encourage other cases to take part in
the study, thereby increasing sample size. This approach is most applicable in small populations that are difficult to access
due to their closed nature, e.g., secret societies, less acceptable behaviors like drug addiction, and inaccessible
professions.

Advantages of probability sampling over nonprobability sampling


1. A sample based on the laws of chance can provide a measure of how precise estimates are. This is not so for
nonprobability samples.
2. If we know how precise our estimates are, and if we are certain that the conditions imposed by the use of a particular
probability method are satisfied, then we will know how much confidence to place in the results of the study.

19
20

You might also like