Professional Documents
Culture Documents
Stat 101C Lecture Notes 1
Stat 101C Lecture Notes 1
Stat 101C Lecture Notes 1
Course notes
in
Stat 101 – Fundamentals of Statistics
Learning outcomes
1. Demonstrate understanding of descriptive statistics by practical application of quantitative reasoning and data visualization.
2. Compare the common methods of gathering sample data and identified the sampling techniques for different problem
situations.
3. Organize data using tables, charts and graphs and interpreted the results.
4. Identify and calculate the measures of the center of the data.
5. Identify and calculate the measures of the variability of the data.
Introduction
As our society becomes more technologically complex, greater demands are being placed on professionals to understand and use the
results of research designed to solve applied problems. This generally requires a working understanding of statistical methods.
Knowledge of statistical analysis also helps to foster new and creative ways of thinking about problems. These skills can be
applied to any area of inquiry and hence are extremely useful.
Social scientists attempt to explain and predict human behavior. They also take “educated guesses” about the nature of social reality,
although in a far more precise and structured manner. In the process, social scientists examine characteristics of human behavior
called variables – characteristics that differ or vary from one individual to another (for example, age, social class, and attitude) or
from one point in time to another (for example, unemployment, crime rate, and population).
Statistics is any numerical data or quantitative analysis. It is also a certain kind of measure used to evaluate a selected property of
the collection of items under consideration. As a branch of science, it is concerned with the scientific methods of collecting,
organizing, summarizing, presenting and analyzing data, as well as drawing valid conclusions and making reasonable decisions
on the basis of such analysis.
Types of statistics
1. Descriptive statistics is the method of collecting, organizing, and utilizing numerical data derived from the empirical world.
It is the phase of statistics that seeks to describe and analyze a given group without drawing any conclusions or inferences
about a larger group.
Descriptive statistics is concerned with
a) characterizing what is “typical” or common in a group
b) indicating how widely the individuals in the group vary
c) presenting other aspects of the distribution values with respect to the variable(s) being considered.
Examples: mean, percentages, proportions, standard deviation, regression and
correlation coefficient, construction of tables, charts and graphs.
2. Inferential statistics comprises some methods concerned with the analysis of a subset of data leading to predictions or
inferences about the entire set of data. Among the common types of analysis are:
1
a) testing for the existence of an association between variables
b) identifying the form of an observed relationship
c) refining observed associations into causal relationships
d) generalizing and predicting on the basis of observed data.
Examples: estimation, hypothesis testing.
Systematically testing our ideas about the nature of social reality often demands carefully planned and executed research
with the following elements:
1. The problem to be studied is reduced to testable hypothesis
2. An appropriate set of instruments is developed
3. The data are collected
4. Data analysis
5. Results of the analysis are interpreted and communicated to an audience (say, lecture, journal article, press release)
Population is a totality of all actual or conceivable objects of a certain class under consideration. It can be finite or infinite.
Sample is a finite number of objects or persons selected from the population. It is a set of measurements that constitute part of the
totality of all possible measurement of the same quantities.
Representative – property of the proportion of the population if that portion reflects the characteristics of the population
Survey – the collection of the information on a defined population to satisfy a definite need
Sampling frame – a complete list of all units from which the sample is drawn
Variable is any quantity or measure or characteristics which may possess different numerical values or categories.
Variables may also be classified as qualitative [differ in quality] or quantitative [differ in magnitude] or may be dependent [criterion
variables] or independent [predictor variables].
Levels of measurements
1. Nominal measurement. This type of measurement is the lowest level, consisting of classifying items or individuals into two or
more categories. Its basic requirement is that one must be able to assign an item or individual to one and only one category
and specify the criteria for placing individuals into classes.
2
The empirical operation is the determination of sameness or equivalence between items with respect to a given
characteristic. The only relationship between two items belonging to the same category is that they are the same with respect
to a particular characteristic; there is implication that one has more or less of the same characteristic as the other.
2. Ordinal measurement – specifies the relative position of items or individuals with respect to a given characteristic with no
indication as to the distance between the positions. The basic requirements is that one must be able to determine whether an
item has more, the same, or less of the attribute being considered than other items.
3. Interval measurement – property defined by an operation which permits making statements of equality of intervals rather than
just statements or difference and greater than or less than. It can compare differences. It does not have a absolute 0, although
0 may be arbitrarily assigned.
4. Ratio measurement. Numbers on a ratio scale indicate the actual amounts of the characteristics being measured; hence it is
possible to say that an item has none of the characteristic or that the item with a score of 8 has twice as much as an item with a
score of 4. It has an origin or an absolute 0 point thus can consider relative positions.
_____|_____|_____|_____|_____|_____|_____|_____|_____|_____|_____|_____
0 1 2 3 4 5 6 7 8 9 10
No pain worst
pain
imaginable
3
Validity and reliability of measurement
In discussing validity of measurement, consider
“Certain basic questions must be asked about measuring instrument: What does it measure? Are the data it
provides relevant to the characteristics in which one is interested? To what extent do the differences in scores represent
true differences in the characteristic we are trying to measure? To what extent do they reflect also the influence of other
factors?”
The validity of measuring instrument may be defined as the extent to which differences in scores on it reflects
true differences among individuals on the characteristic that we seek to measure, rather than constant or random errors.
The reliability of a measure is simply is consistency. A measure is reliable if the measurement does not change when the concept
being measured remains constant in value. However, if the concept being measured does change in value, the reliable measure will
indicate that change. In the case of a social-research instrument such as a questionnaire, the unreliability also lies within the scale
and may be due to such things as questions or answer categories so ambiguous that the respondent is unsure how he or she should
answer, and thus does not answer consistently.
Questionnaire - a survey instrument which is limited to the written responses of subjects to pre-arrange questions. It can be
mailed or handed to the informant with minimum explanation. It ensures anonymity.
Interview – allows for greater flexibility in eliciting information since the interviewer and the person interviewed are both present
when the questions are asked and answered.
Advantages:
• Questions can be repeated or rephrased for better understanding and clarity
• Offers greater opportunity for appraising the validity reports
• Technique for revealing information about complex, emotionally-laden topics or for probing sentiments
underlying an expressed opinion
Types:
a. standardized or structured – questions are presented with exactly the same
wordings in the same order to all subjects.
b. unstructured – neither the questions to be asked nor the responses permitted the subject are determined before the
interview. It is used for intensive study of perceptions, attitudes, motivation, etc., which requires spontaneous, highly
specific, concrete, self-revealing, personal answers.
Sources of Data
4
1. Documentary Sources – data contained in published and unpublished documents, reports, statistics, manuscripts, letters,
diaries, etc.
a) Primary Sources – first hand data wherein the responsibility for their complication and promulgation remain under the
same authority that originally gathered them.
b) Secondary Sources – data that have been transcribed or compiled from original sources.
2. Field Sources – include living persons who have the fundamental knowledge about, or have been in intimate contact with social
conditions and changes over a considerable period of time. Source is more personal and direct.
Presentation of Data
1) Textual presentation – A textual presentation of data consists, of describing the sample
data in expository form, i.e. it shows and emphasizes significant characteristics and results of data gathered in paragraph
form. It should be arranged according to data importance emphasizing certain figures. It should also justify or explain
irregularities in figures. It is adequate for limited amounts of information. However, if there are many facts involved, this
method of presentation should not be used only, because of the difficulty in reading and assimilating a repetitious list of facts
and figures. On the other hand, alongside either a tabular or graphic presentation, one or more accompanying paragraphs
can greatly enhance the understandability of the data.
Statistical tables - systematic way of arranging data in columns and rows . It usually grasps significant details and
relationships.
3. Graphical Presentation – effective way of presenting quantitative data to an intensive and heterogeneous group of readers since
less effort in comprehension is required.
a) graph or chart – any device for showing numerical values or relationship in pictorial
form. It should be clear and simple.
5
e) Statistical Maps – show numerical measurements and location
Sample graphs
Line graphs
6
Pie graph
Scatterplot
Bar graphs
7
8
Box plot
A. Definiton of terms
Central tendency - a central value between the upper and lower limits of a distribution around which the scores are distributed
average - value which is typical or representative of a set of data
- a any single value/number that could represent the data set
- “center of gravity” of the data set
mean - the arithmetic average of all the observed values or groups of values in a distribution
median - a point in the distribution of observed values above which 50% of the values fall and below which 50% of the values
fall
Example 2.
The heights (in meters) of the sampled volcanoes are as follows:
Volcano Height
Mariveles 1420
Smith 688
Biliran 1187
Bulusan 1559
Calayo 302
9
The sample mean of the heights of these volcanoes can be readily computed as 1031.2
For the systematic sample: Babuyan Claro, Cagua, Iraya, Mandalugan, and Ragang, the heights respectively are 837, 1158,
1008, 1880, 2815 so that here, the sample mean is (837+ 1158+ 1008+ 1880+ 2815)/5= 1539.6.
Thus, we see that a different sample will likely yield a different sample mean.
If the numbers x1, x2, … , xk occur f1, f 2, … , fk times respectively, the arithmetic mean is
k k
x = fjx j / fj
j =1 j=i
e.g. Find the arithmetic mean of the following scores of students in their quiz:
5, 3, 6, 5, 4, 5, 2, 8, 6, 5, 4, 8, 3, 4, 5, 4, 8, 2, 5, 4.
Rating Frequency
50 – 99 2
60 – 69 6
70 – 79 10
80 – 89 8
90 – 99 4
Some numbers X1, … , Xk are associated with certain weighting factors or weights w 1, … , wk, respectively, depending on the
significance or importance attached to the numbers.
k k
x= w jx j / w j
j =1 j =1
e.g. A final exam in a course is weighted three times as much as a quiz. A student
has a final grade of 85 and quiz grades of 70 and 90. What is his mean grade?
D. The median
- the observation occupying the middle position when observations are arranged in an array
E. The Mode
For ungrouped data, find the observation that occurs most often.
10
If the item values of the distribution are considerably concentrated or substantially close to each other, the mean is used to
describe this set of data. The mean is easy to use, compute and comprehend and mathematically tractable.It is used for interval and
ratio data.
variation or dispersion - the degree to which numerical data tend to spread about an average
range of a data set - a distance measure between the largest and the smallest observed value
mean deviation / mean absolute deviation / average absolute deviation - the arithmetic
average distance of the observations from the average/mean
variance - the average of the squared distance between the mean and each item in the
population/sample
standard deviation - the positive square root of the variance
coefficient of variation - measure of relative dispersion that relates the magnitude of the standard
deviation with the magnitude of the mean
(x ) f (x )
n 2 k 2
j −x j j −x
j =1 j =1
s =
2
s =
2
(grouped data)
n −1 n −1
11
DATA SET 1
The Labor Force Survey (LFS) adopted the 2003 Master Sample Design with a sample size of approximately 50000 households.
Labor Force- refers to the population 15 years old and over who contribute or seek to contribute to the production of goods and
services as defined in the system of National Account production boundary. It comprises the employed and unemployed. (PSA)
Units – Percent
2017Q1 2017Q2 2017Q3 2017Q4 2018Q1 2018Q2 2018Q3 2018Q4 2019Q1 2019Q2 2019Q3 2019Q4 2020Q1 2020Q2 2020Q3 2020Q4
Philippines 60.7 61.4 60.6 61.2 62.2 60.9 60.1 60.6 60.2 61.4 62.1 61.5 61.7 55.7 61.9 58.7
NCR 61.3 60.5 60.5 61.1 60.6 59.8 60 60.7 60.5 60.7 61.5 60.6 60.1 54.2 59.2 56.7
CAR - Cordillera Admin 60.1 62 64.5 62.7 62.2 60.2 63 62.2 61.6 62.5 61.9 62.4 63 56 64.6 62
Reg I - Ilocos 60.7 56.6 58.7 58.9 63.3 62.3 60 61.3 61.3 59.9 62.3 63.4 63.4 60.8 64.7 61.7
Reg II - Cagayan Valley 63.7 62.6 61.7 63.4 65.3 64.8 62.4 63.1 62 63.3 62.8 64.1 64.4 59.6 64.8 56.7
Reg III - Central Luzon 57.9 57.4 60.5 58.7 60.7 60.2 60.5 58 59.6 60 60.4 60 59.6 51.9 58.9 57.3
Reg IVA-CALABARZON 63.2 63.7 62.4 63.7 62.9 62.1 62.3 63.3 63.1 63.4 65.9 64.1 64 58.3 63.9 60.6
Reg IVB - MIMAROPA 61 63.5 63.6 64 65.9 61.9 60.8 59.4 59.8 61.3 61.7 58.6 58.8 53.4 64.2 62
Region V- Bicol 59 59.1 59.6 60.1 62 62.5 57.7 61.3 55.9 60.7 62.3 61.6 61.7 55.2 62.1 59.1
Reg VI - WVisayas 61.4 60.6 61.6 61.6 62 61.8 60.9 60.3 59.6 56.8 59.5 60.4 60.4 56.1 60.8 57.4
Reg VII - CVisayas 64.9 67.2 62.7 65.1 63.1 61.8 59.7 60.7 63.1 62.5 60.6 63.2 63.9 57.3 57.8 55.9
Reg VIII - EVisayas 56.2 64.2 60.7 60.3 61.6 63.2 61.8 58.4 58.9 61 61.5 58.3 59.5 56.2 60.9 56.2
Reg IX - Zamboanga 56.3 60.9 57.7 58.5 59.3 54.7 54.5 56.9 55.8 56.5 55.9 57.5 59.4 52 60.8 55.7
Reg X – N Mindanao 63 64.7 60.2 63.8 72 66 62 65.2 62.4 73.5 72.5 66.8 70.1 62.8 68.8 63.8
Reg XI - Davao 61.7 64 59.9 62.7 62.2 59.6 59 60.4 60.2 58.2 59.3 61.6 58.9 55.3 59.5 56.5
RegXIISOCCSKSARGEN 62.8 60.8 61.3 62.2 62.4 60.4 61.7 62.3 63.3 64.8 65.4 62.7 65.4 57.4 66.4 62.5
Reg XIII - Caraga 59.2 65.2 61.7 62.1 67.1 66.1 62.5 62 59.2 65.3 63.4 61.4 64.8 57 68.6 63.8
ARMM 44.3 48.2 46.5 46.1 46.1 44.3 46.5 49.5 47.7 55 53.3 53.4 50.9 41.1 62.3 59.4
12
2020Q1 2020Q2 2020Q3 2020Q4
NCR 60.1 54.2 59.2 56.7
CAR - Cordillera Admin 63 56 64.6 62
Reg I - Ilocos 63.4 60.8 64.7 61.7
Reg II - Cagayan Valley 64.4 59.6 64.8 56.7
Reg III - Central Luzon 59.6 51.9 58.9 57.3
Reg IVA-CALABARZON 64 58.3 63.9 60.6
Reg IVB - MIMAROPA 58.8 53.4 64.2 62
Region V- Bicol 61.7 55.2 62.1 59.1
Reg VI - WVisayas 60.4 56.1 60.8 57.4
Reg VII - CVisayas 63.9 57.3 57.8 55.9
Reg VIII - EVisayas 59.5 56.2 60.9 56.2
Reg IX - Zamboanga 59.4 52 60.8 55.7
Reg X – N Mindanao 70.1 62.8 68.8 63.8
Reg XI - Davao 58.9 55.3 59.5 56.5
RegXIISOCCSKSARGEN 65.4 57.4 66.4 62.5
Reg XIII - Caraga 64.8 57 68.6 63.8
ARMM 50.9 41.1 62.3 59.4
DATA SET 2. Test scores of students in an entrance examination (%) & strand
Student Strand WVSUCAT Communication Science Math
1 HUMSS 52 54 50 42
2 HUMSS 51 56 50 42
3 HUMSS 42 62 36 24
4 HUMSS 52 64 52 36
5 HUMSS 48 62 42 30
6 HUMSS 49 60 42 28
7 STEM 38 44 40 26
8 STEM 39 40 38 24
9 HUMSS 43 52 40 20
10 HUMSS 33 24 36 26
11 HUMSS 41 48 38 26
12 HUMSS 26 26 34 16
13 HUMSS 41 46 46 34
14 HUMSS 29 26 32 22
15 HUMSS 34 48 34 16
16 HUMSS 42 42 38 34
17 HUMSS 36 38 28 24
18 GAS 54 68 46 44
19 GAS 44 52 38 44
20 GAS 41 48 42 38
21 GAS 44 48 44 30
22 GAS 35 28 50 22
23 GAS 39 44 38 26
24 GAS 40 40 44 20
25 GAS 43 46 36 36
26 GAS 37 46 40 18
27 GAS 32 34 30 22
28 GAS 31 34 30 20
29 GAS 27 30 26 20
30 GAS 42 50 46 34
31 GAS 42 48 38 26
32 GAS 38 48 28 22
33 STEM 40 42 42 32
34 STEM 35 48 44 20
13
Strand Frequency Percentage
ABM
GAS
HUMSS
STEM
Total
Exercises:
1. Fill out the tables below using the DATA SET 2.
2. Enter Data Set 2 using Microsoft Excel. Using the statistical function of Microsoft Excel, solve for the mean (average) and
standard deviation (stdev) of WVSU-CAT. Then using sort, solve for the mean and standard deviation of WVSU-CAT when students
were classified by strand and fill out the table below.
14
3. Using your Microsoft Excel file of Data Set 2, solve for the mean and standard deviation of the following components of WVSU-
CAT: Communication test scores, Science, and Math test scores when students were classified by strand and fill out the table below.
For those who know how to apply SPSS, you can use the Compare means option under Analyze and solve for the values required
below.
Lesson 3
When we have a categorical variable, one of the first things we do with it is count the number of cases in each category. If we don’t
have too many categories we can display as counts in a table. We can also display as percentages.
In statistics, a contingency table (also known as a cross tabulation or crosstab or two-way table) is a type of table in a matrix format
that displays the (multivariate) frequency distribution of the variables. They provide a basic picture of the interrelation between two
variables and can help find interactions between them. A contingency table is a special type of frequency distribution table.
Data below represent number of passengers in the Titanic cruise ship that sank in its first voyage by ticket class.
15
Possible problem: Did chance of survival depend on ticket class?
Below is a contingency table of the 2201 people aboard the Titanic showing ticket class and survival. Here we are looking at counts.
In a contingency table when we look at the total the bottom row or the right most column that is the same as looking at each variable
separately. We just looked at the distribution of ticket class. When we have a contingency table with two variables this distribution is
called a marginal distribution of ticket class.
Column Percentages.
A contingency table of Class by Survival with only counts and column percentages. Each column represents the conditional
distribution of Survival for a given category of ticket Class.
Here we can see that 41.4% of the second class passengers survived as opposed to 58.6% that did not.
Row percentages.
16
We learn that 35.4% of those who didn’t survive had third class tickets.
Another sample contingency table when American state are classified by region and by political party control.
Source: National Conference of State Legislatures, "2012 Live Election Night Coverage of State
Legislative Races," http://www.ncsl.org. Accessed November 10, 2012.
Lesson 4
Sampling techniques
A population is the entire group of items or individuals of interest in the study. In sample surveys, two populations are considered:
1. Target population is the population fro which representative information is desired.
2. Sampling population is the population from which a sample will actually be taken as determined by the sampling frame. The
frame is merely a list of sampling units (e.g. persons) representing the population.
Types of sampling
1. Probability or random sampling: each individual (element) in the population is given a non-zero probability (chance) of being
selected. It has the greatest freedom from bias but may be costly in terms of time and energy for a given level of sampling
error.
2. Non-probability or nonrandom sampling: not all individuals are given non-zero probability (chance) of being selected.
17
Steps in conducting simple random sampling (SRS)
1. Make a list of the sampling units and number them from 1 to N (where N is the population size)
2. Select n (distinct) random numbers (where n is the sample size) ranging from 1 to N, using the table randomly
assorted digits. The sample consists of the units corresponding to the selected numbers.
2. Systematic sampling is a method of selecting a sample by taking every kth unit from an ordered population, the first unit being
selected at random.
Note: A spot map can be used for systematic sampling especially when you want to sample houses or households.
Procedure:
1. Number of units of the population consecutively from 1 to N.
N population size
2. Determine k, the sampling interval, by the formula: k= =
n sample size
3. Use a table of random numbers to choose r, where 1 r N. The unit corresponding to r is the first unit of the
sample.
4. Consider the list of units of the population as a circular list, i.e., the last unit in the list is followed by the first. The
other units chosen are r + k, r + 2k, r + 3k, … until you have selected n units.
When to use:
1.If the ordering of the population is essentially random.
2.If there is slight stratification in the population.
3.When stratification with numerous data is used.
3. Stratified sampling is a method of selecting a sample where the population is divided or stratified into more or less homogeneous
subpopulations or strata before sampling is done.
Procedure:
1. Stratify the population into strata so that each stratum will consist of more or less homogeneous units.
2. After the population has been stratified, a (random) sample can be selected from each stratum.
When to use:
1. If the population is such that the distribution of the characteristic under consideration is very sporadic or concentrated
in small, scattered points of the population.
2. If the precise estimates are desired for certain parts of the population.
3. If sampling problems differ in the various sections of the populations.
4. Cluster sampling is a method of selecting a sample of distinct groups or clusters, of smaller units called elements. Similar to
strata in stratified sampling, clusters are mutually exclusive subpopulations which together comprise the entire population.
Unlike strata, however, clusters are preferably formed with heterogeneous, rather than homogeneous elements so that cluster
will be typical of the population.
Procedure:
1. List the clusters and number them from 1 to N.
2. Using a table of random numbers, obtain n numbers. The clusters corresponding to the selected numbers form the
sample of clusters.
3. Observe all the elements in each sample cluster.
18
When to use:
1¶ Clustering is used rather than individual selection when the lower cost per element more than compensates for its
disadvantages.
2¶ If the population can be grouped into clusters where individual population elements are known to be different with
respect to characteristics under study.
5. Multi-stage sampling is the selection of the sample is accomplished in two or more steps.
Procedure:
1. Number the first-stage units consecutively from 1 to N in the frame.
2. Using the table of random numbers, select n numbers, then choose first-stage units numbered consecutively.
3. Number the second-stage units corresponding to 1 to M in the frame fro each of the n selected first-stage units.
4. Using the table of random numbers, obtain n sets of m random numbers each.
5. In each of the n first-stage units, select the m second-stage units corresponding to the selected numbers.
When to use:
1. When the sub-units within the selected population unit give similar results, it seems uneconomical to measure them
all, therefore select just a sample of sub-units.
Nonprobability sampling
1. Haphazard or accidental sampling. Many fields in the social and biological sciences, like archeology, history, and
medicine, uses as samples whatever items come to hand. It is assumed, often incorrectly, that items picked this way are
typical of the population they come from. A haphazard sample, therefore, is not a random sample.
2. Judgment or purposive sampling. It is a strategy in which particular settings, persons, or events are selected deliberately
in order to provide important information that cannot be obtained from other choices. It is a sampling procedure whereby a
“representative” sample of a population is selected in accordance with an expert’s subjective judgment. This type of
sample may yield good results if the expert has had a long experience with a particular situation and knows many
important facts about it. This is, for example, the common practice in the national planning of technocrats’ picking “typical”
cities and barrios to represent the country’s urban and rural populations. Experts may differ, of course, in their choices of a
representative sample.
3. Quota sampling. A form of purposive sampling with the added specifications that the sample units must be spread over the
population and that the sample must be roughly proportional to the population. Often census data and other information
are used to “stratify” the population according to certain characteristics, after which, the sample is proportionally chosen
from these strata. “Quotas” are set up for the different strata and enumerators are instructed to keep picking items or
respondents until the quotas are filled. The selection of sample units depends on what the enumerator thinks is typical of
the population.
4. Snowball sampling is a nonrandom sampling method that uses a few cases to help encourage other cases to take part in
the study, thereby increasing sample size. This approach is most applicable in small populations that are difficult to access
due to their closed nature, e.g., secret societies, less acceptable behaviors like drug addiction, and inaccessible
professions.
19
20