Professional Documents
Culture Documents
Engineering Data Analysis Course Material 4 PDF
Engineering Data Analysis Course Material 4 PDF
Marizen Contreras
Engr. Siddartha Valle
Engr. Framces Thea De Mesa
Engr. Nadine Alex Bravo
ENGINEERING DATA ANALYSIS
__________________________________________
COURSE MATERIAL
Prepared by:
Engr. Marizen Contreras
Engr. Siddartha Valle
Engr. Framces Thea De Mesa
Engr. Nadine Alex Bravo
2
COURSE TECHNICALITIES
Course Details
PO PO Statement
a Apply knowledge of mathematics and science to solve engineering
problems
b Design and conduct experiments, as well as to analyze and interpret
data
i Recognize the need for, and an ability to engage in life-long learning
j Apply knowledge of contemporary issues
n Participate in the generation of new knowledge or in research and
development projects
3
Course Outcomes
Mapping of PO to CO
4
Rules of Probability 115
Joint Probability Distribution 126
3 Discrete Introduction 135
Probability
Distributions Binomial Distributions 142
Geometric and Negative Binomial Distributions 146
Hypergeometric Distributions 149
Poisson Distributions 151
4 Continuous Continuous Random Variable 154
Probability
Distributions Probability Density Function 156
Normal Distribution 159
5 Sampling Sampling Distribution 175
Distributions
and Estimation 188
Estimation
Sample Size 197
6 Test of Introduction to Hypothesis 201
Hypothesis
P-value 207
PARAMETRIC TESTS 210
T-Test for Independent Samples 210
T-Test for Correlated Samples 213
One Sample Mean Tests 216
Two Sample Mean Test 217
One-Way ANOVA (F test) 219
Scheffe's Test 222
Two-Way ANOVA 223
Three-Way ANOVA 229
Simple Linear Regression 240
Multiple Regression and Its Significance 243
Pearson Coefficient of Correlation r 250
5
NON-PARAMETRIC TESTS 259
Chi-Square Test of Goodness of Fit 260
Chi-Square Test of Homogeneity 263
Chi-Square Test of Independence 265
Mann Whitney U Test 268
Kruskal-Wallis Test 273
Sign Test for Two Independent Samples 276
Fisher Sign Test 278
Median Test for K Independent Samples 280
Spearman Rank-Order Coefficient of Correlation rs 283
Friedman Fr Test for Randomized Block Design 286
Mc Nemar’s Test for Correlated Proportions 288
Kendall’s Coefficient of Concordance 288
7 Reliability Introduction to Reliability and Validity Tests 294
and Validity
Tests Methods Used For Reliability And Validity Testing 304
8 Computer Basic introduction to the Use of MS Excel in 311
Aided Statistical Data Analysis
Statistics
6
Obtaining Data
____________________________________________________
MODULE 1
7
Session 1
Role of Statistics in Engineering
Lecture:
8
Statistical methods are used to help us describe and understand variability. By variability,
we mean that successive observations of a system or phenomenon do not produce
exactly the same result. We all encounter variability in our everyday lives, and statistical
thinking can give us a useful way to incorporate this variability into our decision-making
processes.
For example, consider the gasoline mileage performance of your car. Do you always get
exactly the same mileage performance on every tank of fuel? Of course not—in fact,
sometimes the mileage performance varies considerably. This observed variability in
gasoline mileage depends on many factors, such as the type of driving that has occurred
most recently (city versus highway), the changes in condition of the vehicle over time
(which could include factors such as tire inflation, engine compression, or valve wear),
the brand and/or octane number of the gasoline used, or possibly even the weather
conditions that have been recently experienced.
These factors represent potential sources of variability in the system. Statistics gives us
a framework for describing this variability and for learning about which potential sources
of variability are the most important or which have the greatest impact on the gasoline
mileage performance.
Because the pull-off force measurements exhibit variability, we consider the pull-off force
to be a random variable. A convenient way to think of a random variable, say X, that
represents a measurement, is by using the model
𝑋 = 𝜇+𝜖
where: 𝜇 is a constant and ∈ is a random disturbance.
The constant remains the same with every measurement, but small changes in the
environment, test equipment, differences in the individual parts themselves, and so forth
change the value of ∈.
9
If there were no disturbances, ∈ would always equal zero and X would always be equal
to the constant 𝜇. However, this never happens in the real world, so the actual
measurements X exhibit variability. We often need to describe, quantify and ultimately
reduce variability.
The dot diagram is a very useful plot for displaying a small body of data—say, up to
about 20 observations. This plot allows us to see easily two features of the data; the
location, or the middle, and the scatter or variability.When the number of observations is
small, it is usually difficult to identify any specific patterns in the variability, although the
dot diagram is a convenient way to see any unusual data features.
From testing the prototypes, he knows that the average pull-off force is 13.0 pounds.
However, he thinks that this may be too low for the intended application, so he decides
to consider an alternative design with a greater wall thickness, 1/8 inch. Eight prototypes
of this design are built, and the observed pull-off force measurements are 12.9, 13.7,
12.8, 13.9, 14.2, 13.2, 13.5, and 13.1. The average is 13.4.
This display gives the impression that increasing the wall thickness has led to an increase
in pull-off force. However, there are some obvious questions to ask. For instance, how do
we know that another sample of prototypes will not give different results? Is a sample of
eight prototypes adequate to give reliable results? If we use the test results obtained so
far to conclude that increasing the wall thickness increases the strength, what risks are
associated with this decision?
For example, is it possible that the apparent increase in pull-off force observed in the
thicker prototypes is only due to the inherent variability in the system and that increasing
the thickness of the part (and its cost) really has no effect on the pull-off force?
Often, physical laws (such as Ohm’s law and the ideal gas law) are applied to help design
products and processes. We are familiar with this reasoning from general laws to specific
cases. But it is also important to reason from a specific set of measurements to more
general cases to answer the previous questions. This reasoning is from a sample (such
10
as the eight connectors) to a population (such as the connectors that will be sold to
customers). The reasoning is referred to as statistical inference.
In some cases, the sample is actually selected from a well-defined population. The
sample is a subset of the population. For example, in a study of resistivity a sample of
three wafers might be selected from a production lot of wafers in semiconductor
manufacturing. Based on the resistivity data collected on the three wafers in the sample,
we want to draw a conclusion about the resistivity of all of the wafers in the lot.
In other cases, the population is conceptual (such as with the connectors), but it might be
thought of as future replicates of the objects in the sample. In this situation, the eight
prototype connectors must be representative, in some sense, of the ones that will be
manufactured in the future. Clearly, this analysis requires some notion of stability as an
additional assumption. For example, it might be assumed that the sources of variability in
the manufacture of the prototypes (such as temperature, pressure, and curing time) are
the same as those for the connectors that will be manufactured in the future and ultimately
sold to customers.
11
Figure 5. Enumerative Versus Analytic Study
An effective data collection procedure can greatly simplify the analysis and lead to
improved understanding of the population or process that is being studied.
Illustrative Example:
12
Production personnel obtain and archive the following records: the concentration of
acetone in an hourly test sample of output product, the reboil temperature log, which is a
plot of the reboil temperature over time, the condenser temperature controller log, the
nominal reflux rate each hour The reflux rate should be held constant for this process.
Consequently, production personnel change this very infrequently.
The study objective might be to discover the relationships among the two temperatures
and the reflux rate on the acetone concentration in the output product stream.
1. We may not be able to see the relationship between the reflux rate and acetone
concentration, because the reflux rate didn’t change much over the historical
period.
2. The archived data on the two temperatures (which are recorded almost
continuously) do not correspond perfectly to the acetone concentration
measurements (which are made hourly). It may not be obvious how to construct
an approximate correspondence.
3. Production maintains the two temperatures as closely as possible to desired
targets or set points. Because the temperatures change so little, it may be
difficult to assess their real impact on acetone concentration.
4. Within the narrow ranges that they do vary, the condensate temperature tends
to increase with the reboil temperature. Consequently, the effects of these two
process variables on acetone concentration may be difficult to separate.
As we can see, a retrospective study may involve a lot of data, but that data may contain
relatively little useful information about the problem. Furthermore, some of the relevant
data may be missing, there may be transcription or recording errors resulting in outliers
(or unusual values), or data on other important factors may not have been collected and
archived. In the distillation column, for example, the specific concentrations of butyl
alcohol and acetone in the input feed stream are a very important factor, but they are not
archived because the concentrations are too hard to obtain on a routine basis.
Using the observational study the engineer would design a form to record the two
temperatures and the reflux rate when acetone concentration measurements are made.
It may even be possible to measure the input feed stream concentrations so that the
impact of this factor could be studied.
13
Generally, an observational study tends to solve problems 1 and 2 above and goes a long
way toward obtaining accurate and reliable data. However, observational studies may not
help resolve problems 3 and 4.
Designed experiments are a very powerful approach to studying complex systems, such
as the distillation column. This process has three factors, the two temperatures and the
reflux rate, and we want to investigate the effect of these three factors on output acetone
concentration.
A good experimental design for this problem must ensure that we can separate the effects
of all three factors on the acetone concentration. The specified values of the three factors
used in the experiment are called factor levels.Typically, we use a small number of levels
for each factor, such as two or three. For this distillation column problem, suppose we
use a “high,’’and “low,’’level (denoted +1 and -1, respectively) for each of the factors. We
thus would use two levels for each of the three factors.
A very reasonable experiment design strategy uses every possible combination of the
factor levels to form a basic experiment with eight different settings for the process. This
type of experiment is called a factorial experiment.
With each setting of the process conditions, we allow the column to reach equilibrium,
take a sample of the product stream, and determine the acetone concentration. We then
can draw specific inferences about the effect of these factors. Such an approach allows
us to proactively study a population or process.
Often data are collected over time. In this case, it is usually very helpful to plot the data
versus time in a time series plot. Phenomena that might affect the system or process
often become more visible in a time-oriented plot and the concept of stability can be better
judged.
14
Figure 6. Dot diagram of acetone concentration taken hourly from the distillation column
The large variation displayed on the dot diagram indicates a lot of variability in the
concentration, but the chart does not help explain the reason for the variation.
process mean
shift detected
A shift in the process mean level is visible in the plot and an estimate of the time of the
shift can be obtained.
The funnel was aligned as closely as possible with the center of the target. He then used
two different strategies to operate the process.
1. He never moved the funnel. He just dropped one marble after another and
recorded the distance from the target.
2. He dropped the first marble and recorded its location relative to the target. He
then moved the funnel an equal and opposite distance in an attempt to
15
compensate for the error. He continued to make this type of adjustment after
each marble was dropped.
After both strategies were completed, he noticed that the variability of the distance from
the target for strategy 2 was approximately 2 times larger than for strategy 1. The
adjustments to the funnel increased the deviations from the target. The explanation is that
the error (the deviation of the marble’s position from the target) for one marble provides
no information about the error that will occur for the next marble. Consequently,
adjustments to the funnel do not decrease future errors. Instead, they tend to move the
funnel farther from the target.
This interesting experiment points out that adjustments to a process based on random
disturbances can actually increase the variation of the process. This is referred to as over
control or tampering. Therefore adjustments should be applied only to compensate for
a nonrandom shift in the process.
The target value for the process is 10 units. The figure displays the data with and without
adjustments that are applied to the process mean in an attempt to produce data closer to
target. Each adjustment is equal and opposite to the deviation of the previous
measurement from target. For example, when the measurement is 11 (one unit above
target), the mean is reduced by one unit before the next measurement is generated. The
over control has increased the deviations from the target.
16
Figure 10. Time series of Deming’s funnel experiment with applied adjustment at the
point of process mean shift.
Figure 10 displays the data without adjustment except that the measurements after
observation number 50 are increased by two units to simulate the effect of a shift in the
mean of the process. When there is a true shift in the mean of a process, an adjustment
can be useful. It also displays the data obtained when one adjustment (a decrease of two
units) is applied to the mean after the shift is detected (at observation number 57). Note
that this adjustment decreases the deviations from target.
The question of when to apply adjustments (and by what amounts) begins with an
understanding of the types of variation that affect a process. A control chart is an
invaluable way to examine the variability in time-oriented data.
The center line on the control chart is just the average of the concentration measurements
for the first 20 samples (𝑥̅ = 91.5 g/l) when the process is stable. The upper control limit
17
and the lower control limit are a pair of statistically derived limits that reflect the inherent
or natural variability in the process. These limits are located three standard deviations
of the concentration values above and below the center line. If the process is operating
as it should, without any external sources of variability present in the system, the
concentration measurements should fluctuate randomly around the center line, and
almost all of them should fall between the control limits.
In this control chart the visual frame of reference provided by the center line and the
control limits indicates that some upset or disturbance has affected the process around
sample 20 because all of the following observations are below the center line and two of
them actually fall below the lower control limit. This is a very strong signal that corrective
action is required in this process. If we can find and eliminate the underlying cause of this
upset, we can improve process performance considerably.
Control charts are a very important application of statistics for monitoring, controlling, and
improving a process. The branch of statistics that makes use of control charts is called
statistical process control, or SPC.
Activity 1:
State any experiment you have done before that you can repeatedly perform at the
comfort of your home. Record measurements and observe variations. What do you think
are the factors that might contributed the variations? Among the factors, did you find any
variable that can be controlled? If yes, what is it and how can it be controlled?
18
Session 2
Methods of Collecting Data, Planning and Conducting Surveys,
Planning and Conducting Experiments
Lecture:
INTRODUCTION
Statistics plays an important role in nearly all phases of our lives. It is used in agriculture,
biology and natural sciences, business and economics, electronics and computer
sciences, education, political science and sociology, and other fields of science and
engineering as mentioned in session 1.
Statistics or statistical methods are the mathematical techniques used to facilitate the
interpretation of numerical data secured from entities, individuals, or observations.
Statistics is a mathematical science including methods of collecting, organizing and
analyzing data in such a way that meaningful conclusions can be drawn from them. In
general, its investigations and analyses fall into two broad categories called descriptive
and inferential statistics.
Descriptive statistics deals with the processing of data without attempting to draw any
inferences from it. The data are presented in the form of tables and graphs. The
characteristics of the data are described in simple terms. Events that are dealt with include
everyday happenings such as accidents, prices of goods, business, incomes, epidemics,
sports data, and population data. While inferential statistics is a scientific discipline that
uses mathematical tools to make forecasts and projections by analyzing the given data.
It makes use of generalizations, predictions, estimations, or approximations in the face of
uncertainty.
The basic idea behind all statistical methods of data analysis is to make inferences about
a population by studying small sample chosen from it. A population often consists of a
large group of specifically defined elements. For example, the population of a specific
country means all the people living within the boundaries of that country. Usually, it is not
19
possible or practical to measure data for every element of the population under study. We
randomly select a small group of elements from the population and call it a sample.
Inferences about the population are then made on the basis of several samples. The
number that describes a population is called parameter while for sample it is called
statistic.
STATISTICAL INQUIRY
For example, gathering information about how many students have cleared entrance
exam of MBA and how are have not cleared the exam.
Information gathering can be from a variety of sources. Importantly to say, there are no
best method of data collection. In principle, how data are being collected depends on the
researcher’s nature of research or the phenomena being studied.
Data collection is a crucial aspect in any level of research work. If data are inaccurately
collected, it will surely impact the findings of the study, thereby leading to false or
invaluable outcome.
Data collection is a systematic method of collecting and measuring data gathered from
different sources of information in order to provide answers to relevant questions. An
accurate evaluation of collected data can help researchers predict future phenomenon
and trends.
Data collection can be classified into two, namely: primary and secondary data. Primary
data are raw data i.e. fresh and are collected for the first time. It is the data collected for
the first time through personal experiences or evidence, particularly for research. It is also
described as raw data or first-hand information. The investigator supervises and controls
the data collection process directly. Mostly the data is collected through observations,
physical testing, mailed questionnaires, surveys, personal interviews, telephonic
20
interviews, case studies, and focus groups, etc. Secondary data, on the other hand, are
data that were previously collected and tested. These are second-hand data that is
already collected and recorded by some researcher for their purpose and not for the
current research problem. It is accessible in the form of data collected from different
sources such as government publications, censuses, internal records of the organization,
books, journal articles, websites, and reports etc. This method of gathering data is
affordable, readily available, saves cost and time.
The differences between primary data and secondary data are represented in a
comparison format are as follows:
Primary data can be divided into two categories: quantitative and qualitative.
Quantitative data comes in the form of numbers, quantities and values. It describes things
in concrete and easily measurable terms. Because quantitative data is numeric and
measurable, it lends itself well to analytics. When you analyze quantitative data, you may
21
uncover insights that can help you better understand your audience. Because this kind of
data deals with numbers, it is very objective and has a reputation for reliability.
Qualitative data is descriptive, rather than numeric. It is less concrete and less easily
measurable than quantitative data. This data may contain descriptive phrases and
opinions. Qualitative data helps explains the “why” behind the information quantitative
data reveals. For this reason, it is useful for supplementing quantitative data, which will
form the foundation of your data strategy.
There are many different techniques for collecting different types of quantitative data, but
there’s a fundamental process you’ll typically follow, no matter which method of data
collection you’re using. This process consists of the following five steps.
The first thing you need to do is choose what details you want to collect. You’ll need to
decide what topics the information will cover, who you want to collect it from and how
much data you need.
Next, you can start formulating your plan for how you’ll collect your data. In the early
stages of your planning process, you should establish a timeframe for your data
collection. You may want to gather some types of data continuously.
At this step, you will choose the data collection method that will make up the core of your
data-gathering strategy. To select the right collection method, you’ll need to consider the
type of information you want to collect, the timeframe over which you’ll obtain it and the
other aspects you determined.
Once you have finalized your plan, you can implement your data collection strategy and
start collecting data. You can store and organize your data. Be sure to stick to your plan
and check on its progress regularly. It may be useful to create a schedule for when you
will check in with how your data collection is proceeding, especially if you are collecting
22
data continuously. You may want to make updates to your plan as conditions change and
you get new information.
Once you’ve collected all of your data, it’s time to analyze it and organize your findings.
The analysis phase is crucial because it turns raw data into valuable insights that you can
use to enhance your marketing strategies, products and business decisions.
DATA COLLECTION
The system of data collection is based on the type of study being conducted. Depending
on the researcher’s research plan and design, there are several ways data can be
collected.
The most commonly used methods are: published literature sources, surveys (email and
mail), interviews (telephone, face-to-face or focus group), observations, documents and
records, and experiments.
1. Literature sources
This involves the collection of data from already published text available in the public
domain. Literature sources can include: textbooks, government or private companies’
reports, newspapers, magazines, online published papers and articles.
2. Surveys
Survey is another method of gathering information for research purposes. Information are
gathered through questionnaire, mostly based on individual or group experiences
regarding a particular phenomenon.
There are several ways by which this information can be collected. Most notable ways
are: web-based questionnaire and paper-based questionnaire (printed form). The results
of this method of data collection are generally easy to analyze.
23
3. Interviews
Interview is a qualitative method of data collection whose results are based on intensive
engagement with respondents about a particular study. Usually, interviews are used in
order to collect in-depth responses from the professionals being interviewed.
4. Observations
This is the process of examining existing documents and records of an organization for
tracking changes over a period of time. Records can be tracked by examining call logs,
email logs, databases, minutes of meetings, staff reports, information logs, etc.
For instance, an organization may want to understand why there are lots of negative
reviews and complains from customer about its products or services. In this case, the
organization will look into records of their products or services and recorded interaction
of employees with customers.
6. Experiments
Experimental research is a research method where the causal relationship between two
variables are being examined. One of the variables can be manipulated, and the other is
measured. These two variables are classified as dependent and independent variables.
In experimental research, data are mostly collected based on the cause and effect of the
two variables being studied. This type of research are common among medical
24
researchers, and it uses quantitative research approach. Additional topics will be
discussed in introduction to design of experiments.
Collecting data is valuable because you can use it to make informed decisions. The more
relevant, high-quality data you have, the more likely you are to make good choices when
it comes to marketing, sales, customer service, product development and many other
areas of your business. Some specific uses of customer data include the following:
It can be difficult or impossible to get to know every one of your customers personally,
especially if you run a large business or an online business. The better you understand
your customers, though, the easier it will be for you to meet their expectations. Data
collection enables you to improve your understanding of who your audience is and
disseminate that information throughout your organization. Through the primary data
collection methods described above, you can learn about who your customers are, what
they’re interested in and what they want from you as a company.
Collecting and analyzing data helps you see where your company is doing well and where
there is room for improvement. It can also reveal opportunities for expanding your
business.
Looking at transactional data, for example, can show you which of your products are the
most popular and which ones do not sell as well. This information might lead you to focus
more on your bestsellers, and develop other similar products. You could also look at
customer complaints about a product to see which aspects are causing problems.
Data is also useful for identifying opportunities for expansion. For example, say you run
an e-commerce business and are considering opening a brick-and-mortar store. If you
look at your customer data, you can see where your customers are and launch your first
store in an area with a high concentration of existing customers. You could then expand
to other similar areas.
Analyzing the data you collect can help you predict future trends, enabling you to prepare
for them. As you look at the data for your new website, for instance, you may discover
videos are consistently increasing in popularity, as opposed to articles. This observation
25
would lead you to put more resources into your videos. You might also be able to predict
more temporary patterns and react to them accordingly. If you run a clothing store, you
might discover pastel colors are popular during spring and summer, while people gravitate
toward darker shades in the fall and winter. Once you realize this, you can introduce the
right colors to your stores at the right times to boost your sales.
You can even make predictions on the level of the individual customer. Say you sell
business software. Your data might show companies with a particular job title often have
questions for tech support when it comes time to update their software. Knowing this in
advance allows you to offer support proactively, making for an excellent customer
experience.
When you know more about your customers or site visitors, you can tailor the messaging
you send them to their interests and preferences. This personalization applies to
marketers designing ads, publishers choosing which ads to run and content creators
deciding what format to use for their content.
Using data collection in marketing can help you craft ads that target a given audience.
For example, say you’re a marketer looking to advertise a new brand of cereal. If your
customer data shows most people who eat the cereal are in their 50s and 60s, you can
use actors in those age ranges in your ads. If you’re a publisher, you likely have
information about what topics your site visitors prefer to read about. You can group your
audience based on the characteristics they share and then show visitors with those
characteristics content about topics popular with that group.
It would normally be impractical to study a whole population, for example when doing a
questionnaire survey. Sampling is a method that allows researchers to infer information
about a population based on results from a subset of the population, without having to
investigate every individual. Reducing the number of individuals in a study reduces the
cost and workload, and may make it easier to obtain high quality information, but this has
to be balanced against having a large enough sample size with enough power to detect
a true association.
26
There are several different sampling techniques available, and they can be subdivided
into two groups: probability sampling and non-probability sampling. In probability
(random) sampling, you start with a complete sampling frame of all eligible individuals
from which you select your sample. In this way, all eligible individuals have a chance of
being chosen for the sample, and you will be more able to generalize the results from
your study. Probability sampling methods tend to be more time-consuming and expensive
than non-probability sampling. In non-probability (non-random) sampling, you do not start
with a complete sampling frame, so some individuals have no chance of being selected.
Consequently, you cannot estimate the effect of sampling error and there is a significant
risk of ending up with a non-representative sample which produces non-generalizable
results. However, non-probability sampling methods tend to be cheaper and more
convenient, and they are useful for exploratory research and hypothesis generation.
2. Systematic sampling
27
3. Stratified sampling
Stratified random sampling allows researchers to obtain a sample population that best
represents the entire population being studied. Stratified random sampling involves
dividing the entire population into homogeneous groups called strata.
4. Clustered sampling
The researcher divides the population into separate groups, called clusters. Then, a
simple random sample of clusters is selected from the population. The researcher
conducts his analysis on data from the sampled clusters.
28
NON-PROBABILITY SAMPLING METHODS
1. Convenience sampling
Convenience sampling (also known as availability sampling) is a specific type of non-
probability sampling method that relies on data collection from population members who
are conveniently available to participate in study. Facebook polls or questions can be
mentioned as a popular example for convenience sampling.
2. Quota sampling
It means to take a much tailored sample that’s in proportion to some characteristic or trait
of a population. For example, you could divide a population by the state they live in,
income or education level, or sex. The population is divided into groups (also called strata)
and samples are taken from each group to meet a quota.
29
3. Judgement (or Purposive) Sampling
4. Snowball sampling
Snowball sampling or chain-referral sampling is defined as a non-probability sampling
technique in which the samples have traits that are rare to find.
This is a sampling technique, in which existing subjects provide referrals to recruit
samples required for a research study.
30
BIAS IN SAMPLING
There are five important potential sources of bias that should be considered when
selecting a sample, irrespective of the method used. Sampling bias may be introduced
when:
1. Any pre-agreed sampling rules are deviated from
2. People in hard-to-reach groups are omitted
3. Selected individuals are replaced with others, for example if they are difficult to
contact
4. There are low response rates
5. An out-of-date list is used as the sample frame (for example, if it excludes people
who have recently moved to an area)
31
h. Re-train staff (if necessary)
I. SURVEY OBJECTIVES
An effective survey begins with specific objectives. Choose survey questions with the
objectives in mind. Every question should refer back to the objectives.
Example
1: To determine the prevalence of HIV infection by race, ethnicity, age, and geographic
location of women patients accessing our downtown clinic from July 1 to December 31.
2: To identify the most common misconceptions about HIV among high school seniors at
Washington High School from September 1 through the 31st.
A. QUESTIONNAIRES
Self-Administered
o Surveys that respondents complete by themselves. Questionnaires can be
printed and distributed in-person, made available on location, mailed or
avail-able on-line.
Group-Administered
o Surveys that are administered by a group facilitator (i.e., questions are read
aloud to respondents) and respondents complete the survey in the group
setting
Face-To-Face Interviews
o Surveys that are conducted between two people, one who asks the
questions and records the answers and the respondent who answers the
questions. These can be in-person or over the phone.
32
C. STRUCTURED OBSERVATIONS
Surveys where data are collected visually by observers in a systematic way to gather
information.
How can you be sure that you have an adequate amount of resources (time, experience
and money) to conduct your survey? Answer these seven questions:
33
GUIDELINES FOR ASSESSING REASONABLE RESOURCES
A survey’s resources are reasonable if they adequately cover the financial costs
and time needed for all activities of the survey in the time planned. You need to
consider the following:
QUANTITATIVE VARIABLES
When you collect quantitative data, the numbers you record represent real amounts that
can be added, subtracted, divided, etc. There are two types of quantitative variables:
discrete and continuous.
34
Discrete vs Continuous Variables
CATEGORICAL VARIABLES
Categorical variables represent groupings of some kind. They are sometimes recorded
as numbers, but the numbers represent categories rather than actual amounts of things.
There are three types of categorical variables: binary, nominal, and ordinal variables.
Statistical studies usually include one or more independent variables and one dependent
variable. The independent variable in an experimental study is the one that is being
manipulated by the researcher. The independent variable is also called the explanatory
variable. The resultant variable is called the dependent variable or the outcome variable.
You will probably also have variables that you hold constant (control variables) in order
to focus on your experimental treatment.
35
Independent vs Dependent vs Control Variables
Once you have defined your independent and dependent variables and determined
whether they are categorical or quantitative, you will be able to choose the correct
statistical test.
But there are many other ways of describing variables that help with interpreting your
results. Some useful types of variable are listed below.
36
measurements of plant
health in our salt-
addition experiment.
Composite variables A variable that is made by The three plant health
combining multiple variables variables could be
in an experiment. These combined into a single
variables are created when plant-health score to
you analyze data, not when make it easier to present
you measure it. your findings.
The most commonly used terms in the DOE methodology include: controllable and
uncontrollable input factors, responses, hypothesis testing, blocking, replication and
interaction.
Controllable input factors, or x factors, are those input parameters that can be modified
in an experiment or process. For example, in cooking rice, these factors include the
quantity and quality of the rice and the quantity of water used for boiling.
Uncontrollable input factors are those parameters that cannot be changed. In the rice-
cooking example, this may be the temperature in the kitchen. These factors need to be
recognized to understand how they may affect the response.
Responses, or output measures, are the elements of the process outcome that gage
the desired effect. In the cooking example, the taste and texture of the rice are the
responses.
37
The controllable input factors can be modified to optimize the output. The relationship
between the factors and responses is shown in the figure.
Hypothesis testing helps determine the significant factors using statistical methods.
There are two possibilities in a hypothesis statement: the null and the alternative. The null
hypothesis is valid if the status quo is true. The alternative hypothesis is true if the status
quo is not valid. Testing is done at a level of significance, which is based on a probability.
Blocking and replication: Blocking is an experimental technique to avoid any unwanted
variations in the input or experimental process. For example, an experiment may be
conducted with the same equipment to avoid any equipment variations. Practitioners also
replicate experiments, performing the same combination run more than once, in order to
get an estimate for the amount of random error that could be part of the process.
38
2. A "full factorial" design that studies the response of every combination of factors
and factor levels, and an attempt to zone in on a region of values where the
process is close to optimization.
3. A response surface designed to model the response.
Use DOE when more than one input factor is suspected of influencing an output. For
example, it may be desirable to understand the effect of temperature and pressure on
the strength of a glue bond.
DOE can also be used to confirm suspected input/output relationships and to develop a
predictive equation suitable for performing what-if analysis.
When you are involved in conducting a research project, you generally go through the
steps described below, either formally or informally. Some of these are directly involved
in designing the experiment to test the hypotheses required by the project. The following
steps are generally used in conducting a research project.
1. Review pertinent literature to learn what has been done in the field and to become
familiar enough with the field to allow you to discuss it with others. The best ideas
often cross disciplines and species, so a broad approach is important. For example,
recent research in controlling odors in swine waste has exciting implications for fly
and nematode control.
2. Define your objectives and the hypotheses that you are going to test. You can't be
vague. You must be specific. A good hypothesis is:
a. Clear enough to be tested
b. Adequate to explain the phenomenon
c. Good enough to permit further prediction
d. As simple as possible
39
experiments required to solve these problems vary greatly in scope and complexity
and also in resource requirements.
4. Evaluate the feasibility of testing the hypothesis. One should be relatively certain that
an experiment can be set up to adequately test the hypotheses with the available
resources. Therefore, a list should be made of the costs, materials, personnel,
equipment, etc., to be sure that adequate resources are available to carry out the
research. If not, modifications will have to be made to design the research to fit the
available resources.
a. Selection of treatment design is very crucial and can make the difference between
success or failure in achieving the objectives. Should seek help of a statistical
resource person (statistician) or of others more experienced in the field. Statistical
help should be sought when planning an experiment rather than afterward when a
statistician is expected to extract meaningful conclusions from a poorly designed
experiment. An example of a poor selection of treatments is the experiment which
demonstrated that each of three treatments, Scotch and water, in and water, and
Bourbon and water taken orally in sufficient quantities, produce some degree of
intoxication. Will this experiment provide information on which ingredient or mixture
causes intoxication? Why? How can this experiment be improved? An example
related to agriculture is an experiment with 2 treatments, Ammonium Sulfate and
Calcium Nitrate, selected to determine whether or not maize responds to N fertilizer
on a Typic Paleudult soil. Will this experiment provide the desired information? What
is lacking? What sources of confusion are included in the treatments?
40
different climates? With animal experiments, you can measure just the increase in
weight or also total food intake, components of blood, food digestibility etc.
d. Selection of the unit of observation, i.e., the individual plant, one row, or a whole plot,
etc? One animal or a group of animals?
f. Probable results: Make an outline of pertinent summary tables and probable results.
Using information gained in the literature review write out the results you expect.
Essentially perform the experiment in theory and predict the results expected.
g. Make an outline of statistical analyses to be performed. Before you plant the first pot
or plot or feed the first animal, you should have set up an outline of the statistical
analysis of your experiment to determine whether or not you are able to test the
factors you wish with the precision you desire. One of the best ways to do this is to
write out the analysis of variance table (source of variation and df) and determine the
appropriate error terms for testing the effects of interest. A cardinal rule is to be sure
you can analyze the experiment yourself and will not require a statistician to do it for
you--he might not be there when you need him. Another danger in this age of the
computer and statistical programs, is to believe that you can just run the data through
the statistical program and the data will be analyzed for you. While this is true to
ascertain extent, you must remember that the computer is a perfect idiot and does
only what you tell it to do. Therefore, if you do not know what to tell the computer to
do and/or of you don’t know what the computer is doing, you may end up with a lot of
useless output-garbage!! Also, there is the little matter of interpreting all the computer
output that you can get in a very short time. This is your responsibility and you had
better know what it is all about
41
is grading until after he has graded it. Have two people do the data collection, one
grade and the other record.
9. Make a complete analysis of the data: Be sure to have a plan of analysis, e.g., which
analysis and in what order will they be done? Interpret the results in the light of the
experimental conditions and hypotheses tested. Statistics do not prove anything and
there is always the possibility that your conclusions may be wrong. One must consider
the consequences of drawing an incorrect conclusion and modify the interpretation
accordingly. Do not jump to a conclusion just because an effect is significant. This is
especially so if the conclusion doesn't agree with previously established facts. The
experimental data should be checked very carefully if this occurs, as the results must
make sense!
42
10. Finally, prepare a complete, correct, and readable report of the experiment. This may
be a report to the farmer or researcher or an extension publication. There is no such
thing as a negative result. If the null hypothesis is not rejected, it is positive evidence
that there may be no real difference among the treatments tested.
1. Replicate: This provides a measure of variation (an error term) which is used in
evaluating the effects observed in the experiment. This is the only way that the validity
of your conclusions from the experiment can be measured.
3. Request Help: Ask for help when in doubt about how to design, execute or analyze
your experiment. Not everyone is a statistician, but should know the important
principles of scientific experimentation. Be on guard against common pitfalls and ask
for help when you need it. Do this when planning an experiment, not after it is
completed.
Activity:
Find a published study online on any engineering field that is of interest to you, and is
aligned with your undergraduate program. Use GOOGLE SCHOLAR as your search
engine, and limit your search results to studies published within the last 5 years.
Design your own study based on the research of your interest. You may replicate the
study while altering the control variables, such as changing the materials to be used,
adjusting physical dimensions of an experimental setup, varying mixture proportion,
different time and place of sampling, etc. Identify the rationale of your proposed study,
and list down at least two (2) objectives. In each objective, identify the independent and
dependent variables. An example is provided for your reference.
43
Example:
Proposed : Modal Analysis of the Bridge of Promise via Ambient Vibration Test
Research (AVT)*
Title
Reference : Feng, M., Fukuda, Y., Mizuta., & Ozer, E. (2015). Citizen sensors for
Literature SHM: Use of accelerometer data from smartphones. Sensors, 15:
2980-2998
Rationale of : To provide a quantitative assessment on the structural health of the
the Study Bridge using the smartphone accelerometers
44
Session 3
Organizing Data
Lecture:
DATA PRESENTATION
The main portion of statistics is the display of summarized data. Data is initially collected
from a given source, whether they are experiments, surveys, or observation. Presentation
of data also needs planning and presentation. If data are properly and interestingly
presented, the benefits will not only go to the readers or users out more so to the
statisticians who will make the analysis and interpretation of the data gathered. The
selection of the method of presentation depends on the type of data, method of analysis,
and type of information sought from the data. The data gathered are summarized and
presented in any of three methods: textual form, tabular form and graphical form.
TEXTUAL PRESENTATION
Textual presentation of data means presenting data in the form of words, sentences and
paragraphs. The opposite of textual presentation is graphical presentation of data. While
graphical presentation of data is the most popular and widely used in the research, textual
presentation allows the researcher to present qualitative data that cannot be presented
in graphical or tabular forms.
The textual presentation of data is very helpful in presenting contextual data. It helps the
researcher explain and analyze specific points in data. While presenting data in textual
form the researcher should consider the following factors.
45
1. The researcher should know the target audience who are going to read it. The
researcher should use a language in the presentation of data that is easy to
understand and highlights the main points of the data findings.
2. The author should use wordings that does not introduce bias in research. Avoid
the use of biased, slanted, or emotional language.
3. Accuracy should be maintained in presenting data and the numbers and
percentages presented in the textual data should be reviewed to avoid any
mistakes in the presentation.
4. To make it easier for the audience to comprehend the important points in data, the
researcher should avoid unnecessary details. Too much detail can make it difficult
for the audience to concentrate on the key points in data.
5. Do not repeat the same point again and again it will ruin the purpose of textual
presentation. Your data presentation will become monotonous if there is repetition
of data finding.
6. Try to shorten longer phrases wherever possible, mix two phrases when they can
be combined as one.
7. One mistake that often researchers make is to use general descriptive words like,
too much, little, exactly, all, always, never, must and many more. These words
should be avoided as they only add to the unnecessary details in your data
presentation. The numbers and percentages better describe and fulfill the aim of
data presentation.
8. Another point to consider is to avoid decorative language while make sure that you
use scholarly language in your data presentation.
1. Textual presentation of data allows the researcher to interpret the data more
elaborately during the presentation. According to In and Lee, 2017, text is the
principal method for explaining findings, outlining trends, and providing contextual
information
2. It allows the researcher to present qualitative data that cannot be presented in
graphical or tabular form.
3. Textual presentation can help in emphasizing some important points in data. It
allows the researcher to explain the data in contextual manner and the reader can
draw meaning out of it.
4. Small sets of data can be easily presented through textual presentation. For
example, simple data like, there are 30 students in the class 20 of whom are girls
while 10 are boys is easier to understand through textual presentation. No table or
graph is required to present this data as it can be comprehended through text.
46
DISADVANTAGES OF TEXTUAL PRESENTATION
TABULAR PRESENTATION
A table helps representation of even large amount of data in an engaging, easy to read
and coordinated manner. The data is arranged in rows and columns. This is one of the
most popularly used forms of presentation of data as data tables are simple to prepare
and read.
The most significant benefit of tabulation is that it coordinates data for additional statistical
treatment and decision-making. The analysis used in tabulation is of 4 types:
OBJECTIVES OF TABULATION:
47
Saving of Space
1. Lacks Description
a. The table represents only figures and not attributes.
b. It ignores the qualitative aspects of facts.
2. Incapable of Presenting Individual Items
a. It does not present individual items.
b. It presents aggregate data.
3. Needs Special Knowledge
a. The understanding of the table requires special knowledge.
b. It cannot be easily used by the layman.
Table Number Table No. is the very first item mentioned on the top of each
table for easy identification and further reference.
Title Title of the table is the second item which shown just above
the table.
It narrates about the contents of the table so, it has to be very
clear, brief and carefully worded.
Headnote It is the third item just above the Table & shown after the title.
It gives information about unit of data like, “Amount in Rupees
or $”, “Quantity in Tonnes” etc.
It is generally given in brackets.
48
Captions or At the top of each column in a table, a column
Column designation/head is given to explain figures of the column.
Headings This is column heading is called “Caption”.
Stubs or Row The title of the horizontal rows is called “Stubs”.
Headings
Body of the Table It contains the numeric information and reveals the whole story
of investigated facts. Columns are read vertically from top to
bottom and rows are read horizontally from left to right.
Source Note It is a brief statement or phrase indicating the source of data
presented in the table.
Footnote It explains the specific feature of the table which is not self-
explanatory and has not been explained earlier. For example,
Points of exception if any.
There are many ways for construction of a good table. However, some basic ideas are:
1. The title should be in accordance with the objective of study: The title of a table
should provide a quick insight into the table.
2. Comparison: If there might arise a need to compare any two rows or columns then
these might be kept close to each other.
3. Alternative location of stubs: If the rows in a data table are lengthy, then the stubs
can be placed on the right-hand side of the table.
4. Headings: Headings should be written in a singular form. For example, ‘good’ must
be used instead of ‘goods’.
5. Footnote: A footnote should be given only if needed.
6. Size of columns: Size of columns must be uniform and symmetrical.
7. Use of abbreviations: Headings and sub-headings should be free of abbreviations.
8. Units: There should be a clear specification of units above the columns.
GRAPHICAL PRESENTATION
49
Line Graphs – Linear graphs are used to display the continuous data and it is
useful for predicting the future events over time.
Bar Graphs – Bar Graph is used to display the category of data and it compares
the data using solid bars to represent the quantities.
Histograms – The graph that uses bars to represent the frequency of numerical
data that are organized into intervals. Since all the intervals are equal and
continuous, all the bars have the same width.
50
Line Plot – It shows the frequency of data on a given number line. ‘x ‘ is placed
above a number line each time when that data occurs again.
Frequency Table – The table shows the number of pieces of data that falls within
the given interval.
Circle Graph – Also known as pie chart that shows the relationships of the parts
of the whole. The circle is considered with 100% and the categories occupied is
represented with that specific percentage like 15%, 56%, etc.
Stem and Leaf Plot – In stem and leaf plot, the data are organized from least
value to the greatest value. The digits of the least place values from the leaves
and the next place value digit forms the stems.
Box and Whisker Plot – The plot diagram summarizes the data by dividing into
four parts. Box and whisker shows the range (spread) and the middle (median) of
the data.
There are certain rules to effectively present the data and information in the graphical
representation. They are:
Suitable Title: Make sure that the appropriate title is given to the graph which
indicates the subject of the presentation.
Measurement Unit: Mention the measurement unit in the graph
Proper Scale: To represent the data in an accurate manner, choose a proper scale.
Index: Index the appropriate colors, shades, lines, design in the graphs for better
understanding
Data Sources: Include the source of information wherever it is necessary at the
bottom of the graph.
Keep it Simple: Construct a graph in an easy way that everyone can understand.
Neat: Choose the correct size, lettering, colors etc. in such a way that the graph
should be a visual aid for the presentation of information.
Graphs and charts provide major benefits. First, they can quickly provide information
related to trends and comparisons by allowing for a global view of the data. It also allows
members of the audience who may be less versed in numerical analysis to follow the
information and understand the presentation more fully. Secondly, graphs and charts
provide a visual version of data, which can be helpful for visual learners.
51
misleading view of the data. Attempting to correct this can make charts overly complex,
which can make their value in aiding a presentation less useful. Finally, it is important to
use the correct chart and/or graph when presenting information, though this can be
difficult to identify for ambiguous data.
A stem and leaf display is an exploratory data analysis tool. It is a technique used to
classify either discrete or continuous variables. It is a graphical method of displaying data.
It is particularly useful when your data are not too numerous.
Stem and Leaf Diagrams allow you to display raw data visually. Each raw score is divided
into a stem and a leaf. The leaf is typically the last digit of the raw value. The stem is the
remaining digits of the raw value. To generate a stem and leaf diagram you must first
create a vertical column that contains all of the stems. Then list each leaf next to the
corresponding stem. In these diagrams, all of the scores are represented in the diagram
without the loss of any information. The graph below is an example of a Stem and Leaf
Diagram:
1 | 2244456888899
0 | 69
From this, we can now easily arrange our data from lowest to highest value.
52
HISTOGRAM
A frequency distribution shows how often each different value in a set of data occurs. A
histogram is the most commonly used graph to show frequency distributions. It looks very
much like a bar chart, but there are important differences between them; the bars always
touch and the width of the bar represents a quantitative value, such as age. This helpful
data collection and analysis tool is considered one of the seven basic quality tools.
A Histogram consists of a number of bars placed side by side. The y-axis uses Frequency
as its label so Histograms are generally referred to as Frequency Histograms. The width
of each bar on the x-axis indicates the interval size. The height of each bar indicates the
frequency of the interval. To create a Frequency Histogram you must first determine the
frequency of the intervals of the axes. Then draw the bars of the histogram. The bars
are drawn from the lower limits to the upper limits along the x-axis. Since the intervals
are connected the bars of the histogram should also be connected. The graph below is
an example of a Frequency Histogram:
53
7. You wish to communicate the distribution of data quickly and easily to others
FREQUENCY POLYGONS
Frequency polygons are a graphical device for understanding the shapes of distributions.
They serve the same purpose as histograms, but are especially helpful for comparing
sets of data. Frequency polygons are also a good choice for displaying cumulative
frequency distributions.
FREQUENCY POLYGON
Relative frequency is used to compare two distributions that have different numbers of
subjects. Relative frequency can be graphed as a Relative Frequency Polygon. Relative
frequency polygons are created in the same manner as the frequency polygon. The only
54
difference being that you use relative frequency instead of frequency values. The graph
below is an example of a Relative Frequency Polygon:
Cumulative polygons the data in a distribution that fall below a particular score.
Cumulative Polygons are created in the same manner as the polygon. The only difference
being that you use cumulative values and the upper real limits of the intervals are used
instead of the midpoints. The s-shaped curve of the graph below is called an “ogive”,
pronounced “oh-jive.”
55
HOW TO DEVELOP A FREQUENCY DISTRIBUTION, HISTOGRAM, FREQUENCY
POLYGON, RELATIVE FREQUENCY POLYGON AND OGIVES
• Arrange data in ascending order. We can use stem and leaf display.
• Determine the class width. If we will decide on the number of classes then it should
be between 5 to 15 classes only. Less than 5 classes, we will be losing too much
information and greater than 15 classes, our graph will be too crowded. We use
the formula:
If we have don’t have any desired number of class, then we can use the formula:
• Computed class width (CW) should be adjusted to the next whole number.
• Create the distinct classes (class limits). We use the convention that the lower
class limit of the first class is the smallest data value. Add the class width to this
number to get the lower class limit of the next class.
• Determine the class boundaries. Apply the continuity correction (Subtract 0.5 to
the lower limit and add 0.5 to the upper limit).
• Compute the midpoint (class mark) for each class. Add the upper limit and lower
limit then divide the sum by 2.
• Tally the data into classes. Each data value should fall into exactly one class. Total
the tallies to obtain each class frequency. This will be the basis for the histogram
and frequency polygon.
• For each class, compute the relative frequency f/n, where f is the class frequency
and n is the total sample size. This will be the basis for the relative frequency
polygon.
• For the cumulative frequency, we have the less than cumulative frequency and
greater than cumulative frequency. For less than cumulative frequency,
frequencies were added from top to bottom. For greater than cumulative
frequency, frequencies were added from bottom to top. This is the basis for the
cumulative frequency polygon or simply ogives.
56
Example 1: Given below are the one-way commuting distances in miles for 60 workers in
Downtown Dallas.
13 47 10 3 16 20 17 40 4 2
7 25 8 21 19 15 3 17 14 6
12 45 1 8 4 16 11 18 23 12
6 2 14 13 7 15 46 12 9 18
34 13 41 28 36 17 24 27 29 9
14 26 10 24 37 31 8 16 12 16
(a) Construct the stem and leaf plot for the data above.
(b) Determine the class width, and setup a frequency distribution.
(c) Construct the frequency histogram, relative frequency polygon and the cumulative
frequency ogives.
Stem Leaf
0 1 2 2 3 3 4 4 6 6 7 7 8 8 8 9 9
1 0 0 1 2 2 2 2 3 3 3 4 4 4 5 5
2 0 1 3 4 4 5 6 7 8 9
3 1 4 6 7
4 0 1 5 6 7
Arranged data:
1 7 12 15 19 29
2 8 12 16 20 31
2 8 12 16 21 34
3 8 13 16 23 36
3 9 13 16 24 37
4 9 13 17 24 40
4 10 14 17 25 41
6 10 14 17 26 45
6 11 14 18 27 46
7 12 15 18 28 47
57
Frequency Distribution
N = 60
58
Relative Frequency Histogram and Polygon
Example 2: Many airline passengers seem weighted down with their carry-on luggage.
Just how much weight are they carrying? The carry-on luggage weights for a random
sample of 40 passengers returning from a vacation to Hawaii were recorded in pounds.
Use 6 classes.
59
Weights of Carry-On Luggage in Pounds
30 27 12 42 35 47 38 36 27 35
22 17 29 3 21 10 38 32 41 33
26 45 18 43 18 32 31 32 19 21
33 31 28 29 51 12 32 18 21 26
Stem Leaf
0 3
1 7 2 8 8 0 2 8 9
2 2 6 7 9 8 9 1 7 1 1 6
3 0 3 1 5 2 8 8 1 2 6 2 2 5 3
4 5 2 3 7 1
5 1
Arranged data
3 17 19 22 27 30 32 33 38 43
10 18 21 26 28 31 32 35 38 45
12 18 21 26 29 31 32 35 41 47
12 18 21 27 29 32 33 36 42 51
51 − 3
𝐶𝑊 = =8≈9
6
60
Frequency Distribution
N = 40
61
Relative Frequency Histogram and Polygon
62
Activity:
1) The following scores represent the final examination grade for an elementary statistics
course:
23 60 79 32 57 74 52 70 82 36
80 77 81 95 41 65 92 85 55 76
52 10 64 75 78 25 80 98 81 67
41 71 83 54 54 72 88 62 74 43
60 78 89 76 76 48 84 90 15 79
34 67 17 82 82 74 63 80 85 61
a) Construct a stem and lead plot for the examination grades in which are the stems
are 1, 2, 3, …, 9.
b) Set up a relative frequency distribution.
c) Construct the frequency histogram, relative frequency polygon and the cumulative
frequency ogives.
2) The following data represent the length of life in years, measured to the nearest tenth,
of 30 similar fuel pumps:
2.0 3.0 0.3 3.3 1.3 0.4
0.2 6.0 5.5 6.5 0.2 2.3
1.5 4.0 5.9 1.8 4.7 0.7
4.5 0.3 1.5 0.5 2.5 5.0
1.0 6.0 5.6 6.0 1.2 0.2
a) Construct a stem and lead plot for the life in years of the fuel pump using the digit
to the left of the decimal point as the stem for each observation.
b) Set up a relative frequency distribution.
c) Construct the frequency histogram, relative frequency polygon and the cumulative
frequency ogives.
3) The following data represent the length of life in seconds of 50 fruit flies to a new spray
in a controlled laboratory experiment.
17 20 10 9 23 13 12 19 18 24
12 14 6 9 13 6 7 10 13 7
16 18 8 13 3 32 9 7 10 11
13 7 18 7 10 4 27 19 16 8
7 10 5 14 15 10 9 6 7 15
63
a) Construct a double-stem and lead plot for the life span of the fruit using the stems
64
Session 4
Measures of Central Tendency and Dispersion
1. Define, illustrate, and distinguish the different measures of central tendency and
dispersion for grouped and ungrouped data;
2. Interpret the computations for grouped and ungrouped data
Lecture:
INTRODUCTION
A measure of central tendency is a single value that attempts to describe a set of data by
identifying the central position within that set of data. As such, measures of central
tendency are sometimes called measures of central location. They are also classed as
summary statistics. The mean (often called the average) is most likely the measure of
central tendency that you are most familiar with, but there are others, such as the median
and the mode.
The mean, median and mode are all valid measures of central tendency, but under
different conditions, some measures of central tendency become more appropriate to use
than others. In the following sections, we will look at the mean, mode and median, and
learn how to calculate them and under what conditions they are most appropriate to be
used.
MEAN
The mean (or average) is the most popular and well known measure of central tendency.
It can be used with both discrete and continuous data, although its use is most often with
continuous data. The mean is equal to the sum of all the values in the data set divided by
the number of values in the data set. So, if we have n values in a data set and they have
values x1, x2, x3…xn, the sample mean, usually denoted by 𝑥̅ , (pronounced "x bar"), is:
𝑥1 + 𝑥2 + 𝑥3 +. . . +𝑥𝑛
𝑥̅ =
𝑛
This formula is usually written in a slightly different manner using the Greek capitol letter,∑
, pronounced "sigma", which means "sum of...":
65
∑𝑥
𝑥̅ =
𝑛
You may have noticed that the above formula refers to the sample mean. So, why have
we called it a sample mean? This is because, in statistics, samples and populations have
very different meanings and these differences are very important, even if, in the case of
the mean, they are calculated in the same way. To acknowledge that we are calculating
the population mean and not the sample mean, we use the Greek lower case letter "mu",
denoted as µ:
∑𝑥
µ=
𝑁
The mean is essentially a model of your data set. It is the value that is most common.
You will notice, however, that the mean is not often one of the actual values that you have
observed in your data set. However, one of its important properties is that it minimizes
error in the prediction of any one value in your data set. That is, it is the value that
produces the lowest amount of error from all other values in the data set.
An important property of the mean is that it includes every value in your data set as part
of the calculation. In addition, the mean is the only measure of central tendency where
the sum of the deviations of each value from the mean is always zero.
The mean has one main disadvantage: it is particularly susceptible to the influence of
outliers. These are values that are unusual compared to the rest of the data set by being
especially small or large in numerical value. For example, consider the wages of staff at
a factory below:
Staff 1 2 3 4 5 6 7 8 9 10
Salary 15k 18k 16k 14k 15k 15k 12k 17k 90k 95k
The mean salary for these ten staff is $30.7k. However, inspecting the raw data suggests
that this mean value might not be the best way to accurately reflect the typical salary of a
worker, as most workers have salaries in the $12k to 18k range. The mean is being
skewed by the two large salaries. Therefore, in this situation, we would like to have a
better measure of central tendency. We can apply a resistant measure called trimmed
mean.
Example:
Many airline passengers seem weighted down with their carry-on luggage. Just how much
weight are they carrying? The carry-on luggage weights for a random sample of 40
passengers returning from a vacation to Hawaii were recorded in pounds.
∑𝑥
𝑥̅ =
𝑛
1141
𝑥̅ =
40
𝑥̅ = 28.53
3 17 19 22 27 30 32 33 38 43
10 18 21 26 28 31 32 35 38 45
12 18 21 26 29 31 32 35 41 47
12 18 21 27 29 32 33 36 42 51
∑𝑥
𝑥̅ =
𝑛
1030
𝑥̅ = = 28.61
36
67
MEDIAN
The median is the middle score for a set of data that has been arranged in order of
magnitude. The median is less affected by outliers and skewed data. In order to calculate
the median, suppose we have the data below:
65 55 89 56 35 14 56 55 87 45 92
We first need to rearrange that data into order of magnitude (smallest first):
14 35 45 55 55 56 56 65 87 89 92
Our median mark is the middle mark - in this case, 56 (highlighted in bold). It is the middle
mark because there are 5 scores before it and 5 scores after it. This works fine when you
have an odd number of scores, but what happens when you have an even number of
scores? What if you had only 10 scores? Well, you simply have to take the middle two
scores and average the result. So, if we look at the example below:
65 55 89 56 35 14 56 55 87 45
14 35 45 55 55 56 56 65 87 89
Only now we have to take the 5th and 6th score in our data set and average them to get
a median of 55.5.
MODE
The mode is the most frequent score in our data set. On a histogram it represents the
highest bar in a bar chart or histogram. You can, therefore, sometimes consider the mode
as being the most popular option. An example of a mode is presented below:
68
EXAMPLES OF THE MODE
For example, in the following list of numbers, 16 is the mode since it appears more times
in the set than any other number:
A set of numbers can have more than one mode (this is known as bimodal if there are
two modes) if there are multiple numbers that occur with equal frequency, and more times
than the others in the set.
In the above example, both the number 3 and the number 16 are modes as they each
occur three times and no other number occurs more often.
If no number in a set of numbers occurs more than once, that set has no mode:
A set of numbers with two modes is bimodal, a set of numbers with three modes is
trimodal, and a set of numbers with four or more nodes is multimodal.
A mean can be determined for grouped data, or data that is placed in intervals. Unlike
listed data, the individual values for grouped data are not available, and you are not able
to calculate their sum. To calculate the mean of grouped data, the first step is to determine
the midpoint (also called a class mark) of each interval, or class. These midpoints must
then be multiplied by the frequencies of the corresponding classes. The sum of the
products divided by the total number of values will be the value of the mean.
In other words, the mean for a population can be found by dividing ∑xf by N, where x is
∑ 𝑥𝑓
the midpoint of the class and f is the frequency. As a result, the formula 𝑥̅ = can be
𝑁
written to summarize the steps used to determine the value of the mean for a set of
grouped data. If the set of data represented a sample instead of a population, the process
∑ 𝑥𝑓
would remain the same, and the formula would be written as 𝑥̅ = .
𝑛
69
Example:
Class
Midpoint freq
Class Limits Boundaries f< f> RF
(x) (f)
Lower Upper Lower Upper
1 7 0.5 7.5 4 11 11 60 0.18
8 14 7.5 14.5 11 18 29 49 0.30
15 21 14.5 21.5 18 14 43 31 0.23
22 28 21.5 28.5 25 7 50 17 0.12
29 35 28.5 35.5 32 3 53 10 0.05
36 42 35.5 42.5 39 4 57 7 0.07
43 49 42.5 49.5 46 3 60 3 0.05
n = 60
∑ 𝑥𝑓
𝑥̅ =
𝑛
1059
𝑥̅ = = 17.65
60
𝑛
(2 − 𝐶𝐹< )
𝑥̃ = 𝐿𝑜 + 𝑥 𝐶𝑊
𝑓
This formula is used to find the median in a group data with class interval. The median
is the value of the data in the middle position of the set when the data is arranged in
numerical order. The class where the middle position is located is called the median class
and this is also the class where the median is located. This formula is used to find the
median in a group data which is located in the median class.
where:
Lo is the lower class boundary of the group containing the median
n is the sum of frequencies
CF< is the cumulative frequency before the median class.
f is the frequency of the median group
CW is the class width
70
Example:
Class
Midpoint freq
Class Limits Boundaries f< f> RF
(x) (f)
Lower Upper Lower Upper
1 7 0.5 7.5 4 11 11 60 0.18
8 14 7.5 14.5 11 18 29 49 0.30
𝑛
(2 − 𝐶𝐹< )
𝑥̃ = 𝐿𝑜 + 𝑥 𝐶𝑊
𝑓
60
( 2 − 29)
𝑥̃ = 14.5 + 𝑥 7 = 15
14
Class
Midpoint freq
Class Limits Boundaries f< f> RF
(x) (f)
Lower Upper Lower Upper
1 7 0.5 7.5 4 11 11 60 0.18
8 14 7.5 14.5 11 18 29 49 0.30
15 21 14.5 21.5 18 14 43 31 0.23
22 28 21.5 28.5 25 7 50 17 0.12
29 35 28.5 35.5 32 3 53 10 0.05
36 42 35.5 42.5 39 4 57 7 0.07
43 49 42.5 49.5 46 3 60 3 0.05
n = 60
71
We can estimate the Mode using the following formula:
𝑓𝑚𝑜 − 𝑓1
𝑥̂ = 𝐿𝑚𝑜 + ( ) 𝑥 𝐶𝑊
2𝑓𝑚𝑜 − 𝑓1 − 𝑓2
where:
18 − 11
𝑥̂ = 7.5 + ( ) 𝑥 7 = 11.95
2(18) − 11 − 14
Dispersion is the state of getting dispersed or spread. Statistical dispersion means the
extent to which a numerical data is likely to vary about an average value. In other words,
dispersion helps to understand the distribution of the data.
MEASURES OF DISPERSION
In statistics, the measures of dispersion help to interpret the variability of data i.e. to know
how much homogeneous or heterogeneous the data is. In simple terms, it shows how
squeezed or scattered the variable is.
72
TYPES OF MEASURES OF DISPERSION
There are two main types of dispersion methods in statistics which are:
An absolute measure of dispersion contains the same unit as the original data set.
Absolute dispersion method expresses the variations in terms of the average of deviations
of observations like standard or means deviations. It includes range, standard deviation,
quartile deviation, etc.
1. Range: It is simply the difference between the maximum value and the minimum
value given in a data set.
2. Variance: Deduct the mean from each data in the set then squaring each of them
and adding each square and finally dividing them by the total no of values in the
data set is the variance.
∑(𝑥 − 𝜇)2 𝑓
𝜎2 =
𝑁
73
Sample Variance:
2
∑(𝑥 − 𝑥̅ )2 𝑓
𝑠 =
𝑛−1
3. Standard Deviation: The square root of the variance is known as the standard
deviation.
∑(𝑥 − 𝜇)2 𝑓
𝜎=√
𝑁
Sample Standard Deviation:
∑(𝑥 − 𝑥̅ )2 𝑓
𝑠=√
𝑛−1
4. Quartiles and Quartile Deviation: The quartiles are values that divide a list of
numbers into quarters. The quartile deviation is half of the distance between the
third and the first quartile.
For Ungrouped Data:
First Quartile (Q1)
𝑛
𝑄1 =
4
74
Third Quartile (Q3)
3𝑛
𝑄3 =
4
Quartile Deviation (QD):
𝑄3 − 𝑄1
𝑄𝐷 =
2
For Grouped Data:
First Quartile (Q1)
𝑛
(4 − 𝐶𝐹< )
𝑥̃ = 𝐿𝑜 + 𝑥 𝐶𝑊
𝑓
Third Quartile (Q3)
3𝑛
( 4 − 𝐶𝐹< )
𝑥̃ = 𝐿𝑜 + 𝑥 𝐶𝑊
𝑓
5. Mean Deviation: The mean deviation (also called the mean absolute deviation) is
the mean of the absolute deviations of a set of data about the data's mean. For a
sample size N, the mean deviation is defined by
∑|𝑥 − 𝑥̅ |
𝑀𝐷 =
𝑛
∑|𝑥 − 𝑥̅ |𝑓
𝑀𝐷 =
𝑛
The relative measures of depression are used to compare the distribution of two or more
data sets. This measure compares values without units. Common relative dispersion
methods include:
75
1. Coefficient of Range
2. Coefficient of Variation
3. Coefficient of Standard Deviation
4. Coefficient of Quartile Deviation
5. Coefficient of Mean Deviation
COEFFICIENT OF DISPERSION
The coefficients of dispersion are calculated along with the measure of dispersion when
two series are compared which differ widely in their averages. The dispersion coefficient
is also used when two series with different measurement unit are compared. It is denoted
as C.D.
EXAMPLE:
3 17
10 18
12 18
12 18
Sample Variance:
x 𝑥̅ (𝑥 − 𝑥̅ )2
3 13.5 110.25
10 13.5 12.25
12 13.5 2.25
12 13.5 2.25
17 13.5 12.25
18 13.5 20.25
76
18 13.5 20.25
18 13.5 20.25
∑(𝑥 − 𝑥̅ )2 200
∑(𝑥 − 𝑥̅ )2 200
𝑠2 = = = 28.57
𝑛−1 7
𝑠 = √28.57 = 5.345
Quartile Deviation:
18 − 2
𝑄𝐷 = =8
2
Mean Deviation:
x 𝑥̅ |𝑥 − 𝑥̅ |
3 13.5 10.5
10 13.5 3.5
12 13.5 1.5
12 13.5 1.5
17 13.5 3.5
18 13.5 4.5
18 13.5 4.5
18 13.5 4.5
∑|𝑥 − 𝑥̅ | 34
∑|𝑥 − 𝑥̅ |
𝑀𝐷 =
𝑛
34
𝑀𝐷 = = 4.25
8
COEFFICIENT OF DISPERSION:
Coefficient of Range =
77
𝑥𝑚𝑎𝑥 − 𝑥𝑚𝑖𝑛 18 − 3
= = 0.714
𝑥𝑚𝑎𝑥 + 𝑥𝑚𝑖𝑛 18 + 3
Class
Midpoint freq
Class Limits Boundaries f< f> RF
(x) (f)
Lower Upper Lower Upper
1 7 0.5 7.5 4 11 11 60 0.18
8 14 7.5 14.5 11 18 29 49 0.30
15 21 14.5 21.5 18 14 43 31 0.23
22 28 21.5 28.5 25 7 50 17 0.12
29 35 28.5 35.5 32 3 53 10 0.05
36 42 35.5 42.5 39 4 57 7 0.07
43 49 42.5 49.5 46 3 60 3 0.05
n = 60
𝑥̅ = 17.65
Range = Highest boundary – Lowest boundary = 49.5 - 0.5 = 49
Sample Variance:
∑(𝑥 − 𝑥̅ )2 𝑓
𝑠2 =
𝑛−1
8077.65
𝑠2 = = 136.9093
59
Standard Deviation:
∑(𝑥 − 𝑥̅ )2 𝑓
𝑠=√
𝑛−1
78
8077.65
𝑠=√ = 11.7
59
60
( 4 − 11)
𝑥̃ = 7.5 + 𝑥 7 = 9.06
18
3𝑛
( 4 − 𝐶𝐹< )
𝑥̃ = 𝐿𝑜 + 𝑥 𝐶𝑊
𝑓
3(60)
( 4 − 43)
𝑥̃ = 21.5 + 𝑥 7 = 23.5
7
79
∑|𝑥 − 𝑥̅ |𝑓
𝑀𝐷 =
𝑛
539.7
𝑀𝐷 = = 8.995
60
COEFFICIENT OF DISPERSION:
Coefficient of Range:
𝑥𝑚𝑎𝑥 − 𝑥𝑚𝑖𝑛 49.5 − 0.5
= = 0.98
𝑥𝑚𝑎𝑥 + 𝑥𝑚𝑖𝑛 49.5 + 0.5
QUANTILES
Sorting the values we have is a common first step in data analysis. When we sort values,
normally in increasing order, we obtain an order statistics. Having an order statistics helps
us locate data in the given data set, compare it with others in terms of ranking or obtain
an idea about the shape of the data distribution since we can approximate where data
are concentrated and where they are sparse. A particular k th entry in an increasing-order-
𝑘
list is the kth order statistic, and approximately (𝑛) 100% of the observations in the list fall
at or below the kth entry.
Quantiles refer to the set of n-1 values of a variable that partition it into a number of n
equal proportions. If a set of data is divided into 2 equal proportions, then we have 2-1 =
1 value that partition the data set and this is the most common quantile, which is the
1
median. The median or the middle data has ( ) 100% = 50% of the observations falling
2
80
at or below it. There may be different ways on how we can partition our data set but the
most common are quartiles, deciles and percentiles.
The quartiles of a data set divide it to quarters or 4 equal parts which means every data
set has three quartiles which we name as Q 1, Q2, and Q3. Basically, the first quartile or
Q1 is the data that has 25% of the data set at or below it and 75% of the data set at or
above it. The second quartile, Q2, is the data that has 50% of the data set at or below and
above it which shows Q2 is equal to the median. And, Q3, the third quartile, is the data
that separates the 75% of the data set at or below it from the 25% of the data set at or
above it.
On the other hand, deciles, divide the data set into 10 equal parts using 9 partition data
or 9 deciles which are named as D1, D2, D3, ... , and D9. The first decile, D1, is the data
that has 10% of the data set at or below it and 90% of the data set at or above it while,
the second decile, D2, is the data that has 20% of the data set at or below it and the
remaining 80% of the data set at or above it. Note that D5 is the same as Q2 and the
median.
Lastly, percentiles, divide the data set into 100 equal parts using 99 partition data or 99
percentiles which are named as P1, P2, P3, ... , and P99. The first percentile, P1, is the data
that has 1% of the data set at or below it and 99% of the data set at or above it while, the
second percentile, P2, is the data that has 2% of the data set at or below it and 98% of
the data at or above it. It is worth noting that P 50 = D5 = Q2 = 𝑥̃ and that other quantiles
may also be equal to the other like P10 = D1 and etc.
In pratical cases, percentiles are also referred to as "quantiles" such that P 1 is referred to
as the 0.01 quantile, P25 as the 0.25 quantile, and etc.
To find a particular quantile in a set of ungrouped data the following must be followed:
81
𝑛𝑁
𝐷𝑛 = ( 10 ) th observation
For an nth Percentile:
𝑛𝑁
𝑃𝑛 = (100) th observation
Note : N represents the total number of observations.
3. If the formula gives an integer as an answer, simply locate the data in the order
statistics. But if the formula gives a non-integer answer, round up to the next whole
number and then locate the data in the order statistics.
Example:
The following observations represent the oxidation-induction time (min) for various
commercial oils:
First, rearrange the data set based on order of magnitude (smallest first):
Where N = 19
𝑛𝑁 19
𝑄1 = ( ) = ( 4 )th observation = 4.75th round up to 5th observation which is 103
4
𝑛𝑁 2∗19 th
𝑄2 = ( )= ( ) observation = 9.5th round up to 10th observation which is 132
4 4
which also represents the median
𝑛𝑁 3∗19 th
𝑄3 = ( )= ( ) observation = 14.25th round up to 15th observation which is 153
4 4
𝑛𝑁 2∗19 th
𝐷2 = ( )= ( ) observation = 3.8th round up to 4th observation which is 99
10 10
82
To solve for D5:
𝑛𝑁 5∗19 th
𝐷5 = ( 10 ) = ( ) observation = 9.5th round up to 10th observation which is 132 and
10
is equal to Q2 and the median
𝑛𝑁 20∗19 th
𝑃20 = (100) = ( ) observation = 3.8th round up to 4th observation which is 99 and
100
is equal to D2
𝑛𝑁 50∗19 th
𝑃50 = (100) = ( ) observation = 9.5th round up to 10th observation which is 132
100
and is equal to Q2, D5, and the median
𝑛𝑁 75∗19 th
𝑃75 = (100) = ( ) observation = 14.25th round up to 15th observation which is 153
100
and is equal to Q3.
Since the actual values of observations may not be obtained if a grouped data is given,
the quartiles, deciles and/or percentiles may be approximated using the following
formulas:
where:
𝐿𝑜 is the lower boundary of the class containing the nth quartile
𝑁 is the sum of frequencies or total observations
𝐶𝐹< is the cumulative frequencies before the class containing the nth
quartile
𝑓 is the frequency of the class containing the nth quartile
𝐶𝑊 is the class width
83
For an nth Decile:
𝑛𝑁
− 𝐶𝐹<
𝐷𝑛 = 𝐿𝑜 + [ 10 ] × 𝐶𝑊
𝑓
where:
𝐿𝑜 is the lower boundary of the class containing the nth decile
𝑁 is the sum of frequencies or total observations
𝐶𝐹< is the cumulative frequencies before the class containing the nth decile
𝑓 is the frequency of the class containing the nth decile
𝐶𝑊 is the class width
where:
𝐿𝑜 is the lower boundary of the class containing the nth percentile
𝑁 is the sum of frequencies or total observations
𝐶𝐹< is the cumulative frequencies before the class containing the nth
percentile
𝑓 is the frequency of the class containing the nth percentile
𝐶𝑊 is the class width
Example:
Class
Class Limits Midpoint freq
Boundaries f< f> RF
(x) (f)
Lower Upper Lower Upper
1 7 0.5 7.5 4 11 11 60 0.18
8 14 7.5 14.5 11 18 29 49 0.30
15 21 14.5 21.5 18 14 43 31 0.23
22 28 21.5 28.5 25 7 50 17 0.12
29 35 28.5 35.5 32 3 53 10 0.05
36 42 35.5 42.5 39 4 57 7 0.07
43 49 42.5 49.5 46 3 60 3 0.05
n = 60
Using the same grouped data, solve for the Q1, D7 and P85.
84
To solve for Q1:
𝒏𝑵
− 𝐶𝐹<
𝑄1 = 𝐿𝑜 + [ 𝟒 ] × 𝐶𝑊
𝑓
𝑛𝑁
Take the value of and locate which class the observation falls. An quantile is
4
𝑛𝑁 𝑛𝑁
considered within a class whose CF < is greater than and is closest to .
4 4
𝑛𝑁 60
For Q1, = = 15 and the closest greater CF< than 15 is 29 which is in the second
4 4
class with limits 8 - 14 and will be considered the class containing the first quartile.
Class
Class Limits Midpoint freq
Boundaries f< f> RF
(x) (f)
Lower Upper Lower Upper
1 7 0.5 7.5 4 11 11 60 0.18
8 14 7.5 14.5 11 18 29 49 0.30
15 21 14.5 21.5 18 14 43 31 0.23
22 28 21.5 28.5 25 7 50 17 0.12
29 35 28.5 35.5 32 3 53 10 0.05
36 42 35.5 42.5 39 4 57 7 0.07
43 49 42.5 49.5 46 3 60 3 0.05
n = 60
Therefore:
𝑛𝑁
− 𝐶𝐹<
𝑄1 = 𝐿𝑜 + [ 4 ] × 𝐶𝑊
𝑓
15 − 11
𝑄1 = 7.5 + [ ] × 7 = 9.06
18
𝒏𝑵
− 𝐶𝐹<
𝐷4 = 𝐿𝑜 + [ 𝟏𝟎 ] × 𝐶𝑊
𝑓
𝑛𝑁
Take the value of and locate which class the observation falls.
10
85
𝑛𝑁 (7)60
For D7, = = 42 and the closest greater CF< than 42 is 43 which is in the third class
10 10
with limits 15 - 21 and will be considered the class containing the seventh decile.
Class
Class Limits Midpoint freq
Boundaries f< f> RF
(x) (f)
Lower Upper Lower Upper
1 7 0.5 7.5 4 11 11 60 0.18
8 14 7.5 14.5 11 18 29 49 0.30
15 21 14.5 21.5 18 14 43 31 0.23
22 28 21.5 28.5 25 7 50 17 0.12
29 35 28.5 35.5 32 3 53 10 0.05
36 42 35.5 42.5 39 4 57 7 0.07
43 49 42.5 49.5 46 3 60 3 0.05
n = 60
Therefore:
𝑛𝑁
− 𝐶𝐹<
𝐷7 = 𝐿𝑜 + [ 10 ] × 𝐶𝑊
𝑓
42 − 29
𝐷7 = 14.5 + [ ] × 7 = 21
14
𝒏𝑵
− 𝐶𝐹<
𝑃85 = 𝐿𝑜 + [𝟏𝟎𝟎 ] × 𝐶𝑊
𝑓
𝑛𝑁
Take the value of 100 and locate which class the observation falls.
𝑛𝑁 (85)(60)
For P85, = = 51 and the closest greater CF< than 51 is 53 which is in the fifth
100 100
class with limits 29 - 35 and will be considered the class containing the 85 th percentile.
86
Class
Class Limits Midpoint freq
Boundaries f< f> RF
(x) (f)
Lower Upper Lower Upper
1 7 0.5 7.5 4 11 11 60 0.18
8 14 7.5 14.5 11 18 29 49 0.30
15 21 14.5 21.5 18 14 43 31 0.23
22 28 21.5 28.5 25 7 50 17 0.12
29 35 28.5 35.5 32 3 53 10 0.05
36 42 35.5 42.5 39 4 57 7 0.07
43 49 42.5 49.5 46 3 60 3 0.05
n = 60
Therefore:
𝑛𝑁
− 𝐶𝐹<
𝑃85 = 𝐿𝑜 + [100 ] × 𝐶𝑊
𝑓
51 − 50
𝑃85 = 28.5 + [ ] × 7 = 30.83
3
BOX-AND-WHISKER PLOTS
The box plot is a graphical display that has been used successfully to describe several of
a data set's most prominent features. These features include (1) center, (2) spread, (3)
the extent and nature of any departure from symmetry, and (4) identification of “outliers,”
observations that lie unusually far from the main body of the data. Because even a single
outlier can drastically affect the values of the mean and standard deviation, a box plot is
based on measures that are “resistant” to the presence of a few outliers—the median and
a measure of variability called the interquartile range.
Lowest value
Lower quartile (Q1)
Median
Upper quartile (Q3)
Highest Value
87
To make a box-and-whiskers plot:
Draw a vertical scale to include the highest and lowest values in the data set
Draw a box from Q1 to Q3. The horizontal width of the box is irrelevant while the
height of the box is exactly the IQR.
Up from the Q3, measure out a distance of 1.5 times the IQR and draw a so-called
whisker up to the highest value that lies within this distance, and put a horizontal
line.
Down from the Q1, measure out a distance of 1.5 times the IQR and draw a so-
called whisker to the lowest value that lies within this distance, and put a horizontal
line.
All observations beyond the whiskers may be marked with " ⃘" or an asterisk (*).
A point less than 3 IQR from the box edge is considered an outlier while points
more than 3IQR from the box edge are considered extreme outliers.
Skewness and kurtosis are two different measures that characterizes the shape or
distribution of a data set. Skewness refers to the degree of symmetry or departures from
symmetry of a set of data while kurtosis is a measure of the degree to which a unimodal
distribution is peaked.
88
SKEWNESS
A distribution is said to be right-skewed (or positively skewed) if the right tail seems to be
stretched from the center. A left-skewed (or negatively skewed) distribution is stretched
to the left side. A symmetric distribution has a graph that is balanced about its center, in
the sense that half of the graph may be reflected about a central line of symmetry to match
the other half.
The degree of skewness is measured by the coefficient of skewness, sk, where
3 ( 𝑀𝑒𝑎𝑛 − 𝑀𝑒𝑑𝑖𝑎𝑛)
𝑠𝑘 =
𝑆𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝑑𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛
A distribution where the 𝑥̅ < 𝑥̃ is negatively skewed or skewed to the left. Negatively
skewed distributions have negative sk down to -3 and contains extremely low scores with
low frequency, but does not contain extreme high scores with corresponding low
frequency.
A distribution where the 𝑥̃ < 𝑥̅ is positively skewed or skewed to the right. Positively
skewed distributions have positive sk up to 3 and contains extremely high scores that
have low frequency, but does not contain extreme low scores with corresponding low
frequency.
89
Longer right tail,
extreme high scores but
with low frequency
low median
high
KURTOSIS
The term kurtosis was introduced by Karl Pearson. This word literally means "the amount
of hump", and is used to represent the degree of peakedness or flatness of a unimodal
frequency curve. One measure of kurtosis is the percentile coefficient of kurtosis (ku)
which is defined by the formula
Q 3 − Q1
ku = 2
P90 − P10
and whose value ranges from 0 to 0.5.
There are three categories of kurtosis that can be displayed by a set of data. All
measures of kurtosis are compared against a standard normal distribution, or bell curve.
The first category of kurtosis is a mesokurtic distribution. This type of kurtosis is the most
similar to a standard normal distribution in that it also resembles a bell curve. However, a
graph that is mesokurtic has fatter tails than a standard normal distribution and has a
slightly lower peak. This type of kurtosis is considered normally distributed but is not a
standard normal distribution and for normality of the data must have ku = 0.263
The final type of distribution is a platykurtic distribution. These type of distributions have
slender tails and a peak that’s smaller than a mesokurtic distribution. The prefix of “platy-
” means “broad,” and it is meant to describe a short and broad-looking peak. With
platykurtic distributions, the ku values come out to be greater than 0.263. Uniform
distributions are platykurtic.
90
Mesokurtic
Leptokurtic
Platykurtic
CHEBYSHEV’S THEOREM
Chebyshev's theorem states for any k > 1, at least 1-1/k2 of the data lies within k standard
deviations of the mean.
Using this formula and plugging in the value 2, we get a resultant value of 1-1/22, which
is equal to 75%. This means that at least 75% of the data for a set of numbers lies within
2 standard deviations of the mean. The number could be greater. It could be all, 100%,
but it's guaranteed to be at least 75%. And this is what Chebyshev's theorem computes.
If we plug in 3 for k, then the resultant value is 88.89%. This means that at least 88.89%
of a data set lies within 3 standard deviations of the mean.
If we plug in 4 for k, then the resultant value is 93.75%. This means that at least 93.75%
of a data set lies within 3 standard deviations of the mean.
Chebyshev's theorem is a great tool to find out how approximately how much percentage
of a population lies within a certain amount of standard deviations above or below a mean.
It tells us at least how much percentage of the data set must fall within that number of
standard deviations.
To use this calculator, a user simply enters in a k value. This k value represents the
number of standard deviations from the mean.
91
The resultant value calculated will represent the minimum percentage of the data set that
falls within k standard deviations of the mean.
at least ¾ of the data lie within two standard deviations of the mean, that is, in the
interval with endpoints 𝑥̅ ± 2𝑠 for samples and with endpoints μ±2σ for populations;
at least 8/9 of the data lie within three standard deviations of the mean, that is, in
the interval with endpoints 𝑥̅ ± 3𝑠 for samples and with endpoints μ±3σ for
populations;
at least 1−1/k2 of the data lie within k standard deviations of the mean, that is, in
the interval with endpoints 𝑥̅ ± 𝑘𝑠 for samples and with endpoints μ±kσ for
populations, where k is any positive whole number that is greater than 1.
It is important to pay careful attention to the words “at least” at the beginning of each of
the three parts of Chebyshev’s Theorem. The theorem gives the minimum proportion of
the data which must lie within a given number of standard deviations of the mean; the
true proportions found within the indicated regions could be greater than what the theorem
guarantees.
92
EXAMPLE
The East Coast Independent News periodically runs ads in its own classified section
offering a month’s free subscription to those who respond. In this way, management can
get sense about the number of subscribers who read the classified section each day.
Over a period of 2 years, careful records have been kept. The mean number of responses
per ad is 525 with standard deviation of 30.
Determine the interval about the mean in which at least 88.9% of the data fall.
µ - 3σ to µ + 3σ
435 to 615
Activity:
1) The following scores represent the final examination grade for an elementary statistics
course:
23 60 79 32 57 74 52 70 82 36
80 77 81 95 41 65 92 85 55 76
52 10 64 75 78 25 80 98 81 67
41 71 83 54 54 72 88 62 74 43
60 78 89 76 76 48 84 90 15 79
34 67 17 82 82 74 63 80 85 61
For the ungrouped data, find the sample mean and median. Determine the range,
variation, standard deviation, quartile deviation, mean deviation and the dispersion
coefficients.
Set up a relative frequency distribution. For the ungrouped data, find the sample
mean and median. Determine the range, variation, standard deviation, quartile
deviation and mean deviation and the dispersion coefficients. for the grouped data.
2) The following data represent the length of life in years, measured to the nearest tenth,
of 30 similar fuel pumps:
93
2.0 3.0 0.3 3.3 1.3 0.4
0.2 6.0 5.5 6.5 0.2 2.3
1.5 4.0 5.9 1.8 4.7 0.7
4.5 0.3 1.5 0.5 2.5 5.0
1.0 6.0 5.6 6.0 1.2 0.2
For the ungrouped data, find the sample mean and median. Determine the range,
variation, standard deviation, quartile deviation, mean deviation and the dispersion
coefficients.
Set up a relative frequency distribution. For the ungrouped data, find the sample
mean and median. Determine the range, variation, standard deviation, quartile
deviation and mean deviation and the dispersion coefficients. for the grouped data.
3) The following data represent the length of life in seconds of 50 fruit flies to a new spray
in a controlled laboratory experiment.
17 20 10 9 23 13 12 19 18 24
12 14 6 9 13 6 7 10 13 7
16 18 8 13 3 32 9 7 10 11
13 7 18 7 10 4 27 19 16 8
7 10 5 14 15 10 9 6 7 15
For the ungrouped data, find the sample mean and median. Determine the range,
variation, standard deviation, quartile deviation, mean deviation and the dispersion
coefficients.
Set up a relative frequency distribution. For the ungrouped data, find the sample
mean and median. Determine the range, variation, standard deviation, quartile
deviation and mean deviation and the dispersion coefficients. For the grouped
data.
94
Probability
________________________________________________
MODULE 2
95
Session 1
Elementary Probability Theory
Lecture:
The mathematical theory of probability gives us the basic tools for constructing and
analyzing mathematical models for random phenomena. In studying a random
phenomena, we are dealing with an experiment of which the outcome is not predictable
in advance.
In science and engineering, random phenomena describe a wide variety of situations that
can be grouped into two broad classes: physical or natural phenomena involving
uncertainties and the other exhibiting variability. Since uncertainty and variability are
present in our modeling of all real phenomena, it is only natural to see that probabilistic
modeling and analysis occupy a central place in the study of wide variety of topics in
science and engineering.
PROBABILITY
96
Example 1:
What is the probability of drawing one card from a standard deck of 52 cards and having
it be a king?
When you select a card from a deck, there are 52 possible outcomes, 4 of which are
4 1
favorable as king. Thus, the probability of drawing a king is 52 = 13.
Example 2:
Human eye color is controlled by a single pair of genes, one of which comes from the
mother and one of which comes from the father, called a genotype. Brown eye color, B,
is dominant over blue eye color, l. Therefore, in the genotype Bl, which consists of one
brown gene B and one blue gene l, the brown gene dominates. A person with a genotype
Bl will have brown eyes.
If both parents have genotype Bl, what is the probability that their child will have blue
eyes?
Considering every possible eye-color genotype for the child, the table shows the possible
outcomes:
F Since all four possible outcomes are equally likely to occur and that
B l
M blue eyes can only occur on the l l genotype, then there is only one
B BB Bl favourable outcome to blue eyes. Thus, the probability that the
1
child has blue eyes is 4.
l lB ll
Example 3:
Find the probability of getting a non-defective bulb from 200 bulbs already examined,
where 12 bulbs are found to be defective.
(200 − 12)𝑛𝑜𝑛 − 𝑑𝑒𝑓𝑒𝑐𝑡𝑖𝑣𝑒 𝑏𝑢𝑙𝑏𝑠 47
𝑃(𝑑𝑒𝑓𝑒𝑐𝑡𝑖𝑣𝑒 𝑏𝑢𝑙𝑏) = =
200 𝑏𝑢𝑙𝑏𝑠 50
PROPERTIES OF PROBABILITY
97
4. The sum of the probabilities of all events (or final outcomes) for an experiment,
denoted by ∑ 𝑃(𝐸𝑖 ) , is always 1.
∑ 𝑷(𝑬𝒊 ) = 𝑷(𝑬𝟏 ) + 𝑷(𝑬𝟐 ) + 𝑷(𝑬𝟑 ) + ⋯ + 𝑷(𝑬𝒏 ) = 𝟏
To illustrate the last property, consider the experiment of tossing a coin twice. Then
E 2 = the event of getting 1 head, then E 2 = { HT, TH} and P( E 2 ) = 2/4 or 1/2.
Example 1:
Five of the 100 randomly delivered plastic connector of assorted colors are found to be
green. What is the probability of getting the color green in the next delivery of connectors?
Let E denote the event of having the color green in the next delivery of connectors. Using
the relative frequency concept of probability with n = 100 and f = 5, then we obtain:
f 5
P(E) = P(Next toy is green) = 0.05
N 100
98
Example 2:
All 275 employees of a factory were asked if they are smokers or non-smokers and
whether or not they are college graduates. Based on this information, the two-way
classification table below was prepared. If one employee is selected at random from this
factory, find the probability that this employee is a(a) college graduate; (b) non-smoker.
a. Let E1 be the event that an employee is a college graduate. Then, since there are
105 graduates out of 275 employees, then
105
P( E1 ) = 0.38
275
b. Let E 2 be the event that an employee is a nonsmoker. Since there are 210
nonsmoker out of 275 employees, then
210
P( E 2 ) = 0.76
275
Activity
Money-market 20%
Short bond 15% Intermediate bond 10% Long bond 5%
Moderate-risk stock 25% High-risk stock 18%
Balanced 7%
99
a. What is the probability that the selected individual owns shares in the balanced
fund?
b. What is the probability that the individual owns shares in a bond fund?
c. What is the probability that the selected individual does not own shares in a
stock fund?
2. A hat contains 40 marbles, 16 of which are red and 24 are green. If one marble is
randomly selected out of this hat, what is the probability that this marble is a) red?
b) green?
5. The following table below gives a two-way classification of all the scientists and
engineers. If one person is selected at random from these scientists and
engineers, find the probability that this person is a) an engineer? b) a male?
Male Female
Scientist 58 79
Engineer 64 37
100
Session 2
Sample Spaces and Relationship Among Events
1. Understand and describe sample spaces and events for random experiments
with graphs, tables, lists, or tree diagrams
2. Identify the relationships existing between events and differentiate them
Lecture:
RANDOM EXPERIMENTS
SAMPLE SPACES
Sample spaces are simply sets whose elements describe all the possible outcomes of
the random experiment in which we are interested in. The sample space is denoted as S.
A sample space is often defined based on the objectives of the analysis. It is useful to
distinguish between two types of sample spaces: discrete and continuous. A sample
space is discrete if it consists of a finite or countable infinite set of outcomes. A sample
space is continuous if it contains an interval (either finite or infinite) of real numbers.
Example 1:
Consider an experiment in which you select a molded plastic part, such as a connector,
and measure its thickness. The possible values for thickness depend on the resolution of
the measuring instrument, and they also depend on upper and lower bounds for
101
thickness. However, it might be convenient to define the sample space as simply the
positive real line
𝑆 = 𝑅+ = {𝑥 |𝑥 > 0}
because a negative value for thickness cannot occur. This sample space is continuous.
If it is known that all connectors will be between 10 and 11millimeters thick, the continuous
sample space could be
If the objective of the analysis is to consider only whether a particular part is low, medium,
or high for thickness, the sample space might be taken to be the set of three outcomes:
If the objective of the analysis is to consider only whether or not a particular part conforms
to the manufacturing specifications, the discrete sample space might be simplified to the
set of two outcomes
𝑆 = {𝑦𝑒𝑠, 𝑛𝑜}
Example 2:
In random experiments in which items are selected from a batch, we will indicate whether
or not a selected item is replaced before the next one is selected.
If a batch consists of three items {a, b, c} and our experiment is to select two items without
replacement, the sample space can be represented as
If the items are replaced before the next one is selected, the sampling is referred to as
with replacement. Then the possible ordered outcomes are
𝑆𝑤𝑖𝑡ℎ = {𝑎𝑎, 𝑎𝑏, 𝑎𝑐, 𝑏𝑎, 𝑏𝑏, 𝑏𝑐, 𝑐𝑎, 𝑐𝑏, 𝑐𝑐}
102
Example 3:
Sometimes it is not necessary to specify the exact item selected, but only a property of
the item.
Suppose that there are 5 defective parts and 95 good parts in a batch. To study the quality
of the batch, two are selected without replacement. Let 𝑔 denote a good part and 𝑑 denote
a defective part. It might be sufficient to describe the sample space (ordered) in terms of
quality of each part is selected as
Also, if there were only one defective part in the batch, there would be fewer possible
outcomes
TREE DIAGRAMS
An effective way of graphically describing a sample space is through the use of tree
diagrams. When a sample space can be constructed in several steps or stages, we can
represent each of the 𝑛1 ways of completing the first step as a branch of a tree. Each of
the ways of completing the second step can be represented as 𝑛2 branches starting from
the ends of the original branches, and so forth.
Example:
If the sample space consists of the set of all possible vehicle types, what is the number
of outcomes in the sample space?
103
Per the tree diagram, the sample space contains 48 outcomes.
EVENTS
We can also be interested in describing new events from combinations of existing events.
Because events are subsets, we can use basic set operations such as unions,
intersections, and complements to form other events of interest. Some of the basic set
operations are summarized below in terms of events:
The union of two events A and B, denoted by the symbol A ᑌ B, is the event
containing all the elements that belong to A or B or both.
104
Let V = {a, e, i, o, u} and C = {l, r, s, t}; then it follows that V ∩ C = φ. That
is, V and C have no elements in common and, therefore, cannot both
simultaneously occur.
For certain statistical experiments it is by no means unusual to define two events, A and
B, which cannot both occur simultaneously. The events A and B are then said to be
mutually exclusive or disjoint, if A ∩ B = φ, that is, if A and B have no elements in
common.
Illustration: Let R be the event that a red card is selected from an ordinary deck of 52
playing cards and let S be the entire deck. Then R’ is the event that the card
selected from the deck is not a red card but a black card.
Example:
Consider the sample space S = {copper, sodium, nitrogen, potassium, uranium, oxygen,
zinc} and the events
A = {copper, sodium, zinc}
B = {sodium, nitrogen, potassium}
C = {oxygen}
105
VENN DIAGRAMS
Diagrams are often used to portray relationships between sets, and these diagrams are
also used to describe relationships between events. We can use Venn diagrams to
represent a sample space and events in a sample space.
The sample space of the random experiment is represented as the points in the rectangle
S. The events A and B are the subsets of points in the indicated regions.
A∩B∩C (A ∩ C)’
106
• If two events have no outcomes in common, they are called mutually exclusive
events. The intersection of such two events is a null set.
Activity
1. Suppose that vehicles taking a particular freeway exit can turn right (R), turn left
(L), or go straight (S). Consider observing the direction for each of three successive
vehicles.
a. List all outcomes in the event A that all three vehicles go in the same direction.
b. List all outcomes in the event B that all three vehicles take different directions.
c. List all outcomes in the event C that exactly two of the three vehicles turn right.
d. List all outcomes in the event D that exactly two vehicles go in the same
direction.
e. List outcomes in D', C∪D, and C∩D.
107
Session 3
COUNTING RULES USEFUL IN PROBABILITY
Lecture:
PRINCIPLE OF COUNTING
If a choice consists of two steps, of which the first can be made in n 1 ways and the second
can be made in n 2 ways, then the whole choice can be made in n 1n 2 ways.
Example 1:
In the design of a casing for a gear housing, we can use four different types of fasteners,
three different bolt lengths, and three different bolt locations. How many casing designs
are possible?
4 × 3 × 3 = 36 casing designs
Example 2:
In a medical study, patients are classified in 8 ways according to whether they have blood
type AB+, AB-, A+, A-, B+, B-, O+, or O-, and also according to whether their blood
pressure is low, normal, or high. In how many ways can a patient be classified?
8 × 3 = 24 classifications
Example 3:
4 × 3 = 12 possible homes
108
PERMUTATIONS
𝒏𝑷𝒏 = 𝒏!
Illustration:
How many permutation can be made from three distinct objects all are taken at a time?
𝑛! = 3! = 3 × 2 × 1 = 6 permutations
𝒏!
𝒏𝑷𝒓 =
( 𝒏 − 𝒓) !
Illustration:
In the game of Jai Alai, players are numbered from 1 to 10. The first player to score 9
points will be the first placer and the next two highest scorers will be declared 2 nd and 3rd
placers. Each bet must consist therefore of three numbers corresponding to the player
number of the 1st placer, 2nd placer, and the 3rd placer respectively. Neglecting the
possibility of not having a 2nd and 3rd placer, how many bets are possible?
10!
𝑛𝑃𝑟 = 10𝑃3 = (10−3)!
= 720 bets
𝒏!
𝒏𝑷𝒏𝒊 =
𝒊𝟏 ! 𝒊𝟐 ! … …
109
𝑖2 = number of type 2 identical things
Illustration:
How many permutations can you make out of the letters of the word PHILIPPINES if all
letters are taken at a time?
11!
𝑛𝑃𝑛𝑖 = = 1,108,800 permutations
3!3!
𝒏𝑷𝒄𝒏 = (𝒏 − 𝟏)!
Illustration:
In how many ways can you sit in a three couple circular table of 6 seats if one occupies
a particular position?
( 𝒏 − 𝟏) !
𝒏𝑷𝒄𝒏 =
𝒊𝟏 ! 𝒊𝟐 ! … …
where the given formula may only be applied provided that at least one thing is distinct.
Illustration:
In how many ways can you arrange 6 chairs in a circular position if 2 of which are
identical?
(6−1)!
𝑛𝑃𝑐𝑛 = = 60 ways
2!
110
Illustration:
How many circular permutations of 4 objects can you make out of 6 different objects?
𝑷𝐺 = 𝐺! 𝑛1 ! 𝑛2 ! … …
Illustration:
In how many ways can you line-up 4 marines and 6 army soldiers if all the marines must
be side by side with each other and the army soldiers are likewise?
𝑃𝐺 = 2! 4! 6! = 34,560 ways
In the last problem, what if all the marines must be together while the army soldiers need
not be necessarily together, how many ways are possible?
𝑃𝐺 = 7! 4! = 120,960 ways
COMBINATIONS
A combination also concerns arrangements, but without regard to the order. This means
that the order or arrangement in which the elements are taken is not important. For
example, ab will be the same as ba, abc is the same arrangement as acb or bac. Also,
the number of ways in the combination of the letters a, b, and c taken 2 at a time is 3, and
these are ab, ab and bc. We note that ab = ba, ac = ca and bc = cb.
111
Illustration:
How many number combinations are there in Lotto 6/49? (Note that each combination
consists of 6 different numbers selected from 1 to 49 wherein the arrangement is not
important)
49𝑃6 49!
49𝐶6 = = 6!(49−6)! = 13,983,816 combinations
6!
A semiconductor company will hire 7 men and 4 women. In how many ways can the
company choose from 9 men and 6 women who qualified for the positions?
𝑪 = 𝟐𝒏 − 𝟏
Illustration:
In how many ways can you fill a box with book(s) if you can choose from 6 different books?
𝐶 = 26 − 1 = 63 ways
PARTITION
A partition is defined as the division of things into smaller groups and is defined by the
formula
𝑛!
𝐷=
𝑔1 ! 𝑔2 !
𝑟1 ! 𝑟2 !
Illustration:
In how many ways can you divide 17 students into 5 committees, namely: 2 committees
with 2 members each, 3 committees with 3 members each and the last, a 4-member
committee?
17!
𝐷= 2!2!3!3!3!4! = 17,153,136,000 ways
2!3!1!
112
Supplementary Examples
1. In a city, the bus route numbers consist of a natural number less than 100, followed
by one of the letters A, B, C, D, E and F. How many different bus routes are
possible?
2. There are 3 questions in a question paper. If the questions have 4,3 and 2 solutions
respectively, find the total number of solutions.
3. In how many ways can an animal trainer arrange 5 lions and 4 tigers in a row so
that no two lions are together?
4. In how many ways can the letters of the word LEADING be arranged in such a
way that the vowels always come together
5. In how many ways can 4 girls and 5 boys be arranged in a row so that all the girls
are together?
6. 12 points lie on a circle. How many cyclic quadrilaterals can be drawn by using
these points?
7. In a box, there are 5 black pens, 3 white pens and 4 red pens. In how many ways
can 2 black pens, 2 white pens and 2 red pens can be chosen?
8. A question paper consists of 10 questions divided into two parts A and B. Each
part contains five questions. A candidate is required to attempt six questions in all
of which at least 2 should be from part A and at least 2 from part B. In how many
ways can the candidate select the questions if he can answer all questions equally
well?
9. A committee of 5 persons is to be formed from 6 men and 4 women. In how many
ways can this be done when:
a. At least 2 women are included?
b. At most 2 women are included?
10. The Indian Cricket team consists of 16 players. It includes 2 wicket keepers and 5
bowlers. In how many ways can a cricket eleven be selected if we have to select
1 wicket keeper and at least 4 bowlers?
Activity
1. An order for a personal digital assistant can specify any one of five memory sizes,
any one of three types of displays, any one of four sizes of a hard disk, and can
either include or not include a pen tablet. How many different systems can be
ordered?
113
2. In a manufacturing operation, a part is produced by machining, polishing, and
painting. If there are three machine tools, four polishing tools, and three painting
tools, how many different routings (consisting of machining, followed by polishing,
and followed by painting) for a part are possible?
3. New designs for a wastewater treatment tank have proposed three possible
shapes, four possible sizes, three locations for input valves, and four locations for
output valves. How many different product designs are possible?
114
Session 4
APPROACHES TO AND RULES OF PROBABILITY
1. Identify the different events that exist in probability problems and differentiate
them from one another when applicable.
2. Apply the different probability rules when computing probabilities.
Lecture:
APPROACHES TO PROBABILITY
1. Marginal Probability
2. Conditional Probability
MARGINAL PROBABILITY
Marginal probability is the probability of a single event without consideration of any other
event. Marginal probability is also called simple probability.
Suppose 40 students in a certain school were asked whether they are in favor of
or against the wearing of school uniform. The table below gives a two-way classification
of the responses of these 40 students.
In Favor Against
Male 6 9
Female 17 8
As we can observe, the probability that a male will be selected is obtained by dividing the
total of the row labeled “Male” (15) by the grand total (40).
Similarly,
115
𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑓𝑒𝑚𝑎𝑙𝑒𝑠 25
𝑃(𝑓𝑒𝑚𝑎𝑙𝑒) = = = 0.625
𝑡𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑠𝑡𝑢𝑑𝑒𝑛𝑡𝑠 40
𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑖𝑛 𝑓𝑎𝑣𝑜𝑟 23
𝑃(𝑖𝑛 𝑓𝑎𝑣𝑜𝑟) = = = 0.575
𝑡𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑠𝑡𝑢𝑑𝑒𝑛𝑡𝑠 40
𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑎𝑔𝑎𝑖𝑛𝑠𝑡 17
𝑃(𝑎𝑔𝑎𝑖𝑛𝑠𝑡) = = = 0.425
𝑡𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑠𝑡𝑢𝑑𝑒𝑛𝑡𝑠 40
Let I = event of having the opinion of “in favor” M = event of male student
Then,
CONDITIONAL PROBABILITY
Conditional probability is the probability that an event will occur given that another
event has already occurred. If A and B are two events, then the conditional probability of
A is written as P(A/B) and read as “the probability of A given that B has already
occurred.”
𝑃(𝐴 ∩ 𝐵)
𝑃(𝐴/𝐵) =
𝑃(𝐵)
Suppose we are being asked to compute for P(in favor/male) in the two-way classification
of the responses of 40 students regarding the wearing of school uniform.
6
𝑃 (𝑖𝑛 𝑓𝑎𝑣𝑜𝑟 ∩ 𝑚𝑎𝑙𝑒) 40 6
𝑃(𝑖𝑛 𝑓𝑎𝑣𝑜𝑟/𝑚𝑎𝑙𝑒) = = = = 0.40
𝑃 (𝑚𝑎𝑙𝑒) 15 15
40
Similarly, the probability that a randomly selected student is against (given that this
student is female) is
8
𝑃 (𝑎𝑔𝑎𝑖𝑛𝑠𝑡 ∩ 𝑓𝑒𝑚𝑎𝑙𝑒) 40 8
𝑃(𝑎𝑔𝑎𝑖𝑛𝑠𝑡/𝑓𝑒𝑚𝑎𝑙𝑒) = = = = 0.32
𝑃 (𝑓𝑒𝑚𝑎𝑙𝑒) 25 25
40
116
EVENTS ON PROBABILITY
The following events are commonly observed when solving probability problems:
3. Complementary Events
Because two complementary events, taken together, include all the outcomes for
an experiment and because the sum of the probabilities of all outcomes is 1, it is
obvious that P(A) + P(A’) = 1.
PROBABILITY RULES
MULTIPLICATION RULE
The probability that events A and B can happen together or the probability of the
intersection of two events A and B is called the joint probability of A and B and is
written as P(A and B).
The probability of the intersection of two events is obtained by multiplying the marginal
probability of one event by the conditional probability of the second event. This rule is
called the multiplication rule.
117
The probability of the intersection of two events A and B or the joint probability of
A and B is
Example 1:
A lot contains 40 batteries, 4 of which is defective. If 2 batteries are selected at random
(without replacement) from the lot, what is the probability that
a) both are defective?
b) both are nondefective?
c) the first is defective but the second is not?
d) If two batteries are selected out of 40 with replacement where 4 are defective, what
is the probability that both are defective?
To calculate the probability P(D/B), we know that the first batteries selected is
defective because B has already occurred. Because the selections are made
without replacement, there are 39 total batteries and 3 of them are defective at the
time of the second selection. Therefore,
P(D/B) = 3/39
118
Hence, the required probability is
b) Let the dependent events A and C = the event of selecting 2 non-defective (or
good) batteries. Since there are 36 good batteries, then
c) Let the dependent events B and C = the event of selecting the first battery
(defective) and the second battery (nondefective), then
d) Let the independent events B and D = the event of selecting 2 defective batteries.
Then
Example 2:
The probability that a patient is allergic to penicillin is 0.30. Suppose the drug is
administered to three patients.
a) Find the probability that all three of them are allergic to it.
b) Find the probability that at least one of them is not allergic to it.
Let A, B and C denote the events that the first, second, and third patients are allergic to
penicillin, respectively.
a) We are to find the joint probability of A, B, and C all three events are independent
because whether or not one patient is allergic does not depend on whether or not
any of the other patients is allergic. Hence,
b) Let us define the following events: D = all three patients are allergic; E = at least
one patient is not allergic.
At least one patient is allergic to penicillin would mean that one or two but not all
of the three patients. Thus, we merely exclude the possibility that of three patients
being allergic, i.e. to penicillin. Notice that events D and E are two complementary
events. Event D consists of the intersection of events A, B, and C. Hence, from
part (a),
119
P(D) = P(A and B and C) = 0.027
ADDITION RULE
The method used to calculate the probability of the union of events is called the addition
rule. It is defined as follows
Thus, to calculate the probability of the union of two events A and B, we add their
marginal probabilities and subtract their joint probability from this sum. We must subtract
the joint probability of A and B from the sum of their marginal probabilities to avoid
double counting due to common outcomes in A and B.
Example 1:
In a college graduating class of 100 students, 58 studied math, 70 studied history, and 30
studied both math and history. If one student is selected at random, find the probability
that
M H
a) the student takes math or history;
28 30 40
b) the student does not take either of these
subjects;
c) the student takes history but not math.
In the Venn diagram that depicts the given situation, let M be event that a student takes
Math and H be the event that a student takes History.
Since there are 30 who studied both subjects and math has 58 students in all, then there
are 58 – 30 = 28 students who studied math alone. Also, there 70 – 30 = 40 students
who studied history alone.
a) Let M and H be the events that the student studied math and history respectively,
then
120
P(M) = 58/100 = 0.58 and P(H) = 70/100 = 0.70.
58 70 30 98
P(M or H) = P(M) + P(H) – P(M and H) = or 0.98
100 100 100 100
b) Let (M or H) denote the event that the student did not study any of the two
subjects. Then using the complementary event rule, we have
c) Let A be the event that the student takes math only but not history. Then based
on the Venn diagram, we have
Example 2:
The table below lists the history of 940 wafers in a semiconductor manufacturing process.
Suppose one wafer is selected at random. Let H denote the event that the wafer contains
high levels of contamination. And C denote the event that the wafer is in the center of a
sputtering tool.
Since P(H) = 358/940 and P(C) = 626/940, then, the probability that the wafer selected
contains high levels of contamination or is in the center of a sputtering tool
Example 3:
The Cost Less Clothing Store carries seconds in slacks. If you buy a pair of slacks in your
regular waist size without trying them on, the probability that the waist will be too tight is
0.30 and the probability that it will be too loose is 0.10
If you choose a pair of slacks at random in your regular waist size, what is the probability
that the waist will be too tight or too loose?
121
Based on the given in the problem,
P(too tight or too loose) = P(too tight) + P(too loose) = 0.30 + 0.10 = 0.40
Example:
Suppose that in a semiconductor manufacturing, the probability is 0.10 that a chip that is
subjected to high levels of contamination during manufacturing causes a product failure.
The probability is 0.005 that a chip that is not subjected to high contamination levels
during manufacturing causes a product failure. In a particular production run, 20% of the
chips are subject to high levels of contamination. What is the probability that a product
using one of these chips fails?
Let F denote the event that the product fails and H denote the event that the chip is
exposed to high levels of contamination.
122
𝑃 (𝐹 ) = (0.20)(0.10) + (0.80)(0.005) = 𝟎. 𝟎𝟐𝟑𝟓
BAYE'S THEOREM
If we have two events A and B and we are given the conditional probability of A given B
known as P(A/B), we can use Bayes’ Theorem to find P(B/A) using the formula
𝑷(𝑩) 𝑷(𝑨/𝑩)
𝑷(𝑩/𝑨) =
𝑷(𝑩) 𝑷(𝑨/𝑩) + 𝑷(𝑩′ ) 𝑷(𝑨/𝑩′)
Example:
In a factory, there are two machines manufacturing bolts. Machine A manufactures 75%
of the bolts and the Machine B manufactures the remaining 25%. From machine A, 5%
of the bolts are defective and from machine B, 8% of the bolts are defective. A bolt is
selected at random, what is the probability that the bolt came from machine A, given that
it is defective?
Then,
𝑃(𝐴) 𝑃(𝐷/𝐴)
𝑃(𝐴/𝐷) =
𝑃(𝐴) 𝑃(𝐷/𝐴) + 𝑃(𝐴′ ) 𝑃(𝐷/𝐴′)
𝑃(𝐴) 𝑃(𝐷/𝐴)
𝑃(𝐴/𝐷) =
𝑃(𝐴) 𝑃(𝐷/𝐴) + 𝑃(𝐵) 𝑃(𝐷/𝐵)
(0.75)(0.05)
𝑃(𝐴/𝐷) = = 𝟎. 𝟔𝟓𝟐𝟐
(0.75)(0.05) + (0.25)(0.08)
123
Activity
1. A box contains six 40-W bulbs, five 60-W bulbs, and four 75-W bulbs. If bulbs are
selected one by one in random order, what is the probability that at least two bulbs
must be selected to obtain one that is rated 75 W?
2. Human visual inspection of solder joints on printed circuit boards can be very
subjective. Part of the problem stems from the numerous types of solder defects
(e.g., pad nonwetting, knee visibility, voids) and even the degree to which a joint
possesses one or more of these defects. Consequently, even highly trained
inspectors can disagree on the disposition of a particular joint. In one batch of
10,000 joints, inspector A found 724 that were judged defective, inspector B found
751 such joints, and 1159 of the joints were judged defective by at least one of the
inspectors. Suppose that one of the 10,000 joints is randomly selected.
a. What is the probability that the selected joint was judged to be defective by
neither of the two inspectors?
b. What is the probability that the selected joint was judged to be defective by
inspector B but not by inspector A?
4. Semiconductor lasers used in optical storage products require higher power levels
for write operations than for read operations. High-power-level operations lower
the useful life of the laser.
Lasers in products used for backup of higher speed magnetic disks primarily write,
and the probability that the useful life exceeds five years is 0.95. Lasers that are in
products that are used for main storage spend approximately an equal amount of
time reading and writing, and the probability that the useful life exceeds five years
is 0.995. Now, 25% of the products from a manufacturer are used for backup and
75% of the products are used for main storage.
Let A denote the event that a laser’s useful life exceeds five years, and let B denote
the event that a laser is in a product that is used for backup. Use a tree diagram to
determine the following:
124
a. 𝑃(𝐵)
b. 𝑃(𝐴/𝐵)
c. 𝑃(𝐴/𝐵′)
d. 𝑃(𝐴 ∩ 𝐵)
e. 𝑃(𝐴 ∩ 𝐵′)
f. 𝑃(𝐴)
g. What is the probability that the useful life of a laser exceeds five years?
h. What is the probability that a laser that failed before five years came from a
product used for backup?
5. Samples of emissions from three suppliers are classified for conformance to air-
quality specifications. The results from 100 samples are summarized as follows:
conforms
Yes No
1 22 8
supplier 2 25 5
3 30 10
Let A denote the event that a sample is from supplier 1, and let B denote the event
that a sample conforms to specifications.
125
Session 5
Joint Probability Distribution
Lecture:
RANDOM VARIABLES
For a given sample space S of some experiment, a random variable (rv) is any rule that
associates a number with each outcome in S. In mathematical language, a random
variable is a function whose domain is the sample space and whose range is the set of
real numbers.
Random variables are customarily denoted by uppercase letters, such as X and Y, near
the end of our alphabet. In contrast to our previous use of a lowercase letter, such as x,
to denote a variable, we will now use lowercase letters to represent some particular value
of the corresponding random variable. The notation means that x is the value associated
with the outcome s by the rv X.
Example 1:
When a student calls a university help desk for technical support, he/she will either
immediately be able to speak to someone (S, for success) or will be placed on hold (F,
for failure). With S={S,F}, define an rv X by
𝑋 (𝑆 ) = 1 𝑋 (𝐹 ) = 0
The rv X indicates whether (1) or not (0) the student can immediately speak to someone.
The rv X in the example was specified by explicitly listing each element of S and the
associated number. Such a listing is tedious if S contains more than a few outcomes, but
it can frequently be avoided.
126
Example 2:
Consider the experiment in which a telephone number in a certain area code is dialed
using a random number dialer (such devices are used extensively by polling
organizations), and define an rv Y by
A discrete random variable is an rv whose possible values either constitute a finite set
or else can be listed in an infinite sequence in which there is a first element, a second
element, and so on (“countably” infinite).
1. Its set of possible values consists either of all numbers in a single interval on the
number line (possibly infinite in extent, e.g., from −∞ to ∞) or all numbers in a
disjoint union of such intervals (e.g., [0,10] ∪ [20,30] ).
2. No possible value of the variable has positive probability, that is, P(X = c) = 0 for
any possible value c.
Given below is an example of a probability distribution (or probability mass function, pmf)
for a single rv X.
𝑥 0 1 2
𝑓(𝑥) 0.50 0.20 0.30
127
Examples for continuous random variables:
Time when bus driver picks you up vs. Quantity of caffeine in bus driver's system
Dosage of a drug (ml) vs. Blood compound measure (percentage)
In general, if X and Y are two random variables, the probability distribution that defines
their simultaneous behaviour is called a joint probability distribution.
Shown below is a table for two discrete random variables, which gives 𝑃(𝑋 = 𝑥, 𝑌 = 𝑦).
𝒙
1 2 3
𝒚 1 0 1/6 1/6
2 1/6 0 1/6
3 1/6 1/6 0
Shown below is a graphic for two continuous random variables as 𝑓𝑋,𝑌 (𝑥, 𝑦).
If X and Y are discrete, this distribution can be described with a joint probability mass
function.
If X and Y are continuous, this distribution can be described with a joint probability
density function.
Dimensions of plastic covers for CDs are measured. Measurement for length and width
of a rectangular plastic covers for CDs are rounded to the nearest mm, so they are
considered discrete.
128
Let Y denote the width.
The possible values of X are 129, 130, and 131 mm while the possible values of Y are 15
and 16 mm. Thus, both X and Y are discrete. There are 6 possible pairs (X,Y). We show
the probability for each pair in the following table:
𝑥 = 𝑙𝑒𝑛𝑔𝑡ℎ
129 130 131
𝑦 = 𝑤𝑖𝑑𝑡ℎ
15 0.12 0.42 0.06
16 0.08 0.28 0.04
The sum of all probabilities is 1.0. The combination with the highest probability is (130,15).
The combination with the lowest probability is (131, 16).
The joint probability mass function is the function 𝑓𝑋𝑌 (𝑥, 𝑦) = 𝑃(𝑋 = 𝑥, 𝑌 = 𝑦). For
example, we have 𝑓𝑋𝑌 (129,15) = 0.12.
If we are given a joint probability distribution for X and Y, we can obtain the individual
probability distribution for X or Y and these are called the Marginal Probability
Distributions.
Example:
Referring to the example of plastic covers of CDs, find the probability that a CD cover has
a length of 129 mm (i.e. X=129).
𝑦 = 𝑤𝑖𝑑𝑡ℎ 𝑥 = 𝑙𝑒𝑛𝑔𝑡ℎ
129 130 131
15 0.12 0.42 0.06
16 0.08 0.28 0.04
𝑃(𝑋 = 129) = 𝑃(𝑋 = 129 𝑎𝑛𝑑 𝑌 = 15) + 𝑃(𝑋 = 129 𝑎𝑛𝑑 𝑌 = 16)
𝑃(𝑋 = 129) = 0.12 + 0.08 = 𝟎. 𝟐𝟎
𝑦 = 𝑤𝑖𝑑𝑡ℎ 𝑥 = 𝑙𝑒𝑛𝑔𝑡ℎ
129 130 131
15 0.12 0.42 0.06
16 0.08 0.28 0.04
Column totals 0.20 0.70 0.10
129
The probability distribution for X appears in the column totals, resulting to
We've used a subscript X in the probability mass function of X, or 𝑓𝑋 (𝑥), for clarification
since we're considering more than one variable at a time now.
We can do the same for the Y random variable:
𝑦 = 𝑤𝑖𝑑𝑡ℎ 𝑥 = 𝑙𝑒𝑛𝑔𝑡ℎ Row
totals
129 130 131
15 0.12 0.42 0.06 0.60
16 0.08 0.28 0.04 0.40
Column totals 0.20 0.70 0.10 1
𝑦 15 16
𝑓𝑌 (𝑦) 0.60 0.40
Because the probability mass functions for X and Y appear in the margins of the table
(i.e. column and row totals), they are often referred to as the Marginal Distributions for
X and Y. When there are two random variables of interest, we also use the term bivariate
probability distribution of bivariate distribution to refer to the joint distribution.
The joint probability mass function of the discrete random variables X and Y, denoted as
𝑓𝑋𝑌 (𝑥, 𝑦), satisfies
1. 𝑓𝑋𝑌 (𝑥, 𝑦) ≥ 0
2. ∑𝑥 ∑𝑦 𝑓𝑋𝑌 (𝑥, 𝑦) = 1
3. 𝑓𝑋𝑌 (𝑥, 𝑦) = 𝑃(𝑋 = 𝑥, 𝑌 = 𝑦)
If X and Y are discrete random variables with joint probability mass function 𝑓𝑋𝑌 (𝑥, 𝑦), then
the marginal probability mass functions of X and Y are
130
Where the sum for 𝑓𝑋 (𝑥 ) is over all points in the range of (X,Y) for which 𝑋 = 𝑥 and the
sum for 𝑓𝑌 (𝑦) is over all points in the range of (X,Y) for which 𝑌 = 𝑦.
When asked for 𝐸(𝑋) or 𝑉(𝑋) (i.e. values related to only 1 of the 2 variables) but you are
given a joint probability distribution, first calculate the marginal distribution 𝑓𝑋 (𝑥) and work
it similar to a univariate case (i.e. for a single random variable).
Example:
Suppose that 2 batteries are randomly chosen without replacement from the following
group of 12 batteries: 3 new, 4 used but working and 5 defective. Find 𝑓𝑋𝑌 (𝑥, 𝑦)
Though X can take on values 0,1, and 2, and Y can take on values 0,1, and 2, when we
consider them jointly, 𝑋 + 𝑌 ≤ 2. So, not all combinations of (𝑋, 𝑌) are possible. But, there
are 6 possible cases:
5𝐶2 10
𝑓𝑋𝑌 (0,0) = =
12𝐶2 66
4𝐶1 5𝐶1 20
𝑓𝑋𝑌 (0,1) = =
12𝐶2 66
4𝐶2 6
𝑓𝑋𝑌 (0,2) = =
12𝐶2 66
3𝐶1 5𝐶1 15
𝑓𝑋𝑌 (1,0) = =
12𝐶2 66
131
3𝐶2 3
𝑓𝑋𝑌 (2,0) = =
12𝐶2 66
3𝐶1 4𝐶1 12
𝑓𝑋𝑌 (1,1) = =
12𝐶2 66
𝑃(𝐴 ∩ 𝐵)
𝑃(𝐴/𝐵) =
𝑃(𝐵)
We now apply this conditioning to random variables X and Y. Given random variables X
and Y with joint probability 𝑓𝑋𝑌 (𝑥, 𝑦), the conditional probability distribution of Y given 𝑋 =
𝑥 is
𝑓𝑋𝑌 (𝑥,𝑦)
𝑓𝑌|𝑥 (𝑦) = for 𝑓𝑋 (𝑥) > 0.
𝑓𝑋 (𝑥)
The conditional probability can be stated as the joint probability over the marginal
probability.
132
Example:
a) Find the probability that a CD cover has a length of 130 mm GIVEN that the width
is 15 mm.
0.12
𝑃(𝑋 = 129|𝑌 = 15) = = 𝟎. 𝟐𝟎
0.60
0.42
𝑃(𝑋 = 130|𝑌 = 15) = = 𝟎. 𝟕𝟎
0.60
0.06
𝑃(𝑋 = 131|𝑌 = 15) = = 𝟎. 𝟏𝟎
0.60
From the subset of the covers with a width of 15 mm, how are the lengths (X)
distributed.
133
Activity
1. Show that the following function satisfies the properties of a joint probability
mass function.
4. Determine the value of c that makes the function 𝑓(𝑥, 𝑦) = 𝑐(𝑥 + 𝑦) a joint
probability mass function over the nine points with x=1,2,3 and y=1,2,3.
5. In problem 4, determine the probabilities of the following:
a. 𝑃(𝑋 = 1, 𝑌 < 4)
b. 𝑃(𝑋 = 1)
c. 𝑃(𝑌 = 2)
d. 𝑃(𝑋 < 2, 𝑌 < 2)
135
Session 1
Introduction to Discrete Random Variables and Discrete Probability
Distributions
Lecture:
RANDOM VARIABLES
A random variable is a function that assigns a real number to each outcome in the sample
space of a random experiment. A random variable is denoted by an uppercase letter such
as X. After an experiment is conducted, the measured value of the random variable is
denoted by a lowercase letter.
In other experiments, we might record a count such as the number of transmitted bits that
are received in error. Then the measurement is limited to integers. Or we might record
that a proportion such as 0.0042 of the 10,000 transmitted bits were received in error.
Then the measurement is fractional, but it is still limited to discrete points on the real line.
Whenever the measurement is limited to discrete points on the real line, the random
variable is said to be a discrete random variable.
136
A discrete random variable is a random variable with a finite (or countably infinite) range.
A continuous random variable is a random variable with an interval (either finite or infinite) of real
numbers for its range.
In some cases, the random variable X is actually discrete but, because the range of
possible values is so large, it might be more convenient to analyze X as a continuous
random variable. For example, suppose that current measurements are read from a
digital instrument that displays the current to the nearest one-hundredth of a milliampere.
Because the possible measurements are limited, the random variable is discrete.
However, it might be a more convenient, simple approximation to assume that the current
measurements are values of a continuous random variable.
Example 2: In a semiconductor manufacturing process, two wafers from a lot are tested.
Each wafer is classified as pass or fail. Assume that the probability that a wafer passes
the test is 0.8 and that wafers are independent. Then the random variable X is the number
of wafers that passed the test.
137
DISCRETE PROBABILITY DISTRIBUTION
For a discrete random variable, the distribution is often specified by just a list of the
possible values along with the probability of each. In some cases, it is convenient to
express the probability in terms of a formula.
Example 4: In a semiconductor manufacturing process, two wafers from a lot are tested.
Each wafer is classified as pass or fail. Assume that the probability that a wafer passes
the test is 0.8 and that wafers are independent. If the discrete random variable X
represents the number of wafers that passed the test. Find the discrete probability
distribution of random variable X.
Solution:
𝑓𝑜𝑟 𝑥 = 2;
𝑃(𝑝, 𝑝) = 0.8 𝑥 0.8 = 0.64
𝑓𝑜𝑟 𝑥 = 1;
𝑃(𝑝, 𝑓) = 0.8 𝑥 0.2 = 0.16
𝑃(𝑓, 𝑝) = 0.8 𝑥 0.2 = 0.16
𝑓𝑜𝑟 𝑥 = 0;
𝑃(𝑓, 𝑓) = 0.2 𝑥 0.2 = 0.04
The sample space for the experiment and associated probabilities are shown below:
μ or E(X) = ∑ 𝑥𝑃(𝑥)
138
From the probability distribution of the Example 4,
Example 5. Two new product designs are to be compared on the basis of revenue
potential. Marketing feels that the revenue from design A can be predicted quite
accurately to be $3 million. The revenue potential of design B is more difficult to assess.
Marketing concludes that there is a probability of 0.3 that the revenue from design B will
be $7 million, but there is a 0.7 probability that the revenue will be only $2 million. Which
design do you prefer?
Because there is no uncertainty in the revenue from design A, we can model the
distribution of the random variable X as $3 million with probability 1.
139
Therefore, 𝝁𝒙 or E(X) = $3 million
However, the variability of the result from design B is larger. That is,
𝝈𝟐 = (7 − 3.5)2 (0.3) + (2 − 3.5)2 (0.7)
= 𝟓. 𝟐𝟓 𝐦𝐢𝐥𝐥𝐢𝐨𝐧 𝐝𝐨𝐥𝐥𝐚𝐫𝐬 𝐬𝐪𝐮𝐚𝐫𝐞𝐝
Because the units of the variables in this example are millions of dollars, and because the
variance of a random variable squares the deviations from the mean, the units of variance
are millions of dollars squared. These units make interpretation difficult.
Because the units of standard deviation are the same as the units of the random variable,
the standard deviation is easier to interpret.
In this example, we can summarize our results as “the average deviation of Y from its
mean is 𝝈 = $2.29 million.’’
Example 6. The number of messages sent per hour over a computer network has the
following distribution:
Determine the mean and standard deviation of the number of messages sent per hour.
𝜇 = (10)(0.08) + (11)(0.15) + (12)(0.3) + (13)(0.2)
+(14)(0.2) + (15)(0.07)
𝝁 = 𝟏𝟐. 𝟓
140
𝝈𝟐 = (−2.5)2 (0.08) + (−1.5)2 (0.15) + (−0.5)2 (0.3) + (0.5)2 (0.2)
+ (1.5)2 (0.2) + (2.5)2 (0.07)
𝟑𝟕
𝝈𝟐 =
𝟐𝟎
𝝈=√𝟑𝟕/𝟐𝟎 = 1.36
Activity:
2. Marketing estimates that a new instrument for the analysis of soil samples will be
very successful, moderately successful, or unsuccessful, with probabilities 0.3,
0.6, and 0.1, respectively. The yearly revenue associated with a very successful,
moderately successful, or unsuccessful product is $10 million, $5 million, and $1
million, respectively. Let the random variable X denote the yearly revenue of the
product. Determine the probability probability distribution of the random variable.
3. If the range of X is the set {0, 1, 2, 3, 4} and P(X=x)=0.2 determine the mean and
variance of the random variable.
141
Session 2
Binomial Distributions
Lecture:
The outcome from each trial either meets the criterion that X counts or it does not;
consequently, each trial can be summarized as resulting in either a success or a failure.
For example, in the multiple choice experiment, for each question, only the choice that is
correct is considered a success. Choosing any one of the three incorrect choices results
in the trial being summarized as a failure.
A trial with only two possible outcomes is used so frequently as a building block of a
random experiment that it is called a Bernoulli trial.
142
PROBABILITY OF BINOMIAL DISTRIBUTION
It is usually assumed that the trials that constitute the random experiment are
independent. This implies that the outcome from one trial has no effect on the outcome
to be obtained from any other trial.
Example 1: The chance that a bit transmitted through a digital transmission channel is
received in error is 0.1. Also, assume that the transmission trials are independent. Let X
be the number of bits in error in the next four bits transmitted. Determine 𝑃(𝑥 = 2).
Solution: Let the letter E denote a bit in error, and let the letter O denote that the bit is
okay, that is, received without error. We can represent the outcomes of this experiment
as a list of four letters that indicate the bits that are in error and those that are okay.
Using the assumption that the trials are independent, the probability of {𝐸𝐸𝑂𝑂} is:
𝑃(𝐸𝐸𝑂𝑂) = 𝑃(𝐸 )𝑃(𝐸 )𝑃(𝑂)𝑃(𝑂) = (0.1)2 (0.9)2 = 0.0081
Also, any one of the six mutually exclusive outcomes for which X=2 has the same
probability of occurring. Therefore,
𝑃(𝑋 = 2) = 6(0.0081) = 0.0486
Solution:
Let X be the number of samples that contain the pollutant in the next 18 samples
analyzed.
Example 3. The chance that a bit transmitted through a digital transmission channel is
received in error is 0.1. Also, assume that the transmission trials are independent. Let X
be the number of bits in error in the next four bits transmitted. Find the mean and variance
of X.
𝝁 = 𝑬(𝑿) = 𝒏𝒑 = 𝟒(𝟎. 𝟏) = 𝟎. 𝟒
𝝈𝟐 = 𝑽(𝑿) = 𝒏𝒑(𝟏 − 𝒑) = 𝟒(𝟎. 𝟏)(𝟏 − 𝟎. 𝟏) = 𝟎. 𝟑𝟔
144
Activity:
1. The random variable X has a binomial distribution with n=10 and p=0.5. Determine
the following probabilities: P(X=5), P(X≤2), P(X≥9), P(3≤X<5). Also find mean and
variance of the random variable X.
2. Because not all airline passengers show up for their reserved seat, an airline sells
125 tickets for a flight that holds only 120 passengers. The probability that a
passenger does not show up is 0.10, and the passengers behave independently.
a. What is the probability that every passenger who shows up can take the
flight?
b. What is the probability that the flight departs with empty seats?
145
Session 3
Geometric and Negative Binomial Distributions
Lecture:
GEOMETRIC DISTRIBUTIONS
Geometric distribution involves series of Bernoulli trials just like binomial distributions.
However, instead of a fixed number of trials, trials are conducted until a success is
obtained.
In a series of Bernoulli trials (independent trials with constant probability p of a success), let the
random variable X denote the number of trials until the first success.
Then X is a geometric random variable with parameter 0 < 𝑝 < 1and
𝑃(𝑥) = (1 − 𝑝)𝑥−1 𝑝 𝑥 = 1, 2, …
Example 1: The probability that a bit transmitted through a digital transmission channel is
received in error is 0.1. Assume the transmissions are independent events, and let the
random variable X denote the number of bits transmitted until the first error.
Solution:
Then, P(X = 5) is the probability that the first four bits are transmitted correctly and the
fifth bit is in error. This event can be denoted as {OOOOE}, where O denotes an okay bit.
Because the trials are independent and the probability of a correct transmission is 0.9,
𝑃(𝑋 = 5) = 𝑃(𝑂𝑂𝑂𝑂𝐸 ) = (0.9)4 0.1 = 0.066
Example 2: The probability that a wafer contains a large particle of contamination is 0.01.
If it is assumed that the wafers are independent, what is the probability that exactly 125
wafers need to be analyzed before a large particle is detected?
146
MEAN AND VARIANCE OF GEOMETRIC DISTRIBUTIONS
If X is a geometric random variable with parameter p,
Example 3: The probability that a bit transmitted through a digital transmission channel is
received in error is 0.1. Assume the transmissions are independent events, and let the
random variable X denote the number of bits transmitted until the first error. Find the mean
number of transmissions until the first error.
Solution:
The mean number of transmissions until the first error is:
1
𝜇 =0.1 = 10
The standard deviation of the number of transmissions before the first
error is
1⁄
𝜎 = [(1 − 0.1)/(0.1)2 ] 2 = 9.49
Example 4: Suppose the probability that a bit transmitted through a digital transmission
channel is received in error is 0.1. Assume the transmissions are independent events,
and let the random variable X denote the number of bits transmitted until the fourth error.
Find the probability that the fourth error bit is transmitted on the 10th trial.
Solution:
Then, X has a negative binomial distribution with r =4.The P(X =10) is the probability that
exactly three errors occur in the first nine trials and then trial 10 results in the fourth error.
𝑥−1 (
𝑃 (𝑋 ) = ( ) 1 − 𝑝)𝑥−𝑟 𝑝𝑟
𝑟−1
𝑃(𝑋 = 10) = (9𝐶3)(0.9)6 (0.1)4 = 4.4641x10−3
147
MEAN AND VARIANCE OF NEGATIVE BINOMIAL DISTRIBUTIONS
Example 5: A Web site contains three identical computer servers. Only one is used to
operate the site, and the other two are spares that can be activated in case the primary
system fails. The probability of a failure in the primary computer (or any activated spare
system) from a request for service is 0.0005. Assuming that each request represents an
independent trial, what is the mean number of requests until failure of all three servers?
3
𝜇 =0.0005 = 6000 𝑟𝑒𝑞𝑢𝑒𝑠𝑡𝑠
Activity:
1. Assume that each of your calls to a popular radio station has a probability of 0.02
of connecting, that is, of not obtaining a busy signal. Assume that your calls are
independent.
a. What is the probability that your first call that connects is your tenth call?
b. What is the probability that it requires more than five calls for you to connect?
c. What is the mean number of calls needed to connect?
2. An electronic scale in an automated filling operation stops the manufacturing line
after three underweight packages are detected. Suppose that the probability of an
underweight package is 0.001 and each fill is independent.
a. What is the mean number of fills before the line is stopped?
b. What is the standard deviation of the number of fills before the line is
stopped?
148
Session 4
Hypergeometric and Poisson Distributions
Lecture:
HYPERGEOMETRIC DISTRIBUTIONS
Example 1: A batch of parts contains 100 parts from a local supplier of tubing and 200
parts from a supplier of tubing in the next state. If four parts are selected randomly and
without replacement, what is the probability they are all from the local supplier?
149
From Example 1, what is the probability that two or more parts in the sample are from
the local supplier?
From Example 1, what is the probability that at least one part in the sample is from the
local supplier?
100 200
P ( X 1) 1 P ( X 0) 0 4
1 0.804
300
4
E ( X ) np
N n
2 V ( X ) np(1 p)
N 1
Where: 𝑝 = 𝐾/𝑁
From Example 1, what is the mean or expected value of X, also find its variance.
𝐾 100 1
𝑝 = 𝑁 = 300 = 3
np (4)( 1 3 ) 4 3 1.3333
N n 300 4
2 V ( X ) np (1 p ) 4( 3 ) 1 3 0.88
1 1
N 1 300 1
150
POISSON DISTRIBUTION
Given an interval of real numbers, assume counts occur at random throughout the
interval. If the interval can be partitioned into subintervals of small enough length such
that:
1. the probability of more than one count in a subinterval is zero,
2. the probability of one count in a subinterval is the same for all subintervals and
proportional to the length of the subinterval, and
3. the count in each subinterval is independent of other subintervals, the random
experiment is called a Poisson process.
The random variable X that equals the number of counts in the interval is a Poisson
random variable with parameter 0 . The probability of random variable X can be computed as:
e x
P( x) x 0,1, 2,...
x!
If a Poisson random variable represents the number of counts in some interval, the mean
of the random variable must equal the expected number of counts in the same length of
interval.
Example 2: Flaws occur at random along the length of a thin copper wire. Suppose that
the number of flaws follows a Poisson distribution with a mean of 2.3 flaws per millimeter.
Determine the probability of exactly 2 flaws in 1 millimeter of wire.
Solution:
𝑒 −𝜆 𝜆𝑥 𝑒 −2.3 2.32
𝑃(𝑥 ) = = = 0.265
𝑥! 2!
From Example 15: Determine the probability of 10 flaws in 5 millimeters of wire.
151
Solution:
𝑓𝑙𝑎𝑤𝑠
𝜆 = 5𝑚𝑚 𝑥 2.3 = 11.5 𝑓𝑙𝑎𝑤𝑠
𝑚𝑚
Therefore,
𝑒 −𝜆 𝜆𝑥 𝑒 −11.511.510
𝑃 (10) = = = 0.113
𝑥! 10!
From Example 2: Determine the probability of at least 1 flaw in 2 millimeters of wire.
Solution:
𝑓𝑙𝑎𝑤𝑠
𝜆 = 2𝑚𝑚 𝑥 2.3 = 4.6 𝑓𝑙𝑎𝑤𝑠
𝑚𝑚
Therefore,
𝑒 −4.64.60
𝑃 (𝑥 ≥ 1) = 1 − 𝑃 (𝑥 = 0) = 1 − = 0.9899
0!
Example 3: Contamination is a problem in the manufacture of optical storage disks. The
number of particles of contamination that occur on an optical disk has a Poisson
distribution, and the average number of particles per centimeter squared of media surface
is 0.1. The area of a disk under study is 100 squared centimeters. Find the probability
that 12 particles occur in the area of a disk under study.
Solution:
𝑝𝑎𝑟𝑡𝑖𝑐𝑙𝑒𝑠
𝜆 = 100𝑐𝑚2 𝑥 0.1 = 10 𝑝𝑎𝑟𝑡𝑖𝑐𝑙𝑒𝑠
𝑐𝑚 2
Therefore,
𝑒 −10 1012
𝑃(𝑥 = 12) = = 0.095
12!
152
Activity:
1. Suppose that the number of customers that enter a bank in an hour is a Poisson
random variable, and suppose that P(X=0) = 0.05. Determine the mean and
variance of X.
2. Printed circuit cards are placed in a functional test after being populated with
semiconductor chips. A lot contains 140 cards, and 20 are selected without
replacement for functional testing.
a. If 20 cards are defective, what is the probability that at least 1 defective card
is in the sample?
b. If 5 cards are defective, what is the probability that at least 1 defective card
appears in the sample?
153
Continuous Probability Distributions
____________________________________________________
MODULE 4
154
Session 1
Introduction to Continuous Random Variables and Probability Density
Functions
By the end of this session, you should be able to:
3. Determine probabilities of continuous random variables from probability density
functions.
4. Calculate means and variances for continuous random variables.
Lecture:
For a continuous random variable X, a probability density function is a function such that:
1. f ( x ) 0
2.
f ( x )dx 1
b
3. P(a X b) f ( x)dx area under f ( x) from a to b
a
A probability density function is zero for x values that cannot occur and it is
assumed to be zero wherever it is not specifically defined.
A histogram is an approximation to a probability density function.
155
For each interval of the histogram, the area of the bar equals the relative frequency
(proportion) of the measurements in the interval.
The relative frequency is an estimate of the probability that a measurement falls in
the interval.
Similarly, the area under f(x) over any interval equals the true probability that a
measurement falls in the interval.
PROBABILITY OF A CONTINUOUS RANDOM VARIABLE
Example 1: Let the continuous random variable X denote the current measured in a
thin copper wire in milliamperes. Assume that the range of X is [0, 20
mA], and assume that the probability density function of X is 𝑓(𝑥 ) = 0.05
for 0 ≤ 𝑥 ≤ 20. What is the probability that a current measurement is less
than 10 milliamperes?
10 10
P( X 10) f ( x)dx 0.05dx 0.5
0 0
Example 2: Let the continuous random variable X denote the diameter of a hole
drilled in a sheet metal component. The target diameter is 12.5
millimeters. Most random disturbances to the process result in larger
diameters. Historical data show that the distribution of X can be modeled
by a probability density function f ( x) 20e20( x 12.5) , x 12.5 . If a part with
a diameter larger than 12.60 millimeters is scrapped, what proportion of
parts is scrapped?
Solution:
156
From Example 2, what proportion of parts is between 12.5 and 12.6 millimeters?
Solution:
2 V (X ) x f ( x)dx x 2 f ( x)dx 2
2
From Example 1, assume that the range of X is [0, 20 mA], and assume that the
probability density function of X is 𝑓 (𝑥 ) = 0.05 for 0 ≤ 𝑥 ≤ 20. What is the
mean and variance of the random variable X?
Solution:
∞ 20 𝑥2
𝜇 = 𝐸 (𝑋) = ∫−∞ 𝑥𝑓(𝑥 )𝑑𝑥 = ∫0 𝑥(0.05)𝑑𝑥 = 0.05 2 | 20
0
= 10
∞ 20
𝜎 2 = V(𝑋) = ∫−∞(𝑥 − 𝜇)2 𝑓(𝑥 )𝑑𝑥 = ∫0 (𝑥 − 10)2 0.05𝑑𝑥
(𝑥−10)3 20
= 0.05 |0 = 33.3333
3
From Example 2, Historical data show that the distribution of X can be modeled by
a probability density function 𝑓(𝑥 ) = 20𝑒 −20(𝑥−12.5) , 𝑥 ≥ 12.5. Determine
the mean of the random variable X.
Solution:
∞
𝜇 = 𝐸 (𝑋 ) = ∫ 𝑥𝑓(𝑥 )𝑑𝑥
12.5
∞
𝜇 = ∫ 𝑥 (20𝑒 −20(𝑥−12.5) )𝑑𝑥
12.5
157
By integration by parts:
𝜇 = 𝑢𝑣 − ∫ 𝑣𝑑𝑢 𝑢 = 𝑥; 𝑑𝑢 = 𝑑𝑥
𝑑𝑣 = 20𝑒 −20(𝑥−12.5) 𝑑𝑥
𝜇 = −𝑥𝑒 −20(x−12.5) — 𝑒 −20(x−12.5) 𝑑𝑥
𝑙𝑒𝑡 𝑧 = −20(x − 12.5); dz = −20dx
1 ∞ d𝑣 = −𝑒 𝑧 𝑑𝑧;
𝜇 = −𝑥𝑒 −20(x−12.5) − 20 𝑒 −20(x−12.5) | 12.5
∫ 𝑑𝑣 = 𝑣 = −𝑒 𝑧 = −𝑒 −20(x−12.5)
𝜇 = 12.5 + 0.05 = 12.55
If X is a continuous uniform random variable over a ≤ x ≤ b, (that is if its probability function is constant)
ab (b a)2
E( X ) and V (X )
2
2 12
From Example 1, assume that the range of X is [0, 20 mA], and assume that the
probability density function of X is 𝑓 (𝑥 ) = 0.05 for 0 ≤ 𝑥 ≤ 20. What is the
mean and standard deviation of random variable X?
Solution:
a b 0 20
E( X ) 10mA
2 2
(20 0) 2
2 V (X ) 33.3333 ,
12
Activity
2. The probability density function of the net weight in pounds of a packaged chemical
herbicide is 𝑓(𝑥 ) = 2.0 𝑓𝑜𝑟 49.75 < 𝑥 < 50.25 pounds.
a. Determine the probability that a package weighs more than 50 pounds.
b. How much chemical is contained in 90% of all packages?
c. Determine the mean and variance of the weight of packages.
158
Session 2
NORMAL DISTRIBUTION
Lecture:
NORMAL DISTRIBUTION
Random variables with different means and variances can be modeled by normal
probability density functions with appropriate choices of the center and width of the curve.
The value of E(X) = 𝜇determines the center of the probability density function and the
value of 𝑉 (𝑋) = 𝜎 2 determines the width.
The figure illustrates several normal probability density functions with selected
values of 𝜇 and 𝜎 2 . Each has the characteristic symmetric bell-shaped curve, but the
centers and dispersions differ.
In a normal distribution for any normal random variable, the probability distribution
can be approximated by the figure below:
159
The probability density function decreases as x moves farther from μ.
Consequently, the probability that a measurement falls far from µ is small,
and at some distance from µ the probability of an interval can be
approximated as zero.
STANDARD NORMAL DISTRIBUTION
160
PROBABILITIES USING THE CUMULATIVE STANDARD NORMAL DISTRIBUTION
Example 3: Solve for 𝑃(𝑍 ≤ 1.5)
161
The intersection of row (1.5) and column (0.00) is 0.933193,
Therefore: 𝑷(𝒁 ≤ 𝟏. 𝟓) = 𝟎. 𝟗𝟑𝟑𝟏
Example 4: Solve for 𝑃(𝑍 > 1.26)
The intersection of row (1.2) and column (0.06) (1.2 + 0.06 = 1.26) is
0.896165, this is 𝑃 (𝑍 ≤ 1.26). But the required probability is 𝑃(𝑍 > 1.26)
162
Example 5: Solve for 𝑃(𝑍 < −0.86)
Solution:
This probability is from −∞ 𝑡𝑜 − 0.86 which is graphically,
163
𝑃(𝑍 ≤ 0.37) = 0.644309
164
𝑃(𝑍 ≤ −3.99) = 0.000033
Since 𝑃(𝑍 ≤ −4.6) is less than 𝑃(𝑍 ≤ −3.99), therefore 𝑃(𝑍 ≤ −4.6) can
be approximated as almost zero.
Example 8. Find the value of 𝑧 such that 𝑃(𝑍 > 𝑧) = 0.05
Solution:
Since the total probability is 1, if the probability that Z is greater than z is
0.05, we can say that the probability that Z is less than or equal to z is
(1 − .05) = 0.95.
Therefore we can say that in the expression 𝑃(𝑍 > 𝑧) = 0.05, 𝑧 = 1.65
165
Example 9: Find the value of 𝑧 such that 𝑃 (−𝑧 < 𝑍 < 𝑧) = 0.99
Solution:
Because of the symmetry of the normal distribution, if the area of the
shaded region is to equal 0.99, the area in each tail of the distribution must
1−0.99
equal [ ] = 0.005.
2
Therefore we can say that in the expression 𝑃(−𝑧 < 𝑍 < 𝑧) = 0.99,
𝒛 = 𝟐. 𝟓𝟖
166
TRANSFORMATION TO STANDARD NORMAL RANDOM VARIABLE (Z)
The preceding examples show how to calculate probabilities for standard normal
random variables. To use the same approach for an arbitrary normal random variable
would require a separate table for every possible pair of values for 𝜇 and .
Fortunately, all normal probability distributions are related algebraically, and
Appendix Table II can be used to find the probabilities associated with an arbitrary normal
random variable by first using a simple transformation.
𝑋−𝜇
𝑍=
𝜎
is a normal random variable with 𝐸(𝑍) = 0 and 𝑉(𝑍) = 1. That is, Z is a standard
normal random variable.
167
𝑃(𝑍 > 1.5) = 1 − 𝑃 (𝑍 ≤ 1.5)
Then from the table
168
𝑃(𝑍 < −0.5) = 0.308538
𝑇ℎ𝑒𝑟𝑒𝑓𝑜𝑟𝑒:
𝑃 (9 < 𝑋 < 11) = 𝑃 (−0.5 < 𝑍 < 0.5)
= 0.691462 − 0.308538
= 𝟎. 𝟑𝟖𝟐𝟗𝟐𝟒
c. Determine the value of X for which the probability that a current
measurement is below this value is 0.98.
This can be expressed as 𝑥 𝑠𝑢𝑐ℎ 𝑡ℎ𝑎𝑡 𝑃(𝑋 < 𝑥 ) = 0.98. First we
have to find the value of 𝑧 such that 𝑃(𝑍 < 𝑧) = 0.98
169
𝑃(𝑍 < 2.05) = 0.979818
𝑃(𝑍 < 2.05) ≈ 0.98
𝑇ℎ𝑒𝑟𝑒𝑓𝑜𝑟𝑒:
𝑇ℎ𝑒 𝑣𝑎𝑙𝑢𝑒 𝑜𝑓 𝑧 𝑠𝑢𝑐ℎ 𝑡ℎ𝑎𝑡 𝑃(𝑍 < 𝑧) = 0.98 is 2.05
𝑇ℎ𝑒𝑛 𝑡ℎ𝑒 𝑣𝑎𝑙𝑢𝑒 𝑥 𝑠𝑢𝑐ℎ 𝑡ℎ𝑎𝑡 𝑃(𝑋 < 𝑥 ) = 0.98 𝑖𝑠:
𝑥−𝜇
𝑧= 𝜎 ;
𝑥−10
2.05 = 2
𝒙 = 𝟏𝟒. 𝟏 𝒎𝑨
Example 11. Assume that in the detection of a digital signal the background noise
follows a normal distribution with a mean of 0 volt and standard deviation
of 0.45 volt. The system assumes a digital 1 has been transmitted when
the voltage exceeds 0.9.
a. What is the probability of detecting a digital 1 when none was
sent?
b. Determine symmetric bounds about 0 that include 99% of all noise
readings.
Solution:
a. What is the probability of detecting a digital 1 when none was
sent?
Let the random variable N denote the voltage of noise. The
requested probability is 𝑃(𝑁 > 0.9).
Converting it to standard score Z:
𝑋 − 𝜇 𝑁 − 𝜇 0.9 − 0
𝑍= = = =2
𝜎 𝜎 0.45
𝑇ℎ𝑒𝑟𝑒𝑓𝑜𝑟𝑒: 𝑃(𝑁 > 0.9) = 𝑃(𝑍 > 2)
𝑃(𝑁 > 0.9) = 𝑃(𝑍 > 2) = 1 − 𝑃(𝑍 ≤ 2)
170
𝑃 (𝑍 > 2) = 1 − 𝑃(𝑍 ≤ 2) = 1 − 0.977250 = 0.02275
Therefore: 𝑷(𝑵 > 𝟎. 𝟗) = 𝟎. 𝟎𝟐𝟐𝟕𝟓
This probability can be described as the probability of a false detection.
b. Determine symmetric bounds about 0 that include 99% of all noise
readings.
The question requires us to find x such that 𝑃(−𝑥 < 𝑁 < 𝑥 ) = 0.99
Graphically,
171
𝑃(𝑍 < −2.58) = 0.004940 ≈ 0.005
Therefore the value of 𝑧 such that 𝑃(−𝑧 < 𝑍 < 𝑧) = 0.99 is 2.58
Converting to N:
𝑁−𝜇
𝑧= @z=2.58;
𝜎
𝑁−0
2.58 = ; 𝑁 = 1.161 𝑉
0.45
172
P(Z < 1.4) = 0.919243
Since P(Z < −4.6) is less than the smallest entry in the table
𝑃 (𝑍 < −3.99) = 0.000033, it is approximated as zero.
P(−4.6 < Z < 1.4) = P(Z < 1.4) − P(Z < −4.6)
= 0.919243 − 0
= 0.919243
Therefore:
𝑷(𝟎. 𝟐𝟒𝟖𝟓 < 𝑿 < 𝟎. 𝟐𝟓𝟏𝟓) = 𝟎. 𝟗𝟏𝟗𝟐𝟒𝟑
This means that 91.92% of the shafts conforms to the specifications.
From Example 12, most of the nonconforming shafts are too large, because the
process mean is located very near to the upper specification limit. If the
process is centered so that the process mean is equal to the target value
of 0.2500, the required probability 𝑃(0.2485 < X < 0.2515) will be
𝑋−𝜇
transformed to standard normal random variable as: Z = 𝜎
0.2485 − 0.2500
@𝑋 = 0.2485; 𝑍 = = −3
0.0005
0.2515 − 0.2500
@𝑋 = 0.2515; 𝑍 = =3
0.0005
Therefore: 𝑃(0.2485 < X < 0.2515) = P(−3 < Z < 3)
P(−3 < Z < 3) = P(Z < 3) − P(Z < −3)
173
𝑷(𝒁 < −𝟑) = 𝟎. 𝟎𝟎𝟏𝟑𝟓𝟎
Therefore:
𝑷(−𝟑 < 𝒁 < 𝟑) = 𝑷(𝒁 < 𝟑) − 𝑷(𝒁 < −𝟑)
= 0.998650 − 0.001350
= 0.9973
This means that by recentering the process, the yield is increased to
approximately 99.73%.
Activity
3. The time it takes a cell to divide (called mitosis) is normally distributed with an
average time of one hour and a standard deviation of 5 minutes.
a. What is the probability that a cell divides in less than 45 minutes?
b. What is the probability that it takes a cell more than 65 minutes to divide?
c. What is the time that it takes approximately 99% of all cells to complete
mitosis?
174
Sampling Distribution and Estimation
______________________________________________
MODULE 5
175
Session 1
Sampling Distribution
Lecture:
INTRODUCTION
In statistics, a population is the entire pool from which a statistical sample is drawn. A
population may refer to an entire group of people, objects, events, hospital visits, or
measurements. A population can thus be said to be an aggregate observation of subjects
grouped together by a common feature.
Now suppose that instead of taking just one sample of 100 newborn weights from each
continent, the medical researcher takes repeated random samples from the general
population, and computes the sample mean for each sample group. So, for North
America, he pulls up data for 100 newborn weights recorded in the US, Canada and
Mexico as follows: four 100 samples from select hospitals in the US, five 70 samples from
Canada and three 150 records from Mexico, for a total of 1200 weights of newborn babies
176
grouped in 12 sets. He also collects a sample data of 100 birth weights from each of the
12 countries in South America.
The average weight computed for each sample set is the sampling distribution of the
mean. Not just the mean can be calculated from a sample. Other statistics, such as the
standard deviation, variance, proportion, and range can be calculated from sample data.
The standard deviation and variance measure the variability of the sampling distribution.
Knowing how spread apart the mean of each of the sample sets are from each other and
from the population mean will give an indication of how close the sample mean is to the
population mean. The standard error of the sampling distribution decreases as the sample
size increases.
SPECIAL CONSIDERATIONS
A population or one sample set of numbers will have a normal distribution. However,
because a sampling distribution includes multiple sets of observations, it will not
necessarily have a bell-curved shape.
Following our example, the population average weight of babies in North America and in
South America has a normal distribution because some babies will be underweight (below
the mean) or overweight (above the mean), with most babies falling in between (around
the mean). If the average weight of newborns in North America is seven pounds, the
sample mean weight in each of the 12 sets of sample observations recorded for North
America will be close to seven pounds as well.
However, if you graph each of the averages calculated in each of the 1,200 sample
groups, the resulting shape may result in a uniform distribution, but it is difficult to predict
with certainty what the actual shape will turn out to be. The more samples the researcher
uses from the population of over a million weight figures, the more the graph will start
forming a normal distribution.
177
UNDERSTANDING DIFFERENT SAMPLING METHODS
When you conduct research about a group of people, it’s rarely possible to collect data
from every person in that group. Instead, you select a sample. The sample is the group
of individuals who will actually participate in the research.
To draw valid conclusions from your results, you have to carefully decide how you will
select a sample that is representative of the group as a whole. As discussed in the
previous lectures there are two types of sampling methods:
POPULATION VS SAMPLE
First, you need to understand the difference between a population and a sample, and
identify the target population of your research.
The population is the entire group that you want to draw conclusions about.
The sample is the specific group of individuals that you will collect data from.
The population can be defined in terms of geographical location, age, income, and many
other characteristics.
178
It can be very broad or quite narrow: maybe you want to make inferences about the whole
adult population of your country; maybe your research focuses on customers of a certain
company, patients with a specific health condition, or students in a single school.
It is important to carefully define your target population according to the purpose and
practicalities of your project.
If the population is very large, demographically mixed, and geographically dispersed, it
might be difficult to gain access to a representative sample.
SAMPLING FRAME
The sampling frame is the actual list of individuals that the sample will be drawn from.
Ideally, it should include the entire target population (and nobody who is not part of that
population).
Example:
You are doing research on working conditions at Company X. Your population is all 1000
employees of the company. Your sampling frame is the company’s HR database which
lists the names and contact details of every employee.
SAMPLE SIZE
The number of individuals in your sample depends on the size of the population, and on
how precisely you want the results to represent the population as a whole.
You can use a sample size calculator or compute using different methods that will be
discussed later to determine how big your sample should be. In general, the larger the
sample size, the more accurately and confidently you can make inferences about the
whole population.
The central limit theorem states that if you have a population with mean μ and standard
deviation σ and take sufficiently large random samples from the population with
replacement [text annotation indicator] , then the distribution of the sample means will be
approximately normally distributed. This will hold true regardless of whether the source
population is normal or skewed, provided the sample size is sufficiently large (usually n >
30). If the population is normal, then the theorem holds true even for samples smaller
than 30. In fact, this also holds true even if the population is binomial, provided that
min(np, n(1-p))> 5, where n is the sample size and p is the probability of success in the
population. This means that we can use the normal probability model to quantify
179
uncertainty when making inferences about a population mean based on the sample
mean.
For the random samples we take from the population, we can compute the mean of the
sample means:
𝜇𝑥̅ = 𝜇
Before illustrating the use of the Central Limit Theorem (CLT) we will first illustrate the
result. In order for the result of the CLT to hold, the sample must be sufficiently large (n
≥30). Again, there are two exceptions to this. If the population is normal, then the result
holds for samples of any size (i.e., the sampling distribution of the sample means will be
approximately normal even for samples of size less than 30).
If we take simple random samples (with replacement) of size n=10 from the population
and compute the mean for each of the samples, the distribution of sample means should
be approximately normal according to the Central Limit Theorem. Note that the sample
size (n=10) is less than 30, but the source population is normally distributed, so this is not
a problem. The distribution of the sample means is illustrated below. Note that the
horizontal axis is different from the previous illustration, and that the range is narrower.
180
The mean of the sample means is 75 and the standard deviation of the sample means is
2.5, with the standard deviation of the sample means computed as follows:
𝜎 8
𝜎𝑥̅ = = = 2.53
√𝑛 √10
If we were to take samples of n=5 instead of n=10, we would get a similar distribution, but
the variation among the sample means would be larger. In fact, when we did this we got
a sample mean = 75 and a sample standard deviation = 3.6.
The Central Limit Theorem applies even to binomial populations like this provided that
the minimum of np and n (1-p) is at least 5, where "n" refers to the sample size, and "p"
181
is the probability of "success" on any given trial. In this case, we will take samples of n=20
with replacement, so min (np, n (1-p)) = min (20(0.3), 20(0.7)) = min (6, 14) = 6. Therefore,
the criterion is met.
We saw previously that the population mean and standard deviation for a binomial
distribution are:
𝜇 = 𝑛𝑝
Standard deviation:
𝜎 = √𝑛𝑝𝑞
The distribution of sample means based on samples of size n=20 is shown below.
𝑥̅ = 𝑛𝑝 = 20(0.3) = 6
𝜎 √𝑛𝑝𝑞 √20(0.3)(0.7)
𝜎𝑥̅ = = = = 0.46
√𝑛 √𝑛 √20
Now, instead of taking samples of n=20, suppose we take simple random samples (with
replacement) of size n=10. Note that in this scenario we do not meet the sample size
requirement for the Central Limit Theorem (i.e., min (np, n (1-p)) = min (10(0.3), 10(0.7))
182
= min (3, 7) = 3).The distribution of sample means based on samples of size n=10 is
shown on the right, and you can see that it is not quite normally distributed. The sample
size must be larger in order for the distribution to approach normality.
The Poisson distribution is another probability model that is useful for modeling discrete
variables such as the number of events occurring during a given time interval. For
example, suppose you typically receive about 4 spam emails per day, but the number
varies from day to day. Today you happened to receive 5 spam emails. What is the
probability of that happening, given that the typical rate is 4 per day? The Poisson
probability is:
(𝑒 −𝜇 )(𝜇 𝑥 )
𝑃 (𝑥; 𝜇) =
𝑥!
Mean = µ
Standard deviation = 𝜎 = √𝜇
The mean for the distribution is μ (the average or typical rate), "X" is the actual number
of events that occur ("successes"), and "e" is the constant approximately equal to
2.71828. So, in the example above
(2.71828−4 )(45 )
𝑃(5; 4) = = 0.1563
5!
Now let's consider another Poisson distribution. with μ=3 and σ=1.73. The distribution is
shown in the figure below.
183
This population is not normally distributed, but the Central Limit Theorem will apply if n >
30. In fact, if we take samples of size n=30, we obtain samples distributed as shown in
the first graph below with a mean of 3 and standard deviation = 0.32. In contrast, with
small samples of n=10, we obtain samples distributed as shown in the lower graph. Note
that n=10 does not meet the criterion for the Central Limit Theorem, and the small
samples on the right give a distribution that is not quite normal. Also note that the sample
standard deviation (also called the "standard error") is larger with smaller samples,
because it is obtained by dividing the population standard deviation by the square root of
the sample size. Another way of thinking about this is that extreme values will have less
impact on the sample mean when the sample size is large.
𝜎 1.73
𝜎𝑥̅ = = = 0.3158
√𝑛 √30
𝜎 1.73
𝜎𝑥̅ = = = 0.5471
√𝑛 √10
184
EXAMPLE 1:
Data from the Framingham Heart Study found that subjects over age 50 had a mean HDL
of 54 and a standard deviation of 17. Suppose a physician has 40 patients over age 50
and wants to determine the probability that the mean HDL cholesterol for this sample of
40 men is 60 mg/dl or more (i.e., low risk). Probability questions about a sample mean
can be addressed with the Central Limit Theorem, as long as the sample size is
sufficiently large. In this case n=40, so the sample mean is likely to be approximately
normally distributed, so we can compute the probability of HDL>60 by using the standard
normal distribution table.
The population mean is 54, but the question is what is the probability that the sample
mean will be >60?
In general,
𝑥−𝜇
𝑧=
𝜎
𝑥̅ − 𝜇𝑥 𝑥̅ − 𝜇
𝑧= = 𝜎
𝜎𝑥̅
√𝑛
60 − 54
𝑧= = 2.23
17
√40
P(z > 2.23) can be looked up in the standard normal distribution table, and because we
want the probability that P(z > 2.23), we compute is as P(z > 2.23) = 1 - 0.9871 = 0.0129.
Therefore, the probability that the mean HDL in these 40 patients will exceed 60 is 1.29%.
EXAMPLE 2:
Assume that the weight of 2,000 students from TUP are normally distributed with a mean
of 45 kgs and standard deviation of 2 kgs. If 100 samples consisting of 25 students each
185
are obtained, what would be the expected mean and standard deviation of the resulting
sampling distribution of means if sampling were done with replacement? What is the
probability that the mean is between a) 44.7 and 46.1? b) less than 46 kgs?
𝜎 2
𝜇𝑥̅ = 𝜇 = 45 , 𝜎𝑥̅ = = = 0.4
√𝑛 √25
44.7 − 45 46.1 − 45
𝑃 (44.7 ≤ 𝜇 ≤ 46.1) = ≤𝑧≤ = −0.75 ≤ 𝑧 ≤ 2.75 = 0.2734 + 0.4970
0.4 0.4
= 0.7704 𝑜𝑟 77.04%
46 − 45
𝑃(𝑥 ≤ 46 𝑘𝑔𝑠) = = 2.5
0.4
𝑧 ≤ 2.5
Activity:
1) Given a standard normal distribution, find the area under the curve that lies (a) to the
right of z = 1.84, and (b) between z = – 1.97 and z = 0.86.
2) Given a standard normal distribution, find the value of k such that P (z > k) = 0.3015.
3) Given a standard normal distribution, find the value of k such that (b) P (k < z < –
0.18) = 0.4197.
4) Given a random variable x having a normal ditribution with μ = 50 and σ = 10, find the
probability that x assumes a value between 45 and 62.
5) Given that x has a normal distribution with μ = 300 and σ = 50, find the probability that
x assumes a value grater than 362.
6) Given a normal distribution with with μ = 40 and σ = 6, find the value of x that has (a)
45% of the area to the lefyt, and (b) 14% of the area to the right.
7) A certain type of storage battery lasts, on average, 3.0 years, with a standard
deviation of 0.5 year. Assuming that the battery lives are normally distributed, find the
probability that a given battery will last less than 2.3 years.
186
8) An electrical fim manufactures light bulbs that have a life, before burn-out, that is
normally distributed with mean equal to 800 hours and a standard deviation of 40
hours. Find the probability that a bulb burns between 778 and 834 hours.
10) In an industrial process the diameter of a ball bearing is an important component part.
The buyer sets specifications on the diameter to be 3.0 ± 0.01 cm. The implication is
that no part falling outside these specifications will be accepted. It is known that in the
process the diameter of a ball bearing has a normal distribution with mean 3.0 and
standard deviation 0.005. On the average, how many manufactures ball bearings will
be scraped? Hint: x1 = 3.0 - 0.01, and x = 3.0 + 0.01.
187
Session 2
Estimation
By the end of this session you should be able to:
1. Explain the concept of an interval estimate of the population mean.
2. Apply formulas for a confidence interval for a population/ proportions and two
distinct populations/ proportions.
Lecture:
INTRODUCTION
In statistics, estimation refers to the process by which one makes inferences about a
population, based on information obtained from a sample.
CONFIDENCE INTERVALS
Statisticians use a confidence interval to express the precision and uncertainty associated
with a particular sampling method. A confidence interval consists of three parts.
A confidence level.
A statistic.
A margin of error.
The confidence level describes the uncertainty of a sampling method. The statistic and
the margin of error define an interval estimate that describes the precision of the method.
188
The interval estimate of a confidence interval is defined by the sample statistic + margin
of error.
For example, suppose we compute an interval estimate of a population parameter. We
might describe this interval estimate as a 95% confidence interval. This means that if we
used the same sampling method to select different samples and compute different interval
estimates, the true population parameter would fall within a range defined by the sample
statistic ± margin of error 95% of the time.
CONFIDENCE LEVEL
The probability part of a confidence interval is called a confidence level. The confidence
level describes the likelihood that a particular sampling method will produce a confidence
interval that includes the true population parameter.
Here is how to interpret a confidence level. Suppose we collected all possible samples
from a given population, and computed confidence intervals for each sample. Some
confidence intervals would include the true population parameter; others would not. A
95% confidence level means that 95% of the intervals contain the true population
parameter; a 90% confidence level means that 90% of the intervals contain the population
parameter; and so on.
MARGIN OF ERROR
In a confidence interval, the range of values above and below the sample statistic is called
the margin of error.
For example, suppose the local newspaper conducts an election survey and reports that
the independent candidate will receive 30% of the vote. The newspaper states that the
survey had a 5% margin of error and a confidence level of 95%. These findings result in
the following confidence interval: We are 95% confident that the independent candidate
will receive between 25% and 35% of the vote.
Note: Many public opinion surveys report interval estimates, but not confidence intervals.
They provide the margin of error, but not the confidence level. To clearly interpret survey
results you need to know both! We are much more likely to accept survey findings if the
confidence level is high (say, 95%) than if it is low (say, 50%).
189
ESTIMATING µ WITH LARGE SAMPLES
The most fundamental point and interval estimation process involves the estimation of a
population mean. Suppose it is of interest to estimate the population mean, μ, for a
quantitative variable. Data collected from a simple random sample can be used to
compute the sample mean, x̄, where the value of x̄ provides a point estimate of μ.
When the sample mean is used as a point estimate of the population mean, some error
can be expected owing to the fact that a sample, or subset of the population, is used to
compute the point estimate. The absolute value of the difference between the sample
mean, x̄, and the population mean, μ, written |x̄ − μ|, is called the sampling error. Interval
estimation incorporates a probability statement about the magnitude of the sampling
error. The sampling distribution of x̄ provides the basis for such a statement.
Statisticians have shown that the mean of the sampling distribution of x̄ is equal to the
𝜎
population mean, μ, and that the standard deviation is given by 𝑛, where σ is the
√
population standard deviation. The standard deviation of a sampling distribution is called
the standard error. For large sample sizes, the central limit theorem indicates that the
sampling distribution of x̄ can be approximated by a normal probability distribution. As a
matter of practice, statisticians usually consider samples of size 30 or more to be large.
• the reliability of an estimate will be measured by the confidence level, c
• zc is the critical value for a confidence level of c
ERROR OF ESTIMATE
E is the maximal error tolerance on the error of estimate for a given confidence level c,
computed using the formula:
𝑠
𝐸 = 𝑧𝑐
√𝑛
The confidence interval for µ is:
𝑥̅ − 𝐸 < 𝜇 < 𝑥̅ + 𝐸
190
EXAMPLE:
Julia enjoys jogging. She has been jogging over a period of several years, during which
time her physical condition has remained constantly good. Usually she jogs 2 miles per
day. During the past year Julia has sometimes recorded her times required to run 2 miles.
She has a sample of 90 of these times. For these 90 times the mean was 15.60 minutes
and the standard deviation was 1.80. Find a 0.95 confidence interval for µ.
𝑠
𝐸 = 𝑧𝑐
√𝑛
1.80
𝐸 = 1.96 = 0.37
√90
𝑥̅ − 𝐸 < 𝜇 < 𝑥̅ + 𝐸
15.6 − 0.37 < 𝜇 < 15.6 + 0.37
15.23 < 𝜇 < 15.97
If the population standard deviation is unknown and the sample size n is small then when
we substitute the sample standard deviation s for σ the normal approximation is no longer
valid. The solution is to use a different distribution, called Student’s t-distribution with n−1
degrees of freedom. Student’s t-distribution is very much like the standard normal
distribution in that it is centered at 0 and has the same qualitative bell shape, but it has
heavier tails than the standard normal distribution does, as indicated by the figure below,
in which the curve (in brown) that meets the dashed vertical line at the lowest point is the
t-distribution with two degrees of freedom, the next curve (in blue) is the t-distribution with
five degrees of freedom, and the thin curve (in red) is the standard normal distribution. As
also indicated by the figure, as the sample size n increases, Student’s t-distribution ever
more closely resembles the standard normal distribution. Although there is a different t-
distribution for every value of n, once the sample size is 30 or more it is typically
acceptable to use the standard normal distribution instead, as we will always do in this
text.
191
Student’s t distribution
Just as the symbol zc stands for the value that cuts off a right tail of area c in the standard
normal distribution, so the symbol tc stands for the value that cuts off a right tail of area c
in the standard normal distribution. This gives us the following confidence interval
formulas.
𝑥̅ − 𝜇
𝑡= 𝑠
√𝑛
Degrees of freedom: d.f. = n-1
𝑠
𝐸 = 𝑡𝑐
√𝑛
The confidence interval for µ is:
𝑥̅ − 𝐸 < 𝜇 < 𝑥̅ + 𝐸
EXAMPLE:
A company has a new process for manufacturing large artificial sapphires. The production
of each gem is expensive, so the number available for examination is limited. In a trial run
12 sapphires are produced. The mean weight for these 12 gems is 6.75 carats, and the
sample standard deviation is 0.33 carats. Find a 95% confidence interval for the mean.
𝑡𝑐 = 2.201 𝑤𝑖𝑡ℎ 𝑑. 𝑓. = 12 − 1 = 11
𝑠
𝐸 = 𝑡𝑐
√𝑛
0.33
𝐸 = 2.201 = 0.21
√12
𝑥̅ − 𝐸 < 𝜇 < 𝑥̅ + 𝐸
6.75 − 0.21 < 𝜇 < 6.75 + 0.21
6.54 < 𝜇 < 6.96
192
ESTIMATING P IN BINOMIAL DISTRIBUTION
The binomial distribution is completely determined by the number of trials n and the
probability p of success in a single trial. For most experiments, the number of trials is
chosen in advance. Then the distribution is completely determined by p.
Point estimate for p
𝑟
𝑝̂ =
𝑛
𝑞̂ = 1 − 𝑝̂
𝑝̂ 𝑞̂
𝐸 = 𝑍𝑐 √
𝑛
𝑝̂ − 𝐸 < 𝑝 < 𝑝̂ + 𝐸
E = maximal error tolerance of the error of estimate |𝑝̂ − 𝑝| for a confidence level c.
EXAMPLE:
A random sample of 188 books purchased showed that 66 of the books were murder
mysteries.
What is the point estimate for p?
Find a 90% confidence interval for p.
𝑟 66
𝑝̂ = = = 0.35
𝑛 188
𝑞̂ = 1 − 𝑝̂ = 1 − 0.35 = 0.65
𝑝̂ 𝑞̂ 0.35𝑥0.65
𝐸 = 𝑍𝑐 √ = 1.645√ = 0.06
𝑛 188
𝑝̂ − 𝐸 < 𝑝 < 𝑝̂ + 𝐸
0.35 − 0.06 < 𝑝 < 0.35 + 0.06
0.29 < 𝑝 < 0.41
Suppose we wish to compare the means of two distinct populations. The figure below
illustrates the conceptual framework of our investigation in this and the next section. Each
population has a mean and a standard deviation. We arbitrarily label one population as
Population 1 and the other as Population 2, and subscript the parameters with the
numbers 1 and 2 to tell them apart. We draw a random sample from Population 1 and
label the sample statistics it yields with the subscript 1. Without reference to the first
193
sample we draw a sample from Population 2 and label its sample statistics with the
subscript 2.
Our goal is to use the information in the samples to estimate the difference μ1−μ2 in the
means of the two populations and to make statistically valid inferences about it.
POINT ESTIMATION OF (𝜇1 − 𝜇2 ) LARGE SAMPLES
𝑠1 2 𝑠2 2
𝐸 = 𝑍𝑐 √ +
𝑛1 𝑛2
(𝑥̅1 − 𝑥̅ 2 ) − 𝐸 < 𝜇1 − 𝜇2 < (𝑥̅1 − 𝑥̅ 2 ) + 𝐸
EXAMPLE:
Company 1 Company 2
n1 = 174 n2 = 355
𝑥̅ 1 = 3.51 𝑥̅ 2 = 3.24
s1 = 0.51 s2 = 0.52
Construct a point estimate and a 99% confidence interval for μ1−μ2, the difference in
average satisfaction levels of customers of the two companies as measured on this five-
point scale.
𝑠1 2 𝑠2 2
𝐸 = 𝑍𝑐 √ +
𝑛1 𝑛2
194
0.512 0.522
𝐸 = 2.58 √ + = 0.123
174 355
(𝑥̅1 − 𝑥̅ 2 ) − 𝐸 < 𝜇1 − 𝜇2 < (𝑥̅1 − 𝑥̅ 2 ) + 𝐸
(3.51 − 3.24) − 0.123 < 𝜇1 − 𝜇2 < (3.51 − 3.24) + 0.123
0.27 − 0.123 < 𝜇1 − 𝜇2 < 0.27 + 0.123
0.147 < 𝜇1 − 𝜇2 < 0.393
1 1
𝐸 = 𝑡𝑐 𝑠√ +
𝑛1 𝑛2
(𝑥̅1 − 𝑥̅ 2 ) − 𝐸 < 𝜇1 − 𝜇2 < (𝑥̅1 − 𝑥̅ 2 ) + 𝐸
EXAMPLE:
Group1
16 19.6 19.9 20.9 20.1 20.1 16.4 20.6
20.1 22.3 18.8 19.1 17.4 21.1 22.1
Group II
8.2 5.4 6.8 6.5 4.7 5.9 2.9 7.6
10.2 6.4 8.8 5.4 8.3 5.4
Group 1 Group 2
n1=15 n2=14
𝑥̅ 1=19.63 𝑥̅ 2=6.61
s1=1.86 s2=1.89
195
𝑡𝑐 = 1.703, 𝑑. 𝑓. = 15 + 14 − 2 = 27
1 1
𝐸 = 𝑡𝑐 𝑠√ +
𝑛1 𝑛2
1 1
𝐸 = (1.703)(1.87)√ + = 1.18
15 14
(𝑥̅1 − 𝑥̅ 2 ) − 𝐸 < 𝜇1 − 𝜇2 < (𝑥̅1 − 𝑥̅ 2 ) + 𝐸
(19.63 − 6.61) − 1.18 < 𝜇1 − 𝜇2 < (19.63 − 6.61) + 1.18
13.02 − 1.18 < 𝜇1 − 𝜇2 < 13.02 + 1.18
11.84 < 𝜇1 − 𝜇2 < 14.2
EXAMPLE:
The book Survey Responses: An Education of their Validity, by E.J. Wentland and K.
Smith, includes studies reporting accuracy of answers to questions from surveys. A study
by Locander et al. considered the question, “Are you a registered voter?” Accuracy of
response was confirmed by a check of city voting records. Two methods of survey were
used: a face to face interview and a telephone interview. A random sample of 93 people
was asked the same questions during the telephone interview. 84 respondents gave
accurate answers. Another random sample of 83 people was asked the voter registration
question face to face. 69 respondents gave accurate answers. Find a 95% confidence
interval for p1-p2
84
𝑝̂1 = = 0.90, 𝑞̂1 = 0.10
93
69
𝑝̂2 = = 0.83, 𝑞̂2 = 0.17
83
0.90𝑥0.10 0.83𝑥0.17
𝐸 = 1.96√ + = 0.10
93 83
196
(𝑝̂1 − 𝑝̂2 ) − 𝐸 < 𝑝1 − 𝑝2 < (𝑝̂1 − 𝑝̂2 ) + 𝐸
(0.90 − 0.83) − 0.10 < 𝑝1 − 𝑝2 < (0.90 − 0.83) + 0.10
0.07 − 0.10 < 𝑝1 − 𝑝2 < 0.07 + 0.10
0 < 𝑝1 − 𝑝2 < 0.17
SAMPLE SIZE
SLOVIN'S FORMULA
It is used to calculate the sample size (n) given the population size (N) and a margin of
error (e). It's a random sampling technique formula to estimate sampling size
It is computed as
𝑁
𝑛=
1 + 𝑁𝑒 2
where:
n = no. of samples
N = total population
e = error margin / margin of error
If a sample is taken from a population, a formula must be used to take into account
confidence levels and margins of error. When taking statistical samples, sometimes a lot
is known about a population, sometimes a little and sometimes nothing at all. For
example, we may know that a population is normally distributed (e.g., for heights, weights
or IQs), we may know that there is a bimodal distribution (as often happens with class
grades in mathematics classes) or we may have no idea about how a population is going
to behave (such as polling college students to get their opinions about quality of student
life). Slovin's formula is used when nothing about the behavior of a population is known
at all.
EXAMPLE:
To use the formula, first figure out what you want your error of tolerance to be. For
example, you may be happy with a confidence level of 95 percent (giving a margin error
of 0.05), or you may require a tighter accuracy of a 98 percent confidence level (a margin
of error of 0.02). Plug your population size and required margin of error into the formula.
The result will be the number of samples you need to take.
197
𝑁
𝑛=
1 + 𝑁𝑒 2
1000
𝑛= = 285.71 ≈ 286
1 + 1000 𝑥 0.052
A wildlife study is designed to find the mean weight of salmon caught by an Alaskan
fishing company. As a preliminary study, a random sample of 50 freshly caught salmon
is weighed. The sample standard deviation is 2.15lbs. How large a sample should be
taken to be 99% confident that the sample mean is within 0.20lb of the true mean weight?
𝑍𝑐 𝜎 2
𝑛=( )
𝐸
2.58 𝑥 2.15 2
𝑛=( ) = 769.23 𝑙𝑏𝑠
0.20
EXAMPLE:
The buyer wants to be 95% sure that the point estimate for p will be in order either way
by less than 0.01.
1 𝑍𝑐 2
𝑛= ( )
4 𝐸
1 1.96 2
𝑛= ( ) = 9604
4 0.01
Let us consider the scenario that the buyer has already a preliminary estimate that the
kernel will pop is 90%.
𝑍𝑐 2
𝑛 = 𝑝(1 − 𝑝) ( )
𝐸
198
1.96 2
𝑛 = 0.90(1 − 0.10) ( ) = 3457.44
0.01
Activity:
2) How large a sample is required in Problem [1] if we want to be 95% confident that our
estimate of μ is off by less than 0.05?
3) The contents of 7 similar containers of slufuric acid are 9.8, 10.2, 10.4, 9.8, 10.0, 10.2,
and 9.6 liters. Find a 95% confidence interval for the mean of all such containers,
assuming an approximate normal distribution.
4) In a random sample of n = 500 families owning television sets in the city of Hamilton,
Canada, it is found that x = 340 subscribed to HBO. Find a 95% confidence interval
for the actual proportion of families in this city who subscribe to HBO.
5) How large a sample is required in Problem [4] if we want to be at least 95% confident
that our estimate of p is within 0.02?
199
milligrams per liter. Find a 95% confidence interval for the difference in the true
average orthophophorus contents at these two stations, assuming that the
observations came from normal populations with difference variances.
200
Test of Hypothesis
______________________________________________
MODULE 6
201
Session 1
Introduction to Hypothesis Testing
Lecture:
INTRODUCTION
The statistician R. Fisher explained the concept of hypothesis testing with a story of a
lady tasting tea. Here we will present an example based on James Bond who insisted
that martinis should be shaken rather than stirred. Let's consider a hypothetical
experiment to determine whether Mr. Bond can tell the difference between a shaken and
a stirred martini. Suppose we gave Mr. Bond a series of 16 taste tests. In each test, we
flipped a fair coin to determine whether to stir or shake the martini. Then we presented
the martini to Mr. Bond and asked him to decide whether it was shaken or stirred. Let's
say Mr. Bond was correct on 13 of the 16 taste tests. Does this prove that Mr. Bond has
at least some ability to tell whether the martini was shaken or stirred?
This result does not prove that he does; it could be he was just lucky and guessed right
13 out of 16 times. But how plausible is the explanation that he was just lucky? To assess
its plausibility, we determine the probability that someone who was just guessing would
be correct 13/16 times or more. This probability can be computed from the binomial
distribution, and the binomial distribution calculator shows it to be 0.0106. This is a pretty
low probability, and therefore someone would have to be very lucky to be correct 13 or
more times out of 16 if they were just guessing. So either Mr. Bond was very lucky, or he
can tell whether the drink was shaken or stirred. The hypothesis that he was guessing is
not proven false, but considerable doubt is cast on it. Therefore, there is strong evidence
that Mr. Bond can tell whether a drink was shaken or stirred.
202
STATISTICAL HYPOTHESIS
It can really be anything at all as long as you can put it to the test.
HYPOTHESIS TESTING
Hypothesis testing in statistics is a way for you to test the results of a survey or experiment
to see if you have meaningful results. You’re basically testing whether your results are
valid by figuring out the odds that your results have happened by chance. If your results
may have happened by chance, the experiment won’t be repeatable and so has little use.
Hypothesis testing can be one of the most confusing aspects for students, mostly
because before you can even perform a test, you have to know what your null hypothesis
is. Often, those tricky word problems that you are faced with can be difficult to decipher.
But it’s easier than you think; all you need to do is follow the steps in hypothesis testing:
• Problem identification
• State the null hypothesis (Ho) and the alternative hypothesis (Ha)
• Choose the level of significance (α)
• Select the appropriate test statistic and establish the critical region.
• Collect the data and compute the value of the test statistic from the sample data.
• Make the decision. Reject Ho is the value of the test statistic belongs in the critical
region. Otherwise, fail to reject Ho.
• Draw conclusion about the population
203
to reject, nullify or disprove the null hypothesis. Researchers come up with an alternate
hypothesis, one that they think explains a phenomenon, and then work to reject the null
hypothesis.
It is the operational statement of the theory that the investigator believes to be true and
wishes to prove. This is the hypothesis that sample observations are influenced by some
non-random cause. It is the contradiction of the null hypothesis.
EXAMPLE 1:
It’s an accepted fact that ethanol boils at 173.1°F; you have a theory that ethanol actually
has a different boiling point, of over 174°F. The accepted fact (“ethanol boils at 173.1°F”)
is the null hypothesis; your theory (“ethanol boils at temperatures of 174°F”) is the
alternate hypothesis.
EXAMPLE 2:
A classroom full of students at a certain elementary school is performing at lower than
average levels on standardized tests. The low test scores are thought to be due to poor
teacher performance. However, you have a theory that the students are performing poorly
because their classroom is not as well ventilated as the other classrooms in the school.
The accepted theory (“low test scores are due to poor teacher performance”) is the null
hypothesis; your theory (“low test scores are due to inadequate ventilation in the
classroom”) is the alternative hypothesis.
It is the set of values of the test statistic for which the null hypothesis will be rejected and
the acceptance region is the set of values of the test statistic for which the null hypothesis
will not be rejected. The rejection and acceptance regions are separated by a critical value
of the test statistic.
The statistical tests used will be one tailed or two tailed depending on the nature of the
null hypothesis and the alternative hypothesis. The following hypothesis applies to test
for the mean:
204
One tailed test:
When we are interested only in the extreme values that are greater than or less than a
comparative value (μ0) we use a one tailed test to test for significance. When we
are interested in determining that things are different or not equal, we use a two
tailed
TEST STATISTIC
It is a value calculated from sample measurements and on which the statistical decision
will be based. This involves specifying the statistic that will be used to assess the validity
of the null hypothesis.
DECISION RULE
205
There are two types of errors that can result from a decision rule; Type I and Type II error.
The chances of committing these two types of errors are inversely proportional: that is,
decreasing type I error rate increases type II error rate, and vice versa.
A type 1 error is also known as a false positive and occurs when a researcher incorrectly
rejects a true null hypothesis. This means that your report that your findings are significant
when in fact they have occurred by chance.
A type II error is also known as a false negative and occurs when a researcher fails to
reject a null hypothesis which is really false. Here a researcher concludes there is not a
significant effect, when actually there really is.
The probability of making a type I error is represented by your alpha level (α), which is
the p-value below which you reject the null hypothesis. A p-value of 0.05 indicates that
you are willing to accept a 5% chance that you are wrong when you reject the null
hypothesis.
You can reduce your risk of committing a type I error by using a lower value for p. For
example, a p-value of 0.01 would mean there is a 1% chance of committing a Type I error.
However, using a lower value for alpha means that you will be less likely to detect a true
difference if one really exists (thus risking a type II error).
The probability of making a type II error is called Beta (β), and this is related to the power
of the statistical test (power = 1- β). You can decrease your risk of committing a type II
error by ensuring your test has enough power.
You can do this by ensuring your sample size is large enough to detect a practical
difference when one truly exists.
206
SELECTION OF STATISTICAL TESTS
The statistical tests are classified into parametric and non-parametric tests. Parametric
tests assume underlying statistical distributions in the data. Therefore, several conditions
of validity must be met so that the result of a parametric test is reliable. For example,
Student’s t-test for two independent samples is reliable only if each sample follows a
normal distribution and if sample variances are homogeneous. Nonparametric tests do
not rely on any distribution. They can thus be applied even if parametric conditions of
validity are not met. Parametric tests often have nonparametric equivalents. You will find
different parametric tests with their equivalents when they exist in this grid. Nonparametric
tests are more robust than parametric tests. In other words, they are valid in a broader
range of situations (fewer conditions of validity).
P-VALUE
The P value, or calculated probability, is the probability of finding the observed, or more
extreme, results when the null hypothesis (H0) of a study question is true – the definition
of ‘extreme’ depends on how the hypothesis is being tested. P is also described in terms
of rejecting H0 when it is actually true, however, it is not a direct probability of this state.
The null hypothesis is usually a hypothesis of "no difference" e.g. no difference between
blood pressures in group A and group B. Define a null hypothesis for each study question
clearly before the start of your study.
The only situation in which you should use a one sided P value is when a large change
in an unexpected direction would have absolutely no relevance to your study. This
situation is unusual; if you are in any doubt then use a two sided P value.
The term significance level (alpha) is used to refer to a pre-chosen probability and the
term "P value" is used to indicate a probability that you calculate after a given study.
The alternative hypothesis (H1) is the opposite of the null hypothesis; in plain language
terms this is usually the hypothesis you set out to investigate. For example, question is
207
"is there a significant (not due to chance) difference in blood pressures between groups
A and B if we give group A the test drug and group B a sugar pill?" and alternative
hypothesis is " there is a difference in blood pressures between groups A and B if we give
group A the test drug and group B a sugar pill".
If your P value is less than the chosen significance level then you reject the null hypothesis
i.e. accept that your sample gives reasonable evidence to support the alternative
hypothesis. It does NOT imply a "meaningful" or "important" difference; that is for you to
decide when considering the real-world relevance of your result.
The choice of significance level at which you reject H 0 is arbitrary. Conventionally the 5%
(less than 1 in 20 chance of being wrong), 1% and 0.1% (P < 0.05, 0.01 and 0.001) levels
have been used. These numbers can give a false sense of security.
In the ideal world, we would be able to define a "perfectly" random sample, the most
appropriate test and one definitive conclusion. We simply cannot. What we can do is try
to optimize all stages of our research to minimize sources of uncertainty. When presenting
P values some groups find it helpful to use the asterisk rating system as well as quoting
the P value:
P < 0.05 *
P < 0.01 **
P < 0.001
Most authors refer to statistically significant as P < 0.05 and statistically highly significant
as P < 0.001 (less than one in a thousand chance of being wrong).
The asterisk system avoids the woolly term "significant". Please note, however, that many
statisticians do not like the asterisk rating system when it is used without showing P
values. As a rule of thumb, if you can quote an exact P value then do. You might also
want to refer to a quoted exact P value as an asterisk in text narrative or tables of contrasts
elsewhere in a report.
At this point, a word about error. Type I error is the false rejection of the null hypothesis
and type II error is the false acceptance of the null hypothesis. As an aid memoir: think
that our cynical society rejects before it accepts.
The significance level (alpha) is the probability of type I error. The power of a test is one
minus the probability of type II error (beta). Power should be maximized when selecting
208
statistical methods. If you want to estimate sample sizes then you must understand all of
the terms mentioned here.
The following table shows the relationship between power and error in hypothesis testing:
DECISION
TRUTH Accept H0: Reject H0:
H0 is true: correct decision P type I error P
1-alpha alpha (significance)
H0 is false: type II error P correct decision P
beta 1-beta (power)
H0 = null hypothesis
P = probability
Activity:
Essay
1. What is the importance of hypothesis testing in engineering data analysis?
2. What is the role of P value in hypothesis testing?
3. Give 5 examples of events where the researcher might commit the two types of
errors.
209
Session 2
Parametric Statistics
Lecture:
INTRODUCTION
THE T-TEST
The t-Test is used to compare two means, the means of two independent samples or two
independent groups and the means of correlated samples before and after the treatment.
Ideally the t-test is used when there are less than 30 samples, but some researchers use
the t-test even if there are more than 30 samples.
210
Where:
𝑡 = 𝑡ℎ𝑒 𝑡 − 𝑡𝑒𝑠𝑡
EXAMPLE:
The following are the score of 10 male and 10 female AB students in spelling. Test the
null hypothesis that there is no significant difference between the performance of male
and female AB students in the test. Use the t-test at 0.05 level of significance.
Male Female
X1 X2
14 12
18 9
17 11
16 5
4 10
14 3
12 7
10 2
9 6
17 13
Problem:
Is there a significant difference between the performance of male and female students in
spelling?
Hypotheses:
Ho: There is no significant difference between the performance of male and female AB
students in spelling. (𝐻𝑜: 𝜇1 = 𝜇2 )
211
Ha: There is a significant difference between the performance of male and female AB
students in spelling. (𝐻𝑜: 𝜇1 ≠ 𝜇2 )
Level of Significance:
α = 0.05
d.f. = n1+n2-2
= 10+10-2
= 18
t0.05 = 2.101 t-tabular value at 0.05 (two tailed test)
Statistics:
t-test for two independent samples
Computations:
Male Female
X1 X2
14 12
18 9
17 11
16 5
4 10
14 3
12 7
10 2
9 6
17 13
∑ 𝑥1 = 131
∑ 𝑥2 = 78
∑ 𝑥12 = 1891
∑ 𝑥22 = 738
𝑥̅1 = 13.1
𝑥̅ 2 = 7.8
(∑ 𝑥1 )2
𝑆𝑆𝑥1 = ∑ 𝑥12 −
𝑥1
212
1312
𝑆𝑆𝑥1 = 1891 − = 174.9
10
(∑ 𝑥2 )2
𝑆𝑆𝑥2 = ∑ 𝑥22 −
𝑥2
782
𝑆𝑆𝑥1 = 738 − = 129.6
10
𝑥̅1 − 𝑥̅ 2
𝑡=
𝑆𝑆1 + 𝑆𝑆2 1 1
√(
𝑛1 + 𝑛2 − 2) (𝑛1 + 𝑛2 )
13.1 − 7.8
𝑡= = 2.88
√(174.9 + 129.6) ( 1 + 1 )
10 + 10 − 2 10 10
Decision Rule:
If the computed value is greater than or beyond the t-tabular/ critical value,
reject Ho.
Conclusion:
Since the t-computed value of 2.88 is greater than the t-tabular value of 2.101 at 0.05
level of significance with 18 degrees of freedom, the Ho is rejected in favor of the research
hypothesis. This means that there is a significant difference between the performance of
the male and female AB students in spelling.
̅
𝐷
𝑡=
2
∑ 𝐷 2 − (∑ 𝐷 )
√ 𝑛
𝑛(𝑛 − 1)
Where:
̅ = the mean difference between the pre test and the post test.
𝐷
∑ 𝐷2 = the sum of squares of the difference between the pre test and post test
n= sample size
213
EXAMPLE:
An experimental study was conducted on the effect of programmed materials in English
on the performance of 20 selected college students. Before the program was
implemented the pre-test was administered and after 5 months the same instrument was
used to get the post test result. The following is the result of the experiment.
Problem:
Is there a significant difference between the pretest and the posttest on the use of
programmed materials in English?
Hypotheses:
Ho: There is no significant difference between the pretest and posttest or the use of
programmed materials did not affect the performance of students in English(𝐻𝑜: 𝜇1 = 𝜇2 )
Ha: The posttest result is higher than the pretest result. (𝐻1: 𝜇1 < 𝜇2 )
Level of Significance:
α = 0.05
d.f. = n-1 = 19
t0.05 = -1.729 (one-tailed test)
214
Computations:
Pre Post
test test D D2
20 25 -5 25
30 35 -5 25
10 25 -15 225
15 25 -10 100
20 20 0 0
10 20 -10 100
18 22 -4 16
14 20 -6 36
15 20 -5 25
20 15 5 25
18 30 -12 144
15 10 5 25
15 16 -1 1
20 25 -5 25
18 10 8 64
40 45 -5 25
10 15 -5 25
10 10 0 0
12 18 -6 36
20 25 -5 25
2
∑ 𝐷 =-81 ∑ 𝐷 =947
̅ =-4.05
𝐷
̅
𝐷
𝑡=
2
∑ 𝐷 2 − (∑ 𝐷 )
√ 𝑛
𝑛(𝑛 − 1)
−4.05
𝑡= = −3.17
(−81)2
√947 −
20
20(20 − 1)
Decision Rule:
If the t-computed value is greater than or beyond the critical value, reject Ho.
215
Conclusion:
The t-computed value of -3.17 is beyond the t-critical value of -1.729 at 0.05 level of
significance with 19 degrees of freedom, the null hypothesis is therefore rejected in favor
of the research hypothesis. This means that the posttest result is higher than the pretest
result. It implies that the use of the programmed materials in English is effective.
THE Z-TEST
The z-test is another test under parametric statistics which requires the normality of the
distribution. It utilizes the two population parameters μ and σ. It is used to compare two
means, the sample mean, and the perceived population mean. It is also used to compare
two sample means taken from the same population. It is used when the samples are
equal or greater than 30. The z-test can be applied in two-ways: the One-Sample Mean
Test and the Two-Sample Mean Test
The one-sample mean test is used when the sample mean is being compared to the
perceived population mean. However, if the population standard deviation is not known,
the sample standard deviation can be used as a substitute.
𝑥̅ − 𝜇
𝑧= 𝜎
√𝑛
Where:
𝑥̅ = sample mean
n = sample size
EXAMPLE:
Problem:
A tire manufacturing plant produces 15.2 tires per hour. This yield has an established
variance of 2.5 tires/hour. New machines are recommended, but will be expensive to
install. Before deciding to implement the change, 32 new machines are tested. They
216
produce 16.8 tire per hour. Is it worth buying the new machines? Before testing, verify
that the data comes from a normal distribution (assumption of the test)
Problem:
Is the claim true that the mean yield of new machines is greater than 15.2 tires per hour?
Hypotheses:
Ho: µ=15.2(ie. Mean yield of new machines is equal to 15.2)
Ha: µ>15.2(ie. Mean yield of new machines is greater than 15.2)
Level of Significance:
α = 0.05
z = ± 1.645
Decision Rule: If the z computed value is greater than or beyond the z tabular value, reject
Ho.
Conclusion: Since the z computed value of 5.73 is beyond the critical value of 1.645 at
0.05 level of significance the null hypothesis is rejected in favor of the research
hypothesis.
The two sample mean test is used when comparing two separate samples drawn at
random taken from a normal population. To test whether the difference between the two
values of 𝑥̅ 1and 𝑥̅ 2 is significant or can be attributed to chance, the formula is used:
𝑥̅1 − 𝑥̅ 2
𝑧=
𝑠1 2 𝑠2 2
√
𝑛1 + 𝑛2
217
EXAMPLE:
An admission test was administered to incoming freshmen in the Colleges of Nursing and
Veterinary Medicine with 100 students. Each was randomly selected. The mean scores
of the given samples were 90 and 85 and the variances of the test scores were 40 and
35, respectively. Is there a significant difference between the two groups? Use 0.01 level
of significance.
Problem:
Is there a significant difference between the test scores of the incoming freshmen in the
Colleges of Nursing and Veterinary Medicine?
Hypotheses:
Ho: 𝜇1 = 𝜇2
Ha: 𝜇1 ≠ 𝜇2
Level of Significance:
α = 0.01
z = 2.58
Computation:
𝑥̅1 − 𝑥̅ 2
𝑧=
𝑠1 2 𝑠2 2
√
𝑛1 + 𝑛2
90 − 85
𝑧= = 5.77
√ 40 + 35
100 100
Decision Rule: If the z computed value is greater than or beyond the z tabular value, reject
Ho.
Conclusion: Since the z computed value of 5.77 is beyond the critical value of 2.58 at
0.01 level of significance the null hypothesis is rejected in favor of the research
hypothesis.
218
THE F-TEST (ONE WAY ANOVA)
The F-Test is the analysis of variance (ANOVA). This is used in comparing the means of
two or more independent groups. One-way ANOVA is used when two variables are
involved: the column and the row variables. The researcher is interested to know if there
are significant differences between and among columns and rows. This is also used in
looking at the interaction effect between the variables being analyzed.
Steps in ANOVA
EXAMPLE:
A B C D
7 9 2 4
3 8 3 5
5 8 4 7
6 7 5 8
9 6 6 3
4 9 4 4
3 10 2 5
219
Perform the analysis of variance and test the hypothesis at 0.05 level of significance that
the average sales of the four brands of shampoo are equal.
Problem:
Is there a significant difference in the average sales of the four brands of shampoo?
Hypotheses:
Ho: 𝜇1 = 𝜇2 = 𝜇3 = 𝜇4
Ha: 𝜇1 ≠ 𝜇2 ≠ 𝜇3 ≠ 𝜇4
Level of Significance:
α = 0.05
d.fbet = k-1 = 4-1 = 3
Computations:
A A2 B B2 C C2 D D2
7 49 9 81 2 4 4 16
3 9 8 64 3 9 5 25
5 25 8 64 4 16 7 49
6 36 7 49 5 25 8 64
9 81 6 36 6 36 3 9
4 16 9 81 4 16 4 16
3 9 10 100 2 4 5 25
∑ 𝐴=37 2
∑ 𝐴 =225 ∑ 𝐵 =57 ∑ 𝐵2 =475 ∑ 𝐶 =26 2
∑ 𝐶 =110 ∑ 𝐷=36 2
∑ 𝐷 =204
220
∑ 𝑥𝑡𝑜𝑡 = ∑ 𝐴 + ∑ 𝐵 + ∑ 𝐶 + ∑ 𝐷
∑ 𝑥𝑡𝑜𝑡 = 37 + 57 + 26 + 36 = 156
2
∑ 𝑥𝑡𝑜𝑡 = ∑ 𝐴2 + ∑ 𝐵2 + ∑ 𝐶 2 + ∑ 𝐷2
2
∑ 𝑥𝑡𝑜𝑡 = 225 + 475 + 110 + 204 = 1014
(∑ 𝑥 𝑇𝑂𝑇 )2
𝑆𝑆𝑇𝑂𝑇 = ∑ 𝑥𝑇𝑂𝑇 2 −
𝑁
(156)2
𝑆𝑆𝑇𝑂𝑇 = 1014 − = 144.857
28
(𝑥𝑖 )2 (∑ 𝑥 𝑇𝑂𝑇 )2
𝑆𝑆𝐵𝐸𝑇 = ∑ −
𝑛𝑖 𝑁
2 2 2 2
37 57 26 36 1562
𝑆𝑆𝐵𝐸𝑇 = + + + − = 72.286
7 7 7 7 28
(∑ 𝑥𝑖 )2
2
𝑆𝑆𝑤𝑖𝑡ℎ = ∑ (∑ 𝑥𝑖 − )
𝑛𝑖
372 572 262 362 1562
𝑆𝑆𝑤𝑖𝑡ℎ = (225 − ) + (475 − ) + (110 − ) + (204 − )− = 72.571
7 7 7 7 28
Since the decision is to reject the null hypothesis, we need to look into where does the
variation lies by using Scheffe’s test.
221
SCHEFFE’S TEST
The F-test tells us that there is a significant difference in the average sales of the four
brands of shampoo but as to where the difference lies, it has to be tested further by
another test, the Scheffe’s test.
(𝑥̅1 − 𝑥̅ 2 )2
𝐹′ =
𝑀𝑆𝑤 (𝑛1 + 𝑛2 )
𝑛1 𝑛2
Where:
F’ = Scheffe’s test
𝑥̅ 2 = mean of group 2
AxB
′
(5.29 − 8.14)2
𝐹 = = 9.41
3.02(7 + 7)
7𝑥7
AXC
′
(5.29 − 3.71)2
𝐹 = = 2.89
3.02(7 + 7)
7𝑥7
AXD
(5.29 − 5.14)2
𝐹′ = = 0.026
3.02(7 + 7)
7𝑥7
BXC
(8.14 − 3.71)2
𝐹′ = = 22.74
3.02(7 + 7)
7𝑥7
222
BXD
(8.14 − 5.14)2
′
𝐹 = = 10.43
3.02(7 + 7)
7𝑥7
CXD
(3.71 − 5.14)2
𝐹′ = = 2.37
3.02(7 + 7)
7𝑥7
Sources of F’computed F’0.05 = Remarks
Variation F0.05x(k-1)=
3.01x3
AxB 9.41 9.03 significant
AxC 2.89 9.03 not significant
AxD 0.026 9.03 significant
BxC 22.74 9.03 not significant
BxD 10.43 9.03 not significant
CxD 2.37 9.03 significant
From all the significant sources of variation, B is the common brand, it means that the
variation lies in Brand B.
A two-way ANOVA test is a statistical test used to determine the effect of two nominal
predictor variables on a continuous outcome variable. ANOVA stands for analysis of
variance and tests for differences in the effects of independent variables on a dependent
variable.
A two-way ANOVA tests the effect of two independent variables on a dependent variable.
It analyzes the effect of the independent variables on the expected outcome along with
their relationship to the outcome itself. Random factors would be considered to have no
statistical influence on a data set, while systematic factors would be considered to have
statistical significance.
223
By using ANOVA, a researcher is able to determine whether the variability of the
outcomes is due to chance or to the factors in the analysis. ANOVA has many applications
in finance, economics, science, medicine, and social science.
Where:
Column 1
SV = sources of variations
Factor A = effect of factor A
Factor B = effect of factor B
A B = interaction effect of factors A and B
Within = variations within groups
Total = sum of all variations
224
Column 2
SS = sum of squares
SSA = factor A sum of squares
SSB = factor B sum of squares
SSA B = interaction A B sum of squares
Within = within groups sum of squares
Total = total sum of squares
Column 3
DF = degrees of freedom
DFA = factor A degrees of freedom
DFB = factor B degrees of freedom
DFA B = interaction A B degrees of freedom
DFwit = within groups degrees of freedom
DFtot = total degrees of freedom
Column 4
MSS = mean sum of squares
MSSA = factor A mean sum of squares
MSSB = factor B mean sum of squares
MSSA B = interaction A B mean sum of squares
MSSwit = within groups mean sum of squares
Column 5
F = f-statistic
FA = factor A computed value of f-statistic
FB = factor B computed value of f-statistic
FA B = interaction A B computed value of f-statistic
SS A n Ai
Ai
N
i
x 2
x 2
SS B n Bi
Bi
N
i
SS bet
x x
Ai B i
2
i
2
n Ai Bi N
SS A B SS bet SS A SS B
x 2
SS tot
2 i
xi
N
225
SS wit SS tot SS bet
Column 3
DFA A 1 ;( Categories in A – 1)
DFB B 1 ;( Categories in B – 1)
DFA B (A 1)(B 1)
DFwit N AB
DFtot N 1
Column 4
SS A
MSS A
DFA
SS B
MSS B
DFB
SS A B
MSS A B
DFA B
SS wit
MSS wit
DFwit
Column 5
MSS A
FA
MSS wit
MSS B
FB
MSS wit
MSS A B
FA B
MSS wit
Where:
x = observed value
i = individual observation or cell
A = the first given factor
B = the first given factor
A B = interaction of factor A and B
n = number of samples in a particular category
N = total samples
Example:
226
Determine the effects of socio-economic status and location of residence on the number
of absences incurred by the students in the past semester. Test at α = 0.05.
Problems:
Do the socio-economic status and location of residence have significant effect on the
number of absences incurred by the students in the past semester?
Is there an interaction effect between the socio-economic status and the location of
residence?
Hypotheses:
HO1: Socio-economic status has no significant effect on the number of absences
incurred by the students in the past semester.
HO2: The location of the residence has no significant effect on the number of absences
incurred by the students in the past semester.
HO3: Socio-economic status and location of residence have no interaction effect.
Level of Significance:
∝= 0.05
𝑑. 𝑓.𝐴 = 𝐴 − 1 = 3 − 1 = 2
𝑑. 𝑓.𝐵 = 𝐵 − 1 = 2 − 1 = 1
𝑑. 𝑓.𝐴×𝐵 = (𝐴 − 1)(𝐵 − 1) = (3 – 1)(2 – 1) = 2
𝑑. 𝑓.𝑊𝐼𝑇 = 𝑁 − 𝐴𝐵 = 48 − (3)(2) = 1
𝑑. 𝑓. 𝑇𝑂𝑇𝐴𝐿 = 𝑁 − 1 = 48 − 1 = 47
227
Location of Socio-economic status
Residence High Average Low
8 3 5 8 1 9
2 5 6 2 3 4
Within the city 4 6 4 5 6 2
7 9 0 3 8 5
44 33 38
4 2 6 1 4 1
3 9 4 5 7 2
4 7 8 4 6 4
Outside the city
1 6 3 2 2 5
36 33 31
80 66 69
2152
𝑆𝑆𝑇𝑂𝑇 = 1233 − = 269.979
48
𝑆𝑆𝑊𝐼𝑇 = 269.979 − 13.854 = 256.125
Sources
Sum of Mean
of d.f. Fratio F0.05 Decision
Sources Squares
variation
A 6.792 2 3.396 0.55 3.22
B 4.688 1 4.688 0.769 4.07
Fail to
AxB 2.374 2 1.187 0.1947 3.22
Reject Ho
Within 256.125 42 6.098
Total 269.979 47
228
THREE-WAY ANOVA
229
. . .
Cn .. .. … ..
B2
C1 .. .. ..
C2 .. .. ..
. .
Cn .. .. … ..
. .
. .
Bm
C1 .. .. ..
C2 .. .. ..
. . .
Cn .. .. … ..
Where:
Column 1
SV = sources of variations
A = effect of factor A
B = effect of factor B
C = effect of factor C
AB = interaction effect of factor A B
230
BC = interaction effect of factor B C
AC = interaction effect of factor A C
ABC = interaction effect of factor A B C
Within = variation within groups
Total = sum of all variations
Column 2
SS = sum of squares
SS A = factor A sum of squares
SS B = factor B sum of squares
SS C = factor C sum of squares
SS A B = interaction A B sum of squares
SS B C = interaction B C sum of squares
SS A C = interaction A C sum of squares
SS A B C = interaction A B C sum of squares
SS wit = within groups sum of squares
SS tot = total sum of squares
Column 3
DF = degrees of freedom
DFA = factor A degrees of freedom
DFB = factor B degrees of freedom
DFC = factor C degrees of freedom
DFA B = interaction A B degrees of freedom
DFB C = interaction B C degrees of freedom
DFA C = interaction A C degrees of freedom
DFA B C = interaction A B C degrees of freedom
DFwit = within groups degrees of freedom
DFtot = total degrees of freedom
Column 4
MSS = mean sum of squares
MSS A = factor A mean sum of squares
MSS B = factor B mean sum of squares
MSS C = factor C mean sum of squares
MSS A B = interaction A B mean sum of squares
231
MSS B C = interaction B C mean sum of squares
MSS A C = interaction A C mean sum of squares
MSS A B C = interaction A B C mean sum of squares
MSS wit = within groups mean sum of squares
Column 5
F = f-statistics
FA = factor A computed value of f-statistics
FB = factor B computed value of f-statistics
FC = factor C computed value of f-statistics
FA B = interaction A B computed value of f-statistics
FB C = interaction B C computed value of f-statistics
FA C = interaction A C computed value of f-statistics
FA B C = interaction A B C computed value of f-statistics
Column 2
x 2
x 2
SS A nBC
Ai
N
i
x 2
x 2
SS B nAC
Bi
N
i
x 2
x 2
SS C nAB
Bi
N
i
SS A B
( x Ai B j ) 2
(x A i ) 2
(x B j ) 2
x i
2
nC nBC nAC N
SS B C
(x BiC j ) 2
(x B i ) 2
(x C j ) 2
x i
2
nA nAC nAB N
SS A C
( x Ai C j ) 2
(x A i ) 2
(x C j ) 2
x i
2
nB nBC nAB N
( x AiB jCk ) 2 ( x Ai B j ) 2
( x B jCk ) 2 ( x AiCk ) 2
SS A B C
n nC nA nB
( x A i ) ( x B j ) ( x C k ) x i
2 2 2 2
nBC nAC nAB N
232
(x AiB jCk ) 2
SS wit xi
2
n
(x i ) 2
SS tot xi
2
Column 3
DFA A 1 ; (categories in A – 1)
DFB B 1 ; (categories in B – 1)
DFC C 1 ; (categories in C – 1)
DFA B (A 1)(B 1)
DFB C (B 1)(C 1)
DFA C (A 1)(C 1)
DFA B C (A 1)(B 1)(C 1)
DFwit N ABC
DFtot N 1
Column 4
SS A
MSS A
DFA
SS B
MSS B
DFB
SS C
MSS C
DFC
SS A B
MSS A B
DFA B
SS B C
MSS B C
DFB C
SS A C
MSS A C
DFA C
SS A B C
MSS A B C
DFA B C
SS wit
MSS wit
DFwit
Column 5
MSS A MSS A B
FA FA B
MSS wit MSS wit
233
MSS B MSS B C
FB FB C
MSS wit MSS wit
MSS C MSS A C
FC FA C
MSS wit MSS wit
MSS A B C
FA B C
MSS wit
Where:
x = observed value
i = individual observation or cell
A = the first given factor
B = the first given factor
C = the third given factor
A B = interaction of factor A and B
B C = interaction of factor B and C
A C = interaction of factor A and C
A B C = interaction of factor A, B and C
n = number of samples in a particular category
N = total samples
Example:
A1 A2 A3
C1 C2 C1 C2 C1 C2
5 2 8 2 1 9
6 7 7 8 5 2
B1 7 1 3 6 4 3
9 6 5 4 9 8
3 8 4 1 7 6
7 5 8 7 8 2
4 7 4 1 1 3
B2 5 6 9 4 3 5
8 8 6 2 6 6
1 3 7 5 2 9
4 8 5 7 4 2
6 7 9 6 1 6
B3
7 9 4 2 3 6
8 5 5 9 5 3
234
2 7 8 4 7 1
1 3 8 1 5 6
3 4 3 9 3 5
B4 5 9 1 2 8 2
6 2 4 7 9 7
4 5 2 5 6 3
Table A B
A1 A2 A3 Total
B1 54 48 54 156
B2 54 53 45 152
B3 63 53 38 154
B4 42 42 54 138
Total 213 196 191 600
Table B C
C1 C2 Total
B1 83 79 156
B2 79 73 152
B3 72 82 154
235
B4 68 70 138
Total 302 298 600
Table A C
A1 A2 A3 Total
C1 101 104 97 302
C2 112 92 94 298
Total 213 196 191 600
Table A B C
A1 A2 A3 Total
C1 C2 C1 C2 C1 C2
B1 30 24 27 21 26 28 156
B2 25 29 34 19 20 25 152
B3 27 36 25 28 20 18 154
B4 19 23 18 24 31 23 138
Total 101 112 104 92 97 94 600
Column 2
x 2
x 2
SS A nBC
Ai
N
i
SS B nAC
Bi
N
i
SS C nAB
Bi
N
i
236
SS C 3,000.13 3,000.00
SS C 0.13
SS A B
( x Ai B j ) 2
(x A i ) 2 (x B j ) 2
x i
2
nC nBC nAC N
54 48 54 ... 54 3,006.65 3,006.67 3,000.00
2 2 2 2
SS A B
(5)(2)
SS A B 3,055.20 3,013.32
SS A B 41.88
SS B C
(x BiC j ) 2
(x B i ) 2 (x C j ) 2
x i
2
nA nAC nAB N
83 73 79 ... 70 3,006.67 3,000.13 3,000.00
2 2 2 2
SS B C
(5)(3)
SS B C 3,014.67 3,006.80
SS B C 7.87
SS A C
( x Ai C j ) 2
(x A i ) 2 (x C j ) 2
x i
2
nB nBC nAB N
101 104 97 ... 94 3,006.65 3,000.13 3,000.00
2 2 2 2
SS A C
(5)(4)
SS A C 3,013.50 3,006.52
SS A C 6.72
(x A i ) 2
(x B j ) 2
(x C k ) 2
x i
2
n
SS wit (5 2 8 ... 32 ) 3,110.40
2 2 2
SS wit 603.60
(x i ) 2
SS tot xi
2
237
SS tot (52 2 2 82 ... 32 ) 3,000.00
SS tot 714.00
Column 3
DFA A 1 ; (categories in A – 1)
DFA 3 1
DFA 2
DFB B 1 ; (categories in B – 1)
DFB 4 1
DFB 3
DFC C 1 ; (categories in C – 1)
DFC 2 1
DFC 1
Column 4
SS A SS B SS C
MSS A MSS B MSS C
DFA DFB DFC
MSS A 6.65 MSS B 6.67 MSS C 0.13
2 3 1
MSS A 3.33 MSS B 2.22 MSS C 0.13
238
SS A B SS B C SS A C
MSS A B MSS B C MSS A C
DFA B DFB C DFA C
MSS A B 41.88 MSS B C 7.87 MSS A C 6.72
6 3 2
MSS A B 6.98 MSS B C 2.62 MSS A C 3.36
SS A B C SS wit
MSS A B C MSS wit
DFA B C DFwit
MSS A B C 40.48 MSS wit 603.6
6 96
MSS A B C 6.75 MSS wit 6.29
Column 5
MSS A MSS B MSS C
FA FB FC
MSS wit MSS wit MSS wit
FA 3.33 FB 2.22 FC 0.13
6.29 6.29 6.29
FA 0.53 FB 0.35 FC 0.02
MSS A B C
FA B C
MSS wit
FA B C 6.75
6.29
FA B C 1.07
After computing all the required statistics, prepare the three-way ANOVA table as shown:
239
Three-Way ANOVA Table
Conclusion
Since Fratio is less than F0.05, for factors A, B and C and interactions A B, B C,
A C and A B C, accept H 0 and conclude the following:
In simple linear regression, we predict scores on one variable from the scores on a second
variable. The variable we are predicting is called the criterion variable and is referred to
as Y. The variable we are basing our predictions on is called the predictor variable and is
referred to as X. When there is only one predictor variable, the prediction method is called
simple regression. In simple linear regression, the topic of this section, the predictions of
Y when plotted as a function of X form a straight line.
240
The example data are plotted figure below. You can see that there is a positive
relationship between X and Y. If you were going to predict Y from X, the higher the value
of X, the higher your prediction of Y.
Example data:
X Y
1 1
2 2
3 1.3
4 3.75
5 2.25
Linear regression consists of finding the best-fitting straight line through the points. The
best-fitting line is called a regression line. The black diagonal line in the figure below is
the regression line and consists of the predicted score on Y for each possible value of X.
The vertical lines from the points to the regression line represent the errors of prediction.
As you can see, the red point is very near the regression line; its error of prediction is
small. By contrast, the yellow point is much higher than the regression line and therefore
its error of prediction is large.
241
A scatter plot of the example data. The black line consists of the predictions, the points
are the actual data, and the vertical lines between the points and the black line
represent errors of prediction.
The error of prediction for a point is the value of the point minus the predicted value (the
value on the line). The table below shows the predicted values (Y') and the errors of
prediction (Y-Y'). For example, the first point has a Y of 1.00 and a predicted Y (called
Y') of 1.21. Therefore, its error of prediction is -0.21.
Example data
X Y Y' Y-Y' (Y-Y')2
1 1 1.21 -0.21 0.044
2 2 1.635 0.365 0.133
3 1.3 2.06 -0.76 0.578
4 3.75 2.485 1.265 1.6
5 2.25 2.91 -0.66 0.436
You may have noticed that we did not specify what is meant by "best-fitting line." By far,
the most commonly-used criterion for the best-fitting line is the line that minimizes the
sum of the squared errors of prediction. That is the criterion that was used to find the
line in the second. The last column in the second table shows the squared errors of
prediction. The sum of the squared errors of prediction shown in the second table is
lower than it would be for any other regression line.
𝑛 ∑ 𝑥𝑦 − ∑ 𝑥 ∑ 𝑦
𝑏=
𝑛(∑ 𝑥 2 ) − (∑ 𝑥 )2
𝑎 = 𝑦̅ − 𝑏𝑥̅
242
For X = 2,
Y' = (0.425) (2) + 0.785 = 1.64
Test statistic
𝑏 − 𝛽0
𝑡=
𝑠/√𝑆𝑥𝑥
d.f. = n-2
To test the hypothesis on the intercept (a), the test statistic is:
𝛼 − 𝛼0
𝑡=
∑𝑛𝑖=1 𝑥𝑖 2
𝑠√ 𝑛𝑆
𝑥𝑥
d.f. = n-2
Where:
2
𝑆𝑥𝑥 = 𝑛 ∑ 𝑥 2 − (∑ 𝑥)
𝑆𝑦𝑦 − 𝑏𝑆𝑥𝑦
𝑠=
𝑛−2
2
𝑆𝑦𝑦 = 𝑛 ∑ 𝑦 2 − (∑ 𝑦)
𝑆𝑥𝑦 = 𝑛 ∑ 𝑥𝑦 − (∑ 𝑥) (∑ 𝑦)
243
relationship between the explanatory (independent) variables and response
(dependent) variable.
yi=dependent variable
xi=expanatory variables
β0=y-intercept (constant term)
βp=slope coefficients for each explanatory variable
ϵ=the model’s error term (also known as the residuals)
R2 by itself can't thus be used to identify which predictors should be included in a model
and which should be excluded. R2 can only be between 0 and 1, where 0 indicates that
the outcome cannot be predicted by any of the independent variables and 1 indicates that
the outcome can be predicted without error from the independent variables.
244
When interpreting the results of a multiple regression, beta coefficients are valid while
holding all other variables constant ("all else equal"). The output from a multiple
regression can be displayed horizontally as an equation, or vertically in table form.
𝑦 ′ = 𝑏1 𝑥1 + 𝑏2 𝑥2 + 𝑎
𝑎 = 𝑀𝑦 − 𝑏1 𝑀𝑋1 − 𝑏2 𝑀𝑋2
𝑎 = 𝑦̅ − 𝑏1 𝑥̅1 − 𝑏2 𝑥̅ 2
where
𝑀 = mean
𝑆𝑆𝑟𝑒𝑔𝑟𝑒𝑠𝑠𝑖𝑜𝑛
𝑅2 =
𝑆𝑆𝑦
𝑆𝑆𝑟𝑒𝑔𝑟𝑒𝑠𝑠𝑖𝑜𝑛 = 𝑅2 𝑆𝑆𝑦
245
𝑆𝑆𝑟𝑒𝑠𝑖𝑑𝑢𝑎𝑙
𝑓𝑜𝑟 𝑡ℎ𝑒 𝑠𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝑒𝑟𝑟𝑜𝑟 𝑜𝑓 𝑒𝑠𝑡𝑖𝑚𝑎𝑡𝑒 = √𝑀𝑆𝑟𝑒𝑠𝑖𝑑𝑢𝑎𝑙 = √
𝑑. 𝑓.
or
𝑏1 𝑆𝑃𝑋1 𝑌 + 𝑏2 𝑆𝑃𝑋2 𝑌
𝑅2 =
𝑆𝑆𝑦
describes the proportion of the total variability of the Y scores that is accounted for by
the regression equation.
Example 1:
Person Y X1 X2
A 11 4 10
B 5 5 6
C 7 3 7
D 3 2 4
E 4 1 3
F 12 7 5
G 10 8 8
H 4 2 4
I 8 7 10
J 6 1 3
∑ 𝑌 = 70 ∑ 𝑋1 = 40 ∑ 𝑋2 = 60
∑ 𝑋1 𝑌 = 334
∑ 𝑋2 𝑌 = 467
∑ 𝑋1 𝑋2 = 282
(∑ 𝑋1 )2 402
𝑆𝑆𝑋1 = ∑ 𝑋1 2 − = 222 − = 62
𝑛1 10
246
2
(∑ 𝑋2 )2 602
𝑆𝑆𝑋2 = ∑ 𝑋2 − = 424 − = 64
𝑛2 10
(∑ 𝑌)2 702
𝑆𝑆𝑌 = ∑ 𝑌 2 − = 580 − = 90
𝑛𝑦 10
∑ 𝑋1 ∑ 𝑌 (47)(70)
𝑆𝑃𝑋1 𝑌 = ∑(𝑋1 𝑌) − = 334 − = 54
𝑛 10
∑ 𝑋2 ∑ 𝑌 (60)(70)
𝑆𝑃𝑋2 𝑌 = ∑(𝑋2 𝑌) − = 467 − = 47
𝑛 10
∑ 𝑋1 ∑ 𝑋2 (40)(60)
𝑆𝑃𝑋1 𝑋2 = ∑(𝑋1 𝑋2 ) − = 282 − = 42
𝑛 10
(𝑆𝑃𝑋1 𝑌 )(𝑆𝑆𝑋2 ) − (𝑆𝑃𝑋1 𝑋2 )(𝑆𝑃𝑋2 𝑌 ) (54)(64) − (42)(47)
𝑏1 = 2 =
(𝑆𝑆𝑋1 )(𝑆𝑆𝑋2 ) − (𝑆𝑃𝑋1 𝑋2 ) (62)(64) − (42)2
𝑏1 = 0.672
𝑏2 = 0.293
𝑎 = 2.554
𝑦 ′ = 𝑏1 𝑥1 + 𝑏2 𝑥2 + 𝑎
𝑅2 = 0.5562
𝑆𝑆𝑟𝑒𝑔𝑟𝑒𝑠𝑠𝑖𝑜𝑛 = 50.059
𝑆𝑆𝑟𝑒𝑠𝑖𝑑𝑢𝑎𝑙 = 39.942
247
𝑆𝑆𝑟𝑒𝑠𝑖𝑑𝑢𝑎𝑙 39.942
𝑀𝑆𝑟𝑒𝑠𝑖𝑑𝑢𝑎𝑙 = =
𝑑. 𝑓. 10 − 3
𝑀𝑆𝑟𝑒𝑠𝑖𝑑𝑢𝑎𝑙 = 5.706
𝑆𝑆𝑟𝑒𝑔𝑟𝑒𝑠𝑠𝑖𝑜𝑛 50.059
𝑀𝑆𝑟𝑒𝑔𝑟𝑒𝑠𝑠𝑖𝑜𝑛 = =
𝑑. 𝑓. 3−1
𝑀𝑆𝑟𝑒𝑔𝑟𝑒𝑠𝑠𝑖𝑜𝑛 = 25.0295
𝑀𝑆𝑟𝑒𝑔𝑟𝑒𝑠𝑠𝑖𝑜𝑛 25.0295
𝐹= =
𝑀𝑆𝑟𝑒𝑔𝑟𝑒𝑠𝑠𝑖𝑜𝑛 5.706
𝐹 = 4.3865
We cannot conclude that the regression equation accounts for a significant portion of
the variance for the Y scores.
Example 2
Evaluate the relationship between churches and crime while controlling population.
248
17 3 13
∑ 𝑋1 = 135 ∑ 𝑋2 = 30 ∑ 𝑌 = 135
∑ 𝑋1 2 = 1605 ∑ 𝑋2 2 = 70 ∑ 𝑌 2 = 1605
𝑥̅1 = 9 𝑥̅ 2 = 2 𝑦̅ = 9
∑ 𝑋1 𝑌 = 1578
∑ 𝑋2 𝑌 = 330
∑ 𝑋1 𝑋2 = 330
2
(∑ 𝑋1 )2 1352
𝑆𝑆𝑋1 = ∑ 𝑋1 − = 1605 − = 390
𝑛1 15
(∑ 𝑋2 )2 30602
𝑆𝑆𝑋2 = ∑ 𝑋2 2 − = 70 − = 10
𝑛2 15
(∑ 𝑌 )2 1352
𝑆𝑆𝑌 = ∑ 𝑌 2 − = 1605 − = 390
𝑛𝑦 15
∑ 𝑋1 ∑ 𝑌 (135)(135)
𝑆𝑃𝑋1 𝑌 = ∑(𝑋1 𝑌) − = 1578 − = 363
𝑛 15
∑ 𝑋2 ∑ 𝑌 (30)(135)
𝑆𝑃𝑋2 𝑌 = ∑(𝑋2 𝑌) − = 330 − = 60
𝑛 15
∑ 𝑋1 ∑ 𝑋2 (135)(30)
𝑆𝑃𝑋1 𝑋2 = ∑(𝑋1 𝑋2 ) − = 330 − = 60
𝑛 15
(𝑆𝑃𝑋1 𝑌 )(𝑆𝑆𝑋2 ) − (𝑆𝑃𝑋1 𝑋2 )(𝑆𝑃𝑋2 𝑌 ) (363)(10) − (60)(60)
𝑏1 = 2 =
(𝑆𝑆𝑋1 )(𝑆𝑆𝑋2 ) − (𝑆𝑃𝑋1 𝑋2 ) (390)(10) − (60)2
𝑏1 = 0.1
𝑏2 = 5.4
249
𝑎 = −2.7
𝑦 ′ = 𝑏1 𝑥1 + 𝑏2 𝑥2 + 𝑎
𝑅2 = 0.9239
𝑆𝑆𝑟𝑒𝑔𝑟𝑒𝑠𝑠𝑖𝑜𝑛 = 360.321
𝑆𝑆𝑟𝑒𝑠𝑖𝑑𝑢𝑎𝑙 = 29.679
𝑆𝑆𝑟𝑒𝑠𝑖𝑑𝑢𝑎𝑙 29.679
𝑀𝑆𝑟𝑒𝑠𝑖𝑑𝑢𝑎𝑙 = =
𝑑. 𝑓. 15 − 3
𝑀𝑆𝑟𝑒𝑠𝑖𝑑𝑢𝑎𝑙 = 2.47325
𝑆𝑆𝑟𝑒𝑔𝑟𝑒𝑠𝑠𝑖𝑜𝑛 360.321
𝑀𝑆𝑟𝑒𝑔𝑟𝑒𝑠𝑠𝑖𝑜𝑛 = =
𝑑. 𝑓. 3−1
𝑀𝑆𝑟𝑒𝑔𝑟𝑒𝑠𝑠𝑖𝑜𝑛 = 180.1605
𝑀𝑆𝑟𝑒𝑔𝑟𝑒𝑠𝑠𝑖𝑜𝑛 180.160
𝐹= =
𝑀𝑆𝑟𝑒𝑔𝑟𝑒𝑠𝑠𝑖𝑜𝑛 2.47325
𝐹 = 72.8436
We can conclude that the regression equation accounts for a significant portion of the
variance for Y.
R2 of 0.9239 population by itself predicts 92.4% of the variance
√𝑀𝑆𝑟𝑒𝑠𝑖𝑑𝑢𝑎𝑙 = √2.47325
√𝑀𝑆𝑟𝑒𝑠𝑖𝑑𝑢𝑎𝑙 = 1.573
250
THE PEARSON PRODUCT MOMENT CORRELATION COEFFICIENT, r
A positive correlation (e.g., +0.32) means there is a positive relationship between X and
Y. For example, a positive correlation between height and weight means that as height
increases, so does weight.
A negative correlation (e.g., -0.25) means there is a relationship between X and Y that
moves in the opposite direction. For example, a negative correlation between people's
waist size and running speed means that the larger the waist size of a person, the slower
they run.
The Pearson correlation coefficient is the most common and widely use measure of the
degree of linear relationship between two variables. It should be noted that the Pearson
Moment correlation tells us whether there is a linear relationship between two variables
but it does not tell us anything about causality. That is, we do not know if X causes Y or
if Y causes X, or if some other variable causes both X and Y.
𝑛 ∑ 𝑥𝑦 − (∑ 𝑥)(∑ 𝑦)
𝑟=
√(𝑛 ∑ 𝑥 2 − (∑ 𝑥)2 )(𝑛 ∑ 𝑦 2 − (∑ 𝑦)2 )
251
Where:
r= the Pearson Product Moment Coefficient of Correlation
n= sample size
Σxy = sum of the product of x and y
ΣxΣy = the product of the sum of Σx and the sum of Σy
Σx2 = sum of squares of x
Σy2 = sum of squares of y
Test statistic is
𝑟√𝑛 − 2
𝑡=
√1 − 𝑟 2
df = n-2
Where:
n= number of pair of x and y
r = correlation coefficient
Example:
x 75 70 65 90 85 85 80 70 65 90
y 80 75 65 95 90 85 90 75 70 90
Problem:
Is there a significant relationship between the midterm and final examinations of 10
students in Mathematics?
Hypotheses:
Ho: There is no significant relationship between the midterm grades and the final
examination/ grades of 10 students in Mathematics.
Ha: There is a significant relationship between the midterm grades and the final
examination/ grades of 10 students in Mathematics.
Level of Significance:
α = 0.05
df = n-2 = 10-2 = 8
252
t 0.05 = 2.306
Computation:
x y x2 y2 xy
75 80 5625 6400 6000
70 75 4900 5625 5250
65 65 4225 4225 4225
90 95 8100 9025 8550
85 90 7225 8100 7650
85 85 7225 7225 7225
80 90 6400 8100 7200
70 75 4900 5625 5250
65 70 4225 4900 4550
90 90 8100 8100 8100
∑x=775 ∑y=815 ∑x =60925 ∑y =67325 ∑xy=64000
2 2
𝑥̅ =77.5 𝑦̅=81.5
𝑛 ∑ 𝑥𝑦 − (∑ 𝑥)(∑ 𝑦)
𝑟=
√(𝑛 ∑ 𝑥 2 − (∑ 𝑥)2 )(𝑛 ∑ 𝑦 2 − (∑ 𝑦)2 )
10(64000) − (775)(815)
𝑟= = 0.94925
√[10(60925) − 7752 )(10(67325) − 8152 )]
𝑟√𝑛 − 2
𝑡=
√1 − 𝑟 2
0.94925√10 − 2
𝑡= = 8.536
√1 − 0.949252
Decision Rule: If the t computed value is greater than or beyond the t tabular value, reject
Ho.
Conclusion: Since the t computed value of 8.536 is beyond the critical value of 2.306 at
0.05 level of significance, we reject Ho, the research hypothesis is accepted which means
253
that there is a significant relationship between the midterm grades and the final
examination/ grades of 10 students in Mathematics.
COEFFICIENT OF DETERMINATION
The formula for computing the coefficient of determination for a linear regression model
with one independent variable is given below.
CD = r2 x 100%
CD = r2 x 100%
= (.949)2 x 100%
= 90.06%
Activity:
254
2. A random sample of companies in electric utilities (I), financial services (II), and
food processing (III) gave the following information regarding annual profits per
employee (units in thousands of dollars).
I 49.1 43.4 32.9 27.8 38.3 36.1 20.2
II 55.6 25.0 41.3 29.9 39.5
III 39.0 37.3 10.8 32.5 15.8 42.6
Shall we reject or not the claim that there is no difference in population mean
annual profits per employee in each of the three types of companies? Use 1% level
of significance.
3. The following data represent the operating time in hours for 4 types of pocket
calculators before a recharge is required.
Fx101 Fx202 Fx303 Fx404
6.4 5.9 7.1 5.3
6.1 5.8 7.1 4.9
6.5 5.9 7.2 6.1
6.2 5.1 7.3 7.1
6.3 5.0 7.4 4.9
Test at 0.01 level of significance if the operating times for all four calculators are
equal.
4. Two groups of experimental animals are given two brands of feed supplement and
the following weight gains are recorded in grams.
Group I 260 204 105 275 210 100 110 143 170 189
Group II 65 78 140 182 195 48 79 80 69 89
Test if the two samples at 0.05 level of significance.
5. Ten students were given remedial instruction in reading comprehension. Pretest
and posttest were also administered, the results of which follow:
Pretest Posttest
25 30
20 25
20 20
8 12
15 20
8 8
15 20
8 8
20 20
21 20
35 38
20 30
Test at 0.05 level of significance to find out if there is significant difference between
the pretest and posttest.
6. Is there a significant difference on the performance of male and female Grade V
pupils in three different subjects using two different methods? Use α=.05
255
Method I Method II
Male Female Male Female
90 87 85 80
88 85 90 95
1 82 78 88 89
80 75 89 82
2 91 88 85 80
85 80 90 92
81 76 85 80
3 89 86 87 82
83 79 88 85
7. The following data are based on information from the book Life in America’s Small
Cities ( by G.S. Thomas, Prometheus Books). Let x be the percentage of 16 to 19
years old not in school and not high school graduates. Let y be the reported violent
crimes per 1000 residents. Six small cities in Arkansas (Blytheville, El Dorado, Hot
Springs, Jonesboro, rogers, and Russelville) reported the following information
about x and y:
8. Determine the effects of gender and type of high school graduated from on the
college entrance examination scores of the students. Test at 0.05 level of
significance
10. A teacher wants to find out whether a relationship exists between her students’
GWA and the result of their exam in the Entrance Test for Graduate School
Program. Test at 0.05 level of significance
GWA Entrance
Exam Results
2.55 98
2.82 90
2.73 95
2.72 93
2.88 99
2.8 97
2.53 98
2.09 97
2.75 94
1.55 99
2.62 95
257
2.86 95
2.35 95
2.31 97
2.66 98
258
Session 3
Nonparametric Statistics
Lecture:
INTRODUCTION
Nonparametric statistics refers to a statistical method in which the data are not assumed
to come from prescribed models that are determined by a small number of parameters;
examples of such models include the normal distribution model and the linear regression
model. Nonparametric statistics sometimes uses data that is ordinal, meaning it does not
rely on numbers, but rather on a ranking or order of sorts. For example, a survey
conveying consumer preferences ranging from like to dislike would be considered ordinal
data.
Nonparametric statistics does not assume that data is drawn from a normal distribution.
Instead, the shape of the distribution is estimated under this form of statistical
measurement. While there are many situations in which a normal distribution can be
assumed, there are also some scenarios in which the true data generating process is far
from normally distributed.
259
SPECIAL CONSIDERATIONS
Nonparametric statistics have gained appreciation due to their ease of use. As the need
for parameters is relieved, the data becomes more applicable to a larger variety of tests.
This type of statistics can be used without the mean, sample size, standard deviation, or
the estimation of any other related parameters when none of that information is available.
Since nonparametric statistics makes fewer assumptions about the sample data, its
application is wider in scope than parametric statistics. In cases where parametric testing
is more appropriate, nonparametric methods will be less efficient. This is because
nonparametric statistics discard some information that is available in the data, unlike
parametric statistics.
They are:
This is a test of difference between the observed and expected frequencies. The Chi-
Square is considered a unique test due to its 3 functions which are as follows:
Chi-Square goodness of fit test is a non-parametric test that is used to find out how the
observed value of a given phenomena is significantly different from the expected value.
In Chi-Square goodness of fit test, the term goodness of fit is used to compare the
observed sample distribution with the expected probability distribution. Chi-Square
260
goodness of fit test determines how well theoretical distribution (such as normal, binomial,
or Poisson) fits the empirical distribution. In Chi-Square goodness of fit test, sample data
is divided into intervals. Then the numbers of points that fall into the interval are
compared, with the expected numbers of points in each interval.
A. Null hypothesis: In Chi-Square goodness of fit test, the null hypothesis assumes that
there is no significant difference between the observed and the expected value.
Compute the value of Chi-Square goodness of fit test using the following formula:
(𝑂 − 𝐸 )2
𝑥2 = ∑
𝐸
Where:
Degree of freedom: In Chi-Square goodness of fit test, the degree of freedom depends
on the distribution of the sample. The following table shows the distribution and an
associated degree of freedom:
Hypothesis testing: Hypothesis testing in Chi-Square goodness of fit test is the same as
in other tests, like t-test, ANOVA, etc. The calculated value of Chi-Square goodness of
fit test is compared with the table value. If the calculated value of Chi-Square goodness
of fit test is greater than the table value, we will reject the null hypothesis and conclude
that there is a significant difference between the observed and the expected frequency.
261
If the calculated value of Chi-Square goodness of fit test is less than the table value, we
will accept the null hypothesis and conclude that there is no significant difference between
the observed and expected value.
Example:
The theory of Mendel regarding crossing of peas is in the ratio of 9:3:3:1, meaning 9 parts
are smooth yellow, 3 parts wrinkled yellow, 3 parts smooth green and one part wrinkled
green. The researcher conducted an experiment and the result was that out of 560 peas,
310 were smooth yellow, 100 were wrinkled yellow, 110 were smooth green and 40 were
wrinkled green. Is there a significant difference between the observed and the expected?
Use X2 – test at 0.05 level of significance.
Problem: Is there a significant difference between the observed (actual experiment) and
the expected (theory) frequencies?
Hypotheses:
Ho: There is no significant difference between the observed and the expected
frequencies.
Ha: There is a significant difference between the observed and the expected frequencies.
Level of Significance
α = 0.05
df = h-1 = 4-1 = 3
X2.05 = 7.815 tabular value
Actual
Result Theory O-E
Attributes Ratio (Observed) (Expected)
-5
Smooth Yellow 9 310 315
-5
Wrinkled Yellow 3 100 105
5
Smooth Green 3 110 105
5
Wrinkled Green 1 40 35
Total 16 560 560
262
For the Theory or Expected values:
Divide 560 by 16 = 35
Decision Rule: If the 𝑥 2 computed value is greater than or beyond the 𝑥 2 tabular value,
reject Ho.
Conclusion: Since the 𝑥 2 computed value of 1.27 is less than the critical value of 7.815
at 0.05 level of significance with 3 degrees of freedom, we fail to reject the null hypothesis,
there is no significant difference between the observed and the expected frequencies.
The chi-square test of homogeneity tests to see whether different columns (or rows) of
data in a table come from the same population or not (i.e., whether the differences are
consistent with being explained by sampling error alone). For example, in a table showing
political party preference in the rows and states in the columns, the test has the null
hypothesis that each state has the same party preferences.
2
𝑁(𝑎𝑑 − 𝑏𝑐)2
𝑥 =
𝑘𝑙𝑚𝑛
Where:
𝑥 2 = chi square
N = grand total
klmn = the product of the rows and columns
for a 2x2 contingency table label the different cells
Total
a b k
c d l
Total m n N
263
Use this formula for any form of contingency table:
2
(𝑂 − 𝐸 )2
𝑥 =∑
𝐸
Where:
Example:
To illustrate this, we can evaluate the attitude of a sample of Lakas and Laban parties on
the issue of peace and order in Mindanao. To carry out such study, a separate random
sample of members of each party is drawn from the nationwide population of Lakas and
Laban and each individual in both samples responds to the scale. Scores are then
classified into favorable or unfavorable categories. The following frequencies are
obtained:
Favorable Unfavorable Total
Lakas 65 35 100
Laban 50 50 100
Total 115 85 200
Problem: Is there a significant difference between the attitudes of the two political parties
on the issue of peace and order in Mindanao?
Hypotheses:
Ho: There is no significant difference between the attitudes of the two political parties on
the issue of peace and order in Mindanao
Ha: There is a significant difference between the attitudes of the two political parties on
the issue of peace and order in Mindanao
Level of Significance:
α=0.05
d.f. = (c-1)(r-1)
= (2-1)(2-1)
=1
2
𝑥 = 3.841
264
Computation:
2
𝑁(𝑎𝑑 − 𝑏𝑐)2
𝑥 =
𝑘𝑙𝑚𝑛
200(3250 − 1750)2
𝑥2 = = 4.604
100𝑥100𝑥115𝑥85
Decision Rule: If the chi square computed value is greater than the chi square tabular
value, reject the Ho.
Conclusion:
Since the chi square computed value of 4.604 is greater than the chi square tabular value
of 3.481 at 0.05 level of significance with 1 degree of freedom, the research hypothesis
is confirmed. There is a significant difference between the attitudes of the two political
parties on the issue of peace and order in Mindanao.
This test utilizes a contingency table to analyze the data. A contingency table (also known
as a cross-tabulation, crosstab, or two-way table) is an arrangement in which data is
classified according to two categorical variables. The categories for one variable appear
in the rows, and the categories for the other variable appear in columns. Each variable
must have two or more categories. Each cell reflects the total count of cases for a specific
pair of categories.
The Chi-Square Test of Independence can only compare categorical variables. It cannot
make comparisons between continuous variables or between categorical and continuous
variables. Additionally, the Chi-Square Test of Independence only assesses associations
between categorical variables, and cannot provide any inferences about causation.
265
Your data must meet the following requirements:
1. Two categorical variables.
2. Two or more categories (groups) for each variable.
3. Independence of observations.
a. There is no relationship between the subjects in each group.
b. The categorical variables are not "paired" in any way (e.g. pre-test/post-test
observations).
4. Relatively large sample size.
a. Expected frequencies for each cell are at least 1.
b. Expected frequencies should be at least 5 for the majority (80%) of the cells.
The null hypothesis (H0) and alternative hypothesis (H1) of the Chi-Square Test of
Independence can be expressed in two different but equivalent ways:
or
2
(𝑂 − 𝐸 )2
𝑥 =∑
𝐸
Where:
Example:
Ninety individuals, male and female, were given a test in psychomotor skills and their
scores were classified into high and low. Use the x2 test of independence at 0.05 level of
significance. The table is shown.
266
Scores
Sex High Low Total
Male 18 28 46
Female 32 12 44
Total 50 40 90
Problem: Is there a significant relationship between sex and scores in psychomotor skill?
Hypotheses:
Ho: There is no significant relationship between sex and scores in psychomotor skill
Ha: There is a significant relationship between sex and scores in psychomotor skill
Level of Significance:
α=0.05
d.f= (c-1)(r-1)
= (2-1)(2-1)
=1
2
X = 3.841
Computation:
Scores
Sex High Low Total
O E O E
Male 18 25.56 28 20.44 46
Female 32 24.44 12 19.56 44
Total 50 40 90
For expected values: Multiply the column total with the row total and divide the product
by the grand total.
(𝑂 − 𝐸 )2
𝑥2 = ∑
𝐸
2 2
(18 − 25.56) (32 − 24.44) ( 28 − 20.44)2 (12 − 19.56)2
𝑥2 = + + + = 10.292
25.56 24.44 20.44 19.56
Decision Rule: If the chi square computed value is greater than the chi square tabular
value, reject the Ho.
267
Conclusion:
Since the chi square computed value of 10.292 is greater than the chi square tabular
value of 3.481 at 0.05 level of significance with 1 degree of freedom, the research
hypothesis is confirmed. There is a significant relationship between sex and scores in
psychomotor skill
The modules on hypothesis testing presented techniques for testing the equality of means
in two independent samples. An underlying assumption for appropriate use of the tests
described was that the continuous outcome was approximately normally distributed or
that the samples were sufficiently large (usually n 1> 30 and n2> 30) to justify their use
based on the Central Limit Theorem. When comparing two independent samples when
the outcome is not normally distributed and the samples are small, a nonparametric test
is appropriate.
In contrast, the null and two-sided research hypotheses for the nonparametric test are
stated as follows:
This test is often performed as a two-sided test and, thus, the research hypothesis
indicates that the populations are not equal as opposed to specifying directionality. A one-
sided research hypothesis is used if interest lies in detecting a positive or negative shift
in one population as compared to the other. The procedure for the test involves pooling
the observations from the two samples into one combined sample, keeping track of which
sample each observation comes from, and then ranking lowest to highest from 1 to n 1+n2,
respectively.
268
EXAMPLE:
Consider a Phase II clinical trial designed to investigate the effectiveness of a new drug
to reduce symptoms of asthma in children. A total of n=10 participants are randomized to
receive either the new drug or a placebo. Participants are asked to record the number of
episodes of shortness of breath over a 1 week period following receipt of the assigned
treatment. The data are shown below.
Placebo 7 5 6 4 12
New Drug 3 6 4 2 1
In this example, the outcome is a count and in this sample the data do not follow a normal
distribution.
Note that if the null hypothesis is true (i.e., the two populations are equal), we expect to
see similar numbers of episodes of shortness of breath in each of the two treatment
groups, and we would expect to see some participants reporting few episodes and some
269
reporting more episodes in each group. This does not appear to be the case with the
observed data. A test of hypothesis is needed to determine whether the observed data is
evidence of a statistically significant difference in populations.
The first step is to assign ranks and to do so we order the data from smallest to largest.
This is done on the combined or total sample (i.e., pooling the data from the two treatment
groups (n=10)), and assigning ranks from 1 to 10, as follows. We also need to keep track
of the group assignments in the total sample.
Note that the lower ranks (e.g., 1, 2 and 3) are assigned to responses in the new drug
group while the higher ranks (e.g., 9, 10) are assigned to responses in the placebo group.
Again, the goal of the test is to determine whether the observed data support a difference
in the populations of responses. Recall that in parametric tests (discussed in the modules
on hypothesis testing), when comparing means between two groups, we analyzed the
difference in the sample means relative to their variability and summarized the sample
information in a test statistic. A similar approach is employed here. Specifically, we
produce a test statistic based on the ranks.
First, we sum the ranks in each group. In the placebo group, the sum of the ranks is 37;
in the new drug group, the sum of the ranks is 18. Recall that the sum of the ranks will
always equal n(n+1)/2. As a check on our assignment of ranks, we have n(n+1)/2 =
10(11)/2=55 which is equal to 37+18 = 55.
For the test, we call the placebo group 1 and the new drug group 2 (assignment of groups
1 and 2 is arbitrary). We let R1 denote the sum of the ranks in group 1 (i.e., R 1=37), and
R2 denote the sum of the ranks in group 2 (i.e., R 2=18). If the null hypothesis is true (i.e.,
270
if the two populations are equal), we expect R1 and R2 to be similar. In this example, the
lower values (lower ranks) are clustered in the new drug group (group 2), while the higher
values (higher ranks) are clustered in the placebo group (group 1). This is suggestive, but
is the observed difference in the sums of the ranks simply due to chance? To answer this
we will compute a test statistic to summarize the sample information and look up the
corresponding value in a probability distribution.
The test statistic for the Mann Whitney U Test is denoted U and is the smaller of U1 and
U2, defined below.
Where R1 = sum of the ranks for group 1 and R2 = sum of the ranks for group 2.
In our example, U=3. Is this evidence in support of the null or research hypothesis? Before
we address this question, we consider the range of the test statistic U in two different
situations.
EXAMPLE:
A new approach to prenatal care is proposed for pregnant women living in a rural
community. The new program involves in-home visits during the course of pregnancy in
addition to the usual or regularly scheduled visits. A pilot randomized trial with 15 pregnant
women is designed to evaluate whether women who participate in the program deliver
healthier babies than women receiving usual care. The outcome is the APGAR score
measured 5 minutes after birth. Recall that APGAR scores range from 0 to 10 with scores
of 7 or higher considered normal (healthy), 4-6 low and 0-3 critically low. The data are
shown below.
271
Usual Care 8 7 6 2 5 8 7 3
New Program 9 9 7 8 10 9 6
Hypotheses:
Level of Significance
• α =0.05
• n1 = 8, n2 = 7
• U = 10
Computation:
272
𝑛1 (𝑛1 + 1)
𝑈1 = 𝑛1 𝑛2 + − 𝑅1
2
8(8 + 1)
𝑈1 = 8(7) + − 45.5 = 46.5
2
𝑛2 (𝑛2 + 1)
𝑈2 = 𝑛1 𝑛2 + − 𝑅2
2
7(7 + 1)
𝑈1 = 8(7) + − 74.5 = 9.5
2
Decision:
The appropriate critical value can be found in the table above. To determine the
appropriate critical value we need sample sizes (n1=8 and n2=7) and our two-sided level
of significance (α=0.05). The critical value for this test with n1=8, n2=7 and α =0.05 is 10
and the decision rule is as follows: Reject H0 if U < 10
Conclusion:
We reject H0 because 9.5 < 10. We have statistically significant evidence at α =0.05 to
show that the populations of APGAR scores are not equal in women receiving usual
prenatal care as compared to the new program of prenatal care.
The Kruskal-Wallis H test (sometimes also called the "one-way ANOVA on ranks") is a
rank-based nonparametric test that can be used to determine if there are statistically
significant differences between two or more groups of an independent variable on a
continuous or ordinal dependent variable. It is considered the nonparametric alternative
to the one-way ANOVA, and an extension of the Mann-Whitney U test to allow the
comparison of more than two independent groups.
12 𝑅𝑖 2
𝐻= ∑ − 3(𝑛 + 1)
𝑛(𝑛 + 1) 𝑛𝑖
273
Where:
H = Kruskal-Wallis test
n = the number of observations
12 = constant
3 = constant
1 = constant
• Rank the data/observations of all the groups from the lowest to the highest value
• Assign the ranks to the corresponding observations
• Get the ∑ 𝑅1 , ∑ 𝑅2 … … …
• Determine the sample size of every group
• Use the formula of H test
• Compare the computed value to the critical value (𝑥 2 table)
Example:
Consider the examination scores of samples of high school students who are taught in
English using three different methods: Method 1 (classroom instruction and language
laboratory), Method 2 (only classroom instruction) and Method 3 (only self-study in
language laboratory). Use H-test at 0.05 level of significance to test the null hypothesis
that their means are not equal. Consider the following data:
Problem: Is the claim true that there are no significant differences in the examination
scores of high school students using three different methods?
Hypotheses:
Ho: There are no significant differences in the examination scores of high school students
using three different methods
274
Ha: There are significant differences in the examination scores of high school students
using three different methods
Level of Significance:
α = 0.05
d.f = h-1=3-1=2
x2 = 5.99
Computations:
12 𝑅𝑖 2
𝐻= ∑ − 3(𝑛 + 1)
𝑛(𝑛 + 1) 𝑛𝑖
Decision Rule:
If the x2 computed value is greater than the x2 tabular value, disconfirm the Ho.
Conclusion:
275
Since the H computed value of 10.458 is greater than the x2 tabular of 5.99 at 0.05 level
of significance with 2 degrees of freedom, the null hypothesis of no significant difference
in the performance of the two groups is rejected.
THE SIGN TEST FOR TWO INDEPENDENT SAMPLES (MEDIAN TEST TWO-SAMPLE
CASE)
The Sign test is a non-parametric test that is used to test whether or not two groups are
equally sized. This test is known as the median test. The sign test is used when
dependent samples are ordered in pairs, where the bivariate random variables are
mutually independent. It is based on the direction of the plus and minus sign of the
observation, and not on their numerical magnitude. It is also called the binominal sign
test, with p = .5. The sign test is considered a weaker test, because it tests the pair value
below or above the median and it does not measure the pair difference. This is the
counterpart of the t-test under parametric test.
The data consist of two independent samples of n 1 and n2 observations. The medians of
the two samples are taken jointly. In each sample observation, the values above the
medians are assigned the plus sign and those at or below the negative sign. Then the
number of + and – signs for each sample is obtained. A x2 test is used to determine
whether the observed frequencies of + and – signs differ significantly
2
𝑁(𝑎𝑑 − 𝑏𝑐 )2
𝑥 =
𝑘𝑙𝑚𝑛
Where:
x2 = Chi-square test
a and c = observed (+) frequencies
b and d = observed (-) frequencies
k and l = the row total
m and n = the column total
N = grand total
Total
a b k
c d l
Total m n N
276
Example:
Consider the test scores of 12 female and 9 male students in a spelling test.
Female 12 26 25 10 10 10 22 20 19 17 17 15
Male 6 22 19 7 8 12 16 8 19
Problem:
Hypotheses:
Level of significance:
α = 0.05
df = (c-1)(v-1)
x20.05 = 3.841
Computations:
The median of the female and male observations is 16. Assigning a+ to values above
median and a – to the values at or below it, we have the following result.
Female - + + - - - + + + + + -
Male - + + - - - - - +
+ - Total
Female a 7 b 5 k 12
Male c 3 d 6 l 9
Total m 10 n 11 N 21
𝑁(𝑎𝑑 − 𝑏𝑐 )2
2
𝑥 =
𝑘𝑙𝑚𝑛
277
2
21(42 − 15)2
𝑥 = = 1.288
(12)(9)(10)(11)
Decision Rule:
If the x2 computed value is greater than the x2 tabular value, disconfirm the H o.
Conclusion:
Since the x2 computed value of 1.288 is lesser than the x2 tabular of 3.841 at 0.05 level
of significance with 1 degree of freedom, the null hypothesis of no significant difference
in the performance of the two groups is confirmed.
This test is under nonparametric statistics. It is the counterpart of the t-test for correlated
sample under the parametric test. The Fisher Sign Test compares two correlated samples
and is applicable to data composed of N paired observations. The difference between
each pair of observations is obtained. This test is based on the idea that half the difference
between the paired observations will be positive and the other half will be negative.
|𝐷 | − 1
𝑧=
√𝑁
Where:
Example:
Consider the pretest and the posttest results before and after the implementation of the
program.
278
Pretest Posttest
(x) (y)
15 19
19 30
31 26
36 8
10 10
11 6
19 17
15 13
10 22
16 8
Problem:
Is there a significant difference between the pretest and posttest results of the 10
students?
Hypothesis:
Ho: There is no significant difference between the pretest and posttest results of the 10
students
Ha: There is a significant difference between the pretest and posttest results of the 10
students.
Level of Significance:
α = 0.05
Z0.05 = ±1.96
279
11 6 +
19 17 +
15 13 +
10 22 -
16 8 +
In this example, there are all 6 + signs, 3 – signs, and 1 zero. Zero is disregarded. It may
be shown that:
|𝐷 | − 1
𝑧=
√𝑁
|6 − 3 | − 1 2
𝑧= = = 0.67
√9 3
Decision Rule:
If the z computed value is greater than the z tabular value disconfirm the H o
Conclusion:
Since the z computed value of 0.67 is less than the z tabular value of 1.96 at 0.05 level
of significance, the null hypothesis is confirmed which means that there is no significant
difference between the pretest and the posttest results of the ten students.
This test is under the nonparametric tests. This is a straightforward extension of the
median test for two independent samples. The x2 test is used for k independent samples.
2
(𝑂 − 𝐸 )2
𝑥 =∑
𝐸
Where:
280
Example:
A sampling of the acidity of rain for 10 randomly selected rainfalls was recorded at three
different locations in the province of Northern Samar: Biri Island, Catarman, and Silvino
Lubos. The pH readings for these 30 rainfalls are shown in the table. (Note that the pH
readings range from 0 to 14; 0 is acid, 14 is alkaline. Pure water falling through clean air
has a pH reading of 5.7).
Silvino
Biri Catarman Lubos
4.4 4.6 4.7
4 4.5 4.8
4.1 4.3 5
3.5 3.8 4.9
2.4 4.2 3.9
3.8 4.5 4.5
4.2 4.7 4.6
3.9 4.3 4.3
4.1 4.5 4
4.2 4.8 4.7
Use the median test at 0.05 level of significance to test the null hypothesis that there is
no significant difference among the pH readings of the three different municipalities of
Northern Samar.
Problem: Is there a significant difference in the pH readings among the three different
municipalities of Northern Samar?
Hypotheses:
Ho: There is no significant difference in the pH readings among the three municipalities
in Northern Samar.
Ha: There is a significant difference in the pH readings among the three municipalities in
Northern Samar.
Level of Significance:
𝛼 = 0.05
281
x2= 5.991
Computation:
Silvino
Biri Catarman Lubos
+ + +
- + +
- - +
- - +
- - -
- + +
- + +
- - -
- + -
- + +
at or Below
Municipalities above 4.3 4.3 Total
+ -
O E O E
Biri 1 4.7 9 5.3 10
Catarman 6 4.7 4 5.3 10
Silvino Lubos 7 4.7 3 5.3 10
14 16 30
(𝑂 − 𝐸)2
𝑥2 = ∑ = 8.299
𝐸
Decision Rule:
If the computed value is greater than the tabular value, disconfirm the Ho.
282
Conclusion:
Since the computed value of 8.299 is greater than the tabular value of 5.991 at 0.05 level
of significance with 2 degrees of freedom, we reject the Ho, there is a significant difference
in the pH readings among the three different municipalities in N. Samar.
6 ∑ 𝐷2
𝑟𝑠 = 1 −
𝑛(𝑛 2 − 1)
Where:
283
EXAMPLE:
The following are the number of hours which 12 students spent in studying for a midterm
examination and the grades they obtained in English. Calculate r s at 0.05 level of
significance.
Number of
Hours Studied Midterm
(x) Grades (y)
5 50
6 60
11 79
20 90
19 85
20 92
10 80
12 82
8 65
15 85
18 94
10 70
Problem:
Is there a significant relationship between the number of hours spent in studying in English
and the corresponding grades in the midterm examination?
Hypothesis:
Ho: There is no significant relationship between the number of hours spent in studying
English and the corresponding grades in the midterm examination.
Ha: There is a significant relationship between the number of hours spent in studying
English and the corresponding grades in the midterm examination.
Level of significance:
α = 0.05
df = 12 – 1 = 11
284
rs = 0.532
Computation:
Number of
Hours Studied Midterm
(x) Grades (y) Rx Ry D D2
5 50 12 12 0 0
6 60 11 11 0 0
11 79 7 8 -1 1
20 90 1.5 3 -1.5 2.25
19 85 3 4.5 -1.5 2.25
20 92 1.5 2 -0.5 0.25
10 80 8.5 7 1.5 2.25
12 82 6 6 0 0
8 65 10 10 0 0
15 85 5 4.5 0.5 0.25
18 94 4 1 3 9
10 70 8.5 9 -0.5 0.25
∑ D2 =17.5
6 ∑ 𝐷2
𝑟𝑠 = 1 −
𝑛(𝑛 2 − 1)
6(17.5)
𝑟𝑠 = 1 − = 0.94
12(122 − 1)
Decision Rule:
If the rs computed value is greater than the rs tabular value, disconfirm the Ho,
Conclusion:
Since the rs computed value of 0.94 is greater than the r s tabular value of 0.532 at 0.05
level of significance with 11 degrees of freedom, the research hypothesis is confirmed. A
significant relationship between the number of hours spent in studying English and the
grade in the midterm examination in English is established. It implies that the more hours
are devoted to studying, the higher is the result in the examination.
285
FRIEDMAN Fr TEST FOR RANDOMIZED BLOCK DESIGNS
The Friedman Fr test is a nonparametric test used for comparing the distributions of
measurements for k treatments laid out in b blocks using randomized block design. The
procedure for conducting the test is similar to that used for the Kruskal- Wallis H Test.
When either the number of k treatments or the number of b of blocks is larger than five,
the sampling distribution of Fr test can be approximated by a chi-square distribution with
(k-1) df. The formula for Fr test is:
12
𝐹𝑟 = = ∑ 𝑇𝑖 2 − 3𝑏(𝑘 + 1)
𝑏𝑘(𝑘 + 1)
Where:
Example
Antibiotics
Child 1 2 3 4
1 5.8 2.5 6.7 6.2
2 9 9 6.6 9.5
3 5 2.6 3.5 6.6
4 7.9 9.4 5.3 8.4
5 3.9 7.5 2.5 2.5
Problem: Is there a significant difference in the reaction of the five children to the 4
different antibiotics?
286
Hypotheses:
Level of Significance:
α=0.05
df=k-1 =4-1 = 3
x2 = 7.815
Statistics:
Antibiotics
Child 1 T1 2 T2 3 T3 4 T4
1 5.8 2 2.5 1 6.7 4 6.2 3
2 9 2.5 9 2.5 6.6 1 9.5 4
3 5 3 2.6 1 3.5 2 6.6 4
4 7.9 2 9.4 4 5.3 1 8.4 3
5 3.9 3 7.5 4 2.5 1.5 2.5 1.5
Rank Sum 12.5 12.5 9.5 15.5
12
𝐹𝑟 = = ∑ 𝑇𝑖 2 − 3𝑏(𝑘 + 1)
𝑏𝑘(𝑘 + 1)
12
𝐹𝑟 = = [12.52 + 12.52 + 9.52 + 15.52 ] − 3(5)(5) = 2.16
(5)(4)(4 + 1)
Decision Rule: If the computed value of Fr is greater than the tabular value of chi square,
reject the Ho.
287
Conclusion: Since the computed value of Fr = 2.16 is less than the chi square tabular
value = 7.815, we fail to reject Ho
McNemar's test assess the significance of the difference between two correlated
proportions, such as might be found in the case where the two proportions are based on
the same sample of subjects or on matched-pair samples. It is a chi square test for
situations when samples are matched, that is, they are not independent. This is before
and after design which is tested to find out whether there is a significant change between
the before and after situations.
(𝒃 − 𝒄)𝟐
𝒙𝟐 =
𝒃+𝒄
Where:
Example:
Consider the data below on seat belt use before and after involvement in auto accidents
for a sample of 100 accident victims
Problem: Is there a significant difference in the use of seat belt before and after
involvement in an automobile accident?
288
Hypotheses:
Ho: There is no significant difference in the use of seat belt before and after involvement
in an automobile accident.
Ha: There is a significant difference in the use of seat belt before and after involvement
in an automobile accident.
Level of Significance:
α=0.05
x2 = 3.841
Computation:
(𝒃 − 𝒄)𝟐
𝒙𝟐 =
𝒃+𝒄
(𝟔 − 𝟏𝟗)𝟐
𝒙𝟐 = = 𝟔. 𝟕𝟔
𝟔 + 𝟏𝟗
Decision rule: If the computed value is greater than the tabular value, reject the Ho.
Conclusion:
Since the computed value of 6.76 is greater than or beyond the tabular value of 3.841 at
0.05 level of significance with 1 degree of freedom, we reject the Ho, there is a significant
difference in the use of seat belt before and after involvement in an automobile accident.
289
The formula is:
12 ∑ 𝐷2
𝑊=
𝑚2 (𝑁)(𝑁 2 − 1)
Where:
D= the difference between the individual sum of ranks of the raters or judges and the
average of the sum of ranks of the object or individuals
Significance tests:
In the case of complete ranks, a commonly used significance test for W against a null
hypothesis of no agreement.
𝒙 𝟐 = 𝒎 ( 𝒏 − 𝟏) 𝑾
Example:
Individual
Judges' Ranks
Projects
A B C D
1 1 2 3 4
2 3 1 2 2
3 4 4 1 3
4 5 5 5 1
5 2 6 7 6
6 8 3 4 7
7 6 8 6 5
290
8 7 7 8 9
9 10 10 9 8
10 9 9 10 10
Hypotheses:
Ho: There is no agreement or concordance among the 4 judges regarding the 10 projects.
Ha: There is an agreement or concordance among the 4 judges regarding the 10 projects.
Level of Significance:
α = 0.05
d.f. = n-1=10-1=9
x2 = 16.92
Computations:
Sum
Individual 𝑅̅- Sum
Judges' Ranks of D2
Projects of Ranks
Ranks
A B C D
1 1 2 3 4 10 12 144
2 3 1 2 2 8 14 196
3 4 4 1 3 12 10 100
4 5 5 5 1 16 6 36
5 2 6 7 6 21 1 1
6 8 3 4 7 22 0 0
7 6 8 6 5 25 -3 9
8 7 7 8 9 31 -9 81
9 10 10 9 8 37 -15 225
10 9 9 10 10 38 -16 256
Total 220 1048
𝑅̅ 22
291
𝟏𝟐 ∑ 𝑫𝟐
𝑾=
𝒎𝟐 (𝑵)(𝑵𝟐 − 𝟏)
12(1048)
𝑊= = 0.79
42 (10)(100 − 1)
𝒙 𝟐 = 𝒎 ( 𝒏 − 𝟏) 𝑾
𝑥 2 = 4(10 − 1)0.79
𝑥 2 = 28.44
Decision rule:
If the computed value is greater than the tabular value, reject the Ho.
Conclusion:
Since the computed value of x2 = 28.44 is greater than the tabular value of x2 = 16.92 at
0.05 level of significance with d.f=9, we reject the Ho. There is an agreement or
concordance in the ranking of the judges regarding the 10 projects
Activity:
2. The following data are the results of 4 groups of animals given four different
treatments.
292
T1 T2 T3 T4
5 8 17 10
6 15 18 12
16 12 15 20
15 15 12 20
15 10 13 25
Test at 0.05 level of significance to find out if there is a significant difference in
the 4 groups of animals given four different treatments assuming data is not
normal.
4. Six subjects were exposed to 4 treatments and the following data were recorded.
Treatments
Subjects T1 T2 T3 T4
1 9 5 6 2
2 10 10 3 5
3 8 7 9 10
4 5 6 3 4
5 10 9 8 7
6 5 6 8 9
Test at 0.05 level of significance to test the null hypothesis that there is no
significant difference among the six subjects on four different treatments.
293
5. Fifteen women were enrolled in a slimming program for a period of 8 weeks. Their
weights were taken before and after the program, and are shown below.
Before After Before After
130 125 141 135
120 120 130 130
125 126 125 125
140 115 114 110
140 148 115 105
135 120 120 110
115 100 135 120
148 128
Test at 0.05 level of significance if there was a significant difference before and
after the program. Assuming that data is not normal.
6. Group of six subjects, each taught using three different methods of teaching
English, had the following scores.
Method 1 Method 2 Method 3
37 38 40
38 40 70
40 45 68
30 50 70
35 49 38
36 30 49
Test at 0.05 level of significance. Assume data is not normal.
7. Data on charter change before and after a televised debate for a sample of 50
registered voters are found below.
Before the debate After the Debate Total
Yes No
Yes 19 11 30
No 8 12 20
27 23 50
Test at 0.05 level of significance to find out if televised debate will change the
opinions of 50 registered voters.
294
Reliability and Validity Tests
______________________________________________
MODULE 7
295
Reliability and Validity Tests
1. Define reliability, including the different types and how they are assessed.
2. Define validity, including the different types and how they are assessed.
3. Describe the kinds of evidence that would be relevant to assessing the reliability
and validity of a particular measure.
Lecture:
INTRODUCTION
Reliability and validity are concepts used to evaluate the quality of research. They indicate
how well a method, technique or test measures something. Reliability is about the
consistency of a measure, and validity is about the accuracy of a measure.
It’s important to consider reliability and validity when you are creating your research
design, planning your methods, and writing up your results, especially in quantitative
research.
RELIABILITY VS VALIDITY
Reliability Validity
296
A reliable measurement is not A valid measurement is
always valid: the results might generally reliable: if a test
How do they relate?
be reproducible, but they’re produces accurate results,
not necessarily correct. they should be reproducible.
Reliability and validity are closely related, but they mean different things. A measurement
can be reliable without being valid. However, if a measurement is valid, it is usually also
reliable.
WHAT IS RELIABILITY?
Reliability refers to how consistently a method measures something. If the same result
can be consistently achieved by using the same methods under the same circumstances,
the measurement is considered reliable.
You measure the temperature of a liquid sample several times under identical conditions.
The thermometer displays the same temperature every time, so the results are reliable.
A doctor uses a symptom questionnaire to diagnose a patient with a long-term medical
condition. Several different doctors use the same questionnaire with the same patient but
give different diagnoses. This indicates that the questionnaire has low reliability as a
measure of the condition.
WHAT IS VALIDITY?
High reliability is one indicator that a measurement is valid. If a method is not reliable, it
probably isn’t valid.
If the thermometer shows different temperatures each time, even though you have
carefully controlled conditions to ensure the sample’s temperature stays the same, the
thermometer is probably malfunctioning, and therefore its measurements are not valid.
297
If a symptom questionnaire results in a reliable diagnosis when answered at different
times and with different doctors, this indicates that it has high validity as a measurement
of the medical condition.
However, reliability on its own is not enough to ensure validity. Even if a test is reliable, it
may not accurately reflect the real situation.
The thermometer that you used to test the sample gives reliable results. However, the
thermometer has not been calibrated properly, so the result is 2 degrees lower than the
true value. Therefore, the measurement is not valid.
A group of participants take a test designed to measure working memory. The results are
reliable, but participants’ scores correlate strongly with their level of reading
comprehension. This indicates that the method might have low validity: the test may be
measuring participants’ reading comprehension instead of their working memory.
Validity is harder to assess than reliability, but it is even more important. To obtain useful
results, the methods you use to collect your data must be valid: the research must be
measuring what it claims to measure. This ensures that your discussion of the data and
the conclusions you draw are also valid.
298
TYPES OF RELIABILITY
A group of participants
complete a questionnaire
The consistency of a designed to measure
measure across time: do personality traits. If they
Test-retest you get the same results repeat the questionnaire
when you repeat the days, weeks or months apart
measurement? and give the same answers,
this indicates high test-retest
reliability.
Based on an assessment
criteria checklist, five
The consistency of a examiners submit
measure across raters or substantially different results
observers: do you get the for the same student project.
Interrater
same results when different This indicates that the
people conduct the same assessment checklist has
measurement? low inter-rater reliability (for
example, because the
criteria are too subjective).
299
A set of questions is
formulated to measure
financial risk aversion in a
group of respondents. The
questions are randomly
It measures the correlation divided into two sets, and
between two equivalent the respondents are
versions of a test. You use it randomly divided into two
Parallel forms when you have two different groups. Both groups take
assessment tools or sets of both tests: group A takes
questions designed to test A first, and group B
measure the same thing. takes test B first. The results
of the two tests are
compared, and the results
are almost identical,
indicating high parallel forms
reliability.
The Minnesota Multiphasic
The split-half method Personality Inventory has
assesses the internal sub scales measuring
consistency of a test, such differently behaviors such as
as psychometric tests and depression, schizophrenia,
Split-half questionnaires. There, it social introversion.
measures the extent to Therefore the split-half
which all parts of the test method was not be an
contribute equally to what is appropriate method to
being measured. assess reliability for this
personality test.
300
TYPES OF VALIDITY
The validity of a measurement can be estimated based on three main types of evidence.
Each type can be evaluated through expert judgement or statistical methods.
301
Criterion The extent to which the result of a A survey is conducted to
measure corresponds to other valid measure the political opinions of
measures of the same concept. voters in a region. If the results
accurately predict the later
outcome of an election in that
region, this indicates that the
survey has high criterion
validity.
The reliability and validity of your results depends on creating a strong research design,
choosing appropriate methods and samples, and conducting the research carefully and
consistently.
ENSURING VALIDITY
For example, to collect data on a personality trait, you could use a standardized
questionnaire that is considered reliable and valid. If you develop your own questionnaire,
it should be based on established theory or findings of previous studies, and the questions
should be carefully and precisely worded.
302
To produce valid generalizable results, clearly define the population you are researching
(e.g. people from a specific age range, geographical location, or profession). Ensure that
you have enough participants and that they are representative of the population.
ENSURING RELIABILITY
Reliability should be considered throughout the data collection process. When you use a
tool or technique to collect data, it’s important that the results are precise, stable and
reproducible.
Apply your methods consistently
Plan your method carefully to make sure you carry out the same steps in the same way
for each measurement. This is especially important if multiple researchers are involved.
For example, if you are conducting interviews or observations, clearly define how specific
behaviors or responses will be counted, and make sure questions are phrased the same
way each time.
Standardize the conditions of your research
When you collect your data, keep the circumstances as consistent as possible to reduce
the influence of external factors that might create variation in the results.
For example, in an experimental setup, make sure all participants are given the same
information and tested under the same conditions.
Where to write about reliability and validity in a thesis
It’s appropriate to discuss reliability and validity in various sections of your thesis or
dissertation. Showing that you have taken them into account in planning your research
and interpreting the results makes your work more credible and trustworthy.
Reliability and validity in a thesis
Section Discuss
What have other researchers done to devise and improve methods
Literature review that are reliable and valid?
How did you plan your research to ensure reliability and validity of
the measures used? This includes the chosen sample set and size,
Methodology sample preparation, external conditions and measuring techniques.
303
If you calculate reliability and validity, state these values alongside
Results your main results.
This is the moment to talk about how reliable and valid your results
actually were. Were they consistent, and did they reflect true
Discussion values? If not, why not?
If reliability and validity were a big problem for your findings, it might
Conclusion be helpful to mention this here.
Needs
Retain Delete
Improvement
3 2 1
1 Curriculum enrichment 4,3,2,1
2 Theory and practice scheme 4,3,2,1
3 Brainstorming method 4,3,2,1
4 Experimental method 4,3,2,1
5 Supervised or directed study 4,3,2,1
6 Laboratory approach 4,3,2,1
7 Discovery approach 4,3,2,1
8 Guided discovery approach 4,3,2,1
304
9 Unstructured approach 4,3,2,1
10 Project method 4,3,2,1
11 Others (please specify) 4,3,2,1
The researcher requires a selected group of experts to validate the content of the
questionnaire on the basis of the foregoing questions. If the weighted mean is 2.5 and
above, the item is retained; 1.5 to 2.4, needs improvement; and below, delete.
For instance, there are five experts to validate the above questionnaire. In Item 1,
“Curriculum enrichment,” 4 experts rated 3 and 1, rated 2. Weighted mean is used. The
computation is as follows:
x f xf
3 4 12
2 1 2
5 14
∑ 𝑓𝑥 14
𝑥̅ = = = 2.8 ≈ 3 𝑟𝑒𝑡𝑎𝑖𝑛
∑𝑓 5
For the Test-retest method, the same instrument is administered twice to the same group
of subjects and correlation coefficient is determined.
Spearman rho Computation of the First and Second administration of Achievement Test
in Computer
Respondents 1 2 R1 R2 D D2
1 75 75 6 6 0 0
2 53 55 13 12.5 0.5 0.25
3 47 48 15 15 0 0
4 83 80 2 3.5 -1.5 2.25
5 70 75 8.5 6 2.5 6.25
6 69 70 10 9.5 0.5 0.25
7 70 70 8.5 9.5 -1 1
8 55 55 12 12.5 -0.5 0.25
305
9 77 75 5 6 -1 1
10 85 85 1 1 0 0
11 79 80 4 3.5 0.5 0.25
12 57 55 11 12.5 -1.5 2.25
13 81 82 3 2 1 1
14 50 55 14 12.5 1.5 2.25
15 71 71 7 8 -1 1
∑D2 18
6 ∑ 𝐷2 6(18)
𝑟𝑠 = 1 − 3
=1− 3 = 0.97 (𝑣𝑒𝑟𝑦 ℎ𝑖𝑔ℎ 𝑟𝑒𝑙𝑖𝑎𝑏𝑖𝑙𝑖𝑡𝑦)
𝑁 −𝑁 15 − 15
For the Split-half method. The test in this method may be administered once, but the test
items are divided into halves. The common procedure is to divide a test into odd and even
items. The two halves of the test must be similar but not identical in content, number of
items, difficulty, means, and standard deviations. Each student obtains two scores, one
on the odd and the other on the even items in the same test. The scores obtained in the
two halves are correlated. The result is reliability coefficient for a half test. Since the
reliability holds only for a half test, the reliability coefficient of a whole test is estimated by
using the Spearman Brown formula. This formula is as follows:
2(𝑟ℎ𝑡)
𝑟𝑤𝑡 =
1 + 𝑟ℎ𝑡
Where:
rwt = the reliability of the whole test
306
4 43 50 10.5 9.5 1 1
5 35 31 12 12 0 0
6 64 72 6 3 3 9
7 57 57 7 8 -1 1
8 70 67 4 6 -2 4
9 69 69 5 5 0 0
10 48 50 9 9.5 -0.5 0.25
11 43 41 10.5 11 -0.5 0.25
12 75 75 1 2 -1 1
∑D2 25.5
6 ∑ 𝐷2 6(25.5)
𝑟𝑠 = 1 − 3
=1− 3 = 0.91 𝑟𝑒𝑙𝑖𝑎𝑏𝑖𝑙𝑖𝑡𝑦 𝑜𝑓 ℎ𝑎𝑙𝑓 𝑡𝑒𝑠𝑡
𝑁 −𝑁 12 − 12
2(𝑟ℎ𝑡) 2(0.91)
𝑟𝑤𝑡 = = = 0.95 𝑣𝑒𝑟𝑦 ℎ𝑖𝑔ℎ 𝑟𝑒𝑙𝑎𝑡𝑖𝑜𝑛𝑠ℎ𝑖𝑝
1 + 𝑟ℎ𝑡 1 + 0.91
For the Internal-consistency method. This method is used with psychological tests which
consists of dichotomously scored items. The examinee either passes or fails in an item.
A score of 1 (one) is assigned for a pass and 0 (zero) for failure. The method of obtaining
reliability coefficient in this method is determined by Kuder-Richardson Formula 20. This
formula is a measure of internal consistency or homogeneity of the research instrument.
The formula is as follows:
𝑁 𝑆𝐷2 − ∑ 𝑝𝑖 𝑞𝑖
𝑟𝑥𝑥 =[ ][ ]
𝑁−1 𝑆𝐷2
Where:
N = number of items
∑(𝑥−𝑥̅ )2
SD2 = variance of scores 𝑁−1
The proportion of individuals passing item i is denoted by the symbol pi and the proportion
failing by qi, where qi = 1-pi. If all items are perfectly corrected, a situation which can only
arise when all have the same difficulty, r xx = 1.0
The steps in applying Kuder-Richardson Formula 20 are as follows:
Step 1. Compute for the variance (𝑆𝐷2 ) of the test scores for the whole group.
307
Step 2. Find the proportion passing each item (𝑝𝑖 ) and the proportion failing each
item(𝑞𝑖 ). For instance, ten of the twelve students passed or got the correct answer for
10 2
Item 2, hence, 𝑝𝑖 = 12 = 0.83. There are two students who failed in Item 2, thus, 𝑞 = 12 =
0.17. It can also be taken by 𝑞𝑖 = 1 − 𝑝𝑖 or 1 − 0.83 = 𝟎. 𝟏𝟕.
Step 3. Multiply p and q for each item. For instance, 0.83 x 0.17 = 0.1411. This gives the
𝑝𝑖 𝑞𝑖 value. Get the sum of 𝑝𝑖 𝑞𝑖 column to get ∑ 𝑝𝑖 𝑞𝑖 .
Step 4. Substitute the calculated value using formula.
For illustration purposes, consider the example below: Suppose a test of 15 items
has been administrated to a group of twelve students. The table below presents the
computation of Richardson Formula 20 of the twelve students’ responses for a fifteen-
item test.
Computation of Kuder-Richardson Formula 20 of the Twelve Students Responses for a
Fifteen-Item Test
Students
Item 1 2 3 4 5 6 7 8 9 10 11 12 f pi qi piqi
1 1 1 1 1 1 1 1 1 1 1 1 0 11 0.92 0.08 0.08
2 1 1 1 1 1 1 1 1 1 1 0 0 10 0.83 0.17 0.14
3 1 1 1 1 1 1 1 1 0 0 0 0 8 0.67 0.33 0.22
4 1 1 1 1 1 1 1 1 0 0 0 0 8 0.67 0.33 0.22
5 1 1 1 1 1 1 1 1 0 0 0 0 8 0.67 0.33 0.22
6 1 1 1 1 1 1 1 0 0 0 0 0 7 0.58 0.42 0.24
7 1 1 1 1 1 1 1 0 0 0 0 0 7 0.58 0.42 0.24
8 1 1 1 1 1 1 1 0 0 0 0 0 7 0.58 0.42 0.24
9 1 1 1 1 1 1 1 0 0 0 0 0 7 0.58 0.42 0.24
10 1 1 1 1 1 0 0 0 0 0 0 0 5 0.42 0.58 0.24
11 1 1 1 1 1 0 0 0 0 0 0 0 5 0.42 0.58 0.24
12 1 1 1 1 0 0 0 0 0 0 0 0 4 0.33 0.67 0.22
13 1 1 1 0 0 0 0 0 0 0 0 0 3 0.25 0.75 0.19
14 1 1 1 0 0 0 0 0 0 0 0 0 3 0.25 0.75 0.19
15 1 1 0 0 0 0 0 0 0 0 0 0 2 0.17 0.83 0.14
Total 15 15 14 12 11 9 9 5 2 2 1 0 ∑piqi 3.08
308
Variance (SD2) Computation
?? ?? ?? ?? ??
Student x ?? ? ?
1 15 7.08 50.1264
2 15 7.08 50.1264
3 14 6.08 36.9664
4 12 4.08 16.6464
5 11 3.08 9.4864
6 9 1.08 1.1664
7 9 1.08 1.1664
8 5 -2.92 8.5264
9 2 -5.92 35.0464
10 2 -5.92 35.0464
11 1 -6.92 47.8864
12 0 7.92 62.7264
Total 95 354.9168
∑ 𝑥 95
𝑥̅ = = = 7.92
𝑛 12
∑(𝑥 − 𝑥̅ ) 354.9168
𝑆𝐷2 = = = 32.27
𝑛−1 12 − 1
𝑁 𝑆𝐷2 − ∑ 𝑝𝑖 𝑞𝑖
𝑟𝑥𝑥 = [ ][ ]
𝑁−1 𝑆𝐷2
15 32.27 − 3.0768
𝑟𝑥𝑥 = [ ][ ] = 0.97 𝑣𝑒𝑟𝑦 ℎ𝑖𝑔ℎ 𝑟𝑒𝑙𝑎𝑡𝑖𝑜𝑛𝑠ℎ𝑖𝑝
15 − 1 32.27
Activity:
Essay:
309
3. How do you administer the test-retest method in testing the reliability of a research
instrument?
4. Which of the qualities of the research instrument is most practical? Why?
5. How are parallel-forms method of estimating the reliability of a research instrument
constructed?
Problem Solving:
1. Using the data below, find out if the research instrument is reliable applying the
split half method administered to 15 students.
Even-Odd Even-Odd Even-Odd Even-Odd Even-Odd
51-52 21-23 51-50 47-49 40-42
28-30 20-19 23-25 40-41 32-30
55-52 38-37 22-23 29-25 57-52
Students
Items 1 2 3 4 5 6 7 8 9 10 11 12
1 1 1 1 1 1 1 1 1 1 1 1 1
2 1 1 1 1 1 1 1 1 0 1 1 1
3 1 1 1 1 1 1 1 1 0 1 1 1
4 0 1 1 1 1 1 1 1 0 1 1 1
5 0 0 1 0 1 1 1 1 1 1 1 0
6 0 0 0 0 1 1 1 1 0 0 1 0
7 0 0 0 0 1 1 1 0 0 0 0 0
Total 3 4 5 4 7 7 7 6 2 5 6 4
310
Using Excel for Statistical Data Analysis
______________________________________________
MODULE 8
311
Using Excel for Statistical Data Analysis
By the end of this session you should be able to use Microsoft Excel for data summary,
presentation and statistical analysis.
Lecture:
INTRODUCTION
Microsoft Excel is a spreadsheet program. That means it's used to create grids of text,
numbers and formulas specifying calculations. That's extremely valuable for many
businesses, which use it to record expenditures and income, plan budgets, chart data
and succinctly present fiscal results.
It can be programmed to pull in data from external sources such as stock market feeds,
automatically running the data through formula such as financial models to update such
information in real time. Like Microsoft Word, Excel has become a de facto standard in
the business world, with Excel spreadsheets frequently emailed and otherwise shared to
exchange data and perform various calculations.
Excel also contains fairly powerful programming capabilities for those who wish to use
them that can be used to develop relatively sophisticated financial and scientific
computation capabilities.
You can perform statistical analysis with the help of Excel. It is used by most of the data
scientists who require the understanding of statistical concepts and behavior of the data.
But when the data set is huge or you need some specialized data analysis model such
as linear or regression, you should go for advanced tools such as Python, R programming.
Here, we will go through the basic concept of statistical analysis and will apply the
concepts to our own data.
Before starting, you need to check whether Excel Analysis ToolPak is enabled in Excel
or not (it is an add-in provided by Microsoft Excel). To check whether it is enabled or not,
go to Excel → Data and check whether data analysis option is there or not on the top right
corner. If it is not there, go to Excel → File → Options → Add-in and enable the Analysis
312
ToolPak by selecting the Excel Add-ins option in manage tab and then, click GO. This will
open a small window; select the Analysis ToolPak option and enable it.
These are some of the tests you can perform using Excel Statistical Analysis.
DESCRIPTIVE ANALYSIS
You can find descriptive analysis by going to Excel→ Data→ Data Analysis → Descriptive
statistics. It is the most basic set of analysis that can be performed on any data set. It
gives you the general behavior and pattern of the data. It is helpful when you a have a
set of data and want to have the summary of that dataset. This will show the following
statistic data for the chosen dataset.
Example:
For example, you may have the scores of 14 participants for a test.
313
To generate descriptive statistics for these scores, execute the following steps.
314
6. Click OK.
Result:
It is a data analysis method which shows whether the mean of two or more data set is
significantly different from each other or not. In other words, it analyses two or more
groups simultaneously and finds out whether any relationship is there among the groups
of data set or not. For example, you can use ANOVA if you want to analyse the traffic of
three different cities and find out which one is more efficient in handling the traffic (or if
there are no significant differences among the traffic).
If you have three groups of datasets and want to check whether there is any significant
difference between these groups or not, you can use ANOVA single factor.
If the P-value in the ANOVA summary table is greater than 0.05, you can say that there
is a significant difference between the groups.
315
EXAMPLE:
This example teaches you how to perform a single factor ANOVA (analysis of variance)
in Excel. A single factor or one-way ANOVA is used to test the null hypothesis that the
means of several populations are all equal.
Below you can find the salaries of people who have a degree in economics, medicine or
history.
H0: μ1 = μ2 = μ3
H1: at least one of the means is different.
316
3. Click in the Input Range box and select the range A2:C10.
5. Click OK.
Result:
317
Conclusion: if F > F crit, we reject the null hypothesis. This is the case, 15.196 > 3.443.
Therefore, we reject the null hypothesis. The means of the three populations are not all
equal. At least one of the means is different. However, the ANOVA does not tell you
where the difference lies. You need a t-Test to test each pair of means.
MOVING AVERAGE
Moving average is usually applicable for time series data such as stock price, weather
report, attendance in class etc. For example, it is heavily used in stock price as a technical
indicator. If you want to predict the stock price of today, the last ten days data would be
more relevant than the last 1 year. So, you can plot the moving average of the stock
having a 10-day time period and you can then predict the price to some extent. The same
applies to the temperature of a city. The recent temperature of a city can be calculated
by taking the average of last few weeks rather than previous months.
Example
This example teaches you how to calculate the moving average of a time series in Excel.
A moving average is used to smooth out irregularities (peaks and valleys) to easily
recognize trends.
318
3. Select Moving Average and click OK.
4. Click in the Input Range box and select the range B2:M2.
7. Click OK.
319
Explanation: because we set the interval to 6, the moving average is the average of the
previous 5 data points and the current data point. As a result, peaks and valleys are
smoothed out. The graph shows an increasing trend. Excel cannot calculate the moving
average for the first 5 data points because there are not enough previous data points.
320
Conclusion: The larger the interval, the more the peaks and valleys are smoothed out.
The smaller the interval, the closer the moving averages are to the actual data points.
It calculates the ranking and percentile in the data set. For example, if you are managing
a business of several products and want to find out which product is contributing to a
higher revenue, you can use this rank method in Excel.
In the left table, we have our data on the revenues of different products. And we want to
rank this data of products based on their revenue.
With the help of rank and percentile, we can get the table shown on the right. You can
observe that now the data is sorted and respective rank is also marked with each data.
Percentile shows the category in which the data belongs, such as top 50%, top 30% etc.
In the summary table, the rank of product 7 is 4. As the total number of data is 7, we can
easily say that it belongs to the top 50% of the data.
REGRESSION
321
This is the window you will get once you click regression option in data analysis. Here,
you have to provide a dependent variable in input Y range and an independent variable
in Input X range. In our example, we have revenue of the product as a dependent variable
and spending on the advertisement as the independent variable (If you have a label in
the data, you can mark the checkbox of the labels). And at last, provide the range of cells
where you want to see the output.
EXAMPLE:
This example teaches you how to run a linear regression analysis in Excel and how to
interpret the Summary Output.
Below you can find our data. The big question is: is there a relation between Quantity
Sold (Output) and Price and Advertising (Input). In other words: can we predict Quantity
Sold if we know Price and Advertising?
322
1. On the Data tab, in the Analysis group, click Data Analysis.
3. Select the Y Range (A1:A8). This is the predictor variable (also called dependent
variable).
4. Select the X Range (B1:C8). These are the explanatory variables (also called
independent variables). These columns must be adjacent to each other.
5. Check Labels.
7. Check Residuals.
8. Click OK.
323
Excel produces the following Summary Output (rounded to 3 decimal places).
R SQUARE
R Square equals 0.962, which is a very good fit. 96% of the variation in Quantity Sold is
explained by the independent variables Price and Advertising. The closer to 1, the better
the regression line (read on) fits the data.
To check if your results are reliable (statistically significant), look at Significance F (0.001).
If this value is less than 0.05, you're OK. If Significance F is greater than 0.05, it's probably
324
better to stop using this set of independent variables. Delete a variable with a high P-
value (greater than 0.05) and rerun the regression until Significance F drops below 0.05.
Most or all P-values should be below below 0.05. In our example this is the case. (0.000,
0.001 and 0.005).
COEFFICIENTS
The regression line is: y = Quantity Sold = 8536.214 -835.722 * Price + 0.592 *
Advertising. In other words, for each unit increase in price, Quantity Sold decreases with
835.722 units. For each unit increase in Advertising, Quantity Sold increases with 0.592
units. This is valuable information.
You can also use these coefficients to do a forecast. For example, if price equals $4 and
Advertising equals $3000, you might be able to achieve a Quantity Sold of 8536.214 -
835.722 * 4 + 0.592 * 3000 = 6970.
RESIDUALS
The residuals show you how far away the actual data points are fom the predicted data
points (using the equation). For example, the first data point equals 8500. Using the
equation, the predicted data point equals 8536.214 -835.722 * 2 + 0.592 * 2800 =
8523.009, giving a residual of 8500 - 8523.009 = -23.009.
325
CORRELATION
The correlation coefficient (a value between -1 and +1) tells you how strongly two
variables are related to each other. We can use the CORREL function or the Analysis
Toolpak add-in in Excel to find the correlation coefficient between two variables.
326
A correlation coefficient near 0 indicates no correlation.
To use the Analysis Toolpak add-in in Excel to quickly generate correlation coefficients
between multiple variables, execute the following steps.
327
3. For example, select the range A1:C6 as the Input Range.
6. Click OK.
328
Result.
Conclusion: variables A and C are positively correlated (0.91). Variables A and B are not
correlated (0.19). Variables B and C are also not correlated (0.11) . You can verify these
conclusions by looking at the graph.
Although you can find a simple function to generate a series of random numbers, this
option in data analysis gives you more flexibility in the random number generation
process. It gives us more control over the generated data.
329
This is the screenshot of the random number generation window. In the number of
variables field, it requires the number of columns in which you need the random number
to be generated. And in the number of random variables field, it requires the number of
rows where you need random data to be filled. If I give a number of variable 10 and the
number of random variables as 3, it will give the following result.
You can also select the type of data being generated such as Uniform, Normal, Bernoulli,
Binomial, Poisson etc.
SAMPLING
This option is the data analysis tool which is used for creating samples from a huge
population. You can randomly select data from the dataset or select every nth item from
the set. For example, if you want to measure the effectiveness of a call centre employee
in a call centre, you can use this tool to randomly select few data every month and listen
to their recorded calls and give a rating based on the selected call.
This is a screenshot of the sampling option in data analysis. In input range, you have to
give the reference of input population data set (check the Labels checkbox if your data
has labels). Next, you have to specify the sampling method. If you are selecting periodic
sampling method, you have to give the number as a period. So, it will create a sample
from the population taking the nth data from the population. For example, if your period
is 5, it will select every 5th value from the dataset and create a sample. And if you are
choosing at random, it will randomly select an n number of the data from the population
330
dataset. The second method can give a closer idea to the actual population as the data
is being selected randomly but there are chances of duplicate data in the sample dataset.
And lastly, you have to specify the output location from the output options.
The statistical analysis tool is one of the most important features in Excel. It is always
recommended to use this option whenever you get any kind of data to have a better
understanding of it. As you saw the above examples, you can get a general sense of the
data using descriptive analysis, calculate the moving average of a data to predict the
future data of the dataset, rank each data of the dataset and find the percentile of data,
test the two groups of the data, generate some specific kind of large random data and
find relations between the two data, using the regression. There are some more options
available which we are not covering here but if you wish, you may read about these on
the official website of Microsoft Excel.
Activity:
1. A random sample of companies in electric utilities (I), financial services (II), and
food processing (III) gave the following information regarding annual profits per
employee (units in thousands of dollars).
Shall we reject or not the claim that there is no difference in population mean
annual profits per employee in each of the three types of companies? Use 1% level
of significance.
2. The following data represent the operating time in hours for 4 types of pocket
calculators before a recharge is required.
Test at 0.01 level of significance if the operating times for all four calculators are
equal.
331
3. Ten students were given remedial instruction in reading comprehension. Pretest
and posttest were also administered, the results of which follow:
Pretest Posttest
25 30
20 25
20 20
8 12
15 20
8 8
15 20
8 8
20 20
21 20
35 38
20 30
Test at 0.05 level of significance to find out if there is significant difference between
the pretest and posttest.
4. The following data are based on information from the book Life in America’s Small
Cities ( by G.S. Thomas, Prometheus Books). Let x be the percentage of 16 to 19
years old not in school and not high school graduates. Let y be the reported violent
crimes per 1000 residents. Six small cities in Arkansas (Blytheville, El Dorado, Hot
Springs, Jonesboro, rogers, and Russelville) reported the following information
about x and y:
5. A teacher wants to find out whether a relationship exists between her students’
GWA and the result of their exam in the Entrance Test for Graduate School
Program. Test at 0.05 level of significance
GWA Entrance
Exam Results
2.55 98
2.82 90
2.73 95
332
2.72 93
2.88 99
2.8 97
2.53 98
2.09 97
2.75 94
1.55 99
2.62 95
2.86 95
2.35 95
2.31 97
2.66 98
333
Test at 0.05 level of significance.
Test at 0.01 level of significance to find out if the additional instruction affects the
average grade assuming that the data is normal.
334
COURSE REFERENCES
A. Books
1. Probability and Statistics, Altares, Tayag, Torres, Yap, Ymas 2015
2. Statistics for the Behavioral Sciences, Frederick J. Gravetter, Larry B. Wallnau,
2015
3. Introduction to probability and statistics for engineers and scientist 5/e, Ross,
Sheldon M., 2014
4. Applied statistics and probability for engineers, 5e Montgomery, Douglas 2011
5. Schaum's Outlines of Probability, 2nd ed. Lipschutz 2011
6. Basic statistics with Calculator and Narag, E. 2010
7. Probability, Random Variables, Random processes 3/e, Hsu, Hwei P., 2014
8. Business Statistics, Cabero, Salamat, Sta. Maria, 2013
9. Nonparametric Statistics, Broto, 2008
10. Understandable Statistics Concepts and Methods, 9 th ed, Brase and Brase, 2009
B. Journals
Taper, M. L., & Ponciano, J. M. (2016). Evidential statistics as a statistical
modern synthesis to support 21st century science. Population Ecology, 58(1), 9-
29. doi:http://dx.doi.org/10.1007/s10144-015-0533-y
Liu, K., Luedtke, A., & Tintle, N. (2013). Optimal methods for using posterior
probabilities in association testing. Human Heredity, 75(1), 2-11.
doi:http://dx.doi.org/10.1159/000349974
Statistics with JMP: Graphs, descriptive statistics, and probability (online access
included) (2015). . Beaverton: Ringgold Inc. Retrieved from
https://search.proquest.com/docview/1686840904?accountid=160143
C. Internet Sources
http://onlinestatbook.com/2/introduction/inferential.html
https://statistics.laerd.com/statistical-guides/descriptive-inferential-statistics.php
http://www.statisticshowto.com/probability-and-statistics/hypothesis-testing/
335
https://onlinecourses.science.psu.edu/statprogram/node/138/
http://stattrek.com/hypothesis-test/hypothesis-testing.aspx
https://www.lotame.com/what-are-the-methods-of-data-collection/
https://www.investopedia.com
https://magoosh.com/excel/using-excel-statistical-analysis/
https://www.excel-easy.com/examples/descriptive-statistics.html
https://www.scribbr.com/methodology/reliability-vs-validity/
336