Engineering Data Analysis: Instructional Materials in STAT 20023

Instructional
Materials in
STAT 20023

ENGINEERING DATA
ANALYSIS
For the sole noncommercial use of the
Faculty of the Department of Mathematics and Statistics
Polytechnic University of the Philippines
2020

Contributors:

Elizon, Katrina
Usona, Laurence
Aranas, Peter John
Bautista, Lincoln
Baccay, Edcon

Republic of the Philippines
POLYTECHNIC UNIVERSITY OF THE PHILIPPINES
COLLEGE OF SCIENCE
Department of Mathematics and Statistics
Course Title : ENGINEERING DATA ANALYSIS

Course Code : STAT 20023
Course Credit : 3 UNITS
Pre-Requisite :
Course Description : This course focuses on conceptual understanding of everyday
statistics, and basic statistical procedures. Topics include basic
concept of statistics, descriptive statistics, inferential statistics
especially on parametric and non-parametric estimation and
hypothesis testing, and illustrated and applied to practical
situations. It also gives students competence in basic computer
technology by generating descriptive statistics and performing
statistical analysis using R and other programming software.
Week Dates Topics and Subtopics
• Definitions and Terminology
• Process of Statistics
Week 1 9/14 – 9/20 • Qualitative and Quantitative
• Discrete and Continuous
• Levels of Measurement
• Data Collection
Week 2 9/21 – 9/27
• Sources of Data
• Getting Started with R
Week 3 9/28 – 10/4 • Basic Data Types in R
• Data Structures in R (Vector)
Week 4 10/5 – 10/11 • Data Structures in R (Matrix, List and Data Frame)
• Conditional Statement
Week 5 10/12 – 10/18 • Mean and Standard Deviation in R
• Data Manipulation in r
Week 6 10/19 – 10/25 • Visualization and Graphics in R
• Procedure for Hypothesis Testing
Week 7 10/26 – 10/31
• Assessing and Testing Normality of the Data
• One Sample T-Test
Week 8 11/3 – 11/8
• Dependent Sample T-Test
• Independent Sample T-Test
Week 9 11/9 – 11/15
• One-Way Analysis of Variance (ANOVA)
Week 10 11/16 – 11/22 • Pearson Product Moment Correlation
• One-Sample Sign Test
Week 11 11/23 – 11/27
• Wilcoxon Sign Rank Test
Week 12 12/1-12/6 • Mann-Whitney U-Test
Week 13 12/7-12/13 • Kruskal Wallis H-Test
• Spearman Rank Correlation
Week 14 12/14-12/20
• Chi-Square Test
COURSE GRADING SYSTEM
The final grade will be based on the weighted average of the student’s scores on each test
assigned at the end of each lesson. The final SIS grade equivalent will be based on the
following table according to the approved University Student Handbook.
Class Standing (CS) = (((Weighted Average of all the Activities) x 50 )+ 50)
Midterm and/or Final Exam (MFE) = (((Weighted Average of the Midterm and/or Final Tests) x
50)+50)
Final Grade = (70% x CS) + (30% x MFE)
Final Grade
SIS Grade Description
Equivalent
1.00 97.00-100 Excellent
1.25 94.00-96.99 Excellent
1.50 91.00-93.99 Very Good
1.75 88.00-90.99 Very Good
2.00 85.00-87.99 Good
2.25 82.00-84.99 Good
2.50 79.00-81.99 Satisfactory
2.75 77.00-78.99 Satisfactory
3.00 75.00-76.99 Passing
5.00 65.00-74.99 Failure
INC Incomplete
W Withdrawn
Prepared by:
Katrina D. Elizon
Faculty Member, Department of Mathematics and Statistics
College of Science
Contents
1 Introduction to Statistical Concepts

1.1 Definitions and Terminology…………………………………………….. 1
1.2 Process of Statistics …………………………………………………….. 2
1.3 Qualitative and Quantitative Variables…………………………………. 3
1.4 Discrete and Continuous Variables ……………………………………. 3
1.5 Levels of Measurement………………………………………………….. 5
1.6 Data Collection …...………………………………………………………. 6
1.7 Sources of Data …………….……………………………………………. 7
2 Introduction to R
2.1 What is R and Difference of R and RStudio …...…………………….. 9
2.2 Steps to Install R, RStudio and R Packages …………….…………… 10
2.3 Basic Data Types in R………………………………………..…………. 13
2.4 Data Structures in R
Vector ……………………..…………..…………………………………. 14
Matrix ……………………..…………..…………………………………. 18
List ……………………..…………..…………………………………….. 20
Data Frame ……………………..………………………………………. 21
2.5 Conditional Statement…………..……..……………………………….. 24
2.6 Mean and Standard Deviation…………..…………………………….. 25
2.7 Data Manipulation…………..…………………….…………………….. 26
2.8 Visulation and Graphics in R…………..………………………………. 32
3 Parametric Statistics with R Software
3.1 Definition of Parametric Statistics……………………………………… 38
3.2 Procedures for Testing Hypothesis……………………………………. 39
3.3 Assessing and Testing Normality of the Data ……………………..…. 41
3.4 One Sample T-Test ……………………………………………..…..….. 45
3.5 Dependent Sample T-Test ……………………………………..…..….. 48
3.6 Independent Sample T-Test ……………………………………..…..... 52
3.7 One-Way Analysis of Variance ……………………………………....... 59
3.8 Pearson Product Moment Correlation……………………………..….. 66
4 Non-Parametric Statistics with R Software
4.1 Definition of Non-Parametric Statistics………………………………… 73
4.2 One Sample Sign Test ……………………………………………..…… 74
4.3 Wilcoxon Signe Rank Test ……………………………………..…..….. 78
4.4 Mann Whitney U-Test …………….……………………………..…..….. 80
4.5 Kruskal Wallis H-Test ……………………….………………………...... 83
4.6 Spearman Correlation…………………………..…………………..…… 86
4.7 Chi-Square Test………………..………………..…………………..…… 89
MODULE 1: BASIC CONCEPTS IN STATISTICS
OBJECTIVES:
What is STATISTICS?
After successful completion of this module, you should be able to: Statistics is the science of collecting, organizing, summarizing,
Define Statistics and analyzing information to draw conclusions or answer
questions.
Explain the process of statistics
Know the difference between descriptive and inferential In addition, statistics is about providing a measure of confidence
statistics. in any conclusions.
Distinguish between qualitative and quantitative variables.
What information is referred to in the definition?
Distinguish between discrete and continuous variables.
The information referred to the definition is the data.
Determine the level of measurement of a variable.
Determine the sources of the data (primary and secondary According to the Merriam Webster dictionary, data are “factual
data) information used as a basis for reasoning, discussion, or
Distinguish the different methods under primary and calculation”.
secondary data.
Polytechnic University of the Philippines Polytechnic University of the Philippines
College of Science College of Science
Department of Mathematics and Statistics Department of Mathematics and Statistics
Definition: Understand the Process of

Universe is the set of all entities under study Statistics
Population is the set of all possible values of the 1. Identify the research objective
variable.
A researcher must determine the question(s) he or she wants
An individual is a person or object that is a member of answered. The question(s) must be detailed so that it identifies a
the population being studied. group that is to be studied and the questions that are to be
answered. The group to be studied is called the population.
Parameter is a numerical
2. Collect the information needed to answer the questions
Sample is the subset of the universe or the population.
Everybody collects and uses information, much of it in
Parameter is a numerical summary of a population. numerical or statistical forms in day-to-day life. Gaining access
to an entire population is often difficult and expensive. In
Statistic is a numerical value that describes a sample or conducting research, we typically look at a subset of the
a number computed from the sample data. population called a sample.

Exercises:
A research objective is presented. For each research objective, Understand the Process of
identify the population and sample in the study.
1. The Philippine Mental Health Associations contacts 1,028
Statistics
teenagers who are 13 to 17 years of age and live in Antipolo City 3. Organize and summarize the information
and asked whether or not they had been prescribed medications
for any mental disorders, such as depression or anxiety. This step in the process is referred to as descriptive statistics.
Descriptive statistics describe the information collected through
Answer:
Population: Teenagers 13 to 17 years of age who live in Antipolo numerical measurements, charts, graphs, and tables. The main
City. purpose of descriptive statistics is to provide an overview of the
Sample: 1,028 teenagers 13 to 17 years of age who live in information collected.
Antipolo City. 4. Draw conclusion from the information.
2. A farmer wanted to learn about the weight of his soybean crop. In this step the information collected from the sample is
He randomly sampled 100 plants and weighted the soybeans on generalized to the population. This process is referred to as
each plant. Inferential statistics. Inferential statistics uses methods that takes
Answer: results obtained from a sample, extends them to the population,
Population: Entire soybean crop. and measures the reliability of the result.
Sample: 100 soybean crop selected.
Exercises:
For the following statements, decide whether it belongs to the field of
descriptive statistics or inferential statistics.
1. A badminton player wants to know his average score for the past
10 games. Answer: Descriptive Statistics
2. A car manufacturer wishes to estimate the average lifetime of
batteries by testing a sample of 50 batteries. Answer: Inferential Statistics
3. Janine wants to determine the variability of her six exam scores in

TAKE NOTE! Algebra. Answer: Descriptive Statistics
If the entire population is studied, then 4. A shipping company wishes to estimate the number of passengers
inferential statistics is not necessary, because traveling via their ships next year using their data on the number of
descriptive statistics will provide all the passengers in the past three years. Answer: Inferential Statistics
information that we need regarding the 5. A politician wants to determine the total number of votes his rival
population. obtained in the past election based on his copies of the tally sheet of
electoral returns. Answer: Descriptive Statistics
Distinction between Qualitative Exercises:

and Quantitative Variables Determine whether the following variables are
qualitative or quantitative.
Variables are the characteristics that differentiate every 1. Hair Color Ans: Qualitative
individual within the population/sample.
2. Temperature Ans: Quantitative
Variables can be classified into two groups:
1.Qualitative variables is variable that yields categorical 3. No. of Hamburger sold Ans: Quantitative
responses. It is a word or a code that represents a class or
category. 4. No. Of children Ans: Quantitative
2.Quantitative variables takes on numerical values
representing an amount or quantity.
5. Zip Code Ans: Qualitative

Exercises:
Distinction between Discrete and Determine whether the following quantitative variables are
Continuous Variables discrete or continuous.
1. The number of heads obtained after flipping a coin five
Quantitative variables may be further classified into: times. Answer: Discrete
Discrete variable is a quantitative variable that either a 2. The number of cars that arrive at a McDonald’s drive-
finite number of possible values or a countable number through between 12:00 P.M and 1:00 P.M. Answer: Discrete
of possible values. The terms countable means that the
values result from counting, such as 0, 1, 2, 3, and so 3. The distance of a 2005 Toyota Prius can travel in city
on. conditions with a full tank of gas. Answer: Continuous
A continuous variable is a quantitative variable that has 4. Number of words correctly spelled. Answer: Discrete
an infinite number of possible values that are not
5. Time of a runner to finish one lap. Answer: Continuous
countable.

Levels of Measurement Levels of Measurement
Nominal Level Interval Level
Identify, name, classify, or categorize objects or Identify, have ordered values, and have the
events. additional property of equal distances or
Example: Method of payment (cash, check, debit intervals between scale.
card, credit card), Type of school (public vs. Example: Temperature on Fahrenheit/Celsius
private), Eye Color (Blue, Green, Brown) Thermometer, Trait anxiety (e.g., high anxious vs.
low anxious), IQ (e.g., high IQ vs. average IQ vs.
Ordinal Level
low IQ)
Like nominal scales, identify, name, classify, or categorize, objects
or events but have an additional property of a logical or natural Ordinal Level
order to the categories or values. Identify, order, represent equal distances between scores values,
Example: Food Preferences, Rank of a Military officer, Social and have an absolute zero point.
Economic Class (First, Middle, Lower) Example: Height, Weight, Number of words correctly spelled
Exercises:
Categorize each of the following as nominal, ordinal, Data Collection
interval or ratio measurement.
Data collection is the process of gathering and measuring
1. Ranking of college athletic teams Ans. Ordinal information on variables of interest, in an established systematic
fashion that enables one to answer stated research questions, test
2. Employee number Ans. Nominal hypotheses, and evaluate outcomes.
Importance of Data Collection
3. Number of vehicles registered Ans. Ratio ✦ Data empowers you to make informed decisions.
4. Brands of soft drinks Ans. Nominal ✦

Data helps you identify problems.
✦ Data allows you to develop accurate theories.
5. Number of car passers along C5 on a given day ✦ Data will backup your arguments.
Ans. Ratio ✦ Data helps you get your hands-on funding.
✦ Data increases your return on assets.
✦
Data improves quality of life.
Consequences from Improperly

Collected Data
Steps in Data Gathering
✦ Inability to answer research questions accurately 1. Set the objectives for collecting data.
✦ Inability to repeat and validate the study
2. Determine the data needed based on the set
✦ Distorted findings resulting in wasted resources objectives.
✦ Misleading other researchers to pursue fruitless 3. Determine the method to be used in data
avenues of investigation
gathering and define the comprehensive data
✦ Compromising decisions for public policy collection points.
✦ Causing harm to human participants and animal 4. Design data gathering forms to be used.
subjects
5. Collect data.
Methods of Collecting
Sources of Data Primary Data
1. Primary sources provide a first-hand account of an event The primary data can be collected by the following five methods:
or time period and are considered to be authoritative. 1. Direct personal interviews. The researcher has direct contact
They represent original thinking, reports on discoveries or with the interviewee. The researcher gathers information by asking
events, or they can share new information. Often these questions to the interviewee.
sources are created at the time the events occurred but
they can also include sources that are created later. 2. Indirect/Questionnaire Method. This methods of data collection
involve sourcing and accessing existing data that were originally
2. Secondary sources offer an analysis, interpretation or a collected for the purpose of the study.
restatement of primary sources and are considered to be
persuasive. They often involve generalisation, synthesis, 3. A focus group is a group interview of approximately six to
interpretation, commentary or evaluation in an attempt twelve people who share similar characteristics or common
interests. A facilitator guides the group based on a predetermined
to convince the reader of the creator's argument.
set of topics.
Methods of Collecting Methods of Collecting

Primary Data Secondary Data
The primary data can be collected by the following five The secondary data can be collected by the following
methods: five methods:
4. Experiment is a method of collecting data where there 1. Published report on newspaper and periodicals
is direct human intervention on the conditions that may 2. Financial Data reported in annual reports
affect the values of the variable of interest.
3. Records maintained by the institution
5. Observation is a method of collecting data on the 4. Internal reports of the government departments
phenomenon of interest by recording the observations
made about the phenomenon as it actually happens. 5. Information from official publications

Methods of Collecting
Secondary Data TAKE NOTE!
The secondary data can be collected by the following ✦ Always investigate the validity and reliability
five methods: of the data by examining the collection
1. Published report on newspaper and periodicals method employed by your source.
2. Financial Data reported in annual reports
✦ Do not use inappropriate data for your
3. Records maintained by the institution
research.
4. Internal reports of the government departments
5. Information from official publications

II. Indicate whether the following statements require the use of descriptive or
ACTIVITIES/ASSESSMENTS: inferential statistics.
I. A research objective is presented. For each, identify the (A) _________1. A teacher wants to know the attitudes of all students towards
population and (B) sample in the study. abortion.
_________2. A market analyst of a sales firm draws a chart showing the sales
1. A polling organization contacts 2141 male university graduates figures of a given product for the period 2006-2007.
who have a white-collar job and asks whether or not they had _________3. A forecaster predicts the results of an election using the number of
received a raise at work during the past 4 months. votes cast in 15 out of 25 barangays.
_________4. Men are better in math than women.
A. __________________________________________________
_________5. Forty percent of the employees of an organization were recorded
B. __________________________________________________ tardy for at least 15 working days.
_________6. There are very few gender-related occupations.
2. Every year the PSA releases the Current Population Report
_________7. An account predicts accuracy rate of a client’s financial resources.
based on a survey of 50,000 households. The goal of this report is
to learn the demographic characteristics, such as income, of all _________8. A quality control manager wishes to check production output.
households within the Philippines. _________9. Records indicated that 75% of the faculty in the graduate school are
A. __________________________________________________ doctoral degree holders.
_________10. There is no relationship between educational qualification of
B. __________________________________________________ parents and academic achievement of their children.

III. Identify the qualitative and quantitative variables and indicate the highest level
of measurement required in each. If quantitative, classify whether discrete or
IV. Determine if the source would be a primary or a secondary
continuous. source.
_________________1. Occupation _________1. Government Records
_________________2. Number of government officials
_________2. Dictionary
_________________3. Favorite color
_________________4. Temperature in Celsius degrees _________3. Artifact
_________________5. Type of school _________4. A TV show explaining what happened in Philippines.
_________________6. Volume of mineral water sold daily
_________________7. Employee number _________5. Autobiography about Rodrigo Duterte.
_________________8. Civil status _________6. Enrile diary describing what he thought about the
_________________9. Zip code numbers world war II.
_________________10. Brands of soft drinks _________7. Audio and video recordings.
_________________11. Socioeconomic status
_________________12. Status Employment
_________8. Speeches
_________________13. Number of vehicles registered _________9. Newspaper
_________________14. Jersey Number _________10. Review Articles
_________________15. Number of employees collecting retirement benefits from
GSIS
OBJECTIVES: Introduction to R
References After successful completion of this module, you should be able to:
Identify the Console, Script, Environment, and Final pane.
Statistics. Informed Decision using Data by Install and load add-in ‘packages’ and import data
Michael Sullivan, III,. Fifth Edition from .xlsx (Excel) file format for data processing and
statistical analysis.
Sampling: Design and Analysis by Sharon L. Understand the different data types in R.
Lhr. Second Edition
Understand the different data structures in R.
Do some useful data manipulation in R.

Compute basic summary statistics.
Produce data visualisations with the
basic and enhance graphics.
Polytechnic University of the Philippines Polytechnic University of the Philippines By Thomas H. Davenport and D.J. Patil
Introduction to R The Difference between R and
• R is a language and environment for statistical computing and R Studio
graphics.
• R is not only entrusted by academic, but many large companies
RStudio is actually an add-on to R: it takes the R software and adds
also use R programming language. Data analysis with R is done in to it a very user-friendly graphical interface. Thus, when one uses
a series of steps; programming, transforming, discovering, modeling RStudio, they are still using the full version of R while also getting
and communicate the results the benefit of greater functionality and usability due to an
Why use R for data analysis? improved user interface. As a result, when using R, one should
1. R is free to download and use. always use RStudio; working with R itself is very cumbersome.
2. R is open-source.
3. Data processing in R is very easy. Since RStudio is an add-on to R, you must first download and
4. Data visualization tools in R are very extensive. install R as well as RStudio. On your computer, you will see R and
5. Advanced functionality often used in practice by scienitists is RStudio as separate installed programs. When using R for data
available in R. analysis, you will always open and work in RStudio; you must leave
6. Will improve one’s understanding of statistics. R installed on the computer for RStudio to work, even though you
7. It is very easy to share your output from R. will likely never open R itself.
8. R provides reproducibility for your analyses
Select the applicable version of RStudio and install the

Steps to Install R and RStudio software.
Install R:
Go to https://cran.r-project.org, select the version of R
software applicable to your computer and install the
software.
Install RStudio:
Once you are done, download the RStudio installer. Go
to https://www.rstudio.com/products/rstudio/download/,
select the applicable version of RStudio and install the
software. Make sure to read and follow the instructions appears in installing
the program.
Four Pane of
RStudio
Four Pane of RStudio
Console Pane - output and error messages are displayed.
R Script or Source Pane - you can type and save your

commands and make notes to yourself about projects.
The environment and history pane is where you will see the
different objects you create or the different datasets you
import.
The final pane contains everything else including help,

plots, packages, etc.

2. In the Install Packages dialog, write the package name
Steps to Install R Packages you want to install under the Packages field and then click
install. This will install the package you searched for or
1. Go to your final pane. Click Packages, then click Install. give you a list of matching packages based on your
package text.

LOADING DATA INTO R

R Function Import Excel File
1. Download and install the package
readxl to read excel files.
A function is a set of statements organized 2. Click “Import Dataset” in the
together to perform a specific task. R has a Environment pane, then select “From
large number of in-built functions and the Excel”. The dialog box will appear.
user can create their own functions.
Select your Excel
file you want to
import.

Keep in Mind!!! Assigning value to variables:

Make sure that the name you assign your variable is
Make sure to attach your file to the R search accurately descriptive and understandable to another
path attach ( ). This means that the database is reader.
searched by R when evaluating a variable, so The command for naming object:
objects in the database can be accessed by = or <-
simply giving their names. Detach the dataset
when you are done detach ( ). Print out the values of the variable:
When we say print out, it simply means that all the
R is case sensitive: It will tell the difference values of the variable are displayed in the console.
between uppercase and lowercase. When assigning a value to an object, R does not print
anything. You can force R to print the value by using
Respect the naming rules for objects (no parentheses or by typing the object name.
spaces, does not start with a number).
It will save
in your
Environment
pane.
It will also appear in your console. In case there is an error in your

command, you will know it here. There will be an error message.

BASIC DATA TYPES IN R

R works with numerous data types. Some of
the most basic types to get started are:
Decimals values like 3.6 are called numerics.
Natural numbers like 7 are called integers.
Integers are also numerics.
Boolean values (TRUE or FALSE) are called
Logical.
Text (or string) values are called characters.
WHAT’S THAT DATA TYPE?

To check the data type of a variable, use the class ( ) Data Structures
function.
Example you want to know what type of data “my_var” is.
Basic Data Structures defined
in R:
Vector
Using class ( ) function, it shows that the variable my_var contain Matrices
numeric data.
List
Data Frame
The quotation
WHAT IS VECTOR? marks indicate

that “a”, “b”,
“c”, “d”, and
“e” are
A vector is a type of array that is one dimensional. characters.
Vectors are a logical element in programming
languages that are used for storing data.
How to create a vector in R? Always used capital
letter for logical/
In R, you create a vector with the combine boolean.
function c ( ).
All the vectors you created will be saved in your
environment pane if you assign a name to each
Take note that all elements must be of the same
vector like this example.
type.
If we want to create a vector of consecutive More complex sequences can be created using the seq( )
function, like defining number of points in an interval, or
numbers, the : operator is very helpful. the step size.

lengt.out indicates the number of points in an interval.

VECTOR SELECTION
Our goal is to select specific elements
of the vector. To select one element of
a vector you can use square brackets
[ ]. Between the square brackets, you
indicate what elements to select.
You can also use [c( )] if multiple
section.

NAMING A VECTOR
You can give a name to the elements of a vector with the names ( ) function.
WHAT’S A FACTOR?
The term factor refers to a statistical data
type used to store categorical variables.
The difference between a categorical
variable and a continuous variable is that
a categorical variable can belong to a
limited number of categories. A
continuous variable, on the other hand,
can correspond to an infinite number of
values.
NOMINAL AND ORDINAL HOW TO CREATE A

FACTORS IN R?
Two types of categorical variables:
Nominal categorical variable
To create factors in R, you make use of
Ordinal categorical variable
the function factor ( ).
A nominal variable is a categorical variable
without an implied order. This means that it is First thing that you have to do is create a
impossible to say that 'one is worth more
than the other’.
vector that contains all the observations
that belong to a limited number of
In contrast, ordinal variables do have a categories.
natural ordering.
EXAMPLE:
Out of 60 respondents considered in
the survey, thirty of them indicates
that they are single, twenty of them
are married and ten of them are
separated. Create a factor, based on
the information given.

EXAMPLE:
I asked 70 students if they like watching
television. Fifty of them answered “Most of
the Time”, fifteen of them answered
“Sometimes” and only five student
answered the question as “Hardly Ever”.
Create a factor, based on the information
given.

To create an ordered factor, you have to add two

additional arguments: ordered and levels. By setting the
argument ordered to TRUE in the function factor ( ), you
indicate that the factor is ordered. With the argument
levels you give the values of the factor in the correct order.
College of Science
College of Science
WHAT IS MATRIX? EXAMPLE:
A matrix is a two-dimensional array of numbers, having
a fixed number of rows and columns, and containing a
Create matrix based on table:
number at the intersection of each row and each
column. Educational Gender
Attainment Female Male
By storing values in a matrix rather than as individual High School 10 22
variables, a program can access and perform operations College 15 8
on the data more efficiently. Masteral 20 10
How to create a matrix in R? PhD 8 9
matrix(data, nrow = 1, ncol = 1, byrow = FALSE,

dimnames = NULL)
EXAMPLE:
Create a 6 x 5 matrix that contains a
number from 1 to 30. Fill matrix by
rows. Make sure that the names of the
rows are A, B, C, D, E, and F, and the
names of the columns are Blue, Red,
White, Green, and Yellow.


EXTRACT A ROW AND WHAT IS LIST IN R?
COLUMN FROM A MATRIX
When data cannot be represented as an array or
th
The entire m row of a matrix can be extracted as a data frame, the list is the best choice. This is
matrixname[m,]. because lists can contain all kinds of other
objects, including other lists, matrix or data
Similarly, the n th column of a matrix can be frames, and in that sense, they are very flexible.
extracted by matrixname[,n].
To extract more than one rows or columns at a time. How to create a list in R?
Multiple rows:
matrixname[c( ),] The List is been created using list ( ) function in R.
Multiple columns:
matrixname[,c( )]
EXAMPLE:
Create a list that contains a numeric
vector (1 to 30), a sequence of number
from 1 to 10 with step size of 0.5, and
a character vector that contains your
top 3 favourite subject.

HOW TO CREATE A HOW TO CREATE A

DATA FRAME IN R? DATA FRAME IN R?
A data frame is a table or a two-dimensional Following are the characteristics of a data frame.
array-like structure in which each column
The column names should be non-empty.
contains values of one variable and each row The row names should be unique.
contains one set of values from each column. The data stored in a data frame can be of
numeric, factor or character type.
To create a data frame use the data.frame( ) Each column should contain same number of
function. data items.

EXTRACT A VARIABLE FROM A

DATA FRAME
If you want to extract a particular variable from
a data frame, use dataname$variablename.
SUBSETTING A DATA FRAME
To take a subset from a data frame, first create

a new data frame and use the subset
command:
subset(dataname, condition)
THE (LOGICAL) COMPARISON

OPERATORS
< for less than
> for greater than
<= for less than or equal
>= for greater than or equal The operator & are used to
== for equal to each other denote multiple conditions.
!= not equal to each other

Import “Encoded Data” file.
EXAMPLE:
Create a data frame for the profile
of the respondents. Include only
the male respondents that are
single, at most 40 years old, and
high school graduate.


CONDITIONS
If/else statements
In R, we can write a conditional if/else
statement as follows:
ifelse(condition on data, true value
returned, false returned)

EXAMPLE:
Suppose we want to create a variable
called grades that is assigned as follows:
E for score less than or equal to 60
D for score 61 to 70
C for score 71 to 80
B for score 81 to 90
A for score at least 91

Computation of Mean and Standard Deviation in R

Mean is calculated by taking the sum of the values and dividing Data Manipulation in R
with the number of values in a data series. The function mean( ) is
used to calculate this in R. Introduction to dplyr
Standard Deviation is a measure of how far away items in a data set Download dplyr package. It aims to provide a function for each
are from the mean. The larger the standard deviation, the more basic verb of data manipulation:
variation there is in the data set. The function sd( ) is used to
calculate this in R. filter( ): select cases based on their values.
If there are missing values, then the mean and sd function returns arrange( ): reorder the cases.
NA. select( ): to select variables based on their names.
To drop the missing values from the calculation use na.rm = TRUE, Rename ( ): Rename column.
which means remove the NA values. mutate( ): add new variables that are functions of existing
Example: variables.
mean(data, na.rm=TRUE) summarise( ): condense multiple values to a single value.
sample_n( ) and sample_frac( ): take random samples.
sd(data, na.rm=TRUE)
PIPES OPERATOR Filter Use filter to filter data with required condition.
How many students in the BSA course came from SUC

and have an average grade of greater than or equal to
The dplyr package allows to use the forward- 98?
pipe chaining operator (%>%) for combining
multiple operations.
The short in RStudio is Ctrl + M for windows.
The short in RStudio is Command + Shift + M

for MacBook.

What courses have students with grades above 97 in
English, Mathematics, and Science when they were in
high school?

How many students have an average grade of above 96

from BSSTAT, BSIT, and BACR course?

Arrange Use to reorder rows.
Arrange the SAMPLE_DATA into ascending order based on the

PUPCET score.

Arrange the SAMPLE_DATA into descending order based on the
PUPCET score.

Select Use select to pick columns by name.

Select from the SAMPLE_DATA the variables Gender,
schooltype, and average.

Select To select series of columns.

Select from the SAMPLE_DATA the variables eng, mat,
sci, average, Language, GenInfo, Science, Numerical,
and NonVerbal.

Select Select columns that need to remove.
Remove the variables Gender and schooltype.

Rename Use to rename column(s).
Rename the variables eng, mat, and sci.

Mutate Use to create new variables.
Create a column ``Score”, calculated as the sum of

Language, GenInfo, Science, Numerical, and NonVerbal.

Summarise Use to find insights from the data. Sample of Observations
You can use summary ( ) function to produce mean, median, To draw a sample of observations without
minimum, maximum, and 1st and 3rd quartile.
replacement.
What is the average score of the students based on their gender?
Randomly select 8 observations from SAMPLE_DATA.

Sample of Observations Sample of Observations

To keep only the first set of observation. To keep only the last set of observation.
Take the first 8 observations from SAMPLE_DATA. Take the last 8 observations from SAMPLE_DATA.

EXAMPLE:
From SAMPLE_DATA:
• Take the first 30% of the observation and name it as
``NEWDATA”.
• Remove the variables Language, GenInfo, Science,
Numerical, and NonVerbal.
• Rename the variable ``schooltype” as ``SchoolType” and
`àverage” as `Àverage”.
• Include only female students and the school type is suc
and private.
• Create additional column and name it as `Ìf_FChoice” to
determine if her course now is her first choice.
• Arrange “NEWDATA” into descending order based on the
Score.
• Summarize your data to find out who has the highest
average grades based on first choice and not first choice.
Visualization, Graphics in R
R graphic systems: base and ggplot2
Base graphics are usually constructed piecemeal, with

each aspect of the plot handled separately through a
series of function calls; this is often conceptually
simpler and allows plotting to mirror the thought
process.
ggplot2 combines concepts from both base and lattice

graphics but uses an independent implementation
Make sure to download ggplot2 package.

Base Graphics: HISTOGRAM Base Graphics: SCATTER PLOT

Base Graphics: BOX PLOT Base Graphics: Parameter

Many base plotting functions share a set of parameters. Here
are a few key ones:
• pch: the plotting symbol (default is open circle)
• lty: the line type (default is solid line), can be dashed,
dotted, etc.
• lwd: the line width, specified as an integer multiple
• col: the plotting color, specified as a number, string, or hex
code; the colors() function gives you a vector of colors by
name
• xlab: character string for the x-axis label
• ylab: character string for the y-axis label
Base Graphics: Parameter
Base Graphics: Parameter
The par( ) function is used to specify global graphics
parameters that affect all plots in an R session.
• las: the orientation of the axis labels on the plot
• bg: the background color
• mar: the margin size
• oma: the outer margin size (default is 0 for all sides)
• mfrow: number of plots per row, column (plots are filled
row-wise)
• mfcol: number of plots per row, column (plots are filled
column-wise)

Base Graphics: Plotting Functions

• plot: make a scatterplot, or other type of plot depending on
ggplot2
the class of the object being plotted
• lines: add lines to a plot, given a vector x values and a A “third” graphic system for R (along with base and
corresponding vector of y values (or a 2- column matrix); lattice)
this function just connects the dots There are two plotting functions available:
• points: add points to a plot qplot( )
ggplot( )
• text: add text labels to a plot using specified x, y coordinates
• title: add annotations to x, y axis labels, title, subtitle, outer The qplot() function is similar to plot() but with many
margin built-in features.
• mtext: add arbitrary text to the margins (inner or outer) of
the plot For complex customization, use ggplot()
• axis: adding axis ticks/labels
ggplot2: qplot ggplot2: ggplot

Basic components of a ggplot plot
• A data frame
• Aesthetic mappings: how data are mapped to color, size
• Geoms: geometric objects like points, lines, shapes
• Facets: for conditional plots
• Stats: statistical transformations like binning, quantiles,
smoothing
• Scales: what scale an aesthetic map uses (example: male =
red, female = blue)
• Coordinate system
Plots are built up by layers
Plot the data
Overlay summary
College of Science
Metadata and annotation
College of Science
ggplot2: ggplot SCATTER PLOT
ggplot2: ggplot BAR CHART

To Construct Scatter Plot

To create scatter plot using enhance
graphic of RStudio:
print(qplot(x,y, data=data frame,

xlab=“label of x-axis”,
ylab=“label of y-axis”,
main=“Title of the graph”))
ACTIVITIES/ASSESSMENTS:
1. Use "faithful" data in Rstudio. Extract waiting variable and
compute only the average of less than or equal to 50 mns. of
waiting time to next eruption.
2. Create a 8 x 5 matrix that contains a number from 1 to 40. Fill
matrix by columns. Make sure that the names of the rows are
Apple, Banana, Orange, Grapes, Mango, Limes, Watermelons and
Apricots, and the names of the columns are Blue, Red, White,
Green, and Yellow.
A. Extract row of Apple, Grapes and Apricots.
3. Create a factor, based on the information given.
I asked 150 residents of Barangay Dela Paz to rate their level of
agreement about mass testing. Most of the respondents answered
“Strongly Agree” with frequency of 60. Fifty of them answered
“Agree”, followed by “Disagree” and Strongly Disagree” with
frequency of 30 and 10, respectively.
ACTIVITIES/ASSESSMENTS: ACTIVITIES/ASSESSMENTS:
4. Create a scatter plot using basic and enhance graphic of A. Create a vector containing the final grades for each class using
RStudio. Used “faithful” data in RStudio. the variable name “final scores".
Take Note: B. Create a vector of character data called “class names" containing
xlab=”Eruption time (min)", your classes.
ylab="Waiting time to next eruption (min)”) C. Assign the class name to each grade in your final scores vector.
main=“Eruptions of Old Faithful” D. Extract elements from final scores vector to create two new
5. Let’s say you're a student taking seven classes. Here's a table vectors:
containing your final grades for each class: “liberal arts": Containing your writing and history final grades.
“fine arts": Containing your art and music final grades.
Class Exams
Mathematics 88.00 E. Calculate the average of each new vector.
Chemistry 87.67 F. Calculate your grade point average from the “final scores" vector
Writing 86.00 that we created earlier. Store the result of your calculation in the
Art 91.33 variable “GPA".
History 84.00 G. Compare “final scores" to “GPA" to see whether the grade in
Music 91.00 each class is higher than the “GPA". Store the logical output in a
Physical Education 89.33 vector named “above average".
References
https://uoftcoders.github.io/rcourse/lec02-basic-r.html
https://monashbioinformaticsplatform.github.io/r-intro/start.html
https://bookdown.org/kdonovan125/ibis_data_analysis_r4/
introducing-r-and-rstudio.html
https://www.datacamp.com/?
utm_source=adwords_ppc&utm_campaignid=1242944157&utm_
adgroupid=58673827368&utm_device=c&utm_keyword=data%
20camp&utm_matchtype=e&utm_network=g&utm_adpostion=&
utm_creative=340731356767&utm_targetid=aud-364780883969
:kwd-298095775602&utm_loc_interest_ms=&utm_loc_physical_
ms=9067208&gclid=CjwKCAjw9vn4BRBaEiwAh0muDMi3k63BZ
rzlDy43gz5XO7jsIcSEbGBs331m0iPQ5D8z-
ycK4pCnSRoCjr4QAvD_BwE
College of Science
Objectives
Engineering Data Analysis
Module 3: Parametric Statistics with R Software After successful completion of this module, you should be able
to:
1 Differentiate the null and alternative hypotheses.
K.Elizon
2 Formulates the appropriate null and alternative
P.Aranas
L.Usona hypotheses.
L.Bautista 3 Explain the logic of hypothesis testing.
E.Baccay 4 Assess and test if the data follows a normal distribution.
5 Distinguish between independent and dependent sampling.

Sta. Mesa, Manila 6 Identify the appropriate test statistics for normally
distributed data.
7 Solves real-life problems involving parametric test.
July 31, 2020
K.Elizon P.Aranas L.Usona L.Bautista E.Baccay Engineering Data Analysis K.Elizon P.Aranas L.Usona L.Bautista E.Baccay Engineering Data Analysis
Parametric Statistics Parametric Statistics
Parametric statistical procedures are inferential

procedures that rely on testing claims regarding Two Common Forms of Statistical Inference
parameters such as the population mean, the 1 Estimation
population standard deviation, or the population It is use to approximate the value of an
proportion. unknown population parameter.
Assume underlying statistical distributions in the 2 Hypothesis Testing
data. Therefore, several conditions of validity It is a procedure on sample evidence and
must be met so that the result of a parametric probability, used to test claims regarding a
test is reliable. characteristic of one or more populations.
Apply to data in ratio scale, and some apply to
data in interval scale.
Hypothesis Testing State the null and alternative hypothesis
Procedures for Testing Hypothesis What is HYPOTHESIS?

1. State the null and alternative hypothesis.
2. Set the level of significance or alpha level (α). A procedure on sample evidence and probability,
3. Determine the appropriate test to use. used to test claims regarding a characteristic of
one or more populations.
4. Determine the p-value.
A preconceived idea, assumed to be true but has
5. Make Statistical decision.
to be tested for its truth or falsity.
6. Draw conclusion.
Hypothesis Testing Set the level of significance or alpha level (α)
Two Types of Hypothesis

1. Null Hypothesis
Denoted by H0 . You should establish a predetermined level of
The statement being tested.
Assumed true until evidence indicates otherwise.
significance, below which you will reject the null
Must contain the condition of equality and must be written hypothesis.
with the symbol =,≤, or ≥.
2. Alternative Hypothesis
The generally accepted levels are 0.10, 0.05, and
Denoted by Ha or H1 .
Statement that must be true if the null hypothesis is false. 0.01.
Sometimes referred to as the research hypothesis.
Must contain the condition of equality and must be written
with the symbol 6=, <, or >.
Determine the appropriate test to use Determine the p-value.
Parametric Test
1. One-Sample t-Test
2. Dependent Sample t-Test
Performing statistical analysis using statistical
3. Independent Sample t-Test software. Using R software calculate the p-value.
4. One-Way Analysis of Variance
5. Pearson Product Moment Correlation
Make sure to verify that the assumptions of every
statistical test are satisfied.
Make Statistical decision Make Statistical decision
Decision Rule:
In stating your decision you can use: If p-value is less than or equal to 0.05 level of
1 Fail to reject the null hypothesis/ Do not reject significance, reject H0 , otherwise failed to reject H0 .
the null hypothesis/ Retain the null hypothesis
Example:
2 Reject the null hypothesis. If the level of significance is α = 0.05:
Take Note! 1. and if the computed p-value is 0.01, then the decision is
reject H0 .
It is important to recognize that we never accept
2. and if the computed p-value is 0.05, then the decision is
the null hypothesis. We are merely saying that the
reject H0 .
sample evidence is not strong enough to warrant
3. and if the computed p-value is 0.10, then the decision is
rejection of the null hypothesis. failed to reject H0 .
Draw Conclusion Assumptions of Parametric Statistics
Common Assumptions
Record conclusions and recommendations in a 1 Approximately Normally Distributed
report, and associate interpretations to justify your
conclusion or recommendations.
2 Homogeneity of Variances
3 Samples must be independent of each other
Testing the Assumptions Testing the Assumptions

Histogram
Histogram plots the observed values against their frequency,
Testing Normality of the Data states a visual estimation whether the distribution is bell
To determine if the data is follows a normality shaped or not. If the histogram form a bell shaped it is
distribution, we can use the graphical or numerical considered that the data follows a normal distribution.
method.
Graphical
1. Histogram
2. Normal Q-Q Plot
Numerical
1. Kolmogorov Smirnov Test
2. Lilliefors
3. Anderson - Darling Test
4. Shapiro Wilk Test

Normal Q-Q Plot
Q-Q probability plots display the observed values against
normally distributed data (represented by the line). If the
How to create a histogram in R? points are close to the diagonal line it is considered that the
data follows a normal distribution.
Command for Histogram
To construct Histogram use the command:
hist(x,probability=TRUE,col=‘‘color’’)
lines(density(x),col=‘‘color’’)
“x” is a numeric vector of data values.
How to create a normal Q-Q plot in R?

Command for Histogram
To construct normal Q-Q plot use the command:
qqnorm(x)
qqline(x)
“x” is a numeric vector of data values.
Caveat: Just because you meet sample size requirements (n in the
above table), this does not guarantee that the test result is
efficient and powerful. Almost all normality test methods perform
poorly for small sample sizes (less than or equal to 30).
Testing for Normality Testing for Normality

Command for Shapiro-Wilk Test Command for Anderson Darling Test
shapiro.test(x) To used the command of anderson darling, you need
to download the package nortest.
Command for lilliefors Test
To used the command of lilliefors, you need to ad.test(x)
download the package nortest. Command for kolmogorov smirnov Test
lillie.test(x) ks.test(x,‘‘pnorm’’)
“x” is a numeric vector of data values. “x” is a numeric vector of data values.
Hypotheses of Normality Test Testing the Assumptions
Null Hypothesis: Testing the Homogeneity of Variances

The sample data follows a normal distribution. There are many ways of testing data for
homogeneity of variance. Two methods are shown
Alternative Hypothesis: here.
The sample data does not follow a normal 1. Bartletts test
distribution. 2. Levenes test
Testing the Homogeneity of Variances Testing the Homogeneity of Variances
Bartletts test Levenes test

If the data is normally distributed, this is the best More robust to departures from normality than
test to use. It is sensitive to data which is not Bartletts test.
non-normally distribution; it is more likely to return Command for Levene’s Test
a “false positive” when the data is non-normal. To used the command of levene’s, you need to
Command for Bartlett’s Test download the package car and carData.
bartlett.test(x∼group, data=data frame) leveneTest(x∼group, data=data frame)
”x” a numeric vector of data values. ”x” a numeric vector of data values.
”group” factor of the data ”group” factor of the data.
Hypotheses One Sample t-Test
One-sample t-test is used to compare the mean of one sample

to a known standard (theoretical or hypothetical) mean (µ).
Null Hypothesis Command for One Sample t-Test
Equal Variances Assumed. t.test(x, mu=0, alternative=‘‘less’’,
conf.level=0.95)
Alternative Hypothesis t.test(x, mu=0, alternative=‘‘greater’’,
Equal Variances Not Assumed. conf.level=0.95)
t.test(x, mu=0, alternative=‘‘two.sided’’,
conf.level=0.95)
”x” a numeric vector of data values.
One Sample t-Test One Sample t-Test
Null and Alternative Hypothesis
H 0 : µ = µ0 Assumptions
Ha : µ 6= µ0 two-tailed: two.sided
Samples must be independent of each other.
H0 : µ ≤ µ0
Ha : µ > µ0 one-tailed: greater Approximately Normally Distributed.
H0 : µ ≥ µ0
Ha : µ < µ0 one-tailed: less
One Sample t-Test Testing the Normality of the Data
Example 1:
A psychologist makes use of a test instrument for measuring
depression. The instrument is known to have a mean score of
70 for normal individuals. Twenty individuals who have been
described as severely depressed by a therapist takes the tests.
A psychologist is interested to know if the average test score
made by severely depressed individuals is greater than 70. The
data are reflected on the table below.
Scores Made by Severely Depressed Individuals
No. Score No. Score
1 75 11 75
2 77 12 74
3 73 13 63
4 80 14 76
5 65 15 67
6 74 16 70
7 75 17 69
8 69 18 76 Since p-value is greater than 0.05, we fail to reject H0 . Therefore,
9 71 19 65 the sample data is normally distributed.
10 72 20 68
Procedures for Testing Hypothesis Procedures for Testing Hypothesis

Step 4:
Step 1: H0 : µ ≤ 70 and Ha : µ > 70

Step 2: α = 0.05
Step 3: Since we are comparing the mean of one
sample to a known standard mean, we will use the
one sample t-test.
Procedures for Testing Hypothesis One Sample t-Test

Example 2:
Dr. Nuguid has developed a test for measuring the vocabulary
skills of 6 year olds. They score 60 on the average. She
administers the test to a sample of single children (i.e., only
Step 5: Since p-value (0.058) is greater than to child). Test if the mean score made by single children is less
0.05 level of significance, we failed to reject H0 . than 60. The data are reflected on the table below.
No. Score No. Score
Step 6: There is no sufficient evidence to conclude 1 47 9 50
that the average test score made by severely 2 45 10 58
depressed individual is greater than 70. 3 49 11 54
4 52 12 55
5 48 13 61
6 47 14 49
7 45 15 52
8 69
Testing the Normality of the Data Procedures for Testing Hypothesis
Step 1: H0 : µ ≥ 60 and Ha : µ < 60

Step 2: α = 0.05
Step 3: Since we are comparing the mean of one
sample to a known standard mean, we will use the
one sample t-test.
Since p-value is greater than 0.05, we fail to reject H0 . Therefore,

the sample data is normally distributed.

Step 4:
Step 5: Since p-value (0.000) is less than to 0.05

level of significance, we reject H0 .
Step 6: There is sufficient evidence to conclude

that the mean score made by single children is less
than 60.
Inference About Two Means Inference About Two Means
Distinguish between Independent and

Dependent Sample
A sampling method is independent when the
To perform inference on the difference of two
individuals selected for one sample do not
population means, we must first determine whether
dictate which individuals are to be in a second
the data come from an independent or
sample.
dependent sample.
A sampling method is dependent when the
individual selected to be in one sample are used
to determine the individuals to be in the second
sample.
Dependent Sample t-Test Dependent Sample t-Test
Dependent Sample t-Test (also called the paired

sample t-test) compares the means of two related Null and Alternative Hypothesis
groups to determine whether there is a statistically H0 : µ1 = µ2
significant difference between these means. Ha : µ1 6= µ2 two-tailed: two.sided
Command for Dependent Sample t-Test
t.test(a,b, mu = 0, alternative = H 0 : µ1 ≤ µ2
‘‘less’’, paired = TRUE, conf.level = Ha : µ1 > µ2 one-tailed: greater
0.95 ) H 0 : µ1 ≥ µ2
”a” a numeric vector of data values. Ha : µ1 < µ2 one-tailed: less
”b” a numeric vector of data values.
Dependent Sample t-Test Dependent Sample t-Test
Assumptions Example 1:
Your dependent variable should be measured at In the show “The Pickup Artist”, “Mystery” (the host) wants
the interval or ratio level (i.e., they are the artists in training to change from being plain old Average
continuous). Frustrated Chumps, into master pickup artists. He insists that
one way to help increase your confidence and your ability to
Your independent variable should consist of two demonstrate higher value is simply to smell the part. He has
categorical, ”related groups” or ”matched pairs. each contestant go out to a club and try to get as many digits
(telephone numbers) as they can without any cologne of any
There should be no significant outliers in the
kind (pre-test). Then he has them go out on the next night to
differences between the two related groups. the same club after dousing themselves in “Mysterys Freaky
The distribution of the differences in the Funk” cologne, to see if the number of digits they receive
increases (post-test). The results are shown below. Although
dependent variable between the two related Mystery maybe a fashion challenged, is he correct in his
groups should be approximately normally assertion that cologne helps when picking up women?
distributed.
Dependent Sample t-Test Testing the Normality of the Data

Artist Pre Post
1 14 16
2 14 16
3 14 15
4 6 12
5 18 16
6 10 12
7 18 16
8 10 13
9 6 5
10 10 11
11 10 12
12 10 14
13 6 6
Since p-value is greater than 0.05, we fail to reject H0 . Therefore, the
14 2 9 difference of the two related sample data is normally distributed.
15 14 12
Step 4:
Step 1: H0 : µpre ≥ µpost and Ha : µpre < µpost

Step 2: α = 0.05
Step 3: Since we are comparing the means of two
related groups, we will use the dependent sample
t-test.
Procedures for Testing Hypothesis Dependent Sample t-Test

Example 2:
A teacher is interested to know if the new learning program
will help to increase the number of correct remembered words.
10 Subjects learn a list of 50 words. Learning performance is
measured using a recall test. After the first test all subjects
Step 5: Since p-value (0.024) is less than 0.05 level are instructed how to use the learning program and then learn
of significance, we reject H0 . a second list of 50 words. Learning performance is again
measured with the recall test. In the following table the
number of correct remembered words are listed for both tests.
Step 6:There is sufficient evidence to conclude that Subject Score 1 Score 2
1 24 26
the cologne helps when picking up women. 2 17 24
3 32 31
4 14 17
5 16 17
6 22 25
7 26 25
8 19 24
9 19 22
10 22 23
Step 1: H0 : µ1 ≥ µ2 and Ha : µ1 < µ2

Step 2: α = 0.05
related groups, we will use the dependent sample
t-test.
Since p-value is greater than 0.05, we fail to reject H0 . Therefore,

the difference of the two related sample data is normally
distributed.
Step 4:

that the new learning program will help to increase
the number of correct remembered words.
Independent Sample t-Test Independent Sample t-Test
Independent Sample t-Test is used to test whether

population means are significantly different from Null and Alternative Hypothesis
each other, using the means from randomly drawn
H0 : µ1 = µ2
samples.
Ha : µ1 6= µ2 two-tailed: two.sided
Command for Independent Sample t-Test
t.test(x∼group, mu = 0, alternative = H 0 : µ1 ≤ µ2
‘‘less’’, var.equal=TRUE, conf.level = Ha : µ1 > µ2 one-tailed: greater
0.95) H 0 : µ1 ≥ µ2
”x” a numeric vector of data values. Ha : µ1 < µ2 one-tailed: less
”group” factor of the data.
Assumptions Assumptions
Your dependent variable should be measured on There should be no significant outliers.
a continuous scale (i.e., it is measured at the
Your dependent variable should be
interval or ratio level).
approximately normally distributed for each
Your independent variable should consist of two group of the independent variable.
categorical, independent groups.
There needs to be homogeneity of variances. If
You should have independence of observations, the variance of two independent groups are not
which means that there is no relationship equal, r software will calculate welch two sample
between the observations in each group or t-test instead of independent sample t-test.
between the groups themselves.
Shallow Processing Deep Processing
13 12
Example 1: 12 15
Twenty participants were given a list of 20 words to process. 11 14
The 20 participants were randomly assigned to one of two 9 14
treatment conditions. Half were instructed to count the 11 13
number of vowels in each word (shallow processing). Half were 13 12
instructed to judge whether the object described by each word 14 15
would be useful if one were stranded on a desert island (deep 14 14
processing). After a brief distractor task, all subjects were 14 16
given a surprise free recall task. The number of words correctly 15 17
recalled was recorded for each subject. Here are the data: Did the instructions given to the participants significantly
affect their level of recall at 10% level of significance? Discuss
your answer.
Testing the Normality of the Data Testing the Normality of the Data
Testing the Normality of the Data Testing the Homogeneity of Variances
The p-value for treatment“shallow” is 0.450, and

for “deep” is 0.673 which are greater than 0.10,
therefore, fail to reject H0 . This implies that
dependent variable is normally distributed for each
group of the independent variable.
Testing the Homogeneity of Variances Procedures for Testing Hypothesis
Step 1:
The p-value for levene’s test is 0.691 which is
greater than 0.05, therefore, fail to reject H0 . It H0 : µshallow = µdeep and Ha : µshallow 6= µdeep
means that we will assume equal variances for this
example. Step 2: α = 0.10
We will use var.equal=TRUE in the command of
independent sample t-test. independent groups, we will use the independent
sample t-test.

Step 4:

that the given instructions are significantly affect
the level of recall.
Independent Sample t-Test Testing the Normality of the Data
Example 2:
Researchers wanted to know whether there was a difference in
comprehension among students learning a computer program based on
the style of the text. They randomly divided 18 students into two groups
of 9 each. The researchers verified that the 18 students were similar in
terms of educational level, age, and so on. Group 1 individuals learned
the software using visual manual (multimodal instruction), while Group 2
individual learned the software using textual manual (Unimodal
instruction). The following data represent scores the students received on
an exam given to them they studied from the manuals.
Visual Textual
51.08 64.55
57.03 57.60
44.85 68.59
75.21 50.75
56.87 49.63
75.28 43.58
57.07 57.40
80.30 49.48
52.20 49.57
The p-value for visual manual is 0.148, and for

textual manual is 0.380 which are greater than 0.05,
therefore, fail to reject H0 . This implies that
Testing the Homogeneity of Variances Testing the Homogeneity of Variances
The p-value for levene’s test is 0.423 which is

greater than 0.05, therefore, fail to reject H0 . It
means that we will assume equal variances for this
example.
independent sample t-test.

Step 4:
Step 1:
H0 : µvisual = µtextual and Ha : µvisual 6= µtextual
Step 2: α = 0.05
independent groups, we will use the independent
sample t-test.
Procedures for Testing Hypothesis One-way Analysis of Variance
Step 5: Since p-value (0.209) is greater than to One-way analysis of variance (ANOVA) is a method
0.05 level of significance, we failed to reject H0 . of testing the equality of two or more population
means by analyzing sample variances.
Step 6: There is no significant difference in
comprehension among students learning a computer It is actually a more general form of the t-test that
program based on the style of the text. is appropriate to use with three or more data groups.
One-way Analysis of Variance One-way Analysis of Variance
This command corrects for non-homogeneity, but doesnt give

much information. Only F, p-value and dfs for numerator and If you want to have an information about the result
denominator are given information, no mean square etc. of sum of square and mean square of anova, this
Command for (ANOVA) Test command is applicable.
To used the command of ANOVA, you need to download the Command for (ANOVA) Test
package stats.
To used the command of ANOVA, you need to
oneway.test(x∼group, data=data
download the package stats.
frame,var.equal=FALSE)
The Default is equal variances not assumed, to change this, set summary(aov(x∼group, data=data frame))
”var.equal=” option to TRUE.
“x” a numeric vector of data values.
“group” factor of the data.
Assumptions
Your dependent variable should be measured at
Null and Alternative Hypothesis the interval or ratio level (i.e., they are
continuous).
H0 : µ1 = µ2 = ... = µk Your independent variable should consist of two
Ha : At least one of the population means is or more categorical, independent groups.
different from the others. You should have independence of observations,
which means that there is no relationship
between the observations in each group or
between the groups themselves.

Example 1:
Suppose the National Transportation Safety Board (NTSB)
Assumptions wants to examine the safety of compact cars, midsize cars,
There should be no significant outliers. and full-size cars. It collects a sample of three for each of the
treatments (cars types). Using the hypothetical data provided
Your dependent variable should be below, test whether the mean pressure applied to the drivers
approximately normally distributed for each head during a crash test is not equal for each types of car. Use
group of the independent variable. α = 5%.
There needs to be homogeneity of variances. If Compact Cars Midsize Cars Full-size Cars
643 469 484
the variance of more than two independent 655 427 456
groups are not equal, r software will calculate 702 525 402
Welchs anova for unequal variances. 451 532 623
678 562 711
509 571 488
The p-value for compact, midsize, and full-size are

0.169, 0.457, and 0.381, respectively. All groups
have p values greater than 0.05, therefore, fail to
reject H0 . This implies that dependent variable is
normally distributed for each group of the
independent variable.
Step 1: H0 : The mean pressure applied to the

drivers head during a crash test is equal for each
The p-value for levene’s test is 0.614 which is types of car. Ha : The mean pressure applied to the
greater than 0.05, therefore, fail to reject H0 . It drivers head during a crash test is not equal for each
means that we will assume equal variances for this types of car.
example.
Step 2: α = 0.05
ANOVA. Step 3: Since we are comparing the means of one
independent variable that consist of two or more
categorical groups, we will use the one-way ANOVA.

Step 4:
Step 5: Since p-value (0.223) is greater than to

0.05 level of significance, we failed to reject H0 .
Step 6: This means that the mean pressure applied
to the drivers head during a crash test is equal for
each types of car.
One-way Analysis of Variance Testing the Normality of the Data
Example 2:
A firm wishes to compare four programs for training workers to perform a
certain manual task. Twenty new employees are randomly assigned to
the training programs, with 5 in each program. At the end of the training
period, a test is conducted to see how quickly trainees can perform the
task. The number of times the task is performed per minute is recorded
for each trainee, with the following results:
Program 1 Program 2 Program 3 Program 4

9 10 12 9
12 6 14 8
14 9 11 11
11 9 13 7
13 10 11 8
Using α = 0.05, determine whether the treatments differ in their

effectiveness.
The p-value for program 1, program 2, program 3

and program 4 are 0.928, 0.054, 0.421, and 0.493,
respectively. All groups have p values greater than
0.05, therefore, fail to reject H0 . This implies that
Step 1: H0 : There is no significant difference

The p-value for levene’s test is 0.917 which is between the effectiveness of four programs. Ha :
greater than 0.05, therefore, fail to reject H0 . It There is significant difference between the
means that we will assume equal variances for this effectiveness of four programs.
example. Step 2: α = 0.05
We will use var.equal=TRUE in the command of Step 3: Since we are comparing the means of one
ANOVA. independent variable that consist of two or more
categorical groups, we will use the one-way ANOVA.

Step 4:

Step 6: This means that the treatment apply to

every program differ in their effectiveness.
Pearson Product Moment Correlation Pearson Product Moment Correlation
Pearson product moment correlation coefficient

(Pearson r) is a measure of the strength of a linear
association between two variables and is denoted by
r. H0 : There is no significant relationship between two
Command for Pearson r continuous variables.
cor.test(x,y, method = ‘‘pearson’’, Ha : There is significant relationship between two
conf.level=0.95) continuous variables.
”x” is the independent variable.
”y” is the dependent variable.
Pearson Product Moment Correlation Pearson Product Moment Correlation
Example 1:
The Rip-off Vending Machine Company operates coffee vending machines
Assumptions in office buildings. The company wants to study the relationship; if any,
that to study number of cups sold per day and the number of persons
Your two variables should be measured at the working in each building. Sample data for the study were collected by the
company and presented below and test the significance at 0.05 level.
interval or ratio level (i.e., they are continuous).
No. of Working at Location No. of Cups
There is a linear relationship between your two 5 10
variables. 6 20
14 30
There should be no significant outliers. 19 40
Your variables should be approximately normally 15 30
11 20
distributed. 18 40
22 40
26 50
It is required that both variables are individually

normally distributed. Since, the calculated p-values
(0.890 and 0.663) are greater than 0.05, we fail to
reject H0 . Therefore, both variables are normally
distributed.

Step 4:
Step 1:
H0 : There is no significant relationship between the
number of cups sold per day and the number of
persons working in each building.
Ha : There is significant relationship between the
number of cups sold per day and the number of
persons working in each building.
Step 2: α = 0.05
Step 3: Since we are testing the significant
relationship of two variables, we will use Pearson r.
Procedures for Testing Hypothesis Pearson Product Moment Correlation
Example 2:
Step 5:Since p-value (0.000) is less than to 0.05 A golf pro wants to investigate the relation between the
level of significance, we reject H0 . club-head speed of a golf club (measured in miles per hour)
and the distance (in yards) that the ball will travel. He realizes
Step 6: Therefore there is significant relationship other variables besides club-head speed determine the distance
between the number of cups sold per day and the a ball will travel (such as club type, ball type, golfer, and
weather conditions). To eliminate the variability due to these
number of persons working in each building and its variables, the pro uses a single model of club and ball, one
relationship is very strong based on correlation golfer, and a clear, 70-degree day with no wind. The pro
coefficient (0.968) records the club-head speed, measures the distance the ball
travels, and collects the data in Table.
Testing the Normality of the Data
Club-head Speed (mph) Distance (yards)

100 257
102 264
103 274
101 266
105 277
100 263
99 258
105 275
Source: Paul Stephenson, student at Joliet Junior College
Step 1:
It is required that both variables are individually club-head speed of a golf club and the distance that
normally distributed. Since, the calculated p-values the ball will travel.
(0.356 and 0.341) are greater than 0.05, we fail to Ha : There is significant relationship between the
reject H0 . Therefore, both variables are normally club-head speed of a golf club and the distance that
distributed. the ball will travel.
Step 2: α = 0.05
relationship of two variables, we will use Pearson r.
Step 4:
Step 5: If p-value (0.001) is less than or equal to

0.05 level of significance, we reject H0 , otherwise
failed to reject H0 .
Step 6: Since p-value (0.001) is less than 0.05, we
reject Ho. Therefore there is significant relationship
between the club-head speed of a golf club and the
distance that the ball will travel and its relationship
is very strong based on correlation coefficient
(0.939)
Activities/Assessments Activities/Assessments
Directions: Determine whether the sampling is dependent or (3.) An educator wants to determine whether a new curriculum
significantly improves standardized test scores for third grade
independent.
students. She randomly divides 80 third-graders into two groups.
(1.) A researcher wishes to compare academic aptitudes of Group 1 is taught using the new curriculum, while group 2 is taught
married mathematicians and their spouses. She obtains a using the traditional curriculum. At the end of the school year, both
random sample of 287 such couples who take an academic groups are given the standardized test and the mean scores are
aptitude test and determines each spouses academic compared.
aptitude. (4.) A stock analyst wants to know if there is difference between the
mean rate of return from energy stocks and that from financial
(2.) A political scientist wants to know how a random sample
stocks. He randomly select 13 energy stocks and computes the rate
of 18- to 25-year-olds feel about Democrats and of return for the past year. He randomly selects 13 financial stocks
Republicans in Congress. She obtains a random sample of and compute the rate of return for the past year.
1030 registered voters 18 to 25 years of age and asks, Do
(5.) An urban economist believes that commute times to work in the
you have favorable/unfavorable opinion of the South are less than commute times to work in the Midwest. He
Democratic/Republican party? Each individual was asked randomly selects 40 employed individuals in the south and 45
to disclose his or her opinion about each party. employed individuals in the Midwest and determines their commute
times.
1. The fun size of a Snickers bar is supposed to weigh 20

grams. Because the penalty for selling candy bars under their 2. The mean waiting time at the drive-through of a fast-food
advertised weight is severe, the manufacturer calibrates the restaurant from the time an order is placed to the time the
machine so the mean weight is 20.1 grams. The quality-control order is received is 84.3 seconds. A manager devises a new
engineer at MMMars, the Snickers manufacturer, is concerned drive-through system that he believes will decrease wait time.
about the calibration. He obtains a random sample of 10
candy bars, weighs them, and obtains the data shown in Table. To test this claim, he initiates the new system at his
Should the machine be shut down and calibrated? Because restaurant and measures the wait time for 10 randomly
shutting down the plant is very expensive, he decides to selected orders. The wait times are provided in the table. Use
conduct the test at the α = 0.01 level of significance. α = 0.10 level of significance.
No. 1 2 3 4 5 6 7 9 10 No. 1 2 3 4 5 6 7 8 9 10
Weight 19.6 19.9 20.5 21.5 20.6 20.6 20.3 19.7 19.5 Time 90.1 80.6 67.3 95.5 58.1 86.8 75.9 70.2 65.5 70.1
source: Michael Carlisle, student at Joliet Junior College
4. Researchers wanted to determine if carpeted rooms contained more
3. Professor Katrina measured the time (in second) required bacteria than uncarpeted rooms. To determine the amount of bacteria in
to catch a falling meter sticks for 10 randomly selected a room, researchers pumped the air from the room over a Petri dish at
students’ dominant hand and non-dominant hand. Professor the rate of 1 cubic foot per minute for eight carpeted rooms and eight
Katrina claims that the reaction time in an individual’s uncarpeted rooms. Colonies of bacteria were allowed to form in the 16
dominant hand is less than the reaction time in their Petri dishes. The results are presented in the table. A normal probability
plot and boxplot indicate that the data are approximately normally
non-dominant hand. Test the claim at the α = 0.10 level of distributed with no outliers. Do carpeted rooms have more bacteria than
significance. The data obtained are presented: uncarpeted rooms at the α = 0.05 level of significance?
Carpeted (cubic foot) Uncarpeted (cubic foot)
Student 1 2 3 4 5 11.8 12.1
Dominant Hand 0.177 0.210 0.186 0.189 0.198 8.2 8.3
Non-Dominant Hand 0.193 0.194 0.160 0.209 0.164 10.8 1.0
10.1 11.1
Student 6 7 8 9 10
7.1 3.8
Dominant Hand 0.194 0.160 0.163 0.166 0.152 13.0 7.2
Non-Dominant Hand 0.179 0.202 0.208 0.184 0.215 14.6 10.1
14.0 13.7
6. A pediatrician wants to determine the relation that may exist between
5. A researcher is interested whether a training course increases the a childs height and head circumference. She randomly selects eleven
teaching performance of the teachers who attended the training courses. 3-yearold children from her practice, measures their heights and head
Test at 10% level of significance. The data are shown below: circumference, and obtains the data shown in the table below.
Case Before After Case Before After Height (inches) Head Circumference (inches)
1 85 95 11 89 97 27.75 17.5
2 84 98 12 87 98 24.50 17.1
3 86 97 13 82 95 25.50 17.1
4 87 92 14 81 95 26.00 17.3
5 89 96 15 86 92 25.00 16.9
6 82 93 16 89 91 27.75 17.6
7 80 94 17 89 94 26.50 17.3
8 84 95 18 84 95 27.00 17.5
9 86 90 19 85 96 26.75 17.3
10 82 82 20 88 97 26.75 17.5
27.50 17.5
7. A stock analyst wondered whether the mean rate of return of
financial, energy, and utility stocks differed over the past five years. He
obtained a simple random sample of eight companies from each of the 8. At a community college,the mathematics department has
three sectors and obtained the five-year rates of return shown in the been experimenting with four different delivery mechanisms for
following table (in percent): content in theirIntermediate Algebra courses. One method is
Financial Energy Utilities the traditional lecture (method I), the second is a hybrid
10.76 12.72 11.88 format in which half the class time is online and the other half
15.05 13.91 5.86 is face-to-face (method II), the third is online (method III),
17.01 6.43 13.46 and the fourth is an emporium model from which students
5.07 11.19 9.90 obtain their lectures and do their work in a lab with an
19.50 18.79 3.95
instructor available for assistance (method IV). To assess the
8.16 20.73 3.44
10.38 9.60 7.11 effectiveness of the four methods, students in each approach
6.75 17.40 15.70 are given a final exam with the results shown next. Do the
data suggest that any method has a different mean score from
Source: Morningstar.com the others?
Are the mean rates of return different at the α = 0.05 level of
significance?
Activities/Assessments References
Method I Method II Method III Method IV

81 85 81 86
81 53 59 90
85 80 70 81 https://wolfweb.unr.edu/homepage/ania/
67 75 70 61
88 64 64 84
stat352f12lectures/352lecture21f12.pdf
72 39 78 72
80 75 75 56 Statistics. Informed Decision using Data by Michael
63 60 80 68 Sullivan, III,. Fifth Edition
62 61 52 82
92 83 45 98 Probability and Statistics for Engineers and Scientist
82 66 87 79
49 75 85 74 by Walpole. Nine Edition
69 66 79 82
66 90
74
80
Objectives
Engineering Data Analysis After successful completion of this module, you should be able
Module 4: Nonparametric Statistics with R Software to:
1 Distinguish between parametric and nonparametric
statistical procedures.
K.Elizon 2 Conduct one-sample t-test.
P.Aranas
3 Test a hypothesis about the difference between the
L.Usona
L.Bautista medians of two dependent samples
4 Test a hypothesis about the difference between the
E.Baccay
medians of two independent samples
Polytechnic University of the Philippines 5 Conduct spearman rank correlation.
Sta. Mesa, Manila
6 Conduct test for two categorical variables.
7 Identify the appropriate test statistics for not normally
July 31, 2020 distributed data.

8 Solves real-life problems involving non-parametric data.
Nonparametric Statistics Advantages of Nonparametric Statistical Procedures
It refers to a statistical method in which the data is not

required to fit a normal distribution. Due to such reason,
1 Most nonparametric tests have very few requirements, so
they are sometimes referred to as distribution-free tests. it is unlikely that these tests will be used improperly.
Nonparametric tests serve as an alternative to parametric
2 For some nonparametric procedures, the computations are
tests. fairly easy.
Most non-parametric tests apply to data in an ordinal
3 The procedures can be used for count data or rank data,
scale, and some apply to data in nominal scale. so nonparametric methods can be used on ordinal data,
such as the rankings of a movie as excellent, good, fair, or
Take Note! poor.
Do not use nonparametric procedures if parametric procedures
can be used.
Disadvantages of Nonparametric Statistical Procedures Determine the appropriate test to use
Nonparametric Test
1 Nonparametric procedures are less efficient than 1. One-Sample Sign Test
parametric procedures.
2 Nonparametric procedures often discard useful 2. Wilcoxon Signed Rank Test
information. For example, the sign test uses only the sign 3. Mann Whitney U-Test
of the data and rank tests merely preserve order, so the
magnitudes of the actual data values are lost. As a result, 4. Kruskal Wallis H-Test
nonparametric procedures are typically less powerful.
3 Because fewer requirements must be satisfied to conduct 5. Spearman Rank Correlation Test
these tests, researchers sometimes incorrectly use these 6. Chi-square Test
procedures when parametric procedures can be used.
Make sure to verify that the assumptions of every
statistical test are satisfied.
One Sample Sign Test One Sample Sign Test
One Sample Sign Test is a nonparametric equivalent

of tests regarding a single population mean. It
makes inferences regarding the median, rather than
the mean. Assumptions
Command for One Sample Sign Test The only assumption behind the test is that the
To used the command of one sample sign test, you samples must be independent of each other.
need to download the package signmedian.test.
signmedian.test(x, mu=0,
alternative=‘‘greater’’, conf.level=0.95)
One Sample Sign t-Test One Sample Sign t-Test

Example 1:
A website administrator for a company claims that the median number of
visitors per day to the companys website is more than 1500. An employee
Null and Alternative Hypothesis doubts the accuracy of this claim. The number of visitors per day for 20
randomly selected days are listed below. At = 0.05, can the employee
H0 : M = M0 reject the administrators claim?
Ha : M 6= M0 two-tailed: two.sided No. No. of Visitors No. No. of Visitors
1 1469 11 1525
H0 : M ≤ M0 2 1463 12 1568
3 1487 13 1602
Ha : M > M0 one-tailed: greater 4 1579 14 1544
5 1462 15 1548
H0 : M ≥ M0 6 1476 16 1492
Ha : M < M0 one-tailed: less 7 1523 17 1500
8 1620 18 1452
9 1634 19 1511
10 1570 20 1649

Step 4:
Step 1: H0 : M ≤ 1500 and Ha : M > 1500

Step 2: α = 0.05
Step 3: Since we are comparing the median of one
sample to a known standard median, we will use the
one sample sign test.
Procedures for Testing Hypothesis One Sample Sign t-Test
Example 2:
Recent studies of the private practices of physicians
Step 5: Since p-value (0.180) is greater than to who saw no Medicaid patients suggested that the
0.05 level of significance, we failed to reject H0 . median length of each patient visit was 22 minutes.
Step 6: There is no sufficient evidence to support It is believed that the median visit length in
the claim of the website administrator. practices with a large Medicaid load is shorter than
22 minutes. A random sample of 20 visits in
practices with a large Medicaid load yielded, in
order, the following visit lengths:
One Sample Sign t-Test Procedures for Testing Hypothesis
No. Time (minutes) No. Time (minutes)

1 9.4 11 16.8
2 19.3 12 23.4 Step 1: H0 : M ≥ 22 and Ha : M < 22
3 13.4 13 18.1
4 20.1 14 23.5 Step 2: α = 0.05
5 15.6 15 18.7
6 20.4 16 24.8 Step 3: Since we are comparing the median of one
7 16.2 17 18.9 sample to a known standard median, we will use the
8 21.6 18 24.9 one sample sign test.
9 16.4 19 19.1
10 21.9 20 26.8
Based on these data, is there sufficient evidence to conclude

that the median visit length in practices with a large Medicaid
load is shorter than 22 minutes?

Step 4:

that the median visit length in practices with a large
Medicaid load is shorter than 22 minutes.
Wilcoxon Signed Rank Test Wilcoxon Signed Rank Test
Wilcoxon Signed Rank Testis nonparametric

equivalent of t-test for two related samples. Null and Alternative Hypothesis
Command for Wilcoxon Signed Rank Test
wilcox.test(a,b, mu = 0, alternative = H0 : M1 = M2
‘‘less’’, paired = TRUE, conf.level = Ha : M1 6= M2 two-tailed: two.sided
0.95, exact=FALSE) H 0 : M1 ≤ M2
Ha : M1 > M2 one-tailed: greater
If there is a tie in your data, it is necessary to add
the command exact=FALSE to avoid error message H 0 : M1 ≥ M2
in the console. Ha : M1 < M2 one-tailed: less
”a” a numeric vector of data values.
”b” a numeric vector of data values.
Wilcoxon Signed Rank Test Wilcoxon Signed Rank Test

Example 1:
Ten women participate in a study. A physical therapist measures
the womens waistlines before and 8 weeks after a rigorous exercise
program begins. Test whether the program decreased the median
Assumptions waistline at the α = 0.01 level of significance.
Your dependent variable should be measured at Before (inches) After (inches)
the ordinal or continuous level. 18.75 19.50
19.50 19.75
Your independent variable should consist of two
23.00 22.00
categorical, ”related groups” or ”matched pairs. 24.25 22.50
The distribution of the differences is symmetric. 25.00 25.00
19.00 20.25
36.25 34.50
35.25 35.00
56.00 51.75
32.75 31.25

Step 4:
Step 1:
H0 : Mbefore ≤ Mafter and Ha : Mbefore > Mafter
Step 2: α = 0.01
Step 3: Since we are comparing the median of two
related groups, we will use the wilcoxon signed rank
test.
Procedures for Testing Hypothesis Wilcoxon Signed Rank Test
Example 2:
An analyst might want to determine whether there is a difference
in the cost per mile of airfares in the United States between 1979
and 2009 for various cities. The data in Table represent the costs
Step 5: Since p-value (0.945) is greater than 0.01 per mile of airline tickets for a sample of 17 cities for both 1979
and 2009.
level of significance, we failed to reject H0 .
City 1979 2009 City 1979 2009
Step 6: There is no sufficient evidence to conclude 1 20.3 22.8 10 20.3 20.9
2 19.5 12.7 11 19.2 22.6
that the program help to decreased the median
3 18.6 14.1 12 19.5 16.9
waistline. 4 20.9 16.1 13 18.7 20.6
5 19.9 25.2 14 17.7 18.5
6 18.6 20.2 15 21.6 23.4
7 19.6 14.9 16 22.4 21.3
8 23.2 21.3 17 20.8 17.4
9 21.8 18.7

Step 4:
Step 1:
H0 : M1979 = M2009 and Ha : M1979 6= M2009
Step 2: α = 0.05
related groups, we will use the wilcoxon signed rank
test.
Procedures for Testing Hypothesis Mann Whitney U-Test
Mann Whitney U-Test is a nonparametric procedure that is

used to test the equality of two population medians from
independent samples. Nonparametric equivalent of
Step 5: Since p-value (0.309) is greater than to independent sample t-test.
Command for Mann Whitney U-Test
Step 6: There is no sufficient evidence to conclude wilcox.test(x∼group, data = data frame, mu = 0,
alternative = ‘‘less’’, conf.level = 0.95,
that the there is difference in the cost per mile of exact=FALSE)
airfares in the United States between 1979 and 2009
for various cities. If there is a tie in your data, it is necessary to add the
command exact=FALSE to avoid error message in the console.
”a” a numeric vector of data values.
Mann Whitney U-Test Mann Whitney U-Test
H0 : M1 = M2 Assumptions
Ha : M1 6= M2 two-tailed: two.sided Your dependent variable should be measured at
the ordinal or continuous level.
H 0 : M1 ≤ M2
Your independent variable should consist of two
Ha : M1 > M2 one-tailed: greater
categorical, ”independent groups”.
H 0 : M1 ≥ M2
Ha : M1 < M2 one-tailed: less
Mann Whitney U-Test Mann Whitney U-Test
Example 1:
ill Healthy
When exposed to an infection, a person typically
640 10
develops antibodies. The extent to which the 80 320
antibodies respond can be measured by looking at a 1280 320
persons titer, which is a measure of the number of 160 320
640 80
antibodies present. The higher the titer is, the more
640 160
antibodies that are present. The data in Table 1280 10
represent the titers of 11 ill people and 11 healthy 640 640
people exposed to the tularemia virus in Vermont. 160 160
320 320
Is the level of titer in the ill group greater than the
160 320
level of titer in the healthy group? Use the
α = 0.10 level of significance.

Step 4:
Step 1:
H0 : Mill ≤ Mhealthy and Ha : Mill > Mhealthy
Step 2: α = 0.10
independent groups, we will use the Mann Whitney
U-Test.
Procedures for Testing Hypothesis Mann Whitney U-Test
Example 2:
An engineer is comparing the time to failure (in
Step 5: since p-value (0.960) is greater than to flight hours) of two different air conditioners for
0.10 level of significance, we failed reject H0 . airplanes and wants to determine if the median time
to failure for model Y is longer than the median
Step 6: There is no sufficient evidence to conclude
time to failure for model X. She obtains a random
that the level of titer in the ill group greater than
sample of 26 failure times for model X and an
the level of titer in the healthy group.
independent random sample of 17 failure times for
model Y. Do the data in Table suggest that the
time to failure for model Y is longer? Use the
α = 0.05 level of significance.
Mann Whitney U-Test Procedures for Testing Hypothesis
Model X Model Y Model X Model Y

7 115 109 168
20 55 33 118 Step 1:
5 219 25 122
52 245 19 253 H0 : Mx ≥ My and Ha : Mx < My
103 239 59
17 130 287 Step 2: α = 0.05
7 412 128
4 62 68 Step 3: Since we are comparing the median of two
76 225 3 independent groups, we will use the Mann Whitney
19 129 4
25 71 91
U-Test.
4 12 472
76 200 28

Step 4:
Step 5:Since p-value (0.000) is less than to 0.05


that the median time to failure for model Y is
longer than the median time to failure for model X.
Kruskal Wallis H-Test Kruskal Wallis H-Test
Kruskal Wallis H-Test It is a rank-based

nonparametric test that can be used to determine if
there are statistically significant differences between
two or more groups of an independent variable on
continuous or ordinal dependent variable. H0 : µ1 = µ2 = ... = µk
Nonparametric equivalent of one way (ANOVA).
Ha : At least one of the population means is
Command for Kruskal Wallis H-Test different from the others.
kruskal.test(x∼group, data = data frame)
Kruskal Wallis H-Test Kruskal Wallis H-Test

Example 1:
Researchers wanted to compare math test scores of students at the end
Assumptions of secondary school from various cities. Eight randomly selected students
from Makati, Manila, and the Quezon City each were administered the
One independent variable with two or more same exam; the results are presented in the following table. Can the
researchers conclude that the distribution of exam scores is different for
levels (independent groups). The test is more each city at the α = 0.01 level of significance?
commonly used when you have three or more
Makati Manila Quezon City
levels. 578 568 506
The level of measurement of dependent variable 548 530 518
521 571 485
are ordinal, interval or ratio level. 555 569 480
Your observations should be independent. 548 563 458
530 535 456
502 561 513
492 513 491

Step 4:
Step 1:
H0 : The distribution of exam scores is the same for
each city.
Ha : The distribution of exam scores is different for
each city.
Step 2: α = 0.01
Step 3: Since we are comparing the median of
more than two independent groups, we will use the
Kruskal Wallis H-Test.
Procedures for Testing Hypothesis Kruskal Wallis H-Test
Example 2:
A family doctor claims that the distributions of HDL
Step 5: Since p-value (0.001) is less than to 0.01 cholesterol in males for the age groups 20 to 29
level of significance, we reject H0 . years old, 40 to 49 years old, and 60 to 69 years old
Step 6: This means that the distribution of exam are different. He obtains a simple random sample of
scores is different for each city. 12 individuals from each age group and determines
their HDL cholesterol.The results are presented in
Table.Test the doctors claim at the α = 0.05 level
of significance.
Kruskal Wallis H-Test Procedures for Testing Hypothesis
Step 1:
No. 20-29yrs.old 40-49yrs.old 60-69yrs.old H0 : The distribution of HDL cholesterol in males
1 54 61 44
2 43 41 65
for the age groups 20 to 29 years old, 40 to 49 years
3 38 44 62 old, and 60 to 69 years old are the same.
4 30 47 53 Ha : The distribution of HDL cholesterol in males
5 61 33 51 for the age groups 20 to 29 years old, 40 to 49 years
6 53 29 49
7 35 59 49 old, and 60 to 69 years old are different.
8 34 35 42 Step 2: α = 0.05
9 39 34 35
10 46 74 44 Step 3: Since we are comparing the median of
11 50 50 37 more than two independent groups, we will use the
12 35 65 38
Kruskal Wallis H-Test.

Step 4:

Step 6: There is no sufficient evidence to support

the claims of a doctor.
Spearman Rank Correlation Spearman Rank Correlation
Spearman Rank Correlation (Spearman Rho) is use

to measures the strength and direction of
association between two ordinal or continuous Null and Alternative Hypothesis
variables. A nonparametric version of thePearson
H0 : There is no significant relationship between two
product-moment correlation.
continuous/ordinal variables.
Command for Spearman Rho
Ha : There is significant relationship between two
cor.test(x,y, method = ’’spearman’’,
continuous/ordinal variables.
conf.level=0.95)
“x” is the independent variable.
“y” is the dependent variable.
Spearman Rank Correlation Spearman Rank Correlation

Here is the data of 9 participants in a Triathlon. Is there a
The two variables should be measured on an relationship between the individual ranks obtained in
ordinal or continuous scale. swimming and cycling at 0.05 level of significance?
There needs to be amonotonic Swimming Rank Cycling Rank
relationshipbetween the two variables. 46 50
45 70
18 10
22 25
17 16
31 32
48 48
1 2
61 59
5 8

Step 4:
Step 1:
individual ranks obtained in swimming and cycling.
individual ranks obtained in swimming and cycling.
Step 2: α = 0.01
relationship of two ordinal variables, we will use
Spearman Rho.
Procedures for Testing Hypothesis Spearman Rank Correlation
Example 2:
The following are the ranks in statistics and the ranks in
mathematics of 10 students in an examination. Determine if there
is a relationship between the ranks of students in the two subjects.
Step 5: Since p-value (0.001) is less than to 0.01 Use 0.05 level of significance.
level of significance, we reject H0 . Subject Statistics Mathematics
1 56 66
Step 6: Therefore is significant relationship 2 75 70
between the individual ranks obtained in swimming 3 45 40
and cycling and its relationship is very strong based 4 71 60
5 62 65
on correlation coefficient (0.903) 6 64 56
7 58 59
8 80 77
9 76 67
10 61 67

Step 4:
Step 1:
ranks of students in statistics and mathematics
subjects.
ranks of students in statistics and mathematics
subjects.
Step 2: α = 0.05
relationship of two ordinal variables, we will use
Spearman Rho.
Procedures for Testing Hypothesis Chi-Square Test
Step 5: Since p-value (0.039) is less than to 0.05 Chi-Square: Test for independence is use to discover
level of significance, we reject H0 . if there is association between two categorical
variables.
Step 6: Therefore is significant relationship
Command for Chi-Square Test
between the ranks of students in statistics and
chisq.test(x,y)
mathematics subjects and its relationship is
moderately strong based on correlation coefficient “x” a numeric vector or matrix.
(0.673) “y” a numeric vector; ignored if x is a matrix.
Chi-Square Test Chi-Square Test
Assumptions
There are 2 variables, and both are measured as
Null and Alternative Hypothesis categories, usually at the nominal level.
H0 : The two categorical variables are independent. However, categories may be ordinal. Interval or
ratio data that have been collapsed into ordinal
Ha : The two categorical variables are dependent. categories may also be used.
The two variables should consist of two or more
categorical,independent groups.
Chi-Square Test Chi-Square Test
The data in the cells should be frequencies, or Educators are always looking for novel ways in which to teach
counts of cases rather than percentages or some statistics to undergraduates as part of a non-statistics degree
course (e.g., psychology). With current technology, it is
other transformation of the data. possible to present how-to guides for statistical programs
For a 2 by 2 table, all expected frequencies > 5. online instead of in a book. However, different people learn in
different ways. An educator would like to know whether
For a larger table, all expected frequencies > 1 gender (male/female) is associated with the preferred type of
and no more than 20% of all cells may have learning medium (online vs. books). Import excel file
expected frequencies < 5. “chi-square data” sheet name “example 1”.
Testing the Assumption Contingency Table
Contingency Table
To Construct Contingency Table
table(a,b)
“a” is a numeric vector that will represent the row
of contingency table.
“b” is a numeric vector that will represent the This is a 2 by 2 contingency table. All expected
columns of contingency table. frequencies is greater than 5, this means that the
assumption is satisfied.
Step 4:
Step 1:
H0 : Gender is not associated with the preferred
type of learning medium.
Ha : Gender is associated with the preferred type of
learning medium.
Step 2: α = 0.05
relationship of two categorical variables, we will use
Chi-square test.
Example 2:
The Gallup Organization conducted a survey in 2014 asking
Step 5: Since p-value (0.026) is less than to 0.05 individuals questions pertaining to social well-being such as
level of significance, we reject H0 . strength of relationship with spouse, partner, or closest friend,
making time for trips or vacations, and having someone who
Step 6: Therefore, there is sufficient evidence encourages them to be healthy. Social well-being scores were
based on sample data that the gender of students is determined based on answers to these questions and used to
categorize individuals as thriving, struggling, or suffering in
associated with the preferred type of learning
their social wellbeing. In addition, body mass index (BMI) was
medium. determined based on height and weight of the individual. This
allowed for classification as obese, overweight, normal weight,
or underweight.
Chi-Square Test Procedures for Testing Hypothesis
The data in the following contingency table are based on the Step 1:
results of this survey.
H0 : There is no association between weight
classification and social wee-being.
Thriving Struggling Suffering
Obese 202 250 102
Ha : There is association between weight
Overweight 294 302 110 classification and social wee-being.
Normal Weight 300 295 103 Step 2: α = 0.05
Underweight 17 17 8
Researchers wanted to determine whether the sample data
suggest there is an association between weight classification relationship of two categorical variables, we will use
and social well-being. Chi-square test.
The data given is presented in a contingency table. The raw data Step 4:
is not given. To solve this problem, we need to construct a matrix.
Example 3:
A survey was conducted at a community college of 102
randomly selected students who dropped a course in the
Step 5:Since p-value (0.306) is greater than to current semester to learn why students drop courses. Personal
0.05 level of significance, we failed to reject H0 . drop reasons include financial, transportation, family issues,
health issues, and lack of child care. Course drop reasons
Step 6: Therefore, there is no sufficient evidence to include reducing ones load, being unprepared for the course,
conclude that there is an association between the course was not what was expected, dissatisfaction with
teaching, and not getting the desired grade. Work drop
weight classification and social well-being. reasons include an increase in hours, a change in shift, and
obtaining fulltime employment. Test whether gender is
independent of drop reason at the α = 0.1 level of
significance. Import excel file “chi-square data” sheet name
“example 3”.
Contingency Table Procedures for Testing Hypothesis
Step 1:
H0 : The gender of the students is independent to
their drop reason.
Ha : The gender of students is dependent to their
drop reason.
Step 2: α = 0.01
This is a 2 by 5 contingency table. All expected
frequencies is greater than 1, and no more than Step 3: Since we are testing the significant
20% of cells may have expected frequencies less relationship of two categorical variables, we will use
than 5, this means that the assumption is satisfied. Chi-square test.
Step 4:

Step 6: Therefore, we don’t have enough evidence
to conclude that gender is dependent to drop reason
of the students.
1. Patients are instructed to do the exercise program 3 times per week
for 6 weeks. After 6 weeks, systolic blood pressures are again measured. 2. An economist believes that the median income of lawyers who recently
The data are shown. graduated from law school is more than $64,000. He queries a random
sample of 12 lawyers and obtains the accompanying data. Do the data
Systolic Blood Pressure of Patient support the economists belief at the α = 0.05 level of significance?
City Before After No. Income
1 125 118 1 85,000
2 132 134 2 63,000
3 138 130 3 62,000
4 120 124 4 70,000
5 125 105 5 91,000
6 127 130 6 67,000
7 136 130 7 68,500
8 139 132 8 86,000
9 131 123 9 70,500
10 132 128 10 71,000
11 69,000
Is there is a difference in systolic blood pressures after participating in the 12 60,500
exercise program as compared to before? Use α = 0.01 level of
significance.
3. A sociologist feels that the median age at which women marry in

Palawan, is less than the median age of 26.9 throughout the Philippines. 4. Is there a difference between health service workers and educational
Based on a random sample of 20 marriage certificates from the county, service workers in the amount of compensation employers pay them per
she obtains the ages shown in the following table: hour? Suppose a random sample of seven health service workers is taken
along with a random sample of eight educational service workers from
No. Age No. Age different parts of the country.
1 31 11 24
2 27 12 28 Health Service Worker Educational Service Worker
3 30 13 25 20.10 26.19
4 25 14 23 19.80 23.88
5 21 15 22 22.36 25.50
6 27 16 24 18.75 21.64
7 32 17 24 21.90 24.85
8 23 18 22 22.96 25.30
9 30 19 26 20.75 24.12
10 24 20 27 23.45
Do the data support the sociologists feelings at the α = 0.05 level of

significance?
5. Agribusiness researchers are interested in determining the conditions 6. A random sample of 395 people were surveyed and each person was
under which Christmas trees grow fastest. A random sample of
equivalent-size seedlings is divided into four groups. The trees are all asked to report the highest education level they obtained. The data that
grown in the same field. One group is left to grow naturally (Group 1), resulted from the survey is summarized in the following table:
one group is given extra water (Group 2), one group is given fertilizer
spikes (Group 3), and one group is given fertilizer spikes and extra water Highschool Bachelor Masters Ph.D.
(Group 4). At the end of one year, the seedlings are measured for growth Female 60 54 46 41
(in height). These measurements are shown for each group. Male 40 44 53 57
Group 1 Group 2 Group 3 Group 4 7. The following are the ranks in population and the ranks in crime rate
5 12 14 20 of 5 cities. Determine if there is a relationship between the ranks of
7 11 10 16
11 9 16 15
countries in the two measures. Use 0.10 level of significance.
9 13 17 14
6 12 12 22 City 1 2 3 4 5
Crime Rate 13 34 5 12 17
Determine whether there is a significant difference in the growth of trees Population 9 41 10 2 20
in these groups. Use α = 0.01.
References
https://wolfweb.unr.edu/homepage/ania/
stat352f12lectures/352lecture21f12.pdf
Statistics. Informed Decision using Data by Michael
Sullivan, III,. Fifth Edition
Probability and Statistics for Engineers and Scientist
by Walpole. Nine Edition
K.Elizon P.Aranas L.Usona L.Bautista E.Baccay Engineering Data Analysis

ENGINEERING DATA ANALYSIS

MIDTERM EXAMINATION
Name: Course & Section:
Directions: Classify the variable as qualitative or quantitative.

1. Address
2. Number of students at a university
3. Number of cars owned
4. Miles per hour at which a car is traveling
5. Amount of money spent on computers this year
6. License number
Directions: Determine whether the quantitative variable is discrete or continuous.
1. Volume of liquid in a glass.
2. Number of beats in a song.
3. Number of coins in a jar.
4. Distance between sides of a street.
5. Air pressure in pounds per square inch in an automobile tire.
6. The length of a leaf.
Directions: Determine the level of measurement of each variable.
1. Birth order among siblings in a family.
2. Favorite movie.
3. Volume of water used by a household in a day
4. Eye color
5. Number of siblings
6. City of birth
Directions: Indicate whether the following statement is descriptive or inferential statistics.
1. Based on the data of the National Telecommunication, the number of subscription increased by
30% from 2015 to 2916.
2. The yearly expenditure on food for 15 families is estimated to be P240,000.
3. The chance that a person will be robbed in a certain city is 15%.
4. A national poll in November indicated that the presidential election would be very close.
5. 20% of students in a certain school experienced being robbed.
6. A random sample of 50 residents of a certain metropolitan area was asked the time that they
spend commuting to work (one way). The mean time spent by these 50 residents was 35 minutes.
Directions: A research objective is presented. For each,identify the (a)population and (b) sample in the study.
1. A quality-control manager randomly selects 70 bottles of ketchup that were filled on July 17 to assess the
calibration of the filling machine.
(a)
(b)
2. Researchers want to determine whether or not higher folate intake is associated with a lower risk of hypertension
(high blood pressure) in women (27 to 44 years of age). To make this determination, they look at 7373 cases
of hypertension in these women and find that those who consume at least 1000 micrograms per day (µg /d) of
total folate had a decreased risk of hypertension compared with those who consume less than 200 µg /d.
(a)
(b)
Directions: Read each item carefully. Create command/codes based on the information requested on each item.
1. Out of 27 respondents considered in the survey, the respondents specify that their favorite subjects are math-
ematics, statistics and science with 12, 5, and 10 of the total respondents, respectively. Create a factor, based
on the information given.
2. Create a list that contains a numeric vector from (1 to 30, 34, 45, 47, and 50 to 70), a character vector that
repeat the elements A, B, and C 30 times, and a 2 x 3 matrix that contains a number from 101 to 106 filled by
rows.
3. Create a data frame based on the given information.
Name Math Grades Stat Grades

Elena 89 76
Mae 78 90
Karen 90 94
Laila 88 79
Eric 92 77
Joshua 87 95
Francis 50 80
Kat 78 90
Erika 80 85
Camille 89 75
Rhea 89 91
4. Select students with grades in mathematics at most 85 and at least 85 in statistics.
Page 2
ENGINEERING DATA ANALYSIS

FINAL EXAMINATION
Name: Course & Section:
Directions: Read each item carefully. Write the letter corresponding to the best answer on a yellow paper on each
item. Write NONE if no correct choice is given.
1. Which of the following is a null hypothesis?

(a) There will be no difference between the length of time taken to complete a test online and the time taken
to complete a test on paper.
(b) Tests completed online will be completed faster than tests completed on paper.
(c) There will be difference between the length of time taken to complete tests online and tests completed on
paper, and if there is it is due to chance.
(d) All of the above
2. It is a nonparametric procedure that is used to test the equality of two population medians from independent
samples.
(a) Independent Sample t - Test (c) Mann - Whitney U - Test

(b) Dependent Sample t - Test (d) Wilcoxon Signed Rank Test
3. It is nonparametric equivalent of t-test for two related samples.
(a) Independent Sample t - Test (c) Mann - Whitney U - Test

(b) Dependent Sample t - Test (d) Wilcoxon Signed Rank Test
4. It is an assumption of test that the distribution of the differences in the dependent variable
between the two related groups should be approximately normally distributed.
(a) Independent Sample t - Test (c) Pearson r

(b) Dependent Sample t - Test (d) Spearman Rho
5. It is an assumption of test that the two variables should be measured at the interval or ratio level
(i.e., they are continuous) given that the data is not normal.
(a) Independent Sample t - Test (c) Pearson r

6. It is a general form of the independent sample t-test that is appropriate to use with three or more data groups.
(a) Dependent Sample t - Test (c) Kruskal Wallis H - Test

(b) Welch Analysis of Variance (d) One - Way Analysis of Variance
7. It is a method of testing the equality of two or more population means by analyzing sample variances. (Assuming
normal and not equal variances)
(a) Dependent Sample t - Test (c) Kruskal Wallis H - Test
(b) Welch Analysis of Variance (d) One - Way Analysis of Variance
8. It is an assumption of test that the independent variable should consist of two or more categorical,
independent groups.
(a) One - Way Analysis of Variance (c) Kruskal Wallis H - Test

(b) Welch Analysis of Variance (d) All of the Above
9. It is an assumption of test that the dependent variable should be measured at the interval or
ratio level (i.e., they are continuous).
(a) Dependent Sample t - Test (c) One - Way Analysis of Variance

(b) Independent Sample t - Test (d) All of the Above
10. It is a nonparametric equivalent of tests regarding a single population mean.
(a) Kruskal Wallis H - Test (c) Mann Whitney U - Test

(b) Wilcoxon Signed Rank Test (d) One sample Sign Test
11. It refers to a statistical method in which the data is not required to fit a normal distribution. Due to such
reason, they are sometimes referred to as distribution-free tests.
(a) Shapiro Wilk Test (c) Levene’s Test

(b) Non-parametric Statistics (d) Parametric Statistics
12. It refers to a statistical method which is apply to data in ratio scale, and some apply to data in interval scale.
(a) Shapiro Wilk Test (c) Levene’s Test

(b) Non-parametric Statistics (d) Parametric Statistics
13. The null hypothesis of Levene’s test is .
(a) Equal variances assumed (c) Data follows a Normal Distribution

(b) Equal variances Not assumed (d) Data does not follows a Normal Distribution
14. It is a statement about the population parameter that is contradictory to the null hypothesis, and is accepted
as true only if there is convincing evidence in favor of it.
(a) Hypothesis Testing (c) Null Hypothesis

(b) Alternative Hypothesis (d) Statement of the Problem
15. When the value of x variable increases and the value of y variable also increases. It is known as .
(a) No Relationship (c) Inverse Relationship

(b) Direct Relationship (d) None of the above
16. If the computed correlation coefficient of two continuous variables is 0.967, then describe the relationship.
(a) Weak Negative and Inverse Relationship
(b) Strong Negative and Inverse Relationship
(c) Strong Positive and Direct Relationship
(d) Weak Positive and Direct Relationship
Page 2
17. A company believes that it controls more than 30% of the total market share for one of its products. To prove
this belief, random samples of 144 purchases, of this product are contacted. It is found that 50 of the 144
purchased this company’s brand of the product. If a researcher wants to conduct a statistical test for this
problem, the alternative hypothesis would be .
(a) the population proportion is less than 0.30
(b) the population proportion is greater than 0.30
(c) the population proportion is not equal to 0.30
(d) the population mean is less than 40
18. If the computed value for Pearson r is negative, this implies that there is a/an relationship between
variables x and y.
(a) No Relationship (c) Inverse Relationship

(b) Direct Relationship (d) Undefined
19. It is a test for single mean when the population mean and standard deviation are known, and follow a normal
distribution.
(a) One Sample t - Test (c) Pearson r

20. The null hypothesis of Shapiro Wilk test is .
(a) Equal variances assumed (c) Data follows a Normal Distribution

(b) Equal variances Not assumed (d) Data does not follows a Normal Distribution
Directions: Follow the 6 steps procedure of hypothesis testing.

1. Tim is doing a research project involving pet preferences among students at his college. He took random
samples of 300 female and 250 male students. Each sample member responded to the survey question If you
could own only one pet, what kind would you choose? The possible responses were: dog, cat, other pet, no pet.
The results of the study is as follows:
Gender Dog Cat Other Pet No Pet

Female 120 132 18 30
Male 135 70 20 25
Does the gender of the students is associated to pet preference? Use a 1% level of significance.
(a) Step 1:
(b) Step 2:
(c) Step 3:
Check the assumptions.
(d) Step 4:
Page 3
(e) Step 5:
(f) Step 6:
2. Some studies have shown that in the Philippines, men spend more than women buying gifts and cards on
Valentines Day. Suppose a researcher wants to test this hypothesis by randomly sampling nine men and 10
women with comparable demographic characteristics from various large cities across the Philippines to be in a
study. Each study participant is asked to keep a log beginning one month before Valentines Day and record all
purchases made for Valentines Day during that one-month period. The resulting data are shown below. Use
these data and a 1% level of significance to test to determine if, on average, men actually do spend significantly
more than women on Valentines Day.
Men (in Peso) Female (in Peso)

10,748 12,598
14,361 4,553
9,019 5,635
12,553 8,062
7,070 4,637
8,300 4,434
12,963 7,521
15,422 6,848
9,380 8,582
12,611
(a) Step 1:
(b) Step 2:
(c) Step 3:
Check the assumptions.
(d) Step 4:
(e) Step 5:
(f) Step 6:
Page 4

Engineering Data Analysis: Instructional Materials in STAT 20023

Uploaded by

Copyright:

Available Formats

You might also like

Engineering Data Analysis: Instructional Materials in STAT 20023

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Engineering Data Analysis: Instructional Materials in STAT 20023

Uploaded by

Copyright:

Available Formats

Instructional

Course Title : ENGINEERING DATA ANALYSIS

Class Standing (CS) = (((Weighted Average of all the Activities) x 50 )+ 50)

Final Grade = (70% x CS) + (30% x MFE)

1.00 97.00-100 Excellent

1.25 94.00-96.99 Excellent

1.50 91.00-93.99 Very Good

1.75 88.00-90.99 Very Good

2.00 85.00-87.99 Good

2.25 82.00-84.99 Good

2.50 79.00-81.99 Satisfactory

2.75 77.00-78.99 Satisfactory

3.00 75.00-76.99 Passing

5.00 65.00-74.99 Failure

1 Introduction to Statistical Concepts

Definition: Understand the Process of

Polytechnic University of the Philippines Polytechnic University of the Philippines

3. Janine wants to determine the variability of her six exam scores in

Distinction between Qualitative Exercises:

Polytechnic University of the Philippines Polytechnic University of the Philippines

Polytechnic University of the Philippines Polytechnic University of the Philippines

4. Brands of soft drinks Ans. Nominal ✦

Consequences from Improperly

Methods of Collecting Methods of Collecting

Polytechnic University of the Philippines Polytechnic University of the Philippines

Polytechnic University of the Philippines Polytechnic University of the Philippines

Polytechnic University of the Philippines Polytechnic University of the Philippines

Do some useful data manipulation in R.

Select the applicable version of RStudio and install the

R Script or Source Pane - you can type and save your

The final pane contains everything else including help,

Polytechnic University of the Philippines Polytechnic University of the Philippines

Polytechnic University of the Philippines Polytechnic University of the Philippines

LOADING DATA INTO R

Polytechnic University of the Philippines Polytechnic University of the Philippines

Keep in Mind!!! Assigning value to variables:

It will also appear in your console. In case there is an error in your

Polytechnic University of the Philippines Polytechnic University of the Philippines

BASIC DATA TYPES IN R

WHAT’S THAT DATA TYPE?

WHAT IS VECTOR? marks indicate

Polytechnic University of the Philippines Polytechnic University of the Philippines

lengt.out indicates the number of points in an interval.

Polytechnic University of the Philippines Polytechnic University of the Philippines

NOMINAL AND ORDINAL HOW TO CREATE A

Polytechnic University of the Philippines Polytechnic University of the Philippines

Polytechnic University of the Philippines Polytechnic University of the Philippines

To create an ordered factor, you have to add two

matrix(data, nrow = 1, ncol = 1, byrow = FALSE,

Polytechnic University of the Philippines Polytechnic University of the Philippines

Polytechnic University of the Philippines Polytechnic University of the Philippines

Polytechnic University of the Philippines Polytechnic University of the Philippines

HOW TO CREATE A HOW TO CREATE A

Polytechnic University of the Philippines Polytechnic University of the Philippines

EXTRACT A VARIABLE FROM A

SUBSETTING A DATA FRAME

To take a subset from a data frame, first create