Session 4 Correlation and Regression

You might also like

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 81

CORRELATION AND

REGRESSION
Objectives:
 Discuss and perform correlation and
regression using MS EXCEL and SPSS;
 Discuss and perform z-test, t-test and f-test
(ANOVA-Analysis of Variance) using MS
EXCEL and SPSS;
 Construct and perform the statistical analysis
and interpretation on the statistic
REVIEW ON HYPOTHESIS
TESTING
Hypothesis
A statistical hypothesis is a
conjecture about a population
parameter.
 This conjecture may or may not
be true (Bluman, 2016).
Types of Hypothesis:
 The null hypothesis, symbolized by H0, is a
statistical hypothesis that states that there is
no difference between a parameter and a
specific value, or that there is no difference
between two parameters.
 The alternative hypothesis, symbolized by H1,
is a statistical hypothesis that states the
existence of a difference between a parameter
and a specific value, or states that there is a
difference between two parameters.
 A statistical test uses the data
obtained from a sample to make a
decision about whether the null
hypothesis should be rejected.
 The numerical value obtained from a
statistical test is called the test value.
Types of Error:
 A type I error (alpha error) occurs if
you reject the null hypothesis when it
is true.
 A type II error (Beta error)occurs if
you do not reject the null hypothesis
when it is false.
Level of Significance (α)
 The level of significance is the maximum
probability of committing a type I error. This
probability is symbolized by α (Greek letter
alpha). That is, P(type I error) = α.
 Statisticians generally agree on using three
arbitrary significance levels: the 0.10, 0.05,
and 0.01 levels.
Level of Significance (α)
 That is, if the null hypothesis is rejected, the
probability of a type I error will be 10%, 5%, or
1%, depending on which level of significance is
used.
 Here is another way of putting it:
 When α=0.10, there is a 10% chance of rejecting
a true null hypothesis;
 when α=0.05, there is a 5% chance of rejecting a
true null hypothesis;
 and when α=0.01, there is a 1% chance of
rejecting a true null hypothesis.
Level of Significance (α)
 In a hypothesis-testing situation, the
researcher decides what level of
significance to use.
 It does not have to be the 0.10, 0.05,
or 0.01 level.
 It can be any level, depending on
the seriousness of the type I error.
 0.05 for educational researches
Methods of Hypothesis
Testing:

1. Using Critical Value


(Traditional Method)
2. Using P-Value Method
3. Using Confidence Interval
Using Critical Value (Traditional
Method)
 The critical value separates the critical region from
the noncritical region. The symbol for critical value is
C.V.
 The critical or rejection region is the range of
values of the test value that indicates that there is a
significant difference and that the null hypothesis
should be rejected.
 The noncritical or nonrejection region is the
range of values of the test value that indicates that
the difference was probably due to chance and that
the null hypothesis should not be rejected.
Using P-Value Method
 The P-value (or probability value) is the
probability of getting a sample statistic (such
as the mean) or a more extreme sample
statistic in the direction of the alternative
hypothesis when the null hypothesis is true.
 In summary, then, if the P-value is less than
α, reject the null hypothesis.
 If the P-value is greater than α, do not reject
the null hypothesis.
Using Confidence Interval
 If zero is included in the
confidence interval, do not reject
the null hypothesis.
 If zero is not included in the
confidence interval, reject the
null hypothesis.
CORRELATION AND
REGRESSION
Objectives:
 Discuss and perform correlation and
regression using MS EXCEL and SPSS;
 Discuss and perform z-test, t-test and f-test
(ANOVA-Analysis of Variance) using MS
EXCEL and SPSS;
 Construct and perform the statistical analysis
and interpretation on the statistic
Correlation
A quantitative relationship between two interval
or ratio level variables

X Y
Hours of Training Number of Accidents
Shoe Size Height
Cigarettes smoked per day Lung Capacity
Score on NAT Grade Point Average
Height IQ
It is used to measure the strength of relationships between
two variables and to use this measure of strength to decide
whether or not any significant linear relationship exists.
Correlation
 Measures and describes the strength and
direction of the relationship

 Bivariate techniques requires two variable


scores from the same individuals

 Multivariate when more than two


independent variables (e.g. effect of
advertising and prices on sales)

 Variables must be ratio or interval scale


Bivariate

Variable 1 Variable 2

Variable 1
Variable 3

Variable 2 Multivariate
Scatter Plots and Types of Correlation
x = hours of training (horizontal axis)
y = number of accidents (vertical axis)

60
50
Accidents

40
30

20

10
0
0 2 4 6 8 10 12 14 16 18 20
Hours of Training

Negative Correlation – as x increases, y decreases


Scatter Plots and Types of Correlation

x = NAT score
y = GPA
4.00
3.75
3.50
3.25
GPA

3.00
2.75
2.50
2.25
2.00
1.75
1.50

300 350 400 450 500 550 600 650 700 750 800
Math NAT
Positive Correlation – as x increases, y increases
Scatter Plots and Types of Correlation

x = height y = IQ

160
150
140
130
IQ

120
110
100
90
80
60 64 68 72 76 80
Height

No linear correlation
Scatter Plots and Types of Correlation

Strong, negative
relationship

but non-linear!
Correlation Coefficient “r”

 A measure of the strength and


direction of a linear relationship
between two variables
Correlation Coefficient “r”

 The range of r is from –1.0 to 1.0.

–1 0 1
If r is close to –1 If r is close to 0 If r is close to 1
there is a strong there is no there is a strong
negative linear positive
correlation. correlation. correlation.
Sample Application
95 Final
90 Absences Grade
85
80 x y
75 8 78
70
2 92
Final Grade

65
60 5 90
55 12 58
15 43
50
45
40 9 74
6 81
0 2 4 6 8 10 12 14 16

Absences
X
Computation of r
x y xy x2 y2
1 8 78 624 64 6084
2 2 92 184 4 8464
3 5 90 450 25 8100
4 12 58 696 144 3364
5 15 43 645 225 1849
6 9 74 666 81 5476
7 6 81 486 36 6561

57 516 3751 579 39898


Interpreting r results…
r Interpretation
±1.00 perfect linear relationship
±0.81 - ±0.99 very strong linear relationship
±0.61 - ±0.80 strong linear relationship
±0.41 - ±0.60 moderate linear relationship
±0.21 - ±0.40 weak linear relationship
±0.01 - ±0.20 very weak linear relationship
0 no linear relationship
Example:

 Thus, the computed r of -0.975 means a


very strong negative relationship
between absences and the final grades of
the students.

 In short, as the absences of a student


increases, the grade on the other hand,
tend to decrease.
Example:
 Likewise, a computed r = 0.86 in a study
between length of administrative experience
and productivity of school heads means that –
the longer the school heads stay in their jobs,
the higher their productivity is.
Strength of the Association

 The coefficient of determination,


r2, measures the strength of the association
and is the ratio of explained variation in y
to the total variation in y.
Strength of the Association
 The correlation coefficient of the number of
times absent and the final grade is r = –0.975.
 The coefficient of determination is
r2 = (–0.975)2 = 0.9506.
Interpretation:

 About 95% of the variation in final grades can be


explained by the number of times a student is absent.
 The other 5% is unexplained and can be due to
sampling error or other variables such as intelligence,
amount of time studied, etc.
COMMON ERRORS OF CORRELATION
ANALYSIS

 The coefficient of correlation is sometimes


interpreted as a percentage. This can be
a serious mistake.
 For ex., if a coefficient correlation of r =
0.7 is interpreted as meaning that 70
percent of the variation in y is explained,
this is significantly above the 49 percent
that actually is explained by the
coefficient of determination r2 = 0.49.
COMMON ERRORS AND LIMITATIONS
OF CORRELATION ANALYSIS

 The coefficient of determination is also


subject to misinterpretation. It is sometimes
interpreted as the percentage of the
variation in the dependent variable caused
by the independent variable.
 This is simply nonsense. It always should be
remembered that it is the variation in the
dependent variable that is being explained
or accounted for (but not necessarily
caused )by the x variable.
The Line of Regression
The equation of a line may be written as y = mx + b
where m is the slope of the line and b is the y-
intercept.
The line of regression is:

The slope m is:

The y-intercept: is:


(xi,yi) = a data point
= a point on the line with the same x-value

= a residual
260 Best fitting straight line

250

240
Revenue

230

220

210

200

190

180

1.5 2.0 2.5 3.0


Ad Php
x y xy x2 y2
1 8 78 624 64 6084
2 2 92 184 4 8464
3 5 90 450 25 8100
4 12 58 696 144 3364
5 15 43 645 225 1849
6 9 74 666 81 5476
7 6 81 486 36 6561
57 516 3751 579 39898

 Write the equation of the line of


regression, with x = number of absences
and y = final grade.
Calculate m and b.

The line of regression is: = –3.924x + 105.667


The Line of Regression
m = –3.924 and b = 105.667
The line of regression is:

95
90
Final Grade

85
80
75
70
65
60
55
50
45
40
0 2 4 6 8 10 12 14 16
Absences
Note that the point = (8.143, 73.714) is on the line.
Predicting y Values
 The regression line can be used to predict
values of y for values of x falling within the range of
the data.

The regression equation for = –3.924x + 105.667


number of times absent
and final grade is:

 Use this equation to predict the expected grade


for a student with (a) 3 absences, and (b) 12
absences.

(a) = –3.924(3) + 105.667 = 93.895


(b) = –3.924(12) + 105.667 = 58.579
The Line of Regression

 Regression indicates the degree to which the


variation in one variable X, is related to or can
be explained by the variation in another
variable Y.

 Once you know there is a significant linear


correlation, you can write an equation
describing the relationship between the x and y
variables.

 This equation is called the line of regression


or least squares
line.
Hypothesis
Test for
Correlation
Recall: Application
95 Final
90 Absences Grade
85
80 x y
75 8 78
70
2 92
Final Grade

65
60 5 90
55 12 58
15 43
50
45
40 9 74
6 81
0 2 4 6 8 10 12 14 16

Absences
X
Computation of r
x y xy x2 y2
1 8 78 624 64 6084
2 2 92 184 4 8464
3 5 90 450 25 8100
4 12 58 696 144 3364
5 15 43 645 225 1849
6 9 74 666 81 5476
7 6 81 486 36 6561

57 516 3751 579 39898


Interpreting r results…
r Interpretation
±1.00 perfect linear relationship
±0.81 - ±0.99 very strong linear relationship
±0.61 - ±0.80 strong linear relationship
±0.41 - ±0.60 moderate linear relationship
±0.21 - ±0.40 weak linear relationship
±0.01 - ±0.20 very weak linear relationship
0 no linear relationship
 Thus, the computed r of -0.975
means a very strong negative
relationship between absences and
the final grades of the students.
 In short, as the absences of a
student increases, the grade on the
other hand, tend to decrease.
Hypothesis Test for Significance
 r is the correlation coefficient for the sample
 The correlation coefficient for the population is ρ (rho).

For a two-tailed test for significance:


(The correlation is not significant)
(The correlation is significant)

The sampling distribution for r is a t-distribution


with n – 2 (df).
Test of Significance
The correlation between the number of times absent and
final grade is r = – 0.975. There were seven pairs of data. Test
the significance of this correlation. Use α = 0.01.

1. Write the null and alternative hypothesis.


(The correlation is not significant)
(The correlation is significant)

2. State the level of significance.


= 0.01
3. Identify the sampling distribution.
A t-distribution with 6 degrees of freedom
Rejection Regions

Critical Values ± t0
t
–4.032 0 4.032
df\p 0.40 0.25 0.10 0.05 0.025 0.01 0.005 0.0005

4. Find the critical value. 1 0.324920 1.000000 3.077684 6.313752 12.70620 31.82052 63.65674 636.6192

2 0.288675 0.816497 1.885618 2.919986 4.30265 6.96456 9.92484 31.5991

5. Find the rejection region.3 0.276671 0.764892 1.637744 2.353363 3.18245 4.54070 5.84091 12.9240

4 0.270722 0.740697 1.533206 2.131847 2.77645 3.74695 4.60409 8.6103


6. Find the test statistic.
5 0.267181 0.726687 1.475884 2.015048 2.57058 3.36493 4.03214 6.8688
Rejection Regions

t
tc = - 9.811 - 4.032 0 + 4.032
– 4.032

7. Make your decision.


tc = –9.811 falls in the rejection region. Therefore, reject
the null hypothesis.
8. Interpret your decision.
There is a significant negative correlation between the
number of times absent and final grades.
WORKSHOP ON CORRELATION AND
REGRESSION USING EXCEL AND
SPSS
Exercises
1. A manager of an automobile dealership
wishes to investigate the relationship
between the number of radio ads aired per
week and the number of automobiles sold.
A random sample of the record for 8 weeks
produces the following data pairs where x
is the number of radio ads aired and y is the
number of automobiles sold.
x 3 16 7 4 15 7 8 5
y 4 40 16 9 38 16 17 10
2. Below is a data obtained on a study
of the relationship between the weight
and the chest size of infants at birth.

X (kg) 3.5 2.5 3.41 3.52 3.21 3.32 2.31 4.3


Y (cm) 29.5 26.3 32.2 36.5 27.2 27.7 28.3 30.3
Exercises
3. One anti-drug agent did a study on the
relationship of the frequency of drug
information campaign and the annual drug
cases filed in a certain city. The following data
showed it.
Year 2005 2006 2007 2008 2009 2010
No. of
information 1 2 4 3 5 2
campaigns

Drug cases 21 18 10 15 4 17
 Is there a significant relationship between
the frequency of information campaign and
drug cases in the city?
Exercises
4. The following data shows the student
enrolment for the past 5 years in a certain
school.
Year 2007 2008 2009 2010 2011

Enrolment 124 243 229 321 357

 Determine the regression equation.


From which, predict the enrolment for
2013? 2015?
Exercises
1. Suppose you have computed an rc = 0.44 in a
study of 36 samples. Is it significant at 0.05
level? 0.01 level?

2. When a study was conducted to find out the


relationship of the number of homework given
to randomly selected 89 Grade 4 pupils and
their GPA in Mathematics, it resulted to a
correlation coefficient of -0.71. Is this value
significant at .05 level?

What if this result was obtained from 25 pupil-


samples only? Will it still be significant?
WORKSHOP ON CORRELATION
AND REGRESSION USING
RESEARCH STUDY
Recall: Statement of the
Problems/ Research Objectives:
 Refer on research instrument about teacher’s job
satisfaction survey taken from:
JOB SATISFACTION LEVEL OF K TO 12
TEACHERS UTILIZING MULTIPLE STATISTICAL
TOOLS
Glorineil D. Romero Ph.D.(c) ª, Dr. Nimrod F.
Bantigueb
a. University of the Philippines, Diliman Quezon City, Philippines
b. Oriental Mindoro National High School, Calapan Oriental Mindoro, Philippines
Corresponding email: neildromero21@hotmail.com
This study will answer the following questions:
1. What is the profile of selected Kto12 teachers in terms of
age, basic salary, gender, civil status, number of dependents
and net take home pay?
2. What is the level of K to 12 teachers’ job satisfaction in terms
of job security, work environment, job responsibilities and
community linkage/ attachments?
3. Is there a significant difference between: married and single
Kto12 Public school teachers? Job security, work environment,
job responsibilities and community attachment?
4. Is there a correlation on job satisfaction among teachers in
terms of: job security and work environment? Job security and
job responsibilities?
5. Is there a significant relationship among public school’s
teachers’ job satisfaction in terms of security, work environment
and job responsibilities and community attachments?
Romero & Bantigue, 2017
THANK
YOU…
AK
1. t=2.85
2. t=-9.40
End of
Session
ADDITIONAL CONTENT ON
SAMPLE
 Important qualitative factors in
determining the sample size

the importance of the decision


the nature of the research
the number of variables
the nature of the analysis
sample sizes used in similar studies
completion rates
resource constraints
Classification of Sampling
Techniques

Sampling Techniques

Non-Probability Probability
Sampling Techniques Sampling Techniques

Convenience Judgmental Quota Snowball


Sampling Sampling Sampling Sampling

Simple Random Systematic Stratified Cluster Other Sampling


Sampling Techniques
Sampling Sampling Sampling
Convenience Sampling
Convenience sampling attempts to obtain a sample of
convenient elements. Often, respondents are selected
because they happen to be in the right place at the right
time.

use of students, and members of social organizations


mall intercept interviews without qualifying the
respondents
department stores using charge account lists
“people on the street” interviews
Judgmental Sampling
Judgmental sampling is a form of convenience sampling in
which the population elements are selected based on the
judgment of the researcher.

purchase engineers selected in industrial marketing


research
expert witnesses used in court
Quota Sampling
Quota sampling may be viewed as two-stage restricted judgmental
sampling.
 The first stage consists of developing control categories, or
quotas, of population elements.
 In the second stage, sample elements are selected based on
convenience or judgment.

Population Sample
composition composition
Control
Characteristic Percentage Percentage Number
Sex
Male 48 48 480
Female 52 52 520
____ ____ ____
100 100 1000
Snowball Sampling
In snowball sampling, an initial group of
respondents is selected, usually at random.

 After being interviewed, these respondents


are asked to identify others who belong to
the target population of interest.
 Subsequent respondents are selected based
on the referrals.
Simple Random Sampling
 Each element in the population has a
known and equal probability of selection.
 Each possible sample of a given size (n) has
a known and equal probability of being the
sample actually selected.
 This implies that every element is selected
independently of every other element.
Systematic Sampling
 The sample is chosen by selecting a random starting point and then
picking every ith element in succession from the sampling frame.
 The sampling interval, i, is determined by dividing the population
size N by the sample size n and rounding to the nearest integer.
 When the ordering of the elements is related to the characteristic
of interest, systematic sampling increases the representativeness of
the sample.
 If the ordering of the elements produces a cyclical pattern,
systematic sampling may decrease the representativeness of the
sample.
For example, there are 100,000 elements in the population and a
sample of 1,000 is desired. In this case the sampling interval, i, is
100. A random number between 1 and 100 is selected. If, for
example, this number is 23, the sample consists of elements 23,
123, 223, 323, 423, 523, and so on.
Stratified Sampling
 A two-step process in which the population is
partitioned into subpopulations, or strata.
 The strata should be mutually exclusive and
collectively exhaustive in that every population
element should be assigned to one and only one
stratum and no population elements should be
omitted.
 Next, elements are selected from each stratum by a
random procedure, usually SRS.
 A major objective of stratified sampling is to increase
precision without increasing cost.
Stratified Sampling
 The elements within a stratum should be as homogeneous as
possible, but the elements in different strata should be as
heterogeneous as possible.
 The stratification variables should also be closely related to the
characteristic of interest.
 Finally, the variables should decrease the cost of the stratification
process by being easy to measure and apply.
 In proportionate stratified sampling, the size of the sample drawn
from each stratum is proportionate to the relative size of that
stratum in the total population.
 In disproportionate stratified sampling, the size of the sample from
each stratum is proportionate to the relative size of that stratum
and to the standard deviation of the distribution of the
characteristic of interest among all the elements in that stratum.
Cluster Sampling
The target population is first divided into mutually exclusive and
collectively exhaustive subpopulations, or clusters.
Then a random sample of clusters is selected, based on a
probability sampling technique such as SRS.
For each selected cluster, either all the elements are included in
the sample (one-stage) or a sample of elements is drawn
probabilistically (two-stage).
Elements within a cluster should be as heterogeneous as possible,
but clusters themselves should be as homogeneous as possible.
Ideally, each cluster should be a small-scale representation of the
population.
In probability proportionate to size sampling, the clusters
are sampled with probability proportional to size. In the second
stage, the probability of selecting a sampling unit in a selected
cluster varies inversely with the size of the cluster.
Types of Cluster Sampling
Cluster Sampling

One-Stage Two-Stage Multistage


Sampling Sampling Sampling

Simple Cluster Probability


Sampling Proportionate
to Size Sampling
Strengths and Weaknesses of
Basic Sampling Techniques
Technique Strengths Weaknesses
Nonprobability Sampling Least expensive, least Selection bias, sample not
Convenience sampling time-consuming, most representative, not recommended for
convenient descriptive or causal research
Judgmental sampling Low cost, convenient, Does not allow generalization,
not time-consuming subjective
Quota sampling Sample can be controlled Selection bias, no assurance of
for certain characteristics representativeness
Snowball sampling Can estimate rare Time-consuming
characteristics

Probability sampling Easily understood, Difficult to construct sampling


Simple random sampling results projectable frame, expensive, lower precision,
(SRS) no assurance of representativeness.
Systematic sampling Can increase Can decrease representativeness
representativeness,
easier to implement than
SRS, sampling frame not
necessary
Stratified sampling Include all important Difficult to select relevant
subpopulations, stratification variables, not feasible to
precision stratify on many variables, expensive
Cluster sampling Easy to implement, cost Imprecise, difficult to compute and
effective interpret results
Procedures for Drawing Probability Samples

Simple Random
Sampling

1. Select a suitable sampling frame


2. Each element is assigned a number from 1 to N
(population size)
3. Generate n (sample size) different random numbers
between 1 and N
4. The numbers generated denote the elements that
should be included in the sample
Procedures for Drawing
Probability Samples Systematic
Sampling

1. Select a suitable sampling frame


2. Each element is assigned a number from 1 to N (pop. size)
3. Determine the sampling interval I; I = N/n. If i is a fraction,
round it off to the nearest integer
4. Select a random number, r, between 1 and i, as explained in
simple random sampling
5. The elements with the following numbers will comprise
the systematic random sample: r, r+i,r+2i,r+3i,r+4i,...,r+(n-1)i
Procedures for Drawing
Probability Samples Stratified
Sampling

1. Select a suitable frame


2. Select the stratification variable(s) and the number of strata, H
3. Divide the entire population into H strata. Based on the
classification variable, each element of the population is assigned
to one of the H strata
4. In each stratum, number the elements from 1 to Nh (the population
size of stratum h)
5. Determine the sample size of each stratum, nh, based on
proportionate or disproportionate stratified sampling, where
H

nh = n
h=1

6. In each stratum, select a simple random sample of size nh


Procedures for Drawing Cluster
Probability Samples Sampling

1. Assign a number from 1 to N to each element in the population


2. Divide the population into C clusters of which c will be included in
the sample
3. Calculate the sampling interval i, i=N/c (round to nearest integer)
4. Select a random number r between 1 and i, as explained in simple
random sampling
5. Identify elements with the following numbers:
r,r+i,r+2i,... r+(c-1)i
6. Select the clusters that contain the identified elements
7. Select sampling units within each selected cluster based on SRS
or systematic sampling
8. Remove clusters exceeding sampling interval i. Calculate new
population size N*, number of clusters to be selected C*= C-1,
and new sampling interval i*.
Procedures for Drawing Probability
Samples
Cluster
Sampling
Repeat the process until each of the remaining clusters
has a population less than the sampling interval. If b
clusters have been selected with certainty, select the
remaining c-b clusters according to steps 1 through 7.
The fraction of units to be sampled with certainty is the
overall sampling fraction = n/N. Thus, for clusters
selected with certainty, we would select
ns=(n/N)(N1+N2+...+Nb) units. The units selected from
clusters selected under PPS sampling will therefore be
n*=n- ns.
Choosing Non-probability VS.
Probability Sampling
Conditions Favoring the Use of
Factors Nonprobability Probability
sampling sampling

Nature of research Exploratory Conclusive

Relative magnitude of sampling Nonsampling Sampling


and nonsampling errors errors are errors are
larger larger

Variability in the population Homogeneous Heterogeneous


(low) (high)

Statistical considerations Unfavorable Favorable

Operational considerations Favorable Unfavorable

You might also like