Correlationanalysis

You might also like

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 49

CORRELATION

ANALYSIS
It is a statistical measure
which shows relationship
between two or more
variable moving in the same
or in opposite direction

08/01/12 anilmishra5555@rediffmail.com 1
Types of correlation

correlation

Simple ,
positive Linear
multiple
& negative & non-
&
linear
partial
08/01/12 anilmishra5555@rediffmail.com 2
Methods of correlation
• Scatter diagram
• Karl Pearson’s
• Rank correlation

08/01/12 anilmishra5555@rediffmail.com 3
Scatter diagram
• Perfectly
+ve

08/01/12 anilmishra5555@rediffmail.com 4
Perfectly -ve

08/01/12 anilmishra5555@rediffmail.com 5
Zero degree

08/01/12 anilmishra5555@rediffmail.com 6
Karl Pearson correlation coefficient

 x.y
r
 x . y 2 2

08/01/12 anilmishra5555@rediffmail.com 7
Where

x X  X
and
y Y Y

08/01/12 anilmishra5555@rediffmail.com 8
problem

From the following data find the


coefficient of correlation by Karl
Pearson method
X:6 2 10 4 8
Y:9 11 5 8 7
Sol.
X Y X-6 Y-8 x. y
x2 y 2

6 9 0 1 0 1 0
2 11 -4 3 16 9 -12
10 5 4 -3 16 9 -12
8 8 2 0 4 0 0
4 7 -2 -1 4 1 -2
30 40 0 0 40 20 -26
Sol.cont.

X   X

30
6
N
5
N 5
 Y 40
Yr   x.y  8   26  26
   0.92
 x . y
2 2
40.20 800
spearman’s Rank correlation

R 6 D 2

1 N (N 2

where 1)
D  Rx  R y
Rx 
rank.of .X Ry
08/01/12  rank.of .y
anilmishra5555@rediffmail.com 12
problem

Calculate spearman’s rank correlation


coefficient between advt.cost & sales from
the following data
Advt.cost :39 65 62 90 82 75 25 98 36 78
Sales(lakhs): 47 53 58 86 62 68 60 91 51
84
X Y R-x R-y D
D2
Sol. 39 47 8 10 -2 4
65 53 6 8 -2 4
62 58 7 7 0 0
90 86 2 2 0 0
82 62 3 5 -2 4
75 68 5 4 1 1
25 60 10 6 4 16
98 91 1 1 0 0
36 51 9 9 0 0
78 84 4 3 1 1
30
Sol.cont.

R  63 D 2
1 N N
6.30
R
1 10 3
10
2
 R  111

9
 R  11 
0.82
Chi square test χ2
• It measure the differences between what is observed
and what is expected according to a null hypothesis.

• Application of Chi-square Test-


1) Goodness of fit- How close are sample results to the
expected results?

Example: In tossing a coin, you expect half heads and


half tails. You tossed a coin 100 times. You expected
50 heads and 50 tails. However, you obtained 48
heads and 52 tails. Are 48 heads and 52 tails close
enough to call the coin fair?
• 2) Test for Homogeneity- This test can also be used
to test whether the occurrence of events follow
uniformity or not
Example-the admission of patients in government
hospital in all days of week is uniform or not can be
tested with the help of chi square test.
Null hypothesis is that there is a uniformity in the
occurrence of the events. (uniformity in the admission
of patients through out the week)
• 3) Test of Independence/ Association- Test enables us
to explain whether or not two attributes are
associated.
• For instance, we may be interested in knowing
whether a new medicine is effective in controlling
fever or not, c2 test is useful.
• In such a situation, we proceed with the null
hypothesis that the two attributes (viz., new medicine
and control of fever) are independent which means
that new medicine is not effective in controlling fever.
Goodness of Fit

• where
• O = observed data in each category
• E = observed data in each category based on the
experimenter’s hypothesis
Goodness of Fit
• Department store, A, has four competitors: B,C,D, and
E. Store A hires a consultant to determine if the
percentage of shoppers who prefer each of the five
stores is the same. A survey of 1100 randomly
selected shoppers is conducted, and the results about
which one of the stores shoppers prefer are below. Is
there enough evidence using a significance level α =
0.05 to conclude that the proportions are really the
same?
Store A B C D E

Number of 262 234 204 190 210


Shoppers
(i) The null hypothesis H0:the population frequencies are equal
to the expected frequencies (to be calculated below).
(ii) The alternative hypothesis, Ha: the null hypothesis is false.
(iii) α = 0.05
(iv) The degrees of freedom: k − 1 = 5 − 1 = 4.
(v) The test statistic can be calculated using a table:
Preference % of Shoppers E O O−E (O-E)2 (O-E)2
E

A 20% 0.2 × 1100 = 220 262 42 1764 8.018

B 20% 0.2 × 1100 = 220 234 14 196 0.891

C 20% 0.2 × 1100 = 220 204 −16 256 1.163

D 20% 0.2 × 1100 = 220 190 −30 900 4.091

E 20% 0.2 × 1100 = 220 210 −10 100 0.455


• χ2 =X ( observed − expected )2 = X (O − E) 2 =14.618.
expected E

• From α = 0.05 and k − 1 = 4, the critical value is


9.488.
• Is there enough evidence to reject H0? Since χ2 ≈
14.618 > 9.488, there is enough statistical evidence
to reject the null hypothesis and to believe that
customers do not prefer each of the five stores
equally
Test of Independence/ Association-
• The procedure for the hypothesis test is essentially the same.
The differences are that: (i) H0 is that the two variables are
independent.
• (ii) Ha is that the two variables are not independent (they are
dependent).
• (iii) The expected frequency Er,c for the entry in row r, column
c is calculated using:
• Er,c = ( Sum of row r) × ( Sum of column c)
Sample size
• The degrees of freedom:
(number of rows - 1)×(number of columns - 1)
• The side effects of a new drug are being tested
against a placebo. A simple random sample of 565
patients yields the results below. At a significance
level of α = 0.05, is there enough evidence to
conclude that the treatment is independent of the
side effect of nausea?
Result Drug Placebo Total
Nausea 36 13 49
No nausea 254 262 516
Total 290 275 565
(i) The null hypothesis H0: the treatment and
response are independent.
(ii) The alternative hypothesis, Ha: the treatment
and response are dependent.
(iii) α = 0.01.
(iv) The degrees of freedom:
(number of rows - 1)×(number of columns - 1) =
(2 − 1) × (2 − 1) = 1 × 1 = 1. (v) The test statistic
can be calculated using a table:
Row, E O O−E (O − E) 2 (O − E) 2
Column E
1,1 49·290 36 10.85 117.72 4.681
565
= 25.15
1,2 49·275 13 −10.85 117.72 4.936
565
= 23.85
2,1 516·290 254 −10.85 117.72 0.444
565
= 264.85
2,2 516·275 262 10.85 117.72 0.469
565
= 251.15
• χ2 =X ( observed − expected )2 = X (O − E) 2 =
expected E
10.53
Linear Regression Analysis
• Simple linear regression is a statistical method that allows us to
summarize and study relationships between two continuous
(quantitative) variables:
 One variable, denoted x, is regarded as the predictor, explanatory,
or independent variable.
 The other variable, denoted y, is regarded as the response, outcome,
or dependent variable.
• In simple linear regression a single independent variable is used
to predict the value of a dependent variable. In multiple linear
regression two or more independent variables are used to predict the
value of a dependent variable
Example-How strong is the linear relationship between the number of
stories a building has and its height?
• Difference between Correlation and Regression
• Both the techniques are directed towards a
common purpose of establishing the degree and
direction of relationship between two or more
variables but the methods of doing so are
different. The choice of one or the other will
depend on the purpose.
Two Regression Lines

When there is a reasonable amount of scatter, we can


draw two different regression lines depending upon which
variable we consider to be the most accurate. The first is a
line of regression of y on x, which can be used to estimate
y given x. The other is a line of regression of x on y, used
to estimate x given y.

• If there is a perfect correlation between the data (in


other words, if all the points lie on a straight line), then
the two regression lines will be the same.
• Sales and profit and vice versa
i) Degree and Nature of Relationship : The correlation coefficient is a measure
of degree of covariability between two variables whereas regression analysis is
used to study the nature of relationship between the variables so that we can
predict the value of one on the basis of another. The reliance on the estimates
or predictions depend upon the closeness of relationship between the
variables.
(ii) Cause and Effect Relationship : The cause and effect relationship is
explained by regression analysis. Correlation is only a tool to ascertain the
degree of relationship between two variables and we can not say that one
variable is the cause and other the effect. A high degree of correlation between
price and demand for a commodity or at a particular point of time may not
suggest which is the cause and which is the effect. However, in regression
analysis cause and effect relationship is clearly expressed— one variable is
taken as dependent and the other an independent.
• The variable which is the basis of prediction is called independent variable
and the variable that is to be predicted is called dependent variable. The
independent variable is represented by X and the dependent variable by Y.
Basis of Comparison Correlation Regression

Meaning Correlation is a statistical Regression describes how


measure which determines an independent variable is
co-relationship or numerically related to the
association of two variables. dependent variable.
Usage To represent linear To fit a best line and
relationship between two estimate one variable on
variables. the basis of another
variable.
Dependent and No difference Both variables are different
Independent variables
Indicates Correlation coefficient Regression indicates the
indicates the extent to impact of a unit change in
which two variables move the known variable (x) on
together. the estimated variable (y).
Test Of Significance
• The test involves comparison of the observed
values with the hypothesis.
• The test establish whether there is a
relationship between the variables or whether
pure chance could produced the observed
results.
Example- Increase in sale of bags due to quality
or by chance sale has been increased.
• On the basis of Sample size
 Small Sample(less than 30)
 t-test
 F-test
 Large Sample(more than 30)
 Chi-square
 Z- test
t-Test
• A t-test is an analysis framework used to
determine the difference between two sample
means from two normally distributed
populations with unknown variances. Analysts
commonly use a t-test with two samples with
small sample sizes, testing the difference
between the samples when they do not know
the variances of two normal distributions.
Example-an analyst wants to study the amount that Pennsylvanians
and Californians spend on clothing per month. It would not be practical
to record the spending habits of every individual or family in both
states. Thus, a sample of spending habits is taken from a selected
group of individuals from each state. The group may be of any small to
moderate size — for this example, assume that the sample group is
200 individuals.
The average amount for Pennsylvanians comes out to $500, and the
average amount for Californians is $1,000. The t-test questions
whether the difference between the groups represents a true
difference between people in Pennsylvania and people in California or
if it is likely a meaningless statistical difference. In this example, if all
Pennsylvanians spent $500 per month on clothing and all Californians
spent $1,000 per month, it is highly unlikely that 200 randomly
selected individuals all spent that exact amount, respective to state.
Thus, if an analyst or statistician yielded the results listed in the
example above, it is safe to conclude that the difference between
F-test
• F-test is used for finding out whether there is any variance
within the samples. F-test is the ratio of variance of two
samples.
• Eg. Suppose, in a manufacturing plant there are 2
machines producing same products, and the management
wants to understand, whether there is any variability
among the products produced by these two machines.
Researcher will take samples from both the machines and
find out the variability, and test it against the null
hypothesis, i.e. the prescribed limit.
• F-statistic also forms the basis for ANNOVA.
Z- test
Z test is a statistical procedure used to test an
alternative hypothesis against a null
hypothesis.
Z-test is any statistical hypothesis used to
determine whether two samples’ means are
different when variances are known and
sample is large (n ≥ 30).
It is Comparison of the means of two
independent groups of samples, taken from
one populations with known variance.
• A principal at a school claims that the students
in his school are above average intelligence. A
random sample of thirty students’ IQ scores
have a mean score of 112.5. Is there sufficient
evidence to support the principal’s claim?
• The mean population IQ is 100 with a standard
deviation of 15.
• State the Null hypothesis. The accepted fact is
that the population mean is 100, so: H0: μ=100.
• State the Alternate Hypothesis. The claim is that
the students have above average IQ scores, so: H1:
μ > 100.
• Find the Z using this formula: z = (x – μ) / (σ / √n)
• For this set of data: Z= (112.5-100) / (15/√30) =
4.56
Non-parametric Test
• Nonparametric statistics uses data that is
often ordinal, meaning it does not rely on
numbers, but rather a ranking or order of
sorts. For example, a survey conveying
consumer preferences ranging from like to
dislike would be considered ordinal data.
• Nonparametric statistics makes no assumption about
the sample size or whether the observed data is
quantitative.
• This method is useful when the data has no clear
numerical interpretation, and is best to use with data
that has a ranking of sorts.
• Example, a personality assessment test may have a
ranking of its metrics set as strongly disagree,
disagree, indifferent, agree, and strongly agree. In
this case, nonparametric methods should be used.
Binomial- Sign test
Its name comes from the fact that it is based on
the direction or the plus or minus signs of
observations in a sample and not on their
numerical magnitudes.
• Find the + and – signs for the given distribution. Put a + sign for
value greater than the mean value , a minus sign for a value smaller
than the mean value and a 0 for a value equal to the mean value.

• Denote the total no. of signs (ignoring zeros) by ‘n’ and the no. of
less frequent signs by ‘S’.

• Obtain the critical value (k) of less frequent signs ‘S’ preferably at
5% level of significance by using the formula :
K=(n-1) /2- 0.98√n

• Compare the value of ‘S’ with the critical value (k). If the value of ‘S’
is greater than value of(k) then the null hypothesis is accepted
otherwise rejected
Run Test for randomness
• This test find out that whether the
observations in a sample occur in a certain
order or they occur in random.
• Ho- The sequence of observation is random
• H1- The sequence of observation is not
random
• In this firstly, all the observations are arranged
in the order they are collectd.
• Then Median is calculated.
• Observations greater than median are given +
sign and less than median, given – sign.
ANOVA
• Analysis of Variance (ANOVA) is a statistical method
used to test differences between two or more means.
• Key Assumptions-
1. Independence of case: There should not be any
pattern in the selection of the sample.
2. Normality: Distribution of each group should be
normal.
3. Homogeneity: Homogeneity means variance
between the groups should be the same. 
• One way analysis: When we are comparing more than
three groups based on one factor variable, then it said
to be one way analysis of variance (ANOVA). For
example, if we want to compare whether or not the
mean output of three workers is the same based on the
working hours of the three workers.

• Two way analysis: When factor variables are more than


two, then it is said to be two way analysis of variance
(ANOVA). For example, based on working condition and
working hours, we can compare whether or not the
mean output of three workers is the same.

You might also like