Chapter 4 and 5: Bivariate Analysis of Variables

Analysis of Economic Data (20606)
Chapter 4 and 5
Bivariate analysis of variables

Correlation and Contingency Table
Now we try to determine if there exsist an relation between two variables.

Given any two variables A and B we can calculate a contingency table:
where nij is the number of observations which are

both characteristics i, j of the variables A and B,
respectively.
Thus, a contingency table is a double entry table, where each box showing the
number of cases or individuals who possess a level of one of the characteristics
analyzed and another level of the other characteristic.
Marginal distributions
When analyzing a two-dimensional distribution, one can focus the study on the
behavior of one variable, regardless of how the other behaves. We would then
calculate the marginal distributions:
Defining:
J I
are marginal absolute frequencies of the
ni    nij n j   nij variables A and B, respectively.
j 1 i 1
J nij I nij
fi   f j   are marginal relative frequencies of the
j 1 n i 1 n variables A and B, respectively.
Using these marginal distributions we can construct the following
contingency table:
a) Marginal distributions
b) Distributions of relative frequencies

c) Row profiles
d) Column profiles
Example: Balearic Islands as a second home. In order to see the evolution and
structure of tourist expenditure, the Balearic Government conducted an annual
survey on tourist expenditure in the Balearic Islands. Among the published
information for 1990 is the desire of the tourists to select Baleares as a possible
second residence. It is considered that this desire may be a function of the zone
where the stay, i.e. the answers to the question "would you choose the Balearic
Islands as a second home?" have been crossed with the place of stay. Possible
answers to the question are: (i) no, (ii) yes, in the coming years, (iii) yes, when I
retire, (iv) does not know. Places of residence were classified into the following
areas: (1) Palma, (2) Costa de Ponent, (3) Costa de Tramuntana, (4) bay of
Pollença, (5) Badia d'Alcudia, (6) Costa Llevant; (7) Platja de Palma-El Arenal, (8)
Minorca (9) Eivissa-Formentera.
Contingency table
Row profile:
Column profile:
Distribution of one variable if the
other meets a specific condition.
xi ni.
(Frequency when y = specific value)
x1 n1.
x2 n2.
… …
xn-1 nn-1.
xn nn.
X: Spending on school supplies 0 5

Y: Number of children
50 8
Conditional distribution: For example, spending on school
100 5
supplies when the number of children is <3. It could also
150 8 Sum of frequencies when y
be simply when y = number. Then it is only to take the = 0, y = 1, y = 2, that have
200 4
column directly. an expenditure of 50.
nij
N
Graves Y
Averias 0 1 2 3 Marginal de leves
0 0,2308 0,0385 0,0077 0,0000 0,2769
Leves X
1
2
0,1692
0,0769
0,0615
0,0385
0,0231
0,0154
0,0077
0,0154
0,2615
0,1462 ni.
3 0,0923 0,0615 0,0077 0,0154 0,1769
4
5
0,0615
0,0308
0,0308
0,0077
0,0000
0,0000
0,0077
0,0000
0,1000
0,0385 N
Marginal de Graves
0,6615 0,2385 0,0538 0,0462 1
n. j
N
ni . n. j nij
Si  ij  Independencia
N N N
• If we have a population we can use the
definition in a strict sense, that is, if the equality
is not fulfilled, we have dependence, but keep in
mind that the dependence can be very weak,
even irrelevant. (This analysis is found in this
chapter).
• If we have a sample, we have a hypothesis of
independence. That is, it is possible that the
equality is not fulfilled, but still we can not reject
the hypothesis of independence. (This analysis
introduced in Chapter 7).
Coefficients of association
Contingency tables are useful tools to try to determine if there is a degree of

association or dependence between 2 variables.
Example above; is the response of considering the Balearic Island as a second

residence influenced by the fact that the individual spent the holidays in one area
or another?
A summary measure of the degree of association is calculated from the

comparison between observed values and values that one could expect in the
case of non-association. If, in the example above, we do not expect any
relationship, the relative frequency distribution in terms of column profiles would
be the one shown in the following table:
1) Chi-Square Coefficient of association (χ2):
nij  Observed frequency

I J
nij  eij 2
2   eij ni  n j Expected
i 1 j 1 eij  frequency
n
If  2≈ 0 there is no association nonexistent relation
Problem: it has no upper limit and, accordingly, it cannot give information of the degree of
association.
u tio n:
2) Coefficient “C” of contingency by Karl Pearson: a so l
As
2 1
C lim it _ max  1 
2 n min( I , J )
 (0≤C≤1):
If C ≈0 nonexistent association
If C ≈1 perfect association between the variables
Note: C can only reach close to 1 if the contingency table is very large. You must
compare C with maximum limit to interpret the degree (i.e., strength) of the association.
(The closer to the maximum limit, the stronger is the association between the variables).
Important:
1) We use Chi-square coefficient to answer the question:

Is there an association or not?
[If we have a sample, we can only answer if we reject a hypothesis of independence or
not]
2) We use coefficient C of contingency to answer the question:

In the data that we analyze, do the variables have a strong or weak relation?
3) Even if we study two ordinal variables, neither of the measures show the
sign of the relation.
If we are interested in the sign of the association we can use other

measures, such as Gamma (γ), tau-b and tau-c, but these are not included
in this course. In an exercise we show how to use row profile or column
profile to evaluate the sign (or the implication) of the relation.
(Note: it does not make any sense of talking about the sign of the
association if at least one of the variables is nominal!)
Bivariate analysis for quantitative variables.
Concept of dependence or linear association
We say that a relation of exact linear dependence between X and Y exists when
a, b are such that:
Yi= a + bXi i=1,…..,n
b>0 → positive linear dependence b<0 → negative linear dependence

Concept of linear dependence
NOTE!: We could have perfect relations between X and Y, but we are

considering only the linear type:
Quadratic relation
Concept of linear dependence
We use the simple linear correlation coefficient to determine the degree of

linear relation between 2 variables. Examples:
Positive non-exact linear dependence absence of linear dependence

Covariance
Covariance between X e Y
 n   n 
  ( xi   x )( yi   y )ni    xi yi ni 
  i 1   n  i 1
 xy
 N  s xy     xy 
   n  1  n 
   
 
Measures whether there is a

linear association between X
and Y. Positive, or negative,
but not the degree of it.
Linear correlation coefficient
The value of the covariance depends on the values of the variables,
accordingly on the units that they are measured with. We use the
coefficient of linear correlation ( rxy ) in order to remove the
units and have a dimensionless measure.
 xy S xy
rxy  
 x y S x S y
The measure is invariant to linear transformations (change of origin and scale) of the
variables. (Exception: If the transformation(s) include(s) a change in the sign of one (but
not both!) of the variables. In that case r changes sign, while the magnitude remains the
same).
• It is a dimensionless coefficient -1  r  1
• If there is a positive linear relation r > 0 and near 1
• If there is a negative linear relation r < 0 and near -1
Properties:
• The closer to -1 or 1, the stronger is the degree of association of the variable
• If there is no linear relation r approaches 0 IMPORTANT!!!
• If X and Y are independent Sxy = 0 and accordingly r = 0
Important:
If two variables are independent, their covariance is zero. We can not assure the same in reverse. If two variables have zero
covariance, this does not mean that they are independent. Linearly no relation exists, but they may be dependent.
rxy
Correlation Matrix (R)
If we have k variables we can calculate the correlation coefficients for each pair
of the variables and present them in a matrix of correlations:
Properties:
 the main diagonal is always equal to one

rxx = 1
 it is symmetric; rxy = ryx

Chapter 4 and 5: Bivariate Analysis of Variables

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Chapter 4 and 5: Bivariate Analysis of Variables

Uploaded by

Copyright:

Available Formats

Analysis of Economic Data (20606)

Bivariate analysis of variables

Now we try to determine if there exsist an relation between two variables.

where nij is the number of observations which are

b) Distributions of relative frequencies

X: Spending on school supplies 0 5

Contingency tables are useful tools to try to determine if there is a degree of

Example above; is the response of considering the Balearic Island as a second

A summary measure of the degree of association is calculated from the

nij  Observed frequency

1) We use Chi-square coefficient to answer the question:

2) We use coefficient C of contingency to answer the question:

If we are interested in the sign of the association we can use other

Yi= a + bXi i=1,…..,n

b>0 → positive linear dependence b<0 → negative linear dependence

NOTE!: We could have perfect relations between X and Y, but we are

We use the simple linear correlation coefficient to determine the degree of

Positive non-exact linear dependence absence of linear dependence

Measures whether there is a

 the main diagonal is always equal to one

You might also like