Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 19

Analysis of Economic Data (20606)

Chapter 4 and 5

Bivariate analysis of variables


Correlation and Contingency Table

Now we try to determine if there exsist an relation between two variables.


Given any two variables A and B we can calculate a contingency table:

where nij is the number of observations which are


both characteristics i, j of the variables A and B,
respectively.

Thus, a contingency table is a double entry table, where each box showing the
number of cases or individuals who possess a level of one of the characteristics
analyzed and another level of the other characteristic.
Marginal distributions
When analyzing a two-dimensional distribution, one can focus the study on the
behavior of one variable, regardless of how the other behaves. We would then
calculate the marginal distributions:

Defining:

J I
are marginal absolute frequencies of the
ni    nij n j   nij variables A and B, respectively.
j 1 i 1

J nij I nij
fi   f j   are marginal relative frequencies of the
j 1 n i 1 n variables A and B, respectively.
Using these marginal distributions we can construct the following
contingency table:
a) Marginal distributions

b) Distributions of relative frequencies


c) Row profiles

d) Column profiles
Example: Balearic Islands as a second home. In order to see the evolution and
structure of tourist expenditure, the Balearic Government conducted an annual
survey on tourist expenditure in the Balearic Islands. Among the published
information for 1990 is the desire of the tourists to select Baleares as a possible
second residence. It is considered that this desire may be a function of the zone
where the stay, i.e. the answers to the question "would you choose the Balearic
Islands as a second home?" have been crossed with the place of stay. Possible
answers to the question are: (i) no, (ii) yes, in the coming years, (iii) yes, when I
retire, (iv) does not know. Places of residence were classified into the following
areas: (1) Palma, (2) Costa de Ponent, (3) Costa de Tramuntana, (4) bay of
Pollença, (5) Badia d'Alcudia, (6) Costa Llevant; (7) Platja de Palma-El Arenal, (8)
Minorca (9) Eivissa-Formentera.
Contingency table
Row profile:

Column profile:
Distribution of one variable if the
other meets a specific condition.

xi ni.
(Frequency when y = specific value)

x1 n1.
x2 n2.
… …
xn-1 nn-1.
xn nn.

X: Spending on school supplies 0 5


Y: Number of children
50 8
Conditional distribution: For example, spending on school
100 5
supplies when the number of children is <3. It could also
150 8 Sum of frequencies when y
be simply when y = number. Then it is only to take the = 0, y = 1, y = 2, that have
200 4
column directly. an expenditure of 50.
nij
N
Graves Y
Averias 0 1 2 3 Marginal de leves
0 0,2308 0,0385 0,0077 0,0000 0,2769

Leves X
1
2
0,1692
0,0769
0,0615
0,0385
0,0231
0,0154
0,0077
0,0154
0,2615
0,1462 ni.
3 0,0923 0,0615 0,0077 0,0154 0,1769
4
5
0,0615
0,0308
0,0308
0,0077
0,0000
0,0000
0,0077
0,0000
0,1000
0,0385 N
Marginal de Graves
0,6615 0,2385 0,0538 0,0462 1

n. j
N

ni . n. j nij
Si  ij  Independencia
N N N
• If we have a population we can use the
definition in a strict sense, that is, if the equality
is not fulfilled, we have dependence, but keep in
mind that the dependence can be very weak,
even irrelevant. (This analysis is found in this
chapter).
• If we have a sample, we have a hypothesis of
independence. That is, it is possible that the
equality is not fulfilled, but still we can not reject
the hypothesis of independence. (This analysis
introduced in Chapter 7).
Coefficients of association

Contingency tables are useful tools to try to determine if there is a degree of


association or dependence between 2 variables.

Example above; is the response of considering the Balearic Island as a second


residence influenced by the fact that the individual spent the holidays in one area
or another?

A summary measure of the degree of association is calculated from the


comparison between observed values and values that one could expect in the
case of non-association. If, in the example above, we do not expect any
relationship, the relative frequency distribution in terms of column profiles would
be the one shown in the following table:
1) Chi-Square Coefficient of association (χ2):

nij  Observed frequency


I J
nij  eij 2
2   eij ni  n j Expected
i 1 j 1 eij  frequency
n
If  2≈ 0 there is no association nonexistent relation
Problem: it has no upper limit and, accordingly, it cannot give information of the degree of
association.
u tio n:
2) Coefficient “C” of contingency by Karl Pearson: a so l
As
2 1
C lim it _ max  1 
2 n min( I , J )
 (0≤C≤1):
If C ≈0 nonexistent association
If C ≈1 perfect association between the variables

Note: C can only reach close to 1 if the contingency table is very large. You must
compare C with maximum limit to interpret the degree (i.e., strength) of the association.
(The closer to the maximum limit, the stronger is the association between the variables).
Important:

1) We use Chi-square coefficient to answer the question:


Is there an association or not?
[If we have a sample, we can only answer if we reject a hypothesis of independence or
not]

2) We use coefficient C of contingency to answer the question:


In the data that we analyze, do the variables have a strong or weak relation?

3) Even if we study two ordinal variables, neither of the measures show the
sign of the relation.

If we are interested in the sign of the association we can use other


measures, such as Gamma (γ), tau-b and tau-c, but these are not included
in this course. In an exercise we show how to use row profile or column
profile to evaluate the sign (or the implication) of the relation.

(Note: it does not make any sense of talking about the sign of the
association if at least one of the variables is nominal!)
Bivariate analysis for quantitative variables.
Concept of dependence or linear association

We say that a relation of exact linear dependence between X and Y exists when
a, b are such that:

Yi= a + bXi i=1,…..,n

b>0 → positive linear dependence b<0 → negative linear dependence


Concept of linear dependence

NOTE!: We could have perfect relations between X and Y, but we are


considering only the linear type:

Quadratic relation
Concept of linear dependence

We use the simple linear correlation coefficient to determine the degree of


linear relation between 2 variables. Examples:

Positive non-exact linear dependence absence of linear dependence


Covariance
Covariance between X e Y
 n   n 
  ( xi   x )( yi   y )ni    xi yi ni 
  i 1   n  i 1
 xy
 N  s xy     xy 
   n  1  n 
   
 

Measures whether there is a


linear association between X
and Y. Positive, or negative,
but not the degree of it.
Linear correlation coefficient
The value of the covariance depends on the values of the variables,
accordingly on the units that they are measured with. We use the
coefficient of linear correlation ( rxy ) in order to remove the
units and have a dimensionless measure.

 xy S xy
rxy  
 x y S x S y
The measure is invariant to linear transformations (change of origin and scale) of the
variables. (Exception: If the transformation(s) include(s) a change in the sign of one (but
not both!) of the variables. In that case r changes sign, while the magnitude remains the
same).
• It is a dimensionless coefficient -1  r  1
• If there is a positive linear relation r > 0 and near 1
• If there is a negative linear relation r < 0 and near -1
Properties:
• The closer to -1 or 1, the stronger is the degree of association of the variable
• If there is no linear relation r approaches 0 IMPORTANT!!!
• If X and Y are independent Sxy = 0 and accordingly r = 0
Important:

If two variables are independent, their covariance is zero. We can not assure the same in reverse. If two variables have zero
covariance, this does not mean that they are independent. Linearly no relation exists, but they may be dependent.
rxy
Correlation Matrix (R)

If we have k variables we can calculate the correlation coefficients for each pair
of the variables and present them in a matrix of correlations:

Properties:

 the main diagonal is always equal to one


rxx = 1
 it is symmetric; rxy = ryx

You might also like