Lecture 3 Identification and Causal Diagrams

ET2013 Introduction to Econometrics

Giacomo Pasini

Ca’ Foscari University of Venice

Lecture overview

1 Identification

2 Causal Diagrams

3 Drawing Causal Diagrams

reference:NHK (2022) Ch. 5, 6 & 7

1 Identification

2 Causal Diagrams

3 Drawing Causal Diagrams

Data Generating Process

scientists believe that there are regular laws that govern the way the universe works.
These laws are an example of a data generating process (DGP).
We can see that if you let go of a ball, it drops to the ground. That’s our observation,
our data.
Gravity is a part of the data generating process for that ball. That’s the underlying
DGPs in the social sciences are generally not as well-behaved and precise as the
ones in the physical sciences.
Regardless, if we believe that observational data comes from at least somewhat
regular laws, we are saying there’s a DGP
what is the DGP useful for?

What we start with typically is a Research question and data

the DGP, whether we know it or not, describes all the different process working
behind the scenes and generating the data
So, we want to use everything we know about the DGP in order to make sense of
the data and...
isolate the part of variation we are interested in
Too abstract? Let’s make an example
Weekly Sales of Avocados in California, Jan 2015-March 2018

1 What conclusion can you draw from the figure?

Try not to see anything that isn’t really there.
2 Think about the relationship between avocado
prices and quantities more broadly. What kinds
of research questions might we have about this
3 Can you answer your research question from
Step 2 using the graph? Why or why not?
What conclusion can you draw from the figure?

There’s clearly a negative relationship. What this tells us is that avocado sales tend
to be lower in weeks where the price of avocados is high.
What we can see in the graph is the negative covariation or correlation between
price and quantity of avocados. THAT’S ALL!!!
We might be tempted to say something like an increase in the price of avocados
drives down sales. But that’s not actually on the graph!
Or we might want to say that an increase in prices makes people demand fewer
avocados, but that’s not on there either
DGP: both supply and demand are part of what generates market prices and
quantities, and we’ve got no way of pulling just the demand parts out yet.
What research questions might we have about the relationship between

price and quantity?

Perhaps we’re interested in how price-sensitive consumers are - what is the effect
of a price increase on the number of avocados people buy?
maybe we should ask what is the effect of the number of avocados brought to
market on the price that sellers choose to charge?
What is the effect of the price on the number of avocados brought to market?
What is the effect of the quantity sold one week on the number of avocados people
will want to buy the next week?
...and many more

Each of these questions would require us to dig out a different part of the variation.
Can we answer any of these questions by looking at the graph? NO!

The graph shows the covariation of price and quantity - how they move together or
But these variables move around for all sorts of reasons! Focus on two consecutive
data points

1 Did a drop in price made people buy more?

2 Is it because the market was flooded with avocados
so people wouldn’t pay as much for them?
3 Is it because the high price in January made suppliers
bring way more avocados to market in February?
Identify the relevant variation I

Variables move around for all sorts of reasons. Those reasons would be reflected in
the DGP.
But when we have a research question in mind, we are usually only interested in
only one of those reasons.
How can we find the variation in the data that answers our question?
Somewhere inside the data, our reason for variation is hiding. How can we get it
Identify the relevant variation II

first: what is the variation that we want to find?

Research question: we want to figure out what the effect of the price is on how
many avocados people want to buy
We want variation in people buying avocados (rather than people selling them) that
is driven by changes in the price (rather than, say, avocados becoming less popular).
Use what you know about the DGP

Demand and supply of Avocados in California determine equilibrium prices and

quantitites every week
Let’s imagine that we know for a fact that at the beginning of each month,
avocado suppliers make a plan for what avocado prices will be each week in that
month, and never change their plans until the next month.
Then the “suppliers set prices” and “suppliers set quantities” explanations only matter
between months.
The variation in price and quantity from week to week in the same month will isolate
variation in people buying avocados and get rid of variation from people selling
Further, because the price is set by the sellers, the variation in quantity we’re looking
at can only be driven by changes in the price.
within month variation

Let’s look at changes in price within month

For each of the months, there’s a negative

relationship, ignoring any differences between
the months.
So, given the data and what we know (assume)
about how sellers operate, an increase in price
does reduce how many avocados people want to
The path to identification

1 Using theory and what you know about the context from where the data are taken,
paint the most accurate picture possible of what the data generating process looks
2 Use that data generating process to figure out the reasons our data might look the
way it does that don’t answer our research question
3 Find ways to block out those alternate reasons and so dig out the variation we need
1 Identification

2 Causal Diagrams

3 Drawing Causal Diagrams

What is causality? We can say “X causes Y ,” but what do we specifically mean by

many research questions we are interested in are causal in nature.
We don’t want to know if countries with higher minimum wages have less poverty,
we want to know ifraising the minimum wage reduces poverty.

We can say that X causes Y if, were we to intervene and change the value of X , then
the distribution of Y would also change as a result.
Correlation is not Causation

We can observe that the number of people who wear shorts is much higher on days
when people eat ice cream.
If we were to intervene and swap out someone’s pants for shorts, would it make
them more likely to eat ice cream? Probably not! So this is a non-causal
Identification Causal Diagrams Drawing Causal Diagrams

Correlation is not Causation

We can observe that the number of people who wear shorts is much higher on days
when people eat ice cream.
If we were to intervene and swap out someone’s pants for shorts, would it make
them more likely to eat ice cream? Probably not! So this is a non-causal
Surely the price of cigarettes by itself has no causal effect on your health.
But if we were to intervene and raise the price of cigarettes, that would likely
reduce the number of cigarettes smoked.
So the price of cigarettes causes cigarette smoking (to go down).
Also, if we were to intervene and reduce the number of cigarettes smoked, that
would cause your health (to improve).
In sum, the price of cigarettes causes health.
Causality and Probability distribution

Important: we’ll still say X causes Y even if changing X doesn’t always change Y ,
but just changes the probability that Y occurs. As I said earlier, it changes the
distribution of Y
Does buying a child a copy of Alice in Wonderland cause them to read it?
Not always! Some kids won’t read it no matter what, and some kids would manage
to read it on their own without you buying it for them.
But in general, buying children copies of Alice in Wonderland increases the
probability that a child reads it, and so we’d say that buying them the book causes
them to read it
Causal diagram

A causal diagram is a graphical representation of a data generating process (DGP).

A causal diagram contains only two things:
The variables in the DGP, each represented by a node on the diagram
The causal relationships in the DGP, each represented by an arrow from the cause
variable to the caused variable
A simple example

each variable on the graph may take multiple values: this is different from a
the arrow just tells us that one variable causes another. It doesn’t say anything
about whether that causal effect is positive or negative.
A more complex example

we must include the causal relationships between all the variables. One variable
might cause multiple things (like CoinFlip), and other variables might be caused by
multiple things (like Money).
when one variable is caused by multiple things, the diagram doesn’t tell us exactly
how those things come together.
All (non-trivial) variables relevant to the data generating process should be
included, even if we can’t measure or see them.
Unobserved variables 1

we will indicate unobserved variables as being a shade of gray.

In addition to being key parts of the DGP, they can sometimes help to include
correlations in the diagram
Unobserved variables 2: latent variables

People are more likely to wear shorts on days they eat ice cream, but shorts don’t
cause you to eat ice cream and ice cream doesnÕt cause you to wear shorts.
There has to be something between them, otherwise they wouldn’t be correlated.
But we can’t have an arrow from one to the other, since neither causes the other.
In these cases we imagine that there’s a latent variable causing both of them, and
we can put that on the diagram
Police presence and crime rate

Relation between Police presence and crime rate

Just as important as what’s on the diagram is what’s not on the diagram!

Every variable and arrow that’s not on the diagram is an assumption we’re making.
Some assumptions we’ve made:

LaggedCrime doesn’t cause LawAndOrderPolitics
PovertyRate isn’t a part of the data generating process
LaggedPolicePerCapita doesn’t cause PolicePerCapita (or anything else for that
RecentPopularCrimeMovie doesn’t cause Crime
Likely, not all those assumptions are true.
There’s a balancing act to walk when talking about causal diagrams.
On one hand, the simpler the diagram is, the easier it is to understand, and the
more likely it is that we’ll be able to figure out how to identify the answer to our
research question.
On the other hand, we might end up leaving out something thatÕs really important.
Research questions and Causal diagrams

does additional police presence reduce crime?

We should be able to see which parts of the diagram answer our research question
of interest.
We should be interested in any parts of it that allow PolicePerCapita to cause
Research questions and Causal diagrams

Direct effect: PolicePerCapita → Crime.

Indirect effect: PolicePerCapita → ExpectedCrimePayout → Crime.
So the variation in our data that answers our research question has to do with
PolicePerCapita causing Crime, and to do with PolicePerCapita causing
ExpectedCrimePayout, which then affects Crime.
To identify our answer, we have to dig out that part of the variation and block out
the alternative explanations.
We need to get rid of the variation due to LaggedCrime and LawAndOrderPolitics
in order to isolate just the variation we need.
1 Identification

2 Causal Diagrams

3 Drawing Causal Diagrams

The real world is complex. The true data generating process is too.
Problem: the whole point of having a model like a causal diagram is to help us
make sense of the DGP and, eventually, figure out how we can use it to identify the
answer to our research question.
We want to simplify where we can without getting so simple that our diagram no
longer represents the true DGP.
How to simplify

Unimportance. If the arrows coming in and out of a variable are likely to be tiny
and unimportant effects, we can probably remove the variable.
Redundancy. If there are any variables on the diagram that have the arrows
coming in and going out of them from/to the same variables - we can probably
combine them and describe them together
Mediators. If one variable is only on the graph as a way for one variable to affect
another (i.e. B in A → B → C where nothing else connects to B), then we can
probably remove it and just have A → C directly
Irrelevance. Some variables are an important part of the data generating process
but irrelevant to the research question at hand. If a variable isn’t on any path
between the treatment and outcome variables, we can probably remove the variable.
In a causal diagram there cannot be a cycle. You shouldn’t be able to start at one
variable, follow down the path of the arrows, and end up back where you started.
Problem: there are plenty of real-world data generating processes with feedback
loops: The rich get richer, and if I punch you that makes you punch me, which
makes me punch you.
Break the Cycles

Let’s pay attention to when these punches are thrown. As is common in statistical
applications where time is a factor

Whenever we have a cycle in our diagram, we can get out of it by thinking about
adding a time dimension: time’s arrow only moves in one direction.

