Professional Documents
Culture Documents
Big Data For Economic Applications: Alessio Farcomeni University of Rome "Tor Vergata"
Big Data For Economic Applications: Alessio Farcomeni University of Rome "Tor Vergata"
Alessio
Farcomeni
Introduction
Big Data for Economic Applications
Alessio Farcomeni
University of Rome “Tor Vergata”
alessio.farcomeni@uniroma2.it
Syllabus
Introduction
Alessio
Farcomeni
Introduction
Alessio
Farcomeni
Introduction
Alessio
Farcomeni
Introduction
Alessio
Farcomeni
Introduction
Alessio
Farcomeni
1 Volume
2 Variety
3 Velocity
4 Veracity
Volume
Introduction
Alessio
Farcomeni
Introduction
Alessio
Farcomeni
Introduction
Alessio
Farcomeni
Introduction
Alessio
Farcomeni
Introduction
Alessio
Farcomeni
Introduction
Alessio
Farcomeni
Example applications (from deepindex.org)
Introduction
Alessio
Farcomeni
Alessio
Farcomeni
Introduction
Alessio
Farcomeni
New challenges
New opportunities
New strenghts
New problems and new limitations
Methods are not actually new. What is new is the availability
of large data corpora, and the computing power to elaborate
them.
Challenges
Introduction
Alessio
Farcomeni
Introduction
Alessio
Farcomeni
Introduction
Alessio
Farcomeni
Introduction
Alessio
Farcomeni
U.S. general election.
An opportunistic survey combining polls conducted by
media for the 2016 US presidential election had
information from n ∼ 2.3 ∗ 106 Americans (∼ 1% of
population).
Problems: non-random sampling, measurement error,
differing polling strategies, non-response biases.
Question: let ns denote the sample size of a survey with
random sampling and no (say, well controlled) bias. What
is the value of ns that gives data with the same
information as the poll above?
Meng (2018) Annals of Applied Statistics
Introduction
Alessio
Farcomeni
n = 400.
New opportunities: an example
Introduction
Alessio
Farcomeni
Introduction
Alessio
Farcomeni
They considered the time series of number of searches 50
million terms
They selected the 45 terms most strongly associated with
time series of incidence of infection
They obtained a score based on the (standard deviation
of) the sum of searches for these 45 terms, used linear
regression to predict infection
The approach actually did not work as good as expected
initially in the long run (predictably)
Morale: big data is not a substitute for good statistical analysis.
Other examples
Introduction
Alessio
Farcomeni
Introduction
Alessio
Farcomeni
Introduction
Alessio
Farcomeni
Introduction
Alessio
Farcomeni
Statistical methods: your usual statistical methods with
some tweaks (e.g., regularization or subsampling).
Advantages: interpretable, with theoretical guarantees,
optimizing objective functions. Disadvantages: sometimes
less accurate in terms of predictive accuracy than
Machine learning methods: algorithmic methods.
Advantages: can often be gauged to adapt to very
complex/non-linear problems, scale well. Disadvantages:
heuristic, sometimes difficult to interpret (black-boxes,
“Only the machine is learning”)
A rap song dedicated to this seeming conflict: Baba Brinkman
“Data Science”.
Why the disadvantages?
Introduction
Alessio
Farcomeni
Introduction
Alessio
Farcomeni
Introduction
Alessio
Farcomeni
Introduction
Alessio
Farcomeni
Introduction
Alessio
Farcomeni
Introduction
Alessio
Farcomeni More than a single CPU can be used sometimes
Sometimes tricks and tweaks are needed. Sometimes the
task is “embarassingly parallel”
For example, stratified analyses, replicated operations
(e.g., simulations, replicated subsampling)
Amdahl’s law:
n
S(n) = ,
1 + (n − 1)f
Introduction
Alessio
Farcomeni
Introduction
Alessio
Farcomeni
(Replicated) sub-sampling
Batching (with early stopping)
Subsampling
Introduction
Alessio
Farcomeni
Introduction
Alessio
Farcomeni
Introduction
Alessio
Farcomeni
Introduction
Alessio
Farcomeni
Introduction
Alessio
Farcomeni
Pn
Problem: inf θ l(θ) = i=1 li (θ)/n
Newton-Raphson: θj = θj−1 − H −1 (θj−1 )∇l(θj−1 )
Gradient Descent: θj = θj−1 − α∇l(θj−1 ) for some
learning rate α
Stochastic gradient descent (batch size: 1):
θi = θi−1 − α∇li (θi−1 )
At each epoch units are randomly permuted (until early
stopping or convergence)