Big Data For Economic Applications: Alessio Farcomeni University of Rome "Tor Vergata"

Introduction
Alessio
Farcomeni
Introduction
Big Data for Economic Applications
Alessio Farcomeni
University of Rome “Tor Vergata”
alessio.farcomeni@uniroma2.it
Syllabus
Introduction
Alessio
Farcomeni
Syllabus for class of 2020.

Pre-requisites
Introduction
Alessio
Farcomeni
Mandatory: Descriptive statistics

Mandatory: Statistical inference
Mandatory: Linear regression models
Required: Non-linear regression models (logistic
regression)
R software: how are we doing on that?
In case you have never covered generalized linear models or at
least logistic regression, send me an email and I will send you
some material for studying them on your own.
Warning
Introduction
Alessio
Farcomeni
We will work with several example data sets.

Due to time constraints in class, they often will be only a
small subset of big data sets
Our operations, with more time to spare, apply also to big
data sets
Learning
Introduction
Alessio
Farcomeni
Statistical or machine learning: slight differences

Supervised learning: training data have labels.
Unsupervised: training data do not.
Tasks: prediction, interpretation, segmentation, anomaly
detection
Many applications in this context are focused on prediction
Peculiarities of big data: the 4 Vs
Introduction
Alessio
Farcomeni
1 Volume
2 Variety
3 Velocity
4 Veracity
Volume
Introduction
Alessio
Farcomeni
Scale of data (large sample size and/or large

dimensionality)
Storage and processing issues.
Variety
Introduction
Alessio
Farcomeni
Different types of data (e.g., text)

Structured, semi-structured, unstructured
Different forms can exist within the same data set
Mostly an issue with pre-processing
Velocity
Introduction
Alessio
Farcomeni
Velocity of new data production

Velocity of analysis (streaming data)
Computational issues in general
Veracity
Introduction
Alessio
Farcomeni
Biases and contamination

Subgroups and rare events
Lower data quality: uncertainty of data reliability
Veracity linked to Value
Typical source: opportunistic sampling
Introduction
Alessio
Farcomeni
Administrative data (not collected for research purposes)

Web scraping (including social media)
Text, images, sounds
Mobile devices
An internet minute in 2019
Introduction
Alessio
Farcomeni
Example applications (from deepindex.org)
Introduction
Alessio
Farcomeni
Recommender systems (movies, music, stuff to buy, who

to hire, investments, etc.)
Detect frauds, spam, etc.
Manage hedge funds, simulate financial markets
Generate fake text, photos, videos, etc.
Use opportunistic non-conventional data (text, images,
sounds) for prediction
Other examples (from “Big Data Imperatives”
book)
Introduction
Alessio
Farcomeni
Retail: customer relationship management, logistics &

supply-chain optimization, fraud detection and prevention,
dynamic pricing
Finance: algorithmic trading, risk analysis, fraud detection,
portfolio analysis, forecasting
Advertising: customer satisfaction, targeted advertising,
demand discovery
In summary
Introduction
Alessio
Farcomeni
New challenges
New opportunities
New strenghts
New problems and new limitations
Methods are not actually new. What is new is the availability
of large data corpora, and the computing power to elaborate
them.
Challenges
Introduction
Alessio
Farcomeni
Computational challenges: processing times (and storage

issues)
Sparsity challenges: high dimensional data sets often
exhibit sparsity, in that most of the predictors have
negligible association with the outcomes.
Paradox of big data: inferential methods often become
useless as bias dominates random fluctuation. Everything
is significant (p < α) when n is huge.
Opportunities & strenghts
Introduction
Alessio
Farcomeni
The possibility to combine new sources of information,

and extract information that could not be reached before
The possibility to automatically obtain accurate predictions
The possibility to collect micro data from a huge number
of subjects
Limitations
Introduction
Alessio
Farcomeni
Garbage in, garbage out. A pile of garbage is only a big

pile of garbage.
Costs escalate fast. Privacy issues.
Poor data quality: an example
Introduction
Alessio
Farcomeni
U.S. general election.
An opportunistic survey combining polls conducted by
media for the 2016 US presidential election had
information from n ∼ 2.3 ∗ 106 Americans (∼ 1% of
population).
Problems: non-random sampling, measurement error,
differing polling strategies, non-response biases.
Question: let ns denote the sample size of a survey with
random sampling and no (say, well controlled) bias. What
is the value of ns that gives data with the same
information as the poll above?
Meng (2018) Annals of Applied Statistics
Introduction
Alessio
Farcomeni
n = 400.
New opportunities: an example
Introduction
Alessio
Farcomeni
A Google research team showed that by crunching the

trending terms in google search they could predict the
outbreak of a flu 3-7 days before the CDC and other
official health authorities
Google trends is now public and can be used to assess any
search term counts in any area/time frame
How did they do it?
Introduction
Alessio
Farcomeni
They considered the time series of number of searches 50
million terms
They selected the 45 terms most strongly associated with
time series of incidence of infection
They obtained a score based on the (standard deviation
of) the sum of searches for these 45 terms, used linear
regression to predict infection
The approach actually did not work as good as expected
initially in the long run (predictably)
Morale: big data is not a substitute for good statistical analysis.
Other examples
Introduction
Alessio
Farcomeni
An algorithm can predict if your biological sex using the

typing rhythm on a keyboard.
An algorithm can predict your psychological profile using
the most recent “likes” on Facebook.
The 30 most recent “likes” can do better than a close
relative. The 60 most recent “likes” can do better than
self-assessment.
More on the sources: examples
Introduction
Alessio
Farcomeni
Social media: Twitter, Facebook, etc.

Web scraping in general (e.g, from websites, news sites,
etc.)
User-generated content: Tripadvisor, Craigslist and similar
sites, Reddit, comments on websites, etc.
Remote sensing (satellite images, GPS tracking, etc.)
Google: maps, flights, trends, translate, etc.
Web crawling
Introduction
Alessio
Farcomeni
A Web crawler is an automated script which browses the

Internet and downloads content satisfying acceptance
criteria
In general, it starts with a list of URLs to visit
Each URL is then used also to identify recursively
hyperlinks to additional pages, up to a stopping criterion
E.g., library(Rcrawler) or Apache Nutch
Methods
Introduction
Alessio
Farcomeni
Statistical methods: your usual statistical methods with
some tweaks (e.g., regularization or subsampling).
Advantages: interpretable, with theoretical guarantees,
optimizing objective functions. Disadvantages: sometimes
less accurate in terms of predictive accuracy than
Machine learning methods: algorithmic methods.
Advantages: can often be gauged to adapt to very
complex/non-linear problems, scale well. Disadvantages:
heuristic, sometimes difficult to interpret (black-boxes,
“Only the machine is learning”)
A rap song dedicated to this seeming conflict: Baba Brinkman
“Data Science”.
Why the disadvantages?
Introduction
Alessio
Farcomeni
Statistical rationale: optimize a parsimonious objective

function. Parsimony can become stinginess, for instance.
Optimization guarantees theoretical properties under
conditions that might not be met in practice.
Machine learning rational: get good performing solutions
for a (sometimes ridiculously) over-parameterized objective
functions.
Examples of Tools (Google AI)
Introduction
Alessio
Farcomeni
Language Identification: text in input, language in output

Text recognition: image in input, text within in output
Translator: you know what this does
Image labeling: image in input, list of objects within in
output
Speech API: audio to text
Google trends, as mentioned
Architectures and Systems for Big Data
Introduction
Alessio
Farcomeni
We will use essentially the R software

Enterprises use a list of architectures and systems for
storage, (parallel) computing.
We will only briefly list them now
Hadoop
Introduction
Alessio
Farcomeni
Hadoop is an operating system for applications on clusters

of computers
HDFS: redundant and reliable storage
Map-Reduce: cluster resource management and data
processing language
It operates in batch, interactive, or online/streaming
In R one can use sparklyr library, in batch.
Cloud Computing and Storage
Introduction
Alessio
Farcomeni
Several services offer resources for cloud computing and

storage
Computation happens on a remote server (or cluster of
servers)
Storage happens on a remove server (or cluster of servers)
Examples: Dropbox, Google Cloud, Amazon Cloud
Parallel computing
Introduction
Alessio
Farcomeni More than a single CPU can be used sometimes
Sometimes tricks and tweaks are needed. Sometimes the
task is “embarassingly parallel”
For example, stratified analyses, replicated operations
(e.g., simulations, replicated subsampling)
Amdahl’s law:
n
S(n) = ,
1 + (n − 1)f
where n is number of CPUs and f is fraction of operations

that is serial and can not be made parallel.
Speed up S(n) is with respect to the slowest parallel job.
Working with large sample size
Introduction
Alessio
Farcomeni
Assume you have a large sample size n and want to perform

your usual statistical analyses.
Conceptual issues: give more attention to effect size and
less to inference (p-values, confidence intervals). See
slides on bias and inference.
Computational issues: time needed to obtain results can
be prohibitive (next)
Simple computational tricks
Introduction
Alessio
Farcomeni
(Replicated) sub-sampling
Batching (with early stopping)
Subsampling
Introduction
Alessio
Farcomeni
A sample of size v << n is obtained through

simple/stratified random sampling
The model is estimated only on the subsample
Replicated Subsampling
Introduction
Alessio
Farcomeni
k > 1 samples of size v << n are obtained through

simple/stratified random sampling (usually, with
replacement)
The model is estimated only on the subsamples
The subsample with best objective function is retained or
the results are somehow averaged over k
Does it work?
Introduction
Alessio
Farcomeni
All laws of statistical inference apply, given that

subsamples are obtained at random
The big issue is with small subgroups and rare events,
which are invariably lost.
Batching
Introduction
Alessio
Farcomeni
In iterative algorithms (e.g., Newton-Raphson) data can

be used sequentially, in batches.
The n observations are partitioned in batches of size
v1 , . . . , vp .
Iteration j uses batch vj( mod p) . Each time the sample is
used entirely, an epoch has passed
Problems clearly involve convergence (iterations are
wiggly)
Example: stochastic gradient descent
Introduction
Alessio
Farcomeni
Pn
Problem: inf θ l(θ) = i=1 li (θ)/n
Newton-Raphson: θj = θj−1 − H −1 (θj−1 )∇l(θj−1 )
Gradient Descent: θj = θj−1 − α∇l(θj−1 ) for some
learning rate α
Stochastic gradient descent (batch size: 1):
θi = θi−1 − α∇li (θi−1 )
At each epoch units are randomly permuted (until early
stopping or convergence)

Big Data For Economic Applications: Alessio Farcomeni University of Rome "Tor Vergata"

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Big Data For Economic Applications: Alessio Farcomeni University of Rome "Tor Vergata"

Uploaded by

Copyright:

Available Formats

Introduction

Syllabus for class of 2020.

Mandatory: Descriptive statistics

We will work with several example data sets.

Statistical or machine learning: slight differences

Scale of data (large sample size and/or large

Different types of data (e.g., text)

Velocity of new data production

Biases and contamination

Administrative data (not collected for research purposes)

Recommender systems (movies, music, stuff to buy, who

Retail: customer relationship management, logistics &

Computational challenges: processing times (and storage

The possibility to combine new sources of information,

Garbage in, garbage out. A pile of garbage is only a big

A Google research team showed that by crunching the

An algorithm can predict if your biological sex using the

Social media: Twitter, Facebook, etc.

A Web crawler is an automated script which browses the

Statistical rationale: optimize a parsimonious objective

Language Identification: text in input, language in output

We will use essentially the R software

Hadoop is an operating system for applications on clusters

Several services offer resources for cloud computing and

where n is number of CPUs and f is fraction of operations

Assume you have a large sample size n and want to perform

A sample of size v << n is obtained through

k > 1 samples of size v << n are obtained through

All laws of statistical inference apply, given that

In iterative algorithms (e.g., Newton-Raphson) data can

You might also like