Download as pdf or txt
Download as pdf or txt
You are on page 1of 39

Introduction

Alessio
Farcomeni

Introduction
Big Data for Economic Applications

Alessio Farcomeni
University of Rome “Tor Vergata”

alessio.farcomeni@uniroma2.it
Syllabus

Introduction

Alessio
Farcomeni

Syllabus for class of 2020.


Pre-requisites

Introduction

Alessio
Farcomeni

Mandatory: Descriptive statistics


Mandatory: Statistical inference
Mandatory: Linear regression models
Required: Non-linear regression models (logistic
regression)
R software: how are we doing on that?
In case you have never covered generalized linear models or at
least logistic regression, send me an email and I will send you
some material for studying them on your own.
Warning

Introduction

Alessio
Farcomeni

We will work with several example data sets.


Due to time constraints in class, they often will be only a
small subset of big data sets
Our operations, with more time to spare, apply also to big
data sets
Learning

Introduction

Alessio
Farcomeni

Statistical or machine learning: slight differences


Supervised learning: training data have labels.
Unsupervised: training data do not.
Tasks: prediction, interpretation, segmentation, anomaly
detection
Many applications in this context are focused on prediction
Peculiarities of big data: the 4 Vs

Introduction

Alessio
Farcomeni

1 Volume
2 Variety
3 Velocity
4 Veracity
Volume

Introduction

Alessio
Farcomeni

Scale of data (large sample size and/or large


dimensionality)
Storage and processing issues.
Variety

Introduction

Alessio
Farcomeni

Different types of data (e.g., text)


Structured, semi-structured, unstructured
Different forms can exist within the same data set
Mostly an issue with pre-processing
Velocity

Introduction

Alessio
Farcomeni

Velocity of new data production


Velocity of analysis (streaming data)
Computational issues in general
Veracity

Introduction

Alessio
Farcomeni

Biases and contamination


Subgroups and rare events
Lower data quality: uncertainty of data reliability
Veracity linked to Value
Typical source: opportunistic sampling

Introduction

Alessio
Farcomeni

Administrative data (not collected for research purposes)


Web scraping (including social media)
Text, images, sounds
Mobile devices
An internet minute in 2019

Introduction

Alessio
Farcomeni
Example applications (from deepindex.org)

Introduction

Alessio
Farcomeni

Recommender systems (movies, music, stuff to buy, who


to hire, investments, etc.)
Detect frauds, spam, etc.
Manage hedge funds, simulate financial markets
Generate fake text, photos, videos, etc.
Use opportunistic non-conventional data (text, images,
sounds) for prediction
Other examples (from “Big Data Imperatives”
book)
Introduction

Alessio
Farcomeni

Retail: customer relationship management, logistics &


supply-chain optimization, fraud detection and prevention,
dynamic pricing
Finance: algorithmic trading, risk analysis, fraud detection,
portfolio analysis, forecasting
Advertising: customer satisfaction, targeted advertising,
demand discovery
In summary

Introduction

Alessio
Farcomeni

New challenges
New opportunities
New strenghts
New problems and new limitations
Methods are not actually new. What is new is the availability
of large data corpora, and the computing power to elaborate
them.
Challenges

Introduction

Alessio
Farcomeni

Computational challenges: processing times (and storage


issues)
Sparsity challenges: high dimensional data sets often
exhibit sparsity, in that most of the predictors have
negligible association with the outcomes.
Paradox of big data: inferential methods often become
useless as bias dominates random fluctuation. Everything
is significant (p < α) when n is huge.
Opportunities & strenghts

Introduction

Alessio
Farcomeni

The possibility to combine new sources of information,


and extract information that could not be reached before
The possibility to automatically obtain accurate predictions
The possibility to collect micro data from a huge number
of subjects
Limitations

Introduction

Alessio
Farcomeni

Garbage in, garbage out. A pile of garbage is only a big


pile of garbage.
Costs escalate fast. Privacy issues.
Poor data quality: an example

Introduction

Alessio
Farcomeni
U.S. general election.
An opportunistic survey combining polls conducted by
media for the 2016 US presidential election had
information from n ∼ 2.3 ∗ 106 Americans (∼ 1% of
population).
Problems: non-random sampling, measurement error,
differing polling strategies, non-response biases.
Question: let ns denote the sample size of a survey with
random sampling and no (say, well controlled) bias. What
is the value of ns that gives data with the same
information as the poll above?
Meng (2018) Annals of Applied Statistics

Introduction

Alessio
Farcomeni

n = 400.
New opportunities: an example

Introduction

Alessio
Farcomeni

A Google research team showed that by crunching the


trending terms in google search they could predict the
outbreak of a flu 3-7 days before the CDC and other
official health authorities
Google trends is now public and can be used to assess any
search term counts in any area/time frame
How did they do it?

Introduction

Alessio
Farcomeni
They considered the time series of number of searches 50
million terms
They selected the 45 terms most strongly associated with
time series of incidence of infection
They obtained a score based on the (standard deviation
of) the sum of searches for these 45 terms, used linear
regression to predict infection
The approach actually did not work as good as expected
initially in the long run (predictably)
Morale: big data is not a substitute for good statistical analysis.
Other examples

Introduction

Alessio
Farcomeni

An algorithm can predict if your biological sex using the


typing rhythm on a keyboard.
An algorithm can predict your psychological profile using
the most recent “likes” on Facebook.
The 30 most recent “likes” can do better than a close
relative. The 60 most recent “likes” can do better than
self-assessment.
More on the sources: examples

Introduction

Alessio
Farcomeni

Social media: Twitter, Facebook, etc.


Web scraping in general (e.g, from websites, news sites,
etc.)
User-generated content: Tripadvisor, Craigslist and similar
sites, Reddit, comments on websites, etc.
Remote sensing (satellite images, GPS tracking, etc.)
Google: maps, flights, trends, translate, etc.
Web crawling

Introduction

Alessio
Farcomeni

A Web crawler is an automated script which browses the


Internet and downloads content satisfying acceptance
criteria
In general, it starts with a list of URLs to visit
Each URL is then used also to identify recursively
hyperlinks to additional pages, up to a stopping criterion
E.g., library(Rcrawler) or Apache Nutch
Methods

Introduction

Alessio
Farcomeni
Statistical methods: your usual statistical methods with
some tweaks (e.g., regularization or subsampling).
Advantages: interpretable, with theoretical guarantees,
optimizing objective functions. Disadvantages: sometimes
less accurate in terms of predictive accuracy than
Machine learning methods: algorithmic methods.
Advantages: can often be gauged to adapt to very
complex/non-linear problems, scale well. Disadvantages:
heuristic, sometimes difficult to interpret (black-boxes,
“Only the machine is learning”)
A rap song dedicated to this seeming conflict: Baba Brinkman
“Data Science”.
Why the disadvantages?

Introduction

Alessio
Farcomeni

Statistical rationale: optimize a parsimonious objective


function. Parsimony can become stinginess, for instance.
Optimization guarantees theoretical properties under
conditions that might not be met in practice.
Machine learning rational: get good performing solutions
for a (sometimes ridiculously) over-parameterized objective
functions.
Examples of Tools (Google AI)

Introduction

Alessio
Farcomeni

Language Identification: text in input, language in output


Text recognition: image in input, text within in output
Translator: you know what this does
Image labeling: image in input, list of objects within in
output
Speech API: audio to text
Google trends, as mentioned
Architectures and Systems for Big Data

Introduction

Alessio
Farcomeni

We will use essentially the R software


Enterprises use a list of architectures and systems for
storage, (parallel) computing.
We will only briefly list them now
Hadoop

Introduction

Alessio
Farcomeni

Hadoop is an operating system for applications on clusters


of computers
HDFS: redundant and reliable storage
Map-Reduce: cluster resource management and data
processing language
It operates in batch, interactive, or online/streaming
In R one can use sparklyr library, in batch.
Cloud Computing and Storage

Introduction

Alessio
Farcomeni

Several services offer resources for cloud computing and


storage
Computation happens on a remote server (or cluster of
servers)
Storage happens on a remove server (or cluster of servers)
Examples: Dropbox, Google Cloud, Amazon Cloud
Parallel computing

Introduction

Alessio
Farcomeni More than a single CPU can be used sometimes
Sometimes tricks and tweaks are needed. Sometimes the
task is “embarassingly parallel”
For example, stratified analyses, replicated operations
(e.g., simulations, replicated subsampling)
Amdahl’s law:
n
S(n) = ,
1 + (n − 1)f

where n is number of CPUs and f is fraction of operations


that is serial and can not be made parallel.
Speed up S(n) is with respect to the slowest parallel job.
Working with large sample size

Introduction

Alessio
Farcomeni

Assume you have a large sample size n and want to perform


your usual statistical analyses.
Conceptual issues: give more attention to effect size and
less to inference (p-values, confidence intervals). See
slides on bias and inference.
Computational issues: time needed to obtain results can
be prohibitive (next)
Simple computational tricks

Introduction

Alessio
Farcomeni

(Replicated) sub-sampling
Batching (with early stopping)
Subsampling

Introduction

Alessio
Farcomeni

A sample of size v << n is obtained through


simple/stratified random sampling
The model is estimated only on the subsample
Replicated Subsampling

Introduction

Alessio
Farcomeni

k > 1 samples of size v << n are obtained through


simple/stratified random sampling (usually, with
replacement)
The model is estimated only on the subsamples
The subsample with best objective function is retained or
the results are somehow averaged over k
Does it work?

Introduction

Alessio
Farcomeni

All laws of statistical inference apply, given that


subsamples are obtained at random
The big issue is with small subgroups and rare events,
which are invariably lost.
Batching

Introduction

Alessio
Farcomeni

In iterative algorithms (e.g., Newton-Raphson) data can


be used sequentially, in batches.
The n observations are partitioned in batches of size
v1 , . . . , vp .
Iteration j uses batch vj( mod p) . Each time the sample is
used entirely, an epoch has passed
Problems clearly involve convergence (iterations are
wiggly)
Example: stochastic gradient descent

Introduction

Alessio
Farcomeni

Pn
Problem: inf θ l(θ) = i=1 li (θ)/n
Newton-Raphson: θj = θj−1 − H −1 (θj−1 )∇l(θj−1 )
Gradient Descent: θj = θj−1 − α∇l(θj−1 ) for some
learning rate α
Stochastic gradient descent (batch size: 1):
θi = θi−1 − α∇li (θi−1 )
At each epoch units are randomly permuted (until early
stopping or convergence)

You might also like