Professional Documents
Culture Documents
Text As Data: Matthew Gentzkow, Bryan Kelly, and Matt Taddy
Text As Data: Matthew Gentzkow, Bryan Kelly, and Matt Taddy
https://doi.org/10.1257/jel.20181020
Text as Data†
Matthew Gentzkow, Bryan Kelly, and Matt Taddy*
535
536 Journal of Economic Literature, Vol. LVII (September 2019)
In this paper, we provide an overview from the text as a whole. It might seem
of methods for analyzing text and a survey obvious that any attempt to distill text into
of current applications in economics and meaningful data must similarly take account
related social sciences. The methods discus- of complex grammatical structures and rich
sion is forward looking, providing an over- interactions among words.
view of methods that are currently applied The field of computational linguistics
in economics as well as those that we expect has made tremendous progress in this kind
to have high value in the future. Our discus- of interpretation. Most of us have mobile
sion of applications is selective and necessar- phones that are capable of complex speech
ily omits many worthy papers. We highlight recognition. Algorithms exist to efficiently
examples that illustrate particular methods parse grammatical structure, disambiguate
and use text data to make important substan- different senses of words, distinguish key
tive contributions even if they do not apply points from secondary asides, and so on.
methods close to the frontier. Yet virtually all analysis of text in the social
A number of other excellent surveys have sciences, like much of the text analysis in
been written in related areas. See Evans and machine learning more generally, ignores
Aceves (2016) and Grimmer and Stewart the lion’s share of this complexity. Raw text
(2013) for related surveys focused on text consists of an ordered sequence of language
analysis in sociology and political science, elements: words, punctuation, and white
respectively. For methodological surveys, space. To reduce this to a simpler repre-
Bishop (2006), Hastie, Tibshirani, and sentation suitable for statistical analysis, we
Friedman (2009), and Murphy (2012) cover typically make three kinds of simplifications:
contemporary statistics and machine learn- dividing the text into individual documents i,
ing in general while Jurafsky and Martin reducing the number of language elements
(2009) overview methods from computa- we consider, and limiting the extent to which
tional linguistics and natural language pro- we encode dependence among elements
cessing. The Spring 2014 issue of the Journal within documents. The result is a mapping
of Economic Perspectives contains a sympo- from raw text to a numerical array C . A row
sium on “big data,” which surveys broader ci of Cis a numerical vector with each ele-
applications of high-dimensional statistical ment indicating the presence or count of a
methods to economics. particular language token in document i.
In section 2 we discuss representing
2.1 What Is a Document?
text data as a manageable (though still
high-dimensional) numerical array C ; in sec- The first step in constructing C is to divide
tion 3 we discuss methods from data mining raw text into individual documents { i}.
and machine learning for predicting V from In many applications, this is governed by the
C. Section 4 then provides a selective survey level at which the attributes of interest V are
of text analysis applications in social science, defined. For spam detection, the outcome of
and section 5 concludes. interest is defined at the level of individual
emails, so we want to divide text that way
too. If V is daily stock price movements that
2. Representing Text as Data
we wish to predict from the prior day’s news
When humans read text, they do not see a text, it might make sense to divide the news
vector of dummy variables, nor a sequence text by day as well.
of unrelated tokens. They interpret words In other cases, the natural way to define
in light of other words, and extract meaning a document is not so clear. If we wish to
538 Journal of Economic Literature, Vol. LVII (September 2019)
stop words are another’s subject of interest. representation then corresponds to counts of
Dropping numerals from political text means 1-grams.
missing references to “the first 100 days” or Counting n-grams of order n > 1 yields
“September 11.” In online communication, data that describe a limited amount of the
even punctuation can no longer be stripped dependence between words. Specifically, the
without potentially significant information n-gram counts are sufficient for estimation
loss :-(. of an n-order homogeneous Markov model
across words (i.e., the model that arises if we
2.3 n-grams
assume that word choice is only dependent
Producing a tractable representation also upon the previous n words). This can lead
requires that we limit dependence among to richer modeling. In analysis of partisan
language elements. A fairly mild step in this speech, for example, single words are often
direction, for example, might be to parse doc- insufficient to capture the patterns of inter-
uments into distinct sentences and encode est: “death tax” and “tax break” are phrases
features of these sentences while ignoring with strong partisan overtones that are not
the order in which they occur. The most evident if we look at the single words “death,”
common methodologies go much further. “tax,” and “break” (see, e.g., Gentzkow and
The simplest and most common way to Shapiro 2010).
represent a document is called bag-of-words. Unfortunately, the dimension of ci in-
The order of words is ignored altogether, creases exponentially quickly with the order
and ciis a vector whose length is equal to nof the phrases tracked. The majority of text
the number of words in the vocabulary and analyses consider n-grams up to two or three
whose elements c ijare the number of times at most, and the ubiquity of these simple
word joccurs in document i. Suppose that representations (in both machine learning
the text of document i is and social science) reflects a belief that the
return to richer n-gram modeling is usually
Good night, good night! small relative to the cost. Best practice in
Parting is such sweet sorrow. many cases is to begin analysis by focusing on
single words. Given the accuracy obtained
After stemming, removing stop words, and with words alone, one can then evaluate if it
removing punctuation, we might be left with is worth the extra time to move on to 2-grams
“good night good night part sweet sorrow.” or 3-grams.
The bag-of-words representation would then
2.4 Richer Representations
have cij = 2for j ∈ {good, night}, cij = 1 for
j ∈ {part, sweet, sorrow}, and cij = 0for all While rarely used in the social science
other words in the vocabulary. literature to date, there is a vast array of
This scheme can be extended to encode methods from computational linguistics
a limited amount of dependence by count- that capture richer features of text and may
ing unique phrases rather than unique have high return in certain applications.
words. A phrase of length n is referred to One basic step beyond the simple n -gram
as an n-gram. For example, in our snippet counting above is to use sentence syntax to
above, the count of 2-grams (or “bigrams”) inform the text tokens used to summarize
would have c ij =
2for j = good.night, a document. For example, Goldberg and
cij = 1for jincluding night.good, night.part, Orwant (2013) describe syntactic n-grams
part.sweet, and sweet.sorrow, and cij = 0 for where words are grouped together when-
all other possible 2-grams. The bag-of-words ever their meaning depends upon each
540 Journal of Economic Literature, Vol. LVII (September 2019)
which the documents are informative. There The second and third groups of meth-
can also be latent attributes of interest, such ods are distinguished by whether they
as the topics being discussed in a congressio- begin from a model of p( vi | ci)or a model of
nal debate or in news articles. p(ci | vi) . In the former case, which we will
Methods to connect counts c i to attri- call text regression methods, we directly
butes vican be roughly divided into four estimate the conditional outcome distribu-
categories. The first, which we will call tion, usually via the conditional expectation
dictionary-based methods, do not involve E[ vi | ci] of attributes vi. This is intuitive: if we
statistical inference at all: they simply spec- want to predict vi from ci, we would naturally
ify vˆ i = f ( ci) for some known function f ( ⋅ ). regress the observed values of the former
This is by far the most common method in (V train) on the corresponding values of the lat-
the social science literature using text to ter (C train). Any generic regression technique
date. In some cases, researchers define f ( ⋅ ) can be applied, depending upon the nature
based on a p respecified dictionary of of vi. However, the h igh dimensionality of
terms capturing particular categories of ci, where pis often as large as or larger than
text. In Tetlock (2007), for example, ci is a n train, requires use of regression techniques
bag-of-words representation and the out-
appropriate for such a setting, such as penal-
come of interest viis the latent “sentiment” ized linear or logistic regression.
of Wall Street Journal columns, defined along In the latter case, we begin from a genera-
a number of dimensions such as “positive,” tive model of p(ci | vi) . To see why this is intu-
“optimistic,” and so on. The author defines itive, note that in many cases the underlying
the function f ( ⋅ )using a dictionary called causal relationship runs from outcomes to
the General Inquirer, which provides lists of language rather than the other way around.
words associated with each of these sentiment For example, Google searches about the flu
categories.3 The elements of f( ci) are defined do not cause flu cases to occur; rather, peo-
to be the sum of the counts of words in each ple with the flu are more likely to produce
category. (As we discuss below, the main anal- such searches. Congresspeople’s ideology
ysis then focuses on the first principal com- is not determined by their use of partisan
ponent of the resulting counts.) In Baker, language; rather, people who are more con-
Bloom, and Davis (2016), c iis the count of servative or liberal to begin with are more
articles in a given n ewspaper-month contain- likely to use such language. From an eco-
ing a set of p respecified terms such as “pol- nomic point of view, the correct “structural”
icy,” “uncertainty,” and “Federal Reserve,” model of language in these cases maps from
and the outcome of interest v i is the degree vi to ci, and as in other cases familiar to
of “policy uncertainty” in the economy. The economists, modeling the underlying causal
authors define f ( ⋅ )to be the raw count of relationships can provide powerful guidance
the prespecified terms divided by the total to inference and make the estimated model
number of articles in the newspaper–month, more interpretable.
averaged across newspapers. We do not pro- Generative models can be further divided
vide additional discussion of d ictionary-based by whether the attributes are observed or
methods in this section, but we return to them latent. In the first case of unsupervised
in section 3.5 and in our discussion of applica- methods, we do not observe the true value of
tions in section 4. vifor any documents. The function relating
ci and viis unknown, but we are willing to
impose sufficient structure on it to allow us to
3 http://www.wjh.harvard.edu/~inquirer/. infer vi from ci. This class includes methods
542 Journal of Economic Literature, Vol. LVII (September 2019)
such as topic modeling and its variants (e.g., typically perform close to the frontier in
latent Dirichlet allocation, or LDA). In the terms of out-of-sample prediction.
second case of supervised methods, we Linear models in the sense we mean here
observe training data V trainand we can fit are those in which v i depends on ci only
our model, say fθ( ci; vi) for a vector of param- through a linear index ηi = α + 𝐱 ′i β, where
eters θ, to this training set. The fitted model 𝐱iis a known transformation of c i. In many
fθˆ can then be inverted to predict v i for doc- cases, we simply have E [vi | 𝐱i] = ηi. It is
uments in the test set and can also be used to also possible that E [vi | 𝐱i] = f (ηi) for some
interpret the structural relationship between known link function f ( ⋅ ), as in the case of
attributes and text. Finally, in some cases, v i logistic regression.
includes both observed and latent attributes Common transformations are the iden-
for a semi-supervised analysis. tity 𝐱i = ci, normalization by document
Lastly, we discuss word embeddings, length 𝐱i = ci/mi with mi = ∑ j cij , or
which provide a richer representation of the the positive indicator x ij = 1[cij >0]. The best
underlying text than the token counts that choice is application specific, and may be
underlie other methods. They have seen driven by interpretability; does one wish to
limited application in economics to date, but interpret βjas the added effect of an extra
their dramatic successes in deep learning count for token j (if so, use x ij = cij ) or as the
and other machine learning domains sug- effect of the presence of token j(if so, use
gest they are likely to have high value in the xij = 1[cij>0])?
The identity is a reasonable
future. default in many settings.
We close in section 3.5 with some broad Write l(α, β)for an unregularized objec-
recommendations for practitioners. tive proportional to the negative log likeli-
3.1 Text Regression hood,− log p(vi | 𝐱i) . For example, in Gaussian
(linear) regression, l(α, β) = ∑ i (vi − ηi) 2
Predicting an attribute vi from counts ci is and in binomial (logistic) regression, l(α, β)
a regression problem like any other, except = − ∑ i [ηi vi − log(1 + e ηi) ] for vi ∈ {0, 1}.
that the high dimensionality of ci makes ordi- A penalized estimator is then the solution to
nary least squares (OLS) and other standard
{ }
p
techniques infeasible. The methods in this
section are mainly applications of standard (1) min l(α, β) + nλ ∑ κj(|βj|) ,
j=1
high-dimensional regression methods to text.
where λ > 0controls overall penalty mag-
3.1.1 Penalized Linear Models
nitude and κ j( ⋅ )are increasing “cost” func-
The most popular strategy for very tions that penalize deviations of the βj from
igh-dimensional regression in contempo-
h zero.
rary statistics and machine learning is the A few common cost functions are shown in
estimation of penalized linear models, par- figure 1. Those that have a n on-differentiable
ticularly with L 1 penalization. We recom- spike at zero (lasso, elastic net, and log) lead
mend this strategy for most text regression to sparse estimators, with some coefficients
applications: linear models are intuitive and set to exactly zero. The curvature of the
interpretable; fast, h igh-quality software penalty away from zero dictates the weight
is available for big sparse input matrices of shrinkage imposed on the nonzero coef-
like our C . For simple text-regression tasks ficients: L2 costs increase with coefficient
with input dimension on the same order as size; lasso’s L1penalty has zero curvature and
the sample size, penalized linear models imposes constant shrinkage, and as c urvature
Gentzkow, Kelly, and Taddy: Text as Data 543
400 60
| β | + 0.1 × β 2
2.5
log(1 + | β |)
15 40
|β|
200 1.5
β2
5 20
0.5
0 0 0
−20 0 20 −20 0 20 −20 0 20 −20 0 20
β β β β
Figure 1
Note: From left to right, L 2costs (ridge, Hoerl and Kennard 1970), L
1(lasso, Tibshirani 1996), the “elastic net”
mixture of L1 and L2 (Zou and Hastie 2005), and the log penalty (Candès, Wakin, and Boyd 2008).
goes toward − ∞one approaches the L0 pen- sample standard deviation of that covariate.
alty of subset selection. The lasso’s L1 pen- In text analysis, where each covariate corre-
alty (Tibshirani 1996) is extremely popular: sponds to some transformation of a specific
it yields sparse solutions with a number of text token, this type of weighting is referred
desirable properties (e.g., Bickel, Ritov, and to as “rare feature u p-weighting” (e.g.,
Tsybakov 2009; Wainwright 2009; Belloni, Manning, Raghavan, and Schütze 2008) and
Chernozhukov, and Hansen 2013; Bühlmann is generally thought of as good practice: rare
and van de Geer 2011), and the number of words are often most useful in differentiat-
nonzero estimated coefficients is an unbi- ing between documents.5
ased estimator of the regression degrees of Large λleads to simple model estimates
freedom (which is useful in model selection; in the sense that most coefficients will be
see Zou, Hastie, and Tibshirani 2007).4 set at or close to zero, while as λ → 0 we
Focusing on L1 regularization, rewrite the approach maximum likelihood estimation
penalized linear model objective as (MLE). Since there is no way to define an
optimal λa priori, standard practice is to
{ }
p
(2) min l(α, β) + nλ ∑ ωj |βj| . compute estimates for a large set of possible
j=1 λand then use some criterion to select the
one that yields the best fit.
A common strategy sets ω jso that the pen- Several criteria are available to choose an
alty cost for each coefficient is scaled by the optimal λ. One common approach is to leave
out part of the training sample in estimation
4 Penalties with a bias that diminishes with coefficient
and then choose the λ that yields the best
size—such as the log penalty in figure 1 (Candès, Wakin, out-of-sample fit according to some criterion
and Boyd 2008), the smoothly clipped absolute deviation such as mean squared error. Rather than work
(SCAD) of Fan and Li (2001), or the adaptive lasso of Zou with a single leave-out sample, researchers
(2006)—have been promoted in the statistics literature as
improving upon the lasso by providing consistent variable most often use K -fold cross-validation (CV).
selection and estimation in a wider range of settings. These
diminishing-bias penalties lead to increased computation
costs (due to a non-convex loss), but there exist efficient 5 This is the same principle that motivates
approximation algorithms (see, e.g., Fan, Xue, and Zou “inverse-document frequency” weighting schemes, such
2014; Taddy 2017b). as tf–idf.
544 Journal of Economic Literature, Vol. LVII (September 2019)
This splits the sample into Kdisjoint subsets, Penalized linear models use shrinkage and
and then fits the full regularization path K variable selection to manage high dimen-
times excluding each subset in turn. This sionality by forcing the coefficients on most
yields Krealizations of the mean squared regressors to be close to (or, for lasso, exactly)
error or other out-of-sample fit measure for zero. This can produce suboptimal forecasts
each value of λ. Common rules are to select when predictors are highly correlated. A
the value of λ that minimizes the average transparent illustration of this problem would
error across these realizations, or (more be a case in which all of the predictors are
conservatively) to choose the largest λ with equal to the forecast target plus an i.i.d. noise
mean error no more than one standard error term. In this situation, choosing a subset of
away from the minimum. predictors via lasso penalty is inferior to tak-
Analytic alternatives to cross-validation ing a simple average of the predictors and
are Akaike’s information criterion (AIC; using this as the sole predictor in a univar-
Akaike 1973) and the Bayesian informa- iate regression. This predictor averaging, as
tion criterion (BIC) of Schwarz (1978). In opposed to predictor selection, is the essence
particular, Flynn, Hurvich, and Simonoff of dimension reduction.
(2013) describe a b ias-corrected AIC PCR consists of a two-step procedure. In
objective for high-dimensional problems the first step, principal components analysis
that they call AICc. It is motivated as an (PCA) combines regressors into a small set
approximate likelihood maximization sub- of Klinear combinations that best preserve
ject to a degrees of freedom (d fλ) adjust- the covariance structure among the predic-
ment: AICc(λ) = 2l(αλ , βλ ) + 2d fλ_
n
. tors. This amounts to solving the problem
n − d fλ − 1
Similarly, the BIC objective is BIC(λ)
= l(αλ , βλ ) + d fλ log n
, and is motivated (3) min trace[(C − ΓB′ )(C − ΓB′ )′ ],
Γ,B
as an approximation to the Bayesian pos-
terior marginal likelihood in Kass and subject to
Wasserman (1995). AICc and BIC selec-
tion choose λ to minimize their respec- (Γ) = rank(B) = K.
rank
tive objectives. The BIC tends to choose
simpler models than cross-validation or The count matrix C consists of n
rows (one
AICc. Zou, Hastie, and Tibshirani (2007) for each document) and p columns (one for
recommend BIC for lasso penalty selec- each term). PCA seeks a low-rank represen-
tion whenever variable selection, rather tation ΓB′ that best approximates the text
than predictive performance, is the primary data C. This formulation has the character of
goal. a factor model. The n × Kmatrix Γ captures
3.1.2 Dimension Reduction the prevalence of Kcommon components,
or “factors,” in each document. The p × K
Another common solution for taming high matrix Bdescribes the strength of associa-
dimensional prediction problems is to form a tion between each word and the factors. As
small number of linear combinations of pre- we will see, this reduced-rank decomposi-
dictors and to use these derived indices as tion bears a close resemblance to other text
variables in an otherwise standard predictive analytic methods such as topic modeling and
regression. Two classic dimension reduction word embeddings.
techniques are principal components regres- In the second step, the Kcomponents are
sion (PCR) and partial least squares (PLS). used in standard predictive regression. As an
Gentzkow, Kelly, and Taddy: Text as Data 545
example, Foster, Liberman, and Stine (2013) is condensed into a single predictive index.
use PCR to build a hedonic real estate pricing To use additional predictive indices, both
model that takes textual content of property vi on cij are orthogonalized with respect
listings as an input.6 With text data, where the to vˆ i, the above procedure is repeated on
number of features tend to vastly exceed the the orthogonalized data set, and the result-
observation count, regularized versions of ing forecast is added to the original v ˆ i. This
PCA such as predictor thresholding (e.g., Bai is iterated until the desired number of PLS
and Ng 2008) and sparse PCA (Zou, Hastie, components K is reached. Like PCR, PLS
and Tibshirani 2006) help exclude the least components describe the prevalence of K
informative features to improve predictive common factors in each document. And also
content of the d imension-reduced text. like PCR, PLS can be implemented with a
A drawback of PCR is that it fails to incor- variety of regularization schemes to aid its
porate the ultimate statistical objective— performance in the u ltra-high-dimensional
forecasting a particular set of attributes—in world of text. Section 4 discusses applica-
the dimensionality reduction step. PCA con- tions using PLS in text regression.
denses text data into indices based on the PCR and PLS share a number of com-
covariation among the predictors. This hap- mon properties. In both cases, Kis a
pens prior to the forecasting step and with- user-controlled parameter which, in many
out consideration of how predictors associate social science applications, is selected ex ante
with the forecast target. by the researcher. But, like any hyperparam-
In contrast, PLS performs dimension eter, Kcan be tuned via c ross-validation. And
reduction by directly exploiting covaria- neither method is scale invariant—the fore-
tion of predictors with the forecast target.7 casting model is sensitive to the distribution
Suppose we are interested in forecasting of predictor variances. It is therefore com-
a scalar attribute v i. PLS regression pro- mon to variance-standardize features before
ceeds as follows. For each element jof the applying PCR or PLS.
feature vector c i, estimate the univariate 3.1.3 Nonlinear Text Regression
covariance between vi on cij . This covari-
ance, denoted φj, reflects the attribute’s Penalized linear models are the most
“partial” sensitivity to each feature j. Next, widely applied text regression tools due to
form a single predictor by averaging all their simplicity, and because they may be
attributes into a single aggregate predictor viewed as a fi rst-order approximation to
vˆ i = ∑ j φj cij / ∑ j φj. This forecast places potentially nonlinear and complex data gen-
the highest weight on the strongest uni- erating processes (DGPs). In cases where a
variate predictors, and the least weight on linear specification is too restrictive, there
the weakest. In this way, PLS performs its are several other machine learning tools that
dimension reduction with the ultimate fore- are well suited to represent nonlinear asso-
casting objective in mind. The description ciations between text ci and outcome attri-
of vˆ i reflects the K = 1case, i.e., when text butes vi. Here we briefly describe four such
nonlinear regression methods—generalized
linear models, support vector machines,
regression trees, and deep learning—and
6 See Stock and Watson (2002a, b) for development of
the PCR estimator and an application to macroeconomic
provide references for readers interested in
forecasting with a large set of numerical predictors. thorough treatments of each.
7 See Kelly and Pruitt (2013, 2015) for the asymptotic
theory of PLS regression and its application to forecasting GLMs and SVMs.—One way to capture
risk premia in financial markets. nonlinear associations between ci and vi is
546 Journal of Economic Literature, Vol. LVII (September 2019)
with a generalized linear model (GLM). roblems. The logic of trees differs markedly
p
These expand the linear model to include from traditional regressions. A tree “grows”
nonlinear functions of ci such as polynomials by sequentially sorting data observations
or interactions, while otherwise treating the into bins based on values of the predictor
problem with the penalized linear regression variables. This partitions the data set into
methods discussed above. rectangular regions, and forms predictions
A related method used in the social science as the average value of the outcome vari-
literature is the support vector machine, or able within each partition (Breiman et al.
SVM (Vapnik 1995). This is used for text 1984). This structure is an effective way to
classification problems (when V is categor- accommodate rich interactions and nonlin-
ical), the prototypical example being email ear dependencies.
spam filtering. A detailed discussion of SVMs Two extensions of the simple regression
is beyond the scope of this review, but from tree have been highly successful thanks to
a high level, the SVM finds hyperplanes in a clever regularization approaches that min-
basis expansion of Cthat partition the obser- imize the need for tuning and avoid over-
vations into sets with equal response (i.e., so fitting. Random forests (Breiman 2001)
that viare all equal in each region).8 average predictions from many trees that
GLMs and SVMs both face the limita- have been randomly perturbed in a b ootstrap
tion that, without a priori assumptions for step. Boosted trees (e.g., Friedman 2002)
which basis transformations and interactions recursively combine predictions from many
to include, they may overfit and require oversimplified trees.10
extensive tuning (Hastie, Tibshirani, and The benefits of regression trees—non-
Friedman 2009; Murphy 2012). For exam- linearity and high-order interactions—are
ple, multi-way interactions increase the sometimes lessened in the presence of
parameterization combinatorially and can high-dimensional inputs. While we would
quickly overwhelm the penalization rou- generally recommend tree models, and
tine, and their performance suffers in the especially random forests, they are often not
presence of many spurious “noise” inputs worth the effort for simple text regression.
(Hastie, Tibshirani, and Friedman 2009).9 Often times, a more beneficial use of trees is
in a final prediction step after some dimen-
Regression Trees.—Regression trees have sion reduction derived from the generative
become a popular nonlinear approach for models in section 3.2.
incorporating
multi-way predictor inter-
actions into regression and classification Deep Learning.—There is a host of other
machine learning techniques that have been
8 Hastie, Tibshirani, and Friedman (2009, chapter 12) applied to text regression. The most com-
and Murphy (2012, chapter 14) provide detailed overviews mon techniques not mentioned thus far are
of GLMs and SVMs. Joachims (1998) and Tong and Koller neural networks, which typically allow the
(2001) (among others) study text applications of SVMs.
9 Another drawback of SVMs is that they cannot be inputs to act on the response through one
easily connected to the estimation of a probabilistic
model and the resulting fitted model can sometimes be
difficult to interpret. Polson and Scott (2011) provide a 10 Hastie, Tibshirani, and Friedman (2009) provide an
pseudo-likelihood interpretation for a variant of the SVM overview of these methods. In addition, see Wager, Hastie,
objective. Our own experience has led us to lean away from and Efron (2014) and Wager and Athey (2018) for results
SVMs for text analysis in favor of more easily interpretable on confidence intervals for random forests, and see Taddy
models. Murphy (2012, chapter 14.6) attributes the pop- et al. (2015) and Taddy et al. (2016) for an interpretation
ularity of SVMs in some application areas to an ignorance of random forests as a Bayesian posterior over potentially
of alternatives. optimal trees.
Gentzkow, Kelly, and Taddy: Text as Data 547
or more layers of interacting nonlinear basis Dunson, and Lee (2013) for Bayesian ana-
functions (e.g., see Bishop 1995). A main logues of diminishing bias penalties like the
attraction of neural networks is their status as log penalty on the right of figure 1.
universal approximators, a theoretical result For those looking to do a full Bayesian
describing their ability to mimic general, analysis for high-dimensional (e.g., text)
smooth nonlinear associations. regression, an especially appealing model is
In high-dimensional and very noisy set- the spike-and-slab introduced in George and
tings, such as in text analysis, classical neu- McCulloch (1993). This models the distribu-
ral nets tend to suffer from the same issues tion over regression coefficients as a mixture
referenced above: they often overfit and between two densities centered at zero—
are difficult to tune. However, the recently one with very small variance (the spike) and
popular “deep” versions of neural networks another with large variance (the slab). This
(with many layers, and fewer nodes per model allows one to compute posterior vari-
layer) incorporate a number of innovations able inclusion probabilities as, for each coef-
that allow them to work better, faster, and ficient, the posterior probability that it came
with little tuning, even in difficult text analy- from the slab and not the spike component.
sis problems. Such deep neural nets (DNNs) Due to a need to integrate over the posterior
are now the state-of-the-art solution for many distribution, e.g., via Markov chain Monte
machine learning tasks (LeCun, Bengio, and Carlo (MCMC), inference for spike-and-slab
Hinton 2015).11 DNNs are now employed in models is much more computationally inten-
many complex natural language processing sive than fitting the penalized regressions of
tasks, such as translation (Sutskever, Vinyals, section 3.1.1. However, Yang, Wainwright,
and Le 2014; Wu et al. 2016) and syntactic and Jordan (2016) argue that s pike-and-slab
parsing (Chen and Manning 2014), as well as estimates based on short MCMC samples
in exercises of relevance to social scientists— can be useful in application, while Scott
for example, Iyyer et al. (2014) infer political and Varian (2014) have engineered effi-
ideology from text using a DNN. They are cient implementations of the s pike-and-slab
frequently used in conjunction with richer model for big data applications. These pro-
text representations such as word embed- cedures give a full accounting of parameter
dings, described more below. uncertainty, which we miss in a quick penal-
ized regression.
3.1.4 Bayesian Regression Methods
3.2 Generative Language Models
The penalized methods above can all be
interpreted as posterior maximization under Text regression treats the token counts as
some prior. For example, ridge regression generic high-dimensional input variables,
maximizes the posterior under independent without any attempt to model structure that
Gaussian priors on each coefficient, while is specific to language data. In many set-
Park and Casella (2008) and Hans (2009) give tings it is useful to instead propose a gen-
Bayesian interpretations to the lasso. See also erative model for the text tokens to learn
the horseshoe of Carvalho, Polson, and Scott about how the attributes influence word
(2010) and the double Pareto of Armagan, choice and account for various dependen-
cies among words and among attributes. In
this approach, the words in a document are
11 Goodfellow, Bengio, and Courville (2016) provide a
viewed as the realization of a generative pro-
thorough textbook overview of these “deep learning” tech-
nologies, while Goldberg (2016) is an excellent primer on cess defined through a probability model for
their use in natural language processing. p(ci | vi) .
548 Journal of Economic Literature, Vol. LVII (September 2019)
3.2.1 Unsupervised Generative Models Many readers will recognize the model in
(5) as a factor model for the vector of nor-
In the unsupervised setting, we have no malized counts for each token in document
direct observations of the true attributes i, ci / mi. Indeed, a topic model is simply a fac-
vi. Our inference about these attributes must tor model for multinomial data. Each topic
therefore depend entirely on strong assump- is a probability vector over possible tokens,
tions that we are willing to impose on the denoted θl, l = 1, … , k (where θlj ≥ 0 and
structure of the model p ( ci | vi) . Examples in ∑ pj=1 θlj = 1). A topic can be thought of as
the broader literature include cases where a cluster of tokens that tend to appear in
the viare latent factors, clusters, or catego- documents. The latent attribute vector v i is
ries. In text analysis, the leading application referred to as the set of topic weights (for-
has been the case in which the vi are topics. mally, a distribution over topics, vil ≥ 0 and
A typical generative model implies that ∑ l=1 k
vil = 1). Note that v il describes the pro-
each observation c iis a conditionally inde- portion of language in document i devoted to
pendent draw from the vocabulary of the lthtopic. We can allow each document
possible tokens according to some d ocument- to have a mix of topics, or we can require
specific token probability vector, say that one vil = 1while the rest are zero, so
𝐪i = [qi1 ⋯ qip
]′ . Conditioning on doc- that each document has a single topic.13
ument length, m i = ∑ j cij , this implies a Since its introduction into text analysis,
multinomial distribution for the counts topic modeling has become hugely popu-
lar.14 (See Blei 2012 for a high-level over-
(4) ci ∼ MN
(𝐪i, mi). view.) The model has been especially useful
in political science (e.g., Grimmer 2010),
This multinomial model underlies the vast where researchers have been successful in
majority of contemporary generative models attaching political issues and beliefs to the
for text. estimated latent topics.
Under the basic model in (4), the function Since the v iare of course latent, estima-
𝐪i = q(vi) links attributes to the distribution tion for topic models tends to make use of
of text counts. A leading example of this link some alternating inference for V | Θand Θ | V.
function is the topic model specification of One possibility is to employ a version of the
Blei, Ng, and Jordan (2003),12 where expectation-maximization (EM) algorithm
to either maximize the likelihood implied by
(5) 𝐪i = vi1
θ1 + vi2 θ2 + ⋯ + vik θk
(4) and (5) or, after incorporating the usual on the application. As we discuss below, in
Dirichlet priors on v i and θl, to maximize the many applications of topic models to date,
posterior; this is the approach taken in Taddy the goal is to provide an intuitive description
(2012; see this paper also for a review of of text, rather than inference on some under-
topic estimation techniques). Alternatively, lying “true” parameters; in these cases, the
one can target the full posterior distribution ad hoc selection of the number of topics may
p(Θ, V ∣ ci) . Estimation, say for Θ, then pro- be reasonable.
ceeds by maximization of the estimated mar- The basic topic model has been general-
ginal posterior, say p (Θ ∣ ci) . ized and extended in variety of ways. A prom-
Due to the size of the data sets and dimen- inent example is the dynamic topic model
sion of the models, posterior approximation of Blei and Lafferty (2006), which considers
for topic models usually uses some form documents that are indexed by date (e.g.,
of variational inference (Wainwright and publication date for academic articles) and
Jordan 2008) that fits a tractable paramet- allows the topics, say Θ t, to evolve smoothly
ric family to be as close as possible (e.g., in in time. Another example is the super-
Kullback–Leibler divergence) from the true vised topic model of Blei and McAuliffe
posterior. This variational approach was (2007), which combines the standard topic
used in the original Blei, Ng, and Jordan model with an extra equation relating the
(2003) paper and in many applications since. weights vito some additional attribute yi in
Hoffman et al. (2013) present a stochastic p(yi | vi). This pushes the latent topics to be
variational inference algorithm that takes relevant to y ias well as the text c i. In these
advantage of techniques for optimization on and many other extensions, the modifica-
massive data; this algorithm is used in many tions are designed to incorporate available
contemporary topic modeling applications. document metadata (in these examples,
Another approach, which is more computa- time and y i respectively).
tionally intensive but can yield more accu-
3.2.2 Supervised Generative Models
rate posterior approximations, is the MCMC
algorithm of Griffiths and Steyvers (2004). In supervised models, the attributes vi are
Alternatively, for quick estimation without observed in a training set and thus may be
uncertainty quantification, the posterior directly harnessed to inform the model of
maximization algorithm of Taddy (2012) is a text generation. Perhaps the most common
good option. supervised generative model is the so-called
The choice of k , the number of topics, is naive Bayes classifier (e.g., Murphy 2012),
often fairly arbitrary. Data-driven choices which treats counts for each token as inde-
do exist: Taddy (2012) describes a model pendent with class-dependent means. For
selection process for kthat is based upon example, the observed attribute might be
Bayes factors, Airoldi et al. (2010) provide author identity for each document in the
a cross-validation (CV) scheme, while Teh corpus with the model specifying different
et al. (2006) use Bayesian nonparametric mean token counts for each author.
techniques that view kas an unknown model In naive Bayes, viis a univariate categor-
parameter. In practice, however, it is very ical variable and the token count distribu-
common to simply start with a number of tion is factorized as p(ci | vi) = ∏ j pj(cij | vi),
topics on the order of ten, and then adjust thus “naively” specifying conditional inde-
the number of topics in whatever direction pendence between tokens j. This rules out
seems to improve interpretability. Whether the possibility that by choosing to say one
this ad hoc procedure is problematic depends token (say, “hello”) we reduce the probability
550 Journal of Economic Literature, Vol. LVII (September 2019)
after conditioning on the other attributes in In the vector space, words are relationally
vi. Taddy (2015a) details use of such suffi- oriented and we can begin to draw meaning
cient projections in a variety of applications, from term positions, something that is not
including attribute prediction, treatment possible in simple bag-of-words approaches.
effect estimation, and document indexing. For example, in the right figure, we can see
New techniques are arising that combine that by subtracting the vector man from the
MNIR techniques with the latent structure vector king, and then adding to this woman,
of topic models. For example, Rabinovich we arrive spatially close to queen. Likewise,
and Blei (2014) directly combine the logistic the combination king − man + child lies in
regression in (7) with the topic model of (5) close proximity to the vector prince.
in a mixture specification. Alternatively, the Such word embeddings, also known as dis-
structural topic model of Roberts et al. (2013) tributed language representations, amount
allows both topic content (θl) and topic prev- to a p
reprocessing of the text data to replace
alence (latent v i) to depend on observable word identities—encoded as binary indica-
document attributes. Such semi-supervised tors in a v ocabulary-length vector—with an
techniques seem promising for their com- embedding (location) of each vocabulary
bination of the strong text-attribute connec- word in 핉 K, where Kis the dimension of
tion of MNIR with topic modeling’s ability the latent representation space. The dimen-
to account for latent clustering and depen- sions of the vector space correspond to vari-
dency within documents. ous aspects of meaning that give words their
content. Continuing from the simplified
3.3 Word Embeddings
example vocabulary, the latent (and, in real-
Throughout this article, documents have ity, unlabeled) dimensions and associated
been represented through token count vec- word embeddings might look like:
tors, ci. This is a crude language summariza-
tion. It abstracts from any notion of similarity Dimension king queen prince man woman child
between words (such as run, runner, jogger) Royalty 0.99 0.99 0.95 0.01 0.02 0.01
or syntactical richness. One of the frontiers Masculinity 0.94 0.06 0.02 0.99 0.02 0.49
of textual analysis is in developing new rep- Age 0.73 0.81 0.15 0.61 0.68 0.09
resentations of text data that more faithfully
…
capture its meaning.
Instead of identifying words only as an
index for location in a long vocabulary list, This type of text representation has long
imagine representing words as points in been applied in natural language process-
a large vector space, with similar words ing (Rumelhart, Hinton, and Williams 1986;
colocated, and an internally consistent arith- Morin and Bengio 2005). The embeddings
metic on the space for relating words to one must be estimated and are chosen to opti-
another. For example, suppose our vocabu- mize, perhaps approximately, an objec-
lary consists of six words: { king, queen, prince, tive function defined on the original text
man, woman, child}. The vector space repre- (such as a likelihood for word occurrences).
sentation of this vocabulary based on simi- They form the basis for many deep learn-
larity of their meaning might look something ing applications involving textual data (see,
like the figure 2 panel A.15 e.g., Chen and Manning 2014; Goldberg
2016). They are also valuable in their own
15 This example is motivated by https://blog.acolyer. right for mapping from language to a vec-
org/2016/04/21/the-amazing-power-of-word-vectors/. tor space where we can compute distances
552 Journal of Economic Literature, Vol. LVII (September 2019)
Panel A Panel B
man man
woman
child woman
king
queen
king – man
king +woman
queen
prince
king – man
and angles between words for fundamen- (denoted γj and βj)
The jthrows of Γand B
tal tasks such as classification, and have give a K-dimensional embedding of the jth
begun to be adopted by social scientists word, so co-occurrences of terms iand j
as a useful summary representation of text are approximated as γi βj ′ . This geometric
data. representation of the text has an intuitive
Some popular embedding techniques interpretation. The inner product of terms’
are Word2Vec (Mikolov et al. 2013) and embeddings, which measures the close-
Global Vector for Word Representation ness of the pair in the K -dimensional vec-
(GloVe, Pennington, Socher, and Manning tor space, describes how likely the pair is to
2014). The key preliminary step in co-occur.
these methods is to settle on a notion of Researchers are beginning to connect
co-occurrence among terms. For example, these vector-space language models with the
consider a p × pmatrix denoted CoOccur, sorts of document attributes that are of inter-
whose (i, j)entry counts the number of est in social science. For example, Le and
times in your corpus that the terms iand j Mikolov (2014) estimate latent document
appear within, say, b words of each other. scores in a vector space, while Taddy (2015b)
This is known as the skip-gram definition of develops an inversion rule for document
co-occurrences. classification based upon Word2Vec. In one
To embed C oOccurin a K -dimensional especially compelling application, Bolukbasi
vector space, where Kis much smaller than p
et al. (2016) estimate the direction of gen-
(say a few hundred), we solve the same type der in an embedding space by averaging the
of problem that PCA used to summarize the angles between female and male descriptors.
word count matrix in equation (3). In partic- They then show that stereotypically male
ular, we can find rank-Kmatrices Γ and B and female jobs, for example, live at the
that best approximate c o-occurrences among corresponding ends of the implied gender
terms: vector. This information is used to derive an
algorithm for removing these gender biases,
CoOccur ≈ ΓB′ .
so as to provide a more “fair” set of inputs
Gentzkow, Kelly, and Taddy: Text as Data 553
for machine learning tasks. Approaches like More promising are computation algorithms
this, which use embeddings as the basis for that approximate the sampling distribution,
mathematical analyses of text, can play a role the most common being the familiar non-
in the next generation of text-as-data applica- parametric bootstrap (Efron 1979). This
tions in social science. repeatedly draws samples with replacement
of the same size as the original data set and
3.4 Uncertainty Quantification
reestimates parameters of interest on the
The machine learning literature on text bootstrapped samples, with the resulting set
analysis is focused on point estimation and of estimated parameters approximating the
predictive performance. Social scientists sampling distribution.17
often seek to interpret parameters or func- Unfortunately, the nonparametric boot-
tions of the fitted models, and hence desire strap fails for many of the algorithms used on
strategies for quantifying the statistical text. For example, it is known to fail for meth-
uncertainty around these targets – that is, for ods that involve non-differentiable loss func-
statistical inference. tions (e.g., the lasso), and w
ith-replacement
Many machine learning methods for text resampling produces overfit in the bootstrap
analysis are based upon a Bayesian modeling samples (repeated observations make predic-
approach, where uncertainty quantification is tion seem easier than it actually is). Hence,
often available as part of the estimation pro- for many applications, it is better to look to
cess. In MCMC sampling, as in the Bayesian methods more suitable for high-dimensional
regression of Scott and Varian (2014) or the estimation algorithms. The two primary can-
topic modeling of Griffiths and Steyvers didates are the parametric bootstrap and
(2004), the software returns samples from subsampling.
the posterior and thus inference is imme- The parametric bootstrap generates new
diately available. For estimators relying on unrepeated observations for each boot-
variational inference—i.e., fitting a trac- strap sample given an estimated generative
table distribution as closely as possible to model (or other assumed form for the data
the true posterior—one can simulate from generating process).18 In doing so, it avoids
the approximate distribution to conduct pathologies of the nonparametric bootstrap
inference.16 that arise from using the empirical sample
Frequentist uncertainty quantification is distribution. The cost is that the parametric
often favored by social scientists, but analytic bootstrap is, of course, parametric: It makes
sampling distributions are unavailable for strong assumptions about the underlying
most of the methods discussed here. Some generative model, and one must bear in
results exist for the lasso in stylized settings mind that the resulting inference is condi-
(especially Knight and Fu 2000), but these tional upon these assumptions.19
assume low-dimensional asymptotic scenar-
ios that may be unrealistic for text analysis. 17 See, e.g., Horowitz (2003) for an overview.
18 See Efron (2012) for an overview that also makes
interesting connections to Bayesian inference.
16 Due to its popularity in the deep learning commu- 19 For example, in a linear regression model, the
nity, variational inference is a common feature in newer parametric bootstrap requires simulating errors from
machine learning frameworks; see, for example, Edward an assumed, say Gaussian, distribution. One must make
(Tran et al. 2016, 2017) for a python library that builds assumptions on the exact form of this distribution, includ-
variational inference on top of the TensorFlow platform. ing whether the errors are homoskedastic or not. This
Edward and similar tools can be used to implement topic contrasts with our usual approaches to standard errors
models and the other kinds of text models that we dis- for linear regression that are robust to assumptions on the
cussed above. functional form of the errors.
554 Journal of Economic Literature, Vol. LVII (September 2019)
found that it is usually unwise to attempt predictive performance on data held out
to learn flexible functional forms unless n from the main estimation sample. In sec-
is much larger than p . When this is not the tion 3.1.1, we discussed the technique of
case, we generally recommend linear regres- cross-validation (CV) for penalty selec-
sion methods. Given the availability of fast tion, a leading example. More generally,
and robust tools (gamlr and glmnet in R, and whenever one works with complex and
scikit-learn in Python), and the typically high high-dimensional data, it is good practice to
dimensionality of text data, many prediction reserve a test set of data to use in estimation
tasks in social science with text inputs can of the true average prediction error. Looping
be efficiently addressed via penalized linear across multiple test sets, as in CV, is a com-
regression. mon way of reducing the variance of these
When there are multiple attributes of error estimates. (See Efron 2004 for a classic
interest, and one wishes to resolve or control overview.)
for interdependencies between these attri- In many social science applications, the
butes and their effects on language, then one goal is to go beyond prediction and use the
will need to work with a generative model values Vˆ in some subsequent descriptive or
for text. Multinomial logistic regression and causal analysis. In these cases, it is import-
its extensions can be applied to such situa- ant to also validate the accuracy with which
tions, particularly via distributed multino- the fitted model is capturing the economic or
mial regression. Alternatively, for corpora descriptive quantity of interest.
of many unlabeled documents (or when the One approach that is often effective is
labels do not tell the whole story that one manual audits: c ross-checking some subset of
wishes to investigate), topic modeling is the the fitted values against the coding a human
obvious approach. Word embeddings are would produce by hand. An informal version
also becoming an option for such questions. of this is for a researcher to simply inspect
In the spirit of contemporary machine learn- a subset of documents alongside the fitted
ing, it is also perfectly fine to combine tech- ˆ and evaluate whether the estimates align
V
niques. For example, a common setting will with the concept of interest. A formal version
have a large corpora of labeled documents would involve having one or more people
as well as a smaller set of documents about manually classify each document in a subset
which some metadata exist. One approach is and evaluating quantitatively the consistency
to fit a topic model on the larger corpora, and between the human and machine codings.
to then use these topics as well as the token The subsample of documents does not need
counts for supervised text regression on the to be large in order for this exercise to be valu-
smaller labeled corpora. able—often as few as twenty or thirty docu-
ments is enough to provide a sense of whether
3.5.2 Model Validation and Interpretation
the model is performing as desired.
Ex ante criteria for selecting an empirical This kind of auditing is especially import-
approach are suggestive at best. In practice, ant for dictionary methods. Validity hinges
it is also crucial to validate the performance on the assumption that a particular func-
of the estimation approach ex post. Real tion of text features—counts of positive or
research often involves an iterative tuning negative words, an indicator for the pres-
process with repeated rounds of estimation, ence of certain keywords, etc.—will be a
validation, and adjustment. valid predictor of the true latent variable V.
When the goal is prediction, the primary In a setting where we have sufficient prior
tool for validation is checking out-of-sample information to justify this assumption, we
556 Journal of Economic Literature, Vol. LVII (September 2019)
typically also have enough prior information path of decreasing penalties. Alternatively,
to evaluate whether the resulting classifica- see Gentzkow, Shapiro, and Taddy (2016) for
tion looks accurate. An excellent example of application-specific term rankings.
this is Baker, Bloom, and Davis (2016), who Inspection of fitted parameters is gener-
perform a careful manual audit to validate ally more informative in the context of a gen-
their dictionary-based method for identify- erative model. Even there, some caution is
ing articles that discuss policy uncertainty. in order. For example, Taddy (2015a) finds
Audits are also valuable in studies using that for MNIR models, getting an interpre-
other methods. In Gentzkow and Shapiro table set of word loadings requires careful
(2010), for example, the authors perform penalty tuning and the inclusion of appro-
an audit of news articles that their fitted priate control variables. As in text regression,
model classifies as having a right-leaning or it is usually worthwhile to look at the largest
left-leaning slant. They do not compare this coefficients for validation but not take the
against hand coding directly, but rather count smaller values too seriously.
the number of times the key phrases that are Interpretation or story building around
weighted by the model are used straightfor- estimated parameters tends to be a major
wardly in news text, as opposed to occurring focus for topic models and other unsuper-
in quotation marks or in other types of arti- vised generative models. Interpretation of
cles such as letters to the editor. the fitted topics usually proceeds by ranking
A second approach to validating a fitted the tokens in each topic according to token
model is inspecting the estimated coefficients probability, θlj, or by token lift θ lj / p¯ j with
or other parameters of the model directly. In p¯ j = (1/n)∑ i cij /mi. For example, if the five
the context of text regression methods, how- highest lift tokens in topic l for a model fit to a
ever, this needs to be approached with cau- corpus of restaurant reviews are another.min-
tion. While there is a substantial literature ute, flag.down, over.minute, wait.over, arrive.
on statistical properties of estimated param- after, we might expect that reviews with high v il
eters in penalized regression models (see correspond to negative experiences where the
Bühlmann and van de Geer 2011 and Hastie, patron was forced to wait for service and food
Tibshirani, and Wainwright 2015), the real- (example from Taddy 2012). Again, however,
ity is that these coefficients are typically only we caution against the overinterpretation of
interpretable in cases where the true model is these unsupervised models: the posterior dis-
extremely sparse, so that the model is likely to tributions informing parameter estimates are
have selected the correct set of variables with often multimodal, and multiple topic model
high probability. Otherwise, multicollinear- runs can lead to multiple different interpreta-
ity means the set of variables selected can be tions. As argued in Airoldi and Bischof (2016)
highly unstable. and in a comment by Taddy (2017a), the best
These difficulties notwithstanding, inspect- way to build interpretability for topic models
ing the most important coefficients to see if may be to add some supervision (i.e., to incor-
they make intuitive sense can still be useful porate external information on the topics for
as a validation and sanity check. Note that some set of cases).
“most important” can be defined in a number
of ways; one can rank estimated coefficients
by their absolute values, or by absolute value 4. Applications
scaled by the standard deviation of the asso-
ciated covariate, or perhaps by the order in We now turn to applications of text analy-
which they first become nonzero in a lasso sis in economics and related social sciences.
Gentzkow, Kelly, and Taddy: Text as Data 557
Rather than presenting a comprehensive lit- undisputed, to train a naive Bayes classifier
erature survey, the goal of this section is to (a supervised generative model, as discussed
present a selection of illustrative papers to in section 3.2) in which the probabilities
give the reader a sense of the wide diversity p( cij | vi)of each phrase jare assumed to be
of questions that may be addressed with tex- independent Poisson or negative binomial
tual analysis and to provide a flavor of how random variables, and the inferences for the
some of the methods in section 3 are applied unknown documents are made by Bayes’ rule.
in practice. The results provide overwhelming evidence
that all of the disputed papers were authored
4.1 Authorship by Madison.
Stock and Trebbi (2003) apply similar
A classic descriptive problem is inferring methods to answer an authorship question
the author of a document. While this is not of more direct interest to economists: who
usually a first-order research question for invented instrumental variables? The ear-
social scientists, it provides a particularly liest known derivation of the instrumental
clean example, and a good starting point to variables estimator appears in an appendix
understand the applications that follow. to The Tariff on Animal and Vegetable Oils,
In what is often seen as the first mod- a 1928 book by statistician Philip Wright.
ern statistical analysis of text data, Mosteller While the bulk of the book is devoted to
and Wallace (1963) use text analysis to infer a “painfully detailed treatise on animal
the authorship of the disputed Federalist and vegetable oils, their production, uses,
Papers that had alternatively been attributed markets and tariffs,” the appendix is of an
to either Alexander Hamilton or James entirely different character, with “a suc-
Madison. They define documents i to be indi- cinct and insightful explanation of why data
vidual Federalist Papers, the data features on price and quantity alone are in general
ciof interest to be counts of function words inadequate for estimating either supply or
such as “an,” “of,” and “upon” in each doc- demand; two separate and correct deriva-
ument, and the outcome vito be an indica- tions of the instrumental variables estimators
tor for the identity of the author. Note that of the supply and demand elasticities; and
the function words the authors focus on are an empirical application” (Stock and Trebbi
exactly the “stop words” that are frequently 2003, p. 177). The contrast between the
excluded from analysis (as discussed in sec- two parts of the book has led to speculation
tion 2 above). The key feature of these words that the appendix was not written by Philip
is that their use by a given author tends to be Wright, but rather by his son Sewall. Sewall
stable regardless of the topic, tone, or intent Wright was an economist who had origi-
of the piece of writing. This means they pro- nated the method of “path coefficients” used
vide little valuable information if the goal is in one of the derivations in the appendix.
to infer characteristics such as political slant Several authors including Manski (1988) are
or discussion of policy uncertainty that are on record attributing authorship to Sewall;
independent of the idiosyncratic styles of others including Angrist and Krueger (2001)
particular authors. When such styles are the attribute it to Philip.
object of interest, however, function words In Stock and Trebbi’s (2003) study, the
become among the most informative text outcome viis an indicator for authorship
characteristics. Mosteller and Wallace (1963) by either Philip or Sewall. The data fea-
use a sample of Federalist Papers, whose tures ci = [c func i c gram
i ]are counts of the
authorship by either Madison or Hamilton is same function words used by Mosteller and
558 Journal of Economic Literature, Vol. LVII (September 2019)
Wallace (1963) plus counts of a set of gram- to predict future returns of the Dow Jones
matical constructions (e.g, “noun followed Industrial Average. Hamilton’s track record
by adverb”) measured using an algorithm is unimpressive. A market-timing strategy
due to Mannion and Dixon (1997). The based on his Wall Street Journal editorials
training sample C trainconsists of forty-five underperforms a passive investment in the
documents known to have been written by Dow Jones Industrial Average by 3.5 per-
either Philip or Sewall, and the test sample centage points per year.
C test in which viis unobserved consists of In its modern form, the implementation of
eight blocks of text from the appendix plus text-based prediction in finance is computa-
one block of text from chapter 1 of The Tariff tionally driven, but it applies methods that are
on Animal and Vegetable Oils included as a conceptually similar to Cowles’s approach,
validity check. The authors apply principal seeking to predict the target quantity V
components analysis, which we can think (the Dow Jones return, in the example of
of as an unsupervised cousin of the topic Cowles) from the array of token counts
modeling approach discussed in section C . We discuss three examples of recent
3.2.1. They extract the first four principal papers that study equity return predic-
gram
components from c func
i and c i respec- tion in the spirit of Cowles: one relying on
tively, and then run regressions of the binary a preexisting dictionary (as discussed at the
authorship variable on the principal com- beginning of section 3), one using regres-
ponents, resulting in predicted values v ˆ func
i sion techniques (as discussed in section 3.1),
gram
and vˆ i . and another using generative models (as dis-
The results provide overwhelming evi- cussed in section 3.2).
dence that the disputed appendix was in fact Tetlock’s 2007 paper is a leading
written by Philip. Figure 3 plots the values dictionary-based example of analyzing media
(vˆ i , vˆ i ) for all of the documents in the
func gram sentiment and the stock market. He studies
sample. Each point in the figure is a docu- word counts ci in the Wall Street Journal’s
ment i, and the labels indicate whether the widely read “Abreast of the Market” column.
document is known to be written by Philip Counts from each article iare converted
(“P” or “1”), known to be written by Sewall into a vector of sentiment scores v ˆ i in sev-
(“S”), or of uncertain authorship (“B”). The enty-seven different sentiment dimensions
measures clearly distinguish the two authors, based on the Harvard IV-4 psychosocial dic-
with documents by each forming clear, tionary.22 The time series of daily sentiment
nonoverlapping clusters. The uncertain doc- scores for each category (v ˆ i) are condensed
uments all fall squarely within the cluster into a single principal component, which
attributed to Philip, with the predicted val- Tetlock names the “pessimism factor” due to
, vˆ i ≈ 1.
gram
ues vˆ func
i
the component’s especially close association
with the “pessimism” dimension of the senti-
4.2 Stock Prices ment categories.
The second stage of the analysis uses this
An early example analyzing news text for pessimism score to forecast stock market
stock price prediction appears in Cowles
(1933). He subjectively categorizes the
text of editorial articles of Peter Hamilton, 22 While it has only recently been used in economics
chief editor of the Wall Street Journal from and finance, the Harvard dictionary and associated General
Inquirer software for textual content analysis dates to the
1902–29, as “bullish,” “bearish,” or “doubt- 1960s and has been widely used in linguistics, psychology,
ful.” Cowles then uses these classifications sociology, and anthropology.
Gentzkow, Kelly, and Taddy: Text as Data 559
Source: Stock and Trebbi (2003). Copyright American Economic Association; reproduced with the permission
of the Journal of Economic Perspectives.
offer improved stock return forecasts rela- emerge from their analysis. First, the terms
tive to dictionary methods. most closely associated with market vola-
The authors propose the following regres- tility relate to government policy and wars.
sion model to capture correlations between Second, high levels of news-implied volatil-
occurrences of individual words and subse- ity forecast high future stock market returns.
quent stock return realizations around regu- These two facts together give insight into the
latory filing dates: types of risks that drive investors’ valuation
decisions.23
( j ∑ j cij )
cij
vi = a + b ∑ wj _ + εi. The closest modern analog of Cowles’s
study is Antweiler and Frank (2004), who
take a generative modeling approach to ask:
Documents iare defined to be annual reports how informative are the views of stock mar-
filed by firms at the Securities Exchange ket prognosticators who post on internet
Commission. The outcome variable v i is a message boards? Similar to Cowles’s analysis,
stock’s cumulative four-day return beginning these authors classify postings on stock mes-
on the filing day. The independent variable c ij sage boards as “buy,” “sell,” or “hold” signals.
is a count of occurrences of word jin annual But the vast number of postings, roughly 1.5
report i. The coefficient wj summarizes the million in the analyzed sample, makes sub-
average association between an occurrence jective classification of messages infeasible.
of word jand the stock’s subsequent return. Instead, generative techniques allow the
The authors show how to estimate w j from authors to automatically classify messages.
a cross-sectional regression, along with a The authors create a training sample of
subsequent rescaling of all coefficients to one thousand messages, and form V train
remove the common influence parameter b. by manually classifying messages into one
Finally, variables to predict returns are built of the three categories. They then use the
from the estimated weights, and are shown naive Bayes method described in section
to have stronger out-of-sample forecasting 3.2.2 to estimate a probability model that
performance than d ictionary-based indices maps word counts of postings C into clas-
from Loughran and McDonald (2011). The sifications Vˆ for the remaining 1.5 million
results highlight the limitations of using fixed messages. Finally, the buy/sell/hold classifi-
dictionaries for diverse predictive problems, cation of each message is aggregated into an
and that these limitations are often sur- index that is used to forecast stock returns.
mountable by estimating application-specific Consistent with the conclusions of Cowles,
weights via regression. message board postings show little ability to
Manela and Moreira (2017) take a predict stock returns. They do, however, pos-
regression approach to construct an index sess significant and economically meaningful
of news-implied market volatility based
on text from the Wall Street Journal from
1890–2009. They apply support vector
machines, a nonlinear regression method
that we discuss in section 3.1.3. This 23 While Manela and Moreira (2017) study aggre-
approach applies a penalized least squares gate market volatility, Kogan et al. (2009) and Boudoukh
objective to identify a small subset of words et al. (2016) use text from news and regulatory filings to
whose frequencies are most useful for pre- predict firm-specific volatility. Chinco, Clark-Joseph, and
Ye (2017) apply lasso in high frequency return prediction
dicting outcomes—in this case, turbulence using preprocessed financial news text sentiment as an
in financial markets. Two important findings explanatory variable.
Gentzkow, Kelly, and Taddy: Text as Data 561
i nformation about stock volatility and trading of positive and negative searches and averag-
volume.24 ing over all phrases in i. The Factiva score
Bandiera et al. (2017) apply unsuper- is calculated similarly. Next, the central bank
vised machine learning—topic modeling sentiment proxies vˆ i are used to predict
(LDA)—to a large panel of CEO diary Treasury yields in a vector autoregression
data. They uncover two distinct behavioral (VAR). They find that changes in statement
types that they classify as “leaders” who content, as opposed to unexpected devia-
focus on communication and coordination tions in the federal funds target rate, are the
activities, and “managers” who empha- main driver of changes in interest rates.
size production-related activities. They Born, Ehrmann, and Fratzscher (2014)
show that, due to horizontal differentiation extend this idea to study the effect of central
of firm and manager types, appropriately bank sentiment on stock market returns and
matched firms and CEOs enjoy better firm volatility. They construct a financial stability
performance. Mismatches are more com-
sentiment index vˆ i from Financial Stability
mon in lower income economies, and mis- Reports (FSRs) and speeches given by cen-
matches can account for 13 percent of the tral bank governors. Their approach uses
labor productivity gap between firms in high- a sentiment dictionary to assign optimism
and middle/low-income countries. scores to word counts c i from central bank
communications. They find that optimis-
4.3 Central Bank Communication
tic FSRs tend to increase equity prices and
A related line of research analyzes the reduce market volatility during the subse-
impact of communication from central banks quent month.
on financial markets. As banks rely more on Hansen, McMahon, and Prat (2018)
these statements to achieve policy objec- research how FOMC transparency affects
tives, an understanding of their effects is debate during meetings by studying a change
increasingly relevant. in disclosure policy. Prior to November 1993,
Lucca and Trebbi (2009) use the con- the FOMC meeting transcripts were secret,
tent of Federal Open Market Committee but following a policy shift transcripts became
(FOMC) statements to predict fluctuations public with a time lag. There are potential
in Treasury securities. To do this, they use costs and benefits of increased transparency,
two different d ictionary-based methods such as the potential for more efficient and
(section 3)—Google and Factiva semantic informed debate due to increased account-
orientation scores—to construct v ˆ i, which ability of policy makers. On the other hand,
quantifies the direction and intensity of the transparency may make committee mem-
ithFOMC statement. In the Google score, bers more cautious, biased toward the status
cicounts how many Google search hits occur quo, or prone to g roup-think.
when searching for phrases in iplus one of The authors use topic modeling (section
the words from a list of antonym pairs sig- 3.2.1) to study 149 FOMC meeting tran-
nifying positive or negative sentiment (e.g., scripts during Alan Greenspan’s tenure. The
“hawkish” versus “dovish”). These counts are unit of observation is a member-meeting. The
mapped into vˆ i by differencing the frequency vector cicounts the words used by FOMC
member min meeting t, and iis defined as
the pair (m, t).The outcome of interest, v i,
24 Other papers that use naive Bayes and similar gener-
is a vector that includes the proportion of
ative models to study behavioral finance questions include
Buehlmaier and Whited (2018), Li (2010), and Das and i’s language devoted to the Kdifferent top-
Chen (2007). ics (estimated from the fitted topic model),
562 Journal of Economic Literature, Vol. LVII (September 2019)
the concentration of these topic weights, and to create a product that predicts flu prevalence
the frequency of data citation by i. Next, a from Google searches using text regression.
difference-in-differences regression esti-
The results are reported in a widely cited
mates the effects of the change in transpar- Nature article by Ginsberg et al. (2009).
ency on vˆ i . The authors find that, after the Their raw data consist of “hundreds of bil-
move to a more transparent system, inex- lions of individual searches from 5 years of
perienced members discuss a wider range Google web search logs.” Aggregated search
of topics and make more references to data counts are arranged into a vector ci, where a
when discussing economic conditions (con- document iis defined to be a particular US
sistent with increased accountability); but region in a particular week, and the outcome
also speak more like Chairman Greenspan of interest viis the true prevalence of flu in
during policy discussions (consistent with the region–week. In the training data, this is
increased conformity). Overall, the account- taken to be equal to the rate measured by
ability effect appears stronger, as inexperi- the CDC. The authors first restrict atten-
enced members’ topics appear to be more tion to the fifty million most common terms,
influential in shaping future deliberation then select those most diagnostic of an out-
after transparency. break using text regression (section 3.1),
specifically a variant of partial least squares
4.4 Nowcasting regression. They first run fifty million uni-
variate regressions of log (vi/( 1 − vi) ) on
Important variables such as unemploy- log(cij/(1 − cij)), where cij is the share of
ment, retail sales, and GDP are measured searches in icontaining search term j. They
at low frequency, and estimates are released then fit a sequence of multivariate regression
with a significant lag. Others, such as racial models of v ion the top n terms jas ranked by
prejudice or local government corruption, are average predictive power across regions for
not captured by standard measures at all. Text n ∈ {1, 2, …}. Next, they select the value
produced online such as search queries, social of n that yields the best fit on a hold-out
media posts, listings on job websites, and so on sample. This yields a regression model with
can be used to construct alternative real-time n = 45terms. The model produces accu-
estimates of the current values of these vari- rate flu rate estimates for all regions approxi-
ables. By contrast with the standard exercise mately 1–2 weeks ahead of the CDC’s regular
of forecasting future variables, this process of report publication dates.25
using diverse data sources to estimate current
variables has been termed “nowcasting” in the
literature (Bańbura et al. 2013). 25 A number of subsequent papers debate the lon-
A prominent early example is the Google Flu ger-term performance of Google Flu Trends. Lazer et al.
Trends project. Zeng and Wagner (2002) note (2014), for example, show that the accuracy of the Google
that the volume of searches or web hits seek- Flu Trends model—which has not been recalibrated or
updated based on more recent data—has deteriorated dra-
ing information related to a disease may be a matically, and that in recent years it is outperformed by
strong predictor of its prevalence. Johnson et simple extrapolation from prior CDC estimates. This may
al. (2004) provide an early data point suggest- reflect changes in both search patterns and the epidemi-
ology of the flu, and it suggests a general lesson that the
ing that browsing influenza-related articles on predictive relationship mapping text to a real outcome of
the website healthlink.com is correlated with interest may not be stable over time. On the other hand,
traditional surveillance data from the Centers Preis and Moat (2014) argue that an adaptive version of
the model that more flexibly accounts for joint dynamics
for Disease Control (CDC). In the late 2000s, in flu incidence and search volume significantly improves
a group of Google engineers built on this idea real-time influenza monitoring.
Gentzkow, Kelly, and Taddy: Text as Data 563
Related work in economics attempts to a nimus in area iis the share of searches orig-
nowcast macroeconomic variables using data inating in that area that contain a set of racist
on the frequency of Google search terms. In words. He then uses these measures to esti-
Choi and Varian (2012) and Scott and Varian mate the impact of racial animus on votes for
(2014, 2015), search term counts are aggre- Barack Obama in the 2008 election, finding
gated by week and by geographic location, a statistically significant and economically
then converted to location-specific frequency large negative effect on Obama’s vote share
indices. They estimate spike and slab Bayesian relative to the Democratic vote share in the
forecasting models, discussed in section 3.1.4 previous election.
above. Forecasts of regional retail sales, new
4.5 Policy Uncertainty
housing starts, and tourism activity are all sig-
nificantly improved by incorporating a few Among the most influential applications
search term indices that are relevant for each of text analysis in the economics litera-
category in linear models. Their results sug- ture to date is a measure of economic pol-
gest a potential for large gains in forecasting icy uncertainty (EPU) developed by Baker,
power using web browser search data. Bloom, and Davis (2016). Uncertainty about
Saiz and Simonsohn (2013) use web search both the path of future government policies
results to estimate the current extent of cor- and the impact of current government pol-
ruption in US cities. Standard corruption icies has the potential to increase risk for
measures based on surveys are available at economic actors and so potentially depress
the country and state level, but not for smaller investment and other economic activity. The
geographies. The authors use a dictionary authors use text from news outlets to provide
approach in which the index v ˆ i of corruption a high-frequency measure of EPU and then
is defined to be the ratio of search hits for the estimate its economic effects.
name of a geographic area iplus the word Baker, Bloom, and Davis (2016) define
“corruption” divided by hits for the name the unit of observation ito be a country–
of the geographic area alone. These counts month. The outcome v iof interest is the
are extracted from search engine results. true level of economic policy uncertainty.
As a validation, the authors first show that The authors apply a dictionary method
country-level and s tate-level versions of their to produce estimates v ˆ i based on digital
measure correlate strongly with established archives of ten leading newspapers in the
corruption indicies and covary in a similar United States. An element of the input data
way with country- and state-level demograph- cijis a count of the number of articles in
ics. They then compute their measure for US newspaper jin country–month i containing
cities and study its observable correlates. at least one keyword from each of three cat-
Stephens-Davidowitz (2014) uses the fre- egories defined by hand: one related to the
quency of racially charged terms in Google economy, a second related to policy, and a
searches to estimate levels of racial ani- third related to uncertainty. The raw counts
mus in different areas of the United States. are scaled by the total number of articles in
Estimating animus via traditional surveys is the corresponding newspaper–month and
challenging because individuals are often normalized to have standard deviation one.
reluctant to state their true attitudes. The The predicted value v ˆ iis then defined to
paper’s results suggest Google searches be a simple average of these scaled counts
provide a less filtered, and therefore more across newspapers.
accurate, measure. The author uses a dictio- The simplicity of the manner in which the
nary approach in which the index v ˆ i of racial index is created allows for a high amount of
564 Journal of Economic Literature, Vol. LVII (September 2019)
flexibility across a broad range of applica- the political process, with the power to poten-
tions. For instance, by including a fourth, tially sway both public opinion and policy.
policy-specific category of keywords, the
Understanding how and why media outlets
authors can estimate narrower indices related slant the information they present is import-
to Federal Reserve policy, inflation, and so on. ant to understanding the role media play in
Baker, Bloom, and Davis (2016) validate v ˆ i practice, and to informing the large body of
using a human audit of twelve thousand arti- government regulation designed to preserve
cles from 1900–2012. Teams manually scored a diverse range of political perspectives.
articles on the extent to which they discuss Groseclose and Milyo (2005) offer a pio-
economic policy uncertainty and the spe- neering application of text analysis methods
cific policies they relate to. The resulting to this problem. In their setting, iindexes a
human-coded index has a high correlation
set of large US media outlets, and documents
with vˆ i. are defined to be the complete news text or
With the estimated v ˆ iin hand, the authors broadcast transcripts for an outlet i. The out-
analyze the micro- and m acro-level effects come of interest viis the political slant of out-
of EPU. Using fi rm-level regressions, they let i. To give this measure content, the authors
first measure how firms respond to this use speeches by politicians in the US Congress
uncertainty and find that it leads to reduced to form a training sample, and define vi within
employment, investment, and greater asset this sample to be a politician’s Americans for
price volatility for that firm. Then, using both Democratic Action (ADA) score, a measure
US and international panel VAR models, the of left-right political ideology based on con-
authors find that increased v ˆ iis a strong pre- gressional voting records. The predicted val-
dictor of lower investment, employment, ues vˆ i for the media outlets thus place them
and production. on the same left-right scale as the politicians,
Hassan et al. (2017) measure political risk and answer the question “what kind of poli-
at the firm level by analyzing quarterly earn- tician does this news outlet’s content sound
ings call transcripts. Their measure captures most similar to?”
the frequency with which policy-oriented lan- The raw data are the full text of speeches
guage and “risk” synonyms c o-occur in a tran- by congresspeople and news reports by
script. Firms with high levels of political risk media outlets over a period spanning the
actively hedge these risks by lobbying more 1990s to the early 2000s.26 The authors
intensively and donating more to politicians. dramatically reduce the dimensionality of
When a firm’s political risk rises, it tends to the data in an initial step by deciding to
retrench hiring and investment, consistent focus on a particularly informative subset
with the findings of Baker, Bloom, and Davis of phrases: the names of two hundred think
(2016) at the aggregate level. Their findings tanks. These think tanks are widely viewed
indicate that political shocks are an important as having clear political positions (e.g., the
source of idiosyncratic firm-level risk. Heritage Foundation on the right and the
NAACP on the left). The relative frequency
4.6 Media Slant
26 For members of Congress, the authors use all entries
A text analysis problem that has received in the Congressional Record from January 1, 1993 to
significant attention in the social science December 31, 2002. The text includes both floor speeches
literature is measuring the political slant of and documents the member chose to insert in the record but
did not read on the floor. For news outlets, the time period
media content. Media outlets have long been covered is different for different outlets, with start dates as
seen as having a uniquely important role in early as January 1990 and end dates as late as July 2004.
Gentzkow, Kelly, and Taddy: Text as Data 565
with which a politician cites conservative as which is denoted in the figure by “aver-
opposed to liberal think tanks turns out to age US voter.” The last fact underlies the
be strongly correlated with a politician’s ide- authors’ main conclusion, which is that there
ology. The paper’s premise is that the cita- is an overall liberal bias in the media.
tion frequencies of news outlets will then Gentzkow and Shapiro (2010) build on
provide a good index of those outlets’ polit- the Groseclose and Milyo (2005) approach
ical slants. The features of interest ci are a to measure the slant of 433 US daily news-
(1 × 50)vector of citation counts for each papers. The main difference in approach is
of forty-four h ighly cited think tanks plus that Gentzkow and Shapiro (2010) omit the
six groups of smaller think tanks. initial step that restricts the space of fea-
The text analysis is based on a supervised tures to mentions of think tanks, and instead
generative model (section 3.2.2). The utility consider all phrases that appear in the 2005
that congress member or media firm i derives Congressional Record as potential predictors,
from citing think tank j is Uij = aj + bj vi + eij, letting the data select those that are most
where viis the observable ADA score diagnostic of ideology. These could poten-
of a congress member ior unobserved tially be think tank names, but they turn out
slant of media outlet i, and eij is an error instead to be politically charged phrases such
distributed type-I extreme value. The coef- as “death tax,” “bring our troops home,” and
ficient bjcaptures the extent to which think “war on terror.”
tank jis cited relatively more by conserva- After standard preprocessing—stemming
tives. The model is fit by maximum likeli- and omitting stop words—the authors pro-
hood with the parameters (aj, bj) and the duce counts of all 2 -grams and 3-grams by
unknown slants v iestimated jointly. This is speaker. They then select the top one thou-
an efficient but computationally intensive sand phrases (five hundred of each length)
approach to estimation, and it constrains the by a χ
2criterion that captures the degree to
authors’ focus to twenty outlets. This limita- which each phrase is diagnostic of the speak-
tion can be sidestepped using more recent er’s party. This is the standard χ 2-test statistic
approaches such as Taddy’s (2013b) multino- for the null hypothesis that phrase jis used
mial inverse regression. equally often by Democrats and Republicans,
Figure 4 shows the results, which sug- and it will be high for phrases that are both
gest three main findings. First, the media used frequently and used asymmetrically by
outlets are all relatively centrist: they are all the parties.28 Next, a two-stage supervised
to the left of the average Republican and to generative method is used to predict news-
the right of the average Democrat with one paper slant v ifrom the selected features. In
exception. Second, the ordering matches the first stage, the authors run a separate
conventional wisdom, with the New York regression for each phrase jof counts (cij)
Times and Washington Post on the left, and
Fox News and the Washington Times on the
right.27 Third, the large majority of outlets features and estimate a much more conservative slant for
fall to the left of the average in congress, the Wall Street Journal.
28 The statistic is
− fjd f∼jr
fjr f∼jd
χ 2 =
___________________________________
27 The one notable exception is the Wall Street Journal, j
which is generally considered to be right-of-center but ( fjr + fjd) ( fjr + f∼jd) ( f∼jr
+ fjd) ( f∼jr
+ f∼jd)
which is estimated by Groseclose and Milyo (2005) to be where fjd and fjrdenote the number of times phrase j is
the most left-wing outlet in their sample. This may reflect used by Democrats or Republicans, respectively, and f∼jd
an idiosyncrasy specific to the way they cite think tanks; and f∼jrdenote the number of times phrases other than j
Gentzkow and Shapiro (2010) use a broader sample of text are used by Democrats and Republicans, respectively.
566 Journal of Economic Literature, Vol. LVII (September 2019)
Source: Groseclose and Milyo (2005). Reprinted with permission from the Quarterly Journal of Economics.
Gentzkow, Kelly, and Taddy: Text as Data 567
on speaker i ’s ideology, which is measured as using the full set of phrases in the data. He
the 2004 Republican vote share in the speak- shows that this substantially increases the
er’s district. They then use the estimated in-sample predictive power of the measure.
coefficients βˆ j to produce predicted slant Greenstein, Gu, and Zhu (2016) analyze
vˆ i ∝ ∑ 1,000
j=1 β j cij
ˆ for the unknown newspa- the extent of bias and slant among Wikipedia
29
pers i. contributors using similar methods. They find
The main focus of the study is character- that contributors tend to edit articles with
izing the incentives that drive newspapers’ slants in opposition to their own slants. They
choice of slant. With the estimated v ˆ i in hand, also show that contributors’ slants become
the authors estimate a model of consumer less extreme as they become more experi-
demand in which a consumer’s utility from enced, and that the bias reduction is largest
reading newspaper idepends on the distance for those with the most extreme initial biases.
between i’s slant viand an ideal slant v ⁎ which
is greater the more conservative the consum- 4.7 Market Definition and Innovation
er’s ideology. Estimates of this model using Impact
zipcode-level circulation data imply a level of
slant that newspapers would choose if their Many important questions in industrial
only incentive was to maximize profits. The organization hinge on the appropriate defi-
authors then compare this profit-maximizing nition of product markets. Standard industry
slant to the level actually chosen by newspa- definitions can be an imperfect proxy for the
pers, and ask whether the deviations can be economically relevant concept. Hoberg and
predicted by the identity of the newspaper’s Phillips (2016) provide a novel way of clas-
owner or by other nonmarket factors such sifying industries based on product descrip-
as the party of local incumbent politicians. tions in the text of company disclosures. This
They find that profit maximization fits the allows for flexible industry classifications that
data well, and that ownership plays no role in may vary over time as firms and economies
explaining the choice of slant. In this study, v ˆ i evolve, and allows the researchers to ana-
is both an independent variable of interest lyze the effect of shocks on competition and
(in the demand analysis) and an outcome of product offerings.
interest (in the supply analysis). Each publicly traded firm in the United
Note that both Groseclose and Milyo States must file an annual 1 0-K report describ-
(2005) and Gentzkow and Shapiro (2010) ing, among other aspects of their business, the
use a t wo-step procedure where they reduce products that they offer. The unit of analysis
the dimensionality of the data in a first stage iis a fi
rm–year. Token counts from the busi-
and then estimate a predictive model in the ness description section of the ith 10-K filing
second. Taddy (2013b) shows how to com- are represented in the vector ci. A pairwise
bine a more sophisticated generative model cosine similarity score, s ij, based on the angle
with a novel algorithm for estimation to esti- between ci and cj, describes the closeness of
mate the predictive model in a single step product offerings for each pair iand jin the
same filing year. Industries are then defined
29 As Taddy (2013b) notes, this method (which
by clustering firms according to their cosine
Gentzkow and Shapiro 2010 derive in an ad hoc fashion) similarities. The clustering algorithm begins
is essentially partial least squares. It differs from the stan- by assuming each firm is its own industry,
dard implementation in that the variables v i and cij would and gradually agglomerates firms into indus-
normally be standardized. Taddy (2013b) shows that doing
so increases the i n-sample predictive power of the measure tries by grouping a firm to the cluster with
from 0.37 to 0.57. its nearest neighbor according to sij. The
568 Journal of Economic Literature, Vol. LVII (September 2019)
algorithm terminates when the number of the data ciare counts of individual words,
industries (clusters) reaches three hundred, a and the outcome of interest v iis a vector of
number chosen for comparability to Standard weights indicating the share of a given article
Industrial Classification and North American devoted to each of one hundred latent topics.
Industrial Classification System codes.30 The authors extend the baseline LDA model
After establishing an industry assignment of Blei, Ng, and Jordan (2003) to allow the
for each firm–year, vˆ i , the authors examine importance of one topic in a particular article
the effect of military and software industry to be correlated with the presence of other
shocks to competition and product offerings topics. They fit the model using all Science
among firms. As an example, they find that articles from 1990–99. The results deliver an
the events of September 11, 2001, increased automated classification of article content into
entry in high-demand military markets and semantically coherent topics such as evolu-
pushed products in this industry toward tion, DNA and genetics, cellular biology, and
“non-battlefield information gathering and volcanoes.
products intended for potential ground Applying similar methods in the politi-
conflicts.” cal domain, Quinn et al. (2010) use a topic
In a similar vein, Kelly et al. (2018) use model to identify the issues being discussed
cosine similarity among patent documents to in the US Senate over the period 1997–2004.
create new indicators of patent quality. They Their approach deviates from the baseline
assign higher quality to patents that are novel LDA model in two ways. First, they assume
in that they have low similarity with the exist- that each speech is associated with a sin-
ing stock of patents and are impactful in that gle topic. Second, their model incorporates
they have high similarity with subsequent time-series dynamics that allow the pro-
patents. They then show that text-based nov- portion of speeches generated by a given
elty and similarity scores correlate strongly topic to gradually evolve over the sample,
with measures of market value. Atalay et al. similar to the dynamic topic model of Blei
(2017) use text from job ads to measure task and Lafferty (2006). Their preferred spec-
content and use their measure to show that ification is a model with forty-two topics, a
within-occupation task content shifts are at number chosen to maximize the subjective
least as important as employment shifts interpretability of the resulting topics.
across occupations in describing the large Table 1 shows the words with the high-
secular reallocation of routine tasks from est weights in each of twelve fitted top-
humans to machines. ics. The labels “Judicial Nominations,”
“Constitutional,” and so on are assigned by
4.8 Topics in Research, Politics, and Law
hand by the authors. The results suggest
A number of studies apply topic models that the automated procedure successfully
(section 3.2.1) to describe how the focus of isolates coherent topics of congressional
attention in a specific text corpus shifts over debates. After discussing the structure of
time. topics in the fitted model, the authors then
A seminal contribution in this vein is Blei track the relative importance of the topics
and Lafferty’s (2007) analysis of topics in across congressional sessions and argue that
Science. Documents iare individual articles, spikes in discussion of particular topics track,
in an intuitive way, the occurrence of import-
30 Best, Hjort, and Szakonyi (2017) use a similar
ant debates and external events.
approach to classify products in their study of public pro- Sim, Routledge, and Smith (2015) esti-
curement and organizational bureaucracy in Russia. mate a topic model from the text of amicus
Gentzkow, Kelly, and Taddy: Text as Data 569
TABLE 1
Congressional Record Topics and Key Words
1. Judicial nominations nomine, confirm, nomin, circuit, hear, court, judg, judici, case, vacanc
2. Constitutional case, court, attornei, supreme, justic, nomin, judg, m, decis, constitut
3. Campaign finance campaign, candid, elect, monei, contribut, polit, soft, ad, parti, limit
4. Abortion procedur, abort, babi, thi, life, doctor, human, ban, decis, or
5. Crime 1 [violent] enforc, act, crime, gun, law, victim, violenc, abus, prevent, juvenil
6. Child protection gun, tobacco, smoke, kid, show, firearm, crime, kill, law, school
7. Health 1 [medical] diseas, cancer, research, health, prevent, patient, treatment, devic, food
8. Social welfare care, health, act, home, hospit, support, children, educ, student, nurs
9. Education school, teacher, educ, student, children, test, local, learn, district, class
10. Military 1 [manpower] veteran, va, forc, militari, care, reserv, serv, men, guard, member
11. Military 2 [infrastructure] appropri, defens, forc, report, request, confer, guard, depart, fund, project
12. Intelligence intellig, homeland, commiss, depart, agenc, director, secur, base, defens
Source: Quinn et al. (2010). Reprinted with permission from John Wiley and Sons.
briefs to the Supreme Court of the United methods for feature selection and model
States. They show that the overall topical training. As we have emphasized, dictionary
composition of briefs for a given case, par- methods are appropriate in cases where
ticularly along a conservative–liberal dimen- prior information is strong and the avail-
sion, is highly predictive for how individual ability of appropriately labeled training data
judges vote in the case. is limited. Experience in other fields, how-
ever, suggests that modern methods will
likely outperform ad hoc approaches in a
5. Conclusion
substantial share of cases.
Digital text provides a rich repository of Second, some of the workhorse meth-
information about economic and social activ- ods of text analysis such as penalized linear
ity. Modern statistical tools give researchers or logistic regression have still seen lim-
the ability to extract this information and ited application in social science. In other
encode it in a quantitative form amena- contexts, these methods provide a robust
ble to descriptive or causal analysis. Both baseline that performs similarly to or bet-
the availability of text data and the frontier ter than more complex methods. We expect
of methods are expanding rapidly, and we the domains in which these methods are
expect the importance of text in empirical applied to grow.
economics to continue to grow. Finally, virtually all of the methods applied
The review of applications above suggests to date, including those we would label as
a number of areas where innovation should sophisticated or on the frontier, are based on
proceed rapidly in coming years. First, a fitting predictive models to simple counts of
large share of text analysis applications con- text features. Richer representations, such as
tinue to rely on ad hoc dictionary methods word embeddings (3.3), and linguistic mod-
rather than deploying more sophisticated els that draw on natural language processing
570 Journal of Economic Literature, Vol. LVII (September 2019)
tools have seen tremendous success else- conomic Time Series Using Targeted Predictors.”
E
Journal of Econometrics 146 (2): 304–17.
where, and we see great potential for their Baker, Scott R., Nicholas Bloom, and Steven J. Davis.
application in economics. 2016. “Measuring Economic Policy Uncertainty.”
The rise of text analysis is part of a broader Quarterly Journal of Economics 131 (4): 1593–636.
Bańbura, Marta, Domenico Giannone, Michele
trend toward greater use of machine learn- Modugno, and Lucrezia Reichlin. 2013. “Now-Cast-
ing and related statistical methods in eco- ing and the Real-Time Data Flow.” In Handbook of
nomics. With the growing availability of Economic Forecasting, Vol. 2A, edited by Allan Tim-
mermann and Graham Elliot, 195–237. Sebastopol:
high-dimensional data in many domains—
O’Reilly Media.
from consumer purchase and browsing Bandiera, Oriana, Stephen Hansen, Andrea Prat, and
behavior, to satellite and other spatial data, to Raffaella Sadun. 2017. “CEO Behavior and Firm
Performance.” NBER Working Paper 23248.
genetics and neuro-economics—the returns Belloni, Alexandre, Victor Chernozhukov, and Chris-
are high to economists investing in learning tian B. Hansen. 2013. “Inference for High-Dimen-
these methods and to increasing the flow of sional Sparse Econometric Models.” In Advances
in Economics and Econometrics: Tenth World Con-
ideas between economics and fields such as gress, Vol. 3, 245–95. Cambridge: Cambridge Uni-
statistics and computer science, where fron- versity Press.
tier innovations in these methods are taking Best, Michael Carlos, Jonas Hjort, and David Szakonyi.
2017. “Individuals and Organizations as Sources of
place. State Effectiveness.” NBER Working Paper 23350.
Bickel, Peter J., Ya’acov Ritov, and Alexandre B. Tsyba-
References kov. 2009. “Simultaneous Analysis of Lasso and Dan-
tzig Selector.” Annals of Statistics 37 (4): 1705–32.
Airoldi, Edoardo M., and Jonathan M. Bischof. 2016. Bishop, Christopher M. 1995. Neural Networks for Pat-
“Improving and Evaluating Topic Models and Other tern Recognition. Oxford: Oxford University Press.
Models of Text.” Journal of the American Statistical Bishop, Christopher M. 2006. Pattern Recognition and
Association 111 (516): 1381–403. Machine Learning. Berlin: Springer.
Airoldi, Edoardo M., Elena A. Erosheva, Stephen E. Blei, David M. 2012. “Probabilistic Topic Models.”
Fienberg, Cyrille Joutard, Tanzy Love, and Suyash Communications of the ACM 55 (4): 77–84.
Shringarpure. 2010. “Reconceptualizing the Classifi- Blei, David M., and John D. Lafferty. 2006. “Dynamic
cation of PNAS Articles.” Proceedings of the National Topic Models.” In Proceedings of the 23rd Interna-
Academy of Sciences of the United States of America tional Conference on Machine Learning, edited by
107 (49): 20899–904. W. Cohen and A. Moore, 113–20. New York: Associ-
Akaike, H. 1973. “Information Theory and an Exten- ation for Computing Machinery.
sion of the Maximum Likelihood Principle.” In Pro- Blei, David M., and John D. Lafferty. 2007. “A Cor-
ceedings of the 2nd International Symposium on related Topic Model of Science.” Annals of Applied
Information Theory, edited by B. N. Petrov and F. Statistics 1 (1): 17–35.
Csaki, 267–81. Budapest: Akademiai Kiado. Blei, David M., and Jon D. McAuliffe. 2007. “Super-
Angrist, Joshua D., and Alan B. Krueger. 2001. “Instru- vised Topic Models.” Proceedings of the 20th
mental Variables and the Search for Identification: International Conference on Neural Information
From Supply and Demand to Natural Experiments.” Processing Systems, edited by J. C. Platt, D. Koller,
Journal of Economic Perspectives 15 (4): 69–85. Y. Singer, and S. T. Roweis, 121–28. Red Hook: Cur-
Antweiler, Werner, and Murray Z. Frank. 2004. “Is All ran Associates.
That Talk Just Noise? The Information Content of Blei, David M., Andrew Y. Ng, and Michael I. Jor-
Internet Stock Message Boards.” Journal of Finance dan. 2003. “Latent Dirichlet Allocation.” Journal of
59 (3): 1259–94. Machine Learning Research 3: 993–1022.
Armagan, Artin, David B. Dunson, and Jaeyong Lee. Bollen, Johan, Huina Mao, and Xiaojun Zeng. 2011.
2013. “Generalized Double Pareto Shrinkage.” Sta- “Twitter Mood Predicts the Stock Market.” Journal
tistica Sinica 23 (1): 119–43. of Computational Science 2 (1): 1–8.
Atalay, Enghin, Phai Phongthiengtham, Sebastian Bolukbasi, Tolga, Kai-Wei Chang, James Y. Zou, Ven-
Sotelo, and Daniel Tannenbaum. 2017. “The Evolv- katesh Saligrama, and Adam T. Kalai. 2016. “Man Is
ing U.S. Occupational Structure.” Washington Cen- to Computer Programmer as Woman Is to Home-
ter for Equitable Growth Working Paper 12052017. maker? Debiasing Word Embeddings.” In Proceed-
Athey, Susan, and Guido Imbens. 2016. “Recursive ings of the 29th International Conference on Neural
Partitioning for Heterogeneous Causal Effects.” Pro- Information Processing Systems, edited by D. D.
ceedings of the National Academy of Sciences of the Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R.
United States of America 113 (27): 7353–60. Garnett, 4349–57. Red Hook: Curran Associates.
Bai, Jushan, and Serena Ng. 2008. “Forecasting Born, Benjamin, Michael Ehrmann, and Marcel
Gentzkow, Kelly, and Taddy: Text as Data 571
Fratzscher. 2014. “Central Bank Communication of the American Statistical Association 99 (467):
on Financial Stability.” Economic Journal 124 (577): 619–32.
701–34. Efron, B. 2012. “Bayesian Inference and the Paramet-
Boudoukh, Jacob, Ronen Feldman, Shimon Kogan, and ric Bootstrap.” Annals of Applied Statistics 6 (4):
Matthew Richardson. 2016. “Information, Trading, 1971–97.
and Volatility: Evidence from Firm-Specific News.” Engelberg, Joseph E., and Christopher A. Parsons.
Available on SSRN at 2193667. 2011. “The Causal Impact of Media in Financial
Breiman, Leo. 2001. “Random Forests.” Machine Markets.” Journal of Finance 66 (1): 67–97.
Learning 45 (1): 5–32. Evans, James A., and Pedro Aceves. 2016. “Machine
Breiman, Leo, Jerome H. Friedman, Richard A. Translation: Mining Text for Social Theory.” Annual
Olshen, and Charles J. Stone. 1984. Classification Review of Sociology 42: 21–50.
and Regression Trees. Routledge: Taylor and Francis. Fan, Jianqing, and Runze Li. 2001. “Variable Selection
Buehlmaier, Matthias M., and Toni M. Whited. 2018. via Nonconcave Penalized Likelihood and Its Oracle
“Are Financial Constraints Priced? Evidence from Properties.” Journal of the American Statistical Asso-
Textual Analysis.” Review of Financial Studies 31 (7): ciation 96 (456): 1348–60.
2693–728. Fan, Jianqing, Lingzhou Xue, and Hui Zou. 2014.
Bühlmann, Peter, and Sara van de Geer. 2011. Statistics “Strong Oracle Optimality of Folded Concave Penal-
for High-Dimensional Data. Berlin: Springer. ized Estimation.” Annals of Statistics 42 (3): 819–49.
Candès, Emmanuel J., Michael B. Wakin, and Stephen Flynn, Cheryl J., Clifford M. Hurvich, and Jeffrey
P. Boyd. 2008. “Enhancing Sparsity by Reweighted S. Simonoff. 2013. “Efficiency for Regularization
ℓ1 Minimization.” Journal of Fourier Analysis and Parameter Selection in Penalized Likelihood Estima-
Applications 14 (5–6): 877–905. tion of Misspecified Models.” Journal of the Ameri-
Carvalho, Carlos M., Nicholas G. Polson, and James G. can Statistical Association 108 (503): 1031–43.
Scott. 2010. “The Horseshoe Estimator for Sparse Foster, Dean P., Mark Liberman, and Robert A. Stine.
Signals.” Biometrika 97 (2): 465–80. 2013. “Featurizing Text: Converting Text into Predic-
Chen, Danqi, and Christopher Manning. 2014. “A Fast tors for Regression Analysis.” http://www-stat.whar-
and Accurate Dependency Parser Using Neural Net- ton.upenn.edu/~stine/research/regressor.pdf.
works.” In Proceedings of the 2014 Conference on Friedman, Jerome H. 2002. “Stochastic Gradient
Empirical Methods in Natural Language Processing, Boosting.” Computational Statistics and Data Anal-
740–50. Stroudsburg: Association for Computation ysis 38 (4): 367–78.
Linguistics. Gentzkow, Matthew, and Jesse M. Shapiro. 2010.
Chernozhukov, Victor, et al. 2018. “Double/Debiased “What Drives Media Slant? Evidence from U.S.
Machine Learning for Treatment and Structural Daily Newspapers.” Econometrica 78 (1): 35–71.
Parameters.” Econometrics Journal 21 (1): C1–C68. Gentzkow, Matthew, Jesse M. Shapiro, and Matt Taddy.
Chinco, Alexander M., Adam D. Clark-Joseph, and 2016. “Measuring Group Differences in High-Di-
Mao Ye. 2017. “Sparse Signals in the Cross-Section mensional Choices: Method and Application to Con-
of Returns.” NBER Working Paper 23933. gressional Speech.” NBER Working Paper 22423.
Choi, Hyunyoung, and Hal Varian. 2012. “Predicting George, Edward I., and Robert E. McCulloch. 1993.
the Present with Google Trends.” Economic Record “Variable Selection via Gibbs Sampling.” Journal
88 (S1): 2–9. of the American Statistical Association 88 (423):
Cook, R. Dennis. 2007. “Fisher Lecture: Dimension 881–89.
Reduction in Regression.” Statistical Science 22 (1): Ginsberg, Jeremy, Matthew H. Mohebbi, Rajan S.
1–26. Patel, Lynnette Brammer, Mark S. Smolinski, and
Cowles, Alfred. 1933. “Can Stock Market Forecasters Larry Brilliant. 2009. “Detecting Influenza Epidem-
Forecast?” Econometrica 1 (3): 309–24. ics Using Search Engine Query Data.” Nature 457
Das, Sanjiv R., and Mike Y. Chen. 2007. “Yahoo! for (7232): 1012–14.
Amazon: Sentiment Extraction from Small Talk on Goldberg, Yoav. 2016. “A Primer on Neural Network
the Web.” Management Science 53 (9): 1375–88. Models for Natural Language Processing.” Journal of
Deerwester, Scott, Susan T. Dumais, George W. Fur- Artificial Intelligence Research 57 (1): 345–420.
nas, Thomas K. Landauer, and Richard Harshman. Goldberg, Yoav, and Jon Orwant. 2013. “A Dataset of
1990. “Indexing by Latent Semantic Analysis.” Jour- Syntactic-Ngrams over Time from a Very Large Cor-
nal of the Association for Information Science 41 (6): pus of English Books.” In Second Joint Conference
391–407. on Lexical and Computational Semantics (*SEM),
Denny, Matthew J., and Arthur Spirling. 2018. “Text Volume 1: Proceedings of the Main Conference and
Preprocessing for Unsupervised Learning: Why It the Shared Task: Semantic Textual Similarity, edited
Matters, When It Misleads, and What to Do about by Mona Diab, Tim Baldwin, and Marco Baroni,
It.” Political Analysis 26 (2): 168–89. 241–47. Stroudsburg: Association for Computational
Efron, B. 1979. “Bootstrap Methods: Another Look at Linguistics.
the Jackknife.” Annals of Statistics 7 (1): 1–26. Goodfellow, Ian, Yoshua Bengio, and Aaron Courville.
Efron, B. 2004. “The Estimation of Prediction Error: 2016. Deep Learning. Cambridge: MIT Press.
Covariance Penalties and Cross-Validation.” J ournal Greenstein, Shane, Yuan Gu, and Feng Zhu. 2016.
572 Journal of Economic Literature, Vol. LVII (September 2019)
“Ideological Segregation among Online Collabora- Jegadeesh, Narasimhan, and Di Wu. 2013. “Word
tors: Evidence from Wikipedians.” NBER Working Power: A New Approach for Content Analysis.” Jour-
Paper 22744. nal of Financial Economics 110 (3): 712–29.
Griffiths, Thomas L., and Mark Steyvers. 2004. “Find- Joachims, Thorsten. 1998. “Text Categorization with
ing Scientific Topics.” Proceedings of the National Support Vector Machines: Learning with Many Rel-
Academy of Sciences of the United States of America evant Features.” In 10th European Conference on
101 (S1): 5228–35. Machine Learning, edited by Claire Nédellec and
Grimmer, Justin. 2010. “A Bayesian Hierarchical Topic Céline Rouveirol, 137–42. Berlin: Springer.
Model for Political Texts: Measuring Expressed Johnson, Heather A., et al. 2004. “Analysis of Web
Agendas in Senate Press Releases.” Political Analysis Access Logs for Surveillance of Influenza.” Studies in
18 (1): 1–35. Health Technology and Informatics 107 (2): 1202–06.
Grimmer, Justin, and Brandon M. Stewart. 2013. “Text Jurafsky, Daniel, and James H. Martin. 2009. Speech
as Data: The Promise and Pitfalls of Automatic Con- and Language Processing: An Introduction to Natu-
tent Analysis Methods for Political Texts.” Political ral Language Processing, Computational Linguistics,
Analysis 21 (3): 267–97. and Speech Recognition. 2nd ed. London: Pearson.
Groseclose, Tim, and Jeffrey Milyo. 2005. “A Measure Kass, Robert E., and Larry Wasserman. 1995. “A Ref-
of Media Bias.” Quarterly Journal of Economics 120 erence Bayesian Test for Nested Hypotheses and Its
(4): 1191–237. Relationship to the Schwarz Criterion.” Journal of the
Hans, Chris. 2009. “Bayesian Lasso Regression.” Bio- American Statistical Association 90 (431): 928–34.
metrika 96 (4): 835–45. Kelly, Bryan, Dimitris Papanikolaou, Amit Seru, and
Hansen, Stephen, Michael McMahon, and Andrea Matt Taddy. 2018. “Measuring Technological Inno-
Prat. 2018. “Transparency and Deliberation within vation over the Long Run.” NBER Working Paper
the FOMC: A Computational Linguistics Approach.” 25266.
Quarterly Journal of Economics 133 (2): 801–70. Kelly, Bryan, and Seth Pruitt. 2013. “Market Expecta-
Hassan, Tarek A., Stephan Hollander, Laurence van tions in the Cross-Section of Present Values.” Journal
Lent, and Ahmed Tahoun. 2017. “Firm-Level Politi- of Finance 68 (5): 1721–56.
cal Risk: Measurement and Effects.” NBER Working Kelly, Bryan, and Seth Pruitt. 2015. “The Three-Pass
Paper 24029. Regression Filter: A New Approach to Forecasting
Hastie, Trevor, Robert Tibshirani, and Jerome Fried- Using Many Predictors.” Journal of Econometrics
man. 2009. The Elements of Statistical Learning: 186 (2): 294–316.
Data Mining, Inference, and Prediction. New York: Knight, Keith, and Wenjiang Fu. 2000. “Asymptotics
Springer. for Lasso-Type Estimators.” Annals of Statistics 28
Hastie, Trevor, Robert Tibshirani, and Martin Wain- (5): 1356–78.
wright. 2015. Statistical Learning with Sparsity: The Kogan, Shimon, Dimitry Levin, Bryan R. Routledge,
Lasso and Generalizations. New York: Taylor and Jacob S. Sagi, and Noah A. Smith. 2009. “Predict-
Francis. ing Risk from Financial Reports with Regression.”
Hoberg, Gerard, and Gordon Phillips. 2016. “Text- In Proceedings of Human Language Technologies:
Based Network Industries and Endogenous Product The 2009 Annual Conference of the North American
Differentiation.” Journal of Political Economy 124 Chapter of the Association for Computational Lin-
(5): 1423–65. guistics, 272–80. Stroudsburg: Association for Com-
Hoerl, Arthur E., and Robert W. Kennard. 1970. “Ridge putational Linguistics.
Regression: Biased Estimation for Nonorthogonal Lazer, David, Ryan Kennedy, Gary King, and Alessan-
Problem.” Technometrics 12 (1): 55–67. dro Vespignani. 2014. “The Parable of Google Flu:
Hoffman, Matthew D., David M. Blei, Chong Wang, Traps in Big Data Analysis.” Science 343 (6176):
and John Paisley. 2013. “Stochastic Variational Infer- 1203–05.
ence.” Journal of Machine Learning Research 14 (1): Le, Quoc, and Tomas Mikolov. 2014. “Distributed Rep-
1303–47. resentations of Sentences and Documents.” Proceed-
Hofmann, Thomas. 1999. “Probabilistic Latent Seman- ings of Machine Learning Research 32: 1188–96.
tic Indexing.” In Proceedings of the 22nd Annual LeCun, Yann, Yoshua Bengio, and Geoffrey Hinton.
International ACM SIGIR Conference on Research 2015. “Deep Learning.” Nature 521 (7553): 436–44.
and Development in Information Retrieval, 50–57. Li, Feng. 2010. “The Information Content of For-
New York: ACM. ward-Looking Statements in Corporate Filings—A
Horowitz, Joel L. 2003. “The Bootstrap in Economet- Naïve Bayesian Machine Learning Approach.” Jour-
rics.” Statistical Science 18 (2): 211–18. nal of Accounting Research 48 (5): 1049–102.
Iyyer, Mohit, Peter Enns, Jordan Boyd-Graber, and Loughran, Tim, and Bill McDonald. 2011. “When Is
Philip Resnik. 2014. “Political Ideology Detection a Liability Not a Liability? Textual Analysis, Dictio-
Using Recursive Neural Networks.” In Proceedings naries, and 10-Ks.” Journal of Finance 66 (1): 35–65.
of the 52nd Annual Meeting of the Association for Lucca, David O., and Francesco Trebbi. 2009. “Mea-
Computational Linguistics, Vol. 1, edited by Kris- suring Central Bank Communication: An Automated
tina Toutanova and Hua Wu, 1113–22. Stroudsburg: Approach with Application to FOMC Statements.”
Association for Computational Linguistics. NBER Working Paper 15367.
Gentzkow, Kelly, and Taddy: Text as Data 573
Manela, Asaf, and Alan Moreira. 2017. “News Implied “Adaptive Nowcasting of Influenza Outbreaks Using
Volatility and Disaster Concerns.” Journal of Finan- Google Searches.” Royal Society Open Science 1 (2):
cial Economics 123 (1): 137–62. Article 140095.
Manning, Christopher D., Prabhakar Raghavan, and Pritchard, Jonathan K., Matthew Stephens, and Peter
Hinrich Schütze. 2008. Introduction to Information Donnelly. 2000. “Inference of Population Structure
Retrieval. Cambridge: Cambridge University Press. Using Multilocus Genotype Data.” Genetics 155 (2):
Mannion, David, and Peter Dixon. 1997. “Authorship 945–59.
Attribution: The Case of Oliver Goldsmith.” Journal Quinn, Kevin M., Burt L. Monroe, Michael Colaresi,
of the Royal Statistical Society, Series D 46 (1): 1–18. Michael H. Crespin, and Dragomir R. Radev. 2010.
Manski, Charles F. 1988. Analog Estimation Methods in “How to Analyze Political Attention with Minimal
Econometrics. New York: Chapman and Hall. Assumptions and Costs.” American Journal of Politi-
Mikolov, Tomas, Ilya Sutskever, Kai Chen, Greg cal Science 54 (1): 209–28.
Corrado, and Jeffrey Dean. 2013. “Distributed Rabinovich, Maxim, and David M. Blei. 2014. “The
Representations of Words and Phrases and Their Inverse Regression Topic Model.” Proceedings of
Compositionality.” In Proceedings of the 26th Inter- Machine Learning Research 32: 199–207.
national Conference on Neural Information Process- Roberts, Margaret E., Brandon M. Stewart, Dustin
ing Systems, edited by C. J. C. Burges, L. Bottou, Tingley, and Edoardo M. Airoldi. 2013. “The Struc-
M. Welling, Z. Ghahramani, and K. Q. Weinberger, tural Topic Model and Applied Social Science.” Paper
3111–19. Red Hook: Curran Associates. presented at the Advances in Neural Information
Morin, Frederic, and Yoshua Bengio. 2005. “Hierarchi- Processing Systems Workshop on Topic Models: Com-
cal Probabilistic Neural Network Language Model.” putation, Application, and Evaluation, Lake Tahoe.
In Proceedings of the Tenth International Workshop Rumelhart, David E., Geoffrey E. Hinton, and Ron-
on Artificial Intelligence and Statistics, 246–52. ald J. Williams. 1986. “Learning Representations
New Jersey: Society for Artificial Intelligence and by Back-Propagating Errors.” Nature 323 (6088):
Statistics. 533–36.
Mosteller, Frederick, and David L. Wallace. 1963. Saiz, Albert, and Uri Simonsohn. 2013. “Proxying
“Inference in an Authorship Problem.” Journal of the for Unobservable Variables with Internet Docu-
American Statistical Association 58 (302): 275–309. ment-Frequency.” Journal of the European Eco-
Murphy, Kevin P. 2012. Machine Learning: A Probabi- nomic Association 11 (1): 137–65.
listic Perspective. Cambridge: MIT Press. Schwarz, Gideon. 1978. “Estimating the Dimension of
Ng, Andrew Y., and Michael I. Jordan. 2001. “On Dis- a Model.” Annals of Statistics 6 (2): 461–64.
criminative versus Generative Classifiers: A Com- Scott, Steven L., and Hal R. Varian. 2014. “Predicting
parison of Logistic Regression and Naive Bayes.” In the Present with Bayesian Structural Time Series.”
Proceedings of the 14th International Conference on International Journal of Mathematical Modeling and
Neural Information Processing Systems: Natural and Numerical Optimisation 5 (1–2): 4–23.
Synthetic, edited by T. G. Dietterich, S. Becker, and Scott, Steven L., and Hal R. Varian. 2015. “Bayesian
Z. Ghahramani, 841–48. Cambridge: MIT Press. Variable Selection for Nowcasting Economic Time
Pang, Bo, Lillian Lee, and Shivakumar Vaithyanathan. Series.” In Economic Analysis of the Digital Econ-
2002. “Thumbs up? Sentiment Classification Using omy, edited by Avi Goldfarb, Shane M. Greenstein,
Machine Learning Techniques.” In Proceedings of and Catherine E. Tucker, 119–35. Chicago: Univer-
the Conference on Empirical Methods in Natural sity of Chicago Press.
Language Processing (EMNLP), 79–86. Stroudsburg: Sim, Yanchuan, Bryan R. Routledge, and Noah A.
Association for Computational Linguistics. Smith. 2015. “The Utility of Text: The Case of Amicus
Park, Trevor, and George Casella. 2008. “The Bayesian Briefs and the Supreme Court.” In Proceedings of the
Lasso.” Journal of the American Statistical Associa- Twenty-Ninth AAAI Conference on Artificial Intelli-
tion 103 (482): 681–86. gence, 2311–17. Palo Alto: AAAI Press.
Pennington, Jeffrey, Richard Socher, and Christopher Stephens-Davidowitz, Seth. 2014. “The Cost of Racial
Manning. 2014. “GloVe: Global Vectors for Word Animus on a Black Candidate: Evidence Using Goo-
Representation.” In Proceedings of the 2014 Confer- gle Search Data.” Journal of Public Economics 118
ence on Empirical Methods in Natural Language Pro- (C): 26–40.
cessing (EMNLP), edited by Alessandro Moschitti, Stock, James H., and Francesco Trebbi. 2003. “Ret-
Bo Pang, and Walter Daelemans, 1532–43. Strouds- rospectives Who Invented Instrumental Variable
burg: Association for Computational Linguistics. Regression?” Journal of Economic Perspectives 17
Politis, Dimitris N., Joseph P. Romano, and Michael (3): 177–94.
Wolf. 1999. Subsampling. Berlin: Springer. Stock, James H., and Mark W. Watson. 2002a. “Fore-
Polson, Nicholas G., and Steven L. Scott. 2011. “Data casting Using Principal Components from a Large
Augmentation for Support Vector Machines.” Bayes- Number of Predictors.” Journal of the American Sta-
ian Analysis 6 (1): 1–24. tistical Association 97 (460): 1167–79.
Porter, M. F. 1980. “An Algorithm for Suffix Stripping.” Stock, James H., and Mark W. Watson. 2002b. “Mac-
Program 14 (3): 130–37. roeconomic Forecasting Using Diffusion Indexes.”
Preis, Tobias, and Helen Susannah Moat. 2014. Journal of Business and Economic Statistics 20 (2):
574 Journal of Economic Literature, Vol. LVII (September 2019)