Download as pdf or txt
Download as pdf or txt
You are on page 1of 40

Journal of Economic Literature 2019, 57(3), 535–574

https://doi.org/10.1257/jel.20181020

Text as Data†
Matthew Gentzkow, Bryan Kelly, and Matt Taddy*

An ever-increasing share of human interaction, communication, and culture is


recorded as digital text. We provide an introduction to the use of text as an input to
economic research. We discuss the features that make text different from other forms
of data, offer a practical overview of relevant statistical methods, and survey a variety
of applications. (JEL C38, C55, L82, Z13)

1.  Introduction from advertisements and product reviews is


used to study the drivers of consumer deci-

N ew technologies have made available


vast quantities of digital text, recording
an ever-increasing share of human interac-
sion making. In political economy, text from
politicians’ speeches is used to study the
dynamics of political agendas and debate.
tion, communication, and culture. For social The most important way that text differs
scientists, the information encoded in text is from the kinds of data often used in econom-
a rich complement to the more structured ics is that text is inherently h­ igh dimensional.
kinds of data traditionally used in research, Suppose that we have a sample of documents,
and recent years have seen an explosion of each of which is w ​ ​words long, and suppose
empirical economics research using text as that each word is drawn from a vocabulary
data. of ​p​possible words. Then the unique repre-
To take just a few examples: In finance, sentation of these documents has dimension​​
text from financial news, social media, and p​​  w​​. A sample of ­thirty-word Twitter mes-
company filings is used to predict asset price sages that use only the one thousand most
movements and study the causal impact of common words in the English language, for
new information. In macroeconomics, text example, has roughly as many dimensions as
is used to forecast variation in inflation and there are atoms in the universe.
unemployment, and estimate the effects of A consequence is that the statistical meth-
policy uncertainty. In media economics, text ods used to analyze text are closely related to
from news and social media is used to study those used to analyze h ­ igh-dimensional data
the drivers and effects of political slant. In in other domains, such as machine learning
industrial organization and marketing, text and computational biology. Some methods,
such as lasso and other penalized regres-
* Gentzkow: Stanford University. Kelly: Yale University sions, are applied to text more or less exactly
and AQR Capital Management. Taddy: University of Chi- as they are in other settings. Other methods,
cago Booth School of Business.

Go to https://doi.org/10.1257/jel.20181020 to visit the such as topic models and multinomial inverse
article page and view author disclosure statement(s). regression, are close cousins of more general

535
536 Journal of Economic Literature, Vol. LVII (September 2019)

methods adapted to the specific structure of third task is p


­ redicting the incidence of local
text data. flu outbreaks from Google searches, where
In all of the cases we consider, the analysis the outcome V ​ ​is the true incidence of flu.
can be summarized in three steps: In these examples, and in the vast major-
ity of settings where text analysis has been
1. Represent raw text ​​as a numerical applied, the ultimate goal is prediction rather
array C
​ ​; than causal inference. The interpretation of
the mapping from V ​ ​ to ​​Vˆ ​​  is not usually an
ˆ ​​ of
2. Map C to predicted values V 
​​ object of interest. Why certain words appear
unknown outcomes ​V​; and more often in spam, or why certain searches
are correlated with flu is not important so
3. Use ​​Vˆ ​​
  in subsequent descriptive or long as they generate highly accurate predic-
causal analysis. tions. For example, Scott and Varian (2014,
2015) use data from Google searches to pro-
In the first step, the researcher must duce ­ high-frequency estimates of macro-
impose some preliminary restrictions to economic variables such as unemployment
reduce the dimensionality of the data claims, retail sales, and consumer sentiment
to a manageable level. Even the most that are otherwise available only at lower fre-
­cutting-edge ­high-dimensional techniques quencies from survey data. Groseclose and
can make nothing of ​​1,000​​  30​​-dimensional Milyo (2005) compare the text of news out-
raw Twitter data. In almost all the cases we lets to speeches of congresspeople in order
discuss, the elements of C ​ ​are counts of to estimate the outlets’ political slant. A large
tokens: words, phrases, or other p ­ redefined literature in finance following Antweiler and
features of text. This step may involve filter- Frank (2004) and Tetlock (2007) uses text
ing out very common or uncommon words; from the internet or the news to predict
dropping numbers, punctuation, or proper stock prices.
names; and restricting attention to a set of In many social science studies, however,
features such as words or phrases that are the goal is to go further and, in the third
likely to be especially diagnostic. The map- step, use text to infer causal relationships
ping from raw text to C leverages prior infor- or the parameters of structural economic
mation about the structure of language to ­models. S ­ tephens-Davidowitz (2014) uses
reduce the dimensionality of the data prior Google search data to estimate local areas’
to any statistical analysis. racial animus, then studies the causal
The second step is where ­high-dimensional effect of racial animus on votes for Barack
statistical methods are applied. In a classic Obama in the 2008 election. Gentzkow and
example, the data is the text of emails, and Shapiro (2010) use congressional and news
the unknown variable of interest V is an indi- text to estimate each news outlet’s political
cator for whether the email is spam. The slant, then study the supply and demand
prediction ​​Vˆ ​​  determines whether or not to forces that determine slant in equilibrium.
send the email to a spam filter. Another clas- Engelberg and Parsons (2011) measure local
sic task is sentiment prediction (e.g., Pang, news coverage of earnings announcements,
Lee, and Vaithyanathan 2002), where the then use the relationship between coverage
unknown variable ​V​is the true sentiment of and trading by local investors to separate
a message (say positive or negative), and the the causal effect of news from other sources
prediction ​​Vˆ ​​  might be used to identify posi- of correlation between news and stock
tive reviews or comments about a product. A prices.
Gentzkow, Kelly, and Taddy: Text as Data 537

In this paper, we provide an overview from the text as a whole. It might seem
of methods for analyzing text and a survey obvious that any attempt to distill text into
of current applications in economics and meaningful data must similarly take account
related social sciences. The methods discus- of complex grammatical structures and rich
sion is ­forward looking, providing an over- interactions among words.
view of methods that are currently applied The field of computational linguistics
in economics as well as those that we expect has made tremendous progress in this kind
to have high value in the future. Our discus- of interpretation. Most of us have mobile
sion of applications is selective and necessar- phones that are capable of complex speech
ily omits many worthy papers. We highlight recognition. Algorithms exist to efficiently
examples that illustrate particular methods parse grammatical structure, disambiguate
and use text data to make important substan- different senses of words, distinguish key
tive contributions even if they do not apply points from secondary asides, and so on.
methods close to the frontier. Yet virtually all analysis of text in the social
A number of other excellent surveys have sciences, like much of the text analysis in
been written in related areas. See Evans and machine learning more generally, ignores
Aceves (2016) and Grimmer and Stewart the lion’s share of this complexity. Raw text
(2013) for related surveys focused on text consists of an ordered sequence of language
analysis in sociology and political science, elements: words, punctuation, and white
respectively. For methodological surveys, space. To reduce this to a simpler repre-
Bishop (2006), Hastie, Tibshirani, and sentation suitable for statistical analysis, we
Friedman (2009), and Murphy (2012) cover typically make three kinds of simplifications:
contemporary statistics and machine learn- dividing the text into individual documents ​i​,
ing in general while Jurafsky and Martin reducing the number of language elements
(2009) overview methods from computa- we consider, and limiting the extent to which
tional linguistics and natural language pro- we encode dependence among elements
cessing. The Spring 2014 issue of the Journal within documents. The result is a mapping
of Economic Perspectives contains a sympo- from raw text  to a numerical array C ​ ​. A row
sium on “big data,” which surveys broader ​​ci​​​​ of ​C​is a numerical vector with each ele-
applications of ­ high-dimensional statistical ment indicating the presence or count of a
methods to economics. particular language token in document ​i​.
In section 2 we discuss representing
2.1 What Is a Document?
text data as a manageable (though still
­high-dimensional) numerical array C ​ ​; in sec- The first step in constructing C ​ ​is to divide
tion 3 we discuss methods from data mining raw text ​​into individual documents { ​​ ​i​​​}​​.
and machine learning for predicting V ​ ​ from​ In many applications, this is governed by the
C​. Section 4 then provides a selective survey level at which the attributes of interest ​V​ are
of text analysis applications in social science, defined. For spam detection, the outcome of
and section 5 concludes. interest is defined at the level of individual
emails, so we want to divide text that way
too. If V​ ​is daily stock price movements that
2.  Representing Text as Data
we wish to predict from the prior day’s news
When humans read text, they do not see a text, it might make sense to divide the news
vector of dummy variables, nor a sequence text by day as well.
of unrelated tokens. They interpret words In other cases, the natural way to define
in light of other words, and extract ­meaning a document is not so clear. If we wish to
538 Journal of Economic Literature, Vol. LVII (September 2019)

­redict legislators’ partisanship from their


p that occur fewer than ​k​times for some arbi-
floor speeches (Gentzkow, Shapiro, and trary small integer ​k​.
Taddy 2016), we could aggregate speech An approach that excludes both common
so a document is a ­speaker–day, a ­speaker– and rare words and has proved very useful
year, or all speech by a given speaker during in practice is filtering by “­term frequency–
the time she is in Congress. When we use inverse document frequency” (­ tf–idf).
methods that treat documents as indepen- For a word or other feature ​j​in document i​​,
dent (which is true most of the time), finer term frequency (​t ​fi​j​​​) is the count ​​cij​ ​​​ of occur-
partitions will typically ease computation at rences of ​j​in ​i​. Inverse document frequency
the cost of limiting the dependence we are (​id  ​f​j​​​
) is the log of one over the share of
able to capture. Theoretical guidance for the documents containing j​​: ​log(n/​dj​​​)​ where ​​dj​​​ 
right level of aggregation is often limited, so = ​∑ i​ ​​​1​​[​cij​ ​​>0]​​​​ and ​n​is the total number of
this is an important dimension along which documents. The object of interest ­ tf–idf
to check the sensitivity of results. is the product ​ t ​f​ij​​  × id ​f​j​​​. Very rare words
will have low ­­tf–idf scores because ​t ​fi​j​​​ will
2.2 Feature Selection
be low. Very common words that appear in
To reduce the number of features to some- most or all documents will have low ­­tf–idf
thing manageable, a common first step is to scores because ​id ​f​j​​​will be low. (Note that
strip out elements of the raw text other than this improves on simply excluding words that
words. This might include punctuation, num- occur frequently because it will keep words
bers, HTML tags, proper names, and so on. that occur frequently in some documents but
It is also common to remove a subset of do not appear in others; these often provide
words that are either very common or very useful information.) A common practice is to
rare. Very common words, often called “stop keep only the words within each document ​i​
words,” include articles (“the,” “a”), conjunc- with ­­tf–idf scores above some rank or cutoff.
tions (“and,” “or”), forms of the verb “to be,” A final step that is commonly used to
and so on. These words are important to the reduce the feature space is stemming: replac-
grammatical structure of sentences, but they ing words with their root such that, e.g.,
typically convey relatively little meaning on “economic,” “economics,” “economically”
their own. The frequency of “the” is proba- are all replaced by the stem “economic.” The
bly not very diagnostic of whether an email Porter stemmer (Porter 1980) is a standard
is spam, for example. Common practice is stemming tool for English language text.
to exclude stop words based on a ­predefined All of these cleaning steps reduce the
list.1 Very rare words do convey meaning, but number of unique language elements we
their added computational cost in expand- must consider and thus the dimensional-
ing the set of features that must be consid- ity of the data. This can provide a massive
ered often exceeds their diagnostic value. computational benefit, and it is also often
A ­common approach is to exclude all words key to getting more interpretable model fits
(e.g., in topic modeling). However, each of
these steps requires careful decisions about
1 There is no single stop word list that has become a
the elements likely to carry meaning in a
standard. How aggressive one wants to be in filtering stop
words depends on the application. The web page http:// particular application.2 One researcher’s
www.ranks.nl/stopwords shows several common stop word
lists, including the one built into the database software
SQL and the list claimed to have been used in early ver- 2 Denny and Spirling (2018) discuss the sensitivity of
sions of Google search. (Modern Google search does not unsupervised text analysis methods such as topic modeling
appear to filter any stop words.) to preprocessing steps.
Gentzkow, Kelly, and Taddy: Text as Data 539

stop words are another’s subject of interest. representation then corresponds to counts of
Dropping numerals from political text means ­1-grams.
missing references to “the first 100 days” or Counting ​n-​grams of order ​n  >  1​ yields
“September 11.” In online communication, data that describe a limited amount of the
even punctuation can no longer be stripped dependence between words. Specifically, the​
without potentially significant information n​-gram counts are sufficient for estimation
loss :-(. of an ​n-​order homogeneous Markov model
across words (i.e., the model that arises if we
2.3 n-grams
assume that word choice is only dependent
Producing a tractable representation also upon the previous n ​ ​words). This can lead
requires that we limit dependence among to richer modeling. In analysis of partisan
language elements. A fairly mild step in this speech, for example, single words are often
direction, for example, might be to parse doc- insufficient to capture the patterns of inter-
uments into distinct sentences and encode est: “death tax” and “tax break” are phrases
features of these sentences while ignoring with strong partisan overtones that are not
the order in which they occur. The most evident if we look at the single words “death,”
common methodologies go much further. “tax,” and “break” (see, e.g., Gentzkow and
The simplest and most common way to Shapiro 2010).
represent a document is called ­bag-of-words. Unfortunately, the dimension of ​​ c​i​​​ in-
The order of words is ignored altogether, creases exponentially quickly with the order​
and ​​ci​​​​is a vector whose length is equal to n​of the phrases tracked. The majority of text
the number of words in the vocabulary and analyses consider ​n​-grams up to two or three
whose elements c​ ​​ ij​​​are the number of times at most, and the ubiquity of these simple
word j​​occurs in document ​i.​ Suppose that representations (in both machine learning
the text of document i​​ is and social science) reflects a belief that the
return to richer ​n​-gram modeling is usually
Good night, good night! small relative to the cost. Best practice in
Parting is such sweet sorrow. many cases is to begin analysis by focusing on
single words. Given the accuracy obtained
After stemming, removing stop words, and with words alone, one can then evaluate if it
removing punctuation, we might be left with is worth the extra time to move on to ­2-grams
“good night good night part sweet sorrow.” or ­3-grams.
The ­bag-of-words representation would then
2.4 Richer Representations
have ​​c​ij​​  = 2​for ​j ∈ ​{good, night}​​, ​​c​ij​​  =  1​ for​
j ∈ ​{part, sweet, sorrow}​​, and ​​cij​ ​​  = 0​for all While rarely used in the social science
other words in the vocabulary. literature to date, there is a vast array of
This scheme can be extended to encode methods from computational linguistics
a limited amount of dependence by count- that capture richer features of text and may
ing unique phrases rather than unique have high return in certain applications.
words. A phrase of length n ​ ​is referred to One basic step beyond the simple n ​ ​-gram
as an ​n​-gram. For example, in our snippet counting above is to use sentence syntax to
above, the count of ­2-grams (or “bigrams”) inform the text tokens used to summarize
would have c​ ​​ij​​  = 
2​for j​ = good.night​,​​ a document. For example, Goldberg and
c​ij​​  = 1​for j​​including ​night.good​, ​night.part​, Orwant (2013) describe syntactic ​ n​-grams
​part.sweet​, and ​sweet.sorrow​, and ​​c​ij​​  =  0​ for where words are grouped together when-
all other possible ­2-grams. The ­bag-of-words ever their meaning depends upon each
540 Journal of Economic Literature, Vol. LVII (September 2019)

other, according to a model of language A more serious issue is that research-


syntax. ers sometimes do not have direct access
An alternative approach is to move beyond to the raw text and must access it through
treating documents as counts of language some interface such as a search engine. For
tokens, and to instead consider the ordered example, Gentzkow and Shapiro (2010)
sequence of transitions between words. count the number of newspaper articles
In this case, one would typically break the containing partisan phrases by entering the
document into sentences, and treat each phrases into a search interface (e.g., for the
as a separate unit for analysis. A single sen- database ProQuest) and counting the num-
tence of length s​​(i.e., containing s​​ words) ber of matches they return. Baker, Bloom,
is then represented as a binary p ​  × s​ matrix​ and Davis (2016) perform similar searches
S​, where the nonzero elements of ​S​ indi- to count the number of articles mentioning
cate occurrence of the r­ow-word in the terms related to policy uncertainty. Saiz and
­column-position within the sentence, and ​p​ Simonsohn (2013) count the number of web
is the length of the vocabulary. Such repre- pages measuring combinations of city names
sentations lead to a massive increase in the and terms related to corruption by enter-
dimensions of the data to be modeled, and ing queries in a search engine. Even if one
analysis of this data tends to proceed through can automate the searches in these cases, it
­word embedding: the mapping of words to is usually not feasible to produce counts for
a location in 핉​​ 
​​ K​​for some ​K ≪ p​, such that very large feature sets (e.g., every ­two-word
the sentences are then sequences of points phrase in the English language), and so the
in this K
​ ​dimensional space. This is discussed initial feature selection step must be rel-
in detail in section 3.3. atively aggressive. Relatedly, interacting
through a search interface means that there
2.5 Other Practical Considerations
is no simple way to retrieve objects like the
It is worth mentioning two details that can set of all words occurring at least twenty
cause practical social science applications times in the corpus of documents, or the
of these methods to diverge a bit from the inputs to computing tf–idf.
­­
ideal case considered in the statistics liter-
ature. First, researchers sometimes receive
3.  Statistical Methods
data in a p
­ re-aggregated form. In the analysis
of Google searches, for example, one might This section considers methods for map-
observe the number of searches contain- ping the ­document-token matrix ​C​to pre-
ing each possible keyword on each day, but dictions ​​Vˆ ​​  of an attribute V ​ ​. In some cases,
not the raw text of the individual searches. the observed data is partitioned into subma-
This means documents must be similarly trices ​​C​​  train​​ and ​​C​​  test​​, where the matrix ​​C​​  train​​
aggregated (to days, rather than individual collects rows for which we have observations​​
searches), and it also means that the natu- V​​  train​​ of ​V​and the matrix ​​C​​  test​​ collects rows
ral representation where ​​c​ij​​​is the number of for which V ​ ​is unobserved. The dimension
occurrences of word j​​on day i​​is not avail- of ​​C​​  train​​ is ​​n​​  train​  × p​, and the dimension of​​
able. This is probably not a significant limita- V​​  train​​ is ​​n​​  train​  × k​, where ​k​is the number of
tion, as the missing information (how many attributes we wish to predict.
times per search a word occurs conditional Attributes in ​ V​can include observable
on occurring at least once) is unlikely to be quantities such as the frequency of flu cases,
essential, but it is useful to note when map- the positive or negative rating of movie
ping practice to theory. reviews, or the unemployment rate, about
Gentzkow, Kelly, and Taddy: Text as Data 541

which the documents are informative. There The second and third groups of meth-
can also be latent attributes of interest, such ods are distinguished by whether they
as the topics being discussed in a congressio- begin from a model of ​p(​ ​v​i​​  | ​ci​​​)​​or a model of
nal debate or in news articles. ​p​(​ci​​​  | ​vi​)​​ ​​. In the former case, which we will
Methods to connect counts c​ ​​i​​​ to attri- call text regression methods, we directly
butes ​​vi​​​​can be roughly divided into four estimate the conditional outcome distribu-
categories. The first, which we will call tion, usually via the conditional expectation
­dictionary-based methods, do not involve ​E[​ ​v​i​​  | ​ci​]​​ ​​ of attributes ​​v​i​​​. This is intuitive: if we
statistical inference at all: they simply spec- want to predict ​​v​i​​​ from ​​c​i​​​, we would naturally
ify ​​​vˆ ​​i ​​  =  f  (​ ​ci​)​​ ​​for some known function f​  (​  ⋅ )​​. regress the observed values of the former
This is by far the most common method in (​​V​​  train​​) on the corresponding values of the lat-
the social science literature using text to ter (​​C​​  train​​). Any generic regression technique
date. In some cases, researchers define ​f  ​( ⋅ )​​ can be applied, depending upon the nature
based on a p ­respecified dictionary of of ​​vi​​​​. However, the h ­ igh dimensionality of
terms capturing particular categories of ​​ci​​​​, where ​p​is often as large as or larger than​​
text. In Tetlock (2007), for example, ​​c​i​​​ is a n​​  train​​, requires use of regression techniques
bag-of-words representation and the out-
­ appropriate for such a setting, such as penal-
come of interest ​​v​i​​​is the latent “sentiment” ized linear or logistic regression.
of Wall Street Journal columns, defined along In the latter case, we begin from a genera-
a number of dimensions such as “positive,” tive model of ​p​(​ci​​​  | ​vi​)​​ ​​. To see why this is intu-
“optimistic,” and so on. The author defines itive, note that in many cases the underlying
the function ​ f  (​  ⋅ )​​using a dictionary called causal relationship runs from outcomes to
the General Inquirer, which provides lists of language rather than the other way around.
words associated with each of these sentiment For example, Google searches about the flu
categories.3 The elements of ​f(​ ​ci​)​​ ​​ are defined do not cause flu cases to occur; rather, peo-
to be the sum of the counts of words in each ple with the flu are more likely to produce
category. (As we discuss below, the main anal- such searches. Congresspeople’s ideology
ysis then focuses on the first principal com- is not determined by their use of partisan
ponent of the resulting counts.) In Baker, language; rather, people who are more con-
Bloom, and Davis (2016), c​ ​​ i​​​is the count of servative or liberal to begin with are more
articles in a given n ­ ewspaper-month contain- likely to use such language. From an eco-
ing a set of p ­ respecified terms such as “pol- nomic point of view, the correct “structural”
icy,” “uncertainty,” and “Federal Reserve,” model of language in these cases maps from
and the outcome of interest v​ ​​ i​​​ is the degree ​​vi​​​​ to ​​ci​​​​, and as in other cases familiar to
of “policy uncertainty” in the economy. The economists, modeling the underlying causal
authors define ​f  (​  ⋅ )​​to be the raw count of relationships can provide powerful guidance
the ­prespecified terms divided by the total to inference and make the estimated model
number of articles in the ­newspaper–month, more interpretable.
averaged across newspapers. We do not pro- Generative models can be further divided
vide additional discussion of d ­ ictionary-based by whether the attributes are observed or
methods in this section, but we return to them latent. In the first case of unsupervised
in section 3.5 and in our discussion of applica- methods, we do not observe the true value of​​
tions in section 4. v​i​​​for any documents. The function relating​​
c​i​​​ and ​​v​i​​​is unknown, but we are willing to
impose sufficient structure on it to allow us to
3 http://www.wjh.harvard.edu/~inquirer/. infer ​​vi​​​​ from ​​c​i​​​. This class includes ­methods
542 Journal of Economic Literature, Vol. LVII (September 2019)

such as topic modeling and its variants (e.g., t­ypically perform close to the frontier in
latent Dirichlet allocation, or LDA). In the terms of ­out-of-sample prediction.
second case of supervised methods, we Linear models in the sense we mean here
observe training data V​​  ​​ train​​and we can fit are those in which v​ ​​i​​​ depends on ​​c​i​​​ only
our model, say ​​f​θ(​​​ ​ci​​​; ​vi​)​​ ​​for a vector of param- through a linear index ​​η​i​​  =  α + ​𝐱​  ′i​  ​  β​, where​​
eters ​θ,​ to this training set. The fitted model 𝐱​i​​​is a known transformation of c​ ​​ i​​​. In many
​​  f​​θˆ ​​​​  can then be inverted to predict v​ ​​ i​​​ for doc- cases, we simply have E ​ ​[​v​i​​  | ​𝐱i​]​​ ​  = ​ηi​​​​. It is
uments in the test set and can also be used to also possible that E ​ ​[​vi​​​   |  ​𝐱i​]​​ ​  =  f ​(​ηi​)​​ ​​ for some
interpret the structural relationship between known link function ​f ​( ⋅ )​​, as in the case of
attributes and text. Finally, in some cases, v​ ​​ i​​​ logistic regression.
includes both observed and latent attributes Common transformations are the iden-
for a ­semi-supervised analysis. tity ​​𝐱i​​​  = ​ci​​​​, normalization by document
Lastly, we discuss word embeddings, length ​​𝐱​i​​  = ​ci​​​/​m​i​​​ with ​​m​i​​  = ​∑ j​ ​​ ​cij​ ​​​, or
which provide a richer representation of the the positive indicator x​ ​​ ij​​  = ​1​​[​cij​ ​​>0]​​​.​ The best
underlying text than the token counts that choice is a­pplication specific, and may be
underlie other methods. They have seen driven by interpretability; does one wish to
limited application in economics to date, but interpret ​​β​j​​​as the added effect of an extra
their dramatic successes in deep learning count for token j​ ​(if so, use x​ ​​ ij​​  = ​cij​ ​​​) or as the
and other machine learning domains sug- effect of the presence of token j​​(if so, use
gest they are likely to have high value in the ​​xi​j​​  = ​1​[​​c​ij​​>0]​​​)?
​ The identity is a reasonable
future. default in many settings.
We close in section 3.5 with some broad Write ​l​(α, β)​​for an unregularized objec-
recommendations for practitioners. tive proportional to the negative log likeli-
3.1 Text Regression hood,​− log  p​(​v​i​​  | ​𝐱i​)​​ ​​. For example, in Gaussian
(linear) regression, l​​(α, β)​  = ​∑ i​ ​​ ​​(​v​i​​  − ​η​i)​​ ​​​  2​​
Predicting an attribute ​​v​i​​​ from counts ​​ci​​​​ is and in binomial (logistic) regression, ​l​(α, β)​ 
a regression problem like any other, except =  −  ​∑ i​ ​​​[​ηi​​​ ​v​i​​  − log​(1 + ​e​​  ​ηi​​​)​ ]​ ​​ for ​​v​i​​  ∈ ​ {0, 1}​​.
that the ­high dimensionality of ​​c​i​​​ makes ordi- A penalized estimator is then the solution to
nary least squares (OLS) and other standard

{ }
p
techniques infeasible. The methods in this
section are mainly applications of standard (1) ​min​ l​(α, β)​  + nλ ​ ∑   ​ ​​ ​κj​​​​(|​βj​​​|)​ ​,​
j=1
­high-dimensional regression methods to text.
where ​λ  >  0​controls overall penalty mag-
3.1.1 Penalized Linear Models
nitude and κ​ ​​ j​​​( ⋅ )​​are increasing “cost” func-
The most popular strategy for very tions that penalize deviations of the ​​β​j​​​ from
­igh-dimensional regression in contempo-
h zero.
rary statistics and machine learning is the A few common cost functions are shown in
estimation of penalized linear models, par- figure 1. Those that have a n ­ on-differentiable
ticularly with L​​​ 1​​​ penalization. We recom- spike at zero (lasso, elastic net, and log) lead
mend this strategy for most text regression to sparse estimators, with some coefficients
applications: linear models are intuitive and set to exactly zero. The curvature of the
interpretable; fast, h ­igh-quality software penalty away from zero dictates the weight
is available for big sparse input matrices of shrinkage imposed on the nonzero coef-
like our C​ ​. For simple t­ext-regression tasks ficients: ​​L2​ ​​​costs increase with coefficient
with input dimension on the same order as size; lasso’s ​​L​1​​​penalty has zero curvature and
the sample size, penalized linear models imposes constant shrinkage, and as c­ urvature
Gentzkow, Kelly, and Taddy: Text as Data 543

A. Ridge B. Lasso C. Elastic net D. log

400 60

| β | + 0.1 × β 2
2.5

log(1 + | β |)
15 40

|β|
200 1.5
β2

5 20
0.5
0 0 0
−20 0 20 −20 0 20 −20 0 20 −20 0 20
β β β β

Figure 1

Note: From left to right, L​ ​​ 2​​​costs (ridge, Hoerl and Kennard 1970), L​
​​ 1​​​(lasso, Tibshirani 1996), the “elastic net”
mixture of ​​L​1​​​ and ​​L2​ ​​​(Zou and Hastie 2005), and the log penalty (Candès, Wakin, and Boyd 2008).

goes toward ​− ∞​one approaches the ​​L​0​​​ pen- sample standard deviation of that covariate.
alty of subset selection. The lasso’s ​​L​1​​​ pen- In text analysis, where each covariate corre-
alty (Tibshirani 1996) is extremely popular: sponds to some transformation of a specific
it yields sparse solutions with a number of text token, this type of weighting is referred
desirable properties (e.g., Bickel, Ritov, and to as “rare feature u ­p-weighting” (e.g.,
Tsybakov 2009; Wainwright 2009; Belloni, Manning, Raghavan, and Schütze 2008) and
Chernozhukov, and Hansen 2013; Bühlmann is generally thought of as good practice: rare
and van de Geer 2011), and the number of words are often most useful in differentiat-
nonzero estimated coefficients is an unbi- ing between documents.5
ased estimator of the regression degrees of Large ​λ​leads to simple model estimates
freedom (which is useful in model selection; in the sense that most coefficients will be
see Zou, Hastie, and Tibshirani 2007).4 set at or close to zero, while as λ ​   →  0​ we
Focusing on ​​L​1​​​ regularization, ­rewrite the approach maximum likelihood estimation
penalized linear model objective as (MLE). Since there is no way to define an
optimal ​λ​a priori, standard practice is to

{ }
p
(2) ​min​ l​(α, β)​  + nλ ​ ∑   ​​ ​​  ω​j​​  |​βj​​​| ​.​ compute estimates for a large set of possible​
j=1 λ​and then use some criterion to select the
one that yields the best fit.
A common strategy sets ω​ ​​ j​​​so that the pen- Several criteria are available to choose an
alty cost for each coefficient is scaled by the optimal ​λ.​ One common approach is to leave
out part of the training sample in estimation
4 Penalties with a bias that diminishes with coefficient
and then choose the λ ​ ​that yields the best
size—such as the log penalty in figure 1 (Candès, Wakin, ­out-of-sample fit according to some criterion
and Boyd 2008), the smoothly clipped absolute deviation such as mean squared error. Rather than work
(SCAD) of Fan and Li (2001), or the adaptive lasso of Zou with a single ­leave-out sample, researchers
(2006)—have been promoted in the statistics literature as
improving upon the lasso by providing consistent variable most often use K ​ ​-fold ­cross-validation (CV).
selection and estimation in a wider range of settings. These
diminishing-bias penalties lead to increased computation
costs (due to a ­non-convex loss), but there exist efficient 5 This is the same principle that motivates
approximation algorithms (see, e.g., Fan, Xue, and Zou “­inverse-document frequency” weighting schemes, such
2014; Taddy 2017b). as ­­tf–idf.
544 Journal of Economic Literature, Vol. LVII (September 2019)

This splits the sample into ​K​disjoint subsets, Penalized linear models use shrinkage and
and then fits the full regularization path K ​​ variable selection to manage high dimen-
times excluding each subset in turn. This sionality by forcing the coefficients on most
yields ​K​realizations of the mean squared regressors to be close to (or, for lasso, exactly)
error or other ­out-of-sample fit measure for zero. This can produce ­suboptimal forecasts
each value of ​λ​. Common rules are to select when predictors are highly correlated. A
the value of λ ​ ​that minimizes the average transparent illustration of this problem would
error across these realizations, or (more be a case in which all of the predictors are
conservatively) to choose the largest ​λ​ with equal to the forecast target plus an i.i.d. noise
mean error no more than one standard error term. In this situation, choosing a subset of
away from the minimum. predictors via lasso penalty is inferior to tak-
Analytic alternatives to c­ross-validation ing a simple average of the predictors and
are Akaike’s information criterion (AIC; using this as the sole predictor in a univar-
Akaike 1973) and the Bayesian informa- iate regression. This predictor averaging, as
tion criterion (BIC) of Schwarz (1978). In opposed to predictor selection, is the essence
particular, Flynn, Hurvich, and Simonoff of dimension reduction.
(2013) describe a b ­ias-corrected AIC PCR consists of a ­two-step procedure. In
objective for ­ high-dimensional problems the first step, principal components analysis
that they call AICc. It is motivated as an (PCA) combines regressors into a small set
approximate likelihood maximization sub- of ​K​linear combinations that best preserve
ject to a degrees of freedom (​d  ​f​λ​​​) adjust- the covariance structure among the predic-
ment: ​AICc​(λ)​  = 2l​(​αλ​ ​​, ​βλ​ )​​ ​  + 2d ​f​λ_
​​​  n   ​​ 
. tors. This amounts to solving the problem
n − d ​f​λ​​  − 1
Similarly, the BIC objective is ​ BIC​(λ)​ 
=  l​(​αλ​ ​​, ​βλ​ ​​)​  + d ​f​λ​​  log  n​
, and is motivated (3) ​​min​​ ​  trace​[​(C − Γ​B′  ​)​​​(C − Γ​B′  ​)′​   ​]​, 
Γ,B
as an approximation to the Bayesian pos-
terior marginal likelihood in Kass and subject to
Wasserman (1995). AICc and BIC selec-
tion choose λ ​​to minimize their respec- (Γ)​  = rank​(B)​  =  K​.
rank​
tive objectives. The BIC tends to choose
simpler models than ­ cross-validation or The count matrix C ​ ​consists of n
​ ​rows (one
AICc. Zou, Hastie, and Tibshirani (2007) for each document) and p ​ ​columns (one for
recommend BIC for lasso penalty selec- each term). PCA seeks a low-rank represen-
tion whenever variable selection, rather tation ​Γ​B′  ​​that best approximates the text
than predictive performance, is the primary data ​C​. This formulation has the character of
goal. a factor model. The ​n × K​matrix ​Γ​ captures
3.1.2 Dimension Reduction the prevalence of ​K​common components,
or “factors,” in each document. The ​p × K​
Another common solution for taming high matrix ​B​describes the strength of associa-
dimensional prediction problems is to form a tion between each word and the factors. As
small number of linear combinations of pre- we will see, this ­reduced-rank decomposi-
dictors and to use these derived indices as tion bears a close resemblance to other text
variables in an otherwise standard predictive analytic methods such as topic modeling and
regression. Two classic dimension reduction word embeddings.
techniques are principal components regres- In the second step, the ​K​components are
sion (PCR) and partial least squares (PLS). used in standard predictive regression. As an
Gentzkow, Kelly, and Taddy: Text as Data 545

example, Foster, Liberman, and Stine (2013) is condensed into a single predictive index.
use PCR to build a hedonic real estate pricing To use additional predictive indices, both​​
model that takes textual content of property v​i​​​ on ​​cij​ ​​​are orthogonalized with respect
­listings as an input.6 With text data, where the to ​​​vˆ ​​ i​​​, the above procedure is repeated on
number of features tend to vastly exceed the the orthogonalized data set, and the result-
observation count, regularized versions of ing forecast is added to the original ​​​v  ˆ ​​i​​​. This
PCA such as predictor thresholding (e.g., Bai is iterated until the desired number of PLS
and Ng 2008) and sparse PCA (Zou, Hastie, components K ​ ​is reached. Like PCR, PLS
and Tibshirani 2006) help exclude the least components describe the prevalence of K ​​
informative features to improve predictive common factors in each document. And also
content of the d ­ imension-reduced text. like PCR, PLS can be implemented with a
A drawback of PCR is that it fails to incor- variety of regularization schemes to aid its
porate the ultimate statistical objective— performance in the u ­ ltra-high-dimensional
forecasting a particular set of attributes—in world of text. Section 4 discusses applica-
the dimensionality reduction step. PCA con- tions using PLS in text regression.
denses text data into indices based on the PCR and PLS share a number of com-
covariation among the predictors. This hap- mon properties. In both cases, ​ K​is a
pens prior to the forecasting step and with- user-controlled parameter which, in many
­
out consideration of how predictors associate social science applications, is selected ex ante
with the forecast target. by the researcher. But, like any hyperparam-
In contrast, PLS performs dimension eter, ​K​can be tuned via c­ ross-validation. And
reduction by directly exploiting covaria- neither method is scale invariant—the fore-
tion of predictors with the forecast target.7 casting model is sensitive to the distribution
Suppose we are interested in forecasting of predictor variances. It is therefore com-
a scalar attribute v​ ​​i​​​. PLS regression pro- mon to ­variance-standardize features before
ceeds as follows. For each element j​​of the applying PCR or PLS.
feature vector c​ ​​i​​​, estimate the univariate 3.1.3 Nonlinear Text Regression
covariance between ​​ v​i​​​ on ​​cij​ ​​​. This covari-
ance, denoted ​​ φ​j​​​, reflects the attribute’s Penalized linear models are the most
“partial” sensitivity to each feature j​​. Next, widely applied text regression tools due to
form a single predictor by averaging all their simplicity, and because they may be
attributes into a single aggregate predictor viewed as a fi ­rst-order approximation to
​​​vˆ ​​i ​​  = ​ ∑  j ​  ​​ ​φj​​​ ​cij​ ​​  / ​∑ j​ ​​ ​φj​​​​. This forecast places potentially nonlinear and complex data gen-
the highest weight on the strongest uni- erating processes (DGPs). In cases where a
variate predictors, and the least weight on linear specification is too restrictive, there
the weakest. In this way, PLS performs its are several other machine learning tools that
dimension reduction with the ultimate fore- are well suited to represent nonlinear asso-
casting objective in mind. The description ciations between text ​​c​i​​​ and outcome attri-
of ​​​vˆ ​​i ​​​ reflects the ​K  =  1​case, i.e., when text butes ​​vi​​​​. Here we briefly describe four such
nonlinear regression methods—generalized
linear models, support vector machines,
regression trees, and deep learning—and
6 See Stock and Watson (2002a, b) for development of
the PCR estimator and an application to macroeconomic
provide references for readers interested in
forecasting with a large set of numerical predictors. thorough treatments of each.
7 See Kelly and Pruitt (2013, 2015) for the asymptotic
theory of PLS regression and its application to forecasting GLMs and SVMs.—One way to capture
risk premia in financial markets. nonlinear associations between ​​c​i​​​ and ​​vi​​​​ is
546 Journal of Economic Literature, Vol. LVII (September 2019)

with a generalized linear model (GLM). ­ roblems. The logic of trees differs markedly
p
These expand the linear model to include from traditional regressions. A tree “grows”
nonlinear functions of ​​c​i​​​ such as polynomials by sequentially sorting data observations
or interactions, while otherwise treating the into bins based on values of the predictor
problem with the penalized linear regression variables. This partitions the data set into
methods discussed above. rectangular regions, and forms predictions
A related method used in the social science as the average value of the outcome vari-
literature is the support vector machine, or able within each partition (Breiman et al.
SVM (Vapnik 1995). This is used for text 1984). This structure is an effective way to
classification problems (when V ​ ​is categor- accommodate rich interactions and nonlin-
ical), the prototypical example being email ear dependencies.
spam filtering. A detailed discussion of SVMs Two extensions of the simple regression
is beyond the scope of this review, but from tree have been highly successful thanks to
a high level, the SVM finds hyperplanes in a clever regularization approaches that min-
basis expansion of ​C​that partition the obser- imize the need for tuning and avoid over-
vations into sets with equal response (i.e., so fitting. Random forests (Breiman 2001)
that ​​vi​​​​are all equal in each region).8 average predictions from many trees that
GLMs and SVMs both face the limita- have been randomly perturbed in a b ­ ootstrap
tion that, without a priori assumptions for step. Boosted trees (e.g., Friedman 2002)
which basis transformations and interactions recursively combine predictions from many
to include, they may overfit and require ­oversimplified trees.10
extensive tuning (Hastie, Tibshirani, and The benefits of regression trees—non-
Friedman 2009; Murphy 2012). For exam- linearity and ­ high-order interactions—are
ple, ­ multi-way interactions increase the sometimes lessened in the presence of
parameterization combinatorially and can high-dimensional inputs. While we would
­
quickly overwhelm the penalization rou- generally recommend tree models, and
tine, and their performance suffers in the especially random forests, they are often not
presence of many spurious “noise” inputs worth the effort for simple text regression.
(Hastie, Tibshirani, and Friedman 2009).9 Often times, a more beneficial use of trees is
in a final prediction step after some dimen-
Regression Trees.—Regression trees have sion reduction derived from the generative
become a popular nonlinear approach for models in section 3.2.
incorporating ­
multi-way predictor inter-
actions into regression and classification Deep Learning.—There is a host of other
machine learning techniques that have been
8 Hastie, Tibshirani, and Friedman (2009, chapter 12) applied to text regression. The most com-
and Murphy (2012, chapter 14) provide detailed overviews mon techniques not mentioned thus far are
of GLMs and SVMs. Joachims (1998) and Tong and Koller neural networks, which typically allow the
(2001) (among others) study text applications of SVMs.
9 Another drawback of SVMs is that they cannot be inputs to act on the response through one
easily connected to the estimation of a probabilistic
model and the resulting fitted model can sometimes be
difficult to interpret. Polson and Scott (2011) provide a 10 Hastie, Tibshirani, and Friedman (2009) provide an
­pseudo-likelihood interpretation for a variant of the SVM overview of these methods. In addition, see Wager, Hastie,
objective. Our own experience has led us to lean away from and Efron (2014) and Wager and Athey (2018) for results
SVMs for text analysis in favor of more easily interpretable on confidence intervals for random forests, and see Taddy
models. Murphy (2012, chapter 14.6) attributes the pop- et al. (2015) and Taddy et al. (2016) for an interpretation
ularity of SVMs in some application areas to an ignorance of random forests as a Bayesian posterior over potentially
of alternatives. optimal trees.
Gentzkow, Kelly, and Taddy: Text as Data 547

or more layers of interacting nonlinear basis Dunson, and Lee (2013) for Bayesian ana-
functions (e.g., see Bishop 1995). A main logues of diminishing bias penalties like the
attraction of neural networks is their status as log penalty on the right of figure 1.
universal approximators, a theoretical result For those looking to do a full Bayesian
describing their ability to mimic general, analysis for ­ high-dimensional (e.g., text)
smooth nonlinear associations. regression, an especially appealing model is
In ­high-dimensional and very noisy set- the ­spike-and-slab introduced in George and
tings, such as in text analysis, classical neu- McCulloch (1993). This models the distribu-
ral nets tend to suffer from the same issues tion over regression coefficients as a mixture
referenced above: they often overfit and between two densities centered at zero—
are difficult to tune. However, the recently one with very small variance (the spike) and
popular “deep” versions of neural networks another with large variance (the slab). This
(with many layers, and fewer nodes per model allows one to compute posterior vari-
layer) incorporate a number of innovations able inclusion probabilities as, for each coef-
that allow them to work better, faster, and ficient, the posterior probability that it came
with little tuning, even in difficult text analy- from the slab and not the spike component.
sis problems. Such deep neural nets (DNNs) Due to a need to integrate over the posterior
are now the ­state-of-the-art solution for many distribution, e.g., via Markov chain Monte
machine learning tasks (LeCun, Bengio, and Carlo (MCMC), inference for ­spike-and-slab
Hinton 2015).11 DNNs are now employed in models is much more computationally inten-
many complex natural language processing sive than fitting the penalized regressions of
tasks, such as translation (Sutskever, Vinyals, section 3.1.1. However, Yang, Wainwright,
and Le 2014; Wu et al. 2016) and syntactic and Jordan (2016) argue that s­ pike-and-slab
parsing (Chen and Manning 2014), as well as estimates based on short MCMC samples
in exercises of relevance to social scientists— can be useful in application, while Scott
for example, Iyyer et al. (2014) infer political and Varian (2014) have engineered effi-
ideology from text using a DNN. They are cient implementations of the s­ pike-and-slab
frequently used in conjunction with richer model for big data applications. These pro-
text representations such as word embed- cedures give a full accounting of parameter
dings, described more below. uncertainty, which we miss in a quick penal-
ized regression.
3.1.4 Bayesian Regression Methods
3.2 Generative Language Models
The penalized methods above can all be
interpreted as posterior maximization under Text regression treats the token counts as
some prior. For example, ridge regression generic ­ high-dimensional input variables,
maximizes the posterior under independent without any attempt to model structure that
Gaussian priors on each coefficient, while is specific to language data. In many set-
Park and Casella (2008) and Hans (2009) give tings it is useful to instead propose a gen-
Bayesian interpretations to the lasso. See also erative model for the text tokens to learn
the horseshoe of Carvalho, Polson, and Scott about how the attributes influence word
(2010) and the double Pareto of Armagan, choice and account for various dependen-
cies among words and among attributes. In
this approach, the words in a document are
11 ­Goodfellow, Bengio, and Courville (2016) provide a
viewed as the realization of a generative pro-
thorough textbook overview of these “deep learning” tech-
nologies, while Goldberg (2016) is an excellent primer on cess defined through a probability model for​
their use in natural language processing. p​(​ci​​​  | ​vi​)​​ ​​.
548 Journal of Economic Literature, Vol. LVII (September 2019)

3.2.1 Unsupervised Generative Models Many readers will recognize the model in
(5) as a factor model for the vector of nor-
In the unsupervised setting, we have no malized counts for each token in document
direct observations of the true attributes ​i​, ​​ci​​​  / ​mi​​​​. Indeed, a topic model is simply a fac-
​​v​i​​​. Our inference about these attributes must tor model for multinomial data. Each topic
therefore depend entirely on strong assump- is a probability vector over possible tokens,
tions that we are willing to impose on the denoted ​​θ​l​​, l  = 1, … , k​ (where ​​θ​lj​​  ≥  0​ and​​
structure of the model p ​ (​ ​ci​​​  | ​vi​)​​ ​​. Examples in ∑ pj=1   ​ ​​ ​θlj​ ​​  =  1​). A topic can be thought of as
the broader literature include cases where a cluster of tokens that tend to appear in
the ​​v​i​​​are latent factors, clusters, or catego- documents. The latent attribute vector v​ ​​ i​​​ is
ries. In text analysis, the leading application referred to as the set of topic weights (for-
has been the case in which the ​​v​i​​​ are topics. mally, a distribution over topics, ​​v​il​​  ≥  0​ and​​
A typical generative model implies that ∑ l=1 k
  ​ ​​ ​vil​ ​​  =  1​). Note that v​​ il​ ​​​ describes the pro-
each observation c​ ​​ i​​​is a conditionally inde- portion of language in document i​ ​devoted to
pendent draw from the vocabulary of the ​lth​topic. We can allow each document
possible tokens according to some d ­ ocument- to have a mix of topics, or we can require
specific token probability vector, say that one ​​v​il​​  =  1​while the rest are zero, so
​​𝐪​i​​  = ​​[​q​i1​​  ⋯ ​qip
​ ​​]′​  . ​​ Conditioning on doc- that each document has a single topic.13
ument length, m​ ​​ i​​  = ​∑ j​ ​​ ​cij​ ​​​, this implies a Since its introduction into text analysis,
­multinomial distribution for the counts topic modeling has become hugely popu-
lar.14 (See Blei 2012 for a ­high-level over-
(4) ​​c​i​​  ∼ MN​
(​𝐪​i​​, ​m​i​​)​.​ view.) The model has been especially useful
in political science (e.g., Grimmer 2010),
This multinomial model underlies the vast where researchers have been successful in
majority of contemporary generative models attaching political issues and beliefs to the
for text. estimated latent topics.
Under the basic model in (4), the function​​ Since the v​ ​​ i​​​are of course latent, estima-
𝐪​i​​  =  q​(​v​i)​​ ​​links attributes to the distribution tion for topic models tends to make use of
of text counts. A leading example of this link some alternating inference for V ​  | Θ​and ​Θ | V​.
function is the topic model specification of One possibility is to employ a version of the
Blei, Ng, and Jordan (2003),12 where expectation-maximization (EM) algorithm
­
to either maximize the likelihood implied by
(5) ​​𝐪​i​​  = ​vi1
​ ​​ ​θ1​ ​​  + ​v​i2​​ ​θ2​ ​​  + ⋯   + ​v​ik​​ ​θk​ ​​

=  Θ ​v​i​​.​ 13 Topic modeling is alternatively labeled as “latent


Dirichlet allocation,” (LDA) which refers to the Bayesian
model in Blei, Ng, and Jordan (2003) that treats each v​ ​​ i​​​ and​​
θ​l​​​as generated from a ­Dirichlet-distributed prior. Another
specification that is popular in political science (e.g., Quinn
et al. 2010) keeps θ​ ​​ l​​​ as ­Dirichlet-distributed but requires
each document to have a single topic. This may be most
12 Standard l­ east-squares factor models have long appropriate for short documents, such a press releases or
been employed in “latent semantic analysis” (LSA; single speeches.
Deerwester et al. 1990), which applies PCA (i.e., singu- 14 The same model was independently introduced in
lar value decompositions) to token count transformations genetics by Pritchard, Stephens, and Donnelly (2000) for
such as ​​𝐱​i​​  = ​ci​​​/​m​i​​​ or ​​xi​j​​  = ​c​ij​​  log​(​dj​​​)​​ where ​​dj​​​  = ​∑  i ​​​​  1​​[​cij​ ​​>0]​​​.​ factorizing gene expression as a function of latent popula-
Topic modeling and its precursor, probabilistic LSA, are tions; it has been similarly successful in that field. Latent
generally seen as improving on such approaches by replac- Dirichlet allocation is also an extension of a related mix-
ing arbitrary transformations with a plausible generative ture modeling approach in the latent semantic analysis of
model. Hofmann (1999).
Gentzkow, Kelly, and Taddy: Text as Data 549

(4) and (5) or, after incorporating the usual on the application. As we discuss below, in
Dirichlet priors on v​ ​​ i​​​ and ​​θ​l​​​, to maximize the many applications of topic models to date,
posterior; this is the approach taken in Taddy the goal is to provide an intuitive description
(2012; see this paper also for a review of of text, rather than inference on some under-
topic estimation techniques). Alternatively, lying “true” parameters; in these cases, the
one can target the full posterior distribution​ ad hoc selection of the number of topics may
p​(Θ, V ∣ ​ci​)​​ ​​. Estimation, say for ​Θ​, then pro- be reasonable.
ceeds by maximization of the estimated mar- The basic topic model has been general-
ginal posterior, say p ​ ​(Θ ∣ ​ci​)​​ ​​. ized and extended in variety of ways. A prom-
Due to the size of the data sets and dimen- inent example is the dynamic topic model
sion of the models, posterior approximation of Blei and Lafferty (2006), which considers
for topic models usually uses some form documents that are indexed by date (e.g.,
of variational inference (Wainwright and publication date for academic articles) and
Jordan 2008) that fits a tractable paramet- allows the topics, say Θ​ ​​ t​​​, to evolve smoothly
ric family to be as close as possible (e.g., in in time. Another example is the super-
­Kullback–Leibler divergence) from the true vised topic model of Blei and McAuliffe
posterior. This variational approach was (2007), which combines the standard topic
used in the original Blei, Ng, and Jordan model with an extra equation relating the
(2003) paper and in many applications since. weights ​​vi​​​​to some additional attribute ​​y​i​​​ in
Hoffman et al. (2013) present a stochastic ​p​(​yi​​​  | ​vi​​​)​​. This pushes the latent topics to be
variational inference algorithm that takes relevant to y​ ​​ i​​​as well as the text c​ ​​ i​​​. In these
advantage of techniques for optimization on and many other extensions, the modifica-
massive data; this algorithm is used in many tions are designed to incorporate available
contemporary topic modeling applications. document metadata (in these examples,
Another approach, which is more computa- time and y​ ​​ i​​​ respectively).
tionally intensive but can yield more accu-
3.2.2 Supervised Generative Models
rate posterior approximations, is the MCMC
algorithm of Griffiths and Steyvers (2004). In supervised models, the attributes ​​v​i​​​ are
Alternatively, for quick estimation without observed in a training set and thus may be
uncertainty quantification, the posterior directly harnessed to inform the model of
maximization algorithm of Taddy (2012) is a text generation. Perhaps the most common
good option. supervised generative model is the ­so-called
The choice of k​ ​, the number of topics, is naive Bayes classifier (e.g., Murphy 2012),
often fairly arbitrary. ­ Data-driven choices which treats counts for each token as inde-
do exist: Taddy (2012) describes a model pendent with class-dependent means. For
selection process for k​​that is based upon example, the observed attribute might be
Bayes factors, Airoldi et al. (2010) provide author identity for each document in the
a ­cross-validation (CV) scheme, while Teh corpus with the model specifying different
et al. (2006) use Bayesian nonparametric mean token counts for each author.
techniques that view ​k​as an unknown model In naive Bayes, ​​v​i​​​is a univariate categor-
parameter. In practice, however, it is very ical variable and the token count distribu-
common to simply start with a number of tion is factorized as ​p​(​ci​​​  | ​vi​)​​ ​  = ​∏ j​ ​​ ​pj​​​​(​cij​ ​​  | ​vi​​​)​​,
topics on the order of ten, and then adjust thus “naively” specifying conditional inde-
the number of topics in whatever direction pendence between tokens j​​. This rules out
seems to improve interpretability. Whether the possibility that by choosing to say one
this ad hoc procedure is problematic depends token (say, “hello”) we reduce the ­probability
550 Journal of Economic Literature, Vol. LVII (September 2019)

that we say some other token (say, “hi”). v​i​​​


, focusing on their use for prediction
The parameters of each independent token of future document attributes through
distribution are estimated, yielding p  ​​​ˆ ​​j​​​ for​ an inversion strategy discussed below.
j = 1, … , p​. The model can then be inverted More recently, Taddy (2015a) provides a
for prediction, with classification probabili- distributed-computing strategy that allows
­
ties for the possible class labels obtained via the model implied by (7) to be fit (using
Bayes’s rule as penalized deviance methods as detailed in
p​(​ci​​​  |  V)​ ​πv​ ​​ section 3.1.1) for h ­igh-dimensional v​ ​​i​​​ on
p(​ V | ​c​i​​)​  = ​ ___________
(6) ​    ​   massive corpora. This facilitates language
​∑ a​ ​​  p​(​ci​​​  |  a)​ ​πa​ ​​ models containing a large number of sources
of heterogeneity (even ­ document-specific
​∏ j​ ​​ ​pj​​​​(​cij​ ​​  |  V)​ ​πv​ ​​ random effects), thus allowing social sci-
= ​ _______________
  
   ​​,
​∑ a​ ​​ ​∏ j​ ​​ ​pj​​​​(​ci​j​​  |  a)​ ​πa​ ​​ entists to apply familiar regression analysis
tools in their text analyses.
where ​​π​a​​​is the prior probability on class ​a​
Application of the logistic regression text
(usually just one over the number of pos-
models implied by (7) often requires an
sible classes). In text analysis, Poisson
inversion step for inference about attributes
naive Bayes procedures, with p ​(​ ​cij​ ​​  |  V)​ 
conditional on text—that is, to map from
= Pois​(​cij​ ​​; ​θvj ​ ​​)​​ where ​E[​ ​cij​ ​​  |  V]​  = ​θvj ​ ​​​, have ​p​(​ci​​​  | ​vi​)​​ ​​ to ​p​(​v​i​​  | ​ci​)​​ ​​. The simple Bayes’s rule
been used as far back as Mosteller and technique of (6) is difficult to apply beyond
Wallace (1963). Some recent social sci- a single categorical attribute. Instead,
ence applications use binomial naive Bayes, Taddy (2013b) uses the inverse regression
which sets ​ p(​ ​c​ij​​  |  V)​  = Bin​(​cij​ ​​; ​θvj
​ ​​)​​ where ideas of Cook (2007) in deriving sufficient
​E​[​cij​ ​​  / ​mi​​​  |  V]​  = ​θ​vj​​​. The Poisson model has projections from the fitted models. Say
some statistical justification in the analysis of ​Φ = ​[​φ​1​​  ⋯ ​φp​ ​​]​​is the matrix of regression
text counts (Taddy 2015a), but the binomial coefficients from (7) across all tokens ​j​; then
specification seems to be more common in the token count projection Φ ​ ​c​i​​​ is a sufficient
­off-the-shelf software. statistic for ​​v​i​​​in the sense that
A more realistic sampling model for text
token counts is the multinomial model of (8) ​​v​i​​  ⫫ ​ci​​​  ∣ Φ​ci​​​,​
(4). This introduces limited dependence
between token counts, encoding the fact i.e., the attributes are independent of the
that using one token for a given utterance text counts conditional upon the projection.
must slightly lower the expected count for Thus, the fitted logistic regression model
all other tokens. Combining such a sampling yields a map from h ­ igh-dimensional text to
scheme with generalized linear models, the presumably lower dimensional attributes
Taddy (2013b) advocates the use of multi- of interest, and this map can be used instead
nomial logistic regression to connect text of the full text counts for future inference
counts with observable attributes. The gen- tasks. For example, to predict variable v​ ​​ i​​​ you
erative model specifies probabilities in the can fit the low dimensional OLS regression
multinomial distribution of (4) as of ​​v​i​​​ on ​Φ​ci​​​​. Use of projections built in this
​ηi​j​​ way is referred to as multinomial inverse
(7) ​​q​ij​​  = ​ _ ​e​​  ​   ​  , ​η​ij​​  = ​αj​​​  + ​v​  ′i​  ​​φj​​​.​ regression (MNIR). This idea can also be
​∑ p h=1
  ​ ​​ ​e​​  ​ηi​h​​​
applied to only a subset of the variables in​​
Taddy (2013a, b) applies these models in v​i​​​, yielding projections that are sufficient for
the setting of univariate or ­low-dimensional​​ the text content relevant to those variables
Gentzkow, Kelly, and Taddy: Text as Data 551

after conditioning on the other attributes in​​ In the vector space, words are relationally
v​i​​​. Taddy (2015a) details use of such suffi- oriented and we can begin to draw meaning
cient projections in a variety of applications, from term positions, something that is not
including attribute prediction, treatment possible in simple bag-of-words approaches.
effect estimation, and document indexing. For example, in the right figure, we can see
New techniques are arising that combine that by subtracting the vector man from the
MNIR techniques with the latent structure vector king, and then adding to this woman,
of topic models. For example, Rabinovich we arrive spatially close to queen. Likewise,
and Blei (2014) directly combine the logistic the combination ­king​  − ​man​​  + ​child lies in
regression in (7) with the topic model of (5) close proximity to the vector prince.
in a mixture specification. Alternatively, the Such word embeddings, also known as dis-
structural topic model of Roberts et al. (2013) tributed language representations, amount
allows both topic content (​​θ​l​​​) and topic prev- to a p
­ reprocessing of the text data to replace
alence (latent v​ ​​ i​​​) to depend on observable word identities—encoded as binary indica-
document attributes. Such ­semi-supervised tors in a v­ ocabulary-length vector—with an
techniques seem promising for their com- embedding (location) of each vocabulary
bination of the strong ­text-attribute connec- word in ​​핉​​  K​​, where ​K​is the dimension of
tion of MNIR with topic modeling’s ability the latent representation space. The dimen-
to account for latent clustering and depen- sions of the vector space correspond to vari-
dency within documents. ous aspects of meaning that give words their
content. Continuing from the simplified
3.3 Word Embeddings
example vocabulary, the latent (and, in real-
Throughout this article, documents have ity, unlabeled) dimensions and associated
been represented through token count vec- word embeddings might look like:
tors, ​​ci​​​​. This is a crude language summariza-
tion. It abstracts from any notion of similarity Dimension king queen prince man woman child
between words (such as run, runner, jogger) Royalty 0.99 0.99 0.95 0.01 0.02 0.01
or syntactical richness. One of the frontiers Masculinity 0.94 0.06 0.02 0.99 0.02 0.49
of textual analysis is in developing new rep- Age 0.73 0.81 0.15 0.61 0.68 0.09
resentations of text data that more faithfully
​…​
capture its meaning.
Instead of identifying words only as an
index for location in a long vocabulary list, This type of text representation has long
imagine representing words as points in been applied in natural language process-
a large vector space, with similar words ing (Rumelhart, Hinton, and Williams 1986;
­colocated, and an internally consistent arith- Morin and Bengio 2005). The embeddings
metic on the space for relating words to one must be estimated and are chosen to opti-
another. For example, suppose our vocabu- mize, perhaps approximately, an objec-
lary consists of six words: { ​​ king, queen, prince,  tive function defined on the original text
man, woman, child}​​. The vector space repre- (such as a likelihood for word occurrences).
sentation of this vocabulary based on simi- They form the basis for many deep learn-
larity of their meaning might look something ing applications involving textual data (see,
like the figure 2 panel A.15 e.g., Chen and Manning 2014; Goldberg
2016). They are also valuable in their own
15 This example is motivated by https://blog.acolyer. right for mapping from language to a vec-
org/2016/04/21/­the-amazing-power-of-word-vectors/. tor space where we can compute distances
552 Journal of Economic Literature, Vol. LVII (September 2019)

Panel A Panel B

man man

woman
child woman
king
queen
king – man
king +woman
queen

prince

king – man

Figure 2. A Graphical Example of Word Embeddings

and angles between words for fundamen- ​ ​ (denoted ​​γ​j​​​ and ​​βj​​​​)
The j​th​rows of ​Γ​and B
tal tasks such as classification, and have give a ​K​-dimensional embedding of the j​th​
begun to be adopted by social scientists word, so ­ co-occurrences of terms ​ i​and ​j​
as a useful summary representation of text are approximated as ​​ γ​i​​ ​βj​  ′​  ​​. This geometric
data. representation of the text has an intuitive
Some popular embedding techniques interpretation. The inner product of terms’
are Word2Vec (Mikolov et al. 2013) and embeddings, which measures the close-
Global Vector for Word Representation ness of the pair in the K ​ ​-dimensional vec-
(GloVe, Pennington, Socher, and Manning tor space, describes how likely the pair is to
2014). The key preliminary step in ­co-occur.
these methods is to settle on a notion of Researchers are beginning to connect
­co-occurrence among terms. For example, these ­vector-space language models with the
consider a ​p × p​matrix denoted ​CoOccur​, sorts of document attributes that are of inter-
whose ​​(i, j)​​entry counts the number of est in social science. For example, Le and
times in your corpus that the terms i​​and j​​ Mikolov (2014) estimate latent document
appear within, say, b ​ ​words of each other. scores in a vector space, while Taddy (2015b)
This is known as the ­skip-gram definition of develops an inversion rule for document
­co-occurrences. classification based upon Word2Vec. In one
To embed C ​ oOccur​in a K​ ​-dimensional especially compelling application, Bolukbasi
vector space, where ​K​is much smaller than p
​​ et al. (2016) estimate the direction of gen-
(say a few hundred), we solve the same type der in an embedding space by averaging the
of problem that PCA used to summarize the angles between female and male descriptors.
word count matrix in equation (3). In partic- They then show that stereotypically male
ular, we can find rank-​K​matrices ​Γ​ and ​B​ and female jobs, for example, live at the
that best approximate c­ o-occurrences among corresponding ends of the implied gender
terms: vector. This information is used to derive an
algorithm for removing these gender biases,
CoOccur  ≈  Γ​B′  ​.​
​ so as to provide a more “fair” set of inputs
Gentzkow, Kelly, and Taddy: Text as Data 553

for machine learning tasks. Approaches like More promising are computation algorithms
this, which use embeddings as the basis for that approximate the sampling distribution,
mathematical analyses of text, can play a role the most common being the familiar non-
in the next generation of ­text-as-data applica- parametric bootstrap (Efron 1979). This
tions in social science. repeatedly draws samples with replacement
of the same size as the original data set and
3.4 Uncertainty Quantification
reestimates parameters of interest on the
­
The machine learning literature on text bootstrapped samples, with the resulting set
analysis is focused on point estimation and of estimated parameters approximating the
predictive performance. Social scientists sampling distribution.17
often seek to interpret parameters or func- Unfortunately, the nonparametric boot-
tions of the fitted models, and hence desire strap fails for many of the algorithms used on
strategies for quantifying the statistical text. For example, it is known to fail for meth-
uncertainty around these targets – that is, for ods that involve ­non-differentiable loss func-
statistical inference. tions (e.g., the lasso), and w
­ ith-replacement
Many machine learning methods for text resampling produces overfit in the bootstrap
analysis are based upon a Bayesian modeling samples (repeated observations make predic-
approach, where uncertainty quantification is tion seem easier than it actually is). Hence,
often available as part of the estimation pro- for many applications, it is better to look to
cess. In MCMC sampling, as in the Bayesian methods more suitable for ­high-dimensional
regression of Scott and Varian (2014) or the estimation algorithms. The two primary can-
topic modeling of Griffiths and Steyvers didates are the parametric bootstrap and
(2004), the software returns samples from subsampling.
the posterior and thus inference is imme- The parametric bootstrap generates new
diately available. For estimators relying on unrepeated observations for each boot-
variational inference—i.e., fitting a trac- strap sample given an estimated generative
table distribution as closely as possible to model (or other assumed form for the data
the true posterior—one can simulate from generating process).18 In doing so, it avoids
the approximate distribution to conduct pathologies of the nonparametric bootstrap
inference.16 that arise from using the empirical sample
Frequentist uncertainty quantification is distribution. The cost is that the parametric
often favored by social scientists, but analytic bootstrap is, of course, parametric: It makes
sampling distributions are unavailable for strong assumptions about the underlying
most of the methods discussed here. Some generative model, and one must bear in
results exist for the lasso in stylized settings mind that the resulting inference is condi-
(especially Knight and Fu 2000), but these tional upon these assumptions.19
assume low-dimensional asymptotic scenar-
ios that may be unrealistic for text analysis. 17 See, e.g., Horowitz (2003) for an overview.
18 See Efron (2012) for an overview that also makes
interesting connections to Bayesian inference.
16 Due to its popularity in the deep learning commu- 19 For example, in a linear regression model, the
nity, variational inference is a common feature in newer parametric bootstrap requires simulating errors from
machine learning frameworks; see, for example, Edward an assumed, say Gaussian, distribution. One must make
(Tran et al. 2016, 2017) for a python library that builds assumptions on the exact form of this distribution, includ-
variational inference on top of the TensorFlow platform. ing whether the errors are homoskedastic or not. This
Edward and similar tools can be used to implement topic contrasts with our usual approaches to standard errors
models and the other kinds of text models that we dis- for linear regression that are robust to assumptions on the
cussed above. functional form of the errors.
554 Journal of Economic Literature, Vol. LVII (September 2019)

An alternative method, subsampling, pro- i­nference about heterogeneous treatment


vides a nonparametric approach to inference effects in the context of a causal tree model.
that remains robust to estimation features
3.5 Some Practical Advice
such as ­non-differentiable losses and model
selection. The book by Politis, Romano, and The methods above will be compared and
Wolf (1999) gives a comprehensive overview. contrasted in our subsequent discussion of
In subsampling, the data are partitioned into applications. In broad terms, however, we
subsamples without replacement (the num- can make some rough recommendations for
ber of subsamples being a function of the practitioners.
total sample size) and the target parameters
3.5.1 Choosing the Best Approach for a
are ­reestimated separately on each subsam-
Specific Application
ple. The advantage of subsampling is that,
because each subsample is a draw from the ­Dictionary-based methods heavily weight
true DGP, it works for a wide variety of esti- prior information about the function map-
mation algorithms. However, since each sub- ping features ​​c​i​​​ to outcomes ​​v​i​​​. They are
sample is smaller than the sample of interest, therefore most appropriate in cases where
one needs to know the estimator’s rate of such prior information is strong and reliable,
convergence in order to translate between and where information in the data is corre-
the uncertainty in the subsamples and in the spondingly weak. An obvious example is a
full sample.20 case where the outcomes v​ ​​ i​​​ are not observed
Finally, one may consider sample splitting for any ​i​, so there is no training data avail-
with methods that involve a model selection able to fit a supervised model, and where
step (i.e., those setting some parameter val- the mapping of interest does not match the
ues to zero, such as lasso). Model selection factor structure of unsupervised methods
is performed on one “selection” sample, such as topic models. In the setting of Baker,
then the standard inference is performed on Bloom, and Davis (2016), for example, there
the second “estimation” sample conditional is no ground truth data on the actual level
upon the selected model.21 Econometricians of policy uncertainty reflected in particu-
have successfully used this approach to lar articles, and fitting a topic model would
obtain accurate inference for machine learn- be unlikely to endogenously pick out policy
ing procedures. For example, Chernozhukov uncertainty as a topic. A second example is
et al. (2018) use it in the context of treat- where some training data do exist, but it is
ment effect estimation via lasso, and Athey sufficiently small or noisy that the researcher
and Imbens (2016) use sample splitting for believes a ­prior-driven specification of ​f ​( ⋅ )​​ is
likely to be more reliable.
Text regression is generally a good choice
20 For many applications, doing this translation under
_
for predicting a single attribute, especially
the assumption of a standard ​​√n    ​​  learning rate is a rea- when one has a large amount of labeled
sonable choice. For example, Knight and Fu (2000) show
_ training data available. As described in Ng
conditions under which the √ ​​ n 
  ​​  rate holds for the lasso.
However, in many ­text-as-data applications the dimension and Jordan (2001) and Taddy (2013c), super-
of the data (the size of the vocabulary) will be large and vised generative techniques such as naive
growing with the number of observations. In such settings​​
_ Bayes and MNIR can improve prediction
√n   ​​  learning may be optimistic, and more sophisticated
methods may need to be used to infer the learning rate when ​p​is large relative to ​n​; however, these
(see, e.g., Politis, Romano, and Wolf 1999).
21 For example, when using lasso, one might apply OLS
gains diminish with the sample size due to
in the second stage using only the covariates selected by the asymptotic efficiency of many text regres-
lasso in the first stage. sion techniques. In text ­regression, we have
Gentzkow, Kelly, and Taddy: Text as Data 555

found that it is usually unwise to attempt predictive performance on data held out
to learn flexible functional forms unless ​n​ from the main estimation sample. In sec-
is much larger than p ​ ​. When this is not the tion 3.1.1, we discussed the technique of
case, we generally recommend linear regres- cross-validation (CV) for penalty selec-
­
sion methods. Given the availability of fast tion, a leading example. More generally,
and robust tools (gamlr and glmnet in R, and whenever one works with complex and
­scikit-learn in Python), and the typically high ­high-dimensional data, it is good practice to
dimensionality of text data, many prediction reserve a test set of data to use in estimation
tasks in social science with text inputs can of the true average prediction error. Looping
be efficiently addressed via penalized linear across multiple test sets, as in CV, is a com-
regression. mon way of reducing the variance of these
When there are multiple attributes of error estimates. (See Efron 2004 for a classic
interest, and one wishes to resolve or control overview.)
for interdependencies between these attri- In many social science applications, the
butes and their effects on language, then one goal is to go beyond prediction and use the
will need to work with a generative model values ​​Vˆ ​​  in some subsequent descriptive or
for text. Multinomial logistic regression and causal analysis. In these cases, it is import-
its extensions can be applied to such situa- ant to also validate the accuracy with which
tions, particularly via distributed multino- the fitted model is capturing the economic or
mial regression. Alternatively, for corpora descriptive quantity of interest.
of many unlabeled documents (or when the One approach that is often effective is
labels do not tell the whole story that one manual audits: c­ ross-checking some subset of
wishes to investigate), topic modeling is the the fitted values against the coding a human
obvious approach. Word embeddings are would produce by hand. An informal version
also becoming an option for such questions. of this is for a researcher to simply inspect
In the spirit of contemporary machine learn- a subset of documents alongside the fitted​​
ing, it is also perfectly fine to combine tech- ˆ ​​and evaluate whether the estimates align
V 
niques. For example, a common setting will with the concept of interest. A formal version
have a large corpora of labeled documents would involve having one or more people
as well as a smaller set of documents about manually classify each document in a subset
which some metadata exist. One approach is and evaluating quantitatively the consistency
to fit a topic model on the larger corpora, and between the human and machine codings.
to then use these topics as well as the token The subsample of documents does not need
counts for supervised text regression on the to be large in order for this exercise to be valu-
smaller labeled corpora. able—often as few as twenty or thirty docu-
ments is enough to provide a sense of whether
3.5.2 Model Validation and Interpretation
the model is performing as desired.
Ex ante criteria for selecting an empirical This kind of auditing is especially import-
approach are suggestive at best. In practice, ant for dictionary methods. Validity hinges
it is also crucial to validate the performance on the assumption that a particular func-
of the estimation approach ex post. Real tion of text features—counts of positive or
research often involves an iterative tuning negative words, an indicator for the pres-
process with repeated rounds of estimation, ence of certain keywords, etc.—will be a
validation, and adjustment. valid predictor of the true latent variable ​V​.
When the goal is prediction, the primary In a setting where we have sufficient prior
tool for validation is checking ­out-of-sample information to justify this assumption, we
556 Journal of Economic Literature, Vol. LVII (September 2019)

typically also have enough prior information path of decreasing penalties. Alternatively,
to evaluate whether the resulting classifica- see Gentzkow, Shapiro, and Taddy (2016) for
tion looks accurate. An excellent example of ­application-specific term rankings.
this is Baker, Bloom, and Davis (2016), who Inspection of fitted parameters is gener-
perform a careful manual audit to validate ally more informative in the context of a gen-
their ­dictionary-based method for identify- erative model. Even there, some caution is
ing articles that discuss policy uncertainty. in order. For example, Taddy (2015a) finds
Audits are also valuable in studies using that for MNIR models, getting an interpre-
other methods. In Gentzkow and Shapiro table set of word loadings requires careful
(2010), for example, the authors perform penalty tuning and the inclusion of appro-
an audit of news articles that their fitted priate control variables. As in text regression,
model classifies as having a r­ight-leaning or it is usually worthwhile to look at the largest
­left-leaning slant. They do not compare this coefficients for validation but not take the
against ­hand coding directly, but rather count smaller values too seriously.
the number of times the key phrases that are Interpretation or story building around
weighted by the model are used straightfor- estimated parameters tends to be a major
wardly in news text, as opposed to occurring focus for topic models and other unsuper-
in quotation marks or in other types of arti- vised generative models. Interpretation of
cles such as letters to the editor. the fitted topics usually proceeds by ranking
A second approach to validating a fitted the tokens in each topic according to token
model is inspecting the estimated coefficients probability, ​​θ​lj​​​, or by token lift θ​ ​​lj​​  / ​​p¯ ​​j ​​​ with
or other parameters of the model directly. In ​​​p¯ ​​ j​​  = (1/n)​∑ i​ ​​ ​cij​ ​​/​mi​​​​. For example, if the five
the context of text regression methods, how- highest lift tokens in topic l​ ​ for a model fit to a
ever, this needs to be approached with cau- corpus of restaurant reviews are another.min-
tion. While there is a substantial literature ute, flag.down, over.minute, wait.over, arrive.
on statistical properties of estimated param- after, we might expect that reviews with high v​​ ​il​​​
eters in penalized regression models (see correspond to negative experiences where the
Bühlmann and van de Geer 2011 and Hastie, patron was forced to wait for service and food
Tibshirani, and Wainwright 2015), the real- (example from Taddy 2012). Again, however,
ity is that these coefficients are typically only we caution against the ­overinterpretation of
interpretable in cases where the true model is these unsupervised models: the posterior dis-
extremely sparse, so that the model is likely to tributions informing parameter estimates are
have selected the correct set of variables with often multimodal, and multiple topic model
high probability. Otherwise, multicollinear- runs can lead to multiple different interpreta-
ity means the set of variables selected can be tions. As argued in Airoldi and Bischof (2016)
highly unstable. and in a comment by Taddy (2017a), the best
These difficulties notwithstanding, inspect- way to build interpretability for topic models
ing the most important coefficients to see if may be to add some supervision (i.e., to incor-
they make intuitive sense can still be useful porate external information on the topics for
as a validation and sanity check. Note that some set of cases).
“most important” can be defined in a number
of ways; one can rank estimated coefficients
by their absolute values, or by absolute value 4.  Applications
scaled by the standard deviation of the asso-
ciated covariate, or perhaps by the order in We now turn to applications of text analy-
which they first become nonzero in a lasso sis in economics and related social sciences.
Gentzkow, Kelly, and Taddy: Text as Data 557

Rather than presenting a comprehensive lit- undisputed, to train a naive Bayes classifier
erature survey, the goal of this section is to (a supervised generative model, as discussed
present a selection of illustrative papers to in section 3.2) in which the probabilities
give the reader a sense of the wide diversity ​p(​ ​cij​ ​​  | ​vi​​​)​​of each phrase ​j​are assumed to be
of questions that may be addressed with tex- independent Poisson or negative binomial
tual analysis and to provide a flavor of how random variables, and the inferences for the
some of the methods in section 3 are applied unknown documents are made by Bayes’ rule.
in practice. The results provide overwhelming evidence
that all of the disputed papers were authored
4.1 Authorship by Madison.
Stock and Trebbi (2003) apply similar
A classic descriptive problem is inferring methods to answer an authorship question
the author of a document. While this is not of more direct interest to economists: who
usually a ­ first-order research question for invented instrumental variables? The ear-
social scientists, it provides a particularly liest known derivation of the instrumental
clean example, and a good starting point to variables estimator appears in an appendix
understand the applications that follow. to The Tariff on Animal and Vegetable Oils,
In what is often seen as the first mod- a 1928 book by statistician Philip Wright.
ern statistical analysis of text data, Mosteller While the bulk of the book is devoted to
and Wallace (1963) use text analysis to infer a “painfully detailed treatise on animal
the authorship of the disputed Federalist and vegetable oils, their production, uses,
Papers that had alternatively been attributed markets and tariffs,” the appendix is of an
to either Alexander Hamilton or James entirely different character, with “a suc-
Madison. They define documents i​ ​to be indi- cinct and insightful explanation of why data
vidual Federalist Papers, the data features​​ on price and quantity alone are in general
c​i​​​of interest to be counts of function words inadequate for estimating either supply or
such as “an,” “of,” and “upon” in each doc- demand; two separate and correct deriva-
ument, and the outcome ​​v​i​​​to be an indica- tions of the instrumental variables estimators
tor for the identity of the author. Note that of the supply and demand elasticities; and
the function words the authors focus on are an empirical application” (Stock and Trebbi
exactly the “stop words” that are frequently 2003, p. 177). The contrast between the
excluded from analysis (as discussed in sec- two parts of the book has led to speculation
tion 2 above). The key feature of these words that the appendix was not written by Philip
is that their use by a given author tends to be Wright, but rather by his son Sewall. Sewall
stable regardless of the topic, tone, or intent Wright was an economist who had origi-
of the piece of writing. This means they pro- nated the method of “path coefficients” used
vide little valuable information if the goal is in one of the derivations in the appendix.
to infer characteristics such as political slant Several authors including Manski (1988) are
or discussion of policy uncertainty that are on record attributing authorship to Sewall;
independent of the idiosyncratic styles of others including Angrist and Krueger (2001)
particular authors. When such styles are the attribute it to Philip.
object of interest, however, function words In Stock and Trebbi’s (2003) study, the
become among the most informative text outcome ​​vi​​​​is an indicator for authorship
characteristics. Mosteller and Wallace (1963) by either Philip or Sewall. The data fea-
use a sample of Federalist Papers, whose tures ​​ci​​​  = ​[​c​  func i​  ​​  ​c​  gram
i​  ​]​​are counts of the
authorship by either Madison or Hamilton is same function words used by Mosteller and
558 Journal of Economic Literature, Vol. LVII (September 2019)

Wallace (1963) plus counts of a set of gram- to predict future returns of the Dow Jones
matical constructions (e.g, “noun followed Industrial Average. Hamilton’s track record
by adverb”) measured using an algorithm is unimpressive. A ­ market-timing strategy
due to Mannion and Dixon (1997). The based on his Wall Street Journal editorials
training sample C​​  ​​ train​​consists of forty-five underperforms a passive investment in the
documents known to have been written by Dow Jones Industrial Average by 3.5 per-
either Philip or Sewall, and the test sample​​ centage points per year.
C​​  test​​ in which ​​v​i​​​is unobserved consists of In its modern form, the implementation of
eight blocks of text from the appendix plus ­text-based prediction in finance is computa-
one block of text from chapter 1 of The Tariff tionally driven, but it applies methods that are
on Animal and Vegetable Oils included as a conceptually similar to Cowles’s approach,
validity check. The authors apply principal seeking to predict the target ­ quantity ​ V​
components analysis, which we can think (the Dow Jones return, in the example of
of as an unsupervised cousin of the topic Cowles) from the array of token counts
modeling approach discussed in section ​C​ . We discuss three examples of recent
3.2.1. They extract the first four principal papers that study equity return predic-
gram
components from ​​ c  ​  func
i​  ​​ and ​​c​  i​  ​​ respec- tion in the spirit of Cowles: one relying on
tively, and then run regressions of the binary a ­preexisting dictionary (as discussed at the
authorship variable on the principal com- beginning of section 3), one using regres-
ponents, resulting in predicted values ​​​v  ˆ ​​   func
 i​  ​​ sion techniques (as discussed in section 3.1),
gram
and ​​​vˆ ​​  i​  ​​. and another using generative models (as dis-
The results provide overwhelming evi- cussed in section 3.2).
dence that the disputed appendix was in fact Tetlock’s 2007 paper is a leading
written by Philip. Figure 3 plots the values​​ ­dictionary-based example of analyzing media
(​​vˆ ​​   i​  ​, ​​vˆ ​​  i​  )​ ​ for all of the documents in the
 func gram sentiment and the stock market. He studies
sample. Each point in the figure is a docu- word counts ​​c​i​​​ in the Wall Street Journal’s
ment ​i​, and the labels indicate whether the widely read “Abreast of the Market” column.
document is known to be written by Philip Counts from each article ​ i​are converted
(“P” or “1”), known to be written by Sewall into a vector of sentiment scores ​​​v  ˆ ​​i​​​ in sev-
(“S”), or of uncertain authorship (“B”). The enty-seven different sentiment dimensions
measures clearly distinguish the two authors, based on the Harvard ­IV-4 psychosocial dic-
with documents by each forming clear, tionary.22 The time series of daily sentiment
­nonoverlapping clusters. The uncertain doc- scores for each category (​​​v ˆ ​​i​​​) are condensed
uments all fall squarely within the cluster into a single principal component, which
attributed to Philip, with the predicted val- Tetlock names the “pessimism factor” due to
​, ​​vˆ ​​  i​  ​  ≈  1​.
gram
ues ​​​vˆ ​​   func
 i​ 
the component’s especially close association
with the “pessimism” dimension of the senti-
4.2 Stock Prices ment categories.
The second stage of the analysis uses this
An early example analyzing news text for pessimism score to forecast stock market
stock price prediction appears in Cowles
(1933). He subjectively categorizes the
text of editorial articles of Peter Hamilton, 22 While it has only recently been used in economics
chief editor of the Wall Street Journal from and finance, the Harvard dictionary and associated General
Inquirer software for textual content analysis dates to the
1902–29, as “bullish,” “bearish,” or “doubt- 1960s and has been widely used in linguistics, psychology,
ful.” Cowles then uses these classifications sociology, and anthropology.
Gentzkow, Kelly, and Taddy: Text as Data 559

Figure 3. Scatterplot of Authorship Predictions from PCA Method

Source: Stock and Trebbi (2003). Copyright American Economic Association; reproduced with the permission
of the Journal of Economic Perspectives.

activity. High pessimism significantly nega- document a significant predictive correla-


tively forecasts o­ ne-day-ahead returns on the tion between Twitter messages and the stock
Dow Jones Industrial Average. This effect is market using other ­ dictionary-based tools
transitory, and the short-term index dip asso- such as OpinionFinder and Google’s Profile
ciated with media pessimism reverts within of Mood States. Wisniewski and Lambe
a week, consistent with the interpretation (2013) show that negative media attention of
that article text is informative regarding the banking sector, summarized via ad hoc
media and investor sentiment, as opposed ­pre-defined word lists, G ­ ranger-causes bank
to ­containing fundamental news that perma- stock returns during the 2007–2009 financial
nently impacts prices. crisis and not the reverse, suggesting that
The number of studies using dictionary ­journalistic views have the potential to influ-
methods to study asset pricing phenomena ence market outcomes, at least in extreme
is growing. Loughran and McDonald (2011) states of the world.
demonstrate that the widely used Harvard The use of text regression for asset pric-
dictionary can be i­ll-suited for financial ing is exemplified by Jegadeesh and Wu
applications. They construct an alternative, (2013). They estimate the response of
finance-specific dictionary of positive and
­ company-level stock returns, v​
­ ​​i​​​, to text
negative terms and document its improved information in the company’s annual report
predictive power over existing sentiment (token counts, c​ ​​ i​​​). The authors’ objective is
dictionaries. Bollen, Mao, and Zeng (2011) to determine whether regression techniques
560 Journal of Economic Literature, Vol. LVII (September 2019)

offer improved stock return forecasts rela- emerge from their analysis. First, the terms
tive to dictionary methods. most closely associated with market vola-
The authors propose the following regres- tility relate to government policy and wars.
sion model to capture correlations between Second, high levels of ­news-implied volatil-
occurrences of individual words and subse- ity forecast high future stock market returns.
quent stock return realizations around regu- These two facts together give insight into the
latory filing dates: types of risks that drive investors’ valuation
decisions.23

( j ​∑ j​ ​​ ​ci​j​​ )
​ci​j​​
​​v​i​​  =  a + b​ ​∑​ ​​ ​ wj​​​ ​ _    ​  ​  + ​ε​i​​.​ The closest modern analog of Cowles’s
study is Antweiler and Frank (2004), who
take a generative modeling approach to ask:
Documents ​i​are defined to be annual reports how informative are the views of stock mar-
filed by firms at the Securities Exchange ket prognosticators who post on internet
Commission. The outcome variable v​ ​​ i​​​ is a message boards? Similar to Cowles’s analysis,
stock’s cumulative ­four-day return beginning these authors classify postings on stock mes-
on the filing day. The independent variable c​​ i​j​​​ sage boards as “buy,” “sell,” or “hold” signals.
is a count of occurrences of word ​j​in annual But the vast number of postings, roughly 1.5
report ​i.​ The coefficient ​​w​j​​​ summarizes the million in the analyzed sample, makes sub-
average association between an occurrence jective classification of messages infeasible.
of word ​j​and the stock’s subsequent return. Instead, generative techniques allow the
The authors show how to estimate w​ ​​ j​​​ from authors to automatically classify messages.
a ­cross-sectional regression, along with a The authors create a training sample of
subsequent rescaling of all coefficients to one thousand messages, and form ​​ V​​  train​​
remove the common influence parameter ​b​. by manually classifying messages into one
Finally, variables to predict returns are built of the three categories. They then use the
from the estimated weights, and are shown naive Bayes method described in section
to have stronger ­out-of-sample forecasting 3.2.2 to estimate a probability model that
performance than d ­ ictionary-based indices maps word counts of postings C ​ ​into clas-
from Loughran and McDonald (2011). The sifications ​​Vˆ ​​  for the remaining 1.5 million
results highlight the limitations of using fixed messages. Finally, the buy/sell/hold classifi-
dictionaries for diverse predictive problems, cation of each message is aggregated into an
and that these limitations are often sur- index that is used to forecast stock returns.
mountable by estimating ­application-specific Consistent with the conclusions of Cowles,
weights via regression. message board postings show little ability to
Manela and Moreira (2017) take a predict stock returns. They do, however, pos-
regression approach to construct an index sess significant and economically meaningful
of ­ news-implied market volatility based
on text from the Wall Street Journal from
1890–2009. They apply support vector
­
machines, a ­ nonlinear regression method
that we discuss in section 3.1.3. This 23 While Manela and Moreira (2017) study aggre-
approach applies a penalized least squares gate market volatility, Kogan et al. (2009) and Boudoukh
objective to identify a small subset of words et al. (2016) use text from news and regulatory filings to
whose frequencies are most useful for pre- predict ­firm-specific volatility. Chinco, Clark-Joseph, and
Ye (2017) apply lasso in high frequency return prediction
dicting outcomes—in this case, turbulence using ­preprocessed financial news text sentiment as an
in financial markets. Two important findings explanatory variable.
Gentzkow, Kelly, and Taddy: Text as Data 561

i­ nformation about stock volatility and trading of positive and negative searches and averag-
volume.24 ing over all phrases in ​i​. The Factiva score
Bandiera et al. (2017) apply unsuper- is calculated similarly. Next, the central bank
vised machine learning—topic modeling sentiment proxies ​​​vˆ ​​i ​​​are used to predict
(LDA)—to a large panel of CEO diary Treasury yields in a vector autoregression
data. They uncover two distinct behavioral (VAR). They find that changes in statement
types that they classify as “leaders” who content, as opposed to unexpected devia-
focus on communication and coordination tions in the federal funds target rate, are the
activities, and “managers” who empha- main driver of changes in interest rates.
size ­ production-related activities. They Born, Ehrmann, and Fratzscher (2014)
show that, due to horizontal differentiation extend this idea to study the effect of central
of firm and manager types, appropriately bank sentiment on stock market returns and
matched firms and CEOs enjoy better firm volatility. They construct a financial stability
performance. Mismatches are more com-
­ sentiment index ​​​vˆ ​​ i​​​ from Financial Stability
mon in lower income economies, and mis- Reports (FSRs) and speeches given by cen-
matches can account for 13 percent of the tral bank governors. Their approach uses
labor productivity gap between firms in high- a sentiment dictionary to assign optimism
and middle/­low-income countries. scores to word counts c​ ​​ i​​​ from central bank
communications. They find that optimis-
4.3 Central Bank Communication
tic FSRs tend to increase equity prices and
A related line of research analyzes the reduce market volatility during the subse-
impact of communication from central banks quent month.
on financial markets. As banks rely more on Hansen, McMahon, and Prat (2018)
these statements to achieve policy objec- research how FOMC transparency affects
tives, an understanding of their effects is debate during meetings by studying a change
increasingly relevant. in disclosure policy. Prior to November 1993,
Lucca and Trebbi (2009) use the con- the FOMC meeting transcripts were secret,
tent of Federal Open Market Committee but following a policy shift transcripts became
(FOMC) statements to predict fluctuations public with a time lag. There are potential
in Treasury securities. To do this, they use costs and benefits of increased transparency,
two different d ­ictionary-based methods such as the potential for more efficient and
(section 3)—Google and Factiva semantic informed debate due to increased account-
orientation scores—to construct v  ​​​ˆ ​​i​​​, which ability of policy makers. On the other hand,
quantifies the direction and intensity of the​ transparency may make committee mem-
ith​FOMC statement. In the Google score,​​ bers more cautious, biased toward the status
c​i​​​counts how many Google search hits occur quo, or prone to g­ roup-think.
when searching for phrases in i​​plus one of The authors use topic modeling (section
the words from a list of antonym pairs sig- 3.2.1) to study 149 FOMC meeting tran-
nifying positive or negative sentiment (e.g., scripts during Alan Greenspan’s tenure. The
“hawkish” versus “dovish”). These counts are unit of observation is a ­member-meeting. The
mapped into ​​​vˆ ​​ i​​​ by differencing the ­frequency vector ​​ci​​​​counts the words used by FOMC
member ​m​in meeting t​​, and i​​is defined as
the pair (​​m, t)​.​The outcome of interest, v​ ​​ i​​​,
24 Other papers that use naive Bayes and similar gener-
is a vector that includes the proportion of
ative models to study behavioral finance questions include
Buehlmaier and Whited (2018), Li (2010), and Das and ​i​’s language devoted to the ​K​different top-
Chen (2007). ics (estimated from the fitted topic model),
562 Journal of Economic Literature, Vol. LVII (September 2019)

the concentration of these topic weights, and to create a product that predicts flu prevalence
the frequency of data citation by ​i.​ Next, a from Google searches using text regression.
difference-in-differences regression esti-
­ The results are reported in a ­widely cited
mates the effects of the change in transpar- Nature article by Ginsberg et al. (2009).
ency on ​​​vˆ ​​i ​​​. The authors find that, after the Their raw data ​​consist of “hundreds of bil-
move to a more transparent system, inex- lions of individual searches from 5 years of
perienced members discuss a wider range Google web search logs.” Aggregated search
of topics and make more references to data counts are arranged into a vector ​​c​i​​​, where a
when discussing economic conditions (con- document ​i​is defined to be a particular US
sistent with increased accountability); but region in a particular week, and the outcome
also speak more like Chairman Greenspan of interest ​​v​i​​​is the true prevalence of flu in
during policy discussions (consistent with the ­region–week. In the training data, this is
increased conformity). Overall, the account- taken to be equal to the rate measured by
ability effect appears stronger, as inexperi- the CDC. The authors first restrict atten-
enced members’ topics appear to be more tion to the fifty million most common terms,
influential in shaping future deliberation then select those most diagnostic of an out-
after transparency. break using text regression (section 3.1),
specifically a variant of partial least squares
4.4 Nowcasting regression. They first run fifty million uni-
variate regressions of ​ log​ (​vi​​​/(​ 1 − ​v​i)​​ ​)​​ on
Important variables such as unemploy- ​log​(​c​ij​​/​(1 − ​c​ij​​)​)​​, where ​​cij​ ​​​is the share of
ment, retail sales, and GDP are measured searches in i​​containing search term j​​. They
at low frequency, and estimates are released then fit a sequence of multivariate regression
with a significant lag. Others, such as racial models of v​ ​​ i​​​on the top n​ ​terms ​j​as ranked by
prejudice or local government corruption, are average predictive power across regions for​
not captured by standard measures at all. Text n  ∈ ​ {1, 2, …}​​. Next, they select the value
produced online such as search queries, social of n​ ​that yields the best fit on a ­hold-out
media posts, listings on job websites, and so on sample. This yields a regression model with​
can be used to construct alternative ­real-time n  = 45​terms. The model produces accu-
estimates of the current values of these vari- rate flu rate estimates for all regions approxi-
ables. By contrast with the standard exercise mately 1–2 weeks ahead of the CDC’s regular
of forecasting future variables, this process of report publication dates.25
using diverse data sources to estimate current
variables has been termed “nowcasting” in the
literature (Bańbura et al. 2013). 25 A number of subsequent papers debate the lon-
A prominent early example is the Google Flu ger-term performance of Google Flu Trends. Lazer et al.
Trends project. Zeng and Wagner (2002) note (2014), for example, show that the accuracy of the Google
that the volume of searches or web hits seek- Flu Trends model—which has not been ­recalibrated or
updated based on more recent data—has deteriorated dra-
ing information related to a disease may be a matically, and that in recent years it is outperformed by
strong predictor of its prevalence. Johnson et simple extrapolation from prior CDC estimates. This may
al. (2004) provide an early data point suggest- reflect changes in both search patterns and the epidemi-
ology of the flu, and it suggests a general lesson that the
ing that browsing ­influenza-related articles on predictive relationship mapping text to a real outcome of
the website healthlink.com is correlated with interest may not be stable over time. On the other hand,
traditional surveillance data from the Centers Preis and Moat (2014) argue that an adaptive version of
the model that more flexibly accounts for joint dynamics
for Disease Control (CDC). In the late 2000s, in flu incidence and search volume significantly improves
a group of Google engineers built on this idea ­real-time influenza monitoring.
Gentzkow, Kelly, and Taddy: Text as Data 563

Related work in economics attempts to a­ nimus in area ​i​is the share of searches orig-
nowcast macroeconomic variables using data inating in that area that contain a set of racist
on the frequency of Google search terms. In words. He then uses these measures to esti-
Choi and Varian (2012) and Scott and Varian mate the impact of racial animus on votes for
(2014, 2015), search term counts are aggre- Barack Obama in the 2008 election, finding
gated by week and by geographic location, a statistically significant and economically
then converted to ­location-specific frequency large negative effect on Obama’s vote share
indices. They estimate spike and slab Bayesian relative to the Democratic vote share in the
forecasting models, discussed in section 3.1.4 previous election.
above. Forecasts of regional retail sales, new
4.5 Policy Uncertainty
housing starts, and tourism activity are all sig-
nificantly improved by incorporating a few Among the most influential applications
search term indices that are relevant for each of text analysis in the economics litera-
category in linear models. Their results sug- ture to date is a measure of economic pol-
gest a potential for large gains in forecasting icy uncertainty (EPU) developed by Baker,
power using web browser search data. Bloom, and Davis (2016). Uncertainty about
Saiz and Simonsohn (2013) use web search both the path of future government policies
results to estimate the current extent of cor- and the impact of current government pol-
ruption in US cities. Standard corruption icies has the potential to increase risk for
measures based on surveys are available at economic actors and so potentially depress
the country and state level, but not for smaller investment and other economic activity. The
geographies. The authors use a dictionary authors use text from news outlets to provide
approach in which the index v  ​​​ˆ ​​i​​​ of corruption a ­high-frequency measure of EPU and then
is defined to be the ratio of search hits for the estimate its economic effects.
name of a geographic area i​​plus the word Baker, Bloom, and Davis (2016) define
“corruption” divided by hits for the name the unit of observation i​​to be a ­country–
of the geographic area alone. These counts month. The outcome v​ ​​ i​​​of interest is the
are extracted from search engine results. true level of economic policy uncertainty.
As a validation, the authors first show that The authors apply a dictionary method
­country-level and s­ tate-level versions of their to produce estimates v  ​​​ˆ ​​i​​​ based on digital
measure correlate strongly with established archives of ten leading newspapers in the
corruption indicies and covary in a similar United States. An element of the input data​​
way with country- and ­state-level demograph- c​ij​​​is a count of the number of articles in
ics. They then compute their measure for US newspaper ​j​in ­country–month ​i​ containing
cities and study its observable correlates. at least one keyword from each of three cat-
Stephens-Davidowitz (2014) uses the fre- egories defined by hand: one related to the
quency of racially charged terms in Google economy, a second related to policy, and a
searches to estimate levels of racial ani- third related to uncertainty. The raw counts
mus in different areas of the United States. are scaled by the total number of articles in
Estimating animus via traditional surveys is the corresponding ­ newspaper–month and
challenging because individuals are often normalized to have standard deviation one.
reluctant to state their true attitudes. The The predicted value v  ​​​ˆ ​​i​​​is then defined to
paper’s results suggest Google searches be a simple average of these scaled counts
provide a less filtered, and therefore more across newspapers.
accurate, measure. The author uses a dictio- The simplicity of the manner in which the
nary approach in which the index ​​​v  ˆ ​​i​​​ of racial index is created allows for a high amount of
564 Journal of Economic Literature, Vol. LVII (September 2019)

flexibility across a broad range of applica- the political process, with the power to poten-
tions. For instance, by including a fourth, tially sway both public opinion and policy.
policy-specific category of keywords, the
­ Understanding how and why media outlets
authors can estimate narrower indices related slant the information they present is import-
to Federal Reserve policy, inflation, and so on. ant to understanding the role media play in
Baker, Bloom, and Davis (2016) validate ​​​v  ˆ ​​i​​​ practice, and to informing the large body of
using a human audit of twelve thousand arti- government regulation designed to preserve
cles from 1900–2012. Teams manually scored a diverse range of political perspectives.
articles on the extent to which they discuss Groseclose and Milyo (2005) offer a pio-
economic policy uncertainty and the spe- neering application of text analysis methods
cific policies they relate to. The resulting to this problem. In their setting, ​i​indexes a
human-coded index has a high correlation
­ set of large US media outlets, and documents
with ​​​vˆ ​​ i​​​. are defined to be the complete news text or
With the estimated v  ​​​ˆ ​​i​​​in hand, the authors broadcast transcripts for an outlet i​​. The out-
analyze the micro- and m ­ acro-level effects come of interest ​​v​i​​​is the political slant of out-
of EPU. Using fi ­ rm-level regressions, they let ​i​. To give this measure content, the authors
first measure how firms respond to this use speeches by politicians in the US Congress
uncertainty and find that it leads to reduced to form a training sample, and define ​​v​i​​​ within
employment, investment, and greater asset this sample to be a politician’s Americans for
price volatility for that firm. Then, using both Democratic Action (ADA) score, a measure
US and international panel VAR models, the of ­left-right political ideology based on con-
authors find that increased ​​​v  ˆ ​​i​​​is a strong pre- gressional voting records. The predicted val-
dictor of lower investment, employment, ues ​​​vˆ ​​i ​​​for the media outlets thus place them
and production. on the same ­left-right scale as the politicians,
Hassan et al. (2017) measure political risk and answer the question “what kind of poli-
at the firm level by analyzing quarterly earn- tician does this news outlet’s content sound
ings call transcripts. Their measure captures most similar to?”
the frequency with which ­policy-oriented lan- The raw data are the full text of speeches
guage and “risk” synonyms c­ o-occur in a tran- by congresspeople and news reports by
script. Firms with high levels of political risk media outlets over a period spanning the
actively hedge these risks by lobbying more 1990s to the early 2000s.26 The authors
intensively and donating more to politicians. dramatically reduce the dimensionality of
When a firm’s political risk rises, it tends to the data in an initial step by deciding to
retrench hiring and investment, consistent focus on a particularly informative subset
with the findings of Baker, Bloom, and Davis of phrases: the names of two hundred think
(2016) at the aggregate level. Their findings tanks. These think tanks are widely viewed
indicate that political shocks are an important as having clear political positions (e.g., the
source of idiosyncratic ­firm-level risk. Heritage Foundation on the right and the
NAACP on the left). The relative frequency
4.6 Media Slant
26 For members of Congress, the authors use all entries
A text analysis problem that has received in the Congressional Record from January 1, 1993 to
significant attention in the social science December 31, 2002. The text includes both floor speeches
literature is measuring the political slant of and documents the member chose to insert in the record but
did not read on the floor. For news outlets, the time period
media content. Media outlets have long been covered is different for different outlets, with start dates as
seen as having a uniquely important role in early as January 1990 and end dates as late as July 2004.
Gentzkow, Kelly, and Taddy: Text as Data 565

with which a politician cites conservative as which is denoted in the figure by “aver-
opposed to liberal think tanks turns out to age US voter.” The last fact underlies the
be strongly correlated with a politician’s ide- authors’ main conclusion, which is that there
ology. The paper’s premise is that the cita- is an overall liberal bias in the media.
tion frequencies of news outlets will then Gentzkow and Shapiro (2010) build on
provide a good index of those outlets’ polit- the Groseclose and Milyo (2005) approach
ical slants. The features of interest ​​c​i​​​ are a​​ to measure the slant of 433 US daily news-
(1 × 50)​​vector of citation counts for each papers. The main difference in approach is
of forty-four h ­ ighly cited think tanks plus that Gentzkow and Shapiro (2010) omit the
six groups of smaller think tanks. initial step that restricts the space of fea-
The text analysis is based on a supervised tures to mentions of think tanks, and instead
generative model (section 3.2.2). The utility consider all phrases that appear in the 2005
that congress member or media firm ​i​ derives Congressional Record as potential predictors,
from citing think tank ​j​ is ​​Ui​j​​  = ​aj​​​  + ​b​j​​ ​vi​​​  + ​e​ij​​​, letting the data select those that are most
where ​​vi​​​​is the observable ADA score diagnostic of ideology. These could poten-
of a ­ congress member ​ i​or unobserved tially be think tank names, but they turn out
slant of media outlet ​i​, and ​​ei​j​​​ is an error instead to be politically charged phrases such
distributed ­type-I extreme value. The coef- as “death tax,” “bring our troops home,” and
ficient ​​bj​​​​captures the extent to which think “war on terror.”
tank ​j​is cited relatively more by conserva- After standard ­preprocessing—stemming
tives. The model is fit by maximum likeli- and omitting stop words—the authors pro-
hood with the parameters (​​​aj​​​, ​b​j​​)​​ and the duce counts of all 2­ -grams and ­3-grams by
unknown slants v​ ​​ i​​​estimated jointly. This is speaker. They then select the top one thou-
an efficient but computationally intensive sand phrases (five hundred of each length)
approach to estimation, and it constrains the by a χ​​ 
​​ 2​​criterion that captures the degree to
authors’ focus to twenty outlets. This limita- which each phrase is diagnostic of the speak-
tion can be sidestepped using more recent er’s party. This is the standard ​​χ​​  2​​-test statistic
approaches such as Taddy’s (2013b) multino- for the null hypothesis that phrase j​​is used
mial inverse regression. equally often by Democrats and Republicans,
Figure 4 shows the results, which sug- and it will be high for phrases that are both
gest three main findings. First, the media used frequently and used asymmetrically by
outlets are all relatively centrist: they are all the parties.28 Next, a ­two-stage supervised
to the left of the average Republican and to generative method is used to predict news-
the right of the average Democrat with one paper slant v​ ​​ i​​​from the selected features. In
exception. Second, the ordering matches the first stage, the authors run a separate
conventional wisdom, with the New York regression for each phrase ​j​of counts (​​c​ij​​​)
Times and Washington Post on the left, and
Fox News and the Washington Times on the
right.27 Third, the large majority of outlets features and estimate a much more conservative slant for
fall to the left of the average in congress, the Wall Street Journal.
28 The statistic is

​ ​​  − ​f​jd​​ ​f∼jr
​f​jr​​ ​f∼jd ​ ​​
​χ​  2​  ​  =     
​ ___________________________________
     ​​
27 The one notable exception is the Wall Street Journal, ​j
which is generally considered to be r­ight-of-center but (​  ​fjr​ ​​  + ​f​jd)​​ (​​  ​fjr​ ​​  + ​f​∼jd)​​ (​​  ​f∼jr
​ ​​  + ​f​jd)​​ (​​  ​f∼jr
​ ​​  + ​f​∼jd)​​ ​
which is estimated by Groseclose and Milyo (2005) to be where ​​fj​d​​​ and ​​f​jr​​​denote the number of times phrase ​j​ is
the most ­left-wing outlet in their sample. This may reflect used by Democrats or Republicans, respectively, and ​​f​∼jd​​​
an idiosyncrasy specific to the way they cite think tanks; and ​​f​∼jr​​​denote the number of times phrases other than ​j​
Gentzkow and Shapiro (2010) use a broader sample of text are used by Democrats and Republicans, respectively.
566 Journal of Economic Literature, Vol. LVII (September 2019)

Figure 4. Distribution of Political Orientation: Media Outlets and Members of Congress

Source: Groseclose and Milyo (2005). Reprinted with permission from the Quarterly Journal of Economics.
Gentzkow, Kelly, and Taddy: Text as Data 567

on speaker i​ ​’s ideology, which is measured as using the full set of phrases in the data. He
the 2004 Republican vote share in the speak- shows that this substantially increases the
er’s district. They then use the estimated ­in-sample predictive power of the measure.
coefficients ​​​βˆ ​​j ​​​to produce predicted slant Greenstein, Gu, and Zhu (2016) analyze
​​​vˆ ​​i ​​  ∝ ​∑ 1,000
 j=1​   ​​​​β ​​j ​​ ​cij
ˆ ​ ​​​for the unknown newspa- the extent of bias and slant among Wikipedia
29
pers ​i​. contributors using similar methods. They find
The main focus of the study is character- that contributors tend to edit articles with
izing the incentives that drive newspapers’ slants in opposition to their own slants. They
choice of slant. With the estimated ​​​v  ˆ ​​i​​​ in hand, also show that contributors’ slants become
the authors estimate a model of consumer less extreme as they become more experi-
demand in which a consumer’s utility from enced, and that the bias reduction is largest
reading newspaper ​i​depends on the distance for those with the most extreme initial biases.
between ​i​’s slant ​​vi​​​​and an ideal slant v​​ ​​  ⁎​​ which
is greater the more conservative the consum- 4.7 Market Definition and Innovation
er’s ideology. Estimates of this model using Impact
­zipcode-level circulation data imply a level of
slant that newspapers would choose if their Many important questions in industrial
only incentive was to maximize profits. The organization hinge on the appropriate defi-
authors then compare this ­profit-maximizing nition of product markets. Standard industry
slant to the level actually chosen by newspa- definitions can be an imperfect proxy for the
pers, and ask whether the deviations can be economically relevant concept. Hoberg and
predicted by the identity of the newspaper’s Phillips (2016) provide a novel way of clas-
owner or by other ­nonmarket factors such sifying industries based on product descrip-
as the party of local incumbent politicians. tions in the text of company disclosures. This
They find that profit maximization fits the allows for flexible industry classifications that
data well, and that ownership plays no role in may vary over time as firms and economies
explaining the choice of slant. In this study, ​​​v  ˆ ​​i​​​ evolve, and allows the researchers to ana-
is both an independent variable of interest lyze the effect of shocks on competition and
(in the demand analysis) and an outcome of product offerings.
interest (in the supply analysis). Each publicly traded firm in the United
Note that both Groseclose and Milyo States must file an annual 1­ 0-K report describ-
(2005) and Gentzkow and Shapiro (2010) ing, among other aspects of their business, the
use a t­ wo-step procedure where they reduce products that they offer. The unit of analysis​
the dimensionality of the data in a first stage i​is a fi
­ rm–year. Token counts from the busi-
and then estimate a predictive model in the ness description section of the ​ith​ ­10-K filing
second. Taddy (2013b) shows how to com- are represented in the vector ​​c​i​​​. A pairwise
bine a more sophisticated generative model cosine similarity score, s​ ​​ ij​​​, based on the angle
with a novel algorithm for estimation to esti- between ​​ci​​​​ and ​​cj​​​​, describes the closeness of
mate the predictive model in a single step product offerings for each pair ​i​and ​j​in the
same filing year. Industries are then defined
29 As Taddy (2013b) notes, this method (which
by clustering firms according to their cosine
Gentzkow and Shapiro 2010 derive in an ad hoc fashion) similarities. The clustering algorithm begins
is essentially partial least squares. It differs from the stan- by assuming each firm is its own industry,
dard implementation in that the variables v​ ​​ i​​​ and ​​c​ij​​​ would and gradually agglomerates firms into indus-
normally be standardized. Taddy (2013b) shows that doing
so increases the i­ n-sample predictive power of the measure tries by grouping a firm to the cluster with
from 0.37 to 0.57. its nearest neighbor according to ​​ s​ij​​​. The
568 Journal of Economic Literature, Vol. LVII (September 2019)

algorithm terminates when the number of the data ​​c​i​​​are counts of individual words,
industries (clusters) reaches three hundred, a and the outcome of interest v​ ​​ i​​​is a vector of
number chosen for comparability to Standard weights indicating the share of a given article
Industrial Classification and North American devoted to each of one hundred latent topics.
Industrial Classification System codes.30 The authors extend the baseline LDA model
After establishing an industry assignment of Blei, Ng, and Jordan (2003) to allow the
for each ­firm–year, ​​​vˆ ​​i ​​​, the authors examine importance of one topic in a particular article
the effect of military and software industry to be correlated with the presence of other
shocks to competition and product offerings topics. They fit the model using all Science
among firms. As an example, they find that articles from ­1990–99. The results deliver an
the events of September 11, 2001, increased automated classification of article content into
entry in high-demand military markets and semantically coherent topics such as evolu-
pushed products in this industry toward tion, DNA and genetics, cellular biology, and
“­non-battlefield information gathering and volcanoes.
products intended for potential ground Applying similar methods in the politi-
conflicts.” cal domain, Quinn et al. (2010) use a topic
In a similar vein, Kelly et al. (2018) use model to identify the issues being discussed
cosine similarity among patent documents to in the US Senate over the period 1997–2004.
create new indicators of patent quality. They Their approach deviates from the baseline
assign higher quality to patents that are novel LDA model in two ways. First, they assume
in that they have low similarity with the exist- that each speech is associated with a sin-
ing stock of patents and are impactful in that gle topic. Second, their model incorporates
they have high similarity with subsequent time-series dynamics that allow the pro-
­
patents. They then show that ­text-based nov- portion of speeches generated by a given
elty and similarity scores correlate strongly topic to gradually evolve over the sample,
with measures of market value. Atalay et al. similar to the dynamic topic model of Blei
(2017) use text from job ads to measure task and Lafferty (2006). Their preferred spec-
content and use their measure to show that ification is a model with forty-two topics, a
­within-occupation task content shifts are at number chosen to maximize the subjective
least as important as employment shifts interpretability of the resulting topics.
across occupations in describing the large Table 1 shows the words with the high-
secular reallocation of routine tasks from est weights in each of twelve fitted top-
humans to machines. ics. The labels “Judicial Nominations,”
“Constitutional,” and so on are assigned by
4.8 Topics in Research, Politics, and Law
hand by the authors. The results suggest
A number of studies apply topic models that the automated procedure successfully
(section 3.2.1) to describe how the focus of isolates coherent topics of congressional
attention in a specific text corpus shifts over debates. After discussing the structure of
time. topics in the fitted model, the authors then
A seminal contribution in this vein is Blei track the relative importance of the topics
and Lafferty’s (2007) analysis of topics in across congressional sessions and argue that
Science. Documents i​​are individual articles, spikes in discussion of particular topics track,
in an intuitive way, the occurrence of import-
30 Best, Hjort, and Szakonyi (2017) use a similar
ant debates and external events.
approach to classify products in their study of public pro- Sim, Routledge, and Smith (2015) esti-
curement and organizational bureaucracy in Russia. mate a topic model from the text of amicus
Gentzkow, Kelly, and Taddy: Text as Data 569

TABLE 1
Congressional Record Topics and Key Words

Topic (Short Label) Keys

  1. Judicial nominations nomine, confirm, nomin, circuit, hear, court, judg, judici, case, vacanc
  2. Constitutional case, court, attornei, supreme, justic, nomin, judg, m, decis, constitut
  3. Campaign finance campaign, candid, elect, monei, contribut, polit, soft, ad, parti, limit
  4. Abortion procedur, abort, babi, thi, life, doctor, human, ban, decis, or
  5. Crime 1 [violent] enforc, act, crime, gun, law, victim, violenc, abus, prevent, juvenil
  6. Child protection gun, tobacco, smoke, kid, show, firearm, crime, kill, law, school
  7. Health 1 [medical] diseas, cancer, research, health, prevent, patient, treatment, devic, food
  8. Social welfare care, health, act, home, hospit, support, children, educ, student, nurs
  9. Education school, teacher, educ, student, children, test, local, learn, district, class
10. Military 1 [manpower] veteran, va, forc, militari, care, reserv, serv, men, guard, member
11. Military 2 [infrastructure] appropri, defens, forc, report, request, confer, guard, depart, fund, project
12. Intelligence intellig, homeland, commiss, depart, agenc, director, secur, base, defens
Source: Quinn et al. (2010). Reprinted with permission from John Wiley and Sons.

briefs to the Supreme Court of the United methods for feature selection and model
States. They show that the overall topical training. As we have emphasized, dictionary
composition of briefs for a given case, par- methods are appropriate in cases where
ticularly along a conservative–liberal dimen- prior information is strong and the avail-
sion, is highly predictive for how individual ability of appropriately labeled training data
judges vote in the case. is limited. Experience in other fields, how-
ever, suggests that modern methods will
likely outperform ad hoc approaches in a
5.  Conclusion
substantial share of cases.
Digital text provides a rich repository of Second, some of the workhorse meth-
information about economic and social activ- ods of text analysis such as penalized linear
ity. Modern statistical tools give researchers or logistic regression have still seen lim-
the ability to extract this information and ited application in social science. In other
encode it in a quantitative form amena- contexts, these methods provide a robust
ble to descriptive or causal analysis. Both baseline that performs similarly to or bet-
the ­availability of text data and the frontier ter than more complex methods. We expect
of methods are expanding rapidly, and we the domains in which these methods are
expect the importance of text in empirical applied to grow.
economics to continue to grow. Finally, virtually all of the methods applied
The review of applications above suggests to date, including those we would label as
a number of areas where innovation should sophisticated or on the frontier, are based on
proceed rapidly in coming years. First, a fitting predictive models to simple counts of
large share of text analysis applications con- text features. Richer representations, such as
tinue to rely on ad hoc dictionary methods word embeddings (3.3), and linguistic mod-
rather than deploying more sophisticated els that draw on natural language processing
570 Journal of Economic Literature, Vol. LVII (September 2019)

tools have seen tremendous success else- ­ conomic Time Series Using Targeted Predictors.”
E
Journal of Econometrics 146 (2): 304–17.
where, and we see great potential for their Baker, Scott R., Nicholas Bloom, and Steven J. Davis.
application in economics. 2016. “Measuring Economic Policy Uncertainty.”
The rise of text analysis is part of a broader Quarterly Journal of Economics 131 (4): 1593–636.
Bańbura, Marta, Domenico Giannone, Michele
trend toward greater use of machine learn- Modugno, and Lucrezia Reichlin. 2013. “Now-Cast-
ing and related statistical methods in eco- ing and the Real-Time Data Flow.” In Handbook of
nomics. With the growing availability of Economic Forecasting, Vol. 2A, edited by Allan Tim-
mermann and Graham Elliot, 195–237. Sebastopol:
high-dimensional data in many domains—
­ O’Reilly Media.
from consumer purchase and browsing Bandiera, Oriana, Stephen Hansen, Andrea Prat, and
behavior, to satellite and other spatial data, to Raffaella Sadun. 2017. “CEO Behavior and Firm
Performance.” NBER Working Paper 23248.
genetics and ­neuro-economics—the returns Belloni, Alexandre, Victor Chernozhukov, and Chris-
are high to economists investing in learning tian B. Hansen. 2013. “Inference for High-Dimen-
these methods and to increasing the flow of sional Sparse Econometric Models.” In Advances
in Economics and Econometrics: Tenth World Con-
ideas between economics and fields such as gress, Vol. 3, 245–95. Cambridge: Cambridge Uni-
statistics and computer science, where fron- versity Press.
tier innovations in these methods are taking Best, Michael Carlos, Jonas Hjort, and David Szakonyi.
2017. “Individuals and Organizations as Sources of
place. State Effectiveness.” NBER Working Paper 23350.
Bickel, Peter J., Ya’acov Ritov, and Alexandre B. Tsyba-
References kov. 2009. “Simultaneous Analysis of Lasso and Dan-
tzig Selector.” Annals of Statistics 37 (4): 1705–32.
Airoldi, Edoardo M., and Jonathan M. Bischof. 2016. Bishop, Christopher M. 1995. Neural Networks for Pat-
“Improving and Evaluating Topic Models and Other tern Recognition. Oxford: Oxford University Press.
Models of Text.” Journal of the American Statistical Bishop, Christopher M. 2006. Pattern Recognition and
Association 111 (516): 1381–403. Machine Learning. Berlin: Springer.
Airoldi, Edoardo M., Elena A. Erosheva, Stephen E. Blei, David M. 2012. “Probabilistic Topic Models.”
Fienberg, Cyrille Joutard, Tanzy Love, and Suyash Communications of the ACM 55 (4): 77–84.
Shringarpure. 2010. “Reconceptualizing the Classifi- Blei, David M., and John D. Lafferty. 2006. “Dynamic
cation of PNAS Articles.” Proceedings of the National Topic Models.” In Proceedings of the 23rd Interna-
Academy of Sciences of the United States of America tional Conference on Machine Learning, edited by
107 (49): 20899–904. W. Cohen and A. Moore, 113–20. New York: Associ-
Akaike, H. 1973. “Information Theory and an Exten- ation for Computing Machinery.
sion of the Maximum Likelihood Principle.” In Pro- Blei, David M., and John D. Lafferty. 2007. “A Cor-
ceedings of the 2nd International Symposium on related Topic Model of Science.” Annals of Applied
Information Theory, edited by B. N. Petrov and F. Statistics 1 (1): 17–35.
Csaki, 267–81. Budapest: Akademiai Kiado. Blei, David M., and Jon D. McAuliffe. 2007. “Super-
Angrist, Joshua D., and Alan B. Krueger. 2001. “Instru- vised Topic Models.” Proceedings of the 20th
mental Variables and the Search for Identification: International Conference on Neural Information
From Supply and Demand to Natural Experiments.” Processing Systems, edited by J. C. Platt, D. Koller,
Journal of Economic Perspectives 15 (4): 69–85. Y. Singer, and S. T. Roweis, 121–28. Red Hook: Cur-
Antweiler, Werner, and Murray Z. Frank. 2004. “Is All ran Associates.
That Talk Just Noise? The Information Content of Blei, David M., Andrew Y. Ng, and Michael I. Jor-
Internet Stock Message Boards.” Journal of Finance dan. 2003. “Latent Dirichlet Allocation.” Journal of
59 (3): 1259–94. Machine Learning Research 3: 993–1022.
Armagan, Artin, David B. Dunson, and Jaeyong Lee. Bollen, Johan, Huina Mao, and Xiaojun Zeng. 2011.
2013. “Generalized Double Pareto Shrinkage.” Sta- “Twitter Mood Predicts the Stock Market.” Journal
tistica Sinica 23 (1): 119–43. of Computational Science 2 (1): 1–8.
Atalay, Enghin, Phai Phongthiengtham, Sebastian Bolukbasi, Tolga, Kai-Wei Chang, James Y. Zou, Ven-
Sotelo, and Daniel Tannenbaum. 2017. “The Evolv- katesh Saligrama, and Adam T. Kalai. 2016. “Man Is
ing U.S. Occupational Structure.” Washington Cen- to Computer Programmer as Woman Is to Home-
ter for Equitable Growth Working Paper 12052017. maker? Debiasing Word Embeddings.” In Proceed-
Athey, Susan, and Guido Imbens. 2016. “Recursive ings of the 29th International Conference on Neural
Partitioning for Heterogeneous Causal Effects.” Pro- Information Processing Systems, edited by D. D.
ceedings of the National Academy of Sciences of the Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R.
United States of America 113 (27): 7353–60. Garnett, 4349–57. Red Hook: Curran Associates.
Bai, Jushan, and Serena Ng. 2008. “Forecasting Born, Benjamin, Michael Ehrmann, and Marcel
Gentzkow, Kelly, and Taddy: Text as Data 571

Fratzscher. 2014. “Central Bank Communication of the American Statistical Association 99 (467):
on Financial Stability.” Economic Journal 124 (577): 619–32.
701–34. Efron, B. 2012. “Bayesian Inference and the Paramet-
Boudoukh, Jacob, Ronen Feldman, Shimon Kogan, and ric Bootstrap.” Annals of Applied Statistics 6 (4):
Matthew Richardson. 2016. “Information, Trading, 1971–97.
and Volatility: Evidence from Firm-Specific News.” Engelberg, Joseph E., and Christopher A. Parsons.
Available on SSRN at 2193667. 2011. “The Causal Impact of Media in Financial
Breiman, Leo. 2001. “Random Forests.” Machine Markets.” Journal of Finance 66 (1): 67–97.
Learning 45 (1): 5–32. Evans, James A., and Pedro Aceves. 2016. “Machine
Breiman, Leo, Jerome H. Friedman, Richard A. Translation: Mining Text for Social Theory.” Annual
Olshen, and Charles J. Stone. 1984. Classification Review of Sociology 42: 21–50.
and Regression Trees. Routledge: Taylor and Francis. Fan, Jianqing, and Runze Li. 2001. “Variable Selection
Buehlmaier, Matthias M., and Toni M. Whited. 2018. via Nonconcave Penalized Likelihood and Its Oracle
“Are Financial Constraints Priced? Evidence from Properties.” Journal of the American Statistical Asso-
Textual Analysis.” Review of Financial Studies 31 (7): ciation 96 (456): 1348–60.
2693–728. Fan, Jianqing, Lingzhou Xue, and Hui Zou. 2014.
Bühlmann, Peter, and Sara van de Geer. 2011. Statistics “Strong Oracle Optimality of Folded Concave Penal-
for High-Dimensional Data. Berlin: Springer. ized Estimation.” Annals of Statistics 42 (3): 819–49.
Candès, Emmanuel J., Michael B. Wakin, and Stephen Flynn, Cheryl J., Clifford M. Hurvich, and Jeffrey
P. Boyd. 2008. “Enhancing Sparsity by Reweighted S. Simonoff. 2013. “Efficiency for Regularization
ℓ1 Minimization.” Journal of Fourier Analysis and Parameter Selection in Penalized Likelihood Estima-
Applications 14 (5–6): 877–905. tion of Misspecified Models.” Journal of the Ameri-
Carvalho, Carlos M., Nicholas G. Polson, and James G. can Statistical Association 108 (503): 1031–43.
Scott. 2010. “The Horseshoe Estimator for Sparse Foster, Dean P., Mark Liberman, and Robert A. Stine.
Signals.” Biometrika 97 (2): 465–80. 2013. “Featurizing Text: Converting Text into Predic-
Chen, Danqi, and Christopher Manning. 2014. “A Fast tors for Regression Analysis.” http://www-stat.whar-
and Accurate Dependency Parser Using Neural Net- ton.upenn.edu/~stine/research/regressor.pdf.
works.” In Proceedings of the 2014 Conference on Friedman, Jerome H. 2002. “Stochastic Gradient
Empirical Methods in Natural Language Processing, Boosting.” Computational Statistics and Data Anal-
740–50. Stroudsburg: Association for Computation ysis 38 (4): 367–78.
Linguistics. Gentzkow, Matthew, and Jesse M. Shapiro. 2010.
Chernozhukov, Victor, et al. 2018. “Double/Debiased “What Drives Media Slant? Evidence from U.S.
Machine Learning for Treatment and Structural Daily Newspapers.” Econometrica 78 (1): 35–71.
Parameters.” Econometrics Journal 21 (1): C1–C68. Gentzkow, Matthew, Jesse M. Shapiro, and Matt Taddy.
Chinco, Alexander M., Adam D. Clark-Joseph, and 2016. “Measuring Group Differences in High-Di-
Mao Ye. 2017. “Sparse Signals in the Cross-Section mensional Choices: Method and Application to Con-
of Returns.” NBER Working Paper 23933. gressional Speech.” NBER Working Paper 22423.
Choi, Hyunyoung, and Hal Varian. 2012. “Predicting George, Edward I., and Robert E. McCulloch. 1993.
the Present with Google Trends.” Economic Record “Variable Selection via Gibbs Sampling.” Journal
88 (S1): 2–9. of the American Statistical Association 88 (423):
Cook, R. Dennis. 2007. “Fisher Lecture: Dimension 881–89.
Reduction in Regression.” Statistical Science 22 (1): Ginsberg, Jeremy, Matthew H. Mohebbi, Rajan S.
1–26. Patel, Lynnette Brammer, Mark S. Smolinski, and
Cowles, Alfred. 1933. “Can Stock Market Forecasters Larry Brilliant. 2009. “Detecting Influenza Epidem-
Forecast?” Econometrica 1 (3): 309–24. ics Using Search Engine Query Data.” Nature 457
Das, Sanjiv R., and Mike Y. Chen. 2007. “Yahoo! for (7232): 1012–14.
Amazon: Sentiment Extraction from Small Talk on Goldberg, Yoav. 2016. “A Primer on Neural Network
the Web.” Management Science 53 (9): 1375–88. Models for Natural Language Processing.” Journal of
Deerwester, Scott, Susan T. Dumais, George W. Fur- Artificial Intelligence Research 57 (1): 345–420.
nas, Thomas K. Landauer, and Richard Harshman. Goldberg, Yoav, and Jon Orwant. 2013. “A Dataset of
1990. “Indexing by Latent Semantic Analysis.” Jour- Syntactic-Ngrams over Time from a Very Large Cor-
nal of the Association for Information Science 41 (6): pus of English Books.” In Second Joint Conference
391–407. on Lexical and Computational Semantics (*SEM),
Denny, Matthew J., and Arthur Spirling. 2018. “Text Volume 1: Proceedings of the Main Conference and
Preprocessing for Unsupervised Learning: Why It the Shared Task: Semantic Textual Similarity, edited
Matters, When It Misleads, and What to Do about by Mona Diab, Tim Baldwin, and Marco Baroni,
It.” Political Analysis 26 (2): 168–89. 241–47. Stroudsburg: Association for Computational
Efron, B. 1979. “Bootstrap Methods: Another Look at Linguistics.
the Jackknife.” Annals of Statistics 7 (1): 1–26. Goodfellow, Ian, Yoshua Bengio, and Aaron Courville.
Efron, B. 2004. “The Estimation of Prediction Error: 2016. Deep Learning. Cambridge: MIT Press.
Covariance Penalties and Cross-Validation.” J­ ournal Greenstein, Shane, Yuan Gu, and Feng Zhu. 2016.
572 Journal of Economic Literature, Vol. LVII (September 2019)

“Ideological Segregation among Online Collabora- Jegadeesh, Narasimhan, and Di Wu. 2013. “Word
tors: Evidence from Wikipedians.” NBER Working Power: A New Approach for Content Analysis.” Jour-
Paper 22744. nal of Financial Economics 110 (3): 712–29.
Griffiths, Thomas L., and Mark Steyvers. 2004. “Find- Joachims, Thorsten. 1998. “Text Categorization with
ing Scientific Topics.” Proceedings of the National Support Vector Machines: Learning with Many Rel-
Academy of Sciences of the United States of America evant Features.” In 10th European Conference on
101 (S1): 5228–35. Machine Learning, edited by Claire Nédellec and
Grimmer, Justin. 2010. “A Bayesian Hierarchical Topic Céline Rouveirol, 137–42. Berlin: Springer.
Model for Political Texts: Measuring Expressed Johnson, Heather A., et al. 2004. “Analysis of Web
Agendas in Senate Press Releases.” Political Analysis Access Logs for Surveillance of Influenza.” Studies in
18 (1): 1–35. Health Technology and Informatics 107 (2): 1202–06.
Grimmer, Justin, and Brandon M. Stewart. 2013. “Text Jurafsky, Daniel, and James H. Martin. 2009. Speech
as Data: The Promise and Pitfalls of Automatic Con- and Language Processing: An Introduction to Natu-
tent Analysis Methods for Political Texts.” Political ral Language Processing, Computational Linguistics,
Analysis 21 (3): 267–97. and Speech Recognition. 2nd ed. London: Pearson.
Groseclose, Tim, and Jeffrey Milyo. 2005. “A Measure Kass, Robert E., and Larry Wasserman. 1995. “A Ref-
of Media Bias.” Quarterly Journal of Economics 120 erence Bayesian Test for Nested Hypotheses and Its
(4): 1191–237. Relationship to the Schwarz Criterion.” Journal of the
Hans, Chris. 2009. “Bayesian Lasso Regression.” Bio- American Statistical Association 90 (431): 928–34.
metrika 96 (4): 835–45. Kelly, Bryan, Dimitris Papanikolaou, Amit Seru, and
Hansen, Stephen, Michael McMahon, and Andrea Matt Taddy. 2018. “Measuring Technological Inno-
Prat. 2018. “Transparency and Deliberation within vation over the Long Run.” NBER Working Paper
the FOMC: A Computational Linguistics Approach.” 25266.
Quarterly Journal of Economics 133 (2): 801–70. Kelly, Bryan, and Seth Pruitt. 2013. “Market Expecta-
Hassan, Tarek A., Stephan Hollander, Laurence van tions in the Cross-Section of Present Values.” Journal
Lent, and Ahmed Tahoun. 2017. “Firm-Level Politi- of Finance 68 (5): 1721–56.
cal Risk: Measurement and Effects.” NBER Working Kelly, Bryan, and Seth Pruitt. 2015. “The Three-Pass
Paper 24029. Regression Filter: A New Approach to Forecasting
Hastie, Trevor, Robert Tibshirani, and Jerome Fried- Using Many Predictors.” Journal of Econometrics
man. 2009. The Elements of Statistical Learning: 186 (2): 294–316.
Data Mining, Inference, and Prediction. New York: Knight, Keith, and Wenjiang Fu. 2000. “Asymptotics
Springer. for Lasso-Type Estimators.” Annals of Statistics 28
Hastie, Trevor, Robert Tibshirani, and Martin Wain- (5): 1356–78.
wright. 2015. Statistical Learning with Sparsity: The Kogan, Shimon, Dimitry Levin, Bryan R. Routledge,
Lasso and Generalizations. New York: Taylor and Jacob S. Sagi, and Noah A. Smith. 2009. “Predict-
Francis. ing Risk from Financial Reports with Regression.”
Hoberg, Gerard, and Gordon Phillips. 2016. “Text- In Proceedings of Human Language Technologies:
Based Network Industries and Endogenous Product The 2009 Annual Conference of the North American
Differentiation.” Journal of Political Economy 124 Chapter of the Association for Computational Lin-
(5): 1423–65. guistics, 272–80. Stroudsburg: Association for Com-
Hoerl, Arthur E., and Robert W. Kennard. 1970. “Ridge putational Linguistics.
Regression: Biased Estimation for Nonorthogonal Lazer, David, Ryan Kennedy, Gary King, and Alessan-
Problem.” Technometrics 12 (1): 55–67. dro Vespignani. 2014. “The Parable of Google Flu:
Hoffman, Matthew D., David M. Blei, Chong Wang, Traps in Big Data Analysis.” Science 343 (6176):
and John Paisley. 2013. “Stochastic Variational Infer- 1203–05.
ence.” Journal of Machine Learning Research 14 (1): Le, Quoc, and Tomas Mikolov. 2014. “Distributed Rep-
1303–47. resentations of Sentences and Documents.” Proceed-
Hofmann, Thomas. 1999. “Probabilistic Latent Seman- ings of Machine Learning Research 32: 1188–96.
tic Indexing.” In Proceedings of the 22nd Annual LeCun, Yann, Yoshua Bengio, and Geoffrey Hinton.
International ACM SIGIR Conference on Research 2015. “Deep Learning.” Nature 521 (7553): 436–44.
and Development in Information Retrieval, 50–57. Li, Feng. 2010. “The Information Content of For-
New York: ACM. ward-Looking Statements in Corporate Filings—A
Horowitz, Joel L. 2003. “The Bootstrap in Economet- Naïve Bayesian Machine Learning Approach.” Jour-
rics.” Statistical Science 18 (2): 211–18. nal of Accounting Research 48 (5): 1049–102.
Iyyer, Mohit, Peter Enns, Jordan Boyd-Graber, and Loughran, Tim, and Bill McDonald. 2011. “When Is
Philip Resnik. 2014. “Political Ideology Detection a Liability Not a Liability? Textual Analysis, Dictio-
Using Recursive Neural Networks.” In Proceedings naries, and 10-Ks.” Journal of Finance 66 (1): 35–65.
of the 52nd Annual Meeting of the Association for Lucca, David O., and Francesco Trebbi. 2009. “Mea-
Computational Linguistics, Vol. 1, edited by Kris- suring Central Bank Communication: An Automated
tina Toutanova and Hua Wu, 1113–22. Stroudsburg: Approach with Application to FOMC Statements.”
Association for Computational Linguistics. NBER Working Paper 15367.
Gentzkow, Kelly, and Taddy: Text as Data 573

Manela, Asaf, and Alan Moreira. 2017. “News Implied “­Adaptive Nowcasting of Influenza Outbreaks Using
Volatility and Disaster Concerns.” Journal of Finan- Google Searches.” Royal Society Open Science 1 (2):
cial Economics 123 (1): 137–62. Article 140095.
Manning, Christopher D., Prabhakar Raghavan, and Pritchard, Jonathan K., Matthew Stephens, and Peter
Hinrich Schütze. 2008. Introduction to Information Donnelly. 2000. “Inference of Population Structure
Retrieval. Cambridge: Cambridge University Press. Using Multilocus Genotype Data.” Genetics 155 (2):
Mannion, David, and Peter Dixon. 1997. “Authorship 945–59.
Attribution: The Case of Oliver Goldsmith.” Journal Quinn, Kevin M., Burt L. Monroe, Michael Colaresi,
of the Royal Statistical Society, Series D 46 (1): 1–18. Michael H. Crespin, and Dragomir R. Radev. 2010.
Manski, Charles F. 1988. Analog Estimation Methods in “How to Analyze Political Attention with Minimal
Econometrics. New York: Chapman and Hall. Assumptions and Costs.” American Journal of Politi-
Mikolov, Tomas, Ilya Sutskever, Kai Chen, Greg cal Science 54 (1): 209–28.
Corrado, and Jeffrey Dean. 2013. “Distributed Rabinovich, Maxim, and David M. Blei. 2014. “The
Representations of Words and Phrases and Their Inverse Regression Topic Model.” Proceedings of
Compositionality.” In Proceedings of the 26th Inter- Machine Learning Research 32: 199–207.
national Conference on Neural Information Process- Roberts, Margaret E., Brandon M. Stewart, Dustin
ing Systems, edited by C. J. C. Burges, L. Bottou, Tingley, and Edoardo M. Airoldi. 2013. “The Struc-
M. Welling, Z. Ghahramani, and K. Q. Weinberger, tural Topic Model and Applied Social Science.” Paper
3111–19. Red Hook: Curran Associates. presented at the Advances in Neural Information
Morin, Frederic, and Yoshua Bengio. 2005. “Hierarchi- Processing Systems Workshop on Topic Models: Com-
cal Probabilistic Neural Network Language Model.” putation, Application, and Evaluation, Lake Tahoe.
In Proceedings of the Tenth International Workshop Rumelhart, David E., Geoffrey E. Hinton, and Ron-
on Artificial Intelligence and Statistics, 246–52. ald J. Williams. 1986. “Learning Representations
New Jersey: Society for Artificial Intelligence and by Back-Propagating Errors.” Nature 323 (6088):
Statistics. 533–36.
Mosteller, Frederick, and David L. Wallace. 1963. Saiz, Albert, and Uri Simonsohn. 2013. “Proxying
“Inference in an Authorship Problem.” Journal of the for Unobservable Variables with Internet Docu-
American Statistical Association 58 (302): 275–309. ment-Frequency.” Journal of the European Eco-
Murphy, Kevin P. 2012. Machine Learning: A Probabi- nomic Association 11 (1): 137–65.
listic Perspective. Cambridge: MIT Press. Schwarz, Gideon. 1978. “Estimating the Dimension of
Ng, Andrew Y., and Michael I. Jordan. 2001. “On Dis- a Model.” Annals of Statistics 6 (2): 461–64.
criminative versus Generative Classifiers: A Com- Scott, Steven L., and Hal R. Varian. 2014. “Predicting
parison of Logistic Regression and Naive Bayes.” In the Present with Bayesian Structural Time Series.”
Proceedings of the 14th International Conference on International Journal of Mathematical Modeling and
Neural Information Processing Systems: Natural and Numerical Optimisation 5 (1–2): 4–23.
Synthetic, edited by T. G. Dietterich, S. Becker, and Scott, Steven L., and Hal R. Varian. 2015. “Bayesian
Z. Ghahramani, 841–48. Cambridge: MIT Press. Variable Selection for Nowcasting Economic Time
Pang, Bo, Lillian Lee, and Shivakumar Vaithyanathan. Series.” In Economic Analysis of the Digital Econ-
2002. “Thumbs up? Sentiment Classification Using omy, edited by Avi Goldfarb, Shane M. Greenstein,
Machine Learning Techniques.” In Proceedings of and Catherine E. Tucker, 119–35. Chicago: Univer-
the Conference on Empirical Methods in Natural sity of Chicago Press.
Language Processing (EMNLP), 79–86. Stroudsburg: Sim, Yanchuan, Bryan R. Routledge, and Noah A.
Association for Computational Linguistics. Smith. 2015. “The Utility of Text: The Case of Amicus
Park, Trevor, and George Casella. 2008. “The Bayesian Briefs and the Supreme Court.” In Proceedings of the
Lasso.” Journal of the American Statistical Associa- Twenty-Ninth AAAI Conference on Artificial Intelli-
tion 103 (482): 681–86. gence, 2311–17. Palo Alto: AAAI Press.
Pennington, Jeffrey, Richard Socher, and Christopher Stephens-Davidowitz, Seth. 2014. “The Cost of Racial
Manning. 2014. “GloVe: Global Vectors for Word Animus on a Black Candidate: Evidence Using Goo-
Representation.” In Proceedings of the 2014 Confer- gle Search Data.” Journal of Public Economics 118
ence on Empirical Methods in Natural Language Pro- (C): 26–40.
cessing (EMNLP), edited by Alessandro Moschitti, Stock, James H., and Francesco Trebbi. 2003. “Ret-
Bo Pang, and Walter Daelemans, 1532–43. Strouds- rospectives Who Invented Instrumental Variable
burg: Association for Computational Linguistics. Regression?” Journal of Economic Perspectives 17
Politis, Dimitris N., Joseph P. Romano, and Michael (3): 177–94.
Wolf. 1999. Subsampling. Berlin: Springer. Stock, James H., and Mark W. Watson. 2002a. “Fore-
Polson, Nicholas G., and Steven L. Scott. 2011. “Data casting Using Principal Components from a Large
Augmentation for Support Vector Machines.” Bayes- Number of Predictors.” Journal of the American Sta-
ian Analysis 6 (1): 1–24. tistical Association 97 (460): 1167–79.
Porter, M. F. 1980. “An Algorithm for Suffix Stripping.” Stock, James H., and Mark W. Watson. 2002b. “Mac-
Program 14 (3): 130–37. roeconomic Forecasting Using Diffusion Indexes.”
Preis, Tobias, and Helen Susannah Moat. 2014. Journal of Business and Economic Statistics 20 (2):
574 Journal of Economic Literature, Vol. LVII (September 2019)

147–62. Text Classification.” Journal of Machine Learning


Sutskever, Ilya, Oriol Vinyals, and Quoc V. Le. 2014. Research 2 (1): 45–66.
“Sequence to Sequence Learning with Neural Net- Tran, Dustin, Matthew D. Hoffman, Rif A. Saurous,
works.” In Advances in Neural Information Pro- Eugene Brevdo, Kevin Murphy, and David M. Blei.
cessing Systems, Vol. 27, edited by Z. Ghahramani, 2017. “Deep Probabilistic Programming.” Paper
M. Welling, C. Cortes, N. D. Lawrence, and K. Q. presented at the 5th International Conference on
Weinberger, 3104–12. La Jolla: Neural Information Learning Representations, Toulon.
Processing Systems Foundation. Tran, Dustin, Alp Kucukelbir, Adji B. Dieng, Maja
Taddy, Matt. 2012. “On Estimation and Selection for Rudolph, Dawen Liang, and David M. Blei. 2016.
Topic Models.” In Proceedings of the 15th Inter- “Edward: A Library for Probabilistic Modeling,
national Conference on Artificial Intelligence and Inference, and Criticism.” Available on arXiv at
Statistics, 1184–93. New York: Association for Com- 1610.09787.
puting Machinery. Vapnik, Vladimir N. 1995. The Nature of Statistical
Taddy, Matt. 2013a. “Measuring Political Sentiment Learning Theory. Berlin: Springer.
on Twitter: Factor Optimal Design for Multinomial Wager, Stefan, and Susan Athey. 2018. “Estimation
Inverse Regression.” Technometrics 55 (4): 415–25. and Inference of Heterogeneous Treatment Effects
Taddy, Matt. 2013b. “Multinomial Inverse Regression Using Random Forest.” Journal of the American Sta-
for Text Analysis.” Journal of the American Statistical tistical Association 113 (523): 1228–42.
Association 108 (503): 755–70. Wager, Stefan, Trevor Hastie, and Bradley Efron. 2014.
Taddy, Matt. 2013c. “Rejoinder: Efficiency and Struc- “Confidence Intervals for Random Forests: The
ture in MNIR.” Journal of the American Statistical Jackknife and the Infinitesimal Jackknife.” Journal of
Association 108 (503): 772–74. Machine Learning Research 15 (1): 1625–51.
Taddy, Matt. 2015a. “Distributed Multinomial Regres- Wainwright, Martin J. 2009. “Sharp Thresholds for
sion.” Annals of Applied Statistics 9 (3): 1394–414. High-Dimensional and Noisy Sparsity Recovery
Taddy, Matt. 2015b. “Document Classification by Using ℓ1-Constrained Quadratic Programming
Inversion of Distributed Language Representa- (Lasso).” IEEE Transactions on Information Theory
tions.” In Proceedings of the 53rd Annual Meeting 55 (5): 2183–202.
of the Association for Computational Linguistics and Wainwright, Martin J., and Michael I. Jordan. 2008.
the 7th International Joint Conference on Natural “Graphical Models, Exponential Families, and
Language Processing, Vol. 2, edited by Chengqing Variational Inference.” Foundations and Trends in
Zong and Michael Strube, 45–49. Stroudsburg: Asso- Machine Learning 1 (1–2): 1–305.
ciation for Computational Linguistics. Wisniewski, Tomasz Piotr, and Brendan Lambe. 2013.
Taddy, Matt. 2017a. “Comment: A Regularization “The Role of Media in the Credit Crunch: The Case
Scheme on Word Occurrence Rates That Improves of the Banking Sector.” Journal of Economic Behav-
Estimation and Interpretation of Topical Content.” ior and Organization 85 (C): 163–75.
https://github.com/TaddyLab/reuters/raw/master/ Wu, Yonghui, et al. 2016. “Google’s Neural Machine
comment/comment-AiroldiBischof.pdf. Translation System: Bridging the Gap between
Taddy, Matt. 2017b. “One-Step Estimator Paths for Human and Machine Translation.” Available on arXiv
Concave Regularization.” Journal of Computational at 1609.08144.
and Graphical Statistics 26 (3): 525–36. Yang, Yun, Martin J. Wainwright, and Michael I. Jor-
Taddy, Matt, Chun-Sheng Chen, Jun Yu, and Mitch dan. 2016. “On the Computational Complexity of
Wyle. 2015. “Bayesian and Empirical Bayesian For- High-Dimensional Bayesian Variable Selection.”
ests.” Proceedings of Machine Learning Research 37: Annals of Statistics 44 (6): 2497–532.
967–76. Zeng, Xiaoming, and Michael Wagner. 2002. “Model-
Taddy, Matt, Matt Gardner, Liyun Chen, and David ing the Effects of Epidemics on Routinely Collected
Draper. 2016. “A Nonparametric Bayesian Analysis Data.” Journal of the American Medical Informatics
of Heterogenous Treatment Effects in Digital Exper- Association 9 (6): S17–22.
imentation.” Journal of Business and Economic Sta- Zou, Hui. 2006. “The Adaptive Lasso and Its Oracle
tistics 34 (4): 661–72. Properties.” Journal of the American Statistical Asso-
Teh, Yee Whye, Michael I. Jordan, Matthew J. Beal, ciation 101 (476): 1418–29.
and David M. Blei. 2006. “Hierarchical Dirichlet Zou, Hui, and Trevor Hastie. 2005. “Regularization and
Processes.” Journal of the American Statistical Asso- Variable Selection via the Elastic Net.” Journal of the
ciation 101 (476): 1566–81. Royal Statistical Society, Series B (Methodological)
Tetlock, Paul C. 2007. “Giving Content to Investor 67 (2): 301–20.
Sentiment: The Role of Media in the Stock Market.” Zou, Hui, Trevor Hastie, and Robert Tibshirani. 2006.
Journal of Finance 62 (3): 1139–68. “Sparse Principal Component Analysis.” Journal
Tibshirani, Robert. 1996. “Regression Shrinkage and of Computational and Graphical Statistics 15 (2):
Selection via the Lasso.” Journal of the Royal Statisti- 265–86.
cal Society, Series B (Methodological) 58 (1): 267–88. Zou, Hui, Trevor Hastie, and Robert Tibshirani. 2007.
Tong, Simon, and Daphne Koller. 2001. “Support Vec- “On the ‘Degrees of Freedom’ of the Lasso.” Annals
tor Machine Active Learning with Applications to of Statistics 35 (5): 2173–92.

You might also like