Download as pdf or txt
Download as pdf or txt
You are on page 1of 25

Article

Journal of Marketing Research


2023, Vol. 60(1) 130-154
Vertical Versus Horizontal Variance in Online © American Marketing Association 2022
Article reuse guidelines:

Reviews and Their Impact on Demand sagepub.com/journals-permissions


DOI: 10.1177/00222437221107549
journals.sagepub.com/home/mrj

Nah Lee , Bryan Bollinger , and Richard Staelin

Abstract
This article examines the differential impact of variances in the quality and taste comments found in online customer reviews on
firm sales. Using an analytic model, the authors show that although increased variance in consumer reviews about taste mismatch
normally decreases subsequent demand, it can increase demand when mean ratings are low and/or quality variance is high. In
contrast, increased variance in quality always decreases subsequent demand, although this effect is moderated by the amount
of variance in tastes. Since these theoretical demand effects are predicated on the assumption that consumers can differentiate
between the two sources of variation in ratings, the authors conduct a survey to test this assumption, demonstrating that par-
ticipants are indeed able to reliably distinguish quality from taste evaluations in two subsets of 5,000 reviews taken from larger
data sets of reviews for 4,305 restaurants and 3,460 hotels. The authors use these responses to construct sets of reviews that
they use in a controlled laboratory experiment on restaurant choice, finding strong support for the theoretical predictions. These
responses are also used to train classifiers using a bag-of-words model to predict the degree to which each review in the larger
data sets relates to quality and/or taste, allowing the authors to estimate the two types of review variances. Finally, the authors
estimate the effects of these variances in overall ratings on establishment sales, again finding support for the theoretical results.

Keywords
review variance, vertical and horizontal content, text analysis, machine learning, quality and taste variance, crowdsourced data
Online supplement: https://doi.org/10.1177/00222437221107549

Consumers have long sought the opinions of others before individual’s ideal preferences, and the importance of the service
deciding to purchase a product offering. Often these opinions not providing these ideal features. We then show how this
focus on the latent quality level and the positioning of the offer- acquired information affects future firm demand.
ing. Before the availability of online reviews, most of this infor- Our multimethod article extends the analytic work of Sun
mation was obtained in the form of word of mouth in personal (2012) and Zimmermann et al. (2018) by allowing the rating
conversations. However, now consumers frequently supple- variance to be composed of two different, but continuous
ment this interpersonal information with online reviews that random variables, one associated with variation in the observed
contain an overall evaluation of the reviewer’s experience, quality of the delivered service and the other associated with
often in terms of a star rating and an accompanying discussion consumers’ preferences with respect to the horizontal attributes
of this experience. This article focuses on how future potential of the service. In addition, we extend the empirical work of Liu,
consumers use not only the star ratings but also the quality and Lee, and Srinivasan (2019), who look at the effects of both the
taste variance inferred from the text content of the reviews to ratings and the content of reviews on future demand (but do not
determine if they want to purchase the focal product offering. explicitly consider the two types of heterogeneity, i.e., quality
We center our attention on the service industry, where customer of service and preference in taste). Our focus is on industries
experiences vary not only in the degree to which the service
encounter meets the person’s taste but also in the quality of
the delivered service. We show that potential customers can Nah Lee is Assistant Professor of Marketing, Sungkyunkwan Graduate School of
use the text discussions in the reviews to determine the Business, Sungkyunkwan University, Korea (email: nah.lee@skku.edu). Bryan
source of the variance in the ratings. Specifically, they can Bollinger is Associate Professor of Marketing, Stern School of Business,
New York University, USA (email: bryan.bollinger@stern.nyu.edu). Richard
assess (1) the expected level of the quality of the service and Staelin is Gregory Mario and Jeremy Mario Professor of Business
the possible range of service encounters and (2) the positioning Administration, Fuqua School of Business, Duke University, USA (email:
of the focal firm in terms of the specific features relative to the rstaelin@duke.edu).
Lee et al. 131

where the rating variance comes from two very different the review content discussing taste issues compared with an
sources that have very different implications. We make explicit establishment that has a lower variation in these reviews, all
why, and under what conditions, variances in prior experiences else equal, but only when the mean rating is low and/or the
in quality may have a different impact on future sales than var- quality variance is high. Using the reviews and sales figures
iances in expressed preferences for the taste aspects of the for restaurants in the San Francisco area and hotels located in
service encounter. After deriving two implications flowing Texas, we replicate the results using instrumental variable
from our analytic model, we test specific aspects of the model (IV) regressions, using the within-establishment variation in
using two surveys, one lab experiment, and two field studies. demand and ratings.
In each case we find strong support for the tested components Our contributions are threefold. First, by building on the
and implications of the analytic model. model of Sun (2012), we not only allow for the star ratings to
A key insight flowing from the analytic model is that the vary with the two different product attributes (i.e., stochastic
effects of changes in both types of variances depend on the service quality and a mismatch of the individuals’ preferences
level of the other variance. Thus, although increases in with the establishment’s features) but also explicate the
quality variance always decrease future sales, this effect is process by which consumers of a product offering write specific
moderated by the amount of taste variance. Likewise, types of content in their reviews and how readers of these
quality variance positively moderates the effect of taste var- reviews use this information to update their prior beliefs
iance on demand (while the mean rating negatively moder- about the mean quality level and its reliability, the positioning
ates it). Here, increases in taste variance increase future of the focal firm, and the mismatch costs, all from the mean
sales if and only if mean ratings are low and/or quality vari- rating and the two variance components. This theoretical frame-
ance is high. Otherwise, the effect is negative. The possible work enables us to derive results regarding how vertical quality
increase in sales is due to the fact that increases in taste var- and horizontal taste attributes differentially affect sales out-
iance imply higher average quality of the service as well as comes. Importantly, these results have managerial implications
higher taste mismatch cost, when the mean rating is held for the effectiveness of service quality initiatives and reposition-
fixed. In other words, taste variance also signals the latent ing strategies. Second, since these theoretical results assume
average quality of the establishment. One implication of that consumers can reliably distinguish between vertically and
this interrelationship of the two variances is that when a horizontally oriented comments in reviews, we empirically
firm invests in increasing its service reliability, it may find show this ability in two different industry settings. Third, we
that this strategy has only marginal value because this show that the demand predictions that result from our analytic
lower quality variance can change the effect of taste variance model hold both in a controlled laboratory setting and in
on future demand from positive to negative, a finding we more generalizable field settings, and that these predictions
demonstrate empirically. differ from those in the literature. Finally, in the process, we
The analytic model explicitly outlines a process by which illustrate that a simple and efficient machine learning methodol-
readers determine the variances of the two types of review ogy can be used to disentangle different sources of variance in
content, since this information allows second-period consumers reviews.
to learn about the firm’s latent mean quality, the service reliabil-
ity (variance in quality), the firm’s positioning, and the impor-
tance of not having the firm provide the individual’s ideal Related Literature
features (taste mismatch cost). Underlying this process is the Our investigation builds on a diverse set of literatures and topics
assumption that readers of the reviews can partition the including service quality and its effects on firm and consumer
content of the reviews into quality comments and taste com- behavior, the effects of variable messages on consumer
ments. Using survey data of 5,000 restaurant and 5,000 hotel choice, and the effect of review variability on subsequent pur-
reviews, we show that consumers can reliably partition the chase behavior. One way of partitioning this literature is by
content of these reviews into the two components. We then whether the consumer has multiple experiences, and thus is
use this classification information to further predict the propor- trying to learn about the latent value underlying the experience,
tion of quality and taste content in each of over 900,000 text or is looking for a priori information to assess the expected
reviews for restaurants and hotels. Then, using the same latent value for the initial product experience. The latter situa-
assumed process that consumers use, we calculate the quality tion is the most relevant for our study, since our second-period
and taste variances for each restaurant and hotel in our consumers use the reviews to assess the latent quality value and
sample. Next, using a controlled laboratory setting, we use a the possible range of outcomes of their first encounter.
2 × 2 × 2 within-person design to show that consumers behave However, findings pertaining to the former situation are rele-
consistently with the predictions of our model, that is, (1) par- vant to the firm, since any firm behavior could affect returning
ticipants have a lower purchase intent for an establishment that customers (retention), possible future reviews from new cus-
has a higher variance in the content discussing quality than for tomers, and thus the dynamic nature of reviews. With this
an establishment with a lower variance in quality-related noted, the dominant view is that consumers, in general, down-
content, all else equal, and (2) participants have a higher pur- grade services that exhibit service quality variation in terms of
chase intent for an establishment that has a higher variation in both adoption (Meyer 1981) and retention (Boulding et al.
132 Journal of Marketing Research 60(1)

1993; Rust et al. 1999; Sriram, Chintagunta, and Manchanda higher lower bound than in Sun, whose results depend on the
2015). mean rating being sufficiently low. Importantly, their model
Although theoretically consumers do not like variation in assumes that quality risk only manifests in terms of a possible
service quality, they may respond positively to uncertainty in quality failure, resulting in a rating of zero. This simplification
terms of ratings (but not quality). West and Broniarczyk makes their model parsimonious, while still incorporating var-
(1998) propose that consumers form an aspiration rating level iance coming from the vertical quality dimension. However, it
as their reference point, and their reaction to review dispersion also means that consumers can easily differentiate between
depends on whether the average rating is above or below the quality variance and taste variance, since quality failures are
reference point: consumers dislike dispersion in ratings for visible as zero ratings in the entire distribution of ratings.
products with the average rating above their reference point This sidesteps the question of whether consumers can disentan-
but prefer review disagreement when it is below the reference gle the variance from the two sources in a more general setting.
point. Sun (2012) finds similar results, in which the effect of Quality issues in many industries with stochastic quality reali-
review variance associated with horizontal attributes depends zations normally do not arise in a binary fashion, but instead
on the average rating, although these results are derived, not occur to a varying degree for each consumption activity.
from prospect theory, but in a game theoretic setting with Under this alternative data-generating process, the quality var-
risk-neutral consumers. In a similar spirit, Clemons, Gao, and iance would be blended with the taste variance in a way that
Hitt (2006) provide empirical evidence that high-variance the two cannot be distinguished just by looking at the distribu-
items can serve as a hyperdifferentiation strategy, which is tion of ratings. Whether consumers can separate the two using
helpful for brand growth and new product introduction. More review content is an empirical question, and one addressed in
recently, Rozenkrants, Wheeler, and Shiv (2017) find that this article.
people view polarizing products as a vehicle for self-expression Although they do not study review variance from
and prefer them when they experience low self-concept clarity taste-related reviews, Tucker and Zhang (2011) examine the
and the product attributes are related to self-expression (i.e., interaction between “ratings” (actually popularity) and the
style but not quality). breath of the market. They propose that low popularity may
The conflicting results in the literature about the effect of be due to either lower quality or narrower appeal. They note
review variability on demand may stem from not specifying that “the same level of popularity implies higher quality for
the source of variability in reviews. A large set of work narrow-appeal products than for broad-appeal products”
focuses on products with deterministic quality, such as books (p. 828). Similarly, we show that higher variance in reviews
(Chen, Wu, and Yoon 2004; Chevalier and Mayzlin 2006; due to taste heterogeneity implies a higher mean quality level.
Sun 2012), movies (Duan, Gu, and Whinston 2008a, b; Liu In a different setting (i.e., earnings report), Harbaugh,
2006; Zhang and Dellarocas 2006), and video games (Zhu Maxwell, and Shue (2016) use a Bayesian updating formulation
and Zhang 2010). For these products, most of the review vari- to study the effects of ratings from multiple reports on the
ance likely stems from individuals’ different preferences reader’s posterior beliefs and find empirical support that good
(although there certainly could be different opinions about news (i.e., high ratings) is more persuasive when the ratings
the objective quality of the product). In contrast, service are more consistent, and bad news (i.e., low ratings) is less dam-
product offerings in industries such as restaurants, cruise ship aging when the ratings are less consistent.
vacations, airlines, hairdressers, spas, and hotels naturally Finally, several recent papers (e.g., Bondi 2019; Chen et al.
have variation in delivered quality, due to differing human 2021; Hu, Pavlou, and Zhang 2017) look at the dynamic aspect
and/or product interactions with each transaction. In these of reviews on the establishment’s future customer base. In this
latter service situations, it is important to consider not only newer stream of research, the current reviews affect which con-
review variance but also the source of the review variance sumers choose to purchase the product today, possibly chang-
when its impact on future sales is examined. In prior work on ing the composition of subsequent reviewers. Such dynamics
the impact of reviews on restaurants (Anderson and Magruder may lead to cyclical patterns in ratings and demand, in which
2012; Luca 2016) and hotels (Vermeulen and Seegers 2009), a lower rating leads consumers to buy the product only if
the review variance is ignored altogether. Only Sun (2012) con- there is a very good taste match between them and the
siders review variance coming from heterogeneous tastes. No product, which leads to higher ratings, which then leads to
research of which we are aware empirically addresses the differ- higher demand from consumers with lower taste match, who
ential effect of vertical and horizontal information variation on then leave lower reviews, and so on. Although we do not
subsequent purchases. directly model such dynamics, in the conclusion section we
With this noted, the work of Zimmermann et al. (2018) most briefly discuss the implication of our empirical results for
closely relates to our conceptualization, since the authors incor- review dynamics.
porate vertical quality risk into Sun’s (2012) theoretical frame-
work. In contrast to our findings, they find that a higher variance
caused by taste differences always results in a higher price and Model
lower demand; but as they point out, their model varies from We develop a model that explicates the process by which first-
that of Sun in that the support for their mean rating has a period customers write reviews of their service experience and
Lee et al. 133

second-period potential customers use this review information focal firm’s reviews into the quality deviation and the taste devi-
to determine if they want to choose this focal establishment. ation. Once these two deviations are known for all of the focal
In this way, we derive the firm’s second-period demand func- establishment’s reviews, the second-period consumer can cal-
tion, which is a function of our three key review variables, culate the two component variances, Vq and Vt , which allows
namely the mean review rating (M) and the two variance com- the individual to determine not only the establishment’s
ponents of this rating, one associated with the firm delivering average quality level and reliability of the service but also the
stochastic quality experiences (Vq ) and the second coming positioning of the firm as well as the individual’s mismatch
from customers having heterogeneous preferences for the costs. Consumers then use this information to decide whether
firm’s horizontal features (Vt ). Then, using this second-period to buy the product after the firm sets the price for the second
demand, we derive comparative statics to generate our two test- period. Although our primary model specification assumes
able hypotheses. that firms adjust prices (optimally) in the second period, as in
Given our interest in the service industry, we broaden Sun’s Zimmermann et al. (2018), our results are robust to an alterna-
(2012) model (which assumes that ratings vary only because of tive scenario in which prices are held constant or set at some
heterogeneous consumer preferences) by allowing ratings to other observed level in the second period.2
also vary because of variances in the firm’s delivered service. Given this overview, we next lay out the specific elements of
Thus, the product in our model (a good or service) is character- our model. We characterize a product experience in terms of
ized along two dimensions: (1) a quality distribution from quality and taste mismatch cost. Quality, which is stochastic in
which stochastic quality is drawn, as a result of factors such nature, is uniformly distributed from v − r to v, where v is the
as server variability (vertical dimension), and (2) the product’s maximum possible delivered quality and r is the range of possible
positioning relative to individuals’ preferences for its features delivered quality experiences. Consumer tastes are uniformly dis-
(horizontal dimension). Following the convention of the tributed in the horizontal dimension. Consumers will consider pur-
service quality literature (e.g., Boulding et al. 1993; Rust chasing a product if they lie within a distance of 1 from a particular
et al. 1999), we treat quality as a random variable. This has product’s taste location. Consumers are fully aware of their ideal
two implications. First, the uncertain service quality implies taste preference, that is, their exact location in the taste dimension.
that consumers face an a priori risk. Second, since most con- When a consumer located at x ∈ [0, 1] distance away from the
sumers do not like uncertainty (i.e., they are risk averse), focal product purchases it at a price p and consumes it, this con-
quality variance in the service industry is bad. Moreover, sumer’s utility is v̇ − tx − p, where v̇ is the realization of stochas-
since quality lies in the vertical dimension, higher quality tic quality and t > 0 is the taste mismatch parameter.3
results in higher utility for all consumers. With quality held Before the anticipated product launch, the product’s quality
fixed, different consumers can also experience different utility and taste mismatch cost are unknown to both the seller and the
for an offering due to the firm’s positioning not meeting the spe- consumers. However, everyone has unbiased prior beliefs for v
cific individual’s ideal preference. We model heterogeneity and r, and the joint probability density function is denoted
only in consumer tastes, assuming identical consumer valua- f (v, r). Likewise, it is common knowledge that each product
tions for quality. We model taste preference as a fixed, time- has a fixed attribute t, and the unbiased prior belief on t is g(t).
unchanging, idiosyncratic attribute for each consumer for Given this uncertainty, the indifferent first-period consumer
each product. In our model, the uncertainty in the taste mis- is located at D1 distance away from the product, such that
match gets resolved through reviews, while the uncertainty
about quality is quantified in terms of the average quality Ev,r,t (C.E.[v − tD1 |f (v, r), g(t)]) − p1 = 0, (1)
level and its variance.
A key feature of our model is the explication of a process in where C.E.[v − tx] stands for the certainty equivalent of the
which first-period customers decide what information to convey utility derived from uncertain product attributes, for a consumer
in their reviews and second-period readers interpret and use this at x distance from the product. Assuming identical prior beliefs
information. Specifically, we assume that (1) these first-period and degree of risk aversion, any consumers located at x ∈
consumers write about experiences that deviate from their [0, D1 ] from the product would purchase it, making the first-
expectations and (2) the percentage of the review devoted to period demand D1 , with a unit mass assumption.4
quality (qi ) versus taste (1 − qi ) reflects the relative magnitude Once each consumer visits the service, v and t are realized
of the deviations between (1) the realized and expected quality and (by assumption) each consumer leaves a product rating
and (2) the individual’s taste mismatch versus the average con-
sumer’s mismatch.1 The second-period consumers know this is 2
In fact, we control the second-period price in our laboratory experiment, in
the process that generates the reviews, and thus they can parti- contrast to our field studies, where we assume that sellers provide appropriate
tion the reviewer’s deviation from the mean rating for all of the price response.
3
By assuming that consumers within a distance of 1 from a product are the only
ones to consider it, we are normalizing on x across the products and have t
capture the variance in the taste dimension for each product.
1
This assumption is similar to the Bayesian concept that data only has value if it 4
A necessary condition for sales to occur in the first period is
alters a person’s prior belief. It also has behavioral support in that consumers v, r)]) − p1 > 0. A consumer whose taste is exactly matched
Ev,r (C.E.[v|f (
like to talk about experiences that excite or frustrate them. with the product (x = 0) would purchase it.
134 Journal of Marketing Research 60(1)

reflecting their individual realized utility: s(x) = v̇ − tx. The


distribution of these ratings reflects the distribution of the con-
sumer utilities, shifted by the price, and, as shown in Figure 1, is
the sum of two uniform distributions.5,6
It is easy to show that the mean (M) and the variance (V) of
these ratings left by the first-period consumers are
r tD1 1 2
M=
v− − and V= (r + t2 D21 ), (2)
2 2 12
Figure 1. Early-Stage Distribution of Ratings.
in which D1 is the first-period demand.7
Note that the variance of the ratings across these first-period
individuals can be decomposed into two parts: the part associ-
ated with the product’s delivered stochastic quality, as captured parameters, v and r, and the taste mismatch parameter, t) and
by the range of possible quality outcomes, r, and the part asso- thus the disutility of not getting the most preferred option,
ciated with the taste mismatch parameter, t, that is, that is,

1 2 1 2 2 √  √


Vq = r and Vt = t D1 . (3) v = M + 3Vq + 3Vt , r = 2 3Vq ,
12 12 √
2 3Vt (4)
These variances reflect the variations in the individuals’ quality and t= .
and taste deviations from their expectations, which, according D1
to our assumed review writing process, determines the propor-
tion of their reviews devoted to quality and taste. For the From this knowledge, the second-period consumers fully
remainder of the article, we let Vq denote the partial variance resolve any uncertainty in t and can make their purchase deci-
sions, conditional on second-period prices. This allows us to
arising from quality variation and let Vt denote the counterpart
determine second-period demand; that is, D2 is derived by
variance arising from taste mismatch, where the sum of the two
noting that the indifferent consumer in the second period satis-
equals the total variance, V, that is, V = Vq + Vt .
fies
We next assume that the second-period consumers, after
reading the first-period reviews, are capable of decomposing
C.E.[v|v, r] − tD2 − p2 = 0, (5)
each rating into these two components, in expectation.8 Once
they have determined the quality and taste deviations across where C.E.[v|v, r] depends on the now-known v and r.10,11,12
many reviews, the second-period consumers can infer Vq and Expecting this demand function, the seller finds the optimal
Vt . Using this information, along with the observable mean second-period price that maximizes the profit, by solving
rating M (and the first-period demand D1 , which can be max p2 (C.E.[v|v, r] − p2 )/t. The equilibrium demand for the
inferred9), the second-period consumers can determine the p2
true underlying product attributes (i.e., the quality distribution second period is found as

C.E.[v|v, r]
5
We do not consider binning or censoring for the ratings, so theoretically the
D∗2 = . (6)
2t
domain for ratings is [−∞, ∞].
6
We admit that our two-stage model is a discrete simplification of what, in To further solve Equation 6, we assume constant absolute risk
reality, would result in a continuous evolvement of review distributions.
However, even if we abstract away from the uniform distributional assumptions
aversion; that is, the degree of risk aversion is constant. This
for both quality and taste space, and thus the mean and the variance no longer specific utility is represented as U(v) = 1 − e−αv , with the abso-
′′
contain full information about both product attributes, we find that once con- lute risk aversion coefficient A(v) = − UU′ (v)(v)
= α > 0. The cer-
sumers can correctly distinguish the two underlying distributions of quality
and taste, the qualitative insights driven from our simpler model continue to tainty equivalent of v is then derived from the definition of
hold as long as consumers are risk averse toward the stochasticity in quality, certainty equivalent leading to the following:
and truth telling is satisfied so that consumers can correctly identify the taste
 
parameter of each product. This is why it is important for consumers to be 1 1 αr
able to distinguish the two dimensions of product attributes in others’ feedback. C.E.[v|v, r] = v − ln + ln(e − 1) . (7)
7
Our model accounts for risk aversion due to uncertainty in  v, r, and t as well,
α αr
which will impact the first-period demand and, through this demand, the distri-
bution of ratings. With more uncertainty in the prior beliefs of  v, r, and t, only
consumers with less of a taste mismatch will purchase in the first period. 10
A necessary condition for sales to occur in the second period is
8
The detailed discussion of how consumers infer the expected Vq and Vt is C.E.[v|v, r] − p2 > 0.
11
found in the Appendix. We assume that market is never fully covered to avoid discussion of corner
9
The first-period demand D1 can be inferred by the second-period consumers if solutions. A sufficient condition for incomplete market coverage is: for products
they share the same prior beliefs and risk aversion with first-period consumers with stochastic quality v ∈ [
v − r,  v) < min(t).
v], max(
and know the first-period price; that is, the second-period consumers derive D1 12
See Web Appendix A for a summary of information sets in the consumer
from Equation 1: ∫∫C.E.[v − tD1 |f (v, r), g(t)]dF(
v, r)dG(t) = p1 . decision process.
Lee et al. 135

Figure 2. Second-Period Demand. √q 1   √q 


Notes: Second-period demand is plotted against Vt and w = M + 3V − α ln 2α√1
3Vq
+ ln e2α 3V − 1 (a constant D1 = 4 is assumed).

Substituting 
v and r from Equation 4, we get These comparative static results lead to the following predic-
tions on how the second-period demand is altered by each type
√ 
C.E.[v|M, Vq , Vt ] = M + 3Vq + 3Vt of variance:
 √q

1 1
− ln √q + ln(e2α 3V − 1) . (8) H1: An increase in review variance due to quality incon-
α 2α 3V sistency across reviewers’ experiences always leads to
lower sales.
Substituting Equations 4 and 8 into Equation 6, we find the
second-period equilibrium solution for demand in terms of M, H2: An increase in review variance due to taste mismatch
Vq , and Vt : costs leads to higher sales if and only if the mean rating is
 √
 sufficiently low and/or the quality variance is sufficiently
√q √t 1 1 2α 3Vq
M+ 3V + 3V − ln √ 
 +ln(e −1) high; otherwise, taste variance leads to lower sales.
D1 α 2α 3Vq
D∗2 = √ .
4 3Vt H1 is intuitive and is addressed in Zimmermann et al. (2018)
(9) and Boulding et al. (1993), albeit with different model assump-
tions. In addition, since demand is a multiplicative function of
Using this demand function, we can determine the marginal Vq and Vt , we expect the marginal effect of Vq to depend on
effect of the three observable review statistics, M, Vq , and Vt , the level of Vt . We discuss this moderating effect in more
on the second-period demand: detail subsequently. H2 reconciles the contradictory results
found in Zimmermann et al. and Sun (2012) by noting how
∂D∗2 ∂D∗2 ∂D∗2 the actual sign reversal of Vt depends on not only M, as in
>0, q <0, and >0 iff
∂M ∂V ∂Vt Sun, but also Vq for products with stochastic quality. This
 √q
 (10)
√ 1 1 sign reversal is made clear in Figure 2, where we plot the
M<− 3Vq + ln √q +ln(e2α 3V −1) . second-period demand against the taste variance, Vt , and a
α 2α 3V
composite index, w (which increases in the mean, M, and
∂2 D∗2 decreases in the quality variance, Vq ). The figure shows that
Moreover, the conditional effect of Vt implies <0 and
∂Vt ∂M Vt positively affects second-period demand if and only if w is
∂2 D∗2
∂Vt ∂Vq
>0. less than zero.
136 Journal of Marketing Research 60(1)

Before discussing the intuition behind H2, we note that the two sampled and 70% of the hotel reviews sampled also include
hypotheses are “all else” statements, since they come from com- text detailing the customer’s experience. Once these establish-
parative statics analyses. Consequently, they can be viewed (and ments were matched with revenue data, the vast majority of
tested) by comparing two firms that differ only on one particular matched businesses (over 95%) were found to have text
review variable at one point in time (as in our laboratory experi- reviews. We define our estimation samples as the two sets of
ment) or by observing how within-firm changes over time in a matched data.
given review variable affect sales (as in our field studies). The first sample of businesses includes 4,305 restaurants,
With this preamble stated, we now discuss the intuition bars, cafes, and bakeries, the overwhelming majority being res-
behind the effect of Vt in H2. Holding fixed M and Vq , consum- taurants, located in San Francisco, California, and in neighbor-
ers can determine that a firm with a higher Vt in ratings, com- ing cities; the second sample includes 3,460 franchised lodging
pared with an otherwise identical firm, indicates that the taste establishments located in Texas.13 We collected the sample of
parameter t is large (i.e., they should seriously consider the restaurant reviews in May 2018. The first reviews were
cost of taste mismatch). However, as seen from Equation 4, it written in the early 1990s, although almost all of the reviews
also means that the underlying average quality level of the are post-2010, which is when the service was integrated into
focal firm, v − r/2, is higher than that of the otherwise identical its current format.14 The hotel reviews were collected in July
firm. This results in two counterbalancing effects: second- 2020 and span the years 2006 to 2019, although most are
period demand shifts outward because of the high average post-2010.15 When calculating review statistics for a given
quality, increasing the number of potential customers, but the establishment at a given time point, we assumed that the
high cost of mismatch implies that fewer second-period custom- reviews posted up to the date under investigation were available
ers would choose the focal product offering. Which effect is to consumers.
greater depends on the quality variance, Vq , as well as the Given our setting, each rating incorporates both the quality
mean of the first-period rating, M. If M is low (and/or Vq is level experienced by the reviewer and that person’s taste mis-
high), then a larger Vt results in larger second-period sales match cost. However, only if the rating is accompanied with
(see Figure 2). In this situation, the effect of higher average review text can consumers disentangle the contributions of
quality level is greater than the effect of the taste mismatch these two sources on the rating variation. Consequently, in
cost. However, if M is high (and/or Vq is low), then the taste the main analyses we limit our analysis to reviews with text.16
mismatch effect is greater, and a larger Vt results in smaller
second-period sales for the focal firm.
Although these implications provide very specific predic- Content Coding
tions on how demand responds to changes in the statistics of Overview. The plan for analysis of our samples of reviews
the review ratings, H1 and H2 only follow if consumers can dis- centers on three different (yet related) objectives. First, we
tinguish between the two different sources of variance and use wanted to determine if consumers can reliably separate the
this information in their purchase decision process. To date, the review comments into statements on vertical quality and hori-
differential effects of the two different sources of variance have zontal taste matches. Second, assuming the respondents were
never been empirically tested. Therefore, we next demonstrate successful in identifying these two aspects of a given review,
that consumers are able to disentangle the two sources of we wanted to use the information from this survey of respon-
rating variance. Then, in subsequent sections, we test our two dents to better understand which words are associated with
demand hypotheses in a controlled laboratory setting and two statements about quality versus taste for the focal industry.
field studies, one for restaurants and the other for hotels.

13
The exact target area for the restaurant sample includes the latitude range of
Product Experience Data [37.690, 37.906] and the longitude range of [−122.518, −122.200] in degrees.
For our hotel sample, we limited our scope to franchised lodging establishments
Consumer Reviews that file their monthly revenues to the State of Texas, in order to exclude
Airbnb’s and other individual vacation rentals.
We start our discussion by describing the two large data sets 14
A large number of reviews were posted in 2017 (because of exponential
(restaurant reviews and hotel reviews) that we use in our subse- growth that Google Reviews experienced in 2016–2017), but the remaining
quent analyses. We focus our attention on industries in which, a reviews were approximately evenly distributed over the remaining six years.
15
We dated reviews using Google’s posted date for the review (e.g., 3 weeks
priori, we believed that product experiences reflect both vertical
ago, 1 year ago). Thus, the posted times are approximations of the true time
and horizontal attributes. Restaurants and hotels fit this descrip- when the given review was available for a consumer to view it.
tion well. This approach led us to collect reviews for businesses 16
We compared the mean of the ratings given without any text with those with
in these two industries that are found on Google Places. Google text. We find that these means are very similar; the average across all restaurants
allows business owners to register their firms and post informa- (hotels) is 4.245 (4.079) for rating-only reviews and 4.184 (3.873) for text
reviews, although the average variances are slightly lower for no-text
tion about the business. When consumers search on Google
reviews. Thus, there appears to be little concern for selection bias. In addition,
Maps, this information is displayed, along with the reviews when we include the statistics of the ratings with no text as control variables in a
posted by customers (see Figure 3). These reviews always robustness check, their effect is insignificant and the effect of the text reviews
include a star rating, and about 60% of the restaurant reviews remains the same.
Lee et al. 137

Figure 3. Business Information Landing Page (Left) and Business Reviews (Right) on Google.

This understanding of word use allowed us to create reviews in the 5,000 restaurant reviews and 5,000 hotel reviews that
which we manipulated M, Vq , and Vt within a laboratory exper- offered as large a coverage of the words as possible, favoring
iment. Third, we used these responses to train two classifier “important” words, that is, ones that affect the ratings.
algorithms for each sample that calculate the degrees to Specifically, we ran elastic net regression (which combines L1
which each review discusses quality-related topics (i.e., our qi and L2 penalties) of ratings on the set of entire terms to identify
measure) and taste-related topics (1 − qi ), based on the text of words that were highly associated with the ratings. Iteratively
the review. We used this classification for each review, along using sets of regulation thresholds, we then approximately
with the deviation of the reviewer’s rating from the mean matched the number of surviving words to what 5,000 reviews
rating, to calculate an expected Vq and Vt for each establish- could possibly contain. The net result was the identification of
ment in our total sample in our field studies. We describe our a set of 5,000 restaurant reviews that contained 12,798 words,
text analysis methodology and the responses of the classifica- or about 71% of those 17,902 words, and a set of 5,000 hotel
tion surveys next. reviews that contained 10,211 words, or about 55% of those
18,577 words. We then assigned these sets of 5,000 reviews to
Methodology. We started with the entire set of 283,069 restau- two different groups of workers (one for restaurants, one for
rant text reviews, which contained 28,572 unique words in hotels) to classify their assigned set of reviews along the two
total, and the entire set of 634,121 hotel text reviews, which dimensions of interest. We note in passing that the 5,000 restau-
contained 31,080 unique words in total. We reduced the dimen- rant reviews also contained 4,983 words that appeared only once,
sionality by eliminating words that occurred only once, because so these 5,000 reviews contained 17,781 words in total, and the
their effect on classification would be minimal. This left us with 5,000 hotel reviews contained 6,114 words that appeared only
17,902 words for restaurants and 18,577 words for hotels. We once, so these 5,000 reviews contained 16,325 words in total.
selected a sample of 5,000 text reviews for both industries to After obtaining this subset of 5,000 reviews for each indus-
be classified by survey respondents. Our goal was to choose try, we divided them into bundles of ten reviews, resulting in
138 Journal of Marketing Research 60(1)

500 surveys. The surveys were distributed to experienced alpha for each set of reviews, based on the answers to the multi-
Amazon Mechanical Turk (MTurk) workers who were asked ple-choice questions. The average across 15 statistics for restau-
to classify the parts of each review in their bundle associated rants (hotels) is .824 (.900), and the median is .814 (.909). All
with quality and with taste. The length of each review ranged sets had a Cronbach’s alpha greater than the commonly accept-
from about ten to 250 words. To keep the task similar for able threshold of .7. Although the rest of the 485 surveys in
each survey, we grouped the reviews, keeping the total length each survey were rated only once, we note that many of the
of each set approximately equal. Thus, each survey of ten words (and especially the frequently occurring ones) appeared
reviews consisted of three long, four medium-length, and multiple times throughout the samples of 5,000 reviews. Thus,
three short reviews; that is, length (and not content) was the cri- the words related to the common topics were actually rated
terion used to place the reviews in batches of ten. Workers were many times. We provide the exact survey instrument and other
asked to read generic definitions of quality and taste statements details of the administration in Web Appendix B.
(i.e., quality concerns aspects of the experience where everyone
would agree that high quality is better than low quality, and Survey responses. We next examined which words the MTurk
taste refers to aspects where individual preferences differ workers identified as being in the quality- and taste-related
across the population). The workers were asked to read the phrases and found that some words occur frequently in discus-
first text review in full and then pick out the phrases or sen- sion of both topics. Consequently, to determine which words
tences that are related to quality issues. Then, they repeated are more strongly associated with one type of topic rather
this process for personal taste or fit issues expressed in that than with the other and thus better for classifying a review
first review. Phrases not selected were considered unrelated to with respect to its quality or taste focus, we calculated the
that topic and thus unrelated to the rating (see Figure 4 for an Z-test statistics for differences in two population propor-
example response to these two tasks for restaurants). tions.17 From the statistic values, we were able to identify
These text selection questions enabled us to let the respon- words most likely to be quality related, rather than taste
dents decide how they distinguish vertically oriented quality related, and vice versa (see Figures 5 and 6). For example,
issues from horizontally oriented personal taste issues. Some the words “service,” “order,” “wait,” “seat,” “staff,” “place,”
respondents selected an entire paragraph as a discussion about and “price” (including the “$” sign, which was coded as the
a topic, whereas others directly pointed to specific words or word “dollar”) are among those found in the quality-related
phrases that they felt were central to the discussion, thereby pro- comments for restaurants. On the other hand, specific menu
viding more focused measures of which words are driving the items as well as descriptive adjectives regarding food taste
nature of the review. Since the workers were free to use any (e.g., sweet, spicy), ambience, location, parking, and so forth
classification strategy, some sentences were selected as being are found to be more likely in taste-related comments. We
related to both quality and taste, and other answers were left used this information to help us create the 40 restaurant
blank if the worker believed that the review covered only one reviews to use in the laboratory experiment, which is discussed
or neither dimension. In addition, the workers were asked two next. In addition, the survey responses were also used to train
multiple-choice questions associated with each review, one classifier algorithms for predicting the existence of quality and
indicating the relative importance of quality- versus taste topics, and the details are discussed in the “Field Study”
taste-related issues and the other indicating the overall helpful- section.
ness of the text review. The respondent repeated this process
nine more times, once for each review.
We took two approaches to ensure that the responses were Laboratory Experiment for Restaurants
high-quality and reliable. First, we screened all answers by
Overview of the Design
comparing an individual’s multiple-choice response to the
question about the relative importance of quality- versus The goal of the laboratory experiment was to explicitly test
taste-related issues with the amount of text input provided for our two hypotheses, which involve not only main effects
quality- and taste-related content for a given review. We used but also interactions between our review variables M, Vq ,
this comparison to identify and filter out fraudulent submissions and Vt . We do this via a 2 (low and high M) × 2 (low
(those automatically filled out by bots, or someone who picked and high Vq ) × 2 (low and high Vt ) within-subject design,
random or identical answers to all questions). We also included in which each respondent looks at all eight restaurants
an attention check question near the end of the survey (see Web (one per each cell) and the associated set of reviews. We
Appendix B for the format of this checkpoint), to assess that the vary M, Vq , and Vt by manipulating the star ratings and
worker was reading each review carefully. Any submission text of the reviews associated with a given restaurant. To
without a correct answer to this attention check was dropped.
Second, we selected 150 reviews (representing 15 unique 17
The Z-statistic for testing for the difference in two population proportions is
sets of reviews) to be analyzed by 6 to 12 unique respondents 
 
for each set of classifier reviews for restaurants, and by 4 to 6 Z = (p̂1 − p̂2 )/ p̂(1 − p̂) n11 + n12 , where p̂1 and p̂2 are proportions, n1 and n2
unique respondents for each set of classifier reviews for are the total counts in two populations, and p̂ is the overall proportion in the
hotels. We used these responses to calculate a Cronbach’s entire population.
Lee et al. 139

Figure 4. Text Selection Question Measuring Which Words Are Related to Quality Issues (Top) and Taste Issues (Bottom).
Notes: In this figure, Xq indicates a vector of predictors (i.e., words) for Yq (i.e., whether a quality topic exists). Similarly, the variables for a taste topic are denoted
with t subscripts.
140 Journal of Marketing Research 60(1)

Figure 5. Words Most Likely to Be Quality Related (Left) and Taste Related (Right) in Restaurant Reviews.

keep the task manageable and to hold the number of the variances due to vertical and horizontal experiences. We
reviews fixed, each of the eight restaurants had five focused the review content for a specific review just on
unique reviews, each consisting of a star rating and a quality, just on taste, or equally on both. (For each restaurant,
short text review about the reviewer’s experience. To four of the reviews only discussed either quality or taste, i.e.,
control for prior beliefs about the restaurants, we fixed qi was 0 or 1, while the fifth described both dimensions
the general location (local), cuisine (American), menu equally, i.e., qi = .5). The text was written to be compatible
types, price, and review volume to be the same for all with the star rating given, where positive reviews were associ-
eight restaurants. Respondents were also told that these ated with four- and five-star ratings, negative reviews with one-
hypothetical restaurants were new to the respondent. In and two-star ratings, and neutral reviews with a three-star rating.
addition, we controlled for the order of information presen- For example, if the restaurant was associated with a high Vq ,
tation by randomizing the order of the restaurants across then the text across the five reviews for this restaurant would
respondents, as well as the order of the five reviews associ- report a large dispersion of opinions on the quality of the restau-
ated with each restaurant. After reading the five reviews for rant. Relying on findings reported previously on words most
a given restaurant, the respondent provided their intent to likely to be associated with quality (taste) issues, we constructed
dine at that restaurant (our measure of sales). text reviews talking about wait times and interactions with the
This 2 × 2 × 2 design allowed us to compare restaurants that server in order to manipulate the low and high quality variance
vary with only one variable of interest (M, Vq , or Vt ), holding within the set of reviews, whereas reviews discussing the horizon-
the other two variables fixed, to assess what happens if a restau- tal attribute of the spiciness of the menu items were used to manip-
rant changes only along that dimension. This approach allowed ulate Vt within a restaurant. For example, one text review focusing
us to test H1 by determining if respondents’ stated purchase on the taste dimension reads: “The spicy chicken was too spicy for
likelihoods decrease with increases in Vq and to test H2 by me. Too spicy that I couldn’t finish my food and didn’t enjoy it at
determining if purchase intent increases with increases in Vt all. Maybe others will like it. (2 stars).” Each review was written to
from the base case (high M and low Vq ) as M becomes low be within a similar length, and the total amount of review text for
and/or Vq becomes high.18 each restaurant was also similar.
We assigned Restaurants 1–8 to Experimental Cells 1–8.
Figure 7 shows the exact star ratings we used for the
Manipulated Reviews reviews to manipulate the quality variance and the taste var-
We constructed 40 different reviews (5 reviews for each of the iance for the four (2 × 2) restaurants with a low average
eight restaurants) that were used to vary M, Vq , and Vt . The text rating, which we designated as being in Experimental Cells
of each review describes experiences that reflect vertical and/or 1–4. For example, for Restaurant 1 (i.e., Cell 1), there were
horizontal features. By constructing different sets of reviews, three reviews that discussed quality issues and also indicated
we were able to manipulate the mean of the reviews, M, and variation in star ratings for these reviews. This restaurant also
had three reviews that mentioned taste issues, but there was
no variation in star ratings for these three reviews. Thus,
18
We also tested the significance of the effects at the endpoints, that is, low M this Cell 1 restaurant represents the high Vq , low Vt cell.
and high Vq versus high M and low Vq , since our theory only predicts that the The four high-mean restaurants were in Experimental Cells
change in sales switches somewhere over the range of the values in the M and 5–8 and used the same ordering as for Cells 1–4, but now
Vq pair. We specifically tested that the effect of Vt is significantly positive under
the low M and high Vq condition and significantly negative under the high M
each rating was shifted by one star for all 20 ratings, resulting
and low Vq condition. Technically, we only needed to test if the effect in the in 20 new (more positive) reviews. Thus, the four restaurants
former condition is greater than the effect in the latter condition. with high mean ratings had a mean of four stars, while having
Lee et al. 141

Figure 6. Words Most Likely to Be Quality Related (Left) and Taste Related (Right) in Hotel Reviews.

Figure 7. Manipulation of Quality Variance and Taste Variance for Low-Mean Restaurants.

the identical (total and partial) variance structure as the low participant’s responses were standardized, also shows significant
mean restaurants. The 40 different written reviews used in differences (p < .0001) and in the desired direction. The survey
the experiment are available in Web Appendix C. instructions and questions and the standardized cell means for
We pretested our manipulations of quality variance and the pretest survey are available in Web Appendix C.
taste variance using a sample of 45 MTurk respondents who
were asked to quantify the amount of quality variance and taste
variance associated with each of the eight restaurants and who Survey Questions and Administration
also passed a screening task. All eight of the appropriate contrasts The experiment was administered in a laboratory setting. The
were statistically significant and in the correct direction. The respondents (average age of 29.6 years, ranging from 18 to 75
pooled contrast of all pairs of comparisons, after each years; 63% female) were part of Duke University’s behavioral
142 Journal of Marketing Research 60(1)

Figure 8. Purchase Intent for Restaurants in a 2 × 2 × 2 Design.


Notes: Number in brackets indicates experimental cell. Error bars indicate standard errors.

research pool of people who expressed interest in participating Experiment Results


in research studies, and in our case the respondents took part
Out of the 178 respondents who finished our survey, N = 90
in multiple short studies in one sitting. After completing all
(50.56%) read each review carefully enough to pass our very
the studies, they were compensated for their participation.
subtle attention check. Given the low rate of “attention,” we
At the beginning of our experiment, respondents were
first compared the two groups and looked for differences
instructed that they were to decide how likely they were to go
between them. For the 90 respondents who passed the attention
to a number of different local restaurants that were new to
test, their average age was 29.9 years, 69% were female, and
them. Then, the respondents were shown the five reviews, in a
they took on average 286 seconds to complete the survey. In
random order, for the first randomly presented restaurant on a
contrast, those who failed the attention check had an average
computer screen. Below the reviews on the same screen page,
age of 29.4 years, 57% were female, and they took about a
the respondents indicated their purchase intent for the restaurant
minute less (i.e., 223 seconds on average) to complete the
by responding to the question “How likely are you (personally)
survey. Although the amount of time spent on taking the survey
to choose this restaurant?” using a seven-item scale anchored by
is not an absolute measure of “attention,” we note that the failed
“Not likely at all” and “Highly likely.” This process was
group still spent, on average, about 25 seconds per restaurant.
repeated seven more times on the following screen pages for
We take this observation to indicate that they still gave substantial
the remaining seven restaurants. Respondents were encouraged
attention to their task. However, given our a priori decision to use
to navigate back and forth between restaurants in order to scale
only “qualified” respondents, we present the results for those who
their answers across restaurants. The exact survey instructions
passed our very strict attention test. With this noted, our hypothesis
and instruments are available in Web Appendix C.
test results for the combined passed/failed group were qualitatively
After completing the survey questions for the eight restau-
similar to our reported results for the group that passed
rants, the respondents were presented with a ninth restaurant,
only.19 The only difference occurs when we limit our tests to
for which one of the five new, but otherwise similar, reviews
had a sentence embedded in the text instructing the respondent
to choose a specific answer for the questions that followed. This
task was given to verify that the person was paying strict atten- 19
We conjecture that the additional noise in the responses of the group that
tion to the task at the end of the study. failed negated the added power from almost doubling the sample size.
Lee et al. 143

the group that failed, in which case we do not see all of our Field Study
hypotheses supported. The additional test results from the differ-
ent combinations of groups are found in Web Appendix D. Data
Before presenting our formal tests of our two hypotheses, Classifying all reviews. Unlike the laboratory study where we
we display the mean responses for the eight cells in Figure 8 created each review, we need to determine Vq and Vt for
with standard error bars. As is evident from this figure, we see each establishment for our field studies. This involves a multistep
increases in purchase intent for high-mean cells compared process. First, we need to classify the degree to which each text
with low-mean cells, all else equal. Similarly, we see higher review discusses vertical versus horizontal attributes, that is, create
purchase intent for low-Vq cells compared to high-Vq cells, a measure of qi . To do this, we built two classifier algorithms for
all else equal. However, the effect of Vt varies depending each sample. Given the voluminous quantity of text reviews for
on the levels of M and Vq . With this noted, H1 involves the both types of establishments, we decided to use a bag-of-words
marginal effect of Vq across the total population, and H2 model that considers words as the building blocks of textual
states that sales increase with increases in Vt when M is review content23 and a support vector machine (SVM) algorithm
low and/or Vq is high. Thus, in Table 1 we present multiple to classify each review in terms of its emphasis on quality and
contrasts20 where we tested the set of inequalities simultane- taste. We first transformed all review texts into a set of words.
ously (thereby avoiding the issue of alpha inflation due to Common abbreviations and internet usage terms were converted to
multiple hypothesis tests), by using a bootstrapping method standard English, and misspelled words were corrected using a spell-
that also addresses the issue of test hypotheses being corre- check program. We removed punctuation after converting the dollar
lated as a result of shared data (Westfall and Young sign (“$”) to the word “dollars.” Any meaningless “stop words” (e.g.,
1993).21 We used 100,000 resamples, and the adjusted “a,” “is,” “the”) were removed before stemming of all words into
p-values correspond to using the bootstrapping method. In their root form (for more details, see Web Appendix E).
conducting these tests, we standardized the purchase intent Next, we used the MTurk workers’ classification provided
responses by individual across the eight restaurants to for the two samples of 5,000 reviews to train the SVM classifi-
remove all individual effects related to the tendency to rate ers, with a linear kernel, for each data set. Out of the entire set of
using high/low and wider/narrower responses. words contained in each review text, we decomposed the text
Consistent with the visual view of the data, we find strong input answer for the phrases related to quality issues for that
support for our hypotheses, that is, sales decrease with Vq review into words, and then linked those words (Xq variables)
across different levels of M and Vt , while the effect of Vt is to the binary response variable Yq = 1, indicating the existence
more positive when M is low and/or Vq is high.22 In addition, of quality-related content. We linked all remaining words in the
we note that the significance of the contrast “[8]–[7] < [6]–[5]” review text to Yq = 0, indicating no existence of quality-related
(which captures the positive interaction between Vq and Vt , content. Thus, we divided each review into subparts to make the
under high M), is equivalent to testing whether the effect of Vq training set of 10,000 observations (half of which predict Yq = 1
is less negative when Vt is high (i.e., [8]–[6] < [7]–[5]). Again, and the other half Yq = 0). We did the same for the response var-
these differences are visually noticeable in Figure 8. iable Yt for the existence of taste-related topics (see Figure 4).
Using these two trained SVM models, one for quality and the
20
other for taste, we predicted the posterior probability that a
Although it was not part of our stated hypotheses, we also tested whether the review discusses quality- and taste-related content, using a logistic
purchase intent is higher for high-mean restaurants than for low-mean restau-
rants (effect of M) as a manipulation check.
mapping function. Next, we scaled these two posterior probabili-
21
A bootstrapping method creates pseudo data sets by randomly sampling with ties to add to 1, giving us our measures of qi and (1 − qi ) for
replacement from the observation data. Each randomly created pseudo data set each review. We display the distribution of these predicted
represents the empirical distribution of the null of all treatments being equal. values in Figure 9. Because of the uncertainty in the measurement
The p-values of the hypothesis tests are calculated on the resampled data set, of these posterior probabilities and the observed empirical distribu-
and the minimum p-value is recorded for each random pseudo set. A large
number of resampling is performed, and the adjusted p-value is calculated as
tion of the predicted values, we binned these continuous measures,
the proportion of the resampled pseudo p-values that are less than or equal to letting them take five discrete values: 0, .25, .5, .75, or 1, using the
the raw p-value. Thus, the resampling methods implicitly account for all following bins: [0, .1), [.1, .4], (.4, .6), [.6, .9], and (.9, 1].
forms of correlations (intertest and intervariable). The resampling-class Our next step was to assume that the proportion of each review i
methods have been shown to have consistent Type I errors under various corre- devoted to quality (qi ) and taste (1 − qi ) is indicative of how much
lation levels and structures (Blakesley et al. 2009), and the step-down version,
which uses a subset of resamples to increase power (Holm 1979; Shaffer 1986;
the quality realization and taste mismatch cost deviated from the
Westfall and Young 1993), is helpful for detecting true differences while con-
trolling the family-wise error rates across all multiple hypothesis tests under
23
study (Romano, Shaikh, and Wolf 2010). We refer the readers to the cited liter- Accommodating more complex sentence structures and using deep learning
ature on multiple hypothesis testing for more details. algorithms might have further improved this classification. However, the texts
22
We find that these results are robust to other testing specifications, such as the in our study are all reviews about restaurants and hotels, and consequently,
step-down version of the bootstrap resampling method, with the heterogeneous there is little reason to expect the need to take into consideration the specific
variance instead of the homogeneity assumption (Satterthwaite approximation), context of discussion when determining the classification ability of any given
and even the Bonferroni method (a more conservative test under the indepen- word. In addition, this simplifies our methodology and also makes the algorithm
dence assumption). for training and prediction computationally efficient.
144 Journal of Marketing Research 60(1)

Table 1. Adjusted p-Values Under Multiple Hypothesis Testing.

Alternative Hypothesis on Purchase Intent Raw p-Value Adjusted p-Value

Effect of M is positive <.0001 <.0001


Restaurants [1],[2],[3],[4] < [5],[6],[7],[8] (pooled)
Effect of Vq is negative <.0001 <.0001
Restaurants [1],[2],[5],[6] < [3],[4],[7],[8] (pooled)
Effect of Vt is more positive compared with high M and low Vq When M is low: <.0001 .0001
Restaurants [8]–[7] < [4]–[3]
When Vq is high: <.0001 <.0001
Restaurants [8]–[7] < [6]–[5]
When M is low and Vq is high: <.0001 <.0001
Restaurants [8]–[7] < [2]–[1]
Notes: Number in brackets indicates experimental cell.

first-period consumer’s prior expectations. The valence of the review of past reviews of other establishments across reviewers, and the
content, that is, the direction of the shock, can be determined for the variances across those past means).
most discussed content (either quality or taste) by the sign of the dif-
ference between the review rating and the average rating of the firm.
Business demand. We attempted to collect sales data for each restau-
For example, if a review gives a restaurant that averages four stars a
rant and hotel in our sample. Monthly sales data for each hotel are
three-star rating, and the review focuses more on quality, that is, qi >
available from the Texas comptroller’s office on request. We aggre-
.5, then the second-period consumer can easily infer that there was a
gated these data to be at the quarterly level. However, revenue figures
negative deviation in quality from the consumer’s prior expectation.
for restaurants are not normally publicly available. Estimates of
What is less clear is whether the taste deviation (if any) was also neg-
annual sales are available in a database maintained by Data Axle
ative, or if it was positive. The sum of the deviations must equal the
(called Infogroup at the time of the study), which is considered to
total deviation in rating, but it could be the case that there was a neg-
be one of the most comprehensive business listing databases avail-
ative shock to both quality and taste, or a very negative shock to
able. This database provides detailed information about each business
quality that was not fully compensated by a smaller, positive
location including industry codes, number of employees working at
shock to taste. (We note, however, that when the review is entirely
the specific location, estimated annual sales, and whether it is part of a
about quality or taste, i.e., qi = 1 or 0, there is no uncertainty—
franchise. We used the Infogroup estimate of annual sales (in thou-
the deviation in the rating from the average rating is equal to the
sands of dollars) after matching Infogroup restaurant location with
shock along the dimension discussed in the review.) In our main
the restaurant location found in our sample database.24
specification, we assume that consumers integrate out over this
uncertainty (see Appendix for details). As a robustness check, we
assume that second-period consumers can perfectly assess the 24
There are two limitations in using this database for the sales figures for res-
valence of both shocks when both quality and taste are discussed taurants. First, these sales figures are only estimates, although there is no reason
in the review. to believe that the Infogroup methodology should induce a systematic bias asso-
ciated with our independent variables (see Web Appendix F for more details on
Finally, using these quality and taste deviations, consumers (and
the variables used to predict annual sales). Thus, using their estimated sales only
we) can calculate for each establishment the expected variance of adds noise to our dependent variable. Second, this database only included infor-
the ratings that results from the variation in vertical quality experi- mation for about 56% of all of the businesses in our initial sample of 7,663 res-
ences and the variation in horizontal taste experiences. Namely, the taurants that had reviews posted on the Google website (see Web Appendix G
variance of the quality shocks (from quality expectation) would be for the exact procedure we used to match businesses across the two different
the variance of the data-generating process for quality, and likewise databases). However, when we compared those businesses for which we
obtained matches and ultimately used the data in our demand analyses with
for taste shocks. We use these expected variances to estimate their those for which we were unable to obtain sales data from the Infogroup data-
causal effects on revenue using IVs to control for endogeneity. Our base, we find the two subsamples had almost identical average ratings and
instruments come from the entire review history of every reviewer average variance in the ratings. In addition, the distribution of the number of
who provided a relevant review for the establishments in our reviews was very similar. Perhaps just as importantly, when we did an analo-
sample. Each Google user’s landing page contains all reviews gous analysis for our hotel database and compared the annual estimates based
on the Infogroup data with the revenue figures obtained from the state comptrol-
written by that user. For each business’s set of review writers, we ler’s office, we find that the correlation of the two measures is .80 (see
aggregate each reviewer’s other ratings, which allows us to Figure WF.1 in Web Appendix F). In addition, when we regressed the differ-
measure whether these specific reviewers are harsh or lenient. ence in actual and estimated hotel sales against our three parameters of interest,
We assume that the mix over time of harsh and lenient reviewers after controlling for firm fixed effects and time dummies (the control indepen-
dent variables in our empirical model), we found that these three parameters
who visit an establishment is exogenous (controlling for the base-
of interest are insignificantly related to the dependent variable, that is, they
line mix with establishment fixed effects), which allows us to are independent of the measurement error related to using the Infogroup data.
create instruments for M, Vq , and Vt using these reviewers’ past Thus, we do not believe there is any significant issue with measurement
reviews of other establishments (the instruments being the mean errors or selection bias by using our estimates of restaurant sales.
Lee et al. 145

Figure 9. Distribution of q in Restaurant and Hotel Reviews.

Summary statistics. Table 2 displays the relevant summary statis- higher-end restaurants. We account for this type of relationship
tics for our samples of restaurants and hotels, in which the unit of in our estimation by including fixed firm effects in the estima-
analysis is at the annual level for restaurants and at the quarterly tion model.
level for hotels. Let Revjt indicate revenue for business j during The full data set includes 15,241 restaurant-year observa-
period t. The mean of the ratings for establishment j is Mjt , and tions and 28,393 hotel-quarter observations. The distribution
our calculated variances are denoted with superscripts: Vqjt is the of the number of text reviews for both types of establishments
expected variance coming from quality inconsistency, and Vtjt is has a long upper tail, with a median of 12 and a maximum of
the expected variance associated with taste mismatch costs. We 1,539 for restaurants and a median of 30 and a maximum of
also control for policy changes, such as renovations or changes 948 for hotels. Within a given establishment, we limit our
in management (Poljt ), measured by screening the review texts attention to the 20 text reviews that appear at the top of the
for such mentions, and whether the business is a franchisee list on the Google reviews page in each time period (if the
(Franjt ). In the bottom four rows, we show the within-establish- establishment has less than 20 reviews, we look at all the avail-
ment variation in the dependent variable and key regressors. able reviews). The default ordering is by what Google consid-
Although the cross-sectional variation in the mean rating and ers “Most Relevant,” which is typically by recency, but
rating variances is much larger, there is still within-firm variation occasionally an older review appears above newer ones if it
that we utilize in estimation.25 is deemed to be more helpful. We do this for two overarching
We report the correlation among the variables in Table 3. reasons. First, our theoretical model is based on what consum-
There do not appear to be any serious signs of potential multi- ers take away from the reviews they read, and consumers typ-
collinearity, other than the moderate negative association ically focus on the more recent reviews that appear on top
between the mean and the two variances. This negative relation- when large numbers of reviews are available. In addition,
ship is not surprising, since the average rating is approximately the variation in Vq and Vt over time will be limited if they
4 out of 5 in both data sets and thus the distribution of ratings is capture the entire history of reviews, since any additional
skewed left. Consequently, higher variance ratings tend to have review will have limited impact in the calculation of these
lower average mean ratings. We also note the negative correla- variables.
tion (across businesses) between log sales and mean rating for
restaurants (but not hotels). We conjecture that this negative
relationship is due, at least in part, to fast-food outlets having Model
higher sales but lower mean ratings compared with smaller, We present two related econometric models. The first is analo-
gous to our laboratory setup in that it captures the heteroge-
neous effect of Vt for 2 × 2 bins of low and high M and Vq .
25
To assess our estimation approach and underlying model assumptions, in The second formulation directly tests the stated H2. The depen-
addition to the concerns about using estimated restaurant revenues and the
dent variable in both formulations is log of revenue, for estab-
potential lack of sufficient variability in data, we run a simulation to demonstrate
how data generated using similar setups can be used to recover two-stage least lishment j at time t, and both formulations include several
squares estimates. Please find the detailed information about this simulation control variables. Our first formulation, the 2 × 2 specification,
analysis in Web Appendix H. is as follows:
146 Journal of Marketing Research 60(1)

Table 2. Descriptive Statistics of the Variables.

Restaurants (Yearly Panel) Hotels (Quarterly Panel)

Variable Obs. Mean SD Min Max Obs. Mean SD Min Max

ln salesa (log (Rev)) 15,241 5.575 1.456 2.303 11.641 28,393 12.896 .967 4.554 17.148
Avg. rating (M) 15,241 4.029 .531 1 5 28,393 3.816 .715 .424 5
Variance due to quality (Vq ) 15,241 .450 .432 0 5.333 28,393 .565 .478 0 8
Variance due to taste (Vt ) 15,241 .311 .306 0 4 28,393 .509 .433 0 8
Policy change (Pol) 15,241 .022 .146 0 1 28,393 .096 .294 0 1
Franchise (Fran) 15,241 .115 .319 0 1 28,393 1 0 1 1

Within-Establishment Variances
a
ln sales (log(Rev)) 3,506 2.051 2.265 0 29.239 1,629 .073 .267 0 5.587
Avg. rating (M) 3,506 .072 .145 0 3.684 1,629 .126 .197 0 2.044
Variance due to quality (Vq ) 3,506 .104 .289 0 6.297 1,629 .132 .466 0 10.049
Variance due to taste (Vt ) 3,506 .054 .136 0 2.082 1,629 .103 .317 0 .5
a
In thousands for restaurants only.

log (Revjt ) =γ1 Mjt + γ2 Vqjt + γ3 Vtjt + γ4 (Vtjt × 1low Mjt & low Vqjt ) 1low Mjt & high Vqjt is for the low M and high Vq bin, according
to our definition. Thus, γ3 is the effect of Vt in the high M and
+ γ5 (Vtjt × 1high Mjt & high Vqjt ) low Vq bin, and the effects in other bins are relative to the
effect in this bin. This 2 × 2 specification allows us to test the
+ γ6 (Vtjt × 1low Mjt & high Vqjt ) effect that lowering M or increasing Vq has on the effect of Vt ,
from the baseline bin where the effect of Vt is the most negative.
+ θ1 Poljt + θ2 Franjt + ϕj + τt + ϵjt .
While this result is interesting, to confirm that each of M and Vq
(11) has a moderating effect on Vt , our H2 states that having low M
and/or high Vq affects the impact that Vt has on revenue.
We discuss the control variables first and then detail the specific
Therefore, we also utilize an alternative specification that directly
interaction formulation around the review variables. Our models
tests our hypotheses:
allow for business-specific fixed effects, denoted by ϕj , that
capture each firm’s intrinsic characteristics (including quality log (Revjt ) = γ1 Mjt + γ2 Vqjt + γ3 Vtjt
of service and taste location), which are consistent over time.
Any variation over time due to economic conditions or other + γ7 (Vtjt × 1low Mjt or high Vqjt ) + θ1 Poljt
common macro shocks is absorbed by the time dummies, τt .
+ θ2 Franjt + ϕj + τt + ϵjt , (12)
We also include Poljt as an indicator for a management change
occurring during time t, while Franjt is an indicator for being a where 1low Mjt or high Vqjt is the indicator variable for being in the low
franchisee or part of a chain, and ϵjt is an idiosyncratic error. M and/or high Vq condition. Similar to the 2 × 2 specification, γ3
Consistent with previous literature, we take the natural logarithm is the effect of Vt in the high M and low Vq bin where neither of
of the revenue to estimate relative effects. Including the time- the required conditions is met, and γ7 , the effect when at least one
invariant fixed effects helps alleviate endogeneity concerns. of the conditions is met, is relative to this base condition. H1 is sup-
Any remaining demand shocks that also affect our review vari- ported if γ2 < 0, while H2 implies γ7 > 0.27
ables and are unobserved to the econometrician are controlled
for by instrumenting for the endogenous variables, namely Mjt ,
Vqjt , and Vtjt . Estimation
In our 2 × 2 specification, we divide the sample into four
We must address several issues during estimation. First, our
bins of low and high M and Vq 26 to allow for a heterogeneous
parsimonious two-period theory model assumes that second-
effect of Vt , using three indicator variables: 1low Mjt & low Vqjt is
period consumers are able to observe the actual distribution
the indicator variable for the low M and low Vq bin,
of ratings across the population; in actuality, they observe a
1high Mjt & high Vqjt is for the high M and high Vq bin, and
set of draws from that distribution. Thus, even if second-period
consumers are able to perfectly decompose the rating deviations
26
Low and high M are defined using a median split. On the other hand, low and into their quality and taste components, there will still be
high Vq are defined relative to M because M and Vq tend to covary in our empir-
ical setting of skewed rating distributions. We account for this covariation by
finding the best fit line between M and Vq , and then dividing the sample into 27
Note that γ3 + γ7 may still be negative since our binning may not capture the
two groups: the one with Vq higher than the predicted value relative to its M condition where M is low enough and/or Vq is high enough to generate positive
according to the best fit line, and the one with Vq lower than the predicted value. sales. We discuss this more when presenting the results.
Lee et al. 147

Table 3. Correlation of Variables.

Restaurants Hotels
q t
log(Rev) M V V Pol Fran log(Rev) M Vq Vt Pol Fran

log(Rev) 1 1
M −.1143 1 .3250 1
Vq −.0548 −.4405 1 −.1699 −.3564 1
Vt .0468 −.4011 .2244 1 −.1215 −.3676 .2185 1
Pol −.0840 −.0091 .0217 .0303 1 .0398 .0032 .0229 .0135 1
Fran .1032 −.3125 .1434 .1506 .0195 1 — — — — — —

variation over time in both Vq and Vt based on the actual draws leading to underestimation of the effect of review variances.
from the quality and taste distributions. Furthermore, we expect A second possible source of endogeneity results from the fact
heterogeneity in the degree to which reviewers use more posi- that users probably do not read all of the reviews, yet our data
tive ratings versus more negative ratings to express the same set does not allow us to determine which reviews they read,
utility realization, reflecting how harsh the reviewer is when leading to possible measurement error. Our setting is in contrast
attaching a rating to a review. In our field studies, we allow to the one in Liu, Lee, and Srinivasan (2019), who had a data set
for the rating left by first-period consumers to have an extra that allowed them to infer the reviews read. They show that
additive stochastic component that reflects how harsh a specific incorrectly assuming all reviews are read versus using the
reviewer is when assigning their ratings. Referring back to the actual reviews read leads to substantially biased estimates. To
theory model, the expected mean rating left by the first-period help address the concern that not all reviews are read by the
consumers is unchanged, and the expected variance is now second-period consumers, we use (up to) the most relevant 20
V = 12 (r + t2 D21 ) + Vη = Vq + Vt + Vη , where Vη is the var-
1 2
reviews in estimation to calculate the variance variables, recog-
iance due to this extra source of variation. If second-period con- nizing that the measured variables are still not necessarily the
sumers are unaware of the fact that ratings reflect this extra same as the variance variables utilized by consumers, if they
stochastic component, then second-period consumers proceed read a different subset. If we assume that the variances of the
exactly as described in the theory model. If, however, they reviews that are read are equal to the variances of the 20
are aware of this third source of variation, they will want to reviews we use, plus some measurement error, then the ordinary
adjust the magnitude of their decomposition of V into Vq and least squares (OLS) estimates will be subject to (potentially
Vt . In such a case, we make the empirically supported assump- severe) attenuation bias.28 This not only will lead to underesti-
tion that these second-period consumers will not take the effort mates of the variance effects when OLS is used but also will
to determine the individual reviewers’ tendencies of harshness, bias the estimated effect of the mean because of the collinearity
but instead just discount their estimates of Vq and Vt when between the mean and the variances.
determining the underlying parameters of our model. Because We address both sources of endogeneity using IV analyses.
this extra source of variation is expected to not change over We do this by leveraging the plausibly exogenous variation in
time, our hypotheses are not affected, but the interpretation of the types of consumers who previously were patrons and left
the coefficient estimates change slightly, since the second- reviews for the given establishment, as described previously.
period consumers only use the discounted estimates of the For example, if the reviews for an establishment come from a
actual variance in the quality and taste distributions, whereas set of consumers who generally leave lower evaluations, then
we assume in estimation that there is no such discounting. we can expect the ratings for this establishment to also be
Consequently, the coefficients obtained using our approach lower for reasons uncorrelated with the actual quality of the
would be underestimates of the actual effect of changes in the reviewed establishment.29 Note that using this reviewer information
variances of the quality and taste distributions. as instruments means we are implicitly assuming that (1) additional
Another issue that needs to be addressed in estimation is
endogeneity. There are at least two potential sources. First,
even with establishment fixed effects, random shocks in unob-
served quality parameters (i.e., the distribution from which
28
Huang and Sudhir (2019) also find that the downward bias due to measure-
ment error dominates the better-known upward bias due to common method in
quality realizations are drawn) over time could impact both OLS estimates, thereby leading to a significant underestimation of causal
ratings and sales. This would lead us to overestimate the effects.
29
impact of mean reviews. It also may lead to biases in the esti- Using the IV estimation also addresses possible omitted variable bias. In the
mates of the variance of reviews, due to the observed negative field data, there could be potential deviations from our theoretical model. For
correlation in the mean and the variances of reviews. example, even though our analytical model assumes the location of each establish-
ment to be fixed on the horizontal taste dimension, establishments may change
Specifically, a positive correlation between any unobservable their menus and thereby affect the reviews without being observed by the econo-
with mean reviews and sales implies a negative correlation in metrician. Such menu turnovers, however, are unrelated to our IVs, which capture
the unobservable and the variances of reviews, potentially the exogenous variation due to the “harshness” user characteristics.
148 Journal of Marketing Research 60(1)

variation in ratings comes from the relative harshness of reviewers worst-case bias in the test statistics due to relatively modest
(i.e., the degree to which ratings left by the reviewers for a business first-stage F-statistics, as described in Lee et al. (2020). We
are lower or higher than the actual utility realizations), and (2) con- still find support for H1 for hotels and H2 for both restaurants
sumers of review content for a given establishment do not search and hotels at 5% significance (and support for H1 for restaurants
through the review history of all the reviewers to assess whether at 10%) using the adjusted p-values in the specification that
they are harsh or generous reviewers. The actual instrument used directly tests both hypotheses.
for the mean rating of an establishment is mi /N, the average
across reviewers of each reviewer’s historic average mean rating, Placebo test. Future information should not affect past sales. We
mi , not associated with the focal establishment. conduct this placebo test by regressing sales on future review
Similarly, the quality and taste variances of ratings can be statistics. For each unit of analysis in our main estimation, we
instrumented by taking the variances of the deviations across calculate M, Vq , and Vt using future reviews, rather than previ-
reviewers from the relevant reviewers’ historic mean ratings. ously posted reviews. To match the amount of information
The construction of these instruments is analogous to that of the available for each establishment, we use (up to) the next n
expected variances, except that the deviations in the ratings reviews posted in the future to calculate the mean and the var-
(si − s) are replaced with the deviations in the users’ historic iances, where n is the number of reviews used in the main esti-
mean ratings (mi − m),  where m  is the mean as defined previ-
mation. We show that information from future reviews does not
ously. Thus, the instrument captures whether the reviewers predict past sales, and the estimation results are presented in
who left reviews for an establishment are harsh or lenient Table 5. The placebo test results tell us that reverse causality
raters relative to the expectation, and how much this harshness is not at play here, since otherwise past sales would be
varies for a set of reviewers. The instruments are subject to the related to future reviews.32
same integration over the unobserved shocks, as we assume
that qi is exogenous, that is, the data-generating processes for
quality and taste distributions are independent. In other words, Robustness checks. As mentioned previously, it could also be
which mix of quality and taste realizations are experienced by the case that second-period consumers can perfectly infer the
a particular consumer is independent from the endogenous valence of both the quality and taste deviations from the
demand factors of the focal establishment. review content for all reviews. In this case, they do not
need to calculate the expected Vq and Vt , as they can calculate
them with certainty, implying a difference between the con-
sumer’s information and that of the researcher. In this situa-
Results
tion we must integrate over our uncertainty in the
Main results. We present our estimation results30 in Table 4 for estimation procedure. We do this by using the same set of
both specifications, for both restaurant and hotel samples. We 1,000 simulations of the realizations of the quality and taste
report heteroskedasticity and autocorrelation consistent stan- shocks described in the Appendix. However, instead of calcu-
dard errors (Newey–West), assuming an AR(1) process.31 We lating the consumers’ expected Vq and Vt by averaging across
assume this autoregressive process to account for possible auto- these simulations, we estimate the empirical model for every
correlation in the errors, as the same review may be included in simulated set of draws, yielding 1,000 independent estimates
calculation of the variances for more than one time period. The of our coefficients, each based on a different imputed value of
first-stage identification test results are also included in the Vq and Vt (using heteroskedasticity and autocorrelation con-
result tables. All individual first-stage regressions for each sistent errors as before). Following the procedure recom-
endogenous variable were strongly identified, and the joint mended by Marshall et al. (2009), we calculate the mean of
weak identification tests had reasonable F-statistics over ten. each of these coefficients and the associated variances of
We find that the main effect of Vq (i.e., γ2 ) is negative, as these 1,000 estimated coefficients. Finally, we include the
hypothesized for both restaurants and hotels, although this imputation uncertainty within our standard errors using
effect is significant only for hotels at 5% (for restaurants, it is Rubin’s rule (Rubin 1987) to determine the overall variance
significant at 10% using the “direct test of hypotheses” specifi- that captures model fit as well as missing data uncertainty.
cation). We find γ4 , γ5 , γ6 and γ7 > 0 as predicted by our Although the standard errors are larger under this alternative
second hypothesis for both restaurants and hotels, and the data-generating assumption because of the added researcher
effects are significant at 5%. These positive coefficients imply uncertainty, we find that qualitative interpretations of the
that, compared with the baseline bin where the effect of Vt is results do not materially change, and the hypothesized
most negative (high M and low Vq condition), the effect of effect of the variances continue to hold with significance
Vt becomes more positive when either M is low or Vq is (see Table WJ.1 in Web Appendix J).
high. We also provide adjusted p-values that account for the
32
Our identification strategy also prevents reverse causality from driving the
30
First-stage estimates are included in Web Appendix I. effects in our main results, since the instruments are unrelated to our dependent
31
We also ran analyses with longer serial correlation and found that the results variables (i.e., high sales do not cause the reviewers to become more harsh or
did not materially change. lenient).
Table 4. Main Estimation Results.

Restaurants (Yearly Panel) Hotels (Quarterly Panel)

2 × 2 Specification Direct Test of Hypotheses 2 × 2 Specification Direct Test of Hypotheses

Newey p- Adjusted Newey Adjusted Newey Adjusted Newey Adjusted


log(Rev) Prediction Coef. SE Value p-Value Coef. SE p-Value p-Value Coef. SE p-Value p-Value Coef. SE p-Value p-Value

M γ1 −.1297 (.1718) .225 .225 −.1025 (.1414) .234 .234 −.0238 (.0370) .260 .260 −.0160 (.0306) .300 .300
Vq γ2 γ2 < 0 −.4216 (.3459) .111 .111 −.3342 (.2502) .091* .091* −.1641 (.0764) .016** .039** −.1359 (.0558) .007*** .021**
Vt γ3 −1.2938 (.5710) .012** .042** −1.1938 (.4880) .007*** .024** −.1696 (.0922) .033** .050* −.1673 (.0894) .031** .038**
Vt × 1low M & low Vq γ4 γ4 > 0 .6908 (.3362) .020** .048** .1031 (.0509) .021** .043**
Vt × 1high M & high Vq γ5 γ5 > 0 .7257 (.3959) .033** .057* .1378 (.0768) .036** .052*
Vt × 1low M & high Vq γ6 γ6 > 0 .9880 (.5597) .039** .060* .1913 (.0939) .021** .043**
Vt × 1low M or high Vq γ7 γ7 > 0 .7049 (.3281) .016** .032** .1320 (.0609) .015** .028**
Pol −.1230 (.0636) .027** .053* −.1318 (.0608) .015** .031** .0008 (.0065) .448 .448 .0013 (.0064) .420 .420
Fran .0320 (.0801) .345 .345 .0231 (.0792) .385 .385
Firm fixed effects Included Included Included Included
Time fixed effects Included Included Included Included
Number of obs. 15,241 15,241 28,393 28,393
Number of panelists 3,506 3,506 1,629 1,629
Avg. obs./panelist 4.3 4.3 17.4 17.4
First-stage χ2 = 25.05 p = .0000 χ2 = 41.62 p = .0000 χ2 = 24.85 p = .0000 χ2 = 33.95 p = .0000
underidentification test
First-stage weak F = 12.51 F = 27.83 F = 17.63 F = 36.70
identification test

*Significant at 10%, **significant at 5%, ***significant at 1% (one-tailed tests).

149
150
Table 5. Placebo Test Results.

Restaurants (Yearly Panel) Hotels (Quarterly Panel)

2 × 2 Specification Direct Test of Hypotheses 2 × 2 Specification Direct Test of Hypotheses

Newey Adjusted Newey Adjusted Newey Adjusted Newey Adjusted


log(Rev) Prediction Coef. SE p-Value p-Value Coef. SE p-Value p-Value Coef. SE p-Value p-Value Coef. SE p-Value p-Value

M γ1 .0179 (.0367) .312 .312 .0115 (.0339) .367 .367 .0261 (.0560) .320 .320 .0189 (.0453) .338 .338
Vq γ2 γ2 < 0 .0712 (.0648) .136 .136 .0428 (.0480) .186 .186 .0934 (.1169) .212 .212 .0750 (.0851) .189 .189
Vt γ3 .1920 (.1694) .129 .129 .1948 (.1563) .106 .106 .0828 (.1407) .278 .278 .0941 (.1437) .256 .256
Vt × 1low M & low Vq γ4 γ4 > 0 −.1007 (.1391) .235 .235 −.0430 (.0760) .286 .286
Vt × 1high M & high Vq γ5 γ5 > 0 −.2061 (.1210) .044** .056* −.1111 (.1389) .212 .212
Vt × 1low M & high Vq γ6 γ6 > 0 −.2041 (.1802) .129 .129 −.0886 (.1403) .264 .264
Vt × 1low M or high Vq γ7 γ7 > 0 −.1317 (.1284) .153 .153 −.0699 (.0964) .234 .234
Pol .0246 (.0241) .153 .153 .0220 (.0242) .182 .182 .0014 (.0067) .418 .418 .0013 (.0067) .425 .425
Fran .0214 (.0744) .387 .387 .0284 (.0734) .350 .350
Firm fixed effects Included Included Included Included
Time fixed effects Included Included Included Included
Number of obs. 13,410 13,410 26,990 26,990
Number of panelists 3,051 3,051 1,581 1,581
Avg. obs./panelist 4.4 4.4 17.1 17.1
First-stage underidentification test χ2 = 28.77 p = .0000 χ2 = 32.24 p = .0000 χ2 = 5.48 p = .0192 χ2 = 8.75 p = .0031
First-stage weak identification test F = 18.13 F = 33.68 F = 4.14 F = 8.27

*Significant at 10%, **significant at 5%, ***significant at 1% (one-tailed tests).


Lee et al. 151

We also report in Table WJ.2 in Web Appendix J the estima- review variables vary over time. This variation could come
tion results assuming all quality and taste deviations to be of the from multiple sources. The first is the stochastic nature of the
more general “same sign” type, as an additional robustness quality realizations that the first-period consumers receive,
check. Since the opposite-signed deviations represent less since second-period consumers do not observe the distribution
likely scenarios with larger magnitudes of deviations, we from which the ratings are drawn but observe only a set of
verify that the significant effects of the variances are not draws from that distribution. The second is the relative harsh-
driven by the extreme values simulated on the basis of our dis- ness of the first-period reviewers. A third source of variation
tributional assumptions on the deviations. Estimation results are could come from firms shifting their quality parameters (e.g.,
robust to plausible variations in distributional parameters for investing in quality) or taste positioning. The first two
quality and taste deviations, and even when we assume them sources of variation are exogenous, and we leverage the
to be of the same sign for all reviews (where no distributional second in our IV estimation strategy. The third source of vari-
assumption is needed), the significant effects of the variances ation is what leads to long-term implications of the effect of
continue to hold with the same significance level. these two variances. To calculate the review parameters, we
focus on the most relevant recent reviews instead of the com-
plete history of all the reviews, since consumers rely more on
Discussion and Conclusion recent reviews (because they feel these reviews more accurately
This article extends the literature on online reviews by con- reflect the current characteristics of the focal establishment). This
sidering the effect of the rating variance from two sources, leads to varying incentives for the firm to change its quality and
within-firm stochastic quality and across-consumer taste mis- taste variances, depending on the historical set of reviews. Lower
match costs, and in doing so broadens the applicability of mean reviews can potentially lead firms to market to a broader set
such inquiries to a large number of service industries in of consumers, leading to greater taste variance and potential increases
which service quality varies across consumer experiences. in sales coming from this larger market. Similarly, hotels that market
We find that the results of a controlled laboratory experiment only to a niche audience, and have lower taste variance, may have
and two field studies are compatible with the theoretical pre- more incentive to reduce the variation in quality realizations by con-
dictions based on our assumed process of how reviews left by centrating on reducing the lower end of the service quality distribu-
past customers influence the purchase decision of future cus- tion, since the marginal effect of Vq is greater with low Vt . In
tomers. The overarching takeaway from both the theoretical addition, since mean ratings are reported by review sites for the
model and empirical findings is that the two sources of vari- entire set of reviews, ongoing changes in quality or positioning
ance interact, and thus their marginal effects depend on the will be reflected more, initially, in the review content itself, implying
level of the other. The effect on future demand for taste var- that these variances take on added importance to both the firm and the
iance can either be positive or negative, depending on the consumer.
level of quality variance, whereas the effect of quality vari- Firms also can learn from monitoring the changes in Vq
ance on future demand is always negative, but marginally and Vt over time, since the most recent Vt and M determine
smaller with high levels of taste variance. the firm’s average quality of service, as perceived by the cus-
Our analytic model explicitly delineates how customers deter- tomers, whereas the most recent Vq determines the variability
mine what to include in their text reviews as well as how prospec- of the current service quality. Thus, although it is well estab-
tive customers interpret the information to gauge (1) the average lished that variance in service quality can be harmful for
quality level and the degree of uncertainty in quality and (2) the repeat business, our theoretical and empirical results show
importance of mismatch costs associated with the product offer- that higher quality variance positively moderates the effect
ing. Although we find strong support for the predictions coming of taste variance, and because these two variances interact,
from this analytic model, we note that we directly test for only the converse is also true. Thus, increases in Vt can result in
one element associated with the process, namely that consumers attracting new customers because the average quality level
can reliably differentiate between statements concerning quality can be inferred to be higher (i.e., the positive effect of Vt
and taste. Thus, we do not assume customers strictly follow on demand) when quality variance is large. Similarly, as
our assumed process; instead, their actions may be like those shown empirically in our results, low values of Vt increase
of a pool player who knows how to bounce a shot off the side the marginal effect of increasing the reliability of the estab-
of the pool table to sink a ball into the pouch without using lishment’s service quality. Thus, the net impact of both
exact geometry. However, we believe the predictions coming types of variation is not straightforward but is tempered by
from our model provide important prescriptive validity. With the level of the other variation.
this noted, we next discuss some implications that flow from Collectively, these findings suggest that firms should track
our model and results. the source of the total variance in their ratings, as the two
Our two-period analytical model is silent on the long-term types of variance have different implications for future
implications of the effect of the reviews. In fact, it implicitly demand. Using textual review information, a firm’s managers
assumes that all of the underlying parameters are known by can monitor the source of variance in reviews, and if it is due
the start of the second period and thus Vq and Vt are known to customers’ heterogeneous tastes, they can further investigate
and fixed. However, empirically we find that these two whether such variance is helpful or whether they should work to
152 Journal of Marketing Research 60(1)

realign the firm’s offerings to better fit the taste of the target cus- problem of self-selection. The effect of reviews on demand
tomers. On the other hand, if there is large quality variance in under such biases would be an interesting empirical question.
customer feedback, they should determine whether improving We leave such extensions regarding incomplete information
the firm’s service consistency by reducing negative experiences scenarios to future studies. A normative study on how firms
will be cost effective, noting that such a change may nullify the should respond to reviews to enhance revenues would also be
positive lift to sales from taste discovery. Our findings should an interesting extension.
generalize to other familiar product categories consisting of
feature-based and experiential attributes, which can be decom- Appendix: Quality and Taste Deviations
posed into vertical and horizontal dimensions. In addition, our Once each review is classified in terms of the amount of content
methodology presents a simple and practical implementation
devoted to quality- and taste-related content, we calculate the
strategy to scan, learn, and track the information contained in two partial variances of the ratings for each firm as described
online text reviews. here. Let si denote the rating associated with the ith review
Our findings also have implications for the review platforms,
and s be the total mean of all text review ratings for that estab-
which may enhance user experience by providing the summa- lishment at that time, and define Δi as the observed deviation of
ries of the quality- and taste-related review content. Some plat- si from s. The rating given by the reviewer is si = vi − txi . The
forms nowadays provide key information, such as positively
quality realization can be written as a deviation from the expec-
and negatively valenced keywords, but since consumers tation, vi = E[v|f (v, r)] + Δqi , and the taste realization can sim-
utilize quality and taste information in different ways in ilarly be written as −txi = E[ − txi |g(t)] + Δti , in which
making purchase decisions, it would be helpful to present
Δqi + Δti = Δi . Once Δqi and Δti are known for all text
each type of content separately (or perhaps present positive/ reviews, the quality and taste variance can be calculated as
negative keywords for each dimension). the variance of the set of Δqi and Δti values, respectively. We
While our study’s data collection of Google reviews pro-
next describe how the Δqi and Δti values are determined.
vides exhaustive coverage of available review information Recall that we assume that the content of the review (i.e., the
on the largest search website, it also has limitations because amount of the review focused on quality vs. taste), represented
consumers also use other online sites to view customer
by qi , reflects the relative deviations from the consumer’s
reviews, and some industries have had other popular sites, expectations. If qi is equal to 0 or 1 for review i, then the
such as Yelp for restaurants and Tripadvisor for hotels. This total deviation, Δi , is attributed to taste or quality, respectively,
limitation was circumvented with our laboratory study, in
and thus can be associated with either Vt or Vq . If qi = 12, that is,
which the provided reviews were the only source of available the review discusses quality and taste in equal proportions, then
information and were independent of the specific platform.33 Δqi = Δti = 12 Δi when Δi ≠ 0. If Δi = 0, that is, the rating is
Regarding generalizability, the results may only apply to
equal to the mean rating, then either Δqi = Δti = 0 or
familiar product categories, as consumers may not be Δqi = −Δti , that is, there are an infinite number of possible
capable of differentiating vertical and horizontal attributes if quality and taste shocks that can explain both the observed rating
the product class is not well understood by most consumers.
and qi , but they need to be of equal magnitude. If 0 < qi < 1
In addition, although our results hold for our samples in and qi ≠ 12, then there are two possible sets of deviations:
general, we believe the results are most applicable to early
reviews in new markets, since it is these markets where con- Δqi = qi Δi and Δti = (1 − qi )Δi ; or (A1)
sumers have the most diffuse prior beliefs and are most likely
to seek additional information. −qi 1 − qi
An interesting extension would be to use our process model Δqi = Δi and Δti = Δi . (A2)
1 − 2qi 1 − 2qi
to investigate the effect of reviews when conditions change and
new consumers use the reviews to set their priors before getting The first solution is when the quality and taste deviations are in
a service experience. Do the consumers’ prior beliefs affect the the same direction, and the second occurs when these deviations
way they interpret the text information? That is, does confirma- are in opposite directions, in which case the deviation of the
tory bias alter the updating process? Does this altered process more talked-about factor (quality or taste) has the same sign as
affect subsequent reviews and thus subsequent sales? Such an the overall deviation, and the sign of the less talked-about devi-
exploration can provide insights on the impact of early and ation has the opposite sign.
more recent reviews, as well as the characteristics of the review- In our main specification, we assume that consumers cannot
ers and inferences about them. This type of exploration is distinguish between the two solutions (or in the case of qi = 12
similar to the work of Bondi (2019), who investigates a and Δi = 0, the infinite number of solutions). However, they
may infer the sign of the more talked-about deviation when qi ≠
1
2 (as the valence of the main discussion should be obvious), but
this is not sufficient to assess whether the less talked-about devi-
33
Even in our field studies, although we were unable to control for the external
information on other platforms, our results using the instruments still stand the ation is of the same or opposite valence. Thus, we assume that
test of significance, under the assumption that our instruments are independent consumers form expectations over the unobserved deviations.
from external information. To integrate over the unobserved shocks, we (and consumers)
Lee et al. 153

Figure A1. Distribution of Quality and Taste Deviations in Restaurant (Top) and Hotel (Bottom) Reviews.

draw the likelihood of deviations from a normal distribution. In Declaration of Conflicting Interests
other words, small deviations from the expectation are highly The author(s) declared no potential conflicts of interest with respect to
likely to occur (for both quality and taste), whereas large devia- the research, authorship, and/or publication of this article.
tions are less likely. We infer the distributional parameters from
the observed Δqi and Δti values from the set of reviews with Funding
known Δqi and Δti (i.e., those reviews with qi equal to 0 or 1,
The author(s) disclosed receipt of the following financial support forthe
or qi = 12 and Δi ≠ 0). As shown in Figure A1, the distribution research, authorship, and/or publication of this article: This work was
of both Δq and Δt is approximately normal with mean 0 and stan- supported by the Marketing PhD student research fund at the Fuqua
dard deviation of .6 for the restaurant reviews and mean 0 and School of Business at Duke University.
standard deviation of .7 for the hotel reviews.
ORCID iDs
Acknowledgments
Nah Lee https://orcid.org/0000-0002-2598-6297
The authors would like to acknowledge the valuable feedback from Bryan Bollinger https://orcid.org/0000-0001-8596-6418
Carl Mela throughout this research project. Additionally, they thank
Monic Sun, Hana Choi, Xiao Liu, Jennifer Cutler and Kathryn
(Sharpe) Wessling for providing helpful comments that improved References
this manuscript. They are also grateful to seminar participants at Anderson, Michael and Jeremy Magruder (2012), “Learning from the
Duke Marketing Department, especially Chris Moorman. The first Crowd: Regression Discontinuity Estimates of the Effects of an
author would like to acknowledge the technical guidance of David Online Review Database,” Economic Journal, 122 (563), 957–89.
Motsinger on data collection. Blakesley, Richard E., Sati Mazumdar, Mary Amanda Dew, Patricia
R. Houck, Gong Tang, Charles F. Reynolds III, et al. (2009),
Associate Editor “Comparisons of Methods for Multiple Hypothesis Testing in
Raghuram Iyengar Neuropsychological Research,” Neuropsychology, 23 (2), 255–64.
154 Journal of Marketing Research 60(1)

Bondi, Tommaso (2019), “Alone, Together: Product Discovery Prognostic Modelling Studies After Multiple Imputation: Current
Through Consumer Ratings,” NET Institute Working Paper No. Practice and Guidelines,” BMC Medical Research Methodology,
19-09, NYU Stern School of Business, SSRN, https://ssrn.com/ 9 (57), doi.org/10.1186/1471-2288-9-57.
abstract=3468433. Meyer, Robert J. (1981), “A Model of Multiattribute Judgments Under
Boulding, William, Ajay Kalra, Richard Staelin, and Valarie Attribute Uncertainty and Informational Constraint,” Journal of
A. Zeithaml (1993), “A Dynamic Process Model of Service Marketing Research, 18 (4), 428–41.
Quality: From Expectations to Behavioral Intentions,” Journal of Romano, Joseph P., Azeem M. Shaikh, and Michael Wolf (2010),
Marketing Research, 30 (1), 7–27. “Multiple Testing,” in The New Palgrave Dictionary of
Chen, Peiyu, Lorin M. Hitt, Yili Hong, and Shinyi Wu (2021), Economics, Matias Vernengo, Esteban Perez Caldentey, and
“Measuring Product Type and Purchase Uncertainty with Online Barkley J. Rosser Jr., eds. London: Palgrave Macmillan.
Product Ratings: A Theoretical Model and Empirical Rozenkrants, Bella, S. Christian Wheeler, and Baba Shiv (2017),
Application,” Information Systems Research, 32 (4), 1470–89. “Self-Expression Cues in Product Rating Distributions: When
Chen, Pei-Yu, Shin-yi Wu, and Jungsun Yoon (2004). “The Impact of People Prefer Polarizing Products,” Journal of Consumer Research,
Online Recommendations and Consumer Feedback on Sales,” in 44 (4), 759–77.
ICIS 2004 Proceedings, 58, http://aisel.aisnet.org/icis2004/58. Rubin, Donald B. (1987), Multiple Imputation for Survey
Chevalier, Judith A. and Dina Mayzlin (2006), “The Effect of Word of Nonresponse. New York: John Wiley & Sons.
Mouth on Sales: Online Book Reviews,” Journal of Marketing Rust, Roland T., J. Jeffrey Inman, Jianmin Jia, and Anthony Zahorik
Research, 43 (3), 345–54. (1999), “What You Don’t Know About Customer-Perceived
Clemons, Eric K., Guodong Gordon Gao, and Lorin M. Hitt (2006), Quality: The Role of Customer Expectation Distributions,”
“When Online Reviews Meet Hyperdifferentiation: A Study of Marketing Science, 18 (1), 77–92.
the Craft Beer Industry,” Journal of Management Information Shaffer, Juliet Popper (1986), “Modified Sequentially Rejective
Systems, 23 (2), 149–71. Multiple Test Procedures,” Journal of the American Statistical
Duan, Wenjing, Bin Gu, and Andrew B. Whinston (2008a), “Do Association, 81 (395), 826–31.
Online Reviews Matter? An Empirical Investigation of Panel Sriram, S., Pradeep K. Chintagunta, and Puneet Manchanda (2015),
Data,” Decision Support Systems, 45 (4), 1007–16. “Service Quality Variability and Termination Behavior,”
Duan, Wenjing, Bin Gu, and Andrew B. Whinston (2008b), “The Management Science, 61 (11), 2739–59.
Dynamics of Online Word-of-Mouth and Product Sales—An Sun, Monic (2012), “How Does the Variance of Product Ratings
Empirical Investigation of the Movie Industry,” Journal of Matter?” Management Science, 58 (4), 696–707.
Retailing, 84 (2), 233–42.
Tucker, Catherine and Juanjuan Zhang (2011), “How Does Popularity
Harbaugh, Rick, John Maxwell, and Kelly Shue (2016), “Consistent
Information Affect Choices? A Field Experiment,” Management
Good News and Inconsistent Bad News,” working paper.
Science, 57 (5), 828–42.
Holm, Sture (1979), “A Simple Sequentially Rejective Multiple Test
Vermeulen, Ivar E. and Daphne Seegers (2009), “Tried and Tested:
Procedure,” Scandinavian Journal of Statistics, 6 (2), 65–70.
The Impact of Online Hotel Reviews on Consumer
Hu, Nan, Paul A. Pavlou, and Jie Jennifer Zhang (2017), “On
Consideration,” Tourism Management, 30 (1), 123–27.
Self-Selection Biases in Online Product Reviews,” MIS
Quarterly, 41 (2), 449–71. West, Patricia M. and Susan M. Broniarczyk (1998), “Integrating
Huang, Guofang and K. Sudhir (2019), “The Causal Effect of Service Multiple Opinions: The Role of Aspiration Level on Consumer
Satisfaction on Customer Loyalty,” SSRN, https://ssrn.com/ Response to Critic Consensus,” Journal of Consumer Research,
abstract=3391242. 25 (1), 38–51.
Lee, David S., Justin McCrary, Marcelo J. Moreira, and Jack Porter Westfall, Peter H. and S. Stanley Young (1993), Resampling-Based
(2020). “Valid t-Ratio Inference for IV,” preprint, arXiv, https:// Multiple Testing: Examples and Methods for p-Value Adjustment,
arxiv.org/abs/2010.05058. Wiley Series in Probability and Statistics, Vol. 279. New York:
Liu, Xiao, Dokyun Lee, and Kannan Srinivasan (2019), “Large-Scale John Wiley & Sons.
Cross-Category Analysis of Consumer Review Content on Sales Zhang, Xiaoquan and Chrysanthos Dellarocas (2006), “The Lord of
Conversion Leveraging Deep Learning,” Journal of Marketing the Ratings: Is a Movie’s Fate Is Influenced by Reviews?” ICIS
Research, 56 (6), 918–43. 2006 Proceedings, 117, https://aisel.aisnet.org/icis2006/117.
Liu, Yong (2006), “Word of Mouth for Movies: Its Dynamics and Impact Zhu, Feng and Xiaoquan Zhang (2010), “Impact of Online Consumer
on Box Office Revenue,” Journal of Marketing, 70 (3), 74–89. Reviews on Sales: The Moderating Role of Product and Consumer
Luca, Michael (2016), “Reviews, Reputation, and Revenue: The Case Characteristics,” Journal of Marketing, 74 (2), 133–48.
of Yelp.com,” Harvard Business School NOM Unit Working Paper Zimmermann, Steffen, Philipp Herrmann, Dennis Kundisch, and
No. 12-016, SSRN, https://ssrn.com/abstract=1928601. Barrie R. Nault (2018), “Decomposing the Variance of Consumer
Marshall, Andrea, Douglas G. Altman, Roger L. Holder, and Ratings and the Impact on Price and Demand,” Information
Patrick Royston (2009), “Combining Estimates of Interest in Systems Research, 29 (4), 984–1002.

You might also like