Automated Marketing Research Using Online Customer Reviews

Automated Marketing Research Using
Online Customer Reviews

Thomas Y. Lee
Adjunct Assistant Professor, Operations and Information Management Department
The Wharton School of the University of Pennsylvania
Visiting Assistant Professor, Operations and Information Technology Management
F402 Haas School of Business, #1930
University of California, Berkeley
Berkeley, CA 94720-1930
thomasyl@haas.berkeley.edu
(415) 672-5801 (Phone)
Eric T. Bradlow*
The K.P. Chao Professor; Professor of Marketing, Statistics and Education
Vice-Dean and Director, Wharton Doctoral Programs
Co-Director, Wharton Interactive Media Initiative
The Wharton School of the University of Pennsylvania
761 JMHH; 3730 Walnut Street
Philadelphia, PA 19104
ebradlow@wharton.upenn.edu
(215) 898-8255 (Phone)
* Author Note. The authors would like to thank Esther Chen, Ellen Ngai, and Sojeong Hong for
helping us with the data coding, and Steven O. Kimbrough, Balaji Padmanabhan, Yoram Wind,
Paul E. Green, Abba M. Krieger, attendees of the Utah Winter Information Systems Conference
and Florida Decision and Information Sciences Workshop, and the anonymous reviewers for
useful suggestions and comments. This work was partially supported by the Fishman Davidson
Center for Service and Operations Management at the Wharton School of the University of
Pennsylvania, the Wharton eBusiness Initiative, and the Wharton Interactive Media Initiative.
Electronic copy available at: http://ssrn.com/abstract=1726055
Automated Marketing Research Using

Online Customer Reviews
Abstract
Market structure analysis is a basic pillar of marketing research. Classic challenges in
marketing such as pricing, campaign management, brand positioning, and new product
development are rooted in an analysis of product substitutes and complements inferred from
market structure.
In this paper, we present a method to support the analysis and visualization of market
structure by automatically eliciting product attributes, and brands relative positions, from online
customer reviews. First, we discover attributes and attribute dimensions using the "Voice of the
Consumer, as reflected in customer reviews, rather than that of manufacturers. Second, the
approach runs automatically. Third, we support rather than supplant managerial judgment by
reinforcing or augmenting attributes and dimensions found through traditional surveys and focus
groups.
We test the approach on six years of customer reviews for digital cameras during a period
of rapid market evolution. We analyze and visualize results in several different ways including
comparisons to expert buying guides, a laboratory survey, and Correspondence Analysis of
automatically discovered product attributes. Managerial insights drawn from the analysis are
evaluated with respect to proprietary market research reports from the same time period
analyzing digital imaging products.
Key words: market structure analysis; online customer reviews; text mining
Lee and Bradlow
Electronic copy available at: http://ssrn.com/abstract=1726055
INTRODUCTION
Marketing research, the set of methods to collect and draw inferences from market-level
customer and business information, have been the lifeblood of the field of marketing practice and
the focus of much academic research in the last 30+ years. Simply put, marketing research, the
methods that surround it, and the inferences derived from it, have put marketing as an academic
discipline and as a functional area within the firm on the map. From a practical perspective,
this has brought forward the stalwarts and toolbox of the marketing researcher including
methods such as: preference data collection via Conjoint Analysis (Green and Srinivasan 1978)
inferring market structure via multi-dimensional scaling (Elrod 1988; Elrod 1991; Elrod, Russell
et al. 2002), inferring market segments through clustering routines (DeSarbo, Howard et al.
1991), or simply understanding the sentiment and Voice of the Consumer (Griffin and Hauser
1993). While these methods are here to stay, the radical changes wrought by the Internet and
user-generated media promise to fundamentally alter the data and collection methods that we use
to perform these methods.
In this paper, we propose to harness the growing body of free, unsolicited, user-generated
online content for automated market research. Specifically, we describe a novel text-mining
algorithm for analyzing online customer reviews to facilitate the analysis of market structure in
two ways. First, the Voice of the Consumer, as presented in user-generated comments, provides
a simple, principled approach to generating and selecting product attributes for market structure
analysis. Traditional methods, by contrast, rely upon a pre-defined set of product attributes
(external analysis) or ex-post interpretation of derived dimensions from consumer surveys
(internal analysis). Second, the preponderance of opinion as represented in the continuous
Lee and Bradlow
stream of reviews over time provides practical input to augment traditional approaches (such as
surveys or focus groups) for conducting brand sentiment analysis, and can be done (unlike
traditional methods) continuously, automatically, very inexpensively, and in real-time.
Our focus on market structure analysis is not by chance, but rather due to its centrality in
marketing practice and its fit with text mining of user-generated content. Analysis of market
structure is a key step in the design and development of new products as well as the repositioning
of existing products (Urban and Hauser 1993). Market structure analysis describes the
substitution and complementary relationships between the brands (alternatives) that define the
market (Elrod, Russell et al. 2002). In addition to descriptive modeling, market structure
analysis is used for predicting marketplace responses to changes such as pricing (Kamakura and
Russell 1989), marketing strategy (Erdem and Keane 1996), product design, and new product
introduction (Srivastava, Alpert et al. 1984). Hence, if automated in a fast, inexpensive way (as
described here), it can have significant impact on marketing research and the decisions that
emanate from it.
There is a long history of research in market structure analysis. Approaches vary by the
type of data analyzed (panel-level scanner data, aggregate sales, consumer survey response) and
by the analytic approach (Elrod, Russell et al. 2002). Regardless of an internal or external
approach, with few exceptions, market structure analysis begins with a predetermined set of
attributes and their underlying dimensions, the same set of survey or transaction sales data, and
assumes that all customers perceive all products the same way and differ only in their
evaluation of product attributes (Elrod, Russell et al. 2002). Models that incorporate customer
uncertainty about product attributes (Erdem and Keane 1996) serve to highlight the colloquial
wisdom, garbage in, garbage out. Surprisingly, despite its importance, there is little extant
Lee and Bradlow
research to guide attribute selection for these methods (Wittink, Krishnamurthi et al. 1982).
There is literature on the sensitivity of conjoint results to changing attribute selection, omitting
an important attribute, level spacing, etc. (Green and Srinivasan 1978) but not much on how to
choose those attributes in the first place. We propose to fill this gap using automated analysis of
online customer reviews. Specifically, in this paper, we visualize market structure by applying
correspondence analysis to product attributes mined from the Voice of the Consumer.
We note, however, that ours is by far not the first use of user-generated content or even
specifically online reviews for the purposes of marketing action. The impact of customer
reviews on consumer behavior has long been a source of study. A large body of work explores
how reviews reflect or shape a seller's reputation (Eliashberg and Shugan 1997; Dellarocas 2003;
Ghose, Ipeirotis, et al. 2006). Other researchers have studied the implications of customer
reviews for marketing strategy (Chen and Xie 2004). To stress its importance and ubiquitous
nature, a 2009 conference co-sponsored by the Marketing Science Institute and the Wharton
Interactive Media Initiative had over 50+ applicants all doing work on user-generated content.
With that said, there has been comparatively little work on what marketers might learn
from customer reviews for purposes of studying market structure. Early work combining
marketing and text mining focused on limited sets of attributes for purposes of analyzing price
premiums associated with specific characteristics (Archak, Ghose et al. 2007; Ghose and
Ipeirotis 2008). By comparison, our objective is to learn the full range of product attributes and
attribute dimensions voiced in product reviews and to reflect that in a visualization of market
structure that can be used for marketing actions. Work that analyzes online text to identify
market competitors (Pant and Sheng 2009) focuses on the corporate level and relies upon
network linkages between Web pages and online news. Social networks have also been used to
Lee and Bradlow
segment customers for planning marketing campaigns, but relies upon directly observable
relationships between individuals (e.g. friends-and-family calling plans) with no reference to
user-generated comments (Hill, Provost et al. 2006). In this paper, we focus on the productbrand level using user-generated product reviews.
Most recently, researchers have applied natural language processing (NLP) to
automatically extract and visualize direct comparative relationships between product brands from
online blogs (Feldman, Fresko et al. 2007; Feldman, Fresko et al. 2008). Our work is
complementary in at least three ways. First, we present a simpler set of text-mining techniques
that is less dependent upon complex language processing and hand-coded parsing rules, requires
minimal human intervention (only in the analysis phase), and is better suited to customer reviews
(Hu and Liu 2004; Lee 2005). Second, our focus is on learning attributes as well as their
underlying dimensions and levels. Our goal in learning attributes, dimensions and levels is not
only to facilitate direct analysis but also to inform more traditional market structure and conjoint
analysis methods. Third, we specifically highlight the Voice of the Consumer. Different
customer segments such as residential home users of personal computers versus hard core
gamers may refer to the same product attribute(s) using different terminology (Randall,
Terwiesch et al. 2007). These subtle differences in vocabulary may prove particularly useful in
identifying unique submarkets (Urban, Johnson et al. 1984).
Therefore, to summarize, we present a novel combination of existing text-mining and
marketing methodologies to expand the traditional scope of market structure analysis. We
describe an automated process for identifying and analyzing online product reviews that is easily
repeatable over reviews for both physical products and services and requires minimal
Lee and Bradlow
human/managerial intervention. The process extends traditional market structure analysis in the
following ways:
x
Provides a principled approach for generating and selecting attributes for market structure
analysis by identifying what attributes customers are commenting on. These include
attributes that are not highlighted using more traditional methods for eliciting attribute
and dimensions such as those used in generating expert-buying guides.
Captures not only what attributes customers are speaking about but also subtle
differences in vocabulary (West, Brown et al. 1996) that may separate brands or identify
unique submarkets. These differences manifest themselves in two ways: the granularity
with which customers describe an attribute and the synonyms that customers might use.
Facilitates market structure analysis over time by enabling the periodic (re)estimation of
market structure through discrete sampling of the continuous stream of reviews.
Enables more detailed analysis by exploiting the natural segregation of comments by

sentiment polarity (pros versus cons) in user-generated online product reviews.
Possesses high face validity in attribute and dimension selection for practical significance
because marketing managers are familiar with and have easy access to online reviews.
In the remainder of this paper, we describe our methodology, apply our approach to six years of
online product review data for digital cameras, and evaluate the approach in several different
ways including comparisons to expert buying guides, a user survey, and proprietary market
research reports.
METHODLOGY
Lee and Bradlow
Our approach is summarized in Figure 1. We review the entire process here in a highlevel fashion, as many of these techniques are new to the marketing audience; specific details are
provided in Appendix 1 for those who would like to replicate our approach. The process begins
with a set of online reviews in a product category over a specified time frame. For example, in
this paper, we consider the reviews for all digital cameras available at Epinions.com between
July 5, 2004 and January 1, 2008. Figure 1, Step 1 shows three reviews, one for a camera
manufactured by Olympus, one for a camera by HP (Hewlett Packard), and one for a camera by
Fuji. In Step 2, screen scraping software automatically extracts details from each review
including the brand and explicitly labeled lists of Pros and Cons.
Our goal is to group all
phrases discussing a common attribute into one or more clusters to reveal what customers are
saying about the product space, the level of detail used to describe an attribute, and the specific
word choices that customers make. While some review sites do not provide user-authored Pro
and Con summaries (e.g. Amazon.com), many others including Epinions.com, BizRate, and
CNet do (Hu and Liu 2004). Exploiting the structure provided by Pro and Con lists allows us to
avoid numerous complexities and limitations of automated language processing of prose-like
text. This allows us to have our process be "automated", whereas extant research less so.
[Insert Figure 1 about here]
All of the Pros and Cons are then separated into individual phrases as depicted in column
1 of the table in Figure 1, Step 3. Preprocessing transforms the phrases from column 1 into a
normalized form. Column 2 depicts one step of pre-processing, the elimination of uninformative
stop-words such as articles ("the") and prepositions ("of"). Each phrase is then rendered as a
word vector. The matrix of word vectors is depicted in the remaining columns of the table in
Step 3. Each row is the vector for one phrase. Each vector index represents a word and the
Lee and Bradlow
vector value records a weighted, normalized count of the number of times that the indexed word
appears in the corresponding phrase.
Phrases are automatically grouped together based upon their similarity. Similarity is
measured as the cosine angular distance between word vectors and is identical to the Hamming
distance used in computer-science based research. Step 4 depicts the K-means clustering of the
phrases from Step 1. Conceptually, we may think of each cluster representing a different product
attribute. The example shows clusters for "zoom," "battery life," "picture quality," and
"memory."
While any number of clustering algorithms is acceptable, we selected K-means for its
simplicity and its familiarity to both the text-mining and marketing communities. More complex
topic clustering algorithms like Probabilistic Latent Semantic Analysis (PLSA) and Latent
Dirichlet Analysis (LDA) also require parameter estimation and marketing approaches such as
Latent Structure MDS (LSMDS) begin with a pre-defined set of product attributes. As noted
earlier, the principle contribution of our work is the elicitation of product attributes directly from
the customer.
In product design, a product architecture defines the hierarchical decomposition of a
product into components (Ulrich and Eppinger 2003). In the same way, Step 5 depicts the
hierarchical decomposition of a product attribute into its constituent dimensions (i.e. attribute
levels a la conjoint analysis). In Step 5a, we show a conceptual decomposition of the digital
camera product attribute memory. In Step 5b, we show an actual decomposition using only
the phrases from Step 1. The decomposition is treated as a linear programming assignment
problem (Hillier and Lieberman 2001). The objective is to assign each word in the attribute
cluster to an attribute dimension. Each word phrase defines a constraint on the assignment: any
Lee and Bradlow
two words that co-occur in the same phrase cannot be assigned to the same attribute dimension.
Thus, we know that smart and media cannot appear as a value for the attribute dimension
quantity (4, 8, 16) or for the attribute dimension of memory unit (mb). Note that not all
phrases include a word for every dimension. Intuitively, this is both reasonable and an important
aspect of capturing the Voice of the Consumer. We wish to know not only what customers say
but also the level of detail with which they say it. For the algorithm, phrases that do not include
a word for each attribute simply represent a smaller set of co-occurrence constraints than a
phrase containing more words.
ANALYSIS AND EVALUATION
We evaluate the quality and efficacy of analyzing market structure using online product
reviews in three ways: First, we ask whether we find any new information from user generated
product reviews not otherwise obtained from existing methods. Second, we measure the
importance of attributes discovered from within product reviews. Finally, we use
correspondence analysis to analyze the Voice of the Consumer and ask whether any meaningful
managerial insight is gained by visualizing market structure using product reviews. The first two
measures support the use of reviews as a complementary method for generating possible
attributes for use in market structure analysis. The third measure supports reviews and their
corresponding word counts as a complementary method for selecting attributes and deriving
pairwise brand distances for visualizing market structure. In this section, we evaluate our
approach utilizing all three measures on an actual set of digital camera product reviews.
Lee and Bradlow
Digital Cameras
Our initial data set consists of 8,226 online digital camera reviews downloaded from
Epinions.com on July 5, 2004. The reviews span 575 different products and product bundles that
range in price from $45 to more than $1000. Parsing the Pro and Con lists produces the
aforementioned phrase u word matrix that is 14,081 phrases by 3,364 words. We set k (the
number of maximum clusters for K-means) at 50. There are a number of general, statistical
approaches for initializing k including the Gap (Tibshirani, Walther et al. 2005) and KL
(Krzanowski and Lai 1988) statistics. Because of its computational simplicity, we plotted k
versus KL in a range from 45 to 55 and found that k maximizes the value of KL at 50.
Processing details are in Appendix 1.
Given an initial set of 50 clusters (from K-means), our next step is to further filter the
initial clusters into attribute dimensions. The CLP process produced a total of 171 sub-clusters
describing attributes and dimensions within the 50 initial clusWHUV$SSO\LQJD2 threshold of
.001 and further filtering the results using the Spearman Rank test rs reduces those 171 subclusters to 99. Within each cluster, sub-clusters may represent noise from K-means or different
ways in which customers express the dimensions of an attribute (e.g. the number of batteries, the
type of batteries, or battery life). To this point, the entire research process is fully automated
with no human intervention whatsoever. Finally, a manual reading reveals which of the
remaining sub-clusters discuss a common product attribute, and which are noise; albeit, in the
future even this could be automated. Our final reading identifies 39 clusters of product attributes
(see Table 1). Though we might have expected 50 sub-clusters, one for each of the initial
clusters, this is not the case. For some initial clusters, none of the sub-clusters pass the statistical
Lee and Bradlow
10
filters. In other cases, multiple sub-clusters within a single cluster may indicate that the parent
cluster does not cleanly distinguish a common product attribute.
To facilitate the presentation, we apply a nave convention for naming each cluster (a
common practice in marketing studies): scan the cluster for the most frequent word(s) in each
cluster. Some resulting cluster names may only have meaning in the product context.
Comments are inserted in parentheses to provide context to the automatically generated name as
well as to indicate where certain product attributes are duplicated. A listing of automatically
generated dimensions for each of the 39 attributes is available in Appendix 2.
[Insert Table 1 about here.]
Generating Attributes from Product Reviews

The purpose of mining the text of user-generated online reviews is to complement rather
than replace existing methods for analyzing market structure. As a measure of the value within
the text, we ask whether reviews reveal product attributes not found using traditional measures
such as those used to create expert buying guides. As a metric, we compute precision (P)
(Salton and McGill 1983) the number of automatically generated attributes and dimensions also
used by experts in published buying guides. Conversely, recall (R) counts the number of
attributes and levels named in professional buying guides that are automatically discovered in the
Voice of the Consumer. More formally, if X is the set of attributes from the Voice of the
Consumer and Y is the set of set of attributes identified in professional guides, P and R are
defined as:
P
Lee and Bradlow
X Y
X
and R
X Y
Y
11
However, exact matching is an aggressive standard in comparing hierarchies. Optical

zoom is an attribute in some reference guides while the automatic process identified optical as
a dimension of zoom. Borrowing from Popescu et al. (2004), we relax the definition of
precision and recall so that specific terms match more general terms, provided that the more
specific term appears as a dimension or level, and vice versa.
The columns of Table 2 correspond to each of ten different expert guides aimed at the
same consumer audience and available during the time period covered by our reviews. For this
study, the guides represent traditional methods for identifying meaningful product attributes.
Epinions (A) comprises the attributes and levels by which customers can browse the digital
camera product mix at Epinions (in 2005) and Epinions (B) represents a buying guide available
on the Epinions website (in 2005). CR02 CR05 represent print buying guides regarding digital
cameras from Consumer Reports for the years 2002 through 2005. The "Auto" column
represents the attributes and dimensions derived automatically from online reviews. Row one
indicates the average precision and row two, the average recall between the column source and
all other reference sources.
[Insert Table 2 about here]
Comparing the expert guides to one another reveals that there is no consensus among the
experts on what the "important" attributes are. The absence of consistency indicates that
analyzing online product reviews is not necessarily a redundant exercise. The Voice of the
Consumer reveals interest in specific product attributes that are not identified by experts. High
precision at any level of recall would indicate that polling the Voice of the Consumer is
redundant. Consumers do not reveal any information not already captured by existing methods.
Low precision and high recall would indicate that, in their reviews, customers mention every
Lee and Bradlow
12
possible attribute and offers no discriminatory power. The data indicates that analyzing reviews
yields high recall when compared to expert attributes. This means that users do report nearly all
of the attributes that experts do. In pairwise comparisons, the Voice of the Consumer has better
recall than any other single expert. Additionally, median precision suggests that consumers do
mention what the experts do but that consumers also mention some additional attributes.
The evidence suggests that reviews are neither a perfect reflection of the guides nor a
mindless list of every conceivable attribute offering no value in discriminating among attributes.
Product reviews do reveal information not used in the traditional methods. The managerial
question is therefore not whether online product reviews provide information; but instead, the
question is what value that information provides. Do reviews include important "unseen"
attributes? We address that next.
Importance of Attributes from Product Reviews

Through comparison to expert guides, we find that customers do mention unseen
attributes. To determine significance, we conducted a laboratory survey that asked subjects to
evaluate the importance of different attributes for the purpose of purchasing a new digital
camera. We find that automated analysis of online product reviews can support managerial
decision-making in at least two ways. First, our approach can identify significant attributes that
are otherwise overlooked by the experts. Second, the reviews can serve as a filter for other
attribute elicitation methods; attributes that are identified by experts but also named by
customers may have more salience for purposes of product marketing and design.
Specifically, our survey, which took less than 5 minutes to complete, was administered as
part of a 60 to 90 minute sequence of experiments where subjects were compensated at the
Lee and Bradlow
13
average rate of $10/hour. Pre-testing suggested no interference between our survey and the
unrelated studies conducted during the same session. In total, 181 subjects at a large
Northeastern university participated in our web-based survey. Based upon validity checks and
response times, usable results were obtained for 164 subjects.
A set of product attributes for testing was constructed by reconciling the 39 attributes
shown in Table 1 with all of the attributes identified in the 10 different reference buying guides
listed in Table 2. After duplicate attributes were eliminated, the resulting set of 55 attributes was
divided into overlapping thirds to reduce any individual respondents burden. See Appendix 3
for the complete list. A few attributes were repeated in each third as an additional validity check.
Each subject viewed between 20 and 21 attributes. Subjects were asked to rate their "familiarity
with" and the "importance of" each attribute using a 1- 7 scale. The specific questions were:
Imagine that you are about to buy a new digital camera. In the table below, we give a
list of camera attributes. For each attribute, please answer two questions:
First, from 1 7, please rate how familiar you are with the attribute. [1] means that you
have no idea what this attribute is and a [7] means that you know exactly what this is.
Second, from 1 7, please rate how much you care about this attribute when thinking
about buying a new digital camera. [1] means that you do not care about this attribute at all
and [7] means that this is critical. You would not think of buying without first asking about this
attribute.
Subjects were prompted to answer all questions. In particular, subjects were reminded to
answer the second question for each attribute even if they answered [1] for the first question. In
Lee and Bradlow
14
Part 2 of the survey, to understand the role that expertise might play, subjects were asked to
provide their self-assessed expertise on digital cameras (1= novice and 7 = expert). 1 Finally,
subjects were asked for a standard set of demographic variables such as age, gender, educationlevel, etc. These variables were used as covariates to verify our main findings.
Our main results are summarized in Figure 2 where we plot the mean familiarity versus
mean importance for each of the 55 attributes. Attributes that appeared only in one or more
buying guides are labeled "Expert Only" and symbolized by diamonds. Attributes that appear in
at least one buying guide and also in our automated results are labeled "Expert + VOC (Voice of
the Consumer)" and are identified by squares. Finally, attributes that emerged only from our
automated analysis of reviews are labeled "VOC Only" and plotted as "X" symbols. As we
would expect, the graph indicates a general trend upwards and to the right. Users are more
familiar with those product attributes that they tend to consider important. Similarly, if a user is
unfamiliar with a particular attribute, they are unlikely to place a high value on that attribute. A
complete table of attributes and their respective familiarity and importance means is provided in
Appendix 3. We note that, in general, "importance" does not vary greatly by gender, expertise,
or any other demographic variable. In several instances, whether the subject owns a digital
camera does exhibit some significance. The significance of covariates should be checked
separately in each applied domain.
[Insert Figure 2. about here]
Figure 2 suggests two significant managerial implications. First, there are eight attributes
labeled "VOC Only," and they tend towards the upper right-hand corner of the plot: (camera)
size, body (design), (computer) download, feel (durability), instructions, lcd (brightness), shutter
lag, and a cover (twist lcd for protection). The existence of product attributes that are both
Lee and Bradlow
15
familiar and important to users suggests the value of processing online reviews to augment
existing methods for identifying salient product attributes.
Moreover, by comparing the "Expert Only" plot to the "Expert + VOC" label in Figure 2,
we see that most high-importance, high-familiarity attributes appear as "Expert + VOC" while
the lower-left hand region is populated primarily by "Expert Only" attributes. Comparing the
"Expert Only" attributes to "Expert + VOC" attributes, we find a statistically significant
difference (p < .01) in average familiarity (4.2 to 5.1) and in average importance (4.1 to 4.9).
The difference (p < .05) between "Expert Only" and "VOC Only" is equally large and suggests a
second managerial application of automated analysis. The "Voice of the Consumer" as
represented in online product reviews can serve as a filter, highlighting meaningful product
attributes.
The potential for applying "VOC" as a filter is also seen in the wide disagreement among
the Expert Buying Guides. There are only seven product attributes that appear in at least 50%
(five) of the Expert guides used in our evaluation. Labeled by "+" symbols in Figure 2, we see
that even when the Experts agree, there is a wide variance in familiarity and importance that is
mediated by the "VOC."
At the same time, it is worth noting that our procedure does miss some high value
attributes. In particular, two sub-classes of attributes stand-out. "Brand" is certainly a prominent
missing attribute. Our approach relies on word frequencies. Specific references to any one
brand (e.g. Canon) may not appear with sufficient frequency to cluster as an attribute. We
might instead map all explicit brand names to a single common word (e.g. brand).
The second sub-class of attributes missed by our approach is largely due to differences in
how terms are classified. In Figure 2, the attribute ranked second in importance among "Guide
Lee and Bradlow
16
Only" results is "battery source." While our automated approach does capture characteristics of
batteries and even battery type, these values are classified as properties of the "battery" attribute.
Likewise, we distinguish between shutter lag and shutter delay. Although the terms refer to
the same physical characteristic, distinctions in vocabulary may reveal distinct submarkets. We
revisit this point further below. In this way, our analysis highlights differences between the
Voice of the Consumer as represented in product reviews with that of the manufacturer and
retailers marketing literature.
Selecting Attributes and Visualizing Market Structure Using Product Reviews

Through comparisons to expert guides and consumer surveys, we demonstrate that
customer reviews can complement existing methods for generating attributes used in market
structure analysis. As a third measure of efficacy, we use attribute counts within customer
reviews to select attributes, calculate brand distances, and visualize market structure. A product
brand is associated with each online review. Using the automatically generated attribute clusters,
we generate a brand by attribute matrix, counting the number of brand occurrences (number of
phrases) for each attribute. We row (brand) normalize the matrix by the total number of phrases
for that brand (Nagpaul 1999). To then turn this into a visual map, we utilize Correspondence
Analysis. Correspondence Analysis (CA) is a technique for analyzing two-way, two-mode
frequency data (Everitt and Dunn 2001) making it more appropriate for this task than
continuously-scaled MDS procedures that are commonly used for market structure maps. CA is
designed to help generate hypotheses by representing the data in a reduced space as determined
by consulting the eigenvalues and the corresponding scree plot (Greenacre 1992).
Lee and Bradlow
17
To help interpret the dimensions in the reduced space, we use (stepwise) regression of
each brands (x,y) coordinates on the derived attributes. The probability for entry is 0.05 and
probability for removal is 0.1, albeit other values were tested and yielded high robustness. To
ensure stability of the regression results, we use a rotationally invariant, asymmetric
correspondence analysis (Bond and Michalidis 1997).
To make the dimensions both interpretable and actionable, we follow the "House of
Quality" (Hauser and Clausing 1988) in relating customer needs, as represented by usergenerated reviews, to actionable manufacturer specifications. Specifically, product attributes
elicited from online reviews include different granularities ("zoom" versus "10x optical zoom")
or reflect different levels of technical sophistication ("low-light settings" vs. "iso settings").
Consulting the set of professional buying guides, we manually mapped our automatically
generated attributes onto a coarser but actionable categorization of specifications. The
visualizations are generated and interpreted using these meta-attributes (see Appendix 4).
For attribute selection, Figure 3 depicts the brand map in two dimensions based upon the
initial set of digital camera reviews analyzed for this study alongside the Scree plot of the
eigenvalues and cumulative percentage of inertia. In this instance, the average percentage
explained is 20%, indicating a clear separation between two and three dimensions. More
generally, we recognize that the reliability of any such model is dependent upon fit as
represented by the percentage of inertia captured by those dimensions. At the limit, we imagine
that marketing managers might use the two-dimensional figures as a point of departure for
interpretation with respect to decision-making. For example, we generated the additional F1-F3
and F2-F3 plots and found that the results are consistent with our 2D interpretation.
[Insert Figure 3. about here]
Lee and Bradlow
18
Regressing the CA coordinates (F1 and F2) on the meta-attributes defines F1 as a

combination of comments about in-camera navigation, storage, and video. We might
characterize F1 as those first-order attributes that the lay consumer would notice: easy to use
menus and navigation interfaces, the number of pictures available, and video capabilities. We
note that video was only just emerging as a standard feature of digital cameras during the initial
study period. By contrast, F2 is defined by attributes more relevant to technically sophisticated
consumers: low-light or iso-controls, lens characteristics (e.g. name brand optics like Zeiss).
Price also emerges as an attribute of F2, but may reflect the correlation of price with higher-end
products. A summary of the stepwise results for variable selection are included as Appendix 4.
An alternative approach to naming the dimensions would have been to regress on the actual,
manufacturer-provided physical dimensions. However, this is exactly what makes our approach
novel/different. Figure 3 is unique in that it reflects perceptions as revealed through automated
analysis of customer reviews.
When interpreting CA maps, it is important to note that the dimensions only reflect
relative relationships between brands and the underlying attributes. Being further from the origin
does not necessarily imply higher term counts along one dimension or another. Rather, distance
from the origin indicates greater deviation from the "average" brand.
We next assess the degree to which the brand map we derived from the Voice of the
Consumer, as derived in our research, is consistent with empirical market share and documented
brand strategies. Based upon 2004 market share numbers from IDC, the top three manufacturers,
in descending order, were Kodak, Sony and Canon. All other brands had share percentages of
10% or less. The corresponding brand map reveals three distinct clusters with Nikon as a
Lee and Bradlow
19
notable outlier. In the minds of the customer, the secondary manufacturers all cluster by one of
the three market leaders in a classic imitation strategy.
The positioning is also consistent with documented brand strategy, based upon
proprietary market research reports from the period. HP and Kodak explicitly sought to
differentiate themselves by emphasizing the camera-to-PC and camera-to-printer connection,
linking them in the minds of the consumer (Worthington 2003). Among the best known vendors,
Nikon was the brand of professional photographers rather than lay consumers, late in its
commitment to digital, and focused on the advanced amateur (Lyra 2003). It is therefore
unsurprising that Nikon would be furthest from the origin (brand "average").
At the same time, the brand map derived from the Voice of the Consumer reveals
information that complements market share numbers and the proprietary market research reports.
Traditional market research might cluster Fuji and Olympus based upon their shared
commitment to and introduction of the xD-Picture card memory format (Lyra 2003). Likewise,
the "Four-Thirds" partnership between Olympus and Kodak to standardize image-sensor size and
a removable lens mount might lead one to cluster those two brands (Lyra 2003). Instead, we see
that in the mind of the consumer, Olympus is more closely associated with market leader Sony
and fellow follower Panasonic. Despite introducing a unique memory format with Olympus at
the time, Fuji is paired with market leader Canon and Casio in the minds of the consumer.
Viewing market structure through product reviews enables three additional levels of
analysis. First, automation enables managers to easily track market structure evolution by
sampling reviews over time. Second, managers can drill into specific brand-attribute
relationships by examining word choice: how consumers express themselves. Third, managers
Lee and Bradlow
20
can gain additional insight into market structure by visualizing brand relationships by sentiment
polarity rather than overall comments.
One advantage of analyzing market structure based upon user generated product reviews
is the ability to quickly repeat the process and analyze changes in market structure over time. To
this end, we collected a parallel set of 5567 digital camera reviews from Epinions.com dated
between January 1, 2005 and January 28, 2008. The new set of reviews revealed five new
attributes replacing five previously formed clusters. The substitutions are listed in Table 3.
Although it is difficult to discern the underlying causes, the changes have face validity. As the
customer base becomes increasingly sophisticated and increasingly connected online, the need
for instructions and support has shifted towards online self-service. Likewise, the ubiquity of
personal computers and online photo management software may have shifted such functions
away from the camera. In keeping with the theme of a more technically sophisticated audience,
functions such as ISO settings, multiple shot modes, and white/color balance are more significant
or reflect active campaigns.
[Insert Table 3 about here]
Even though many attributes remain the same across time periods, changes in both
customers and the marketplace may drive (or reflect) brand repositioning over time. To visualize
these changes, we use the supplemental points method of Correspondence Analysis (Nagpaul
1999). We construct a brand by attribute count matrix for period two; applying the
transformation weights from the period one decomposition, we map the period two brand
positions onto the period one factors. Using supplemental points, we can trace market evolution
between two (or more) periods on a common set of factors (axes). Figure 4 overlays period two
(year >= 2005) on the two-dimensional space defined for period one (year < 2005).
Lee and Bradlow
21

Market share numbers for 2006 and 2007 published by the NPD group and Hon Hai
Worldwide show that Kodak, Sony, and Canon remained the three major manufacturers although
by 2007, Canon emerges as the overall market leader with Kodak falling to third. Consistent
with market share, the period two market structure continues to echo the overall positioning of a
market leader with a cluster of imitators around each. However, in period two, the clusters
exhibit a convergence where even Nikon has now drawn closer to the three market leaders.
The timing of this convergence matches the competitive forecasts from proprietary
market research for the digital imaging market (Lyra 2003) and is consistent with expectations
for a rapidly evolving technology-driven marketplace. From period one to period two, Nikon
made a deliberate effort to extend its brand down-market to pursue the beginning photographer.
Likewise, as the digital image market approached saturation, the remaining manufacturers were
predicted to (and did) focus on expanding their product portfolios to cover all market segments
(Lyra 2003). Successful new innovations were quickly adopted by competitors. Consider, for
example, Nikon's early attempts to differentiate using swivel LCD screens followed by the
subsequent diffusion of flip or rotating LCDs into rival product lines.
In addition to looking at market evolution, managers can drill into specific brand-attribute
relationships by examining word choice: how consumers express themselves. In order to
visualize market structure, the Voice of the Consumer is mapped onto a set of product
specifications (Hauser and Clausing 1988). Different users may refer to the same product
attribute using different terminology such as "iso-settings" versus "photos in dim lighting."
General terms "easy to use" need to be translated into actionable features such as "menus,"
"navigation," "auto settings," etc. As noted earlier, in this research, we followed the House of
Lee and Bradlow
22
Quality (Hauser and Clausing 1988) in manually rolling-up different user terms into a common,
actionable specification.
However, the manager may discover knowledge within specific consumer word choices.
Focusing on F1 and F2 from our original market structure analysis, we analyze the degree of
association between the three market leaders (Kodak, Canon, and Sony) and customer word
choice on F1 and customer word choice on F2. Word choice in F1 is separated into two classes:
generic terms "ease of use," "easy to use" and specifications "navigation," "menus," etc.
Likewise, word choice in F2 is separated into two classes: lay terminology "low-light"
photography, "dim-light," etc. and advanced terminology "iso settings," "iso adjustment," etc.
Because we manually map different word choices onto a common factor, it is entirely possible
that we are missing additional terminology. For example, consumers citing "ease of use" could
also be implicitly referring to ergonomic factors such as "grip" or "lightweight." Our analysis
here is intended to convey the potential of drilling down on word choice rather than provide an
authoritative accounting for a specific product attribute.
For F1, a Chi-square of 9.087 p-value = 0.011 and for F2, a Chi-square of 21.409, pvalue < 0.0001 suggests a clear relationship between brand and word choice. We constructed the
corresponding 2x2 contingency tables and report both the Chi-square and the z-statistic from the
log(odds ratio) in Tables 4 and 5.
[Insert tables 4 and 5 about here]
We see that of the big three players, Kodak is clearly the brand most associated with lay
terminology such as "low-light photography" (F2) rather than "iso" and generic words such as
"ease of use" (F1). By contrast, Canon users clearly tend towards more sophisticated terms such
as "iso settings." These findings reinforce market research from the time period which reports
Lee and Bradlow
23
that gender, age, and education-level all exhibit strong brand preferences for Kodak and Canon
(Lyra 2005). For example, older consumers and those most likely to have stopped schooling
after high school or a few years of college prefer Kodak. Canon users are typically younger and
have at least a college degree.
Processing consumer reviews also enables a third level of market structure analysis.
Because we explicitly use product reviews that separate user comments in structured Pro-Con
lists, we can immediately identify the sentiment polarity (Pro phrases convey positive polarity
and Con phrases convey negative polarity). Using the meta-attributes derived from clustering
review phrases and rolled-up to actionable specifications, we label each meta-attribute based
upon whether it includes only Pro phrases, only Con phrases, or both. In Figure 5, we construct
a polarity space by mapping brands based upon Con phrases and use supplemental points
analysis to map brands by Pro phrases onto the same dimensions. The Scree plot shows a clear
separation between F2 and F3 where average percentage explained is 12.5%. We constructed a
similar space based upon Pro phrases, but only provide the Con analysis here for space reasons.
The visualization characterizes the "Must Have" product attributes as defined by Kano's
needs hierarchy (Ulrich and Eppinger 2003). "Must Have's" define basic product requirements.
Customers criticize products for their failure on Must Have's; they do not praise products for
Must Have's because Must Have's are expected. Consistent with this definition, the figure
exhibits the expected brand differentiation based on Cons. Positive comments mapped onto the
same (F1, F2) dimensions reveal a single cluster of all brands, which share a limited number of
Pro remarks. Although attributes are not strictly "needs" in Kano's terms, we can think in terms
of those attributes that support specific needs.
Lee and Bradlow
24
Using stepwise regression, F1 defines the minimum that any digital camera is expected to
provide: picture quality, start-up time, lens cover, and shutter-lag. These attributes certainly
match our intuitive sense of Must Have's for digital cameras. By contrast, F2 defines those Must
Have attributes that distinguish specific product segments within the market space. Specifically,
F2 (size, interchangeability of lenses, and the degree of in-camera programmability e.g. menus
and options) separate the compact, prosumer, and digital SLR product categories.
The structural map by polarity is most revealing when interpreted in the context of our
two-period market structure map (Figure 4). Situated at the origin on both Pro and Con, Canon
essentially defines the space. This is consistent with the Canon's position as a leader whose
market share grew from period one to period two and is further evidenced by the period two
convergence of all groups towards the Canon brand cluster.
Market structure suggests three distinct clusters with leaders and imitators (Figure 4);
focusing on polarity helps explains differences in market share. We know that Kodak and HP
share a similar market strategy. However, HP clearly lags behind in sales; performance in the
Must Have's and the relative importance of the particular attribute dimensions helps to explain
this difference. Likewise, imitators Fuji and Casio are closer to leader Canon than Olympus and
Panasonic are to leader Sony. This again makes sense when taken into account with overall
market convergence towards the Canon, Fuji, Casio cluster in period two.
Nikon, like Canon, is effectively neutral with respect to Pros and Cons. This is consistent
with Nikon's strategy during the study period to produce excellent technology but to focus on a
different customer base than the rest. Casio, with market share numbers too low to break out by
IDC, NPD, or Hon Hai during the study period, is an outlier in the polarity maps.
Lee and Bradlow
25
DISCUSSION AND CONCLUSIONS
In this paper, we have presented a system for automatically processing the text from
online customer reviews. Phrases are parsed from the original text. The phrases are then
normalized and clustered. A novel logical assignment approach exploits the structure of Pro-Con
review summaries to further separate phrases into sub-clusters and then assigns individual words
to unique categories. Notably, our work differs from sentiment-based strategies in that we do not
rely upon complex natural language processing techniques and do not rely upon the user to first
provide a set of representative examples or keywords (Turney 2002 ; Nasukawa and Yi 2003; Hu
and Liu 2004). We conclude by discussing the managerial implications of analyzing market
structure based upon the Voice of the Consumer, limitations of our current approach, and future
work.
Managerial Implications
The automated analysis of online product reviews promises to support rather than
supplant managerial decision making both descriptively and prescriptively. First, analysis of
online reviews can confirm the manager's marketing strategy at several levels. Describing
market structure through the Voice of the Consumer reveals a brand's position relative to its
peers, highlights salient product attributes, and reveals underlying segments. First, relative brand
position can help confirm the success of a campaign's objectives. For example, consumer
perceptions reflect the similarities in market strategy between HP and Kodak during the study
period. Conversely, a brand's perception of itself may not match the consumer's, revealing a
Lee and Bradlow
26
failure in strategy. There is little evidence that Fuji or Olympus, explicitly or deliberately
engaged in a strategy of imitation.
Analysis of reviews also confirms which attributes are salient for describing consumer
perception. For brands like Nikon who were known to focus on advanced consumers but seek to
move down market (Lyra 2003), the market structure map reveals those product attributes which
differentiate position in the minds of the consumer. Although HP and Kodak pursued similar
printing and PC connection strategies, consumers equated those brands (and distinguished among
others) using attributes like menus, storage, and video.
Finally, the explicit word choices in reviews can confirm the success of targeted
marketing and segmentation strategies. Proprietary market research discovered distinct brand
preferences along gender, age, and educational levels (Lyra 2005). Demographic variables are
known to correlate with language, and we demonstrate that these same segments and brand
preferences are revealed in the Voice of the Consumer (e.g. technical wording versus layman's
speech). In this way, vocabulary may help identify unique submarkets (Urban, Johnson et al.
1984).
In addition to describing the current state of the market to confirm existing strategies,
manager's can analyze product reviews in a prescriptive manner. Conjoint Analysis, a form of
external market structure analysis (Elrod, Russell et al. 2002), is used by manager's for new
product introduction (Wittink and Cattin 1989; Michalek, Feinberg et al. 2005), optimal product
repositioning (Moore, Louviere et al. 1999) and pricing (Goldberg, Green et al. 1984), and
segmenting customers (Green and Krieger 1991). However, conjoint is dependent on the
representation of a customers utility as an agglomeration of preferences for the underlying
attribute levels that have been selected. This dependence holds regardless of either format
Lee and Bradlow
27
(choice-based, ratings-based, ranking-based, constant sum, or self-explicated) or method for

determining the profiles (Huber and Zwerina 1996; Moore, Gray-Lee et al. 1998; Toubia,
Simester et al. 2003; Evgeniou, Boussios et al. 2005).
Automated analysis of online product reviews can prescriptively assist in the design of
conjoint studies in at least three ways. For attribute generation, customer reviews discover
meaningful attributes not found by traditional means. For attribute selection, polarity analysis
helps separate Kano's must-have requirements from linear satisfiers and delighters. Finally,
analysis of reviews can help produce meaningful levels for conjoint study design.
Limitations
To generate initial clusters, the system requires phrases. Even though Epinions' customer
reviews provide lists of phrases, variations in human input are a source of noise. Moreover, we
assume that a phrase represents a single concept and that individual words represent distinct
levels of attribute dimensions. It is easily seen how this assumption creates difficulties with the
natural language in reviews (e.g. people who write lists like "digital and optical zoom").
One possible solution is to apply more sophisticated NLP techniques. Beginning with a
representative set of meaningful words, there are ways of expanding the set of words,
distinguishing between distinct attributes, and identifying relationships between attributes
(Hearst 1992; Popescu, Yates et al. 2004). However, our Pro-Con summaries are simply lists of
phrases with no associated linguistic context; Liu et al. (2005) demonstrate that techniques which
rely upon representative examples perform markedly less well in the context of Pro-Con
phrases. Our constrained optimization approach dispenses with the need for representative
examples and knowledge of grammatical rules (Lee 2005). An entirely different approach that
Lee and Bradlow
28
would preserve the unsupervised nature of our work is to attempt to identify phrases through
frequent item-set analysis. Hu and Liu (2004) demonstrate that by tuning support and
confidence thresholds, it is possible to discover whether a word pair represents one concept or
two (e.g. compact flash versus 3x zoom).
We need the ability to assess the stability of our clusters and concomitant product
features. One instance of stability is sensitivity to data sample size. Here, we relied upon a large
data set to yield the phrases from which we cluster and optimize the assignment. The large data
set is also a boon because we can liberally discard phrases to minimize the effects of nave
parsing. To measure the sensitivity to sample size, we would cross-validate on smaller sets of
review samples. We can plot the trade-off between sample size and evaluation metrics to
identify diminishing returns and attempt to estimate a minimal number of required reviews. Care
needs to be taken to ensure sufficient heterogeneity in the sample selection with respect to
different product features and the corresponding feature attributes.
While our approach is not limited to evolving product domains, our dependence on
sources that provide phrase-like strings is a limitation. At least two factors ameliorate this
limitation. First, there are other domains where phrase-like text-strings apply as opposed to
prose. Progress notes in medical records and online movie reviews (Eliashberg and Shugan
1997) are two such examples. Second, recognizing the current limitations of natural language
processing tools, more online sources are soliciting customer feedback in the form of phrases
rather than prose to facilitate automated processing (Google 2007).
Finally, the distinction between attributes, dimensions, and levels is imprecise.
Conceptually, each attribute is parameterized by one or more dimensions and each dimension is
defined by the levels (the set of values) that a particular dimension may take. Operationally,
Lee and Bradlow
29
attributes, dimensions, and levels are defined by the clustering. Attributes name the clusters
from K-means. Each sub-cluster formed from the assignment algorithm represents one attribute
dimension. The values within each sub-cluster define the dimensions levels. Knowing that the
distinction between attributes, dimensions, and levels is noisy emphasizes the role of textprocessing as a decision-aid rather than as a substitute for traditional processes.
Future Work
Beyond the existing method, there are a number of avenues ways in which we might
enrich our understanding of the Voice of the Consumer for market structure analysis. First, we
have only scratched the surface of linguistically analyzing word choice. Recognizing that word
choice can reflect distinct segments, we might attempt to simultaneously induce specific market
segments and their corresponding language using strategies such as co-clustering (Dhillon,
Mallela, et al. 2002). A complementary strategy might seek to learn relationships between
different word clusters. For example, different customers may address a concept using parallel
categories. How does 32 mb of storage compare to a customer who has 130 images? We can
apply concept clustering (Gibson, Kleinberg et al. 1998; Ganti, Gehrke et al. 1999) in
conjunction with our CLP approach to group words from parallel categories.
Structure exists not only at the market level but also at the level of individual consumers
(Elrod, Russell et al. 2002). A deeper understanding of the distinctions between segments
requires understanding not only the differences between product attributes but also the
differences in the underlying customer needs (Srivastava, Alpert et al. 1984; Allenby, Fennell et
al. 2002; Yang, Allenby et al. 2002). Unlike traditional data sources for market structure
analysis, online product reviews include comments about not only what and how, but why. In the
Lee and Bradlow
30
text of reviews, customers often relate why they purchased a product and what they use that
product for. Research mining user needs from online reviews (Lee 2004; Lee 2009) could be
incorporated into market structure analysis.
We provide an approach for descriptive modeling of market structure based upon the
Voice of the Consumer as captured in online product reviews. In this paper, we applied the
method to reviews of digital cameras representing a period of rapid evolution and compared our
analysis to proprietary market research. Analyzing additional domains including mature markets
and service goods are beyond the scope of this paper. However, our preliminary analyses of
toaster ovens and hotel reviews suggest equal promise. Moreover, traditional approaches to
market structure analysis produce models useful not only for describing existing markets but also
for predictive purposes. Exploring the integration of the Voice of the Consumer and productive
reviews into a predictive market structure analysis across multiple product domains is a great
opportunity (Allenby, Fennell et al. 2002; Elrod, Russell et al. 2002). We believe that our
research offers an important first step.
Lee and Bradlow
31
REFERENCES
Allenby, G., G. Fennell, et al. (2002). "Market Segmentation Research: Beyond Within and
Across Group Differences." Marketing Letters 13(3): 11.
Archak, N., A. Ghose, et al. (2007). Show me the money! Deriving the Pricing Power of Product
Features by Mining Consumer Reviews. ACM KDD. San Jose, CA.
Bond, J. and G. Michailidis (1997). "Interactive Correspondence Analysis in a Dynamic ObjectOriented Environment." Journal of Statistical Sofware 2(8): 1 31.
Chen, Y. and J. Xie (2004). Online Consumer Review: A New Element of Marketing
Communications Mix. SSRN, http://ssrn.com/abstract=618782
Dellarocas, C. (2003). "The Digitization of Word of Mouth: Promise and Challenges of Online
Feedback Mechanisms." Management Science 49(10): 1407-1424.
DeSarbo, W. S., D. J. Howard, et al. (1991). "Multiclus: A new method for simultaneously
performing multidimensional scaling and cluster analysis." Psychometrika 56(1): 16.
Dhillon, I. S., S. Mallela et al. (2002). "Enhanced Word Clustering for Hierarchical Text
Classification." ACM SIGKDD.
Eliashberg, J. and S. Shugan (1997). "Film Critics: Influencers or Predictors?" Journal of
Marketing 61: 68-78.
Elrod, T. (1988). "Choice Map: Inferring a Product-Market Map from Panel Data." Marketing
Science 7(1): 19.
Elrod, T. (1991). "Internal Analysis of Market Structure: Recent Developments and Future
Prospects." Marketing Letters 2(3): 14.
Elrod, T., G. J. Russell, et al. (2002). "Inferring Market Structure from Customer Response to
Competing and Complementary Products." Marketing Letters 13(3): 12.
Erdem, T. and M. P. Keane (1996). "Decision-Making Under Uncertainty: Capturing Dynamic
Choice Processes in Turbulent Consumer Good Markets." Marketing Science 15(1): 20.
Everitt, B. S. and G. Dunn (2001). Applied Multivariate Data Analysis. New York, Oxford
University Press.
Evgeniou, T., C. Boussios, et al. (2005). "Generalized Robust Conjoint Estimation." Marketing
Science 24(3): 415-429.
Feldman, R., M. Fresko, et al. (2007). Extracting Product Comparisons from Discussion Boards.
IEEE ICDM.
Lee and Bradlow
32
Feldman, R., M. Fresko, et al. (2008). Using Text Mining to Analyze User Forums. International
Conference on Service Systems and Service Management.
Ganti, V., J. Gehrke, et al. (1999). CACTUS - Clustering categorical data using summaries.
ACM SIGKDD, San Diego, CA.
Ghose, A. and P. Ipeirotis (2008). Estimating the Socio-Economic Impact of Product Reviews:
Mining Text and Reviewer Characteristics. New York, New York University.
Ghose, A., P. Ipeirotis, et al. (2006). The Dimensions of Reputation in Electronic Markets, New
York University: 32.
Gibson, D., J. Kleinberg, et al. (1998). Clustering categorical data: an approach based on
dynamical systems. 24th International Conference on Very Large Databases (VLDB).
Goldberg, S. M., P. E. Green, et al. (1984). "Conjoint Analysis of Price Preimums for Hotel
Amenities." Journal of Business 57(1): 111-132.
Google (2007). About Google Base, Google
Green, P. E. and A. M. Krieger (1991). "Segmenting Markets with Conjoint Analysis." Journal
of Marketing 55: 20-31.
Green, P. E. and V. Srinivasan (1978). "Conjoint Analysis in Cosumer Research: Issues and
Outlook." Journal of Consumer Research 5: 103-123.
Greenacre, M. J. (1992). "Correspondence Analysis in Medical Research." Statistical Methods in
Medical Research 1: 97-117.
Griffin, A. and J. R. Hauser (1993). "The Voice of the Customer." Marketing Science 12(1): 127.
Hauser, J.R. and D. Clausing (1988). "The House of Quality." Harvard Business Review 3(MayJune): 63 73.
Hearst, M. (1992). Automatic acquisition of hyponyms from large text corpora. COLING.
Nantes, France: 539-545.
Hill, S., F. Provost, et al. (2006). "Network-based Marketing: Identifying likely adopters via
consumer networks." Statistical Science 21: 21.
Hillier, F. S. and G. J. Lieberman (2001). Introduction to Operations Research, McGraw-Hill.
Hu, M. and B. Liu (2004). Mining and Summarizing Customer Reviews. KDD04. Seattle, WA.
Huber, J. and K. Zwerina (1996). "The Importance of Utility Balance in Efficient Choice
Designs." Journal of Marketing Research 33: 307-317.
Lee and Bradlow
33
Kamakura, W. A. and G. J. Russell (1989). "A Probabilistic Choice Model for Market
Segmentation and Elasticity Structure." Journal of Marketing Research 26(4): 12.
Krzanowski, W. J. and Y. T. Lai (1988). "A criterion for determing the number of groups in a
data set using sum-of-squares clustering." Biometrics 44: 12.
Lee, T. (2004). Use-centric mining of customer reviews. WITS. Washington, D.C.: 146-151.
Lee, T. (2005). Ontology Induction for Mining Experiential Knowledge from Customer
Reviews. Utah Winter Information Systems Conference. Salt Lake City, UT.
Lee, T. Y. (2009). Automatically Learning User Needs from Online Reviews for New Product
Design. AMCIS. San Francisco, CA.
Liu, B., M. Hu, et al. (2005). Opinion Observer: Analyzing and Comparing Opinions on the
Web. WWW. Chiba, Japan, ACM.
Lyra Research, Inc. (2005). Inside the Mind of the Digital Photographer: A Complete Photo
Behavior Profile by Gender, Age, Education and Income. July.
Lyra Research, Inc. (2003). The Digital Camera Market: Worldwide Trends and Forecast
through 2006. January.
Michalek, J. J., F. M. Feinberg, et al. (2005). "Linking Marketing and Engineering Product
Design Decisions via Analytical Target Cascading." Journal of Product Innovation
Management 22: 42-62.
Moore, W. L., J. Gray-Lee, et al. (1998). "A Cross-Validity Comparison of Conjoint Analysis
and Choice Models at Different Levels of Aggregation." Marketing Letters 9(2): 195-207.
Moore, W. L., J. J. Louviere, et al. (1999). "Using Conjoint Analysis to Help Design Product
Platforms." Journal of Product Innovation Management 16: 27-39.
Nagpaul, P.S. (1999). "Guide to Advanced Data Analysis using IDAMS Software." UNESCO.
http://www.unesco.org/webworld/idams/advguide/TOC.htm.
Nasukawa, T. and J. Yi (2003). Sentiment Analysis: Capturing Favorability Using Natural
Language Processing. K-CAP. Sanibel Island, Florida, ACM: 70-7.
Pant, G. and O. Sheng (2009). Avoiding the Blind Spots: Competitor Identification Using Web
Text and Linkage Structure. ICIS. Phoenix, AZ.
Popescu, A.-M., A. Yates, et al. (2004). Class extraction from the World Wide Web. AAAI
Workshop on ATEM.
Lee and Bradlow
34
Randall, T., C. Terwiesch, et al. (2007). "User Design of Customized Products." Marketing
Science 26(2): 268-280.
Salton, G. and M. McGill (1983). Introduction to modern information retrieval. New York,
McGraw-Hill.
Srivastava, R. K., M. I. Alpert, et al. (1984). "A Customer-Oriented Approach for Determining
Market Structures." Journal of Marketing 48(2): 14.
Tibshirani, R., G. Walther, et al. (2005). "Cluster validation by prediction strength." Journal of
Computational & Graphical Statistics 14(3): 18.
Toubia, O., D. I. Simester, et al. (2003). "Fast Polyhederal Adaptive Conjoint Estimation."
Marketing Science 22(3): 273-303.
Turney, P. (2002). Thumbs Up or Thumbs Down? Semantic Orientation Applied to
Unsupervised Classification of Reviews. ACL: 417-424.
Ulrich, K. and S. Eppinger (2003). Product Design and Development, McGraw-Hill.
Urban, G. L. and J. R. Hauser (1993). Design and Marketing of New Products, Prentice Hall.
Urban, G. L., P. L. Johnson, et al. (1984). "Competitive Market Structure." Marketing Science
3(2): 30.
West, P. M., C. L. Brown, et al. (1996). "Consumption Vocabulary and Preference Formation."
Journal of Consumer Research 23(2): 16.
Wittink, D. R. and P. Cattin (1989). "Commercial Use of Conjoint Analysis: An Update."
Journal of Marketing 53: 91-96.
Wittink, D. R., L. Krishnamurthi, et al. (1982). "Comparing derived importance weights across
attributes." Journal of Consumer Research 10(1): 471-74.
Worthington, P. (2003). Digital Cameras: Strategies to Reach the Late Majority. Future Image,
Inc. Market Analysis.
Yang, S., G. Allenby, et al. (2002). "Modeling Variation in Brand Preference: The Roles of
Objective Environment and Motivating Conditions." Marketing Science 21: 18.
TABLES
Lee and Bradlow
35
LCD
Red Eye
Price
Focus
Features
Software
Memory/Screen
Shutter (delay, lag)
Lens
(cap, quality, mfr)
Picture
(what, where)
Edit (in camera)
Adapter (AC)
Image (quality)
Print
(size, quality, output)
Cover
(lens, LCD, battery)
Disk
MB (memory)
Macro (lens)
Optical
(zoom, viewfinder)
Floppy
(storage media)
Support
(service)
Shoot
Slow (start-up,
turn on, recovery)
Flash (memory
card, photo)
Body (design,
construction)
USB
Feel
(mfr, construction)
Instruction
Low light
Megapixel
Battery
(life, use, type)
Photo quality
Picture quality
Movie
(audio, visual)
Menu
Resolution
Size
Control
Zoom
Table 1: Automatically generated product attributes
Auto
E(A)
DP
Mega
Biz
Cnet
E(B)
CR02
CR03
CR04
CR05
Mean
Precision
0.37
0.69
0.33
0.57
0.41
0.27
0.48
0.24
0.26
0.27
0.37
0.42
Recall
0.72
0.23
0.55
0.23
0.48
0.33
0.46
0.38
0.37
0.37
0.50
0.39
Table 2: Internal consistency: average precision and recall between one source and all others
Before 2005
After 2005
Size
ISO
Support (service)
Modes
Feel (mfr)
Accessories
Instruction
Easy to use
Edit (in camera)

White/color balance
Table 3. Changes in automatically generated attributes before and after 2005
Kodak
Canon
Sony
Kodak
Canon
5.568**(-2.335)
9.117*(2.966)
1.266(1.123)
Sony
Table 4. Comparing consumer word choice (ease of use v. navigation, menus) by brand
Chi-square(z-statistic) *p < 0.005 **p < 0.05
Kodak
Canon
Sony
Kodak
Canon
14.535*(-3.673)
1.218(1.213)
12.695*(-3.532)
Sony
Table 5. Comparing consumer word choice (low-light v. iso) by brand

Chi-square(z-statistic) *p < 0.0001
Lee and Bradlow
36
FIGURES
Figure 1: Heuristic Description of Text Processing of User-Generated Pro-Con lists
Lee and Bradlow
37
Figure 2. User assessments of familiarity vs. importance
Lee and Bradlow
38
Figure 3. Mapping the market using customer reviews
Lee and Bradlow
39
Figure 4. Evolution of market structure with period two as supplementary points
Lee and Bradlow
40
Figure 5. Market structure by Cons with Pros as supplementary points
Lee and Bradlow
41
Appendix 1. Algorithmic Details

In this Appendix, we elaborate on specific details of the text-processing algorithms.
[1] Data collection and pre-processing
After identifying the source of reviews, we wrote a program in the Python programming
language. For each review, we get the product identifier used by Epinions.com to uniquely identify a
product, the list of Pros, and the list of Cons. Product brand names are excerpted from the Epinions.com
product identifier 2 and inserted into a MySQL database. It is also important to note that one could
separate the selected reviews by pre-defined segments (i.e. demographic clusters, and hence produce
segment-level Pro-Con lists that would be analyzed distinctly), or attempt to (but is beyond the scope of
this research) to simultaneously infer latent segments and market structure simultaneously.
To construct the matrix of word vectors, we focus on each phrase. For now, we do not
distinguish between whether a phrase appears as a Pro or a Con, focusing only on grouping together those
phrases that discuss a common product attribute. As a standard preprocessing step in text mining, we
normalize words as follows. Delete all stop-words and stem the remaining text (Salton and McGill 1983).
Stop-words, like grammatical articles, conjunctions, prepositions, etc. are meaningless for purposes of
product attribute identification so they are removed. For example, after pruning, the phrase "Only 8 mb
Smart media card included" becomes "8 mb Smart media card included." Reduce words to their root
form by stemming. We use the Porter stemmer to find equivalences between singular, plural, past and
present tense forms of individual words used by customers. Thus, "includes" and included" are both
reduced to the root "includ."
[2] Vector space model and word importance

Borrowing from the information retrieval community, our phrase u word matrix is a
representation of the vector-space model (VSM). More formally, j J is a word in the set of all words; i
I is a phrase. A phrase is simply a finite sequence of words and J is a subset of the set of finite word
Lee and Bradlow
42
sequences I = {<j>| j J}. We define an initial phrase u word matrix as a simple variation on the termfrequency inverse-document-frequency (TF-IDF) VSM (Salton and McGill 1983):
Matrix(i,j) = (TFij u IPFj)
(A1.1)
where the term frequency TFij counts the total number of occurrences of word j in the instances of
phrase i. The inverse phrase frequency IPFj = log(|I|/nj) is a weighting factor for words that are more
helpful in distinguishing between different product attributes because they only appear in a fraction of the
total number of unique phrases. If |I| represents the total number of unique phrases in the review
collection, nj counts the total number of unique phrases containing word j.
A limitation of the TF-IPF weighting is that there are still some terms (e.g. sentiment words like
"great" or "good") that are neither stop words nor product attributes yet appear with product attributes in
the TF-IDF matrix. As an additional discount factor beyond IPF, we automatically gather words from a
second set of K phrases using online reviews for an unrelated product domain. Intuitively, words
appearing in the reviews for unrelated products are less likely to represent relevant product attributes for
the focal one. For example, words describing digital camera attributes are less likely to also appear in
vacuum cleaner reviews.
Formally, for a set of (I') phrases drawn from the set of finite word sequences over j J, we
calculate rank(j) = rank(TF'ijuIPF'j) where higher weighted frequencies correspond to higher rank. Note
that multiple words may share the same rank; if we define words that do not appear in any phrase as
having IPF'j = 0, then we may say:
Matrix(i,j) = TFij u rank j u IPF j IPF ' j
(A1.2)
Thus, we scale TF by the rank of the word in the unrelated product domain and scale the IPF by IPF'
Lee and Bradlow
43
[3] Phrase clustering

We cluster the vectors (the matrix rows) so that all phrases describing the same product attribute
are grouped together. More formally, given the phrase u word matrix(i,j) over the set of I phrases and the
set of words J, we seek to separate phrases into a set C of k mutually exclusive and exhaustive clusters
We use the cosine measure of angular distance between vectors to calculate similarity. The cosine
measure is then applied to the phrase u word matrix using the K-means clustering algorithm. As noted
earlier, while any number of clustering algorithms is acceptable, we selected K-means for its simplicity
and its familiarity to both the text-mining and marketing communities.
The quality, QC, of a K-means clustering, C, is calculated by the sum of the distances from each
vector in a cluster to that vector's centroid. Following (Zhao and Karypis 2002), this metric is more
simply defined as the sum of the length of the composite vectors:
QC
cosv, centroid c compositec
ci C vci
where compositeci
ci C
(1)
vci
Because K-means is known to be extremely sensitive to its initial conditions, we repeat the algorithm ten
times, beginning with a new, random set of k centers and pick the solution that maximizes QC.
[4] Attributes and their dimensions

A critical step in our approach is to discover not only what product attributes customers are
discussing but also the granularity with which those attributes are discussed. Specifically, we seek to
elicit the attribute dimensions that customers use in their reviews. To discover attributes, we assume that
each phrase corresponds to a distinct product attribute. To discover attribute dimensions, we will assume
that each word in the phrase corresponds to a distinct dimension. Discovering attributes then reduces to
the assignment of particular words to attribute dimensions.
Lee and Bradlow
44
Conceptually, we model this process as a constrained optimization problem. Abusing our

previous notation slightly, assume a set of phrases I composed from the set of words J and a set of
attribute dimensions D. We have J u D binary decision variables Xjd where Xjd is 1 if word j is
assigned to dimension d. There are I u J {0, 1} variables representing a constraint matrix where Yij is
1 or 0 depending upon whether word j appears in phrase i. Thus, our objective is to:
max X jd
s.t. i J Yij * X jd d 1
X jd
binary
The graph partitioning algorithm used to set the parameters I, J, and D and the constrained logic
program (CLP) by which we solve the optimization are implemented in Python and detailed next.
[5] Graph representations

To discover attributes, we assume that each customer review phrase corresponds to a distinct
product attribute. To discover attribute dimensions, we assume that each word in the phrase corresponds
to a distinct dimension. Discovering attribute dimensions then reduces to the assignment of particular
words to attribute dimensions. But how do we know how many dimensions there are in the assignment
problem? Is it possible that the assignment optimization has no feasible solution because of conflicting
constraints due to noise from the vagaries of human language? To solve this problem, we generate a
graph of all words in the cluster. Each word is a node and arcs are defined by the co-occurrence of two
words in the same phrase. We partition the graph into (possibly overlapping) sub-graphs by searching for
maximal cliques. Intuitively, each sub-graph represents a maximal subset of words and phrases for which
an optimal solution exists. The size of the maximal clique sets the number of attributes |D|. The subgraph (words J and phrases I) define the optimization.
More formally, we assume that phrases and words are preprocessed and normalized into words as
before. A graph G = (V,E) is a pair of the set of vertices V and the set of edges E. An edge in E is a
Lee and Bradlow
45
connection between two vertices and may be represented as a pair (vi,vj) V. Each phrase (word)
represents a vertex v in the graph; edges are defined by phrase pairs within a review (word pairs within a
phrase). An N-partite graph is a connected graph where there are no edges in any set of vertices Vi. A
clique of size N simulates a plays the role of arelational schema and can be extended to an N-partite
graph by substituting each vertice vi of the clique with a set of vertices Vi. A database table with disjoint
columns thus represents an N-partite graph where the size of the clique defines the number of columns
and each word in the clique names a column. A maximal-complete-N-partite graph is a complete-Npartite graph not contained in any other such graph; in other words, the initial clique is maximal. The
corresponding database table of phrases represents the existing product attribute space, and the maximalcomplete-N-partite graph includes possibly novel combinations of previously unpaired attributes and/or
attribute properties.
To relate the graph back to customer reviews, we say that a product attribute is constructed from k
dimensions. Each dimension names a domain (D). Each domain D is defined by a finite set of words that
includes the value NULL for review phrases where customers fail to mention one or more attribute
dimension(s). The Cartesian product of domains D1 Dk is the set of all k-tuples {t1tk | ti Di}. Each
phrase is simply one such k-tuple and the set of all phrases in the cluster simply defines a finite subset of
the Cartesian product. A relational schema is simply a mapping of attribute properties A1 Ak to domains
D1 Dk. Note the strong, implicit assumption that a maximal clique, taken over a word graph, is a proxy
for the proper number of attribute dimensions. Under this assumption, it is easy to see how searching for
cliques within the graph results in a table.
Lee and Bradlow
46
[6] Constrained Logic Programming

To align words into their corresponding attribute dimensions, we frame the task as a
mathematical assignment problem and resolve the problem using a bounds consistency approach. We
define the assignment using the maximal clique that corresponds to the schema for each product attribute
table (see Figure A1.1). In the bounds consistency approach, we invert the constraints (tok_exclusion) to
express the complementary set of candidate assignments (tok_candidates) for each attribute dimension. If
the phrase constraints, taken together, are internally consistent, then the candidate assignments
(tok_assign)for a given token are simply the intersection of all candidate assignments as defined by all
phrases in the cluster containing that token.
We transform the mutual exclusivity constraint represented by each phrase into a set of candidate
assignments using the algorithm in Figure A1.2. Note that we need only propagate the mutual exclusivity
of words that are previously unassigned. Accordingly, for each unassigned token in a given phrase, the
set of candidate assignments is the intersection of the possible assignments based upon the current phrase
and all candidate assignments from earlier phrases containing the same token. We maintain a list of
active tokens boundary_list to avoid rescanning the set of all tokens every time the possible assignments
for a given token is updated.
Finally, the K-means clustering used to separate review phrases into distinct product attributes is
a noisy process. The clustering can easily result in the inclusion of spurious phrases. Both the initial
clustering of phrases into product attributes and the subsequent assignment of words to attribute
properties are inherently imperfect. Inconsistencies may emerge for any number of reasons including:
process_phrases(p_list)
[1] schema = find_maximal_clique(p_list)
[2] order phrases by length
[3] for each phrase p:
[4]
# initialize data structures
[5]
tok_exclusion for each tok, mutually exclusive tokens
[6]
tok_candidates for each tok, valid candidate assignments
[7]
tok_assign for each tok, the dimension assignment
[8]
# propagate the constraints for each successive phrase
[9]
tok_candidates, tok_exclusion, tok_assign =
[10]
propagate_bounds(phrase, tok_candidates,
[11]
tok_exclusion, tok_assign, schema)
Figure A1.1 Logical Assignment
Lee and Bradlow
47
propagate_bounds(phrase, tok_candidates, tok_exclusion, tok_assign, schema)

[1]
[2]
[3]
[4]
[5]
[6]
[7]
[8]
# marshall prior assignments

unassigned_tok = {t|tphrase tassign_d}
unassigned_attr = {a|aschema t(tphrase atok_assign[t])}
for each t in unassigned_tok:
tok_exclusion[t] = (t u (unassigned_tok t))tok_exclusion[t]
possible_assign = {a|a(unassigned_attr tok_candidates[t])}
boundary_list = {(t,[possible_assign])} boundary_list

recurse_boundary(boundary_list, tok_exclusion, tok_assign)
Figure A1.2 Propagate boundary constraints
Poor parsing, the legitimate appearance of one word multiple times within a single phrase (e.g. the phrase
digital zoom and optical zoom duplicates the word zoom) or even inaccuracies by the human
reviewers who write the text that is being automatically processed. This could result in a single attribute
property divided over multiple table columns. For example, some reviews might write "SmartMedia" as a
single word and others might use "Smart" and "media" as two separate words. Alternatively, multiple
product attributes may appear in the same cluster. '[C]ompact flash' and 'compact camera' are clustered
together based upon their common use of the word 'compact,' yet refer to distinct attributes.
To address the problem of robustness in the face of noisy clusters that include references to additional
product attributes or have different properties for the same attributes, we extend our CLP approach to
simultaneously cluster phrases and assign words. By modeling reviews as a graph of phrases, we can
apply the same CLP in a pre-assignment step to filter a single (noisy) cluster of phrases. As alluded to in
Appendix 2.2, we generate a graph where phrases are nodes, and edges represent the co-occurrence of
two phrases within the same review. The extended CLP then prunes phrases by recursively applying cooccurrence constraints; two phrases in the same review cannot describe the same attribute just as two
words in the same phrase cannot describe the same attribute dimension. The same assignment
representation removes phrases that are not central to the product attribute at the heart of a particular
phrase cluster. Phrases that are not connected in the graphical sense of a connected component or
represent conflicting constraints are simply excluded from the subcluster.
Unfortunately, even the extended CLP approach is imperfect. Some of the tables will represent
distinct product attributes. Others will simply constitute random noise. Individual tables are supposed to
Lee and Bradlow
48
represent distinct product attributes, so we assume that meaningful tables should contain minimal word
overlap. With this in mind, we apply a two-stage statistical filter to further filter noisy clusters.
First, because each table itself separates tokens into attribute properties (columns), meaningful
tables will not hold too small a percentage of the overall number of tokens. Second, we assume that
meaningful tables comprise a (predominately) disjoint token subset. If the tokens in a table appear in no
other table, then the intra-table token frequency should match the frequency of the initial k-means cluster;
likewise, the table's tokens, when ordered by frequency, should match the relative frequency-based order
of the same tokens within the initial cluster. The first stage of our statistical filter is evaluation of a F2
statistic, comparing each table to its corresponding initial cluster. Although there is no hypothesis to be
tested per se, there is a history of applying the F2 statistic in linguistics research to compare different sets
of text with a measure that weights higher-frequency tokens with greater significance than lower
frequency tokens (Kilgarriff 2001). In our case, we set a minimum threshold on the F2 statistic to ensure
that individual tables reflect an appropriate percentage of tokens from the initial cluster.
After filtering out tables that do not satisfy the F2 threshold, we use the same cluster token counts
to calculate rank order statistics. We compare the token rank order from each constituent table to that in
the corresponding initial cluster using a modified Spearman rank correlation co-efficient (rs). As a minor
extension, we use the relative token rank, meaning that we maintain order but keep only tokens that are in
both the initial and the iterated CLP cluster(s). We select as significant those tables that maximize rs. In
the event that two or more tables maximize rs we promote all such subclusters either as a noisy cluster or
as synonymous words for the same product attribute as determined by a manual reading.
Lee and Bradlow
49
Appendix 2. Automatically generated attributes and their corresponding properties and levels
1. Picture (what/where)
tough
wildlif, waterspot, vivid, underexpos, unclear, unattract, took, tant, surprisingli, sun, streaki, srisp, sharpest, regret,
recommend, quailti, printabl, proof, outsid, opp, nois, minimum, maximum, lost, kid, immeadi, hit, guess, ful,
frame, far, excel, even, definit, daytim, counter, construct, base, adequ, absolut
vibrant
plenti, geou, galleri
consist
stong, inaccuraci, blotchi
pictur
col
underxposur
2. Flash (memory, card, photo)

1
flash type
card
slot
2
immedi
meg
bigger
us
cheap
compact
fragil
ard
3. Slow (start-up, turn on, recovery)
shutter
reaction,
slow, fast, long, no,
exposur
high,
us
speed
automat
recoveri lag,
second
wonder sluggish
turbo
1
need
virtual
shot
hard
4. Resolution
capac
resolut
5. Megapixel
mpix
megapixel
pixel
pixl
stick
megapixel
cost
lcd
low
realli
mode, set, control, , activ,

qualiti, wide, action
bit
featur
button, oper
releas
2, respons
Lee and Bradlow
top
littl
time, adjust, manual, take, rang
delai
press
effect
high
1.5
4.0
5
re
improv
6.3
4.10, 4.1, 3.4, 2.6, 2.3, 2.0, 11, 1.1, 0.8, 0
6. Lens (cap, quality, mfr)

Easily
cap
pro
leash
avail
variabl
telephoto option averag
loos
nikon
attach
famou, distanc
canon
auto
lost
no
damag
len
access
set
mega
meg
map
come
life
qualiti
leica
amaz
50
7. Optical (zoom, viewfinder)

low
blurri
16
x3
x10
smallest, power, bigger, 7, larg, lack, featur, decent, cheat,
avail, 2.8, 2.5x
8. Body (design, construction)
bodi
9. Print (size, quality, output)

print photo
hard
qualiti
produc beauti match
film
4x6
perfect copi
10. Zoom
capabl,
definit
us, nice, fast
non
littl, function, somewhat
true, realli
feel
look
built
compact
camera
viewfind
chintzi
wide
varianc
option
rection
adjust, accur, mislead,
inaccur
zoom
digit
telephoto
len
plastic
ugli
nice
metal
indo
averag
imag
high
clear, digit,
incred
loos
featur
benefit
3
11. Feel (mfr, construction)

solid,
film
thing, simpler
35
cheap, plenti
starter, get
fragil
flimsi
small
solid
fairli
optic
camera
lack, fuzzi, limit, effici

len
tricki
no
2.5x
lack, fuzzi, limit, effici
zoom
look, old
feel
want, supp, resembl, bulkier, ard
wai, finish
built
extra, faulti
2.5
useless
stink
seamless
1.6x, dist, crap, 32
12
littl
point
right
kind, somewhat
design
nice
build
us
awesom
hold
12. Menu
control,
tough
menu
rel
versu
screen
us
read
easi
manual
up
custom,
exposur
complic
preview
option
set, multipl, lot,

full, extens, lack,
no
hidden
compress
option
quirki, min
slow
sensit
scatter
function, scheme
simpl, complic,
hidden,
awkward
navig
decent
familiar
featur
minim
interfac, graphic
nice, layout, larger, imbed,

hopeless, clear, camera, maze,
ergonom, annoi
design
13. Shoot
shoot adjust
stitch slew, rapid, preset, multipl, lot, lack, numer
movi
function
option
Lee and Bradlow
control
manual, ,
cumbersom, buri,
plai, inexpens
difficult
mode
sound
amic
clip, need
infinit
fast
digit
point
look
just
petroglyph
camera
repetit
51
14. Support (service)

supp
product ribl
service tech
inadequ
need
indiffer
ganiz
quick
15. LCD
see light
sunlight
night
displai,
pretti
review
daylight
rel
custom
screen
hard
soft
panel
automat
difficult
lcd
bright
dim
outsid
expos
16. Movie (audio, visual)

control sound
set
second 30
movi
quicktim
lot
audio
vga
quality captur
capabl
unlimit
take
no
mpeg
useless
featur
17. USB
platf, comput
dock
usb, conveni
sound
archiv
cabl
serial
av
easi, weird
file
brainless, flawless
station
option
slow
quick
problem
18. Focus
focus
unimpress, twist, stripe, dull, sharper, lacklustr, innov, decept,

placement, locat, flexibl, clariti, acur, 1.5
rotat
unprotect, articul, huge, clear
littl
upload
transfer
charg
video, come,
larg
travel
download
link
auto
length
price
clip
video
length
avi
mode
60
given
us
pictur
speed
connect
mean
pc
camera
tv
fast
soft
nice
19. Software
interfac
includ, bundl, ,
download, view,
easyshar, zoombrows
comput
proprietari
fun
mac, somewhat, 5
Lee and Bradlow
benefit
us, capabl, ,
kodak, improv,
instal
clunki
complic
manag
set
imag
tv, extern, provid, bundl,
us, ted, compatibl, slightli,
twain
comput, user
special
clumsi
weak, wonder, captur,
suck, option, littl, kodak,
packag
low, instal, sai, reload
difficult
no, slideshow, easi

method, easili,
custom
friendli, averag
window
tricki, function
includ, basic, ing,
decent,
lack,
creativ
softwar
output, interface, box, wai,
pretty, extrem, particularly,
gener
possibl
retain
contrast
hookup
program
xp
52
20. Cover (lens, LCD, battery)

lcd
cover
camera
strap, built, auto, automat
integr
rubber
retract
21. Price
price place
long
easi, eas
speed autofocu
perf
fast
cost
littl
point, fair
rang
super
cheap
22. Memory/Screen
need
fairli
us
proprietari
monei
mem
screen
window
panel
screen
protect,
len
lens
afterthought
batteri
larg
travel
addit
soni
us,
load
transfer
medium, space
limit, choic
cheap
lock
flimsi
step, interfer, care

tricki
zoom
open
mechan
l.c.d
view, review,
protect
dirti
upgrad, remov
built
easi
regular
floppi
1.44mb
remov
25. Battery (life, use, type)

sure
pretti, littl, aa, tom
eat, run, 4, power, recharg, extra,
ion, suppli, built,
infolithium
take, back
life, time
displai
eas
remov, camera
bring
expens
terribl
real
system
Lee and Bradlow
close
slide
fulli
higher, nice, reason

us
time
high
startup,
sluggish
23. Floppy (storage media)

floppi access disk versatil
us
no
ag
oper
cheap
24. Disk
disc
comput
easier disk
media easi
diskett
ag
no
displai
side
up
lot, quick, nimh, 2,
pack, charger, right,
wonder
thing
price
long
low
us
includ, inexpens,
keep, drain, suck
cruddi
weigh
last
charg
preview
fast, quickli, alkalin, realli,
regular, lithium, , hour,
no, need
just
becaus
easi
lose
batteri
betteri,
batt
give
sock
liion
53
26. Size
small, broke, conveni, durabl, intuit,
mechan, , near, big, problem, pleas,
case, bigger, smaller, larger, larg,
versatil, po, cumbersom, monei, statu,
heavi, gui, nice, no, clearer, easili, dim
non, awkward
coupl
true, nearli, jean
expect
crack
design, tad, weight, slightli, kinda, lot,

frustrat, expens, icon, capabl, incred,
extrem, stick, chassi, function, mirr,
possibl, perfect, littl, right, requir, ard,
cheap, somewhat, bit, , 2.2, digit, pretti,
compar, damag, pocket, flip
a101
time
wish
unus, separ
unit, shape, moveabl
sleek, lightweight, cam, , size, displai,

quirk, button, turn, col, complic, ton,
fairli, pro, ergonom, solid, bulki, ergo,
rug, viewfind, basic, highli, carri,
pressur
stuf
finepix
drop
sftwr
preset
27. Photo quality

photo
loos
unfocuss wash
min
trait
touch, suburb, white, tint, suction, stun, stitch, soft, share, scape, rupt, retouch, result, realist, qualliti, pretti, plan,
noisi, move, manag, lost, length, ing, geou, fuzzi, ful, file, fabul, eras, ard, amic, alter, actual, accur, 250, floppi,
downtim
unexpect
0
28. Low light
auto
lowlight
dim
light
object
us
lamp
aid
adjust
littl
touchi
hunt, finicki
blind
hard, long,
inabl, difficult
no, problem
focu
set
low
certain
condit
assist
difficult, option
mix
auto, lot
judg
annoi,
awkward,
inadequ, laser
bit
situat
question, night
harder
fast, take, averag, assist,
slightli, rang, system,
level
room
condit, troubl, iffi,

motion, wai, lag, need,
set, rel, , occasion
abl
capabl
graini
accur
especi, limit, le, ks, us,
sensit, focu, dark, dismal,
fare, cost, nois, weak,
finicki
imperfect
29. Control
control
dist, underexposur,
overexposur
stigma
frame, refund, size
slightli
obviou, parallax
solid, ergonom, unintuit, tini, simpler, sensit, place, pad, odd, ing, finicki, familiar, dummi, col,
clear, basic, analog, clumsi, individu
casio
slight
full
guess, time, big
30. Macro (lens)

macro design, zero, unimpress, built, averag, lack, awesom
no, amaz, fussi,
super, great.fantast, po
nice
incred
closeup, unsurpass
Lee and Bradlow
mode
real, function
stupend
possibil, abil
featur
54
31. MB (memory)
mb
media, pictur
flash
no
compactflash
come, skimpi
need
32. Edit (in camera)

edit softwar capabl
fec onboard imag
33. Red eye

problem flash
built
massiv
pic
occasion
anti
lot
show
littl
avail
8
lowli
4, 32
2, 7, 11, provid
128
size
smart
qualiti
16
stick, take
wai
measli
card
mem
flashcard
onboard, intern, run, wimpi, pricei
smartcard
recommend
packag
limit
no
averag
ey
indo
po
34. Shutter (delay, lag)

shutter
lag
us
speed
recoveri sluggish
wonder reaction,
exposur
virtual
1
shot
includ
small
usb
red
low
yellow
catch
hue
background
look,
light
tint
unbear, frequent, easier, deflect, appear, elimin, control
turbo
automat
second
slow, fast, long, no,
high,
need
hard
button, oper
bit
featur
mode, set, control, activ, qualiti, wide,
action
releas
2, respons
35. Features
featur manual, newer, g2, extend, , switch, chang, readili
place practic, neat, cool, extra
point wide, incred, ard, access
small, larg, huge, sell, limit, impress, basic
pro
long, competit
delai
top
littl
time, adjust, manual, take,
rang
auto
press
lot, addit, 6, need, avail, expert, document, lack

rang, arrai
easi, ton, hard, difficult
set, want, dabl,
semi, level
list
36. Instruction
instruct clear
uniqu
rom
lot
lose
lack
answer
37. Adapter (AC)
separ ac
purchas
need Level 1: includ
50
charger
usb
Lee and Bradlow
no
option
avail
power
come
adapt
extern
case
55
38. Picture quality

extremely, switch
surprisingly, printable, mediocre, amazingly
print
zoom
durable
lens
poor
picture
image
design
distance
quality
cap
39. Image (quality)

inconsist
webcam, super, profession, margin, ard, over, ok, nikon, mediocr, lcd, indiffer, wonder, unaccept, satisfact, ribl,
problem, overprocess, nice, mean, fair, expect, class, aw, averag, astound, astonish, accept, medium, ing, adic,
horribl, gener, control, case, bef
incred
col
produc
detail
Lee and Bradlow
imag
qualiti
sharp
need
colour
56
Appendix 3. User survey

In this Appendix, we list the digital camera product attributes (with duplicates eliminated) that are
found exclusively in one or more online buying guides (Expert Only), learned automatically from the
product reviews (VOC Only), or in both (Expert + VOC). The means for Familiarity and Importance as
collected from our survey are reported on a 1 to 7 scale. To help align the attribute names used here with
those in Table 2, we include a mapping from the automatically derived attribute clusters (auto) to those 55
attributes used in the consumer survey. Note that in some cases, an automatically derived attribute is
mapped to more than one survey (expert) attribute name and vice versa due to inconsistencies between the
granularity with which an attribute is discussed in the expert guides and/or by the Voice of the Consumer.
Expert Only
Survey attribute
Familiarity
Importance
battery source
6.42
6.05
flash ext
4.91
3.09
flash range
4.17
3.90
image compress
4.80
4.62
image sensor
2.37
3.25
image stab
4.88
5.28
man light sens
3.69
3.69
manual exp
2.69
3.10
manual light meter
2.38
2.86
manual shut
3.09
3.24
mem qty built-in
5.68
5.05
movie fps
4.43
4.35
movie output
3.90
4.20
music play
4.57
3.80
num sensors
2.25
3.18
power adapt
6.05
4.87
time lapse
4.45
3.76
wide angle
3.57
3.57
Lee and Bradlow
57
VOC Only
Survey attribute
Auto
Familiarity
Importance
camera size
size
6.35
5.87
body (design)
body
5.82
5.48
download time
USB
5.68
4.75
feel (durability)
feel; support service
5.30
5.43
instructions
instruction
6.07
4.18
lcd brightness
screen
4.62
4.57
shutter lag
slow; shutter
3.87
3.80
twist lcd
cover
3.80
3.62
Survey attribute
Auto
Familiarity
Importance
battery life
battery
6.50
6.30
cam soft
edit
5.32
3.57
cam type
shoot
4.65
5.06
comp cxn
USB
6.13
5.75
Expert + VOC
comp soft
software
5.77
4.60
ergonomic feel
feel
5.15
4.83
flash built-in
red-eye
6.45
6.26
flash mode
low-light; red-eye
5.48
5.12
lcd viewfinder
lcd
5.00
5.15
lens cap
cover; lens
4.92
4.25
lens type
macro
3.80
4.20
manual aper
control
2.40
2.86
manual focus
control; focus
5.14
3.72
mem capacity
mb
5.30
5.68
mem stor type
disk; floppy; flash (drive)
5.63
5.47
movie audio
movie
4.85
4.48
movie length
movie
5.73
5.47
movie res
mpeg
4.97
5.15
navigation
menu
5.93
5.40
optical viewfinder
4.50
4.33
picture quality
optical
photo qual; picture; print;
image qual
6.35
6.60
price
price
6.31
6.45
resolution
resolution; megapixel
5.35
5.56
shot modes
features
5.55
4.67
Lee and Bradlow
58
shutter delay
slow; shutter
4.27
3.95
shutter speed
control
4.81
4.60
white bal
features
3.40
3.74
zoom dig
zoom
5.07
4.47
zoom opt
zoom; optical
5.52
5.21
Lee and Bradlow
59
Appendix 4. Interpreting the Correspondence Analysis dimensions.

Correspondence Analysis is an approach to dimension reduction when analyzing highdimensional, two-mode, two-way count data (Everitt and Dunn 2001). To interpret the dimensions in the
reduced space, we regressed each brands factor scores on the derived attributes. For each figure in the
paper, we report the results of the stepwise regression on F1 and F2. The reported results assume a
probability for entry of 0.05 and a probability of removal of 0.1.
To make the dimensions both interpretable and actionable, we follow the marketing literature in
relating customer needs, as represented by user generated reviews, to actionable manufacturer
specifications, as in the "House of Quality" (Hauser and Clausing 1988). Specifically, product attributes
elicited from online reviews include different granularities ("zoom" versus "10x optical zoom") or reflect
different levels of technical sophistication ("low-light settings" vs. "iso settings"). Consulting the set of
professional buying guides, we manually mapped our 39 automatically generated attributes onto a coarser
but actionable categorization of specifications. The visualizations are generated and interpreted using
these meta-attributes.
lens cap/cover
feel durability
start-up time
shutter-lag
shot-delay
resolution
price
pc connect
flash bult - in
camera type
manual focus
movie/video
navigation
storage type
white balance
lens
zoom
weight
size
lcd
rechargeable battery
image stabil
software
low-light/light control
autofocus
image resolution/print qual

manual controls
picture types
service/warranty
Table A4.1 Actionable meta-attributes for regressing explaining CA dimensions
Figure 3. Mapping the market using customer reviews

Summary of attribute selection for dimension F1.
No. of
variables
Attribute
1
navigation
2
storage type
3
movie/video
MSE
0.000
0.000
0.000
R
0.897
0.992
1.000
Adjusted
R
0.882
0.989
0.999
Lee and Bradlow
60
No. of
variables
Attribute
1
low light
2
lens
3
price
MSE
0.000
0.000
0.000
R
0.913
0.988
1.000
Adjusted
R
0.900
0.983
0.999
Figure 5. Market structure by Cons with Pros as supplementary points

No. of
variables
1
2
3
4
5
Attribute
image resolution/print qual
start-up time
lens/cap cover
shutter-lag
service-warranty
MSE
0.000
0.000
0.000
0.000
0.000
R
0.929
0.972
0.989
0.996
1.000
Adjusted
R
0.918
0.963
0.982
0.992
0.999

No. of
variables
Attribute
1
lens
2
navigation
3
size
MSE
0.001
0.000
0.000
R
0.689
0.933
0.973
Adjusted
R
0.644
0.911
0.956
ENDNOTES
1
Each subject also completed a six-item digital camera quiz after providing their self-
assessment. The correlation was high (r = 0.4) and hence we use the self-assessment score in
subsequent analysis.
2
Interested readers may contact the corresponding author for the source code for all of the
algorithms described within this research.
Lee and Bradlow
61

Automated Marketing Research Using Online Customer Reviews

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Automated Marketing Research Using Online Customer Reviews

Uploaded by

Copyright:

Available Formats

Automated Marketing Research Using

Online Customer Reviews

Electronic copy available at: http://ssrn.com/abstract=1726055

Automated Marketing Research Using

Lee and Bradlow

Electronic copy available at: http://ssrn.com/abstract=1726055

Lee and Bradlow

Lee and Bradlow

Lee and Bradlow

Lee and Bradlow

Enables more detailed analysis by exploiting the natural segregation of comments by

Lee and Bradlow

Our goal is to group all

Lee and Bradlow

Lee and Bradlow

ANALYSIS AND EVALUATION

Lee and Bradlow

Lee and Bradlow

Generating Attributes from Product Reviews

Lee and Bradlow

However, exact matching is an aggressive standard in comparing hierarchies. Optical

Lee and Bradlow

Importance of Attributes from Product Reviews

Lee and Bradlow

Lee and Bradlow

Lee and Bradlow

Lee and Bradlow

Selecting Attributes and Visualizing Market Structure Using Product Reviews

Lee and Bradlow

Lee and Bradlow

Regressing the CA coordinates (F1 and F2) on the meta-attributes defines F1 as a

Lee and Bradlow

Lee and Bradlow

Lee and Bradlow

[Insert Figure 4 about here]

Lee and Bradlow

Lee and Bradlow

Lee and Bradlow

Lee and Bradlow

DISCUSSION AND CONCLUSIONS

Lee and Bradlow

Lee and Bradlow

(choice-based, ratings-based, ranking-based, constant sum, or self-explicated) or method for

Lee and Bradlow

Lee and Bradlow

Lee and Bradlow

Lee and Bradlow

Lee and Bradlow

Lee and Bradlow

Shutter (delay, lag)

Table 1: Automatically generated product attributes

Edit (in camera)

Table 3. Changes in automatically generated attributes before and after 2005

Table 5. Comparing consumer word choice (low-light v. iso) by brand

Lee and Bradlow

Lee and Bradlow

Figure 2. User assessments of familiarity vs. importance

Lee and Bradlow

Figure 3. Mapping the market using customer reviews

Lee and Bradlow

Figure 4. Evolution of market structure with period two as supplementary points

Lee and Bradlow

Figure 5. Market structure by Cons with Pros as supplementary points

Matrix(i,j) = TFij u rank j u IPF j IPF ' j

cosv, centroid c compositec