Download as pdf or txt
Download as pdf or txt
You are on page 1of 17

JID: YIJHC

ARTICLE IN PRESS [m5GeSdc;September 29, 2017;14:32]

Int. J. Human-Computer Studies 000 (2017) 1–17

Contents lists available at ScienceDirect

International Journal of Human-Computer Studies


journal homepage: www.elsevier.com/locate/ijhcs

User perception of sentiment-integrated critiquing in recommender systems


Li Chen a,∗, Dongning Yan a,b, Feng Wang a
a
Department of Computer Science, Hong Kong Baptist University, Hong Kong, China
b
School of Mechanical Engineering, Shandong University, Shandong, China

a r t i c l e i n f o a b s t r a c t

Keywords: Critiquing in recommender systems has been accepted as an effective feedback mechanism that allows users
Critiquing-based recommender systems to incrementally refine their preferences for product attributes, especially in complex decision environments
Product reviews and high-investment product domains where users’ initial preferences are usually uncertain and incomplete.
Feature-based sentiment analysis
However, the traditional critiquing methods are limited in that they are only based on static attribute values
E-commerce
(such as a digital camera’s screen size, effectiveness pixels, optical zoom). Considering product reviews contain
User evaluation
other customers’ sentiments (also called opinions) expressed on some features, in this manuscript, we propose
a sentiment-integrated critiquing approach, for helping users to formulate and refine their preferences. Through
both before-after and within-subjects experiments, we find that the incorporation of feature sentiments into the
critiquing interface can significantly improve users’ product knowledge, preference certainty, decision confidence,
perceived information usefulness, and purchase intention. The results can hence be constructive for enhancing
current critiquing-based recommender systems.
© 2017 Elsevier Ltd. All rights reserved.

1. Introduction However, though critiquing has been popularly adopted in


preference-based recommender systems (Chen and Pu, 2007c; 2010),
In recommender systems, critiquing has been recognized as a dis- knowledge-based recommender systems (Burke, 2000; Burke et al.,
tinct feedback mechanism to solve the popular cold-start problem in 1997), and conversational recommender systems (McCarthy et al., 2005;
high-investment product domains (e.g., digital cameras, laptops, cars, Shimazu, 2002; Smyth et al., 2004), the current methods are mainly
apartments) (Chen and Pu, 2012). As users in those domains are usu- based on products’ static attribute values (such as a digital camera’s
ally new and do not have well-defined, fixed preferences initially, the screen size, effectiveness pixels, optical zoom) to elicit users’ critiques.
critiquing system has been targeted to elicit users’ preferences for prod- Little work has studied whether and how other customers’ reviews could
uct attributes on site and allow them to incrementally refine preferences be leveraged into the critiquing interface for aiding the current user to
through posting critiques on the recommended product (such as “I would construct her/his preferences. For example, suppose a user initially does
like something cheaper” or “with higher optical zoom” if the product is a not know the meaning of “optical zoom” when she searches for a digital
digital camera). In this way, the system is able to improve recommen- camera, but after seeing the review “Nice 38X optical zoom lens for cap-
dations in the next interaction cycle. Thus, in such a system, the initial turing beautiful close-ups of faraway action”, she may be able to specify
user preference model does not influence the accuracy of their decision. preference for not only the optical zoom’s static value (e.g., “>= 38𝑋”)
Rather, it is the subsequent process of incremental critiquing that as- but also its associated sentiment (e.g., “ > 3” if the sentiment is in the
sists users in making more informed and confident decisions. According range [1,5] from “least negative” to “very positive”). It hence implies
to prior experiments (Chen and Pu, 2006; 2007b), for a user to finally that product reviews could be potentially useful for users to learn from
reach her/his ideal product, a number of critiquing cycles are often re- others customers’ experiences (Aciar et al., 2007; Kim and Srivastava,
quired. The studies from the areas of decision theory and consumer be- 2007; Wu et al., 2013), and hence possibly increase their own product
havior also show that users are likely to construct their preferences in knowledge and preference certainty.
a context-dependent and adaptive manner during the decision process Therefore, in this article, we propose a novel critiquing method
(Payne et al., 1993; 1999; Tversky and Simonson, 1993), and a typical that particularly extracts feature sentiments (i.e., opinions the other
buyer has some latent constraints and preferences that s/he may only customers have expressed on some specific features in their reviews)
become aware as s/he sees more options (Pu and Faltings, 2000; 2002). and integrates them with products’ static attribute values for users to


Corresponding author.
E-mail addresses: lichen@comp.hkbu.edu.hk (L. Chen), yandongning@sdu.edu.cn (D. Yan), fwang@comp.hkbu.edu.hk (F. Wang).

https://doi.org/10.1016/j.ijhcs.2017.09.005
Received 9 January 2017; Received in revised form 1 September 2017; Accepted 24 September 2017
Available online xxx
1071-5819/© 2017 Elsevier Ltd. All rights reserved.

Please cite this article as: L. Chen et al., International Journal of Human-Computer Studies (2017), https://doi.org/10.1016/j.ijhcs.2017.09.005
JID: YIJHC
ARTICLE IN PRESS [m5GeSdc;September 29, 2017;14:32]

L. Chen et al. Int. J. Human-Computer Studies 000 (2017) 1–17

Fig. 1. The preference-based organization interface, as a typical example of system-suggested critiquing method, where the recommended products, except the top candidate, are organized
into several categories with the category titles (e.g., “They are cheaper and lighter, but have fewer megapixels”) as the suggested critiques for users to consider (Chen and Pu, 2007c).

perform critiques. The user’s preference model is hence built on both Reilly et al., 2004). In another work Chen and Pu (2007c); 2010), the
static values and sentiments, for the system to compute product utility authors take into account the user’s current preferences for selecting
and return that with the highest utility as the recommendation in each critique suggestions and present them in the Preference-based Organiza-
interaction cycle. In the experiment, we report results of two user stud- tion (Pref-ORG) interface (see Fig. 1). Through experiment, they showed
ies: before-after and within-subjects, which compared our method with the that Pref-ORG can be more effective to improve critique prediction ac-
traditional critiquing system (without considering feature sentiments) in curacy and recommendation accuracy, as well as saving users’ decision
two different experimental settings. Both studies validate the superior effort (Chen and Pu, 2007c; 2010).
performance of our method in terms of improving user perceptions, such Another type of critiquing support, user-initiated critiquing systems,
as their product knowledge, preference certainty, decision confidence, emphasizes aiding users to create critiques on their own. For example,
perceived information usefulness, and purchase intention. in the Example Critiquing interface (see Fig. 2) (Chen and Pu, 2006; Pu
The remainder content is organized as follows. We first introduce and Chen, 2005), users can freely choose what attributes to critique and
related work in two branches, critiquing-based recommender systems and how. To be specific, along with each attribute, there are three critiquing
review-based recommender systems (Section 2). We then describe our options: “Keep” (keeping the attribute’s current value), “Improve” (im-
method, i.e., the sentiment-integrated critiquing, in Section 3, followed by proving its value), and “Take any suggestion” (accepting a compromised
two experiments’ setup, materials, participants, and results analysis in value of this attribute so as to achieve intended improvements on more
Sections 4 and 5. Finally, we summarize our major findings and discuss important attributes). This interface essentially stimulates users to make
their practical implications to the research field (Sections 6 and 7). value tradeoff among attributes (i.e., accepting an outcome that is un-
desirable in some respects but advantageous in others) (Payne et al.,
2. Related work 1993; 1999), for enhancing their decision quality. A user study empir-
ically demonstrated its advantage of offering maximal user control and
2.1. Critiquing-based recommender systems enabling users to make confident decisions (Chen and Pu, 2006).
In follow-up work Chen and Pu (2007b), we compared the two cri-
Earlier critiquing-based recommender system mainly focused on pro- tiquing methods, i.e., system-suggested and user-initiated, and found
actively producing a set of critiques for users to pick (called system- that combining them together in a hybrid system can best exert their re-
suggested critiques), as improvement on the current recommendation. For spective advantages: system-suggested critiques can educate users about
example, one typical system is FindMe (Burke et al., 1996; 1997), which properties of existing products and accelerate their critiquing process if
allows users to critique the currently recommended apartment by se- the suggested critique matches their requirement; otherwise, users can
lecting one of the system’s pre-designed tweaks like “cheaper”, “bigger”, still compose critiques and define tradeoff criteria by themselves via the
“nicer”, “safer”. However, because its suggested critique is fixed and lim- user-initiated critiquing support. Our prior user evaluation revealed that
ited to a single attribute (called “unit critique” in McCarthy et al., 2005), in this hybrid system users are more motivated to perform critiques and
some researchers have aimed to generate dynamic, compound critiques able to achieve a higher level of decision accuracy while consuming less
that are not only representative of remaining products’ properties, but cognitive effort (Chen and Pu, 2007a; 2007b).
also operating over multiple attributes simultaneously (Reilly et al., More recently, some people have attempted to develop speech-based
2004). For instance, the Dynamic Critiquing system adopts association critiquing interface (Grasch et al., 2013), eye-based critiquing inference
rule mining tool to discover critique candidates, each representing a set (Chen and Wang, 2016), or experience-based critiquing that harnesses
of matching products (e.g., “Less Optical Zoom, More Digital Zoom, and other users’ critiquing histories to guide the current user’s interaction
A Different Storage Type” that refers to a set of products all having less process (McCarthy et al., 2010; Xie et al., 2014), which are all with the
optical zoom, more digital zoom, and different storage type, in compar- purpose of reducing users’ critiquing effort and improving the system’s
ison with the currently recommended camera) (McCarthy et al., 2005; recommending efficiency.

2
JID: YIJHC
ARTICLE IN PRESS [m5GeSdc;September 29, 2017;14:32]

L. Chen et al. Int. J. Human-Computer Studies 000 (2017) 1–17

Fig. 2. The Example Critiquing interface, as a typical example of user-initiated critiquing method, where there are three options, “Keep”, “Improve”, and “Take any suggestion”, for users
to choose for critiquing the current recommendation’s attribute values (Chen and Pu, 2006).

Limitation. However, in the above-mentioned literatures, the cri- 200g. The derived value preferences are used to strengthen products’
tiques are mainly based on products’ static attribute values for users ranking (Wang et al., 2013).
to refine preferences. Little innovation has been made to incorporate In some papers, reviewers’ feature sentiments have been used to
other types of features, especially those associated with other customers’ build product profile, such as ontology in Aciar et al. (2007) and prod-
sentiments as embedded in their reviews, into the critiquing system. uct case in Dong et al. (2013a); 2013b). Take the work Dong et al.
Therefore, in this work, we aim to improve our previous work on hy- (2013a) as an example, each product case is first constructed based fea-
brid critiquing system (Chen and Pu, 2007a; 2007b), with the focus on ture sentiment and feature popularity (that refers to a feature’s occur-
integrating feature sentiments into both system-suggested critiques and ring frequency in the product’s reviews). Then, products, which enjoy
user-initiated critiquing support. relatively higher sentiment improvement as well as being similar to the
user’s query case, are recommended. The experimental comparison with
similarity-based approaches showed their method can achieve the opti-
2.2. Review-based recommender systems mal balance between query-product similarity and ratings benefit.
Limitation. However, most of existing review-based recommender
The growing popularity of social media and e-commerce sites has systems have aimed to serve repeated users who have left reviews in the
encouraged users to freely write reviews to describe their assessment of system, but not for new users who use the system for the first system (so
items (e.g., music, movies, books, electronic products). These reviews without any historical records). For those users, it is critical to elicit their
are usually in the form of free textual comments that explain whether preferences for products on site. Therefore, in this paper, our objective
they like or dislike an item based on their usage experiences and why. has been to utilize feature sentiments as extracted from other customers’
In recommender systems, the user reviews have mainly been exploited reviews to benefit those new users when they state their preferences via
for the following purposes (Chen et al., 2015). critiquing.
In collaborative filtering (CF) systems, the sentiments expressed in
opinion words are aggregated to infer a reviewer’s overall opinion (e.g.,
‘1’ for positive and ‘−1’ for negative), which is called virtual rating in 3. Sentiment-integrated critiquing
Zhang et al. (2013) for augmenting user-based CF. Some researchers
have attempted to directly leverage reviews’ contained topics or feature In this section, we introduce a sentiment-integrated critiquing ap-
sentiments into a collaborative filtering model such as matrix factoriza- proach to addressing the above-mentioned related work’s limitations.
tion for rating prediction (Seroussi et al., 2011; Wang et al., 2012). First of all, let’s see how users usually interact with a critiquing-based
In preference-based recommender systems, feature sentiments have recommender system (Fig. 3) (Chen and Pu, 2012):
been used to derive users’ weight preferences (Chen and Wang,
2013; Liu et al., 2013) or value preferences for different product at- • Step 1: the user is asked to first specify a reference product as the
tributes (Wang et al., 2013). For instance, the authors of Chen and starting query, or give some specific value preferences for the prod-
Wang (2013) extend latent class regression model (LCRM) for utiliz- uct’s attributes (e.g., searching criteria for the digital camera’s price,
ing feature opinions < feature, opinion > to accommodate inherent pref- screen size, optical zoom, etc.);
erence homogeneity among users and recover each reviewer’s weight • Step 2: the system then recommends a product according to the user’s
preference. They compared this method with a standard approach based initial preferences;
on probabilistic regression model (PRM), and found the LCRM-based • Step 3: at this point, the user can select the product as her/his final
method is more accurate in generating recommendations. Furthermore, choice and terminate the interaction. Otherwise, if the product does
they map feature opinions to static specifications in form of < feature, not fully match the user’s interests, s/he can make critique on it.
opinion, specification > to derive reviewers’ value preferences. For exam- • Step 4: once the critique is made, the system will update the user’s
ple, < “weight′′, 1, 200g > indicates that a reviewer is positive (‘1’ for preference model and return a new recommendation in the next in-
positive) about the camera’s weight that is of static specification value teraction cycle (go back to Step 2).

3
JID: YIJHC
ARTICLE IN PRESS [m5GeSdc;September 29, 2017;14:32]

L. Chen et al. Int. J. Human-Computer Studies 000 (2017) 1–17

Fig. 3. Users’ interaction model with the critiquing-based recommender system (Chen and Pu, 2012).

Fig. 4. The procedure of extracting feature sentiments (opinion ratings) from product reviews.

As mentioned in Section 2.1, we focus on studying how to incor- Table 1


Pre-defined set of seed words for each attribute.
porate reviewers’ feature sentiments into a hybrid critiquing system.
Specifically, we aim to extend the hybrid critiquing interface that com- Attribute Seed Words
bines Preference-based Organization (Pref-ORG) technique for produc- Seed words for the digital camera’s major attributes
ing system-suggested critiques and Example Critiquing (EC) for facili- Price Price, value, cost, money, dollar, pay, payment, sale, deal
tating user-initiated critiquing (Chen and Pu, 2007a). In the following, Screen size Screen, lcd, size, touchscreen, monitor
we in detail describe our methodology. Effective pixels Resolution, pixel, megapixel, ccd
Optical zoom Zoom, range
Weight Weight, heavy, light
Image quality Image, picture, photo, raw, color, look
Video quality Video, film, movie
3.1. Feature-based sentiment analysis
Ease of use Use, usage, control, setting, button, menu, option, easy,
hard, instruction, hand
We first describe how we extract feature sentiments from product
Seed words for the laptop’s major attributes
reviews (Fig. 4). Feature-based sentiment analysis (also called opinion Price Price, value, cost, money, dollar, pay, payment, sale, deal
mining) has become an established subject in the area of natural lan- Processor speed cpu, speed, performance, compute, boot, time, core,
guage processing (Liu, 2010). There are various approaches developed processor
RAM Memory, load, ram
to capturing customers’ opinions towards specific features that were
Hard drive Disk, capacity, io, storage, disc
mentioned in their reviews, such as statistical methods (Hu and Liu, Screen size Screen, size, lcd, monitor, inch, touch, touchscreen
2004a; 2004b), machine learning methods like those based on lexical- Weight Weight, heavy, light
ized Hidden Markov Model (L-HMMs) (Jin et al., 2009) and Conditional Display Display, look, resolution, graphic, screen, brightness
Random Fields (CRFs) (Miao et al., 2010; Qi and Chen, 2010), and Latent Battery life Battery, charge, life, last, power
Portability Portability, size, portable, thin, lightweight
Dirichlet Allocation (LDA) based methods that identify features directly
Quietness Noise, sound, silent
(McAuley and Leskovec, 2013).
In our system, we adopt a popular statistical approach (Hu and Liu,
2004a; 2004b), because it is domain-independent without the need of
manually labeling words and training an inference model. This approach jor attributes 2 (e.g., the digital camera’s price, screen size, effective
can also reach at reasonable accuracy relative to model-based methods pixels, optical zoom, weight, image quality, video quality, ease of use).
(Hu and Liu, 2004a). Concretely, given that a feature is normally ex- Concretely, for each attribute, we pre-defined a set of seed words (see
pressed as a noun or a noun phrase (e.g., “display”, “size”, “image qual- Table 1). The similarity score between a feature candidate f and an at-
ity”) in raw review texts, we first perform association rule mining to tribute ai ’s seed words Si is computed as:
discover all frequent noun words/phrases 1 as feature candidates (that 1 ∑
exceed certain frequency threshold, i.e., with minimum support value sim(𝑓 , 𝑆𝑖 ) = sim(𝑓 , 𝑠) (1)
|𝑆𝑖 | 𝑠∈𝑆
1%). Those feature candidates are then mapped to the product’s ma- 𝑖

2
For each product catalog like digital camera, we identified several major attributes
1
Through Core-NLP package for Part-of-Speech (POS) tagging: http://nlp. according to those that usually appear in e-commerce websites for describing the product’s
stanford.edu/software/corenlp.shtml. basic properties and for users to specify filtering criteria.

4
JID: YIJHC
ARTICLE IN PRESS [m5GeSdc;September 29, 2017;14:32]

L. Chen et al. Int. J. Human-Computer Studies 000 (2017) 1–17

where sim(f, s) is the lexical similarity between f and a seed word s cheaper, the better”; for the camera’s screen size, “the larger, the better”;
as defined in WordNet (Fellbaum, 1998). Then, the feature candidate is for sentiment score, “the higher, the better”). The following equations
mapped to the attribute with the highest sim(f, Si ), or put in the category give how the value function V() in Eq. (5) is defined based on the user’s
initial preferences:
“others” if the maximal similarity score is not very high.
The sentiment associated with each identified feature is further ex- 𝑉𝑛𝑢𝑚 (𝑥)
tracted by looking for adjacent adjective words in a review sentence ⎧1.0 if (𝑐𝑜𝑛𝑑𝑖𝑡𝑖𝑜𝑛 = “ ≤ ” and 𝑥 ≤ 𝑝𝑟𝑒𝑓 _𝑣𝑎𝑙𝑢𝑒)
(within 3-word distance to the feature, e.g., “vivid” in “The LCD dis- ⎪ or if (𝑐𝑜𝑛𝑑𝑖𝑡𝑖𝑜𝑛 = “ ≥ ” and 𝑥 ≥ 𝑝𝑟𝑒𝑓 _𝑣𝑎𝑙𝑢𝑒)
⎪ |𝑥 − 𝑝𝑟𝑒𝑓 _𝑣𝑎𝑙𝑢𝑒|
play screen provides very vivid previews of photos or video.”). The polarity ⎪1.0 − if (𝑐𝑜𝑛𝑑𝑖𝑡𝑖𝑜𝑛 = “ ≤ ” and 𝑥 > 𝑝𝑟𝑒𝑓 _𝑣𝑎𝑙𝑢𝑒)
⎪ 𝑚𝑎𝑥(𝑎) − 𝑚𝑖𝑛(𝑎)
value of an adjective word w is formally determined via SentiWordNet =⎨ or if (𝑐𝑜𝑛𝑑𝑖𝑡𝑖𝑜𝑛 = “ ≥ ” and 𝑥 < 𝑝𝑟𝑒𝑓 _𝑣𝑎𝑙𝑢𝑒)
(Esuli and Sebastiani, 2006) as follows: ⎪ 𝑥 − 𝑚𝑖𝑛(𝑎)
⎪ if (𝑐𝑜𝑛𝑑𝑖𝑡𝑖𝑜𝑛 = “𝑎𝑛𝑦” and 𝑎 is “the higher, the better”)
𝑟 + 𝑟𝑚𝑎𝑥 ⎪ 𝑚𝑎𝑥(𝑎) − 𝑚𝑖𝑛(𝑎)
𝑝𝑜𝑙𝑎𝑟𝑖𝑡𝑦(𝑤) = 𝑛𝑒𝑔(𝑤) ∗ 𝑟𝑚𝑖𝑛 + 𝑝𝑜𝑠(𝑤) ∗ 𝑟𝑚𝑎𝑥 + 𝑜𝑏𝑗(𝑤) ∗ 𝑚𝑖𝑛 (2) ⎪ 𝑚𝑎𝑥(𝑎) − 𝑥 if (𝑐𝑜𝑛𝑑𝑖𝑡𝑖𝑜𝑛 = “𝑎𝑛𝑦” and 𝑎 is “the lower, the better”)
2 ⎩ 𝑚𝑎𝑥(𝑎) − 𝑚𝑖𝑛(𝑎)

where neg(w), pos(w), and obj(w) respectively denote three polar- (6)
ity scores: negativity, positivity, and objectivity (𝑝𝑜𝑠(𝑤) + 𝑛𝑒𝑔(𝑤) +
𝑜𝑏𝑗(𝑤) = 1), and rmin and rmax are respectively set to 1 and 5 for restrict- {
1.0 if (𝑐𝑜𝑛𝑑𝑖𝑡𝑖𝑜𝑛 = “ = ” and 𝑥 = 𝑝𝑟𝑒𝑓 _𝑣𝑎𝑙𝑢𝑒)
ing the value of polarity(w) into the range [1,5] (from “least negative” 𝑉𝑐𝑎𝑔 (𝑥) = (7)
0 otherwise
to “very positive”). In the case there appear odd number of negation
words (e.g., “not”, “don’t”, “no”, “didn’t”) in the same review sentence, where Vnum (x) is for the numerical attribute (e.g., price, screen size, ef-
the adjective word’s polarity value will be reversed. fective pixels) for which max(a) and min(a) give the maximal and min-
Lastly, an attribute’s sentiment score is calculated by aggregating imal values of attribute a in all products respectively, and Vcag (x) is
all polarity values of features that are mapped to that attribute ai of a for the categorical attribute (e.g., manufacturer). Then, products will be
product p: ranked by their utilities as computed via Eq. (5), and the top k products
1 ∑ with the highest utilities will be retrieved.
𝑠𝑒𝑛𝑡𝑖𝑖 (𝑝) = 𝑠𝑒𝑛𝑡𝑖𝑖 (𝑟) (3)
|𝑅(𝑎𝑖 , 𝑝)| 𝑟∈𝑅(𝑎 ,𝑝)
𝑖
3.3. Sentiment-integrated hybrid critiquing
where R(ai , p) is the set of reviews to product p that contain opinions
on attribute ai , and sentii (r) is computed as: In our critiquing interface (see Fig. 5), the ranked 1st product among

𝑤∈𝑆𝑊 (𝑎𝑖 ,𝑟) 𝑝𝑜𝑙𝑎𝑟𝑖𝑡𝑦(𝑤)
2 those top k candidates will be returned as the current recommendation.
𝑠𝑒𝑛𝑡𝑖𝑖 (𝑟) = ∑ (4) The user-initiated critiquing support is provided below this product for
𝑤∈𝑆𝑊 (𝑎𝑖 ,𝑟) 𝑝𝑜𝑙𝑎𝑟𝑖𝑡𝑦(𝑤)
the user to create critiques by her/himself, and a set of system-suggested
where SW(ai , r) is the set of opinion words that are associated with all critiques are presented right next to the product for the user to select.
features mapped to attribute ai in a review r.
Each product will then be formalized as {(𝑎𝑖 , 𝑠𝑝𝑒𝑐𝑖𝑖 , 3.3.1. User-initiated critiquing
𝑠𝑒𝑛𝑡𝑖𝑖 )1∶𝑚 , (𝑎𝑗 , 𝑠𝑒𝑛𝑡𝑖𝑗 )𝑚+1∶𝑛 }, where ai refers to the attribute that In the user-initiated critiquing panel (see Fig. 5, left lower part), at-
has both static specification value specii (e.g., 18.0 megapixels) and tributes that have both static specification values and sentiment scores
sentiment score sentii (e.g., 4 out of 5), and aj refers to the attribute are shown first, for which the user can choose “Improve” to improve ei-
that only has sentiment score (e.g., “image quality”, “video quality”, ther its static value (e.g., improving the current effective pixels from 16
“ease of use”). to “>= 17”), sentiment score (e.g., improving its sentiment score from
3.7 to “>= 4”), or both. The user can also compromise the specification
3.2. Sentiment-integrated user preference model value or sentiment score by choosing “Take any”, which is for her/him
to indicate the tradeoff preference between this attribute and others.
To model a user’s product preferences for those attributes (ai and Moreover, in order to allow users to learn about an attribute’s static
aj ), we revise the traditional weighted additive form of value functions specification (if they are unfamiliar with it) from other customers’ re-
(Payne et al., 1993) in the following way: views, we make each sentiment score clickable, leading users to read the
𝑚 original review sentences that contain that attribute. In this way, they
∑ [ ]
𝑈 𝑡𝑖𝑙𝑖𝑡𝑦𝑢 (𝑝) = 𝑊𝑖 ∗ 𝛼 ∗ 𝑉 (𝑠𝑝𝑒𝑐𝑖𝑖 (𝑝)) + (1 − 𝛼) ∗ 𝑉 (𝑠𝑒𝑛𝑡𝑖𝑖 (𝑝)) may get to know why other customers expressed positive or negative
𝑖=1 opinions about that attribute.
𝑛
∑ The attribute that only has sentiment score (e.g., the camera’s image
+ 𝑊𝑗 ∗ 𝑉 (𝑠𝑒𝑛𝑡𝑖𝑗 (𝑝)) (5) quality, video quality, ease of use) can also be critiqued, still with three
𝑗=𝑚+1
options: “Keep”, “Improve”, and “Take any”. For example, if the user is
where Utilityu (p) represents the utility of a product p in terms of its not satisfied with the low sentiment score (e.g., 2) on “image quality”
matching degree to the user’s preferences, V(specii (p)) denotes the user’s of the current recommendation, s/he can select “Improve” to make an
preference for attribute ai ’s specification value specii , V(sentii (p)) gives improvement criterion (e.g., >= 4).
her/his preference for the attribute’s sentiment score sentii , Wi is the After the user makes critiques, the system will refine her/his pref-
attribute’s importance (default set as 3 out of 5), and 𝛼 indicates the erence model (see Section 3.4) and return a new recommendation that
relative importance between an attribute’s static specification specii and obtains the highest matching utility (by Eq. (5)).
sentiment sentii for the user (default set as 0.5). Theoretically, this model
is grounded on Multi-Attribute Utility Theory (MAUT) (Keeney and 3.3.2. System-suggested critiques
Raiffa, 1993), which explicitly accommodates the tradeoff relationship System-suggested critiques are generated by discovering representa-
between attributes via weight parameters. tive tradeoff properties within the remaining k-1 products (of the top k
In our system, each user is first asked to specify her/his initial pref- retrieved products from the previous step; see Section 3.2). Specifically,
erences in form of (attribute, condition, pref_value, weight). For example, we first convert each product p′ to a tradeoff vector by comparing it with
(effective pixels, ≥ , 16, 4) indicates that the user prefers a digital cam-
era with effective pixels greater than or equal to 16 megapixels and this the current recommendation p: {(ai , tradeoffi )}, where tradeoffi is either
attribute’s weight is 4. Default functions will be assigned to attributes improved (↑) or compromised (↓), indicating the attribute ai ’s specifica-
that the user does not state any actual preferences (e.g., for price, “the tion value or sentiment score is better or worse than that of the current

5
JID: YIJHC
ARTICLE IN PRESS [m5GeSdc;September 29, 2017;14:32]

L. Chen et al. Int. J. Human-Computer Studies 000 (2017) 1–17

Fig. 5. The screenshot of our sentiment-integrated hybrid critiquing interface (Senti-CBRS), which is composed of both user-initiated critiquing support (left lower part) and system-
suggested critiques (right hand).

recommendation. The equation below gives how the tradeoff vector is It can be seen that when p’s sentiment on attribute ai is negative ( ≤ 3),
formally defined: our main focus is on judging whether the compared product p′ has better
If p’s sentiment on ai is negative (i.e., sentii (p) ≤ 3): sentiment (↑o ) on ai or not, while if the sentiment is positive ( > 3), we
emphasize disclosing better specification value (↑v ) of ai .
tradeoff𝑖 (𝑝′ , 𝑝) Subsequently, the Apriori algorithm (a popular association rule min-
⎧↑ ing tool for retrieving frequent patterns (Agrawal et al., 1993)) is per-
if 1 ≤ 𝑖 ≤ 𝑚, 𝑉 (speci𝑖 (𝑝′ )) ≥ 𝑉 (speci𝑖 (𝑝)) and senti𝑖 (𝑝′ ) > senti𝑖 (𝑝)
⎪ 𝑣𝑜 formed over all the k-1 products’ tradeoff vectors, in order to dis-
= ⎨↑𝑜 if 𝑚 < 𝑖 ≤ 𝑛 and senti𝑖 (𝑝′ ) > senti𝑖 (𝑝)
⎪↓𝑣 if 1 ≤ 𝑖 ≤ 𝑚 and 𝑉 (speci𝑖 (𝑝′ )) < 𝑉 (speci𝑖 (𝑝))
cover frequently occurring subsets of (ai , tradeoffi ) pairs. Each sub-
⎩ set hence represents a critique candidate (or called a category), e.g.,
(8) “{(screen_size, ↑𝑣 ), (optical_zoom, ↑𝑜 ), (effective_pixels, ↓𝑣 )}” representing a
group of products that all have bigger screen size, better sentiment on
Else, if the sentiment is positive (sentii (p) > 3): optical zoom, but lower effective pixels, than the current recommenda-
{ tion.
↑𝑣 if 1 ≤ 𝑖 ≤ 𝑚 and 𝑉 (𝑠𝑝𝑒𝑐𝑖𝑖 (𝑝′ )) > 𝑉 (𝑠𝑝𝑒𝑐𝑖𝑖 (𝑝)) Apriori will likely return a large amount of critique candidates since
tradeoff𝑖 (𝑝′ , 𝑝) = (9)
↓𝑣 if 1 ≤ 𝑖 ≤ 𝑚 and 𝑉 (𝑠𝑝𝑒𝑐𝑖𝑖 (𝑝′ )) < 𝑉 (𝑠𝑝𝑒𝑐𝑖𝑖 (𝑝)) a product might belong to multiple categories if it shares different trade-

6
JID: YIJHC
ARTICLE IN PRESS [m5GeSdc;September 29, 2017;14:32]

L. Chen et al. Int. J. Human-Computer Studies 000 (2017) 1–17

off properties with different groups of products, so we need to select the ⎧0 . 8 if attribute_speci.weight = 5 and attribute_senti.weight ≠ 5;
most prominent critiques. Similar to our previous work Chen and Pu ⎪
𝛼 = ⎨0 . 2 if attribute_speci.weight ≠ 5 and attribute_senti.weight = 5;
(2007c); 2010), we favor critiques that can potentially match the user’s ⎪0 . 5 otherwise
inherent needs, for which the critiques with higher tradeoff benefits as ⎩
well as being diverse among each other are selected: (15)

𝐹 (𝐶) = 𝑇 𝑟𝑎𝑑𝑒𝑜𝑓 𝑓 𝐵𝑒𝑛𝑒𝑓 𝑖𝑡(𝐶) × 𝐷𝑖𝑣𝑒𝑟𝑠𝑖𝑡𝑦(𝐶, 𝑆𝐶) (10) where attribute_speci and attribute_senti respectively denote an at-
tribute’s static specification and sentiment, and their weights (e.g.,
The TradeoffBenefit is formally calculated as: attribute_speci.weight) are defined in Eq. (13). It is then possible to re-rank
available products via Eq. (5) and start a new round of recommendation
( |𝐶| )
∑ and critiquing (go back to steps described in Section 3.3).
TradeoffBenefit(𝐶) = 𝑊𝑖 × tradeoff𝑖 In another case that the user is interested in system-suggested critiques,
𝑖=1 s/he will choose a product under one suggested critique as the new ref-
( )
1 ∑ erence product (to replace the current recommendation). Still, we will
× Utility𝑢 (𝑝) (11) refine her/his preference model according to her/his selected critique.
|𝑆𝑅(𝐶)| 𝑝∈𝑆𝑅(𝐶)
In detail, (attribute, condition, pref_value, weight) will be adjusted as: If
there appears better value or better opinion at ai in the user-selected cri-
where C denotes the currently concerned critique candidate as repre-
tique, condition = “ > ”, pref_value = 𝑝𝑎𝑖 , 𝑤𝑒𝑖𝑔ℎ𝑡 = 5; otherwise, if worse
sented by a set of (ai , tradeoffi ) pairs, Wi is the attribute ai ’s weight,
value at ai appears, it infers the user is willing to accept a compromised
SR(C) denotes the set of products that satisfy C, and Utilityu (p) is a prod-
value of ai , so condition = “ = ”, pref_value = “𝑎𝑛𝑦”, 𝑤𝑒𝑖𝑔ℎ𝑡 = 1; for the
uct’s utility (see Eq. (5)). The tradeoff value tradeoffi is default set as
attribute that does not exist in the selected critique, condition = “ = ”,
0.75 if improved (↑), or 0.25 if compromised (↓).
pref_value = 𝑝𝑎𝑖 , weight = 3. Eq. (5) will then be updated accordingly,
The Diversity degree of C is defined in terms of both the critique itself
by which we will retrieve a new set of products with higher matching
and the set of products that satisfy it:
utilities for producing the system-suggested critiques (see Section 3.3.2)
⋂ ⋂ for the new reference product p. During the new interaction cycle, the
|𝐶 𝐶𝑖 | |𝑆𝑅(𝐶) 𝑆𝑅(𝐶𝑖 )|
Diversity(𝐶 , 𝑆𝐶 ) = min ((1 − ) × (1 − )) (12) user can critique the product p through either self-initiated critiquing or
𝐶𝑖 ∈𝑆𝐶 |𝐶| |𝑆𝑅(𝐶)|
picking a suggested critique.
where SC includes the critiques selected so far. As mentioned before, the above interaction process continues until
Therefore, the first selected critique is that with the highest trade- the user makes the final choice.
off benefit, and the subsequent critique is selected if it has the highest
score F(C) (via Eq. (10)) within the remaining non-selected candidates. 4. User studies
This selection process ends when the desired N critiques are selected.
Fig. 5 (right hand) shows an example of the resulting interface, where We performed two experiments to empirically measure the perfor-
each suggested critique is displayed as the category title, e.g., “They have mance of our sentiment-integrated critiquing interface (Senti-CBRS). In
better value at screen size, and better opinion at optical zoom, but worse value this section, we describe study materials, two experiments’ setups, par-
at effective pixels”, followed by some sample products (with higher util- ticipants, evaluation criteria, and our hypotheses.
ities) that satisfy this critique.
4.1. Materials

3.4. Preference refinement Both experiments are in form of comparative studies. That is,
we compared Senti-CBRS with the original hybrid critiquing inter-
After the user critiques a product, her/his preferences will be face (CBRS) that is purely based on static attribute values (Chen and
refined by the system. Concretely, if the user creates critiques by Pu, 2007a). The goal was to identify the actual impact that feature senti-
her/himself in the user-initiated critiquing panel, her/his preference ments bring on improving user perceptions of their preference construc-
(attribute, condition, pref_value, weight) will be adjusted in the following tion and decision process as well as the recommender system’s compe-
way: tence.

⎧condition = “ = ”, pref_value = 𝑝𝑎 , weight = 3 if critique = “Keep”;


𝑖

⎪condition = “ >= ”, pref_value = 𝑐𝑎𝑖 , weight = 5 if critique = “Improve” and
⎪ criterion is “ ≥ 𝑐𝑎𝑖 ” for a numerical attribute;
⎨condition = “ = ”, pref_value = 𝑐 , weight = 5 if critique = “Improve” and
(13)
⎪ 𝑎𝑖
⎪ criterion is “ = 𝑐𝑎𝑖 ” for a categorical attribute;
⎪condition = “ = ”, pref_value = “𝑎𝑛𝑦”, weight = 1 if critique = “Take any”

In the above equation, 𝑝𝑎𝑖 denotes the specification value or senti-


ment score of the current recommendation p’s ai , and 𝑐𝑎𝑖 is the new The entries to the two compared systems Senti-CBRS and CBRS are
acceptable value for ai that the user specifies in the improvement crite- identical, which require users to specify their initial preferences for
rion. In consequence, Eq. (5) will be updated, for which parameters Wi the product’s attributes. Then in CBRS (see Fig. 6), the product that
and 𝛼 will be recalculated as: best matches the user’s preferences is displayed at the top, accompa-
nied by a brief description including reviewers’ average rating, the to-
⎧5 tal number of reviews, and static specification values of several major
if attribute_speci.weight = 5 or attribute_senti.weight = 5;
⎪ attributes. The area below the product is composed of three tabs: “Bet-
𝑊 𝑖 = ⎨1 else (if attribute_speci.weight = 1 or attribute_senti.weight = 1);
⎪3 ter Features” (default active for users to make self-initiated critiquing),
⎩ otherwise
“Specifications” (linking to a list of full specifications that include more
(14) detailed information about the product), and “Reviews” (linking to the
list of raw reviews). On the right hand, four system-suggested critiques
are displayed, which are generated by the standard Preference-based

7
JID: YIJHC
ARTICLE IN PRESS [m5GeSdc;September 29, 2017;14:32]

L. Chen et al. Int. J. Human-Computer Studies 000 (2017) 1–17

Fig. 6. The screenshot of the original hybrid critiquing interface (CBRS), which is purely based on products’ static attribute values for users to make critiques.

Organization method (Chen and Pu, 2007c; 2010). Each critique con- Amazon.com in February, 2014. To ensure the reliability of inferred
tains up to six satisfying products (with three listed as default, and the sentiment scores, we filtered out products with reviews less than 10. As
others hidden till users click “Expand”). a result, the digital camera catalog consists of 346 products. Each prod-
In Senti-CBRS (see Fig. 5), the system workflow is basically the same uct is described by 6 major attributes (i.e., brand, price, screen size,
as that of CBRS, except that sentiments extracted from product reviews effective pixels, optical zoom, weight) and 54 full specifications (i.e.,
are displayed in both product description and critiquing areas. Specifi- technical details such as sensor type, image stabilization, number of fo-
cally, in the product description, every major attribute is associated with cus points). The total amount of reviews posted to these products (till
a sentiment score (opinion rating) and the number of reviews that com- February, 2014) is 46,434 (mean = 134.2), from which we extracted 8
mented this attribute. A tool tip window will be automatically shown features (5 corresponding to major attributes, and 3 only having senti-
if users hover over the number of reviews, which will list the associ- ment scores that are image quality, video quality, and ease of use). As for
ated opinion words and occurrence frequency of each word in raw re- laptop, there are 303 products (with totally 16,047 reviews and aver-
views. The number of reviews is also clickable, leading users to read the age 52.96 reviews per product). Each laptop is mainly described by 11
original review texts where sentences containing the corresponding at- attributes (i.e., brand, price, processor speed, RAM, hard drive, screen
tribute are highlighted. In addition to associating sentiment scores with size, weight, display, battery life, portability, and quietness, where the lat-
attributes that have static specification values, features that only ap- ter four only have sentiment scores) and 20 full specifications (e.g., op-
pear in reviews (such as “ease of use”) are also shown in the product tical drive type, color, processor count). Table 2 lists the two product
description as well as being available for users to critique (see details in datasets’ descriptive statistics.
Section 3.3).
In both systems, after users save one or more products in their 4.2. Two experiments
shopping cart, they can go to the “Checkout” page to make the final
choice. For the purpose of conducting a within-subjects experiment (see In order to measure the changes in users’ decision behavior and per-
Section 4.2), we implemented each system in two product catalogs: dig- ceptions after the critiquing interface is incorporated with feature senti-
ital camera and laptop. The products’ information was crawled from ments, we set up a before-after experiment (see the procedure in Fig. 7). It

8
JID: YIJHC
ARTICLE IN PRESS [m5GeSdc;September 29, 2017;14:32]

L. Chen et al. Int. J. Human-Computer Studies 000 (2017) 1–17

Fig. 7. The evaluation procedures of the two experiments: before-after and within-subjects.

Table 2 With within-subjects design, those variables are exactly the same for the
Description of product datasets used to implement the critiquing systems.
two compared systems, because the participants are the same and they
Digital camera Laptop should exhibit consistently. Therefore, the variability in measurements
#Products 346 303
will be more likely due to differences between systems than to those be-
#Reviews 46,434 16,047 tween participants (MacKenzie, 2013). (3) It allows us to measure each
#Avg. reviews per 134.20 52.96 participant’s system preference (i.e., “Which system, CBRS or Senti-CBRS,
product do you prefer to use for searching for products?”), since s/he is given the
#Attributes associated 5 (price, screen size, 6 (price, processor speed,
chance to experience both systems.
with both static effective pixels, optical RAM, hard drive,
specification value and zoom, weight) screen size, weight) In order to compensate for the carryover effects (also called learning
sentiment score effects) that may occur in the within-subjects experiment, we adopted a
#Attributes associated 3 (image quality, video 4 (display, battery life, complete counterbalancing procedure (MacKenzie, 2013; Martin, 2007).
with only sentiment quality, ease of use) portability, quietness)
That is, we placed participants in groups and presented systems to each
score
group in a different order (see Fig. 7). More concretely, the manipulated
factors are systems’ order (CBRS first or Senti-CBRS first) and product
catalogs’ order (digital camera first or laptop first), so there are four
asks users to first make a choice in CBRS, and then in Senti-CBRS to de- groups (2 product catalogs × 2 evaluation orders). Because the order
cide whether the product they just chose is truly the best, or they prefer of system presentations is different for each group, the learning effect
another one. The two compared systems are with the same product cata- tends to balance out (MacKenzie, 2013; Martin, 2007), and the bias in
log (i.e., digital camera). When users come to use Senti-CBRS, the choice users’ post-task subjective responses could be reduced.
they have made in CBRS is displayed at the top, for them to make cri- We developed an online procedure, which contains instructions,
tiques if they wish. Therefore, this experiment requires each participant evaluated interfaces and questionnaires, for participants to carry out
to evaluate both systems in a fixed order, first CBRS then Senti-CBRS, the experiment at their convenience. All users’ clicking actions and an-
for addressing this question: Would s/he be motivated to make a better swers were automatically recorded in a log file. Each experiment was
choice in Senti-CBRS or not? conducted in the following four steps:
Meanwhile, we performed a more standard within-subjects experi-
ment 3 , for the purpose of evaluating the two systems under equivalent
circumstances. To be specific, participants still use the two systems in • Step 1: The participant was first debriefed on the experiment’s objec-
sequence, but the order and searching tasks are different among them tive and upcoming tasks. In particular, s/he was asked to compare
(see Fig. 7). For example, some users are asked to use Senti-CBRS first two product finder systems and determine which one is more ef-
for finding a digital camera to “buy”, and then use CBRS to find a lap- fective in terms of supporting her/him to make purchase decision.
top to “buy”. As the two systems are used in two independent scenarios, Thereafter, a short questionnaire was filled out about her/his demo-
we are able to more fairly compare users’ perceptions and interaction graphics and e-commerce experience.
behavior between them. Another popular experimental design, called The user then started evaluating the two systems one by one ac-
between-group, can also be used for the similar purpose, by which all cording to the order assigned aforehand. In order for the user to be
participants need to be evenly divided into two groups, each group as- familiar with each evaluated system before s/he formally starts us-
signed only one system to evaluate (Charness et al., 2012). The rational ing it, a demo video (lasting around 2 min) was played. S/he was
behind our choice of within-subjects approach, rather than between- also required to indicate her/his knowledge about the product (ini-
group 4 , is due to the following considerations: (1) It allows us to in- tial product knowledge level).
crease the number of subjects in both CBRS and Senti-CBRS conditions • Step 2:
(in between-group setup, the number of subjects is reduced by half in • Before-after experiment: In this experimental setup, the user
either condition). (2) It can reduce the amount of errors arising from was asked to use CBRS to find a digital camera that s/he would
natural variances associated with individual differences (such as age, purchase if given the opportunity.
gender, education, nationality, and e-commerce experience in our case). • Within-subjects experiment: The user was asked to use either

CBRS or Senti-CBRS (according to the order defined in her/his


group) to find a digital camera or a laptop that s/he would pur-
3
Strictly speaking, before-after is also a type of within-subjects design. Here we use
chase if given the opportunity.
within-subjects to refer to the particular setup that the two evaluated systems are used in
two different product catalogs and their evaluation order is altered among participants. After the user found a product to “buy” by using the assigned system,
4
We leave the validation of our results by using the between-group design as our future s/he was prompted to answer a questionnaire about her/his overall
work. opinions on the system.

9
JID: YIJHC
ARTICLE IN PRESS [m5GeSdc;September 29, 2017;14:32]

L. Chen et al. Int. J. Human-Computer Studies 000 (2017) 1–17

Table 3
Demographic profiles of participants (the number of users is in the bracket).

Before-after experiment (129) Within-subjects experiment (100)

Gender Male (61), Female (68) Male (46), Female (54)


Age below 20 (8), 21–30 (59), 31–40 (33), above below 20 (7), 21–30 (41), 31–40 (27), above 40 (25)
40 (29)
Nationality USA (51), India (42), China (26), others (10) USA (45), India (33), China (11), others (11)
Education High school (16), College (27), Bachelor (56), High school (7), College (26), Bachelor (47), Master (17), PhD (3)
Master (26), PhD (4)
Major Computer science, Marketing, Finance, Computer science, Business, Engineering, Education, Agriculture, Physics, etc.
Engineering, Biology, Phycology, etc.
Profession Student, Engineer, Teacher, Manager, Student, Teacher, Salesman, Tradesman, Freelance writer, Accountant, etc.
Homemaker, Graphic designer, etc.
E-commerce site visits 4.26 (1–3 times a month), st.d.=.809 4.35 (1–3 times a month), st.d.=.801
E-shopping frequency 2.80 (a few times every 3 months), st.d.=.961 2.74 (a few times every 3 months), st.d.=.816

Note: The answers to “e-commerce site visits” and “e-shopping frequency” were given on a five-point Likert scale from 1 “least frequent” to 5 “very frequent”.

• Step 3: Table 4
• Before-after experiment: The product that the user chose in Step
Questions to measure user perceptions of the critiquing-based recommender system.

2 was shown again. S/he was asked to use Senti-CBRS to decide Construct Assessment question
whether this product is truly the best, or s/he prefers another Product knowledge Q1: How would you rate your knowledge about xxx?
one. Preference certainty Q2: I am now very certain about what I need in respect of
• Within-subjects experiment: The user used another system each attribute. (value preference)
(e.g., CBRS if s/he just used Senti-CBRS in Step 2) to find a fa- Q3: I am now very certain about the relative importance
of each attribute. (weight preference)
vorite product from a new product catalog (e.g., laptop if s/he
Decision confidence Q4: I am confident that the product I just chose is the best
just searched for digital camera in Step 2). For the new type of choice for me.
product, s/he was also required to indicate her/his initial prod- System competence Q5: This system helped me discover some useful
uct knowledge level before start. information. (information usefulness)
Q6: This system recommended some interesting products
After this task was done, s/he gave opinions on the used system in a
to me. (recommendation quality)
post-task questionnaire. System trustworthiness Q7: The system can be trusted.
• Step 4: At the end, the user was asked to compare the two evaluated Purchase intention Q8: I would purchase the product I just chose if given the
systems and indicate which one s/he prefers to use for searching for opportunity.
products. Note: The first question was responded on a five-point Likert scale from 1 “not familiar at
all” to 5 “very familiar”; the other questions were responded from 1 “strongly disagree”
to 5 “strongly agree”.

4.3. Participants
4.4. Evaluation criteria and hypotheses
We recruited participants through internal email list, advertisement
in public forums and crowdsourcing (via Amazon Mechanical Turk). For Given that the objective of our experiments was to identify whether
people who successfully completed the experiment, s/he was rewarded the sentiment-integrated critiquing interface (Senti-CBRS) could more
an incentive (around HKD20). In total, 351 participants took part in our effectively support users to formulate their preferences and improve
experiments. We filtered out records that were with incomplete/invalid their decision-making process, the measurement was conducted from
answers to our questionnaires or total duration lasting less than 5 min- both objective and subjective perspectives.
utes. As a result, 229 participants’ data were retained, among whom Objective measures include users’ decision effort and actual behav-
129 (68 females) participated in the before-after experiment, and 100 ior. Concretely, the decision effort was assessed through counting users’
(54 females) joined in the within-subjects experiment (with 25 in each task completion time and critiquing cycles (the number of interaction
condition group). cycles the user was involved into critiquing products). Their behavior
Those participants are mainly from 20 to 40 years old, with differ- covers critiquing actions made on attributes’ static specification values
ent nationalities (e.g., USA, India, China, etc.), education degrees (from and sentiment scores per cycle, times of viewing the full list of speci-
high school to PhD), majors (e.g., computer science, marketing, finance, fications, original reviews and feature-specific reviews (the highlighted
business, education, etc.), and professions (e.g., student, teacher, engi- parts in raw review texts that commented a specific feature) when they
neer, salesman, etc.). As for their e-commerce experiences, it shows they evaluated a product, and their final choice.
all visited e-commerce sites before, and 95.6% of them have purchased Users’ perceptions were measured in six subjective constructs (see
products online. Table 3 lists their demographic profiles. Table 4): product knowledge, preference certainty, decision confidence,
We examined the differences in those demographic properties be- system competence (perceived information usefulness, recommendation
tween participants of before-after experiment and those of within- quality), system trustworthiness, and purchase intention. Most of assess-
subjects experiment. It shows there are no significant differences ment questions are from related literatures, where they were proven
in terms of gender (𝜒 2 (1)=.037, p=.847 by Chi-square test), age with strong content validity (Pu et al., 2011).
(𝜒 2 (3)=.539, p=.910), nationality (𝜒 2 (3)=3.98, p=.264), education Our main hypothesis was that, due to the incorporation of feature
(𝜒 2 (4)=2.725, p=.605), e-commerce site visits (t=-0.799, p=0.43, by sentiments into the critiquing interface, users would be able to learn
Student’s t-test), and e-shopping frequency (t=0.55, p=0.59) 5 . from other customers’ opinions towards attributes to formulate and re-
fine their preferences more effectively, so their product knowledge, pref-
erence certainty and decision confidence would be higher after using
Senti-CBRS than CBRS. Moreover, they would perceive the product in-
5
For “major” and “profession”, we did not measure the differences because the values formation provided in Senti-CBRS more useful. Their perception of the
are very discrete. system’s recommendation quality would also be strengthened, since the

10
JID: YIJHC
ARTICLE IN PRESS [m5GeSdc;September 29, 2017;14:32]

L. Chen et al. Int. J. Human-Computer Studies 000 (2017) 1–17

Fig. 8. Users’ objective behavior in the before-after experiment.

products’ ranking considers users’ critiquing criteria for both static spec- 5.1. Before-after experiment
ification values and sentiment scores. In consequence, users would be
more inclined to trust Senti-CBRS and purchase the product that they 5.1.1. Objective measures
choose from it. As for their objective effort, we postulated that users We first compared participants’ product choices made in CBRS and
would behave actively in interacting with sentiments when they make Senti-CBRS, which shows 31.8% of users (41 out of 129) preferred an-
critiques in Senti-CBRS, and be even motivated to more frequently read other choice after switching from CBRS to Senti-CBRS, implying that the
the original review texts for discovering more relevant information by traditional critiquing interface that is simply based on static attribute
themselves. One more hypothesis specific to the before-after experiment values cannot allow them to make the best choice.
was that some users would switch to another choice in Senti-CBRS that We further measured users’ critiquing applications. The average
they think is better than that they have made in CBRS. In total, we had number of critiquing cycles in Senti-CBRS is 1.44, vs. 2.18 in CBRS
seven hypotheses: (F(1,128)=11.66, p=.001), among which the frequencies of making self-
initiated critiquing and choosing system-suggested critiques are respec-
tively 0.87 and 0.57 in Senti-CBRS, and 1.06 and 1.12 in CBRS. In-
• Hypothesis 1: Users would be likely to perceive higher level of product
depth analysis of users’ critiquing actions in Senti-CBRS indicates that,
knowledge, preference certainty, and decision confidence in Senti-CBRS
in each critiquing cycle, there are 1.56 attribute critiques posted on
than CBRS.
static values, which is significantly lower than that in CBRS (vs. 2.23,
• Hypothesis 2: Users would perceive the product information provided in
F(1,128)=14.86, p=0.0002), while the remaining 0.67 critiques were
Senti-CBRS more useful.
left on sentiment scores that account for 30% of their total attribute cri-
• Hypothesis 3: Users would perceive the recommendation quality higher
tiques in Senti-CBRS (0.67/2.24) (see Fig. 8(a)). Therefore, it suggests
in Senti-CBRS.
that when attributes’ sentiments were available for users to critique,
• Hypothesis 4: Users would be more inclined to trust Senti-CBRS and
some users did specify actual preferences for them, reflecting their in-
purchase the product chosen from it.
herent need for such information. Indeed, among all of the sentiment cri-
• Hypothesis 5: Users would actively interact with feature sentiments when
tiques specified on the digital camera’s attributes, price was critiqued by
they make critiques in Senti-CBRS.
36.4% users (out of 88 participants who were involved into at least one
• Hypothesis 6: Users would be motivated to more frequently read the
critiquing cycle), followed by effective pixels by 34.1%, optical zoom by
original review texts in Senti-CBRS.
27.3%, screen size by 21.6%, and weight by 19.3%, whereas the attributes
• Hypothesis 7: In the before-after experiment, users would be likely to
that are associated with only sentiments were rarely critiqued (video
make better choices in Senti-CBRS.
quality by 0.045% and image quality by 0.01%). We additionally iden-
tified attributes whose sentiments were repeatedly critiqued by a user
during her/his entire decision process. It shows attributes effective pix-
els and price more often appeared in users’ repetitive critiques than the
5. Results others (with appearance frequencies 25.8% and 22.6% respectively), in-
ferring these two attributes’ sentiments are more likely to be concerned
The software IBM SPSS Statistics V22.0 was used for data analysis. by users when they search for a digital camera to buy. Another interest-
To identify whether the observed differences between the two systems ing finding is that 70.6% of all the sentiment critiques (160) were for
(Senti-CBRS and CBRS) are statistically significant or not, we mainly improving, with the average threshold 2.988, implying that users would
ran one-way repeated measures ANOVA on before-after experimental be inclined to improve an attribute’s sentiment when its score is nega-
results, and two-way mixed ANOVA (Field, 2013) on within-subjects re- tive. Besides, the occurrence of sentiment critique that appears together
sults. In more details, one-way repeated measures ANOVA enables us to with critique on the same attribute’s static value in a cycle is 64.3%,
take the system as the independent factor for comparing the same par- which indicates that some users were subject to critique an attribute’s
ticipants’ differences between before and after they used Senti-CBRS. In static value and sentiment at the same time.
the within-subjects experiment, because all participants were split into Regarding users’ behavior of accessing a product’s full specification
four groups according to two factors system and product catalog, we chose list and original review texts in the two systems (see Figs. 5 and 6), we
two-way mixed ANOVA that can take the system as the within-subjects find that users were less motivated to consult with full specifications in
factor and the product catalog as the between-subjects factor, allowing Senti-CBRS (0.22 vs. 0.47 in CBRS, F(1,128)=8.28, p=0.005), but read
us to not only determine systems’ differences, but also the interaction more raw reviews (0.58 vs. 0.45 in CBRS, F(1,128)=0.97, p=.327), for
effect between system and product. which the frequency of reading feature-specific reviews is higher than

11
JID: YIJHC
ARTICLE IN PRESS [m5GeSdc;September 29, 2017;14:32]

L. Chen et al. Int. J. Human-Computer Studies 000 (2017) 1–17

Fig. 9. User perceptions of CBRS and Senti-CBRS in the before-after experiment (with mean ratings and 95% confidence intervals).

Table 5
Comparison in respect of user perceptions in the before-after experiment.

Construct Mean (st.d.) One-Way Repeated-Measures ANOVA

CBRS Senti-CBRS F Hypothesis df Error df p

Product knowledge (Q1) 3.41 (0.71)vs. 3.29a (0.74) 4.23 1 128 .042
3.51 (0.72)vs. 3.29a (0.74) 7.76 1 128 .006
3.41 (0.71) 3.51 (0.72) 2.16 1 128 .144
Preference certainty (Q2: value preference) 3.66 (0.70) 3.98 (0.56) 20.76 1 128 .000
Preference certainty (Q3: weight preference) 3.73 (0.57) 3.94 (0.57) 11.07 1 128 .001
Decision confidence (Q4) 3.94 (0.59) 4.11 (0.58) 5.84 1 128 .017
System competence (Q5: information usefulness) 3.67 (0.88) 4.12 (0.55) 23.02 1 128 .000
System competence (Q6: recommendation quality) 3.88 (0.63) 3.98 (0.60) 1.36 1 128 .245
System trustworthiness (Q7) 3.78 (0.50) 3.91 (0.52) 4.57 1 128 .034
Purchase intention (Q8) 3.77 (0.66) 3.98 (0.71) 8.68 1 128 .004
a
Users’ initial product knowledge level.

that of reading overall reviews (0.39 vs. 0.19) (see Fig. 8(b)). This obser- CBRS, p=.000), and perceived it with higher recommendation quality
vation suggests that the extraction of feature sentiments from reviews (Q6: 3.98 vs. 3.88 in CBRS, p=.245). In consequence, their trust in Senti-
may stimulate users to look for more useful information relevant to their ORG was significantly higher than in CBRS (Q7: 3.91 vs. 3.78, p=.034),
interests in attributes. and they possessed significantly higher intention to purchase the chosen
As for the other objective measures, it shows users spent signifi- product (Q8: 3.98 vs. 3.77, p=.004).
cantly less time in Senti-CBRS (mean=3.03 mins vs. 5.36 mins in CBRS, Regarding their overall preference for the two compared systems,
F(1,128)=58.73, p=.000), which might be because they already gained 73.6% users (95 out of 129) favored Senti-CBRS, against 26.4% who
certain familiarity with the products when they used the first system liked CBRS (see Fig. 10), which is with significant distribution (rel-
CBRS, so the time consumed in revising their preferences while using ative to the assumption of equal distribution) by Chi-square test
Senti-CBRS was shortened. (𝜒 2 (1)=28.85, p < .05).

5.1.2. Subjective measures 5.1.3. User comments


We then analyzed users’ answers to our post-task questions in or- From users’ free comments on the two systems, we identify the rea-
der to know how they subjectively felt about the two systems. From sons behind their reported preferences. In general, they thought the sen-
Fig. 9 and Table 5, we can see users gave more positive ratings on timent scores (opinion ratings) associated with attributes are helpful for
all of the eight assessment statements w.r.t. Senti-CBRS, six of which them to formulate product preferences. For instance, “I am not deeply
reach at significant level. Concretely, users’ product knowledge level familiar with cameras, the opinion rating is easy to understand.” “I can see
was increased after using Senti-CBRS (Q1: 3.51), vs. 3.41 after CBRS each individual characteristic of the camera and how it was commented on
(p=.144 6 ), which are both significantly higher than users’ initial prod- by those who purchased it. This helped me to look at specific things that I
uct knowledge level 3.29 (p=.006 w.r.t. Senti-CBRS; and p=.042 w.r.t. considered important when buying the camera.” “Ratings were set according
CBRS). Their preference certainty was significantly improved in Senti- to my own preferences.” “I really liked that I could see the rating in regards
CBRS in terms of both attribute value preference (Q2: 3.98 vs. 3.66 in to specific aspects of the item.” “The features and ratings were more detailed
CBRS, p=.000) and attribute weight preference (Q3: 3.94 vs. 3.73 in so that I get a clearer picture of the product.” “I liked the fact that I could
CBRS, p=.001), which in result enabled them to obtain significantly clearly see what the rating for each feature was, versus just an overall average
higher decision confidence (Q4: 4.11 vs. 3.94 in CBRS, p=.017). Fur- rating. I feel it was better at helping me to make my buying decision.”
thermore, they indicated Senti-CBRS is significantly more competent in Some users also explicitly mentioned the advantages of incorporating
helping them discover useful product information (Q5: 4.12 vs. 3.67 in sentiments into the critiquing interface: “I think the search for better option
was really nice to make the search for products.” “I would prefer to pick some
ratings and some specifications, e.g., for resolution in megapixels, I want to
6
The F value of the one-way repeated measures ANOVA is given in Table 5. specify ‘ > 15’ megapixels. But on quality of image, I do want to select the

12
JID: YIJHC
ARTICLE IN PRESS [m5GeSdc;September 29, 2017;14:32]

L. Chen et al. Int. J. Human-Computer Studies 000 (2017) 1–17

Fig. 10. Distribution of users’ self-reported preferences for the two compared systems.

ratings, e.g., ‘ > 4.2’.” “More search options to find the camera.” “Make me We further investigated users’ interaction patterns with attribute sen-
confident enough to make a good decision.” “It showed me some ideas for timents. Among the half users (50) who used Senti-CBRS for searching
what I might be looking for when buying a digital camera.” “It allows me for a digital camera, it shows effective pixels and price were still more
to search other cameras of the related preference.” “I can get suggestions of frequently critiqued, respectively 29.7% and 27% (out of 37 partici-
use feedback from other customers to choose the products.” “It gave a better pants who made at least one critique), followed by optical zoom and
choice of products. It made my decision quick and the right one.” “I felt like weight that are with equal proportion (18.9%). In terms of attributes
I was able to explore the options I wanted to examine more and get a better that contain only sentiment scores, video quality and image quality were
idea of their range.” “The ratings are easier broken down into what people critiqued by some users (13.5% and 0.05%), but ease of use was not cri-
did and did not like about the camera.” “It is easier to compare quality based tiqued at all. This is consistent with the before-after experiment’s obser-
on consumer reviews.” vation. With regard to laptop, 43 users made critiquing at least once, of
As to the reason why few users preferred the traditional system CBRS, whom 46.5%, 27.9%, 25.6%, and 20.9% respectively critiqued the sen-
we found it is mainly owing to its simplicity that only product specifi- timents of price, weight, processor speed, and screen size. In comparison,
cation information is provided: “The description of products is simple and the percents of users who critiqued RAM and hard drive are relatively
clear.” “It was less cluttered.” “It is more helpful for me to see the actual lower (both 11.6%), but those on battery life, quietness, and display that
specifications of a product.” “It contains enough information and needed at- are defined by sentiment only are not low (25.6%, 23.3%, and 16.3%).
tributes to select the best one.” Moreover, for digital camera, price, weight and effective pixels were more
frequently repeatedly critiqued by a user (33.3%, 20%, 20%), and for
5.2. Within-subjects experiment laptop, price, weight and quietness more often appeared the user’s repet-
itive critiques (25%, 18.9%, 18.9%). Regarding how many sentiment
In the within-subjects experiment, each participant went through the critiques were for the direction of improvement, it indicates all (72)
two systems (Senti-CBRS and CBRS) in the order defined in her/his as- were with this purpose in Senti-CBRS of digital camera, and the aver-
signed group (see Fig. 7), for searching two different types of product age threshold is 2.61; for laptop, it is 84.8% (out of 112 total sentiment
(digital camera and laptop) respectively. This experiment’s results con- critiques) specified on the average score 2.34. In addition, 64.5% and
solidate the before-after findings. 78.1% of those sentiment critiques, respectively w.r.t. digital camera
and laptop, were in conjunction with critiques on the same attributes’
static values, which verifies again our original postulation that the avail-
5.2.1. Objective measures ability of feature sentiments might stimulate users to better learn how to
In this experiment, users spent significantly more time in Senti-CBRS state preferences for static values, and also be able to propose critiquing
than in CBRS (5.06 mins vs. 4.36 mins, F(1,98)=5.67, p=.019), but took criteria for sentiment scores simultaneously.
less critiquing cycles (1.97 vs. 2.22 in CBRS, F(1,98)=1.23, p=.27, for We also found users less frequently viewed full specifications and
which the interaction effect between system and product is significant 7 : overall reviews in Senti-CBRS (0.35 vs. 0.42 in CBRS, F(1,98)=0.6,
F(1,98)=5.11, p=0.026). p=0.44; and 0.27 vs. 0.53, F(1,98)=5.15, p=.025) (see Fig. 11(b)). 37.2%
In more depth, the average frequencies of applying self-initiated of their review-reads (0.16/0.43) were actually related to specific fea-
and system-suggested critiquing are respectively 0.84 and 1.13 in tures. It hence strengthens the importance of extracting feature senti-
Senti-CBRS, and those in CBRS are respectively 1.12 and 1.1. Within ments from reviews in supporting users’ information seeking needs.
each product critique in Senti-CBRS, there are on average 2.327 at-
tributes involved, which is close to the number in CBRS (2.334). How-
ever, users’ critiques on attributes’ static specification values per cy-
5.2.2. Subjective measures
cle were significantly decreased in Senti-CBRS (1.60 vs. 2.33 in CBRS,
Users’ subjective assessments of the two systems are also consistent
F(1,98)=22.40, p=.000). In fact, the remaining 31.2% of their attribute
with the before-after experiment’s findings. To be specific, Senti-CBRS
critiques (0.73/2.327) were made on sentiment scores (see Fig. 11(a)).
obtains positively higher assessment ratings on all of the eight state-
ments (see Table 4), seven of which are significant (see Fig. 12 and
7
Due to space limit, we only report the significant interaction effect. As for the other
Table 6).
objective measures in this experiment, the interaction effect between system and product As for their product knowledge, although it was significantly in-
is not significant. creased in both systems (Q1: CBRS: 3.52 vs. 3.36 initially, p=.001; Senti-

13
JID: YIJHC
ARTICLE IN PRESS [m5GeSdc;September 29, 2017;14:32]

L. Chen et al. Int. J. Human-Computer Studies 000 (2017) 1–17

Fig. 11. Users’ objective behavior in the within-subjects experiment.

Fig. 12. User perceptions of CBRS and Senti-CBRS in the within-subjects experiment (with mean ratings and 95% confidence intervals).

Table 6
Comparison in respect of user perceptions in the within-subjects experiment.

Construct Mean (st.d.) Two-Way Mixed ANOVA

CBRS Senti-CBRS F Hypothesis df Error df p


a
Product knowledge (Q1) 3.52 (0.88)vs. 3.36 (1.00) 10.81 1 99 .001
3.68 (0.64)vs. 3.53a (0.88) 7.75 1 99 .006
3.52(0.88) 3.68(0.64) 4.07 1 98 .046
Preference certainty (Q2: value preference) 3.89 (0.64) 4.10 (0.49) 6.57 1 98 .012
Preference certainty (Q3: weight preference) 3.84 (0.70) 4.15 (0.55) 14.46 1 98 .000
Decision confidence (Q4) 3.94 (0.70) 4.27 (0.60) 14.42 1 98 .000
System competence (Q5: information usefulness) 3.79 (0.81) 4.02 (0.73) 5.18 1 98 .025
System competence (Q6: recommendation quality) 4.06 (0.40) 4.22 (0.29) 5.78 1 98 .018
System trustworthiness (Q7) 3.90 (0.60) 4.01 (0.47) 2.62 1 98 .109
Purchase intention (Q8) 3.85 (0.73) 4.08 (0.68) 6.54 1 98 .012
a Users’ initial product knowledge level

CBRS: 3.68 vs. 3.53 initially, p=.006 8 ), the level achieved by Senti- their perception of system competence, both “information usefulness”
CBRS is significantly higher than that of CBRS (3.68 vs. 3.52, p=.046). and “recommendation quality” were perceived significantly higher in
In addition, users reached significantly higher preference certainty after Senti-CBRS than CBRS (Q5: 4.02 vs. 3.79, p=.025; Q6: 4.22 vs. 4.06,
using Senti-CBRS in terms of both attribute value and weight prefer- p=.018). The construct about system trustworthiness is not significantly
ences (Q2: 4.10 vs. 3.89 in CBRS, p=.012; Q3: 4.15 vs. 3.84, p=.000). different between the two compared systems (Q7: 4.01 in Senti-CBRS vs.
They were also significantly more confident that they made the best 3.90 in CBRS, p=.109), but users still had significantly higher purchase
choice in Senti-CBRS (Q4: 4.27 vs. 3.94 in CBRS, p=.000). Regarding intention when using Senti-CBRS (Q8: 4.08 vs. 3.85, p=.012). Further-
more, the analysis of interaction effect between evaluated system and
product type shows that, except it is significant for product knowledge
8
This p value and the previous one were calculated through one-way repeated measures (F(1,98)=9.15, p=0.003), there is no interaction as to the other con-
ANOVA. All of the F values, including those for the following two-way mixed ANOVA tests, structs.
are given in Table 6.

14
JID: YIJHC
ARTICLE IN PRESS [m5GeSdc;September 29, 2017;14:32]

L. Chen et al. Int. J. Human-Computer Studies 000 (2017) 1–17

Participants’ responses to the final question about their overall sys- sions, and implies that users will be more likely to accept such a system
tem preference reveal that more users preferred Senti-CBRS to CBRS in practice.
(61% against 39%; see Fig. 10) at a significant distribution (𝜒 2 (1)=4.84, Thus, all of our hypotheses claimed before (see Section 4.4) are well
p < .05). proved.

5.2.3. User comments 6. Practical implications


Examination of users’ free comments identifies the stronger point of
Senti-CBRS: its disclosure of reviewers’ opinions towards specific fea- 6.1. Critiquing interface design
tures allows participants to more effectively specify critiques and deter-
mine what they want. For example, “I can see the pros and cons of each Given that traditional critiquing-based recommender systems mainly
product easily.” “It provided opinion ratings for attributes, which made the exploit products’ static attribute values to elicit users’ critiquing feed-
critique on each attribute we prefer easier.” “There were more preference op- back, in this paper, we propose a sentiment-integrated critiquing
tions and it was easier to find what I wanted.” “I got more preference types method, which particularly incorporates feature sentiments as extracted
and options to choose. It’s easy and more specific to find what I want.” “Be- from product reviews into the process of assisting users in formulating
cause of aspect reviews and ratings are posted in the site, in which we select and refining their preferences. We have actually extended our previous
appropriate product based on my mind.” “Could search according to user work on hybrid critiquing system (Chen and Pu, 2007a), to integrate fea-
reviews.” “I felt it is more tailored to my needs.” “Easily saw the pros and ture sentiments with static specification values into both user-initiated
cons of each camera.” “Liked being able to control which attribute was more critiquing and system-suggested critiques.
important based on ratings and comments.” “I liked the list format of cus- The primary usages of feature sentiments in our interface can be
tomer reviews. I found the recommendations of comparable products much concluded in the following points:
more helpful.” “I liked the ratings. They helped me to know that I made a
good choice. I also liked being able to specify my needs.” “It has more filtering 1. It can facilitate users to view other customers’ opinions towards a
options and was more effective at finding what I needed.” particular attribute, so as for them to learn how to specify actual
In contrast, those who preferred CBRS still attributed their prefer- criteria for both of its specification value and sentiment.
ence to its simple design: “It is easier to see all the available specifications 2. In addition to associating sentiment scores with attributes that have
at once.” “It seems less overwhelming and congested than another interface.” static specification values (e.g., a digital camera’s optical zoom, ef-
“It seems smoother, less cluttered and easier to manipulate.” fective pixels), we extract some features from reviews that purely
count on other customers’ usage experiences (e.g., image quality,
5.3. Summary video quality, ease of use). They can be helpful for new buyers to
better understand a product and evaluate its quality.
As a short summary, both experiments validate the advantage of 3. The critiquing support allows users to not only improve an attribute’s
sentiment-integrated critiquing interface (Senti-CBRS) over the tradi- specification value and/or sentiment score, but also compromise that
tional approach (CBRS). Users’ critiques on static specification values of less important attributes. Through making such kind of value
were significantly decreased in Senti-CBRS, while their critiques on sen- tradeoff, users could be clearer about the inherent relationship be-
timent scores made up to 30% of their attribute critiques. As a matter of tween specification value and sentiment as well as that between at-
fact, they behaved actively in interacting with various attributes’ sen- tributes, and therefore be more certain about their preferences. Ac-
timents (such as price, effective pixels and optical zoom in digital cam- tually, in system-suggested critiques, we present critiques that high-
era catalog, and price, weight, processor speed and battery life in laptop). light representative pros and cons (tradeoff properties) of remaining
Most of their sentiment critiques were for improving the score, which products, which let users know what products exist and in what as-
frequently occurred when the score is negative. Moreover, they often pects they are advantageous. The user-initiated critiquing, on the
critiqued together both the attribute’s static value and its sentiment, other hand, aids users to specify tradeoff criteria in their own de-
implying the role of sentiment in aiding new buyers to formulate their sires.
value preferences. The availability of sentiment also motivated users to 4. The extracted feature sentiments can act as a bridge motivating users
read raw feature-specific review texts. Besides, in the before-after ex- to access the original review texts that commented a specific feature,
periment, around 32% users made new choice in Senti-CBRS that they when they are interested in seeking for more opinion information
thought is better than that found in CBRS. From the within-subjects ex- about it.
periment, we observe that users spent more time in reaching at their
final choice in Senti-CBRS, with average longer time consumed in each Our user studies well demonstrate the above points. We hence be-
critiquing cycle. These results infer that Senti-CBRS is capable of not lieve that these design ideas could be constructive for developing more
only increasing users’ decision quality, but also reducing their critiquing effective critiquing interface in recommender systems, in terms of en-
cycles. hancing users’ perceived decision quality (product knowledge, prefer-
Participants’ subjective reactions verify the impact of Senti-CBRS on ence certainty, and decision confidence) and satisfaction with the sys-
enhancing their product knowledge, preference certainty, decision con- tem’s competence (perceived information usefulness and purchase in-
fidence, and purchase intention. In addition, they perceived Senti-CBRS tention).
more competent in helping them discover useful product information.
The inconsistencies between the before-after and within-subjects exper- 6.2. Sentiment-based recommender algorithm
iments occurred on users’ perceived system trustworthiness and recom-
mendation quality, but the two constructs still got more positive ratings In this work, we utilize feature sentiments to support new users
in Senti-CBRS. In particular, in the within-subjects experiment, the rec- to formulate and refine their preferences. For this purpose, a multi-
ommendation quality of Senti-CBRS was perceived significantly higher, attribute preference model is established for each user, which accom-
suggesting that it has higher chance to augment the system’s recommen- modates her/his preferences for both static specification values and fea-
dation power. ture sentiments. In each recommendation cycle, the product that best
Finally, in both experiments, more users expressed favorable ap- matches the user’s current preferences is recommended, along with a set
praisal of Senti-CBRS, which, in combination with their free comments, of related products that are organized into system-suggested critiques.
confirms the merit of incorporating feature sentiments into the cri- For the latter, we extend Preference-based Organization technique to
tiquing interface for users to make more informed and confident deci- identify groups of products with representative tradeoff properties rel-

15
JID: YIJHC
ARTICLE IN PRESS [m5GeSdc;September 29, 2017;14:32]

L. Chen et al. Int. J. Human-Computer Studies 000 (2017) 1–17

ative to the current recommendation, and return those with maximal Charness, G., Gneezy, U., Kuhn, M.A., 2012. Experimental methods: between-subject and
tradeoff benefits and diversity degrees. within-subject design. J. Econ. Behav. Organ. 81 (1), 1–8.
Chen, L., Chen, G., Wang, F., 2015. Recommender systems based on user re-
The experiment results show that users perceived our system with views: the state of the art. User Model. User-Adapt. Interact. 25 (2), 99–154.
higher recommendation quality than the traditional method that does doi:10.1007/s11257-015-9155-5.
not consider feature sentiments to model user preferences. The results Chen, L., Pu, P., 2006. Evaluating critiquing-based recommender agents. In: Proceed-
ings of the Twenty-first National Conference on Artificial Intelligence (AAAI’06),
can hence offer some new insights into improving existing sentiment- pp. 157–162.
based recommender algorithms (that are introduced in Section 2.2): Chen, L., Pu, P., 2007a. The evaluation of a hybrid critiquing system with preference-based
recommendations organization. In: Proceedings of the 2007 ACM Conference on Rec-
1. Feature sentiments can be combined with static specification values ommender Systems. ACM, pp. 169–172.
to model users’ multi-attribute preferences. Chen, L., Pu, P., 2007b. Hybrid critiquing-based recommender systems. In: Proceedings
of the 12th International Conference on Intelligent User Interfaces (IUI’07). ACM,
2. Users can incrementally refine their preferences via making critiques pp. 22–31.
on the current recommendation. Our hybrid critiquing interface, Chen, L., Pu, P., 2007c. Preference-based organization interfaces: aiding user critiques in
which incorporates feature sentiments into both user-initiated cri- recommender systems. In: Proceedings of International Conference on User Modeling
(UM’07). Springer, pp. 77–86.
tiquing support and system-suggested critiques, could well achieve Chen, L., Pu, P., 2010. Experiments on the preference-based organization interface in
this function. recommender systems. ACM Trans. Comput.-Human Interact. (TOCHI) 17 (1), 1–33.
3. Furthermore, system-suggested critiques can represent products Chen, L., Pu, P., 2012. Critiquing-based recommenders: survey and emerging trends. User
Model. User-Adapt. Interact. 22 (1), 125–150.
with diverse tradeoff benefits for supporting users to effectively ex- Chen, L., Wang, F., 2013. Preference-based clustering reviews for augmenting e-commerce
plore the product space. recommendation. Knowl.-Based Syst. 50, 44–59.
Chen, L., Wang, F., 2016. An eye-tracking study: implication to implicit critiquing feed-
back elicitation in recommender systems. In: Proceedings of the 2016 Conference on
7. Conclusions User Modeling Adaptation and Personalization (UMAP’16). ACM, pp. 163–167.
Dong, R., O’Mahony, M.P., Schaal, M., McCarthy, K., Smyth, B., 2013a. Sentimental prod-
In conclusion, we have developed a sentiment-integrated critiquing uct recommendation. In: Proceedings of the 7th ACM Conference on Recommender
Systems. ACM, pp. 411–414.
interface for recommender systems (Senti-CBRS), which particularly uti- Dong, R., Schaal, M., O’Mahony, M.P., McCarthy, K., Smyth, B., 2013b. Opinionated
lizes feature sentiments of product reviews to support users to make product recommendation. In: Proceedings of the 21st International Conference on
critiques. By means of before-after and within-subjects experiments, we Case-Based Reasoning. Springer, pp. 44–58.
Esuli, A., Sebastiani, F., 2006. Sentiwordnet: a publicly available lexical resource for opin-
compared Senti-CBRS with the traditional CBRS, which demonstrate the ion mining. In: Proceedings of the 5th Conference on Language Resources and Evalu-
impact brought by feature sentiments on improving users’ decision qual- ation, 6, pp. 417–422.
ity and perceptions of the system’s competence. Fellbaum, C., 1998. WordNet: An Electronic Lexical Database. MIT Press, Cambridge, MA.
Field, A., 2013. Discovering Statistics Using IBM SPSS Statistics. SAGE Press.
As mentioned at the beginning, our system mainly serves users who
Grasch, P., Felfernig, A., Reinfrank, F., 2013. Recomment: towards critiquing-based rec-
are new to the product domain (such as high-investment products in the ommendation with speech interaction. In: Proceedings of the 7th ACM Conference on
e-commerce environment). For this objective, we have exploited other Recommender Systems. ACM, pp. 157–164.
Hu, M., Liu, B., 2004a. Mining and summarizing customer reviews. In: Proceedings of
customers’ review data to help those new users to formulate and refine
the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data
their product preferences via critiquing. In the future, we will attempt Mining. ACM, NY, USA, pp. 168–177. doi:10.1145/1014052.1014073.
to extend the work in other domains where users are not only infor- Hu, M., Liu, B., 2004b. Mining opinion features in customer reviews. In: Proceedings
mation seekers but also contributors (review writers), for whom their of the 19th National Conference on Artifical Intelligence. AAAI Press, pp. 755–760.
http://dl.acm.org/citation.cfm?id=1597148.1597269.
own reviews might be used by the system to infer their initial attribute Jin, W., Ho, H.H., Srihari, R.K., 2009. Opinionminer: a novel machine learning system for
preferences and hence more effectively guide their present critiquing web opinion mining and extraction. In: Proceedings of the 15th ACM SIGKDD Inter-
process. We may also take into account other users’ historical critiques national Conference on Knowledge Discovery and Data Mining. ACM, pp. 1195–1204.
Jindal, N., Liu, B., 2006. Mining comparative sentences and relations. In: Proceedings
on products’ static attribute values and feature sentiments, to improve of the 21st National Conference on Artificial Intelligence - Volume 2. AAAI Press,
the recommendation and critique suggestion for the current user via col- pp. 1331–1336. http://dl.acm.org/citation.cfm?id=1597348.1597400.
laborative intelligence (Xie et al., 2014). In addition, we will try other Keeney, R.L., Raiffa, H., 1993. Decisions With Multiple Objectives: Preferences and Value
Trade-offs. Cambridge University Press.
sentiment analysis algorithms, especially those that can extract other Kim, Y.A., Srivastava, J., 2007. Impact of social influence in e-commerce decision making.
types of opinions like conditional opinions (Li et al., 2010) and com- In: Proceedings of the Ninth International Conference on Electronic Commerce. ACM,
parative opinions (Jindal and Liu, 2006) from product reviews, so as NY, USA, pp. 293–302. doi:10.1145/1282100.1282157.
Li, Y., Nie, J., Zhang, Y., Wang, B., Yan, B., Weng, F., 2010. Contextual recommendation
to further enhance our critiquing system by considering users’ context-
based on text mining. In: Proceedings of the 23rd International Conference on Compu-
dependent preferences. tational Linguistics: Posters. Association for Computational Linguistics, pp. 692–700.
http://dl.acm.org/citation.cfm?id=1944566.1944645.
Liu, B., 2010. Sentiment analysis and subjectivity. In: Indurkhya, N., Damerau, F.J. (Eds.),
Acknowledgements Handbook of Natural Language Processing, Second Edition. CRC Press, Taylor and
Francis Group, Boca Raton, FL. ISBN 978-1420085921
This research work was supported by Hong Kong Research Liu, H., He, J., Wang, T., Song, W., Du, X., 2013. Combining user preferences and user
opinions for accurate recommendation. Electron. Comm. Res. Appl. 12 (1), 14–23.
Grants Council (RGC) under projects ECS/HKBU211912 and MacKenzie, I.S., 2013. Human-Computer Interaction: An Empirical Research Perspective,
RGC/HKBU12200415, and the Fundamental Research Funds of 1st Morgan Kaufmann Publishers Inc., San Francisco, CA, USA.
Shandong University, China. We also thank all participants who took Martin, D.W., 2007. Doing Psychology Experiments, 7th Thomson/Wadsworth.
McAuley, J., Leskovec, J., 2013. Hidden factors and hidden topics: understanding rating
part in our experiments. dimensions with review text. In: Proceedings of the 7th ACM Conference on Recom-
mender Systems. ACM, NY, USA, pp. 165–172. doi:10.1145/2507157.2507163.
References McCarthy, K., Reilly, J., McGinty, L., Smyth, B., 2005. Experiments in dynamic critiquing.
In: Proceedings of the 10th International Conference on Intelligent User Interfaces.
Aciar, S., Zhang, D., Simoff, S., Debenham, J., 2007. Informed recommender: basing rec- ACM, pp. 175–182.
ommendations on consumer product reviews. IEEE Intell. Syst. 22 (3), 39–47. McCarthy, K., Salem, Y., Smyth, B., 2010. Experience-based critiquing: reusing critiquing
Agrawal, R., Imieliński, T., Swami, A., 1993. Mining association rules between sets of experiences to improve conversational recommendation. In: Proceedings of the Inter-
items in large databases. In: Proceedings of the 1993 ACM SIGMOD International national Conference on Case-Based Reasoning. Springer, pp. 480–494.
Conference on Management of Data. ACM, pp. 207–216. Miao, Q., Li, Q., Zeng, D., 2010. Mining fine grained opinions by using probabilistic
Burke, R.D., 2000. Knowledge-based recommender systems. Encycloped. Lib. Inf. Sci. 69 models and domain knowledge. In: Proceedings of 2010 IEEE/WIC/ACM Interna-
(Supplement 32), 180. tional Conference on Web Intelligence and Intelligent Agent Technology. IEEE, pp.
Burke, R.D., Hammond, K.J., Young, B.C., 1996. Knowledge-based navigation of complex 358–365.
information spaces. In: Proceedings of the National Conference on Artificial Intelli- Payne, J.W., Bettman, J.R., Johnson, E.J., 1993. The Adaptive Decision Maker. Cambridge
gence (AAAI’96), pp. 462–468. University Press.
Burke, R.D., Hammond, K.J., Young, B.C., 1997. The findme approach to assisted brows- Payne, J.W., Bettman, J.R., Schkade, D.A., 1999. Measuring constructed preferences: to-
ing. IEEE Exp. 12 (4), 32–40. doi:10.1109/64.608186. wards a building code. J. Risk Uncertain. 19 (1–3), 243–275.

16
JID: YIJHC
ARTICLE IN PRESS [m5GeSdc;September 29, 2017;14:32]

L. Chen et al. Int. J. Human-Computer Studies 000 (2017) 1–17

Pu, P., Chen, L., 2005. Integrating tradeoff support in product search tools for e-commerce Smyth, B., McGinty, L., Reilly, J., McCarthy, K., 2004. Compound critiques for conver-
sites. In: Proceedings of the 6th ACM Conference on Electronic Commerce. ACM, New sational recommender systems. In: Proceedings of the 2004 IEEE/WIC/ACM Interna-
York, NY, USA, pp. 269–278. doi:10.1145/1064009.1064038. tional Conference on Web Intelligence. IEEE, pp. 145–151.
Pu, P., Chen, L., Hu, R., 2011. A user-centric evaluation framework for recommender sys- Tversky, A., Simonson, I., 1993. Context-dependent preferences. Manag. Sci. 39 (10),
tems. In: Proceedings of the Fifth ACM Conference on Recommender Systems. ACM, 1179–1189.
pp. 157–164. Wang, F., Pan, W., Chen, L., 2013. Recommendation for new users with partial preferences
Pu, P., Faltings, B., 2000. Enriching buyers’ experiences: the smartclient approach. In: by integrating product reviews with static specifications. In: Proceedings of the Inter-
Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. ACM, national Conference on User Modeling, Adaptation, and Personalization (UMAP’13).
pp. 289–296. Springer, pp. 281–288.
Pu, P., Faltings, B., 2002. Personalized navigation of heterogeneous product spaces using Wang, Y., Liu, Y., Yu, X., 2012. Collaborative filtering with aspect-based opinion mining:
smartclient. In: Proceedings of the 7th International Conference on Intelligent User a tensor factorization approach. In: Proceedings of the IEEE International Conference
Interfaces (IUI’02). ACM, pp. 212–213. on Data Mining. IEEE Computer Society, pp. 1152–1157.
Qi, L., Chen, L., 2010. A linear-chain crf-based learning approach for web opinion mining. Wu, J., Wu, Y., Sun, J., Yang, Z., 2013. User reviews and uncertainty assessment: a two
In: Proceedings of the International Conference on Web Information Systems Engi- stage model of consumers’ willingness-to-pay in online markets. Decis. Supp. Syst. 55
neering. Springer, pp. 128–141. (1), 175–185.
Reilly, J., McCarthy, K., McGinty, L., Smyth, B., 2004. Dynamic critiquing. In: Proceedings Xie, H., Chen, L., Wang, F., 2014. Collaborative compound critiquing. In: Proceedings
of the European Conference on Case-Based Reasoning. Springer, pp. 763–777. of the International Conference on User Modeling, Adaptation, and Personalization
Seroussi, Y., Bohnert, F., Zukerman, I., 2011. Personalised rating prediction for new users (UMAP’14). Springer, pp. 254–265.
using latent factor models. In: Proceedings of the 22nd ACM Conference on Hypertext Zhang, W., Ding, G., Chen, L., Li, C., Zhang, C., 2013. Generating virtual ratings from chi-
and Hypermedia. ACM, pp. 47–56. nese reviews to augment online recommendations. ACM Trans. Intell. Syst. Technol.
Shimazu, H., 2002. Expertclerk: a conversational case-based reasoning tool for developing (TIST) 4 (1), 9:1–9:17.
salesclerk agents in e-commerce webshops. Artif. Intell. Rev. 18 (3-4), 223–244.

17

You might also like