Professional Documents
Culture Documents
Fairjudge: Trustworthy User Prediction in Rating Platforms
Fairjudge: Trustworthy User Prediction in Rating Platforms
Platforms
BIRDNEST[8]
foodst4mp
0.9
FraudEagle[2]
Trustiness[29]
xdragonflyx
sol56
FairJudge
SpEagle[24]
FairJudge
lemon
0.8
Average AUC
vragnaroda yotagada
pirateat40
BAD[21]
popart64
aethero akhan08
tdiggs rawrmage
am42babbler btcmixers
james
0.7 captainddl
rg
stalers
danielpbarron
prophexy supa3
supadupajenkins
supa2supa5
0.6 Existing Algorithms b-t-c-buyer
stamit
thebutterzone bawse
0.5 supa4
sammey
Uses network information 3 3 3 3 3 3
tetrisx
supa Uses behavior properties 3 3 3 3
10 20 30 40 50 60 70 80 90 awesomeo
Prior free 3 3 3
Percentage of training data decatur
Theoretical Guarantees 3 3
(a) (b)
Table 1: Proposed algorithm FairJudge satisfies all
Figure 1: (a) The proposed algorithm, FairJudge, con- desirable properties.
sistently performs the best, by having the highest AUC,
in predicting fair and unfair users with varying percent- user and product. FraudEagle [2] is a belief propagation
age of training labels. (b) FairJudge discovered a bot-net model to rank users, which assumes fraudsters rate good
of 40 confirmed shills of one user, rating each other pos- products poorly and bad products positively, and vice-versa
itively. for honest users. Random-walk based algorithms have been
show that FairJudge outperforms several existing meth- developed to detect trolls [32] and link farming from col-
ods [2, 8, 21, 24, 29, 23, 22, 17] in predicting fraudulent lusion on Twitter [7]. [10, 34, 16] identify group of fraud-
users. Specifically, in an unsupervised setting, FairJudge sters based on local neighborhood of the users. A survey on
has the best or second best average precision in nine out of network-based fraud detection can be found in [3].
ten cases. Across two supervised settings, FairJudge has Behavioral fraud detection algorithms are often feature-
the highest AUC ≥ 0.85 across all five datasets. It consis- based. Consensus based features have been proposed in [23,
tently performs the best as the percentage of training data 17] – our proposed goodness metric is also inspired by con-
is varied between 10 and 90, as shown for the Alpha network sensus or ‘wisdom of crowds’. Commonly used features are
in Figure 1(a). Further, we experimentally show that both derived from timestamps [33, 31, 20] and review text [26, 6,
the cold start treatment and behavior properties improve 25]. SpEagle [24] extends FraudEagle [2] to incorporate be-
the performance of FairJudge algorithm, and incorporat- havior features. BIRDNEST [8] creates a Bayesian model to
ing both together performs even better. estimate the belief of each user’s deviation in rating behavior
FairJudge is practically very useful. We reported the 100 from global expected behavior. [10, 4, 28] study coordinated
most unfair users as predicted by FairJudge in the Flipkart spam behavior of multiple users. A survey on behavior based
online marketplace. Review fraud investigators at Flipkart algorithms can be found in [11].
studied our recommendations and confirmed 80 of them were Our proposed FairJudge algorithm combines both net-
unfair, presenting a validation of the utility of FairJudge work and behavior based algorithms, along with Bayesian
in identifying real-world review fraud. In fact, FairJudge solution to cold start problems. Since review text is not
is already being deployed at Flipkart. On the Bitcoin Al- always available, for e.g. in a subset of our datasets, we
pha network, using FairJudge, we discovered a botnet-like only focus on ratings and timestamps. FairJudge provides
structure of 40 unfair users that rate each other positively theoretical guarantees and does not require any user inputs.
(Figure 1(b)), which are confirmed shills of a single account. Table 1 compares FairJudge to the closest existing algo-
Overall, the paper makes the following contributions: rithms, which do not always satisfy all the desirable proper-
• Algorithm: We propose three novel metrics called fair- ties.
ness, goodness and reliability to rank users, products and
ratings, respectively. We propose Bayesian approaches to
address cold start problems and incorporate behavioral prop- 3. FAIRJUDGE FORMULATION
erties. We propose the FairJudge algorithm to iteratively
In this section, we present the FairJudge algorithm that
compute these metrics.
jointly models the rating network and behavioral properties.
• Theoretical guarantees: We prove that FairJudge al-
We first present our three novel metrics — Fairness, Reli-
gorithm is guaranteed to converge in a bounded number of
ability and Goodness — which measure intrinsic properties
iterations, and it has linear time complexity.
of users, ratings and products, respectively, building on our
• Effectiveness: We show that FairJudge outperforms
prior work [13]. We show how to incorporate Bayesian priors
nine existing algorithms in identifying fair and unfair users,
to address user and product cold start problems, and how
conducted via five experiments on five rating networks.
to incorporate behavioral features of the users and products.
We then prove that our algorithm have several desirable the-
2. RELATED WORK oretical guarantees.
Existing works in rating fraud detection can be catego- Prerequisites. We model the rating network as a bipartite
rized into network-based and behavior-based algorithms: network where user u gives a rating (u, p) to product p. Let
Network-based fraud detection algorithms are based on the rating score be represented as score(u,p). Let U, R
iterative learning, belief propagation, and node ranking tech- and P represent the set of all users, ratings and products,
niques. Similar to our proposed FairJudge algorithm, [29, respectively, in a given bipartite network. We assume that
21, 16] develop iterative algorithms that jointly assign scores all rating scores are scaled to be between -1 and +1, i.e.
in the rating networks based on consensus of ratings - [29] score(u,p) ∈ [−1, 1]∀(u, p) ∈ R. Let, Out(u) be the set
scores each user, review and product, and [21] scores each of ratings given by user u and In(p) be the set of ratings
+1 +1
P1
UA +1 +1
UD
+1 −1
+0.5 +0.5
+0.5
UB +0.5 P2 UE
+0.5 −1
Unfair user
−1 −1
UC UF
−1 −1
−1
P3
+1
Figure 2: While most ratings of fair users have Figure 3: Toy example showing products (P1 , P2 ,
high reliability, some ratings also have low reliabil- P3 ), users (UA , UB , UC , UD , UE and UF ), and rating
ity (green arrow). Conversely, unfair users also give scores provided by the users to the products. User
some highly reliability ratings (red arrow), but most UF always disagrees with the consensus, so UF is
of their ratings have low reliability. unfair.
received by product p. So, |Out(u)| and |In(p)| represents Example 1 (Running Example). Figure 3 shows a sim-
their respective counts. ple example in which there are 3 products, P1 to P3 , and 6
3.1 Fairness, Goodness and Reliability users, UA to UF . Each review is denoted as an edge from
a user to a product, with rating score between −1 and +1.
Users, ratings and products have the following character- Note that this is a rescaled version of the traditional 5-star
istics: rating scale where a 1, 2, 3, 4 and 5 star corresponds to
• Users vary in fairness. Fair users rate products without −1, −0.5, 0, +0.5 and +1 respectively.
bias, i.e. they give high scores to high quality products, and One can immediately see that UF ’s ratings are inconsistent
low scores to bad products. On the other hand, users who with those of the UA , UB , UC , UD and UE . UF gives poor
frequently deviate from the above behavior are ‘unfair’. For ratings to P1 and P2 which the others all agree are very good
example, fraudulent users often create multiple accounts to by consensus. UF also gives a high rating for P3 which the
boost ratings of unpopular products and bad-mouth good others all agree is very bad. We will use this example to
products of their competitors [28, 9]. Hence, these fraud- motivate our formal definitions below.
sters should have low fairness scores. The fairness F (u) of a
user u lies in the [0, 1] interval ∀u ∈ U . 0 denotes a 100% un-
trustworthy user, while 1 denotes a 100% trustworthy user. Fairness of users. Intuitively, fair users are those who
• Products vary in terms of their quality, which we mea- give reliable ratings, and unfair users mostly give unreliable
sure by a metric called goodness. The quality of a product ratings. So we simply define a user’s fairness score to be the
determines how it would be rated by a fair user. Intuitively, average reliability score of its ratings:
a good product would get several high positive ratings from P
fair users, and a bad product would receive high negative R(u, p)
(u,p)∈Out(u)
ratings from fair users. The goodness G(p) of a product p F (u) = (4)
ranges from −1 (a very low quality product) to +1 (a very |Out(u)|
high quality product) ∀p ∈ P. Other definitions such as temporally weighted average can
• Finally, ratings vary in terms of reliability. This mea- be developed, but we use the above for simplicity.
sure reflects how trustworthy the specific rating is. The
reliability R(u, p) of a rating (u, p) ranges from 0 (an un- Goodness of products. When a product receives rat-
trustworthy rating) to 1 (a trustworthy rating) ∀(u,p) ∈ R ing scores via ratings with different reliability, clearly, more
The reader may wonder: isn’t the rating reliability, iden- importance should be given to ratings that have higher reli-
tical to the user’s fairness? The answer is ‘no’. Consider ability. Hence, to estimate a product’s goodness, we weight
Figure 2, where we show the rating reliability distribution the ratings by their reliability scores, giving higher weight
of the top 1000 fair and top 1000 unfair users in the Flip- to reliable ratings and little importance to low reliability
kart network, as identified by our FairJudge algorithm (ex- ratings:
plained later). Notice that, while most ratings by fair users P
have high reliability, some of their ratings have low relia- R(u, p) · score(u,p)
bility, indicating personal opinions that disagree with ma- G(p) =
(u,p)∈In(p)
(5)
jority (see green arrow). Conversely, unfair users give some |In(p)|
high reliability ratings (red arrow), probably to camouflage Returning to our running example, the ratings given by
themselves as fair users. Thus, having reliability as a rating- users UA , UB , UC , UD , UE and UF to product P1 are
specific metric allows us to more accurately characterize this +1, +1, +1, +1, +1 and −1, respectively. So we have:
distribution.
Given a bipartite user-product graph, we do not know the
1
values of these three metrics for any user, product or rating. G(P1 ) = (R(UA , P1 ) + R(UB , P1 ) + R(UC , P1 )+
Clearly, these scores are mutually interdependent. Let us 6
look at an example. R(UD , P1 ) + R(UE , P1 ) − R(UF , P1 ))
P
0.5 · α1 + α2 · IBIRDN ESTIRT DU (u) + (u,p)∈Out(u) R(u, p)
Fairness of user u, F (u) = (1)
α1 + α2 + |Out(u)|
1 |score(u,p) − G(p)|
Reliability of rating (u, p), R(u, p) = (F (u) + (1 − )) (2)
2 2
P
β2 · IBIRDN ESTIRT DP (p) + (u,p)∈In(p) R(u, p) · score(u,p)
Goodness of product p, G(p) = (3)
β1 + β2 + |In(p)|
Figure 4: This is the set of mutually recursive definitions of fairness, reliability and goodness for the pro-
posed FairJudge algorithm. The yellow shaded part addresses the cold start problems and gray shaded part
incorporates the behavioral properties.
Reliability of ratings. A rating (u, p) should be consid- score of the user is close to the default score 0.5. The more
ered reliable if (i) it is given by a generally fair user u, and (ii) number of ratings the user gives, the more the fairness score
its score is close to the goodness score of product p. The first moves towards the user’s rating reliabilities. This way shills
condition makes sure that ratings by fair users are trusted with few ratings have little effect on product scores.
more, in lieu of the user’s established reputation. The sec- Similarly, the Bayesian prior in product’s goodness score
ond condition leverages ‘wisdom of crowds’, by making sure is incorporated as:
that ratings that deviate from p’s goodness have low reliabil-
ity. This deviation is measured as the normalized absolute P
R(u, p) · score(u,p)
(u,p)∈In(p)
difference, |score(u,p) − G(p)|/2. Together: G(p) = (8)
β + |In(p)|
1 |score(u,p) − G(p)| Again, β is a non-negative integer constant. The prior belief
R(u, p) = (F (u) + (1 − )) (6)
2 2 of product’s goodness is set to 0 which is the midpoint of
In our running example, for the rating by user UF to P1 : the goodness range [−1, 1], and so 0 · β = 0 does not appear
in the numerator. We will explain how the values of α and
1 | − 1 − G(P1 )| β are set in Section 4.
R(UF , P1 ) = F (UF ) + (1 − )
2 2
3.3 Incorporating Behavioral Properties
Similar equations can be associated with every edge (rat-
Rating scores alone are not sufficient to efficiently estimate
ing) in the graph of Figure 3.
the fairness, goodness and reliability values. The behavior
of the users and products is also an important aspect to be
3.2 Addressing Cold Start Problems considered. As an example, fraudsters have been known to
If a user u has given only a few ratings, we have very little give several ratings in a very short timespan, or regularly at
information about his true behavior. Say all of u’s ratings fixed intervals which indicates bot-like behavior [15, 8, 28].
are very accurate – it is hard to tell whether it is a fraudster Even though the ratings themselves may be very accurate,
that is camouflaging and building reputation by giving gen- the unusually rapid behavior of the user is suspicious. Be-
uine ratings [9], or it is actually a benign user. Conversely, if nign users, on the other hand, have a more spread out rating
all of u’s ratings are very deviant, it is hard to tell whether behavior as they lack regularity [15]. In addition, products
the user is a fraudulent shill account [23, 17], or simply a that receive an unusually high number of ratings for a very
normal user whose rating behavior is unusual at first but short time period may have bought fake ratings [31, 20].
stabilizes in the long run [8]. Due to the lack of sufficient Therefore, behavioral properties of the ratings that a user
information about the ratings given by the user, little can gives or a product receives are indicative of their true nature.
be said about his fairness. Similarly, for products that have Here, we show a Bayesian technique to incorporate the
only been rated a few times, it is hard to accurately deter- rating behavioral properties of users and products. We focus
mine their true quality, as they may be targets of fraud [17, on temporal rating behavior as the behavioral property for
20]. This uncertainty due to insufficient information of less the rest of the paper, but our method can be used with any
active users and products is the cold start problem. additional set of properties. A user u’s temporal rating be-
We solve this issue by assigning Bayesian priors to each havior is represented as the time difference between its con-
user’s fairness score as follows: secutive ratings, inter-rating time distribution IRT DU (u).
To model the behavior, we use a recent algorithm called
P
0.5 · α + (u,p)∈Out(u) R(u, p)
F (u) = (7) BIRDNEST [8], which calculates a Bayesian estimate of how
α + |Out(u)|
much user u’s IRT DU (u) deviates from the global popula-
Here, α is a non-negative integer constant, which is the tion of all users’ behavior, IRT DU (u0 )∀u0 ∈ U. This esti-
relative importance of the prior compared to the rating re- mate is the BIRDNEST score of user u, BIRDN ESTIRT DU (u) ∈
liability – the lower (higher) the value of α, the more (less, [0, 1]. We use IBIRDN ESTIRT DU (u) = 1−BIRDN ESTIRT DU (u),
resp.) the fairness score depends on the reliability of the as user u’s behavioral normality score. When
ratings. The 0.5 score is the default global prior belief of all IBIRDN ESTIRT DU (u) = 1, it means the user’s behavior is
users’ fairness, which is the midpoint of the fairness range absolutely normal, and 0 means totally abnormal. Similarly,
[0, 1]. If a user gives only a few ratings, then the fairness for products, IBIRDN ESTIRT DP (p) is calculated from the
consecutive ratings’ time differences IRT DP (p) that prod- Algorithm 1 FairJudge Algorithm
uct p gets, with respect to the global IRT DP (p0 )∀p0 ∈ P. 1: Input: Rating network (U , R, P), α1 , α2 , β1 , β2
As in the case of cold start, we adopt a Bayesian approach 2: Output: Fairness, Reliability and Goodness scores, given
to incorporate the behavioral properties in the fairness and α1 , α2 , β1 and β2
goodness equations 7 and 8. The IBIRDN ESTIRT DU (u) 3: Initialize F 0 (u) = 1, R0 (u, p) = 1 and G0 (p) = 1, ∀u ∈
U , (u, p) ∈ R, p ∈ P.
score is treated as user u’s behavioral Bayesian prior. As 4: Calculate IBIRDN ESTIRT DU (u) ∀u ∈ U and
opposed to global priors (0.5 and 0) in the cold start case, IBIRDN ESTIRT DP (p) ∀p ∈ P.
this prior is user-dependent and may be different for differ- 5: t=0
ent users. The resulting equation is given in Equation 1. 6: do
Note that now there are two non-negative integer constants, 7: t=t+1
α1 and α2 , indicating the relative importance of the cold 8: Equation 3: ∀p ∈ P,
Update goodness of products using P
β2 ·IBIRDN ESTIRT D (p)+(u,p)∈In(p) Rt−1 (u,p).score(u,p)
start and behavioral components, respectively, to the net- Gt (p) = P
β1 +β2 +|In(p)|
.
work component. The Bayesian prior for products is in- 9: Update reliability of ratings using Equation 2 ∀(u, p) ∈ R,
corporated similarly to give the final goodness equation in |score(u,p)−Gt (p)|
Rt (u, p) = 12 (F t−1 (u) + (1 − 2
)).
Equation 3. 10: ∀u ∈ U ,
Update fairness of users using Equation 1 P
Overall, the FairJudge formulation represented in Fig- 0.5·α1 +α2 ·IBIRDN ESTIRT D (u)+ (u,p)∈Out(u) Rt (u,p)
F t (u) = U
α1 +α2 +|Out(u)|
ure 4 is the set of three equations that define the fairness, t t−1
goodness and reliability scores in terms of each other. It 11: error = max(maxu∈U |F (u) − F (u)|, max(u,p)∈R
combines the network, cold start treatment and behavioral |Rt (u, p) − Rt−1 (u, p)|, maxp∈P |Gt (p) − Gt−1 (p)|)
12: while error >
properties together. 13: Return F t+1 (u), Rt+1 (u, p), Gt+1 (p), ∀u ∈ U , (u, p) ∈
R, p ∈ P
4. THE FAIRJUDGE ALGORITHM
Having formulated the fairness, goodness and reliability the more important the particular combination of parameter
metrics in Section 3, we present the FairJudge algorithm in values is. The learned classifier’s prediction labels are then
Algorithm 1 to calculate their values for all users, products used as the supervised FairJudge output.
and ratings. The algorithm is iterative, so let F t , Gt and Rt
Example 2 (Running Example). Let us revisit our run-
denote the fairness, goodness and reliability score at the end ning example. Let, α1 = α2 = β1 = β2 = 0. We initialize
of iteration t. Given the rating network and non-negative all fairness, goodness and reliability scores to 1 (line 3 in
integers α1 , α2 , β1 and β2 , we first initialize all scores to the Algorithm 1. Consider the first iteration of the loop (i.e.
highest value 1 (see line 3).1 Then we iteratively update the when t is set to 1 in line 7). In line 8, we must update the
scores using Equations (3), (2) and (1) until convergence goodness of all products. Let us start with product P1 . Its
(see lines 6-12). Convergence occurs when all scores change goodness G0 (P1 ) was 1, but this gets updated to
minimally (see line 12). is the acceptable error bound, −1(1) + 1(1) + 1(1) + 1(1) + 1(1) + 1(1)
G1 (P1 ) = = 0.67.
which is set to a very small value, say 10−6 . 6
But how do we set the values of α1 , α2 , β1 and β2 ? In an We see that the goodness of P1 has dropped because of UA ’s
unsupervised scenario, it is not possible to find the best poor rating. The following table shows how the fairness and
combination of these values apriori. Therefore, the algo- goodness values change over iterations (we omit reliability
rithm is run for several combinations of α1 , α2 , β1 and β2 for brevity):
as inputs, and the final scores of a user across all these
runs are averaged to get the final FairJudge score of the Fairness/Goodness in iterations
user. Formally, let C be the set of all parameter combina- Node Property 0 1 2 5 9 (final)
P1 G(P1 ) 1 0.67 0.67 0.67 0.68
tions {α1 , α2 , β1 , β2 }, and F (u|α1 , α2 , β1 , β2 ) be the fairness P2 G(P2 ) 1 0.25 0.28 0.31 0.32
score of user u after running Algorithm 1 with α1 , α2 , β1 P3 G(P3 ) 1 -0.67 -0.67 -0.67 -0.68
and β2 asP input. So the final fairness score of user u is UA - UE F (UA ) - F (UE ) 1 0.92 0.89 0.86 0.86
F (u|α1 ,α2 ,β1 ,β2 )
F (u) = (α1 ,α2 ,β1 ,β2 )∈C
. Similarly, UF F (UF ) 1 0.62 0.43 0.24 0.22
|C|
By symmetry, nodes UA to UE have the same fairness
P
(α1 ,α2 ,β1 ,β2 )∈C G(p|α1 ,α2 ,β1 ,β2 )
G(p) = and
P |C| values throughout. After convergence, UF has low fairness
(α ,α ,β ,β )∈C R((u,p)|α1 ,α2 ,β1 ,β2 )
R(u, p) = 1 2 1 2
|C|
. In our exper- score, while UA to UE have close to perfect scores. Confirm-
iments, we varied all (integer) parameters from 0 to 5, i.e. ing our intuition, the algorithm quickly learns that UF is an
0 ≤ α1 , α2 , β1 , β2 ≤ 5, giving 6x6x6x6 = 1296 combinations. unfair user as all of its ratings disagree with the rest of the
So, altogether, scores from 1296 different runs were averaged users. Hence, the algorithm then downweighs UF ’s ratings
to get the final unsupervised FairJudge scores. This final in its estimation of the products’ goodness, raising the score
score is used for ranking the users. of P1 and P2 as they deserve.
In a supervised scenario, it is indeed possible to learn the 4.1 Theoretical Guarantees of FairJudge
relative importance of parameters. We represent each user u
Here we present the theoretical properties of FairJudge.
as a feature vector of its fairness scores across several runs,
Let, F ∞ (u), G∞ (p) and R∞ (u, p) be the final scores after
i.e. F (u|α1 , α2 , β1 , β2 ), ∀(α1 , α2 , β1 , β2 ) ∈ C are the features
convergence, for some input α1 , α2 , β1 , β2 .
for user u. Given a set of fraudulent and benign user labels, a
random forest classifier is trained that learns the appropriate Lemma 1 (Lemma 1). The difference between a prod-
weights to be given to each score. The higher the weight, uct p’s final goodness score and its score after the first it-
eration is at most 1, i.e. |G∞ (p) − G1 (p)| ≤ 1. Similarly,
1
Random initialization gives similar results. |R∞ (u, p) − R1 (u, p)| ≤ 3/4 and |F ∞ (u) − F t (u)| ≤ 3/4.
The proof is shown in the appendix [1].
Table 2: Five rating networks used for evaluation.
Theorem 1 (Convergence Theorem). The difference Network # Users (% unfair, fair) # Products # Edges
during iterations is bounded as |R∞ (u, p) − Rt (u, p)| ≤ ( 34 )t , OTC 4,814 (3.7%, 2.8%) 5,858 35,592
Alpha 3,286 (3.1%, 4.2%) 3,754 24,186
∀(u, p) ∈ R. As t increases, the difference decreases and
Amazon 256,059 (0.09%, 0.92%) 74,258 560,804
Rt (u, p) converges to R∞ (u, p). Similarly, |F ∞ (u)−F t (u)| ≤ Flipkart 1,100,000 (-, -) 550,000 3,300,000
( 34 )t , ∀u ∈ U and |G∞ (p) − Gt (p)| ≤ ( 43 )(t−1) , ∀p ∈ P. Epinions 120,486 (0.84%, 7.7%) 132,585 4,835,208
Average AUC
Average AUC
Average AUC
0.8
Average AUC
Average AUC
0.85
0.7 0.75 FraudEagle
0.7 0.7
FraudEagle 0.7 SpEagle
0.75 0.6
0.6 0.6 SpEagle Trustiness
0.65
0.5 Trustiness
0.65 0.5 0.5 0.6
10 20 30 40 50 60 70 80 90 10 20 30 40 50 60 70 80 90 10 20 30 40 50 60 70 80 90 10 20 30 40 50 60 70 80 90 10 20 30 40 50 60 70 80 90
Percentage of training data Percentage of training data Percentage of training data Percentage of training data Percentage of training data
FraudEagle BAD SpEagle BIRDNEST Trustiness SpEagle+ SpamBehavior Spamicity ICWSM’13 FairJudge
Figure 5: Variation of AUC with percentage of training data available for supervision. FairJudge consistently performs
the best across all settings, and its performance is robust to the training percentage.
0.07
100 1.0 0.25
Fraction of ratings
0.84 0.81 0.85 0.06
Average Precision
76.5 Unfair
71.9 74.9
Fraction of Users
80 0.8 0.71 0.05
Average AUC
0.20 users
64 0.04
N+C+B
N+C+B
60 0.6 Fair
0.03 0.15
N+B
N+C
users
N+B
N+C
20 0.2 0.01
0.00 0.05
0 0.0
e
nd
r
ay
th
ou
0.00
ut
on
D
co
in
(a) Unsupervised (b) Supervised
H
−1.0 −0.5 0.0 0.5 1.0
M
M
Se
User Inter-Rating Time Profile Mean User Rating
Figure 6: Change in performance of FairJudge on Alpha
network in (a) unsupervised and (b) supervised experi- (a) (b)
ments when different components are used: network (N),
Figure 8: Identified unfair users by FairJudge are (a)
cold start treatment (C) and behavioral (B).
faster in rating , and give extreme ratings.
Average Out-weight
4
10
1.0 ity, but some ratings have low reliability, indicating personal
Run time (seconds)
103
opinion about products that disagrees with ‘consensus’.
0.5
0.62 We see in Figure 7(b) that in the Amazon network, un-
102 0.0 fair users detected by FairJudge (F (u) ≤ 0.5) give highly
101 Slope = 1 −0.5 −0.88 negative rating on average (mean out-rating weight of -0.88),
while fair users give positive ratings (mean out-weight 0.62).
100 3 −1.0
10 104 105 106 107 0 0.5 1.0 Unfair users also give almost half as many ratings as fair
Number of edges in network Fairness Range users (7.85 vs 15.92). This means on average, fraudsters
(a) (b) aggressively bad-mouth target products.
We also observe in Figure 8 that unfair users in OTC
Figure 7: (a) FairJudge scales linearly - the running network: (a) give ratings in quick succession of less than a
time increases linearly with the number of edges. (b) few minutes, and (b) exhibit bimodal rating pattern – either
Unfair users give highly negative ratings. they give all -1.0 ratings (possibly, bad-mouthing a competi-
tor) or all +1.0 ratings (possibly, over-selling their product-
shows that all three components are important for predict-
s/friends). As an example, the most unfair user found by
ing fraudulent users.
FairJudge had 3500 ratings, all with +0.5 score, and al-
most all given 15 seconds apart (apparently, a script). On
5.7 Experiment 5: Linear scalability the other hand, fair users have a day to a month between
We have theoretically proved in Section 4.1 that Fair- consecutive ratings, and they give mildly positive ratings
Judge is linear in running time in the number of edges. (between 0.1 and 0.5). These observations are coherent with
To show this experimentally as well, we create random net- existing research [15, 5].
works of increasing number of nodes and edges and compute Figure 1(d) shows a set of 40 unfair users, as identified by
the running time of the algorithm till convergence. Figure 7 FairJudge on the Alpha network, that collude to positively
shows that the running time increases linearly with the num- rate each other and form sets of tightly knit clusters – they
ber of edges in the network, which shows that FairJudge are confirmed to be shills of a single user.
algorithm is indeed scalable to large size networks for prac- In summary, unfair users detected by FairJudge exhibit
tical use. strange characteristics with respect to their behavior:
• They have bimodal rating pattern - they give too low (bad-
5.8 Discoveries mouthing) or too high (over-selling) ratings.
Here we look at some insights and discoveries about ma- • They are less active, have no daily periodicity, and post
licious behavior found by FairJudge. quickly, often less than a few minutes apart.
As seen previously in Figure 2, most ratings given by un- • They tend to form near-cliques, colluding to boost their
fair users have low reliability, while some have high reliabil- own or their product’s ratings.
ity, indicating camouflage to masquerade as fair users. At • They camouflage their behavior to masquerade as benign
the same time, most ratings of fair users have high reliabil- users.
6. CONCLUSIONS [15] H. Li, G. Fei, S. Wang, B. Liu, W. Shao,
We presented the FairJudge algorithm to address the A. Mukherjee, and J. Shao. Bimodal distribution and
problem of identifying fraudulent users in rating networks. co-bursting in review spam detection. In WWW, 2017.
This paper has the following contributions: [16] R.-H. Li, J. Xu Yu, X. Huang, and H. Cheng. Robust
• Algorithm: We proposed three mutually-recursive met- reputation-based ranking on bipartite rating networks.
rics - fairness of users, goodness of products and reliability of In SDM, 2012.
ratings. We extended the metrics to incorporate Bayesian [17] E.-P. Lim, V.-A. Nguyen, N. Jindal, B. Liu, and
solutions to cold start problem and behavioral properties. H. W. Lauw. Detecting product review spammers
We proposed the FairJudge algorithm to iteratively com- using rating behaviors. In CIKM, 2010.
pute these metrics. [18] P. Massa and P. Avesani. Trust-aware recommender
• Theoretical guarantees: We proved that FairJudge systems. In RecSys, 2007.
algorithm has linear time complexity and is guaranteed to [19] J. J. McAuley and J. Leskovec. From amateurs to
converge in a bounded number of iterations. connoisseurs: modeling the evolution of user expertise
• Effectiveness: By conducting five experiments, we showed through online reviews. In WWW, 2013.
that FairJudge outperforms nine existing algorithms to [20] A. J. Minnich, N. Chavoshi, A. Mueen, S. Luan, and
predict fair and unfair users. FairJudge is practically use- M. Faloutsos. Trueview: Harnessing the power of
ful, and already under deployment at Flipkart. multiple review sites. In WWW, 2015.
[21] A. Mishra and A. Bhattacharya. Finding the bias and
7. REFERENCES prestige of nodes in networks based on trust scores. In
[1] Additional material. WWW, 2011.
http://cs.umd.edu/~srijan/fairjudge/, 2017.
[22] A. Mukherjee, A. Kumar, B. Liu, J. Wang, M. Hsu,
[2] L. Akoglu, R. Chandy, and C. Faloutsos. Opinion M. Castellanos, and R. Ghosh. Spotting opinion
fraud detection in online reviews by network effects. In spammers using behavioral footprints. In KDD, 2013.
ICWSM, 2013. [23] A. Mukherjee, V. Venkataraman, B. Liu, and N. S.
[3] L. Akoglu, H. Tong, and D. Koutra. Graph based Glance. What yelp fake review filter might be doing?
anomaly detection and description: a survey. TKDD, In ICWSM, 2013.
29(3):626–688, 2015.
[24] S. Rayana and L. Akoglu. Collective opinion spam
[4] C. Chen, K. Wu, V. Srinivasan, and X. Zhang. detection: Bridging review networks and metadata. In
Battling the internet water army: Detection of hidden KDD, 2015.
paid posters. In ASONAM, 2013. [25] V. Sandulescu and M. Ester. Detecting singleton
[5] J. Cheng, C. Danescu-Niculescu-Mizil, and review spammers using semantic similarity. In WWW,
J. Leskovec. Antisocial behavior in online discussion 2015.
communities. In ICWSM, 2015.
[26] H. Sun, A. Morales, and X. Yan. Synthetic review
[6] A. Fayazi, K. Lee, J. Caverlee, and A. Squicciarini. spamming and defense. In KDD, 2013.
Uncovering crowdsourced manipulation of online
[27] B. Viswanath, M. A. Bashir, M. Crovella, S. Guha,
reviews. In SIGIR, 2015.
K. P. Gummadi, B. Krishnamurthy, and A. Mislove.
[7] S. Ghosh, B. Viswanath, F. Kooti, N. K. Sharma, Towards detecting anomalous user behavior in online
G. Korlam, F. Benevenuto, N. Ganguly, and K. P. social networks. In USENIX Security, 2014.
Gummadi. Understanding and combating link farming [28] B. Viswanath, M. A. Bashir, M. B. Zafar, S. Bouget,
in the twitter social network. In WWW, 2012.
S. Guha, K. P. Gummadi, A. Kate, and A. Mislove.
[8] B. Hooi, N. Shah, A. Beutel, S. Gunneman, L. Akoglu, Strength in numbers: Robust tamper detection in
M. Kumar, D. Makhija, and C. Faloutsos. Birdnest: crowd computations. In COSN, 2015.
Bayesian inference for ratings-fraud detection. In [29] G. Wang, S. Xie, B. Liu, and S. Y. Philip. Review
SDM, 2016. graph based online store review spammer detection. In
[9] B. Hooi, H. A. Song, A. Beutel, N. Shah, K. Shin, and ICDM, 2011.
C. Faloutsos. Fraudar: Bounding graph fraud in the [30] J. Wang, A. Ghose, and P. Ipeirotis. Bonus,
face of camouflage. In KDD, 2016.
disclosure, and choice: what motivates the creation of
[10] M. Jiang, P. Cui, A. Beutel, C. Faloutsos, and high-quality paid reviews? In ICIS, 2012.
S. Yang. Catchsync: catching synchronized behavior
[31] G. Wu, D. Greene, and P. Cunningham. Merging
in large directed graphs. In KDD, 2014. multiple criteria to identify suspicious reviews. In
[11] M. Jiang, P. Cui, and C. Faloutsos. Suspicious RecSys, 2010.
behavior detection: current trends and future [32] Z. Wu, C. C. Aggarwal, and J. Sun. The troll-trust
directions. IEEE Intelligent Systems, 2016. model for ranking in signed networks. In WSDM, 2016.
[12] N. Jindal and B. Liu. Opinion spam and analysis. In [33] S. Xie, G. Wang, S. Lin, and P. S. Yu. Review spam
WSDM, 2008.
detection via temporal pattern discovery. In KDD,
[13] S. Kumar, F. Spezzano, V. Subrahmanian, and 2012.
C. Faloutsos. Edge weight prediction in weighted
[34] J. Ye and L. Akoglu. Discovering opinion spammer
signed networks. In ICDM, 2016. groups by network footprints. In ECML-PKDD, 2015.
[14] T. Lappas, G. Sabnis, and G. Valkanas. The impact of
fake reviews on online visibility: A vulnerability
assessment of the hotel industry. INFORMS, 27(4),
2016.