Professional Documents
Culture Documents
Comments Mining With TF-IDF: The Inherent Bias and Its Removal
Comments Mining With TF-IDF: The Inherent Bias and Its Removal
fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2018.2840127, IEEE
Transactions on Knowledge and Data Engineering
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 1
Abstract—Text mining have gained great momentum in recent years, with user-generated content becoming widely available. One key
use is comment mining, with much attention being given to sentiment analysis and opinion mining. An essential step in the process of
comment mining is text pre-processing; a step in which each linguistic term is assigned with a weight that commonly increases with its
appearance in the studied text, yet is offset by the frequency of the term in the domain of interest. A common practice is to use the
well-known tf-idf formula to compute these weights.
This paper reveals the bias introduced by between-participants’ discourse to the study of comments in social media, and proposes an
adjustment. We find that content extracted from discourse is often highly correlated, resulting in dependency structures between
observations in the study, thus introducing a statistical bias. Ignoring this bias can manifest in a non-robust analysis at best and can
lead to an entirely wrong conclusion at worst. We propose an adjustment to tf-idf that accounts for this bias. We illustrate the effects of
both the bias and correction with with seven Facebook fan pages data, covering different domains, including news, finance, politics,
sport, shopping, and entertainment.
1041-4347 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2018.2840127, IEEE
Transactions on Knowledge and Data Engineering
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 2
corpus of 5546 descriptions and a vocabulary of 672 unique In the machine learning field, there is a wide agree-
words; in a work on Hashtagger+, a system to recommend ment that data-driven analysis of user- generated content
Twitter hashtags for news articles, tf-idf is used to create requires text pre-processing. This step commonly involves
a feature set describing candidate hashtags, creating tweet- removal (or down weighting) of stop words, word stemming
bags of terms from news articles [9]. The use of tf-idf to [39] or word lemmatization [40], and controlling for word
pre-process short texts in order to create a co-occurrence frequency in the domain of interest. For the latter task, a
network to model semantic relatedness can also be seen in common practice is to use the well-known tf-idf formula
the work of [10]. (formally described later) that assigns weights to each word
This paper thus aims at exposing and adjusting the bias in the dataset, which increases with its appearance in the
caused by this approach, when applied by researchers and training text, yet is offset by the frequency of the term in
practitioners. Of course, there are many situations in which the domain of interest. However, the appropriateness of
alternatives to tf-idf should be considered. For example, in applying unadjusted tf-idf to comment sentiment analysis
their study of tree-structured book representation, Zhang et is questionable [6], [41].
al. [11] find that their models based on principal component Common to much of the literature on comment classifi-
analysis (PCA) features outperforms models using tf-idf cation is the analysis of comments as independent observations
features. (e.g. [14], [15], [19], [42], [43]). Surprisingly, although the
structure of the comments and their proximity are largely
1.1 Background: The study and classification of com- recognized as having a great impact on users’ opinion and
ments sentiment (as we discuss in the next section), and often
The potential impact of the bias that we will illustrate is used for comment classification, comments are still treated
amplified by the high level of research activity surrounding independently for the purpose of word frequency control
the study of comments. The field of text analysis focuses (e.g., [26], [31]). In practice, as we will show, between-
attention on the analysis of comments as a central part participants’ discourse creates a statistical dependency be-
of understanding user-generated online content [12]. Re- tween observations (or comments) that inflates term fre-
searchers have studied comments to posted pictures and quency of commonly non-frequent terms in a given domain.
YouTube videos [13], [14], [15], comments on blogs [16], Our next discussion is on the use of comment struc-
[17], comments on press releases [18], [19], comments on ture in classification methods, commonly referred to as the
corporate communication [20], comments on public service analysis of threaded discussions and author-topic / author-
announcements [21], [22], and even comments on comments recipient-topic analysis.
[18], [23], [24].
The study of comments has various aims. Given a set of
1.2 Enhancing classification with comments’ structure
documents such as blog entries or Facebook posts, each fol-
lowed by a list of comments which is either flat or threaded, Research on threaded discussions makes use of comment
researchers study what can be inferred from users’ com- structure. A threaded discussion is characterized by the
ments on the document [5], [16] or the similarity between inherent relationship between the threaded comments, and
documents [13], [19], the topics of threaded discussions [25] the textual dependency between them [26]. The relationship
and their dynamics [19], [26]. between comments, both structural and lexical, is highly
Comment classification, often denoted message classifi- recognized as useful in classification as seen in [31] who
cation in the computational linguistics (CL) and natural lan- study the quality of comments in a threaded blog discus-
guage processing (NLP) literature (e.g., [26]), is the analysis sion. On top of the standard unigrams (words’ tf-idf score)
of comments within a commentsphere [12]. Classes of com- and conversational features such as comment length, order,
ments studied vary from sarcasm and nastiness [27], [28], and number of replies, they studied comments with respect
to attitude toward the comments [29] misinformation such to their lexical similarity to the preceding comment and
as in rumors [30], informativeness relative to the document following comment. Interestingly, they show that, for the
posted, thread topic, past comments [31], [32], uniqueness purpose of comments’ quality, lexical similarity features are
of comments [33], and much more. dominated by conversation features, and that unigrams are
Perhaps the most commonly addressed question in the the least informative feature. Wanas and colleagues [32]
context of comment classification is sentiment analysis [34], study informativeness of comments in a discussion examin-
also referred to as opinion mining [35]. Sentiment analysis ing conversational features, structural features such as refer-
and opinion mining represent a large problem space, often encing and replies, and features based on cosine similarity,
defined slightly differently, covering, for example, opinion such as the similarity between a comment and the post’s
extraction, subjectivity study, emotion analysis, and more. topic, the thread’s topic, and its parent’s (replied-to) topic in
The ultimate goal of sentiment analysis is to find out what the threaded discussion. Their research highlights the power
people think or feel toward entities such as products, ser- of (non-)similarity features in predicting informativeness.
vices, individuals, events, news articles, and topics. Method- The literature on comment classification in threaded dis-
ologically, in similarity to comment classification, the goal cussions generally addresses goals where dependency plays
is achieved using content extraction and classification tech- a major role: attitude between commenters [29]; quality,
niques. In much IS research, the term ”sentiment analysis” informativeness, or uniqueness of comments (e.g., [31], [32],
is used to represent this entire umbrella of comment classifi- [33]); interestingness of comments [44]; etc. The literature
cation problems. A complete survey on different aspects of commonly offers spot solutions for these classification tasks
sentiment analysis is given in [36], [37], [38]. that make use of dependency structure. Gómez et al. [44]
1041-4347 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2018.2840127, IEEE
Transactions on Knowledge and Data Engineering
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 3
propose a framework to make use of threads’ structure correction is kept simple for practical reasons, and alter-
when the aim is related to the differential between com- native corrections are discussed. In contrast to the litera-
ments. As far as we know, research has yet to address ture on threaded discussion where the focus is to leverage
the inherent bias between comments for flat discussions, dependency to derive predictors, we ”flatten” the threads
or when the dependency does not play a major role in and remove the dependency, to extract information from
the classification task (such as opinion mining, sentiment the remaining unigrams. We believe that our approach will
analysis). make the unbiased analysis of comments more accessible to
The most commonly used dataset in the threaded- the IS community.
discussion literature is Slashdot (www.slashdot.com). Slash-
dot, a popular technology discussion forum, also integrates
a moderation scheme where readers can rate comments
2 T EXT PRE - PROCESSING WITH TF - IDF
based on informativeness (scale 1-5). Comments on Slash- The tf-idf weighting, first introduced in [47], stands for
dot usually revolve around an initial post or contribution, term frequency (tf ) × inverse document frequency (idf ). Tf-idf
are generally lengthy, and contain a significant amount of weighting is commonly used in text mining and information
content [32]. By contrast, in this paper we analyze Facebook retrieval to evaluate the importance of a linguistic term
comments which are fairly short (15-25 words on average) (commonly a unigram or bigram) in a studied corpus. Term
thus plausibly less informative. Moreover, readers in Face- importance (weight) increases with the term’s frequency in
book see the first comment in each thread, sorted by popu- the text, yet is offset by the frequency of the term in the
larity (at time of access). The length of messages is shown domain of interest (e.g., frequent words like ”the” or ”for”
to have a great impact on the conversation and its analysis will be scaled down).
[45]. Given that the average depth of threads on Facebook Given a collection of terms t ∈ T that appear in a set of
is less than 1 (extracted from our datasets), compared with N documents d ∈ D, each of length nd , tf-idf weighting is
10-15 on Slashdot, it can essentially be considered a ”flat” computed as follows [48]:
discussion. ft,d
tft,d = = (1)
nd
1.3 Contributions: Articulating the bias and adjusting
N
tf-idf idft = log
dft
A discussion on the limitations of text analysis when applied
Wt,d = tft,d × idft ,
to computer-mediated discourse (CMD) was presented in
[6], in which they claim that existing text analysis sys- where ft,d is the frequency of term t in document d, and dft
tems are concerned with topic modeling (assigning topics is the document frequency of term t, that is, the number of
to documents). In CMD data, however, text features are documents in which term t appears. Several variations and
often either overlooked or manually extracted. Arazy and adjustments were offered, including normalizing tft,d and
Woo [41] highlight the importance of effective statistical optional weighting schemes (such as BM25) by [48], [49],
natural language (SNLP) methods to the IS community of [50], [51].
researchers and practitioners. Their focus, on collocation in- For the task of comment classification, ’document’ is
dexing for compound terms, points to tf-idf as “the de facto replaced by ’class’, e.g. sentiment class (negative/positive)
standard” for weighting schemes that take into account a in sentiment analysis, serving to collect a set of relevant
local, document-specific factor, and a global corpus-level comments. Term frequency (tf ) is then computed per class.
factor. After showing how standard tf-idf is ineffective for Inverse to document (idf ) becomes ”inverse to comments”,
collocations, they proceeded to develop an adaptation. We meaning that N is the size of the set of comments, and the
have identified a similar challenge to tf-idf showing how its document frequency for a term (dft ) is computed on that
unadjusted use in the analysis of social media comments set. Bermingham and Smeaton [52] define this method as
is inappropriate and may result in significant bias. In the sentiment tf-idf. We adopt this terminology in the remainder
recent IS literature we observe a semi-manual approach, of the paper. Comments are then classified into classes using
where classification relies on user (often expert) generated probabilistic (e.g., Naive Bayes) or discriminative models
dictionaries or user-defined textual features (e.g., [7], [46]). (e.g., SVMs) [14].
The need for developing customized automated methods is A common variant of the classic tf-idf is delta idf weight-
thus called for. ing, in which idf is calculated for each class separately, and
This paper bridges the gap between discourse analysis, then the difference between the values is used for sentiment
specifically CMD, and (automated) text classification. Our classification [53]. This variant is proved to be efficient for
main contribution is in revealing the comment discourse supervised classification at the sentence level.
bias and discussing its implications for text preprocessing
and model evaluation. Although the literature does offer
alternatives to the tf-idf weighting approach, we found no
3 B ETWEEN - PARTICIPANTS ’ DISCOURSE CORRE -
discussion of tf-idf misuse in the important and growing LATION : DEFINING THE PROBLEM
context of comment study. We illustrate the extent of the An interesting property of comments to online posts is the
bias problem on several datasets. Misuse is examined and discourse generated between the participating commenters.
illustrated using empirical evidence. When part of an online discussion, discourse is shown to be
Our second contribution is a proposed statistical correc- content-dependent [14], [15]. A prominent phenomenon in
tion to the bias, delivered by modifying tf-idf. Our statistical this respect is the presence of terms that repeatedly occur
1041-4347 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2018.2840127, IEEE
Transactions on Knowledge and Data Engineering
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 4
across a sequence of comments, such terms neither related Comment-to-document: comments often discuss the docu-
to the topic nor to the sentiment. An example of such textual ment’s content. We therefore expect to observe the same set
dependency is presented in Figure 1: a partial comment of terms used by several commenters, commenting on the
thread on a news article from the news.discovery.com site. same article.
The title of the article is ”App does math homework with
Comment-to-comment: comments are often dependent on
phone camera”. The first commenter compared this new ap-
previous comments: either because they are influenced by
plication to other products: Google glass and PhotoMath. The
them, or simply because they reply to or quote them. This
second commenter cynically replies to the article, using the
dependency is expected to be time-dependent. That is, once
same comparison as the former user. The third participant,
a term appears, it is more likely to reappear in succeed-
who does not ”comprehend the joke”, replies to the idea
ing comments. This dependency possibly decays along the
presented by the second commenter, and then being replied
thread of comments, and may have a smaller impact on
by the fourth commenter, who repeats the exact same words
comments that are further down the comment thread. One
in an ironic fashion.
also might argue that threaded comments are expected to
It is important to note that neither the terms Google
exhibit higher correlation compared to a flat discussion.
glass, nor Photomath, nor the repeated discourse of the com-
menters, appear anywhere in the text of the original article When the dataset is large enough, the comment-to-
being commented upon. Thus despite the high correlation commenter dependency is mitigated, as the effect of indi-
of these terms within the comment space, they would have vidual commenters is small and diverse, thus can be aver-
low correlations with the main text. aged out. Large datasets typically mitigate the comment-
In the example presented in Figure 1, the comments are to-SN dependency as well, as long as there are multiple
threaded, that is, users post comments on comments (”reply social circles contributing to that dataset, and those circles
to”). In an interesting study, Agrawal et al. [54] observed are small compared to the size of the commenters’ group.
that the relationship between two individuals in a threaded Diversity among multiple such circles allows averaging out
discussion is much more likely to be antagonistic (74%) than of their effect. In cases where there are only a few social
reinforcing (7%). circles, or their sizes are relatively large, comment-to-SN de-
In a flat discussion, in which comments appear consec- pendency persists. However, given the nature of CMD, the
utively (no ”reply to” option), textual dependency is also two other dependency structures (comment-to-document,
apparent, commonly following a ”quoting” term - refer- and comment-to-comment) are likely to bias datasets of all
ence to another post by quoting part of it or by tagging sizes, in different contexts. This is because influence, reply
the commenter’s user name or ID. In a study on quoting and quoting may apply to large sets of comments, e.g.,
behavior among participants on the politics.com discussion influence of a document on all of the comments posted to it,
site, Mullen and Malouf [55] found that 10% of the posts and thus generate dependency even in large datasets. More
contain quoted material, and as much as 55.7% of the users importantly, these two dependency structures exhibit fast-
quote comments or are being quoted by others at least once. paced, time dependent dynamics of change. Commenters
Here again, the majority of the participants quote users at are typically influenced by recent comments as well as
the opposite end of the political spectrum. the document to which they post comments. This pattern
Dependency between discourses and repeated occur- of influence is non-stationary and has a significant effect
rence of terms introduce bias in tf-idf weighting when used on time series performance. Therefore, influence patterns
for text preprocessing. The tf-idf technique assumes com- should be controlled for.
pleteness of the domain of interest: idf is the term inverse-to-
frequency in the entire domain, tf is the term frequency in a
complete document, or class (in sentiment tf-idf ). In practice, 4 E MPIRICAL EVIDENCE OF THE BIAS
the ”domain of interest” is replaced by the training set
(see Equation 1). Therefore, correlation between discourses We examine the discourse correlation and its bias to tf-
characterized by increased term frequency in comments idf on two Facebook fan pages with large user activ-
to a single post (beyond its frequency in the domain), ity. The first is the fan page of the TV show Com-
erroneously decreases idf and increases tf. We prove this munityTV (www.facebook.com/communitytv). Communi-
observation in Appendix A, and provide empirical evidence tyTV has over 1.7M fans and its posts are mostly
in the next section. promotional; The second fan page is the news page
SPORT1News (www.facebook.com/SPORT1News), which
maintains nearly 1M fans. SPORT1News, as the title conveys,
3.1 Discourse dependency structures discusses sport news. For each page we have collected a
Dependency between comments can result from multiple recent set of 1000 documents (posts), and their comments.
causes, as listed below: The average (median) comments per post is approximately
172 in CommunityTV, and 96 in SPORT1News.
Comment-to-commenter: a given commenter may use sim-
ilar vocabulary in several comments, resulting in depen-
dency between comments posted by the same commenter. 4.1 Discourse dependency network
Comment-to-Social Network (SN): commenters from the To illustrate dependency between comments posted on the
same social circles may have a similar discourse style, and same document, we construct a dependency network in
may use the same terms and phrases. which the set of nodes represents the set of comments, and
1041-4347 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2018.2840127, IEEE
Transactions on Knowledge and Data Engineering
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 5
ties between two nodes (comments) and their weight corre- TABLE 1: Difference between dependency and baseline net-
spond to the number of shared words2 . We evaluate these works
networks against a baseline, in which for each document
Measurement CommunityTV Sport1News
we generate a network of the same size (total number of
#Components -7.3 -3.8
comments) of randomly selected comments that were posted
Degree 17.2 12.6
on different documents (thus lexically-independent). Exam-
Betweenness -0.8 (insignificant) 3.7
ples of dependency networks for comments posted on the
same document are given in Figure 2 (for the CommunityTV
fan page). The darker node in the figure represents the
to commenter b, and commenter b replies to commenter a;
document itself. Interestingly, we see that comments do not
etc.). High values of betweenness might indicate the effect
share content with the document, yet discourse is generated
of temporality in a network.
between many of the commenters. Similar comment depen-
dency networks are observed for all documents posted on Table 1 presents the results of the comparison between
both fan pages. the networks, controlling for network size.3 As expected,
We then measure dependency metrics of the networks when compared to the baseline, the dependency networks
and compare the measurements with those of baseline net- exhibit on average fewer components and higher nodes
works. The metrics measured and compared are: degree. Betweenness was insignificant in the CommunityTV
dataset, indicating low impact of temporality.
Number of components: measures the emergence of com-
In the following experiment, we illustrate the bias that
munities around topics. A network with fewer components
discourse correlation introduces to tf-idf weights, and its
is one in which commenters share similar terms (yet not
implication to the classification task. As a preprocessing step
necessarily similar ideas, opinions, or sentiments).
we split the data into training (50%) and holdout (50%) sets,
Node degree: indicates the popularity of a node (comment) as commonly done in classification tasks.
in terms of terms used. When a degree in a dependency
network is significantly higher than that of a baseline net- 4.2 A discussion on training and holdout sets
work, it likely implies existence of comments with multiple
citations or replies (discourse). According to [56], [57] the statistical performance of a
model is strictly linked to the correct selection of the hold-
Betweenness: high network betweenness corresponds to out sample. Stein [57] articulates that “with time varying
scenarios of threads within comments (commenter c replies processes . . . holdout testing can miss important model
2. Preprocess steps that were taken are: removal of stop words, 3. The model used: Y=a+b1[Network Size]+b2[Indicator of Depen-
numbers, and punctuations, and words’ stemming. dency Network]; Table 1 depicts the value of b2.
1041-4347 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2018.2840127, IEEE
Transactions on Knowledge and Data Engineering
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 6
1041-4347 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2018.2840127, IEEE
Transactions on Knowledge and Data Engineering
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 7
and comprehensibility of the weighting. Hence, while many 5.1 Computing expected comment thread frequency
adjustment approaches are possible, we opt for a simple The expected comment thread frequency ecft is essentially
yet well justified approach. In a nutshell, we examine term the expected number of comment threads a term t should
frequency and, in case of a term being overly frequent, we appear in, given its frequency (tft ). The problem of com-
adjust its tf-idf weight as detailed below. puting ecft is equivalent to the ”balls into bins” problem
We keep the terminology from the previous section and [58], with weighted bins: allocate tft independent occurrences
adjust the sentiment tf-idf. Recall that in sentiment tf-idf, idf of term t (”balls”) into n comment threads (”bins”), with
is computed on a set of comments, and tfd is computed comment thread probability p(ci ) being proportional to its
per sentiment class. We define comment thread as a set of length (total length of all comments in that thread), with
comments that are posted to a single document (regardless Rn
p(ci ) = 1. The objective is to find the expected number
i=1
of the comments actual structure). We denote the comment of ”non-empty bins”, which is the expected comment thread
thread posted to document i by ci . C={c1 , ..., cn } is the set frequency.
of all comment threads associated with all n documents. We define random variable zt,i :
Discourse correlation inflates the frequency of terms in
certain comment threads, but not in others. To determine 1, if comment thread i does not contain the term t;
zt,i =
whether a term t is over-frequent in a specific comment 0, otherwise.
thread ci , we compare the number of comment threads in (4)
which t appears (comment thread frequency, denoted cft ) to Note that such a random variable has a Bernoulli dis-
the number of comment threads we expect t to appear in, in tribution. Computing the exact probabilities in the case of
the absence of correlation (expected comment thread frequency, Bernoulli distribution is hard, hence, as common in the art,
denoted ecft and computed in the next section). We then use we use a Poisson approximation instead [59]. The probabil-
the ratio rt between cft and ecft to determine whether t is ity that a comment thread c does not contain term t, is then:
over-frequent in ci . The term t is considered over-frequent tf
− p(ct )
in ci , if this ratio is smaller than 1. Had t appeared in as Pr (zt,i = 1) = (1 − p(ci ))tft ≈ e i ≡ e− λt,i (5)
many comment threads as expected or more (i.e., rt ≥ 1), its
high frequency in ci would have been statistically expected. Ascribed to the linearity of expected value, the total
However, a smaller than 1 ratio indicates that t appears expected number of comment threads that do not contain
infrequently across comment threads, while it is frequent term t, satisfies:
in ci . This suggests that t is over-frequent in ci . Put in n
X n
X
other words, given its frequency in ci , a ratio smaller than E (zt ) = E (zt,i ) = = Pr (zt,i = 1) (6)
1 indicates that t appears in fewer comment threads than i=1 i=1
expected without correlation. Xn
1041-4347 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2018.2840127, IEEE
Transactions on Knowledge and Data Engineering
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 8
6.1 Data in a class. Practically, low (high) RMSD value indicates the
Our data were compiled from seven Facebook fan pages ability (inability) of the training set to capture the features of
selected from different domains, including news, finance, the data when applied to holdout sets. RMSD is computed
politics, sport, shopping, and entertainment. This afforded for each method (e.g., adjusted tf-idf, non-adjusted tf-idf )
diversity in topic and size thus facilitating wider applica- separately, as follows:
bility of our results. The description, characteristics, and sZ
2
descriptive statistics of the pages are given in Table 2. RM SD = 1 − Ŵ 2
Ŵt,d (8)
t,d
For each page, we collected the entire set of documents t,d
(posts) and threads (first-level comments4 ). We additionally
computed the average number of threads per document,
the average number of words per comment (referred to as Spearman’s Rank Correlation Coefficient (Sρ): this non-
comment length), the average number of replies in threads, parametric measure compares the tf-idf weights’ rank of the
and their average depth. Since Facebook incorporated a training set to that of the holdout set, as follows:
threaded commenting system in mid-2013 [63], the latter
1
2
two measurements were computed for recent threads only, S = R Ŵt,d , R Ŵt,d (9)
specifically those retrieved from the year of 2016 data.
Practically, Sρ examines whether the relative importance
Following our discussion in Section 4.2, fan pages were
(noted ””) of a term in a class can be inferred
split to training and holdout, such that 50% of the pages
from the training set. Mathematically, Sρ computes
posts, along with their entire thread, were used as training,
the degree to which the following relationship holds:
and the remainder 50% posts were used as holdout.
Ŵt2A ,d Ŵt2B ,d ⇒ Ŵt1A ,d Ŵt1B ,d .
4. The vast majority of comments in Facebook pages are first level 5. Greater number of folds showed no statistically significant differ-
comments (see average thread depth). ence in performance.
1041-4347 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2018.2840127, IEEE
Transactions on Knowledge and Data Engineering
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 9
Facebook Page #Threads #Documents Threads per Document Comment’s Length #Replies on Threads Thread Depth
Community 59572 2481 24.01 19.50 1.08 0.29
Amazon 95126 4670 20.37 15.50 0.81 0.29
CNN Politics 656357 28012 23.43 37.00 3.79 0.52
The Economist 257375 14111 18.24 34.50 1.65 0.32
BBC News 445741 18065 24.67 5.50 3.34 0.40
Sport1 779599 48592 16.04 12.00 2.12 0.36
CBC News 852303 39512 21.57 71.50 2.47 0.38
TABLE 3: Paired t-tests for RMDS TABLE 4: Paired t-tests for Spearman’s Rank Correlation
Facebook Page Adjusted tf-idf, Adjusted tf-idf Facebook Page Adjusted tf-idf, Adjusted tf-idf
compared with compared with compared with compared with
Non-Adjusted tf-idf K-fold CV Non-Adjusted tf-idf K-fold CV
Community -0.18 0.26∗∗ Community 0.007∗∗∗ -0.04∗∗∗
Amazon -0.66∗∗ -0.61∗∗ Amazon 0.01∗∗∗ -0.005∗∗∗
CNN Politics -0.91∗∗ 1.49 CNN Politics 0.008∗∗∗ -0.02∗∗∗
The Economist -1.95∗∗∗ 0.45 The Economist 0.01∗∗∗ -0.01∗∗∗
BBC News -1.82∗∗∗ 0.51 BBC News 0.01∗∗∗ -0.01∗∗∗
Sport1 -0.78∗∗∗ 0.001 Sport1 0.005∗∗∗ -0.01∗∗∗
CBC News -1.68∗∗∗ -0.61 CBC News 0.01∗∗∗ -0.01∗∗∗
Fig. 5: Comparison of RMSD of the three approaches Fig. 6: Comparison of Spearman’s Rank correlation of the
three approaches
1041-4347 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2018.2840127, IEEE
Transactions on Knowledge and Data Engineering
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 10
1041-4347 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2018.2840127, IEEE
Transactions on Knowledge and Data Engineering
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 11
meanCorrelation 1M 2M 5M
1.0
1.0
1.0
0.76
Approach
0.8
0.8
0.74 Adjusted tf−idf
0.6
0.6
0.6
Non−adjusted tf−idf
0.72
0.4
0.4
0.4
0.5 0.7 0.9 0.5 0.7 0.9 0.5 0.7 0.9
0.2
0.2
0.2
Quantile
0.0
0.0
0.0
Fig. 11: Comparison of Spearman’s Quantile Rank correla- 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
tions for varying data sizes. Illustrated on Page CNN Politics False Positive Rate (1−Specificity) False Positive Rate (1−Specificity) False Positive Rate (1−Specificity)
1.0
1.0
1.0
UMD news articles may be replaced by supposedly non- Fig. 12: Method’s robustness
True Positive Rate (Sensitivity)
0.8
0.8
identifying initials. Comments published by private users 0.6
0.6
0.6
may (unintentionally) reveal the undisclosed material data,
0.6 0.0 0.8 0.2 1.0 0.4
0.4
0.4
False Positive Rate (1−Specificity) False Positive Rate (1−Specificity) False Positive Rate (1−Specificity)
0.2
0.2
with him for years ...”. Given the readily available identity
0.0
0.0
0.0
of that commenter, it only takes a few clicks to identify the 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
lottery winner. We therefore treat this comment and other False Positive Rate (1−Specificity) False Positive Rate (1−Specificity) False Positive Rate (1−Specificity)
1.0
1.0
with an aggregate total of 3,538 annotated comments, out Fig. 13: Deceiving model
True Positive Rate (Sensitivity)
0.8
0.8
of which 149 (4.21%) were labeled as PB comments. The
0.6
0.6
0.6
task of this study is to classify comments as PB comments
our case is the detection of a comment as a PB comment. The
0.4
0.4
0.4
or Non privacy breaching (NPB) comments, using comment
gray line in the chart represents performance of the model
0.2
0.2
0.2
classification techniques. This task falls under the umbrella
of sentiment analysis, as PB comments are a users’ way on a training set. The black line represents performance on
0.0
0.0
0.0
to express that they know more than the article reveals. 0.0 0.2 0.4 0.6 0.8 1.0 0.0 the holdout. Clearly, the performance of the model on out-
0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
comments on both existing articles and new (future) articles. multiple samples we learned that the performance is incon-
sistent and varies across samples (not shown in the Figure).
We attribute the offset between performances to the bias
7.1 Applying tf-idf (without adjustment) caused by between-participants discourse correlation.
We construct a data-driven classifier that utilizes classic
sentiment tf-idf. We define two sentiment classes: PB and 7.2 Ignoring dependency structure
NPB, each groups the set of relevant comments. Term fre-
What if we entirely ignore the bias and treat all comments
quency tf is computed per class. Inverse to document idf is
as independent observations? In this example, we randomly
computed separately for each class on its set of comments.
sample 60% of the comments (rather than news articles) to
To avoid impact of unique words, and consequently over-
serve as training, leaving 40% of the comments to validate
fitting, we chose the 20 most frequent words (in the training
the model. Here again, we repeat the analysis several times.
set) to represent each class. Other threshold were examined
A typical performance comparison is depicted in Figure
to examine the sensitivity of the prediction accuracy. We use
13, showing statistically equal performance. This can be
logistic regression to predict class membership. Note that
explained by the fact that, in this scenario, the discourse
the classifier improves significantly when more predictors
correlation that exists in the training set is replicated to
are added to the model. However, to illustrate the bias
the validation set. Therefore, the bias it causes does not
introduced by the discourse, a simple model should suffice.
affect performance on holdout. Obviously, this model will
Out of 48 UMD articles, we randomly sample 30 articles
not perform similarly on new, out-of-sample observations.
to serve as training data, the remaining 18 serve as the
holdout set. The tf-idf was trained on the training set. To
examine the robustness of the method on out-of-sample 7.3 Discourse correlation
observations, we compared the performance of the method The comments in our data are highly correlated for two
on training and holdout sets. The whole procedure was reasons. First, we observe correlation between comments
repeated several times to verify model consistency. and the article itself. Second, we observe that new com-
The comparative performance is summarized in Figure menters’ response to existing comments generates corre-
12 which presents a typical ROC (Receiver-operating charac- lation between comments. In Figure 14 (left hand side)
teristic) curve, depicting the trade-off between False Positive we attempt to visualize the amount of correlation in our
Rate (1-Specificity) and True Positive Rate (Sensitivity) for dataset. In black ink, we scatter plot the joint distribution
different internal thresholds of the logistic model. Positive in of words frequency in comments (the number of comments
1041-4347 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2018.2840127, IEEE
Transactions on Knowledge and Data Engineering
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 12
8 C ONCLUSION
25
Many contemporary social media web sites, with Facebook
40
man you
Frequency in Articles
Frequency in Articles
being one, incorporate commenting systems that allow peo-
30
20
hero exciting
ple to respond to posts on the web sites. Commenting
20
15
systems encourage discourse exchanges between partici-
10
king health
10
0 50 100 150 200 250 50 100 150 200 250
be content-dependent. This dependency, as we have shown
Word's Frequency Word's Frequency
in this paper, introduces bias to tf-idf, a widely used term
weighting technique used for text preprocessing. We have
Fig. 14: Discourse correlation.
further shown that this bias, if ignored, can manifest in
a non-robust method at best, and can lead to an entirely
wrong conclusion as worst.
For decades, tf-idf has played a major role in information
1.0
1.0
retrieval and text mining. Early on, tf-idf aimed to model the
True Positive Rate (Sensitivity)
0.8
0.6
0.4
0.2
0.0
0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 tences (e.g., tweets and comments) - a significant deviation
False Positive Rate (1−Specificity) False Positive Rate (1−Specificity) from its intended usage. The short-sentence environment,
and even more so the tweet and comment environments,
introduce two major challenges to tf-idf. Firstly, a large refer-
Fig. 15: Adjusted model
1.0
1.0
0.8
0.6
in which it appears) on the x-axis, and words frequency in sentiment analysis. In this study we discuss this bias, we
0.4
0.4
news articles (the number of articles in which it appears) show that it mainly relates to discourse and topic, and that
0.2
0.2
on the y-axis. Since some articles attract more comments, it is not part of the sentiment, in particular when the goal is
0.0
0.0
the concave trend (black line), that represents the smoothed prediction (out-of-sample observation with new topics and
0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
relationship is expected.
False Positive Rate (1−Specificity)
It is observed (see Figure 14, right
False Positive Rate (1−Specificity) new discourse).
hand side) that several words, such as ’hero’, ’king’ and We proposed an adjustment that corrects for the dis-
’health’, are much more frequent in fewer cases compared course bias when applying tf-idf to comments to online
with the others (the figure is trimmed at the 95% quintile, posts. However, it should be noted that the main contri-
for better visibility). bution of this paper is in exposing the potential misuse of
We next compare the observed correlation with an ex- tf-idf when applied to dependent text. Other approaches
pected words frequency in comments/articles joint distri- can also be considered to quantify and/or mitigate the bias.
bution, assuming comments’ independence. We randomly One example is that suggested by [66] that surveys different
assign words to articles, keeping term frequency and total sources as potential textual baselines. Another example is k-
comments length per article constant. I.e.: (1) number of fold cross-validation, which is commonly used in sentiment
repetitions of term t in the experiment equals to number analysis, where each of the k samples contain a list of com-
of repetitions in the raw data, and (2) article i will be ments to different documents (unlike the common practice,
assigned with wi random words, some may be repetitive, which is random sampling). While k-fold is expected to give
where wi is the total length of comments to this article a fair solution to the bias, it may be prohibitively expensive
in the raw data. The gray dots in Figure 14 present the when applied to large datasets. This limitation is crucial
resultant joint distribution. The gray line is the smoothed in the context of user-generated-content and social media,
relationship between the variables. This experiment shows where Big Data is being considered.
some correlation between nearly all words in the dataset. In our case studies, some comment threads exhibit high
discourse correlation, whereas others don’t. Other domains
may contain higher correlations. However, the adjustment
7.4 Applying adjusted tf-tdf to the data presented in this paper is applicable to any level of corre-
lation (even to a no-correlation scenario), as the adjustment
We applied the correction in Equation 2 to the text- coefficient is proportional to the observed bias.
preprocessing step. We then re-trained the classifier on a While we present prominent advantages of adjusting
training set that consists of 30 randomly sampled articles. the tf-idf weights, there are some limitations as well. In
We repeated the sampling-training procedure several times particular, when term population is small, tf-idf weighting
and compared the results on training and holdout sets. The may be biased, yet the adjustment we introduce may over-
results are given in Figure 15, demonstrating the removal of compensate for this bias, introducing another bias. For in-
the discourse bias – the adjusted model is robust to out-of- stance, there are cases in which some terms are used only by
sample observations. a small group of people and therefore their bias correction
1041-4347 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2018.2840127, IEEE
Transactions on Knowledge and Data Engineering
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 13
could likely be flawed. [19] X. Wang, J. Bian, Y. Chang, and B. Tseng, “Model news relatedness
Despite limitations of our approach, it is evident that bias through user comments,” in Proceedings of the 21st International
Conference on World Wide Web. ACM, 2012, pp. 629–630.
correction of the sort we propose can significantly improve
[20] D. Eberle, G. Berens, and T. Li, “The impact of interactive corpo-
comment classification accuracy and processing time. As we rate social responsibility communication on corporate reputation,”
show in three different examples, bias indeed exists, hence Journal of Business Ethics, vol. 118, no. 4, pp. 731–746, 2013.
our approach is very relevant. In fact, bias of the type we [21] R. Shi, P. Messaris, and J. N. Cappella, “Effects of online comments
on smokers’ perception of antismoking public service announce-
have identified in comment and tweet environments exists ments,” Journal of Computer-Mediated Communication, vol. 19, no. 4,
in other domains as well, as we have identified in yet pp. 975–990, 2014.
unpublished research. This suggests that, prior to using tf- [22] J. B. Walther, D. DeAndrea, J. Kim, and J. C. Anthony, “The
idf, researchers better check for bias in their target domain, influence of online comments on perceptions of antimarijuana
public service announcements on youtube,” Human Communica-
and apply corrective adjustments where applicable to avoid tion Research, vol. 36, no. 4, pp. 469–492, 2010.
undesirable analysis results. [23] Y.-J. Chang, Y.-S. Chang, S.-Y. Hsu, and C.-H. Chen, “Social net-
work analysis to blog-based online community,” in Convergence
Information Technology, 2007. International Conference on. IEEE,
R EFERENCES 2007, pp. 2193–2198.
[24] M. Ziegele, T. Breiner, and O. Quiring, “What creates interactivity
[1] N. B. Ellison et al., “Social network sites: Definition, history, and
in online news discussions? an exploratory analysis of discussion
scholarship,” Journal of Computer-Mediated Communication, vol. 13,
factors in user comments on news items,” Journal of Communica-
no. 1, pp. 210–230, 2007.
tion, vol. 64, no. 6, pp. 1111–1138, 2014.
[2] H. Chen, R. H. Chiang, and V. C. Storey, “Business intelligence
and analytics: From big data to big impact.” MIS quarterly, vol. 36, [25] M. Zhu, W. Hu, and O. Wu, “Topic detection and tracking for
no. 4, 2012. threaded discussion communities,” in Web Intelligence and Intel-
[3] V. Dhar, “Data science and prediction,” Communications of the ligent Agent Technology, 2008. WI-IAT’08. IEEE/WIC/ACM Interna-
ACM, vol. 56, no. 12, pp. 64–73, 2013. tional Conference on, vol. 1. IEEE, 2008, pp. 77–83.
[4] M. Saar-Tsechansky, “Editors comments: the business of business [26] D. Feng, J. Kim, E. Shaw, and E. Hovy, “Towards modeling
data science in is journals,” MIS Quarterly, vol. 39, no. 4, pp. iii–vi, threaded discussions using induced ontology knowledge,” in Proc.
2015. 21st National Conf. on Artificial Intelligence, vol. 21, no. 2. AAAI
[5] J. P. Gee, An introduction to discourse analysis: Theory and method, Press, 2006, pp. 1289–1294.
2014. [27] D. Davidov, O. Tsur, and A. Rappoport, “Semi-supervised recog-
[6] A. Abbasi and H. Chen, “Cybergate: a design framework and nition of sarcastic sentences in twitter and amazon,” in Proceedings
system for text analysis of computer-mediated communication,” of the fourteenth conference on computational natural language learning.
Mis Quarterly, pp. 811–837, 2008. Association for Computational Linguistics, 2010, pp. 107–116.
[7] E. Vaast, E. J. Davidson, and T. Mattson, “Talking about technol- [28] R. Justo, T. Corcoran, S. M. Lukin, M. Walker, and M. I. Torres,
ogy: The emergence of a new actor category through new media.” “Extracting relevant knowledge for the detection of sarcasm and
Mis Quarterly, vol. 37, no. 4, 2013. nastiness in the social web,” Knowledge-Based Systems, vol. 69, pp.
[8] D. Surian, S. Seneviratne, A. Seneviratne, and S. Chawla, “App 124–133, 2014.
miscategorization detection: A case study on google play,” IEEE [29] A. Hassan, V. Qazvinian, and D. Radev, “What’s with the atti-
Transactions on Knowledge and Data Engineering, vol. 29, no. 8, pp. tude?: identifying sentences with attitude in online discussions,”
1591–1604, 2017. in Proceedings of the 2010 Conference on Empirical Methods in Natural
[9] B. Shi, G. Poghosyan, G. Ifrim, and N. Hurley, “Hashtagger+: Language Processing. Association for Computational Linguistics,
Efficient high-coverage social tagging of streaming news,” IEEE 2010, pp. 1245–1255.
Transactions on Knowledge and Data Engineering, vol. 30, no. 1, pp. [30] V. Qazvinian, E. Rosengren, D. R. Radev, and Q. Mei, “Rumor
43–58, 2018. has it: Identifying misinformation in microblogs,” in Proceedings of
[10] W. Hua, Z. Wang, H. Wang, K. Zheng, and X. Zhou, “Understand the Conference on Empirical Methods in Natural Language Processing.
short texts by harvesting and analyzing semantic knowledge,” Association for Computational Linguistics, 2011, pp. 1589–1599.
IEEE transactions on Knowledge and data Engineering, vol. 29, no. 3, [31] N. FitzGerald, G. Carenini, G. Murray, and S. Joty, “Exploiting
pp. 499–512, 2017. conversational features to detect high-quality blog comments,”
[11] H. Zhang, S. Wang, Z. Mingbo, X. Xu, and Y. Ye, “Locality Advances in Artificial Intelligence, pp. 122–127, 2011.
reconstruction models for book representation,” IEEE Transactions [32] N. Wanas, M. El-Saban, H. Ashour, and W. Ammar, “Automatic
on Knowledge and Data Engineering, 2018. scoring of online discussion posts,” in Proceedings of the 2nd ACM
[12] M. Potthast, B. Stein, F. Loose, and S. Becker, “Information retrieval workshop on Information credibility on the web. ACM, 2008, pp.
in the commentsphere,” ACM Transactions on Intelligent Systems 19–26.
and Technology (TIST), vol. 3, no. 4, p. 68, 2012. [33] E. Momeni, K. Tao, B. Haslhofer, and G.-J. Houben, “Identification
[13] K. Filippova and K. B. Hall, “Improved video categorization from of useful user comments in social media: a case study on flickr
text metadata and user comments,” in Proceedings of the 34th commons,” in Proceedings of the 13th ACM/IEEE-CS joint conference
international ACM SIGIR conference on Research and development in on Digital libraries. ACM, 2013, pp. 1–10.
Information Retrieval. ACM, 2011, pp. 835–842.
[34] T. Nasukawa and J. Yi, “Sentiment analysis: Capturing favorabil-
[14] S. Siersdorfer, S. Chelaru, W. Nejdl, and J. San Pedro, “How useful
ity using natural language processing,” in Proceedings of the 2nd
are your comments?: analyzing and predicting youtube comments
international conference on Knowledge capture. ACM, 2003, pp. 70–
and comment ratings,” in Proceedings of the 19th international con-
77.
ference on World wide web. ACM, 2010, pp. 891–900.
[15] S. Siersdorfer, S. Chelaru, J. S. Pedro, I. S. Altingovde, and W. Ne- [35] K. Dave, S. Lawrence, and D. M. Pennock, “Mining the peanut
jdl, “Analyzing and mining comments and comment ratings on gallery: Opinion extraction and semantic classification of product
the social web,” ACM Trans. on the Web, vol. 8, no. 3, p. 17, 2014. reviews,” in Proceedings of the 12th international conference on World
[16] M. Hu, A. Sun, and E.-P. Lim, “Comments-oriented document Wide Web. ACM, 2003, pp. 519–528.
summarization: understanding documents with readers’ feed- [36] B. Liu, “Synthesis lectures on human language technologies,”
back,” in Proc. 31st annual intl. ACM SIGIR conf. on Research and Sentiment analysis and opinion mining [J], vol. 5, no. 1, pp. 1–167,
development in information retrieval. ACM, 2008, pp. 291–298. 2012.
[17] G. Mishne, N. Glance et al., “Leave a reply: An analysis of weblog [37] B. Liu and L. Zhang, “A survey of opinion mining and sentiment
comments,” in Third annual workshop on the Weblogging ecosystem. analysis,” in Mining text data. Springer, 2012, pp. 415–463.
Edinburgh, Scotland, 2006. [38] B. Pang, L. Lee et al., “Opinion mining and sentiment analysis,”
[18] A. Schuth, M. Marx, and M. De Rijke, “Extracting the discussion Foundations and Trends
R in Information Retrieval, vol. 2, no. 1–2,
structure in comments on news-articles,” in Proceedings of the 9th pp. 1–135, 2008.
annual ACM international workshop on Web information and data [39] M. Kantrowitz, B. Mohit, and V. Mittal, “Stemming and its effects
management. ACM, 2007, pp. 97–104. on tfidf ranking (poster session),” in Proceedings of the 23rd annual
1041-4347 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2018.2840127, IEEE
Transactions on Knowledge and Data Engineering
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 14
international ACM SIGIR conference on Research and development in Lectures on Data Mining and Knowledge Discovery, vol. 2, no. 1, pp.
information retrieval. ACM, 2000, pp. 357–359. 1–126, 2010.
[40] M. Toman, R. Tesar, and K. Jezek, “Influence of word normaliza- [65] D. G. Schwartz, I. Yahav, and G. Silverman, “News censorship
tion on text classification,” Proc. InSciT, vol. 4, pp. 354–358, 2006. in online social networks: A study of circumvention in the com-
[41] O. Arazy and C. Woo, “Enhancing information retrieval through mentsphere,” Journal of the Association for Information Science and
statistical natural language processign: A study of collocation Technology, vol. 68, no. 3, pp. 569–582, 2017.
indexing,” Mis Quarterly, pp. 525–546, 2007. [66] V. Yatsko, S. Dixit, A. Agrawal, and S. Myint, “Tf* idf revisited,”
[42] M. Gamon, “Sentiment classification on customer feedback data: International Journal of Computational Linguistics and Natural Lan-
noisy data, large feature vectors, and the role of linguistic analy- guage Processing, vol. 2, no. 6, pp. 385–387, 2013.
sis,” in Proc. 20th intl. conf. on Computational Linguistics. Associa-
tion for Computational Linguistics, 2004, p. 841.
[43] A. Kennedy and D. Inkpen, “Sentiment classification of movie re-
views using contextual valence shifters,” Computational intelligence,
vol. 22, no. 2, pp. 110–125, 2006.
[44] V. Gómez, H. J. Kappen, N. Litvak, and A. Kaltenbrunner, Inbal Yahav Dr. Inbal Yahav, Social Scientist
“A likelihood-based framework for the analysis of discussion (PhD 2010) is an assistant professor and the
threads,” World Wide Web, vol. 16, no. 5-6, pp. 645–675, 2013. head of Information Systems specialization at
[45] S. Lee, J. Baker, J. Song, and J. C. Wetherbe, “An empirical com- the Graduate School of Business Administration,
parison of four text mining methods,” in System Sciences (HICSS), Bar-Ilan University, Israel. Her main research in-
2010 43rd Hawaii International Conference on. IEEE, 2010, pp. 1–10. terest is developing and tuning statistical models
[46] A. Ghose, P. G. Ipeirotis, and B. Li, “Designing ranking systems to the information systems discipline. In her re-
for hotels on travel search engines by mining user-generated and search work, Dr. Yahav combine techniques from
crowdsourced content,” Marketing Science, vol. 31, no. 3, pp. 493– Data Mining algorithms, social network analysis,
520, 2012. and optimization models to achieve optimized
[47] G. Salton and C.-S. Yang, “On the specification of term values in and interpretable statistical models. She applies
automatic indexing,” Journal of documentation, vol. 29, no. 4, pp. these methods mainly to health care applications and online social
351–372, 1973. networks. Dr. Yahav has presented her work at multiple conferences
[48] C. Manning, P. Raghavan, and H. Schütze, “Language models and has published papers in books and journals, including MIS Quar-
for information retrieval,” Introduction to Information Retrieval, pp. terly, Production and Operation Management, and Annals of Operations
237–252, 2008. Research. She received her B.A. in Computer Science, her M.Sc in
Industrial Engineering from the Israel Institute of Technology, and her
[49] G. Paltoglou and M. Thelwall, “A study of information retrieval
PhD in Operations Research and Data Mining from the University of
weighting schemes for sentiment analysis,” in Proceedings of the
Maryland, College Park in August 2010. Dr. Yahav is currently serving
48th annual meeting of the association for computational linguistics.
as an Associate Editor of the Decision Sciences Journal and Big Data
Association for Computational Linguistics, 2010, pp. 1386–1395.
journal.
[50] S. Robertson, “Understanding inverse document frequency: on
theoretical arguments for idf,” Journal of documentation, vol. 60,
no. 5, pp. 503–520, 2004.
[51] G. Salton and C. Buckley, “Term-weighting approaches in auto-
matic text retrieval,” Information processing & management, vol. 24,
no. 5, pp. 513–523, 1988. Onn Shehory Onn Shehory received his MSc in
[52] A. Bermingham and A. F. Smeaton, “On using twitter to monitor physics and his PhD in computer science from
political sentiment and predict election results,” Sentiment Analysis Bar Ilan University, Israel, in 1992 and 1996 re-
where AI meets Psychology (SAAIP), pp. 2–10, 2011. spectively. He has recently joined Bar Ilan Uni-
[53] J. Martineau and T. Finin, “Delta tfidf: An improved feature space versity as an associate professor of information
for sentiment analysis.” Icwsm, vol. 9, p. 106, 2009. systems. Prior to taking up this position he was
[54] R. Agrawal, S. Rajagopalan, R. Srikant, and Y. Xu, “Mining news- a researcher at IBM Research. His research
groups using networks arising from social behavior,” in Proc. 12th interests include intelligent information systems,
intl. conf. on WWW. ACM, 2003, pp. 529–535. autonomous systems, data analytics, social net-
[55] T. Mullen and R. Malouf, “A preliminary investigation into sen- works analysis, software engineering and algo-
timent analysis of informal political discourse.” in AAAI Spring rithmic game theory.
Symposium: Computational Approaches to Analyzing Weblogs, 2006,
pp. 159–162.
[56] G. Shmueli et al., “To explain or to predict?” Statistical science,
vol. 25, no. 3, pp. 289–310, 2010.
[57] R. M. Stein, “Benchmarking default prediction models: Pitfalls and
remedies in model validation,” Moodys KMV, New York, vol. 20305, David Schwartz David G. Schwartz is profes-
2002. sor of information systems, and former vice-
chairman, at the Business School of Bar-Ilan
[58] D. Sprott, “Urn models and their applicationan approach to mod-
University, Israel. He has published over 120
ern discrete probability theory,” 1978.
research papers, books, book chapters, and ed-
[59] L. H. Chen, “Poisson approximation for dependent trials,” The
itorials in the field of information systems and
Annals of Probability, pp. 534–545, 1975.
technologies. His research has appeared in pub-
[60] B. Pang, L. Lee, and S. Vaithyanathan, “Thumbs up?: sentiment
lications such as Information Systems Research,
classification using machine learning techniques,” in Proc. ACL-02
IEEE Intelligent Systems, ACM Computing Sur-
conf. on Empirical methods in natural language processing-Volume 10.
veys, and Journal of the Association for Infor-
Association for Computational Linguistics, 2002, pp. 79–86.
mation Science and Technology (JASIST). His
[61] E. Riloff, S. Patwardhan, and J. Wiebe, “Feature subsumption for books include Cooperating Heterogeneous Systems; Internet-Based
opinion analysis,” in Proceedings of the 2006 conference on empirical Knowledge Management and Organizational Memory; and the Ency-
methods in natural language processing. Association for Computa- clopedia of Knowledge Management, now in its second edition. From
tional Linguistics, 2006, pp. 440–448. 1998 to 2011 he served as editor-in-chief of the journal Internet Re-
[62] P. Godfrey, “Balls and bins with structure: balanced allocations search and is currently an Associate Editor of the European Journal
on hypergraphs,” in Proceedings of the nineteenth annual ACM- of Information Systems. His main research interests are Cybersecu-
SIAM symposium on Discrete algorithms. Society for Industrial and rity, mHealth, Knowledge Management, Social Network Analysis, and
Applied Mathematics, 2008, pp. 511–517. Computer-mediated Communications. David received his Ph.D. in Com-
[63] V. Lavrusik, “Improving conversations on facebook with replies,” puter Science from Case Western Reserve University, USA; MBA from
Journalists on Facebook, vol. 25, 2013. McMaster University, Canada; and B.Sc. from the University of Toronto,
[64] G. Seni and J. F. Elder, “Ensemble methods in data mining: Canada.
improving accuracy through combining predictions,” Synthesis
1041-4347 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.