Comments Mining With TF-IDF: The Inherent Bias and Its Removal

This article has been accepted for publication in a future issue of this journal, but has not been
fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2018.2840127, IEEE
Transactions on Knowledge and Data Engineering
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 1
Comments Mining With TF-IDF: The Inherent

Bias and Its Removal
Inbal Yahav, Onn Shehory, and David Schwartz.
Abstract—Text mining have gained great momentum in recent years, with user-generated content becoming widely available. One key
use is comment mining, with much attention being given to sentiment analysis and opinion mining. An essential step in the process of
comment mining is text pre-processing; a step in which each linguistic term is assigned with a weight that commonly increases with its
appearance in the studied text, yet is offset by the frequency of the term in the domain of interest. A common practice is to use the
well-known tf-idf formula to compute these weights.
This paper reveals the bias introduced by between-participants’ discourse to the study of comments in social media, and proposes an
adjustment. We find that content extracted from discourse is often highly correlated, resulting in dependency structures between
observations in the study, thus introducing a statistical bias. Ignoring this bias can manifest in a non-robust analysis at best and can
lead to an entirely wrong conclusion at worst. We propose an adjustment to tf-idf that accounts for this bias. We illustrate the effects of
both the bias and correction with with seven Facebook fan pages data, covering different domains, including news, finance, politics,
sport, shopping, and entertainment.
Index Terms—Sentiment analysis, text mining, statistical bias, discourse.
1 I NTRODUCTION body language [5]. In computer-mediated discourse (CMD1 )

such as on SNS, ’language’ is replaced by textual messages,
S OCIAL media and in particular social networks (SNS)

are today’s major form of communication used on a
daily basis [1]. SNS platforms serve individuals and orga-
and ’non-language’ is replaced by writing style, such as
deleting subject pronouns, using abbreviations, and adding
signs and symbols [6], [7]. An interesting aspect of com-
nizations that utilize these platforms to spread information. ment analysis with respect to the relationship between
In essence, a great portion of interpersonal communication commenters, is the between-participants’ discourse, and in
has shifted, and is still shifting, to online social platforms. particular, the language and non-language expressions they
The availability of such rich online communication data choose while participating in an ongoing discussion. We
provides fertile ground for researchers across multiple dis- discuss the between-participants’ discourse in more detail
ciplines [2]. later in the paper.
The analysis of messages in social media falls squarely This paper is concerned with the bias that CMD intro-
within the domain of data science, and effective process- duces to the word weights assigned by tf-idf for the purpose
ing and analysis of those messages will have a growing of comments classification. Ignoring this dependency-bias
impact on business and business Information Technology may lead to one of two possible outcomes: if the depen-
(IT) management and Information Systems (IS) [3]. As dency is intentionally ignored (assumed to introduce no
Saar-Tsechansky [4] articulates “IS data science is uniquely bias to the model), the analysis may not be generalizable to
positioned to focus on challenges at the nexus of data out-of-sample observations. This will become evident when
science, business, and society. Insightful observations on the method is applied to a holdout dataset. However, if the
the shortcomings of state-of-the-art methods to address old dependency is unintentionally overlooked, and comments
and new business challenges can give rise to new data on the same document are randomly split between training
science problems and research that addresses them”. Our and holdout sets, the same statistical bias will be repeated
work highlights one such shortcoming – the popular use of in both sets. As a result, the researcher may reach incor-
standard text analysis techniques (specifically, tf-idf, as we rect conclusions regarding his analysis. In this paper, we
later describe), applied to comment classification, resulting exemplify this observation via examples. We further show
in biased analysis – and proposes an adjustment. how adjusting tf-idf to account for this dependency can
In qualitative research, discourse is defined as ’anything significantly reduce this bias.
beyond the sentence’, which refers to the way in which peo- Tf-idf is widely used for text pre-processing and feature
ple integrate language with other forms of self-expression engineering, as can be seen in these recent examples. An
(denoted ’non-language’), such as acting, interacting, and increasing emphasis on short texts and dynamic corpuses
is evident in the literature. Surian and colleagues [8] for
• Graduate School of Business Administration, Bar Ilan University, Israel, example use tf-idf to create a feature vector of words used to
52900. describe game apps in different App Store categories, on a
E-mail: {inbal.yahav, onn.shehory, david.schwartz}@biu.ac.il,
Manuscript received July 24, 2017. 1. Also called Computer Mediated Communication (CMC)
1041-4347 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2018.2840127, IEEE
corpus of 5546 descriptions and a vocabulary of 672 unique In the machine learning field, there is a wide agree-
words; in a work on Hashtagger+, a system to recommend ment that data-driven analysis of user- generated content
Twitter hashtags for news articles, tf-idf is used to create requires text pre-processing. This step commonly involves
a feature set describing candidate hashtags, creating tweet- removal (or down weighting) of stop words, word stemming
bags of terms from news articles [9]. The use of tf-idf to [39] or word lemmatization [40], and controlling for word
pre-process short texts in order to create a co-occurrence frequency in the domain of interest. For the latter task, a
network to model semantic relatedness can also be seen in common practice is to use the well-known tf-idf formula
the work of [10]. (formally described later) that assigns weights to each word
This paper thus aims at exposing and adjusting the bias in the dataset, which increases with its appearance in the
caused by this approach, when applied by researchers and training text, yet is offset by the frequency of the term in
practitioners. Of course, there are many situations in which the domain of interest. However, the appropriateness of
alternatives to tf-idf should be considered. For example, in applying unadjusted tf-idf to comment sentiment analysis
their study of tree-structured book representation, Zhang et is questionable [6], [41].
al. [11] find that their models based on principal component Common to much of the literature on comment classifi-
analysis (PCA) features outperforms models using tf-idf cation is the analysis of comments as independent observations
features. (e.g. [14], [15], [19], [42], [43]). Surprisingly, although the
structure of the comments and their proximity are largely
1.1 Background: The study and classification of com- recognized as having a great impact on users’ opinion and
ments sentiment (as we discuss in the next section), and often
The potential impact of the bias that we will illustrate is used for comment classification, comments are still treated
amplified by the high level of research activity surrounding independently for the purpose of word frequency control
the study of comments. The field of text analysis focuses (e.g., [26], [31]). In practice, as we will show, between-
attention on the analysis of comments as a central part participants’ discourse creates a statistical dependency be-
of understanding user-generated online content [12]. Re- tween observations (or comments) that inflates term fre-
searchers have studied comments to posted pictures and quency of commonly non-frequent terms in a given domain.
YouTube videos [13], [14], [15], comments on blogs [16], Our next discussion is on the use of comment struc-
[17], comments on press releases [18], [19], comments on ture in classification methods, commonly referred to as the
corporate communication [20], comments on public service analysis of threaded discussions and author-topic / author-
announcements [21], [22], and even comments on comments recipient-topic analysis.
[18], [23], [24].
The study of comments has various aims. Given a set of
1.2 Enhancing classification with comments’ structure
documents such as blog entries or Facebook posts, each fol-
lowed by a list of comments which is either flat or threaded, Research on threaded discussions makes use of comment
researchers study what can be inferred from users’ com- structure. A threaded discussion is characterized by the
ments on the document [5], [16] or the similarity between inherent relationship between the threaded comments, and
documents [13], [19], the topics of threaded discussions [25] the textual dependency between them [26]. The relationship
and their dynamics [19], [26]. between comments, both structural and lexical, is highly
Comment classification, often denoted message classifi- recognized as useful in classification as seen in [31] who
cation in the computational linguistics (CL) and natural lan- study the quality of comments in a threaded blog discus-
guage processing (NLP) literature (e.g., [26]), is the analysis sion. On top of the standard unigrams (words’ tf-idf score)
of comments within a commentsphere [12]. Classes of com- and conversational features such as comment length, order,
ments studied vary from sarcasm and nastiness [27], [28], and number of replies, they studied comments with respect
to attitude toward the comments [29] misinformation such to their lexical similarity to the preceding comment and
as in rumors [30], informativeness relative to the document following comment. Interestingly, they show that, for the
posted, thread topic, past comments [31], [32], uniqueness purpose of comments’ quality, lexical similarity features are
of comments [33], and much more. dominated by conversation features, and that unigrams are
Perhaps the most commonly addressed question in the the least informative feature. Wanas and colleagues [32]
context of comment classification is sentiment analysis [34], study informativeness of comments in a discussion examin-
also referred to as opinion mining [35]. Sentiment analysis ing conversational features, structural features such as refer-
and opinion mining represent a large problem space, often encing and replies, and features based on cosine similarity,
defined slightly differently, covering, for example, opinion such as the similarity between a comment and the post’s
extraction, subjectivity study, emotion analysis, and more. topic, the thread’s topic, and its parent’s (replied-to) topic in
The ultimate goal of sentiment analysis is to find out what the threaded discussion. Their research highlights the power
people think or feel toward entities such as products, ser- of (non-)similarity features in predicting informativeness.
vices, individuals, events, news articles, and topics. Method- The literature on comment classification in threaded dis-
ologically, in similarity to comment classification, the goal cussions generally addresses goals where dependency plays
is achieved using content extraction and classification tech- a major role: attitude between commenters [29]; quality,
niques. In much IS research, the term ”sentiment analysis” informativeness, or uniqueness of comments (e.g., [31], [32],
is used to represent this entire umbrella of comment classifi- [33]); interestingness of comments [44]; etc. The literature
cation problems. A complete survey on different aspects of commonly offers spot solutions for these classification tasks
sentiment analysis is given in [36], [37], [38]. that make use of dependency structure. Gómez et al. [44]
propose a framework to make use of threads’ structure correction is kept simple for practical reasons, and alter-
when the aim is related to the differential between com- native corrections are discussed. In contrast to the litera-
ments. As far as we know, research has yet to address ture on threaded discussion where the focus is to leverage
the inherent bias between comments for flat discussions, dependency to derive predictors, we ”flatten” the threads
or when the dependency does not play a major role in and remove the dependency, to extract information from
the classification task (such as opinion mining, sentiment the remaining unigrams. We believe that our approach will
analysis). make the unbiased analysis of comments more accessible to
The most commonly used dataset in the threaded- the IS community.
discussion literature is Slashdot (www.slashdot.com). Slash-
dot, a popular technology discussion forum, also integrates
a moderation scheme where readers can rate comments
2 T EXT PRE - PROCESSING WITH TF - IDF
based on informativeness (scale 1-5). Comments on Slash- The tf-idf weighting, first introduced in [47], stands for
dot usually revolve around an initial post or contribution, term frequency (tf ) × inverse document frequency (idf ). Tf-idf
are generally lengthy, and contain a significant amount of weighting is commonly used in text mining and information
content [32]. By contrast, in this paper we analyze Facebook retrieval to evaluate the importance of a linguistic term
comments which are fairly short (15-25 words on average) (commonly a unigram or bigram) in a studied corpus. Term
thus plausibly less informative. Moreover, readers in Face- importance (weight) increases with the term’s frequency in
book see the first comment in each thread, sorted by popu- the text, yet is offset by the frequency of the term in the
larity (at time of access). The length of messages is shown domain of interest (e.g., frequent words like ”the” or ”for”
to have a great impact on the conversation and its analysis will be scaled down).
[45]. Given that the average depth of threads on Facebook Given a collection of terms t ∈ T that appear in a set of
is less than 1 (extracted from our datasets), compared with N documents d ∈ D, each of length nd , tf-idf weighting is
10-15 on Slashdot, it can essentially be considered a ”flat” computed as follows [48]:
discussion. ft,d
tft,d = = (1)
nd
1.3 Contributions: Articulating the bias and adjusting
N
tf-idf idft = log
dft
A discussion on the limitations of text analysis when applied
Wt,d = tft,d × idft ,
to computer-mediated discourse (CMD) was presented in
[6], in which they claim that existing text analysis sys- where ft,d is the frequency of term t in document d, and dft
tems are concerned with topic modeling (assigning topics is the document frequency of term t, that is, the number of
to documents). In CMD data, however, text features are documents in which term t appears. Several variations and
often either overlooked or manually extracted. Arazy and adjustments were offered, including normalizing tft,d and
Woo [41] highlight the importance of effective statistical optional weighting schemes (such as BM25) by [48], [49],
natural language (SNLP) methods to the IS community of [50], [51].
researchers and practitioners. Their focus, on collocation in- For the task of comment classification, ’document’ is
dexing for compound terms, points to tf-idf as “the de facto replaced by ’class’, e.g. sentiment class (negative/positive)
standard” for weighting schemes that take into account a in sentiment analysis, serving to collect a set of relevant
local, document-specific factor, and a global corpus-level comments. Term frequency (tf ) is then computed per class.
factor. After showing how standard tf-idf is ineffective for Inverse to document (idf ) becomes ”inverse to comments”,
collocations, they proceeded to develop an adaptation. We meaning that N is the size of the set of comments, and the
have identified a similar challenge to tf-idf showing how its document frequency for a term (dft ) is computed on that
unadjusted use in the analysis of social media comments set. Bermingham and Smeaton [52] define this method as
is inappropriate and may result in significant bias. In the sentiment tf-idf. We adopt this terminology in the remainder
recent IS literature we observe a semi-manual approach, of the paper. Comments are then classified into classes using
where classification relies on user (often expert) generated probabilistic (e.g., Naive Bayes) or discriminative models
dictionaries or user-defined textual features (e.g., [7], [46]). (e.g., SVMs) [14].
The need for developing customized automated methods is A common variant of the classic tf-idf is delta idf weight-
thus called for. ing, in which idf is calculated for each class separately, and
This paper bridges the gap between discourse analysis, then the difference between the values is used for sentiment
specifically CMD, and (automated) text classification. Our classification [53]. This variant is proved to be efficient for
main contribution is in revealing the comment discourse supervised classification at the sentence level.
bias and discussing its implications for text preprocessing
and model evaluation. Although the literature does offer
alternatives to the tf-idf weighting approach, we found no
3 B ETWEEN - PARTICIPANTS ’ DISCOURSE CORRE -
discussion of tf-idf misuse in the important and growing LATION : DEFINING THE PROBLEM
context of comment study. We illustrate the extent of the An interesting property of comments to online posts is the
bias problem on several datasets. Misuse is examined and discourse generated between the participating commenters.
illustrated using empirical evidence. When part of an online discussion, discourse is shown to be
Our second contribution is a proposed statistical correc- content-dependent [14], [15]. A prominent phenomenon in
tion to the bias, delivered by modifying tf-idf. Our statistical this respect is the presence of terms that repeatedly occur
across a sequence of comments, such terms neither related Comment-to-document: comments often discuss the docu-
to the topic nor to the sentiment. An example of such textual ment’s content. We therefore expect to observe the same set
dependency is presented in Figure 1: a partial comment of terms used by several commenters, commenting on the
thread on a news article from the news.discovery.com site. same article.
The title of the article is ”App does math homework with
Comment-to-comment: comments are often dependent on
phone camera”. The first commenter compared this new ap-
previous comments: either because they are influenced by
plication to other products: Google glass and PhotoMath. The
them, or simply because they reply to or quote them. This
second commenter cynically replies to the article, using the
dependency is expected to be time-dependent. That is, once
same comparison as the former user. The third participant,
a term appears, it is more likely to reappear in succeed-
who does not ”comprehend the joke”, replies to the idea
ing comments. This dependency possibly decays along the
presented by the second commenter, and then being replied
thread of comments, and may have a smaller impact on
by the fourth commenter, who repeats the exact same words
comments that are further down the comment thread. One
in an ironic fashion.
also might argue that threaded comments are expected to
It is important to note that neither the terms Google
exhibit higher correlation compared to a flat discussion.
glass, nor Photomath, nor the repeated discourse of the com-
menters, appear anywhere in the text of the original article When the dataset is large enough, the comment-to-
being commented upon. Thus despite the high correlation commenter dependency is mitigated, as the effect of indi-
of these terms within the comment space, they would have vidual commenters is small and diverse, thus can be aver-
low correlations with the main text. aged out. Large datasets typically mitigate the comment-
In the example presented in Figure 1, the comments are to-SN dependency as well, as long as there are multiple
threaded, that is, users post comments on comments (”reply social circles contributing to that dataset, and those circles
to”). In an interesting study, Agrawal et al. [54] observed are small compared to the size of the commenters’ group.
that the relationship between two individuals in a threaded Diversity among multiple such circles allows averaging out
discussion is much more likely to be antagonistic (74%) than of their effect. In cases where there are only a few social
reinforcing (7%). circles, or their sizes are relatively large, comment-to-SN de-
In a flat discussion, in which comments appear consec- pendency persists. However, given the nature of CMD, the
utively (no ”reply to” option), textual dependency is also two other dependency structures (comment-to-document,
apparent, commonly following a ”quoting” term - refer- and comment-to-comment) are likely to bias datasets of all
ence to another post by quoting part of it or by tagging sizes, in different contexts. This is because influence, reply
the commenter’s user name or ID. In a study on quoting and quoting may apply to large sets of comments, e.g.,
behavior among participants on the politics.com discussion influence of a document on all of the comments posted to it,
site, Mullen and Malouf [55] found that 10% of the posts and thus generate dependency even in large datasets. More
contain quoted material, and as much as 55.7% of the users importantly, these two dependency structures exhibit fast-
quote comments or are being quoted by others at least once. paced, time dependent dynamics of change. Commenters
Here again, the majority of the participants quote users at are typically influenced by recent comments as well as
the opposite end of the political spectrum. the document to which they post comments. This pattern
Dependency between discourses and repeated occur- of influence is non-stationary and has a significant effect
rence of terms introduce bias in tf-idf weighting when used on time series performance. Therefore, influence patterns
for text preprocessing. The tf-idf technique assumes com- should be controlled for.
pleteness of the domain of interest: idf is the term inverse-to-
frequency in the entire domain, tf is the term frequency in a
complete document, or class (in sentiment tf-idf ). In practice, 4 E MPIRICAL EVIDENCE OF THE BIAS
the ”domain of interest” is replaced by the training set
(see Equation 1). Therefore, correlation between discourses We examine the discourse correlation and its bias to tf-
characterized by increased term frequency in comments idf on two Facebook fan pages with large user activ-
to a single post (beyond its frequency in the domain), ity. The first is the fan page of the TV show Com-
erroneously decreases idf and increases tf. We prove this munityTV (www.facebook.com/communitytv). Communi-
observation in Appendix A, and provide empirical evidence tyTV has over 1.7M fans and its posts are mostly
in the next section. promotional; The second fan page is the news page
SPORT1News (www.facebook.com/SPORT1News), which
maintains nearly 1M fans. SPORT1News, as the title conveys,
3.1 Discourse dependency structures discusses sport news. For each page we have collected a
Dependency between comments can result from multiple recent set of 1000 documents (posts), and their comments.
causes, as listed below: The average (median) comments per post is approximately
172 in CommunityTV, and 96 in SPORT1News.
Comment-to-commenter: a given commenter may use sim-
ilar vocabulary in several comments, resulting in depen-
dency between comments posted by the same commenter. 4.1 Discourse dependency network
Comment-to-Social Network (SN): commenters from the To illustrate dependency between comments posted on the
same social circles may have a similar discourse style, and same document, we construct a dependency network in
may use the same terms and phrases. which the set of nodes represents the set of comments, and
Fig. 1: Example of between-commenters’ discourse.
ties between two nodes (comments) and their weight corre- TABLE 1: Difference between dependency and baseline net-
spond to the number of shared words2 . We evaluate these works
networks against a baseline, in which for each document
Measurement CommunityTV Sport1News
we generate a network of the same size (total number of
#Components -7.3 -3.8
comments) of randomly selected comments that were posted
Degree 17.2 12.6
on different documents (thus lexically-independent). Exam-
Betweenness -0.8 (insignificant) 3.7
ples of dependency networks for comments posted on the
same document are given in Figure 2 (for the CommunityTV
fan page). The darker node in the figure represents the
to commenter b, and commenter b replies to commenter a;
document itself. Interestingly, we see that comments do not
etc.). High values of betweenness might indicate the effect
share content with the document, yet discourse is generated
of temporality in a network.
between many of the commenters. Similar comment depen-
dency networks are observed for all documents posted on Table 1 presents the results of the comparison between
both fan pages. the networks, controlling for network size.3 As expected,
We then measure dependency metrics of the networks when compared to the baseline, the dependency networks
and compare the measurements with those of baseline net- exhibit on average fewer components and higher nodes
works. The metrics measured and compared are: degree. Betweenness was insignificant in the CommunityTV
dataset, indicating low impact of temporality.
Number of components: measures the emergence of com-
In the following experiment, we illustrate the bias that
munities around topics. A network with fewer components
discourse correlation introduces to tf-idf weights, and its
is one in which commenters share similar terms (yet not
implication to the classification task. As a preprocessing step
necessarily similar ideas, opinions, or sentiments).
we split the data into training (50%) and holdout (50%) sets,
Node degree: indicates the popularity of a node (comment) as commonly done in classification tasks.
in terms of terms used. When a degree in a dependency
network is significantly higher than that of a baseline net- 4.2 A discussion on training and holdout sets
work, it likely implies existence of comments with multiple
citations or replies (discourse). According to [56], [57] the statistical performance of a
model is strictly linked to the correct selection of the hold-
Betweenness: high network betweenness corresponds to out sample. Stein [57] articulates that “with time varying
scenarios of threads within comments (commenter c replies processes . . . holdout testing can miss important model
2. Preprocess steps that were taken are: removal of stop words, 3. The model used: Y=a+b1[Network Size]+b2[Indicator of Depen-
numbers, and punctuations, and words’ stemming. dency Network]; Table 1 depicts the value of b2.
Fig. 2: Examples of dependency networks of comments. A

darker node represents the document itself.
Fig. 4: Tf-idf score comparisons on terms extracted from fan

pages CommunityTV (left) and Sport1News (right)
the sets, with the entire thread of comments that follow

them. Splitting by document is dictated in prediction analytics,
when the goal is to project from a sample of documents
to new documents. Due to the inherent bias, we expect tf-
Fig. 3: Stein’s [57] schematic holdout techniques, applied to
idf weights of some terms to be inflated in one set or the
comments classification. White/black comments are part of
other, as visually confirmed in Figures 4(a, left) and 4(a,
the training/holdout set.
right). This scenario illustrates real-world conditions, were
the dependency causes miscomputation of tf-idf weights for
weaknesses not obvious when tting the model across time certain terms.
because simple holdout tests do not provide information The second is random split by comment (top-left panel
on performance through time”. A similar argument can be in Figure 3). Here, comments on the same document may
made in the context of comments classification; comments be placed in different sets. Split by comments assumed in-
are posted over time, to documents that are published over dependence of comments, hence cannot be used when the goal
time. A simple holdout set selected randomly from the pop- is prediction [56], [57]. It can only be used to infer patterns
ulation of comments, irrespectively to time and document, in comments posted on in-sample documents. Discourse
is not suitable for prediction under real-world conditions, in correlation under this scenario is expected to introduce the
which comments to new posts are merely (or not) lexically same bias to both sets, resulting in statistically equal tf-
dependent on comments to past posts. idf weights. Figures 4(b, left) and 4(b, right) demonstrate
To avoid inappropriate selection of holdout sets, Stein the above discussion. This scenario illustrates a case were
proposes a framework in which the training-holdout split is the dependency is ignored yet the holdout set “misses
selected according to the prediction task in two dimensions: important model weaknesses”.
(a) time and (b) the population of obligors (denoted uni- We quantify the extent of the bias under different setting
verse). Figure 3 illustrates this framework. In the context of in our numerical experiments, described in a dedicated
comment classification, universe becomes documents, and section later in this paper.
the task of focus is classifying comments posted to new
documents (”split by document”). 5 A DJUSTED TF - IDF : DESCRIBING THE SOLUTION
The discussion above demonstrates the bias inherent to
4.3 Implication of dependency to tf-idf values sentiment tf-idf and the risks associated with such bias.
In the following experiment, we illustrate the bias that dis- Nevertheless, sentiment tf-idf is a central method in senti-
course correlation introduces to tf-idf weights. We examine ment analysis: it is well understood, widely implemented,
two variations of training-holdout splits. Due to the symme- and easy to use. Therefore, our goal is not to replace it,
try between the sets we name them ‘set 1’ and ‘set 2’. The but instead minimize its inherent bias. Given that discourse
first split follows Stein’s [57] framework for out-of-universe correlation inflates sentiment tf-idf weights of some terms,
tasks: split by document (bottom-left panel in Figure 3). our quest is to adjust those weights to compensate for
In this case, documents are selected randomly into one of that inflation. We additionally aim to maintain simplicity
and comprehensibility of the weighting. Hence, while many 5.1 Computing expected comment thread frequency
adjustment approaches are possible, we opt for a simple The expected comment thread frequency ecft is essentially
yet well justified approach. In a nutshell, we examine term the expected number of comment threads a term t should
frequency and, in case of a term being overly frequent, we appear in, given its frequency (tft ). The problem of com-
adjust its tf-idf weight as detailed below. puting ecft is equivalent to the ”balls into bins” problem
We keep the terminology from the previous section and [58], with weighted bins: allocate tft independent occurrences
adjust the sentiment tf-idf. Recall that in sentiment tf-idf, idf of term t (”balls”) into n comment threads (”bins”), with
is computed on a set of comments, and tfd is computed comment thread probability p(ci ) being proportional to its
per sentiment class. We define comment thread as a set of length (total length of all comments in that thread), with
comments that are posted to a single document (regardless Rn
p(ci ) = 1. The objective is to find the expected number
i=1
of the comments actual structure). We denote the comment of ”non-empty bins”, which is the expected comment thread
thread posted to document i by ci . C={c1 , ..., cn } is the set frequency.
of all comment threads associated with all n documents. We define random variable zt,i :
Discourse correlation inflates the frequency of terms in
certain comment threads, but not in others. To determine 1, if comment thread i does not contain the term t;
zt,i =
whether a term t is over-frequent in a specific comment 0, otherwise.
thread ci , we compare the number of comment threads in (4)
which t appears (comment thread frequency, denoted cft ) to Note that such a random variable has a Bernoulli dis-
the number of comment threads we expect t to appear in, in tribution. Computing the exact probabilities in the case of
the absence of correlation (expected comment thread frequency, Bernoulli distribution is hard, hence, as common in the art,
denoted ecft and computed in the next section). We then use we use a Poisson approximation instead [59]. The probabil-
the ratio rt between cft and ecft to determine whether t is ity that a comment thread c does not contain term t, is then:
over-frequent in ci . The term t is considered over-frequent tf
− p(ct )
in ci , if this ratio is smaller than 1. Had t appeared in as Pr (zt,i = 1) = (1 − p(ci ))tft ≈ e i ≡ e− λt,i (5)
many comment threads as expected or more (i.e., rt ≥ 1), its
high frequency in ci would have been statistically expected. Ascribed to the linearity of expected value, the total
However, a smaller than 1 ratio indicates that t appears expected number of comment threads that do not contain
infrequently across comment threads, while it is frequent term t, satisfies:
in ci . This suggests that t is over-frequent in ci . Put in n
X n
X
other words, given its frequency in ci , a ratio smaller than E (zt ) = E (zt,i ) = = Pr (zt,i = 1) (6)
1 indicates that t appears in fewer comment threads than i=1 i=1
expected without correlation. Xn
We next correct the sentiment tf-idf weighting scheme ≈ e−λt,i ,

in Equation 1, by defining an adjustment coefficient to the i=1
comment thread frequency. The aim of the adjustment is tft

λt,i = −
to undo the effect of the bias on the tf-idf weights. A p(ci )
simple, straightforward approach to doing that is to find
In the special case, where all comment threads lengths
the ratio of weight inflation, and multiply the weight by the
are equal, p (ci ) = n1 ∀i, λt = n × tft ,we get E(zt ) = ne−λt
inverse of that rate. This is exactly the adjustment we apply
The excepted comment thread frequency, is then given
here. The adjustment coefficient we apply is based on the
by:
ratio rt between cft and ecft , which is inverse-proportional
to weight inflation. We further right-trim this number, to ecft = n − E(zt ) (7)
avoid data overfitting. The adjustment coefficient is given Regarding correlation between words, we note that de-
in Equation 2 and introduced into the adjusted term weight pendency between words, referred to as ngram, is common
in Equation 3. Inspecting Equation 2 and Equation 3 one in discourse. Accounting for ngrams in tf-idf weights has
can observe that, in practice, adjustment is performed only been addressed by several researchers, such as [60] who
in case that rt < 1. In such cases, a weighted term t is over- show that unigrams better predict class membership than
frequent in a comment thread ci and consequently its tf- bigrams; Dave et al [35] who show the advantages of bi-
idf weight is inflated. Those inflated weights are deflated grams in predicting class polarity; and [61] who discuss the
when multiplied by the adjustment coefficient (Equation 3), general ngams form. Our method corrects the tf-idf weight
thus controlling the bias they introduce. Variations on this for the case of unigrams. However, it can easily be modified
weighing scheme, such as quadratic coefficient, adding log to handle ngrams by correcting ecft , to account for the joint
function, or an addition of left trimming (remove terms that distribution of the terms [62].
occur in few comment threads entirely), can also apply, and
should be tested per case study.
6 N UMERICAL EXPERIMENTS
rt , rt ≤ 1; cft
adj(df )t = , rt = (2) In this section, we evaluate the statistical bias in comments
1, rt > 1. ecft
classification. We examine multiple datasets, of different do-
mains and sizes. We propose three measurements to quan-
Wt,d = tft,d × idft × adj(df )t (3) tify the bias and its implication to comments classification.
6.1 Data in a class. Practically, low (high) RMSD value indicates the
Our data were compiled from seven Facebook fan pages ability (inability) of the training set to capture the features of
selected from different domains, including news, finance, the data when applied to holdout sets. RMSD is computed
politics, sport, shopping, and entertainment. This afforded for each method (e.g., adjusted tf-idf, non-adjusted tf-idf )
diversity in topic and size thus facilitating wider applica- separately, as follows:
bility of our results. The description, characteristics, and sZ
2
descriptive statistics of the pages are given in Table 2. RM SD = 1 − Ŵ 2
Ŵt,d (8)
t,d
For each page, we collected the entire set of documents t,d
(posts) and threads (first-level comments4 ). We additionally
computed the average number of threads per document,
the average number of words per comment (referred to as Spearman’s Rank Correlation Coefficient (Sρ): this non-
comment length), the average number of replies in threads, parametric measure compares the tf-idf weights’ rank of the
and their average depth. Since Facebook incorporated a training set to that of the holdout set, as follows:
threaded commenting system in mid-2013 [63], the latter
1

2

two measurements were computed for recent threads only, S = R Ŵt,d , R Ŵt,d (9)
specifically those retrieved from the year of 2016 data.
Practically, Sρ examines whether the relative importance
Following our discussion in Section 4.2, fan pages were
(noted ””) of a term in a class can be inferred
split to training and holdout, such that 50% of the pages
from the training set. Mathematically, Sρ computes
posts, along with their entire thread, were used as training,
the degree to which the following relationship holds:
and the remainder 50% posts were used as holdout.
Ŵt2A ,d Ŵt2B ,d ⇒ Ŵt1A ,d Ŵt1B ,d .
6.2 Measurements Quantile Rank Correlation (qSρ): the quantile correlation

For each term t in class d, we assume an unknown true examines the rank correlation of the left-tail weights, at
Wt,d = tft,d × idft weight. We compute the observed different quantiles. In other words, qSρ evaluates the rank of
weights for the training (set 1) and holdout (set 2) sets, high value (importance) terms only, under the assumption
denoted Ŵt,d 1 2
and Ŵt,d . In the absence of bias, we expect that higher values are commonly used for classification [50].
1 2
Ŵt,d = Ŵt,d = Wt,d . In the presence of bias, the observed The value of qSρ is computed at different quantile values:
1 {0.5, 0.7, 0.9}.
weights diverge from the true weight, thus Ŵt,d 6= Wt,d
2
(bias in set 1), or Ŵt,d 6= Wt,d (bias in set 2). The relationship

1 2
qS = R Ŵt,d |q , R Ŵt,d |q (10)
between the two observed weights, however, depends on
whether the term is biased in either of the sets, or in both, Using these three measurement methods allows us to
and on the dependency magnitude. evaluate the performance of comment classification when
When the dependency occurs in a single set only, for using different tf-idf schemes.
1
example set 1, then the following relationships hold: Ŵt,d 6=
2 1 2
Wt,d , and Ŵt,d = Wt,d , implying that Ŵt,d 6= Ŵt,d . The 6.3 Results
same inequation is expected when the divergence from the
true value occurs in both sets, yet at different magnitudes: In the first experiment, we sample approximately 20K com-
1
Ŵt,d − Wt,d 6= Ŵt,d 2
− Wt,d , implying again that Ŵt,d 1
6= ments from each Facebook page, posted on a 1000 docu-
2 ments. Each set of comments is then split into training (50%)
Ŵt,d . This scenario is likely when the training and holdout and holdout sets (50%), and tf-idf weights are computed for
sets are split by document, that is, discourse is noticed within
each set, and compared between the sets. We repeat this test
sets but not across sets.
20 times, on 20 different samples, to achieve a significance
When the dependency occurs in both sets, at the same
1 2
measure for our comparisons.
magnitude: Ŵt,d − Wt,d = Ŵt,d − Wt,d , then the observed We examine three approaches to computing tf-idf. The
1 2
weights equate: Ŵt,d = Ŵt,d . Note that this does not imply first, presented in Equation 1, introduces no adjustment. It is
1 2
a no-bias scenario, as both Ŵt,d 6= Wt,d and Ŵt,d 6= Wt,d . commonly used for text classification. The second approach
We anticipate this scenario when the training- holdout sets is the well-known k-fold cross-validation (CV), with 7-folds.5
are split by comments. Here, comments are dependent both CV is considered a robust state-of-art approach that avoids
within and across sets, thus the same discourse bias is model overfitting [64]. Hence, we use it as a baseline for
repeated in both sets. An example of the above assertions our proposed approach. The third approach is our proposed
was discussed earlier and depicted in Figure 4. adjusted tf-idf approach (Equations 2 and 3).
Since the true weight Wt,d is unknown, we evaluate We apply these approaches to the different Facebook
1 2 pages. We then measure differences between the weights
the observed weights Ŵt,d and Ŵt,d , and the difference
between them on three measurements: generated by each approach using RMSD, Sρ, and qSρ.
Figure 5 depicts the RMSD values and standard errors com-
Root Mean Squared Difference (RMSD): RMSD compares puted for the different Facebook pages. Table 3 presents the
the difference between the observed weights of all terms difference in RMSD when comparing the proposed adjusted
4. The vast majority of comments in Facebook pages are first level 5. Greater number of folds showed no statistically significant differ-
comments (see average thread depth). ence in performance.
TABLE 2: Descriptive Statistics of the Data
Facebook Page #Threads #Documents Threads per Document Comment’s Length #Replies on Threads Thread Depth
Community 59572 2481 24.01 19.50 1.08 0.29
Amazon 95126 4670 20.37 15.50 0.81 0.29
CNN Politics 656357 28012 23.43 37.00 3.79 0.52
The Economist 257375 14111 18.24 34.50 1.65 0.32
BBC News 445741 18065 24.67 5.50 3.34 0.40
Sport1 779599 48592 16.04 12.00 2.12 0.36
CBC News 852303 39512 21.57 71.50 2.47 0.38
TABLE 3: Paired t-tests for RMDS TABLE 4: Paired t-tests for Spearman’s Rank Correlation
Facebook Page Adjusted tf-idf, Adjusted tf-idf Facebook Page Adjusted tf-idf, Adjusted tf-idf
compared with compared with compared with compared with
Non-Adjusted tf-idf K-fold CV Non-Adjusted tf-idf K-fold CV
Community -0.18 0.26∗∗ Community 0.007∗∗∗ -0.04∗∗∗
Amazon -0.66∗∗ -0.61∗∗ Amazon 0.01∗∗∗ -0.005∗∗∗
CNN Politics -0.91∗∗ 1.49 CNN Politics 0.008∗∗∗ -0.02∗∗∗
The Economist -1.95∗∗∗ 0.45 The Economist 0.01∗∗∗ -0.01∗∗∗
BBC News -1.82∗∗∗ 0.51 BBC News 0.01∗∗∗ -0.01∗∗∗
Sport1 -0.78∗∗∗ 0.001 Sport1 0.005∗∗∗ -0.01∗∗∗
CBC News -1.68∗∗∗ -0.61 CBC News 0.01∗∗∗ -0.01∗∗∗
Fig. 5: Comparison of RMSD of the three approaches Fig. 6: Comparison of Spearman’s Rank correlation of the
three approaches
tf-idf approach with the other two alternatives (using paired

t-tests). We observe that the RMSD values are consistently competitive (see Figure 6). This result is consistent across all
and significantly lower when the tf-idf weights are adjusted, Facebook pages, and the different quantiles, and significant
compared with RMSD values of the no adjustment ap- at the 1% significance level.
proach, and statistically equal to the RMSD values of the The results discussed above indicate that, in capturing
cross-validation approach. This implies that our proposed the relative importance of left tail tf-idf weights (i.e., the
approach, as well as K-fold CV, better capture the features of lower values), the k-fold CV approach performs better
out-of-sample data, compare with no adjustment approach. than our proposed adjusted approach and the non-adjusted
Figures 6 and 7 compare the Spearman’s Rank, and approach. In contrast, in capturing the right tail of tf-idf
Quantile Rank correlations of the different approaches weights (i.e., the higher values), our proposed adjusted tf-idf
across the same Facebook pages examined with RMSD. approach is best. This difference in performance is impor-
Table 4 summarizes the difference in Spearman’s Rank tant, as it shows that the adjusted tf-idf approach performs
correlation between the proposed adjusted tf-idf approach best when it matters. That is, higher tf-idf values – which
and the other two alternatives. Overall, the Rank correlation are the more important ones for classification – are best
is fairly low (lower than 0.3), with a slight advantage to captured by our approach.
the k-fold CV approach compared with our adjusted ap-
proach, and to the adjusted tf-idf approach compared with
the non-adjusted tf-idf approach. As the quantile increases, 6.4 Robustness to data size
and higher values of tf-idf weights are considered, the pro- Tf-idf robustness to bias is expected to increase as the data
posed adjusted tf-idf approach outperforms the alternatives size increases and more documents are examined as part of
in terms of capturing the relative importance of terms in the training set (as proved in Appendix A). In the second
the data, and the k-fold CV approach becomes the least experiment, we thus re-compute the RMSD measurement
Fig. 9: Comparison of Spearman’s Rank correlation for vary-

ing data sizes
Fig. 7: Comparison of Spearman’s Quantile Rank correla-

tions of the three approaches
Fig. 10: Comparison of Spearman’s Quantile Rank correla-

tions for varying data sizes. Illustrated on Page CNN Politics
6.5 On using large reference corpus

In our last experiment we contrast our adjustment approach
with an alternative tf-idf approach in which frequencies are
Fig. 8: Comparison of RMSD of the three approaches for adjusted based on large reference corpus. For this exper-
varying data sizes iment we randomly sample 2000 posts (along with their
threads) published in the CNN Politics fan page. As a
reference corpus we sample 1M, 2M, and 5M independent
comments (i.e., from different posts) posted on other fan
(Figure 8), Spearman’s Rank (Figure 9), and Spearman’s pages on Facebook. We re-compute our measurements on
Quantile Rank correlations (Figure 10) on varying samples the two alternatives.
sizes, ranging from 2K comments posted on 100 documents Our results show that in terms of RMSD and Spearman’s
to 300K comments, posted on 15K documents. In this exper- Rank correlation, the alternatives exemplify statistically sim-
iment we evaluate pages with over 15K documents, namely ilar performance, with no preference to either of the alterna-
CNN Politics, BBC News, Sport1, and CBC News. tives. In contrast, the rank correlations differ significantly in
Due to the extremely long running time and memory favor of our adjusted tf-idf approach, as depicted in Figure
consumed by the K-fold CV approach, we exclude this 11. Notably, on top of the improvement in bias removal,
approach from the big-data experiment: we found that the k- adjusted tf-idf can be computed on a fairly small sample
fold CV approach takes on average six times longer than the size, which eliminates the need in additional data collection,
adjusted tf-idf (as expected); adjusted tf-idf takes on average data storage, and high computational power.
four times longer than non-adjusted tf-idf.
Evidently, the advantage of the adjusted approach across
7 C ASE STUDY: UNINTENDED DATA REVELATION
all data sizes, pages, and measurements is statistically
significant. As expected, the magnitude of the advantage This case study is concerned with unintended data reve-
decreases as data size increases, yet remains significant. lation via comments on news articles published in Face-
This implies that our proposed approach better captures the book. For various reasons, news articles may refrain from
features of out-of-sample data, as well as the relative impor- disclosing material information. We refer to such articles as
tance of terms (specifically, high-weight terms), compared Undisclosed Material Data (abbreviated as UMD) articles.
to no adjustment approach. For example, to maintain privacy, names of subjects of
meanCorrelation 1M 2M 5M
1.0
1.0
1.0
0.76
Approach
True Positive Rate (Sensitivity)

0.8
0.8
0.8
0.74 Adjusted tf−idf
0.6
0.6
0.6
Non−adjusted tf−idf
0.72
0.4
0.4
0.4
0.5 0.7 0.9 0.5 0.7 0.9 0.5 0.7 0.9
0.2
0.2
0.2
Quantile
0.0
0.0
0.0
Fig. 11: Comparison of Spearman’s Quantile Rank correla- 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
tions for varying data sizes. Illustrated on Page CNN Politics False Positive Rate (1−Specificity) False Positive Rate (1−Specificity) False Positive Rate (1−Specificity)
1.0
1.0
1.0
UMD news articles may be replaced by supposedly non- Fig. 12: Method’s robustness

0.8
0.8
0.8
identifying initials. Comments published by private users 0.6
0.6
0.6
may (unintentionally) reveal the undisclosed material data,
0.6 0.0 0.8 0.2 1.0 0.4
0.6 0.0 0.8 0.2 1.0 0.4
0.6 0.0 0.8 0.2 1.0 0.4

thus possibly breaching privacy of the subjects.
Examples of such data revelation are presented in Table

6 in [65]. For instance, the identity of a lottery winner is
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
intentionally not disclosed by a news article on Facebook. A
0.4
0.4
0.4
False Positive Rate (1−Specificity) False Positive Rate (1−Specificity) False Positive Rate (1−Specificity)
commenter on that article states that ”I know him ... worked

0.2
0.2
0.2
with him for years ...”. Given the readily available identity
0.0
0.0
0.0
of that commenter, it only takes a few clicks to identify the 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
lottery winner. We therefore treat this comment and other False Positive Rate (1−Specificity) False Positive Rate (1−Specificity) False Positive Rate (1−Specificity)
similar ones as privacy breaching (or PB) comments.

The dataset for the case study contains 48 UMD articles,
1.0
1.0
1.0
with an aggregate total of 3,538 annotated comments, out Fig. 13: Deceiving model

0.8
0.8
0.8
of which 149 (4.21%) were labeled as PB comments. The
0.6
0.6
0.6
task of this study is to classify comments as PB comments
our case is the detection of a comment as a PB comment. The
0.4
0.4
0.4
or Non privacy breaching (NPB) comments, using comment
gray line in the chart represents performance of the model
0.2
0.2
0.2
classification techniques. This task falls under the umbrella
of sentiment analysis, as PB comments are a users’ way on a training set. The black line represents performance on
0.0
0.0
0.0
to express that they know more than the article reveals. 0.0 0.2 0.4 0.6 0.8 1.0 0.0 the holdout. Clearly, the performance of the model on out-
0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
Note that the classification method should be able to classify

False Positive Rate (1−Specificity)
of-sample observations is often poor. Moreover, by checking
False Positive Rate (1−Specificity) False Positive Rate (1−Specificity)
comments on both existing articles and new (future) articles. multiple samples we learned that the performance is incon-
sistent and varies across samples (not shown in the Figure).
We attribute the offset between performances to the bias
7.1 Applying tf-idf (without adjustment) caused by between-participants discourse correlation.
We construct a data-driven classifier that utilizes classic
sentiment tf-idf. We define two sentiment classes: PB and 7.2 Ignoring dependency structure
NPB, each groups the set of relevant comments. Term fre-
What if we entirely ignore the bias and treat all comments
quency tf is computed per class. Inverse to document idf is
as independent observations? In this example, we randomly
computed separately for each class on its set of comments.
sample 60% of the comments (rather than news articles) to
To avoid impact of unique words, and consequently over-
serve as training, leaving 40% of the comments to validate
fitting, we chose the 20 most frequent words (in the training
the model. Here again, we repeat the analysis several times.
set) to represent each class. Other threshold were examined
A typical performance comparison is depicted in Figure
to examine the sensitivity of the prediction accuracy. We use
13, showing statistically equal performance. This can be
logistic regression to predict class membership. Note that
explained by the fact that, in this scenario, the discourse
the classifier improves significantly when more predictors
correlation that exists in the training set is replicated to
are added to the model. However, to illustrate the bias
the validation set. Therefore, the bias it causes does not
introduced by the discourse, a simple model should suffice.
affect performance on holdout. Obviously, this model will
Out of 48 UMD articles, we randomly sample 30 articles
not perform similarly on new, out-of-sample observations.
to serve as training data, the remaining 18 serve as the
holdout set. The tf-idf was trained on the training set. To
examine the robustness of the method on out-of-sample 7.3 Discourse correlation
observations, we compared the performance of the method The comments in our data are highly correlated for two
on training and holdout sets. The whole procedure was reasons. First, we observe correlation between comments
repeated several times to verify model consistency. and the article itself. Second, we observe that new com-
The comparative performance is summarized in Figure menters’ response to existing comments generates corre-
12 which presents a typical ROC (Receiver-operating charac- lation between comments. In Figure 14 (left hand side)
teristic) curve, depicting the trade-off between False Positive we attempt to visualize the amount of correlation in our
Rate (1-Specificity) and True Positive Rate (Sensitivity) for dataset. In black ink, we scatter plot the joint distribution
different internal thresholds of the logistic model. Positive in of words frequency in comments (the number of comments
8 C ONCLUSION
25
Many contemporary social media web sites, with Facebook
40
man you
Frequency in Articles
Frequency in Articles
being one, incorporate commenting systems that allow peo-
30
20
hero exciting
ple to respond to posts on the web sites. Commenting
20
15
systems encourage discourse exchanges between partici-
10
king health
pants. Between-participant discourse has been proven to
10
0 50 100 150 200 250 50 100 150 200 250
be content-dependent. This dependency, as we have shown
Word's Frequency Word's Frequency
in this paper, introduces bias to tf-idf, a widely used term
weighting technique used for text preprocessing. We have
Fig. 14: Discourse correlation.
further shown that this bias, if ignored, can manifest in
a non-robust method at best, and can lead to an entirely
wrong conclusion as worst.
For decades, tf-idf has played a major role in information
1.0
1.0
retrieval and text mining. Early on, tf-idf aimed to model the

0.8
0.8
relationship between a term and a document in reference

0.6
0.6
to a corpus [50]. Over the years, the main practical usage

of tf-idf (and variations thereof) was to model topics, and
0.4
0.4
match queries to topics [47]. In recent years, tf-idf has been

0.2
0.2
harnessed to the classification of sentiments of short sen-

0.0
0.0
0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 tences (e.g., tweets and comments) - a significant deviation
False Positive Rate (1−Specificity) False Positive Rate (1−Specificity) from its intended usage. The short-sentence environment,
and even more so the tweet and comment environments,
introduce two major challenges to tf-idf. Firstly, a large refer-
Fig. 15: Adjusted model
1.0
1.0
ence document collection is typically unavailable. Secondly,


0.8
0.8
there exists between-participants’ discourse. As a result,

classical tf-idf generates biased weights when applied to
0.6
0.6
in which it appears) on the x-axis, and words frequency in sentiment analysis. In this study we discuss this bias, we
0.4
0.4
news articles (the number of articles in which it appears) show that it mainly relates to discourse and topic, and that
0.2
0.2
on the y-axis. Since some articles attract more comments, it is not part of the sentiment, in particular when the goal is
0.0
0.0
the concave trend (black line), that represents the smoothed prediction (out-of-sample observation with new topics and
0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
relationship is expected.
False Positive Rate (1−Specificity)
It is observed (see Figure 14, right
False Positive Rate (1−Specificity) new discourse).
hand side) that several words, such as ’hero’, ’king’ and We proposed an adjustment that corrects for the dis-
’health’, are much more frequent in fewer cases compared course bias when applying tf-idf to comments to online
with the others (the figure is trimmed at the 95% quintile, posts. However, it should be noted that the main contri-
for better visibility). bution of this paper is in exposing the potential misuse of
We next compare the observed correlation with an ex- tf-idf when applied to dependent text. Other approaches
pected words frequency in comments/articles joint distri- can also be considered to quantify and/or mitigate the bias.
bution, assuming comments’ independence. We randomly One example is that suggested by [66] that surveys different
assign words to articles, keeping term frequency and total sources as potential textual baselines. Another example is k-
comments length per article constant. I.e.: (1) number of fold cross-validation, which is commonly used in sentiment
repetitions of term t in the experiment equals to number analysis, where each of the k samples contain a list of com-
of repetitions in the raw data, and (2) article i will be ments to different documents (unlike the common practice,
assigned with wi random words, some may be repetitive, which is random sampling). While k-fold is expected to give
where wi is the total length of comments to this article a fair solution to the bias, it may be prohibitively expensive
in the raw data. The gray dots in Figure 14 present the when applied to large datasets. This limitation is crucial
resultant joint distribution. The gray line is the smoothed in the context of user-generated-content and social media,
relationship between the variables. This experiment shows where Big Data is being considered.
some correlation between nearly all words in the dataset. In our case studies, some comment threads exhibit high
discourse correlation, whereas others don’t. Other domains
may contain higher correlations. However, the adjustment
7.4 Applying adjusted tf-tdf to the data presented in this paper is applicable to any level of corre-
lation (even to a no-correlation scenario), as the adjustment
We applied the correction in Equation 2 to the text- coefficient is proportional to the observed bias.
preprocessing step. We then re-trained the classifier on a While we present prominent advantages of adjusting
training set that consists of 30 randomly sampled articles. the tf-idf weights, there are some limitations as well. In
We repeated the sampling-training procedure several times particular, when term population is small, tf-idf weighting
and compared the results on training and holdout sets. The may be biased, yet the adjustment we introduce may over-
results are given in Figure 15, demonstrating the removal of compensate for this bias, introducing another bias. For in-
the discourse bias – the adjusted model is robust to out-of- stance, there are cases in which some terms are used only by
sample observations. a small group of people and therefore their bias correction
could likely be flawed. [19] X. Wang, J. Bian, Y. Chang, and B. Tseng, “Model news relatedness
Despite limitations of our approach, it is evident that bias through user comments,” in Proceedings of the 21st International
Conference on World Wide Web. ACM, 2012, pp. 629–630.
correction of the sort we propose can significantly improve
[20] D. Eberle, G. Berens, and T. Li, “The impact of interactive corpo-
comment classification accuracy and processing time. As we rate social responsibility communication on corporate reputation,”
show in three different examples, bias indeed exists, hence Journal of Business Ethics, vol. 118, no. 4, pp. 731–746, 2013.
our approach is very relevant. In fact, bias of the type we [21] R. Shi, P. Messaris, and J. N. Cappella, “Effects of online comments
on smokers’ perception of antismoking public service announce-
have identified in comment and tweet environments exists ments,” Journal of Computer-Mediated Communication, vol. 19, no. 4,
in other domains as well, as we have identified in yet pp. 975–990, 2014.
unpublished research. This suggests that, prior to using tf- [22] J. B. Walther, D. DeAndrea, J. Kim, and J. C. Anthony, “The
idf, researchers better check for bias in their target domain, influence of online comments on perceptions of antimarijuana
public service announcements on youtube,” Human Communica-
and apply corrective adjustments where applicable to avoid tion Research, vol. 36, no. 4, pp. 469–492, 2010.
undesirable analysis results. [23] Y.-J. Chang, Y.-S. Chang, S.-Y. Hsu, and C.-H. Chen, “Social net-
work analysis to blog-based online community,” in Convergence
Information Technology, 2007. International Conference on. IEEE,
R EFERENCES 2007, pp. 2193–2198.
[24] M. Ziegele, T. Breiner, and O. Quiring, “What creates interactivity
[1] N. B. Ellison et al., “Social network sites: Definition, history, and
in online news discussions? an exploratory analysis of discussion
scholarship,” Journal of Computer-Mediated Communication, vol. 13,
factors in user comments on news items,” Journal of Communica-
no. 1, pp. 210–230, 2007.
tion, vol. 64, no. 6, pp. 1111–1138, 2014.
[2] H. Chen, R. H. Chiang, and V. C. Storey, “Business intelligence
and analytics: From big data to big impact.” MIS quarterly, vol. 36, [25] M. Zhu, W. Hu, and O. Wu, “Topic detection and tracking for
no. 4, 2012. threaded discussion communities,” in Web Intelligence and Intel-
[3] V. Dhar, “Data science and prediction,” Communications of the ligent Agent Technology, 2008. WI-IAT’08. IEEE/WIC/ACM Interna-
ACM, vol. 56, no. 12, pp. 64–73, 2013. tional Conference on, vol. 1. IEEE, 2008, pp. 77–83.
[4] M. Saar-Tsechansky, “Editors comments: the business of business [26] D. Feng, J. Kim, E. Shaw, and E. Hovy, “Towards modeling
data science in is journals,” MIS Quarterly, vol. 39, no. 4, pp. iii–vi, threaded discussions using induced ontology knowledge,” in Proc.
2015. 21st National Conf. on Artificial Intelligence, vol. 21, no. 2. AAAI
[5] J. P. Gee, An introduction to discourse analysis: Theory and method, Press, 2006, pp. 1289–1294.
2014. [27] D. Davidov, O. Tsur, and A. Rappoport, “Semi-supervised recog-
[6] A. Abbasi and H. Chen, “Cybergate: a design framework and nition of sarcastic sentences in twitter and amazon,” in Proceedings
system for text analysis of computer-mediated communication,” of the fourteenth conference on computational natural language learning.
Mis Quarterly, pp. 811–837, 2008. Association for Computational Linguistics, 2010, pp. 107–116.
[7] E. Vaast, E. J. Davidson, and T. Mattson, “Talking about technol- [28] R. Justo, T. Corcoran, S. M. Lukin, M. Walker, and M. I. Torres,
ogy: The emergence of a new actor category through new media.” “Extracting relevant knowledge for the detection of sarcasm and
Mis Quarterly, vol. 37, no. 4, 2013. nastiness in the social web,” Knowledge-Based Systems, vol. 69, pp.
[8] D. Surian, S. Seneviratne, A. Seneviratne, and S. Chawla, “App 124–133, 2014.
miscategorization detection: A case study on google play,” IEEE [29] A. Hassan, V. Qazvinian, and D. Radev, “What’s with the atti-
Transactions on Knowledge and Data Engineering, vol. 29, no. 8, pp. tude?: identifying sentences with attitude in online discussions,”
1591–1604, 2017. in Proceedings of the 2010 Conference on Empirical Methods in Natural
[9] B. Shi, G. Poghosyan, G. Ifrim, and N. Hurley, “Hashtagger+: Language Processing. Association for Computational Linguistics,
Efficient high-coverage social tagging of streaming news,” IEEE 2010, pp. 1245–1255.
Transactions on Knowledge and Data Engineering, vol. 30, no. 1, pp. [30] V. Qazvinian, E. Rosengren, D. R. Radev, and Q. Mei, “Rumor
43–58, 2018. has it: Identifying misinformation in microblogs,” in Proceedings of
[10] W. Hua, Z. Wang, H. Wang, K. Zheng, and X. Zhou, “Understand the Conference on Empirical Methods in Natural Language Processing.
short texts by harvesting and analyzing semantic knowledge,” Association for Computational Linguistics, 2011, pp. 1589–1599.
IEEE transactions on Knowledge and data Engineering, vol. 29, no. 3, [31] N. FitzGerald, G. Carenini, G. Murray, and S. Joty, “Exploiting
pp. 499–512, 2017. conversational features to detect high-quality blog comments,”
[11] H. Zhang, S. Wang, Z. Mingbo, X. Xu, and Y. Ye, “Locality Advances in Artificial Intelligence, pp. 122–127, 2011.
reconstruction models for book representation,” IEEE Transactions [32] N. Wanas, M. El-Saban, H. Ashour, and W. Ammar, “Automatic
on Knowledge and Data Engineering, 2018. scoring of online discussion posts,” in Proceedings of the 2nd ACM
[12] M. Potthast, B. Stein, F. Loose, and S. Becker, “Information retrieval workshop on Information credibility on the web. ACM, 2008, pp.
in the commentsphere,” ACM Transactions on Intelligent Systems 19–26.
and Technology (TIST), vol. 3, no. 4, p. 68, 2012. [33] E. Momeni, K. Tao, B. Haslhofer, and G.-J. Houben, “Identification
[13] K. Filippova and K. B. Hall, “Improved video categorization from of useful user comments in social media: a case study on flickr
text metadata and user comments,” in Proceedings of the 34th commons,” in Proceedings of the 13th ACM/IEEE-CS joint conference
international ACM SIGIR conference on Research and development in on Digital libraries. ACM, 2013, pp. 1–10.
Information Retrieval. ACM, 2011, pp. 835–842.
[34] T. Nasukawa and J. Yi, “Sentiment analysis: Capturing favorabil-
[14] S. Siersdorfer, S. Chelaru, W. Nejdl, and J. San Pedro, “How useful
ity using natural language processing,” in Proceedings of the 2nd
are your comments?: analyzing and predicting youtube comments
international conference on Knowledge capture. ACM, 2003, pp. 70–
and comment ratings,” in Proceedings of the 19th international con-
77.
ference on World wide web. ACM, 2010, pp. 891–900.
[15] S. Siersdorfer, S. Chelaru, J. S. Pedro, I. S. Altingovde, and W. Ne- [35] K. Dave, S. Lawrence, and D. M. Pennock, “Mining the peanut
jdl, “Analyzing and mining comments and comment ratings on gallery: Opinion extraction and semantic classification of product
the social web,” ACM Trans. on the Web, vol. 8, no. 3, p. 17, 2014. reviews,” in Proceedings of the 12th international conference on World
[16] M. Hu, A. Sun, and E.-P. Lim, “Comments-oriented document Wide Web. ACM, 2003, pp. 519–528.
summarization: understanding documents with readers’ feed- [36] B. Liu, “Synthesis lectures on human language technologies,”
back,” in Proc. 31st annual intl. ACM SIGIR conf. on Research and Sentiment analysis and opinion mining [J], vol. 5, no. 1, pp. 1–167,
development in information retrieval. ACM, 2008, pp. 291–298. 2012.
[17] G. Mishne, N. Glance et al., “Leave a reply: An analysis of weblog [37] B. Liu and L. Zhang, “A survey of opinion mining and sentiment
comments,” in Third annual workshop on the Weblogging ecosystem. analysis,” in Mining text data. Springer, 2012, pp. 415–463.
Edinburgh, Scotland, 2006. [38] B. Pang, L. Lee et al., “Opinion mining and sentiment analysis,”
[18] A. Schuth, M. Marx, and M. De Rijke, “Extracting the discussion Foundations and Trends R in Information Retrieval, vol. 2, no. 1–2,
structure in comments on news-articles,” in Proceedings of the 9th pp. 1–135, 2008.
annual ACM international workshop on Web information and data [39] M. Kantrowitz, B. Mohit, and V. Mittal, “Stemming and its effects
management. ACM, 2007, pp. 97–104. on tfidf ranking (poster session),” in Proceedings of the 23rd annual
international ACM SIGIR conference on Research and development in Lectures on Data Mining and Knowledge Discovery, vol. 2, no. 1, pp.
information retrieval. ACM, 2000, pp. 357–359. 1–126, 2010.
[40] M. Toman, R. Tesar, and K. Jezek, “Influence of word normaliza- [65] D. G. Schwartz, I. Yahav, and G. Silverman, “News censorship
tion on text classification,” Proc. InSciT, vol. 4, pp. 354–358, 2006. in online social networks: A study of circumvention in the com-
[41] O. Arazy and C. Woo, “Enhancing information retrieval through mentsphere,” Journal of the Association for Information Science and
statistical natural language processign: A study of collocation Technology, vol. 68, no. 3, pp. 569–582, 2017.
indexing,” Mis Quarterly, pp. 525–546, 2007. [66] V. Yatsko, S. Dixit, A. Agrawal, and S. Myint, “Tf* idf revisited,”
[42] M. Gamon, “Sentiment classification on customer feedback data: International Journal of Computational Linguistics and Natural Lan-
noisy data, large feature vectors, and the role of linguistic analy- guage Processing, vol. 2, no. 6, pp. 385–387, 2013.
sis,” in Proc. 20th intl. conf. on Computational Linguistics. Associa-
tion for Computational Linguistics, 2004, p. 841.
[43] A. Kennedy and D. Inkpen, “Sentiment classification of movie re-
views using contextual valence shifters,” Computational intelligence,
vol. 22, no. 2, pp. 110–125, 2006.
[44] V. Gómez, H. J. Kappen, N. Litvak, and A. Kaltenbrunner, Inbal Yahav Dr. Inbal Yahav, Social Scientist
“A likelihood-based framework for the analysis of discussion (PhD 2010) is an assistant professor and the
threads,” World Wide Web, vol. 16, no. 5-6, pp. 645–675, 2013. head of Information Systems specialization at
[45] S. Lee, J. Baker, J. Song, and J. C. Wetherbe, “An empirical com- the Graduate School of Business Administration,
parison of four text mining methods,” in System Sciences (HICSS), Bar-Ilan University, Israel. Her main research in-
2010 43rd Hawaii International Conference on. IEEE, 2010, pp. 1–10. terest is developing and tuning statistical models
[46] A. Ghose, P. G. Ipeirotis, and B. Li, “Designing ranking systems to the information systems discipline. In her re-
for hotels on travel search engines by mining user-generated and search work, Dr. Yahav combine techniques from
crowdsourced content,” Marketing Science, vol. 31, no. 3, pp. 493– Data Mining algorithms, social network analysis,
520, 2012. and optimization models to achieve optimized
[47] G. Salton and C.-S. Yang, “On the specification of term values in and interpretable statistical models. She applies
automatic indexing,” Journal of documentation, vol. 29, no. 4, pp. these methods mainly to health care applications and online social
351–372, 1973. networks. Dr. Yahav has presented her work at multiple conferences
[48] C. Manning, P. Raghavan, and H. Schütze, “Language models and has published papers in books and journals, including MIS Quar-
for information retrieval,” Introduction to Information Retrieval, pp. terly, Production and Operation Management, and Annals of Operations
237–252, 2008. Research. She received her B.A. in Computer Science, her M.Sc in
Industrial Engineering from the Israel Institute of Technology, and her
[49] G. Paltoglou and M. Thelwall, “A study of information retrieval
PhD in Operations Research and Data Mining from the University of
weighting schemes for sentiment analysis,” in Proceedings of the
Maryland, College Park in August 2010. Dr. Yahav is currently serving
48th annual meeting of the association for computational linguistics.
as an Associate Editor of the Decision Sciences Journal and Big Data
Association for Computational Linguistics, 2010, pp. 1386–1395.
journal.
[50] S. Robertson, “Understanding inverse document frequency: on
theoretical arguments for idf,” Journal of documentation, vol. 60,
no. 5, pp. 503–520, 2004.
[51] G. Salton and C. Buckley, “Term-weighting approaches in auto-
matic text retrieval,” Information processing & management, vol. 24,
no. 5, pp. 513–523, 1988. Onn Shehory Onn Shehory received his MSc in
[52] A. Bermingham and A. F. Smeaton, “On using twitter to monitor physics and his PhD in computer science from
political sentiment and predict election results,” Sentiment Analysis Bar Ilan University, Israel, in 1992 and 1996 re-
where AI meets Psychology (SAAIP), pp. 2–10, 2011. spectively. He has recently joined Bar Ilan Uni-
[53] J. Martineau and T. Finin, “Delta tfidf: An improved feature space versity as an associate professor of information
for sentiment analysis.” Icwsm, vol. 9, p. 106, 2009. systems. Prior to taking up this position he was
[54] R. Agrawal, S. Rajagopalan, R. Srikant, and Y. Xu, “Mining news- a researcher at IBM Research. His research
groups using networks arising from social behavior,” in Proc. 12th interests include intelligent information systems,
intl. conf. on WWW. ACM, 2003, pp. 529–535. autonomous systems, data analytics, social net-
[55] T. Mullen and R. Malouf, “A preliminary investigation into sen- works analysis, software engineering and algo-
timent analysis of informal political discourse.” in AAAI Spring rithmic game theory.
Symposium: Computational Approaches to Analyzing Weblogs, 2006,
pp. 159–162.
[56] G. Shmueli et al., “To explain or to predict?” Statistical science,
vol. 25, no. 3, pp. 289–310, 2010.
[57] R. M. Stein, “Benchmarking default prediction models: Pitfalls and
remedies in model validation,” Moodys KMV, New York, vol. 20305, David Schwartz David G. Schwartz is profes-
2002. sor of information systems, and former vice-
chairman, at the Business School of Bar-Ilan
[58] D. Sprott, “Urn models and their applicationan approach to mod-
University, Israel. He has published over 120
ern discrete probability theory,” 1978.
research papers, books, book chapters, and ed-
[59] L. H. Chen, “Poisson approximation for dependent trials,” The
itorials in the field of information systems and
Annals of Probability, pp. 534–545, 1975.
technologies. His research has appeared in pub-
[60] B. Pang, L. Lee, and S. Vaithyanathan, “Thumbs up?: sentiment
lications such as Information Systems Research,
classification using machine learning techniques,” in Proc. ACL-02
IEEE Intelligent Systems, ACM Computing Sur-
conf. on Empirical methods in natural language processing-Volume 10.
veys, and Journal of the Association for Infor-
Association for Computational Linguistics, 2002, pp. 79–86.
mation Science and Technology (JASIST). His
[61] E. Riloff, S. Patwardhan, and J. Wiebe, “Feature subsumption for books include Cooperating Heterogeneous Systems; Internet-Based
opinion analysis,” in Proceedings of the 2006 conference on empirical Knowledge Management and Organizational Memory; and the Ency-
methods in natural language processing. Association for Computa- clopedia of Knowledge Management, now in its second edition. From
tional Linguistics, 2006, pp. 440–448. 1998 to 2011 he served as editor-in-chief of the journal Internet Re-
[62] P. Godfrey, “Balls and bins with structure: balanced allocations search and is currently an Associate Editor of the European Journal
on hypergraphs,” in Proceedings of the nineteenth annual ACM- of Information Systems. His main research interests are Cybersecu-
SIAM symposium on Discrete algorithms. Society for Industrial and rity, mHealth, Knowledge Management, Social Network Analysis, and
Applied Mathematics, 2008, pp. 511–517. Computer-mediated Communications. David received his Ph.D. in Com-
[63] V. Lavrusik, “Improving conversations on facebook with replies,” puter Science from Case Western Reserve University, USA; MBA from
Journalists on Facebook, vol. 25, 2013. McMaster University, Canada; and B.Sc. from the University of Toronto,
[64] G. Seni and J. F. Elder, “Ensemble methods in data mining: Canada.
improving accuracy through combining predictions,” Synthesis

Comments Mining With TF-IDF: The Inherent Bias and Its Removal

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Comments Mining With TF-IDF: The Inherent Bias and Its Removal

Uploaded by

Copyright:

Available Formats

This article has been accepted for publication in a future issue of this journal, but has not been

Comments Mining With TF-IDF: The Inherent

Index Terms—Sentiment analysis, text mining, statistical bias, discourse.

1 I NTRODUCTION body language [5]. In computer-mediated discourse (CMD1 )

S OCIAL media and in particular social networks (SNS)

Fig. 1: Example of between-commenters’ discourse.

Fig. 2: Examples of dependency networks of comments. A

Fig. 4: Tf-idf score comparisons on terms extracted from fan

the sets, with the entire thread of comments that follow

We next correct the sentiment tf-idf weighting scheme ≈ e−λt,i ,

comment thread frequency. The aim of the adjustment is tft

6.2 Measurements Quantile Rank Correlation (qSρ): the quantile correlation

TABLE 2: Descriptive Statistics of the Data

tf-idf approach with the other two alternatives (using paired

Fig. 9: Comparison of Spearman’s Rank correlation for vary-

Fig. 7: Comparison of Spearman’s Quantile Rank correla-

Fig. 10: Comparison of Spearman’s Quantile Rank correla-

6.5 On using large reference corpus

True Positive Rate (Sensitivity)

True Positive Rate (Sensitivity)

True Positive Rate (Sensitivity)

True Positive Rate (Sensitivity)

True Positive Rate (Sensitivity)

0.6 0.0 0.8 0.2 1.0 0.4

0.6 0.0 0.8 0.2 1.0 0.4

True Positive Rate (Sensitivity)

True Positive Rate (Sensitivity)

commenter on that article states that ”I know him ... worked

similar ones as privacy breaching (or PB) comments.

True Positive Rate (Sensitivity)

True Positive Rate (Sensitivity)

Note that the classification method should be able to classify

pants. Between-participant discourse has been proven to

True Positive Rate (Sensitivity)

relationship between a term and a document in reference

to a corpus [50]. Over the years, the main practical usage

match queries to topics [47]. In recent years, tf-idf has been

harnessed to the classification of sentiments of short sen-

ence document collection is typically unavailable. Secondly,

True Positive Rate (Sensitivity)

there exists between-participants’ discourse. As a result,

You might also like