Guo ExtractingRepresentativeInformation 2017

Extracting Representative Information on Intra-Organizational Blogging Platforms
Author(s): Xunhua Guo, Qiang Wei, Guoqing Chen, Jin Zhang and Dandan Qiao
Source: MIS Quarterly , December 2017, Vol. 41, No. 4 (December 2017), pp. 1105-1128
Published by: Management Information Systems Research Center, University of Minnesota
Stable URL: https://www.jstor.org/stable/10.2307/26630287
JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide
range of content in a trusted digital archive. We use information technology and tools to increase productivity and
facilitate new forms of scholarship. For more information about JSTOR, please contact support@jstor.org.
Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at
https://about.jstor.org/terms
Management Information Systems Research Center, University of Minnesota is collaborating with

JSTOR to digitize, preserve and extend access to MIS Quarterly
This content downloaded from

203.218.254.164 on Wed, 18 Oct 2023 08:47:49 +00:00
All use subject to https://about.jstor.org/terms
RESEARCH ARTICLE
EXTRACTING REPRESENTATIVE INFORMATION ON

INTRA-ORGANIZATIONAL BLOGGING PLATFORMS1
Xunhua Guo, Qiang Wei, and Guoqing Chen
Research Center for Contemporary Management, School of Economics and Management, Tsinghua University,
Beijing 100084 CHINA {guoxh@sem.tsinghua.edu.cn} {weiq@sem.tsinghua.edu.cn}
{chengq@sem.tsinghua.edu.cn}
Jin Zhang
School of Business, Renmin University of China,
Beijing 100872 CHINA {zhangjin@rbs.ruc.edu.cn}
Dandan Qiao
Beijing 100084 CHINA {qiaodd.12@sem.tsinghua.edu.cn}
Content generated on intra-organizational blogging platforms may help managers understand the emerging
ideas, issues, and opportunities of their companies, whereas the difficulty is how to go beyond the overload of
information and obtain an overall view. This paper proposes a system framework for extracting representative
information from intra-organizational blogging platforms, as well as the REPSET (REPresentative SET)
method, which serves as the core component of the extraction system. Drawing from a novel clustering
technique, REPSET is designed to identify a small set of items that largely represent the diversified content of
a huge information base. Building on REPSET, an extraction system enables managers to locate representative
articles that may serve as starting points for comprehensively understanding the hot topics, prevailing thoughts,
and emerging opinions among employees. Empirical evaluations are conducted based on the massive database
accumulated on an internal blogging platform at a large telecommunications company. The results from data
experiments and user evaluations demonstrate that REPSET and the extraction system upon which it is based
can provide outstanding performance, in comparison with benchmark methods.
Keywords: Intra-organizational blogging platforms, representative information extraction, enterprise social

media, business intelligence, design science
Introduction1 mation production on the Internet since the beginning of this

century (Nardi et al. 2004). Recently, blogs have also been
As one of the typical social media applications, blogs have introduced into corporate contexts along with other social
been prevailing and serving as a substantial source of infor- media technologies such as wikis and group-messaging,
creating a new form of collaborative environment (Bente and
1
Karla 2009; Lu et al. 2015; McAfee 2006). It is widely be-
Sumit Sarkar was the accepting senior editor for this paper. Manish lieved that enterprise blogging platforms would help build a
Agrawal served as the associate editor.
flexible intra-organizational networking stage that may effec-
The appendices for this paper are located in the “Online Supplements” tively facilitate knowledge sharing and support emergent
section of the MIS Quarterly’s website (http://www.misq.org). innovations (Li et al. 2015; McAfee 2006; Suh et al. 2011).
MIS Quarterly Vol. 41 No. 4, pp. 1105-1127/December 2017 1105

203.218.254.164 on Wed, 18 Oct 2023 08:47:49 +00:00
Guo et al./Extracting Representative Information on Blogging Platforms
Many companies are investing heavily in internal blogging vice, but few other people read and discuss the idea. In this
platforms and encouraging their employees to maintain blogs case, the article will not be recommended by a popularity-
(Chui et al. 2012; Liao et al. 2011; Luo et al. 2015). based system. It will also not be recommended to a manager
by a preference-based system if it does not fit with the man-
When employees have accepted an internal blogging platform ager’s pre-stored preferences. However, the idea might prove
and are regularly posting their thoughts and writings on it, to be innovative if it is extracted and exposed to the manager.
information accumulated on such a crowd-based platform is Another example is that an employee posts an article
potentially useful for managers to support their managerial expressing some negative feelings. Using popularity-based
work. Existing research has revealed that blog posts often mechanisms, the article will not be recommended to the
reflect real feelings of the authors (Nardi et al. 2004) and can manager unless it is widely discussed, but the manager may
be considered as a reliable source of information (Aggarwal find it useful to notice the existence of such articles at an early
et al. 2012). By reading internal blogs, decision makers will stage. In other words, suppose that there is a box containing
be able to leverage the shared knowledge originally distrib- 100 balls, among which 99 are red (common articles) and 1 is
uted across the organization (Li et al. 2015; Lu et al. 2015) green (the special article). Now we want to pick out 10 balls
and react to emerging business opportunities or changing from the box to represent the population in the box. A
organizational conditions in a more timely fashion. random sampling or a popularity-based extraction is likely to
result in 10 red balls. However, if a method can pick out 9
The difficulty, however, lies in the huge amount of data gen- red balls and 1 green ball, this result can be deemed as more
erated on such a platform. An internal blogging system “representative” because it informs us that there are green
typically produces thousands of posts every week. It is balls existing as minority in the box.
almost impossible for a decision maker to read or even
browse all of the articles. The explosion of content creates a Since existing methods are mostly designed for mining public
challenge that, while access to knowledge and information is blogs, they usually focus on popularity and personalization,
facilitated, causes quality signals to become so faint and dif- whereas representativeness is rarely considered. For intra-
fused that cognitive overload occurs (Bawden and Robinson organizational contexts, it is meaningful to develop a novel
2009). In other words, one critical obstacle to the gold mine approach that is centered on representativeness. Therefore,
of valuable blog content is how to look beyond the overload the objective of this study is to propose a system framework
of information and obtain a clear view of the diversified capable of extracting a small number of articles that can
content on the platform. Specifically, to exploit the potential largely represent the diversified content generated on an
value of internal blogging platforms, decision makers face a organizational blogging platform. To achieve this goal, we
fundamental question: Which articles will they read, given develop REPSET (REPresentative SET), a business intelli-
the large amount of articles generated every day, in order to gence (BI) method serving as the core component of the
better understand the prevailing thoughts and emerging extraction system. Featuring a novel clustering technique,
opinions among employees? REPSET is designed for effectively identifying a small set of
items, which can represent the diversified content of a large
The information needs of managers in organizational contexts information base. To validate the effectiveness of REPSET,
are notably different from those in public blogging contexts. empirical evaluations were conducted based on the massive
On public blogs, users are usually concerned about the database accumulated on an internal blogging platform at a
content that fits with their own interest, as well as widely large telecommunications company. An illustrative example
discussed topics (e.g., top rated or most commented). In is provided depicting how our proposed method can facilitate
contrast, in organizational contexts, to facilitate the mana- enhanced diversity in coverage, resulting in more representa-
gerial use of blog content, managers would prefer to be more tive results. Results from data experiments and user evalua-
aware of diverse voices as well as emerging ideas and tions demonstrate that REPSET and the extraction system
opinions that might potentially be meaningful to the organi- based on it can provide outstanding performance, in com-
zation, no matter how popular they are or whether they fit parison with benchmark methods.
with the managers’ personal interest. In this sense, the quality
of representativeness of the extracted information is critical. From the perspective of design science research (Goes 2014;
Gregor and Hevner 2013; Hevner et al. 2004), our work can
Literally, representativeness can be defined as the degree to be seen as positioned in the “exaptation” quadrant of the
which a small set of items reflect the diverse content of a design science knowledge contribution framework (Gregor
large information base. A simple example for the need of and Hevner 2013), as it focuses on the relatively new problem
representativeness is that an employee posts an article that of representative information extraction on internal blogs and
implicitly conveys a novel idea for improving customer ser- develops REPSET as an exaptation of related BI methods in
1106 MIS Quarterly Vol. 41 No. 4/December 2017

203.218.254.164 on Wed, 18 Oct 2023 08:47:49 +00:00
response to the particular challenges of that problem. There- partitional clustering methods, the number of hash tables
fore, this study aims to make contributions in the form of needs to be predetermined, which may restrict its application
design artifacts including the REPSET method as well as the in the research scenario of this study. Compared to partitional
formalization and instantiation of the representative informa- methods, the complexity of hierarchical clustering methods is
tion extraction framework. Based on this motivation, this higher. Meanwhile, in hierarchical methods, if a data object
work has been conducted in line with the design science is assigned to a wrong cluster at a stage, it will never be re-
methodological principles (Hevner et al. 2004). allocated in the following clustering process, missing the
possible opportunity to further refine the clustering results.
Furthermore, topic clustering in online content analysis

Literature Review usually focuses on finding the most common and dominant
opinions, which is different from our emphasis on represen-
In the BI literature, there are efforts to develop reliable tativeness. For instance, K-Means clustering, the most com-
methods for monitoring and analyzing online content. For monly used partitional method, suffers from at least three
instance, research on online opinion mining aims to determine well-known issues: (1) it tends to generate spatially equi-
the attitude of the public toward news, products, or services distant clusters; (2) it fails to effectively consider different
(Turney and Littman 2003; Wright 2009). Methods have data densities; and (3) it fails to consider data sets with non-
been developed for collecting and analyzing social network globular structures effectively (Han et al. 2011). While
content such as blogs (Chau and Xu 2012), extracting small certain methods fare better on some of these issues (for
sentiments from a relatively large message body (Turney and instance, DBScan is better for the density problem and nearest
Littman 2003), and discovering relationships between infor- neighbor-based methods have alleviated the non-globular
mation items (Ku et al. 2009). Meanwhile, models and problem), no existing method effectively addresses all three
systems have been built based on various techniques to of these issues. Considering that the issues of different sizes
achieve diverse managerial goals, such as classifying and and densities are even more crucial and common in the
summarizing product reviews (Hu and Liu 2004; Zhang et al. context of representative information extraction, the deficien-
2016), predicting sales performance (Liu et al. 2007), esti- cies of such methods may be amplified with the emergence of
mating the effectiveness of communication (Abbasi and Chen novel ideas and opinions that are still not popular and of small
2008), and providing decision support for business mergers sizes and densities.
and acquisitions (Lau et al. 2012). Underlying most of the
research on online content mining, the most widely addressed Another type of techniques related to topic clustering is topic
BI techniques are topic clustering (Pons-Porrata et al. 2007; modeling, such as LDA (Blei et al. 2003), hLDA (Blei et al.
Zhao et al. 2005), which is the unsupervised classification of 2004), and DTM (Blei and Lafferty 2006). These techniques
a given document set into clusters such that documents within typically decompose documents into topic terms, so that every
each cluster are more similar between each other than those document is associated to a topic with a certain probability.
in different clusters, and text summarization (Das and Martins The probability that a document belongs to a topic is higher
2007; Gupta and Lehal 2010), which aims to condense the when the document contains more discussion on that topic
source text into a shorter version while preserving its (Blei 2012). Therefore, for each topic, the document with the
information content and overall meaning. largest probability may be selected as the representative item.
These methods offer the advantage that the relationship
Previous research on topic clustering is usually based on between a document and the topic that it represents can be
partitional methods, such as K-Means (Jain et al. 1999) and effectively explained. However, topic modeling methods
graph-partition techniques (Han et al. 1998), density-based generally focus on the frequencies of terms. Therefore, they
methods (Ester et al. 1996), or hierarchical methods (Berkhin tend to obtain topics based on frequently used terms while the
2006). Among them, the partitional and density-based “long tail” terms are likely to be ignored.
methods have a linear time complexity, which is very effi-
cient. However, such methods may produce inferior clusters Additionally, text summarization methods may also be con-
in some cases and the number of clusters is required to be pre- sidered as a foundation for representative information
determined, which is usually difficult in practice, as well as in extraction to some extent. In the existing literature, two types
the context of this study. Locality sensitive hashing (LHS) is of text summarization techniques have been well developed,
another technique related to partitional methods (Gionis et al. namely extractive methods and abstractive methods (Aligu-
1999; Indyk and Motwani 1998). LHS maps similar data liyev 2009; Cai and Li 2011). Between them, extractive
objects to the same buckets (Leskovec et al. 2014), so it can methods are more well developed. The disadvantage of such
be applied to partition data objects. However, like other methods, however, is that the extracted text summaries often
MIS Quarterly Vol. 41 No. 4/December 2017 1107

203.218.254.164 on Wed, 18 Oct 2023 08:47:49 +00:00
lead to problems in overall coherence and are very difficult (CEO and engineer). In the interviews, the envisioned frame-
for users to understand in practice, as has been repeatedly work was conveyed to the interviewees. All of the inter-
pointed out by the existing literature (Gupta and Lehal 2010). viewees replied with a positive confirmation that such a
In contrast, abstractive summarization relies on advanced representative information extraction system was desirable.
language generation techniques to produce grammatical sum- We then asked them to describe what features they might
maries, but the quality of the summaries heavily depends on expect for such extraction systems. Based on their responses,
sophisticated language processing tools that are not always we developed the design requirements for the extraction
adaptable in various contexts, making abstractive methods method.
less usable and less studied than extractive methods (Cai and
Li 2011). Furthermore, with both types of summarization The overall design framework is for a system that can extract
methods, when the summaries are extracted, the integrity of representative information from corporate internal blogging
every single article cannot be maintained and authorship platforms using a business intelligence approach. As a highly
information usually cannot be included. Consequently, it is conceptualized illustration, the framework is shown in
difficult for readers to locate the sources of particular senti- Figure 1. It highlights the relationship between the corporate
ments or to find a typical individual article that contains the internal blogging platform and the representative information
sentiments. extraction system, of which the central component is the
extraction engine. In the following, we will discuss the blog-
Therefore, although text summarization can be used to obtain ging platform and the extraction engine respectively, and then
good information coverage in the extraction of representative elaborate on the technical requirements for the extraction
content, it is still far from perfect because it suffers from the method implemented in the engine.
above-mentioned disadvantageous attributes. In this sense,
we regard it as a worthwhile attempt to develop a novel
method to achieve representative information extraction The Enterprise Blogging Platform
through a different path. In this paper, we focus on the
method for extracting representative items. This new method The extraction system is designed as an add-on to the cor-
is not intended to replace sentiment analysis or text summari- porate internal blogging platform. Generally, blogs are fre-
zation. Instead, we believe that the efforts on text summari- quently modified web pages, in which dated entries (articles)
zation and representative set extraction can develop in are listed in reverse chronological sequences (Jackson et al.
parallel, from different perspectives. 2007). Readers are usually allowed to post comments to
specific articles in a blog. An article may contain both text
and graphical elements, and even multimedia information
Design Framework such as video and music. In this paper, we focus on the text
content of blog articles.
Our vision for the representative information extraction
framework was initially shaped through our on-site observa- The blogging platform usually provides some predefined
tion on the use of two internal blogging systems in practice. mechanisms for organizing the blogs and articles in a struc-
Subsequently, we conducted a series of field interviews to tured way. Typically, blogs are categorized according to the
refine the framework as well as to derive the specific design organizational structure. Additionally, a set of article cate-
requirements. Prior design science studies have used inter- gories (or tags) are defined. The author of a blog article can
views to understand artifact use cases, strengths and weak- choose to attach the article to particular categories. Such
nesses, and potential areas for improvement (Adomavicius et structures may help readers browse the contents on the plat-
al. 2008; McLaren et al. 2011). From 2010 to 2012, we form. Moreover, on some blogging platforms, readers are
investigated a large Chinese telecommunications company allowed to rate the articles that they have read. Based on
(X), a multinational software company (M), and a collabora- reader ratings, top rated articles will be recommended to a
tion software vendor (Y). Both X and M had deployed and new reader, in case that he/she cannot decide which articles
were using the internal blogging systems actively. Company to read first. Platform administrators may also rate the articles
Y was a vendor of internal blogging systems and had imple- and mark some of them as “recommended” based on their
mented their products in more than 10 customer companies. own judgment. All of these mechanisms are designed in order
We visited companies X (four times) and M (one time) and to help readers more easily seize the essential articles that
observed the use of the internal blogging systems on-site. deserve reading. Although there are auxiliary mechanisms
Our interviewees included five persons from X (CEO, one such as category structures, reader ratings, and administrator
department head, two team managers, and one engineer), one recommendations, it remains difficult for readers, especially
person from M (department head), and two persons from Y

203.218.254.164 on Wed, 18 Oct 2023 08:47:49 +00:00
Figure 1. Design Framework for the Extraction System
senior managers who do not have much time to spend on Requirements for the Extraction Method
browsing the blogging platform, to quickly and accurately
find out the most information-rich items. In this regard, the As described above, the principal requirements for the extrac-
extraction system should be of important help. tion engine were derived by studying two blogging systems in
practice and interviewing typical users. In the interviews, the
interviewees were asked to describe the features that they
The Representativeness Extraction Engine would expect for such extraction systems. In light of the
responses, the three most desirable features turned out to be
The core component of an extraction system is the extraction
engine, which fetches data from the blogging platform, (1) From the perspective of functionality, the extracted
analyzes the content of blog articles together with platform articles should be indeed representative, which means
usage records (including records about user browsing, that they should cover the content of the targeted article
reading, and commenting), and outputs representative articles population to a large extent.
as results. The representative articles will be presented to
users who are concerned with grasping the overall situation of (2) From the perspective of usability, users would like to
the organization. Herein, representative articles are defined freely specify the number of representative articles to
as a small set of blog articles, which, to a large extent, cover display and the results should be presented in real time.
the content of all the blog articles posted on the blogging For example, a user usually prefers to read 10 represen-
platform within a specific time period. tative articles every day, but occasionally he/she may
choose to read 20 or more articles when he/she has more
Representative articles extracted by the system are com-
time. Similarly, user A may prefer to read 10 articles
parable to the top-rated articles generated by the user rating
every time but user B may prefer to read 30. Therefore,
mechanism, considering the fact that they are both small sets
the system should allow the users to select the number of
of articles recommended to readers to help them select what
articles in real time according to their needs, rather than
to read. The difference, however, is that representative
fixing it to a predetermined value.
articles are selected with the aim of representing the whole
population of blog articles posted within a specific time
period, while top-rated articles are simply the posts most (3) From the perspective of information granularity, users
favored by general readers. From the perspective of a high would like to “drill down” on representative articles (i.e.,
level decision maker, apart from knowing what is most after reading a presentative article, they may want to read
favored in the organization, it is at least equivalently some more articles that are representationally related)
important to capture an overview on the whole platform. and such queries should also be responded to in real time.

203.218.254.164 on Wed, 18 Oct 2023 08:47:49 +00:00
Documents Document Document Clusters , kth Representative

Dendrogram Level of the Dendrogram Documents
Similarity
Matrix
Level 1
Construction
Level 2 S
E
Clustering L
… E
Level k C
… T
I
…
…
… … O
N
…
Level n-1 …
Level n …
Figure 2. REPSET Procedure
Based on these design requirements, we propose a BI method Similarity Matrix Construction

called REPSET (REPresentative SET), which incorporates a
novel document clustering technique. Evaluated with mul- The similarity between two documents can be calculated
tiple measures about representativeness, REPSET is proved using the cosine-similarity measure (Salton 1971), which is
to be an effective method supporting the extraction system. widely used in web/text mining. Using Lucene, a commonly
In the following sections, we will describe the REPSET used word-splitting tool, each document can be converted into
method and present the empirical evaluations. a set of words. With the standard treatment in the TF/IDF
model (Salton 1971), a feature word list can be derived, based
on which the corresponding keyword vector space can be
generated. Each document d then can be mapped as a
The REPSET Method keyword vector denoted as [ω1, ω2, …, ωp] with the TF/IDF
model. An alternative way is to use “shingles” (contiguous
According to the definition of representativeness, the target of
subsequences of tokens), instead of keywords, to denote a
representative article extraction is to find a small set of
document (Broder et al. 1997). In our study, keyword vectors
articles that, to a large extent, reflect the diverse content of the
are adopted to reduce computational complexity. Thereafter,
blogging system. Generally, this target can be achieved with
two major steps: (1) categorize all of articles based on their the cosine-similarity between every two documents can be
similarities of content between each other; and (2) select a calculated based on their corresponding keyword vectors.
good representative article for each category. If the cate- Specifically, the cosine-similarity between any two docu-
gorization of articles and the selection of representative items ments d1 and d2 (d1 = [ω11, ω21, …, ωp1] and d2 = [ω12, ω22, …,
are more accurate, the degree of representativeness of the ωp2]) in a p-dimensional vector space can be calculated with
resulting set will be higher. Accordingly, the overall proce- the following Sim measure:
dure of the REPSET method is shown in Figure 2. As illus-
trated, in the first step, the degrees of similarity between
documents are calculated. That is, given n documents, an Sim(d1 , d 2 ) = cos(θ between vectors d1 and d 2 )
n × n similarity matrix would be constructed. Next, a clus- d1 • d 2
tering algorithm is utilized to cluster the documents into a =
|| d1 || × || d 2 || (1)
dendrogram with different granularities of clusters. Finally,
for all given levels in the document dendrogram, a repre- ω11 × ω12 + ω21 × ω22 + ... + ω p1 × ω p 2
=
sentative document is selected from each cluster according to ω112 + ω212 + ... + ω p21 × ω122 + ω222 + ... + ω p2 2
the degrees of similarity. From the document dendrogram,
the manager could obtain various levels of clustering results
and representative articles according to the need, instead of
determining the cluster number k in advance. Given a set D with n documents, an n × n similarity matrix M
can be calculated with Equation (1). In M, the element at row
The following three subsections will introduce the three i and column j represents the similarity between the docu-
components of REPSET in detail, respectively. ments i and j.

203.218.254.164 on Wed, 18 Oct 2023 08:47:49 +00:00
Clustering In Equation (2), nkm and nkn represent the number of docu-
ments in clusters Ckm and Ckn, respectively. At stage k, two
Based on the three desirable features for the extraction engine, clusters with the maximum average similarity, Ckp and Ckq,
we derived three principal requirements for the clustering will be merged into a new cluster, Ckpq, which is demonstrated
algorithm: (1) the quality (accuracy) of clustering results as Equation (3).
should be high, because this is the basis for high quality of
representative extraction; (2) the number of clusters should
not be arbitrarily determined in advance (it should be able to
(
Sim C pk , Cqk = ) {
Cmk ,Cnk ∈
max
C1k ,C2k ,,Cnk−k +1 } (
 Sim Cmk , Cnk 
 ) (3)
be flexibly adjusted when wanted, without having to restart
the whole procedure of clustering), so that the system can
Furthermore, given a threshold λ (0 # λ # 1), documents with
respond immediately once the user has selected a different
average in-cluster similarities less than λ (near boundaries)
number of representative articles; and (3) there should be a
will be marked as boundary objects (to be considered to
multilevel structure in the resulting clusters, so that the
reallocate). Specifically, in the new cluster Ckpq, documents
navigation features such as drill down for representative
meeting the following condition will be marked as dboundary:
articles can be facilitated.
 Sim(d ) < λ , λ ∈[0,1]

1
Based on examining existing algorithms that are commonly boundary , d x (4)
used for clustering analysis, we found that they cannot suffi- n kpq k
d x ∈C pq
ciently satisfy these three requirements simultaneously. In the
partitional methods, the number of clusters needs to be arbi-
where nkpq represents the number of documents in the cluster
trarily set before the clustering algorithm is executed. In other
Ckpq. Notably, the value of the parameter λ may affect the
words, one execution of the partitional algorithm can only
accuracy of the clustering algorithms. Before the extraction
result in one fixed and pre-given number of clusters. Mean-
of representative information, the value of λ can be tuned in
while, partitional methods are not applicable for varying sizes,
light of the following measure:
densities, and structures. In the density-based methods, the
minimum numbers of points and radius need to be configured k
 1 
in advance, which indirectly fixes the number of clusters. In
addition, both partitional clustering and density-based clus-   |C | p di ∈C p
{Sim(d , O )}
i p
tering typically create one-level partitioning of the data p =1
CS = k (5)
{ ( )}
objects at once and cannot provide various levels of repre-  
sentative information. In contrast, the hierarchical clustering   max
q = 1,2,, k , p ≠ q
Sim Op , Oq 

can meet both requirements (2) and (3). However, the p =1
accuracy of hierarchical clustering is relatively low (Stein-
bach et al. 2000). Table 1 summarizes the key features of the In the equation, the numerator represents the aggregate in-
clustering methods. cluster similarity after clustering. Cp denotes a cluster and Op
is the centroid document of Cp. The denominator represents
Therefore, in REPSET, a novel method is proposed based on the aggregate between-clusters similarity after clustering,
hierarchical clustering in order to retain the two desirable where Op and Oq denote the centroid documents of two
features. Additionally, a heuristic strategy, namely the back- clusters, respectively. The clustering result is better when in-
ward strategy for documents near boundaries, is incorporated cluster similarities are high and between-clusters similarities
to improve the quality of clustering. The detailed procedure are low. In the clustering process, given a certain λ value,
of the clustering algorithm is as shown in Figure 3. boundary documents can be determined and reallocated when
necessary. After the reallocation procedure, the CS value is
Given a set D with n documents, in the proposed clustering calculated accordingly. In this sense, the value of λ will
algorithm, each document is initially considered to be a single influence the clustering result and therefore have impacts on
cluster. Therefore, at the first stage, there are n clusters, the final CS value. A higher CS value indicates a more appro-
denoted as C11, C12, …C1n. Let Ckj denote the jth cluster in the priate value of λ. In other words, the proper input parameter
kth stage in the clustering process of REPSET. At each stage, of λ in REPSET can be determined by seeking for the maxi-
the average similarity of each pair of clusters Ckm, Ckn is mized CS value. Since the expression of CS does not expli-
calculated as follows: citly include λ, a commonly used parameter-tuning technique
called stepwise self-tuning (Zelnik-Manor and Perona 2004)
( ) (
Sim Cmk , Cnk = 1 / nmknnk ) 

d x ∈Cmk   d y ∈Cnk
( ) 
Sim d x , d y 

(2) is adopted to achieve this goal. This technique has been
widely applied in parameter tuning for various studies, such

203.218.254.164 on Wed, 18 Oct 2023 08:47:49 +00:00
Table 1. Features of Clustering Methods

Number of Multilevel Applicable for Varying Sizes,
Methods Accuracy Clusters Clusters Densities, and Structures
Partitional clustering medium-high fixed no no
Density-based clustering medium-high fixed no yes
Hierarchical clustering low-medium flexible yes yes
No Output:
Document
dendrogram
th
Any Record the k
Input:
document to level of the
D, M
merge? dendrogram
Yes Backward
strategy
Identifying the
Merging two Reallocating
boundaries of
closest documents near
the new
clusters boundaries
cluster
Figure 3. Clustering Algorithm
as cluster numbers and other related parameters in clustering gain the highest average similarity, if necessary. This is
(Gu and Zhou 2009; Mandala et al. 2014; Seref et al. 2013), called a backward strategy. Specifically, at stage k, there are
topic numbers in LDA (Blei et al. 2003; Griffiths and Stey- (n – k +1) clusters, denoted as Ck1,Ck2, …, Ck(n-k+1). For the
vers 2004), confidence and support degrees in association-rule marked documents dboundary, REPSET first calculates the
classification (Chen et al. 2006; Coenen and Leng 2007), average similarities between dboundary and the (n – k +1)
normalization parameters in information retrieval (He and clusters, which is shown as Equation (6).
Ounis 2003), and regularization parameters in logistic regres-
sion (Simon et al. 2011).
(
Sim dboundary , C kj = ) 1
n kj  Sim(d boundary , d x ) (6)
With stepwise self-tuning, values of λ with equal intervals are d x ∈C kj
applied on a training dataset that is randomly sampled from
the data set D. Typically, with a step length of 0.1, the value If the average similarity between dboundary and Ckj is the largest
of λ can be set as {0.1, 0.2, 0.3, …, 0.9}, respectively. among all the clusters, then dboundary is assigned to the cluster
Notably, in the clustering process of REPSET, CS values can Ckj.
be calculated at every level of the dendrogram. However,
because the sample data set for tuning λ is usually much In each iteration of updating clustering, two clusters with the
smaller than the whole data set D, it is not feasible to tune λ largest similarity between each other is merged into one, so
for every level independently. Therefore, a unified value of the number of clusters decreases by 1. Therefore, in stage k,
λ will be used for all levels of the dendrogram. In the tuning the documents in D can be clustered into (n – k + 1) sets of
process, for each given value of λ, the clustering algorithm is similar documents. The (n – k + 1)th level of the document
run on the sample data set, during which CS values are dendrogram is constructed by these sets, in which the docu-
recorded at all levels in the dendrogram. Subsequently, the ments near boundaries have been appropriately processed.
average value of CS is calculated as the indicator for choosing After n iterations of the proposed algorithm, the clusters will
λ. The λ value that results in the highest average CS value is converge to 1 and the whole dendrogram will have been well
chosen as the final input parameter for REPSET to perform constructed. Every level of this dendrogram will consist of
the clustering in the dataset D. document clusters with good clustering quality.
The marked boundary documents will be reevaluated, and The backward strategy in REPSET is designed to overcome
reallocated to some new cluster, in which the document will the disadvantage of traditional hierarchical clustering methods

203.218.254.164 on Wed, 18 Oct 2023 08:47:49 +00:00
Figure 4. Backward Strategy Example
that the clusters cannot be dynamically adjusted during the hierarchical methods, as well as REPSET, this is not a big
construction process of the dendrogram. This heuristic issue in our context, since the clustering procedure only needs
strategy aims to improve the clustering quality by assigning to run one time on a daily basis and it can be scheduled as an
boundary objects to more appropriate clusters, based on the automatic task that runs at night. Particularly, as REPSET
rationale that the boundary objects in a newly merged cluster allows real time flexible query and there is no need to re-do
are likely to belong to another cluster. Specifically, in the the clustering when a new number of representative articles is
traditional hierarchical clustering process, in each iteration, set, the computation time issue is even more marginal for this
two items or clusters with the highest similarities will be method.
merged. Consider the example in Figure 4, where each circle
denotes an article. From the overall perspective, the desired Furthermore, the method is able to capture more diversified
clustering result should put all white circles into one cluster content and, consequently, innovative ideas and emerging
and all black ones to another. However, in the step-wise opinions are likely to be included in the extracted item set.
dendrogram, because the distance between di and dj is very Considering an extreme case of an innovative idea that is new
close (high similarity), these two articles will first be merged and rarely discussed, it will have very low similarity levels
into one cluster and will never be separated in the subsequent with other articles. Therefore, in the clustering procedure of
procedures. In contrast, the backward strategy in REPSET REPSET, this item is not likely to be merged with other items
can effectively solve this problem. Suppose that di and dj have or clusters, because of its low similarities with others. In
been merged in Clusterij, and this cluster is then merged with other words, this innovative idea will be maintained indepen-
the other black articles forming Clusterblack, while the other dent of other items or clusters all through the clustering
white articles are clustered as Clusterwhite. In this case, for process and eventually become an individual cluster. Subse-
each document in Clusterblack, REPSET will calculate the quently, when selecting representative items, since this
average similarity between it and the other documents, and innovative idea is the only one item in its cluster, it will be
then accordingly mark out the boundary articles in this certainly extracted.
cluster. In the example, article di will turn out to have a low
average similarity. Therefore, REPSET will recalculate the
average similarities between di and all existing clusters. Con- Representative Document Selection
sequently, REPSET will find that di has the largest average
similarity to the Clusterwhite, which means di is more likely to Given a set of documents, the document with the highest
be an article in Clusterwhite. Therefore, di is reallocated into average similarity to the other documents will offer the
Clusterwhite. Thus, the clustering accuracy will be improved. highest content coverage in the set (Liu and Croft 2008).
A more detailed illustrative example is provided in Appen- Therefore, such a document can be selected as the represen-
dix A. tative in a cluster.
As compared with traditional hierarchical clustering methods Particularly, the clustering process of REPSET can provide
such as agglomerative clustering, the novelty of REPSET the whole dendrogram after one time running and every level
clustering lies in the heuristic design of the backward strategy of this dendrogram contains document clusters with good
that significantly improves the clustering accuracy. There- clustering quality. At each level of the dendrogram, REPSET
fore, it will be capable of overcoming the major weakness of can extract the corresponding representative documents in the
hierarchical clustering algorithms and obtain higher accuracy. clusters, forming the representative subset of documents at
Meanwhile, although the computation time is a weakness of this level. Therefore, after running one time, REPSET can

203.218.254.164 on Wed, 18 Oct 2023 08:47:49 +00:00
Input:
Document Set D = {d1, d2, …, dn} // D is the set of n documents
Output:
Dendrogram C = {C1, C2, …, Cn}; // The document dendrogram C
// Ck represents the kth level of C
Set of representative documents // A series of representative documents
RepSet = {R1,..., Rki, …, Rn}; // Rki : the representative documents for Cki
Begin:
1. RepSet = ∅; // Initially, RepSet is set empty
2. Number_of_Clusters = n; // Initially, it could regarded as n clusters
3. Clusteri = {di}, i = 1, 2, …, n; // Each cluster containing one document
4. Cn = {Cluster1,…, Clustern}; // The nth level of the dendrogram
5. Level = n; // record the level of the dendrogram
6. M = Cosine_Similarity(D); // Calculate the similarity matrix M
7. While Number_of_Clusters > 0 Do // Clustering procedure
8. {
9. Level = Level – 1;
10. CLevel = CLevel+1;
11. While(Avg_Sim(Clusterp, Clusterq)
= Maxi, j = 1, 2, …, Number_of_Clusters, i ≠ j(Avg_Sim(Clusteri, Clusterj) Do
12. {
13. Clusterp = Merge(Clusterp, Clusterq); // Update sets
14. Reallocate_Boundaries(); // Backward strategy
15. Number_of_Clusters --;
16. }
17. CLevel = Remove(Clusterq); // Record the kth level of the dendrogram
18. }
19. For Level = 1 to n // Select representative documents
20. {
21. For Clusteri in CLevel
22. RLevel = RLevel + Select(Clusteri); // Update RLevel
23. RepSet = RepSet + RLevel;
24. }
25. Output(RepSet); // Output the representative information
Figure 5. REPSET Procedure Pseudo-Codes
provide representative subsets at all levels of the dendrogram, construction), the main computational consumption is the
guaranteeing real time flexible queries. calculation of the similarity matrix, which is of O(n2) time
complexity. At the second stage (clustering), suppose that the
average scale of boundary documents in newly generated
Algorithm Summary clusters is navg, under the best-case situation, where there is no
document to be reallocated throughout the process, the
Figure 5 presents the structural pseudo-codes for the proposed clustering procedure degenerates into a typical hierarchical
method. In the codes, lines 1–6 represent the components of clustering method that is of O(n2logn) time complexity (Liu
initialization and similarity matrix construction. Lines 7–18 et al. 2007). Under the worst-case situation, where every
correspond to the component of the clustering procedure. boundary document in the newly merged cluster satisfies the
Lines 19–25 are the component of representative document backward strategy condition in each step of the procedure, the
selection. time complexity of the clustering procedure will be
O(n3navglogn). At the third stage (representative document
The overall time complexity of REPSET is low-level poly- selection), the time complexity is O(n2). Therefore, the over-
nomial. At the first stage (initialization and similarity matrix all time complexity of REPSET is between O(n2logn) (best-

203.218.254.164 on Wed, 18 Oct 2023 08:47:49 +00:00
case) and O(n3navglogn) (worst-case). Since navg is less than n, pairs of data points n, with which two clustering results agree/
it can be concluded for simplicity that the average-case time disagree. The purity measure represents the degree that all
complexity of REPSET is between O(n2logn) and O(n4logn). clusters contain true documents, the larger the better.
Specifically, given a testing dataset D, the true clusters C+1,
C+2, …, C+k+, and the clustering results C1, C2, …, Ck, the
Benchmark Data Evaluation purity measure is defined as Equation (8).
In order to evaluate the performance of the clustering 1 k

algorithm in REPSET, we compare it with 10 widely used Purity = max p+ =1,2,...,k+ | Cp ∩Cp++ |
n p=1
(8)
clustering algorithms, including four K-Means type methods
(RBR, DIRECT, Simple K-Means, and X-Means) (Marmanis The metrics of entropy and purity evaluate the consistency
and Babenko 2009; Pelleg and Moore 2000; Zhao et al. 2005), between the clusters generated by an algorithm and the actual
four variances of the agglomerative method with different cluster labels assigned by experts, which are widely used in
intercluster similarity calculation methods (Agglo(group), the evaluation of clustering (Bezdek and Pal 1998; Liu 2007).
Agglo(max), Agglo(min), and Agglo(centroids)) (Zhao et al.
2005), a graph partitioning method (GRAPH) (Zhao et al. In order to conduct a reliable comparison, 30 data sets are
2005), and a density-based method (DBscan) (Ester et al. collected from multiple benchmark databases specifically for
1996), with respect to two major measures (i.e., entropy and clustering and text mining algorithms (Boley et al. 1999;
purity) (Bezdek and Pal 1998; Liu 2007), on some benchmark Chen et al. 2006; Han et al. 1998; Moore et al. 1997; Wu et
datasets. Moreover, to evaluate the effectiveness of the pro- al. 2010), including 15 data sets from the UCI Machine
posed backward strategy in clustering, REPSET with no Learning Repository and 15 from the test-bed of the CLUTO
backward strategy is also considered in the comparative Data Clustering Software. The data sets are randomly chosen
experiments. The 10 algorithms cover the main categories from the databases, covering various kinds of data types and
with respect to the mainstream clustering methods (Liu 2007). attributes. For RBR, DIRECT, AGGLO (all variances), and
Among them, RBR, DIRECT, Simple K-Means and Agglo GRAPH, the CLUTO software package (Zhao et al. 2005) is
are deemed effective document clustering algorithms (Stein- adopted. The programs for DBscan and K-Means are pro-
bach et al. 2000) and are adopted by the well-known CLUTO vided by a commonly used text book (Marmanis and Babenko
clustering platform (Zhao et al. 2005). Graph is a typical 2009). The program for X-Means is adapted from the original
graph clustering algorithm and DBscan is designed for dis- codes provided by the authors of the algorithm (Pelleg and
covery of clusters with arbitrary shape and good efficiency on Moore 2000). For agglomerative clustering, four types of
large databases (Ester et al. 1996). On this basis, these 10 intercluster similarity (group average, min, max, and distance
algorithms are suitable for the comparative experiment. between centroids) are used.
The entropy measure is based on information-theoretic con- For each given data set, the target number of clusters is set to
siderations, which represents the difference of information the ground truth cluster number in the data set. Formally, for
entropy between the clustering by an algorithm and the a data set Di, there are ki clusters indicated by the label field.
actually true clustering, the smaller the better. In particular, In the experiment, for the three K-Means type methods (RBR,
assume that the given testing dataset D is composed of DIRECT, and Simple K-Means) and the graph partitioning
clusters C+1, C+2, …, C+k+ (true clustering), and a clustering method (graph), the parameter k is directly set to ki. For X-
algorithm is applied to find clusters C1, C2, …, Ck in the Means, a range of (ki – 1, ki + 1) is set to guarantee that it
dataset D, then the entropy measure can be defined as outputs the results of ki clusters. For DBscan, its parameters
Equation (7). eps and minpts are tuned to make the cluster number equal ki.
For the four variances of the agglomerative methods and
k  k+ | C p ∩ C p+ + |  |C p ∩ C + + |   REPSET, only the particular level of the document dendro-
|C p | 
 ( )
p 
1
Entropy = − log
n  log k + Cp  Cp   (7) gram containing ki clusters is extracted as the clustering
p =1  p + =1   results.
In Equation (7), n represents the scale of the dataset D. The The generalized results of the benchmark testing are listed in
signal “| |” means the size of the corresponding set. For Table 2. As shown in the table, the REPSET clustering
example, |Cp| represents the size of Cp. method produces 12 minimum entropy values and 15 maxi-
mum purity values in the 30 instances, which significantly
The other measure for evaluating the performance of clus- outperforms the other 11 algorithms. The average rank values
tering algorithms is purity, which is based on counting the also support the fact.

203.218.254.164 on Wed, 18 Oct 2023 08:47:49 +00:00
Table 2. Entropy and Purity Measures

Entropy Purity
Number of Minimum Average Rank Number of Maximum Average
30 Datasets Entropy Value Purity Rank Value
RBR 6 3.30 5 3.97
DIRECT 5 3.62 5 4.52
AGGLO(group) 0 6.30 5 6.27
AGGLO(max) 0 9.03 3 8.45
AGGLO(min) 0 7.77 3 7.72
AGGLO(centroids) 0 8.82 3 8.17
GRAPH 5 4.07 6 4.48
DBscan 0 9.10 3 8.42
K-Means 6 5.03 8 4.98
X-Means 1 6.43 6 6.07
REPSET 12 2.53 15 2.97
REPSET(no Backward) 0 5.80 2 5.77
Note: The range of rank values for the algorithms is 1 to 12, where 1 represents the best (i.e., the minimum entropy or the maximum purity), and
12 represents the worst (i.e., the maximum entropy or the minimum purity).
Furthermore, statistical analysis can also be conducted to test Experiments were also conducted to evaluate the time and
whether the proposed REPSET method is reliably better. space complexity of the REPSET method. REPSET is mainly
Considering that the 30 datasets are from two different data composed of two phases. The first phase is to cluster the
sources and may not satisfy the identical distribution assump- original data objects to construct a dendrogram. The second
tion, non-parameter Friedman test is considered (Vapnik phase is to extract representative items at every level in the
1998), which will test whether the rankings of the algorithms dendrogram. The first phase can be executed offline as a
concerned are significantly different. Table 3 shows the batch processing task. After the dendrogram is generated and
stored, the time needed to read the dendrogram and return a
Friedman test results on both the entropy and purity measures.
specified level is very short at the millisecond level. There-
These results are consistent with those of Table 2. Briefly, on
fore, the second phase is immediately responsive to users’
the benchmark datasets, REPSET is superior to other tradi- online queries. In the following, the time and space com-
tional text clustering algorithms in ranks, indicating that plexity experiments only focus on the first phase.
REPSET is technically effective and advantageous over major
existing ones for text clustering. In addition, it is worth Different sizes of testing data collections were constructed by
noting that REPSET is significantly better than REPSET with duplicating a dataset randomly selected from the 30 bench-
no backward strategy, revealing that the proposed backward mark datasets described above. The testing platform was a
strategy plays an effective role in the clustering process and Windows 7 system on a PC with an Intel Core i3-2100 CPU
can significantly promote the accuracy level of clustering. (3.1 GHz) and 4G RAM. The programs were implemented in
java. The running time of REPSET on different scales of data
Notably, the backward strategy in REPSET will lead to data sets is represented in Figure 6(a), in comparison with typical
object movements during the clustering process. Because the partitional, density, and hierarchical methods. The curves
number of boundary objects is usually small, the amount of show that REPSET is slightly slower than existing methods,
such movements is also small. The proportions of data ob- indicating that it might not be efficient for processing real-
jects that have been reallocated during the clustering process time stream data. However, the overall trend reveals a low-
level polynomial complexity, which is acceptable for offline
of REPSET are provided in Appendix B. On average, the
batch processing tasks such as representative article extraction
reallocation proportion in the clustering of the 30 benchmark
in blogging systems. The memory consumption is presented
datasets is 1.29%. The appendix also includes a box plot in Figure 6(b), revealing a low-level polynomial complexity
showing the average reallocation proportion across the 30 curve as well, which indicates that REPSET has the scala-
benchmark data sets in every stage of iteration. bility to deal with large-scale datasets.

203.218.254.164 on Wed, 18 Oct 2023 08:47:49 +00:00
Table 3: Friedman Test Statistical Results on Entropy and Purity Measures

Entropy Chi-Square Sig. Purity Chi-Square Sig.
REPSET < RBR 4.172 * REPSET > RBR 4.167 *
REPSET < DIRECT 8.533 *** REPSET > DIRECT 11.560 ***
REPSET < AGGLO (group) 25.138 *** REPSET > AGGLO (group) 17.640 ***
REPSET < AGGLO (max) 26.133 *** REPSET > AGGLO (max) 22.154 ***
REPSET < AGGLO (min) 22.533 *** REPSET > AGGLO (min) 21.160 ***
REPSET < AGGLO (centroids) 30.000 *** REPSET > AGGLO (centroids) 22.154 ***
REPSET < GRAPH 3.846 * REPSET > GRAPH 3.846 *
REPSET < DBscan 26.133 *** REPSET > DBscan 22.154 ***
REPSET < K-Means 6.533 * REPSET > K-Means 9.846 **
REPSET < X-Means 10.800 *** REPSET > X-Means 6.259 *
REPSET < REPSET(no Backward) 22.533 *** REPSET > REPSET(no Backward) 18.615 ***
Note: *p < 0.05, **p < 0.01, ***p < 0.001
(a) (b)
Figure 6. Running Time of REPSET on Different Scales of Datasets
Empirical Evaluations In 2006, an internal blogging platform was deployed in Com-

pany X, aimed at enhancing the communication among work-
Empirical evaluations of REPSET were conducted based on groups and employees, promoting knowledge sharing, and
the massive database accumulated on an internal blogging providing a stage for self-expression. Each blog on the plat-
platform in practice. Multiple approaches including data form is maintained by a specific business workgroup in the
experiments and user evaluations were utilized. company. Since establishment, effective policies were made
to encourage the use of the blogging platform. As of October
2013, almost all workgroups in the company were main-
Research Site taining their blogs on the platform, among which about 3,000
blogs were active (with at least one article posted per week).
Empirical data for our search were provided by Company X, Around 5,000 articles plus about 10,000 comments were
a large telecommunications service provider in China, whose produced every day.
parent company is a Fortune 500 telecommunications firm.
As of 2013, Company X had over 40,000 employees and pro- We evaluated the function of REPSET to extract a small set
vided mobile services for more than 90 million customers. of representative articles from all the blog articles posted
The annual revenue of the company was about USD 13.5 within a specific time period. This was done by comparing
billion, holding 8% of market share of mobile telecommuni- the articles selected by REPSET with articles recommended
cations service in China. by popularity-based methods and those obtained by other

203.218.254.164 on Wed, 18 Oct 2023 08:47:49 +00:00
clustering methods. In light of the findings from research on In the context of blog article extraction, the coverage,
search engines that users usually only pay attention to the first redundancy, and F1 measures of the resulting articles can be
10 resulting articles of the first page (Granka et al. 2004), we evaluated from two perspectives. The first is the de facto
initially restricted the size of the representative set to 10 perspective. The de facto results of the measures are defined
articles. as the degrees of coverage and redundancy obtained by
calculating the textual similarities between blog articles. The
Using a subset of data from the blogging platform, an illus- other perspective is user perception. The perceived results of
trative example is provided in Appendix C, which compares the evaluation measures are defined as the degrees of cover-
the results of representative article extraction between age and redundancy reported by readers of blog articles. In
REPSET and X-Means, a typical traditional clustering the spirit of the measures generally used in research on
method. The comparison intuitively shows that REPSET can reading behavior (Allington 1984), perceived coverage may
facilitate enhanced diversity in coverage and therefore capture be measured by the number of meaningful words that a reader
more representative results. In the following, more consoli- remembers after reading the articles. For perceived redun-
dated experiments will be presented to evaluate the perfor- dancy, self-reported scales are usually utilized.
mance of REPSET.
Benchmark Methods and Evaluation Plan

Measures
We compared the 10 representative articles generated by
The REPSET method is composed of two major phases. The REPSET with 3 types of reader-based recommended articles.
first phase is to cluster the original blog articles and the The first is the top 10 rated articles based on reader ratings;
second is to select a representative article for each cluster. the second is the top 10 most read articles; and the third is the
Therefore, the evaluation of REPSET also includes two top 10 most commented articles. Furthermore, based on the
phases. As presented in the “Benchmark Data Evaluation” results of the benchmark data evaluation, we chose four clus-
subsection, the measures of entropy and purity are used for tering algorithms, namely Direct, Graph, RBR, and X-Means,
the first phase, namely to evaluate the accuracy of clustering. to analyze the company blogs and compared them with
These are the most commonly used measures for evaluating REPSET. Three typical topic modeling methods, namely
clustering accuracy. LDA, hLDA, and DTM, were also considered. Additionally,
randomly generated articles were also taken into account as
The second phase of evaluation is focused on the quality of a method for comparison. Altogether, 11 benchmark methods
representative information extraction, namely how well the were involved.
extracted items may represent the whole population. In this
regard, the measures of coverage and redundancy are used. Therefore, the evaluation plan was to compare REPSET with
These measures have been developed and formulated by a each of the benchmark methods in terms of three de facto
series of related literature on e-commerce and information measures (coverage, redundancy, and F1) and three perceived
retrieval (Ma and Wei 2012; Pan et al. 2005). Generally, measures (coverage, redundancy, and F1). Two types of
coverage is defined as the degree to which a small set of items experiments were involved. First, we conducted data experi-
cover the contents of the whole body of a large information ments to evaluate the de facto results of the measures.
base. The other measure for representativeness is redun- Second, we conducted user experiments to evaluate the per-
dancy, which is defined as the degree to which the contents of ceived results of the measures. The de facto and perceived
the items in an extracted set interleave with each other (Zhang results will verify each other and provide a solid base for the
et al. 2002). Since the target of the extraction method is to evaluation conclusions.
arrive at a small set of information items with good represen-
tativeness, it is also important that the resulting set should be
concise, so that its readers can capture as much information as Data Experiments
possible in a limited time. Therefore, a high degree of redun-
dancy would indicate a low level of extraction performance. The de facto performances of the methods were evaluated
To comprehensively evaluate the overall performance using data experiments. As defined, de facto degrees of
including both coverage and redundancy, the balanced F- coverage and redundancy were obtained by calculating the
measure (F1 measure) may also be considered (Arazy and textual similarities between blog articles. Given two blog
Woo 2007; Powers 2011). The F1 measure is commonly used article sets, D = {d1, d2, …, dn} and D = {d1, d2, …, dm}, and
in testing the accuracy of classification analysis, the higher suppose that Sim(d1, d2) indicates the similarity between
the better. In our research, the F1 measure can be calculated articles d and d', the de facto coverage of D' over D is calcu-
as the harmonic mean of coverage and redundancy. lated as follows (Ma and Wei 2012):

203.218.254.164 on Wed, 18 Oct 2023 08:47:49 +00:00
Table 4. De facto Performance (Overall)

Method De facto Coverage De facto Redundancy De facto F1
REPSET 0.3735 0.2274 0.5036
Random 0.284 0.3731 0.3909
Top Rated 0.1927 0.5015 0.278
Most Read 0.2874 0.4628 0.3745
Most Commented 0.1301 0.4982 0.2066
Graph 0.3258 0.6746 0.3256
X-Means 0.3646 0.4682 0.4326
RBR 0.4022 0.4367 0.4693
Direct 0.3999 0.48 0.4521
LDA 0.2843 0.5103 0.3597
hLDA 0.3115 0.4054 0.4088
DTM 0.2718 0.4228 0.3696
rC ( D′ , D) =   max (Sim(d , d ′))

d ∈D
d ′∈D ′
| D| (9)
are small. Meanwhile, the de facto redundancy of REPSET
articles is the lowest. As the comprehensive measure, the F1
of REPSET is clearly higher than all the benchmark methods,
Generally speaking, de facto coverage represents the percent- indicating that REPSET achieves the best performance. To
age of information conveyed in D that is reflected by the statistically test the performance differences, we split the
information conveyed in D'. For any blog article set D = {d1, dataset of 2,239 articles into 11 groups along the timeline,
d2, …, dn}, the de facto redundancy of D is calculated as (Ma with each group consisting of no less than 200 articles. Then
and Wei 2012): we used each method to generate 10 representative articles for
each group and compared them in the corresponding group.
 
We used the paired-samples t-test method to evaluate the
rR ( D) =  1 − 1  Sim(d , d ′)
d ∈D d ′∈D
| D| (10) differences. The testing results are shown in Table 5, which
illustrates that in 25 out of the 33 cases, the results of
REPSET are superior to those of the benchmark methods.
Following the definition of the F1 measure, the de facto F1 of
In terms of coverage and redundancy, REPSET significantly
article set D' is calculated as
outperforms 7 and 9 (out of 11) benchmark methods, respec-
tively. Overall, and more importantly, in terms of the com-
rF1 ( D′ , D) = 2 ⋅
( )
rC ( D′ , D) ⋅ 1 − rR ( D′ )
prehensive F1 measure, REPSET significantly outperforms
(11)
(
rC ( D′ , D) + 1 − rR ( D′ ) ) most of the benchmark methods except for RBR and X-
Means. RBR and X-Means achieve a comparable level of
For the experiments, we selected the blogging data of a typi- performance as REPSET, largely because of their high
cal region of Company X, during the time period of July clustering accuracy (Marmanis and Babenko 2009).
2010, with a total of 2,239 blog articles in 62 blogs (namely However, they are both partitional clustering methods and can
62 workgroups), in which 590 employees were involved as hardly satisfy the requirements for flexible representative
authors and/or readers. Using REPSET and the benchmark article numbers and multilevel clusters, as discussed above in
methods, respectively, we extracted 10 articles from the 2,239 the “Clustering” subsection. Meanwhile, topic modeling
blog posts. In line with the above formulae, the de facto methods do not perform well in these results. In topic
coverage, redundancy and F1 measures of the 12 methods modeling, because topic labels are determined by a number of
were calculated, as shown in Table 4. top-frequency terms (Blei et al. 2003), the resulting topics are
usually dominated by high-frequency words, while low-
Intuitively, we can see from the table that the de facto cover- frequency words are neglected. In contrast, although
age of REPSET articles is higher than 9 out of the 11 bench- REPSET is also dependent on term frequency for the cosine-
mark methods. The de facto coverage degrees of RBR and similarity measure, the impacts of low-frequency words are
Direct articles are higher than REPSET, but the differences still preserved all through the process of clustering.

203.218.254.164 on Wed, 18 Oct 2023 08:47:49 +00:00
Table 5. Significance Testing Results (de facto Measures)

Measure Benchmark Method t-value Significance
REPSET > Random 11.315 ***
REPSET > Top Rated 11.229 ***
REPSET > Most Read 9.424 ***
REPSET > Most Commented 10.082 ***
REPSET > Graph -1.935
de facto Coverage REPSET > RBR -6.047 ***
REPSET > X-Means -1.646
REPSET > Direct -3.518 **
REPSET > LDA 8.574 ***
REPSET > hLDA 6.000 ***
REPSET > DTM 5.803 ***
REPSET < Random -3.470 **
REPSET < Top Rated -8.037 ***
REPSET < Most Read -5.130 ***
REPSET < Most Commented -6.107 ***
REPSET < Graph -4.527 **
de facto Redundancy REPSET < X-Means -3.416 **
REPSET < RBR -2.401 *
REPSET < Direct -5.516 ***
REPSET < LDA -9.845 ***
REPSET < hLDA -1.664
REPSET < DTM -2.051
REPSET > Random 11.830 ***
REPSET > Graph 3.065 *
de facto F1-Measure REPSET > X-Means 1.977
REPSET > RBR -1.100
REPSET > Direct 4.754 **
REPSET > LDA 10.733 ***
REPSET > DTM 6.024 ***
Note: *p < 0.05; **p < 0.01; ***p < 0.001
In these experiments, the number of representative articles In addition, considering that a target of representativeness
was set to 10. It might be questioned whether a similar level extraction is to capture the special articles that have low
of performance can be achieved when different numbers of similarities with others, a comparative experiment was also
items are extracted. As a robustness check, we re-executed conducted to illustrate the capability of the similarity-based
the empirical data experiments 15 times with the number of clustering methods in this regard. In this experiment, all of
extracted articles varying from 10 to 150 (see Appendix D for the articles were ranked according to their average similarities
details). It turned out that REPSET performed well across all to other articles in the whole dataset. The articles ranked in
of the 15 cases of varying numbers of extracted articles. the last 10% were gathered to form a testing set of least-
Results from Friedman tests show that REPSET significantly common articles. For each set of 10 articles extracted by
outperforms all of the 11 benchmark methods in terms of the REPSET, Direct, RBR, Graph, and X-Means, the coverage
F1 measure. values on the least-common article set were calculated using

203.218.254.164 on Wed, 18 Oct 2023 08:47:49 +00:00
Equation (9), respectively. The results show that the coverage measured with the number of new words learned (Allington
value achieved by REPSET was 0.180, while those achieved 1984). Perceived redundancy was reported by the short
by Direct, RBR, Graph, and X-Means were 0.135, 0.138, questionnaire that the subjects answered. The four question-
0.044, and 0.140, respectively, indicating that REPSET naire items for measuring perceived redundancy, in the form
managed to capture distinguishably more least-common of seven-point Likert-type scales, were developed through
content than the other methods. two rounds of discussion and a pilot study to ensure its
validity.
User Experiments Two rounds of experiments were conducted. The first round
was conducted in December 2012, during which we recruited
Although the results of blog data analysis have demonstrated 212 qualified participants from the students of the university
the performance of the proposed method, aiming to secure and compared REPSET with top rated, most read, most com-
higher quality for the evaluations, we decided to go beyond mented articles as well as the Graph and X-Means methods.
that and incorporate user experiments so that the results of the The second round was conducted in January 2015, during
two types of evaluation can be cross validated. Using the which 166 qualified students were recruited and REPSET was
same data set, user experiments were conducted among volun- compared with the Direct, RBR, LDA, hLDA, DTM, and
tary participants recruited from a prestigious university in random methods. According to self-reporting, volunteers that
China. The experiment process was designed as follows. were not familiar with blogs were excluded. The experiments
First, participating subjects were randomly divided into were conducted in the Behavior Lab of a national research
groups, which were assigned to REPSET and the benchmark center at the university. When dividing the participants into
methods respectively. Thereafter, each subject was asked to the comparison groups, attributes including gender, age, and
review the 10 articles of his/her group independently in 20 grade were controlled, so that there are no significant differ-
minutes. The subjects were not aware of the group to which ences between the groups. Although the blog articles were
they belonged, and they did not know whether the articles collected from a company, students did not have much diffi-
they were reviewing came from REPSET extraction or other culty in understanding the contents. The business of the
methods. After a subject had read the 10 articles assigned, company is mobile phone services, with which most students
he/she was asked to fill in a short questionnaire consisting of were very familiar. In our experiments, we asked the subjects
four scale questions about perceived redundancy. After that, whether there was anything they did not understand after they
the subjects were given a word list consisting of 267 feature read the articles. Nobody answered that he/she felt diffi-
words collected from all the 2,239 blog articles. These blog culties, and very few people raised questions. In the rare
articles were first preprocessed with a series of processing cases when subjects felt confused, our research assistants
strategies including stopword removing, stemming, and case gave explanations. All of the participants finished the entire
folding, according to the common practice of document process of the experiment smoothly in the time allocated.
indexing (Witten et al. 1999). They were then segmented into Tests using the Cronbach Alpha indicator show that the ques-
keywords using a widely used Chinese language processing tionnaire data are reliable.
toolkit called NLPIR (http://ictclas.nlpir.org/). Moreover,
infrequent keywords were removed, because they usually Experiment results are shown in Table 6, where the numbers
have little semantic meaning and may introduce ambiguity for perceived coverage are the average percentages of feature
(Sedding and Kazakov 2004). Finally, all of the remaining words marked as mentioned in the corresponding group, the
keywords constitute the word list of 267 feature words. The higher the better. The numbers for perceived redundancy are
subjects were asked to mark out all of the feature words that the average scores of the seven-point Likert-type scales, the
were mentioned (explicitly or implicitly) in the 10 articles that lower the better. The perceived F1 measure is the harmonic
they had just read. They were not allowed to refer back to the mean of perceived coverage and the reversed value of per-
articles when they were marking out the words. In the end, ceived redundancy (normalized to the range of [0,1]).
the subjects were also asked to write down any of their Specifically,
feelings or thoughts about the 10 articles they had read. The
experiment script and screenshots for the experiment system
UI are provided in Appendix E and Appendix F, respectively. Perceived F1 =
Perceived coverage was then measured by calculating the ( Perceived Coverage) ×  7 − Perceived6 Redundancy  (12)
percentage of words in the list of feature words that were 2×
+ 
7 − Perceived Redundancy 
marked as mentioned. This is consistent with the research in ( Perceived Coverage) 6


reading behavior in which the degree of content coverage is

203.218.254.164 on Wed, 18 Oct 2023 08:47:49 +00:00
Table 6. Perceived Performance

Perceived Coverage Perceived Redundancy Perceived F1
REPSET (first round) 51.80% 2.50 0.61
REPSET (second round) 54.73% 3.03 0.56
Random 39.46% 3.95 0.42
Top Rated 36.15% 3.66 0.51
Most Read 41.41% 3.84 0.51
Most Commented 17.17% 4.54 0.45
Graph 36.27% 3.79 0.50
X-Means 42.44% 2.98 0.51
RBR 49.22% 3.13 0.52
Direct 46.93% 3.51 0.47
LDA 30.95% 2.90 0.39
hLDA 48.25% 4.26 0.42
DTM 47.64% 2.51 0.56
The table illustrates that REPSET results in high perceived In contrast, representative information extraction is focused
coverage, low perceived redundancy, and the highest F1 on extending the range of information covered by the
values in both rounds, which are desirable. We also used the extracted set. In practice, multiple methods can be used
independent-samples t-test method to statistically test the alongside each other to provide multiple perspectives.
differences. The testing results are shown in Table 7, which
illustrates that in 22 out of the 33 cases, the differences are The research presented in this paper draws upon the promi-
significant. Particularly, in terms of the comprehensive F1 nent needs for representative information in organizational
measure, the performance advantages of REPSET over the context. Nevertheless, the design artifacts proposed in this
benchmark methods are mostly significant, except for RBR, paper may be adapted and applied to other scenarios where
Direct, and DTM. These are basically consistent with the information representativeness is a concern. Apart from
data experiments and demonstrate the outstanding perfor- internal blogs, when managers and employees seek to lever-
mance of REPSET. age blogs on the external web for customer-oriented insights
(Chau and Xu 2012), the representativeness extraction
In summary, considering the comprehensive results that we problem will also be a compelling application need. Mean-
have obtained from the multiple commonly used design while, situations such as online news, discussion boards,
evaluation methods (Gregor and Hevner 2013), it is reason- online magazines, and newsletters can also benefit from
able to conclude that REPSET has demonstrated an
showing a more representative sample of contents. Most
outstanding performance.
news websites (such as cnn.com) carry a section on most-read
news stories, but readers may find it useful if the top represen-
tative articles are also provided. Furthermore, information
Concluding Remarks overload scenarios in offline contexts such as corporate
reports, news articles, and legal documents can also poten-
Satisfying user information needs on internal blog platforms tially benefit.
is a broad topic area that deserves much research efforts in
various directions. Our work on the information extraction Our research has been positioned as an exaptation study
framework and the REPSET method offers a new and mean- (Gregor and Hevner 2013) and executed in line with the prin-
ingful approach to utilizing blogging information in an ciples of design science (Hevner et al. 2004). We formulated
organizational contexts, which is different from other mech- the relatively new problem of representative information
anisms such as the popularity-based or preference-based extraction on internal blogs, which is meaningful for tackling
methods. Specifically, popularity-based methods such as information overload and utilizing accumulated data for
most-read and most-commented blogs are focused on pro- managerial effectiveness. Considering the rapid development
viding the information in which most people are interested. of social network applications (Kane et al. 2014), this topic

203.218.254.164 on Wed, 18 Oct 2023 08:47:49 +00:00
Table 7. Significance Testing Results (Perceived Measures)

Measure Benchmark Method t value Significance
REPSET > Random 2.698 **
REPSET > Most Read 2.462 *
REPSET > Graph 4.009 ***
Perceived Coverage REPSET > X-Means 2.174 *
REPSET > RBR 0.948
REPSET > Direct 1.271
REPSET > LDA 4.075 ***
REPSET > hLDA 0.965
REPSET > DTM 1.194
REPSET < Random -2.666 *
REPSET < Top Rated -4.123 ***
REPSET < Most Read -5.634 ***
REPSET < Most Commented -10.049 ***
REPSET < Graph -5.417 ***
Perceived Redundancy REPSET < X-Means -2.019 *
REPSET < RBR -0.279
REPSET < Direct -1.427
REPSET < LDA 0.438
REPSET < hLDA -3.740 ***
REPSET < DTM 1.832
REPSET > Random 3.2106 **
REPSET > Graph 4.7732 ***
Perceived F1–Measure REPSET > X-Means 2.1558 *
REPSET > RBR 0.9773
REPSET > Direct 1.907
REPSET > LDA 3.7258 ***
REPSET > DTM 0.0224
Note: *p < 0.05; **p < 0.01; ***p < 0.001
may inspire more extensive efforts in related areas. We drew design science literature with the formulated research question
upon existing BI techniques to design a solution. After ana- of representative content extraction on internal blogs, as well
lyzing the requirements for the solution, we found that they as the design framework and extended clustering technique to
may not be met by directly applying existing methods. There- address this problem; (2) we contribute to the business intelli-
fore, the novel technique of REPSET was proposed and gence literature with the novel clustering method developed
incorporated into the design framework. The postulated arti- through the course of solution design, which offers greater
fact was evaluated using multiple approaches involving both understanding of clustering research and may be generalized
objective data and subjective perceptions, with all experi- to other contexts.
ments carefully designed, organized, and executed, in order
to guarantee the validity and reliability of the results. In this One limitation of our current work is that we did not empiri-
way, our contributions are two-fold: (1) we contribute to the cally test whether and how the extracted representative

203.218.254.164 on Wed, 18 Oct 2023 08:47:49 +00:00
articles can eventually help with managerial decision making. ologies,” Journal of Information Science (35:2), pp. 180-191
Another limitation is that we only considered internal blog- (doi: 10.1177/0165551508095781).
ging applications, while other forms of enterprise social media Bente, S., and Karla, J. 2009. “Enterprise Social Network Plat-
platforms could also have been studied. Meanwhile, although forms as a Management Tool in Complex Technical Systems,” in
the novel clustering method proposed in REPSET performed Proceedings of the 15th Americas Conference on Information
well in our research context, the validation of its advantages Systems, San Francisco, August 6.
in general demands future work to test it in other contexts. Berkhin, P. 2006. “A Survey of Clustering Data Mining Tech-
niques,” in Grouping Multidimensional Data, J. Kogan,
Additionally, the time efficiency of the proposed method
C. Nicholas, and M. Teboulle (eds.), Berlin: Springer pp. 25-71
needs to be improved, if it is to be applied in real-time stream
(doi: 10.1007/3-540-28349-8_2).
data processing scenarios such as extracting representative
Bezdek, J. C., and Pal, N. R. 1998. “Some New Indexes of Cluster
information from microblogging platforms like Twitter.
Validity,” IEEE Transactions on Systems, Man, and Cybernetics,
Another direction for future research is to explore other BI Part B: Cybernetics (28:3), pp. 301-315 (doi: 10.1109/3477.
techniques that could also be helpful for further exploiting the 678624).
massive data accumulated on enterprise social media plat- Blei, D. M. 2012. “Probabilistic Topic Models,” Communications
forms, so as to support management and decision making on of the ACM (55:4), pp. 77-84 (doi: 10.1145/2133806.2133826).
various aspects. Blei, D. M., and Lafferty, J. D. 2006. “Dynamic Topic Models,” in
Proceedings of the 23rd International Conference on Machine
Learning, New York: ACM, pp. 113-120 (doi: 10.1145/
Acknowledgments 1143844.1143859).
Blei, D. M., Griffiths, D., Jordan, M. I., and Tenenbaum, J. B. 2004.
Xunhua Guo was supported by the National Natural Science Foun- “Hierarchical Topic Models and the Nested Chinese Restaurant
dation of China (grant numbers 71490721, 71572092). Qiang Wei, Process,” in Advances in Neural Information Processing Systems
Guoqing Chen, and Dandan Qiao were supported by the National 16, Cambridge, MA: MIT Press, pp. 17-24.
Natural Science Foundation of China (grant numbers 71110107027, Blei, D. M., Ng, A. Y., and Jordan, M. I. 2003. “Latent Dirichlet
71490724, 71372044) and the MOE Project of Key Research Insti- Allocation,” The Journal of Machine Learning Research (3), pp.
tute of Humanities and Social Sciences at Universities of China 993-1022.
(grant number 12JJD630001). Jin Zhang was supported by the Boley, D., Gini, M., Gross, R., Han, E.-H., Hastings, K.,
National Natural Science Foundation of China (grant number Karypis, G., Kumar, V., Mobasher, B., and Moore, J. 1999.
71402186). “Partitioning-Based Clustering for Web Document Categori-
zation,” Decision Support Systems (27:3), pp. 329-341 (doi:
References 10.1016/S0167-9236(99)00055-X).
Broder, A. Z., Glassman, S. C., Manasse, M. S., and Zweig, G.
Abbasi, A., and Chen, H. 2008. “Cybergate: A Design Framework 1997. “Syntactic Clustering of the Web,” Computer Networks
and System for Text Analysis of Computer-Mediated Communi- and ISDN Systems (29:8), pp. 1157-1166 (doi: 10.1016/S0169-
cation,” MIS Quarterly (32:4), pp. 811-837. 7552(97)00031-7).
Adomavicius, G., Bockstedt, J. C., Gupta, A., and Kauffman, R. J. Cai, X., and Li, W. 2011. “A Spectral Analysis Approach to Docu-
2008. “Making Sense of Technology Trends in the Information ment Summarization: Clustering and Ranking Sentences Simul-
Technology Landscape: A Design Science Approach,” MIS taneously,” Information Sciences (181:18), pp. 3816-3827 (doi:
Quarterly (32:4), pp. 779-809. 10.1016/j.ins.2011.04.052).
Aggarwal, R., Gopal, R., Sankaranarayanan, R., and Singh, P. V. Chau, M., and Xu, J. 2012. “Business Intelligence in Blogs:
2012. “Blog, Blogger, and the Firm: Can Negative Employee Understanding Consumer Interactions and Communities,” MIS
Posts Lead to Positive Outcomes?,” Information Systems Quarterly (36:4), pp. 1189-1216.
Research (23:2), pp. 306-322 (doi: 10.1287/isre.1110.0360). Chen, G., Liu, H., Yu, L., Wei, Q., and Zhang, X. 2006. “A New
Aliguliyev, R. M. 2009. “A New Sentence Similarity Measure and Approach to Classification Based on Association Rule Mining,”
Sentence Based Extractive Technique for Automatic Text Sum- Decision Support Systems (42:2), pp. 674-689 (doi: 10.1016/
marization,” Expert Systems with Applications (36:4), pp. j.dss.2005.03.005).
7764-7772 (doi: 10.1016/j.eswa.2008.11.022). Chui, M., Manyika, J., Bughin, J., Dobbs, R., Roxburgh, C.,
Allington, R. L. 1984. “Content Coverage and Contextual Reading Sarrazin, H., Sands, G., and Westergren, M. 2012. The Social
in Reading Groups,” Journal of Literacy Research (16:2), pp. Economy: Unlocking Value and Productivity through Social
85-96 (doi: 10.1080/10862968409547506). Technologies, McKinsey Global Institute (http://www.mckinsey.
Arazy, O., and Woo, C. 2007. “Enhancing Information Retrieval com/industries/high-tech/our-insights/the-social-economy).
through Statistical Natural Language Processing: A Study of Coenen, F., and Leng, P. 2007. “The Effect of Threshold Values on
Collocation Indexing,” MIS Quarterly (31:3), pp. 525-546. Association Rule Based Classification Accuracy,” Data &
Bawden, D., and Robinson, L. 2009. “The Dark Side of Infor- Knowledge Engineering (60:2), pp. 345-360 (doi: 10.1016/
mation: Overload, Anxiety and Other Paradoxes and Path- j.datak.2006.02.005).

203.218.254.164 on Wed, 18 Oct 2023 08:47:49 +00:00
Das, D., and Martins, A. F. T. 2007. “A Survey on Automatic Text Computing, New York: ACM, pp. 604-613 (doi: 10.1145/
Summarization,” Literature Survey for Language and Statistics 276698.276876).
II course, Carnegie Mellon University. Jackson, A., Yates, J., and Orlikowski, W. 2007. “Corporate
Ester, M., Kriegel, H., Sander, J., and Xu, X. 1996. “A Density- Blogging: Building Community through Persistent Digital Talk,”
Based Algorithm for Discovering Clusters in Large Spatial in Proceedings of the 40th Annual Hawaii International
Databases with Noise,” in Proceedings of the 2nd International Conference on System Sciences, Waikoloa, HI, pp. 80-80 (doi:
Conference on Knowledge Discovery and Data Mining, Portland, 10.1109/HICSS.2007.155).
OR, August 2-4, AAAI Press, pp. 226-231. Jain, A. K., Murty, M. N., and Flynn, P. J. 1999. “Data Clustering:
Gionis, A., Indyk, P., and Motwani, R. 1999. “Similarity Search in A Review,” ACM Computing Surveys (31), pp. 264-323 (doi:
High Dimensions via Hashing,” in Proceedings of the 25th 10.1145/331499.331504).
International Conference on Very Large Data Bases, Edinburgh, Kane, G. C., Alavi, M., Labianca, G., and Borgatti, S. P. 2014.
Scotland, pp. 518-529. “What’s Different about Social Media Networks? A Framework
Goes, P. B. 2014. “Editor’s Comments: Design Science Research and Research Agenda,” MIS Quarterly (38:1), pp. 275-304.
in Top Information Systems Journals,” MIS Quarterly (38:1), pp. Ku, L.-W., Ho, H.-W., and Chen, H.-H. 2009. “Opinion Mining
iii-viii. and Relationship Discovery Using CopeOpi Opinion Analysis
Granka, L. A., Joachims, T., and Gay, G. 2004. “Eye-Tracking System,” Journal of American Society for Information Science
Analysis of User Behavior in WWW Search,” in Proceedings of and Technology (60:7), pp. 1486-1503 (doi: http://dx.doi.org/
the 27th Annual International ACM SIGIR Conference on 10.1002/asi.v60:7).
Research and Development in Information Retrieval, New York: Lau, R. Y. K., Liao, S. S. Y., Wong, K. F., and Chiu, D. K. W.
ACM, pp. 478-479 (doi: 10.1145/1008992.1009079). 2012. “Web 2.0 Environmental Scanning and Adaptive Decision
Gregor, S., and Hevner, A. R. 2013. “Positioning and Presenting Support for Business Mergers and Acquisitions,” MIS Quarterly
Design Science Research for Maximum Impact,” MIS Quarterly (36:4), pp. 1239-1268.
(37:2), pp. 337-355. Leskovec, J., Rajaraman, A., and Ullman, J. D. 2014. Mining of
Griffiths, T. L., and Steyvers, M. 2004. “Finding Scientific Massive Datasets, Cambridge, UK: Cambridge University Press.
Topics,” Proceedings of the National Academy of Sciences Li, N., Guo, X., Chen, G., and Luo, N. 2015. “Reading Behavior
(101:Supplement 1), pp. 5228-5235 (doi: 10.1073/pnas. on Intra-Organizational Blogging Systems: A Group-Level
0307752101). Analysis through the Lens of Social Capital Theory,” Information
Gu, Q., and Zhou, J. 2009. “Co-clustering on Manifolds,” in & Management (52:7), pp. 870-881 (doi: 10.1016/j.im.2015.
Proceedings of the 15th ACM SIGKDD International Conference 03.004).
on Knowledge Discovery and Data Mining, New York: ACM, Liao, Q., Pan, S., Lai, J. C., and Yang, C. 2011. “Enterprise Blog-
pp. 359-368 (doi: 10.1145/1557019.1557063). ging in a Global Context: Comparing Chinese and American
Gupta, V., and Lehal, G. S. 2010. “A Survey of Text Summariza- Practices,” in Proceedings of the ACM 2011 Conference on
tion Extractive Techniques,” Journal of Emerging Technologies Computer Supported Cooperative Work, New York: ACM, pp.
in Web Intelligence (2:3), pp. 258-268. 35-44 (doi: 10.1145/1958824.1958831).
Han, E.-H., Karypis, G., Kumar, V., and Mobasher, B. 1998. Liu, B. 2007. Web Data Mining: Exploring Hyperlinks, Contents
“Hypergraph Based Clustering in High-Dimensional Data Sets: and Usage Data, Heidelberg: Springer.
A Summary of Results,” IEEE Bulletin of the Technical Com- Liu, X., and Croft, W. B. 2008. “Evaluating Text Representations
mittee on Data Engineering (21:1), pp. 15-22. for Retrieval of the Best Group of Documents,” in Advances in
Han, J., Pei, J., and Kamber, M. 2011. Data Mining: Concepts and Information Retrieval, C. Macdonald, I. Ounis, V. Plachouras,
Techniques, Amsterdam: Elsevier. I. Ruthven, and R. W. White (eds.), Berlin: Springer, pp.
He, B., and Ounis, I. 2003. “A Study of Parameter Tuning for 454-462 (doi: 10.1007/978-3-540-78646-7_43).
Term Frequency Normalization,” in Proceedings of the 12th Liu, Y., Huang, X., An, A., and Yu, X. 2007. “ARSA: A
International Conference on Information and Knowledge Sentiment-Aware Model for Predicting Sales Performance Using
Management, New York: ACM, pp. 10-16 (doi: 10.1145/ Blogs,” in Proceedings of the 30th Annual International ACM
956863.956867). SIGIR Conference on Research and Development in Information
Hevner, A. R., March, S. T., Park, J., and Ram, S. 2004. “Design Retrieval, New York: ACM, pp. 607-614 (doi: http://doi.acm.
Science in Information Systems Research,” MIS Quarterly org/10.1145/1277741.1277845).
(28:1), pp. 75-105. Lu, B., Guo, X., Luo, N., and Chen, G. 2015. “Corporate Blogging
Hu, M., and Liu, B. 2004. “Mining and Summarizing Customer and Job Performance: Effects of Work-Related and Nonwork-
Reviews,” in Proceedings of the 10th ACM SIGKDD Inter- Related Participation,” Journal of Management Information
national Conference on Knowledge Discovery and Data Mining, Systems (32:4), pp. 285-314 (doi: 10.1080/07421222.2015.
New York: ACM, pp. 168-177 (doi: 10.1145/1014052. 1138573).
1014073). Luo, N., Guo, X., Zhang, J., Chen, G., and Zhang, N. 2015.
Indyk, P., and Motwani, R. 1998. “Approximate Nearest Neigh- “Understanding the Continued Use of Intra-Organizational Blogs:
bors: Towards Removing the Curse of Dimensionality,” in An Adaptive Habituation Model,” Computers in Human
Proceedings of the 30th Annual ACM Symposium on Theory of Behavior (50:9), pp. 57-65 (doi: 10.1016/j.chb.2015.03.070).

203.218.254.164 on Wed, 18 Oct 2023 08:47:49 +00:00
Ma, B., and Wei, Q. 2012. “Measuring the Coverage and Redun- on Knowledge Discovery and Data Mining, New York: ACM,
dancy of Information Search Services on e-Commerce Plat- pp. 823-831 (doi: 10.1145/2020408.2020550).
forms,” Electronic Commerce Research and Applications (11:6), Steinbach, M., Karypis, G., and Kumar, V. 2000. “A Comparison
pp. 560-569 (doi: 10.1016/j.elerap.2012.09.001). of Document Clustering Techniques,” in Proceedings of the KDD
Mandala, S., Kumara, S., and Chatterjee, K. 2014. “A Game- Workshop on Text Mining, New York: ACM, pp. 525-526.
Theoretic Approach to Graph Clustering,” INFORMS Journal on Suh, A., Shin, K., Ahuja, M., and Kim, M. S. 2011. “The Influence
Computing (26:3), pp. 629-643 (doi: 10.1287/ijoc.2013.0588). of Virtuality on Social Networks Within and Across Work
Marmanis, H., and Babenko, D. 2009. Algorithms of the Intelligent Groups: A Multilevel Approach,” Journal of Management
Web (1st ed.), Greenwich, CT: Manning Publications Co. Information Systems (28:1), pp. 351-386.
McAfee, A. P. 2006. “Enterprise 2.0: The Dawn of Emergent Turney, P. D., and Littman, M. L. 2003. “Measuring Praise and
Collaboration,” MIT Sloan Management Review (47:3), pp. Criticism: Inference of Semantic Orientation from Association,”
21-28. ACM Transations on Information Systems (21:4), pp. 315-346
McLaren, T., Head, M., Yuan, Y., and Chan, Y. 2011. “A Multi- (doi: http://doi.acm.org/10.1145/944012.944013).
level Model for Measuring Fit Between a Firm’s Competitive Vapnik, V. N. 1998. Statistical Learning Theory, New York:
Strategies and Information Systems Capabilities,” MIS Quarterly Wiley.
(35:4), pp. 909-929. Witten, I. H., Moffat, A., and Bell, T. C. 1999. Managing Giga-
Moore, J., Han, E., Boley, D., Gini, M., Gross, R., Hastings, K., bytes: Compressing and Indexing Documents and Images, San
Karypis, G., Kumar, V., and Mobasher, B. 1997. “Web Page Francisco: Morgan Kaufmann.
Categorization and Feature Selection Using Association Rule and Wright, A. 2009. “Our Sentiments, Exactly,” Communications of
Principal Component Clustering,” (available at http://europepmc. the ACM (52:4), pp. 14-15 (doi: http://doi.acm.org/10.1145/
org/abstract/CIT/233736/reload=0). 1498765.1498772).
Nardi, B. A., Schiano, D. J., Gumbrecht, M., and Swartz, L. 2004. Wu, J., Yuan, H., Xiong, H., and Chen, G. 2010. “Validation of
“Why We Blog,” Communications of the ACM (47:12), pp. Overlapping Clustering: A Random Clustering Perspective,”
41-46. Information Sciences (180:22), pp. 4353-4369 (doi: 10.1016/
Pan, F., Wang, W., Tung, A. K. H., and Yang, J. 2005. “Finding j.ins.2010.07.028).
Representative Set from Massive Data,” in Fifth IEEE Zelnik-Manor, L., and Perona, P. 2004. “Self-Tuning Spectral
International Conference on Data Mining, p. 8 (doi: 10.1109/ Clustering,” in Advances in Neural Information Processing
ICDM.2005.69).
Systems, Cambridge, MA: MIT Press, pp. 1601-1608.
Pelleg, D., and Moore, A. 2000. “X-means: Extending K-means
Zhang, Y., Callan, J., and Minka, T. 2002. “Novelty and Redun-
with Efficient Estimation of the Number of Clusters,” in Pro-
dancy Detection in Adaptive Filtering,” in Proceedings of the 25th
ceedings of the 17th International Conference on Machine
Annual International ACM SIGIR Conference on Research and
Learning, San Francisco: Morgan Kaufmann, pp. 727-734.
Development in Information Retrieval, New York: ACM, pp.
Pons-Porrata, A., Berlanga-Llavori, R., and Ruiz-Shulcloper, J.
81-88 (doi: http://doi.acm.org/10.1145/564376.564393).
2007. “Topic Discovery Based on Text Mining Techniques,”
Zhang, Z., Chen, G., Zhang, J., Guo, X., and Wei, Q. 2016.
Information Processing & Management (43:3), pp. 752-768 (doi:
“Providing Consistent Opinions from Online Reviews: A
DOI: 10.1016/j.ipm.2006.06.001).
Heuristic Stepwise Optimization Approach,” INFORMS Journal
Powers, D. M. 2011. “Evaluation: from Precision, Recall and F-
on Computing (28:2), pp. 236-250 (doi: 10.1287/ijoc.2015.
measure to ROC, Informedness, Markedness and Correlation,”
0672).
Journal of Machine Learning Technologies (2:1), pp. 31-63.
Zhao, Y., Karypis, G., and Fayyad, U. 2005. “Hierarchical Clus-
Salton, G. 1971. The SMART Retrieval System: Experiments in
Automatic Document Processing, Englewood Cliffs, NJ.: tering Algorithms for Document Datasets,” Data Mining and
Prentice-Hall. Knowledge Discovery (10:3), pp. 141-168 (doi: doi:10.1007/
Sedding, J., and Kazakov, D. 2004. “WordNet-Based Text s10618-005-0361-3).
Document Clustering,” in Proceedings of the 3rd Workshop on
Robust Methods in Analysis of Natural Language Data,
Stroudsburg, PA: Association for Computational Linguistics, pp. About the Authors
104-113 (available at http://dl.acm.org/citation.cfm?id=
1621445.1621458). Xunhua Guo is an associate professor of Information Systems at the
Seref, O., Fan, Y.-J., and Chaovalitwongse, W. A. 2013. “Mathe- School of Economics and Management, Tsinghua University. He
matical Programming Formulations and Algorithms for Discrete received his Ph.D. degree from Tsinghua University in 2005. His
k-Median Clustering of Time-Series Data,” INFORMS Journal research takes behavioral and design science approaches to topics on
on Computing (26:1), pp. 160-172 (doi: 10.1287/ijoc. electronic commerce, social networks, and business intelligence. He
2013.0554). has published in journals including Communications of the ACM,
Simon, G. J., Kumar, V., and Li, P. W. 2011. “A Simple Statistical Decision Sciences, Decision Support Systems, Information & Man-
Model and Association Rule Filtering for Classification,” in agement, Information Systems Journal, INFORMS Journal on
Proceedings of the 17th ACM SIGKDD International Conference Computing, Journal of Information Technology, and Journal of MIS.

203.218.254.164 on Wed, 18 Oct 2023 08:47:49 +00:00
He serves as Executive Board Member and Vice General-Secretary 2005–2013). He has numerous publications in journals, books, and
of China Association for Information Systems (CNAIS). conference proceedings, including a number of Harvard Business
School cases on IT/IS in Chinese companies. His research interests
Qiang Wei is an associate professor of Information Systems at the include business intelligence and analytics, e-business and IT-
School of Economics and Management, Tsinghua University, enabled innovation, and fuzzy logic applications.
Beijing, China. His research interests cover big data analytics, busi-
ness intelligence, online advertising, web mining, online recommen- Jin Zhang is an assistant professor at the School of Business,
dation, etc. His work has been published in journals including ACM Renmin University of China. He received his Ph.D. from the School
Transactions on Knowledge Discovery from Data, Decision of Economics and Management, Tsinghua University. His current
Sciences, INFORMS Journal on Computing, Decision Support research interests include web data mining, competitive intelligence,
Systems, and Information Sciences. He serves as an associate editor and information retrieval. His work has been published in journals
for Decision Support Systems and Electronic Commerce Research, including INFORMS Journal on Computing, IEEE Transactions on
and secretary or committee member of several academic societies. Neural Networks and Learning Systems, ACM Transactions on
Qiang was the corresponding author for this paper. Knowledge Discovery from Data, Decision Support Systems, and
Information & Management.
Guoqing Chen is EMC Chair Professor of Information Systems at
the School of Economics and Management, Tsinghua University Dandan Qiao is currently pursuing her Ph.D. at the School of
(Beijing, China). He received his Ph.D. in managerial informatics Economics and Management, Tsinghua University, Beijing, China.
from the Catholic University of Leuven (KUL, Belgium) in 1992. Her research focuses on business intelligence, data mining, and
He was appointed China's national Changjiang Scholar Professor in human behavior analytics. Her work has been published in journals
2005, and awarded IFSA Fellow in 2009. He served as the founding including ACM Transactions on Knowledge Discovery from Data
president of China Association for Information Systems (CNAIS, and Information & Management.

203.218.254.164 on Wed, 18 Oct 2023 08:47:49 +00:00

203.218.254.164 on Wed, 18 Oct 2023 08:47:49 +00:00
RESEARCH ARTICLE
EXTRACTING REPRESENTATIVE INFORMATION ON

INTRA-ORGANIZATIONAL BLOGGING PLATFORMS
Xunhua Guo, Qiang Wei, and Guoqing Chen
Beijing 100084 CHINA {guoxh@sem.tsinghua.edu.cn} {weiq@sem.tsinghua.edu.cn}
{chengq@sem.tsinghua.edu.cn}
Jin Zhang
School of Business, Renmin University of China,
Beijing 100872 CHINA {zhangjin@rbs.ruc.edu.cn}
Dandan Qiao
Beijing 100084 CHINA {qiaodd.12@sem.tsinghua.edu.cn}
Appendix A
An Illustrative Example for the Clustering Process of REPSET
Figure A1 illustrates the clustering process of the REPSET method. There are nine documents represented as five round points and four square
points. The distances between data points correspond to the pairwise similarities between documents. Intuitively, these nine documents can
be divided into two clusters, i.e., the round point cluster and the square point cluster.
At the beginning, each point is treated as an individual cluster. In each step, two clusters with the largest similarity are merged into a new one.
The backward strategy is applied after the generation of a new cluster. In the example, in steps (b)–(f), no boundary document is found in the
generated cluster and reallocation does not occur. In step (g), two documents located in the middle are merged into one cluster, since they have
the largest similarity. If the clustering process continues as the traditional hierarchical clustering method, the nine documents will finally be
grouped into two clusters as shown in step (h), which turns out to be inaccurate, because one square document is assigned to a cluster in which
most documents are round. In contrast, in REPSET, the square document will be marked as a boundary object (outlined in red), since its
average similarity to the whole cluster is less than λ. The backward strategy will then reevaluate the red document and reallocate it if needed.
As shown in (i), because the red document’s similarity to the left cluster is than that to the right cluster, it is reallocated to the left cluster.
MIS Quarterly Vol. 41 No. 4–Appendices/December 2017 A1

203.218.254.164 on Wed, 18 Oct 2023 08:47:49 +00:00
Figure A1. Illustrative Example for REPSET Clustering
A2 MIS Quarterly Vol. 41 No. 4–Appendices/December 2017

203.218.254.164 on Wed, 18 Oct 2023 08:47:49 +00:00
Appendix B
Proportions of Data Objects Reallocated in REPSET Clustering
Table B1: Proportions of Data Objects Reallocated in REPSET Clustering

Dataset No. Reallocation Proportion Dataset No. Reallocation Proportion
1 0.91% 16 3.23%
2 0.15% 17 1.17%
3 0.00% 18 0.37%
4 1.35% 19 0.25%
5 0.01% 20 5.79%
6 4.42% 21 1.52%
7 2.16% 22 1.16%
8 0.00% 23 0.04%
9 3.47% 24 0.78%
10 0.35% 25 1.08%
11 0.38% 26 1.16%
12 0.51% 27 0.60%
13 0.47% 28 0.18%
14 4.94% 29 0.00%
15 0.30% 30 2.37%
Average 1.29%
Figure B1. Proportions of Data Objects Reallocated in Iteration Stages

203.218.254.164 on Wed, 18 Oct 2023 08:47:49 +00:00
Appendix C
Representative Article Extraction Example
As an example, we compare the results of representative article extraction between REPSET and X-Means, a typical traditional clustering
method. We used the blogging data of a typical region of Company X, during the time period of July 2010, with a total of 2,239 blog articles.
Using REPSET and X-Means, respectively, we extracted 10 articles from the data set. Figure C1 shows the comparison of the sizes of clusters
generated by the two methods. It can be seen that the sizes of clusters generated by X-Means is much more even than those of REPSET. These
results illustrate the fact that X-Means tends to select clusters that are spatially equidistant, which is deficient for representativeness extraction.
In contrast, the sizes of clusters generated by REPSET are more discrepant, indicating the capability of REPSET to capture more diverse
content.
Figure C1. Cluster Size Comparison
Table C1 lists the titles of the representative articles extracted by REPSET and X-Means, respectively. It can be seen that the articles extracted
by X-Means are mostly life-related and emotional essays. In the whole data set, about 70% of articles are life-related and 30% are work-related.
In the results of X-Means, work-related articles are drown in life-related content and fail to be revealed. Meanwhile, most of the extracted life-
related articles are similar to some extent. In contrast, in the results of REPSET, six extracted articles are work-related, covering diverse topics
including work balance, industry analysis, and customer alerts. The four life-related articles are also diverse, covering topics such as love,
dining, and life advices. Such results intuitively show that REPSET can facilitate enhanced diversity in coverage and therefore capture more
representative results.

203.218.254.164 on Wed, 18 Oct 2023 08:47:49 +00:00
Table C1. Representative Article Extraction Results

Articles Extracted by REPSET
No. Article Title Category
1 In the south, love is not only conveyed with poetry Life
2 Nice places for dining in the Shunde city Life
3 How to balance project work and post work Work
4 Fourteen advices for girls born in the 1980s Life
5 Industry Watch: China Mobile is facing five challenges in the field of information technology Work
6 The most comprehensive methods for recovering from drunk Life
7 Sharing the morning business meeting memo for July 6, 2010 Work
8 Training plan for new employees Work
9 Work plan, 2nd week, July, Ronggui customer services Work
10 VIP customer alert numbers Work
Articles Extracted by X-Means
No. Article Title Category
1 Happiness blind Life
2 There is a love that cannot be waited, there is a love that cannot be hurt. Life
3 Life is like donkeys, dogs, and monkeys. Life
4 Workplace is like a battlefield Work
5 To lose Life-work
6 Sadness is a kind of beauty Life
7 Love is only one more stroke than hate Life
8 Someone to allow me to be unreasonable Life
9 Losing affection Life
10 Spider nets Life

203.218.254.164 on Wed, 18 Oct 2023 08:47:49 +00:00
Appendix D
Empirical Data Experiments Results with Varying Numbers of
Representative Articles
CLUSTER TOP MOST MOST

NUM MEASURE REPSET RANDOM RATED READ COMMENTED GRAPH
10 de facto F1-measure 0.504 0.391 0.278 0.374 0.205 0.326
10 de facto Coverage 0.374 0.284 0.193 0.287 0.130 0.326
10 de facto Redundancy 0.227 0.373 0.502 0.463 0.521 0.675

203.218.254.164 on Wed, 18 Oct 2023 08:47:49 +00:00
CLUSTER TOP MOST MOST

NUM MEASURE REPSET RANDOM RATED READ COMMENTED GRAPH
CLUSTER
NUM MEASURE XMEANS RBR DIRECT LDA HLDA DTM

203.218.254.164 on Wed, 18 Oct 2023 08:47:49 +00:00
CLUSTER
NUM MEASURE XMEANS RBR DIRECT LDA HLDA DTM

203.218.254.164 on Wed, 18 Oct 2023 08:47:49 +00:00
Appendix E
Experiment Script (Translated from Chinese)
Experiments on Blog Article Reading

NO:
Dear students,
Welcome to participate in this experiment!
The following are ten articles selected from a company's internal blogging system. Please carefully read these blog articles for 20 minutes.
When the time is up, you will be asked to answer several questions mark a number of words according to what you have read.
Basic information:
1. Name , Gender , Age
2. Major , Grade
3. Your familiarity with blogging system
A. Completely unfamiliar B. Unfamiliar C. Neutral D. Familiar E. Very familiar
(After providing the basic information, the articles are displayed and the subjects are required to read the articles for 20 minutes. When the
time is up, the system closes the article and presents the following questions.)
For each of the following question, please select the number that corresponds with your reading experience. The number “1” represents
“strongly disagree,” while the number “7” represents “strongly agree.”
Please select
Strongly Slightly Slightly Strongly
Questions Disagree Disagree Disagree Neutral Agree Agree Agree
This group of blog articles has much repetition. 1 2 3 4 5 6 7
This group of blog articles has similar topics. 1 2 3 4 5 6 7
This group of blog articles has a wide range of contents. 1 2 3 4 5 6 7
This group of blog articles provides rich information. 1 2 3 4 5 6 7
If you have any feelings, opinions or comments on this group of blog articles please feel free to write them down in the following space:

203.218.254.164 on Wed, 18 Oct 2023 08:47:49 +00:00
Please label the following words based on your reading.
If you have browsed any information related with the word, please tick “%”, otherwise no label is needed.
100 139 2010 BOSS G3 GPRS

http arrangement case method transaction organization
warranty standard performance others department participate
take part in operation inquire product surpass growth
success grade member charge error final
unit cause zone place second third
first phone store adjustment effect dynamic
sms happen send discovery development program
method manner fee branch allocation analysis
share minute Foshan service responsible accessory
change thanks post tell the work company
function communication purchase key management Guangdong
specification rule process kids number cooperation
bill joy environment reply home conference
activity opportunity foundation group plan record
quarter home value reduce inspection simple
set up proposal health reward drop exchange
accept end solve explanation introduction progress
manager experience experience spirit competition account
opening happy check client control happiness
binding difficulty leliu leave understanding utilize
leadership process satisfaction monthly password free
goal internal content ability strive train
coordinate friend brand balance usually platform
business reception situation mood area channel
global pass life staff daily Ronggui social
application close identity body life life
time world market thing receive collect
cell phone familiar data Shunde attitude discussion
package put forward improve provide upgrade remind
experience condition call communicate communications notification
colleague unity complaints team spread recommendation
expand network website future hope habit
love system afternoon show on site relevant
enjoy project consumption sales effect assist
thanks mood mentality information week industry
form happiness demand propaganda choose student
learning pressure business opinion awareness factor
marketing operate affect own forever user
preferential outstanding pre-store staff reason operations
online responsibility increase gift about correct
policy support knowledge execution index guide
formulate quality intelligence china center terminal
emphasis initiative theme status consultation charges
data resources comprehensive

203.218.254.164 on Wed, 18 Oct 2023 08:47:49 +00:00
Appendix F
Screenshots for the Experiment System UI

203.218.254.164 on Wed, 18 Oct 2023 08:47:49 +00:00

203.218.254.164 on Wed, 18 Oct 2023 08:47:49 +00:00

Guo ExtractingRepresentativeInformation 2017

Uploaded by

Copyright:

Available Formats

You might also like

Guo ExtractingRepresentativeInformation 2017

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Guo ExtractingRepresentativeInformation 2017

Uploaded by

Copyright:

Available Formats

Extracting Representative Information on Intra-Organizational Blogging Platforms

Stable URL: https://www.jstor.org/stable/10.2307/26630287

Management Information Systems Research Center, University of Minnesota is collaborating with

This content downloaded from

EXTRACTING REPRESENTATIVE INFORMATION ON

Keywords: Intra-organizational blogging platforms, representative information extraction, enterprise social

Introduction1 mation production on the Internet since the beginning of this

MIS Quarterly Vol. 41 No. 4, pp. 1105-1127/December 2017 1105

This content downloaded from

1106 MIS Quarterly Vol. 41 No. 4/December 2017

This content downloaded from

Furthermore, topic clustering in online content analysis

MIS Quarterly Vol. 41 No. 4/December 2017 1107

This content downloaded from

1108 MIS Quarterly Vol. 41 No. 4/December 2017

This content downloaded from

Figure 1. Design Framework for the Extraction System

MIS Quarterly Vol. 41 No. 4/December 2017 1109

This content downloaded from

Documents Document Document Clusters , kth Representative

Figure 2. REPSET Procedure

Based on these design requirements, we propose a BI method Similarity Matrix Construction

1110 MIS Quarterly Vol. 41 No. 4/December 2017

This content downloaded from

 Sim(d ) < λ , λ ∈[0,1]

MIS Quarterly Vol. 41 No. 4/December 2017 1111

This content downloaded from

Table 1. Features of Clustering Methods

Figure 3. Clustering Algorithm

1112 MIS Quarterly Vol. 41 No. 4/December 2017

This content downloaded from

Figure 4. Backward Strategy Example

MIS Quarterly Vol. 41 No. 4/December 2017 1113

This content downloaded from

Figure 5. REPSET Procedure Pseudo-Codes

1114 MIS Quarterly Vol. 41 No. 4/December 2017

This content downloaded from

In order to evaluate the performance of the clustering 1 k

MIS Quarterly Vol. 41 No. 4/December 2017 1115

This content downloaded from

Table 2. Entropy and Purity Measures

1116 MIS Quarterly Vol. 41 No. 4/December 2017

This content downloaded from

Table 3: Friedman Test Statistical Results on Entropy and Purity Measures

Note: *p < 0.05, **p < 0.01, ***p < 0.001

Figure 6. Running Time of REPSET on Different Scales of Datasets

Empirical Evaluations In 2006, an internal blogging platform was deployed in Com-

MIS Quarterly Vol. 41 No. 4/December 2017 1117

This content downloaded from

Benchmark Methods and Evaluation Plan

1118 MIS Quarterly Vol. 41 No. 4/December 2017

This content downloaded from

Table 4. De facto Performance (Overall)

rC ( D′ , D) =   max (Sim(d , d ′))

MIS Quarterly Vol. 41 No. 4/December 2017 1119

This content downloaded from

Table 5. Significance Testing Results (de facto Measures)

Note: p < 0.05, p < 0.01, p < 0.001