Download as pdf or txt
Download as pdf or txt
You are on page 1of 9

A Graph Based Approach on Extractive

Summarization

Madhurima Dutta, Ajit Kumar Das, Chirantana Mallick,


Apurba Sarkar and Asit K. Das

Abstract With the advent of Information technology and the Internet, the world is
producing several terabytes of information every second. Several online news feeds
have popped up in the past decade that reports an incident almost instantly. This has
led to a dire need to reduce content and present the user only with what is necessary,
called the summary. In this paper an Extractive Summarization technique based on
graph theory is proposed. The method tries to create a representative summary or
abstract of the entire document, by finding the most informative sentences by means
of infomap clustering after a graphical representation of the entire document.

Summarization systems work to reduce full sized articles to produce fluent and
concise summaries that convey the main idea or the central idea of the passage along
with relevant information. A summary is a text that is produced from one or more
texts, containing a significant portion of the information present in the original text,
and the length of which is much lesser than that of the original text. The systems
produce paragraph length summaries. Automatic document summarization dates
back to Luhn’s work [1]. Many methods [2–4] have been proposed to extract the
important concepts from a source text and to build the intermediate representation.
Early methods [5, 6] focussed on the frequency of words present in the document
to determine the concepts and ideas being highlighted in the document. Linguistic
approaches on the other hand attempts to achieve true “semantic understanding” of

M. Dutta (B) · A. K. Das (B) · C. Mallick · A. Sarkar · A. K. Das


Indian Institute of Engineering Science and Technology, Shibpur, Shibpur, India
e-mail: madhurima.pg2016@cs.iiests.ac.in
A. K. Das
e-mail: writetoajit@yahoo.com
C. Mallick
e-mail: chirantana9@gmail.com
A. Sarkar
e-mail: as.besu@gmail.com
A. K. Das
e-mail: akdas@cs.iiests.ac.in
© Springer Nature Singapore Pte Ltd. 2019 179
A. Abraham et al. (eds.), Emerging Technologies in Data Mining and Information
Security, Advances in Intelligent Systems and Computing 813,
https://doi.org/10.1007/978-981-13-1498-8_16
180 M. Dutta et al.

the source document. The use of deep semantic analysis offers to create a quality
summary that is somewhat close to a summary that is made by a human. Such
approaches work effectively with a detailed semantic representation of the document
and a domain specific knowledge base of the specific language.
Though there are several kinds of summarization techniques in the paper, only
the extractive summarization techniques are discussed. The Extractive Summarizers
basically work in three relatively independent phases. All summarizers work to pro-
duce an intermediate representation of the text at first. Raw input is preprocessed to
remove stopwords i.e. frequently used articles, conjunctions and prepositions. Punc-
tuations are removed as well in some cases. From the remaining portion, often the
“term-frequency” or the “inverse document frequency” or both are found and stored
as key-value pairs. These metrics help to understand the crux of the information in
the document and process it in a particular way. Next, weights are assigned to the
sentences based on the intermediate representation. In the final phase, the sentences
are selected with the highest scores in a greedy approach. The stages are thoroughly
described in the subsequent sections.
A. Intermediate Representation
The main task of this step is to identify the information hidden in the original doc-
ument. “Topic Representation” approaches are used widely where the topic word
pertaining to the particular document are identified. Some of the most popular sum-
marization methods focuses on topic representation so that the central idea of the doc-
ument is not lost. These approaches include topic word or topic signature approaches
[7], frequency, TF-IDF etc. Other approaches like lexical chains [8] use widely avail-
able resources like WordNet which helps to establish similarity between semantically
related words. Another method of extractive summarization is with Latent Seman-
tic Analysis [9] which helps to identify patterns of word co-occurrence and these
roughly feature as topics. Suitable weights are assigned to each pattern subject to
fulfillment of some conditions. In “Indicator representation” [10] approaches each
sentence in the input is represented as a list of indicators of importance such as
“sentence length”, “location in the document”, “presence of certain phrases”, etc. In
graph models, such as LexRank [11], the entire document is represented as a network
of sentences by means of weighted graphs where the edge weights correspond to the
similarity between the sentences they connect.
B. Score Sentences
The task of scoring sentences differs on the basis of the approach being used to create
its intermediate representation. It is given that the sentence which is of more rele-
vance will be given a higher weightage than the others. This is exceptionally followed
in the topic representation approaches. The score is assigned after examining of how
important a particular sentence is to a particular document. For indicator representa-
tion methods, the weight of each sentence is determined by taking the values of the
indicators into account, most commonly by using machine learning techniques. In
the graphical approaches, metrics like the measures of similarity between the vertices
is determined and then further processing is carried on.
A Graph Based Approach on Extractive Summarization 181

C. Selecting Summary Sentences


The last and final step is responsible for constructing the resultant summary. Gener-
ally it selects the best combination of sentences in the form of paragraph that gives
up the key information of the original text in a concise and fluent manner. Care is
taken to minimize redundancy so that similar sentences in the summary are avoided.
The summary should be coherent to the original text. The genre of the document
must control what sentences should go in the summary. A document can be in the
form of a webpage, a news article, email, a chapter from a book and so on.
The rest of the paper is organized in the following manner. Section 1 talks about
our background study which includes some previous works in the field of Extractive
Summarization. In Sect. 2 the proposed algorithm is explained. The results and
analysis are presented in Sect. 3. Finally, the paper is concluded with possible future
direction in Sect. 4.

1 Background Study

Interest in automatic text summarization arose early as in 1950s. As mentioned


before, extractive summarization chooses a subset of sentences from the original text.
These sentences are supposedly the most important sentences containing the impor-
tant information from the text. Over the years, a host of techniques have been applied
to perform extractive summarization. Topic representation approaches like “Topic
Word” and “Topic Signature” methods have been applied from the very beginning
[12]. These approaches consider a frequency threshold to classify certain words as
topic words. Frequency driven approaches measure the density of topic words. Met-
rics like tf (term frequency) and idf (inverse document frequency) often combined
together to help in identifying topic words as well. Another well known approach is
centroid summarization [13]. It computes salience of the sentences using a given set
of features.
Latent Semantic Analysis [14] is an unsupervised technique used for deriving
an implicit representation of text based on the observed co-occurrence of words. It
works in lines of dimensionality reduction, initially filling a n × m matrix where
each row corresponds to words from the input and the columns correspond to the
sentences.
In graph based summarization methods, the sentences are represented as vertices
of a graph and they are connected by means of weighted edges. In [15] a bipartite
graph is created to represent the sentences and topics individually after PageRank
algorithm is applied to rank the sentences. This topical graph is then used again
to rank the sentences using HITS (Hyperlink Induced Topic Search) [16], another
popular ranking algorithm used to rank websites initially. A coherence measure is
used to find out the relevance of a sentence to a certain topic and then decide whether
to accept or reject the sentence for summary after optimisation is applied on the
scores.
182 M. Dutta et al.

In [17] the TextRank algorithm [18] is used to assign scores to each of the sen-
tences. Distortion measure has been taken into account to find out the semantic
difference of the sentences and thus a distortion graph is constructed. Based on this
distortion graph, the sentences are ranked again. The higher ranked sentences are
eligible to go into the summary.
In [19] the “tweet similarity graph” has been obtained to establish the similarity
measures between the pairs of tweets. The similarity metrics used are “Levenshtein
Distance”, cosine similarity, semantic similarity. It also compares the URLs and the
hashtags used in the tweets to compute the similarity measure. The tweets are treated
as vertices and the vertices with the highest degree are selected to go into the final
summary.

2 Proposed Algorithm

We have tested our method on various news articles obtained from BBC News feed
using BeautifulSoup [20] library of Python [21]. BeautifulSoup is a web scraping tool
of Python. It helps us in scraping only the text portion of the XML documents. Once
the data is successfully read in, the articles are scraped of its stopwords. Stopwords are
the frequently occurring words of any language like the, is, are, have, etc. and other
articles, common verbs, prepositions which do not convey any special information
about the sentence. Next, Proper Nouns which are generally the names of persons,
animals, places, or specific organisations or the principal object on which the article
is based are removed from the sentences after the document is part of speech tagged.
Every pair of the remaining portion of the sentences are checked for cosine similarity
according to the Eq. (1) to construct the similarity matrix.
x·y
cossimilarit y (x, y) = (1)
|x||y|

where x and y are sparse vectors.


The similarity measure is based on the content overlap between sentences. The
sentences are represented as bag-of-words, so each of them will be a sparse vector
and define measure of overlap as angle between vectors. Removal of the proper
nouns removes any chances of biasing the computation of similarity matrix. Each
of the sentences are treated as vertices of a similarity graph SG = (V, E). Edges
are established between vertices and the weights are assigned according to their
similarity measure. The clustering coefficient (cu ) of the vertex (u ∈ V ) is computed
according to Eq. (2).
2T (u)
cu = (2)
deg(u)(deg(u) − 1)

where T (u) is the number of triangles through node u and deg(u) is the degree of u.
A Graph Based Approach on Extractive Summarization 183

Fig. 1 Flowchart of the proposed system

With the similarity matrix and the graph as input the sentences are clustered by
means of Infomap clustering. The average clustering coefficient (avg_cu ) is com-
puted using the maximum cu of each cluster. From each of the cluster, those sentences
are selected whose cu > avg_cu . This step is repeated until the summary length is
reached. The flowchart of the algorithm is shown in Fig. 1.
Algorithm: Extractive Summarization
The outline of the algorithm is as follows:
Input: Text document from a web page scraped with the help of BeautifulSoup
library.
Output: A summary of the input text.
Begin
1. Encode the text document in UTF-8 format.
2. Remove the stopwords from the encoded text
3. The text is tokenized into individual sentences using the NLTK library.
4. for each sentence
184 M. Dutta et al.

a. The NLTK library is used to tag the Parts of Speech of each of the words in
the sentences.
b. The named entities, i.e. the Proper Nouns are identified and removed.
c. Remaining words in a sentence is treated as a text vector.
5. Cosine similarity is calculated between every pair of text vector using Eq. (1)
and the similarity matrix S is obtained.
6. Construct a graph with sentences as nodes and weight of edges between every
pair of sentence is the similarity between them.
7. Modify the graph removing edges with weight less than a predefined threshold.
8. Compute clustering coefficient of each node of the graph using Eq. (2) and com-
pute the average clustering coefficient cavg .
9. Apply infomap clustering algorithm to partition the given graphs into subgraphs.
10. Arrange subgraphs based on the maximum clustering coefficient computed.
11. For each subgraph of the graph
if maximum clustering coefficient < cavg
remove the subgraph as redundant.
else
remove node with maximum clustering coefficient from the
subgraph and store it into summary file.
12. If number of sentences in summary < predefined size, then go to step 10.
13. Return summary.
End

3 Results

The method has been thoroughly tested on the news articles generated by the BBC
News feed. We have used the NLTK library [22] and the BeautifulSoup library of
Python 2.7 [21] to help with our processing.
Evaluation using Rouge:
Since the early 2000s the ROUGE [23] metric is widely used for automatic evaluation
of summaries. Lin introduced a set of metrics called Recall-Oriented Understudy for
Gisting Evaluation (ROUGE) to automatically determine the quality of a summary
by comparing it to reference summaries developed by humans which is generally
considered as ground truth. There are several variations to the Rouge metric among
which the popularly used metrics are explained below:
A Graph Based Approach on Extractive Summarization 185

3.1 Rouge_N

The Rouge_N metric is a measure of overlapping words typically considered in N-


grams between a candidate summary and a set of reference summaries. It is computed
using Eq. (3)
s∈r e f sum gram n ∈S Countmatch (gram n )
Rouge_N = (3)
s∈r e f sum gram n ∈S Count (gram n )

where N stands for the length of the N-gram, gram n , is the maximum number of
N-grams co-occurring in a candidate summary and Countmatch (gram n ) is the same
words occuring in the set of reference summaries in exactly the same sequence. Here
value of N is considered as 1 and 2.

3.2 Rouge_L

Using this metric the longest common subsequence is sought between the reference
summaries and the system generated summary. Although this metric is more flexible
than the Rouge_N metric, it has a drawback because it requires all the N-grams in
consecutive positions. ROUGE_L is a longest common subsequence (LCS) metric
which seeks for the LCS between the reference and system summaries. Here, a
sequence A = [a1 , a2 , . . . , an ] is considered as a subsequence of another sequence
B = [b1 , b2 , . . . , bn ] if there can be defined a strictly-increasing sequence of indices
for B (i.e. i = [i 1 , i 2 , . . . , i k ] such that for all j = 1, 2, . . . k, ai j = z j ). The longest
common subsequence for A and B can be considered as the sequence common to
both B and A having sentence structure level similarity and identifies the largest
occuring N-gram. In general it is assumed that ROUGE_L is based on the idea that
pairs of summaries with higher LCS scores will be more similar than those with lower
scores. LCS-based Recall, Precision and F-measure can be calculated to estimate the
similarity between a reference summary X (of length m) and a candidate summary
Y of length n according to Eqs. (4)–(6):
LC S(X, Y )
Rlcs = (4)
m

LC S(X, Y )
Plcs = (5)
n

(1 + β 2 )Rlcs Plcs
Flcs = (6)
(Rlcs + β 2 Plcs )

where, LCS(X, Y ) is equal to the length of the LCS of X and Y and β = Plcs
Rlcs
.
186 M. Dutta et al.

Table 1 Rouge values of the proposed method with respect to ground truth
Metric Recall Precision f-score Recall Precision f-score
Length 10% 15%
Rouge-1 0.69672 0.35417 0.55173 0.63030 0.43333 0.51358
Rouge-2 0.34711 0.17573 0.23333 0.29268 0.20084 0.23821
Rouge-L 0.55738 0.28332 0.37569 0.44848 0.30833 0.36543
Length 20% 25%
Rouge-1 0.59538 0.46250 0.51991 0.57198 0.61250 0.59155
Rouge-2 0.26656 0.18410 0.20706 0.26172 0.28033 0.27071
Rouge-L 0.45989 0.35833 0.40281 0.42023 0.45000 0.43461

The results of our evaluation are given below in Table 1. It can be seen that for
summaries which is 25% of the size of the original document gives better results
than of shorter sizes.
The unbiased ground truth or the reference summaries of the news articles were
obtained from our fellow peers and research scholars having different field expertise.

4 Conclusion and Future Work

In this paper we have tried to present a new graph based approach on achieving
extractive summarization. By removing named entities from the sentences, the simi-
larity measure for each pair of sentences becomes unbiased, hence important words
are emphasized. This process gives us interesting results when evaluated on news
articles. In the future we hope to refine the noun-pronoun resolution so that the results
are further improved. Also instead of using Infomap clustering algorithms, several
graph based and other clustering algorithms may be applied and a comparative study
will be made as future work.

References

1. Nenkova, A., McKeown, K.: A Survey of Text Summarization Techniques. Springer Sci-
ence+Business Media (2012)
2. Meena, Y.K., Gopalani, D.: Evolutionary algorithms for extractive automatic text summariza-
tion. In: Procedia Comput. Sci., 48(Suppl. C), 244 – 249 (2015). (International Conference on
Computer, Communication and Convergence (ICCC 2015))
3. Saggion, H., Lapalme, G.: Generating indicative-informative summaries with sumum. Comput.
Linguist. 28(4), 497–526 (2002)
4. Gong, Y., Liu, X.: Generic text summarization using relevance measure and latent semantic
analysis. In: Proceedings of the 24th Annual International ACM SIGIR Conference on Research
and Development in Information Retrieval, pp. 19–25. ACM (2001)
A Graph Based Approach on Extractive Summarization 187

5. Dunning, T.: Accurate methods for the statistics of surprise and coincidence. Comput. Linguist.
19(1), 61–74 (1993)
6. Hovy, E., Lin, C.-Y.: Automated text summarization and the summarist system. In: Proceedings
of a Workshop on Held at Baltimore, Maryland: 13–15 Oct 1998, TIPSTER’98, pp. 197–214,
Stroudsburg, PA, USA, 1998. Association for Computational Linguistics
7. Lin, C.-Y., Hovy, E.: The automated acquisition of topic signatures for text summarization. In:
COLING’00 Proceedings of the 18th conference on Computational linguistics, pp. 495–501.
Association for Computational Linguistics Stroudsburg, PA, USA (2000)
8. Wei, T., Lu, Y., Chang, H., Zhou, Q., Bao, X.: A semantic approach for text clustering using
wordnet and lexical chains. Expert Syst. Appl. 42(4), 2264–2275 (2015)
9. Alpaslan, F.N., Cicekli, I.: Text summarization using latent semantic analysis. J. Inf. Sci. 37(4),
405–417 (2011)
10. Kan, M.-Y., McKeown, K.R., Klavans, J.L.: Applying natural language generation to indicative
summarization. In: Proceedings of the 8th European Workshop on Natural Language Genera-
tion, vol. 8, pp. 1–9. Association for Computational Linguistics (2001)
11. Erkan, G., Radev, D.R.: Lexrank: graph-based lexical centrality as salience in text summariza-
tion. J. Artif. Intell. Res. 457–479 (2004)
12. Harabagiu, S., Lacatusu, F.: Topic themes for multi-document summarization. In: Proceedings
of the 28th Annual International ACM SIGIR Conference on Research and Development in
Information Retrieval, pp. 202–209. ACM, New York, NY, USA (2005)
13. Radev, D.R., Jing, H., Stys, M., Tam, D.: Centroid-based summarization of multiple documents.
Inf. Process. Manag. 40, 919–938 (2003)
14. Landauer, T.K., Foltz, P.W., Laham, D.: An introduction to latent semantic analysis. Discourse
Process. 25(2–3), 259–284 (1998)
15. Parveen, D., Strube, M.: Integrating importance, non-redundancy and coherence in graph-based
extractive summarization. In: Proceedings of the 24th International Conference on Artificial
Intelligence, IJCAI’15, pp. 1298–1304. AAAI Press (2015)
16. Kleinberg, J.M.: Authoritative sources in a hyperlinked environment. J. ACM 46(5), 604–632
(1999)
17. Agrawal, N., Sharma, S., Sinha, P., Bagai, S.: A graph based ranking strategy for automated
text summarization. DU J. Undergrad. Res. Innov. 1(1) (2015)
18. Mihalcea, R., Tarau, P.: TextRank: bringing order into texts. In: Proceedings of EMNLP-04
and the 2004 Conference on Empirical Methods in Natural Language Processing, July 2004
19. Dutta, S., Ghatak, S., Roy, M., Ghosh, S., Das, A.K.: A graph based clustering technique for
tweet summarization. In: 2015 4th International Conference on Reliability, Infocom Technolo-
gies and Optimization (ICRITO) (Trends and Future Directions), pp. 1–6. IEEE (2015)
20. Beautifulsoup documentation. https://www.crummy.com/software/BeautifulSoup/bs4/doc/.
Accessed 29 Nov 2017
21. Python 2.7.14 documentation. https://docs.python.org/2/index.html. Accessed 29 Nov 2017
22. Bird, S., Klein, E., Loper, E.: Natural Language Processing with Python. O’Reilly (2009)
23. Lin, C.-Y.: Rouge: a package for automatic evaluation of summaries. In: Proceedings of the
ACL Workshop: Text Summarization Braches Out 2004, pp. 10, 01 2004

You might also like