Malayalam 2

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 4

2016 International Conference on Next Generation Intelligent Systems (ICNGIS)

Malayalam Text Summarization: An Extractive


Approach
Krishnaprasad P∗ , Sooryanarayanan A† and Ajeesh Ramanujan‡
∗ Dept. of Computer Science
Govt. Engineering College Sreekrishnapuram
Email: krishnaprasadpguptha@gmail.com
† Dept. of Computer Science

Govt. Engineering College Sreekrishnapuram


Email: asooryanarayanan@gmail.com
‡ Asst. Professor, Govt. Engineering College Sreekrishnapuram

Email:ajeeshramanujan@gmail.com

Abstract—Automatic summarization of text is on of the area of document. The summary created using these sentences may
interest in the field of natural language processing. The proposed not be coherent, but gives idea about the content of the input
method utilizes the sentence extraction in a single document document. Abstract summarization system understands the
and produces a generic summary for a given Malayalam doc-
ument (Extractive summarization). Sentences in the document contents of the document and then it creates summaries with
are ranked based on the word score of each word present in its own words. The abstractive summarization aims to produce
it. Top N ranked sentences are extracted and arrange them a generalized summary, gives the information in a concise way,
in their chronological order for summary generation, where N and it needs advanced natural language generation techniques
represents the size of summary with respect to the percentage of [4]. In general, abstractive methods make semantic represen-
original document size (condensation rate). The standard metric
ROUGE is used for performance evaluation. ROUGE calculates tation of a given text and then use natural language generation
the n-gram overlap between a generated summary and reference techniques to create a summary which is very much similar to
summaries. Reference summaries were constructed manually. a human generated summary. Such an abstract summary might
Experiments show that the results are promising. contain words not explicitly present in the original document.
Keywords − Text Summarization, Sentence Extraction, An abstract of a research article is a good example for abstract
Content Word Extraction, Natural Language Processing, summarization.
ROUGE. Another text summarization classification is single-
document summarization and multi-document summarization.
I. I NTRODUCTION Early works in summarization started with single document
With the enormous growth of Internet, a large collection summarization. In single document summarization, it produces
of texts are available in the web. Some of them duplicates summary of a single document supplied by the user. As the
the content present in others. Whenever a user needs to get tremendous changes in research, and with the huge amount
some information, he needs to retrieve these documents and of information on Internet, multi document summarization
read them completely to understand the content. This is a time emerged. Because information is spread over different doc-
consuming and tedious process and is very difficult for human uments, it needs to collect from these different documents.
beings to manually summarize these large documents of text. In multi document summarization, it produces summaries
There comes the importance of automatic summarization. from many source documents on the same topic or same
Text summarization is the condensation of the source text by event. Obviously multi-document summarization is a more
reducing the size along with preserving significant information complex task than the single-document summarization. There
content and overall meaning. The text summarization done are two major reasons for this. Firstly, there is a chance of
with the machine is termed as automatic text summarization. information overlap between the documents and this can lead
These summarization tools can help people to grasp the main to redundancy in the summary. Secondly, an extra effort is
information contents in a short time. A good summary can needed to collect and organize the information from different
give a quick overview of a large document. several documents to form a coherent summary. In the field
Mainly the text summarization techniques can be gener- of text summarization, significant achievements have been
ally classifieds into extractive summarization and abstractive obtained using sentence extraction and statistical analysis.
summarization. In extractive summarization, important and Malayalam is a regional language spoken in India, predomi-
meaningful sentences or phrases from the source documents nantly in Kerala state. Malayalam includes in the 22 scheduled
are extracted. These sentences are arranged chronologically languages of India and one of the 6 Classical Languages in
to produce the summary. In extractive summary, there is no India. Malayalam got the designation of Classical Language
word or phrase present in the summary outside of the given in 2013. it has official language status in Kerala state and

978-1-5090-0870-4/16/$31.00 ©2016 IEEE


2016 International Conference on Next Generation Intelligent Systems (ICNGIS)

in the union territories of Puducherry and Lakshadweep. III. P ROPOSED SYSTEM


Malayalam belongs to the Dravidian family of languages and
about 38 million people speak it. It is also spoken in Tamil The summarization system uses simple heuristics based
Nadu and Karnataka; with some peoples in the Kanyakumari, extractive summarization. We assume that the summary can
Coimbatore, and Nilgiri districts of Tamil Nadu, and Kodagu be generated by recombining the extracted important sentences
and Dakshina Kannada districts of Karnataka [16]. The field from the text. In order to identify the important sentences in
of Malayalam text summarization will unveil the number of the text we follow content word extraction method. By simple
applications such as topic modelling and Malayalam search heuristics we assume that we can extract out the content words
engines. The new insight will lighten up summarizing the from the frequency distribution of word except the stop words.
official, legal and technical documents in Malayalam. Our The words with higher frequency may represent the document
extractive summarization method is sufficient for figuring out without information loss. The summarization system com-
the contents of such documents. To our knowledge, very few prises of two components as text analysing component and the
works are conducted in Malayalam text summarization. summary generation component. Text analysing component is
used to identify the features associated with the sentences and
II. R ELATED WORKS based on the features, it assign a score to each sentence in the
During the last years, text summarization becomes an areas text. In summary generation component, it utilizes the sentence
of interest in Natural language processing. The need for an scores calculated in the text analysing component for summary
automatic text summarization has got increased, because of the generation.
availability of huge amount of documents in the Internet. K . Firstly text normalization process is performed and it in-
S . Jones [8] defines summary as a reductive transformation of volves the splitting the text into sentences and further split
given input document to summary document with the process sentences into words. Then the normalized text is used for fea-
of reduction of content by selecting and generalizing on what ture extraction process. This feature extraction process extracts
is important in the input document. That paper suggests the the features associated with the sentences and the features
basic process model as a three-stage process of (i) source associated with words respectively. For this purpose, the main
text interpretation to source text representation, (ii) source features such as frequency of a word, number of characters
representation transformation to summary text representation in a word etc. are used. After the feature extraction process,
and (iii) summary text generation from summary represen- the system will calculates a score for each sentence in the text
tation. D. Shen et al. [14] differentiates the two approaches and this score is used for summary generation. Sentences with
to text summarization as abstract based and extract based high scores are selected and a sentence refinement is done
summarizations. H. P. Luhn [11] presented the idea that on the sentences. Then these refined sentences are used for
frequently occurring words (except stop words) signify the the summary generation with keeping the same chronological
overall content of the document. order as they were found in the original input text document.
Dhanya P. M et al. [5] performed a comparative study of Fig 1. shows the various tasks involved in the summarization
text summarization works available in Indian languages. It system.
includes two summarization methods each from Tamil [13, Text analysing component involves the tasks of sentence
10], Kannada [6, 7] and one each from Odia [2], Bengali [9], marking, feature extraction and sentence ranking. The sum-
Punjabi [15] and Gujarathi [1] and is used for comparison. mary generation component comprises of mainly two tasks a
A three sentences sample text was taken as an example and sentence selection and summary generation tasks. These tasks
they tried to find out the summary sentences using all the eight are detailed below.
methods. From this comparison, they found that these methods Sentence marking task initially divides the document into
uses a set of features of sentences and are used for ranking the sentences. Using end-of-sentence punctuation marks, such as
sentences. Punjabi method uses ten such features(maximum) periods, question marks, and exclamation points, are sufficient
while Odia uses a single feature(minimum). Accuracy of for marking the sentence boundaries. It should be noted
summarization depends on the number of features used and its that exclamation point and question mark are somewhat less
contributions to the summary. These summarization techniques ambiguous when compared to periods. Sentence marking
give recall scores of 0.45, 0.48, 0.43, 0.66, 0.42, 0.412, 0.42, has much importance because we will use the sentence for
0.82 respectively. In almost all summarization techniques, generating summary. Periods can be appeared in non standard
testing is carried out by comparing the system generated words like abbreviations, this creates serious problem during
results with the human generated summaries. sentence marking process.
Renjith S. R et al. [12] suggests a sentence extraction based The Feature extraction task extracts both the sentence level
single document text summarization for Malayalam document. and word level features. For picking out best sentence from
They calculate the scores based on sentence features. Then a given document we follow the sentence scoring technique
calculate the rank using the Google’s Page Rank formula. The based on word frequency and average number of characters
top k ranked sentences were picked up from the original text to per sentence. The heuristics suggest that the highly frequent
be included in summary where k depends on the compression words except the stop words can convey the general idea of the
ratio of original to summary. document. We include the feature, average number of character
2016 International Conference on Next Generation Intelligent Systems (ICNGIS)

be done by two ways as evaluation by human and methods


of automatic evaluation. In human evaluation, human judges
compare system-generated summaries with model summaries.
Then the judges may gives a score in a predefined scale to each
summary under the evaluation process. Qualitative features
like fluency, information content, coherence etc. are used for
this scoring. Some of the problems associated with human
evaluation process are, it may be tedious and it suffers from
the lack of consistency.
Two human judges may not agree on each other’s judge-
ment. On the other hand, automatic evaluation is always
consistent with a judgement. The problem with automatic eval-
uations is, it may not consider linguistic skills and emotional
perspectives of a human. Even automatic evaluation is not
good as compared to the evaluation of human, it is generally
preferred. Because the process of evaluation is faster even if
there are huge number of summaries. Another advantage of
automatic evaluation is that, it uses the same logic and always
gives a common result on a given summary. And automatic
evaluation process is also free from the bias of human, it
consistently compares the various summarization systems.
To our knowledge, there is no Malayalam summarizer sys-
tem which shows their performance. So we decided to select
the condensation percentage which gives better result. For that,
Fig. 1. Architecture of the Summarization System
first we create a corpus of 50 news articles from Mathrubhumi,
a Malayalam daily, during the period of November 2015. From
these 50 news we randomly select 10 news and create 2
per word, from our world knowledge by assuming the lengthy reference summaries for each news.
words are more important.
In Sentence ranking task, a simple and efficient sentence The randomly selected 10 news articles are supplied to
ranking method based on the frequency words in the text and our system for generating summary. For each news article
average number of characters in each word. The system will our system generates 4 summaries based on the condensation
rank each and every sentence in the given text with the above rate of 10, 15, 20 and 25 percentage respectively. These
sentence ranking method. The heart of the summarization system generated summaries are evaluated with the reference
system is this sentence ranking method. The quality of the summaries, by using the standard metric ROUGE [3]. The
generated summary will greatly depends on this sentence theory behind the ROUGE-N measure, is the number of
ranking method. overlapping units such as n-grams present in both system
After the sentences are scored, next task is the sentence generated summary and reference summary. The ROUGE-
selection and it needs to select the sentences that make a good N score is a recall-related measure and can be calculated as
summary. After selecting the sentences with high score, they follows :
got a sentence level refinement. After the sentence selection
 
task, the next is summary generation task. One of the summary Countmatch (gramn )
generation methods is to pick the N top scored sentence S∈{Ref erenceSummaries} gramn ∈S
towards the summary. But this will create the problem of ROU GE − N =
 
Count(gramn )
coherence. Once the sentences are selected, these sentences are
recombined in the chronological order present in the original S∈Ref erenceSummaries gramn ∈S

input text for getting a readable summary. The sentences are


selected based on the percentage of output text summary
required with respect to the input document. This stage gives Where n represents the n-gram length, gramn represents
us a number of research scopes on text summarization. The the maximum number of n-grams co-occurring in a candidate
quality of the generated summary depends on the sentence summary, and Countmatch (gramn ) represents the maximum
ranking and sentence selection procedure. number of n-grams co-occurring in a set of reference sum-
maries. ROUGE-1 and ROUGE-2 scores are calculated for
IV. E VALUATION AND RESULTS these 10 news article for different percentage condensation
To decide whether a given summary is good or bad is summary average is taken. Obtained results are tabulated in
a very difficult task. The summary evaluation methods can Table I.
2016 International Conference on Next Generation Intelligent Systems (ICNGIS)

System (Based on condensation rate) ROUGE-1 ROUGE-2


[11] H.P. Luhn, “”The automatic creation of literature abstracts, ” IBM
10 Percentage 0.256935 0.225331 Journal of Research Development, 1958, pp. 159-165.
15 Percentage 0.371939 0.339387 [12] Renjith. S. R and Sony. P, “An automatic text summarization for Malay-
20 Percentage 0.443436 0.405756 alam using sentence extraction,” Proceedings of 27th IRF International
25 Percentage 0.570577 0.538460 Conference, Chennai, India, ISBN: 978-93-85465-35-2, 14th June 2015.
[13] Sankar K, Vijay Sundar Ram R and Sobha Lalitha Devi, “Text extraction
TABLE I for an agglutinative language,” Problems of Parsing in Indian Languages,
ROUGE E VALUATION S CORES Special Volume, May 2011.
[14] D. Shen, J. T. Sun, H. Li, Q. Yang, and Z. Chen, “Document summariza-
tion using conditional random fields,” In IJCAI, 2007, pp. 2862-2867.
[15] Vishal Gupta and Gurpreet Singh Lehal, “Features selection and weight
From the result, it is clear that when the condensation rate learning for Punjabi text summarization,” International Journal of Engi-
of summary increases, the ROUGE scores also increases. For neering Trends and Technology, vol. 2, Issue 2, 2011.
the 10 percentage condensation summary, this method yields [16] “Malayalam - Wikipedia, the free encyclopedia.” [On-line], Available:
https://en.wikipedia.org/wiki/Malayalam, [Accessed: 18-Oct-2015].
poor result. This is because most of the news taken contains 10
to 20 sentences. So, when we take 10 percentage condensed
summary, it contains the most important one or two sentence
which can’t represent the whole document in content.

V. C ONCLUSION AND F UTURE W ORKS


This paper presents a novel method for single document
text summarization for Malayalam. There are several sum-
marization techniques available for English, but a very few
attempts have been made for Malayalam text summarization.
The quality of the summary can be improved by improving
the sentence scoring stage. The performance of the proposed
system may further be improved by including stemming pro-
cess, improvement in sentence splitting criteria and adding
more number of features. It can use any number of reference
summaries in ROUGE evaluation. On increasing the number
of reference summaries, we can improve the ROUGE score,
but due to the practical difficulty we created only two manual
summaries for each news article. There is scope for adding
more manual reference summaries.

R EFERENCES
[1] Alkesh Patel, Tanveer Siddiqui and U. S. Tiwary, “A language in-
dependent approach to multilingual text summarization,” RIAO2007,
Pittsburgh PA, USA, May 30-June 1, 2007
[2] R. C. Balabantaray, B. Sahoo, D. K. Sahoo and M. Swain, “Odia
text summarization using stemmer,” International Journal of Applied
Information Systems (IJAIS), ISSN : 2249-0868, vol. 1, No.3, February
2012.
[3] Chin Yew Lin, “ROUGE: A package for automatic evaluation of sum-
maries,” In Proceedings of Workshop on Text Summarization Branches
Out, Post-Conference Workshop of ACL, Barcelona, Spain, 2004.
[4] D. Das and A. F. Martins, “A survey on automatic text summarization,”
Literature Survey for the Language and Statistics II course at CMU, vol.
4, 2007, pp. 192-195.
[5] Dhanya P M and Jathavedan M, “Comparative study of text summa-
rization in Indian Languages,” IJCA (0975 - 8887), VOL. 75, NO. 6,
August 2013.
[6] Jagadish S Kallimani and Srinivasa K, G, “Information retrieval by text
summarization for an Indian regional language,” 2010 IEEE.
[7] Jayashree.R, Srikanta Murthy.K and Sunny.K, “Document summariza-
tion in Kannada using keyword extraction,” CS IT-CSCP, 2011, pp. 121-
127.
[8] K. S. Jones, “Automatic summarizing: factors and directions,” In Mani,
I.; Maybury, M. Advances in Automatic Text Summarization, The MIT
Press, 1999, pp. 1-12.
[9] Kamal Sarkar ,“Bengali text summarization by sentence extraction,” Pro-
ceedings of International Information Management(ICBIM-2012),NIT
Conference on Business at Durgapur, pp. 233-245.
[10] Krish Perumal and Bidyut Baran Chaudhuri, “Language independent
sentence extraction based text summarization,” In Proceedings of ICON
2011, 9th International Conference on Natural Language Processing.

You might also like