Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/261244026

Named entity recognition and normalization in tweets towards text


summarization

Conference Paper · September 2013


DOI: 10.1109/ICDIM.2013.6694007

CITATIONS READS
3 212

3 authors:

Saima Jabeen Sajid Shah


Pak-Austria Fachhochschule: Institute of Applied Sciences and Technology COMSATS University Islamabad
13 PUBLICATIONS   176 CITATIONS    19 PUBLICATIONS   213 CITATIONS   

SEE PROFILE SEE PROFILE

Asima Latif
University of Peshawar
3 PUBLICATIONS   50 CITATIONS   

SEE PROFILE

Some of the authors of this publication are also working on these related projects:

microRNA identification View project

PDF book annotation View project

All content following this page was uploaded by Saima Jabeen on 02 February 2019.

The user has requested enhancement of the downloaded file.


Named Entity Recognition and Normalization in
Tweets Towards Text Summarization
Saima Jabeen Sajid Shah Asma Latif
Politecnico di Torino Politecnico di Torino University of Peshawar
Torino, Italy Torino, Italy Peshawar, Pakistan
Email: saima.jabeen@polito.it Email: sajid.shah@polito.it Email: cocokoil@gmail.com

Abstract—This paper presents an experimental study on of NERD i.e. Normalization. Section 4 presents some recently
Named Entity Recognition and Disambiguation (NERD) in social proposed approaches for NERD in tweets. Our ongoing work
media context. Recent approaches presented for NERD in short on text summarization based on named entities is briefly
context especially in tweets are discussed. The knowledge of
named entity recognition is utilized in our underdeveloped discussed in section 5. Section 6 presents some preliminary
work of multi-document summarization. We assessed that Yago results obtained by the NER-based summarizer, whereas Sec-
knowledge-based Aida service for NERD is far better so far by tion 7 draws conclusion and describes future development of
considering a number of reasons. Some preliminary results of this work.
our summarizer are also presented that shows the effectiveness
of using named entities in text summarization. II. R ELATED W ORK

I. I NTRODUCTION Named entity Recognition in text and normalization of


short messages are well matured research topics but they are
Social networks such as Twitter has gained enormous use mostly performed seperately. Several normalization methods
over its seven years of age. Millions of tweets are sent over are proposed over a decade that focus short context such as
Internet channel in every fraction of time where users express short messaging service [1], [2] and tweets [3], [4], [5] etc.
their thoughts about their daily life to any event or on a news Along with normalization, various approaches for NER are
story. Trending topics and user’s comment(s) or tweets may adopted in short context as well e.g., named entity recognition
consist of valuable information such as Named Entities (NEs) in query [6], Short Messaging Service SMS [7], [8], social
that can significantly effect many computational applications media context i.e. tweets [9] and news Comments on the Web
for example; they can be helpful for figuring out important [10] etc.
sections of a news story. There is need to mine social media Substantial work on Interplay of NER and normalization
text to come up with existing knowledge. for twitter data does not exist in the literature. For the similar
Semantic technologies are currently spreading across sev- objective but with the different methodology, [11] worked on
eral application domains as a reliable and consistent mean joint inference of NER and Normalization for Tweets where
to address challenges related to organization, manipulation, they used a novel graphical model to simultaneously conduct
visualization and exchange of data and knowledge. Different NER and Named Entity Normalization (NEN) on multiple
roles are actually played by these techniques depending on tweets.
the application domain, on the timing constraints, on the Some attempts are made to jointly use the named entity
distributed nature of applications, and so on. Incorporating recognition and normalization in text summarization but fewer
semantic knowledge in computational approaches has proven approaches exist for making a relation of news with social
its effective role for showing better performance of applica- media context. [12] describes social context summarization
tions. The information such as named entities lying in user’s for Web documents by using a dual wing factor graph model
comments can be used in many computer-based applications which utilizes the mutual reinforcement between Web docu-
e.g., document summarization can get benefit of it where ments and their associated social context. Both the significance
named entities of tweets can play very effective role by of sentences and social user’s interests are considered to
considering their frequency, weight etc. in tweets to rank the generate summaries for standard documents.
sentences of the associated document collection for generating
its summary. III. N ORMALIZATION
Before NERD task, there is need of a preliminary step Most text mining methods require some kind of normal-
of normalization that brings the noisy and informal genre of ization before diving into the actual analysis. Normalization
twitter’s text to a formal and correct English text format by includes things like removing punctuation, converting words
resolving underlying noise, slangs and misspelled words etc. to lowercase, stripping numbers out, and so on. This is
This paper is organized as follows. Some related work is essential for any kind of frequency-based analysis so that
presented in section 2. Section 3 discusses the preliminary step words such as dont, Dont, and dont are not considered unique
words. After all, when dealing with human-generated text, correct classes, ranging from rule-based and dictionary-based
typos and differences in presentation are bound to occur. approaches, to semi-supervised, and unsupervised machine
Often times, normalization also includes stemming words for learning techniques.
instance; think, thinking, and thinks are all stemmed to think Despite of efforts made for recognizing and disambiguat-
as they all represent (basically) the same concept. For normal ing named entities in formal text, there are few approaches
English text, this is fairly easy and pre-packaged in many of devoted for short text especially social media such as twitter.
natural language processing software packages (including the Nowadays, NER for tweets is rapidly getting the attention of
tm package for R) while in short text scenarios such as SMS’s researchers.
or in Tweets, the headache of normalization becomes more Publically available ontological knowledge-bases are be-
severe being irregular and informal text. coming easy and efficient semantic knowledge sources. It
is well evident that NERD using any domain-independent
A. Tweet Normalization knowledge-base makes it more generic and efficient.
The occurrence of informal slangs, short and user’s self- Knowledge-base having high quality and quantity metrics is
oriented abbreviated terms e.g., b4 for before and tomorw for essential and Yago ontology is a good choice which exhibits
tomorrow etc. are commonly observed in twitter data due to these essentials for performing this task more reliable and
its length limit of 140 characters for the storage of the text efficient. Many efforts based on ontologies are made for named
in a Tweet. The normalization is performed as pre-processing entity recognition and disambiguation in normal text e.g.,
step in many applications where many approaches have been document collections, news categories and biological domains.
adopted in order to eliminate noise and irregularity in tweets NERD approaches developed for normal text performs worse
which rely on keyword match or word frequency statistics. for twitter data due to its noise, irregular syntax, misspelled
Also, many techniques to normalize microblogs have been words and having short context. Few approaches are presented
introduced where complex models are utilized and struggle in recent years which are devoted for NERD in short mes-
is made to differentiate between correctly-spelled unknown sages/contexts and twitter data i.e. tweets.
words and lexical variants of known words. A recent one is In a very recent study [14], Yago has been used for NERD
[5] who developed a normalization dictionary based on his pre- from short messages i.e. from tweets but there is room of mak-
vious work [4]. It consists of lexical variants of known words ing more improvements in that work. Normalization of Tweets
that facilitates lexical normalization where context information has not performed in their task. An unsupervised Semantic
is used to generate possible variant and normalization pairs Web-driven approach to improve named entity extraction by
and ranking is made using string similarity then dictionary is using clues from the disambiguation process is presented [14].
populated by selecting highly ranked pairs. With the aim of A simple Knowledge-Base matching technique combined with
converting any tweet into a proper English sentence, machine a clustering-based approach is adopted for disambiguation.
translation approach based on syntactic normalization of twit- This method does not deal with the noisy messages rather
ter messages is used [3] where moses were used for machine it only focuses informative messages containing information
translation. Similar approach is used by [13] where machine about one or more named entities.
translation is used for SMS normalization. In a heuristic based Publicly availabe AIDA service is a well established tech-
approach, [4] worked for Lexical Normalization of Short nique for NERD that consults Yago knowledge-base. Yago
Text Messages to identify ill formed words using classifier ontology is of high quality and quantity metrics with almost
and generated corrected version of misspelled word using 95% accuracy and relies on Wikipedia and Wordnet. For
orthographic similarity. disambiguation step, this approach considers popularity-prior
in terms of number of wiki-pedia links, context and coherence
IV. NAMED E NTITY R ECOGNITION AND D ISAMBIGUATION factors. Inspite of its competence to work on regular text, it
IN T WEETS
has been trained on twitter data as well in its recent version.
Named entity recognition is a very active field of research We make use of this service in our ongoing another work of
for many years. It has been regarded as a central point in tweets summarization, briefly discussed in V.
many applications involving concepts such as understanding, [15] presents a distantly supervised approach based on
semantic search, etc. Named entities are usually typed as LabeledLDA to classify the named entities where CRF model
taxonomies more or less large and highly dependent on the is used to segment named entities. It re-builds the NLP pipeline
scope or needs considered. NEs include proper nouns covering for tweets beginning with POS tagging, through chunking, to
names for people, places or organizations but also entities ex- NER.
pressed through other nominal expressions, such as multi-word An approach of recognizing named entities in tweets is
units, classified into types which may be coarse or fine-grained presented in [16] where a classifier based on the k-nearest
according to domain or user requirements. Despite years neighbors algorithm is combined with a CRF-based model
of research, NER still includes several challenges, such as to leverage cross tweets information, and adopted the semi-
correct classification, resolution of ambiguity, synonym detec- supervised learning to leverage unlabeled tweets. Named entity
tion, coreference and variability (e.g., acronyms, orthography). recognition and disambiguation is not a simple task especially
Several methods have been used to improve the prediction of in Twitter data since tweets belong to a different genre of
text. Usually, normalization is embedded into NERD system. documents as well due to adopted approach of utilizing named
A recent work describing the inference of named entity recog- entities.
nition (NER) and normalization(NEN) in tweets is presented
[11]. Authors proposed their own graphical model to jointly Summary comparison
conducting NER and NEN for multiple tweets to deal with
error propagation and dearth of information in a single tweet We conducted a qualitative evaluation of the soundness and
where a NE normalization variable is introduced to indicate readability of the summaries generated by our system and
whether a word pair belongs to the mentions of the same entity. the other approaches. It reports the 100-words representative
Authors aimed to use advanced normalization techniques to summary of our system and a traditional summary produced
resolve slang expressions and informal abbreviations and to by a widely used Open Source Summarizer (OTS) shown
incoporate Wikipedia knowledge into their proposed frame- in table I. Specifically, reported dataset belongs to three
work. different categories i.e., Irene hurricane, Steve Jobs Apple
and Strauss kahn. A collection of almost one thousand tweets
V. I MPLICATION OF N ORMALIZATION AND NERD IN OUR corresponding to each of its relevant category is collected
UNDERDEVELOPED T WITTER - BASED S UMMARIZATION
using the same keywords as aforementioned category labels.
S YSTEM
Performance comparison
A work on multi-document summarizer is carried on where
we are utilizing the knowledge of named entities recognized Some reliminary results are shown in the Table II-V in
from tweets. The proposal of this work has been accepted as terms of ROUGE-2 and ROUGE-3 measures. The bold-faced
an abstract in [17]. The summarizer is in the phase of ex- numbers represent significant difference with other scores.
perimental evaluation producing very incouraging preliminary Similar results are obtained for the other ROUGE measures
results and will be finalized and submitted soon. In this work, too. Obtained results of summarizer based on named entities
normalized tweets are provided to publicaly available Aida are highly encouragine for the difference with a traditional
Service [18] to get disambiguated named entities. Tweets were summarizers OTS and TexLexAn.
normalized using a recently proposed normalization dictionary ——————————
[5] that presents simple solution of table lookup stratedgy to
resolve the abbreviated and short informal words and slangs VII. C ONCLUSION
exising in tweets. Being fast, lightweight and easy-to-use
Tweets is a different and complex genre of text and dif-
solution, this dictionary reported quite impressive state-of-art
ferent scheme than formal text are required for information
performance for both F-score and word error rate on a standard
retrieval purpose. A walk through the efforts for named entity
dataset. Aida service works on normalized tweets and returns
recognition in Tweets is done in this paper. Normalization
recognized and disambiguated entities. Top frequent entities
is discussed as an important preprocessing step of NERD in
are used to generate a topic sentence then a similarity measure
tweets where both approaches can benefit from each other. Our
with the topic sentence is used to rank the document sentences.
ongoing experience to work on tweets-based text summariza-
Finally, higher ranked but minimally redundant sentences are
tion exploiting named entity knowledge directed us to use pub-
selected as a document summary.
licly available normalization dictionary for better performance.
On the other side, we may increase the vocabulary of
Incorporating normalized tweets to Aida service produced
normalization dictionary that lacks named entities by simply
very reliable disambiguated named entities for tweets. Also,
adding the recognized named entities for the mentions existing
obtained named entities with their original mentions as pairs
in tweets or documents. Recognized mention-entity pairs can
are valuable new entries to normalization dictionary that lacks
be added in normalization dictionary if not already present.
them on first place. In this way, capabilities of publicly
For aforementioned work [11], it is interesting that both
available normalization dictionary and NERD Aida service can
of our used established approaches in our summarizatin ap-
be enhanced to work on tweets. Knowledge of disambiguated
plication depicts their future aimed qualities where NERD
named entities of tweets to integrate with the corresponding
using Aida service is based on Yago ontology that relies on
category of a news story to spot the important sections of the
Wikipedia and Wordnet to make it more reliable for NERD.
news documents for summarization task is underway. Entity
Then, we make it more competent to work on tweets as well
rank returned by Aida and entity frequency in the tweets play
by mere use of normalization step. Expansion of normalization
important role in ranking the sentences of associated news
dictionary by identified named entities is also achieved.
collection and preliminary results shown in the paper prove
the worth of efficient NER system for summarization task.
VI. E XPERIMENTAL R ESULTS
The results suggest that implicit information in social networks
Summarizer utilizing the named entities identified by discovered by researchers by means of analysis tools (e.g.,
Aida Service is not a traditional summarizer. It is domain- data mining tools) may exploit this framework in various
independent and generates summaries having both the expres- applications such as summarization, machine learning and
sive power of traditional summaries and domain specificity of information retrieval etc.
TABLE II
I RENE - HURRICANE : P ERFORMANCE COMPARISON IN TERMS OF ROUGE -2

Summarizer Recall Precision F-measure


NER-TweetsSumm 0.01425 0.14138 0.02578
OTS 0.01289 0.12523 0.02328
TABLE I TexLexAn 0.01237 0.12272 0.02238
S UMMARY COMPARISON
TABLE III
Summarizer based on Tweets-NER OTS S TRAUSS - KAHN : P ERFORMANCE COMPARISON IN TERMS OF ROUGE -2
It’s one of several towns in states With estimated damage from Hur-
such as New Jersey, Connecti- ricane Irene topping $7 billion, the
cut, New York, Vermont and Mas- White House and some in Congress Summarizer Recall Precision F-measure
sachusetts dealing with the dam- are at odds over where to find money NER-TweetsSumm 0.02343 0.21175 0.04183
age of torrential rain and flooding to replenish the disaster relief fund of OTS 0.01936 0.17534 0.03457
spawned by Hurricane Irene (Cleve- the Federal Emergency Management
land.com). The storm may have Agency, which has dipped below the TexLexAn 0.01718 0.15776 0.03072
spared New York City, but it caused $1 billion level considered advisable.
the worst flooding in decades in in- Then on Monday, House Majority TABLE IV
land areas of New York State, New Leader Eric Cantor told a Fox News I RENE - HURRICANE : P ERFORMANCE COMPARISON IN TERMS OF ROUGE -3
Jersey and Vermont. Barack Obama, audience that any new federal disas-
US president, has called Irene a ”his- ter monies would require offsetting Summarizer Recall Precision F-measure
toric hurricane” and declared a state cuts in other spending, igniting a
of emergency in New York, order- round of budgetary who-goes-first. NER-TweetsSumm 0.00325 0.03253 0.00589
ing federal aid to supplement state ”Recovery from hurricane damage OTS 0.00278 0.0287 0.00505
and local response efforts starting on the East Coast must not come at TexLexAn 0.00271 0.02721 0.00491
on Friday. The governors of New the expense of Missouri’s rebuilding
York, New Jersey and Connecticut efforts. It’s one of several...
sought... TABLE V
Apple is on course, and Cook has Somehow it has escaped almost ev- S TRAUSS - KAHN : P ERFORMANCE COMPARISON IN TERMS OF ROUGE -3
a record of being able to run the eryone’s notice for decades that the
company. ”I want you to be confident father of the man who transformed Summarizer Recall Precision F-measure
that Apple is not going to change,” Apple Inc. Apple’s legacy now falls NER-TweetsSumm 0.00844 0.07578 0.01507
Cook said in a letter to Apple em- to newly appointed CEO Tim Cook,
ployees that was first reported by Ars along with Apple’s executive team OTS 0.00581 0.05311 0.01038
Technica. Cook has thrice proven his that includes Jonathan Ive (indus- TexLexAn 0.00652 0.05963 0.01165
mettle as Apple CEO. ”After Jobs’ trial design), Scott Forstall (iOS soft-
resignation, Apple’s board pledged ware), Phil Schiller (marketing), Bob
support for Cook. ”Cook also faces Mansfield (Mac hardware engineer-
mounting competition, in part be- ing) and others. It’s not as if Apple R EFERENCES
cause of Apple’s foray into new is suddenly going to stuff Blu-ray
markets. Years after hiring Cook, drives into its Macs and add Adobe [1] A. Aw, M. Zhang, J. Xiao, and J. Su, “A phrase-based statistical model
Devine ran into Jobs in an elevator Flash to its iOS devices – two tech- for sms text normalization,” in Proceedings of the COLING/ACL on
at Apple’s headquarters. Cook and nologies Jobs disapproved of. But Main conference poster sessions, ser. COLING-ACL ’06. Stroudsburg,
Jobs were introduced in 1998 in Palo neither is it a given that Apple will PA, USA: Association for Computational Linguistics, 2006, pp. 33–40.
Alto,... maintain its fast-paced growth that [Online]. Available: http://dl.acm.org/citation.cfm?id=1273073.1273078
began under Jobs. [2] D. Pennell and Y. Liu, “A character-level machine translation
The high-profile case of the former Even after the dismissal of all approach for normalization of sms abbreviations,” in Proceedings of
head of the International Monetary charges against the former Interna- 5th International Joint Conference on Natural Language Processing.
Fund, Dominique Strauss-Kahn, a tional Monetary Fund chief, who was Chiang Mai, Thailand: Asian Federation of Natural Language
prominent member of the French so- arrested May 14 and charged with Processing, November 2011, pp. 974–982. [Online]. Available:
ciety, has brought much needed at- sexually assaulting a hotel maid, http://www.aclweb.org/anthology/I11-1109
tention to a tradition that is illegal Vance’s office will admit to no er- [3] M. Kaufmann, “Syntactic normalization of twitter messages,” 2010.
in France. As his fellow Socialists rors of judgment or strategy in its [4] B. Han and T. Baldwin, “Lexical normalisation of short text messages:
in France argued sharply over his handling of the case. But the world Makn sens a twitter,” p. 368378, 2011.
character, Dominique Strauss-Kahn was transfixed for more than three [5] B. Han, P. Cook, and T. Baldwin, “Automatically constructing a
returned this week to the Interna- months by every twist and turn in the normalisation dictionary for microblogs,” in Proceedings of the 2012
tional Monetary Fund for an emo- case, and Vance’s post has long been Joint Conference on Empirical Methods in Natural Language Processing
tional farewell that several people in one of the most closely scrutinized and Computational Natural Language Learning. Jeju Island, Korea:
attendance said was well received. in the country, primarily because of Association for Computational Linguistics, July 2012, pp. 421–432.
I’M NOT SURE if Niasfatou Diallo the many high-profile cases it brings. [Online]. Available: http://www.aclweb.org/anthology/D12-1039
lied about being raped by Dominique This decision at the outset of the case [6] J. Guo, G. Xu, X. Cheng, and H. Li, “Named entity recognition
Strauss-Kahn. On May 16, two days may have been the most fateful... in query,” in Proceedings of the 32nd international ACM SIGIR
after Nafissatou Diallo accused the conference on Research and development in information retrieval, ser.
managing director of the Interna- SIGIR ’09. New York, NY, USA: ACM, 2009, pp. 267–274. [Online].
tional Monetary Fund of sexual... Available: http://doi.acm.org/10.1145/1571941.1571989
[7] T. Ek, C. Kirkegaard, H. Jonsson, and P. Nugues, “Named entity
recognition for short text messages,” Procedia - Social and Behavioral
Sciences, vol. 27, no. 0, pp. 178 – 187, 2011, ¡ce:title¿Computational
Linguistics and Related Fields¡/ce:title¿. [Online]. Available:
http://www.sciencedirect.com/science/article/pii/S1877042811024232
[8] C.-N. Seon, J. Yoo, H. Kim, J.-H. Kim, and J. Seo, “Lightweight named
entity extraction for korean short message service text,” TIIS, vol. 5,
no. 3, pp. 560–574, 2011.
[9] X. Liu, F. Wei, S. Zhang, and M. Zhou, “Named entity recognition for
tweets,” ACM TIST, vol. 4, no. 1, p. 3, 2013.
[10] L. Zong, X. Wan, L. Zhao, J. Yang, and Y. Wu, “Named entity
resolution in chinese news comments on the web,” in Proceedings of
the 2010 12th International Asia-Pacific Web Conference, ser. APWEB
’10. Washington, DC, USA: IEEE Computer Society, 2010, pp.
307–313. [Online]. Available: http://dx.doi.org/10.1109/APWeb.2010.20
[11] X. Liu, M. Zhou, F. Wei, Z. Fu, and X. Zhou, “Joint inference of
named entity recognition and normalization for tweets,” in Proceedings
of the 50th Annual Meeting of the Association for Computational
Linguistics: Long Papers - Volume 1, ser. ACL ’12. Stroudsburg, PA,
USA: Association for Computational Linguistics, 2012, pp. 526–535.
[Online]. Available: http://dl.acm.org/citation.cfm?id=2390524.2390598
[12] Z. Yang, K. Cai, J. Tang, L. Zhang, Z. Su, and J. Li, “Social context
summarization,” in Proceedings of the 34th International ACM SIGIR
Conference on Research and Development in Information Retrieval,
ser. SIGIR ’11. New York, NY, USA: ACM, 2011, pp. 255–264.
[Online]. Available: http://doi.acm.org/10.1145/2009916.2009954
[13] A. Aw, M. Zhang, J. Xiao, and J. Su, “A phrase-based statistical model
for sms text normalization,” in Proceedings of the COLING/ACL on
Main conference poster sessions, ser. COLING-ACL ’06. Stroudsburg,
PA, USA: Association for Computational Linguistics, 2006, pp. 33–40.
[Online]. Available: http://dl.acm.org/citation.cfm?id=1273073.1273078
[14] M. B. Habib and M. van Keulen, “Unsupervised improvement of named
entity extraction in short informal context using disambiguation clues,”
in Workshop on Semantic Web and Information Extraction, SWAIE 2012,
Galway, Ireland, ser. CEUR Workshop Proceedings, vol. 925. Germany:
CEUR-WS.org, October 2012, pp. 1–10.
[15] A. Ritter, S. Clark, Mausam, and O. Etzioni, “Named entity
recognition in tweets: an experimental study,” in Proceedings
of the Conference on Empirical Methods in Natural Language
Processing, ser. EMNLP ’11. Stroudsburg, PA, USA: Association for
Computational Linguistics, 2011, pp. 1524–1534. [Online]. Available:
http://dl.acm.org/citation.cfm?id=2145432.2145595
[16] X. Liu, S. Zhang, F. Wei, and M. Zhou, “Recognizing named entities
in tweets,” in ACL, 2011, pp. 359–367.
[17] S. Jabeen and S. Shah, “Abstract accepted: Text summarization
of news-articles using yago ontology,” Paris, France, November
2012. [Online]. Available: http://celta.paris-sorbonne.fr/CELTA-
colloques/MIC-Sorbonne-2012/MIC2012-Programme.html
[18] J. Hoffart, M. A. Yosef, I. Bordino, H. Fürstenau, M. Pinkal, M. Spaniol,
B. Taneva, S. Thater, and G. Weikum, “Robust disambiguation of named
entities in text,” in Proceedings of the Conference on Empirical Methods
in Natural Language Processing, ser. EMNLP ’11. Stroudsburg, PA,
USA: Association for Computational Linguistics, 2011, pp. 782–792.
[Online]. Available: http://dl.acm.org/citation.cfm?id=2145432.2145521

View publication stats

You might also like