Professional Documents
Culture Documents
CS 523 - Essentials of Natural Language Processing: Project Title: Report On Named Entity Recognition
CS 523 - Essentials of Natural Language Processing: Project Title: Report On Named Entity Recognition
CS 523 - Essentials of Natural Language Processing: Project Title: Report On Named Entity Recognition
UNDER GUIDANCE OF
CERTIFICATE
This is to certify that the “Report on Named Entity Recognition” has been carried
out by the following students under my guidance as a part of Indian Ethos &
Business Ethics Assignment work for MBA-IDS at NIIT University, Neemrana,
Rajasthan.
ACKNOWLEDGEMENT
We owe our thanks to all the people who helped and supported us during writing
this report.
We thank Dr. Prosenjit Gupta for guiding us and correcting our drafts with care. We
are highly obliged for his painstaking efforts and attention to detail.
We would also like to thank the effort and time spent by our batch mate in NIIT
University, who have gone out of their way to help us and fairly express their opinion
on our “Report on Named Entity Recognition”.
We would also thank our Institution NIIT University, Neemrana for supporting us
with the infrastructure without which this project would have been a distant reality.
Essentials of Natural Language Processing Course Code: CS 523
Table of Contents
1.0 History
The term “Named Entity (NE)”, widely used in Information Extraction (IE), Question
Answering (QA)or other Natural Language Processing (NLP) applications, was born in the
Message Understanding Conferences (MUC) which influenced IE research in the U.S. in
the 1990’s [Grishman and Sundheim 1996] (to be precise, it was introduced for MUC-6 in
1995).
At that time, MUC focused on IE tasks where structured information of company activities
and defence related activities is extracted from unstructured text, such as newspaper
articles. In the course of system development, people noticed that it is important to
recognize information units like names, including person, organization and location names,
and numeric expressions including time, date, money and percent expressions.
Extracting these entities was recognized as one of the important sub-tasks of IE. As this
task is relatively independent, it has been evaluated separately in several different
languages, e.g. Japanese, Chinese and Spanish in MET (Multilingual Entity Tracking)
project.
Outside the U.S., there have been several evaluation-based projects for NE, as one of the
tasks of IREX (Information Retrieval and Extraction Exercise) in Japan [Sekine and
Isahara 2000] [IREX HP], and as the shared task in CoNLL in 2002 and 2003 for four
languages, English, German, Dutch and Spanish [CoNLL HP].
In the IREX project, a new category “artifact”, such as “Odyssey” as a book title or
“Windows” as a product name, was added to the original MUC categories. The NE task in
MUC was inherited by the ACE project in the U.S., where 2 new categories are added,
GPE (Geographical and Political Entities, such as “France” or “New York”) and facility,
such as “Empire State Building”.
Around this time, the number of categories is limited to 7 to 10, and the NE taggers,
automatic annotation systems for NE entities in unstructured text, are based on 1)
dictionaries and rules which were made by hand or 2) some supervised learning technique.
More recent and currently dominating technology is the supervised learning techniques,
which include Decision Tree [Sekine 1998], Hidden Markov Model (HMM) [Bikel et. al
Essentials of Natural Language Processing Course Code: CS 523
1997], Maximum Entropy Model (ME) [Borthwick 1998], Support Vector Machine (SVM)
[Asahara 2003], Boosting and voted perceptron [Collins 2000] and Conditional Random
Fields (CRFs) [McCallum and Li 2003]. The NE extraction task has been the experimental
sandbox for various forms of supervised learning.
Information extraction tools make it possible to pull information from text documents, databases,
websites or multiple sources. IE may extract info from unstructured, semi-structured or structured,
machine-readable text. Usually, however, IE is used in natural language processing (NLP) to
extract structured from unstructured text.
Information extraction depends on named entity recognition (NER), a sub-tool used to find
targeted information to extract.
Information extraction can be entirely automated or performed with the help of human input.
Typically, the best information extraction solutions are a combination of automated methods and
human processing.
Essentials of Natural Language Processing Course Code: CS 523
Business intelligence (for enabling analysts to gather structured information from multiple
sources);
Pharma research (for drug discovery, adverse effects discovery and clinical trials automated
analysis).
NER can be used for different important tasks. It can be used as a self-standing tool for full-text
searching and filtering. Also it can be used as a pre-processing tool for other NLP tasks. These
tasks can take advantage of marked Named Entities (NE) and handle them separately, which often
results in better performance. Some of these tasks are Machine Translation, Question Answering,
Text Summarization, Language Modelling or Sentiment Analysis.
First introduced at MUC-6 in 1995. Since that time it has moved from rule-based systems to
statistical systems with variety of advanced features. The state-of-the-art performance is around
90% for English and 70% for Czech. The performance for other languages greatly varies
depending on properties of a given language.
Other common tasks can be: recognition of date/time expressions, measures (percent, money,
weight etc.), email addresses etc.
Other domain-specific entities can be: names of Drugs, Genes, medical conditions, names of ships,
bibliographic references etc.
E.g. John who is a student of Stanford University, Stanford, scored 95% in his seminar on the 11th
of April.
Entity Types:
Essentials of Natural Language Processing Course Code: CS 523
Numerical Expression
Time Expression
Stanford NER
SpaCy
NLTK
Output:
The/DT crash/NN of/IN (GPE Ethiopian/NNP) (PERSON Airlines/NNPS
Flight/NNP) 302/CD on/IN March/NNP 10/CD followed/VBD the/DT
unrecoverable/JJ nose-dive/JJ almost/RB five/CD months/NNS
earlier/JJR of/IN another/DT jet/NN of/IN the/DT same/JJ
model/NN ,/, a/DT Boeing/NNP 737/CD Max/NNP 8/CD ,/, in/IN (GPE
Indonesia/NNP) ./. (GPE Indonesian/JJ) investigators/NNS
have/VBP implicated/VBN a/DT malfunctioning/NN automated/JJ
anti-stall/JJ program/NN in/IN that/DT disaster/NN ,/, in/IN
which/WDT the/DT plane/NN ’/NNP s/VBZ computer/NN system/NN
appeared/VBD to/TO override/VB pilot/NN directions/NNS
based/VBN on/IN faulty/NN data/NNS ./.)
Output:
[('Ethiopian Airlines', 'ORG'), ('Flight 302', 'PRODUCT'),
('March 10', 'DATE'), ('almost five months earlier', 'DATE'),
('Boeing', 'ORG'), ('737 Max 8', 'PRODUCT'), ('Indonesia',
'GPE'), ('Indonesian', 'NORP'), ('’s', 'ORG')]
3. Performing named entity recognition on the article which is extracted using web scrapping.
Essentials of Natural Language Processing Course Code: CS 523
Output:
Number of named entities in the article:
182
Output:
Counter ({'PERSON': 31, 'ORG': 62, 'DATE': 22, 'GPE': 19,
'PRODUCT': 12, 'NORP': 10, 'CARDINAL': 14, 'ORDINAL': 2,
'WORK_OF_ART': 1, 'FAC': 1, 'LAW': 3, 'TIME': 2, 'LOC': 2,
'QUANTITY': 1})
Output:
[('Boeing', 15), ('Ethiopian Airlines', 11), ('Max', 11)]
6. Picking a sentence from the article which is extracted from web scrapping on which named
entity recognition is performed.
Output:
The crash of Ethiopian Airlines Flight 302 on March 10 followed
the unrecoverable nose-dive almost five months earlier of
another jet of the same model, a Boeing 737 Max 8, in Indonesia.
7. Display the named entities from the sentence picked from the article.
Output:
Essentials of Natural Language Processing Course Code: CS 523
Output:
The annotation guidelines followed in the traditional evaluation forums show some degree of
confusion on how to annotate Named Entities. Annotation criteria are different across the MUC
(Message Understanding Conference, CoNLL (Computational Natural Language Learning) and
ACE (Automatic Content Extraction) forum. Annotation criteria are different across the MUC,
CoNLL and ACE evaluation forums. The most conflictive aspects in this matter are the types of
Named Entity to recognize, the criteria for their identification and annotation, and their boundaries
in the text.
The table below shows some annotations assigned by the different forums:
Essentials of Natural Language Processing Course Code: CS 523
The ambiguity in the definition of Named Entity affects the NER tools too. Five research NER
tools (Annie, Afner, TextPro, YooName and Supersense Tagger were studied, noting that the types
of NE recognized are usually prefixed and very different across tools. Apparently, they seem to
implicitly agree on recognizing the categories of people, organization and localization as types of
NE, but there are many discrepancies with the rest.
10 Future Scope
Challenges and Opportunities in Named Entity Recognition
Essentials of Natural Language Processing Course Code: CS 523
1. NER has been considered a solved problem when the techniques achieved a minimum
performance with a handful of NE types, document genre and usually in the journalistic domain.
2. There are no commonly accepted resources to evaluate the new types of NE that tools recognize
nowadays, and the new evaluation forums, though they overcome some of the previous limitations,
are not enough to measure the evolution of NER because they evaluate systems with different
goals, not valid for most NER applications.
3. Therefore, the NER community is presented with the opportunity to further advance in the
recognition of any named entity type within any kind of collection.
4.The evaluation corpora need to be extended too, paying special attention to the application of
NER in such heterogeneous scenarios like the Web. These evaluation forums should be
maintained along time, with stable measures, agreed upon and shared by the whole community.
5. But getting to adequate NER evaluations without raising costs is a challenging problem. The
effectiveness measures widely used in other areas are not suitable for NER, and the evaluation
methodologies need to be reconsidered. In particular, it is necessary to measure recall, which can
be extremely costly with a large corpus.
6. It is necessary to reconsider the effort required to adapt a tool to a new type of entity or
collection, as it usually implies the annotation of a new document collection. The recurrent use of
supervised machine learning techniques during the last decade contributed in making these tools
portable, but at the expense of significant annotation efforts on behalf of the end user.11.
Conclusions
11. Conclusion
The definitions given for Named Entity have been very diverse, ambiguous and incongruent so
far. The evaluation of NER tools has been carried out in several forums, and it is generally
considered a solved problem with very high performance ratios. But these evaluations have used
a very limited set of NE types that has seldom changed over the years, and extremely small corpora
Essentials of Natural Language Processing Course Code: CS 523
compared to other areas of Information Retrieval. Both factors seem to lead to overfitting of tools
to these corpora, limiting the evolution of the area and leading to wrong conclusions when
generalizing the results. It is necessary to take NER back to the research community and develop
adequate evaluation forums, with a clear definition of the task and user models, and the use of
appropriate measures and standard methodologies. Only by doing so may we really contemplate
the possibility of NER being a solved problem.
Essentials of Natural Language Processing Course Code: CS 523
12. References
https://slideplayer.com/slide/6912263/
https://homes.cs.washington.edu/~mausam/papers/emnlp11.pdf
https://en.wikipedia.org/wiki/Named-entity_recognition
https://www.kiv.zcu.cz/site/documents/verejne/vyzkum/publikace/technicke-
zpravy/2012/tr-2012-04.pdf
https://ltrc.iiit.ac.in/iasnlp2014/slides/lecture/sobha-ner.ppt