CS 523 - Essentials of Natural Language Processing: Project Title: Report On Named Entity Recognition

You might also like

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 19

Essentials of Natural Language Processing Course Code: CS 523

CS 523– Essentials of Natural Language Processing

Project Title: Report on Named Entity Recognition

IN PARTIAL FULLFILLMENT OF THE DEGREE OF

Masters of Business Administration-Intelligent Data Science (MBA)

UNDER GUIDANCE OF

Dr. Prosenjit Gupta


Essentials of Natural Language Processing Course Code: CS 523

CERTIFICATE

This is to certify that the “Report on Named Entity Recognition” has been carried
out by the following students under my guidance as a part of Indian Ethos &
Business Ethics Assignment work for MBA-IDS at NIIT University, Neemrana,
Rajasthan.

1. Ashish Garg MB18GID274


2. Navaneeth S MB18GID262
3. Nikhil Goyal MB18GID258
4. Srishti Jain MB18GID245
5. Suprit Deepak MB18GID287
6. Tushar Sethi MB18GID271

Date: Signature & Seal


Place: Neemrana
Essentials of Natural Language Processing Course Code: CS 523

ACKNOWLEDGEMENT

We owe our thanks to all the people who helped and supported us during writing
this report.

We thank Dr. Prosenjit Gupta for guiding us and correcting our drafts with care. We
are highly obliged for his painstaking efforts and attention to detail.

We would also like to thank the effort and time spent by our batch mate in NIIT
University, who have gone out of their way to help us and fairly express their opinion
on our “Report on Named Entity Recognition”.

We would also thank our Institution NIIT University, Neemrana for supporting us
with the infrastructure without which this project would have been a distant reality.
Essentials of Natural Language Processing Course Code: CS 523

Table of Contents

1.0 History ..................................................................................................................5

2.0 Information Extraction .........................................................................................6

3.0 Process for Information Extraction ......................................................................7

4.0 Typical Information Extraction Applications ......................................................8

5.0 Named Entity Recognition ...................................................................................8

5.1 Name Entity (NE) Types ..................................................................................8

5.2 Approaches in NER ........................................................................................10

5.3 Application of NER ........................................................................................11

6.0 Standard Libraries to use Named Entity Recognition .......................................11

7.0 Named Entity Recognition Using NLTK Library .............................................11

8.0 Named Entity Recognition Using SpaCy Library .............................................12

9.0 Limitations of Named Entity Recognition .........................................................15

10 Future Scope .......................................................................................................16

11. Conclusion .........................................................................................................17

12. References ..........................................................................................................19


Essentials of Natural Language Processing Course Code: CS 523

1.0 History
 The term “Named Entity (NE)”, widely used in Information Extraction (IE), Question
Answering (QA)or other Natural Language Processing (NLP) applications, was born in the
Message Understanding Conferences (MUC) which influenced IE research in the U.S. in
the 1990’s [Grishman and Sundheim 1996] (to be precise, it was introduced for MUC-6 in
1995).
 At that time, MUC focused on IE tasks where structured information of company activities
and defence related activities is extracted from unstructured text, such as newspaper
articles. In the course of system development, people noticed that it is important to
recognize information units like names, including person, organization and location names,
and numeric expressions including time, date, money and percent expressions.
 Extracting these entities was recognized as one of the important sub-tasks of IE. As this
task is relatively independent, it has been evaluated separately in several different
languages, e.g. Japanese, Chinese and Spanish in MET (Multilingual Entity Tracking)
project.
 Outside the U.S., there have been several evaluation-based projects for NE, as one of the
tasks of IREX (Information Retrieval and Extraction Exercise) in Japan [Sekine and
Isahara 2000] [IREX HP], and as the shared task in CoNLL in 2002 and 2003 for four
languages, English, German, Dutch and Spanish [CoNLL HP].
 In the IREX project, a new category “artifact”, such as “Odyssey” as a book title or
“Windows” as a product name, was added to the original MUC categories. The NE task in
MUC was inherited by the ACE project in the U.S., where 2 new categories are added,
GPE (Geographical and Political Entities, such as “France” or “New York”) and facility,
such as “Empire State Building”.
 Around this time, the number of categories is limited to 7 to 10, and the NE taggers,
automatic annotation systems for NE entities in unstructured text, are based on 1)
dictionaries and rules which were made by hand or 2) some supervised learning technique.

 More recent and currently dominating technology is the supervised learning techniques,
which include Decision Tree [Sekine 1998], Hidden Markov Model (HMM) [Bikel et. al
Essentials of Natural Language Processing Course Code: CS 523

1997], Maximum Entropy Model (ME) [Borthwick 1998], Support Vector Machine (SVM)
[Asahara 2003], Boosting and voted perceptron [Collins 2000] and Conditional Random
Fields (CRFs) [McCallum and Li 2003]. The NE extraction task has been the experimental
sandbox for various forms of supervised learning.

2.0 Information Extraction


Information extraction (IE) is the automated retrieval of specific information related to a selected
topic from a body or bodies of text.

Information extraction tools make it possible to pull information from text documents, databases,
websites or multiple sources. IE may extract info from unstructured, semi-structured or structured,
machine-readable text. Usually, however, IE is used in natural language processing (NLP) to
extract structured from unstructured text.

The below flow chart depicts the process of information extraction

Information extraction depends on named entity recognition (NER), a sub-tool used to find
targeted information to extract.

Gathering detailed structured data from texts, information extraction enables:

 The automation of tasks such as smart content classification, integrated search,


management and delivery
 Data-driven activities such as mining for patterns and trends, uncovering hidden
relationships, etc.
Essentials of Natural Language Processing Course Code: CS 523

3.0 Process for Information Extraction


 Typically, for structured information to be extracted from unstructured texts, the following
main subtasks are involved:
 Pre-processing of the text – this is where the text is prepared for processing with the help
of computational linguistics tools such as tokenization, sentence splitting, morphological
analysis, etc.
 Finding and classifying concepts – this is where mentions of people, things, locations,
events and other pre-specified types of concepts are detected and classified.
 Connecting the concepts – this is the task of identifying relationships between the extracted
concepts.
 Unifying – this subtask is about presenting the extracted data into a standard form.
 Getting rid of the noise – this subtask involves eliminating duplicate data.
 Enriching your knowledge base – this is where the extracted knowledge is ingested in your
database for further use.

Information extraction can be entirely automated or performed with the help of human input.

Typically, the best information extraction solutions are a combination of automated methods and
human processing.
Essentials of Natural Language Processing Course Code: CS 523

4.0 Typical Information Extraction Applications


Information extraction can be applied to a wide range of textual sources: from emails and Web
pages to reports, presentations, legal documents and scientific papers. The technology successfully
solves challenges related to content management and knowledge discovery in the areas of:

Business intelligence (for enabling analysts to gather structured information from multiple
sources);

Financial investigation (for analysis and discovery of hidden relationships);

Scientific research (for automated references discovery or relevant papers suggestion);

Media monitoring (for mentions of companies, brands, people);

Healthcare records management (for structuring and summarizing patient’s records);

Pharma research (for drug discovery, adverse effects discovery and clinical trials automated
analysis).

5.0 Named Entity Recognition


. Named Entity Recognition (NER) is one of the important parts of Natural Language Processing
(NLP). NER is supposed to find and classify expressions of special meaning in texts written in
natural language. These expressions range from proper names of persons or organizations to dates
and often hold the key information in texts.

NER can be used for different important tasks. It can be used as a self-standing tool for full-text
searching and filtering. Also it can be used as a pre-processing tool for other NLP tasks. These
tasks can take advantage of marked Named Entities (NE) and handle them separately, which often
results in better performance. Some of these tasks are Machine Translation, Question Answering,
Text Summarization, Language Modelling or Sentiment Analysis.

First introduced at MUC-6 in 1995. Since that time it has moved from rule-based systems to
statistical systems with variety of advanced features. The state-of-the-art performance is around
90% for English and 70% for Czech. The performance for other languages greatly varies
depending on properties of a given language.

5.1 Name Entity (NE) Types


The Named entity hierarchy is divided into three major classes Entity
Essentials of Natural Language Processing Course Code: CS 523

• Name, Time and Numerical expressions.

Other common tasks can be: recognition of date/time expressions, measures (percent, money,
weight etc.), email addresses etc.

Other domain-specific entities can be: names of Drugs, Genes, medical conditions, names of ships,
bibliographic references etc.

E.g. John who is a student of Stanford University, Stanford, scored 95% in his seminar on the 11th
of April.

Output: $ John^(ENAMEX, name) who is a student of $ Stanford University^(ENAMEX, org),


$ Stanford ^(ENAMEX, location), scored $ 95% ^(NUMEX, percent) in his seminar on the $ 11th
of April ^(TIMEX, date).

Single classes can further identify many attributes as follow:

 Entity Types:
Essentials of Natural Language Processing Course Code: CS 523

 Numerical Expression

 Time Expression

5.2 Approaches in NER


• Dictionary Look-up

• Rule based (Using lexical, contextual and morphological information)

• Maximum entropy theory based

• Hidden Markov Model


Essentials of Natural Language Processing Course Code: CS 523

• Conditional Random Fields

• Hybrid methods (Statistical+ Linguistics)

5.3 Application of NER


 QUESTION ANSWERING: NER is extremely useful for systems that read text and
answer queries.
e.g. Tasks such as “Name all the colleges in Bombay listed in the document”
 INFORMATION EXTRACTION: To find out and tag the subject of a web page
e.g. To extract the names of all the companies in a particular document.
 PRE PROCESSING FOR MACHINE TRANSLATION
 WORD SENSE DISAMBIGUATION FOR PROPER NOUNS

6.0 Standard Libraries to use Named Entity Recognition


Three standard libraries used in python to perform NER are:

 Stanford NER
 SpaCy
 NLTK

7.0 Named Entity Recognition Using NLTK Library


1. Import the required libraries for named entity recognition using NLTK.

2. The paragraph on which named entity recognition using NLTK is performed.


Essentials of Natural Language Processing Course Code: CS 523

3. Named entities are generated using ne_chunk function.

Output:
The/DT crash/NN of/IN (GPE Ethiopian/NNP) (PERSON Airlines/NNPS
Flight/NNP) 302/CD on/IN March/NNP 10/CD followed/VBD the/DT
unrecoverable/JJ nose-dive/JJ almost/RB five/CD months/NNS
earlier/JJR of/IN another/DT jet/NN of/IN the/DT same/JJ
model/NN ,/, a/DT Boeing/NNP 737/CD Max/NNP 8/CD ,/, in/IN (GPE
Indonesia/NNP) ./. (GPE Indonesian/JJ) investigators/NNS
have/VBP implicated/VBN a/DT malfunctioning/NN automated/JJ
anti-stall/JJ program/NN in/IN that/DT disaster/NN ,/, in/IN
which/WDT the/DT plane/NN ’/NNP s/VBZ computer/NN system/NN
appeared/VBD to/TO override/VB pilot/NN directions/NNS
based/VBN on/IN faulty/NN data/NNS ./.)

8.0 Named Entity Recognition Using SpaCy Library


1. Import the required libraries for named entity recognition using SpaCy.

2. The paragraph on which named entity recognition using SpaCy is performed.


Essentials of Natural Language Processing Course Code: CS 523

Output:
[('Ethiopian Airlines', 'ORG'), ('Flight 302', 'PRODUCT'),
('March 10', 'DATE'), ('almost five months earlier', 'DATE'),
('Boeing', 'ORG'), ('737 Max 8', 'PRODUCT'), ('Indonesia',
'GPE'), ('Indonesian', 'NORP'), ('’s', 'ORG')]

Note: Entity types which SpaCy supports:

3. Performing named entity recognition on the article which is extracted using web scrapping.
Essentials of Natural Language Processing Course Code: CS 523

Output:
Number of named entities in the article:
182

4. Count of named entities for each label of SpaCy.

Output:
Counter ({'PERSON': 31, 'ORG': 62, 'DATE': 22, 'GPE': 19,
'PRODUCT': 12, 'NORP': 10, 'CARDINAL': 14, 'ORDINAL': 2,
'WORK_OF_ART': 1, 'FAC': 1, 'LAW': 3, 'TIME': 2, 'LOC': 2,
'QUANTITY': 1})

5. Count of most common named entity.

Output:
[('Boeing', 15), ('Ethiopian Airlines', 11), ('Max', 11)]

6. Picking a sentence from the article which is extracted from web scrapping on which named
entity recognition is performed.

Output:
The crash of Ethiopian Airlines Flight 302 on March 10 followed
the unrecoverable nose-dive almost five months earlier of
another jet of the same model, a Boeing 737 Max 8, in Indonesia.

7. Display the named entities from the sentence picked from the article.

Output:
Essentials of Natural Language Processing Course Code: CS 523

8. Display the named entities from the entire article.

Output:

9.0 Limitations of Named Entity Recognition


1. Different Identification and Annotation Criteria.

The annotation guidelines followed in the traditional evaluation forums show some degree of
confusion on how to annotate Named Entities. Annotation criteria are different across the MUC
(Message Understanding Conference, CoNLL (Computational Natural Language Learning) and
ACE (Automatic Content Extraction) forum. Annotation criteria are different across the MUC,
CoNLL and ACE evaluation forums. The most conflictive aspects in this matter are the types of
Named Entity to recognize, the criteria for their identification and annotation, and their boundaries
in the text.

The table below shows some annotations assigned by the different forums:
Essentials of Natural Language Processing Course Code: CS 523

2. Consequences for NER Tools

The ambiguity in the definition of Named Entity affects the NER tools too. Five research NER
tools (Annie, Afner, TextPro, YooName and Supersense Tagger were studied, noting that the types
of NE recognized are usually prefixed and very different across tools. Apparently, they seem to
implicitly agree on recognizing the categories of people, organization and localization as types of
NE, but there are many discrepancies with the rest.

3. Is NER Really Solved?


In ACE 2008 though, the best scores were only marginally above 50%. However, the performance
measures used in ACE are so different from the traditional precision-recall measures that scores
from ACE are not at all comparable to scores from MUC and CoNLL. In any case, the fact that
the best scores in 2008 were just about 50% indicates there is significant room for improvement
in NER tools. Despite of this, in 2005 NER is generally regarded as a solved problem with
performance scores above 95%, and after ACE 2008 these tasks are dropped from the international
IE and IR evaluation forums. The statement that NER is a solved problem is in fact based on results
from several evaluation experiments, which are also subject to validity analysis.

10 Future Scope
Challenges and Opportunities in Named Entity Recognition
Essentials of Natural Language Processing Course Code: CS 523

1. NER has been considered a solved problem when the techniques achieved a minimum
performance with a handful of NE types, document genre and usually in the journalistic domain.

2. There are no commonly accepted resources to evaluate the new types of NE that tools recognize
nowadays, and the new evaluation forums, though they overcome some of the previous limitations,
are not enough to measure the evolution of NER because they evaluate systems with different
goals, not valid for most NER applications.

3. Therefore, the NER community is presented with the opportunity to further advance in the
recognition of any named entity type within any kind of collection.

4.The evaluation corpora need to be extended too, paying special attention to the application of
NER in such heterogeneous scenarios like the Web. These evaluation forums should be
maintained along time, with stable measures, agreed upon and shared by the whole community.

5. But getting to adequate NER evaluations without raising costs is a challenging problem. The
effectiveness measures widely used in other areas are not suitable for NER, and the evaluation
methodologies need to be reconsidered. In particular, it is necessary to measure recall, which can
be extremely costly with a large corpus.

6. It is necessary to reconsider the effort required to adapt a tool to a new type of entity or
collection, as it usually implies the annotation of a new document collection. The recurrent use of
supervised machine learning techniques during the last decade contributed in making these tools
portable, but at the expense of significant annotation efforts on behalf of the end user.11.
Conclusions

11. Conclusion

The definitions given for Named Entity have been very diverse, ambiguous and incongruent so
far. The evaluation of NER tools has been carried out in several forums, and it is generally
considered a solved problem with very high performance ratios. But these evaluations have used
a very limited set of NE types that has seldom changed over the years, and extremely small corpora
Essentials of Natural Language Processing Course Code: CS 523

compared to other areas of Information Retrieval. Both factors seem to lead to overfitting of tools
to these corpora, limiting the evolution of the area and leading to wrong conclusions when
generalizing the results. It is necessary to take NER back to the research community and develop
adequate evaluation forums, with a clear definition of the task and user models, and the use of
appropriate measures and standard methodologies. Only by doing so may we really contemplate
the possibility of NER being a solved problem.
Essentials of Natural Language Processing Course Code: CS 523

12. References

https://slideplayer.com/slide/6912263/

https://homes.cs.washington.edu/~mausam/papers/emnlp11.pdf

https://en.wikipedia.org/wiki/Named-entity_recognition

https://www.kiv.zcu.cz/site/documents/verejne/vyzkum/publikace/technicke-
zpravy/2012/tr-2012-04.pdf

https://ltrc.iiit.ac.in/iasnlp2014/slides/lecture/sobha-ner.ppt

You might also like