Professional Documents
Culture Documents
Questioning Answering System For Most Frequently Asked Questions
Questioning Answering System For Most Frequently Asked Questions
net/publication/309717410
CITATIONS READS
0 2,636
5 authors, including:
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Dr. Partha Pakray on 14 December 2016.
Abstract 1 Introduction
English. Question Answering (QA) is Question Answering (QA) is an emerging topic in
an important aspect of Natural Language today’s world. It is an aggregate of Information
Processing. It comprises building a sys- Retrieval (IR) and Natural Language Processing
tem that automatically answers questions (NLP) and is concerned with developing an
sought in natural language. Frequently automated engine which is able to respond to the
Asked Questions (FAQs) are a set of listed queries presented by users in natural language.
questions and answers concerning a spe- Frequently Asked Questions (FAQs) represent
cific topic, which are most likely to be en- an effective and efficient way to quickly resolve
quired by a user. This paper deals with de- queries posed by users. They are usually repre-
veloping an open domain QA system for sented as an ensembled list of questions and their
retrieving a list of relevant FAQs related answers.
to the query issued by the user. Our ap-
proach combines the orthodox AND/OR Searching within FAQs can be a tedious
searching with the Combinatorics search- task. This becomes even more drawn out when
ing technique which is able to produce an paraphrasing comes into fray. As a result the
exhaustive list of results for a particular user is pushed into a maze of questions and
query generated. answers having to manually look for a particular
one as shown in figure 1. It is here that a QA
Italiano. Question Answering (QA) un
system comes of utmost importance retrieving the
aspetto importante di Natural Language
particular desired query instantly.
Processing. Si compone di costruire
un sistema che risponde automaticamente
alle domande cercato in linguaggio natu-
rale. Domande frequenti (FAQ) sono un
insieme di domande elencate e risposte
riguardanti un argomento specifico, che
hanno pi probabilit di indagato da un
utente. Questo documento si occupa di
sviluppo di un sistema di QA dominio
aperto per il recupero di un elenco di
domande frequenti pertinenti relativi alla
query emesso da parte dell’utente. Il nos-
tro approccio combina l’ortodossa e / o la
ricerca con la tecnica di ricerca combina-
torio che in grado di produrre un elenco
esaustivo dei risultati per una determinata
query generato. Figure 1: FAQs of Microsoft Download Center
The rest of this paper is organised as fol- stopwords.2 3 4 So, we merged them and devel-
lows, Section 2 describes the corpus and its pre- oped our own exhaustive Italian stopword corpus
processing, Section 3 describes our system’s ar- from the existing ones. This corpus5 had approxi-
chitecture and the tools used, Section 4 describes mately 546 unique stopwords in total. This opera-
the experiment. Section 5 describes the perfor- tion helped us in getting rid of the unwanted words
mance of the system, Section 6 analyses the results which would hinder the system’s performance.
and Section 7 describes the conclusion and future
works. 3 System Architecture
An example of the data provided is given below: Figure 2: Architecture of the implemented system
193;Cosa significa AEEGSI?; l’Autorit The architecture may be divided into two
per l’Energia Elettrica il Gas ed il Sistema distinct parts as shown in figure. One part con-
Idrico.;acqua, acquedotto, distribuzione, AEEGSI tains the architecture of Nutch6 enclosed in the
rectangle. It contains all the basic components
essential in the implementation of a Search
2.1 Parsing
Engine. The other part represents the aggregation
For the purpose of pre-processing of the training of the searching techniques to be adopted while
data we developed a CSV parser which could ex- searching the FAQs. This includes a module that
tract the ID and the rest of the parts. Develop- processes the queries obtained for both AND/OR
ment dataset had 406 files with id, question, an- searching as well as combinatorics based search-
swer, tag(s). We extracted the question, answer ing.The two major steps involved in developing
and tags in a file and saved it in the file named the architecture were Crawling & Indexing and
ID.txt. Searching (described in Section 4).
The steps involved in crawling and indexing are
2.2 Stopword Removal described below:
In order to increase the efficiency of our input
data, we decided to perform stopwords removal. 2
http://www.ranks.nl/stopwords/italian
Words which occur in 80% of the documents in 3
http://members.unine.ch/jacques.savoy/clef/italianST.txt
4
the collection are the stop words. However while https://github.com/themnd/stopword-it/blob/master/
stopwords.txt
searching for a list of Italian stopwords, we re- 5
The corpus is openly shared in Github for furthur
alised that the existing ones had only 133 to 399 use - https://github.com/SRvSaha/QA4FAQ-EVALITA-
16/blob/master/italian stopwords.txt
1 6
http://qa4faq.github.io https://nutch.apache.org
1. Run a generic Java code taking the ids (taken 1. queryNorm() : indicates the normalization
from ID.txt) as the input to generate URL factor for the query.
seeds.
2. coord() : indicates how many query terms are
2. Injector injects the list of seed URLs into the present in the given document.
crawlDB.
3. norm() : score indicating field based normal-
3. Generator takes the list of seed URLs ization factor.
from crawlDB, forms fetch list, adds
crawl generate folder into the segments. 4. tf: term frequency
4. These fetch lists are used by fetchers to fetch 5. idf: inverse document frequency
the raw content of the document. It is then 6. t.boost() : score indicating the importance of
stored in segments. terms occurrence in a particular field
5. Parser is called to parse the content of the Apart from this, we developed our own con-
document and parsed content is stored back figuration which was a combination of both
in segments. the traditional AND/OR search along with the
Combinatorics approach. To implement this
6. The links are inverted in the link graph and
Combinatorics approach, we split the query by
stored in LinkDB.
space separator and all possible combinations
7. Indexing the terms present in segments is of a word in query were generated. This is the
done and indices are updated in the segments. methodology adopted in subset generation from a
given set. So, given n number of words in a query
8. Information on the newly fetched documents after removing stopwords, we would have 2n − 1
are updated on the crawlDB. possible combinations of query. These were
then used for searching by Nutch and ranking
4 Experiments
was done based on the ranking algorithm we
The corpus obtained after pre-processing was developed. Benefit of this approach was that, it
experimented upon by means of various method- was an exhaustive search and maximum number
ologies. A total of 1132 FAQs were available of relevant results would be retrieved using it
in the test data set. A prototype system was using proper ranking algorithm.
created by feeding the input data into Nutch. We This approach could be explained using the
performed two separate runs so as to perform a following example:
comparative study between unprocessed and pre Consider the following query:
processed data. numero verde aqp
We used Nutch’s own configuration for the Index- For this query, all the possible combinations
ing, Searching and Ranking of the data for one of would be created in the following order :
the runs and implemented our own configuration numero verde aqp
for the other run. The ranking provided by Nutch numero verde
may be explained using the following equation: verde aqp
numero aqp
numero
verde
aqp
Narendra K Gupta, Mazin G Rahim, Giuseppe and Valentin Jijkoun and Maarten de Rijke 2010. Retriev-
Riccardi, 2007. System for handling frequently ing answers from frequently asked questions pages
asked questions in a natural language dialog ser- on the web, Proceedings of the 14th ACM inter-
vice, Google Patents national conference on Information and knowledge
management
Partha Pakray 2014. Yamraj: Binary-class and Multi-
class based Textual Entailment System for Japanese
(JA) and Chinese Simplified (CS), Proceedings of the
11th NTCIR Conference