Chapter 1

You might also like

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 3

Chapter 1

Introduction

1.1 Overview
Information retrieval and extraction is considered algorithms when information is required from
any type of documents that would help in some application in the world. These algorithms are
considered an important field in Natural language processing since it deals with some important
application in NLP. Information retrieval can be text, speech, video, and image. Text categorization is
an important research in NLP that uses information retrieval and extraction algorithms design and
implement solutions. Text categorization “classification in other term” includes many tasks such as
tagging, word sense disambiguation and many other classifications. Classification may include
language identifiers on a document of unknown region, which eventually helps in identifying the
region it came from[ CITATION pro99 \l 2057 ]. Text classification is the artificial word used by
computing scholars, but linguistics scholars tent to call it Stylometry if the task was about the
classification of a document and the author’s style. In other words , an example of NLP classification
that may interest linguistics is a classification application on poem that was newly discovered and
linguistics would like to determine if it was written by Shakespeare or any another author. Moreover
text classification can implement information and extraction in detecting important/ interesting/
suspicious text to a user using features that it defines. Detecting suspicious text is research that is
carried out in University of Leeds that is a part of “Making Sense” project. This project can be useful
in detecting "suspicious" texts in a large corpus of surveillance and intercepted texts from terrorist
suspects by implementing an algorithm that would classify different data sets into interesting and
non-interesting. This will be accomplished by designing system that would retrieve information that
will be useful in the implementation of the classifier. In addition attempt number of approaches that
would evaluate the classifiers accuracy. In information retrieval a subject must be defined in order to
retrieve the information required, and work on the features that would specify the subject and help
the classifier to understand and use in classifying. According to ChengXiang Zhai[ CITATION Zha09 \l
2057 ], one way of representation of the data that would be used in text maiming is topic model
labelling. Labelling maybe based on the high probability words of a topic model since the general
steps of labelling is setting a number of phrases that are generated by parsing the data text and
ranking them based on probabilistic methods that would help in specifying if they were appropriate.
Then select the top rated phrases that will label the topics in a model. However in this project data
will be labelled in different mergers that will be explained later in the design section. The data set
that was selected in the beginning of the project was the Arabic version of the holy Quran since it is
an open source document can be accessed from many sources over the internet. The English version
of the holy Quran was provided by the Claire Brierley, a researcher in University of Leeds. However
the Arabic version was downloaded from the internet and was structured to use it in the
implementation. The Arabic version of the subset was created manually using the English subset that
was provided and the full Arabic version of the holy Quran that was downloaded. In addition, hadith
was also added as new data set to implement the classifier on it and test if it was performing in the
expected behaviour. This set of data was created manually and structured and produced in certain
format to implement the classification.

1.2 Aim
The aim of the project is to produce a classification of the verses in the Arabic version of
Holy Quran into two classes, which are interesting or non-interesting. This will be implemented using
supervised learning algorithms that include training sets and the Quran corpus. The system is built
on classification that is based on predefined feature that are used in open source software tools.
These tools will help in analysing the performance of the features that were selected in
implementing the classification on the data sets, and the performance of the classifiers that are
defined. The results of the classification system is table that includes a recording of the right and
wrong classification of both interesting and not interesting verses and a percentage of the system
accuracy. These two classes are selected based on the main idea of the project which is detecting
suspicious text that is explained in the overview and was converted to interesting because the data
set that was selected is the Holy Quran.

1.3 Objectives
The main objective of the project is to:
 First, have good background review on text classification and machine learning
algorithms.
 Generate a system that classifies data sets into interesting and non-interesting based on
features that are defined by the user to describe the interesting set. In this case the
interesting set is the verses of the holy Quran that relates to the hereafter concept.
 Research on how to define beneficial features in order to improve the text classification.
In the case of the project the defined feature must help in identifying the verses that
hold the hereafter. This may include signs of the hereafter, name of the hereafter, and
the awards in the hereafter.
 Generate class labels for all verses in the Holy Quran, ready for training and testing.
 Try 10-fold cross-validation as testing method.
 WEKA options of classification implemented on designed .arff file.

1.4 Minimum Requirements


The minimum requirements are:
 Understanding how categorizing should be done in this project and how to build an
accurate classifier.
 Building java program that helps in retrieving information required from the holy Quran,
generate a .arff file that is used in classification tools.
 Train and test the data set.
 Implement the classifier and evaluate the results.

You might also like