Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

See discussions, stats, and author profiles for this publication at: https://www.researchgate.


Building Standard Dataset for Quran Tafseer

Conference Paper · December 2013

DOI: 10.1109/NOORIC.2013.64

0 1,345

3 authors, including:

Mohammed Bakri Bashir Abd Latiff Muhammad Shafie

Nahda College, Sudan Universiti Teknologi Malaysia


Some of the authors of this publication are also working on these related projects:

Grid Computing View project

Wireless Sensor Network View project

All content following this page was uploaded by Abd Latiff Muhammad Shafie on 19 April 2017.

The user has requested enhancement of the downloaded file.

2013 Taibah University International Conference on Advances in Information Technology for the Holy Quran and Its Sciences

Building Standard Dataset for Quran Tafseer

Mohammed Bakri Bashir Muhammad Shafie A. Latiff Amin Mohammed Salih
Department of Computer Science Department of Computer Science Department of Information Technology
Faculty of Science and Technology Faculty of Computing Sudanese Company for Electricity
Shendi University Universiti Teknologi Malaysia Distribution
Shendi, Sudan Johor , Malaysia Khartoum, Sudan email : email :

Abstract— The growing number of scholars and students of the datasets because the structure of the Quran Tafseer books is
Quran Tafseer and its science has led to the increase of different from the structure of the other scientific books.
computer-based researches. Additionally, the evaluation of these Moreover, to evaluate the Quran Tafseer research by using real
researches requires computer-based experiments to be datasets will produce real results and evaluations. Lastly, the
performed with a real Quran Tafseer documents. The available requirement for standard dataset used by the researchers in
Tafseer documents are in text file or image file and are in Arabic Quran Tafseer area is significant. The aforementioned issues
language whereas the majority of the computer programming is highlight the importance of creating and developing Quran
in English language. Furthermore, the researchers will have Tafseer dataset.
difficulty to compare their research with other researchers.
Therefore, the requirement for standard Tafseer dataset is very This paper explains a dataset of the Quran Tafseer book for
necessary for the current researches. This paper proposed a each verse of the Quran with its explanations and a list of
Tafseer dataset that will be used by researchers in Quran Tafseer Hadiths and verses to explain the verse. By including the
and information processing fields. The dataset is organized in verses in the explanation as a list will eliminate the confusion
XML format to provide simple access and usage from the for the researchers and will show the relation among the verses.
computer applications. A computer program is developed to Additionally, the relations will provide deep analysis for
create the dataset in XML structure. By applying the XML semantic researches. The objective of this research is to collect
language computer applications users will be able to access the
the data from Quran Tafseer books and organizes it in datasets
dataset and manipulate the content in easily.
format to provide evaluation and test resources for the Quran
Tafseer researcher.

Keywords- Holy Quran; Quran Tafseer; information retrieval

; Tafseer dataset , XML dataset. The paper starts with a discussion of the current studies that
have created the Quranic and non-Quranic datasets. Section III
describes the design and development phases that were applied
I. INTRODUCTION to create the QurTafData dataset. These phases include
selecting of the data source and the steps are followed to build
The last decades have witnessed the utilization of computer the dataset. The features of the QurTafData dataset are also
for the research in Quran Tafseer and its science[1]. This explained in this section. This paper is concluded with
corresponds with the increase in the number of researches and conclusion and future work section by mentioning all the
the applications of the Quran Tafseer area. Additionally, the unsolved issues and challenges that need to be highlighted in
improvement on the research as well as the harnessing of the future studies.
several computer fields in enhancing the Quran Tafseer
researches have resulted in a number of advantages. First, large
number of computer-based and web-based applications have
been developed to handle and provide Quran Tafseer services II. RELATED WORK
for the Muslims such as[2] and have helped in memorizing the The computer science research in Islamic field is not a new
Quran such as [3]. Second, there are applications concern with area as many researches were proposed in the literature[1].
the teaching and learning og Quran and its sciences and have Several dataset have been created for different type of fields.
produced it as e-learning applications such as [4, 5]. Third,
Qursim [8] is semantic-based dataset that contains the
some researches and applications utilize the recent technologies
relation and the similarity of the Quran verses and it is created
such as semantic web, ontology , and data mining to develop
from the Quranic text. The Qursim can be utilized in the short
different type of information retrieval and extraction searches
texts relatedness and the similarity of computational linguists
such as [1, 6, 7].
areas. Additionally, the dataset is suitable to evaluate the
The rapid increase of the researches in the Quran fields machine translation and paraphrase analysis. QurAna [9] is a
corresponded with the requirement to evaluate and test these large ontology-based dataset that created personal pronouns
researches. The validation of researches should consider the connected with their antecedence. The QurAna uses the
diversity of the Quran Tafseer datasets from the other scientific ontological concepts list to maintain the antecedents.

978-1-4799-2822-4/13 $31.00 © 2015

978-1-4799-2823-1/15 2013 IEEE 287
DOI 10.1109/NOORIC.2013.64
There are many non-Islamic XML datasets available to be B. Dataset Build Processes
used such as Data Bases Logic Programming (DBLP) [10, 11], A software program is developed to build the Scientific-
which contains bibliographic information for article published dataset as depicted in Figure 1. The process of creating the
in computer science journals and proceeding. Additionally, the dataset is performed by applying several steps as illustrated in
British Library Catalogue Dataset [12]contains around 2.7 Figure 2 and discussed in the following section:
million records of metadata for book indexed in the British
Library since 1950. On the other hand, the DB-Research [13], • Crawling the HTML file.
is the dataset composed from research articles published in
computer science collected from 19 web sites such as DBLP, • Extracting information and creating the XML
Citer-seer, which use different schema and structure to present document.
the dataset. • Refining XML file


This research has created and designed the Quran Tafseer
dataset that can be used by the scholar and researchers to
evaluate their researches and applications. The new dataset
created is called QurTafData that contains the Tafseer Ibn
Kathir book, which also includes the Tafseer (explanation) of
all verses in the Holy Quran. The datasets are developed by
creating XML schemas to present the structure and contents of
all verses in the Ibn Kathir book. The process of creating the
dataset follows several steps as will be explained in the
following sections.

A. Choosing the Data Source

The availability of unlimited Quranic source has made the
selection of data source uneasy task. Large number of websites
and books have existed which contain Tafseer documents and
are available for end user. The Ibn Kathir's Tafseer[14] is
chosen as the data source that will be used to build the dataset .
Ibn Kathir Tafseer is lengthy book that explains the Quran in
detail with support of the Quran verses and Prophet Hadiths.
Moreover, Ibn Kathir Tafseer is one of the most acknowledged
book by the Muslims. Ibn Kathir’s real name is Ismail bin
Omar bin Kathir. He was born in Syria in 700H and died in
774H. He is the writer of famous Quran commentary book that
connect the Quran verses with the Prophet Hadiths in Arabic
language. Ibn Kathir mentioned the methodology he applied to
commenting the Quran verses. He used the verses to explain
the current verse of the same subject. He supported the Figure 1. Crawler program interface.
commenting with the prophet saying (Hadiths) and he used the
Sahabah (Prophet Companions) comments like Ibn Omar and 1) Crawling the HTML file: The document was
Ibn Abas. The methodology used by Ibn Kathir shows that the downloaded from the Quran Complex website, which was
commenting of one verse includes one or several verses and presented as HTML file. Each HTML file represents one or
Hadiths. multiple pages in the Ibn Kathir Tafseer book. The web pages
was crawled and downloaded page by page. Firstly, the seed
The step that follows is to determine the data sources. The (the URL of the first page on the web site of Ibn Kathir) of the
King Fahd Complex for the Printing of The Holy Qur'an[15] first web page was obtained and the HTML file was then
was selected as the source to build the datasets. The selection downloaded and saved in the hard disk. This was followed by
was performed based on the two factors: the trusted websites extracting the URL of next web page from the downloaded
by the Muslims. Secondly, the structure and the organization of seed and then by downloading the second page. These steps
the websites permit the user to document from the website. were repeated until all the web page for Ibn Kathir book have
Additionally, the structure of the webpage allows the developer been downloaded.
to write a program in order to download and extract the
information in XML format. 2) Extracting information and creating the XML
document: This step began after the book pages have been
downloaded as HTML files. The metadata of the verse such as
the verse number, Sura number, and Sura name were extracted

from the HTML file. Additionally, the metadata of the verses when several verses covered the same topic. Ibn Kathir handled
were used in the explanation , namely the location of the verse these verses as one verse by following the example from Sura
in Quran (Sura), the number of the verse, and the verse text
As-Saaffaat (6-10):
itself. Furthermore, the metadata of the Hadiths that was used
for commenting was also extracted. The extracted metadata
were compiled to create an XML record before appending "Indeed, We have adorned the nearest heaven with an
them at the end of the XML file. The structure of the dataset adornment of stars(6) And as protection against every
record is explained in Figure 3. rebellious devil(7) [So] they may not listen to the exalted
assembly [of angels] and are pelted from every side,(8)
Repelled; and for them is a constant punishment, ( 9 ) Except
Start one who snatches [some words] by theft, but they are pursued
by a burning flame, piercing [in brightness].(10) "

Get the seed of first page

This group of verses needs to be refined manually because
the program requires the extraction of information from the text
thus requires human being to operate it. Lastly, refining the
Download the HTML file XML file needs to be refined and any repeated record or
uncorrected text has to be eliminated.
Save the file to hard disk DB

Is the last
No Yes

Extract the URL

Read the file from HD
from file

Extract metadata from file

Add metadata to XML file DB

Is the last


Read the verse from the XML file DB

Is duplicate

Remove the verse

from XML file

Last verse Yes

Figure 3: QunTafData dataset structure.


Figure 2: QurTafData dataset building flowchart.

3) Refine XML file: This is the last step to be conducted after

extracting and creating the XML dataset. This step have three
functions; firstly, to revise the downloaded HTML file so as to C. The features of the QurTafData dataset:
check the content is correct and is not damage, if damages QurTafData is a dataset that contains Tafseer Ibn Kathir
occurred the file needs to be redownloaded. Secondly, to check book in which each verse is represented as XML node. Each
the syntax of the XML file in order to ensure that the file is node involves the explanation of one verse by grouping the
built correctly and there are no errors, error needs to be fixed if explanation’s verses in sub node and the Hadiths in another
found. Thirdly, the size of the verses in Quran covering one subnode, which stores the explanation in separate sub node.
topic varies from one verse to several verses. Problem occurred

These Arrangements of the Ibn Kathir Tafseer add some
features for the dataset such as:
• Each verse is saved on the node in the XML file while
the explanation verses are stored in the sub node of the
main verse node. This feature prevents duplication of
the verses in the main nodes. Additionally, the user of
the dataset will obtain an accurate result because the
access for the verses will be performed on the root
nodes only while the explanation will be in the sub
• The arrangement of the explanation’s verses in sub
node shows the relation of the verse with other verses
as well as the Hadiths with the verse.
• The number of verses in the Holy Quran is 6236
distributed among 114 chapters. Approximately half of
these verses are similar, while 98 verses are repeated
181 times [16] such as follows:
• The information of the verse is stored in sub node that
contains the chapter of the verse. This approach
prevents confusion in the case of similar verses
• The QurTafData is organized in XML format in order
to allow large number of users and application to use
the dataset. Figure 4: Verses retrieval based on one verse.


This paper focuses on design and develops Islamic dataset, This paper presented QurTafData dataset as Quran Tafseer
which can be used by researchers to evaluate their researches. resource for Islamic researchers, scholars, and student in the
The prototype website is developed to tests and evaluates the computer science field, especially those concerned with
datasets. The PHP and HTML programming languages are information retrieval and data mining. This dataset will enable
used to develop the web sites. The PHP used to retrieve the the researcher to evaluate their researches thus will also
information from the dataset while the HTML used to design increase the progress in this research field. The QurTafData
the user interface. A number of retrieval tasks were conducted can significantly contribute in producing high quality of
online. Entering the Juzu number, the Surah, and the verse researches
number, the scholars can obtain all the verses mentioned in the
explanation of the verse. Below are several suggestions to improve the quality of the
dataset for future work:
Figure 4 shows screenshot as example of the retrieval tasks
for verses mentioned in explanation of another verse. The 1. The dataset needs to arrange each verse separately from
screenshot shows the verses in Arabic language while the other the other verses. However, several sets of verses are
information displayed in English such as the Juzu number and explanted together. Consequently, the dataset requires
the Surah name. Additionally, the scholars and the knowledge enhancement to separate these sets of the verses and to put
seekers can explore the verses explanation by select the verse each verse separately.
from the verses result list. Furthermore, the users can retrieves 2. The QurTafData contains Tafseer Ibn Kathir book. By
all the verses appear in whole Surah. The user can also retrieve extending the dataset to add the other Tafseer books such
the information based on the diverse options based on the as Alqurtoby will enhance the dataset quality and will
Verses, Hadiths, or explanation. increase the size of the dataset.
3. The additional Tafseer books in other language will
provide multi language dataset. Hence, the non-Arabic
speaker will be able to utilize the dataset.
4. The Prophet Hadiths are required to differentiate between
the weak Hadiths and correct Hadiths.

REFERENCES and Evaluation (LREC), Istanbul, Turkey, 2012, pp.
[1] Q. ul Ain and A. Basharat, "Ontology driven 2295-2302.
Information Extraction from the Holy Qur’an related
Documents," in 26th IEEEP Students Seminar, [9] A.-B. M. Sharaf and E. Atwell, "QurAna: Corpus of
2011. the Quran annotated with Pronominal Anaphora," in
The International Conference on Language
[2] R. binti Hamzah, "Visualizing Surah Al Baqarah: The Resources and Evaluation (LREC), Istanbul, Turkey,
New Innovation Of Reciting Al Quran." 2012, pp. 130-137.

[3] A. Muhammad, W. M. M. Zia ul Qayyum, S. [10] M. Ley, "DBLP-some lessons learned," Proceedings
Tanveer, A. Martinez-Enriquez, and A. Z. Syed, "E- of the VLDB Endowment, vol. 2, pp. 1493-1500,
Hafiz: Intelligent System to Help Muslims in 2009.
Recitation and Memorization of Quran," Life Science
Journal, vol. 9, 2012. [11] M. Ley, "The DBLP Computer Science
Bibliography: Evolution, Research Issues,
[4] Y. O. M. Elhadj, "E-Halagat: An E-Learning System Perspectives," in String Processing and Information
for Teaching the Holy Quran," TURKISH ONLINE, Retrieval. vol. 2476, A. F. Laender and A. Oliveira,
vol. 9, p. 53, 2010. Eds., ed: Springer Berlin Heidelberg, 2002, pp. 1-10.

[5] K. Musbahtiti, M. R. Saady, and A. Muhammad, [12] B. N. B. (BNB). (2013, Jun 13).
"Comprehensive e-Learning system based on Islamic
principles," in Information and Communication
Technology for the Muslim World (ICT4M), 2013 5th [13] I. Tatarinov and A. Halevy, "Efficient query
International Conference on, 2013, pp. 1-5. reformulation in peer data management systems,"
presented at the Proceedings of the 2004 ACM
[6] A. R. Yauri, R. A. Kadir, A. Azman, and M. A. A. SIGMOD international conference on Management
Murad, "Quranic-based concepts: Verse relations of data, Paris, France, 2004.
extraction using Manchester OWL syntax," in
Information Retrieval & Knowledge Management [14] I. Kathir and A. al-Fida’Isma‘il, "Tafsir al-Qur’an al-
(CAMP), 2012 International Conference on, 2012, Azim," Beirut: Dar al-Qalam, nd [1983, vol. 2, pp.
pp. 317-321. 403-430, 1990.

[7] M. Shoaib, M. Nadeem Yasin, U. Hikmat, M. I. [15] T. K. F. C. f. t. P. o. T. H. Qur'an. (2013).

Saeed, and M. S. H. Khiyal, "Relational WordNet
model for semantic search in Holy Quran," in asp?t=KATHEER&TabID=3&SubItemID=1&l=arb
Emerging Technologies, 2009. ICET 2009. &SecOrder=3&SubSecOrder=3.
International Conference on, 2009, pp. 29-34.
[16] M. Nassourou, "Towards a Knowledge-Based
[8] A.-B. M. Sharaf and E. Atwell, "QurSim: A corpus Learning System for The Quranic Text," ed.
for evaluation of relatedness in short texts," in The Würzburg: Universitätsbibliothek der Universität
International Conference on Language Resources Würzburg, 2012.


View publication stats

You might also like