Download as doc, pdf, or txt
Download as doc, pdf, or txt
You are on page 1of 25

KIGALI INSTITUTE OF SCIENCE AND TECHNOLOGY INSTITUT DES SCIENCES ET DE TECHNOLOGIE DE KIGALI

Avenue de l'Arme, B.P. 3900 Kigali, Rwanda

FACULTY OF ENGINEERING DEPARTMENT OF COMPUTER ENGINEERING AND INFORMATION TECHNOLOGY A PROJECT REPORT ON

WEB ENCYCLOPEDIA INFO-BOX QUICK EXTRACTION


Submitted by ISHIMWE MUHUMUZA Emma Marie

REG.NO: GS20090098
Under the Guidance of Prof.Santhi KUMARAN Submitted in partial fulfilment of the requirements for the award of BACHELOR OF SCIENCE DEGREE IN COMPUTER ENGINEERING February 2012

PROJECT ID: CEIT/FT/12/13


i

ABSTRACT
Due to the growth of mobile devices use, a cell phone is becoming a world's largest tool for making calls and sending messages but its utilisation in searching information isnt currently powerful. This study mainly discusses the extraction of web information through cell phones. Usual web information extraction is mostly based on DOM tree and HTML tag analysis. Based on those web information extraction techniques and rules, the study proposes the development of an SMS application facilitating to make a quick information extraction from web encyclopedia. The general idea of the application is to send an SMS request to search for specific information; data will be extracted from a web based on a URL assigned to the key word. After being transformed into a format readable by each cell phone, the user will easily get trustworthy and updated information.

ii

DECLARATION

I, ISHIMWE MUHUMUZA Emma Marie, hereby declare that, the work presented in this research paper is original. No one has ever presented it at Kigali Institute of Science and Technology or elsewhere for any award. For any consulted work, references were made and put in the list of references. I therefore declare this work to be wholly mine.

Emma Marie ISHIMWE MUHUMUZA Signature:

iii

KIGALI INSTITUTE OF SCIENCE AND TECHNOLOGY INSTITUT DES SCIENCES ET DE TECHNOLOGIE DE KIGALI
Avenue de l'Arme, B.P. 3900 Kigali, Rwanda

FACULTY OF ENGINEERING DEPARTMENT OF COMPUTER ENGINEERING AND INFORMATION TECHNOLOGY

CERTIFICATE
This is to certify that the Project Work entitled Web Encyclopedia Info-Box Quick Extraction is a record of the original bonafide work done by ISHIMWE MUHUMUZA Emma Marie (REG.No: GS20090098) in partial fulfilment of the requirement for the award of Bachelor of Science Degree in Computer Engineering of Kigali Institute of Science, Technology, during the Academic Year 2012.

.......................................... Supervisor: Prof.Santhi KUMARAN

.......................................... Head of the Department: Jonathan MWAKIJELE

Submitted for the Project Examination held at KIST on ...

iv

LIST OF FIGURES
Fig.1.1 Gantt chart..................................................................................................................................6

LIST OF ABBREVIATIONS
SMS: Short Message Service ICT: Information Communication Technology IT: Information Technology MUC: Message Understanding Conference HTML: HyperText Markup Language DOM: Document Object Model API: Application Programming Interface NLP: Natural Language Processing

vi

Table of Contents
..................................................................................................................................................................i ABSTRACT............................................................................................................................................ii DECLARATION...................................................................................................................................iii CERTIFICATE......................................................................................................................................iv LIST OF FIGURES.................................................................................................................................v LIST OF ABBREVIATIONS................................................................................................................vi Table of Contents..................................................................................................................................vii CHAPITER ONE: GENERAL INTRODUCTION................................................................................1 Introduction.........................................................................................................................................1 Background .......................................................................................................................................2 Problem Statement.............................................................................................................................3 Objectives of the project.....................................................................................................................3 General Objective............................................................................................................................3 Specific objectives .........................................................................................................................................................3 1.5 Scope of the study.........................................................................................................................3 Project interest....................................................................................................................................3 Individual interests..........................................................................................................................3 ........................................................................................................................................................3 Academic interest ...........................................................................................................................4 Public interest .................................................................................................................................4 Organization of the study...................................................................................................................5 Gant chart..........................................................................................................................................6 Expected Results ......................................................................................................................6 Conclusion..........................................................................................................................................7 CHAPTER TWO: LITERATURE REVIEW.........................................................................................8 2.1 Introduction...................................................................................................................................8 2.2.1 Extraction rules.......................................................................................................................9 2.2.2 Mechanisms of data extraction...............................................................................................9 2.2.3 Verifying the Extracted Data................................................................................................10 2.3 Terms and technologies...............................................................................................................10 2.3.1 Encyclopedia........................................................................................................................10 vii

2.3.2 Info-Box...............................................................................................................................11 2.3.3 DOM tree..............................................................................................................................11 2.3.4 Data Extraction.....................................................................................................................11 2.3.5 Cell phone ............................................................................................................................11 2.3.6 Web wrapper........................................................................................................................13 2.3.8 Django..................................................................................................................................13 2.4 Proposed methodology ...............................................................................................................13 2.5 System requirements...................................................................................................................14 2.5.1 Software requirements..........................................................................................................14 2.5.2 Hardware requirements ......................................................................................................14 2.6 Conclusion ..................................................................................................................................14 References.............................................................................................................................................16 Books and Articles............................................................................................................................16 Internet Sources.................................................................................................................................17

viii

CHAPITER ONE: GENERAL INTRODUCTION


Introduction
As humanity evolves, events increase and mark the change from one generation to another. Some events get to be known by high number of people, while others are more or less ignored. Refereed to those events, some people get known and called celebrities according to what they have discovered, their experiences in life or their specialty in entertaining the society .Among those events, there are celebrities known for Political or entertainment purposes such as the heads of states, music and movie actors and so on.... The fact that those celebrities are known prompt people to search for their specific information. Everyone try to look for the easiest and cheapest ways to get information which sometimes cost some amount of money. When it comes to payments for information, people pay for local information content in some form: It can be for their local print newspaper, for an Application on their mobile device or for access to special information Online. [1]

The actual and common source of trustworthy information is an ever updating encyclopedia which holds information about high profile personalities while another alternative is to wait for television/radio programs to talk about them. Using those different sources, people risk to miss up the needed information because those services are sometimes time consuming. The rise of mobile phone devices has facilitated tasks by providing real-time information to a large number of people as each cell phone owner is more likely to use their mobile phones and get each kinds of information. [2]

The easiest and quickest way of extracting information results then from an SMS application that will facilitate people to search information from a web encyclopedia in a single SMS that a user will send via its cell phone and receive a response SMS containing requested information.

In this research, Python programming language and Django framework will be used to develop a tool for collecting and extracting data from the web encyclopedia and transform it into a format readable by a Cell Phone as a text message. The resulting data can then be sent to the Cell Phone user in the form of SMS.

Background
It has been a long-time that people get information from different sources such as newspapers, Television/Radio and from Application on their mobile device or by accessing online special information. But most of those sources dont provide adequate information; they cannot even be accessible by everyone. When it comes to extract data from internet, some researchers created tools for extracting data from web sites and transforming it into a structured data format. The resulting data can then be used to build new applications without having to deal with unstructured data. [3]

Those ways of getting information miss some features based on how their algorithms are structured. Thus Web Encyclopedia Info-Box Quick extraction Application has come up with new extraction technology.

Problem Statement
Most People own a Cell phone, but they don't manage to get quick information from their mobile devices unless spending time and money to extract needed information. What can be an easiest way for all cell phone users to retrieve information from either a sophisticated or a simplest cell phone and get trustworthy information in a real-time and at a lowest price?

Objectives of the project


General Objective The General objective of this research is to develop an application suitable to everyone and that facilitates people make a quick information search for their preferred celebrities through their cell phones. Specific objectives The specific objectives of this research are: 1. To identify the need of SMS based applications. 2. The design and the implementation of quick information extraction using cell phone. 3. Test and Validation of Web information extraction on mobile devices. 4. To improve the use of cell phone, not only for making calls and sending/receiving messages.

1.5 Scope of the study


The study will be defined in human limitation as well as geographically. Although there is a lot of information that can be accessed in different areas, this research is limited on brief information of various celebrities. That information is: Names, Date of Birth, Nationality, origin, Political party, Religion, spouse and Genres & Occupations Geography, the application has been developed for each and every cell phone user.

Project interest
Individual interests The first interest in this research is to improve knowledge in Python programming language and 3

Django Framework and fill some gaps faced in the ICT industry. Thus contribute in ICT development. The second interest in to apply the practical domain skills and engineering approaches learn from school by solving problems. Academic interest 1. The project helps the student to advance necessary understanding in PYTHON programming language. 2. The project helps students to prove what they have been doing during the University studies by saving as the exemplary people. Public interest 1. The project facilitates cell phone users to make a quick search via their mobile devices and get trustworthy information. 2. The project helps the developer to be known in the ICT industry and show its capabilities in solving problems in the society especially in Information Technology domain.

Organization of the study


This research project report consists of the following chapters: Chapter one: General introduction The first chapter introduces the aspects of the study. It describes the problem statement which describes the current situation in the area that has to be improved through the whole study and indicate the problem explicitly. It describes also objectives of the study, scope of the study, project interest and the organization of the study.

Chapter Two: Literature review The second chapter is about literature review, which describes various theories relating to the project work. Chapter three: Research methodology The third chapter encompasses a research methodology, which describes the waterfall model as well as the principles of software engineering.

Chapter four: System analysis and design The fourth chapter encompasses a system design, which describes data modeling, use case, sequence diagram. Chapter five: System implementation, testing and results This chapter is covering the implementation of the SMS application and it finally presents the results testing on the implemented system. Conclusion and recommendation: It is the last portion of this research report. It presents the conclusions and recommendations made upon this research project

Gant chart

Fig.1.1 Gantt chart

Expected Results
With Web Encyclopedia Info-Box Quick Extraction application, cell phone users will be able to easily receive accurate and updated information from a Web Encyclopedia based on their request. A cell phone user will send an SMS request to search for specific information, and then data will be extracted from a web based on a URL assigned to the key word. After being transformed into a format readable by each cell phone, a cell phone user will manage to read the text.

Conclusion
This chapter covers the general introduction of the project. It introduces the project, tells more about the background of how the information has been extracted before, the background leads to the statement of the problem, the objectives of the project as well as the limitations of the study. The interest of this study has been discussed. We have presented the timeline of the project for the researcher to accomplish every task. Finally, this chapter covers a brief summary of the whole chapter.

CHAPTER TWO: LITERATURE REVIEW


2.1 Introduction
Nowadays, Cell phone usage has deeply penetrated into the society and has become a daily tool not only for making calls and sending messages, but also their rise has already altered the environment of local news and information. Thus mobile devices have become one of the most quickly adopted consumer goods. [4] The majority of those of who own a cell-phone can get some kind of local news and information on their mobile devices. [5] The growth in cell phone use has brought with it a growing use of new applications even if the adoption of applications, however, is not as rapid as cell phones themselves. In the research made, just a few number of cell phone owners report having applications that helps them getting information or news about their local community. [6] There is a tremendous amount of information available on the Web, but much of that information is not in a form that can be easily used by other applications. [7]

2.2 Current State


During the past decade, information extraction has been extensively studied with many research results as well as systems developed. Since the late 1980s, through the message understanding conference (MUC), many information extraction systems have been successfully developed and quantitatively evaluated. [8] The information source can be classified into three main types, including free text, structured text and semi-structured text. Originally, the extraction system focuses on free text extraction.

Natural Language Processing (NLP) techniques are developed to extract this type of unrestricted, unregulated information, which employs the syntactic and semantic characteristics of the language to generate the extraction rules. The structured information usually comes from databases, which provide rigid or well defined formats of information, therefore, it is easy to extract through some query language such as Structured Query Language (SQL). The other type is the semi-structured information, which falls between free text and structured information. Web pages are a typical example of semi-structured information. In some papers, focus on extracting text information from web pages. [9] We are going to review various theories relating to the information extraction. 2.2.1 Extraction rules A wrapper is software used to enable a semi-structured Web source to be queried as if it were a database. These are sources where there is no explicit structure or schema, but there is an implicit underlying structure. Even text sources, such as email messages, have some structure in the heading that can be exploited to extract the date, sender, addressee, title, and body of the messages. Other sources, such as online catalogs, have a very regular structure that can be exploited to extract the data automatically. [10] 2.2.2 Mechanisms of data extraction Lixto offers two basic mechanisms of data extraction: Tree extraction and String extraction. 1. Tree extraction For tree extraction, elements are identified with their corresponding tree paths and possibly some properties of the elements themselves. This does not necessarily identify a single element. A plain tree path is a sequence of consecutive nodes in a sub-tree of an HTML tree. In an incompletely specified tree path, stars may be used instead of element names. For simplicity, incompletely specified tree paths are referred to as tree paths. The semantics of a tree path applied to a tree region of an HTML page is defined as the set of matched elements. [10] 2. String extraction 9

The second extraction method relies on strings. In the HTML parse tree, strings are represented by the text of content leaves. However, a string is associated to every node of the parse tree available as the value of the attribute element text. String extraction has to be used when extracting access codes of the phone-numbers of lixto.html. [11] 2.2.3 Verifying the Extracted Data A problem that has been largely ignored on extracting data from web sites is that sites change and they change often. Kushmerick [12] addressed the wrapper verification problem by monitoring a set of generic features, such as the density of numeric characters within a field, but this approach only detects certain types of changes. In contrast, they address that problem by applying machine learning techniques to learn a set of patterns that describe the information that is being extracted from each of the relevant fields. Since the information for even a single field can vary considerably, the system learns the statistical distribution of the patterns for each field. Wrappers can be verified by comparing the patterns of data returned to the learned statistical distribution. When a significant difference is found, an operator can then be notified or can automatically launch the wrapper repair process. Based on all of these theories related to the web information extraction, we will improve the extraction tool by developing a new SMS application that eases to make the quickest web information extracting in a single SMS via a cell phone and receive response SMS containing requested information.

2.3 Terms and technologies


2.3.1 Encyclopedia An encyclopedia (also spelled encyclopaedia or encyclopdia) is a type of reference work, a compendium holding a summary of information from either all branches of knowledge or a particular branch of knowledge. Encyclopedias are divided into articles or entries, which are usually accessed alphabetically by article name. Encyclopedia entries are longer and more detailed than those in most dictionaries. [13]

10

2.3.2 Info-Box Generally, Info-box templates are templates that provide standardized information across related articles. In this study, an info-box is a fixed-format box under the persons picture on the top right-hand corner of articles to consistently present brief information of that person. 2.3.3 DOM tree DOM tree is a cross platform and a language independent convention for representing and interacting with objects in HTML,XML and XML documents. The aspects of the DOM tree may be addressed and manipulated within the syntax of the programming language in use. The public interface is specified in its application programming interface (API). [14] 2.3.4 Data Extraction Data extraction is the act or process of retrieving data out of data sources for further data processing or data storage. The import into the intermediate extracting system is thus usually followed by data transformation. [15] Typical unstructured data sources include web pages, emails, documents, PDFs, scanned text, mainframe reports, spool files etc. Extracting data from these unstructured sources has grown into a considerable technical challenge where as historically data extraction has had to deal with changes in physical hardware formats. The majority of current data extraction deals with extracting data from the unstructured data sources, and from different software formats. This growing process of data extraction from the web is referred to as Web scraping. 2.3.5 Cell phone A cell phone (also known as a cellular phone, mobile phone and a hand phone) is a device that can make and receive telephone calls as well as send and receive text messages over a radio link whilst moving around a wide geographic area.

11

It does so by connecting to a cellular network provided by a mobile phone operator, allowing access to the public telephone network. By contrast, a cordless telephone is used only within the short range of a single, private base station. [16]

12

2.3.6 Web wrapper Web wrapper is tool used to extract information from the web given using only a set of general rules describing the data domain. It cleanly separates out site-independent and site-specific knowledge from execution implementation. Site-independent knowledge is expressed in user-supplied domain rules, while site-specific knowledge is expressed in automatically-generated context-free grammars that describe site structures. [17] A wrapper is also used to manually extract a particular format of information. 2.3.7 Python Python is a programming language that lets you work more quickly and integrate your systems more effectively. Python runs on Windows, Linux/Unix, Mac OS X, and has been ported to the Java and .NET virtual machines. [18] 2.3.8 Django The Django is a high-level python Web Framework that encourages rapid development and clean pragmatic design. [19] It lets you build high-performing, elegant and Web application quickly.

2.4 Proposed methodology


In this study we will details some of the possible methodologies that have to be used for developing SMS application .The research process will then be done into the following steps: 1. Collecting the Data by means of questionnaires 2. Analysis of Data 3. Generalisation and Interpretation of data and 4. Presentation of Results and write ups of conclusions reached. 13

2.5 System requirements


Our application will facilitate users: To make a request using its cell phone and To get updated information at a real-time To get trustworthy information at a real-time

2.5.1 Software requirements The application will be developed using: Python programming language Django Framework Web wrappers as web extraction systems MySQL as the database UBUNTU 11.10 as the Operating system

Python programming language and Django framework will be used to develop a tool for collecting and extracting data from the web encyclopedia using web wrappers as one of the web extraction systems. Semantic search will be used to generate the information extraction based on the meaning of the given key word and the requested information will be stored into the database, therefore, it is easy to extract through some query language such as Structured Query Language (SQL). 2.5.2 Hardware requirements For the application to be used efficiently, the user must have a cell phone with the capability of sending and receiving text message.

2.6 Conclusion
In this chapter, we discussed about various theories related to the study and we defined some terms and terminologies used during the project description. 14

We have also discussed about the proposed methodology that have to be used in order to collect the data and steps that have to be followed during the research process .Finally we mentioned the system requirements by specifying the Software requirement as well as the hardware requirements.

15

References
Books and Articles
[1] A. Mitchell, Deputy Director, Project for Excellence in Journalism. How mobile devices are changing community information environments, March 14, 2011 [2] K.Purcell, AssociateDirector-Research, Pew Internet Project. http://www.stateofthemedia.org/2011/Mobile-survey accessed on Jan 9, 2012. [3] C. Hsu and M. Dung. Generating finite-state transducers for semi-structured data extraction from the web, Article, 23(8):521538, 1998. [4] Source: Pew Research Center's Project for Excellence in Journalism and Internet & American Life Project in partnership with the Knight Foundation, January 12-25, 2011 Local Information Survey. [5] A. Survey, Technical Report 945, Norweigan Computing Center, Olso, Norway, On July 1999 [6] W. Cohen. Recognizing structure in web pages using similarity queries. In Proc. of the 16th National Conference on Artificial Intelligence AAAI-1999, pages 5966, 1999. [7] A. Craig Knoblock University of Southern California and Fetch Technologies, in 2009 [8] L. Eikvil: Information Extraction from World Wide Web A Survey, Technical Report 945, Norweigan Computing Center, Oslo, Norway (July 1999) [9] Y. Zhang et al. (Eds.): APWeb 2008, LNCS 4976, pp. 383394, 2008. [10] A. Craig Knoblock University of Southern California and Fetch Technologies, December 6, 1998. [11] T. Goan, N. Benson, and O. Etzioni. A grammar inference algorithm for the World Wide Web. In Proc. Of the AAAI Spring Symposium on Machine Learning in Information Access, 1996.

[12] N. Kushmerick. Regression testing for wrapper maintenance. In Proc. of the 16th National Conference on Artificial Intelligence AAAI-1999, pages 7479, 1999. [17] M. E., Califf, and Mooney, R. J. 1999. Relational learning of pattern-match rules for information extraction 16

Internet Sources
[13] http://en.wikipedia.org/wiki/Encyclopedia Accessed on February 10, 2012 [14] Document Object Model (DOM), http://www.w3.org/:W3C. Accessed on February 11, 2012. [15] http://en.wikipedia.org/wiki/Data_extraction. Accessed on February 12, 2012 [16] http://en.wikipedia.org/wiki/Mobile_phone. Accessed on February 12, 2012 [19] http://www.boddie.org.uk/python/HTML.htmlA .Accessed on February 13, 2012

17

You might also like