Professional Documents
Culture Documents
Final Proposal
Final Proposal
FACULTY OF ENGINEERING DEPARTMENT OF COMPUTER ENGINEERING AND INFORMATION TECHNOLOGY A PROJECT REPORT ON
REG.NO: GS20090098
Under the Guidance of Prof.Santhi KUMARAN Submitted in partial fulfilment of the requirements for the award of BACHELOR OF SCIENCE DEGREE IN COMPUTER ENGINEERING February 2012
ABSTRACT
Due to the growth of mobile devices use, a cell phone is becoming a world's largest tool for making calls and sending messages but its utilisation in searching information isnt currently powerful. This study mainly discusses the extraction of web information through cell phones. Usual web information extraction is mostly based on DOM tree and HTML tag analysis. Based on those web information extraction techniques and rules, the study proposes the development of an SMS application facilitating to make a quick information extraction from web encyclopedia. The general idea of the application is to send an SMS request to search for specific information; data will be extracted from a web based on a URL assigned to the key word. After being transformed into a format readable by each cell phone, the user will easily get trustworthy and updated information.
ii
DECLARATION
I, ISHIMWE MUHUMUZA Emma Marie, hereby declare that, the work presented in this research paper is original. No one has ever presented it at Kigali Institute of Science and Technology or elsewhere for any award. For any consulted work, references were made and put in the list of references. I therefore declare this work to be wholly mine.
iii
KIGALI INSTITUTE OF SCIENCE AND TECHNOLOGY INSTITUT DES SCIENCES ET DE TECHNOLOGIE DE KIGALI
Avenue de l'Arme, B.P. 3900 Kigali, Rwanda
CERTIFICATE
This is to certify that the Project Work entitled Web Encyclopedia Info-Box Quick Extraction is a record of the original bonafide work done by ISHIMWE MUHUMUZA Emma Marie (REG.No: GS20090098) in partial fulfilment of the requirement for the award of Bachelor of Science Degree in Computer Engineering of Kigali Institute of Science, Technology, during the Academic Year 2012.
iv
LIST OF FIGURES
Fig.1.1 Gantt chart..................................................................................................................................6
LIST OF ABBREVIATIONS
SMS: Short Message Service ICT: Information Communication Technology IT: Information Technology MUC: Message Understanding Conference HTML: HyperText Markup Language DOM: Document Object Model API: Application Programming Interface NLP: Natural Language Processing
vi
Table of Contents
..................................................................................................................................................................i ABSTRACT............................................................................................................................................ii DECLARATION...................................................................................................................................iii CERTIFICATE......................................................................................................................................iv LIST OF FIGURES.................................................................................................................................v LIST OF ABBREVIATIONS................................................................................................................vi Table of Contents..................................................................................................................................vii CHAPITER ONE: GENERAL INTRODUCTION................................................................................1 Introduction.........................................................................................................................................1 Background .......................................................................................................................................2 Problem Statement.............................................................................................................................3 Objectives of the project.....................................................................................................................3 General Objective............................................................................................................................3 Specific objectives .........................................................................................................................................................3 1.5 Scope of the study.........................................................................................................................3 Project interest....................................................................................................................................3 Individual interests..........................................................................................................................3 ........................................................................................................................................................3 Academic interest ...........................................................................................................................4 Public interest .................................................................................................................................4 Organization of the study...................................................................................................................5 Gant chart..........................................................................................................................................6 Expected Results ......................................................................................................................6 Conclusion..........................................................................................................................................7 CHAPTER TWO: LITERATURE REVIEW.........................................................................................8 2.1 Introduction...................................................................................................................................8 2.2.1 Extraction rules.......................................................................................................................9 2.2.2 Mechanisms of data extraction...............................................................................................9 2.2.3 Verifying the Extracted Data................................................................................................10 2.3 Terms and technologies...............................................................................................................10 2.3.1 Encyclopedia........................................................................................................................10 vii
2.3.2 Info-Box...............................................................................................................................11 2.3.3 DOM tree..............................................................................................................................11 2.3.4 Data Extraction.....................................................................................................................11 2.3.5 Cell phone ............................................................................................................................11 2.3.6 Web wrapper........................................................................................................................13 2.3.8 Django..................................................................................................................................13 2.4 Proposed methodology ...............................................................................................................13 2.5 System requirements...................................................................................................................14 2.5.1 Software requirements..........................................................................................................14 2.5.2 Hardware requirements ......................................................................................................14 2.6 Conclusion ..................................................................................................................................14 References.............................................................................................................................................16 Books and Articles............................................................................................................................16 Internet Sources.................................................................................................................................17
viii
The actual and common source of trustworthy information is an ever updating encyclopedia which holds information about high profile personalities while another alternative is to wait for television/radio programs to talk about them. Using those different sources, people risk to miss up the needed information because those services are sometimes time consuming. The rise of mobile phone devices has facilitated tasks by providing real-time information to a large number of people as each cell phone owner is more likely to use their mobile phones and get each kinds of information. [2]
The easiest and quickest way of extracting information results then from an SMS application that will facilitate people to search information from a web encyclopedia in a single SMS that a user will send via its cell phone and receive a response SMS containing requested information.
In this research, Python programming language and Django framework will be used to develop a tool for collecting and extracting data from the web encyclopedia and transform it into a format readable by a Cell Phone as a text message. The resulting data can then be sent to the Cell Phone user in the form of SMS.
Background
It has been a long-time that people get information from different sources such as newspapers, Television/Radio and from Application on their mobile device or by accessing online special information. But most of those sources dont provide adequate information; they cannot even be accessible by everyone. When it comes to extract data from internet, some researchers created tools for extracting data from web sites and transforming it into a structured data format. The resulting data can then be used to build new applications without having to deal with unstructured data. [3]
Those ways of getting information miss some features based on how their algorithms are structured. Thus Web Encyclopedia Info-Box Quick extraction Application has come up with new extraction technology.
Problem Statement
Most People own a Cell phone, but they don't manage to get quick information from their mobile devices unless spending time and money to extract needed information. What can be an easiest way for all cell phone users to retrieve information from either a sophisticated or a simplest cell phone and get trustworthy information in a real-time and at a lowest price?
Project interest
Individual interests The first interest in this research is to improve knowledge in Python programming language and 3
Django Framework and fill some gaps faced in the ICT industry. Thus contribute in ICT development. The second interest in to apply the practical domain skills and engineering approaches learn from school by solving problems. Academic interest 1. The project helps the student to advance necessary understanding in PYTHON programming language. 2. The project helps students to prove what they have been doing during the University studies by saving as the exemplary people. Public interest 1. The project facilitates cell phone users to make a quick search via their mobile devices and get trustworthy information. 2. The project helps the developer to be known in the ICT industry and show its capabilities in solving problems in the society especially in Information Technology domain.
Chapter Two: Literature review The second chapter is about literature review, which describes various theories relating to the project work. Chapter three: Research methodology The third chapter encompasses a research methodology, which describes the waterfall model as well as the principles of software engineering.
Chapter four: System analysis and design The fourth chapter encompasses a system design, which describes data modeling, use case, sequence diagram. Chapter five: System implementation, testing and results This chapter is covering the implementation of the SMS application and it finally presents the results testing on the implemented system. Conclusion and recommendation: It is the last portion of this research report. It presents the conclusions and recommendations made upon this research project
Gant chart
Expected Results
With Web Encyclopedia Info-Box Quick Extraction application, cell phone users will be able to easily receive accurate and updated information from a Web Encyclopedia based on their request. A cell phone user will send an SMS request to search for specific information, and then data will be extracted from a web based on a URL assigned to the key word. After being transformed into a format readable by each cell phone, a cell phone user will manage to read the text.
Conclusion
This chapter covers the general introduction of the project. It introduces the project, tells more about the background of how the information has been extracted before, the background leads to the statement of the problem, the objectives of the project as well as the limitations of the study. The interest of this study has been discussed. We have presented the timeline of the project for the researcher to accomplish every task. Finally, this chapter covers a brief summary of the whole chapter.
Natural Language Processing (NLP) techniques are developed to extract this type of unrestricted, unregulated information, which employs the syntactic and semantic characteristics of the language to generate the extraction rules. The structured information usually comes from databases, which provide rigid or well defined formats of information, therefore, it is easy to extract through some query language such as Structured Query Language (SQL). The other type is the semi-structured information, which falls between free text and structured information. Web pages are a typical example of semi-structured information. In some papers, focus on extracting text information from web pages. [9] We are going to review various theories relating to the information extraction. 2.2.1 Extraction rules A wrapper is software used to enable a semi-structured Web source to be queried as if it were a database. These are sources where there is no explicit structure or schema, but there is an implicit underlying structure. Even text sources, such as email messages, have some structure in the heading that can be exploited to extract the date, sender, addressee, title, and body of the messages. Other sources, such as online catalogs, have a very regular structure that can be exploited to extract the data automatically. [10] 2.2.2 Mechanisms of data extraction Lixto offers two basic mechanisms of data extraction: Tree extraction and String extraction. 1. Tree extraction For tree extraction, elements are identified with their corresponding tree paths and possibly some properties of the elements themselves. This does not necessarily identify a single element. A plain tree path is a sequence of consecutive nodes in a sub-tree of an HTML tree. In an incompletely specified tree path, stars may be used instead of element names. For simplicity, incompletely specified tree paths are referred to as tree paths. The semantics of a tree path applied to a tree region of an HTML page is defined as the set of matched elements. [10] 2. String extraction 9
The second extraction method relies on strings. In the HTML parse tree, strings are represented by the text of content leaves. However, a string is associated to every node of the parse tree available as the value of the attribute element text. String extraction has to be used when extracting access codes of the phone-numbers of lixto.html. [11] 2.2.3 Verifying the Extracted Data A problem that has been largely ignored on extracting data from web sites is that sites change and they change often. Kushmerick [12] addressed the wrapper verification problem by monitoring a set of generic features, such as the density of numeric characters within a field, but this approach only detects certain types of changes. In contrast, they address that problem by applying machine learning techniques to learn a set of patterns that describe the information that is being extracted from each of the relevant fields. Since the information for even a single field can vary considerably, the system learns the statistical distribution of the patterns for each field. Wrappers can be verified by comparing the patterns of data returned to the learned statistical distribution. When a significant difference is found, an operator can then be notified or can automatically launch the wrapper repair process. Based on all of these theories related to the web information extraction, we will improve the extraction tool by developing a new SMS application that eases to make the quickest web information extracting in a single SMS via a cell phone and receive response SMS containing requested information.
10
2.3.2 Info-Box Generally, Info-box templates are templates that provide standardized information across related articles. In this study, an info-box is a fixed-format box under the persons picture on the top right-hand corner of articles to consistently present brief information of that person. 2.3.3 DOM tree DOM tree is a cross platform and a language independent convention for representing and interacting with objects in HTML,XML and XML documents. The aspects of the DOM tree may be addressed and manipulated within the syntax of the programming language in use. The public interface is specified in its application programming interface (API). [14] 2.3.4 Data Extraction Data extraction is the act or process of retrieving data out of data sources for further data processing or data storage. The import into the intermediate extracting system is thus usually followed by data transformation. [15] Typical unstructured data sources include web pages, emails, documents, PDFs, scanned text, mainframe reports, spool files etc. Extracting data from these unstructured sources has grown into a considerable technical challenge where as historically data extraction has had to deal with changes in physical hardware formats. The majority of current data extraction deals with extracting data from the unstructured data sources, and from different software formats. This growing process of data extraction from the web is referred to as Web scraping. 2.3.5 Cell phone A cell phone (also known as a cellular phone, mobile phone and a hand phone) is a device that can make and receive telephone calls as well as send and receive text messages over a radio link whilst moving around a wide geographic area.
11
It does so by connecting to a cellular network provided by a mobile phone operator, allowing access to the public telephone network. By contrast, a cordless telephone is used only within the short range of a single, private base station. [16]
12
2.3.6 Web wrapper Web wrapper is tool used to extract information from the web given using only a set of general rules describing the data domain. It cleanly separates out site-independent and site-specific knowledge from execution implementation. Site-independent knowledge is expressed in user-supplied domain rules, while site-specific knowledge is expressed in automatically-generated context-free grammars that describe site structures. [17] A wrapper is also used to manually extract a particular format of information. 2.3.7 Python Python is a programming language that lets you work more quickly and integrate your systems more effectively. Python runs on Windows, Linux/Unix, Mac OS X, and has been ported to the Java and .NET virtual machines. [18] 2.3.8 Django The Django is a high-level python Web Framework that encourages rapid development and clean pragmatic design. [19] It lets you build high-performing, elegant and Web application quickly.
2.5.1 Software requirements The application will be developed using: Python programming language Django Framework Web wrappers as web extraction systems MySQL as the database UBUNTU 11.10 as the Operating system
Python programming language and Django framework will be used to develop a tool for collecting and extracting data from the web encyclopedia using web wrappers as one of the web extraction systems. Semantic search will be used to generate the information extraction based on the meaning of the given key word and the requested information will be stored into the database, therefore, it is easy to extract through some query language such as Structured Query Language (SQL). 2.5.2 Hardware requirements For the application to be used efficiently, the user must have a cell phone with the capability of sending and receiving text message.
2.6 Conclusion
In this chapter, we discussed about various theories related to the study and we defined some terms and terminologies used during the project description. 14
We have also discussed about the proposed methodology that have to be used in order to collect the data and steps that have to be followed during the research process .Finally we mentioned the system requirements by specifying the Software requirement as well as the hardware requirements.
15
References
Books and Articles
[1] A. Mitchell, Deputy Director, Project for Excellence in Journalism. How mobile devices are changing community information environments, March 14, 2011 [2] K.Purcell, AssociateDirector-Research, Pew Internet Project. http://www.stateofthemedia.org/2011/Mobile-survey accessed on Jan 9, 2012. [3] C. Hsu and M. Dung. Generating finite-state transducers for semi-structured data extraction from the web, Article, 23(8):521538, 1998. [4] Source: Pew Research Center's Project for Excellence in Journalism and Internet & American Life Project in partnership with the Knight Foundation, January 12-25, 2011 Local Information Survey. [5] A. Survey, Technical Report 945, Norweigan Computing Center, Olso, Norway, On July 1999 [6] W. Cohen. Recognizing structure in web pages using similarity queries. In Proc. of the 16th National Conference on Artificial Intelligence AAAI-1999, pages 5966, 1999. [7] A. Craig Knoblock University of Southern California and Fetch Technologies, in 2009 [8] L. Eikvil: Information Extraction from World Wide Web A Survey, Technical Report 945, Norweigan Computing Center, Oslo, Norway (July 1999) [9] Y. Zhang et al. (Eds.): APWeb 2008, LNCS 4976, pp. 383394, 2008. [10] A. Craig Knoblock University of Southern California and Fetch Technologies, December 6, 1998. [11] T. Goan, N. Benson, and O. Etzioni. A grammar inference algorithm for the World Wide Web. In Proc. Of the AAAI Spring Symposium on Machine Learning in Information Access, 1996.
[12] N. Kushmerick. Regression testing for wrapper maintenance. In Proc. of the 16th National Conference on Artificial Intelligence AAAI-1999, pages 7479, 1999. [17] M. E., Califf, and Mooney, R. J. 1999. Relational learning of pattern-match rules for information extraction 16
Internet Sources
[13] http://en.wikipedia.org/wiki/Encyclopedia Accessed on February 10, 2012 [14] Document Object Model (DOM), http://www.w3.org/:W3C. Accessed on February 11, 2012. [15] http://en.wikipedia.org/wiki/Data_extraction. Accessed on February 12, 2012 [16] http://en.wikipedia.org/wiki/Mobile_phone. Accessed on February 12, 2012 [19] http://www.boddie.org.uk/python/HTML.htmlA .Accessed on February 13, 2012
17