Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

Proceedings of the Ninth International Conference on Machine Learning and Cybernetics, Qingdao, 11-14 July 2010

SEMANTIC ANNOTATION BASED FINANCIAL WEB INFORMATION REORGANIZATION


BO YUAN, QING-CAI CHEN, XIAO-LONG WANG LI-BO LIU Computer Science and technology Department Key Laboratory of Network Oriented Intelligent Computation Harbin Institute of Technology Shenzhen graduate School, Shenzhen, Guangdong, 518055, China E-MAIL: {yuanbo.hitsz, qingcai.chen}@gmail.com, wangxl@insun.hit.edu.cn, liulibo.hit@gmail.com

Abstract:
The incredible amount of web information and the inhumane retrieval result from search engine cannot satisfy peoples demands. Reorganizing the web information in hierarchical structure is one solution. To achieve this aim the knowledge base on specific domain is built using ontology technology. The definition of event which is composed by key information and attributes is given to match the knowledge base. The process of semantic annotation is designed and realized. In this way the web information on specific domain is reorganized and can help return more elaborate result to user. The experimental results show that the methods of knowledge base building and semantic annotation are effective and feasible.

Keywords:
Web information; Hierarchical structure knowledge base; semantic annotation

1.

Introduction

With nearly thirty years development the web information has researched incredible amount and brings convenience to peoples daily work and life. However the service of information retrieval cannot keep up with the information growth and peoples demands. People may use a simple query for example interest rate policy to acquire elaborate information such as the times of rate changing, the scope of each change and the influence on the economics and stock market. This is difficult for traditional search engine. One simple solution is reorganizing the web information and building relationship between each information point. This thought comes from the Semantic Web [1]. If the web information can be indexed in a hierarchical structure and the related attributes of information can be matched, the retrieval result for users query would be more affluent and available. Obviously it is an enormous project to reorganize the entire internet but for specific domain it seems viable. In this paper we attempt to reorganize the finance information on web according to certain structure using the ontology knowledge base and semantic annotation. The strategy of 978-1-4244-6527-9/10/$26.00 2010 IEEE

processing web information also draws on the experience of the Semantic Web which defined a series of accurate rules to achieve the goals of using data to describe data. To describe web information more clearly, the definition of event is given that key information and attributes compose a single event. So the annotation process can be seen as event annotation which contains two parts: key information annotation and attributes annotation. There is so much vague information in web pages that it is not easy to extract exact keywords to match the knowledge points and attributes. In this paper the basic knowledge base on financial domain is built and the rough rules of annotation are defined. Nature language processing technology is also used to extract the information. We believe our research can be reference of ontology building on other domain. For the research is based on the Semantic Web and ontology, the next section will introduce the related work about ontology and semantic annotation. Section three is the architecture of system and Section four is the financial knowledge base building based on ontology. Section five is annotating model and indexing model. The sixth section is experimental result and discussion. The last section summarizes this work and speculates on what the future may bring. 2. Related work

Our research refers to the prospective function of the Semantic Web that the structure of web pages content and the environment for software agents roaming from page to page will be built by the Semantic Web [1].The Semantic Web was first put forward by Tim Berners-Lee and the research on the Semantic Web has become a hot topic for years. XML (Extensible Markup Language), RDF (Resource Description Framework) and ontology are the three key technologies of the Semantic Web realization [2]. The standard of XML schema and RDF are defined by W3C. In [3] the author discussed the key elements of XML and RDF, and a general

3362

Proceedings of the Ninth International Conference on Machine Learning and Cybernetics, Qingdao, 11-14 July 2010

method was proposed for encoding ontology representation languages into RDF/RDF schema. The philosophy conception of ontology is redefined in AI of computer science and the definition become more explicit over time. In [4] the author explained that the philosophy conception of an explicit specification of a conceptualization is borrowed to define ontology. In [5] the author accepted that ontology is a formal, explicit specification of a shared conceptualization. In [6] ontology is developed as the content theory about the sorts of objects, properties of objects, and relations between objects that are possible in a specified domain of knowledge. There are many tools of ontology building appeared such as Protg[7], OilEd[8], OntoEdit[9],and in this paper Protg is selected to build the ontology base. Although it will take a long time to realize the Semantic Web, the relative research has achieved great progress. In IR a semantic web search engine called Swoogle has been well-known to users [10]. In [11] an agent-based platform called QuestSemantics (QS) is developed and fine-grained business knowledge is used to support semiautomatic discovery, annotation, filtering and retrieval of information resources. For Chinese the complexity and ambiguity of the language make it more difficult to attempt the research and the progress can be found in [12]. 3. System architecture and data preprocessing

acquisition and data purification. The web pages are downloaded by crawler to realize data acquisition. To narrow the research field on finance, information source are limited to specific financial web sites. After downloaded and stored the web pages would be purified. The web news page is mainly composed by five parts as shown in Figure 2: 1) 2) 3) 4) 5) Title Post date Text Navigation bar Advertisement

The process of web data reorganization as shown in Figure 1 can be divided into three parts: data preprocessing, knowledge base building and semantic annotation. To describe information accurately, the definition of event is given. The event has two aspects which are keywords and attributes. The keywords are matched with the knowledge base. For different events the attributes are different and it is difficult to define all attributes for each event. However the attribute of time is common property for all events and it will be discussed in this paper.

Figure 2 Example of financial web page

The first three parts are the prime processing targets for the information of web page is mainly contained in these parts. The navigation bar and advertisement would be cleaned as the information contained in these parts would confuse the annotation. The purified documents would be stored in xml forms. The core ideas of information reorganization are knowledge base construction and semantic annotation. 4. Knowledge base construction

Figure 1 Process of information reorganization

Data preprocessing contains two steps which are data

Since the incredible amounts of information and the high speed of new information updating, it is obvious that all the knowledge points of specific domain cannot be covered by our knowledge base. In fact the structure of annotated web page is the hierarchical structure of knowledge base. If the main structure can be built reasonably, the new points can be added. In this paper we focus on the main structure building. The concepts and tools for ontology building are used because the hierarchical structure and relationship between each ontology point is suitable for this project. Even so the structure building of financial area is still difficult. The solution we attempted is building knowledge base on impact factors of stock market to cover the financial

3363

Proceedings of the Ninth International Conference on Machine Learning and Cybernetics, Qingdao, 11-14 July 2010

information indirectly. The stock market can be seen as the barometer of nation economics. Mainly all the financial information has impact on stock price fluctuation. By all appearances it is easier to conclude the stock market impact and build the structure. This method is significant and can be seen as reference for knowledge building on other domains. The basic structure construction still needs manual work. The preliminary work is finished artificially and the automatic work will be next research. Under the suggestion from experts in finance domain, three influence factors about the stock price fluctuation are concluded which constitute the basic structure of ontology library: macro factor (MF), industry factor (IF) and single stock factor (SSF). The framework of finance ontology contains three parts as shown in Figure 3. The trends of each factor are also need to be considered which can be classified as positive trend and negative trend. The complicated rules of inductive inference in ontology building are not applied in this paper. Based on the framework of finance ontology, the detailed design of each sub-category is shown below: y y MF: the factors that have impact on the whole stock market such as government finance, investment, policy, price, productivity, interest rate, tax rate, trading. IF: the stocks can be classified on the industry and IF contains the factors that have great influence on one or more industries. The classification refers to the SSE (Shanghai Stock Exchange). Each industry has its keywords and description of impact factors. SSF: the factors that are related with single stock and some of them reflect the financial situation of that company.

To describe the information clearly we define the trend class. The subclass such as rate often has verbs followed, for example raise and cut. These verbs are classified as trend. Although the trend should be considered as an attribute, it is difficult to add trend to all the keywords which really have this attribute. Instead we annotated trend as an ontology class. The basic rule of annotating is keyword matching for single keyword in one sentence. For more one keyword in one sentence, the words distance are computed to limit the adjacent keywords for one event. The ontology knowledge base for one sentence is defined as vector V = (v1, v 2...vn ) and vi is the keyword which appears in one sentence. Vector S = ( w1, w2...wn ) is defined to describe one sentence and wj is the sentence word. The location of wj in the sentence is defined as pos ( wj ) .The distance between two words in one sentence which are the adjacent keywords in S is defined as (1) and the annotation rule is defined as (2): len( wm, wn) = pos ( wm = vi + 1) pos ( wn = vi ) (1)

len( wm, wn ) <

(2)

The distance of adjacent keyword in S is used to limit the scope of semantic keywords for one event. By artificial statistic on the distance from web pages, the value of is 5 bytes. The temporal information of event would be extracted as attribute in second step. The occurrence time of event or the publish time of web page should align the time of stock market. In this way the stock price fluctuation can come into contact with the financial event. It is difficult to extract temporal information of Chinese web page and the solution is given in [13]. Using event temporal information in impact computation will be discussed in next section.

Figure 3 Hierarchical structure of knowledge base

5.

Semantic annotation

As defined that the annotation process contains two parts which are key information annotation and attributes annotation. The aim of annotation is extracting and marking information as event from the web news with labels. In this way the financial events from web can be reorganized. The mini unit to describe an event for nature language and the content of web page is sentence. So in this paper the mini unit of annotating event is sentence.

Figure 4 Structure of the annotated web page

3364

Proceedings of the Ninth International Conference on Machine Learning and Cybernetics, Qingdao, 11-14 July 2010

Specification for each tag is below: <doc>: mark the beginning and end of annotated page <testid>: id of annotated web page <docid>: id of web page in purified web page database <time>: publish or update time of financial web page <url>: address of financial news web page <title>: financial news title <offset>: offset of web page in purified database <event>: financial event extracted from web page <content>: financial web page content The tag of <event> is most important in annotation process. It contains the concept classes: MF, IF, SSF. For example <event class="rate: @74" industry="C7: @88" class="rate: @92" trend="down: @106">, the left of = represents the concept class, the first word in scare quotes represents the subclass, the phrase after : represents the keywords which appear in the event sentence, the number after @ represents the offset of keywords in the text. Figure 4 shows the structure of labeled webpage. The annotation process can be expressed as below.
Input: purified web page data base which include web page and index Output database of annotated web pages Initialize the hash table HMM_Hash which contains the ontology for each purified web page do participle extract the docid, date, title, URL and offset of web page store the docid, date, title, URL and offset in XML form cut the content of web page into sentences Initialize the vector of ontology for each sentence Si do for each word wi do Match the word in HMM_Hash if wi in HMM_Hash && pos(wi) pos(vend) < 5 push wi into V else traverse the vector of ontology generate <event> label use < event > to annotate the current sentence Initialize the vector of ontology end if end for connect the annotated sentences use <content> to annotate the text body of web page end for use <doc> to annotate the entire web page output the annotated web page to database end for

6.

Experimental result

y y y y y y y y y

The evaluation of web information annotation also contains two parts. To evaluate the effectiveness of key information annotation, 64 key words of ontology are selected. To cover these key words, there are 401 financial web pages randomly selected from database which is composed by sina financial web pages. The primary results which are illustrated in Table 1 show that the event sentences in text can be mainly annotated based on ontology.
TABLE 1 RESULT OF KEY INFORMAITON ANNOTATING text number financial event number correct annotated event number precision 401 654 484 74.01%

The annotating errors are mainly caused by three reasons. First is participle error. After participle process the text is separated into words which will be matched to the ontology keyword. If the participle result is not correct, the word matching and classifying can hardly perform brilliantly. The words in ontology vocabulary have specified meaning, but some company names and proper nouns contain the keyword and these sentences will be annotated. Obviously the results do not contain the initial semantics. The solution is building specified participle rules for ontology words and shield the proper nouns. Second is the ambiguity of ontology word. Although the participle and annotating is correct, the word in this sentence dose not perform the meaning in ontology and the classifying result is not correct. Usually the long word has doubtless meaning and the short word may often cause these errors. To avoid this error the annotating process should judge whether the matching word reflects the ontology semantics or not based on the context meaning and this would be our next work. The third error caused by hyperlink illustrates that the purification program cannot tidy up all the content irrelevant with event and the sentence in hyperlink has intact description about financial event which would be annotated. The annotating system should have strict rules about whole sentence analysis and the pre-processing of data needs to be improved. Another aspect is the sentences of experts advice and comment are also be annotated as financial event. These sentences also have intact related events description which contains the ontology keywords. Our annotating system is lack in judging the sentence belongs to news event or experts advice and comment. A feasible way to solve it is to building template of these styles of text. The verb such as consider,

3365

Proceedings of the Ninth International Conference on Machine Learning and Cybernetics, Qingdao, 11-14 July 2010

believe and advise, the adjective such as gloomy and optimistic and adverb such as will and should appearing in related sentence can be the characters of template. The database of web pages is huge and the annotation result cannot be rechecked by human. Although the result in Table 1 is a small part of the whole database and precision could not exactly reflect the annotation process, it is also useful to achieve better performance for future work by analyzing the error reasons.
TABLE 2 RESULT OF TEMPORAL ATTRIBUTES ANNOTATING Name Web page Total temporal expression Formulized temporal expression Direct temporal expression Indirect temporal expression Rate rising 1543 62371 2595 995 1600 Rate cutting 677 13990 758 199 559

7.

Conclusion and future work

To evaluate the effectiveness of attributes annotation, the experiments are designed base on the previous research in [13]. In [13] ten times of interest rate rising of are chosen as key information annotation targets. In this paper the test events for temporal attributes are add with five time of interest rate cut. The information of interest rate rising is also retest and more web pages are added to test data. As discussed in [13] the temporal attribute of Chinese web page has the characters of vagueness and complexity, the temporal expressions are divided into direct temporal expression and indirect temporal expression. Another point should be noticed that temporal information for single event from web pages may have relative error to the accurate time and we define standard precision and rough precision to evaluate availability of the temporal information annotation. The experimental results in Table 2 and Table 3 show that the temporal attribute of event is mainly extracted and annotated.
TABLE 3 PRECISION OF TEMPORAL ATTRIBUTES EXTRACTING Rate rising Name Standard precision 83.72% 53.56% 65.13% Rough precision 92.76% 82.94% 86.71% Standard precision 75.38% 56.89% 63.54% Rate cutting Rough precision 76.38% 81.93% 84.87%

In this paper the process of web information reorganization on specific domain is realized basically. The process contains two steps which are knowledge base building and semantic annotation. To cover the financial information points the impact factors on stock market are classified to build the hierarchical structure of knowledge based on experts advice. The theory of ontology building is attempted to complete the construction of knowledge base. Although the preliminary work is in manual way and the complicate inductive inference is not accepted, the hierarchical structure and the basic relationship between ontology keywords are helpful in annotation process. The annotation of web page achieves the aim of reorganize the domain information in hierarchical structure. The annotating rules of key information and the attribute extraction methods are discussed. The experimental results show that the solution is available. Our research is still on the stage of exploration. It is obvious that the room for improvement is huge. The knowledge base building is long-term work and automatic way would be our next research. The annotating rules based on keyword matching cannot reflect all the semantics and many other attributes such as location of event happening would be studied on. In the future we will focus on these subjects. Acknowledgements This investigation is supported in part by the National Natural Science Foundation of China (No. 60703015 and No.60973076). References [1] Tim Berners-Lee, James Hendler and Ora Lassila. The Semantic Web. Scientific American. May 17, 2001 [2] Tian Chunhu. Review of Research about Semantic Web. Journal of the China Society for Scientific andTechnical Information, vol. 24, No 2, pp 243-249,April,2005 [3] Stefan Decker, Sergey Melnik and Frank Van Harmelen. The Semantic Web: The Roles of XML and RDF. IEEE Internet Computing, vol. 15, pp 63-74, October 2000 [4] Gruber TR. A translation approach to portable ontology specifications. Technical Report, KSL 92-71, Knowledge System Laboratory, 1993. [5] Rudi Studer, V. Richard Benjamins and Dieter Fensel. Knowledge Engineering: Principles and Methods. Data & Knowledge Engineering. vol: 25, pp 161-198, March 1998

direct time indirect time Total time

3366

Proceedings of the Ninth International Conference on Machine Learning and Cybernetics, Qingdao, 11-14 July 2010

[6] Chandrasekaran, John R. Josephson and Richard V. Benjamins. What are Ontologies and why do we need them? IEEE Intelligent Systems, pp. 20-26, Januar 1999 [7] Noy NF, Fergerson RW, Musen MA. The knowledge model of protg-2000: Combining interoperability and flexibility. In: Dieng R, Corby O, eds. Proc. of the EKAW 2000. Heidelberg: Springer-Verlag, 2000. 17 32. [8] Bechhofer S, Horrocks I, Goble C, Stevens R. OilEd: A reason-able ontology editor for the semantic Web. In: Baader F, Brewka G, Eiter T, eds. Proc. of the KI 2001, Joint German/Austrian Conf. on AI. Heidelberg: Springer-Verlag, 2001. 396 408. [9] Sure Y, Angele J, Erdmann M, Staab S, Studer R, Wenke D. OntoEdit: Collaborative ontology engineering for the semantic Web. In: Horrocks I, Hendler JA, eds. Proc. of the ISWC 2002. Heidelberg: Springer-Verlag, 2002. 221 235.

[10] Li Ding, Tim Finin, Anupam Joshi, Rong Pan, R. Scott Cost, Yun Peng, Pavan Reddivari, Vishal C Doshi, and Joel Sachs. Swoogle: A Search and Metadata Engine for the Semantic Web. Proceedings of the Thirteenth ACM Conference on Information and Knowledge Management, November, 2004 [11] Tamma Valentina. Semantic Web Support for Intelligent Search and Retrieval of Business Knowledge. IEEE Intelligent Systems, Vol 25, Issue:,1 pp: 84 88, Jan.-Feb 2010 [12] Du Xiao-Yong, Li Man, and Wang Shan. A Survey on Ontology Learning Research. Journal of Software, Vol.17, No.9, September 2006, pp.1837 1847 [13] Bo Yuan, Qingcai Chen, Xiaolong Wang, Liwei Han. Extracting Event Temporal Information based on Web. The 2nd International Symposium on Knowledge Acquisition and Modeling (KAM 2009), pp346-350, 2009

3367

You might also like