Professional Documents
Culture Documents
Summer Training Report
Summer Training Report
Aditya Dhoke
Roll. No - 04005013
Department of Computer Science and Engineering
Indian Institute of Technology Bombay ,India.
Guide Prof. Amit Sheth
August 5, 2007
Abstract
In this report, I will describe the projects / implemetations namely Entity Spotter, Semantic Browser and Web Portal on which I worked on with
Kno.e.sis lab members during my summer internship, 2007. I will also provide the details of the presentation which I gave as a part of my curriculum.
Finally , the conclusion states the utility of this internship for me and future
endeavors that I plan.
0.1
Entity Spotter
The implementation reads a set of entities and relations and their corresponding IDs to create a tree data structure, similar to a Trie structure and
tags the entities in text.
In the initial stage, the program takes as input the concepts and relationships along with their alphanumerical IDs and creates a tree structure
with each node having a hashmap.This hashmap (set of key value pairs) has
key as the words within the entity and value is another node in the tree. As
words are read, the tree is traversed from top to bottom with the key as the
current word and the value as the node below it. If the set of words read
matches an entity, the ID of that entity will be fetched from the current
node. It also handles the case where one entity is prefix of another and if
the longer entity does not have a match, it backtracks to tag the shorter
entity.
The method runs in O(log(m)*n) time, where m is the number of entities
and n is number of words in the input file, as opposed to the brute force
technique where time taken would have been O(m*n).The data structure
has been shown the Figure 1. The figure shows storage of two entites gene
mutation and gene abnormility and their corresponding IDs D130 and
D432. If gene mutation occurs in the text, the hash-table of the uppermost
node is referred and later the hash-table in the left node is referred.This
finally leads to the node in which the ID D130 is stored.
0.2
The aim of the presentation was to put forward the idea in [1] .The computer
scientist in Southeast University,China had introduced the novel idea of RDF
Sentence graph which they used for summarizing the ontologies in RDF format. Given an ontology(RDF graph) along with length of the summary and
preference, RDF sentences are detected from which graph is built in which
each node is a RDF sentence.Now the summarization problem has been
reduced to finding salient nodes in the graph.After this, re-ranking of the
salient nodes was done to get more appropriate ontology. Degree Centrality,
Shortest-Path-based Centrality, Eigenvector Centrality, Weighted HITS are
the methods that were used finding salience.The work flow has been shown
the Figure 2.
0.3
Semantic Browser
0.3.1
Data Storage
The RDF statements are stored in the form of Trie structure persistent
object. The abstracts and their PMIDs are indexed using Lucence Index.
The persistent object and indexes are created off-line and stored on the
server.
0.3.2
Data Exchange
The data exchange is done using AJAX, parameters are passed from the
client-side to a JSP which in turn queries information on the server-side.
The data retrieved is converted in XML format by JSP. The XML data is
parsed by DOM on the client-side and is then made readable to the user by
CSS.
0.3.3
Functionality
The entities and relations in the abstract are highlighted.When the user
hovers over the entity(subject), the corresponding relation and object of
RDF statement are listed. The PMID numbers of the files in which this
statement occurs is displayed. Two search boxes are provided one for PMID
and other for keyword. As the user types suggestions appear in a drop down
menu.
0.4
Web Portal
I worked on the library web page of Kno.e.sis. The resources were displayed
on web using a tool named Exhibit. The tool provided an interface to browse
through the resources. Earlier, it fetched data in JSON format which was
created manually from the spreadsheets. Now, the data is read directly
from spreadsheet. The data in spreadsheet(Google Spreadsheet) was cleaned
up using Java library so that every lab members name appears only once
irrespective of whether he/she uses initials or canonical forms.
0.5
Acknowledgements
0.6
Conclusion
Bibliography
[1] Xiang Zhang,Gong Cheng,Yuzhong Qu, Ontology Summarization
Based on RDF Sentence Graph, World Wide Web Conference, 2007.
[2] Bush,V., As We May Think. The Atlantic Monthly,1945. 176(1) p.101108.
[3] Cartic Ramakrishnan,Krys J. Kochut,Amit P. Sheth, A Framework for
Schema-Driven Relationship Discovery form Unstructured Text ISWC
,2006. p.583-596.
[4] Marti A. Herst, Untangling Text Data Mining, Proceedings of ACL
,1999.
[5] Partha Pratim Talukdar,Thorsten Brants,Mark Liberman Fernando
Periera, A Context Pattern Induction Method for Named Entity Extraction, Proceedings of 10th Conference on Computional Natural Language Learning, June 2006.
[6] Eugene Agichtein, Luis Gravano Snowball: Extracting Relations from
Large Plain-Text Collections ACM DL, 2000.
[7] lucene.apache.org