Professional Documents
Culture Documents
Collecting Web Data
Collecting Web Data
In this module we examine the DBpedia The basic definition of DBpedia is a community
effort to gather information from Wikipedia, to structure it and to post it on the web.
DBpedia allows users to query relations and properties of Wikipedia resources. This
project focuses on describing a method of showing structured data in a way, in which it
can be connected, in order to become more useful. This method is built upon standard
web technologies, such as URIs and HTTP, but instead of making the data readable by
humans, as web services do, it uses the technologies to gather information in a way in
which machines can read and interpret data. Another part of the Linked Data project is
FOAF, which is a dataset that describes persons, their properties and relationships. Friend
of a friend (FOAF) is a phrase used to refer to someone that one does not know well,
literally, a friend of a friend. The basic idea behind FOAF is that it provides information
written in such way that the web can understand by its own connections between things
that matter to people. Each person is been searched in order to get, if any, his or her site
id. This id allows forming an URL that takes a user to the persons page. For Portal a
number represents the use ID. The users page is structured in a similar manner to the
representation. The main object described is the article written by the requested person
there can be more than one article. Its authors, title, publishing year, etc., describe each
article. With DBLP there are two main pages describing a person. The first page contains
all the persons collaborators, with their full names and their unique ids. This information
will help when monitoring a persons activity it is considered that a person is active when
he or she has a large number of collaborators. The second page used to extract
information gives the persons articles, or more specifically, URLs to the description of
the articles.
Preprocessing Web Data
In this module the data collected from the DBpedia or DBLP is preprocessed by
gathering all the text contexts and contents in an accurate and fast way and to organize
them based on their subjects. It allows people to observe and filter the information on the
Internet. This can Semantically classify all the texts; it can also analyze keywords,
phrases and structure of documents, and recognize a texts category therefore it can do
both statistic and semantic analysis. Classifying the texts and putting them into categories
based on their subject and on keywords can find every piece of text. We have also used
this structure in the project to define the articles that represent the users research activity.
people with a number of x articles in common with the main user. One can also see the
strongest relation in the graph, which represents the most important collaborator of the
user. Also, the users collaborators graph can be an interesting statistics, for it gives
information about a users activity. The graph of collaborators is dynamically created.
Besides the graph of collaborators, other statistics and relations appear on the users
profile. The number of publications in certain years makes statistics. This can keep
evidence of the periods when the user has been most active. A list with the common
articles then appears on the screen. If one of the requested people does not have a FOAF
profile on DBLP, depending on the request, a specific message appears on the screen. In
case of success, all the documents and articles in common are listed.