Download as doc, pdf, or txt
Download as doc, pdf, or txt
You are on page 1of 3

Collecting Web Data

In this module we examine the DBpedia The basic definition of DBpedia is a community
effort to gather information from Wikipedia, to structure it and to post it on the web.
DBpedia allows users to query relations and properties of Wikipedia resources. This
project focuses on describing a method of showing structured data in a way, in which it
can be connected, in order to become more useful. This method is built upon standard
web technologies, such as URIs and HTTP, but instead of making the data readable by
humans, as web services do, it uses the technologies to gather information in a way in
which machines can read and interpret data. Another part of the Linked Data project is
FOAF, which is a dataset that describes persons, their properties and relationships. Friend
of a friend (FOAF) is a phrase used to refer to someone that one does not know well,
literally, a friend of a friend. The basic idea behind FOAF is that it provides information
written in such way that the web can understand by its own connections between things
that matter to people. Each person is been searched in order to get, if any, his or her site
id. This id allows forming an URL that takes a user to the persons page. For Portal a
number represents the use ID. The users page is structured in a similar manner to the
representation. The main object described is the article written by the requested person
there can be more than one article. Its authors, title, publishing year, etc., describe each
article. With DBLP there are two main pages describing a person. The first page contains
all the persons collaborators, with their full names and their unique ids. This information
will help when monitoring a persons activity it is considered that a person is active when
he or she has a large number of collaborators. The second page used to extract
information gives the persons articles, or more specifically, URLs to the description of
the articles.
Preprocessing Web Data
In this module the data collected from the DBpedia or DBLP is preprocessed by
gathering all the text contexts and contents in an accurate and fast way and to organize
them based on their subjects. It allows people to observe and filter the information on the
Internet. This can Semantically classify all the texts; it can also analyze keywords,
phrases and structure of documents, and recognize a texts category therefore it can do

both statistic and semantic analysis. Classifying the texts and putting them into categories
based on their subject and on keywords can find every piece of text. We have also used
this structure in the project to define the articles that represent the users research activity.

Analysis Web Data


In this module analysis of data collected and preprocessed in done, where the information
is extracted and stored on the disk, in RDF files. Each person has his or her RDF, where
the profile on DBLP is structured in a more ordered way. The RDF structure chosen
(classes and attributes) is similar. The main class used is <foaf:Person> and represents the
person himself and his or her collaborators and scientific research. This class
encapsulates certain attributes of a person, such as the name of the person, the people he
or she knows and a list of documents that represent his or her research. The attribute
<foaf:knows> encapsulates more <foaf:Person> tags, which contain the name and, if it is
the case, the URL of the collaborators. There are more classes <foaf:Document>, which
describe the articles. The Document type has its own attributes, which have the form < >.
They can stand for title, Description, date, author, and so on. A document can have more
authors, therefore, there are more < > tags within its description. Parsing HTML pages
from Java classes provides the information within the RDFs. There is a mediator, who
gathers the entire information and uses it to create the RDFs.

Tracking User Profiles


In this module the idea is for a client to introduce a persons name, in order to receive his
or her research work and his collaborators. If the client introduces a correct name and
then chooses what information to be shown from DBLP, then he will be redirected to the
requested persons page if the page exists; else a relevant message will be displayed,
notifying the client that the requested person does not have a page. The center of interest
of the profile is the user and his collaborators. Each collaborator has next to his name the
number of common articles with the user the information received from DBLP. This
picture represents the graph of collaborators. This graph dynamically changes given the
number of articles in common. This means that there is also the possibility to see only the

people with a number of x articles in common with the main user. One can also see the
strongest relation in the graph, which represents the most important collaborator of the
user. Also, the users collaborators graph can be an interesting statistics, for it gives
information about a users activity. The graph of collaborators is dynamically created.
Besides the graph of collaborators, other statistics and relations appear on the users
profile. The number of publications in certain years makes statistics. This can keep
evidence of the periods when the user has been most active. A list with the common
articles then appears on the screen. If one of the requested people does not have a FOAF
profile on DBLP, depending on the request, a specific message appears on the screen. In
case of success, all the documents and articles in common are listed.

You might also like