Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 24

Summarize Principles of Distributed Database Systems Chapter 12 Web

Data Management

12 Web Data Management (57 Pages)


The World Wide Web (WWW) has become a significant repository of data and documents, growing rapidly
and changing rapidly. It consists of two components: the publicly indexable web (PIW) and the deep web
(hidden web). The PIW consists of static web pages on web servers, which can be easily searched and
indexed. The hidden web, or "dark web," is composed of numerous databases that encapsulate the data,
hiding it from the outside world. Data in the hidden web is usually retrieved through search interfaces, with
a portion of the deep web being known as the "dark web."
Research on web data management has followed different threads in two separate but overlapping
communities. The web search and information retrieval community focused on keyword search and search
engines, while the database community focused on declarative querying of web data. XML data
management emerged as an important data format for representing and integrating data on the web in the
2000s, but its use in web data management has waned due to its perceived complexity. RDF has emerged as
a common representation for web data representation and integration.
There is little unifying architecture or framework for discussing web data management, and different lines of
research have to be considered separately. Full coverage of all web-related topics requires deeper and more
extensive treatment than can be covered within a chapter.

12.1 Web Graph Management


The web graph is a structured representation of static HTML web pages connected by hyperlinks, which can
be modeled as a directed graph. This structure is crucial for studying data management issues, as it is used in
web search, categorization, and classification of web content. The web graph has several important
characteristics, including being quite volatile, sparse, self-organizing, small-world, and power law.
The web graph is characterized by its "bowtie" shape, with a strongly connected component (the knot in the
middle) that connects each pair of pages. This structure accounts for about 28% of web pages, while the
"IN" component has 21% of pages with paths to pages in SCC but no paths from SCC. The "OUT"
component has 21% of pages with paths from SCC but not vice versa. Tendrils consist of pages that cannot
be reached from SCC and from which SCC pages cannot be reached either, constituting about 22% of web
pages that have not yet been discovered and have not yet been connected to the better known parts of the
web.
There are also disconnected components that have no links to/from anything except their own small
communities, making up about 8% of the web. This structure determines the results obtained from web
searches and querying the web. It is different from many other graphs that are normally studied, requiring
special algorithms and techniques for its management.
In summary, the web graph is a complex structure that plays a significant role in various tasks such as web
search, categorization, and classification. Its unique characteristics make it an interesting and complex
structure that requires special algorithms and techniques for effective management.

12.2 Web Search


Web search is a process that involves finding all relevant web pages related to a user's specified keywords.
However, it is not possible to find all the pages or know if they have been retrieved. Instead, the search is
performed on a database of collected and indexed web pages. These pages are presented to the user in
ranked order of relevance as determined by the search engine.
A generic search engine's abstract architecture includes the crawler, which scans the web on its behalf and
collects data about web pages. The crawler is given a starting set of pages, or Uniform Resource Locators
(URLs), and then retrieves and parses the page corresponding to that URL, extracting any URLs in it, and
adding them to a queue. In the next cycle, the crawler extracts a URL from the queue and retrieves the
corresponding page. This process is repeated until the crawler stops.

The indexer module constructs indexes on the downloaded pages, with two common ones being text indexes
and link indexes. Text indexes provide all URLs pointing to the pages where a given word occurs, while link
indexes describe the link structure of the web and provide information on the in-link and out-link state of
pages.
The ranking module sorts a large number of results to present the most relevant ones to the user's search.
This problem has drawn increased interest to address the special characteristics of the web, such as small
queries executed over vast amounts of data.

12.2.1 Web Crawling


A crawler is a tool used by search engines to scan the web to extract information about visited pages.
However, due to the web's size, changing nature, and limited computing and storage capabilities, it is
impossible to crawl the entire web. Therefore, a crawler must visit "most important" pages in a ranked order.
To determine the importance of a page, measures can be static or dynamic. Static measures determine the
importance of a page based on the number of backlinks or the importance of backlink pages, such as the
PageRank metric used by Google. Dynamic measures calculate the importance of a page based on its textual
similarity to the query being evaluated using information retrieval similarity measures.
Recall that the PageRank of a page Pi, denoted P R(Pi), is simply the normalized sum of the PageRank of all
Pi’s backlink pages (denoted as BPi) where the normalization for each Pj ∈ BPi is over all of Pj’s forward
links FPj :

Recall also that this formula calculates the rank of a page based on the backlinks, but normalizes the
contribution of each backlinking page Pj using the number of forward links that Pj has. The idea here is that
it is more important to be pointed at by pages conservatively link to other pages than by those who link to
others indiscriminately, but the “contribution” of a link from such a page needs to be normalized over all the
pages that it points to.
The crawler's next page visit after crawling a page is a crucial issue. The crawler maintains a queue of
URLs, which can be ordered in the order they were discovered. Strategies include the breadth-first approach,
random ordering, or metrics like backlink counts or PageRank. A slight revision to the PageRank formula is
needed, modeling a random surfer who is likely to choose one URL on a page with equal probability d or
jump to a random page with probability 1−d. Then the above formula for PageRank is revised as follows:

The ordering of the URLs according to this formula allows the importance of a page to be incorporated into
the order in which the corresponding page is visited. In some formulations, the first term is normalized with
respect to the total number of pages in the web. Example 12.1 Consider the web graph in Fig. 12.3 where
each web page Pi is a vertex and there is a directed edge from Pi to Pj if Pi has a link to Pj. Assuming the
commonly accepted value of d = 0.85, the PageRank of P2 is P R(P2) = 0.15 + 0.85( P R(P 2 1) + P R(P 3
3)). This is a recursive formula that is evaluated by initially assigning to each page equal PageRank values
(in this case 1 6 since there are 6 pages) and iterating to compute each P R(Pi) until a fixpoint is reached
(i.e., the values no longer change).
Crawling is a continuous activity that involves revisiting web pages to update the information. Incremental
crawlers are designed to ensure fresh information by selectively revisiting pages based on their change
frequency or by sampling a few pages. Change frequency-based approaches use an estimate of a page's
change frequency to determine its revisit frequency. Sampling-based approaches focus on websites rather
than individual pages, sampling a small number of pages from a site to estimate the change in the site.
Focused crawlers are used by search engines to search pages related to a specific topic, ranking pages based
on relevance. Learning techniques, such as naïve Bayes classifier and reinforcement learning, are used to
identify the topic of a given page. Parallel crawling can be achieved by running parallel crawlers, but
coordination schemes must minimize overhead. One method is to use a central coordinator to dynamically
assign each crawler a set of pages to download, while another is to logically partition the web, with each
crawler knowing its partition without central coordination.

12.2.2 Indexing
In order to efficiently search the crawled pages and the gathered information, a number of indexes are built as shown
in Fig. 12.2. The two more important indexes are the structure (or link) index and a text (or content) index.

12.2.2.1 Structure Index


The structure index, based on the graph model discussed in Sect. 12.1, aids in efficient storage and retrieval of web
pages. It provides crucial information about the linkage of web pages, including the neighborhood and siblings of a
page, ensuring efficient web page management.

12.2.2.2 Text Index


The text index is the most commonly used index for text-based retrieval, and it can be implemented using various
access methods, such as suffix arrays, inverted files, and signature files. Inverted indexes are collections of inverted
lists associated with a particular word, which are used in proximity queries and query result ranking. Search
algorithms often use additional information about the occurrence of terms in a web page, such as terms occurring in
bold face, section headings, or anchor text.
In addition to the inverted list, many text indexes also keep a lexicon, a list of all terms that occur in the index, and
term-level statistics that can be used by ranking algorithms. However, constructing and maintaining an inverted index
presents three major difficulties: processing each page, reading all words, and storing the location of each word. This
process becomes difficult when dealing with a vast and nonstatic collection like the web.
Maintaining the "freshness" of the index is also a challenge due to the rapid change of the web. Periodic index
rebuilding is necessary to ensure freshness, and storage formats must be carefully designed to balance performance
gains with overhead at query time.
To address these challenges and develop a highly scalable text index, distributing the index through local inverted
indexes at each machine or a global inverted index shared across machines can be achieved.

12.2.3 Ranking and Link Analysis


Search engines often return a large collection of web pages, but these pages are likely to be different in terms of
quality and relevance. To rank these pages, algorithms are needed to ensure higher quality ones appear in the top
results. Link-based algorithms can be used to rank a collection of pages, based on the intuition that if a page contains a
link to another page, it is likely that the authors of that page think that the page is of good quality. This intuition is the
basis of ranking algorithms, such as PageRank and HITS. HITS is based on identifying "authorities" and "hubs," with
a good authority page receiving a high rank. A good authority page is a page linked to by many good hubs, and a good
hub is a document linking to many authorities. A web graph, G = (V, E), is used to represent the authoritative and hub
values of each page in V. The authoritative and hub values are updated as follows. If a page Pi is pointed to by many
good hubs, then aPi is increased to reflect all pages Pj that link to it (the notation Pj → Pi means that page Pj has a link
to page Pi):
Thus, the authoritative value (hub value) of page Pi, is the sum of the hub values (authority values) of all the backlink
pages to Pi.

12.2.4 Evaluation of Keyword Search


Keyword-based search engines are popular tools for searching information on the web, but they have limitations such
as not being powerful enough to express complex queries, not offering a global view of information, and not capturing
user intent. Category search, also known as web directory, catalogs, yellow pages, and subject directories, addresses
these issues by providing a global view of the web.
Web directories, a hierarchical taxonomy, classify human knowledge and can be useful if a category is identified as
the target. However, not all web pages can be classified, and natural language processing cannot be 100% effective for
categorizing web pages. Additionally, keeping the directory up-to-date can be time-consuming and involve significant
overhead.
Metasearchers, web services that take a given query from the user and send it to multiple search engines, collect
answers, and return a unified result. Examples include Dogpile, MetaCrawler, and lxQuick. Metasearchers have
different ways to unify results and translate user queries to specific query languages of each search engine.
Users can access metasearchers through client software or a web page, and each search engine covers a smaller
percentage of the web. The goal of a metasearcher is to cover more web pages than a single search engine by
combining different search engines together.

12.3 Web Querying


Database technology has been a significant focus in declarative querying and efficient query execution. However,
there are challenges in applying database techniques to web data, as it assumes a strict schema. Web data may be
semistructured, but not as rigid or complete as databases, making it difficult to query schema-less data.
The web is more than semistructured data and documents; links between data entities, such as pages, are important
and need to be considered. Web queries may need to follow and exploit links as first-class objects, similar to search.
There is no commonly accepted language for querying web data, similar to SQL. While some consensus on basic
constructs has emerged, there is no standard language. Standardized languages for data models such as XML and RDF
have emerged, such as XQuery for XML and SPARQL for RDF. These issues are discussed in Sect. 12.6, which
focuses on web data integration.

12.3.1 Semistructured Data Approach


Web data can be analyzed using semistructured data models and languages, which were initially developed for
growing data collections without strict schemas. However, these characteristics are also common to web data, leading
to their applicability in this domain. The Object Exchange Model (OEM) is a self-describing semistructured data
model that allows each object to specify its schema. OEM objects are defined as four-tuples: label, type, value, and
oid.
The OEM data can be represented as a vertex-labeled graph, where the vertices correspond to OEM objects and the
edges correspond to the subobject relationship. The label of a vertex is the oid and the label of the corresponding
object vertex. However, it is common in literature to model the data as an edge-labeled graph, where the label of a
vertex is assigned to the edge connecting oi to oj, and the oids are omitted as vertex labels.
The semistructured approach fits well for modeling web data since it can be represented as a graph and accepts that
data may have some structure, but not as rigid, regular, or complete as traditional databases. Users do not need to be
aware of the complete structure when querying the data, so expressing a query should not require full knowledge of
the structure. These graph representations of data at each data source are generated by wrappers discussed in Sect. 7.2.
Lorel is a language used to query semistructured data, with its SELECT-FROM-WHERE structure and path
expressions in the SELECT, FROM, and WHERE clauses. The fundamental construct in forming Lorel queries is a
path expression, which is a sequence of labels starting with an object name or variable denoting an object. These paths
are called data paths and can be more complex regular expressions that can be constructed using conjunction,
disjunction, iteration, and wildcards. Examples of path expressions include bib.doc.title, bib.doc.title, bib.doc.title, and
bib.doc.title.
The semistructured data approach to web data modeling and querying is simple and flexible, supporting the
containment structure of web objects and the link structure of web pages. However, it lacks record structure, ordering,
and support for links. The model or languages do not differentiate between different types of links, which cannot be
separately modeled or easily queried. The graph structure can become complicated, making it difficult to query. To
simplify the process, a construct called a DataGuide has been proposed. A DataGuide is a graph where each path in
the corresponding OEM graph occurs only once and is dynamic, updating as the OEM graph changes. This provides
concise and accurate structural summaries of semistructured databases and can be used as a lightweight schema for
browsing the database structure, formulating queries, storing statistical information, and enabling query optimization.

12.3.2 Web Query Language Approach


This category of approaches addresses web data characteristics, focusing on links and overcoming keyword search
shortcomings by providing abstractions for document content structure and external links. They combine content-
based queries like keyword expressions with structure-based queries like path expressions.

Web data is handled by two types of languages: first generation and second generation. First generation languages
model the web as an interconnected collection of atomic objects, allowing queries to search link structures but not
exploiting document structures. Second generation languages, such as WebSQL, W3QL, and WebLog, model the web
as a linked collection of structured objects, allowing queries to exploit document structures. WebSQL, an early query
language, combines searching and browsing by directly addressing web data captured by web documents, including
links to other pages or objects. As before, the structure can be represented as a graph, but WebSQL captures the
information about web objects in two virtual relations:

The document relation is a key-value pair that stores information about a web document, including its URL, title, text,
type, length, and modification date. It can only have one value, while the link relation stores information about links,
including their base URL, referenced URL, and label. All other attributes can be null except for the URL.
WebSQL defines a query language that consists of SQL plus path expressions.The path expressions are more powerful
than their counterparts in Lorel; in particular, they identify different types of links:
(a) interior link that exists within the same document (#>)
(b) local link that is between documents on the same server (->)
(c) global link that refers to a document on another server (=>)
(d) null path (=)
These link types form the alphabet of the path expressions. Using them, and the usual constructors of regular
expressions, different paths can be specified as in Example 12.7. Example 12.7 The following are examples of
possible path expressions that can be
specified in WebSQL.
(a) -> | =>: a path of length one, either local or global
(b) ->*: local path of any length
(c) =>->*: as above, but in other servers
(d) (-> |=>)*: the reachable portion of the web

In addition to path expressions that can appear in queries, WebSQL allows scoping within the FROM clause in the
following way:
FROM Relation SUCH THAT domain-condition
where domain-condition can be either a path expression, or can specify a text search using MENTIONS, or can
specify that an attribute (in the SELECT clause) is equal to a web object. Of course, following each relation
specification, there could be a variable ranging over the relation—this is standard SQL. The following example
queries (taken from with minor modifications) demonstrate the features of WebSQL. Example 12.8 Following are
some examples of WebSQL:
(a) The first example we consider simply searches for all documents about “hypertext” and demonstrates the use of
MENTIONS to scope the query.

(b) The second example demonstrates two scoping methods as well as a search for links. The query is to find all links
to applets from documents about “Java.”

(c) The third example demonstrates the use of different link types. It searches for documents that have the string
“database” in their title that are reachable from the ACM Digital Library home page through paths of length two or
less containing only local links.

(d) The final example demonstrates the combination of content and structure specifications in a query. It finds all
documents mentioning “Computer Science” and all documents that are linked to them through paths of length two or
less containing only local links.
WebSQL can query web data based on links and textual content, but it cannot query documents based on their
structure due to its data model that treats the web as a collection of atomic objects. Second-generation languages like
WebOQL address this limitation by modeling the web as a graph of structured objects, combining features of
semistructured data approaches with those of first-generation web query models. WebOQL's main data structure is a
hypertree, an ordered edge-labeled trie with two types of edges: internal and external. An internal edge represents the
internal structure of a web document, while an external edge represents a reference among objects. Each edge is
labeled with a record containing attributes and cannot have descendants. In WebOQL, attributes are captured in the
records associated with each edge, while internal edges represent the document structure.
The query uses a variable x to range over simple trees of dbDocuments, and for a given value, iterates over the simple
trees of a single subtree. If the author value matches "Ozsu" using the string matching operator ∼, it constructs a trie
with the title attribute of the record and the URL attribute value of the subtree. Web query languages adopt a more
powerful data model than semistructured approaches, allowing them to exploit different edge semantics and construct
new structures.
12.4 Question Answering Systems
Question answering (QA) systems are an unusual approach to accessing web data from a database perspective. These
systems accept natural language questions and analyze them to determine the specific query being posed. They are
typically used within IR systems, which aim to determine the answer to posed queries within a well-defined corpus of
documents. They allow users to specify complex queries in natural language and allow asking questions without full
knowledge of the data organization.
Sophisticated natural language processing (NLP) techniques are applied to these queries to understand the specific
query. They search the corpus of documents and return explicit answers, rather than links to relevant documents. This
does not mean they return exact answers as traditional DBMSs do, but they may return a ranked list of explicit
responses to the query.
Open domain systems, which use the web as the corpus, are also used in QA systems. Web data sources are accessed
using wrappers developed for them to obtain answers to questions. There are various systems with different objectives
and functionalities, such as Mulder, WebQA, Start, and Tritus.
The summary discusses the functionality of systems using preprocessing, an offline process to extract and enhance
rules used by systems. Preprocessing involves analyzing documents extracted from the web or returned as answers to
user questions to determine the most effective query structures. These transformation rules are stored for later use
during runtime. Tritus uses a learning-based approach, using a collection of frequently asked questions and their
correct answers as a training dataset. In a three-stage process, the system attempts to guess the answer structure by
analyzing the question and searching for the answer. The first stage extracts the question phrase, the second phase
analyzes question-answer pairs, and generates candidate transforms for each phrase.

Question analysis is a process used to understand user-posed natural language questions. It involves predicting the
type of answer and categorizing the question for translation and answer extraction. Different systems use different
approaches depending on the sophistication of Natural Language Processing (NLP) techniques employed. For
example, Mulder uses three phases: question parsing, question classification, and query generation. Mulder uses four
methods in this phase: verb conversion, query expansion, noun phrase formation, and transformation.

Once the question is analyzed and queries are generated, the next step is to generate candidate answers. The queries
generated at question analysis are used to perform keyword search for relevant documents. Some systems use general-
purpose search engines or consider additional data sources like the CIA's World Factbook or weather data sources like
the Weather Network or Weather Underground. The choice of appropriate search engine(s)/data source(s) is crucial
for better results.
Question answering systems are more flexible than other web querying approaches, as they offer users flexibility
without knowledge of web data organization. However, they are constrained by the idiosynchrocies of natural
language and the difficulties of natural language processing.
Response to queries is normalized into "records," which need to be extracted from these records. Various text
processing techniques can be used to match keywords to the returned records. These results need to be ranked using
various information retrieval techniques. Different systems employ different notions of the appropriate answer, such
as a ranked list of direct answers or a ranked order of the portion of the records that contain the keywords in the query.
In conclusion, question answering systems are a complex process that requires a combination of NLP techniques, data
analysis, and a careful selection of appropriate search engines and data sources.

12.5 Searching and Querying the Hidden Web


Most search engines currently operate on the Public Information Web (PIW), leaving valuable data in hidden
databases. The trend in web search is to search the hidden web alongside the PIW due to its larger size and higher
quality data. However, searching the hidden web faces challenges such as the inability of ordinary crawlers to crawl
HTML pages or hyperlinks, the need for a search interface or special interface to access data, and the unknown
underlying structure of the database. Data providers are often reluctant to provide information about their data, making
it difficult to improve the quality of answers. This section discusses these issues and proposes solutions to address
these challenges.

12.5.1 Crawling the Hidden Web


To search the hidden web, a hidden web crawler can be used in a similar manner to the PIW, submitting queries to the
database's search interface and analyzing the returned result pages to extract relevant information. This approach is
crucial for dealing with hidden web databases.

12.5.1.1 Querying the Search Interface


One method to improve search form functionality is to analyze the database's search interface and create an internal
representation that identifies fields, types, domains, and associated labels. This is done by analyzing the HTML
structure of the page and extracting the labels. The representation is then matched with the system's task-specific
database, based on the labels of the fields. If a label is matched, the field is filled with the available values. This
process is repeated for all possible values in the search form, and the form is submitted with every combination of
values. Another approach is to use agent technology, which involves hidden web agents that interact with search
forms and retrieve the result pages. This involves finding the forms, learning to fill them, and identifying and fetching
the target pages.

12.5.1.2 Analyzing the Result Page


The process of analyzing a returned page after submitting a form involves matching its values with those in the agent's
repository. Once a data page is found, it is traversed and all pages it links to, especially those with more results.
However, returned pages often contain irrelevant data, as most result pages follow a template with a significant
amount of text for presentation purposes. To identify web page templates, textual contents and adjacent tag structures
of a document are analyzed to extract query-related data. A web page is represented as a sequence of text segments,
with each segment encapsulated between two tags. The mechanism for detecting templates involves analyzing text
segments, identifying an initial template, generating a template, comparing retrieved documents, extracting non-
matching text segments, and extracting document contents for future template generation.

12.5.2 Metasearching
Metasearching is another approach for querying the hidden web. Given a user query, a metasearcher performs the
following tasks:
1. Database selection: selecting the databases(s) that are most relevant to the user’s query. This requires collecting
some information about each database. This information is known as a content summary, which is statistical
information, usually including the document frequencies of the words that appear in the database.
2. Query translation: translating the query to a suitable form for each database (e.g., by filling certain fields in the
database’s search interface).
3. Result merging: collecting the results from the various databases, merging them (and most probably, ordering
them), and returning them to the user.
We discuss the important phases of metasearching in more detail below.

12.5.2.1 Content Summary Extraction


Metasearching involves computing content summaries, which are often unavailable from data providers. A possible
approach is to extract a document sample set from a database and compute the frequency of each observed word in the
sample, SampleDF (w). This involves starting with an empty content summary and a comprehensive word dictionary.
Then, a word is selected and sent as a query to the database. The top-k documents are retrieved, and if the number of
retrieved documents exceeds a prespecified threshold, the process stops. The process can be repeated if the number of
documents exceeds a prespecified threshold. There are two main versions of this algorithm: one picks a random word
from the dictionary and the other selects the next query from among the words discovered during sampling. Another
alternative is a focused probing technique that can classify databases into hierarchical categorization. This involves
preclassifying training documents into categories, extracting different terms, and using single-word probes to
determine document frequencies.

12.5.2.2 Database Categorization


Database selection is a crucial task in metasearching, as it impacts the efficiency and effectiveness of query processing
over multiple databases. A database selection algorithm aims to find the best set of databases based on information
about the database contents, such as document frequency and storage. GlOSS is a simple algorithm that assumes query
words are independently distributed over database documents to estimate the number of documents that match a given
query.
The focused probing algorithm exploits database categorization and content summaries for database selection. It
consists of two basic steps: propagating the database content summaries to the categories of the hierarchical
classification scheme and using the content summaries to perform database selection hierarchically by zooming in on
the most relevant portions of the topic hierarchy. This results in more relevant answers to the user's query since they
only come from databases that belong to the same category as the query itself. Once relevant databases are selected,
each database is queried, and the returned results are merged and sent back to the user.

12.6 Web Data Integration


In Chapter 7, we discussed the integration of databases with well-defined schemas, which are suitable for enterprise
data. However, when integrating web data sources, the problem becomes more complex due to the characteristics of
"big data," such as varied data sources and higher amounts of data and sources. This makes manual curation difficult
and increases the importance of data cleaning solutions. An appropriate approach is pay-as-you-go integration, where
the up-front investment is reduced, allowing data owners to easily integrate their datasets into a federation. One
proposal is data spaces, which advocates for a lightweight integration platform with basic access opportunities. This
approach can be further improved by developing tools for more sophisticated use. Various approaches have been
developed to address these challenges, including web tables and fusion tables, semantic web, and Linked Open Data
(LOD) approaches. Finally, the issues of data cleaning and machine learning techniques in data integration and
cleaning at web-scale integration are discussed.

12.6.1 Web Tables/Fusion Tables


Web portals and mashups are popular approaches to lightweight web data integration, aggregating web and other data
on specific topics like travel and hotel bookings. These "vertically integrated" systems target one domain and focus on
finding relevant web data. The Web tables project is an early attempt at finding data on the web with relational table
structure and providing access over these tables, focusing on the open web. It employs a classifier to group HTML
tables as relational and nonrelational, extracting schemas and maintaining statistics for search. Join opportunities
across tables are introduced for more sophisticated navigation. Web tables can be viewed as a method to retrieve and
query web data but also serve as a virtual integration framework for web data with global schema information.
The Fusion tables project at Google takes web tables a step further by allowing users to upload their own tables in
various formats. The fusion table's infrastructure can automatically discover the join attribute across tables and
produce integration opportunities. An example is given in Fig. 12.11, where two datasets are combined over a
common attribute to provide integrated access. In other cases, one or both tables can be discovered from the web using
the techniques developed by the web tables project.

12.6.2 Semantic Web and Linked Open Data


The web is a repository of machine processable data, with Semantic!web aiming to convert this data into machine
understandable form by integrating structured and unstructured data semantically.
The original semantic web vision includes three components:
• Markup web data so that metadata is captured as annotations;
• Use ontologies for making different data collections understandable; and
• Use logic-based technologies to access both the metadata and the ontologies.
The Linked Open Data (LOD) was introduced in 2006 to clarify the semantic web vision, emphasizing the linkages
among data within it and outlining guidelines for web data publication, thereby achieving the web data integration
vision. LOD requirements for publishing (and hence integrating) data on the web are based on four principles:

• All web resources (data) are locally identified by their URIs that serve as names;
• These names are accessible by HTTP;
• Information about web resources/entities are encoded as RDF (Resource Description Framework) triples. In other
words, RDF is the semantic web data model (and we discuss it below);
• Connections among datasets are established by data links and publishers of datasets should establish these links so
that more data is discoverable.
The Semantic!web (LOD) graph generates a graph with vertices representing web resources and edges representing
relationships. As of 2018, LOD consisted of 1,234 datasets with 16,136 links. Semantic!web consists of several
technologies, including XML for structured web documents, RDF Schema for data model establishment, ontologies
for relationships among web data, and logic-based declarative rule languages for application rule definition. The
technologies in lower layers are the minimum requirements, and a schema over the data can provide necessary
primitives.

12.6.2.1 XML
HTML is the primary encoding for web documents, consisting of HTML elements encapsulated by tags. In the
semantic web context, XML, proposed by the World Wide Web Consortium (W3C), is the preferred representation for
encoding and exchanging web documents. This encoding allows for the discovery and integration of structured data.

XML tags, or markups, divide data into elements to provide semantics. Elements can be nested but cannot overlap,
representing hierarchical relationships. An XML document can be represented as a trie, with a root element, zero or
more nested subelements, and recursively containing subelements. Each element has zero or more attributes with
atomic values assigned to them, and an optional value. The textual representation of the trie defines a total order,
called document order, on all elements corresponding to the order in which the first character of the elements occurs in
the document.
For example, the root element in Fig. 12.4 is bib, which has three child elements: two book and one article. The first
book element has an attribute year with atomic value "1999", and also contains subelements. An element can contain a
value, such as "Principles of Distributed Database Systems" for the element title.
Standard XML document definition can contain ID-IDREFs, defining references between elements in the same
document or another document. However, the simpler trie representation is commonly used, and its definition will be
more precise in the following section.
An XML document trie is defined as an ordered collection of XML document trie nodes or atomic values. A schema
can be defined for an XML document, allowing for variations in each document. XML schemas can be defined using
the Document Type Definition (DTD) or XMLSchema. A simpler schema definition exploits the graph structure of
XML documents. An XML schema graph is defined as a 5-tuple containing an alphabet of XML document node
types, a set of edges between node types, and a domain of the text content of an item of type σ.
Using the XML data model and instances, query languages can be defined. Expressions in XML query languages take
an instance of XML data as input and produce an instance of XML data as output. Two query languages proposed by
the W3C are XPath and XQuery. Path expressions are present in both query languages and are the most natural way to
query hierarchical XML data. XQuery is complicated, hard to formulate by users, and difficult to optimize by systems.
JSON has replaced both XML and XQuery for many applications, although XML representation remains important
for the semantic web.

12.6.2.2 RDF
RDF is a data model on top of XML and forms a fundamental building block of the semantic web. It was originally
proposed by W3C as a component of the semantic web but its use has expanded. Examples include Yago and
DBPedia extracting facts from Wikipedia automatically and storing them in RDF format for structural queries, and
biologists encoding their experiments and results using RDF to communicate among themselves. RDF data collections
include Bio2RDF and Uniprot RDF.
RDF models each "fact" as a set of triples (subject, property (or predicate), object) denoted as's, p, o'. Entities are
denoted by a URI (Uniform Resource Identifier) that refers to a named resource in the environment being modeled,
while blank nodes refer to anonymous resources without a name.
RDF Schema (RDFS) is the next layer in the semantic web technology stack, which allows for the annotation of RDF
data with semantic metadata. This annotation primarily enables reasoning over the RDF data (entailment) and impacts
data organization in some cases. RDFS also allows the definition of classes and class hierarchies, with built-in class
definitions like rdfs:Class and rdfs:subClassOf. A special property, rdf:type, is used to specify that an individual
resource is an element of the class.
SPARQL query types are based on the shape of the query graph, with three types: linear, star-shaped, and snowflake-
shaped. RDF data management systems can be categorized into five groups: direct relational mapping, relational
schema with extensive indexing, denormalizing triples into clustered properties, column-store organization, and
exploiting native graph pattern matching semantics. Direct relational mapping systems use RDF triples' natural tabular
structure to create a single table with three columns (Subject, Property, Object) for SPARQL queries. This approach
aims to exploit well-developed relational storage, query processing, and optimization techniques in executing
SPARQL queries. Systems like Sesame SQL92SAIL10 and Oracle follow this approach, utilizing well-developed
relational storage, query processing, and optimization techniques. The full translation of SPARQL 1.0 to SQL remains
open.
Single Table Extensive Indexing
Native storage systems like Hexastore and RDF-3X offer an alternative to direct relational mapping by allowing
extensive indexing of the triple table. These systems maintain a single table but create indexes for all possible
permutations of subject, property, and object. These indexes are sorted lexicographically by the first, second, and third
columns, and stored in the leaf pages of a clustered B+-tree. This organization allows SPARQL queries to be
efficiently processed regardless of variable location, and it eliminates some self-joins by turning them into range
queries over the particular index. Fast merge-join can be used when joins are required. However, disadvantages
include space usage and the overhead of updating multiple indexes if data is dynamic.
Property Tables
The Property Tables approach in RDF datasets uses regularity to store "related" properties in the same table. Jena and
IBM's DB2RDF follow this strategy, mapping the resulting tables to a relational system and converting queries to
SQL for execution. Jena defines two types of property tables: a clustered property table, which groups properties that
occur in the same subjects, and a property class table that clusters subjects with the same type of property into one
table. The primary key for a property is the subject, while the key for a multivalued property is the compound key
(subject, property). The mapping of the single triple table to property tables is a database design problem handled by a
database administrator.

Example 12.17 The example dataset in Example 12.14 may be organized to create one table that includes the
properties of subjects that are films, one table for properties of directors, one table for properties of actors, one table
for properties of books and so on.
IBM DB2RDF uses a dynamic table organization, called direct primary hash (DPH), organized by subject. Each
subject has k property columns, each with a different property in different rows. If a subject has more than k
properties, extra properties are spilled onto a second row. For multivalued properties, a direct secondary hash (DSH)
table is maintained. This approach simplifies joins in star queries, resulting in fewer joins. However, it can lead to a
significant number of null values and requires special care for multivalued properties. Star queries can be efficiently
handled, but it may not be suitable for other query types. Clustering "similar" properties is nontrivial, and poor design
decisions can exacerbate the null value problem.
Binary Tables
The binary tables approach is a column-oriented database schema organization that defines a two-column table for
each property containing the subject and object. This results in a set of tables ordered by the subject, reducing I/O,
tuple length, and compression. It avoids null values and clustering algorithms for similar properties and supports
multivalued properties. Subject-subject joins can be implemented using efficient merge-joins. However, queries
require more join operations, some of which may be subject-object joins. Additionally, insertions into tables have
higher overhead, and the proliferation of tables may negatively impact the scalability of the binary tables approach.
For example, the binary table representation of the dataset would create one table for each unique property, with 18
tables.
Graph-Based Processing
Graph-based RDF processing methods maintain the RDF data's graph structure using adjacency lists, convert
SPARQL queries to query graphs, and perform subgraph matching using homomorphism to evaluate queries against
the RDF graph, a technique used by systems like gStore and chameleon-db.

The approach to encoding RDF data in SPARQL maintains its original representation and enforces its intended
semantics. However, it has a disadvantage of the cost of subgraph matching, as graph homomorphism is NP-complete.
This raises issues with scalability for large RDF graphs. The gStore system uses adjacency list representation of
graphs, encoding each entity and class vertex into a fixed-length bit string. This information is exploited during graph
matching, resulting in a data signature graph G∗, where each vertex corresponds to a class or entity vertex in the RDF
graph G. An incoming SPARQL query is also represented as a query graph Q, which is encoded into a query signature
graph Q∗.

The problem of finding Q∗ over G∗ requires a filter-and-evaluate strategy to reduce the search space. The objective is
to find candidate subgraphs (CL) using a false-positive pruning strategy and validate them using the adjacency list
(RS). Two issues need to be addressed: encoding the RS to guarantee it ⊆ CL and developing an efficient subgraph
matching algorithm. gStore uses an index structure called VS∗-tree, a summary graph of G∗, to efficiently process
queries using a pruning strategy.

Distributed and Federated SPARQL Execution


As RDF collections grow, scale-out solutions have been developed involving parallel and distributed processing.
These solutions divide an RDF graph G into several fragments and place each at a different site in a
parallel/distributed system. Each site hosts a centralized RDF store of some kind. A SPARQL query Q is decomposed
into several subqueries, which can be answered locally at one site, and the results are then aggregated. Each paper
proposes its own data partitioning strategy, and different partitioning strategies result in different query processing
methods. Some approaches use MapReduce-based solutions, where RDF triples are stored in HDFS and each triple
pattern is evaluated by scanning the HDFS files followed by a MapReduce join implementation. Other approaches
follow distributed/parallel query processing methodologies, where the query is partitioned into subqueries and
evaluated across the sites.
An alternative has been proposed for executing distributed SPARQL queries using partial query evaluation. Partial
function evaluation is a well-known programming language strategy that generates a partial answer for a function f (s,
d), where s is the known input and d is the yet unavailable input. In this particular setting, each site treats fragment Fi
as the known input in the partial evaluation stage, while the unavailable input is the rest of the graph (G = G \ Fi).
In many RDF settings, concerns arise similar to database integration requiring a federated solution. Some sites that
host RDF data also have the capability to process SPARQL queries, called SPARQL endpoints. A common technique
in federated RDF environments is to precompute metadata for each individual SPARQL endpoint, specifying the
capabilities of the end point or a description of the triple patterns that can be answered at that endpoint.

12.6.2.3 Navigating and Querying the LOD


The Local Object (LOD) is a set of web documents that contain embedded RDF triples that encode web resources.
These RDF triples contain data links to other documents, allowing interconnection and graph structure. The semantics
of SPARQL queries over the LOD can be challenging. One option is adopting full web semantics, which specifies the
scope of evaluating a query expression as all linked data. However, there is no known terminating query execution
algorithm that guarantees result completeness under this semantics.
An alternative is a family of reachability-based semantics that define the scope of evaluating a SPARQL query in
terms of the documents that can be reached. This family is defined by different reachability conditions and has
computationally feasible algorithms. There are three approaches to SPARQL query execution over LOD: traversal-
based, index-based, and hybrid. Traversal approaches implement a reachability-based semantics, recursively
discovering relevant URIs by traversing specific data links at query execution runtime. They have advantages such as
simplicity, latency, and limited parallelization. Index-based approaches use an index to determine relevant URIs,
reducing the number of linked documents needed to be accessed. However, they also have disadvantages such as
dependence on the index, freshness issues, and difficulty keeping the index up-to-date. Hybrid approaches perform a
traversal-based execution using prioritized listing of URIs for look-up.

12.6.3 Data Quality Issues in Web Data Integration


In Chapter 7, data quality and data cleaning issues in database integration systems are discussed. Web data quality is
more severe due to the number of sources, uncontrolled data entry, and increased diversity. Data quality includes data
consistency and veracity, which are obtained through data cleaning. However, data cleaning in web contexts is
difficult due to the lack of schema information and limited integrity constraints. Checking for data veracity remains a
challenge, but efficient data fusion techniques may help detect correct values for the same data items. This section
highlights main data quality and cleaning issues in web data and discusses current solutions.

12.6.3.1 Cleaning Structured Web Data


Structured data is a crucial category of web-based data that faces numerous data quality issues. Cleaning structured
data involves a workflow consisting of discovery and profiling steps, error detection steps, and error repair steps.
Discovery and profiling steps are often used to automatically discover metadata, while error detection steps identify
parts of the data that do not conform to the metadata and declare them as errors. Errors can be in various forms, such
as outliers, violations of integrity constraints, and duplicates. The error repair step produces data updates to remove
detected errors. External sources like knowledge bases and human experts are consulted to ensure the accuracy of the
cleaning workflow. The process works well for structured tables with a rich set of metadata and automatic algorithms
for error detection. However, web tables are short and skinny, and the number of web tables is far greater than in a
data warehouse, making manual cleaning not feasible for all structured web tables.

12.6.3.2 Web Data Fusion


Web data integration often faces a challenge in data fusion, where determining the correct value for items with
different representations from multiple sources can be challenging due to data conflicts. Uncertainty and contradiction
are two types of data conflicts, with uncertainty occurring between null values and one or more null values used to
describe the same property of a real-world entity, and contradiction between two or more different nonnull values
representing different values of the same property. Automatic cleaning of web tables is particularly challenging, and
data-driven statistics-based techniques like Auto-Detect can help detect such errors. However, Auto-Detect does not
suggest data fixes, and there are no proposals that automatically repair or suggest fixes for errors in web tables.
Figure 12.25 outlines different data fusing strategies, including conflict ignorance, conflict avoidance, and conflict
resolution. Conflict ignorance ignores conflicts, while conflict avoidance acknowledges them and uses a unique
decision based on data instance or metadata. Conflict resolution resolves conflicts by choosing a value from present
values or mediating, such as deciding or mediating.

12.6.3.3 Web Source Quality


Basic conflict resolution strategies rely on participating values, but they can fall short in three aspects: recognizing the
different qualities of web sources, considering source quality when predicting the correct value, and recognizing the
dependencies between web sources. Ignoring these can lead to wrong resolution decisions, such as the majority vote
strategy. Additionally, the correct value for a data item may evolve over time, making it crucial to differentiate
between incorrect and outdated values. Advanced data fusion strategies evaluate the trustworthiness or quality of a
source, and this section discusses modeling the accuracy of a data source, handling source dependencies, and source
freshness.
Source Accuracy
Source accuracy is measured as the fraction of true values provided by a source, denoted by A(S), and is calculated by
dividing the total values by the probability of each value being the true value, denoted as P r(v). Then A(S) is
computed as follows:

Consider a data item D. Let Dom(D) be the domain of D, including one true value and n false values. Let SD be the
set of sources that provide a value for D, and let SD(v) ⊆ SD be the set of sources that provide the value v for D. Let
(D) denote the observation of which value each S ∈ SD provides for D. The probability P r(v) can be computed as
follows:
The true value of a data item D is determined by the value v with the highest probability P r(v) in Dom(D). The
computation of source accuracy A(S) depends on the probability P r(v) and source accuracy A(S). An algorithm can
start with the same accuracy for every source and probability for every value, iteratively computing probabilities until
convergence.
Source Dependency
Source dependency is a crucial aspect of data analysis, as sources often copy from each other, creating dependencies.
Two intuitions for copy detection between sources include the presence of only one true value and multiple false
values. Sharing the same true value does not necessarily imply dependency, but sharing the same false value is a rare
event. A random subset of values provided by a data source typically has similar accuracies as the full set, but a copier
data source may have different accuracies. A Bayesian model can be developed to compute the probability of copying
between two sources, adjusting the computation of the vote count for a value to account for source dependencies.
Source Freshness
Data fusion is a process that aims to find all correct values and valid periods in a history when true values evolve over
time. In this dynamic setting, data errors occur due to sources providing wrong values, failing to update their data, or
some sources not updating in time. Source quality can be evaluated using three metrics: coverage of a source,
exactness, and freshness. Bayesian analysis can be used to determine the time and value of each transition for a data
item.

Machine learning and probabilistic models have been used in data fusion and modeling data source quality. SLiMFast
is a framework that expresses data fusion as a statistical learning problem over discriminative probabilistic models. It
provides quality guarantees for fused results and can incorporate available domain knowledge in the fusion process.
SLiMFast takes input from source observations, labeled ground truth, and domain knowledge about sources. It
compiles these information into a probabilistic graphical model for holistic learning and inference. Depending on the
amount of ground truth data, SLiMFast decides which algorithm to use for learning the parameters of the graphical
models. The learned model is then used for inferring both object values and source accuracies.

12.7. Bibliogaphic Notes


There are a number of good sources on web topics, each with a slightly different focus. Abiteboul et al. [2011] focus
on the use of XML and RDF for web data modeling and also contain discussions of search, and big data technologies
such as MapReduce. A web data warehousing perspective is given in [Bhowmick et al. 2004]. Bonato [2008]
primarily focuses on the modeling of the web as a graph and how this graph can be exploited. Early work on the web
query languages and approaches are discussed in [Abiteboul et al. 1999]. A very good overview of web search issues
is [Arasu et al. 2001], which we also follow in Sect. 12.2. Additionally, Lawrence and Giles [1998] provides an earlier
discussion on the same topic focusing on the open web. Florescu et al. [1998] survey web search issues from a
database perspective. Deep (hidden) web is the topic of [Raghavan and Garcia-Molina 2001]. Lage et al. [2002] and
Hedley et al. [2004b] also discuss search over the deep web and the analysis of the results. Metasearch for accessing
the deep web is discussed in [Ipeirotis and Gravano 2002, Callan and Connell 2001, Callan et al. 1999, Hedley et al.
2004a]. The metasearch-related problem of database selection is discussed by Ipeirotis and Gravano [2002] and
Gravano et al. [1999] (GlOSS algorithm). Statistics about the open web are taken from [Bharat and Broder 1998,
Lawrence and Giles 1998, 1999, Gulli and Signorini 2005] and those related to the deep web are due to [Hirate et al.
2006] and [Bergman 2001].

You might also like