Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 10

Web Data Integration Zusammenfassung

Data Integration Process 

Types of Structured Data on the Web


1. Data Portals
 collect and host datasets
 collect and generate metadata describing the datasets
 provide for data search and exploration
 provide free or payment-based access to data

Types of Shared Data: public sector, research, commercial

FAIR Data Principles

 Findable
o (Meta)data are assigned a globally unique
identifier
o Data are described with rich metadata
o (Meta)data are registered or indexed in a
searchable resource
 Accessible
o (Meta)data are retrievable by their identifier using a standardized communications
protocol
o Metadata are accessible, even when the data are no longer available
 Interoperable
o (Meta)data use a formal, broadly applicable language for knowledge representation
o (Meta)data use vocabularies that follow FAIR principles
o (Meta)data include qualified references to other (meta)data
 Reusable
o (Meta)data are released with a clear data usage license
o (Meta)data are associated with detailed provenance
o (Meta)data meet domain-relevant community standards

2. Web APIs
Platforms that enable users to share information, e.g., Facebook, are partly accessible from the web
They slice the Web into Data Silos
--- 1 Not index-able by generic web crawlers
--- 2 No automatic discovery of additional data source
--- 3 No single global data space

3. Linked Data
+++ Entities are identified with HTTP URIs (role of global primary keys), URIs can be looked up on the
Web (discover new data sources, navigate the global data graph)

4. HTML-embedded Data
+++
1. Webpages traditionally contain structured data in the form of HTML tables as well as template data
2. More and more websites semantically markup the content of their HTML pages using standardized
markup formats, like RDFa

Data Exchange Formats


 Data Exchange: Transfer of data from one system to another
 Data Exchange Format: Format used to represent (encode) the transferred data
 Web Data is heterogeneous with respect to the employed
o 1. Data Exchange Format (Technical Heterogeneity, e.g., XML, JSON, CSV)
o 2. Character Encoding (Syntactical Heterogeneity, e.g., ASCII, Unicode)

Character Encoding
is mapping of “real” characters to bit sequences and a common problem in data integration

UTF-8: most common encoding, includes Asian signs, common characters are encoded using only one
byte, less common ones are encoded in 2-6 bytes

Comma Separated Values (CSV)


 Data model is a table
 Pro: Data representation with minimal overhead
 Cons
o restricted to tabular data
o hard to read for humans when tables get wider
o different variations, no support for data types

Extensible Markup Language (XML)


 Widely used format for data exchange in the Web and enterprise contexts
 Data model: Tree
 Is a meta language: defines standard syntax, allows the definition of specific languages (XML
applications)
 HTML versus XML
o HTML: Aimed at displaying information to humans; mixes structure, content, and
presentation
o XML: Aimed at data exchange; separates structure, content, and presentation
 Well-formed XML Documents
o 1. Closing tag for each
opening tag
o 2. Proper nesting of tags
o 3. Only one attribute with
a specific name
 XML References
o Trees are limited when it
comes to n:m relations
 Problem: data
duplication
 consistency
 storage
 transmission volume
o Solution: IDs and references
Document Type Definition (DTD)
 Defines valid content structure of an XML document
o allowed elements, attributes, child elements, optional elements
o allowed order of elements
 DTDs can be used to validate an XML document from the Web before it is further processed

XPath
From Slide 33 on till 44

JavaScript Object Notation (JSON)


 is a lightweight data exchange format that uses JavaScript syntax
o less verbose alternative to XML
o widely adopted
– by Web APIs as data
exchange format
– for embedding structured
data in the HEAD section of
HTML pages
 Basics:
o objects are noted as in JavaScript
o objects are enclosed in curly brackets
{…}
o data is organized in key value pairs
separated by colons { key : value }
 Example: { “firstname” : “John” , “lastname” :
“Smith” , “age” : 46 }
 JSON is a lot like XML:
o data model: tree
o opening/closing tags/brackets
 Differences
o more compact notation compared to XML
o no id/idref – JSON data is strictly tree shaped
o less data types (only string, number, and Boolean)

Resource Description Framework (RDF)


Graph data model designed for sharing data on the Web

 Applications:
o annotation of Web pages (RDFa, JSON-LD)
o publication of data on the Web (Linked Data)
o exchange of graph data between applications
 View 1: Sentences in form Subject-Predicate-Object (called Triples “Chris works at Uni of Ma”)
 View 2: Labeled directed graph
 Resources
o everything (a person, a place, a web page, …) is a resource
o are identified by URI references
o may have one or more types (e.g. foaf:Person)
 Literals
o are data values, e.g., strings or integers
o may only be objects, not subjects of triples
o may have a data type or a language tag
 Predicates (Properties)
o connect resources to other resources
o connect resources to literal
 1. RDF Data Model
 2. RDF Syntaxes
 3. RDF Schema
 4. SPARQL Query Language
 5. RDF in Java

Schema Mapping and Data Translation


1. Two Basic Integration Situations
Schema Mapping

 Goal: Translate data from a set of source schemata into a given


target schema
 Top-down integration situation
 Triggered by concrete information need

Schema Integration

 Goal: Create a new integrated schema that can represent all


data from a given set of source schemata
 Bottom-up integration situation
 Triggered by the goal to fulfill different information needs
based on data from all sources (not throwing information out)

2. Types of Correspondences
 A correspondence relates a set of elements in a schema S to a
set of elements in schema T
 Mapping = Set of all correspondences that relate S and T
 Schema Matching: Automatically or semi-automatically discover correspondences between
schemata
 Types of correspondences
o One-to-One Correspondences – Movie.title → Item.name
o One-to-Many – Person.Name → split() → FirstName (Token 1), Surname (Token 2)
o Many-to-One – Product.basePrice * (1 + Location.taxRate) → Item.price

3. Schema Integration
 Completeness: All elements of the source schemata should be covered
 Correctness: All data should be represented semantically correct
 Minimality: The integrated schema should be minimal in respect to the number of relations
and attributes
 Understandability: should be easy to understand

4. Data Translation
Query Generation Goal: Derive suitable data translation queries (or programs) from the
correspondences.

5. Schema Matching
Automatically or semi-automatically discover correspondences between schemata

5.1 Challenges to Finding Correspondences


 Large schemata > 100 tables and >1000 attributes
 Generic, automatically generated names, e.g., attribute1, attribute2, attribute3
 Missing documentation

5.2. Schema Matching Methods


 1. Label-based Methods: Rely on the names of schema elements
o 1. Generate cross product of all attributes (classes) from A and B
o 2. For each pair calculate the similarity of the attribute labels using some similarity
metric: Levenshtein (insert/delete/replace are the edit operations), Jaccard, Soundex,
etc.
o 3. The most similar pairs are the matches
o Problems:
 Semantic heterogeneity is not recognized (e.g., synonyms/homonyms)
 Problems with different naming conventions (e.g., abbreviations)
o Solution
 Preprocessing: Normalize (stop word removal, stemming,…) labels to prepare
them for matching
 Matching: Employ similarity metrics that fit the specifics of the schemata
 2. Instance-based Methods: Compare the actual data values
o determine correspondences between A and B by examining which attributes in A and
B contain similar values  values often better capture the semantics of an attribute
than its label
o 1. Attribute Recognizers
o 2. Value Overlap using Jaccard 
o 3. Feature-based Methods
 By comparing e.g., attribute data type, average string length
 Discussion
 Require decision which features to use
 Require decision how to compare and combine values
 Similar attribute values do not always imply same semantics
o 4. Duplicate-based Methods
 Check which attribute values closely match in each duplicate (= similar entry
in two DBs)
 ++Can correctly distinguish very similar attributes
 ++Work well if duplicates are known or easy to find
 --Does not work well if identity resolution is too noisy, products with very
similar names
 3. Structure-based Methods: Exploit the structure of the schema
o Addresses this problem 
o High similarity of neighboring attributes and/or name of
relation increases similarity of attribute pair
 4. Combined Approaches: Use combinations of above methods
5.3 Generating Correspondences from the Similarity Matrix
 Input: Matrix containing attribute similarities
 Output: Set of correspondences
 Local Single Attribute Strategies:
o Thresholding
 all attribute pairs with sim above a threshold are returned as
correspondences
 domain expert checks correspondences afterwards and selects the right ones
o TopK
 give domain expert TopK correspondences for each attribute
o Top1
 directly return the best match as correspondence
 very optimistic, errors might frustrate domain expert
 Alternative: Global Matching
o Looking at the complete mapping (all correct correspondences between A and B)
gives us the additional restriction that one attribute in A should only be matched to
one attribute in B
o Find optimal set of disjunct correspondences
 Alternative: Stable Marriage

5.4 Finding Many-to-One and One-to-Many Correspondences sl.65


5.5 Table Annotation
Goal: Annotate the columns (type and property) of tables in a large table corpus with concepts from a
knowledge graph or shared vocabulary

6. Schema Heterogeneity on the Web

Identity Resolution
 Goal: Find all records that refer to the
same real-world entity.
 Challenge 1: Representations of the
same real-world entity are not
identical  fuzzy duplicates
o Solution: Entity Matching: compare multiple attributes using attribute-specific
similarity measures, after value normalization
 Challenge 2: Quadratic Runtime Complexity
o Comparing every pair of records is too expensive for larger datasets
o Solution: Blocking methods  avoid unnecessary comparisons

Entity Matching
2.1 Linearly Weighted Matching
Rules
Compute the similarity score between
records x and y as a linearly weighted
combination of individual attribute
similarity scores 
We declare x and y matched if sim(x,y) >= b for a pre-specified threshold b, and not matched
otherwise

2.2 Non-Linear Matching Rules


Often better than linear rules but require specific domain knowledge.

Non-linear rules can be learned using tree-based learners

2.3 Data Gathering for Matching


Not only values of the records to be compared, but also values of related records are relevant for the
similarity computation, e.g., Movies look at Actors

2.4 Data Preprocessing for Matching


To enable similarity measures to compute reliable scores, the data needs to be normalized (spelling,
value formats, measurement units, abbreviations), then parsed (Extract attribute/value pairs from
title) and translated (e.g., into target language).

2.5 Local versus Global Matching


 Local Matching
o consider all pairs above threshold as matches
o implies that one record can be matched with several other records
o makes sense for duplicate detection within single data source
 Global Matching
o enforce constraint that one record in data set A
should only be matched to one record in data set B
o makes sense for data sources that do not contain
duplicates
o Stable Marriage: Immer zuerst Paar mit höchster
Sim vereinen

2.6 Cluster Records using Pairwise Correspondences


Goal: Create groups of records describing the same real-world entity
from pairwise correspondences.
Simple Approach: Connected Components, Smarter Approach: Correlation Clustering (…)

Blocking
Since similarity is reflexive and symmetric, one can avoid unnecessary comparisons
from n² to (n²-n)/2.

3.1 Standard Blocking


Idea: Reduce number of comparisons by partitioning the records into buckets and
compare only records within each bucket e.g., partition books by publisher.

 Pro: much faster, Con: missed true matches


 Reduction ratio depends on effectiveness of blocking key
o high: if records are equally distributed over buckets
o low: if majority of the records end up in one bucket
o possible workaround: build sub-buckets using a second blocking attribute
 Recall depends on the actually matching pairs being kept (not more than 3% should be lost)

3.2 The Sorted Neighborhood Method (SNM)


Idea: Sort records so that similar records are close to each other. Only compare records within a small
neighborhood window.

 Challenges
o Choice of Blocking Key
o Choice of Window Size
o But: no problem with different bucket sizes

3.3 Token Blocking for Textual Attributes (…) only slide 32

Evaluation
GS necessary

Accuracy is not a good metric as matching is a strongly unbalanced task

Goal: Have a high reduction ratio while keeping pair

completeness > 0.97

Similarity Measures – In Detail


5.1 Edit-based String Similarity Measures
 Levenshtein Distance (aka Edit Distance)
o Measures the minimum number of edits needed to transform one string into the
other (insert, delete, replace)
o ++ can deal with typos
o -- does not work if parts of string (words) have different order
o -- quadratic runtime complexity
 Jaro Similarity 
o Specifically designed for matching names
 Winkler Similarity
o Intuition: Similarity of first few letters is more important
 less typos in first letters
 dealing with abbreviations
 Jaro-Winkler Similarity

5.2 Token-based String Similarity Measures


 Token-based measures ignore the order of words which is often desirable for comparing
multi-word strings
 Using n-grams for Jaccard Coefficient
o Deal with typos and different order of words
o Reduce the time complexity compared to Levenshtein
 Cosine Similarity: for comparing weighted term vectors
 TF-IDF: gives less weight to common tokens

5.3 Hybrid String Similarity Measures


hybrid similarity measures split strings into tokens and apply internal similarity function to compare
tokens

 Monge-Elkan Similarity
o ++ can deal with typos and different order of words
o -- runtime complexity: quadratic

5.4 Datatype-specific Similarity Measures


Numerical Comparison, Dates, geographic coordinates

Learning Matching Rules

Data Fusion
Data profiling
= refers to the activity of calculating statistics and creating summaries of a data source or data lake.

Data Fusion
= Given multiple records that describe the same real-world entity, create a single record while
resolving conflicting data values.
Goal: Create a single high-quality record.
Two basic fusion situations:

 Slot Filling
o Fill missing values (NULLs) in one dataset with corresponding values from other
datasets.  increased dataset density
 Conflict Resolution
o Resolve contradictions between records by applying a conflict resolution function
(heuristic)  increased data quality
Conflict Resolution Functions
 Conflict resolution functions are attribute-specific
o 1. Content-based functions that rely only on the data values to be fused
 E.g., average, min/max, union,…
o 2. Metadata-based functions that rely on provenance data, ratings, or quality scores
 E.g., favourSources, mostRecent, …

Evaluation of Fusion Results


 1. Data-Centric Evaluation Measures
o Density: measures the fraction of non-NULL values
o Consistency: A data set is consistent if it is free of
conflicting information
 2. Ground-Truth Based Evaluation Measures
o Accuracy: Fraction of correct values
selected by conflict resolution
function
o Manually determine correct values for a subset of the records

Learning Conflict Resolution Functions: Choose function with smallest


mean absolute error with respect to gold standard

You might also like