Professional Documents
Culture Documents
Data & Information Quality - Reading
Data & Information Quality - Reading
• Prior Knowledge
Information System Design and Implementation.
2. Theory
• Why Information Quality is Relevant?
This section contains excerpts from Batini y Scannapieco (2016).
The consequences of poor quality of information can be experienced in
everyday life but, often, without making explicit connections to their causes.
Information quality seriously impacts on the efficiency and effectiveness of
organizations and businesses.
1
• Information Classifications
This section contains excerpts from Batini y Scannapieco (2016).
As to the linguistic character of information, we can distinguish several
types of information such as:
2
data items may be used (sometimes simultaneously) in the manufacturing
of several different products.
· Information products, which are the result of a manufacturing activity
performed on data.
Previous classifications did not take into account the time dimension of
information. According to its change frequency, we can classify source
information into three categories:
· Stable information is information that is unlikely to change. Examples
are scientific publications; although new publications can be added to the
source, older publications remain unchanged.
· Long-term-changing information is information that has a very low
change frequency. Examples are addresses, currencies, and hotel price
lists. The concept of “low frequency” is domain dependent; in an e-trade
application, if the value of a stock quote is tracked once an hour, it is
considered to be a low-frequency change, while a shop that changes its
goods weekly has a high-frequency change for clients.
· Frequently changing information is information that has intensive
change, such as real-time traffic information, temperature sensor
measures, and sales quantities. The changes can occur with a defined
frequency or they can be random.
3
• Data and Information Quality Clusters of Dimensions
This section contains excerpts from Batini y Scannapieco (2016).
Dimensions for data and information quality can be characterized by a
common classification framework that allows us to compare dimensions
across different information types. The framework is based on a
classification in clusters where dimensions are included in the same cluster
according to their similarity. Clusters are defined in the following list:
1. Accuracy, correctness, validity, and precision focus on the adherence
to a given reality of interest.
2. Completeness, pertinence, and relevance refer to the capability of
representing all and only the relevant aspects of the reality of
interest.
3. Redundancy, minimality, compactness, and conciseness refer to the
capability of representing the aspects of the reality of interest with
the minimal use of informative resources.
4. Readability, comprehensibility, clarity, and simplicity refer to ease
of understanding and fruition of information by users.
5. Accessibility and availability are related to the ability of the user to
access information from his or her culture, physical status/functions,
and technologies available.
6. Consistency, cohesion, and coherence refer to the capability of the
information to comply without contradictions to all properties of the
reality of interest, as specified in terms of integrity constraints, data
edits, business rules, and other formalisms.
7. Usefulness related to the advantage the user gains from the use of
information.
8. Trust, including believability, reliability, and reputation, catching
how much information derives from an authoritative source. The
trust cluster encompasses also issues related to security.
4
the data value v aims to represent. The world around us changes, and what
we have called in the above definition “the real-life phenomenon that the
data value v aims to represent” reflects such changes.
So, there is a particular yet relevant type of data accuracy that refers to the
rapidity with which the change in real-world phenomenon is reflected in the
update of the data value; we call temporal accuracy such type of accuracy,
in contrast to structural accuracy, that characterizes the accuracy of data as
observed in a specific time frame, where the data value can be considered as
stable and unchanged.
Volatility characterizes the frequency with which data vary in time. For
instance, stable data such as birth dates have volatility equal to 0, as they do
not vary at all. Conversely, stock quotes, a kind of frequently changing data,
5
have a high degree of volatility due to the fact that they remain valid for
very short time intervals.
Timeliness expresses how current the data are for the task at hand. The
timeliness dimension is motivated by the fact that it is possible to have
current data that are actually useless because they are late for a specific
usage. For instance, the timetable for university courses can be current by
containing the most recent data, but it is not timely if it is available only
after the start of the classes.
Completeness
Completeness can be generically defined as “the extent to which data are of
sufficient breadth, depth, and scope for the task at hand”. Three types of
completeness are identified. Schema completeness is defined as the degree
to which concepts and their properties are not missing from the schema
(metadata). Column completeness is defined as a measure of the missing
values for a specific property or column in a table. Population completeness
evaluates missing values with respect to a reference population.
Accessibility
Accessibility measures the ability of the user to access the data from his or
her own culture, physical status/functions, and technologies available.
Several guidelines are provided by international and national bodies to
govern the production of data, applications, services, and Web sites in order
to guarantee accessibility. In the following, we describe some guidelines
related to data provided by the World Wide Web Consortium.
6
sensory and other disabilities. In order to be useful, the text must convey the
same function or purpose as the image.
Consistency
The consistency dimension captures the violation of semantic rules defined
over (a set of) data items. With reference to relational theory, integrity
constraints are an instantiation of such semantic rules. Integrity constraints
are properties that must be satisfied by all instances of a database schema. It
is possible to distinguish two main categories of integrity constraints,
namely, intrarelation constraints and interrelation constraints.
7
Approaches to the Definition of Data Quality Dimensions
There are three main approaches adopted for proposing comprehensive sets
of the dimension definitions, namely, theoretical, empirical, and intuitive.
The theoretical approach adopts a formal model in order to define or justify
the dimensions. The empirical approach constructs the set of dimensions
starting from experiments, interviews, and questionnaires. The intuitive
approach simply defines dimensions according to common sense and
practical experience.
• Theoretical Approach
This approach considers an information system (IS) as a
representation of a real-world system (RW); RW is properly
represented in an IS if (1) there exists an exhaustive mapping RW →
IS and (2) no two states in RW are mapped into the same state in the
IS, i.e., the inverse mapping is a function.
• Empirical Approach
Data quality dimensions have been selected by interviewing data
consumers. Starting from 179 data quality dimensions, the authors
selected 15 different dimensions, A two-level classification is
proposed, in which each of four categories is further specialized into
a number of dimensions. The four categories are:
· Intrinsic data quality, capturing the quality that data has on its own.
As an example, accuracy is a quality dimension that is intrinsic to
data.
· Contextual data quality considers the context where data are used.
As an example, the completeness dimension is strictly related to the
context of the task.
· Representational data quality captures aspects related to the quality
of data representation, e.g., interpretability.
9
· Accessibility data quality is related to the accessibility of data and
to a further nonfunctional property of data access, namely, the level
of security.
• Intuitive Approach
Redman classifies DQ dimensions according to three categories,
namely, conceptual schema, data values, and data format.
Conceptual schema dimensions correspond to what we called
schema dimensions. Data value dimensions refer specifically to
values, independently of the internal representation of data; this last
aspect is covered by data format dimensions.
10
• Data Management
This section contains excerpts from Sebastian-Coleman (2018).
Even before the rise of information technology, information and knowledge
have been keys to competitive advantage. Organizations that have reliable,
high-quality information about their customers, products, services, and
operations can make better decisions than those without data (or with
unreliable data). But producing high-quality data and managing it in ways
that enable it to be used effectively is not a simple process.
Data is everywhere
Almost every business process uses data as input and produces data as
output. Technical changes have enabled organizations to use data in new
ways to create products, share information, create knowledge, and improve
organizational success. But the rapid growth of technology and with it
human capacity to produce, capture, and mine data for meaning has
intensified the need to manage data effectively.
11
Data as an asset
An asset is an economic resource, that can be owned or controlled, and that
holds or produces value. Data is widely recognized as an enterprise asset,
although many organizations still struggle to manage data as an asset.
· Lifecycle Activities focus on planning and designing for data, enabling its
use, ensuring it is effectively maintained, and actually using it. Use of data
often results in enhancements and innovations, which have their own
lifecycle requirements. Lifecycle activities include:
· Data Architecture
· Data Modeling
· Building and managing data warehouses and marts
· Integrating data for use by business intelligence analysts and data
scientists
· Managing the lifecycle of highly critical shared data
3. References
Batini, C. y Scannapieco, M. (2016). Data and Information Quality. Dimensions,
Principles and Techniques. Milano, Italy: Springer International
Publishing.
13
4. Extra Material
• What is Data Management? Infographic Video.
https://youtu.be/5xw_OjVx5gQ
14