Download as pdf or txt
Download as pdf or txt
You are on page 1of 14

WEEK 12

Data and Information Quality


1. Introduction
• Unit 3 Learning Outcome
Students evaluate the quality of information systems in an organization,
including its effects on all stakeholders inside and outside the organization.

• Week 12 Learning Objectives


· Understand dimensions of data and information quality
· Understand importance of data management

• Prior Knowledge
Information System Design and Implementation.

2. Theory
• Why Information Quality is Relevant?
This section contains excerpts from Batini y Scannapieco (2016).
The consequences of poor quality of information can be experienced in
everyday life but, often, without making explicit connections to their causes.
Information quality seriously impacts on the efficiency and effectiveness of
organizations and businesses.

Information quality requires senior management to treat information as a


corporate asset and to realize that the value of this asset depends on its
quality.

Quality, in general, has been defined as the “totality of characteristics of a


product that bear on its ability to satisfy stated or implied needs”, also called
“fitness for (intended) use”, “conformance to requirements”, or “user
satisfaction”.

1
• Information Classifications
This section contains excerpts from Batini y Scannapieco (2016).
As to the linguistic character of information, we can distinguish several
types of information such as:

· Structured information, i.e., information represented in terms of a set of


instances and a schema, which are tightly coupled, as the schema binds
the semantic interpretation and properties of instances with features such
as types, domains, integrity constraints, etc. (e.g., relational tables in
databases). Structured data represent real-world objects, with a format
and a model that can be stored, retrieved, and elaborated by database
management systems (DBMSs).
· Semistructured information, i.e., information that is either partially
structured or has a descriptive rather than prescriptive schema. An XML
record is an example of semistructured information, where a schema can
be defined that binds only a part of the contained information.
· Unstructured information, i.e., any sequence of symbols, coded in
natural language or any other symbolic language with no semantics
induced by an explicit schema.

A second point of view sees information as a product. This point of view is


adopted, for example, in the IP-MAP model. The IP-MAP model identifies
a parallelism between the quality of information and the quality of products
as managed by manufacturing companies. In this model, three different
types of information are distinguished:
· Raw data items are considered smaller data units. They are used to
construct information and component data items that are semi-processed
information.
· While the raw data items may be stored for long periods of time, the
component data items are stored temporarily until the final product is
manufactured. The component items are regenerated each time an
information product is needed. The same set of raw data and component

2
data items may be used (sometimes simultaneously) in the manufacturing
of several different products.
· Information products, which are the result of a manufacturing activity
performed on data.

The third classification addresses a typical distinction made in information


systems between elementary data and aggregated data. Elementary data are
managed in organizations by operational processes and represent atomic
phenomena of the real world (e.g., social security number, age, and sex).
Aggregated data are obtained from a collection of elementary data by
applying some aggregation function to them (e.g., the average income of tax
payers in a given city).

Previous classifications did not take into account the time dimension of
information. According to its change frequency, we can classify source
information into three categories:
· Stable information is information that is unlikely to change. Examples
are scientific publications; although new publications can be added to the
source, older publications remain unchanged.
· Long-term-changing information is information that has a very low
change frequency. Examples are addresses, currencies, and hotel price
lists. The concept of “low frequency” is domain dependent; in an e-trade
application, if the value of a stock quote is tracked once an hour, it is
considered to be a low-frequency change, while a shop that changes its
goods weekly has a high-frequency change for clients.
· Frequently changing information is information that has intensive
change, such as real-time traffic information, temperature sensor
measures, and sales quantities. The changes can occur with a defined
frequency or they can be random.

3
• Data and Information Quality Clusters of Dimensions
This section contains excerpts from Batini y Scannapieco (2016).
Dimensions for data and information quality can be characterized by a
common classification framework that allows us to compare dimensions
across different information types. The framework is based on a
classification in clusters where dimensions are included in the same cluster
according to their similarity. Clusters are defined in the following list:
1. Accuracy, correctness, validity, and precision focus on the adherence
to a given reality of interest.
2. Completeness, pertinence, and relevance refer to the capability of
representing all and only the relevant aspects of the reality of
interest.
3. Redundancy, minimality, compactness, and conciseness refer to the
capability of representing the aspects of the reality of interest with
the minimal use of informative resources.
4. Readability, comprehensibility, clarity, and simplicity refer to ease
of understanding and fruition of information by users.
5. Accessibility and availability are related to the ability of the user to
access information from his or her culture, physical status/functions,
and technologies available.
6. Consistency, cohesion, and coherence refer to the capability of the
information to comply without contradictions to all properties of the
reality of interest, as specified in terms of integrity constraints, data
edits, business rules, and other formalisms.
7. Usefulness related to the advantage the user gains from the use of
information.
8. Trust, including believability, reliability, and reputation, catching
how much information derives from an authoritative source. The
trust cluster encompasses also issues related to security.

Next, some of these clusters are described in more detail.


Accuracy
Accuracy is defined as the closeness between a data value v and a data value
v′, considered as the correct representation of the real-life phenomenon that

4
the data value v aims to represent. The world around us changes, and what
we have called in the above definition “the real-life phenomenon that the
data value v aims to represent” reflects such changes.

So, there is a particular yet relevant type of data accuracy that refers to the
rapidity with which the change in real-world phenomenon is reflected in the
update of the data value; we call temporal accuracy such type of accuracy,
in contrast to structural accuracy, that characterizes the accuracy of data as
observed in a specific time frame, where the data value can be considered as
stable and unchanged.

Two kinds of (structural) accuracy can be identified, namely, a syntactic


accuracy and a semantic accuracy. Syntactic accuracy is the closeness of a
value v to the elements of the corresponding definition domain D. In
syntactic accuracy, we are not interested in comparing v with the true value
v′; rather, we are interested in checking whether v is any one of the values in
D, whatever it is. Semantic accuracy is the closeness of the value v to the
true value v′. Note that while it is reasonable to measure syntactic accuracy
using a distance function, semantic accuracy is measured better with a < yes,
no > or a < correct, not correct > domain. Consequently, semantic accuracy
coincides with the concept of correctness.

A relevant aspect of data is their change and update in time. A classification


of types of data according to the temporal dimension is: stable, long-term-
changing, and frequently changing data. The principal time-related
dimensions proposed for characterizing the above three types of data are
currency, volatility, and timeliness. Currency concerns how promptly data
are updated with respect to changes occurring in the real world. As an
example: if the residential address of a person is updated, i.e., it corresponds
to the address where the person lives, then the currency is high.

Volatility characterizes the frequency with which data vary in time. For
instance, stable data such as birth dates have volatility equal to 0, as they do
not vary at all. Conversely, stock quotes, a kind of frequently changing data,

5
have a high degree of volatility due to the fact that they remain valid for
very short time intervals.

Timeliness expresses how current the data are for the task at hand. The
timeliness dimension is motivated by the fact that it is possible to have
current data that are actually useless because they are late for a specific
usage. For instance, the timetable for university courses can be current by
containing the most recent data, but it is not timely if it is available only
after the start of the classes.

Completeness
Completeness can be generically defined as “the extent to which data are of
sufficient breadth, depth, and scope for the task at hand”. Three types of
completeness are identified. Schema completeness is defined as the degree
to which concepts and their properties are not missing from the schema
(metadata). Column completeness is defined as a measure of the missing
values for a specific property or column in a table. Population completeness
evaluates missing values with respect to a reference population.

Accessibility
Accessibility measures the ability of the user to access the data from his or
her own culture, physical status/functions, and technologies available.
Several guidelines are provided by international and national bodies to
govern the production of data, applications, services, and Web sites in order
to guarantee accessibility. In the following, we describe some guidelines
related to data provided by the World Wide Web Consortium.

The first, and perhaps most important, guideline indicates provision of


equivalent alternatives to auditory and visual content, called text equivalent
content. In order for a text equivalent to make an image accessible, the text
content can be presented to the user as synthesized speech, braille, and
visually displayed text. Each of these three mechanisms uses a different
sense, making the information accessible to groups affected by a variety of

6
sensory and other disabilities. In order to be useful, the text must convey the
same function or purpose as the image.

Other guidelines suggest:


• Avoiding the use of color as the only means to express semantics,
helping daltonic people appreciate the meaning of data.
• Usage of clear natural language, by providing expansions of
acronyms, improving readability, a frequent use of plain terms.
• Designing a Web site that ensures device independence using
features that enable activation of page elements via a variety of input
devices.
• Providing context and orientation information to help users
understand complex pages or elements.

Several countries have enacted specific laws to enforce accessibility in


public and private Web sites and applications used by citizens and
employees in order to provide them effective access and reduce the digital
divide.

Consistency
The consistency dimension captures the violation of semantic rules defined
over (a set of) data items. With reference to relational theory, integrity
constraints are an instantiation of such semantic rules. Integrity constraints
are properties that must be satisfied by all instances of a database schema. It
is possible to distinguish two main categories of integrity constraints,
namely, intrarelation constraints and interrelation constraints.

Intrarelation integrity constraints can regard single attributes (also called


domain constraints) or multiple attributes of a relation. Interrelation
integrity constraints involve attributes of more than one relation.
Interrelation integrity constraints involve attributes of more than one
relation.

7
Approaches to the Definition of Data Quality Dimensions
There are three main approaches adopted for proposing comprehensive sets
of the dimension definitions, namely, theoretical, empirical, and intuitive.
The theoretical approach adopts a formal model in order to define or justify
the dimensions. The empirical approach constructs the set of dimensions
starting from experiments, interviews, and questionnaires. The intuitive
approach simply defines dimensions according to common sense and
practical experience.

• Theoretical Approach
This approach considers an information system (IS) as a
representation of a real-world system (RW); RW is properly
represented in an IS if (1) there exists an exhaustive mapping RW →
IS and (2) no two states in RW are mapped into the same state in the
IS, i.e., the inverse mapping is a function.

All deviations from proper representations generate deficiencies.


They distinguish between design deficiencies and operation
deficiencies. Design deficiencies are of three types: incomplete
representation, ambiguous representation, and meaningless states.

Only one type of operation deficiency is identified, in which a state


in RW might be mapped to a wrong state in an IS; this is referred to
as garbling. Garbling with a map to a meaningless state is dangerous,
as it will preclude a map back to a real-world state. Garbling to a
meaningful but wrong state will allow the user to map back to a real-
world state.

A set of data quality dimensions are defined by making references to


described deficiencies. More specifically, the identified dimensions
are:
· Accuracy: “inaccuracy implies that the information system
represents a real-world state different from the one that should have
been represented.” Inaccuracy refers to a garbled mapping into a
8
wrong state of the IS, where it is possible to infer a valid state of the
real world though not the correct one.
· Reliability indicates “whether the data can be counted on to convey
the right information; it can be viewed as correctness of data.” No
interpretation in terms of data deficiencies is given.
· Timeliness refers to “the delay between a change of the real-world
state and the resulting modification of the information system state.”
Lack of timeliness may lead to an IS state that reflects a past RW
state.
· Completeness is “the ability of an information system to represent
every meaningful state of the represented real-world system.”
Completeness is of course tied to incomplete representations.
· Consistency of data values occurs if there is more than one state of
the information system matching a state of the real-world system;
therefore “inconsistency would mean that the representation
mapping is one-to-many.” This is captured by the representation, so
the inconsistency is not considered a result of a deficiency.

• Empirical Approach
Data quality dimensions have been selected by interviewing data
consumers. Starting from 179 data quality dimensions, the authors
selected 15 different dimensions, A two-level classification is
proposed, in which each of four categories is further specialized into
a number of dimensions. The four categories are:
· Intrinsic data quality, capturing the quality that data has on its own.
As an example, accuracy is a quality dimension that is intrinsic to
data.
· Contextual data quality considers the context where data are used.
As an example, the completeness dimension is strictly related to the
context of the task.
· Representational data quality captures aspects related to the quality
of data representation, e.g., interpretability.

9
· Accessibility data quality is related to the accessibility of data and
to a further nonfunctional property of data access, namely, the level
of security.

• Intuitive Approach
Redman classifies DQ dimensions according to three categories,
namely, conceptual schema, data values, and data format.
Conceptual schema dimensions correspond to what we called
schema dimensions. Data value dimensions refer specifically to
values, independently of the internal representation of data; this last
aspect is covered by data format dimensions.

10
• Data Management
This section contains excerpts from Sebastian-Coleman (2018).
Even before the rise of information technology, information and knowledge
have been keys to competitive advantage. Organizations that have reliable,
high-quality information about their customers, products, services, and
operations can make better decisions than those without data (or with
unreliable data). But producing high-quality data and managing it in ways
that enable it to be used effectively is not a simple process.

Data is everywhere
Almost every business process uses data as input and produces data as
output. Technical changes have enabled organizations to use data in new
ways to create products, share information, create knowledge, and improve
organizational success. But the rapid growth of technology and with it
human capacity to produce, capture, and mine data for meaning has
intensified the need to manage data effectively.

11
Data as an asset
An asset is an economic resource, that can be owned or controlled, and that
holds or produces value. Data is widely recognized as an enterprise asset,
although many organizations still struggle to manage data as an asset.

The primary driver for data management is to enable organizations to get


value from their data, just as effective management of financial and physical
assets enables organizations to get value from those assets. Deriving value
from data does not happen in a vacuum or by accident. It requires
organizational commitment and leadership, as well as management.

Data Management vs Technology Management


Data management is the development, execution, and supervision of plans,
policies, programs, and practices that deliver, control, protect, and enhance
the value of data and information assets, throughout their lifecycle.

Though data management is highly dependent on technology and intersects


with technology management, it involves separate activities that are
independent from specific technical tools and processes.

Data management involves planning and coordinating resources and


activities in order to meet organizational objectives. The activities
themselves range from the highly technical, like ensuring that large
databases are accessible, performant, and secure, to the highly strategic, like
determining how to expand market share through innovative uses of data.
These management activities must strive to make high-quality, reliable data
available to the organization, while ensuring this data is accessible to
authorized users and protected from misuse.

Data Management Activities


Data management activities can be understood in groups:
· Governance Activities help control data development and reduce risks
associated with data use, while at the same time, enabling an organization to
leverage data strategically. Governance activities include things like:
12
· Defining Data Strategy
· Setting Policy
· Stewarding Data
· Defining the value of data to the organization
· Preparing the organization to get more value from its data by maturing
its data management practices and evolving the organization’s mindset
around data though culture change

· Lifecycle Activities focus on planning and designing for data, enabling its
use, ensuring it is effectively maintained, and actually using it. Use of data
often results in enhancements and innovations, which have their own
lifecycle requirements. Lifecycle activities include:
· Data Architecture
· Data Modeling
· Building and managing data warehouses and marts
· Integrating data for use by business intelligence analysts and data
scientists
· Managing the lifecycle of highly critical shared data

· Foundational Activities are required for consistent management of data


over time. Integral to the entire data lifecycle, these activities include:
· Ensuring data is protected
· Managing Metadata, the knowledge required to understand and use data
· Managing the quality of data

3. References
Batini, C. y Scannapieco, M. (2016). Data and Information Quality. Dimensions,
Principles and Techniques. Milano, Italy: Springer International
Publishing.

Sebastian-Coleman, L. (2018). Navigating the Labyrinth. An Executive Guide to


Data Management. Basking Ridge, USA: Technics Publications.

13
4. Extra Material
• What is Data Management? Infographic Video.
https://youtu.be/5xw_OjVx5gQ

14

You might also like