Download as pdf or txt
Download as pdf or txt
You are on page 1of 12

Electron Markets (2017) 27:387–398

DOI 10.1007/s12525-017-0245-6

RESEARCH PAPER

The data quality improvement plan: deciding


on choice and sequence of data quality improvements
Dominikus Kleindienst 1

Received: 18 December 2015 / Accepted: 3 January 2017 / Published online: 14 January 2017
# Institute of Applied Informatics at University of Leipzig 2017

Abstract With the rapid growth in the amount of data gener- Introduction
ated worldwide, ensuring adequate data quality (DQ) is increas-
ingly becoming a challenge for companies: data are, among Today’s generation has never known a world without cell
others, required to be timely, complete, consistent, valid, and phones and the Internet and is now increasingly participating
accessible. Given this multidimensionality, DQ improvements in business-to-customer interactions (Radant et al. 2014). This
(DQIs) need to be purposefully chosen and –as there can be is one of many reasons for the inevitable and rapid growth in
path dependencies– arranged in an optimal sequence. Thus, this the amount of worldwide generated data (Provost and Fawcett
research contributes to performing the complex multidimen- 2013), which, as a result, makes data quality (DQ) a growing
sional task of ensuring adequate DQ in an economically rea- challenge for companies. Accordingly, information and com-
sonable manner by providing a formal decision model for iden- munication technologies yield new data-driven business
tifying an optimal data quality improvement plan (DQIP). This models that further intensify data-rich and DQ-reliant interac-
DQIP comprises both an economically reasonable selection tions between companies and customers. For companies, en-
and execution sequence of DQIs based on existing interrela- suring adequate DQ is a manifold task that requires the data to
tionships between different DQ dimensions. Furthermore, a be up-to-date, complete, consistent, valid, and accessible in
comprehensive Monte Carlo simulation provides insights in order to improve customer interactions. Given this global
implications to put the decision model into operation. For prac- and recurrent challenge and against the background of scarce
titioners, the decision model enables efficient allocation of re- resources, data errors need to be addressed by an economics-
sources to DQIs. The model also gives advice on how to se- driven method. That is, DQ improvements (DQIs) have to be
quence DQIs and attracts attention to the complex problem chosen according to their effectiveness and then applied in an
context of DQ in order to support valid managerial decisions. efficient sequence. Particularly, in large-scale datasets, on
which Big Data technologies conduct more and more data
intensive processing (Vera-Baquero et al. 2015), an optimal
Keywords Data quality . Dimensions . Interrelationships . choice and sequence of DQIs is – looking at all sorts of ad-
Improvements dressable DQ dimensions – not apparent initially. Although
most data errors in a business context are not grave and
Redman (2004) states, Bthat most data errors just cause an
Jel Classification M29 invoice to be incorrect^ or Bthe wrong product to be sent,^
they can in sum be very costly. Moreover, good-quality data
do not necessarily guarantee good decisions (Shah et al.
Responsible Editor: Hans-Dieter Zimmermann
2012); nonetheless, low-quality data are usually associated
* Dominikus Kleindienst
with poor decisions and high costs (Ballou and Tayi 1999;
dominikus.kleindienst@fim-rc.de Even and Shankaranarayanan 2007; Fisher et al. 2003). This
has been observed in several studies; for example, a study
1
FIM Research Center, Universitätsstraße 12, conducted by the Data Warehouse Institute revealed that in
86159, Augsburg, Germany 67% of the involved companies, poor DQ resulted in high
388 Kleindienst D.

costs (Russom 2006). Moreover, 75% of the interviewees in Theoretical background and related work
an international study on DQ admitted making wrong deci-
sions due to incorrect data (Harris Interactive 2006). In line Ensuring adequate DQ requires economics-oriented DQM
with these results, 85% of the interviewees in a study conduct- (Even and Kaiser 2009), so as to manage DQ with respect to
ed by Forrester Research (2011) consider investments in DQ cost and economic benefits. An economical orientation can be
and solving of DQ problems as important. Hence, managing provided by considering performance indicators such as
DQ is a significant problem for companies (Ballou et al. 1998; Butility^ or Beffectiveness,^ and effort indicators such as
Jiang et al. 2007; Russom 2006). Furthermore, ensuring ade- Bcosts.^ An economics-oriented DQM usually comprises of
quate DQ requires economics-oriented data quality manage- four phases: defining, measuring, analyzing, and improving
ment (DQM) (Even and Kaiser 2009). DQ (Wang 1998). That is, DQ has to be improved continu-
DQ is described as the extent to which the data stored in an ously using DQIs (Ballou and Tayi 1989; Wang 1998). DQIs
information system corresponds with the defined requirements can be defined as actions that are taken to improve the actual
or with the respective information in the real world (Orr 1998; level of DQ (e.g., filtering data, buying trustworthy third-party
Parssian et al. 2004). Moreover, DQ is considered to be multidi- data, implementing consistency checks, etc.), which usually
mensional (Wang and Strong 1996) with consistency, complete- results in costs and economic benefits (Heinrich et al. 2007a).
ness, and timeliness counted as the most cited among the DQ As DQ is a multidimensional concept, DQ dimensions and
dimensions (Lee et al. 2002; Wand and Wang 1996). their interrelationships have been analyzed in several studies.
Researchers often focus on developing metrics to measure DQ For example, interrelationships between DQ dimensions such
in terms of single DQ dimension (Ballou et al. 1998; Even and as accuracy and timeliness as well as completeness and consis-
Shankaranarayanan 2007; Heinrich et al. 2009; Heinrich and tency are modeled as trade-offs (Ballou and Pazer 1995; Ballou
Klier 2011; Hüner et al. 2011), which is necessary to quantify and Pazer 2003). In addition, logical connections between DQ
as-is and to-be DQ levels and is a prerequisite to support dimensions have been discussed (Gackowski 2004) and ap-
economics-oriented DQM (Heinrich et al. 2007b; Pipino et al. proaches have been developed to quantify existing interrela-
2002). Besides, a large number of trade-offs and interrelation- tionships by means of correlations (De Amicis et al. 2006;
ships exist among the DQ dimensions (Ballou and Tayi 1999; Lee et al. 2002). However, DQ dimensions are mostly analyzed
Ballou and Pazer 2003; De Amicis et al. 2006; Helfert et al. independently, because most frameworks for DQIs are devel-
2009). Consequently, enhancing a particular DQ dimension oped on an ad-hoc basis to solve specific problems (Pipino et al.
(e.g., accuracy) using a specific DQI (e.g., data cleansing) will 2002). Nonetheless, taking into account these interrelation-
most likely impact other DQ dimensions (e.g., timeliness) ships, which can be traced back to the differing effects of
(Ballou and Tayi 1999). Based on the results provided by the DQIs on particular DQ dimensions, is a prerequisite for enhanc-
aforementioned metrics and considering these interrelationships, ing overall DQ (i.e., an aggregate of several DQ dimensions).
it is possible to select appropriate DQIs and determine an eco- Against this background, it is also necessary to consider more
nomically reasonable execution sequence for them. The resulting than one DQI when enhancing overall DQ, as single DQIs
selection and its corresponding execution sequence constitute a mostly impact only a specific subset of DQ dimensions. As a
DQ improvement plan (DQIP), which can be applied to improve result, interrelationships between DQ dimensions influence the
DQ step-by-step in an economically reasonable manner. optimal selection of several DQIs, as well as their optimal ex-
However, an approach that considers interrelationships between ecution sequence (Helfert et al. 2009) (cf. section 3). Thus, an
DQ dimensions in order to define an optimal selection and exe- optimal DQIP comprises an economically optimal selection
cution sequence of DQIs does not exist. Hence, we investigate and execution sequence of DQIs, considering the respective
how an optimal DQIP can be determined, which consists of an interrelationships between relevant DQ dimensions.
optimal selection and execution sequence of DQIs. Today, several concepts comparable to the idea of a DQIP
The paper is organized as follows. First, we outline the exist. In the following, we examine the following concepts:
problem context and discuss work related to determining an allocating resources to different datasets with respect to costs
optimal DQIP. Based on this discussion, we propose the deci- and effectiveness of available maintenance options (Ballou
sion model and a procedure for answering the aforementioned and Tayi 1989), identifying the most valuable DQ enhance-
research question in order to contribute to DQM. We illustrate ment projects (Ballou and Tayi 1999), identifying an eco-
these explanations using examples with concrete DQIs and nomically optimal configuration of the dimensions com-
DQ dimensions. Then, we evaluate the decision model with pleteness and accuracy (Even and Shankaranarayanan
respect to its quality by conducting a Monte Carlo simulation. 2007), and setting priorities for DQIs based on an analysis
Next, we derive recommendations relevant for managing DQ of the association between input and output accuracy
improvement initiatives as well as for further research. The (Gelman 2010). Within the approach discussed by Ballou
conclusion summarizes results, addresses limitations, and dis- and Tayi (1989), resources are not directly allocated to
cusses areas of possible future research. DQIs, but to different datasets, depending on costs and the
Data quality improvement plan 389

effectiveness of available maintenance options. Although the Determining an optimal DQIP


authors admit that Bin reality, of course, application of mul-
tiple procedures to the same dataset could be beneficial,^ it In this section, we describe the model’s main idea, introduce the
is assumed that at the most one maintenance procedure can model’s definitions and assumptions, and develop the formal
be applied to a dataset. Besides this, Ballou and Tayi (1999) decision model to determine an optimal DQIP. Finally, we pres-
propose an approach where the decision maker has to assess ent an approach to solve the model by exhaustive enumeration.
the change in utility in terms of the current quality, required
quality, anticipated quality, priority of organizational activity, Model idea
cost of DQ enhancement, and added value for each project,
dataset, and relevant DQ dimension. Based on this The aim of the following decision model is to consider inter-
information, the most valuable DQ enhancement projects relationships between the DQ dimensions and, based on these
can be identified. Apart from this, Even and interrelationships, identify the optimal selection and execution
Shankaranarayanan (2007) model the costs and utility of sequence of DQIs. With regard to the interrelationships be-
different configurations of DQ in data repositories in order tween DQ dimensions, the decision model examines the im-
to identify an economically optimal configuration of the pacts that the particular DQIs have on different DQ dimen-
dimensions completeness and accuracy. Based on this sions, rather than measuring these interrelationships directly.
configuration, the economic performance of data As illustrated in Fig. 1, the interrelationships between DQ
repositories can be optimized. Additionally, Gelman (2010) dimensions usually result from different impacts that the
identifies the impact of input data errors in order to rank the DQIs have on several DQ dimensions. Thereby, a specific
error effects. Input data with a higher negative effect in the DQI can have a positive impact on one or more DQ dimen-
case of an error earn a higher priority in terms of DQI. sions, while having a negative impact on other DQ dimen-
According to this prioritization, resources are allocated for sions simultaneously. For example, in the case of the three
quality improvement of the respective datasets. DQ dimensions timeliness, completeness, and consistency,
In a nutshell, solely Even and Shankaranarayanan (2007) and there can be trade-offs in a certain situation. For example,
Ballou and Tayi (1999) (indirectly) account for the multidimen- timeliness can be affected while attempting to achieve com-
sionality of DQ and the interrelationships between the DQ pleteness and consistency; Bhaving complete or consistent da-
dimensions; Even and Shankaranarayanan (2007) discuss the ta may need checks and activities, that require time, and thus
case of only two dimensions, namely, completeness and accu- timeliness is negatively affected^ (Batini and Scannapieco
racy. Moreover, none of the existing approaches provides in- 2006). Further, a trade-off can exist between consistency and
sights on an optimal execution sequence for the resulting completeness; the more complete the data and the more data
DQIs although this is necessary against the background of pos- there are, the greater the probability of inconsistencies among
sible interrelationships. Consequently, we develop a formal de- these data. Thereby, DQIs designed to improve one DQ di-
cision model that identifies the most effective DQIP by consid- mension would possibly cause a worsening of the other
ering both selection and execution sequence of DQIs. (Batini and Scannapieco 2006). When it is known which types

Fig. 1 Impacts on DQ
dimensions caused by DQIs
390 Kleindienst D.

of conflicting (or analogously complementary) impacts DQIs applied sequentially and can each only be applied once
can have on different DQ dimensions, we consider the inter- within the examined period. Therefore, a DQI vj is either
relationships between the DQ dimensions where they arise. part of a DQIP or not.
Thus, we consider the interrelationships that result from [A.3] A DQIP is applied on one dataset with a fixed size
conducting DQIs that positively or negatively impact different within one period. That is, a DQIP starts at the beginning
DQ dimensions at the same time. Therefore, the interrelation- of a period and is completed at the end of the same period.
ships do not have to be determined explicitly, which is an [A.4] The position of a DQI vj within an execution se-
advantage as complex estimations of interrelationships be- quence is p (with p = 1 , … , m). Therefore, a DQI vj
tween DQ dimensions are now obsolete. applied on position p within a DQIP is written as vpj .
In Fig. 1, the impacts of the three exemplary DQIs filtering [A.5] Both initial and resulting DQ levels are expressed
data (supposed to improve timeliness), buying additional ex- as a proportion of quality-assured data in a dataset. The
ternal data (supposed to improve completeness), and consis- DQ level for DQ dimension di, after applying a DQI vj on
tency checks (supposed to improve consistency) are shown as position p within a DQIP, is Qpdi; v j (with Qpd i; v j ∈½0; 1 ).
tendencies using ↑ for positive, → for neutral, and ↓ for neg-
The given initial DQ level of a DQ dimension di is Qpd i
ative impact direction in accordance with Ballou and Pazer
(with p = 0 and Qpdi ∈½0; 1 ).
(2003) as well as Batini and Scannapieco (2006).
To identify the optimal execution sequence of the DQIs, we [A.6] A DQI vj causes fixed costs cj (with c j ∈Rþ ).
point to the next section, where an exemplary calculation [A.7] In order to realize a DQIP, a budget B (with B∈Rþ )
based on the mathematical impact modalities shows that the is available. This budget is allocated to the most effective
DQI execution sequence matters. DQIP, considering the selection and execution sequence
As a result, DQIs have to be combined in an economically of DQIs.
reasonable manner, for example, based on costs and the effec- [A.8] In the case of a negative impact (i.e., if I d i ;v j ∈½−1; 0½
tiveness of DQIPs as proposed in this paper. Therefore, an ), the DQ level on position p, Qpd i , results from a relative
available budget is allocated to the most effective DQIP com- reduction of the quality-assured part of the dataset on
posed of a specific selection and execution sequence of DQIs. position p − 1, Qdp−1 . In the case of a positive or no impact
i
We describe effectiveness of a DQIP as the extent to which the
I di ;v j (i.e., if I d i ;v j ∈½0; 1 ), the DQ level Qpdi results from a
gap between the recent and required DQ level is bridged by
relative increase of the part of the dataset that is not qual-
applying a DQIP. In our context, bridging this gap by improv-
ing DQ implies augmenting the proportion of quality-assured ity-assured, 1−Qdp−1 i
. The quality reduction or increase
p
data in a dataset with respect to several DQ dimensions. In other caused by a DQI v j on a DQ dimension di follows from
words, by applying a DQIP the data that are not quality-assured multiplying impact I d i ;vpj with the respective (quality-as-
(e.g., not timely, incomplete, inconsistent) are to be transformed
sured or not quality-assured) DQ level (Qdp−1 or 1−Qdp−1 ).
into quality-assured data (e.g., timely, complete, consistent). i i

This assumption is exemplarily visualized in Fig. 2.


Based on these ideas, we now develop a formal decision model.

Development of the decision model As a consequence, the execution sequence of DQIs vj mat-
ters, because when a DQI is first applied with a positive im-
In the following, we put forth the most effective DQIP from pact, it increases the quality-assured part of a dataset and
the available DQIs within a given budget, while considering therefore offers a bigger target to a following negative impact,
the interrelationships between the DQ dimensions. The fol-
lowing definitions and assumptions form the basis of the
subsequent discussion.

[A.1] A DQ dimension di (with i = 1 , … , n) is impacted


by the application of DQIs vj (with j = 1 , … , m) with
given impacts I d i ;v j (with I d i ;v j ∈½−1; þ1 ). Therefore, im-
pact I d i ;v j ¼ −1 represents a maximum negative impact
and impact I d i ;v j ¼ þ1 represents a maximum positive
impact on a DQ dimension di.
[A.2] A DQIP comprises a selection out of m DQIs vj
(with j = 1 , … , m) and considers the execution se-
quence of the DQIs. Within a DQIP, all DQIs vj are Fig. 2 Calculation of positive and negative changes in DQ
Data quality improvement plan 391

thereby worsening the quality-assured data. On the contrary, context. In order to establish the effectiveness Ep = m of a
when a DQI is first applied with a negative impact, it reduces DQIP, the actual improvement has to be set in relation to the
the quality-assured part of a dataset, and therefore forms a highest possible improvement. The actual improvement is the
bigger basis for a subsequent positive impact to augment the difference between the attained DQ level, ∑i¼1
n
wd i ⋅ Qm
d i , and
data that is not quality-assured. Hence, depending on the sign p¼0
the initial DQ level, ∑i¼1 wd i ⋅ Qd i . Similarly, the theoreti-
n
of the impacts, different execution sequences lead to different
cally highest possible improvement is the difference between
results, which is observed in Figures 3 and 4, where three
the theoretically maximum DQ level, which is 1, and the ini-
DQIs v1 (filtering data), v2 (additional external data), and v3 p¼0
tial DQ level, ∑i¼1 wdi ⋅ Qdi . Hence, the objective function
n
(consistency checks) with their respective impacts on three
DQ dimensions d1 (timeliness), d2 (completeness), and d3 maximizes the effectiveness of a DQIP in the following
(consistency) are applied in two exemplary sequences (v1 ➔ manner:
v2 ➔ v3 and v3 ➔ v2 ➔ v1). As it can be seen, the results of the  
p¼0
∑ni¼1 wdi ⋅ Qm
d i −Qd i
two sequences diverge. Thereby, even a small percentage dif- Maximize Ep¼m ¼ subject to∑mj¼1 cv j ⋅xv j ≤ B
ference can matter in the result in terms of monetary implica- 1−∑ni¼1 wdi ⋅Qdp¼0
i

tions. In the given example, there is a difference of 20 percent-


age points in the results of the two sequences. A detailed derivation of the objective function is given
The position on which a DQI vj is placed within an execu- in the appendix. To sum up, maximizing the effective-
tion sequence is expressed by the index p (with p = 1 , … , ness Ep = m of a DQIP depends on the right selection of
m), as defined in assumption A.4. Consequently, if consider- DQIs vpj and right execution sequence, expressed in
ing an execution sequence of DQIs vj, and thus for impacts terms of which vj corresponds to which vpj . In doing so,
I d i ;v j , the variables for the DQIs are written as vpj and I di ;vpj . a given budget is allocated on an available set of DQIs
The optimization calculus determines which DQI vj finally and the most effective DQI execution sequence is iden-
corresponds to which position p. The absolute effect of a tified while taking into account the interrelationships be-
DQI vpj on a DQ dimension di follows from multiplying im- tween the DQ dimensions.
pact I di ;vpj by the respective (quality-assured or not quality-
assured) DQ level (Qdp−1i
or 1−Qdp−1
i
). Model solution
In order to allocate a given budget in an economically
reasonable manner to an available set of DQIs, all possible The objective function states a non-linear, binary opti-
DQIPs need to be compared to each other. As a comparison mization problem with case distinction. An appropriate
criterion, we calculate the respective effectiveness Ep = m of a methodology is required to identify the optimal im-
DQIP. That is, we calculate the extent to which the gap be- provement plan comprising both choice and sequence.
tween the initial and the perfect DQ level is bridged by apply- Consequently, in order to solve the decision model, we
ing a DQIP, which is a vivid performance indicator in this considered several relevant combinatorial methods such

Fig. 3 Comparison of DQI execution sequences – sequence v1 ➔ v2 ➔ v3


392 Kleindienst D.

Fig. 4 Comparison of DQI execution sequences – sequence v3 ➔ v2 ➔ v1

as the traveling salesman problem or the knapsack prob- by conducting an experimental evaluation based on a
lem. As none of these methods is capable of optimizing Monte Carlo simulation. The Monte Carlo simulation
both choice and sequence, especially with respect to an can be used to evaluate a deterministic model by using
unknown starting point, we argue that an enumeration is pseudo-randomly generated numbers as parameter inputs
the most appropriate method to solve the decision mod- (Fishman 1996). For simulating a repeated sampling, a
el and apply it in a three-step approach. In the first large number of input parameter settings are created.
step, all possible DQI selections are enumerated. Based on these inputs, a deterministic computation is
However, both the selections of DQIs and all possible performed and the results are aggregated to determine
DQI execution sequences define a specific DQIP. the properties of the model’s behavior (Sawilowsky and
Therefore, knowing all DQI selections, we identify a Fahoome 2002). This approach yields results within rea-
dominant execution sequence of the DQIs with respect sonable time based on the law of large numbers and has
to their effectiveness in the second step. In the third been used in a similar manner in previous research
step, the DQIPs, including their dominant DQI execu- (Fridgen and Müller 2011; Gelman 2012). With the
tion sequences, are compared to each other. The evaluation, we intend to show that the decision model
resulting decision is the most effective DQIP that keeps plausibly behaves to input parameter variations. In ad-
within the budget. dition, we derive implications for the operationalization
The complexity of this enumeration rises exponentially of the decision model. Therefore, we first describe the
with the number of available DQIs. While a small number, analysis procedure and provide a selection of visualiza-
such as 4 DQIs, causes 65 calculations of the objective func- tions appropriate for understanding the model’s behav-
tion, 10 DQIs cause approximately 10 million calculations, ior. Then, a short summary of the key findings of the
and 20 DQIs cause approximately 6 × 1018 calculations of analysis is given.
the objective function. However, in real-world settings, since
a small number of possible DQIs is to be expected, enumera- Simulation and analysis procedure
tion is a feasible method to solve the decision model at hand.
Moreover, the complexity can be minimized if the DQI selec- We use artificial data to simulate a repeated sampling and
tions that are not within the budget are excluded from the pseudo-randomly create 1000 input parameter settings, as-
outset. DQI selections can also be excluded if they are, ac- suming specific distributions as presented in Table 1.
cording to an expert, not technically feasible. Thereby, we choose an equal distribution for parameters de-
fined within an interval, and a Gaussian distribution for pa-
rameters where a deviation around a mean value is more real-
Evaluation on the basis of a monte Carlo simulation istic. In our case, this is reasonable as impacts can in fact take
values over their complete range of values and, in contrast,
In this section, we analyze the decision model with costs vary around a certain mean with a corresponding vari-
respect to its quality (Hevner et al. 2004). Here, we test ance. For the simulation, we consider three DQ dimensions di
the decision model’s quality in terms of its robustness and four DQIs vj. Hence, there are three initial DQ levels
Data quality improvement plan 393

Table 1 Input parameter settings input parameters, ceteris paribus. At first, it is obvious that
Range a
Distribution even a +/−25% variation in the budget B and costs cv j does
not cause a proportional change in mean effectiveness E.
Impacts, I d i ;v j [−1,+1] Equal Numerically, a +/−25% variation in both B and cv j causes
Initial DQ level, Qdp¼0
i
[0,1] Equal
barely an average +/−1% deviation of E. Thus, it seems rea-
DQI costs, cv j (μ,σ) = (1000, 20)3 Gaussian sonable to assume that an estimation error for budget and costs
Budget, B (μ,σ) = (3000, 60)3 Gaussian would virtually cause no damage.
Weights, wd i [0,1] Equal As argued above, the higher the initial DQ levels already
a
When the costs cv j and the budget B had different orders of magnitude,
are, the less DQIs are coming into consideration to be part of
the decision model derived identical results the optimal DQIP. This becomes apparent in Figure 6. The
reason therefore is that negative impacts have a stronger in-
fluence on DQ for higher initial DQ levels because there is
Qp¼0
d i , three weights wd i , four DQI costs cv j , and twelve im- more potential to cause damage. For higher initial DQ levels,
pacts I d i ;v j , altogether constituting one input parameter setting. DQIs that have negative impacts on one or more DQ dimen-
Based on each of the 1000 input parameter settings, we com- sions are therefore more likely to be sorted out by the decision
pute the above developed decision model. That is, we deter- model. As Figure 6 takes an average view on Monte Carlo
mine each optimal DQIP and calculate its effectiveness Ep = m simulation results, it shows non-integer values for the number
and its costs ∑mj¼1 cv j ⋅ xv j . of DQIs.
In order to derive findings regarding the decision model’s In contrast and as illustrated in Figure 7, a variation in all
behavior based on the law of large numbers, we aggregate the impacts I d i ;v j causes an approximately proportional change in
1000 results to a mean effectiveness (E ) and mean costs (C ). E, for example, a +/−10% variation in all impacts proportion-
In addition, we also vary other input parameters1 (i.e., impacts ally causes an average +/−10% deviation2 in E. Similarly, a
I d i ;v j , budget B, DQI costs cv j , initial DQ level Qdp¼0
i
, and a +/−25% variation in all impacts causes an approximately pro-
single impact I d i ;v j (e.g., I d1 ;v1 )) by +/−1%, +/−5%, +/−10%, portional deviation between +22% and −26%, respectively.
and +/−25%, ceteris paribus. In doing so, we study the deci- Therefore, an estimation error for impacts, rather than for bud-
sion model’s sensitivity and are able to compare the respective get or costs, would lead to wrong results.
behaviors of the decision model. As some of these different Additionally, a wrong estimation of impacts would be par-
parameter variations lead to similar implications and for rea- ticularly problematic, as it would also implicitly be a wrong
sons of clarity, we present an appropriate selection of these estimation of the interrelationships between DQ dimensions.
visualizations and derive feasible implications. Apart from that, in Figure 7 we see the behavior of the mean
effectiveness E as a function of one single impact, ceteris
paribus. Beside the similarly proportional deviation when
Visualization and analysis varying all other impacts, it is apparent that the DQI contain-
ing this single impact is most likely not a part of the optimal
In this section, we provide a discussion on chosen visualiza- DQIP as long as it is negative. The more positive this impact
tions of the analysis results. As in sum the entire analysis is becomes, the more likely is the application of the respective
grounded on more than 1.5 million executions of the simula- DQI within the optimal DQIP and hence its benefit to the
tion, we identified specific visualizations to be appropriate for optimal DQIP.
understanding the model’s behavior. These visualizations are
coordinate systems showing the effectiveness E as a function
Results of the analysis and practical implications
of either the initial DQ level Qdp¼0 i
or the exemplary single
impact I d 1 ;v1 .
The Monte Carlo simulations reveal the decision model’s
In the illustrations in Figure 5, we show the behavior of the
quality for DQM in terms of its robustness as well as implica-
mean effectiveness E as a function of the initial DQ level tions for its operationalization. With regard to the decision
Qp¼0
d i . In this manner, we can assess that the value of E is model’s robustness, in worst cases, we show that the devia-
descending with ascending values of Qp¼0 d i . The reason is that
tions in effectiveness and costs of the resulting DQIPs are
less DQIs are coming into consideration to be part of the proportional to the strength of the input parameter variation.
optimal DQIP, the higher the initial DQ levels already are In addition, we illustrate that the decision model’s behavior is
(also see Figure 6). Additionally in Figure 5, we vary different always as expected, technically reasonable, and comprehensi-
ble. That is, the model is being said to be robust.
1
Weights, wd i , are not varied, as this would not provide adequate infor-
2
mative value. This proportionality holds for a +/−1% or +/− 5% variation too.
394 Kleindienst D.

Fig. 5 Variation of the input parameters budget and costs

The scientific key finding of the paper and its contribution importance of this issue is aggravated by the fact that an
to DQ research is that the execution sequence of DQIs should estimation of the impacts is implicitly also an estimation of
be considered in the approaches that improve DQ. Although, the interrelationships between the DQ dimensions, which
literature is aware of the necessity to consider the interrela- are essential for realistic results. Moreover, an exact esti-
tionships between different DQ dimensions, it does not con- mation of positive impacts is more important than negative
sider the execution sequence of DQIs yet. Based on the anal- impacts, as DQIs containing positive impacts are more
ysis presented here, we can offer advice on certain likely to be part of the optimal DQIP.
operationalization issues of the decision model: (2) Higher initial DQ levels imply that more DQIs will be
reasonably discarded by the decision model, as their appli-
(1) An estimation error for budget or costs is insubstantial in cation would not lead to a substantial increase in effective-
terms of the effectiveness of the optimal DQIP. Therefore, ness. DQIs inherently contain positive and negative im-
considering the chosen modeling technique as a basis, pacts on different DQ dimensions. Lower initial DQ levels
operationalization efforts should rather focus on a realistic imply that more DQIs containing negative impacts will be
estimation of the impacts, as they are the most sensitive considered for the optimal DQIP, as in such cases their
toward the effectiveness of the optimal DQIP. The potential to cause damage is limited. However, as DQIs
with negative impacts can only contribute to an optimal
DQIP when applied together with DQIs having a positive
impact, it is therefore particularly important to consider an
economics-oriented DQI execution sequence.

Moreover, by providing the ability to identify an optimal


DQIP, the decision model can be useful for practitioners in
different ways:

(a) The DQIP reveals the most effective choice out of all
possible DQIs and thereby implicitly considers interde-
pendencies between different DQ dimensions. In this
manner, the DQIP helps practitioners in using a given
budget efficiently.
(b) Within the most effective choice of DQIs, the DQIP con-
siders path dependencies, that is, an optimal execution
sequence of the chosen DQIs can be identified. This
information is necessary for DQ project management.
(c) The example in this paper already enables practitioners to
Fig. 6 Number of DQIs in a DQIP as a function of the initial DQ level learn about possible DQIs and interdependencies in the
Data quality improvement plan 395

Fig. 7 Variation of impacts

context of the most common DQ dimensions, namely Our findings are beset with some limitations particularly in
timeliness, completeness, and consistency. This forms terms of simplifying assumptions. As applying a certain mea-
the basis for additionally applying the DQIP to other sure multiple times within an improvement plan could be pos-
context or company-specific DQ dimensions. sible in practice, confining the solution to scenarios where each
(d) With the help of the model leading to an optimal DQIP, DQI is applied only once can be seen as a limitation. However,
decision makers become aware of the complex problem we consider this assumption not to be too restrictive, as the
context at hand such as the multidimensionality of DQ or effect of applying one measure twice or even multiple times
the interrelationships between different DQ dimensions can only have a marginal impact on the overall effectiveness
and learn which information is necessary for their deci- of an improvement plan due to the diminishing marginal bene-
sions. In this manner, decision makers are better able to fits. Hence, the model in its current version focusses on choos-
make valid managerial decisions in the context of DQ. ing measures in terms of either applying or not applying them.
The process of adding an ability to choose the optimal number
of one certain measure within an improvement plan is lengthy,
but would lead to slightly more realistic results. Further research
Summary, limitations, and further research could thus address the identification of data quality measures
that provide a higher positive impact if applied multiple times
In this paper, we developed a formal decision model that deter- within one time period. Another limitation of our model lies in
mines an economically optimal DQIP. Using this model, we the assumption that the relative impacts of a certain data quality
determine the best DQI selection and DQI execution sequence measure are the same for different positions within an improve-
with respect to the maximum effectiveness of the DQIP, keeping ment plan (e.g., updating data before or after applying another
within a given budget. In doing so, the decision model considers improvement does not change the relative impact of the mea-
interrelationships between the DQ dimensions in terms of com- sure). Moreover, even if this assumption may not fully reflect
plementary and concurring impacts that DQIs usually have on reality, we consider this assumption to be realistic to a high
the different DQ dimensions. Thus, the model provides an inte- extent, as positive impacts only affect the share of the data that
grated view of different DQ dimensions. In order to demonstrate has not been subjected to quality assurance and negative impacts
the decision model’s quality in terms of robustness, we conduct- only affect the remaining part of the data with regard to a certain
ed an analysis based on a Monte Carlo simulation. Apart from DQ dimension. Consequently, the shares that are affected by
demonstrating the robustness of the decision model, we derived impacts change and therefore the absolute effect of a DQ mea-
practical implications for DQM. For example, a primary priority sure does depend on its position within a certain DQIP.
is to accurately determine the impacts, and therefore implicitly However, theoretically, there is no reason why the relative effect
determine the interrelationships between the DQ dimensions. of a measure to improve completeness, for instance, should
For researchers, the paper depicts the so far unattended connec- differ according to the number of data records to which it is
tion between existing interrelationships between DQ dimen- applied, i.e., it should be the same regardless whether it is ap-
sions and the necessity to consider a DQI execution sequence plied to 10 incomplete data records or a million incomplete data
in approaches to improve DQ. records. Finally, using synthetic data allows to tailor properties
396 Kleindienst D.

in order to meet various conditions that are not easily available Appendix
in reality (Barse et al. 2003), such as the used impacts. The use
of synthetic data in the Monte Carlo simulation is thus a limita- Supplementing the chapter BDevelopment of the Decision Model
tion of the paper, and an evaluation in a real-world context ^, a more detailed derivation of the objective function is
should be addressed in further research. Thus, these impacts given in the following.
could be determined in detail by conducting before-after com- The decision to consider a DQI vpj in a DQIP is described
parisons of the DQ dimensions when applying particular DQIs. formally using the decision variable xvpj , whereas the binary
Generally, several case studies should be conducted to further
variable xvpj is equal to 1 if DQI vpj is part of the DQIP, and 0 if
evaluate the decision model and its effects in real-world appli-
cations. However, we already derived recommendations that not. Thus, the resulting DQ level Qpdi; v j of DQ dimension di,
should be taken into account in DQM. Therefore, and despite after applying DQI vpj , is calculated in the following manner:
these pending tasks, we believe that the developed decision
8 
model constitutes a sound contribution to DQM and an inspira- < 1−Qp−1 ⋅I
d i; v j d i ;vpj ⋅xvpj if I d i ;vpj ≥ 0
tion to researchers, as it particularly constitutes a first step to- Qpdi; v j ¼ Qdp−1 þ :
i; v j : Qp−1 ⋅I d ;vp ⋅xvp if I d ;vp < 0
ward considering the DQI execution sequence and interdepen- d i; v j i j j i j

dencies when determining an optimal DQI selection.


After remodelling and, for mathematical reasons, introduc-
Acknowledgments Supportive inputs and helpful comments by Dr. ing the substitute variables f di ;vpj and sd i ;vpj , the resulting DQ
Quirin Görz on an earlier version of this paper are gratefully
acknowledged. level Qpdi; v j can also be written as

( (
1−I di; vpj ⋅xvpj if I di; vpj ≥ 0 I di; vpj ⋅xvpj if I d i; vpj ≥ 0
Qpdi; v j ¼ Qdp−1 ⋅ f d i; vpj þ sd i; vpj ; with f di; vpj ¼ and sdi ;vpj ¼ :
i; v j 1 þ I d i; vpj ⋅xvpj if I d i; vpj < 0 0 if I di; vpj < 0

Based on this, the DQ level Qm


d i for dimension di, after
applying a complete DQIP for m DQIs vpj is calculated in the
following manner:

      
di ¼
Qm … Qdp¼0
i
⋅ f di ;v p¼1 þ s
di ;vp¼1 ⋅ f di ;vp¼2 þ s
di ;v p¼2 ⋅… ⋅ f di ;vp¼m−1 þ s
di ;v p¼m−1 ⋅ f p¼m þ s
di ;v j p¼m
di ;v j
j j j j j j

The binary variables xvpj , that are part of f di ;vpj and sdi ;vpj , In contrast to former approaches that have been crit-
neutralize the effect of DQI vpj
by taking the value 0 if a icized in literature for not considering interrelationships
p when aggregating DQ dimensions to an overall DQ lev-
specific DQI v j is not part of the DQIP. Although term (3)
el, in this decision model, interrelationships between the
contains all m DQIs vpj , a DQI selection is possible through
DQ dimensions d i are implicitly considered by the
this neutralization. Knowing the DQ level Qm d i for each DQ impacts I di ;vpj .
dimension di, the overall DQ level can be calculated. As there
Since all DQIs are applied on the same dataset, which has a
can be context-dependent differences between the DQ dimen-
fixed size, and a DQIP is realized within one period (cf. A.3),
sions, an allocation of weights to different DQ dimensions is
n
the costs cv j for applying DQI vj are fixed. As a result, the
reasonable. Therefore, we use a weight wdi (with ∑ wd i ¼ 1 ) overall DQIP costs are
i¼1
to weight the DQ dimensions in our decision model.

∑ni¼1 wdi ⋅Qm


di
∑mj¼1 cv j ⋅ xv j :
Data quality improvement plan 397

According to assumption A.8, the costs of a DQIP must not Forrester Research. (2011). Trends in data quality and business process
alignment. Cambridge (USA).
exceed budget B; thus, the budget constraint for the decision
Fridgen, G., & Müller, H. (2011). An approach for portfolio selection in
model is multi-vendor IT outsourcing. Proceedings of the 32nd International
Conference on Information Systems (ICIS), Shanghai, China.
∑mj¼1 cv j ⋅xv j ≤B Gackowski, Z. J. (2004). Logical interdependence of data/information
quality dimensions—A purpose-focused view on IQ. Proceedings
In order to allocate a given budget in an economically of the Ninth International Conference on Information Quality (ICIQ
reasonable manner to an available set of DQIs, all possible 2004), Cambridge, MA, (USA).
DQIPs need to be compared to each other. As a comparison Gelman, I. A. (2010). Setting priorities for data accuracy improvements in
satisficing decision-makingscenarios: a guiding theory. Decision
criterion, we calculate the respective effectiveness Ep = m of a Support Systems, 48(4), 507–520.
DQIP in the way described in the chapter BDevelopment of Gelman, I. A. (2012). A model of error propagation in conjunctive deci-
the Decision Model^. The objective function maximizes the sions and its application to database quality management. Journal of
effectiveness of a DQIP in the following manner: Database Management, 23(1), 103–126.
Harris Interactive. (2006). Information workers beware: Your business data
 
p¼0 can't be trusted. Retrieved 10/13, 2008, from http://www.sap.com/
∑i¼1
n
w d i ⋅ Qm
d i −Qd i about/newsroom/businessobjects/20060625_005028.epx
Maximize Ep¼m ¼ sub ject to ∑mj¼1 cv j ⋅xv j ≤B
1−∑ni¼1 wdi ⋅Qdp¼0 Heinrich, B., Kaiser, M., & Klier, M. (2007a). How to measure data quality?
– a metric based approach. Proceedings of the 28th International
i

Conference on Information Systems (ICIS), Montreal, (Canada).


References Heinrich, B., Kaiser, M., & Klier, M. (2007b). Metrics for measuring data
quality – foundations for an economic data quality management.
Ballou, D. P., & Pazer, H. L. (1995). Designing information systems to 2nd International Conference on Software and Data Technologies
optimize the accuracy-timeliness tradeoff. Information Systems (ICSOFT), Barcelona, (Spain).
Research, 6(1), 51–72. Heinrich, B., Kaiser, M., & Klier, M. (2009). A procedure to develop
Ballou, D. P., & Pazer, H. L. (2003). Modeling completeness versus metrics for currency and its application in CRM. ACM Journal of
consistency tradeoffs in information decision contexts. IEEE Data and Information Quality, 1(1), 5:1–5:28.
Transactions on Knowledge and Data Engineering, 15(1), 240–243. Heinrich, B., & Klier, M. (2011). Assessing data currency — a probabi-
Ballou, D. P., & Tayi, G. K. (1989). Methodology for allocating resources listic approach. Journal of Information Science, 37(1), 86–100.
for data quality enhancement. Communications of the ACM, 32(3), Helfert, M., Foley, O., Ge, M., & Cappiello, C. (2009). Limitations of weight-
320–329. ed sum measures for information quality. San Francisco, CA, (USA).
Ballou, D. P., & Tayi, G. K. (1999). Enhancing data quality in data ware- Hevner, A. R., March, S. T., Park, J., & Ram, S. (2004). Design science in
house environments. Communications of the ACM, 42(1), 73–78. information systems research. Management Information Systems
Ballou, D. P., Wang, R. Y., Pazer, H. L., & Tayi, G. K. (1998). Modeling Quarterly, 28(1), 75–106.
information manufacturing systems to determine information prod- Hüner, K. H., Schierning, A., Otto, B., & Österle, H. (2011). Product data
uct quality. Management Science, 44(4), 462–484. quality in supply chains: the case of beiersdorf. Electronic Markets,
Barse, E. L., Kvarnström, H., & Jonsson, E. (2003). Synthesizing test data 21, 141–154.
for fraud detection systems. Proceedings of the 19th Annual Jiang, Z., Sarkar, S., De, P., & Dey, D. (2007). A framework for recon-
Computer Security Applications Conference, Las Vegas, NV, ciling attribute values from multiple data sources. Management
(USA). 384–395. Science, 53(12), 1946–1963.
Batini, C., & Scannapieco, M. (2006). Data quality. Concepts, method- Lee, Y. W., Strong, D. M., Kahn, B. K., & Wang, R. Y. (2002). AIMQ: a
ologies and techniques (data-centric systems and applications) (1st methodology for information quality assessment. Information &
ed.). Berlin: Springer. Management, 40(2), 133–146.
De Amicis, F., Barone, D., & Batini, C. (2006). An analytical framework Orr, K. (1998). Data quality and systems theory. Communications of the
to analyze dependencies among data quality dimensions. ACM, 41(2), 66–71.
Proceedings of the 11th International Conference on Information Parssian, A., Sarkar, S., & Jacob, V. S. (2004). Assessing data quality for
Quality, Cambridge, MA, (USA). 369–383. information products: impact of selection, projection, and cartesian
Even, A., & Kaiser, M. (2009). A framework for economics-driven as- product. Management Science, 50(7), 967–982.
sessment of data quality decisions. Proceedings of the Fifteenth Pipino, L., Lee, Y. W., & Wang, R. Y. (2002). Data quality assessment.
Americas Conference on Information Systems. San Francisco, Communications of the ACM, 45(4), 211–218.
California. Paper 436. Provost, F., & Fawcett, T. (2013). Data science and its relationship to big
Even, A., & Shankaranarayanan, G. (2007). Utility-driven assessment of data and data-driven decision making. Big Data, 1(1), 51–59.
data quality. The DATA BASE for Advances in Information Systems, Radant, O., Colomo-Palacios, R., & Stantchev, V. (2014). Analysis of
38(2), 75–93. reasons, implications and consequences of demographic change
Fisher, C. W., Chengalur-Smith, I. N., & Ballou, D. P. (2003). The impact for IT departments in times of scarcity of talent: a systematic review.
of experience and time on the use of data quality information in International Journal of Knowledge Management, 10(4), 1–15.
decision making. Information Systems Research, 14(2), 170–188. Redman, T. C. (2004). Data: An unfolding quality disaster. DM Review.
Fishman, G. S. (1996). Monte Carlo; concepts, algorithms, and applica- Russom, P. (2006). Taking data quality to the enterprise through data
tions. New York [u.a.]: Springer. governance. Seattle: The Data Warehousing Institute.
398 Kleindienst D.

Sawilowsky, S., & Fahoome, G. C. (2002). Statistics through Monte Wand, Y., & Wang, R. Y. (1996). Anchoring data quality dimensions in
Carlo simulation with fortran. Rochester Hills: JMASM. ontological foundations. Communications of the ACM, 39(11), 86–95.
Shah, S., Horne, A., & Capellá, J. (2012). Good data won't guarantee Wang, R. Y. (1998). A product perspective on total data quality manage-
good decisions. Harvard Business Review, 90(4), 23–25. ment. Communications of the ACM, 41(2), 58–65.
Vera-Baquero, A., Colomo-Palacios, R., Stantchev, V., & Molloy, O. Wang, R. Y., & Strong, D. M. (1996). Beyond accuracy: what data quality
(2015). Leveraging big-data for business process analytics. The means to data consumers. Journal of Management Information
Learning Organization., 22(4), 215–228. Systems, 12(4), 5–33.

You might also like