Professional Documents
Culture Documents
Paper Icic Reza 2019
Paper Icic Reza 2019
Abstract— In the past few decades, the quality of data has become something important for managing data in an organization.
The primary objective of an organization to build data quality management includes improving the accuracy of the data, the
completeness of the data, the age of the data, and the reliability of the data. In developing quality management data, several strong
analysis steps are needed so that the tools can run well. Data taken from various applications in an organization is required to
maintain its quality. In general, data quality management has needs between tables. Star Schema as a model for representing
relationships between tables in the development of data monitoring, a star schema is needed to store the results of data that has
been tested from quality tools data. In this paper, we propose a star schema that is used as a basis for developing monitoring data
quality management tools that we have developed.
Keywords- monitoring, data quality, data quality management, dimensional modeling, star schema
I. INTRODUCTION
Organizations realize that the development of a quality information system has significant business benefits. Using data to
determine goals and decision making makes data as an essential component of information systems [ CITATION
Mac11 \l 1033 ]. Data governance as a part of corporate governance and information technology is defined as data
management techniques that produce data as organizational assets [ CITATION The09 \l 1033 ]. Data governance is a
set of processes, policies, standards, organizations, and technologies that ensure availability, accessibility, quality,
consistency, audit capability, and data security in an organization [ CITATION Pan10 \l 1033 ].
When an organization is seriously handling information system quality issues, they must manage the quality of data in their
organization [ CITATION Mac11 \l 1033 ]. Data quality plays a vital role in data governance, data quality as a way to build the
reputation of an organization, as a decision-making process, even operational process and transaction process. [ CITATION
Her07 \l 1033 ]. The primary purpose of the organization to handle data quality problems is to improve the accuracy of data,
data completeness, the age of data and data quality reliability. Managing data quality problems are one of the functions in data
governance, namely Data Quality Management (DQM)[ CITATION The09 \l 1033 ]. The DQM process occurs in several
stages starting from the scene of data production, data storage, data transfer, data sharing, and data usage [ CITATION Rou18 \l
1033 ]. DQM consists of several phases, namely data profiling, data cleansing, data quality assessment, and data quality
monitoring [ CITATION DAp15 \l 1033 ]. Data quality monitoring is also described as monitoring and ensuring data (static
and streaming) can adjust to business rules and can be used to determine the quality of data in an organization [ CITATION
Jud16 \l 1033 ].
In this study, a government agency in Indonesia. Some common mistakes in the data entry process, for example, are the
existence of empty data, the absence of standards in data formats, and the emergence of data redundancies that are still a
problem for the government agency. These errors also still cannot be monitored automatically by the application and if left
unchecked, these errors will become an ongoing problem because the data that is owned will continue to increase even more.
By looking at these conditions, a data quality monitoring application is needed to monitor data quality automatically. To
develop a data quality monitoring application, several steps are needed, including developing a star schema that can be used to
store data quality monitoring results. This study was made to design a star schema design for a data quality monitoring
application in the study case of the government agency.
Thus, this paper will implement a star schema to support data quality management tools to monitor the data quality result.
The systematics of writing this paper has five parts. Part 2 provides background on Data Quality Management, Database
Schema, and Dimensional Modeling. Part 3 introduces the method used for making monitoring on data quality management
tools. Part 4 monitoring process. Section 5 evaluation and discussion, and section 6 conclusions.
The case study used in this study is analyzing data quality monitoring for a single column. The dimension is to calculate
the number of empty columns in the data that have been tested with data profiling.
At Fig. 3 when the user wants to count the number of PASS_RECORD based on DQ_RULE_ID. The user must drag
DQ_RULE_ID to the left side of the page as shown in Fig. 3 and the user must change to SUM choice in the dropdown at the
left, and change to PASS_RECORD at the dropdown list at the bottom of SUM dropdown list.
Fig. 3. Mockup Show Null Sum of Pass Record by DQ_RULE_ID
Besides counting the total of PASS_RECORD by DQ_RULE_ID, users can also count the total PASS_RECORD by
DQ_TIME_ID as seen in Fig. 4. The user must drag DQ_TIME_ID to the left side of the page as shown in Fig. 4, and the user
must change to SUM choice in the dropdown at the left, and change to PASS_RECORD at the dropdown list at the bottom of
SUM dropdown list.
At Fig. 5 represents the results of calculating the number of PASS_RECORD based on DQ_FIELD_ID and
DQ_TIME_ID. The user must drag DQ_FIELD_ID and DQ_TIME_ID to the left side of the page as shown in Fig. 4, and the
user must change to SUM choice in the dropdown at the left, and change to PASS_RECORD at the dropdown list at the
In Table I are column names from the DQ_RESULT fact table which has seven columns, there are DQ_RESULT_ID,
DQ_TIME_ID, DQ_DIMENSION_ID, DQ_RULE_ID, DQ_FIELD_ID, PASS_RECORD, FAIL_RECORD. In the
DQ_TIME_ID column it has dimensional references in the DIM_DQ_TIME table, DQ_DIMENSION_ID column has a
reference to the DIM_DQ_DIMENSIONS dimensional table, DQ_RULE_ID towards the dimensional table reference in the
DIM_DQ_RULES table, and column DQ_FIELD_ID towards dimensional references in the DIM_DQ_FIELDS table, while
the pass_record column contains the number of columns that passed the test, otherwise the FAIL_RECORD column shows
the number of columns that did not pass the test.
Table II is a column representation of the DIM_DQ_TIME table, and there are seven columns namely DQ_TIME_ID as
this table's primary key, LOAD_TIMESTAMP, CALENDAR_DATE, DAY_NAME, MONTH_NAME, and YEAR_NUM.
In Table III is a representation of the columns owned by the DIM_DQ_DIMENSIONS table. These columns are
DQ_DIMENSION_ID as table primary key, DQ_DIMENSION_NAME, and SHORT_NAME.
DQ_DIMENSION_ID DQ DIMENSION ID
DQ_DIMENSION_NAM
The name of the data quality parameter
E
The short name of the data quality
SHORT_NAME
parameter
In Table IV is a set of column information in the table DIM_DQ_RULES. The columns consist of DQ_RULE_ID as the
primary key from the table, DQ_RULE_NAME, DQ_RULE_LOCATION, DQ_RULE_OWNER, and
DQ_RULE_DESCRIPTION.
Table V is a DIM_DQ_FIELDS table that has columns namely DQ_FIELD_ID as the primary key, DQ_ENTITY_ID
which has dimensional references with DQ_ENTITIES, and DQ FIELD_NAME.
Table VI represents the columns owned by the DIM_DQ_ENTITIES table consisting of DQ_ENTITY_ID as the primary
key in this table, STATUS, LOAD_TIMESTAMP, MODIFIED_DATE, and DQ_ENTITY_VALUE.
After the database creation phase is complete, then create the Show Null Monitoring page as illustrated in Fig. 8.
At Fig. 9 represents the Show Null Sum of Pass Record by DQ_RULE_ID page.
The process of calculating the number of PASS_RECORD based on time can be seen in the system as shown in Fig. 10.
At Fig. 11 processes for calculating PASS_RECORD data are based on two selection conditions DQ_FIELD_ID and
DQ_TIME_ID.
[1] K. Leonard, "The Role of Data in Business," Chron, 5 October 2018. [Online]. Available:
http://smallbusiness.chron.com/role-data-business-20405.html. [Accessed 13 February 2019].
[2] The Data Management Asociation, The DAMA Guide to The Data Management Body of Knowledge First Edition,
Bradley Beach: Technics Publications, LLC, 2009.
[3] Z. Panian, "Some Practical Experiences in Data Governance," World Academy of Science, Engineering and
Technology 62 2010, vol. 62, pp. 939-940, 2010.
[4] Herzog, Data Quality and Record Linkage Techniques, Springer, 2007.
[5] M. Rouse, "What is data quality?," Techtarget, May 2018. [Online]. Available:
http://searchdatamanagement.techtarget.com/definition/data-quality. [Accessed 20 February 2019].
[6] D. Apel, Datenqualität erfolgreich steuern: Praxislösungen für Business-Intelligence-Projekte, Heidelberg:
dpunkt.verlag, 2015.
[7] S. Judah, M. Y. Selvage and A. Jain, "Magic Quadrant for Data Quality Tools," GartnerResearch, 2016.
[8] L. Cai, "The Challenges of Data Quality and Data Quality Assessment in the Big Data Era," Data Science Journal,
pp. 1294-1297, 2015.
[9] R. Y. Wang, M. S. D., P. L. and L. L. Y., Manage Your Information as Product: The Keystone to Quality
Information, MIT Sloan School of Management, 1997.
[10] Z. Abedjan, L. Golab and F. Nauman, "Data Profiling," in 2016 IEEE 32nd International Conference on Data
Engineering (ICDE), 2016.
[11] Dataflux Corporation, "Data Profilling : The diagnosis for better enterprise information," pp. 2-10, 2016.
[12] Experian, "The 2014 Digital Marketer," Experian, California, 2014.
[13] J. G. Jennifer, W. O. Walter, E. K. Timothy, P. M. Jeffrey and P. A. William, "Data Quality Assurance, Monitoring,
and Reporting," Controlled Clinical Trials, vol. 16, pp. 105-135, 1995.
[14] T. L., M. A., S. I and D. R. G, "CMS data quality monitoring: systems and," Journal of Physics: Conference Series
219 (2010) 072020, vol. IX, pp. 1-9, 2010.
[15] A. Blinn, M. A. Cohen, M. Lorton and G. J. Stein, "Database Schema Independence". United States Patent
US5974418A, 1996.
[16] D. Kinshansing and M. Sutan Saghar, "Study Of The ANSI/SPARC Architecture," International Journal of Modern
Trends in Engineering and, pp. 22-30, 2014.
[17] E. Ramez and B. N. Shamkant, Fundamentals of Database Systems:Sixth Edition, Boston: Addison-Wesley, 2010.
[18] J. D. C., Relational Database: Selected Writings First Edition Edition, Boston: Addison-Wesley, 1998.
[19] E. F. Codd, "Derivability, Redundancy, and Consistency of Relations Stored in Large Data Banks," IBM, 1969.
[20] S. Frank, "A Gentle Introduction to Relational and Object Oriented Databases," ORL Technical Report, Cambridge,
1995.
[21] N. Dedić and C. Stanier, "An Evaluation of the Challenges of Multilingualism in Data Warehouse Development," in
18th International Conference on Enterprise Information Systems - ICEIS 2016, Rome, 2016.
[22] T. Connolly and C. Begg, Database Systems - A Practical Approach to Design, Implementation and Management,
Pearson , 2015.
[23] D. L. Moody and M. A. Kortink, "From Enterprise Models to Dimensional Models: A Methodology for Data
Warehouse and Data Mart Design," in International Workshop on Design and Management of Data Warehouses
(DMDW'2000), Stockholm, 2000.