Data Framework

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

Intermediate Data Format for the Elastic Data

Conversion Framework
2021 15th International Conference on Ubiquitous Information Management and Communication (IMCOM) | 978-1-6654-2318-2/21/$31.00 ©2021 IEEE | DOI: 10.1109/IMCOM51814.2021.9377366

Tran Khanh Dang( ) Ta Manh Huy( ) Nguyen Le Hoang


Ho Chi Minh City University of Ho Chi Minh City University of Ho Chi Minh City University of
Technology (HCMUT), VNU-HCM, Technology (HCMUT), VNU-HCM, Technology (HCMUT), VNU-HCM,
268 Ly Thuong Kiet Street, 268 Ly Thuong Kiet Street, 268 Ly Thuong Kiet Street,
District 10, Ho Chi Minh City, District 10, Ho Chi Minh City, District 10, Ho Chi Minh City,
Vietnam Vietnam Vietnam
khanh@hcmut.edu.vn 1870399@hcmut.edu.vn nlhoang@hcmut.edu.vn

Abstract—With the development of technology, data is be-


coming an extremely valuable resource. Data is being created,
analyzed and used in a massive scale in every modern system.
As a result, data analysis and data mining are very essential in
each aspect of social applications. The value of data will be more
useful if it can be linked and merged with other different data
resources, especially for solving current social problems. In order
to make a reality of this big challenge, data transformation is a
crucial step that we have to overcome. Among data conversion
methods, the intermediate method is to be a preferable one due to
its expandable ability. However, for this type of model, finding the
intermediate data type for the system is the crucial problem that
needs to be solved firstly. In this paper, we will make comparisons
among some popular data formats to find the appropriate data
type for this elastic data conversion system.
Index Terms—data format, data type, data conversion, data Fig. 1. Direct Data Conversion
integration system, data transformation, open data

I. I NTRODUCTION data representation ambiguity. The data conversion methods


can be divided into two categories: direct conversion and
With the development of technology, data is becoming an
intermediate conversion.
extremely valuable resource. Data is being created, analyzed
and used in a massive scale in every modern system. As a • Direct conversion: The data is converted directly from
result, data analysis and data mining are very essential in source to target format. This conversion method is the
each aspect of social applications. The value of data will most popular conversion method due to its simplicity and
be more useful if it can be linked and merged with other ease of implementation. However, this method is only
different data resources, especially for solving current social effective when the number of source and target formats
problems. In order to make a reality of this big challenge, data is small. As this number increases, the complexity of the
transformation is a crucial step that we have to overcome. system increases rapidly (Fig.1).
Data transformation can be described as a task that can • Intermediate conversion: Data will be converted to inter-
flexibly convert data among different models and formats, mediate data format and this data will be converted to the
thereby supporting the combination of data from various format that the user wants. This method has the advantage
resources to a unified one, in another word, a unified dataset. that it will reduce the complexity of the system and make
This can bring many benefits to data analysis and management the system expandable (Fig.2).
applications as well as can provide potential and optimal value Most studies and works mainly use the direct conversion
such as in making predictions and supporting decision making. model [5], [9]. The reason for the widely usage of this model
Data now can be collected and integrated to store and manage is due to the natural factor of the goal or project: the need
in data centers for a variety of purposes. to convert data from one or several specific forms to one or
However, this problem is not easy even when converting several specific forms. However, with the goal of designing
traditional data with few data sources in simple structure. an extensible system having the framework can work with a
Usually, this process requires the participation of human to variety of input and output formats, the direct transformation
understand and correct the meaning of the data in each source model creates a lot of complexity when adding more data types
to solve the data ambiguity problems, including semantic and for the system. In the direct conversion model, if we add a new

978-0-7381-0508-6/21/$31.00 ©2021 IEEE

Authorized licensed use limited to: Carleton University. Downloaded on June 05,2021 at 17:13:03 UTC from IEEE Xplore. Restrictions apply.
TABLE I
RDBMS VS N O SQL

Relational Databases NoSQL Databases


Optimal works Designed for online transaction processing (OLTP) applications Designed for analysis of incomplete structured data and low-latency
with high stability and suitable for online analytical processing applications
(OLAP)
Database The relational model normalizes data into tables of rows and NoSQL database provides diverse data models, including text,
model columns. The diagram clearly specifies the relationship between graph, key-value, etc.
tables and other database elements
ACID Relational databases require Atomicity, Consistency, Isolation, NoSQL databases often loosen some ACID attributes for a more
Durability (ACID) properties flexible data model capable of horizontal scaling.
Performance Depending on the system of the hard drive. Typically, optimizations Depending on hardware cluster size and network latency.
will require query, index, and table structures.
Expansion Requiring to reconstruct tables, constraints and databases High flexible and extensible ability
API Data storage and retrieval requirements are communicated using The object-based API allows application developers to easily store
Structured Query Language (SQL) queries. These queries are and retrieve structured data in memory. The fragment key looks for
analyzed and executed by relational databases. key-values, columns, or structured texts that contain the objects and
properties arranged in series.

this development, there are many database models that are


proposed based on different theoretical backgrounds; the most
two common models are: relational databases RDBMS (such
as Oracle, MySQL, SQL Server, etc.) and NoSQL databases
(typically MongoDB, Postgre, Redis, etc.). According to DB-
Engines [1], relational databases are the most popular with
74.5% of the market and NoSQL with nearly 20% (September
2020). The comparison between RDBMS and NoSQL is
shown in Table I.
The intermediate data conversion system is designed to
use an intermediate data type or standard data format in
the transformation process. In fact, this standard format of
data will also be stored in the storage block. For elastic and
Fig. 2. Intermediate Data Conversion extensible needs and purposes, this standard data format must
meet the following criteria: flexible structure, efficient storage
and good scalability.
input format to the system including n outputs of the system, it
According to the above standards, NoSQL database is a
is required to manually programmatically extend n functions.
better option since the RDBMS models will not meet the need
However, if the intermediate conversion model is used, only
for flexible structure.
one additional function is needed for the system. Therefore,
the intermediate conversion model is the most suitable for the III. C OMPARISONS AMONG XML, JSON AND BSON
needs of a large and extensible data conversion system. Among NoSQL databases, there are many standard data
For this type of model, one of the most significant problems formats that can be used here; however, we only consider
is finding the intermediate data type for the system. In this JSON, XML and BSON formats which can satisfy our needs
paper, we will make comparisons among some popular data for our data conversion framework in [16].
formats to find the appropriate data type for the elastic data • XML (Extensible Markup Language). XML is one of the
conversion system. This research direction is one of the most commonly used data formats today. Designed with
research trends on Information Technology for the Ho Chi the aim of being easy to read, easy to understand for both
Minh City in the period of 2018-2023. The rest of this paper humans and machines, XML has been widely used and
is organized as follows, we compare relational and NoSQL practically become the communication standard.
databases in section 2 to explain the reason of choosing • JSON (Javascript Object Notation). JSON was proposed
NoSQL model; in section 3, we make some comparisons in 2000, then was first synthesized and released as
among some NoSQL data types to find the adaptable one. standard under the name ECMA-404 in 2013 [3]. First
The summary and conclusion will be in section 4. developed for the purpose of data exchange, JSON now
is used for many different purposes: exchanging data,
II. R ELATIONAL AND N O SQL DATABASES
storing data, storing system settings, etc. Moreover, JSON
Since the 1960s, database management system is one of has a very strong and growing community, especially on
the core technologies in today’s information systems. During Github.

Authorized licensed use limited to: Carleton University. Downloaded on June 05,2021 at 17:13:03 UTC from IEEE Xplore. Restrictions apply.
• BSON (Binary JSON). BSON is a binary representation
of JSON. Designed by MongoDB based on the actual
work for databases, BSON has many good properties such
TABLE II
as closer to computer language and easier to find saved E XPERIMENTAL DATASETS
data. However, because the design purpose of BSON
is to be applied to the MongoDB system, not many Dataset Description Number of
communities use and support this BSON format. Records
Dataset 1 Child and Adult Health Care Qual- 1716
Choosing the a suitable format is an indispensable part ity Measures
of any system. The use of a suitable storage format affects Dataset 2 911 Master PSAP Registry 8557
not only system performance but also the structure (through Dataset 3 Accidental Drug Related Deaths 5105
2012-2018
the ecosystem of format support libraries), the maintainability Dataset 4 Average Daily Traffic Counts 1279
(such as error detection capabilities) and the scalability (future Dataset 5 Border Crossing Entry Data 351603
system upgrades), etc. To find the suitable format, we mainly Dataset 6 Development Credit Authority 186545
(DCA)
focus on these following properties:
Dataset 7 Directory Of Homeless Population 32
• Size of the datasets when saving. By Year
Dataset 8 Federal Funding Allocation 23624
• The speed of loading and reading the datasets. Dataset 9 Hospital General Information 5344
• Supported Libraries and Communities. Dataset 10 Meteorite Landings 45716
Dataset 11 Motor Vehicle Occupant Death 5
Rate
A. Comparison 1 - Size of datasets when saving Dataset 12 NASA Facilities 485
Dataset 13 Operating Expenses 17844
For a same dataset, the format that support smaller storage Dataset 14 Patient survey (HCAHPS) 4032
will be better since this can save a lot of resources and fasten Dataset 15 Popular Baby Names 19418
the writing speed. Dataset 16 Railroad Operations Data 301159
Dataset 17 Restaurants 1327
• Datasets: the data for comparisons are retrieved from Dataset 18 Sales Tax Rates 2129
www.data.gov as depicted in Table II. The chosen Dataset 19 Service With Vehicle Data 14057
Dataset 20 Traffic Flow Map Volumes 437
datasets are both available in XML and JSON formats.
• Method: each dataset will be converted to XML, JSON
and BSON formats. Since the orginal datasets are already
in XML and JSON formats, we only need to convert
the datasets to BSON format from JSON format via
bsonjs library in Python. The JSON-to-BSON conversion
include loading the JSON-data, saving JSON-data to
BSON data using bsonjs and then saving BSON-data TABLE III
to files. We then measure and compare the size of each C OMPARISON R ESULT 1
dataset after converted.
• Results (Table III): XML data format has the biggest Dataset XML JSON BSON
capacity while JSON has the smallest data storage. BSON Dataset 1 1.6 MB 870 KB 1.1 MB
Dataset 2 5.4 MB 3.1 MB 3.6 MB
datasets require just a little bigger than JSON. Dataset 3 7.1 MB 4.4 MB 4.6 MB
Dataset 4 819 KB 380 KB 505 KB
Dataset 5 126 MB 67 MB 86 MB
B. Comparison 2 - Leading and Reading speed Dataset 6 161 MB 57 MB 76 MB
Dataset 7 8.5 KB 11 KB 12 KB
For a same dataset, the format able to load and read Dataset 8 45 MB 11 MB 16 MB
data faster will be better. This factor is not related to above Dataset 9 10 MB 3,4 MB 4,2 MB
Dataset 10 19 MB 11 MB 15 MB
property; in fact, there are some formats able to compress the Dataset 11 3,6 KB 14 KB 15 KB
data to very small size but it takes long time to load and read. Dataset 12 394 KB 240 KB 294 KB
Dataset 13 32 MB 11 MB 14 MB
• Datasets: the data for comparisons are retrieved from Dataset 14 2,6 MB 1,4 MB 1,6 MB
www.data.gov as depicted in Table II. Dataset 15 2,6 MB 1,4 MB 1,6 MB
• Method: each converted dataset from comparison 1 will Dataset 16 135 MB 51 MB 68 MB
be stored in our system. Then it will be loaded and read. Dataset 17 697 KB 394 KB 433 KB
Dataset 18 877 KB 543 KB 620 KB
We measure and compare the process of loading and Dataset 19 32 MB 8,4 MB 11 MB
reading. Dataset 20 165 KB 90 KB 115 KB
• Results (Table IV): XML takes the longest time to read
and load data while JSON and BSON take nearly the
same periods of time.

Authorized licensed use limited to: Carleton University. Downloaded on June 05,2021 at 17:13:03 UTC from IEEE Xplore. Restrictions apply.
TABLE IV teristics such as high scalability, flexibility and convertibility.
C OMPARISON R ESULT 2 However, we think that JSON format has a bigger advantage
over the other two format and more suitable for our framework
Dataset XML (s) JSON (s) BSON (s) in [16] due to these reasons:
Dataset 1 2.40 1.54 1.17
Dataset 2 11.03 4.11 5.19 • Smaller storage capacity.
Dataset 3 16.44 6.34 6.31 • Faster loading and reading process.
Dataset 4 0.56 0.34 0.29 • Stronger communities of users and developers.
Dataset 5 161.21 95.52 93.51
Dataset 6 849.02 345.41 346.42 Recent studies have shown the important role of data
Dataset 7 0.10 0.04 0.03 conversion and its usefulness in data integration systems, espe-
Dataset 8 106.81 15.84 18.65 cially the intermediate data conversion model which supported
Dataset 9 7.08 3.61 2.13
Dataset 10 42.38 21.54 24.34 the diversity of input data sources and data formats or the
Dataset 11 0.08 0.06 0.06 challenging context of big data. To make this system perform-
Dataset 12 4.49 0.46 0.21 ing effectively, finding the most appropriate intermediate data
Dataset 13 58.65 20.71 21.02
Dataset 14 2.79 0.80 1.19
type is an essential problem and needs to be solved firstly.
Dataset 15 3.78 2.40 1.90 In this paper, we make comparisons among some popular
Dataset 16 172.70 72.12 71.81 and mostly used data formats then choosing the suitable data
Dataset 17 0.78 0.58 0.61 type for this data conversion system. Even though relational
Dataset 18 0.58 0.43 0.38
Dataset 19 51.74 8.71 15.56 databases are the most traditional and widely-used, NoSQL
Dataset 20 3.11 0.12 0.05 models will help us the attributes of flexibility and scalability.
And among NoSQL data types, JSON seems to be the best
option for the system due to its smaller storage capacity, faster
C. Comparison 3 - Supported Libraries and Communities loading/reading and bigger supported communities.
The built-in libraries and the support from communities ACKNOWLEDGMENT
are also a very important aspect to consider when choosing This work is supported by a project with the Department of
a suitable data type. As a rule of thumb, the more popular Science and Technology, Ho Chi Minh City, Vietnam (contract
the format is, the better option it is. There are many ways with HCMUT No. 42/2019/HD-QPTKHCN, dated 11/7/2019).
to compare the popularity of the aforementioned formats;
however, in this paper, we will use the following criteria since R EFERENCES
they can give us some insights about the use of these formats [1] DB-Engines, https://db-engines.com/en/
in the developer community. [2] https://insights.stackoverflow.com/
[3] ECMA International, ECMA-404: The JSON Data Interchange Format,
• Number of related repositories on GitHub. 2013
• Number of questions on Stack Overflow. [4] Dong X.L., Srivastava D. (2015). Big Data Integration. Morgan &
Claypool Publishers, p. 198.
• Data streaming supported.
[5] Hyeonjeong Lee, Hoseok Jung, Miyoung Shin, Ohseok Kwon (2017).
Moreover, other than the above criteria, we also include Developing a semi-automatic data conversion tool for Korean ecological
some other ones which will provide a deeper understanding data standardization. In Journal of Ecology and Environment, 41(11).
[6] Information Builders (2016). Real World Strategies for Big Data -
of these formats. These include: Tackling The Most Common Challenges With Big Data Integration -
• The format this format is based on. A white paper.
• The formats based on this format.
[7] Ivan Ermilov, Claus Stadler, Michael Martin, Soeren Auer (2013).
CSV2RDF: User-Driven CSV to RDF Mass Conversion Framework. In
• The standardization of the format. Proceedings of the 9th International Conference on Semantic Systems.
• The standard APIs of the format. [8] Knoblock, C.A., Szekely, P. (2015). Exploiting semantics for big data
• Notable libraries of the format.
integration. In AI Magazine, 36(1), 25-38.
[9] Luis Paiva, Pedro Pereira, Bruno Almeida, Pedro Maló, Juha Hyvärinen,
It is always difficult to build anything from scratch; the more Krzysztof Klobut, Vanda Dimitriou, and Tarek Hassan (2017). Interop-
supported we have from existing libraries and communities, erability: A Data Conversion Framework to Support Energy Simulation.
In Proceedings, 1(7), 695, ISSN: 2504-3900.
the more effectively and efficiently we can operate our system. [10] Marek Obitko, Václav Jirkovský (2015). Big Data Semantics in Industry
The comparison in Table V has shown that: 4.0. In Industrial Applications of Holonic and Multi-Agent Systems.
• More people are using JSON than XML and BSON.
Lecture Notes in Computer Science, vol 9266, pp. 217-229.
[11] Microsoft (2017). SQL Server Integration Services.
• JSON also has more supported libraries than others. https://docs.microsoft.com/en-us/sql/integration-services/sql-server-
• BSON format is not yet matured and has limited support integration-services.
from the community. [12] Milan Vathoopan, Benjamin Brandenbourger, Amil George, Alois Zoitl
(2017). Towards an Integrated Plant Engineering Process Using a Data
• Both JSON and XML are full supported in data stream- Conversion Tool for AutomationML. In IEEE International Conference
ing. on Industrial Technology, pp. 1205-1210.
[13] Oracle (2015). The Five Most Common Big Data Integration Mistakes
IV. C ONCLUSIONS to Avoid. Oracle White Paper. https://www.oracle.com/assets/big-data-
integration-mistakes-wp-2492054.pdf.
From the comparisons, we can conclude that both XML [14] Rocha, Leonardo, et al. A Framework for Migrating Relational Datasets
and JSON have the maturity in technology with good charac- to NoSQL1. Procedia Computer Science 51 (2015): 2593-2602.

Authorized licensed use limited to: Carleton University. Downloaded on June 05,2021 at 17:13:03 UTC from IEEE Xplore. Restrictions apply.
TABLE V
XML VS JSON VS BSON

XML JSON BSON


Number of related repositories on About 95,000 (October, 2020) About 240,000 (October, 2020) About 900 (October, 2020)
GitHub
Number of questions on Stack- About 0.6% of questions on Stack About 1.5% of questions on Stack No data available
Overflow Overflow in 2019 [2] Overflow in 2019 [2]
Data streaming supported Full supported such as XMPP, Full supported such as Jackson (API), Few
QuiXSchematron, XLTL 3.0 jq, logstash, ldjson-stream
The format this format is based SGML None (But based on JavaScript syn- JSON
on tax)
The formats based on this format YAML, Fast Infoset, SOAP, XML- YAML, BSON, Ion, Smile, CBOR, None
RPC, Effi-cient XML Interchange Mes-sagePack, Extensible Data Nota-
(EXI) tion (EDN)
The standardization of the format Standardized Standardized Not Standardized
The standard APIs of the format DOM, SAX, XQuery, XPath Clarinet, JSONQuery, JSONPath, No Standard APIs
JSON-LD
Notable libraries of the format JSON checker, YAJL, json-c, json- DOM4J, StAX, JDOM, xml2js, libbson, Mongo-GLib, bson4jackson,
parser, JSONKit, JSONUtil, json2.js libxmljs, sax, fast-xml-parser, xml- CookJson, PyMongo
stream

[15] Talend (2017). Talend Data Integration. https://www.talend.com/


[16] Tran Khanh Dang, Ta Manh Huy, and Nguyen Le Hoang (2020).
An Elastic Data Conversion Framework for Data Integration System.
Communications in Computer and Information Science, vol 1306.

Authorized licensed use limited to: Carleton University. Downloaded on June 05,2021 at 17:13:03 UTC from IEEE Xplore. Restrictions apply.

You might also like