Professional Documents
Culture Documents
Data Framework
Data Framework
Data Framework
Conversion Framework
2021 15th International Conference on Ubiquitous Information Management and Communication (IMCOM) | 978-1-6654-2318-2/21/$31.00 ©2021 IEEE | DOI: 10.1109/IMCOM51814.2021.9377366
Authorized licensed use limited to: Carleton University. Downloaded on June 05,2021 at 17:13:03 UTC from IEEE Xplore. Restrictions apply.
TABLE I
RDBMS VS N O SQL
Authorized licensed use limited to: Carleton University. Downloaded on June 05,2021 at 17:13:03 UTC from IEEE Xplore. Restrictions apply.
• BSON (Binary JSON). BSON is a binary representation
of JSON. Designed by MongoDB based on the actual
work for databases, BSON has many good properties such
TABLE II
as closer to computer language and easier to find saved E XPERIMENTAL DATASETS
data. However, because the design purpose of BSON
is to be applied to the MongoDB system, not many Dataset Description Number of
communities use and support this BSON format. Records
Dataset 1 Child and Adult Health Care Qual- 1716
Choosing the a suitable format is an indispensable part ity Measures
of any system. The use of a suitable storage format affects Dataset 2 911 Master PSAP Registry 8557
not only system performance but also the structure (through Dataset 3 Accidental Drug Related Deaths 5105
2012-2018
the ecosystem of format support libraries), the maintainability Dataset 4 Average Daily Traffic Counts 1279
(such as error detection capabilities) and the scalability (future Dataset 5 Border Crossing Entry Data 351603
system upgrades), etc. To find the suitable format, we mainly Dataset 6 Development Credit Authority 186545
(DCA)
focus on these following properties:
Dataset 7 Directory Of Homeless Population 32
• Size of the datasets when saving. By Year
Dataset 8 Federal Funding Allocation 23624
• The speed of loading and reading the datasets. Dataset 9 Hospital General Information 5344
• Supported Libraries and Communities. Dataset 10 Meteorite Landings 45716
Dataset 11 Motor Vehicle Occupant Death 5
Rate
A. Comparison 1 - Size of datasets when saving Dataset 12 NASA Facilities 485
Dataset 13 Operating Expenses 17844
For a same dataset, the format that support smaller storage Dataset 14 Patient survey (HCAHPS) 4032
will be better since this can save a lot of resources and fasten Dataset 15 Popular Baby Names 19418
the writing speed. Dataset 16 Railroad Operations Data 301159
Dataset 17 Restaurants 1327
• Datasets: the data for comparisons are retrieved from Dataset 18 Sales Tax Rates 2129
www.data.gov as depicted in Table II. The chosen Dataset 19 Service With Vehicle Data 14057
Dataset 20 Traffic Flow Map Volumes 437
datasets are both available in XML and JSON formats.
• Method: each dataset will be converted to XML, JSON
and BSON formats. Since the orginal datasets are already
in XML and JSON formats, we only need to convert
the datasets to BSON format from JSON format via
bsonjs library in Python. The JSON-to-BSON conversion
include loading the JSON-data, saving JSON-data to
BSON data using bsonjs and then saving BSON-data TABLE III
to files. We then measure and compare the size of each C OMPARISON R ESULT 1
dataset after converted.
• Results (Table III): XML data format has the biggest Dataset XML JSON BSON
capacity while JSON has the smallest data storage. BSON Dataset 1 1.6 MB 870 KB 1.1 MB
Dataset 2 5.4 MB 3.1 MB 3.6 MB
datasets require just a little bigger than JSON. Dataset 3 7.1 MB 4.4 MB 4.6 MB
Dataset 4 819 KB 380 KB 505 KB
Dataset 5 126 MB 67 MB 86 MB
B. Comparison 2 - Leading and Reading speed Dataset 6 161 MB 57 MB 76 MB
Dataset 7 8.5 KB 11 KB 12 KB
For a same dataset, the format able to load and read Dataset 8 45 MB 11 MB 16 MB
data faster will be better. This factor is not related to above Dataset 9 10 MB 3,4 MB 4,2 MB
Dataset 10 19 MB 11 MB 15 MB
property; in fact, there are some formats able to compress the Dataset 11 3,6 KB 14 KB 15 KB
data to very small size but it takes long time to load and read. Dataset 12 394 KB 240 KB 294 KB
Dataset 13 32 MB 11 MB 14 MB
• Datasets: the data for comparisons are retrieved from Dataset 14 2,6 MB 1,4 MB 1,6 MB
www.data.gov as depicted in Table II. Dataset 15 2,6 MB 1,4 MB 1,6 MB
• Method: each converted dataset from comparison 1 will Dataset 16 135 MB 51 MB 68 MB
be stored in our system. Then it will be loaded and read. Dataset 17 697 KB 394 KB 433 KB
Dataset 18 877 KB 543 KB 620 KB
We measure and compare the process of loading and Dataset 19 32 MB 8,4 MB 11 MB
reading. Dataset 20 165 KB 90 KB 115 KB
• Results (Table IV): XML takes the longest time to read
and load data while JSON and BSON take nearly the
same periods of time.
Authorized licensed use limited to: Carleton University. Downloaded on June 05,2021 at 17:13:03 UTC from IEEE Xplore. Restrictions apply.
TABLE IV teristics such as high scalability, flexibility and convertibility.
C OMPARISON R ESULT 2 However, we think that JSON format has a bigger advantage
over the other two format and more suitable for our framework
Dataset XML (s) JSON (s) BSON (s) in [16] due to these reasons:
Dataset 1 2.40 1.54 1.17
Dataset 2 11.03 4.11 5.19 • Smaller storage capacity.
Dataset 3 16.44 6.34 6.31 • Faster loading and reading process.
Dataset 4 0.56 0.34 0.29 • Stronger communities of users and developers.
Dataset 5 161.21 95.52 93.51
Dataset 6 849.02 345.41 346.42 Recent studies have shown the important role of data
Dataset 7 0.10 0.04 0.03 conversion and its usefulness in data integration systems, espe-
Dataset 8 106.81 15.84 18.65 cially the intermediate data conversion model which supported
Dataset 9 7.08 3.61 2.13
Dataset 10 42.38 21.54 24.34 the diversity of input data sources and data formats or the
Dataset 11 0.08 0.06 0.06 challenging context of big data. To make this system perform-
Dataset 12 4.49 0.46 0.21 ing effectively, finding the most appropriate intermediate data
Dataset 13 58.65 20.71 21.02
Dataset 14 2.79 0.80 1.19
type is an essential problem and needs to be solved firstly.
Dataset 15 3.78 2.40 1.90 In this paper, we make comparisons among some popular
Dataset 16 172.70 72.12 71.81 and mostly used data formats then choosing the suitable data
Dataset 17 0.78 0.58 0.61 type for this data conversion system. Even though relational
Dataset 18 0.58 0.43 0.38
Dataset 19 51.74 8.71 15.56 databases are the most traditional and widely-used, NoSQL
Dataset 20 3.11 0.12 0.05 models will help us the attributes of flexibility and scalability.
And among NoSQL data types, JSON seems to be the best
option for the system due to its smaller storage capacity, faster
C. Comparison 3 - Supported Libraries and Communities loading/reading and bigger supported communities.
The built-in libraries and the support from communities ACKNOWLEDGMENT
are also a very important aspect to consider when choosing This work is supported by a project with the Department of
a suitable data type. As a rule of thumb, the more popular Science and Technology, Ho Chi Minh City, Vietnam (contract
the format is, the better option it is. There are many ways with HCMUT No. 42/2019/HD-QPTKHCN, dated 11/7/2019).
to compare the popularity of the aforementioned formats;
however, in this paper, we will use the following criteria since R EFERENCES
they can give us some insights about the use of these formats [1] DB-Engines, https://db-engines.com/en/
in the developer community. [2] https://insights.stackoverflow.com/
[3] ECMA International, ECMA-404: The JSON Data Interchange Format,
• Number of related repositories on GitHub. 2013
• Number of questions on Stack Overflow. [4] Dong X.L., Srivastava D. (2015). Big Data Integration. Morgan &
Claypool Publishers, p. 198.
• Data streaming supported.
[5] Hyeonjeong Lee, Hoseok Jung, Miyoung Shin, Ohseok Kwon (2017).
Moreover, other than the above criteria, we also include Developing a semi-automatic data conversion tool for Korean ecological
some other ones which will provide a deeper understanding data standardization. In Journal of Ecology and Environment, 41(11).
[6] Information Builders (2016). Real World Strategies for Big Data -
of these formats. These include: Tackling The Most Common Challenges With Big Data Integration -
• The format this format is based on. A white paper.
• The formats based on this format.
[7] Ivan Ermilov, Claus Stadler, Michael Martin, Soeren Auer (2013).
CSV2RDF: User-Driven CSV to RDF Mass Conversion Framework. In
• The standardization of the format. Proceedings of the 9th International Conference on Semantic Systems.
• The standard APIs of the format. [8] Knoblock, C.A., Szekely, P. (2015). Exploiting semantics for big data
• Notable libraries of the format.
integration. In AI Magazine, 36(1), 25-38.
[9] Luis Paiva, Pedro Pereira, Bruno Almeida, Pedro Maló, Juha Hyvärinen,
It is always difficult to build anything from scratch; the more Krzysztof Klobut, Vanda Dimitriou, and Tarek Hassan (2017). Interop-
supported we have from existing libraries and communities, erability: A Data Conversion Framework to Support Energy Simulation.
In Proceedings, 1(7), 695, ISSN: 2504-3900.
the more effectively and efficiently we can operate our system. [10] Marek Obitko, Václav Jirkovský (2015). Big Data Semantics in Industry
The comparison in Table V has shown that: 4.0. In Industrial Applications of Holonic and Multi-Agent Systems.
• More people are using JSON than XML and BSON.
Lecture Notes in Computer Science, vol 9266, pp. 217-229.
[11] Microsoft (2017). SQL Server Integration Services.
• JSON also has more supported libraries than others. https://docs.microsoft.com/en-us/sql/integration-services/sql-server-
• BSON format is not yet matured and has limited support integration-services.
from the community. [12] Milan Vathoopan, Benjamin Brandenbourger, Amil George, Alois Zoitl
(2017). Towards an Integrated Plant Engineering Process Using a Data
• Both JSON and XML are full supported in data stream- Conversion Tool for AutomationML. In IEEE International Conference
ing. on Industrial Technology, pp. 1205-1210.
[13] Oracle (2015). The Five Most Common Big Data Integration Mistakes
IV. C ONCLUSIONS to Avoid. Oracle White Paper. https://www.oracle.com/assets/big-data-
integration-mistakes-wp-2492054.pdf.
From the comparisons, we can conclude that both XML [14] Rocha, Leonardo, et al. A Framework for Migrating Relational Datasets
and JSON have the maturity in technology with good charac- to NoSQL1. Procedia Computer Science 51 (2015): 2593-2602.
Authorized licensed use limited to: Carleton University. Downloaded on June 05,2021 at 17:13:03 UTC from IEEE Xplore. Restrictions apply.
TABLE V
XML VS JSON VS BSON
Authorized licensed use limited to: Carleton University. Downloaded on June 05,2021 at 17:13:03 UTC from IEEE Xplore. Restrictions apply.