Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 31

MySQL and MongoDB for GIS

applications
Contents
Introduction...........................................................................................................................................3
Theoretical knowledge..........................................................................................................................7
1. Geographical Data................................................................................................................................8
2. Geometric Data.....................................................................................................................................8
3. Geospatial Data....................................................................................................................................9
Method.................................................................................................................................................15
Previous Research....................................................................................................................15
Hypothesis.................................................................................................................................15
Variables....................................................................................................................................16
Dataset.......................................................................................................................................17
Code for Testing sqlite3...........................................................................................................18
Performance analysis of sqlite3..............................................................................................25
Code for testing MongoDB.......................................................................................................25
Performance analysis of pymongo..........................................................................................27
Discussion...........................................................................................................................................27
Conclusion...........................................................................................................................................29
Refrences.............................................................................................................................................30
Introduction
The integration of databases within Geographic Information Systems (GIS) has gained paramount
importance in recent years, driven by the exponential growth of spatial data acquisition and
analysis. As GIS applications continue to expand in various horizons, the need for efficient and
reliable databases to store and manage spatial data has intensified. A database management system
is essential in managing data efficiently and enabling users to complete various tasks with ease, as
it "improves the effectiveness of business processes and decreases overall costs" (Sahatqija et al.,
2018a, p. 1).
This supports the importance of database management systems in improving business processes and
reducing costs which is relevant to the context of GIS applications, as efficient and reliable database
systems are crucial for managing the growing volume and complexity of spatial data. It suggests that
the performance and functionality of these systems play a crucial role in driving operational
efficiency and cost savings. This aligns with the broader context of GIS applications, where the
choice of a suitable database system can significantly impact data management, analysis, and
decision-making processes. Therefore, understanding the implications of database selection on
business processes and costs is essential when evaluating the suitability of systems like MongoDB
and MySQL for GIS applications. Choosing the right database management system is a critical
decision that every firm and software developer grapples with regularly. This paper delves into two
of the leading database systems commonly used in GIS applications: MongoDB and MySQL. Each
system offers unique features and capabilities, making them suitable for different scenarios within
GIS. Our goal is to thoroughly explore the comparative strengths and weaknesses of MongoDB and
MySQL. In the paper, we are going to explore the following sections:

In the Theoretical section, an exploration into the foundational concepts necessary to


comprehend the context of the research is undertaken. This entails an examination of terms
such as GIS, its data types, SQL, and MongoDB, which establishes a robust groundwork for
subsequent exploration within the paper.
In the Method section, a detailed exposition of the experimental design and analysis techniques
employed for comparing MongoDB and MySQL in GIS applications is provided. Within the
Results section, the findings are presented objectively and impartially. Systematic reporting of
performance tests, scalability assessments, and usability evaluations of MongoDB and MySQL is
carried out. Each result is articulated clearly, thereby ensuring lucidity in data presentation. This
furnishes valuable insights into the performance of both databases and aids in determining the
preferable choice under various scenarios. In the Discussion section, a critical analysis of the
chosen methods and results pertaining to the comparison of MongoDB and MySQL for GIS
applications is conducted. Strengths and weaknesses of the approach are highlighted, facilitating a
comprehensive understanding of the research. Such an exhaustive analysis serves to underscore the
relevance and impact of the study within the broader realm of database technology in GIS. In the
Conclusions section, the key findings of the study comparing MongoDB and MySQL for GIS
applications are succinctly summarized. A concise overview of the results is provided, highlighting
the major discoveries and their alignment with the research objectives.
This concluding section accentuates the significance of the study and its contribution to the
advancement of knowledge in the field. As Rajesh &
Sreekumar aptly assert, the selection of a database is a critical decision for big data
analytics, as it can profoundly impact the performance, scalability, and flexibility of the
application. The choice of database is contingent upon the specific requirements of the project,
encompassing factors such as data volume, type, and required operations (Rajesh & Sreekumar,
2015, p. 8). Consequently, the decision to select the appropriate database for varying scenarios
holds paramount importance, enabling the implementation of cost-effective solutions capable of
addressing multiple challenges within the realm of GIS. As organizations strive towards cost
optimization, obtaining an accurate understanding of how each component will contribute to the
plan becomes imperative. The research paper aims to address the following question:

"Which database has faster retrieval time for data and how does it vary according to a specific
data type for GIS applications, and how do MySQL and MongoDB perform against that
data?"
This research will provide GIS professionals and organizations with educated insights into
each database's operational efficiencies, particularly focusing on ease of
implementation, performance during complex spatial queries, and overall suitability for
various GIS applications. This objective, therefore, not only addresses a significant academic
gap but also serves a practical need in the GIS community.
Theoretical knowledge
Overview of Geographic Information Systems (GIS)

Geographic Information Systems (GIS) are integral tools that provide the ability to capture, store,
manipulate, analyze, manage, and present all types of geographical data. These systems are pivotal
in supporting decision-making processes across a multitude of sectors. A GIS is defined as: “In
general, the definitions of GIS cover three main components. They reveal that GIS is a computer
system. This implies more than just a series of computer boxes sitting on a desk but includes
hardware (the physical parts of the computer itself and associated peripherals - plotters and
printers), software (the computer programs that run on the computer) and appropriate procedures
(or techniques and orders for task implementation).” (Heywood et al., 2010, 18). This will make
possible the overall view of the GIS as a computer system involving hardware, software and
procedures essential for task implementation. This specialization refers to the overall
interdisciplinary nature of GIS that integrates technology, software, and methodologies into the
running of Geographic Information Systems. Through this emphasis on various
elements that these systems are composed of, it also conveys the level of complexity and intricacy
of these systems which removes the impression that these systems work only because of the mere
physical aspect of computers and peripherals and expands to a
wider aspect concerning software applications and data management techniques which are
equally critical as the hardware components. This overall view implies that the successful use of
GIS necessitates not only the appropriate hardware and software, but also suitable procedures and
organizational structure, to tap into the full potential of this system. The implication of these
features is vital for GIS to get the status of an integrated tool that helps in keeping data,
management, and analysis operations to support decision-making processes. GIS can be defined
in a more general aspect as “A special
kind of information system, often located on a user's desk, dedicated to performing special kinds of
operations related to location.” (Longley et al., 2015, 3). This definition offers a more user-centric
perspective, portraying GIS as a specialized information system focused on location-related
operations. Furthermore, we need to understand the fact that GIS is a facility that helps in handling
all the geographical information and tasks with the help of spatial operations.
Introduction to Spatial Data
Another terminology used commonly in GIS systems is Spatial data also known as geospatial data,
encompasses information about the geographic location and characteristics of natural or
constructed features on the earth, integrating location with descriptive information. This data type
is foundational to Geographic Information Systems (GIS), which are designed to collect, store,
process, and interpret datasets tied to spatial contexts.

Types of Spatial Data:


1. Geographical Data
This type of data represents the physical locations and boundaries of features on the earth's
surface, such as lakes, rivers, and cities. It includes both the natural environment and constructed
elements. According to Khalilizangelani and Ghaffarian (Khalilizangelani & Ghaffarian, n.d.,),
‘geographic data refers to data that describes the physical and cultural features of a location
mapped on a sphere such as the Earth’. This encapsulates the essence of geographic data by
highlighting its role in capturing the diverse attributes of a specific location within the context of a
spherical representation. By encompassing both the physical and cultural aspects of a location,
geographic data becomes a comprehensive source of information that aids in understanding and
analyzing the spatial characteristics of the Earth's surface. The definition of geographic data
presented defines the importance of capturing the physical and cultural attributes of locations on
Earth. It is this very fundamental understanding of geographic facts that gives us the ability to
integrate into the structure of geographical features and cultural areas which are pointed
out areas. It is the foundation that sets the pace for discovering and managing the spatial data in
GIS, the basis for undertaking further research into the usability and impact of spatial analysis and
decision-making processes.

2. Geometric Data
Geometric data focuses on the shapes and relative positions of objects in space. This includes
points (single locations), lines (roads, rivers), and polygons (areas like parks, building
footprints), which are used to represent spatial dimensions and boundaries.
3. Geospatial Data
Encompassing both geographical and geometric data, geospatial data adds additional layers of
information such as elevation, population density, or traffic patterns.
This type of data often involves a temporal component that tracks changes over time, providing
dynamic insights into the evolution of a particular area. Geospatial data refers to information about
objects, events, or other features with a position on Earth (Khalilizangelani and Ghaffarian n.d).
What distinguishes spatial data from any other is its coordinates, with which they have a direct link
that gives specific locations their attributes such as latitude or longitude. This spatial reference is
the very reason it is critical for Geographic Information Systems (GIS) and various other
applications that depend on maps, location analysis, and spatial relationships. Since it can
determine the pattern, trend, and correlation in different fields such as urban planning, monitoring
environmental components, and public health. Spatial data, thus, is being considered as a key
factor in spatially oriented decision-making processes where they provide the insights that
irremediably aid in better management and not just the understanding of the
interconnected world.

Relational Database Management Systems


A relational database is a tabular database that was introduced by E.F. Codd at IBM in 1970 and
allows data to be reorganized and viewed in a variety of ways. (Thakur & Gupta, n.d., pp. 2–3) . In
the case of tabular format, data may be changed by rearrangement and viewed by different methods
along this line, functional and this utility is increased. It is possible to identify the entity, rows, and
columns of a table in a relational database respectively because each table represents a particular
entity and rows correspond to
individual records while columns to attributes of those records. Another aspect in this case is the
representation of primary keys that unambiguously correspond to every table’s row and the foreign
keys which are used to develop relationships between table lines that precisely execute complex
queries and increase the data integrity. The Structured Query Language (SQL) is the language for
operations with relational databases, this language allows users to spend a minimum of time staring
at screens and monitor database reliability. As to relational databases, they obey the ACID
properties (Atomicity,
Consistency, Isolation, Durability), to secure transactions and give data a well integrity. These
algorithms also utilize normalization methods to cut out duplications and bring in
more reliable data quality.
There are different types of relational databases (RDBMS) and each kind has own features which
are useful at different times. MySQL is a well-known open source RDBMS with a good reputation
of reliability and simple handling characterized as a proper system for website applications.
PostgreSQL, which is an open source RDBMS, is a very popular database system which is famous
for its quality adherence with the standards and the full support of complex queries and
transactions. Oracle DB is a commercial RDBMS has been recognized as one of the advanced
features with high-scale applications, this DBMS is most likely preferred by many enterprise
settings. Microsoft SQL server, which has been developed by Microsoft, is integrated well with the
Microsoft owned products and delivers excellent performance. SQLite is a thin, close to the system
database realizing the principle of the RDBMS. It is a database suitable for applications that have
the smallest resource overhead, often used in mobile and small to medium-sized projects. The Db2
from IBM is a commercial RDBMS, which supports the workloads of large-scale enterprises and
those that are used for complex analysis and AI technologies. Each
RDBMS type generates their own pros, and accordingly they decide which one to use
according to their particular data management needs.

Non-Relational Database Management Systems


Nonrelational databases, on the other hand, use a storage model that is specific to the database
form. The term” NoSQL” refers to data stores that don’t use SQL to query their data and instead
rely on a different construct (Thakur & Gupta, n.d., pp. 4–5). Instead of using the same language,
Structured Query Language (SQL) for querying and managing data, NoSQL utilizes different
methods and query languages, each being optimized for various data types and operations. This
fundamental characteristic allows NoSQL databases to be more flexible, scalable and responsive
to the needs created by certain specific types of applications. The NoSQL data model takes into
consideration that detail specifically. Each type of NoSQL database, e. g. document write, key-
value store, column family store or graph databases is optimized for a particular type of info so
that they can store and manage it with lesser time and resources than a traditional relational
database. In MongoDB, records are put in dynamic documents which usually consist of JSON or
BSON records. This enables the scheme-free design, which in turn makes it easy to fit
any ever-dynamic data structures, as well as the nested pieces of information (the nested
information). Key-value store systems like redis are highly efficient and their simple structure
makes them excel in fast operations fundamentally needed for caching and quick session
management which in essence is retrieving data fast. The column-family stores, for instance the
Cassandra, store data similarly to the tables in a relational database, but not in distinct families
with columns but grouped instead into families, which supports efficient access to large and sparse
data structures. Meantime, this excludes makes them perfect for large scale applications where
read and write performance at scale is the key factor. The most advantageous feature of graph
databases, such as
Neo4j, is that they subsist on a system of nodes, edges, and properties that are rather effective when
it comes to applications with complex relationships and networks such as social networks and
recommendation systems. All these types leverage API usage according to the data model which
make them efficient for unstructured or semi-structured data.
MySQL
SQL, or Structured Query Language, serves as the cornerstone of database management systems
(DBMS) and is instrumental in managing, manipulating, and retrieving data stored in relational
databases. Originally developed in the 1970s by IBM, SQL has evolved into a standard language
widely adopted across various database platforms.
Within the context of GIS data, PostGIS seems a valid option. PostGIS is the most efficient open-
source solution for managing geospatial data. Similarly to open-source databases, it supports all
the standard geometry data types (e.g., Point, MultiPoint,
Polygon, Multi Polygon) and all the standard geospatial operators (e.g., Distance, Within,
Intersects, Closest). It also supports three types of spatial indexes: B-trees (binary trees), R-trees
(sub-rectangles trees) and GiST (Generalized Search Trees) to speed up the execution of spatial
queries. (Baralis et al., 2017, p. 3). PostGIS, which is famed to be at the top position in open-
source geospatial data management, is highly efficient and supports basic standard geometrics
formats for point, multi-point, polygon and multipolygon alike with Azure SQL Database. This is
alongside the set of geospatial operators which include Distance, Within, Intersects, and Closest
making more precise spatial operations that may be used to apply complex georeferencing,
proximity-based recommendations and more. The spatial indexing of PostGIS is also optimized
for spatial queries using three forms of indexes, namely B-trees (Binary Trees), R-trees (Sub
rectangles Trees) that significantly enhance the spatially complex queries and making it easier for
users to understand the implementation details. The PostGIS integration of these indexing
constructs allows for spatial data to be managed easily and effectively even in a complicated and
varied spatial setting. Furthermore, these constructs present a versatile and effective tool for urban
planning, environmental conservation, and logistics. In a nutshell, SQL is a standard computer
language for maintaining and utilizing data in relational databases. Put simply, SQL is a language
that lets users interact with relational databases (Rockoff, 2021, 8). SQL databases use a
structured, table-based format where data is stored in rows and columns. Each table represents a
different type of entity with attributes, and relationships between tables are defined by foreign
keys. SQL databases adhere to ACID (Atomicity, Consistency, Isolation, Durability) properties,
ensuring reliable processing of transactions.
This makes them highly suitable for applications requiring strict data integrity. SQL requires a
predefined schema to determine the structure of data before any data is added to the database. This
schema defines the tables, columns, data types, and relationships. SQL data statements, which are
used to manipulate the data structures previously defined using SQL schema statements; and SQL
transaction statements, which are used to begin, end, and roll back transactions (Beaulieu, 2020, 9).
In the context of Geographic Information Systems (GIS), MySQL can manage spatial data
effectively through its spatial data types and functions that comply with the Open Geospatial
Consortium (OGC) standards. These spatial features in MySQL allow users to store geographical
data as points, lines, and polygons. For instance, urban planners might use MySQL to store and
query spatial data about land use, infrastructure projects, or demographic distributions.
SQL is less commonly used to handle geospatial data types due to its rigid schemas, but it has
also been proven that SQL databases sometimes outperform NoSQL databases, and that the data
model has little correlation with the actual performance” (Bhogal &
Choksi, 2015, 398)

MongoDB
MongoDB is a NoSQL database that provides high performance, high availability, and easy
scalability. MongoDB stores data in flexible, JSON-like documents, meaning fields can vary
from document to document and data structure can be changed over time. This model makes the
integration of data in certain types of applications easier and faster.
MongoDB falls under the category of NoSQL databases that allow flexibility in storing as well
as performing operations on data. An example of a NoSQL database is MongoDB
which is a document-oriented database that is written in C++. An entry in the database is stored as
an object which in turn is serialized as JSON, XML, or BSON. The objects do not need to have the
same structure, fields, or types of fields, which makes the database flexible (Li & Manoharan,
2013a, p. nd) This stems from the fact that the application is able to dynamically change its data
model without any need to maintain a strict schematic structure that may be difficult to use or
maintain when faced with changing needs or
real-time processing requirements. MongoDB can be changed into heterogeneous data
collections on the fly without incurring the overhead of costly schema migrations by its
capability to do this.
Therefore, this fits in with agile development and especially in cases where the data
requirements are very complex or uncertain, in which for example we can have the content
management systems, IoT or big data analytics. Currently, MongoDB uses GeoJSON objects
to store spatial geometries. GeoJSON is an open-source specification for the JSON- formatting
of shapes in a coordinate space. The GeoJSON spec is used in the geospatial community and
there is growing library support in most popular languages. Each GeoJSON document (or sub
document) is generally composed of two fields:
1. Type – the shape being represented, which informs a GeoJSON reader how to
interpret the “coordinates” field.
2. Coordinates – an array of points, the specific arrangement of which is determined
by “type” field. (agarwal, n.d., p. 39). However, MongoDB does not support R-trees.

The use of MongoDB around GeoJSON objects for a spatial purpose by storing geometries
shows that it adheres to the open-source GeoJSON specification that is the standard across the
geospatial community. GeoJSON is a format for encoding shapes over a coordinate system and
doing this aside others increases interoperability so that data sharing can also be simplified
across systems and programming languages which would enjoy growing library support. Each
GeoJSON document, consisting primarily of
"type" and "coordinates" fields, provides a clear structure: Type of the shape determines the
shape based on the type of the geometry object (such as Point, LineString, or
Polygon), whereas points define the coordinates for these shapes based on the coordinates of
objects. The choice of this technology describes the effective storage and document processing,
which makes it suitable for spatial uses of mapping, geolocation service and spatial analytics.
Through supporting the GeoJSON standard, MongoDB delivers efficient and adaptive location-
based data governance options which satisfy the demands of developers and data scientists for
performing the complex analysis of spatial data. MongoDB supports geospatial queries and
indexing, which makes it suitable for handling GIS data that is less structured or rapidly evolving.
MongoDB's geospatial features include support for GeoJSON and legacy coordinate pairs,
allowing for efficient querying of location-based data.
Method

A comparison between MySQL and MongoDB databases will be conducted using a geospatial
dataset under identical conditions. The dataset will be loaded, and queries executed using Python
within a Jupyter notebook environment, facilitating precise measurement and comparison of
query performance between the two databases. This methodology ensures consistent evaluation of
execution times, identification of optimization opportunities, and thorough analysis of each
database's handling of geospatial data. We will try to answer our research question in this section
through a quantitative approach as stated above. We will be measuring response times for both
the database and analyze why one database is better than the other.

Previous Research
There was a total of 5 different papers which served as primary information sources and as
inspiration for this experiment. These papers were found through means of different search terms
primarily made in Google Scholar and IEEE, which includes “Geospatial Data SQL”, “Geospatial
Data NoSQL”, “maps MongoDB”, “maps MySQL”, “geospatial data”, etc.
“SQL versus NoSQL databases for geospatial applications” (Baralis et al., 2017) and
“Performance Analysis of RDBMS and No SQL Databases: PostgreSQL, MongoDB and
Neo4j” (Sharma et al., 2018). The former discusses many databases, both NoSQL and SQL,
and includes descriptions for all supported geospatial data types within each
database. Although the paper did not test MongoDB or MySQL, it compared other NoSQL and
SQL databases, offering valuable insights into evaluating results and the impact of hardware on
tests. The study provides practical examples of data insertion and handling in both SQL and
NoSQL databases, including MongoDB. While the implementation
differed slightly from the intended approach for this thesis, it still offered useful information on
query usage for data insertion and retrieval. These findings supported the development of
database management and data structure strategies.

Hypothesis
The hypothesis formulated in this regard would be:
“MySQL, due to its structured query capabilities and indexing, will perform better in terms of
query execution times for GIS data compared to MongoDB.”
This hypothesis provides a basis for designing experiments and analyzing results. An
experimental approach is chosen because it allows for controlled and systematic comparisons
between MySQL and MongoDB. We will then perform an experiment to see the effects of the
independent variable (type of database) with the dependent variable (query execution time).

Variables
The variables defined for this process are given below:

● Independent Variable: The type of database system (MySQL vs MongoDB)


● Dependent Variable: Query execution times, used to measure performance.
● Control Variables: Conditions that are kept constant to ensure a fair comparison, such as
hardware specifications, dataset size, and types of queries executed.
Dataset
The dataset in the provided code is being sourced from OpenStreetMap (OSM), a
collaborative mapping project that provides free geospatial data about locations
worldwide. Specifically, this code uses the Overpass API, a powerful tool for querying OSM
data using a custom query language. The query itself is written in the Overpass Query
Language (OQL) and is designed to fetch data within a specified bounding box,
which, in this case, covers coordinates from latitude 50.745 to 50.75 and longitude 7.17 to
7.18. The data is then converted from the JSON format returned by the Overpass API to
GeoJSON using the `osm2geojson` Python library, which makes it suitable for geospatial
analysis and further processing.
Finally, the dataset is loaded into a Geodata Frame using the `geopandas` library for
geospatial analysis and exported as a CSV file. The code for the file is as follows:
Define query string Q:

- Q = """

[out:json][timeout:25];

node(50.745,7.17,50.75,7.18);

<;

);

out body;

>;

out skel qt;

"""

3. Send HTTP GET request to Overpass API with query Q:

- response = requests.get(f"http://overpass-api.de/api/interpreter?data={Q}")

4. Check if response status code is 200:

- If response.status_code == 200:

- Proceed to convert JSON to GeoJSON

- Else:

- Raise an error with response.content

5. Convert JSON response to GeoJSON:

- geojson_data = osm2geojson.json2geojson(response.json())

6. Create GeoDataFrame G from GeoJSON features:


- G = gpd.GeoDataFrame.from_features(geojson_data['features'])

7. Export GeoDataFrame G to CSV file:

- G.to_csv('datasetop.csv', index=False)

8. Print GeoDataFrame G:

- print(G)

Code for Testing sqlite3.


After obtaining the dataset from OpenStreetMap using the Overpass API and converting it to a
suitable format, the next step involves loading the data into an SQLite3 database using Python.
SQLite is a lightweight, embedded relational database engine that is
integrated into Python's standard library via the sqlite3 module. By leveraging sqlite3 in
Python, we can efficiently manage relational data without needing to install or configure a separate
database server. The sqlite3 module allows us to execute SQL queries to create tables, insert data,
and perform complex analyses directly on the dataset. In this setup, we use pandas to handle and
manipulate the dataset due to its seamless integration with data frames, which simplifies
converting data formats. The time module is used to measure the execution time of the operations,
which is crucial for performance benchmarking.

2. Read data from CSV file into DataFrame `df`:


- df = pd.read_csv('datasetop.csv')
3. Connect to SQLite database:
- conn = sqlite3.connect('database.db')
4. Write DataFrame `df` to SQL table 'data':
- df.to_sql('data', conn, if_exists='replace', index=False)
5. Define menu function:
- Function `menu()`:
- While True:
- Print menu options:
- "1. View all data"
- "2. Search data records"
- "3. Exit"
- Prompt user for choice:
- choice = input("Enter your choice: ")
6. Handle menu options:
- If choice == '1' (View all data):
- Retrieve all data from 'data' table:
- Print data
- Measure and print query execution time:
- print(f"Time taken to execute query: {elapsed_time:.2f} seconds")
- If choice == '2' (Search data records):
- Prompt user for search query:
- query = input("Enter your search query for 'id': ")
- Construct SQL query:
- sql_query = f"SELECT * FROM data WHERE id LIKE '%{query}%'"
- Retrieve and print matching data:
- data = pd.read_sql(sql_query, conn)
- Measure and print query execution time:
- print(f"Time taken to execute query: {elapsed_time:.2f} seconds")
- If choice == '3' (Exit):
- Break loop to exit menu
- Else (Invalid choice):
- Print "Invalid choice. Please try again."
7. Close database connection:

This pseudocode describes how the code will be performing to perform query selection for the user
and give different results on the basis of what the user selects
1. View all data
2. Search data records
3. Exit
Enter your choice: 2
Enter your search query for 'id': 507464742
geometry type id \
0 POINT (7.1788184 50.748386) node 507464742
1 POINT (7.1788184 50.748386) node 507464742

tags nodes
0 {'VRS:gemeinde': 'BONN', 'VRS:ortsteil': 'Holz... None
1 None None
Time taken to execute query: 0.00 seconds

1. View all data


2. Search data records
3. Exit
Enter your choice: 2
Enter your search query for 'id': 891093714
geometry type id tags
\
0 LINESTRING (7.2760259 50.7752137, 7.2756679 50... way 891093714 None

nodes
0 [99672363, 99668853, 1936339964, 8282081287, 2...
Time taken to execute query: 0.02 seconds

For this we search all


the data
//
///
////
/////

1. View all data


2. Search data records
3. Exit
Enter your choice: 1

geometry type id \
0 POINT (7.1759171 50.7498233) node 507464720
1 POINT (7.1788184 50.748386) node 507464742

2 POINT (7.1737725 50.7454101) node 507464745


3 POINT (7.1727814 50.74522) node 507464751
4 POINT (7.1740973 50.745012) node 507464813
... ... ... ...
1456 LINESTRIN (7.2836635 50.7739565, 7.2834752 50... way 27895260
G
1457 LINESTRIN (7.2797154 50.7749734, 7.2795721 50... way 11207933
G
1458 LINESTRIN (7.2799039 50.7758648, 7.2799002 50... way 27895259
G
1459 LINESTRIN (7.2763874 50.775189, 7.2763612 50.... way 378660319
G
1460 LINESTRIN (7.2760259 50.7752137, 7.2756679 50... way 891093714
G

tags \
0 {'VRS:gemeinde': 'BONN', 'VRS:ortsteil': 'Holz...
1 {'VRS:gemeinde': 'BONN', 'VRS:ortsteil': 'Holz...
2 {'VRS:gemeinde': 'BONN', 'VRS:ortsteil': 'Holz...
3 {'VRS:gemeinde': 'BONN', 'VRS:ortsteil': 'Holz...
4 {'VRS:gemeinde': 'BONN', 'VRS:ortsteil': 'Holz...
... ...
1456 None
1457 None
1458 None
1459 None
1460 None

nodes
0 None
1 None
2 None
3 None
4 None
... ...
1456 [300758079, 516363187, 9155787455,
7773004753,...
1457 [99668851, 99672560, 1476849747, 978635870, 11...
1458 [100697013, 7827934685, 99668849, 1585742366, ...
1459 [3723260306, 632714853, 99672363]
1460 [99672363, 99668853, 1936339964, 8282081287, 2...

[1461 rows x 5 columns] \


geometry type id
0 POINT (7.1759171 50.7498233) node 507464720
1 POINT (7.1788184 50.748386) node 507464742
2 POINT (7.1737725 50.7454101) node 507464745
3 POINT (7.1727814 50.74522) node 507464751
4 POINT (7.1740973 50.745012) node 507464813

... ... ... ...


1456 LINESTRING (7.2836635 50.7739565, 7.2834752 50... wa 27895260
y
1457 LINESTRING (7.2797154 50.7749734, 7.2795721 50... wa 11207933
y
1458 LINESTRING (7.2799039 50.7758648, 7.2799002 50... wa 27895259
y
1459 LINESTRING (7.2763874 50.775189, 7.2763612 50.... wa 378660319
y
1460 LINESTRING (7.2760259 50.7752137, 7.2756679 50... wa 891093714
y

tags \
0 {'VRS:gemeinde': 'BONN', 'VRS:ortsteil': 'Holz...
1 {'VRS:gemeinde': 'BONN', 'VRS:ortsteil': 'Holz...
2 {'VRS:gemeinde': 'BONN', 'VRS:ortsteil': 'Holz...
3 {'VRS:gemeinde': 'BONN', 'VRS:ortsteil': 'Holz...
4 {'VRS:gemeinde': 'BONN', 'VRS:ortsteil': 'Holz...
... ...
1456 None
1457 None
1458 None
1459 None
1460 None

nodes
0 None
1 None
2 None
3 None
4 None
... ...
1456 [300758079, 516363187, 9155787455, 7773004753,...
1457 [99668851, 99672560, 1476849747, 978635870, 11...
1458 [100697013, 7827934685, 99668849, 1585742366, ...
1459 [3723260306, 632714853, 99672363]
1460 [99672363, 99668853, 1936339964, 8282081287, 2...
[1461 rows x 5 columns]
Time taken to execute query: 0.05 seconds
Performance analysis of sqlite3
The implemented solution, a time of 0. 02 and 0.05 seconds for bringing a record from by 'id' and
all the records via sqlite3 indicate the high processing level in terms of read performance, that is
mainly assured with the proper data structure and indexes. An embedded lightweight database
called SQLite is designed to operate fast by its property of high retrieval, which makes it a pick
for apps that have an average workload on data.
The split-second reaction to queries by SQLite represents an attractive use of the indexing
technology which is used for fast looking up queries and specific searches. This leads to using
sqlite3 as an efficient wagon in the cart for seating purposes that has fast data access and flexible
query feature.

Code for testing MongoDB


The code below leverages multiple libraries to facilitate efficient data processing, database
operations, and performance measurement. The panda’s library is utilized for manipulating data,
where the original dataset is read from a CSV file and sampled every nth row to create a smaller,
more manageable dataset. This sampled data is then saved to another CSV file and converted into a
list of dictionaries for database. insertion. The pymongo library acts as the Python driver for
MongoDB, enabling seamless communication with the MongoDB cluster. It is used to insert the
sampled data into a collection and execute data queries to find and retrieve specific records.
Additionally, the time module is incorporated to measure query execution times, providing insights
into MongoDB's performance. Finally, bson.objectid.ObjectId is imported to manage the ObjectId
data type, which uniquely identifies documents in MongoDB. Together, these
libraries streamline the process of loading, sampling, and querying geospatial data
efficiently.
1. Import necessary libraries:

- import pandas as pd

- from pymongo import MongoClient

- import time

- from bson.objectid import ObjectId

2. Setup MongoDB client

3. Read and sample data from CSV

- sampled_df.to_csv('sampled_datasetop.csv', index=False)

4. Insert sampled data into MongoDB:

- collection.insert_many(data)

- Print "Data inserted into MongoDB"

5. Define menu function:

- def menu():

- While True:

- Print menu options:

- Prompt user for choice:

- choice = input("Enter your choice: ")

- If choice == '1' (View all data):

- Print f"Time taken to execute query: {elapsed_time:.2f} seconds"

- If choice == '2' (Search data records):

- Print f"Time taken to execute query: {elapsed_time:.2f} seconds

- If choice == '3' (Exit):

- Break loop to exit menu

- Else (Invalid choice):

- Print "Invalid choice. Please try again."

6. Execute menu:

- menu()
Performance analysis of pymongo
While fetching all the records the code took 1.23 seconds to retrieve all the records from the
database while fetching an id took 0.14 seconds. Specifically, when categorizing data into similar
groups, it took 1. 23 seconds to explore and print the full dataset from MongoDB DB while an ID
exists find query requires only 0. 14 seconds. The difference in performance can be potentially
caused by the disparity of queries themselves since each of them is different from the others. For
instance, a compare operation of full record sets involves scanning the entire collection of records
and consequently, it takes longer to execute.

Discussion
Comparing both databases, it's clear that SQLite3 significantly outperforms MongoDB in terms
of query execution time, and the reasons quickly become apparent upon closer
inspection. SQLite3's speed advantage likely comes from its lightweight, embedded nature, which
allows it to work entirely in memory, reducing the usual overhead of disk I/O. Its highly efficient
indexing also minimizes the path to relevant data, ensuring fast lookups, especially in single-user
environments where contention and concurrency aren't issues.
On the other hand, MongoDB's slower performance seems to stem from its distributed design,
which, while enabling it to handle massive datasets, also introduces network
latency and computational overhead as it manages data partitions and coordinates results across
clusters. Furthermore, the support for multiple concurrent operations and
ACID-compliant transactions, though crucial for complex data management, increases
resource consumption and further slows execution. Thus, while MongoDB excels at
large-scale, distributed applications, SQLite3 stands out in scenarios requiring swift,
lightweight data retrieval. SQLite3 significantly outperformed MongoDB in terms of query
execution time because the dataset chosen for the experiment had a table like structure rather than
a document structure. This significance was determined by doing multiple benchmark tests where
SQLite3 consistently exhibited faster query responses. The
lightweight, embedded nature of SQLite3 allows it to operate entirely in memory, reducing disk
I/O overhead and since it is a python library, no external storage was used to configure SQL
Server on the device, rather the features of the library were leveraged. Its efficient indexing further
enhances data retrieval speed. In contrast, MongoDB's slower
performance is attributed to its distributed design, introducing network latency and
computational overhead. The tests included 10 measurements for each database, demonstrating
a statistically significant difference in execution times under controlled conditions which means
that 10 times, the code performed similar returns while performing the desired user actions.
Conclusion
The SQLite3 and MongoDB database systems show that apart, each has its outstanding strengths
and is best suited for distinct cases of unequivocal applications. Operation in memory, efficient
indexing, and a lightweight architecture of SQLite3 contribute to its speed outperformance as it
carries out fast data retrieval functions in contrast to
large-scale applications where reading and writing predominantly takes place. Firstly,
MongoDB's distributed design and strong functions can be applied to big data management as
well, high-performance computing across whole massive data and parallel transactions is
executed by MongoDB. Yet, the dropping of such elements leads to even more network latency
and further complicates the order for responding to an
inquiry. The final decision then depends on what application is the most appropriate, as the
demands differ from one application to another. Meanwhile, SQLite3 is better for those situations
where speed and simplicity are prime objectives, compared to MongoDB which is more
appropriate for applications that require distributed, scalable, and
high-performance data management. The knowledge of the dissimilarities is important as this will
make sure that different databases are used according to their strength and will provide the best
performance as per their case use
Refrences
● agarwal, sarthak. (n.d.). Performance analysis of mongodb vs.
PostGIS/PostGreSQL databases for line intersection and point containment
spatial queries. ScholarWorks@UMass Amherst.
https://scholarworks.umass.edu/foss4g/vol15/iss1/50/
● Baralis, E., Dalla Valle, A., Garza, P., Rossi, C., & Scullino, F. (2017, December). SQL
versus NoSQL databases for geospatial applications. 2017 IEEE International
Conference on Big Data (Big Data). http://dx.doi.org/10.1109/bigdata.2017.8258324.

● Beaulieu, Alan. Learning SQL: Generate, Manipulate, and Retrieve Data.


O'Reilly Media, Incorporated, 2020. Accessed 17 April 2024.
● Bhogal, J., and I. Choksi. 2015, pp. 393-398,
https://doi.org/10.1109/WAINA.2015.19.
● Heywood, Ian, et al. “An introduction to Geographical Information systems.
Third ed., Pearson Education, 2010.
● Khalilizangelani, Y., and S. Ghaffarian. “A Study of Geospatial Data
Processing Based on Cloud Computing.”
https://doi.org/10.13140/2.1.4307.9680.
● Li, Yishan, and Sathiamoorthy Manoharan. “A Performance Comparison of SQL
and NoSQL Databases.” 2013 IEEE Pacific Rim Conference on
Communications, Computers and Signal Processing (PACRIM), IEEE, 2013,
http://dx.doi.org/10.1109/pacrim.2013.6625441.
● Longley, Paul, et al. Geographic Information Science and Systems. John Wiley &
Sons, 2015.
● Rockoff, Larry. The Language of SQL. Addison-Wesley, 2021. Accessed 17
April 2024.
● Sahatqija, K., Ajdari, J., Zenuni, X., Raufi, B., & Ismaili, F. (2018, May).
Comparison between relational and NOSQL databases. 2018 41st International
Convention on Information and Communication Technology, Electronics and
Microelectronics (MIPRO). http://dx.doi.org/10.23919/mipro.2018.8400041
● Rajesh, T., & Sreekumar, E. S. (2015). Database Selection for Big Data
Analytics: A Comparative Study of Hadoop, MongoDB, and Cassandra.
International Journal of Advanced Research in Computer Science and
Software Engineering, 1–8.
● Thakur, N., & Gupta, N. (n.d.). Relational and Non Relational Databases: A Review.

Journal of University of Shanghai for Science and Technology, 23(8), 2–3.

● Baralis, E., Dalla Valle, A., Garza, P., Rossi, C., & Scullino, F. (2017, December).

SQL versus NoSQL databases for geospatial applications. 2017 IEEE International

Conference on Big Data (Big Data). http://dx.doi.org/10.1109/bigdata.2017.8258324

● Sharma, M., Sharma, V. D., & Bundele, M. M. (2018, November). Performance analysis of

RDBMS and no SQL databases: PostgreSQL, MongoDB and Neo4j. 2018 3rd

International Conference and Workshops on Recent Advances and Innovations in

Engineering (ICRAIE). http://dx.doi.org/10.1109/icraie.2018.8710439

You might also like