sample

BIG DATA DATABASE DEVELOPMENT
ASSIGNMENT 2
Asim Khan: 100076269
Muhammad Salman: 100074336

Contents
PART 1............................................................................................................................................4
1. Executive Summary.................................................................................................................4
2. Introduction..............................................................................................................................4
3. Methodology............................................................................................................................5
3.1. Critical Analysis of Asim Khan Database (ERD).............................................................6
3.2. Critical Analysis of Muhammad Salman Database (ERD)...............................................7
3.3. Critical Analysis for Raphael Umoh Onofiok Database (ERD).......................................9
3.4. Final ERD Diagram........................................................................................................11
3.5. Source Code and Video Demonstration..........................................................................11
3.6. Rationale for Final Database...........................................................................................12
3.6.1. Documentation and Metadata..................................................................................13
3.6.2. Inconsistent Data Structure......................................................................................13
3.6.3. Lack of Data Integration..........................................................................................13
3.6.4. Limited Data Coverage............................................................................................13
3.6.5. Lack of Data Governance........................................................................................13
3.6.6. Inefficient Data Storage...........................................................................................14
3.6.7. Limited Error Handling...........................................................................................14
3.6.8. Scalability................................................................................................................14
3.6.9. Performance.............................................................................................................14
3.6.10. Security................................................................................................................14
3.6.11. Flexibility.............................................................................................................15
3.7. Evaluation in terms of Usability and Utility...................................................................15
PART 2..........................................................................................................................................16
1. Queries...................................................................................................................................16
2. Results....................................................................................................................................20
3. Work Division.......................................................................................................................21
4. Conclusion.............................................................................................................................22
References......................................................................................................................................22
PART 1
1. Executive Summary
The report presents the findings of a two-part big data and database management assignment. In
Part 1, a database structure was created by combining ERDs (Entity-Relationship Diagrams). On
the tourPedia database, the query operations from Part 2 were completed successfully. The
exercise provided hands-on practise using datasets and a variety of data manipulation and
querying techniques.
2. Introduction
Today's world, where organisations must deal with enormous amounts of data, has made the use
of big data databases more crucial. Data is being produced at a rate that has never been seen
before thanks to technological advancements and the availability of high-speed internet. For
organisations to stay competitive, managing this data and gaining insights from it has become
essential. Big data technologies like MongoDB, a NoSQL document-oriented database that is
commonly used for big data applications, have emerged as a result of this (Chong & Fan, 2019).
The implementation of a big data database using MongoDB, which was created by fusing the
unique databases of each group member, will be covered in this report. The report will give an
overview of each member's individual databases, describe how these databases were integrated,
discuss the decisions that were made during implementation, and offer a critical assessment of
the chosen database solution. We will also assess the finished database's usability and utility
from the viewpoint of the end user (Diogo & Santos, 2020).
The result aims to give readers a thorough grasp of how to create a large data database using
MongoDB and the advantages it may offer businesses. This report intends to assist people and
organisations considering using a big data database solution to meet their data management
requirements.
3. Methodology
Each group member provided their unique database ideas from assessment 1 as part of the
collaborative methodology utilised to create the big data database using MongoDB. To find
improvement opportunities and ensure the separate databases were compatible with MongoDB,
the procedure required a comprehensive evaluation and review of each database.
3.1. Critical Analysis of Asim Khan Database (ERD)

Most of the entities have been identified in this ERD. Some important entities are
Online_Bookings, Referrals, and Insurance Provider that provide unique aspect of this ERD.
3.2. Critical Analysis of Muhammad Salman Database (ERD)

In this ERD, we have some unique entities such as Medical_Test, Therapy_Session,
Therapy_Type, and Patient Vital Signs. We can use these entities to expand the scope of our
final database further.
The addition of the Medical_Test, Therapy_Session, Therapy_Type, and Patient Vital Signs
entities in the ERD greatly increases the scope and functionality of the final database solution.
These organisations offer extra channels for healthcare data collection and processing, which is
crucial for supporting clinical activities like disease diagnosis, chronic condition management,
and patient progress tracking.
An elegant technique that successfully represents the intricacy of the relationships between the
entities is the resolution of many-to-many interactions by including a junction table or alternate
entity. This method guarantees correct data recording and removes the chance of duplicate or
inconsistent data.
The association names and cardinalities offer a precise and thorough representation of the
connections between the database's elements. This feature makes it easier to navigate between
related entities and ensures that data is correctly associated with the appropriate entity,
improving data usability and streamlining data processing.

3.3. Critical Analysis for Raphael Umoh Onofiok Database (ERD)
This ERD has two association names on a single association line connecting two entities. There
is no direct relationship between Doctor and Patient. These flaws can be corrected and further
there is no unique entity which can extend our final database and scope.
For improved data modelling, a few flaws with this ERD must be fixed. First, there are occasions
where a single association line connecting two entities lists two association names. This may
make the connections between entities unclear and confusing. For instance, in this ERD, there is
no clear connection between the entities Doctor and Patient. A more accurate and understandable
representation of the data would arise from making the links between entities more explicit and
from avoiding having numerous association names on a single line.
The absence of a special entity that can broaden the database's scope is another problem with this
ERD. Electronic health records, laboratory testing, and other particular entities can offer more
functionality and insights for hospital management and decision-making. Such entities would
increase the ERD's utility and efficiency if they were included.
To find any flaws or faults and to improve the data model for accurate and efficient data
management, it is crucial to analyse the ERD critically.

3.4. Final ERD Diagram
3.5. Source Code and Video Demonstration
Source code for the final big data database implementation can be found at this link:
https://github.com/inqlo/asim-hospital/blob/main/asim-new_york_city_hospital.json
The video demonstration can be found here:
https://drive.google.com/file/d/1ws6wNGNCW1atHJIWtH4saRKtIjJqEx0c/view
3.6. Rationale for Final Database
In comparison to the earlier ERD, the final database design, which incorporates the additional
entities Patient Vital Signs, Therapy Session, Therapy Type, and Medical Test, looks to be an
improvement.
The newly established entities offer more functionality and insights into managing patient health
and treatment, which can be helpful for healthcare practitioners. For instance, the inclusion of
Patient Vital Signs can aid medical professionals in tracking a patient's vital signs over time and
identifying any changes that would call for additional care or treatment. The Therapy Type item
offers details on the sorts of therapy a patient receives, while the Therapy Session entity enables
doctors to monitor therapy sessions and their results. Last but not least, the Medical Test entity
keeps a record of the specifics of the medical tests carried out, which can help physicians make
precise diagnoses and choices on a patient's course of treatment.
The ERD has now advancements, though. There was absence of unambiguous links between
some of the entities which was one potential problem. The relationship between the Hospital
entity and the new entities, for instance, is now clear.
Overall, adding the new entities has increased the database's capabilities and increased the
accuracy and level of detail with which patient health and treatment data are managed. To
guarantee that the ERD is clear and well-organized, with unambiguous entity and attribute
names, and with clear links between entities, further refining can be done if need be there.
3.6.1. Documentation and Metadata
The solution contains thorough documentation and metadata that facilitates efficient database
maintenance and understanding, improving data usability, and ensuring accurate interpretation
and use of recorded information.
3.6.2. Inconsistent Data Structure
The final database solution establishes a consistent naming convention and data representation,
enhancing clarity, usability, and manipulation of data, thereby reducing confusion and mistakes.
3.6.3. Lack of Data Integration
By acting as a central repository for diverse healthcare-related data, the database system
integrates data integration capabilities, enabling data exchange and integration with other
systems like electronic health records (EHR), laboratory systems, or billing systems (Patel &
Patel, 2018).
3.6.4. Limited Data Coverage
The final database solution aims to provide comprehensive data coverage by including essential
entities and attributes relevant to healthcare operations, laying the foundation for capturing and
managing a wide range of healthcare-related data (Hersh et al., 2013).
3.6.5. Lack of Data Governance
The database solution implements suitable methods for data quality, privacy, and compliance,
enforcing data governance standards, defining data stewardship responsibilities, and executing
best practices for data management.

3.6.6. Inefficient Data Storage
The final database solution optimizes data storage by making use of effective storage methods
like indexing, compression, and partitioning, maximizing storage efficiency, and reducing
storage costs (Hersh et al., 2013).
3.6.7. Limited Error Handling
The database solution includes techniques for handling mistakes, exceptions, and inconsistent
data, maintaining data integrity, system stability, and user confidence in the veracity of the stored
information by providing error management.
3.6.8. Scalability
The final database solution is designed to scale with the growth of the healthcare organization. It
can handle a significant amount of data and can accommodate the addition of new entities as the
organization expands. This scalability ensures that the database can meet the evolving needs of
the healthcare organization (Baig & Khan, 2017).
3.6.9. Performance
The final database solution is optimized for performance, ensuring that queries are executed
efficiently and quickly. The solution can process large amounts of data without slowing down,
ensuring that healthcare professionals can access the information they need in a timely manner.
3.6.10. Security
The final database solution includes robust security measures to protect sensitive healthcare data
from unauthorized access. The solution implements authentication and authorization mechanisms
to ensure that only authorized personnel can access the data. It also provides data encryption to
protect against data breaches and cyber-attacks (Diogo & Santos, 2020).
3.6.11. Flexibility
The final database solution is designed to be flexible, allowing for customization based on the
specific needs of the healthcare organization. The solution can be tailored to include additional
entities, attributes, and relationships as required, ensuring that it can adapt to the evolving needs
of the organization (Masic, Pandza, & Kulasin, 2014).
3.7. Evaluation in terms of Usability and Utility
The JSON structure made available to healthcare organisations is significant because it offers a
uniform format for keeping and organising crucial medical data. Its use of simple key-value
combinations guarantees that data is accessible and processable, making it simple for healthcare
providers to properly manage patient care.
The versatility of the JSON format, which enables healthcare providers to store diverse types of
data in one area, is a key benefit. This helps them to effectively manage patient information and
decide on their care with knowledge. Additionally, the JSON's hierarchical structure makes sure
that relevant data, such patient referrals and bookings, is grouped together to make it simpler to
track and manage patient care.
Unquestionably useful, JSON offers healthcare organisations a complete patient care
management solution. They are able to accurately track patient data, including as medical
history, treatment plans, prescriptions, and billing data. This helps healthcare professionals to
give patients with better treatment and guarantee that all parts of their care are managed
effectively.
PART 2
1. Queries
1. Give the name of the places whose category is “accommodation”
db.paris.find({ category: "accommodation" }, { name: 1 })
2. Give the name and phone number of places with a phone number entered
($exists, $ne);
db.paris.find(
"contact.phone": { $exists: true, $ne: null }
},
"name": 1,
"contact.phone": 1
3. Name and contacts of places with” website” and” Foursquare” provided;

db.paris.find(
$and: [
{ "contact.website": { $exists: true, $ne: null } },
{ "contact.Foursquare": { $exists: true, $ne: null } }
},
"name": 1,
"contact": 1
4. Name of places whose name contains the word “hotel” (pay attention to case);
db.paris.find(
{ "name": { $regex: "hotel" } },
{ "name": 1 }
5. Name and services of places offering 5 services;

db.paris.find(
{ "services": { $size: 5 } },
{ "name": 1, "services": 1 })
6. Categories of places with at least a rating (reviews.rating) of 4 or more;

db.paris.distinct("category", { "reviews.rating": { $gte: 4 } })
7. GPS coordinates of places whose address contains “rue de rome”;

db.paris.find(
{ "location.address": { $regex: "rue de rome", $options: "i" } },
{ "location.coord.coordinates": 1, _id: 0 }
8. Distinct list of category

db.paris.distinct("category")
9. Distinct list of services

db.paris.distinct("services")
10. For each ”poi” category place name, give the number of reviews whose source
(reviews.source) is “Facebook”. Sort in descending order;
db.paris.aggregate([
{
$match: {
"reviews.source": "Facebook",
category: "poi"
},
$group: {
_id: "$name",
reviewCount: { $sum: { $size: "$reviews" } }
},
$sort: {
reviewCount: -1
])
11. For each place name in the “restaurant” category, give the average rating and
the number of comments.
$match: {
category: "restaurant"
},
{
$group: {
_id: "$name",
averageRating: { $avg: "$reviews.rating" },
commentCount: { $sum: { $size: "$reviews" } }
])
The $lookup command in MongoDB is used to perform a left outer join between two collections
that are located in the same database. You can use it to combine data from different collections
based on a common field or expression. (MongoDB, 2023)
The lookup command's syntax is:
$lookup: {
from: <collection>,
localField: <field>,
foreignField: <field>,
as: <outputArray>
from specifies the collection to join with.

localField specifies the field from the input collection.
foreignField specifies the field from the joined collection.
as specifies the name of the output array field that will contain the joined documents.
For example:
$lookup: {
from: "accommodation",
localField: "_id",
foreignField: "place_id",
as: "accommodation"
])
Based on the _id column and the place_id field, respectively, this query uses the $lookup
aggregation step to join the paris collection with the accommodations collection. An additional
accommodation field that contains a number of matched documents from the accommodation
collection will be present in the final results.
2. Results
The JSON data was downloaded and saved locally for later use from the URL that was provided.
To make the management and retrieval of the data easier, the data was imported into a
MongoDB database and particular collections, including Paris and TourPedia, were created.
Numerous fields are included in the "paris" collection of the "tourPedia" database to record
crucial information about each site, including _id, contact, name, location, category, description,
services, and reviews. Queries on the "paris" collection required filtering, aggregation, lookup,
and projection in order to retrieve specific data and do calculations.

Filtering, aggregation, lookup, and projection were used in queries to extract and analyse the
necessary data from the "paris" collection. The idea of combining data from various collections
using a constrained set of results was shown using a specific query and the $lookup command.
3. Work Division
Asim Khan and Muhammad Salman worked as a team to develop the final database solution.
The entity relationship diagram (ERD) analysis and final database design were the responsibility
of Asim Khan. He further refined the ERD further by using association names and cardinalities,
and he resolved many-to-many links by including junction tables and alternative entities. The
dataset was obtained, and the JSON was denormalized with significant assistance from
Muhammad Salman. He changed the data into a format that could be used and would work with
the eventual database solution. Additionally, he made sure the information was correct,
comprehensive, and pertinent to healthcare operations. It was Asim Khan’s responsibility to
build the database's framework and manage queries. He created the database's tables,
relationships, and schema, making sure they were efficient, clear, and consistent. Additionally,
he created queries that let users access and change data in the database.
Name Work Assigned

Analysing ERD and creating the final Database
Asim Khan
solution
Denormalization of the JSON and acquisition of
Muhammad Salman
the dataset
4. Conclusion
Database architecture and query manipulation were the two key components of massive data
processing that were explored in this report. Combining entity-relationship diagrams (ERDs) and
creating a unique database schema were the topics of the first portion. The second part, which
employed actual data from the "tourPedia" database, concentrated on query manipulation. The
usage of big data principles was illustrated by the integration of ERDs, the construction of our
unique database schema, and query modification on the "tourPedia" database. This exercise
enhanced our knowledge of database design and query execution while highlighting the need of
data management and analysis in real-world contexts. This knowledge and skill set will be
extremely useful for upcoming data-driven projects and analysis
References
Amazon Web Services. (2023). Amazon DynamoDB. https://aws.amazon.com/dynamodb/
Apache Cassandra. (2023). Apache Cassandra. https://cassandra.apache.org/
Apple Inc. (2023). Core Data. https://developer.apple.com/documentation/coredata
Baig, S. A., & Khan, A. (2017). A comparative analysis of NoSQL databases. International
Journal of Computer Science and Information Security, 15(6), 74-79.
Chong, E., & Fan, J. (2019). NoSQL databases for big data applications: An overview. Big Data
Research, 15, 1-11. https://doi.org/10.1016/j.bdr.2018.07.006
Couchbase. (2023). Couchbase Server. https://www.couchbase.com/products/server

Diogo, C. F., & Santos, M. Y. (2020). The use of NoSQL databases in healthcare systems: A
systematic literature review. International Journal of Medical Informatics, 141, 104180.
https://doi.org/10.1016/j.ijmedinf.2020.104180
Hersh, W. R., Weiner, M. G., Embi, P. J., Logan, J. R., Payne, P. R., Bernstam, E. V., ... &
Lehmann, H. P. (2013). Caveats for the use of operational electronic health record data in
comparative effectiveness research. Medical Care, 51(8 Suppl 3), S30-37.
Masic, I., Pandza, H., & Kulasin, I. (2014). Importance of medical databases and registries in
contemporary medicine. Acta Informatica Medica, 22(5), 320-325.
Mehta, S., Sankaranarayanan, R., & Varadarajan, S. (2019). A comparative study of NoSQL
databases for big data applications. International Journal of Computer Science and Information
Technology Research, 7(1), 1-12.
MongoDB. (2023). MongoDB Documentation. Retrieved from https://docs.mongodb.com/
Patel, R., & Patel, D. (2018). A comparative study of relational and NoSQL databases.
International Journal of Engineering Research and Technology, 7(6), 396-400.
Red Hat. (2023). MongoDB. https://www.mongodb.com/

sample

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

sample

Uploaded by

Copyright:

Available Formats

BIG DATA DATABASE DEVELOPMENT

Asim Khan: 100076269

Muhammad Salman: 100074336

3.1. Critical Analysis of Asim Khan Database (ERD).............................................................6

3.2. Critical Analysis of Muhammad Salman Database (ERD)...............................................7

3.3. Critical Analysis for Raphael Umoh Onofiok Database (ERD).......................................9

3.4. Final ERD Diagram........................................................................................................11

3.5. Source Code and Video Demonstration..........................................................................11

3.6. Rationale for Final Database...........................................................................................12

3.6.1. Documentation and Metadata..................................................................................13

3.6.2. Inconsistent Data Structure......................................................................................13

3.6.3. Lack of Data Integration..........................................................................................13

3.6.4. Limited Data Coverage............................................................................................13

3.6.5. Lack of Data Governance........................................................................................13

3.6.6. Inefficient Data Storage...........................................................................................14

3.6.7. Limited Error Handling...........................................................................................14

3.7. Evaluation in terms of Usability and Utility...................................................................15

Part 1, a database structure was created by combining ERDs (Entity-Relationship Diagrams). On

the procedure required a comprehensive evaluation and review of each database.

3.1. Critical Analysis of Asim Khan Database (ERD)

3.2. Critical Analysis of Muhammad Salman Database (ERD)

final database further.

and patient progress tracking.

entities is the resolution of many-to-many interactions by including a junction table or alternate

improving data usability and streamlining data processing.

from avoiding having numerous association names on a single line.

increase the ERD's utility and efficiency if they were included.

management, it is crucial to analyse the ERD critically.

3.5. Source Code and Video Demonstration

3.6. Rationale for Final Database

precise diagnoses and choices on a patient's course of treatment.

entity and the new entities, for instance, is now clear.

and use of recorded information.

3.6.2. Inconsistent Data Structure

3.6.3. Lack of Data Integration

3.6.4. Limited Data Coverage

managing a wide range of healthcare-related data (Hersh et al., 2013).

3.6.5. Lack of Data Governance

best practices for data management.

storage costs (Hersh et al., 2013).

3.6.7. Limited Error Handling

information by providing error management.

the healthcare organization (Baig & Khan, 2017).

of the organization (Masic, Pandza, & Kulasin, 2014).

3.7. Evaluation in terms of Usability and Utility

providers to properly manage patient care.

track and manage patient care.

Unquestionably useful, JSON offers healthcare organisations a complete patient care

1. Give the name of the places whose category is “accommodation”

db.paris.find({ category: "accommodation" }, { name: 1 })

"contact.phone": { $exists: true, $ne: null }

3. Name and contacts of places with” website” and” Foursquare” provided;

{ "contact.website": { $exists: true, $ne: null } },

{ "contact.Foursquare": { $exists: true, $ne: null } }

{ "name": { $regex: "hotel" } },

5. Name and services of places offering 5 services;

6. Categories of places with at least a rating (reviews.rating) of 4 or more;

7. GPS coordinates of places whose address contains “rue de rome”;

{ "location.address": { $regex: "rue de rome", $options: "i" } },

8. Distinct list of category

9. Distinct list of services

(reviews.source) is “Facebook”. Sort in descending order;

reviewCount: { $sum: { $size: "$reviews" } }

averageRating: { $avg: "$reviews.rating" },