Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 22

BIG DATA DATABASE DEVELOPMENT

ASSIGNMENT 2

Asim Khan: 100076269

Muhammad Salman: 100074336


Contents

PART 1............................................................................................................................................4

1. Executive Summary.................................................................................................................4

2. Introduction..............................................................................................................................4

3. Methodology............................................................................................................................5

3.1. Critical Analysis of Asim Khan Database (ERD).............................................................6

3.2. Critical Analysis of Muhammad Salman Database (ERD)...............................................7

3.3. Critical Analysis for Raphael Umoh Onofiok Database (ERD).......................................9

3.4. Final ERD Diagram........................................................................................................11

3.5. Source Code and Video Demonstration..........................................................................11

3.6. Rationale for Final Database...........................................................................................12

3.6.1. Documentation and Metadata..................................................................................13

3.6.2. Inconsistent Data Structure......................................................................................13

3.6.3. Lack of Data Integration..........................................................................................13

3.6.4. Limited Data Coverage............................................................................................13

3.6.5. Lack of Data Governance........................................................................................13

3.6.6. Inefficient Data Storage...........................................................................................14

3.6.7. Limited Error Handling...........................................................................................14

3.6.8. Scalability................................................................................................................14

3.6.9. Performance.............................................................................................................14
3.6.10. Security................................................................................................................14

3.6.11. Flexibility.............................................................................................................15

3.7. Evaluation in terms of Usability and Utility...................................................................15

PART 2..........................................................................................................................................16

1. Queries...................................................................................................................................16

2. Results....................................................................................................................................20

3. Work Division.......................................................................................................................21

4. Conclusion.............................................................................................................................22

References......................................................................................................................................22
PART 1

1. Executive Summary

The report presents the findings of a two-part big data and database management assignment. In

Part 1, a database structure was created by combining ERDs (Entity-Relationship Diagrams). On

the tourPedia database, the query operations from Part 2 were completed successfully. The

exercise provided hands-on practise using datasets and a variety of data manipulation and

querying techniques.

2. Introduction

Today's world, where organisations must deal with enormous amounts of data, has made the use

of big data databases more crucial. Data is being produced at a rate that has never been seen

before thanks to technological advancements and the availability of high-speed internet. For

organisations to stay competitive, managing this data and gaining insights from it has become

essential. Big data technologies like MongoDB, a NoSQL document-oriented database that is

commonly used for big data applications, have emerged as a result of this (Chong & Fan, 2019).

The implementation of a big data database using MongoDB, which was created by fusing the

unique databases of each group member, will be covered in this report. The report will give an

overview of each member's individual databases, describe how these databases were integrated,

discuss the decisions that were made during implementation, and offer a critical assessment of

the chosen database solution. We will also assess the finished database's usability and utility

from the viewpoint of the end user (Diogo & Santos, 2020).
The result aims to give readers a thorough grasp of how to create a large data database using

MongoDB and the advantages it may offer businesses. This report intends to assist people and

organisations considering using a big data database solution to meet their data management

requirements.

3. Methodology

Each group member provided their unique database ideas from assessment 1 as part of the

collaborative methodology utilised to create the big data database using MongoDB. To find

improvement opportunities and ensure the separate databases were compatible with MongoDB,

the procedure required a comprehensive evaluation and review of each database.

3.1. Critical Analysis of Asim Khan Database (ERD)


Most of the entities have been identified in this ERD. Some important entities are

Online_Bookings, Referrals, and Insurance Provider that provide unique aspect of this ERD.

3.2. Critical Analysis of Muhammad Salman Database (ERD)


In this ERD, we have some unique entities such as Medical_Test, Therapy_Session,

Therapy_Type, and Patient Vital Signs. We can use these entities to expand the scope of our

final database further.

The addition of the Medical_Test, Therapy_Session, Therapy_Type, and Patient Vital Signs

entities in the ERD greatly increases the scope and functionality of the final database solution.

These organisations offer extra channels for healthcare data collection and processing, which is

crucial for supporting clinical activities like disease diagnosis, chronic condition management,

and patient progress tracking.

An elegant technique that successfully represents the intricacy of the relationships between the

entities is the resolution of many-to-many interactions by including a junction table or alternate

entity. This method guarantees correct data recording and removes the chance of duplicate or

inconsistent data.

The association names and cardinalities offer a precise and thorough representation of the

connections between the database's elements. This feature makes it easier to navigate between

related entities and ensures that data is correctly associated with the appropriate entity,

improving data usability and streamlining data processing.


3.3. Critical Analysis for Raphael Umoh Onofiok Database (ERD)

This ERD has two association names on a single association line connecting two entities. There

is no direct relationship between Doctor and Patient. These flaws can be corrected and further

there is no unique entity which can extend our final database and scope.

For improved data modelling, a few flaws with this ERD must be fixed. First, there are occasions

where a single association line connecting two entities lists two association names. This may

make the connections between entities unclear and confusing. For instance, in this ERD, there is

no clear connection between the entities Doctor and Patient. A more accurate and understandable
representation of the data would arise from making the links between entities more explicit and

from avoiding having numerous association names on a single line.

The absence of a special entity that can broaden the database's scope is another problem with this

ERD. Electronic health records, laboratory testing, and other particular entities can offer more

functionality and insights for hospital management and decision-making. Such entities would

increase the ERD's utility and efficiency if they were included.

To find any flaws or faults and to improve the data model for accurate and efficient data

management, it is crucial to analyse the ERD critically.


3.4. Final ERD Diagram

3.5. Source Code and Video Demonstration

Source code for the final big data database implementation can be found at this link:

https://github.com/inqlo/asim-hospital/blob/main/asim-new_york_city_hospital.json
The video demonstration can be found here:

https://drive.google.com/file/d/1ws6wNGNCW1atHJIWtH4saRKtIjJqEx0c/view

3.6. Rationale for Final Database

In comparison to the earlier ERD, the final database design, which incorporates the additional

entities Patient Vital Signs, Therapy Session, Therapy Type, and Medical Test, looks to be an

improvement.

The newly established entities offer more functionality and insights into managing patient health

and treatment, which can be helpful for healthcare practitioners. For instance, the inclusion of

Patient Vital Signs can aid medical professionals in tracking a patient's vital signs over time and

identifying any changes that would call for additional care or treatment. The Therapy Type item

offers details on the sorts of therapy a patient receives, while the Therapy Session entity enables

doctors to monitor therapy sessions and their results. Last but not least, the Medical Test entity

keeps a record of the specifics of the medical tests carried out, which can help physicians make

precise diagnoses and choices on a patient's course of treatment.

The ERD has now advancements, though. There was absence of unambiguous links between

some of the entities which was one potential problem. The relationship between the Hospital

entity and the new entities, for instance, is now clear.

Overall, adding the new entities has increased the database's capabilities and increased the

accuracy and level of detail with which patient health and treatment data are managed. To

guarantee that the ERD is clear and well-organized, with unambiguous entity and attribute

names, and with clear links between entities, further refining can be done if need be there.
3.6.1. Documentation and Metadata

The solution contains thorough documentation and metadata that facilitates efficient database

maintenance and understanding, improving data usability, and ensuring accurate interpretation

and use of recorded information.

3.6.2. Inconsistent Data Structure

The final database solution establishes a consistent naming convention and data representation,

enhancing clarity, usability, and manipulation of data, thereby reducing confusion and mistakes.

3.6.3. Lack of Data Integration

By acting as a central repository for diverse healthcare-related data, the database system

integrates data integration capabilities, enabling data exchange and integration with other

systems like electronic health records (EHR), laboratory systems, or billing systems (Patel &

Patel, 2018).

3.6.4. Limited Data Coverage

The final database solution aims to provide comprehensive data coverage by including essential

entities and attributes relevant to healthcare operations, laying the foundation for capturing and

managing a wide range of healthcare-related data (Hersh et al., 2013).

3.6.5. Lack of Data Governance

The database solution implements suitable methods for data quality, privacy, and compliance,

enforcing data governance standards, defining data stewardship responsibilities, and executing

best practices for data management.


3.6.6. Inefficient Data Storage

The final database solution optimizes data storage by making use of effective storage methods

like indexing, compression, and partitioning, maximizing storage efficiency, and reducing

storage costs (Hersh et al., 2013).

3.6.7. Limited Error Handling

The database solution includes techniques for handling mistakes, exceptions, and inconsistent

data, maintaining data integrity, system stability, and user confidence in the veracity of the stored

information by providing error management.

3.6.8. Scalability

The final database solution is designed to scale with the growth of the healthcare organization. It

can handle a significant amount of data and can accommodate the addition of new entities as the

organization expands. This scalability ensures that the database can meet the evolving needs of

the healthcare organization (Baig & Khan, 2017).

3.6.9. Performance

The final database solution is optimized for performance, ensuring that queries are executed

efficiently and quickly. The solution can process large amounts of data without slowing down,

ensuring that healthcare professionals can access the information they need in a timely manner.

3.6.10. Security

The final database solution includes robust security measures to protect sensitive healthcare data

from unauthorized access. The solution implements authentication and authorization mechanisms

to ensure that only authorized personnel can access the data. It also provides data encryption to

protect against data breaches and cyber-attacks (Diogo & Santos, 2020).
3.6.11. Flexibility

The final database solution is designed to be flexible, allowing for customization based on the

specific needs of the healthcare organization. The solution can be tailored to include additional

entities, attributes, and relationships as required, ensuring that it can adapt to the evolving needs

of the organization (Masic, Pandza, & Kulasin, 2014).

3.7. Evaluation in terms of Usability and Utility

The JSON structure made available to healthcare organisations is significant because it offers a

uniform format for keeping and organising crucial medical data. Its use of simple key-value

combinations guarantees that data is accessible and processable, making it simple for healthcare

providers to properly manage patient care.

The versatility of the JSON format, which enables healthcare providers to store diverse types of

data in one area, is a key benefit. This helps them to effectively manage patient information and

decide on their care with knowledge. Additionally, the JSON's hierarchical structure makes sure

that relevant data, such patient referrals and bookings, is grouped together to make it simpler to

track and manage patient care.

Unquestionably useful, JSON offers healthcare organisations a complete patient care

management solution. They are able to accurately track patient data, including as medical

history, treatment plans, prescriptions, and billing data. This helps healthcare professionals to

give patients with better treatment and guarantee that all parts of their care are managed

effectively.
PART 2

1. Queries

1. Give the name of the places whose category is “accommodation”

db.paris.find({ category: "accommodation" }, { name: 1 })

2. Give the name and phone number of places with a phone number entered
($exists, $ne);
db.paris.find(

"contact.phone": { $exists: true, $ne: null }

},

"name": 1,

"contact.phone": 1

3. Name and contacts of places with” website” and” Foursquare” provided;


db.paris.find(

$and: [

{ "contact.website": { $exists: true, $ne: null } },

{ "contact.Foursquare": { $exists: true, $ne: null } }

},

"name": 1,
"contact": 1

4. Name of places whose name contains the word “hotel” (pay attention to case);
db.paris.find(

{ "name": { $regex: "hotel" } },

{ "name": 1 }

5. Name and services of places offering 5 services;


db.paris.find(

{ "services": { $size: 5 } },

{ "name": 1, "services": 1 })

6. Categories of places with at least a rating (reviews.rating) of 4 or more;


db.paris.distinct("category", { "reviews.rating": { $gte: 4 } })

7. GPS coordinates of places whose address contains “rue de rome”;


db.paris.find(

{ "location.address": { $regex: "rue de rome", $options: "i" } },

{ "location.coord.coordinates": 1, _id: 0 }

8. Distinct list of category


db.paris.distinct("category")

9. Distinct list of services


db.paris.distinct("services")

10. For each ”poi” category place name, give the number of reviews whose source

(reviews.source) is “Facebook”. Sort in descending order;

db.paris.aggregate([
{

$match: {

"reviews.source": "Facebook",

category: "poi"

},

$group: {

_id: "$name",

reviewCount: { $sum: { $size: "$reviews" } }

},

$sort: {

reviewCount: -1

])

11. For each place name in the “restaurant” category, give the average rating and
the number of comments.
db.paris.aggregate([

$match: {

category: "restaurant"

},
{

$group: {

_id: "$name",

averageRating: { $avg: "$reviews.rating" },

commentCount: { $sum: { $size: "$reviews" } }

])

The $lookup command in MongoDB is used to perform a left outer join between two collections
that are located in the same database. You can use it to combine data from different collections
based on a common field or expression. (MongoDB, 2023)
The lookup command's syntax is:

$lookup: {

from: <collection>,

localField: <field>,

foreignField: <field>,

as: <outputArray>

from specifies the collection to join with.


localField specifies the field from the input collection.
foreignField specifies the field from the joined collection.
as specifies the name of the output array field that will contain the joined documents.

For example:
db.paris.aggregate([

$lookup: {

from: "accommodation",

localField: "_id",

foreignField: "place_id",

as: "accommodation"

])

Based on the _id column and the place_id field, respectively, this query uses the $lookup

aggregation step to join the paris collection with the accommodations collection. An additional

accommodation field that contains a number of matched documents from the accommodation

collection will be present in the final results.

2. Results

The JSON data was downloaded and saved locally for later use from the URL that was provided.

To make the management and retrieval of the data easier, the data was imported into a

MongoDB database and particular collections, including Paris and TourPedia, were created.

Numerous fields are included in the "paris" collection of the "tourPedia" database to record

crucial information about each site, including _id, contact, name, location, category, description,

services, and reviews. Queries on the "paris" collection required filtering, aggregation, lookup,

and projection in order to retrieve specific data and do calculations.


Filtering, aggregation, lookup, and projection were used in queries to extract and analyse the

necessary data from the "paris" collection. The idea of combining data from various collections

using a constrained set of results was shown using a specific query and the $lookup command.

3. Work Division

Asim Khan and Muhammad Salman worked as a team to develop the final database solution.

The entity relationship diagram (ERD) analysis and final database design were the responsibility

of Asim Khan. He further refined the ERD further by using association names and cardinalities,

and he resolved many-to-many links by including junction tables and alternative entities. The

dataset was obtained, and the JSON was denormalized with significant assistance from

Muhammad Salman. He changed the data into a format that could be used and would work with

the eventual database solution. Additionally, he made sure the information was correct,

comprehensive, and pertinent to healthcare operations. It was Asim Khan’s responsibility to

build the database's framework and manage queries. He created the database's tables,

relationships, and schema, making sure they were efficient, clear, and consistent. Additionally,

he created queries that let users access and change data in the database.

Name Work Assigned


Analysing ERD and creating the final Database
Asim Khan
solution
Denormalization of the JSON and acquisition of
Muhammad Salman
the dataset
4. Conclusion

Database architecture and query manipulation were the two key components of massive data

processing that were explored in this report. Combining entity-relationship diagrams (ERDs) and

creating a unique database schema were the topics of the first portion. The second part, which

employed actual data from the "tourPedia" database, concentrated on query manipulation. The

usage of big data principles was illustrated by the integration of ERDs, the construction of our

unique database schema, and query modification on the "tourPedia" database. This exercise

enhanced our knowledge of database design and query execution while highlighting the need of

data management and analysis in real-world contexts. This knowledge and skill set will be

extremely useful for upcoming data-driven projects and analysis

References

Amazon Web Services. (2023). Amazon DynamoDB. https://aws.amazon.com/dynamodb/

Apache Cassandra. (2023). Apache Cassandra. https://cassandra.apache.org/

Apple Inc. (2023). Core Data. https://developer.apple.com/documentation/coredata

Baig, S. A., & Khan, A. (2017). A comparative analysis of NoSQL databases. International

Journal of Computer Science and Information Security, 15(6), 74-79.

Chong, E., & Fan, J. (2019). NoSQL databases for big data applications: An overview. Big Data

Research, 15, 1-11. https://doi.org/10.1016/j.bdr.2018.07.006

Couchbase. (2023). Couchbase Server. https://www.couchbase.com/products/server


Diogo, C. F., & Santos, M. Y. (2020). The use of NoSQL databases in healthcare systems: A

systematic literature review. International Journal of Medical Informatics, 141, 104180.

https://doi.org/10.1016/j.ijmedinf.2020.104180

Hersh, W. R., Weiner, M. G., Embi, P. J., Logan, J. R., Payne, P. R., Bernstam, E. V., ... &

Lehmann, H. P. (2013). Caveats for the use of operational electronic health record data in

comparative effectiveness research. Medical Care, 51(8 Suppl 3), S30-37.

Masic, I., Pandza, H., & Kulasin, I. (2014). Importance of medical databases and registries in

contemporary medicine. Acta Informatica Medica, 22(5), 320-325.

Mehta, S., Sankaranarayanan, R., & Varadarajan, S. (2019). A comparative study of NoSQL

databases for big data applications. International Journal of Computer Science and Information

Technology Research, 7(1), 1-12.

MongoDB. (2023). MongoDB Documentation. Retrieved from https://docs.mongodb.com/

Patel, R., & Patel, D. (2018). A comparative study of relational and NoSQL databases.

International Journal of Engineering Research and Technology, 7(6), 396-400.

Red Hat. (2023). MongoDB. https://www.mongodb.com/

You might also like