BIG DATA SYSTEM REPORT GROUP 2-Heartdisease

COLLEGE OF COMPUTING, INFORMATICS AND MEDIA
BACHELOR’S OF INFORMATION TECHNOLOGY (HONS.)
DSC652: BIG DATA APPLICATIONS AND ISSUES
BIG DATA SYSTEM REPORT: HEART DISEASE
PREPARED BY:
NAME STUDENT ID
MUHAMMAD AZHARI BIN ANUAR 2021823794
AFIQAH MURSYIDAH BINTI MUZAINI 2020621312
DAYANG AISYAH SYAHIRA BINTI ABD MALI 2020604836
NUR DAYINI SYAHMINA BINTI MUHAMMAD HATTA 2020853876
NUR SABRINA BINTI MOHD SHAFAWI 2020872824
GROUP:
CS2406C
PREPARED FOR:
PN SAIDATUL RAHAH BINTI HAMIDI
SEMESTER:
FEBRUARY 23 / AUGUST 23
Table of Contents
CHAPTER 1: INTRODUCTION ........................................................................................................ 3

1.1 Project Background ................................................................................................................. 3
1.2 Problem Statement .................................................................................................................. 4
1.3 Objectives and Scopes ............................................................................................................ 5
CHAPTER 2: RESEARCH METHODOLOGY ............................................................................... 6
2.1 Methodology ........................................................................................................................... 6
2.1.1 Data Collection .................................................................................................................. 6
2.1.2 Data Storage ....................................................................................................................... 7
2.1.3 Data Processing .................................................................................................................. 8
2.1.4 Analytics and Insight ....................................................................................................... 11
2.1.5 Visualization and Reporting............................................................................................. 13
CHAPTER 3: ANALYSIS AND FINDINGS ................................................................................... 15
3.1 Project’s Dashboard .............................................................................................................. 15
3.2 SOLUTION.......................................................................................................................... 19
CHAPTER 4: CONCLUSION........................................................................................................... 20
REFERENCES .................................................................................................................................... 21
CHAPTER 1: INTRODUCTION
1.1 Project Background
Heart diseases, also known as cardiovascular disease (CVD), is a broad term for
any condition affecting the heart or blood vessels. The term refers to several types of
heart conditions, such as coronary artery disease (CAD), which affects blood flow to
the heart, and decreased blood flow can cause a heart attack. Moreover, the term heart
disease refers to a broad range of heart conditions, one of which is heart failure.
According to research, approximately 6.2 million adults in the United States have heart
failure. While heart disease is a broad category, several aspects of it have an impact on
heart failure.
According to the Centers for Disease Control and Prevention, is one of the
leading causes of death in the United States for people of all races, including African
Americans, American Indians, Alaska Natives, and white people. Approximately half
of the Americans (47%) have at least one of the three major risk factors for heart
disease, including high blood pressure, high cholesterol, and smoking.
Other key indicators include diabetic status, obesity (high BMI), a lack of
physical activity, and excessive alcohol assumption. It is critical in healthcare to
identify and prevent the factors that have the greatest impact on heart disease.
Computational advances, in turn, enable the use of machine learning methods to detect
“patterns” in data that can predict a patient’s condition.
1.2 Problem Statement
The dataset is from the Centres for Disease Control and Prevention and is a
component of the Behavioural Risk Factor Surveillance System (BRFSS), which
conducts annual telephone surveys to collect data on health of U.S. residents. Every
year, BRFSS conducts over 400,000 adult interviews, making it the world’s largest
continuously conducted health survey system.
The data gathered is from an analysis of heart disease. The dataset was
discovered on the Kaggle website. In Malaysia, disease analysis is extremely popular.
To learn more about heart disease based on age, BMI, daily activities, and other factors.
Moreover, to determine the number of people who have heart disease and the cause of
that disease.
The project includes hands-on work that uses real-world data as an example of
how to learn about and comprehend data. According to real-world data from heart
disease, when a patient engages in unhealthy habitual behaviours such as smoking,
drinking alcohol, sleeping late, or any other activity, they are more likely to develop
heart disease. This heart disease analysis dataset can help you better understand the
everyday activities and BMI that have the greatest impact on the development of heart
disease. This project can be beneficial because it allows us to prepare for the actual
dataset that we may obtain and analyse.
1.3 Objectives and Scopes
• To identify, design and develop the system requirement dashboard for heart
disease
• To identify risk factors for heart disease
• To investigate the relationship between Demographic Factors and Health
Conditions
This project is guided by Dataquest, and the datasets are from the classified parts
of the Heart Disease Analysis (EDA), with approximately 319796 data points taken
from the original dataset, which was updated 20 days ago (2022) inside Kaggle to
provide much faster results as the data was loaded and used. This dataset has numerous
flaws for data cleaning, but it contains many columns that can be used to analyse BMI,
physical health, mental health, gen health, age, gender, and other factors.
This project intends to investigate the relationship between several factors and
the presence of heart disease by analysing a dataset containing demographic and health-
related information. This project aims to develop a comprehensive understanding of the
impact of lifestyle choices, health conditions, and demographic factors on heart disease
occurrence through data collection, storage, processing, analytics, and visualisation.
The project's findings may help to develop personalised risk assessment models as well
as inform targeted inventions to reduce the prevalence of heart disease and improve
overall cardiovascular health.
This project will create a dashboard that will display information about heart
disease analysis. The dashboard will show the user how the relationship between age
and gender can be related, and whether a person's BMI can be a risk factor for heart
disease. Furthermore, the dashboard has its own distinct features, such as the ability to
slice the graph and include some infographics that can assist us in learning more about
heart disease.
CHAPTER 2: RESEARCH METHODOLOGY
2.1 Methodology
Big data methodology is the systematic technique and set of steps used to gather,
handle, examine, and get knowledge from large and complex data. It includes the
specialised methods, instruments, and frameworks applied to address the unique
challenges brought on by big data, such as its volume, velocity, variety, and veracity.
The following elements are often included in methodology when it comes to big data:
2.1.1 Data Collection
The process of gathering and acquiring significant amounts of data from several
sources to be utilised for analysis and decision-making in a methodical and effective
way is known as data collection in big data methodology. Big data requires the
collection of organised, semi-structured, and unstructured data from a variety of sources
to create expansive databases that can be analysed.
The dataset, which was gathered from Kaggle, contains details on the key
indicators of heart disease. The information was gathered from the CDC (Centres for
Disease Control) (Centres for Disease Control) and is a significant component of the
Behavioural Risk Factor Surveillance System (BRFSS), which conducts annual
telephone surveys to collect information on Americans' health condition. Only roughly
18 variables remained from the original dataset's almost 300 variables as the figure
shown below.
Figure 2.1: The Dataset

2.1.2 Data Storage
The process of storing and managing massive amounts of data in an organised,

scalable, and effective way to facilitate data analysis and decision-making is known as
data storage in big data methodology. Data storage is essential for big data because it
makes it possible for data to be accessed, processed, and retrieved in an environment
where the volume, variety, and velocity of data are substantial.
In this project, we use Apache Hadoop Distributed File System (HDFS) because
HDFS is popular for storing and processing large data sets on a cluster of machines. It
offers fault tolerance, high throughput, and scalability. HDFS divides data into blocks
that are replicated across multiple nodes for redundancy and efficient data handling.
The HDFS is commonly used for data storage it is because of several reasons:
• Scalability: HDFS is profoundly scalable, competent of dealing with massive sums of

data. It can scale evenly by including more product servers to a Hadoop cluster,
accommodating the ever-growing data storage needs of organizations. This scalability
makes it suitable for big data environments where data volumes can be immense.
• Fault Tolerance: Fault tolerance was a consideration in the architecture of HDFS. By

duplicating data blocks over several cluster nodes, it stores data in a dispersed fashion.
This redundancy guarantees that data is still available from other duplicated copies even
if a node or disc fails. Data availability and reliability are ensured by the automated
failure detection and recovery capabilities of HDFS.
• Data Processing Performance: By utilising its distributed design, HDFS enables

effective data processing performance. Data processing can occur simultaneously
because data is distributed across numerous nodes. High-performance data processing
frameworks like Apache Spark and Apache Hadoop can operate effectively on huge
datasets stored in HDFS because to their distributed and parallel processing capacity.
2.1.3 Data Processing
Data processing, as used in big data methodology, is the systematic and

structured process of transforming, manipulating, and analysing huge and complicated
datasets to provide insightful conclusions, observable patterns, and useful knowledge.
To extract useful knowledge from the data, a variety of computational approaches,
algorithms, and tools must be applied. The steps of data processing stages will be shown
below:
Figure 2.2 : Import Data
This figure show importing a CSV file into Jupyter Notebook. the CSV file
located at 'file_path' will be read and its contents will be loaded into the DataFrame 'hd'.
The printed output will display the DataFrame, showing the rows and columns of data
from the CSV file.
Figure 2.2 : Defining of conditions of BMI and the categories
This figure defines the conditions for each BMI category. The BMI (Body Mass
Index) have four category such as underweight, healthy, overweight and obese. For
example, the underweight category has a BMI of less than 18.
Figure 2.3 : Adding new column ”BMIcat” next to BMI column
This figure adds the new column which is BMIcat column to show the category of every
BMI based on the conditions that have been shown in previous figure.
Figure 2.4 : Mapping age category and rename the column
This figure shows the mapping of age categories and rename the column from
AgeCategory to AgeRange. The given dictionary, agecat_mapping, maps values of the
'AgeCategory' to corresponding values of 'AgeCat'. This mapping provides a way to
categorize individuals into different age groups based on their age ranges. Each key in
the dictionary represents an age range, while the corresponding value represents the
corresponding age category.
Figure 2.5: Mapping value to the selected columns
The dictionary called 'value_mapping', which maps certain values to their

corresponding replacements. In this case, it maps the value 'Yes' to 'True' and the value
'No' to 'False'. This mapping will be used to rename the values in the specified columns.
Figure 2.6 : The original dataset and processed dataset
The original dataset has 18 variables and the processed dataset have 2 new
variables that have been added which are BMIcat and AgeCategory column. The
BMIcat column is added to show the category of each BMI and AgeCategory column
to show the category of age range.
2.1.4 Analytics and Insight
The techniques and results of analysing vast amounts of data to extract valuable
information and develop a more comprehensive understanding or knowledge are
referred to as analytics and insights in the context of big data methodology. To find
patterns, trends, correlations, and other useful insights, analytics entails the systematic
investigation and evaluation of huge databases. The insightful findings, observations,
or discoveries that result from the analytics process are referred to as insights. They
stand for the comprehension and information discovered through big data analysis.
In this project, we use diagnostic analytics aims to identify the causes or factors
contributing to heart disease. In the context of heart disease, diagnostic analytics entails
data analysis to determine the reasons or factors contributing to the condition and
comprehending the correlations between different variables. It seeks to shed light on
the underlying causes that could contribute to the development or progression of heart
disease. It is possible to identify what factors are more likely to be linked to the
development or progression of the disease by analysing huge datasets and use statistical
approaches. Understanding risk profiles of individuals or populations using this
information can help develop focused preventative or intervention measures.
The visualization technique that we used in Power BI Desktop are card, pie
chart, column chart and key indicators. The pie chart allows the user to identify the age
categories with a higher prevalence of heart disease. Larger slices indicate higher
proportions of people with heart disease in those age groups. This can be useful in
identifying age groups that may require targeted interventions or preventive measures.
The card visual in Power BI Desktop allows the user to display a specific value
prominently. In this case, it can be used to display the total number of people in the
dataset. The card visual provides a clear and concise representation of this value.
Next, the column chart. We use the two different clustered column chart in this
dashboard which are the clustered column chart and line and clustered column chart.
From the clustered column chart, it displays the number of people according to the age
category and gender. The visualization shows the older adult where the age is between
65 and 79 years old have the highest number. For the gender, it shows the male gender
outnumber the female gender for every age category. From the line and clustered
column chart, it displays the number of people according to their BMI category and it
shows no correlation between BMI and heart disease.
Lastly, the key indicator. The key indicator has been used to show relationships
between the risk factors and heart disease. For people with these risk factors such as
kidney disease and diabetic, they have higher potential to develop heart disease.
2.1.5 Visualization and Reporting
The tool that we used for visualising our data is Power BI. Charts, graphs, maps, tables, and
custom visuals are just a few of the numerous built-in visualizations available in Power BI.
These visual representations enable a thorough analysis of the prevalence of heart disease
across various demographic groups. It is possible to gain insights into how heart disease cases
are distributed based on age, gender, and BMI categories, which aids in identifying potential
disparities, correlations, and trends, which enables targeted interventions, individualized
healthcare plans, and well-informed decision-making for stakeholders.
 Wireframe
Interface design holds significance as it ensures the alignment of the dashboard layout with
user requirements. To convey the planned design of the project, we will employ a low-fidelity
interface design.
Figure 2.7: Wireframe of Homepage
The homepage displays the structure, assisting users in understanding the aim and scope of the
project, which is heart disease. It includes elements such as a logo, a title, and a visual design.
It defines the dashboard identity and ensures that users have a uniform experience across the
dashboard.
Figure 2.8: Wireframe of charts on various factors
This page depicts the association between several factors and the existence of cardiovascular
disease. We decided to group all charts on one page so that users could see the overall effect
rather than browsing through many pages, which could give the impression that the dashboard
is time-consuming.
Figure 2.9: Wireframe of key influencers of the three series.
This page assists users in better understanding the relationship between heart disease and other
diseases such as kidney disease, diabetes, and borderline diabetes. Since these three are
considered diseases rather than demographics, we decided to divide them into an individual
page.
CHAPTER 3: ANALYSIS AND FINDINGS
3.1 Project’s Dashboard
(Figure 3.1: Home Page)

The data visualisation dashboard's welcome page is seen in Figure 3.1 above. When users click
on the dashboard, the welcome page will be the first page they see. Users can access the
following page by clicking the 'Click Here' button. We have included a About Icon to navigate
the users to World Health Organisations (WHO) website in this page's footer if the users may
learn more about heart disease there.
(Figure 3.1: Overall Analysis)
To start with, we can see from the pie chart in Figure 3.1 that there are four types of age
categories for this heart disease analysis. 47.48% of the data was occupied by older adults,
whose ages ranged from 70 to 74. Then comes middle age, with 32.14% of people between the
ages of 35 and 39. Following that, 19.91% of senior adults are 80 years or older. The last group
is young adults, who account for only 0.47% of those aged 18 to 24.
Second, the clustered column chart on the left side compares each age category and gender for
this dataset. In the figure above, the red bar represents female, and the blue bar represents male.
This chart shows that males outnumber females across all age groups.
Last but not least, the clustered column chart on the right side shows the interactions between
BMI category and whether or not they have heart disease. By comparing the four BMI
categories of underweight, healthy, overweight, and obese, we can conclude that BMI has no
influence on whether a person gets this disease or not. This is due to the enormous number of
comparisons between true and false for this differentiate.
(Figure 3.2: Analysis on three series disease (Kidney))
Figure 3.2 shows the relationship between heart disease and kidney disease. According to the
analysis, a person with kidney disease is 3.19 times more likely to develop heart disease.
Where the true percentage is 29.33% and the false percentage is 7.77%.
(Figure 3.3: Analysis on three series disease (Diabetic))

Figure 3.3 shows the relationship between heart disease and diabetic. According to the analysis,
a person with diabetic is 3.16 times more likely to develop heart disease. Where the true
percentage is 21.65% and the false percentage is 6.50%, and 4.22% when the person is
pregnant.
(Figure 3.4: Analysis on three series disease (Borderline diabetic))

Figure 3.4 shows the relationship between heart disease and borderline diabetic where the
person is almost to get the diabetic. According to the analysis, a person with at the borderline
diabetic is 1.42 times more likely to develop heart disease. Where the true percentage is 21.95%
and the false percentage is 6.50%, and 4.22% when the person is pregnant.
3.2 SOLUTION
Due to the significant health concern that occur nowadays, especially heart disease which cause
concern that it can happen to all age group. There are numerous ways to make improvement
that can be performed to help with heart diseases difficulties. One of the ways to enhance heart
disease is prevention and awareness, this is important because they empower individuals to
make a better decision and reduce risk from anything. First and foremost is education and
awareness campaign that will help people to learn more about heart diseases, more health
beneficials information and their cons. Next, diagnostic solution like wearing wearable device
that can track vital indications including heart rate, blood pressure, and activity levels, such as
smartwatches or fitness trackers. This information can aid in early detection and offer insights
into heart health. In addition, one of the most thing that used by people who must take a lot of
medication, which is medication management where the system or application provide daily
reminders to take medicine and educational resources to assists patients with their prescription
meds as directed and to track the results. Third way to improve is by taking proper treatment
and management like cardiac rehabilitation programs to implement technology-based
solutions, such as exercise instruction, education, and progress monitoring, to enable remote
cardiac rehabilitation. Lastly, collaborative care such as help establish support network and
community of heart diseases patients where they can stay connected and share their experience
handling heart diseases.
CHAPTER 4: CONCLUSION
To conclude, we have developed a platform that enables people to make knowledgeable

health decisions and take proactive steps to treat or avoid cardiac problems. Our dashboard
provide insight on people with heart diseases by age category, comparison between gender, and
BMI category. In addition, there is also graph that present the chances of person getting disease
like kidney, diabetic, and symptoms that leading to diabetic from having heart diseases. Heart
diseases dashboard project has significantly contributed to the advancement of our knowledge
of cardiovascular health. The knowledge acquired from this initiative has the potential to help
people know about the chances of getting them into heart diseases or other diseases, so they
can take better care of their health. By using big data power, we can create the conditions for a
healthier future for all people.
REFERENCES
Personal Key Indicators of Heart Disease. (n.d.).

Www.kaggle.com. https://www.kaggle.com/datasets/kamilpytlak/personal-key-
indicators-of-heart-disease?resource=download
CDC. (2020, September 17). Assessing Your Weight. Centers for Disease Control and
Prevention.
https://www.cdc.gov/healthyweight/assessing/index.html#:~:text=If%20your%20BMI
%20is%20less

BIG DATA SYSTEM REPORT GROUP 2-Heartdisease

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

BIG DATA SYSTEM REPORT GROUP 2-Heartdisease

Uploaded by

Copyright:

Available Formats

COLLEGE OF COMPUTING, INFORMATICS AND MEDIA

BACHELOR’S OF INFORMATION TECHNOLOGY (HONS.)

DSC652: BIG DATA APPLICATIONS AND ISSUES

BIG DATA SYSTEM REPORT: HEART DISEASE

MUHAMMAD AZHARI BIN ANUAR 2021823794

AFIQAH MURSYIDAH BINTI MUZAINI 2020621312

DAYANG AISYAH SYAHIRA BINTI ABD MALI 2020604836

NUR DAYINI SYAHMINA BINTI MUHAMMAD HATTA 2020853876

NUR SABRINA BINTI MOHD SHAFAWI 2020872824

CHAPTER 1: INTRODUCTION ........................................................................................................ 3

1.1 Project Background

2.1.1 Data Collection

Figure 2.1: The Dataset

The process of storing and managing massive amounts of data in an organised,

• Scalability: HDFS is profoundly scalable, competent of dealing with massive sums of

• Fault Tolerance: Fault tolerance was a consideration in the architecture of HDFS. By

• Data Processing Performance: By utilising its distributed design, HDFS enables

Data processing, as used in big data methodology, is the systematic and

Figure 2.2 : Import Data

Figure 2.2 : Defining of conditions of BMI and the categories

Figure 2.4 : Mapping age category and rename the column

The dictionary called 'value_mapping', which maps certain values to their

Figure 2.6 : The original dataset and processed dataset

Figure 2.7: Wireframe of Homepage

Figure 2.9: Wireframe of key influencers of the three series.

3.1 Project’s Dashboard

(Figure 3.1: Home Page)

(Figure 3.3: Analysis on three series disease (Diabetic))

(Figure 3.4: Analysis on three series disease (Borderline diabetic))

To conclude, we have developed a platform that enables people to make knowledgeable

Personal Key Indicators of Heart Disease. (n.d.).

You might also like