Professional Documents
Culture Documents
BIG DATA SYSTEM REPORT GROUP 2-Heartdisease
BIG DATA SYSTEM REPORT GROUP 2-Heartdisease
PREPARED BY:
NAME STUDENT ID
GROUP:
CS2406C
PREPARED FOR:
PN SAIDATUL RAHAH BINTI HAMIDI
SEMESTER:
FEBRUARY 23 / AUGUST 23
Table of Contents
Heart diseases, also known as cardiovascular disease (CVD), is a broad term for
any condition affecting the heart or blood vessels. The term refers to several types of
heart conditions, such as coronary artery disease (CAD), which affects blood flow to
the heart, and decreased blood flow can cause a heart attack. Moreover, the term heart
disease refers to a broad range of heart conditions, one of which is heart failure.
According to research, approximately 6.2 million adults in the United States have heart
failure. While heart disease is a broad category, several aspects of it have an impact on
heart failure.
According to the Centers for Disease Control and Prevention, is one of the
leading causes of death in the United States for people of all races, including African
Americans, American Indians, Alaska Natives, and white people. Approximately half
of the Americans (47%) have at least one of the three major risk factors for heart
disease, including high blood pressure, high cholesterol, and smoking.
Other key indicators include diabetic status, obesity (high BMI), a lack of
physical activity, and excessive alcohol assumption. It is critical in healthcare to
identify and prevent the factors that have the greatest impact on heart disease.
Computational advances, in turn, enable the use of machine learning methods to detect
“patterns” in data that can predict a patient’s condition.
1.2 Problem Statement
The dataset is from the Centres for Disease Control and Prevention and is a
component of the Behavioural Risk Factor Surveillance System (BRFSS), which
conducts annual telephone surveys to collect data on health of U.S. residents. Every
year, BRFSS conducts over 400,000 adult interviews, making it the world’s largest
continuously conducted health survey system.
The data gathered is from an analysis of heart disease. The dataset was
discovered on the Kaggle website. In Malaysia, disease analysis is extremely popular.
To learn more about heart disease based on age, BMI, daily activities, and other factors.
Moreover, to determine the number of people who have heart disease and the cause of
that disease.
The project includes hands-on work that uses real-world data as an example of
how to learn about and comprehend data. According to real-world data from heart
disease, when a patient engages in unhealthy habitual behaviours such as smoking,
drinking alcohol, sleeping late, or any other activity, they are more likely to develop
heart disease. This heart disease analysis dataset can help you better understand the
everyday activities and BMI that have the greatest impact on the development of heart
disease. This project can be beneficial because it allows us to prepare for the actual
dataset that we may obtain and analyse.
1.3 Objectives and Scopes
• To identify, design and develop the system requirement dashboard for heart
disease
• To identify risk factors for heart disease
• To investigate the relationship between Demographic Factors and Health
Conditions
This project is guided by Dataquest, and the datasets are from the classified parts
of the Heart Disease Analysis (EDA), with approximately 319796 data points taken
from the original dataset, which was updated 20 days ago (2022) inside Kaggle to
provide much faster results as the data was loaded and used. This dataset has numerous
flaws for data cleaning, but it contains many columns that can be used to analyse BMI,
physical health, mental health, gen health, age, gender, and other factors.
This project intends to investigate the relationship between several factors and
the presence of heart disease by analysing a dataset containing demographic and health-
related information. This project aims to develop a comprehensive understanding of the
impact of lifestyle choices, health conditions, and demographic factors on heart disease
occurrence through data collection, storage, processing, analytics, and visualisation.
The project's findings may help to develop personalised risk assessment models as well
as inform targeted inventions to reduce the prevalence of heart disease and improve
overall cardiovascular health.
This project will create a dashboard that will display information about heart
disease analysis. The dashboard will show the user how the relationship between age
and gender can be related, and whether a person's BMI can be a risk factor for heart
disease. Furthermore, the dashboard has its own distinct features, such as the ability to
slice the graph and include some infographics that can assist us in learning more about
heart disease.
CHAPTER 2: RESEARCH METHODOLOGY
2.1 Methodology
Big data methodology is the systematic technique and set of steps used to gather,
handle, examine, and get knowledge from large and complex data. It includes the
specialised methods, instruments, and frameworks applied to address the unique
challenges brought on by big data, such as its volume, velocity, variety, and veracity.
The following elements are often included in methodology when it comes to big data:
The process of gathering and acquiring significant amounts of data from several
sources to be utilised for analysis and decision-making in a methodical and effective
way is known as data collection in big data methodology. Big data requires the
collection of organised, semi-structured, and unstructured data from a variety of sources
to create expansive databases that can be analysed.
The dataset, which was gathered from Kaggle, contains details on the key
indicators of heart disease. The information was gathered from the CDC (Centres for
Disease Control) (Centres for Disease Control) and is a significant component of the
Behavioural Risk Factor Surveillance System (BRFSS), which conducts annual
telephone surveys to collect information on Americans' health condition. Only roughly
18 variables remained from the original dataset's almost 300 variables as the figure
shown below.
In this project, we use Apache Hadoop Distributed File System (HDFS) because
HDFS is popular for storing and processing large data sets on a cluster of machines. It
offers fault tolerance, high throughput, and scalability. HDFS divides data into blocks
that are replicated across multiple nodes for redundancy and efficient data handling.
The HDFS is commonly used for data storage it is because of several reasons:
This figure show importing a CSV file into Jupyter Notebook. the CSV file
located at 'file_path' will be read and its contents will be loaded into the DataFrame 'hd'.
The printed output will display the DataFrame, showing the rows and columns of data
from the CSV file.
This figure defines the conditions for each BMI category. The BMI (Body Mass
Index) have four category such as underweight, healthy, overweight and obese. For
example, the underweight category has a BMI of less than 18.
Figure 2.3 : Adding new column ”BMIcat” next to BMI column
This figure adds the new column which is BMIcat column to show the category of every
BMI based on the conditions that have been shown in previous figure.
This figure shows the mapping of age categories and rename the column from
AgeCategory to AgeRange. The given dictionary, agecat_mapping, maps values of the
'AgeCategory' to corresponding values of 'AgeCat'. This mapping provides a way to
categorize individuals into different age groups based on their age ranges. Each key in
the dictionary represents an age range, while the corresponding value represents the
corresponding age category.
Figure 2.5: Mapping value to the selected columns
The original dataset has 18 variables and the processed dataset have 2 new
variables that have been added which are BMIcat and AgeCategory column. The
BMIcat column is added to show the category of each BMI and AgeCategory column
to show the category of age range.
2.1.4 Analytics and Insight
The techniques and results of analysing vast amounts of data to extract valuable
information and develop a more comprehensive understanding or knowledge are
referred to as analytics and insights in the context of big data methodology. To find
patterns, trends, correlations, and other useful insights, analytics entails the systematic
investigation and evaluation of huge databases. The insightful findings, observations,
or discoveries that result from the analytics process are referred to as insights. They
stand for the comprehension and information discovered through big data analysis.
In this project, we use diagnostic analytics aims to identify the causes or factors
contributing to heart disease. In the context of heart disease, diagnostic analytics entails
data analysis to determine the reasons or factors contributing to the condition and
comprehending the correlations between different variables. It seeks to shed light on
the underlying causes that could contribute to the development or progression of heart
disease. It is possible to identify what factors are more likely to be linked to the
development or progression of the disease by analysing huge datasets and use statistical
approaches. Understanding risk profiles of individuals or populations using this
information can help develop focused preventative or intervention measures.
The visualization technique that we used in Power BI Desktop are card, pie
chart, column chart and key indicators. The pie chart allows the user to identify the age
categories with a higher prevalence of heart disease. Larger slices indicate higher
proportions of people with heart disease in those age groups. This can be useful in
identifying age groups that may require targeted interventions or preventive measures.
The card visual in Power BI Desktop allows the user to display a specific value
prominently. In this case, it can be used to display the total number of people in the
dataset. The card visual provides a clear and concise representation of this value.
Next, the column chart. We use the two different clustered column chart in this
dashboard which are the clustered column chart and line and clustered column chart.
From the clustered column chart, it displays the number of people according to the age
category and gender. The visualization shows the older adult where the age is between
65 and 79 years old have the highest number. For the gender, it shows the male gender
outnumber the female gender for every age category. From the line and clustered
column chart, it displays the number of people according to their BMI category and it
shows no correlation between BMI and heart disease.
Lastly, the key indicator. The key indicator has been used to show relationships
between the risk factors and heart disease. For people with these risk factors such as
kidney disease and diabetic, they have higher potential to develop heart disease.
2.1.5 Visualization and Reporting
The tool that we used for visualising our data is Power BI. Charts, graphs, maps, tables, and
custom visuals are just a few of the numerous built-in visualizations available in Power BI.
These visual representations enable a thorough analysis of the prevalence of heart disease
across various demographic groups. It is possible to gain insights into how heart disease cases
are distributed based on age, gender, and BMI categories, which aids in identifying potential
disparities, correlations, and trends, which enables targeted interventions, individualized
healthcare plans, and well-informed decision-making for stakeholders.
Wireframe
Interface design holds significance as it ensures the alignment of the dashboard layout with
user requirements. To convey the planned design of the project, we will employ a low-fidelity
interface design.
The homepage displays the structure, assisting users in understanding the aim and scope of the
project, which is heart disease. It includes elements such as a logo, a title, and a visual design.
It defines the dashboard identity and ensures that users have a uniform experience across the
dashboard.
Figure 2.8: Wireframe of charts on various factors
This page depicts the association between several factors and the existence of cardiovascular
disease. We decided to group all charts on one page so that users could see the overall effect
rather than browsing through many pages, which could give the impression that the dashboard
is time-consuming.
This page assists users in better understanding the relationship between heart disease and other
diseases such as kidney disease, diabetes, and borderline diabetes. Since these three are
considered diseases rather than demographics, we decided to divide them into an individual
page.
CHAPTER 3: ANALYSIS AND FINDINGS
Second, the clustered column chart on the left side compares each age category and gender for
this dataset. In the figure above, the red bar represents female, and the blue bar represents male.
This chart shows that males outnumber females across all age groups.
Last but not least, the clustered column chart on the right side shows the interactions between
BMI category and whether or not they have heart disease. By comparing the four BMI
categories of underweight, healthy, overweight, and obese, we can conclude that BMI has no
influence on whether a person gets this disease or not. This is due to the enormous number of
comparisons between true and false for this differentiate.
(Figure 3.2: Analysis on three series disease (Kidney))
Figure 3.2 shows the relationship between heart disease and kidney disease. According to the
analysis, a person with kidney disease is 3.19 times more likely to develop heart disease.
Where the true percentage is 29.33% and the false percentage is 7.77%.
CDC. (2020, September 17). Assessing Your Weight. Centers for Disease Control and
Prevention.
https://www.cdc.gov/healthyweight/assessing/index.html#:~:text=If%20your%20BMI
%20is%20less