Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 9

Kingdom of Saudi Arabia

Ministry of Education
King Faisal University
College of Computer Sciences & Information Technology
Department of Information System

Course: Data Mining and Data Warehousing


Instructor: Dr.Abdulaziz Saad Albarrak
Student ID
Faddel Ali Sahool 218011753
Hadi Al-naser 2180131879
Mohammed Ibrahim Alhussain 216016024
Hashem Zakaria Alsadah 218028932
Heart Disease Prediction Using RapidMiner
1. Introduction
In the modern technological world heart disease has continued to increase and has become one of
the major causes of death internationally. Most people in different parts of the world die annually
due to cardiovascular diseases than any other cause of death. In the United States, it's clear that
heart disease kills one person every 35 seconds. Heart attacks can e said to e tragic since they
block the flow of blood to the brain or the heart where the people at risk may show elevated
blood pressure, glucose, and as well as stress. Other heart-related diseases include coronary heart
disease and hypertension (Mohan et al., 2019). Data mining has been used in a variety of
applications, for instance, marketing, customer relationship, engineering and medicine analysis,
mobile computing, and web mining. It has and can be used to find the patterns that are hidden
and the information that may contribute to making accurate decisions and providing quality
services to the public. Different machine-learning techniques can be employed in data mining to
solve heart-related issues in the health sector. The paper will be looking to manage heart disease
effectively with the combination of lifestyle changes, medicine, and surgery. The objective of
this work will be to predict accurately using a few tests and the attributes of the presence of heart
diseases (Purushottam et al., 2016).
2. Project Objective
The main objective of this analysis is to develop a heart prediction system using RapidMiner.
The algorithms can discover and extract hidden information about the disease from the historical
data set.
The system will aim at exploiting the data mining methods on the available medical data to assist
in heart disease prediction.
2.1. Importance of the project
The clinical decisions are normally made by the doctor’s experience and insights rather than the
knowledge-rich data hidden in the medical dataset. This has led to unwanted biases, errors, and
excessive medical costs which influence the quality of service being given to the patients. The
prediction models will create a new approach to concealed patterns in the heart disease medical
data available. Hence playing a big part in reducing and avoiding the human biases which have
been in this sector for a very long time. The project will save a lot of lives through education on
a healthy lifestyle based on the available predictive models. The study will also reduce the cost
of medical tests since the results will be obtained using this technique.

3. Data collection and selection


The data set was collected from the UCI data respiratory which is freely available. The data has
been found suitable for developing models since it is said to maintain lesser missing values and
other outliers. The data is normally preprocessed and cleaned before testing and training. The
UCI learning respirator of the databases, data generators, and domain theories is normally used
for descriptive data analysis. The data was obtained by downloading it. That can e said to contain
adequate data for heart disease prediction.

4. Data Exploration:
The dataset contained several 270 attributes with 14 variables ‘age’, sex’, ‘chest pain type’,
‘BP’, ‘cholesterol’, ‘FBS over 120’, ‘EKG results’, ‘max HR’, ‘exercise angina’, ‘ST
depression’, ‘slope of ST’, ‘Number of vessels fluro’, ‘Thallium’, and ‘Heart disease’. The target
variable is ‘heart disease’.
Heart disease is the target variable while the other variables are the independent variables that
influence the occurrence of heart disease.
The quartiles were employed to identify various outliners from the variables which were noted as
shown in the diagram below.

Figure 2.Variables Quartiles

5. Data Preparation:
The data set was split into two the testing and training datasets with 75% and 25%. From the data
analysis in the RapidMiner the dataset had no missing values. Hence no treatment was required
to fill in the missing values. The normalization was also done for some non-values.

6. Modeling and Evaluation:


i. Correlation Matrix
The correlation matrix is a table that shows the coefficient correlation between the variables.
Each of the cells in the table normally shows the correlation between two attributes. It was
important in summarizing the data as input into more advanced of the analysis more advanced
analysis and the diagnostic for further advanced analysis.

Figure 3.Heatmap of Features

From the heat map above or the correlation matrix, it's clear that there is no attribute having a
higher correlation with heart disease which is the target variable.

ii. Confusion Matrix


This is also identified as the error matrix which is a visualization of the performance of the
algorithms, typically a supervised learning one. Each of the raw columns represents the instance
in an actual class. The typical confusion matrix is normally represented as TN: True negative,
TP: true positive, FN False negative, and FP: False positive.
Positive (1) Negative (0)

Positive (1) TP FN

Negative (0) FP TN

Figure 4. Graphical representation of confusion matrix


iii. KNN(K Nearest Neighbor)
It is a machine learning technique that is commonly used. Most are used in continuous
parameters. It is also commonly used due to its simplicity in the prediction. The algorithms take
the target variable to predict whether the person has heart disease or not. The KNN was used to
predict the people with heart diseases on variables such as age, and sex.

The confusion matrix shows that has 70 with true negative rate, 39 false positive, 50 with false
negative, and 111 with a true positive. The mode achieved an accuracy of 66.80%

iv. Decision Trees


The decision tree is a type of machine learning algorithm classifier. It works with numerical
values and categorical data. It normally resembles the normal tree with the nodes, ranches, and
leaf nodes in which the ranches are the values of the given data set.

The confusion matrix shows that has 75 with a true negative rate, 31 false positive, 45 with false
negative, and 119 with a true positive. The decision tree mode achieved an accuracy of 71.80%.

v. Random forest
It is a machine-learning algorithm that is normally used for both classification and regression. It
normally creates decision trees of the selected data to obtain the prediction in this case heart
disease.
The confusion matrix shows that has 83 with a true negative rate, 20 false positive, 37 with false
negative, and 130 with a true positive. The random forest mode achieved an accuracy of 78.73%.

vi. Linear Regression

The confusion matrix shows 95 with a true negative rate, 18 false positive, 25 with a false
negative, and 132 with a true positive. The random forest mode achieved an accuracy of 83.93%.

7. Result and Analysis


After implementing the models to predict heart disease using RapidMiner and employing the
different machine learning algorithms. The table below shows the classification accuracy,
precision, and accuracy of the models. The recall and precision are the averages between recalls
and precisions.
Table 1. algorithm performance comparison

Algorithm Accuracy Precision Recall

KNN 66.80% 66.58% 66.165%

Decision Tree 71.80% 71.655 70.915%


Random Forest 78.73% 79.21% 77.92%

Linear Regression 83.93% 84.08% 83.59%

From the analysis above, linear regression had the highest accuracy with 83.93% followed by
random forest with 78.73%, 71.80 for the decision tree, and KN with 66.80%. The most precise
model was the linear regression with 84.59% it also has the highest recall with 83.59%. Hence
from the above analysis and comparison, linear regression had the highest accuracy, precision,
and recall of the best model. From the experiments where we implemented four classification
algorithms trying to find out the best algorithms that can be used to predict heart disease. As we
all know that heart disease is a sensitive and critical disease that has resulted in deaths hence the
model will play a big role in the prediction hence reducing the spread. Hence the accuracy and
TP should e kept high while FP should be kept as low as possible.

Conclusion
Heart disease is a major concern in society due to poor living and feeding habits in most
countries. It is hard to manually demine the odds of getting heart disease based on the risk
factors. Machine learning which can be implemented in different sites plays a vital role in the
prediction of this disease. It is clear from this implementation that machine learning algorithms
can be implemented in the healthcare sector to predict and diagnose heart disease at an early
stage.
References

D’Agostino, R. B., Levy, D., Belanger, A. M., Silbershatz, H., & Kannel, W. B. (1998).
Prediction of Coronary Heart Disease Using Risk Factor Categories. Circulation, 97(18),
1837–1847. https://doi.org/10.1161/01.cir.97.18.1837
Mohan, S., Thirumalai, C., & Srivastava, G. (2019). Effective Heart Disease Prediction Using
Hybrid Machine Learning Techniques. IEEE Access, 7, 81542–81554.
https://doi.org/10.1109/access.2019.2923707
Purushottam, Saxena, K., & Sharma, R. (2016). Efficient Heart Disease Prediction System.
Procedia Computer Science, 85, 962–969. https://doi.org/10.1016/j.procs.2016.05.288
UCI Machine Learning Repository: Heart Disease Data Set. (2022).
https://archive.ics.uci.edu/ml/datasets/heart+disease

You might also like