A Scalable Solution For Heart Disease Prediction Using Classification Mining Technique

You might also like

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 20

A scalable solution for heart disease prediction using

classification mining technique

Abstract
Because of current advancements that generate an ever-increasing amount of digital data, the
healthcare sector is one of the most data-intensive. Technology has progressed to the point
that it can now manage these massive data quantities, allowing for more accurate estimates of
relevant occurrences. The authors of this work propose a scalable system that forecasts
cardiac disease based on a variety of characteristics using data given by healthcare providers.
We provide a distinct way of detecting heart disease by focusing on a specific set of
symptoms. In our prediction technique, we leverage random forests based on Apache Spark.
This provides a variety of options for installing our solution in a dynamic, scalable big data
environment, allowing analysts in the medical insurance industry to make educated
judgements. The results demonstrate the method's high accuracy, which might exceed 98%.
Furthermore, we compare our technique to the naive Bayes classifier and show that the
random forest approach performs much better.

1. Introduction
The increased deployment of electronic health record (EHR) systems is quickly extending the
digital clinical data repository in the healthcare sector. In 2020, virtual computing will
generate 25 zettabytes (1018 bytes) of data, which is equivalent to 500 petaflops (1015 bytes)
of processing power. The term "big data analytics" refers to the study of massive data sets
that results in valuable findings. Clinical analytics has also evolved significantly, making it
feasible to analyse large data sets and derive crucial conclusions from them. This opens up
opportunities to reduce healthcare costs and enhance illness diagnostics.
This article discusses the coronary artery disease (CAD) condition to show its healthcare
environment. Each of the several illnesses that go under the umbrella term "heart disease" has
its own set of symptoms. Some types of heart disease have symptoms that are slightly
different from others.
Patient data is frequently maintained in hospital databases, which are handled by database
management systems. These sites provide vast amounts of knowledge, yet we seldom use
them to make educated healthcare decisions. Discovering healthcare patterns, assisting in
early illness detection, and preventing infections are just a few of the numerous potential uses
for data mining and big data. The major focus of this study is the prediction and analysis of
medical insurance expenses related to the diagnosis of various health conditions, including
heart disease.
1.1. Key challenges
There are numerous significant obstacles to overcome when creating a scalable solution for
heart disease prediction utilising classification mining techniques. Firstly, a strong
infrastructure and computer power are needed to handle and analyse enormous amounts of
healthcare data. Due to the sensitive nature of the patient data, it is crucial to ensure data
privacy and security. Third, it's crucial to choose the classification mining method that can
manage the complicated and varied data related to heart illness. The dataset's problems with
class imbalance must also be resolved, which is a difficult task. The solution's capacity to
scale to meet an increasing patient population and changing medical data must also be
confirmed. To anticipate heart disease accurately and effectively, these issues must be
resolved.
1.2. Key contribution
The capacity to effectively handle enormous volumes of healthcare data and reliably
anticipate heart disease cases is the major contribution of a scalable system for heart disease
prediction employing classification mining techniques. The solution can extract significant
patterns and insights from various and complicated heart disease datasets using classification
mining algorithms, resulting in enhanced diagnosis accuracy. Scalability guarantees that the
solution can manage an expanding number of patients while also adapting to changing
medical data. This enables healthcare practitioners to make more educated decisions, improve
early identification, and put preventative measures in place, eventually improving patient
outcomes and lowering the burden of heart disease on healthcare systems.

2. Background
2.1. Methodology
1. Data Collection: Datasets containing relevant heart disease attributes will be obtained
from reliable sources, such as the UCI artificial intelligence repository, Cleveland,
Switzerland, and Hungary.

2. Data Preprocessing: The collected datasets will be preprocessed to handle missing


values, feature selection, and normalization to prepare them for classification mining
techniques.

3. Implementation on Spark: Each of the five methods will be implemented on the Spark
framework to analyze their performance in handling large-scale datasets.

4. Evaluation Metrics: The accuracy, precision, recall, F1-score, and computational time
will be used as evaluation metrics to compare the effectiveness of each method.

5. Scalability Testing: The scalability of each method will be assessed by increasing the
size of training datasets to observe their performance under varying data volumes.

2.2. Below is a table comparing 10 traditional methods for heart disease

prediction using classification mining techniques in a scalable solution

Method Description Pros Cons

Ensemble method of High accuracy, handles large Computationally expensive, may


Random Forest decision trees datasets effectively overfit on noisy data

Assumes features are independent,


Simple and fast, handles high- may not capture complex
Naive Bayes Probability-based method dimensional data relationships

Decision Tree Tree-based method for Easy to interpret, can handle Prone to overfitting, may create
classification both numerical and categorical
Method Description Pros Cons

data biased trees

Support Vector Linear and non-linear Effective in high-dimensional Computationally intensive for large
Machine classifier spaces, good generalization datasets

k-Nearest Simple to implement, adapts Computationally expensive for large


Neighbors Instance-based method well to data changes datasets

Logistic Linear model for binary Easy to interpret, works well


Regression classification with small datasets Limited to linear decision boundaries

Neural High flexibility, can capture Requires large training data and
Networks Deep learning technique complex patterns computational resources

Genetic Optimization-based method Finds global optima, handles Convergence may be slow, requires
Algorithms using natural selection non-linear relationships fine-tuning

Ensemble of decision trees Good generalization, reduces More complex to implement than
Decision Forest with feature bagging overfitting single decision trees

Sequentially adds weak


Gradient learners to boost High accuracy, handles large Sensitive to noisy data and prone to
Boosting performance datasets overfitting

3. Objectives
This study employs a strategy based on random forests and Apache Spark to identify a
smaller set of criteria for accurate coronary artery disease prediction. The outcomes of this
study will be useful for healthcare insurance analysts working in a scalable big data
environment to allow informed decision-making.
The fundamental goal of this research is to solve critical challenges, namely the following:
1. manage complexity. While the amount of data reviewed remains constant, some
analytical approaches may need extra processing time.
2. Accuracy of Prediction A variety of data mining techniques, including classification,
segmentation, regression, and association, have varied degrees of prediction accuracy.
3. The quantity of Data: Even for relatively basic studies, analysing a large quantity of data
can require a significant amount of time and resources.
4. Using the classifier model, parallel computing distributes computation over several
processors, enabling the efficient execution of jobs demanding significant computational
capacity.

The study's goal is to appropriately predict heart illness by identifying the most important
qualities, which will be accomplished by identifying critical patterns in medical data.

4. Related work
An business may improve its processing effectiveness and analytical capabilities by moving
away from traditional databases and towards big data platforms like Apache Hadoop.
However, in order to ensure effective usage, a number of preliminary steps are required for
the extraction of crucial information from vast databases. At every level, working with large
amounts of healthcare data involves a range of difficulties that must be solved. Among the
difficulties include data collecting, network upkeep, data processing, analysis, and signal
interpretation. [2] The intricacy of the investigation, the magnitude of the dataset, and the
parallelization of the computing paradigm have all received attention. According on health
indicators, the authors of [4] developed a method for quickly identifying the standards for
calculating participants' risk levels. Their system employed this strategy. The model now
includes these rules that were produced by the decision tree. Memory computing is a feature
of Spark, a distributed computing system based on MapReduce techniques from Hadoop.
This considerably boosts the effectiveness of data evaluation [5].
5. Proposed systems
The incorporation of genetic search resulted in the identification of 13 essential
characteristics that contribute to the prediction of cardiac sickness. This led to a considerable
improvement in the accuracy of classifier predictions. Classifiers and ensemble algorithms
such as Random Forest analyse datasets with these 13 criteria as input in order to arrive at
accurate diagnoses of cardiovascular disease. The Random Forest data mining technique
displayed performance that was equivalent to that of the Naive Bayes and Decision Tree
algorithms after the use of feature subset selection; however, the development of models with
Random Forest needed more time. It's remarkable that Random Forest was able to keep its
constant performance even after the number of attributes was cut down, all without altering
the amount of time it took to create the model.
The following components make up the proposed system:
 Collecting information related to healthcare about cardiovascular disease
 Storing and analysing data by utilising Spark and the Hadoop Distributed File System
 Using a method known as Random Forest's algorithm
 Assessing its overall performance, taking into account factors such as its accuracy and
mistake rate

6. Methodology
Our organisation employs the Apache Spark and Hive platforms because of their inherent
flexibility for expanding with attribute data sets. The powerful Spark engine, which also
provides high-level APIs for the computer languages Java, Scala, Python, and R, can manage
large-scale execution graphs. There is also support for a number of higher-level tools,
including Flink, MLlib for deep learning, GraphX for graph processing, and Spark SQL for
SQL programming [5].
Spark SQL has a substantial impact in the realms of knowledge affinity, performance
development, and feature enhancement. These advantages are the result of better SQL
execution in Spark. Because of Apache Spark's data-flow computational design, Spark
Streaming may be leveraged to provide a strong API for batch, streaming, and interactive
querying. Apache Spark makes this possible.

Thanks to GraphX, Spark and graph management may now benefit from a concurrent
processing API. As a result, there has been a considerable improvement in speed as well as a
reduction in memory overhead. Spark MLlib is a library that provides scalable machine
learning by providing appropriate test generators and data providers. By a factor of 100, the
performance of the techniques in this package beats that of the Map and Reduce strategy.
Dimensionality reduction, clustering, and collaborative filtering are just a few of the machine
learning methods that are compatible with this system.

Fig 1 - System architecture

Fig 2 - Spark ecosystem

Figure 1 depicts the system architecture as a visual representation of the solution. Before you
can use Spark's machine learning libraries, you must first understand the Spark ecosystem,
which is represented in Figure 2. Datasets on cardiovascular illness are collected, stored as
CSV files, and then examined using the proper class labels.

Class 0 indicates the lack of cardiac disease, whereas class 1 indicates the presence of cardiac
illness. These courses teach students how to predict the onset of heart disease quantitatively.
Using HDFS, data storage in CSV format is guaranteed to be fault-tolerant. As a result, we
consider any missing attribute values while gathering and processing data. Classification of
freshly obtained unsupervised, unlabeled datasets can be predicted using the random forest
technique. Because the accuracy is measured after it has been applied several times over an
increasing number of training data sets, the technique may grow to massive data sets. The
data set's accuracy and error rate are then evaluated in terms of how long it takes Spark to
finish its calculations.

Fig 3 - Data storage on HDFS

The need to store data in several places arises when a dataset's size exceeds the capacity of a
single machine. The Distributed File System (HDFS) in Hadoop was created to maintain real-
time stream access patterns for huge files on commodity clusters. No matter the file size,
from megabytes to terabytes, HDFS allows you to "write once, read many times" without any
issues. Then, a range of analyses created especially for huge datasets are used to the datasets,
which are frequently created from scratch or imported from other sources.
HDFS makes use of both master and slave nodes. While the data nodes are the slave nodes
and are in charge of storing the actual data, the name node serves as the master node and is in
charge of preserving the directory structure and file metadata. The name node holds details
about the data nodes, which are where the file blocks are actually kept. Figure 3 for Hadoop
displays the HDFS file system layout.
Users connect to the name node to access a file, and the name node acts as their
representative in all communications with the data nodes. Users do not need to be familiar
with the underlying name node or data node in order to engage with a client interface. For
read and write operations, blocks of data are kept in and retrieved from data nodes as
necessary. The name node receives updates on the blocks that the data nodes have at regular
intervals.

7. Implementation
The heart disease datasets come from a variety of sources, which are mentioned in [12]. The
UCI artificial intelligence repository is the most widely used of these locations since it
delivers machine learning applications that are driven by data. Data from Cleveland,
Switzerland, and Hungary is being collected to aid in the prediction of heart disease. The
presence or absence of cardiac disease is indicated by attribute values in a CSV file using the
"class" property. These property values range from 0 (no presence) to 4 (complete presence).
The Cleveland database tests are primarily concerned with the absence (class value 0) and
presence (class values 1 to 4) categories. In this experiment, two classes will be used: class 0
will represent "absence," while class 1 will represent "presence." To safeguard the patient's
right to privacy, their true identification information is turned into anonymous data.
The dataset used to diagnose heart disease is stored in the repository directory. In this dataset,
only 14 of the 76 raw attributes from the Cleveland database are used. These vital signs
include, among others, ECR, insulin, chest pain, fasting insulin, and MHR (maximum heart
rate). [4] contains domain ranges and further information on the properties under
consideration here.

Table 1- Attribute information


The random forest algorithm [6] is explained as below:

Random Forest is a powerful ensemble of unpruned classification trees, offering exceptional


performance in practical scenarios, such as predicting healthcare outcomes, due to its
resistance to overfitting and dataset distortions. This method combines predictions from
independently trained trees, delivering fast and robust results, outperforming other tree-based
algorithms like decision trees. During the creation of a random tree, the Object() [native
code] function offers three key options [6]:

1. The leaf splitting procedure to follow.


2. The predictor type to be used in each leaf.
3. The method for introducing randomness into the tree structure.

For our experiment, we carefully selected optimal values for the random forest-generated
decision tree parameters. Specifically, we used numClasses = 2, numTrees = 3, impurity =
"gini," maxDepth = 4, and selected all features for the categoricalFeaturesInfo category.

The rbf kernel function has a lot of advantages for anticipating cardiac illness, including the
following [13]:• More accurate than many other techniques• Effective and efficient operation
on extremely large datasets

• The ability to process hundreds of bits of incoming data simultaneously without erasing any
values

• Calculating important variables that are incorporated into the categorization process

• Reliable estimation techniques that enable precise forecasting with less data

• Techniques for handling data sets with an asymmetrical distribution

• The ability to use many data sources to generate, store, and reuse trees

• Unsupervised clustering, which recognises collections of instances based on the traits they
share.

In order to diagnose heart illness and assess the level of accuracy:

• The HDFS acts as the datasets' repository.


• Preprocessing comprises completing supervised datasets with missing values in
order to make them complete.
• The datasets are split in half, 70:30, into a training set and a testing set.
• Fabricated data are used for training, which is followed by the adjustment of
different parameters.
• Using examples from the test, the model's test error is calculated and evaluated.
• Validation data can be used to detect the existence of heart disease.
• Prior to assessing a model's correctness, one must contrast the anticipated values
with the test data's prior labels.

Figures provide a complete roadmap of the modules involved in accuracy testing and
prediction using the Decision Tree algorithm on the Spark platform. These graphs
demonstrate the implementation's data flow for accuracy in heart disease prediction as well as
how the data moves across the system.

Fig 4 - Prediction data-flow-diagram


Fig 5 - Accuracy data-flow-diagram

Fig 6 - Prediction sequence-diagram


Fig 7 - Accuracy sequence-diagram

8. Experiments and results

Below is a summary of the essential system requirements for conducting an experiment


focused on predicting cardiovascular diseases using the Random Forest classification method
on the Spark framework:

Table 2 - System requirement

The graph below depicts the results of utilising the random forest method on Spark as well as
the built Naive Bayes prediction model. The graph clearly shows that Naive Bayes
predictions are not as accurate as those made by the random forest. As demonstrated in
Figure 8, the accuracy performance of Random Forest and Naive Bayes changes as the
number of training datasets saved in HDFS rises. These are shown in relation to one another.
Figure 9 shows how the model's accuracy grows as the number of training datasets increases.

When the number of records is increased from 200 to 600, the accuracy rises considerably
from 88 percent to 98 percent. It is crucial to note that as the number of records grows from
200 to 400 (by 200 records), the accuracy increases by 8%. It is important to recognise this
accomplishment. The accuracy increase from 400 to 600 records (the additional 200 records)
is only a 2% improvement, demonstrating the law of diminishing returns.
Fig 8 - Accuracy comparison chart

Fig 9 - Accuracy graph of random forest


Fig 10- Computation time graph

Fig 11 - Accuracy error graph

9. Conclusion and Future Enhancement


Big data analytics enables rapid analysis of the rising data set in healthcare for sickness
prediction, removing any unnecessary intermediaries in the process. We developed a scalable
method for accurately recognising heart disease symptoms. With only 600 records in our
dataset, we were able to achieve an astounding 98% accuracy using the Spark framework's
random forest technique.
In the future, we plan to investigate different ways of forecasting diseases in the healthcare
system, such as early cancer detection. Furthermore, we plan to study how running on highly
efficient clusters with large, well-structured datasets affects the system's overall performance
and accuracy. This study establishes the groundwork for the use of big data analytics to
transform healthcare forecasts and decision-making, paving the way for more precise and
rapid medical diagnoses in the future.

Reference

1. Srivastava, K., & Choubey, D. K. (2020). Heart disease prediction using machine learning
and data mining. International Journal of Recent Technology and Engineering, 9(1), 212–
219.

2. Ayon, S. I., Islam, M. M., & Hossain, M. R. (2020). Coronary artery heart disease
prediction: A comparative study of computational intelligence techniques. IETE Journal of
Research.

3. Chin, A. J., Mirzal, A., Haron, H., & Hamed, H. N. A. (2016). IEEE Transactions on
Computational Biology and Bioinformatics.

4. Aggrawal, R., & Pal, S. (2020). Sequential feature selection and machine learning
algorithm-based patient’s death events prediction and diagnosis in heart disease. SN
Computer Science, 1(6).

5. Gao, X.-Y., Ali, A. A., Hassan, H. S., & Anwar, E. M. (2021). Improving the accuracy for
analyzing heart diseases prediction based on the ensemble method. Complexity, 2021, Article
ID 6663455.

6. Takci, H. (2018). Improvement of heart attack prediction by the feature selection methods.
Turkish Journal of Electrical Engineering and Computer Sciences, 26, 1–10.

7. Latha, C. B. C., & Jeeva, S. C. (2019). Improving the accuracy of prediction of heart
disease risk based on ensemble classification techniques. Informatics in Medicine Unlocked,
16, Article ID 100203.
8. KarenGarate-Escamila, A., Hassani, A. E., & Andr ´ es, E. (2020). Classification models
for heart disease prediction using feature selection and PCA. Informatics in Medicine
Unlocked, 19, Article ID 100330.

9. Spencer, R. F. (abtah, N., Abdelhamid, N., & (ompson, M. (2020). Exploring feature
selection and classification methods for predicting heart disease. Digital Health, 6, Article ID
2055207620914777.

10. Senan, E. M., Al-Adhaileh, M. H., Alsaade, F. W., et al. (2021). Diagnosis of chronic
kidney disease using effective classification algorithms and recursive feature elimination
techniques. Journal of Healthcare Engineering, 2021, Article ID 1004767.

11. Almansour, N. A., Syed, H. F., Khayat, N. R., et al. (2019). Neural network and support
vector machine for the prediction of chronic kidney disease: A comparative study. Computers
in Biology and Medicine, 109, 101–111.

12. Li, J. P., Haq, A. U., Din, S. U., Khan, J., Khan, A., & Saboor, A. (2020). Heart disease
identification method using machine learning classification in e-healthcare. IEEE, 8.

13. Angayarkanni, G. (2020). Selection OF features associated with coronary artery diseases
(cad) using feature selection techniques. Journal of Xi'an University of Architecture &
Technology, pp. 686–689.

14. Hasan, N., & Bao, Y. (2020). Comparing different feature selection algorithms for
cardiovascular disease prediction. Health and Technology, 11, 49–62.

15. Hancer, E., Xue, B., & Zhang, M. (2018). Differential evolution for filter feature selection
based on information theory and feature ranking. Knowledge-Based Systems, 140, 103–119.

16. Solorio-Fernandez, S., Carrasco-Ochoa, J. A., & Martínez-Trinidad, J. F. (2019). A


review of unsupervised feature selection methods. Artificial Intelligence Review.
17. Sulaiman, M. A., & Labadin, J. (2015). Feature selection based on mutual information. In
Proceedings of the International Conference on Information Technology in Asia (CITA),
Kuching, Malaysia.

18. Kaushik, S., Choudhury, A., Jatav, A. K., et al. (2019). Comparative analysis of features
selection techniques for classification in healthcare. Lecture Notes in Computer Science.

19. Moorthy, U., & Gandhi, U. D. (2020). A novel optimal feature selection technique for
medical data classification using ANOVA based whale optimization. Journal of Ambient
Intelligence and Humanized Computing, 12, 1–12.

20. Rosely, N. F. L. M., Salleh@Sallehuddin, R., & Zain, A. M. (2019). Overview feature
selection using fish swarm algorithm. In Proceedings of the 2nd International Conference on
Data and Information Science.

21. Elavarasan, D., Durai Raj Vincent, P. M., Srinivasan, K., & Chang, C.-Y. (2020). A
hybrid CFS filter and RF-RFE wrapper-based feature extraction for enhanced agricultural.
Agriculture, 10, 400.

22. Dutta, A., Batabyal, T., Basu, M., & Acton, S. T. (2020). An efficient convolutional
neural network for coronary heart disease prediction. Expert Systems With Applications, 159,
Article ID 113408.

You might also like