Download as pdf
Download as pdf
You are on page 1of 34
An AAT Project Report ON HEART-DISEASE-PREDICTION In partial fulfilment of the requirements for the award of of BACHELOR OF ENGINEERING IN Department of Information Science & Engineering BY BATCH CHITRA C 1BM2018403 SANJANAN — 1BMI9IS143 Under the Guidance of Shobhana TS Assistant Professor , BM! Department of Information Science & Engineering 2021-22 B.M.S College of Engineering P.O. Box No. 1908 Bull Temple Road (Autonomous Institution under VTU) BENGALURU-S60019 B.M.S College of Engineering P.0, Box No 1908 Bull Temple Road, Bangalore-560 019 DEPARTMENT OF INFORMATION SCIENCE & ENGINEERING CERTIFICATE Certified that the Project has been successfully presented at B.M.S College Of Engineering by CHITRA C (1BM20IS403) , SANJANA N ( 1BM19IS143) in partial fulfillment of the requirements for the V Semester degree in Bachelor of Engineering in Information Science & Engineering of Visvesvaraya Technological University, Belgaum as a part of project for the Machine Learning (20ISSPCMLG) during academic year 2021-2022 Signature of the Faculty Signature of the HOD Name and Designation Name and Designation B.M.S College of Engineering P.0, Box No 1908 Bull Temple Road, Bangalore-560 019 DEPARTMENT OF INFORMATION SCIENCE & ENGINEERING DECALARATION We, CHITRA C (1BM201S403), SANJANA N (1BM191S143) students of Sth Semester, B.E, Department of INFORMATION SCIENCE & ENGINEERING, BMS College of Engineering, Bangalore, hereby declare that, this Project entitled " Heart Disease Prediction” has been carried out by us under the guidance of Prof Shobhana T S, Assistant Professor, Department of ISE, BMS College of Engineering, Bangalore during the academic semester Dec-Jan 2022. We also declare that to the best of our knowledge and belief, the development reported here is not from part of any other report by any other students. Signature CHITRA C (1BM2018403) SANJANA N (IBMI9IS143) HEART DISEASE PREDICTION ABSTRACT In recent times, Heart Disease prediction is one of the most complicated tasks in medical field. In the modern era, approximately one person dies per minute due to heart disease. Data science plays a crucial role in processing huge amount of data in the field of healthcare. As heart disease prediction is a complex task, there is a need to automate the prediction process to avoid risks as iated with it and alert the patient well in advance. This paper makes use of heart disease dataset available in Kaggle. The proposed work predicts the chances of Heart Disease and classifies patient's risk level by implementing different techniques such as Naive Bayes, Decision Tree, Logistic Regression and Random Forest. Thus, this paper presents a comparative study by analysing the performance of different machine learning algorithms. The trial results verify that Random Forest algorithm has achieved the highest accuracy of 90.16% compared to other ML algorithms implemented. Keywords: Decision Tree, Naive Bayes, Logistic Regression, Random Forest, Heart Disease Prediction Dept of ISE BMSCE Page 1 HEART DISEASE PREDICTION Table of Contents Slno | Topic Page no 1) Introduction 04-05 2 Literature Survey 05-07 3) Software requirement specification 07-20 4) System design & analysis 20-22 5) Implementation 22-25 6) Test results and analysis, 25-27 i) Screenshot 27-29 8) Conclusion 29 -30 9) References 30-31 Dept of ISE BMSCE Page 2 HEART DISEASE PREDICTION INTRODUCTION. 1.1 Introduction, 1.2 Statement of problem. 1.3. Objective of study. 2, LITERATURE SURVEY. 3. SOFTWARE REQUIREMENT SPECIFICATION... 3.0. Functional requirements... 3.1 Non-functional requirements... : 3.2 Hardware and Software Requirements... 3.2.1 Python installation. 3.2.2 Anaconda installation. 3.2.3 Jupiter installation. ‘SYSTEM DESIGN AND ANALYSI 4.1 Methodologies... 4.2 Flowchart of proposed model. 4 IMPLEMENTATIOI 5 TESTING 6 SNAPSHOT! 7 8 CONCLUSION. REFERENCES Dept of ISE BMSCE Page 3 HEART DISEASE PREDICTION CHAPTER 1 INTRODUCTION 1.1 INTRODUCTION Human heart is the principal part of the human body. Basically, it regulates blood flow throughout our body. Any irregularity to heart can cause distress in other parts of body. Any sort of disturbance to normal functioning of the heart can be classified as a Heart disease. In today’s contemporary world, heart disease is one of the primary reasons for occurrence of most deaths Heart disease may occur due to unhealthy lifestyle, smoking, alcohol and high intake of fat whi may cause hyperte jon. According to the World Health Organization more than 10 million die due to Heart diseases every single year around the world. A healthy lifestyle and earliest detection are only ways to prevent the heart related diseases. ‘The main challenge in today’s healthcare is provision of best quality services and effective accurate diagnosis. Even if heart diseases are found as the prime source of death in the world in recent years, they are also the ones that can be controlled and managed effectively. The whole accuracy in management of a disease lies on the proper time of detection of that disease. The proposed work makes an attempt to detect these heart diseases at early stage to avoid disastrous consequences, Mostly the medical database consists of discrete information. Hence, decision making using discrete data becomes complex and tough task. Machine Learning (ML) which is subfield of data mining handles large scale well-formatted dataset efficiently. In the medical field, machine learning can be used for diagnosis, detection and prediction of various diseases. The main goal of this paper is to provide a tool for doctors to detect heart disease as early stage. This in turn will help to provide effective treatment to patients and avoid severe consequences. ML plays a very important role to detect the hidden discrete patterns and thereby analyse the given data. After analysis of data ML techniques help in heart disease prediction and early diagnosis. This paper presents performance analysis of various ML techniques such as Naive Bayes, Decision Tree, Logistic Regression and Random Forest for predicting heart disease at an early stage. Dept of ISE BMSCE Page + HEART DISEASE PREDICTION 1.2 Problem Statement The primary goal is to develop a prediction engine which will allow the users to check whether they have heart disease sitting at home. The user need not visit the doctor unless he has heart disease, for further treatment, The prediction engine requires a large dataset and efficient machine learning algorithms to predict the presence of the disease. Pre-processing the dataset to train the machine learning mod removing redundant, null, or invalid data for optimal performance of the prediction engine, Doctors rely on common knowledge for treatment. When common knowledge is lacking, studies are summarized after some number of cases have been studied. But this process takes time, whereas if machine learning is used, the patterns can be identified earlier. For using machine earning, a huge amount of data is required. There is very limited amount of data available depending on the disease. Also, the number of samples having no diseases is very high compared to the number of samples having the disease. This project is about performing two case studies to compare the performance of Disease Prediction Using Machine Learning various machine learning algorithms to help identify such patterns and to create a platform for easier data sharing and collaboration. 1.3 Objective of study The main objectives of developing this project are: 1. To develop machine learning model to predict future possibility of heart disease by implementing Logistic Regression, , and other tress models. 2. To determine significant risk factors based on medical dataset which may lead to heart disease, 3. To analyse feature selection methods and understand their working principle. Dept of ISE BMSCE Page 5 HEART DISEASE PREDICTION CHAPTER 2 LITERATURE SURVEY 24 Introduction Data mining is the process of finding previously unknown pattems and trends in databases and using that information to build predictive models. Data mining combines statistical analysis, machine learning and database technology to extract hidden patterns and relationships from large databases. The World Health Statistics 2018 report enlightens the fact that one in three adults worldwide has raised blood pressure - a condition that causes around half of all deaths from stroke and heart disease, Heart disease, also known as cardiovascular disease (CVD), encloses a number of conditions that influence the heart ~ not just heart attacks. Heart disease was the major cause of casualties in the different countries including India, Heart disease kills one person every 34 seconds in the United States. Coronary heart disease, Cardiomyopathy and Cardiovascular disease are some categories of heart diseases. The term “cardiovascular disease” includes a wide range of conditions that affect the heart and the blood vessels and the manner in which blood is pumped and circulated through the body. Diagnosis is complicated and important task that needs to be executed accurately and efficiently. The diagnosis is often made, based on doctor’s experience & knowledge. This leads to unwanted results & excessive medical costs of treatments provided to patients. Therefore, an automatic medical diagnosis system would be exceedingly beneficial, 2.2. Heart Disease Prediction using Machine Learning, Authors:Apurb Rajdhan , Avi Agarwal Milan Sai , Dundigalla Ravi and Prof.Dr. Poonam Ghuli Published In: 2020 International Journal of Engineering Research & Technology (IIERT) Different levels of accuracy have been attained using various data mining techniques which are explained as follows. Avinash Golande and et. al;studies various different ML algorithms that can be used for classification of heart disease. Research was carried out to study Decision Tree, KNN and K- Means algorithms that can be used for classification and their accuracy were compared. This research concludes that accuracy obtained by Decision Tree was highest further it was inferred that it can be made efficient by combination of different techniques and parameter tuning. [1]. Dept of ISE BMSCE Page 6 HEART DISEASE PREDICTION ‘T.Nagamani, et al. have proposed a system which deployed data mining techniques along with the MapReduce algorithm. The accuracy obtained according to this paper for the 45 instances of testing set, was greater than the accuracy obtained using conventional fuzzy artificial neural network. Here, the accuracy of algorithm used was improved due to use of dynamic schema and linear scaling. [2] Fahd Saleh Alotaibi has designed a ML model comparing five different algorithms Rapid Miner tool was used which resulted in higher accuracy compared to Matlab and Weka tool. In this research the accuracy of Decision Tree, Logistic Regression, Random forest, Naive Bayes and SVM classification algorithms were compared. Decision tree algorithm had the highest accuracy. 3h Anjan Nikhil Repaka, ea tL, proposed a system in that uses NB (NaA“ve Bayesian) techniques for classification of dataset and AES (Advanced Encryption Standard) algorithm for secure data transfer for prediction of disease. [4] Theresa Priney. R, et al, executed a survey including different classification algorithm used for predicting heart disease. The classification techniques used were Naive Bayes, KNN (K- Nearest Neighbour), Decision tree, Neural network and accuracy of the cla fiers was analyzed for different number of attributes [5] Nagaraj M Lutimath, et al., has performed the heart disease prediction using Naive bayes classification and SVM (Support Vector Machine). The performance measures used in analysis is established are Mean Absolute Error, Sum of Squared Error and Root Mean Squared Error, that SVM was emerged as superior algorithm in terms of accuracy over Naive Bayes [6] The m: idea behind the proposed system after reviewing the above papers was to create a heart disease prediction system based on the inputs as shown in Table 1. We analysed the classification algorithms namely De mn Tree, Random Forest, Logistic Regression and Naive Bayes based on their Accuracy, Precision, Recall and f-measure scores and identified the best classification algorithm which can be used in the heart disease prediction, Dept of ISE BMSCE Page 7 HEART DISEASE PREDICTION Chapter-3 SOFTWARE REQUIREMENT SPECIFICATION 4.3 Functional Requirements: Dataset preparation and_preprocessing. Data Collection Data collection is defined as the procedure of collecting, measuring and analysing accurate insights for research using standard validated techniques. A researcher can evaluate their hypothesis based on collected data. In most cases, data collection is the primary and most important step for research, irrespective of the field of research. The approach of data collection is different for different fields of study, depending on the required information, ‘The most critical objective of data collection is ensuring that information-rich and reliable data is collected for statistical analysis so that data-driven decisions can be made for research, + Data Visualization Data visualization is the graphical representation of information and data. By using visual elements like charts, graphs, and maps, data visualization tools provide an accessible way to see and understand trends, outli and patterns in data. Example ~ Fig 3.1 ~ Data Visualization Dept of ISE BMSCE Page 8 HEART DISEASE PREDICTION > Data Labelling Supervised machine learning, which we'll talk about below, entails training a predictivemodel on historical data with predefined target answers. An algorithm must be shown which target ‘answers or attributes to look for. Mapping these target attributes in a dataset is called labelling. Data labelling takes much time and effort as datasets sufficient for machine learning may require thousands of records to be labelled. For instance, if your image recognition algorithm must classify types of bicycles, these types should be clearly defined and labelled in a dataset. > Data Selection Data selection is defined as the process of determining the appropriate data type and source, as, well as suitable instruments to collect data. Data selection precedes the actual practice of data collection. This definition distinguishes data selection from selective data reporting (selectively excluding data that is not supportive of a research hypothesis) and interactive/active data selection (using collected data for monitoring activities/events, or conducting secondary data analyses). The process of selecting suitable data for a research project can impact data integrity. After having collected all information, a data analyst chooses a subgroup of data to solve the defined problem. The selected data includes attributes that need to be considered when building a predictive model. * Data Pre-processing Data pre-processing is a data mining technique that involves transforming raw data intoan ‘understandable format. Real-world data is often incomplete, inconsistent, and/or lacking in certain behavior’s or trends, and is likely to contain many errors. Data pre- processing is a proven method of resolving such issues. The purpose of pre-processing is to convert raw data into a form that fits machine learning. Structured and clean data allows a data scientist to get ‘more precise results from an applied machine learning model. The technique includes data formatting, cleaning, and sampling. Dept of ISE BMSCE Page 9 HEART DISEASE PREDICTION Data formatting. The importance of data formatting grows when data is acquired from various sources by different people. The first task for a data scientist is to standardize record Formats. A specialist checks whether variables representing each attribute are recorded in the same way. Titles of products and services, prices, date formats, and addresses are examples of variables. The principle of data consistency also applies to attributes represented by numeric ranges. Data cleaning, This set of procedures allows for removing noise and fixing inconsistencies in data. A data scientist can fill in missing data using imputation techniques, e.g. substituting missing values with mean attributes. A specialist also detects outliers — observations that deviate significantly from the rest of distribution. If an outlier indicates erroneous data, a data scientist deletes or corrects them if possible. Thi stage also includes removing incomplete and useless data objects. Data anonymization. Sometimes a data scientist must anonymize or exclude attributes representing sensitive information (i.e. when working with healthcare and banking data). Data sampling. Big datasets require more time and computational power for analysis. If a dataset is too large, applying data sampling is the way to go. A data scientist uses this technique to select a smaller but representative data sample to build and run models much faster, and at the same time to produce accurate outcomes. “+ Data Transformation Data transformation is the process of converting data from one format or structure into another format or structure. Data transformation is critical to activities such as data integration and data management, Data transformation can include a range of activities: you might convert data types, cleanse data by removing nulls or duplicate data, enrichthe data, or perform aggregations, depending on the needs of your project. Scaling. Data may have numeric attributes (features) that span different ranges, for example, millimeter’s, meters, and kilometer’s. Scaling is about converting these attributes so that they will have the same scale, such as between 0 and 1, or 1 and 10 for the smallest and biggest, value for an attribute, Dept of ISE BMSCE Page 10 HEART DISEASE PREDICTION Decomposition. Sometimes finding patterns in data with features representing complex concepts is more difficult, Decomposition technique can be applied in this case. During decomposition, a specialist converts higher level features into lower level ones. In otherwords, new features based on the existing ones are being added. Decomposition is mostly used in time series analys For example, to estimate a demand for air conditioners per month, a ‘market research analyst converts data representing demand per quarters, Aggregation. Unlike decomposition, aggregation aims at combining several features into a feature that represents them all. For example, you have collected basic information about your customers and particularly their age. To develop a demographic segmentation strategy, you need to distribute them into age categories, such as 16-20, 21-30, 31-40, ete. You use aggregation tocreate large-scale features based on small-scale ones. This technique allows you to reduce the size of a dataset without the loss ofinformation. > Data Splitting ‘A dataset used for machine learning should be partitioned into three subsets — training,test, and validation sets Training set. A data scientist uses a training set to train a model and define its optimal Parameters — parameters it must learn from data. Test set. A test set is needed for an evaluation of the trained model and its capability for generalization, The latter means a model's ability to identify patterns in new unseendata after having been trained over at ing data, It is crucial to use different subsets for tra ing and, testing to avoid model overfitting, which is the incapacity for generalization we mentioned above. ‘Validation set. The purpose of a validation set is to tweak a model's hyper parameters— higher-level structural settings that cannot be directly learned from data. ‘These settings can express, for instance, how complex a model is and how fast it finds patternsin data. Dept of ISE BMSCE Page 11 HEART DISEASE PREDICTION ‘The proportion of a training and a test set is usually 80 to 20 percent, respectively. A training set is then split again, and its 20 percent will be used to form a validation set. At the same time, machine learning practitioner Jason Brownlee suggests using 66 percent of data for training and 33 percent for testing. A size of each subset depends onthe total dataset size ‘The more training data a data scientist uses, the better the potential model will perform. Consequently, more results of model testing data lead to better model performance and generalization capability. 66-20 percent 20-33 percent Training Set Fig 3.2 ~ Data Splitting set “+ Modelling ‘After pre-processing the collected data and split it into three subsets, we can proceed with a model training. This process entails “feeding” the algorithm with training data. An algorithm will process data and output a model that is able to find a target value (attribute) in new data — an answer you want to get with predictive analysis. The purpose of model training is to develop a model. ‘Two model training styles are most common — supervised and unsupervised learning. The choice of each style depends on whether you must forecast specific attributes or group data objects by similarities. ‘Model evaluation and testing: The goal of this step is to develop the simplest model able to formulate a target value fast and well enough. A data s jemtist can achieve this goal through ‘model tuning. That is the optimization of model parameters to achieve analgorithm’s best performance. One of the more efficient methods for model evaluation and tuning is cross Dept of ISE BMSCE Page 12 HEART DISEASE PREDICTION validation.Cross-validation: Cross-validation is the most used tuning method. It entails splitting atraining dataset into ten equal parts (folds). A given model is trained on only nine folds and then tested on the tenth one (the one previously left out). Training continues until every fold is left aside and used for testing. As a result of model performance measure, specialist calculates a cross-validated score for each set of hyperparameters. A data scientist trains models with different sets of hyperparameters to define which model hasthe highest prediction accuracy. The cross-validated score indicates average model performance across ten hold-out folds. + Model Deployment Deployment is the method by which you integrate a machine leaming model into an existing production environment to make practical business decisions based on data. Itis cone of the last stages in the machine leaning life cycle and can be one of the most cumbersome. Often, an organization’s IT systems are incompatible with traditional model- building languages, forcing data tists and programmers to spend valuable time and brainpower rewriting them, KEY LEARNINGS of Functional Requirement 1) A Functional Requirement defines a system or its component 2) Functional Requirements Document should contain Data handling logic and complete information about the workflows performed by the system 3) Functional requirements along with requirement analys help identify missing requirements 4) ‘Transaction corrections, adjustments, and cancellations, Business Rules, Certification Requirements, Reporting Requirements, Historical Data management, Legal or Regulatory Requirements are various types of functional requirements 5) As a good practice do not combine two requirements into one, Keep the requirements granular. Dept of ISE BMSCE Page 13, HEART DISEASE PREDICTION 3.1 Non-functional requirements The definition of non-functional requirements is quality attributes that describe ways your product should behave. ‘The list of basic non-functional requirements includes: — Usability Usability is the degree of ease with which the user will interact with your products to achieve required goals effectively and efficiently. Relial ity Such a metric shows the possibility of your solution to fail. To achieve high reliability, your team should eliminate all bugs that may influence the code safety and issues with system components. Performance Performance describes how your solution behaves when users interact with it in various scenarios. Poor performance may lead to negative user experience, KEY LEARNING of non-functional requirement 1) Types of Non-functional requirement are Scalability Capacity. Availability, Reliability, Recoverability, Data Integrity, ete. 2) Example of Non Functional Requirement is Employees never allowed to update their salary information. Such attempt should be reported to the security administrator. 3) Functional Requirement is a verb while Non-Functional Requirement is an attribute 4) The advantage of Non-functional requirement is that it helps you to ensure good user experience and ease of operating the software 5) The biggest disadvantage of Non-functional requirement is that it may affect the various high-level software subsystems. Dept of ISE BMSCE Page 14 HEART DISEASE PREDICTION 3.1 Hardware Requirements v Operating System - Windows / Mae OS / Linux x86 64-bit CPU (Intel / AMD architecture) Python 3.6 vy v Dataset Python based Computer Vision and Deep Learning libraries will be exploited for the development and experimentation of the project. Tools such as Anaconda Python, and libraries such as, Numpy, Scipy, pandas, sklearn, will be utilized for this process. > Why Python? + General purpose programming language * Increasing popularity for use in data science © Easy to build end-to-end products like web applications. 3.2.1 PYTHON INSTALLATION Python is a high-level, interpreted, object-oriented programming language with dynamic semantics as an added advantage. The combination of high-level built-in data structures with dynamic typing and dynamic binding makes it very attractive for Rapid Application Development, as well as for scripting or glue language to connect to the existing components. Python symphonizes readability and reduces the cost of the program maintenance by its outstanding feature simple and easy to learn syntax. Program modularity and code reuse can be achieved by Python modules and packages. The Python interpreter and the extensive standard library are available in source or binary form without charge for all major platforms, and can be freely distributed. Python is a great general-purpose programming language on its own. But it has become even more popular environment for scientific computing with the help of a few popular libraries (oumpy, py, matplotlib), Dept of ISE BMSCE Page 15, HEART DISEASE PREDICTION Python Version > Install Python 3.6.8 Scientific and Numeric is one the best application of python used in our project. This is the era of Artificial intelligence where the machine can perform the task the same as the human, Python language is the most suitable language for Artificial intelligence or machine learning. It consists of many scientific and mathematical libraries, which makes easy to solve complex calculations. Implementing machine learning algorithms require complex mathematical calculation. Python has many libraries for scientific and numeric such as Numpy, Pandas, Scipy, Scikit-lear, ete. Anaconda Python Distribution Anaconda is an open-source package manager, environment manager, and distribution of the Python and R programming languages. It is commonly used for large-scale data processing, scientific computing, and predictive analytics, serving data scientists, developers, business analysts, and those working in DevOps. Anaconda offers a collection of over 720 open-source packages, and is available in both free and paid versions. The Anaconda distribution ships the conda command-line utility. 3.2.2 Why Anaconda? User level install of the version of python you want able to install/update packages stem libraries or admin Privileges conda tool installs binary completely independent of packages, rather than requiring compile resources like pip - again, handy if you have limited privileges for installing necessary libraries. More or less eliminates the headaches of trying to figure out which version/release of package X is compatible with which version/release of package Y, both of which are required for the install of package Z Comes either in full-meal-deal version, with numpy, scipy, PyQt, spyder IDE, ete. or in minimal / alacarte version (miniconda) where you can install what you want, when you need it No risk of messing up required system libraries. Dept of ISE BMSCE Page 16 HEART DISEASE PREDICTION Installing Anaconda on Windows 1, Download the Anaconda installer (hitps://www.continuum io/downloads). 2. Optional: Verify data integrity with MDS or SHA-256. More info on hashes 3. Double click the installer to launch And Click Next. 5. Read the licensing terms and click I Agree. 6. Select an install for “Just Me” unless you're installing for all users 7. Select a destination folder to install Anaconda and click Next. 8. Choose whether to add Anaconda to your PATH environment variable. We recommend not adding Anaconda to the PATH environment variable, since this can interfere with other software. Instead, use Anaconda software by opening Anaconda Navigator Anaconda Command Prompt from the Start Menu. 9. Choose whether to register Anaconda as your default Python 3.6. Unless you plan on installing and running multiple versions of Anaconda, or multiple versions of Python, you should accept the default and leave this box checked. 10. Click Install. You can click Show Details if you want to see all the packages Anaconda is, installing. And . Click Next. 12. After a successful installation you will see the “Thanks for installing Anaconda’. Dept of ISE BMSCE Page 17 HEART DISEASE PREDICTION Jupyter Notebook ‘The Jupyter Notebook is an open source web application that you can use to create and share documents that contain live code, equations, visualizations, and text. Jupyter Notebook is maintained by the people at Project Jupyter. Tupyter Notebooks are a spin-off project from the IPython project, which used to have an IPython Notebook project itself. The name, Jupyter, comes from the core supported programming languages that it supports: Julia, Python, and R. Jupyter ships with the IPython kemel, which allows you to write your programs in Python, but there are currently over 100 other kernels that you can also use. Getting Up and Running With Jupyter Notebook The Jupyter Notebook is not included with Python, ‘you want to try it out, you will need to install Jupyter. ‘There are many distributions of the Python language. This article will focus on just two of them for the purposes of installing Jupyter Notebook. It is also assumed that you are using Python 3. Dept of ISE BMSCE Page 18 HEART DISEASE PREDICTION Installation If so, then you can use a handy tool that comes with Python called pip to install Jupyter Notebook like this:$ pip install jupyter ‘The next most popular distribution of Python is Anaconda. Anaconda has its own installer tool called conda that you could use for installing a third-party package. However, Anaconda comes with many scientific libraries preinstalled, including the Jupyter Notebook, so you don’t actually need to do anything other than install Anaconda itself. Starting the Jupyter Notebook Server Now that you have Jupyter installed, let’s learn how to use it, To get started, all you need to do is open up your terminal application and go to a folder of your choice. I recommend using something like your Documents folder to start out with and create a subfolder there called Notebooks or something else that is easy to remember. ‘Then just go to that location in your terminal and run the following command:$ jupyter notebook ‘Your browser should now look something like this: jupyter Fis Aarning Castes Sect tems ptr actors on hen Dept of ISE BMSCE Page 19 HEART DISEASE PREDICTION Note that right now you are not actually running a Notebook, but instead you are just running the Notebook server. Let’s actually create a Notebook now! Creating a Notebook Now that you know how to start a Notebook server, you should probably learn how to create an actual Notebook document. All you need to do is click on the New button (upper right), and it will open up a list of choices. (On my machine, I happen to have Python 2 and Python 3 installed, so I can create a Notebook that uses either of these. For simplicity’ sake, let’s choose Python 3 ‘Your web page should now look like this: SJUpyter Untied encom 2 me Benen ee nm lcm Dept of ISE BMSCE Page 20 HEART DISEASE PREDICTION Chapter-4 SYSTEM DESIGN AND ANALYSIS 4.1 METHODOLOGIES Before writing any code, the initial design of the algorithm for the prediction engine was created using Microsoft Visio. The initial design can be seen in Figure below. wow f}-——_{ ome ‘anit cram made vine sected gore ots Mate ye a = | sauna ( Fig 4.1 ~ Training Model Process Dept of ISE BMSCE Page 21 HEART DISEASE PREDICTION 4.2 PROPOSED MODEL ‘This will be the proposed flow chart thatthe system will look like Exact Significant Variabioe, Fig 4.2 Flowchart Dept of ISE BMSCE Page 22 HEART DISEASE PREDICTION Chapter-5 IMPLEMENTATION ‘The different ML alge Naive Bayes classification techniques . The input dataset is split into 80% of the training dataset thms such as Random Forest, Decision Tree, Logistic Regression and and the remaining 20% into the test dataset. Training dataset is the dataset which is used to train a model. Testing dataset is used to check the performance of the trained model. For each of the algorithms the performance is computed and analysed based on different metrics used such as accuracy, precision, recall and F-measure scores as described further. Naive Bayes Naive Bayes algorithm is based on the Bayes rule{]. The independence between the attributes of the dataset is the main assumption and the most important in making a classification. It is easy and fast to predict and holds best when the assumption of independence holds. Bayes’ theorem calculates the posterior probability of an event (A) given some prior probability of event B represented by P(A/B) as shown in equation 1 : P(A|B) = (P(BIA)P(A)) / P(B) oon seinen neice Sapo cones (om sratn(o}0) » om srla( 11) At trate) enol te] > en est sD) ied est), secure Dept of ISE BMSCE Page 23, HEART DISEASE PREDICTION Logistic Regression Logistic Regression is a classification algorithm mostly used for binary classification problems. In logistic regression instead of fitting a straight line or hyper plane, the logi ic regression algorithm uses the logistic function to squeeze the output of a linear equation between 0 and 1 ‘There are 13 independent variables which makes logistic regression good for classification. Decision Tree Decision Tree algorithm is in the form of a flowchart where the inner node represents the dataset attributes and the outer branches are the outcome. Decision Tree is chosen because they are fast, sd. In Decision Tree, the prediction of class label originates from root of the tree, The value of the root attribute is reliable, easy to interpret and very little data preparation is requi compared to records attribute. On the result of comparison, the corresponding branch is followed to that value and jump is made to the next node. Dept of ISE BMSCE Page 24 HEART DISEASE PREDICTION 10) + em traelsatp arty Mea tects ieeneert) Random Forest Random Forest algorithms are used for classification as well as regression. It creates a tree for the data and makes prediction based on that. Random Forest algorithm can be used on large datasets and can produce the same result even when large sets record values are missing. The generated samples from the decision tree can be saved so that it can be used on other data. In random forest there are two stages, firstly create a random forest then make a prediction using a random forest.classifier. ° Dept of ISE BMSCE Page 25, HEART DISEASE PREDICTION Chapter-6 TEST RESULT AND ANALYSIS ‘The results obtained by applying Random Forest, Decision Tee, Naive Bayes and Logistic, Regression are shown in this section. The metrics used to carry out performance analysis of the algorithm are Accuracy score, Precision (P), Recall (R) and F-measure. Precision metric provides the measure of positive analysis that is correct. Recal defines the measure of actual positives that are correct. F-measure tests accuracy. > Precision = (TP) / (TP +FP) > Recall = (TP) / (TP+FN) > F Measure =(2 * Precision * Recall) / (Precision +Recall) + TP True positive: the patient has the disease and the testis positive. + FP False positive: the patient does not have the disease but the test is positive, + TN True negat is negative. the patient does not have the disease and the test + FN False negative: the patient has the disease but the test is negative. In the experiment the pre-processed dataset is used to carry out the experiments and the above mentioned algorithms are explored and applied. The above mentioned performance metrics are obtained using the confusion matrix. Confusion Matrix describes the performance of the model. The confusion matrix obtained by the proposed model for different algorithms is shown below in Table 2, The accuracy score obtained for Random Forest, Decision Tree, Logistic Regression and Naive Bayes classification techniques is shown below in Table 3, Dept of ISE BMSCE Page 26 HEART DISEASE PREDICTION Algorithm ‘Tue Positive | False Positive | False Negative | True Negative Logistic Regression | 22 5 4 30 Naive Bayes 21 6 3 31 Random Forest 2 5 6 28 Decision Tree 25 2 4 30 ‘Table .2. Values Obtained For Confusion Matrix Using Different Algorithm Algorithm Precision Recall | F-measure Accuracy Decision Tree 0.845 0.823 | 0.835 81.97% Logistic Regression 0.857 0.882 | 0.869 85.25% Random Forest 0.937 0.882 | 0.909 90.16% Naive Bayes 0.837 o9il | 0.873 85.25% Table 3. Analysis Of Machine Learning Algorithm Dept of ISE BMSCE Page 27 HEART DISEASE PREDICTION Chapter-7 SNAPSHOTS Naive Bayes : Decison surface using the PCA transformed/projected features No Heart Dense ‘Principal Component - 2 = 3 z 7 Principal Component - 1 Logistic Regression: Decison surface using the PCA transformed/projected features Principal Component -2 Dept of ISE BMSCE Page 28 HEART DISEASE PREDICTION Decision Tree: Decison surface using the PCA transformed/projected features Principal Component - 2 Principal Component 2 Random Forest: Decison surface using the PCA transformed/projected features ‘rineipal Component -2 ~ principal Component - 1 Dept of ISE BMSCE Page 29 HEART DISEASE PREDICTION Chapter-8 CONCLUSION With the increasing number of deaths due to heart diseases, it has become mandatory to develop a system to predict heart diseases effectively and accurately. The motivation for the study was to find the most efficient ML algorithm for detection of heart diseases. This study compares the accuracy score of Decision Tree, Logistic Regression, Random Forest and Naive Bayes algorithms for predicting heart disease .The result of this study indicates that the Random Forest algorithm is the most efficient algorithm with accuracy score of 90.16% for prediction of heart disease. In future the work can be enhanced by developing a web application based on the Random Forest algorithm as well as ing a larger dataset as compared to the one used in this analysis which will help to provide better results and help health professionals in predicting the heart disease effectively and efficiently. Dept of ISE BMSCE Page 30 HEART DISEASE PREDICTION Chapter-9 REFERENCES 1, Avinash Golande, Pavan Kumar T, Heart Disease Prediction Using Effective Machine Learning Techniques, International Journal of Recent Technology and Engineering, Vol 8, pp.944-950,2019. 2, T.Nagamani, S.Logeswari, B.Gomathy, Heart Disease Prediction using Data Mining with Mapreduce Algorithm, Intemational Journal of Innovative Technology and Exploring Engineering (IIITEE) ISSN: 2278-3075, Volume-8 Issue-3, January 2019, 3. Fahd Saleh Alotaibi, Implementation of Machine Learning Model to Predict Heart Failure Disease, (JACSA) International Journal of Advanced Computer Science and Applications, Vol. 10, No. 6, 2019. 4, Anjan Nikhil Repaka, Sai Deepak Ravikanti, Ramya G Franklin, Design And Implementation Heart Disease Prediction Using Naives Bayesian, International Conference on Trends in Electronics and Information(ICOEI 2019). 5, Nagaraj M Lutimath.Chethan C,Basavaraj $ Pol.Prediction Of Heart Disease using Machine Learning, International journal Of Recent Technology and Engineering,8,(2810), pp 474-477, 2019. 6, Heart Disease Data Set{Online]. Available (Accessed on May 1 2020): hups://www.kaggle.comv/ronitf/heart-disease-uci. Dept of ISE BMSCE Page 31

You might also like