Professional Documents
Culture Documents
2265 A Aarthi
2265 A Aarthi
Submitted by
S. PRIYADHARSHINI (18132001)
S. HARIPRIYA (18132009)
A. AARTHI (18132011)
APRIL 2022
I
BONAFIDE CERTIFICATE
HARIPRIYA (18132009), A. AARTHI (18132011) who carried out the project work
under my supervision.
SIGNATURE SIGNATURE
DEPARTMENT OF IT DEPARTMENT OF IT
II
ACKNOWLEDGEMENT
First and foremost, I would like to thank the Lord Almighty for His presence
and immense blessings throughout the project work. It’s a matter of pride and
work.
for my support when I was looking for some encouragement. Last but not the least,
special thanks to all my department faculties for their support for promptly guiding
III
ABSTRACT
Diabetic Retinopathy is chronic mellitus which occurs due to lack of insulin that
causes Diabetic which can incite loss of vision in case it is not identified at the very
first level. Diabetic retinopathy is also considered as one of the major impermanence
cardiovascular disease, stroke etc. If preventive measures are not taken, it can lead
to further diseases such as nephropathy, diabetic foot and retinopathy. Data mining
plays an important role in diabetic retinopathy which can be beneficial for the better
health of the society. This model helps to identify the diabetic retinopathy based on
IV
TABLE OF CONTENTS
V
10
2.5 SUMMARY
3 PROJECT DESCRIPTION 11
3.1 INTRODUCTION 11
3.2 EXISTING SYSTEM 11
3.2.1 Limitations of Existing System 12
3.3 PROPOSED SYSTEM 12
3.3.1 Advantages over Existing System 13
3.4 SUMMARY 13
4 PROJECT REQUIREMENT 14
4.1 INTRODUCTION
4.2 SOFTWARE REQUIREMENTS
4.2.1 Hardware and software specification
4.2.2 Python
4.2.3 Anaconda Software
4.2.4 OpenCv 14-20
4.2.5 Jupyter Notebook
4.3 TECHNOLOGY OVERVIEW
4.3.1 Multi Model ML
4.3.2 HyperParameter tuning
4.4 SUMMARY
5 SYSTEM DESIGN 21
5.1 INTRODUCTION 21
5.2 SYSTEM ARCHITECTURE
5.2.1 Input Design 21-24
5.2.2 User Interface design
VI
5.2.3 Procedural Design
5.2.4 Output Design
5.3 SUMMARY
6 MODULE DESCRIPTION 25
6.1 INTRODUCTION 25
6.2 MODULES 25
6.2.1 Module 1: Data Collection and
preprocessing
6.2.2 Module 2: Exploratory data analysis
6.2.3 Module 3: Implementation of ML
algorithm 25-28
6.2.4 Module 4: Prediction of diabetic
retinopathy
6.2.5 Module 5: Performance analysis
6.3 SUMMARY
7 IMPLEMENTATION 28
7.1 INTRODUCTION 28
7.2 PROCEDURE 28
7.3 ALGORITHMS USED 29
7.3.1 Logistic Regression algorithm 29
7.3.2 KNN Algorithm 29
7.3.3 Decision Tree Algorithm 30
7.3.4 Random Forest Algorithm 30
7.3.5 SVM Algorithm 31
7.4 ALGORITHM WITH HYPERPARAMETER 31
7.4.1 Web Application 32
VII
7.5 Evaluation Parameters 32
7.5.1 Confusion Matrix 32
7.5.2 F1-Score 33
7.5.3 Precision 33
7.5.4 Recall 33
7.6 Summary 34
8 CONCLUSION & FUTURE WORK 35
8.1 CONCLUSION 35
REFERENCES 36
APPENDIX 39
A. SCREENSHOTS 39
B. SAMPLE CODE 47
C. PUBLICATION
VIII
LIST OF FIGURES
1 System Architecture 21
2 Input Design 22
3 UI Design 23
4 Procedural Design 23
5 Confusion Matrix 33
IX
LIST OF ABBREVIATIONS
DR Diabetic Retinopathy
ML Machine Learning
X
CHAPTER 1
INTRODUCTION
1.1 INTRODUCTION
Diabetic Retinopathy is a severe retinal disorder that occurs because of
uncontrollable diabetic mellitus which leads to loss of vision if the disease is not
detected at the prompt. Diabetic retinopathy is the majority disorder as per World
Health Organization Report. Diabetic retinopathy will affect 191 million people by
2030. Retinal disease that is caused due to diabetic retinopathy will subsume
glaucoma, cataracts and retinopathy. There are enormous chances for diabetic
infected persons to not be treated for a very long time. The foremost aim of this
research paper is to construct a Working Model to detect diabetic retinopathy by
using multimodal machine learning with hyper parameter tuning method to get more
accuracy than the existing. Once the model is trained and tested with the datasets it
can be used to predict diabetic retinopathy that affects a person at an earlier stage.
Diabetic retinopathy will mainly affect the human Red Blood cells and it shrinks the
vision of a human eye. Microvascular disease and macrovascular disease are the
most common diabetic complications diseases. Microvascular disease threatens the
tiny blood vessels and arteries. The diabetic retinopathy disease will mainly attack
the eye(Retinopathy), Nerves(Neuropathy) and kidney(Nephropathy).The crucial
Macrovascular stumbling block includes cardiovascular disease demonstrated as
strokes among other serious complications. There are different types of diabetes that
can affect a human however the most common type of diabetics are type1 diabetic
and type 2 diabetic and gestational diabetic mellites(GDM).Type 1 diabetic will
commonly affect children’s, type 2 diabetes will commonly affect the middle aged
people and infrequently older person. Whereas gestational diabetic mellites(GDM)
will affect women during their pregnancy period. Generally diabetic retinopathy
1
treatment will be conducted by an Ophthalmologist by gathering the retinal images
of the patient. Data mining process can be exceedingly helpful for medical
practitioners for extracting hidden medical knowledge. Diabetic retinopathy is the
most complex disorder that cannot be ignored according to Clinical Research World
of Ophthalmology. DR is codified as two different stages No proliferative
DR(NDPR)and Proliferative DR(PDR). Further Diabetic retinopathy is Classified
into mild stage, severe stage. Initial Medication of DR can prevent patients from
more distant worsening. Today's world Diabetic Retinopathy Treatment is
undergoing by most experienced doctors in order to classify the critical stages of
patients in identifying the diseases manually. The performance Random Forest
algorithm and decision tree algorithm were better compared to Support vector
machine algorithm, KNN algorithm and K-means algorithm because the parameters
used for Multimodal algorithms gave more accuracy and detected the diabetic
retinopathy at an earlier juncture.
● The proposed system classifies normal and abnormal input data by using all
necessary ML algorithms. Computational methods and algorithms have been
developed to analyze as well as quantify biomedical dataset.
● The proposed technique applies supervised machine learning to the sector of
clinical analysis on the way to reduce the time and strain passed through with
the aid of using the ophthalmologist and different participants of the group
withinside the screening, analysis and remedy of diabetic retinopathy.
● As diabetes continues to increase in popularity, more and more people are
looking for ways to manage the disease. Some of the most popular methods
2
used to treat diabetes include diet, exercise, and medication. However, there
is still much debate surrounding how effective these treatments are.
● Multi-Model Machine Learning is a machine learning method that can help
you predict which type of diabetes will develop in a person. This is important
because it can help you make better decisions about treatment and
medications.
3
CHAPTER 2
LITERATURE REVIEW
2.1 INTRODUCTION
Research in a specific field should contain a detailed study of literature study related
to that subject and highly structured review is required to acquire a strong
background of the domain. A complete study of the literature study would give
insights about what was already done in the same field which will lead to a
significant investigation. This chapter presents a detailed review on the problem of
diabetic retinopathy disease prediction which is induced by diabetes. Various
methods of predicting different variability diseases of diabetic have been analyzed
and discussed in this chapter.
4
2.3 FEASIBILITY STUDY
The feasibility observe is research of the software product’s viability or it's far a
quantification of the way benefits product improvement could be for the enterprise
from a sensible standpoint. A feasibility observe is performed for numerous motives
to decide whether or not a software program product is viable in phrases of
improvement, implementation, and fee to the company. Feasibility observe is a
important degree of the software program Project Management Process as it
determines whether or not to continue with the proposed challenge due to the fact
it's far almost viable or to halt the proposed challenge right here as it isn't always
viable to broaden or consider the proposed challenge again. Along with this, a
feasibility observe assists in figuring out danger elements related to growing and
deploying a system, in addition to making plans for danger analysis. Additionally,
with the aid of using studying numerous parameters related to proposed challenge
improvement, a feasibility observe facilitates slender enterprise options and enhance
achievement rates.
2.3.1 Technical feasibility
In this phase, existing resources, including hardware and software, as well as
necessary technologies, are analyzed/assessed in order to build the project.
Appropriate trials were conducted to determine the project's feasibility in terms of
technical feasibility, convenience, and economic viability. The findings of the
examination indicate that the project is technically viable. The performance study
demonstrates that the project achieves a higher level of accuracy and precision,
implying that it is feasible in terms of performance criteria.
2.3.2 Economical Feasibility
The economic feasibility of an answer is decided through its price. Economic
feasibility research looks at the project’s price and benefit. That is all through this
feasibility study, an in-depth evaluation of the project’s improvement charges is
5
conducted, which covers all wished charges for very last improvement, consisting
of hardware and software program aid requirements, layout and improvement
charges, and operations charges, amongst others. The thought is easy to enforce and
economically viable.
6
Vaibhav V. Kamble et al. [2] Proposed a retinal pictures data set by utilizing RBF
brain organization. The preliminary sees the delicacy of 71.2, Perceptivity 0.83 and
Particularity 0.043 for DIARETDB0.
Cut Fiarni et al. [4] recommended a decrease. The vaccination model of diabetes
entanglement grievance is based on information mining style and calculation
bunching and characterization. Diabetes clinical information is isolated into four
classes by the model: nephropathy, retinopathy, neuropathy, and blended impacts.
To develop the ideal rule-based model for vaticination, they measure execution
utilizing grouping and order calculations.
7
diabetes dataset and affiliation rules are being produced by both of these
calculations”.
Jiangxue Han et. al. [7] fostered a PC vision framework for perceiving and
computerizing this protest involving a neural organization to give discoveries to an
enormous number of cases in a short measure .
G. Kalyani et. al. [9] Diabetic retinopathy will be distinguished and characterized
utilizing the arranged case organization. The convolution and essential case layers
are utilized to remove qualities from fundus pictures, while the class case layer and
softmax layer are utilized to evaluate the probability that the picture has a place with
a particular class. Four execution models are used to check the proposed
organization's productivity utilizing the Messidor dataset.
Manisha Sharma et. al. [10] Proposed a PC vision framework via mechanizing it
utilizing a neural organization and furthermore identifying the grumbling, to give
results for an enormous number of cases in a little quantum of time.
Harry Pratt et. al. [11] proposed a CNN approach for diagnosing Diabetic
Retinopathy from advanced design pictures and straightforwardly arranging its
firmness. Also, he fostered an organization with CNN plan and information
8
expansion that could lay out the complex choices worried at stretches the
arrangement task like miniature aneurysms, exudates and hemorrhages on the tissue
layer and regularly offer an assignment precisely and keeping in mind that not client
finish input. Prepared this organization involving a top-of-the-line GPU for an
elevated place arrangement.
V. Deepa et. al. [12] Proposed procedure presents A troupe of multi stage profound
CNN models for Diabetic retinopathy evaluating upheld pictures patches. The pre-
handling stage permits the information pictures to give a great deal of important data
than the crude information pictures. Normalization and resizing region units used
because of the pre-handling methods during this work. The arranged multi-stage
algorithmic program is executed with three primary stages to execute the order
measures of the call organization.
Syna Sreng and Noppadol Maneerat [13] Proposed strategy, the picture is
preprocessed to deflect small commotions and improve the differentiation of the
image. Additionally, liquor edge perception is utilized to recognize the splendid
sores. Then, at that point, the red sores region unit recognized wishing on formal hat
morphological separating ways. Additionally, the remarkable and dull sores region
unit consolidated by exploitation consistent AND administrator. To be left
exclusively neurotic signs, the commotions close vessels are unit extra eliminated
by exploitation mass examination. Morphological choices are unit removed and one
more to the SVM classifier.
Suriyaharayananm et. al. [14] proposed A most recent limitation for vulnerable side
location, first sight the vessel and exudate fixes and remove each to actuate point. A
9
few of the choices like vas, exudates and points are recognized for precisely
exploitation morphological tasks applied reasonably.
Carlos Santos et. al. [15] Proposed strategy upheld profound neural organization
models that perform one-stage object recognition, abuse moderate information
increase and move learning methods to give a model that among the assignments of
physical design injuries. The model was prepared, upheld and upheld the YOLOv5
plan and conjointly the PyTorch structure, accomplishing values for map.
Kavakiotis et al. [16] In the field of diabetes research, proposed uses of AI,
information mining strategies, and apparatuses as far as expectation and analysis.
“The utilization of ml and information mining strategies in advanced datasets that
incorporate clinical and organic data is relied upon to prompt more top to bottom
investigation toward analysis and treatment of DM” because of the appearance of
biotechnology and the tremendous measure of information created.
2.5 SUMMARY
This chapter covers the initial investigation of the project and feasibility study of the
paper and literature survey. The major area of the project was Data mining
techniques and machine learning algorithms with hyper parameter tuning were
discussed here.
10
CHAPTER 3
INTRODUCTION
3.1 INTRODUCTION
Diabetic retinopathy is a disease which is caused by changes of blood vessels in
retina. In most cases people with DR, the blood vessels in the retina may swell and
leak fluid. The crucial Macrovascular stumbling block includes cardiovascular
disease demonstrated as strokes among other serious complications. There are
different types of diabetes that can affect a human however the most common type
of diabetics are type1 diabetic and type 2 diabetic and gestational diabetic
melitus(GDM). Type 1 diabetic will commonly affect childrens ; type 2 diabetes will
commonly affect the middle aged people and infrequently older person. Whereas
gestational diabetic melitus(GDM) will affect women during their pregnancy period.
Generally diabetic retinopathy treatment will be conducted by an Ophthalmologist
by gathering the retinal images of the patient. Data mining process can be
exceedingly helpful for medical practitioners for extracting hidden medical
knowledge.
3.2 EXISTING SYSTEM
The existing work deals with the issue of ailment gauge using biomedical diabetic
enlightening list has been particularly analyzed. In like manner, the maker present
an effect measure based sickness conjecture estimation. In the first place, the
methodology scrutinizes the biomedical enlightening assortment and performs
upheaval ejection. Second, the features are taken out and for each data point a multi
property social similarity measure (MARSM) is evaluated towards different bundles
available. Taking into account the MARSM measure evaluated, a lone class has been
recognized for the significant thing given. The procedure produces higher viability
in grouping as well as disease assumption. The procedure diminishes the deceptive
request extent and reduces the time unpredictability. The examination can likewise
11
be improved by remembering ecological elements for diabetic subordinate infection
expectation. Scientists might additionally add occasions of diabetic retinopathy to
the dataset for the forecast of diabetic retinopathy.
3.2.1 limitations of existing system
The existing system was designed to show the higher accuracy of the proposed
algorithm (MIAM, MLDDM) by combining two algorithms using machine learning
methods. The proposed method reported that the hybrid model achieved higher
accuracy compared to the standard model using the algorithms K-means, Sparsity
Correlation, Decision Tree, MLDDM and MAIM to predict diabetes, diabetic
nephropathy, diabetic neuropathy and diabetic cardiovascular. The existing method
has not predicted the diabetic disease which will affect human vision (retinopathy).
3.4 SUMMARY
This chapter covers the proposed and existing system of the project and the major
area of the project was Data mining and machine learning. The advantages of the
proposed system over existing systems were discussed here.
13
CHAPTER-4
PROJECT REQUIREMENT
4.1 INTRODUCTION
A software program requirement may be understood as a asset that the software need
to show off so as for it to correctly carry out its characteristic. This characteristic can
be to automate a few a part of a project of the individuals who will use the software,
to help the enterprise strategies of the organisation that has commissioned the
software, controlling a tool wherein the software is to be embedded, and plenty of
more. The functioning of the users, or the enterprise strategies or the tool will
generally be complicated and, via way of means of extension, the necessities at the
software program might be a complicated aggregate of necessities from unique
humans at unique degrees of an employer and from the surroundings wherein the
software need to execute.
4.2 SOFTWARE REQUIREMENTS
The requirements specification is a technical specification of necessities for the
software products. It is step one withinside the requirements evaluation process; it
lists the requirements of a selected software machine such as functional, overall
performance and protection requirements. The requirements additionally offer
utilization situations from a person, an operational and an administrative perspective.
The reason of software requirements specification is to offer an in-depth review of
the software task, its parameters and goals. This describes the task target market and
its person interface, hardware and software requirements. It defines how the client,
crew and audience see the task and its functionality.
• Hardware specifications:
– Microsoft Server enabled computers, preferably workstations
14
– Higher RAM, of about 4GB or above
– Processor of frequency 1.5GHz or above
• Software specifications:
– Python 3.6 and higher
– Anaconda software
– Jupyter Notebook
4.2.2 PYTHON
Python is a programming language that helps the advent of a extensive variety
of applications. Developers regard it as a awesome preference for Artificial
Intelligence (AI), Machine Learning, and Deep Learning projects.
● It has a large variety of libraries and frameworks: The Python language comes
with many libraries and frameworks that make coding easy. This additionally
saves a extensive quantity of time. The maximum famous libraries are
NumPy, that is used for medical calculations; SciPy for extra superior
computations; and scikit, for getting to know information mining and
information analysis. These libraries paintings along effective frameworks
like TensorFlow, CNTK, and Apache Spark. These libraries and frameworks
are critical with regards to device and deep getting to know projects.
● Simplicity: Python code is concise and readable even to new developers, that
is useful to gadget and deep getting to know projects. Due to its easy syntax,
the improvement of packages with Python is rapid while as compared to many
programming languages. Furthermore, it lets in the developer to check
algorithms with out enforcing them. Readable code is likewise crucial for
collaborative coding. Many people can paintings collectively on a
15
complicated challenge. One can without problems discover a Python
developer for the team, as Python is a acquainted platform. Therefore, a brand-
new developer can speedy get familiar with Python’s principles and paintings
at the challenge instantly.
● Fast code tests: Python presents plenty of code evaluate and check tools.
Developers can quick take a look at the correctness and pleasant of the code.
AI initiatives have a tendency to be time-consuming, so well-established
16
surroundings for trying out and checking for insects is needed. Python is the
perfect language because it helps those features.
● Visualization tools: Python comes with a huge kind of libraries. Some of those
frameworks provide excellent visualization tools. In AI, Machine learning,
and Deep learning, it's far vital to offer statistics in a human-readable
4.2.4 OpenCV
OpenCV is the massive open-supply library for pc vision, system learning, and
photograph processing and now it performs a prime position in real-time operation
which could be very crucial in today’s systems. By the use of it, you possibly can
procedure pics and motion pictures to perceive objects, faces, or maybe handwriting
of a human. When included with numerous libraries, including NumPy, python is
able to processing the OpenCV array shape for analysis. To Identify photograph
sample and its numerous functions we use vector area and carry out mathematical
operations on those functions. The first OpenCV model turned into 1.0. OpenCV is
launched beneathneath a BSD license and for this reason it’s unfastened for each
instructional and business use. It has C++, C, Python and Java interfaces and helps
Windows, Linux, Mac OS, iOS and Android. When OpenCV turned into designed
the primary awareness turned into real-time programs for computational efficiency.
All matters are written in optimized C/C++ to take benefit of multi-middle
processing.
A simple evaluation of the Jupyter Notebook App and its additives, the records of
Jupyter Project to reveal how it is linked to I Python, An evaluation of the 3
18
maximum famous approaches to run your notebooks: with the assist of a Python
distribution, with pip or in a Docker container, A sensible advent to the additives
that had been protected withinside the first section, entire with examples of Pandas
Data Frames, a proof on the way to make your pocket book files magical, and the
excellent practices and guidelines to help you to make your pocket book an
introduced price to any facts technological know-how project. The simplest manner
for a amateur to get began out with Jupyter Notebooks is with the aid of using putting
in Anaconda. Anaconda is the maximum broadly used Python distribution for facts
technological know-how and springs pre-loaded with all of the maximum famous
libraries and tools. As properly as Jupyter, a number of the largest Python libraries
wrapped up in Anaconda consist of NumPy , pandas and Matplotlib, aleven though
the overall 1000+ listing is exhaustive. This helps you to hit the floor walking for
your very own absolutely stocked facts technological know-how workshop with out
the problem of handling endless installations or annoying approximately
dependencies and OS-specific (read: Windows-specific) set up issues.
20
4.4 SUMMARY
This chapter covers the Software and hardware requirements of the project and the
major area of the project was Data mining and machine learning. The requirements
and The technology overview was discussed here.
21
CHAPTER 5
SYSTEM DESIGN
5.1 INTRODUCTION
The system of the model was system designed optimized and evaluated with
the high testing accuracy which was achieved by the adding hyperparameter tuning
method based on the machine Learning system. The system design shows the
overview process of the model which includes multiple modules. The architecture
diagram depicts the entire flow of the Model which starts from data preprocessing
to the final prediction of the model and with the evaluation parameters.
22
5.2.1 Input Design
23
5.2.2 User Interface design
Figure 3: UI design
In this project we have designed a web application using python language where
user can enter the details like Glucose level, Blood Pressure, High Pressure, Insulin,
BMI, Diabetes, Age by observing the input from the user the web application will
display whether the user has been affected with diabetic retinopathy or not. If a
person has a diabetic retinopathy it will show a suggestion to the person who has
been affected by diabetic retinopathy.
5.2.3 procedural Design
24
The procedural diagram depicts the entire flow of the machine learning model the
dataset has been collected from the kaggle and the data was pre processed and we
used data mining techniques to find if there is any anomalies in the dataset and then
we have split the dataset into training and testing part to show the accuracy and
finally by using the evaluation parameters we have concluded the output and the
machine learning algorithm which has the higher accuracy to predict the diabetic
retinopathy.
5.3 SUMMARY
This chapter covers the system design,input and output design of the project and the
major area of the project was Data mining and machine learning.The system
architecture and design overview was discussed here.
25
CHAPTER 6
MODULE DESCRIPTION
6.1 INTRODUCTION
The system is intended to show the data mining techniques in disease prediction for
diabetic retinopathy. The input data can be extracted to detect Diabetic Retinopathy.
The data mining technique specified in this paper places focus on the feature
relevance and classification techniques to accurately categorize the disease
associated with the retina based on the features extracted from input parameters
using classification techniques. Created checkpoints to stop the model when it
reached higher accuracy. Save the best model that produces high accuracy. Finally,
Predict retinopathy when data input is given.
6.2 Modules
6.2.1 Module 1: Data collection and preprocessing
1. The dataset is then pre-processed and then the cleaned data will be used
for training.
2. Then the cleaned data will be explored and all the necessary things have
to be done like removing noise, changing all the parameters to the same.
3. Data preprocessing comprises 4 steps
The first step is the data cleaning in which the duplicate, incorrectly formatted,
corrupted data will be fixed or removed.
4. The second step is data integration which combines multiple sources of
data into a single view.
5. The third step is the data reduction step in which the data are encoded,
scaled sorted if needed.
26
● The final step is the data transformation in which the data is transformed into
a required format
28
● Recall
● Support
6.3 SUMMARY
This chapter covers the Modules description of the project and the major area of
the project was Data mining and machine learning. Five modules were discussed
here.
29
CHAPTER 7
IMPLEMENTATION
7.1 INTRODUCTION
The implementation process for building and testing the model we have used five
algorithms which are Logistic regression, Random Forest, Decision tree, KNN,
Support vector machine. We have used the above algorithm to show the accuracy of
individual algorithm by using a hyperparameter tuning, The research paper is
intended to show the comparison of the five algorithms without using a hyper
parameter tuning and using hyper parameter tuning for identifying the patients with
diabetic retinopathy possibilities. The result is designed in such a way that the
accuracy of the algorithm using a hyperparameter tuning shows higher accuracy then
the algorithm without hyper parameter tuning.
7.2 PROCEDURE
There are four stages to data preparation.
1. The first phase is data cleaning, which involves fixing or removing duplicated,
poorly formatted, or damaged data.
2. The second phase is data integration, which involves combining data from
several sources into a single perspective.
3. The data reduction stage is the third step, in which the data is encoded, scaled,
and sorted if necessary.
4. The data transformation is the last phase in which the data is turned into the
desired format.
30
7.3 ALGORITHMS USED
In this research paper we have used a supervised ML method to detect the early stage
of DR in a person, the algorithms we used are Logistic regression, Random Forest,
Decision Tree, KNN, Support Vector Machine.
32
7.3.4 Random Forest
It's a Supervised ML Algorithm that helps to solve problems like regression and
classification. In the event of a relapse, it assembles decision trees on numerous
examples and takes their greater vote in favor of order and normal. The Random
Forest Algorithm creates the final result by combining the results of many Decision
Trees. It erodes the packing guideline, which creates an alternate preparation subset
from test preparation information with substitution, and the final result is mostly
dependent on voting.The tree relies on the benefits of an arbitrary vector evaluated
independently and uniformly across the forest. Out to be huge, the speculation
mistake for forest meets as far as feasible. The power of the individual trees in the
forest, as well as their interrelation, is crucial to the hypothesis blunder of a forest of
tree classifiers. A method for generating random numbers is the Random Forest
Algorithm. As the name implies.We calculate an unregistered joint distribution
value. The main objective is to identify the prediction function for forecasting value.
A loss method is a formula that calculates how much money
7.3.5 Support Vector Machine
SVM is additionally referred to as a support vector machine learning formula which
might be used for each classification similarly as regression issues. SVM or Support
Vector Machine may be a direct model for bracket and retrogression issues. It will
break direct and non-linear issues and work well for varied sensible issues. SCIKIT-
Learning is an extensively accessible methodology for imposing cc algorithms.
SUPPORT VECTOR MACHINE is additionally useful in SCIKIT library
appropriate model and validation). SVM workshop by mapping information to a
high-dimensional purpose area so information points are often distributed, so once
the information is not otherwise linearly dissociable. A division between the orders
may be a plant, conjointly the information square measure regenerates in such a way
that the division might be drawn as a hyperplane.
33
7.4 ALGORITHM WITH HYPERPARAMETER TUNING
In ML hyper parameters are used for improving the accuracy of any machine
learning algorithm. A hyper parameter could be used in a machine learning
algorithm which has a low accuracy and it will help us to improve the accuracy of
the algorithm. The key to machine attainment algorithms is hyperparameter
standardization. For illustration the terms “ model parameter” and “ model
hyperparameter.”
There are two hyper parameter tuning GridSearchCV and Randomized SearchCV.
We have used GridsearchCV for our model to improve the accuracy of the model.
Gridsearchcv has an added advantage over Randomized searchcv the random cv is
fast comparing to grid search cv but as it goes for a random search it might search
for a null value and there might be a chance where the accuracy of the algorithm will
decrease whereas gridsearch cv is slow and its search for each grid and find the best
value to improve the accuracy of the model.
7.4.1 Web Application
We’ve built our model using Python in the flask web application. The model was
trained and downloaded as a pickle file. The front end was designed for the user
input where the users can enter the valid data and the result will display in the web
application. Flask is also extensible and doesn’t force a particular directory structure
or require complicated boilerplate code before getting started. Flask’s framework is
more explicit than Django’s framework and is also easier to learn because it has less
base code to implement a simple web-Application.
34
7.5 EVALUATION PARAMETERS
Once the model has been built the accuracy of the model has to be evaluated by the
performance metrics in machine learning methods. We have used F1-Score,
precision, recall, confusion matrix.
35
7.5.3 Precision
Ratio of true negatives to total negatives in the data. Important when: you want to cover all
true negatives.
Specificity = TN/(TN+FP)
7.5.4 Recall
Ratio of true positives to actual positives in the data. Important when: identifying
the positives is crucial.
Sensitivity or Recall = TP/(TP+FN)
7.6 SUMMARY
This chapter covers the Algorithms and implementation of the project and the major
area of the project was Data mining and machine learning. Machine learning model
with hyper parameter tuning was discussed here.
36
CHAPTER 8
CONCLUSION & FUTURE WORK
8.1 CONCLUSION
Diabetic Retinopathy (DR) is the maximum generic eye ailment ensuing in blindness
in diabetic sufferers. Diabetic Retinopathy (Damage in Retina) is the maximum not
unusual place threatening diabetic eye ailment and reasons main imaginative and
prescient loss and blindness. A affected person with the diabetic ailment desires to
revel in occasional eye screening.
The most important targets of this paintings have been
a) Development of a machine so that it will be capable of become aware of sufferers
with BDR and PDR from both shadeation photograph or grey degree fundus
photograph.
b) The exclusive diabetic retinopathy sicknesses which might be of hobby
encompass pink spots and bleeding each falls among BDR and PDR levels of the
ailment. While SDR kinds are anticipated to be cited the ophthalmologist.
State-of-the-artwork ML strategies have been followed on this observe. Traditional
regression evaluation is predicated on hypothesis-pushed assumptions, whilst the
ML strategies used do now no longer require a predetermined assumption. This
function lets in for statistics-pushed exploration for non-linear styles that are
expecting threat for a given individual, i.e., unique threat stratification. As found on
this observe, the rating of the significance confirmed that the length of diabetes,
HbA1c, systolic blood pressure, TG, BMI, serum creatinine, age, schooling degree,
length of hypertension, and earnings degree have been the ten maximum crucial
elements for RDR. Furthermore, the given ML set of rules calls for simplest
minimum enter all through the version improvement stage, that's specially crucial
for the reason that ML fashions can without difficulty comprise new statistics to
37
replace and optimize, thereby constantly enhancing their discriminative overall
performance over time.
Our fashions supplied facts for DR screening in excessive-threat populations and
might assist to lessen the frequency of ocular examinations in low-threat
populations. Limited research have been to be had on threat stratification of DR
primarily based totally on ML and non-ocular parameters. By education the statistics
of 1,782 sufferers (with out the usage of cross-validation), the logit version received
an AUC of 0.760 primarily based totally on backward removal as a function choice
strategy. We divided the scientific statistics of 536 sufferers in Taiwan into
education and validation sets (at an 80:20 ratio), and in comparison the overall
performance of 4 fashions (assist vector system, selection tree, ANN, and logistic
regression) for DR detection, and discovered that assist vector system executed
satisfactory with an AUC of 0.839. Random Forest outperformed logistic regression
for DR detection with AUCs of 0. eighty four and 0.77, respectively. The above cited
research has been primarily based totally on hospital-primarily based totally
statistics, however population-primarily based totally statistics are extra applicable
to the truth of DR screening programmes. This observe implemented ML strategies
to population-primarily based totally statistics and proven their usefulness for RDR
detection with comparable AUCs to the ones in hospital-primarily based totally
research.
The significance rating evaluation confirmed that the quantity and length of
smoking and consuming have been additionally crucial for RDR. Finally, the rating
of threat elements may offer perception into the prevention of DR.
In this secondary evaluation of a large-scale population-primarily based totally
survey, we first extracted demographic variables, laboratory check results, and
scientific and own circle of relatives history, after which implemented exclusive ML
algorithms to rank threat elements and for identity of RDR. The Random Forest set
38
of rules accomplished the satisfactory overall performance primarily based totally
on 10 easy variables. The utilization of ML algorithms to rank epidemic threat
elements (apart from ophthalmic examinations) to become aware of referable
sufferers will lessen the price and feature a excessive utility cost in resource-bad
areas.
39
REFERENCES
1. Amol Prataprao Bhatkar and G.U. Kharat,(2015) “Detection of Diabetic
Retinopathy in Retinal Images Using MLP Classifier”, IEEE International
Symposium on Nanoelectronic and Information Systems (iNIS), Vol. 1, Pp.
331-335.
4. Carlos Santos et. al.“A New Method Based on Deep Learning to Detect
Lesions in Retinal Images using YOLOv5” Published on 2021 IEEE Int.l
Conf. on Bioinformatics and Biomedicine (BIBM),Pp. 3513-3520.
40
8. Farrukh Aslam Khan et. al., “Detection and Prediction of Diabetes Using Data
Mining: A Comprehensive Review” Published on 2021 IEEE Int.I Conf, Vol.
9,Pp. 43711-43735.
10.Jayant Yadav et. al. “Diabetic Retinopathy detection using feedforward neural
network”, Tenth Int. Conf. On Contemporary Computing(IC3), Pp. 1-3, 2017.
41
APPENDIX
A. SCREENSHOTS
1. RANDOM FOREST
1.1 Random Forest Algorithm with hyperparameter
Figure 6: Confusion matrix for Random Forest with hyper parameter tuning
The above screenshots show the accuracy of the Random Forest algorithm with
hyper parameter tuning techniques.
42
1.2 Random Forest Algorithm without hyperparameter
Figure 7: Confusion matrix for Random Forest without hyper parameter tuning
The above screenshots show the accuracy of the Random Forest algorithm without
hyper parameter tuning techniques, hence the accuracy of the model is low compared
to hyper parameter tuning method.
43
2. LOGISTIC REGRESSION
Figure 8: Confusion matrix for Logistic Regression with hyper parameter tuning
The above screenshots show the accuracy of the Logistic Regression algorithm with
hyper parameter tuning techniques.
Figure 9: Confusion matrix for Logistic Regression without hyper parameter tuning
44
The above screenshots show the accuracy of the Logistic Regression algorithm
without hyper parameter tuning techniques; hence the accuracy of the model is low
compared to hyper parameter tuning method.
3. KNN ALGORITHM
Figure 10: Confusion matrix for KNN with hyper parameter tuning
The above screenshots show the accuracy of the KNN algorithm with hyper
parameter tuning techniques.
45
3.2 KNN without hyperparameter
Figure 11: Confusion matrix for KNN without hyper parameter tuning
The above screenshots show the accuracy of the KNN algorithm without hyper
parameter tuning techniques, hence the accuracy of the model is low compared to
hyper parameter tuning method.
4. DECISION TREE
46
Figure 12: Confusion matrix for Decision Tree with hyper parameter tuning
The above screenshots show the accuracy of the Decision Tree algorithm with hyper
parameter tuning techniques.
Figure 13: Confusion matrix for Decision Tree without hyper parameter tuning
The above screenshots show the accuracy of the Decision Tree algorithm without
hyper parameter tuning techniques, hence the accuracy of the model is low compared
to hyper parameter tuning methods.
47
5. SVM ALGORITHM
Figure 14: Confusion matrix for SVM with hyper parameter tuning
The above screenshots show the accuracy of the Support Vector Machine algorithm
with hyper parameter tuning techniques.
Figure 15: Confusion matrix for SVM without hyper parameter tuning
The above screenshots show the accuracy of the Support Vector Machine algorithm
without hyper parameter tuning techniques, hence the accuracy of the model is low
compared to hyper parameter tuning methods.
48
6. ALGORITHM COMPARISON
The above comparison graph shows the overall accuracy of each algorithm using
hyper parameter tuning methods and we can conclude that the random forest
algorithm has a high accuracy compared to other algorithms.
49
The above comparison graph shows the overall accuracy of each algorithm using
hyper parameter tuning methods and we can conclude that the random forest
algorithm has a high accuracy compared to other algorithms.
B. SAMPLE CODE
plt.figure(figsize=(10,10))
sns.heatmap(data.corr(), annot = True)
from sklearn.model_selection import train_test_split
X = data.iloc[:,:-1]
y = data['Outcome']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2,
random_state = 10)
print("Train Set: ", X_train.shape, y_train.shape)
print("Test Set: ", X_test.shape, y_test.shape)
model = RandomForestClassifier(n_estimators=20)
model.fit(X_train, y_train)
from sklearn.metrics import accuracy_score
acc_rand = accuracy_score(y_test, model.predict(X_test))*100
score_.append(acc_rand)
model_.append("Random Forest Classifier")
# search for optimun parameters using gridsearch
params = {'penalty':['l1','l2'],
'C':[0.01,0.1,1,10],
'class_weight':['balanced',None]}
logistic_clf = GridSearchCV(LogisticRegression(),param_grid=params,cv=10)
#make predictions
logistic_predict = logistic_clf.predict(X_test)
log_accuracy = accuracy_score(y_test,logistic_predict)
50
log_accuracy = round(log_accuracy*100,2)
model_.append("Logistic Regression")
score_.append(log_accuracy)
cm=confusion_matrix(y_test,logistic_predict)
conf_matrix=pd.DataFrame(data=cm,columns=['Predicted:0','Predicted:1'],index=['
Actual:0','Actual:1'])
plt.figure(figsize = (8,5))
sns.heatmap(conf_matrix, annot=True,fmt='d',cmap="YlGnBu")
# search for optimun parameters using gridsearch
params= {'n_neighbors': np.arange(1, 5)}
grid_search = GridSearchCV(estimator = KNeighborsClassifier(), param_grid =
params,
scoring = 'accuracy', cv = 10, n_jobs = -1)
knn_clf = GridSearchCV(KNeighborsClassifier(),params,cv=3, n_jobs=-1)
# train the model
knn_clf.fit(X_train,y_train)
knn_clf.best_params_
#accuracy
knn_accuracy = accuracy_score(y_test,knn_predict)
knn_accuracy = round(knn_accuracy*100,2)
model_.append("K Nearest Neighbour")
score_.append(knn_accuracy)
cm=confusion_matrix(y_test,knn_predict)
conf_matrix=pd.DataFrame(data=cm,columns=['Predicted:0','Predicted:1'],index=['
Actual:0','Actual:1'])
plt.figure(figsize = (8,5))
sns.heatmap(conf_matrix, annot=True,fmt='d',cmap="YlGnBu")
51
from sklearn.tree import DecisionTreeClassifier
dtree= DecisionTreeClassifier(random_state=7)
# grid search for optimum parameters
params = {'max_features': ['auto', 'sqrt', 'log2'],
'min_samples_split': [2,3,4,5,6,7,8,9,10,11,12,13,14,15],
'min_samples_leaf':[1,2,3,4,5,6,7,8,9,10,11]}
tree_clf = GridSearchCV(dtree, param_grid=params, n_jobs=-1)
# train the model
tree_clf.fit(X_train,y_train)
tree_clf.best_params_
# predictions
tree_predict = tree_clf.predict(X_test)
#accuracy
tree_accuracy = accuracy_score(y_test,tree_predict)
tree_accuracy = round(tree_accuracy*100,2)
model_.append("Decision Tree Classifier")
score_.append(tree_accuracy)
cm=confusion_matrix(y_test,tree_predict)
conf_matrix=pd.DataFrame(data=cm,columns=['Predicted:0','Predicted:1'],index=['
Actual:0','Actual:1'])
plt.figure(figsize = (8,5))
sns.heatmap(conf_matrix, annot=True,fmt='d',cmap="YlGnBu")
#grid search for optimum parameters
Cs = [0.001, 0.01, 0.1, 1, 10]
gammas = [0.001, 0.01, 0.1, 1]
param_grid = {'C': Cs, 'gamma' : gammas}
52
svm_clf = GridSearchCV(SVC(kernel='rbf', probability=True), param_grid,
cv=10)
cm=confusion_matrix(y_test,svm_predict)
conf_matrix=pd.DataFrame(data=cm,columns=['Predicted:0','Predicted:1'],index=['
Actual:0','Actual:1'])
plt.figure(figsize = (8,5))
sns.heatmap(conf_matrix, annot=True,fmt='d',cmap="YlGnBu")
53
C. PUBLICATION
Dear Author,
Your manuscript was accepted and recommended for publication in the Springer
Series - Advances in Intelligent Systems and Computing.
Kindly ensure the following points before uploading the final paper.
1. Final manuscript must be as per Springer template, Refer Springer sample word document
Click Here
2. Minimum 15-20 references are expected and that must be in the article and all
references must be cited in the text. Like [1], [2],.....
3. The article has few typographical errors which may be carefully looked at.
4. Complete the Consent to publish form (Publishing agreement).
5. Ensure all the figures and tables are cited in the sequential order.
6. Mark * for the corresponding author name and email address in the first page of the paper.
Important dates:
Conference Date: 29-30, June 2022
Last Date for Registration: 8, May 2022
https://www.townscript.com/v2/e/4th-international-conference-on-intelligent-computing-
information-and-control- systems-204131/booking/tickets
Registration Method: Send Final paper (in both .doc & .pdf), Response to Reviewer
Comments, Publishing agreement and screen snapshot of payment proof to
iccs.conf.org@gmail.com
54