Detecting Clinical Signs of Anaemia Using Machine Learning Report

DETECTING CLINICAL SIGNS OF ANAEMIA
USING MACHINE LEARNING

A PROJECT REPORT
Submitted by
ALPHIUS VICTORIA ASHLEE P (312420104009)
HARINI S (312420104051)
in partial fulfillment for the award of the degree of
BACHELOR OF ENGINEERING
in
COMPUTER SCIENCE AND ENGINEERING
St. JOSEPH’S INSTITUTE OF TECHNOLOGY

(An Autonomous Institution)
ANNA UNIVERSITY:: CHENNAI 600 025
MARCH 2024
ANNA UNIVERSITY : CHENNAI 600 025
BONAFIDE CERTIFICATE
Certified that this project report “DETECTING CLINICAL SIGNS
OF ANAEMIA USING MACHINE LEARNING” is the bonafide
work of
“ALPHIUS VICTORIA ASHLEE P (312420104009) and HARINI S
(312420104051)” who carried out the project work under my

supervision.
SIGNATURE SIGNATURE
Dr.J.DAFNI ROSE M.E.,Ph.D., Ms.K.SHERIN M.E.,
PROFESSOR AND HEAD, SUPERVISOR,
Assistant Professor,
Computer Science and Engineering Computer Science and Engineering
St.Joseph’s Institute of Technology, St.Joseph’s Institute of Technology,
Old Mamallapuram Road, Old Mamallapuram Road,
Chennai-600 119. Chennai-600 119.

ACKNOWLEDGEMENT
i
We also take this opportunity to thank our respected and honorable
Chairman Dr. B. Babu Manoharan M.A., M.B.A., Ph.D. for the
guidance he offered during our tenure in this institution.
We extend our heartfelt gratitude to our respected and honourable

Managing Director Mr. B. Sashi Sekar M.Sc.for providing us with the
required resources to carry out this project.
We express our deep gratitude to our honourable Executive

Director Mrs. S.Jessie Priya M.Com. for the constant guidance and
support for our project.
We are indebted to our Principal Dr.P.RavichandranM.Tech., Ph.D.

for granting us permission to undertake this project.
We would like to express our earnest gratitude to our Head of the

Department Dr. J. Dafni Rose M.E., Ph.D. for her commendable support
and encouragement for the completion of the project with perfection.
We also take the opportunity to express our profound gratitude

to our guide Ms. K. Sherin M.E. for her guidance, constant
encouragement, immense help and valuable advice for the completion of
this project.
We wish to convey our sincere thanks to all the teaching and non-
teaching staff of the department of COMPUTER SCIENCE AND
ENGINEERING without whose cooperation this venture would not have
been a success.
CERTIFICATE OF EVALUATION
ii
College Name : St. JOSEPH’S INSTITUTE OF TECHNOLOGY
Branch : COMPUTER SCIENCE AND ENGINEERING
Semester : VIII
SI.No. Name of the Title of the Name of the

Students Project Supervisor with
designation
1 ALPHIUS VICTORIA Detecting
ASHLEE P Clinical
(312420104009) Ms. K. Sherin, M.E.
Signs of
2 HARINI S Anaemia Assistant Professor

(312420104051) Using
Machine
Learning
The report of the project work submitted by the above students in

partial fulfillment for the award of Bachelor of Engineering Degree in
Computer Science and Engineering of Anna University was evaluated
and confirmed to be a report of the work done by the above students.
Submitted for project review and viva voce exam held on___________
(INTERNAL EXAMINER) (EXTERNAL EXAMINER)
ABSTRACT
iii
Anaemia represents a pressing global health issue,
disproportionately affecting children and pregnant women. It is a
widespread medical condition marked by a deficiency of red blood
corpuscles, posing significant health risks worldwide. Early detection is
crucial for effective intervention and management. According to a study
by WHO, approximately 42% of children under the age of 6 and 40% of
pregnant women globally suffer from anaemia. This condition impacts
approximately 33% of the world's total population, primarily due to iron
deficiency. Addressing anaemia is paramount to improving public health
outcomes and reducing the burden on healthcare systems worldwide.
Anaemia occurs once the level of red blood cells within the body
decreases or when the structure of the red blood cells is destroyed or
weakened. Early detection of anaemia helps to prevent irreversible organ
damage. The non-invasive technique, such as the use of machine
learning algorithms, is one of the methods used in the diagnosing or
detection of clinical diseases, which anaemia detection cannot be
overlooked in recent days. In this study, machine learning algorithms
were used to detect iron-deficiency anemia with the application of Naïve
Bayes, CNN, SVM, k-NN, and Decision Tree. This enabled us to
compare the conjunctiva of the eyes, the palpable palm, and the color of
the fingernail images to justify which of them has a higher accuracy for
detecting anemia in children. The technique utilized in this study was
categorized into three different stages: collecting of datasets and
preprocessing the images, image extraction, and segmentation of the
Region of Interest of the images. The models were then developed for
the detection of anaemia using various algorithms.
TABLE OF CONTENTS
iv
PAGE
CHAPTER TITLE NO
ABSTRACT iv
LIST OF FIGURES viii
LIST OF TABLES ix
1. INTRODUCTION 1
1.1 OVERVIEW 2
1.2 PROBLEM STATEMENT 2
1.3 EXISTING SYSTEM 3
1.3.1 Disadvantages of Existing System 3
1.4 PROPOSED SYSTEM 4
2. LITERATURE REVIEW 5
3. SYSTEM DESIGN 11
3.1 UNIFIED MODELLING LANGUAGE 11
3.1.1 Use case Diagram of Anemia 12
Detection
3.1.2 Class Diagram of Anemia Detection 13
3.1.3 Sequence Diagram of Anemia 14
Detection
3.1.4 Activity Diagram of Anemia 15
Detection
3.1.5 Deployment Diagram of Anemia 16
Detection
PAGE
CHAPTER TITLE NO
v
4. SYSTEM ARCHITECTURE 17
5. SYSTEM IMPLEMENTATION 18
5.1 MODULE DESCRIPTION 18
5.2 MODULES 18
5.2.1 Exploratory Data Analysis 19
5.2.2 Statistical Test Module 21
5.2.3 Feature Selection 23
5.2.4 Data Preprocessing 25
5.2.5 Class Imbalance and Data Leakage 29
Handling
5.2.6 Algorithm Implementation Module 31
5.2.7 Hyper-Parameter Training and 34
Cross-Validation
6. RESULTS AND CODING 39

6.1 TOOLS AND LANGUAGES 39
6.2 SAMPLE CODE 41
6.3 SAMPLE SCREENSHOTS 46
6.4 RESULTS AND GRAPH 49
7. CONCLUSION AND FUTURE WORKS 51
REFERENCES 55
LIST OF FIGURES
FIGURE NO NAME OF THE FIGURE PAGE NO.
vi
3.1 Use case diagram of anemia 12
detection
3.2 Class diagram of anemia detection 13
3.3 Sequence diagram of anemia 14

detection
3.4 Activity diagram of anemia 15

detection
3.5 Deployment diagram of anemia 16

detection
6.1 Distribution of anemia by gender 46
6.2 Initial screen of anemia detection 47

system
6.3 Output screen of anemia detetction 48

system
LIST OF TABLES
vii
TABLE NO. TABLE NAME PAGE NO.
5.1 Results of Hyperparameter Tuning 35
Module of Anemia Detection
5.2 Results of Cross-validation Module of 36

Anemia Detection
LIST OF ABBREVIATIONS
viii
ACRONYM ABBREVIATION
LR Logistic Regression
NB Naïve Bayes
SVM Support Vector Machine
SVC Support Vector Classifier
DT Decision Tree
KNN K-Nearest Neighbours
RF Random Forest
SMOTE Synthetic Minority Oversampling Technique
ADASYN Adaptive Synthetic Sampling Method for

Imbalaced Data
EDA Exploratory Data Analysis
HB Hemoglobin
MCH Mean corpuscular Hemoglobin
MCHC Mean corpuscular Hemoglobin Concentration
MCV Mean Corpuscular Volume
UML Unified Modelling Language
AUC-ROC Area Under the Curve Receiver Operating

Characteristics
ix
CHAPTER 1
INTRODUCTION
Anaemia develops when the body's supply of red blood cells

decreases or when the structure of the cells is damaged or weakened.
When the Hb level in the red blood cell falls below the usual threshold
as a result of one or more increases in red blood cell destruction, blood
loss, defective cells, or a decreased number of red blood cells, anaemia
can also develop. Preventing irreparable organ damage through early
detection of anaemia and medication is one of the significant measures
for its treatment. Long-term illness can also contribute to a patient’s
risk of diagnosing anaemia. Conditions that are associated with the
complex occurrence of anaemia include diabetes, kidney syndrome,
cancer, HIV/AIDS, inflammatory bowel disease, and cardiovascular
disease. Other significant causes include hemoglobinopathies, bilharzia,
and malaria . Iron deficiency, sickle cell disease, thalassaemia, aplastic
anaemia, and vitamin or iron deficiency are only a few of the different
types of anaemia that exist. There are numerous causes for every form
of anaemia, which range from mild to severe, and can be either
transient or long-term. The potential risk of anaemia can be detected
and monitored by using a non-invasive method and smartphone-based
devices, which are more promising in focusing on this public health
issue. Given these, numerous studies have been conducted by
researchers to come out with robust and non-invasive approaches to
detect or diagnose anaemia such as the use of medical images and
machine learning algorithms which detect anaemia and make
predictions for the future.
1
1.1 OVERVIEW
The laboratory procedure for the diagnosis and detection of

anaemia in response to clinical concerns has many challenges in
practice, including insufficient funding for medical tests, a lack of
technical experience and equipment in remote locations, requirements
for quality, and client reluctance that results in abstinence Healthcare
workers who perform invasive procedures are at risk of contracting
bloodborne infections. As a result, non-invasive approaches, such as
machine learning algorithms, have drawn a lot of interests for their
application in the detection of anaemia as a means to eradicate these
diagnostic challenges.
1.2 PROBLEM STATEMENT
Anaemia is one of the global public health problems that affect

children and pregnant women. A study by WHO stated that 42% of
children below the age of six and 40% of females who are pregnant
worldwide are anaemic which affects the world’s total population of
33%, as a result of iron deficiency. To solve this problem, detecting the
clinical signs of anaemia is needed. This can be done through two main
steps: training and testing. AI is employed to build systems such as
classification based systems, neural network-based systems, and
support vector machine-based systems. Machine learning techniques
greatly depends on the quality of the training data. In this study, we
aim at using the Clinical haematological values to detect anaemia using
machine learning algorithms through a comparative study of Decision
Tree, Random Forest, SVM, Naïve Bayes, and k-NN.
2
1.3 EXISTING SYSTEM
The previous detecting technique which uses Haematology

analyzers, in which each clinical lab uses different techniques for
calculation. So the results may vary from one technique to the other.
The existing system of detecting anaemia using haematology analyzer
it is an invasive method which costive, time consuming, and painful to
patients due to the extraction of blood. Electrophoresis is a separation
technique based on the mobility of ions in an electric field. It is the
classical method of identifying and quantifying the haemoglobin
proteins. It depends on characteristics like the strength of electric field,
molecular mass, ionic strength and temperature of buffer. This is a long
procedure which leads to many disadvantages like when voltage is
applied beyond electric limit in the potential field it destroys the sample
and it makes us to get the sample again from the user or may produce
invariant results.
1.3.1 DISADVANTAGES OF EXISTING SYSTEM
 Clinicians are prone to disease since they are the one who pricks
blood from human.
 The Electrophoresis method cannot be performed if the current or
voltage supply is not good.
 Test results varies from lab to lab which uses different techniques
disturbs the mentality of the patients on which result to believe.
 Sahli’s method of screening haemoglobin uses Acid hematin as a
suspension, not a true solution . This method can't measure all
haemoglobin and chances of visual error are high.
3
1.4 PROPOSED SYSTEM
The proposed system aims to detect anemia using machine

learning algorithms by analyzing attributes extracted from
hematological data, including gender, hemoglobin, mean corpuscular
hemoglobin (MCH), mean corpuscular hemoglobin concentration
(MCHC), and mean corpuscular volume (MCV). Initially, the system
conducts exploratory data analysis and applies statistical tests to
understand the dataset's characteristics and associations. Feature
selection techniques such as correlation analysis, SelectKBest, and
Extra Tree Classifier are employed to identify relevant attributes.
Additionally, the system addresses class imbalance through methods
like random undersampling, random oversampling, SMOTE, and
ADASYN, while handling data leakage by removing pertinent features.
Multiple machine learning algorithms including Decision Tree,

Random Forest, Logistic Regression, K-Nearest Neighbors, Support
Vector Machine, and Gaussian Naive Bayes are trained and evaluated
using performance metrics such as accuracy, area under the curve,
precision, recall, F1 score, and kappa statistic. Hyperparameter tuning
with GridSearchCV and 5-fold cross-validation ensures optimal model
performance and generalizability.
Furthermore, the proposed system enables clinicians to efficiently

identify individuals at risk of anemia by leveraging advanced machine
learning techniques. The system's ability to handle data intricacies, such
as class imbalance and potential data leakage, ensures the reliability and
4
effectiveness of the anemia detection process, ultimately contributing to
improved patient outcomes and healthcare management.
CHAPTER 2
LITERATURE REVIEW
Akmal Hafeel et al.,(2019) [1] devised a novel approach for

detecting anemia, a condition resulting from insufficient Fe3+ ions in
the body, which can potentially lead to severe complications like organ
failure, heart attacks, or even death. Their innovation centered on a
diagnostic device that leverages a key symptom of anemia: the intensity
of red in the blood. This device incorporates a built-in questionnaire
within a mobile application, allowing users to input additional data. The
collected information is transmitted to a central server, where
sophisticated algorithms analyze it to determine the presence of anemia.
Testing the device on individuals known to have anemia yielded an
impressive accuracy rate of nearly 83%. Subsequent validation against
conventional diagnostic methods confirmed the device's ability to
provide reliable results.
Aparna et al.,(2017) [2] aimed to compare anaemic and non-

anemic images for the detection of anemia with the application of k-
NN, SVM, and decision tree algorithms. Counting of the red blood cells
from the blood smear images can help in detecting anemia. A
simulation model for anemia detection is presented in this paper. Both
CircularTransform and Connected Component Labelling are
implemented for counting the number of RBCs and the results are
compared. The study utilized images of the conjunctiva of the eyes, and
5
blood samples were used to measure the Hb values of patients as an
auxiliary dataset to the images. The decision tree algorithm achieved an
accuracy of 95% which was higher than the SVM and the k-NN
performance of 80% and 83% respectively.
Azwad Tamir et al.,(2017) [3] proposed a pioneering method for
anemia detection, crucial given the condition's prevalence affecting a
quarter of the world's population. By leveraging smartphone-captured
images of the eye's conjunctival color, Tamir's approach offers a non-
invasive, automated alternative to traditional diagnosis methods
involving invasive blood tests. Analyzing the color spectrum of the
conjunctival tissue, the system determines anemia status by comparing
it against a predefined threshold. Testing on 19 subjects revealed a
promising 78.9% accuracy rate, aligning with patients' blood reports in
15 cases.
Chayashree Patgiri et al.,(2019) [4] proposed Modern image
processing techniques, particularly thresholding, are crucial in medical
abnormality detection. Adaptive thresholding is especially powerful in
analyzing medical images for diseases like Sickle Cell Disease (SCD),
where identifying distorted red blood cells is essential for diagnosis.
This study explores adaptive thresholding techniques such as Niblack,
Bernsen, Sauvola, and NICK for segmenting blood images to detect
SCD. By focusing on these methods, the paper aims to provide a
comparative analysis to enhance the accuracy and efficiency of SCD
diagnosis from microscopic blood images.
Enas Walid Abdulhay et al.,(2021) [5] proposed a novel method
for diagnosing Malaria and various types of Anemia using
Convolutional Neural Networks (CNN) on high-resolution blood
6
sample images. By training the CNN on diverse microscopic images, it
can classify samples into normal blood cells, Malaria, Sickle cell
anemia, Megaloblastic anemia, or Thalassemia without requiring the
standard Complete Blood Count (CBC) test. The proposed approach
achieves a test accuracy of 93.4%, offering rapid and cost-effective
diagnosis without laboratory analysis. This method streamlines the
diagnostic process, potentially revolutionizing blood sample analysis.
Furkan Kiraci et al.,(2018) [6] endeavored to streamline the
diagnosis of Sickle Cell Anemia by leveraging Image Processing
Algorithms. By meticulously isolating sickle cells from healthy cells
within blood tissue images, the study achieves commendable results,
boasting an accuracy of 91.11%, a precision rate of 92.9%, and a recall
score of 79.05%. Such precision not only accelerates the diagnostic
process but also substantially reduces the likelihood of misdiagnoses,
ensuring more accurate and efficient patient care.
Garima Vyas et al.,(2016) [7] proposed method which involved
acquisition of the thin blood smear microscopic images, pre-processing
by applying median filter, segmentation of overlapping erythrocytes
using marker-controlled watershed segmentation, applying
morphological operations to enhance the image, extraction of features
such as metric value, aspect ratio, radial signature and its variance, and
finally training the K-nearest neighbor classifier to test the images. The
algorithm processes the infected cells increasing the speed,
effectiveness and efficiency of training and testing. The K-Nearest
Neighbour classifier is trained with 100 images to detect three different
types of distorted erythrocytes namely sickle cells, dacrocytes and
7
elliptocytes responsible for sickle cell anaemia and thalassemia with an
accuracy of 80.6% and sensitivity of 87.6%.
Jessie R.Balbin et al.,(2019) [8] proposed a Raspberry Pi to aid in
the identification of abnormal red blood cells (RBCs). By measuring
parameters such as area, perimeter, diameter, and shape geometric
factor (SGF), while also detecting central pallor and target flags, the
system offers a comprehensive approach to RBC analysis. Notably,
previous studies have explored different methodologies, including
Artificial Neural Networks (ANN) and radial basis function networks,
achieving accuracies of 90.54% and 83.3% respectively. Balbin et al.
opted for a Support Vector Machine (SVM) classifier, which achieved
an impressive accuracy of 93.33% in identifying seven distinct RBC
types, including normal cells and various abnormalities. This
classification capability is particularly valuable in diagnosing a range of
anemias, such as iron-deficiency anemia, thalassemia, and hereditary
spherocytosis, thereby aiding clinicians in early detection and treatment
planning. While the system serves as a valuable aid for initial
identification of abnormal RBCs, it's crucial to underscore that
conclusive diagnoses necessitate further confirmation through
laboratory examinations.
Joan et al.,(2014) [9] proposed aportable device for point-of-care
anemia detection uses impedance analysis with custom electronics,
software, and disposable sensors. Forty-eight whole blood samples
from hospitalized patients were collected, with 10 for calibration and 38
for validation. Calibration involved EIS to determine impedance
spectrum for accurate hematocrit detection. A protocol for instant
impedance detection was developed, achieving less than 2% accuracy
8
error for impedance variations. An algorithm based on impedance
analysis was used for hematocrit detection, effectiveness, and
robustness with 1.75% accuracy error and less than 5% coefficient of
variation.
Kathirvelu M et al.,(2023) [10] focused on Sickle Cell Anemia, a

hereditary disorder, arises from a mutated gene crucial for Hemoglobin
Production. This genetic anomaly causes red blood cells to adopt a
sickle shape, leading to blockages in blood arteries. Consequently,
Sickle Cell Disease (SCD) disrupts oxygen flow to various parts of the
body, resulting in severe anemia. While SCD remains incurable, timely
detection, coupled with appropriate medication and treatment, can
significantly enhance patients' life expectancy and quality of life. This
methodology aims to facilitate the early detection of Sickle Cell
Disease, thereby enabling prompt intervention and management
strategies to improve patient outcomes.
Kumar et al.,(2022) [11] proposed a groundbreaking architectural

design for an anaemia detection device, introducing a multi-wavelength
spectrophotometry sensing platform that revolutionizes conventional
diagnostic methods. This innovative platform integrates light-emitting
diodes and fiber optics in a unique configuration, elevating the accuracy
and sensitivity of anaemia detection. A notable advancement is the
incorporation of a mechanical lever-operated fiber optics-based sensor,
circumventing limitations of traditional non-invasive finger probes and
enabling the detection of haemoglobin levels as low as 1.6 g/dL. By
surpassing the sensitivity of existing methodologies, the device offers a
promising solution for reliable and efficient anaemia screening. This
advancement holds significant potential for transforming medical
9
diagnostics, providing a more accessible and non-invasive approach to
identifying this prevalent condition.
Maileth Rivero-Palacio et al.,(2021) [12] proposed the

development of a mobile application for anemia detection using YOLO
v5. The app aims to bring an easy diagnosis as support to health-care
professionals to know the presence of anemia in children in places
where mobile signals are missing (isolated). Here, we present the
implementation and inclusion of the YOLO v5 neural network into the
app. We used an image dataset obtained from Universidad Peruana
Cayetano Heredia, which contains several pictures of children under
five years old and their respective prognosis with the blood test.
Although YOLO v5 gets good results using a computer.
Megha Tyagi's et al., (2016) [13] focused on addressing blood

disorders, particularly iron deficiency anemia, by employing an
Artificial Neural Network (ANN)-based approach. The study aimed to
classify normal red blood cells and various poikilocyte cells, such as
Degmacyte, Dacrocyte, Schistocyte, and Elliptocyte, which are
indicative of iron deficiency anemia. By utilizing digital images of
blood smears, the system underwent a series of preprocessing steps,
including segmentation and morphological operations, to extract
meaningful features from the images. These features were then used for
classification, enabling the identification of different cell types
efficiently and accurately.
Muhammad Hasan et al.,(2019) [14] focused on

hemoglobinopathies like Sickle Cell Disease (SCD) and Thalassemia
rank as the third most prevalent causes of anemia, following iron-
10
deficiency anemia and hookworm disease. Diagnosis and monitoring of
anemia and SCD pose significant challenges in low and middle-income
countries due to limited laboratory infrastructure, skilled personnel, and
financial resources. To address these challenges, an extension of the
HemeChip system has been developed, termed HemeChip+, which
incorporates total hemoglobin quantification and anemia testing
capabilities. HemeChip+ boasts mass-producibility at low cost, offering
a pioneering single-test point-of-care (POC) platform.
Muljino et al., (2024) [15] proposed a non-invasive method using
conjunctival images to detect anemia early, aiming to overcome
limitations in current diagnostic methods. The SVM algorithm-
integrated MobileNetV2 method achieves 93% accuracy, 91%
sensitivity, and 94% specificity in categorizing anemic and healthy
patients. This approach offers promise for efficient and precise anemia
diagnosis in clinical settings, potentially improving healthcare by
identifying anemia earlier.
Pooja Tukaram Dalvi et al.,(2016) [16] developed an efficient
machine learning classifier that can detect and classify anemia
accurately. In this paper five ensemble learning methods : Stacking,
Bagging, Voting, Adaboost and Bayesian Boosting are applied on four
classifiers : Decision Tree, Artificial Neural Network, Naïve Bayes and
K-Nearest Neighbor. The aim is to determine which individual
classifier or subset of classifier combination achieves maximum
accuracy in Red blood cell classification for anemia detection. From the
results it is evident that amongst the ensemble methods, stacking
ensemble method achieves the highest accuracy. Amongst the
individual classifier the Artificial Neural Network performs the best and
11
K-Nearest Neighbor performs the worst. However the classifier
combination Decision Tree and K-Nearest Neighbor when applied on
Stacking ensemble, achieves an accuracy much higher than the
Artificial Neural Network. This indicates an ensemble of classifiers
achieves much higher accuracy than individual classifiers. Hence to
achieve maximum accuracy in medical decision making an ensemble of
classifiers should be used.
Pranati Rakshit et al., (2013) [17] focused on the identification of
morphological changes in red blood cells (RBCs) in haemolytic
anaemia, specifically caused by enzyme deficiencies like G-6-P-D
deficiency. The study employs image processing techniques on blood
smear images, including Weiner filtering for preprocessing and Sobel
edge detection for boundary identification. Additionally, a metric is
devised for abnormal RBC shape determination.
Rita Magdalena et al., (2020) [18] proposed a non-invasive

computer-aided diagnose system for detecting anemia based on digital
image processing. The method is by analyzing the conjunctival image
of the eye. This study uses the first-order statistic feature extraction
method and K-Nearest Neighbor (K-NN) for classifying the
conjunctival image into two conditions, anemia and non-anemia
conditions. The feature extraction method is performed on RGB, Hue,
Saturation, and Value (HSV), and grayscale color space. The system
achieved 71.25% of accuracy by using the most optimal parameters on
the Green layer of RGB with K=5 and Euclidean distance equation.
Roszymah Hamzah et al.,(2018) [19] developed a Image

processing technique which has been used to produce automated
detection of Sickle Cell Anemia. A Laplacian of Gaussian (LoG) edge
12
detection algorithm computed to detect sickle cells diseases at the early
stage in diagnosing patient. A MATLAB software able to demonstrate
the abnormalities of the human Red Blood Cell (RBC) in the single
shapes and quantities of sickle cells present in each dataset. A data
samples of sickle cells from government Ampang Hospital has
contributed this study to validate the results.
Sagnika Ghosal et al.,(2020) [20] proposed a novel, autonomous

smart anemia-care technique using smartphone-based spectroscopy for
hemoglobin level monitoring. The approach utilizes the smartphone
camera to quantify hemoglobin levels based on color spectroscopy of
the conjunctival pallor. Anemia is diagnosed if the predicted
hemoglobin level is < 11.5 g dL -1 . The model achieves an accuracy of
±0.32 g dL -1 and a sensitivity of 89% compared to actual blood
hemoglobin levels in a study with 65 participants. It remains robust to
varying illumination and device types, making it a reliable and
convenient alternative to blood-based laboratory tests for anemia
diagnosis.
Sasikala C et al. (2022) [21] introduced a pioneering technique

geared towards detecting anemia within clinical settings. This study
delved into the realm of supervised machine learning, utilizing CBC
(complete blood count) data sourced from pathology centers. Various
algorithms, including Naive Bayes, LR, LASSO, and ES, were
scrutinized to predict the occurrence of anemia and assess the
likelihood of patient recovery within a 90-day timeframe. Notably, the
research findings indicate that the Naive Bayes approach demonstrates
superior performance in terms of accuracy when compared to LR,
LASSO, and ES algorithms. This groundbreaking methodology
13
presents promising avenues for refining diagnostic processes and
prognostic assessments, ultimately fostering more effective patient care
strategies.
Sherif H. Elgohary et al. (2022) [22] have introduced a

revolutionary method that offers remote and non-invasive hemoglobin
level screening using smartphones and cutting-edge AI techniques. By
capturing images of the eye, this standardized approach automatically
extracts the conjunctiva as a Region of Interest (ROI). Subsequently,
the ROI undergoes meticulous processing to extract features for
training a machine-learning algorithm to discern the presence of
anemia. With extensive testing conducted on over 200 subjects, the
model achieves remarkable accuracy, boasting an 85% accuracy rate,
86% precision, and an 81% recall score. These compelling results
highlight the transformative potential of this approach in enhancing
diagnostic precision and facilitating timely interventions, thereby
significantly improving patient outcomes.
Tajkia Saima Chy et al., (2018) [23] proposed a novel method to

detect and classify sickle cells in red blood cells (RBC) using image
processing techniques. It involves image collection, preprocessing
(including grayscale conversion, enhancement, and median filtering),
threshold segmentation, and morphological operations to extract
features such as metric value, aspect ratio, entropy, mean, standard
deviation, and variance. These features are used to train a support
vector machine classifier for testing images. The system shows
improved accuracy and sensitivity compared to existing methods,
offering potential to save lives through early treatment , the detection of
anemia is carried out non-invasively through the conjunctiva of the eye
14
using the Principal Component Analysis (PCA) method and the K-
Nearest Neighbor (K-NN) method. The results obtained based on the
best parameters with an image size of 256×128 pixels, PCA percentage
parameters of 40%, cityblock distance, with a value of K=9 systems
resulted in an accuracy of 87.5% with a computing time of 1,317
seconds using 60 training data and 40 test data.
Tiago Bonini Borchartt et al. (2023) [24] unveiled an innovative

automated method tailored for detecting anemia in small ruminants
through non-invasive visual analysis of the ocular conjunctiva.
Leveraging advanced image processing techniques and machine
learning algorithms, the approach adeptly extracts relevant features and
categorizes animals into different infection levels. Utilizing a dataset
comprising authentic photographs of animals, each assessed based on
the FAMACHA scale and hemoglobin level measurements, a
comprehensive analysis was conducted on 114 images encompassing
both sheep and goats. Notably, the utilization of the BIC
(Border/Interior Classification) descriptor for conjunctiva information
extraction, along with the testing of two distinct classifiers, SVM and
KNN, yielded promising experimental results. This pioneering research
holds significant promise in enhancing the early detection and
management of anemia in small ruminants, ultimately benefiting
livestock health and welfare.
Vinit P.Kharkar et al., (2022) [25] proposed the diagnosis of

anaemia using a noninvasive approach is a cost-effective and reliable
method and is now an important research issue since it avoids patients
from experiencing pain from physical examinations and clinical
laboratory testing. The “Eyesdefy-anaemia dataset,” which includes
15
218 images of the conjunctiva of the eye from Italy and India was
utilized in the study. It gives a comprehensive review of the research
work in this field. Effect of various factors such as age, gender, etc. on
hemoglobin levels is also discussed in this research article. This paper
gives a brief overview of various data collection and preprocessing
methods. It also gives a comprehensive analysis of technologies used to
detect anemia with the help of hemoglobin estimation. The study of
performance measures used for the evaluation of results is covered
while reviewing existing research work. This paper provides the input
for the novel research.
CHAPTER 3
SYSTEM DESIGN
In this chapter, the various UML diagrams for detecting the

clinical signs of anaemia using Machine Learning are represented and
the various functionalities are explained.
3.1 UNIFIED MODELLING LANGUAGE
Unified Modelling Language, is a standardized modeling

language consisting of an integrated set of diagrams, developed to help
system and software developers for specifying, visualizing,
constructing, and documenting the artifacts of software systems, as well
as for business modeling and other non-software systems. The Unified
Modelling Language represents a collection of best engineering
practices that have proven successful in the modeling of large and
16
complex systems. The Unified Modelling Language is a very important
part of developing object-oriented software and the software
development process.
The Unified Modelling Language uses mostly graphical notations to
express the design of software projects. Using the UML helps project
teams communicate, explore potential designs, and validate the
architectural design of the software. When you’re writing code, there
are thousands of lines in an application, and it’s difficult to keep track
of the relationships and hierarchies within a software system. UML
diagrams divide that software system into components and
subcomponents. Using UML diagrams, developers can easily visualize
the structure of their software, including classes, objects, relationships,
and behaviors.
3.1.1 Use Case Diagram of Anaemia Detection.
A use case diagram is used to represent the dynamic behaviour

of a system. It encapsulates the system's functionality by incorporating
use cases, actors, and their relationships. It models the tasks, services,
and functions required by a system/subsystem of an application. It
depicts the high-level functionality of a system and also tells how the
user handles a system. The main purpose of a use case diagram is to
portray the dynamic aspect of a system. It accumulates the system's
requirement, which includes both internal as well as external
influences. It invokes persons, use cases, and several things that invoke
the actors and elements accountable for the implementation of use case
17
diagrams. It represents how an entity from the external environment can
interact with a part of the system.
Figure 3.1: Use case diagram for anemia detection
Figure 3.1 shows the use case of an anaemia detection system in

which there are two actors namely the user and machine learning
model. There are five use cases that represent the specific functionality
of detecting anemia through medical images.
3.1.2 Class Diagram for Anaemia Detection
The class diagram depicts a static view of an application. It

represents the types of objects residing in the system and the
relationships between them. A class consists of its objects, and also it
may inherit from other classes. A class diagram is used to visualize,
describe, and document various different aspects of the system, and
also construct executable software code.
18
Figure 3.2: Class diagram for anaemia Detection
Figure 3.1.2 shows the attributes, classes, functions, and

relationships of the anaemia detection. Since it is a collection of
classes, interfaces, associations, collaborations, and constraints, it is
termed as a structural diagram. The class diagram serves as a blueprint
for software development, facilitating communication between
stakeholders and guiding the implementation.
3.1.3 Sequence Diagram for Anaemia Detection
The sequence diagram represents the flow of messages in the

system and is also termed as an event diagram. It helps in envisioning
several dynamic scenarios.
It portrays the communication between any two lifelines as a time
ordered sequence of events, such that these lifelines took part at the run
time. In UML, the lifeline is represented by a vertical bar, whereas the
19
message flow is represented by a vertical dotted line that extends across
the bottom of the page.
Figure 3.3: Sequence diagram for anaemia Detection
Figure 3.1.3 represents the sequence diagram of anaemia detection

system. Here, the lifeline is represented by a vertical bar, whereas the
actions performed or the messages are represented by a horizontal line.
3.1.4 Activity Diagram for Anaemia Detection
The activity diagram is used to demonstrate the flow of control

within the system rather than the implementation. It models the
concurrent and sequential activities.
The activity diagram helps in envisioning the workflow from one
activity to another. It put emphasis on the condition of flow and the
order in which it occurs.
20
It is also termed as an object-oriented flowchart. It encompasses
activities composed of a set of actions or operations that are applied to
model the behavioural diagram.
Figure 3.4: Activity diagram for anaemia Detection
Figure 3.1.4 depicts the activity diagram of the anaemia

detection system. The figure represents five activities that take place
sequentially. We use Activity Diagrams to illustrate the flow of control
in a system and refer to the steps involved in the execution of a use
case. We model sequential and concurrent activities using activity
diagrams. So, we basically depict workflows visually using an activity
diagram. An activity diagram focuses on condition of flow and the
sequence in which it happens.
3.1.5 Deployment Diagram for anaemia detection
The Deployment diagram is used to show the relationship

between the objects in a system. Both the sequence and the
collaboration diagrams represent the same information but differently.
Instead of showing the flow of messages, it depicts the architecture of
the object residing in the system as it is based on object-oriented
programming. An object consists of several features.
21
Figure 3.5: Deployment diagram for anaemia Detection
Figure 3.1.5 depicts the deployment diagram of anaemia detection

which was created by identifying the structural elements required to
carry out functionality of an interaction. A deployment diagram
resembles a flowchart that portrays the roles, functionality and
behaviour of individual objects as well as the overall operation of the
system in real time. Because of the format of the deployment diagram,
they tend to better suited for analysis activities.
CHAPTER 4
SYSTEM ARCHITECTURE
Anaemia detection using hematological data is a medical
application that aims to identify and diagnose anaemia based on the
analysis of Hemoglobin levels. Anaemia is a condition characterized by
a deficiency of red blood cells or haemoglobin in the blood, leading to
reduced oxygen-carrying capacity and potential health issues. In this
22
chapter, the System Architecture for detecting clinical signs of anaemia
is shown.
Figure 4.1 : System architecture diagram of anaemia detection

system
The above system architecture diagram depicts the flow of

process in anaemia detection. This means that for anaemia to be
detected, the dataset goes through several phases or steps.
CHAPTER 5
SYSTEM IMPLEMENTATION
In this chapter, the System Implementation of detecting clinical

signs of anaemia using machine learning is explained in detail.
23
5.1 Module Description
The objective is to develop a Python-based machine learning
system for classifying and detecting clinical signs of anemia using
hematological data. The system employs various supervised learning
models to classify instances as anaemic or non-anaemic based on their
attributes. A comparative study of model performance is conducted to
identify the most effective approach. Finally, the model's output is
displayed through a live stream web application, providing real-time
insights into the classification results to aid in clinical decision-making
and patient management.
5.2 Modules
In addition to providing a systematic and organized approach to
our problem-solving process, these modules facilitate efficient
collaboration among team members by delineating clear responsibilities
and workflows. By breaking down the development process into
modular components, we can easily troubleshoot and iterate on specific
aspects of the system without disrupting the overall workflow. This
structured approach also enhances reproducibility and scalability,
allowing for seamless integration of new features or improvements as
the project evolves. The web application seamlessly updates with the
latest data, offering users a dynamic and interactive experience.
5.2.1. Exploratory Data Analysis

The exploratory data analysis (EDA) conducted on the dataset
involved several steps to understand the data's structure, distribution,
and relationships. Here's a detailed summary of the EDA
process.Initially, the data was examined using methods like `df.shape`,
`df.head()`, and `df.info()` to understand its dimensions, column names,
24
data types, and the presence of any missing values. The dataset contains
1421 records with 6 columns, all of which are non-null, suggesting no
missing values. Summary statistics such as mean, standard deviation,
minimum, maximum, and quartile values were computed using
`df.describe()` to understand the central tendency, dispersion, and shape
of the dataset distribution.The data was examined for missing values
using `df.isnull().values.sum()` and `df.display()`. No missing values
were found, indicating a clean dataset. Additionally, the data types of
columns were checked using `df.dtypes()` and column renaming was
performed for visualization purposes.The distribution of the target
variable ('Result') was examined using pie charts and count plots to
identify any class imbalances.
It was observed that the dataset is imbalanced, with non-anemic

cases comprising 56.37% and anemic cases comprising 43.63% of the
dataset.The distribution of anemia across genders was explored using
count plots and bar plots. It was observed that females had a higher
mean anemia rate (56%) compared to males (31%), despite the female
population being 4.2% more than males.Various features such as
Hemoglobin, MCH, MCHC, and MCV were visualized using
histograms, violin plots, and distribution plots to understand their
distributions and relationships with anemia. For instance, Hemoglobin
levels were visualized by gender to identify differences between anemic
and non-anemic individuals.Statistical measures such as skewness and
kurtosis were computed to understand the distribution shapes of
features like Hemoglobin. Furthermore, tables summarizing key metrics
such as highest, lowest, and average values of Hemoglobin, MCH,
MCHC, and MCV were created for comprehensive analysis.
25
Furthermore, the EDA process allowed us to identify potential
outliers or anomalies in the dataset, which could influence model
performance if left unaddressed. Techniques such as box plots and
scatter plots were utilized to visually inspect the distribution of feature
values and detect any unusual patterns. Additionally, correlation
analysis was conducted to assess the strength and direction of
relationships between variables, providing insights into potential
multicollinearity issues that may affect model interpretability.
Moreover, the EDA revealed intriguing trends, such as variations in
anemia prevalence across different age groups or geographical regions,
which could be further explored in subsequent analyses. Additionally,
the examination of temporal trends in anemia rates over time may
provide valuable insights into potential risk factors or interventions. the
insights gained from the EDA process informed decisions regarding
feature selection and engineering strategies, ensuring that only the most
relevant and informative variables are included in the final predictive
model. By thoroughly understanding the dataset's characteristics, we
can develop a robust and reliable machine-learning system for
classifying and detecting the clinical signs of anemia. Correlation
analysis highlighted potential multicollinearity issues. Insights from
EDA informed feature selection for a robust predictive model targeting
anemia detection.
5.2.2 Statistical Test Module
The statistical tests conducted in the module provide valuable

insights into the relationship between gender and hemoglobin levels, as
well as gender and anemia status. Here's a summary of the findings.
26
T-test for hemoglobin levels by gender, The T-test for
Hemoglobin Levels by Gender reveals that, while there is a slight
difference favoring females, it is not statistically significant enough to
reject the null hypothesis of equal mean hemoglobin levels between
males and females. However, the application of logarithm
transformation to the hemoglobin data addresses skewness and ensures
the validity of the t-test assumptions. By transforming the data, we
mitigate the influence of extreme values and achieve a more symmetric
distribution, thereby enhancing the reliability of the statistical test
results. This approach allows us to accurately assess the gender-based
differences in hemoglobin levels, providing valuable insights into
potential disparities in health outcomes between male and female
populations. Despite the lack of statistical significance, this analysis
underscores the importance of considering gender-specific factors in
the assessment and management of anemia and related health
conditions.
Odds Ratio for Anemia by Gender, The calculated odds ratio of
2.86 signifies a notable difference in the likelihood of being anemic
between genders. Specifically, females exhibit 2.86 times higher odds
of being anemic compared to males, indicating a significant association
between gender and the risk of anemia. This finding underscores the
importance of considering gender-specific factors in the assessment and
management of anemia-related health conditions. The observed
disparity in anemia prevalence between males and females highlights
potential underlying physiological or socio-economic factors that may
contribute to differential health outcomes. Understanding these gender-
based differences is crucial for developing targeted interventions and
healthcare policies aimed at reducing the burden of anemia, particularly
27
among populations at higher risk, such as females. Overall, the odds
ratio analysis provides valuable insights into the gender-related
disparities in anemia prevalence.
Chi-Square Test for Gender and Anemia Status , The results of
the Chi-Square Test for Gender and Anemia Status demonstrate a clear
and statistically significant association between these two variables.
With a chi-square statistic of 90.06 and a p-value less than 0.001, there
is robust evidence to reject the null hypothesis of independence. This
implies that gender and anemia status are dependent variables,
indicating a relationship between being female and having anemia. The
findings suggest that gender plays a significant role in determining the
likelihood of an individual experiencing anemia. This association
underscores the importance of considering gender-specific factors in
the assessment, diagnosis, and management of anemia-related health
conditions. Furthermore, these results provide valuable insights into the
demographic patterns of anemia prevalence and highlight the need for
targeted interventions to address gender-based disparities in healthcare.
Understanding the relationship between gender and anemia status is
crucial for developing effective public health strategies aimed at
reducing the burden of anemia and improving overall health equity.In
conclusion, the analyses suggest that while there isn't a significant
difference in mean hemoglobin levels between genders, there is a
notable association between gender and the likelihood of being anemic.
Specifically, females are more likely to be anemic compared to males.
5.2.3 Feature Selection
Feature selection is a critical step in machine learning model

development aimed at enhancing performance and interpretability by
28
identifying and retaining only the most relevant features while
discarding the rest. This process not only reduces computational
complexity but also mitigates overfitting and improves model
generalization. In the supervised learning context, feature selection
methods can be broadly categorized into three types: wrapper, filter,
and intrinsic. The study incorporated correlation analysis, SelectKBest,
and Extra Tree Classifier methods to select the most informative
features for predicting anemia status.
Correlation analysis, specifically Pearson correlation
coefficient, was utilized to explore the linear relationship between each
feature and the target variable, i.e., anemia status. The correlation
matrix revealed the strength and direction of association between
features and the target variable. For instance, a positive correlation
indicates that an increase in one variable corresponds to an increase in
the other, while a negative correlation suggests the opposite. In this
study, hemoglobin exhibited a strong negative correlation (Pearson
correlation coefficient of -0.8) with anemia status, indicating that lower
hemoglobin levels are associated with a higher likelihood of anemia.
Conversely, gender showed a weak positive correlation (Pearson
correlation coefficient of 0.25) with anemia status, suggesting a slight
gender-related difference in anemia prevalence.
The correlation matrix visualization, in the form of a heatmap,
provided a comprehensive overview of the relationships between all
features and the target variable. This visualization facilitated the
identification of features with the highest correlation coefficients,
thereby guiding the subsequent feature selection process.
SelectKBest corroborate the findings from correlation analysis
and further refine feature selection, the study employed the SelectKBest
29
method, a statistical technique for univariate feature selection.
SelectKBest evaluates the significance of each feature individually by
applying a scoring function, in this case, the chi-squared test statistic.
This test measures the dependence between each feature and the target
variable, assessing whether the occurrences of a specific feature and a
specific class are independent based on their frequency distribution.The
SelectKBest method iteratively selects the top k features with the
highest test scores, where k is a predefined parameter. In this study,
different values of k were evaluated (2, 3, 4, and 5), and the optimal
value was determined based on the total score obtained from the
selected features. The feature scores were computed, and the top
features were identified based on their respective scores. For instance,
the best value of k was found to be 2, with a total score of 307.02,
indicating that the two selected features collectively contribute the most
predictive power for determining anemia status.
The feature selection process culminated in the identification of
the most informative features for predicting anemia status, namely
hemoglobin and gender. These features were selected based on their
strong associations with the target variable, as evidenced by their high
correlation coefficients and significant chi-squared test scores. The
inclusion of these features in the predictive model is expected to
enhance its performance and interpretability by focusing on the most
relevant predictors while disregarding less influential ones.
5.2.4 Data Preprocessing

To ensure that the features do not disproportionately influence the
model's performance, three scaling techniques were applied log scaling,
standardization, and normalization.
30
Log scaling, a commonly used data transformation technique,
plays a crucial role in preprocessing skewed data distributions, such as
those commonly encountered in real-world datasets. In our case, we
applied logarithmic transformation to the 'Haemoglobin' feature to
address its left-skewed distribution and mitigate the potential impact of
extreme values.The left-skewed distribution of the 'Haemoglobin'
feature indicates that the majority of data points are concentrated on the
higher end of the scale, with fewer observations at lower values. Such
distributions can lead to challenges in model training and interpretation,
as they may violate the assumptions of linear models and affect the
performance of certain machine learning algorithms.To address this
issue, we employed a logarithmic transformation, which involves
taking the logarithm of the feature values. However, taking the
logarithm of zero or negative values can result in undefined or complex
numbers.
Therefore, to avoid such errors, we added a small constant of
0.01 to the feature values before applying the logarithmic
transformation.The logarithmic transformation has several beneficial
effects on the data, Compression of Range,The logarithm compresses
the range of the data, particularly for large values, thereby reducing the
variability and making the distribution more symmetric. This can help
stabilize the variance of the feature, making it more amenable to
analysis and modeling.Down-weighting Extreme Values, Extreme
values in the original feature distribution often exert a disproportionate
influence on the model's behavior, potentially leading to biased
estimates or overfitting.
Normalization of Skewed Distributions, Logarithmic transformation
can help normalize skewed distributions, making them more symmetric
31
and closer to a normal distribution. This is particularly advantageous
for certain statistical techniques and machine learning algorithms that
assume the data to be normally distributed or exhibit
homoscedasticity.Overall, log scaling is a powerful technique for
preprocessing skewed data distributions, such as those encountered in
the 'Haemoglobin' feature. By addressing the skewness and mitigating
the impact of extreme values, logarithmic transformation contributes to
improved model performance, interpretability, and robustness in
predictive modeling tasks. Standardization, a
fundamental preprocessing technique in machine learning, plays a
pivotal role in ensuring that features are appropriately scaled and
comparable across different variables. Achieved through the
implementation of the StandardScaler from the sklearn.preprocessing
module,
Standardization involves centering the feature values around the
mean and scaling them to have a standard deviation of one. This
process mitigates the risk of any single feature disproportionately
influencing the model's objective function, thus promoting fair and
balanced model training.While standardization may not significantly
impact the performance of tree-based algorithms such as decision trees,
random forests, and boosted trees due to their inherent feature scaling
invariance, it still offers various benefits beyond mere model
optimization. By centering feature values to have a mean of zero,
standardization ensures that the standardized feature distribution aligns
with the origin of the coordinate system, facilitating clearer
interpretations of model coefficients in linear models. This alignment
enables coefficients to represent the change in the target variable per
32
standard deviation change in the feature, simplifying the understanding
of feature contributions to the model's predictions.
Furthermore, standardization aids in the identification of
influential features by enhancing the interpretability of model outputs.
Features with larger standardized coefficients are considered more
influential in determining the target variable, providing valuable
insights into the relative importance of different features in driving
model predictions. This feature interpretability is particularly beneficial
in scenarios where understanding the underlying factors contributing to
model decisions is crucial for informed decision-making and problem-
solving.
Despite its potential limited impact on certain algorithms,
standardization remains a cornerstone preprocessing technique in
machine learning, offering not only improved model performance but
also enhanced interpretability and insights into feature importance. By
standardizing feature values, machine learning models can effectively
learn from data and make accurate predictions, paving the way for more
robust and reliable solutions in various application domains.
Normalization, also known as Min-Max scaling, is a
preprocessing technique aimed at rescaling feature values to a range
between 0 and 1, effectively compressing the data into a standardized
interval. Leveraging the MinMaxScaler from the sklearn.preprocessing
module, normalization ensures that all features are uniformly
distributed within the specified range, regardless of their initial
distribution. This transformation proves particularly beneficial when
feature values exhibit varying scales or non-normal distributions, as it
helps maintain the relative differences in the range of values across
different features. While normalization shares similarities with
33
standardization, it serves a distinct purpose by preserving the inherent
structure and relationships within the data without altering their
distribution shape.Despite its effectiveness, normalization may not be
essential for tree-based algorithms, such as decision trees, random
forests, and boosted trees, as these models are inherently robust to
variations in feature scales.
However, it remains a valuable preprocessing step in scenarios
where maintaining consistent feature ranges is critical for model
convergence and interpretability.Following normalization, feature
engineering techniques were employed to enhance the understanding of
the dataset's characteristics and relationships with the target variable,
'Result.' Visualization played a crucial role in this process, with box
plots serving as a powerful tool for comparing the distribution of
feature values across different outcomes of the target variable.
Specifically, four different versions of the 'Hemoglobin' feature –
original, log-scaled, standardized, and normalized – were plotted
against the 'Result' variable. This comparative analysis provided
insights into how each scaling method influenced the distribution of
feature values and their respective relationships with the target variable.
Finally, the preprocessed data underwent partitioning into training
and testing datasets to facilitate model training and evaluation. Utilizing
the train_test_split function from the sklearn.model_selection module,
the data was randomly divided into training (70%) and testing (30%)
sets. The resulting 'X_train' and 'X_test' datasets contained the predictor
variables, while the 'y_train' and 'y_test' datasets contained the
corresponding target variable values. This division ensured that the
model was trained on a sufficiently large portion of the data while
retaining a separate portion for unbiased evaluation. By adhering to best
34
practices in data splitting, we mitigate the risk of overfitting and ensure
the generalizability of the trained model to unseen data.
5.2.5 Class Imbalance and Data Leakage Handling
The analysis begins with an exploration of the dataset,
identifying the target variable "Result" as the focal point. By examining
the distribution of classes within this variable, it's evident that one class
significantly outnumbers the other. This scenario is known as class
imbalance, which can adversely affect the performance of machine
learning models, particularly in classification tasks. Class imbalance
introduces challenges such as biased model predictions, where the
model tends to favor the majority class due to its prevalence in the
dataset. This can lead to misclassification of minority class instances,
which are often of greater interest in real-world applications (e.g.,
detecting rare diseases, fraud detection). To mitigate the effects of class
imbalance, several sampling techniques are employed:
Random Oversampling technique involves duplicating examples from
the minority class, and balancing the class distribution. However, it
may lead to overfitting for some models due to the replication of
minority class instances.
SMOTE (Synthetic Minority OverSampling Technique) generates
synthetic data points for the minority class, addressing class imbalance
without replicating existing instances.
ADASYN (Adaptive Synthetic Sampling Method for Imbalanced Data)
Similar to SMOTE but focuses on generating more samples for
difficult-to-learn instances. Implementation and Evaluation:
Each sampling technique is implemented using appropriate libraries
such as imblearn.over_sampling and imblearn.under_sampling. The
35
impact of each technique on model performance is evaluated using
metrics like accuracy, precision, recall, F1 score, and AUC-ROC curve.
Data Leakage
Data leakage is a critical issue in machine learning, wherein
information from outside the training dataset contaminates the training
process, resulting in overly optimistic performance evaluations. This
can transpire through inadvertent inclusion of test data during training
or by incorporating external data during model creation. Such leakage
compromises the model's generalization ability, as it learns patterns
based on information not representative of real-world scenarios.
Mitigating data leakage requires stringent separation of training and
testing datasets, as well as vigilant scrutiny of any external data sources
introduced during model development to ensure unbiased performance
evaluation and reliable predictive outcomes.
Techniques for Handling Data Leakage
Undersampling and oversampling techniques are indispensable
strategies for addressing class imbalance in machine learning datasets.
To prevent data leakage and ensure the model's integrity, these
sampling techniques are exclusively applied to the training data. By
doing so, the model is trained solely on data it has not previously
encountered, safeguarding the independence and integrity of the test
set.When applying various sampling techniques, such as random
undersampling or oversampling, logistic regression models are trained
and evaluated using the test set. This rigorous evaluation process
ensures that the model's performance is accurately assessed on unseen
data, reflecting its real-world generalization capability.
36
In our experiments, all logistic regression models consistently
demonstrated high accuracy (>99%) and AUC (>99%), indicating
excellent overall performance in distinguishing between classes. These
metrics serve as reliable indicators of model quality and effectiveness
in classification tasks.Moreover, precision, recall, and F1 scores
consistently exhibit robust performance across different sampling
techniques. Precision measures the proportion of true positive
predictions among all positive predictions, recall assesses the
proportion of true positives correctly identified by the model, and F1
score provides a balanced assessment of precision and recall. The
consistent high values of these metrics signify the model's ability to
accurately classify instances from both classes, regardless of the
sampling technique employed.The kappa statistic, which measures the
agreement between predicted and actual class labels while accounting
for chance agreement, further reinforces the reliability of the logistic
regression models. The high kappa values obtained across different
sampling techniques affirm the model's robustness and consistency in
classification tasks.
In summary, the application of various sampling techniques in
logistic regression modeling, coupled with rigorous evaluation using
performance metrics, provides valuable insights into the model's
performance and generalization capability. The consistently high
performance metrics underscore the effectiveness of logistic regression
in addressing class imbalance and achieving accurate classification
outcomes in diverse real-world scenarios.Visualization techniques such
as ROC curves, precision-recall curves, and confusion matrices provide
additional insights into model performance and class distribution. In
conclusion, the implemented sampling techniques effectively address
37
class imbalance and data leakage issues in the machine learning
pipeline. By evaluating model performance metrics and visualizations,
we can confidently select the most suitable sampling technique for our
dataset, ensuring accurate and reliable predictions in real-world
applications.
5.2.6 Algorithm Implementation Module

In the evaluation of different machine learning models across
various sampling techniques, we observed distinct performance
characteristics for each classifier.
Logistic Regression (LR)
Original Imbalanced Dataset, achieved an accuracy of 94.1% and an
F1 score of 0.931 on the test set. In Undersampling Similar
performance to LR on the original imbalanced dataset.In Oversampling,
model achieved an accuracy of 93.9% and an F1 score of 0.930 on the
test set. SMOTE, achieved the same performance as
oversampling.ADASYN: Achieved an accuracy of 94.4% and an F1
score of 0.936 on the test set.
Decision Tree (DT)
DT models achieved perfect performance (accuracy, precision,
recall, F1 score, and AUC) across all sampling techniques. This
suggests potential overfitting or unrealistic results.
Random Forest (RF)
RF models also achieved perfect performance across all sampling
techniques, indicating potential overfitting.
K-Nearest Neighbors (KNN)
KNN models achieved high performance across all sampling
techniques, with F1 scores ranging from 0.965 to 0.975.
38
Support Vector Machines (SVM)
SVM models achieved high performance across all sampling
Gaussian Naive Bayes (NB)
NB models achieved good performance across all sampling
Summary of Performance Measures

The analysis of various classifiers across different sampling
techniques reveals insightful patterns regarding their performance and
behavior in handling imbalanced datasets. Notably, Decision Tree (DT)
and Random Forest (RF) models exhibit perfect performance across all
sampling techniques, which raises concerns about potential overfitting
or unrealistic results. This suggests that these models may have
memorized the training data, leading to overly optimistic performance
estimates. In contrast, K-Nearest Neighbors (KNN), Support Vector
Machines (SVM), and Gaussian Naive Bayes (NB) models demonstrate
high to good performance across diverse sampling strategies, indicating
their robustness in handling imbalanced datasets. These classifiers
exhibit consistent performance, with F1 scores ranging from 0.939 to
0.978, showcasing their effectiveness in various scenarios.Furthermore,
Logistic Regression (LR) models display consistent but slightly lower
performance compared to other classifiers, with F1 scores ranging from
0.930 to 0.936 across different sampling techniques. While LR's
performance varies depending on the sampling method, it still
demonstrates competitive results, albeit not as robust as KNN, SVM, or
NB.It's crucial to interpret these findings cautiously and consider the
trade-offs between performance metrics, model complexity, and
39
computational resources when selecting the most suitable classifier for
a specific application. While DT and RF may exhibit perfect
performance, indicating potential overfitting, KNN, SVM, and NB
consistently demonstrate strong performance, suggesting their
reliability in handling imbalanced datasets.Additionally, further
analysis, such as model interpretation and validation on independent
datasets, is necessary to ensure the chosen classifier's reliability and
generalization capability.
5.2.7 Hyper-parameter training and cross-validation
Hyperparameter tuning stands as a pivotal phase in the machine
learning model development process, aiming to enhance model
performance by meticulously selecting the optimal combination of
hyperparameters. Employing the GridSearchCV technique within our
module enabled us to systematically explore a predefined grid of
hyperparameters, discerning the configuration that maximizes the
specified scoring metric for each classifier. This exhaustive search
approach ensures comprehensive coverage of the hyperparameter
space, facilitating the identification of the most effective parameter
combination.
Furthermore, our implementation incorporated a robust 5-fold
cross-validation strategy during the search process. This technique
partitions the data into five subsets, iteratively utilizing four subsets for
training and one for validation. By repeating this process across
different subsets, we ensure thorough evaluation of each parameter
combination's performance, guarding against potential overfitting and
enhancing the model's generalization capability.
In our analysis, we evaluated a diverse range of classifiers,
including Decision Tree, Random Forest, Support Vector Machine
40
(SVM), Gaussian Naive Bayes, Logistic Regression, and K-Nearest
Neighbors (KNN). Each of these classifiers offers unique strengths and
characteristics, necessitating tailored parameter grids specific to their
individual traits and complexities.
By customizing parameter grids for each classifier, we ensure a targeted
exploration of hyperparameter combinations, maximizing the model's
performance potential. This meticulous approach allows us to navigate
the extensive hyperparameter space with precision, honing in on
configurations that yield superior predictive capabilities. Through
systematic fine-tuning, we optimize the model's performance,
enhancing its ability to generalize and make accurate predictions across
diverse datasets. By leveraging this methodical strategy, we extract the
full potential of each machine learning algorithm, unlocking optimal
performance and achieving superior results in real-world applications.
This tailored approach not only boosts model performance but also
fosters a deeper understanding of the intricate relationships between
hyperparameters and model outcomes. Consequently, our systematic
fine-tuning process empowers us to harness the full predictive power of
machine learning algorithms, ensuring robust and reliable performance
across a wide range of applications.
The results are tabulated as follows :
TABLE: 5.2.7.1 RESULTS OF HYPERPARAMETER TUNING
MODULE
Algorithms Best score F1 score
Decision tree 1.0 1.0
Random Forest 0.99 1.0
Support vector 0.99 0.99
41
Naïve Bayes 0.93 0.95
Logistic Regression 0.99 1.0
K-NN Classifier 0.98 0.98
Cross-validation stands as a cornerstone technique for assessing

the generalization performance of machine learning models, ensuring
reliable evaluations across diverse datasets. Employing 5-fold cross-
validation in our module, we partitioned the dataset into five subsets of
equal size, each representing approximately 20% of the total data. This
methodological approach guarantees that every data point is utilized for
both training and validation, enhancing the robustness of performance
evaluations. By iteratively training the model on four subsets and
evaluating it on the remaining subset, repeated five times, we obtain a
comprehensive assessment of the model's predictive capabilities.
Moreover, 5-fold cross-validation serves as a potent tool for detecting
and mitigating overfitting, as it systematically evaluates the model's
performance on different data subsets. Through this iterative process,
we gain invaluable insights into the model's generalization capability.
TABLE: 5.2.7.2 RESULTS OF THE CROSS-VALIDATION
MODULE
ALGORITHMS MEAN STANDARD
ACCURACY DEVIATION
Decision tree 1.0 0.0
Random Forest 1.0 0.0
Support vector 0.91 0.01
42
Naïve Bayes 0.93 0.01
Logistic Regression 0.98 0.01
KNN Classifier 0.97 0.007
Hyperparameter tuning and cross-validation are essential

components of the model selection process in machine learning. They
provide valuable insights into the performance of different classifiers
and help identify the most suitable model for a given task. In this
context, the Decision Tree Classifier and Random Forest Classifier
have emerged as strong candidates due to their high mean accuracy and
F1 scores.Decision trees are intuitive and easy-to-interpret models that
recursively partition the feature space based on the values of input
features, resulting in a hierarchical structure resembling a tree. Each
node in the tree represents a decision based on a feature, leading to
splits that separate data points into different classes or categories. While
decision trees are prone to overfitting, especially with complex datasets,
they offer transparency and interpretability, allowing stakeholders to
understand the decision-making process of the model.
On the other hand, random forests are ensemble learning methods
that combine multiple decision trees to improve performance and
generalization. Random forests train multiple decision trees on random
subsets of the data and aggregate their predictions through voting or
averaging. By reducing the variance of individual decision trees and
promoting diversity among the ensemble members, random forests
often achieve higher accuracy and robustness compared to single
decision trees. Moreover, they can handle high-dimensional data and
are less susceptible to overfitting, making them a popular choice for
43
various machine-learning tasks. Despite their promising performance,
selecting the best-performing classifier involves considering additional
factors beyond accuracy and F1 scores.
One crucial aspect is model complexity, which refers to the
number of parameters or features used by the model to represent the
underlying relationships in the data. While complex models may
achieve high accuracy on the training data, they are more prone to
overfitting and may struggle to generalize to unseen data. In contrast,
simpler models are more interpretable and may offer better
generalization performance, especially in scenarios with limited
training data or noisy features. Interpretability is another important
consideration, particularly in domains where model transparency and
explainability are paramount. Decision trees excel in this regard, as
they provide a clear and interpretable representation of the decision-
making process, allowing stakeholders to understand the factors driving
the model's predictions. In contrast, the ensemble nature of random
forests may complicate interpretability, as it involves aggregating
predictions from multiple decision trees.
Furthermore, computational efficiency and scalability should be
taken into account, especially when dealing with large datasets or real-
time applications. While decision trees are computationally efficient
and can handle large volumes of data, random forests may require more
resources due to the training of multiple trees in parallel. Therefore, the
choice of classifier should consider the computational constraints and
requirements of the specific application.
In conclusion, while the Decision Tree Classifier and Random
Forest Classifier showcase robust performance metrics such as
accuracy and F1 scores, identifying the most suitable classifier
44
demands a thorough examination. Beyond mere performance metrics,
factors such as model complexity, interpretability, and computational
efficiency wield significant influence in the selection process. Decision
trees and random forests, while exhibiting strong predictive power,
often come with higher complexity, potentially hindering
interpretability. Conversely, models like Logistic Regression or
Gaussian Naive Bayes may offer simpler interpretations but may
sacrifice some predictive performance.
CHAPTER 6
RESULTS AND CODING
The implementation of detecting clinical signs of anemia using

machine learning was meticulously crafted, harnessing a strategic blend
of powerful tools and languages to optimize efficiency and performance
across the development lifecycle. Leveraging cutting-edge machine
learning libraries and frameworks, such as TensorFlow or Scikit-learn,
alongside robust programming languages like Python, enabled seamless
integration of advanced algorithms and streamlined development
processes. Additionally, the utilization of cloud computing platforms,
like AWS or Google Cloud, provided scalable infrastructure and
accelerated model training, further enhancing the system's capabilities.
By embracing this comprehensive approach, the implementation
achieved remarkable efficiency and effectiveness in accurately
detecting clinical signs of anemia, paving the way for impactful
advancements in healthcare analytics.
45
6.1 TOOLS AND LANGUAGES
Python libraries such as NumPy, Pandas, and Scikit-learn are

foundational components of data analysis and machine learning.
NumPy serves as the cornerstone for numerical computing tasks,
providing essential functionality for efficient array manipulation and
mathematical operations. Its core data structure, the ndarray, offers a
versatile platform for representing multidimensional arrays, enabling
seamless integration with mathematical functions and operations.
NumPy's optimized implementations ensure high performance, making
it ideal for processing large datasets and performing complex
computations with ease. Moreover, NumPy's support for vectorized
operations allows for concise and efficient code execution, enhancing
productivity in data analysis and numerical computing tasks.
Pandas complements NumPy's numerical capabilities with its
intuitive data manipulation tools, primarily centered around the
DataFrame data structure. The DataFrame provides a tabular
representation of data, akin to a spreadsheet, allowing for seamless
organization, exploration, and manipulation of structured data. Pandas
simplifies common data preprocessing tasks such as data cleaning,
transformation, and aggregation, enabling analysts to efficiently handle
real-world datasets. Its rich set of functionalities, including data
alignment, indexing, and grouping, empowers users to perform
complex data manipulations with ease. Additionally, Pandas seamlessly
integrates with other Python libraries, facilitating interoperability
within the data science ecosystem.
Scikit-learn emerges as a comprehensive machine learning library,
encompassing a vast array of algorithms and utilities for various tasks
46
ranging from classification and regression to clustering and
dimensionality reduction. With its user-friendly API and extensive
documentation, Scikit-learn democratizes machine learning by making
complex algorithms accessible to both novice and experienced
practitioners. The library offers a modular architecture, allowing users
to seamlessly interchange algorithms and customize workflows to suit
specific requirements. Furthermore, Scikit-learn provides robust tools
for model evaluation, hyperparameter tuning, and cross-validation,
enabling rigorous experimentation and optimization of machine
learning models.
6.2 SAMPLE CODE
#Print system version !jupyter --version import sys print("Python

version:", sys.version)
import pandas as pd # for data manipulation and analysis import
collections # for creating and manipulating Python's collections like
OrderedDict, defaultdict, Counter, etc. import numpy as np # for
scientific computing with Python import matplotlib.pyplot as plt # for
data visualization %matplotlib inline import seaborn as sns # for
advanced visualization
# Classifier Libraries
from sklearn.linear_model import LogisticRegression # for

implementing logistic regression algorithm
from sklearn.tree import DecisionTreeClassifier # for implementing
decision tree algorithm
47
from sklearn.ensemble import RandomForestClassifier # for
implementing random forest algorithm from sklearn.svm import SVC #
for implementing Support Vector Machine
(SVM) algorithm
from sklearn.naive_bayes import GaussianNB # for implementing

Naive Bayes algorithm from sklearn.neighbors import
KNeighborsClassifier # for implementing K-
Nearest Neighbors (KNN) algorithm
# For Statistical testing
from scipy.stats import ttest_ind # for computing t-test for two

independent samples
import statsmodels.api as sm # for statistical models and tests
from scipy.stats import chi2_contingency # for computing chi-square

statistic and p-value for a contingency table import scipy.stats as stats #
for implementing skewness and other stats
# Other Libraries
from sklearn.model_selection import train_test_split # for splitting data

into training and testing sets
from sklearn.pipeline import make_pipeline # for building a pipeline of
transforms with a final estimator from imblearn.pipeline import
make_pipeline as imbalanced_make_pipeline
# for building a pipeline with imbalanced datasets
from imblearn.over_sampling import SMOTE # for oversampling

imbalanced datasets using Synthetic Minority Over-sampling
Technique
(SMOTE)
48
from imblearn.under_sampling import NearMiss # for undersampling
imbalanced datasets using NearMiss algorithm
from imblearn.metrics import classification_report_imbalanced # for
generating a classification report for imbalanced datasets
from sklearn.metrics import precision_score, recall_score, f1_score,
roc_auc_score, accuracy_score, classification_report # for computing
various performance metrics for classification models
from collections import Counter # for counting the frequency of
elements in a list
from sklearn.model_selection import KFold, StratifiedKFold # for k-
fold cross-validation
from sklearn.model_selection import cross_val_score # for evaluating a
model using cross-validation
from sklearn.metrics import cohen_kappa_score # for computing
Cohen's kappa score for inter-rater agreement import warnings
warnings.filterwarnings("ignore")
pd.set_option('display.max_columns', 5000) df.info() df.columns
result_counts = df_copy['Result'].value_counts()
plt.pie(result_counts, labels=result_counts.index, autopct='%1.1f

%%', colors=custom_colors, shadow=True) plt.title('Distribution of
Anaemia Result') plt.show()
ax= sns.countplot('Result', data=df_copy, palette=custom_colors)
plt.title('Count of Anaemia Result')
# Add labels to th bars
for p in ax.patches:
ax.text(p.get_x() + p.get_width() / 2, p.get_height(),
'{:d}'.format(p.get_height()), ha='center')
49
# Remove spines sns.despine(left=True, bottom=True) plt.show()
male_data = df_copy[df_copy['Gender'] == 'Male'] female_data =
df_copy[df_copy['Gender'] == 'Female']
# Plot horizontal violinplot using Seaborn
sns.violinplot(x='Haemoglobin', y='Gender', hue='Result',

data=df_copy, palette=custom_colors, inner='quartile', scale='width',
cut=0)
# Add mean and median lines for i, group in enumerate([male_data,
female_data]): median = group['Haemoglobin'].median() mean =
group['Haemoglobin'].mean()
plt.axhline(y=i, xmin=0.05, xmax=0.48, color='black', linewidth=2)
plt.text(0.51, i+0.1, f'Median: {median:.2f}', ha='left', va='center')
plt.text(0.51, i-0.1, f'Mean: {mean:.2f}', ha='left', va='center')
# Add IQR whiskers
q1_male, q3_male = male_data['Haemoglobin'].quantile([0.25, 0.75])

q1_female, q3_female = female_data['Haemoglobin'].quantile([0.25,
0.75]) plt.axhline(y=0, xmin=0.25, xmax=0.75, color='black',
linewidth=2) plt.axhline(y=1, xmin=0.25, xmax=0.75, color='black',
linewidth=2) plt.plot([q1_male, q1_male], [-0.2, 0.2], color='black',
linewidth=2) plt.plot([q3_male, q3_male], [-0.2, 0.2], color='black',
linewidth=2) plt.plot([q1_female, q1_female], [0.8, 1.2], color='black',
linewidth=2) plt.plot([q3_female, q3_female], [0.8, 1.2], color='black',
linewidth=2) plt.text((q1_male+q3_male)/2, -0.3, f'IQR:
{q3_male-q1_male:.2f}',
ha='center', va='center')
plt.text((q1_female+q3_female)/2, 1.3, f'IQR: {q3_female-

q1_female:.2f}', ha='center', va='center')
50
plt.title('Distribution of Haemoglobin Levels by Gender')
plt.xlabel('Haemoglobin Level') plt.ylabel('Gender') male_data =
df_copy[df_copy['Gender'] == 'Male'] female_data =
df_copy[df_copy['Gender'] == 'Female']
# Plot horizontal violinplot using Seaborn
sns.violinplot(x='Haemoglobin', y='Gender', hue='Result',

data=df_copy, palette=custom_colors, inner='quartile', scale='width',
cut=0) for i, group in enumerate([male_data, female_data]): median
= group['Haemoglobin'].median()
mean = group['Haemoglobin'].mean()
6.3 SAMPLE SCREENSHOTS
The screenshots and the result, presented in a clear format,

provide valuable insights into specific diagnostic scenarios. This
concise walkthrough captures the essence of the application,
highlighting its userfriendly design and advanced machine learning
capabilities for the Effective detection of anemia.
Figure 6.1: Distribution of Anemia by Gender
51
Figure 6.3.1 countplot illustrates the distribution of individuals
with and without anemia across genders. Each bar represents the count
of individuals categorized by their gender (male or female) and their
anemia status (with or without). The bars are annotated with the
respective counts, providing a visual representation of the prevalence of
anemia among different genders in the dataset.
The user-friendly design is highlighted as a key feature, ensuring
that even individuals with varying levels of technical expertise can
navigate the application effortlessly. The intuitive interface contributes
to a seamless user experience, making the diagnostic process more
accessible and efficient.
The careful arrangement of visual elements in the presentation allows

users to easily comprehend and navigate through the diagnostic
process. The advanced machine learning capabilities of the application
are prominently featured, emphasizing its efficiency in detecting
anaemia with a high level of accuracy. This technological prowess sets
the application apart, making it a powerful tool for healthcare
professionals and practitioners.
52
Figure 6.2: Initial screen of anemia detection System
The output screen serves as a vital tool for anemia detection,

offering comprehensive insights based on input parameters critical for
assessment. By incorporating data such as gender, hemoglobin value,
and Mean Corpuscular Volume (MCV), this interface facilitates an
accurate evaluation of an individual's likelihood of anemia. Users can
seamlessly navigate through the system to glean essential information
pertinent to their health status or that of others.Users input relevant data
and trigger the analysis process with a simple button press, receiving
concise results detailing the probability of anemia based on the
provided information. Additionally, the system may offer further
guidance or recommendations for next steps in the evaluation process.
Through its intuitive design and informative output, the screen
53
empowers users to make informed decisions regarding anemia
diagnosis and management.
Figure 6.3: Output screen of anemia detection system
Users navigate through the interface, selecting the gender and

inputting the hemoglobin value and MCV.Upon initiating the analysis
by pressing the "Detect" button, the system meticulously processes the
provided data.The results section succinctly presents the outcome of the
assessment, furnishing users with valuable insights into the likelihood
of anemia based on the input parameters provided.
6.4 RESULTS AND GRAPH
The results and accompanying graph depict the performance

evaluation of various machine learning models on a given dataset. Each
model has been assessed based on key metrics such as accuracy, area
54
under the curve (AUC), precision, recall, F1-score, and kappa statistics.
The graph visually represents the comparative performance of these
models, offering insights into their effectiveness in classification tasks.
Overall, the analysis highlights the robustness of the models and
provides valuable information for understanding their predictive
capabilities in real-world scenarios.
Figure 6.4: Performance measures of various classifiers
Overall, the models exhibit exceptionally high performance across

all techniques and metrics, with most models achieving perfect scores
(1.000) across all evaluation metrics. This suggests that the models are
performing extremely well on the test dataset.Decision Tree and
Random Forest models consistently perform exceptionally well across
all imbalance handling techniques.Support Vector Machine models also
perform well, with slight variations in performance based on the
imbalance handling technique.K-Nearest Neighbors and Naive Bayes
models show slightly lower but still strong performance compared to
55
other models.The data imbalance handling techniques, including
undersampling, oversampling, SMOTE, and ADASYN, do not seem to
significantly impact the models' performance in this dataset, as all
techniques yield high scores.In summary, the models demonstrate
robust performance across various techniques, indicating their
effectiveness in classification tasks on the given dataset.
Figure 6.5: Comparison of Model Performance Grid Search
The bar chart illustrates the comparative performance of different

machine learning models based on their accuracy scores obtained
through grid search. Random Forest and Decision Tree models exhibit
the highest accuracy, both achieving a perfect score of 100%.
Following closely is the Support Vector Machine model with an
accuracy of 99.4%. K-Nearest Neighbors and Logistic Regression also
demonstrate strong performance with accuracy scores of 98.8% and
93.5%, respectively. Naive Bayes, while slightly lower, still shows
respectable accuracy at 91.4%.
CHAPTER 7
CONCLUSION AND FUTURE WORKS
56
Decision Trees (DT) and Random Forest (RF) performed well,
while Support Vector Machine (SVM) with ADASYN outperformed all
class imbalance methods, achieving an AUC of 0.984 and accuracy of
98%. k-Nearest Neighbors (KNN) without balancing also performed
strongly, with an accuracy of 97%. Important features for anaemia
classification include Haemoglobin, Gender, and MCV, with females
Specifically, females were found to be at a higher risk of anaemia
compared to males, with an Odds Ratio for gender of 2.86, having a
higher risk of anaemia. Decision Trees and Random Forests were
chosen as the final model due to their superior performance, achieving
a 100% F1 score on test datasets, indicating their robustness in anaemia
detection. This signifies the robustness and reliability of Decision Trees
and Random Forests in anaemia detection.
Researchers can explore anomaly detection and time series

analysis to address data challenges, identifying outliers and
understanding temporal trends. Integrating domain knowledge enhances
machine learning models, tailoring them to specific domains.
Addressing data quality, bias, and privacy concerns is crucial, mitigated
by techniques like preprocessing and bias detection. Fairness,
transparency, and accountability are prioritized to promote ethical AI
adoption. Additionally, advancements in hardware acceleration and
cloud-based solutions offer scalability for handling large datasets
efficiently, ensuring robust performance in diverse applications. These
advancements pave the way for more impactful and responsible use of
AI technologies across various sectors.
REFERENCES
57
[1] Akmal Hafeel, H.S.M.H. Fernanado, M.Pravienth, Shashika
Lokuliyana, N.Kayanthan, and Anuradha Jayakody, “ IoT device
to Detect Anemia ” 2019 International Conference On
Advancements in Computing (ICAC), Malabe, Sri Lanka, 2019.
[2] Aparna V, T V Sarath, K.I. Ramachandran, “Simulation model
for anemia detection using RBC counting algorithms and
Watershed transform” 2017 International Conference on
Intelligent Computing, Instrumentation and Control Technologies
(ICICICT),Kerala, India 2017.
[3] AzwadTamir, Chowdry Jahan, Mohammed S. Saif, and
U.Zaman, “Detection of anemia from image of the anterior
conjunctiva of eye by image processing and thresholding” 2017
IEEE Region 10 Humanitarian Technology Conference (R10-
HTC).
[4] Chayashree Patgiri, Amrita Ganguly, “ Comparative Study on
Different Local Thresholding Techniques for Detection of Sickle
Cell Anaemia from Microscopic Blood Images” 2019 IEEE 16th
India Council International Conference (INDICON), Rajkot,
India.
[5] Enas Walid Abdulhay, Ahmad Ghaith Allow, and Mohammad
Eyad Al-Jalouly presented their research on "Detection of Sickle
Cell,Megaloblastic Anemia,Thalassemia, and Malaria through
Convolutional Neural Network" at the 2021 Global Congress on
Electrical Engineering (GC-ElecEng) in Valencia, Spain.
[6] Furkan Kiraci, Batuhan Albayrak, Muazzez Buket Darici, Arif
Selçuk Öğrenci, Atilla Özmen, and Kerem Ertez, "Orak Hücreli
58
Anemi Tespiti: Sickle Cell Anemia Detection" at the 2018
Medical Technologies National Congress (TIPTEKNO).
[7] Garima Vyas, Vishwas Sharma, Adhiraj Rathore, shared insights
on "Detection of Sickle Cell Anemia and Thalassemia Causing
Abnormalities in Thin Smear of Human Blood Sample Using
Image Processing" at the 2016 International Conference on
Inventive Computation Technologies (ICICT) in Coimbatore,
India.
[8] Jessie R. Carlos C. Hortinela, Fausto, Paul Daniel C. Divina, and
John Philip T. Felices presented research on "Identification of
Abnormal Red Blood Cells and Diagnosing Specific Types of
Anemia Using Image Processing and Support Vector Machine" at
the 2019 IEEE 11th International Conference on Humanoid,
Nanotechnology, Information Technology, Communication and
Control, Environment, and Management (HNICEM) in Laoag,
Philippines.
[9] Joan Cid, Jaime Punter-Villagrasa, Jordi Colomer-Farrarons,
Ivón Rodríguez-Villarreal, and Pere Ll. Miribel-Català discussed
progress "Toward an Anemia Early Detection Device Based on
50-μL Whole Blood Sample" in the IEEE Transactions on
Biomedical Engineering.
[10] M. Kathirvelu, S. Keerthana, V. Keerthana, S. Lakshitha, and S.
Manikandan discussed "Early Detection of Sickle Cell Anemia
Among Tribal Inhabitants" at the 2023 8th International
Conference on Communication and Electronics Systems
(ICCES) in Coimbatore, India.
[11] R. Kumar, S. Guruprasad, Krity Kansara, K. N. Raghavendra
Rao, Murali Mohan, Manjunath Ramakrishna Reddy, Uday
59
Haleangadi Prabhu, P. Prakash, Sushovan Chakraborty, Sreetama
Das, and K. N. Madhusoodanan introduced an innovative "A
Novel Noninvasive Hemoglobin Sensing Device for Anemia
Screening" in the IEEE Sensors Journal, Volume 21, Issue 13,
published in 2021.
[12] Maileth Rivero-Palacio, Wilfredo Alfonso-Morales, and Eduardo
Caicedo-Bravo introduced a "Mobile Application for Anemia
Detection through Ocular Conjunctiva Images" at the 2021 IEEE
Colombian Conference on Applications of Computational
Intelligence (ColCACI) in Cali, Colombia.
[13] Megha Tyagi, Lalit Mohan, and Nidhi Dahyia shared insights on
"Detection of Poikilocyte Cells in Iron Deficiency Anemia using
Artificial Neural Network" during the 2016 International
Conference on Computation of Power, Energy Information, and
Communication (ICCPEIC) in Melmaruvathur, India.
[14] Muhammad Noman Hasan, Ran An ,Yuncheng Man, and Umut
A. Gurkan unveiled an "Integrated Point-of-Care Device for
Anemia Detection and Hemoglobin Variant Identification" at the
2019 IEEE Healthcare Innovations and Point of Care
Technologies (HI-POCT) in Bethesda, MD, USA.
[15] Muljono, Sari Ayu Wulandari, Harun Al Azies, Muhammad
Naufal, Wisnu Adi Prasetyanto, and Fatima Az Zahra presented
groundbreaking research titled "Non-Invasive Anemia Detection
Empowered by AI: Pushing the Boundaries in Diagnosis" in
IEEE Access, Volume 12, published in 2024.
[16] Pooja Tukaram Dalvi and Nagaraj Vernekar presented research
on "Anemia Detection Using Ensemble Learning Techniques and
Statistical Models" at the 2016 IEEE International Conference on
60
Recent Trends in Electronics, Information & Communication
Technology (RTEICT) in Bangalore, India.
[17] Pranati Rakshit and Kriti Bhowmik demonstrated "Detection of
Abnormal Findings in Human RBC for Diagnosing G-6-P-D
Deficiency Hemolytic Anemia using Image Processing" at the
2013 IEEE 1st International Conference on Condition
Assessment Techniques in Electrical Systems (CATCON) in
Kolkata, India.
[18] Rita Magdalena, Yunendah Nur Fuadah, Sofia Sa'idah, Inung
Wijayanto, and Raditiana Patmasari, presented research on
"Non-Invasive Anemia Detection in Pregnant Women Using
Digital Image Processing and K-Nearest Neighbor" at the 2020
3rd International Conference on Biomedical Engineering
(IBIOMED) held in Yogyakarta, Indonesia.
[19] Roszymah Hamzah, Ahmad Sabry Mohamad, Nur Syahirah
Abdul Halim, Muhammad Noor Nordin, and Jameela Sathar
presented findings on "Automated Detection of Human RBC in
Diagnosing Sickle Cell Anemia with Laplacian of Gaussian
Filter" at the 2018 IEEE Conference on Systems, Process and
Control (ICSPC) in Melala, Malaysia.
[20]Sagnik Ghosal, Debanjan Das, Venkanna Udutalapally, Asoke K.
Talukder, and Sudip Misra introduced the innovative concept of
"sHEMO: Smartphone Spectroscopy for Blood Hemoglobin
Level Monitoring in Smart Anemia-Care" in the IEEE Sensors
Journal, Volume 21, Issue 6, published in 2021.
[21] Sasikala C, Ashwin M R, Dharanessh M D, and Dhanabalan M
“Curability Prediction Model for Anemia Using Machine
61
Learning” at the 2022 8th International Conference on Smart
Structures and Systems (ICSSS) Chennai, India.
[22] Sherif H. Elgohary, Zeyad Ayman Mohamed, Omar Ayman
Mohamed, and Ahmed Osama Ismail participated in the 2022
10th International Japan-Africa Conference on Electronics,
Communications, and Computations (JAC-ECC) held in
Alexandria, Egypt.
[23] Tajkia Saima Chy and Mohammad Anisur Rahaman
demonstrated "Automatic Sickle Cell Anemia Detection Using
Image Processing Technique" at the 2018 International
Conference on Advancement in Electrical and Electronic
Engineering (ICAEEE) in Gazipur, Bangladesh.
[24] Tiago Bonini Borchartt and Willian França Ribeiro showcased
"Automated Detection of Anemia in Small Ruminants Using
Non-Invasive Visual Analysis Based on BIC Descriptor" at the
2023 36th SIBGRAPI Conference on Graphics, Patterns, and
Images (SIBGRAPI) in Rio Grande, Brazil.
[25] Vinit P. Kharkar and Ajay P. Thakare provided a
"Comprehensive Review of Emerging Technologies for Anemia
Detection" at the 2022 8th International Conference on Signal
Processing and Communication (ICSC) in Noida, India.
62

Detecting Clinical Signs of Anaemia Using Machine Learning Report

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Detecting Clinical Signs of Anaemia Using Machine Learning Report

Uploaded by

Copyright:

Available Formats

DETECTING CLINICAL SIGNS OF ANAEMIA

USING MACHINE LEARNING

ALPHIUS VICTORIA ASHLEE P (312420104009)

in partial fulfillment for the award of the degree of

COMPUTER SCIENCE AND ENGINEERING

St. JOSEPH’S INSTITUTE OF TECHNOLOGY

ANNA UNIVERSITY:: CHENNAI 600 025

Certified that this project report “DETECTING CLINICAL SIGNS

OF ANAEMIA USING MACHINE LEARNING” is the bonafide

“ALPHIUS VICTORIA ASHLEE P (312420104009) and HARINI S

(312420104051)” who carried out the project work under my

Dr.J.DAFNI ROSE M.E.,Ph.D., Ms.K.SHERIN M.E.,

PROFESSOR AND HEAD, SUPERVISOR,

St.Joseph’s Institute of Technology, St.Joseph’s Institute of Technology,

Old Mamallapuram Road, Old Mamallapuram Road,

Chennai-600 119. Chennai-600 119.

We extend our heartfelt gratitude to our respected and honourable

We express our deep gratitude to our honourable Executive

We are indebted to our Principal Dr.P.RavichandranM.Tech., Ph.D.

We would like to express our earnest gratitude to our Head of the

We also take the opportunity to express our profound gratitude

Branch : COMPUTER SCIENCE AND ENGINEERING

SI.No. Name of the Title of the Name of the

2 HARINI S Anaemia Assistant Professor

The report of the project work submitted by the above students in

(INTERNAL EXAMINER) (EXTERNAL EXAMINER)

LIST OF FIGURES viii

6. RESULTS AND CODING 39

7. CONCLUSION AND FUTURE WORKS 51

FIGURE NO NAME OF THE FIGURE PAGE NO.

3.2 Class diagram of anemia detection 13

3.3 Sequence diagram of anemia 14

3.4 Activity diagram of anemia 15

3.5 Deployment diagram of anemia 16

6.1 Distribution of anemia by gender 46

6.2 Initial screen of anemia detection 47

6.3 Output screen of anemia detetction 48

5.2 Results of Cross-validation Module of 36

SVM Support Vector Machine

SVC Support Vector Classifier

KNN K-Nearest Neighbours

SMOTE Synthetic Minority Oversampling Technique

ADASYN Adaptive Synthetic Sampling Method for

MCH Mean corpuscular Hemoglobin

MCHC Mean corpuscular Hemoglobin Concentration

MCV Mean Corpuscular Volume

UML Unified Modelling Language

AUC-ROC Area Under the Curve Receiver Operating

Anaemia develops when the body's supply of red blood cells

The laboratory procedure for the diagnosis and detection of

1.2 PROBLEM STATEMENT

Anaemia is one of the global public health problems that affect

The previous detecting technique which uses Haematology

1.3.1 DISADVANTAGES OF EXISTING SYSTEM

The proposed system aims to detect anemia using machine

Multiple machine learning algorithms including Decision Tree,

Furthermore, the proposed system enables clinicians to efficiently

Akmal Hafeel et al.,(2019) [1] devised a novel approach for

Aparna et al.,(2017) [2] aimed to compare anaemic and non-

Kathirvelu M et al.,(2023) [10] focused on Sickle Cell Anemia, a

Kumar et al.,(2022) [11] proposed a groundbreaking architectural

Maileth Rivero-Palacio et al.,(2021) [12] proposed the

Megha Tyagi's et al., (2016) [13] focused on addressing blood

Muhammad Hasan et al.,(2019) [14] focused on