Download as pdf or txt
Download as pdf or txt
You are on page 1of 43

MAKERERE UNIVERSITY

COLLEGE OF COMPUTING AND INFORMATION SCIENCES


SCHOOL OF COMPUTING AND INFORMATICS TECHNOLOGY
A REPORT ON
FIELD ATTACHMENT/ INTERNSHIP ABOUT
Diabetes Disease Prediction using a Web tool with the help of a machine learning
model.
June 24th - August 06th
BY
MUGUME IAN
18/U/25478/PS
Field attachment Report submitted to the School of computing and Informatics Technology
In Partial fulfilment of the requirements for the degree of (state your Programme) of Makerere
University Kampala
MUGUME IAN e-signed
Academic Supervisor: Dr. Joyce-Nabende Nakatumba ................................
Field Supervisor: Mr. Tusubira Jeremy .......................................................
Declaration.
I MUGUME IAN hereby declare that the information in this report is my own original gathered
authentic work. It also makes practical and effective fulfilment of the purposes and objectives of this
field attachment, and the content of the document has never been previously submitted to any other
university or institution for a higher degree or any other award. Except for Citations, Quotations
and References to other people’s work used where otherwise acknowledged.
Date 4th August 2021
e-signed.
Acknowledgement.
First and foremost, I would like to acknowledge the Almighty God for the successful completion of
the Field Attachment period.

I would like to say thanks to Makerere University Data Science and Artificial Intelligence Lab for
the opportunity given to me as an intern.

Special thanks go to my academic supervisor Dr. Joyce Nabende for giving me an opportunity to
work under the Data Science and Artificial Intelligence Lab, Makerere University.

Many thanks to my field supervisor Mr. Tusubira Jeremy for his personal efforts, practical skills,
professional guidance and direction towards successful internship.

Finally, I would like to extend my heartfelt gratitude to my family members especially my mother
RUTH KOMWAKA for all the support, classmates and other friends for their invaluable support
throughout my training.
Abstract.
I carried out my Internship virtually under the Data Science and Artificial Intelligence with close
supervision by both of my field and academic supervisors.

One of the main aims of Bachelor of Science in Computer Science at undergraduate level is to solve
real world problems around us for people and with this I was able to work on Diabetes disease predic-
tion project using machine learning techniques being implemented with a Streamlit web application.
I used two datasets from Kaggle website to work on my project and carried on tasks like EDA, data
preprocessing, model formulation, building and evaluation, improving on their accuracies, saving the
trained model using pickle library and then developing the streamlit web application basing on the
two datasets for prediction.

Throughout my work and experience, i attained problem thinking and solving skills, i was also able
to troubleshoot some of the tasks which failed to output the required results through online research
for solutions.

The challenges faced include: hardships understanding new technical terms regarding machine
learning, learning streamlit framework, poor internet connection, insufficient funds to buy mobile
data bundles to continue with research and project.

In my conclusion, internship under the Data science and AI lab was so productive with practical
hands on skills and guidance from the experts like my field supervisor.

I therefore recommend that we as students need to be taught much more of practical skills than
theory and be given more time for practice.
Contents
1 Introduction 3
1.1 Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Background. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Problem Statement. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4.1 Main Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4.2 Specific Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.5 Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.6 Significance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2 Literature Review 6
2.1 Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2 Related Works Using Machine Learning. . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.3 Related Works Using Deep Learning. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

3 Methodology. 9
3.1 Case study. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.2 Datasets/Data collection. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.2.1 Pima Indians dataset (PID). . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.2.2 Early diabetes disease dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.3 Exploratory Data Analysis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.3.1 Pima Indians dataset (PID). . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.3.2 Early diabetes disease dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.4 Data pre-processing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.4.1 Pima Indians dataset (PID). . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.4.2 Early diabetes disease dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.5 Model building, Improving performance accuracies and model evaluation. . . . . . . . 15
3.5.1 Pima Indians dataset (PID). . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.5.2 Early diabetes disease dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.6 Finalising the model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.6.1 Pima Indians dataset (PID). . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.6.2 Early diabetes disease dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

4 System Analysis and Design. 20


4.1 System Analysis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.1.1 Data analysis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.1.2 User requirements. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.1.3 Equipment/tools used. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.1.4 Functional requirements. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.1.5 Non-functional requirements. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.2 System Design. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

5 System Implementation, testing and Validation. 22


5.1 System Implementation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
5.2 User Interface Design. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
5.2.1 Merged web tool with Pima Indians dataset selection. . . . . . . . . . . . . . . 22
5.2.2 Merged web tool with Early Diabetes Disease dataset selection. . . . . . . . . . 23
5.3 System testing and results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
5.3.1 Sample results of prediction on Pima Indians dataset regarding positive results. 24
5.3.2 Sample results of prediction on Pima Indians dataset regarding negative results. 24
5.3.3 Sample results of prediction on Early Diabetes Disease dataset regarding positive
results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

4
5.3.4 Sample results of prediction on Early Diabetes Disease dataset regarding nega-
tive results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

6 Conclusions and Recommendations. 26


6.1 Conclusions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
6.2 Recommendations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
6.2.1 Recommendations to the University. . . . . . . . . . . . . . . . . . . . . . . . . 26
6.2.2 Recommendation to future interns. . . . . . . . . . . . . . . . . . . . . . . . . . 27

7 References and Appendices. 28


7.1 References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
7.2 Appendices. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
7.3 My Github repository. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

A Student’s Weekly Progress Report 34


List of figures.
List of figures all in the Appendix section.

1
List of acronyms/abbreviations.
EDA Exploratory Data Analysis.
PID Pima Indians Dataset.
AI Artificial Intelligence.
NLP Natural language processing.
CSC Computer Science.
CoCIS College of Computing and Information Sciences.
FAMS Field Attachment Management System.

2
1 Introduction
1.1 Introduction.
Diabetes is the most widespread chronic disease which put a lot of pressure on the public health sys-
tem. Diabetes occurs when pancreas does not produce enough insulin or produced insulin is not used
effectively in body. According to WHO and ADA, there are four types of diabetes: Type-I, Type-II
diabetes, Gestational Diabetes (GDM) and rare specific diabetes. Type-I diabetes is responsible of
5% to 10% of total diabetes, it occurred due to lack of insulin production. Destruction of pancreas
organ is the source insulin production loss in human body which leads to insulin-dependent diabetes.
Type-II diabetes, however, are more common (90% of the diabetic population) is caused by “insulin
resistance” and metabolic disorders, the result is increase in sugar levels in blood [7].
GDM diabetes is s type of II diabetes and related to the pregnancy changes in body. Almost 4% of
pregnancy cases will develop GDM at some stage of the pregnancy. To decrease the risk of diabetes
in newborn babies, GDM diabetes should be monitored and treated accordingly. The other cause of
diabetes is related to genetic and metabolic disorders.
Diabetes is a disease that occurs when your blood glucose, also called blood sugar is too high in the
body. Blood glucose is the main source of energy and comes from the food we eat. Insulin made
by pancreas helps glucose get into our cells to be used for energy. However when blood glucose is
high, people get very sick. Early signs of diabetes include Hunger and fatigue, blurred vision, dry
mouth and itchy skin and also peeing more often and being thirstier. If not detected early can lead
to death. This is where machine learning in the healthcare comes into help. With the advancement
of technology, the better computing power and availability of data-sets on open source repositories
have further increased the use of machine learning. Machine learning is used in healthcare in vast
areas. The healthcare produces sector produces large amounts of data in terms of images, patient data
and so on that helps to identify patterns and make predictions. Thus, making a machine learning
model, training it on the data-set and entering individual patient details can help in prediction. The
prediction result will be according to the data entered and hence specific to that individual.

1.2 Background.
Machine Learning (ML) is the one of computer science disciplinary, provides systems that learn from
data, and improves their behaviour over time by discovering emerge patterns from training datasets
automatically [8]. The main objective in ML is to learn from previous experience, it can be supervised
or unsupervised learning. In supervised learning the training data contains labelled data (positive and
negative examples); thus, the ML algorithm will use training data to predict the labels of new obser-
vations. ML is also categorized into three major groups: Classification, Regression, Clustering, and
Reinforcement Learning. In classification problems, the objective is predicting the associated labels
(categorical values) for the observations, however in regression modelling, the prediction is a continues
values rather than nominal values. Clustering methods on the other hands are unsupervised methods
which gives some insight about distribution of data and similarity of observations. Reinforcement
learning is an area of ML where there is no notion of immediate right or wrong decision, but rather
having a strategy to maximize the reward in sequences of actions [9].
In this study, I utilized supervised learning models for classification for diabetes disease prediction.
Some of the most common classification methods are:

ˆ Logistic Regression: that build a statistical model by fitting a linear model to describe the rela-
tionship between logit of the features and one or more independent variables. Logistic Regression
is a simple approach but popular with the Machine Learning community [10].

ˆ Support Vector Machines (SVM) is another popular ML method that finds a hyperplane which
separates positive and negative classes. SVM gives better insight about important features and
how they influence the decision boundary [11, 12].

3
ˆ Random Forest (RF) is an ensemble model, which produce multiple trees during training by
randomly selecting features and sample boosting technique. Random Forest is more robust to
variance error [13, 14].

ˆ Decision Tree (J48) is a simple representation of classifying examples by developing the decision
rules through nodes and making final decision in leaves. J48 is the most common decision tree
[15].

ˆ Artificial Neural Network (ANN) is a general, parametrized classification method that mimics the
underlying computational model in human brain, thought vast network of connection between
neurons [16].

1.3 Problem Statement.


Diabetes mellitus is a common disease that affects a vast majority of the people in many parts of the
world. Diabetes affects people usually after the age of 20 [25]. According to WHO statistics, the global
prevalence of diabetes among adults above 18 years of age has risen to 8.5% in 2014 [36]. Diabetes
prevalence has been increasing more in middle and low- income countries. It becomes a cause for
other illnesses also like blindness, kidney failure, cholesterol and heart diseases. The deaths due to
diabetes and high blood glucose are on the rise. Prediction of diabetes at an early stage would help
the patients to maintain the sugar level under control. As machine learning techniques prove to be
good in predictive analyses, its used to predict the risk of diabetes in the proposed approach. The
performance of the algorithm is also measured and improved using feature selection and selection of
training set.

1.4 Objectives
1.4.1 Main Objective
To develop a Web application tool to automatically predict and classify Diabetes disease in human
beings with the help of a machine learning model.

1.4.2 Specific Objectives


ˆ To understand the existing methods used for Diabetes disease prediction.

ˆ To study the current system for Diabetes disease prediction.

ˆ To develop a machine learning model for predicting and classifying Diabetes disease.

ˆ To test and evaluate the machine learning model used for Diabetes prediction.

4
1.5 Scope
In this research, i used secondary sources of data ie two public datasets in form of csv files which were
Pima Indians dataset and Early diabetes disease dataset in my project.

1.6 Significance
The Significance of this project was;

ˆ Less time was spent in the detection of Diabetes diseases.

ˆ The smartphone application was also be used as a data collection tool for other purposes of
research.

ˆ Quick prediction and classification of Diabetes diseases in human beings.

ˆ Human intervention in classifying Diabetes disease was omitted.

5
2 Literature Review
2.1 Introduction.
To perform this study, i selected ten of the most recent studies that have been discussed in details
regarding machine learning and seven studies related to Deep Learning techniques too discussed.

2.2 Related Works Using Machine Learning.


Machine Learning algorithms are very well-known in the medical field for predicting diseases. Many
researchers have used ML techniques to predict diabetes in an effort to obtain the best and most
accurate results [17].
Kandhasamy and Balamurali [18] used multiple classifiers SVM, J48, K-Nearest Neighbors (KNN),
and Random Forest. The classification was performed on a dataset taken from the UCI repository
(for more details see Table A4). The results of the classifiers were compared based on the values of
the accuracy, sensitivity, and specificity. The classification was done in two cases, when the dataset is
pre-processed and without preprocessing by using 5-fold cross validation. The authors didn’t explain
the pre-processing step applied on the dataset, they just mentioned that the noise was removed from
the data. They reported that the decision tree J48 classifier has the highest accuracy rate being 73.82
% without pre-processing, while the classifiers KNN (k = 1) and Random Forest showed the highest
accuracy rate of 100% after pre-processing the data.

Moreover, Yuvaraj and Sripreethaa [19] presented an application for diabetes prediction using
three different ML algorithms including Random Forest, Decision Tree, and the Naı̈ve Bayes. The
Pima Indian Diabetes dataset (PID) was used after pre-processing it. The authors didn’t mention
how the data was pre-processed, however they discussed the Information Gain method used for fea-
ture selection to extract the relevant features. They used only eight main attributes among 13. In
addition, they divided the dataset into 70% for training and 30% for testing. The results showed that
the random forest algorithm had the highest accuracy rate of 94%.

Furthermore, Tafa et al. [20] proposed a new integrated improved model of SVM and Naı̈ve Bayes
for predicting the diabetes. The model was evaluated using a dataset collected from three different
locations in Kosovo. The dataset contains eight attributes and 402 patients where 80 patients had
type 2 diabetes. Some attributes utilized in this study have not been investigated before, including
the regular diet, physical activity, and family history of diabetes. The authors didn’t mention whether
the data was pre-processed or not. For the validation test, they split the dataset into 50% for each of
the training and testing sets. The proposed combined algorithms have improved the accuracy of the
prediction to reach 97.6%. This value was compared with the performance of SVM and Naı̈ve Bayes
achieving 95.52% and 94.52%, respectively.

In addition, Deepti and Dilip [21] used Decision Tree, SVM, and Naive Bayes classifiers to detect
diabetes. The aim was to identify the classifier with the highest accuracy. The Pima Indian dataset
was used for this study. The partition of the dataset is done by means of 10-folds cross-validation.
The authors didn’t discuss the data preprocessing. The performance was evaluated using the measures
of the accuracy, the precision, recall, and the F-measure. The highest accuracy was obtained by the
Naive Bayes, which reached 76.30%.

Mercaldo et al. [22] used six different classifiers. The classifiers are J48, Multilayer Perceptron,
HoeffdingTree, JRip, BayesNet, and RandomForest. The Pima Indian dataset was also utilized for
this study. The authors didn’t mention a preprocessing step, however, they employed two algorithms,
GreedyStepwise and BestFirst, to determine the discriminatory attributes that help in increase the
classification performance. Four attributes have been selected, namely body mass index, plasma glu-
cose concentration, diabetes pedigree function, and age. A 10 fold-cross validation is applied to the

6
dataset. The comparison between the classifiers was made based on the value of the precision, the
recall, and the F-Measure. The result showed the precision value equals to 0.757, recall equals to 0.762,
and F-measure equals to 0.759 using the Hoeffding Tree algorithm. This is the highest performance
compared to the others.

In addition to the other studies, Negi and Jaiswal [23] aimed to apply the SVM to predict diabetes.
The Pima Indians and Diabetes 130-US datasets were used as a combined dataset. The motivation of
this study was to validate the reliability of the results as other researchers often used a single dataset.
The dataset contains 102,538 samples and 49 attributes where 64,419 were positive samples and 38,115
were negative samples. The authors didn’t discuss the attributes used in this study. The dataset is
pre-processed by replacing the missing values and out of range data by zero, the non-numerical val-
ues are changed to numerical values, and finally the data is normalized between 0 and 1. Different
feature selection methods were used prior to the application of the SVM model. The Fselect script
from LIBSVM package selected four attributes, while Wrapper and Ranker methods (from Weka Tool)
selected nine and 20 attributes, respectively. For the validation process, the authors used 10-fold cross
validation technique. By using a combined dataset, the diabetes prediction might be more reliable,
with an accuracy of 72%.

Moreover, Olaniyi and Adnan [24] used a Multilayer Feed-Forward Neural Network. The back-
propagation algorithm was used for training the algorithm. The aim was to improve the accuracy
of diabetes prediction. The Pima Indian Diabetes database was used. The authors normalized the
dataset before processing to the classification in order to obtain a numerical stability. It consisted
of dividing each sample attributes by their corresponding amplitude to make all the dataset values
between 0 and 1. After that, the dataset is divided into 500 samples for a training set and 268 for the
testing set. The accuracy obtained was 82% which is considered as a high accuracy rate.

Soltani and Jafarian [25] used the Probabilistic Neural Network (PNN) to predict diabetes. The
algorithm was applied to the Pima Indian dataset. The authors didn’t apply any pre-processing tech-
nique. The dataset is divided into 90% for the training set and 10% for the testing set. The proposed
technique achieved accuracies of 89.56%, 81.49% for the training and testing data, respectively.

Rakshit et al. [26] used a Two-Class Neural Network to predict diabetes using the Pima Indian
dataset. The authors pre-processed the dataset by normalizing all the sample attributes values using
the mean and the standard deviation of each attribute in order to obtain a numerical stability. In
addition, they extracted the relevant features using the correlation. However, the authors didn’t men-
tion these discriminatory features. The dataset was split into a training set containing 314 samples
and a testing set comprising 78 samples. The result of this model achieved the highest accuracy of
83.3% when compared to other accuracies obtained from the previous studies.

Mamuda and Sathasivam [27] applied three supervised learning algorithms including Levenberg
Marquardt (LM), Bayesian Regulation (BR), Scaled Conjugate Gradient (SCG). This study used the
Pima Indian dataset (with 768 samples and eight attributes) for evaluating the performance. For the
validation study, the 10-fold cross validation was used to split the data into training and testing. The
authors reported that Levenberg Marquardt (LM) obtained the best performance on the validation
set based on the Mean Squared Errorr (MSE) equals to 0.00025091.

2.3 Related Works Using Deep Learning.


Researchers have started to realize the capabilities of the DL techniques in processing large datasets.
Therefore, diabetes prediction has also been performed using DL techniques. Seven studies were pub-
lished during the six last years.
Also, Ashiquzzamanet al. [28] used a Deep Neural Network (DNN). The architecture of the DNN
composed of Multilayer Perceptron (MLP), General Regression Neural Network (GRNN), and Radial

7
Basis Function (RBF). The evaluation of the approach was based on the Pima Indian dataset. The
authors didn’t pre-process the dataset intentionally as DNN can filter the data and acquire the biases.
The dataset is split into 192 samples for the testing set and the rest for the training. The accuracy
rate reported by the authors was of 88.41%.

Another study by Swapna et al. [29] used two DL techniques to improve the accuracy of diabetes
prediction. A private dataset called Electrocardiograms was used to assess the performance of the
CNN and CNN-LSTM. It consisted of 142,000 samples and eight attributes. Five-fold cross validation
was used to split the dataset into training and testing sets. The authors did not pre- process data nor
apply feature selection method because of the self-learning of DNN. The generated accuracy rates for
the models were 90.9% and 95.1%, respectively.

Mohebbi et al. [30] used logistic regression as a baseline to multilayer perceptron neural network
and conventional neural network (CNN). The aim was to detect diabetic patients based on a contin-
uous glucose monitoring (CGM) signal dataset. The dataset is composed of nine patients and each
patient had 10,800 days of CGM data, resulting in a total of 97,200 simulated CGM days. The at-
tributes used in this study were not discussed. The dataset was split into training, validation, and
testing sets based on leave-one-patient-out cross-validation technique. In fact, the authors selected
six patients for training and validation, and three patients for testing. The CNN achieved the highest
accuracy of 77.5%.

Moreover, Miotto et al. [31] proposed a framework of unsupervised Deep Neural Network called
Deep Patient. The framework used a patients’ electronic health records database composed of 704,857
patients. The authors didn’t specify the features used in this dataset, they mentioned that the dataset
can be used to predict different diseases. In the validation process, the authors split the data into
5000 (patients) for the validation, 76,217 (patients) for the testing and the rest for the training. The
accuracy was measured based on the Area Under Curve (AUC) which achieved 0.91. The authors
recommended to pre-process the dataset to well enhance the prediction performance. They suggested
using PCA to extract relevant attributes before performing the DL.

Pham et al. [32] applied three different DL techniques on a manually collected dataset from a
regional Australian hospital. The dataset is composed of 12,000 samples (patients) containing 55.5%
males. Some pre-processing techniques (not mentioned in their article) have been applied to clean and
reduce the samples to 7,191 patients. For validation, the dataset was split to 2/3 for the training set,
1/6 for the validation and 1/6 for the testing. The methods were Long Short-Term Memory (LTSM),
Markov, and Plain RNN. The precision value was used to compare the performance of the techniques.
The best precision value of 59.6% was achieved by using the LTSM.

Furthermore, Ramesh et al. [33] used the Recurrent Neural Network (RNN) to predict the two
types of diabetes. The authors utilized the Pima Indian dataset with 768 samples and eight attributes.
The attributes are ordered according to their highest importance as indicated in their study “Glucose,
BMI, Age, Pregnancies, Diabetes Pedigree Function, Blood Pressure, Skin Thickness and Insulin”.
To validate the study, the dataset was split into 80% for the training and 20% for the testing. The
accuracy of predicting Diabetes type 1 was 78% while it was 81% for type 2.

In addition to the other studies, Lekha and Suchetha [34] used one-dimensional modified CNN to
predict diabetes based on breath signals. The authors collected a dataset for breath signals composed
of 11 healthy patients, nine diabetic patients of type 2, and five diabetic patients of type 1. No pre-
processing was performed on the dataset. For the validation process, the authors used Leave-One Out
Cross Validation. The performance was evaluated based on the Receiver Operating Characteristics
(ROC) curve which reached 0.96.

8
3 Methodology.
To address the problem identified, the project needed to meet the objectives stated in section 1.4.2.
This part describes how I archieved the objectives of the project.

3.1 Case study.


Two public datasets were used ie Pima Indians dataset and Early diabetes disease dataset all being
csv files.

3.2 Datasets/Data collection.


3.2.1 Pima Indians dataset (PID).
In this study, I used the Pima Indian Dataset (PID). I got it from the UCI Machine Learning Repos-
itory. This dataset was originally from the National Institute of Diabetes, Digestive, and Kidney
Disease [35]. The PID dataset had eight attributes and one output class with a binary value to
indicate if the person has diabetes or not. Moreover, it contains 768 instances, 500 instances are
non-diabetics while the remaining 268 are diabetic. I chose PIMA because it was a well-known and a
common benchmark dataset to compare the performance of methods between studies.

3.2.2 Early diabetes disease dataset.


I got this dataset from the open source standard test data set website UCI. The data set was obtained
by direct questionnaires from 520 patients at the Sylhet Diabetes Hospital in Sylhet, Bangladesh, and
was approved by doctors. The data set was divided into 17 attributes including age, gender, polyuria,
depression, Sudden weight loss, Weakness, Polyphagia,Genital thrush, Visual blurring, Visual blur-
ring, Itching,Irritability, Delayed healing, Partial paresis, Muscle stiffness, Alopecia and Obesity.

3.3 Exploratory Data Analysis.


3.3.1 Pima Indians dataset (PID).
ˆ There were 9 columns and 768 instances.

ˆ There were no data points missing in the dataset.

ˆ There were some unexpected outliers in some columns after analyzing the histogram eg Blood
pressure, age, insulin, BMI, glucose levels in reality can not be 0 since it doest make sense.

ˆ The ’Outcome’ column was the target variable for prediction. 1 was for positive case and 0 for
negative case.

ˆ In the ’Outcome’ column, there was an imbalanced class distribution with 500 non diabetic
and 268 diabetic people. The figure below shows the red part is for Diabetic, Blue is for the
non-diabetic. For most of the attributes, the distribution for the diabetic people (the red part)
is shifted towards the right when compared with the distribution of the non-diabetic part (blue
part). This basically tells a story that a diabetic person is more likely to be a elder person with
a higher BMI, SkinThickness, and glucose levels.

9
ˆ The classes were imbalanced between positive and negative cases.

10
ˆ plotting the boxplot for each of these attributes to clearly see the difference in the distribution
of each of the attributes for both these outcomes (Diabetic and Non-diabetic)..

11
12
3.3.2 Early diabetes disease dataset.
ˆ There were 17 columns and 520 instances.

ˆ There were no missing or null values.

ˆ There were 320 positive cases for Diabetes and 200 negative cases.

ˆ The classes were randomly balanced between positive and negative cases though not much.

ˆ Considering independent variables that have a high correlation with dependent variables and
less correlation with other variables.

13
ˆ Top 10 features.

3.4 Data pre-processing.


3.4.1 Pima Indians dataset (PID).
ˆ For the unexpected outliers problem, i solved it by converting the zero values to null values
and null values with imputation techniques like imputing using the median of that particular
attribute based on what Outcome we will see. If a null value belongs to a diabetic person, next
is finding the median using only the diabetic records, and similarly if it belongs to a non-diabetic
person, finding the median using the non-diabetic records.

ˆ All values in the dataset were put on the same scale using Standiser.

3.4.2 Early diabetes disease dataset.


ˆ For the ’Class’ column (target variable) I converted positive and negative instances to binary
values of 1 and 0 respectively.

ˆ For the other columns with ’Yes’and ’No’ instances, i used label encoder for object to numeric
conversion.

14
ˆ For the ’Age’ column, i transformed those values to binary usin OneHOTEncoder.

ˆ Finally all the instances in the column where in binary values.

3.5 Model building, Improving performance accuracies and model evaluation.


3.5.1 Pima Indians dataset (PID).
ˆ I chose 5 models for this classification problem eg SVM, Random forest classifier, Logistic re-
gression and Naive Bayes.

ˆ I used 10-fold cross validation with the ’accuracy’ metric to get mean and standard deviation
accuracies.

ˆ I selected the 2 best models using the Box and whisker plots ie Logistic regression and Random
forests.

15
ˆ For avoiding data leakage, i used Pipelines that standardised the data and build the model for
each fold in the cross validation test harness.
From this figure of rescaled model using pipelines, 2 models were selected using Box and whisker

plots, these were Logistic model and SVM.

ˆ Tuning Logistic regression and SVM gave same accuracies of Best: 0.775120 using ’C’: 100,
’penalty’: ’l2’, ’solver’: ’lbfgs’ and Best: 0.780090 using ’C’: 0.1, ’kernel’: ’linear’ respectively.

ˆ After using Ensemble methods, Gradient boosting and Random forests both gave the better
result.

ˆ Drawing the Box and Whisker plots for the Ensemble methods, Random forests showed the best
performance .

16
3.5.2 Early diabetes disease dataset.
ˆ I chose 5 models for this classification problem eg SVM, Random forest classifier, Logistic re-
gression and Naive Bayes.

ˆ I used 10-fold cross validation with the ’accuracy’ metric to get mean and standard deviation
accuracies.

ˆ I selected the 2 best models using the Box and whisker plots.

17
ˆ For avoiding data leakage, i used Pipelines that standardised the data and build the model for
each fold in the cross validation test harness.
From this figure of rescaled model using pipelines, 2 models were selected using Box and whisker

plots, these were Random Forests and Decision trees.

ˆ Tuning Random Forests and Decision trees gave same accuracies of 98.0720.

ˆ After using Ensemble methods, Extra Tree Classifier gave the better result.

ˆ Drawing the Box and Whisker plots for the Ensemble methods, Extra tress classifier still showed
better results again.

18
3.6 Finalising the model.
3.6.1 Pima Indians dataset (PID).
ˆ I selected my best model using the Box and whisker plots, confusion matrix, precision and F1
scores on the 3 models ie SVM, Logistic regression and GBM..

SVM was the best.

ˆ I later saved the pre-trained model using pickle library.

3.6.2 Early diabetes disease dataset.


ˆ I selected my best model using the Box and whisker plots, confusion matrix, precision and F1
scores.
Selection of the best model was among the 3 models ie Random forests, Decision tree and extra
tree classifier.
Extra Tree Classifier was the best.

ˆ I later saved the pretrained model using pickle library.

19
4 System Analysis and Design.
4.1 System Analysis.
4.1.1 Data analysis.
From both datasets through manual inspection in the csv files, I found out that the current methods
that medical officials use were inefficient and ineffective.

4.1.2 User requirements.


Below are the user requirements for Web tool for Diabetes disease prediction.

ˆ Allow the user to choose the dataset to use for prediction.

ˆ Allow the user to input in numerical or text data.

ˆ Allow the user to make multiple prediction as they wish.

ˆ Be able to give the probability value of whether Diabetic or not.

4.1.3 Equipment/tools used.


The following are some of the software systems i was able to use in this project.

ˆ python3

ˆ Anaconda (with Jupyter notebook, Spyder, Visual Studio code)

ˆ Streamlit framework.

ˆ Ubuntu 20.4 LTS

4.1.4 Functional requirements.


The functional requirements specify what the system is expected to do. These include the following;

ˆ Provide users with two options of datasets to choose from for prediction.

ˆ Provide users with text and numerical input fields to put data.

ˆ Provide fast and quick prediction on submission of data.

4.1.5 Non-functional requirements.


ˆ The web tool providing solutions to overcome diabetes.

ˆ The web tool outputting a congratulatory or failure message to the user.

20
4.2 System Design.
The Web tool was able to be developed using Streamlit, a machine learning and Data science frame-
work with python as per the screen shot below.

21
5 System Implementation, testing and Validation.
This chapter describes how the Diabetes disease prediction Web app/tool was implemented. The
implementation was driven by the desire to achieve the set objectives at the start of the project.

5.1 System Implementation.


This section contains an overview of the project implementation; it highlights the major components
and the operations of the system as well as the different user interfaces and activities that allow users
to interact with the system.

The code was written using the Visual Studio Code IDE with Streamlit framework in python3
programming language.

Implementation tools.
Streamlit framework in one single python file did both the frontend and backend tasks.

5.2 User Interface Design.


The web tool was developed on a local server on Ubuntu system. Below are the prototype interfaces
that were designed using Streamlit framework.

5.2.1 Merged web tool with Pima Indians dataset selection.

22
5.2.2 Merged web tool with Early Diabetes Disease dataset selection.

23
5.3 System testing and results.
5.3.1 Sample results of prediction on Pima Indians dataset regarding positive results.

5.3.2 Sample results of prediction on Pima Indians dataset regarding negative results.

24
5.3.3 Sample results of prediction on Early Diabetes Disease dataset regarding positive
results.

5.3.4 Sample results of prediction on Early Diabetes Disease dataset regarding negative
results.

25
6 Conclusions and Recommendations.
6.1 Conclusions.
Makerere university sends out students for field attachment with the main objective of enabling stu-
dents to get hands on real life experiences in environments they are expected to work in after gradua-
tion. Although I did my internship virtually under the Data science and AI lab, it was well prepared
to take on any student for field attachment since it had experts in the field to help and offer guidance
to interns while carrying on their their projects.

I was exposed to new technologies in machine learning like Sci-kit learn, being able to work in
Anaconda environment integrated with many tools like Jupyter lab and notebook, Visual studio code,
Orange etc. I was able to use some of them to work on my project.

Summarized below are some of the strengths and weaknesses noted during internship in field at-
tachment:
Strengths
The field attachment helped me to apply the knowledge taught at the university to the filed of work
by working on real life projects eg Diabetes disease prediction using a web tool.
My field supervisor was very helpful and offered great guidance while working on my project. This
helped me learn alot of new knowledge and skills as indicated throughout this report.

Weaknesses.
On the side of the university, there was a weakness of late academic supervisor allocation that in turn
delayed the supervision schedules.
The internship period set was in collision with semester work load on students.

6.2 Recommendations.
On the basis of the findings and conclusion drawn in this field attachment I recommended that:

6.2.1 Recommendations to the University.


ˆ The university should urgently restructure the curriculum offerings to meet the requirements of
the labour market.

ˆ This course CSC 2303: Field Attachment, should be shifted to the third year of study, such
that it is given more time, at least six months. Many students have concluded that two months
internships are too small.

ˆ Students teaching-learning resources should be improved, especially the tools for practicals, lec-
ture room capacity, laboratories and workshops.

ˆ ICT should be introduced into both teaching and learning activities of every university, so that
both staff and students can possess the much needed ICT knowledge and skills.

ˆ The University should keep good records of its graduates for feedback purposes; while academic
departments should liaise with employers for information on their employed ex- students.

26
6.2.2 Recommendation to future interns.
As students, good supervisory relationships are pivotal to successful completion of our degrees be-
cause supervisors provide expert guidance in your research, and our fields of study thus need for good
supervisory relationships with our supervisors.

27
7 References and Appendices.
7.1 References.

References
[1] https://www.who.int/news-room/fact-sheets/detail/diabetes.

[2] Fayyad, Usama; Piatetsky-Shapiro, Gregory; Smyth, Padhraic (1996), ” Data Mining to Knowl-
edge Discovery in Databases”.

[3] Kaelbling, Leslie P.; Littman, Michael L.; Moore, Andrew W. (1996). ”Reinforcement Learning:
A Survey”. Journal of Artificial Intelligence Research. 4: 237–285.

[4] Alan Agresti Department of Statistics University of Florida, Gainesville, Florida, An Introduc-
tion to Categorical Data Analysis 2 nd Edition, (2007).

[5] Cortes C, Vapnik VN. Support-vector networks. Mach Learn. 1995;20(3): 273–97

[6] Jegan, Chitra. (2013). Classification Of Diabetes Disease Using Support Vector Machine. Inter-
national Journal of Engineering Research and Applications. 3. 1797 - 1801.

[7] Ho, Tin Kam (1995). Random Decision Forests (PDF). Proceedings of the 3rd International
Conference on Document Analysis and Recognition, Montreal, QC, 14–16 August 1995. pp.
278–282.

[8] Somu N, Raman MR, Kirthivasan K, Sriram VS. Hypergraph Based Feature Selection Technique
for Medical Diagnosis. J Med Syst. 2016;40(11):239.

[9] A. Al Jarullah, ”Decision tree discovery for the diagnosis of type II diabetes,” 2011 International
Conference on Innovations in Information Technology, Abu Dhabi, 2011, pp. 303-307

[10] Bishop, C. M. 2006. Pattern Recognition and Machine Learning. Springer.

[11] D. Deng and N. Kasabov, “ On-line pattern analysis by evolving self- organizing maps”, In
Proceedings of the fifth biannual conference on artificial neural networks and expert systems
(ANNES), 2001, pp. 46-51.

[12] Yue, et al. “ An Intelligent Diagnosis to Type 2 Diabetes Based on QPSO Algorithm and
WLSSVM,” International Symposium on Intelligent Information Technology Application Work-
shops, IEEE Computer Society, 2008.

[13] Luukka, Pasi. (2011) ‘Feature selection using fuzzy entropy measures with similarity classifier’,
Expert Systems with Applications, Elsevier, Vol. 38, pp. 4600–4607.

[14] Seera, Manjeevan., Lim, Chee Peng. (2014) ‘A hybrid intelligent system for medical data classi-
fication’, Expert Systems with Applications, Elsevier, Vol. 41pp. 2239–2249.

[15] Choubey, Dilip Kumar., Paul, Sanchita. (2016) ‘GA MLP NN: A Hybrid Intelligent System
for Diabetes Disease Diagnosis’, International Journal of Intelligent Systems and Applications
(IJISA), MECS, ISSN: 2074–904X (Print), ISSN: 2074–9058. (Online), Vol. 8, No. 1, pp.49–59.

[16] Choubey, D.K., Paul, S., Kumar, S., & Kumar, S. (2016). Classification of Pima indian diabetes
dataset using naive bayes with genetic algorithm as an attribute selection.

[17] Deo, R.C. Machine Learning in Medicine. Circulation 2015, 132, 1920–1930.

[18] Kandhasamy, J.P.; Balamurali, S. Performance Analysis of Classifier Models to Predict Diabetes
Mellitus. Procedia Comput. Sci. 2015, 47, 45–51.

28
[19] Yuvaraj, N.; SriPreethaa, K.R. Diabetes prediction in healthcare systems using machine learning
algorithms on Hadoop cluster. Clust. Comput. 2017, 22, 1–9.

[20] Tafa, Z.; Pervetica, N.; Karahoda, B. An intelligent system for diabetes prediction. In Pro-
ceedings of the 2015 4th Mediterranean Conference on Embedded Computing (MECO), Budva,
Montenegro, 14–18 June 2015; pp. 378–382.

[21] Sisodia, D.; Sisodia, D.S. Prediction of Diabetes using Classification Algorithms. Procedia Com-
put. Sci. 2018, 132, 1578–1585.

[22] Mercaldo, F.; Nardone, V.; Santone, A. Diabetes Mellitus Affected Patients Classification and
Diagnosis through Machine Learning Techniques. Procedia Comput. Sci. 2017, 112, 2519–2528.

[23] Negi, A.; Jaiswal, V. A first attempt to develop a diabetes prediction method based on different
global datasets. In Proceedings of the 2016 Fourth International Conference on Parallel, Dis-
tributed and Grid Computing (PDGC), Waknaghat, India, 22–24 December 2016; pp. 237–241.

[24] Olaniyi, E.O.; Adnan, K. Onset diabetes diagnosis using artificial neural network. Int. J. Sci.
Eng. Res. 2014, 5, 754–759.

[25] Soltani, Z.; Jafarian, A. A New Artificial Neural Networks Approach for Diagnosing Diabetes
Disease Type II. Int. J. Adv. Comput. Sci. Appl. 2016, 7, 89–94.

[26] Somnath, R.; Suvojit, M.; Sanket, B.; Riyanka, K.; Priti, G.; Sayantan, M.; Subhas, B. Pre-
diction of Diabetes Type-II Using a Two-Class Neural Network. In Proceedings of the 2017
International Conference on Computational Intelligence, Communications, and Business Ana-
lytics, Kolkata, India, 24–25 March 2017; pp. 65–71.

[27] Mamuda, M.; Sathasivam, S. Predicting the survival of diabetes using neural network. In Pro-
ceedings of the AIP Conference Proceedings, Bydgoszcz, Poland, 9–11 May 2017; Volume 1870,
pp. 40–46.

[28] Ashiquzzaman, A.; Kawsar Tushar, A.; Rashedul Islam, M.D.; Shon, D.; Kichang, L.M.; Jeong-
Ho, P.; Dong- Sun, L.; Jongmyon, K. Reduction of overfitting in diabetes prediction using
deep learning neural network. In IT Convergence and Security; Lecture Notes in Electrical
Engineering; Springer: Singapore, 2017; Volume 449.

[29] Swapna, G.; Soman, K.P.; Vinayakumar, R. Automated detection of diabetes using CNN and
CNN-LSTM network and heart rate signals. Procedia Comput. Sci. 2018, 132, 1253–1262.

[30] Mohebbi, A.; Aradóttir, T.B.; Johansen, A.R.; Bengtsson, H.; Fraccaro, M.; Mørup, M. A
deep learning approach to adherence detection for type 2 diabetics. In Proceedings of the 2017
39th Annual International Conference of the IEEE Engineering in Medicine and Biology Society
(EMBC), Jeju, Korea, 11–15 July 2017; pp. 2896–2899.

[31] Miotto, R.; Li, L.; Kidd, B.A.; Dudley, J.T. Deep Patient: An Unsupervised Representation to
Predict the Future of Patients from the Electronic Health Records. Sci. Rep. 2016, 6, 26094.

[32] Pham, T.; Tran, T.; Phung, D.; Venkatesh, S. Predicting healthcare trajectories from medical
records: A deep learning approach. J. Biomed. Informatics 2017, 69, 218–229.

[33] Balaji, H.; Iyengar, N.; Caytiles, R.D. Optimal Predictive analytics of Pima Diabetics using
Deep Learning. Int. J. Database Theory Appl. 2017, 10, 47–62.

[34] Lekha, S.; Suchetha, M. Real-Time Non-Invasive Detection and Classification of Diabetes Using
Modified Convolution Neural Network. IEEE J. Biomed. Health Inform. 2018, 22, 1630–1636.

[35] Kar, A.K. Bio inspired computing—A review of algorithms and scope of applications. Expert
Syst. Appl. 2016, 59, 20–32.

29
7.2 Appendices.
Merged web tool with Pima Indians dataset selection.

Merged web tool with Early Diabetes Disease dataset selection.

30
Sample results of prediction on Pima Indians dataset regarding positive results.

Sample results of prediction on Pima Indians dataset regarding negative results.

31
Sample results of prediction on Early Diabetes Disease dataset regarding positive results.

Sample results of prediction on Early Diabetes Disease dataset regarding negative results.

32
7.3 My Github repository.
https://github.com/Group4Day2019/ML Internship2021

33
Student's Weekly Progress Report
Name: Ian Mugume Reg No: 18/U/25478/PS
Course: Bachelor of Science in Computer Science

Date Field Supervisor


Report Date Task Completed Task in Progress Next Week's Task Problems / Challenges
Submitted Comments
I was impressed by the
depth of the concept note
Challenge of
for the experiments. Iann
understanding some
defined up to 5 different
machine learning
Exploratory data Exploratory data baseline methods that
02-07-2021 29-07-2021 Concept paper write up concepts. Confusing
analysis (EDA) analysis (EDA) would be used for the task
some machine learning
and this is comprehensive
concepts for other
in terms of modelling an
terms. Time constraint.
optimum solution for the
problem he is working on
EDA was completed and
the work was pushed to
GitHub. I am impressed
Hardships in by the visualizations
Data processing. Data processing. understanding of some presented for the data and
Exploratory data Model building Model building machine learning comments on possible
09-07-2021 08-07-2021
analysis. selecting the best selecting the best concepts and libraries. improvements were
model model coding in python made. Also a good
language. understanding of using
Git has made it easy to
keep track of the progress
of the work done

34
Data preprocessing.
Model building.
Organisation of
respective jupyter
notebooks in their Confusing some Iann has been proactive,
folders on my GitHub machine learning he has managed to
repository using git. XGB algorithm. XGB algorithm. concepts. Difficulty in accomplish the nextweeks
Classification reportsfor Saving the trained Saving the trained coding some things in tasks within thisweek.
16-07-2021 14-07-2021 each model. Adocument model using pickle or model using pickle or python. Time The changes thathe was
reportingaccuracies of joblib. Developingthe joblib. Developingthe management. Taking asked to makehave been
differentmodels per web application web application lots of time researching effected andthis will give
dataset.AUC curves. things I don't know him time tooptimize the
Improvingmodel about. experiments.
performanceusing
tuning, pipelines,
standardizing data,
Ensemble methods

35
I having faced
challenges coming up
with developing both
web applications ie in
terms of learning
I was impressed with the
Streamlit which is new
selected deployment
to me and also writing
strategy for the models.
Saving the trained proper python code to
Iann went with existing
models to files using interpret pickle file
simple framework as
pickle for both datasets. contents for correct
Final reportdocument Final reportdocument opposed to building new
23-07-2021 22-07-2021 XGB algorithm for both predictions. I havespent
writing. writing. interfaces from scratch.
datasets. Web lots of timeresearching
This is a very good for
applications for both on theinternet watching
testing and making rapid
datasets. tutorials, reading books
changes as the interface
to come up with a better
and the models are lightly
algorithm and a web
coupled.
application for better
results. Challenges of
internet connection too
during my research.
Limited time schedule.
Iann made a very good
presentation of the entire
project to his academic
Poor internet
supervisor showcasing the
Presentation of the 2 connection while
cancer prediction tool.
web applications for presenting via zoom.
Writing the final Writing the final We made a few
30-07-2021 28-07-2021 Diabetes disease Problem explaining
report document. report document. comments and
prediction via zoom some machine learning
observations for areas that
meeting. concepts due to
could be changed but
confusion and mix ups.
overall we were very
impressed by the outcome
of the project.

36
The recommended
changes have been made
and the report has been
completed for the project.
Hardships in working
I am satisfied with the
Report document Report document with two datasets while
documentation of the
review and making review and making writing the report
06-08-2021 03-08-2021 Report documentwriting. EDA, experimentationand
corrections were corrections were document since eachhas
results as reflected inthe
necessary. necessary. to be explained ingreat
report. The GitHub
detail.
repository has also been
updated to reflect the
recommended changes in
the experiment

37

You might also like