Professional Documents
Culture Documents
A Mini Skill Based Project Report On: Machine Learning & Optimization (270404)
A Mini Skill Based Project Report On: Machine Learning & Optimization (270404)
SUBMITTED BY
Ankit Sharma(0901AD211006)
Ayush Goyal (0901AD211007)
Ayushi Verma (0901AD211008)
Chandan Jat (0901AD211009)
Devanshi Rathore(0901AD211010)
4th SEMESTER
Artificial Intelligence And Data Science
SUBMITTED TO
Prof. Vibha Tiwari
I hereby declare that the mini skill based project for the course Machine Learning &
Optimization (270404) is being submitted in the partial fulfilment of the requirement for the
award of Bachelor of Technology in Artificial Intelligence And Data Science.
All the information in this document has been obtained and presented in accordance with
academic rule and ethical conduct.
Date : 15-03-2023
Place: Gwalior
Ankit Sharma(0901AD211006)
Ayush Goyal (0901AD211007)
Ayushi Verma (0901AD211008)
Chandan Jat (0901AD211009)
Devanshi Rathore(0901AD211010)
ACKNOWLEDGEMENT
I would like to express my greatest appreciation to all the individuals who have helped and
supported me throughout this lab file. I am thankful to whole Information Technology
department for their ongoing support during the experiments, from initial advice and provision
of contact in the first stages through ongoing advice and encouragement, which led to the finals
report of this lab file.
A special acknowledgement goes to my colleagues who help me in completing the file and by
exchanging interesting ideas to deal with problems and sharing the experience.
I wish to thank our professor Vibha Tiwari as well for her undivided support and interests
which inspired me and encouraged me to go my own way without whom I would be unable to
complete my project.
At the end, I want to thank my friends who displayed appreciation to my work and motivated
me to continue my work
Ankit Sharma(0901AD211006)
Ayush Goyal (0901AD211007)
Ayushi Verma (0901AD211008)
Chandan Jat (0901AD211009)
Devanshi Rathore(0901AD211010)
CHAP 1 – INTRODUCTION
1.1 PROBLEM STATEMENT
Diabetes is a chronic illness which can be caused by body’s inability to produce, or when
body cannot use the insulin that it produces [1]. The effects of diabetes mellitus include
long– term damage, dysfunction and failure of various organs (WHO). As a result, it has
significantly increased mortality in patients. There are mainly two types of diabetes: Type
I (T1) and Type II (T2). T1 occurs when the body is no longer able to produce insulin
whereas T1 is common in childhood and also known as juvenile diabetes. This form of
diabetes is less common; only about 5-10% of people with diabetes have T1 (American
Diabetes Association, 2010). T2 occurs when the body is unable to utilize the insulin
produced or not enough insulin is produced [9, 10 and 11]. In addition, there is another type
of diabetes named gestational diabetes which develops during pregnancy. Too much
glucose in blood can damage eyes, kidneys, and nerves. It can also cause of heart disease,
stroke, and insufficiency in blood flow to legs. Overweight, lack of exercise, family history
and stress increased the possible risk of diabetes 14, 15]. In Bangladesh, people are not
conscious about health. There are 7.1 million case of Diabetes in Bangladesh. The
increasing level of Diabetes is up bound. People do not know about it and they do not go
to check it.
Regression is a supervised learning algorithm in machine learning which is used for
prediction by learning and forming a relationship between present statistical data and target
value i.e., Sale Price in this case. Different factors are taken into consideration while
predicting the worth of the house like location, neighbourhood and various amenities like
garage space etc. if learning is applied to above parameters with target values for a certain
geographical region as different areas differ in price like land price, housing style, material
used, availability of public utilities.
The domain problem of a machine learning binary classification model to predict whether a
person is diabetic or not falls under the umbrella of healthcare and medical informatics.
Diabetes is a chronic condition that affects the way the body processes blood sugar, and it can
lead to serious complications such as heart disease, stroke, and kidney failure if left untreated.
The goal of a binary classification model in this domain is to accurately predict whether a
person has diabetes based on their medical history, physical examination, and other relevant
factors. The model would be trained on a dataset of individuals with and without diabetes, and
it would use features such as age, body mass index (BMI), blood pressure, and blood glucose
levels to make its predictions.
The development of a binary classification model for diabetes diagnosis is important because
it can help healthcare providers to make more accurate and timely diagnoses, which in turn can
lead to better treatment outcomes and improved quality of life for patients. Additionally, such
models can help identify high-risk individuals who may benefit from preventive interventions
or lifestyle modifications to reduce their risk of developing diabetes.
Goal of the paper is to investigate for model to predict diabetes with better accuracy. We
experimented with different classification and ensemble algorithms to predict diabetes. In the
following, we briefly discuss the phase.
2. RAM—8GB
Software utilized:
1.Anaconda Jupyter Notebook
2. Google Colab – for Hyper parameter tuning
There are several machine learning algorithms that can be used to solve the binary classification
problem of predicting whether a person is diabetic or not based on the given features:
1. Logistic Regression: This algorithm is widely used for binary classification problems
and is particularly suitable when the number of features is small. It models the
probability of an instance belonging to a certain class using a logistic function.
2. Decision Trees: This algorithm is useful for feature selection as it selects the most
informative features to make a hierarchical decision. Decision trees are easy to interpret
and can handle both categorical and numerical data.
3. Random Forest: This algorithm is an ensemble method that combines multiple
decision trees to improve the model's performance and reduce overfitting. It is
particularly useful for high-dimensional data.
4. Support Vector Machines (SVM): This algorithm is a powerful method for binary
classification and can handle both linear and non-linear decision boundaries. It is
particularly useful when the number of features is large compared to the number of
instances.
5. Naive Bayes: This algorithm is based on Bayes' theorem and is particularly useful when
the number of features is large. It is a probabilistic algorithm that models the joint
probability distribution of the features and the target variable.
6. Neural Networks: This algorithm is a powerful method for binary classification and
can model complex non-linear relationships between features and the target variable. It
is particularly useful for high-dimensional data and can handle both numerical and
categorical data.
As it is given in this problem statement that we have to use Logistic Regression Algorithm for
making a machine learning model for predicting whether a person is diabetic or not. So, we
will use Logistic Regression Algorithm in this problem Statement.
The key metrics for success when solving the problem of binary classification using logistic
regression on the Prima Indians Historical Diabetes dataset are:
1. Accuracy: This metric measures the percentage of correctly classified instances, i.e.,
the ratio of the number of true positives and true negatives to the total number of
instances. High accuracy indicates that the model is able to predict the correct class for
most instances.
2. Precision: This metric measures the proportion of true positives among the instances
predicted as positive, i.e., the ratio of the number of true positives to the number of true
positives and false positives. High precision indicates that the model makes few false
positive predictions.
3. Recall: This metric measures the proportion of true positives among the instances that
are actually positive, i.e., the ratio of the number of true positives to the number of true
positives and false negatives. High recall indicates that the model can identify most
positive instances.
4. F1-score: This metric is the harmonic mean of precision and recall and provides a
single measure of the model's performance. It balances precision and recall and is
particularly useful when the classes are imbalanced.
5. AUC Score: This metric measures the area under the ROC curve and provides a single
measure of the model's performance in terms of true positive rate (recall) and false
positive rate. A high AUC score indicates that the model can distinguish between
positive and negative instances well.
In summary, success in solving the problem using logistic regression can be measured
by achieving high accuracy, precision, recall, F1-score, and AUC score on the testing
set, as well as avoiding overfitting and underfitting. Additionally, interpretability of the
model's predictions can provide insights into the factors that contribute most to the
target variable, which can be useful for domain experts.