Download as pdf or txt
Download as pdf or txt
You are on page 1of 20

A

Mini Skill Based Project Report


On
Machine Learning & Optimization (270404)
In fulfilment of the requirement for the award of the degree

SUBMITTED BY
Ankit Sharma(0901AD211006)
Ayush Goyal (0901AD211007)
Ayushi Verma (0901AD211008)
Chandan Jat (0901AD211009)
Devanshi Rathore(0901AD211010)

4th SEMESTER
Artificial Intelligence And Data Science
SUBMITTED TO
Prof. Vibha Tiwari

Department Of Information Technology

Madhav Institute of Technology and Science, Gwalior


(A Govt. Aided UGC Autonomous & NAAC Accredited Institute Affiliated to RGPV, Bhopal)
Session: 2023
DECLARATION

I hereby declare that the mini skill based project for the course Machine Learning &
Optimization (270404) is being submitted in the partial fulfilment of the requirement for the
award of Bachelor of Technology in Artificial Intelligence And Data Science.
All the information in this document has been obtained and presented in accordance with
academic rule and ethical conduct.
Date : 15-03-2023
Place: Gwalior

Ankit Sharma(0901AD211006)
Ayush Goyal (0901AD211007)
Ayushi Verma (0901AD211008)
Chandan Jat (0901AD211009)
Devanshi Rathore(0901AD211010)
ACKNOWLEDGEMENT

I would like to express my greatest appreciation to all the individuals who have helped and
supported me throughout this lab file. I am thankful to whole Information Technology
department for their ongoing support during the experiments, from initial advice and provision
of contact in the first stages through ongoing advice and encouragement, which led to the finals
report of this lab file.
A special acknowledgement goes to my colleagues who help me in completing the file and by
exchanging interesting ideas to deal with problems and sharing the experience.
I wish to thank our professor Vibha Tiwari as well for her undivided support and interests
which inspired me and encouraged me to go my own way without whom I would be unable to
complete my project.
At the end, I want to thank my friends who displayed appreciation to my work and motivated
me to continue my work

Ankit Sharma(0901AD211006)
Ayush Goyal (0901AD211007)
Ayushi Verma (0901AD211008)
Chandan Jat (0901AD211009)
Devanshi Rathore(0901AD211010)
CHAP 1 – INTRODUCTION
1.1 PROBLEM STATEMENT
Diabetes is a chronic illness which can be caused by body’s inability to produce, or when
body cannot use the insulin that it produces [1]. The effects of diabetes mellitus include
long– term damage, dysfunction and failure of various organs (WHO). As a result, it has
significantly increased mortality in patients. There are mainly two types of diabetes: Type
I (T1) and Type II (T2). T1 occurs when the body is no longer able to produce insulin
whereas T1 is common in childhood and also known as juvenile diabetes. This form of
diabetes is less common; only about 5-10% of people with diabetes have T1 (American
Diabetes Association, 2010). T2 occurs when the body is unable to utilize the insulin
produced or not enough insulin is produced [9, 10 and 11]. In addition, there is another type
of diabetes named gestational diabetes which develops during pregnancy. Too much
glucose in blood can damage eyes, kidneys, and nerves. It can also cause of heart disease,
stroke, and insufficiency in blood flow to legs. Overweight, lack of exercise, family history
and stress increased the possible risk of diabetes 14, 15]. In Bangladesh, people are not
conscious about health. There are 7.1 million case of Diabetes in Bangladesh. The
increasing level of Diabetes is up bound. People do not know about it and they do not go
to check it.
Regression is a supervised learning algorithm in machine learning which is used for
prediction by learning and forming a relationship between present statistical data and target
value i.e., Sale Price in this case. Different factors are taken into consideration while
predicting the worth of the house like location, neighbourhood and various amenities like
garage space etc. if learning is applied to above parameters with target values for a certain
geographical region as different areas differ in price like land price, housing style, material
used, availability of public utilities.

1.2 Conceptual Background of the Domain Problem

The domain problem of a machine learning binary classification model to predict whether a
person is diabetic or not falls under the umbrella of healthcare and medical informatics.
Diabetes is a chronic condition that affects the way the body processes blood sugar, and it can
lead to serious complications such as heart disease, stroke, and kidney failure if left untreated.

The goal of a binary classification model in this domain is to accurately predict whether a
person has diabetes based on their medical history, physical examination, and other relevant
factors. The model would be trained on a dataset of individuals with and without diabetes, and
it would use features such as age, body mass index (BMI), blood pressure, and blood glucose
levels to make its predictions.
The development of a binary classification model for diabetes diagnosis is important because
it can help healthcare providers to make more accurate and timely diagnoses, which in turn can
lead to better treatment outcomes and improved quality of life for patients. Additionally, such
models can help identify high-risk individuals who may benefit from preventive interventions
or lifestyle modifications to reduce their risk of developing diabetes.

1.3 Motivation for the Problem Undertaken


The project is provided to our group by Prof. Vibha Tiwari as a part of mini skill based
project. The exposure to real world data and the opportunity to deploy our skillset in
solving a real time problem has been the primary motivation.
Our main objective of doing this project is to build a model to predict the house prices
with the help of other supporting features. In order to improve the selection of
customers, the client wants some predictions that could help them in further
investment and improvementin selection of customers.
The motivation for developing a machine learning binary classification model to predict if a
person is diabetic or not is primarily driven by the need to improve healthcare outcomes for
individuals with diabetes. Diabetes is a chronic disease that affects millions of people
worldwide, and it can lead to serious complications such as heart disease, stroke, kidney failure,
and blindness if not managed properly. Early detection and treatment of diabetes are critical to
preventing these complications and improving the quality of life for people with diabetes.
Chap 2- Analytical Problem Formulation

2.1 Mathematical / Analytical Modelling of theProblem

Goal of the paper is to investigate for model to predict diabetes with better accuracy. We
experimented with different classification and ensemble algorithms to predict diabetes. In the
following, we briefly discuss the phase.

2. 2 Data Sources and their formats


The data is gathered from UCI repository which is named as Pima Indian Diabetes Dataset.
The dataset have many attributes of 768 patient. The 9th attribute is class variable of each data
points. This class variable shows the outcome 0 and 1 for diabetics which indicates positive or
negative for diabetics. The dataset shape is (768,9).
2.3. Data Pre-processing
Data preprocessing is most important process. Mostly healthcare related data contains missing
vale and other impurities that can cause effectiveness of data. To improve quality and
effectiveness obtained after mining process, Data preprocessing is done. To use Machine
Learning Techniques on the dataset effectively this process is essential for accurate result and
successful prediction. For Pima Indian diabetes
dataset we need to perform pre-processing in two steps:
1). Missing Values removal- Remove all the instances that have zero (0) as worth. Having zero
as worth is not possible. Therefore this instance is eliminated. Through eliminating irrelevant
features/instances we make feature subset and this process is called features subset selection,
which reduces diamentionality of data and help to work faster.
2). Splitting of data- After cleaning the data, data is normalized in training and testing the
model. When data is spitted then we train algorithm on the training data set and keep test data
set aside. This training process will produce the training model based on logic and algorithms
and values of the feature in training data. Basically aim of normalization is to bring all the
attributes under same scale.

2.4. Data Inputs- Logic- Output Relationships


Correlation heatmap is plotted to gain understanding of relationship between target
features & independent features. To gain insights about relationship between Input &
output different types ofvisualization are plotted which we will see in EDA section of
this report.
2.5. Hardware & Software Requirements withTool Used
Hardware Used -

1. Processor — Intel i5 processor

2. RAM—8GB

Software utilized:
1.Anaconda Jupyter Notebook
2. Google Colab – for Hyper parameter tuning

Libraries Used – General libraries used for data wrangling


Chap. 3 Models Development & Evaluation
3.1 Identification of Possible Problem Solving Approach

There are several machine learning algorithms that can be used to solve the binary classification
problem of predicting whether a person is diabetic or not based on the given features:

1. Logistic Regression: This algorithm is widely used for binary classification problems
and is particularly suitable when the number of features is small. It models the
probability of an instance belonging to a certain class using a logistic function.
2. Decision Trees: This algorithm is useful for feature selection as it selects the most
informative features to make a hierarchical decision. Decision trees are easy to interpret
and can handle both categorical and numerical data.
3. Random Forest: This algorithm is an ensemble method that combines multiple
decision trees to improve the model's performance and reduce overfitting. It is
particularly useful for high-dimensional data.
4. Support Vector Machines (SVM): This algorithm is a powerful method for binary
classification and can handle both linear and non-linear decision boundaries. It is
particularly useful when the number of features is large compared to the number of
instances.
5. Naive Bayes: This algorithm is based on Bayes' theorem and is particularly useful when
the number of features is large. It is a probabilistic algorithm that models the joint
probability distribution of the features and the target variable.
6. Neural Networks: This algorithm is a powerful method for binary classification and
can model complex non-linear relationships between features and the target variable. It
is particularly useful for high-dimensional data and can handle both numerical and
categorical data.

3.2 Testing of Identified Approaches (Algorithms)

As it is given in this problem statement that we have to use Logistic Regression Algorithm for
making a machine learning model for predicting whether a person is diabetic or not. So, we
will use Logistic Regression Algorithm in this problem Statement.

3.3 KEY METRICS FOR SUCCESS IN SOLVING PROBLEM


UNDER CONSIDERATION

The key metrics for success when solving the problem of binary classification using logistic
regression on the Prima Indians Historical Diabetes dataset are:

1. Accuracy: This metric measures the percentage of correctly classified instances, i.e.,
the ratio of the number of true positives and true negatives to the total number of
instances. High accuracy indicates that the model is able to predict the correct class for
most instances.
2. Precision: This metric measures the proportion of true positives among the instances
predicted as positive, i.e., the ratio of the number of true positives to the number of true
positives and false positives. High precision indicates that the model makes few false
positive predictions.
3. Recall: This metric measures the proportion of true positives among the instances that
are actually positive, i.e., the ratio of the number of true positives to the number of true
positives and false negatives. High recall indicates that the model can identify most
positive instances.
4. F1-score: This metric is the harmonic mean of precision and recall and provides a
single measure of the model's performance. It balances precision and recall and is
particularly useful when the classes are imbalanced.
5. AUC Score: This metric measures the area under the ROC curve and provides a single
measure of the model's performance in terms of true positive rate (recall) and false
positive rate. A high AUC score indicates that the model can distinguish between
positive and negative instances well.

In summary, success in solving the problem using logistic regression can be measured
by achieving high accuracy, precision, recall, F1-score, and AUC score on the testing
set, as well as avoiding overfitting and underfitting. Additionally, interpretability of the
model's predictions can provide insights into the factors that contribute most to the
target variable, which can be useful for domain experts.

3.4 RUN AND EVALUATE SELECTED MODEL


Logistic Regression
VISUALISATIONS
Conclusions
In this problem statement, Accuracy of this machine learning model based on Logistic
Regression Algorithm used for predicting whether a person is Diabetic or not is greater than
77% which is a good accuracy rate . So, this machine learning model is suitable for predicting
whether a person is Diabetic or not.

You might also like