20it113 Logisticregression ML

20IT107 IT359: Machine Learning
PRACTICAL: 4
Logistic Regression
Q1: Why do you want to apply classification to the selected dataset? Discuss the full story behind
the dataset.
Classification is a type of machine learning algorithm used to predict categorical outputs based on input variables.
Unlike regression, which predicts a continuous numerical output, classification predicts a categorical output, such
as a binary or multi-class classification problem. Classification is applicable to a dataset when the dependent
variable is categorical, and we want to predict the class or category of new observations based on a set of input
features.
The full story behind a dataset can vary depending on the specific dataset and context of the data.
However, there are some common factors that can contribute to the story behind a dataset. Here are some potential
factors that can provide insight into the story behind a dataset:
Data collection: The way in which data was collected can provide valuable information about the dataset. For
example, if a dataset was collected from a survey, it is important to know the sample size, sampling technique, and
potential biases that may exist in the sample.
Variables: The variables included in the dataset can provide insight into what the dataset is measuring. It is important
to understand what each variable represents, the measurement scale of each variable, and
whether any variables are correlated.
Context: The context of the dataset is important for understanding the broader implications of the data. For example,
if the dataset is related to healthcare, it is important to understand the potential impacts of the data on patients,
healthcare providers, and policymakers.
Data quality: Data quality is important for ensuring the accuracy and reliability of the dataset. It is important to
understand the extent to which the data has been cleaned, validated, and prepared for analysis.
Creation of Synthetic Data

In [105]:
import numpy as np
from sklearn.datasets import make_classification
import pandas as pd
from matplotlib import pyplot as plt
%matplotlib inline
In
X, y = make_classification()
[29]
CSPIT-IT Page|1
In [31]:
X,y
Out[31]:
(array([[-0.20020088, 0.78460786, -1.30168442, ..., -0.06747625,
1.62252584, 2.47225306],
[ 1.67601082, -0.62152034, -0.10236129, ..., 1.14925094,
0.16455779, 1.48995319],
[-0.97790698, -0.19776114, 1.12052341, ..., 0.36427123,
1.73479863, 1.91387022],
...,
[ 1.28277902 -1.68569432, 1.5846109 , ..., -0.06213438,
,
-0.2437965 , -1.93831695],
[ 0.07581445, 0.56921426, -0.38206611, ..., 1.010115 ,
-1.32707962, -0.156631 ],
[-0.41622948, -0.06956544, 0.58448114, ..., 1.02636506,
0.7429604 , -0.59588374]]),
array([1, 1, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 1, 1,
1, 0, 0, 1,
0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1,
0, 0, 0, 1,
1, 1, 1, 0, 1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0,
1, 0, 0, 1,
1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0,
1, 1, 1, 0,
0, 0, 0, 1, 1, 0, 1, 1, 1, 0, 0, 1]))
In [33]:
X.shape
Out[33]:
(100, 20)
In [35]:
y.shape
Out[35]:
(100,)
In [66]:
X,y = make_classification(n_samples=200, n_features=1,n_redundant=0, n_informative=1,n_c
In [67]:
X.shape
Out[67]:
(200, 1)
CSPIT-IT Page|2
In [68]:
y.shape
Out[68]:
(200,)
In [69]:
np.unique(y)
Out[69]:
array([0, 1])
In [70]:
X
[ 2.33993732],
[ 1.21747399],
[-0.4195153 ],
[-0.90063875],
[-0.70310485],
[-0.55687924],
[-0.65681639],
[ 1.11790368],
[ 1.91283482],
[ 1.80726238],
[ 1.32051548],
[ 0.84466249],
[-1.09071016],
[-1.12500005],
[ 0.2526999 ],
[-0.84868304],
[-0.96270377],
[-0.79931494],
[-0.96845083],
[ 1.37535748],
In [71]:
y
Out[71]:
array([0, 1, 0, 1, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1,
1, 1, 1, 1, 1,
1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1,
1, 1, 0, 0, 1,
1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1,
0, 0, 0, 0, 1,
0, 1, 0, 1, 0, 0, 1, 1, 0, 1, 1, 1, 0, 1, 0, 1, 1,
1, 1, 1, 0, 0,
1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1,
0, 0, 0, 0, 0,
0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 0,
0, 0, 1, 0, 1,
0, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0,
1, 1, 1, 1, 0,
0, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0,
1, 0, 1, 1, 1,
1, 0, 1, 1, 1, 0, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0, 1,
CSPIT-IT Page|3
0, 1, 0, 0, 0,
0, 0]
)
CSPIT-IT Page|4
In [72]:
df=pd.DataFrame(X,y)
In [73]:
df
Out[73]:
0 -0.857867
1 0.187659
0 -0.981590
1 0.552080
1 -0.663328
... ...
0 -0.903706
0 -1.304785
1 0.551296
0 -0.844255
0 -0.889851
200 rows × 1 columns
Q2: How many total observations in data?
There are total 200 observations in data from code below.
Q3: How many independent variables?
There are total one columns which e is independent. Here, X is independent variable which indicates total number of
sample. Which is dependent variable?
Q4: Which is dependent variable?
Y is dependent variable which is indicating the value either 0 or 1.
CSPIT-IT Page|5
In [80]:
df.columns=['X'] df
Out[80]:
0 -0.857867
1 0.187659
0 -0.981590
1 0.552080
1 -0.663328
... ...
0 -0.903706
0 -1.304785
1 0.551296
0 -0.844255
0 -0.889851
200 rows × 1 columns
Q5: Which are the most valuable variables in classification? Prove using correlation.
The variables that have a strong correlation with the dependent variable (i.e., the variable we are trying to predict)
are likely to be the most valuable variables for classification.
In [76]:

%matplotlib inline
plt.scatter(X,y)
#plt.hist(X,bins= 10)
plt.show()
CSPIT-IT Page|6
In [77]:
df.shape
Out[77]:
(200, 1)
In [78]:
df.head()
Out[78]:
0 -0.857867
1 0.187659
0 -0.981590
1 0.552080
1 -0.663328
Logistic regression
In [92]:
def sigmoid(X, weight):
z = np.dot(X, weight) h= 1 / (1 + np.exp(-z)) return h
In [93]:
def gradient_descent(X, h, y):
return np.dot(X.T, (h - y)) / y.shape[0]
In [94]:
def update_weight_loss(weight, learning_rate, gradient):
return weight - learning_rate * gradient
In [95]:
intercept = np.ones((X.shape[0],
1)) len(intercept)
Out[95]:
200
CSPIT-IT Page|7
In [96]:
import time
start_time = time.time() num_iter = 1000
intercept = np.ones((X.shape[0], 1)) X_or = X
X = np.concatenate((intercept, X), axis=1) theta = np.zeros(X.shape[1])
for i in range(num_iter): h = sigmoid(X, theta)
gradient = gradient_descent(X, h, y)
theta = update_weight_loss(theta, 0.1, gradient)
print("Training time (Log Reg using Gradient descent):" + str(time.time() - start_time) print("Le
Training time (Log Reg using Gradient descent):0.028480052947998047 secon

ds
Learning rate: 0.1
Iteration: 1000
In [97]:
result = sigmoid(X, theta)
f = pd.DataFrame(np.around(result, decimals=6))#.join(y) f['class'] = y
f['pred'] = f[0].apply(lambda x : 0 if x < 0.5 else 1) print("Accuracy (Loss minimization):") f.l
Accuracy (Loss minimization):
Out[97]:
97.0
CSPIT-IT Page|8
In [98]:
import math
def sigmoid2(x): a = []
for item in x: a.append(1/(1+math.exp(-item)))
return a
import matplotlib.pyplot as plt

import numpy as np
x = np.arange(-10., 10., 0.2) sig = sigmoid2(x) plt.plot(x,sig)

plt.show()
CSPIT-IT Page|9
In [1]:
import pandas as pd
%matplotlib inline
In [9]:
df=pd.read_csv(r"D:\Sem-6\IT359 MACHINE LEARNING\Lab\Notes\LR_insurance_data.csv") df.head()
Out[9]:
age bought_insurance
0 22 0
1 25 0
2 47 1
3 52 0
4 46 1
In [10]:
plt.scatter(df.age,df.bought_insurance,marker='+',color='red')
Out[10]:
<matplotlib.collections.PathCollection at 0x14d62673b50>
In [11]:
df.shape
Out[11]:
(27, 2)
CSPIT-IT P a g e | 10
In [12]:
from sklearn.model_selection import train_test_split
In [13]:
X_train, X_test, y_train, y_test = train_test_split(df[['age']],df.bought_insurance,test
In [14]:
X_test
Out[14]:
age
15 55
13 29
2 47
In [15]:
X_train
Out[15]:
age
4 46
21 26
6 55
0 22
16 25
10 18
25 54
5 56
19 18
24 50
23 45
7 60
11 28
12 27
8 62
1 25
3 52
18 19
9 61
20 21
17 58
22 40
26 23
14 49
In [16]:
from sklearn.linear_model import LogisticRegression
In [17]:
model=LogisticRegression()
In [55
model.fit(X_train, y_train)
Out[18]:
LogisticRegression()
In [23]:
prediction=
model.predict(X_test) prediction
Out[23]:
array([1, 0, 1], dtype=int64)
Q6: Quantify the goodness of your model and discuss steps taken for improvement (Accuracy,
Confusion matrices, F-measure).
1. Accuracy is a common metric used to evaluate the overall performance of a classification model. It is
calculated as the number of correct predictions divided by the total number of predictions. While accuracy is a
useful metric, it can be misleading if the class distribution in the dataset is imbalanced.
2. Confusion matrices provide a more detailed view of the performance of a classification model. A confusion
matrix is a table that shows the number of true positive, true negative, false positive, and false negative
predictions made by the model. From the confusion matrix, we can calculate metrics such as precision,
recall, and F-measure.
3. F-measure is a metric that combines precision and recall into a single score. The F-measure is the harmonic mean
of precision and recall and provides a single metric that balances the trade-off between precision and recall.
In [20]:
model.score(X_test, y_test)
Out[20]:
1.0
In [21]:
model.predict_proba(X_test)
Out[21]:
array([[0.12292577 0.87707423],
,
[0.78112113, 0.21887887],
[0.27509135, 0.72490865]])
In [56
from sklearn.metrics import confusion_matrix confusion_matrix(y_test,prediction)
Out[29]:
array([[1, 0],
[0, 2]], dtype=int64)
In [30]:
from sklearn.metrics import accuracy_score accuracy_score(y_test,prediction)
Out[30]:
1.0

20it113 Logisticregression ML

Uploaded by

Copyright:

Available Formats

You might also like

20it113 Logisticregression ML

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

20it113 Logisticregression ML

Uploaded by

Copyright:

Available Formats

20IT107 IT359: Machine Learning

Creation of Synthetic Data

200 rows × 1 columns

Q2: How many total observations in data?

There are total 200 observations in data from code below.

Q3: How many independent variables?

Q4: Which is dependent variable?

Y is dependent variable which is indicating the value either 0 or 1.

200 rows × 1 columns

from matplotlib import pyplot as plt

Training time (Log Reg using Gradient descent):0.028480052947998047 secon

Accuracy (Loss minimization):

import matplotlib.pyplot as plt

x = np.arange(-10., 10., 0.2) sig = sigmoid2(x) plt.plot(x,sig)

You might also like