20it113 Logisticregression ML

You might also like

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 14

20IT107 IT359: Machine Learning

PRACTICAL: 4

Logistic Regression
Q1: Why do you want to apply classification to the selected dataset? Discuss the full story behind
the dataset.

Classification is a type of machine learning algorithm used to predict categorical outputs based on input variables.
Unlike regression, which predicts a continuous numerical output, classification predicts a categorical output, such
as a binary or multi-class classification problem. Classification is applicable to a dataset when the dependent
variable is categorical, and we want to predict the class or category of new observations based on a set of input
features.

The full story behind a dataset can vary depending on the specific dataset and context of the data.
However, there are some common factors that can contribute to the story behind a dataset. Here are some potential
factors that can provide insight into the story behind a dataset:

Data collection: The way in which data was collected can provide valuable information about the dataset. For
example, if a dataset was collected from a survey, it is important to know the sample size, sampling technique, and
potential biases that may exist in the sample.

Variables: The variables included in the dataset can provide insight into what the dataset is measuring. It is important
to understand what each variable represents, the measurement scale of each variable, and
whether any variables are correlated.

Context: The context of the dataset is important for understanding the broader implications of the data. For example,
if the dataset is related to healthcare, it is important to understand the potential impacts of the data on patients,
healthcare providers, and policymakers.

Data quality: Data quality is important for ensuring the accuracy and reliability of the dataset. It is important to
understand the extent to which the data has been cleaned, validated, and prepared for analysis.

Creation of Synthetic Data


In [105]:

import numpy as np
from sklearn.datasets import make_classification
import pandas as pd
from matplotlib import pyplot as plt
%matplotlib inline

In

X, y = make_classification()

[29]

CSPIT-IT Page|1
In [31]:

X,y

Out[31]:
(array([[-0.20020088, 0.78460786, -1.30168442, ..., -0.06747625,
1.62252584, 2.47225306],
[ 1.67601082, -0.62152034, -0.10236129, ..., 1.14925094,
0.16455779, 1.48995319],
[-0.97790698, -0.19776114, 1.12052341, ..., 0.36427123,
1.73479863, 1.91387022],
...,
[ 1.28277902 -1.68569432, 1.5846109 , ..., -0.06213438,
,
-0.2437965 , -1.93831695],
[ 0.07581445, 0.56921426, -0.38206611, ..., 1.010115 ,
-1.32707962, -0.156631 ],
[-0.41622948, -0.06956544, 0.58448114, ..., 1.02636506,
0.7429604 , -0.59588374]]),
array([1, 1, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 1, 1,
1, 0, 0, 1,
0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1,
0, 0, 0, 1,
1, 1, 1, 0, 1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0,
1, 0, 0, 1,
1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0,
1, 1, 1, 0,
0, 0, 0, 1, 1, 0, 1, 1, 1, 0, 0, 1]))

In [33]:
X.shape

Out[33]:

(100, 20)

In [35]:
y.shape

Out[35]:

(100,)

In [66]:
X,y = make_classification(n_samples=200, n_features=1,n_redundant=0, n_informative=1,n_c

In [67]:

X.shape

Out[67]:
(200, 1)

CSPIT-IT Page|2
In [68]:
y.shape

Out[68]:

(200,)

In [69]:
np.unique(y)

Out[69]:

array([0, 1])

In [70]:

X
[ 2.33993732],
[ 1.21747399],
[-0.4195153 ],
[-0.90063875],
[-0.70310485],
[-0.55687924],
[-0.65681639],
[ 1.11790368],
[ 1.91283482],
[ 1.80726238],
[ 1.32051548],
[ 0.84466249],
[-1.09071016],
[-1.12500005],
[ 0.2526999 ],
[-0.84868304],
[-0.96270377],
[-0.79931494],
[-0.96845083],
[ 1.37535748],

In [71]:

y
Out[71]:
array([0, 1, 0, 1, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1,
1, 1, 1, 1, 1,
1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1,
1, 1, 0, 0, 1,
1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1,
0, 0, 0, 0, 1,
0, 1, 0, 1, 0, 0, 1, 1, 0, 1, 1, 1, 0, 1, 0, 1, 1,
1, 1, 1, 0, 0,
1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1,
0, 0, 0, 0, 0,
0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 0,
0, 0, 1, 0, 1,
0, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0,
1, 1, 1, 1, 0,
0, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0,
1, 0, 1, 1, 1,
1, 0, 1, 1, 1, 0, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0, 1,
CSPIT-IT Page|3
0, 1, 0, 0, 0,
0, 0]
)

CSPIT-IT Page|4
In [72]:
df=pd.DataFrame(X,y)

In [73]:
df

Out[73]:

0 -0.857867

1 0.187659

0 -0.981590

1 0.552080

1 -0.663328

... ...

0 -0.903706

0 -1.304785

1 0.551296

0 -0.844255
0 -0.889851

200 rows × 1 columns

Q2: How many total observations in data?

There are total 200 observations in data from code below.

Q3: How many independent variables?

There are total one columns which e is independent. Here, X is independent variable which indicates total number of
sample. Which is dependent variable?

Q4: Which is dependent variable?

Y is dependent variable which is indicating the value either 0 or 1.

CSPIT-IT Page|5
In [80]:
df.columns=['X'] df

Out[80]:

0 -0.857867

1 0.187659

0 -0.981590

1 0.552080

1 -0.663328

... ...

0 -0.903706

0 -1.304785

1 0.551296

0 -0.844255
0 -0.889851

200 rows × 1 columns

Q5: Which are the most valuable variables in classification? Prove using correlation.

The variables that have a strong correlation with the dependent variable (i.e., the variable we are trying to predict)
are likely to be the most valuable variables for classification.

In [76]:

from matplotlib import pyplot as plt


%matplotlib inline
plt.scatter(X,y)
#plt.hist(X,bins= 10)
plt.show()

CSPIT-IT Page|6
In [77]:
df.shape

Out[77]:

(200, 1)

In [78]:
df.head()

Out[78]:

0 -0.857867

1 0.187659

0 -0.981590

1 0.552080
1 -0.663328

Logistic regression
In [92]:
def sigmoid(X, weight):
z = np.dot(X, weight) h= 1 / (1 + np.exp(-z)) return h

In [93]:
def gradient_descent(X, h, y):
return np.dot(X.T, (h - y)) / y.shape[0]

In [94]:
def update_weight_loss(weight, learning_rate, gradient):
return weight - learning_rate * gradient

In [95]:

intercept = np.ones((X.shape[0],
1)) len(intercept)

Out[95]:

200

CSPIT-IT Page|7
In [96]:
import time
start_time = time.time() num_iter = 1000
intercept = np.ones((X.shape[0], 1)) X_or = X
X = np.concatenate((intercept, X), axis=1) theta = np.zeros(X.shape[1])
for i in range(num_iter): h = sigmoid(X, theta)
gradient = gradient_descent(X, h, y)
theta = update_weight_loss(theta, 0.1, gradient)

print("Training time (Log Reg using Gradient descent):" + str(time.time() - start_time) print("Le

Training time (Log Reg using Gradient descent):0.028480052947998047 secon


ds
Learning rate: 0.1
Iteration: 1000

In [97]:
result = sigmoid(X, theta)
f = pd.DataFrame(np.around(result, decimals=6))#.join(y) f['class'] = y
f['pred'] = f[0].apply(lambda x : 0 if x < 0.5 else 1) print("Accuracy (Loss minimization):") f.l

Accuracy (Loss minimization):

Out[97]:

97.0

CSPIT-IT Page|8
In [98]:
import math

def sigmoid2(x): a = []
for item in x: a.append(1/(1+math.exp(-item)))
return a

import matplotlib.pyplot as plt


import numpy as np

x = np.arange(-10., 10., 0.2) sig = sigmoid2(x) plt.plot(x,sig)


plt.show()

CSPIT-IT Page|9
In [1]:

import pandas as pd
from matplotlib import pyplot as plt
%matplotlib inline

In [9]:
df=pd.read_csv(r"D:\Sem-6\IT359 MACHINE LEARNING\Lab\Notes\LR_insurance_data.csv") df.head()

Out[9]:

age bought_insurance

0 22 0

1 25 0

2 47 1

3 52 0

4 46 1

In [10]:
plt.scatter(df.age,df.bought_insurance,marker='+',color='red')

Out[10]:

<matplotlib.collections.PathCollection at 0x14d62673b50>

In [11]:
df.shape

Out[11]:

(27, 2)

CSPIT-IT P a g e | 10
In [12]:
from sklearn.model_selection import train_test_split

In [13]:
X_train, X_test, y_train, y_test = train_test_split(df[['age']],df.bought_insurance,test

In [14]:
X_test

Out[14]:

age
15 55

13 29

2 47

CSPIT-IT P a g e | 11
In [15]:
X_train

Out[15]:

age

4 46

21 26

6 55

0 22

16 25

10 18

25 54

5 56

19 18

24 50

23 45

7 60

11 28

12 27

8 62

1 25

3 52

18 19

9 61

20 21

17 58

22 40

26 23
14 49

In [16]:
from sklearn.linear_model import LogisticRegression

In [17]:
model=LogisticRegression()

CSPIT-IT P a g e | 11
In [55
model.fit(X_train, y_train)

Out[18]:

LogisticRegression()

In [23]:

prediction=
model.predict(X_test) prediction

Out[23]:
array([1, 0, 1], dtype=int64)

Q6: Quantify the goodness of your model and discuss steps taken for improvement (Accuracy,
Confusion matrices, F-measure).

1. Accuracy is a common metric used to evaluate the overall performance of a classification model. It is
calculated as the number of correct predictions divided by the total number of predictions. While accuracy is a
useful metric, it can be misleading if the class distribution in the dataset is imbalanced.
2. Confusion matrices provide a more detailed view of the performance of a classification model. A confusion
matrix is a table that shows the number of true positive, true negative, false positive, and false negative
predictions made by the model. From the confusion matrix, we can calculate metrics such as precision,
recall, and F-measure.
3. F-measure is a metric that combines precision and recall into a single score. The F-measure is the harmonic mean
of precision and recall and provides a single metric that balances the trade-off between precision and recall.

In [20]:
model.score(X_test, y_test)

Out[20]:

1.0

In [21]:

model.predict_proba(X_test)

Out[21]:
array([[0.12292577 0.87707423],
,
[0.78112113, 0.21887887],
[0.27509135, 0.72490865]])

CSPIT-IT P a g e | 12
In [56
from sklearn.metrics import confusion_matrix confusion_matrix(y_test,prediction)

Out[29]:

array([[1, 0],
[0, 2]], dtype=int64)

In [30]:
from sklearn.metrics import accuracy_score accuracy_score(y_test,prediction)

Out[30]:

1.0

CSPIT-IT P a g e | 13

You might also like