Download as pdf or txt
Download as pdf or txt
You are on page 1of 8

International Islamic University Islamabad

Faculty of Engineering & Technology


Department of Electrical Engineering

Machine Learning Lab

Experiment No. 6: Logistic Regression

Name of Student: ……………………………………

Registration No.: …………………………………….

Date of Experiment: …………………………………

1
Logistic Regression in Python

Classification techniques are an essential part of machine learning and data mining applications.
Approximately 70% of problems in Data Science are classification problems. There are lots of
classification problems that are available, but the logistics regression is common and is a useful
regression method for solving the binary classification problem. Another category of classification
is Multinomial classification, which handles the issues where multiple classes are present in the
target variable. For example, IRIS dataset a very famous example of multi-class classification.
Logistic Regression can be used for various classification problems such as spam detection.
Diabetes prediction, if a given customer will purchase a particular product or will they churn
another competitor, whether the user will click on a given advertisement link or not, and many
more examples are in the bucket.

Logistic Regression is one of the most simple and commonly used Machine Learning algorithms
for two-class classification. It is easy to implement and can be used as the baseline for any binary
classification problem. Its basic fundamental concepts are also constructive in deep learning.
Logistic regression describes and estimates the relationship between one dependent binary variable
and independent variables.

Logistic Regression
Logistic regression is a statistical method for predicting binary classes. The outcome or target
variable is dichotomous in nature. Dichotomous means there are only two possible classes. For
example, it can be used for cancer detection problems. It computes the probability of an event
occurrence.
It is a special case of linear regression where the target variable is categorical in nature. It uses a
log of odds as the dependent variable. Logistic Regression predicts the probability of occurrence
of a binary event utilizing a logit function.
Linear Regression Equation:

Where, y is dependent variable and x1, x2 ... and Xn are explanatory variables.
Sigmoid Function:

Apply Sigmoid function on linear regression:

Properties of Logistic Regression:


The dependent variable in logistic regression follows Bernoulli Distribution.
Estimation is done through maximum likelihood.

2
No R Square, Model fitness is calculated through Concordance, KS-Statistics.
Linear Regression Vs. Logistic Regression
Linear regression gives you a continuous output, but logistic regression provides a constant
output. An example of the continuous output is house price and stock price. Example's of the
discrete output is predicting whether a patient has cancer or not, predicting whether the customer
will churn. Linear regression is estimated using Ordinary Least Squares (OLS) while logistic
regression is estimated using Maximum Likelihood Estimation (MLE) approach.

Sigmoid Function
The sigmoid function, also called logistic function gives an ‘S’ shaped curve that can take any real-
valued number and map it into a value between 0 and 1. If the curve goes to positive infinity, y
predicted will become 1, and if the curve goes to negative infinity, y predicted will become 0. If
the output of the sigmoid function is more than 0.5, we can classify the outcome as 1 or YES, and
if it is less than 0.5, we can classify it as 0 or NO. The output cannot For example: If the output is
0.75, we can say in terms of probability as: There is a 75 percent chance that patient will suffer
from cancer.

3
Types of Logistic Regression:
Binary Logistic Regression: The target variable has only two possible outcomes such as Spam or
Not Spam, Cancer or No Cancer.
Multinomial Logistic Regression: The target variable has three or more nominal categories such
as predicting the type of Wine.
Ordinal Logistic Regression: the target variable has three or more ordinal categories such as
restaurant or product rating from 1 to 5.
Advantages
Because of its efficient and straightforward nature, doesn't require high computation power, easy
to implement, easily interpretable, used widely by data analyst and scientist. Also, it doesn't require
scaling of features. Logistic regression provides a probability score for observations.
Disadvantages
Logistic regression is not able to handle a large number of categorical features/variables. It is
vulnerable to overfitting. Also, can't solve the non-linear problem with the logistic regression that
is why it requires a transformation of non-linear features. Logistic regression will not perform well
with independent variables that are not correlated to the target variable and are very similar or
correlated to each other.

4
User Database – This dataset contains information of users from a companies database. It contains
information about UserID, Gender, Age, EstimatedSalary, Purchased. We are using this dataset
for predicting that a user will purchase the company’s newly launched product or not.

1. #Logistic Regression
2.
3. #Let’s make the Logistic Regression model, predicting whether a user will
4. #purchase the product or not.
5.
6. #Inputing Libraries
7.
8. import pandas as pd
9. import numpy as np
10. import matplotlib.pyplot as plt
11.
12. #Loading dataset – User_Data
13. dataset = pd.read_csv(r"E:\Lab Engineer (IIUI)\Courses\Machine Learning Lab (CS423L)\ML
Lab Manaul Spring 2020\Lab 06\User_Data.csv")
14. #Now, to predict whether a user will purchase the product or not,
15. #one needs to find out the relationship between Age and Estimated Salary.
16. #Here User ID and Gender are not important factors for finding out this.
17.
18.
19. # Input
20. x = dataset.iloc[:, [2, 3]].values
21. # Output /
22. y = dataset.iloc[:, 4].values
23.
24. #Splitting the dataset to train and test.

5
25. #75% of data is used for training the model
26. #25% of it is used to test the performance of our model.
27.
28. from sklearn.model_selection import train_test_split
29.
30. xtrain, xtest, ytrain, ytest = train_test_split(x, y, test_size = 0.25, random_state =
0)
31.
32. #Now, it is very important to perform feature scaling here
33. #because Age and Estimated Salary values lie in different ranges.
34. #If we don’t scale the features
35. #then Estimated Salary feature will dominate Age feature
36. #when the model finds the nearest neighbor to a data point in data space.
37.
38. #Here once see that Age and Estimated salary features values are sacled
39. # and now there in the -1 to 1.
40. #Hence, each feature will contribute equally in decision making
41. #i.e. finalizing the hypothesis.
42.
43. from sklearn.preprocessing import StandardScaler
44. sc_x = StandardScaler()
45. xtrain = sc_x.fit_transform(xtrain)
46. xtest = sc_x.transform(xtest)
47.
48. print (xtrain[0:10, :])
49.
50.
51.
52. #Finally, we are training our Logistic Regression model.
53.
54. from sklearn.linear_model import LogisticRegression
55. classifier = LogisticRegression(random_state = 0)
56. classifier.fit(xtrain, ytrain)
57.
58. #After training the model, it time to use it to do prediction on testing data.
59. y_pred = classifier.predict(xtest)
60.
61. #Let’s test the performance of our model – Confusion Matrix
62.
63. from sklearn.metrics import confusion_matrix
64. cm = confusion_matrix(ytest, y_pred)
65. print ("Confusion Matrix : \n", cm)
66.
67. #Performance measure – Accuracy
68. from sklearn.metrics import accuracy_score
69. print ("Accuracy : ", accuracy_score(ytest, y_pred))
70.
71.
72.
73. #Visualizing the performance of our model.
74.
75. from matplotlib.colors import ListedColormap
76. X_set, y_set = xtest, ytest
77. X1, X2 = np.meshgrid(np.arange(start = X_set[:, 0].min() - 1,
78. stop = X_set[:, 0].max() + 1, step = 0.01),
79. np.arange(start = X_set[:, 1].min() - 1,
80. stop = X_set[:, 1].max() + 1, step = 0.01))
81.
82. plt.contourf(X1, X2, classifier.predict(
83. np.array([X1.ravel(),

6
84. X2.ravel()]).T).reshape(
85. X1.shape), alpha = 0.75, cmap = ListedColormap(('red', 'green')))
86.
87. plt.xlim(X1.min(), X1.max())
88. plt.ylim(X2.min(), X2.max())
89.
90. for i, j in enumerate(np.unique(y_set)):
91. plt.scatter(X_set[y_set == j, 0], X_set[y_set == j, 1],
92. c = ListedColormap(('red', 'green'))(i), label = j)
93.
94. plt.title('Classifier (Test set)')
95. plt.xlabel('Age')
96. plt.ylabel('Estimated Salary')
97. plt.legend()
98. plt.show()

Output:

Data Set:
[[ 0.58164944 -0.88670699]
[-0.60673761 1.46173768]
[-0.01254409 -0.5677824 ]
[-0.60673761 1.89663484]
[ 1.37390747 -1.40858358]
[ 1.47293972 0.99784738]
[ 0.08648817 -0.79972756]
[-0.01254409 -0.24885782]
[-0.21060859 -0.5677824 ]
[-0.21060859 -0.19087153]]
Confusion Matrix:
[[65 3]
[8 24]]
Out of 100:
TruePostive + TrueNegative = 65 + 24
FalsePositive + FalseNegative = 3 + 8
Performance Measure – Accuracy:
Accuracy: 0.89

7
Analysing the performance measures – accuracy and confusion matrix and the graph, we can
clearly say that our model is performing really good.

Lab Task:
Apply the Logistic Regression on ‘weather.csv’ database and report the analysis of its results.

You might also like