Download as pdf or txt
Download as pdf or txt
You are on page 1of 9

Linear Regression - SLR and MLR

There are two types of supervised machine learning algorithms: Regression and classification.

The Regression predicts continuous value outputs while the classification predicts discrete
outputs.

For instance, predicting the price of a house in dollars is a regression problem whereas
predicting whether a tumor is malignant or benign is a classification problem.

Regression is a relationship between independent variable and dependent


variable

Linear Regression is a linear relationship between independent variable and dependent


variable

SLR (Simple Linear Regression) is a linear relationship between 1 independent


variable and a dependent variable
MLR (Multi Linear Regression) is a linear relationship between more than 1
independent variable and a dependent variable
Polynomial Regression it is applied whenever relationship between independent variable
and dependent variable is non-linear

Logistic Regression it is applied when dependent variables with discrete values ( 0 or 1, yes
or no).

Assumptions of Linear Regression

1. Linear relationship
2. Multivariate normality
3. No or little multicollinearity
4. No auto-correlation
5. Homoscedasticity

Linear Regression using python

Multiple Linear Regression


In the previous section we performed linear regression involving two variables.

Linear regression involving multiple variables is called "multiple linear regression".

The steps to perform multiple linear regression are almost similar to that of simple linear
regression. The difference lies in the evaluation.
We can use it to find out which factor has the highest impact on the predicted output and
how different variables relate to each other.

STEP 1: Problem Statement


To predict the Petrol consumption in 48 US states based upon petrol taxes (in cents),
Average_income (dollars), paved highways (in miles) and the proportion of population that has a
drivers license (in %).

STEP 2: Import necessary libraries and dataset


Importing Libraries

In [2]:
# Data Manipulation
import numpy as np
import pandas as pd

#Data Visualization
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

#Scientific computation (for Statistics)


from scipy import stats

#To split the train and test data


from sklearn.model_selection import train_test_split

#Building a model
from sklearn.linear_model import LinearRegression

#Validating a model
from sklearn.metrics import mean_squared_error, mean_absolute_error

Load a Dataset

The dataset being used for this example has been made publicly available and can be
downloaded from this link: Dataset link

In [3]:
dataset = pd.read_csv("C:/Users/chetankumarbk/Desktop/CSV practice files/petrol_cons

STEP 3: EDA (Data Exploration)


Understanding about the data and insights of it

In [4]:
dataset.shape

(48, 5)
Out[4]:

This means that our dataset has 48 rows and 5 columns. Let's take a look at what our dataset
actually looks like. To do this, use the head() method:

In [6]:
dataset.head()

Out[6]: Petrol_tax Average_income Paved_Highways Population_Driver_licence(%) Petrol_Consumption

0 9.0 3571 1976 0.525 541

1 9.0 4092 1250 0.572 524

2 9.0 3865 1586 0.580 561

3 7.5 4870 2351 0.529 414

4 8.0 4399 431 0.544 410


In [7]:
len(dataset)

48
Out[7]:

In [8]:
dataset.tail()

Out[8]: Petrol_tax Average_income Paved_Highways Population_Driver_licence(%) Petrol_Consumption

43 7.0 3745 2611 0.508 591

44 6.0 5215 2302 0.672 782

45 9.0 4476 3942 0.571 510

46 7.0 4296 4083 0.623 610

47 7.0 5002 9794 0.593 524

In [9]:
#To see statistical details of the dataset
dataset.describe()

Out[9]: Petrol_tax Average_income Paved_Highways Population_Driver_licence(%) Petrol_Consumptio

count 48.000000 48.000000 48.000000 48.000000 48.00000

mean 7.668333 4241.833333 5565.416667 0.570333 576.77083

std 0.950770 573.623768 3491.507166 0.055470 111.8858

min 5.000000 3063.000000 431.000000 0.451000 344.00000

25% 7.000000 3739.000000 3110.250000 0.529750 509.50000

50% 7.500000 4298.000000 4735.500000 0.564500 568.50000

75% 8.125000 4578.750000 7156.000000 0.595250 632.75000

max 10.000000 5342.000000 17782.000000 0.724000 968.00000

In [10]:
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48 entries, 0 to 47
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Petrol_tax 48 non-null float64
1 Average_income 48 non-null int64
2 Paved_Highways 48 non-null int64
3 Population_Driver_licence(%) 48 non-null float64
4 Petrol_Consumption 48 non-null int64
dtypes: float64(2), int64(3)
memory usage: 2.0 KB

In [11]:
dataset.dtypes

Petrol_tax float64
Out[11]:
Average_income int64
Paved_Highways int64
Population_Driver_licence(%) float64
Petrol_Consumption int64
dtype: object

In [12]:
from scipy import stats

pearson_value, p_value = stats.pearsonr(dataset["Average_income"], dataset["Petrol_C

print(pearson_value, p_value)

-0.24486207498269905 0.09346842977474583

In [13]:
from scipy import stats

pearson_value, p_value = stats.pearsonr(dataset["Petrol_tax"], dataset["Petrol_Consu

print(pearson_value, p_value)

-0.45128027518698666 0.0012848906734289317

In [ ]:

In [15]:
dataset.plot(x='Average_income', y='Petrol_Consumption', style='o')
plt.title('Average_income vs Petrol_Consumption')
plt.xlabel('Average income')
plt.ylabel('Petrol Consumption')
plt.show()

Data points on 2-D graph to eyeball our dataset and see if we can manually find any relationship
between the data

From the graph above, we can clearly see that there is a Lesser or weak relation between the
Average income and Petrol Consumption.

In [16]:
corr = dataset.corr()
corr

Out[16]: Petrol_tax Average_income Paved_Highways Population_Driver_licence(


Petrol_tax Average_income Paved_Highways Population_Driver_licence(

Petrol_tax 1.000000 0.012665 -0.522130 -0.2880

Average_income 0.012665 1.000000 0.050163 0.1570

Paved_Highways -0.522130 0.050163 1.000000 -0.0641

Population_Driver_licence(%) -0.288037 0.157070 -0.064129 1.0000

Petrol_Consumption -0.451280 -0.244862 0.019042 0.6989

In [19]:
plt.subplots(figsize=(8,5))
sns.heatmap(corr, annot=True)

<AxesSubplot:>
Out[19]:

STEP 4: Data preparation


Checking the Missing values

In [20]:
dataset.isnull().sum()

Petrol_tax 0
Out[20]:
Average_income 0
Paved_Highways 0
Population_Driver_licence(%) 0
Petrol_Consumption 0
dtype: int64

In [21]:
dataset.columns[dataset.isnull().any()]
Out[21]: Index([], dtype='object')

In [49]:
plt.figure(figsize=(12, 6))
sns.heatmap(dataset.isnull())

plt.show()

Dividing the dataset into independent and dependent variable


The next step is to divide the data into "attributes" and "labels".

Attributes are the independent variables while labels are dependent variables whose values
are to be predicted.
In our dataset we only have two columns.
We want to predict the Petrol consumption.
Therefore our attribute set will consist of the "other than Petrol consumption" column,
and the label will be the "Petrol consumption" column.

In [22]:
X = dataset[['Petrol_tax', 'Average_income', 'Paved_Highways',
'Population_Driver_licence(%)']]
y = dataset['Petrol_Consumption']

NOTE: the column indexes start with 0, with 1 being the second column

To split this data into training and test sets. We'll do this by using Scikit-Learn's built-in
train_test_split() method:

In [23]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_stat

The above script splits 80% of the data to training set while 20% of the data to test set. The
test_size variable is where we actually specify the proportion of test set.

In [24]:
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

(38, 4)
(10, 4)
(38,)
(10,)

In [26]:
X_train.head()

Out[26]: Petrol_tax Average_income Paved_Highways Population_Driver_licence(%)

11 7.5 5126 14186 0.525

31 7.0 3333 6594 0.513

33 7.5 3357 4121 0.547

27 7.5 3846 9061 0.579

47 7.0 5002 9794 0.593

In [27]:
y_train.head()

11 471
Out[27]:
31 554
33 628
27 631
47 524
Name: Petrol_Consumption, dtype: int64

STEP 5: Building a model using training data


In [28]:
from sklearn.linear_model import LinearRegression

regressor = LinearRegression()
regressor.fit(X_train, y_train)

LinearRegression()
Out[28]:

Import the LinearRegression class, instantiate it, and call the fit() method along with our training
data.

In the theory section we said that linear regression model basically finds the best value for the
intercept and slope, which results in a line that best fits the data. To see the value of the
intercept and slop calculated by the linear regression algorithm for our dataset

In [29]:
print(regressor.intercept_)

425.5993322032417

In [30]:
print(regressor.coef_)

[-4.00166602e+01 -6.54126674e-02 -4.74073380e-03 1.34186212e+03]


This means that for a unit increase in "petrol_tax", there is a decrease of 4.001 million gallons in
petrol consumption. Similarly, a unit increase in proportion of population with a drivers license
results in an increase of 1.324 billion gallons of petrol consumption. We can see that
"Average_income" and "Paved_Highways" have a effect on the petrol consumption.

Step 6: Predictions
Now that we have trained our algorithm, it's time to make some predictions. To do so, we will
use our test data and see how accurately our algorithm predicts the percentage score.

In [31]:
y_pred = regressor.predict(X_test)

Now, to compare the actual output values for X_test with the predicted values

In [32]:
df = pd.DataFrame({'Actual': y_test, 'Predicted': y_pred})
df

Out[32]: Actual Predicted

29 534 469.391989

4 410 545.645464

26 577 589.668394

30 571 569.730413

32 577 649.774809

37 704 646.631164

34 487 511.608148

40 587 672.475177

7 467 502.074782

10 580 501.270734

Step 7: Evaluating the Algorithm


The final step is to evaluate the performance of algorithm. We'll do this by finding the values for
MAE, MSE and RMSE.

In [33]:
from sklearn import metrics
print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, y_pred))
print('Mean Squared Error:', metrics.mean_squared_error(y_test, y_pred))
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, y_pred)

Mean Absolute Error: 56.822247478964684


Mean Squared Error: 4666.3447875883585
Root Mean Squared Error: 68.31064915215165
You can see that the value of root mean squared error is 60.07, which is slightly greater than
10% of the mean value of the gas consumption in all states. This means that our algorithm was
not very accurate but can still make reasonably good predictions.

There are many factors that may have contributed to this inaccuracy, a few of which are listed
here:
1. Need more data: Only one year worth of data isn't that much, whereas having multiple years
worth could have helped us improve the accuracy quite a bit.

2. Poor features: The features we used may not have had a high enough correlation to the
values we were trying to predict.

3. Bad assumptions: We made the assumption that this data has a linear relationship, but that
might not be the case. Visualizing the data may help you determine that.

R-Square
Training Accuracy

In [34]:
regressor.score(X_train, y_train)

0.72081542958177
Out[34]:

Testing Accuracy

In [35]:
regressor.score(X_test, y_test)

0.2036193241012182
Out[35]:

From above R-Square for training data and Testing data is high difference.. hence it is
overfitting

Conclusion
we studied on of the most fundamental machine learning algorithms i.e. linear regression. We
implemented both simple linear regression and multiple linear regression with the help of the
Scikit-Learn machine learning library.

You might also like