Regression Practice - MLR

Linear Regression - SLR and MLR
There are two types of supervised machine learning algorithms: Regression and classification.
The Regression predicts continuous value outputs while the classification predicts discrete
outputs.
For instance, predicting the price of a house in dollars is a regression problem whereas
predicting whether a tumor is malignant or benign is a classification problem.
Regression is a relationship between independent variable and dependent

variable
Linear Regression is a linear relationship between independent variable and dependent

variable
SLR (Simple Linear Regression) is a linear relationship between 1 independent

variable and a dependent variable
MLR (Multi Linear Regression) is a linear relationship between more than 1
independent variable and a dependent variable
Polynomial Regression it is applied whenever relationship between independent variable
and dependent variable is non-linear
Logistic Regression it is applied when dependent variables with discrete values ( 0 or 1, yes
or no).
Assumptions of Linear Regression
1. Linear relationship
2. Multivariate normality
3. No or little multicollinearity
4. No auto-correlation
5. Homoscedasticity
Linear Regression using python
Multiple Linear Regression

In the previous section we performed linear regression involving two variables.
Linear regression involving multiple variables is called "multiple linear regression".
The steps to perform multiple linear regression are almost similar to that of simple linear
regression. The difference lies in the evaluation.
We can use it to find out which factor has the highest impact on the predicted output and
how different variables relate to each other.
STEP 1: Problem Statement

To predict the Petrol consumption in 48 US states based upon petrol taxes (in cents),
Average_income (dollars), paved highways (in miles) and the proportion of population that has a
drivers license (in %).
STEP 2: Import necessary libraries and dataset

Importing Libraries
In [2]:
# Data Manipulation
import numpy as np
import pandas as pd
#Data Visualization
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
#Scientific computation (for Statistics)

from scipy import stats
#To split the train and test data

from sklearn.model_selection import train_test_split
#Building a model
from sklearn.linear_model import LinearRegression
#Validating a model
from sklearn.metrics import mean_squared_error, mean_absolute_error
Load a Dataset
The dataset being used for this example has been made publicly available and can be
downloaded from this link: Dataset link
In [3]:
dataset = pd.read_csv("C:/Users/chetankumarbk/Desktop/CSV practice files/petrol_cons
STEP 3: EDA (Data Exploration)

Understanding about the data and insights of it
In [4]:
dataset.shape
(48, 5)
Out[4]:
This means that our dataset has 48 rows and 5 columns. Let's take a look at what our dataset
actually looks like. To do this, use the head() method:
In [6]:
dataset.head()
Out[6]: Petrol_tax Average_income Paved_Highways Population_Driver_licence(%) Petrol_Consumption
0 9.0 3571 1976 0.525 541
1 9.0 4092 1250 0.572 524
2 9.0 3865 1586 0.580 561
3 7.5 4870 2351 0.529 414
4 8.0 4399 431 0.544 410

In [7]:
len(dataset)
48
Out[7]:
In [8]:
dataset.tail()
Out[8]: Petrol_tax Average_income Paved_Highways Population_Driver_licence(%) Petrol_Consumption
43 7.0 3745 2611 0.508 591
44 6.0 5215 2302 0.672 782
45 9.0 4476 3942 0.571 510
46 7.0 4296 4083 0.623 610
47 7.0 5002 9794 0.593 524
In [9]:
#To see statistical details of the dataset
dataset.describe()
Out[9]: Petrol_tax Average_income Paved_Highways Population_Driver_licence(%) Petrol_Consumptio
count 48.000000 48.000000 48.000000 48.000000 48.00000
mean 7.668333 4241.833333 5565.416667 0.570333 576.77083
std 0.950770 573.623768 3491.507166 0.055470 111.8858
min 5.000000 3063.000000 431.000000 0.451000 344.00000
25% 7.000000 3739.000000 3110.250000 0.529750 509.50000
50% 7.500000 4298.000000 4735.500000 0.564500 568.50000
75% 8.125000 4578.750000 7156.000000 0.595250 632.75000
max 10.000000 5342.000000 17782.000000 0.724000 968.00000
In [10]:
dataset.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48 entries, 0 to 47
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Petrol_tax 48 non-null float64
1 Average_income 48 non-null int64
2 Paved_Highways 48 non-null int64
3 Population_Driver_licence(%) 48 non-null float64
4 Petrol_Consumption 48 non-null int64
dtypes: float64(2), int64(3)
memory usage: 2.0 KB
In [11]:
dataset.dtypes
Petrol_tax float64
Out[11]:
Average_income int64
Paved_Highways int64
Population_Driver_licence(%) float64
Petrol_Consumption int64
dtype: object
In [12]:
pearson_value, p_value = stats.pearsonr(dataset["Average_income"], dataset["Petrol_C
print(pearson_value, p_value)
-0.24486207498269905 0.09346842977474583
In [13]:
pearson_value, p_value = stats.pearsonr(dataset["Petrol_tax"], dataset["Petrol_Consu
print(pearson_value, p_value)
-0.45128027518698666 0.0012848906734289317
In [ ]:
In [15]:
dataset.plot(x='Average_income', y='Petrol_Consumption', style='o')
plt.title('Average_income vs Petrol_Consumption')
plt.xlabel('Average income')
plt.ylabel('Petrol Consumption')
plt.show()
Data points on 2-D graph to eyeball our dataset and see if we can manually find any relationship
between the data
From the graph above, we can clearly see that there is a Lesser or weak relation between the
Average income and Petrol Consumption.
In [16]:
corr = dataset.corr()
corr
Out[16]: Petrol_tax Average_income Paved_Highways Population_Driver_licence(

Petrol_tax Average_income Paved_Highways Population_Driver_licence(
Petrol_tax 1.000000 0.012665 -0.522130 -0.2880
Average_income 0.012665 1.000000 0.050163 0.1570
Paved_Highways -0.522130 0.050163 1.000000 -0.0641
Population_Driver_licence(%) -0.288037 0.157070 -0.064129 1.0000
Petrol_Consumption -0.451280 -0.244862 0.019042 0.6989
In [19]:
plt.subplots(figsize=(8,5))
sns.heatmap(corr, annot=True)
<AxesSubplot:>
Out[19]:
STEP 4: Data preparation

Checking the Missing values
In [20]:
dataset.isnull().sum()
Petrol_tax 0
Out[20]:
Average_income 0
Paved_Highways 0
Population_Driver_licence(%) 0
Petrol_Consumption 0
dtype: int64
In [21]:
dataset.columns[dataset.isnull().any()]
Out[21]: Index([], dtype='object')
In [49]:
plt.figure(figsize=(12, 6))
sns.heatmap(dataset.isnull())
plt.show()
Dividing the dataset into independent and dependent variable

The next step is to divide the data into "attributes" and "labels".
Attributes are the independent variables while labels are dependent variables whose values
are to be predicted.
In our dataset we only have two columns.
We want to predict the Petrol consumption.
Therefore our attribute set will consist of the "other than Petrol consumption" column,
and the label will be the "Petrol consumption" column.
In [22]:
X = dataset[['Petrol_tax', 'Average_income', 'Paved_Highways',
'Population_Driver_licence(%)']]
y = dataset['Petrol_Consumption']
NOTE: the column indexes start with 0, with 1 being the second column
To split this data into training and test sets. We'll do this by using Scikit-Learn's built-in
train_test_split() method:
In [23]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_stat
The above script splits 80% of the data to training set while 20% of the data to test set. The
test_size variable is where we actually specify the proportion of test set.
In [24]:
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)
(38, 4)
(10, 4)
(38,)
(10,)
In [26]:
X_train.head()
Out[26]: Petrol_tax Average_income Paved_Highways Population_Driver_licence(%)
11 7.5 5126 14186 0.525
31 7.0 3333 6594 0.513
33 7.5 3357 4121 0.547
27 7.5 3846 9061 0.579
47 7.0 5002 9794 0.593
In [27]:
y_train.head()
11 471
Out[27]:
31 554
33 628
27 631
47 524
Name: Petrol_Consumption, dtype: int64
STEP 5: Building a model using training data

In [28]:
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, y_train)
LinearRegression()
Out[28]:
Import the LinearRegression class, instantiate it, and call the fit() method along with our training
data.
In the theory section we said that linear regression model basically finds the best value for the
intercept and slope, which results in a line that best fits the data. To see the value of the
intercept and slop calculated by the linear regression algorithm for our dataset
In [29]:
print(regressor.intercept_)
425.5993322032417
In [30]:
print(regressor.coef_)
[-4.00166602e+01 -6.54126674e-02 -4.74073380e-03 1.34186212e+03]

This means that for a unit increase in "petrol_tax", there is a decrease of 4.001 million gallons in
petrol consumption. Similarly, a unit increase in proportion of population with a drivers license
results in an increase of 1.324 billion gallons of petrol consumption. We can see that
"Average_income" and "Paved_Highways" have a effect on the petrol consumption.
Step 6: Predictions
Now that we have trained our algorithm, it's time to make some predictions. To do so, we will
use our test data and see how accurately our algorithm predicts the percentage score.
In [31]:
y_pred = regressor.predict(X_test)
Now, to compare the actual output values for X_test with the predicted values
In [32]:
df = pd.DataFrame({'Actual': y_test, 'Predicted': y_pred})
df
Out[32]: Actual Predicted
29 534 469.391989
4 410 545.645464
26 577 589.668394
30 571 569.730413
32 577 649.774809
37 704 646.631164
34 487 511.608148
40 587 672.475177
7 467 502.074782
10 580 501.270734
Step 7: Evaluating the Algorithm

The final step is to evaluate the performance of algorithm. We'll do this by finding the values for
MAE, MSE and RMSE.
In [33]:
from sklearn import metrics
print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, y_pred))
print('Mean Squared Error:', metrics.mean_squared_error(y_test, y_pred))
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, y_pred)
Mean Absolute Error: 56.822247478964684

Mean Squared Error: 4666.3447875883585
Root Mean Squared Error: 68.31064915215165
You can see that the value of root mean squared error is 60.07, which is slightly greater than
10% of the mean value of the gas consumption in all states. This means that our algorithm was
not very accurate but can still make reasonably good predictions.
There are many factors that may have contributed to this inaccuracy, a few of which are listed
here:
1. Need more data: Only one year worth of data isn't that much, whereas having multiple years
worth could have helped us improve the accuracy quite a bit.
2. Poor features: The features we used may not have had a high enough correlation to the
values we were trying to predict.
3. Bad assumptions: We made the assumption that this data has a linear relationship, but that
might not be the case. Visualizing the data may help you determine that.
R-Square
Training Accuracy
In [34]:
regressor.score(X_train, y_train)
0.72081542958177
Out[34]:
Testing Accuracy
In [35]:
regressor.score(X_test, y_test)
0.2036193241012182
Out[35]:
From above R-Square for training data and Testing data is high difference.. hence it is
overfitting
Conclusion
we studied on of the most fundamental machine learning algorithms i.e. linear regression. We
implemented both simple linear regression and multiple linear regression with the help of the
Scikit-Learn machine learning library.

Regression Practice - MLR

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Regression Practice - MLR

Uploaded by

Copyright:

Available Formats

Linear Regression - SLR and MLR

Regression is a relationship between independent variable and dependent

Linear Regression is a linear relationship between independent variable and dependent

SLR (Simple Linear Regression) is a linear relationship between 1 independent

Assumptions of Linear Regression

Linear Regression using python

Multiple Linear Regression

Linear regression involving multiple variables is called "multiple linear regression".

STEP 1: Problem Statement

STEP 2: Import necessary libraries and dataset

#Scientific computation (for Statistics)

#To split the train and test data

STEP 3: EDA (Data Exploration)

Out[6]: Petrol_tax Average_income Paved_Highways Population_Driver_licence(%) Petrol_Consumption

0 9.0 3571 1976 0.525 541

1 9.0 4092 1250 0.572 524

2 9.0 3865 1586 0.580 561

3 7.5 4870 2351 0.529 414

4 8.0 4399 431 0.544 410

Out[8]: Petrol_tax Average_income Paved_Highways Population_Driver_licence(%) Petrol_Consumption

43 7.0 3745 2611 0.508 591

44 6.0 5215 2302 0.672 782

45 9.0 4476 3942 0.571 510

46 7.0 4296 4083 0.623 610

47 7.0 5002 9794 0.593 524

Out[9]: Petrol_tax Average_income Paved_Highways Population_Driver_licence(%) Petrol_Consumptio

count 48.000000 48.000000 48.000000 48.000000 48.00000

mean 7.668333 4241.833333 5565.416667 0.570333 576.77083

std 0.950770 573.623768 3491.507166 0.055470 111.8858

min 5.000000 3063.000000 431.000000 0.451000 344.00000

25% 7.000000 3739.000000 3110.250000 0.529750 509.50000

50% 7.500000 4298.000000 4735.500000 0.564500 568.50000

75% 8.125000 4578.750000 7156.000000 0.595250 632.75000

max 10.000000 5342.000000 17782.000000 0.724000 968.00000

pearson_value, p_value = stats.pearsonr(dataset["Average_income"], dataset["Petrol_C

pearson_value, p_value = stats.pearsonr(dataset["Petrol_tax"], dataset["Petrol_Consu

Out[16]: Petrol_tax Average_income Paved_Highways Population_Driver_licence(

Petrol_tax 1.000000 0.012665 -0.522130 -0.2880

Average_income 0.012665 1.000000 0.050163 0.1570

Paved_Highways -0.522130 0.050163 1.000000 -0.0641

Population_Driver_licence(%) -0.288037 0.157070 -0.064129 1.0000

Petrol_Consumption -0.451280 -0.244862 0.019042 0.6989

STEP 4: Data preparation

Dividing the dataset into independent and dependent variable

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_stat

Out[26]: Petrol_tax Average_income Paved_Highways Population_Driver_licence(%)

11 7.5 5126 14186 0.525

31 7.0 3333 6594 0.513

33 7.5 3357 4121 0.547

27 7.5 3846 9061 0.579

47 7.0 5002 9794 0.593

STEP 5: Building a model using training data

[-4.00166602e+01 -6.54126674e-02 -4.74073380e-03 1.34186212e+03]

Out[32]: Actual Predicted

Step 7: Evaluating the Algorithm

Mean Absolute Error: 56.822247478964684

You might also like