Professional Documents
Culture Documents
Regression Practice - MLR
Regression Practice - MLR
There are two types of supervised machine learning algorithms: Regression and classification.
The Regression predicts continuous value outputs while the classification predicts discrete
outputs.
For instance, predicting the price of a house in dollars is a regression problem whereas
predicting whether a tumor is malignant or benign is a classification problem.
Logistic Regression it is applied when dependent variables with discrete values ( 0 or 1, yes
or no).
1. Linear relationship
2. Multivariate normality
3. No or little multicollinearity
4. No auto-correlation
5. Homoscedasticity
The steps to perform multiple linear regression are almost similar to that of simple linear
regression. The difference lies in the evaluation.
We can use it to find out which factor has the highest impact on the predicted output and
how different variables relate to each other.
In [2]:
# Data Manipulation
import numpy as np
import pandas as pd
#Data Visualization
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
#Building a model
from sklearn.linear_model import LinearRegression
#Validating a model
from sklearn.metrics import mean_squared_error, mean_absolute_error
Load a Dataset
The dataset being used for this example has been made publicly available and can be
downloaded from this link: Dataset link
In [3]:
dataset = pd.read_csv("C:/Users/chetankumarbk/Desktop/CSV practice files/petrol_cons
In [4]:
dataset.shape
(48, 5)
Out[4]:
This means that our dataset has 48 rows and 5 columns. Let's take a look at what our dataset
actually looks like. To do this, use the head() method:
In [6]:
dataset.head()
48
Out[7]:
In [8]:
dataset.tail()
In [9]:
#To see statistical details of the dataset
dataset.describe()
In [10]:
dataset.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48 entries, 0 to 47
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Petrol_tax 48 non-null float64
1 Average_income 48 non-null int64
2 Paved_Highways 48 non-null int64
3 Population_Driver_licence(%) 48 non-null float64
4 Petrol_Consumption 48 non-null int64
dtypes: float64(2), int64(3)
memory usage: 2.0 KB
In [11]:
dataset.dtypes
Petrol_tax float64
Out[11]:
Average_income int64
Paved_Highways int64
Population_Driver_licence(%) float64
Petrol_Consumption int64
dtype: object
In [12]:
from scipy import stats
print(pearson_value, p_value)
-0.24486207498269905 0.09346842977474583
In [13]:
from scipy import stats
print(pearson_value, p_value)
-0.45128027518698666 0.0012848906734289317
In [ ]:
In [15]:
dataset.plot(x='Average_income', y='Petrol_Consumption', style='o')
plt.title('Average_income vs Petrol_Consumption')
plt.xlabel('Average income')
plt.ylabel('Petrol Consumption')
plt.show()
Data points on 2-D graph to eyeball our dataset and see if we can manually find any relationship
between the data
From the graph above, we can clearly see that there is a Lesser or weak relation between the
Average income and Petrol Consumption.
In [16]:
corr = dataset.corr()
corr
In [19]:
plt.subplots(figsize=(8,5))
sns.heatmap(corr, annot=True)
<AxesSubplot:>
Out[19]:
In [20]:
dataset.isnull().sum()
Petrol_tax 0
Out[20]:
Average_income 0
Paved_Highways 0
Population_Driver_licence(%) 0
Petrol_Consumption 0
dtype: int64
In [21]:
dataset.columns[dataset.isnull().any()]
Out[21]: Index([], dtype='object')
In [49]:
plt.figure(figsize=(12, 6))
sns.heatmap(dataset.isnull())
plt.show()
Attributes are the independent variables while labels are dependent variables whose values
are to be predicted.
In our dataset we only have two columns.
We want to predict the Petrol consumption.
Therefore our attribute set will consist of the "other than Petrol consumption" column,
and the label will be the "Petrol consumption" column.
In [22]:
X = dataset[['Petrol_tax', 'Average_income', 'Paved_Highways',
'Population_Driver_licence(%)']]
y = dataset['Petrol_Consumption']
NOTE: the column indexes start with 0, with 1 being the second column
To split this data into training and test sets. We'll do this by using Scikit-Learn's built-in
train_test_split() method:
In [23]:
from sklearn.model_selection import train_test_split
The above script splits 80% of the data to training set while 20% of the data to test set. The
test_size variable is where we actually specify the proportion of test set.
In [24]:
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)
(38, 4)
(10, 4)
(38,)
(10,)
In [26]:
X_train.head()
In [27]:
y_train.head()
11 471
Out[27]:
31 554
33 628
27 631
47 524
Name: Petrol_Consumption, dtype: int64
regressor = LinearRegression()
regressor.fit(X_train, y_train)
LinearRegression()
Out[28]:
Import the LinearRegression class, instantiate it, and call the fit() method along with our training
data.
In the theory section we said that linear regression model basically finds the best value for the
intercept and slope, which results in a line that best fits the data. To see the value of the
intercept and slop calculated by the linear regression algorithm for our dataset
In [29]:
print(regressor.intercept_)
425.5993322032417
In [30]:
print(regressor.coef_)
Step 6: Predictions
Now that we have trained our algorithm, it's time to make some predictions. To do so, we will
use our test data and see how accurately our algorithm predicts the percentage score.
In [31]:
y_pred = regressor.predict(X_test)
Now, to compare the actual output values for X_test with the predicted values
In [32]:
df = pd.DataFrame({'Actual': y_test, 'Predicted': y_pred})
df
29 534 469.391989
4 410 545.645464
26 577 589.668394
30 571 569.730413
32 577 649.774809
37 704 646.631164
34 487 511.608148
40 587 672.475177
7 467 502.074782
10 580 501.270734
In [33]:
from sklearn import metrics
print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, y_pred))
print('Mean Squared Error:', metrics.mean_squared_error(y_test, y_pred))
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, y_pred)
There are many factors that may have contributed to this inaccuracy, a few of which are listed
here:
1. Need more data: Only one year worth of data isn't that much, whereas having multiple years
worth could have helped us improve the accuracy quite a bit.
2. Poor features: The features we used may not have had a high enough correlation to the
values we were trying to predict.
3. Bad assumptions: We made the assumption that this data has a linear relationship, but that
might not be the case. Visualizing the data may help you determine that.
R-Square
Training Accuracy
In [34]:
regressor.score(X_train, y_train)
0.72081542958177
Out[34]:
Testing Accuracy
In [35]:
regressor.score(X_test, y_test)
0.2036193241012182
Out[35]:
From above R-Square for training data and Testing data is high difference.. hence it is
overfitting
Conclusion
we studied on of the most fundamental machine learning algorithms i.e. linear regression. We
implemented both simple linear regression and multiple linear regression with the help of the
Scikit-Learn machine learning library.