Download as pdf or txt
Download as pdf or txt
You are on page 1of 7

Simple Linear Regression Using (OLS) Ordinary Least Squares method.

Develop a regression
model to predict Salary based on Percentage in Grade 10.

#This code is to upload the data set from local drive into Colab
from google.colab import files
uploaded = files.upload()

Choose Files No file chosen


Upload widget is only available when the cell has been
executed in the
current browser session. Please rerun this cell to enable.

1.Import the MBA Salary dataset

#import the data from MBA Salary.csv
import pandas as pd
mba_salary_df = pd.read_csv('/content/MBA Salary.csv')
mba_salary_df.head(3)

S. No. Percentage in Grade 10 Salary

0 1 62.00 270000

1 2 76.33 200000

2 3 72.00 240000

#print information about the data set

mba_salary_df.info()

<class 'pandas.core.frame.DataFrame'>

RangeIndex: 50 entries, 0 to 49

Data columns (total 3 columns):

# Column Non-Null Count Dtype

--- ------ -------------- -----

0 S. No. 50 non-null int64

1 Percentage in Grade 10 50 non-null float64

2 Salary 50 non-null int64

dtypes: float64(1), int64(2)

memory usage: 1.3 KB

mba_salary_df.shape

(50, 3)
*2. Creating feature set X and the outcome variable Y. The
statsmodel library is used for building statistical models.
OLS API in statsmodel.api is used to estimate the
parameters of simple linear regression. It takes two
parameters Y and X. IN this data Y is Salary and X is
Percentage in Grade 10. The OLS model estimates only the
coefficient of X (Beta 1 or slope). To estimate Beta 0, a
constant term of 1 needs to be added as a seperate
column. This parameter is the intercept term. *

import statsmodels.api as sm

X = sm.add_constant(mba_salary_df['Percentage in Grade 10'])

X.head(5)

/usr/local/lib/python3.7/dist-packages/statsmodels/tsa/tsatools.py:142: FutureWarnin
x = pd.concat(x[::order], 1)

const Percentage in Grade 10

0 1.0 62.00

1 1.0 76.33

2 1.0 72.00

3 1.0 60.00

4 1.0 61.00

3. Create outcome Variable Y

Y = mba_salary_df['Salary']

Y.head()

0 270000

1 200000

2 240000

3 250000

4 180000

Name: Salary, dtype: int64


4. Split dataset into training and validation sets. Use 80%
for training and 20% for validating

from sklearn.model_selection import train_test_split

train_X, test_X, train_y, test_y = train_test_split(X,Y,train_size = 0.8, random_state = 1

5. Fit the model

mba_salary_lm = sm.OLS(train_y, train_X).fit()

<statsmodels.regression.linear_model.RegressionResultsWrapper at 0x7fada1eb6290>

The fit() method on OLS, estimates the parameters and returns the model information such as
model parameters(coefficients), acccuracy measures and residual values to the varibale
mba_salary_lm

6. Print the estimated parameters

print(mba_salary_lm.params)

const 30587.285652

Percentage in Grade 10 3560.587383

dtype: float64

Hence Beta 0 = 30587.285 and Beta 1 = 3560.587. The estimated model is MBA Salary =
30587.285652 + 3560.587383(Percentage in Grade 10)

7. Model Diagnostics - Printing the coefficient of


determination R-Square

print(mba_salary_lm.summary2())

Results: Ordinary least squares

===================================================================================

Model: OLS Adj. R-squared: 0.190

Dependent Variable: Salary AIC: 1008.8680

Date: 2022-10-12 09:27 BIC: 1012.2458

No. Observations: 40 Log-Likelihood: -502.43

Df Model: 1 F-statistic: 10.16

Df Residuals: 38 Prob (F-statistic): 0.00287

R-squared: 0.211 Scale: 5.0121e+09

-----------------------------------------------------------------------------------

Coef. Std.Err. t P>|t| [0.025 0.975]

-----------------------------------------------------------------------------------

const 30587.2857 71869.4497 0.4256 0.6728 -114904.8089 176079.3802

Percentage in Grade 10 3560.5874 1116.9258 3.1878 0.0029 1299.4892 5821.6855

-----------------------------------------------------------------------------------

Omnibus: 2.048 Durbin-Watson: 2.611

Prob(Omnibus): 0.359 Jarque-Bera (JB): 1.724

Skew: 0.369 Prob(JB): 0.422

Kurtosis: 2.300 Condition No.: 413

===================================================================================

Hence R Square of the model is 0.211. So the model explains 21.1% of the variation in salary.

8. Model Diagnostics - Residual Analysis - variance of the


residual has to be constant across different values of the
predicted value (Y') - a property known as
homoscedasticity. A non-constant variance of the
residuals is known as heteroscedasticity - not desired. If
there is heteroscedasticity, a residual plot between
standardised residual values and standardised predicted
values, will be funnel shaped. To standardize, subtract
from mean and divide by standard deviation

import matplotlib.pyplot as plt

def get_std_values(vals):

  return(vals - vals.mean())/vals.std()

  

x_axis = get_std_values(mba_salary_lm.fittedvalues)

y_axis = get_std_values(mba_salary_lm.resid)

plt.scatter(x_axis, y_axis)

plt.xlabel("Standardised Predicted values")

plt.ylabel("Standardised Residual Values")

plt.title("Residual Plot")

plt.show()

The residual plot is not funnel shaped. Hence residuals have constant variance.

9. Model Diagnostics - Oulier Detection. Outliers are


observations whose values show a large deviation from
the mean value. Their presence can have a significant
influence on the values of the regression coefficients.
Hence we use Z-Score to identify their existence in the
data. Any obervation with an Z-Score of more than 3.0 is
an outlier.

from scipy.stats import zscore

mba_salary_df['z_score_salary'] = zscore(mba_salary_df.Salary)

#mba_salary_df.head()

mba_salary_df[(mba_salary_df.z_score_salary > 3.0)| (mba_salary_df.z_score_salary< -3.0)]

S. No. Percentage in Grade 10 Salary z_score_salary

Hence there is no outlier

10. Model Diagnostics - Finding highly influential


Observations using Cook's distance. This distance
measures how much the predicted value of the dependent
variable changes for all observations on the sample when
a particular observation is removed from the sample while
estimating the regression parameters. get_influence()
returns the influence of each observations and
cook_distance variable provides Cook's distance
measures. An observation with Cook's distance of more
than 1 is highly influential.
import numpy as np

mba_influence = mba_salary_lm.get_influence()

(c,p) = mba_influence.cooks_distance

plt.stem(np.arange(len(train_X)), np.round(c,3))

plt.xlabel("Row Index")
plt.ylabel("Cooks Distance")

plt.show()

/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:4: UserWarning: In Matp


after removing the cwd from sys.path.

There is no observation with Cooks's distance > 1. Hence none of them are influential.

11. Making predictions on validation set and measuring


accuracy - R-Squared and RMSE
import numpy as np

from sklearn.metrics import r2_score, mean_squared_error

pred_y = mba_salary_lm.predict(test_X)

print('R2 Score =',np.abs(r2_score(test_y,pred_y)))

print('RMSE = ', np.sqrt(mean_squared_error(test_y,pred_y)))

R2 Score = 0.156645849742304

RMSE = 73458.04348346895

Colab paid products


-
Cancel contracts here

You might also like