Importing Necessary Libraries

1/16/23, 5:38 PM simple linear
1) Calories_consumed-> predict weight gained using calories consumed
dataset = calories_consumed.csv
y - continuous(dependent), x - single & continuos(independent) ==> y - Weight gained (grams), x - Calories

Consumed
So, We go simple linear regression
Importing necessary libraries
In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
Reading csv file using pandas library
In [2]:
wg_cc=pd.read_csv("C:\\Users\\Raja\\Downloads\\assignments\\simple linear\\calories_con
sumed.csv")
EDA
In [3]:
wg_cc.head()
Out[3]:
wg cc
0 108 1500
1 200 2300
2 900 3400
3 200 2200
4 300 2500
https://onlinefreetutorial.com/wp-content/uploads/2020/03/simple-linear.html 1/29
In [4]:
wg_cc.corr()
Out[4]:
wg cc
wg 1.000000 0.946991
cc 0.946991 1.000000
In [5]:
plt.scatter(x=wg_cc.cc, y=wg_cc.wg, color='red')

plt.xlabel("Calories Consumed")
plt.ylabel("Weight gained (grams)")
Out[5]:
Text(0,0.5,'Weight gained (grams)')
In [6]:
wg_cc.describe()
Out[6]:
wg cc
count 14.000000 14.000000
mean 357.714286 2340.714286
std 333.692495 752.109488
min 62.000000 1400.000000
25% 114.500000 1727.500000
50% 200.000000 2250.000000
75% 537.500000 2775.000000
max 1100.000000 3900.000000
Importing statsmodels.formula.api for linear regression model
In [7]:
import statsmodels.formula.api as smf
In [8]:
model=smf.ols("wg~cc",data=wg_cc).fit()
In [9]:
model.params
Out[9]:
Intercept -625.752356
cc 0.420157
dtype: float64
In [10]:
model.summary()
C:\Users\Raja\Anaconda33\lib\site-packages\scipy\stats\stats.py:1394: User
Warning: kurtosistest only valid for n>=20 ... continuing anyway, n=14
"anyway, n=%i" % int(n))
Out[10]:
OLS Regression Results
Dep. Variable: wg R-squared: 0.897
Model: OLS Adj. R-squared: 0.888
Method: Least Squares F-statistic: 104.3
Date: Sat, 21 Mar 2020 Prob (F-statistic): 2.86e-07
Time: 18:14:56 Log-Likelihood: -84.792
No. Observations: 14 AIC: 173.6
Df Residuals: 12 BIC: 174.9
Df Model: 1
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept -625.7524 100.823 -6.206 0.000 -845.427 -406.078
cc 0.4202 0.041 10.211 0.000 0.331 0.510
Omnibus: 3.394 Durbin-Watson: 2.537
Prob(Omnibus): 0.183 Jarque-Bera (JB): 1.227
Skew: -0.203 Prob(JB): 0.541
Kurtosis: 1.608 Cond. No. 8.28e+03
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 8.28e+03. This might indicate that there are
strong multicollinearity or other numerical problems.
In [11]:
model.conf_int(0.05) # 95% confidence interval
Out[11]:
0 1
Intercept -845.426655 -406.078057
cc 0.330506 0.509807
In [12]:
pred = model.predict(wg_cc) # Predicted values of wg using the model
Visualization of regresion line over the scatter plot of wg & cc
In [13]:
plt.scatter(x=wg_cc.cc, y=wg_cc.wg, color='red')

plt.plot(wg_cc.cc, pred,color='black')
plt.xlabel("Calories Consumed")
plt.ylabel("Weight gained (grams)")
Out[13]:
Text(0,0.5,'Weight gained (grams)')
In [14]:
pred.corr(wg_cc.wg) # 0.81
Out[14]:
0.9469910088554457
Transforming variables for accuracy
In [15]:
model1 = smf.ols('wg~np.log(cc)',data=wg_cc).fit()
In [16]:
model1.params
Out[16]:
Intercept -6955.650125
np.log(cc) 948.371723
dtype: float64
In [17]:
model1.summary()
Out[17]:
Dep. Variable: wg R-squared: 0.808
Df Model: 1
coef std err t P>|t| [0.025 0.975]
Intercept -6955.6501 1030.908 -6.747 0.000 -9201.806 -4709.494
np.log(cc) 948.3717 133.580 7.100 0.000 657.325 1239.418
Skew: 0.046 Prob(JB): 0.566
Kurtosis: 1.606 Cond. No. 199.
Warnings:
In [18]:
model1.conf_int(0.01)
Out[18]:
0 1
Intercept -10104.600246 -3806.700004
np.log(cc) 540.345229 1356.398218
In [19]:
pred1 = model1.predict(wg_cc)
In [20]:
pred1.corr(wg_cc.wg)
Out[20]:
0.8987252805287711
Model having highest R-Squared value is better i.e. (model=0.897 is better than
model1=0.808). There has good relationship>0.85
====================================================================================
2) Delivery_time -> Predict delivery time using sorting time
dataset = delivery_time.csv
y - continuous(dependent), x - single & continuos(independent) ==> y - delivery time, x - sorting time
So, We go simple linear regression
In [21]:
dt_st=pd.read_csv("C:\\Users\\Raja\\Downloads\\assignments\\simple linear\\delivery_tim
e.csv")
EDA
In [22]:
dt_st.head()
Out[22]:
dt st
0 21.00 10
1 13.50 4
2 19.75 6
3 24.00 9
4 29.00 10
In [23]:
dt_st.corr()
Out[23]:
dt st
dt 1.000000 0.825997
st 0.825997 1.000000
In [24]:
plt.scatter(x=dt_st.st, y=dt_st.dt, color='green')

plt.xlabel("Sorting time")
plt.ylabel("Delivery time")
Out[24]:
Text(0,0.5,'Delivery time')
In [25]:
plt.boxplot(dt_st.dt)
Out[25]:
{'whiskers': [<matplotlib.lines.Line2D at 0x1715aa31c88>,

<matplotlib.lines.Line2D at 0x1715aa31d68>],
'caps': [<matplotlib.lines.Line2D at 0x1715aa3c518>,
<matplotlib.lines.Line2D at 0x1715aa3c940>],
'boxes': [<matplotlib.lines.Line2D at 0x1715aa316a0>],
'medians': [<matplotlib.lines.Line2D at 0x1715aa3cd68>],
'fliers': [<matplotlib.lines.Line2D at 0x1715aa3ce80>],
'means': []}
In [26]:
plt.hist(dt_st.dt, bins=5)
Out[26]:
(array([5., 4., 8., 3., 1.]),

array([ 8. , 12.2, 16.4, 20.6, 24.8, 29. ]),
<a list of 5 Patch objects>)
In [27]:
model2=smf.ols("dt~st",data=dt_st).fit()
In [28]:
model2.params
Out[28]:
Intercept 6.582734
st 1.649020
dtype: float64
In [29]:
model2.summary()
Out[29]:
Dep. Variable: dt R-squared: 0.682
Df Model: 1
coef std err t P>|t| [0.025 0.975]
Intercept 6.5827 1.722 3.823 0.001 2.979 10.186
st 1.6490 0.258 6.387 0.000 1.109 2.189
Skew: 0.750 Prob(JB): 0.352
Kurtosis: 3.367 Cond. No. 18.3
Warnings:
In [30]:
model3=smf.ols("dt~np.log(st)",data=dt_st).fit()
In [31]:
model3.params
Out[31]:
Intercept 1.159684
np.log(st) 9.043413
dtype: float64
In [32]:
model3.summary()
Out[32]:
Dep. Variable: dt R-squared: 0.695
Df Model: 1
coef std err t P>|t| [0.025 0.975]
Intercept 1.1597 2.455 0.472 0.642 -3.978 6.297
np.log(st) 9.0434 1.373 6.587 0.000 6.170 11.917
Skew: 0.946 Prob(JB): 0.175
Warnings:
In [33]:
model2.conf_int(0.05) # 95% confidence interval
Out[33]:
0 1
Intercept 2.979134 10.186334
st 1.108673 2.189367
In [34]:
Out[34]:
0 1
Intercept -3.97778 6.297147
np.log(st) 6.16977 11.917057
In [35]:
pred2 = model2.predict(dt_st) # Predicted values of dt using the model
In [36]:
pred3 = model3.predict(dt_st) # Predicted values of dt using the model
In [37]:
plt.scatter(x=dt_st.st, y=dt_st.dt, color='green')

plt.plot(dt_st.st, pred2,color='black')
Out[37]:
In [38]:
plt.scatter(x=dt_st.st, y=dt_st.dt, color='red')

plt.plot(dt_st.st, pred3,color='green')
Out[38]:
Model3 is slightly better than Model2 and has moderate corelation.
====================================================================================
3) Emp_data -> Build a prediction model for Churn_out_rate
dataset = emp_data.csv
y - continuous(dependent), x - single & continuos(independent) ==> y - Churn_out_rate , x - Salary_hike
In [39]:
sal_churn=pd.read_csv("C:\\Users\\Raja\\Downloads\\assignments\\simple linear\\emp_dat
a.csv")
EDA
In [40]:
sal_churn.head()
Out[40]:
Salary_hike Churn_out_rate
0 1580 92
1 1600 85
2 1610 80
3 1640 75
4 1660 72
In [41]:
sal_churn.corr()
Out[41]:
Salary_hike 1.000000 -0.911722
Churn_out_rate -0.911722 1.000000
In [42]:
plt.scatter(x=sal_churn.Salary_hike, y=sal_churn.Churn_out_rate, color='blue')

plt.xlabel("Salary_hike")
plt.ylabel("Churn_out_rate")
Out[42]:
Text(0,0.5,'Churn_out_rate')
In [43]:
plt.hist(sal_churn.Salary_hike)
Out[43]:
(array([2., 1., 2., 1., 1., 1., 0., 1., 0., 1.]),
array([1580., 1609., 1638., 1667., 1696., 1725., 1754., 1783., 1812.,
1841., 1870.]),
In [44]:
plt.hist(sal_churn.Churn_out_rate)
Out[44]:
(array([2., 1., 1., 2., 1., 0., 1., 1., 0., 1.]),
array([60. , 63.2, 66.4, 69.6, 72.8, 76. , 79.2, 82.4, 85.6, 88.8, 92.
]),
If |r| is greater than 0.85 then Co-relation is Strong(Correlation Co-efficient = -0.911722).
This has a strong negative Correlation
In [45]:
sal_churn.describe()
Out[45]:
count 10.000000 10.000000
mean 1688.600000 72.900000
std 92.096809 10.257247
min 1580.000000 60.000000
25% 1617.500000 65.750000
50% 1675.000000 71.000000
75% 1724.000000 78.750000
max 1870.000000 92.000000
Simple model without using any transformation
In [46]:
model4=smf.ols("Churn_out_rate~Salary_hike",data=sal_churn).fit()
In [47]:
model4.summary()
Out[47]:
Dep. Variable: Churn_out_rate R-squared: 0.831
Date: Sat, 21 Mar 2020 Prob (F-statistic): 0.000239
Df Model: 1
coef std err t P>|t| [0.025 0.975]
Intercept 244.3649 27.352 8.934 0.000 181.291 307.439
Salary_hike -0.1015 0.016 -6.277 0.000 -0.139 -0.064
Skew: 0.851 Prob(JB): 0.495
Warnings:
In [48]:
model4.params
Out[48]:
Intercept 244.364911
Salary_hike -0.101543
dtype: float64
In [49]:
Out[49]:
0 1
Intercept 181.291232 307.438591
Salary_hike -0.138845 -0.064240
In [50]:
pred4 = model4.predict(sal_churn) # Predicted values of Churn_out_rate using the model
In [51]:
model5=smf.ols("Churn_out_rate~np.log(Salary_hike)",data=sal_churn).fit()
In [52]:
model5.summary()
Out[52]:
Dep. Variable: Churn_out_rate R-squared: 0.849
Date: Sat, 21 Mar 2020 Prob (F-statistic): 0.000153
Df Model: 1
coef std err t P>|t| [0.025 0.975]
Intercept 1381.4562 195.402 7.070 0.000 930.858 1832.054
np.log(Salary_hike) -176.1097 26.297 -6.697 0.000 -236.751 -115.468
Skew: 0.853 Prob(JB): 0.492
Warnings:
In [53]:
model5.params
Out[53]:
np.log(Salary_hike) -176.109735
dtype: float64
In [54]:
Out[54]:
0 1
Intercept 930.858413 1832.053972
np.log(Salary_hike) -236.751223 -115.468248
In [55]:
pred5 = model5.predict(sal_churn) # Predicted values of Churn_out_rate using the model
In [56]:
plt.scatter(x=sal_churn.Salary_hike, y=sal_churn.Churn_out_rate, color='blue')

plt.plot(sal_churn.Salary_hike, pred5,color='black')
plt.xlabel("Salary_hike")
plt.ylabel("Churn_out_rate")
Out[56]:
Text(0,0.5,'Churn_out_rate')
Model having highest R-Squared value which is the log transformation - model5
====================================================================================
4) Salary_hike -> Build a prediction model for Salary_hike
dataset = Salary_Data.csv
y - continuous(dependent), x - single & continuos(independent) ==> y - Salary , x - YearsExperience
In [57]:
sal_hike=pd.read_csv("C:\\Users\\Raja\\Downloads\\assignments\\simple linear\\Salary_Da
ta.csv")
In [58]:
sal_hike.head()
Out[58]:
YearsExperience Salary
0 1.1 39343.0
1 1.3 46205.0
2 1.5 37731.0
3 2.0 43525.0
4 2.2 39891.0
In [59]:
sal_hike.shape
Out[59]:
(30, 2)
In [60]:
sal_hike.describe()
Out[60]:
count 30.000000 30.000000
mean 5.313333 76003.000000
std 2.837888 27414.429785
min 1.100000 37731.000000
25% 3.200000 56720.750000
50% 4.700000 65237.000000
75% 7.700000 100544.750000
max 10.500000 122391.000000
In [61]:
plt.boxplot(sal_hike.YearsExperience)
Out[61]:
{'whiskers': [<matplotlib.lines.Line2D at 0x1715adebba8>,

<matplotlib.lines.Line2D at 0x1715adf84a8>],
'caps': [<matplotlib.lines.Line2D at 0x1715adf8908>,
<matplotlib.lines.Line2D at 0x1715adf8d68>],
'boxes': [<matplotlib.lines.Line2D at 0x1715adeba58>],
'medians': [<matplotlib.lines.Line2D at 0x1715adf8e80>],
'fliers': [<matplotlib.lines.Line2D at 0x1715ae01668>],
'means': []}
In [62]:
plt.boxplot(sal_hike.Salary)
Out[62]:
{'whiskers': [<matplotlib.lines.Line2D at 0x1715ae48ef0>,

<matplotlib.lines.Line2D at 0x1715ae51390>],
'caps': [<matplotlib.lines.Line2D at 0x1715ae517f0>,
<matplotlib.lines.Line2D at 0x1715ae51c50>],
'boxes': [<matplotlib.lines.Line2D at 0x1715ae48940>],
'medians': [<matplotlib.lines.Line2D at 0x1715ae51d68>],
'fliers': [<matplotlib.lines.Line2D at 0x1715ae5a550>],
'means': []}
In [63]:
sal_hike.corr()
Out[63]:
YearsExperience 1.000000 0.978242
Salary 0.978242 1.000000
In [64]:
plt.hist(sal_hike.Salary, bins=20)
Out[64]:
(array([3., 1., 1., 1., 5., 2., 3., 1., 0., 0., 2., 0., 1., 1., 1., 1.,
2.,
2., 1., 2.]),
array([ 37731., 41964., 46197., 50430., 54663., 58896., 63129.,
67362., 71595., 75828., 80061., 84294., 88527., 92760.,
96993., 101226., 105459., 109692., 113925., 118158., 122391.]),
In [65]:
plt.scatter(x=sal_hike.YearsExperience, y=sal_hike.Salary, color='blue')

plt.xlabel("YearsExperience")
plt.ylabel("Salary")
Out[65]:
Text(0,0.5,'Salary')
In [66]:
model6=smf.ols("Salary~YearsExperience",data=sal_hike).fit()
In [67]:
model6.summary()
Out[67]:
Dep. Variable: Salary R-squared: 0.957
Df Model: 1
coef std err t P>|t| [0.025 0.975]
Intercept 2.579e+04 2273.053 11.347 0.000 2.11e+04 3.04e+04
YearsExperience 9449.9623 378.755 24.950 0.000 8674.119 1.02e+04
Skew: 0.363 Prob(JB): 0.456
Warnings:
In [68]:
model7=smf.ols("Salary~np.log(YearsExperience)",data=sal_hike).fit()
In [69]:
model7.summary()
Out[69]:
Df Model: 1
coef std err t P>|t| [0.025 0.975]
Intercept 1.493e+04 5156.226 2.895 0.007 4365.921 2.55e+04
np.log(YearsExperience) 4.058e+04 3172.453 12.792 0.000 3.41e+04 4.71e+04
Skew: 0.156 Prob(JB): 0.635
Warnings:
In [70]:
model8=smf.ols("Salary~np.exp(YearsExperience)",data=sal_hike).fit()
In [71]:
model8.summary()
Out[71]:
Df Model: 1
coef std err t P>|t| [0.025 0.975]
Intercept 6.757e+04 4065.396 16.620 0.000 5.92e+04 7.59e+04
np.exp(YearsExperience) 2.1360 0.427 5.007 0.000 1.262 3.010
Skew: 0.276 Prob(JB): 0.374
Warnings:
In [72]:
model6.params
Out[72]:
YearsExperience 9449.962321
dtype: float64
In [73]:
model7.params
Out[73]:
np.log(YearsExperience) 40581.98796
dtype: float64
In [74]:
Out[74]:
0 1
Intercept 21136.061314 30448.339084
YearsExperience 8674.118747 10225.805896
In [75]:
pred6 = model6.predict(sal_hike) # Predicted values of Salary using the model
In [76]:

plt.plot(sal_hike.YearsExperience, pred6,color='black')
Out[76]:
In [77]:
pred7 = model7.predict(sal_hike) # Predicted values of Salary using the model
In [78]:

plt.plot(sal_hike.YearsExperience, pred7,color='black')
Out[78]:
Model6 is better than Model7.

Importing Necessary Libraries

Uploaded by

Copyright:

Available Formats

You might also like

Importing Necessary Libraries

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Importing Necessary Libraries

Uploaded by

Copyright:

Available Formats

1/16/23, 5:38 PM simple linear

1) Calories_consumed-> predict weight gained using calories consumed

y - continuous(dependent), x - single & continuos(independent) ==> y - Weight gained (grams), x - Calories

So, We go simple linear regression

Importing necessary libraries

Reading csv file using pandas library

plt.scatter(x=wg_cc.cc, y=wg_cc.wg, color='red')

Text(0,0.5,'Weight gained (grams)')

count 14.000000 14.000000

mean 357.714286 2340.714286

std 333.692495 752.109488

min 62.000000 1400.000000

25% 114.500000 1727.500000

50% 200.000000 2250.000000

75% 537.500000 2775.000000

max 1100.000000 3900.000000

Importing statsmodels.formula.api for linear regression model

import statsmodels.formula.api as smf

OLS Regression Results

Dep. Variable: wg R-squared: 0.897

Model: OLS Adj. R-squared: 0.888

Method: Least Squares F-statistic: 104.3

Date: Sat, 21 Mar 2020 Prob (F-statistic): 2.86e-07

Time: 18:14:56 Log-Likelihood: -84.792

No. Observations: 14 AIC: 173.6

Df Residuals: 12 BIC: 174.9

Covariance Type: nonrobust

coef std err t P>|t| [0.025 0.975]

Intercept -625.7524 100.823 -6.206 0.000 -845.427 -406.078

cc 0.4202 0.041 10.211 0.000 0.331 0.510

Omnibus: 3.394 Durbin-Watson: 2.537

Prob(Omnibus): 0.183 Jarque-Bera (JB): 1.227

Skew: -0.203 Prob(JB): 0.541

Kurtosis: 1.608 Cond. No. 8.28e+03

model.conf_int(0.05) # 95% confidence interval

Intercept -845.426655 -406.078057

pred = model.predict(wg_cc) # Predicted values of wg using the model

Visualization of regresion line over the scatter plot of wg & cc

plt.scatter(x=wg_cc.cc, y=wg_cc.wg, color='red')

Text(0,0.5,'Weight gained (grams)')

Transforming variables for accuracy

OLS Regression Results

Dep. Variable: wg R-squared: 0.808

Model: OLS Adj. R-squared: 0.792

Method: Least Squares F-statistic: 50.40

Date: Sat, 21 Mar 2020 Prob (F-statistic): 1.25e-05

Time: 18:14:57 Log-Likelihood: -89.148

No. Observations: 14 AIC: 182.3

Df Residuals: 12 BIC: 183.6

Covariance Type: nonrobust

coef std err t P>|t| [0.025 0.975]

Intercept -6955.6501 1030.908 -6.747 0.000 -9201.806 -4709.494

np.log(cc) 948.3717 133.580 7.100 0.000 657.325 1239.418

Omnibus: 3.265 Durbin-Watson: 2.438

Prob(Omnibus): 0.195 Jarque-Bera (JB): 1.139

Skew: 0.046 Prob(JB): 0.566

Kurtosis: 1.606 Cond. No. 199.