Linearregression 4

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 3

Linear Regression - Part4

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score

1- Read the CPU data

df = pd.read_csv("https://raw.githubusercontent.com/grbruns/cst383/master/machine.csv"
df.index = df['vendor']+' '+df['model']
df.drop(['vendor', 'model'], axis=1, inplace=True)
df['cs'] = np.round(1e3/df['myct'], 2)  # clock speed in MHz

df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 209 entries, adviser 32/60 to wang vs-90
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 myct 209 non-null int64
1 mmin 209 non-null int64
2 mmax 209 non-null int64
3 cach 209 non-null int64
4 chmin 209 non-null int64
5 chmax 209 non-null int64
6 prp 209 non-null int64
7 erp 209 non-null int64
8 cs 209 non-null float64
dtypes: float64(1), int64(8)
memory usage: 16.3+ KB

2. Split the data randomly into a training set and a test set, using a 70/30 split (70% training
data).

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state
 

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state
df.describe
X_train 
y train
y_train
X_test
y_test
 

array([ 274, 30, 22, 915, 16, 326, 72, 6, 1144, 208, 65,
130, 52, 45, 35, 36, 51, 31, 100, 132, 50, 60,
111, 18, 11, 50, 69, 27, 19, 41, 248, 32, 45,
26, 16, 26, 67, 465, 38, 17, 307, 34, 214, 465,
1150, 32, 510, 40, 24, 71, 23, 11, 24, 27, 120,
54, 40, 17, 259, 318, 93, 71, 277])

3. Use LinearRegression to create a linear model to predict performance (feature ‘prp’). Use a
couple of predictor variables of your own choice. Create the model using your training set.

poly = np.polyfit(df.prp, df.cach, 1)
sns.scatterplot(data=df, x="prp", y="cach")
plt.plot(df.prp, poly[0] * df.prp + poly[1],'--', color='black')

[<matplotlib.lines.Line2D at 0x7f1e3ea3c950>]

4.Compute the MSE of your model on the test data. Do this manually.

X
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state
 
reg = LinearRegression()
reg.fit(X_train, y_train)
predict = reg.predict(X_test)
MSE = np.sqrt(((predict - y_test)**2).mean())
MSE

127.79049465873042
5- Repeat steps 2-4, but this time use new randomly-generated test and training sets. How much
does the RMSE differ?

6.If you have time, write code that will do steps 2-4 100 times, each time creating different
training/test sets. Collect the computed RMSE values, and plot them on a histogram.

7- If you still have time, repeat problem 6, but this time use an 80/20 split.

8- If you still have time, compute MSE using cross validation on the entire data set. Do this many
times and plot all the histogram values using a histogram.

9- If you still have time, check out Section 5.1 of 'An Introduction to Statistical Learning'.

You might also like