Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

Team 5

Krishnan KM Spandana M Sathvika P Rishekesan S.V

Linear Regression Model Representation:


Linear Regression is a popular machine learning algorithm that is used to predict the value
of a continuous outcome variable based on one or more predictor variables. It assumes a
linear relationship between the predictor variables and the outcome variable. The general
representation of the linear regression model can be expressed as follows:
Y = β0 + β1X1 + β2X2 + … + βn*Xn + ε
Where,
Y is the outcome variable or the dependent variable. X1, X2, …, Xn are the predictor
variables or the independent variables. β0 is the y-intercept or the constant term. β1, β2, …,
βn are the regression coefficients or the parameters. ε is the error term or the residual. The
aim of the linear regression model is to estimate the regression coefficients β1, β2, …, βn
such that the model fits the data well and the error term ε is minimized. This is achieved by
minimizing the cost function using the gradient descent algorithm.

Cost Function:
The cost function is a measure of how well the linear regression model fits the data. It is
defined as the sum of squared errors between the predicted values and the actual values.
The formula for the cost function is given by:
J(β0, β1, …, βn) = 1/2m * Σ(i=1 to m) (Yi - Ŷi)^2
Where,
m is the number of training examples. Yi is the actual value of the dependent variable for
the ith training example. Ŷi is the predicted value of the dependent variable for the ith
training example. The aim of the linear regression model is to minimize the cost function by
finding the optimal values of the regression coefficients β1, β2, …, βn.

Gradient Descent Algorithm:


Gradient descent is an optimization algorithm that is used to minimize the cost function of
the linear regression model. It works by iteratively adjusting the regression coefficients in
the direction of the steepest descent of the cost function.
The algorithm starts by initializing the regression coefficients β1, β2, …, βn with random
values. Then, it computes the predicted values of the dependent variable for the training
examples using the current values of the regression coefficients. Next, it calculates the
gradient of the cost function with respect to the regression coefficients, which gives the
direction of steepest descent of the cost function.
The algorithm then updates the regression coefficients using the following formula:
βj := βj - α/m * Σ(i=1 to m) (Ŷi - Yi) * Xi
Where,
α is the learning rate, which controls the step size of the algorithm. m is the number of
training examples. Yi is the actual value of the dependent variable for the ith training
example. Ŷi is the predicted value of the dependent variable for the ith training example. Xi
is the value of the jth independent variable for the ith training example. The algorithm
repeats these steps until the cost function converges to a minimum, which indicates that
the model has converged and the regression coefficients have been optimized.
In summary, the linear regression model is represented by a linear equation that predicts
the value of a continuous outcome variable based on one or more predictor variables. The
cost function measures how well the model fits the data, and the gradient descent
algorithm is used to minimize the cost function and optimize the regression coefficients.
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
import matplotlib.pyplot as plt

# Load the data


data = pd.read_csv("Salary_Data.csv")

# Split the data into training and testing sets using inbuild function
X_train, X_test, y_train, y_test =
train_test_split(data['YearsExperience'], data['Salary'],
test_size=0.25, random_state=42)
# Single variable linear regression
model_single = LinearRegression()
model_single.fit(X_train.values.reshape(-1,1), y_train)

# Multi-variable linear regression (not applicable for this dataset)


# model_multi = LinearRegression()
# model_multi.fit(data.drop('Salary', axis=1), data['Salary'])

# Predict the test set using single variable model


y_pred = model_single.predict(X_test.values.reshape(-1,1))

# Compute performance measures


mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y_test, y_pred)

print(f"MSE: {mse}")
print(f"RMSE: {rmse}")
print(f"R2: {r2}")

# Plot % of training data vs RMSE


training_sizes = [0.1, 0.3, 0.5, 0.7, 0.9]
rmse_scores = []
for size in training_sizes:
X_train, X_test, y_train, y_test =
train_test_split(data['YearsExperience'], data['Salary'], test_size=1-
size, random_state=42)
model_single.fit(X_train.values.reshape(-1,1), y_train)
y_pred = model_single.predict(X_test.values.reshape(-1,1))
rmse_scores.append(np.sqrt(mean_squared_error(y_test, y_pred)))

plt.plot(training_sizes, rmse_scores)
plt.xlabel('% of training data')
plt.ylabel('RMSE')
plt.show()

MSE: 38802588.99247065
RMSE: 6229.172416338358
R2: 0.9347210011126782
#Multivariable linear regression
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import warnings

warnings.filterwarnings('ignore')

# Load the data


from sklearn.datasets import fetch_california_housing
housing = fetch_california_housing()

X = housing.data
y = housing.target

# Split the data into training and test sets


from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.25, random_state=0)

# Fit the linear regression model with Gradient Descent


from sklearn.linear_model import SGDRegressor
regressor = SGDRegressor(max_iter=1000, tol=1e-3)
regressor.fit(X_train, y_train)
# Make predictions on the test set
y_pred = regressor.predict(X_test)

# Evaluate the model


from sklearn.metrics import mean_squared_error, r2_score
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y_test, y_pred)
print('MSE:', mse)
print('RMSE:', rmse)
print('R^2:', r2)

MSE: 6.467064527275153e+30
RMSE: 2543042376224815.0
R^2: -4.89243206637211e+30

train_sizes = np.linspace(0.1, 1.0, 10)


train_sizes_abs = np.round(train_sizes * X_train.shape[0]).astype(int)
test_scores = np.zeros(train_sizes_abs.shape[0])
for i, train_size in enumerate(train_sizes_abs):
if train_size == X_train.shape[0]:
X_train_subset, y_train_subset = X_train, y_train
else:
X_train_subset, _, y_train_subset, _ =
train_test_split(X_train, y_train, train_size=train_size,
random_state=0)
regressor.fit(X_train_subset, y_train_subset)
y_pred_subset = regressor.predict(X_test)
test_scores[i] = np.sqrt(mean_squared_error(y_test,
y_pred_subset))

plt.plot(train_sizes * 100, test_scores, 'o-')


plt.title('Learning Curve')
plt.xlabel('% of training data')
plt.ylabel('RMSE')
plt.ylim((0, np.max(test_scores) * 1.1))
plt.show()

You might also like