Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

Assignment-based Subjective Questions:

Q1. From your analysis of the categorical variables from the dataset, what
could you infer about their effect on the dependent variable?
Answer:
I have done an analysis of categorical variables from the dataset on the dependent variable(count)
using the boxplot and bar plot. Below are some points that can be inferred from the visualization:

1. fall has the highest median, which is expected as the weather conditions are most optimal to
ride the bike followed by summer.
2. Median bike rents are increasing year on as the year 2019 has a higher median than 2018, it
might be because bike rentals are getting popular and people are becoming more aware.
3. Most of the bikes were rented during May, June, July, Aug, Sep, and Oct. This trend
increased at the starting of the year till mid of the year, and then it started decreasing as we
approach the end of the year.
4. People rent more on non-holiday compared to holidays, so the reason might be they prefer
to spend time with family at home or use a personal vehicle.
5. Renting bikes seems to be almost equal every day of the week.
6. Renting bikes seems to be almost equal either on working days or non-working days.
7. Clear weather is most optimal for bike renting, as temperate is optimal, humidity is less.

Q2. Why is it important to use drop_first=True during dummy variable


creation?
Answer:
A variable with n levels can be represented by n-1 dummy variable. So, if we remove the first
column then also, we can represent the data. So, we use drop_first=True to reduce the extra
column created during dummy variable creation. Hence it also reduces the correlations created
among dummy variables.

Example: Let’s say we have 3 types of values (A, B, C) in the Categorical column and we want to
create dummy variables for that. If one variable is not A and not B, then It is obvious C. So we do
not need 3rd variable to identify the C as shown in the table below:

A B

0 1

1 0

0 0

Q3. Looking at the pair-plot among the numerical variables, which one
has the highest correlation with the target variable?
Answer:

By Looking at the pair plot, it is quite clear that registered, casual, temp, and feeltemp(atemp)
variables have the highest correlation with the target variable.

But after dropping registered, casual, feeltemp(atemp) variables, the temp variable has the highest
correlation with the target variable.
Q4. How did you validate the assumptions of Linear Regression after
building the model on the training set?
Answer:
I have validated the assumption of the Linear Regression Model based on the below 5 assumptions.

1. Normality of error terms:


 Error terms should be normally distributed.

2. Multi-collinearity:
 There should be insignificant Multi-collinearity among variables.

3. Linear relationship validation:


 Linearity should be visible among variables.
4. Homoscedasticity:
 There should be no visible pattern in residual values.

5. Independence of residuals:
 No auto-correlation.
- The Durbin-Watson value for Final Model is 2.0884. which means there
is almost no auto-correlation.

Q5. Based on the final model, which are the top 3 features contributing
significantly towards explaining the demand of the shared bikes?

Answer:
Based on the coefficients of the final model, the top 3 features which explain the demand for the
shared bikes are:

1. temp
2. light snow (weathersit (3))
3. year (yr)
General Subjective Questions:
Q1. Explain the linear regression algorithm in detail.
Answer:
Linear Regression is a machine learning algorithm based on supervised learning. It performs a
regression task.

A simple linear regression model attempts to explain the relationship between a dependent
variable(y) and an independent variable(x) using a straight line and the relationship between y and x
is a straight line, hence the name is linear regression.

The independent variable is also known as the predictor variable and the dependent variables are
called also known as output variables.

Mathematically the relationship can be represented with the help of the following equation:

Y = mX + c

 Here, Y is the dependent variable we are trying to predict.


 X is the independent variable we are using to make predictions.
 m is the slope of the regression line which represents the effect X has on Y.
 c is a constant, known as the Y-intercept. If X = 0, Y would be equal to c.

Furthermore, the linear relationship can be positive or negative as explained below:

1. Positive Linear Relationship:


 A linear relationship will be called positive if both independent and
dependent variable increases. It can be understood with the help of the
following graph

2. Negative Linear relationship:


 A linear relationship will be called negative if the independent increases and
the dependent variable decreases and vice versa. It can be understood with
the help of the following graph
Linear regression has two types:

1. Simple Linear Regression


2. Multiple Linear Regression

Assumptions:

There are some assumptions about the dataset that are made by the Linear Regression model:

1. Multi-collinearity:
 Linear regression model assumes that there is very little or no multi-
collinearity in the data. Multi-collinearity occurs when the independent
variables or features have a dependency on them.
2. Auto-correlation:
 Another assumption the Linear regression model assumes is that there is
very little or no autocorrelation in the data. Auto-correlation occurs when
there is a dependency between residual errors.
3. Relationship between variables:
 Linear regression model assumes that the relationship between response
and feature variables must be linear.
4. Normality of error terms:
 Error terms should be normally distributed
5. Homoscedasticity:
 There should be no visible pattern in residual values.

Q2. Explain the Anscombe’s quartet in detail.


Answer:
we should never just run a regression without having a good look at your data because simple linear
regression has quite a few shortcomings:

1. It is sensitive to outliers.
2. It models the linear relationships only.
3. A few assumptions are required to make the inference.

These phenomena can be best explained by the Anscombe's Quartet, shown below:

Let’s comprise four datasets, each containing eleven (x, y) pairs. The essential thing to note about
these datasets is that they share the same descriptive statistics.
For all four datasets:

But when we plot these four datasets on an x/y coordinate plane, we can observe that they show
the same regression lines as well but each dataset is telling a different story:

As shown, all the four linear regressions are exactly the same. But there are some peculiarities in the
datasets which have fooled the regression line. While the first one seems to be doing a decent job,
the second one clearly shows that linear regression can only model linear relationships and is
incapable of handling any other kind of data. The third and fourth images showcase the linear
regression model's sensitivity to outliers. Had the outlier not been present, we could have gotten a
great line fitted through the data points. So, we should never run a regression without having a good
look at our data.

Q3. What is Pearson’s R?


Answer:

If two variables are correlated, it is very much possible that they have some other sort of
relationship and not just a linear one.
But the important point to note here is that two correlation coefficients are widely used in the
regression.

1. One is the Pearson's R correlation coefficient which is the correlation coefficient. This
correlation coefficient is designed for linear relationships and it might not be a good
measure if the relationship between the variables is non-linear.
2. The other correlation coefficient is Spearman's R which is used to determine the correlation
if the relationship between the variables is not linear. So even though, Pearson's R might
give a correlation coefficient for non-linear relationships, it might not be reliable.

Example: The correlation coefficients as given by both the techniques for the relationship y = X³ for
100 equally separated values between 1 and 100 were found to be:

Pearson′s R ≈ 0.91

Spearman′s R ≈ 1

And as we keep on increasing the power, the Pearson's R-value consistently drops whereas the
Spearman's R remains robust at 1.

Example: For the relationship y=X10 for the same data points, the coefficients were:

Pearson′s R ≈ 0.66

Spearman′s R ≈ 1

So, if we have some sense of the relationship being non-linear, we should look at Spearman's R
instead or Pearson's R. It might happen that even for a non-linear relationship, the Pearson's R-value
might be high, but it is simply not reliable.

Q4. What is scaling? Why is scaling performed? What is the difference


between normalized scaling and standardized scaling?
Answer:
Scaling is the process of bringing all the variables to the same scale. Scaling is performed mostly
during model-building processes to bring everything to the same scale.

The two common types of scalers used are:

1. MinMax Scaler (Normalized Scaler):


 Normalized scaling uses the maximum and the minimum values of a
particular column to perform the scaling. For any datapoint ‘X’ in a column
‘C’, this scaling is performed using the formula:

MinMax Scaler = X – min(C)/max(C) – min(C).

2. Standard Scaler:
 Standardised scaling, on the other hand, brings all the data points in a
normal distribution with mean zero and standard deviation one. It is
performed using the formula:

Standard Scaler = X – mean(C)/SD(C)

Both the scalers are affected by outliers since the parameters used to scale the values are affected
by outliers.

Q5. You might have observed that sometimes the value of VIF is infinite.
Why does this happen?
Answer:
Variance inflation factor (VIF) is a measure of the amount of multi-collinearity in a set of
multiple regression variables.

The formula for the variance inflation factor is given as:

VIF = 1/(1-R²)

Now, when calculating the VIF for one independent variable using all the other independent
variables, if the R² value comes out to be 1, the VIF will become infinite. This is quite possible when
one of the independent variables is strongly correlated with many of the other independent
variables.

Q6. What is a Q-Q plot? Explain the use and importance of a Q-Q plot in
linear regression?
Answer:
The quantile-quantile (q-q) plot is a graphical technique for determining if two data sets come from
populations with a common distribution.

Use of Q-Q plot:

 A q-q plot is a plot of the quantiles of the first data set against the quantiles
of the second dataset. By a quantile, I mean the fraction (or percent) of
points below the given value. That is, the 0.3 (or 30%) quantile is the point
at which 30% percent of the data falls below and 70% fall above that value.
A 45-degree reference line is also plotted. If the two sets come from a
population with the same distribution, the points should fall approximately
along this reference line. The greater the departure from this reference line,
the greater the evidence for the conclusion that the two data sets have
come from populations with different distributions.

Importance of Q-Q plot:


 When there are two data samples, it is often desirable to know if the
assumption of a common distribution is justified. If so, then location and
scale estimators can pool both data sets to obtain estimates of the common
location and scale. If two samples do differ, it is also useful to gain some
understanding of the differences. The q-q plot can provide more insight into
the nature of the difference than analytical methods such as the chi-square
and Kolmogorov-Smirnov 2-sample tests.

You might also like