Professional Documents
Culture Documents
Assignment-Based Subjective Questions
Assignment-Based Subjective Questions
Q1. From your analysis of the categorical variables from the dataset, what
could you infer about their effect on the dependent variable?
Answer:
I have done an analysis of categorical variables from the dataset on the dependent variable(count)
using the boxplot and bar plot. Below are some points that can be inferred from the visualization:
1. fall has the highest median, which is expected as the weather conditions are most optimal to
ride the bike followed by summer.
2. Median bike rents are increasing year on as the year 2019 has a higher median than 2018, it
might be because bike rentals are getting popular and people are becoming more aware.
3. Most of the bikes were rented during May, June, July, Aug, Sep, and Oct. This trend
increased at the starting of the year till mid of the year, and then it started decreasing as we
approach the end of the year.
4. People rent more on non-holiday compared to holidays, so the reason might be they prefer
to spend time with family at home or use a personal vehicle.
5. Renting bikes seems to be almost equal every day of the week.
6. Renting bikes seems to be almost equal either on working days or non-working days.
7. Clear weather is most optimal for bike renting, as temperate is optimal, humidity is less.
Example: Let’s say we have 3 types of values (A, B, C) in the Categorical column and we want to
create dummy variables for that. If one variable is not A and not B, then It is obvious C. So we do
not need 3rd variable to identify the C as shown in the table below:
A B
0 1
1 0
0 0
Q3. Looking at the pair-plot among the numerical variables, which one
has the highest correlation with the target variable?
Answer:
By Looking at the pair plot, it is quite clear that registered, casual, temp, and feeltemp(atemp)
variables have the highest correlation with the target variable.
But after dropping registered, casual, feeltemp(atemp) variables, the temp variable has the highest
correlation with the target variable.
Q4. How did you validate the assumptions of Linear Regression after
building the model on the training set?
Answer:
I have validated the assumption of the Linear Regression Model based on the below 5 assumptions.
2. Multi-collinearity:
There should be insignificant Multi-collinearity among variables.
5. Independence of residuals:
No auto-correlation.
- The Durbin-Watson value for Final Model is 2.0884. which means there
is almost no auto-correlation.
Q5. Based on the final model, which are the top 3 features contributing
significantly towards explaining the demand of the shared bikes?
Answer:
Based on the coefficients of the final model, the top 3 features which explain the demand for the
shared bikes are:
1. temp
2. light snow (weathersit (3))
3. year (yr)
General Subjective Questions:
Q1. Explain the linear regression algorithm in detail.
Answer:
Linear Regression is a machine learning algorithm based on supervised learning. It performs a
regression task.
A simple linear regression model attempts to explain the relationship between a dependent
variable(y) and an independent variable(x) using a straight line and the relationship between y and x
is a straight line, hence the name is linear regression.
The independent variable is also known as the predictor variable and the dependent variables are
called also known as output variables.
Mathematically the relationship can be represented with the help of the following equation:
Y = mX + c
Assumptions:
There are some assumptions about the dataset that are made by the Linear Regression model:
1. Multi-collinearity:
Linear regression model assumes that there is very little or no multi-
collinearity in the data. Multi-collinearity occurs when the independent
variables or features have a dependency on them.
2. Auto-correlation:
Another assumption the Linear regression model assumes is that there is
very little or no autocorrelation in the data. Auto-correlation occurs when
there is a dependency between residual errors.
3. Relationship between variables:
Linear regression model assumes that the relationship between response
and feature variables must be linear.
4. Normality of error terms:
Error terms should be normally distributed
5. Homoscedasticity:
There should be no visible pattern in residual values.
1. It is sensitive to outliers.
2. It models the linear relationships only.
3. A few assumptions are required to make the inference.
These phenomena can be best explained by the Anscombe's Quartet, shown below:
Let’s comprise four datasets, each containing eleven (x, y) pairs. The essential thing to note about
these datasets is that they share the same descriptive statistics.
For all four datasets:
But when we plot these four datasets on an x/y coordinate plane, we can observe that they show
the same regression lines as well but each dataset is telling a different story:
As shown, all the four linear regressions are exactly the same. But there are some peculiarities in the
datasets which have fooled the regression line. While the first one seems to be doing a decent job,
the second one clearly shows that linear regression can only model linear relationships and is
incapable of handling any other kind of data. The third and fourth images showcase the linear
regression model's sensitivity to outliers. Had the outlier not been present, we could have gotten a
great line fitted through the data points. So, we should never run a regression without having a good
look at our data.
If two variables are correlated, it is very much possible that they have some other sort of
relationship and not just a linear one.
But the important point to note here is that two correlation coefficients are widely used in the
regression.
1. One is the Pearson's R correlation coefficient which is the correlation coefficient. This
correlation coefficient is designed for linear relationships and it might not be a good
measure if the relationship between the variables is non-linear.
2. The other correlation coefficient is Spearman's R which is used to determine the correlation
if the relationship between the variables is not linear. So even though, Pearson's R might
give a correlation coefficient for non-linear relationships, it might not be reliable.
Example: The correlation coefficients as given by both the techniques for the relationship y = X³ for
100 equally separated values between 1 and 100 were found to be:
Pearson′s R ≈ 0.91
Spearman′s R ≈ 1
And as we keep on increasing the power, the Pearson's R-value consistently drops whereas the
Spearman's R remains robust at 1.
Example: For the relationship y=X10 for the same data points, the coefficients were:
Pearson′s R ≈ 0.66
Spearman′s R ≈ 1
So, if we have some sense of the relationship being non-linear, we should look at Spearman's R
instead or Pearson's R. It might happen that even for a non-linear relationship, the Pearson's R-value
might be high, but it is simply not reliable.
2. Standard Scaler:
Standardised scaling, on the other hand, brings all the data points in a
normal distribution with mean zero and standard deviation one. It is
performed using the formula:
Both the scalers are affected by outliers since the parameters used to scale the values are affected
by outliers.
Q5. You might have observed that sometimes the value of VIF is infinite.
Why does this happen?
Answer:
Variance inflation factor (VIF) is a measure of the amount of multi-collinearity in a set of
multiple regression variables.
VIF = 1/(1-R²)
Now, when calculating the VIF for one independent variable using all the other independent
variables, if the R² value comes out to be 1, the VIF will become infinite. This is quite possible when
one of the independent variables is strongly correlated with many of the other independent
variables.
Q6. What is a Q-Q plot? Explain the use and importance of a Q-Q plot in
linear regression?
Answer:
The quantile-quantile (q-q) plot is a graphical technique for determining if two data sets come from
populations with a common distribution.
A q-q plot is a plot of the quantiles of the first data set against the quantiles
of the second dataset. By a quantile, I mean the fraction (or percent) of
points below the given value. That is, the 0.3 (or 30%) quantile is the point
at which 30% percent of the data falls below and 70% fall above that value.
A 45-degree reference line is also plotted. If the two sets come from a
population with the same distribution, the points should fall approximately
along this reference line. The greater the departure from this reference line,
the greater the evidence for the conclusion that the two data sets have
come from populations with different distributions.