Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 10

Assignment-based Subjective Questions

1. From your analysis of the categorical variables from the dataset, what could you
infer about their effect on the dependent variable?

The categorical variable in the dataset were


1. Season
2. weather condition
3. Working Day
4. Year
5. Month
6. Weekday

These were visualised using a boxplot. These variables had the following effect on
our dependant variable: -

1. Season - The boxplot showed that spring season had least value of count whereas
fall had maximum value of total count. Summer and winter had intermediate value of
total count.

2. weather condition - There are no users when there is heavy rain/ snow indicating
that this weather is extremely unfavourable. Highest count was seen when the weather
condition was ‘Good, Neutral, Bad’.

3. Working Day - rentals increased during holiday.

4. Year - The number of rentals in 2019 was more than 2018.

5. Month - September saw highest no of rentals in 2019 whereas June saw the highest
rentals in 2018.

6. Weekday – Monday had the highest rentals during weekdays.

2. Why is it important to use drop_first=True during dummy variable creation?


If you don't drop the first column then your dummy variables will be correlated
(redundant). This may affect some models adversely and the effect is stronger when
the cardinality is smaller. For example, iterative models may have trouble converging
and lists of variable importance’s may be distorted. Another reason is, if we have all
dummy variables it leads to Multicollinearity between the dummy variables. To keep
this under control, we lose one column.
3. Looking at the pair-plot among the numerical variables, which one has the
highest correlation with the target variable?

“temp” and “atemp” variables are highly corelated with the target variable count.
4. How did you validate the assumptions of Linear Regression after building the
model on the training set?
Residuals distribution should follow normal distribution and centred around 0. (mean
= 0)

We validate this assumption about residuals by plotting a distplot of residuals and see
if residuals are following normal distribution or not. The above diagram shows that
the residuals are distributed about mean = 0.

5. Based on the final model, which are the top 3 features contributing significantly
towards explaining the demand of the shared bikes?
The top 3 variables that are seen effecting and benefitting the Bike Rental count are as
follows:
1. Spring season: -0.6842
2. Temperature: 0.4042
3. Mist: -0.3544

General Subjective Questions

1. Explain the linear regression algorithm in detail.

Linear Regression is a type of supervised Machine Learning algorithm that is used for
the prediction of numeric values. Linear Regression is the most basic form of
regression analysis. Regression is the most commonly used predictive analysis model.
Linear regression is based on the popular equation “y = mx + c”.
It assumes that there is a linear relationship between the dependent variable(y) and the
predictor(s)/independent variable(x). In regression, we calculate the best fit line which
describes the relationship between the independent and dependent variable.
Regression is performed when the dependent variable is of continuous data type and
Predictors or independent variables could be of any data type like continuous,
nominal/categorical etc. Regression method tries to find the best fit line which shows
the relationship between the dependent variable and predictors with least error.
In regression, the output/dependent variable is the function of an independent variable
and the coefficient and the error term. Regression is broadly divided into simple linear
regression and multiple linear regression.
1. Simple Linear Regression: SLR is used when the dependent variable is predicted
using only one independent variable.
2. Multiple Linear Regression: MLR is used when the dependent variable is predicted
using multiple independent variables The equation for MLR will be: b1 = coefficient
for X1 variable b2 = coefficient for X2 variable b3 = coefficient for X3 variable and
so on... b0 is the intercept (constant term)
2. Explain the Anscombe’s quartet in detail.
Anscombe's quartet is a set of four datasets, each consisting of 11 points, created by
the statistician Francis Anscombe in 1973. The datasets were designed to demonstrate
the importance of visualizing data, as they have nearly identical descriptive statistics
but very different relationships between the variables. The quartet is often used to
illustrate the limitations of relying solely on summary statistics to understand a
dataset.

Each of the four datasets in Anscombe's quartet contains two continuous variables,
labeled x and y. The x variable in each dataset has values ranging from 4 to 14, while
the y variable varies between 4.26 and 10.84. The four datasets are as follows:

Dataset I: This dataset has a linear relationship between x and y, with a slope of
approximately 0.5 and an intercept of around 3.0.

Dataset II: This dataset has a non-linear relationship between x and y, with a
parabolic curve that is skewed towards the right.

Dataset III: This dataset also has a non-linear relationship between x and y, but with
a much more pronounced curvature than Dataset II.
Dataset IV: This dataset has a relatively flat relationship between x and y, with most
of the points clustered around y = 8.

Despite the fact that all four datasets have similar descriptive statistics (the mean,
variance, and correlation coefficient are nearly identical), they illustrate the
importance of visualizing data. By plotting the data, it is clear that each dataset has a
different relationship between the variables, with some exhibiting linear relationships,
while others have curved or more complex patterns.

Anscombe's quartet highlights the importance of visualizing data and the limitations
of relying solely on summary statistics. In practice, it is important to use both
descriptive statistics and visualizations to fully understand a dataset.

3. What is Pearson’s R?
Pearson's R, also known as Pearson's correlation coefficient, is a statistical measure
that quantifies the strength and direction of the linear relationship between two
continuous variables. It is denoted by the symbol "r" and takes values between -1 and
1, where:
 A value of -1 indicates a perfect negative linear relationship, where one
variable increases as the other decreases.
 A value of 0 indicates no linear relationship between the variables.
 A value of 1 indicates a perfect positive linear relationship, where both
variables increase or decrease together.
The formula for Pearson's R is:

r = (n * Σ(xy) - Σx * Σy) / sqrt((n * Σx^2 - (Σx)^2) * (n * Σy^2 - (Σy)^2))

where:

 n is the number of observations


 Σxy is the sum of the products of x and y
 Σx and Σy are the sums of x and y, respectively
 Σx^2 and Σy^2 are the sums of the squares of x and y, respectively
Pearson's R is widely used in various fields, such as psychology, biology, and
economics, to measure the relationship between two continuous variables. It is
important to note that Pearson's R only measures linear relationships and may not
capture other types of relationships, such as non-linear or categorical relationships.
4. What is scaling? Why is scaling performed? What is the difference between
normalized scaling and standardized scaling?
Scaling is a data pre-processing technique used to transform variables to a common
scale, usually between 0 and 1 or with a mean of 0 and a standard deviation of 1. The
purpose of scaling is to improve the performance of machine learning algorithms that
are sensitive to the scale of the input variables.

Scaling is performed because many machine learning algorithms use distance-based


metrics, such as Euclidean distance, to measure the similarity between observations. If
the input variables have different scales, then the variables with larger scales may
dominate the distance calculations, leading to inaccurate results. Additionally, some
algorithms, such as support vector machines, require the input variables to be on a
common scale.
There are two common types of scaling: normalized scaling and standardized scaling.

Normalized scaling, also known as min-max scaling, transforms the input variables to
a range between 0 and 1. The formula for normalized scaling is:

x_scaled = (x - min(x)) / (max(x) - min(x))

where x is the original value of the variable, and x_scaled is the scaled value.

Standardized scaling transforms the input variables to have a mean of 0 and a standard
deviation of 1. The formula for standardized scaling is:

x_scaled = (x - mean(x)) / std(x)

where x is the original value of the variable, and x_scaled is the scaled value.

The main difference between normalized scaling and standardized scaling is the scale
that the variables are transformed to. Normalized scaling is useful when the
distribution of the variables is not normal, and when the input variables have a
defined minimum and maximum value. Standardized scaling is useful when the
distribution of the variables is normal or approximately normal, and when the input
variables have different units or scales of measurement.

5. You might have observed that sometimes the value of VIF is infinite. Why does
this happen?
VIF - the variance inflation factor - The VIF gives how much the variance of the
coefficient estimate is being inflated by collinearity.
(VIF) =1/ (1-R_1^2). If there is perfect correlation, then VIF = infinity. Where R-1 is
the R-square value of that independent variable which we want to check how well this
independent variable is explained well by other independent variables- If that
independent variable can be explained perfectly by other independent variables, then
it will have perfect correlation and its R-squared value will be equal to 1.
So, VIF = 1/(1-1) which gives VIF = 1/0 which results in “infinity”.
6. What is a Q-Q plot? Explain the use and importance of a Q-Q plot in linear
regression.
A Q-Q plot, short for quantile-quantile plot, is a graphical tool used to assess the
normality of a dataset by comparing its distribution to a theoretical normal
distribution. The plot displays the quantiles of the dataset on the y-axis and the
corresponding quantiles of the normal distribution on the x-axis. If the data is
normally distributed, the Q-Q plot will form a straight line.

In linear regression, a Q-Q plot is used to check the normality assumption of the
residuals, which are the differences between the observed and predicted values of the
response variable. A linear regression model assumes that the residuals are normally
distributed with a mean of 0 and a constant variance. If the residuals are not normally
distributed, it may indicate that the linear regression model is not appropriate for the
data, and the results may be unreliable.

By examining the Q-Q plot of the residuals, we can assess whether the residuals
follow a normal distribution. If the Q-Q plot shows a straight line, then the residuals
are normally distributed. If the Q-Q plot shows a curve or a pattern, then the residuals
may not be normally distributed.

The importance of a Q-Q plot in linear regression is that it allows us to check the
normality assumption of the residuals, which is one of the key assumptions of linear
regression. If the normality assumption is not met, the results of the linear regression
model may not be reliable, and we may need to consider using a different type of
model or transforming the data to meet the assumptions of linear regression. In
addition, a Q-Q plot can also help identify outliers or other departures from normality
in the data, which may affect the results of the analysis.
The q-q plot is used to answer the following questions:
• Do two data sets come from populations with a common distribution?
• Do two data sets have common location and scale?
• Do two data sets have similar distributional shapes?
• Do two data sets have similar tail behaviour?

You might also like