Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 20

SEOUL RENTAL BIKE DATA

ANALYSIS AND MODELING

QUANTITATIVE TECHNIQUES – II

BM Section A – Group A1
Aarushi | BJ20001
Aayush Khandelwal | BJ20002
Abhishek Kapoor | BJ20003
Adyathma Ela | BJ20004
Akanksha Garima Mittal | BJ20005
Aman Khanna | BJ20006
Table of Contents
1. Introduction.......................................................................................................................................2
2. Data Description................................................................................................................................2
3. Exploratory Analysis..........................................................................................................................3
3.1 Descriptive Summary...................................................................................................................3
3.2 Scatter Plots.................................................................................................................................3
3.3 Box Plots......................................................................................................................................5
3.4 Correlation Analysis between Continuous Variables...................................................................6
4. Regression Modelling........................................................................................................................6
4.1 Model Building.............................................................................................................................6
4.2 Interaction Effect.........................................................................................................................9
4.3 Including Interaction Effect in the Regression Model................................................................10
4.4 Multicollinearity.........................................................................................................................11
4.5 Tests for Regression Coefficients...............................................................................................12
4.6 Checking Linear Regression Assumptions..................................................................................12
4.7 Linear Regression Model Diagnostics – Error/Residual Analysis................................................13
4.8 Model Performance/Validation.................................................................................................15
4.9 Results and Interpretation.........................................................................................................15
5. Discussion/Conclusion and Future Work.........................................................................................15
6. References.......................................................................................................................................16
7. Contribution....................................................................................................................................16
8. Appendix..........................................................................................................................................16

Page | 1
1. Introduction
Seoul is one of the most visited tourist locations and attracts tourists from South Korea as well as outside
Korea. In Seoul most of the tourists take two wheelers and cars on rental for travelling purposes. The season of
tourist visit is also limited to a certain time frame of the year. It becomes difficult for the local administration
of Seoul to maintain effective traffic due to lack of information on the current vehicles on the road. Also, this
exercise will help the local rental vendors maintain their inventory of 2-wheeler and plan the maintenance
activity of their inventory accordingly. In peak tourist season the price of rental vehicles rise to twice or thrice
of the normal prices. All of this can be reduced if there is a prior knowledge or an estimate of the number of
tourists coming in Seoul.

Using this data set, which we try to find if there is any co-relation between the external environmental factors
and the number of vehicles rented in Seoul, South Korea. This data set seemed interesting because of the
variables we are using to determine the 2-wheeler requirement. We are using whether condition such as,
temperature, humidity, windspeed, visibility and try to find a co-relation of these factors with the number of 2-
wheelers rented out. If a co relation is found, then it can be further refined, and more variables can be added
to get a better co-relation and a more precise model. Eliminating some of the variables which show very low
co-relation will help us in getting a good accuracy in model.

We plan to understand the behaviour of each predictor variable on the model output, i.e., the count of rented
bikes using an exploratory analysis. The insights drawn from the scatter plots and correlation analysis will be
used as a motivation for model building. For model building, all predictor variables were included to
understand their contributing power. The interaction effects were studied in detail among various predictor
variables and some were included in the model. Finally, we removed the predictor variables which were
insignificant as per the t-tests which was conducted after ANOVA + F-test. Finally, multicollinearity, model
assumptions and model diagnostics were carried out in detail.

While this data exercise will help us in reducing the waiting time of the 2-wheelers to the public and hence
help us in providing a better service to the people.

2. Data Description
Data is taken from a research paper by ‘Sathishkumar V E, Jangwoo Park, and Yongyun Cho. 'Using data mining
techniques for bike sharing demand prediction in metropolitan city.' Computer Communications, Vol.153,
pp.353-366, March 2020’ [1]. The variables used in this research paper are as follows:

1. Date: year-month-day – (Categorical data)


2. Rented Bike count - Count of bikes rented at each hour – (numerical data)
3. Hour - Hour of the day – (Categorical data)
4. Temperature-Temperature in Celsius – (numerical data)
5. Humidity - % – (numerical data)
6. Windspeed - m/s – (numerical data)
7. Visibility – 10 m – (numerical data)
8. Dew point temperature – Celsius – (numerical data)
9. Solar radiation - MJ/m2 – (numerical data)
10. Rainfall – mm – (numerical data)
11. Snowfall – cm – (numerical data)
12. Seasons - Winter, Spring, Summer, Autumn – (Categorical data)
13. Holiday – Holiday / No holiday – (Categorical data)
14. Functional Day – NoFunc (Non-Functional Hours), Fun (Functional hours) – (Categorical data)

We believe that this data plays a crucial role in determining the number of people visiting Seoul and
then choosing 2-wheeler, i.e., Rented Bikes.

Page | 2
3. Exploratory Analysis
3.1 Descriptive Summary
Table 1: Five Number Summary
Five-Number Summary
Rented Bike Count Hour Temperature(°C) Humidity(%) Wind speed (m/s) Visibility (10m) Dew point temperature(°C) Solar Radiation (MJ/m2) Rainfall(mm) Snowfall (cm)
Minimum 0 0 -17.8 0 0 27 -30.6 0 0 0
Q1 191 5 3.5 42 0.9 940 -4.7 0 0 0
Median 504 11 13.7 57 1.5 1698 5.1 0.01 0 0
Q3 1065 17 22.5 74 2.3 2000 14.8 0.93 0 0
Maximum 3556 23 39.4 98 7.4 2000 27.2 3.52 35 8.8

As can be seen from the five-number summary:

 The data for Rented Bike Count, Hour, Temperature, Humidity, Wind speed, Visibility, Dew point
temperature is adequately ranged to provide opportunity for analysis.
 The data for solar radiation shows that solar radiation was 0 for almost half the days and data for
rainfall and snowfall shows that it hardly rained or snowed during the data recording period.

3.2 Scatter Plots


Scatter Plots Comments

The count of rented bike count increased in direct


proportion to the increase in visibility.

The count of rented bike count Decreased in direct


proportion to the increase in Wind Speed.

The count of rented bike count increased in direct


proportion to the increase in Dew Point Temp.

Page | 3
The count of rented bike count decreased in direct
proportion to the increase in Solar Radiation.

Hourly data shows rental bike counts peaking thrice


in a day at 0th, 8th and 18th hour with maximum
recorded in the evening time.

The count of rented bike count increased in direct


proportion to the increase in Temperature until the
point of 25 degree Celsius and then decreased
slightly.

The count of rented bike count Decreased in direct


proportion to the increase in Rainfall.

The count of rented bike count Decreased in direct


proportion to the increase in Snowfall.

Page | 4
The count of rented bike count increased from 20
humidity till an optimal humidity of 50 after which it
saw a decrease.

3.3 Box Plots

 The Rented Bikes Count shows a


large number of outliers which are
outside the limit of 1.5 times the
interquartile region. This would
imply that our results could be
affected by the presence of many
outliers.

 Similar to Rented Bikes Count, Wind


Speed also shows large number of
outliers. This implies that wind speed
may not be a good predictor as large
number of outliers can affect the
model performance adversely.

 The Rainfall and


for a region like Seoul.

Page | 5
3.4 Correlation Analysis between Continuous Variables
Pearson correlation is conducted on all the continuous variable and a correlation matrix is designed in the form
of a plot as shown below.

Figure 1: Correlation Analysis of Continuous Variables

Observations and Conclusions:

For our model analysis, we have assumed that the predictor variables can be dropped from the model if the
correlation is more than 70%.

1. As seen from the correlation plot above, there exists 91% positive correlation between the predictors
“Dew_Point_Temperature” and “Temperature”. Thus, “Dew_Point_Temperature” can be dropped
from further model analysis.
2. Moderate correlation can also be seen between “Humidity” and “Visibility” and “Humidity” and
“Solar_Radiation”. However, since the correlation is less than 70%, we may continue with these
predictors for further analysis.

4. Regression Modelling
The regression modelling has been carried out in various steps as described below in detail.

4.1 Model Building


The entire model building exercise is undertaken at 5% level of significance, thus, alpha is equal to 5%.

1. We start with a basic model including all the predictor variables, except “Dew_Point_Temperature”
which is removed due to high correlation with “Temperature”, to get an idea of the significance levels
of different predictor variables.
2. An ANOVA + F-test is then conducted to establish goodness of fit. If the null hypothesis (all β are equal
to 0) is rejected, then we will proceed for t-tests to check the significance of all the predictor
variables.

If H0 is rejected, perform separate t-


tests
Fitted Regression Model ANOVA + F-test Test for β (t-test)

Page | 6
I. Model 1: All predictor variables excluding “Dew_Point_Temperature”

A linear model is fitted on all the predictor variables, and the goodness of fit is checked using an ANOVA table
as shown below.

(a) Goodness of fit using ANOVA and F-test

R Output:

Figure 2: ANOVA output for Model 1

Since the above ANOVA table from R output is incomplete, a complete ANOVA table along with F-test is shown
below.

H0: β1 = β2 = β3 =…… β11 = 0

H1: At least one βi is not equal to 0

Source DF Sum of Squares Mean Square F-Stat


Regression 13 1594980810 122690831.5
662.6263
Residuals 6994 1294997905 185158.408
Total 7007 2889978715

Test Statistic: Fobs = 662.6263

Cut off value: F (13, 6994, 0.05) = 1.721554

Conclusion: Fobs > F (13, 6994, 0.05) implies that H0 is rejected.


(b) Test of significance of Predictor Xi: t-tests of Regression Coefficients

R Output:

Figure 3: t-tests for Model 1


Page | 7
Observations (Significance Level = 5%):

1. “Visibility” is insignificant as its P-Value is 0.569 which is more than 0.05. Hence, this predictor can be
removed from the model.
2. Adjusted R-squared is 55.11%

(c) Using Step function to select the best model as per AIC value

As per step, lowest AIC value is obtained by further dropping “Visibility” predictor from the model. This is in
sync with the results obtained through t-tests where this variable was insignificant .

Figure 4: R Output of step function based on model1

II. Model 2: Model based on lowest AIC value by starting from Model 1
The R-squared value for Model 2 obtained is 55.11% with no insignificant variables. The below figure
shows the summary of model 2 R output.
R Output:

Figure 5: R Output of summary for Model 2

Page | 8
4.2 Interaction Effect
Based on the data set scatter plots, the interaction effect is studied among various combinations of predictor
variables to check the presence of interaction.

A. Interaction between categorical and continuous variables


a. Interaction between “Temperature” and “Seasons”
The interaction between temperature and seasons is evident from the fact that Summer
season will have higher temperature and Winter season will have lower temperature.
Spring and Autumn season will exhibit temperatures between the above Summer and Winter
season temperatures.
The below graph explains the interaction effect between temperature and seasons.

Figure 6: Interaction Effect between Temperature and Seasons

B. Interaction between continuous variables


a. Interaction between “Hour” and “Temperature”
From the below graph it can be observed that for a particular season, say Summer, early
hours in a day (between 0 to 5) have lower temperature compared to late hours.
Thus, there is an interaction between Hour of the day and the temperature at that hour.
Similar, interaction effect is also observed between “Hour” and “Humidity”.

Page | 9
Figure 7: Interaction Effect between Hour and Temperature for a particular season
(Summer)

4.3 Including Interaction Effect in the Regression Model


The above discussed interaction effects are further fitted in the Model 2.

III. Model 3: Including Interaction Effects in Model 2


The interaction effects are added one at a time based on their contributing power. The final model
summary is given below.

R Output:

Observations:

1. The model R-squared has improved from 55.11% to 63.61% which is a significant improvement.
2. “Temperature”, “Humidity” and “Snowfall” have now become insignificant as their P-Values are less
than 5%.
3. The interaction effect has certainly improved the model R-squared but has made few predictors as
insignificant. Thus, the model can be further improved so that there are no insignificant predictors.

Page | 10
IV. Model 4: Final Model
In the final model, insignificant predictors are removed by ensuring that R-squared value is
maintained high.

Goodness of fit using ANOVA and F-test

R Output:

Since the above


ANOVA table Figure 9: ANOVA analysis of the Final Model
from R output is incomplete, a complete ANOVA table along with F-test is shown below.

H0: β1 = β2 = β3 =…… β11 = 0

H1: At least one βi is not equal to 0

Source DF Sum of Squares Mean Square F-Stat


Regression 12 1835936758 152994729.9
1015.328
Residuals 6995 1054041957 150685.0546
Total 7007 2889978715
Test Statistic: Fobs = 1015.328

Cut off value: F (12, 6995, 0.05) = 1.7535

Conclusion: Fobs > F (12, 6995, 0.05) implies that H0 is rejected.

Final Regression Equation and Interpretation:

Rented Bike Count = (-455.127) + (44.29 * Hour) – (56.716 * Solar_Radiation) – (62.94 * Rainfall) – (336.79 *
ISpring) + (1004.89 * ISummer) – (364.67 * IWinter) + (934.58 * Functioning Day) + (2.24 * Hour * Temperature) + (15.48
* ISpring * Temperature) – (43.29 * ISummer * Temperature) – (9.39 * IWinter * Temperature) – (0.65 * Hour *
Humidity)

Figure 10: Levels of different categorical


variables

Consider an example,

Page | 11
The number of rental bikes required in Autumn on a Functioning Day at 1’o clock in the morning with the
rainfall on that day as 0 mm, humidity was 55%, temperature as 24.1 oC and Solar Radiation was recorded as
1.46 MJ/mm2 is calculated as

Rented Bike Count = (-455.127) + (44.29 * 1) – (56.716 * 1.46) – (62.94 * 0) + (934.58 * 1) + (2.24 * 216.9) –
(0.65 * 0.55) = 926

The actual value on that day was 925. Thus, the residual calculated equals to 1

4.4 Multicollinearity
The multicollinearity is calculated using VIF, where VIF > 5 is assumed to exhibit multicollinearity in the model.

R Output:

Figure 11: Multicollinearity Diagnostics

As observed in the above R output, the final model (model4) does not exhibit multicollinearity as the value of
VIF for all the predictor variables is less than 5.

4.5 Tests for Regression Coefficients


As shown above in the model 4 model analysis, the ANOVA and F-test indicates that null hypothesis (all β are
equal to 0) is rejected. Thus, there is at least one β which is statistically significant.

Test of significance of Predictor Xi: t-tests of Regression Coefficients

R Output:

As shown in the output below, all the predictor variables now have P-Values less than alpha (alpha = 5%). Thus,
all are significant predictors.

Page | 12

Figure 12: Final Model t-tests for significance of predictor variables


4.6 Checking Linear Regression Assumptions
1. Serial Correlation - Yi are independent over i: Since the data is collected over a year, Breusch-Godfrey
test will be used to check for serial correlation. There is no serial correlation present in the final model
selected.
2. Homoscedasticity – Y values have the same standard deviation: Breusch-Pagan test will be used to
check for the same.
R Output:

H0: σ2 are equal

H1: σ2 are not equal

Figure 13: Breusch Pagan Test for Homoscedasticity

Since, P-Value is less than alpha (5%), we can reject the null hypothesis. This means that standard
deviations are not equal and there may be heteroskedasticity present in the Y variable.

3. Y values are normally distributed: Shapiro-Wilk’s test will be used to check for normality. However,
since the data is more than 30, we may assume that Y values will follow a normal distribution. Also,
Shapiro-Wilk’s test can be conducted for sample size between 3 and 5000.

4.7 Linear Regression Model Diagnostics – Error/Residual Analysis


Residual analysis will be carried out for regression diagnostics exercise.

1. Standardised Residuals vs Fitted Value

Figure 14: Standard Residuals vs Fitted Values

Page | 13
The above plot is used for checking the randomness and the constancy of error terms. As it is
observed, the error terms have a non-constant variance. This is in synchronous with the Breusch-
Pagan test conducted above.

2. Detecting Outliers and Influence Points


Residual vs Leverage plot against standardised residuals and Cook’s distance has been used to test for
outliers.
 Influential point – if Cook’s distance > 1
 Point is flagged for Cook’s distance > 0.5

Figure 15: Cook's Distance

As can be observed from the two plots above, none of the points have Cook’s distance more than 0.5. Thus,
none of the points is influential or can be flagged. As such, there are no points such that deleting them can
change the model a lot.

3. Non-Normality of Errors

Page | 14
4 types of plots are plotted as shown below to illustrate the normality nature of the errors. As the
sample size is more than 30, the normality can be seen in each of the following types of plots: plot of
standardised residuals, histogram, box plot and Q-Q plot of standardised residuals.

Figure 16: Non-Normality of Errors

4.8 Model Performance/Validation


1. R-Square Approach
The R-square and the adjusted R-square of every model discussed above is show

Model Model 1 Model 2 Model 3 Model 4


R2 0.5519 0.5519 0.637 0.6353
Ra2 0.5511 0.5511 0.6361 0.6347

The final model, i.e., model 4 is finally selected as the best fit model. The model has an R-square of
63% which means that 63% of the variation can be explained using regression. The regression model
has a reasonable predictive ability.

2. RMSE Approach
 The final model is run on the training data set and the root means square error (RMSE) is
calculated
o The RMSE for training data set is 388.
 The same model is now tested on the testing data set, for studying the out of sample
implications on the final model.
o The RMSE for testing data set is 837.
 There is a considerable difference between the two RMSE values. Since the RMSE values for
the testing data set is higher than the training data set, the model can be further improved to
avoid over-fitting the model.

Page | 15
o However, it may result in a drop in the R-squared value of the model. This suggests
that the model can further improved by using certain transformations of the
predictor variables, indicating a better fit of the model.

4.9 Results and Interpretation


The final regression equation is given below:

Rented Bike Count = (-455.127) + (44.29 * Hour) – (56.716 * Solar_Radiation) – (62.94 * Rainfall) – (336.79 *
ISpring) + (1004.89 * ISummer) – (364.67 * IWinter) + (934.58 * Functioning Day) + (2.24 * Hour * Temperature) + (15.48
* ISpring * Temperature) – (43.29 * ISummer * Temperature) – (9.39 * IWinter * Temperature) – (0.65 * Hour * Humidity)

The interpretation of the model can be considered using an example as shown below:

The number of rental bikes required in Autumn on a Functioning Day at 1’o clock in the morning with the
rainfall on that day as 0 mm, humidity was 55%, temperature as 24.1 oC and Solar Radiation was recorded as
1.46 MJ/mm2 is calculated as

Rented Bike Count = (-455.127) + (44.29 * 1) – (56.716 * 1.46) – (62.94 * 0) + (934.58 * 1) + (2.24 * 216.9) –
(0.65 * 0.55) = 926

The actual value on that day was 925. Thus, the residual calculated equals to 1.

5. Discussion/Conclusion and Future Work


The problem which we intended to solve was to calculate the number of rental bikes required in a city like
Seoul. For this, we started with all different types of predictor variables such as weather information, season,
holiday/not-holiday and discussed the impacts of each predictor variable on the output model. The model was
also checked for any interaction effects which could impact the output and thus, considerable effects were
finally included in our final model.

 The final model has an R-squared value equal to 63% which means that 63% of variation is being
explained by our regression model.
 The final model selected gives a higher RMSE value on the testing data set as compared to the
training data set.
 The scope of future work includes following:
o Transforming the predictor variables: Since the predictor variables are not strongly linearly
correlated with the output variable, certain kind of transformation may be required.
o Heteroskedasticity: The Breusch-Pagan test indicates a presence of heteroskedasticity. Thus,
this can be removed in the future development of the model.
o Non-Linear Model: The high RMSE values indicates that fitting a linear model may not be a
good idea on this data set. It can also be observed by seeing the scatter plots in our
exploratory analysis.
o Interaction: The interaction effect can also be analysed for all types of combinations of the
predictor variables by going beyond the second order to see its effect.

6. References
[1] Dataset - https://archive.ics.uci.edu/ml/datasets/Seoul+Bike+Sharing+Demand

Page | 16
7. Contribution
Data Description, Data Cleaning, Result Reporting, and
Abhishek Kapoor and Aayush Khandelwal
subsequent R Codes
Exploratory Analysis (Descriptive Summary, Scatter Plots,
Adyathma Ela and Akanksha Mittal
Histograms) and related R Codes
Model Building and R Codes Aman Khanna and Aarushi
Report Making All Members

8. Appendix
I. R Codes for Exploratory Analysis

Page | 17
II. R Codes for Model Building, Model Assumptions, Model Diagnostics and RMSE

Page | 18
Page | 19

You might also like