Professional Documents
Culture Documents
Seoul Rental Bike Data Analysis and Modeling: Quantitative Techniques - Ii
Seoul Rental Bike Data Analysis and Modeling: Quantitative Techniques - Ii
QUANTITATIVE TECHNIQUES – II
BM Section A – Group A1
Aarushi | BJ20001
Aayush Khandelwal | BJ20002
Abhishek Kapoor | BJ20003
Adyathma Ela | BJ20004
Akanksha Garima Mittal | BJ20005
Aman Khanna | BJ20006
Table of Contents
1. Introduction.......................................................................................................................................2
2. Data Description................................................................................................................................2
3. Exploratory Analysis..........................................................................................................................3
3.1 Descriptive Summary...................................................................................................................3
3.2 Scatter Plots.................................................................................................................................3
3.3 Box Plots......................................................................................................................................5
3.4 Correlation Analysis between Continuous Variables...................................................................6
4. Regression Modelling........................................................................................................................6
4.1 Model Building.............................................................................................................................6
4.2 Interaction Effect.........................................................................................................................9
4.3 Including Interaction Effect in the Regression Model................................................................10
4.4 Multicollinearity.........................................................................................................................11
4.5 Tests for Regression Coefficients...............................................................................................12
4.6 Checking Linear Regression Assumptions..................................................................................12
4.7 Linear Regression Model Diagnostics – Error/Residual Analysis................................................13
4.8 Model Performance/Validation.................................................................................................15
4.9 Results and Interpretation.........................................................................................................15
5. Discussion/Conclusion and Future Work.........................................................................................15
6. References.......................................................................................................................................16
7. Contribution....................................................................................................................................16
8. Appendix..........................................................................................................................................16
Page | 1
1. Introduction
Seoul is one of the most visited tourist locations and attracts tourists from South Korea as well as outside
Korea. In Seoul most of the tourists take two wheelers and cars on rental for travelling purposes. The season of
tourist visit is also limited to a certain time frame of the year. It becomes difficult for the local administration
of Seoul to maintain effective traffic due to lack of information on the current vehicles on the road. Also, this
exercise will help the local rental vendors maintain their inventory of 2-wheeler and plan the maintenance
activity of their inventory accordingly. In peak tourist season the price of rental vehicles rise to twice or thrice
of the normal prices. All of this can be reduced if there is a prior knowledge or an estimate of the number of
tourists coming in Seoul.
Using this data set, which we try to find if there is any co-relation between the external environmental factors
and the number of vehicles rented in Seoul, South Korea. This data set seemed interesting because of the
variables we are using to determine the 2-wheeler requirement. We are using whether condition such as,
temperature, humidity, windspeed, visibility and try to find a co-relation of these factors with the number of 2-
wheelers rented out. If a co relation is found, then it can be further refined, and more variables can be added
to get a better co-relation and a more precise model. Eliminating some of the variables which show very low
co-relation will help us in getting a good accuracy in model.
We plan to understand the behaviour of each predictor variable on the model output, i.e., the count of rented
bikes using an exploratory analysis. The insights drawn from the scatter plots and correlation analysis will be
used as a motivation for model building. For model building, all predictor variables were included to
understand their contributing power. The interaction effects were studied in detail among various predictor
variables and some were included in the model. Finally, we removed the predictor variables which were
insignificant as per the t-tests which was conducted after ANOVA + F-test. Finally, multicollinearity, model
assumptions and model diagnostics were carried out in detail.
While this data exercise will help us in reducing the waiting time of the 2-wheelers to the public and hence
help us in providing a better service to the people.
2. Data Description
Data is taken from a research paper by ‘Sathishkumar V E, Jangwoo Park, and Yongyun Cho. 'Using data mining
techniques for bike sharing demand prediction in metropolitan city.' Computer Communications, Vol.153,
pp.353-366, March 2020’ [1]. The variables used in this research paper are as follows:
We believe that this data plays a crucial role in determining the number of people visiting Seoul and
then choosing 2-wheeler, i.e., Rented Bikes.
Page | 2
3. Exploratory Analysis
3.1 Descriptive Summary
Table 1: Five Number Summary
Five-Number Summary
Rented Bike Count Hour Temperature(°C) Humidity(%) Wind speed (m/s) Visibility (10m) Dew point temperature(°C) Solar Radiation (MJ/m2) Rainfall(mm) Snowfall (cm)
Minimum 0 0 -17.8 0 0 27 -30.6 0 0 0
Q1 191 5 3.5 42 0.9 940 -4.7 0 0 0
Median 504 11 13.7 57 1.5 1698 5.1 0.01 0 0
Q3 1065 17 22.5 74 2.3 2000 14.8 0.93 0 0
Maximum 3556 23 39.4 98 7.4 2000 27.2 3.52 35 8.8
The data for Rented Bike Count, Hour, Temperature, Humidity, Wind speed, Visibility, Dew point
temperature is adequately ranged to provide opportunity for analysis.
The data for solar radiation shows that solar radiation was 0 for almost half the days and data for
rainfall and snowfall shows that it hardly rained or snowed during the data recording period.
Page | 3
The count of rented bike count decreased in direct
proportion to the increase in Solar Radiation.
Page | 4
The count of rented bike count increased from 20
humidity till an optimal humidity of 50 after which it
saw a decrease.
Page | 5
3.4 Correlation Analysis between Continuous Variables
Pearson correlation is conducted on all the continuous variable and a correlation matrix is designed in the form
of a plot as shown below.
For our model analysis, we have assumed that the predictor variables can be dropped from the model if the
correlation is more than 70%.
1. As seen from the correlation plot above, there exists 91% positive correlation between the predictors
“Dew_Point_Temperature” and “Temperature”. Thus, “Dew_Point_Temperature” can be dropped
from further model analysis.
2. Moderate correlation can also be seen between “Humidity” and “Visibility” and “Humidity” and
“Solar_Radiation”. However, since the correlation is less than 70%, we may continue with these
predictors for further analysis.
4. Regression Modelling
The regression modelling has been carried out in various steps as described below in detail.
1. We start with a basic model including all the predictor variables, except “Dew_Point_Temperature”
which is removed due to high correlation with “Temperature”, to get an idea of the significance levels
of different predictor variables.
2. An ANOVA + F-test is then conducted to establish goodness of fit. If the null hypothesis (all β are equal
to 0) is rejected, then we will proceed for t-tests to check the significance of all the predictor
variables.
Page | 6
I. Model 1: All predictor variables excluding “Dew_Point_Temperature”
A linear model is fitted on all the predictor variables, and the goodness of fit is checked using an ANOVA table
as shown below.
R Output:
Since the above ANOVA table from R output is incomplete, a complete ANOVA table along with F-test is shown
below.
R Output:
1. “Visibility” is insignificant as its P-Value is 0.569 which is more than 0.05. Hence, this predictor can be
removed from the model.
2. Adjusted R-squared is 55.11%
(c) Using Step function to select the best model as per AIC value
As per step, lowest AIC value is obtained by further dropping “Visibility” predictor from the model. This is in
sync with the results obtained through t-tests where this variable was insignificant .
II. Model 2: Model based on lowest AIC value by starting from Model 1
The R-squared value for Model 2 obtained is 55.11% with no insignificant variables. The below figure
shows the summary of model 2 R output.
R Output:
Page | 8
4.2 Interaction Effect
Based on the data set scatter plots, the interaction effect is studied among various combinations of predictor
variables to check the presence of interaction.
Page | 9
Figure 7: Interaction Effect between Hour and Temperature for a particular season
(Summer)
R Output:
Observations:
1. The model R-squared has improved from 55.11% to 63.61% which is a significant improvement.
2. “Temperature”, “Humidity” and “Snowfall” have now become insignificant as their P-Values are less
than 5%.
3. The interaction effect has certainly improved the model R-squared but has made few predictors as
insignificant. Thus, the model can be further improved so that there are no insignificant predictors.
Page | 10
IV. Model 4: Final Model
In the final model, insignificant predictors are removed by ensuring that R-squared value is
maintained high.
R Output:
Rented Bike Count = (-455.127) + (44.29 * Hour) – (56.716 * Solar_Radiation) – (62.94 * Rainfall) – (336.79 *
ISpring) + (1004.89 * ISummer) – (364.67 * IWinter) + (934.58 * Functioning Day) + (2.24 * Hour * Temperature) + (15.48
* ISpring * Temperature) – (43.29 * ISummer * Temperature) – (9.39 * IWinter * Temperature) – (0.65 * Hour *
Humidity)
Consider an example,
Page | 11
The number of rental bikes required in Autumn on a Functioning Day at 1’o clock in the morning with the
rainfall on that day as 0 mm, humidity was 55%, temperature as 24.1 oC and Solar Radiation was recorded as
1.46 MJ/mm2 is calculated as
Rented Bike Count = (-455.127) + (44.29 * 1) – (56.716 * 1.46) – (62.94 * 0) + (934.58 * 1) + (2.24 * 216.9) –
(0.65 * 0.55) = 926
The actual value on that day was 925. Thus, the residual calculated equals to 1
4.4 Multicollinearity
The multicollinearity is calculated using VIF, where VIF > 5 is assumed to exhibit multicollinearity in the model.
R Output:
As observed in the above R output, the final model (model4) does not exhibit multicollinearity as the value of
VIF for all the predictor variables is less than 5.
R Output:
As shown in the output below, all the predictor variables now have P-Values less than alpha (alpha = 5%). Thus,
all are significant predictors.
Page | 12
Since, P-Value is less than alpha (5%), we can reject the null hypothesis. This means that standard
deviations are not equal and there may be heteroskedasticity present in the Y variable.
3. Y values are normally distributed: Shapiro-Wilk’s test will be used to check for normality. However,
since the data is more than 30, we may assume that Y values will follow a normal distribution. Also,
Shapiro-Wilk’s test can be conducted for sample size between 3 and 5000.
Page | 13
The above plot is used for checking the randomness and the constancy of error terms. As it is
observed, the error terms have a non-constant variance. This is in synchronous with the Breusch-
Pagan test conducted above.
As can be observed from the two plots above, none of the points have Cook’s distance more than 0.5. Thus,
none of the points is influential or can be flagged. As such, there are no points such that deleting them can
change the model a lot.
3. Non-Normality of Errors
Page | 14
4 types of plots are plotted as shown below to illustrate the normality nature of the errors. As the
sample size is more than 30, the normality can be seen in each of the following types of plots: plot of
standardised residuals, histogram, box plot and Q-Q plot of standardised residuals.
The final model, i.e., model 4 is finally selected as the best fit model. The model has an R-square of
63% which means that 63% of the variation can be explained using regression. The regression model
has a reasonable predictive ability.
2. RMSE Approach
The final model is run on the training data set and the root means square error (RMSE) is
calculated
o The RMSE for training data set is 388.
The same model is now tested on the testing data set, for studying the out of sample
implications on the final model.
o The RMSE for testing data set is 837.
There is a considerable difference between the two RMSE values. Since the RMSE values for
the testing data set is higher than the training data set, the model can be further improved to
avoid over-fitting the model.
Page | 15
o However, it may result in a drop in the R-squared value of the model. This suggests
that the model can further improved by using certain transformations of the
predictor variables, indicating a better fit of the model.
Rented Bike Count = (-455.127) + (44.29 * Hour) – (56.716 * Solar_Radiation) – (62.94 * Rainfall) – (336.79 *
ISpring) + (1004.89 * ISummer) – (364.67 * IWinter) + (934.58 * Functioning Day) + (2.24 * Hour * Temperature) + (15.48
* ISpring * Temperature) – (43.29 * ISummer * Temperature) – (9.39 * IWinter * Temperature) – (0.65 * Hour * Humidity)
The interpretation of the model can be considered using an example as shown below:
The number of rental bikes required in Autumn on a Functioning Day at 1’o clock in the morning with the
rainfall on that day as 0 mm, humidity was 55%, temperature as 24.1 oC and Solar Radiation was recorded as
1.46 MJ/mm2 is calculated as
Rented Bike Count = (-455.127) + (44.29 * 1) – (56.716 * 1.46) – (62.94 * 0) + (934.58 * 1) + (2.24 * 216.9) –
(0.65 * 0.55) = 926
The actual value on that day was 925. Thus, the residual calculated equals to 1.
The final model has an R-squared value equal to 63% which means that 63% of variation is being
explained by our regression model.
The final model selected gives a higher RMSE value on the testing data set as compared to the
training data set.
The scope of future work includes following:
o Transforming the predictor variables: Since the predictor variables are not strongly linearly
correlated with the output variable, certain kind of transformation may be required.
o Heteroskedasticity: The Breusch-Pagan test indicates a presence of heteroskedasticity. Thus,
this can be removed in the future development of the model.
o Non-Linear Model: The high RMSE values indicates that fitting a linear model may not be a
good idea on this data set. It can also be observed by seeing the scatter plots in our
exploratory analysis.
o Interaction: The interaction effect can also be analysed for all types of combinations of the
predictor variables by going beyond the second order to see its effect.
6. References
[1] Dataset - https://archive.ics.uci.edu/ml/datasets/Seoul+Bike+Sharing+Demand
Page | 16
7. Contribution
Data Description, Data Cleaning, Result Reporting, and
Abhishek Kapoor and Aayush Khandelwal
subsequent R Codes
Exploratory Analysis (Descriptive Summary, Scatter Plots,
Adyathma Ela and Akanksha Mittal
Histograms) and related R Codes
Model Building and R Codes Aman Khanna and Aarushi
Report Making All Members
8. Appendix
I. R Codes for Exploratory Analysis
Page | 17
II. R Codes for Model Building, Model Assumptions, Model Diagnostics and RMSE
Page | 18
Page | 19