ETRI Journal - 2022 - Gangrade - Taxi Demand Forecasting Using Dynamic Spatiotemporal Analysis

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 17

Received: 7 April 2021 Revised: 12 August 2021 Accepted: 27 September 2021

DOI: 10.4218/etrij.2021-0123

ORIGINAL ARTICLE

Taxi-demand forecasting using dynamic spatiotemporal


analysis

Akshata Gangrade | Pawel Pratyush | Gaurav Hajela

Department of Computer Science and


Engineering, Maulana Azad National Abstract
Institute of Technology, Bhopal, India Taxi-demand forecasting and hotspot prediction can be critical in reducing
response times and designing a cost effective online taxi-booking model. Taxi
Correspondence
Gaurav Hajela, Department of Computer demand in a region can be predicted by considering the past demand accumu-
Science and Engineering, Maulana Azad lated in that region over a span of time. However, other covariates—like
National Institute of Technology, Bhopal,
India.
neighborhood influence, sociodemographic parameters, and point-of-interest
Email: contactgauravhajela@gmail.com data—may also influence the spatiotemporal variation of demand. To study
the effects of these covariates, in this paper, we propose three models that con-
sider different covariates in order to select a set of independent variables.
These models predict taxi demand in spatial units for a given temporal resolu-
tion using linear and ensemble regression. We eventually combine the charac-
teristics (covariates) of each of these models to propose a robust forecasting
framework which we call the combined covariates model (CCM). Experimen-
tal results show that the CCM performs better than the other models proposed
in this paper.

KEYWORDS
combined covariates model, ensemble regression models, linear regression, spatiotemporal
analysis, taxi demand forecasting

1 | INTRODUCTION passengers. According to a survey conducted in Chicago,


the demand for taxis varies substantially from one com-
In big cities like Chicago, public transportation plays a munity area to another. For instance, in 2013–2016, the
vital role as a means of conveyance for most of the popu- Near North Side community area had over 29 million
lation. However, the need for increased flexibility, taxi-pickup demands, whereas the Riverdale community
reduced waiting times, and greater convenience has given area had only around 400 taxi-pickup demands.1 With
rise to the popularity of other transportation modes, such such a significant variation in taxi demand among differ-
as taxis. Being fast paced and easily available, taxi ser- ent areas, there is an urgent need to utilize the massive
vices provide customers with convenience and flexibility amount of data generated by the global positioning sys-
over time. With advances in technology, the ease of find- tem (GPS) to predict taxi demand in advance and to iden-
ing and booking a taxi over a smartphone has led to a tify hotspots. From the considerable amount of data
drastic increase in demand for taxis. However, one of the available via GPS, one can obtain the taxi demand
most critical problems for taxi drivers is to identify hot-
spots from which they can collect an ample number of 1
https://data.cityofchicago.org/Transportation/Taxi-Trips/wrvz-psew.

This is an Open Access article distributed under the term of Korea Open Government License (KOGL) Type 4: Source Indication + Commercial Use Prohibition +
Change Prohibition (http://www.kogl.or.kr/info/licenseTypeEn.do).
1225-6463/$ © 2022 ETRI

624 wileyonlinelibrary.com/journal/etrij ETRI Journal. 2022;44(4):624–640.


22337326, 2022, 4, Downloaded from https://onlinelibrary.wiley.com/doi/10.4218/etrij.2021-0123 by Sri Lanka National Access, Wiley Online Library on [14/03/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
GANGRADE ET AL. 625

(number of taxi-booking requests) generated by passen- separately in each stage. In the first stage, we perform a
gers in each second, which can then be projected into a spatiotemporal analysis for each community area on the
time-series model to predict the variation of demand as a basis of influence propagated by demands in nearby
function of time [1]. Thus, the demand for taxis varies areas. In the next stage, we include sociodemographic
with both space and time. However, the balance between factors and analyze their impact on taxi demand. In the
the supply of passengers and the need for taxis not only last stage, we utilize a framework that incorporates the
depends on spatiotemporal features but also is influenced previous two stages along with POI to study the com-
by other factors, such as sociodemographics (e.g., per- bined influence of these factors on the spatiotemporal
capita income, hardship index, unemployment, and age), variations of taxi demand. In all three of these stages, we
the influence of neighborhood areas, and point-of- consider various state-of-the-art linear and ensemble
interest (POI) locations (e.g., restaurants, hospitals, col- models as base regressors. Ultimately, the outcome of this
leges, pubs, and shopping malls). Various investigations study can be utilized to identify hotspots across the city.
[2–5] have been performed to explore the impact of these The main contributions of this paper include the fol-
factors on taxi demand. Some have made use of lowing: (a) We have performed a systematic analysis of
sociodemographic data alone to predict the taxi flow taxi-demand variation in different hours of the day for
between community areas. However, such data may not weekdays and weekends across all the months. (b) We
convey enough information about those areas, and they have investigated the influence of covariates like
also remains static over a considerable period of time sociodemographic factors and neighborhood influence.
(e.g., census information is collected only once every (c) We have proposed a combined model that considers
decade). Other studies have employed only neighborhood the overall influence of all the factors together (neighbor-
influences to discern the pattern of taxi flow. This factor, hood influence, sociodemographics, and POI). (d) We have
however, may be inadequate, in the sense that nearby proposed various data-transformation algorithms to make
areas are likely to share similar sociodemographics, taxi-demand forecasting suitable for regression models.
which limits the benefit of adding neighborhood influ- The rest of this paper is organized as follows: Section 2
ence in predicting taxi demand. In addition, the impact reviews previous work in the field of time-series forecast-
of factors that reflect the dynamics of the city—like ing and taxi-demand forecasting. Section 3 contains a
crowd-generated POI data—may also play important description of the datasets used to train and test the
roles in determining the demand. All these points lead to models proposed in this work. We provide an overview of
the conclusion that a robust framework can be devised the models proposed for taxi-demand forecasting in
only if we consider all these factors simultaneously in the Section 4. Section 5 contains the algorithm for trans-
spatiotemporal analysis of taxi demand [2]. forming the raw taxi-booking dataset into an appropriate
The ability to predict taxi demand in advance can time-series taxi-demand dataset. Section 6 contains all
help significantly to alleviate the problem of inadequate the models proposed in this work to select a set of inde-
taxi supply. By using an accurate time-series forecasting pendent variables corresponding to each target. We com-
framework, demand for the next time interval can be pare the results from the proposed models and explain
predicted for a given area. If the predicted demand in that them in detail in Section 7. Conclusions drawn from this
area decreases while the supply is high, we can safely research are included in Section 8.
conclude that many taxis will be running vacant because
they have not been reallocated to appropriate areas (the
supply can be predicted from pings generated by taxis 2 | L I T E R A T UR E R E V I E W
during their journeys). These idle taxis should be
redirected immediately to areas with high unmet With increasing travel demands, various approaches have
demand, balancing the flow between demand and supply been proposed to predict the transportation demand.
and truncating the overall idle driving time and avoidable Conventional forecasting approaches focus mainly on the
fuel consumption. Conversely, oversupply can lead to temporal variation of the taxi demand. As these
traffic congestion. Hence, it is necessary to understand approaches depend on the time-series characteristics of
the variable needs of the population in both time and taxi demand, they can be considered to be standard time-
space, which can be achieved only through a robust series problems that can be solved using traditional statis-
demand-prediction model. tical and machine-learning algorithms. For instance,
In this paper, we utilize the taxi-trip records for the Yang and Gonzales [3] aggregated the raw data by census
city of Chicago to determine the spatiotemporal varia- tract and hours of the day to extract valuable insights,
tions of taxi demand in three different stages, considering and they employed count-regression models (a Poisson
the variability of demands on weekdays and weekends model, a quasi-Poisson model, and a negative binomial
22337326, 2022, 4, Downloaded from https://onlinelibrary.wiley.com/doi/10.4218/etrij.2021-0123 by Sri Lanka National Access, Wiley Online Library on [14/03/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
626 GANGRADE ET AL.

model) to identify spatiotemporal differences between the network (CNN) with a gated recurrent unit (GRU) to han-
demand and availability of taxi services. Faghih and dle complex nonlinear spatiotemporal correlations. Shu
others [6] analyzed taxi demand in New York City using and others [13] proposed a hybrid model that integrated a
the demand for other modes of transportation as well as CNN and an LSTM to predict short-term taxi demand
weather conditions, and they presented a model for across different areas. Luo and others [4] proposed a multi-
predicting taxi demand by combining a linear-regression task deep-learning (MTDL) model using LSTM as a neural
model with a time-series model, which they termed linear unit to predict the need for taxis at the multisite level. The
regression with autoregressive moving-average errors. main goal of that paper was to improve the performance
The combined model helped to reduce the number of var- of the proposed model by using multiple hyperparameter-
iables, reducing the computational expense of the linear- optimization methods, including a random search, a grid
regression model and achieving better R2 values. Liu and search, and Bayesian optimization. Ye and others [14] pro-
others [7] identified hotspots and then predicted taxi posed a CoST-Net model to correlate spatial and temporal
demand in these hotspots using GPS data and environ- demand using a CNN and a heterogeneous LSTM, and
mental data based on three models: random forest (RF), they incorporated environmental features to predict multi-
ridge regression, and a combination forecasting model. ple demands simultaneously. Vanichrujee and others [15]
Markaou and others [8] pooled time-series data for taxi proposed an ensemble model based on the characteristics
records from New York City with textual data (compris- of LSTM, GRU, and XGBOOST to predict taxi demand in
ing event information for the city) extracted from the Bangkok City. They implemented the model in seven area
web by screen scraping using application programming functions, which denote prediction functions for each area,
interfaces (APIs), and they predicted taxi demand from and they confirmed the prediction results by mapping the
the combined data using linear-regression and Gaussian POIs with predicted demand in these areas. Liu and others
models. Antoniades and others [9] employed linear [16] proposed several models, combining information from
regression with model selection, the least absolute backpropagation from a neural network with extreme-
shrinkage and selection operator (LASSO), and RF to pre- gradient boosting to investigate the correlation between
dict taxi fares and trip durations using New York City online taxi-hailing demand and overall taxi demand. They
taxi-trip data. Safikhani and others [10] presented a gen- also introduced a data-driven forecasting approach to ana-
eralized version of the space-time autoregressive moving- lyze the real-time prediction of online taxi-hailing demand.
average model (which reduces the number of parameters Chen and others [17] predicted taxi demand at a finer spa-
compared with conventional time-series models), and tial level; that is, at the road-section level. To achieve this,
they utilized the autoregressive part of the vector auto- they devised a prediction network that considered the local
regressive (VAR) model to produce a generalized STAR and global relationships between road sections. They
model for forecasting the spatiotemporal variation of taxi established these two spatial relations using a graph CNN,
demand in New York City. They also introduced a pen- whereas they mapped the temporal characteristics using
alty function, which penalizes prediction parameters that an LSTM network. Quy and others [18] augmented an
are temporally or spatially distant. The proposed model, LSTM with demand knowledge from neighboring taxi
including the penalty function, outperformed conven- stands, along with historical taxi-demand counts, to fore-
tional models such as STAR and VAR. However, these cast the pickup demand for a given taxi stand. Guo [19]
methods considered only the temporal features of the proposed a hybrid model, combining a CNN with a bidi-
demand and did not focus on other potential factors, such rectional LSTM and the attention mechanism in order to
as the dependence on neighborhood demand. In addi- predict taxi demand. He termed this model a CNN–
tion, these approaches fail to capture the nonlinear BiLSTM–Attention model.
interdependence between spatial and temporal features. Some approaches used a two-level machine-learning
To address the aforementioned drawbacks, various framework to forecast taxi demand. Kim and others [1]
neural-network-based approaches have gained significant combined multivariate linear regression with an LSTM,
attention in recent years, as they consider both spatial and enabling it to assess a quota system aimed at balancing the
temporal features along with the nonlinear behavior of the volumes of demand for regular taxis and other for-hire
demand. For instance, Xu and others [11] divided New vehicles in New York City. Rodrigues and others [20] pres-
York City into small areas and predicted the demand in ented an analysis of spatiotemporal variations in short-term
each area by using a long short-term memory (LSTM) and taxi demand in Lisbon city and studied how they are
a recurrent neural network with an mixture density net- affected by weather conditions and POI. They selected a lin-
work layered on top of it. Liu et al. [12] proposed a con- ear statistical model (an autoregressive integrated moving
volutional recurrent network model for granulated taxi- average [ARIMA] model) and a machine-learning model
demand prediction that combined a convolutional neutral (an artificial neural network, or ANN) to forecast the taxi
22337326, 2022, 4, Downloaded from https://onlinelibrary.wiley.com/doi/10.4218/etrij.2021-0123 by Sri Lanka National Access, Wiley Online Library on [14/03/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
GANGRADE ET AL. 627

TABLE 1 Dataset description

Attributes considered in transformation


Dataset Attributes phase/model
Taxi-trip Trip ID, Taxi ID, Trip-start timestamp, Trip-end timestamp, Trip Trip-start timestamp, Trip-end timestamp,
seconds, Trip miles, Pickup census tract, Dropoff census tract, Pickup community area, Dropoff community
Pickup community area, Dropoff community area, Fare, Tips, area.
Tolls, Extras, Trip total, Payment type, Company, Pickup
centroid latitude, Pickup centroid longitude, Pickup centroid
location, Dropoff centroid latitude, Dropoff centroid longitude,
and Dropoff centroid location.
Census Community area number, Community area name, Percentage of Community area number, Community area
households in each community area below the poverty line (PI), name, and Per-capita income are considered
Percentage of unemployed people (above 16 years age) (PII), using a demographic similarity-based model.
Percentage of crowded housing units (PIII), Percentage of
people without high-school diploma (above 25 years age) (PIV),
Dependency (people over 64 years and under 18 years age) (PV),
per-capita income (PVI), and Hardship index. (PVII)
POI Venue ID, Venue Name, Contact, Location, Formatted address, Food and Restaurant, Arts and Entertainment,
Categories, Stats, here. Now, referralId (as a JSON object). Outdoor and Recreation, College and
education, Shops and Stores, Event Venues,
Health and Hospital, Nightlife, Residence,
and Office.

demand. Liu and others [21] utilized POI and GPS- geohash, enclosing an area of 0.72 km2. They first achieved
trajectory information to model the spatial variation of taxi clustering by analyzing the correlation between the nearest
demand in Qingdao City using a geographically weighted geohashes, and they expressed the demand for every geo-
regression model. They also studied how taxi demand is hash as a fraction of its total cluster demand to form per-
influenced by factors like socioeconomic, traffic, and land- centage time-series data. They multiplied the prediction of
use data. Zhou and others [22] proposed a method called the percentage time-series data by the predicted demand for
ST-Vec in which they predicted taxi demand at vital desti- the whole cluster to obtain the final prediction.
nations for a given region of New York City. The ST-Vec
method maps regions with dense, low-dimensional vectors
such that the vectors of more-likely destination regions will 3 | DATASET DESCRIPTION
be nearer, and hence, the spatiotemporal relationships
among zones can be obtained from the similarities between We utilized three different data sources for our study:
these vectors. Hu and others [5] initially studied the spatio- Chicago taxi-trip records, demographic data, and POI data.
temporal distribution of job–housing–travel and the travel-
ing characteristics of inhabitants. On this basis, they 1. Chicago taxi-trip records: We collected this dataset
introduced a metric system for evaluating jobs–housing– from the official data portal managed by the Depart-
taxi demand and a regional development-level index. Next, ment of Business Affairs & Consumer Protection of
they constructed a coupling coordination degree model that the city of Chicago.2 It contains information about taxi
makes use of the entropy-weight method to examine the flow between different community areas, consisting of
coupling relationship between regional taxi demand and around 195 million rows of pickup and dropoff loca-
socioeconomic development. Faial and others [23] pres- tions every 15 min for the past 4 years (2016–2019).
ented a data-stream mining framework to predict the taxi The average number of records per year is around
demand by adopting an approach to handle continuous 49 million.
data using batch and stream machine-learning algorithms. 2. Sociodemographic data: We collected this dataset
Moreover, Davis and others [24] approached the prediction from the U.S. Census Bureau, which collects data
of taxi demand as a clustering problem, and they proposed once every 10 years. We used the census data for the
a multilevel clustering method to model taxi-demand den-
sity at various locations in Bengaluru city. Each location 2
https://data.cityofchicago.org/Transportation/Taxi-Trips/wrvz-
was addressed by six alphanumeric characters called psew#column-menu.
22337326, 2022, 4, Downloaded from https://onlinelibrary.wiley.com/doi/10.4218/etrij.2021-0123 by Sri Lanka National Access, Wiley Online Library on [14/03/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
628 GANGRADE ET AL.

year 2010 for our research. It has seven attributes,


each corresponding to a specific sociodemographic
index, and 77 rows, each corresponding to a commu-
nity area of the city of Chicago.
3. POI data: We obtained this dataset using the Four-
Square API3 to perform screen scraping. The data con-
tains count information for major venues in Chicago
city and comprises 10 attributes, each corresponding
to a POI category, and 77 rows, each corresponding to
a community area.

Table 1 describes the attributes corresponding to each


dataset mentioned in this section.

4 | P R O P O S ED W O R K

In Chicago City, there are 77 community areas (shown in FIGURE 1 Map of Chicago illustrating community areas
Figure 1). The taxi demand in these areas is heteroge-
neous; that is, the demand is considerably high in some
areas while it is quite low in others. This fluctuation in These models can be used to study the demand pattern
demand requires prompt predictions of the spatial char- further, including both the temporal features and spatial
acteristics of taxi demand to identify hotspots and thus characteristics, eventually identifying the dynamic pat-
redirect taxis into areas where the demand is high. tern of the hotspots. A detailed analysis of each model is
Another problem is the dynamic nature of taxi demand discussed in Section 6
in the temporal dimension. For example, demand may be For a given community area, the taxi demand may
high during working hours but comparatively low during also be affected by the distance of the dropoff destination
the early morning and late at night (see Figure 2A). In from the pickup location. If the destination is distant, a
addition, demand may vary between weekdays and week- passenger may prefer to travel by another convenient
ends and across different months of the year (see mode of transport, like the metro or buses. Thus, neigh-
Figure 2B,C). These figures illustrate the problem of borhood proximity may play an essential role in deter-
predicting the temporal characteristics of taxi demand. mining the demand in a given community area. To
To address this spatiotemporal heterogeneity, we there- incorporate the concept of neighborhood influence, we
fore consider spatial resolution at the level of the 77 com- propose a proximity-based prediction model, which is dis-
munity areas. Furthermore, the prediction for each area cussed in Section 6.2
is done on an hourly basis and then aggregated monthly, It is also possible that the taxi-demand pattern may
which defines the temporal resolution of the demand pre- be influenced by socioeconomic factors such as income,
diction (see Figure 2B). We consider monthly predictions, household status, employment status, and so forth. For
because a yearly aggregation excludes seasonality effects; instance, areas with high per-capita income may prefer
that is, the variation of demand in different seasons. Also, taxis as a primary mode of public transportation more
to capture the seasonality effect due to differences in than areas with lower per-capita income. Conversely,
demand between weekdays and weekends, we analyzed areas where the average household status is poor may
those demand patterns separately. We expect this spatio- prefer cheaper modes, such as buses or the metro. To
temporal resolution of the prediction results to aid in address such socioeconomic interdependence, we pro-
identifying hotspots. pose a sociodemographic-based prediction model, which
Because the demand varies in both space and time, it is discussed in Section 6.3
induces dynamism in the pattern of hotspots; that is, hot- Another consideration is the impact of taxi demand
spots vary from month to month and year to year. In this associated with frequently visited venues. The flow of
paper, we propose various dynamic demand-prediction taxis may be greater toward areas where there are more
models based on applications of various statistical and POI locations. If a given community area has a relatively
ensemble regression models to forecast the taxi demand. more-extensive collection of educational institutes, medi-
cal facilities, or recreational centers, the demand may be
3
https://developer.foursquare.com/. very large in this particular area. Therefore, we have
22337326, 2022, 4, Downloaded from https://onlinelibrary.wiley.com/doi/10.4218/etrij.2021-0123 by Sri Lanka National Access, Wiley Online Library on [14/03/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
GANGRADE ET AL. 629

5 | DA T A P R E P R O C E SSI NG

The official data portal of Chicago City contains taxi-trip


records from the year 2013 to the current year, and we
extracted from it the most recent 4-year data (from 2016
to 2019) for this study. Out of the 23 attributes present in
this dataset (see Table 1), we considered the attributes
Trip-start timestamp and Pickup community area for the
transformation phase. Initially, we dropped all rows that
had missing values for Pickup community area. This raw
data consisted of taxi flow records between community
areas for every interval of 15 min, which we aggregated
on an hourly basis into the dataframe intervalDemand.
We converted the resulting data into a demand-count
matrix (demandCountMatrix), the columns of which
represented the 77 community areas, while the rows rep-
resented the taxi-demand count for each interval of the
day for the given year. To implement the targeted trans-
formation, we first stored each interval of the day into a
dictionary (intervalDictionary) as keys and initialized
the corresponding values to 0. For every community
area, we then iterated over the intervalDemand and
mapped the taxi demand into the matching interval of
dictionary values. In this way, we obtained the final
demand-count matrix, of size t  c, where t is the total
number of intervals in a day multiplied by total days in a
year and c is the total number of community areas
(in this case, 77). The intervals for which there was no
value for taxi demand remained 0 as initialized. These
steps are summarized in the demand-count data-
transformation Algorithm (Algorithm 1). The quantity
CA_list is the list of all community areas (CAs), and
CA_Count is the taxi demand in that CA during the
given interval.
For the POI data, we first extracted the data using
FourSquare API, where the URL that is to be requested
has a predefined format and contains some parameter
values. Some of these parameter values include the lati-
tude and longitude of the community area, the radius
F I G U R E 2 Graphs illustrating the average taxi demand as covered by that CA, and the categories of the venues that
functions of the indicated parameter: (A) hour of day, (B) month, need to be extracted. The response consists of JavaScript
and (C) days of week Object Notation (JSON) objects, each representing a
venue of the queried category. These objects are then
screen-scrapped, and the total count corresponding to a
mapped the POI effect on the spatial variation of taxi specific venue for a given community area is stored in the
demand, which is dealt with in Section 6.4 dataframe POI_Count, of size jCAj  jCategoryj where
The final goal of this research is to consider all the jCAj is the total number of community areas, and
above influencing factors simultaneously and to study jCategoryj is the number of categories of POI venues in
whether the aggregation of these factors in forecasting Chicago (Food and Restaurant, Arts and Entertainment,
demand performs better than considering the individ- Outdoor and Recreation, College and education, Shops
ual factors one at a time. To investigate this, we and Stores, Event Venues, Health and Hospital, Nightlife,
propose a combined model, which is described in Residence, and Office). These steps are summarized in
Section 6.5. the POI transformation algorithm (Algorithm 2). Because
22337326, 2022, 4, Downloaded from https://onlinelibrary.wiley.com/doi/10.4218/etrij.2021-0123 by Sri Lanka National Access, Wiley Online Library on [14/03/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
630 GANGRADE ET AL.

the sociodemographic data obtained were already in the sociodemographic parameters as independent variables
desired format, we performed no transformation opera- for the target area—the demand can be predicted more
tions on them. effectively. We elucidate this model in Section 6.3. We
term the third model a POI-based model. Its set of inde-
pendent variables is determined by the number of POI
venues in the CAs. This model also can be used to identify
hotspots by mapping the POI locations in CAs together
with the average daily demand in those areas. We discuss
this model in Section 6.4. We term the fourth model the
combined covariates model (CCM); in it, the independent
variables are selected by taking into account the charac-
teristics of all the previous models. We discuss this
combined-framework model in Section 6.5. In a broader
sense, the algorithm for a taxi-demand forecasting model
works by taking each community area once as a target
variable and applying the models mentioned above to
determine the vector of independent variables for it. We
ultimately predict the taxi-demand count in a target com-
munity area by employing various state-of-the-art statisti-
cal and ensemble regression models as base regressors.

6.1 | Base regression models used by the


proposed models

In this paper, we utilize various linear and ensemble base


regressors to predict the taxi-demand count in a given
6 | T A X I - D E M A N D FO R E C A S T I N G community area. In the regression equation, the taxi
MODELS count for the target community area y can be expressed
as a linear combination of independent variables (which
Forecasting of taxi demand is done by utilizing taxi- are obtained by applying the algorithms described in
booking data generated in the past—which consists of Sections 6.2, 6.3, 6.4, and 6.5) and a weight vector. The
booking records for the various community areas—to pre- regression equation can be thus written as
dict the number of taxi demands that might happen in the
future in those particular areas. This forecasting problem y ¼ w0 þ w1 x 1 þ w2 x 2 þ … þ ϵ, ð1Þ
is in fact a multivariate regression problem, where the
selection of independent variables is critical. The pivotal where [x 1 , x 2 , …, x n ] is the vector of independent vari-
decision is whether to consider all the CAs as independent ables, [w0 , w1 , w2 , …, wn ] is the vector of weights
variables or to consider an intelligently selected subset that corresponding to each independent variable, y is the value
is highly correlated with a target variable. In this section, that needs to be predicted, and ϵ is the residual error
we propose several models based on the interdependence term.
of a targeted community area with its surrounding areas
and the areas with similar socioeconomic and POI indexes
to acknowledge the later decision. 6.1.1 | Linear models
We term the first model the neighborhood-proximity-
based model (NPBM). It is based on the concept that the 1. Ordinary least squares (OLS): The objective of OLS is
taxi demand in a given community area can be forecasted to find the model parameters that minimize the differ-
precisely if we consider the areas located in its vicinity as ence between the sum of squares of observed targets
influencing factors. We discuss this spatial-characteris- and the targets predicted by the linear approximation:
tics-based model in Section 6.2. We term the second
model the sociodemographics-influence-based model X
N
(SIBM). It makes use of the concept that—if we consider min ϵ2n , ð2Þ
θ
the set of CAs that have the most similar i¼1
22337326, 2022, 4, Downloaded from https://onlinelibrary.wiley.com/doi/10.4218/etrij.2021-0123 by Sri Lanka National Access, Wiley Online Library on [14/03/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
GANGRADE ET AL. 631

where θ is a vector of coefficients and ϵ represents the random subset of input features (the second layer of
error term. randomness).
2. LASSO: The main objective of LASSO is to estimate 3. Bagging method: The bagging method chooses ran-
sparse coefficients by performing variable selection dom subsets from the training set of data and builds
and regularization: instances of black-box estimators on these subsets.
These individual subsets are then aggregated to form
X
N X
k
min ϵ2n þ λ jθi j, ð3Þ the final prediction set.
θ 4. Stacking: Stacking determines the best combination
i¼1 i¼1
of predicted outputs from two or more base machine-
where θ is a vector of coefficients, ϵ represents the learning algorithms by using a meta-learning
error term, and λ represents the regularization algorithm. It stacks the outputs determined by all
parameter. individual estimators, and this final output is then
3. LARS: LARS adds a penalty to the loss function dur- used as input for the final estimation.
ing the training phase itself; thus, it does not require 5. Voting: A voting ensemble determines the final
any hyperparameters, and it is therefore an efficient prediction on the basis of the average of individual
way of fitting a LASSO model. predictions calculated from various regression models.
4. Bayesian ridge (BR): The aim of BR is to find the pos-
terior distribution of model parameters. The mathe-
matical expression on which BR works is given by a 6.2 | NPBM
Gaussian formula, which can be written as
As discussed in Section 1, it is necessary to consider
pðωjλÞ ¼ Nðωj0, λ1 Ip Þ, ð4Þ neighborhood influence in forecasting the taxi demand
for a given community area. To achieve this goal, we
where ω is the weight vector and λ is the shape propose a neighborhood-proximity-based regression
parameter for a Gamma distribution. model in which the selection of independent variables
corresponding to a target variable are selected directly
5. Ridge: In ridge-regression algorithms, an additional based on its immediate neighbors (see Algorithm 3). Let
term is added to the OLS equation to minimize the t_CA be the target variable representing the community
penalized residual sum of squares: area for which the demand is to be predicted temporally.
Then, i_CA will be a vector of independent variables con-
X
N X
k
min ϵ2n þ λ θ2i , ð5Þ sisting of community areas that share boundaries with
θ t_CA. We plotted the heatmap shown in Figure 3 to find
i¼1 i¼1
the correlation of a given community area with other
where θ is a vector of coefficients, ϵ represents the areas. This plot shows that the areas located in the imme-
error term, and λ represents the regularization diate neighborhood of a given community area are more
parameter. correlated with it than those far away, which strengthens
the use of such a model for selecting the independent
variables.
6.1.2 | Ensemble models

1. Extra trees (ET): ET adds a layer of randomness by


choosing random thresholds for every candidate
feature instead of selecting those thresholds that are
most discriminating. Out of the selected thresholds,
the best random thresholds are picked as the splitting
rule.
2. RF: RF works by randomly choosing a bootstrap sam-
ple with replacement from the training dataset. This
sample is then used to build each tree of the RF (the
first layer of randomness). During the construction of
the tree, each node is split, and the best split is
decided either from all input features or from a
22337326, 2022, 4, Downloaded from https://onlinelibrary.wiley.com/doi/10.4218/etrij.2021-0123 by Sri Lanka National Access, Wiley Online Library on [14/03/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
632 GANGRADE ET AL.

(see Algorithm 4). For each shift of the window, the


median [the ðl þ 1Þst area] is treated as the target commu-
nity area, and the corresponding independent variables
for that target area lie at a distance l above and l below
the target community area based on demographic-index
values. As a result, the set of areas lying above the target
variable will have l closest values smaller than it, whereas
areas lying below the target variable will have l closest
values that are higher than it. The vector of
independent variables corresponding to a given target
community area is obtained in this way. Figure 5A–G
shows polygonal density maps for the indicated
sociodemographic parameters.

F I G U R E 3 Figure representing the CAs (listed by ID) and


the correlation matrix for taxi demands across all CAs in Chicago.
Values of the correlation matrix are represented by the color bar
at the right

We tested this model against the taxi demand for the


year 2019, and the R2 values for this model for the indi-
cated base regressors are shown in Table 2. The taxi
demands are predicted on an hourly basis to meet real-
world requirements, but the tabulated results are
monthly aggregates. The taxi demands are filtered sepa-
rately for weekdays and weekends in order to capture
the variations in the spatiotemporal patterns of taxi
demands due to changes in peoples’ routines on week-
ends. Note that we have separated the results for the
linear models and the ensemble models in Table 2 in
order to show that the ensemble models provide better
predictions (see the values shown in bold in Table 2) The correlation of average hourly taxi demand with
than do the linear models. The ensemble models pro- these demographic factors is summarized in Table 3. The
vide better predictions because the samples are oper- value of the correlation is positive for per-capita income
ated by diverse models. The average R2 values for the whereas it is negative for the other factors. This confirms
linear and ensemble models tested against taxi demand our hypothesis, mentioned in Section 1, that taxi demand
for each month of the year 2018 are shown in Figure 4. can be expected to be greater in areas where per-capita
The ensemble models clearly outperform the linear income is higher and where other factors—such as
models. percent households below poverty, percent aged 16+
unemployed, hardship index, and others—are lower, and
vice versa.
6.3 | SIBM The R2 values obtained after testing the proposed
model for predicting taxi demand for the year 2019 for
Population-based factors may influence the taxi various base regressors are shown in Table 4. Here, also,
demand, as described in Section 1. Past investigations ensemble-based regressors have better R2 values than do
[2–5,7,8,20] have overlooked these socioeconomic-based linear ones. Note, however, that a lower R2 value for
factors, which can directly or indirectly affect taxi some months does not always means poor generalization
demand. In this model, we implemented [l, þ l] of the model; instead, it may be due to a fall in taxi
regressors, where a sliding window of length 2l þ 1 (in demand in certain months. Figure 6 shows that—just like
our case, l ¼ 3) is moved across every community area the NPBM—in the SIBM model, the R2 values are higher
22337326, 2022, 4, Downloaded from https://onlinelibrary.wiley.com/doi/10.4218/etrij.2021-0123 by Sri Lanka National Access, Wiley Online Library on [14/03/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
GANGRADE ET AL. 633

TABLE 2 Table showing R2 values for the NPBM for the year 2019

Linear Ensemble

Months OLS LASSO LARS BR Ridge ET RF Bagging Stacking Voting


January Weekdays 0.4156 0.4106 0.2328 0.4160 0.4156 0.7192 0.5689 0.7158 0.7135 0.7158
Weekends 0.5496 0.5125 0.4339 0.5512 0.5496 0.6789 0.5445 0.6570 0.6542 0.6570
February Weekdays 0.5050 0.4833 0.3478 0.5051 0.5050 0.6834 0.6218 0.6643 0.6617 0.6642
Weekends 0.5556 0.4977 0.3531 0.5200 0.5556 0.6810 0.5786 0.6903 0.6877 0.6902
March Weekdays 0.2686 0.2835 0.1966 0.2703 0.2686 0.7198 0.4899 0.6477 0.6473 0.6474
Weekends 0.5536 0.6637 0.3319 0.5548 0.5536 0.7156 0.6060 0.7196 0.7177 0.7188
April Weekdays 0.3266 0.3509 0.0947 0.3278 0.3266 0.6958 0.5881 0.6743 0.6742 0.6748
Weekends 0.1948 0.2030 0.2084 0.1964 0.1948 0.7513 0.5819 0.6816 0.6774 0.6802
May Weekdays 0.4561 0.4388 0.2030 0.4561 0.4561 0.6335 0.5790 0.6311 0.6291 0.6311
Weekends 0.5514 0.5849 0.4175 0.4148 0.4143 0.6769 0.5622 0.6309 0.6288 0.6305
June Weekdays 0.2562 0.2562 0.1499 0.2567 0.2562 0.7077 0.6080 0.7005 0.6999 0.7011
Weekends 0.3322 0.3636 0.1452 0.3335 0.3322 0.6491 0.5607 0.6821 0.6797 0.6809
July Weekdays 0.4355 0.4221 0.2718 0.4356 0.4355 0.6148 0.5951 0.6043 0.6009 0.6040
Weekends 0.5085 0.4595 0.4974 0.5088 0.5085 0.6292 0.5167 0.6189 0.6161 0.6183
August Weekdays 0.4248 0.4082 0.3000 0.4249 0.4248 0.6683 0.6036 0.6468 0.6443 0.6470
Weekends 0.5279 0.4626 0.4712 0.5278 0.5279 0.6034 0.5146 0.6258 0.6232 0.6257
September Weekdays 0.3732 0.3560 0.1616 0.3731 0.3732 0.6386 0.6291 0.6274 0.6264 0.6275
Weekends 0.5668 0.4985 0.5585 0.5667 0.5668 0.6622 0.5448 0.6544 0.6527 0.6544
October Weekdays 0.3436 0.3475 0.1286 0.3440 0.3436 0.6715 0.5691 0.6669 0.6660 0.6674
Weekends 0.1716 0.1991 0.1442 0.1749 0.1716 0.6284 0.4240 0.6536 0.6498 0.6542
November Weekdays 0.4190 0.4114 0.2994 0.4194 0.4190 0.6128 0.5923 0.6289 0.6278 0.6285
Weekends 0.3857 0.3977 0.3988 0.3866 0.3857 0.6187 0.4731 0.5984 0.5963 0.5985
December Weekdays 0.5288 0.4354 0.3974 0.4322 0.4321 0.6756 0.5853 0.6895 0.6889 0.6899
Weekends 0.3345 0.4911 0.3783 0.4727 0.4768 0.6757 0.4223 0.5893 0.5833 0.5901

6.4 | POI-based model

To test the hypothesis that the taxi flow may be


inclined more toward areas with comparatively larger
numbers of POI locations, we calculated the Pearson
correlation coefficient for various POI categories with
the hourly average taxi demand, as shown in Table 5.
Note that the correlation of each POI category with the
average hourly taxi demand is appreciable. Moreover,
CAs with more POI locations have higher correlations
than do areas with relatively fewer POI locations.
Figure 7A–I represents polygonal density maps for the
F I G U R E 4 Graph comparing R2 values for linear and indicated POI categories. Note that areas having more
ensemble models for the NPBM POIs are represented by darker shading, as indicated in
the legends. We conclude that this model produces a
uniform set of independent variables for all target CAs
when ensemble base regressors are used than when and that the top K POI locations for any particular cat-
linear regressors are used, when tested against taxi egory remain uniform for every community area (see
demand for the year 2018. Algorithm 5).
22337326, 2022, 4, Downloaded from https://onlinelibrary.wiley.com/doi/10.4218/etrij.2021-0123 by Sri Lanka National Access, Wiley Online Library on [14/03/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
634 GANGRADE ET AL.

F I G U R E 5 Polygonal density maps representing hotspots for the indicated sociodemographic parameters: (A) Hardship index,
(B) Percentage of unemployed people (above 16 years age), (C) Dependency (people over 64 years and under 18 years age), (D) Percentage of
community areas below the poverty line, (E) Percentage of crowded housing units, (F) Percentage of people without high-school diploma
(above 25 years age), and (G) Per-capita income

6.5 | CCM

Spatiotemporal modeling with either sociodemographic


or neighborhood-proximity data alone may not convey
enough information about the flow of and demand for
taxis. This is due to the static nature of demographic
data over a long span of time and the similar demo-
graphic indices shared among nearby CAs, which
limits the benefit of adding neighborhood influence, as
explained in Section 1. We address these issues by pro-
posing a model that simultaneously combines the
aforementioned covariates—sociodemographic and
T A B L E 3 Correlation coefficient between demographic neighborhood influence along with one more covariate,
parameters and taxi demand crowd-generated POI—which reflect the dynamics of
the city. We term this model the CCM, and we define
Features Pearson correlation p value
the set of independent variables corresponding to a tar-
PI 0.1602 0.1639
get as the union of the independent variables
PII 0.2853 0.0118 corresponding to that target as suggested by the
PIII 0.1889 0.0998 NPBM, SIBM, and POI-based model (see Algorithm 6).
PIV 0.3221 0.0042 As shown in Figure 8, this model also achieves a high
PV 0.5190 1.32e-06 R2 value for the ensemble models—as compared with the
linear ones—when tested against the taxi demand for the
PVI 0.6574 8.30e-11
year 2018. The R2 values obtained after testing the pro-
PVII 0.3762 0.0007
posed model for predicting taxi demand for the year 2019
22337326, 2022, 4, Downloaded from https://onlinelibrary.wiley.com/doi/10.4218/etrij.2021-0123 by Sri Lanka National Access, Wiley Online Library on [14/03/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
GANGRADE ET AL. 635

TABLE 4 Table showing R2 values for SIBM for the year 2019

Linear Ensemble

Months OLS LASSO LARS BR Ridge ET RF Bagging Stacking Voting


January Weekdays 0.5498 0.4083 0.2445 0.5498 0.5498 0.6786 0.6083 0.6545 0.6508 0.6545
Weekends 0.4131 0.2820 0.3479 0.4097 0.4131 0.6219 0.5341 0.5420 0.5376 0.5420
February Weekdays 0.5629 0.4635 0.3774 0.5632 0.5629 0.6882 0.6218 0.6324 0.6513 0.6321
Weekends 0.5024 0.4485 0.4882 0.5010 0.5024 0.6052 0.5582 0.5264 0.5221 0.5262
March Weekdays 0.4892 0.3561 0.1934 0.4895 0.4892 0.6565 0.5574 0.6357 0.7022 0.6351
Weekends 0.6147 0.6145 0.6146 0.6151 0.6147 0.7227 0.6226 0.7536 0.7534 0.7539
April Weekdays 0.4415 0.3822 0.1219 0.4418 0.4415 0.6520 0.6403 0.6505 0.6496 0.6508
Weekends 0.2043 0.2020 0.2031 0.2045 0.2043 0.6749 0.6282 0.6667 0.6642 0.6663
May Weekdays 0.6192 0.5451 0.3623 0.6192 0.6192 0.6327 0.5821 0.6209 0.6185 0.6209
Weekends 0.5755 0.5689 0.5724 0.5757 0.5755 0.6611 0.6187 0.6508 0.6495 0.6504
June Weekdays 0.3502 0.2912 0.1312 0.3502 0.3502 0.6452 0.6701 0.6589 0.6562 0.6592
Weekends 0.6965 0.5220 0.2331 0.6965 0.6965 0.6702 0.6702 0.7221 0.7216 0.7220
July Weekdays 0.6568 0.4972 0.2919 0.6568 0.6568 0.6485 0.6245 0.6567 0.6534 0.6564
Weekends 0.5742 0.5152 0.5675 0.5741 0.5742 0.6005 0.5886 0.6017 0.5988 0.6015
August Weekdays 0.4997 0.3908 0.2356 0.4996 0.4997 0.6433 0.6047 0.6267 0.6221 0.6264
Weekends 0.5751 0.4640 0.5038 0.5750 0.5751 0.6071 0.5759 0.5492 0.5469 0.5494
September Weekdays 0.6433 0.5062 0.3525 0.6432 0.6433 0.6738 0.6614 0.6681 0.6671 0.6682
Weekends 0.4330 0.4253 0.4278 0.4328 0.4330 0.5433 0.5934 0.5865 0.5855 0.5868
October Weekdays 0.4484 0.3188 0.1234 0.4485 0.4484 0.7189 0.6291 0.6803 0.6768 0.6803
Weekends 0.1102 0.1080 0.1105 0.1105 0.1102 0.5273 0.5918 0.5888 0.5862 0.5889
November Weekdays 0.5216 0.4458 0.3242 0.5214 0.5216 0.6007 0.6169 0.6332 0.6310 0.6330
Weekends 0.2191 0.2026 0.1566 0.2185 0.2191 0.5505 0.5745 0.5594 0.5589 0.5596
December Weekdays 0.4206 0.4465 0.3263 0.5509 0.5506 0.6598 0.5970 0.6487 0.7254 0.6487
Weekends 0.1999 0.2745 0.2700 0.2001 0.1999 0.5900 0.5125 0.5560 0.5559 0.5558

TABLE 5 Correlation coefficients between POIs and taxi


demand

Features Pearson correlation p value


Food and Restaurant 0.3430 0.0022
Arts and Entertainment 0.7018 1.438e-12
Outdoor and Recreation 0.6025 7.941e-09
College and Education 0.6455 2.327e-10
Shops and Stores 0.4707 1.637e-05
Event Venues 0.7751 2.952e-16
Health and Hospital 0.5548 1.813e-07
Nightlife 0.4983 4.183e-06
F I G U R E 6 Graph comparing R2 values for the linear and
Office 0.3376 0.0026
ensemble models for the SIBM
22337326, 2022, 4, Downloaded from https://onlinelibrary.wiley.com/doi/10.4218/etrij.2021-0123 by Sri Lanka National Access, Wiley Online Library on [14/03/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
636 GANGRADE ET AL.

FIGURE 8 Graph comparing the R2 values for the CCM

1. Mean absolute error (MAE): The MAE is the sum of


the absolute differences between the target and the
predicted variables. Thus, it measures the average
F I G U R E 7 Polygonal density maps representing hotspots for
magnitude of the errors in a set of predictions without
the indicated points of interest: (A) Food and Restaurants, (B) Arts considering their directions.
and Entertainment, (C) Outdoor and Recreation, (D) College and 2. R2 value: The R-squared (R2 ) value is a statistical mea-
Education, (E) Event Venues, (F) Nightlife, (G) Health and sure that represents the proportion of the variance of
Hospital, (H) Shops and Stores, and (I) Office a dependent variable that can be described in terms of
an independent variable or variables in a regression
model.
for various base regressors are shown in Table 6. Here,
also, ensemble-based regressors have better R2 values
than do linear ones. 7.2 | Discussion

Tables 2 and 4 represent the R2 values for the NPBM and


SIBM predictions, respectively, when they are tested
against data for the year 2019. The bold values in each
table correspond to the best-performing regression model
among all the base regressors for each month, with week-
days and weekends treated separately. These tables show
that the ensemble regressors outperform the linear
regressors for both proposed models (NPBM and SIBM),
and this observation is consistent throughout almost all
the months. For the CCM also, the same type of tabula-
tion (see Table 6) shows that the ensemble regressors give
better results than the linear regressors. Comparison of
the R2 values for both the linear and ensemble regressors
shows that the CCM outperforms both the NPBM and
the SIBM (see Figure 9). This is due to the fact that the
CCM combines a multitude of characteristics—such as
7 | R ES U L T S A N D D I S C U S S I O N neighborhood influences and sociodemographic charac-
teristics along with POI data—to perform the forecasting.
7.1 | Performance evaluation However, note that a comparison of the NPBM with the
parameters SIBM depends on the type of demand—including taxi
type and distance traveled—which is outside the scope
We evaluated all the proposed models based on state- of this investigation. Hence, we cannot determine with
of-the-art regressors using the following parameters: confidence which of these two models performed better.
22337326, 2022, 4, Downloaded from https://onlinelibrary.wiley.com/doi/10.4218/etrij.2021-0123 by Sri Lanka National Access, Wiley Online Library on [14/03/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
GANGRADE ET AL. 637

TABLE 6 Table showing R2 values for the CCM for the year 2019

Linear Ensemble

Months OLS LASSO LARS BR Ridge ET RF Bagging Stacking Voting


January Weekdays 0.5628 0.5123 0.3152 0.5783 0.5782 0.7569 0.6999 0.6843 0.6808 0.6843
Weekends 0.4331 0.3950 0.4689 0.4587 0.4581 0.7924 0.6313 0.6440 0.6356 0.6440
February Weekdays 0.5633 0.6605 0.4421 0.5843 0.5840 0.7651 0.6218 0.6524 0.6523 0.6521
Weekends 0.5224 0.4998 0.4168 0.5535 0.5565 0.7612 0.5631 0.5734 0.5725 0.5732
March Weekdays 0.4884 0.4219 0.3316 0.5875 0.5872 0.7732 0.6407 0.6877 0.7132 0.6872
Weekends 0.6444 0.7012 0.7055 0.6456 0.6451 0.8251 0.7431 0.7844 0.7643 0.7842
April Weekdays 0.4536 0.4742 0.1996 0.4838 0.4835 0.7961 0.7093 0.7148 0.7146 0.7145
Weekends 0.2132 0.3422 0.3130 0.2555 0.2553 0.7514 0.7162 0.7255 0.7252 0.7256
May Weekdays 0.7120 0.6952 0.4285 0.6583 0.6580 0.7146 0.6051 0.6333 0.6267 0.6333
Weekends 0.5956 0.6863 0.6135 0.5997 0.5995 0.7096 0.6895 0.6923 0.6920 0.6922
June Weekdays 0.4513 0.3835 0.1516 0.3822 0.3819 0.7501 0.7000 0.7211 0.7210 0.7201
Weekends 0.7464 0.7024 0.2678 0.7065 0.7065 0.7870 0.7118 0.7555 0.7541 0.7554
July Weekdays 0.6888 0.5004 0.3085 0.6875 0.6878 0.7456 0.6919 0.7007 0.7005 0.7006
Weekends 0.5952 0.5931 0.6457 0.5991 0.5993 0.7541 0.6708 0.6837 0.6801 0.6837
August Weekdays 0.5232 0.4617 0.3529 0.5223 0.5224 0.7265 0.7146 0.7367 0.7297 0.7264
Weekends 0.5963 0.5813 0.6178 0.5844 0.5845 0.7570 0.5989 0.6012 0.6009 0.6011
September Weekdays 0.6646 0.5425 0.4935 0.6842 0.6853 0.7049 0.7014 0.7223 0.7107 0.7222
Weekends 0.4383 0.5104 0.4789 0.4538 0.4540 0.6311 0.6335 0.6265 0.6147 0.6264
October Weekdays 0.4799 0.3999 0.2734 0.4775 0.4774 0.8969 0.7151 0.7203 0.7202 0.7203
Weekends 0.2112 0.2103 0.2541 0.1515 0.1511 0.6374 0.6077 0.6134 0.6123 0.6134
November Weekdays 0.5536 0.5841 0.4289 0.5524 0.5527 0.6529 0.6942 0.7119 0.7109 0.7118
Weekends 0.2475 0.2624 0.3156 0.3075 0.3081 0.6425 0.6446 0.6554 0.6519 0.6555
December Weekdays 0.4446 0.5646 0.3889 0.5828 0.5826 0.7058 0.6096 0.6847 0.7260 0.6847
Weekends 0.2799 0.3548 0.3500 0.2124 0.2101 0.7001 0.6144 0.6101 0.6095 0.6108

FIGURE 9 Graph comparing the average R2 values for (A) all linear models and (B) all ensemble models

The predictions for the year 2019 are made on the We also analyzed the predictions on the basis of
basis of taxi-demand data for the preceding three MAE scores for the years 2018 and 2019. The results
consecutive years. obtained are summarized in box plots (see Figure 10),
22337326, 2022, 4, Downloaded from https://onlinelibrary.wiley.com/doi/10.4218/etrij.2021-0123 by Sri Lanka National Access, Wiley Online Library on [14/03/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
638 GANGRADE ET AL.

F I G U R E 1 0 Box plot comparing MAE values for linear and ensemble regressors in (A) the NPBM for the year 2018, (B) the NPBM for the
year 2019, (C) the SIBM for the year 2018, (D) the SIBM for the year 2019, (E) the CCM for the year 2018, and (F) the CCM for the year 2019

FIGURE 11 Graph comparing the average R2 values for (A) the indicated linear models and (B) the indicated ensemble models

which show that the upper bound and interquartile range outliers for the linear models, and they vary sporadically,
corresponding to ensemble models are smaller than they which indicates that the predictions from the linear
are for the linear models. Also, there are significantly more models vary more than they do for the ensemble models.
22337326, 2022, 4, Downloaded from https://onlinelibrary.wiley.com/doi/10.4218/etrij.2021-0123 by Sri Lanka National Access, Wiley Online Library on [14/03/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
GANGRADE ET AL. 639

Moreover, comparing the NPBM and SIBM with the CCM ORCID
shows that the latter (CCM) outperforms the others on the Gaurav Hajela https://orcid.org/0000-0002-9835-205X
basis of the MAE score. Thus, the observation derived
from the R2 score is corroborated by the MAE score. RE FER EN CES
In addition, we tested the proposed models by per- 1. T. Kim, S. Sharda, X. Zhou, and R. M. Pendyala, A stepwise
forming sensitivity analyses of the R2 score in three dif- interpretable machine learning framework using linear regres-
ferent scenarios. In the first scenario, the base regressors sion (LR) and long short-term memory (LSTM): City-wide
are compared using the temporal variations of each of demand-side prediction of yellow taxi and for-hire vehicle (FHV)
service, Transp. Res. C: Emerg. Technol. 120 (2020), 102786.
the proposed models (see Figures 4, 6, and 8). Here, the
2. Q. Liu, C. Ding, and P. Chen, A panel analysis of the effect of
ensemble regressors performed better than did the linear the urban environment on the spatiotemporal pattern of taxi
regressors. In the second scenario, we compared the pro- demand, Travel Behav. Soc. 18 (2020), 29–36.
posed models for weekdays and weekends separately, 3. C. Yang and E. J. Gonzales, Modeling taxi demand and supply
keeping the temporal features constant for both the lin- in New York City using large-scale taxi GPS data, Seeing cities
ear and ensemble models. This shows that the CCM per- through big data: Research, methods and applications in urban
forms better than either the NPBM or the SIBM (see informatics, Springer International Publishing, Cham, 2017,
pp. 405–425.
Figure 9). In the third scenario, we compared the individ-
4. H. Luo, J. Cai, K. Zhang, R. Xie, and L. Zheng, A multi-task
ual base regressors in the ensemble and linear models, deep learning model for short-term taxi demand forecasting con-
keeping the other parameters constant (see Figure 11). sidering spatiotemporal dependences, J. Transp. Eng. (English
This shows that ET performs best among the ensemble Edition) 8 (2021), no. 1, 83–94.
regressors, whereas ridge regression performs best among 5. B. Hu, S. Zhang, Y. Ding, M. Zhang, X. Dong, and H. Sun,
the linear regressors. Research on the coupling degree of regional taxi demand and
social development from the perspective of job-housing travels,
Phys. A: Stat. Mech. Appl. 564 (2021), 125493.
6. S. Faghih, A. Shah, Z. Wang, A. Safikhani, and C. Kamga, Taxi
8 | C ON C L U S I ON S and mobility: Modeling taxi demand using ARMA and linear
regression, Procedia Comput. Sci. 177 (2020), 186–195.
Taxi-demand forecasting is a challenging task, as taxi 7. Z. Liu, H. Chen, Y. Li, and Q. Zhang, Taxi demand prediction
demands have variable spatiotemporal patterns. In this based on a combination forecasting model in hotspots, J. Adv.
work, we performed a meticulous spatiotemporal analysis Transp. 2020 (2020), 13.
8. I. Markou, F. Rodrigues, and F. C. Pereira, Multi-step ahead
of taxi demands using Chicago data. We found that
prediction of taxi demand using time-series and textual data,
approaches that consider only a single covariate at a time Transportation Research Procedia 41 (2019), 540–544.
(i.e., the NPBM and SIBM) did not convey enough infor- 9. C. Antoniades, D. Fadavi, and A. F. Amon, Fare and duration
mation about the flow of taxis. Consequently, they can prediction: A study of New York City taxi rides, 2016.
lead to inadequate forecasting of demand. In contrast, we 10. A. Safikhani, C. Kamga, S. Mudigonda, S. S. Faghih, and B.
found that combining the individual covariates improves Moghimi, Spatio-temporal modeling of yellow taxi demands in
the performance drastically. We also incorporated New York City using generalized STAR models, Int.
J. Forecasting 36 (2020), no. 3, 1138–1148.
dynamic POI data in this CCM to make it more robust.
11. J. Xu, R. Rahmatizadeh, L. Boloni, and D. Turgut, Real-time
The robustness of the CCM relative to the NPBM and prediction of taxi demand using recurrent neural networks,
SIBM is demonstrated by the experimental results, in IEEE Trans. Intell. Transp. Syst. 19 (2018), no. 8, 2572–2581.
which we tested all the models against the 2018 and 2019 12. T. Liu, W. Wu, Y. Zhu, and W. Tong, Predicting taxi demands
datasets using the R2 and MAE scores as criteria to com- via an attention-based convolutional recurrent neural network,
pare performances. We note that the performance of taxi- Knowl. Based Syst. 206 (2020), 106294.
13. P. Shu, Y. Sun, Y. Zhao, and G. Xu, Spatial-temporal taxi
demand forecasting depends upon the availability of his-
demand prediction using LSTM-CNN, (IEEE 16th International
torical data for taxi demand. If ambient and residential Conference on Automation Science and Engineering, Hong
populations are low in some areas, it will be challenging Kong, China), Aug. 2020, pp. 1226–1230.
to predict taxi demand accurately there. Demographics 14. J. Ye, L. Sun, B. Du, Y. Fu, X. Tong, and H. Xiong,
and POI data are other factors that can influence taxi Co-prediction of multiple transportation demands based on
demand, but they are useful only when combined with deep spatio-temporal neural network, (Proceedings of the 25th
historical taxi-demand data. In future, we plan to extend ACM SIGKDD International Conference on Knowledge Dis-
covery & Data Mining, Association for Computing Machinery,
this work to predict hotspots and analyze the accuracy
Anchorage, AK, USA), 2019, pp. 305–313.
and coverage of the model. 15. U. Vanichrujee, T. Horanont, W. Pattara-atikom, T.
Theeramunkong, and T. Shinozaki, Taxi demand prediction
CONFLICT OF INTEREST using ensemble model based on RNNs and xgboost, (Interna-
The authors declare no potential conflict of interests. tional Conference on Embedded Systems and Intelligent
22337326, 2022, 4, Downloaded from https://onlinelibrary.wiley.com/doi/10.4218/etrij.2021-0123 by Sri Lanka National Access, Wiley Online Library on [14/03/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
640 GANGRADE ET AL.

Technology International Conference on Information and


Communication Technology for Embedded Systems, Khon Pawel Pratyush received his Bache-
Kaen, Thailand), 2018, pp. 1–6. lor of Technology degree in Com-
16. Z. Liu, H. Chen, X. Sun, and H. Chen, Data-driven real-time puter Science and Engineering from
online taxi-hailing demand forecasting based on machine learn- the Maulana Azad National Institute
ing method, Appl. Sci. 10 (2020), no. 19, 6681. of Technology (MANIT), Bhopal
17. Z. Chen, B. Zhao, Y. Wang, Z. Duan, and X. Zhao, Multitask
(India), in 2021. His main research
learning and GCN-based taxi demand prediction for a traffic
interests are machine learning, deep
road network, Sensors 20 (2020), no. 13, 3776.
18. T. L. Quy, W. Nejdl, M. Spiliopoulou, and E. Ntoutsi, A learning, and their application in physics, health
neighborhood-augmented LSTM model for taxi-passenger informatics, and spatiotemporal modeling.
demand prediction, Multiple-aspect analysis of semantic trajec-
Gaurav Hajela received his Bache-
tories, Cham, 2020, pp. 100–116.
19. X. Guo, Prediction of taxi demand based on CNN-BiLSTM-
lor of Engineering degree in Informa-
Attention neural network, Neural information processing, tion Technology from Rajiv Gandhi
Cham, 2020, pp. 331–342. Proudyogiki Vishwavidyalaya,
20. P. Rodrigues, A. Martins, S. Kalakou, and F. Moura, Spatio- Bhopal, India, in 2012, and his M.
temporal variation of taxi demand, Transp. Res. Procedia 47 Tech degree in Computer Science
(2020), 664–671. and Engineering from Maulana Azad
21. X. Liu, L. Sun, Q. Sun, and G. Gao, Spatial variation of taxi
National Institute of Technology (MANIT), Bhopal,
demand using GPS trajectories and POI data, J. Adv. Transp.
India, in 2014. Since 2015, he has been with Depart-
2020 (2020), 7621576.
22. Y. Zhou, Y. Wu, J. Wu, L. Chen, and J. Li, Refined taxi ment of Computer Science and Engineering, MANIT,
demand prediction with ST-Vec, (26th International Confer- Bhopal, India, where he is pursuing his PhD degree.
ence on Geoinformatics, Kunming, China), 2018, pp. 1–6. His main research interests are Big Data analytics,
23. D. Faial, F. Bernardini, E. M. Meza, L. Miranda, and J. machine learning, and time-series prediction.
Viterbo, A methodology for taxi demand prediction using
stream learning, (International Conference on Systems, Signals
and Image Processing, Niteroi, Brazil), 2020, pp. 417–422.
24. N. Davis, G. Raina, and K. Jagannathan, A multi-level cluster-
How to cite this article: A. Gangrade, P.
ing approach for forecasting taxi travel demand, (IEEE 19th
International Conference on Intelligent Transportation Sys- Pratyush, and G. Hajela, Taxi-demand forecasting
tems, Rio de Janeiro, Brazil), 2016, pp. 223–228. using dynamic spatiotemporal analysis, ETRI
Journal 44 (2022), 624–640. https://doi.org/10.
4218/etrij.2021-0123
AUTHOR BIOGRAPHIES

Akshata Gangrade received her


Bachelor of Technology degree in
Computer Science from Maulana
Azad National Institute of Technol-
ogy (MANIT), Bhopal (India), in
2021. She has a keen interest in
research that motivated her to con-
duct research work in multiple domains. Her main
research interests are machine learning, deep learn-
ing, and time-series prediction.

You might also like