Professional Documents
Culture Documents
J of Advced Transportation - 2014 - Zhu - A Reinforcement Learning Approach For Distance Based Dynamic Tolling in The
J of Advced Transportation - 2014 - Zhu - A Reinforcement Learning Approach For Distance Based Dynamic Tolling in The
J of Advced Transportation - 2014 - Zhu - A Reinforcement Learning Approach For Distance Based Dynamic Tolling in The
See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
JOURNAL OF ADVANCED TRANSPORTATION
J. Adv. Transp. 2015; 49:247–266
Published online 5 June 2014 in Wiley Online Library (wileyonlinelibrary.com). DOI: 10.1002/atr.1276
SUMMARY
This paper proposes a novel dynamic tolling model based on distance and accounts for uncertain traffic de-
mand and supply conditions. The distance-based tolling controller is modeled as an intelligent agent
interacting within the stochastic network environment dynamically by taking actions, which are to decide
different distance-based tolling rates of vehicles. The distance-based tolls are determined according to var-
ious metrics, for example, total traffic flow throughput, delay time, vehicular emissions, which are set as
objectives in the modeling framework. The optimal tolling rate is determined by an R-Markov Average
Reward Technique based reinforcement learning algorithm. In the numerical case study, we test the
proposed tolling scheme on a benchmark test network—the Sioux Falls network—where specified links
are candidate toll links. The result shows that the total travel time of tolling links reduces by 25% over
simulation runs. Copyright © 2014 John Wiley & Sons, Ltd.
KEY WORDS: dynamic tolling; stochastic network; connected vehicle; reinforcement learning
1. INTRODUCTION
Deterioration of traffic conditions in urban areas has long been a problem for a city’s economic
development and decreased quality of life. Especially in large urban areas, due to the uncertain
and increased traffic demand, congestion imposes a burden to economic activities, work and non-
work travels, and air quality. From the demand side, many traffic control management strategies
have been proposed to relieve urban congestion. As one of many traffic control strategies, road pric-
ing has shown to be an effective congestion mitigation strategy not only theoretically but also with
real world implementation [1, 2]. Practical implementation of road pricing can be found in big cities
around the world. For instance, Singapore launched the Electronic Road Pricing scheme in 1998. It
charges a congestion fee every time a user crosses the cordon area. London introduced the zonal con-
gestion pricing scheme in 2003, where a daily fee is charged for every vehicle within the congestion
charge zone.
In the practical implementation of road pricing, wireless communication technology has played a
significant role [3, 4]. For example, in the electronic toll collection system, the dedicated short range
communications technology is used in the automated vehicle identification system to collect tolls.
Dedicated short range communications technology has great potential in the area of intelligent trans-
portation system (ITS), as it enables the wireless exchange of information between vehicles, as well
as between vehicles and roadside infrastructure. The development of wireless communication technol-
ogy has received lots of attention because of the connected vehicle (CV) initiative. This technology
was primarily developed to improve traffic safety (crash collision avoidance) at intersections. The sec-
ondary concern is to alleviate congestion and reduce vehicular emissions. Acknowledging the poten-
tial, the ITS program of the U.S. Department of Transportation has emphasized CV research in the
*Correspondence to: Satish V. Ukkusuri, Lyles School of Civil Engineering, Purdue University, CIVL G167D, 550
Stadium Mall Drive West Lafayette, IN 47906, U.S.A. E-mail: sukkusur@purdue.edu
ITS Strategic Plan (2010–2014). CV environment facilitates communication where vehicles can talk to
each other (i.e. vehicle-to-vehicle), to the infrastructure components (i.e. vehicle-to-infrastructure,
V2I), and also infrastructure to infrastructure communication. CV has also received attention in
Europe, where it is known as car to car and car to X technology. Although CV has not been realized
in the real world transportation system yet, many auto companies are expending significant efforts to
produce vehicles with communication features. In addition, many test beds are ongoing in USA,
Europe, and Japan.
A key motivation of this study is to address the dynamic tolling problem under the CV environment.
Recent advances in CV environment offer useful technologies in detection and acquisition of high
fidelity data that can be used for more efficient traffic control strategies. In particular, under the CV
environment, the control unit of tolling may have access to the traversing information of the surround-
ing vehicles, for example, originations, destinations, paths taken, speed, distance traveled, purpose of
the trip, and so on. As shown in the later section of the paper, the distance-based tolling only requires
the entry point information of the vehicle, which can be obtained through the V2I technology. In
current practice, it is also plausible to obtain vehicle’s traversing information by installing wireless
communication devices in the vehicle and deploying roadside sensors along the toll lane. Thus, the dis-
tance-based dynamic tolling has potential to replace the traditional road pricing (facility based, cordon
based, or zone based) under the CV environment in the future, as well as under specified settings in
current practice.
In the literature of tolling problems, the static case has been well studied in the past [5, 6]. Re-
cently, Li et al. [7] investigated the network travel time reliability by proposing a bi-level optimal
toll design model. Liu et al. [8] studied the morning commuters’ modal choice behavior under dif-
ferent types of rationing and pricing strategies. Liu et al. [9] further captured travelers’ route choice
behavior by applying the logit-based stochastic user equilibrium principle. Meanwhile, the dynamic
tolling problem has also attracted tremendous interest in the literature [6]. To name a few, Chung
et al. [10] formulated the dynamic congestion pricing problem by considering demand uncer-
tainty, where an optimization approach based on particle swarm algorithm is proposed to obtain
robust solutions. Stewart and Ge [11] extended the concept of minimal-revenue congestion pric-
ing from static case to dynamic case where traffic flows are time varying and follow the dynamic
user equilibrium principle. From a different perspective, Zhang et al. [12] modeled the traffic
flow by applying the kinematic wave model and developed a self-adaptive tolling strategy for
the high occupancy toll lane system.
However, it is worthwhile to note that there is very limited literature in the area of distance-based
dynamic tolling. Another motivation of this study is to address the distance-based dynamic tolling
problem. In a recent work, De Palma and Lindsey [4] summarized the distance-based tolling in Europe.
However, all the schemes collect charges between fixed terminals and are designated for heavy goods
vehicles. It is not based on flexible distance and not for passenger vehicles. Moreover, to the best of our
knowledge, dynamic toll based on flexible distance has never been studied before. One reason for this
is due to the lack of technology to realize a distance-based tolling concept. It is difficult to track the
dynamic distance traveled by vehicles because of technology and privacy restrictions. However, this
technology barrier may not exist in the very near future given the extensive research and implementation
efforts in CV. For instance, a California startup called True Mileage [13] is providing a “black-box” to
measure the mileage of a car without violating privacy concerns. The concept of distance-based tolling
is distinct from vehicle miles traveled (VMT) based fee in USA. VMT fee (or tax) is to charge road users
based on miles traveled. It is a potential replacement or supplement to the current motor fuel tax. The
main purpose of VMT fees is for raising revenues and financing infrastructure maintenance [14–16].
There is a potential for combining both the distance-based tolling and VMT in the future, but currently,
there is a clear distinction between these two concepts.
In practice, one way to implement distance-based tolling is by making use of the underused high
occupancy vehicle lanes. The concept is known as managed lanes, where single occupancy vehicles
are allowed to use the high occupancy vehicle lanes for a toll. Yin and Lou [17] introduced a reactive
self-learning approach to determine time-varying tolls for operating managed toll lanes. The
approach learns the parameters of motorists’ willingness to pay in a sequential fashion. Following
this line of work, Lou et al. [18] extended the idea to lane slip ramp configuration and more efficient
Copyright © 2014 John Wiley & Sons, Ltd. J. Adv. Transp. 2015; 49:247–266
DOI: 10.1002/atr
20423195, 2015, 2, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/atr.1276 by Florida International University, Wiley Online Library on [06/11/2022]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
AN RL APPROACH FOR DISTANCE-BASED DYNAMIC TOLLING 249
self-learning proactive approach. However, both papers only characterize a corridor with several
entry points to collect tolls. The entry points are fixed, and the determined tolls are only optimal
for localized entry points. Coordination between entry points is not considered. Yang et al. [19]
proposed the distance-based dynamic tolling to address the managed lane problem. However, in
the model, vehicles can only enter the toll lane through specified toll entrances and leave the system
from specified toll exits.
This paper sets to fill the research gap in the area of road pricing by proposing a novel distance-
based dynamic tolling model. Two features of the proposed model are as follows: (i) distance based.
The notion of “distance” in this study refers to the actual distance of the vehicle traveled in the toll
lane. In other words, vehicles are free to enter the toll lane at any point of the lane. We do not pose
any specific entry points for tolls. This setting is technologically possible under the CV environment.
It is also plausible in current practice by deploying roadside sensors along the toll lane. (2) Dynamic
toll rate. Based on the vehicle’s entry location, the tolling control unit determines the best toll rate
for the vehicle. Hence, the final toll equals to the multiplication of the dynamic toll rate and distance
traveled.
The model is built upon a stochastic network environment. The traffic demand input is generated
randomly according to a probability distribution to account for the uncertainty from the traffic demand
side. Similarly, the saturation capacity of toll links is also set randomly according to certain probability
distribution. For the underlying traffic flow model, we use the well accepted cell transmission model
(CTM) [20, 21]. By making use of the fundamental diagram in CTM, we are able to obtain the speed
profile at the cell and link level and then estimate the travel time. Notice that the distance in CTM is not
strictly continuous. However, in contrast to the fixed distance (tolls are collected between toll stations),
we consider the distance obtained from CTM as flexible.
Then, we compute the choice between a non-toll and toll road using a utility model based on travel
time and tolls. A binomial logit model is applied to model the lane choice. Moreover, the dynamic toll-
ing problem is modeled as a Markov decision process (MDP) problem. Different metrics, for example,
total network throughput, delay time, and vehicular emissions, can be set as the optimized objectives in
the modeling framework. The control unit of tolling is modeled as an intelligent agent interacting with
the stochastic network environment by taking actions, which is to determine different distance-based
tolling rates of vehicles. Thus, the distance-based dynamic tolling problem is equivalent to finding
the optimal policy (mapping between the toll rate activations and traffic states) that results in the maxi-
mum reward measured in terms of total travel time, number of stops, or vehicular emission, and so on in
the long term. The optimal tolling rate is obtained by applying an R-Markov Average Reward Technique
(RMART) based reinforcement learning algorithm. Notice that the dynamic tolling (varying over time
instead of distance) problem has been studied extensively in the literature [10,22–24]. However, most
of these studies, if not all, are focused on solving the problem analytically and do not consider
uncertainties from both traffic demand and capacity. Moreover, they are not readily applicable to the
distance-based dynamic tolling problem. The advantages of applying the reinforcement learning (RL)
algorithm are as follows. (i) The dynamic tolling problem is formulated as an MDP problem, which well
fits with the RL approach. (ii) Uncertainties from both traffic demand and capacity are incorporated in the
traffic flow environment. (iii) RL is an on-line learning algorithm; it provides a real time control
mechanism to maximize the expected value of reward over the long run.
In the test case study, we have reconstructed the simplified version of the Sioux Falls network by
imposing tolls on specified links. Results from the experiment shows that the total travel time de-
creases with simulation runs, and finally stabilizes around the 75% percentage level compared with
the total travel time from the first simulation run where tolling is not applied. Moreover, another inter-
esting finding from the experiment design is that for this specific example, the arterial roads are more
likely candidates for distance-based tolls as compared to freeway links to reduce total travel time given
the current experimental set up.
The rest of the paper is structured as follows. Section 2 is devoted to the methodology of the study,
including the underlying traffic flow model (stochastic CTM), toll lane choice model, and the rein-
forcement learning approach. The details of the RMART algorithm are also presented. Section 3 pro-
vides a case study of applying the proposed model on the Sioux Falls test network. Section 4 concludes
the paper and discusses the direction for future research.
Copyright © 2014 John Wiley & Sons, Ltd. J. Adv. Transp. 2015; 49:247–266
DOI: 10.1002/atr
20423195, 2015, 2, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/atr.1276 by Florida International University, Wiley Online Library on [06/11/2022]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
250 F. ZHU AND S. V. UKKUSURI
NOTATION:
Sets
Parameters
W shockwave speed
V free-flow speed
S saturation flow rate
C′ length of a cell
dJ jam density
Ni(m, t) maximum number of vehicles (or holding capacity) allowable in cell m lane i at time t
Di(t) fixed mean demand input of lane i at time t
Variables
2. METHODOLOGY
To begin, the assumptions in our modeling framework are as follows. (i) In modeling the traffic prop-
agation, we have applied the CTM—an advanced spatial queuing model, which has desirable proper-
ties of capturing spatial queuing and shockwave propagation. (ii) In the toll lane choice model, the toll
lane cost is assumed to be a function of the travel time and the distance-based toll. (iii) To determine
the distance-based toll, we assume the toll controller has access to the location information of the
vehicle dynamically.
Copyright © 2014 John Wiley & Sons, Ltd. J. Adv. Transp. 2015; 49:247–266
DOI: 10.1002/atr
20423195, 2015, 2, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/atr.1276 by Florida International University, Wiley Online Library on [06/11/2022]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
AN RL APPROACH FOR DISTANCE-BASED DYNAMIC TOLLING 251
Copyright © 2014 John Wiley & Sons, Ltd. J. Adv. Transp. 2015; 49:247–266
DOI: 10.1002/atr
20423195, 2015, 2, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/atr.1276 by Florida International University, Wiley Online Library on [06/11/2022]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
252 F. ZHU AND S. V. UKKUSURI
Sink cells:
Ordinary/merging/diverging cells:
X X
xi ðm; t Þ ¼ xi ðm; t 1Þ þ f ik;m ðt 1Þ f im;n ðt 1Þ; ∀m ∈ C O;M;D (3)
k∈Γ1 ðmÞ n∈ΓðmÞ
di ðt Þ ¼ pi ðt ÞDi ðt Þ (7)
Note that Di(t) is a fixed value, representing the predefined demand; and pi(t) denotes a random
value generated by certain probability distribution.
In order to capture the uncertainty from the infrastructure supply side, we have
Sξ; ∀m ∉ ending cell
Qi ðm; t Þ ¼ (8)
pi ðt ÞSξ; ∀m ∈ ending cell
Note that S is a fixed value, representing the saturation flow and pi(t) denotes a random value within
(0, 1), which is generated by certain probability distribution. The choice of probability distribution is
based on the type of uncertain event, for example, a highway crash, lane closure, work zone and so on.
Based on empirical data, the typical distribution assumptions for these events include multivariate nor-
mal distribution, lognormal distribution, and multivariate lognormal distribution [36–39]. A similar
idea to describe the stochastic traffic network environment with CTM is also discussed in [40]. The
advantage of CTM lies in that it covers the whole range of traffic dynamics including queue formation,
dissipation, and kinematic wave. With the density or cell occupancy determined, the mean speed at the
cell level or the link segment level can be derived, making it applicable for travel time estimation.
Copyright © 2014 John Wiley & Sons, Ltd. J. Adv. Transp. 2015; 49:247–266
DOI: 10.1002/atr
20423195, 2015, 2, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/atr.1276 by Florida International University, Wiley Online Library on [06/11/2022]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
AN RL APPROACH FOR DISTANCE-BASED DYNAMIC TOLLING 253
Based on the fundamental equation of traffic flow, the traffic speed of each cell can be calculated as
qi ðm; t Þ
vi ðm; t Þ ¼ (9)
k i ðm; t Þ
eθU i ðm;tÞ
pi ðm; t Þ ¼ ; i; i′∈I (12)
eθU i ðm;tÞ þ eθU i’ ðm;tÞ
where I denotes the set of lanes within the same link and θ denotes scale parameter. Ui(m, t) and
Ui ’(m, t) denote the cost of taking the toll lane and non-toll lane of cell m at time t.
Note that the lane choice model is distinct from the route choice model. Lane choice model is similar
to a lane changing model, where traffic changes from the non-toll lane to the toll lane. Also note that
the lane utility is a complex perceptual concept. It can be based on different measurements, for exam-
ple, travel time of the lane, tolls, vehicular emissions, and so on. In this study, the toll lane cost Ui(m, t)
is assumed to be a function of travel time from cell m to the exit of the link and the distance-based toll,
while the non-toll lane cost Ui ’(m, t) is assumed to be a function of travel time from cell m to the exit of
Figure 3. Cell representation of a link with non-toll lane and toll lane.
Copyright © 2014 John Wiley & Sons, Ltd. J. Adv. Transp. 2015; 49:247–266
DOI: 10.1002/atr
20423195, 2015, 2, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/atr.1276 by Florida International University, Wiley Online Library on [06/11/2022]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
254 F. ZHU AND S. V. UKKUSURI
Note that the distance from cell m to the exit of lane i can be written as (Ni m) C ′, where Ni de-
notes the total number of cells in lane i. Moreover, as ai(m, t) denotes the tolling rate of one unit length
for cell m at time t, the distance-based toll is calculated as
Further, we have
0 1
B XN XN C
BðN mÞC′ x ðm; t Þ x ðm; t Þ C
B i m i m i C
TT i ðm; t Þ ¼ minB ; ; XN !C (17)
B V S C
@ xi ðm;tÞ A
W d J ðN i m
m
ÞC′
By substituting Equations (15), (17), and (18) into Equations (13) and (14), we derive the complete
expression of the utility for both toll lane and non-toll lane, and finally obtain the probability of traffic
flow switching to toll lane. For the sake of brevity, they are omitted here.
Copyright © 2014 John Wiley & Sons, Ltd. J. Adv. Transp. 2015; 49:247–266
DOI: 10.1002/atr
20423195, 2015, 2, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/atr.1276 by Florida International University, Wiley Online Library on [06/11/2022]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
AN RL APPROACH FOR DISTANCE-BASED DYNAMIC TOLLING 255
Q sti ; ai ðm; t Þ Q-value of state-action pair sti ; ai ðm; t Þ at lane i
α(k) learning rate for the Q-values at kth iteration
β(k) learning rate for the average reward at kth iteration
γ discount factor for reward value
ε greedy value
Reinforcement learning techniques have been effectively applied to solve practical problems involving
optimal control and optimization in different disciplines of science and engineering. In general, any
method applying the sampling-based techniques to solve the optimal control problems or its variants
can be defined as RL. The agent (the control unit of tolling) interacts with the environment (the system
or any representative model) by taking certain action, and the environment reacts to that action through
changing its state. In addition, the environment also interacts with the agent to determine how much re-
ward it gains by performing that action. The reward gives a measure of the effectiveness of the actions
taken by the agent to reach its optimization goals. In the context of dynamic tolling problem, the tolling
controller is the agent and the traffic network (which is dynamic and random) is the environment.
Reinforcement learning system has some specific components: state, action, and reward. The
following sections define the specified state, action, and reward for the proposed RL algorithm.
Note that we have chosen to discretize state into four categories, but this setting is flexible. Depend-
ing on the size of the network, one can choose to discretize state into more than four categories.
Where σ1,2,3,4 denotes threshold values of toll rates. Similar to the case of state, the discretization of
actions is also flexible. One can choose to discretize toll rates into more than four categories.
Reinforcement learning algorithms in general require a balance between exploitation and
exploration in the strategies for selecting the optimal action. The simplest selection rule is to choose
the action (or one of the tie actions) with the highest estimated state-action value (completely greedy
Copyright © 2014 John Wiley & Sons, Ltd. J. Adv. Transp. 2015; 49:247–266
DOI: 10.1002/atr
20423195, 2015, 2, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/atr.1276 by Florida International University, Wiley Online Library on [06/11/2022]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
256 F. ZHU AND S. V. UKKUSURI
behavior). In other words, the agent always tries to maximize the immediate reward using the imme-
diate knowledge without any attempt to explore other possible actions. To balance between exploi-
tation and exploration, we apply the ε greedy method [41]. In this method, the agent chooses the
action that results in maximum state-action value in most cases except in a few cases where a random
action is chosen to explore other possible actions. The probability of this random behavior is ε, and
the probability of selecting the optimal action converges to greater than 1 ε. Note that the advan-
tage of ε greedy methods over the greedy methods is highly dependent on the type of problem.
For instance, with higher variance in the reward values, the ε greedy methods might perform better.
Copyright © 2014 John Wiley & Sons, Ltd. J. Adv. Transp. 2015; 49:247–266
DOI: 10.1002/atr
20423195, 2015, 2, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/atr.1276 by Florida International University, Wiley Online Library on [06/11/2022]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
AN RL APPROACH FOR DISTANCE-BASED DYNAMIC TOLLING 257
the learning phase. During the implementation period, the algorithm emphasizes on exploitation with
small ε value. Because the only change from the learning to implementation phase is the action
selection strategy, only the learning phase algorithm is described next.
R-Markov Average Reward Technique also assumes ergodic process, that is, it does not depend on the ini-
tial state. For any initialized state, the long-term average should yield the same value. RMART overcomes the
transient scenario that better-than-average rewards can be received for a while for some states and worse-than-
average rewards are received for other states. One clear distinction from other temporal difference techniques
is the use of relative value functions. The values are relative to average reward under the active policy [41, 42].
RMART uses the concept of average reward over long term instead of discounted reward used in Q-learning
and State-Action-Reward-State-Action (SARSA) algorithms. Tsitsiklis and Van Roy [44] made an
analytical comparison between the discounted (Q-learning) and average reward techniques (RMART)
and showed that as the discount factor approaches one, the value function by discounted technique
approaches the differential value function by average reward technique.
Because of the uncertain traffic demand and supply, traffic volume of a link is a stochastic process and the
state in the RL system is highly dependent on the traffic volume. Two distinct properties of traffic dynamics are
the similarity of traffic pattern (e.g. the traffic pattern at a particular link on each Sunday during 11 AM—noon)
and heterogeneity in the network congestion parameter. To account for these attributes, this research deploys
an average reward technique. In addition, average reward methods offer computational advantages [45].
Initialization:
Set initial values forρ sti ; ai ðÞ , andQ sti ; ai ðÞ for all state-action pairs sti ; ai ðm; t Þ ;
Check=0;k=1.
While Check=0 Do
Update learning rate and discount rate:
ðkþ2Þ
αðkÞ ¼ 10 logkþ2
βðkÞ ¼ Bþk
A
; A and B are scalars
Learning Phase:
The agent builds its mapping table sti ; ai ðm; t Þ , which is used in later steps to decide
which toll to impose inthe implementation phase.
Observe rewardrti sti ; ai ðÞ; sti for choosing actionaiand next state,esti .
Update Q-values:
t t ðk Þ t
t t t
Q si ; ai ðÞ ←Q si ; ai ðÞ þ α r i ðÞ ρ si ; ai ðÞ þ max~
Q si ; e ai ðÞ Q si ; ai ðÞ
ai
Update average
reward:
If Q sti ; ai ðÞ ¼ maxai Q sti ; ai ðÞ
Then
t t ðkÞ t
t t t
ρ si ; ai ðÞ ←ρ si ; ai ðÞ þ β r i ðÞ-ρ si ; ai ðÞ þ max
~
Q si ; e
ai ðÞ maxa Q si ; ai ðÞ
ai
sti ←sti
Update k ← k + .
If termination criteria met, Then Check=1.
End
Copyright © 2014 John Wiley & Sons, Ltd. J. Adv. Transp. 2015; 49:247–266
DOI: 10.1002/atr
20423195, 2015, 2, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/atr.1276 by Florida International University, Wiley Online Library on [06/11/2022]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
258 F. ZHU AND S. V. UKKUSURI
Copyright © 2014 John Wiley & Sons, Ltd. J. Adv. Transp. 2015; 49:247–266
DOI: 10.1002/atr
20423195, 2015, 2, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/atr.1276 by Florida International University, Wiley Online Library on [06/11/2022]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
AN RL APPROACH FOR DISTANCE-BASED DYNAMIC TOLLING 259
settings of the network is presented in Table I. For the sake of brevity, we only show the link length of
the toll links.
For the toll lane choice model, we set the scale parameters (θ) in Equation (12) as one, and the
discretized toll rates are preset as σ1 = 0, σ2 = 0.01, σ3 = 0.02, σ4 = 0.03. The duration of the simulation is
3600 s with a time step of 10 s. Hence, the cell length of freeway (arterial) roads is around 244 m (133).
For the input demand of origin (node 1), we randomly generate values according to the normal probability
distribution, with two peaks (with mean at 4 veh/step) as shown in Figure 5. For the last 300 s (5 min), we
set zero demand input so as to clear traffic at the network when simulation ends. The saturation flow is set
as 1800 vph (or 5 veh/step). Hence, it is worthy to note that the demand setting of 4 veh/step is quite high.
The generated interval of demand is set as 60 s (or six steps). Moreover, to model the random disturbance
of traffic, the capacity of the exit cell in the non-toll lane of links 7, 14, and 27 are generated randomly
according to normal probability distribution with mean value of 1800 vph, and with a deviation of
180 vph. The generated interval of capacity is set as 300 s (or 30 steps).
Index
Freeway (55 mph) 1, 2, 7, 18, 19, 33
Managed link 7, 14, 27
Origin node 1, 26, 27
Destination node 7, 25
Link index Link length (m)
7 3000
14 1500
27 3000
Copyright © 2014 John Wiley & Sons, Ltd. J. Adv. Transp. 2015; 49:247–266
DOI: 10.1002/atr
20423195, 2015, 2, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/atr.1276 by Florida International University, Wiley Online Library on [06/11/2022]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
260 F. ZHU AND S. V. UKKUSURI
Figure 7. Total travel time variation of different simulation runs (result of first one is set as base for comparison)
for toll links.
demonstrate the setting. The uncertainty from both the traffic demand and the infrastructure supply
defines the stochastic traffic network environment.
Note that in this study, we set total travel time of all the toll links as the optimized objective. The
results of the total travel time for different simulation runs are shown in Figure 7. In this study, we
set the maximum number of iterations to 120. It is clear to see that the total travel time decreases with
iterations. From the 90th iteration, the total travel time varies within a small range at the value of 75%
compared with the result from the first simulation run. The trend of the total travel time confirms the
RMART algorithm’s performance in minimizing the total travel time. On the other hand, the result is
not converging to a stable point. This is also expected. One reason is that the RMART algorithm takes
action according ε greedy method. The control unit of tolling does not take the recommended action
in every time step, but also explores the possibility of other actions by a small probability of (1 ε).
Another reason is due to the uncertainty from the traffic demand side and infrastructure supply side.
Figures 8 and 9 show the distribution of collected tolls of the toll lane in links 7 (freeway road) and
14(arterial road) at the last iteration. We see that more toll is collected on link 14 than link 7. The traffic
switching from non-toll lane to toll lane is much denser in link 14 than link 7. The result suggests that
Copyright © 2014 John Wiley & Sons, Ltd. J. Adv. Transp. 2015; 49:247–266
DOI: 10.1002/atr
20423195, 2015, 2, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/atr.1276 by Florida International University, Wiley Online Library on [06/11/2022]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
AN RL APPROACH FOR DISTANCE-BASED DYNAMIC TOLLING 261
Figure 8. Distribution of collected tolls on the toll lane in link 7 (freeway link).
Figure 9. Distribution of collected tolls on the toll lane in link 14 (arterial link).
traffic from arterial road is more likely to use toll lane compared to the traffic from freeway. Such find-
ing seems counter-intuition, because usually tolls are collected on freeways rather than arterial road.
However, it is worthwhile to note that the result is obtained by applying the RL algorithm to maximize
the long term Q-value. Thus one insight from the comparison of different road types is that the need of
dynamic tolling may be higher for arterial road than freeway road from the viewpoint of reducing total
travel time. On the other side, because there are lots of arterial roads, freeways, express ways, and so
on, in the real world network, which arterial road, freeway or other types of road should be selected for
dynamic distance-based tolling, such that the whole network is more beneficial? This is an interesting
research topic, but beyond the focus of this study.
Figures 10 and 11 show the traffic density distribution for the toll lane in links 7 and 14. One inter-
esting finding is that there exist two obvious peaks during the simulation period. This is explicable.
Because of the excess demand input and the deterioration of the outflow capacity during the peak pe-
riod, heavy congestion could happen in the non-toll lane, hence inforcing more traffic to take the toll
lane. As shown in the distribution of density for the toll lane (Figures 10 and 11), traffic density of the
toll lane is much denser during the two peak period.
Copyright © 2014 John Wiley & Sons, Ltd. J. Adv. Transp. 2015; 49:247–266
DOI: 10.1002/atr
20423195, 2015, 2, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/atr.1276 by Florida International University, Wiley Online Library on [06/11/2022]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
262 F. ZHU AND S. V. UKKUSURI
Figure 10. Traffic density distribution of the toll lane in link 7 (freeway link).
Figure 11. Traffic density distribution of the toll lane in link 14 (arterial link).
Copyright © 2014 John Wiley & Sons, Ltd. J. Adv. Transp. 2015; 49:247–266
DOI: 10.1002/atr
20423195, 2015, 2, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/atr.1276 by Florida International University, Wiley Online Library on [06/11/2022]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
AN RL APPROACH FOR DISTANCE-BASED DYNAMIC TOLLING 263
Figure 12. Travel time reduction of various combinations of states and toll rates.
4. CONCLUSIONS
In light of the need to address the road pricing problem under the CVs environment, and motivated by
the research gap in distance-based dynamic tolling, this paper proposes a novel distance-based
dynamic tolling model. The notion of distance in this study refers to the actual distance of the vehicle
traveled on the toll lane. This setting is technologically possible under the CV environment. Under the
CV environment, vehicles are able to communicate with infrastructure, thus the control unit of tolling
has access to the entry location of the vehicle. It is also plausible in current practice by deploying road-
side sensors along the toll lane. In the model, there are no specified entry points for tolls. In other
words, vehicles are free to choose entering the toll lane at any point of the lane. Based on the vehicle’s
entry location, the tolling control unit determines the best toll for the vehicle. The model is built upon a
stochastic network environment. The traffic demand input and the saturation capacity of toll links are
generated randomly according to a certain probability distribution (e.g. normal distribution, binomial
distribution, Poisson distribution, etc.) to account for the uncertainty from the traffic demand side
and supply side. For the underlying traffic flow modeling, we are applying the stochastic CTM. By
making use of the fundamental diagram in CTM, we are able to obtain the speed profile and then es-
timate the travel time. A binomial logit model is applied to model the choice of the toll and non-toll
lane. Moreover, the dynamic tolling problem is modeled as an MDP problem. Different metrics, for
example, total network throughput, delay time, and vehicular emissions, can be easily set as the opti-
mized objectives in the modeling framework. The control unit of tolling is modeled as an intelligent
agent interacting with the stochastic network environment by taking actions, which is to determine dif-
ferent distance-based tolling rates of traffic. Thus, the distance-based dynamic tolling problem is trans-
ferred to finding the optimal policy (mapping between the toll rate activations and traffic states) that
gives the maximum reward measured in terms of total travel time, number of stops, vehicular emission,
and so on in the long term. The optimal tolling rate is obtained by applying the RMART algorithm. In
the test case study, we have reconstructed the Sioux Falls network and set hypothetical toll links.
Among multiple simulation runs, the total travel time is improving in the first 90 runs then fluctuating
within a narrow range at 75% level from the first simulation run. The reason for the result not converg-
ing to a stable point is predicable. It is mainly due to the exploration setting of the action-taken process
and uncertainty from the stochastic network. Furthermore, one insight from the comparison analysis of
different road types for the specific example is that the need of dynamic tolling may be higher for
arterial road than freeway road from the viewpoint of reducing total travel time.
This study is a starting point for the area of distance-based dynamic tolling. There are multiple
research directions along this stream. (i) As noted in Section 3.3, the performance of the algorithm
Copyright © 2014 John Wiley & Sons, Ltd. J. Adv. Transp. 2015; 49:247–266
DOI: 10.1002/atr
20423195, 2015, 2, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/atr.1276 by Florida International University, Wiley Online Library on [06/11/2022]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
264 F. ZHU AND S. V. UKKUSURI
is influenced by the selection of states and toll rates. How to obtain the best combination of states and
toll rates remains an unsolved problem in this study. (ii) This study only models drivers’ decision
behavior on lane selection. The route choice behavior is not involved. It may be interesting to revisit
the problem in the context of dynamic traffic assignment. (iii) The objective in the RL framework is
a single objective, that is, total travel time. How to balance the tradeoff between multiple objectives
(e.g. total throughput, delay, and emissions) has not been addressed. (iv) Considering the size of the
network, it is impossible to toll all the roads of the network. How should we select the links (arterial
road, freeway, or other types of road) for dynamic distance-based tolling, such that the whole network
is more beneficial? This may also be an interesting research topic. (v) The proposed RL framework is a
localized optimization framework; that is, it only optimizes the performance of tolling on one link but
does not consider the coordination between links. Exploring a cooperative RL framework to optimize
the system wide performance of tolling is also worthwhile for future research.
5. LIST OF ABBREVIATIONS
5.1. Abbreviations
ACKNOWLEDGEMENT
The authors are grateful for the constructive comments from three anonymous referees.
REFERENCES
1. Akiyama T, Okushima M. Implementation of cordon pricing on urban network with practical approach. Journal of
Advanced Transportation 2006; 40(2):221–248.
2. Lindsey R. Do economists reach a conclusion on highway pricing? The intellectual history of an idea. Econ Journal
Watch 2006; 3(2):292–379.
3. Ukkusuri SV, Karoonsoontawong A, Waller ST, Kockelman K. Congestion pricing technologies: a comparative
evaluation. In Transportation Research Trends, Chapter 4, Nova Publications: New York, 2007; 121–142.
Copyright © 2014 John Wiley & Sons, Ltd. J. Adv. Transp. 2015; 49:247–266
DOI: 10.1002/atr
20423195, 2015, 2, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/atr.1276 by Florida International University, Wiley Online Library on [06/11/2022]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
AN RL APPROACH FOR DISTANCE-BASED DYNAMIC TOLLING 265
4. De Palma A, Lindsey R. Traffic congestion pricing methodologies and technologies. Transportation Research Part
C: Emerging Technologies 2011; 19(6):1377–1399.
5. Lam WHK, Poon ACK, Ye RJ. Optimization of tunnel tolls in land use and transport planning. Journal of Advanced
Transportation 1996; 30(3):45–56.
6. Yang H, Huang H-J. Mathematical and Economic Theory of Road Pricing. Elsevier: Amsterdam and Boston, 2005.
7. Li H, Bliemer MCJ, Bovy PHL. Network reliability-based optimal toll design. Journal of Advanced Transportation
2008; 42:311–332.
8. Liu W, Yang H, Yin Y. Traffic rationing and pricing in a linear monocentric city. Journal of Advanced Transpor-
tation 2012. DOI:10.1002/atr.1219.
9. Liu Z, Wang S, Meng Q. Toll pricing framework under logit-based stochastic user equilibrium constraints. Journal
of Advanced Transptation 2013. DOI:10.1002/atr.1255.
10. Chung BD, Yao T, Friesz TL, Liu H. Dynamic congestion pricing with demand uncertainty: a robust optimization
approach. Transportation Research Part B 2012; 46(10):1504–1518.
11. Stewart KJ, Ge YE. Optimising time-varying network flows by low-revenue tolling under dynamic user equilibrium.
European Journal of Transport and Infrastructure Research 2014; 14(1):30–45.
12. Zhang G, Ma X, Wang Y. Self-adaptive tolling strategy for enhanced high-occupancy toll lane operations. IEEE
Transactions on Intelligent Transportation Systems 2014; 15(1):306–317.
13. Halper E. A black box in your car? Some see a source of tax revenue. Article of Los Angeles Times, 2013. Available
online: http://articles.latimes.com/2013/oct/26/nation/la-na-roads-black-boxes-20131027
14. Whitty JM. Oregon’s mileage fee concept and road user fee pilot program. Final report. Oregon Department of
Transportation, Salem, 2007.
15. Schultz M, Atkinson RD. Paying Our Way: A New Framework for Transportation Finance. Final Report, National
Surface Transportation Infrastructure Financing Commission, 2009.
16. Starr McMullen B, Zhang L, Nakahara K. Distributional impacts of changing from a gasoline tax to a vehicle-mile
tax for light vehicles: a case study of Oregon. Transport Policy 2010; 17(6):359–366.
17. Yin Y, Lou Y. Dynamic tolling strategies for managed lanes. Journal of Transportation Engineering 2009; 135(2):
45–52.
18. Lou Y, Yin Y, Laval JA. Optimal dynamic pricing strategies for high-occupancy/toll lanes. Transportation
Research Part C: Emerging Technologies 2011; 19(1):64–74.
19. Yang L, Saigal R, Zhou H. Distance-based dynamic pricing strategy for managed toll lanes. Transportation
Research Record: Journal of the Transportation Research Board 2012; 2283:90–99.
20. Daganzo CF. The cell transmission model: a dynamic representation of highway traffic consistent with the hydro-
dynamic theory. Transportation Research Part B: Methodological 1994; 28:269–287.
21. Daganzo CF. The cell transmission model, part II: network traffic. Transportation Research Part B: Methodological
1995; 29:79–93.
22. Yang H, Huang HJ. Analysis of the time-varying pricing of a bottleneck with elastic demand using optimal control
theory. Transportation Research Part B 1997; 31(6):425–440.
23. De Palma A, Lindsey R. Private toll roads: competition under various ownership regimes. The Annals of Regional
Science 2000; 34(1):13–35.
24. Lin DY, Unnikrishnan A, Waller T. A dual variable approximation based heuristic for dynamic congestion pricing.
Networks and Spatial Economics 2011; 11(2):271–293.
25. Lighthill MJ, Whitham GB. On kinematic waves. II. A theory of traffic flow on long crowded roads. Proceedings of
the Royal Society of London. Series A, Mathematical and Physical Sciences 1955; 229:317.
26. Richards PI. Shock waves on the highway. Operations Research 1956; 4:42.
27. Lo HK, Szeto WY. A cell-based variational inequality formulation of the dynamic user optimal assignment problem.
Transportation Research Part B-Methodological 2002; 36:421–443.
28. Szeto WY, Lo HK. A cell-based simultaneous route and departure time choice model with elastic demand. Trans-
portation Research Part B-Methodological 2004; 38:593–612.
29. Han LS, Ukkusuri S, Doan K. Complementarity formulations for the cell transmission model based dynamic user
equilibrium with departure time choice, elastic demand and user heterogeneity. Transportation Research Part B-
Methodological 2011; 45:1749–1767.
30. Ukkusuri SV, Han L, Doan K. Dynamic user equilibrium with a path based cell transmission model for general traf-
fic networks. Transportation Research Part B: Methodological 2012; 46(10):1657–1684.
31. Gomes G, Horowitz R, Kurzhanskiy AA, Varaiya P, Kwon J. Behavior of the cell transmission model and effective-
ness of ramp metering. Transportation Research Part C: Emerging Technologies 2008; 16(4):485–513.
32. Lo HK. A novel traffic signal control formulation. Transportation Research Part A: Policy and Practice 1999;
33:433–448.
33. Lo HK, Chang E, Chan YC. Dynamic network traffic control. Transportation Research Part A: Policy and Practice
2001; 35:721–744.
34. Ukkusuri SV, Ramadurai G, Patil G. A robust transportation signal control problem accounting for traffic dynamics.
Computers and Operations Research 2010; 37(5):869–879.
35. Wong CK, Wong SC, Lo HK. A spatial queuing approach to optimize coordinated signal settings to obviate gridlock
in adjacent work zones. Journal of Advanced Transportation 2010; 44(4):231–244.
36. Zhao Y, Kockelman K. The propagation of uncertainty through travel demand models: an exploratory analysis.
Annals of Regional Science 2002; 36:145–163.
Copyright © 2014 John Wiley & Sons, Ltd. J. Adv. Transp. 2015; 49:247–266
DOI: 10.1002/atr
20423195, 2015, 2, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/atr.1276 by Florida International University, Wiley Online Library on [06/11/2022]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
266 F. ZHU AND S. V. UKKUSURI
37. Krishnamurthy S, Kockelman K. Propagation of uncertainty in transportation land use models: investigation of
DRAM-EMPAL and UTTP predictions in Austin, Texas. In Transportation Research Record: Journal of the Trans-
portation Research Board, No. 1831, vol. 24, TRB, National Research Council: Washington, DC, 2003; 219–229.
38. Siu BWY, Lo HK. Doubly uncertain transportation network: degradable capacity and stochastic demand. European
Journal of Operational Research 2008; 191:166–181.
39. Ng MW, Kockelman K, Waller ST. Relaxing the multivariate normality assumption in the simulation of transpor-
tation system dependencies. Transportation Letters: The International Journal of Transportation Research 2010;
2(2):63–74.
40. Sumalee A, Zhong RX, Pan TL, Szeto WY. Stochastic cell transmission model (SCTM): A stochastic dynamic
traffic model for traffic state surveillance and assignment. Transportation Research Part B: Methodological 2011;
45(3):507–533.
41. Sutton RS, Barto AG. Reinforcement Learning: An Introduction. A Bradford Book, MIT Press: Boston, 1998.
42. Gosavi A. Simulation-Based Optimization: Parametric Optimization Techniques and Reinforcement Learning.
Springer: Norwell, Massachusetts, 1997.
43. Aziz H, Zhu F, Ukkusuri SV. Reinforcement learning based signal control using R-Markov Average Reward Tech-
nique (RMART) accounting for neighborhood congestion information sharing. Proceedings of 92nd Transportation
Research Board Meeting, National Academies, Washington, DC, January 2013.
44. Tsitsiklis JN, Van Roy B. On average versus discounted reward temporal-difference learning. Machine Learning
2002; 49(2-3):179–191.
45. Tadepalli P, Ok D. Model-based average reward reinforcement learning. Artificial Intelligence 1998; 100(1-2):177–224.
Copyright © 2014 John Wiley & Sons, Ltd. J. Adv. Transp. 2015; 49:247–266
DOI: 10.1002/atr