J of Advced Transportation - 2014 - Zhu - A Reinforcement Learning Approach For Distance Based Dynamic Tolling in The

20423195, 2015, 2, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/atr.1276 by Florida International University, Wiley Online Library on [06/11/2022].
See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
JOURNAL OF ADVANCED TRANSPORTATION
J. Adv. Transp. 2015; 49:247–266
Published online 5 June 2014 in Wiley Online Library (wileyonlinelibrary.com). DOI: 10.1002/atr.1276
A reinforcement learning approach for distance-based dynamic

tolling in the stochastic network environment
Feng Zhu and Satish V. Ukkusuri*

Lyles School of Civil Engineering, Purdue University, West Lafayette, IN 47907-2051, U.S.A.
SUMMARY
This paper proposes a novel dynamic tolling model based on distance and accounts for uncertain traffic de-
mand and supply conditions. The distance-based tolling controller is modeled as an intelligent agent
interacting within the stochastic network environment dynamically by taking actions, which are to decide
different distance-based tolling rates of vehicles. The distance-based tolls are determined according to var-
ious metrics, for example, total traffic flow throughput, delay time, vehicular emissions, which are set as
objectives in the modeling framework. The optimal tolling rate is determined by an R-Markov Average
Reward Technique based reinforcement learning algorithm. In the numerical case study, we test the
proposed tolling scheme on a benchmark test network—the Sioux Falls network—where specified links
are candidate toll links. The result shows that the total travel time of tolling links reduces by 25% over
simulation runs. Copyright © 2014 John Wiley & Sons, Ltd.
KEY WORDS: dynamic tolling; stochastic network; connected vehicle; reinforcement learning
1. INTRODUCTION
Deterioration of traffic conditions in urban areas has long been a problem for a city’s economic
development and decreased quality of life. Especially in large urban areas, due to the uncertain
and increased traffic demand, congestion imposes a burden to economic activities, work and non-
work travels, and air quality. From the demand side, many traffic control management strategies
have been proposed to relieve urban congestion. As one of many traffic control strategies, road pric-
ing has shown to be an effective congestion mitigation strategy not only theoretically but also with
real world implementation [1, 2]. Practical implementation of road pricing can be found in big cities
around the world. For instance, Singapore launched the Electronic Road Pricing scheme in 1998. It
charges a congestion fee every time a user crosses the cordon area. London introduced the zonal con-
gestion pricing scheme in 2003, where a daily fee is charged for every vehicle within the congestion
charge zone.
In the practical implementation of road pricing, wireless communication technology has played a
significant role [3, 4]. For example, in the electronic toll collection system, the dedicated short range
communications technology is used in the automated vehicle identification system to collect tolls.
Dedicated short range communications technology has great potential in the area of intelligent trans-
portation system (ITS), as it enables the wireless exchange of information between vehicles, as well
as between vehicles and roadside infrastructure. The development of wireless communication technol-
ogy has received lots of attention because of the connected vehicle (CV) initiative. This technology
was primarily developed to improve traffic safety (crash collision avoidance) at intersections. The sec-
ondary concern is to alleviate congestion and reduce vehicular emissions. Acknowledging the poten-
tial, the ITS program of the U.S. Department of Transportation has emphasized CV research in the
*Correspondence to: Satish V. Ukkusuri, Lyles School of Civil Engineering, Purdue University, CIVL G167D, 550
Stadium Mall Drive West Lafayette, IN 47906, U.S.A. E-mail: sukkusur@purdue.edu
Copyright © 2014 John Wiley & Sons, Ltd.

20423195, 2015, 2, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/atr.1276 by Florida International University, Wiley Online Library on [06/11/2022]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
248 F. ZHU AND S. V. UKKUSURI
ITS Strategic Plan (2010–2014). CV environment facilitates communication where vehicles can talk to
each other (i.e. vehicle-to-vehicle), to the infrastructure components (i.e. vehicle-to-infrastructure,
V2I), and also infrastructure to infrastructure communication. CV has also received attention in
Europe, where it is known as car to car and car to X technology. Although CV has not been realized
in the real world transportation system yet, many auto companies are expending significant efforts to
produce vehicles with communication features. In addition, many test beds are ongoing in USA,
Europe, and Japan.
A key motivation of this study is to address the dynamic tolling problem under the CV environment.
Recent advances in CV environment offer useful technologies in detection and acquisition of high
fidelity data that can be used for more efficient traffic control strategies. In particular, under the CV
environment, the control unit of tolling may have access to the traversing information of the surround-
ing vehicles, for example, originations, destinations, paths taken, speed, distance traveled, purpose of
the trip, and so on. As shown in the later section of the paper, the distance-based tolling only requires
the entry point information of the vehicle, which can be obtained through the V2I technology. In
current practice, it is also plausible to obtain vehicle’s traversing information by installing wireless
communication devices in the vehicle and deploying roadside sensors along the toll lane. Thus, the dis-
tance-based dynamic tolling has potential to replace the traditional road pricing (facility based, cordon
based, or zone based) under the CV environment in the future, as well as under specified settings in
current practice.
In the literature of tolling problems, the static case has been well studied in the past [5, 6]. Re-
cently, Li et al. [7] investigated the network travel time reliability by proposing a bi-level optimal
toll design model. Liu et al. [8] studied the morning commuters’ modal choice behavior under dif-
ferent types of rationing and pricing strategies. Liu et al. [9] further captured travelers’ route choice
behavior by applying the logit-based stochastic user equilibrium principle. Meanwhile, the dynamic
tolling problem has also attracted tremendous interest in the literature [6]. To name a few, Chung
et al. [10] formulated the dynamic congestion pricing problem by considering demand uncer-
tainty, where an optimization approach based on particle swarm algorithm is proposed to obtain
robust solutions. Stewart and Ge [11] extended the concept of minimal-revenue congestion pric-
ing from static case to dynamic case where traffic flows are time varying and follow the dynamic
user equilibrium principle. From a different perspective, Zhang et al. [12] modeled the traffic
flow by applying the kinematic wave model and developed a self-adaptive tolling strategy for
the high occupancy toll lane system.
However, it is worthwhile to note that there is very limited literature in the area of distance-based
dynamic tolling. Another motivation of this study is to address the distance-based dynamic tolling
problem. In a recent work, De Palma and Lindsey [4] summarized the distance-based tolling in Europe.
However, all the schemes collect charges between fixed terminals and are designated for heavy goods
vehicles. It is not based on flexible distance and not for passenger vehicles. Moreover, to the best of our
knowledge, dynamic toll based on flexible distance has never been studied before. One reason for this
is due to the lack of technology to realize a distance-based tolling concept. It is difficult to track the
dynamic distance traveled by vehicles because of technology and privacy restrictions. However, this
technology barrier may not exist in the very near future given the extensive research and implementation
efforts in CV. For instance, a California startup called True Mileage [13] is providing a “black-box” to
measure the mileage of a car without violating privacy concerns. The concept of distance-based tolling
is distinct from vehicle miles traveled (VMT) based fee in USA. VMT fee (or tax) is to charge road users
based on miles traveled. It is a potential replacement or supplement to the current motor fuel tax. The
main purpose of VMT fees is for raising revenues and financing infrastructure maintenance [14–16].
There is a potential for combining both the distance-based tolling and VMT in the future, but currently,
there is a clear distinction between these two concepts.
In practice, one way to implement distance-based tolling is by making use of the underused high
occupancy vehicle lanes. The concept is known as managed lanes, where single occupancy vehicles
are allowed to use the high occupancy vehicle lanes for a toll. Yin and Lou [17] introduced a reactive
self-learning approach to determine time-varying tolls for operating managed toll lanes. The
approach learns the parameters of motorists’ willingness to pay in a sequential fashion. Following
this line of work, Lou et al. [18] extended the idea to lane slip ramp configuration and more efficient
Copyright © 2014 John Wiley & Sons, Ltd. J. Adv. Transp. 2015; 49:247–266
DOI: 10.1002/atr
AN RL APPROACH FOR DISTANCE-BASED DYNAMIC TOLLING 249
self-learning proactive approach. However, both papers only characterize a corridor with several
entry points to collect tolls. The entry points are fixed, and the determined tolls are only optimal
for localized entry points. Coordination between entry points is not considered. Yang et al. [19]
proposed the distance-based dynamic tolling to address the managed lane problem. However, in
the model, vehicles can only enter the toll lane through specified toll entrances and leave the system
from specified toll exits.
This paper sets to fill the research gap in the area of road pricing by proposing a novel distance-
based dynamic tolling model. Two features of the proposed model are as follows: (i) distance based.
The notion of “distance” in this study refers to the actual distance of the vehicle traveled in the toll
lane. In other words, vehicles are free to enter the toll lane at any point of the lane. We do not pose
any specific entry points for tolls. This setting is technologically possible under the CV environment.
It is also plausible in current practice by deploying roadside sensors along the toll lane. (2) Dynamic
toll rate. Based on the vehicle’s entry location, the tolling control unit determines the best toll rate
for the vehicle. Hence, the final toll equals to the multiplication of the dynamic toll rate and distance
traveled.
The model is built upon a stochastic network environment. The traffic demand input is generated
randomly according to a probability distribution to account for the uncertainty from the traffic demand
side. Similarly, the saturation capacity of toll links is also set randomly according to certain probability
distribution. For the underlying traffic flow model, we use the well accepted cell transmission model
(CTM) [20, 21]. By making use of the fundamental diagram in CTM, we are able to obtain the speed
profile at the cell and link level and then estimate the travel time. Notice that the distance in CTM is not
strictly continuous. However, in contrast to the fixed distance (tolls are collected between toll stations),
we consider the distance obtained from CTM as flexible.
Then, we compute the choice between a non-toll and toll road using a utility model based on travel
time and tolls. A binomial logit model is applied to model the lane choice. Moreover, the dynamic toll-
ing problem is modeled as a Markov decision process (MDP) problem. Different metrics, for example,
total network throughput, delay time, and vehicular emissions, can be set as the optimized objectives in
the modeling framework. The control unit of tolling is modeled as an intelligent agent interacting with
the stochastic network environment by taking actions, which is to determine different distance-based
tolling rates of vehicles. Thus, the distance-based dynamic tolling problem is equivalent to finding
the optimal policy (mapping between the toll rate activations and traffic states) that results in the maxi-
mum reward measured in terms of total travel time, number of stops, or vehicular emission, and so on in
the long term. The optimal tolling rate is obtained by applying an R-Markov Average Reward Technique
(RMART) based reinforcement learning algorithm. Notice that the dynamic tolling (varying over time
instead of distance) problem has been studied extensively in the literature [10,22–24]. However, most
of these studies, if not all, are focused on solving the problem analytically and do not consider
uncertainties from both traffic demand and capacity. Moreover, they are not readily applicable to the
distance-based dynamic tolling problem. The advantages of applying the reinforcement learning (RL)
algorithm are as follows. (i) The dynamic tolling problem is formulated as an MDP problem, which well
fits with the RL approach. (ii) Uncertainties from both traffic demand and capacity are incorporated in the
traffic flow environment. (iii) RL is an on-line learning algorithm; it provides a real time control
mechanism to maximize the expected value of reward over the long run.
In the test case study, we have reconstructed the simplified version of the Sioux Falls network by
imposing tolls on specified links. Results from the experiment shows that the total travel time de-
creases with simulation runs, and finally stabilizes around the 75% percentage level compared with
the total travel time from the first simulation run where tolling is not applied. Moreover, another inter-
esting finding from the experiment design is that for this specific example, the arterial roads are more
likely candidates for distance-based tolls as compared to freeway links to reduce total travel time given
the current experimental set up.
The rest of the paper is structured as follows. Section 2 is devoted to the methodology of the study,
including the underlying traffic flow model (stochastic CTM), toll lane choice model, and the rein-
forcement learning approach. The details of the RMART algorithm are also presented. Section 3 pro-
vides a case study of applying the proposed model on the Sioux Falls test network. Section 4 concludes
the paper and discusses the direction for future research.
DOI: 10.1002/atr
NOTATION:
Sets
CR set of origin cells

CS set of destination cells
CO set of ordinary cells
CD set of diverging cells
CM set of merging cells
Γ 1(i) set of predecessors of cell i
Γ(i) set of successors of cell i
Parameters
W shockwave speed
V free-flow speed
S saturation flow rate
C′ length of a cell
dJ jam density
Ni(m, t) maximum number of vehicles (or holding capacity) allowable in cell m lane i at time t
Di(t) fixed mean demand input of lane i at time t
Variables
Qi(m, t) Inflow or outflow capacity of cell m at lane i at time t

di(t) demand input of lane i at time t
pi(t) generated probability of lane i (to account for uncertainty of demand or capacity) according
to some probability distribution
pi ðm; t Þ probability of traffic switching from the non-toll lane i to the toll road
Ui(m, t) utility of cell m at lane i at time t
TTi(m, t) travel time from cell m to the exit of lane i
δi,m(t) toll of cell m of lane i at time t
xi(m, t) cell occupancy (number of vehicles) of cell m at lane i at time t
f im;n ðt Þ flow from cell m to n at lane i at time t
qi(m, t) flow of cell m at lane i at time t
ki(m, t) density of cell m at lane i at time t
vi ðm; t Þ speed of cell m at lane i at time t
ai(m, t) action (toll rate) at cell m of lane i at time t
2. METHODOLOGY
To begin, the assumptions in our modeling framework are as follows. (i) In modeling the traffic prop-
agation, we have applied the CTM—an advanced spatial queuing model, which has desirable proper-
ties of capturing spatial queuing and shockwave propagation. (ii) In the toll lane choice model, the toll
lane cost is assumed to be a function of the travel time and the distance-based toll. (iii) To determine
the distance-based toll, we assume the toll controller has access to the location information of the
vehicle dynamically.
DOI: 10.1002/atr
2.1. Traffic flow modeling

The CTM [20, 21] is one of the widely used network loading models. It provides a convergent approx-
imation to a simplified version of the LWR (Lighthill and Whitham, 1995; Richards, 1956) hydrodynamic
model [25, 26], whereby the fundamental diagram of traffic flow and density is assumed to be a piecewise
linear function. The model is capable of capturing the traffic propagation phenomena such as spill back,
kinematic wave, and physical queue. CTM has been used for various dynamic problems in the last decade.
Among the wealth of literature, Lo and Szeto [27] embedded CTM into the user equilibrium dynamic
traffic assignment problem using the variational inequality approach. Szeto and Lo [28] extended the
variational inequality formulation to capture simultaneous route and departure time choice problem with
elastic demands. Han et al. [29] and Ukkusuri et al. [30] formulated the cell-based dynamic user
equilibrium problem using complementarity theory. Besides network analysis, CTM has also been
applied in the area of traffic control management [31–35].
In CTM, a series of homogenous cells are used to represent the road network, and time is discretized
into steps. Figure 1 shows the cell representation for one simple road segment. The length of each cell
C ′ is set to be the distance traveled by the free-flow speed V in one time step ξ, that is, C ′ = V ξ.
Therefore, for the same time step, faster free-flow speed will lead to longer cell length, and vice versa.
When there is no congestion, vehicles move from one cell to the next one in one time step.
RemarkNote that CTM is a discretized version of the LWR model. If the size of cells is determined
infinitely small, CTM reduces to continuous LWR types of model. Nevertheless, the distance is not
strictly continuous in the context of CTM. In contrast to the fixed distance (tolls are collected between
toll stations), we consider the distance obtained from CTM as flexible distance because the distance
moved by flow depends on cell length and the time discretization of the simulation.
The fundamental diagram (flow-density relationship) is approximated by a piecewise linear model as
shown in Figure 2, while the LWR [25, 26] process is approximated by this set of recursive equations:
Source cells:
xi ðm; t Þ ¼ di ðt 1Þ þ xi ðm; t 1Þ f im;n ðt 1Þ; ∀m ∈ C R ; n ∈ ΓðmÞ (1)
Figure 1. Cell representation of a road segment.
Figure 2. Flow-density fundamental diagram for cell transmission model simulation.
DOI: 10.1002/atr
Sink cells:
xi ðm; t Þ ¼ xi ðm; t 1Þ þ f ik;m ðt 1Þ; ∀m ∈ C S ; k ∈ Γ1 ðmÞ (2)
Ordinary/merging/diverging cells:
X X
xi ðm; t Þ ¼ xi ðm; t 1Þ þ f ik;m ðt 1Þ f im;n ðt 1Þ; ∀m ∈ C O;M;D (3)
k∈Γ1 ðmÞ n∈ΓðmÞ
Ordinary cell connectors:

W
f im;n ðt Þ ¼ min xi ðm; t Þ; Qi ðm; t Þ; Qi ðn; t Þ; ðN i ðn; t Þ xi ðn; t ÞÞ ; ∀m ∈ C O ; n ∈ ΓðmÞ (4)
V
Diverging cell connectors:

W
f im;n ðt Þ ¼ min ρi ðm; t Þxi ðm; t Þ; Qi ðm; t Þ; Qi ðn; t Þ; ðN i ðn; t Þ xi ðn; t ÞÞ ; ∀m ∈ C D ; n ∈ ΓðmÞ (5)
V
Merging cell connectors:

W
f im;n ðt Þ ¼ min xi ðm; t Þ; Qi ðm; t Þ; Qi ðn; t Þ; ρi ðm; t Þ ðN i ðn; t Þ xi ðn; t ÞÞ ; ∀m ∈ C M ; n∈ΓðmÞ (6)
V
In order to capture the uncertainty from traffic demand, we have
di ðt Þ ¼ pi ðt ÞDi ðt Þ (7)
Note that Di(t) is a fixed value, representing the predefined demand; and pi(t) denotes a random
value generated by certain probability distribution.
In order to capture the uncertainty from the infrastructure supply side, we have

Sξ; ∀m ∉ ending cell
Qi ðm; t Þ ¼ (8)
pi ðt ÞSξ; ∀m ∈ ending cell
Note that S is a fixed value, representing the saturation flow and pi(t) denotes a random value within
(0, 1), which is generated by certain probability distribution. The choice of probability distribution is
based on the type of uncertain event, for example, a highway crash, lane closure, work zone and so on.
Based on empirical data, the typical distribution assumptions for these events include multivariate nor-
mal distribution, lognormal distribution, and multivariate lognormal distribution [36–39]. A similar
idea to describe the stochastic traffic network environment with CTM is also discussed in [40]. The
advantage of CTM lies in that it covers the whole range of traffic dynamics including queue formation,
dissipation, and kinematic wave. With the density or cell occupancy determined, the mean speed at the
cell level or the link segment level can be derived, making it applicable for travel time estimation.
DOI: 10.1002/atr
Based on the fundamental equation of traffic flow, the traffic speed of each cell can be calculated as
qi ðm; t Þ
vi ðm; t Þ ¼ (9)
k i ðm; t Þ
Moreover, from the fundamental diagram of CTM, we obtain
qi ðm; t Þ ¼ minðVk i ðm; t Þ; S; WðdJ k i ðm; t ÞÞÞ (10)
Substituting Equation (10) into Equation (9), we obtain

qi ðm; t Þ S Wðd J k i ðm; t ÞÞ
vi ðm; t Þ ¼ ¼ min V; ; (11)
ki ðm; t Þ ki ðm; t Þ k i ðm; t Þ
2.2. Toll lane choice model

Toll choice model determines the traffic demand input of toll roads, or determines how many travelers
would choose to take the toll road. Normally, toll roads are link based, indicating that all the lanes of
the link are toll lanes. Here, we are interested in the general case: some lane of the link is toll road while
other lanes are not. This concept has been implemented as managed lanes in the real world [17–19].
Without loss of generality, consider a link of two lanes. One is a non-toll lane and the other is a toll
lane. Because we use the CTM as the underlying traffic flow model, we can further discretize non-toll
lane and toll lane into homogeneous cells, as shown in Figure 3.
Each vehicle from the non-toll lane has an opportunity to enter the toll lane at various locations.
Previous study usually makes specifications on entry points of the toll lane, where vehicles are only
possible to enter the toll lane via the specified entry points. In this study, we do not have this limited
setting. Vehicles of the non-toll lane can enter the toll lane at any point along the lane. As discussed
earlier, this is possible due to the V2I technology that allows vehicles to communicate with nearby
roadside infrastructure.
To model the behavior of traffic switching to the toll lane, we apply the binomial logit model as pre-
sented in the succeeding text.
eθU i ðm;tÞ
pi ðm; t Þ ¼ ; i; i′∈I (12)
eθU i ðm;tÞ þ eθU i’ ðm;tÞ
where I denotes the set of lanes within the same link and θ denotes scale parameter. Ui(m, t) and
Ui ’(m, t) denote the cost of taking the toll lane and non-toll lane of cell m at time t.
Note that the lane choice model is distinct from the route choice model. Lane choice model is similar
to a lane changing model, where traffic changes from the non-toll lane to the toll lane. Also note that
the lane utility is a complex perceptual concept. It can be based on different measurements, for exam-
ple, travel time of the lane, tolls, vehicular emissions, and so on. In this study, the toll lane cost Ui(m, t)
is assumed to be a function of travel time from cell m to the exit of the link and the distance-based toll,
while the non-toll lane cost Ui ’(m, t) is assumed to be a function of travel time from cell m to the exit of
Figure 3. Cell representation of a link with non-toll lane and toll lane.
DOI: 10.1002/atr
the link only. Thus we have
U i ðm; t Þ ¼ TT i ðm; t Þ þ δi;m ðt Þ (13)
U i’ ðm; t Þ ¼ TT i’ ðm; t Þ (14)
Note that the distance from cell m to the exit of lane i can be written as (Ni m) C ′, where Ni de-
notes the total number of cells in lane i. Moreover, as ai(m, t) denotes the tolling rate of one unit length
for cell m at time t, the distance-based toll is calculated as
δi;m ðt Þ ¼ μðN i mÞC′ai ðm; t Þ (15)
where μ denotes a scale parameter.

In order to obtain travel time of toll lane i, we need to calculate the speed traveling from cell m to the
exit of lane i. According to (11), firstly, we need to compute the traffic density from cell m to the exit,
XN
xi ðm;tÞ
which is ki ðm; t Þ ¼ m
ðN i mÞC′ . Thus, we have
0 XN !1
xi ðm;tÞ
B W d J ðN i m
m C
B ÞC′ C
B S C
vi ðm; t Þ ¼ minBV; XN ; XN C (16)
B C
@ m
xi ðm;tÞ
m
xi ðm;tÞ A
ðN i mÞC′ ðN i mÞC′
Further, we have
0 1
B XN XN C
BðN mÞC′ x ðm; t Þ x ðm; t Þ C
B i m i m i C
TT i ðm; t Þ ¼ minB ; ; XN !C (17)
B V S C
@ xi ðm;tÞ A
W d J ðN i m
m
ÞC′
Similarly, for the non-toll lane i ′, we also obtain

0 1
B XN XN C
BðN mÞC′ x ðm; t Þ x ðm; t Þ C
B i’ m i′ m i′ C
TT i’ ðm; t Þ ¼ minB ; ; XN !C (18)
B V S C
@ xi’ ðm;tÞ A
W d J m
ðN i mÞC′
By substituting Equations (15), (17), and (18) into Equations (13) and (14), we derive the complete
expression of the utility for both toll lane and non-toll lane, and finally obtain the probability of traffic
flow switching to toll lane. For the sake of brevity, they are omitted here.
2.3. Reinforcement learning for optimal dynamic tolling

Notation
dti density of lane i at time t
sti current state of lane i at time t
sti next state of lane i at time t
ai(m, t) action (toll rate) at cell m of lane i at time t
rti sti ; ai ðm; t Þ; sti observed reward when the agent takes action ai(m, t) in state sti , and moves to state
sti at lane i
ρ sti ; ai ðm; t Þ average reward of state-action pair sti ; ai ðm; t Þ at lane i
DOI: 10.1002/atr

Q sti ; ai ðm; t Þ Q-value of state-action pair sti ; ai ðm; t Þ at lane i
α(k) learning rate for the Q-values at kth iteration
β(k) learning rate for the average reward at kth iteration
γ discount factor for reward value
ε greedy value
Reinforcement learning techniques have been effectively applied to solve practical problems involving
optimal control and optimization in different disciplines of science and engineering. In general, any
method applying the sampling-based techniques to solve the optimal control problems or its variants
can be defined as RL. The agent (the control unit of tolling) interacts with the environment (the system
or any representative model) by taking certain action, and the environment reacts to that action through
changing its state. In addition, the environment also interacts with the agent to determine how much re-
ward it gains by performing that action. The reward gives a measure of the effectiveness of the actions
taken by the agent to reach its optimization goals. In the context of dynamic tolling problem, the tolling
controller is the agent and the traffic network (which is dynamic and random) is the environment.
Reinforcement learning system has some specific components: state, action, and reward. The
following sections define the specified state, action, and reward for the proposed RL algorithm.
2.3.1. State of the tolling controller

State of the tolling controller is a measure of the real time traffic environment, or the evolution of traffic
flow. It is obvious that the evolution of traffic flow is a continuous process. In other words, the state
space of traffic environment is of infinite dimensions. Because of the “curse of dimension”, setting
the state with high dimensions increases the computational complexity to solve the problem. Here,
we characterize the state of the tolling controller into four discrete congestion levels: free flow state
if the current density is less than one fourth of the jam density, fairly congested state if the current den-
sity is less than half of the jam density, moderate congested state if the current density is less than three
fourths of the jam density, and heavy congestion state else if the current density is less than the jam
density. Specifically, the discretization of states for a toll lane or non-toll lane is as follows.
8
>
> 1; if dti ≤0:25d J
>
>
< 2; else if dt ≤0:50d
J
sti ¼ i
(19)
> 3; else if dti ≤0:75d J
>
>
>
:
4; else if d ti ≤d J
Note that we have chosen to discretize state into four categories, but this setting is flexible. Depend-
ing on the size of the network, one can choose to discretize state into more than four categories.
2.3.2. Actions of the tolling controller

The tolling controller’s action is to dynamically assign tolling rates to traffic switching to the tolling
lane. Typically, the tolling rate ai(m, t) should be a continuous function depending on the dynamic
environment. Thus, the space of action will be an infinite set. To get rid of the “curse of dimension”,
here, we characterize the toll rates in four categories as follows.
8
> σ1 ; if ai ðm; t Þ ¼ 1
>
>
< σ2 ; if ai ðm; t Þ ¼ 2
ai ðm; t Þ ¼ (20)
>
> σ3 ; if ai ðm; t Þ ¼ 3
>
:
σ4 ; if ai ðm; t Þ ¼ 4
Where σ1,2,3,4 denotes threshold values of toll rates. Similar to the case of state, the discretization of
actions is also flexible. One can choose to discretize toll rates into more than four categories.
Reinforcement learning algorithms in general require a balance between exploitation and
exploration in the strategies for selecting the optimal action. The simplest selection rule is to choose
the action (or one of the tie actions) with the highest estimated state-action value (completely greedy
DOI: 10.1002/atr
behavior). In other words, the agent always tries to maximize the immediate reward using the imme-
diate knowledge without any attempt to explore other possible actions. To balance between exploi-
tation and exploration, we apply the ε greedy method [41]. In this method, the agent chooses the
action that results in maximum state-action value in most cases except in a few cases where a random
action is chosen to explore other possible actions. The probability of this random behavior is ε, and
the probability of selecting the optimal action converges to greater than 1 ε. Note that the advan-
tage of ε greedy methods over the greedy methods is highly dependent on the type of problem.
For instance, with higher variance in the reward values, the ε greedy methods might perform better.
2.3.3. Reward function

The reward function definition is closely related to the optimized objective of the whole model. De-
pending on the metric of interest, we can define the reward as total travel time, queue length, waiting
time, delay time, vehicular emission, and so on. In this study, we adopt the total travel time of the link
as the intended objective. The reward function is defined as follows.
XX
r ti sti ; ai ðm; t Þ; sti ¼ xi′ ðm; t Þ (21)
i′∈I m
where I refers to the set of lanes within the link.
Note that although we specify reward as total travel time in this study, other types of reward can also
be readily accommodated into the modeling framework.
2.3.4. Dynamic tolling problem as Markov decision process

Optimization of traffic pricing control (different objectives can be filled in here, e.g. total travel time, num-
ber of stops, and vehicular emission) requires the determination of optimal tolls. To attain the optimal ob-
jective, the tolling controller has to allocate different toll rates to vehicles traveling on the toll lane
depending on the distance traveled and different period of the day. The controller takes decision at specific
intervals, which is determined beforehand by the tolling signal-timing planner. The traffic network is the
environment, and the traffic controllers are agents in this context. Because of the setting of uncertain traf-
fic demand and supply, the network environment is stochastic, increasing the difficulty of assigning tolls.
The action of an agent is defined as to activate different toll rates at the decision interval. Note that the
transition time from one state to another state after activating any of the toll rates is unity (or same for
all cases). Thus, the dynamic tolling problem has all the elements of infinite time horizon MDP. Within
finite time steps, each time the agent takes an action that impacts the current environment, the state of
the environment changes accordingly. The transition probabilities from one state to another are deter-
mined by the dynamic combined effect of traffic flow propagation, lane choice, and toll rates. If we write
down the transition probabilities matrix, it is time dependent. But in this study, we do not need the explicit
expression of the transition matrix as we are not seeking the closed form solution for the problem. The
objective is to find the optimal policy (mapping between the toll rate activations and traffic states) that
gives the maximum reward measured in terms of total travel time, number of stops, vehicular emission,
and so on in the long run. Let set T, St, At, Pt, Rt denote the set of time steps, state sti , action ai(), transition
probability, and cost rti ðÞ respectively; then the dynamic toll problem is characterized by {T, St, At, Pt, Rt},
and can be formulated into the infinite time horizon MDP as

Q sti ; ai ðÞ ¼ max Q sti ; ai ðÞ ; ∀i (22)
ai ðÞ∈At
2.4. R-Markov Average Reward Technique algorithm description

We implement an off-policy temporal difference control algorithm, which is also known as advanced
off-policy RMART [41, 42]. Like most RL-based schemes, the proposed algorithm has two phases:
learning phase and implementation phase. The learning takes place before the implementation phase.
The key difference in the techniques stated earlier is the process of updating the state-value function
[43]. During the learning phase, the agents update the state-action value by interacting with the envi-
ronment. Balancing the exploration and exploitation is important at this phase. Initially, the algorithm
starts with ε greedy using higher ε value. Then, gradually, the ε value is decreased toward the end of
DOI: 10.1002/atr
the learning phase. During the implementation period, the algorithm emphasizes on exploitation with
small ε value. Because the only change from the learning to implementation phase is the action
selection strategy, only the learning phase algorithm is described next.
2.4.1. R-Markov Average Reward Technique description

R-Markov Average Reward Technique is an advanced off-policy RL method. This method does not
divide the experience into separate episodes with finite returns. The value functions are defined with
respect to the average expected reward per time step and are defined as
1X n
ρ sti ; ai ðÞ ¼ lim E rti ðÞ (23)
n→∞ n
t¼1
R-Markov Average Reward Technique also assumes ergodic process, that is, it does not depend on the ini-
tial state. For any initialized state, the long-term average should yield the same value. RMART overcomes the
transient scenario that better-than-average rewards can be received for a while for some states and worse-than-
average rewards are received for other states. One clear distinction from other temporal difference techniques
is the use of relative value functions. The values are relative to average reward under the active policy [41, 42].
RMART uses the concept of average reward over long term instead of discounted reward used in Q-learning
and State-Action-Reward-State-Action (SARSA) algorithms. Tsitsiklis and Van Roy [44] made an
analytical comparison between the discounted (Q-learning) and average reward techniques (RMART)
and showed that as the discount factor approaches one, the value function by discounted technique
approaches the differential value function by average reward technique.
Because of the uncertain traffic demand and supply, traffic volume of a link is a stochastic process and the
state in the RL system is highly dependent on the traffic volume. Two distinct properties of traffic dynamics are
the similarity of traffic pattern (e.g. the traffic pattern at a particular link on each Sunday during 11 AM—noon)
and heterogeneity in the network congestion parameter. To account for these attributes, this research deploys
an average reward technique. In addition, average reward methods offer computational advantages [45].
Pseudo code for the RMART based Algorithm:
Initialization:

Set initial values forρ sti ; ai ðÞ , andQ sti ; ai ðÞ for all state-action pairs sti ; ai ðm; t Þ ;
Check=0;k=1.
While Check=0 Do
Update learning rate and discount rate:
ðkþ2Þ
αðkÞ ¼ 10 logkþ2
βðkÞ ¼ Bþk
A
; A and B are scalars
Learning Phase:
The agent builds its mapping table sti ; ai ðm; t Þ , which is used in later steps to decide
which toll to impose inthe implementation phase.
Observe rewardrti sti ; ai ðÞ; sti for choosing actionaiand next state,esti .
Update Q-values:
t t ðk Þ t
t t t
Q si ; ai ðÞ ←Q si ; ai ðÞ þ α r i ðÞ ρ si ; ai ðÞ þ max~
Q si ; e ai ðÞ Q si ; ai ðÞ
ai
Update average
reward:

If Q sti ; ai ðÞ ¼ maxai Q sti ; ai ðÞ
Then
t t ðkÞ t
t t t
ρ si ; ai ðÞ ←ρ si ; ai ðÞ þ β r i ðÞ-ρ si ; ai ðÞ þ max
~
Q si ; e
ai ðÞ maxa Q si ; ai ðÞ
ai
sti ←sti
Update k ← k + .
If termination criteria met, Then Check=1.
End
DOI: 10.1002/atr
3. TEST CASE STUDY
3.1. Experiment design

To demonstrate the performance of the RMART algorithm in solving the distance-based dynamic toll-
ing problem, we constructed a test case study based on the simplified Sioux Falls network as shown in
Figure 4. Detailed settings for the test case study are as follows.
In the simplified Sioux Falls network, there are 40 links and 27 nodes. The origin nodes are 1, 26,
and 27. The destination nodes are 7 and 25. Considering that roads in the real world are typically
consisting of mixed road types, we have differentiated two types of roads in the network, freeway
(with speed limit of 55 mph) and arterial (with speed limit of 30 mph). This setting is also helpful in
terms of providing insights into tolling on various types of roads. Links 1, 2, 7, 18, 19, and 33 are
considered as freeways, and the other links are considered as arterial roads. We model route choice
behavior in a simple manner, which is to split traffic flow of the diverging link evenly to the down-
stream links. We do not consider dynamic traffic assignment because it is not the focus of this study.
Moreover, we define all the links in the network as single lane road except links 7, 14, and 27. These
three links (highlighted in bold in Figure 4) are set as the managed links. In other words, links 7, 14,
and 27 are consisting of two lanes. One lane is toll lane, and the other one is non-toll lane. Part of the
Figure 4. Sioux Falls network representation.
DOI: 10.1002/atr
settings of the network is presented in Table I. For the sake of brevity, we only show the link length of
the toll links.
For the toll lane choice model, we set the scale parameters (θ) in Equation (12) as one, and the
discretized toll rates are preset as σ1 = 0, σ2 = 0.01, σ3 = 0.02, σ4 = 0.03. The duration of the simulation is
3600 s with a time step of 10 s. Hence, the cell length of freeway (arterial) roads is around 244 m (133).
For the input demand of origin (node 1), we randomly generate values according to the normal probability
distribution, with two peaks (with mean at 4 veh/step) as shown in Figure 5. For the last 300 s (5 min), we
set zero demand input so as to clear traffic at the network when simulation ends. The saturation flow is set
as 1800 vph (or 5 veh/step). Hence, it is worthy to note that the demand setting of 4 veh/step is quite high.
The generated interval of demand is set as 60 s (or six steps). Moreover, to model the random disturbance
of traffic, the capacity of the exit cell in the non-toll lane of links 7, 14, and 27 are generated randomly
according to normal probability distribution with mean value of 1800 vph, and with a deviation of
180 vph. The generated interval of capacity is set as 300 s (or 30 steps).
3.2. Result analysis

The incoming demand input of link 1 is shown in Figure 5. It is easy to see that the traffic demand
pattern is random and at a hight level (as the value of 5 denotes the saturation flow rate). Moreover,
the capacity of the exit cell of the non-toll lane of link 7 is shown in Figure 6. Because of the setting
in the experiment design, the random demand input has two peaks and the exit capacity is changing
randomly every 30 time steps during the 1-h period. Figures 6 and 7 are representative figures to
Table I. Setting of the test network.
Index
Freeway (55 mph) 1, 2, 7, 18, 19, 33
Managed link 7, 14, 27
Origin node 1, 26, 27
Destination node 7, 25
Link index Link length (m)
7 3000
14 1500
27 3000
Figure 5. Demand input variation of link 1.
DOI: 10.1002/atr
Figure 6. Exit capacity variation of link 7.
Figure 7. Total travel time variation of different simulation runs (result of first one is set as base for comparison)
for toll links.
demonstrate the setting. The uncertainty from both the traffic demand and the infrastructure supply
defines the stochastic traffic network environment.
Note that in this study, we set total travel time of all the toll links as the optimized objective. The
results of the total travel time for different simulation runs are shown in Figure 7. In this study, we
set the maximum number of iterations to 120. It is clear to see that the total travel time decreases with
iterations. From the 90th iteration, the total travel time varies within a small range at the value of 75%
compared with the result from the first simulation run. The trend of the total travel time confirms the
RMART algorithm’s performance in minimizing the total travel time. On the other hand, the result is
not converging to a stable point. This is also expected. One reason is that the RMART algorithm takes
action according ε greedy method. The control unit of tolling does not take the recommended action
in every time step, but also explores the possibility of other actions by a small probability of (1 ε).
Another reason is due to the uncertainty from the traffic demand side and infrastructure supply side.
Figures 8 and 9 show the distribution of collected tolls of the toll lane in links 7 (freeway road) and
14(arterial road) at the last iteration. We see that more toll is collected on link 14 than link 7. The traffic
switching from non-toll lane to toll lane is much denser in link 14 than link 7. The result suggests that
DOI: 10.1002/atr
Figure 8. Distribution of collected tolls on the toll lane in link 7 (freeway link).
Figure 9. Distribution of collected tolls on the toll lane in link 14 (arterial link).
traffic from arterial road is more likely to use toll lane compared to the traffic from freeway. Such find-
ing seems counter-intuition, because usually tolls are collected on freeways rather than arterial road.
However, it is worthwhile to note that the result is obtained by applying the RL algorithm to maximize
the long term Q-value. Thus one insight from the comparison of different road types is that the need of
dynamic tolling may be higher for arterial road than freeway road from the viewpoint of reducing total
travel time. On the other side, because there are lots of arterial roads, freeways, express ways, and so
on, in the real world network, which arterial road, freeway or other types of road should be selected for
dynamic distance-based tolling, such that the whole network is more beneficial? This is an interesting
research topic, but beyond the focus of this study.
Figures 10 and 11 show the traffic density distribution for the toll lane in links 7 and 14. One inter-
esting finding is that there exist two obvious peaks during the simulation period. This is explicable.
Because of the excess demand input and the deterioration of the outflow capacity during the peak pe-
riod, heavy congestion could happen in the non-toll lane, hence inforcing more traffic to take the toll
lane. As shown in the distribution of density for the toll lane (Figures 10 and 11), traffic density of the
toll lane is much denser during the two peak period.
DOI: 10.1002/atr
Figure 10. Traffic density distribution of the toll lane in link 7 (freeway link).
Figure 11. Traffic density distribution of the toll lane in link 14 (arterial link).
3.3. Sensitivity analysis of the number of states and tolling rates

Note that there are only four possible states and toll rates in the aforementioned result analysis. Four
toll rates are chosen because of the complexity of Q-values in the RL algorithm. With larger number
of states (toll rates), the space of state-action pairs grows bigger. Accordingly, it takes more time to
visit all the state-action pairs to converge to a stable action-taken scheme. However, if the space of
states is too small, it may not capture the change of the environment accurately; hence, the action
schemes are affected negatively. In this section, we increase the space of states (toll rates) from 4 to
10. More precisely, the state determination threshold equals to the jammed density divided by the
number of states, similar to Equation (19); and the toll rate = 0.01× (action-1). The other settings
(e.g. network configuration, demands, and capacities) are consistent with Section 3.1. The results of
travel time reduction for the various combinations of states and toll rates are presented in Figure 12.
Notice that there are two peaks (reduction over 25%) in Figure 12. One is around state 4, toll rate 6;
the other is around state 10, toll rate 6. Furthermore, there is a valley (reduction less than 5%) around
state 8, toll rate 9. There is also a trend that the performance of the algorithm is worsening with the
number of states/toll rates growing. These observations confirm that the performance of the algorithm
is highly affected by the selection of states and toll rates. How to obtain the best combination of states
and toll rates remains an unsolved problem in this study. We leave this problem for future study.
DOI: 10.1002/atr
Figure 12. Travel time reduction of various combinations of states and toll rates.
4. CONCLUSIONS
In light of the need to address the road pricing problem under the CVs environment, and motivated by
the research gap in distance-based dynamic tolling, this paper proposes a novel distance-based
dynamic tolling model. The notion of distance in this study refers to the actual distance of the vehicle
traveled on the toll lane. This setting is technologically possible under the CV environment. Under the
CV environment, vehicles are able to communicate with infrastructure, thus the control unit of tolling
has access to the entry location of the vehicle. It is also plausible in current practice by deploying road-
side sensors along the toll lane. In the model, there are no specified entry points for tolls. In other
words, vehicles are free to choose entering the toll lane at any point of the lane. Based on the vehicle’s
entry location, the tolling control unit determines the best toll for the vehicle. The model is built upon a
stochastic network environment. The traffic demand input and the saturation capacity of toll links are
generated randomly according to a certain probability distribution (e.g. normal distribution, binomial
distribution, Poisson distribution, etc.) to account for the uncertainty from the traffic demand side
and supply side. For the underlying traffic flow modeling, we are applying the stochastic CTM. By
making use of the fundamental diagram in CTM, we are able to obtain the speed profile and then es-
timate the travel time. A binomial logit model is applied to model the choice of the toll and non-toll
lane. Moreover, the dynamic tolling problem is modeled as an MDP problem. Different metrics, for
example, total network throughput, delay time, and vehicular emissions, can be easily set as the opti-
mized objectives in the modeling framework. The control unit of tolling is modeled as an intelligent
agent interacting with the stochastic network environment by taking actions, which is to determine dif-
ferent distance-based tolling rates of traffic. Thus, the distance-based dynamic tolling problem is trans-
ferred to finding the optimal policy (mapping between the toll rate activations and traffic states) that
gives the maximum reward measured in terms of total travel time, number of stops, vehicular emission,
and so on in the long term. The optimal tolling rate is obtained by applying the RMART algorithm. In
the test case study, we have reconstructed the Sioux Falls network and set hypothetical toll links.
Among multiple simulation runs, the total travel time is improving in the first 90 runs then fluctuating
within a narrow range at 75% level from the first simulation run. The reason for the result not converg-
ing to a stable point is predicable. It is mainly due to the exploration setting of the action-taken process
and uncertainty from the stochastic network. Furthermore, one insight from the comparison analysis of
different road types for the specific example is that the need of dynamic tolling may be higher for
arterial road than freeway road from the viewpoint of reducing total travel time.
This study is a starting point for the area of distance-based dynamic tolling. There are multiple
research directions along this stream. (i) As noted in Section 3.3, the performance of the algorithm
DOI: 10.1002/atr
is influenced by the selection of states and toll rates. How to obtain the best combination of states and
toll rates remains an unsolved problem in this study. (ii) This study only models drivers’ decision
behavior on lane selection. The route choice behavior is not involved. It may be interesting to revisit
the problem in the context of dynamic traffic assignment. (iii) The objective in the RL framework is
a single objective, that is, total travel time. How to balance the tradeoff between multiple objectives
(e.g. total throughput, delay, and emissions) has not been addressed. (iv) Considering the size of the
network, it is impossible to toll all the roads of the network. How should we select the links (arterial
road, freeway, or other types of road) for dynamic distance-based tolling, such that the whole network
is more beneficial? This may also be an interesting research topic. (v) The proposed RL framework is a
localized optimization framework; that is, it only optimizes the performance of tolling on one link but
does not consider the coordination between links. Exploring a cooperative RL framework to optimize
the system wide performance of tolling is also worthwhile for future research.
5. LIST OF ABBREVIATIONS
5.1. Abbreviations
AVI Automated Vehicle Identification

C2C Car To Car
C2X Car To X
CV Connected Vehicle
CTM Cell Transmission Model
DOT Department of Transportation
DSRC Dedicated Short Range Communications
DUE Dynamic User Equilibrium
ERP Electronic Road Pricing
ETC Electronic Toll Collection
HGV Heavy Goods Vehicles
HOV High Occupancy Vehicle
ITS Intelligent Transportation System
I2I Infrastructure-To-Infrastructure
LWR Lighthill and Whitham 1955; Richards 1956
MDP Markov Decision Process
RL Reinforcement Learning
RMART R-Markov Average Reward Technique
SOV Single Occupancy Vehicles
SARSA State-Action-Reward-State-Action
TD Temporal Difference
V2V Vehicle-To-Vehicle
V2I Vehicle-To-Infrastructure
VMT Vehicle Miles Traveled
ACKNOWLEDGEMENT
The authors are grateful for the constructive comments from three anonymous referees.
REFERENCES
1. Akiyama T, Okushima M. Implementation of cordon pricing on urban network with practical approach. Journal of
Advanced Transportation 2006; 40(2):221–248.
2. Lindsey R. Do economists reach a conclusion on highway pricing? The intellectual history of an idea. Econ Journal
Watch 2006; 3(2):292–379.
3. Ukkusuri SV, Karoonsoontawong A, Waller ST, Kockelman K. Congestion pricing technologies: a comparative
evaluation. In Transportation Research Trends, Chapter 4, Nova Publications: New York, 2007; 121–142.
DOI: 10.1002/atr
4. De Palma A, Lindsey R. Traffic congestion pricing methodologies and technologies. Transportation Research Part
C: Emerging Technologies 2011; 19(6):1377–1399.
5. Lam WHK, Poon ACK, Ye RJ. Optimization of tunnel tolls in land use and transport planning. Journal of Advanced
Transportation 1996; 30(3):45–56.
6. Yang H, Huang H-J. Mathematical and Economic Theory of Road Pricing. Elsevier: Amsterdam and Boston, 2005.
7. Li H, Bliemer MCJ, Bovy PHL. Network reliability-based optimal toll design. Journal of Advanced Transportation
2008; 42:311–332.
8. Liu W, Yang H, Yin Y. Traffic rationing and pricing in a linear monocentric city. Journal of Advanced Transpor-
tation 2012. DOI:10.1002/atr.1219.
9. Liu Z, Wang S, Meng Q. Toll pricing framework under logit-based stochastic user equilibrium constraints. Journal
of Advanced Transptation 2013. DOI:10.1002/atr.1255.
10. Chung BD, Yao T, Friesz TL, Liu H. Dynamic congestion pricing with demand uncertainty: a robust optimization
approach. Transportation Research Part B 2012; 46(10):1504–1518.
11. Stewart KJ, Ge YE. Optimising time-varying network flows by low-revenue tolling under dynamic user equilibrium.
European Journal of Transport and Infrastructure Research 2014; 14(1):30–45.
12. Zhang G, Ma X, Wang Y. Self-adaptive tolling strategy for enhanced high-occupancy toll lane operations. IEEE
Transactions on Intelligent Transportation Systems 2014; 15(1):306–317.
13. Halper E. A black box in your car? Some see a source of tax revenue. Article of Los Angeles Times, 2013. Available
online: http://articles.latimes.com/2013/oct/26/nation/la-na-roads-black-boxes-20131027
14. Whitty JM. Oregon’s mileage fee concept and road user fee pilot program. Final report. Oregon Department of
Transportation, Salem, 2007.
15. Schultz M, Atkinson RD. Paying Our Way: A New Framework for Transportation Finance. Final Report, National
Surface Transportation Infrastructure Financing Commission, 2009.
16. Starr McMullen B, Zhang L, Nakahara K. Distributional impacts of changing from a gasoline tax to a vehicle-mile
tax for light vehicles: a case study of Oregon. Transport Policy 2010; 17(6):359–366.
17. Yin Y, Lou Y. Dynamic tolling strategies for managed lanes. Journal of Transportation Engineering 2009; 135(2):
45–52.
18. Lou Y, Yin Y, Laval JA. Optimal dynamic pricing strategies for high-occupancy/toll lanes. Transportation
Research Part C: Emerging Technologies 2011; 19(1):64–74.
19. Yang L, Saigal R, Zhou H. Distance-based dynamic pricing strategy for managed toll lanes. Transportation
Research Record: Journal of the Transportation Research Board 2012; 2283:90–99.
20. Daganzo CF. The cell transmission model: a dynamic representation of highway traffic consistent with the hydro-
dynamic theory. Transportation Research Part B: Methodological 1994; 28:269–287.
21. Daganzo CF. The cell transmission model, part II: network traffic. Transportation Research Part B: Methodological
1995; 29:79–93.
22. Yang H, Huang HJ. Analysis of the time-varying pricing of a bottleneck with elastic demand using optimal control
theory. Transportation Research Part B 1997; 31(6):425–440.
23. De Palma A, Lindsey R. Private toll roads: competition under various ownership regimes. The Annals of Regional
Science 2000; 34(1):13–35.
24. Lin DY, Unnikrishnan A, Waller T. A dual variable approximation based heuristic for dynamic congestion pricing.
Networks and Spatial Economics 2011; 11(2):271–293.
25. Lighthill MJ, Whitham GB. On kinematic waves. II. A theory of traffic flow on long crowded roads. Proceedings of
the Royal Society of London. Series A, Mathematical and Physical Sciences 1955; 229:317.
26. Richards PI. Shock waves on the highway. Operations Research 1956; 4:42.
27. Lo HK, Szeto WY. A cell-based variational inequality formulation of the dynamic user optimal assignment problem.
Transportation Research Part B-Methodological 2002; 36:421–443.
28. Szeto WY, Lo HK. A cell-based simultaneous route and departure time choice model with elastic demand. Trans-
portation Research Part B-Methodological 2004; 38:593–612.
29. Han LS, Ukkusuri S, Doan K. Complementarity formulations for the cell transmission model based dynamic user
equilibrium with departure time choice, elastic demand and user heterogeneity. Transportation Research Part B-
Methodological 2011; 45:1749–1767.
30. Ukkusuri SV, Han L, Doan K. Dynamic user equilibrium with a path based cell transmission model for general traf-
fic networks. Transportation Research Part B: Methodological 2012; 46(10):1657–1684.
31. Gomes G, Horowitz R, Kurzhanskiy AA, Varaiya P, Kwon J. Behavior of the cell transmission model and effective-
ness of ramp metering. Transportation Research Part C: Emerging Technologies 2008; 16(4):485–513.
32. Lo HK. A novel traffic signal control formulation. Transportation Research Part A: Policy and Practice 1999;
33:433–448.
33. Lo HK, Chang E, Chan YC. Dynamic network traffic control. Transportation Research Part A: Policy and Practice
2001; 35:721–744.
34. Ukkusuri SV, Ramadurai G, Patil G. A robust transportation signal control problem accounting for traffic dynamics.
Computers and Operations Research 2010; 37(5):869–879.
35. Wong CK, Wong SC, Lo HK. A spatial queuing approach to optimize coordinated signal settings to obviate gridlock
in adjacent work zones. Journal of Advanced Transportation 2010; 44(4):231–244.
36. Zhao Y, Kockelman K. The propagation of uncertainty through travel demand models: an exploratory analysis.
Annals of Regional Science 2002; 36:145–163.
DOI: 10.1002/atr
37. Krishnamurthy S, Kockelman K. Propagation of uncertainty in transportation land use models: investigation of
DRAM-EMPAL and UTTP predictions in Austin, Texas. In Transportation Research Record: Journal of the Trans-
portation Research Board, No. 1831, vol. 24, TRB, National Research Council: Washington, DC, 2003; 219–229.
38. Siu BWY, Lo HK. Doubly uncertain transportation network: degradable capacity and stochastic demand. European
Journal of Operational Research 2008; 191:166–181.
39. Ng MW, Kockelman K, Waller ST. Relaxing the multivariate normality assumption in the simulation of transpor-
tation system dependencies. Transportation Letters: The International Journal of Transportation Research 2010;
2(2):63–74.
40. Sumalee A, Zhong RX, Pan TL, Szeto WY. Stochastic cell transmission model (SCTM): A stochastic dynamic
traffic model for traffic state surveillance and assignment. Transportation Research Part B: Methodological 2011;
45(3):507–533.
41. Sutton RS, Barto AG. Reinforcement Learning: An Introduction. A Bradford Book, MIT Press: Boston, 1998.
42. Gosavi A. Simulation-Based Optimization: Parametric Optimization Techniques and Reinforcement Learning.
Springer: Norwell, Massachusetts, 1997.
43. Aziz H, Zhu F, Ukkusuri SV. Reinforcement learning based signal control using R-Markov Average Reward Tech-
nique (RMART) accounting for neighborhood congestion information sharing. Proceedings of 92nd Transportation
Research Board Meeting, National Academies, Washington, DC, January 2013.
44. Tsitsiklis JN, Van Roy B. On average versus discounted reward temporal-difference learning. Machine Learning
2002; 49(2-3):179–191.
45. Tadepalli P, Ok D. Model-based average reward reinforcement learning. Artificial Intelligence 1998; 100(1-2):177–224.
DOI: 10.1002/atr

J of Advced Transportation - 2014 - Zhu - A Reinforcement Learning Approach For Distance Based Dynamic Tolling in The

Uploaded by

Copyright:

Available Formats

You might also like

J of Advced Transportation - 2014 - Zhu - A Reinforcement Learning Approach For Distance Based Dynamic Tolling in The

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

J of Advced Transportation - 2014 - Zhu - A Reinforcement Learning Approach For Distance Based Dynamic Tolling in The

Uploaded by

Copyright:

Available Formats

20423195, 2015, 2, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/atr.1276 by Florida International University, Wiley Online Library on [06/11/2022].

A reinforcement learning approach for distance-based dynamic

Feng Zhu and Satish V. Ukkusuri*

Copyright © 2014 John Wiley & Sons, Ltd.

CR set of origin cells

Qi(m, t) Inﬂow or outﬂow capacity of cell m at lane i at time t

2.1. Trafﬁc ﬂow modeling

xi ðm; t Þ ¼ di ðt 1Þ þ xi ðm; t 1Þ f im;n ðt 1Þ; ∀m ∈ C R ; n ∈ ΓðmÞ (1)

Figure 1. Cell representation of a road segment.

Figure 2. Flow-density fundamental diagram for cell transmission model simulation.

xi ðm; t Þ ¼ xi ðm; t 1Þ þ f ik;m ðt 1Þ; ∀m ∈ C S ; k ∈ Γ1 ðmÞ (2)

Ordinary cell connectors:

Diverging cell connectors:

Merging cell connectors:

In order to capture the uncertainty from trafﬁc demand, we have

Moreover, from the fundamental diagram of CTM, we obtain

qi ðm; t Þ ¼ minðVk i ðm; t Þ; S; WðdJ k i ðm; t ÞÞÞ (10)

Substituting Equation (10) into Equation (9), we obtain

2.2. Toll lane choice model

the link only. Thus we have

U i ðm; t Þ ¼ TT i ðm; t Þ þ δi;m ðt Þ (13)

U i’ ðm; t Þ ¼ TT i’ ðm; t Þ (14)

δi;m ðt Þ ¼ μðN i mÞC′ai ðm; t Þ (15)

where μ denotes a scale parameter.

Similarly, for the non-toll lane i ′, we also obtain

2.3. Reinforcement learning for optimal dynamic tolling

2.3.1. State of the tolling controller

2.3.2. Actions of the tolling controller

2.3.3. Reward function

2.3.4. Dynamic tolling problem as Markov decision process

2.4. R-Markov Average Reward Technique algorithm description

2.4.1. R-Markov Average Reward Technique description

Pseudo code for the RMART based Algorithm:

3. TEST CASE STUDY

3.1. Experiment design

Figure 4. Sioux Falls network representation.

3.2. Result analysis

Table I. Setting of the test network.

Figure 5. Demand input variation of link 1.

Figure 6. Exit capacity variation of link 7.

3.3. Sensitivity analysis of the number of states and tolling rates

AVI Automated Vehicle Identiﬁcation

You might also like