Professional Documents
Culture Documents
Multi-Agent Reinforcement Learning For Spectrum
Multi-Agent Reinforcement Learning For Spectrum
Abstract—This paper investigates the spectrum sharing prob- Reinforcement learning (RL) has been shown effective in
lem in vehicular networks, where multiple vehicle-to-vehicle addressing of sequential decision making problems [9]. In
(V2V) links reuse the frequency spectrum preoccupied by vehicle- particular, recent success of deep RL in human-level video
to-infrastructure (V2I) links. We model the resource sharing as a
multi-agent reinforcement learning (RL) problem, which is then game play and AlphaGo has sparked a flurry of interest in
solved using a fingerprint-based deep Q-network method. The the topic. RL is also well-suited to resource allocation in
V2V links, each acting as an agent, collectively interact with vehicular networks in that it can train for objectives that are
the vehicular environment, receive distinctive observations yet hard to model or optimize in an exact or analytical manner.
a common reward, and then improve policy design through Another advantage of using RL for resource allocation is that
updating their Q-networks with gained experiences. Simulation
results demonstrate desirable performance of the proposed re- distributed resource management are made possible through
source allocation scheme based on multi-agent RL in terms of treating each V2V link as an agent that learns to refine its
both V2I capacity and V2V payload delivery probability. strategy via interacting with the environment [10]–[12]. In this
V2X, spectrum access, resource allocation, multi-agent re- paper, we model the spectrum sharing design in vehicular net-
inforcement learning works as a multi-agent problem and show that recent progress
of multi-agent RL [13], [14] can be exploited to enable each
I. INTRODUCTION V2V link to learn from experiences and work cooperatively
Vehicular communication, commonly referred to as vehicle- to optimize system-level performance in a distributed way.
to-everything (V2X) communication, is envisioned to trans-
II. SYSTEM MODEL
form connected and intelligent transportation services in many
aspects [1], [2]. For vehicular networks, judicious resource Consider a vehicular communication network with M V2I
allocation is particularly challenging due to strong underlying and K V2V links. The V2I links connect M vehicles to the
dynamics and the diverse quality-of-service requirements. In base station (BS) to support bandwidth intensive applications,
response, a heuristic spatial spectrum reuse scheme has been such as social networking and media streaming. The K V2V
developed in [3] for device-to-device (D2D) based vehicular links are formed among vehicles, designed to reliably share
networks, relieving requirements on full channel state informa- safety critical information among neighboring vehicles, in
tion (CSI). In [4], V2X resource allocation adapts to slowly- the form of localized D2D communications. We assume all
varying large-scale channel fading and hence reduces network transceivers use a single antenna. The set of V2I links and V2V
signaling overhead. Further in [5], similar strategies have been links are denoted by M = {1, · · · , M } and K = {1, · · · , K},
adopted while spectrum sharing is allowed between vehicle- respectively. In this paper, we assume the M V2I links (uplink)
to-infrastructure (V2I) and vehicle-to-vehicle (V2V) links and have been preassigned M orthogonal spectrum bands and the
also among peer V2V links. Sum ergodic capacity of V2I mth V2I transmitter uses the mth band. To improve spectral
links has been maximized with V2V reliability guarantee using efficiency, these bands are reused by the K V2V links.
large-scale fading channel information in [6] or CSI from pe- The channel power gain of the kth V2V link over the mth
riodic feedback in [7]. A novel graph-based approach has been band follows
further developed in [8] to deal with a generic V2X resource
gk [m] = αk hk [m], (1)
allocation problem. However, the majority of existing methods
rely on channel information, large- or small-scale, often in an where hk [m] is the frequency dependent fast (small-scale)
isolated manner. They ignore the dynamics underlying channel fading power component and assumed to be exponentially
evolution and thus find difficulties in providing direct answers distributed with unit mean, and αk captures the large-scale
to problems of sequential nature, such as the requirement of fading effect, including path loss and shadowing, assumed
“successfully transmitting B bytes within time T ”, commonly to be frequency independent. The interfering channel from
seen in vehicular networks. the k 0 th V2V transmitter to the kth V2V receiver over the
mth band, gk0 ,k [m], the interfering channel from the kth
This work was supported in part by a research gift from Intel Corporation
and in part by the National Science Foundation under Grants 1731017 and V2V transmitter to the BS over the mth band, gk,B [m], the
1815637. channel from the mth V2I transmitter to the BS, ĝm,B , and
the interfering channel from the mth V2I transmitter to the p(s0 , r|s, a). In RL, decision making manifests itself in a pol-
kth V2V receiver, ĝm,k , are similarly defined. icy π(a|s), which is a mapping from states in S to probabilities
The received signal-to-interference-plus-noise ratios of selecting each action in A. Q-Learning [15] is a popular
(SINRs) of the mth V2I link and the kth V2V link over the model-free method (meaning MDP dynamics p(s0 , r|s, a) is
mth band are expressed as learned rather than given) to solve RL problems. It is based
Pmc
ĝm,B [m] on the action-value function qπ (s, a), defined as the expected
c
γm [m] = , (2) cumulative discounted rewards starting from the state s, taking
ρk [m]Pkd [m]gk,B [m]
P
σ2 +
k the action a, and thereafter following the policy π. Q-learning
solves for the optimal action value q∗ (s, a) through iteration,
and
from which the optimal policy π∗ can be derived. Please refer
Pkd [m]gk [m] to [9] for details.
γkd [m] = , (3)
σ 2 + Ik [m] In many problems of practical interest, the state and action
respectively, where Pm c
and Pkd [m] denote transmit powers of space can be too large to store all action-value functions in a
the mth V2I transmitter and the kth V2V transmitter over the tabular form. Hence, a deep neural network parameterized by
mth band, respectively, σ 2 is the noise power, and θ, called deep Q-network (DQN), is used to approximate the
X action-value function in deep Q-learning [16]. The state-action
c
Ik [m] = Pm ĝm,k [m] + ρk0 [m]Pkd0 [m]gk0 ,k [m], (4) space is explored with some soft policies, e.g., -greedy, and
k0 6=k the transition tuple (St , At , Rt+1 , St+1 ) is stored in a replay
denotes the interference power. ρk [m] is the binary spectrum memory. At each step, a mini-batch of experiences D are
allocation indicator with ρk [m] = 1 implying the kth V2V uniformly sampled from the memory for updating θ, hence
link uses the mth band and ρk [m] = 0 otherwise. the name experience replay, to minimize
P We assume
each V2V link only accesses one band, i.e., ρk [m] ≤ 1. Xh
0 −
i2
m Rt+1 + γ max 0
Q(S t+1 , a ; θ ) − Q(S t , At ; θ) , (8)
Capacities of the mth V2I link and the kth V2V link over a
D
the mth band are then obtained as
where γ is a discount factor and θ− are the parameters of a
c c
Cm [m] = W log(1 + γm [m]), (5) target Q-network, which are duplicated from the training Q-
network parameters θ periodically and fixed for a few updates.
and
The multi-agent RL problem setup consists of multiple
Ckd [m] = W log(1 + γkd [m]), (6) agents, denoted by k ∈ K, concurrently exploring the unknown
environment [13], [14]. At each time step t, given the under-
where W is the bandwidth of each spectrum band. lying environment state St of the system with all agents, each
Per requirements of different vehicular links, the objective is (k)
agent k receives a local observation Zt , determined by the
to design power control and spectrum allocation schemes
P c that (k)
observation function O as Zt = O(St , k), and then takes an
simultaneously maximize the sum V2I capacity, Cm [m], (k)
m action At , forming a joint action At . Thereafter, each agent
and the V2V payload delivery probability, receives a reward Rt+1 and the environment evolves to the
( T M
XX
) next state St+1 with probability p(s0 , r|s, a).
d
Pr ρk [m]Ck [m, t] ≥ B/∆T , k ∈ K, (7)
t=1 m=1 B. Resource Sharing with Multi-Agent RL
where B is the payload size, ∆T is channel coherence time, We treat each V2V link as an agent and multiple agents
T is the payload generation period, and the index t is added collectively learn from interactions with the environment.
in Ckd [m, t] to indicate V2V capacity at different time slots. While the problem appears a competitive game, we turn it into
a cooperative one by using the same reward for all agents in
III. MULTI-AGENT RL BASED RESOURCE the interest of system-level performance. We focus on settings
ALLOCATION with centralized training and distributed implementation.
A. Multi-Agent Reinforcement Learning State and Observation Space. The true environment state,
RL deals with sequential decision making, where an agent St , which could include global channel conditions and all
learns to map situations to actions so as to maximize certain re- agents’ behaviors, is unknown to each individual V2V agent.
wards through interacting with the environment [9]. Mathemat- Each agent can only acquire knowledge of the underlying
ically, the RL problem can be modeled as a Markov decision environment through the lens of an observation function. The
process (MDP). At each discrete time step t, the agent observes observation space of an individual V2V agent k contains
some representation of the environment state St from the state local channel information, including its own signal channel,
space S, and then selects an action At from the action set A. gk [m], for all m ∈ M, interference channels from other
Following the action, the agent receives a numerical reward V2V transmitters, gk0 ,k [m], for all k 0 6= k and m ∈ M,
Rt+1 and the environment transitions to a new state St+1 , the interference channel from its own transmitter to the BS,
with probability Pr {St+1 = s0 , Rt+1 = r|St = s, At = a} , gk,B [m], for all m ∈ M, and the interference channel from
2019 IEEE 20th International Workshop on Signal Processing Advances in Wireless Communications (SPAWC)
1
TABLE I
Bandwidth 4 MHz
0.8
Vehicle drop and mobility model Urban case of A.1.2 in [17]*
c
V2I transmit power Pm 23 dBm 0.75
V2V payload generation period T 100 ms
V2V payload size B [1, 2, · · · ]× 1060 bytes 0.7
Large-scale fading update A.1.4 in [17] every 100 ms Random baseline
Small-scale fading update Rayleigh fading every 1 ms 0.65 MARL
SARL
*
We shrink the height and width of the simulation area by a factor of 2.
0.6
1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 6
45
Random baseline
MARL Fig. 2. V2V payload delivery probability with varying payload sizes.
SARL
40
2500
V2V Link 1
V2V Link 2
V2V Link 3
2000 V2V Link 4
1500
30
1000
25
1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 6 500
agent to adjust its action. Each V2V agent k has a Q-network (a) The remaining payload of MARL.
(k)
that takes as input the current observation, Zt , and outputs 2500
the value functions for all actions. We train the Q-networks V2V Link 1
V2V Link 2
through running multiple episodes. At each training step, 2000
V2V Link 3
V2V Link 4
Remaining V2V payload (bytes)
10
V2V Link 1
agent RL based method learns to leverage good channels of
9 V2V Link 2
V2V Link 3 some V2V links and meanwhile provides protection for those
8 V2V Link 4
with bad channel conditions.
V2V transmission rate (Mbps)
7
6 V. CONCLUSION
5
In this paper, we have presented a distributed resource
4
sharing scheme for vehicular networks based on multi-agent
3
RL. A fingerprint-based method has been exploited to address
2
nonstationarity of independent Q-learning for multi-agent RL
1
problems when using DQN with experience replay. The pro-
0
0 5 10 15 20 25 30 posed method is shown to be effective in encouraging cooper-
Time step (ms)
ation among V2V links to improve system-level performance
(a) V2V transmission rates of MARL. albeit with distributed implementation.
10
V2V Link 1 R EFERENCES
9 V2V Link 2
V2V Link 3
8 V2V Link 4
[1] L. Liang, H. Peng, G. Y. Li, and X. Shen, “Vehicular communications:
V2V transmission rate (Mbps)
A physical layer perspective,” IEEE Trans. Veh. Technol., vol. 66, no.
7
12, pp. 10647–10659, Dec. 2017.
6 [2] H. Peng, L. Liang, X. Shen, and G. Y. Li, “Vehicular communications:
5 A network layer perspective,” IEEE Trans. Veh. Technol., vol. 68, no.
2, pp. 1064–1078, Feb. 2019.
4
[3] M. Botsov, M. Klügel, W. Kellerer, and P. Fertl, “Location dependent
3 resource allocation for mobile device-to-device communications,” in
2 Proc. IEEE WCNC, Apr. 2014, pp. 1679–1684.
1
[4] W. Sun, E. G. Ström, F. Brännström, K. Sou, and Y. Sui, “Radio resource
management for D2D-based V2V communication,” IEEE Trans. Veh.
0
0 5 10 15 20 25 30 Technol., vol. 65, no. 8, pp. 6636–6650, Aug. 2016.
Time step (ms) [5] W. Sun, D. Yuan, E. G. Ström, and F. Brännström, “Cluster-based
(b) V2V transmission rates of the random baseline. radio resource management for D2D-supported safety-critical V2X
communications,” IEEE Trans. Wireless Commun., vol. 15, no. 4, pp.
2756–2769, Apr. 2016.
Fig. 4. V2V transmission rates of different resource allocation schemes within [6] L. Liang, G. Y. Li, and W. Xu, “Resource allocation for D2D-enabled
the same episode as Fig. 3. Only the results of the beginning 30 ms are plotted vehicular communications,” IEEE Trans. Commun., vol. 65, no. 7, pp.
for better presentation. The initial payload size B = 2, 120 bytes. 3186–3197, Jul. 2017.
[7] L. Liang, J. Kim, S. C. Jha, K. Sivanesan, and G. Y. Li, “Spectrum
and power allocation for vehicular communications with delayed CSI
To understand why the proposed method achieves better feedback,” IEEE Wireless Comun. Lett., vol. 6, no. 4, pp. 458–461,
performance, we select an episode in which the proposed Aug. 2017.
method enables all V2V links to successfully deliver the [8] L. Liang, S. Xie, G. Y. Li, Z. Ding, and X. Yu, “Graph-based resource
sharing in vehicular communication,” IEEE Trans. Wireless Commun.,
payload of 2, 120 bytes while the random baseline fails. We vol. 17, no. 7, pp. 4579–4592, Jul. 2018.
plot in Fig. 3 the change of the remaining V2V payload within [9] R. S. Sutton, A. G. Barto, et al., Reinforcement Learning: An Introduc-
the time constraint, i.e., T = 100 ms, for all V2V links. From tion, MIT press, 1998.
[10] H. Ye, G. Y. Li, and B.-H. Juang, “Deep reinforcement learning for
Fig. 3(a), the V2V Link 4 finishes payload delivery early in the resource allocation in V2V communications,” IEEE Trans. Veh. Technol.,
episode while the other three links end transmission roughly vol. 68, no. 4, pp. 3163–3173, Apr. 2019.
at the same time for the proposed method. For the random [11] H. Ye, L. Liang, G. Y. Li, J. Kim, L. Lu, and M. Wu, “Machine learning
for vehicular networks: Recent advances and application examples,”
baseline, Fig. 3(b) shows that V2V Links 1 and 4 successfully IEEE Veh. Technol. Mag., vol. 13, no. 2, pp. 94–101, Jun. 2018.
deliver all payload early in the episode. V2V Link 3 also [12] L. Liang, H. Ye, and G. Y. Li, “Toward intelligent vehicular networks:
finishes payload transmission albeit much later in the episode A machine learning framework,” IEEE Internet Things J., vol. 6, no. 1,
pp. 124–135, Feb. 2019.
while V2V Link 2 fails to deliver the required payload. [13] S. Omidshafiei, J. Pazis, C. Amato, J. P. How, and J. Vian, “Deep
In Fig. 4, we show the rates of all V2V links under the decentralized multi-task multi-agent reinforcement learning under partial
observability,” in Proc. Int. Conf. Mach. Learning (ICML), 2017, pp.
two schemes at each step in the same episode as Fig. 3. 2681–2690.
We compare Fig. 4(a) and (b) to find that with the proposed [14] J. Foerster, N. Nardelli, G. Farquhar, T. Afouras, P. H. S. Torr, P. Kohli,
method, V2V Link 4 gets high rates at the beginning to and S. Whiteson, “Stabilising experience replay for deep multi-agent
reinforcement learning,” in Proc. Int. Conf. Mach. Learning (ICML),
finish transmission early, leveraging its good channel condition 2017, pp. 1146–1155.
and incurring no interference to other links at later stages of [15] C. J. C. H. Watkins and P. Dayan, “Q-learning,” Machine Learning,
the episode. V2V Link 1 keeps low rates at first such that vol. 8, no. 3-4, pp. 279–292, May 1992.
[16] V. Mnih, K. Kavukcuoglu, D. Silver, et al., “Human-level control
the vulnerable V2V Links 2 and 3 can get relatively good through deep reinforcement learning,” Nature, vol. 518, no. 7540, pp.
transmission rates to deliver payload, and then jumps to high 529–533, Feb. 2015.
data rates to deliver its own data when both Links 2 and 3 [17] 3rd Generation Partnership Project; Technical Specification Group
Radio Access Network; Study on LTE-based V2X Services; (Release
finish transmission. Moreover, a closer examination of Links 14), 3GPP TR 36.885 V14.0.0, Jun. 2016.
2 and 3 reveals their clever strategy to take turns to transmit to [18] S. Ruder, “An overview of gradient descent optimization algorithms,”
avoid mutual interference. To summarize, the proposed multi- arXiv preprint arXiv:1609.04747, 2016.