Multi-Agent Reinforcement Learning For Spectrum

2019 IEEE 20th International Workshop on Signal Processing Advances in Wireless Communications (SPAWC)
Multi-Agent Reinforcement Learning for Spectrum

Sharing in Vehicular Networks
Le Liang, Hao Ye, and Geoffrey Ye Li
School of Electrical and Computer Engineering
Georgia Institute of Technology, Atlanta GA
E-mails: {lliang, yehao}@gatech.edu, liye@ece.gatech.edu
Abstract—This paper investigates the spectrum sharing prob- Reinforcement learning (RL) has been shown effective in
lem in vehicular networks, where multiple vehicle-to-vehicle addressing of sequential decision making problems [9]. In
(V2V) links reuse the frequency spectrum preoccupied by vehicle- particular, recent success of deep RL in human-level video
to-infrastructure (V2I) links. We model the resource sharing as a
multi-agent reinforcement learning (RL) problem, which is then game play and AlphaGo has sparked a flurry of interest in
solved using a fingerprint-based deep Q-network method. The the topic. RL is also well-suited to resource allocation in
V2V links, each acting as an agent, collectively interact with vehicular networks in that it can train for objectives that are
the vehicular environment, receive distinctive observations yet hard to model or optimize in an exact or analytical manner.
a common reward, and then improve policy design through Another advantage of using RL for resource allocation is that
updating their Q-networks with gained experiences. Simulation
results demonstrate desirable performance of the proposed re- distributed resource management are made possible through
source allocation scheme based on multi-agent RL in terms of treating each V2V link as an agent that learns to refine its
both V2I capacity and V2V payload delivery probability. strategy via interacting with the environment [10]–[12]. In this
V2X, spectrum access, resource allocation, multi-agent re- paper, we model the spectrum sharing design in vehicular net-
inforcement learning works as a multi-agent problem and show that recent progress
of multi-agent RL [13], [14] can be exploited to enable each
I. INTRODUCTION V2V link to learn from experiences and work cooperatively
Vehicular communication, commonly referred to as vehicle- to optimize system-level performance in a distributed way.
to-everything (V2X) communication, is envisioned to trans-
II. SYSTEM MODEL
form connected and intelligent transportation services in many
aspects [1], [2]. For vehicular networks, judicious resource Consider a vehicular communication network with M V2I
allocation is particularly challenging due to strong underlying and K V2V links. The V2I links connect M vehicles to the
dynamics and the diverse quality-of-service requirements. In base station (BS) to support bandwidth intensive applications,
response, a heuristic spatial spectrum reuse scheme has been such as social networking and media streaming. The K V2V
developed in [3] for device-to-device (D2D) based vehicular links are formed among vehicles, designed to reliably share
networks, relieving requirements on full channel state informa- safety critical information among neighboring vehicles, in
tion (CSI). In [4], V2X resource allocation adapts to slowly- the form of localized D2D communications. We assume all
varying large-scale channel fading and hence reduces network transceivers use a single antenna. The set of V2I links and V2V
signaling overhead. Further in [5], similar strategies have been links are denoted by M = {1, · · · , M } and K = {1, · · · , K},
adopted while spectrum sharing is allowed between vehicle- respectively. In this paper, we assume the M V2I links (uplink)
to-infrastructure (V2I) and vehicle-to-vehicle (V2V) links and have been preassigned M orthogonal spectrum bands and the
also among peer V2V links. Sum ergodic capacity of V2I mth V2I transmitter uses the mth band. To improve spectral
links has been maximized with V2V reliability guarantee using efficiency, these bands are reused by the K V2V links.
large-scale fading channel information in [6] or CSI from pe- The channel power gain of the kth V2V link over the mth
riodic feedback in [7]. A novel graph-based approach has been band follows
further developed in [8] to deal with a generic V2X resource
gk [m] = αk hk [m], (1)
allocation problem. However, the majority of existing methods
rely on channel information, large- or small-scale, often in an where hk [m] is the frequency dependent fast (small-scale)
isolated manner. They ignore the dynamics underlying channel fading power component and assumed to be exponentially
evolution and thus find difficulties in providing direct answers distributed with unit mean, and αk captures the large-scale
to problems of sequential nature, such as the requirement of fading effect, including path loss and shadowing, assumed
“successfully transmitting B bytes within time T ”, commonly to be frequency independent. The interfering channel from
seen in vehicular networks. the k 0 th V2V transmitter to the kth V2V receiver over the
mth band, gk0 ,k [m], the interfering channel from the kth
This work was supported in part by a research gift from Intel Corporation
and in part by the National Science Foundation under Grants 1731017 and V2V transmitter to the BS over the mth band, gk,B [m], the
1815637. channel from the mth V2I transmitter to the BS, ĝm,B , and
978-1-5386-6528-2/19/$31.00 ©2019 IEEE

the interfering channel from the mth V2I transmitter to the p(s0 , r|s, a). In RL, decision making manifests itself in a pol-
kth V2V receiver, ĝm,k , are similarly defined. icy π(a|s), which is a mapping from states in S to probabilities
The received signal-to-interference-plus-noise ratios of selecting each action in A. Q-Learning [15] is a popular
(SINRs) of the mth V2I link and the kth V2V link over the model-free method (meaning MDP dynamics p(s0 , r|s, a) is
mth band are expressed as learned rather than given) to solve RL problems. It is based
Pmc
ĝm,B [m] on the action-value function qπ (s, a), defined as the expected
c
γm [m] = , (2) cumulative discounted rewards starting from the state s, taking
ρk [m]Pkd [m]gk,B [m]
P
σ2 +
k the action a, and thereafter following the policy π. Q-learning
solves for the optimal action value q∗ (s, a) through iteration,
and
from which the optimal policy π∗ can be derived. Please refer
Pkd [m]gk [m] to [9] for details.
γkd [m] = , (3)
σ 2 + Ik [m] In many problems of practical interest, the state and action
respectively, where Pm c
and Pkd [m] denote transmit powers of space can be too large to store all action-value functions in a
the mth V2I transmitter and the kth V2V transmitter over the tabular form. Hence, a deep neural network parameterized by
mth band, respectively, σ 2 is the noise power, and θ, called deep Q-network (DQN), is used to approximate the
X action-value function in deep Q-learning [16]. The state-action
c
Ik [m] = Pm ĝm,k [m] + ρk0 [m]Pkd0 [m]gk0 ,k [m], (4) space is explored with some soft policies, e.g., -greedy, and
k0 6=k the transition tuple (St , At , Rt+1 , St+1 ) is stored in a replay
denotes the interference power. ρk [m] is the binary spectrum memory. At each step, a mini-batch of experiences D are
allocation indicator with ρk [m] = 1 implying the kth V2V uniformly sampled from the memory for updating θ, hence
link uses the mth band and ρk [m] = 0 otherwise. the name experience replay, to minimize
P We assume
each V2V link only accesses one band, i.e., ρk [m] ≤ 1. Xh
0 −
i2
m Rt+1 + γ max 0
Q(S t+1 , a ; θ ) − Q(S t , At ; θ) , (8)
Capacities of the mth V2I link and the kth V2V link over a
D
the mth band are then obtained as
where γ is a discount factor and θ− are the parameters of a
c c
Cm [m] = W log(1 + γm [m]), (5) target Q-network, which are duplicated from the training Q-
network parameters θ periodically and fixed for a few updates.
and
The multi-agent RL problem setup consists of multiple
Ckd [m] = W log(1 + γkd [m]), (6) agents, denoted by k ∈ K, concurrently exploring the unknown
environment [13], [14]. At each time step t, given the under-
where W is the bandwidth of each spectrum band. lying environment state St of the system with all agents, each
Per requirements of different vehicular links, the objective is (k)
agent k receives a local observation Zt , determined by the
to design power control and spectrum allocation schemes
P c that (k)
observation function O as Zt = O(St , k), and then takes an
simultaneously maximize the sum V2I capacity, Cm [m], (k)
m action At , forming a joint action At . Thereafter, each agent
and the V2V payload delivery probability, receives a reward Rt+1 and the environment evolves to the
( T M
XX
) next state St+1 with probability p(s0 , r|s, a).
d
Pr ρk [m]Ck [m, t] ≥ B/∆T , k ∈ K, (7)
t=1 m=1 B. Resource Sharing with Multi-Agent RL
where B is the payload size, ∆T is channel coherence time, We treat each V2V link as an agent and multiple agents
T is the payload generation period, and the index t is added collectively learn from interactions with the environment.
in Ckd [m, t] to indicate V2V capacity at different time slots. While the problem appears a competitive game, we turn it into
a cooperative one by using the same reward for all agents in
III. MULTI-AGENT RL BASED RESOURCE the interest of system-level performance. We focus on settings
ALLOCATION with centralized training and distributed implementation.
A. Multi-Agent Reinforcement Learning State and Observation Space. The true environment state,
RL deals with sequential decision making, where an agent St , which could include global channel conditions and all
learns to map situations to actions so as to maximize certain re- agents’ behaviors, is unknown to each individual V2V agent.
wards through interacting with the environment [9]. Mathemat- Each agent can only acquire knowledge of the underlying
ically, the RL problem can be modeled as a Markov decision environment through the lens of an observation function. The
process (MDP). At each discrete time step t, the agent observes observation space of an individual V2V agent k contains
some representation of the environment state St from the state local channel information, including its own signal channel,
space S, and then selects an action At from the action set A. gk [m], for all m ∈ M, interference channels from other
Following the action, the agent receives a numerical reward V2V transmitters, gk0 ,k [m], for all k 0 6= k and m ∈ M,
Rt+1 and the environment transitions to a new state St+1 , the interference channel from its own transmitter to the BS,
with probability Pr {St+1 = s0 , Rt+1 = r|St = s, At = a} , gk,B [m], for all m ∈ M, and the interference channel from
all V2I transmitters, ĝm,k [m], for all m ∈ M. Such channel

information, except gk,B [m], can be accurately estimated by Algorithm 1 Resource Sharing with Multi-Agent RL
the receiver of the kth V2V link at the beginning of each time 1: Start simulator, generating vehicles and links
slot t and we assume it is also available instantaneously at the 2: Initialize Q-networks for all agents randomly
transmitter through delay-free feedback. The channel gk,B [m] 3: for each episode do
is estimated at the BS in each time slot t and then broadcast 4: Update vehicle locations and large-scale fading α
to all vehicles in its coverage, which incurs small signaling 5: Reset Bk = B and Tk = T , for all k ∈ K
overhead. The received interference power over all bands, 6: for each step t do
Ik [m], for all m ∈ M, expressed in (4), can be measured at the 7: for each V2V agent k do
(k)
V2V receiver and is also introduced in the local observation. 8: Observe Zt
(k) (k)
In addition, the local observation space includes the remaining 9: Choose action At from Zt according to -
V2V payload, Bk , and the remaining time budget, Tk . Hence, greedy policy
the observation function for an agent k is summarized as 10: end for
11: All agents take actions and receive reward Rt+1
O(St , k) = {Bk , Tk , {Ik [m]}m∈M , {Gk [m]}m∈M } , (9) 12: Update channel small-scale fading
13: for each V2V agent k do
with Gk [m] = {gk [m], gk0 ,k [m], gk,B [m], ĝm,k [m]}. (k)
In multi-agent RL, naively combining DQN with indepen-
14: Zt+1
Observe
(k) (k) (k)
dent Q-learning is problematic since each agent would face 15: Store Zt , At , Rt+1 , Zt+1 in memory Dk
a nonstationary environment while other agents are learning 16: end for
to adjust their behaviors. To address the issue, we adopt 17: end for
the fingerprint-based method developed in [14]. The idea is 18: for each V2V agent k do
that while the action-value function of an agent is nonsta- 19: Uniformly sample mini-batches from Dk
tionary due to behavior change of other agents, it can be 20: Optimize error between Q-network and learning tar-
made stationary conditioned on their policies. This means gets (8) using variant of stochastic gradient descent
we can augment each agent’s observation space with an 21: end for
estimate of other agents’ policies to address nonstationarity. 22: end for
Instead of including all parameters of other agents’ DQN, a 23: Apply trained DQN of all agents to resource sharing
low-dimensional fingerprint is included in [14] to track the
trajectory of the policy change of other agents. Since each
agent’s policy change is highly correlated with the training 1, greater cumulative rewards translate to larger amount of
iteration number, e, as well as its rate of exploration, e.g., transmitted data for V2V links until the payload delivery is
the probability of random action selection, , in the -greedy finished. Since the learning algorithms for RL maximize the
policy, we should include both of them in the observation for expected cumulative rewards, such a reward design encourages
an agent k, expressed as more data to be delivered for V2V links when the remaining
(k) payload is still nonzero, i.e., Bk ≥ 0. In addition, the learning
Zt = {O(St , k), e, } . (10) process also attempts to achieve as many rewards of β as
Action Space. The resource sharing design of vehicular links possible, effectively leading to higher possibility of successful
comes down to the spectrum band selection and transmission delivery of V2V payload. As such, the V2V-related reward at
power control. The spectrum naturally breaks into M disjoint each time step t is set as
bands, each for a V2I link as indicated before. We limit the  M
 P ρ [m]C d [m, t], if B ≥ 0,
power to four levels, i.e., [23, 10, 5, −100] dBm, for ease of Lk (t) = m=1
k k k
(11)
learning and circuit restrictions. Note the choice of −100 dBm 
β, otherwise.
effectively means zero transmit power. Hence, the dimension
of the action space is 4 × M , with each action corresponding To this end, we set the reward at each time step t as
to a particular combination of band and power selection. X
c
X
Rt+1 = λc Cm [m, t] + λd Lk (t), (12)
Reward Design. Our objective is to maximize V2I capacity
m k
and the V2V payload transmission success probability. To
address the first goal, we include the instantaneous sum V2I where λc and λd are positive weights to balance V2I and V2V
capacity in the reward at each time step t. For the second objectives.
goal, we set the reward Lk for each agent k equal to the Training Algorithm. We focus on an episodic setting with
effective V2V transmission rate until the payload is delivered, each episode spanning the V2V payload generation period T .
after which the reward is set to a constant number, β, that Each episode starts with a randomly initialized environment
is greater than the largest possible V2V transmission rate. state and a full V2V payload of size B, and lasts until the
In practice, β is a hyperparameter that needs to be tuned end of T . The change of small-scale channel fading triggers
empirically. We observe that if setting the discount rate γ to a transition of the environment state and causes each V2V
1
TABLE I
Average V2V payload transmission probability

S IMULATION PARAMETERS [17] 0.95
Parameter Value 0.9

Number of V2I links M 4
Number of V2V links K 4 0.85
Bandwidth 4 MHz
0.8
Vehicle drop and mobility model Urban case of A.1.2 in [17]*
c
V2I transmit power Pm 23 dBm 0.75
V2V payload generation period T 100 ms
V2V payload size B [1, 2, · · · ]× 1060 bytes 0.7
Large-scale fading update A.1.4 in [17] every 100 ms Random baseline
Small-scale fading update Rayleigh fading every 1 ms 0.65 MARL
SARL
*
We shrink the height and width of the simulation area by a factor of 2.
0.6
1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 6
45
Random baseline
MARL Fig. 2. V2V payload delivery probability with varying payload sizes.
SARL
40
2500
V2V Link 1
V2V Link 2
V2V Link 3
2000 V2V Link 4
Remaining V2V payload (bytes)

35
1500
30
1000
25
1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 6 500
Fig. 1. Sum V2I capacity with varying V2V payload sizes. 0

0 10 20 30 40 50 60 70 80 90 100
Time step (ms)
agent to adjust its action. Each V2V agent k has a Q-network (a) The remaining payload of MARL.
(k)
that takes as input the current observation, Zt , and outputs 2500
the value functions for all actions. We train the Q-networks V2V Link 1
V2V Link 2
through running multiple episodes. At each training step, 2000
V2V Link 3
V2V Link 4
Remaining V2V payload (bytes)
all V2V agents select their actions based on the observa-

tions and the exploration rate . Following the environment 1500
each agent k collects

transition, and stores the transition
(k) (k) (k)
tuple, Zt , At , Rt+1 , Zt+1 , in a replay memory. At each 1000
episode, we uniformly sample batches of stored transitions D

from the replay memory and update the Q-network of each 500
V2V agent through minimizing the sum-squared error in (8).

The training procedure is summarized in Algorithm 1. 0
0 10 20 30 40 50 60 70 80 90 100
Time step (ms)
IV. SIMULATION RESULTS
(b) The remaining payload of the random baseline.
We follow the simulation setup for the urban case in 3GPP
TR 36.885 [17]. The V2I links are started by M vehicles and Fig. 3. The change of the remaining V2V payload of different resource
allocation schemes within the time constraint T = 100 ms. The initial payload
the K V2V links are formed between each vehicle with its size B = 2, 120 byte.
closest neighbors. Simulation parameters are set as [10] and
differences are listed in Table I. The DQN for each V2V agent a random baseline method in terms of sum V2I capacity and
consists of 3 fully connected hidden layers, whose numbers V2V payload transmission success probability, respectively.
are 500, 250, and 120, respectively. The rectified linear unit From the figures, performance of both V2I and V2V links
(ReLU), f (x) = max(0, x), is used as the activation function drops for all schemes with growing V2V payload sizes and
and RMSProp optimizer [18] is used to update network Algorithm 1 achieves better performance than the other two
parameters with a learning rate of 0.001. We train each agent’s benchmarks. Note that the network is trained with a fixed pay-
Q-network for a total of 3, 000 episodes and the exploration load size of 2×1060 bytes, and thus its desirable performance
rate is linearly annealed from 1 to 0.02 over the beginning also demonstrates algorithm robustness against V2V payload
2, 400 episodes and remains constant afterwards. variation. Remarkably, for B = 1060 and B = 2 × 1060
Figs. 1 and 2 compare Algorithm 1, termed MARL, with the bytes, the proposed method attains 100% V2V transmission
single-agent RL based algorithm in [10], termed SARL, and probability and meanwhile improves V2I capacity.
10
V2V Link 1
agent RL based method learns to leverage good channels of
9 V2V Link 2
V2V Link 3 some V2V links and meanwhile provides protection for those
8 V2V Link 4
with bad channel conditions.
V2V transmission rate (Mbps)
7
6 V. CONCLUSION
5
In this paper, we have presented a distributed resource
4
sharing scheme for vehicular networks based on multi-agent
3
RL. A fingerprint-based method has been exploited to address
2
nonstationarity of independent Q-learning for multi-agent RL
1
problems when using DQN with experience replay. The pro-
0
0 5 10 15 20 25 30 posed method is shown to be effective in encouraging cooper-
Time step (ms)
ation among V2V links to improve system-level performance
(a) V2V transmission rates of MARL. albeit with distributed implementation.
10
V2V Link 1 R EFERENCES
9 V2V Link 2
V2V Link 3
8 V2V Link 4
[1] L. Liang, H. Peng, G. Y. Li, and X. Shen, “Vehicular communications:
V2V transmission rate (Mbps)
A physical layer perspective,” IEEE Trans. Veh. Technol., vol. 66, no.
7
12, pp. 10647–10659, Dec. 2017.
6 [2] H. Peng, L. Liang, X. Shen, and G. Y. Li, “Vehicular communications:
5 A network layer perspective,” IEEE Trans. Veh. Technol., vol. 68, no.
2, pp. 1064–1078, Feb. 2019.
4
[3] M. Botsov, M. Klügel, W. Kellerer, and P. Fertl, “Location dependent
3 resource allocation for mobile device-to-device communications,” in
2 Proc. IEEE WCNC, Apr. 2014, pp. 1679–1684.
1
[4] W. Sun, E. G. Ström, F. Brännström, K. Sou, and Y. Sui, “Radio resource
management for D2D-based V2V communication,” IEEE Trans. Veh.
0
0 5 10 15 20 25 30 Technol., vol. 65, no. 8, pp. 6636–6650, Aug. 2016.
Time step (ms) [5] W. Sun, D. Yuan, E. G. Ström, and F. Brännström, “Cluster-based
(b) V2V transmission rates of the random baseline. radio resource management for D2D-supported safety-critical V2X
communications,” IEEE Trans. Wireless Commun., vol. 15, no. 4, pp.
2756–2769, Apr. 2016.
Fig. 4. V2V transmission rates of different resource allocation schemes within [6] L. Liang, G. Y. Li, and W. Xu, “Resource allocation for D2D-enabled
the same episode as Fig. 3. Only the results of the beginning 30 ms are plotted vehicular communications,” IEEE Trans. Commun., vol. 65, no. 7, pp.
for better presentation. The initial payload size B = 2, 120 bytes. 3186–3197, Jul. 2017.
[7] L. Liang, J. Kim, S. C. Jha, K. Sivanesan, and G. Y. Li, “Spectrum
and power allocation for vehicular communications with delayed CSI
To understand why the proposed method achieves better feedback,” IEEE Wireless Comun. Lett., vol. 6, no. 4, pp. 458–461,
performance, we select an episode in which the proposed Aug. 2017.
method enables all V2V links to successfully deliver the [8] L. Liang, S. Xie, G. Y. Li, Z. Ding, and X. Yu, “Graph-based resource
sharing in vehicular communication,” IEEE Trans. Wireless Commun.,
payload of 2, 120 bytes while the random baseline fails. We vol. 17, no. 7, pp. 4579–4592, Jul. 2018.
plot in Fig. 3 the change of the remaining V2V payload within [9] R. S. Sutton, A. G. Barto, et al., Reinforcement Learning: An Introduc-
the time constraint, i.e., T = 100 ms, for all V2V links. From tion, MIT press, 1998.
[10] H. Ye, G. Y. Li, and B.-H. Juang, “Deep reinforcement learning for
Fig. 3(a), the V2V Link 4 finishes payload delivery early in the resource allocation in V2V communications,” IEEE Trans. Veh. Technol.,
episode while the other three links end transmission roughly vol. 68, no. 4, pp. 3163–3173, Apr. 2019.
at the same time for the proposed method. For the random [11] H. Ye, L. Liang, G. Y. Li, J. Kim, L. Lu, and M. Wu, “Machine learning
for vehicular networks: Recent advances and application examples,”
baseline, Fig. 3(b) shows that V2V Links 1 and 4 successfully IEEE Veh. Technol. Mag., vol. 13, no. 2, pp. 94–101, Jun. 2018.
deliver all payload early in the episode. V2V Link 3 also [12] L. Liang, H. Ye, and G. Y. Li, “Toward intelligent vehicular networks:
finishes payload transmission albeit much later in the episode A machine learning framework,” IEEE Internet Things J., vol. 6, no. 1,
pp. 124–135, Feb. 2019.
while V2V Link 2 fails to deliver the required payload. [13] S. Omidshafiei, J. Pazis, C. Amato, J. P. How, and J. Vian, “Deep
In Fig. 4, we show the rates of all V2V links under the decentralized multi-task multi-agent reinforcement learning under partial
observability,” in Proc. Int. Conf. Mach. Learning (ICML), 2017, pp.
two schemes at each step in the same episode as Fig. 3. 2681–2690.
We compare Fig. 4(a) and (b) to find that with the proposed [14] J. Foerster, N. Nardelli, G. Farquhar, T. Afouras, P. H. S. Torr, P. Kohli,
method, V2V Link 4 gets high rates at the beginning to and S. Whiteson, “Stabilising experience replay for deep multi-agent
reinforcement learning,” in Proc. Int. Conf. Mach. Learning (ICML),
finish transmission early, leveraging its good channel condition 2017, pp. 1146–1155.
and incurring no interference to other links at later stages of [15] C. J. C. H. Watkins and P. Dayan, “Q-learning,” Machine Learning,
the episode. V2V Link 1 keeps low rates at first such that vol. 8, no. 3-4, pp. 279–292, May 1992.
[16] V. Mnih, K. Kavukcuoglu, D. Silver, et al., “Human-level control
the vulnerable V2V Links 2 and 3 can get relatively good through deep reinforcement learning,” Nature, vol. 518, no. 7540, pp.
transmission rates to deliver payload, and then jumps to high 529–533, Feb. 2015.
data rates to deliver its own data when both Links 2 and 3 [17] 3rd Generation Partnership Project; Technical Specification Group
Radio Access Network; Study on LTE-based V2X Services; (Release
finish transmission. Moreover, a closer examination of Links 14), 3GPP TR 36.885 V14.0.0, Jun. 2016.
2 and 3 reveals their clever strategy to take turns to transmit to [18] S. Ruder, “An overview of gradient descent optimization algorithms,”
avoid mutual interference. To summarize, the proposed multi- arXiv preprint arXiv:1609.04747, 2016.

Multi-Agent Reinforcement Learning For Spectrum

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Multi-Agent Reinforcement Learning For Spectrum

Uploaded by

Copyright:

Available Formats

2019 IEEE 20th International Workshop on Signal Processing Advances in Wireless Communications (SPAWC)