Deep Learning Based

This article has been accepted for inclusion in a future issue of this journal.
Content is final as presented, with the exception of pagination.
Deep-Learning-Based
Wireless Resource Allocation
With Application to Vehicular
Networks
By L E L IANG , Member IEEE, H AO Y E , Student Member IEEE,

G UANDING Y U , Senior Member IEEE, AND G EOFFREY Y E L I , Fellow IEEE
ABSTRACT | It has been a long-held belief that judicious discuss deep-learning-assisted optimization for resource allo-
resource allocation is critical to mitigating interference, cation. We then highlight the deep reinforcement learning
improving network efficiency, and ultimately optimizing wire- approach to address resource allocation problems that are
less communication performance. The traditional wisdom is difficult to handle in the traditional optimization framework.
to explicitly formulate resource allocation as an optimization We also identify some research directions that deserve further
problem and then exploit mathematical programming to solve investigation.
the problem to a certain level of optimality. Nonetheless,
KEYWORDS | Deep learning; reinforcement learning (RL);
as wireless networks become increasingly diverse and com-
resource allocation; vehicular networks; wireless communica-
plex, for example, in the high-mobility vehicular networks,
tions.
the current design methodologies face significant challenges
and thus call for rethinking of the traditional design philoso-
phy. Meanwhile, deep learning, with many success stories in I. I N T R O D U C T I O N
various disciplines, represents a promising alternative due to In the last few decades, wireless communications
its remarkable power to leverage data for problem solving. have been relentlessly pursuing higher throughput, lower
In this article, we discuss the key motivations and roadblocks latency, higher reliability, and better coverage. In addi-
of using deep learning for wireless resource allocation with tion to designing more efficient coding, modulation,
application to vehicular networks. We review major recent channel estimation, equalization, and detection/decoding
studies that mobilize the deep-learning philosophy in wireless schemes, optimizing the allocation of limited communica-
resource allocation and achieve impressive results. We first tion resources is another effective approach [1].
From Shannon’s information capacity theorem [2],
power and bandwidth are the two primitive resources in
Manuscript received July 6, 2019; revised October 1, 2019; accepted a communication system. They determine the capacity
November 30, 2019. This work was supported in part by a research gift from of a wireless channel, up to which the information can
Intel Corporation and in part by the National Science Foundation under Grant
1731017 and Grant 1815637. (Corresponding author: Le Liang.) be transmitted with an arbitrarily small error rate. For
L. Liang is with Intel Labs, Hillsboro, OR 97124 USA (e-mail: lliang@gatech.edu). modern wireless communication systems, the definition of
H. Ye and G. Y. Li are with the School of Electrical and Computer Engineering,
Georgia Institute of Technology, Atlanta, GA 30332 USA (e-mail:
communication resources has been substantially enriched.
yehao@gatech.edu; liye@ece.gatech.edu). Beams in a multiple-input multiple-output (MIMO) sys-
G. Yu is with the Department of Information Science and Electronic Engineering,
tem, time slots in a time-division multiple-access sys-
Zhejiang University, Hangzhou 310027, China (e-mail: yuguanding@zju.edu.cn).
tem (TDMA), frequency subbands in a frequency-division
Digital Object Identifier 10.1109/JPROC.2019.2957798 multiple-access system, spreading codes in a code-division
0018-9219 © 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
P ROCEEDINGS OF THE IEEE 1

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
Liang et al.: Deep-Learning-Based Wireless Resource Allocation With Application to Vehicular Networks
multiple-access system, and even base stations or backhaul

links in virtualized wireless networks all count. Judicious
allocation of these communication resources in response
to channel conditions and user’s quality-of-service (QoS)
requirements is critical in wireless system optimization.
For example, water-filling power allocation needs to be
performed over different subcarriers in an orthogonal fre-
quency division multiplexing (OFDM) system or different
channel eigen directions in a MIMO system for capacity
maximization. Power and spectrum allocation is impor-
tant in a cellular network, especially with device-to-device
(D2D) underlay, to manage interference and optimize net-
work throughput. In addition to optimizing for traditional
physical layer communication metrics, such as capacity Fig. 1. Classification of approaches to wireless resource allocation.
maximization [3] and power minimization [4], cross-layer

resource allocation takes account of the requirements of
upper layers, for example, delay and fairness, through
optimization approaches. Furthermore, we can exploit the
optimizing properly defined utility functions [5]–[7].
rich expert knowledge in wireless communications devel-
Historically, the dominant approach to resource allo-
oped over the past few decades to complement the data-
cation is through mathematical programming, where we
driven deep-learning methods to improve data efficiency
optimize one of the design criteria of interest, for exam-
via, e.g., deep transfer learning [23], [24]. The goal of this
ple, sum rate maximization or interference minimization,
article is to review some of the most promising results and
while imposing constraints on the remaining. Despite the
discuss the principles, benefits, and potential challenges of
remarkable success in this domain, it turns out that many
leveraging deep learning to address wireless resource allo-
of the formulated optimization problems are difficult to
cation problems in general with application to vehicular
solve [8]. Moreover, with a myriad of new applications to
networks as a special example. Since this is a fast evolving
support, conventional methods find it increasingly difficult
field, we do not attempt to exhaustively cover all research
to balance and model the diverse service requirements in
achievements in this area but only highlight those that
a mathematically exact way. As an example, for ultrareli-
closely align with our theme. We refer interested readers
able and low-latency communications (URLLC), one of the
to other excellent survey and tutorial articles on various
three major 5G usage scenarios, the definitions of latency
aspects of leveraging learning concepts in the wireless
and reliability are still subject to debate [9], not to mention
context [17], [18], [25]–[29] to get a complete picture.
a principled approach to provide performance guarantees.
Compared to them, this article differentiates itself in that
A more flexible framework for wireless resource allocation
we exclusively focus on wireless resource allocation using
is thus needed, motivating a departure from the traditional
deep learning. We emphasize the fundamental properties
wireless design philosophy.
of this category of problems that make the deep-learning
Recently, adaptation of machine-learning tools to
approach appealing and demonstrate through extensive
address difficult problems in wireless communications and
examples how to unleash the power of this promising
networking has gained momentum, ranging from physi-
method to its fullest extent.
cal layer design [10]–[15], resource allocation [16], net-
The article is organized as follows. In Section II, we dis-
working [17], caching [18], to edge computing [19]. In
cuss the limitations of traditional optimization methods for
fact, the latest cycle of enthusiasm for machine learning
wireless resource allocation and motivate deep learning in
is largely triggered by the exceptional performance of
addressing the issue. In Section III, we present examples on
deep learning in a broad array of application scenarios.
how to leverage deep learning to solve resource optimiza-
Deep learning provides multilayer computation models
tion problems in a more efficient way. In Section IV, deep
that learn efficient representations of data with multiple
RL-based methods are discussed in detail that warrant a
levels of abstraction from unstructured sources. It enables
fundamental shift in treating resource allocation problems
a powerful data-driven approach to many problems that
in a more flexible and effective framework. In Section V,
are traditionally deemed hard due to, e.g., lack of accurate
we recognize and highlight several open issues that are
models or prohibitively high computational complexity.
worth further investigation. Concluding remarks are finally
In the wireless resource allocation context, deep learning
made in Section VI.
has been demonstrated to achieve significant performance
improvement over conventional methods in several recent
works [20]–[22]. In particular, it has been demonstrated II. M A C H I N E L E A R N I N G F O R
that deep reinforcement learning (RL) is capable of provid- R E S O U R C E A L L O C AT I O N
ing a nice treatment of service requirements that are hard As shown in Fig. 1, the mainstream approach to wire-
to model exactly and thus also not subject to any effective less resource allocation has long been to formulate the
2 P ROCEEDINGS OF THE IEEE

design objective and constraints as an optimization prob- To avoid such difficulties, traditional optimization-based
lem. It is then solved to certain levels of optimality, methods break down the problem into isolated resource
depending on problem complexity and allowable compu- allocation decisions at each time step without considering
tation time, by leveraging tools from various disciplines, the long-term effect. For instance, methods in [32]–[35]
including mathematical programming, graph theory, game reformulate the requirement as a signal-to-interference-
theory, etc. In this section, we discuss the limitations of plus-noise ratio (SINR) constraint at each decision step
these optimization approaches and highlight the poten- and then use various optimization techniques to solve for
tials of uprising data-driven methods enabled by machine the resource allocation solution. Such a practice loses the
learning, in particular deep learning. Methods that com- flexibility to balance V2I and V2V performance across the
bine the theoretical models derived from domain knowl- whole time T and leads to inevitable performance loss.
edge and the data-driven capabilities of learning are also
briefly discussed. B. Deep-Learning-Assisted Optimization
Deep learning allows multilayer computation models
that learn data representations with multiple levels of
A. Limitation of Traditional Optimization abstraction [36], [37]. Each layer computes a linear com-
Approach bination of outputs from the previous layer and then
Except in a few simple cases, where we are fortu- introduces nonlinearity through an activation function to
nate enough to end up with convex optimization that improve its expressive power. Deep learning has seen a
admits a systematic procedure to find the global opti- recent surge in a wide variety of research areas due to
mum, most optimization problems formulated for wireless its exceptional performance in many tasks, such as speech
resource allocation are strongly nonconvex (continuous recognition and object detection. Coupled with the avail-
power control), combinatorial (discrete channel assign- ability of more computing power and advanced training
ment), or mixed-integer nonlinear programming (MINLP) techniques, deep learning enables a powerful data-driven
(combined power control and spectrum assignment). For approach to many problems that are deemed difficult
instance, it has been shown in [8] that the spectrum man- traditionally. In the context of resource allocation, this
agement problem in a frequency selective channel, where sheds light on solving hard optimization problems (at
multiple users share the same spectrum, is nonconvex and least partially).
NP-hard. There is no known algorithm that can solve the In a simplistic form, deep learning can be leveraged to
problem to optimality with polynomial time complexity. To learn the correspondence of the parameters and solutions
deal with problems of this kind, we are often satisfied with of an optimization problem. The computationally complex
a locally optimal solution or some good heuristics without procedure to find optimal or suboptimal solutions can
any performance guarantee. More often than not, even be taken offline. With the universal approximation capa-
these suboptimal methods are computationally complex bility of deep neural networks (DNNs), the relationship
and hard to be executed in real time. Another limitation between the parameter input and the optimization solution
of existing optimization approaches points to the require- obtained from any existing algorithm can be approxi-
ment of exact models, which tend to abstract away many mated. For implementation in real time, the new parame-
imperfections in reality for mathematical tractability, and ter is input into the trained DNN and a good solution can
the solution is highly dependent on the accuracy of the be given almost instantly, thus improving its potential for
models. However, the wireless communication environ- adoption in practice.
ment is constantly changing by nature and the resultant In learning tasks, the DNN is usually trained to mini-
uncertainty in model parameters, e.g., channel state infor- mize the discrepancy between the output and the ground
mation (CSI) accuracy, undermines the performance of the truth given an input. With this idea in mind, we can
optimized solution. leverage deep learning to directly minimize or maximize
Finally, as wireless networks grow more complex and the optimization objective, i.e., treating the objective as
versatile, a lot of new service requirements do not directly the loss function in supervised learning. Then various
translate to the performance metrics that the communica- training algorithms, such as stochastic gradient descent,
tion community is used to, such as the sum rate or pro- can be employed to find the optimization solution. Com-
portional fairness. For example, in high-mobility vehicu- pared with the direct input–output relationship learning,
lar networks, the simultaneous requirements of capacity this approach lifts the performance limit imposed by the
maximization for vehicle-to-infrastructure (V2I) links and traditional optimization algorithms that are used to gener-
reliability enhancement for vehicle-to-vehicle (V2V) links ate the training data. Alternatively, deep learning can be
[27], [28] do not admit an obvious formulation. In par- embedded as a component to accelerate some steps of a
ticular, if we define the reliability of V2V links as the well-behaved optimization algorithm, such as the pruning
successful delivery of packets of size B within the time stage of the branch-and-bound (B&B) in [38] and [39].
constraint T [30], [31], the problem becomes a sequential This method leverages the theoretical models developed
decision problem spanning the whole T time steps and with expert knowledge and achieves near-optimal perfor-
is difficult to solve in a mathematically exact manner. mance with significantly reduced execution time.

1) Q-Learning: Q-Learning is a popular model-free

method to solve RL problems [45]. It is based on the con-
cept of action-value function, qπ (s, a), for policy π , which
is defined as the expected accumulative rewards starting
from state s, taking action a, and thereafter following
policy π . The action-value function of the optimal policy,
q∗ (s, a), satisfies a recursive relationship, known as the
Bellman optimality equation. In principle, one can solve
this system of nonlinear equations for q∗ (s, a) if the dynam-
ics p(s , a |s, a) are known. Once q∗ is obtained, it is easy
to determine the optimal policy π∗ (a|s) following a greedy
procedure. Q-learning avoids the difficulty of acquiring
the dynamics by taking an iterative update approach,
given by
Q(St , At ) ← Q(St , At )

Fig. 2. Agent–environment interaction in RL. + α Rt+1 + γ max

Q(St+1 , a ) − Q(St , At )
a
(1)
C. Deep RL-Based Resource Allocation where α is the step-size parameter, γ ∈ (0, 1] is the
MDP discounter factor, and the choice of At in state St
RL addresses sequential decision making via maximiz- follows some soft policies, e.g., the -greedy, meaning that
ing a numerical reward signal while interacting with the the action with maximal estimated value is chosen with
unknown environment, as illustrated in Fig. 2. Mathemati- probability 1 − while a random action is selected with
cally, the RL problem can be modeled as a Markov decision probability .
process (MDP). At each discrete time step t, the agent
observes some representation of the environment state St 2) Deep Q-Network (DQN) With Experience Replay: In
from state space S , and then selects an action At from the many problems of practical interest, the state and action
action set A. Following the action, the agent receives a space can be too large to store all action-value functions in
numerical reward Rt+1 and the environment transitions a tabular form. As a result, it is common to use function
to a new state St+1 , with transition probability p(s , r|s, a). approximation to estimate these value functions. In deep
In RL, decision-making manifests itself in a policy π(a|s), Q-learning [46], a DNN parameterized by θ, called DQN,
which is a mapping from states in S to probabilities of is used to represent the action-value function. The state-
selecting each action in A. The goal of learning is to find action space is explored with some soft policies, e.g.,
an optimal policy π∗ that maximizes the expected accumu- -greedy, and the transition tuple (St , At , Rt+1 , St+1 ) is
lative rewards from any initial state s. The RL framework stored in a replay memory at each time step. The replay
provides native support for addressing sequential decision memory accumulates experiences over many episodes of
making under uncertainty that we encounter in, e.g., the MDP. At each step, a mini-batch of experience D are
the resource allocation problem in vehicular networks. The uniformly sampled from the memory for updating θ with
problem can be tackled by designing a reward signal that variants of stochastic gradient descent methods, hence
correlates with the ultimate objective and the learning the name experience replay, to minimize the sum-squared
algorithm can figure out a decent solution to the problem error
automatically. Indeed, it is the flexibility of reward design
2
in RL that makes it appealing for solving problems that are Rt+1 + γ max Q(St+1 , a ; θ− ) − Q(St , At ; θ) (2)
a
deemed difficult traditionally due to the inability to model D
the design objective exactly.
Here, we briefly review two basic RL algorithms, where θ− is the parameter set of a target Q-network,
i.e., (deep) Q-learning and REINFORCE, as representatives which is duplicated from the training Q-network parame-
of the value-based and policy-based RL methods, respec- ters set θ periodically and fixed for a couple of updates.
tively. We also provide a concise introduction to the actor– Experience replay improves sample efficiency through
critic algorithm that blends the two RL categories. As for repeatedly sampling stored experiences and breaks cor-
more advanced algorithms, such as natural policy gradi- relation in successive updates, thus also stabilizing
ents [40], trust region policy optimization [41], and async learning.
advantage actor–critic [42], we refer interested readers to
the given reference articles and the survey on deep RL in 3) Policy Gradient and Actor Critic: Different from value-
[43] and [44]. based RL methods that estimate a value function and

then use that to compute a deterministic policy, the pol- III. D E E P - L E A R N I N G - A S S I S T E D

icy gradient methods directly search the policy space for O P T I M I Z AT I O N F O R R E S O U R C E
an optimal one. The policy is usually represented by a A L L O C AT I O N
function approximator, such as a DNN, parameterized This section deals with the employment of deep learn-
by θ, i.e., πθ (a|s). The parameter set θ is updated in ing to find (near-) optimal solutions of the optimization
the direction of improvement of a certain performance problems for wireless resource allocation in an efficient
measure, J(θ), by gradient descent, θ ← θ + α∇θ J(θ), manner. There are three ways to incorporate deep learning
with proper step size α [47]. In episodic learning tasks, in solving the optimization problems.
we define J(θ) as the expected return from the start
1) Supervised Learning Paradigm: The DNNs are applied
state
to learn the mapping from the parameter input to the
T
solution of a given optimization algorithm.

J(θ) = E t
γ Rt+1 (3) 2) Objective-Oriented Unsupervised Learning Paradigm:
t=0 The optimization objective is employed as the loss
function, which is optimized directly during training.
where T is the final time step. From the policy gradient 3) Learning Accelerated Optimization Paradigm: The
theorem [47], [48], the gradient ∇J(θ) follows: deep-learning technique is embedded as a component
to accelerate some steps of a given optimization algo-
rithm.
∇θ J(θ) = Eπ {∇θ log πθ (At |St )qπθ (St , At )}. (4)
A. Supervised Learning Approach for

In the REINFORCE algorithm, we use the accumulative
Optimization
rewards obtained from time step t onward as an unbiased
estimate of qπθ (St , At ) and then run multiple episodes The most straightforward way to leverage deep learning
of the task following policy πθ (a|s) to update the para- for resource allocation is treating a given optimization
meter set θ in each time step t during training. In fact, problem as a black box and using various deep-learning
the REINFORCE algorithm suffers from high variance and techniques to learn its input–output relation. In this case,
we can simultaneously learn an approximation of the value a traditional optimization method will act as a supervisor,
function for variance reduction, that is whose output will serve as the ground truth for training
the DNNs. With the universal approximation property of
DNNs, the mapping from the parameter input to the solu-
Qw (s, a) ≈ qπθ (s, a) (5)
tion of the given optimization algorithm can be approxi-
mated.
with parameter set w to accelerate learning. Various policy The training and testing stages are conveniently shown
evaluation methods, such as temporal difference (TD) in Fig. 3. During the training stage in Fig. 3(a), labeled
learning [48], can be leveraged to address the value samples are generated by running some algorithms of the
function approximation. In the actor–critic terminology, involved optimization problem using simulated data. We
the value function approximation Qw (s, a) is called the then leverage the obtained training data set to minimize
critic while the approximate policy πθ (a|s) is called the the discrepancy between DNN outputs and the optimized
actor. Then the actor–critic algorithm follows an approx- solutions by updating DNN weights. In the testing stage
imate policy gradient determined by (4) and (5). Finally, in Fig. 3(b), we also generate the labeled testing samples
we remark that compared with value-based RL methods, using the same mathematical algorithm. We pass the para-
the algorithms involving policy gradients can learn sto- meters of a new problem instance as input to the trained
chastic policies and tend to be more effective in high- network and collect the inferred solution before comparing
dimensional or continuous action space. Nevertheless, they it with its corresponding true label.
are more likely to converge to a local rather than global DNNs have shown their power in mimicking solu-
optimum. tions of state-of-the-art optimal or heuristic algorithms
In fact, there are a good number of existing studies while reducing computational complexity. In [16], the NP-
leveraging the concept of MDP for resource allocation, hard power allocation problem in an interference chan-
e.g., the delay-optimal OFDMA power control and spec- nel has been solved by a deep-learning-enabled method
trum allocation in [49] and [50] and virtualized radio that uses a DNN to approximate the weighted minimum
resource scheduling for software-defined vehicular net- mean-squared error (WMMSE) algorithm [52]. Simulation
works in [51]. However, we do not treat them as (deep) results suggest that the performance of DNN is very close
RL approaches as they assume the availability of MDP to that of the WMMSE algorithm while significantly reduc-
transition dynamics more or less and are not learning from ing computational complexity. The challenging energy-
interactions with unknown environment. Such trial-and- efficient power control problem in a multicell network
error learning behavior is a key ingredient in making RL as has been solved by deep learning in [53], which learns
successful as it is today, from our perspective. from the optimally performing B&B algorithm. It is worth

Table 1 Performance Comparison of Different Methods for LSAP Prob-

lems With n 4
a link scheduling method without CSI has been pro-

posed in [56]–[58] for D2D networks with the help of
feature-embedded paradigm. The link scheduling problem
focuses on a densely deployed D2D network with a large
number of mutually interfering links. The goal of link
scheduling problem is to maximize the network utility
by activating a subset of links at any given time. It can
be formulated as a nonconvex combinatorial optimization
problem. Traditional methods are based on mathematical
optimization techniques with the help of accurate CSI.
For a network with N D2D links, N 2 channel coeffi-
cients need to be estimated, which is time- and resource-
consuming. In order to bypass the channel estimation
stage, the feature-embedded paradigm is used for link
scheduling. In [56] and [57], transmitter and receiver
Fig. 3. Supervised paradigm of using deep learning for density grids are first constructed for a given D2D network,
optimization. (a) Training stage. (b) Testing stage. and then two designed convolutional filters are used to
learn the interference pattern among different D2D links.
The convolutional stage corresponds to the feature extrac-
noting that in [53] a specially designed bound has been tion process in Fig. 4. Outputs from the convolutional stage
proposed for the B&B algorithm to reduce the computa- are then input into a DNN that learns from the scheduling
tional complexity of offline training data generation. The results of the state-of-the-art FPLinQ algorithm [59] as
learned DNN is demonstrated to achieve virtually optimal in [56] or in an unsupervised manner as in [57] to be
performance in terms of weighted sum energy efficiency discussed later. The simulation results suggest that DNNs
while requiring an extremely small online complexity. In can effectively learn the network interference topology and
[54], a similar paradigm is used to deal with a classical perform scheduling to near optimum without the help of
combinatorial optimization problem, i.e., the linear sum accurate CSI. The proposed method needs 800 000 training
assignment programming (LSAP) problem, which is fre- samples and has good scalability and generalizability to
quently encountered in wireless resource allocation. The different topologies.
LSAP problem is about how to assign n jobs to n people To further reduce the number of required training sam-
in the best way so that some utility (cost) function can ples, a graph embedding method has been developed in
be maximized (minimized). The optimal solution to the [58]. For a given D2D network, a graphical model is
LSAP problem can be obtained by the Hungarian algo- constructed, where each node corresponds to a D2D link.
rithm [55] with a computational complexity of O(n3 ), Then graph embedding is used to learn the vector rep-
which is impractical for real-time implementation in many
applications. To reduce complexity, the LSAP problem has
been first decomposed into several subassignment prob-
lems, which are essentially classification problems. Then,
DNNs are utilized to solve each subassignment problem
by approximating the solutions obtained from running
the Hungarian algorithm offline. Finally, a low-complexity
greedy collision-avoidance rule has been used to obtain the
inferred output for the LSAP problem. As shown in Table 1,
DNNs, including the feed-forward network (FNN) and
convolutional neural network (CNN), can be used to obtain
a real-time solution to the LSAP problem with a slight loss
of accuracy.
In addition, the features of communication networks can
be exploited to improve sample efficiency. For instance, Fig. 4. Training stage for the feature-embedded paradigm.

Table 2 Performance of Graph Embedding-Based Link Scheduling multiuser interference channel. With the channel power,
hi,j , as the input, the DNN training loss is set as the sum
rate, given by

hi,i Pi
ηSE = log2 1+ 2 (6)
resentation of each node based on topology information. i
σ + k=i hk,i Pk
The output feature vector is then input to a DNN for
link scheduling. Compared with the kernel-based feature where Pi , the ith output of the DNN, denotes the transmit
extraction process in [56] and [57], the graph embedding power of the ith transmitter, σ 2 is the noise power, and hi,j
process only involves the nonlinear function mapping and denotes the channel power from transmitter i to receiver
can be learned with fewer training samples. The simulation j . Furthermore, in [20], the CNN is used for power control
results are summarized in Table 2. Note that there are two strategy learning in a similar scenario. It can be trained
metrics: classifier accuracy and average sum rate. The for- to improve the spectral efficiency as defined in (6) or the
mer reflects the similarity between the scheduling output energy efficiency, which is expressed as
of the proposed method and the state-of-the-art FPLinQ
algorithm for link scheduling. The latter is the normalized ηSE
(i)
sum rate achieved by the proposed method with respect to ηEE = (7)
i
Pi + Pc
that by the FPLinQ algorithm. From Table 2, the proposed
method in [58] only needs hundreds of training samples to
(i)
achieve good performance without accurate CSI. where ηSE represents the ith summation term in (6) and
Pc is the fixed circuit power. The performance of both
unsupervised learning approaches is shown to be better
B. Unsupervised Learning Approach for than the state-of-the-art WMMSE heuristic [52].
Optimization Along this line of thought, resource allocation is fur-
The supervised learning paradigm sometimes suffers ther treated as a functional optimization problem in [61]
from several disadvantages. First, since the ground truth that optimizes the mapping from the channel states to
must be provided by conventional optimization algo- power allocation results such that the long term average
rithms, performance of the deep-learning approach will be performance of a wireless system is maximized. To enforce
bounded by that of the conventional algorithms. Second, constraints in the learning procedure, such as the power
a large number of labeled samples are usually required budget and QoS requirements, training of DNNs can be
to obtain a good model as demonstrated in [16] and undertaken in the dual domain, where the constraints are
[54] that use 1 000 000 and 50 000 labeled samples for linearly combined to create a weighted objective. Primal–
training, respectively. However, high-quality labeled data dual gradient descent is then proposed as a model-free
are difficult to generate in wireless resource allocation learning approach, where the gradients are estimated by
due to a few factors, e.g., inherent problem hardness sampling the model functions and wireless channels with
and computational resource constraints. It further limits the policy gradient method.
scalability of the supervised learning methods. Conventionally, the loss functions for deep-learning
In order to improve performance of the deep-learning- training are designed for regression and classification
enabled optimization, unsupervised learning approaches tasks. These loss functions are usually of simple forms and
have been proposed to train the DNN according to the have theoretical guarantees for convergence. For instance,
optimization objective directly, instead of learning from the deep linear networks can obtain the global minima
the conventional optimization approach. In general, deep with convex loss functions [62] and overparameterized
learning is trained to optimize a loss function via stochastic networks with nonlinear activation functions can achieve
gradient descent. The loss functions are designed from case zero training loss with the quadratic loss function [63]. In
to case. For instance, in classification problems, the cross- the resource allocation problems, however, the loss func-
entropy loss is often used while in regression problems, tions deviate from the commonly used ones in the regres-
l1 and l2 losses are preferred. Therefore, it is natural to sion and classification tasks, and hence issues arise from
use the objective function from the optimization problem the theoretical and practical aspects in obtaining good
as the loss function so that the objective function can be performance. As such, in the majority of existing works,
optimized based on the stochastic gradient descent during additional techniques have been applied to improve train-
training. ing performance for the optimization objectives in wireless
Various loss functions have been utilized to train the resource allocation. For example, in [57], the DNNs are
DNNs for wireless resource management and shown designed with special structures, including the use of con-
to outperform state-of-the-art heuristics. For example, volutional filters and feedback links. In [60], multiple deep
the sum rate of multiple users is treated as the loss function networks are ensembled for better performance while in
to train a fully connected DNN in [60] that learns to [20], the DNN is pretrained with the WMMSE solution as
solve the nonconvex sum rate maximization problem in a the ground truth.

Table 3 Performance of Accelerated B&B Algorithm With Different Num- algorithm. From this table, the accelerated B&B algorithm
bers of Training Samples for the Scenario With Five Cellular Users and can achieve near-optimal performance and meanwhile
Two D2D Pairs
substantially reduce computational complexity using only
tens to hundreds of training samples.
IV. D E E P R L - B A S E D R E S O U R C E
A L L O C AT I O N
C. Deep-Learning Accelerated Optimization Deep RL has been found effective in network slicing
[64], integrated design of caching, computing, and com-
In the two aforementioned learning paradigms, the opti-
munication for software-defined and virtualized vehicular
mization procedure is usually viewed as a black box
networks [65], multitenant cross-slice resource orches-
and then completely replaced by a deep-learning module.
tration in cellular RANs [66], proactive channel selec-
These approaches do not require any prior information
tion for long-time evolution (LTE) networks in unlicensed
about the optimization problems, but need a large amount
spectrum [67], and beam selection in millimeter wave
of training data, labeled or unlabelled, to obtain good
MIMO systems in [68]. In this section, we highlight some
performance, albeit with some efforts to improve sample
exemplary cases where deep RL shows impressive promises
efficiency [56]–[58]. To address the issue, it has been
in wireless resource allocation, in particular, for dynamic
proposed to leverage the domain knowledge by embedding
spectrum access, power allocation, and joint spectrum and
deep learning as a component to accelerate certain parts of
power allocation in vehicular networks.
a well-behaved optimization algorithm.
The MINLP problem is frequently encountered in wire-
less resource allocation, which is generally NP-hard and A. Dynamic Spectrum Access
difficult to solve optimally. The generic formulation is given In its simplest form, the dynamic spectrum access prob-
by [38] lem considers a single user that chooses one of N channels
maximize f (αα, ω ) for data transmission, as shown in Fig. 5. In [21], it is
α,ω
assumed that each channel has either “good” or “bad”
α, ω ) ≤ 0
s.t. Q(α conditions in each time slot and the channel conditions
αi ∈ N, ωi ∈ C ∀i (8) vary as time evolves. The transmission is successful if the
chosen channel is good and unsuccessful otherwise. The
where f (·, ·) is the objective function, αi and ωi are user keeps transmitting in successive time slots with the
the elements of α and ω , and Q(·, ·) represents certain objective of achieving as many successful transmissions
constraints, such as power or QoS constraints. N and as possible.
C denote some discrete and continuous set constraints, The problem is inherently partially observable since
respectively. Traditional approaches to the MINLP problem the user can only sense one of the N channels in each
are often based on mathematical programming, such as time slot while the conditions of other channels remain
the globally optimal B&B algorithm, which nonetheless has unknown. To deal with such partial observability and
high complexity for real-time implementation. improve transmission success rates, deep RL has been
The resource allocation in cloud radio access networks leveraged in [21] that takes the history of previous actions
(RANs) and D2D systems has been studied in [38] and and observations (success or failure) up to M time slots
[39], respectively, which can be formulated as an MINLP
problem and subject to the B&B algorithm. By observation,
the branching procedure is the most time consuming part
in the B&B algorithm and a good pruning policy can
potentially reduce the computational complexity. The more
nodes that would not lead to the optimal solution are
pruned, the less time is consumed. Therefore, algorithm
acceleration can be formulated into a pruning policy learn-
ing task. With invariant problem-independent features
and appropriate problem-dependent features selection,
the pruning policy learning task can be further converted
into a binary classification problem to be solved by deep
learning. Simulation results of the learning accelerated
optimization method for D2D networks are summarized
in Table 3. In this table, ogap, or optimality gap, means
the performance gap between the optimal solution and the Fig. 5. Illustration of dynamic spectrum access of a single
one achieved by the accelerated algorithm, while speed transmitter with N channels. “G” and “B” represent good and bad
refers to the speedup with respect to the original B&B channel conditions, respectively.

as the input to their DQN. The output is a vector of to gain experience for Q-network training without prior
length N with the ith element representing the action knowledge about the protocols that other devices follow.
value of the input state if channel i is selected. The reward With such a model-free deep RL approach that is purely
design naturally follows: a reward of +1 is obtained for data-driven, the performance is measurably close to the
successful transmission and −1 for failed transmission. theoretical upper bound that has full model knowledge.
The learning agent employs the -greedy policy to explore So far, we have demonstrated the power of deep RL
the unknown environment and the accumulated experi- in a single-user spectrum access problem. In fact, similar
ence is used to improve the policy with the experience observations translate to the multiuser setting as well,
replay technique to break data correlation and stabilize albeit with some variations. As investigated in [74], a set
training. When the tested N = 16 channels are strongly of K = {1, . . . , K} users attempt transmission over N =
correlated, the proposed learning-based dynamic spectrum {1, . . . , N } orthogonal channels. In each time slot, every
access method outperforms the implementation friendly user selects a channel to send its data and the trans-
model-based Whittle Index heuristic introduced in [69]. mission is successful (with an ACK signal received) if
It closely approaches the genie-aided Myopic policy (with no others use the same channel. Different from a single-
channel evolution dynamics known), which is known to user setting, various design objectives can be defined for
be optimal or near-optimal in the considered scenario multiple users, such as sum rate maximization, sum log-
[70], [71]. rate maximization (known as proportional fairness), etc.,
However, as pointed out in [72], the DQN-enabled depending on the network utility of interest. Apart from
approach developed in [21] encounters difficulty when the its strong combinatorial nature, the problem is difficult in
number of channels, N , scales large. To tackle the chal- that the environment is only partially observable to each
lenge, a model-free deep RL framework using actor–critic user and nonstationary from user’s perspective due to the
has been further proposed in [72]. The states, actions, and interaction among multiple users when they are actively
rewards are similarly defined. But two DNNs are now con- exploring and learning.
structed at the agent, one for the actor that maps the cur- The deep RL-based framework developed in [74]
rent state (history of actions and observations up to M time assumes a centralized training and distributed implemen-
slots) to the distribution over all actions and is updated tation architecture, where a central trainer collects the
using policy gradient, and the other for the critic that experience from each user, trains a DQN, and sends
evaluates the value of each action (minus a baseline) under the trained parameters to all users to update their local
the current policy. The agent explores the unknown envi- Q-networks in the training phase. In the implementation
ronment following the policy defined in the actor network phase, each user inputs local observations into its DQN
and the accumulated experience is then used to update and then acts according to the network output without
both the actor (with information from the critic network) any online coordination or information exchange among
and critic networks. Finally, the trained actor network is them. To address partial observability, a long short-term
available to guide the selection of channels at the begin- memory (LSTM) layer is added to the DQN that main-
ning of each time slot. It has been shown that the actor– tains an internal state and accumulates observations over
critic method has comparable performance as the DQN- time. The local observation, Si (t), of user i in time slot t
based approach in [21] when N = 16 and significantly includes its action (selected channel), the selected channel
outperforms that when N grows to 32 or 64. Similar to the capacity, and received ACK signal in time slot t − 1. The
original DQN-based algorithm, the actor–critic approach observation is then implicitly aggregated in the LSTM layer
also demonstrates strong potential to quickly adapt to embedded in the DQN to form a history of the agent. The
sudden nonstationary change in the environment while the action, ai (t), of user i in time slot t is drawn according
actor–critic algorithm is more computationally efficient. to the following distribution to balance exploitation and
The DQN-based method is also applied to address exploration:
dynamic spectrum access in a time-slotted heterogeneous
network in [73], where a wireless device needs to decide
whether to transmit in a time slot when it coexists with (1 − α)eβQ(a,Si (t)) α
Pr(ai (t) = a) = + (9)
other devices that follow TDMA and ALOHA protocols. The ã∈N e βQ(ã,Si (t)) N +1
transmission fails if more than one device attempts to
transmit and the goal is also to maximize the number where α ∈ (0, 1), β is the temperature to be tuned in
of successful transmissions. Similar to [21], the wireless training, and Q(a, Si (t)) is the value of selecting channel a
device constructs a DQN that takes the history of previous for a given observation Si (t) according to the DQN output.
actions (TRANSMIT or WAIT) and observations (SUC- To address the issue of environment nonstationarity,
CESS, COLLISION, or IDLENESS) up to M time slots, experience replay, which has been popular in most deep
as the input, and the output is the value of each action RL training but could continuously confuse the agent
given the current state. The reward is set to 1 if the with outdated experiences in a nonstationary environment,
observation is SUCCESS and 0 otherwise. The learning is disabled during training. More advanced techniques,
agent interacts with the unknown wireless environment such as the dueling network architecture [75] and dou-

ble Q-learning [76], are leveraged to improve training values in different environment states. After training for
convergence. When compared with the classical slotted a while, the updated DQN parameters are broadcast to
ALOHA protocol, opportunistic channel aware algorithm all agents that use the parameters to construct/update
[77], [78], and distributed protocol developed in [79], their own DQNs for distributed execution. To have a
the proposed deep RL method consistently achieves better better representation of the communication environment,
performance in terms of average channel utilization, aver- the state observed by each agent is constructed to include
age throughput, and proportional fairness with a properly useful local information (such as its own transmit power
designed reward according to the utility of interest. in the previous time slot, total interference power, and
its own channel quality), interference from close inter-
B. Power Allocation in Wireless Networks fering neighbors, and its generated interference toward
Power allocation in wireless networks concerns the impacted neighbors. With a proper reward design that
adaption of transmit power in response to varying channel targets system-wide performance, the proposed deep RL
and user conditions such that system metrics of interest are method learns to adapt the transmit power of each link
optimized. Consider an interference channel, where N rep- only using the experience obtained from interaction with
resents communication links, denoted by N = {1, . . . , N }, the environment. It has been shown to outperform state-
share a single spectrum subband that is assumed frequency of-the-art, including the WMMSE algorithm in [52] and
flat for simplicity. The transmitter of link i sends infor- fractional programming (FP) algorithm in [80], which
mation toward its intended receiver with the power, Pi , are assumed to have accurate global CSI. Remarkably,
and the channel gain from the transmitter of link j to the the developed power allocation method leveraging deep
(t)
receiver of link i in time slot t is denoted by gj,i , for any RL returns solutions better and faster without assuming
i, j ∈ N , including both large- and small-scale fading com- prior knowledge about the channels, and thus can handle
ponents. Then the received SINR of link i in time slot t is more complicated but practical nonidealities of real sys-
tems. Another interesting observation is that the proposed
(t)
Pi gi,i learning-based approach shows impressive robustness in
(t)
γ i (P ) = (t)
(10) the sense that DQNs trained in a different setting (dif-
j=i Pj gj,i + σ 2
ferent initialization or number of links) can still achieve
decent performance. It suggests using a DQN trained on
where P = [P1 , . . . , PN ]T is the transmit power vector a simulator to jump-start a newly added node in real
for the N links and σ 2 represents noise power. Then, a communication networks is very likely to work in practice.
dynamic power allocation problem to optimize a generic An extension has been developed in [81] that considers
weighted sum rate is formulated as the case where each transmitter serves more than one
receiver and the data-driven deep RL-based approach has
N
been demonstrated to achieve better performance than
(t) (t)
maximize wi · log 1 + γi (P) benchmark algorithms.
P
i=1
The power allocation problem has also been studied in
s.t. 0 ≤ Pi ≤ Pmax , i∈N (11) [82] and [83], which consider a cellular network with
several cells and the base station in each cell serves mul-
where Pmax is the maximum transmit power for all tiple users. All base stations share the same frequency
links and wi(t) is the nonnegative weight of link i in spectrum, which is further divided into orthogonal sub-
time slot t that can be adjusted to consider sum rate bands within the cell for each user, i.e., there exists
maximization or proportionally fair scheduling. intercell interference but no intracell interference. Each
Owing to the coupling of transmit power across all links, base station is controlled by a deep RL agent that takes
the optimization problem in (11) is in general nonconvex actions (i.e., performs power adaptation) based on its
and NP-hard [8], not to mention the heavy signaling local observation of the environment, including cell power,
overhead to acquire global CSI. To address the challenge, average reference signal received power, average interfer-
a model-free deep RL-based power allocation scheme has ence, and a local cell reward. The power control action
been developed in [22] that can track the channel evo- is discretized to incremental changes of {0, ±1, ±3} dB
lution and execute in a distributed manner with limited and the reward is designed to reflect system-wide util-
information exchange. ity to avoid selfish decisions. The agents take turns to
The proposed deep RL method in [22] assumes a cen- explore the environment to minimize the impact on each
tralized training architecture, where each transmitter acts other, which stabilizes the learning process. Each deep
as a learning agent that explores the unknown environ- RL agent trains a local Q-network using the experience
ment with an -greedy policy and then sends its explo- accumulated from its interaction with the communication
ration experience to a central controller through backhaul environment. The deep RL-based approach learns to con-
links with some delay. A DQN is trained at the controller trol transmit power for each base station that achieves
using the experience collected from all agents with expe- significant energy savings and fairness among users in the
rience replay, which stores an approximation of action system.

correlates with the objective. We assume that each V2I link

has been assigned an orthogonal resource block (RB) and
uses a fixed transmit power. Then, each V2V transmitter
needs to carefully select the V2I RB to share and adjust
its transmit power to avoid strong interference and ensure
both V2I and V2V links achieve their respective goals.
1) Single-Agent RL: In view of the difficulty to collect

global CSI at a central controller in real time, a distributed
resource allocation algorithm has been developed in [85]
that leverages deep RL. In particular, each V2V transmitter
serves as a learning agent that occupies a local copy of
a DQN and follows the -greedy policy to explore the
unknown environment. The observation of each V2V agent
represents its own perception of the unknown environment
state, given by
St = {Gt , Ht , It−1 , Nt−1 , Lt , Ut } (12)
Fig. 6. Illustration of spectrum sharing in vehicular networks, where Gt and Ht represent the current V2V signal channel
where V2V and V2I links are indexed by k and m, respectively, and strength and the interference channel strength from the
each V2I link is preassigned an orthogonal RB.
V2V transmitter to the base station over all RBs, respec-
tively, It−1 is the received interference power, Nt−1 is the
selected RBs of neighbors in the previous time slot over all
RBs, Lt and Ut denote the remaining load and time to meet
Power allocation in a more complicated multicell net-
latency constraint from the current time slot, respectively.
work has been considered in [84], where each cell has one
The action of each V2V agent amounts to a selection of
base station serving multiple users and all base stations
RB as well as discrete transmit power levels. The reward
transmit over the same spectrum, generating both intra-
balances V2I and V2V requirements, given by
and intercell interference. Each base station is controlled
by an RL agent that decides its transmit power to maximize
the network sum throughput. Similar to [22], a single RL r t = λc C c [m] + λv C v [k] − λp (T − Ut ) (13)
agent is trained in a centralized manner using experience m k
collected from all transmitters. The trained policy is then

shared among all agents and executed in a distributed way. where C c [m] and C v [k] represent the capacity of V2I link m
Deep Q-learning, REINFORCE, actor–critic deep determin- and V2V link k, respectively. λc , λv , and λp are nonnegative
istic policy gradient (DDPG) algorithms have been lever- weights to balance different design objectives. In particu-
aged to train the RL agent and all of them outperform the lar, the inclusion of T −Ut in the reward constantly reminds
benchmark WMMSE [52] and FP algorithms [80]. Among agents of the upcoming deadline for V2V payload transmis-
the three algorithms, the actor–critic DDPG is the best in sion and effectively helps improve payload delivery rates
terms of both sum rate performance and robustness. for V2V links.
The reward design in (13) facilitates system-wide perfor-
mance improvement. But it also dictates the use of a cen-
C. Joint Spectrum and Power Allocation: tralized training architecture, where a central controller
Application Example in Vehicular Networks collects experiences from all V2V agents and compiles the
In a vehicular network as illustrated in Fig. 6, K V2V reward for the DQN training. The system architecture is
links share the spectrum of M V2I links to improve its illustrated in Fig. 7. In the implementation, each V2V agent
utilization efficiency. The V2I links are designed for high performs a local observation of the environment and then
data-rate entertainment services while the V2V links need uses its local copy of the trained DQN to guide its RB
to support reliable dissemination of safety-critical mes- selection and power control in a distributed manner. To
sages, formally stated as the successful delivery of packets alleviate the impact of environment nonstationarity due
of size B within the time constraint T . Such a reliability to mutual interaction among multiple V2V links, the V2V
requirement of V2V links is hard to handle with traditional agent takes turns to change its action to stabilize training,
optimization approaches due to its exponential complexity rather than acting simultaneously. Experimental results
in the length of T . However, it has been shown in [85] demonstrate that the proposed deep RL-based approach
and [86] that we can clearly address the issue in the deep can learn from scratch to perform intelligent spectrum
RL framework through properly designing a reward that and power allocation that outperforms the benchmarks,

where St(k) contains similar local observations (measure-

ments) of the kth V2V agent as in (12). In addition,
we revise the reward design to approach the V2V payload
delivery reliability more closely by setting the V2V-related
reward component at each time step to the sum V2V rate
when payload delivery is not finished and to a constant
number β larger than the largest sum V2V rate when
finished. The design of β reflects the tradeoff between
designing purely toward the ultimate goal and learning
efficiency. For pure goal-directed consideration, we would
set the V2V-related reward to 0 at each step until the pay-
load is delivered when the reward is 1. However, receiving
such delayed rewards decreases training efficiency and
hence we impart our domain knowledge by incorporating
the sum V2V rate as aggregated rewards. Again we employ
the centralized training and distributed implementation
architecture in Fig. 7 to train multiple DQNs for V2V agents
Fig. 7. Deep RL training and implementation architecture for
with experience replay and then deploy them after training
resource allocation in vehicular networks.
is done.
Remarkably, such a simple approach proves very effec-
tive in stabilizing DQN training and combined with the
revised reward design, it significantly outperforms both
including the distributed algorithm in [87] and a random the single-agent RL-based approach in [85] and a ran-
baseline, in terms of both sum V2I rate as well as the dom baseline. In fact, the proposed multiagent RL-based
delivery rate of V2V payloads. method encourages V2V cooperation despite the distrib-
2) Multiagent RL: To further improve the network per- uted implementation without any online coordination. To
formance and better handle the dynamics in a vehicular illustrate, we break down in Fig. 8 the change of V2V
network, we investigate in [86] a multiagent RL-based transmission rates of the proposed method and random
approach to enable all V2V agents to perform resource baseline over the time constraint T = 100 ms for one
allocation simultaneously. We have iterated throughout episode where all V2V links successfully deliver payload
the article that such simultaneous actions of all learning using the proposed method. But for the random baseline,
agents tend to make the environment observed by each Link 3 fails the task and all others succeed. Comparing
agent highly nonstationary and compromises stability of Fig. 8(a) and (b), one can tell that the proposed method
DQN training. To address the issue, either the experience enables Link 4 to transmit at a high rate to finish delivery
replay technique that is central to the success of deep early and Link 1 is restricted to a low rate at the beginning
RL is disabled as in [74], or the agents take turns to to protect the more vulnerable Links 2 and 3 that achieve
update their actions as in [22], [82], and [85]. We believe decent rates to deliver their payload. Upon successful
that such a turn-taking action update constraint leads to delivery of Links 2 and 3, Link 1 leverages its good channel
inevitable suboptimality as it is a subset of the simulta- quality to quickly deliver its data. Moreover, Links 2 and
neous action space and the disabling of highly efficient 3 figure out a clever strategy of alternating transmission to
experience replay techniques is undesirable. In response, avoid strong mutual interference. In contrast, the random
we leverage the fingerprint-based method proposed in [88] baseline does not enjoy such intelligence and the more
that identifies and addresses the source of nonstationarity: vulnerable Link 3 fails the task finally.
the policy change of other agents due to learning. As such, We carry the multiagent RL idea further in [90], where
the environment observed by each agent can be made we restructure our network architecture substantially to
stationary by conditioning on other agents’ policy change, enable a centralized decision making yet with very low
i.e., we augment the observation of each V2V agent with an signaling overhead. In particular, each V2V transmitter
estimate of the policy change of all other agents, the idea constructs a DNN that learns to compress its local observa-
of hyper Q-learning [89]. Further analysis reveals that the tion (measurement), which is then fed back to the central
agents’ policy varies along the learning process, whose base station to serve as the input of the stored DQN. The
trajectory can be tracked by a low-dimensional fingerprint, output of the DQN at the base station is the value of
including the training iteration number e and the probabil- each action vector that contains the actions of all V2V
ity of selecting a random action, , in the -greedy policy. transmitters given the current compressed observation.
Then we revise the observation of each V2V agent as Then the resource allocation decision made at the central
base station is broadcast to all V2V transmitters during
(k) (k)
implementation, incurring small overhead. The proposed
Zt = St , , e (14) architecture aligns with the single meta-agent idea of deal-

CNNs and deep residual networks (ResNets) have excellent

performance in image recognition while recurrent neural
networks (RNNs) prevail in sequential data processing,
such as speech and language. Likewise, wireless resource
allocation has its own unique characteristics that are worth
considering when we adapt or redesign an appropriate
DNN. For example, in power allocation, the adjustment
of one transmitter’s power affects not only its intended
receiver but other closely located wireless links, making
the problem nonconvex. However, if we convert it to
the dual domain, solving the problem indeed becomes
much easier and the duality gap is shown to be small
in most practical cases [3]. Such domain knowledge is
expected to be very helpful for guiding DNN design in
the wireless context. That said, we notice there have
been some recent studies [91], [92] showing graph neural
networks, or graph convolutional networks [93], can be
particularly efficient in modeling wireless interference,
which might represent a good start in this direction.
In the unsupervised approaches of using deep learning
for optimization mentioned earlier, we treat the resource
allocation objective as the loss function for learning.
More often than not, the objective is different from what
we normally use in DNN training, like mean-squared
error or cross-entropy. It is unclear whether the exist-
ing network architecture best suits the needs of resource
allocation objective minimization or maximization. Fur-
thermore, theoretical understanding of the convergence
properties of training DNNs with the new objectives and
efficient training techniques largely remain unknown.
Fig. 8. V2V transmission rates of different resource allocation
schemes within one episode. Only the results of the beginning 30 ms B. Bridging the Gap Between Training and
are plotted for better presentation. The initial payload size B = 2120
Implementation
bytes. (a) V2V transmission rates of MARL. (b) V2V transmission
rates of the random baseline. Most, if not all, proposed learning algorithms in the
existing literature are trained and tested on an offline
simulator. We understand that this makes algorithm design
and testing quick and easy, and the policy learned from
ing with dynamics in multiagent RL and has been shown simulated experiences can be used as a jump-starter for
to achieve about 95% of the optimal performance that is real-time deployment, following the idea of transfer learn-
obtained by time-consuming brute force search when each ing [23]. Nonetheless, it is extremely difficult to build a
local observation is compressed to 36 bits per time step. simulator with high enough fidelity that guarantees the
learned policy can work as expected when interacting with
V. O P E N I S S U E S A N D F U T U R E the real-world environment. It seems that the only way
RESEARCH we can provide an ultimate answer to the puzzle is to
Despite many benefits of using deep learning for wire- perform policy learning in the real world. The issue is
less resource allocation, blindly applying the technique in then how to avoid catastrophic actions while the agents
the wireless environment is insufficient and more delibera- are actively exploring the environment and no concrete
tion is needed. In this section, we identify major roadblocks policies have been obtained yet. A seemingly good answer
and highlight some research opportunities. might be to use expert human knowledge to help confine
the exploration space [83] and guide the learning agent’s
search within the space. But exactly how to implement the
A. Tailoring Deep-Learning Architecture for concept in algorithm design with performance guarantee
Wireless is unclear and worth further investigation.
In deep learning, it is well established that different
neural network architectures have varying strengths and C. Multiagent Consideration in Deep RL
weaknesses in different application domains. No single In the wireless domain, most scenarios that we are
architecture can be the best player in all tasks. For instance, interested in are of multiuser nature, whether it is the

power control for multiple users within a cell, or joint have been some recent works [94] that attempt to solve the
spectrum and power allocation for multiple V2V links in issue and establish the convergence guarantee of mean-
a vehicular network. In these cases, actions of one user field multiagent RL, more investigation in this direction is
impact the performance of not only itself but also others still desired.
nearby. From the perspective of each user, the environment
that it observes then exhibits nonstationarity when other
VI. C O N C L U S I O N
users are actively exploring the state and action space
for policy learning. Meanwhile, each user can only obtain In this article, we have provided an overview of
local observations or measurements of the true underlying applying the burgeoning deep-learning technology to wire-
environment state, which, in the deep RL terminology, is a less resource allocation with application to vehicular net-
partially observable MDP. Then the learning agent needs works. We have discussed the limitations of traditional
to construct its belief state from the previous actions and optimization-based approaches and the potentials of deep-
local observations to estimate the true environment state, learning paradigms in wireless resource allocation. In
which is a challenging issue. A partial solution to the particular, we have described in detail how to lever-
problem might be to enable interagent communications age deep learning to solve difficult optimization prob-
to encourage multiuser coordination and increase local lems for resource allocation and deep RL for a direct
awareness of the global environment state. That said, such answer to many resource allocation problems that can-
a combination of environment nonstationarity and partial not be handled or even modeled in the traditional opti-
observability makes learning extremely difficult, which is mization framework. We have further identified some
made even worse if we scale the number of users large open issues and research directions that warrant future
as in the upcoming Internet of Things era. Although there investigation.
REFERENCES
[1] Z. Han and K. J. R. Liu, Resource Allocation for [13] H. Ye, L. Liang, G. Y. Li, and B.-H. F. Juang, “Deep intelligence: Embedding expert knowledge in deep
Wireless Networks: Basics, Techniques, and learning based end-to-end wireless communication neural networks for wireless system optimization,”
Applications. Cambridge, U.K.: Cambridge Univ. systems with conditional GAN as unknown IEEE Veh. Technol. Mag., vol. 14, no. 3, pp. 60–69,
Press, 2008. channel,” 2019, arXiv:1903.02551. [Online]. Jul. 2019.
[2] C. E. Shannon, “A mathematical theory of Available: https://arxiv.org/abs/1903.02551 [25] Z. Qin, H. Ye, G. Y. Li, and B.-H. F. Juang, “Deep
communication,” Bell Syst. Tech. J., vol. 27, no. 3, [14] N. Samuel, T. Diskin, and A. Wiesel, “Learning to learning in physical layer communications,” IEEE
pp. 379–423, 1948. detect,” IEEE Trans. Signal Process., vol. 67, no. 10, Wireless Commun., vol. 26, no. 2, pp. 93–99,
[3] W. Yu and R. Lui, “Dual methods for nonconvex pp. 2554–2564, May 2019. Apr. 2019.
spectrum optimization of multicarrier systems,” [15] N. Farsad and A. Goldsmith, “Neural network [26] D. Gunduz, P. de Kerret, N. Sidiroupoulos,
IEEE Trans. Commun., vol. 54, detection of data sequences in communication D. Gesbert, C. Murthy, and M. van der Schaar,
no. 7, pp. 1310–1322, Jul. 2006. systems,” IEEE Trans. Signal Process., “Machine learning in the air,” IEEE J. Sel. Areas
[4] C. Y. Wong, R. S. Cheng, K. B. Lataief, and vol. 66, no. 21, pp. 5663–5678, Nov. 2018. Commun., vol. 37, no. 10, pp. 2184–2199,
R. D. Murch, “Multiuser OFDM with adaptive [16] H. Sun, X. Chen, Q. Shi, M. Hong, X. Fu, and Oct. 2019.
subcarrier, bit, and power allocation,” IEEE J. Sel. N. D. Sidiropoulos, “Learning to optimize: Training [27] H. Ye, L. Liang, G. Y. Li, J. Kim, L. Lu, and M. Wu,
Areas Commun., vol. 17, no. 10, pp. 1747–1758, deep neural networks for interference “Machine learning for vehicular networks: Recent
Oct. 1999. management,” IEEE Trans. Signal Process., vol. 66, advances and application examples,” IEEE Veh.
[5] G. Song and Y. Li, “Utility-based resource allocation no. 20, pp. 5438–5453, Oct. 2018. Technol. Mag., vol. 13, no. 2, pp. 94–101,
and scheduling in OFDM-based wireless broadband [17] C. Zhang, P. Patras, and H. Haddadi, “Deep learning Jun. 2018.
networks,” IEEE Commun. Mag., vol. 43, no. 12, in mobile and wireless networking: A survey,” IEEE [28] L. Liang, H. Ye, and G. Y. Li, “Toward intelligent
pp. 127–134, Dec. 2005. Commun. Surveys Tuts., vol. 21, no. 3, vehicular networks: A machine learning
[6] G. Song and Y. Li, “Cross-layer optimization for pp. 2224–2287, 3rd Quart., 2019. framework,” IEEE Internet Things J., vol. 6, no. 1,
OFDM wireless networks—Part I: Theoretical [18] Z. Chang, L. Lei, Z. Zhou, S. Mao, and T. Ristaniemi, pp. 124–135, Feb. 2019.
framework,” IEEE Trans. Wireless Commun., vol. 4, “Learn to cache: Machine learning for network [29] M. Chen, U. Challita, W. Saad, C. Yin, and
no. 2, pp. 614–624, Mar. 2005. edge caching in the big data era,” IEEE Wireless M. Debbah, “Machine learning for wireless
[7] G. Song and Y. Li, “Cross-layer optimization for Commun., vol. 25, no. 3, pp. 28–35, Jun. 2018. networks with artificial intelligence: A tutorial on
OFDM wireless networks—Part II: Algorithm [19] J. Park, S. Samarakoon, M. Bennis, and M. Debbah, neural networks,” IEEE Commun. Surveys Tuts.,
development,” IEEE Trans. Wireless Commun., “Wireless network intelligence at the edge,” Proc. vol. 21, no. 4, pp. 3039–3071, 4th Quart., 2019.
vol. 4, no. 2, pp. 625–634, IEEE, vol. 107, no. 11, pp. 2204–2239, [30] 3rd Generation Partnership Project; Technical
Mar. 2005. Nov. 2019. Specification Group Radio Access Network; Study on
[8] Z.-Q. Luo and S. Zhang, “Dynamic spectrum [20] W. Lee, M. Kim, and D.-H. Cho, “Deep power LTE-based V2X Services; (Release 14), 3GPP TR
management: Complexity and duality,” IEEE J. Sel. control: Transmit power control scheme based on 36.885 V14.0.0, Jun. 2016.
Topics Signal Process., vol. 2, no. 1, pp. 57–73, convolutional neural network,” IEEE Commun. [31] R. Molina-Masegosa and J. Gozalvez, “LTE-V for
Feb. 2008. Lett., vol. 22, no. 6, pp. 1276–1279, Jun. 2018. sidelink 5G V2X vehicular communications: A new
[9] M. Bennis, M. Debbah, and H. V. Poor, “Ultra [21] S. Wang, H. Liu, P. H. Gomes, and 5G technology for short-range vehicle-to-everything
reliable and low-latency wireless communication: B. Krishnamachari, “Deep reinforcement learning communications,” IEEE Veh. Technol. Mag., vol. 12,
Tail, risk, and scale,” Proc. IEEE, vol. 106, no. 10, for dynamic multichannel access in wireless no. 4, pp. 30–39, Dec. 2017.
pp. 1834–1853, Oct. 2018. networks,” IEEE Trans. Cogn. Commun. Netw., [32] L. Liang, G. Y. Li, and W. Xu, “Resource allocation
[10] T. O’Shea and J. Hoydis, “An introduction to deep vol. 4, no. 2, pp. 257–265, Jun. 2018. for D2D-enabled vehicular communications,” IEEE
learning for the physical layer,” IEEE Trans. Cogn. [22] Y. S. Nasir and D. Guo, “Multi-agent deep Trans. Commun., vol. 65, no. 7, pp. 3186–3197,
Commun. Netw., vol. 3, no. 4, pp. 563–575, reinforcement learning for dynamic power Jul. 2017.
Dec. 2017. allocation in wireless networks,” IEEE J. Sel. Areas [33] L. Liang, S. Xie, G. Y. Li, Z. Ding, and X. Yu,
[11] S. Dörner, S. Cammerer, J. Hoydis, and S. ten Brink, Commun., vol. 37, no. 10, pp. 2239–2250, “Graph-based resource sharing in vehicular
“Deep learning based communication over the air,” Oct. 2019. communication,” IEEE Trans. Wireless Commun.,
IEEE J. Sel. Areas Commun., vol. 12, no. 1, [23] A. Zappone, M. Di Renzo, and M. Debbah, vol. 17, no. 7, pp. 4579–4592, Jul. 2018.
pp. 132–143, Feb. 2018. “Wireless networks design in the era of deep [34] W. Sun, E. G. Ström, F. Brännström, K. Sou, and
[12] H. Ye, G. Y. Li, and B.-H. Juang, “Power of deep learning: Model-based, AI-based, or both?” IEEE Y. Sui, “Radio resource management for D2D-based
learning for channel estimation and signal Trans. Commun., vol. 67, no. 10, pp. 7331–7376, V2V communication,” IEEE Trans. Veh. Technol.,
detection in OFDM systems,” IEEE Wireless Oct. 2019. vol. 65, no. 8, pp. 6636–6650, Aug. 2016.
Commun. Lett., vol. 7, no. 1, pp. 114–117, [24] A. Zappone, M. Di Renzo, M. Debbah, T. T. Lam, [35] W. Sun, D. Yuan, E. G. Ström, and F. Brännström,
Feb. 2018. and X. Qian, “Model-aided wireless artificial “Cluster-based radio resource management for

D2D-supported safety-critical V2X for wireless scheduling,” in Proc. IEEE GLOBECOM, architectures for deep reinforcement learning,”
communications,” IEEE Trans. Wireless Commun., Dec. 2018, pp. 1–6. 2015, arXiv:1511.06581. [Online]. Available:
vol. 15, no. 4, pp. 2756–2769, Apr. 2016. [57] W. Cui, K. Shen, and W. Yu, “Spatial deep learning https://arxiv.org/abs/1511.06581
[36] Y. LeCun, Y. Bengio, and G. Hinton, “Deep for wireless scheduling,” IEEE J. Sel. Areas [76] H. Van Hasselt, A. Guez, and D. Silver, “Deep
learning,” Nature, vol. 521, no. 7553, p. 436, 2015. Commun., vol. 37, no. 6, pp. 1248–1261, reinforcement learning with double q-learning,” in
[37] B.-H. Juang, “Deep neural Jun. 2019. Proc. AAAI, 2016, pp. 2094–2100.
networks—A developmental perspective,” APSIPA [58] M. Lee, G. Yu, and G. Y. Li, “Graph embedding [77] K. Cohen and A. Leshem, “Distributed
Trans. Signal Inf. Process., vol. 5, Apr. 2016. based wireless link scheduling with few training game-theoretic optimization and management of
[38] Y. Shen, Y. Shi, J. Zhang, and K. B. Letaief, “LORM: samples,” 2019, arXiv:1906.02871. [Online]. multichannel ALOHA networks,” IEEE/ACM Trans.
Learning to optimize for resource management in Available: https://arxiv.org/abs/1808.01486 Netw., vol. 24, no. 3, pp. 1718–1731, Jun. 2016.
wireless networks with few training samples,” IEEE [59] K. Shen and W. Yu, “FPLinQ: A cooperative [78] T. To and J. Choi, “On exploiting idle channels in
Trans. Wireless Commun., to be published. spectrum sharing strategy for device-to-device opportunistic multichannel ALOHA,” IEEE
[39] M. Lee, G. Yu, and G. Y. Li, “Learning to branch: communications,” in Proc. IEEE Int. Symp. Inf. Commun. Lett., vol. 14, no. 1, pp. 51–53, Jan. 2010.
Accelerating resource allocation in wireless Theory (ISIT), Jun. 2017, pp. 2323–2327. [79] I.-H. Hou and P. Gupta, “Proportionally fair
networks,” IEEE Trans. Veh. Technol., to be [60] F. Liang, C. Shen, W. Yu, and F. Wu, “Towards distributed resource allocation in multiband
published. optimal power control via ensembling deep neural wireless systems,” IEEE/ACM Trans. Netw., vol. 22,
[40] S. M. Kakade, “A natural policy gradient,” in Proc. networks,” 2018, arXiv:1807.10025. [Online]. no. 6, pp. 1819–1830, Dec. 2014.
Adv. Neural Inf. Process. Syst. (NeurIPS), 2002, Available: https://arxiv.org/abs/1807.10025 [80] K. Shen and W. Yu, “Fractional programming for
pp. 1531–1538. [61] M. Eisen, C. Zhang, L. F. O. Chamon, D. D. Lee, and communication systems—Part I: Power control and
[41] J. Schulman, S. Levine, P. Abbeel, M. Jordan, and A. Ribeiro, “Learning optimal resource allocations beamforming,” IEEE Trans. Signal Process., vol. 66,
P. Moritz, “Trust region policy optimization,” in in wireless systems,” IEEE Trans. Signal Process., no. 10, pp. 2616–2630, May 2018.
Proc. Int. Conf. Mach. Learn.(ICML), 2015, vol. 67, no. 10, pp. 2775–2790, Apr. 2019. [81] F. Meng, P. Chen, and L. Wu, “Power allocation in
pp. 1889–1897. [62] T. Laurent and J. Brecht, “Deep linear networks multi-user cellular networks with deep Q learning
[42] V. Mnih et al., “Asynchronous methods for deep with arbitrary loss: All local minima are global,” in approach,” in Proc. IEEE ICC, May 2019, pp. 1–7.
reinforcement learning,” in Proc. Int. Conf. Mach. Proc. Int. Conf. Mach. Learn. (ICML), 2018, [82] E. Ghadimi, F. D. Calabrese, G. Peters, and
Learn.(ICML), 2016, pp. 1928–1937. pp. 2908–2913. P. Soldati, “A reinforcement learning approach to
[43] V. François-Lavet, P. Henderson, R. Islam, [63] S. S. Du, J. D. Lee, H. Li, L. Wang, and X. Zhai, power control and rate adaptation in cellular
M. G. Bellemare, and J. Pineau, “An introduction to “Gradient descent finds global minima of deep networks,” in Proc. IEEE ICC, May 2017, pp. 1–7.
deep reinforcement learning,” Found. Trends Mach. neural networks,” 2018, arXiv:1811.03804. [83] F. D. Calabrese, L. Wang, E. Ghadimi, G. Peters,
Learn., vol. 11, nos. 3–4, pp. 219–354, 2018. [Online]. Available: https://arxiv.org/abs/1811. L. Hanzo, and P. Soldati, “Learning radio resource
[44] K. Arulkumaran, M. P. Deisenroth, M. Brundage, 03804 management in RANs: Framework, opportunities,
and A. A. Bharath, “Deep reinforcement learning: [64] R. Li et al., “Deep reinforcement learning for and challenges,” IEEE Commun. Mag., vol. 56,
A brief survey,” IEEE Signal Process. Mag., vol. 34, resource management in network slicing,” IEEE no. 9, pp. 138–145, Sep. 2018.
no. 6, pp. 26–38, Nov. 2017. Access, vol. 6, pp. 74429–74441, 2018. [84] F. Meng, P. Chen, L. Wu, and J. Cheng, “Power
[45] C. J. C. H. Watkins and P. Dayan, “Q-learning,” [65] T. He, N. Zhao, and H. Yin, “Integrated networking, allocation in multi-user cellular networks: Deep
Mach. Learn., vol. 8, nos. 3–4, pp. 279–292, 1992. caching, and computing for connected vehicles: reinforcement learning approaches,” 2019,
[46] V. Mnih et al., “Human-level control through deep A deep reinforcement learning approach,” IEEE arXiv:1901.07159. [Online]. Available:
reinforcement learning,” Nature, vol. 518, Trans. Veh. Technol., vol. 67, no. 1, pp. 44–55, https://arxiv.org/abs/1901.07159
pp. 529–533, 2015. Jan. 2018. [85] H. Ye, G. Y. Li, and B.-H. F. Juang, “Deep
[47] R. S. Sutton, D. A. McAllester, S. P. Singh, and [66] X. Chen et al., “Multi-tenant cross-slice resource reinforcement learning based resource allocation
Y. Mansour, “Policy gradient methods for orchestration: A deep reinforcement learning for V2V communications,” IEEE Trans. Veh. Technol.,
reinforcement learning with function approach,” IEEE J. Sel. Areas Commun., vol. 37, vol. 68, no. 4, pp. 3163–3173, Apr. 2019.
approximation,” in Proc. Adv. Neural Inf. Process. no. 10, pp. 2377–2392, Oct. 2019. [86] L. Liang, H. Ye, and G. Y. Li, “Spectrum sharing in
Syst. (NeurIPS), 2000, pp. 1057–1063. [67] U. Challita, L. Dong, and W. Saad, “Proactive vehicular networks based on multi-agent
[48] R. S. Sutton and A. G. Barto, Reinforcement resource management for LTE in unlicensed reinforcement learning,” IEEE J. Sel.
Learning: An Introduction. Cambridge, MA, USA: spectrum: A deep learning perspective,” IEEE Trans. Areas Commun., vol. 37, no. 10, pp. 2282–2292,
MIT Press, 2018. Wireless Commun., vol. 17, no. 7, pp. 4674–4689, Oct. 2019.
[49] Y. Cui and V. K. N. Lau, “Distributive stochastic Jul. 2018. [87] M. I. Ashraf, M. Bennis, C. Perfecto, and W. Saad,
learning for delay-optimal OFDMA power and [68] V. Va, T. Shimizu, G. Bansal, and R. W. Heath, Jr., “Dynamic proximity-aware resource allocation in
subband allocation,” IEEE Trans. Signal Process., “Online learning for position-aided millimeter wave vehicle-to-vehicle (V2V) communications,” in Proc.
vol. 58, no. 9, pp. 4848–4858, Sep. 2010. beam training,” IEEE Access, vol. 7, IEEE GLOBECOM Workshops, Dec. 2016, pp. 1–6.
[50] V. K. N. Lau and Y. Cui, “Delay-optimal power and pp. 30507–30526, 2019. [88] J. Foerster et al., “Stabilising experience replay for
subcarrier allocation for OFDMA systems via [69] K. Liu and Q. Zhao, “Indexability of restless bandit deep multi-agent reinforcement learning,” in Proc.
stochastic approximation,” IEEE Trans. Wireless problems and optimality of Whittle index for Int. Conf. Mach. Learn. (ICML), 2017,
Commun., vol. 9, no. 1, pp. 227–233, Jan. 2010. dynamic multichannel access,” IEEE Trans. Inf. pp. 1146–1155.
[51] Q. Zheng, K. Zheng, H. Zhang, and V. C. M. Leung, Theory, vol. 56, no. 11, pp. 5547–5567, Nov. 2010. [89] G. Tesauro, “Extending Q-learning to general
“Delay-optimal virtualized radio resource [70] Q. Zhao, B. Krishnamachari, and K. Liu, “On adaptive multi-agent systems,” in Proc. Adv. Neural
scheduling in software-defined vehicular networks myopic sensing for multi-channel opportunistic Inf. Process. Syst. (NeurIPS), 2004,
via stochastic learning,” IEEE Trans. Veh. Technol., access: Structure, optimality, and performance,” pp. 871–878.
vol. 65, no. 10, pp. 7857–7867, Oct. 2016. IEEE Trans. Wireless Commun., vol. 7, no. 12, [90] L. Wang, H. Ye, L. Liang, and G. Y. Li, “Learn to
[52] Q. Shi, M. Razaviyayn, Z.-Q. Luo, and C. He, pp. 5431–5440, Dec. 2008. compress CSI and allocate resources in vehicular
“An iteratively weighted MMSE approach to [71] S. H. A. Ahmad, M. Liu, T. Javidi, Q. Zhao, and networks,” 2019, arXiv:1908.04685. [Online].
distributed sum-utility maximization for a MIMO B. Krishnamachari, “Optimality of myopic sensing Available: https://arxiv.org/abs/1908.04685
interfering broadcast channel,” IEEE Trans. Signal in multichannel opportunistic access,” IEEE Trans. [91] M. Eisen and A. Ribeiro, “Large scale wireless
Process., vol. 59, no. 9, pp. 4331–4340, Sep. 2011. Inf. Theory, vol. 55, no. 9, pp. 4040–4050, power allocation with graph neural networks,” in
[53] B. Matthiesen, A. Zappone, K.-L. Besser, Sep. 2009. Proc. IEEE SPAWC, Jul. 2019, pp. 1–5.
E. A. Jorswieck, and M. Debbah, “A globally optimal [72] C. Zhong, Z. Lu, M. C. Gursoy, and S. Velipasalar, [92] M. Eisen and A. Ribeiro, “Optimal wireless resource
energy-efficient power control framework and its “Actor-critic deep reinforcement learning for allocation with random edge graph neural
efficient implementation in wireless interference dynamic multichannel access,” in Proc. IEEE networks,” 2019, arXiv:1909.01865. [Online].
networks,” 2018, arXiv:1812.06920. [Online]. GlobalSIP, 2018, pp. 599–603. Available: https://arxiv.org/abs/1909.01865
Available: https://arxiv.org/abs/1812.06920 [73] Y. Yu, T. Wang, and S. C. Liew, “Deep-reinforcement [93] T. N. Kipf and M. Welling, “Semi-supervised
[54] M. Lee, Y. Xiong, G. Yu, and G. Y. Li, “Deep neural learning multiple access for heterogeneous wireless classification with graph convolutional networks,”
networks for linear sum assignment problems,” networks,” in Proc. IEEE ICC, May 2018, pp. 1–7. in Proc. Int. Conf. Learn. Represent. (ICLR), 2017,
IEEE Wireless Commun. Lett., vol. 7, no. 6, [74] O. Naparstek and K. Cohen, “Deep multi-user pp. 1–14.
pp. 962–965, Dec. 2018. reinforcement learning for distributed dynamic [94] M. K. Sharma, A. Zappone, M. Assaad, M. Debbah,
[55] H. W. Kuhn, “The Hungarian method for the spectrum access,” IEEE Trans. Wireless Commun., and S. Vassilaras, “Distributed power control for
assignment problem,” Naval Res. Logist., vol. 2, vol. 18, no. 1, pp. 310–323, Jan. 2019. large energy harvesting networks: A multi-agent
nos. 1–2, pp. 83–97, Mar. 1995. [75] Z. Wang, T. Schaul, M. Hessel, H. van Hasselt, deep reinforcement learning approach,” IEEE Trans.
[56] W. Cui, K. Shen, and W. Yu, “Spatial deep learning M. Lanctot, and N. De Freitas, “Dueling network Cogn. Commun. Netw., to be published.

ABOUT THE AUTHORS Dr. Yu received the 2016 IEEE ComSoc Asia-Pacific Outstand-
ing Young Researcher Award. He regularly sits on the Technical
Le Liang (Member, IEEE) received the Program Committee (TPC) Boards of the prominent IEEE con-
B.E. degree in information engineering ferences such as ICC, GLOBECOM, and VTC. He also serves as
from Southeast University, Nanjing, China, the Symposium Co-Chair for the IEEE Globecom 2019 and the
in 2012, the M.A.Sc. degree in electrical Track Chair for the IEEE VTC 2019’Fall. He has served as a Guest
engineering from the University of Victoria, Editor of the IEEE Communications Magazine—Special Issue on
Victoria, BC, Canada, in 2015, and the Ph.D. Full Duplex Communications, an Editor of the IEEE JOURNAL ON
degree in electrical and computer engineer- SELECTED AREAS IN COMMUNICATIONS—Series on Green Communi-
ing from Georgia Institute of Technology, cations and Networking, a Lead Guest Editor of the IEEE Wire-
Atlanta, GA, USA, in 2018. less Communications Magazine—Special Issue on long-time evo-
Since 2019, he has been a Research Scientist with Intel Labs, lution (LTE) networks in unlicensed spectrum, and an Editor of
Hillsboro, OR, USA. His general research interests include wireless IEEE ACCESS. He is currently serving as an Editor of the IEEE
communications, signal processing, and machine learning. TRANSACTIONS ON GREEN COMMUNICATIONS AND NETWORKING and
Dr. Liang received the Best Paper Award at the IEEE/CIC ICCC in IEEE WIRELESS COMMUNICATIONS LETTERS.
2014. He is an Editor of IEEE COMMUNICATIONS LETTERS.
Geoffrey Ye Li (Fellow, IEEE) was with AT&T
Laboratories Research, Red Bank, NY, USA,
as a Senior and then a Principal Techni-
cal Staff Member, from 1996 to 2000, and
a Postdoctoral Research Associate with the
Hao Ye (Student Member, IEEE) received University of Maryland, College Park, MD,
the B.E. degree in information engineering USA, from 1994 to 1996. He is currently a
from Southeast University, Nanjing, China, Professor with Georgia Institute of Technol-
in 2012, and the M.E. degree in electrical ogy, Atlanta, GA, USA. His general research
engineering from Tsinghua University, Bei- interests include statistical signal processing and machine learning
jing, China, in 2015. He is currently working for wireless communications. In these areas, he has published over
toward the Ph.D. degree in electrical engi- 500 journal and conference articles in addition to over 40 granted
neering at Georgia Institute of Technology, patents. His publications have been cited over 38 000 times.
Atlanta, GA, USA. Dr. Li was a Fellow of the IEEE for his contributions to signal
His research interests include machine learning for physical and processing for wireless communications in 2005. He received the
MAC layer communications. Donald G. Fink Overview Paper Award in 2017 from the IEEE
Signal Processing Society; the James Evans Avant Garde Award
in 2013 and the Jack Neubauer Memorial Award in 2014 from the
IEEE Vehicular Technology Society; the Stephen O. Rice Prize Paper
Award in 2013, the Award for Advances in Communication in 2017,
Guanding Yu (Senior Member, IEEE) and the Edwin Howard Armstrong Achievement Award in 2019 from
received the B.E. and Ph.D. degrees in the IEEE Communications Society; and the 2015 Distinguished Fac-
communication engineering from Zhejiang ulty Achievement Award from the School of Electrical and Computer
University, Hangzhou, China, in 2001 and Engineering, Georgia Institute of Technology. He has organized and
2006, respectively. chaired many international conferences, including the Technical
From 2013 to 2015, he was a Visiting Program Vice-Chair of the IEEE ICC 2003, the Technical Program Co-
Professor with the School of Electrical and Chair of the IEEE SPAWC 2011, the General Chair of the IEEE
Computer Engineering, Georgia Institute GlobalSIP 2014, the Technical Program Co-Chair of the IEEE VTC
of Technology, Atlanta, GA, USA. He joined 2016 (Spring), and the General Co-Chair of the IEEE VTC 2019 (Fall).
Zhejiang University in 2006, where he is currently a Full Professor He has been involved in editorial activities for over 20 technical
with the College of Information Science and Electronic Engineering. journals, including founding Editor-in-Chief of the IEEE 5G Tech
His research interests include 5G communications and networks, Focus. He has been recognized as the World’s Most Influential Sci-
mobile edge computing, and machine learning for wireless entific Mind, also known as a Highly Cited Researcher, by Thomson
networks. Reuters almost every year.

Deep Learning Based

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Deep Learning Based

Uploaded by

Copyright:

Available Formats

This article has been accepted for inclusion in a future issue of this journal.

Content is final as presented, with the exception of pagination.

By L E L IANG , Member IEEE, H AO Y E , Student Member IEEE,

P ROCEEDINGS OF THE IEEE 1

multiple-access system, and even base stations or backhaul

maximization [3] and power minimization [4], cross-layer

2 P ROCEEDINGS OF THE IEEE

P ROCEEDINGS OF THE IEEE 3

1) Q-Learning: Q-Learning is a popular model-free

4 P ROCEEDINGS OF THE IEEE

then use that to compute a deterministic policy, the pol- III. D E E P - L E A R N I N G - A S S I S T E D

A. Supervised Learning Approach for

P ROCEEDINGS OF THE IEEE 5

Table 1 Performance Comparison of Different Methods for LSAP Prob-

a link scheduling method without CSI has been pro-

6 P ROCEEDINGS OF THE IEEE

P ROCEEDINGS OF THE IEEE 7

8 P ROCEEDINGS OF THE IEEE

P ROCEEDINGS OF THE IEEE 9

10 P ROCEEDINGS OF THE IEEE

correlates with the objective. We assume that each V2I link

1) Single-Agent RL: In view of the difficulty to collect

St = {Gt , Ht , It−1 , Nt−1 , Lt , Ut } (12)

collected from all transmitters. The trained policy is then

P ROCEEDINGS OF THE IEEE 11

where St(k) contains similar local observations (measure-

12 P ROCEEDINGS OF THE IEEE

CNNs and deep residual networks (ResNets) have excellent

P ROCEEDINGS OF THE IEEE 13

14 P ROCEEDINGS OF THE IEEE

P ROCEEDINGS OF THE IEEE 15

16 P ROCEEDINGS OF THE IEEE

You might also like