Cooperative Spectrum Sensing Princeton

858 IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, VOL. 7, NO.
5, OCTOBER 2013
Multiagent Reinforcement Learning Based Spectrum

Sensing Policies for Cognitive Radio Networks
Jarmo Lundén, Member, IEEE, Sanjeev R. Kulkarni, Fellow, IEEE, Visa Koivunen, Fellow, IEEE, and
H. Vincent Poor, Fellow, IEEE
Abstract—This paper proposes distributed multiuser multiband change in a dynamic fashion due to mobility and nodes joining
spectrum sensing policies for cognitive radio networks based on and leaving the network. In this work we propose distributed re-
multiagent reinforcement learning. The spectrum sensing problem inforcement learning based spectrum sensing policies for cog-
is formulated as a partially observable stochastic game and mul-
tiagent reinforcement learning is employed to find a solution. In nitive radio networks.
the proposed reinforcement learning based sensing policies the sec- Cognitive radios sense the radio frequency spectrum in order
ondary users (SUs) collaborate to improve the sensing reliability to obtain awareness about the spectrum state and primary user
and to distribute the sensing tasks among the network nodes. The (PU) occupancy. Spectrum sensing is essential for identifying
SU collaboration is carried out through local interactions in which available spectrum opportunities and for controlling the level
the SUs share their local test statistics or decisions as well as in-
formation on the frequency bands sensed with their neighbors. As of interference experienced by the PUs. Situational awareness
a result, a map of spectrum occupancy in a local neighborhood is combined with learning allows more efficient and effective use
created. The goal of the proposed sensing policies is to maximize of the existing network resources and time-frequency-location
the amount of free spectrum found given a constraint on the prob- varying spectrum resources. That is, learning through past ex-
ability of missed detection. This is addressed by obtaining a bal- perience facilitates more intelligent decision making and per-
ance between sensing more spectrum and the reliability of sensing
results. Simulation results show that the proposed sensing policies formance optimization. A particularly suitable form of learning
provide an efficient way to find available spectrum in multiuser for cognitive radio networks is reinforcement learning [4], [25].
multiband cognitive radio scenarios. Reinforcement learning is a form of learning in which an agent
Index Terms—Cognitive radio networks, collaborative spectrum or agents learn through experiment and interaction with the en-
sensing, multiagent reinforcement learning, multiuser multiband vironment and each other. The goal is to learn how to choose
spectrum sensing policy, partially observable stochastic game. actions in order to maximize a given form of cumulative future
reward. Reinforcement learning has found great success in va-
riety of applications [4], [25]. In this paper we employ multia-
I. INTRODUCTION gent reinforcement learning to optimize the sensing policy in a
cognitive radio network.
We propose a collaborative distributed sensing framework in
D ISTRIBUTED spectrum sensing algorithms are particu-

larly suited for cognitive radio ad hoc networks [1] since
cognitive radio ad hoc networks have no pre-established infra-
which the sensing time is constant regardless of the number
of SUs sensing, i.e., the diversity order, at each particular fre-
quency band. In order to satisfy a constraint on the probability
structure to centrally manage and coordinate the network tasks, of missed detection the SUs choose the detection threshold ac-
such as identifying idle spectrum and accessing it. Moreover, cording to the employed diversity order. The higher the diver-
the network topology of a cognitive radio ad hoc network may sity order, the lower the threshold and the resulting false alarm
rate needed to satisfy the probability of missed detection con-
Manuscript received September 28, 2012; revised February 20, 2013; ac- straint. On the other hand, since false alarms lead to overlooked
cepted April 08, 2013. Date of publication April 24, 2013; date of current ver-
spectrum opportunities, it is important to minimize their proba-
sion September 11, 2013. The work of J. Lundén and H. V. Poor was supported
in part by the Qatar National Research Fund under Grant NPRP 08-522-2-211. bility to make the system sufficiently opportunistic. In the pro-
The work of J. Lundén was also supported by the Finnish Cultural Foundation. posed framework this can be achieved by increasing the diver-
The work of S. R. Kulkarni was supported in part by the National Science
sity order for each frequency band. However, increasing the di-
Foundation Science and Technology Center under Grant CCF-0939370, the
U.S. Army Research Office under Grant W911 NF-07-1-0185, and by Deutsche versity order reduces the number of frequency bands that can be
Telekom AG under Grant RES AGMT. J. Lundén was with the Department of simultaneously sensed. Hence, we trade off between these two
Electrical Engineering, Princeton University, when this work was performed.
goals in order to identify as much idle spectrum as possible.
The guest editor coordinating the review of this manuscript and approving it
for publication was Prof. Amir Leshem. The other option to meet the required probability of missed de-
J. Lundén and V. Koivunen are with the Department of Signal Processing and tection level is to increase the sensing time while keeping the
Acoustics, SMARAD CoE, Aalto University, 02150 Espoo, Finland (e-mail:
false alarm rate fixed. However, that complicates the system de-
jarmo.lunden@aalto.fi; visa.koivunen@aalto.fi).
S. R. Kulkarni and H. V. Poor are with the Department of Electrical sign because different diversity orders require different sensing
Engineering, Princeton University, Princeton, NJ 08544 USA (e-mail: times for the same performance. Hence, we keep the sensing
kulkarni@princeton.edu; poor@princeton.edu).
time fixed and focus on adjusting the detection thresholds and
Color versions of one or more of the figures in this paper are available online
at http://ieeexplore.ieee.org. false alarm rates to meet the probability of missed detection
Digital Object Identifier 10.1109/JSTSP.2013.2259797 constraint.
1932-4553 © 2013 IEEE

LUNDÉN et al.: MULTIAGENT REINFORCEMENT LEARNING BASED SPECTRUM SENSING POLICIES 859
The main contributions of this paper are as follows: avoidance among the SUs. Thus, the SUs have to learn to avoid
• We formulate the multiuser multiband spectrum sensing the other SUs, in addition to learning the PU occupancy sta-
problem in cognitive radio networks as a partially observ- tistics. In [14] multiagent machine learning policies based on
able stochastic game. collaborative filtering are proposed. The goal is to learn the PU
• We consider scenarios with both known and unknown state occupancy probabilities and maximize the data rate of spectrum
transition probabilities and propose algorithms for esti- access. The SUs collaborate with their neighbors by sharing
mating the state transition probabilities as well as for ap- their local estimates of the PU occupancy probabilities to find
proximating the state transitions with a more heuristic ap- more free spectrum or to increase the accuracy of their own esti-
proach based on the sensing results. mates. The SUs aim to collaborate with other SUs whose spec-
• We propose collaborative distributed multiagent reinforce- trum occupancy statistics are highly correlated with their own
ment learning based spectrum sensing policies for solving statistics. In [28], a distributed multiagent learning based spec-
the stochastic spectrum sensing game. The proposed poli- trum access policy is proposed. The SUs optimize their joint
cies aim to maximize the amount of idle spectrum found action by employing payoff propagation [12]. Each SU broad-
given a constraint on the probability of missed detection casts its payoff message to the neighboring SUs.
by addressing the tradeoff between finding more free spec- In addition, many other spectrum sensing and access policies
trum and the sensing reliability. not necessarily based on reinforcement learning have been pro-
• We propose a Sarsa [25] reinforcement learning-based ap- posed, such as myopic (greedy) policies [17], [29], [30], sev-
proach with linear function approximation to reduce the eral methods based on multiarmed bandit formulations [2], [13],
dimensionality and computational complexity of the state- [16] and game theory [23], [27]. For example, in [13] a multi-
action space of the cognitive radio network. band spectrum sensing and access problem is formulated as
• We demonstrate through simulation results that the pro- a restless multiarmed bandit problem. Asymptotically optimal
posed sensing policies provide an efficient way of finding multiuser sensing and access strategies are proposed for the case
idle spectrum in multiuser multiband cognitive radio sce- of unknown frequency band availability probabilities. In [30]
narios, and in particular in scenarios without a centralized the multiuser multiband spectrum sensing and access problem
controller, such as a base station. is formulated as a partially observable Markov decision process
The proposed sensing policies can be used in time-synchronized (POMDP) and policies operating individually without any col-
multiuser multiband cognitive radio networks with a common laboration by the SUs are proposed. The partially observable
control channel, and especially in networks without any cen- stochastic game formulation proposed in this paper is a gener-
tralized controller or base station. alization of the POMDP formulation in which the SUs collabo-
This paper is organized as follows. In Section II we dis- rate with each other and the rewards depend on the joint action
cuss the related literature. Section III describes the employed of the collaborating SUs. In [23] the multiuser spectrum sensing
system and network model. In Section IV the considered spec- and access problem is modeled as a coalitional game in partition
trum sensing problem is formulated as a partially observable form and a coalition formation algorithm is proposed. The pro-
stochastic game and in Section V multiagent reinforcement posed algorithm allows the SUs to distributedly join and leave
learning schemes are proposed for solving the formulated coalitions and optimize their sensing and access policies in a
game. The properties of the proposed algorithm are discussed distributed manner. A recent survey of sensing and access poli-
in Section VI. Simulation results are presented in Section VII cies for cognitive radio systems can be found in [20].
and the paper is concluded in Section VIII. None of the above sensing policies addresses the coupling of
the diversity gains achieved by multiple SUs sensing the same
II. RELATED WORK band simultaneously at different locations and false alarm rate
Reinforcement learning based spectrum sensing and access and the resulting influence of spectrum sensing errors on maxi-
policies for cognitive radios have been proposed in [3], [6], mizing the idle spectrum found in a cognitive radio ad hoc net-
[14], [15], [22] and [28]. In [3] and [6] sensing and access poli- work as we do in this paper.
cies that aim at balancing sensing, transmission, and switching The considered multiuser spectrum sensing problem is
the frequency band employed by the secondary users (SUs) are closely related to sensor management problems in sensor
proposed. In [22] a single state -learning based centrally-con- networks and in particular to sensor scheduling for target
trolled collaborative multiband sensing policy with -greedy ac- detection. This problem can also be formulated as a POMDP
tion selection is proposed for cognitive radio networks. The pro- [5], [10]. However, the underlying objectives are different. In
posed sensing policy is comprised of two stages both coordi- sensor scheduling the objective is to detect, identify and track
nated by a fusion center. In the first stage the fusion center aims targets as accurately as possible whereas in cognitive radio
to find the best frequency bands to be sensed by the SUs in applications the objective is to find as much available spectrum
order to maximize the throughput of the cognitive radio net- as possible and exploit that in the most efficient manner.
work. In the second stage each SU is assigned to sense one Some preliminary results of this work were presented in part
of the frequency bands selected in the first stage such that the in [18] at the Fifth IEEE International Symposium on New
probabilities of missed detection are minimized. In [15] a mul- Frontiers in Dynamic Spectrum Access Networks, Aachen,
tiagent multiband Aloha-like spectrum access policy based on Germany, May 2011 and in [19] at the Fourth IEEE Interna-
-learning is proposed. The proposed policy is independently tional Workshop on Computational Advances in Multi-Sensor
employed by each SU without any communication or collision Adaptive Processing, San Juan, Puerto Rico, Dec. 2011. In
860 IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, VOL. 7, NO. 5, OCTOBER 2013
Fig. 1. A wireless cognitive radio ad hoc network in which neighboring users

collaborate with each other by sharing their local sensing information through Fig. 2. Time is divided into time slots. Each time slot is further divided into
local interaction. The large dashed circles indicate the local interaction radii of three different kind of minislots: spectrum sensing, SU collaboration, and spec-
the SUs indicated by the small solid circles. trum access. During the SU collaboration slots the SUs share their local sensing
information with the neighboring users to enable collaborative sensing and SU
action coordination.
this paper, we extend these results by providing a theoretical

so that the spectrum sensing and access tasks are performed si-
partially observable stochastic game formulation of the mul-
multaneously by all SUs. Thus, time may be divided into time
tiuser multiband sensing problem. We consider and propose
slots comprised of shorter minislots in which spectrum sensing,
new algorithms for solving the game in scenarios with both
spectrum access, and SU information sharing are performed.2
known and unknown state transition probabilities. Finally, we
Fig. 2 illustrates the slotted nature of the network operation.
illustrate the improved performance with extensive simulations
The SU collaboration minislots are employed for transmitting
in various propagation environments and network topologies.
local test statistics (or decisions) and current actions required
for performing collaborative detection as well as future actions
for coordinating the sensing action selection. That is, during the
III. SYSTEM MODEL SU collaboration minislot in time slot , each SU broadcasts
a data packet to its neighbors in group . The contents
In this work we are considering a cognitive radio ad hoc net- of these data packets will be described in more detail later. The
work consisting of a group of spatially dispersed SUs. Fig. 1 il- data packets are assumed to be transmitted through a common
lustrates an example network topology. The SUs in the network control channel in a sequential, possibly random, order by the
share information and communicate with each other through SUs that may be different from one time slot to another.
local interactions. Each SU is assumed to be able to commu- The spectrum of interest may be very wide and possibly non-
nicate with the users within its communication distance (inter- contiguous. Hence, it may be more energy efficient if each user
action radius) indicated by the large dashed circles in Fig. 1.1 senses only a part of the spectrum at a time. Moreover, coopera-
The users within the communication distance of a given SU are tion among the neighboring users allows distributing the sensing
called the neighbors or the neighboring users of this particular tasks among the users, thus, potentially reducing the workload
SU. Let denote the number of SUs in the network and let of each user while maintaining or improving the sensing perfor-
denote the group of neighbors of SU mance. Consequently, it is assumed that the spectrum of interest
where is the index of the SU in question. For simplicity, we is divided into frequency bands, indexed by ,
assume a fixed network topology, i.e., and do not vary and each SU senses only one frequency band in each time slot.
as functions of time. Moreover, we assume that the communi- We are most interested in a situation in which the spectrum occu-
cation distances are fixed within the network. In practice, the pancy may be only partially observed in each time slot. That is,
local interaction radii should be selected such that the true PU in a situation in which the number of SUs within a local neigh-
occupancy statistics at the locations of the interacting SUs are borhood and the number of different frequency bands are such
highly correlated. Thus, the choice of communication distance that all frequency bands cannot be reliably sensed in each time
depends, for example, on the local propagation environment and slot. Thus, it is important to design a sensing policy that results
the PU characteristics. However, the selection of the best local in the SUs sensing more frequently frequency bands that pro-
interaction radii is beyond the scope of this paper. We merely vide persistent spectrum opportunities and less frequently fre-
assume that the density of SUs is high enough to allow us to as- quency bands with high PU occupancy. To this end, we formu-
sume a relatively short communication distance that guarantees late the multiuser multiband sensing problem as a partially ob-
highly correlated true PU occupancy statistics at the locations servable stochastic game. Moreover, we propose a multiagent
of the interacting SUs. reinforcement learning based sensing policy for maximizing the
In addition, we assume that the SUs in the cognitive radio ad objective function of the game.
hoc network are synchronized in time with sufficient accuracy,
2Note that if the cognitive radios are equipped with multiple radio transmitters
1Note that in practice the actual communication distances depend on many and receivers the SU information sharing may be performed simultaneously to
factors, such as the transmit powers, carrier frequency, local interference levels, the sensing and access tasks on a different control channel. However, for the
antenna beampatterns, and propagation conditions between the SUs, among sake of simplicity of the presentation and to highlight the general relative order
others. Thus, in reality, the area constituting the neighborhood varies locally of the different tasks we employ time structures with explicit SU collaboration
and is not in general a perfect circle as depicted in Fig. 1. slots.
IV. PARTIALLY OBSERVABLE STOCHASTIC is the reward of SU in time slot and , is

GAME FORMULATION the discount rate.
Let 0 and 1 denote the vacant and occupied PU states, respec-
It is assumed that the SUs cannot directly observe the PU state tively. The probabilities of the sensing decisions depend on the
since the number of SUs may not be sufficiently high to allow false alarm and detection probabilities. The conditional proba-
sensing all frequency bands simultaneously. Moreover, the bilities of the sensing decisions conditioned on the PU state and
sensing results (observations) may be subject to errors due to SU actions are given by
imperfect physical layer sensing. Hence, the multiuser multi-
band spectrum sensing problem can be formulated as a partially (2)
observable stochastic game [9] that consists of (3)
• A sequence of discrete time steps
(4)
• A group of SUs .
• A set of possible PU states . Here (5)
where is a vector of the PU
states at each frequency band at SU , i.e., where denotes the actions of the SU group and
where . and denote the probabilities of false alarm and de-
• A set of possible sensing actions for each SU in each state tection obtained, respectively, when SUs are sensing. The
, where . The parameters of the local detectors are chosen such that the re-
combined action space is . sulting probability of missed detection satisfies the probability
• A set of possible sensing decisions (observations) for each of missed detection constraint for any .
SU where the elements of are in the set We model the PU states using belief states. That is, let
. denote SU ’s belief that the frequency band is
• A state transition function vacant. Moreover, let us assume for simplicity that the
that defines the state transition probability state transition probabilities do not depend on the SU
. actions, i.e., .3
• Reward functions that Now, let the state transition probabilities be denoted as
give the SUs’ rewards for the joint . Assuming that the PU
sensing action in state resulting occupancy statistics are independent for different frequency
in new state . bands the belief states can be updated using Bayes’ formula
• Sensing decision (observation) probability functions given the local sensing decisions as follows:
that define the probabilities
of the sensing decisions .
The multiuser multiband spectrum sensing game proceeds as
follows. In each state each SU chooses an action , i.e., a sensed vacant
frequency band to sense, makes local decisions based on the
received local test statistics, and receives reward , sensed occupied
for the joint action. This results in a new state . The local not sensed
binary decisions are obtained after fusing the test statistics
received from the neighboring SUs for each frequency band.
There are no specific restrictions on the applicable local test sensed vacant
statistics or fusion rules except that it is assumed that the local
fused decisions can be obtained individually at each SU after
receiving the local test statistics without any iterative processing sensed occupied
with the neighboring SUs. Hence, most common fusion rules, not sensed
such as likelihood ratio tests and Boolean decision rules can
be employed. The reward SU receives is defined as
the number of frequency bands identified as vacant in time slot
sensed vacant
by SU . The goal of each SU is to maximize the expected
sum of its discounted future rewards given a constraint on the (6)
probability of missed detection on each frequency band, i.e.,
sensed occupied
not sensed
3Strictly speaking this is true for any PU system since spectrum sensing is
(1)
a passive act that does not cause interference. However, if we assume that the
SUs will also transmit on the vacant frequency bands this is true, for example, if
the PU is a digital TV signal but not, in general, for PU systems using collision
where is the probability of missed detection constraint for avoidance mechanisms. However, modeling the influence of SU behavior on
frequency band inside a given protected region for the PUs, PU behavior in such a case is a difficult problem.
where and denote the probabilities of false alarm V. MULTIAGENT REINFORCEMENT LEARNING WITH
and detection obtained, respectively, when SUs are sensing. LINEAR FUNCTION APPROXIMATION
In this section, we propose multiagent reinforcement learning
A. Unknown PU State Transition Probabilities
schemes for solving the partially observable stochastic game,
In practice, the state transition probabilities may be com- i.e., for maximizing the objective functions in (1). One conve-
pletely unknown or can be only approximately estimated in ad- nient way of optimizing (1) is to learn an optimal action-value
vance. However, the state transition probabilities may be esti- function. The action-value function gives the expected
mated online sequentially using the following update rule after return of taking an action in a starting state and then fol-
observing two consecutive local decisions and lowing policy . In our case, we aim to find a policy that max-
for the same frequency band: imizes . Except for the heuristic belief state update al-
gorithm, the belief state is continuous and can attain any value in
the interval . Consequently, the typical approach of keeping
(7) a lookup-table of -values for all possible state-action pairs is
not feasible. Hence, we employ linear function approximation to
where is the index of the decision pairs, is a
approximate the action-value function and thus also reduce the
step size parameter and is an indicator function having
dimensionality of the problem. That is, the action-value function
value 1 if and 0 otherwise. Note that is approximated by a linear function as follows:
and by definition.
However, false alarms and missed detections introduce errors
in the estimates. Hence, we propose that the estimates of the
state transition probabilities are updated only if the reliability of (9)
both successive sensing decisions and is above some
predetermined threshold. That is, if and only if when where is a parameter vector and is a feature vector
and when , where is depending on the belief state and actions of the SU and its
the maximum allowed error probability and and are neighbors. Thus, the learning problem is transformed to the
the probabilities of false alarm and detection obtained with problem of learning the parameter vector . The feature vector
SUs sensing. Here is the number of SUs sensing frequency is constructed as follows. The number of features is equal to
band and thus contributing to the decision . Note that the the number of frequency bands with one feature for each
condition must hold separately for both and . frequency band. Each feature describes the SU’s belief that the
Assuming that there are no sensing errors, it follows from sto- corresponding frequency band is available for secondary use if
chastic approximation theory (see e.g., ([11], Theorem 1) that sensed by users so that the probability of missed detection
the update in (7) converges with probability 1 to the true state constraint is satisfied. The feature value depends on both
transition probabilities if and the SU’s belief that the spectrum is vacant and the probability
and each frequency band is sensed infinitely many times. of false alarm, and is given by
For example, by choosing the sample average is
obtained and the constraints are satisfied. However, although
(10)
the state transition probability estimates are guaranteed to con-
verge if the above conditions are satisfied, the convergence can
be slow. This is the case especially if the number of frequency (11)
bands is large compared to the number of SUs and the sensing where is an indicator function having value 1 if
reliability of individual sensors is not especially high. Hence, we and 0 otherwise, and is the false alarm probability ob-
propose also a heuristic update rule for updating the belief states tained with SUs sensing. The argument of the function is
based directly on the sensing results (observations) without es- the number of SUs in group sensing the frequency band
timating the state transition probabilities: in the time slot .
The -values are updated using Sarsa, a temporal-difference
(TD) learning algorithm [25]. The update of the parameter
vector of Sarsa with gradient-descent based linear function
approximation is given by [25]
(8) (12)
where is a constant in , e.g., . The update where is the reward in time slot and
rule moves gradually toward 1/2 from the previous sensed is a step size (learning rate) parameter. In
state as time advances. The idea is to favor sensing frequency (12), is the gradient of the action-value function
bands that have not been sensed occupied in a while. The value with respect to .
represents the highest uncertainty in the true PU In this section, we have proposed a reinforcement learning
occupancy in that frequency band. method employed by the individual SUs in the cognitive radio
Finally, in order to balance between exploiting the best ac-

tion and exploring other actions in the hope of finding an even
better one, we employ the -greedy action selection algorithm
[25]. The -greedy action selection is a simple, yet effective
method that balances between exploitation and exploration by
selecting the action that maximizes the action-value function,
i.e., , with a probability of , or a
random action, uniformly, with a probability of regardless of
the action-value estimates [25].
VI. DISCUSSION
Single-user Sarsa with finite state spaces has been proven to
converge with probability 1 to the optimal policy if the agent
employs a lookup-table to store the -values, visits every
state-action pair infinitely many times, the step size satisfies
and , and the learning policy
becomes greedy in the limit [24]. Moreover, single-user Sarsa
with linear function approximation has been shown in [21] to
also converge with probability 1 although under some more
restrictive conditions than lookup-table Sarsa. However, in
a multiagent setting it is not sufficient that the single-user
learning algorithm converges. In a multiagent setting the
agents’ policies should also be able to adapt to the other agents’
behavior. However, the proposed coordination scheme ensures
that the SUs are aware of the actions the SUs preceding them in
the sequential order have chosen. This allows the SUs to break
ties among equally good joint actions.
A notable feature and benefit of the proposed reinforcement
learning based sensing policies is that they are all extremely
computation- and memory-efficient. Due to the use of the linear
function approximation approach the memory requirements
ad hoc network. However, since we are considering a coop-
of the learning algorithms are substantially reduced. Instead
erative multiagent learning problem, the cooperation among
of maintaining a separate -value for each (discretized) be-
the SUs has to be considered as well. The cooperation among
lief state-action pair, each SU has to maintain only the local
the SUs involves information sharing and action coordination.
-vector of size to calculate the -value for any belief
Moreover, the algorithms for selecting the actions given the
state-action pair. Moreover, since each SU selects only its own
learned -values have to be also considered. These issues will
action given the actions of its neighbors, at most -values
be addressed in the following subsection.
have to be evaluated by each SU in each belief state to find the
best action.
A. Collaboration and Action Coordination
Finally, we note that the future action selections have to be
Algorithm 1 summarizes the proposed collaboration and ac- based on the belief state of the current time slot. This is due to the
tion coordination scheme. During the SU collaboration slots the fact that in Algorithm 1 both the local test statistic and the future
SUs transmit their data packets in a sequential order (for-loop action are transmitted in the same data packet. Thus, the future
in Algorithm 1) that may be random and vary between different action selections are effectively lagging one time slot behind. Of
time slots. The ordering does not affect the end result of the co- course, if an SU is able to make a sufficiently reliable decision
operative sensing since the fused decisions are made only after to some of the frequency bands already after receiving only a
all data packets containing test statistics have been received. subset of the local test statistics from its neighbors it can update
However, the ordering does affect the action selection for the its belief state for these frequency bands before choosing its own
next time slot in Algorithm 1. That is, the SUs that are later in action. Moreover, the whole problem can be circumvented if
the sequential order have the possibility of using the informa- the future actions are transmitted separately after all the local
tion obtained from the neighboring SUs preceding them in the decisions have been obtained for the frequency bands and the
sequential order to make more informed decisions about which belief states have been updated. However, this will also increase
frequency band to sense in the next time slot. Thus, the deci- the delay due to the SU collaboration and thus will reduce the
sions about which frequency bands to sense are better coordi- time available for secondary spectrum access.
nated within the cognitive network. Sharing this information al-
lows action changes to be made more frequently without jeop- VII. SIMULATION RESULTS
ardizing the stability of the cognitive network and the learning In this section, we demonstrate the performance of the pro-
processes of the individual SUs. posed reinforcement learning based sensing policies using sim-
Fig. 3. Multiagent spectrum sensing scenario with 10 SUs. The small square
at the origin denotes the location of the 7 PUs and the large circle denotes the
protected radius of the PUs. Small dots indicate the locations of the SUs. Each
shaded circle denotes the corresponding SU’s (i.e., the one in the middle of
the circle) communication range. The received PU signal power attenuates ac-
cording to where is the transmit power and is the
distance.
ulations. Both additive white Gaussian noise (AWGN) chan-

nels and independent Rayleigh fading channels are employed. In
each scenario, the local detector employed by the SUs is an en-
ergy detector [26]. Moreover, the SUs employ equal gain com-
Fig. 4. Average sensing performance of the proposed algorithms relative to
bining (EGC) [7] to fuse the local energy detection test statistics an optimal genie in AWGN in the topology of Fig. 3. (a) Percentage of free
obtained from multiple users to perform collaborative sensing. spectrum found and its (b) 50-sample moving average compared to an optimal
The probability of missed detection constraint is genie. The heuristic policy learns faster than the policy with known state tran-
sition probabilities. However, after convergence the performance of the policy
for all frequency bands . The following parameter values of with known state transition probabilities is the best as expected, although the
the proposed reinforcement learning algorithm have been used differences in performance are very minor among the different policies except
in the simulations: , for the myopic policy that has very poor performance.
and . Moreover, the exploration rate was set to 0.1
for all the other algorithms except the one with estimated PU
state transition probabilities. That is, for that algorithm started The PU occupancy follows two-state Markov models
from 1 and was linearly decreased to 0.1 in 750 and 1500 time with state transition probabilities given by
slots in AWGN and Rayleigh fading channels, respectively. This and
was done to speed up the learning process. The initial values , re-
for the PU state transition probability estimates were chosen as spectively, for the 7 PU frequency bands.
. Figs. 4 and 5 depict the performance of the proposed mul-
First, we consider a small scenario with 10 SUs and 7 PUs. tiagent reinforcement learning scheme with known and esti-
Fig. 3 depicts the spatial locations of the SUs and PUs. The SU mated PU state transition probabilities and with the proposed
locations have been randomly selected from a uniform distri- heuristic belief state update in AWGN and in Rayleigh fading
bution inside a square of size 0.1 0.1 centered at the origin. channels, respectively. The performance curves are averages
The received median PU signal power attenuates according to over 1000 Monte Carlo iterations. The performance is com-
where is the transmit power, is the dis- pared to a myopic policy [29] which is the optimal policy for
tance, and 0.05 is the reference distance. The path loss expo- single-user sensing with independent and identically distributed
nent 3.3 was chosen to model a typical urban microcell envi- frequency bands. Here, we employ the myopic policy in a mul-
ronment [8]. All SUs are inside the protected region of the PUs. tiuser setting; the SUs exchange their local test statistics sim-
At the edge of the protected region the median signal-to-noise ilarly to the other algorithms but decide independently which
ratio (SNR) is 0 dB. In AWGN channels the employed number frequency band to sense. Moreover, the performance of the al-
of energy detection samples is 50 and the resulting probabil- gorithms is plotted relative to an optimal centralized genie that
ities of false alarm are and knows in each time slot which frequency bands are vacant. This
respectively, for 1 and 2 or more SUs collaboratively policy has been obtained through an exhaustive search of all
sensing. In Rayleigh fading channels the number of energy de- possible joint actions.
tection samples is 100 and the resulting false alarm rates are It can be seen that the proposed heuristic belief state update
, and based algorithm learns faster than the algorithms using known
. These values have been obtained using and estimated PU state transition probabilities. However, the
analytical expressions [7] and numerical simulations. eventual performance after the initial learning phase is worse
Fig. 6. Mean absolute PU state transition probability estimation error over all
frequency bands and SUs as a function of the time slot. The PU state transition
probability estimates converge faster in the AWGN scenario since the sensing
reliability is much higher and the SUs are sensing different frequency bands
more frequently. Moreover, the PU state transition probability estimate for
(from vacant to vacant) converges faster due to the reinforcement learning al-
gorithm that tries to exploit vacant frequency bands which, consequently, are
sensed more often.
Fig. 5. Average sensing performance of the proposed algorithms relative to an

optimal genie in Rayleigh fading channels in the topology of Fig. 3. (a) Per-
centage of free spectrum found and (b) its 50-sample moving average com-
pared to an optimal genie. In the Rayleigh fading scenario the learning of the
algorithms takes considerably longer than in AWGN. This is due to the fact
that in Rayleigh fading false alarms and SU collaboration play a much bigger
role. Moreover, in the Rayleigh fading scenario the algorithms using known and
estimated state transition probabilities outperform the heuristic algorithm once
the initial learning phase has passed. The myopic policy performs better in the
Rayleigh fading case than in AWGN since the lack of coordination among the
SUs results in the SUs sensing the same frequency bands. Hence, the sensing
reliability is very high.
Fig. 7. Multiagent spectrum sensing scenario in which 100 SUs (small dots)
for the heuristic algorithm. In AWGN, the differences in perfor- are located close to the edge of the protected region (large circle) of the 15 PUs
(small square) where the probability of detection is smaller. The received PU
mance among the proposed algorithms are not significant but median signal power attenuates according to where is
in the Rayleigh fading scenario the algorithms based on known the transmit power and is the distance.
and estimated PU state transition probabilities outperform the
heuristic belief state update based algorithm. Moreover, in the much higher than in the Rayleigh fading channels. Hence, the
Rayleigh fading scenario all algorithms perform better com- SUs can sense different frequency bands simultaneously and
pared to the optimal genie. For example, at the end of the simu- thus the PU state transition probability estimates get updated
lation the algorithm with estimated PU state transition probabil- more frequently. In addition, due to the reinforcement learning
ities finds on average roughly 68.5% of the available spectrum algorithm vacant frequency bands are sensed more often and
the optimal genie finds whereas in the AWGN scenario the per- hence the estimate for converges faster than the estimate for
centage is around 64.5%. All the proposed algorithms clearly . For the same reasons the estimation accuracy varies quite
outperform the myopic policy. The myopic policy suffers from heavily also from one frequency band to another.
the lack of coordination in the action selection. After the initial Next, we consider a larger cognitive radio ad hoc network
phase all the SUs start sensing the same frequency bands. This consisting of 100 spatially dispersed SUs uniformly distributed
degrades the performance significantly in AWGN. On the other inside an annulus with an inner radius of 0.15 and an outer radius
hand, in the Rayleigh fading case this actually ensures a high of 0.20. Fig. 7 illustrates the considered network topology. Note
diversity order and thus a reasonable performance as well. that due to computational reasons we have limited the number
Fig. 6 shows the mean absolute PU state transition proba- of SUs to 100. In order to make the scenario representative of
bility estimation error over all frequency bands and SUs in both a difficult fading scenario the SUs have been placed close to
AWGN and Rayleigh fading scenarios. It can be seen that the the edge of the protected region of the PUs where the path loss
estimation error decreases faster in AWGN channels. This is un- is the largest and thus detecting the PUs is difficult. The re-
derstandable since in AWGN channels the sensing reliability is ceived PU median signal power attenuates again according to
Fig. 9. Histogram of the number of SU test statistics (i.e., diversity order) used
for making each decision in (a) AWGN and (b) Rayleigh fading for the policy
with known state transition probabilities. In the Rayleigh fading scenario the
proposed learning algorithm exploits SU collaboration to reduce the false alarm
rate and thus increase the sensing reliability while in the AWGN scenario SU
collaboration is employed to sense more spectrum.
Fig. 8. 50-sample moving average of the number of frequency bands found
correctly as vacant as a function of time in (a) AWGN and (b) Rayleigh fading
in the topology of Fig. 7. In the AWGN scenario the amount of free spectrum probabilities has slightly worse performance than the other two
found is much higher than in the Rayleigh fading scenario due to the much lower algorithms. In the Rayleigh fading scenario, we observe that
sensing reliability in fading channels. Moreover, due to the high number of dif-
ferent frequency bands learning the PU state transition probabilities is slower, the algorithm with known state transition probabilities has the
in particular, in the Rayleigh fading scenario and hence the performance of the best performance. Moreover, the algorithm with estimated PU
algorithm based on estimated state transition probabilities is the worst. state transition probabilities performs again the worst. The per-
formance is roughly 7% worse than the performance of the al-
where is the transmit power and is the gorithm with known PU state transition probabilities while for
distance. Moreover, the local detectors and their performances the heuristic algorithm it is only 3.5% worse. In the Rayleigh
(false alarm rates) are identical to the previous scenario with 10 fading scenario the learning of the PU state transition probabil-
SUs. The number of frequency bands, and thus also the number ities is slower due to the fact that the neighboring SUs are col-
of PUs, is 15. The PU state transition probabilities and laboratively sensing simultaneously the same frequency bands
of the two-state Markov models have been randomly selected and hence exploration is slower. Moreover, due to the employed
from the interval [0.85,0.98] independently for each frequency sensing reliability constraint only collaborative deci-
band. The reinforcement learning algorithm parameters are the sions with diversity order 3 or higher are used to update the PU
same as above. state transition probability estimates.
Fig. 8 depicts the performance of the proposed reinforcement It can also be seen that in the AWGN scenario the average
learning based sensing policies in AWGN and Rayleigh fading number of vacant frequency bands found is much larger than
scenarios. In Fig. 8 we have plotted the 50-sample moving av- in the Rayleigh fading scenario. This is due to the high false
erage of the number of frequency bands found correctly vacant alarm rates for low diversity orders in Rayleigh fading and
by the sensing policies as functions of time. The curves are aver- thus the need for collaborative sensing with a higher diversity
ages over 50 Monte Carlo iterations. Note that in this scenario order. Fig. 9 shows the histogram of the number of SU test
obtaining the optimal centralized genie is computationally in- statistics used for making the decision during the simulation for
feasible due to the very large number of possible joint actions. both AWGN and Rayleigh fading scenarios for the algorithm
That is, the number of different joint actions is where using known state transition probabilities. The histograms for
is the number of available frequency bands and is the other algorithms are very similar. It can be seen that the
the number of SUs in the network. distribution of the sensing actions are significantly different
In the AWGN scenario the performance of the algorithms is in the AWGN and Rayleigh fading scenarios. In the AWGN
very similar. The algorithm with estimated PU state transition scenario the neighboring SUs aim to sense different frequency
bands since the false alarm rate 0.0015 with single-user sensing [7] F. F. Digham, M.-S. Alouini, and M. K. Simon, “On the energy detec-
provides sufficiently reliable sensing. On the other hand, in tion of unknown signals over fading channels,” IEEE Trans. Commun.,
vol. 55, no. 1, pp. 21–24, Jan. 2007.
the Rayleigh fading scenario single-user sensing suffers from [8] A. Goldsmith, Wireless Communications. New York, NY, USA:
a high false alarm rate and thus from a large number of over- Cambridge Univ. Press, 2005.
[9] E. A. Hansen, D. S. Bernstein, and S. Zilberstein, “Dynamic program-
looked spectrum opportunities. Hence, the neighboring SUs ming for partially observable stochastic games,” in Proc. 19th National
aim to increase the diversity order of each decision by sensing Conf. Artif. Intell., San Jose, CA, USA, Jul. 25–29, 2004.
[10] Foundations and Applications of Sensor Management, A. O. Hero, D.
simultaneously the same frequency bands. It can be seen that Castañón, D. Cochran, and K. Kastella, Eds. New York, NY, USA:
diversity order 3 is the most commonly employed. This is log- Springer, 2008.
ical since it provides false alarm rate 0.03 and thus increasing [11] T. Jaakkola, M. I. Jordan, and S. P. Singh, “On the convergence of sto-
chastic iterative dynamic programming algorithms,” Neural Comput.,
the diversity order above 3 does not offer significant gain. vol. 6, pp. 1185–1201, 1994.
[12] J. R. Kok and N. Vlassis, “Collaborative multiagent reinforcement
learning by payoff propagation,” J. Mach. Learn. Res., vol. 7, pp.
VIII. CONCLUSION 1789–1828, Dec. 2006.
In this paper, we modeled the multiuser multiband spec- [13] L. Lai, H. El Gamal, H. Jiang, and H. V. Poor, “Cognitive medium ac-
cess: Exploration, exploitation, and competition,” IEEE Trans. Mobile
trum sensing problem in cognitive radio ad hoc networks as a Comput., vol. 10, no. 2, pp. 239–253, Feb. 2011.
partially observable stochastic game and proposed multiagent [14] H. Li, “Learning the spectrum via collaborative filtering in cognitive
radio networks,” in Proc. IEEE Symp. New Frontiers in Dynam. Spec-
reinforcement learning based algorithms for solving it. The trum Access Netw. (DySPAN 2010), Singapore, Apr. 6–9, 2010, pp.
proposed sensing policies employ SU collaboration through 1–12.
[15] H. Li, “Multiagent Q-learning for Aloha-like spectrum access in cog-
local interaction to obtain awareness of the local spectrum nitive radio systems,” EURASIP J. Wireless Commun. Netw., 2010,
occupancy. We considered and proposed algorithms with both Volume 2010, Article ID 876216.
known and unknown PU state transition probabilities. For the [16] K. Liu and Q. Zhao, “Distributed learning in multi-armed bandit with
multiple players,” IEEE Trans. Signal Process., vol. 58, no. 11, pp.
case of known PU state transition probabilities, we derived be- 5667–5681, Nov. 2010.
lief state update rules given the local sensing decisions. For the [17] K. Liu, Q. Zhao, and B. Krishnamachari, “Dynamic multichannel
access with imperfect channel state detection,” IEEE Trans. Signal
case of unknown PU state transition probabilities, we proposed Process., vol. 58, no. 5, pp. 2795–2808, May 2010.
an algorithm for estimating the PU state transition probabilities [18] J. Lundén, V. Koivunen, S. R. Kulkarni, and H. V. Poor, “Reinforce-
ment learning based distributed multiagent sensing policy for cognitive
as well as a heuristic belief state update scheme. All of these radio networks,” in Proc. 5th IEEE Int. Symp. New Frontiers in Dynam.
algorithms employ Sarsa based reinforcement learning with Spectrum Access Netw.(DySPAN 2011), Aachen, Germany, May 3–6,
linear function approximation that trades off between sensing 2011, pp. 642–646.
[19] J. Lundén, V. Koivunen, S. R. Kulkarni, and H. V. Poor, “Exploiting
more spectrum and the sensing reliability to maximize the spatial diversity in multiagent reinforcement learning based spectrum
amount of vacant spectrum found. Our simulation results show sensing,” in Proc. IEEE Workshop Comput. Adv. Multi-Sensor Adap-
tive Process. (CAMSAP ’11), San Juan, Puerto Rico, Dec. 13–16, 2011,
that the proposed learning and sensing algorithms provide pp. 325–328.
a computation- and memory-efficient way of finding vacant [20] J. Lundén, V. Koivunen, and H. V. Poor, , E. Biglieri, A. Goldsmith,
L. Greenstein, N. Mandayam, and H. V. Poor, Eds., “Spectrum ex-
spectrum in multiuser multiband cognitive radio scenarios. ploration and exploitation,” in Principles of Cognitive Radio, Chapter
Moreover, our results show that exploiting spatial diversity is 5. New York, NY, USA: Cambridge Univ. Press, 2013, pp. 184–258.
important for finding available spectrum more effectively in [21] F. S. Melo, S. P. Meyn, and M. I. Ribeiro, “An analysis of reinforce-
ment learning with function approximation,” in Proc. 25th Int. Conf.
fading channels. The algorithms with known and estimated Mach. Learn., Helsinki, Finland, Jul. 5–9, 2008, pp. 664–671.
PU state transition probabilities provide the best performance [22] J. Oksanen, J. Lundén, and V. Koivunen, “Reinforcement learning
based sensing policy optimization for energy efficient cognitive
eventually, in particular, in fading scenarios. However, the radio networks,” Neurocomput.—Spec. Iss. Mach. Learn. for Signal
proposed heuristic belief update based algorithm provides the Process., vol. 80, pp. 102–110, Mar. 2012.
[23] W. Saad, Z. Han, R. Zheng, A. Hjørungnes, T. Başar, and H. V. Poor,
fastest learning, thus indicating its suitability for nonstationary “Coalitional games in partition form for joint spectrum sensing and ac-
dynamic scenarios. cess in cognitive radio networks,” IEEE J. Sel. Topics Signal Process.,
vol. 6, no. 2, pp. 195–209, Apr. 2012.
[24] S. P. Singh, T. Jaakkola, M. L. Littman, and C. Szepesvári, “Con-
REFERENCES vergence results for single-step on-policy reinforcement-learning al-
[1] I. F. Akyildiz, W.-Y. Lee, and K. R. Chowdhury, “CRAHNs: Cognitive gorithms,” Mach. Learn., vol. 39, pp. 287–308, 2000.
radio ad hoc networks,” Ad Hoc Networks, vol. 7, no. 5, pp. 810–836, [25] R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduc-
Jul. 2009. tion. Cambridge, MA, USA: MIT Press, 1998.
[2] A. Anandkumar, N. Michael, and A. K. Tang, “Opportunistic spectrum [26] H. Urkowitz, “Energy detection of unknown deterministic signals,”
access with multiple users: Learning under competition,” in Proc. IEEE Proc. IEEE, vol. 55, no. 4, pp. 523–531, Apr. 1967.
Conf. Comput. Commun. (INFOCOM 2010), San Diego, CA, USA, [27] Y. Xu, J. Wang, Q. Wu, A. Anpalagan, and Y.-D. Yao, “Opportunistic
Mar. 15–19, 2010, pp. 1–9. spectrum access in cognitive radio networks: Global optimization using
[3] U. Berthold, F. Fu, M. van der Schaar, and F. K. Jondral, “Detection of local interaction games,” IEEE J. Sel. Topics Signal Process., vol. 6, no.
spectral resources in cognitive radios using reinforcement learning,” in 2, pp. 180–194, Apr. 2012.
Proc. 3rd IEEE Symp. New Frontiers Dynamic Spectrum Access Netw. [28] K.-L. A. Yau, P. Komisarczuk, and P. D. Teal, “Achieving efficient
(DySPAN ’08), Chicago, IL, USA, Oct. 14–17, 2008, pp. 1–5. and optimal joint action in distributed cognitive radio networks using
[4] L. Busoniu, R. Babuška, and B. De Schutter, “A comprehensive survey payoff propagation,” in Proc. IEEE Int. Conf. Commun. (ICC ’10),
of multiagent reinforcement learning,” IEEE Trans. Systems, Man, Cy- Capetown, South Africa, May 23–27, 2010, pp. 1–6.
bern. C: Applicat. Rev., vol. 38, no. 2, pp. 156–172, Mar. 2008. [29] Q. Zhao, B. Krishnamachari, and K. Liu, “On myopic sensing for multi-
[5] E. K. P. Chong, C. M. Kreucher, and A. O. Hero, III, “Partially observ- channel opportunistic access: Structure, optimality, and performance,”
able Markov decision process approximations for adaptive sensing,” IEEE Trans. Wireless Commun., vol. 7, no. 12, pp. 5431–5440, Dec.
Discrete Event Dynam. Syst., vol. 19, no. 3, pp. 377–422, Sep. 2009. 2008.
[6] M. Di Felice, K. R. Chowdhury, W. Meleis, and L. Bononi, “To sense or [30] Q. Zhao, A. Swami, and Y. Chen, “Decentralized cognitive MAC for
to transmit: A learning-based spectrum management scheme for cogni- opportunistic spectrum access in ad hoc networks: A POMDP frame-
tive radio mesh networks,” in Proc. 5th IEEE Workshop Wireless Mesh work,” IEEE J. Sel. Areas Commun., vol. 25, no. 3, pp. 589–600, Apr.
Netw. (WIMESH 2010), Boston, MA, USA, Jun. 21, 2010. 2007.
Jarmo Lundén (S’05–M’10) received the M.Sc. Visa Koivunen (S’87–M’93–SM’98–F’11) received
(Tech.) degree with distinction in communications the D.Sc. (EE) degree (with honors) from the Depart-
engineering and the D.Sc. (Tech.) degree with ment of Electrical Engineering, University of Oulu,
distinction in signal processing for communications Finland. From 1992 to 1995, he was a visiting re-
from the Helsinki University of Technology, Espoo, searcher at the University of Pennsylvania, Philadel-
Finland, in 2005 and 2009, respectively. Since 2009 phia. Since 1999, he has been a Professor of Signal
he has been a Postdoctoral Researcher at Aalto Processing at Aalto University (formerly known as
University, Finland. Moreover, between August Helsinki University of Technology), Finland. Since
2010 and August 2011 he was a Visiting Postdoc- 2009 he has been Academy professor at Aalto Uni-
toral Research Associate at Princeton University, versity. He is one of the Principal Investigators in
Princeton, NJ, USA. His research interests include the SMARAD (Smart Radios and Wireless Systems)
statistical signal processing and machine learning, and their applications in Center of Excellence nominated by the Academy of Finland. He was also Ad-
cognitive radio systems and smart grids. junct Full Professor at the University of Pennsylvania, Philadelphia. During his
sabbatical leave in 2006–2007, he was a Visiting Fellow at Nokia Research
Center, as well as Princeton University. He makes frequent research visits to
Princeton University.
Sanjeev R. Kulkarni (M’91–SM’96–F’04) received His research interests include statistical, communications, radar and sensor
the B.S. in mathematics, B.S. in E.E., and M.S. in array signal processing. He has published more than 350 papers in international
mathematics from Clarkson University in 1983, scientific conferences and journals. Dr. Koivunen received the Primus Doctor
1984, and 1985, respectively; the M.S. degree in (best graduate) Award among the doctoral graduates in the years 1989 to 1994.
E.E. from Stanford University in 1985; and the He is a member of Eta Kappa Nu. He co-authored the papers receiving the Best
Ph.D. in E.E. from M.I.T. in 1991. From 1985 to Paper Award in IEEE PIMRC 2005, EUSIPCO 2006, and EuCAP 2006. He has
1991 he was a Member of the Technical Staff at been awarded the IEEE Signal Processing Society Best Paper Award for 2007
M.I.T. Lincoln Laboratory working on the modeling (co-authored with J. Eriksson). He is a member of the editorial board for the
and processing of laser radar measurements. In the IEEE Signal Processing Magazine. He is also a member of the IEEE Sensor
spring of 1986, he was a part-time faculty member Array Multichannel Signal Processing Technical Committee. He is also serving
at the University of Massachusetts, Boston. Since at the industrial liaison board of the IEEE SP society. He was the general chair
1991 he has been with Princeton University, where he is currently Professor of of the IEEE SPAWC (Signal Processing Advances in Wireless Communication)
Electrical Engineering, Director of the Keller Center, and an affiliated faculty 2007 conference in Helsinki, June 2007.
member in the Department of Operations Research and Financial Engineering
and the Department of Philosophy. He spent January 1996 as a research fellow
at the Australian National University, 1998 with Susquehanna International
Group, and summer 2001 with Flarion Technologies. Prof. Kulkarni received H. Vincent Poor (S’72–M’77–SM’82–F’87) re-
an ARO Young Investigator Award in 1992, an NSF Young Investigator ceived the Ph.D. degree in EECS from Princeton
Award in 1994, and several teaching awards at Princeton. He has served as University in 1977. From 1977 until 1990, he
an Associate Editor for the IEEE TRANSACTIONS ON INFORMATION THEORY. was on the faculty of the University of Illinois at
Prof. Kulkarni’s research interests include statistical pattern recognition, Urbana-Champaign. Since 1990 he has been on
nonparametric estimation, learning and adaptive systems, information theory, the faculty at Princeton, where he is the Michael
wireless networks, and image/video processing. Henry Strater University Professor of Electrical
Engineering and Dean of the School of Engineering
and Applied Science. Dr. Poor’s research interests
are in the areas of stochastic analysis, statistical
signal processing, and information theory, and their
applications in wireless networks and related fields such as social networks
and smart grid. Among his publications in these areas the recent books Smart
Grid Communications and Networking (Cambridge, 2012) and Principles of
Cognitive Radio (Cambridge, 2013).
Dr. Poor is a member of the National Academy of Engineering and the Na-
tional Academy of Sciences, a Fellow of the American Academy of Arts and
Sciences, an International Fellow of the Royal Academy of Engineering (U.K.),
and a Corresponding Fellow of the Royal Society of Edinburgh. He is also a
Fellow of the Institute of Mathematical Statistics, the Acoustical Society of
America, and other organizations. He received the Technical Achievement and
Society Awards of the SPS in 2007 and 2011, respectively. Recent recognition
of his work includes the 2010 IET Ambrose Fleming Medal, the 2011 IEEE
Eric E. Sumner Award, honorary doctorates from Aalborg University, the Hong
Kong University of Science and Technology, and the University of Edinburgh.

Cooperative Spectrum Sensing Princeton

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Cooperative Spectrum Sensing Princeton

Uploaded by

Copyright:

Available Formats

858 IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, VOL. 7, NO.

Multiagent Reinforcement Learning Based Spectrum

D ISTRIBUTED spectrum sensing algorithms are particu-

1932-4553 © 2013 IEEE

Fig. 1. A wireless cognitive radio ad hoc network in which neighboring users

this paper, we extend these results by providing a theoretical

IV. PARTIALLY OBSERVABLE STOCHASTIC is the reward of SU in time slot and , is

Finally, in order to balance between exploiting the best ac-

ulations. Both additive white Gaussian noise (AWGN) chan-

Fig. 5. Average sensing performance of the proposed algorithms relative to an

You might also like