Professional Documents
Culture Documents
Cooperative Spectrum Sensing Princeton
Cooperative Spectrum Sensing Princeton
5, OCTOBER 2013
Abstract—This paper proposes distributed multiuser multiband change in a dynamic fashion due to mobility and nodes joining
spectrum sensing policies for cognitive radio networks based on and leaving the network. In this work we propose distributed re-
multiagent reinforcement learning. The spectrum sensing problem inforcement learning based spectrum sensing policies for cog-
is formulated as a partially observable stochastic game and mul-
tiagent reinforcement learning is employed to find a solution. In nitive radio networks.
the proposed reinforcement learning based sensing policies the sec- Cognitive radios sense the radio frequency spectrum in order
ondary users (SUs) collaborate to improve the sensing reliability to obtain awareness about the spectrum state and primary user
and to distribute the sensing tasks among the network nodes. The (PU) occupancy. Spectrum sensing is essential for identifying
SU collaboration is carried out through local interactions in which available spectrum opportunities and for controlling the level
the SUs share their local test statistics or decisions as well as in-
formation on the frequency bands sensed with their neighbors. As of interference experienced by the PUs. Situational awareness
a result, a map of spectrum occupancy in a local neighborhood is combined with learning allows more efficient and effective use
created. The goal of the proposed sensing policies is to maximize of the existing network resources and time-frequency-location
the amount of free spectrum found given a constraint on the prob- varying spectrum resources. That is, learning through past ex-
ability of missed detection. This is addressed by obtaining a bal- perience facilitates more intelligent decision making and per-
ance between sensing more spectrum and the reliability of sensing
results. Simulation results show that the proposed sensing policies formance optimization. A particularly suitable form of learning
provide an efficient way to find available spectrum in multiuser for cognitive radio networks is reinforcement learning [4], [25].
multiband cognitive radio scenarios. Reinforcement learning is a form of learning in which an agent
Index Terms—Cognitive radio networks, collaborative spectrum or agents learn through experiment and interaction with the en-
sensing, multiagent reinforcement learning, multiuser multiband vironment and each other. The goal is to learn how to choose
spectrum sensing policy, partially observable stochastic game. actions in order to maximize a given form of cumulative future
reward. Reinforcement learning has found great success in va-
riety of applications [4], [25]. In this paper we employ multia-
I. INTRODUCTION gent reinforcement learning to optimize the sensing policy in a
cognitive radio network.
We propose a collaborative distributed sensing framework in
The main contributions of this paper are as follows: avoidance among the SUs. Thus, the SUs have to learn to avoid
• We formulate the multiuser multiband spectrum sensing the other SUs, in addition to learning the PU occupancy sta-
problem in cognitive radio networks as a partially observ- tistics. In [14] multiagent machine learning policies based on
able stochastic game. collaborative filtering are proposed. The goal is to learn the PU
• We consider scenarios with both known and unknown state occupancy probabilities and maximize the data rate of spectrum
transition probabilities and propose algorithms for esti- access. The SUs collaborate with their neighbors by sharing
mating the state transition probabilities as well as for ap- their local estimates of the PU occupancy probabilities to find
proximating the state transitions with a more heuristic ap- more free spectrum or to increase the accuracy of their own esti-
proach based on the sensing results. mates. The SUs aim to collaborate with other SUs whose spec-
• We propose collaborative distributed multiagent reinforce- trum occupancy statistics are highly correlated with their own
ment learning based spectrum sensing policies for solving statistics. In [28], a distributed multiagent learning based spec-
the stochastic spectrum sensing game. The proposed poli- trum access policy is proposed. The SUs optimize their joint
cies aim to maximize the amount of idle spectrum found action by employing payoff propagation [12]. Each SU broad-
given a constraint on the probability of missed detection casts its payoff message to the neighboring SUs.
by addressing the tradeoff between finding more free spec- In addition, many other spectrum sensing and access policies
trum and the sensing reliability. not necessarily based on reinforcement learning have been pro-
• We propose a Sarsa [25] reinforcement learning-based ap- posed, such as myopic (greedy) policies [17], [29], [30], sev-
proach with linear function approximation to reduce the eral methods based on multiarmed bandit formulations [2], [13],
dimensionality and computational complexity of the state- [16] and game theory [23], [27]. For example, in [13] a multi-
action space of the cognitive radio network. band spectrum sensing and access problem is formulated as
• We demonstrate through simulation results that the pro- a restless multiarmed bandit problem. Asymptotically optimal
posed sensing policies provide an efficient way of finding multiuser sensing and access strategies are proposed for the case
idle spectrum in multiuser multiband cognitive radio sce- of unknown frequency band availability probabilities. In [30]
narios, and in particular in scenarios without a centralized the multiuser multiband spectrum sensing and access problem
controller, such as a base station. is formulated as a partially observable Markov decision process
The proposed sensing policies can be used in time-synchronized (POMDP) and policies operating individually without any col-
multiuser multiband cognitive radio networks with a common laboration by the SUs are proposed. The partially observable
control channel, and especially in networks without any cen- stochastic game formulation proposed in this paper is a gener-
tralized controller or base station. alization of the POMDP formulation in which the SUs collabo-
This paper is organized as follows. In Section II we dis- rate with each other and the rewards depend on the joint action
cuss the related literature. Section III describes the employed of the collaborating SUs. In [23] the multiuser spectrum sensing
system and network model. In Section IV the considered spec- and access problem is modeled as a coalitional game in partition
trum sensing problem is formulated as a partially observable form and a coalition formation algorithm is proposed. The pro-
stochastic game and in Section V multiagent reinforcement posed algorithm allows the SUs to distributedly join and leave
learning schemes are proposed for solving the formulated coalitions and optimize their sensing and access policies in a
game. The properties of the proposed algorithm are discussed distributed manner. A recent survey of sensing and access poli-
in Section VI. Simulation results are presented in Section VII cies for cognitive radio systems can be found in [20].
and the paper is concluded in Section VIII. None of the above sensing policies addresses the coupling of
the diversity gains achieved by multiple SUs sensing the same
II. RELATED WORK band simultaneously at different locations and false alarm rate
Reinforcement learning based spectrum sensing and access and the resulting influence of spectrum sensing errors on maxi-
policies for cognitive radios have been proposed in [3], [6], mizing the idle spectrum found in a cognitive radio ad hoc net-
[14], [15], [22] and [28]. In [3] and [6] sensing and access poli- work as we do in this paper.
cies that aim at balancing sensing, transmission, and switching The considered multiuser spectrum sensing problem is
the frequency band employed by the secondary users (SUs) are closely related to sensor management problems in sensor
proposed. In [22] a single state -learning based centrally-con- networks and in particular to sensor scheduling for target
trolled collaborative multiband sensing policy with -greedy ac- detection. This problem can also be formulated as a POMDP
tion selection is proposed for cognitive radio networks. The pro- [5], [10]. However, the underlying objectives are different. In
posed sensing policy is comprised of two stages both coordi- sensor scheduling the objective is to detect, identify and track
nated by a fusion center. In the first stage the fusion center aims targets as accurately as possible whereas in cognitive radio
to find the best frequency bands to be sensed by the SUs in applications the objective is to find as much available spectrum
order to maximize the throughput of the cognitive radio net- as possible and exploit that in the most efficient manner.
work. In the second stage each SU is assigned to sense one Some preliminary results of this work were presented in part
of the frequency bands selected in the first stage such that the in [18] at the Fifth IEEE International Symposium on New
probabilities of missed detection are minimized. In [15] a mul- Frontiers in Dynamic Spectrum Access Networks, Aachen,
tiagent multiband Aloha-like spectrum access policy based on Germany, May 2011 and in [19] at the Fourth IEEE Interna-
-learning is proposed. The proposed policy is independently tional Workshop on Computational Advances in Multi-Sensor
employed by each SU without any communication or collision Adaptive Processing, San Juan, Puerto Rico, Dec. 2011. In
860 IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, VOL. 7, NO. 5, OCTOBER 2013
not sensed
3Strictly speaking this is true for any PU system since spectrum sensing is
(1)
a passive act that does not cause interference. However, if we assume that the
SUs will also transmit on the vacant frequency bands this is true, for example, if
the PU is a digital TV signal but not, in general, for PU systems using collision
where is the probability of missed detection constraint for avoidance mechanisms. However, modeling the influence of SU behavior on
frequency band inside a given protected region for the PUs, PU behavior in such a case is a difficult problem.
862 IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, VOL. 7, NO. 5, OCTOBER 2013
where and denote the probabilities of false alarm V. MULTIAGENT REINFORCEMENT LEARNING WITH
and detection obtained, respectively, when SUs are sensing. LINEAR FUNCTION APPROXIMATION
In this section, we propose multiagent reinforcement learning
A. Unknown PU State Transition Probabilities
schemes for solving the partially observable stochastic game,
In practice, the state transition probabilities may be com- i.e., for maximizing the objective functions in (1). One conve-
pletely unknown or can be only approximately estimated in ad- nient way of optimizing (1) is to learn an optimal action-value
vance. However, the state transition probabilities may be esti- function. The action-value function gives the expected
mated online sequentially using the following update rule after return of taking an action in a starting state and then fol-
observing two consecutive local decisions and lowing policy . In our case, we aim to find a policy that max-
for the same frequency band: imizes . Except for the heuristic belief state update al-
gorithm, the belief state is continuous and can attain any value in
the interval . Consequently, the typical approach of keeping
(7) a lookup-table of -values for all possible state-action pairs is
not feasible. Hence, we employ linear function approximation to
where is the index of the decision pairs, is a
approximate the action-value function and thus also reduce the
step size parameter and is an indicator function having
dimensionality of the problem. That is, the action-value function
value 1 if and 0 otherwise. Note that is approximated by a linear function as follows:
and by definition.
However, false alarms and missed detections introduce errors
in the estimates. Hence, we propose that the estimates of the
state transition probabilities are updated only if the reliability of (9)
both successive sensing decisions and is above some
predetermined threshold. That is, if and only if when where is a parameter vector and is a feature vector
and when , where is depending on the belief state and actions of the SU and its
the maximum allowed error probability and and are neighbors. Thus, the learning problem is transformed to the
the probabilities of false alarm and detection obtained with problem of learning the parameter vector . The feature vector
SUs sensing. Here is the number of SUs sensing frequency is constructed as follows. The number of features is equal to
band and thus contributing to the decision . Note that the the number of frequency bands with one feature for each
condition must hold separately for both and . frequency band. Each feature describes the SU’s belief that the
Assuming that there are no sensing errors, it follows from sto- corresponding frequency band is available for secondary use if
chastic approximation theory (see e.g., ([11], Theorem 1) that sensed by users so that the probability of missed detection
the update in (7) converges with probability 1 to the true state constraint is satisfied. The feature value depends on both
transition probabilities if and the SU’s belief that the spectrum is vacant and the probability
and each frequency band is sensed infinitely many times. of false alarm, and is given by
For example, by choosing the sample average is
obtained and the constraints are satisfied. However, although
(10)
the state transition probability estimates are guaranteed to con-
verge if the above conditions are satisfied, the convergence can
be slow. This is the case especially if the number of frequency (11)
bands is large compared to the number of SUs and the sensing where is an indicator function having value 1 if
reliability of individual sensors is not especially high. Hence, we and 0 otherwise, and is the false alarm probability ob-
propose also a heuristic update rule for updating the belief states tained with SUs sensing. The argument of the function is
based directly on the sensing results (observations) without es- the number of SUs in group sensing the frequency band
timating the state transition probabilities: in the time slot .
The -values are updated using Sarsa, a temporal-difference
(TD) learning algorithm [25]. The update of the parameter
vector of Sarsa with gradient-descent based linear function
approximation is given by [25]
(8) (12)
where is a constant in , e.g., . The update where is the reward in time slot and
rule moves gradually toward 1/2 from the previous sensed is a step size (learning rate) parameter. In
state as time advances. The idea is to favor sensing frequency (12), is the gradient of the action-value function
bands that have not been sensed occupied in a while. The value with respect to .
represents the highest uncertainty in the true PU In this section, we have proposed a reinforcement learning
occupancy in that frequency band. method employed by the individual SUs in the cognitive radio
LUNDÉN et al.: MULTIAGENT REINFORCEMENT LEARNING BASED SPECTRUM SENSING POLICIES 863
VI. DISCUSSION
Single-user Sarsa with finite state spaces has been proven to
converge with probability 1 to the optimal policy if the agent
employs a lookup-table to store the -values, visits every
state-action pair infinitely many times, the step size satisfies
and , and the learning policy
becomes greedy in the limit [24]. Moreover, single-user Sarsa
with linear function approximation has been shown in [21] to
also converge with probability 1 although under some more
restrictive conditions than lookup-table Sarsa. However, in
a multiagent setting it is not sufficient that the single-user
learning algorithm converges. In a multiagent setting the
agents’ policies should also be able to adapt to the other agents’
behavior. However, the proposed coordination scheme ensures
that the SUs are aware of the actions the SUs preceding them in
the sequential order have chosen. This allows the SUs to break
ties among equally good joint actions.
A notable feature and benefit of the proposed reinforcement
learning based sensing policies is that they are all extremely
computation- and memory-efficient. Due to the use of the linear
function approximation approach the memory requirements
ad hoc network. However, since we are considering a coop-
of the learning algorithms are substantially reduced. Instead
erative multiagent learning problem, the cooperation among
of maintaining a separate -value for each (discretized) be-
the SUs has to be considered as well. The cooperation among
lief state-action pair, each SU has to maintain only the local
the SUs involves information sharing and action coordination.
-vector of size to calculate the -value for any belief
Moreover, the algorithms for selecting the actions given the
state-action pair. Moreover, since each SU selects only its own
learned -values have to be also considered. These issues will
action given the actions of its neighbors, at most -values
be addressed in the following subsection.
have to be evaluated by each SU in each belief state to find the
best action.
A. Collaboration and Action Coordination
Finally, we note that the future action selections have to be
Algorithm 1 summarizes the proposed collaboration and ac- based on the belief state of the current time slot. This is due to the
tion coordination scheme. During the SU collaboration slots the fact that in Algorithm 1 both the local test statistic and the future
SUs transmit their data packets in a sequential order (for-loop action are transmitted in the same data packet. Thus, the future
in Algorithm 1) that may be random and vary between different action selections are effectively lagging one time slot behind. Of
time slots. The ordering does not affect the end result of the co- course, if an SU is able to make a sufficiently reliable decision
operative sensing since the fused decisions are made only after to some of the frequency bands already after receiving only a
all data packets containing test statistics have been received. subset of the local test statistics from its neighbors it can update
However, the ordering does affect the action selection for the its belief state for these frequency bands before choosing its own
next time slot in Algorithm 1. That is, the SUs that are later in action. Moreover, the whole problem can be circumvented if
the sequential order have the possibility of using the informa- the future actions are transmitted separately after all the local
tion obtained from the neighboring SUs preceding them in the decisions have been obtained for the frequency bands and the
sequential order to make more informed decisions about which belief states have been updated. However, this will also increase
frequency band to sense in the next time slot. Thus, the deci- the delay due to the SU collaboration and thus will reduce the
sions about which frequency bands to sense are better coordi- time available for secondary spectrum access.
nated within the cognitive network. Sharing this information al-
lows action changes to be made more frequently without jeop- VII. SIMULATION RESULTS
ardizing the stability of the cognitive network and the learning In this section, we demonstrate the performance of the pro-
processes of the individual SUs. posed reinforcement learning based sensing policies using sim-
864 IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, VOL. 7, NO. 5, OCTOBER 2013
Fig. 3. Multiagent spectrum sensing scenario with 10 SUs. The small square
at the origin denotes the location of the 7 PUs and the large circle denotes the
protected radius of the PUs. Small dots indicate the locations of the SUs. Each
shaded circle denotes the corresponding SU’s (i.e., the one in the middle of
the circle) communication range. The received PU signal power attenuates ac-
cording to where is the transmit power and is the
distance.
Fig. 6. Mean absolute PU state transition probability estimation error over all
frequency bands and SUs as a function of the time slot. The PU state transition
probability estimates converge faster in the AWGN scenario since the sensing
reliability is much higher and the SUs are sensing different frequency bands
more frequently. Moreover, the PU state transition probability estimate for
(from vacant to vacant) converges faster due to the reinforcement learning al-
gorithm that tries to exploit vacant frequency bands which, consequently, are
sensed more often.
Fig. 7. Multiagent spectrum sensing scenario in which 100 SUs (small dots)
for the heuristic algorithm. In AWGN, the differences in perfor- are located close to the edge of the protected region (large circle) of the 15 PUs
(small square) where the probability of detection is smaller. The received PU
mance among the proposed algorithms are not significant but median signal power attenuates according to where is
in the Rayleigh fading scenario the algorithms based on known the transmit power and is the distance.
and estimated PU state transition probabilities outperform the
heuristic belief state update based algorithm. Moreover, in the much higher than in the Rayleigh fading channels. Hence, the
Rayleigh fading scenario all algorithms perform better com- SUs can sense different frequency bands simultaneously and
pared to the optimal genie. For example, at the end of the simu- thus the PU state transition probability estimates get updated
lation the algorithm with estimated PU state transition probabil- more frequently. In addition, due to the reinforcement learning
ities finds on average roughly 68.5% of the available spectrum algorithm vacant frequency bands are sensed more often and
the optimal genie finds whereas in the AWGN scenario the per- hence the estimate for converges faster than the estimate for
centage is around 64.5%. All the proposed algorithms clearly . For the same reasons the estimation accuracy varies quite
outperform the myopic policy. The myopic policy suffers from heavily also from one frequency band to another.
the lack of coordination in the action selection. After the initial Next, we consider a larger cognitive radio ad hoc network
phase all the SUs start sensing the same frequency bands. This consisting of 100 spatially dispersed SUs uniformly distributed
degrades the performance significantly in AWGN. On the other inside an annulus with an inner radius of 0.15 and an outer radius
hand, in the Rayleigh fading case this actually ensures a high of 0.20. Fig. 7 illustrates the considered network topology. Note
diversity order and thus a reasonable performance as well. that due to computational reasons we have limited the number
Fig. 6 shows the mean absolute PU state transition proba- of SUs to 100. In order to make the scenario representative of
bility estimation error over all frequency bands and SUs in both a difficult fading scenario the SUs have been placed close to
AWGN and Rayleigh fading scenarios. It can be seen that the the edge of the protected region of the PUs where the path loss
estimation error decreases faster in AWGN channels. This is un- is the largest and thus detecting the PUs is difficult. The re-
derstandable since in AWGN channels the sensing reliability is ceived PU median signal power attenuates again according to
866 IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, VOL. 7, NO. 5, OCTOBER 2013
Fig. 9. Histogram of the number of SU test statistics (i.e., diversity order) used
for making each decision in (a) AWGN and (b) Rayleigh fading for the policy
with known state transition probabilities. In the Rayleigh fading scenario the
proposed learning algorithm exploits SU collaboration to reduce the false alarm
rate and thus increase the sensing reliability while in the AWGN scenario SU
collaboration is employed to sense more spectrum.
Fig. 8. 50-sample moving average of the number of frequency bands found
correctly as vacant as a function of time in (a) AWGN and (b) Rayleigh fading
in the topology of Fig. 7. In the AWGN scenario the amount of free spectrum probabilities has slightly worse performance than the other two
found is much higher than in the Rayleigh fading scenario due to the much lower algorithms. In the Rayleigh fading scenario, we observe that
sensing reliability in fading channels. Moreover, due to the high number of dif-
ferent frequency bands learning the PU state transition probabilities is slower, the algorithm with known state transition probabilities has the
in particular, in the Rayleigh fading scenario and hence the performance of the best performance. Moreover, the algorithm with estimated PU
algorithm based on estimated state transition probabilities is the worst. state transition probabilities performs again the worst. The per-
formance is roughly 7% worse than the performance of the al-
where is the transmit power and is the gorithm with known PU state transition probabilities while for
distance. Moreover, the local detectors and their performances the heuristic algorithm it is only 3.5% worse. In the Rayleigh
(false alarm rates) are identical to the previous scenario with 10 fading scenario the learning of the PU state transition probabil-
SUs. The number of frequency bands, and thus also the number ities is slower due to the fact that the neighboring SUs are col-
of PUs, is 15. The PU state transition probabilities and laboratively sensing simultaneously the same frequency bands
of the two-state Markov models have been randomly selected and hence exploration is slower. Moreover, due to the employed
from the interval [0.85,0.98] independently for each frequency sensing reliability constraint only collaborative deci-
band. The reinforcement learning algorithm parameters are the sions with diversity order 3 or higher are used to update the PU
same as above. state transition probability estimates.
Fig. 8 depicts the performance of the proposed reinforcement It can also be seen that in the AWGN scenario the average
learning based sensing policies in AWGN and Rayleigh fading number of vacant frequency bands found is much larger than
scenarios. In Fig. 8 we have plotted the 50-sample moving av- in the Rayleigh fading scenario. This is due to the high false
erage of the number of frequency bands found correctly vacant alarm rates for low diversity orders in Rayleigh fading and
by the sensing policies as functions of time. The curves are aver- thus the need for collaborative sensing with a higher diversity
ages over 50 Monte Carlo iterations. Note that in this scenario order. Fig. 9 shows the histogram of the number of SU test
obtaining the optimal centralized genie is computationally in- statistics used for making the decision during the simulation for
feasible due to the very large number of possible joint actions. both AWGN and Rayleigh fading scenarios for the algorithm
That is, the number of different joint actions is where using known state transition probabilities. The histograms for
is the number of available frequency bands and is the other algorithms are very similar. It can be seen that the
the number of SUs in the network. distribution of the sensing actions are significantly different
In the AWGN scenario the performance of the algorithms is in the AWGN and Rayleigh fading scenarios. In the AWGN
very similar. The algorithm with estimated PU state transition scenario the neighboring SUs aim to sense different frequency
LUNDÉN et al.: MULTIAGENT REINFORCEMENT LEARNING BASED SPECTRUM SENSING POLICIES 867
bands since the false alarm rate 0.0015 with single-user sensing [7] F. F. Digham, M.-S. Alouini, and M. K. Simon, “On the energy detec-
provides sufficiently reliable sensing. On the other hand, in tion of unknown signals over fading channels,” IEEE Trans. Commun.,
vol. 55, no. 1, pp. 21–24, Jan. 2007.
the Rayleigh fading scenario single-user sensing suffers from [8] A. Goldsmith, Wireless Communications. New York, NY, USA:
a high false alarm rate and thus from a large number of over- Cambridge Univ. Press, 2005.
[9] E. A. Hansen, D. S. Bernstein, and S. Zilberstein, “Dynamic program-
looked spectrum opportunities. Hence, the neighboring SUs ming for partially observable stochastic games,” in Proc. 19th National
aim to increase the diversity order of each decision by sensing Conf. Artif. Intell., San Jose, CA, USA, Jul. 25–29, 2004.
[10] Foundations and Applications of Sensor Management, A. O. Hero, D.
simultaneously the same frequency bands. It can be seen that Castañón, D. Cochran, and K. Kastella, Eds. New York, NY, USA:
diversity order 3 is the most commonly employed. This is log- Springer, 2008.
ical since it provides false alarm rate 0.03 and thus increasing [11] T. Jaakkola, M. I. Jordan, and S. P. Singh, “On the convergence of sto-
chastic iterative dynamic programming algorithms,” Neural Comput.,
the diversity order above 3 does not offer significant gain. vol. 6, pp. 1185–1201, 1994.
[12] J. R. Kok and N. Vlassis, “Collaborative multiagent reinforcement
learning by payoff propagation,” J. Mach. Learn. Res., vol. 7, pp.
VIII. CONCLUSION 1789–1828, Dec. 2006.
In this paper, we modeled the multiuser multiband spec- [13] L. Lai, H. El Gamal, H. Jiang, and H. V. Poor, “Cognitive medium ac-
cess: Exploration, exploitation, and competition,” IEEE Trans. Mobile
trum sensing problem in cognitive radio ad hoc networks as a Comput., vol. 10, no. 2, pp. 239–253, Feb. 2011.
partially observable stochastic game and proposed multiagent [14] H. Li, “Learning the spectrum via collaborative filtering in cognitive
radio networks,” in Proc. IEEE Symp. New Frontiers in Dynam. Spec-
reinforcement learning based algorithms for solving it. The trum Access Netw. (DySPAN 2010), Singapore, Apr. 6–9, 2010, pp.
proposed sensing policies employ SU collaboration through 1–12.
[15] H. Li, “Multiagent Q-learning for Aloha-like spectrum access in cog-
local interaction to obtain awareness of the local spectrum nitive radio systems,” EURASIP J. Wireless Commun. Netw., 2010,
occupancy. We considered and proposed algorithms with both Volume 2010, Article ID 876216.
known and unknown PU state transition probabilities. For the [16] K. Liu and Q. Zhao, “Distributed learning in multi-armed bandit with
multiple players,” IEEE Trans. Signal Process., vol. 58, no. 11, pp.
case of known PU state transition probabilities, we derived be- 5667–5681, Nov. 2010.
lief state update rules given the local sensing decisions. For the [17] K. Liu, Q. Zhao, and B. Krishnamachari, “Dynamic multichannel
access with imperfect channel state detection,” IEEE Trans. Signal
case of unknown PU state transition probabilities, we proposed Process., vol. 58, no. 5, pp. 2795–2808, May 2010.
an algorithm for estimating the PU state transition probabilities [18] J. Lundén, V. Koivunen, S. R. Kulkarni, and H. V. Poor, “Reinforce-
ment learning based distributed multiagent sensing policy for cognitive
as well as a heuristic belief state update scheme. All of these radio networks,” in Proc. 5th IEEE Int. Symp. New Frontiers in Dynam.
algorithms employ Sarsa based reinforcement learning with Spectrum Access Netw.(DySPAN 2011), Aachen, Germany, May 3–6,
linear function approximation that trades off between sensing 2011, pp. 642–646.
[19] J. Lundén, V. Koivunen, S. R. Kulkarni, and H. V. Poor, “Exploiting
more spectrum and the sensing reliability to maximize the spatial diversity in multiagent reinforcement learning based spectrum
amount of vacant spectrum found. Our simulation results show sensing,” in Proc. IEEE Workshop Comput. Adv. Multi-Sensor Adap-
tive Process. (CAMSAP ’11), San Juan, Puerto Rico, Dec. 13–16, 2011,
that the proposed learning and sensing algorithms provide pp. 325–328.
a computation- and memory-efficient way of finding vacant [20] J. Lundén, V. Koivunen, and H. V. Poor, , E. Biglieri, A. Goldsmith,
L. Greenstein, N. Mandayam, and H. V. Poor, Eds., “Spectrum ex-
spectrum in multiuser multiband cognitive radio scenarios. ploration and exploitation,” in Principles of Cognitive Radio, Chapter
Moreover, our results show that exploiting spatial diversity is 5. New York, NY, USA: Cambridge Univ. Press, 2013, pp. 184–258.
important for finding available spectrum more effectively in [21] F. S. Melo, S. P. Meyn, and M. I. Ribeiro, “An analysis of reinforce-
ment learning with function approximation,” in Proc. 25th Int. Conf.
fading channels. The algorithms with known and estimated Mach. Learn., Helsinki, Finland, Jul. 5–9, 2008, pp. 664–671.
PU state transition probabilities provide the best performance [22] J. Oksanen, J. Lundén, and V. Koivunen, “Reinforcement learning
based sensing policy optimization for energy efficient cognitive
eventually, in particular, in fading scenarios. However, the radio networks,” Neurocomput.—Spec. Iss. Mach. Learn. for Signal
proposed heuristic belief update based algorithm provides the Process., vol. 80, pp. 102–110, Mar. 2012.
[23] W. Saad, Z. Han, R. Zheng, A. Hjørungnes, T. Başar, and H. V. Poor,
fastest learning, thus indicating its suitability for nonstationary “Coalitional games in partition form for joint spectrum sensing and ac-
dynamic scenarios. cess in cognitive radio networks,” IEEE J. Sel. Topics Signal Process.,
vol. 6, no. 2, pp. 195–209, Apr. 2012.
[24] S. P. Singh, T. Jaakkola, M. L. Littman, and C. Szepesvári, “Con-
REFERENCES vergence results for single-step on-policy reinforcement-learning al-
[1] I. F. Akyildiz, W.-Y. Lee, and K. R. Chowdhury, “CRAHNs: Cognitive gorithms,” Mach. Learn., vol. 39, pp. 287–308, 2000.
radio ad hoc networks,” Ad Hoc Networks, vol. 7, no. 5, pp. 810–836, [25] R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduc-
Jul. 2009. tion. Cambridge, MA, USA: MIT Press, 1998.
[2] A. Anandkumar, N. Michael, and A. K. Tang, “Opportunistic spectrum [26] H. Urkowitz, “Energy detection of unknown deterministic signals,”
access with multiple users: Learning under competition,” in Proc. IEEE Proc. IEEE, vol. 55, no. 4, pp. 523–531, Apr. 1967.
Conf. Comput. Commun. (INFOCOM 2010), San Diego, CA, USA, [27] Y. Xu, J. Wang, Q. Wu, A. Anpalagan, and Y.-D. Yao, “Opportunistic
Mar. 15–19, 2010, pp. 1–9. spectrum access in cognitive radio networks: Global optimization using
[3] U. Berthold, F. Fu, M. van der Schaar, and F. K. Jondral, “Detection of local interaction games,” IEEE J. Sel. Topics Signal Process., vol. 6, no.
spectral resources in cognitive radios using reinforcement learning,” in 2, pp. 180–194, Apr. 2012.
Proc. 3rd IEEE Symp. New Frontiers Dynamic Spectrum Access Netw. [28] K.-L. A. Yau, P. Komisarczuk, and P. D. Teal, “Achieving efficient
(DySPAN ’08), Chicago, IL, USA, Oct. 14–17, 2008, pp. 1–5. and optimal joint action in distributed cognitive radio networks using
[4] L. Busoniu, R. Babuška, and B. De Schutter, “A comprehensive survey payoff propagation,” in Proc. IEEE Int. Conf. Commun. (ICC ’10),
of multiagent reinforcement learning,” IEEE Trans. Systems, Man, Cy- Capetown, South Africa, May 23–27, 2010, pp. 1–6.
bern. C: Applicat. Rev., vol. 38, no. 2, pp. 156–172, Mar. 2008. [29] Q. Zhao, B. Krishnamachari, and K. Liu, “On myopic sensing for multi-
[5] E. K. P. Chong, C. M. Kreucher, and A. O. Hero, III, “Partially observ- channel opportunistic access: Structure, optimality, and performance,”
able Markov decision process approximations for adaptive sensing,” IEEE Trans. Wireless Commun., vol. 7, no. 12, pp. 5431–5440, Dec.
Discrete Event Dynam. Syst., vol. 19, no. 3, pp. 377–422, Sep. 2009. 2008.
[6] M. Di Felice, K. R. Chowdhury, W. Meleis, and L. Bononi, “To sense or [30] Q. Zhao, A. Swami, and Y. Chen, “Decentralized cognitive MAC for
to transmit: A learning-based spectrum management scheme for cogni- opportunistic spectrum access in ad hoc networks: A POMDP frame-
tive radio mesh networks,” in Proc. 5th IEEE Workshop Wireless Mesh work,” IEEE J. Sel. Areas Commun., vol. 25, no. 3, pp. 589–600, Apr.
Netw. (WIMESH 2010), Boston, MA, USA, Jun. 21, 2010. 2007.
868 IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, VOL. 7, NO. 5, OCTOBER 2013
Jarmo Lundén (S’05–M’10) received the M.Sc. Visa Koivunen (S’87–M’93–SM’98–F’11) received
(Tech.) degree with distinction in communications the D.Sc. (EE) degree (with honors) from the Depart-
engineering and the D.Sc. (Tech.) degree with ment of Electrical Engineering, University of Oulu,
distinction in signal processing for communications Finland. From 1992 to 1995, he was a visiting re-
from the Helsinki University of Technology, Espoo, searcher at the University of Pennsylvania, Philadel-
Finland, in 2005 and 2009, respectively. Since 2009 phia. Since 1999, he has been a Professor of Signal
he has been a Postdoctoral Researcher at Aalto Processing at Aalto University (formerly known as
University, Finland. Moreover, between August Helsinki University of Technology), Finland. Since
2010 and August 2011 he was a Visiting Postdoc- 2009 he has been Academy professor at Aalto Uni-
toral Research Associate at Princeton University, versity. He is one of the Principal Investigators in
Princeton, NJ, USA. His research interests include the SMARAD (Smart Radios and Wireless Systems)
statistical signal processing and machine learning, and their applications in Center of Excellence nominated by the Academy of Finland. He was also Ad-
cognitive radio systems and smart grids. junct Full Professor at the University of Pennsylvania, Philadelphia. During his
sabbatical leave in 2006–2007, he was a Visiting Fellow at Nokia Research
Center, as well as Princeton University. He makes frequent research visits to
Princeton University.
Sanjeev R. Kulkarni (M’91–SM’96–F’04) received His research interests include statistical, communications, radar and sensor
the B.S. in mathematics, B.S. in E.E., and M.S. in array signal processing. He has published more than 350 papers in international
mathematics from Clarkson University in 1983, scientific conferences and journals. Dr. Koivunen received the Primus Doctor
1984, and 1985, respectively; the M.S. degree in (best graduate) Award among the doctoral graduates in the years 1989 to 1994.
E.E. from Stanford University in 1985; and the He is a member of Eta Kappa Nu. He co-authored the papers receiving the Best
Ph.D. in E.E. from M.I.T. in 1991. From 1985 to Paper Award in IEEE PIMRC 2005, EUSIPCO 2006, and EuCAP 2006. He has
1991 he was a Member of the Technical Staff at been awarded the IEEE Signal Processing Society Best Paper Award for 2007
M.I.T. Lincoln Laboratory working on the modeling (co-authored with J. Eriksson). He is a member of the editorial board for the
and processing of laser radar measurements. In the IEEE Signal Processing Magazine. He is also a member of the IEEE Sensor
spring of 1986, he was a part-time faculty member Array Multichannel Signal Processing Technical Committee. He is also serving
at the University of Massachusetts, Boston. Since at the industrial liaison board of the IEEE SP society. He was the general chair
1991 he has been with Princeton University, where he is currently Professor of of the IEEE SPAWC (Signal Processing Advances in Wireless Communication)
Electrical Engineering, Director of the Keller Center, and an affiliated faculty 2007 conference in Helsinki, June 2007.
member in the Department of Operations Research and Financial Engineering
and the Department of Philosophy. He spent January 1996 as a research fellow
at the Australian National University, 1998 with Susquehanna International
Group, and summer 2001 with Flarion Technologies. Prof. Kulkarni received H. Vincent Poor (S’72–M’77–SM’82–F’87) re-
an ARO Young Investigator Award in 1992, an NSF Young Investigator ceived the Ph.D. degree in EECS from Princeton
Award in 1994, and several teaching awards at Princeton. He has served as University in 1977. From 1977 until 1990, he
an Associate Editor for the IEEE TRANSACTIONS ON INFORMATION THEORY. was on the faculty of the University of Illinois at
Prof. Kulkarni’s research interests include statistical pattern recognition, Urbana-Champaign. Since 1990 he has been on
nonparametric estimation, learning and adaptive systems, information theory, the faculty at Princeton, where he is the Michael
wireless networks, and image/video processing. Henry Strater University Professor of Electrical
Engineering and Dean of the School of Engineering
and Applied Science. Dr. Poor’s research interests
are in the areas of stochastic analysis, statistical
signal processing, and information theory, and their
applications in wireless networks and related fields such as social networks
and smart grid. Among his publications in these areas the recent books Smart
Grid Communications and Networking (Cambridge, 2012) and Principles of
Cognitive Radio (Cambridge, 2013).
Dr. Poor is a member of the National Academy of Engineering and the Na-
tional Academy of Sciences, a Fellow of the American Academy of Arts and
Sciences, an International Fellow of the Royal Academy of Engineering (U.K.),
and a Corresponding Fellow of the Royal Society of Edinburgh. He is also a
Fellow of the Institute of Mathematical Statistics, the Acoustical Society of
America, and other organizations. He received the Technical Achievement and
Society Awards of the SPS in 2007 and 2011, respectively. Recent recognition
of his work includes the 2010 IET Ambrose Fleming Medal, the 2011 IEEE
Eric E. Sumner Award, honorary doctorates from Aalborg University, the Hong
Kong University of Science and Technology, and the University of Edinburgh.