Professional Documents
Culture Documents
Primary User Emulation Attacks
Primary User Emulation Attacks
1, JANUARY 2011
Adversarial
Abstract—The defense against the Primary User Emulation Uniform random attack
bandit
Selectively random attack
Attack (PUE) is studied in the scenario of unknown channel Maximal interception attack
Unknown channels
learning
statistics (coined blind dogfight in spectrum). The algorithm of
the adversarial bandit problem is adapted to the context of blind Channel 1
dogfight. Both cases of complete and partial information about
the rewards of different channels are analyzed. Performance
Channel 2
bounds are obtained subject to arbitrary channel statistics
and attack policy. Several attack strategies, namely uniformly
random, selectively random and maximal interception attacks, Channel 3
PUE attacker Secondary user
are discussed. The validity of the defense strategy is then
demonstrated by numerical simulation results.
Index Terms—Cognitive radio, primary user emulation attack, Fig. 1. An illustration of the blind dogfight in spectrum.
adversarial bandit algorithm.
I. I NTRODUCTION
channel statistics, e.g. the channel idle (or busy) probabilities
not assume a time-invariant structure of the channel reward, to the assumption that a secondary user knows all channel
thus being suitable for the scenario of arbitrary channel occupancies by the end of each time slot. For the partial
statistics and attack policy. Compared with the original study information, secondary user 𝑖 knows 𝑟𝑖𝑗 (𝑡) only when it
on the adversarial bandit algorithms [2] [3], this paper also sensed channel 𝑗 at time slot 𝑡.
considered the case of multiple player case in which collisions We assume that there exists an attacker. In each time slot,
could occur. Compared with other studies on PUE [4] [6] [8], it chooses one channel and sends out the PUE attack signal
the algorithm in this paper adopts a passive way and does not during the spectrum sensing period. It does not attack during
require collaboration among secondary users. the data transmission period since it requires much higher
The remainder of this paper is organized as follows. The power to suppress the secondary user’s signal. A secondary
system model is introduced in Section II. Both cases of full user is unable to distinguish the signals of attacker and primary
and partial information on the reward are discussed in Sections user. Therefore, the secondary user identifies the attacker as a
III and IV, respectively. Numerical results are provided in normal primary user signal and cannot access the channel if
Section V. Conclusions are drawn in Section VI. it happens to choose the channel that the attacker is jamming,
even if the channel is actually not occupied by primary users.
II. S YSTEM M ODEL Note that the secondary user does not cease transmitting
during the data transmission period even if primary user
We consider a cognitive radio system with 𝑁 secondary
emerges or the attacker begins to jam during this period. We
users and 𝑀 licensed channels. At the beginning of each time
do not specify the attacker’s strategy, which is also unknown
slot, each secondary user can sense only one channel due to
to the secondary users.
its limited capability of sampling1 . For simplicity, we do not
consider spectrum sensing errors. When a channel is found
not to be occupied by primary users, it can be used by the
secondary user for data transmission in the remainder of the III. B LIND D OGFIGHT WITH F ULL I NFORMATION
time slot. For the availability of spectrum information, we
In this section, we consider the case of full information,
consider the following two cases:
i.e., a secondary user knows the rewards of all channels at
∙ Full information: Although a secondary user can sense the end of each time slot. This is reasonable if the data
only one channel during its spectrum sensing, we con- transmission period is much longer than the spectrum sensing
sider the full information case, in which the secondary period. During the data transmission period, the secondary
user knows the states of all channels at the end of each user can switch to different channels for sensing, provided that
time slot since the secondary user can continue to sense it can distinguish the signals of primary users and secondary
during the data communication period2. users. We will consider both single defender and multiple
∙ Partial information: In this case, a secondary user knows defender cases.
only the state of one channel at the end of each time slot.
The states of all other channels need to be predicted.
Each licensed channel is modeled as a random process, A. Single Defender Case
which equals 1 when the channel is not occupied by primary
users and equals 0 when primary users are present. We do not When there is only one defender, we adopt a channel access-
specify the detailed distribution of the spectrum occupancies ing scheme motivated by the Hedge algorithm for adversarial
over different time slots and different channels. The spectrum bandit proposed in [7]. Since there is only one secondary
occupancies could be correlated in time or in spectrum. The user defender, we omit the indices for secondary users. In
performance analysis does not depend on the distribution. contrast to the Hedge algorithm, we adopt a forgetting factor
Spectrum measurement obtained from measurements will be for the estimation of merits of different channels while the
used in the simulation. We denote by 𝜇𝑚 the probability Hedge algorithm uses the sum of rewards which could incur
that channel 𝑚 is not occupied by primary users. These an overflow problem in practical systems. As we will see, the
probabilities are unknown to the secondary users. For each proposed algorithm is also similar to the 𝑄-learning algorithm
spectrum sensing, a secondary user receives reward 1, if the in reinforcement learning [15]. After proposing the algorithm,
sensed channel is idle and there are no other secondary users we derive lower bounds for the performance of spectrum
competing for this channel, and otherwise 0. We denote by sensing subject to several typical PUE attack strategies.
𝑟𝑖𝑗 (𝑡) the channel availability (1: available; 0: unavailable)3 1) Algorithm: For the defender, we set a value for each
for secondary user 𝑖 over channel 𝑗 at time slot 𝑡. In the channel, denoted by 𝑄𝑖 for channel 𝑖, which represents the
full information case, for any channel 𝑗, 𝑟𝑖𝑗 (𝑡) is known to goodness of the channel and can be initialized to 0. In the
secondary user 𝑖 at the beginning of time slot 𝑡 + 1 due 𝑡-th time slot, the values of {𝑄𝑖 }𝑖=1,...,𝑀 are updated by
1 It is easy to extend to the case in which a secondary user can sense
multiple channels. We only need to change the action space to the set of
𝑄𝑖 (𝑡) = (1 − 𝛼(𝑡))𝑄𝑖 (𝑡 − 1) + 𝛼(𝑡)𝑟𝑖 (𝑡), ∀𝑖, (1)
channels that the secondary user senses.
2 We assume that secondary users can well distinguish the signals of primary
where 𝑟𝑖 (𝑡) is the availability of channel 𝑖 (since there is only
users and secondary users. one defender, we ignore the index of the defender) and 0 <
3 Note that 𝑟 (𝑡) represents the potential reward at channel 𝑗 for secondary
𝑖𝑗
user 𝑖 at time slot 𝑡. It is independent of the spectrum sensing and is not the 𝛼(𝑡) < 1 is a forgetting factor which could be time-varying
actual reward of secondary user. or constant.
276 IEEE TRANSACTIONS ON WIRELESS COMMUNICATIONS, VOL. 10, NO. 1, JANUARY 2011
Q1=3 Q1=3 Q1=3 slots when using a sufficiently small constant forgetting
factor, by comparing with the performance of the strategy
insisting on only one channel in all time slots. The
Q2=1 Q2=1 Q2=1 corresponding proof is given in Appendix A. Note that
Uniformly random attack Selectively random attack Maximal interception attack
the lower bound holds for any arbitrary strategy of the
PUE attack.
Fig. 2. An illustration of three PUE attacks for single user case. It is assumed Proposition 1: When 𝛼𝑖 (𝑡) = 𝛼 < 𝑇 , ∀𝑖, 𝑡, for the
that there are two channels and the 𝑄 values are labeled in the corresponding spectrum access algorithm in Procedure 1, we have
squares. The probability of jamming the corresponding channel is represented
by the thickness of the arrow. ∑t ∑ 𝑀 ∑t
𝛼(1 − 𝛼)t−𝑡 𝑟𝑗 (𝑡) − ln 𝑀
𝑝𝑖 (𝑡)𝑟𝑖 (𝑡) ≥ 𝑡=1 ,
𝑡=1 𝑖=1
𝛼(𝑒 − 1)
Then, the probability of accessing channel 𝑖 is given by the (3)
Boltzman distribution4 [15]:
for all 𝑗 = 1, 2, ..., 𝑀 .
𝑄𝑖 (𝑡−1)
𝑒 𝑇 ∙ Uniformly Random Attack: When the attacker jams dif-
𝑝𝑖 (𝑡) = ∑ 𝑄𝑗 (𝑡−1)
, (2) ferent channels uniformly, the availability probability of
𝑀
𝑗=1 𝑒 𝑇
channel 𝑖 is equal to (𝑀−1)𝜇 𝑖
. When 𝛼𝑖 (𝑡) decreases with
𝑀
where 𝑇 is a constant called temperature which is used to time and satisfies
∑∞
control the balance between the exploration and exploitation.
The intuition of the algorithm is to choose channels having 𝛼𝑖 (𝑡) = ∞, (4)
𝑡=1
good reward histories with higher probabilities. The procedure
of spectrum access is summarized in Procedure 1. ∑Note that, and
𝑡
in the Hedge algorithm, the sum of rewards, 𝜏 =1 𝑖 (𝜏 ),
𝑟 ∞
∑
is used in lieu of the 𝑄𝑖 (𝑡) used in this paper. Obviously, 𝛼2𝑖 (𝑡) < ∞, (5)
the sum of rewards diverges to ∞ as 𝑡 → ∞, while 𝑄𝑖 (𝑡) 𝑡=1
is upper limited by 1, which is more numerically stable. It the procedure of updating 𝑄𝑖 (𝑡) converges to the expec-
is easy to verify that the computational complexity is low. tation of the reward of channel 𝑖 in each time slot, which
The main computational cost is incurred by the algorithm is is equal to (𝑀−1)𝜇 𝑖
[15]. Then, the expected reward in
𝑀
the computation of 𝑄-values in Line 7 and the update of each time slot is given by
the spectrum access probabilities in Line 8, which are both
∑𝑀 (𝑀 −1)𝜇𝑖
linear with respect to the number of channels, as well as the (𝑀 − 1)𝜇𝑖 𝑒 𝑀𝑇
random number generation in Line 4, which is also linear in 𝑟¯ = ∑𝑀 . (6)
𝑖=1
𝑀 (𝑀 −1)𝜇𝐾
𝑒 𝑀𝑇
the number of channels and can be efficiently implemented 𝑘=1
with many random number generation algorithms. ∙ Selectively Random Attack: Since a larger 𝑄𝑖 (𝑡) implies
a higher probability of sensing channel 𝑖, it also means
Procedure 1 Procedure of Channel Selection with Full Infor- that the attacker may have more opportunity to intercept
mation the defender over channel 𝑖. Therefore, the attacker can
1: Initialize all 𝑄-values to 0. adopt a selectively random attack strategy by considering
2: Randomly choose the spectrum access probabilities. 𝑄𝑖 (𝑡) as the metric over channel 𝑖 and using the Boltzman
3: for Each time slot 𝑡 do
distribution for the channel selection probability. Then,
4: Randomly choose one channel to carry out spectrum sensing
according to the spectrum access probabilities, which can be the jamming probability over channel 𝑖 is given by (same
realized by a software or hardware random number generator. as (2))
5: Data communication if the channel is idle. 𝑄𝑖 (𝑡−1)
6: Collect the states of all channels, {𝑟𝑖 (𝑡)}𝑖=1,...,𝑁 . 𝑒 𝑇
using (2).
9: end for Then, the expected reward of the defender in channel
𝑖 is equal to (1 − 𝑞𝑖 (𝑡))𝜇𝑖 . The stationary point of the
dynamics of {𝑄𝑖 (𝑡)}𝑖=1,...,𝑀 is given by
2) Performance Analysis: We first provide a performance ⎛ ⎞
𝑄𝑖
lower bound for arbitrary strategies of PUE attacks. Then, we 𝑒 𝑇
∙ Arbitrary Attack Strategy: The following proposition For simplicity, we consider the special case of equal chan-
provides a lower bound for the sum of rewards in t time nel availabilities, i.e., 𝜇𝑖 = 𝜇, ∀𝑖. Then, 𝑄𝑖 = 𝜇(𝑀−1) 𝑀 ,
∀𝑖, is a solution to (8). Moreover, we have the following
4 There are usually two types of distributions for selecting the actions in
simple lemma
machine learning, i.e., 𝜖-greedy and Boltzman distribution [15]. The reason
for choosing the Boltzman distribution is due to its continuity which facilitates Lemma 1: 𝑄𝑖 = 𝜇(𝑀−1) 𝑀 , ∀𝑖, is the only stationary point
the mathematical analysis. of (8).
LI and HAN: DOGFIGHT IN SPECTRUM: COMBATING PRIMARY USER EMULATION ATTACKS IN COGNITIVE RADIO SYSTEMS–PART II . . . 277
Proof: Suppose that there is another stationary point, 3/3 3/3 3/3
in which there exist 𝑖 ∕= 𝑗 such that 𝑄𝑖 > 𝑄𝑗 . Then, we 0.5 / 0.5 / 0.5 /
have 0.01 0.01 0.01
𝑄𝑖 𝑄𝑗 0.01 / 0.01 / 0.01 /
𝑒 𝑇 𝑒 𝑇 0.01 0.01 0.01
∑𝑀 𝑄𝑘 > ∑𝑀 𝑄𝑘 , (9)
𝑘=1 𝑒 𝑇
𝑘=1 𝑒 𝑇
Uniformly random attack Selectively random attack Maximal interception attack
channel that the attacker chooses at time slot 𝑡 is for all 𝑗 = 1, 2, ..., 𝑀 .
⎡ The analysis on different attacking strategies is similar to
∑𝑁 𝑄𝑛𝑚 (𝑡−1)
1 1
0.9 0.9
0.8 0.8
0.7
0.7
average reward
average reward
0.6
0.6
0.5
0.5
0.4
0.4 1 attacker: uniform
0.3
2 attackers: uniform
0.3 uniformly random attack 0.2 3 attackers: uniform
selectively random attack 1 attacker: max
0.2 max interception attack 0.1 2 attackers: max
lower bound 3 attackers: max
0.1 0
0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16
temperature temperature
Fig. 4. Average rewards versus different temperatures with complete Fig. 6. Average rewards versus different temperatures with complete
observation for the indoor measurements and a single secondary user observation for the indoor measurements, single secondary user and multiple
attackers
0.9
0.9
0.8
0.8
0.7
0.7
0.6
average reward
average reward
0.6
0.5
0.4 0.5
0.3 0.4
uniformly random attack
0.2 selectively random attack 0.3
max interception attack PU traffic level 1
0.1 lower bound PU traffic level 2
0.2 PU traffic level 3
0
0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.1
temperature 0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16
temperature
attackers make decisions independently. We tested only the 2) Multiple User Case: Figures 8 and 9 show the average
uniform attack case and the maximal interception attack case rewards in each time slot versus different temperatures for
for the indoor situation. We observe that, as the number of the indoor and outdoor environments, respectively, when there
attackers increases, the effect of uniform attack is improved. are five secondary users. A significant different between the
An interesting observation is that the effect of attack is almost multiple user and single user cases is that the performance de-
not changed for the maximal interception when the number creases almost monotonically with respect to the temperature,
of attackers increases from 1 to 3. The reason is that all the even for the maximum interception attack case. A reasonable
attackers will jam the same channel with a large probability explanation is that, when the actions of the secondary users are
due to the lack of collaboration. Therefore, much of the more distributed due to the higher temperature, the probability
attack effort is wasted. Hence, from the viewpoint of attackers, of collisions is also increased, thus decreasing the average
collaboration is of key importance, which is beyond the scope reward of the secondary users. Another interesting observation
of this paper. is that the performance in the maximal interception attack
In Fig. 7, we tested the impact of the primary user traffics is significantly improved in the low temperature regime,
for the indoor and complete observation case. We set three compared with the single secondary user case. A reasonable
traffic levels for the primary users. In level 1, we use the explanation is that the multiple secondary users diversified the
original measurement data. In levels 2 and 3, we add random focus of the attacker, thus averaging the damage to different
data traffics such that the primary user data traffic is doubled secondary users. Therefore, in a multiple user cognitive radio,
and tripled, respectively, compared with the traffic level 1. we need to consider using low temperatures in the adversarial
We observe that the average reward is significantly decreased bandit algorithm. Note that this does not mean that the temper-
when the primary user data traffic is increased. This also ex- ature should be as low as possible since the simulation does
plain why the outdoor data incurs worse performance than the not consider the speed of learning and the possible change
indoor data since the data traffic in the outdoor measurement of environment. If the temperature is too low, the learning
set is higher. procedure will be very slow and cannot track the environment
280 IEEE TRANSACTIONS ON WIRELESS COMMUNICATIONS, VOL. 10, NO. 1, JANUARY 2011
0.95 1
uniformly random attack
selectively random attack 0.9
max interception attack
0.9
0.8
average reward
average reward
0.7
0.85
0.6
0.8 0.5
Fig. 8. Average rewards versus different temperatures with complete Fig. 10. Average rewards versus different temperatures with complete
observation for the indoor measurements and multiple secondary users observation for the indoor measurements and various numbers of secondary
users
0.95
uniformly random attack 1
selectively random attack
0.9 max interception attack 0.9
0.85
0.8
average reward
0.8 0.7
average reward
0.75 0.6
0.7 0.5
0.4
0.65
0.3
0.6 uniformly random attack
0.2 selectively random attack
0.55 max interception attack
0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.1
temperature 0 0.5 1 1.5 2 2.5
temperature
Fig. 9. Average rewards versus different temperatures with complete
observation for the outdoor measurements and multiple secondary users Fig. 11. Average rewards versus different temperatures with partial obser-
vation for the indoor measurements and a single secondary user
0.9 0.74
average reward
average reward
0.6
0.66
0.5
0.64
0.4
0.62
0.3
0.6
uniformly random attack
0.2 selectively random attack 0.58
max interception attack
0.1 0.56
0 0.5 1 1.5 2 2.5 0 0.5 1 1.5 2 2.5
temperature temperature
Fig. 12. Average rewards versus different temperatures with partial obser- Fig. 14. Average rewards versus different temperatures with partial obser-
vation for the outdoor measurements and a single secondary user vation for the outdoor measurements and multiple secondary users
0.86
Then, we have
0.84 uniformly random attack
selectively random attack ∑𝑀 (1−𝛼)𝑄𝑚 (𝑡) 𝛼𝑟𝑚 (𝑡)
max interception attack 𝑆𝑡+1 𝑒 𝑇 𝑒 𝑇
0.82 =
𝑆𝑡 𝑚=1
𝑆𝑡
0.8 ( )
average reward
(1−𝛼)𝑄𝑚 (𝑡)
0.78 ∑𝑀 𝑒 𝑇 1 + (𝑒−1)𝛼𝑟
𝑇
𝑚 (𝑡)
≤
0.76
𝑚=1
𝑆𝑡
0.74
∑𝑀 (1−𝛼)𝑄𝑚 (𝑡)
𝑚=1 𝑒
𝑇
= ∑𝑀 𝑄𝑚 (𝑡)
0.72
𝑚=1 𝑒
𝑇
0.7 ∑𝑀 𝑄𝑚 (𝑡)
𝛼(𝑒 − 1) 𝑚=1 𝑟𝑚 (𝑡)𝑒 𝑇
0.68 +
0 0.5 1 1.5 2 2.5 𝑇 𝑆𝑡
temperature ∑𝑀 𝑄𝑚 (𝑡)
𝛼(𝑒 − 1) 𝑚=1 𝑟𝑚 (𝑡)𝑒
𝑇
≤ 1+ , (16)
Fig. 13. Average rewards versus different temperatures with partial obser- 𝑇 𝑆𝑡
vation for the indoor measurements and multiple secondary users
where the first equality is due to the definition of 𝑆𝑡 , the first
inequality is due to the assumption 𝑇𝛼 < 1 and the inequality
combat the arbitrary strategies of the attacker and the unknown (1 + 𝑥)𝑎 ≤ 1 + 𝑎𝑥, when 𝑎 < 1, the second equality is
channel statistics, the defender(s) considers each channel as a obtained by decomposing the result into two terms and the
bandit arm and uses the technique of adversarial multi-armed last inequality is because
bandit as its strategy. We have considered several typical types
𝑒(1−𝛼)𝑄𝑚 (𝑡)
of attacks. For the single defender case, we have derived the = 𝑒−𝛼𝑄𝑚 (𝑡) ≤ 1, (17)
corresponding lower bound of performance for arbitrary attack 𝑒𝑄𝑚 (𝑡)
strategies. The discussion has been extended to the multiple since 0 ≤ 𝑄𝑚 (𝑡) ≤ 1.
user case. We have also discussed both cases of full and Then, we have
partial information. Numerical simulations have been used
t
∑
to demonstrate the performance of the proposed algorithm 𝑆t 𝑆𝑡+1
ln = ln
for both single defender and multi defender cases. We have 𝑆1 𝑆𝑡
𝑡=1
observed that the proposed maximal interception attack incurs t
( ∑𝑀 )
∑ 𝑄𝑚 (𝑡)
𝛼(𝑒 − 1) 𝑚=1 𝑟𝑚 (𝑡)𝑒 𝑇
the worse performance degradation. The numerical results ≤ ln 1 +
also show that the performance degradation decreases with 𝑡=1
𝑇 𝑆𝑡
increasing temperature, except for the maximal interception ∑t ∑𝑀 𝑄𝑚 (𝑡)
A PPENDIX B A PPENDIX C
P ROOF OF P ROP. 2 P ROOF OF P ROP. 3
Proof: It is easy to verify that the updating rule in (1) is Proof: For lower bound (10), we have the following key
equivalent to solving the equation (8) using Robbins-Monro observation: for 𝑚1 (𝑡) = arg max𝑚 𝑟𝑚 (𝑡), i.e., the channel
algorithm [10], i.e. having the highest sensing probability at time 𝑡, 𝑟𝑚1 (𝑡) = 0
since the attacker uses the maximal interception attack strategy
q(𝑡 + 1) = (1 − 𝛼(𝑡))q(𝑡) + 𝛼(𝑡)r(𝑡) and always jams this channel. However, 𝑄𝑚1 (𝑡) ≤ 𝑄𝑚2 (𝑡) +
= q(𝑡) + 𝛼(𝑡)Y(𝑡), (19) 𝛼, where 𝑚2 is the index of the channel having the second
largest 𝑄 value. We can prove this by carrying out induction
where q = (𝑄1 , ..., 𝑄𝑀 ), 𝛼(𝑡) is the vector of all step factors, over 𝑡. When 𝑡=1, 𝑄𝑖 (𝑡) = 0, for all 𝑖, due to the zero
r(𝑡) is the vector of rewards obtained at spectrum access initialization. Now, we assume that 𝑄𝑚1 (𝑡) ≤ 𝑄𝑚2 (𝑡) + 𝛼
period 𝑡 and Y(𝑡) is a random observation contaminated by at time slot 𝑡. At time slot 𝑡 + 1, we have
noise, i.e. 𝑄𝑚1 (𝑡 + 1) = (1 − 𝛼)𝑄𝑚1
Y(𝑡) = r(𝑡) − q(𝑡) ≤ (1 − 𝛼) (𝑄𝑚2 (𝑡) + 𝛼)
= r̄(𝑡) − q𝑠 (𝑡) + r(𝑡) − r̄(𝑡) = (1 − 𝛼)𝑄𝑚2 (𝑡) + 𝛼(1 − 𝛼)
= g(q(𝑡)) + 𝛿m(𝑡), (20) ≤ 𝑄𝑚2 (𝑡 + 1) + 𝛼, (25)
where the first equality is due to the fact that channel 𝑚1
where g(q(𝑡)) = r̄(𝑡) − q(𝑡), 𝛿m(𝑡) = r(𝑡) − r̄(𝑡) is the noise
will be jammed by the attacker at time slot 𝑡 + 1 and the first
and r̄(𝑡) is the expected reward given by
inequality is because of the assumption 𝑄𝑚1 (𝑡) ≤ 𝑄𝑚2 (𝑡)+𝛼.
⎛ ⎞
𝑄𝑖 Then, we have
𝑒 𝑇
(r̄(𝑡))𝑖 = 𝜇𝑖 ⎝1 − ∑ 𝑄𝑗
⎠. (21) 𝛼
𝑢𝑚1 (𝑡) ≤ 𝑢𝑚2 (𝑡)𝑒 𝑇 . (26)
𝑀
𝑗=1 𝑒
𝑇