Download as pdf or txt
Download as pdf or txt
You are on page 1of 32

Cognitive Radios and the Prediction of Unoccupied Spectrum

Regina Kellly 08310556 Supervisor: Prof. Linda Doyle

Abstract
The aim of this project is to investigate Dynamic Spectrum Access and Cognitive radios. This paper investigates if there are times when dynamic spectrum access is applicable to radio spectrum and whether there is cause to apply learning techniques to aid the cognition of spectrum greyspaces. After this reinforcement learning is discussed and C++ code was used to model these techniques. The results of this paper show that the reward received from reinforcement learning techniques increases as there is a decrease in the complexity of the system and this is also increased with the number of channels studied.

Contents
1 Introduction 1.1 1.2 2 3 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Cognitive Radios . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1 6 9 13 13 14 16 18 21 24 24 26 27

Determining whether to learn Learning 3.1 Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1.1 3.2 Q-learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

4 5 6

Results Conclusions Explanation of code 6.1 6.2 Q-learning code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . L-Z complexity code . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Acknowledgements

List of Figures
1 2 3 4 Fcc Spectrum Allocation . . . . . . . . . . . . . . . . . . . . . . . . . . . Shared Spectrum Company measurements . . . . . . . . . . . . . . . . The cognitive radio Cycle . . . . . . . . . . . . . . . . . . . . . . . . . . Graphs showing difference in threshold . . . . . . . . . . . . . . . . . . ii 3 4 7 10

5 6

Q-learning pseudocode . . . . . . . . . . . . . . . . . . . . . . . . . . . . Entropy vs Expected Reward . . . . . . . . . . . . . . . . . . . . . . . .

15 19

iii

1
1.1

Introduction
Motivation

When asked to think of a resource, the rst thing to ones mind would probably be energy, water or food. Rarely would the average person consider electromagnetic spectrum to be a valuable resource, but in the current technological climate there is growing demand for this highly valuable commodity. Spectrum is often referred to as the lifeblood of wireless systems due to the fact that it is a limited resource but it is essential for the growth and spread of new technologies. Quite evidently, any great wireless idea, no matter how innovative it is, could not be realised without access to radio spectrum. With recent advancements in technological devices such as smart phones and Tablets etc., which make use of new innovations such as 3G (third generation) and 4G, more spectrum is needed to support such technologies. These have to compete with pre-existing spectrum users that consume the majority of spectrum due to lack of foresight by regulators and outdated methods of optimizing their spectrum use.

There are two sides to this story. The rst side tells how apparently overlled the radio spectrum is, especially the bands below 1GHz with the bands between 2GHz and 3GHz also being highly occupied. More and more companies, new and existing, want access to more the radio spectrum and there is a surplus of demand for space due to the fact that there is only a nite supply. The current most prevalent way to deal with this is by auction.[7] Spectrum is regulated by the government in most countries which allocates amounts

of spectrum to different technologies when a band becomes free. Spectrum auction is a process whereby a government uses an auction system to sell the rights for signal transmission over specic bands of the electromagnetic spectrum, and to assign scarce spectrum resources. Depending on what type of auction is employed, the duration of a spectrum auction can vary from days to months from the opening to nal bid. Ideally in a spectrum auction resources are allocated to the parties that value them the most. There are goals other than being the highest bidder that hopeful spectrum buyers need to achieve. The US set goals for its governing body the FCC when it was rst launched : In designing auctions for spectrum licenses, the FCC is required by law to meet multiple goals and not focus simply on maximizing receipts. Those goals include ensuring efcient use of the spectrum, promoting economic opportunity and competition, avoiding excessive concentration of licenses, preventing the unjust enrichment of any party, and fostering the rapid deployment of new services, as well as recovering for the public a portion of the value of the spectrum.[6]

Spectrum auctions are a step toward market-based spectrum management, and are a way for governments to allocate scarce resources. This clearly has its downfalls as smaller companies that may have new innovative ideas may not have the funds to attain enough spectrum.[2] The only other real alternatives to auctions that are being implemented currently include administrative licensing which may be in the form of comparative hearings (otherwise referred to as beauty contests), or lotteries. The former puts an onus on the governing body to choose which new technologies will succeed and which wont. The latter ,though equally fair to all, may mean sub

standard technologies get allocated spectrum and that spectrum may be wasted due to under use. As one may assume, the electromagnetic spectrum, being such a hotly contested commodity, is completely used up. This can be seen in the below chart showing the division of spectrum by the FCC between 30kHz and 300GHz.

Figure 1: Fcc Spectrum Allocation

But is that really true? Though the spectrum has been divided up between the many contenders it is not in constant use. Most of even the ercely contested spectrum has times at which it is

being used little or not at all. Measurements taken by the Shared Spectrum Company show the spectrum occupancy measurements taken in Dublin 07. The most interesting graph in g.2 is the central one which is a waterfall graph. It shows how much frequency is being used across the band 216MHz to 225MHz within a forty hour period. The colours show how occupied the band is. It is apparent that for a lot of the time, a good proportion of the spectrum is being rather unused if it is being used at all.

Figure 2: Shared Spectrum Company measurements

Considering the huge demand for spectrum the idea that there is so much time where it is laying idle could be seen as a waste of resources. Measurements like these inspired a whole new area of research into how spectrum use could be optimized in a way where there is more access to free spectrum. The idea that was put forward was Dynamic Spectrum Access (DSA). Dynamic Spectrum Access is a technique

whereby if a licensed user or primary user is not occupying their spectrum a secondary user can take advantage of it when it is idle and leave immediately when the primary user returns. It would do well to note that concepts like this are happening everywhere. In the book Whats Mine is Yours:The rise of collaborative consumption[1] it describes how people are becoming more (accustomed) to swapping, bartering and shared use systems. People are more inclined to collaborate in such a way where interactions are mutually benecial. Take for example the Dublin Bike system. This is a system whereby people pay a small fee to use a bike and return it in one of the many depots around the city. In this way people can have access to a bicycle whenever they choose and they do not have to go to the expense of buying one themselves and no one is left bereft of transport (in theory). Similarly, proper application of DSA would allow all users access to spectrum without displacing the primary users occupation of the spectrum. DSA is a very simple concept in theory, but the practical aspect of its implementation makes it more technically difcult. As a system like this would have to be automated, the difculty lies in making the system recognize when a channel is free or not. The system whereby the primary user would communicate with secondary users when their spectrum is not in use so that they may take advantage of the free spectrum would be ideal. However, the onus is not on the licensed spectrum users to make this knowledge available and thus a method would have to be developed to circumvent the need for primary user input. For this we need a system that can learn and adapt. The method that has been devised to try to deal with this problem is the use of cognitive radios.

1.2

Cognitive Radios

The idea of cognitive radios was rst ofcially introduced to the world in 2000 by Joseph Mitola III at a seminar in KTH Royal Institute of Technology in Stockholm, Sweden. Mitola described them as: The point at which wireless personal digital assistants (PDAs), and the related networks are sufciently computationally intelligent about radio resources and related computer-to-computer communications to: (a) detect user communications needs as a function of use context, and (b) to provide radio resources and wireless services most appropriate to those needs.[6] [One thing that is important to note here is that though the idea is to try and make the move from individual ownership to sharing or dynamic models, it would not be possible to do this all at one or perhaps ever.] The cognitive radio process can be simplied into a cycle whereby the radio observes a given channel and learns how the system behaves. After this it plans and decides when to transmit on the channel from what it has learned and then acts on what it has decided. Though this is considered the correct method of developing a cognitive radio it is often seen that there is not enough emphasis on the learning aspect. It is apparent that the most important approach to developing a cognitive radio is to learn the users patterns and thus have a better understanding of the behaviour of the primary user. When this is known, one can use this knowledge to try and predict when there will be gaps or white-spaces in the spectrum and hopefully take advantage of them effectively. Cognitive radios are specically designed for application of dynamic spectrum access. Cognitive radios can be considered radios that think and that can adapt to different 6

Figure 3: The cognitive radio Cycle situations. In this way they can be tailored in such a way to exploit various opportunities in different radio spectrum environments. However, there is not one version of a cognitive radio that can deal with every situation. Take for example a skilled professional such as a neurosurgeon. They can use their extensive knowledge and learning to perform complicated operations on the human brain but it us unlikely they could use this knowledge to x the broken carburettor in their car. Similarly a cognitive radio that takes full advantage of one spectrum opportunity may fail to reap any rewards in others. Because of this the learning aspect of cognitive radios is a valuable and vital part of the cognitive radio process. Spectrum usage varies from band to band as different spectrum owners provide different technology services. Some of these may be more popular depending on the service and the provider and thus some bands may exhibit less opportunities than others. If the cognitive radio can determine patterns in the users behaviour it can not only determine how to act but also if to act. This is a very important point to make 7

as, though learning about the environment is paramount in being able to take advantage of any spectrum opportunities that may occur, it would be a waste of time and resources if it transpires that the cognitive radio studying a particular channel gives little or no reward. This could be that the channel is almost entirely active thus meaning that making use of cognitive radios has given little or no information that was not already known. The same would apply to bands which are entirely inactive. Another possibility would be a channel that is almost equivalent to a random channel. This type of environment also can not be learned or predicted effectively. From this it is clear that it is important to rst establish when to learn before we start to implement learning techniques.

Determining whether to learn

When we talk about learning in cognitive radios we are talking about trying to understand our environment and ultimately predict the pattern of the primary user. This understanding should come from observing the primary user and building up a knowledge base of the PUs spectrum usage. In this way we can model their behaviour. To do this it is necessary to gather measurements of spectrum occupancy and develop a way of understanding these measurements. From there, methods for determining whether there are exploitable patterns in the data can be employed. This is not just limited to expansive patterns i.e. there is less spectrum occupied in the night than the day. We strive to develop a method of determining micro-patterns in the spectrum thus lling in the gaps that are available. Though there has been talk about observing the spectrum use of a particular band it has not been apparent so far as to how this could be done. The way that is most commonly measured is by means of an energy detector. This measures the power spectral density to see whether a given band has sufcient energy to be considered occupied. It is important to note that the energy threshold that determines whether or not the channel is occupied has a resounding effect on how the channel is to be interpreted. As there is always background noise in spectrum measurements, it would be detrimental to our cause to take a very low threshold. This would limit the possible spectrum usage opportunities that may actually be greatly fruitful. Conversely if the energy threshold is estimated to be too high then our signal causes interference with the signal already being transmitted by the PU which is the opposite of what we are trying to achieve. The gure below shows how taking different threshold values frees up different amounts of spectrum. The areas in black show 9

when the spectrum is already in use.

Figure 4: Graphs showing difference in threshold

These measurements are then taken and converted into a spectrum occupancy sequence which is a sequence of ones and zeros, representing occupied and free channels respectively. As this paper deals entirely with simulated systems, techniques that are used in characterizing real spectrum usage are not implemented in the workings of this paper. However, for completeness, we will investigate ways in which the primary users behaviour can be described as it is both relevant and important to the purposes of this project. One way to estimate the PUs behaviour would be to consider the duty cycle of the band. The duty cycle(DC) is the time that an entity spends in an active state as a fraction of the total time under consideration. This would give an estimate of the probability that the channel is occupied over a certain time period. From this we can determine how often a channel is likely to be free and thus it gives an idea of how likely it is we will be able to get an empty channel to take advantage of. Another way of characterizing the user behaviour is by measuring the complexity of the channel. There are many different measures of complexity that are available to us to use each of them having different characteristics that make them more or 10

less useful for application to different problems. For the purposes of our study the Lempel-Ziv(LZ) complexity gives us a measure to describe the structure of our spectrum occupancy sequence. This complexity measure was chosen because it can be shown that in a Markov model of a system the LZ complexity converges to the source entropy.[10] This will be important later in the paper when we talk about our learning model. Another very important reason for choosing the LZ complexity is that it is not necessary to have any information about how the information was generated. The Lempel-Ziv complexity of a sequence was dened by Lempel and Ziv in 1976.[4] It is a technique that is used in many applications, to analyse sequences. It is used in scientic elds for analysing data. It has been used in Spike Train analysis and is also used in cryptography where it was used to test the randomness of the output of a symmetric cipher. The most common use of the LZ complexity is in the LZ77 data compression algorithm. The LZ complexity is computed by parsing a string of digits(or letters etc) into different patterns that occur in the string as we move from left to right through the string. For example the LZ complexity of the string K= 101101011010110 is 7 as it can be divided into seven distinct patterns by scanning left to right. These patterns are 101101011010110. (Note: The LZ complexity is not restricted to binary or even numerical digits. Binary is used for simplicity and relevance.) Though they express different attributes of a channel, with the DC showing how often a channel will be free and the LZ complexity showing how randomly that freeness occurs, it can be shown that there is a direct relationship between the duty cycle of a channel to the complexity of a channel. This is the most obvious with extreme values of the duty cycle. When the DC is very high i.e. approaching 100 percent, it would

11

mean that the channel is expressible as almost entirely ones thus the complexity is lower and similarly for a DC of almost 0 percent. However, between these extremes the complexity is not so easily categorised and may take on a range of values for the same DC. The application of these techniques to real data can be seen in the paperRecognition and Informed Exploitation of Grey Spectrum Opportunities[5]

12

Learning

After implementing our techniques of determining which bands would be the best to try and exploit, it then becomes necessary to nd a learning technique appropriate to the task of predicting spectrum occupancy. This paper proposes to study how machine learning performs in the channel selection process.

3.1

Reinforcement Learning

Reinforcement learning(RL) is a learning paradigm in which, when given a task, if a machine performs this task in a way that is deemed right the agent receives a reward and if the agent performs the task incorrectly it does not receive this reward or it receives a punishment. In this way the agent performs the task to maximise their cumulative reward. (Note: It makes little difference if the machine is punished or merely receives no reward once it is consistent as they are equivalent relative to each of the other channels as the number of iterations becomes large.) There are a wide variety of RL techniques as different circumstances call for different types of learning. The RL method that was chosen for this paper was done so due to the fact that it exhibited traits that are valuable for the purposes of exploration. The Temporal Difference (TD) Learning method is a branch of RL. It is similar to a Monte Carlo method because it learns by sampling the environment according to some policy. There are two types of TD methods; on-policy and off-policy. The essential difference between them is that off-policy can in theory separate control from exploration and on-policy can not. Consequentially it may transpire that an agent employing an off-policy method may develop tactics that had not been apparent in the learning stages. The RL technique that will be used for the purposes of this paper is an off-policy Temporal difference 13

control algorithm called Q-learning.

3.1.1

Q-learning

The reasons for the use of Q-learning in this paper are varied. The main reason is that Q-learning does not require a model of the environment it is being used in.[8] Also It has been proven that the learned action function Q approximates the optimal policy Q* directly and given some specic constraints convergence has been proven.[3] Qlearning consists of an agent, System state S and a set of actions A for each system state. Each action the system performs causes a transition to a new state after which the agent receives a reward or punishment. In the learning process, the agents task is to devise a policy of state action pairs Q(S,A) whereby the reward is maximized. The algorithm used for this is: Q(st , at ) Q(st , at ) + [rt+1 + maxa Q(st+1 , a) Q(st , at )] here is the learning rate of the algorithm. This determines how important or relevant the new information we get from this step is. is the discount factor which determines how important future rewards will be. Some pseudocode for the q-learning algorithm (which will be implemented as C++ code in a later section) can be seen in the below gure. There are three common action selection policies: -greedy - most of the time the action with the highest estimated reward is chosen, called the greediest action. Now and then with very little probability, an random action is chosen. The action has no relation to the best value estimates. This is done to make sure that every action is tried and, if iterated enough times,

14

Figure 5: Q-learning pseudocode will have sufcient data to on each of the channels to ensure the optimal action is discovered. -soft -This is very similar to -greedy. The best action is selected with probability 1 - and the rest of the time a random action is chosen uniformly. softmax - one drawback of -greedy and -soft is that they select random actions uniformly. Consequentially the least favourable possible action has an equal likelihood to be selected as the second best. Softmax is different because it assigns each action a particular weight, according to their individual action-value estimates. A random action is selected by the weight associated with each action. Because of this the worst actions will probably not be selected. This is a good approach to take where the worst actions are very unfavourable, however this can only be used if the least favourable actions are known.

The action policy that will be used for the purposes of this paper will be the -soft policy as it ts our needs the most accurately. [8] 15

3.2

Method

As this project was not conducted with access to real data we may only simulate a system that tries to emulate how a real system with certain attributes would act. This is done by means of a Markov decision matrix. The system uses this matrix to choose whether to stay in a given state or not. For example with the Markov decision matrix 0.3 0.7 0.7 0.3 if an entry is in state 0 it has a probability of being 1 in the next state of 0.3 and a probability of staying 0 of 0.7. Similarly with an entry in state 1 it has a probability of 0.7 of staying 1 with the chance of changing to zero 0.3. In a real situation the normalized LZ complexity of the data would be computed to analyse the performance of the of the Q-learning algorithm. A programme was developed to compute the LZ-complexity of the system . As previously stated, it has been proved [10] that the LZ complexity converges to the entropy rate. Therefore in our case we can calculate the entropy rate as this makes the process more efcient. The entropy rate H(X)is given by H(X) = i Pij logPij

where is the stationary distribution. What we will try to measure in this paper is how much the complexity or entropy contributes to the net reward we receive after running our Q-learning algorithm. For the purposes of this test we will set the discount factor in our algorithm to zero. This is done to create a policy whereby the agent will go to the most likely channel to be unoccupied. In our model we assumed there was no cost of switching from 16

one channel to another. This differs from an agent who needs to consider the cost of switching as the higher likelihood of a channel being free may not have a high enough return over the current channel to entice the agent to switch. By not including this step we also hope to improve computation speed. In a real system the cost of switching would most denitely need to be considered. Only staying in a given channel for a small amount of time would not be an efcient way of transmitting a pay-load and thus staying in a channel as long as possible would be the optimum goal. Another aspect this study is to measure the impact of the number of channels we are learning from. Intuitively, one would think that increasing the number of channels would increase the reward, however the purpose of this exercise is to measure to what extent the reward is increased and how this changes with varying complexity. This will be done over 1000 learning stages and 1000 trial stages.

17

Results

Before any results could be attained it was paramount to check that the programme actually worked as it was expected to. This was tested by running the programme with values for the probability of change we should hypothetically know the return for, i.e. 1 and 0. An important aspect to note for a system with a low number of channels is how it is rstly initialised. If the system is initialised to have all the same value then for a change value of 1 the yield in reward would be 50% and for a change value of 0 the reward would either be 0% or 100%. This is less likely to happen for higher numbers of channels but it should be taken into account. For non uniform initialization, the yield for both should be 100%. After it was established that this was indeed the case, further testing could take place. The rst case that was tested was a three channel system. The programme was trialled twenty times for each change value to obtain a reasonable average. This was then repeated for four separate change values; 0.05,0.1,0.3, and 0.5. It is clear from the results which can be seen in the graph below that the expected gain does in fact increase with a decrease in complexity. Though this would have seemed reasonably evident it is interesting to note how, for the three channel case it seems that though there is an initial increase it seems to plateau after a certain point. The second result we wanted to check was whether the number of channels had an affect on the expected reward and if so by how much. The following table is the result of the learning process.

18

Entropy 0.286 0.468 0.8812 1

M =3

M =5

M = 10

0.6863333333 0.7346666667 0.7613333333 0.6864285714 0.7121428571 0.7255714286 0.6228333333 0.6236666667 0.504 0.5055 0.623 0.502

From this and the below graph we can see clearly that the number of channels used in the learning process does indeed have an impact on the expected reward. As the number of channels increases we can see that the expected reward does not taper off as quickly.

Figure 6: Entropy vs Expected Reward

We can show this makes sense by recalling the what we learned previously in the paper i.e. that the complexity of a channel and the DC of a channel are related. We 19

know that the Pf ree = 1 m . For m=3 Pf ree = 0.875 For m=5 Pf ree = 0.96875 and m=10 Pf ree = 0.9990. so as m goes to innity the probability that there will be at least one free channel is 1. The above graph shows that, though we do receive better rewards when we include more channels in our learning, we do this with diminishing increment in reward. This is due to the fact that with a higher number of channels, the statistical impact of a new channel diminishes. For example the probability that a channel will be free with 10 channels is as above 0.9990 but Pf ree when m=11 is 0.9995 only an increase in reward of 0.0005 whereas there is a much higher increase in reward with less channels. This affects the expected reward accordingly. It is lower due to the fact that the Q-learning algorithm also depends on the complexity of the channel, and though they are related, the DC and the LZ complexity are not directly proportional especially away from the end points. Thus, even though the probability of a free channel could be very high, it depends on how well the Q-learning algorithm can identify where the free channels would be and this is determined by the complexity of the system.

20

Conclusions

As we have said before, not all spectrum opportunities are created equally and cognitive radios will need to take this into account when choosing what methods they should apply to different opportunities. As such, the methods employed in this paper, while applicable to some situations would not work as well in other situations where another method might be highly suited to that type of environment. Learning algorithms are based on exploration and trying to interpret how a certain channel behaves so that this can be exploited. In this way as the algorithm is applied to a new band the benets of the learning will not be seen immediately. From this we can assert that the learning techniques in this paper will not be suitable for all the different applications of cognitive radio. The best employment of these techniques would be applications that do not need immediate action. An application that could allow the cognitive radio to plan and transmit its payload whenever it knows a given band will be free would reap the most rewards from this DSA approach. For example it would be very unlikely one could stream a video from the internet using this approach but downloading a le that is not needed urgently would be a good application of this style of CR. Cognitive radio is a relatively new technology and as such it will be some time before techniques such as RL become tailored to meet all the demands that a cognitive radio needs. However if or when this does happen, it seems likely that cognitive radio will become an even more important eld of research. Spectrum sensing is a complicated process, and it may be that the technology needed to sense radio signals accurately may not be commonly available for quite some time. This may mean that, though the eld of CR grows and improves, there still may not be the technology available for proper implementation. There have been studies of 21

dynamic spectrum access done that indicate there are very few patterns in spectrum use and that cognitive radio may not be able to predict spectrum accurately enough to employ DSA techniques.[9] However these measurements were done under very xed conditions and did not take into consideration what would happen when these conditions were changed. Thus there may have been times where DSA would have been possible. Also the bands chosen for their study were bands like GSM bands that have a very high complexity thus pattern prediction would have been quite difcult. In regards to this paper there are probably some things that could have been done differently or experimented with. It may have been advantageous to consider the Q-learning model with a non zero discount factor to compare with the results gotten from the simplied version but for the purposes of the paper it was not necessary. Another possible change that could have been made to the q-learning algorithm that was not considered until too late was the fact that the algorithm was being made choose a channel every time. This begs the question; if it was given the option of not having to make a choice how much higher would the reward have been? It may also have been interesting to try different reinforcement learning techniques and see how they compared with each other. It has been shown in other papers [9] that other ways of modelling spectrum bands prove to be more effective than by Markov modelling. For example geometric and lognormal distributions can model the primary users activity accurately. However these are not as easy to implement, nor do they have properties as useful as the Markov model and thus they would not have been as useful to the purposes of this paper. Overall, the aims that were set out in the beginning of the paper were met. A higher understanding of the concepts of cognitive radio was garnered in the process of com-

22

pleting this paper. We have seen how the LZ complexity can determine how the PUs behaviour is characterized and also how the cognitive radio can benet from this knowledge using learning enhanced dynamic spectrum access. We have also seen by the use of our Q-learning algorithm that the number of channels monitored has an effect on how well the algorithm can perform.

23

Explanation of code

It should be noted that there is a downloadable version of the actual code that can be found at www.maths.tcd.ie/ regimoto/fyp.cpp. The following section is an overview of the code for those who do not wish to read through actual code but would like to get a general idea of how the programme was written.

6.1

Q-learning code

A lot of the time was spent on this part of the project. This was in part due to the difculty of trying to convert the idea of learning into code. s the concept of the project is itself very memory intensive. The rst thing to considered in writing this particular programme was the general layout of the code. If done incorrectly it may make it more difcult to change later. Most different aspects of the programme are separated into different functions for ease of distinguishing and also so if there is need to change the layout of the programme it is more easily done. The system that models the network is initialized rst. As the algorithm will not need to look too far into the future iterations, only two states of the network are stored at a given time: the current state and the next state. the rst state is initialised randomly with the channel being free or empty both equally likely. From this the system is put through a Markov process where a random number is generated and if the number is below a given threshold the channel changes state ie goes from free to occupied or vice versa. All of the system states are output to a le so that they can checked if need be. Another thing that we have to initialise is Q(s,a). This is done by setting the rst column to an index which can be referenced from other parts of the programme. All other entries of the Q matrix are initialized to zero. 24

The Q learning part of the programme references other functions in it which should be explained. The rst of these is the binary function. This is a function which reads in the system state which is essentially a binary number and convert it to a decimal number. This is done from left to right as opposed to the way binary numbers are actually calculated however the only reasons for using this function is to have a compact set of distinct numbers which this provides and there is no need to have the numbers in order so to speak. The second function is the action function. This function chooses what action the system is to perform based on an action selection policy. The action selection policy was the -soft policy with an iteratively decreasing epsilon. This is done by, again, randomly generating a number and if that number is greater than it takes the action that has the maximum value in the Q-table otherwise it randomly chooses an action to take with equal probability. After this the value of the action at the next step is checked and if the action selection was correct i.e. si+1 = 0 then the function is rewarded in accordance with the Qlearning algorithm given in the previous section remembering that we are considering to take a value of zero, and to be iteratively increasing. If the action selection is not correct however the function receives a punishment. After this the system is reset by the Markov process and this is iterated until reaches zero. Another function standardize is then called which standardizes the Q-table so that it only includes the action with the highest Q-value. this is then stored in a similar table F. That completes a learning step and for the purposes of this programme 1000 of these steps are performed. The next part of the programme is a testing step whereby no learning is performed

25

and the system is run using the F-table to predict what action to take for the next state. This is done k times and then a percentage reward is output based on the Ftables performance.

6.2

L-Z complexity code

Each channel is read in to this programme individually as a string so that we can later get an average over the channels. This string is then read into a function. This function pushes back the rst entry of the string into a vector then it enters a loop over the string. next there is a while loop over the vector inside which some sub-string of length j is initialized. This enters a for loop again over the vector that checks if this sub-string (initially of length 1) is equal to any of the other entries in the vector. If it is, then j is incremented and the for loop is broken. The new longer string is then tried in the loop and this continues until the sub-string is no longer equal to any of the entries in the vector. Then the loop over the initial string is incremented by j until the string is terminal. The function then returns the vector size and this is normalized in the main. this is done over the number of channels and an average is gotten. It is clear that as the string length gets very large this code would get dramatically slower. This is the reason why the entropy was used over the experimental code.

26

Acknowledgements

I would like to thank my supervisor Prof. Linda Doyle for her willingness to take on a project with a project. For introducing me to the fascinating subject of CR and for broadening my mind to more than maths. Thank you. I would also like to thank Dr. Irene Macaluso for her invaluable help and guidance in this project. Finally I would like to thank Cian Booth and Ruairi Short. Class mates of mine whos error checking abilities far exceed my own and have saved me from several panic attacks. Thanks to them I still have a full head of hair.

References
[1] Rachel Botsman and Roo Rogers. Whats mine is yours:the rise of collaborative consumption, 2010. [2] L.E Doyle. The essestials of cognitive radio, April 2009. [3] Tommi Jaakkola, Michael I. Jordan, and Satinder P. Singh. On the con- vergence of stochastic iterative dynamic programming algorithms, 1994. [4] A. Lempel and J. Ziv. On the complexity of nite sequences, 1976. [5] Irene Macaluso. Recognition and informed exploitation of grey spectrum opportunities, 2011. [6] III Mitola, J. and Jr. Maguire, G.Q. Cognitive radio: making software radios more personal, 1999. [7] Gregory F. Rose and Mark Lloyd. The failure of fcc spectrum auctions, 2005-06.

27

[8] R. Sutton and A. Barto. Reinforcement learning: An introduction., 1998. [9] M. Wellens, J Riihijrvi, and P. Mhnen. Empirical time and frequency domain models of spectrum use, 2009. [10] J. Ziv. Coding theorems for individual sequences, 1978.

28

You might also like