Fast SCC SCC ML

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 11

1

Fast Polar Decoders: Algorithm and Implementation


Gabi Sarkis∗ , Pascal Giard∗† , Alexander Vardy‡ , Claude Thibeault† , and Warren J. Gross∗
∗ Department of Electrical and Computer Engineering, McGill University, Montréal, Québec, Canada
Email: gabi.sarkis@mail.mcgill.ca, pascal.giard@mail.mcgill.ca, and warren.gross@mcgill.ca
† Department of Electrical Engineering, École de technologie supérieure, Montréal, Québec, Canada
Email: claude.thibeault@etsmtl.ca
‡ University of California San Diego, La Jolla, CA. USA
Email: avardy@ucsd.edu

Abstract—Polar codes provably achieve the symmetric capacity 1 directly and ML-SSC additionally enables the decoding of
arXiv:1307.7154v1 [cs.AR] 26 Jul 2013

of a memoryless channel while having an explicit construction. smaller constituent codes.


This work aims to increase the throughput of polar decoder In this work, we focus on improving the throughput of polar
hardware by an order of magnitude relative to the state of
the art successive-cancellation decoder. We present an algorithm, decoders. By building on the ideas used for SSC and ML-
architecture, and FPGA implementation of a gigabit-per-second SSC decoding, namely decoding constituent codes without
polar decoder. recursion, and recognizing further classes of constituent codes
that can be directly decoded, we present a polar decoder that,
for a (32768, 29492) code, is 40 times faster than the state of
I. Introduction the art in literature [3] when implemented on the same field-
Polar codes [1] are the first error-correcting codes to prov- programmable gate-array (FPGA). Additionally, the proposed
ably achieve the symmetric capacity of memoryless channels decoder is flexible and can decode any polar code up to a
while having an explicit construction. They have two proper- maximum length.
ties that are of interest to data storage systems: a very low We start this paper by reviewing polar codes, their con-
error-floor due to their large stopping distance [2], and low- struction, and the successive-cancellation decoding algorithm
complexity implementations [3]. However, polar codes have in Section II. The SSC and ML-SSC decoding algorithms
two drawbacks: their performance at short to moderate lengths are reviewed in Section III. We present our improved decod-
is inferior to that of other codes, such as low-density parity- ing algorithm in Section IV including new constituent code
check (LDPC) codes; and their low-complexity decoding algo- decoders. The decoder architecture is discussed in detail in
rithm, successive-cancellation (SC), is serial in nature, leading sections V, VI, and VII. Implementation results, showing that
to low decoding throughput [3]. the proposed decoder has lower complexity on an FPGA
Multiple methods exist to improve the error-correction than the 10GBASE-T LDPC decoder with the same rate
performance of polar codes of moderate lengths. Using list, and comparable error-correction performance, are presented
and list-CRC decoding [4] improves performance significantly. in Section VIII.
Another method is to increase the polar code length. Using a We focus on two codes: a (32768, 29492) that has a rate
code length that corresponds to the current hard drive block of 0.9 making it suitable for storage systems; and a (32768,
length standard [5], as we show in this work, results in a 27568) that is comparable, in error-correction performance,
polar decoder with lower complexity than an LDPC decoder to the popular 10GBASE-T LDPC code in error-correction
with similar error-correction performance and the same rate. performance and has the same rate, which enables the imple-
Specifically, a (32768, 27568) polar code has slightly worse mentation complexity comparison in Section VIII.
error-correction performance than the (2048, 1723) LDPC
code of the 10GBASE-T (802.3an) standard in the low signal- II. Polar Codes
to-noise ratio (SNR) region but better performance at frame
error rates (FER) lower than 7 × 10−7 , with the high SNR A. Construction of Polar Codes
region being more important for storage systems. In addition, By exploiting channel polarization, polar codes approach
polar codes can be made to perform better than the LDPC the symmetric capacity of a channel as the code length, N,
code starting at FER of 2 × 10−4 as shown in Section II. increases. Channel polarization is the phenomenon that occurs
Among the many throughput-improving methods proposed when constructing, from a set of N independent identical
in literature, simplified successive-cancellation (SSC) [6] and channels, a set of N channels whose probability of error-
simplified successive-cancellation with maximum-likelihood free transmission approaches either 1 or 0.5 as N → ∞ [1].
nodes (ML-SSC) [7] offer the largest improvement over SC The polarizing construction when N = 2 is shown in Fig. 1a,
decoding. This throughput increase is achieved by utilizing where the probability of correctly estimating bit u0 decreases;
the recursive nature of polar codes, where every polar code while that of bit u1 increases compared to when the bits are
of length N is formed from two polar codes of length N/2, transmitted without any transformation over the channel W .
and decoding the constituent codes directly, without recursion, Channels can be combined recursively to create longer codes,
when possible. SSC decodes constituent codes of rates 0 and Fig. 1b shows the case of N = 4. As N → ∞, the probability
2

In this work, we use natural indexing to review and intro-


v0 x0 duce algorithms for reasons of clarity. However, it was shown
u0 + + W y0
in [3] that bit-reversed indexing significantly reduced data-
v1 x1 routing complexity in a hardware implementation; therefore,
u1 + W y1
we used it to implement our decoder architecture.
v2 x2
u2 + W y2
u0 + W y0
v3 x3 B. Successive-Cancellation Decoding
u1 W y1 u3 W y3
Polar codes achieve the channel capacity asymptotically in
(a) N = 2 (b) N = 4 code length when decoded using the successive-cancellation
(SC) decoding algorithm, which sequentially estimates the bits
Fig. 1: Construction of polar codes of lengths 2 and 4
ûi , where 0 ≤ i < N, using the channel output y and the
previously estimated bits, û0 to ûi−1 , denoted ûi−1 0 , according
to:
of successfully estimating each bit approaches 1 (perfectly 0 |ûi =0]
Pr[y,ûi−1

0, if Pr[y,ûi−1 ≥ 1;


reliable) or 0.5 (completely unreliable), and the proportion of ûi =  |ûi =1] (2)

0

reliable bits approaches the symmetric capacity of W . 1, otherwise.




To create an (N, k) polar code, N copies of the channel
The recursive calculation of Pr[y, û0i−1 |ûi = 0] and
W are transformed using the polarizing transform and the k
0 |ûi = 1] was presented in [1]. Using the values v0 ,
Pr[y, ûi−1 N−1
most reliable bits, called the information bits, are used to send
shown in Fig. 1b, the likelihood values for û0 are calculated
information bits; while the N − k least reliable bits, called
as:
the frozen bits, are set to 0. Determining the locations of the
information and frozen bits depends on the type and conditions
of W and is investigated in detail in [8]. A polar code of length Pr[y|û0 = 0] =Pr[y|v̂0 = 0] ∗ Pr[y|v̂1 = 0]+
N can be representedh using
⊗ log N
i a generator matrix, GN = FN = Pr[y|v̂0 = 1] ∗ Pr[y|v̂1 = 1] (3)
F2 2 , where F2 = 11 01 and ⊗ is the Kronecker power. For
Pr[y|û0 = 1] =Pr[y|v̂0 = 0] ∗ Pr[y|v̂1 = 1]+
example, for N = 4, the generator matrix is
Pr[y|v̂0 = 1] ∗ Pr[y|v̂1 = 0]. (4)
1 0 0 0
 
1 1 0 0 The likelihoods of û1 depend on the value of û0 and are
G4 = F2⊗2 =  . calculated when û0 = 0 as:
1 0 1 0
1 1 1 1
Pr[y, û0 = 0|û1 = 0] =Pr[y|v̂0 = 0] ∗ Pr[y|v̂1 = 0] (5)
The frozen bits are indicated either by removing the rows of
the generator matrix with the corresponding indices as well Pr[y, û0 = 0|û1 = 1] =Pr[y|v̂0 = 1] ∗ Pr[y|v̂1 = 1]; (6)
as the frozen bits in u, or by setting their values to 0 in the and
source vector u. We chose the latter method.
Polar codes can be encoded systematically to improve bit Pr[y, û0 = 1|û1 = 0] =Pr[y|v̂0 = 1] ∗ Pr[y|v̂1 = 0] (7)
error-rate (BER) [9]. Furthermore, systematic polar codes are Pr[y, û0 = 1|û1 = 1] =Pr[y|v̂0 = 0] ∗ Pr[y|v̂1 = 1] (8)
a natural fit for the SSC and ML-SSC algorithms [7]. Assume
a vector x00 in which elements with indices corresponding to when û0 = 1.
the frozen bits in u are first ignored while the others are set to These operations are applied recursively until the likelihood
the desired information bit values. To force some elements of values calculated from the channel output are reached.
x00 to zero, a matrix I f is used where I f differs from an identity When log-likelihood-ratios (LLRs) are used instead of
matrix in that columns corresponding to frozen bits are set to likelihood values, the min-sum (MS) approximation can be
0. The vector u is calculated using u00 = x00 I f G−1 and the utilized to reduce complexity without compromising error-
frozen bit locations in u00 are then set to 0 by reapplying I f correction performance [3]. Under MS decoding, and using λ
to obtain u, this can be expressed as u = u00 I f . Finally, the to denote an LLR, (3) and (4) are replaced with the f function
systematic codeword is calculated as x = uG. This can be [3]
expressed as λu0 = f (λv0 , λv1 ) = sign(λv0 )sign(λv1 ) min(|λv0 |, |λv1 |); (9)
x = x00 I f G−1 I f G. (1)
and (5), (6), (7), and (8) with the g function
In [1], bit-reversed indexing is used, which changes the 
λv0 + λv1
 when û0 = 0,
generator matrix by multiplying it with a bit-reversal operator λu1 = g(λv0 , λv1 , û0 ) =  (10)

−λv0 + λv1 when û0 = 1.
B, so that

1 0 0 0 The decision rule (2) becomes


 
1 0 1 0

G4 = BF4 = 

. 0, if λui ≥ 0;
1 1 0 0

ûi =  (11)

1 1 1 1 1, otherwise.

3

100 100

10−2 10−2

10−4 10−4
αv
αl

BER
βv
FER

10−6 10−6 βl
v
αr
βr
10−8 10−8
(a) SC (b) SSC (c) ML-SSC
10−10 10−10
Fig. 3: Decoder trees corresponding to the SC, SSC, and ML-
SSC decoding algorithms
10−12 10−12
4 4.5 5 4 4.5 5
SNR SNR
PC(2048, 1723) PC(32768, 27568) PC*(32768, 27568)
List-CRC(2048, 1723) LDPC(2048, 1723) PC(32768, 29492)
adapt some of these techniques to list decoding.
The throughput of SC decoding is limited by its serial na-
Fig. 2: Error-correction performance of polar codes compared ture: the fastest implementation currently is an ASIC decoder
with that of an LDPC code with the same rate. In addition to for a (1024, 512) polar code with an information throughput
the performance of a rate 0.9 polar code. of 48.75 Mbps when running at 150 MHz [10]; while the
fastest decoder for a code of length 32768 is FPGA-based
and has a throughput of 26 Mbps for the (32768, 27568) code
C. Performance of SC Decoding [3]. This low throughput renders SC decoders impractical for
most systems; however, it can be improved significantly by
Fig. 2 shows the error-correction performance of the (2048, using the SSC or the ML-SSC decoding algorithms.
1723) 10GBASE-T LDPC code compared to that of polar
codes of the same rate. These results were obtained for the
binary-input additive white Gaussian noise (AWGN) channel III. SSC and ML-SSC Decoding
with random codewords and binary phase-shift keying (BPSK)
modulation. The first observation to be made is that the A. Tree Structure of an SC Decoder
performance of the (2048, 1723) polar code is significantly A polar code of length N is the concatenation of two polar
worse than the LDPC code. The polar code of length 32768, codes of length N/2. Since this construction is recursive,
labeled PC(32768, 27568), was constructed to be optimal as mentioned in Section II-A, a binary tree is a natural
for an SNR of 4.5 dB and performs worse than the LDPC representation for a polar code where each node corresponds
code until the SNR is 4.25 dB, past which it outperforms to a constituent code. The tree representation is presented in
the LDPC code with a growing gap. The last polar error-rate detail in [6] and [7]. Fig. 3a shows the tree representation
curve, labeled PC*(32768, 27568), combines the results of two for an (8, 3) polar code where the white and black leaves
(32768, 27568) polar codes. One is optimized for 4.25 dB and correspond to frozen and information bits, respectively.
used up to that point, and the other is optimized for 4.5 dB.
A node v, corresponding to a constituent code of length Nv ,
Due to the regular structure of polar codes, it is simple to build
receives a real-valued message vector, αv , containing the soft-
a decoder that can decode any polar code of a given length.
valued input to the constituent polar decoder, from its parent
Therefore, it is simpler to change polar codes in a system than
node. It calculates the soft-valued input to its left child, αl
it is to change LDPC codes.
using (9). Once the constituent codeword estimate, βl , from
From these results, it can be concluded that a (32768, the left is ready, it is used to calculate the input to the right,
27568) polar code optimized for 4.5 dB or a higher SNR is αr , according to (10). Finally, βv is calculated from βl and
required to outperform the (2048, 1723) LDPC one in the low βr as
error-rate region, and a combination of different polar codes 
can be used to outperform the LDPC code even in high error- βl [i] ⊕ βr [i], when i < Nv /2;

βv [i] =  (12)

rate regions. Even though the polar code has a longer length, βr [i − Nv /2], otherwise.

its decoder still has a lower implementation complexity than
the LDPC decoder as will be shown in Section VIII. For leaf-nodes, βv is 0 if the node is frozen. Otherwise, it is
Using list-CRC decoding [4], with a list size of 32 and calculated using threshold detection, defined for an LLR-based
a 32-bit CRC, reduces the gap with the LDPC code to decoder as: 
the point where the two codes have similar performance as 0, when αv ≥ 0;

βv =  (13)

shown in Fig. 2. However, in spite of this improvement, we 1, otherwise.

do not discuss list-CRC decoding in this work as it cannot
accomodate the proposed throughput-improving techniques, The input to the root node is the LLR values calculated from
which are designed to provide a single estimate instead of the channel output, and its output is the estimated systematic
a list of potential candidates. Further research is required to codeword.
4

B. SSC and ML-SSC Decoder Trees and would otherwise require an N bit parallel interleaver
In [6], it was observed that a tree with only frozen leaf- of significant complexity. The problem is compounded in a
nodes rooted in a node N 0 , does not need to be traversed as resource constrained SSC decoder that presents its output one
the output will always be a zero-vector. Similarly, it was shown word at a time, which must be stored in memory. However,
that the output of a tree with only information leaf-nodes since two consecutive information bits might not be in the
rooted in N 1 can be obtained directly by performing threshold same memory word, each memory word will be visited mul-
detection (13) on the soft-information vector α, without any tiple times, significantly increasing decoding latency.
additional calculations. Therefore, the decoder tree can be A vector x00 is prepared as described earlier i.e. source-
pruned reducing the number of nodes visitations and latency. data a is placed sequentially using natural indexing in the
The remaining nodes, denoted N R as they correspond to codes locations of x00 that do not correspond to frozen bits. Then
of rate 0 < R < 1, perform their calculations as in the SC I f is applied so that xi00 = 0 when ui is frozen. Finally, x00 is
decoder. The pruned tree for an SSC decoder is shown in encoded similarly to (1) using:
Fig. 3b and requires nine time steps, for a 36% reduction in T
x = B(x00 I f )T G−1 (BI f )G,

latency compared with the 14 steps required to traverse the
SC tree in Fig. 3a. where B is the bit-reversal operator. Substituting G = BF,
ML-SSC further prunes the decoder tree by using yields
exhaustive-search maximum-likelihood (ML) decoding to de- T
x = B(x00 I f )T (BF)−1 (BI f )(BF)

code any constituent code, C, while meeting resource con-
straints. Therefore, an ML node, denoted N ML , is used when = (x00 I f )BT (BF)−1 BI f BF.
 
(15)
Nv ≤ Nvmax , kv ≤ kvmax , and (2k + 1)(Nv − 1) ≤ P, where Nv is
the length of the constituent code, kv its dimension, and P is a As B is a permutation matrix, it is orthogonal and thus
constraint on the number of computational elements available BBT = I ⇒ BT = B−1 .
[7]. The ML output is calculated according to
Moreover, B is a symmetric matrix so it follows that
N
Xv −1

βv = arg max (1 − 2x[i])αv [i]. (14) B = BT ⇒ B = B−1 .


x∈C i=0
Using the above in (15), and considering that F −1 = F [9]
The (8, 3) polar decoder utilizing N ML nodes, and whose tree reveals
is shown in Fig. 3c where N ML is indicated with a striped
x = (x00 I f )BT (BF)−1 BI f BF
 
pattern, requires 7 time steps to estimate a codeword.
= x00 I f B (BF)−1 BI f BF
 

C. Performance = x00 I f BF −1 B−1 BI f BF


In [7], it was shown that under resource constraints the = x00 I f BF −1 (BB)I f BF
information throughput of SSC and ML-SSC decoding in-
= x00 I f BFII f BF
creases faster than linearly as the code rate increases, and
approximately logarithmically as the code length increases. = x00 (I f B)F(I f B)F. (16)
For example, for a rate 0.9 polar code of length 32768, which As shown by (16), an encoder can support both natural and
is optimized for SNR = 3.47 dB, the information throughput bit-reversal indexing. It is the position of frozen bits that must
of SC decoding is ∼ 0.45 info. bit/s/Hz; while that of ML-SSC be modified to accommodate bit-reversal. Furthermore, that
decoding is 9.1 info. bit/s/Hz, for an improvement by a factor implies that information bits are placed in natural order and
of 20. The throughput of SSC and ML-SSC is affected by the received like so, eliminating the need for an interleaver.
code structure. For example, optimizing the rate 0.9, length To illustrate this encoding method, Fig. 4 shows the encoder
32768 polar code for an SNR of 5.0 dB instead of 3.47 dB, graph for an (8, 5) polar code with bit-reversal. (x000 , x200 , x400 )
reduces the throughput to 5.2 info. bit/s/Hz. While this is a are set to 0 since (u0 , u2 , u1 ) are frozen; while (x100 , x300 , x500 , x600 ,
significant reduction, the decoder remains 11 times faster than x700 ) are set to source bits (a0 , a1 , a2 , a3 , a4 ). x00 is encoded to
an SC decoder. yield the codeword x, which is transmitted over the channel
The error-correction performance of polar codes is not sequentially, i.e. x0 then x1 and so on. An SSC decoder with
tangibly altered by the use of the SSC or ML-SSC decoding P = 2 outputs (x̂0 , x̂1 , x̂2 , x̂3 ) then (x̂4 , x̂5 , x̂6 , x̂7 ), i.e. the output
algorithms [7]. of the decoder is (x̂0 , â0 , x̂2 , â1 ) then (x̂4 , â2 , â3 , â4 ) where
the source data estimate appears in the correct order.
D. Systematic Encoding and Bit-Reversal
In [9], it was stated that systematic encoding and bit- IV. Proposed Algorithm
reversed indexing can be combined. In this section, we prove In this section we explore more constituent codes that can
that the information bits can be presented at the output of be decoded directly and present the associated specialized
the decoder in the order in which they were presented by the decoding algorithms. We present three new corresponding
source without the use of interleavers. This is of importance to node types: a single-parity-check-code node, a repetition-code
the SSC decoding algorithm as it presents its output in parallel node, and a special node whose left child corresponds to a
5

u0 + + + x0 TABLE I: Number of all nodes and of SPC nodes of different


u4 + + x1 sizes in three polar codes of length 32768 and rates 0.9,
0.8413, and 0.5.
u2 + + x2
u6 + x3
Code All
SPC, Nv ∈
u1 + + x4 (0, 8] (8, 64] (64, 256] (256, 32768]
u5 + x5 (32768, 29492) 2065 383 91 17 13
u3 + x6 (32768, 27568) 3421 759 190 43 10
(32768, 16384) 9593 2240 274 19 1
u7 x7

Fig. 4: Systematic encoding with bit-reversal in SSC decoding. TABLE II: Number of all nodes and of repetition nodes of
different sizes in three polar codes of length 32768 and rates
0.9, 0.8413, and 0.5.
repetition code and its right to a single-parity-check code. We Repetition, Nv ∈
also present node mergers that reduce decoder latency and Code All
(0, 8] (8, 16] (16, 32768]
summarize all the functions the new decoder must be able to
perform. Finally, we study the effect of quantization on the (32768, 29492) 3111 474 30 0
(32768, 27568) 5501 949 53 0
error-correction performance of the proposed algorithm. (32768, 16384) 10381 2290 244 0
It should be noted that all the transformations and mergers
presented in this work preserve the polar code, i.e. they
do not alter the locations of frozen and information bits. The resulting node can decode an SPC code of length Nv >
While some throughput improvement is possible via some 2P in Nv /(2P) + c steps, where c ≥ 1 since at least one step
code modifications, the resulting polar code diverges from the is required to correct the least reliable estimate and others
optimal one constructed according to [8]. might be used for pipelining; whereas an SSC decoder requires
Plog N
To keep the results in this section practical, we use P as a 2 i=12 v d2i /(2P)e steps. For example, for an SPC constituent
resource constraint parameter, similar to [3]. However, since code of length 4096, P = 256, and c = 4, the specialized SPC
new node types are introduced, the notion of a processing decoder requires 12 steps, whereas the SSC decoder requires
element (PE) might not apply in certain cases. Therefore, we 46 steps. For constituent codes of length ≤ 2P the decoder can
redefine P so that 2P is the maximum number of memory provide an output immediately, or after a constant number of
elements that can be accessed simultaneously. Since each PE time steps if pipelining is used.
has two inputs, P PEs require 2P input values and the two Large SPC constituent codes are prevalent in high-rate polar
definitions for P are compatible. codes, a significant reduction in latency can be achieved if
they can be decoded quickly. Table I lists the number of SPC
A. Single-Parity-Check Nodes N SPC nodes binned by size in three different polar codes: (32768,
In any polar code of rate (N − 1)/N, the frozen bit is 29492), (32768, 27568), and a lower-rate (32768, 16384), all
always u0 rendering the code a single-parity check (SPC) code, optimized for σ = 0.44. Comparing the results for the three
which can be observed in Fig. 1b. While the dimension of an codes, we observed that the total number of nodes decreases as
SPC code is N − 1, making exhaustive-search ML decoding the rate increases. The distribution of SPC nodes by length is
impractical, ML decoding can still be performed with very also affected by code rate: the proportion of large SPC nodes
low complexity: the hard-decision estimate and the parity of decreases as the rate decreases.
It was observed in [11] that decoding chains of SPC codes,
the input are calculated; then the estimate of the least reliable
called caterpillar nodes, can be accelerated, yielding a latency
bit is flipped if the parity constraint is not satisfied. The hard-
reduction of 11-14% compared with SSC, when decoding
decision estimate of the soft-input values is calculated using
 polar codes transmitted over the binary erasure channel (BEC)
0, when αv ≥ 0;
 without resource constraints.
HD[i] = 

1, otherwise.

B. Repetition Nodes N REP
The parity of the input is calculated as
Another type of constituent codes that can be decoded more
Nv −1
M efficiently than using tree traversal is repetition codes, in which
parity = HD[i]. (17) only the last bit is not frozen. The decoding algorithm starts
i=0
by summing all input values. Threshold detection is performed
The index of the least reliable input is found using via sign detection, and the result is replicated and used as the
constituent decoder’s final output:
j = arg min |αv [i]|. 
i
0, when j αv [ j] ≥ 0;
 P
βv [i] =  (19)

Finally, the output of the node is 1, otherwise.


HD[i] ⊕ parity, when i = j;

The decoding method (19) requires Nv /(2P) steps to calcu-
βv [i] =  (18)

HD[i],
 otherwise. late the sum and Nv /(2P) steps to set the output, in addition
6

to any extra steps required for pipelining. Two other methods TABLE III: A listing of the different functions performed by
employing prediction can be used to decrease latency. The the proposed decoder.
first sets all output bits to 0 while accumulating the inputs, Name Description
and writes the output again only if the sign of the sum is
F calculate αl (9).
negative. The average latency of this method is 75% that of G calculate αr (10).
(19). The second method sets half the output words to all 0 COMBINE combine βl and βr (12).
and the other half to all 1, and corrects the appropriate words COMBINE-0R same as COMBINE, but with βl = 0.
G-0R same as G, but assuming βl = 0.
when the sum is known. The resulting latency is 75% that P-R1 calculate βv using (10), (13), then (12).
of (19). However, since the high-rate codes of interest do not P-RSPC calculate βv using (10), (18), then (12).
have any large repetition constituent codes, we chose to use P-01 same as P-R1, but assuming βl = 0.
P-0SPC same as P-RSPC, but assuming βl = 0.
(19) directly. ML calculate βv using (14).
Unlike SPC constituent codes, repetition codes are more REP calculate βv using (19).
prevalent in lower-rate polar codes as shown in Table II. More- REP-SPC calculate βv as in Section IV-C.
over, for high-rate codes, SPC nodes have a more pronounced
impact on latency reduction. This can be observed in tables
I and II, which show that the total number of nodes in the be immediately calculated once αv is available and βr is
decoder tree is significantly smaller when only SPC nodes are combined with the zero vector to obtain βv .
introduced than when only repetition nodes are introduced, In all the codes that were studied, N R , N 1 , and N SPC were
indicating a smaller tree and lower latency. Yet, the impact of the only nodes to be observed as right children; and N 1 and
repetition nodes on latency is measurable; therefore, we use N SPC are the only two that can be merged with their parent.
them in the decoder.
E. Required Decoder Functions
C. Repetition-SPC Nodes N REP-SPC As a result of the many types of nodes and the different
mergers, the decoder must perform many functions. Table III
When enumerating constituent codes with Nv ≤ 8 and
lists these 12 functions. For notation, 0, 1, and R are used
0 < kv < 8 for the (32768, 27568) and (32768, 29492)
to denote children with constituent code rates of 0, 1, and R,
codes, three codes dominated the listing: the SPC code, the
respectively. Having a left child of rate 0 allows the calculation
repetition code, and a special code whose left constituent code
of αr directly from αv as explained earlier. It is important to
is a repetition code and its right an SPC one, denoted N REP-SPC .
make this distinction since the all-zero βl or a rate 0 code
The other constituent codes accounted for 6% and 12% in the
is not stored in the decoder memory. In addition, having a
two polar codes, respectively. Since N REP-SPC codes account
right child of rate 1 allows the calculation of βv directly once
for 28% and 25% of the total N R nodes of length 8 in the
βl is known. A P- prefix indicates that the message to the
two aforementioned codes, efficiently decoding them would
parent, βv , is calculated without explicitly visiting the right
have a significant impact on latency. This can be achieved by
child node.
using two SPC decoders of length 4, SPC0 and SPC1 , whose
We note the absence of N 0 and N 1 node functions: the
inputs are calculated assuming the output of the repetition code
former due to directly calculating αr and the latter to directly
is 0 and 1, respectively. Simultaneously, the repetition code is
calculating βv from αr .
decoded and its output is used to generate the N REP-SPC output
using either the output of SPC0 or SPC1 as appropriate.
While this code can be decoded using an exhaustive-search F. Performance with Quantization
ML decoder, the proposed decoder has a significantly lower Fig. 5 shows the effect of quantization on the (32768,
complexity. 27568) polar code that was optimized for SNR = 4.5 dB.
The quantization numbers are presented in (W, WC , F) format,
where W is total number of quantization bits for internal LLRs,
D. Node Mergers
Wc for channel LLRs, and F is the number of fractional bits.
The N REP-SPC node merges an N REP and an N SPC node Since the proposed algorithm does not perform any operations
to reduce latency. Similarly, it was mentioned in [7] that N R that increase the number of fractional bits—only the integer
nodes need not calculate the input to a child node if it is an N 0 ones—we use the same number of fractional bits for both
node. Instead, the input to the right child is directly calculated. internal and channel LLRs.
Another opportunity for a node merger arises when a node’s From the figure, it can be observed that using a (7, 5, 1)
right child directly provides βr without tree traversal: the quantization scheme yields performance extremely close to
calculation of αr , βr , and βv can all be performed in one step, that of the floating-point decoder. Decreasing the range of the
halving the latency. This is also applicable for nodes where channel values to three bits by using the (7, 4, 1) scheme sig-
Nv > 2P: P values of αr are calculated and used to calculate nificantly degrades performance. While completely removing
P values of βr , which are then used to calculate 2P values of fractional bits, (6, 4, 0), yields performance that remains within
βv until all values have been calculated. 0.1 dB of the floating-point decoder throughout the entire SNR
This can be expanded further when the left node is N 0 . range. This indicates that the decoder needs four bits of range
Since βl is known a priori to be a zero vector, αr can for the channel LLRs. Keeping the channel LLR quantization
7

100 100
α-RAM Channel RAM
10−2 10−2
α-Router Channel Loader Channel
10−4 10−4
Processing Unit Controller

BER
FER

10−6 10−6

10−8 10−8 β -Router Instruction RAM Instructions

10−10 10−10
β -RAM Codeword RAM Estimate
10−12 10−12
4 4.5 5 4 4.5 5
SNR SNR Fig. 6: Top-level architecture of the decoder.
(6, 5, 1) (6, 4, 0) (7, 5, 1)
(7, 4, 1) Floating-point

Fig. 5: Effect of quantization on the error-correction perfor- LLRs into memory, and data processing unit (ALU) to perform
mance the (32768, 27568) code. the correct function. The processing unit accesses data in α−
and β− RAMs (data memory). The estimated codeword is
buffered into the codeword RAM which is accessible from
the same, but reducing the range of the internal LLRs by outside the decoder.
one bit and using (6, 5, 1) quantization does not affect the By using a pre-compiled list of instructions, the controller is
error-correction performance for SNR < 4.25. After that point reduced to fetching and decoding instructions, tracking which
however, the performance starts to diverge from that of the stage is currently decoded, initiating channel LLR loading, and
floating-point decoder. Therefore, the range of internal LLR triggering the processing unit.
values increases in importance as the SNR increases. Before discussing the details of the decoder architecture, it
From these results, we conclude that minimum number of should be noted that this works presents a complete decoder,
integer quantization bits required is six for the internal LLRs including all input and output buffers needed to be flexible.
and four for the channel ones and that fractional bits have a While it is possible to reduce the size of the buffers, this is
small effect on the performance of the studied polar codes. accompanied by a reduction in flexibility and limits the range
The (6, 4, 0) offers lower memory use for a small reduction of codes which can be decoded at full speed, especially at
in performance and would be the recommended scheme for a high code rates. This trade off is explored in more detail in
practical decoder for high-rate codes. For the rest of this work, sections VI-A and VIII.
we use both the (6, 4, 0) and (7, 5, 1) schemes to illustrate the
performance-complexity trade off between them. VI. Architecture: Data Loading and Routing
V. Architecture: Top-Level When designing the decoder, we have elected to include
the required input and output buffers in addition to the buffers
As mentioned earlier, Table III lists the 12 functions per- required to store internal results. To enable data loading while
formed by the decoder. Deducing which function to perform decoding and achieve the maximum throughput supported by
online would require complicated controller logic. Therefore, the algorithm, α values were divided between two memories:
we decided to provide the decoder with an offline-calculated one for channel α values and the other for internal ones as
list of functions to perform. This does not reduce the decoder described in sections VI-A and VI-B, respectively. Similarly,
flexibility with respect to decoding different codes as the β values were divided between two memories as discussed in
structure of the code is loaded into the decoder in both cases: sections VI-C and VI-D. Finally, routing of data to and from
in the first case, as a list of frozen-bit locations or a mask that the processing unit is examined in Section VI-E.
is converted into a list of functions; and in the second, as a Since high throughput is the target of this design, we
list of functions directly. To further simplify implementation, choose to improve timing and reduce routing complexity at
we present the decoder with a list of instructions, with each the expense of logic and memory use.
instruction composed of the function to be executed, and a
value indicating whether the function is associated with a right
or a left child in the decoder tree. Such a list can be viewed as A. Channel α Values
a program executed by a specialized micro-processor, in this Due to the lengths of polar codes with good error-correction
case, the decoder. performance, it is not practical to present all the channel output
With such a view, we present the overall architecture of our values to the decoder simultaneously. For the proposed design,
decoder, shown in Fig. 6. At the beginning, the instructions we have elected to provide the channel output in groups of 32
(program) are loaded into the instruction RAM (instruction LLRs; so that for a code of length 32768, 1024 clock cycles
memory) and fetched by the controller (instruction decoder). are required to load one frame in the channel RAM. Since the
The controller then signals the channel loader to load channel codes of rates 0.8413 and 0.9 require 3791 and 2948 clock
8

cycles to decode, respectively, stalling the decoder while a accepts two α values as inputs and produces one. Since up
new frame is loaded will reduce throughput by more than to P such functions are employed simultaneously, the decoder
25%. Therefore, loading a new frame while currently decoding must be capable of providing 2P α values and of writing P
another is required to prevent throughput loss. values. To support such a requirement, the internal α value
The method employed in this work for loading a new frame RAM, denoted α-RAM, is composed of two P-LLR wide
while decoding is to use a dual-port RAM that provides banks. A read operation provides data from both banks; while
enough memory to store two frames. The write port of the a write operation only updates one. Smaller decoder stages,
memory is used by the channel loader to write the new frame; which require fewer than 2P α values, are still assigned a
while the read port is used by the α-router to read the current complete memory word in each bank. This is performed to
frame. Writing to the channel RAM is stalled until decoding reduce routing and multiplexing complexity as demonstrated
the current frame is finished; at which point, the reading in [3].
and writing locations in the channel RAM are swapped and Since read from and write to α-RAM operations can be
loading the new frame begins. This method was selected as performed simultaneously, it is possible to request a read
it allowed full throughput decoding of both rate 0.8413 and operation from the same location that is being written. In
0.9 codes without the need for a faster second write clock this case, the memory must provide the most recent data. To
while maintaining a reasonable decoder input bus width of provide this functionality for synchronous RAM, a register is
32 × 5 = 160 bits, where five quantization bits are used for the used to buffer newly written data and to provide it when the
channel values, or 128 bits when using (6, 4, 0) quantization. read and write addresses are the same [3].
Additionally, channel data can be written to the decoder at a
constant rate by utilizing handshaking signals. C. Internal β Values
The decoder operates on 2P channel α-values simultane-
ously, requiring access to a 2 ∗ 256 ∗ 5 = 2560-bit read The memory used to store internal β values needs to offer
bus. In order for the channel RAM to accommodate such a greater flexibility than α-RAM, as some functions, such as
requirement while keeping the input bus width within practical COMBINE, generate 2P bits of β values while others, such
limits, it must provide differently sized read and write buses. as ML and REP, generate P or fewer bits.
One approach is to use a very wide RAM and utilize a write The β-RAM is organized as two banks that are 2P bits wide
mask; however, such wide memories are discouraged from each. One bank stores the output of left children while the
an implementation perspective. Instead, multiple RAM banks, other that of right ones. When a read operation is requested,
each has the same width as that of the input bus, are used. data from both banks is read and either the lower or the upper
Data is written to one bank at a time, but read from all half from each bank is selected according to whether the read
simultaneously. The proposed decoder utilizes 2∗256/32 = 16 address is even or odd.
banks each with a depth of 64 and a width of 32 ∗ 5 = 160 Since β-RAM is read from and written to simultaneously,
bits. using the second port of a narrower dual-port RAM and
This memory cannot be merged with the one for the writing to two consecutive addresses to improve memory
internal α values without stalling the decoder to load the new utilization is not possible as it would interfere with the read
frame as the latter’s two ports can be used by the decoder operation and reduce throughput.
simultaneously and will not be available for another write
operation. D. Estimated Codeword
Another method for loading-while-decoding is to replace the The estimated codeword is generated 2P = 512 bits at a
channel values once they are no longer required. This occurs time. These estimated bits are stored in the codeword RAM
after 2627 and 2134 clock cycles, permitting the decoder in order to enable the decoder to use a bus narrower than 512
1164 and 814 clock cycles in which to load the new frame bits to convey its estimate and to start decoding the following
for the R = 0.8413 and R = 0.9 codes, respectively. Given frame immediately after finishing the current. In addition,
these timing constraints, the decoder is provided sufficient buffering the output allows the estimate to be read at a constant
time to decode the rate 0.8413 code, but not the rate 0.9 one, rate. The codeword RAM is a simple dual-port RAM with a
at full throughput. To decode the latter, either the input bus 2P = 512-bit write bus and a 256-bit read bus and is organized
width must be increased, which might not be possible given as N/(2P) = 64 words of 512 bits.
design constraints, or a second clock, operating faster than Similar to the case of α value storage, this memory must
the decoder’s, must be utilized for the loading operation. This remain separate from the internal β memory in order to
approach sacrifices the flexibility of decoding very high-rate support decoding at full speed; otherwise, decoding must be
codes for a reduction in the channel RAM size. The impact of stalled while the estimated codeword is read due to lack of
this compromise on implementation complexity is discussed available ports in RAM.
in Section VIII.
E. Routing
B. Internal α Values Since both α and β values are divided between two
The f (9) and g (10) functions are the only two components memories, some logic is required to determine which memory
of the decoder that generate α values as output: each function to access, which is provided by the α- and β- routers.
9

calculate their output by directly implementing (9). To simplify


the comparison logic, we limit the most negative number
f
m1 α' to −2Q + 1 instead of 2−Q so that the magnitude of an
LLR contains only Q − 1 bits. The g element also directly
REP implements (10) with saturation to 2Q − 1 and −2Q + 1. The
combined resource utilization of an f element and a g element
REP is slightly more than that of the merged PE [3]; however the
SPC m2 β0' g element is approximately 50% faster.
ML
Using two’s complement arithmetic negatively affected the
0101 speed of the f element. This, however, does not impact the
overall clock frequency of the decoder since the path in which
0
m0 Sign
COMBINE
β1' f is located is short.
β0
g
m3
Since bit-reversal is used, f and g operate on adjacent values
α
SPC in the input α word and the outputs are correctly located in
the output word α0 for all constituent code lengths. Special
β1
multiplexing rules would need to be added to support a non-
bit-reversed implementation, increasing complexity without
Fig. 7: Architecture of the data processing unit. any positive effects [3].

B. Repetition Block
The α-router receives stage and word indices, determines
The repetition block, described in Section IV-B and denoted
whether to fetch data from the channel or α-RAM, and calcu-
REP in Fig. 7, also benefits from using two’s complement as
lates the read address. Only α-RAM is accessible for write
its main component is an adder tree that accumulates the input,
operations. Similarly, the β-router calculates addresses and
the sign of whose output is repeated to yield the β value. As
determines which memory is written to; and read operations
can be seen in Table II, the largest constituent repetition code
are only performed for the β-RAM.
in the polar codes of interest is of length 16. Therefore, the
adder tree is arranged into four levels. Since only the sign of
VII. Architecture: Data Processing the sum is used, the width of the adders was allowed to grow
As mentioned in Section IV, our proposed algorithm re- up in the tree to avoid saturation and the associated error-
quires many decoder functions, which translate into instruc- correction performance degradation. This tree is implemented
tions that in turn are implemented by specialized hardware using combinational logic.
blocks. When decoding a constituent code whose length Nv is
In Fig. 7, which illustrates the architecture of the data smaller than 16, the last Nv − 16 are replaced with zeros and
processing unit, α, β0 , and β1 are the data inputs; while α0 , do not affect the result.
β00 , and β10 are the corresponding outputs. The first multiplexer An attempt at simplifying logic by using a majority count
(m0 ) selects either the β0 value loaded from memory or the of the sign of the input values caused significant reduction
all-zero vector, depending on which opcode is being executed. in error-correction performance that was not accompanied by
Another multiplexer (m1 ) selects the result of f or g as the perceptible reduction in the resource utilization of the decoder.
α0 output of the current stage. Similarly, one multiplexer (m2 )
chooses which function provides the β00 output. Finally, the last C. Repetition-SPC Block
multiplexer (m3 ) selects the input to the COMBINE function.
The critical path of the design passes through g, SPC, This block decodes the very common constituent code
and COMBINE; therefore, these three blocks must be made whose frozen bit pattern is [00010111], i.e. its left child is a
fast. As a result, the merged processing element (PE) of [3] repetition code and its right an SPC code. We implement this
cannot be used since it has a greater propagation delay than block using two SPC nodes and one repetition node. First, four
one implementing only g. Similarly, using two’s complement f processing elements in parallel calculate the αREP vector to
arithmetic results in a faster implementation of the g function be fed to a small repetition decoder block. At the same time,
as it performs signed addition and subtraction. both possible vectors of LLR values—αSPC0 and αSPC1 , one
In this section, we describe the architecture of the different assuming the output of the repetition code is all zeros and
blocks in detail as well as justify design decisions. We omit the other all ones—are calculated using eight g processing
the sign block from the detailed description since it simply elements. Those vectors are fed to the two SPC nodes SPC0
selects the most significant bit of its input to implement (11). and SPC1 .
The outputs of these SPC nodes are connected to a multi-
plexer. The decision βREP from the repetition node is used to
A. The f and g Blocks select between the outputs of SPC0 and SPC1 . Finally, results
As mentioned earlier, due to timing constraints, f and g are combined to form the vector of decoded bits βv out of
are implemented separately and use the two’s complement βREP and either βSPC0 or βSPC1 . This node is also purely
representation. The f block contains P f elements which combinational.
10

D. Single-Parity-Check Block VIII. Implementation Results


A. Methodology
Due to the large range of constituent code lengths—
Each component of our decoder has been validated against
[4, 8192]—that it must decode, the SPC block is the most
a bit-accurate software implementation. Random test vectors
complex in the decoder. At its core is a compare-select
were used. The bit-accurate software implementation was used
(CS) tree to find the index of the least reliable input bit
to estimate the error correction performance of the decoder and
as described in Section IV-A. While some small constituent
to experimentally determine acceptable quantization levels.
codes can be decoded within a clock cycle, obtaining the input
Logic synthesis, technology mapping, and place and route
of larger codes requires multiple clock cycles. Therefore, a
were performed to target two different FPGAs. The former is
pipelined design with the ability to select an output from
the Altera Stratix IV EP4SGX530KH40C2 and the latter is the
different pipeline stages is required. The depth of this pipeline
Xilinx Virtex VI XC6VLX550TL-1LFF1759. They were cho-
is selected to optimize the overall decoding throughput by
sen to provide a fair comparison with state of the art decoders.
balancing the length of the critical path and the latency of
In both cases, we used the tools provided by the vendors,
the pipeline.
Altera Quartus II 13.0 and Xilinx ISE 13.4. Moreover, we
Table I was used as the guideline for the pipeline design. use worst case timing estimates e.g. the maximum frequency
As codes with Nv ∈ (0, 8] are the most common, their output reported for the FPGA from Altera Quartus is taken from the
is provided within the same clock-cycle. Using this method, results of the “slow 900mV 85◦ C” timing model.
pipeline registers were inserted in the CS tree so that there
was a one clock cycle delay for Nv ∈ (8, 64] and two for
Nv ∈ (64, 256]. Since SPC node only exist in a P-RSPC or a B. Comparison with the state of the art SC-based polar
P-0SPC and they receive their input from the g elements, their decoder
maximum input size is P, not 2P; so any constituent SPC code The fastest polar decoder in literature was implemented
with Nv > P receives its input in multiple clock cycles. The as an application-specific integrated-circuit (ASIC) [10] for
final stage of the pipeline handles this case by comparing the a (1024, 512) polar code. Since we are interested in better
results from the current input word with that of the previous performing longer codes we compare the proposed decoder
one, and updating a register as required. Therefore, for such with the FPGA-based, length 32768 implementation of [3].
cases, the SPC output is ready in Nv /P + 4 clock cycles. Results for the same FPGA are shown in Tables IV and V.
The extra clock cycle improved decoder operating frequency For a (32768, 27568) code, our decoder is 15 to 29 times faster
and the overall throughput. The pipeline for the parity values than the SP-SC decoder. For the code with a rate of 0.9, it has
utilizes the same structure. 19 to 40 times the throughput of SP-SC depending on P and
the quantization scheme used, and achieves an information
throughput of 1 Gbps for both quantization schemes. It can
be also noted that the proposed decoder uses significantly
E. Maximum-Likelihood Block fewer LUTs and registers but requires more RAM, and can be
clocked faster. If the decoder followed the buffering scheme
of [3], namely, one input frame and no output buffering, its
When implementing a length 16 ML decoder as suggested
RAM usage would decrease to 507,248 bits for the P = 256,
in [7], we noted that it formed the critical path and was
(7, 5, 1) case and to 410,960 bits when P = 64 and the (6, 4, 0)
significantly slower than the other blocks. In addition, once
quantization scheme is used.
repetition, SPC, and repetition-SPC decoders were introduced,
Although implementation results for P = 256 are not
the number of ML node of length greater than four became
provided in [3], the throughput the SP-SC algorithm asymp-
minor. Therefore, the ML node was limited to constituent
totically approaches 0.5 · fclk · R where fclk is the clock
codes of length four. When enumerating these codes in the
frequency. Therefore, even when running at its maximum
targeted polar codes, we noticed that [0101] was the only
possible throughput, SP-SC remains 16 to 34 times slower
such code to be decoded with an ML node. The other two
than the proposed decoder for the (32768, 29492) code. The
length-four constituent codes were the repetition and the SPC
results for the rate 0.9 code with P = 256 and the (7, 5, 1)
codes and other patterns never appeared. Thus, instead of
quantization scheme were obtained using Synposis Synplify
implementing a generic ML node that supports all possible
Premier F-2011.09-SP1-1 and Altera Quartus 11.1.
frozen bit patterns, only [0101] is realized. This significantely
reduces the implementation complexity of this node.
The ML decoder finds the most likely codeword among the C. Comparison with an LDPC code of similar error correcting
2kv = 4 possibilities. As only one constituent code is supported, performance
the possible codewords are known in advance. Four adder A fully-parallel (2048, 1723) LDPC decoder on FPGA
trees of depth two calculate the reliability of each potential is presented in [12]. At 30.7 MHz on a Xilinx Virtex VI
codeword, feeding their result into a comparator tree also of XC6VLX550TL, an information throughput of 1.1 Gbps is
depth two. The comparison result determines which of [0000], reached. Early termination could be used to achieve 8.8 Gbps
[0001], [0101] or [0100] is the most likely codeword. This at 5 dB, however that would require support for early termi-
block is combinational. nation and extra buffering that were not implemented in [12].
11

TABLE IV: Post-fitting results for a code of length 32768 on References


the Altera Stratix IV EP4SGX530KH40C2.
[1] E. Arikan, “Channel polarization: A method for constructing capacity-
achieving codes for symmetric binary-input memoryless channels,” IEEE
RAM fmax
Algorithm P Q LUTs Registers Trans. Inf. Theory, vol. 55, no. 7, pp. 3051–3073, 2009.
(bits) (MHz)
[2] A. Eslami and H. Pishro-Nik, “On bit error rate performance of polar
SP-SC [3] 64 5 58,480 33,451 364,288 66 codes in finite regime,” in Proc. 48th Annual Allerton Conf. Communi-
cation, Control, and Computing (Allerton), 2010, pp. 188–194.
This work 64 (6, 4, 0) 6,759 1,380 574,800 104
[3] C. Leroux, A. Raymond, G. Sarkis, and W. Gross, “A semi-parallel
(7, 5, 1) 8,613 998 678,864 103
successive-cancellation decoder for polar codes,” IEEE Trans. Signal
256 (6, 4, 0) 26,058 7,224 539,136 104 Process., vol. 61, no. 2, pp. 289–299, 2013.
(7, 5, 1) 30,403 3,880 703,892 102 [4] I. Tal and A. Vardy, “List decoding of polar codes,” CoRR, vol.
abs/1206.0050, 2012. [Online]. Available: http://arxiv.org/abs/1206.
0050v1
TABLE V: Throughput comparison for codes of length 32768 [5] P. Chicoine, M. Hassner, M. Noblitt, G. Silvus, B. Weber, and E. Gro-
on the Altera Stratix IV EP4SGX530KH40C2. chowski, “Hard disk drive long data sector white paper,” Technical re-
port, The International Disk Drive Equipment and Materials Association
Algorithm Code rate P Q T/P (Mbps) (IDEMA), Tech. Rep., 2007.
[6] A. Alamdar-Yazdi and F. R. Kschischang, “A simplified successive-
SP-SC [3] 0.84 64 5 26 cancellation decoder for polar codes,” IEEE Commun. Lett., vol. 15,
0.9 64 5 28 no. 12, pp. 1378–1380, 2011.
This work 0.84 64 (6, 4, 0) 393 [7] G. Sarkis and W. J. Gross, “Increasing the throughput of polar decoders,”
(7, 5, 1) 389 Communications Letters, IEEE, vol. 17, no. 4, pp. 725–728, 2013.
256 (6, 4, 0) 758 [8] I. Tal and A. Vardy, “How to construct polar codes,” CoRR, vol.
(7, 5, 1) 743 abs/1105.6164, 2011. [Online]. Available: http://arxiv.org/abs/1105.6164
0.9 64 (6, 4, 0) 510 [9] E. Arikan, “Systematic polar coding,” IEEE Commun. Lett., vol. 15,
(7, 5, 1) 505 no. 8, pp. 860–862, 2011.
[10] A. Mishra, A. J. Raymond, L. G. Amaru, G. Sarkis, C. Leroux,
256 (6, 4, 0) 1,044
P. Meinerzhagen, A. Burg, and W. J. Gross, “A successive cancellation
(7, 5, 1) 1,022
decoder ASIC for a 1024-bit polar code in 180nm CMOS,” in Proc.
IEEE Asian Solid-State Circuits Conference (A-SSCC 2012), to appear.
[11] A. Alamdar-Yazdi and F. R. Kschischang, “Locally-reduced polar
codes,” personal communication, 2012.
Results for our decoder with P = 256, (6, 4, 0) quantization, [12] V. Torres, A. Perez-Pascual, T. Sansaloni, and J. Valls, “Fully-parallel
and a (32768, 27568) polar code on the same FPGA are shown LUT-based (2048,1723) LDPC code decoder for FPGA,” in Electronics,
in Table VI. Our decoder requires 5 times fewer LUTs, but Circuits and Systems (ICECS), 2012 19th IEEE International Conference
on, 2012, pp. 408–411.
only achieves 46% of the throughput.

IX. Conclusion
In this work we presented a new algorithm for decoding
polar codes that results in a high-throughput, flexible decoder.
An FPGA implementation of the proposed algorithm was able
to achieve an information throughput of 1 Gbps when decoding
a (32768, 29492) polar code with a clock frequency of 104
MHz. We expect derivative works implementing this decoder
as an ASIC to reach a throughput of 3 Gbps when operating
at 300 MHz with a complexity lower than that required by
LDPC decoders of similar error correction performance. Thus,
our results indicate that polar codes are promising candidates
for data storage systems.

Acknowledgement
The authors wish to thank CMC Microsystems for providing
access to the Altera, Xilinx, Mentor Graphics and Synopsys
tools. The authors would also like to thank Prof. Roni Khazaka
and Alexandre Raymond of McGill University for helpful
discussions. Claude Thibeault is a member of ReSMiQ.

TABLE VI: Comparison with an LDPC code of simi-


lar error correcting performance, on the Xilinx Virtex VI
XC6VLX550TL.
Code Q LUTs fmax (MHz) T/P (Gbps)
LDPC [12] 4 99,468 30.7 1.102
This work (6, 4, 0) 18,986 70.2 0.510
(7, 5, 1) 22,217 70.2 0.510

You might also like