Basic-Set Trellis Min-Max Decoder Architecture For Nonbinary LDPC Codes With High-Order Galois Fields

496 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 26, NO.
3, MARCH 2018
Basic-Set Trellis Min–Max Decoder Architecture

for Nonbinary LDPC Codes With High-Order
Galois Fields
Huyen Pham Thi, Member, IEEE, and Hanho Lee , Senior Member, IEEE
Abstract— Nonbinary low-density parity-check (NB-LDPC) error floor is critical for flash memory applications, and the
codes outperform their binary counterparts in terms of error- NB-LDPC codes show much promise for multilevel flash
correction performance. However, the drawback of NB-LDPC memory applications [4]. However, the main disadvantage of
decoders is high complexity, especially for the check node unit
(CNU), and the complexity increases considerably when increas- NB-LDPC codes is their highly complex decoding algorithms;
ing the Galois-field (GF) order. In this paper, a novel basic-set it is difficult to achieve maximum throughput and minimum
trellis min–max algorithm is proposed to greatly reduce not only area for their architectures. In practical implementations, the
the CNU complexity but also the number of messages exchanged NB-LDPC decoders have several drawbacks, such as a highly
between the check node and the variable node compared with complex check node unit (CNU), a large area spent on storage
previous studies, which is highly efficient for higher order GFs.
In addition, the proposed CNU is designed to compute the elements, and routing congestion.
messages in a parallel way. Layered decoder architectures based First, the belief propagation (BP) algorithm used for binary
on the proposed algorithm were implemented for the (837, 726) LDPC decoding was introduced for the NB-LDPC decod-
NB-LDPC code over GF(32) and the (1512, 1323) code over ing [1]. Then, a fast Fourier transform-BP (FFT-BP) algo-
GF(64) using 90-nm CMOS technology, and obtained a reduction rithm [5] in the probability domain was proposed to reduce
in the complexity by 30% and 37% for the CNU, and 40%
and 37.4% for the whole decoder, respectively. Moreover, the the computational complexity in check node processing by
proposed decoder achieves a higher throughput at 1.67 Gbit/s replacing the convolutional operations with multiplications
and 1.4 Gbit/s compared with the other state-of-the-art high-rate in the frequency domain. Although the probability domain
NB-LDPC decoders with high-order GFs. algorithm provides optimal error-correcting performance, the
Index Terms— Basic set (BS), check node processing, high large number of additions and multiplications causes an expo-
order, layered decoding, nonbinary low-density parity-check nential increase in hardware complexity. In [6], the FFT-BP
LDPC, trellis min–max (TMM), VLSI design. algorithm based on the logarithm domain used log-likelihood
ratio (LLR) values to decode the channel messages instead of
I. I NTRODUCTION probability values, in which the multiplications are replaced
with additions.
N ONBINARY low-density parity-check (NB-LDPC)
codes defined over Galois fields (GFs) GF(q) with
q > 2 outperform their binary counterparts in terms of
For practical NB-LDPC decoder implementations, subop-
timal algorithms such as extended min-sum (EMS) [7] and
error-correcting performance and performance improvement the min–max [8] algorithm have been proposed to reduce
in the error-floor region when code length is moderate [1]. the complexity of the CNU as the main bottleneck of the
In addition, these codes have good ability of burst error NB-LDPC decoder. The min–max algorithm [8] is interesting
correction, especially for high-order GFs. because it uses comparisons instead of additions [7] in the
Research results in [2] and [3] demonstrate that NB-LDPC check node processing, which not only reduces the hardware
codes provide superior performance compared with the best complexity but also prevents the numerical growth of the
optimized binary LDPC code over fading channels, and decoder. In addition, in [8], a forward–backward scheme was
the combination of NB-LDPC code with high-order mod- utilized to derive the check node output messages. This scheme
ulations improves both the bandwidth efficiency and the includes sequential computations, which cause a throughput
error-correction capability. Moreover, the elimination of the problem for the decoder architectures. Moreover, additional
storage memories are required to store the intermediate mes-
Manuscript received June 26, 2017; revised September 17, 2017; accepted sages such as forward and backward messages.
November 4, 2017. Date of publication December 8, 2017; date of current
version February 22, 2018. This work was supported by the Basic Science Recently, the path construction algorithms [9], [10] and
Research Program through the NRF funded by the Ministry of Science, the relaxed min–max (RMM) algorithm [11] introduced the
ICT and Future Planning under Grant 2016R1A2B4015421. (Corresponding trellis representation for check node processing to eliminate
author: Hanho Lee.)
The authors are with the Department of Information and Communication computing the forward–backward messages, and thus reduces
Engineering, Inha University, Incheon 22212, South Korea (e-mail: the memory requirement for the intermediate messages. The
phamhuyenmta87@gmail.com; hhlee@inha.ac.kr). RMM algorithm [11] using the minimum basis to generate
Color versions of one or more of the figures in this paper are available
online at http://ieeexplore.ieee.org. the check node output messages was proposed for NB-LDPC
Digital Object Identifier 10.1109/TVLSI.2017.2775646 decoders, which further reduces the check node complexity.
1063-8210 © 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
THI AND LEE: BS-TMM DECODER ARCHITECTURE FOR NONBINARY LDPC CODES WITH HIGH-ORDER GFS 497
However, the sequential check node processing requires a large Algorithm 1 Layered Min–Max Decoding Algorithm [17]
number of clock cycles, which limits the maximum throughput
of the decoder. In [12] and [13], the trellis EMS algorithm
was proposed to improve the throughput of the NB-LDPC
decoders, where the check node output messages are generated
in parallel by means of an extra column inserted to the original
trellis. A disadvantage of the decoders in [12] and [13] is
high area, which causes a reduction of the overall decoder
efficiency. To take advantage of the idea in [12], the simplified
trellis min–max (STMM) algorithm [14] was proposed to
improve the throughput of the min–max decoders with less
complexity. In [15], the one-minimum-only TMM algorithm
was introduced on the basis of the STMM algorithm to reduce
the CNU complexity by obtaining only one minimum and
estimating the second one. In [12]–[15], q × dc check node
output messages are exchanged between the check node and
the variable nodes. For high-order GFs or high-rate NB-LDPC the design of high-rate and high-order NB-LDPC decoders.
codes, there are two main drawbacks in [12]–[15]. First, the Two NB-LDPC decoders, including (837, 726) over GF(32)
amount of exchanged messages increases, which causes wiring and (1512, 1323) over GF(64), were implemented on the basis
congestion, and thus limits the maximum throughput of the of the BS-TMM algorithm.
decoders. Second, the check node output messages are stored The rest of this paper is organized as follows. Section II
in the memory for the next decoding iteration in the layered reviews the decoding algorithms for the NB-LDPC codes.
decoders. Therefore, the memory requirement becomes large, Section III presents the proposed BS-TMM decoding algo-
which leads to a significant growth in the decoder area for rithm for the NB-LDPC codes. In Section IV, the CNU
NB-LDPC codes. architecture and the overall decoder architecture based on
To overcome the drawbacks of [12]–[15], Lacruz et al. [16] the BS-TMM algorithm are proposed. The implementation
originally introduced a compression technique to reduce the results and comparison with previous works are discussed in
exchanged messages between one check node and the variable Section V. Finally, conclusions are drawn in Section VI.
nodes to four sets, including the intrinsic and extrinsic infor-
mation, the path coordinates, and the hard-decision symbols II. T RELLIS M IN –M AX D ECODING A LGORITHM
with a size of 5 × (q − 1) + dc messages without any error- A. Review of the Layered Min–Max Algorithm
correcting performance loss. For further improvement, the A sparse parity-check matrix H with M rows and
research in [17] and [18] proposed to simplify the CNU archi- N columns defines an NB-LDPC linear block code, where
tecture and reduce the exchanged messages to 4 × (q − 1) + dc each nonzero element h mn belongs to the GF GF(q). Moreover,
messages with a similar error-correcting performance in [16]. a Tanner graph corresponding to H is used to represent the
The approximated TMM algorithms in [19] and [20] were NB-LDPC codes in a graphical way, where variable nodes
introduced to reduce the amount of intrinsic information from represent N columns of H and check nodes represent M
(q − 1) elements [16] to only two elements and L q rows of H. Let dc and dv be the check node degree (row
elements, respectively, at the cost of some error-correcting weight) and the variable node degree (column weight) of H,
performance loss. The remaining elements are calculated from respectively. Therefore, N(m) denotes the set of dc variable
the approximation functions. nodes connected to check node m, and M(n) denotes the set of
In this paper, a novel basic-set TMM (BS-TMM) algorithm dv check nodes connected to variable node n. Let Q mn (a) and
is proposed for NB-LDPC codes based on the theory of the GF Rmn (a) be the exchanged messages from n variable node to m
GF(q = 2 p ), where each field element is uniquely represented check node (V2C) and from m check node to n variable node
by a linear combination of p independent field elements. (C2V) for each symbol a ∈ GF(q), respectively. A regular
In the proposed BS-TMM algorithm, the basis set including NB-LDPC code with fixed values of dc and dv is considered
the intrinsic information of only p = log2 q independent field in this paper.
elements in the extra column is stored, and the other elements A horizontal layered decoding algorithm is applied in
are constructed on the basis of this basic set. Moreover, this paper because of its higher convergence with similar
a novel algorithm is introduced for finding p independent performance, compared with the flood decoding algorithm.
field elements with the most reliable messages of the basic The layered decoding algorithm for the NB-LDPC codes
set in parallel. The BS-TMM algorithm allows the reduction is presented in Algorithm 1. Let cn be the nth reference
of exchanged messages between one check node and variable symbol of a received codeword and z n be the nth hard-
nodes from 4 × (q − 1) + dc [16] to (q − 1) + 3 × p + dc decision symbol with the highest reliability. The decoding
messages with a negligible performance loss of 0.1 dB. The process is initialized by obtaining the LLR vectors with a
proposed method provides a great area reduction and through- size of q of the channel information by means of L n (a) =
put improvement for the NB-LDPC decoders with good error- ln(Pr(cn = z n |channel)/Pr(cn = a|channel)). At the first layer
correction performance. Therefore, it is extremely efficient for of the first iteration, Q n (a) as the a posteriori information
498 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 26, NO. 3, MARCH 2018
Algorithm 2 TMM Algorithm [14] conf(n r , n c ) [13] including the possible paths is constructed by
the nr most reliable messages and a maximum of n c deviations
from path 0. The configuration set conf(1, 2) is considered
in [14], where nr = 1 and n c = 2. Thus, only the most reliable
message m1(a) and maximum of two deviations [η1 (a), η2 (a)]
are considered for each symbol a. In the case of one deviation,
η1 (a) is equal to η2 (a). Otherwise, they are different.
The check node output messages in the delta domain
Rmn (a) are simultaneously generated from step 5 to step 13
depending on the deviation information as η1 (a) and η2 (a).
For each trellis row, if column j is not the deviations of
the most reliable path, then the output message Rmn j (a) is
assigned by the extra column value Q ( a). If the most reliable
path has one deviation at column j , then the second most
reliable message m2(a) is assigned to the output message.
In the case of the most reliable path formed by two deviations,
m1(a) is assigned to the output message. Finally, step 14
transforms the output messages from the delta domain to
the normal domain as the C2V messages Rmn (a). A suitable
scaling factor λ is used to improve the performance of the
for the variable node n is equal to L n (a), and Rmn (a) is decoder, which does not affect the hardware complexity.
equal to zero. Let k and l be indexes in the loop for the A disadvantage of the STMM algorithm [14] is the large
kth iteration and lth layer, respectively. The Q n (a) messages memory requirement to store q × dc output messages of
are permuted following the nonzero element h mn of matrix H Rmn (a) for each check node, which causes a high decoder
to obtain Q n (h mn a). Then, the V2C messages Q̃ mn (a) are area. In [18], a compression technique is applied to reduce
derived in step 3, and the normalization of these messages is the output check node messages from q × dc values to four
implemented by steps 4 and 5 to ensure that the LLR value elementary sets such as I (a), E(a), P(a), and z n∗ , including
for the most reliable symbol in each vector is equal to zero. 4×(q−1)+dc values without any error-correcting performance
Step 6 involves in the computation of the check node output loss, as follows. Thus, the memory requirement is significantly
messages Rmn (a) using the function, which depends on the reduced
⎧
algorithm applied for the check node processing. The updated ⎪ I (a)
⎪
⎪
messages Q n (a) in step 7 need to be undergone the reverse ⎨ E(a)
permutation before starting a new layer. The decoding process Output: (1)
⎪
⎪ P(a)
is repeated until the maximum number of iterations Imax is ⎪
⎩ ∗
reached. Finally, the output codeword c̃n is the most reliable z n = z n + β.
symbol corresponding to Q n (a) messages. The set I (a) is generated in a similar way to the extra
column Q(a). The set E(a) includes complement values,
B. Trellis Min–Max Algorithm With Compressed Messages
whose values are either m1(a) or m2(a) depending on the
In [14], the STMM algorithm was proposed for check node deviation information as shown in (2). The set P(a) contains
processing to generate the check node output messages in a the path information for updating the output messages as
parallel way. The STMM algorithm, which is presented in shown in (3)
Algorithm 2, provides a good tradeoff between the error-
correcting performance and the decoding complexity, com- m2(a) if η1 (a) = η2 (a)
E(a) = (2)
pared with the previous works [11], [21]. The first step m1(a) otherwise.

involves the transformation of the input messages from the P(a) = m1col η1 (a) , m1col η2 (a) . (3)
normal domain Q mn (a) to the delta domain Q mn (a). This
transformation ensures that the most reliable symbols are Finally, updating the C2V messages is implemented by
always in the first index corresponding to the GF symbol 0, decompression of the check node output messages in the
and the rest of the indexes are in order of {α 0 , α 1 , . . . , α q−1 }. variable node processing as follows:

Step 2 relates to the computation of the syndrome β using the I (a) if P(a) = j
∗
most reliable symbols z n from V2C messages. In step 3, the Rmn j a + z n j = (4)
E(a) otherwise.
first minimum value m1(a) and its column index m1col (a),
as well as the second minimum value m2(a) for each trellis
III. BASIC -S ET T RELLIS M IN –M AX
row are calculated using the function . Step 4 constructs an
D ECODING A LGORITHM
extra column Q(a) based on the most reliable path of the
configuration set conf(nr , n c ) for each symbol a. A. Basic-Set Trellis Min–Max Algorithm
Let path 0 be the optimal path for symbol 0 including all In this section, the novel BS-TMM algorithm is proposed
nodes in the first row of the delta trellis. The configuration set to greatly reduce the complexity and the memory requirement
Algorithm 3 BS-TMM Algorithm
for check node processing as well as the exchanged messages

Fig. 1. Number of exchanged bits between the check node and variable node
between check nodes and variable nodes with a negligible for different GFs.
error-correcting performance loss. The BS-TMM algorithm
is highly efficient for designing the decoders with high- TABLE I
order GFs. Without loss of generality, the GF GF(q) with C OMPARISON OF E XCHANGED M ESSAGES B ETWEEN C HECK
q = 2 p including q elements such as {0, α 0 , α 1 , . . . , α q−2 } is N ODE AND VARIABLE N ODE W ITH dc = 27 AND w = 6
considered in our work. For each GF GF(2 p ), any field element
is uniquely represented by the linear addition of p independent
field elements. To take advantage of this, in our work, a set of
only p = log2 q independent field elements with the smallest
LLRs, called the basic set B ∗ , are generated in the check node
processing instead of (q−1) nonzero field elements in the extra
column Q(a) [16], [19]. Then, construction of the Q(a)
is implemented in the variable node processing based on the
basic set B ∗ .
The BS-TMM algorithm is represented in Algorithm 3.
Steps 1–3 are similar to steps 1–3 in Algorithm 2. Step 4
computes the basic set B ∗ = {m1l∗ , Il∗ , al∗ }1≤l≤ p including
3 × p values ( p LLR values, p column indexes, and p field In [16]–[19], a small number of fixed sets, in which the size
elements), based on the minimum values m1(a) and their of each set is proportional to either q or dc , are exchanged.
column indexes Icol (a) (1 ≤ a < q). Finding the basic Compared with the original compression technique [16], the
set B ∗ is given by the function in Algorithm 4. Step 5 proposed work reduces the number of exchanged bits by
relates to calculating the complement values in set E(a). The factors of almost 2.5 and 3.48 for GF(32) and GF(128),
complement values for p field elements, which belong to the respectively. In comparison with the latest work [19], the
basic set B ∗ , are assigned to the second minimum values reduction of the exchanged bits is 38.59% and 52.07% for
m2(a). For the remaining field elements, the complement GF(32) and GF(128), respectively. The BS-TMM algorithm
values are assigned to the minimum values m1(a). Finally, the achieves a large reduction of the exchanged bits for two
output of the check node processing includes three sets B ∗ , reasons. First, the BS-TMM algorithm reduces the number of
E(a), and z n∗ with a size of 3 × p + (q − 1) + dc values, which fixed sets, where the basic set B ∗ , including 3 × p values, is
are used for generating the C2V messages in the variable node exchanged instead of 3 × (q − 1) values of two sets I (a) and
processing. P(a), as shown in (1). Second, the size of the basic set B ∗
Table I shows the number of bits exchanged between check is proportional to p = log2 q, whereas the size of sets I (a)
node and variable node in the proposed algorithm and previous and P(a) is proportional to q. Thus, the BS-TMM algorithm
works for the general GF(q = 2 p ) and w quantization bits for is extremely efficient for high-order GFs.
the LLR values. In addition, the number of exchanged bits The function in Algorithm 4 relates to finding the basic
for high-order GFs such as GF(32), GF(64), and GF(128) is set B ∗ based on the minimum values m1(a) and their column
also computed with dc = 27 and w = 6 quantization bits, and indexes Icol (a). Let M be a set including the minimum values
illustrated in Fig. 1. It is clear that the proposed algorithm m1(a) and their column indexes Icol (a). In step 1, set M is
greatly reduces the exchanged bits, compared with previous rearranged in ascending order of the m1(a) values to generate
works. In [14], all C2V messages generated in the check a new set M . The first two field elements from set M are
node processing are exchanged, which causes an extremely selected for the basic set because they are independent field
high number of check node output bits. It can be seen that elements with the smallest LLRs, as shown in steps 2–4.
the exchanged bits are reduced by factors of almost 13, 16, The remaining elements of the basic set are found by the
and 22.46 for GF(32), GF(64), and GF(128), respectively. loop from steps 5 to 11. The goal of the loop is to find
Algorithm 4 Function: Finding Basic Set B ∗ Algorithm 5 Construct Extra Column Q(a) and Rmn (a)
the next independent field elements with the smallest LLRs

except for both the selected elements and the elements that
are generated by the possible combinations of the selected
elements in steps 7 and 8. Finally, the basic set B ∗ generated
includes p independent field elements with the most reliable
LLR values.
In the variable node processing, the extra column Q(a) is
recovered and the C2V messages Rmn (a) are generated on
the basis of the output sets of the check node processing,
including B ∗ , E(a), and z n∗ , as shown in Algorithm 5. First,
the extra column Q(a) and the path information d(a) are
calculated in steps 1–7. For p field elements, which belong to
the basic set B ∗ , the Q(a) value is the most reliable LLR
m1l∗ , and the path information d(a) has one deviation at the
column index Il∗ with 1 ≤ l ≤ p. The remaining field elements
are computed on the basis of all possible combinations of the
field elements in the basic set B ∗ . Their Q(a) values are the
maximum LLR value from the LLR values corresponding to
the combined field elements, and their path information d(a)
has more than one deviation and a maximum of p deviations.
Updating the C2V messages is implemented in steps 8–14. Fig. 2. Example of the trellis based on GF(8) with dc = 4.
For each row, if the column index j does not belong to the
part information d(a), the C2V message Rmj (a) is assigned
to the extra column Q(a). Otherwise, the C2V message achieved. The first two field elements from M are selected
Rmj (a) is assigned to the complement set E(a). Finally, the for the basic set as B ∗ = {(1, 1, α 3 ), (2, 1, α 0 )}. The
C2V messages in the delta domain are converted to the normal third field element selected is the field element with the
domain in step 15. smallest LLR value from the remaining field elements
In Fig. 2, an example of the delta trellis for GF(8) with of set M except for field element α 1 = α 3 + α 0 or
dc = 4 is presented, where the minimum values in each row (10, 2, α 1 ), which is a combination of two field elements
are marked with a dashed square. The extra column Q(a) in B ∗ . Hence, (3, 4, α 4 ) is selected, and the basic set
in the rightmost column is constructed on the basis of basic B ∗ = {(1, 1, α 3 ), (2, 1, α 0 ), (3, 4, α 4 )} includes p = 3
set B ∗ , as shown in Algorithm 5. This example demonstrates independent field elements with the most reliable messages.
the method of building the basic set B ∗ and the extra column Then, the extra column Q(a) is constructed. For p field
Q(a). From the delta trellis, set M, including the minimum elements in the extra column, which belong to the basic
values and their column indexes as M = {(2, 1, α 0 ), set B ∗ such as {α 3 , α 0 , α 4 }, their LLR values Q(a) and
(10, 2, α 1 ), (26, 3, α 2 ), (1, 1, α 3 ), (3, 4, α 4 ), (30, 1, α 5 ), (4, 1, the path information d(a) are the same as the LLR values
α 6 )}, is generated. After rearranging set M with ascending and column indexes in the basic set B ∗ . For other field
order of the minimum values, set M = {(1, 1, α 3 ), (2, 1, α 0 ), elements, all combinations of the field elements in B ∗ are
(3, 4, α 4 ), (4, 1, α 6 ), (10, 2, α 1 ), (26, 3, α 2 ), (30, 1, α 5 )} is considered as follows: Q(α 3 + α 0 = α 1 ) = max (1, 2) = 2
Fig. 3. FERs of the (837, 726) NB-LDPC code over GF(32) under the Fig. 4. FERs of the (1512, 1323) NB-LDPC code over GF(64) under the
AWGN channel. AWGN channel.
and d(α 3 + α 0 = α 1 ) = {1, 1}; Q(α 0 + α 4 = α 5 ) = performed using different quantization schemes. A scheme
max(2, 3) = 3 and d(α 0 + α 4 = α 5 ) = {1, 4}; Q(α 3 + α 4 = with 5-bit quantization and eight iterations was chosen, which
α 6 ) = max (1, 3) = 3 and d(α 3 + α 4 = α 6 ) = {1, 4}; shows a performance loss at almost 0.1 dB, compared with
and Q(α 3 + α 0 + α 4 = α 2 ) = max (1, 2, 3) = 3 and the floating-point result at 15 iterations.
d(α 3 + α 0 + α 4 = α 2 ) = {1, 1, 4}. Fig. 4 represents the FER performance of the
(1512, 1323) NB-LDPC code over GF(64). As can be
seen that the BS-TMM algorithm has a minor performance
B. Performance Analysis
loss of 0.07 dB and 0.14 dB in comparison with the
To demonstrate the error-correcting performance of the modified trellis min–max (mT-MM) algorithm [19] and the
proposed BS-TMM decoding algorithm, we performed the STMM algorithm [14], respectively. This result demonstrates
simulations for two GFs: GF(32) and GF(64). Fig. 3 illus- that the proposed BS-TMM algorithm provides good
trates the frame error rate (FER) performance for (837, 726) FER performance and significantly reduced computation
NB-LDPC code over GF(32) with dv = 4 and dc = 27 complexity for the high-order GF.
under the additive white Gaussian noise (AWGN) channel and
binary phase shift keying modulation. As shown in Fig. 3,
the floating-point simulation result of the BS-TMM algorithm IV. BS-TMM D ECODER A RCHITECTURE
with 15 iterations shows a minor performance loss at almost In this section, the proposed quasi-cyclic NB-LDPC decoder
0.1 dB, compared with the STMM algorithm [14] and the two- architectures and design technologies for the BS-TMM algo-
extra-column TMM algorithm [17]. However, the proposed rithm are described. The quasi-cyclic NB-LDPC codes over
BS-TMM algorithm provides low computation complexity, a GF(q) are constructed by the algebraic construction method
large area reduction, and a significant improvement in through- based on array dispersions of matrices in [22], where a
put. This is explained by the fact that (q − 1) messages in the (q −1)×(q −1) submatrix is generated first. Then, a submatrix
extra column Q(a) in [14] and [17] are constructed directly with size (dv , dc ) is selected from the (q − 1) × (q − 1)
from all reliable messages of the configuration set conf(1, 2) submatrix. Each field element from the (dv , dc ) submatrix is
using (q − 1) processors, whereas these are constructed on dispersed in either a zero matrix or a circulant permutation
the basis of only p reliable messages in the basic set in our matrix (CPM) of size (q −1)×(q −1). As a result, the H matrix
work. Compared with the R-TMM algorithm [11], in which generated from the (dv , dc ) submatrix has M = (q − 1) × dv
C2V messages are generated on the basis of minimum basic rows and N = (q − 1) × dc columns.
sets, the FER performance of the BS-TMM algorithm is
almost the same as that of the R-TMM algorithm. It is noted
that dc minimum basic sets are required in the R-TMM A. CNU Architecture
algorithm [11] to generate the C2V messages, whereas the The top-level CNU architecture for the BS-TMM algo-
proposed BS-TMM algorithm requires only one basic set to rithm is shown in Fig. 5, where each module corresponds
construct the extra column. Moreover, the sequential design to a step in Algorithm 3. The transformation module con-
implemented in [11] causes a throughput problem, whereas verts V2C messages from normal to delta domain using the
the proposed BS-TMM algorithm-based design performs all control signals z j . This module is constructed by means
calculations in one clock cycle. For the purpose of hard- of dc reordering networks, as shown in [23], where each
ware implementation, various fixed-point simulations were reordering network requires q ×log2 q w-bit multiplexers. The
Fig. 5. Top-level CNU architecture for BS-TMM algorithm.
Fig. 6. Two-min finder architecture with eight inputs [24].
Fig. 7. (a) Third element in the basic set. (b) Fourth element in the basic set.
check node syndrome β is generated by a tree adder struc-
ture. The delta-to-normal domain transformation is derived
later using dc reordering networks with the control signals the messages exchanged between check node and variable
z ∗j = z j ⊕ β. The function is responsible for finding the node. It is noted that, in our work, multiple nodes can
first two minimum values and the first minimum value’s index come from the same column stage. This causes negligible
from dc inputs using the 2-min finder. The 2-min finder is performance loss as shown in [11]. For example, the trellis
adopted by applying the technique in [24], which provides a in Fig. 2 shows that two independent field elements in the
good tradeoff between the area and latency. Because (q − 1) basic set, such as {(1, 1, α 3 ) and (2, 1, α 0 )}, come from the
rows in the delta trellis except the first row must perform the same column stage.
function, a total of (q −1) 2-min finders are required. Fig. 6 The architecture of the basic set constructor corresponds to
shows an example of the 2-min finder architecture with eight the steps in Algorithm 4. A parallel sorting approach in [20] is
inputs. applied in this paper to simultaneously generate the rearranged
In [14]–[19], the values in the extra column Q(a) and minimum values m1 (a) and their indexes Icol (a) in ascending
the path information are generated by means of the first order of the m1(a) values in one clock cycle. Then, the
minimum values m1(a) and their column indexes Icol (a). first two field elements in the basic set are selected from
(q −1) processors are required to generate LLR values and the the first two field elements in the rearranged values, such as
path information for (q − 1) nonzero elements in the Q(a). (m1∗1 , I1∗ , a1∗ ) = (m1 (a1 ), Icol
(a ), a ) and (m1∗ , I ∗ , a ∗ ) =
1 1 2 2 2
Each processor is responsible for constructing q/2 possible (m1 (a2 ), Icol (a ), a ). In this paper, we propose an architec-
2 2
paths to find the LLR value and the path information of one ture to obtain the ( p−2) remaining independent field elements
nonzero field element, in the case of using the configuration in a parallel way. Thus, p independent field elements in the
set conf(1, 2). For higher order GFs (increasing the value q), basic set B ∗ are calculated in one clock cycle. Fig. 7 shows the
constructing the extra column Q(a) becomes more complex proposed architectures for the next two elements in the basic
and costly in terms of area. In our work, a basic set B ∗ set. In Fig. 7(a), the architecture is designed to generate the
including the LLR values and the column indexes of only third element, where the combination of the first two elements
p = log2 q independent field elements needs to be constructed, such as a1∗ + a2∗ is removed from the remaining rearranged
which provides a large reduction of not only the area but also field elements {a3 , a4 , . . . , aq−1 } by assigning the maximum
TABLE II
S YNTHESIS R ESULTS FOR THE P ROPOSED CNU A RCHITECTURE
Fig. 8. E(a) complement generator for GF(8). (1512, 1323) NB-LDPC code over GF(64) are presented
in Table II using the Synopsys design tools and a TSMC
90-nm CMOS standard cell library. Compared with the works
in [14] and [15], the proposed CNU greatly reduces the
quantization bits instead of the LLR value m1 (a j ). The min1- gate count by 51.22% and 34.64% for GF(32), respectively,
finder architecture is responsible for finding the smallest value because of the removal of (q − 1) processors for finding
and its index from (q − 3) inputs. Then, the smallest value the extra column [14], [15] and applying the compression
is the LLR value of the third element m1∗3 , and its index is technique [16]. Compared with the original TMM algorithm
used to obtain the field element a3∗ and the column index with the compressed messages in [16], it can be seen that
I3∗ . The signals c31 , c41 , . . . , cq−1
1 are used to eliminate the the area saving is almost 30% for GF(32) and 37% for
combination of the previous field elements as a1∗ + a2∗ in GF(64). This is due to the complexity reduction for finding
finding the current element and the next element. Immedi- the basic set with a size of p = log2 q instead of finding
ately, the fourth element is generated, as shown in Fig. 7(b), the sets corresponding to the extra column with a size of
which is independent of the previous elements. To ensure the q [16]. In [20], L = 4 is chosen for designing the CNU
independence, two eliminations are made. First, the input field architecture in both GF(32) and GF(64), while log2 (32) = 5
elements in this stage are either the rearranged field elements and log2 (64) = 6 values are kept in the proposed CNU
{a3 , a4 , . . . , aq−1 } or assigning to zero element with p bits architecture for GF(32) and GF(64), respectively. The number
depending on the control signals c31 , c41 , . . . , cq−1
1 . Second, the
of exchange bits between check node and variable node in [20]
combinations of the previous elements with the third element, is almost similar with the one in the proposed work. However,
such as {a1∗ + a3∗ , a2∗ + a3∗ , a1∗ + a2∗ + a3∗ } and the third element the proposed CNU architecture reduces the area by 18.7% for
a3∗ , are eliminated. The control signals c32 , c42 , . . . , cq−1
2 are GF(32) and 14.2% for GF(64). This area reduction achieves
responsible for both eliminations in finding the fourth element, because of that the work in [20] requires (q − 1) processors to
and further in the next field element. Finding the LLR value calculate (q − 1) elements of the complement set E(a), while
m1∗4 , field element a4∗ , and column index I4∗ of the fourth the proposed work uses only one module to calculate (q − 1)
element is similar to that of the third element. This procedure elements of E(a) set. In addition, this improvement will be
is the same for the remaining field elements in the basic set. significantly increased if L values [20] are chosen similarly
Finally, p independent field elements in the basic set B ∗ are to the proposed CNU design. Compared with [17], since the
generated simultaneously. proposed CNU needs to find only p = log2 q elements instead
The architecture of the E(a) complement generator for of q elements of two extra columns, the CNU complexity is
GF(8) is designed to generate (q − 1) LLR values in parallel, reduced by 11.6%.
as shown in Fig. 8. The one-hot function generates a group
of q bits, where only one bit at location a ∗j is equal to
“1,” and all the other bits are equal to “0.” Therefore, the B. Decoder Architecture
control signal e[0:7] has p high bits. The high bit locations In this section, a complete decoder architecture based on
correspond to the field elements generated from one deviation the BS-TMM algorithm is designed for NB-LDPC codes. The
path, and the complement values E(a) are assigned to m2(a). proposed decoder architecture achieves a great reduction in
Otherwise, the field elements are generated from more than the area because of the large area reduction in the CNU
two deviations, and the complement values E(a) are assigned architecture. In addition, an improvement in the throughput
to m1(a). is obtained since reducing the wires between the check node
The outputs of the proposed CNU architecture, includ- and variable node processors mitigates the routing congestion.
ing z n∗ , E(a), and B ∗ = {m1l∗ , Il∗ , al∗ }1≤l≤ p , are used The layered min–max algorithm for the proposed decoder is
to generate the C2V messages Rmn (a) corresponding to presented in Algorithm 6, where the BS-TMM in Algorithm 3
Algorithm 5 in the variable node processing. Thus, the is implemented in the check node processor. In addition, the
total number of bits exchanged from C2V is dc × p+ decompression network (DN) corresponding to Algorithm 5
(q − 1) × w + p × (w + log2 (dc )
+ p) bits. is implemented in the variable node processor to generate
The synthesis results of the proposed CNU architecture the C2V messages Rmn (a) from outputs of the CNU archi-
for the (837, 726) NB-LDPC code over GF(32) and the tecture. The DN has three parts: 1) generating the LLR
Fig. 9. Proposed extra column and path information generator for GF(8). (a) Extra column generator for the jth element. (b) Control signal generator. (c)
Path information generator for the jth element.
Algorithm 6 Proposed Layered Decoding Algorithm the control signals sl [ j ] is equal to “1,” and others are equal
to “0.” Thus, only one of p LLR values is selected for the
output Q(a j ).
In order to calculate the p control signals sl [ j ], 2 p − 1 =
q − 1 combinations of p field elements in the basic set B ∗
excluding the zero element are divided into p groups, as shown
in Fig. 9(b). p control signals sl [ j ] correspond to p outputs of
the groups. The lth group contains the field element al∗ and its
combinations with all possible combinations of the previous
field elements ak∗ (0 < k < l). Therefore, the lth group (l > 0)
includes 2l−1 combinations of the field elements. In addition,
(q−1) path information corresponding to (q−1) field elements
is also constructed, where each path information d[ j ] has p
column indexes dl [ j ] (1 ≤ l ≤ p). (q − 1) architectures as in
Fig. 9(c) are required to compute (q −1) path information. For
p field elements in the basic set al∗ (1 ≤ l ≤ p), their paths
are one deviation, and thus p values of the path information
{dl [ j ]}1≤l≤ p are the same as column index Il∗ . The path of
the field element generated by the combination of all field
elements in the basic set has a maximum of p deviations; thus,
values of the extra column Q(a) and the path information p values of the path information dl [ j ] (1 ≤ l ≤ p) correspond
d(a) with a maximum of p deviations on the basis of the to p column indexes Il∗ (1 ≤ l ≤ p) in the basic set. For other
basic set B ∗ = {m1l∗ , Il∗ , al∗ }1≤l≤ p ; 2) generating the C2V field elements generated by the remaining combinations, the
messages in the delta domain as Rmn (a) on the basis of number of their deviations is k (1 < k < p), and then the dl [ j ]
Q(a), E(a), and d(a); 3) and converting the C2V messages is assigned to the column index Il∗ with 1 ≤ l ≤ k. Otherwise,
from delta to normal domain. It is noted that two DNs are the dl [ j ] with k < l ≤ p is assigned to the column index Ik∗ .
required in the variable node processor. However, the proposed Fig. 10 presents the proposed C2V message generator for
decoder area is much lower than that of the conventional GF(8) with dc = 4. The C2V messages Rmj (a) (1 ≤ j ≤
decoders [14], [15]. dc ) are simultaneously introduced by either Q(a) or E(a),
First, Fig. 9 shows the proposed extra column and path which are the outputs of the multiplexers. The control signals
information generator for GF(8). The LLR value of each for the multiplexers depend on the column indexes and p
element in the extra column Q(a j ) is selected from one deviations of the path information. If the column index j
of the p LLR values m1l∗ (1 ≤ l ≤ p) in the basic set B ∗ (1 ≤ j ≤ dc ) is equal to at least one of p deviations dl (a)
depending on the p control signals sl [ j ] (1 ≤ l ≤ p), as shown (1 ≤ l ≤ p), then the output of the multiplexer is assigned
in Fig. 9(a). (q − 1) architectures as in Fig. 9(a) are required to compensation value E(a). Otherwise, the output of the
to compute (q − 1) messages in the Q(a) simultaneously. multiplexer is assigned to Q(a).
p control signals sl [ j ] (1 ≤ l ≤ p) are generated using the Fig. 11 shows the top-level decoder architecture for the
architecture in Fig. 9(b). To compute Q(a j ), only one of proposed layered decoding algorithm, where one row of H
corresponding to the output bits of the check node processor.

A total of M×[ p×(w+ log(dc )
+ p)+(q−1)×w+dc × p] bits
are stored in one iteration. Compared with the M × q × dc × w
bits stored in CNMEM in the conventional approach [14], the
memory requirement for CNMEM in the proposed decoder is
greatly reduced, which leads to a large reduction in decoder
area.
V. I MPLEMENTATION R ESULTS AND C OMPARISON

To illustrate the efficiency of our proposal for NB-LDPC
codes, especially for high-order GFs, the complete decoder
Fig. 10. Proposed C2V message generator for GF(8) with dc = 4. architectures were implemented for two codes (837, 726)
NB-LDPC code over GF(32) and (1512, 1323) NB-LDPC
code over GF(64). A Verilog HDL was used to model the
architectures, and Synopsys design tools with the TSMC
90-nm CMOS standard cell library were used to implement
the proposed decoder architectures. The throughput T p of the
decoders is archived as shown in (5), where seg is the number
of pipeline stages used in the decoder architecture to improve
the timing. In the proposed decoder architectures, seg = 9 was
chosen to obtain a balance between throughput and area
f clk [MHz] × (q − 1) × dc × p
Tp = [Mbps]. (5)
Imax × (M + dv × seg) + (q − 1)
Table III shows the implementation results of the proposed
Fig. 11. Top-level decoder architecture based on the BS-TMM algorithm. decoder in comparison with the other state-of-the-art works for
the (837, 726) NB-LDPC code over GF(32). It can be seen
that the proposed decoder outperforms the other approaches
corresponding to one layer is processed in one clock cycle. in both area and throughput. Compared with the STMM
It can be seen that the decoder architecture is divided into a algorithm with uncompressed messages [14], our work has
variable node processor and check node processor. almost 8.3 times higher efficiency, and reduces gate count by
To start the decoding process, the LLR messages from chan- a factor of 4.3. This significant improvement is achieved by
nel information L n (a) are loaded in variable node memory the great reduction in both the storage bits in the CNMEM
(VNMEM). From the next layer and next iteration, the output and the CNU complexity, as explained previously. Compared
messages of the variable node processor Q k,l n (a) are stored wih [11], our proposal not only reduces the gate count but also
in the VNMEM. The VNMEM includes dc memories with a increases the throughput because of its reduced complexity and
depth of (q − 1) as the size of the CPM [22] and a width of parallel processing in the CNU. Thus, the proposed decoder
q × w bits. For each decoding time, one address is read and achieves almost 9.4 times higher efficiency. In [20], a reduced-
one address is written from each memory. complexity NB-LDPC decoder was proposed on the basis of
The permutation and depermutation of the variable mes- reducing the size of the intrinsic information and the path
sages in steps 4 and 9 in Algorithm 6 are implemented coordinates to L q values, and the decoder performance
by modules P and P−1 , respectively. Each module requires depends on the selected L value, whereas our approach reduces
dc × (q − 1) × log2 q multiplexers of w bits to permute or the size of these sets to p = log2 q values for any GF. Because
depermute dc vectors of (q − 1) messages, and the control the complexity of the proposed CNU is reduced, the efficiency
signals are based on the h mn nonzero values of H. of the proposed decoder with p = 5 is almost 1.7 times higher
The normalization module N is responsible for finding the than that in [20] implemented with L = 4. Compared with the
most reliable messages and their locations z n , and generating decoders in [16], [19], and [17], the proposed decoder reduces
the Q k,l
mn (a) messages for the inputs of the check node proces- the gate count by 40%, 35.4%, and 5.5%, and achieves 53%,
sor. In addition, normalization ensures that the smallest value 44.7%, and 4.5% higher efficiency, respectively. Moreover, the
in each LLR vector Q k,lmn (a) is always equal to zero. At the proposed decoder is almost 14.4 times more efficient than that
last decoding iteration, the z n values are the hard-decision of [25].
symbols c̃n stored in the output memory, and the P module In our work, the (1512, 1323) NB-LDPC code over GF(64)
and subtractor are inactive during this process. is constructed by the submatrix (dv , dc ) = (3, 24) and a CPM
Since a layered decoding scheme is used, the outputs of the of size (q − 1) × (q − 1) [22], which is the same code rate as
check node processor in one iteration must be stored in the the (1536, 1344) NB-LDPC code in previous works, as shown
check node memory (CNMEM) for the next iteration process. in Table IV. It is noted that the size of the CPM in previous
Thus, the CNMEM in the proposed decoder has a depth of M works is q × q instead of (q − 1) × (q − 1). The synthesis
and a width of p×(w+ log(dc )
+ p)+(q −1)×w+dc× p bits results of the proposed decoder for the (1512, 1323) NB-LDPC
TABLE III
C OMPARISON OF THE P ROPOSED D ECODER W ITH O THER W ORKS FOR THE (837, 726) NB-LDPC C ODE OVER GF(32)
TABLE IV VI. C ONCLUSION

I MPLEMENTATION R ESULTS OF THE P ROPOSED D ECODER FOR THE
(1512, 1323) NB-LDPC C ODE OVER GF(64) IN
In this paper, we propose a novel basic-set trellis min–max
A 90-nm CMOS P ROCESS algorithm for decoding NB-LDPC codes to reduce the com-
plexity of the CNU architecture, the messages exchanged
between the check node and the variable node, and the
storage bits in the CNMEM, compared with previous works.
The implementation results show that the decoder architec-
ture based on the proposed algorithm provides a great area
reduction and throughput improvement, compared with the
other state-of-the-art works. In addition, the results for the
NB-LDPC code over GF(64) demonstrate that the proposed
algorithm is especially efficient for the high-rate NB-LDPC
codes with high-order GFs.
R EFERENCES
[1] M. C. Davey and D. MacKay, “Low-density parity check codes over
GF(q),” IEEE Commun. Lett., vol. 2, no. 6, pp. 165–167, Jun. 1998.
[2] R. Peng and R.-R. Chen, “WLC45-2: Application of nonbinary LDPC
codes for communication over fading channels using higher order
modulations,” in Proc. IEEE Global Telecommun. Conf. (GLOBECOM),
Nov./Dec. 2006, pp. 1–5.
[3] M. Arabaci, I. B. Djordjevic, L. Xu, and T. Wang, “Nonbinary LDPC-
code and the comparison with previous works are presented coded modulation for high-speed optical fiber communication without
bandwidth expansion,” IEEE Photon. J., vol. 4, no. 3, pp. 728–734,
in Table IV. For fair comparison with previous works in Jun. 2012.
terms of throughput, the clock frequency, after placing and [4] C. A. Aslam, Y. L. Guan, and K. Cai, “Non-binary LDPC code with
routing the design, was reduced following the work in [20]. multiple memory reads for multi-level-cell (MLC) flash,” in Proc.
Asia–Pacific Signal Inf. Process. Assoc., Annu. Summit Conf. (APSIPA),
It can be seen that the proposed decoder reduces the gate 2014, pp. 1–9.
count by 57% and achieves almost 3.8 times higher efficiency, [5] L. Barnault and D. Declercq, “Fast decoding algorithm for LDPC
compared with the work from [14]. Compared with the works over GF(2q ),” in Proc. IEEE Inf. Theory Workshop, Mar./Apr. 2003,
pp. 70–73.
with compressed messages [16], [19], the proposed decoder [6] H. Wymeersch, H. Steendam, and M. Moeneclaey, “Log-domain decod-
improves not only the gate count but also the throughput ing of LDPC codes over GF(q),” in Proc. IEEE Int. Conf. Commun.,
because of a large reduction of the complexity in the CNU vol. 2. Jun. 2004, pp. 772–776.
[7] D. Declercq and M. Fossorier, “Decoding algorithms for nonbinary
and the messages exchanged between check node and variable LDPC codes over GF(q),” IEEE Trans. Commun., vol. 55, no. 4,
node, which contributes to mitigating the routing congestion. pp. 633–643, Apr. 2007.
Therefore, the proposed decoder reduces the gate count by [8] V. Savin, “Min-max decoding for non binary LDPC codes,” in Proc.
IEEE Int. Symp. Inf. Theory, Jul. 2008, pp. 960–964.
37.4% and 48.4%, and obtains a higher efficiency at 53.2% and [9] X. Zhang and F. Cai, “Reduced-complexity decoder architecture for non-
61.4%, compared with [16] and [19], respectively. Moreover, binary LDPC codes,” IEEE Trans. Very Large Scale Integr. (VLSI) Syst.,
the proposed decoder exhibits almost 38.6% higher efficiency, vol. 19, no. 7, pp. 1229–1238, Jul. 2011.
[10] K. He, J. Sha, and Z. Wang, “Nonbinary LDPC code decoder architec-
compared with the work in [20] with L = 5 for codes in ture with efficient check node processing,” IEEE Trans. Circuits Syst. II,
GF(64). Exp. Briefs, vol. 59, no. 6, pp. 381–385, Jun. 2012.
[11] F. Cai and X. Zhang, “Relaxed min-max decoder architectures for [23] J. Lin, J. Sha, Z. Wang, and L. Li, “Efficient decoder design for
nonbinary low-density parity-check codes,” IEEE Trans. Very Large nonbinary quasicyclic LDPC codes,” IEEE Trans. Circuits Syst. I,
Scale Integr. (VLSI) Syst., vol. 21, no. 11, pp. 2010–2023, Nov. 2013. Reg. Papers, vol. 57, no. 5, pp. 1071–1082, May 2010.
[12] E. Li, D. Declercq, and K. Gunnam, “Trellis-based extended min-sum [24] C.-L. Wey, M.-D. Shieh, and S.-Y. Lin, “Algorithms of finding the first
algorithm for non-binary LDPC codes and its hardware structure,” IEEE two minimum values and their hardware implementation,” IEEE Trans.
Trans. Commun., vol. 61, no. 7, pp. 2600–2611, Jul. 2013. Circuits Syst. I, Reg. Papers, vol. 55, no. 11, pp. 3430–3437, Dec. 2008.
[13] E. Li, F. García-Herrero, D. Declercq, K. Gunnam, J. O. Lacruz, [25] X. Chen and C.-L. Wang, “High-throughput efficient non-binary
and J. Valls, “Low latency T-EMS decoder for non-binary LDPC LDPC decoder based on the simplified min-sum algorithm,” IEEE
codes,” in Conf. Rec. 47th Asilomar Conf. Signals, Syst. Comput. Trans. Circuits Syst. I, Reg. Papers, vol. 59, no. 11, pp. 2784–2794,
(ASILOMAR), Nov. 2013, pp. 831–835. Nov. 2012.
[14] J. O. Lacruz, F. García-Herrero, D. Declercq, and J. Valls, “Simplified
trellis min–max decoder architecture for nonbinary low-density parity-
check codes,” IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 23,
no. 9, pp. 1783–1792, Sep. 2015. Huyen Pham Thi (M’14) received the B.S. degree
[15] J. O. Lacruz, F. García-Herrero, J. Valls, and D. Declercq, “One from the Department of Information and Commu-
minimum only trellis decoder for non-binary low-density parity-check nication Engineering, Military Technical Academy,
codes,” IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 62, no. 1, Ha Noi, Vietnam, in 2011. She is currently work-
pp. 177–184, Jan. 2015. ing toward the M.S. and Ph.D. integrated degree
[16] J. O. Lacruz, F. García-Herrero, and J. Valls, “Reduction of complexity with the Department of Information and Commu-
for nonbinary LDPC decoders with compressed messages,” IEEE Trans. nication Engineering from Inha University, Incheon,
Very Large Scale Integr. (VLSI) Syst., vol. 23, no. 11, pp. 2676–2679, South Korea.
Nov. 2015. Her current research interests include algorithms
[17] H. P. Thi and H. Lee, “Two-extra-column trellis min–max decoder and VLSI architecture design for digital signal
architecture for nonbinary LDPC codes,” IEEE Trans. Very Large Scale processing, forward error correction architectures,
Integr. (VLSI) Syst., vol. 25, no. 5, pp. 1787–1791, May 2017. and communication systems.
[18] J. O. Lacruz, F. García-Herrero, M. J. Canet, J. Valls, and
A. Pérez-Pascual, “A 630 Mbps non-binary LDPC decoder for
FPGA,” in Proc. IEEE Int. Symp. Circuits Syst. (ISCAS), May 2015,
pp. 1989–1992.
[19] J. O. Lacruz, F. García-Herrero, M. J. Canet, and J. Valls, “High- Hanho Lee (M’98–SM’13) received the Ph.D. and
performance NB-LDPC decoder with reduction of message exchange,” M.S. degrees in electrical and computer engineering
IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 24, no. 5, from the University of Minnesota, Minneapolis, MN,
pp. 1950–1961, May 2016. USA, in 2000 and 1996, respectively.
[20] J. O. Lacruz, F. García-Herrero, M. J. Canet, and J. Valls, “Reduced- From 2000 to 2002, he was a Member of Technical
complexity nonbinary LDPC decoder for high-order Galois fields based Staff with Lucent Technologies (Bell Labs Innova-
on trellis min–max algorithm,” IEEE Trans. Very Large Scale Integr. tions), Allentown, PA, USA. From 2002 to 2004,
(VLSI) Syst., vol. 24, no. 8, pp. 2643–2653, Aug. 2016. he was an Assistant Professor with the Department
[21] Y.-L. Ueng, K.-H. Liao, H.-C. Chou, and C.-J. Yang, “A high-throughput of Electrical and Computer Engineering, Univer-
trellis-based layered decoding architecture for non-binary LDPC codes sity of Connecticut, Storrs, CT, USA. Since 2004,
using max-log-QSPA,” IEEE Trans. Signal Process., vol. 61, no. 11, he has been with the Department of Information
pp. 2940–2951, Jun. 2013. and Communication Engineering, Inha University, Incheon, Korea, where
[22] B. Zhou, J. Kang, S. Song, S. Lin, K. Abdel-Ghaffar, and M. Xu, he is currently a Professor. From 2010 to 2011, he was a Visiting Scholar
“Construction of non-binary quasi-cyclic LDPC codes by arrays with Bell Labs, Alcatel-Lucent, Murray Hill, NJ, USA. His current research
and array dispersions,” IEEE Trans. Commun., vol. 57, no. 6, interests include VLSI architecture design for forward error correction coding,
pp. 1652–1662, Jun. 2009. cryptographic, VLSI signal processing, and digital communications.

Basic-Set Trellis Min-Max Decoder Architecture For Nonbinary LDPC Codes With High-Order Galois Fields

Uploaded by

Copyright:

Available Formats

You might also like

Basic-Set Trellis Min-Max Decoder Architecture For Nonbinary LDPC Codes With High-Order Galois Fields

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Basic-Set Trellis Min-Max Decoder Architecture For Nonbinary LDPC Codes With High-Order Galois Fields

Uploaded by

Copyright:

Available Formats

496 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 26, NO.

Basic-Set Trellis Min–Max Decoder Architecture

Algorithm 3 BS-TMM Algorithm

for check node processing as well as the exchanged messages

the next independent field elements with the smallest LLRs

Fig. 5. Top-level CNU architecture for BS-TMM algorithm.

Fig. 6. Two-min finder architecture with eight inputs [24].

corresponding to the output bits of the check node processor.

V. I MPLEMENTATION R ESULTS AND C OMPARISON

TABLE IV VI. C ONCLUSION

You might also like