Fast Decoding ECC For Future Memories

2486 IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS, VOL. 34, NO.
9, SEPTEMBER 2016
Fast Decoding ECC for Future Memories

Paolo Amato, Sandro Bellini, Marco Ferrari, Member, IEEE, Christophe Laurent,
Marco Sforzin, and Alessandro Tomasoni, Member, IEEE
Abstract— High-performance storage class memories could

benefit from a fast decoding error correcting code (ECC), able
to correct a few errors in just a few nanoseconds. The class of
BCH codes provides excellent candidates to play this role. The
low latency requirement prevents adopting iterative or sequential
processes in the encoding and decoding phases—as traditionally
done for storage application based on Flash NAND technology.
Therefore, we propose an architecture for fast decoding of double
and triple ECCs. In our architecture, any time-consuming itera-
tive computation is eliminated, and the most complex evaluations
are isolated and carried in parallel with the other terms, to
avoid bottlenecks in the decoder. In particular, the error locator
polynomial is computed by a combinatorial logic, and its roots are
searched by testing all the bits simultaneously. Here, we describe
a gate-level design of these architectures. We also give an in-
depth analysis of hardware-oriented implementations of finite
field operations, and of bases for element representation.
Fig. 1. The gap between Volatile and Non-Volatile memories may be filled
Index Terms— DRAM, nonvolatile memory, phase change by Emerging Memories.
memory, error correction codes, Galois fields, block codes.
I. I NTRODUCTION the same sign in ever smaller spaces. Two different strategies
H ALF a century ago Dennard invented DRAM (1966),

and Khang and Sze the Floating Gate MOS (1967).
In these fifty years the memory technologies have evolved,
are currently adopted to overtake scaling limits. On one side,
DRAM and NAND life is prolonged by innovations in material
engineering and in array architectures such as the Three
improved and consolidated in the two current mainstreams: Dimensional (3D) cell arrangement and the Cross Point (XP)
DRAM and Flash-NAND (NAND for short). In DRAM tech- cell positioning. On the other, new memory concepts [1]
nology electrons are stored in a capacitor, while in NAND are emerging, sometimes combined with 3D and XP archi-
technology electrons are stored in the floating gate (FG-flash) tectures, in order to replace DRAM and NAND in their
or directly in the oxide layer (CT-flash) of a MOS transistor. traditional applications. A recent example in this direction is
In both cases the amount of charge accumulated in the cell is the 3D XPoint memory of Intel-Micron [2]. In the immediate
linked to the information stored. DRAM is fast and volatile future, Emerging Memories (EMs) do not seem to be able
(and much closer to the CPU in the memory hierarchy system) to replace DRAM and NAND. Actually they are going to
while NAND is high density and non-volatile (and much closer support mainstream memories by filling the gap in the memory
to the disk in the hierarchy). hierarchy.
Both technologies are encountering scaling problems due The storage principle in EM is not charge based as for
to the continuous reduction of their electrons containers — it DRAM and NAND. In Phase Change Memory (PCM, [3])
is increasingly difficult to accommodate electrical charges of the information is stored in the structure of the mate-
rial (amorphous or crystalline). In metal oxide resistive
Manuscript received May 2, 2016; revised July 29, 2016; accepted
August 2, 2016. Date of publication August 26, 2016; date of current version RAM (Ox-RAM, [4]) the state is associated to the oxygen
October 11, 2016. (Corresponding author: Sandro Bellini.) location, while in copper resistive RAM (Cu-RAM, [5]) to
P. Amato, C. Laurent, and M. Sforzin are with Micron Semiconduc- the copper location. In Spin Transfer Torque Magnetic RAM
tor Italia S.r.l., 20871 Vimercate, Italy (e-mail: pamato@micron.com;
claurent@micron.com; msforzin@micron.com). (STTMRAM, [6]) the state is the electron spin. In Ferroelectric
S. Bellini is with the Dipartimento di Elettronica, Informazione RAM (Fe-RAM, [7]) the information is stored in the form of
e Bioingegneria, Politecnico di Milano, 20133 Milan, Italy (e-mail: ion displacement and in correlated electron RAM (Ce-RAM
sandro.bellini@polimi.it).
M. Ferrari and A. Tomasoni are with the Consiglio Nazionale delle
or Mott memories [8]) in the resistive state of Mott insulators.
Ricerche, Istituto di Elettronica e di Ingegneria dell’Informazione e delle Fig. 1 shows the landscape of memory systems. A large
Telecomunicazioni, 20133 Milan, Italy (e-mail: marco.ferrari@ieiit.cnr.it; empty area in the memory hierarchy separates DRAM and
alessandro.tomasoni@ieiit.cnr.it).
Color versions of one or more of the figures in this paper are available
NAND technologies, Volatile and Non-Volatile memories.
online at http://ieeexplore.ieee.org. This region, called Storage Class Memories (SCM), is a big
Digital Object Identifier 10.1109/JSAC.2016.2603698 opportunity for emerging technologies.
0733-8716 © 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY SILCHAR. Downloaded on November 09,2022 at 08:24:40 UTC from IEEE Xplore. Restrictions apply.
AMATO et al.: FAST DECODING ECC 2487
High performance SCM devices (closer to DRAM than

to NAND) need to be fast and reliable and could benefit from
embedded algebraic Error Correcting Codes (ECCs) able to
correct a few errors just in a few nanoseconds. DRAM is
already adopting binary Hamming codes to correct one error
per page [9]. BCH codes, able to correct two or three errors,
are their natural extension for high performance SCM devices.
For instance, in [10] for HfOX -based resistive memory a
Bit-Error-Rate (BER) of 10−8 is reported, and then the DRAM
reliability target is achievable with a triple ECC. Another
example is in [11] where the authors describe STTMRAM
with improved reliability such that a triple ECC decoder would
meet the target block failure rate of 10−9 .
Fig. 2. Fast decoder high-level conceptual scheme: ri are the read data
In this paper we focus on binary BCH codes for random and ei the estimated errors.
independent errors, leaving to future investigations non-binary
ECCs for burst errors. BCH codes are already adopted in
NAND flash memories, where they are applied to correct tens The main contributions of this paper are the following:
of errors in large pages of thousands of data bits. Since in • A gate level design of the proposed fast decoding archi-
NAND the main focus is on throughput optimization, standard tecture for double (Sec. V) and triple (Sec. VI) ECCs
BCH decoding algorithms are usually applied. The iterative • An in-depth analysis of hardware-oriented implementa-
Berlekamp-Massey (BM) algorithm computes the coefficients tions of finite field operations (Sec. III), and of bases for
of the Error Locator Polynomial (ELP), and the Chien-search element representation (Sec. IV). In particular, alternative
algorithm finds its roots (i.e. the error positions) by sequen- representations of the elements of a GF are relevant for
tially testing all the possibilities. the triple ECC.
This classical approach is not compatible with low-latency
high-speed SCM devices: one single iteration of the BM algo- II. ECC FOR M EMORY A PPLICATIONS
rithm would require the same latency of the whole decoding
Channel coding transforms each binary data vector u
process we propose in our design (see [12]). Even partially
of k bits into a binary codeword v = [p, u] of n > k bits. If the
parallel solutions such as [13] require several clock cycles
n − k additional parity-check bits of the vector p are defined
to complete the decoding process. On the other hand, fully
by a linear map p = u · P, where P is a k × (n − k) matrix,
parallel solutions for low latency decoding proposed so far
the set of codewords C is a linear (n, k) code of length n and
such as [14] and [15] cannot guarantee full correction of all
dimension k. The k × n matrix G = (P, Ik ) is the generator
3-bits error patterns. In this paper we propose low-latency
matrix of C while the (n − k) × n matrix H = (In−k , P T )
architectures for double and triple ECCs, dubbed fast
is its parity-check matrix. Each codeword in C has the form
decoders. To meet the tight latency constraint, all the iter-
v = u · G and satisfies the null-condition v · HT = 0. The
ative and sequential processes of the decoding algorithm
Hamming weight w(x) of a vector x is the number of its
are replaced by full parallel implementations of the same
nonzero components and the Hamming distance d H between
functions. The iterative BM algorithm is replaced by the
two vectors x and y is d H (x, y) = w(x − y).
direct evaluation of the ELP symbolic expressions. Any time-
Let v be the codeword written in memory. Due to errors in
consuming computation is eliminated, and the most complex
the reading process or in the retention ability of the cells, the
evaluations are isolated and carried in parallel with the other
n-tuple r (called senseword) read from memory can be differ-
terms, to avoid bottlenecks in the decoder; in particular we
ent from v. The aim of any decoding scheme is to determine
compute linear terms and nonlinear ones separately. Finally,
the error vector e = r + v. When r is read, the decoder
the Chien search is replaced by a set of dedicated circuits
computes the syndrome s of r as the (n − k)-tuple s = r · HT .
that check for each bit position if it is a root of the ELP,
The syndrome is null if and only if r is a codeword of C .
and possibly apply the correction. Fig 2 shows a high-level
conceptual scheme of the proposed architecture.
The decoding algorithms for 2 or 3 errors ECCs, recently A. Error Correction Performance of Linear Block Codes
proposed in [16] and [17], are based on the direct computation The error correction capability t directly impacts the device
of the roots of the ELP, and avoid the exhaustive search. This reliability. The fraction p of incorrect bits in the read word
approach saves time compared to solutions with sequential before ECC decoding is the raw bit error rate
(RBER). Assum-
root search. However, it requires time-consuming operations ing independent errors, P(E) = ni=t +1 ni p i (1 − p)n−i is
in the Galois Field (GF), and thus trades decoding latency for the probability of having more than t errors in the word. The
smaller area occupancy compared to full parallel solutions like uncorrectable bit error rate (UBER) is the ratio P(E)/k and
ours. In this paper we aim to complete the decoding process as it is a lower bound to the bit error probability after decoding.
fast as possible even at the cost of some increase in area. After Fig. 3 plots UBER vs RBER for various ECCs with
all, the area for ECC circuitry is still negligible compared to minimum Hamming distance d = 2t + 1 = 3, 5, 7, and
the additional memory needed to store ECC parity bits [15]. k = 64, 128, 256, 512. A reasonable UBER target for storage
2488 IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS, VOL. 34, NO. 9, SEPTEMBER 2016
decoders only. Only the binary representation of the field

elements guarantees fast operations. The exponential represen-
tation is a very useful tool in software, but is almost useless
for fast decoders.
1) Multiplication by a Constant: Let a be a variable
in GF (2m ), α a primitive element in the field and α j a
constant. The product b = aα j can be written as

m−1
m−1
b = aα j = ai α i α j = ai α i+ j
i=0 i=0
where ai is the coefficient of αi

in the polynomial representa-
tion of a. The multiplication can be expressed in matrix form,
Fig. 3. Log-log plot of UBER v. RBER for different t and k. Each code using the binary representation (row vector of the coefficients)
is indicated with the triplet (n, k, d).
of each GF element α i+ j , as
⎡ ⎤
TABLE I αj
T ECHNOLOGY-I NDEPENDENT S IZE -D ELAY R ELATIONSHIPS ⎢ α j +1 ⎥
⎢ ⎥
b = (a0 , a1 , . . . , am−1 ) · ⎢ . ⎥.
⎣ .. ⎦
α j +m−1
Since on average each column is half filled with ones, the
implementation of this operation requires m X OR trees, each
application is 10−15 . For such a target an RBER equal to using (m/2 − 1) X ORs and with maximum depth log2 (m).
6 · 10−6 can be tolerated with t = 3. One-decade-lower RBER 2) Multiplication of Two Variables: Different structures
is needed to achieve the same UBER target with t = 2, and have been proposed to compute c = ab, with a, b ∈ GF (2m ).
three-decades-lower with t = 1. Vice versa, at RBER of 10−8 Here we adopt the Mastrovito multiplier [18], for many
each unit of t reduces the UBER by six decades. Also note the reasons. The complexity of this multiplier, i.e., the number of
weak effect of k on UBER. A ND and X OR gates needed, is quite small (almost optimal, if
we do not consider exotic multipliers). Besides, this multiplier
B. Technology-Independent Estimates allows for some delay of one of the terms to be multiplied.
To compare at abstract level the HW implementation of Other multipliers require that both factors be available at the
different ECC solutions, we express their decoder area A and same time. Finally, the structure of the Mastrovito multiplier
latency L in terms of technology-independent units. To this is easy to understand, and also easy to design. We can write
purpose we make use of elementary 2-input gates as basic c = ab as
building elements, and we assume the area and delay rela-
m−1
m−1
tionships between these gates as given in Table I. Let R be c = ab = a bjαj = b j (aα j ).
the P/N ratio of the CMOS technology, i.e. the R · W/L min j =0 j =0
PMOS speed is equivalent to W/L min NMOS speed. The NOT
The products
gate (inverter) has area R + 1, that is the sum of the areas of
PMOS and NMOS, by assuming that the area of NMOS is 1.
m−1
( j)
Besides, it has delay R + 1, that is the capacitive load of the aα j = ai α i
output node assuming that each MOS transistor contributes i=0
with its own capacitive load proportional to its width. Con- where j = 0, . . . , m − 1, can be prepared in advance as soon
sidering a standard architecture for the other elementary gates as a is available. The m products and m − 1 additions in
(X OR, NAND, A ND,…), it is possible to assign a size and

m−1
( j)
a delay (TX , TNAND , T A ,…) to each of them. Many different ci = b j ai
internal architectures can be envisioned for each gate type and j =0
a specific sizing may be set to each of them by taking many
parameters into account. Nonetheless, this approach allows to start when also b is available. The total latency depends on
9 representation of 1, α, α , . . . , α
the binary 2 2(m−1) . For instance
give complexity estimates in terms of one single reference
( j)
gate, i.e. the X OR gate. Hereinafter we also assume R = 2. in GF 2 the maximum latency for the terms ai is 2TX .
However some terms require only TX and can be multiplied
III. E LEMENTARY O PERATIONS IN GF (2m ) in advance by the corresponding b j (if available). With this
Unlike Hamming codes, BCH decoding requires non trivial precaution, the computation of each bit ci can be completed
operations in the GF. In this section we analyze the complexity within 3TX , despite nine addends, and the total latency of the
of the full-parallel implementation of elementary operations operation is T A + 5TX (with a maximum allowed extra delay
in GF (2m ). Some of these are useful for both double TX for b). Even an additional sum, i.e., ab + d or ab + d 2
and triple ECC, whereas some are used in triple ECC or even ab + d 4 , can be completed in parallel during the
computation of ab, without latency penalty. As to the number TABLE II

( j) S UMMARY OF THE C OMPUTATION L ATENCY ( AND A LLOWED
of gates, note that the computation of the bits ai requires
D ELAYS OF THE VARIOUS T ERMS ) FOR O PERATIONS
THE
a very sparse matrix and many terms can be reused: a good
I MPLEMENTED IN GF 29
estimate [19] for the overall number of gates in the multiplier
is m 2 A ND and (m 2 − 1) X OR.
An equivalent implementation can be done also for c = a 2 b.
k
Powers a 2 (see next subsection) are linear combinations of
the coefficients ai whose evaluation can be embedded in the
k
terms a 2 α j with no additional delay.
k
3) Powers: The computation of powers as c = a 2 requires
only linear combinations of the bits ai for each bit c j . For
c = a 2 , we have
m−1 2 m−1

c=a =
2
ai α i
= ai α 2i . the evaluation of terms like a 3 b + c can be realized without
i=0 i=0 additional latency. Table II summarizes the results obtained in
The latency depends on the actual field. For instance, GF(512).

in GF 29 , a 2 requires one single level of X OR , whereas a 4
IV. A LTERNATIVE BASES FOR GF E LEMENTS
requires 2TX .
R EPRESENTATION
The computation of powers c = a n with n = 2k requires
nonlinear terms. To minimize the latency we separate the linear It is worth to investigate alternative representations of the
part (in the bits ai ) of the computation from the nonlinear one. elements of a GF, for possible advantages in terms of latency.
For instance, with n = 3 This is useful in particular for triple ECC decoding that
m−1 3 requires more complex operations than a double ECC decoder.
The literature on finite fields is abundant. Many papers deal
c=a = 3
ai α i
with the representation of field elements and on fast multipli-
i=0
ers. See, for instance, [20] for a recent survey of these matters.

m−1 m−1
m−2 Good alternative candidates for the binary representation of
= ai α 3i + ai a j (α 2i+ j + α i+2 j ). (1)
GF elements are the normal basis (NB), considered for a
i=0 i=0 j =i+1
long time, and the shifted polynomial basis (SPB) that has
linear nonlinear been proposed more recently [21]. Other bases have also
The most time-consuming term is the nonlinear one. The been considered. Among all, the SPB seems the best choice,
products ai a j require m(m − 1)/2 A NDs, and a single A ND in general. In particular cases, also the NB deserves some
level. The vector of all these products is then multiplied by attention. Other bases did not show real merits.
an m(m − 1)/2 × m matrix, i.e.
⎡ ⎤ A. Shifted Polynomial Basis
α + α2
⎢ α2 + α4 ⎥ The elements of the polynomial basis (PB), for standard
⎢ ⎥
(a0 a1 , a0 a2 , . . . , am−2 am−1 ) · ⎢ .. ⎥. binary representation, are α 0 , . . . , α m−1 , that are independent
⎣ . ⎦
elements in every GF.
α 3m−5 + α 3m−4 The SPB is α −v , α −v+1 , . . . , α m−v−1 , where v is an inte-
The columns of the right-hand matrix are approximately half ger. Typically, v is a small positive integer. If the generator
filled with ones, thus the product requires m X OR trees polynomial of the field is a trinomial x m + x l + 1 it can be
(one for each column) with m(m − 1)/4 inputs, on average. shown that the best values are v = l and v = l − 1. With
The total number of X OR is m(m(m − 1)/4 − 1) and, as worst both choices the product of two elements belonging to the
case, the maximum depth of the tree is log2 (m(m − 1)/2). basis is another basis element, or is equal to the sum of just
Some additions can be anticipated before the A ND level to two basis elements. This does not happen, in general, for the
reduce the total number of terms in the final sum, thereby standard PB, which corresponds to v = 0.
saving some X OR levels, but such a design must be tailored In shifted polynomial basis (SPB) an element a is repre-
to the particular GF in use. For instance, in GF 29 , the com- sented as
putation of a 3 can be completed within T A +6TX according to
m−v−1
m−1
the worst case upper bound. By carefully selecting the terms a= ai α i = α −v ai α i . (2)
to be added before the A ND level, we get T A + 4TX . i=−v i=0
Finally, the most complex term in the triple ECC decoder In other words, the SPB representation of a is equal to the
is the computation of a 3 b. The minimum latency is achieved PB representation of α v a.
computing a 3 as described above, and the terms bα j Usually, operations in SPB require a little less time, and a
( j = 0, . . . , m − 1) in the meantime. The products require a little less hardware. The most common field operations are
single level of m 2 A ND , and the final sum of products can be discussed below. In all these cases, the field operations in
optimized for a total latency of 2T A +8TX . In a similar manner, PB correspond to v = 0.
1) Multiplication by a Constant: Let a be a generic field TABLE III

element and α j be a constant. The product b = aα j is given by D ELAY OF THE X OR L OGIC N ETWORK FOR S OME C OMMON
m−1 N ONLINEAR O PERATIONS W ITH SPB R EPRESENTATION

m−1
j −v i j −v i+ j
b = aα = α ai α α = α ai α . (3)
i=0 i=0
It is easily seen that the last summation (in brackets) has
exactly the same structure as the multiplier of the m-tuple
a0 , . . . , am−1 by α j in the standard polynomial basis.
2) Squarer: The square of a generic field element a is a
linear function of a0 , . . . , am−1 , and is given by
2

m−1
m−1
−v
2
a = α ai α i
= α −v ai α 2i−v . (4)
i=0 i=0
This is similar to the squarer in standard PB, but for the
substitution of α 2i by α 2i−v . The structure of the squarer in
SPB does not change, but now the coefficients are different.
An analysis of the SPB squarer is presented in [22] and [23]. Optimizing the SPB searching common subexpressions, 80,
It is shown that for fields generated by a trinomial x m + x l + 1 80 and 99 X OR gates are needed in GF(256), GF(512) and
the best values of v are l and l − 1. GF(1024), respectively. Only in the first field the SPB shows
If the field generator polynomial is a trinomial, it can be an advantage on the PB multiplier that requires 82 X OR gates.
shown that the delay is always TX if v = l or v = l − 1 [23]. 4) Third Power: The third power of a generic element a
Unfortunately, in GF(256) there are no irreducible trinomials. in SPB is given by
Using the pentanomial x 8 + x 4 + x 3 + x 2 + 1 the delay is 2TX
3
even in SPB.
m−1
m−1
−v
3) Multiplier: As already said, among several multiplier a = α
3
ai α i
= α −v ai α 3i−2v
structures that have been proposed so far, we analyze only the i=0 i=0
Mastrovito multiplier [18]. A thorough analysis of the structure m−1
m−2
of the Mastrovito multiplier for fields generated by trinomials + α −v ai a j α 2i+ j −2v + α i+2 j −2v . (6)
is given in [19]. A similar analysis for fields represented in i=0 j =i+1
SPB can be found in [21] and [24]. In the latter paper also This is similar to the cube in standard basis, but for the
some particular pentanomials are considered. For instance, for substitution of α 3i by α 3i−2v , and of (α 2i+ j + α i+2 j ) by
the pentanomial x 8 + x 4 + x 3 + x 2 + 1 that generates GF(256) (α 2i+ j −2v + α i+2 j −2v ).
it is suggested that v = 3 should be the best choice. In GF(256), GF(512) and GF(1024) the time delay turns
The Mastrovito multiplier in SPB is out to be T A + 4TX . The number of X OR gates is slightly

m−1
m−1 lower with the SPB. These numbers depend on the particular
c = ab = aα −v bjαj = b j (aα j −v ). (5) way of gathering terms like ai a j + ai an + ak a j + ak an .
j =0 j =0 Optimizing the collection of terms with SPB, a total of 73,
78 and 107 X OR gates are needed in GF(256), GF(512) and
As already observed, the products aα j −v in SPB do not differ GF(1024), respectively, against 77, 89 and 115 X OR gates
from the same products in PB. Therefore, the Mastrovito needed with PB.
multiplier in SPB is very similar to the multiplier in PB, but Table III gives the latency L of the X OR logic network,
for the substitution of aα j by aα j −v . Since the coefficients for some common nonlinear operations using the SPB repre-
α j −v depend on v, the complexity and delay depend on v as sentation for m = 8, 9, and 10. Db,c (when applicable) are
well. Of course, we can swap a and b. the delays that can be tolerated for these variables, without
In [21] and [24], it is shown that if the generator polynomial delaying the final result. All delays are expressed in units
of the GF is a trinomial, the best performance is obtained with of TX . To get the total latency, one T A has to be added in
v = l or v = l − 1. In [24] it is also shown that if v = l or each operation.
v = l − 1 the Mastrovito multiplier requires m 2 A ND gates
and m 2 − 1 X OR gates (unless l = m/2). Besides, the time
delay is T A + log(2m − l)TX . B. Generalized Polynomial Basis
In the same paper also some particular pentanomials are A recent paper [25] generalizes the PB and SPB represen-
considered. For the pentanomial generating GF(256), namely tations, introducing the generalized polynomial basis (GPB).
x 8 +x 4 +x 3 +x 2 +1, m 2 = 64 A ND gates and m 2 +3m−7 = 81 Basically, instead of inserting a term like α −v in the definition
X OR gates are required. of the binary representation, as in SPB, the field elements are
The time delay in GF(256) is T A + 5TX . In GF(512) and multiplied by a polynomial, say R(x) = 1 + x 5 + x 6 .
GF(1024) the delay is T A + 4TX . The number of X OR gates Then, it is shown that in some cases particular field gener-
required by the PB or SPB multiplier is about the same. ator polynomials, along with suitable polynomials R(x), may
give a performance almost equivalent to SPB, or even better. TABLE IV

As an example, for m = 8 one could use the field generator D ELAY OF THE X OR L OGIC N ETWORK FOR S OME C OMMON
N ONLINEAR O PERATIONS W ITH NB R EPRESENTATION
polynomial x 8 +x 7 +x 2 +x +1, along with the GPB polynomial
R(x) = 1 + x 5 + x 6 , which is equivalent to choosing v = 158.
With this choice the time delay is T A +5TX and only 74 X OR
gates are needed. However, other operations (e.g., the third
power, which is not considered in [25]) could be more complex
than in SPB.
C. Normal Bases
Each GF has at least one normal basis (NB) β0 = γ ,
β1 = γ 2 , β2 = γ 4 , . . .. In NB an element a is represented
as follows: D. Other Bases

m−1
m−1 For small fields, say up to GF(64), an exhaustive search
i
a= a i βi = ai γ 2 . (7) for all possible bases is possible. Surprisingly, it turns out
i=0 i=0 that there are a great many possible bases. For instance in
Operations in normal basis are obtained substituting terms like GF(16) there are 840 bases, out of 1365 4-tuples of distinct
α i α j by βi β j in all the above equations. elements. A thorough investigation of bases leading to the
Note that in PB or SPB α i α j may be an element of the simplest hardware structures, has always returned an SPB as
basis. Thus, in many cases the representation of α i α j is very the best solution.
simple (no additions are required). For larger fields an exhaustive search is impractical.
i j
On the contrary, in NB βi β j = γ 2 γ 2 is never equal to To simplify the analysis, one can consider only equispaced
some other basis element βk . Therefore, the NB representation bases, i.e., βi = α K (i+i0 ) . With this choice, in many cases βi β j
of βi β j requires at least two terms. If two terms are enough belongs to the basis.
in all cases, we say that the normal basis is optimal. However, A long search for good equispaced bases produced again
optimal normal bases do not exist for all Galois fields. A well the SPB solutions, and a few other unexpected solutions (not
known paper on normal bases is [26]. When an optimal normal better than the SPB ones, yet). No useful solutions for GF(256)
basis does not exist, we can look for the best basis, anyway. were found.
Up to m = 15 we found that the best basis is unique.
For the above reasons, usually field operations in NB
E. Redundant Bases
are more complex than the corresponding operations in
polynomial basis. Good multipliers in NB are described In some cases, redundant bases, i.e., bases that represent
in [27] and [28]. One possible advantage of multipliers in NB GF(2m ) with more than m elements, have been proposed.
is that they consist in a basic structure that is replicated m Sometimes, the cardinality of the basis is just a little more
times. This eases the design of the multiplier, and produces a than m. Operations in the GF are a little simpler because the
regular structure. Instead, the multipliers in PB or SPB have redundant representation corresponds to a polynomial ring.
different structures for the m output bits. This may offset the increased complexity of the redundant
Also Mastrovito-like multipliers in NB are possible. The basis, provided its cardinality is not too large.
only difference with respect to their SPB counterparts is that The only redundant basis of interest for the fields under
the binary representation of the field elements is different. examination is an 11 bit basis in GF(1024). No good redundant
Their structure, which can be derived easily, is a little less bases exist for GF(256) and GF(512).
regular than the classic NB multiplier. Still, these multipliers
have many repeated cells. V. U LTRA -FAST D OUBLE E RROR
The time required by Mastrovito multiplication in NB is C ORRECTING BCH C ODE
T A + 5TX with m = 8, 9, 10. The number of X OR gates
required by the NB multiplier, although optimized searching The minimum Hamming distance of a BCH code able to
common subexpressions, is larger than in SPB. correct t = 2 errors is d = 5. The generator polynomial has
Also the third power in NB has a very regular structure. four consecutive roots in the exponential representation, plus
The time required by the third power is T A + 4TX in all cases the conjugate roots [29], i.e.,
of interest. The number of X OR gates required by the third
power in NB depend, once again, on the particular way of
m−1
x − α3 · 2
i i
g(x) = x − α2 . (8)
gathering terms like ai a j + ai an + ak a j + ak an . i=0
Finally, squaring in NB is simply a rotation of coordinates.
No X OR gates are needed, and there is no delay. Usually in a memory device the information data size k is
Table IV summarizes the latency L of the X OR logic a power of 2. Assuming k = 2m−1 , to encode a message u
network, for the nonlinear operations, using the NB repre- of k bits we need a BCH code designed in a larger field, for
sentation. Also here to get the total latency, one T A has to be example in GF (2m ). The final code C (n, k) can be obtained
added in each operation. by shortening a primitive code with length N = 2m − 1,
mt parity bits, N − mt information bits and generator

polynomial g(x) given by (8).1
A. BCH2 Decoding Algorithm

The decoding architecture we propose aims at minimizing
the latency. It separates and computes in parallel different
terms of the ELP expression, and then checks all the bits
simultaneously to find the errors. However, the decoding algo-
rithm can still be separated in the classical stages: Syndrome
Evaluation, ELP computation, root search and correction.
1) Syndrome Evaluation: Starting from the read sequence
r = (r0 , . . . , rn−1 ), the syndromes S1 , S3 ∈ GF (2m ) can be
computed by Fig. 4. Logical scheme of the decoder of the 2-error-correcting BCH code.
T
S1 T W1
= rW = r (9)
S3 W3
1) Syndrome Evaluation: The blocks labelled Sb gen with
where W1 and W3 are m × n binary matrices whose columns b = 1, 3 in Fig. 4 compute the linear combinations of the
are given by the polynomial representations of α i and α 3i read word r producing the two m-bit syndromes S1 and S3 as
respectively, with i = 0, 1, . . . , n − 1. in (9). In a completely bit-parallel implementation, each bit of
2) Error Locator Polynomial: When two errors occur Sb can be obtained by a separate X OR-tree [31]. On average,
(ν = 2), say in positions i 1 , i 2 , the ELP with roots α −i j reads the binary rows of Wb (m ×n) are half filled with ones. Hence

(x) = 1 − xα i1 1 − xα i2 = 1 + 1 x + 2 x 2 . the syndrome Sb calculation circuit requires m X OR -trees
with (n/2 − 1) X OR each, on average. In practice, many terms
The coefficients of (x) can be evaluated running the can be reused. The row of Wb with maximum Hamming
BM algorithm in symbolic form as in [30] that returns weight w generally has w ≈ n/2. Thus the depth of the
S13 + S3 2 logic is log2 (w) = m − 1 and the total latency of this stage
(x) = 1 + S1 x + x . (10) is (m − 1)TX . For instance, in [32] with m = 9, k = 256,
S1
n = 274, the latency of the Syndrome Evaluation stage is 8TX .
The values of the coefficients are computed once S1 and S3
2) Error Locator Polynomial: The ELP evaluation is orga-
are available. The computation of 2 in this form is too
nized as the sum of two terms: one linear and one nonlinear
time-consuming, since among the elementary operations the
in the bits of the syndrome S1 . The linear component Ai
division is the most demanding one and must be avoided.
– computed by the block labelled Linear comb in Fig. 4 –
To this aim (x) can be multiplied by S1 , obtaining the
includes the linear part of (1) and is position dependent, i.e.
equivalent (i.e. with the same roots) polynomial

m−1
(x) = S1 + S12 x + (S13 + S3 )x 2 . (11) Ai = S1 α + 2i
S12 α i + S1, j α 3 j . (13)
When a single error occurs, S1 = 0 and S13 + S3 = 0, and j =0
the ELP is The m bits of Ai , ∀i can be computed in parallel as linear
(x) = 1 + S1 x. combinations of the m bits of S1 , by
⎡ ⎤
Finally, when also S1 = 0, there is no error (or more α 2i + α i + 1
⎢ α 2i+1 + α i+2 + α 3 ⎥
than two). The ELP (11) is null for any input, thus the ⎢ ⎥
Ai = S1 · ⎢ .. ⎥. (14)
correction must be disabled. ⎣ . ⎦
3) Root Search and Correction: For each i = 0, . . . , n − 1
α 2i+m−1 + α i+2(m−1) + α 3(m−1)
if (α −i ) = 0, ri is wrong and must be corrected. This
check can be done running any of the equivalent tests We have k different positions and m (possibly) different linear
α j i (α −i ) = 0. The best one is combinations for each of these. With k = 2m−1 , mk is
much larger than 2m thus it is worthwhile to evaluate all
S1 α 2i + S12 α i + S13 + S3 = 0 (12) the 2m distinct linear combinations of m bits, and select the
because no further processing of the most complex term appropriate one for each term and position. To this end, we
S13 + S3 is required. need 2m − 1 X OR trees and a total number of X OR
m
m
B. BCH2 Decoder Architecture (i − 1) = m2m−1 − (2m − 1).
i
In this Subsection we analyze the architecture of the pro- i=1
posed decoder shown in Fig. 4. The scheme follows the The maximum depth of the trees is log2 m X OR levels. For
algorithm steps described in the previous subsection. instance, in [32] with m = 9 we have a total latency of 4TX
1 To keep notation simple, we assume to shorten the N − n last bits of the for the block Linear comb.
original code. With this shortening map, the ith bit in the shortened code is The nonlinear component of the third power of the
in position i even in the original code. syndrome S1 – the block nl( · )3 in Fig. 4 – is the most
TABLE V went into mass production in 2010 and was sold in millions
T HEORETICAL E STIMATES OF A REA AND L ATENCY FORTHE of units.
BCH2 D ECODER B LOCKS I MPLEMENTED IN GF 2m
A similar, fully parallel solution is proposed in [15]. The
expressions independent of the bit position are evaluated sepa-
rately from the ones depending on α i . Our proposal introduces
the further optimizations of separating linear and nonlinear
operations, and of efficiently computing all the needed linear
combinations. In fact, just to give an example, by taking
k = 256 and the NAND delay (∼0.03 ns) used in [15], we
have that the delay of the BCH2 in [15] is ∼2.5 ns while the
estimated delay of our solution is ∼1.25 ns.
time consuming. The nonlinear part of S13 , given in (1), is
m−1
m−2 VI. U LTRA -FAST T RIPLE -E RROR -C ORRECTING
nl(S13 ) S1,i S1, j (α 2i+ j + α i+2 j ). BCH C ODE
i=0 j =i+1
A triple-error correcting BCH code C (n, k) for a data size
Its computation requires a single level of m(m − 1)/2 A ND k = 2m−1 can be obtained by shortening a primitive BCH code
gates (T A ) and, on average, m(m − 1)/4 terms to add for with length N = 2m − 1, d = 7, 3m parity bits, and generator
each bit, thus a total of m 2 (m − 1)/4 X OR gates and a polynomial given by
worst case latency of log2 (m(m − 1)/2)TX . The nonlinear
part nl(S13 ) is then added to the syndrome S3 . This requires
m−1
x − α3 · 2 x − α5 · 2
i i i
g(x) = x − α2 . (15)
m X OR, and an additional X OR level. By carefully selecting
i=0
the order of the operations (see Section III-.3) in [32], despite
m = 9, we obtained a total latency of T A + 4TX , including the
addition of S3 . A. BCH3 Decoding Algorithm
3) Root Search and Correction: The last stage checks if BCH3 decoding is more complex than BCH2. The BM algo-
rithm is replaced again by the parallel evaluation of the
Ai + nl(S13 ) + S3 = 0 ELP symbolic expressions, that are now selected through a
for i = 1, . . . , n. In this case, ei = 1 and ri must be corrected. more complicated decision tree [30]. It is important to deal
The check is obtained by N ORing m bits, and the correction with a limited number of ELP expressions. Given the above
by X OR ing the result with the bit ri precautions, the decoding algorithm is still composed of the
following stages, as the BCH2.
v̂i = ri + N OR m (Ai + nl(S13 ) + S3 ) 1) Syndrome Evaluation: A triple ECC BCH code needs a
third syndrome evaluation. Eq. (9) can be trivially extended to
where N ORm gate is a N OR gate with m inputs.2
⎡ ⎤ ⎡ ⎤T
S1 W1
⎣ S3 ⎦ = rWT = r ⎣W3 ⎦ (16)
C. BCH2 Decoder Implementation
S5 W5
In Table V we summarize a first-order approximation of
the area occupancy and of the latency as a function of m. where r is the word read from memory, and each Wb is an
In general these expressions overestimate the actual costs, m × n binary matrix whose columns are given by the poly-
because they do not take into account any optimization for nomial representations of α bi with i = 0, 1, . . . , n and
the specific GF. b = 1, 3, 5.
For instance, with m = 9, k = 256, n = 274, from 2) Error Locator Polynomial: When three errors occur
Tables V and I, we deduce a BCH2 decoder area of 7.6 kX OR (ν = 3), say in positions i 1 , i 2 , i 3 , the ELP with roots in α −in
and a latency of 18.4 TX . An efficient implementation of the reads
standard BCH decoder is outlined by Strukov in [12]. By using
Strukov’s formulae and the values in Table I, for the same (x) = 1 − xα i1 1 − xα i2 1 − xα i3
ECC we get a decoder area of 31.24 kX OR, and a latency = 1 + 1 x + 2 x 2 + 3 x 3 .
of 67.2 TX levels, approximately four times larger and slower
than our solution. The latency of this architecture has been The coefficients of (x), evaluated running the BM algorithm
further reduced to 16.2 TX in [32] with a careful optimization. in symbolic form [30], are
This BCH2 implemented in a 45 nm 1G bit PCM device [3]
S5 + S12 S3 S5 S1 + S16 + S32 + S13 S3
1 = S1 , 2 = , 3 = .
2 Such a gate can be implemented by using a tree of elementary N OR and
S3 + S13 S3 + S13
N AND gates. The tree consists of n N OR = 1≤i<m,i odd m/2i N OR gates,
m − 1 − n N OR N AND gates, and it’s latency is log2 (m) TN OR (assuming If S3 + S13 = 0, an inversionless ELP with the same roots is
TN OR = TNAND as in Table I). An I NV gate has to be added to both area and
latency if log2 (m) is even. (x) = A + Bx + C x 2 + Dx 3 (17)
where
A S3 + S13
B S1 A = S1 S3 + S14
C S5 + S12 S3
D S5 S1 + S16 + S32 + S13 S3 . (18)
Note that D = + S5 S1 +
A2 + D2 .
S13 S3 A2
If A = 0 but D = 0 the last discrepancy of the
BM algorithm is zero, and we obtain the ELP of the
BCH2 decoder.
If D = 0 and also A = 0, the ELP computed by the
BM algorithm has degree one and reads
(x) = A
+ B
x (19)
with A
= 1 e B
= S1 . Finally, if also S1 = 0 we are in the
Fig. 5. Schematic representation of the fast BCH3 decoder.
error free case (S1 = S3 = S5 = 0). Note that the ELP (11)
reduces to (19) when S3 + S13 = 0, whereas the error free case
TABLE VI
requires a separate test because if also S1 = 0, the ELP (11)
T HEORETICAL E STIMATES OF A REA AND L ATENCY
FOR
is null for any x. THE BCH3 D ECODER I MPLEMENTED IN GF 2m
Following [30] and the BM algorithm we would need three
different ELPs, whose choice is driven by the value of the
most time-consuming term D.
An alternative is the symbolic solution of the Key Equation.
If ν = 2 we can pick two equations among
S5 1 + S4 2 = S6 (20)
S4 1 + S3 2 = S5 (21)
S3 1 + S2 2 = S4 (22)
S2 1 + S1 2 = S3 . (23) 3) Root Search and correction: The fastest search for the
From (22) and (23) we get 1 = S1 . From (21) and (23) ELP roots is the parallel test α −i = 0, ∀i = 0, 1..n. When
we obtain ν ≥ 2 ( A = 0), the test can be run checking the expression

S5 + S12 S3 Aα 3i + A2 + Bα 2i + Cα i + D2 = 0 (24)
2 =
S3 + S13
where we highlight that the constant term A2 can be taken
that is the same 2 found for ν = 3. Therefore (17)-(18) hold
into account while computing the linear combinations Aα 3i .
even in case of two errors. This can be verified also by the
Since the evaluation of D2 = S5 S1 + S13 S3 is still the most
Peterson decoding algorithm (see, e.g., [17]).
time-consuming, expression (24) enables the fastest test. The
From (21) and (22) we can infer the condition for ν = 2
computation of Aα 3i + A2 , Bα 2i and Cα i can be carried in
that reads
parallel with D2 as soon as A, B and C get available.
S4 S2 + S32 = 0 ⇔ S16 + S32 = 0 ⇔ S3 + S13 = 0 In the simplified ELP case of ν ≤ 1 ( A = 0), the search
can be run by the same hardware, simply replacing the coef-
and thus the condition for ν = 2 or 3 (and for the use of the
ficient C with 1 and D2 with S1 . In fact, in this case, A = 0,
ELP (17)) is simply A = 0. This is a great advantage because
and B = S1 A = 0.
the computation of A is much faster than the computation
of D, as we will show in Section VI-B.
When A = 0 and S1 = 0, we have ν ≤ 1 and five equations B. BCH3 Decoder Architecture
for 1 . They all lead to Fig. 5 shows the architecture of the proposed decoder. The
scheme follows the algorithm steps described in the previous
1 = S1 .
subsection. In Table VI we give the general expressions for
Finally, if also S1 = 0 the sequence is error free, and the Area and Latency of the three stages as a function of m. In the
correction must be disabled. discussion, to focus on a practical implementation we assume
In conclusion, the decision tree has just two ELPs to choose m = 9, k = 256.
from: 1) Syndrome Evaluation: The blocks labelled Sb gen with
• when A = 0, (x) = A + Bx + C x 2 + Dx 3 , with A, B, b = 1, 3, 5 compute the three m-bit syndromes S1 , S3 and S5
C, D given in (18) given by (16), as for the BCH2, in a completely bit-parallel
• when A = 0, (x) = 1 + S1 x, that includes as a special implementation. Each bit of Sb is obtained by a separate
case the ELP with no roots (x) = 1 when S1 = 0. X OR-tree with inputs taken from r. The complexity and
latency for each syndrome is the same as in the BCH2: m by the latency required by the other computations. Finally, the
X OR -trees with (n/2 − 1) X OR each, on average. The depth ELP root test requires an m-inputs NOR gate to produce ei and
of the logic is log2 (w) = m − 1 and the total latency of the correction v̂i = ri + ei is completed T A + 20TX + Tm N O R
this stage is again (m − 1)TX . For instance, with m = 9, the seconds after data reading.
latency is 8TX . As to logic area A, the three linear combination blocks
2) Error Locator Polynomial: The second step is run in require (2m−1 (m − 2) + 1)X OR gates each, as in the BCH2
parallel for the terms of the four coefficients of the ELP. The decoder. For each of the k bit positions, (3m+1)X OR gates and
fastest to compute is A, that requires a cube and one addition an m-input N OR are needed. In Table VI we summarize the
that can actually be inserted without latency penalty: it takes theoretical estimates of Area and Latency of the main BCH3
4 X OR levels and one A ND level (see Table III). The result blocks as a function of m.
of the test A = 0 is conveyed at the output of the blocks
computing C and D2 , because their values must be replaced
C. BCH3 Decoder Implementation
if the test is true.
With SPB representation of the GF(512) elements with The solution proposed, with m = 9, k = 256 [33],
v = 3 or 4 (see Table III) even the value of B that is of has been implemented following the Synopsys topographical
type ab + c4 can be computed with the same latency as synthesis methodology using a 54 nm logic gate length CMOS
A. Conversely, whatever the SPB choice, the value of C is technology. In such implementation the decoding latency is
available TX seconds later as it requires a product operation smaller than 3 ns, and the area occupancy of the decoder is
with a squared term. The computed coefficients A, B and C about 250 · 103 µm2 . We remark that the area overhead associ-
are conveyed to the Root Search stage, to compute their linear ated to ECC is mainly due to the additional memory required
combinations. to store the parity check bits. In particular, for the mentioned
The computation of D2 is the most time consum- implementation the overhead due to ECC circuitry is below 5%
ing. An upper bound to its latency is given by 2T A + of the additional array area for the redundancy. Unlike the
estimates of number of elementary gates, these results take
log2 (m(m − 1)/2) + log2 (m) TX + TO , including the
into account the buffering of the longer interconnections as
change D2 → S1 if A = 0. In GF (512) , by using the opti- well as the routing of signals, which may increase the total
mization described in Sec. III-.3, the latency is 2T A +8TX +TO . decoder area occupancy. A trade-off between latency and area
In fact, with SPB the term S1 S5 is available T A + 4TX after occupancy is possible, depending on the specific application.
syndrome evaluation, and the sum of this term can be inserted
during the computation of S13 S3 without latency penalty. The
VII. C ONCLUSIONS
computation of S13 S3 can be completed within 2T A + 8TX ,
for any GF and basis, thanks to the Mastrovito multiplier that ECC solutions are being used nowadays in an increasing
allows the precomputation of the terms based on S3 . The value number of memory and storage devices. Since the latency of
of D2 is thus ready 2T A +16TX after data reading and no linear NAND flash memories is in the order of µs, the major concern
combinations of D2 are needed. for their ECC solutions is the optimization of the throughput.
Note that the replacements D2 → S1 and C → 1 do not Conversely, DRAMs and storage class memory devices have
need standard switches. In fact, the replaced terms C and D access times in the order of ns. For instance, a typical access
are null, thus a simple OR with the replacing values (provided time of current DRAM devices is around 50 ns. Consequently,
they are nulled when not needed) will produce the same also the latency of the ECC decoder plays a critical role for
result. such applications.
The ELP stage requires four multipliers, four adders, In this paper we have described double- an triple-error-
one squarer, one third and one fourth power. An upper bound correcting codes suitable for DRAM-like devices. We have
to the total logic area of this stage is m 3 /4 + 257m 2/ shown that by parallelizing the decoding algorithm, we are
able to achieve low-latency decoding, in the order of tens of
24 + 6m X OR + 9m 2 /2 − m/2 A ND gates.
X OR levels, that translate to around 1 ns in state-of-the-art
3) Root Search and Correction: For the root search, the
CMOS technology. Moreover we have also shown that a high
blocks labeled Linear comb compute Aα 3i + A2 , Bα 2i and
level of parallelism is possible without an exponential increase
Cα i for each bit position i . This computation requires m linear
of the ECC logic area.
combinations of the bits of A, B and C, for each i . Once
again 2m − 1 mk and all possible linear combinations
are computed with a depth of log2 m levels of X OR . The R EFERENCES
linear combinations of A and B are the first available, at the [1] G. Atwood, “Current and emerging memory technology landscape,”
same time in GF(512) with SPB. For each position i , the in Proc. Flash Memory Summit, Santa Clara, CA, USA, Aug. 2011,
term Aα 3i + A2 + Bα 2i is ready at the same time as Cα i , pp. 1–24.
[2] (Apr. 2016). Breakthrough Nonvolatile Memory Technology. [Online].
TX seconds later. This is shown for a single position i Available: https://www.micron.com/about/emerging-technologies/3d-
in Fig. 5. When the addition Aα 3i + A2 + Bα 2i +Cα i is ready, xpoint-technology
TX seconds later, the value of the coefficient D2 is also ready, [3] C. Villa, D. Mills, G. Barkley, H. Giduturi, S. Schippers, and
D. Vimercati, “A 45nm 1Gb 1.8V phase-change memory,” in IEEE Int.
and the overall ELP computation is completed T A + 19TX Solid-State Circuits Conf. Dig. Tech. Papers (ISSCC), San Francisco,
seconds after data reading. Thus only one TX is not hidden CA, USA, Feb. 2010, pp. 270–271.
[4] T.-Y. Liu et al., “A 130.7mm2 2-layer 32Gb ReRAM memory device [30] C. Kraft, “Closed solution of Berlekamp’s algorithm for fast decoding
in 24nm technology,” in IEEE Int. Solid-State Circuits Conf. Dig. Tech. of BCH codes,” IEEE Trans. Commun., vol. 39, no. 12, pp. 1721–1725,
Papers (ISSCC), San Francisco, CA, USA, Feb. 2013, pp. 210–211. Dec. 1991.
[5] W. Otsuka et al., “A 4Mb conductive-bridge resistive memory with [31] H. Lee, “High-speed VLSI architecture for parallel Reed–Solomon
2.3GB/s read-throughput and 216MB/s program-throughput,” in IEEE decoder,” IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 11,
Int. Solid-State Circuits Conf. Dig. Tech. Papers (ISSCC), San Francisco, no. 2, pp. 288–294, Apr. 2003.
CA, USA, Feb. 2011, pp. 210–211. [32] P. Amato, C. Laurent, M. Sforzin, S. Bellini, M. Ferrari, and
[6] K. Tsuchida et al., “A 64Mb MRAM with clamped-reference and A. Tomasoni, “Ultra fast, two-bit ECC for emerging memories,” in Proc.
adequate-reference schemes,” in Proc. IEEE Int. Solid-State Circuits 6th IEEE Int. Memory Workshop (IMW), Taipei, Taiwan, May 2014,
Conf. Dig. Tech. Papers (ISSCC), San Francisco, CA, USA, Feb. 2010, pp. 79–82.
pp. 258–259. [33] M. Ferrari, A. Tomasoni, S. Bellini, P. Amato, M. Sforzin, and
[7] K. R. Udayakumar et al., “Low-power ferroelectric random access C. Laurent, “Embedded ECC solutions for emerging memories (PCMs),”
memory embedded in 180nm analog friendly CMOS technology,” in in Proc. DATE, 2016, p. 6. [Online]. Available: https://kluedo.ub.uni-
Proc. 5th IEEE Int. Memory Workshop (IMW), Monterey, CA, USA, kl.de/frontdoor/index/index/year/2016/docId/4320
May 2013, pp. 128–131.
[8] N. F. Mott, Metal-Insulator Transitions, 2nd ed. London, U.K.:
Taylor & Francis, 1990.
[9] Low Power Double Data Rate 4 (LPDDR4), JEDEC, Paolo Amato received the Laurea (cum laude)
document JESD209-4, Aug. 2014. [Online]. Available: https://www. degree in computer science from the University of
jedec.org/ Milano, Italy, in 1997, and the Ph.D. degree in
[10] B. L. Ji et al., “In-line-test of variability and bit-error-rate of HfOx - computer science from the University of Milano-
based resistive memory,” in Proc. IEEE Int. Memory Workshop (IMW), Bicocca in 2013.
May 2015, pp. 1–4. Dr. Amato joined Micron in 2010, where he
[11] Y. Emre, C. Yang, K. Sutaria, Y. Cao, and C. Chakrabarti, “Enhancing investigates storage and memory architectures based
the reliability of STT-RAM through circuit and system level techniques,” on mainstream and emerging technologies for next
in Proc. IEEE Workshop Signal Process. Syst. (SiPS), Oct. 2012, generations of mobile systems. From 1998 to 2008,
pp. 125–130. he was with STMicroelectronics, Agrate, Italy. From
[12] D. Strukov, “The area and latency tradeoffs of binary bit-parallel 2000 to 2005, he was the Leader of the Methodology
BCH decoders for prospective nanoelectronic memories,” in Proc. 40th for Complexity Team (a global research and development team of about
Asilomar Conf. Signals, Syst. Comput. (ACSSC), Pacific Grove, CA, 30 people in Milan, Napoli, Catania, Moscow, and Singapore), which is
USA, Oct./Nov. 2006, pp. 1183–1187. aimed at developing methods, algorithms, and software tools for complex
[13] J. Yeon, S.-J. Yang, C. Kim, and H. Lee, “Low-complexity triple-error- system management. From 2005 to 2010, he was the Manager of Statistical
correcting parallel BCH decoder,” J. Semicond. Technol. Sci., vol. 13, Methods for the Non Volatile Memory-Technology Development Group, ST,
no. 5, pp. 465–472, Oct. 2013. and Numonyx, and he started the development of ECC solutions for PCM
[14] X. Wang, D. Wu, L. Pan, R. Zhou, and C. Hu, “An on-chip high-speed devices. He is currently a Distinguished Member of the Technical Staff with
4-bit BCH decoder in MLC NOR flash memories,” in Proc. IEEE Asian Micron Semiconductor Italia S.r.l. He is also an expert on statistical methods,
Solid-State Circuits Conf., Taipei, Taiwan, Nov. 2009, pp. 229–232. error correcting codes and security, and of their application to mobile systems.
[15] C. Badack, T. Kern, and M. Gössel, “Modified DEC BCH codes for He has authored over 40 papers published in peer-reviewed international
parallel correction of 3-bit errors comprising a pair of adjacent errors,” journals and international conferences, and filed over 20 patents.
in Proc. 20th IEEE IOLTS, Girona, Spain, Jul. 2014, pp. 116–121.
[16] I. Yoo and I.-C. Park, “A search-less DEC BCH decoder for low-
complexity fault-tolerant systems,” in Proc. IEEE Workshop Signal Sandro Bellini received the Dr.Ing. (cum laude)
Process. Syst. (SiPS), Belfast, U.K., 2014, pp. 1–6. degree in electronic engineering from the Politecnico
[17] X. Zhang, VLSI Architectures for Modern Error-Correcting Codes. di Milano, Italy, in 1971. He began his research
New York, NY, USA: Taylor & Francis, 2015. activity with the Italian National Research Council.
[18] E. D. Mastrovito, “Vlsi designs for multiplication over finite fields He became an Associate Professor in 1982, and has
G F(2m ),” in Proc. 6th Int. Conf. Appl. Algebra, Algebraic Algo- been a Full Professor of telecommunications since
rithms Error-Correcting Codes (AAECC), Jul. 1989, pp. 297–309, doi: 1990.
10.1007/3-540-51083-4_67. During this time, his main research themes have
[19] B. Sunar and cC. K. Kocc, “Mastrovito multiplier for all trinomials,” been analysis and design of pulse modulation sys-
IEEE Trans. Comput., vol. 48, no. 5, pp. 522–527, May 1999. tems, digital frequency modulation with continu-
[20] H. Fan and M. A. Hasan, “A survey of some recent bit-parallel GF(2n ) ous phase, efficient simulation techniques based on
multipliers,” Finite Fields Appl., vol. 32, pp. 5–43, Mar. 2015. importance sampling, computer emission tomography, multicarrier demodula-
[21] H. Fan and Y. Dai, “Fast bit-parallel GF(2n ) multiplier for all trinomi- tion, channel equalization, blind equalization, symbol and carrier synchroniza-
als,” IEEE Trans. Comput., vol. 54, no. 4, pp. 485–490, Apr. 2005. tion, digital data storage on optical support, and advanced coding techniques.
[22] S.-M. Park and K.-Y. Chang, “Low complexity bit-parallel squarer In recent years, his research activity has mainly been devoted to channel
for G F(2n ) defined by irreducible trinomials,” IEICE Trans. Fundam. coding, with an emphasis on turbo codes, LDPC codes, and turbo product
Electron. Commun. Comput. Sci., vol. E89-A, no. 9, pp. 2451–2452, codes.
Sep. 2006.
[23] X. Xiong and H. Fan, “Bit-parallel G F(2n ) squarer using shifted
polynomial basis,” Cryptol. ePrint Arch., Tech. Rep. 2012/626, 2012.
[Online]. Available: http://eprint.iacr.org/2012/626 Marco Ferrari (M’06) was born in Milano, Italy,
[24] H. Fan and M. A. Hasan, “Fast bit parallel-shifted polynomial basis in 1971. He received the Laurea (cum laude)
multipliers in G F(2n ),” IEEE Trans. Circuits Syst. I, Reg. Papers, degree in telecommunications engineering and the
vol. 53, no. 12, pp. 2606–2615, Dec. 2006. Ph.D. degree in electronics and communication
[25] A. Cilardo, “Fast parallel GF(2m ) polynomial multiplication for all engineering from the Politecnico di Milano, Italy,
degrees,” IEEE Trans. Comput., vol. 62, no. 5, pp. 929–943, May 2013. in 1996 and 2000, respectively.
[26] R. C. Mullin, I. M. Onyszchuk, S. A. Vanstone, and R. M. Wilson, Since 2001, he has been a Researcher with the
“Optimal normal bases in GF( pn ),” Discrete Appl. Math., vol. 22, no. 2, Department of Electronics, Information and Bio-
pp. 149–161, 1989. engineering, Institute IEIIT of the Italian National
[27] B. Sunar and C. K. Koc, “An efficient optimal normal basis type II Research Council (CNR), Politecnico di Milano.
multiplier,” IEEE Trans. Comput., vol. 50, no. 1, pp. 83–87, Jan. 2001. In 2002, he was an EPRSC Research Fellow with
[28] A. Reyhani-Masoleh and M. A. Hasan, “A new construction of the University of Plymouth, U.K. He has co-authored tens of scientific publi-
Massey–Omura parallel multiplier over GF(2m ),” IEEE Trans. Comput., cations in leading international journals and conference proceedings, and a few
vol. 51, no. 5, pp. 511–520, May 2002. patents. His main research interests are in digital transmission, information
[29] S. Lin and D. J. Costello, Error Control Coding, 2nd ed. theory and channel coding. He is a member of the IEEE Communications
Upper Saddle River, NJ, USA: Prentice-Hall, 2004. Society.
Christophe Laurent received the Laurea degree Alessandro Tomasoni (M’09) was born in Milano,
in microelectronics and robotics from Polytech Italy, in 1980. He received the M.S. (cum laude)
Montpellier, France, and Politecnico di Torino, Italy, degree in telecommunications engineering in 2005,
in 1998. From 1998 to 2002, he was a Consul- and the Ph.D. degree in information engineering
tant with Alcatel, STMicroelectronics, Stepmind, from the Politecnico di Milano, Italy, in 2009.
and Philips Semiconductors, in France and Italy, Since 2012, he has been a Researcher with
designing GSM baseband circuits, Bluetooth base- the Dipartimento di Elettronica, Informazione
band circuits, and cryptographic-related peripherals. e Bioingegneria, Italian National Research Council,
From 2002 to 2008, he designed novel architec- Institute of Electronics, Computer and Telecommu-
tures for flash NOR, including ECC circuits for nication Engineering, Politecnico di Milano, Italy.
STMicoelectronics and Numonyx. He is currently From 2009 to 2011, he was a Temporary Research
SMTS with Micron Semiconductor Italia S.r.l., Vimercate, Italy, where he is Assistant with the Dipartimento di Elettronica, Informazione e Bioingegneria.
involved in defining, implementing and synthesizing novel emerging memories In 2008, he was a Visiting Researcher with the Viterbi School of Engineer-
architectures. He is serving as an Expert in ECC code development and ing, University of Southern California, Los Angeles, CA, USA. He was a
implementation, and serving as a Consultant for all the design teams. consultant for leading industries, such as STMicroelectronics, Alcatel-Lucent
Mr. Laurent received three company recognition awards, filed eight inter- and Micron. He has co-authored several publications on journal papers and
national patents and published one article in TLP Technical Journal. He also conference proceedings, and is co-inventor of many patents. His main research
received the Best Technical Paper Award for an article in the SNUG interests are in wireless communications, optical communications and solid
France 2012. state data storage, with emphasis on information theory, advanced channel
coding and channel estimation.
Marco Sforzin received the Laurea (cum laude)

degree in electronic engineering from the Politecnico
di Milano, Italy, in 1997. From 1997 to 1998, he
made many tutorials in support of the dynamical
system theory course and the automatic control
systems course with Politecnico di Milano. From
1998 to 1999, he completed the military service.
In 1999, he joined the Flash Memory Design Team
of the MPG with STM, Agrate Brianza, Italy, where
he was involved in development of SLC NOR flash
memory devices. In 2001, he was appointed Project
Leader of MLC NOR flash memory prototype for wireless applications and
later Design Manager with the task of bringing MLC devices into production.
In 2007, he joined the Advanced Architectures Design Team during the
transition STM/Numonyx/Micron, where he was involved in development of
phase change memory devices. In 2013, he was with the Mobile Business Unit
Research and Development with Micron. He is currently Senior Member of
Technical Staff with Micron Semiconductor Italia S.r.l. His expertise domains
include analog design, high-speed design, full-chip mixed signals design and
validation, ECC for memory and storage applications, emerging memory
technologies, analytical, and statistical modeling. His interests also include
information theory, system theory, and neural networks.
Mr. Sforzin has authored four international conference papers and over
20 patents.

Fast Decoding ECC For Future Memories

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Fast Decoding ECC For Future Memories

Uploaded by

Copyright:

Available Formats

2486 IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS, VOL. 34, NO.

Fast Decoding ECC for Future Memories

Abstract— High-performance storage class memories could

H ALF a century ago Dennard invented DRAM (1966),

High performance SCM devices (closer to DRAM than

decoders only. Only the binary representation of the field

where ai is the coefficient of αi

computation of ab, without latency penalty. As to the number TABLE II

1) Multiplication by a Constant: Let a be a generic field TABLE III

give a performance almost equivalent to SPB, or even better. TABLE IV

mt parity bits, N − mt information bits and generator

A. BCH2 Decoding Algorithm

Marco Sforzin received the Laurea (cum laude)

You might also like