Sphere Decoder For Massive MIMO Systems

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

Sphere Decoder for Massive MIMO Systems

Dimitris Vordonis and Vassilis Paliouras


Electrical and Computer Engineering Dept.
University of Patras, Greece
Email: ece8292@upnet.gr, paliuras@ece.upatras.gr

Abstract—The increasing demand for higher data rates and There is a wide range of methods which can be employed
for more connected devices has led to Massive MIMO (MMIMO) for the IR selection. Some of them simply set IR to a constant
Technology. The large number of antennas makes the Maximum- value or infinity without the need of extra hardware [5], [6],
Likelihood (ML) detector infeasible to be implemented due to
high complexity, despite its optimal performance. Sphere Decoder ignoring for a possible optimization of searching time. A
(SD) has a bit error rate (BER) performance similar to ML different aspect is to take into account the noise distribution
detector, therefore making it more efficient (0.5–1.25 dB gain) before selecting the IR value [4], [5], [6], [7]. These methods
than Linear Detectors (LD), proposed in the literature. However, are effective only for MIMO systems. In MMIMO technology,
the low complexity of LD and the non-deterministic behavior of the suboptimal-solution-based approach is used, where either
SD are the main reasons that prohibit the use of sphere decoding
methods in MMIMO systems. The results of this paper disrupt a low-complexity detector or a pre-processing block initializes
conventional thinking and show that there may be a future for SD the sphere radius [5], [6]. This method requires additional
in certain MMIMO system. The number of visited nodes during hardware in order to compute the suboptimal solution, but
detection and the Initial Radius (IR) method are crucial for the achieves a good estimation of IR hence the searching time
computational complexity of SD. In this paper, an effective IR decreases significantly.
method, decreasing significantly the complexity and the number
of visited nodes is proposed. Furthermore an optimization at tree The proposed method computes the IR value combining
searching further reduces the number of visited nodes, where an estimate of a known linear detector and an approximate
in combination with an implementation featured with one-node- calculation of the 2-norm of a matrix. The choice of the
per-cycle architecture minimize the latency and make the SD detector is optimal in terms of complexity, exploiting the SD
attainable to large-scale systems for Eb /N0 ≥ 4 dB. Hardware algorithm. For the computation of the 2-norm two approaches
aspects are investigated for both a Virtex-7 FPGA and a 28 nm
ASIC technology. are examined and are evaluated, having as main purpose
Index Terms—massive MIMO, sphere decoding, initial radius, the complexity reduction. The first approach relies on the
MIMO detection, eigenvalue problem, norm distance estimation of the largest eigenvalue of a matrix, while the
second one computes the 1-norm or the infinity norm of a
I. I NTRODUCTION matrix instead of 2-norm. In order to approximate the largest
eigenvalue of a matrix, the power iteration algorithm in a
ML detector yields the optimal solution in MIMO detection
simplified way is used. Apart from complexity reduction these
problem but its complexity is susceptible to high-order mod-
approaches feature low-latency, since no recursive process is
ulation schemes and to large-scale systems. SD provides a
required neither for the approximation of the largest eigen-
BER performance similar to ML detector [1], [2], [3] with the
value, in contrast with the classic power method, nor for the
advantage that the complexity of SD is polynomial in contrast
computation of 1-norm or infinity norm of a matrix.
with ML-detector’s complexity which rises exponentially with
LD are used in MMIMO technology, despite the BER
the number of transmit and receive antennas. SD intends to
degradation. For large-scale systems and large constellation
detect the transmitted signal vector by searching only the
sizes, SD is considered as prohibitive due to high computa-
candidate vectors which lie inside a hypersphere with radius
tional complexity and its non-deterministic behavior [8]. In
R0 around the received signal r in contrast with ML detector
this paper, apart from an efficient IR method, an optimization
where all candidate vectors are examined [1], [4].
at SD algorithm regarding node pruning is proposed. Vari-
The method employed to calculate IR has a critical impact ous optimizations of sphere decoding methods are described
on the complexity of SD [2], [3]. A large initial value of sphere in [9], [10]. In addition an one-node-per-cycle architecture
radius can lead to an exhaustive search between numerous is described decreasing significantly the latency of the de-
possible transmitted symbols, while in contrast a very small tection. In [8], an one-node-per-cycle architecture is also
radius may contain no lattice point inside the sphere and the presented. The differences are explained in this paper. These
search will have to be restarted with a new estimation of IR. optimizations make SD feasible for larger schemes, when
In MIMO systems IR selection is not mandatory since the Eb /N0 ≥ 4 dB, compared to the limits determined by the
tree searching contains a small number of nodes; however in literature.
MMIMO the absence of an IR method leads to an exhaustive The remainder of the paper is organised as follows: Sec-
search, not feasible for hardware implementations. tion II reviews the system model and the detection problem
978-1-7281-2769-9/19/$31.00 © 2019 IEEE of SD in MIMO systems. Section III describes the algorithm

Authorized licensed use limited to: University of Surrey. Downloaded on June 06,2022 at 14:17:26 UTC from IEEE Xplore. Restrictions apply.
Source 10−1
s n
M -QAM Mapper N -point IFFT AWGN ISD 128x16,16QAM
ZF 128x16,16QAM
H ISD 128x16,32QAM
ZF 128x16,32QAM
ISD 128x16,64QAM
ZF 128x16,64QAM
10−2 ISD 128x32,16QAM
ZF 128x32,16QAM
H

BER
Sink r
BER M -QAM Demapper Sphere Decoder N -point FFT

Fig. 1. Block diagram of the massive MIMO system used for simulation 10−3

of SD, a proposed optimization for a more efficient tree


searching and presents the BER performance compared to
a linear detector. Section IV analyzes various approaches 10−4
0 1 2 3 4 5 6 7 8
which are considered to compute IR and proposes a method Eb /N0 (dB)

adapted for implementation on hardware. Section V details Fig. 2. BER performance of ISD compared to ZF detector
the one-node-per-cycle architecture and reveals the results of a detection has a metric constraint equal to R02 . Let Di denote
FPGA implementation compared to an ASIC implementation. the accumulated Euclidean metric down to antenna level i,
Section VI evaluates IR methods effective for large-scale XNT
2
2
Di = u2ii |si − zi | + u2jj sj − zj ≤ R02 ,

systems and proves that SD is feasible in MMIMO systems. (4)
| {z }
Finally, Section VII discusses conclusions. di
j=i+1
| {z }
II. S YSTEM M ODEL Di+1
where di is the partial Euclidean metric from antenna level i,
In this section, the organization of the simulation system Di+1 is the accumulated Euclidean metric down to antenna
model is presented. A MMIMO system is assumed with NT level i + 1 and zi is derived from
transmit and NR receive antennas (NR  NT ). For the NT
X uij
uplink communication, each user transmits a sequence of bits, zi = ŝi − (sj − ŝj ). (5)
u
j=i+1 ii
mapped to constellation symbols by using Quadrature Ampli-
tude Modulation (QAM) of M points. Let C denote the set of B. Improved Sphere Decoder (ISD)
all possible constellation symbols. Then the frequency-domain As mentioned, the number of visited nodes is crucial for
symbols are transformed into time-domain symbols through the computational complexity and the latency of SD. Apart
a N -point IFFT before the transmission over the wireless from the sphere constraint of (4), the ISD checks also if di is
channel. During transmission a Rayleigh fading channel and smaller than 1 – 2% of the estimated IR. Unless di satisfies
additive white Gaussian noise (AWGN) are assumed. that, (4) is not necessary to be checked. In (4) the value of R0
The received signal r is derived as r = Hs + n, where is updated when a solution path is found, in contrast with
s ∈ C NT = [s1 , s2 , . . . , sNT ]T is the vector of transmitted the constraint of di where the comparable value is stable.
symbols, r ∈ C NR = [r1 , r2 , . . . , rNR ]T is the vector of This introduced restriction contributes significantly to the node
received symbols and n = [n1 , n2 , . . . , nNR ]T is the vector of pruning, without having an impact on BER performance as it
independent and identically distributed complex AWGN with is shown later. A similar concept, in which each node has its
zero mean and variance N0 . The complex transfer function constraint, is mentioned in [1].
from transmitter j to receiver i is represented by a matrix H
C. BER performance
with dimensions NR × NT , where all the hij are Rayleigh
fading channel coefficients and depict the channel gain. As SD achieves ML performance and this is the main advantage
channel correlation is low when NR  NT [11], for all over LD. Based on the system described in Fig. 1, the
simulations, an uncorrelated channel is assumed. Provided that BER performance of the ISD is examined through MATLAB
the channel matrix is known at the receiver, the detection simulations compared to the Zero-Forcing (ZF) detector. The
problem of SD [1], [2] is length of the FFT N is 256 for all cases and the total
2
ŝml = arg min kr − Hsk ≤ R02 .

(1) number of processed M -QAM symbols equals to NT × N .
s The tested MMIMO schemes and the simulation results are
The whole process is presented in Fig. 1.
presented in Fig. 2 and in Table V. It must be noticed that, the
III. S PHERE D ECODER simulation results are in agreement with [12]. Other possible
A. Sphere Decoder Algorithm MMIMO schemes are meaningless in this context due to high
As described in [1],sphere constraint (1) can be written as computational complexity and latency.
2
ŝml = arg min U(s − ŝ) ≤ R02 , (2) IV. I NITIAL R ADIUS S ELECTION
s
where U is the Cholesky decomposition of the Gram matrix There are many techniques proposed in the literature, in
HH H and ŝ is the Least Squares estimate of s, order to select IR. However most of them focus on systems
ŝ = (HH H)−1 HH r. (3) with a limited number of antennas, in contrast with this
The algorithm can be considered as a tree search through paper, which targets at MMIMO technology. In that case
NT levels, where every node has M branches. SD manages a the estimation of a proper IR is unavoidable in order to
limited area of a lattice field depending on IR; therefore the decrease the number of visited nodes. However IR selection

Authorized licensed use limited to: University of Surrey. Downloaded on June 06,2022 at 14:17:26 UTC from IEEE Xplore. Restrictions apply.
is considered as a complex procedure for large-scale systems, Algorithm 1: Power iteration algorithm
hence a simplified and efficient method has to be established. Input: The NR × NR matrix B
A. IR using ZF estimate Output: The greatest eigenvalue λmax (B)
ZF detector brings down the intersymbol interference dis- 1 Set b0 a random vector with dimension NR × 1
regarding the noise. The solution of ZF detector can be used It should have a nonzero component in the direction of
for IR selection [5] as follows the dominant eigenvalue;
R02 = kr − Hx̂ZF k2 , (6) 2 for k = 1; k ≤ Z; k++ do
Bbk−1
where x̂ZF is the transmitted symbol estimate of ZF detec- bk = kBb k−1 k
tor [13] and can be written as
3 λmax (B) = bH
k Bbk ;
x̂ZF = (HH H)−1 HH r. (7)
The optimal case, where the sphere contains only one lattice slowly; hence a large number of iterations may be demanded.
point, is when the estimate of ML detector is used. Neverthe- Let Z denote the required number of iterations the power
less due to high complexity, it is ordinary to use other detectors method needs to converge. In case B is diagonalizable, the
with reduced complexity.

method converges with ratio λλ21 , where λ1 = λmax and λ2 is

B. IR using MMSE estimate
the second largest in absolute value eigenvalue [14]. Therefore
Minimum Mean Square Error (MMSE) detector is an the power method converges slowly if an eigenvalue close in
intermediate solution between intersymbol interference and magnitude to the greatest eigenvalue, exists.
noise. According to (6) if x̂ZF is replaced with x̂MMSE ,
Given that B is diagonalizable, line 2 of the Alg. 1 can be
where x̂MMSE is the transmitted symbol estimate of MMSE
transformed, according to [14], as
detector, a new approach for IR selection is created [5]. MMSE Bk b0
detector requires the statistical information of noise and after bk = . (12)
kBk b0 k
equalization x̂MMSE is obtained from [13] As a result, the method computes the dominant eigenvector,
1
x̂MMSE = (HH H + I)−1 HH r, (8) which corresponds to the largest eigenvalue, without the re-
SN R quirement for an iterative process. The approximation of λmax
where I indicates the identity matrix.
C. Proposed Method depends on the number of iterations k, where a high value of
Combining the approach of computing the IR using ZF k provides a good estimate of the real value. If k = Z, the
estimate and an approximation of calculating the 2-norm largest eigenvalue is calculated in full precision. However as it
distance of a matrix, a new method for choosing IR is follows from (12), k denotes the times where B is multiplied
proposed. Among the analyzed methods, the ZF detector is by itself, therefore a large value can lead to possible numerical
chosen since (3) and (7) are equivalent. SD algorithm can not overflows having in mind the hardware implementation.
avoid the computation of (3), therefore can be reused in order This approach takes advantage of (12) and in combination
to reduce the computational complexity. Let A be defined as with some variations, which are described below, concludes to
a N × NR matrix, where a good estimation of λmax with reduced complexity. Instead
A = r − Hx̂ZF = r − Hŝ. (9) of line 3 of Alg. 1, the largest eigenvalue is estimated as
bH Bbk
Then due to (6), the IR can be written as λmax (B) = kH , (13)
R02 = kAk2 . (10) bk bk
which is known as Rayleigh quotient [15]. It must be noted
The norm of a matrix A for the special case of 2-norm is
that the product bH k bk equals to 1 when the method converges,
the largest singular value of A, which equals to the square
therefore line 3 of Alg. 1 and (13) are equivalent for k = Z.
root of the maximum eigenvalue of the positive-semidefinite
Exploiting (13), it is proved that the denominator of (12) has
matrix AH A [14].Consequently (10) 2 is rewritten as
q no use. Assuming that
R02 = kAk2 = λmax (AH A) = kBk = λmax (B), ck = Bk b0 , (14)
(11) it can be derived from (12) and (14) that bk = cµk , where µ
where B is a NR × NR matrix and equals to AH A. There- is a constant variable equal to kBk b0 k. Hence it follows that
fore what remains is to compute the largest eigenvalue of bH
k Bbk
( cµk )H B( cµk ) cH Bck
H
= ck H ck = kH . (15)
B. Another approach, which is also examined below, is to bk bk (µ) (µ) ck ck
approximate the 2-norm of A or the 2-norm of B with other The denominator of the ratio in line 2 of Alg. 1 normalizes
norms such as 1-norm and infinity norm. bk at each iteration, thus has a significant impact on classic
1) Based on the largest eigenvalue (M1) power iteration algorithm preventing numerical overflows. The
For the computation of the largest eigenvalue a known method proposed in this paper, estimates the greatest eigen-
method, namely power iteration, is used [14], [15]. The whole value without this iterative step. Instead of line 2 and line 3 of
method is described in Alg. 1. Power iteration estimates the Alg. 1, the proposed method relies on (13)–(15). The overflow
greatest in absolute value eigenvalue of a matrix without com- issues, due to the abolition of the normalization factor, are
puting matrix decomposition. So it can be used for matrices confronted by exploiting an appropriate initialization of b0 .
with large dimensions with the drawback that it converges The vector b0 acts as a scaling factor with all its elements

Authorized licensed use limited to: University of Surrey. Downloaded on June 06,2022 at 14:17:26 UTC from IEEE Xplore. Restrictions apply.
Algorithm 2: Modified Power iteration algorithm can be estimated either as kAk2 or kBk according to (11).
Input: The NR × NR matrix B Depending on the dimensions of the array, either less sums of
Output: The greatest eigenvalue λmax (B) more terms or more sums of less terms must be calculated.
1 Set all elements of NR × 1 vector b0 equal to 2
−φ
; Having many terms at each sum may cause numerical over-
2 c1 = Bb0 ; flow issues, while in contrast more sums need more parallel
cH
1 Bc1 arithmetic units. It must be noted that kBk1 = kBk∞ due to
3 λmax (B) = cH
+ δ;
1 c1 the fact that B is symmetric.
The magnitude of a complex number demands the compu-
equal to 2−φ , where φ is the number of fractional bits used
tation of a square root. It can be omitted, using the approxi-
in a hardware implementation. Therefore the multiplication
mations [8],  
in (14) is reduced to shift-and-add operations.
dij ≈ Re dij + Im dij (20)
Specifically, the proposed approach chooses k = 1 for (14),
which means that the estimation is a result of one iteration or  
 
only, hence there is an impact on accuracy. It is noted that dij ≈ max Re dij , Im dij . (21)
a poor approximation of the largest eigenvalue influences the
performance of SD as a too large IR can lead to an exhaustive The above approximations do not differ to a great extent in
search; on the other hand a very small IR produces an empty complexity, but the (20) produces higher values compared
sphere. In this approach, the risk is limited to the case of an to (21). When the IR is estimated as kBk, it is not necessary to
empty sphere only. To confront that, attempting to ensure the compute the magnitude for all elements of the matrix because
existence of a lattice point inside the sphere, a constant δ is B is symmetric. Especially when N  NR the decrease
added to the estimation of the largest eigenvalue as of computations is noticeable. However to achieve this, a
cH Bc1 complex matrix multiplication to compute B is required.
λmax (B) = 1H + δ = ξ + δ, (16)
c1 c1 In terms of accuracy the approximation of 2-norm according
H
c Bc
where ξ = c1H c11 , indicates an estimation of λmax for k = 1. to (18) and (19), either the IR is calculated as kAk2 or
1 kBk, leads to a large estimate of IR. Nevertheless it has the
Based on the system in Fig. 1, for various M and N , a
advantage that for a specific MMIMO scheme, the divergence
possible value of δ is estimated through MATLAB simulations.
is stable regardless of the level of noise. It means that the
Simulations have been performed for different levels
of noise. estimate of IR can be improved with an appropriate scale,
For all examined cases, the convergence ratio λλ12 is close

exploiting in this way the low-complexity of this method. The
to 1, therefore a large number of iterations is required. The divergence of the estimation is not influenced neither from
estimated value of the greatest eigenvalue after k iterations is M nor NT , provided that NR  NT . The estimate is only
derived as λmax = ξ +θk ξ, where 0 < θk < 1. This creates an affected by the dimensions of the matrix we are approximating
upper bound equal to 2ξ for the value of λmax . Consequently its 2-norm. These factors are the length of FFT N and the
the proposed value of δ, which ensures the existence of at least number of receiving antennas NR . Especially for the case
one lattice point inside the sphere, is δ = ξ. when the IR is estimated as kBk, the addiction to N does
Since B is symmetric, the nominator of ξcan be written as not exist. Scaling is suggested to take place before 2-norm
NR X NR NR NR X
X X 2
X approximation by appropriately prescaling matrix entries, to
( c∗1i Bij )c1j = |c1i | Bii +2 Re(c∗1i c1i Bij ),
prevent wordlength increase.
j=1 i=1 i=1 j=1 i<j
(17) V. O NE - NODE - PER - CYCLE ARCHITECTURE
where c∗1i indicates the conjugate of c1i . Therefore the number Apart from complexity, latency is also a huge problem at
of the required multiplications has a significant decrease. The sphere decoding methods for MMIMO systems. The number
whole modified algorithm is presented in Alg. 2. of visited nodes during the decoding process is the fundamen-
2) Based on 1-norm and infinity norm (M2) tal parameter for that problem. Having an one-node-per-cycle
As mentioned before, the 2-norm value of a matrix can be architecture, which is the optimal case, the required clock
estimated using 1-norm or infinity norm. A similar approach periods for detection are minimized. In order to achieve the
takes place in [8], but in this paper the approximation refers minimum latency at each node, all memory resources are fully
to a matrix. Let D be defined as a v × m complex matrix, partitioned. Furthermore the loop, which implements in (5),
then the 1-norm of D is estimated as v is fully unrolled. The quotients in (5) require only the data
X
kDk1 = max dij , (18) after Cholesky decomposition, which are available before the
1≤j≤m
i=1 tree searching. Therefore these quotients are computed before
which is the maximum absolute column sum of the ma- executing the of SD algorithm.
trix [15]. The infinity norm of D is defined as In [8], an one-node-per-cycle architecture is also described.
Xm
There are differences in tree searching. In [8] two units work

kDk∞ = max dij , (19)
1≤i≤v in parallel. One is used for forward recursion and the other
j=1
which is the maximum absolute row sum of the matrix [15]. is responsible for the next node in case of pruning or in case
The matrix is not mandatory to be square, therefore the IR of leaf finding. In this paper, one unit estimates the next node

Authorized licensed use limited to: University of Surrey. Downloaded on June 06,2022 at 14:17:26 UTC from IEEE Xplore. Restrictions apply.
either the constraints are applied or not. In addition, in [8] the TABLE III
IR is set to infinity in contrast with the approach proposed H ARDWARE I MPLEMENTATION RESULTS OF ONE - NODE - PER - CYCLE
COMPONENT - ASIC DEVICE FOR 28 NM TECHNOLOGY
here, where IR has a great effect at node pruning. Of course
128 × 16 128 × 16 128 × 16 128 × 32
this is reasonable as in [8] the levels of the tree were four, Implementation results
M = 16 M = 32 M = 64 M = 16
Maximum Frequency (MHz) 3460.2 3984.1 3649.6 3472.2
while in this paper we use 16 or 32 levels depending on the Area* 35.0 k 38.5 k 47.4 k 92.0 k

examined case. *refering to instances of cells of a 28 nm standard-cell library


The computational complexity of the one-node-per-cycle TABLE IV
component is presented in Table I. The number of compu- C OMPUTATIONAL C OMPLEXITY OF IR METHODS
tations depends on the level of the tree each time, hence the Complex Multipliers Multipliers Adders
IR method
mean and the worst case are depicted. The implementation M1
NT = 16
4.73e+06
NT = 32
5.25e+06
NT = 16
-
NT = 32
-
NT = 16
4.71e+06
NT = 32
5.23e+06
results of one-node-per-cycle component is in Table II. Hard- M2 using B matrix
M2 using A matrix
4.72e+06
5.24e+05
5.24e+06
1.05e+06
-
1
-
1
4.72e+06
5.57e+05
5.24e+06
1.08e+06

ware implementations have been mapped to a VT690T FPGA


device. Implementations results based on an ASIC 28 nm the lowest complexity, hence is the dominant method among
standard-cell library are also presented in Table III. The ASIC the others. The complexity of IR methods is independent of
results are obtained by synthesis using Cadence design flow. the modulation scheme. It is noted that all these IR techniques
The tree searching procedure takes place for N times and all are equally effective at node pruning.
of them are independent from each other. Regarding the FPGA B. Number of visited nodes
implementation, the percentage of area utilization determines As mentioned, the number of visited nodes is a crucial factor
the amount of times this component can be replicated. At for the latency of detection. Table V depicts the significant de-
ASIC devices there is no restriction of area but the cost crease in the number of visited nodes. The results correspond
increases as the design enlarges. This increase at hardware to one tree search and the displayed number of visited nodes
resources decreases the latency of SD and increases the is the average value obtained through simulations, given that
achieved throughput rate. Eb /N0 = 5 dB. The examined cases featured with M = 64
VI. E VALUATION or NT = 32 are effective for Eb /N0 ≥ 5 dB, in contrast
with the others which are for Eb /N0 ≥ 4 dB. Table V reveals
The one-node-per-cycle architecture minimizes the latency
the contribution of IR method, especially in combination with
at each node. However in order to be feasible for imple-
ISD, at node pruning. Table V contains also the total number
mentation, sphere decoding methods must have a comparable
of nodes, the visited nodes without any optimization of IR and
complexity related to LD and the number of visited nodes
the decoded bits after one tree search for all examined cases.
during detection must be as low as possible in order to achieve
This huge reduction in visited nodes, having the one-node-per-
low latency and high throughput rate. Otherwise the gain in
cycle architecture, solves the latency problem and is a main
the BER performance may be too expensive.
contribution.
A. Proposed IR evaluation methods C. Comparison with ZF detector
The use of IR has a critical impact at node pruning in Apart from latency, the computational complexity of ISD
MMIMO in contrast with MIMO systems, where it is not compared to ZF has to be examined. The complexity of a
so important. In MMIMO technology, the convergence ratio required matrix inversion is calculated according to [16]. The
λ2
λ1 of power method is usually close to 1, therefore a large computational complexity of ZF detector is in Table VI, given
number of iterations is required. Both proposed approaches are that N = 256, while Table V presents a comparison with the
no iterative and for that reason they do not impose high latency. ISD. For the computational complexity of ISD in Table V,
Therefore the selection of the most suitable IR method is no duplicates of one-node-per-cycle component are assumed.
determined by the computational complexity. As it is shown in ISD requires the ZF estimate, a Cholesky decomposition of
Table IV, given that N = 256, M1 and M2 using B matrix are symmetric Gram matrix and the IR estimation before the
of almost complexity equal. M2 using A matrix feature with execution of tree searching. The IR is computed by M2 using
TABLE I A matrix. The IR component and the Cholesky decomposition
C OMPUTATIONAL C OMPLEXITY OF ONE - NODE - PER - CYCLE COMPONENT have a combined complexity almost equal to ZF detector.
Arithmetic Unit
NT = 16 NT = 32 Therefore without the computations of tree searching for all
Mean Case Worst Case Mean Case Worst Case
Complex Multipliers 9 16 17 32 the transmitted symbols, ISD has almost double complexity
Multipliers 2 2 2 2
Adders 18 31 34 63 compared to ZF. This explains, with the help of Table V, why
the number of visited nodes is important not only for latency
TABLE II
H ARDWARE I MPLEMENTATION RESULTS OF ONE - NODE - PER - CYCLE
but also for the computational complexity.
COMPONENT - FPGA DEVICE According to Table V there is a significant complexity
128 × 16 128 × 16 128 × 16 128 × 32
increase for M = 64, hence it restricts the modulation
Implementation results
Maximum Frequency (MHz)
M = 16
56.12
M = 32
52.06
M = 64
52.06
M = 16
43.96
order up to 32 when NT = 16. For the examined case
DSP48
FF
86 (2%)
225 (0%)
86 (2%)
362 (0%)
86 (2%)
367 (0%)
364 (10%)
298 (0%)
with the largest gain at BER performance, the complexity is
LUT 6238 (1%) 9879 (2%) 16222 (3%) 13241 (3%)
almost 7 times higher. As mentioned before, the one-node-per-
*The percentages denote the utilization score according to the selected board cycle component can be replicated. So despite the increase in

Authorized licensed use limited to: University of Surrey. Downloaded on June 06,2022 at 14:17:26 UTC from IEEE Xplore. Restrictions apply.
TABLE V TABLE VII
S UMMARY FOR THE EXAMINED MMIMO SCHEMES R EQUIRED BITS FOR ALL EXAMINED CASES
128 × 16 128 × 16 128 × 16 128 × 32 128 × 16 128 × 16 128 × 16 128 × 32
Feature Decoding method
M = 16 M = 32 M = 64 M = 16 M = 16 M = 32 M = 64 M = 16
Total nodes 1.97e+19 1.25e+24 8.05e+28 3.63e+38 ZF 24 24 24 26
Visited nodes** (R0 = ∞) 5.68e+05 3.64e+06 6.45e+07 7.60e+08 ISD 32 33 34 36
Visited nodes** (IR) 5.24e+04 1.08e+06 1.13e+07 3.85e+08
Visited nodes** (IR + ISD) 256 568 5261 1927
Decoded bits 64 80 96 128
BER gain of ISD* (dB) 0.5 0.6 0.7 1.25 TABLE VIII
Computational Complexity of ISD* ×2.81 ×3.98 ×21.5 ×7.25 T HROUGHPUT RATE FOR FPGA AND ASIC IMPLEMENTATION
* compared to ZF detector ** when Eb /N0 = 5 dB 128 × 16 128 × 16 128 × 16 128 × 32
Device
M = 16 M = 32 M = 64 M = 16
TABLE VI FPGA 575.2 Mbps 300.6 Mbps 24.7 Mbps 26.3 Mbps
ASIC (c1) 35467.1 Mbps 23006.6 Mbps 1731.5 Mbps 2075.8 Mbps
C OMPUTATIONAL C OMPLEXITY OF ZF DETECTOR ASIC (c2) 24221.5 Mbps 14028.4 Mbps 1398.5 Mbps 2306.4 Mbps

Complex Multipliers Multipliers Adders


NT = 16
6.15e+05
NT = 32
1.51e+06
NT = 16
-
NT = 32
-
NT = 16
6.09e+05
NT = 32
1.50e+06
R EFERENCES
[1] L. G. Barbero and J. S. Thompson, “Fixing the complexity of the
computations, the latency of detection can be constrained and sphere decoder for MIMO detection,” IEEE Transactions on Wireless
the throughput can be enhanced. communications, vol. 7, no. 6, pp. 2131–2142, 2008.
[2] G. Kapfunde, Y. Sun, and N. Alinier, “An improved sphere decoder for
Minimization of required bits for data representation is MIMO systems,” in 2012 IEEE 8th International Conference on Wireless
crucial for hardware complexity. The wordlength for the and Mobile Computing, Networking and Communications (WiMob).
IEEE, 2012, pp. 533–537.
examined cases is in Table VII. Notably it is not necessary to [3] S. Mrinalee, H. P. Garg, G. Mathur, and R. Yadav, “Improved radius
have a constant wordlength across the complete architecture. selection in sphere decoder for MIMO system,” in 2014 International
Table VII depicts the maximum wordlength, which is deter- Conference on Computing for Sustainable Global Development (INDI-
ACom). IEEE, 2014, pp. 161–165.
mined by the IR and the possible values of the accumulated [4] S. Yang, L. Jianping, and C. Chaoshi, “An novel initial radius scheme of
distance Di . These variables require the most integral bits. sphere decoding in MIMO system,” in 2010 International Conference
Further optimization can lead to shorter wordlengths in certain on Multimedia Communications. IEEE, 2010, pp. 164–167.
[5] B. Cheng, W. Liu, Z. Yang, and Y. Li, “A new method for initial
parts of the architectures. radius selection of sphere decoding,” in 2007 12th IEEE Symposium
on Computers and Communications. IEEE, 2007, pp. 19–24.
D. Detection throughput rate [6] F. Eshagh Hosseini and S. Shirvani Moghaddam, “Controlling initial
and final radii to achieve a low-complexity sphere decoding technique
Another examined metric is throughput rate. In order to in MIMO channels,” International Journal of Antennas and Propagation,
vol. 2012, 2012.
have multiple tree searches, duplicates of one-node-per-cycle [7] S. Yang, L. Jianping, and C. Chaoshi, “A novel initial radius selec-
component are required. For the FPGA device, the computed tion with simplify sphere decoding algorithm in MIMO system,” in
number of duplicates is based on the limits of our board International Conference on Information and Management Engineering.
Springer, 2011, pp. 395–402.
regarding the area utilization of Table II. For the ASIC device [8] A. Burg, M. Borgmann, M. Wenk, M. Zellweger, W. Fichtner, and
with 28 nm technology, several cases are examined. The H. Bolcskei, “VLSI implementation of MIMO detection using the sphere
first case (c1) assumes the same number of duplicates as on decoding algorithm,” IEEE Journal of solid-state circuits, vol. 40, no. 7,
pp. 1566–1577, 2005.
FPGA device, while the second case (c2) has a constraint [9] M. M. Mansour, S. P. Alex, and L. M. Jalloul, “Reduced complexity
of one million instances of cells of a 28 nm standard-cell soft-output MIMO sphere detectors Part I: Algorithmic optimizations,”
library. The number of decoded bits and the mean number of IEEE Transactions on Signal Processing, vol. 62, no. 21, pp. 5505–5520,
2014.
visited nodes per tree search for all the examined cases are [10] ——, “Reduced complexity soft-output MIMO sphere detectors Part II:
in Table V. The attained throughput rate for both FPGA and Architectural optimizations,” IEEE Transactions on Signal Processing,
ASIC implementations is in Table VIII. vol. 62, no. 21, pp. 5521–5535, 2014.
[11] J. Hoydis, C. Hoek, T. Wild, and S. ten Brink, “Channel measurements
for large antenna arrays,” in 2012 International Symposium on Wireless
VII. C ONCLUSION Communication Systems (ISWCS). IEEE, 2012, pp. 811–815.
[12] J. Minango, C. Altamirano, and C. de Almeida, “Performance difference
SD has similar BER performance with ML detector, there- between zero-forcing and maximum likelihood detectors in massive
MIMO systems,” Electronics Letters, vol. 54, no. 25, pp. 1464–1466,
fore is more effective compared to LD. However sphere 2018.
decoding methods are known for their high latency and [13] C.-Y. Hung and W.-H. Chung, “An improved MMSE-based MIMO
computational complexity. In this paper, a low complexity IR detection using low-complexity constellation search,” in 2010 IEEE
Globecom Workshops. IEEE, 2010, pp. 746–750.
method is proposed, which in combination with ISD decrease [14] T. S. Shores, Applied linear algebra and matrix analysis. Springer,
significantly the number of nodes. Apart from that, a one- 2007, vol. 2541.
node-per-cycle architecture reduces also the latency making [15] F. Chatelin, Eigenvalues of Matrices: Revised Edition. SIAM, 2012.
[16] A. Thanos and V. Paliouras, “Hardware trade-offs for massive MIMO
ISD feasible for hardware implementation in MMIMO when uplink detection based on Newton iteration method,” in 2017 6th
Eb /N0 ≥ 4 dB. The hardware area complexity of one-node- International Conference on Modern Circuits and Systems Technologies
per-cycle component reported here for an FPGA implementa- (MOCAST). IEEE, 2017, pp. 1–4.
tion, is crucial in order to manage multiple tree searches in
parallel. This parallelism increases the throughput rate both
for FPGA and ASIC implementations. As expected, ASIC
implementations may support high throughput rates.

Authorized licensed use limited to: University of Surrey. Downloaded on June 06,2022 at 14:17:26 UTC from IEEE Xplore. Restrictions apply.

You might also like