Professional Documents
Culture Documents
Sphere Decoder For Massive MIMO Systems
Sphere Decoder For Massive MIMO Systems
Sphere Decoder For Massive MIMO Systems
Abstract—The increasing demand for higher data rates and There is a wide range of methods which can be employed
for more connected devices has led to Massive MIMO (MMIMO) for the IR selection. Some of them simply set IR to a constant
Technology. The large number of antennas makes the Maximum- value or infinity without the need of extra hardware [5], [6],
Likelihood (ML) detector infeasible to be implemented due to
high complexity, despite its optimal performance. Sphere Decoder ignoring for a possible optimization of searching time. A
(SD) has a bit error rate (BER) performance similar to ML different aspect is to take into account the noise distribution
detector, therefore making it more efficient (0.5–1.25 dB gain) before selecting the IR value [4], [5], [6], [7]. These methods
than Linear Detectors (LD), proposed in the literature. However, are effective only for MIMO systems. In MMIMO technology,
the low complexity of LD and the non-deterministic behavior of the suboptimal-solution-based approach is used, where either
SD are the main reasons that prohibit the use of sphere decoding
methods in MMIMO systems. The results of this paper disrupt a low-complexity detector or a pre-processing block initializes
conventional thinking and show that there may be a future for SD the sphere radius [5], [6]. This method requires additional
in certain MMIMO system. The number of visited nodes during hardware in order to compute the suboptimal solution, but
detection and the Initial Radius (IR) method are crucial for the achieves a good estimation of IR hence the searching time
computational complexity of SD. In this paper, an effective IR decreases significantly.
method, decreasing significantly the complexity and the number
of visited nodes is proposed. Furthermore an optimization at tree The proposed method computes the IR value combining
searching further reduces the number of visited nodes, where an estimate of a known linear detector and an approximate
in combination with an implementation featured with one-node- calculation of the 2-norm of a matrix. The choice of the
per-cycle architecture minimize the latency and make the SD detector is optimal in terms of complexity, exploiting the SD
attainable to large-scale systems for Eb /N0 ≥ 4 dB. Hardware algorithm. For the computation of the 2-norm two approaches
aspects are investigated for both a Virtex-7 FPGA and a 28 nm
ASIC technology. are examined and are evaluated, having as main purpose
Index Terms—massive MIMO, sphere decoding, initial radius, the complexity reduction. The first approach relies on the
MIMO detection, eigenvalue problem, norm distance estimation of the largest eigenvalue of a matrix, while the
second one computes the 1-norm or the infinity norm of a
I. I NTRODUCTION matrix instead of 2-norm. In order to approximate the largest
eigenvalue of a matrix, the power iteration algorithm in a
ML detector yields the optimal solution in MIMO detection
simplified way is used. Apart from complexity reduction these
problem but its complexity is susceptible to high-order mod-
approaches feature low-latency, since no recursive process is
ulation schemes and to large-scale systems. SD provides a
required neither for the approximation of the largest eigen-
BER performance similar to ML detector [1], [2], [3] with the
value, in contrast with the classic power method, nor for the
advantage that the complexity of SD is polynomial in contrast
computation of 1-norm or infinity norm of a matrix.
with ML-detector’s complexity which rises exponentially with
LD are used in MMIMO technology, despite the BER
the number of transmit and receive antennas. SD intends to
degradation. For large-scale systems and large constellation
detect the transmitted signal vector by searching only the
sizes, SD is considered as prohibitive due to high computa-
candidate vectors which lie inside a hypersphere with radius
tional complexity and its non-deterministic behavior [8]. In
R0 around the received signal r in contrast with ML detector
this paper, apart from an efficient IR method, an optimization
where all candidate vectors are examined [1], [4].
at SD algorithm regarding node pruning is proposed. Vari-
The method employed to calculate IR has a critical impact ous optimizations of sphere decoding methods are described
on the complexity of SD [2], [3]. A large initial value of sphere in [9], [10]. In addition an one-node-per-cycle architecture
radius can lead to an exhaustive search between numerous is described decreasing significantly the latency of the de-
possible transmitted symbols, while in contrast a very small tection. In [8], an one-node-per-cycle architecture is also
radius may contain no lattice point inside the sphere and the presented. The differences are explained in this paper. These
search will have to be restarted with a new estimation of IR. optimizations make SD feasible for larger schemes, when
In MIMO systems IR selection is not mandatory since the Eb /N0 ≥ 4 dB, compared to the limits determined by the
tree searching contains a small number of nodes; however in literature.
MMIMO the absence of an IR method leads to an exhaustive The remainder of the paper is organised as follows: Sec-
search, not feasible for hardware implementations. tion II reviews the system model and the detection problem
978-1-7281-2769-9/19/$31.00 © 2019 IEEE of SD in MIMO systems. Section III describes the algorithm
Authorized licensed use limited to: University of Surrey. Downloaded on June 06,2022 at 14:17:26 UTC from IEEE Xplore. Restrictions apply.
Source 10−1
s n
M -QAM Mapper N -point IFFT AWGN ISD 128x16,16QAM
ZF 128x16,16QAM
H ISD 128x16,32QAM
ZF 128x16,32QAM
ISD 128x16,64QAM
ZF 128x16,64QAM
10−2 ISD 128x32,16QAM
ZF 128x32,16QAM
H
BER
Sink r
BER M -QAM Demapper Sphere Decoder N -point FFT
Fig. 1. Block diagram of the massive MIMO system used for simulation 10−3
adapted for implementation on hardware. Section V details Fig. 2. BER performance of ISD compared to ZF detector
the one-node-per-cycle architecture and reveals the results of a detection has a metric constraint equal to R02 . Let Di denote
FPGA implementation compared to an ASIC implementation. the accumulated Euclidean metric down to antenna level i,
Section VI evaluates IR methods effective for large-scale XNT
2
2
Di = u2ii |si − zi | + u2jj sj − zj ≤ R02 ,
systems and proves that SD is feasible in MMIMO systems. (4)
| {z }
Finally, Section VII discusses conclusions. di
j=i+1
| {z }
II. S YSTEM M ODEL Di+1
where di is the partial Euclidean metric from antenna level i,
In this section, the organization of the simulation system Di+1 is the accumulated Euclidean metric down to antenna
model is presented. A MMIMO system is assumed with NT level i + 1 and zi is derived from
transmit and NR receive antennas (NR NT ). For the NT
X uij
uplink communication, each user transmits a sequence of bits, zi = ŝi − (sj − ŝj ). (5)
u
j=i+1 ii
mapped to constellation symbols by using Quadrature Ampli-
tude Modulation (QAM) of M points. Let C denote the set of B. Improved Sphere Decoder (ISD)
all possible constellation symbols. Then the frequency-domain As mentioned, the number of visited nodes is crucial for
symbols are transformed into time-domain symbols through the computational complexity and the latency of SD. Apart
a N -point IFFT before the transmission over the wireless from the sphere constraint of (4), the ISD checks also if di is
channel. During transmission a Rayleigh fading channel and smaller than 1 – 2% of the estimated IR. Unless di satisfies
additive white Gaussian noise (AWGN) are assumed. that, (4) is not necessary to be checked. In (4) the value of R0
The received signal r is derived as r = Hs + n, where is updated when a solution path is found, in contrast with
s ∈ C NT = [s1 , s2 , . . . , sNT ]T is the vector of transmitted the constraint of di where the comparable value is stable.
symbols, r ∈ C NR = [r1 , r2 , . . . , rNR ]T is the vector of This introduced restriction contributes significantly to the node
received symbols and n = [n1 , n2 , . . . , nNR ]T is the vector of pruning, without having an impact on BER performance as it
independent and identically distributed complex AWGN with is shown later. A similar concept, in which each node has its
zero mean and variance N0 . The complex transfer function constraint, is mentioned in [1].
from transmitter j to receiver i is represented by a matrix H
C. BER performance
with dimensions NR × NT , where all the hij are Rayleigh
fading channel coefficients and depict the channel gain. As SD achieves ML performance and this is the main advantage
channel correlation is low when NR NT [11], for all over LD. Based on the system described in Fig. 1, the
simulations, an uncorrelated channel is assumed. Provided that BER performance of the ISD is examined through MATLAB
the channel matrix is known at the receiver, the detection simulations compared to the Zero-Forcing (ZF) detector. The
problem of SD [1], [2] is length of the FFT N is 256 for all cases and the total
2
ŝml = arg min kr − Hsk ≤ R02 .
(1) number of processed M -QAM symbols equals to NT × N .
s The tested MMIMO schemes and the simulation results are
The whole process is presented in Fig. 1.
presented in Fig. 2 and in Table V. It must be noticed that, the
III. S PHERE D ECODER simulation results are in agreement with [12]. Other possible
A. Sphere Decoder Algorithm MMIMO schemes are meaningless in this context due to high
As described in [1],sphere
constraint
(1) can be written as computational complexity and latency.
2
ŝml = arg min
U(s − ŝ)
≤ R02 , (2) IV. I NITIAL R ADIUS S ELECTION
s
where U is the Cholesky decomposition of the Gram matrix There are many techniques proposed in the literature, in
HH H and ŝ is the Least Squares estimate of s, order to select IR. However most of them focus on systems
ŝ = (HH H)−1 HH r. (3) with a limited number of antennas, in contrast with this
The algorithm can be considered as a tree search through paper, which targets at MMIMO technology. In that case
NT levels, where every node has M branches. SD manages a the estimation of a proper IR is unavoidable in order to
limited area of a lattice field depending on IR; therefore the decrease the number of visited nodes. However IR selection
Authorized licensed use limited to: University of Surrey. Downloaded on June 06,2022 at 14:17:26 UTC from IEEE Xplore. Restrictions apply.
is considered as a complex procedure for large-scale systems, Algorithm 1: Power iteration algorithm
hence a simplified and efficient method has to be established. Input: The NR × NR matrix B
A. IR using ZF estimate Output: The greatest eigenvalue λmax (B)
ZF detector brings down the intersymbol interference dis- 1 Set b0 a random vector with dimension NR × 1
regarding the noise. The solution of ZF detector can be used It should have a nonzero component in the direction of
for IR selection [5] as follows the dominant eigenvalue;
R02 = kr − Hx̂ZF k2 , (6) 2 for k = 1; k ≤ Z; k++ do
Bbk−1
where x̂ZF is the transmitted symbol estimate of ZF detec- bk = kBb k−1 k
tor [13] and can be written as
3 λmax (B) = bH
k Bbk ;
x̂ZF = (HH H)−1 HH r. (7)
The optimal case, where the sphere contains only one lattice slowly; hence a large number of iterations may be demanded.
point, is when the estimate of ML detector is used. Neverthe- Let Z denote the required number of iterations the power
less due to high complexity, it is ordinary to use other detectors method needs to converge. In case B is diagonalizable, the
with reduced complexity.
method converges with ratio λλ21 , where λ1 = λmax and λ2 is
B. IR using MMSE estimate
the second largest in absolute value eigenvalue [14]. Therefore
Minimum Mean Square Error (MMSE) detector is an the power method converges slowly if an eigenvalue close in
intermediate solution between intersymbol interference and magnitude to the greatest eigenvalue, exists.
noise. According to (6) if x̂ZF is replaced with x̂MMSE ,
Given that B is diagonalizable, line 2 of the Alg. 1 can be
where x̂MMSE is the transmitted symbol estimate of MMSE
transformed, according to [14], as
detector, a new approach for IR selection is created [5]. MMSE Bk b0
detector requires the statistical information of noise and after bk = . (12)
kBk b0 k
equalization x̂MMSE is obtained from [13] As a result, the method computes the dominant eigenvector,
1
x̂MMSE = (HH H + I)−1 HH r, (8) which corresponds to the largest eigenvalue, without the re-
SN R quirement for an iterative process. The approximation of λmax
where I indicates the identity matrix.
C. Proposed Method depends on the number of iterations k, where a high value of
Combining the approach of computing the IR using ZF k provides a good estimate of the real value. If k = Z, the
estimate and an approximation of calculating the 2-norm largest eigenvalue is calculated in full precision. However as it
distance of a matrix, a new method for choosing IR is follows from (12), k denotes the times where B is multiplied
proposed. Among the analyzed methods, the ZF detector is by itself, therefore a large value can lead to possible numerical
chosen since (3) and (7) are equivalent. SD algorithm can not overflows having in mind the hardware implementation.
avoid the computation of (3), therefore can be reused in order This approach takes advantage of (12) and in combination
to reduce the computational complexity. Let A be defined as with some variations, which are described below, concludes to
a N × NR matrix, where a good estimation of λmax with reduced complexity. Instead
A = r − Hx̂ZF = r − Hŝ. (9) of line 3 of Alg. 1, the largest eigenvalue is estimated as
bH Bbk
Then due to (6), the IR can be written as λmax (B) = kH , (13)
R02 = kAk2 . (10) bk bk
which is known as Rayleigh quotient [15]. It must be noted
The norm of a matrix A for the special case of 2-norm is
that the product bH k bk equals to 1 when the method converges,
the largest singular value of A, which equals to the square
therefore line 3 of Alg. 1 and (13) are equivalent for k = Z.
root of the maximum eigenvalue of the positive-semidefinite
Exploiting (13), it is proved that the denominator of (12) has
matrix AH A [14].Consequently (10) 2 is rewritten as
q no use. Assuming that
R02 = kAk2 = λmax (AH A) = kBk = λmax (B), ck = Bk b0 , (14)
(11) it can be derived from (12) and (14) that bk = cµk , where µ
where B is a NR × NR matrix and equals to AH A. There- is a constant variable equal to kBk b0 k. Hence it follows that
fore what remains is to compute the largest eigenvalue of bH
k Bbk
( cµk )H B( cµk ) cH Bck
H
= ck H ck = kH . (15)
B. Another approach, which is also examined below, is to bk bk (µ) (µ) ck ck
approximate the 2-norm of A or the 2-norm of B with other The denominator of the ratio in line 2 of Alg. 1 normalizes
norms such as 1-norm and infinity norm. bk at each iteration, thus has a significant impact on classic
1) Based on the largest eigenvalue (M1) power iteration algorithm preventing numerical overflows. The
For the computation of the largest eigenvalue a known method proposed in this paper, estimates the greatest eigen-
method, namely power iteration, is used [14], [15]. The whole value without this iterative step. Instead of line 2 and line 3 of
method is described in Alg. 1. Power iteration estimates the Alg. 1, the proposed method relies on (13)–(15). The overflow
greatest in absolute value eigenvalue of a matrix without com- issues, due to the abolition of the normalization factor, are
puting matrix decomposition. So it can be used for matrices confronted by exploiting an appropriate initialization of b0 .
with large dimensions with the drawback that it converges The vector b0 acts as a scaling factor with all its elements
Authorized licensed use limited to: University of Surrey. Downloaded on June 06,2022 at 14:17:26 UTC from IEEE Xplore. Restrictions apply.
Algorithm 2: Modified Power iteration algorithm can be estimated either as kAk2 or kBk according to (11).
Input: The NR × NR matrix B Depending on the dimensions of the array, either less sums of
Output: The greatest eigenvalue λmax (B) more terms or more sums of less terms must be calculated.
1 Set all elements of NR × 1 vector b0 equal to 2
−φ
; Having many terms at each sum may cause numerical over-
2 c1 = Bb0 ; flow issues, while in contrast more sums need more parallel
cH
1 Bc1 arithmetic units. It must be noted that kBk1 = kBk∞ due to
3 λmax (B) = cH
+ δ;
1 c1 the fact that B is symmetric.
The magnitude of a complex number demands the compu-
equal to 2−φ , where φ is the number of fractional bits used
tation of a square root. It can be omitted, using the approxi-
in a hardware implementation. Therefore the multiplication
mations [8],
in (14) is reduced to shift-and-add operations.
dij ≈ Re dij + Im dij (20)
Specifically, the proposed approach chooses k = 1 for (14),
which means that the estimation is a result of one iteration or
only, hence there is an impact on accuracy. It is noted that dij ≈ max Re dij ,Im dij . (21)
a poor approximation of the largest eigenvalue influences the
performance of SD as a too large IR can lead to an exhaustive The above approximations do not differ to a great extent in
search; on the other hand a very small IR produces an empty complexity, but the (20) produces higher values compared
sphere. In this approach, the risk is limited to the case of an to (21). When the IR is estimated as kBk, it is not necessary to
empty sphere only. To confront that, attempting to ensure the compute the magnitude for all elements of the matrix because
existence of a lattice point inside the sphere, a constant δ is B is symmetric. Especially when N NR the decrease
added to the estimation of the largest eigenvalue as of computations is noticeable. However to achieve this, a
cH Bc1 complex matrix multiplication to compute B is required.
λmax (B) = 1H + δ = ξ + δ, (16)
c1 c1 In terms of accuracy the approximation of 2-norm according
H
c Bc
where ξ = c1H c11 , indicates an estimation of λmax for k = 1. to (18) and (19), either the IR is calculated as kAk2 or
1 kBk, leads to a large estimate of IR. Nevertheless it has the
Based on the system in Fig. 1, for various M and N , a
advantage that for a specific MMIMO scheme, the divergence
possible value of δ is estimated through MATLAB simulations.
is stable regardless of the level of noise. It means that the
Simulations have been performed for different levels
of noise. estimate of IR can be improved with an appropriate scale,
For all examined cases, the convergence ratio λλ12 is close
exploiting in this way the low-complexity of this method. The
to 1, therefore a large number of iterations is required. The divergence of the estimation is not influenced neither from
estimated value of the greatest eigenvalue after k iterations is M nor NT , provided that NR NT . The estimate is only
derived as λmax = ξ +θk ξ, where 0 < θk < 1. This creates an affected by the dimensions of the matrix we are approximating
upper bound equal to 2ξ for the value of λmax . Consequently its 2-norm. These factors are the length of FFT N and the
the proposed value of δ, which ensures the existence of at least number of receiving antennas NR . Especially for the case
one lattice point inside the sphere, is δ = ξ. when the IR is estimated as kBk, the addiction to N does
Since B is symmetric, the nominator of ξcan be written as not exist. Scaling is suggested to take place before 2-norm
NR X NR NR NR X
X X 2
X approximation by appropriately prescaling matrix entries, to
( c∗1i Bij )c1j = |c1i | Bii +2 Re(c∗1i c1i Bij ),
prevent wordlength increase.
j=1 i=1 i=1 j=1 i<j
(17) V. O NE - NODE - PER - CYCLE ARCHITECTURE
where c∗1i indicates the conjugate of c1i . Therefore the number Apart from complexity, latency is also a huge problem at
of the required multiplications has a significant decrease. The sphere decoding methods for MMIMO systems. The number
whole modified algorithm is presented in Alg. 2. of visited nodes during the decoding process is the fundamen-
2) Based on 1-norm and infinity norm (M2) tal parameter for that problem. Having an one-node-per-cycle
As mentioned before, the 2-norm value of a matrix can be architecture, which is the optimal case, the required clock
estimated using 1-norm or infinity norm. A similar approach periods for detection are minimized. In order to achieve the
takes place in [8], but in this paper the approximation refers minimum latency at each node, all memory resources are fully
to a matrix. Let D be defined as a v × m complex matrix, partitioned. Furthermore the loop, which implements in (5),
then the 1-norm of D is estimated as v is fully unrolled. The quotients in (5) require only the data
X
kDk1 = max dij , (18) after Cholesky decomposition, which are available before the
1≤j≤m
i=1 tree searching. Therefore these quotients are computed before
which is the maximum absolute column sum of the ma- executing the of SD algorithm.
trix [15]. The infinity norm of D is defined as In [8], an one-node-per-cycle architecture is also described.
Xm
There are differences in tree searching. In [8] two units work
kDk∞ = max dij , (19)
1≤i≤v in parallel. One is used for forward recursion and the other
j=1
which is the maximum absolute row sum of the matrix [15]. is responsible for the next node in case of pruning or in case
The matrix is not mandatory to be square, therefore the IR of leaf finding. In this paper, one unit estimates the next node
Authorized licensed use limited to: University of Surrey. Downloaded on June 06,2022 at 14:17:26 UTC from IEEE Xplore. Restrictions apply.
either the constraints are applied or not. In addition, in [8] the TABLE III
IR is set to infinity in contrast with the approach proposed H ARDWARE I MPLEMENTATION RESULTS OF ONE - NODE - PER - CYCLE
COMPONENT - ASIC DEVICE FOR 28 NM TECHNOLOGY
here, where IR has a great effect at node pruning. Of course
128 × 16 128 × 16 128 × 16 128 × 32
this is reasonable as in [8] the levels of the tree were four, Implementation results
M = 16 M = 32 M = 64 M = 16
Maximum Frequency (MHz) 3460.2 3984.1 3649.6 3472.2
while in this paper we use 16 or 32 levels depending on the Area* 35.0 k 38.5 k 47.4 k 92.0 k
Authorized licensed use limited to: University of Surrey. Downloaded on June 06,2022 at 14:17:26 UTC from IEEE Xplore. Restrictions apply.
TABLE V TABLE VII
S UMMARY FOR THE EXAMINED MMIMO SCHEMES R EQUIRED BITS FOR ALL EXAMINED CASES
128 × 16 128 × 16 128 × 16 128 × 32 128 × 16 128 × 16 128 × 16 128 × 32
Feature Decoding method
M = 16 M = 32 M = 64 M = 16 M = 16 M = 32 M = 64 M = 16
Total nodes 1.97e+19 1.25e+24 8.05e+28 3.63e+38 ZF 24 24 24 26
Visited nodes** (R0 = ∞) 5.68e+05 3.64e+06 6.45e+07 7.60e+08 ISD 32 33 34 36
Visited nodes** (IR) 5.24e+04 1.08e+06 1.13e+07 3.85e+08
Visited nodes** (IR + ISD) 256 568 5261 1927
Decoded bits 64 80 96 128
BER gain of ISD* (dB) 0.5 0.6 0.7 1.25 TABLE VIII
Computational Complexity of ISD* ×2.81 ×3.98 ×21.5 ×7.25 T HROUGHPUT RATE FOR FPGA AND ASIC IMPLEMENTATION
* compared to ZF detector ** when Eb /N0 = 5 dB 128 × 16 128 × 16 128 × 16 128 × 32
Device
M = 16 M = 32 M = 64 M = 16
TABLE VI FPGA 575.2 Mbps 300.6 Mbps 24.7 Mbps 26.3 Mbps
ASIC (c1) 35467.1 Mbps 23006.6 Mbps 1731.5 Mbps 2075.8 Mbps
C OMPUTATIONAL C OMPLEXITY OF ZF DETECTOR ASIC (c2) 24221.5 Mbps 14028.4 Mbps 1398.5 Mbps 2306.4 Mbps
Authorized licensed use limited to: University of Surrey. Downloaded on June 06,2022 at 14:17:26 UTC from IEEE Xplore. Restrictions apply.