Professional Documents
Culture Documents
Learning To Detect
Learning To Detect
Learning To Detect
Learning to Detect
Neev Samuel, Member, IEEE, and Tzvi Diskin, Member, IEEE and Ami Wiesel, Member, IEEE
Abstract—In this paper we consider Multiple-Input-Multiple- [3], [4] were purposed, based on lattice search, and offering
Output (MIMO) detection using deep neural networks. We better computational complexity with a rather low accuracy
introduce two different deep architectures: a standard fully performance degradation relatively to the full search. In the
connected multi-layer network, and a Detection Network (DetNet)
which is specifically designed for the task. The structure of other regime, the most common suboptimal detectors are the
DetNet is obtained by unfolding the iterations of a projected linear receivers, i.e., the matched filter (MF), the decorrelator
gradient descent algorithm into a network. We compare the or zero forcing (ZF) detector and the minimum mean squared
accuracy and runtime complexity of the purposed approaches error (MMSE) detector. More advanced detectors are based on
arXiv:1805.07631v1 [cs.IT] 19 May 2018
and achieve state-of-the-art performance while maintaining low decision feedback equalization (DFE), approximate message
computational requirements. Furthermore, we manage to train a
single network to detect over an entire distribution of channels. passing (AMP) [5] and semidefinite relaxation (SDR) [6],
Finally, we consider detection with soft outputs and show that [7]. Currently, both AMP and SDR provide near optimal
the networks can easily be modified to produce soft decisions. accuracy under many practical scenarios. AMP is simple and
Index Terms—MIMO Detection, Deep Learning, Neural Net- cheap to implement in practice, but is an iterative method that
works. may diverge in challenging settings. SDR is more robust and
has polynomial complexity, but is limited in the settings it
addresses and is much slower in practice.
I. I NTRODUCTION
Exciting contributions in the context of error correcting codes and identically distributed (i.i.d.) Gaussian matrix, refers to a
include [17]–[21]. In [22] a machine learning approach is matrix where each of its elements is i.i.d. sampled from the
considered in order to decode over molecular communica- normal distribution N (0, 1). The rectified linear unit defined
tion systems where chemical signals are used for transfer as ρ(x) = max{0, x}. When considering a complex matrix or
of information. In these systems an accurate model of the vector the real and imaginary parts of it are defined as <(·) and
channel is impossible to find. This approach of decoding =(·) respectively. An α-Toeplitz M matrix will be defined as
without CSI (channel state information) is further developed a matrix such that MT M is a square matrix where the value
in [23]. Machine learning for channel estimation is considered of each element on the i’th diagonal is αi−1 .
in [24], [25]. End-to-end detection over continuous signals is
addressed in [26]. And in [27] deep neural networks are used II. P ROBLEM FORMULATION
for the task of MIMO detection using an end-to-end approach A. MIMO detection
where learning is deployed both in the transmitter in order
to encode the transmitted signal and in the receiver where We consider the standard linear MIMO model:
unsupervised deep learning is deployed using an autoencoder. ȳ = H̄x̄ + w̄, (1)
Parts of our work on MIMO detection using deep learning
have already appeared in [28], see also [29]. Similar ideas where ȳ ∈ CN is the received vector, H̄ ∈ CN ×K is the
were discussed in [30] in the context of robust regression. channel matrix, x̄ ∈ S̄K is an unknown vector of independent
and equal probability symbols from some finite constellation
C. Main contributions S̄ (e.g. PSK or QAM), w̄ is a noise vector of size N with
independent, zero mean Gaussian variables of variance σ 2 .
The main contribution of this paper is the introduction of Our detectors do not assume knowledge of the noise
two deep learning networks for MIMO detection. We show variance σ 2 . Hypothesis testing theory guarantees that it is
that, under a wide range of scenarios including different unnecessary for optimal detection. Indeed, the ML rule does
channels models and various digital constellations, our net- not depend on it. This is contrast to the MMSE and AMP
works achieve near optimal detection performance with low decoders that exploit this parameter and are therefore less
computational complexity. robust in cases where the noise variance is not known exactly.
Another important result we show is their ability to easily
provide soft outputs as required by modern communication
systems. We show that for different constellations the soft out- B. Reparameterization
put of our networks achieve accuracy comparable to that of the A main challenge in MIMO detection is the use of complex
M-Best sphere decoder with low computational complexity. valued signals and various digital constellations S̄ which are
In a more general learning perspective, an important con- less common in machine learning. In order to use standard
tribution is DetNet’s ability to perform on multiple models tools and provide a unified framework, we re-parameterize the
with a single training. Recently, there were works on learning problem using real valued vectors and one-hot mappings as
to invert linear channels and reconstruct signals [15], [16], described below.
[31]. To the best of our knowledge, these were developed and First, throughout this work, we avoid handling complex
trained to address a single fixed channel. In contrast, DetNet is valued variables, and use the following convention:
designed for handling multiple channels simultaneously with
y = Hx + w, (2)
a single training phase.
The paper is organized in the following order: where
In section II we present the MIMO detection problem and
<(ȳ)
<(w̄)
<(x̄)
how it is formulated as a learning problem including the y = ,w = ,x = ,
=(ȳ) =(w̄) =(x̄)
use of one-hot representations. In section III we present two
types of neural network based detectors, FullyCon and DetNet. <(H̄) − =(H̄)
H = (3)
In section IV we consider soft decisions. In section V we =(H̄) <(H̄)
compare the accuracy and the runtime of the purposed learning where y ∈ R2N is the received vector, H ∈ R2N ×2K is the
based detectors against traditional detection methods both in channel matrix and x ∈ S2K where S = <{S̄} (which is also
the hard decision and the soft decision cases. Finally, section equal to ={S̄} in the complex valued constellations we tested)
VI provides concluding remarks. A second convention concerns the re-parameterization of
the discrete constellations S = {s1 , · · · , s|S| } using one-hot
D. Notation mapping. With each possible si we associate a unit vector
ui ∈ R|S| . For example, the 4 dimensional one-hot mapping
where µ is
In this paper, we define the normal distribution
of the real part of 16-QAM constellations is defined as
the mean and σ 2 is the variance as N µ, σ 2 . The uniform
distribution with the minimum value a and the maximum value s1 = −3 ↔ u1 = [1, 0, 0, 0]
b will be U (a, b) . Boldface uppercase letters denote matrices.
T s2 = −1 ↔ u2 = [0, 1, 0, 0]
Boldface lowercase letters denote vectors. The superscript (·)
denotes the transpose. The i’th element of the vector x will be s3 = 1 ↔ u3 = [0, 0, 1, 0]
denoted as xi . Unless stated otherwise, the term independent s4 = 3 ↔ u4 = [0, 0, 0, 1] (4)
3
C. Learning to detect Fig. 1. A flowchart representing a single layer of the fully connected network.
not work with y directly, but use the compressed sufficient IV. S OFT DECISION OUTPUT
statistic: In this section, we consider a more general setting in
which the MIMO detector needs to provide soft outputs.
HT y = HT Hx + HT w. (10)
High end communication systems typically resort to iterative
This hints that two main ingredients in the architecture should decoding where the MIMO detector and the error correcting
be HT y and HT Hx. Second, our construction is based on decoder iteratively exchange information on the unknowns
mimicking a projected gradient descent like solution for the until convergence. For this purpose, the MIMO detector must
maximum likelihood optimization. Such an algorithm would replace its hard estimates with soft posterior distributions
lead to iterations of the form Prob(xj = si |y) for each unknown j = 1, · · · , 2K and each
" # possible symbol i = 1, · · · , |S|. More precisely, it also needs
∂ky − Hxk2
to allow additional soft inputs but we leave this for future
x̂k+1 = Π x̂k − δk
∂x
x=x̂k work.
T T
Computation of the posteriors is straight forward based on
= Π x̂k − δk H y + δk H Hxk , (11)
Bayes law, but its complexity is exponential in the size of the
where x̂k is the estimate in the k’th iteration, Π[·] is a nonlin- signal and constellation. Similarly to the maximum likelihood
ear projection operator, and δk is a step size. Intuitively, each algorithm in the hard decision case, this computation yields
iteration is a linear combination of the xk , HT y, and HT Hxk optimal accuracy yet is intractable. Thus, the goal in this
followed by a non-linear projection. We enrich these iterations section is to design networks that output approximate the
by lifting the input to a higher dimension in each iteration and posteriors. On first glance, this seems difficult to learn as
applying standard non-linearities which are common in deep we have no training set of posteriors and cannot define a
neural networks. In order to further improve the performance loss function. Remarkably, this is not a problem and the
we treat the gradient step sizes δK at each step as a learned probabilities of arbitrary constellations can be easily recovered
parameter and optimize them during the training phase. This using the standard l2 loss function with respect to the one-hot
yields the following architecture: representation xoh . Indeed, consider a scalar x and a single
s ∈ S associated with its one-hot bit xoh then it is well known
that
qk = x̂k−1 − δ1k HT y + δ2k HT Hxk−1 2
arg min E[||xoh − x̂oh || |y] = E[xoh |y] (15)
qk x̂oh
zk = ρ W1k + b1k = Prob(xoh,s = 1|y)
vk−1 s∈S
x̂oh,k = W2k zk + b2k = Prob(x = s|y)
s∈S
x̂k = foh (x̂oh,k )
Thus, assuming that our network is sufficiently expressive and
v̂k = W3k zk + b3k
globally optimized, the one-hot output xˆoh will provide the
x̂0 = 0 exact posterior probabilities.
v̂0 = 0, (12)
V. N UMERICAL R ESULTS
with the trainable parameters
In this section, we provide numerical results on the accuracy
L
θ = {W1k , b1k , W2k , b2k , W3k , b1k , δ1k , δ2k }k=1 . (13) and complexity of the proposed networks in comparison to
competing methods.
To enjoy the lifting and non-linearities, the parameters W1k In the FC case, the results are over the 0.55-Toeplitz
are defined as tall and skinny matrices. The final estimate is channel.
defined as x̂L . For convenience, the structure of each DetNet In the VC case and when testing the soft output perfor-
layer is illustrated in Fig. 2. mance, the results presented are over random channels, where
Training deep networks is a difficult task due to vanishing each element is sampled i.i.d. from the normal distribution
gradients, saturation of the activation functions, sensitivity N (0, 1).
to initialization and more [32]. To address these challenges
and following the notion of auxiliary classifiers feature in
GoogLeNet [12], we adopted a loss function that takes into A. Implementation details
account the outputs of all of the layers: We train both networks using a variant of the stochastic gra-
dient descent method [33], [34] for optimizing deep networks,
L
X named Adam Optimizer [35]. All networks were implemented
l (xoh ; x̂oh (H, y; θ)) = log(l)kxoh − x̂oh,l k2 . (14)
using the Python based TensorFlow library [36].
l=1
To give a rough idea of the computation needed during
In our final implementation, in order to further enhance the learning phase, optimizing the detectors in our numerical
the performance of DetNet, we added a residual feature from results in both architectures took around 3 days on a standard
ResNet [11] where the output of each layer is a weighted Intel i7-6700 processor. Each sample was independently gen-
average with the output of the previous layer. erated from (2) according to the statistics of x, H (either in the
5
HTy - Multiplication
X + Wk,3 bk,3
HTH
+ - Addition
Fig. 2. A flowchart representing a single layer of DetNet. The network is composed out of L layers as such where each layers’ output is the ext layers’ input
−1
10
−2
10
SER
−3 −2
10 10 ZF
BER
ZF
DF DF
−4 AMP AMP
10 −3 SD
SDR 10
SD DetNet
−5 DetNet 8 9 10 11 12 13
10 SNR (dB)
8 9 10 11 12 13
SNR (dB)
Fig. 6. Comparison of the detection algorithms SER performance in the
Fig. 4. Comparison of the detection algorithms BER performance in the varying channel case over a 16-QAM modulated signal. All algorithms were
varying channel case over a BPSK modulated signal. All algorithms were tested on channels of size 15X25.
tested channels of size 30x60.
−2
10
−2
10
SER
SER
−4
−3 10
10 DF SDR
AMP SD
SDR −6 DetNet
−4
10 SD 10
19 20 21 22 23 24
DetNet SNR (dB)
8 9 10 11 12 13
SNR (dB) Fig. 7. Comparison of the detection algorithms SER performance in the
varying channel case over a 8-PSK modulated signal. All algorithms were
Fig. 5. Comparison of the detection algorithms BER performance in the tested on channels of size 15X25.
varying channel case over a QPSK modulated signal. All algorithms were
tested on channels of size 20x30.
performance of DetNet is comparable to the M-Best Sphere
decoding algorithm. For completeness, in Fig. 10 we added the
3) Soft Outputs: We also experimented with soft decoding. 8-PSK constellation soft output where DetNet is comparable
Implementing a full iterative decoding scheme is outside the to the M-Best algorithms only in the high SNR region.
scope of this paper, and we only provide initial results on
the accuracy of our posterior estimates. For this purpose, we
examined smaller models where the exact posteriors can be D. Computational Resources
computed exactly and measured their statistical distance to 1) FullyCon and DetNet run time: In order and estimate
our estimates. the computational complexity of the different detectors we
We shall define the following statistical distance function: compared their run time. Comparing complexity is non-trivial
Given two probability distributions P and Q over the due to many complicating factors as implementation details
symbol set S (that is, the probability of each symbol to be and platforms. To ensure fairness, all the algorithms were
the true symbol), the distance δ(P, Q) shall be: tested on the same machine via python 2.7 environment
X using the Numpy package. The networks were converted from
δ(P, Q) = |P (s) − Q(s)| (16) TensorFlow objects to Numpy objects. We note that the run-
s∈S
time of SD depends on the SNR, and we therefore report a
As reference, we compare our results to the M-Best detectors range of times.
[3]. In Fig. 8 we present accuracy in the case of a BPSK signal An important factor when considering the run time of the
over a 10x20 real channel. In this setting we reach accuracy neural networks is the effect the batch size. Unlike classical
levels better than those achieved by the M-Best algorithm. As detectors as SDR and SD, neural networks can detect over
seen in Fig. 8 adding additional layers improves the accuracy entire batches of data which speeds up the detection process.
of the soft output. In Fig. 9 we present the results over a 4x8 This is true also for the AMP algorithm, where computation
complex channel with 16-QAM constellation. We can see the can be made on an entire batch of signals at once. However, the
7
−3
10 M−Best M=5
M−Best M=5
M−Best M=7
M−Best M=7
DetNet
−2 DetNet
8 9 10 11 12 13 10
19 20 21 22 23 24 25 26
SNR (dB) SNR (dB)
Fig. 8. Comparison of the accuracy of the soft output relative to the posterior Fig. 10. Comparison of the accuracy of the soft output relative to the posterior
probability in the case of a BPSK signal over a 10 × 20 real valued channel. probability for a 8-PSK signal over an 4 × 8 complex valued channel.
We present the results for 2 types of DetNet, one with 30 layers and the
second one with 50 layers