Professional Documents
Culture Documents
2018 3D Freehand Ultrasound Without External Tracking Using Deep
2018 3D Freehand Ultrasound Without External Tracking Using Deep
2018 3D Freehand Ultrasound Without External Tracking Using Deep
PII: S1361-8415(18)30371-2
DOI: 10.1016/j.media.2018.06.003
Reference: MEDIMA 1380
Please cite this article as: Raphael Prevost, Mehrdad Salehi, Simon Jagoda, Navneet Kumar,
Julian Sprung, Alexander Ladikos, Robert Bauer, Oliver Zettinig, Wolfgang Wein, 3D Freehand Ul-
trasound Without External Tracking Using Deep Learning, Medical Image Analysis (2018), doi:
10.1016/j.media.2018.06.003
This is a PDF file of an unedited manuscript that has been accepted for publication. As a service
to our customers we are providing this early version of the manuscript. The manuscript will undergo
copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please
note that during the production process errors may be discovered which could affect the content, and
all legal disclaimers that apply to the journal pertain.
ACCEPTED MANUSCRIPT
Motion Estimation
I3
Neural Network
Our Method I2
I1
1
ACCEPTED MANUSCRIPT
Highlights
• A neural network to estimate the motion of the probe between two suc-
cessive frames.
2
ACCEPTED MANUSCRIPT
Abstract
measurement unit
Proposed Method
(Optional)
IMU
I1 I2
Motion Estimation
I4
Neural Network
I3
I2
I1
Figure 1: Overview of our paper. (Top) The problem that we are trying to solve is the 3D re-
construction of any 2D ultrasound clip acquired without any external tracking. (Bottom) Our
approach is based on a frame-to-frame estimation. We train a neural network to estimate
both the translation and the rotation of the probe between successive frames, optionally with
the help of a small inertial measurement unit mounted on the probe.
1. Introduction
4
ACCEPTED MANUSCRIPT
5
ACCEPTED MANUSCRIPT
6
ACCEPTED MANUSCRIPT
12
scanners that are directly connected to and recorded on a smartphone , some
3
of them being even wireless .
1 Philips Lumify
,
R Koninklijke Philips B.V., Netherlands www.lumify.philips.com (ac-
7
ACCEPTED MANUSCRIPT
Far from trivial, the physical and mathematical modeling of such a rela-
tionship has been the topic of numerous research papers. Tuthill et al. (1998)
105 generalized the original formulation of Chen et al. (1997) to include the depen-
dency on the depth that the image patches have been extracted from. They also
pointed out that the correlation hypothesis is only valid in so-called fully devel-
oped speckle areas and therefore introduced a method to mask out all the other
areas from their analysis. Detecting such areas is a challenging problem itself
110 and various methods relying on local image statistics have been proposed (Has-
senpflug et al., 2005). Rivaz et al. (2007) reported better results by using a
beam-steering approach, which however requires full control on the ultrasound
acquisition system and is therefore not always usable.
Prager et al. (2003) as well as Gee et al. (2006) further refined the model by
115 adapting the correlation curves to the local scattering of the scanned tissues. A
similar idea was exploited by Laporte and Arbel (2011) where the decorrelation
curves are tailored to the local intensity statistics of the image, enabling a bet-
ter generalization across transducers and medium changes. Afsham et al. (2014)
introduced a new statistical model based on a Rician inverse Gaussian distri-
8
ACCEPTED MANUSCRIPT
120 bution, which alleviates the need to discard parts of the image. This method
was further extended in Afsham et al. (2015), in which the authors develop a
denoising approach with a dedicated non-local means filtering that extracts the
relevant speckle patterns. This makes the computations more practical and also
less tissue-dependent.
125 While all those methods seem to produce satisfactory results on synthetic
data and phantom acquisitions, most of the assumed mathematical hypotheses
fail on real clinical images in which the image appearance is the superposition of
various phenomena (moving structures, signal post-processing, etc.). Further-
more, it has also been shown that the displacement of speckle patterns does not
130 always correspond to the motion of the underlying tissue (Morrison et al., 1983;
Kallel et al., 1994). This often results in errors in in vivo acquisitions that ac-
cumulate even if marginal, producing a significant drift, which in turn prevents
those methods to be used in a clinical setting. Such a drift can be mitigated
by using an external source of information, as proposed by Lang et al. (2009),
135 who fused the speckle decorrelation approach with EM tracking. While this
approach indeed produces an overall better estimate of the probe trajectory by
removing the jittering of the EM tracking, it still does not solve the problem of
designing a tracking-less portable 3D ultrasound system as in Gee et al. (2006).
Acknowledging the difficulty of modeling the whole US image acquisition
140 pipeline, several studies started to incorporate more and more machine learning
components, either to (i) refine the model (Laporte and Arbel, 2011), (ii) detect
uncertainties in the estimates and try to correct them (Conrath and Laporte,
2012), (iii) skip the estimates that are not reliable (Tetrel et al., 2016). However,
no work so far has aimed at replacing the whole speckle decorrelation approach
145 with a method fully based on machine learning. Apart from the known difficulty
of regression tasks in comparison to classification problems, one reason for this
research gap might lie in the problem of feature engineering. It is indeed not
straightforward to define image features that would be meaningful for motion
estimation. Toews and Wells (2018) used SIFT features and matched them
150 across ultrasound frames in order to calibrate an ultrasound system. However,
9
ACCEPTED MANUSCRIPT
160 In this paper, we will therefore investigate the potential of deep learning for
image-based ultrasound trajectory estimation. Our contributions with respect
to the previous works are threefold:
10
ACCEPTED MANUSCRIPT
180 This study is based on our previous work (Prevost et al., 2017), but includes
the following extensions and further contributions:
1. While the tracking estimation presented in our conference paper was orig-
inally limited to US frames, we integrated an IMU sensor to our system.
Note that in doing so, none of the advantages of sensorless 3D recon-
185 struction are sacrificed as IMU chips mounted on or integrated within US
probes are completely independent from external hardware. In this work,
we describe how to use such orientation data within our neural network
and show how this can greatly improve the accuracy of both the orienta-
tion but also the position estimation of the probe.
3. Methods
This section aims at defining the problem that we are addressing, as well as
the main notations used throughout the paper.
11
ACCEPTED MANUSCRIPT
Given such a function, one can then iterate over all pairs of successive images
and compute any matrix T k by chain-multiplying the previous estimates of
the relative transformations T 1→2 , T 2→3 , T 3→4 , ..., T k−1→k , as depicted in
Figure 1. Designing the function f (or an algorithm producing its output) is
225 therefore the crux of the problem. A very high accuracy is required, since even
small errors will be propagated and induce a drift in the trajectory.
12
ACCEPTED MANUSCRIPT
our approach is not prone to the usual problems of Euler angles such as ambigu-
235 ity or gimbal lock. The alternative would have been to use quaternions Hamilton
(1853), but they are not as easily interpretable and their normalization require-
ment could have raised other issues during their estimation.
Working with small transformations also allowed us to manipulate them di-
rectly via their angle representation. Indeed, since rigid transformations are not
240 a linear sub-space, the mathematically rigorous way of performing operations
on transformations would have been to use the exponential and logarithmic
maps of SO(3) or SE(3), see Govindu (2004) for instance. Our early experi-
ments showed that there was no benefit in using the logarithm representation
of rotations, so we did not use that parametrization.
In this section, we describe our method and show how it is related to the
standard approach of speckle decorrelation. To that end, we will first recall
the different high-level steps of most speckle decorrelation algorithms. First,
the two images are divided into local patches. Optionally, patches that do not
250 satisfy the fully-developed speckle condition, i.e. presence of Rayleigh distri-
bution, are ignored. Then, for every patch of the first image, the normalized
cross-correlation against a set of patches from the second image within its neigh-
borhood is computed. The maximum correlation, as well as the displacement
that has produced it are stored, yielding a 2D displacement map that represents
255 the in-plane motion. The third, elevational component of the displacement field
can afterwards be found via the decorrelation model, which is a function map-
ping the patch correlation (and other statistics) to the distance between the
patches. Ultimately, a rigid transformation is estimated using a robust algo-
rithm, e.g. RANSAC (Fischler and Bolles, 1981), based on the 3D displacement
260 field.
As already mentioned though, and despite the successive refinements pro-
posed in the literature, the decorrelation model is based on physical and mathe-
matical assumptions which do not encompass the whole complexity of the ultra-
13
ACCEPTED MANUSCRIPT
Speckle tx
Masking
ty
output
+ Rigid
Decorrelation
tz
Transform
Model dis Fitting ϴx
co
in-
p
rre ρ pla
c ϴy
m lan lat
ion alo em a a
I1 I2 ot e s ng ent long lon ϴz
ion tz x s y gz
Speckle Decorrelation
Convolutional Neural Network
tx
ty
output
Additional Fully tz
hidden Connected
ϴx
fe layers Layers
at
ur
h
fe igh- ϴy
es at le
I1 I2
m
ap
ur ve
em l ϴz
s ap
s
Figure 2: Workflow comparison of speckle decorrelation (top) and convolutional neural net-
work (bottom) for the estimation of the transformation parameters between two successive
images. Related steps in the two approaches have the same color.
sound image formation. Moreover, errors accumulate along the multiple steps
265 of the pipeline. Since this process is repeated for each frame, the inaccuracies
will accumulate and the estimated trajectory will drift noticeably.
Conversely, we propose to use an end-to-end approach where a convolutional
neural network (CNN) will take the pair of images as inputs and directly out-
put the parameters of the transformation. During the training, the prediction
270 errors can thus be back-propagated throughout the whole processing of the pair
of images and therefore help the adjustment of the very first layers.
14
ACCEPTED MANUSCRIPT
• The selection of reliable speckle features and areas in the image could be
achieved via the activation layers.
285 However, the more complex steps of the pipeline (the decorrelation model,
the robust transformation fitting, etc.) are now replaced with a combination
of non-linear operations whose modeling capabilities exceed all physical mod-
els (Hornik, 1991).
More practically, our convolutional neural network has a standard architec-
290 ture composed of convolutional, rectified linear unit (ReLU) and pooling layers.
Its output is a vector of six entries representing the parameters of the relative
transformation between the two input frames. Training is performed by ad-
justing the network’s parameters in order to minimize the L2 norm (squared
difference) between the network output and the ground truth. Such a loss func-
295 tion penalizes large deviations to the ground truth which can produce significant
errors since the estimated transformations are chained. Figure 3 depicts the de-
tailed architecture of our network. Similarly to FlowNet (Dosovitskiy et al.),
the pair of two successive frames is fed into the network as a 2-channel image,
so that the information coming from the two images can be coupled from the
300 very first convolutional layer.
Since we initially experienced some trouble with the convergence of the train-
ing process, we also report the parameters of a solver and the weights initializa-
tion strategy that eventually worked consistently. Among the different standard
solvers, we achieved good results with the AdaGrad optimizer with a learning
305 rate of 1. The weights of the network were all initialized with a Gaussian distri-
bution (mean 0.0, std 0.01). A very important trick was to use a large batch size
(500 in our case), that samples pairs of images from as many different datasets as
15
ACCEPTED MANUSCRIPT
possible. We noticed that while using smaller batch sizes could reduce training
time, the local minimum found was not as accurate.
310 Naturally, overfitting has to be mitigated. No weight decay has been ap-
plied during the training, but three dropout layers have been used with a rate
of 0.25. Artificial data augmentation is also often used to prevent overfitting.
Defining meaningful perturbations is however not as straightforward as in clas-
sification, segmentation or registration problems, but we did use the two fol-
315 lowing strategies. First, images were mirrored horizontally (left becomes right
and vice versa), and the corresponding transformations flipped around the x-
axis. Second, we generated additional pairs by considering images that are
non-consecutive. This strategy helps the network to some extent be more ro-
bust to speed variations, but it has to be used with caution. Indeed, adding
320 pairs with too many skipped frames can actually hurt the performance, since it
biases the network toward unrealistic probe velocities. Besides, distant frames
may actually be completely unrelated and perturb the training of the network.
The number of frames that can be skipped depends on the framerate but also
on the average speed of the probe.
325 Another way to alleviate overfitting is to include additional information,
which will be presented in the next subsections.
T
u
~ (x, y) = (ux (x, y), uy (x, y)) (2)
16
ACCEPTED MANUSCRIPT
Output
Convolution 5x5x64 Convolution 3x3x64 tz
(stride 2) + ReLU (stride 2) + ReLU
Images (I1,I2) θx
θy
Max Pooling 2x2 Max Pooling 2x2 θz
(stride 2) (stride 2)
Concatenate
Inertial θxIMU
Optical Flow Measurement Unit θyIMU
(IMU)
θzIMU
Figure 3: Architecture of our convolutional neural networks. The main input (blue) is the pair
of frames encoded as a multi-channel image that is passed through a series of convolutional,
rectified linear unit and pooling layers, and is finally fully connected to a 6-dimensional vector
representing the parameters of the transformation. The two other optional inputs are the
optical flow vector field and the measures of an IMU.
that can be directly encoded as additional input channels. Without any change
in the network architecture, we therefore feed a 4-channel image to the network
that includes the two B-mode images and the two components of the vector
field, as shown in Figure 3. Our implementation is based on the optical flow
340 implementation from Farnebäck (2003) but any similar method could have been
used instead.
17
ACCEPTED MANUSCRIPT
350 from the low speed (and even lower acceleration) of typical probe trajectories
during sweep acquisition. We therefore only integrate the orientation informa-
tion within our approach. As depicted in Figure 3, we simply concatenate the
three Euler rotation angles θxIM U , θyIM U , θzIM U of the IMU to the 512-valued
vector of the penultimate layer of the network.
355 This information about the orientation of the probe naturally represents a
huge clue for the motion estimation that will significantly boost the performance
of the network, as shown by our experiments in the next sections.
4. Experimental Setup
Since the data acquisition was a key part in our work, we devote this section
360 to a thorough description of our setup.
365 sound machine (Cephasonics, Inc., Santa Clara, CA, USA). A linear probe with
128 elements at a frequency of 5 MHz was used to generate the images orig-
inally composed from 256 scanlines. We recorded B-mode images at a frame
rate of approximately 35 images per second, without any filtering or scanline
conversion. We believe this is a good trade-off between the displayed images,
370 which are deprived of many details due to the usual post-processing and noise
filtering, and the raw radio-frequency (RF) data that would contain the most
information but are not accessible on all ultrasound systems. In order to ease
the processing of the images, all of them are also resampled to a fixed isotropic
resolution of typically 0.3 mm, which seems to match the scale of the speckle
375 pattern of the images. We will show evidence in Section 5.2.1 that this was
indeed a suitable value.
18
ACCEPTED MANUSCRIPT
19
ACCEPTED MANUSCRIPT
420 where the ground truth angular velocity ω GT , i.e. the first derivative of the
orientation, is computed by approximation with finite differences.
Finally, in order to make the reported orientations compatible between track-
ing sources, we needed to align the orientation of the two tracking coordinate
systems. Unlike Housden et al. (2008b), who calibrated the IMU directly to
425 the ultrasound images, we use the simpler approach of calibrating it to our op-
tical ground truth tracking (which is already calibrated to the US frames) by
modeling their relation as:
T GT
k = R · T IM
k
U
· C, (4)
where, for every frame k, the matrix T GT contains the rotational part of the
ground truth tracking, and C and R denote constant calibration and regis-
430 tration matrices, respectively. The former describes the local transformation
between IMU sensor and optical tracking target, the latter the global trans-
formation between the IMU world coordinate system and the optical tracking
camera. Note that the registration will change if the tracking camera is moved,
but because only relative transformation between successive frames are consid-
20
ACCEPTED MANUSCRIPT
Table 1: List of the different datasets used in this paper. See text for details.
# Anatomy Sweeps Frames Motions Avg. length IMU
1 Phantom 20 7,168 basic 131 mm no
2 Forearms 88 41,869 basic 190 mm no
3 Calves 12 6,647 basic 175 mm no
4 Forearms 600 307,200 all 202 mm yes
5 Carotid 100 21,945 basic+tilt 75 mm yes
435 ered, it cancels out anyway unless orientations are directly compared against
ground truth, e.g. for accuracy estimation. Both matrices are found using nu-
merical global optimization with the MLSL (Kan and Timmer, 1987) algorithm
by minimizing the absolute Euler angle residual error between T GT
i and the
right-hand side of Equation 4.
440 Additionally, we investigated whether a second IMU of the same type mounted
orthogonally to the first one could increase the overall system accuracy. We
could not find any significant orientation preference, which is why only a single
IMU sensor was used throughout the remainder of this work.
4.2. Datasets
445 Our experiments are based on five different datasets that are summarized in
Table 1. The first three datasets are the same that were used in our preliminary
study (Prevost et al., 2017):
phantom (CAE Healthcare, Inc., Sarasota, FL, USA). The images contain
450 mostly speckle but also a variety of masses that are either hyperechoic or
hypoechoic.
21
ACCEPTED MANUSCRIPT
Figure 4: Visualization of the four types of trajectories acquired on forearms. (a) Basic sweeps
are typical sweeps that follow a straight vessel. (b) Shift sweeps have been acquired to simulate
the scanning of a vessel that would deviate out of the ultrasound image so that the operator
would have to stop and shift the probe. (c) Wave sweeps are meant to simulate the following
of a tortuous vessel. (d) Tilt sweeps simulate rotations along the x-axis of the frames.
455 3. Another 12 in vivo tracked sweeps acquired on the lower legs on a subset
of the volunteers. This third set was used to assess how the network
generalizes to other anatomies.
However, even though the second dataset (forearms) was already larger than the
ones used in the existing literature, the trajectories were mostly translational
460 and thus not diverse enough to represent the variability observed in clinical
practice. Therefore, we acquired a forth extensive set of 600 in vivo sweeps on
the forearms on another set of 15 volunteers, this time with an IMU mounted
on the ultrasound probe (dataset #4). We asked the operators to deliberately
execute strong but realistic translations and rotations during the recording.
465 Such motions were classified into four types that are represented in Figure 4. For
each arm of each volunteer, the distribution of sweep types was as follows: 6×
basic, 4× shift, 8× wave and 2× tilt. Most datasets have thus been acquired on
limbs (arms and leg). Such acquisitions are used in clinical practice to visualize
the vascular topology, e.g. for a AV-fistula mapping or a peripheral vein mapping
470 preceding bypass surgery. From a technical point of view, those very elongated
22
ACCEPTED MANUSCRIPT
sweeps (partially exceeding 20 cm) are also particularly well-suited to study the
drift of the different methods.
Finally, in order to perform a final analysis on the generalization capabilities
of our system with the IMU, we acquired a last set of 100 sweeps on both carotids
475 from 10 volunteers (dataset #5). Their trajectories were mostly translational
since the operator followed the vessel, but also contained some tilt (rotation
along the left-right axis of the US images). The content of the images is also
significantly different from the limbs. Furthermore, the presence of a large artery
that is pulsating during the acquisition represents an additional challenge for
480 the trajectory reconstruction.
495 First we apply to each frame an anisotropic filter (He et al., 2010) which
is able to roughly separate the large structures (which do not represent
reliable information) and the speckle noise. By subtracting the result
of this filter to the original image, we obtain an image where the speckle
pattern is enhanced. We then divide those two images into patches of 15×
23
ACCEPTED MANUSCRIPT
500 15 pixels. For each patch of the first image, we find in the second image the
patch that maximizes the normalized cross-correlation. The offset between
the patch centers is considered as an in-plane motion estimation, while the
maximum correlation is used to estimate the out-of-plane displacement
with a decorrelation model similar to Prager et al. (2003). This results
505 in a 3D vector field subsequently fed into a RANSAC algorithm (Fischler
and Bolles, 1987) that fits the six transformation parameters.
The decorrelation model was trained by fitting its parameters on the set
of pairs of patches extracted from the considered training set.
Unless otherwise stated, our methods are evaluated using a 2-fold cross val-
510 idation. For statistical performance comparisons, a Wilcoxon signed-rank sta-
tistical test with a target p-value of at most 10−4 is employed.
The network is trained to minimize the difference between its output and the
parameters of each relative transformation. However, such values convey very
little insight on the overall accuracy or the drift of the method, let alone clinical
515 relevance. We therefore report alternative metrics for all our comparisons and
analyses, which capture and quantify the different kinds of errors in a better
way.
The average absolute parameter-wise error is computed by averaging, for
each frame k of each sweep, the absolute difference of the transformation pa-
520 rameters between estimation and ground truth,
1 X −1
N
avg. abs. error = p T k · T GT
~ k , (5)
N
k=1
where |~
p(A)| extracts the translation and rotation parameters from matrix A
and computes the element-wise absolute value.
Denoting ~
tk as the center position of the US frame, i.e. the translation part
of matrix T k , we can furthermore compute the final drift of a sweep as the
525 distance between the positions of the last frame (N ) center of the estimated
trajectory and the ground truth trajectory:
24
ACCEPTED MANUSCRIPT
In other words, this number is a target registration error on the center of the
final frame and captures the accumulated error over the whole sweep. The
maximum center error is the maximum distance within each sweep between the
530 estimated center of a frame k and its true frame center:
Eventually, the length error is the difference between the distance of the first-
to-last frame for the estimated trajectory and the ground truth trajectory:
~ ~ ~GT ~GT
length error = ktN − t1 k − ktN − t1 k. (8)
This metric is motivated by our clinical application, since the goal of the exam
is to map the vascular tree and measure the length of each blood vessel. Note
535 that it is slightly different from the drift since there could be an overall rotation
in the sweep that does not degrade the accuracy of the vessel length.
Our experiments are divided in several subsections that use different datasets
since they either focus on (i) comparing our approach to the baseline methods,
540 (ii) studying the effect of parameters, (iii) evaluating the effects of including
the IMU information or (iv) investigating the generalization properties of the
network.
In Table 2, parameter-wise errors and drifts are reported for each method as
545 evaluated on the first three datasets. Unsurprisingly, the assumption of a linear
motion performed worst in all cases. Because it is quite difficult for a human
operator to maintain a constant speed, the out-of-plane translation tz exhibited
the highest variability and was thus the main reason for the poor performance
of this method. While the speckle decorrelation approach managed to infer
550 the in-plane translation and the elevation information from the US frames to
25
ACCEPTED MANUSCRIPT
some extent, the error on tz remained high for the majority of sweeps across
all datasets, mainly because the uncertainty of the motion estimation in this
direction deteriorates with increasing distance between frames.
In contrast, both CNN methods demonstrated a clear improvement com-
555 pared to the other two techniques, except the one without optical flow on the
calves datasets, presumably because of the very low training sample size. The
results indicate that it is necessary to add the optical flow as input channels
to maintain an acceptable performance within the image plane, i.e. for pa-
rameters tx and ty . This way, the network can better focus on predicting the
560 out-of-plane motion, overall leading to the lowest median drift errors across
datasets.
On the first two clinical datasets # 2 (forearms) and # 3 (calves), the final
drift was on average 1.45 cm for sweeps exceeding a length of 20 cm, thus at
least twice as accurate as the baseline methods. The outcomes of all methods
565 were significantly different in pairwise fashion. A qualitative comparison of the
best, median, and worst case of reconstructed trajectories is depicted in Figure 5.
In all fairness, the results that we obtained with the speckle decorrelation
method do not seem to always match the accuracy reported in previous papers.
570 On dataset #2, our implementation yields an average final drift of 19% (error
on the center of the final frame divided by the length of the sweep in mm). As
a comparison, we recall the reported results in recent papers:
26
ACCEPTED MANUSCRIPT
drift.
These findings were further validated using a separately recorded sweep with
a deliberately high variation in out-of-plane velocity between 0.3 mm/frame (be-
27
ACCEPTED MANUSCRIPT
ginning and end) and 0.9 mm/frame (middle part). Figure 6 illustrates the ele-
595 vational translation as estimated by all methods. By design, the linear motion
method cannot capture this variation at all, resulting in a severely distorted
trajectory. As already mentioned above, the speckle decorrelation approach has
limited power once the frame-to-frame distance exceeds the scale of the speckle
patterns and thus systematically underestimates the translation in the middle
600 part of the recording. Only the CNN with optical flow was able to cope with the
changing probe velocity appropriately. For the remainder of this study, unless
otherwise stated, CNN experiments thus include the additional two optical flow
channels.
605 The last experiment of this section was then designed to study the robustness
of the various methods when the scanned anatomy is different than the training
data. We trained a neural network on our forearms dataset #2 and tested on
the two other ones (phantom and lower legs). The results are summarized in
Figure 7, where we compare it with speckle decorrelation on the one hand and
610 with a dedicated network that has been trained on the appropriate dataset.
As expected, the box plots indicate that the highest accuracy is reached with
a network trained on the same kind of sweeps. However, the neural network
trained on another dataset still yielded significantly lower center errors and final
drifts than the speckle decorrelation algorithm, which means that it generalized
615 much better. Unsurprisingly, the gap of performance between the different
methods is almost entirely due to the elevational displacement tz , whereas the
quality of estimation of both in-plane translation and rotation is rather similar.
In short, this preliminary experiment shows that the network seems to use
both general features but also patterns that are specific to an anatomy. Note
620 that the datasets are not solely different by the content of the images but also
on the ultrasound acquisition parameters and naturally the shape of the trajec-
tories.
The consequence is that a dedicated network will likely have to be trained for
each target anatomy. We believe that this is not a significant drawback since
28
ACCEPTED MANUSCRIPT
best
case
median
case
worst
case
Figure 5: Comparison of the trajectories reconstructed with different methods. The three
selected forearm sweeps illustrate the best, median, and worst case in terms of drift, respec-
tively.
29
ACCEPTED MANUSCRIPT
ground truth
1.1
relative tz (mm)
linear motion
0.9 speckle decorrelation
CNN with optical flow
0.7
0.5
0.3
0.1
0 50 100 150 200 250 300 350 400
frame index
30
ACCEPTED MANUSCRIPT
100 100
maximum center error (mm)
80 80
40 40
20 20
0 0
phantom lower legs phantom lower legs
value for the resampling resolution. For illustration purposes, we also present
in Figure 8 a sample image resampled at the different considered resolutions.
31
ACCEPTED MANUSCRIPT
0.5
0.4
0.3
0.2
5 10 15 20 25
maximum frame center error within each sweep (mm)
Figure 8: Comparison of the various isotropic resampling tested on dataset # 1, and the
corresponding network performance. Boxplots represent minimum, lower quartile, median,
upper quartile and maximum.
32
ACCEPTED MANUSCRIPT
10
10
0
Figure 9: Comparison of the performance of our method when trained on the original images
or after the speckle filter from the ultrasound system (dataset # 3). Boxplots represent
minimum, lower quartile, median, upper quartile and maximum.
33
ACCEPTED MANUSCRIPT
Table 3: Results on various experiments on dataset #4 regarding the inclusion of IMU data
into the estimation. Flow, IMU : Whether optical flow and IMU data were included as input.
θ: Whether rotation part of CNN prediction or IMU orientations were used to reconstruct final
trajectory – in all cases, the translation was predicted by the CNN. Letter indices correspond
to Figure 10.
Methods Avg. absolute error [mm/◦ ] Final drift [mm]
# Flow IMU θ from tx ty tz θx θy θz min. med. max.
A × X CNN 6.56 7.23 16.70 0.94 2.65 2.80 3.12 29.22 186.83
B X × CNN 8.89 6.61 5.73 5.21 7.38 4.01 3.22 27.34 139.02
C X X CNN 5.16 2.67 4.43 0.96 3.54 2.85 2.54 15.07 55.20
D X × IMU 2.98 2.57 4.79 0.19 0.21 0.13 1.33 11.43 42.94
E X X IMU 2.75 2.41 4.36 0.19 0.21 0.13 0.76 10.42 35.22
One of the possible explanations is that even though the trajectory seems
globally smooth, the time difference between two successive frames is so small
that the relative transformation parameters are actually quite noisy (this can
be observed for instance in the ground truth curve of Figure 6). At this scale,
695 the hand trembling of the operator but also the potential jittering noise of the
tracking system can become significant.
We still think that temporal information could be relevant for this problem,
but it should probably be included at a larger scale and therefore would require
changes in our approach that are more significant than adding a recurrent layer
700 in our network.
34
ACCEPTED MANUSCRIPT
0 10 20 30 40 50 60
maximum center error (mm)
0 10 20 30 40 50 60
final drift (mm)
Figure 10: Results on various experiments on dataset #4 regarding the inclusion of IMU data
into the estimation. Letter indices correspond to Table 3. Boxplots represent minimum, lower
quartile, median, upper quartile and maximum.
35
ACCEPTED MANUSCRIPT
Figure 11: Comparison of the trajectories reconstructed with and without incorporation of
the IMU in the neural network. The four selected sweeps were average cases for the four types
of sweeps of dataset # 4 (respectively basic, shift, wave, tilt).
36
ACCEPTED MANUSCRIPT
720 this turned out to slightly outperform the aforementioned methods, presumably
because of the improved angular resolution of the employed IMU sensors and the
lack of image-based cues of vanishing relative rotations. Combining these two
strategies, i.e. using the CNN with all available input to predict the translation,
and resorting to the IMU orientations for trajectory estimation (E ) led to a
725 further improvement, yielding a median final drift of merely 10.4 mm across all
types of sweep motions. This is equivalent to a median normalized drift of 5.2%,
meaning that for each 10 cm, the reconstruction might be off by around 5 mm.
The length of the recorded sweep, and thus the elevational translation, is
an essential factor for a broad variety of clinical applications. Figure 12.a com-
730 pares the predicted sweep lengths with their actual lengths, following a strong
correlation of approximately 0.9. In Figure 12.b, we group all lengths errors
based on the sweep type, indicating broad consistency across the different kinds
of motion. The overall median length error was 6.84 mm (3.4%). The shift
category contains more outliers due to the very sparse but large and abrupt
735 motions happening during such acquisitions.
745 Figure 13 shows the drift and the length error obtained on this dataset using
five different networks:
37
ACCEPTED MANUSCRIPT
260
ρ = 0.90 30
240
R2 = 0.79
predicted length (mm)
25
180 10
5
160
0
140
150 200 250 basic shift wave tilt
actual sweep length (mm) sweep type
(a) (b)
Figure 12: (a) Comparison of the estimated sweep lengths with respect to the ground truth
lengths. (b) Distribution of the length errors split across the different sweep types.
(C) a network trained on the forearms dataset, with its two last layers subse-
750 quently fine-tuned on the carotid dataset;
(D) a network trained on the forearms dataset, then fully fine-tuned on the
carotid dataset;
(E) a network trained from scratch on both forearms and carotid datasets,
38
ACCEPTED MANUSCRIPT
39
ACCEPTED MANUSCRIPT
(A) trained on
forearms only
(B) trained on
carotid only
(C) fine-tuned to carotid
(last layers)
(D) fine-tuned to carotid
(whole network)
(E) trained on
forearms+carotid
0 5 10 15 20 25
final drift (mm)
(A) trained on
forearms only
(B) trained on
carotid only
(C) fine-tuned to carotid
(last layers)
(D) fine-tuned to carotid
(whole network)
(E) trained on
forearms+carotid
0 2 4 6 8 10 12 14 16 18
length error (mm)
Figure 13: Results on various experiments on dataset #5 regarding the generalization capa-
bilities of the networks from forearms to carotid sweeps.
but put a stronger weight on the carotid sweeps so that both datasets have an
overall similar impact. The results are plotted in Figure 14. Unsurprisingly,
795 there is a significant difference between a network only trained on forearms and
the networks that have been refined on the carotid dataset. However, a perhaps
more unexpected observation is that using data from a single subject is already
sufficient to capture most of the difference between the two domains. Adding
more subjects tends to slightly decrease the errors but the added value seem
800 more subtle, with a much higher p-value around 10−2 .
In summary, our experiments seems to indicate that although our method is
sensitive to the kind of sweeps used during training, the data and time required
40
ACCEPTED MANUSCRIPT
10
0 10 20 30 40 50
only forearm 1 subject 2 subjects 3 subjects 4 subjects 5 subjects
number of carotid sweeps used for fine-tuning
Figure 14: Errors on the carotid dataset #5 as a function of the number of sweeps used for
fine-tuning a network originally trained on the forearms dataset. Dots represent the average
and error bars represent the standard deviation across the 100 sweeps.
41
ACCEPTED MANUSCRIPT
Figure 15: Reconstruction of a very long ultrasound sweep (more than 60 cm) across the full
leg showing the measurement of the great saphenous vein. A lot of ultrasound frames are
skipped, and the vessel was segmented for the sake of visualization.
42
ACCEPTED MANUSCRIPT
Table 4: Error metrics for the three different approaches used as interpolation methods be-
tween the first and the last frames of the sweeps of dataset #2.
6. Discussion
Since our approach does have some constraints, we list here all the short-
comings that we could identify and how to possibly overcome them.
First, we assumed that the sweeps were acquired in a fixed direction, for
instance proximal to distal. Applying our algorithm on sweep with an opposite
840 direction would therefore yield a mirrored result. This constraint was paramount
in order to train our networks as including both directions in the training set
made the estimation of the out-of-plane translation significantly ambiguous.
However this limitation is not specific to our method, but is rather due to the
symmetry of the trajectory estimation problem that makes it ill-posed. Besides,
845 we deem that enforcing the direction of the probe during acquisition would be
a reasonable constraint for the clinician (for instance by drawing a visual cue
on the probe).
More importantly though, this also means that no back-and-forth motion
can be estimated by the network. This is a problem when the organ of interest
850 does not fit into a single US frame and would need to be swept over several times.
Despite our efforts, we were not able to find a way to detect frames where the
main direction of the probe reverses, so we did not include such motion in our
datasets. Even the IMU acceleration signal was too noisy to be used at all to
detect changes of direction. A workaround would be to acquire multiple sweeps
855 with sufficient overlap so that they could be registered together.
We also expect the accuracy of our approach to depend, at least to some
extent, on the ultrasound acquisition parameters. This is true for the bright-
43
ACCEPTED MANUSCRIPT
ness, the contrast, the framerate but also the image depth, which changes the
image geometry and could hinder the network trained with a fixed size. A po-
860 tential idea to help the network coping with these variations is to store all such
acquisition parameters of the ultrasound system. We could then incorporate
them within the network architecture so that they can be directly used by the
network, just like we did with the IMU information.
The system will also be dependent on the type of probe used, i.e. linear
865 vs convex. Since the image acquisition happens on a polar coordinate system
instead of a Cartesian grid, the speckle patterns of the images will look different
and therefore have to be treated adequately. This would have been a serious
problem if we directly used the optical flow as the in-plane motion, similarly to
the standard speckle decorrelation approach. Nevertheless, we believe that our
870 statistical model can figure out, during its training, how to compensate for such
artifacts. This however means that we would need to train a dedicated network
for each type of probe.
Eventually, while our approach does not require access to any raw data
like RF or in-phase and quadrature (IQ) signals, our experiments did show
875 that the performance is optimal when we disable the speckle filter of the B-
mode images. Systems relying on a frame-grabber without the possibility to
significantly reduce the amount of filtering are therefore feasible but likely to
produce less accurate trajectory reconstructions.
6.2. Conclusion
880 This paper introduced a novel method for the challenging task of 3D recon-
struction of freehand ultrasound sweeps based on deep learning. We showed
that convolutional neural networks constitute a suitable replacement for the
standard approach of speckle decorrelation, since they are composed of similar
basic operations but are able to be trained to solve the problem in an end-to-end
885 manner. This new way of addressing the problem alleviates the need for accu-
rately modeling the influence of speckle on the image intensities, and instead
leverages a large quantity of tracked ultrasound sweeps. Another benefit of our
44
ACCEPTED MANUSCRIPT
approach is that it does not require any raw data that is difficult to extract
from an ultrasound system. However, it is able to use - if available - additional
890 information to improve the accuracy of its prediction. Such information can be
results of pre-computations such as an optical flow vector field or the output of
external sensors like the orientation of an IMU chip.
The thorough experiments and evaluations that we provide also constitute
a contribution of our work. To the best of our knowledge, no study on freehand
895 3D ultrasound reconstruction was tested on such a large database. We indeed
worked on 800 in vivo freehand ultrasound sweeps with very diverse trajectories
that cover the potential motions that can occur during an actual ultrasound
sweep. The main findings of our experiments are the following:
• The proposed approach based on deep learning generates much more accu-
900 rate trajectories than the existing baseline methods, reaching normalized
median length errors of 3.4% on our largest dataset.
This does not mean that the speckle decorrelation is obsolete, but can po-
tentially be improved using some of our results. Opening the black-box of our
915 neural network, for instance by studying the learnt convolution kernels or visu-
alizing the relevant features, could provide interesting insight that could help
45
ACCEPTED MANUSCRIPT
46
ACCEPTED MANUSCRIPT
Acknowledgments
The authors would like to thank ACMIT (Vienna, Austria) and piur Imag-
ing (Vienna, Austria) for the help on the IMU integration and the mount on
950 the ultrasound probe. We also thank Steven Rogers and Richard Pole (IVS,
Manchester, UK) for their advice on the ultrasound sweeps acquisition. The
authors have benefited from a H2020-FTI grant (number 760380) delivered by
the European Union.
References
955 References
Afsham, N., Najafi, M., Abolmaesumi, P., Rohling, R., 2014. A generalized
correlation-based model for out-of-plane motion estimation in freehand ultra-
sound. IEEE Transactions on Medical Imaging 33, 186–199.
Afsham, N., Rasoulian, A., Najafi, M., Abolmaesumi, P., Rohling, R., 2015.
960 Nonlocal means filter-based speckle tracking. IEEE transactions on ultrason-
ics, ferroelectrics, and frequency control 62, 1501–1515.
Chang, R.F., Wu, W.J., Chen, D.R., Chen, W.M., Shu, W., Lee, J.H., Jeng,
L.B., 2003. 3-d us frame positioning using speckle decorrelation and image
registration. Ultrasound in Medicine and Biology 29, 801 – 812.
965 Chen, J.F., Fowlkes, J.B., Carson, P.L., Rubin, J.M., 1997. Determination
of scan-plane motion using speckle decorrelation: Theoretical considerations
and initial test. International Journal of Imaging Systems and Technology 8,
38–44.
Chiu, J.P., Nichols, E., 2015. Named entity recognition with bidirectional lstm-
970 cnns. arXiv preprint arXiv:1511.08308 .
Conrath, J., Laporte, C., 2012. Towards improving the accuracy of sensorless
freehand 3d ultrasound by learning, in: International Workshop on Machine
Learning in Medical Imaging, Springer. pp. 78–85.
47
ACCEPTED MANUSCRIPT
Dosovitskiy, A., Fischer, P., Ilg, E., Hausser, P., Hazirbas, C., Golkov, V.,
975 v.d. Smagt, P., Cremers, D., Brox, T., . Flownet: Learning optical flow
with convolutional networks, in: IEEE International Conference on Computer
Vision (ICCV).
980 Fischler, M.A., Bolles, R.C., 1981. Random sample consensus: a paradigm for
model fitting with applications to image analysis and automated cartography.
Communications of the ACM 24, 381–395.
Fischler, M.A., Bolles, R.C., 1987. Random sample consensus: a paradigm for
model fitting with applications to image analysis and automated cartography,
985 in: Readings in computer vision. Elsevier, pp. 726–740.
Franz, A.M., Haidegger, T., Birkfellner, W., Cleary, K., Peters, T.M., Maier-
Hein, L., 2014. Electromagnetic tracking in medicinea review of technology,
validation, and applications. IEEE transactions on medical imaging 33, 1702–
1725.
990 Gao, H., Huang, Q., Xu, X., Li, X., 2016. Wireless and sensorless 3D ultrasound
imaging. Neurocomput. 195, 159–171.
Gee, A.H., Housden, R.J., Hassenpflug, P., Treece, G.M., Prager, R.W., 2006.
Sensorless freehand 3d ultrasound in real tissue: speckle decorrelation without
fully developed speckle. Medical image analysis 10, 137–149.
995 Ghanbari, M., 1990. The cross-search algorithm for motion estimation (image
coding). IEEE Transactions on Communications 38, 950–953.
48
ACCEPTED MANUSCRIPT
Hassenpflug, P., Prager, R.W., Treece, G.M., Gee, A.H., 2005. Speckle classi-
fication for sensorless freehand 3-d ultrasound. Ultrasound in Medicine and
Biology 31, 1499–1508.
1005 He, K., Sun, J., Tang, X., 2010. Guided image filtering, in: European conference
on computer vision, Springer. pp. 1–14.
Hennersperger, C., Karamalis, A., Navab, N., 2014. Vascular 3d+ t freehand
ultrasound using correlation of doppler and pulse-oximetry data, in: Interna-
tional Conference on Information Processing in Computer-Assisted Interven-
1010 tions, Springer. pp. 68–77.
Hochreiter, S., Schmidhuber, J., 1997. Long short-term memory. Neural com-
putation 9, 1735–1780.
1015 Hossack, J.A., Sumanaweera, T.S., Napel, S., Ha, J.S., 2002. Quantitative 3-
d diagnostic ultrasound imaging using a modified transducer array and an
automated image tracking technique. IEEE Transactions on Ultrasonics, Fer-
roelectrics, and Frequency Control 49, 1029–1038.
Housden, R., Gee, A.H., Prager, R.W., Treece, G.M., 2008a. Rotational motion
1020 in sensorless freehand three-dimensional ultrasound. Ultrasonics 48, 412 –
422.
Housden, R.J., Gee, A.H., Treece, G.M., Prager, R.W., 2006. Sensorless re-
construction of freehand 3d ultrasound data, in: Larsen, R., Nielsen, M.,
Sporring, J. (Eds.), Medical Image Computing and Computer-Assisted Inter-
1025 vention – MICCAI 2006: 9th International Conference, Copenhagen, Den-
mark, October 1-6, 2006. Proceedings, Part II, Springer Berlin Heidelberg,
Berlin, Heidelberg. pp. 356–363.
49
ACCEPTED MANUSCRIPT
Housden, R.J., Treece, G.M., Gee, A.H., Prager, R.W., 2008b. Calibration
of an orientation sensor for freehand 3d ultrasound and its use in a hybrid
1030 acquisition system. BioMedical Engineering OnLine 7, 5.
Kallel, F., Bertrand, M., Meunier, J., 1994. Speckle motion artifact under tissue
rotation. IEEE Transactions on Ultrasonics, Ferroelectrics, and Frequency
Control 41, 105–122.
Kan, A.R., Timmer, G.T., 1987. Stochastic global optimization methods part
1035 ii: Multi level methods. Mathematical Programming 39, 57–78.
Lang, A., Mousavi, P., Fichtinger, G., Abolmaesumi, P., 2009. Fusion of elec-
tromagnetic tracking with speckle-tracked 3d freehand ultrasound using an
unscented kalman filter, in: Progress in Biomedical Optics and Imaging -
Proceedings of SPIE.
1040 Lang, A., Mousavi, P., Gill, S., Fichtinger, G., Abolmaesumi, P., 2012. Multi-
modal registration of speckle-tracked freehand 3d ultrasound to ct in the
lumbar spine. Medical Image Analysis 16, 675 – 686. Computer Assisted
Interventions.
Laporte, C., Arbel, T., 2011. Learning to estimate out-of-plane motion in ul-
1045 trasound imagery of real tissue. Medical image analysis 15, 202–213.
LeCun, Y., Bengio, Y., Hinton, G., 2015. Deep learning. Nature 521, 436–444.
Morrison, D., McDicken, W., Smith, D., 1983. A motion artefact in real-time
ultrasound scanners. Ultrasound in Medicine and Biology 9, 201 – 203.
Mozaffari, M.H., Lee, W.S., 2017. Freehand 3-d ultrasound imaging: A system-
1050 atic review. Ultrasound in Medicine and Biology 43, 2099–2124.
Nagaraj, Y., Benedicks, C., Matthies, P., Friebe, M., 2016. Advanced inside-
out tracking approach for real-time combination of mri and us images in the
radio-frequency shielded room using combination markers, in: Engineering in
50
ACCEPTED MANUSCRIPT
Medicine and Biology Society (EMBC), 2016 IEEE 38th Annual International
1055 Conference of the, IEEE. pp. 2558–2561.
Prager, R.W., Gee, A.H., Treece, G.M., Cash, C.J., Berman, L.H., 2003. Sensor-
less freehand 3-d ultrasound using regression of the echo intensity. Ultrasound
in medicine & biology 29, 437–446.
Prevost, R., Salehi, M., Sprung, J., Ladikos, A., Bauer, R., Wein, W., 2017.
1060 Deep learning for sensorless 3d freehand ultrasound imaging, in: Descoteaux,
M., Maier-Hein, L., Franz, A., Jannin, P., Collins, D.L., Duchesne, S. (Eds.),
Medical Image Computing and Computer-Assisted Intervention – MICCAI
2017, Springer International Publishing, Cham. pp. 628–636.
Rivaz, H., Zellars, R., Hager, G., Fichtinger, G., Boctor, E., 2007. 9c-1 beam
1065 steering approach for speckle characterization and out-of-plane motion esti-
mation in real tissue, in: Ultrasonics Symposium, 2007. IEEE, IEEE. pp.
781–784.
Sainath, T.N., Vinyals, O., Senior, A., Sak, H., 2015. Convolutional, long short-
term memory, fully connected deep neural networks, in: Acoustics, Speech
1070 and Signal Processing (ICASSP), 2015 IEEE International Conference on,
IEEE. pp. 4580–4584.
Salehi, M., Prevost, R., Moctezuma, J.L., Navab, N., Wein, W., 2017. Pre-
cise ultrasound bone registration with learning-based segmentation and speed
of sound calibration, in: Medical Image Computing and Computer-Assisted
1075 Intervention - MICCAI 2017, Springer International Publishing, Cham. pp.
682–690.
Simonyan, K., Zisserman, A., 2015. Very deep convolutional networks for large-
scale image recognition. ICLR 2015 .
Tetrel, L., Chebrek, H., Laporte, C., 2016. Learning for graph-based sensorless
1080 freehand 3d ultrasound, in: Machine Learning in Medical Imaging: 7th In-
ternational Workshop, MLMI 2016, Held in Conjunction with MICCAI 2016,
51
ACCEPTED MANUSCRIPT
Toews, M., Wells, W.M., 2018. Phantomless auto-calibration and online cali-
1085 bration assessment for a tracked freehand 2-d ultrasound probe. IEEE Trans-
actions on Medical Imaging 37, 262–272.
Tuthill, T.A., Krücker, J., Fowlkes, J.B., Carson, P.L., 1998. Automated three-
dimensional us frame positioning computed from elevational speckle decorre-
lation. Radiology 209, 575–582.
1090 Wein, W., Khamene, A., 2008. Image-based method for in-vivo freehand ultra-
sound calibration, in: SPIE Medical Imaging 2008, San Diego.
Xingjian, S., Chen, Z., Wang, H., Yeung, D.Y., Wong, W.K., Woo, W.c., 2015.
Convolutional lstm network: A machine learning approach for precipitation
nowcasting, in: Advances in neural information processing systems, pp. 802–
1095 810.
Zbontar, J., LeCun, Y., 2015. Computing the stereo matching cost with a
convolutional neural network, in: Proceedings of the IEEE conference on
computer vision and pattern recognition, pp. 1592–1599.
52