2018 3D Freehand Ultrasound Without External Tracking Using Deep

Accepted Manuscript
3D Freehand Ultrasound Without External Tracking Using Deep

Learning
Raphael Prevost, Mehrdad Salehi, Simon Jagoda, Navneet Kumar,

Julian Sprung, Alexander Ladikos, Robert Bauer, Oliver Zettinig,
Wolfgang Wein
PII: S1361-8415(18)30371-2
DOI: 10.1016/j.media.2018.06.003
Reference: MEDIMA 1380
To appear in: Medical Image Analysis
Received date: 8 February 2018

Revised date: 5 June 2018
Accepted date: 6 June 2018
Please cite this article as: Raphael Prevost, Mehrdad Salehi, Simon Jagoda, Navneet Kumar,
Julian Sprung, Alexander Ladikos, Robert Bauer, Oliver Zettinig, Wolfgang Wein, 3D Freehand Ul-
trasound Without External Tracking Using Deep Learning, Medical Image Analysis (2018), doi:
10.1016/j.media.2018.06.003
This is a PDF file of an unedited manuscript that has been accepted for publication. As a service
to our customers we are providing this early version of the manuscript. The manuscript will undergo
copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please
note that during the production process errors may be discovered which could affect the content, and
all legal disclaimers that apply to the journal pertain.
ACCEPTED MANUSCRIPT
2D Ultrasound Clip Frame-to-Frame

Motion Estimation
3D Ultrasound
Reconstruction
I1 I2 IMU
(Optional)
Motion Estimation
I3
Neural Network
Our Method I2
I1
1
ACCEPTED MANUSCRIPT
Highlights
• A system for 3D freehand ultrasound reconstruction without external

tracking.
• A neural network to estimate the motion of the probe between two suc-
cessive frames.
• Integration of an IMU to improve the accuracy even further.
• Extensive validation on a large clinically relevant dataset.
• Unprecedented reconstruction accuracy, especially for elongated sweeps.
2
ACCEPTED MANUSCRIPT
3D Freehand Ultrasound Without External Tracking

Using Deep Learning
Raphael Prevosta,∗, Mehrdad Salehia,b , Simon Jagodaa , Navneet Kumara ,

Julian Sprungc , Alexander Ladikosa , Robert Bauerc , Oliver Zettiniga ,
Wolfgang Weina
a ImFusion GmbH, Munich, Germany
b Computer Aided Medical Procedures (CAMP), TU Munich, Germany
c Piur Imaging GmbH, Vienna, Austria
Abstract
This work aims at creating 3D freehand ultrasound reconstructions from 2D

probes with image-based tracking, therefore not requiring expensive or cum-
bersome external tracking hardware. Existing model-based approaches such as
speckle decorrelation only partially capture the underlying complexity of ultra-
sound image formation, thus producing reconstruction accuracies incompatible
with current clinical requirements. Here, we introduce an alternative approach
that relies on a statistical analysis rather than physical models, and use a con-
volutional neural network (CNN) to directly estimate the motion of successive
ultrasound frames in an end-to-end fashion. We demonstrate how this technique
is related to prior approaches, and derive how to further improve its predictive
capabilities by incorporating additional information such as data from inertial
measurement units (IMU). This novel method is thoroughly evaluated and an-
alyzed on a dataset of 800 in vivo ultrasound sweeps, yielding unprecedentedly
accurate reconstructions with a median normalized drift of 5.2%. Even on long
sweeps exceeding 20 cm with complex trajectories, this allows to obtain length
measurements with median errors of 3.4%, hence paving the way toward trans-
lation into clinical routine.
Keywords: 3D freehand ultrasound; deep learning; motion estimation; inertial
∗ Corresponding author. Address: ImFusion GmbH, Agnes-Pockels-Bogen 1, 80992

München, Germany. E-mail: prevost@imfusion.de
Preprint submitted to Medical Image Analysis June 7, 2018

ACCEPTED MANUSCRIPT
measurement unit
2D Ultrasound clip 3D Ultrasound

Reconstruction
Proposed Method
(Optional)
IMU
I1 I2
Motion Estimation
I4
Neural Network
I3
I2
I1
Figure 1: Overview of our paper. (Top) The problem that we are trying to solve is the 3D re-
construction of any 2D ultrasound clip acquired without any external tracking. (Bottom) Our
approach is based on a frame-to-frame estimation. We train a neural network to estimate
both the translation and the rotation of the probe between successive frames, optionally with
the help of a small inertial measurement unit mounted on the probe.
1. Introduction
Ultrasound imaging (US) combines a number of advantages as a medical

modality: it is affordable, safe for both the patient and the clinician, and is
convenient to set up and use. This unique combination of properties makes it one
5 of the most popular imaging modalities for both diagnostic and interventional
applications. For a long time though, the range of its applications was limited
4
ACCEPTED MANUSCRIPT
due to its inability to produce three-dimensional data, which is required for

many clinical scenarios, for instance to reliably measure the extent of structures
of interest, or to perform registration with pre-operative data.
10 In the last decades, a significant effort has been dedicated to the development
of 3D ultrasound systems, and various solutions have been successfully proposed
(see review by Mozaffari and Lee (2017)). These approaches can generally be
classified into two distinct categories, each following a fundamentally different
approach to augment the 2D image plane to the third dimension. On the one
15 hand, dedicated probes with a 2D arrangement of transducer elements instead
of simple 1D arrays were designed, either forming a full 2D matrix or combining
perpendicular rows as in Hossack et al. (2002). Such systems are able to directly
acquire three-dimensional data as a stack of images but regularly come with
a number of disadvantages such as higher cost, larger footprint making the
20 scan more cumbersome, and decreased image quality due to the rectangular
arrangement of transducer elements.
On the other hand, it is possible to reconstruct a volume from a collection
of independent 2D frames if the position and orientation in space of the indi-
vidual images are known. Such a resampling of a set of arbitrary slices within
25 a volume of interest is commonly referred to as US compounding or reconstruc-
tion (Housden et al., 2006). While integrated hardware solutions in the form
of mechanical probes (wobbler ) combine the transducer array with a linear or
rotary stage, which comes with its own set of disadvantages, external track-
ing systems are frequently used to record pose information for the acquired US
30 frames. Hereby, either a mechanical, optical, or electro-magnetic (EM) tracking
system continuously measures positional and orientational data of a tracking
target rigidly mounted to the ultrasound probe. Such systems allow to record
very long ultrasound sweeps with arbitrary trajectories, which is useful for some
applications, for instance blood vessel mapping (Hennersperger et al., 2014) or
35 bone reconstruction (Lang et al., 2012; Salehi et al., 2017). However, such sys-
tems inherently require an often expensive tracking system and pose additional
challenges such as maintaining the line-of-sight to the camera or the avoidance of
5
ACCEPTED MANUSCRIPT
magnetic influences in proximity of the scanned anatomy. Also inside-out track-

ing solutions, for instance the one presented by Nagaraj et al. (2016), where the
40 camera is directly attached to transducer, do not alleviate these constraints as
line-of-sight issues remain.
Ideally, 3D US acquisitions would be performed with conventional linear or
convex transducers without an external tracking system, i.e. without additional
hardware such as tracking cameras or EM field generators. A compromise in
45 terms of hardware can be found in modern inertial measurement units (IMUs),
which combine a tri-directional magnetometer, a rate gyroscope and an ac-
celerometer in small chips of less than 1 cm in size. Often found in smartphones,
IMUs continue to be integrated in other hand-held devices as their small foot-
print would not degrade the user experience or the ergonomics. In fact, some
50 commercial US probes already contain such chips, but reliable information on
their usage in clinical products remains scarce. As shown by Housden et al.
(2008a), who used an IMU sensor to estimate the orientation of an US probe,
such devices are adequate to obtain rotatory measurements. Yet, acceleration
data is insufficient for the computation of sweep trajectories due to the very low
55 signal-to-noise ratio and the required double numerical integration.
Ultimately, even with IMU orientation data available, the translatory por-
tion of a transducer’s motion path needs to be estimated by relying solely on the
ultrasound images themselves. As we will discuss more thoroughly below, algo-
rithms based on statistics and computer vision have been developed to track the
60 content of the image from one frame to the next in order to estimate the relative
motion of the ultrasound probe. In this paper, we introduce a novel approach
to the problem of the trajectory estimation of a hand-held ultrasound probe.
Such a system would be extremely beneficial for many applications (e.g. Lang
et al. (2012); Salehi et al. (2017)), and in particular point-of-care ultrasound.
65 This is especially true as there seems to be a trend toward portable ultrasound
6
ACCEPTED MANUSCRIPT
12
scanners that are directly connected to and recorded on a smartphone , some
3
of them being even wireless .
The remainder of the paper is organized as follows. In Section 2, we outline

70 the relevant literature and describe the contributions of this article. Thereafter,
Section 3 presents the proposed methodology for the estimation of 3D ultrasound
trajectories. Section 4 is dedicated to outline our experimental setup and the
various datasets, which form the basis for the results reported in Section 5.
Finally, Section 6 discusses the obtained results including remaining challenges
75 and concludes the paper.
2. Related Work and Contributions
The image-based estimation of the trajectory of an ultrasound probe is a

challenging problem that has been studied for more than two decades. Most of
the existing literature on the topic addresses this task using an approach named
80 speckle decorrelation, which dates back to the seminal work by Chen et al.
(1997). Its basic idea is that the motion between two successive US frames can
be decomposed into two parts that can be decoupled:
• The in-plane motion, which is a translation along the US plane, is sup-

posedly the easiest to recover because it is not really ambiguous. If the
85 second image were an exact copy of the first one with all pixels shifted
by one scanline, one could easily infer that the motion of the probe was
lateral with a given amplitude. Of course, real motions are not that sim-
ple but well-established methods like block-matching Ghanbari (1990) or
optical flow (Farnebäck, 2003) are able to estimate a non-homogeneous
1 Philips Lumify ,
R Koninklijke Philips B.V., Netherlands www.lumify.philips.com (ac-
cessed January 2018)

2 Butterfly Network iQ ,
R Butterfly Network, Inc., NY, USA www.butterflynetwork.com
(accessed January 2018)

3 Clarius ,
R Clarius Mobile Health, BC, Canada www.clarius.me (accessed January 2018)
7
ACCEPTED MANUSCRIPT
90 displacement field from two images.
• The out-of-plane motion (also called elevational displacement), on the

other hand, is much more challenging because the actual content of the
two images is different. A change in the visible structures could therefore
be related to either a motion of the probe or a change in the structure
95 from one frame to the next. While large-scale structures cannot be fully
trusted, the very particular speckle noise present in the images turns out
to be useful. Because ultrasound image intensities undergo a point-spread
function not only in the image plane but also in the perpendicular di-
rection, the speckle patterns of two successive frames are correlated: the
100 higher the correlation, the lower the elevational distance. This relationship
can be exploited to get an estimate of the elevational distance from the
correlation of the speckle patterns, hence the name speckle decorrelation.
Far from trivial, the physical and mathematical modeling of such a rela-
tionship has been the topic of numerous research papers. Tuthill et al. (1998)
105 generalized the original formulation of Chen et al. (1997) to include the depen-
dency on the depth that the image patches have been extracted from. They also
pointed out that the correlation hypothesis is only valid in so-called fully devel-
oped speckle areas and therefore introduced a method to mask out all the other
areas from their analysis. Detecting such areas is a challenging problem itself
110 and various methods relying on local image statistics have been proposed (Has-
senpflug et al., 2005). Rivaz et al. (2007) reported better results by using a
beam-steering approach, which however requires full control on the ultrasound
acquisition system and is therefore not always usable.
Prager et al. (2003) as well as Gee et al. (2006) further refined the model by
115 adapting the correlation curves to the local scattering of the scanned tissues. A
similar idea was exploited by Laporte and Arbel (2011) where the decorrelation
curves are tailored to the local intensity statistics of the image, enabling a bet-
ter generalization across transducers and medium changes. Afsham et al. (2014)
introduced a new statistical model based on a Rician inverse Gaussian distri-
8
ACCEPTED MANUSCRIPT
120 bution, which alleviates the need to discard parts of the image. This method
was further extended in Afsham et al. (2015), in which the authors develop a
denoising approach with a dedicated non-local means filtering that extracts the
relevant speckle patterns. This makes the computations more practical and also
less tissue-dependent.
125 While all those methods seem to produce satisfactory results on synthetic
data and phantom acquisitions, most of the assumed mathematical hypotheses
fail on real clinical images in which the image appearance is the superposition of
various phenomena (moving structures, signal post-processing, etc.). Further-
more, it has also been shown that the displacement of speckle patterns does not
130 always correspond to the motion of the underlying tissue (Morrison et al., 1983;
Kallel et al., 1994). This often results in errors in in vivo acquisitions that ac-
cumulate even if marginal, producing a significant drift, which in turn prevents
those methods to be used in a clinical setting. Such a drift can be mitigated
by using an external source of information, as proposed by Lang et al. (2009),
135 who fused the speckle decorrelation approach with EM tracking. While this
approach indeed produces an overall better estimate of the probe trajectory by
removing the jittering of the EM tracking, it still does not solve the problem of
designing a tracking-less portable 3D ultrasound system as in Gee et al. (2006).
Acknowledging the difficulty of modeling the whole US image acquisition
140 pipeline, several studies started to incorporate more and more machine learning
components, either to (i) refine the model (Laporte and Arbel, 2011), (ii) detect
uncertainties in the estimates and try to correct them (Conrath and Laporte,
2012), (iii) skip the estimates that are not reliable (Tetrel et al., 2016). However,
no work so far has aimed at replacing the whole speckle decorrelation approach
145 with a method fully based on machine learning. Apart from the known difficulty
of regression tasks in comparison to classification problems, one reason for this
research gap might lie in the problem of feature engineering. It is indeed not
straightforward to define image features that would be meaningful for motion
estimation. Toews and Wells (2018) used SIFT features and matched them
150 across ultrasound frames in order to calibrate an ultrasound system. However,
9
ACCEPTED MANUSCRIPT
this method is designed for ultrasound datasets acquired by sweeping multiple

times over a structure of interest. This is fundamentally different to what we are
trying to achieve, since typical sweeps do not have any overlap between frames
(see example in Figure 1).
155 Recently though, deep learning approaches have enabled significant break-
throughs at even the most challenging image analysis tasks (LeCun et al., 2015).
Moreover, since convolutional neural networks do not require the definition of
explicit image features, they appear as a suitable regressor for our problem.
160 In this paper, we will therefore investigate the potential of deep learning for
image-based ultrasound trajectory estimation. Our contributions with respect
to the previous works are threefold:
1. By using a deep learning-based approach, we explore a path that is sig-

nificantly different from the usual pipeline. We show how speckle decor-
165 relation and convolutional neural networks methods can be related, and
describe a network architecture that is able to learn the complete three-
dimensional motion of the ultrasound probe between two successive frames.
2. We conduct different studies on a very large dataset composed of 780

in vivo ultrasound sweeps acquired on volunteers. This is, to the best
170 of our knowledge, the largest and most clinically representative database
used in a study on freehand 3D ultrasound reconstruction without exter-
nal tracking. Our experiments show that our method achieves significant
improvements in terms of accuracy over baseline approaches and yields
reconstructed trajectories with very limited drift. We provide extensive
175 analysis on the performance of our methods, and its dependencies on the
network input and the scanned anatomies.
3. We also provide a number of additional strategies to make the estimates

more robust and accurate, including the pre-processing of the images with
the optical flow and the fusion with IMU sensor data.
10
ACCEPTED MANUSCRIPT
180 This study is based on our previous work (Prevost et al., 2017), but includes
the following extensions and further contributions:
1. While the tracking estimation presented in our conference paper was orig-
inally limited to US frames, we integrated an IMU sensor to our system.
Note that in doing so, none of the advantages of sensorless 3D recon-
185 struction are sacrificed as IMU chips mounted on or integrated within US
probes are completely independent from external hardware. In this work,
we describe how to use such orientation data within our neural network
and show how this can greatly improve the accuracy of both the orienta-
tion but also the position estimation of the probe.
190 2. We have recorded an additional dataset of 700 US sweeps in addition to our

original database of 100 recordings. The new sweeps have been acquired
with much more realistic motions including significant rotations, both in-
plane and out-of-plane, following guidance from our clinical partners.
3. Based on this comprehensive dataset, we extend our parameter analysis

195 and study the impact of various acquisition parameters (such as the image
resolution, the image pre-processing, the network architecture, etc.) on
the performance of our system in a detailed quantitative evaluation.
4. Finally, we conducted additional studies on the generalization capabilities

of our approach, and discuss strategies to adapt our networks to new
200 datasets.
3. Methods
3.1. Problem statement
This section aims at defining the problem that we are addressing, as well as
the main notations used throughout the paper.
11
ACCEPTED MANUSCRIPT
205 3.1.1. Main notations

Given an ordered sequence of 2D ultrasound images (In )n , the problem
that we are trying to solve is the reconstruction of the trajectory of the ul-
trasound probe at each time point tn corresponding at the acquisition of the
n-th frame In . This trajectory can be represented by a sequence of matrices
210 (T n )n encoding arbitrary rigid transformations (3 degrees of freedom for the
translation, 3 for the rotation), which can be encoded for instance as homoge-
~ ∈ R6 . Since there
neous matrices and parametrized by a vector of parameters p
is no absolute coordinate system, we can set without loss of generality T 0 = Id
(the identity matrix) and only consider relative transformations from now on.
215 More particularly, we are interested in estimating the relative transformation
T n−1→n = T −1
n−1 .T n between each frame In and the previous one. T n−1→n
is also a rigid transformation and can therefore be parametrized by a 6-valued

vector p
~n−1→n .
We are then looking for a function f that estimates such parameters, given
220 a pair of images:
p
~ n−1→n ≈ f (In−1 , In ) . (1)
Given such a function, one can then iterate over all pairs of successive images
and compute any matrix T k by chain-multiplying the previous estimates of
the relative transformations T 1→2 , T 2→3 , T 3→4 , ..., T k−1→k , as depicted in
Figure 1. Designing the function f (or an algorithm producing its output) is
225 therefore the crux of the problem. A very high accuracy is required, since even
small errors will be propagated and induce a drift in the trajectory.
3.1.2. Transformation parametrization

Rigid motions can be represented with 3 parameters for the translational
component and 3 parameters for the rotational component: p
~n−1→n = (tx , ty , tz , θx , θy , θz ).
230 While the tx , ty and tz are straightforward to define, several options exist for
θx , θy and θz . In this paper, we choose to always represent rotation with Euler
angles for the sake of simplicity. Since we are here considering relative (frame-to-
frame) transformations that only contain minor rotations of less than 1 degree,
12
ACCEPTED MANUSCRIPT
our approach is not prone to the usual problems of Euler angles such as ambigu-
235 ity or gimbal lock. The alternative would have been to use quaternions Hamilton
(1853), but they are not as easily interpretable and their normalization require-
ment could have raised other issues during their estimation.
Working with small transformations also allowed us to manipulate them di-
rectly via their angle representation. Indeed, since rigid transformations are not
240 a linear sub-space, the mathematically rigorous way of performing operations
on transformations would have been to use the exponential and logarithmic
maps of SO(3) or SE(3), see Govindu (2004) for instance. Our early experi-
ments showed that there was no benefit in using the logarithm representation
of rotations, so we did not use that parametrization.
245 3.2. From speckle decorrelation to convolutional neural networks
In this section, we describe our method and show how it is related to the
standard approach of speckle decorrelation. To that end, we will first recall
the different high-level steps of most speckle decorrelation algorithms. First,
the two images are divided into local patches. Optionally, patches that do not
250 satisfy the fully-developed speckle condition, i.e. presence of Rayleigh distri-
bution, are ignored. Then, for every patch of the first image, the normalized
cross-correlation against a set of patches from the second image within its neigh-
borhood is computed. The maximum correlation, as well as the displacement
that has produced it are stored, yielding a 2D displacement map that represents
255 the in-plane motion. The third, elevational component of the displacement field
can afterwards be found via the decorrelation model, which is a function map-
ping the patch correlation (and other statistics) to the distance between the
patches. Ultimately, a rigid transformation is estimated using a robust algo-
rithm, e.g. RANSAC (Fischler and Bolles, 1981), based on the 3D displacement
260 field.
As already mentioned though, and despite the successive refinements pro-
posed in the literature, the decorrelation model is based on physical and mathe-
matical assumptions which do not encompass the whole complexity of the ultra-
13
ACCEPTED MANUSCRIPT
Patch Division + Cross-Correlation
Speckle tx
Masking
ty
output
+ Rigid
Decorrelation
tz
Transform
Model dis Fitting ϴx
co
in-
p
rre ρ pla
c ϴy
m lan lat
ion alo em a a
I1 I2 ot e s ng ent long lon ϴz
ion tz x s y gz
Speckle Decorrelation
Convolutional Neural Network
tx
ty
output
Additional Fully tz
hidden Connected
ϴx
fe layers Layers
at
ur
h
fe igh- ϴy
es at le
I1 I2
m
ap
ur ve
em l ϴz
s ap
s
Convolutions + Pooling + Activation layers
Figure 2: Workflow comparison of speckle decorrelation (top) and convolutional neural net-
work (bottom) for the estimation of the transformation parameters between two successive
images. Related steps in the two approaches have the same color.
sound image formation. Moreover, errors accumulate along the multiple steps
265 of the pipeline. Since this process is repeated for each frame, the inaccuracies
will accumulate and the estimated trajectory will drift noticeably.
Conversely, we propose to use an end-to-end approach where a convolutional
neural network (CNN) will take the pair of images as inputs and directly out-
put the parameters of the transformation. During the training, the prediction
270 errors can thus be back-propagated throughout the whole processing of the pair
of images and therefore help the adjustment of the very first layers.
At first glance, aiming at mimicking the elaborate speckle decorrelation al-

gorithm with a single CNN might seem overly ambitious or appear as resorting
275 to a black-box model. However, as we show in Figure 2, it turns out that the
operations used within the two approaches can coarsely be related. The analogy
is far from perfect, but we believe that it gives some insight on why it makes
sense to use a CNN for this problem:
14
ACCEPTED MANUSCRIPT
• The local cross-correlation operation may be approximated by a set of

280 convolution filters.
• The patch-based approach that aggregates local information corresponds

to the pooling layers of the network.
• The selection of reliable speckle features and areas in the image could be
achieved via the activation layers.
285 However, the more complex steps of the pipeline (the decorrelation model,
the robust transformation fitting, etc.) are now replaced with a combination
of non-linear operations whose modeling capabilities exceed all physical mod-
els (Hornik, 1991).
More practically, our convolutional neural network has a standard architec-
290 ture composed of convolutional, rectified linear unit (ReLU) and pooling layers.
Its output is a vector of six entries representing the parameters of the relative
transformation between the two input frames. Training is performed by ad-
justing the network’s parameters in order to minimize the L2 norm (squared
difference) between the network output and the ground truth. Such a loss func-
295 tion penalizes large deviations to the ground truth which can produce significant
errors since the estimated transformations are chained. Figure 3 depicts the de-
tailed architecture of our network. Similarly to FlowNet (Dosovitskiy et al.),
the pair of two successive frames is fed into the network as a 2-channel image,
so that the information coming from the two images can be coupled from the
300 very first convolutional layer.
Since we initially experienced some trouble with the convergence of the train-
ing process, we also report the parameters of a solver and the weights initializa-
tion strategy that eventually worked consistently. Among the different standard
solvers, we achieved good results with the AdaGrad optimizer with a learning
305 rate of 1. The weights of the network were all initialized with a Gaussian distri-
bution (mean 0.0, std 0.01). A very important trick was to use a large batch size
(500 in our case), that samples pairs of images from as many different datasets as
15
ACCEPTED MANUSCRIPT
possible. We noticed that while using smaller batch sizes could reduce training
time, the local minimum found was not as accurate.
310 Naturally, overfitting has to be mitigated. No weight decay has been ap-
plied during the training, but three dropout layers have been used with a rate
of 0.25. Artificial data augmentation is also often used to prevent overfitting.
Defining meaningful perturbations is however not as straightforward as in clas-
sification, segmentation or registration problems, but we did use the two fol-
315 lowing strategies. First, images were mirrored horizontally (left becomes right
and vice versa), and the corresponding transformations flipped around the x-
axis. Second, we generated additional pairs by considering images that are
non-consecutive. This strategy helps the network to some extent be more ro-
bust to speed variations, but it has to be used with caution. Indeed, adding
320 pairs with too many skipped frames can actually hurt the performance, since it
biases the network toward unrealistic probe velocities. Besides, distant frames
may actually be completely unrelated and perturb the training of the network.
The number of frames that can be skipped depends on the framerate but also
on the average speed of the probe.
325 Another way to alleviate overfitting is to include additional information,
which will be presented in the next subsections.
3.3. Augmenting the network with the optical flow
Thanks to their representative power, convolutional neural networks are the-

oretically able to discover and learn any relevant features for the task they are
330 being trained for (Hornik, 1991). However in practice, it is often useful to
provide any extra information available as additional input.
For our application, we can easily compute the component of the motion
that happens in the plane of the ultrasound image, using for instance a block
matching algorithm or by computing the optical flow between the two frames.
335 Such algorithms produce a two-dimensional dense vector field
T
u
~ (x, y) = (ux (x, y), uy (x, y)) (2)
16
ACCEPTED MANUSCRIPT
Convolution 5x5x64 Convolution 3x3x64

(stride 2) + ReLU (stride 2) + ReLU
tx
Fully Connected (512)

Concatenate as channels
Fully Connected (6)

ty
Output
Convolution 5x5x64 Convolution 3x3x64 tz
(stride 2) + ReLU (stride 2) + ReLU
Images (I1,I2) θx
θy
Max Pooling 2x2 Max Pooling 2x2 θz
(stride 2) (stride 2)
Concatenate
Inertial θxIMU
Optical Flow Measurement Unit θyIMU
(IMU)
θzIMU
Figure 3: Architecture of our convolutional neural networks. The main input (blue) is the pair
of frames encoded as a multi-channel image that is passed through a series of convolutional,
rectified linear unit and pooling layers, and is finally fully connected to a 6-dimensional vector
representing the parameters of the transformation. The two other optional inputs are the
optical flow vector field and the measures of an IMU.
that can be directly encoded as additional input channels. Without any change
in the network architecture, we therefore feed a 4-channel image to the network
that includes the two B-mode images and the two components of the vector
field, as shown in Figure 3. Our implementation is based on the optical flow
340 implementation from Farnebäck (2003) but any similar method could have been
used instead.
3.4. Augmenting the network with an IMU
Non-image information can also easily be leveraged within our approach

by incorporating them into the architecture of the network. Housden et al.
345 (2008a) mounted an inertial measurement unit (IMU) on their ultrasound probe
in order to get an estimation of its orientation. IMUs provide both orientation
and acceleration measurements. As already mentioned in Section 1, the lat-
ter are significantly too noisy to be used for this application, which was noted
in Housden et al. (2008a) but also confirmed by our experiments. This stems
17
ACCEPTED MANUSCRIPT
350 from the low speed (and even lower acceleration) of typical probe trajectories
during sweep acquisition. We therefore only integrate the orientation informa-
tion within our approach. As depicted in Figure 3, we simply concatenate the
three Euler rotation angles θxIM U , θyIM U , θzIM U of the IMU to the 512-valued
vector of the penultimate layer of the network.
355 This information about the orientation of the probe naturally represents a
huge clue for the motion estimation that will significantly boost the performance
of the network, as shown by our experiments in the next sections.
4. Experimental Setup
Since the data acquisition was a key part in our work, we devote this section
360 to a thorough description of our setup.
4.1. Data acquisition
4.1.1. Ultrasound images

Our experiments are based on multiple datasets of 3D optically-tracked ul-
trasound sweeps. All sweeps have been acquired on a Cicada
R research ultra-
365 sound machine (Cephasonics, Inc., Santa Clara, CA, USA). A linear probe with
128 elements at a frequency of 5 MHz was used to generate the images orig-
inally composed from 256 scanlines. We recorded B-mode images at a frame
rate of approximately 35 images per second, without any filtering or scanline
conversion. We believe this is a good trade-off between the displayed images,
370 which are deprived of many details due to the usual post-processing and noise
filtering, and the raw radio-frequency (RF) data that would contain the most
information but are not accessible on all ultrasound systems. In order to ease
the processing of the images, all of them are also resampled to a fixed isotropic
resolution of typically 0.3 mm, which seems to match the scale of the speckle
375 pattern of the images. We will show evidence in Section 5.2.1 that this was
indeed a suitable value.
18
ACCEPTED MANUSCRIPT
4.1.2. Ground truth tracking

Most of the previous work used precise motors or robots to generate a set
380 of training data. However, for the proposed purely machine learning-based
approach, we deemed it important to ensure that the training data matches
the acquisition conditions of clinical practice. Therefore, we decided to acquire
sweeps using a hand-held ultrasound probe, which requires an external tracking
system with sufficient accuracy. Since electromagnetic tracking exhibits some
385 drift but more importantly significant jitter (Franz et al., 2014), we rather used a
Stryker NAV3TM Camera (Stryker Co., Kalamazoo, MI, USA) optical tracking
system originally intended for surgical navigation.
After thorough spatial image-to-sensor calibration as proposed by Wein and
Khamene (2008), we were able to record ground truth transformations with
390 absolute positioning accuracy of around 0.2 mm in terms of translation. As the
ground truth has to be extremely precise on a frame-to-frame basis, we also
assured that the temporal calibration (Salehi et al., 2017) neither induces jitter
nor drift, thanks to the digital interface of the research US system and proper
clock synchronization. To compensate for the different acquisition rates of the
395 ultrasound and the tracking system, transformations are interpolated linearly
and spherically for translations and rotations, respectively.
4.1.3. IMU integration

The choice of a suitable IMU model was made based on a thorough evaluation
of five different sensors in a realistic environment (BNO055, Bosch Sensortec
400 GmbH, Reutlingen, Germany; MPU-9250, InvenSense Inc., San Jose, CA, USA;
PhidgetSpatial Precision 3/3/3, Phidgets Inc., Calgary, AB, Canada; Xsens
MTi-3, Xsens Technologies B.V., Enschede, Netherlands; and Yost 3-Space,
Yost Labs Inc., Portsmouth, OH, USA). A full report on these investigations
would exceed the scope of this article, hence we briefly summarize our efforts in
405 this regard.
We recorded various freehand movements with the mentioned sensors and
the optical ground truth tracking target strapped together. Following the cal-
19
ACCEPTED MANUSCRIPT
ibration procedure outlined below, we picked the Xsens MTi-3-8A7G6 because

the average absolute Euler angle error with respect to ground truth was lowest
410 with around 0.14◦ . A development board, which provides a USB communica-
tion interface to the integrated IMU chip, was screwed to the ultrasound probe
attachment to guarantee the required rigidity between the IMU and the optical
tracking target.
IMUs of this type are capable of directly returning usable Euler orientations
415 (θxIM U , θyIM U , θzIM U ), which can be represented in 3×3 rotation matrices T IM U .
Therefore, no sensor fusion or additional post-processing was done on our part.
In addition, the gyroscope sensors are able to natively measure the angular
velocity ω IM U , which we used for temporal calibration by maximizing the cross-
correlation between ground truth and IMU:

tIM U →GT = arg max ω IM U ∗ ω GT (t), (3)
t
420 where the ground truth angular velocity ω GT , i.e. the first derivative of the
orientation, is computed by approximation with finite differences.
Finally, in order to make the reported orientations compatible between track-
ing sources, we needed to align the orientation of the two tracking coordinate
systems. Unlike Housden et al. (2008b), who calibrated the IMU directly to
425 the ultrasound images, we use the simpler approach of calibrating it to our op-
tical ground truth tracking (which is already calibrated to the US frames) by
modeling their relation as:
T GT
k = R · T IM
k
U
· C, (4)
where, for every frame k, the matrix T GT contains the rotational part of the
ground truth tracking, and C and R denote constant calibration and regis-
430 tration matrices, respectively. The former describes the local transformation
between IMU sensor and optical tracking target, the latter the global trans-
formation between the IMU world coordinate system and the optical tracking
camera. Note that the registration will change if the tracking camera is moved,
but because only relative transformation between successive frames are consid-
20
ACCEPTED MANUSCRIPT
Table 1: List of the different datasets used in this paper. See text for details.
# Anatomy Sweeps Frames Motions Avg. length IMU
1 Phantom 20 7,168 basic 131 mm no
2 Forearms 88 41,869 basic 190 mm no
3 Calves 12 6,647 basic 175 mm no
4 Forearms 600 307,200 all 202 mm yes
5 Carotid 100 21,945 basic+tilt 75 mm yes
435 ered, it cancels out anyway unless orientations are directly compared against
ground truth, e.g. for accuracy estimation. Both matrices are found using nu-
merical global optimization with the MLSL (Kan and Timmer, 1987) algorithm
by minimizing the absolute Euler angle residual error between T GT
i and the
right-hand side of Equation 4.
440 Additionally, we investigated whether a second IMU of the same type mounted
orthogonally to the first one could increase the overall system accuracy. We
could not find any significant orientation preference, which is why only a single
IMU sensor was used throughout the remainder of this work.
4.2. Datasets
445 Our experiments are based on five different datasets that are summarized in
Table 1. The first three datasets are the same that were used in our preliminary
study (Prevost et al., 2017):
1. A set of 20 US sweeps acquired on a BluePhantom

R ultrasound biopsy
phantom (CAE Healthcare, Inc., Sarasota, FL, USA). The images contain
450 mostly speckle but also a variety of masses that are either hyperechoic or
hypoechoic.
2. A set of 88 in vivo tracked US sweeps acquired on the forearms of 12

volunteers. Two different operators acquired at least three sweeps on
both forearms of each participant.
21
ACCEPTED MANUSCRIPT
(a) basic (b) shift (c) wave (d) tilt
Figure 4: Visualization of the four types of trajectories acquired on forearms. (a) Basic sweeps
are typical sweeps that follow a straight vessel. (b) Shift sweeps have been acquired to simulate
the scanning of a vessel that would deviate out of the ultrasound image so that the operator
would have to stop and shift the probe. (c) Wave sweeps are meant to simulate the following
of a tortuous vessel. (d) Tilt sweeps simulate rotations along the x-axis of the frames.
455 3. Another 12 in vivo tracked sweeps acquired on the lower legs on a subset
of the volunteers. This third set was used to assess how the network
generalizes to other anatomies.
However, even though the second dataset (forearms) was already larger than the
ones used in the existing literature, the trajectories were mostly translational
460 and thus not diverse enough to represent the variability observed in clinical
practice. Therefore, we acquired a forth extensive set of 600 in vivo sweeps on
the forearms on another set of 15 volunteers, this time with an IMU mounted
on the ultrasound probe (dataset #4). We asked the operators to deliberately
execute strong but realistic translations and rotations during the recording.
465 Such motions were classified into four types that are represented in Figure 4. For
each arm of each volunteer, the distribution of sweep types was as follows: 6×
basic, 4× shift, 8× wave and 2× tilt. Most datasets have thus been acquired on
limbs (arms and leg). Such acquisitions are used in clinical practice to visualize
the vascular topology, e.g. for a AV-fistula mapping or a peripheral vein mapping
470 preceding bypass surgery. From a technical point of view, those very elongated
22
ACCEPTED MANUSCRIPT
sweeps (partially exceeding 20 cm) are also particularly well-suited to study the
drift of the different methods.
Finally, in order to perform a final analysis on the generalization capabilities
of our system with the IMU, we acquired a last set of 100 sweeps on both carotids
475 from 10 volunteers (dataset #5). Their trajectories were mostly translational
since the operator followed the vessel, but also contained some tilt (rotation
along the left-right axis of the US images). The content of the images is also
significantly different from the limbs. Furthermore, the presence of a large artery
that is pulsating during the acquisition represents an additional challenge for
480 the trajectory reconstruction.
4.3. Baseline methods and evaluation metrics
We compare several variants of the proposed algorithm against two baseline

methods:
• The trajectory model with fewest assumptions forms a linear motion,

485 where the translation and rotation parameters are constant and set to the
average parameters across all frames of the dataset under consideration.
As most sweeps are of predominantly translational nature, these average
parameters exhibit almost no rotations or in-plane translations. Only the
elevational translation tz remains with a speed of around 1.5 cm/s.
490 • Our implementation of the speckle decorrelation method is based on var-

ious ideas inspired from prior works. Since there does not seem to be a
consensus on the best speckle decorrelation pipeline in the existing litera-
ture, we tried several approaches and describe below the one yielding the
best results on our dataset.
495 First we apply to each frame an anisotropic filter (He et al., 2010) which
is able to roughly separate the large structures (which do not represent
reliable information) and the speckle noise. By subtracting the result
of this filter to the original image, we obtain an image where the speckle
pattern is enhanced. We then divide those two images into patches of 15×
23
ACCEPTED MANUSCRIPT
500 15 pixels. For each patch of the first image, we find in the second image the
patch that maximizes the normalized cross-correlation. The offset between
the patch centers is considered as an in-plane motion estimation, while the
maximum correlation is used to estimate the out-of-plane displacement
with a decorrelation model similar to Prager et al. (2003). This results
505 in a 3D vector field subsequently fed into a RANSAC algorithm (Fischler
and Bolles, 1987) that fits the six transformation parameters.
The decorrelation model was trained by fitting its parameters on the set
of pairs of patches extracted from the considered training set.
Unless otherwise stated, our methods are evaluated using a 2-fold cross val-
510 idation. For statistical performance comparisons, a Wilcoxon signed-rank sta-
tistical test with a target p-value of at most 10−4 is employed.
The network is trained to minimize the difference between its output and the
parameters of each relative transformation. However, such values convey very
little insight on the overall accuracy or the drift of the method, let alone clinical
515 relevance. We therefore report alternative metrics for all our comparisons and
analyses, which capture and quantify the different kinds of errors in a better
way.
The average absolute parameter-wise error is computed by averaging, for
each frame k of each sweep, the absolute difference of the transformation pa-
520 rameters between estimation and ground truth,
1 X −1
N

avg. abs. error = p T k · T GT
~ k , (5)
N
k=1
where |~
p(A)| extracts the translation and rotation parameters from matrix A
and computes the element-wise absolute value.
Denoting ~
tk as the center position of the US frame, i.e. the translation part
of matrix T k , we can furthermore compute the final drift of a sweep as the
525 distance between the positions of the last frame (N ) center of the estimated
trajectory and the ground truth trajectory:
drift = k~tN − ~tGT

N k. (6)
24
ACCEPTED MANUSCRIPT
In other words, this number is a target registration error on the center of the
final frame and captures the accumulated error over the whole sweep. The
maximum center error is the maximum distance within each sweep between the
530 estimated center of a frame k and its true frame center:
max. center error = max k~tk − ~tGT

k k. (7)
k
Eventually, the length error is the difference between the distance of the first-
to-last frame for the estimated trajectory and the ground truth trajectory:

~ ~ ~GT ~GT
length error = ktN − t1 k − ktN − t1 k. (8)
This metric is motivated by our clinical application, since the goal of the exam
is to map the vascular tree and measure the length of each blood vessel. Note
535 that it is slightly different from the drift since there could be an overall rotation
in the sweep that does not degrade the accuracy of the vessel length.
5. Experiments and Results
Our experiments are divided in several subsections that use different datasets
since they either focus on (i) comparing our approach to the baseline methods,
540 (ii) studying the effect of parameters, (iii) evaluating the effects of including
the IMU information or (iv) investigating the generalization properties of the
network.
5.1. Comparison to baseline methods and inclusion of optical flow
In Table 2, parameter-wise errors and drifts are reported for each method as
545 evaluated on the first three datasets. Unsurprisingly, the assumption of a linear
motion performed worst in all cases. Because it is quite difficult for a human
operator to maintain a constant speed, the out-of-plane translation tz exhibited
the highest variability and was thus the main reason for the poor performance
of this method. While the speckle decorrelation approach managed to infer
550 the in-plane translation and the elevation information from the US frames to
25
ACCEPTED MANUSCRIPT
some extent, the error on tz remained high for the majority of sweeps across
all datasets, mainly because the uncertainty of the motion estimation in this
direction deteriorates with increasing distance between frames.
In contrast, both CNN methods demonstrated a clear improvement com-
555 pared to the other two techniques, except the one without optical flow on the
calves datasets, presumably because of the very low training sample size. The
results indicate that it is necessary to add the optical flow as input channels
to maintain an acceptable performance within the image plane, i.e. for pa-
rameters tx and ty . This way, the network can better focus on predicting the
560 out-of-plane motion, overall leading to the lowest median drift errors across
datasets.
On the first two clinical datasets # 2 (forearms) and # 3 (calves), the final
drift was on average 1.45 cm for sweeps exceeding a length of 20 cm, thus at
least twice as accurate as the baseline methods. The outcomes of all methods
565 were significantly different in pairwise fashion. A qualitative comparison of the
best, median, and worst case of reconstructed trajectories is depicted in Figure 5.
In all fairness, the results that we obtained with the speckle decorrelation
method do not seem to always match the accuracy reported in previous papers.
570 On dataset #2, our implementation yields an average final drift of 19% (error
on the center of the final frame divided by the length of the sweep in mm). As
a comparison, we recall the reported results in recent papers:
• Tetrel et al. (2016) reported an average TRE (target registration error)

on final frames of approximately 5 mm on sweeps spanning 35 mm over a
575 speckle phantom, which amounts to a 14% drift.
• In Lang et al. (2012), the original speckle decorrelation method (without

the multi-modal registration described of the paper) yields a final TRE
between 6.4 and 10.9 mm after scanning 300 frames of a phantom.
• Gao et al. (2016) performed experiments on pork phantoms and report

580 a 9.4% error on the length measurement, which is a lower bound on the
26
ACCEPTED MANUSCRIPT
Table 2: Comparison of different methods with respect to average absolute parameter-wise

error and final drift.
Dataset # 1 Avg. absolute error [mm/◦ ] Final drift [mm]

- phantom tx ty tz θx θy θz min. med. max.
Linear motion 2.27 8.71 38.72 2.37 2.71 0.97 2.29 70.30 149.19
Speckle decorrelation 4.96 2.21 29.89 2.10 4.46 1.93 12.67 47.27 134.93
Standard CNN 2.25 5.67 14.37 2.13 1.86 0.98 14.31 26.17 65.10
CNN with optical flow 1.32 2.13 7.79 2.32 1.21 0.90 1.70 18.30 36.90

- forearms tx ty tz θx θy θz min. med. max.
Linear motion 4.46 6.11 24.84 3.51 2.59 2.37 10.11 46.23 129.93
Standard CNN 6.30 5.97 6.15 2.82 2.78 2.40 3.72 25.16 63.26

- calves tx ty tz θx θy θz min. med. max.
Linear motion 4.49 4.84 39.81 4.39 2.18 2.46 37.35 73.40 143.42
Standard CNN 4.91 8.95 25.89 2.01 2.54 2.90 27.11 54.72 116.64
drift.
• In Housden et al. (2008a), the standard speckle decorrelation method

(without the IMU correction) generates a 0.83 mm average length error
on sweeps measuring 2.8 mm, which a represents 29% length error.
585 Those differences could be explained by (i) a suboptimal implementation of ours,

(ii) an imperfect calibration of the decorrelation model because we are directly
fitting it on the freehand datasets (instead of using a robot for instance) or
(iii) by the challenging nature of our dataset, which has actually been acquired
in a clinical setting. Nevertheless, we point out that our deep-learning approach
590 yields results that are anyway superior to what has been reported so far.
These findings were further validated using a separately recorded sweep with
a deliberately high variation in out-of-plane velocity between 0.3 mm/frame (be-
27
ACCEPTED MANUSCRIPT
ginning and end) and 0.9 mm/frame (middle part). Figure 6 illustrates the ele-
595 vational translation as estimated by all methods. By design, the linear motion
method cannot capture this variation at all, resulting in a severely distorted
trajectory. As already mentioned above, the speckle decorrelation approach has
limited power once the frame-to-frame distance exceeds the scale of the speckle
patterns and thus systematically underestimates the translation in the middle
600 part of the recording. Only the CNN with optical flow was able to cope with the
changing probe velocity appropriately. For the remainder of this study, unless
otherwise stated, CNN experiments thus include the additional two optical flow
channels.
605 The last experiment of this section was then designed to study the robustness
of the various methods when the scanned anatomy is different than the training
data. We trained a neural network on our forearms dataset #2 and tested on
the two other ones (phantom and lower legs). The results are summarized in
Figure 7, where we compare it with speckle decorrelation on the one hand and
610 with a dedicated network that has been trained on the appropriate dataset.
As expected, the box plots indicate that the highest accuracy is reached with
a network trained on the same kind of sweeps. However, the neural network
trained on another dataset still yielded significantly lower center errors and final
drifts than the speckle decorrelation algorithm, which means that it generalized
615 much better. Unsurprisingly, the gap of performance between the different
methods is almost entirely due to the elevational displacement tz , whereas the
quality of estimation of both in-plane translation and rotation is rather similar.
In short, this preliminary experiment shows that the network seems to use
both general features but also patterns that are specific to an anatomy. Note
620 that the datasets are not solely different by the content of the images but also
on the ultrasound acquisition parameters and naturally the shape of the trajec-
tories.
The consequence is that a dedicated network will likely have to be trained for
each target anatomy. We believe that this is not a significant drawback since
28
ACCEPTED MANUSCRIPT
ground truth trajectory linear motion

speckle decorrelation CNN with optical flow
best
case
median
case
worst
case
Figure 5: Comparison of the trajectories reconstructed with different methods. The three
selected forearm sweeps illustrate the best, median, and worst case in terms of drift, respec-
tively.
29
ACCEPTED MANUSCRIPT
ground truth
1.1
relative tz (mm)
linear motion
0.9 speckle decorrelation
CNN with optical flow
0.7
0.5
0.3
0.1
0 50 100 150 200 250 300 350 400
frame index
Figure 6: Elevational translations tz predicted values with different methods on an ultrasound

sweep deliberately acquired with a strongly varying speed. The target curve is the black one,
representing the ground truth (best viewed in color).
625 ultrasound systems already have application-specific presets. More thorough

experiments on the dependency of our method to the scanned anatomy will be
presented in Section 5.4.
5.2. Parameter analysis
We present in this section a series of experiments performed to better un-

630 derstand the behaviour of our approach and its sensitivity to the parameters.
5.2.1. Influence of the working resolution

As we have already mentioned before, we resample the input images before
feeding it to the network. This is done mainly to make the resolution of each
pixel isotropic, but also to reduce the computation times and the required net-
635 work complexity. In order to find the optimal resolution, we repeated the same
experiment four times on the phantom dataset with input images resampled ei-
ther at 0.2, 0.3, 0.4 or 0.5 mm, and reported the results in Figure 8. We expect
the errors to be minimal at an intermediate resolution: on the one hand, larger
pixel sizes would discard relevant speckle patterns; on the other hand, a too
640 small pixel size increases the effect of electronic noise and would also require a
larger network. As expected, the various resolutions produce results with a sta-
tistically significant difference. This experiment indicates 0.3 mm to be the best
30
ACCEPTED MANUSCRIPT
100 100
maximum center error (mm)
80 80
final drift (mm)

60 60
40 40
20 20
0 0
phantom lower legs phantom lower legs
Figure 7: Comparison of methods trained on different anatomies. Boxplots represent mini-

mum, lower quartile, median, upper quartile and maximum.
value for the resampling resolution. For illustration purposes, we also present
in Figure 8 a sample image resampled at the different considered resolutions.
645 5.2.2. Influence of the noise filtering

We then ran experiments to assess the importance of the speckle noise. Since
the main hypothesis of the standard speckle decorrelation is that speckle pat-
terns are important for the motion estimation, we expect a degradation in the
accuracy of the network if we use filtered B-mode images. We therefore recorded
650 the sweeps of our forearms dataset #2 before and after the speckle filter (such
data was available via the interface of our research ultrasound system). As il-
lustrated in Figure 9, this filter has a strong effect on the appearance on the
image. While it makes the image visually more appealing, it also smooths out
most of the speckle pattern.
655 As expected, the results summarized in the boxplots of Figure 9 indicate
that training on original images is a much better strategy than training the
network on the filtered images. This is a particularly interesting result since
it confirms the intuition of the research community that speckle indeed carries
31
ACCEPTED MANUSCRIPT
0.2 mm 0.3 mm 0.4 mm 0.5 mm

resolution (mm)
0.5
0.4
0.3
0.2
5 10 15 20 25
maximum frame center error within each sweep (mm)
Figure 8: Comparison of the various isotropic resampling tested on dataset # 1, and the
corresponding network performance. Boxplots represent minimum, lower quartile, median,
upper quartile and maximum.
significant information. It is also worth mentioning that speckle is useful but

660 not necessary, our method does not completely break down when applied on
filtered images and actually still performs better than speckle decorrelation.
5.2.3. Influence of the network architecture

While the network architecture that we described in Section 3 is very stan-
dard and simple, it turned out to provide the best results according to our
665 experiments. For the sake of brevity, we only summarize our efforts in opti-
mizing the network architecture for this application, including: changing the
number of channels for the convolution filters, changing the length of the flat-
tened vector, adding or removing more convolution/activation/pooling blocks
of layers, varying the dropout rate, using a Siamese architecture (Zbontar and
670 LeCun, 2015) instead of encoding the images as a multi-channel input, and in-
corporating the IMU rotation angles at different stages of the network. All those
modifications had very little impact on the overall performance of the network
compared to the pre-processing of the input data for instance. More significant
32
ACCEPTED MANUSCRIPT
maximum frame center error (mm)

40 40
final drift (mm)

30
30
original
20
20
10
10
0
filtered original filtered original filtered
Figure 9: Comparison of the performance of our method when trained on the original images
or after the speckle filter from the ultrasound system (dataset # 3). Boxplots represent
minimum, lower quartile, median, upper quartile and maximum.
changes like switching to a VGG-like architecture (Simonyan and Zisserman,

675 2015) even showed a performance decrease.
Perhaps the most promising architectural change was to turn our conven-
tional neural network into a recurrent one, like the popular long short-term
memory (Hochreiter and Schmidhuber, 1997) (LSTM). Since we estimate a
temporal process, it makes sense to let the network know about the tempo-
680 ral consistency of the trajectory. This is done by feeding the whole sequence of
frames (actually, pairs of frames in our case) to the network that produces the
whole sequence of outputs. Typically, one of the layer of the networks will retain
some state computed from the previous frames, which acts like a memory and
embeds a notion of time dependency. We therefore tested a number of different
685 LSTM-based networks (Xingjian et al., 2015; Sainath et al., 2015; Chiu and
Nichols, 2015). Here again, various architectures have been evaluated. Most of
the time, the LSTM layer was placed at the end of the network processing, be-
fore the first fully connected layer. Despite our efforts though, we only observed
slight changes in the distribution of errors that were not statistically significant
690 (p-value ≈ 0.1).
33
ACCEPTED MANUSCRIPT
Table 3: Results on various experiments on dataset #4 regarding the inclusion of IMU data
into the estimation. Flow, IMU : Whether optical flow and IMU data were included as input.
θ: Whether rotation part of CNN prediction or IMU orientations were used to reconstruct final
trajectory – in all cases, the translation was predicted by the CNN. Letter indices correspond
to Figure 10.
Methods Avg. absolute error [mm/◦ ] Final drift [mm]
# Flow IMU θ from tx ty tz θx θy θz min. med. max.
A × X CNN 6.56 7.23 16.70 0.94 2.65 2.80 3.12 29.22 186.83
B X × CNN 8.89 6.61 5.73 5.21 7.38 4.01 3.22 27.34 139.02
C X X CNN 5.16 2.67 4.43 0.96 3.54 2.85 2.54 15.07 55.20
D X × IMU 2.98 2.57 4.79 0.19 0.21 0.13 1.33 11.43 42.94
E X X IMU 2.75 2.41 4.36 0.19 0.21 0.13 0.76 10.42 35.22
One of the possible explanations is that even though the trajectory seems
globally smooth, the time difference between two successive frames is so small
that the relative transformation parameters are actually quite noisy (this can
be observed for instance in the ground truth curve of Figure 6). At this scale,
695 the hand trembling of the operator but also the potential jittering noise of the
tracking system can become significant.
We still think that temporal information could be relevant for this problem,
but it should probably be included at a larger scale and therefore would require
changes in our approach that are more significant than adding a recurrent layer
700 in our network.
5.3. Experiments on the larger dataset with IMU
On the largest dataset #4 consisting of 600 sweeps, we evaluated various

options to train the network with different inputs. While the parameter-wise
errors as well as the final drifts are reported in Table 3, an overview of the
705 distribution on maximum center error and drift across sweeps are illustrated in
Figure 10. The pairwise difference between all experiments (A-E ) was statisti-
cally significant.
Following the results presented in Section 5.1, the CNN without the optical
flow presented comparatively poor performance, even if IMU data was included
710 (A vs. C ). Likewise, it was beneficial to augment the input of the network
34
ACCEPTED MANUSCRIPT
(A) Flow ×, IMU X,

θ from CNN
(B) Flow X, IMU ×,
θ from CNN
(C) Flow X, IMU X,
θ from CNN
(D) Flow X, IMU ×,
θ from IMU
(E) Flow X, IMU X,
θ from IMU
0 10 20 30 40 50 60
maximum center error (mm)
(A) Flow ×, IMU X,

θ from CNN
(B) Flow X, IMU ×,
θ from CNN
(C) Flow X, IMU X,
θ from CNN
(D) Flow X, IMU ×,
θ from IMU
(E) Flow X, IMU X,
θ from IMU
0 10 20 30 40 50 60
final drift (mm)
Figure 10: Results on various experiments on dataset #4 regarding the inclusion of IMU data
into the estimation. Letter indices correspond to Table 3. Boxplots represent minimum, lower
quartile, median, upper quartile and maximum.
with IMU data (B vs. C ). Apparently, both complementary kinds of input

are necessary to capture the complex trajectories present in this dataset, which
is also why method B, as originally proposed in Prevost et al. (2017), shows
inferior performance here compared to the other datasets.
715 Under close examination, there are in fact several ways to integrate IMU
data into the tracking estimation. As an alternative to feeding it as additional
input as described in Section 3.4 (C ), it is also possible to directly use the
calibrated orientations on top of the CNN trained without IMU (D) similarly to
what Housden et al. (2008a) suggested with speckle decorrelation. Interestingly,
35
ACCEPTED MANUSCRIPT
ground truth trajectory

CNN without IMU
CNN with IMU
Figure 11: Comparison of the trajectories reconstructed with and without incorporation of
the IMU in the neural network. The four selected sweeps were average cases for the four types
of sweeps of dataset # 4 (respectively basic, shift, wave, tilt).
36
ACCEPTED MANUSCRIPT
720 this turned out to slightly outperform the aforementioned methods, presumably
because of the improved angular resolution of the employed IMU sensors and the
lack of image-based cues of vanishing relative rotations. Combining these two
strategies, i.e. using the CNN with all available input to predict the translation,
and resorting to the IMU orientations for trajectory estimation (E ) led to a
725 further improvement, yielding a median final drift of merely 10.4 mm across all
types of sweep motions. This is equivalent to a median normalized drift of 5.2%,
meaning that for each 10 cm, the reconstruction might be off by around 5 mm.
The length of the recorded sweep, and thus the elevational translation, is
an essential factor for a broad variety of clinical applications. Figure 12.a com-
730 pares the predicted sweep lengths with their actual lengths, following a strong
correlation of approximately 0.9. In Figure 12.b, we group all lengths errors
based on the sweep type, indicating broad consistency across the different kinds
of motion. The overall median length error was 6.84 mm (3.4%). The shift
category contains more outliers due to the very sparse but large and abrupt
735 motions happening during such acquisitions.
5.4. Generalization to another anatomy
As already mentioned in Section 5.1, our method does generalize to some

extent from one kind of sweep to another, but still shows a drop of accuracy
compared to a dedicated network. In this section, we perform a more detailed
740 analysis of the generalization capabilities and properties of our networks. To
that end, we use dataset # 5, which contains sweeps acquired on carotids. This
dataset is more complete and different from the forearms than the lower legs
dataset # 3 originally used in Prevost et al. (2017).
745 Figure 13 shows the drift and the length error obtained on this dataset using
five different networks:
(A) a network trained from scratch on the forearms dataset (# 4) only;
(B) a network trained from scratch on the carotid dataset (# 5) only;
37
ACCEPTED MANUSCRIPT
260
ρ = 0.90 30
240
R2 = 0.79
predicted length (mm)
25
length error (mm)

220
20
200
15
180 10
5
160
0
140
150 200 250 basic shift wave tilt
actual sweep length (mm) sweep type
(a) (b)
Figure 12: (a) Comparison of the estimated sweep lengths with respect to the ground truth
lengths. (b) Distribution of the length errors split across the different sweep types.
(C) a network trained on the forearms dataset, with its two last layers subse-
750 quently fine-tuned on the carotid dataset;
(D) a network trained on the forearms dataset, then fully fine-tuned on the
carotid dataset;
(E) a network trained from scratch on both forearms and carotid datasets,
which enables us to draw some interesting conclusions.

755 First, we notice that the network (A) trained only on forearms yields the
worst results. This is expected and in agreement with our previous experiments:
the network needs to see some carotid sweeps during training to learn the frames’
appearance and motion. The most straightforward approach to fix this problem
is then to re-train a network on the carotid dataset (B), which unsurprisingly
760 improves the results. However, training a network from scratch requires both
time and a significant amount of data. This might be unnecessary if one already
38
ACCEPTED MANUSCRIPT
has trained a network on a different application but with a large dataset. In

order to test this hypothesis, we fine-tuned the network (C and D). Using the
very same architecture, we initialized the network’s parameters with pre-trained
765 weights, and resumed the training on the carotid dataset either by freezing all
but the last two layers (C) or by optimizing the whole network (D). As it
turned out, the difference between (B) and (C) is not statistically significant (p-
value ≈ 0.3), which indicates that, although it takes much longer, training from
scratch on a smaller dataset does not provide any advantage than fine-tuning a
770 pre-existing network. Interestingly, (D) provides better results than (C), which
means that the first layers of the network are actually important and also need
to be fine-tuned in order to get a high accuracy. In particular, this suggests
that the low-level features learnt by the network are, to some extent, depending
on the anatomy. For instance, the network probably has to learn that a change
775 in the size of the carotid from one frame to the next might be due to pulsation
and is not necessarily a geometric cue. Finally, the last model (E) requires
the largest amount of data and training time (since it is trained on the both
datasets). Its results are significantly better than a network trained on the
carotid only (B), which can be explained by the fact that we allow it to leverage
780 the information that is common to the two datasets. However, it does not
improve on the model (D), with a p-value ≈ 0.3. One could attribute such a
result to the fact that the networks ”wastes” resources by trying to learn the
more complex motions of the forearms dataset which are not as extreme in the
carotid dataset. This finding is particularly interesting since it suggests that we
785 do not need to re-train a network from scratch for every application.
We then wanted to assess how many data of the new application was nec-
essary to fine-tune a pre-existing network. Our dataset # 5 is composed of
100 sweeps from 10 volunteers (10 per subject), which means that we have 50
sweeps for training and 50 for validation during a 2-fold cross-validation. In-
790 stead of using the 50 training sweeps, we fine-tuned a pre-trained network with
only 10 of them (i.e. 1 subject), then another one with 20 of them, etc. In order
to make the fine-tuning more stable, we also add the whole forearms dataset
39
ACCEPTED MANUSCRIPT
(A) trained on
forearms only
(B) trained on
carotid only
(C) fine-tuned to carotid
(last layers)
(D) fine-tuned to carotid
(whole network)
(E) trained on
forearms+carotid
0 5 10 15 20 25
final drift (mm)
(A) trained on
forearms only
(B) trained on
carotid only
(C) fine-tuned to carotid
(last layers)
(D) fine-tuned to carotid
(whole network)
(E) trained on
forearms+carotid
0 2 4 6 8 10 12 14 16 18
length error (mm)
Figure 13: Results on various experiments on dataset #5 regarding the generalization capa-
bilities of the networks from forearms to carotid sweeps.
but put a stronger weight on the carotid sweeps so that both datasets have an
overall similar impact. The results are plotted in Figure 14. Unsurprisingly,
795 there is a significant difference between a network only trained on forearms and
the networks that have been refined on the carotid dataset. However, a perhaps
more unexpected observation is that using data from a single subject is already
sufficient to capture most of the difference between the two domains. Adding
more subjects tends to slightly decrease the errors but the added value seem
800 more subtle, with a much higher p-value around 10−2 .
In summary, our experiments seems to indicate that although our method is
sensitive to the kind of sweeps used during training, the data and time required
40
ACCEPTED MANUSCRIPT
final drift (mm) length error (mm)

14
significant not significant
12
error on carotid dataset
10
0 10 20 30 40 50
only forearm 1 subject 2 subjects 3 subjects 4 subjects 5 subjects
number of carotid sweeps used for fine-tuning
Figure 14: Errors on the carotid dataset #5 as a function of the number of sweeps used for
fine-tuning a network originally trained on the forearms dataset. Dots represent the average
and error bars represent the standard deviation across the 100 sweeps.
to adapt it to other applications might not be overwhelming.
805 Finally, as a last illustration of the accuracy of our system, we acquired a

very long sweep (more than 60 cm) on the leg of a volunteer (see Figure 15) and
reconstructed its trajectory with our best network trained on forearms and IMU
input (E ). The sweep followed the great saphenous vein over the full leg. The
length of this vein was measured twice: first using the ground truth trajectory,
810 and second using our estimation. As the measured lengths were respectively
61.4 cm and 58.5 cm, the measurement error was lower than 5% which is quite
satisfactory given the extreme length of the sweep and the target anatomy that
had never been seen by the network.
41
ACCEPTED MANUSCRIPT
Figure 15: Reconstruction of a very long ultrasound sweep (more than 60 cm) across the full
leg showing the measurement of the great saphenous vein. A lot of ultrasound frames are
skipped, and the vessel was segmented for the sake of visualization.
5.5. Alternative application: Interpolation of missing tracking data
815 We finally discuss an alternative application of our motion estimation ap-

proach. Even when an external tracking system is used during a sweep acquisi-
tion, it often happens that the probe position becomes unavailable for a number
of frames, for instance due to occlusions. To avoid tedious and time-consuming
repeat acquisitions while not having to rely on inaccurate polynomial interpo-
820 lations, the presented method could be used for a more faithful gap filling. In
order to test this application, we used the forearms dataset #2 and assumed
that the entire sweep is actually a tracking-less gap of a longer acquisition, i.e.
ground truth transformation was available at first and last frame. The network
was first used to predict the position of all frames in between. Then, the inverse
825 transformation of the drift was evenly distributed across all frames so that the
last one ended up at the known ground truth position.
Table 4 shows a comparison of different interpolation strategies. Since the
final drift is now zero by definition, we use as a metric the maximum center
error. Similarly to the previous applications, our approach based on a CNN -
830 even without the IMU - produces a significant improvement over the baseline
techniques for interpolation. In particular, the median error on each sweep is
almost halved. In all likelihood, it would even be further reduced when using
the IMU orientation.
42
ACCEPTED MANUSCRIPT
Dataset #2 Avg. absolute error [mm/◦ ] Max. center error [mm]

- forearms without IMU tx ty tz θx θy θz min. med. max.
Linear interpolation 2.00 6.69 5.81 2.55 0.88 1.29 9.34 16.46 29.28
Table 4: Error metrics for the three different approaches used as interpolation methods be-
tween the first and the last frames of the sweeps of dataset #2.
6. Discussion
835 6.1. Remaining pitfalls
Since our approach does have some constraints, we list here all the short-
comings that we could identify and how to possibly overcome them.
First, we assumed that the sweeps were acquired in a fixed direction, for
instance proximal to distal. Applying our algorithm on sweep with an opposite
840 direction would therefore yield a mirrored result. This constraint was paramount
in order to train our networks as including both directions in the training set
made the estimation of the out-of-plane translation significantly ambiguous.
However this limitation is not specific to our method, but is rather due to the
symmetry of the trajectory estimation problem that makes it ill-posed. Besides,
845 we deem that enforcing the direction of the probe during acquisition would be
a reasonable constraint for the clinician (for instance by drawing a visual cue
on the probe).
More importantly though, this also means that no back-and-forth motion
can be estimated by the network. This is a problem when the organ of interest
850 does not fit into a single US frame and would need to be swept over several times.
Despite our efforts, we were not able to find a way to detect frames where the
main direction of the probe reverses, so we did not include such motion in our
datasets. Even the IMU acceleration signal was too noisy to be used at all to
detect changes of direction. A workaround would be to acquire multiple sweeps
855 with sufficient overlap so that they could be registered together.
We also expect the accuracy of our approach to depend, at least to some
extent, on the ultrasound acquisition parameters. This is true for the bright-
43
ACCEPTED MANUSCRIPT
ness, the contrast, the framerate but also the image depth, which changes the
image geometry and could hinder the network trained with a fixed size. A po-
860 tential idea to help the network coping with these variations is to store all such
acquisition parameters of the ultrasound system. We could then incorporate
them within the network architecture so that they can be directly used by the
network, just like we did with the IMU information.
The system will also be dependent on the type of probe used, i.e. linear
865 vs convex. Since the image acquisition happens on a polar coordinate system
instead of a Cartesian grid, the speckle patterns of the images will look different
and therefore have to be treated adequately. This would have been a serious
problem if we directly used the optical flow as the in-plane motion, similarly to
the standard speckle decorrelation approach. Nevertheless, we believe that our
870 statistical model can figure out, during its training, how to compensate for such
artifacts. This however means that we would need to train a dedicated network
for each type of probe.
Eventually, while our approach does not require access to any raw data
like RF or in-phase and quadrature (IQ) signals, our experiments did show
875 that the performance is optimal when we disable the speckle filter of the B-
mode images. Systems relying on a frame-grabber without the possibility to
significantly reduce the amount of filtering are therefore feasible but likely to
produce less accurate trajectory reconstructions.
6.2. Conclusion
880 This paper introduced a novel method for the challenging task of 3D recon-
struction of freehand ultrasound sweeps based on deep learning. We showed
that convolutional neural networks constitute a suitable replacement for the
standard approach of speckle decorrelation, since they are composed of similar
basic operations but are able to be trained to solve the problem in an end-to-end
885 manner. This new way of addressing the problem alleviates the need for accu-
rately modeling the influence of speckle on the image intensities, and instead
leverages a large quantity of tracked ultrasound sweeps. Another benefit of our
44
ACCEPTED MANUSCRIPT
approach is that it does not require any raw data that is difficult to extract
from an ultrasound system. However, it is able to use - if available - additional
890 information to improve the accuracy of its prediction. Such information can be
results of pre-computations such as an optical flow vector field or the output of
external sensors like the orientation of an IMU chip.
The thorough experiments and evaluations that we provide also constitute
a contribution of our work. To the best of our knowledge, no study on freehand
895 3D ultrasound reconstruction was tested on such a large database. We indeed
worked on 800 in vivo freehand ultrasound sweeps with very diverse trajectories
that cover the potential motions that can occur during an actual ultrasound
sweep. The main findings of our experiments are the following:
• The proposed approach based on deep learning generates much more accu-
900 rate trajectories than the existing baseline methods, reaching normalized
median length errors of 3.4% on our largest dataset.
• As the research community presumed, speckle noise indeed carries infor-

mation that is relevant to our problem, but it is not necessary for a 3D
reconstruction. Therefore our method could also be applied, possibly at
905 the cost of a decreased accuracy, on systems without access to completely
unfiltered B-mode images.
• Incorporating additional information such as the pre-computed optical

flow displacement field or the orientation of an IMU sensor significantly
helps the network, much more than a fine-tuning of its architecture.
910 • Although training and applying a network on a specific anatomy naturally

yields better results, a thoroughly pre-trained network can be easily fine-
tuned to another application with limited data and effort.
This does not mean that the speckle decorrelation is obsolete, but can po-
tentially be improved using some of our results. Opening the black-box of our
915 neural network, for instance by studying the learnt convolution kernels or visu-
alizing the relevant features, could provide interesting insight that could help
45
ACCEPTED MANUSCRIPT
the community better understand the relationship between image intensities

and probe motion.
Furthermore, unlike standard image analysis tasks like classification or seg-
920 mentation, there is no guarantee that a perfect reconstruction is possible even in
theory. In other words, it was unclear if the errors produced by speckle decorre-
lation algorithms are due to imperfections of the model or if the images simply
do not contain enough information to solve that problem. We believe that the
results obtained by a purely statistical approach like ours also help quantifying
925 how much information about the probe motion the images really contain.
In addition to addressing the shortcomings described in the previous subsec-
tion, our future work will aim at exploring clinical applications that could benefit
from our system, including for instance aneurysm monitoring or thyroid volume
estimation. For some of those applications, it is feasible to develop improved ac-
930 quisition protocols that further reduce the drift by exploiting redundancy from
several orientations or more generally consistency from multiple overlapping ac-
quisitions. Perpendicular views of the ultrasound image plane, whether taken
alone (as panoramic in-plane stitching) or from a second sweep reconstructed
with our method, could be used to reduce errors through an image-based opti-
935 mization step. To that end, an image-based calibration method such as Wein
and Khamene (2008) could be extended by properly parameterizing the un-
known motion. If the IMU data is used, perpendicular views can be properly
pre-aligned at least in terms of their orientation, which makes such an approach
promising for mostly angle-swept data, such as overlapping longitudinal and
940 transversal sweeps of the liver. Since a similar idea was reported in Chang et al.
(2003) with promising results, we suppose that this would be a very relevant
extension to our approach.
Altogether, we believe that our work paves the way toward a tracking-free
3D freehand ultrasound product. Apart from technical aspects, an important
945 next step of our work will therefore include clinical evaluations on a variety of
applications to confirm its impact on medical practice.
46
ACCEPTED MANUSCRIPT
Acknowledgments
The authors would like to thank ACMIT (Vienna, Austria) and piur Imag-
ing (Vienna, Austria) for the help on the IMU integration and the mount on
950 the ultrasound probe. We also thank Steven Rogers and Richard Pole (IVS,
Manchester, UK) for their advice on the ultrasound sweeps acquisition. The
authors have benefited from a H2020-FTI grant (number 760380) delivered by
the European Union.
References
955 References
Afsham, N., Najafi, M., Abolmaesumi, P., Rohling, R., 2014. A generalized
correlation-based model for out-of-plane motion estimation in freehand ultra-
sound. IEEE Transactions on Medical Imaging 33, 186–199.
Afsham, N., Rasoulian, A., Najafi, M., Abolmaesumi, P., Rohling, R., 2015.
960 Nonlocal means filter-based speckle tracking. IEEE transactions on ultrason-
ics, ferroelectrics, and frequency control 62, 1501–1515.
Chang, R.F., Wu, W.J., Chen, D.R., Chen, W.M., Shu, W., Lee, J.H., Jeng,
L.B., 2003. 3-d us frame positioning using speckle decorrelation and image
registration. Ultrasound in Medicine and Biology 29, 801 – 812.
965 Chen, J.F., Fowlkes, J.B., Carson, P.L., Rubin, J.M., 1997. Determination
of scan-plane motion using speckle decorrelation: Theoretical considerations
and initial test. International Journal of Imaging Systems and Technology 8,
38–44.
Chiu, J.P., Nichols, E., 2015. Named entity recognition with bidirectional lstm-
970 cnns. arXiv preprint arXiv:1511.08308 .
Conrath, J., Laporte, C., 2012. Towards improving the accuracy of sensorless
freehand 3d ultrasound by learning, in: International Workshop on Machine
Learning in Medical Imaging, Springer. pp. 78–85.
47
ACCEPTED MANUSCRIPT
Dosovitskiy, A., Fischer, P., Ilg, E., Hausser, P., Hazirbas, C., Golkov, V.,
975 v.d. Smagt, P., Cremers, D., Brox, T., . Flownet: Learning optical flow
with convolutional networks, in: IEEE International Conference on Computer
Vision (ICCV).
Farnebäck, G., 2003. Two-frame motion estimation based on polynomial expan-

sion, in: Scandinavian conference on Image analysis, Springer. pp. 363–370.
980 Fischler, M.A., Bolles, R.C., 1981. Random sample consensus: a paradigm for
model fitting with applications to image analysis and automated cartography.
Communications of the ACM 24, 381–395.
Fischler, M.A., Bolles, R.C., 1987. Random sample consensus: a paradigm for
model fitting with applications to image analysis and automated cartography,
985 in: Readings in computer vision. Elsevier, pp. 726–740.
Franz, A.M., Haidegger, T., Birkfellner, W., Cleary, K., Peters, T.M., Maier-
Hein, L., 2014. Electromagnetic tracking in medicinea review of technology,
validation, and applications. IEEE transactions on medical imaging 33, 1702–
1725.
990 Gao, H., Huang, Q., Xu, X., Li, X., 2016. Wireless and sensorless 3D ultrasound
imaging. Neurocomput. 195, 159–171.
Gee, A.H., Housden, R.J., Hassenpflug, P., Treece, G.M., Prager, R.W., 2006.
Sensorless freehand 3d ultrasound in real tissue: speckle decorrelation without
fully developed speckle. Medical image analysis 10, 137–149.
995 Ghanbari, M., 1990. The cross-search algorithm for motion estimation (image
coding). IEEE Transactions on Communications 38, 950–953.
Govindu, V.M., 2004. Lie-algebraic averaging for globally consistent motion

estimation, in: Computer Vision and Pattern Recognition, 2004. CVPR 2004.
Proceedings of the 2004 IEEE Computer Society Conference on, IEEE. pp.
1000 I–684.
48
ACCEPTED MANUSCRIPT
Hamilton, W.R., 1853. Lectures on quaternions. Hodges and Smith.
Hassenpflug, P., Prager, R.W., Treece, G.M., Gee, A.H., 2005. Speckle classi-
fication for sensorless freehand 3-d ultrasound. Ultrasound in Medicine and
Biology 31, 1499–1508.
1005 He, K., Sun, J., Tang, X., 2010. Guided image filtering, in: European conference
on computer vision, Springer. pp. 1–14.
Hennersperger, C., Karamalis, A., Navab, N., 2014. Vascular 3d+ t freehand
ultrasound using correlation of doppler and pulse-oximetry data, in: Interna-
tional Conference on Information Processing in Computer-Assisted Interven-
1010 tions, Springer. pp. 68–77.
Hochreiter, S., Schmidhuber, J., 1997. Long short-term memory. Neural com-
putation 9, 1735–1780.
Hornik, K., 1991. Approximation capabilities of multilayer feedforward net-

works. Neural networks 4, 251–257.
1015 Hossack, J.A., Sumanaweera, T.S., Napel, S., Ha, J.S., 2002. Quantitative 3-
d diagnostic ultrasound imaging using a modified transducer array and an
automated image tracking technique. IEEE Transactions on Ultrasonics, Fer-
roelectrics, and Frequency Control 49, 1029–1038.
Housden, R., Gee, A.H., Prager, R.W., Treece, G.M., 2008a. Rotational motion
1020 in sensorless freehand three-dimensional ultrasound. Ultrasonics 48, 412 –
422.
Housden, R.J., Gee, A.H., Treece, G.M., Prager, R.W., 2006. Sensorless re-
construction of freehand 3d ultrasound data, in: Larsen, R., Nielsen, M.,
Sporring, J. (Eds.), Medical Image Computing and Computer-Assisted Inter-
1025 vention – MICCAI 2006: 9th International Conference, Copenhagen, Den-
mark, October 1-6, 2006. Proceedings, Part II, Springer Berlin Heidelberg,
Berlin, Heidelberg. pp. 356–363.
49
ACCEPTED MANUSCRIPT
Housden, R.J., Treece, G.M., Gee, A.H., Prager, R.W., 2008b. Calibration
of an orientation sensor for freehand 3d ultrasound and its use in a hybrid
1030 acquisition system. BioMedical Engineering OnLine 7, 5.
Kallel, F., Bertrand, M., Meunier, J., 1994. Speckle motion artifact under tissue
rotation. IEEE Transactions on Ultrasonics, Ferroelectrics, and Frequency
Control 41, 105–122.
Kan, A.R., Timmer, G.T., 1987. Stochastic global optimization methods part
1035 ii: Multi level methods. Mathematical Programming 39, 57–78.
Lang, A., Mousavi, P., Fichtinger, G., Abolmaesumi, P., 2009. Fusion of elec-
tromagnetic tracking with speckle-tracked 3d freehand ultrasound using an
unscented kalman filter, in: Progress in Biomedical Optics and Imaging -
Proceedings of SPIE.
1040 Lang, A., Mousavi, P., Gill, S., Fichtinger, G., Abolmaesumi, P., 2012. Multi-
modal registration of speckle-tracked freehand 3d ultrasound to ct in the
lumbar spine. Medical Image Analysis 16, 675 – 686. Computer Assisted
Interventions.
Laporte, C., Arbel, T., 2011. Learning to estimate out-of-plane motion in ul-
1045 trasound imagery of real tissue. Medical image analysis 15, 202–213.
LeCun, Y., Bengio, Y., Hinton, G., 2015. Deep learning. Nature 521, 436–444.
Morrison, D., McDicken, W., Smith, D., 1983. A motion artefact in real-time
ultrasound scanners. Ultrasound in Medicine and Biology 9, 201 – 203.
Mozaffari, M.H., Lee, W.S., 2017. Freehand 3-d ultrasound imaging: A system-
1050 atic review. Ultrasound in Medicine and Biology 43, 2099–2124.
Nagaraj, Y., Benedicks, C., Matthies, P., Friebe, M., 2016. Advanced inside-
out tracking approach for real-time combination of mri and us images in the
radio-frequency shielded room using combination markers, in: Engineering in
50
ACCEPTED MANUSCRIPT
Medicine and Biology Society (EMBC), 2016 IEEE 38th Annual International
1055 Conference of the, IEEE. pp. 2558–2561.
Prager, R.W., Gee, A.H., Treece, G.M., Cash, C.J., Berman, L.H., 2003. Sensor-
less freehand 3-d ultrasound using regression of the echo intensity. Ultrasound
in medicine & biology 29, 437–446.
Prevost, R., Salehi, M., Sprung, J., Ladikos, A., Bauer, R., Wein, W., 2017.
1060 Deep learning for sensorless 3d freehand ultrasound imaging, in: Descoteaux,
M., Maier-Hein, L., Franz, A., Jannin, P., Collins, D.L., Duchesne, S. (Eds.),
Medical Image Computing and Computer-Assisted Intervention – MICCAI
2017, Springer International Publishing, Cham. pp. 628–636.
Rivaz, H., Zellars, R., Hager, G., Fichtinger, G., Boctor, E., 2007. 9c-1 beam
1065 steering approach for speckle characterization and out-of-plane motion esti-
mation in real tissue, in: Ultrasonics Symposium, 2007. IEEE, IEEE. pp.
781–784.
Sainath, T.N., Vinyals, O., Senior, A., Sak, H., 2015. Convolutional, long short-
term memory, fully connected deep neural networks, in: Acoustics, Speech
1070 and Signal Processing (ICASSP), 2015 IEEE International Conference on,
IEEE. pp. 4580–4584.
Salehi, M., Prevost, R., Moctezuma, J.L., Navab, N., Wein, W., 2017. Pre-
cise ultrasound bone registration with learning-based segmentation and speed
of sound calibration, in: Medical Image Computing and Computer-Assisted
1075 Intervention - MICCAI 2017, Springer International Publishing, Cham. pp.
682–690.
Simonyan, K., Zisserman, A., 2015. Very deep convolutional networks for large-
scale image recognition. ICLR 2015 .
Tetrel, L., Chebrek, H., Laporte, C., 2016. Learning for graph-based sensorless
1080 freehand 3d ultrasound, in: Machine Learning in Medical Imaging: 7th In-
ternational Workshop, MLMI 2016, Held in Conjunction with MICCAI 2016,
51
ACCEPTED MANUSCRIPT
Athens, Greece, October 17, 2016, Proceedings, Springer International Pub-

lishing. pp. 205–212.
Toews, M., Wells, W.M., 2018. Phantomless auto-calibration and online cali-
1085 bration assessment for a tracked freehand 2-d ultrasound probe. IEEE Trans-
actions on Medical Imaging 37, 262–272.
Tuthill, T.A., Krücker, J., Fowlkes, J.B., Carson, P.L., 1998. Automated three-
dimensional us frame positioning computed from elevational speckle decorre-
lation. Radiology 209, 575–582.
1090 Wein, W., Khamene, A., 2008. Image-based method for in-vivo freehand ultra-
sound calibration, in: SPIE Medical Imaging 2008, San Diego.
Xingjian, S., Chen, Z., Wang, H., Yeung, D.Y., Wong, W.K., Woo, W.c., 2015.
Convolutional lstm network: A machine learning approach for precipitation
nowcasting, in: Advances in neural information processing systems, pp. 802–
1095 810.
Zbontar, J., LeCun, Y., 2015. Computing the stereo matching cost with a
convolutional neural network, in: Proceedings of the IEEE conference on
computer vision and pattern recognition, pp. 1592–1599.
52

2018 3D Freehand Ultrasound Without External Tracking Using Deep

Uploaded by

Copyright:

Available Formats

You might also like

2018 3D Freehand Ultrasound Without External Tracking Using Deep

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

2018 3D Freehand Ultrasound Without External Tracking Using Deep

Uploaded by

Copyright:

Available Formats

Accepted Manuscript

3D Freehand Ultrasound Without External Tracking Using Deep

Raphael Prevost, Mehrdad Salehi, Simon Jagoda, Navneet Kumar,

To appear in: Medical Image Analysis

Received date: 8 February 2018

2D Ultrasound Clip Frame-to-Frame

• A system for 3D freehand ultrasound reconstruction without external

• Integration of an IMU to improve the accuracy even further.

• Extensive validation on a large clinically relevant dataset.

• Unprecedented reconstruction accuracy, especially for elongated sweeps.

3D Freehand Ultrasound Without External Tracking

Raphael Prevosta,∗, Mehrdad Salehia,b , Simon Jagodaa , Navneet Kumara ,

This work aims at creating 3D freehand ultrasound reconstructions from 2D

∗ Corresponding author. Address: ImFusion GmbH, Agnes-Pockels-Bogen 1, 80992

Preprint submitted to Medical Image Analysis June 7, 2018

2D Ultrasound clip 3D Ultrasound

Ultrasound imaging (US) combines a number of advantages as a medical

due to its inability to produce three-dimensional data, which is required for

magnetic influences in proximity of the scanned anatomy. Also inside-out track-

The remainder of the paper is organized as follows. In Section 2, we outline

2. Related Work and Contributions

The image-based estimation of the trajectory of an ultrasound probe is a

• The in-plane motion, which is a translation along the US plane, is sup-

cessed January 2018)

(accessed January 2018)

90 displacement field from two images.

• The out-of-plane motion (also called elevational displacement), on the

this method is designed for ultrasound datasets acquired by sweeping multiple

1. By using a deep learning-based approach, we explore a path that is sig-

2. We conduct different studies on a very large dataset composed of 780

3. We also provide a number of additional strategies to make the estimates

190 2. We have recorded an additional dataset of 700 US sweeps in addition to our

3. Based on this comprehensive dataset, we extend our parameter analysis

4. Finally, we conducted additional studies on the generalization capabilities

3.1. Problem statement

205 3.1.1. Main notations

is also a rigid transformation and can therefore be parametrized by a 6-valued

3.1.2. Transformation parametrization

245 3.2. From speckle decorrelation to convolutional neural networks

Patch Division + Cross-Correlation

Convolutions + Pooling + Activation layers

At first glance, aiming at mimicking the elaborate speckle decorrelation al-

• The local cross-correlation operation may be approximated by a set of

• The patch-based approach that aggregates local information corresponds

3.3. Augmenting the network with the optical flow

Thanks to their representative power, convolutional neural networks are the-

Convolution 5x5x64 Convolution 3x3x64

Fully Connected (512)

Fully Connected (6)

3.4. Augmenting the network with an IMU

Non-image information can also easily be leveraged within our approach

4.1. Data acquisition

4.1.1. Ultrasound images

4.1.2. Ground truth tracking

4.1.3. IMU integration

ibration procedure outlined below, we picked the Xsens MTi-3-8A7G6 because

1. A set of 20 US sweeps acquired on a BluePhantom

2. A set of 88 in vivo tracked US sweeps acquired on the forearms of 12