Download as pdf or txt
Download as pdf or txt
You are on page 1of 14

H06W8a: Medical image analysis Deep learning for medical image computing

Class 10: Deep learning


Part 1: Principles
Prof. Frederik Maes
frederik.maes@esat.kuleuven.be

Medical Image Computing Fundamental problems

• Image segmentation:
– Detection and delineation of objects of interest in the images
– Prerequisite for morphometry and regional quantification

• Image registration:
Normal
– Spatial alignment of different images (different modalities / time points / subjects…)
Stroke
fMRI – Prerequisite for fusion and joint analysis of complementary image information
6

• Image visualization:
– Presentation of the relevant information that was extracted from the images
– Prerequisite for clinical interpretation

0
Huntington

3 4

Challenges Model-based image analysis

• Complex data • Incorporate prior knowledge about the appearance of the objects of interest in the images
– multi-dimensional, multi-temporal, multi-modal, multi-subject – photometric properties (intensity, contrast, texture…)
– limited image quality: resolution, contrast, noise, artifacts è ambiguity – geometric properties (position, shape, deformations...)
• Complex objects – context (other objects, clinical information…)
– 3D anatomical shapes
– normal biological variability • Model = mathematical representation of prior knowledge
– abnormalities (pathology…) – must be flexible (= parameterized) to account for variability
• Complex applications – can itself be represented as an image (e.g. an atlas)
– continuous technological advances in medical imaging
– increasing clinical requirements • Image analysis problem formulated as an optimization problem
• Complex validation – Find the model parameters that “best fit” the image data
– lack of objective ground truth in clinical images – Suitable criterion to measure “goodness of fit” (= objective function)
– observer variability – Suitable computational strategy to find the optimal solution (= optimization method)

5 6
Maximum A Posteriori Probability (MAP) formulation Approach 1: energy minimization

M(Q) = model with parameters Q

Bayes’ rule: P(a,b) = P(a).P(b|a) = P(b).P(a|b)


Gibbs distribution
E = energy function (to be defined…)
If Prob(M(Q)) = constant è maximum likelihood Z = normalization constant
Eint = internal energy è measures fidelity to the prior
Eext = external energy è measures conformity to the data

Prob(M(Q)) = prior probability of the model with parameters Q Energy minimization problem
Prob(I | M(Q)) = data likelihood for the specified model parameters è Flexibility to define the energy terms…
Prob(I ) = prior probability of observing image I (independent of Q ) è Based on heuristics, physics, statistics…
Prob(M(Q) | I) = posterior probability of the model with parameters Q for the observed image I
g = user-specified weight (hyper-parameter)
è Tunable behavior…
7 8

Example 1: statistical shape models Example 2: atlas-based segmentation

Model building Model fitting Model building Model fitting

M = contour with N landmarks Eint (Q): shape model M = atlas Eint (Q): regularization
Q = landmark coordinates Eext (I|Q): intensity model 9 Q = deformation field Eext (I|Q): similarity measure 10

Approach 2: Classification Conventional machine learning: handcrafted features

MAP formulation
F
e O
Estimate Prob(M(q)|I ) directly from training data a u
(supervised classification)
t t
Data Classifier
F = classifier / regressor u p
Maps given input image I onto most likely model r u
instance q* based on previous training data
e t
s
Instead of full data I: feature vector f
(= dimensionality reduction) SVM
Domain Handcrafted Random forest
knowledge feature Neural network
11 vector 12
Deep learning: automated optimal features A single neural node
x0 = 1 w0

O
u a = f(z)
w1
Classifie t x1 z = Si wi*xi f
Data Features
r p
u Linear Non-linear
t weighted sum activation function
w2
a
1
Deep neural network x2 sigmoid:
Domain
with automated discovery of
knowledge 0
optimal features z
13 14

z = 0 è a = 0.5
Rationale Rationale
x2 x2
z<0èaà0 z = 0 è a = 0.5
z = Si wi*xi z = Si wi*xi

Non-optimal values for w Optimal values for w

Class 1 Class 1 Decision boundary


Class 2 Class 2 in feature space (x1,x2)

a = Prob(Class = 1)

z>0èaà1
z>0èaà1 x1 z<0èaà0 x1
15 16

Rationale A simple neural network x2 Curved


x2 decision boundaries
z = 0 è a = 0.5
z = Si wi*xi
1 if Class 1,
Optimal values for w 0 otherwise

Class 1 Prediction: 1 if Class 2,


Class 2 0 otherwise
* = new sample
a(*) > 0.5 => Class 1
* Output of layer l = input of layer (l+1)
Weight w associated with each connection
z>0èaà1 For classification: output = binary (class label)
For regression: output = real value x1
z<0èaà0 x1
17 18
Loss L
Supervised training Gradient descent

= walk downhill by taking small


Classifier based on features f with internal parameters w steps along direction of steepest
descent
Prediction for training sample i for the current parameters w
= iterative search procedure w(k+1) = w(k) – a (dL/dw)(k)

a = step size = learning rate


Find classifier parameters w that yield optimal classification
Too large: no convergence
performance over the training set w.r.t. some loss function L
Too small: local optimum dL/dw
= complex non-linear optimization problem
Updates determined for each Optimum
Popular loss functions:
weight layer-by-layer (from
- Cross-entropy (for classification)
output to input)
- Mean squared error (for regression) = back-propagation
- Dice similarity (for segmentation)
19 20
wopt w(k+1) w(k) w

Deep neural networks Convolutional neural networks

Input: 2000x2000 pixels


= vector with 4 million values Output: N classes

1 if Normal,
0 otherwise
1 if Abnormal type A,
0 otherwise
1 if Abnormal type B,
0 otherwise

Share weights between different regions è less parameters


Huge amount of parameters if fully connected Convolution = linear filtering operator è feature maps
Limited amount of training samples Pooling è position and scale invariance
Complex training Cascaded è increasing level of abstraction
Poor generalization for classification of new test images Learn optimal features during training (back-propagation)
21 Fully connected layers for final classification using learned features 22

Convolution operator Convolutional filters in image processing


Example: 3x3 filter
c-1 c c+1 c-1 c c+1 Smoothing Edge detection

r-1 i1 i2 i3 w1 w2 w3 r-1

r i4 i0 i5 * w4 w0 w5 r j

r+1 i6 i7 i8 w6 w7 w8 r+1

filter W filtered image J 0.0232 0.0338 0.0383 0.0338 0.0232


0.0338 0.0492 0.0558 0.0492 0.0338 -1 -2 -1 -1 0 1
0.0383 0.0558 0.0632 0.0558 0.0383 0 0 0 -2 0 2
0.0338 0.0492 0.0558 0.0492 0.0338 1 2 1 -1 0 1
= feature map 0.0232 0.0338 0.0383 0.0338 0.0232
vertical horizontal
23 weighted average difference difference 24
Stride and padding Convolution layer

Shorthand:

Number of parameters: 3x3x3x2 + 2 = 56


If fully connected: 108x32 + 32 = 3488
25 26

Alternative activation functions Pooling layer


a
a

sigmoid:
1 tanh: 1

0 z -1

MOST POPULAR a
a
Rectified Linear Unit: Leaky RELU:
(RELU) (LRELU)

0 z z
0
27 28

Feature maps Consecutive layers = consecutive levels of abstraction

29 30
Training of deep CNNs: plenty of heuristics Drop out

• Large amount of parameters, limited training samples


è Data augmentation (image translation, rotation, mirroring, deformations)
è Normalization
è Initialization (pre-training, transfer learning)
• Local minima
è Vanishing gradient problem: gradients get smaller from end-to-front (è RELU)
è Batch gradient descent: less erratic updates
è Momentum: keep track of previous downhill direction (e.g. Adam optimizer)
è Variable learning rate
è Drop out
• Overfitting
è Training set & validation set
è Early stopping
è Regularization

31 32

Training & validation set Batch gradient descent


N training samples

Stochastic gradient descent


Parameters adjusted after every training sample
TRAINING VALIDATION TEST
(60%) (20%) (20%) Batch gradient descent
Parameters adjusted after every pass over all training samples
TRAINING SET VALIDATION SET TEST SET (once after each epoch)
Train network Used during training New data examples,
parameters to assess overfitting not used for training
Mini-batch gradient descent
Parameters adjusted after every m training samples
(several times per epoch)

33 34

Early stopping Example: LeNet-5 for written digit recognition (1998)

Input dimensions: 32 x 32 = 1024


Output dimensions: 10
Number of parameters:
Convolution layers: ~2500
Fully connected layers: ~59100
Total: ~62000

35 36
Example: AlexNet for object recognition (2012) CNN variants

Input dimensions: ~150 K


Output dimensions: 1000
Number of parameters:
Convolution layers: ~2.3 M
Fully connected layers: ~58.6 M
Total: ~61 M

Residual Neural Network


(skip connections)

37 38

GAN (Generative Adversarial Networks) GAN: examples

Christies, 2018

StarGAN, 2018

Synthetic CT, 2018


39 40

The DL zoo RN
N
LSTM
t GA
U-Ne N
SqueezeNet Deep learning for medical image computing
U-net, Ronneberger et al.

Part 2: Applications

dic
p Me
Dee

DeepMedic, Kamnitsas et al.

41 42
General principles of deep learning Deep Learning for Computer Vision

Step 1 Step 2 Step 3 Step 4 Step 5


Gather data which is Develop your neural Train your neural network Tune the hyper Use the predicted
relevant for your network with basic using an optimization parameters of your neural parameters from your model
learning problem parameters method and a cost function network to secure to predict outcomes on new
generalization data

• Pictures are everywhere


• Everyone is an expert
• Performance is (usually) not critical
• Underlying hypotheses are irrelevant
è Strive for fully automated solutions

43 44

Deep Learning for Medical Imaging Deep Learning for Image Segmentation
Reconstruction & QC Quantification Treatment planning

Diagnosis © ADNI

Outcome prediction
Ubiquitous in medical image computing

Requires domain-specific knowledge

Until recently: variety of model-based strategies


(cfr. energy minimization)
• Image access is limited
• True expertise is scarce Now: unified approach using Deep Learning
• Performance is (usually) critical (classification)
• Image findings require clinically relevant interpretation
45 46
è Essential to keep the expert in the loop Lao et al., Scientific Reports, 2017

CNNs for image segmentation: U-Net CNNs for image segmentation: DeepMedic

DeepMedic, Kamnitsas et al.

U-net, Ronneberger et al.


47 48
CNNs for image segmentation: our own DeepVoxNet Dice Similarity Coefficient (DSC)
V(2)
Input Patch-Based Approach Conv neural network
V(1)
Architecture based on DeepMedic

V(1&2)

50% over-segmentation: DSC = 80% 20% over-segmentation: DSC = 90%

Loss function
V (1& 2)
DSC =
(cross-entropy, Dice) (V (1) +V (2)) / 2
Image transformations
x mm over-segmentation: DSC = 82%
Class weighted sample generator
49 x mm over-segmentation: DSC = 90% 50
Sample transformations

Example 1: Delineation of left ventricle in cardiac MRI Example 2: Delineation of left ventricle in cardiac US
• Unet-based architecture • Unet-based architecture LV
– Input = stack of 11 T1-weighted images – Input = 2CH US, ED or ES
(mid-cavity SA, 320 x 320, pre-aligned) (672 x 672, point of cone aligned)
– Output = segmentation of myocardium – Output = delineation of LV epi/endo + LA
• Training: 146 patients x 5 slices • Training: 450 patients x 2 phases
– Loss-function = Dice coefficient – Loss-function = mean DSC
Myocard
– No data augmentation – No data augmentation
– Trained from scratch (~ 30’) – Trained from scratch
• Test: 50 patients x 5 slices • Test: 50 patients x 2 phases
• From concept to state-of-the-art result: < 2 weeks…
Mean DSC [%] 91.8
MAD_Endo [mm] 0.94 LA
DSC (%) Mean +/- Std
MAD_Epi [mm] 1.36
LV 91.1 +/- 4.9
HD_Endo [mm] 2.54
Expert Myocard 82.2 +/- 7.7
AI tool HD_epi [mm] 3.40
Best DSC = .97 Median DSC = .93 Worst DSC = .40 DSC = .80
LA 84.7 +/- 15.4 Max Median Min

51 52

Example 3: Brain tissue segmentation (2D CNN) Example 4: Brain tissue segmentation (3D CNN)
Moeskops et al., Wachinger et al.,
IEEE Trans Medical Imaging, 2016 DeepNAT, NeuroImage 2018
25 structures
Prenatal, young adults, aging subjects T1-weighted, 1 mm3
Modalities: T1, T2 13x13x13 patches
Very limited training set (5-10 images) 2.7 million parameters
2D patches at multiple scales
Use coordinates for local context
Preprocessing: BFC, brain mask
No post-processing 30 images (20 training, 10 test)
Performance: DSC 80-90%

ChildBrain 2019 53 54
Example 5: Detection and segmentation of brain pathology FLAIR T1 T1 CE T2 GT Prediction

Tumor

MS

Unified approach using DL !

Stroke Binary classification

WHOLE TUMOR:

© BRATS, ISLES
GT Prediction
55

FLAIR T1 T1 CE T2 GT Prediction FLAIR T1 T1 CE T2 GT Prediction

Binary classification Binary classification GT Prediction

WHOLE TUMOR: WHOLE TUMOR:

GT Prediction

FLAIR T1 T1 CE T2 GT Prediction FLAIR T1 T1 CE T2 GT Prediction

Binary classification Binary classification

WHOLE TUMOR: WHOLE TUMOR:

GT Prediction GT Prediction
FLAIR T1 T1 CE T2 GT Prediction FLAIR T1 T1 CE T2 GT Prediction

Multiclass classification Multiclass classification

WHOLE TUMOR: TUMOR CORE:

GT Prediction GT Prediction

FLAIR T1 T1 CE T2 GT Prediction Example 6: Assessment of tissue viability in acute stroke


Initial CT perfusion scan
0s 5s 10 s
Tmax
Ischemic core
(excluding penumbra)

Conventional approach:
perfusion model

15 s 20 s 25 s
Predicted
Multiclass classification Tmax

ENHANCING TUMOR:
Alternative approach: Ground truth
data-driven prediction based on MRI

GT Prediction
64

Deep Learning for RT planning Delineation of Organs at Risk (OAR) Brainstem

Left Cochlea
Planning CT Parotid glands
Brainstem Right Cochlea

Upper Esophagus
CT acquisition Contouring RT Plan optimization Final dose calculation
Glottic Larynx
Mandible

Oral Cavity

Supraglottic Larynx
Mandible Left Parotid gland
Right Parotid gland

Inferior PCM

Mid PCM

Superior PCM
Left Submandibular gland
Manual or semi-automatic Manual interventions One patient specific treatment Oral cavity Spinal cord
plan Right Submandibular gland
Target volumes Organs at risk
Spinal Cord
45min – 2hours hours
H&N : 16 OAR, clinical guidelines (Brouwer et al, 2016)
Inter- and intra- Inter-institutional Semi-automated delineation approaches have been developed (e.g. atlas-based), but complicated to use
observer variations
65 In clinical practice: manual delineation…
variability 66
DeepVoxNet 3D CNN for H&N OAR segmentation Initial results

DeepVoxNet

16 outputs
Loss function: = probability map
mean Dice per structure
è post-processing
~70 training images: planning CT + delineations exported from clinical TPS Expert
Trained from scratch (randomly initialized weights) AI tool
Adam optimizer with drop out
67 68

Integration in the clinical RT workflow in UZ Leuven Retrospective validation: AI tool mimicking the expert
Radiation Oncology Medical Imaging Research Center DSC (%) 10 test cases

PLANNING AUTO-DELINEATION Brainstem 84.7

GPU server hosted by Left Cochlea 70.1


DICOM UZ Leuven datacenter Right cochlea 80.1
Planning CT
Upper esophagus 66.7
Glottic Larynx 53.3
DeepVoxNet
Mandible 84.9

DICOM RT-Struct Oral Cavity 67.5


TPS Supraglottic Larynx 54.3
Retraining
Left Parotid gland 80.8
Right Parotid gland 79.8
Inferior PCM 48.0
Approved Expert (45’) AI tool (5’)
Validation Mid PCM 55.5
contours Clinical
Retrospective = ‘never perfect…’ Superior PCM 36.5
feedback
Observer variability… Left Submandibular gland 69.6
‘Ground truth’ ? Right Submandibular gland 71.9
Spinal Cord 81.9
69 70

Prospective validation: expert correcting the AI tool Clinical benefits of auto-delineation using DL ?
DSC (%) Retrospective Prospective
(10 cases) (20 cases) Experimental automated workflow
Brainstem 84.7 91.5
Left Cochlea 70.1 75.4
Right cochlea 80.1 73.1 Expert 1
Corrected
Upper esophagus 66.7 34.8
delineations
Glottic Larynx 53.3 39.4
Automated Expert 2
Mandible 84.9 95.9
delineations
Oral Cavity 67.5 83.5
(16 OAR)
Supraglottic Larynx 54.3 71.2
Left Parotid gland 80.8 86.3 Conventional manual workflow
Right Parotid gland 79.8 89.7
AI tool (5’) Corrections (15’) Inferior PCM 48.0 57.9
15 consecutive Clinical planning CT Expert 1
Manual
Mid PCM 55.5 60.9
H&N RT patients delineations
Prospective = ‘good enough!’ Superior PCM 36.5 46.1 Expert 2
Observer focuses on clear errors, Left Submandibular gland 69.6 78.8
irrelevant variability ignored
Right Submandibular gland 71.9 87.7
Direct clinical feedback:
Spinal Cord 81.9 95.9 71 72
mouth open/closed, missing organs
Auto-delineation performance: efficiency Auto-delineation performance: observer variability

Expert 1
Expert 2
Delineation time
= time spent by the expert to delineate or to correct Automated
all 16 OAR structures in one patient

Average manual delineation time = 36 min

Average corrected delineation time = 22 min

Reduction of 38 %

Manual delineations Corrected delineations

73 74

Auto-delineation performance: observer variability Auto-delineation performance: robustness

Expert 1 Expert 1
Expert 2 Expert 2
Automated Automated

Manual delineations Corrected delineations Manual delineations Corrected delineations

75 76

Beyond delineation in Radiology: computer-aided diagnosis Beyond delineation in RT: treatment planning
Cardiac MRI
Deep Learning today
Delineation Imaging biomarkers CT acquisition Contouring RT Plan optimization Final dose calculation
End-diastole End-systole
Epicardium Endocardium Epicardium Endocardium

& & & 6


features

Statistical modeling

Conventional
machine learning Feature selection

Manual or semi-automatic Manual interventions One patient specific treatment


Classification plan

45min – 2hours hours


Deep Learning
Healthy or infarcted
in the future ? Inter- and intra- Inter-institutional
observer variations
Dept. Cardiology, KU Leuven variability 80
77
Deep learning for optimal dose prediction Multi-task learning: contouring + dose prediction
1) Contours derived from CT 2) Optimal dose derived Contours and optimal dose
using deep learning from contours (& CT) estimated jointly from CT
(= voxel-wise classification) using deep learning using deep learning
(= voxel-wise regression)
Dose
mimicking

DL 1 DL 2 DL TPS

Plan
Images Contours Dose Images Actual dose
DVH constraints
Treatment setup

Sequential è sub-optimal use of correlation between CT / contours / dose Simultaneous, largely based on same image-derived features è exploit correlations
81 82

Deep learning for treatment adaptation? Conclusion

Pre-treatment planning CT On-board CBCT of today • Deep learning is based on deep convolutional neural networks
• A deep neural network is a NN with many hidden layers
Option 2:
• A deep NN aims at achieving a complex mapping of a high-dimensional input (an image) onto a user-specified output
Replan (classification, regression)
• Training involves optimizing the (very many) internal parameters of the NN (using back-propagation) by presenting samples
of input/output pairs until convergence (assessed by a suitable loss function)
• CNNs are conceptually simple, but due to the huge amount of parameters, many heuristics are needed to get them to work
Option 1: • CNNs are computationally complex and best implemented on GPUs
Adapt • CNNs were invented in the 1990s, but broke through 20 years later in computer vision due to cheaper GPUs, large image
databases on the internet and more clever training schemes
• DL using CNNs is a revolution in computer vision, as it works with rich image patterns directly instead of poor feature-based
representations derived from them
• Medical imaging applications are more complex, as data is scarce and specific expertise is needed
• Properly structured image annotations are key for deep learning
• The current AI hype is all about DL. The intelligence is in the combination of {data, annotations, domain knowledge}, not in
the algorithm itself.

87 88

You might also like