Chest Xray Captioning

Captioning Chest X Rays
with Deep Learning

Krishna Poddar(2K21/CO/243)
Kumar Su Prashant(2K21/CO/247)
Kunal Singhal(2K21/CO/251)
Table of contents
01 02 03
Introduction Proposed Experimental Details
Methodology
04 05 06
Code Output Bibliography
01
Introduction
Introduction
1. Transformation in Deep Learning: Deep learning has significantly advanced, especially in understanding and
interpreting visual information.
2. Synergy of Computer Vision and NLP: The combination of computer vision and natural language processing
(NLP) is groundbreaking, enabling machines to recognize and describe objects and scenes in natural language.
3. Potential in Medical Radiology: This synergy holds immense potential in medical radiology, where
thousands of images are generated daily, aiding in disease diagnosis, treatment monitoring, and understanding
patients' health conditions.
4. Challenges in Image Interpretation: Despite the importance of radiological images, interpreting them is
challenging due to the complexity and vast amount of visual information they contain.
5. Objective of the Project: The project aims to address this challenge by creating an automated image
captioning system specifically tailored for medical radiology reports.
6. Utilizing Deep Learning: State-of-the-art neural network architectures will be employed to generate
coherent and contextually accurate natural language descriptions of radiological images.
7. Benefits: This innovative approach is expected to save time for healthcare professionals while enhancing
accessibility and interpretability of medical images for various stakeholders, including physicians, radiologists,
and patients.
Challenges in Medical Radiology Reports
1. Complexity of Medical Radiology Reports: These reports contain intricate and multifaceted images,
requiring a deep understanding of anatomy, pathology, and disease-specific patterns.
2. Use of Medical Jargon: Radiology reports often contain complex medical terminology, making them
challenging for non-specialists to comprehend.
3. Challenge for Automated Interpretation: The combination of visual complexity and linguistic specificity
poses a significant challenge for automated interpretation of radiology reports.
4. Conventional Methods: Traditional methods involve manual interpretation and report writing by radiologists,
which are time-consuming, prone to human error, and may result in reporting backlogs in busy healthcare
settings.
5. Advantages of Deep Learning-Based Image Captioning: Deep learning-based systems can automatically
generate detailed and coherent descriptions of radiological images, reducing the burden on healthcare
professionals and providing rapid, consistent, and understandable reports.
02
Proposed
Methodology
Introduction to Image Understanding
1) Image Understanding:
a) Essential for generating coherent captions.
b) Encompasses object recognition, scene
recognition, and understanding
interrelationships.
2) Key Aspects:
a) Object Recognition: Identifying anatomical
structures or abnormalities.
b) Scene Recognition: Understanding the
broader context, crucial in medical images.
c) Interrelationships: Understanding how
objects and scenes relate.
Language Used: Python
a) Key Language: Python is the main language for

AI/ML due to its simplicity.
b) Rich Libraries: Python's libraries like c) Flexibility:
Python's adaptability streamlines the AI/ML
workflow.
d) Community Support: Python's vast community
offers abundant resources for AI/ML practitioners.
Libraries Used:
a) TensorFlow: For building, training, and deploying
ML models.
b) Keras: Simplifies building and training deep
learning models.
c) NumPy: For numerical computations and data
manipulation.
d) OpenCV: For image processing tasks.
e) NLTK: For text-related tasks in natural language
processing.
f) Scikit-Learn: For machine learning tasks like
dataset splitting and evaluation.
g) Matplotlib: For data visualization.
h) Pandas: For data manipulation and analysis.
i) Pre-trained CNN Models:
Leveraged for feature extraction from images.
Feature Extraction and Dataset Preparation
Feature Extraction:
a) Utilizes CNN architectures like
ResNet50, EfficientNet, etc.
b) Transforms raw pixel data into
numerical feature vectors. Dataset Preparation:
c) Final vector dimension
typically 8x8x2048. a) Importance of high-quality datasets.
b) Specific datasets: NIH Chest X-Ray
Dataset and Chest X-ray dataset
from Indiana University.
Preprocessing
Data Preprocessing: Benefits:
a) Organizing datasets into a) Efficient utilization of data
TensorFlow's during training.
image_dataset_from_directory b) Improves the effectiveness
format. of the training process.
b) Ensures data is in the required
format for model training.
03
Experimental
Details
Workflow Overview
Dataset Preprocessing Feature Extraction Text Vectorization and Network for captioning
and visualization data splitting
CNN Encoder RNN Decoder Bahdanau Attention Training

Mechanism
Data Preprocessing And visualization
● Indiana University and NIH Chest X-Ray datasets
provide valuable resources for medical image
analysis.
● Merged image path and caption files to facilitate
easier data handling.
● Created training and testing datasets using
TensorFlow's image_dataset_from_directory
method.
● Utilized parameters such as image size, label mode,
color mode, and batch size for dataset creation.
● Visualizing the dataset is crucial for understanding its
contents.
● Demonstrated dataset reading and display using the
Pandas library.
Feature Extraction
NIH-Dataset Indiana University Dataset

The code loads the dataset into a The last convolutional layer of the
pandas DataFrame and preprocesses InceptionV3 model is used to extract
it to create training and testing features from chest x-rays and
datasets using TensorFlow's encode them into a feature vector.
image_dataset_from_directory Additionally, a TensorFlow Text
method. Then, it uses the InceptionV3 Vectorization layer is set up to
model with pre-trained weights to numerically encode the caption data
train a classification model on the 12 for creating an embedding. Finally,
classes with the highest value counts, datasets are split into training and
aiming to encode features of chest x- testing sets for further processing.
rays into tensors
Network for Captioning
● Preparation of caption data is essential for image
captioning, ensuring that the model can effectively learn
from textual information associated with the images. The
process involves two components: a CNN encoder and an
RNN decoder.
● The CNN encoder extracts features from input images,
encoding them into a suitable format for further
processing.
● The RNN decoder interprets these encoded features and
generates captions sequentially, leveraging sequential
processing capabilities to capture context and nuances.
● The integration of the CNN encoder and RNN decoder
bridges the semantic gap between visual input and
textual output, facilitating the generation of meaningful
captions for chest X-ray images.
CNN Encoder and RNN Decoder
● The model takes in a single raw image and generated a
caption y encode as a sequence of 1 to K encoded words.
K is the size of the vocabulary and C is the length of the
caption.
● The model also uses a Convolutional Neural Network
(InceptionV3 in our case) to extract and output a feature
vector which the authors call annotation vectors. The
CNN outputs L vectors each of which is of D dimensions.
In our case, the output of the InceptionV3 feature
extractor is a tensor of shape 8x8x2048.
● The RNN Decoder uses LSTM (Long Short Term Memory)
cells to produce captions step by step.
● Context vectors, obtained from the attention mechanism,
are used to influence the caption generation.
● In this implementation, GRU is used instead of LSTM for
sequential processing.
Architecture of Cells implemented in RNN
Decoder
LSTM cell GRU cell

Bahdanau Attention Mechanism
• The Bahdanau Attention

mechanism is a key
component of the RNN
Decoder.
• It computes attention weights
that determine the
importance of different image
locations when generating
words in the captions.
• The attention mechanism
enhances the model's ability
to focus on relevant parts of
the image.
Training
• Model training is a crucial

phase where the CNN Encoder
and RNN Decoder are trained
to work together.
• The Adam optimizer is used,
and the loss is calculated
using Sparse Categorical
Cross-Entropy.
• The training is performed for
40 epochs on a Google Colab
pro Tesla P100 GPU to allow
the model to learn and
improve.
04
Code
Link to Code of the program
https://colab.research.google.com/drive/
1yT-WhVclXBw80-
pN_Igg8wWYgrnGrzMi?usp=sharing
05
Output
Sample test 1
Actual Caption Generated caption
• Indications: xxxx with xxxx • Indications: xxxxyearold female
followup endseq startseq
endseq startseq • Findings: normal heart size no focal
• Findings: stable consolidation is identified there is
cardiomediastinal minimal xxxx airspace disease in the
left ventricle no focal alveolar
silhouette no focal airspace consolidation no definite pleural
consolidation suspicious effusion or pneumothoraces
pulmonary opacity cardiomediastinal silhouette is
normal for size and contour
pneumothorax or pleural degenerative changes in the inferior
effusion changes of right xxxx cardiomegaly and small to
mastectomy sequelae of previouschronic pulmonary arthritis
prior granulomatous
• Impressions: 1 pulmonary clinical
correlation xxxx no xxxx old fractures
disease mild thoracic spine the previously seen left upper
degenerative change. quadrant seen no xxxx soft tissue
since comparison examination there
• Impressions: no acute is some left base airspace disease
cardiopulmonary the visualized bony structures are
abnormality intact endseq startseq impressions
no
Sample test 2
Actual Caption Generated caption
• Indications: start startseq • Indications: shortness of breath
hypertension
indications dyspnea • Findings: impressions ltthe heart size
endseq startseq within normal limits no focal
• Findings: stable the heart is consolidation pneumothorax or large
pleural effusion visualized bony
top normal in size the structures are otherwise
mediastinum is stable the unremarkable in appearance of focal
aorta is atherosclerotic airspace disease no pleural effusion
or pneumothorax the bony elements
xxxx opacities are noted in from elsewhere are no displaced rib
the lung bases compatible fractures the lungs are clear no
with scarring or atelectasis pleural effusion
there is no acute infiltrate
• Impressions: chest three total
images to be grossly unremarkable
or pleural effusion no suspicious pulmonary opacities
• Impressions: chronic mild degenerative changes of right
apex otherwise unremarkable exam
changes without acute negative for acute pulmonary
disease infiltrate endseq end
06
Bibliography
a) Link to NIH X-ray dataset: https://www.nih.gov/news-
events/news-releases/nih-clinical-center-provides-one-
largest-publicly-available-chest-x-ray-datasets-scientific-
community?source=post_page-----24febcc19f6f---------
-----------------------
b) Link to Indiana University Dataset:
https://www.kaggle.com/datasets/raddar/chestxrays-
Indiana-
university?select=indiana_reports.csv&source=post_page--
--- 24febcc19f6f--------------------------------
c) The Bahdanau attention paper:
https://arxiv.org/abs/1409.0473?source=post_page-----
24febcc19f6f--------------------------------
Thanks!!

Chest Xray Captioning

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Chest Xray Captioning

Uploaded by

Copyright:

Available Formats

Captioning Chest X Rays

with Deep Learning

a) Key Language: Python is the main language for

CNN Encoder RNN Decoder Bahdanau Attention Training

NIH-Dataset Indiana University Dataset

LSTM cell GRU cell

• The Bahdanau Attention

• Model training is a crucial

You might also like