Download as pdf or txt
Download as pdf or txt
You are on page 1of 20

CAPSTONE PROJECT REPORT

Pneumonia detection based on the detection of


lung opacity using CNN model

Team Members (Nov 2022 Batch Group A)

Avik Kumar Sarkar


Gajendra Kaladgi
Indirakumar H J
Moumita Dhar
Praveen Kumar Nayak
Sanjeeva K N

Date : 29th Nov 2022


Mentored By : Dr. Aamit Ajit Lakhani
1
Pneumonia Detection challenge – Group A, 2022

Contents
1. Abstract ........................................................................................................................................... 2
2. INTRODUCTION ............................................................................................................................... 2
2.1 About Pneumonia ................................................................................................................... 2
2.2 Pneumonia Diagnosis and detection ...................................................................................... 2
2.3 Chest Radiographs Basics ........................................................................................................ 3
3. Summary of Problem Statement, Data and Findings...................................................................... 3
3.1 Problem Statement ................................................................................................................. 3
3.2 Given Data............................................................................................................................... 4
3.3 Findings ................................................................................................................................... 5
4. Overview of the final process-data pre-processing steps and the algorithms used ...................... 6
4.1 Visualizations – Exploratory Data Analysis ............................................................................. 6
4.1.1 Distribution of lung opacity in patients .......................................................................... 6
4.1.2 Age-wise value counts for all records ............................................................................. 7
4.1.3 Bar chart view of the ratio of Sex and the different lung opacities .............................. 10
4.1.4 Lung opacity scatter plot............................................................................................... 11
4.1.5 DICOM image Metadata ............................................................................................... 13
4.1.6 Visualization of Sample X-Ray images for each category ............................................. 15
4.1.7 Visualization of sample X-Ray images with bounding boxes depicting lung opacity .... 16
5. Deciding Models and Model Building ........................................................................................... 17
5.1 Suitable algorithms for the given problem ........................................................................... 17
5.2 Different algorithm and their performances ........................................................................ 19
6. Improve model performance ........................................................................................................ 19
6.1 Approaches to improve model performance ........................................................................... 19
2
Pneumonia Detection challenge – Group A, 2022

1. Abstract
This is the report of the capstone project, which is executed as part of the academic
requirement, and to complete the PGP programme on AIML from Great learning, Great
Lakes university.

The objective of the project is to build a CNN model to detect the presence of pneumonia
given a set of DICOM images.

2. INTRODUCTION
2.1 About Pneumonia
Pneumonia is an infection of one or both lungs caused by bacteria, viruses, or fungi.
It is a serious infection in which the air sacs fill with pus and other liquid
(hopkinsmedicine.org). The infection causes inflammation in the air sacs in your
lungs, which are called alveoli.

It is a life-threatening disease particularly to infants, children and people above 65.


Symptoms include cough with phlegm or pus, fever, chills and difficult breathing.

Going by the statistics, Pneumonia accounts for over 15% of all deaths of children
under 5 years old globally. In 2017 alone, 920,000 children under age of 5 died of
pneumonia.

2.2 Pneumonia Diagnosis and detection


Pneumonia could be difficult to diagnose because the symptoms are variable and
very similar to those seen in cold or influenza. To diagnose pneumonia, doctors will
study the medical history, conduct physical exam and run tests. If the doctor
suspects, he could do further tests which includes blood test, study of the Chest X-
ray or even CT scan.

On the Chest X-ray, pneumonia usually manifests as an area of increased opacity.


comparison of Chest X-rays of the patient taken at different time points and
correlation with clinical symptoms and history are helpful in making the
diagnosis

For the detection of Pneumonia, we need to detect Inflammation of the lungs. In


this project, we are challenged to build an algorithm to detect a visual signal for
pneumonia in medical images. Basically, our algorithm needs to automatically
locate lung opacities on chest radiographs.
Fig 2.2.1 depicts an image with normal lungs. It is observed that there is a mass
of tissue surrounding the lungs and between the lungs. These areas contain skin,
muscles, fat, bones, and the heart and big blood vessels. That translates into a
lot of information on the chest radiograph that is not useful for detecting the
lung opacity.
3
Pneumonia Detection challenge – Group A, 2022

Fig 2.2.1

2.3 Chest Radiographs Basics

In the process of taking the image, an X-ray passes through the body and reaches
a detector on the other side. Tissues with sparse material, such as lungs which
are full of air, do not absorb the X-rays and appear black in the image. Dense
tissues such as bones absorb the X- rays and appear white in the image.

In short –
o Black = Air
o White = Bone
o Grey = Tissue or fluid

The left side of the subject is on the right side of the screen by convention. It can
also be observed that there is a small L at the top of the right corner. In a normal
image we see the lungs as black, but they have different projections on them -
mainly the rib cage bones, main airways, blood vessels and the heart.

3. Summary of Problem Statement, Data and Findings

3.1 Problem Statement

Automating Pneumonia screening in chest radiographs, providing affected area


details through bounding box and assist physicians to make better clinical
decisions or even replace human judgement in certain functional areas of
healthcare (e.g, radiology).
With the help of AI techniques and by relevant clinical support, we can unlock
clinically relevant information hidden in the massive amount of data, which in
turn can assist clinical decision making
4
Pneumonia Detection challenge – Group A, 2022

Objective of the activity is to


• To build a model to identify whether CXR images have lung opacity or
not.
• To build an algorithm to automatically locate lung opacities on chest
radiographs providing affected area details through bounding box

3.2 Given Data

Data provided has DICOM(Digital Imaging and Communications in Medicine)


original images: - Medical images are stored in a special format called DICOM
files (*.dcm). They contain a combination of header metadata as well as
underlying raw image arrays for pixel data. While we are theoretically detecting
“lung opacities”, there are lung opacities that are not pneumonia related. In the
data, some of these are labelled “Not Normal No Lung Opacity”. This extra third
class indicates that while pneumonia was determined not to be present, there
was nonetheless some type of abnormality on the image and oftentimes this
finding may mimic the appearance of true pneumonia. Lung Opacity class refers
to Pneumonia cases. There are 26684 2D single channel CT images in the
pneumonia dataset that’s provided in DICOM format
The data and datasets are downloaded from, https://www.kaggle.com/c/rsna-
pneumonia-detection-challenge/data

The data and the dataset contain:

• stage_2_detailed_class_info.csv – contains attributes patientId and class


information
• stage_2_train_labels.csv – contains attributes patientId, x, y, width,
height and Target
• stage_2_train_images – 22684 images in .dcm format, to be used for
training the model
• stage_2_test_images – 3000 images in. dcm format.

File Name Total No No of Attributes


of Rows Attributes
Stage_2_detailed_class_info.csv 30227 2 patientid, class
Stage_2_train_labels.csv 30227 6 Patientid, x,y, width,
height, Target

Details of the given images


• Training Images : 26,684 images
• supported by two csv files
• Testing Images: 3,000 images
• Observation : All the images are named with the patient id
5
Pneumonia Detection challenge – Group A, 2022

The testing images doesn’t contain any class and bounding box information. So,
the training images should split into training data and validation data in the ratio
of 70% and 30%

Target value ‘1’ indicates patients with pneumonia and


Target value ‘0’ indicates normal patients or patients having lung opacity
but not pneumonia

3.3 Findings

Differences observed between training images and no of information’s given in


the .csv files

No of training images : 26684


No of rows given in csv file (No : 30227
of patientId)
Differences Observed : 3543
There are multiple rows for same patient with different bounding boxes which
clarifies the difference.

* Number of patients with bounding boxes 4 :13


* Number of patients with bounding boxes 3 :119
* Number of patients with bounding boxes 2 :3266
* Number of patients with bounding box 1 : 23286
6
Pneumonia Detection challenge – Group A, 2022

4. Overview of the final process-data pre-processing steps and the


algorithms used

4.1 Visualizations – Exploratory Data Analysis


From our EDA, we learned that there are 26684 unique patients. Overall, the
distribution of data is imbalanced with Target class being only 31.6% of the
whole dataset. This tends to result in bias. We have addressed such data issues
using augmentation techniques or used sampling methods to equally represent
data classes.
Here, are some of our findings from the DICOM images dataset
Few patients have multiple bounding boxes -
a. 3266 patients have 2 bounding boxes defined
b. 119 patients have 3 bounding boxes defined
c. 13 patients have 4 bounding boxes defined
We will run a binary classification to predict patients with pneumonia.

4.1.1 Distribution of lung opacity in patients


8,851 (29.3%) patients are with Lung Opacity
9,555 (31.6%) patients have healthy/Normal lung image
11,821 (39.1%) patients have No Lung Opacity but may have other lung abnormality
7
Pneumonia Detection challenge – Group A, 2022

• Total Records with Pneumonia 9555


• Total records with No Pneumonia 20672

• Total Records with Lung Opacity 9555


• Total records with No Lung Opacity / Not Normal 11821
• Total Normal Records 8851

4.1.2 Age-wise value counts for all records


From the below graph, we can see that the max number of patients are of 58
years old
8
Pneumonia Detection challenge – Group A, 2022

Patient distribution shows most pneumonia patients or patients are found between
age 40-60. More records for male are observed in the data.

There are some outliers for Age above 140


Minimum Age: 1
Maximum Age: 155
There can be some mistake while entering age which caused the outliers, e.g. -
155 is entered instead of 55.
9
Pneumonia Detection challenge – Group A, 2022

Patients with age 50-60 have more number of Pneumonia

Patients with Age around 50 have a greater number of Normal cases.


Patients with Age 50-60 have more cases with Lung Opacity.
10
Pneumonia Detection challenge – Group A, 2022

4.1.3 Bar chart view of the ratio of Sex and the different lung opacities

Number of male patients with Pneumonia is greater than female patients

Distribution of three different classes based on patient sex.

We can clearly see that ‘MALE’ have higher number of observations as compared to
‘FEMALE’ in all the classes
11
Pneumonia Detection challenge – Group A, 2022

“No Lung Opacity / Not Normal” and “Normal” class falls under Target 0 which represent
no Pneumonia and “Lung Opacity” class represent Pneumonia patient

4.1.4 Lung opacity scatter plot

From the below scatter plots, it can be observed that lung opacity is seen maximum
in age group 50-65 followed by group between 20-50

Also we observe that lower age group ( less than 20) patients have less Pneumonia
compared to older ages. Above 65 age group Pneumonia patients are less as the
number of patients decreases.
12
Pneumonia Detection challenge – Group A, 2022
13
Pneumonia Detection challenge – Group A, 2022

4.1.5 DICOM image Metadata

DICOM (Digital Imaging and Communications in Medicine) is a standard for storing,


processing, and transmitting medical images and related information. DICOM images
consists below metadata about the patient. We parsed DICOM image using dcmread
method of pydicom library.

• Pixel Data (7fe0 0010)(last entry) — This is where the raw pixel data is stored. The
order of pixels encoded for each image plane is left to right, top to bottom, i.e., the
upper left pixel (labeled 1,1) is encoded first
• Photometric Interpretation (0028, 0004) — aka color space. In this case it is
MONOCHROME2 where pixel data is represented as a single monochrome image
plane where the minimum sample value is intended to be displayed as black.
• Bits Stored (0028 0101) — Number of bits stored for each pixel sample. Here it is 8
• Pixel Representation (0028 0103) — can either be unsigned(0) or signed(1). Here it is
0 which means unsigned.
• Lossy Image Compression (0028 2110) — 00 image has not been subjected to lossy
compression. 01 image has been subjected to lossy compression. Here it is 01.
14
Pneumonia Detection challenge – Group A, 2022

• The Rows and the Columns attribute yield us the image size. Here Rows:1024 and
cols 1024, Image size is 1024 x 1024
• Lossy Image Compression Method (0028 2114) — states the type of lossy
compression used (in this case JPEG Lossy Compression, as denotated
by CS:ISO_10918_1)

Image pixel data is distributed in the range of 0-255.


Based on pixel data it is understood that HU values are already converted in Grayscale
values(also we can understand from “Bits Stored” value which is 8).
16 bit DICOM images have values ranging from -32768 to 32768 while 8-bit grey-scale
images store values from 0 to 255. The value ranges in DICOM images are useful as they
correlate with the Hounsfield Scale which is a quantitative scale for describing radio-
density (or a way of viewing different tissues densities).
15
Pneumonia Detection challenge – Group A, 2022

4.1.6 Visualization of Sample X-Ray images for each category


16
Pneumonia Detection challenge – Group A, 2022

4.1.7 Visualization of sample X-Ray images with bounding boxes depicting lung
opacity
17
Pneumonia Detection challenge – Group A, 2022

5. Deciding Models and Model Building


5.1 Suitable algorithms for the given problem
In the given problem, the data sample contains images as input and the
information of those affected with pneumonia. The need is for us to build a
model which learns from the given data and given a new sample image, the
model should be able to accurately classify if the image is of a pneumonia
affected person or not.
This is clearly a deep learning problem and since it involves image features and
metadata as input features and the output is classification of the image,
Convolutional neural network (CNN) models are the right models to be adopted

The model involves building the input layer, feature extraction layer, using
activation functions, applying appropriate weights and classifiers

Fig 5.1.1 shows the architecture of the basic model that we have used
18
Pneumonia Detection challenge – Group A, 2022

Fig 5.1.1 : Architecture of the CNN model

Fig 5.1.2 Train loss and Train accuracy graph


19
Pneumonia Detection challenge – Group A, 2022

Fig 5.1.3 Confusion Matrix and classification report


5.2 Different algorithm and their performances

6. Improve model performance

6.1 Approaches to improve model performance


The following are the approaches to improve the model performance, however,
have to be tried/tested in the second half of the project

• Try changing the number of layers in the model


• Adopt image augmentation techniques
o We trained model with image size 128, model can perform with
larger image size e.g 256
• Adopt Transfer learning techniques
• Use feature extractors like AlexNet, VGG-16 and VGG -19, ResNet..etc
• Tune the hyperparameters of the model
• Reduce combined features with feature selection method (mRMR)

You might also like