Download as pdf or txt
Download as pdf or txt
You are on page 1of 27

Multimedia Tools and Applications (2022) 81:37541–37567

https://doi.org/10.1007/s11042-022-13545-0

1218: ENGINEERING TOOLS AND APPLICATIONS IN MEDICAL


IMAGING

Design, analysis and implementation of efficient deep


learning frameworks for brain tumor classification

Aman Verma 1 & Vibhav Prakash Singh 2

Received: 10 March 2021 / Revised: 27 January 2022 / Accepted: 6 July 2022 /


Published online: 30 July 2022
# The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature 2022

Abstract
Computer-aided diagnosis (CAD) system may be utilized as assistants for doctors and
radiologists for the detection of disease. CAD systems using deep learning approaches are
promising in diagnosing brain tumors but due to their computationally intensive nature,
they are resilient to deploy in real-time scenarios where speed, as well as accuracy, is
required. Further, it is necessary for the deep-learning models to capture multi-scale
information as the task brain-tumor classification requires modelling pixel-to-pixel rela-
tionship and spatial-contexts in tumor-affected regions. To this end, this paper introduces
the representational feature learning powers of deep, efficient and lighter deep learning
architectures based on novel weight initialization and layers freezing for brain tumor
classification. We use five different weight initialization and freezing configurations, four
from the domain of transfer learning and the remaining being random initialization. These
configurations are applied over different parameters and memory efficient architectures.
Results suggest that when architecture is initiated adequately with correct weight initial-
ization configuration based on the number of trainable parameters and architectural depth,
performance obtained is optimal. Experimentation over eight different CNN architectures
and five different weight initialization configurations was conducted and therefore train-
ing and evaluation of 40 deep learning frameworks was carried out. From the compre-
hensive experimental analyses of classification performances over three classes of brain
tumor, it is evident that DenseNet201 based transfer learning model with initial 5
convolution layers frozen attains state-of-the-art accuracy of 98.22% while the light-
weight models of MobileNet outperform many other models attaining the highest 97.87%

* Vibhav Prakash Singh


vibhav@mnnit.ac.in
Aman Verma
aman.verma.nitrr@gmail.com

1
Department of Electronics and Communication Engineering, National Institute of Technology
Raipur, Raipur, India
2
Department of Computer Science and Engineering, Motilal Nehru National Institute of Technology
Allahabad, Prayagraj 211004, India
37542 Multimedia Tools and Applications (2022) 81:37541–37567

accuracy for transfer learning configuration with initial 3 convolutional layers frozen
while sizing only 42.6 MBs. The DenseNet201 model utilizes densely-flowing skip
connections which in-turn allows the model to utilize the features learning from different
spatial-contexts to formulate understanding of features in accordance with current recep-
tive fields. With the 5 convolutional layer frozen transfer-learning scheme the same
architecture achieves a performance gain of 0.88% over the state-of-art-methods. Further,
the efficacy of the random initialization paradigm for brain tumor classification is
investigated, results suggest that the random initialization framework can be promising
if the number of trainable parameters is kept in accordance with training data quantities.

Keywords CAD system . Brain tumour . Deep learning . Transfer learning . Weight freezing

1 Introduction

Brain Tumors are malicious development of mass lesions in the brain, they are broadly found
in 3 major categories Meningioma which is generally benign, Gliomas - these are 80% of times
malignant and Pituitary Tumor. Statistics from the Central Brain Tumor Registry of the United
States (CBTRUS), suggest that there are nearly 0.7 million in the United States affected with
Brain Tumor and in the year 2020 more than 87,240 cases will emerge [36]. Further, the
statistics from the same report revealed that out of the total reported cases 61,430 cases will be
benign and the rest 25,800 will be malignant in which Glioblastoma will constitute 48.3% of
all malignant tumors by which if a patient is affected, attains only 6.8% five-year relative
survival rate. WHO classified the intensity of Brain Tumors in 4 Grades, namely Grade - I, II,
III and IV with Grade-IV being the deadliest. Grade-1 Tumors can be cured via surgery and
have a very slow growth rate accredited to their least malignant or benign nature, Grade-II
tumors are identified via anomalous look under a microscope, often cells affected with Grade-
II invades neighbouring cells; Grade-III Tumors are of malignant nature, they are recurring and
return back as a higher grade tumor; The most malignant type of brain tumor is Grade-IV, they
lead high chances of mortality and are rapid developers, Glioblastoma Multiforme are Grade-
IV tumors [3]. Figure 1 illustrates the 3 above-mentioned types of brain tumors.
Diagnosis of Brain Tumors in the early stages can increase the chances of a patient’s
survival because of multiple available treatment options. The procedural flow of Brain
Tumor’s diagnosis follows three steps, first is the patient’s neurological examination on
symptomatic grounds, then in a non-invasive manner the patient’s Brain is analysed using
medical images collected via MRI Scans, CT-Scans and CAT Scans, then for final confirma-
tion, Biopsy i.e., tissue sample analysis of affected region is performed via invasive method-
ology. Automated methods that are faster and can be implemented easily without the need of
experienced medical expertise are required so as to make medical diagnosis systems scalable,
hence there is a resonant call for methods that may bypass the invasive methodologies which
are expensive, slow and do not reveal ailment in early stages. With an upsurge in computing
technologies, the world has witnessed the emergence of Computer-Aided Diagnosis (CADx)
systems [11] which are useful in the detection of the suspected lesions from the images; for
example, MRI-Scans for diagnosing brain tumors, or CT-Scans for diagnosis of disorders in
lungs. Figure 2 reflects the working of a general CAD system. This system gives a preliminary
diagnosis of the disease through the analysis of medical images, and then based on the view
that they provide doctors can infer the status of the disease which is being diagnosed. CADx
Multimedia Tools and Applications (2022) 81:37541–37567 37543

Fig. 1 Brain tumor MRI and tumor types. Each of (a), (b) and (c) respectively illustrate respective types of brain
tumor. a Meningioma b Gliomas c Pituitary Tumor

finds one of its contributions in Brain Tumor Diagnosis. Via MRI scans multi-sequence slices
of the brain are imaged and then these images are analysed via medical expertise. They though
avail a faster and accurate way of diagnosis which makes them quite scalable but an issue
regarding the need for trained medical expertise remains intact. Further, the complex nature of
medical images makes human-inference challenging and time-consuming. Machine Learning
based solutions have proven to be prominent for CAD systems in providing solutions over the
above forth mentioned challenges [13, 45]. These approaches capitalize over data to train
computer experts capable of delivering high-end performance.

Fig. 2 CAD System (Images in Detection block adapted from [46])


37544 Multimedia Tools and Applications (2022) 81:37541–37567

At the beginning, there had been several machine-learning based approaches for brain
tumor classification. Chen et al. [6] authors extracted BoW features to attain 91.02% perfor-
mance on the Figshare Dataset [5]. In [23] authors computed statistical measures overt the
MRIs and their corresponding DWT applied image, then utilized those features as a feature
vector for a Multi-Layer Perceptron Network. The average testing performance achieved in
this case was 91.9%. Similarlily, in [16], authors utilized PCA-NGIST feature descriptor on an
ELM classifier to achieve 94.23% testing performance. The same suggested fair evaluation
using 5-Fold cross validation. Although these methods have achieved a good performance
suffer from some shortfalls, first being the fact that they depend upon manual hand-crafting of
features which are indeed intricate to design. Secondly, they extracted features does not
correlate locally extracted features at global scales and vice-versa and hence due to following
reasons there remained a scope of improvement.
Deep Learning [27], a relatively newer paradigm and which utilizes very deep neural
networks with input applied directly without manual feature extraction has found success
and multitudes of application in medical image analysis [12, 28, 41]. The same had achieved
quite significant results in brain tumor classification, in [1] authors proposed CNN based deep
learning model for segmented as well as non-segmented MRI images. This CNN based
methodology offered two-fold benefits: i. This method does not require tedious feature
hand-crafting ii. Each of the brain tumor types have specific morphology within a MRI slice,
the key distinguishing pattern here is the overall structure and pixel-to-pixel relation in tumor
neighbourhoods; and a CNN extract features at both local and global scale i.e., in a CNN each
layer extracts features locally within a receptive field and with stack of layers the overall
receptive field enlarges leading to robust classification. The CNN proposed in [1] was not very
deep and hence owing to the same very significant performance gain was not seen but using
the similar CNN based concept other novel attempts had been made. In [42], authors proposed
BrainMRNet a novel CNN involving attention mechanism [44] at spatial and channel levels to
make the model focus more important features, this was further supplemented by
Hypercolumn technique and the model achieved overall testing performance of 97.69%
accuracy. Another novel attempt [2] involved use of Capsule Networks so as to decouple
image variance but the gain remained limited. While, in [14] using Generative Adversarial
Networks (GAN) [22] dataset was enlarged and then the enlarged dataset was used for more
robust training, a shortfall of the approach was that the synthetically generated images lacked
semantic correspondence with the original images.
To enhance the performance of brain tumor classification, inductive knowledge transfer
using transfer learning was performed in [9]. Transfer Learning has benefitted diverse fields of
image analysis such as medical image analysis in [32] and Image Malware Detection in [43]. It
uplifts the tuned parameters from extensive training on similar tasks so as to give boosted
initialization for the model, this enhances low-level feature understanding and hence supple-
ments overall recognition. Owing to knowledge transfer on GoogleNet [38] architecture and
pixel normalization the proposed approach in [9] obtained state-of-the-art performance.
Inspired by the same, there had been multiple other works [24, 29, 37] that employ transfer
learning on different architectures to even enhance the performance, but in these approaches
method for optimal knowledge transfer is not answered. The full potential in transfer learning
methods remains unexplored for brain tumor classification. Moreover, optimal architecture
with the optimal transfer learning framework is yet to decided. There had also been a gap in the
use of efficient and lightweight CNN architectures for the task. Hence, through this research,
we attempt to fill these gaps.
Multimedia Tools and Applications (2022) 81:37541–37567 37545

In this paper, we exploit the representational powers of very Deep, Efficient and Light-
weight CNNs respectively for Brain tumor classification and find optimal transfer learning
scheme. We have also explored Brain Tumor classification in the random initialization
paradigm and compared it with the other transfer learning frameworks. Classification of T1
Weighted Brain MRI slice images has been done into 3 classes namely Meningioma, Glioma
and Pituitary Tumor. To the best of our knowledge, this work presents the most comprehensive
and exhaustive analysis of Brain Tumor Classification. Some of the best performing deep
learning models can be deployed as real-time assistants for doctors and radiologists to make
their diagnosis faster and more accurate. Following are the major contributions that we made in
this paper: -.

& An efficient deep learning-based framework for brain tumor classification, which uses five
different weight initialization and freezing configurations, four from the domain of transfer
learning and the remaining being random initialization which are applied over different
parameter and memory efficient architectures.
& Training and Validation of deep learning models for brain tumor classification on 8
different architectures with architectures being very deep like InceptionResNetV2 [39]
or efficient and effective like DenseNet121 [21] or lightweight as MobileNet [20]. The 8
deep CNN architectures on which experimentation has been done are mentioned as
follows:
& VGG19 [35]
& InceptionV3 [40]
& DenseNet121
& DenseNet201
& ResNet152V2 [18]
& Xception [8]
& InceptionResNetV2
& MobileNet
& Exploring potential in training the models from scratch i.e., random initializing. Hence,
designing deep learning models for 8 different architectures and 5 different weight
initialization configurations and thus in total 40 models were trained for classifying brain
tumors from MRI images. All the models were trained and validated on Figshare Dataset.
& An exhaustive comparison between the 40 trained models to determine the best-
performing architectures and configurations i.e., which among all transfer learning and
random initialization frameworks gives the best performance.
& In-depth validation of the optimal framework using challenging-training strategies.
& Utilization of lightweight architecture - MobileNet for brain tumor classification so as to
explore the performance of lightweight deep CNN models.

The remainder of the paper is organized as follows. We gave a brief introduction and review of
literature in this section which was followed by our contributions in this paper. The next
section i.e., is Section 2 Methods and Models in which this work’s methodology is explained,
we have broadly divided this section into 2 parts with the first providing a brief review of CNN
architectures used in the study while the second discusses the proposed system. In the next
section i.e., section 3 Results and Discussions, we present an exhaustive analysis of the
performance of so trained 40 models on various measures. We conclude this paper in section
4 Conclusions and Future Scope and provide directions of future research in the same.
37546 Multimedia Tools and Applications (2022) 81:37541–37567

2 Methods and models

2.1 CNN architectures

As postulated above, for brain tumor-classification it is essential to model both global and local
patterns in an MRI slice. To this end, we have used 8 architectures with large receptive fields
and multi-scale operation. They are briefly explained here under: -.

1) VGG19: It is a 19 layered architecture and uses a high number of 3X3 convolutional


filters for achieving a bigger receptive field, more the size of receptive field richer will be
the richer extraction at a global scale but this comes with a cost on computational
complexities which was tackled through stacking smaller kernels to formulate a bigger
filter kernel.
2) InceptionV3: Multi-scale filtering by use of 1 × 1, 3 × 3, 5 × 5 at each convolutional
layer was done so as to make the model learn more discriminative features while being
parameter efficient at the same time. Overall top-1 accuracy on the ImageNet dataset was
77.3%. The parameter scalability was achieved via breaking the convolution operations
into smaller sub-parts, for example, 3 × 3 convolution was broken into 3X1 and 1 × 3
operations.
3) DenseNet: This architecture utilizes densely connected connections to achieve a better
flow of gradients and feature maps. This leads to creation enriched representations, while
at the same time vanishing gradients problem [19] is tackled. All the outputs from
previous layers are concatenated along channel dimensions with the ones in latter layer.
The flow of feature-maps allows the model to extract features on the current-scale in
accordance with features understood in previous layers, this gives the architecture an
advantage of multi-scale learning essential for brain-tumor classification. Furthermore,
these benefits come through a parameter and memory-efficient design by using only
limited filter kernels. This work uses DenseNet121 and DenseNet201 variants from the
DenseNet family which are 121, 201 layered architectures respectively.
4) ResNet152V2: ResNetV2 uses residual connections to counter vanishing gradients prob-
lem while using ReLU [30] as a pre-activation function to the input. This architecture
prevents information loss and further stabilizes the gradients hence contributing to faster
optimization. By making identity mapping much easier this class of models achieves the
above-mentioned goals easier to capture. Selection of the 152 layered variants of the
ResNetV2 has been done for this research.
5) Xception: This architecture uses depthwise separable convolutions instead of regular
convolutions which employ separate spatial filtering followed by cross channel filtering
to achieve better representations at a faster speed. Depthwise convolutions let Xception be
faster and more parameter efficient than InceptionV3 and at the same time achieves a
higher top-1 accuracy.
6) InceptionResNetV2: This architecture adopts both identity mapping learning and param-
eter efficient abilities from ResNet and Inception families. It has added ReLU pre-
activation before every residual block and this uses multiscale filtering inspired from
Inception-style architecture for even richer representations. This the deepest as well as the
most memory exhaustive network utilized in this work.
7) MobileNet: MobileNet is a very lightweight architecture, which uses only 4 million
parameters by the virtue of Depthwise separable convolutions which consists of
Multimedia Tools and Applications (2022) 81:37541–37567 37547

Depthwise convolutions followed by pointwise convolutions, and resolution and width


multipliers that further contribute to efficiently incrementing or decrementing the number
of parameters depending upon the application. The total number of parameters used in the
model gets drastically reduced in comparison with regular CNN blocks.

2.2 Proposed system

In this section, we present the methodology of this work wherein the prime objective was to
develop and validate an efficient framework for brain tumor classification using deep learning-
based approaches. Figure 3 presents an overall workflow of the proposed system. The
workflow of the proposed system has 3 phases. Its first phase involved the selection of a deep
learning model, we selected architecture from 8 different state-of-the-art convolutional neural
networks and also chose among the 5 possible weight initialization configurations among
which 4 were from transfer learning - pre-trained weights transfer and 1 was random
initialization. The second phase is the training phase, where the selected models were trained
and then in the last phase the trained models were evaluated on the basis of classification
performance on the test dataset. An extensive qualitative-quantitative analysis of all the trained
models was performed so as to find out the most optimal model for the task. This work had a
major objective to get efficient models that can be easily deployed in real-time scenarios.
Analysis involved inter-architecture, inter-configuration comparisons on the above-mentioned
performance metrics.

2.2.1 Model selection

In this part of the system, we select the architecture and then initialize weights to forward the
model for training. All the models used the same architectural-design i.e., a CNN model
followed by a classification head and the classification head was kept same for all the models.
Our focus was on selecting not only very deep CNNs but also on selecting parameter, memory
efficient and even lighter models because these models can be easily deployed in real-time
scenarios accredited to their small size. We have also experimented with heavy CNNs. CNNs
play the role of feature extractors. The output of CNN architectures is passed through a Global

Fig. 3 Proposed brain tumor classification system


37548 Multimedia Tools and Applications (2022) 81:37541–37567

Average Pooling layer and then the average output is flattened. The next layer in the
classification head is a Batch Normalization [22] layer which is succeeded by a dense layer.
This dense layer is followed by another Batch Normalization layer which is again succeeded
by a dense layer and then again, a Batch Normalization and then a final SoftMax classification
layer with 3 outputs. In every of the dense layers a L2-regularization of weight 0.01 was
applied. Figure 4 illustrates a generalized architecture used in this work. All the 8 discussed
CNN Architectures are used as choices for the system. We chose VGG19 - a de-facto standard
in medical imaging pipelines. MobileNet, DenseNet and Xception architectures were em-
ployed as the emphasis was laid on efficiency. The DenseNet models are parameter efficient,
while the MobileNet model is very lightweight. For Xception networks training is quicker. The
other three are the most efficient architectures of their respective families. ResNets are very
promising in eradicating vanishing gradient problems while Inception networks exploit rich
features extracted from multi-scale filtering, InceptionResNetV2 uses both these benefits to
learn discriminative decision boundaries. CNN architectures used in the work are shown in
Fig. 4.

Fig. 4 Generalized CNN architecture (Input image adapted from Figshare dataset)
Multimedia Tools and Applications (2022) 81:37541–37567 37549

When data is scarce, transfer learning approaches have proven to be very beneficial. In
transfer learning, pre-trained weights from previous heavy-duty training on large datasets are
used to initialize the model. From here the model can operate as a feature extractor or can be
fine-tuned according to task [4]. When used as a feature-extractor, model’s weights are fixed
and for an input corresponding representation are generated but here the model remains un-
generalized for specific application. Fine-Tuning on the other hand, helps the model to start
with good low level feature extraction and then adapt according to data by training on the task
specific data. This allows the model to generate enhanced representations with domain-specific
higher order features extracted.
Optimal weight initialization configuration depends upon task, data quantity available,
number of trainable parameters, and weights chosen to be transferred. It is shown in [7, 32,
34] that ImageNet [10] based pre-training brings significant improvement in medical imaging
tasks, we also chose ImageNet based weight transfer for our transfer learning configurations.
Transferring pre-trained weights are not standalone sufficient to bring improvements, but
based on the number of trainable parameters in the model and data available for the primary
task, it is also essential to decide which layers are to be kept trainable and which layers are to
be frozen. In this research, we refer to frameworks used in transferring weights and freezing
layers as weight initialization configurations. When data is limited and the model is very deep,
making only a few final layers trainable after freezing pre-trained weights brings positive
gains. We hypothesized that when a model will be having a limited number of trainable
parameters, while data is not very limited or can be augmented via data augmentation
strategies then it will be most promising just a few initial convolutional layers are kept
trainable so that weights of neurons present in networks are adapted according to the task;
here, pre-trained weights in initial layers shall contribute in extracting superior lower-level
features such as edges, textures etc. We also hypothesized that due to the model’s architecture
(i.e., number of trainable parameters) even random initialization configurations wherein
weights are started using Glorot Initialization [15] used to combat vanishing gradients can
outperform transfer learning methods. Both of our hypotheses are verified from experimental
results which are mentioned in the next section. Hence, from hypotheses we selected following
weight initialization configurations: -.

& Random Initialization (RI): In this configuration, the entire model was kept trainable after
initializing the model with Glorot Initialization.
& 3 Layers Frozen (3LF): In this configuration, we froze the initial 3 convolutional layers
after the weight transfer from ImageNet pre-training while the remaining model was kept
trainable
& 5 Layers Frozen (5LF): In this configuration, we froze the initial 5 convolutional layers
after the weight transfer from ImageNet pre-training.
& 3 Layer Trainable (3LT): In this configuration, we kept only the last 3 convolutional layers
and the classification head trainable while the remaining model was kept frozen.
& 5 Layer Trainable (5LT): In this configuration, we kept only the last 5 convolutional layers
and the classification head trainable while the remaining model was kept frozen.

These selected configurations are present as a choice for the model. Hence, for all the 8
architectures we chose 5 weight initialization configurations thereby building 40 deep learning
models in total. A visualization of Weight Initialization Configurations is present in Fig. 5. The
number of parameters trainable for each of the architecture per weight initialization
37550 Multimedia Tools and Applications (2022) 81:37541–37567

Fig. 5 Weight initialization configuration in random initialization and transfer learning paradigms. (Input image
adapted from Figshare dataset)

configuration is given in Table 1, the least number of trainable parameters are used by the
MobileNet 3LF model i.e., 3 million parameters approximately, while the RI configuration of
the ResNetV2 model uses 59 million parameters approximately.

2.2.2 Training and evaluation procedure

After the model selection, the selected model was trained. The Brain-MRI images were first
pre-processed and then subjected to augmentation. The only pre-processing involved was
image rescaling. Images were resized to (224,224) and 3 channels of the same image were
stacked to prepare images of (224, 224, 3) dimensions. Prepared images were made to undergo
extensive data augmentation. Data Augmentation strategies help to increase the dataset size
when the data is scarce, they help in reducing overfitting by providing slight regularization. In
our augmentation strategy we randomly rotated the images from range 0o to 3590, they were
also shifted along with the height and width by 40% of the original, brightness of the images
was also randomly changed between 50% to 150% of the original, we also zoomed in the
images in with range being 0.0 to 0.3 of the original. Images were also put through random
shearing of 100 in the counter clockwise direction. Augmentation was added so as to make the
model more generalized for unconstrained environments. The augmentation strategy was as
per the standard methodology, wherein during each training-iteration all the images are

Table 1 Number of trainable parameters per model

Model | Scheme RI 3LF 5LF 3LT 5LT

VGG19 20,190,789 20,078,211 19,635,459 7,245,827 11,965,443


InceptionV3 22,331,043 22,302,531 22,159,171 2,053,635 2,938,371
DenseNet121 7,252,355 7,197,891 7,148,739 499,203 658,947
DenseNet201 18,622,595 18,568,131 18,518,979 845,059 1,119,491
ResNet152V2 58,750,595 58,700,163 58,666,883 5,021,187 8,431,107
Xception 21,369,643 21,341,579 21,315,851 6,049,795 7,338,355
InceptionResnetV2 54,706,787 54,678,275 54,534,915 4,731,427 5,259,811
MobileNet 3,505,475 3,502,275 3,493,507 1,880,579 298,499
Multimedia Tools and Applications (2022) 81:37541–37567 37551

subjected to random transformations and shuffling, this does not enlarge the dataset but rather
add new transformed version of the image. Addition of perturbation during training helps in
reduction of overfitting as at every-epoch the model is operating over a new set of images. The
testing-phase involved transforming the test-set in similar manner, this was done so as to make
the testing challenging as well as for checking generalizability of model in the ‘wild’
environments. A model is tested multiple times, the best performance achieved is reported
all along. After this, augmented images were input to the CNN for training. Images were
shuffled before augmentation; this shuffling enhances the generalization abilities of the model.
Each of the models was trained for 275 epochs, keeping a batch size of 64 and Adam [25]
being the optimizer with β1 = 0.99, β2 = 0.999 and ε = 1e-8 while the learning rate schedule
was kept as follows: -

8
< 1e−4 if epoch < 125
lr ¼ 1e−5 if 125 ≤ epoch < 175
:
1e−6 otherwise

Here, lr is the learning rate. A benefit of using this learning rate schedule is that this slows
down the optimization process as the global optima come up nearby, hence landing in global
optima or in very close proximity to the same. We used categorical cross-entropy loss as a
classification loss, we kept saving the model with the best validation accuracy. All the models
were trained on TensorFlow framework on K80 GPUs provided by Google Colab and Kaggle.
After training all forty models, each of them was put through comprehensive evaluation using
Accuracy, Loss, Precision, Recall, F1-Score, AUC. The test set images were augmented
similar to the training images and then they were given as input in batches of 64 to the model,
using results we plot the ROC Curves and Confusion Matrices. We present an experimental
analysis and discussion over the results in the next section.

3 Results and discussions

We present a quantitative as well as qualitative analysis of the performance of all the models.
From our analysis, we tried to find answers to the following: -.

& What should be the scheme of transferring knowledge from pre-trained models and hence
which layers must be frozen?
& Which among Transfer Learning and Random Initializations paradigms perform better?
& Which is the most suitable CNN architecture (among the used architectures) for Brain
Tumor Classification?
& Can a lighter model like MobileNet learn good enough decision boundaries for the task?
& Which is the best overall architecture plus scheme in classifying Brain Tumors?

3.1 Dataset description and evaluation strategies

This work uses the publicly-available Figshare dataset which contains 3064 T1 Weighted and
contrast enhanced images of Brain MRI slices with 708, 1426 and 930 belonging respectively
37552 Multimedia Tools and Applications (2022) 81:37541–37567

to Meningioma, Glioma and Pituitary tumor class. The entire data collected from 233 patients
in multiple views – sagittal, axial and coronal. For analysis of optimal framework amongst the
8 architectures using 5 different weight initialization schemes, we have split we have split the
dataset in hold-out manner wherein we used 2500 randomly selected images for training
while the remaining 564 images for testing(validation). In the training set, we had 582,
1167, 751 images from meningioma, glioma and pituitary tumor classes respectively
while in the validation set, we had the same quantities as 126, 259, 179. Both testing-set
and the validation-set are the same. After empirical analysis of the optimal CNN-Model,
it is required to validate the model under challenging protocols [33]. Hence, for evalu-
ation of the model, 5-Fold Cross Validation and challenging-training strategies have been
used. Under the setting of challenging-training there are two experiments, First the model
is to be trained on 70% of the total data while testing is done on remaining 30%. In the
second experiment, model is trained on 30% of the total data (The data which is used for
testing the model in the first challenging-training strategy) and here the testing is done on
the remaining 70% data (Fig. 6).

3.2 Performance metrics

After training models were extensively validated, ROC Curve and Confusion Matrix was
plotted for all the trained models and then performance metrics were computed, these metrics
provide insights on how good are the classifying abilities of the classifier so employed.
Following are the performance metrics: -.

& Validation Accuracy: Validation Accuracy is the measure of the number of correct
predictions made by a classifier on a validation set. Mathematically,

TN þ TP
Accuracy ðACC Þ ¼ ð1Þ
TN þ TP þ FN þ FP

& Precision: Fraction of all relevant instances among all retrieved instances.

a) b)
Fig. 6 Dataset quantities per class in Hold-Out Evaluation. a Training Dataset b Validation Dataset
Multimedia Tools and Applications (2022) 81:37541–37567 37553

TP
Precision ¼ ð2Þ
TP þ FN

& Recall: Fraction of total amount of relevant instances from all the retrieved instances.

TP
Recall ¼ ð3Þ
TP þ FN

& F1 Score: It is the Weighted Harmonic mean of Precision and Recall.

2*Precision*Recall
F1 Score ¼ ð4Þ
Precision þ Recall

& AUC Score: It is the area under the ROC Curve of a classifier, it is a measure of the
effectiveness of the classifier. Higher the AUC Score, higher will be the true-positive rate
while lower will be the false-positive, and hence resulting in better classification perfor-
mance. For multiclass classification, AUC score is computed for each class using one-vs-
all approach i.e., for a class, all the examples belonging to that class will be considered as
positive while the others as negative.
& Matthews Correlation Coefficient (MCC): MCC takes considers the complete confusion
matrix and hence is a fair evaluation metric for imbalanced-data.

TP*TN−FP*FN
MCC ¼ 1 ð5Þ
½ðTP þ FPÞ*ðTP þ FN Þ*ðTN þ FPÞ*ðTN þ FN Þ 2

& Cohen’s Kappa Coefficient (κ): Cohen’s Kappa is an evaluation metric for imbalanced
datasets. It takes average accuracy under random guess into account.

p0 −pe
κ¼ ð6Þ
1−pe

Here, p0 is overall average accuracy while pe is the average accuracy in random guess which
computed with apriori probability distribution information.
37554 Multimedia Tools and Applications (2022) 81:37541–37567

3.3 Results and discussions

This section provides experimental analysis over all the models trained and the findings
derived from the analysis. As mentioned earlier 40 different models were trained on 8 different
architectures and 5 different weight initialization configurations out which the DenseNet201
architecture-based model for 5 initial convolutional layers frozen and transfer learning applied
(i.e., 5LF) obtained state-of-the-art 98.22% validation accuracy, 99.86 AUC and F1 Score of
0.9784 with Precision and Recall being 0.9769 and 0.9801 respectively. Table 2 presents a
tabular comparative analysis of all 40 models in terms of maximum validation accuracy.
Tables 3, 4 and 5 shows the comparison between all the 40 models in terms of training
accuracy, validation loss and training loss correspondingly. We computed average over all the
architectures for all configurations respectively and did the same for architectural comparison
also, this average suggests that when transfer learning is applied and only initial few layers are
frozen. We computed average over all the eight architectures for all five configurations and for
all configurations respectively and did the same for architectural comparison also, this average
suggests that when transfer learning is applied and only initial few layers are kept frozen then
the accuracy achieved is the best while loss optimization is optimal. Thus, it can be inferred
that knowledge of low-level feature filtering generally captured by shallower layers in deeper
models is very useful for brain tumor classification. Average validation accuracy achieved for
3LF configuration-based models is 97.30% and for the similar setting but instead of 3 but 5
initial convolutional layers frozen i.e., 5LF is 97.43%. This metric was drastically degraded for
the cases of transfer learning when final 3 and 5 convolutional layers were only kept trainable
(3LT and 5LT) to 79.64% and 80.39% respectively. Hence, it is evident from the results that
when the number of parameters learning generic features are adequately high, performance
increment is seen. However, this quantity also depends upon the architectural depth and merits
of the convolutional model being followed. In order to gain more insights, Fig. 7 illustrates the
accuracy of per weight initialization configuration best performing models. It can be noticed
from these curves that both training and testing accuracies have converged and hence there is
good generalization present. To visualize the training and validation loss optimization per
epoch during the training process we have plotted Loss curves in Fig. 8 for the best model for
all respective 8 architectures. It can be inferred from the curves there’s only a trace chance of
overfitting or underfitting in very few models.
Confusion Matrices for the best performing model per architecture has been illustrated in
Fig. 9. These matrices serve the purpose of showing the strengths and weaknesses of models

Table 2 Validation accuracy

Scheme RI 3LF 5LF 3LT 5LT Average


Architecture

VGG19 86.17% 94.68% 94.33% 95.92% 96.69% 93.56%


InceptionV3 92.02% 96.99% 97.52% 71.28% 72.34% 86.03%
DenseNet121 93.79% 98.05% 97.87% 90.25% 90.60% 94.11%
DenseNet201 93.97% 98.05% 98.22% 90.60% 91.31% 94.36%
ResNet152V2 90.25% 97.34% 97.16% 71.63% 70.39% 85.35%
Xception 95.39% 97.34% 97.70% 81.21% 84.93% 91.31%
InceptionResnetV2 95.04% 98.05% 97.87% 55.00% 53.37% 79.87%
MobileNet 78.90% 97.87% 97.87% 81.21% 83.51% 87.87%
Average 90.69% 97.30% 97.31% 79.64% 80.39%
Multimedia Tools and Applications (2022) 81:37541–37567 37555

Table 3 Training accuracy

Scheme RI 3LF 5LF 3LT 5LT


Architecture

VGG19 86.12% 96.32% 94.40% 97.52% 98.24%


InceptionV3 90.92% 98.52% 98.48% 70.08% 70.84%
DenseNet121 93.72% 99.16% 99.04% 90.04% 90.60%
DenseNet201 94.04% 99.28% 99.32% 90.64% 91.04%
ResNet152V2 89.72% 98.48% 98.44% 70.88% 70.48%
Xception 95.12% 99.12% 99.20% 80.92% 84.04%
InceptionResnetV2 95.52% 99.16% 99.32% 52.36% 52.16%
MobileNet 77.56% 98.64% 98.08% 80.28% 83.40%

Table 4 Validation loss

Scheme RI 3LF 5LF 3LT 5LT


Architecture

VGG19 0.6358 0.4468 0.5104 0.615 0.3857


InceptionV3 0.3533 0.2405 0.2586 1.0541 1.0258
DenseNet121 0.4787 0.3558 0.4198 0.9527 0.8982
DenseNet201 0.414 0.2689 0.3335 0.8678 0.7589
ResNet152V2 0.4544 0.296 0.2994 0.9239 0.9223
Xception 0.2891 0.2478 0.2223 0.7112 0.6553
InceptionResnetV2 0.3511 0.1894 0.2745 1.1419 1.1977
MobileNet 0.684 0.4527 0.4843 0.8872 1.3077

on a class-to-class basis. The DenseNet201 5LF model proves to be the most efficient as it
predicts the least wrongly and the most correctly when compared with other top-performing
models. This signifies that the proposed model is not having bias. From Table 2, the
supremacy of configurations with freezing of initial layers is clear and it is not surprising that
other performance metrics such as precision, recall, F1 Score and AUC were comparatively
lower for the 3LT, 5LT configurations when compared with the 3LF and 5LF configurations.
Table 6 may be referenced for these metrics. For all the models trained under 3LF and 5LF
configurations average AUC obtained was above or equal to 0.99. Validation and training loss
was also least for 3LF, 5LF schemes but were comparatively higher 3LT and 5LT approaches.
InceptionResNetV2 architecture for 3LF achieved the minimum validation loss of 0.1894.
Maximum achieved training accuracies were also very high for 3LF and 5LF schemes

Table 5 Training loss

Scheme RI 3LF 5LF 3LT 5LT


Architecture

VGG19 0.6413 0.4043 0.4886 0.5598 0.3301


InceptionV3 0.347 0.2052 0.228 1.0636 1.0287
DenseNet121 0.4691 0.3346 0.3832 0.9432 0.8749
DenseNet201 0.4087 0.2321 0.2961 0.8583 0.7493
ResNet152V2 0.44 0.2643 0.2562 0.9023 0.8861
Xception 0.3068 0.199 0.1691 0.7213 0.6747
InceptionResnetV2 0.3304 0.1535 0.2125 1.1646 1.205
MobileNet 0.7167 0.4263 0.469 0.9069 1.3049
37556 Multimedia Tools and Applications (2022) 81:37541–37567

(a) (b) (c)

(d) (e)
Fig. 7 Accuracy curves of best performing architectures per scheme. a RI - Xception b 3LF - DenseNet121 c
5LF - DenseNet201 d 3LT - VGG19 e 5LT - VGG19

(a) (b) (c)

(d) (e) (f)

(g) (h)

Fig. 8 Loss curves of best model per architecture. a VGG19 5LT b InceptionV3 5LF c DenseNet121 3LF d
DenseNet201 5LF e ResNet152V2 3LF f Xception 5LF g InceptionResNetV2 3LF h MobileNet 3LF
Multimedia Tools and Applications (2022) 81:37541–37567 37557

(a) (b) (c)

(d) (e) (f)

(g) (h)
Fig. 9 Confusion matrix of best model per architecture. a VGG19 5LT b InceptionV3 5LF c DenseNet121 3LF
d DenseNet201 5LF e ResNet152V2 3LF f Xception 5LF g InceptionResNetV2 3LF h MobileNet 3LF

especially for DenseNet201 and DenseNet121 based architectures with DenseNet201 achiev-
ing maximum training accuracy of 99.32% for 5LF. VGG19 based 3LT and 5LT based
architectures are an exception to poor performance cases of later layer only trainable as in
contrast to above, these two models performed even better than their 3LF and 5LF counterparts
achieving maximum validation accuracies of 95.92% and 96.99% while maintaining a good
F1 score and AUC of 0.925, 0.952 and 0.9938, 0.9956 respectively. VGG19 based 3LT, 5LT
models achieved high performance because of dense parameter density at the end of CNN
architecture and comparatively smaller architectural depth.
Random Initialization schemes (RI) have shown good performance in the case of
InceptionResNetV2, both the variants of DenseNet and especially Xception architecture.
Average validation, training accuracies were higher for Randomly Initialized architectures in
comparison with 3LT, 5LT configurations and were 90.69% and 90.34% respectively.
Xception based outperformed every other model using random initialization by achieving
95.39% validation accuracy, the same model obtained AUC of 0.9896 and F1-Score of
0.9229. A comparative performance was also shown by InceptionResNetV2. Models were
37558

Table 6 Comparison between performance metrics of 40 models

Scheme RI 3LF 5LF

Architecture Precision Recall F1 Score AUC Precision Recall F1 Score AUC Precision Recall

VGG19 0.836 0.848 0.8405 0.956 0.9263 0.9277 0.9269 0.991 0.9143 0.9195
InceptionV3 0.8693 0.8803 0.8735 0.9753 0.9544 0.9605 0.9573 0.9967 0.9532 0.9514
DenseNet121 0.9137 0.9231 0.9179 0.988 0.9683 0.9752 0.9712 0.9977 0.9701 0.9699
DenseNet201 0.9097 0.9233 0.9144 0.9914 0.9581 0.961 0.9595 0.9973 0.9769 0.9801
ResNet152V2 0.9108 0.8803 0.8924 0.9823 0.9596 0.9601 0.9596 0.9975 0.9491 0.955
Xception 0.921 0.9258 0.9229 0.9896 0.9549 0.9594 0.9569 0.9957 0.96 0.9658
InceptionResNetV2 0.9027 0.9133 0.9074 0.9874 0.9539 0.9605 0.957 0.9965 0.9646 0.9705
MobileNet 0.7374 0.7392 0.7368 0.9157 0.9534 0.9585 0.9558 0.9963 0.9525 0.9581

Scheme 5LF 3LT 5LT

Architecture F1 Score AUC Precision Recall F1 Score AUC Precision Recall F1 Score AUC

VGG19 0.9161 0.99 0.9313 0.9279 0.9295 0.9938 0.9503 0.9546 0.952 0.9956
InceptionV3 0.9522 0.9962 0.6395 0.6083 0.6118 0.8253 0.6684 0.6402 0.6451 0.8426
DenseNet121 0.9698 0.9983 0.864 0.8666 0.8653 0.9712 0.8908 0.8985 0.8943 0.9788
DenseNet201 0.9784 0.9986 0.8674 0.8733 0.8684 0.9773 0.8669 0.8733 0.8695 0.9758
ResNet152V2 0.9516 0.9951 0.6723 0.6244 0.6344 0.8611 0.6375 0.638 0.6368 0.8316
Xception 0.9627 0.9954 0.778 0.791 0.7787 0.9341 0.7881 0.7844 0.7847 0.9394
InceptionResNetV2 0.9673 0.9967 0.543 0.4252 0.477 0.6576 0.4398 0.4018 0.3557 0.6096
MobileNet 0.9551 0.9951 0.7688 0.774 0.7712 0.9216 0.7811 0.776 0.7778 0.9367
Multimedia Tools and Applications (2022) 81:37541–37567
Multimedia Tools and Applications (2022) 81:37541–37567 37559

randomly initialized by Glorot Initialization hence contributing to stabilizing the training


process. Superior performance of Xception model in RI scheme can be accredited to limited
number of training parameters and hierarchical convolutional architecture for low-data quan-
tities. MobileNet also follows an efficient design but the depth is not enough model global
correspondence with respect to features extracted in local neighbourhoods. However, RI
schemes were less promising than the 3LF and 5LF configurations due to less amount of
data, even if the data augmentation strategies had been employed. We have computed average
validation accuracies for each architecture used in this research and the findings suggest that
DenseNet201 based models achieve the highest average validation accuracy of 94.61% while
DenseNet121 obtain 94.11%. These high accuracies can be attributed to parameters efficiency
of architecture while ensuring enough functional complexities to catch data’s high nonlinear
trend. For even 3LT, 5LT schemes, models based on both the variants of DenseNet validation
accuracies were comparatively very higher than the models based on other architectures except
VGG19. AUC and other performance measures were the highest but loss was also the least for
DenseNet models. Another efficient architecture Xception models managed to achieve overall
average validation accuracy of 91.31%, VGG19 based models reached an accuracy of 93.56%.
InceptionResNetV2 enhanced performance during RI can be characterized to the fact that
the model has enough parameters to capitalize robust decision-boundaries. Conversely, the
MobileNet architecture has limited number of parameters hence it is not able to perform
significantly for the case RI training. Further, the stark performance drop of
InceptionResNetV2 model during 3LT-5LT transfer learning training can be attributed to
limited availability of parameters fine-tuned according to the task. Another cause of the same is
generation non-domain specific feature maps from the deep-frozen network. This applies for
the ResNet152V2 and InceptionV3 also.
Overall, MobileNet based models performed better when compared with ResNet152V2,
InceptionResNetV2 and InceptionV3 reaching 87.87% average accuracy, while being faster at
training time and very light for full-scale deployment. Both the 3LF, 5LF scheme model
MobileNet achieved a validation accuracy of 97.87%, a good average AUC of 0.9963 and
0.9951 and F1 scores of 0.9558 and 0.9551 duly. It is evident from the results (Tables 2 and 6)
that when the MobileNet architecture leverages pre-trained weights and has adequately high
number of parameters to be trained (i.e., 5LF and 3LF transfer-learning scheme) it obtains
good performance at low-computational costs. Further, MobileNet based models are very
lightweight as they fit in just 42.6MBs space at the same time, this space is approximately 6
times less than that required by the best performing DenseNet201 model while it is 16 times
less than the 658 MB’s space required by the top performing InceptioResNetV2 model. This
space can be further compressed using model compression techniques, but still, it is compar-
atively low but is performing nearly at par with other top performing models. It is showing
potential to scale up the use of deep learning approaches in medical diagnosis in real-time
scenarios as these architectures can easily fit into edge devices. This also shows positive signs
in improving the ability to deploy robust AI-based systems in remote places of the world.
For VGG19 architecture the 5LT configuration suited the best, for InceptionV3 5LF
configuration gave the best results achieving 97.52% validation accuracy and an AUC of
0.9962. For the DenseNet121 network, the 3LF scheme obtained the best results of AUC equal
to 0.9977 and validation accuracy 98.05%, this was even the best model for the 3LF scheme. It
has mentioned earlier about the 5LF scheme used on DenseNet201 architecture. For
ResNet152V2 architecture, the 3LF schemed model performed well and achieved 97.34%
validation accuracy but the 3LT and 5LT models performed poorly achieving only 71.63%
37560 Multimedia Tools and Applications (2022) 81:37541–37567

and 70.39% validation accuracies respectively. On an overall basis Xception networks gave
good results, its RI variant was the best among all the RI models while its 5LF model was the
best among all other schemes achieving 97.70% accuracy and AUC of 0.9954.
InceptionResetV2 based models reached the least average validation accuracy, but it’s 3LF
variant managed to obtain a high validation accuracy of 98.05%; the found inefficiency is due
to a high number of parameters while having a very deep architecture. Amongst the MobileNet
architectures, the 3LF configured model achieved the best results. From the analysis, optimal-
ity in 3LF and 5LF schemes can be inferred but amongst them, the framework which will
perform better depends entirely upon convolutional architecture’s properties and specifically
depth.
We have plotted ROC Curves in Fig. 10 for each class namely Meningioma, Glioma and
Pituitary Tumor for different weight initialization configurations per architecture. From these
curves also, the efficacy of 3LF and 5LF configurations for most of the architectures are
shown. We computed Precision, Recall and F1 Score per class, and it suggests that Meningi-
omas were the most challenging to classify, while Pituitary tumors and Gliomas were
analogously easier, this fact can also be inferred from ROC curves so plotted. DenseNet201
based 5LF model was the best in classifying meningiomas predicting least falsely for the class.
There were also benefits from the learning rate schedule; it can be seen from the Loss Curves

(a)

(b)

(c)

Fig. 10 ROC plots per weight initialization configuration. a RI weight initialization configuration b 3LF weight
initialization configuration c 5LF weight initialization configuration d 3LT weight initialization configuration e
5LT weight initialization configuration
Multimedia Tools and Applications (2022) 81:37541–37567 37561

(d) 3LT Weight Initialization Configuration

(e) 5LT Weight Initialization Configuration

Fig. 10 continued.

in Fig. 8 that at intervals after which there was a decay in the learning rate, the optimization
process moved more towards convergence.
This analysis highlighted optimality in DenseNet201 based 5LF model, to further validate
its performance a 5-Fold Cross Validation has been performed. Results of the same has been
tabulated in Table 7. To articulate the fair performance of the model along with the Validation
Accuracy and F1-Score, Cohen’s Kappa and MCC Score has been computed. Average
validation accuracy attained by the model is 98.68% with model performing significant for
all the three classes. A high Cohen’s Kappa and MCC Score of 0.9762 and 0.9762 respectively
suggests that the model is robust against data-imbalance. A high average F1-Score of evidence
of the same. In all the folds, classification of meningioma was the most challenging, although
model was performing well in classifying both true-positives and true-negatives.
In an attempt to investigate on the DenseNet201 based 5LF model’s robustness against hard
training strategies, the model was trained on 70–30% and 30–70% data settings. This data-split
was done on held-out basis. The results are tabulated in Table 8. In the both the settings, the
model sustained to high-end performances. In the case of 70–30% split, the model achieved

Table 7 5-Fold cross validation of DenseNet201 based 5LF model

Fold Accuracy (in %) F1-Score Cohen’s MCC score


Kappa (κ)
Meningioma Glioma Pituitary tumor Average

1 97.77 99.64 98.44 98.85 0.9861 0.9820 0.9820


2 98.59 98.92 99.47 99.02 0.9887 0.9847 0.9847
2 97.56 99.28 98.33 98.69 0.9863 0.9797 0.9797
4 97.14 98.63 98.82 98.36 0.9818 0.9742 0.9742
5 96.85 99.31 98.97 98.69 0.9852 0.9792 0.9792
Average 97.55 99.12 98.77 98.68 0.9852 0.9796 0.9796
37562 Multimedia Tools and Applications (2022) 81:37541–37567

Table 8 Hard-training performance of DenseNet201 based 5LF model

Training data Testing data Accuracy (in %) F1- Cohen’s MCC


size (in %) size (in %) Score Kappa (κ) score
Meningioma Glioma Pituitary Average
tumor

70 30 95.41 98.61 100.00 98.26 0.9801 0.9277 0.9277


30 70 93.06 93.06 97.12 95.70 0.9520 0.9330 0.9329

excellent performance of overall 98.26% validation accuracy. Particularly, for the case of
Pituitary Tumors, 100% classification results were obtained. Similar to the results of previous
training strategies, classification of meningiomas was challenging while the misclassification
for the same was also comparatively higher. This can be attributed to limited data quantities for
the particular class. Although model is robust enough for data-imbalance as it has attained a
high Cohen’s Kappa and MCC Score of 0.9277 and 0.9277 respectively. In data-scarce
scenario of 30–70% data training-testing split, the model has performed significantly well. It
has sustained to high average validation accuracy of 95.70%. High F1-Score, Cohen’s Kappa
and MCC Score obtained under this setting gives evidence on efficacy in the model. Never-
theless, mis-classification to the class Meningioma was higher in this scenario, a possible can
be limited training examples for the class.
It is of utmost importance to analyze the image regions which activates the neurons of the
model for tumor-type classification. Further, this shall ensure that whether the model is
emphasizing upon the area in vicinity of tumor-affected region or the predictions are based
upon some morphological bias corresponding to view and structure of MRI-Slice. To achieve
the same, we have extracted activations from the last-convolutional layer of the proposed
DenseNet 201(5LF with transfer-learning), then then these maps were rescaled and
superimposed over the original image so as to find out regions upon which the model is
basing predictions. The obtained results have been illustrated in Fig. 11. It is clear from the
Figure that the model is learning features with spatial-contexts that corresponds to tumor-
affected region. Further, the model is not emphasizing much upon structural details of the MRI
and is rather focussing on generic-features. These two observations pertain to the fact the
model doesn’t encompass intrinsic-bias and is highly generalized. Another merit of the
proposed model is that, it produces rich-representations irrespective of the view of the MRI-

Fig. 11 Feature-activation visualization for DenseNet 5LF over all the three tumor types
Multimedia Tools and Applications (2022) 81:37541–37567 37563

Slice (i.e., Coronal, Sagittal or Axial). Lastly, it has been observed that model localizes features
in vicinity of tumor-affected region and this is followed for all the three class; Hence, this
further reinforces over the fact the model is generalized for different challenging scenarios.
Further, a comparative study between existing state-of-the-art frameworks and proposed
model is presented in Table 9. The most optimal framework from the analysis i.e.,
DenseNet201 based 5LF model is considered for comparison with results from all the
above-forth mentioned training strategies. Since, most of the researches claim results in terms
of average accuracy, comparative study has taken it as the performance metric. Supremacy of
the proposed approach is established from the results. Under 5-Fold Cross validation training
the proposed model has a performance gain of 0.84% when compared with state-of-the-art and
5.45% gain against the most-significantly performing machine learning based attempt. The
proposed model, trained on other strategies is also highly generalized and results give evidence
of the same. The model trained under data-scarce training of 30–70% split has performed
encouragingly well when compared with other frameworks. Optimality in the proposed
transfer-learning application scheme is also proven from the results, there is 0.88% increment
in performance in the proposed model compared to the previous state-of-the-art transfer
learning method. Thus, the proposed model of DenseNet201 follows two properties – Firstly,
it efficiently extracts information at multiple spatial contexts and secondly, the adopted 5LF
transfer-learning preserves model’s capacity to extract rich low-level but non-generic features.
Upon these rich lower-level representations, discriminative-generic features are learnt. These
two properties help the proposed model attain state-of-the-art performance.
To explore further upon generalization of our proposed model - DenseNet201 5LF (Initial 5
Layer Frozen) over different MR modalities, we considered T2-Weighted MR-Images from
the Harvard Database [17, 31] for Brain-MR Image Abnormality Detection task. This dataset

Table 9 Comparison with state-of-the-art

Work Framework Training-Testing strategy Average


accuracy (in %)

Chen et al. [6] BoW-SVM 80–20% Data Split 91.28


Ismael and DWT+Statistical Features 70–30% Data Split 91.90
Qader [23] + Neural Network
Pashei et al. [16] PCA-NGIST + RELM 5-Fold Cross Validation 92.61
70–30% Data Split 93.23
Abiwinanda et al. [1]CNN – 84.19
Afshar et al. [2] Capsule Networks – 90.89
Ghassemi et al. [14] DCGAN based data 5-Fold Cross Validation 95.6
augmentation + CNN
Toğaçar et al. [42] BrainMRNet (Attention Mechanism 80–20% Data Split 97.84
and Residual Blocks) 70–30% Data Split 97.25
+Hypercolumn Technique
Swati et al. [37] Fine-Tuning of Pre-Trained VGG19 5-Fold Cross Validation 94.82
Kaur and Gandhi [24] Pre-trained Alexnet [26] with transfer 70–30% Data Split 96.85
learning 5-Fold Cross Validation 96.10
Deepak and Ameer [9] Pixel Normalization and Transfer 5-Fold Cross Validation 97.80
Learning
Proposed DenseNet201 and 5LF based Transfer Hold-Out (2500 Images in 98.22
Learning Training and 564 Testing)
30–70% Data Split 95.70
70–30% Data Split 98.26
5-Fold Cross Validation 98.68
37564 Multimedia Tools and Applications (2022) 81:37541–37567

has been utilized by [24] (Kaur and Gandhi) wherein they also had the major research
objective to classify the type of brain-tumors. The above-forth mentioned database has 160
images with 140 being abnormal while the remaining 20 MR-images being normal. The
images from the ‘Abnormal’ class are taken from patients of Alzheimer, Sarcoma, Pick’s
disease, Alzheimer’s disease plus Visual agnosia, and Huntington’s disease. For training and
testing, 60% of the entire data was taken into training and the rest of the 40% data was for
testing. Table 10 tabulates the result obtained for the same while in Fig. 12 we illustrate the
visualization of image regions which activates the neurons of the model for tumor-type
classification. Model’s ability to generalize over different MR-Modalities is significantly
highlighted from the obtained 100% results on the Harvard Database. Furthermore, from
Fig. 12 it can be inferred that model is highly activated over the tumor affected region, this
emphasizes that the model performs well irrespective of the modality and does not carry
intrinsic biases due to morphological structure of MRI. A point worthwhile mentioning is that
for feature-activation visualization we utilize the model trained over Figshare dataset. Accurate
localizations even in this challenging cross-database cum cross-modality scenario suggests
potential in approach, while gives evidences on generalizability of the model.

4 Conclusion and future scope

This paper presented an efficient deep learning-based framework for brain tumor classification.
The proposed framework gave four different weight initialization configurations in the transfer
learning domain and one from the random initialization framework, applied respectively over
eight different CNN architectures, thereby exploring the classification of brain tumors on 40
different deep learning models. The experimental results reflect the encouraging classification
performances for the freezing of initial layers i.e., 3LF and 5LF configurations of the transfer
learning paradigm which after transfer learning keeps the starting three and five layers non-
trainable respectively. Architectural features and the number of trainable parameters also
influenced the performance of models. The deep architectures of the DenseNet family gave
promising results especially for the 3LF and 5LF configurations. VGG19 based models
accredited to comparatively shallower but parameter extensive nature achieved comparable
results for configurations from the transfer learning domain which kept only the last three and
five convolutional layers of the model trainable. Results showed transfer learning-based
models with correct architectural choice outperformed the random initialization counterparts.
A very lightweight architecture of MobileNet was also applied to the task with all five
configurations and it obtained promising results opening gates for making medical image
analysis with deep learning much more scalable. All the trained models were compared
quantitatively with various performance metrics and graphically using Loss-Accuracy curves,
Confusion Matrices and ROC Curves. The best performing proposed models are robust and

Table 10 Comparison of the proposed model on the harvard-database (T2-Weighted MR Images)

Model Average Accuracy (%) F1-Score Cohen’s Kappa (κ) MCC-


Score

Pre-Trained AlexNet [24] 100 1.00 1.00 1.00


DenseNet201 5LF based Transfer 100 1.00 1.00 1.00
Learning (Proposed)
Multimedia Tools and Applications (2022) 81:37541–37567 37565

Fig. 12 Feature-activation visualization for DenseNet 5LF (Figshare trained model) over glioma T2-weighted
MR-slice image from harvard database [31]

well generalized; they can be utilized as assistants for doctors/radiologists in brain tumor
classification. In future research, instead of single-MRI slice, entire MRI sequence shall be
utilized, along with that attention-based neural networks and deep-metric losses will also be
explored for brain tumor classification.

Declarations

Conflict of interest/Competing interest The authors declare that they have no known competing financial
interests or personal relationships that could have appeared to influence the work reported in this paper.

Ethical approval This article does not contain any studies with human participants or animals by any authors.

References

1. Abiwinanda N, Hanif M, Hesaputra ST, Handayani A, Mengko TR (2019) Brain tumor classification using
convolutional neural network. In: World congress on medical physics and biomedical engineering 2018.
Springer, Singapore, pp 183–189
2. Afshar P, Plataniotis KN, Mohammadi A (2019) Capsule networks for brain tumor classification based on
MRI images and coarse tumor boundaries. In: ICASSP 2019-2019 IEEE international conference on
acoustics, speech and signal processing (ICASSP). IEEE, pp 1368–1372
3. Buetow PC, Smirniotopoulos JG, Done S (1990) Congenital brain tumors: a review of 45 cases. AJR Am J
Roentgenol 155(3):587–593
4. Cascio D, Taormina V, Raso G (2019) Deep CNN for IIF images classification in autoimmune diagnostics.
Appl Sci 9(8):1618
5. Cheng J (2017) Brain tumor dataset (version 5). Figshare. Retrieved 16 November 2020 from https://doi.
org/10.6084/m9.figshare.1512427.v5
6. Cheng J, Huang W, Cao S, Yang R, Yang W, Yun Z, Wang Z, Feng Q (2015) Enhanced performance of
brain tumor classification via tumor region augmentation and partition. PLoS ONE 10(10):e0140381
7. Cheplygina V, de Bruijne M, Pluim JP (2019) Not-so-supervised: a survey of semi-supervised, multi-
instance, and transfer learning in medical image analysis. Med Image Anal 54:280–296
8. Chollet F (2017) Xception: deep learning with depthwise separable convolutions. In: Proceedings of the
IEEE conference on computer vision and pattern recognition, pp 1251–1258
9. Deepak S, Ameer PM (2019) Brain tumor classification using deep CNN features via transfer learning.
Computerized Medical Imaging and Graphics 111:103345
10. Deng J, Dong W, Socher R, Li LJ, Li K, Fei-Fei L (2009) Imagenet: a large-scale hierarchical image
database. In: 2009 IEEE conference on computer vision and pattern recognition. IEEE, pp 248–255
11. Doi K (2007) Computer-aided diagnosis in medical imaging: historical review, current status and future
potential. Computerized Medical Imaging and Graphics : the Official Journal of the Computerized Medical
Imaging Society 31(4–5):198–211. https://doi.org/10.1016/j.compmedimag.2007.02.002
12. Dong Y, Jiang Z, Shen H, Pan WD, Williams LA, Reddy VV, Benjamin W, Bryan AW (2017) Evaluations
of deep convolutional neural networks for automatic identification of malaria infected cells. In: 2017 IEEE
EMBS international conference on biomedical & health informatics (BHI). IEEE, pp 101–104
13. Erickson BJ, Korfiatis P, Akkus Z, Kline TL (2017) Machine learning for medical imaging. Radiographics
37(2):505–515
14. Ghassemi N, Shoeibi A, Rouhani M (2020) Deep neural network with generative adversarial networks pre-
training for brain tumor classification based on MR images. Biomed Signal Process Control 57:101678
37566 Multimedia Tools and Applications (2022) 81:37541–37567

15. Glorot X, Bengio Y (2010) Understanding the difficulty of training deep feedforward neural networks. In:
Proceedings of the thirteenth international conference on artificial intelligence and statistics, pp 249–256
16. Gumaei A, Hassan MM, Hassan MR, Alelaiwi A, Fortino G (2019) A hybrid feature extraction method with
regularized extreme learning machine for brain tumor classification. IEEE Access 7:36266–36273
17. Harvard Medical School, http://med.harvard.edu/AANLIB/
18. He K, Zhang X, Ren S, Sun J (2016) Identity mappings in deep residual networks. In: European conference
on computer vision. Springer, Cham, pp 630–645
19. Hochreiter S (1998) The vanishing gradient problem during learning recurrent neural nets and problem
solutions. Int J Uncertain Fuzziness Knowledge-Based Syst 6(02):107–116
20. Howard AG, Zhu M, Chen B, Kalenichenko D, Wang W, Weyand T, Adam H (2017) Mobilenets: Efficient
convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861
21. Huang G, Liu Z, Van Der Maaten L, Weinberger KQ (2017) Densely connected convolutional networks. In:
Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4700–4708
22. Ioffe S, Szegedy C (2015) Batch normalization: Accelerating deep network training by reducing internal
covariate shift. In International conference on machine learning. PMLR 448–456
23. Ismael MR, Abdel-Qader I (2018) Brain tumor classification via statistical features and back-propagation
neural network. In: 2018 IEEE international conference on electro/information technology (EIT). IEEE, pp
0252–0257
24. Kaur T, Gandhi TK (2020) Deep convolutional neural networks with transfer learning for automated brain
image classification. Mach Vis Appl 31(3):1–16
25. Kingma DP, Ba J (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980
26. Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural
networks. Adv Neural Inf Proces Syst 25:1097–1105
27. LeCun Y, Bengio Y, Hinton G (2015) Deep learning. Nature 521(7553):436–444
28. Litjens G, Kooi T, Bejnordi BE, Setio AAA, Ciompi F, Ghafoorian M, van der Laak JAWM, van Ginneken
B, Sánchez CI (2017) A survey on deep learning in medical image analysis. Med Image Anal 42:60–88
29. Mehrotra R, Ansari MA, Agrawal R, Anand RS (2020) A transfer learning approach for AI-based
classification of brain tumors. Mach Learn Appl 2:100003
30. Nair V, Hinton GE (2010) Rectified linear units improve restricted boltzmann machines. In: ICML
31. Nayak DR, Dash R, Majhi B (2016) Brain MR image classification using two-dimensional discrete wavelet
transform and AdaBoost with random forests. Neurocomputing 177:188–197
32. Raghu M, Zhang C, Kleinberg J, Bengio S (2019) Transfusion: understanding transfer learning for medical
imaging. In: Advances in neural information processing systems, pp 3347–3357
33. Ranjan A, Singh VP, Mishra RB, Thakur AK, Singh AK (2021) Sentence polarity detection using stepwise
greedy correlation based feature selection and random forests: an fMRI study. Journal of Neurolinguistics
59:100985
34. Shin HC, Roth HR, Gao M, Lu L, Xu Z, Nogues I, Yao J, Mollura D, Summers RM (2016) Deep
convolutional neural networks for computer-aided detection: CNN architectures, dataset characteristics and
transfer learning. IEEE Trans Med Imaging 35(5):1285–1298
35. Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition.
arXiv preprint arXiv:1409.1556
36. Surawicz TS, McCarthy BJ, Kupelian V, Jukich PJ, Bruner JM, Davis FG (1999) Descriptive epidemiology
of primary brain and CNS tumors: results from the central brain tumor registry of the United States, 1990-
1994. Neuro-oncology 1(1):14–25
37. Swati ZNK, Zhao Q, Kabir M, Ali F, Ali Z, Ahmed S, Lu J (2019) Brain tumor classification for MR
images using transfer learning and fine-tuning. Comput Med Imaging Graph 75:34–46
38. Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A (2015)
Going deeper with convolutions. In: Proceedings of the IEEE conference on computer vision and pattern
recognition, pp 1–9
39. Szegedy C, Ioffe S, Vanhoucke V, Alemi A (2016) Inception-v4, inception-resnet and the impact of residual
connections on learning. arXiv preprint arXiv:1602.07261
40. Szegedy C, Vanhoucke V, Ioffe S, Shlens J, Wojna Z (2016) Rethinking the inception architecture for
computer vision. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp
2818–2826
41. Ting DSW, Cheung CYL, Lim G, Tan GSW, Quang ND, Gan A, Hamzah H, Garcia-Franco R, San Yeo
IY, Lee SY, Wong EYM, Sabanayagam C, Baskaran M, Ibrahim F, Tan NC, Finkelstein EA, Lamoureux
EL, Wong IY, Bressler NM, … Wong TY (2017) Development and validation of a deep learning system for
diabetic retinopathy and related eye diseases using retinal images from multiethnic populations with
diabetes. Jama 318(22):2211–2223
Multimedia Tools and Applications (2022) 81:37541–37567 37567

42. Toğaçar M, Ergen B, Cömert Z (2020) BrainMRNet: brain tumor detection using magnetic resonance
images with a novel convolutional neural network model. Med Hypotheses 134:109531
43. Vasan D, Alazab M, Wassan S, Naeem H, Safaei B, Zheng Q (2020) IMCFN: image-based malware
classification using fine-tuned convolutional neural network architecture. Comput Netw 171:107138
44. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser L, Polosukhin I (2017)
Attention is all you need. In: Advances in neural information processing systems, pp 5998–6008
45. Wernick MN, Yang Y, Brankov JG, Yourganov G, Strother SC (2010) Machine learning in medical
imaging. IEEE Signal Process Mag 27(4):25–38

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps
and institutional affiliations.

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the
author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is
solely governed by the terms of such publishing agreement and applicable law.

You might also like