Download as pdf or txt
Download as pdf or txt
You are on page 1of 43

ECE/CS5582 Computer Vision, 2023 Project: Transfer Learning

Student Name: Chandrasekhar Bhavirisetti Student ID: 16344254

Project: Transfer Learning for Aerial Image Recognition

Data set: the NWPU aerial data set contains 45 categories of 700 images for each category in 256x256 RGB format.
The data set can be downloaded from:

https://umkc.box.com/s/fxvzh5qq2tiob6eklfxfwn89kg3e1io1

Figure 1. Examples of the images and labels

More details of the dataset can be found at:

http://www.escience.cn/people/JunweiHan/NWPU-RESISC45.html

For the Project, the data set should be partitioned into training (500 images), validation (100 images), and test (100
images) for each category.

(1) Fisher Vector aggregation of conv features [40pts], Use pre-trained VGG16 network
1
ECE/CS5582 Computer Vision, 2023 Project: Transfer Learning

Student Name: Chandrasekhar Bhavirisetti Student ID: 16344254

[https://colab.research.google.com/github/bentrevett/pytorch-image-classification/blob/master/4_vgg.ipynb] to
compute pool 5 features, scaling the input from 256x256 to 224x224, this feature will be 512 x (7x7) in dimension.
Compute the PCA and GMM of this 49 dimensional conv feature by randomly sampling 50 images from all 45
classes, for kd=[16, 24] and nc =[64, 128], compute FV aggregations (total 4 sizes), and benchmark the accuracy and
show the 45-class confusion map using leave 1 out(L1O) SVM classifier.

Show your implementation here also:

function [training_pool_features]=getVGG16Pool5Features(im_dir)
....
function [A, gmm]=trainGMM(training_features, kds, ncs)
....

First step is to import all the required libraries like torch, sklearn, copy, random and time.

2
ECE/CS5582 Computer Vision, 2023 Project: Transfer Learning

Student Name: Chandrasekhar Bhavirisetti Student ID: 16344254

Model Definition

The very next step is to retrieve our files; however, when utilising a method that has already been pre-trained, the
appearance of this process is a little bit different. Because of this, we are required to first conduct an analysis of the
structure of the VGG, and then after that, we will demonstrate how to retrieve a file that was created before the VGG
prototype was implemented. This is because this is a consequence of the fact that we are required to first conduct an
analysis of the structure of the VGG. Your viewing pleasure is ensured by the presentation of a summary of the VGG
model in the following diagram. This is the VGG component that unifies all of the various VGG patterns into a single
coherent whole. The 'features' that we will offer as options when building the VGG framework are the only
component that will be entirely determined by the connection that is chosen. We will offer these 'features' as an option
when we build the VGG framework. When we construct the VGG framework, we will make these 'features' available
as a selectable option. The only new feature is a functionality that has been given the name "AdaptiveAvgPool2d,"
and it can be found in this particular section. In addition to the standard implementations of the "AvgPool" and
"MaxPool" layers, PyTorch also provides "adaptive" variants of these two pooling implementations. In dynamic
pooling layers, rather than providing the size of something like the aggregating layer's filtering, we reflect the
intended generated various of the maximum pooling. This is done rather than providing the size of something like
the aggregating layer's filtering. Because of the nature of this particular application, we need the output size to be 7
by 7. If our "features" layer for each configuration is always 7x7, then we know that every VGG configuration finishes
with a 512-filter convolutional layer; consequently, we do not need to modify the "classifier" for each configuration.
If our "features" layer for each configuration is always 7x7, then we know that every VGG configuration finishes
with a 512-filter convolutional layer. If the size of our "features" layer remains constant across all configurations at
7x7, then we know that each and every VGG configuration culminates in a 512-filter convolutional layer. 512
convolutional filters will be present in the final layer of each and every VGG configuration if the size of our "features"
layer is always 7x7. Because of the adaptable layers that we use, our model can be applied to images that range in
size from as small as 32x32 pixels all the way up to an extremely large number of sizes. This is made possible by the
fact that we use layers. VGG models only support images that are a minimum of 32 pixels by 32 pixels in size. The
following equation can be used to determine the size of the filter: (the input size plus the desired size minus one)
divided by the desired size. If we see the symbol "/," it tells us that we need to round the results of our calculations
up to the next highest integer in the sequence. As a consequence of this, in order for us to successfully complete the
task of converting an image that is 32x32 into an image that is 7x7, we will require a filter that is 6x6. After being
applied to the photos, the filter should intersect in some places, producing a situation in which the filter covers some

3
ECE/CS5582 Computer Vision, 2023 Project: Transfer Learning

Student Name: Chandrasekhar Bhavirisetti Student ID: 16344254

pixels more than once. This is the effect that should be achieved. This is the effect that one ought to be able to
accomplish. To begin, we started with an image that was 32 pixels by 32 pixels and divided it into sections that were
the "desired size" and were evenly spaced from one another. Because of this, we were able to pinpoint the appropriate
location for the filters. In order to obtain this information, you will need to use the formula "np.linspace(0, input size
- desired size, desired size)." The fundamental architecture of the VGG is going to be the very first thing that will be
completed once everything else is. After completing this stage, the next thing that needs to be done is selecting the
"features" that will be used for each individual VGG configuration. Lists are frequently used as an element in the
definition of VGG setups. [Case in point:] [Case in point:] Every single entry on the list is composed of a maximum
pooling layer, which can be denoted by the letter "M" or a number depending on the entry in question. On this page,
you will find a rundown of the configuration settings for VGG11, VGG13, VGG16, and VGG19. These
configurations can also be referred to as "A," "B," "D," and "E," respectively, depending on the context in which they
are being discussed. It is now possible to develop a procedure that accepts a configuration list as an input and
generates a 'nn.Sequential' that contains all of the required layers. This was previously not possible. When I first
started, this wasn't an option. The configuration list is worked on in an iterative fashion by the function that is
commonly referred to as "get vgg layers." Following the completion of the list processing, the function then appends
each layer to the variable that is referred to as "layers" prior to delivering the nn.Sequential output. Each of our
maximum pool layers is sampled at the same frequency (2x2) and stride for the purpose of preserving consistency
across all of our maximum pool layers (2). As we have seen, the default configuration for max pooling requires that
the stride be set to the same value as the kernel size. This must be done in order for the configuration to function
properly. One of the prerequisites for the configuration is that this be done. In terms of the convolutional layers, we
always use the same filter size, which is 3 by 3, in addition to the padding that is specified (1). To briefly recap,
padding is achieved by adding black ("padding") pixels that have a value of 0 to all four sides of the image in each
channel before applying the filter. This is done before the padding is applied. Immediately after the completion of
each layer of the convolutional processing, a non-linear operation known as ReLU is carried out. When selecting the
appropriate "in channels" for the subsequent convolutional layer, one of the factors that is taken into consideration is
the number of filters that are currently present in the layer that came before it. This is because the number of filters
that are present in the previous convolutional layer. In addition to this, a procedure known as batch normalisation has
been incorporated into the system (BN). Even though preventing overfitting isn't the primary goal of batch
normalisation, training in small batches can still have a regularisation effect if it's done correctly. This is because
batch normalisation works by dividing the data into many smaller batches. This is due to the fact that batch
normalisation functions by breaking the data up into a large number of more manageable batches. It was stated in the
publication that was responsible for its initial introduction that in order to address the problem of covariate shift, it
ought to be utilised before the activation function. This was stated to be the case because it was the publication that
was responsible for its initial introduction. It was stated that this was the case due to the fact that the publication was
the one responsible for the introduction of it in the first place. As a direct result of this, you are at liberty to employ
it in a manner that is either scant or abundant, depending on your preferences. Let's take a more in-depth look at some
of the "characteristics" that the VGG11 architecture possesses, such as the feature that allows for batch normalisation

4
ECE/CS5582 Computer Vision, 2023 Project: Transfer Learning

Student Name: Chandrasekhar Bhavirisetti Student ID: 16344254

as an example. Batch normalisation was not initially utilised in the VGG; however, this has since changed, and it is
now routinely utilised in VGG models. [Case in point:] [Case in point:] [Here's a good example:] [Here's a good
example:] [Here's a good example:] [This is a prime example] Even though convolutional layers are interpretation
constant and VGG models can handle visuals as small as 32x32, our image needs to be resampled because the pre-
trained "classifier" layer expects particular characteristics to appear in particular locations within the flat 512x7x7
output of the "characteristics" layer that comes after the responsive pooling layer. This is because the pre-trained
"classifier" layer expects particular characteristics to appear in particular locations within the output of the
"characteristics" This is due to the fact that the pre-trained "classifier" layer anticipates the output of the
"characteristics" layer, which is the layer that is placed after the responsive pooling. Using the pre-trained method
results in a reduction in efficiency due to the fact that the "classifier" is provided with characteristics that are
positioned in unexpected locations. Because of this, the "classifier" generates predictions that are incorrect. This is
as a result of the fact that a different collection of input photographs was utilised as opposed to the one that was
utilised for the construction of the model. The reason for this is as a result of the fact that a unique collection of input
photographs was utilized.

5
ECE/CS5582 Computer Vision, 2023 Project: Transfer Learning

Student Name: Chandrasekhar Bhavirisetti Student ID: 16344254

6
ECE/CS5582 Computer Vision, 2023 Project: Transfer Learning

Student Name: Chandrasekhar Bhavirisetti Student ID: 16344254

7
ECE/CS5582 Computer Vision, 2023 Project: Transfer Learning

Student Name: Chandrasekhar Bhavirisetti Student ID: 16344254

Data Preprocessing:

When utilizing pre-trained algorithms, there are a variety of data processing concerns that need to be taken into
consideration. These concerns can range from simple to complex. These worries include the following: All of the
representations of the picture data that have been pre-trained algorithms are mini-batches of three-channel RGB
images of the shape. These images have dimensions of 3 x H x W, and both H and W have a value that is greater than
or equal to 224. Before the photos can be normalized, they need to be positioned somewhere on a scale that goes
from 0 to 1, with a mean value of [0.485, 0.456, 0.406] and a sample variance value of [0.229, 0.224, 0.225]. These
values will be used to determine the standard deviation of the photos. These values are going to be used to calculate
the photos' standard deviation, so keep them handy. As a consequence of this, rather than computing them from the
data, we need to scale our 32x32 images up to 224x224 and then normalize them according to the averages and
standards that have been established. This is in addition to the fact that we need to keep in mind that the data may not
always be accurate. We are going to have to make use of the exact same procedures and standards in order to make
sure that the colours of the pictures that are being given to the model with the pre-trained parameters are the same as
those that were used in order to train the pre-trained model. After that, and only then, will we be able to ensure that
the colors of the pictures that are being shown to the model will be consistent with one another. Think about the fact
that the initial dataset featured a lot of dark green images, and the mean for the green channel was rather low—
perhaps 0.2—whereas the dataset that we're using now contains a lot of light green images, and the mean for the
green channel is roughly 0.8. This is an important distinction to make. This is a distinction that should not be
overlooked. In order to train the pre-trained model, the values of pixels with a dark green color had to be normalized
to zero before they could be used in the training process (subtracting the mean). It is possible that our pre-trained
model will draw the incorrect conclusion that a particular light green image is actually a dark green image if bright
green pixels are normalized to zero as a result of poor use of the means and standard deviations from our dataset.
This has the potential to cause our pre-trained model's predictions to be inaccurate. When the brightness of the green
pixels is normalized to zero, there is a possibility that this error will occur. Our model will not function correctly as
a consequence of this, which will lead to outcomes that are not desirable. When you use the "Resize" transform, we
will take care of resizing the image for you; all you need to do is enter the size that you want the image to be. Because
the images are larger, we can get away with performing a little bit more rotation and cropping within the "Random
Rotation" and "Random Crop" transformations. This is because we have more room to work with. This is due to the
fact that there is more space for us to work with. In addition, the pre-trained mean and standard deviation are both
transformed and then passed through in a manner that is analogous to how our calculated means and transformations
are transformed. We are going to take a few photographs, and in addition to that, we are going to make some
alterations so that the photographs make sense when they are viewed as a whole. We will make certain that our data
is re-normalized in order for it to be viewed in the appropriate colours, and we will get to work on this endeavour as
quickly as is humanly possible.

8
ECE/CS5582 Computer Vision, 2023 Project: Transfer Learning

Student Name: Chandrasekhar Bhavirisetti Student ID: 16344254

9
ECE/CS 5582 Computer Vision, 2022 Project: Transfer Learning

Student Name: Chandrasekhar Bhavirisetti Student ID: 16344254

1
ECE/CS 5582 Computer Vision, 2022 Project: Transfer Learning

Student Name: Chandrasekhar Bhavirisetti Student ID: 16344254

PCA:

1
ECE/CS 5582 Computer Vision, 2022 Project: Transfer Learning

Student Name: Chandrasekhar Bhavirisetti Student ID: 16344254

1
ECE/CS 5582 Computer Vision, 2022 Project: Transfer Learning

Student Name: Chandrasekhar Bhavirisetti Student ID: 16344254

1
ECE/CS 5582 Computer Vision, 2022 Project: Transfer Learning

Student Name: Chandrasekhar Bhavirisetti Student ID: 16344254

Model Development

Classification with SVM:

1
ECE/CS 5582 Computer Vision, 2022 Project: Transfer Learning

Student Name: Chandrasekhar Bhavirisetti Student ID: 16344254

1
ECE/CS 5582 Computer Vision, 2022 Project: Transfer Learning

Student Name: Chandrasekhar Bhavirisetti Student ID: 16344254

Confusion Matrix

(2) Subspace Transfer Learning(60pts), for the FV features in step 1), compute its PCA and plot its eigen values,
choose appropriate PCA low dimension embedding, and apply LDA and LPP learning, and plot 45-class confusion
map with L1O SVM classification:

Fisher Vector:

1
ECE/CS 5582 Computer Vision, 2022 Project: Transfer Learning

Student Name: Chandrasekhar Bhavirisetti Student ID: 16344254

1
ECE/CS 5582 Computer Vision, 2022 Project: Transfer Learning

Student Name: Chandrasekhar Bhavirisetti Student ID: 16344254

1
ECE/CS 5582 Computer Vision, 2022 Project: Transfer Learning

Student Name: Chandrasekhar Bhavirisetti Student ID: 16344254

Standard Deviation Vs Fisher Vector

PCA:

1
ECE/CS 5582 Computer Vision, 2022 Project: Transfer Learning

Student Name: Chandrasekhar Bhavirisetti Student ID: 16344254

2
ECE/CS 5582 Computer Vision, 2022 Project: Transfer Learning

Student Name: Chandrasekhar Bhavirisetti Student ID: 16344254

2
ECE/CS 5582 Computer Vision, 2022 Project: Transfer Learning

Student Name: Chandrasekhar Bhavirisetti Student ID: 16344254

2
ECE/CS 5582 Computer Vision, 2022 Project: Transfer Learning

Student Name: Chandrasekhar Bhavirisetti Student ID: 16344254

PCA Plot

2
ECE/CS 5582 Computer Vision, 2022 Project: Transfer Learning

Student Name: Chandrasekhar Bhavirisetti Student ID: 16344254

Eigen Value Plot

LDP

LPP

2
ECE/CS 5582 Computer Vision, 2022 Project: Transfer Learning

Student Name: Chandrasekhar Bhavirisetti Student ID: 16344254

SVM

2
ECE/CS 5582 Computer Vision, 2022 Project: Transfer Learning

Student Name: Chandrasekhar Bhavirisetti Student ID: 16344254

2
ECE/CS 5582 Computer Vision, 2022 Project: Transfer Learning

Student Name: Chandrasekhar Bhavirisetti Student ID: 16344254

PCA + SVM

LDA + SVM

2
ECE/CS 5582 Computer Vision, 2022 Project: Transfer Learning

Student Name: Chandrasekhar Bhavirisetti Student ID: 16344254

Confusion Matrix

(3) Transfer learning with MLP [50pts]: for all 4 FVs with different kd x nc combination, design a MLP network
(dense linear projection) to aggregate them with softMax loss, and show final accuracy and confusion map. Hint: a
suggested MLP arch:

Creating the MLP architecture

2
ECE/CS 5582 Computer Vision, 2022 Project: Transfer Learning

Student Name: Chandrasekhar Bhavirisetti Student ID: 16344254

The learning rate finder that was contained in one of the earlier notebooks is going to be utilized by our group. When
employing a model that has previously undergone training, the learning rate will frequently be noticeably slowed
down. To begin, the optimizer will be calibrated with an initial learning rate that is going to be a great deal lower than
the one that we intend to use in the later stages of the process. The "device" that will act as a host for our model is
then specified; however, this is contingent upon the presence of a GPU. The "model" and the "criterion" are both
loaded onto our apparatus after the "criterion" has been established (loss function). Following that, we move on to
the next step, which is to define the class that will ultimately be utilized for the learning rate finder. When we have
finished running the learning rate finder, we will plot the loss that was achieved for each batch as well as for each of
the various learning rate values that were tested. This will be done after we have finished running the learning rate
finder. The following thing that needs to be done is to make an optimizer by utilizing discriminative fine-tuning and
making use of the learning rate that we discovered. The foundation on which the discriminative fine-tuning method
is built is the concept that various model layers should each have their own one-of-a-kind learning rate. This is the
idea that drives the discriminative fine-tuning method. According to this theory, the earlier layers of a neural network
are responsible for the acquisition of more general properties, whereas the later layers of a neural network are
responsible for the acquisition of more task-specific characteristics. If this is the case, then we ought to adjust the

2
ECE/CS 5582 Computer Vision, 2022 Project: Transfer Learning

Student Name: Chandrasekhar Bhavirisetti Student ID: 16344254

pre-trained weights of the early layers by an extremely minute amount, if we do so at all, as the generic features
extracted by those layers ought to be suitable for any endeavor we undertake. If this is not the case, then we shouldn't
adjust the weights of the early layers at all. Note that discriminative fine-tuning should only be used when transferring
knowledge from a trained model. This is something to keep in mind. It is essential that you keep this in mind because
it is an important point. It is not always necessary to use it when training a model using weights that were initially
chosen at random, but it is highly recommended to do so. It is possible that we will use PyTorch to configure different
learning rate values for each individual parameter in our model. This would be done in order to improve accuracy.
The optimizer is provided with a list of dictionaries to work with in order for it to be successful in achieving this
goal. A list of the parameters, which are also referred to as "params," as well as any additional arguments that will be
given priority over those that are supplied directly to the optimizer, should be included in every dictionary. We have
divided the parameters into two "groups" here rather than using a different learning rate for each layer. The first
group, "features," includes all of the convolutional layers, and the second group, "classifier," includes all of the linear
layers. According to what is stated in the first dictionary, the "classifier" will make use of the "FOUND LR" that was
given directly to the optimizer, and the "features" will make use of the "FOUND LR / 10." As a direct consequence
of this, the rate of learning that occurs in our convolutional layers is noticeably slower when contrasted with the linear
layers. a value for x which is less than what is seen in the linear layers. After that, we began the process of training
our model. The training process takes a great deal more time than it used to due to the fact that our images have been
scaled up to be much more substantial and that our model contains a great deal more parameters. Transfer learning
enables us to train for a significantly smaller number of epochs than we were previously capable of, but in spite of
this, we are still able to achieve a significantly higher level of accuracy than we were previously capable of. Within
the span of just five iterations of the process, we were able to achieve a validation accuracy of 94%. The results of
our test have a fairly low level of accuracy, coming in at 92%. The analysis of the model will be approached in the
same way that we did earlier. Before moving on to the actual testing, we will first obtain the predictions for each of
the examples that make up the test set. After that, the incorrect forecasts are going to be arranged according to the
level of confidence that our model had, and then we are going to figure out which predictions were incorrect.
Following that, the representations of the model are plotted with the help of principal component analysis and then
t-SNE, in that order. Only the output representations will be plotted in this particular instance; neither the intermediate
nor the output representations will be plotted. This is because even though there are only 10 output dimensions, the
intermediate representations contain more than 25,000 dimensions, which means that a sizeable amount of RAM is
required in order to store them in memory. The reason for this is because the output dimensions are only 10. Even
though there are only 10 dimensions on the output, there are over 25,000 dimensions in the intermediate
representations. The t-SNE representations are then computed after that step has been completed. We are only able
to generate the t-SNE embeddings by using a small sample of the representations because the computation time for
t-SNE is significantly longer than that required for PCA. It is abundantly clear that the trained filters of the model
perform a variety of techniques for edge detection, in addition to blurring and inverting the colors. Even though these
are more difficult to interpret than the pre-trained AlexNet filters that were shown in the notebook that came before
this one, they are more intriguing than the AlexNet filters that were created from scratch. This is because the pre-

3
ECE/CS 5582 Computer Vision, 2022 Project: Transfer Learning

Student Name: Chandrasekhar Bhavirisetti Student ID: 16344254

trained AlexNet filters were demonstrated in the notebook that came before this one. It is interesting to note that
several filters are completely black, which indicates that they all have filter weights that are practically equal to zero.
This fact suggests that these filters all share a common property. Given this information, it seems likely that all of
these filters share a particular quality.

Code and Outputs:

3
ECE/CS 5582 Computer Vision, 2022 Project: Transfer Learning

Student Name: Chandrasekhar Bhavirisetti Student ID: 16344254

3
ECE/CS 5582 Computer Vision, 2022 Project: Transfer Learning

Student Name: Chandrasekhar Bhavirisetti Student ID: 16344254

3
ECE/CS 5582 Computer Vision, 2022 Project: Transfer Learning

Student Name: Chandrasekhar Bhavirisetti Student ID: 16344254

3
ECE/CS 5582 Computer Vision, 2022 Project: Transfer Learning

Student Name: Chandrasekhar Bhavirisetti Student ID: 16344254

3
ECE/CS 5582 Computer Vision, 2022 Project: Transfer Learning

Student Name: Chandrasekhar Bhavirisetti Student ID: 16344254

3
ECE/CS 5582 Computer Vision, 2022 Project: Transfer Learning

Student Name: Chandrasekhar Bhavirisetti Student ID: 16344254

3
ECE/CS 5582 Computer Vision, 2022 Project: Transfer Learning

Student Name: Chandrasekhar Bhavirisetti Student ID: 16344254

3
ECE/CS 5582 Computer Vision, 2022 Project: Transfer Learning

Student Name: Chandrasekhar Bhavirisetti Student ID: 16344254

3
ECE/CS 5582 Computer Vision, 2022 Project: Transfer Learning

Student Name: Chandrasekhar Bhavirisetti Student ID: 16344254

4
ECE/CS 5582 Computer Vision, 2022 Project: Transfer Learning

Student Name: Chandrasekhar Bhavirisetti Student ID: 16344254

4
ECE/CS 5582 Computer Vision, 2022 Project: Transfer Learning

Student Name: Chandrasekhar Bhavirisetti Student ID: 16344254

4
ECE/CS 5582 Computer Vision, 2022 Project: Transfer Learning

Student Name: Chandrasekhar Bhavirisetti Student ID: 16344254

You might also like