Download as pdf or txt
Download as pdf or txt
You are on page 1of 16

Evolutionary Computation-based deep learning approach to

monocular depth estimation

Abstract- Estimation of a high-quality depth map of the surrounding environment using only monocular RGB
images has been an ill-posed problem in the field of Computer Vision. With the advent of deep learning remarkable
results are obtained for monocular depth estimation, however, research is still going on to further reduce the depth
regression error and to maximize the percentage of accurately predicted pixels in the depth map. It is observed that
deep neural networks based depth estimation approaches usually suffer from the problem of generalization, that is
the depth estimation model trained on one kind of depth dataset might not perform well on a different set of captured
RGB images or real-time datasets. In this paper, an attempt is made to deal with the generalization problem of deep
learning models by proposing an adaptive evolutionary computation-based ensemble learning approach for dense
depth predictions. The depth estimation models present in the ensemble are trained on the NYU V2 depth dataset
and these models are based on four different pre-trained benchmark feature extractors: ResNet-50, DenseNet-161,
DenseNet-169, and DenseNet-201. The depth value predicted by each model corresponding to the RGB image is
assigned a best suitable weight in the range [-0.1, 0.1] and these weights are computed using an evolutionary genetic
algorithm. Then these predictions are combined together and an optimized depth map is obtained as the output of the
ensemble. The comparison of the obtained results with the state-of-the-art depth estimation techniques verifies the
effectiveness of the proposed ensemble learning approach for monocular depth estimation.

1. Introduction

Dense depth maps of the surrounding environment play an influential role in many trending applications of
Computer Vision including localization of mobile robots [21,29], visual odometry [18], densification of sparse maps
generated by SLAM algorithms [17,28], autonomous navigation [22], augmented reality [23], virtual reality [24],
etc. The traditional methods generally used for estimating the depth maps of the environment are based on
RGB-Depth cameras, lidar, laser sensors, stereo cameras, and geometrical processing techniques. However, the
sensors used for estimating the depth of the RGB scene are expensive and heavy-weight, and thus cannot be used in
all kinds of commercial applications. In order to overcome these limitations and to make the depth estimation task
more feasible and computationally faster, several deep learning-based approaches have been proposed. Deep
learning-based depth estimation techniques have illustrated a significant improvement in this area. Although
collecting labeled depth datasets during real-world applications is an exhaustive and time-consuming task [19]. The
supervised deep neural networks trained on benchmark depth datasets do not perform well on a completely unseen
real-world dataset, hence an adaptive refinement mechanism is required to achieve better performance in real-time
computer vision applications [20]. In order to solve this generalization problem of deep learning-based depth
estimation models a novel ensemble learning approach is proposed in which the predictions from different depth
estimation models are combined together and each model is assigned a different weight in the range [-0.1,0.1]
according to its feature extraction performance. The optimized weights are computed using an evolutionary
algorithm [25]. The deep learning models used to determine the depth of the scene have encoder-decoder network
architecture. The encoder network is used to extract the dense depth features from the input RGB image and these
encoded features are stacked together and fed to the decoder part of the network for the generation of the depth
image. The pre-trained benchmark deep convolutional neural networks are generally used as an encoder in the depth
estimation network and the decoder network is made up of upsampling blocks and deconvolution layers. In this
paper, the depth estimation models used in the ensemble are based on four different feature-extracting backbone
networks [1,2]: ResNet-50, DenseNet-161, DenseNet-169, and DenseNet-201. One more advantage of using
ensemble models is that by combining different depth estimation models based on different encoder networks one
can leverage the feature extraction characteristics of all the encoder networks and the combined prediction is
regarded as the output of the ensemble model. The feature identification properties of the encoder networks are not
constant and vary from scene to scene. For one RGB image residual networks might perform better and for the other
RGB image a high-quality dense depth map may be obtained using densely connected networks. In ensemble
models, the feature detection characteristics of different deep neural networks are utilized and the generalized
performance of the model is achieved for all types of datasets. The predictions from all four pre-trained depth
estimation models are assigned different weights in the range [-0.1, 0.1] and the optimized value for the set of these
weights is obtained using a Genetic algorithm [26]. The adaptive weight learning approach for depth estimation
models promises generalized performance on real-time and synthetic datasets. For the RGB image sequence, the
adaptive weight learning approach will find out the set of optimized coefficients for predictions from different depth
estimation models in the ensemble for each RGB scene. The model whose depth prediction is close to the ground
truth will be assigned more weight as compared to other models in the ensemble. The scale of the generated depth
image is with respect to the camera which is different from the real-world scale. Therefore, the scale of each
estimated depth map is needed to be converted to the real-world scale by multiplying it by the ratio of the median of
ground truth depth and the RGB image. The weight-learning evolutionary algorithm is executed over a number of
iterations until the change in the value of the objective function reaches the threshold. As illustrated in the paper, the
evaluation results of the proposed and developed ensemble model are comparable to the existing state-of-the-art
depth estimation techniques. To the best of our knowledge, this is the first paper that incorporates the use of
evolutionary computation for learning weights assigned to the depth estimation models present in the ensemble
model developed for the task of depth estimation. The main contributions of the presented research can be
summarized as follows:

● The use of transfer learning in leveraging four different pre-trained benchmark CNNs as backbone
networks of the depth estimation models combined to form an ensemble.
● For the first time, the evolutionary computation based adaptive scheme for determining weights assigned to
different CNN models in the ensemble is introduced in the field of depth estimation for improving the
quality of predicted depth maps and minimizing the depth regression error.
● The demonstrated ensemble learning approach for monocular depth estimation leverages the feature
extraction properties of four benchmark pre-trained encoder networks (ResNet-50, DenseNet-161,
DenseNet-169, and DenseNet-201), thus dealing with the generalization problem of supervised deep
learning models.
The rest of the paper is organized as follows. The existing research in the field of depth estimation is reviewed in
Section 2. In Section 3, the overview of the used evolutionary algorithm is presented. Section 4, illustrates the
proposed methodology for monocular depth estimation. In Section 5, all the experimental details and results are
presented. Finally, Section 6 concludes the paper.

2. Related Work

In this section, the review of the existing works related to monocular depth estimation, 3D scene reconstruction,
sparse map densification, and encoder-decoder network architectures are presented.
Eigen et al., 2015 [4] proposed a multi-level convolutional architecture in which first, the rough and
unsmooth depth predictions are made by one network, and then another network is used to improve those
estimations progressively. In this technique, the image representations are learned directly from raw pixels without
the need for hand-crafted features like over-segmentation, super-pixels, or contours. This deep learning approach of
learning feature representations directly from raw pixels is leveraged in many works including estimation of the
surface normal[5], utilizing conditional stochastic fields for refining accuracy [6-8], or changing the machine
learning problem from regression to classification [27]. Nowadays, the supervised learning techniques that are used
widely for monocular depth estimation are based on the encoder-decoder approach, where the encoder acts as a
dense feature extractor and reduces the resolution of the input image and the decoder is made up of bilinear
upsampling layers to increase the resolution of the encoded image and to generate the depth image with size equal to
that of the ground truth depth image. Following this strategy, Sharma et al., 2017 [12] proposed a deep CNN-based
method for establishing a mapping between RGB image and their corresponding depth map. In this method,
DenseNet-161 [2] is used as an encoder for the extraction of dense features from input monocular RGB image and
then a decoder made up of four deconvolution layers is used to obtain the depth map from encoded features. The
results presented by the authors are better than those obtained using traditional depth estimation methods. In order to
make the depth estimation model more robust and resistant to noise in images, Ma and Karaman(2018) [9] proposed
a ResNet-50-based encoder-decoder network trained with two kinds of images RGB image and a sparse depth image
obtained by using a low-resolution depth camera. Furthermore, this idea can also be used for the densification of
sparse depth maps obtained using vision-based SLAM (Simultaneous Localization and Mapping) algorithms, by
fusing them with the corresponding RGB image of the scene to get a high-resolution depth map with less RMSE
when compared to the depth images obtained with the single RGB image. In the proposed network, the fully
connected layer of ResNet is replaced by up-projection blocks proposed by Laina et al., 2016 [15] for upscaling of
the depth map. On replacing the backbone feature extraction network with DenseNet-169 [2], Alhashim and
Wonka(2019) [10] obtained high-quality depth results which are better than the existing depth estimation techniques.
Recently, Swaraja et al., 2021 [14] leveraged the same transfer learning-based approach for monocular depth
estimation. However, they discussed the results obtained using two different encoder networks for learning feature
representations from a single RGB image. Those two pretrained convolutional neural networks are ResNet-50 [1]
and Google EfficientNet-B0 [13]. RMSE obtained in the case of EfficientNet-B0 is 7% less than the one obtained
with ResNet-50. The decoder used in this depth estimation work is very simple and consists of only a single bilinear
upsampling layer for generating depth maps with a resolution the same as the ground truth image.
Sometimes, due to the unavailability of large amounts of labeled depth datasets, several efforts have been
made in the field of semi-supervised learning, where the network is made to train on some amount of labeled dataset
and the rest of the data is unlabeled or synthetically generated. In many cases of semi-supervised learning-based
depth estimation the labeled depth data is combined with the surface normal as it is one of the feature
representations of the RGB image which is required to predict the depth and contains some amount of information
similar to the extracted depth. The surface normal can be computed from the tangent to the local plane of the 3-D
object in the RGB image and this local tangent plane can be determined from the depth. Similarly, the depth is
dependent on the local tangent plane, computed using the surface normal, hence there is a strong dependency
between the surface normal and depth. Leveraging this inter-dependency between the surface normal and depth,
Huang et al. (2020) [16] proposed and developed a hybrid approach based on transfer learning and surface normal
guidance for high-quality monocular depth estimation. In the proposed scheme, first, the coarse depths are predicted
by encoder-decoder architecture, where the encoder is pretrained Densenet-121 [2] model for dense feature
extraction and the decoder is made up of upsampling blocks for the upscaling of the depth map, then the generated
coarse depth map is fed to another network known as red-green-blue-depth (RGBD) surface normal (RSN) for the
refinement and densification of the initial depth map obtained using transfer learning. The RSN network also has an
encoder-decoder architecture that generates the surface normal maps by combining the initial coarse depth map
produced by the DenseNet-121-based depth estimation network with RGB images. Then finally, the refinement
network leverages the surface normal maps for enhancing the quality of low-quality depth maps produced at the first
stage. From this discussion on existing literature for high-quality monocular depth estimation, it can be seen that the
pretrained residual networks(ResNet) and densely connected networks(DenseNet) are used in most of the works for
dense feature extraction and we observe that highly feasible depth predictions are synthesized by deploying these
pretrained networks as the backbone networks of the depth estimation models. This behavior of the existing deep
learning-based depth estimation models motivates us to present an ensemble learning approach for solving the
ill-posed problem of monocular depth estimation. The idea behind the proposed approach is to leverage the feature
extraction and learning characteristics of the benchmark pretrained deep convolutional neural networks (ResNet-50,
DenseNet-161, DenseNet-169, DenseNet-201) by creating the ensemble of four depth estimation models having
these networks as their encoders respectively and then Genetic algorithm and Particle swarm optimization should be
used for determining the weights that have to be assigned to these models so that minimum regression error is
obtained in the depth predicted by the ensemble.

3. Evolutionary Optimization Algorithm

Evolutionary algorithms are important classes of Evolutionary Computation (EC), generally known as
nature-inspired metaheuristics. The two well-known characteristics of these optimization algorithms are the
Representation of the solution space in terms of population and the iterative method of repeating the mechanism
with a random exploration of solution space until convergence is achieved within the population. The most
important factor to be considered in population-based search optimization algorithms is to maintain a balance
between the algorithm capabilities of exploration and exploitation. The ability of the algorithm to search uniformly
over the set of randomly generated populations in the solution space is known as exploration and the generated
population should maintain diversity, that is it contains all possible sets of values for solutions. This property of
metaheuristics algorithms brings robustness to the learning function, that is if some individuals (particles or genes)
get trapped in the local minima of the non-convex objective function then others might be able to find a solution that
is significantly close to the global minima. Whereas, in the exploitation strategy the entire group of the population is
focused on achieving the promising solutions prominently close to global optima after receiving valuable
information about the solution and search space. This type of informed search leads to the faster convergence of the
algorithm. The more reliable and valid the shared information is, the more quickly the converged population is
obtained. Thus, the metaheuristics-based evolutionary algorithms can successfully adjust and deal with the above
two abilities by following the nature-inspired operations having the potential to maintain equilibrium between
exploration and exploitation. These algorithms begin from the state in which the population of individuals is
stochastically generated and then optimized processing steps are designed for the faster convergence of the
algorithm and obtaining the solution close to the global optima at which the value of an objective function is
minimized with the balance between exploration and exploitation abilities. The widely used evolutionary algorithms
are genetic algorithms and genetic programming.
In this work, a Genetic algorithm is used for assigning optimized weights to different depth estimation
models present in the ensemble. The detailed procedure for the genetic algorithm is discussed in the next section.

3.1 Genetic Algorithm

Genetic algorithm [26] is one of the classical and commonly used evolutionary algorithms. The population
generated in the initial stage is the set of individuals, known as genes. And each gene is a set of chromosomes which
are usually represented by a binary coded string of 1’s and 0’s. Corresponding to each gene or individual the value of
fitness function is computed in which the chromosomes are the variables upon which fitness function depends. In
the parent selection operation, two individuals with minimum fitness values are selected for crossover and mutation.
These two operations are responsible for gene manipulation and performing exploration within the evolution space,
or sometimes outside of the distributed population. The simulation of the working principle of genetic algorithm,
that is evolution for individuals, is formed by the collaborative execution of these two operations. During crossover,
the swapping between the genes up to the crossover point of selected parents takes place. The new individuals
generated contain some partial block of genes of selected parents, hence they inherit a certain degree of parental
character. Next, the mutation is performed to introduce new characteristics and features in the individuals which are
not inherited from selected parents and this can be done by flipping off the random number of bits of the individual
with a small probability. The mutation operation helps the individual to escape from local minima. Survivor
selection, that is replacement of the old generation with the new ones or offspring is performed for the alternation of
generation. The one iteration of the Evolutionary Genetic algorithm is represented in Fig. 1 for the visual
understanding of the evolution methodology used by it.
Fig. 1 Example of one generation cycle of Genetic algorithm.

4. Proposed Methodology

In this section, the proposed novel scheme for monocular depth estimation is presented. First, the network
architecture of the deep learning-based depth estimation model is discussed. Then, a brief overview of different
state-of-the-art CNN models whose pretrained weights have been used in the training of the encoder networks in the
depth estimation models is presented. In the next section, the state-of-the-art approach of applying ensemble
learning to depth estimation problems for obtaining better quality depth estimation results is conferred. Then finally,
a genetic algorithm-based optimization strategy for determining optimized values of weights assigned to four
different depth estimation models present in the ensemble is demonstrated.

4.1 Depth Estimation Network

The CNN architecture discussed in this paper for monocular depth estimation follows an encoding-decoding strategy
in which the encoder is made up of multiple convolutional and pooling layers and acts as a dense feature extractor
and reduces the resolution of the original RGB image, whereas, the decoder is made up of upsampling blocks in
which skip connections are created between upsampling layers and encoded features to generate the high-resolution
depth map as an output of the network. Fig. 2 represents the proposed encoder-decoder architecture. In this work, we
analyze the performance of four different pretrained deep CNN models which have been used for the extraction and
stacking of depth attributes of the RGB image in the depth estimation network. And, these benchmark networks are
ResNet-50, DenseNet-161, DenseNet-169, and DenseNet-201. Following, some characteristics of these backbone
networks used in the encoding of feature maps of the monocular RGB images are discussed.
Fig. 2 Encoder-Decoder-based network architecture for monocular depth estimation.

4.1.1 ResNet
Residual Network was originally proposed by He et al., 2015 [1]. The objective of the implementation of the
ResNet is to train deep neural networks efficiently. Generally, On increasing the number of layers in deep
convolutional neural networks, the performance of the model gets saturated and can also degrade slowly after a point
due to the problem of vanishing gradient. The residual network has solved this problem by introducing the skip
connections as an alternate bypass for the gradient between the first and last layer of the residual block. Skip
connections add the output of the previous convolution layer to the output of the stacked layers in the residual block.
In this analysis, for the encoding of depth feature representations in the ensemble model among the four different
depth estimation models, ResNet-50 is used, that is a network with 50 residual blocks and each residual block is
made up of 3 convolution layers.

4.1.2 DenseNet
DenseNet stands for densely connected convolutional neural network, proposed by Huang et al., 2016 [2] to deal
with the vanishing gradient issue efficiently. In DenseNet, for each layer, the inputs from all preceding layers are
added to it and similarly, it forwards its own output or feature maps to the subsequent layers in the network. In this
mechanism, every layer of the network is obtaining collective knowledge from all its preceding layers. The idea
makes the network thinner and more compact with fewer channels as compared to the traditional convolutional
neural networks used for image classification problems. Depending upon the number of layers in the network there
are several versions of DenseNet, such as DenseNet-121, DenseNet-161, DenseNet-169, and DenseNet-201. In this
paper, we present the results obtained with DenseNet-161, DenseNet-169, and DenseNet-201.

4.2 Ensemble learning for monocular depth estimation

Ensembling is the mechanism of combining multiple learning techniques to get the advantage of their collective
performance, that is, to combine the predictions from existing trained models in some way to obtain better results
than the individual models. In this work, we create the ensemble model using four depth estimation models based on
different feature encoders (ResNet-50, DenseNet-161, DenseNet-169, DenseNet-201). The predictions from
different models present in the ensemble are assigned different weights as computed by a metaheuristic-based
Genetic algorithm. The depth estimated by the ensemble model is calculated as the sum of the predictions of
individual depth estimation models present in the ensemble where each prediction value is multiplied by the
corresponding optimal weight as learned by the optimization algorithm. The evaluation results presented in the next
section conferred that the genetic algorithm performs better than the individual depth estimation models. Equations
(1) and (2) represent the final estimated output of the ensemble and the objective function that needs to be
minimized by the optimization algorithms respectively. Fig. 3 represents the variation of the mentioned fitness
function over 100 iterations for the Genetic algorithm. The optimal weights assigned to each depth estimation model
in the ensemble are computed separately for each image in the depth dataset for achieving the best depth prediction
with minimum regression error corresponding to the particular RGB image. Fig. 4 represents the design framework
of the proposed ensemble learning-based depth estimation approach.
In equations (1) and (2),
Ensemblepred is the depth estimated by the ensemble model;
λ𝑖 is the weight assigned to ith model of the ensemble;
𝑝𝑟𝑒𝑑𝑖 is the prediction of the ith model of the ensemble;
target is the ground truth depth value;
f is the objective function which has to be minimized by the genetic algorithm for the computation of λ𝑖𝑠.
4
𝐸𝑛𝑠𝑒𝑚𝑏𝑙𝑒𝑝𝑟𝑒𝑑 = ∑ (λ𝑖 * 𝑝𝑟𝑒𝑑𝑖) (1)
𝑖=1
4
𝑓(λ1, λ2, λ3, λ4) = ∑ (λ𝑖 * 𝑝𝑟𝑒𝑑𝑖) - target (2)
𝑖=1

Fig. 3 Optimization of the fitness function by Genetic algorithm over 100 iterations.

The use of metaheuristic computation in optimizing deep learning models is an emerging field of research. In this
work, the evolutionary genetic algorithm is leveraged for the attainment of better depth predictions from monocular
RGB images. The predictions from four different benchmark depth estimation models are combined together and
each model is allotted a different weightage in the ensemble prediction according to its feature extraction
characteristics. In the following subsection, the procedure for the genetic algorithm is presented in tabular form.
Fig. 4 Design Framework of the proposed ensemble learning bases monocular depth estimation method.

4.2.1 Genetic Algorithm


Algorithm 1: Learning weights for four different depth estimation models present in the ensemble using
Genetic algorithm

Step 1: [Initialization]
A population of 100 chromosomes (solutions) is randomly generated, where each chromosome consists of 4 genes
and the value of each gene lies in the interval [-0.1, 0.1].
Step 2: [Parents selection]
The value of the objective function (represented in equation (2)) is calculated for each chromosome and then two
parents with minimum fitness value are selected.
Step 3: [Crossover]
The genes between the selected parents are exchanged up to a crossover point which is a randomly generated
integer between 1 and the length of the chromosome, that is the crossover point lies in the range [1,4].
Step 4: [Mutation]
The bits of selected parents (chromosomes) are flipped with a small probability, ‘mut’ which is computed as
𝑚𝑢𝑡 = 1/(𝑛𝑏𝑖𝑡𝑠 * 𝑙) (3)
where nbits are the number of bits used to represent each chromosome,
l is the length of bounds set for the values assigned to genes.
In this implementation-
nbits = 16,
bounds = [[-0.1, 0.1], [-0.1, 0.1], [-0.1, 0.1], [-0.1, 0.1]],
l = 4,
Therefore, after substituting these values in equation (3), the value of the mutation probability, mut comes out to
be 0.015625.
Step 5: [Termination]
On the successful completion of 100 iterations, the algorithm terminates and returns an optimized set of weights
used by the ensemble model in depth computation, otherwise go to step 2 of the algorithm.

5. Experiments and Results

In this section, we first discuss the details of the dataset and the loss function used for training the monocular depth
estimation networks. Next, we present a brief description of the network training parameters and the evaluation
metrics to validate our experiments, and then finally we compare the performance of the proposed ensemble model
based on a bio-inspired weight learning technique (Genetic algorithm) with the existing state-of-the-art depth
estimation methods.

5.1 Dataset used

NYU V2 depth[3] dataset consists of color (RGB) and ground truth images of 464 different indoor scenes captured
with an RGB-D camera Microsoft Kinect at a resolution of 640x480. From these 464 scenes in our work, we use
249 for training the model and 215 for validation. This is the official division of the dataset. The training dataset
consists of approximately 48,000 synchronized depth-RGB image pairs evenly sampled from the raw video
sequence of the official training dataset. For evaluating the performance of the model, the small labeled dataset of
654 images is used. The resolution of the depth maps generated by the depth estimating encoder-decoder network is
304x228. The RGB images used for training are fed to the model at their original resolution, that is, 640x480,
whereas the ground truth depth images are downsampled and center-cropped to the resolution of 304x228.

5.2 Training loss

The depth estimation networks are trained with Mean Absolute Error (MAE) loss function. The MAE loss function
is generally used for optimization in supervised regression problems. MAE loss is responsible for optimizing the
simple norm of the Euclidean distance between the estimated and ground-truth value: (ŷ − 𝑦) = || ŷ − 𝑦||,
where ŷ represents the estimated depth and y represents ground truth depth. For our problem, MAE performs very
well and the depth maps generated by the models trained with this loss function have sharp boundaries rather than a
smooth low quality visual representation of edges. The advantage of using the MAE loss function is that it doesn’t
penalize heavily for outliers present in the training dataset, as all of the errors (differences between predicted and
ground-truth values) are weighted according to one linear scale.

5.3 Implementation details

The presented work is implemented using PyTorch [11] and trained on the benchmark NYU V2 depth dataset [3]
using a GeForce RTX 2070 GPU with 16 GB memory. The encoders used in the depth estimation models are
pretrained, as they are the benchmark deep convolutional neural networks developed for the task of image
classification. However, random weights are assigned to the decoder. During the training of the network, the ADAM
optimizer is used with a learning rate of 0.001, and the values used for the learning parameters β1 and β2 are 0.9 and
0.999 respectively. The small batch size of 8 is used and the model is trained for 20 epochs, where for each epoch
the network is trained on an entire dataset of about 48,000 images and images fed to the model in small batches of 8.
5.4 Evaluation metrics

The performance of the proposed depth estimation model is evaluated using RMSE (Root-Mean-Squared-Error),
AbsREL (Absolute Relative Error), and Inlier metrics (δi). These evaluation measures are discussed in detail in the
following subsections and their values are given by equations (4), (5), and (6). In equations (4), (5), and (6), yi and ŷi
represent the ground truth and predicted depths respectively.

5.4.1 RMSE
The root mean squared error is calculated as the square root of the mean of the square of the difference between the
predicted value and ground truth value.
2
𝑟𝑚𝑠𝑒 = √ ( 𝑚𝑒𝑎𝑛( ( ŷ𝑖 − 𝑦𝑖) ) (4)

5.4.2 AbsREL
The absolute relative error is the mean of the absolute value of the relative difference between the predicted and
ground truth value of depth.
𝑟𝑒𝑙 = 𝑚𝑒𝑎𝑛 (( ŷ𝑖 − 𝑦𝑖 ) / 𝑦𝑖) (5)

5.4.3 Inlier metrics (δi)


The value of this function determines the percentage of the pixels in the predicted depth image where the error
relative to the ground truth depth image is within a certain threshold value. The higher the δi better the depth
prediction.
ŷ𝑖 𝑦𝑖 𝑖
𝑐𝑎𝑟𝑑({ŷ𝑖 : 𝑚𝑎𝑥 { 𝑦𝑖
, ŷ𝑖
} < 1.25
δ𝑖 = 𝑐𝑎𝑟𝑑 ({𝑦𝑖})
(6)

In the equations (6) card is the function for computing cardinality of the set.

5.5 Performance Comparison

In this section we present the evaluation results of the monocular depth estimation models with encoder-decoder
architecture obtained on the NYU V2 test dataset. In Table 1 the test results obtained with individual depth
estimation models based on four different encoder networks (ResNet-50, DenseNet-161, DenseNet-169,
DenseNet-201) are enumerated. Further, these four depth estimation models are selected for forming an ensemble
model. In an ensemble model, different weights are assigned to these depth estimation models and the optimal
values for these weights are obtained using an evolutionary Genetic algorithm. Further, the evaluation results
obtained with the ensemble model are compared with the existing state-of-the-art depth estimation methods in Table
2 and we conclude that our ensemble learning-based depth estimation approach performs better than all the
mentioned monocular depth estimation methods. For the proposed model, all the error metrics are calculated with
the scaled predicted depths, that is, first the estimated depth is multiplied by the ratio of the medians of ground truth
depth and predicted depth, then the obtained scaled depth is used for computing evaluation metrics. Scaling is
required, as the values of the depth pixels present in predicted depth maps are relative to the camera. For a visual
representation of the results, Fig. 5 shows a gallery of depth images estimated by the models based on four different
pretrained backbone networks. And, the depth maps generated by the ensemble of these four depth estimation
models are presented in Fig. 6. From this representation, it can be deduced that the proposed ensemble learning
approach has produced high-quality depth maps with sharp depth edges which are even slightly better than the
ground truth depth images with relatively less percentage of blurriness.
Table 1: Comparison of the evaluation results obtained with individual depth estimation models based on pretrained
encoder networks.
Encoder RMSE AbsREL δ1<1.25 δ2<1.252 δ3<1.253
Networks

ResNet-50 0.481 0.124 0.847 0.965 0.990

DenseNet-161 0.452 0.116 0.863 0.970 0.993

DenseNet-169 0.464 0.119 0.858 0.969 0.992

DenseNet-201 0.462 0.119 0.857 0.969 0.993

Ensemble 0.422 0.116 0.869 0.974 0.994

Table 2: Comparison of the existing state-of-the-art monocular depth estimation networks with the proposed
ensemble models.
Methods RMSE AbsREL δ1<1.25 δ2<1.252 δ3<1.253

Eigen et al., 0.641 0.158 0.769 0.950 0.988


2015 [4]

Laina et al., 0.573 0.127 0.811 0.953 0.988


2016 [15]

Sharma et al., 0.549 0.153 0.799 0.950 0.985


2017 [12]

Karaman et al., 0.514 0.143 0.810 0.959 0.989


2018 [9]

Alhashim et al., 0.465 0.123 0.846 0.974 0.994


2019 [10]

Huang et al., 0.459 0.122 0.859 0.972 0.993


2020 [16]

Swaraja et al., 0.638 0.168 - - -


2021
[14](ResNet)

Swaraja et al., 0.625 0.156 - - -


2021[14](
EfficientNet-B0
)

Ensemble 0.422 0.116 0.869 0.974 0.994

According to the results presented in Table 2, the proposed ensemble model performs best on all the assessment
metrics. Hence, we can say that for the discussed ensemble learning-based approach for the monocular depth
estimation problem, the genetic algorithm-based weight learning method provides a good approximation of the
objective function.

Fig. 5 From top to bottom: RGB images (first row); ground truth depth images (second row); depth images
estimated by ResNet-50 (third row); depth images estimated by DenseNet-161 (fourth row); depth images estimated
by DenseNet-169 (fifth row); depth images estimated by DenseNet-201 (sixth row).
Fig. 6 From top to bottom: RGB images (first row); ground truth depth images (second row); depth images
estimated by Ensemble (third row)

In Fig. 5, the visual representation of depth maps implies that the most effective results are obtained with
DenseNet-161. The depth images obtained with the depth estimation model that has used DenseNet-161 as its dense
feature extractor represent the objects present in the surrounding 3-D environment with more sharp boundaries and
fewer artifacts as compared to the models with other pretrained deep convolutional neural networks. Next, Fig. 6
represents the gallery of images including RGB images, ground truth depth images, and the depth maps obtained
using the ensemble of four depth estimation models trained with four different encoders (ResNet-50, DenseNet-161,
DenseNet-169, and DenseNet-201) and assigned different weightage to each of these four models and the assigned
weights are obtained using Genetic algorithm. The depth images generated by the proposed ensemble model are
highly similar to the ground truth images and even some depth scenes obtained with this ensemble are better than the
ground truth with fewer voids, less blurriness, and more sharpness. Furthermore, the depth images obtained with the
ensemble are better than all the depth maps(in Fig. 8) generated by the individual models in terms of the clarity and
similarity index with the ground truth depth images which are captured using an RGB-D camera. The depth
estimates obtained with the proposed state-of-the-art approach leverage the ensemble of different depth estimating
encoder-decoder networks by learning different optimized weights for these models using evolutionary computation
based optimization for monocular depth estimation is reliable, therefore, the proposed depth estimation models can
be integrated with any real world computer vision application such as autonomous driving, UAV navigation,
augmented reality, etc. Also, the proposed models are relatively simple and less complex than the existing
state-of-the-art, hence can be used onboard for calculating the depth of the 3-D scene in vision-based applications.
Generally, the existing lidar and laser sensors-based depth estimation approach increase the weight and cost of the
application, the proposed method overcomes these limitations by replacing the costly sensors with the simple deep
learning model that can estimate the depth within the time limit set by the application and with minimum regression
error, thereby reducing the risk of collision in the task of autonomous navigation. However, the application is
required to be equipped with a good quality monocular camera so that it can capture high-definition RGB scenes
with minimum noise and without being affected by the illumination effects present in the surroundings.

6. Conclusion

In this paper, the ensemble learning-based approach for monocular depth estimation is presented. In the
demonstrated work, first, the four supervised depth estimation models based on an encoder-decoder architecture
with four different pretrained encoder networks- ResNet-50, DenseNet-161, Densenet-169, and DenseNet-201 are
trained and evaluated on NYU V2 depth dataset. Then, with these pretrained depth estimation models, an ensemble
model is developed, and different weights are assigned to the depth predictions from the four individual models.
These optimal weights are computed using the evolutionary Genetic algorithm. The evaluation results obtained with
the ensemble are compared with the benchmark monocular depth estimation methods and we find out that the
proposed approach performs better than the existing depth estimation techniques. This behavior of the ensemble
model illustrates that the solution provided by the genetic algorithm for minimizing objective function (represented
by equation (2)) is closer to the global optima and the weights computed by it are optimized and highly precise. In
future plans, to improve the quality of depth predictions, we will add other pretrained CNN models such as
EfficientNet, AlexNet, VGG 16, and a layer of transformers in the ensemble model. Further, In order to prove the
efficacy of the proposed scheme on other benchmark datasets, we will explore the depth and odometry datasets like
KITTI, Make3D, TUM-RGBD, etc for training and evaluating the proposed supervised depth estimation models.

Declaration of Interests
The authors declare that they have no known competing financial interests or personal relationships that could have
appeared to influence the work reported in this paper.

References

1. K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In: Proceedings of the
IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
2. G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger. Densely connected convolutional networks.
In: CVPR, 2017, vol. 1, p. 3.
3. N. Silberman, D. Hoiem, P. Kohli, R. Fergus. Indoor Segmentation and Support Inference from RGBD
Images. In: European Conference on Computer Vision (ECCV), 2012, pp 746-760.
4. D. Eigen and R. Fergus. Predicting depth, surface normals, and semantic labels with a common multi-scale
convolutional architecture. In: Proceedings of the IEEE International Conference on Computer Vision,
2015, pp. 2650–2658.
5. X. Wang, D. Fouhey, and A. Gupta. Designing deep networks for surface normal estimation. In:
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp 539–547.
6. B. Li, C. Shen, Y. Dai, A. Van Den Hengel, and M. He. Depth and surface normal estimation from
monocular images using regression on deep features and hierarchical crfs. In Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition, 2015, pp. 1119–1127.
7. P. Knöbelreiter, C. Reinbacher, A. Shekhovtsov, and T. Pock. End-to-end training of hybrid cnn-crf models
for stereo. In: Conference on Computer Vision and Pattern Recognition,(CVPR), 2017.
8. A. G. Schwing and R. Urtasun. Fully connected deep structured networks. arXiv preprint
arXiv:1503.02351, 2015.
9. F. Ma and S. Karaman. Sparse-to-dense: Depth prediction from sparse depth samples and a single image.
In: 2018 IEEE International Conference on Robotics and Automation (ICRA), 2018.
10. I. Alhashim, P. Wonka. High-Quality Monocular Depth Estimation via Transfer Learning. In: Computer
Vision and Pattern Recognition (CVPR), 2018.
11. R. Collobert, K. Kavukcuoglu, C. Farabet. Torch7: A Matlab-like Environment for Machine Learning. In:
BigLearn NIPS Workshop, 2011.
12. S. Sharma, R.P. Padhy, S.K. Choudhary, N. Goswami, P. Kumar. DenseNet with pre-activated
deconvolution for estimating depth map from a single image. In: Conference on Activity Monitoring by
Multiple Distributed Sensing (AMMDS) under BMVC, 2017.
13. Z. Yin and J. Shi. Geonet: Unsupervised learning of dense depth, optical flow, and camera pose. In:
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 1983–1992.
14. K. Swaraja, K. Naga Siva Pavan, S. Suryakanth Reddy, K.Ajay, P. Uday Kiran Reddy, P. Kora, K.
Meenakshi, D. Chaitanya, H.Valiveti. CNN-Based Monocular Depth Estimation. In: International
Conference on Sustainable Energy and Design Aspects in Manufacturing Technology (ICMED), 2021.
15. I. Laina, C. Rupprecht, V. Belagiannis, F. Tombari, N. Navab. Deeper depth prediction with fully
convolutional residual networks. In: 2016 Fourth International Conference on 3D Vision (3DV), IEEE,
2016, pp. 239–248.
16. K. Huang, X. Qu, S. Chen, Z. Chen, W. Zhang, H. Qi, and F. Zhao. Superb monocular depth estimation
based on transfer learning and surface normal guidance. Sensors, 2020, vol. 20, pp. 4856.
17. K. Yousif, Y. Taguchi, S. Ramalingam. MonoRGBD-SLAM: Simultaneous localization and mapping using
both monocular and RGBD cameras. In: Proceedings of the 2017 IEEE International Conference on
Robotics and Automation (ICRA), Singapore, 29 May–3 June 2017, pp. 4495–4502.
18. R. Li, S. Wang, Z. Long, D. Gu. Undeepvo: Monocular visual odometry through unsupervised deep
learning. In: Proceedings of the 2018 IEEE international conference on robotics and automation (ICRA),
Brisbane, Australia, 21–25 May 2018, pp. 7286–7291.
19. Y. Kuznietsov, J. Stuckler, B. Leibe. Semi-supervised deep learning for monocular depth map prediction.
In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA,
21–26 July 2017, pp. 6647–6655.
20. W.D. Jang, C.S. Kim. Interactive Image Segmentation via Backpropagating Refinement Scheme. In
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long
Beach, CA, USA, 15–20 June 2019.
21. A. S. Silva Oliveira, M. C. Reis, F. A. Mota, M. E. Martinez, and A. R. Alexandria. New Trends on
Computer Vision applied to mobile robot localization. Internet of Things and Cyber-Physical Systems,
Elsevier, 2022, vol. 2, pp. 63–69.
22. J. Biswas and M. Veloso. Depth camera-based indoor mobile robot localization and navigation. In: 2012
IEEE International Conference on Robotics and Automation, 2012.
23. R. Du, E. Turner, M. Dzitsiuk, L. Prasso, I. Duarte, J. Dourgarian, J. Afonso, J. Pascoal, J. Gladstone, N.
Cruces, S. Izadi, A. Kowdle, K. Tsotsos, and D. Kim. DepthLab: Real-time 3D interaction with depth maps
for mobile augmented reality. In: Proceedings of the 33rd Annual ACM Symposium on User Interface
Software and Technology, 2020.
24. Z. Li, Y. Cui, T. Zhou, Y. Jiang, Y. Wang, Y. Yan, M. Nebeling, and Y. Shi. Color-to-depth mappings as
depth cues in virtual reality. The 35th Annual ACM Symposium on User Interface Software and
Technology, 2022.
25. K.-C. Wong. Evolutionary algorithms. Advances in Computational Intelligence and Robotics, 2016, pp.
190–215.
26. Holland J. Adaptation in natural and artificial systems. Ann Arbor, MI, USA: University of Michigan Press;
1975.
27. Y. Cao, Z. Wu, and C. Shen. Estimating depth from monocular images as classification using deep fully
convolutional residual networks. IEEE Transactions on Circuits and Systems for Video Technology, 2017.
28. R. Mur-Artal, J. M. Montiel, and J. D. Tardos. ORB-SLAM: A versatile and accurate monocular slam
system. IEEE Transactions on Robotics, 2015, vol. 31, no. 5, pp. 1147–1163.
29. L. Yao and F. Li. Mobile robot localization based on vision and Multisensor. Journal of Robotics, 2020, vol.
2020, pp. 1–11.

You might also like