Auto CNN

1
Autonomous CNN (AutoCNN):

A Data-Driven Architecture Learning Approach
Abhay M S Aradhya∗ , Andri Ashfahani∗ , Fienny Angelina, Mahardhika Pratama, Senior Member, IEEE, RF De
Mello and Suresh Sundaram, Senior Member, IEEE,
Abstract—Designing an optimum Convolutional Neural Net- in nature. Another molecular biology and genetics inspired ap-
works (CNN) is a complex task due to a large array of possible proach in [6] explored random sampling based approach with
architectures and requires experience and in-depth knowledge high-throughput. The biological inspired approaches proved
of deep learning. This paper proposes Autonomous Convo-
lutional Neural Networks (AutoCNN1 ), a novel non-heuristic that automated determination of network architecture is fea-
data-driven method to determine the CNN architecture for sible. Bayesian optimization strategies have also been used
various classification problems. Novel convolution stage growing for automatic determination of network architecture and their
and filter pruning strategies are proposed in this paper to hyper-parameters [7], [8], however, their performance were
sequentially evolve and to optimize the network architecture unable to match that of the hand-crafted networks.
based on the input data distribution. Further, an early stopping
criteria is introduced in AutoCNN to prevent over-training and Recently, deep network architecture determination studies
to minimize performance loss. The AutoCNN was evaluated have primarily focused on meta-heuristic based approaches as
using the MNIST, MNIST-rot-back-image, Fashion MNIST and they closely mimic the popular stochastic selection criteria.
the ADHD200 datasets. The results indicate that AutoCNN not The meta-heuristic architecture determination methodologies
only achieves an improved classification performance over the have outperformed the hand-designed architectures on multiple
existing evolutionary learning-based architecture determination
methodologies, but also provides a state-of-the-art classification classification tasks. Genetic algorithm-based meta-heuristic
accuracy on the MNIST-rot-back-image (2.07% improvement) approaches to identify CNN architectures have achieved im-
and on the ADHD200 dataset (4.36% improvement). The ablation provement in automatic determination of network architecture
studies prove that AutoCNN performance is reliable, highly [9], [10], [11], but have failed to beat the hand-crafted network
robust to noise, the results are highly generalizable across various based classifiers.
datasets and the proposed data-driven strategies can also be used
to improve existing CNN architectures. With the advances in computational infrastructure and size
of the databases, reinforcement learning-based approaches
Index Terms—Evolving Intelligent Systems, Convolution Neu-
have gained popularity for their ability to utilize the action-
ral Networks, deep learning, evolutionary computing
reward mechanism to better explore the search space. Neural
Architecture Search (NAS) [12] is one of the popular rein-
I. I NTRODUCTION forcement learning methods that proposed a common learning
I N recent years, significant improvement has been achieved

in the field of hand-writing recognition, computer vision,
image restoration and medical image analysis using Convo-
strategy for determining both the network architecture and its
parameters. The NAS utilizes the evaluation performance as
the reward function to optimize the network architecture, while
lutional Neural Network (CNN) based supervised learning the classification error is backpropagated to update the network
algorithms. The number of layers (depth) and the organization parameters. Although the performance of the NAS improves
of the architecture play a vital part in the network perfor- the classification performance over the previous meta-heuristic
mance [1], [2]. Although handcrafted CNN architecture based methods in literature, the reinforcement learning based NAS
classifiers have been successfully in accurate classification, methodology is computationally expensive and requires to
determination of a suitable CNN architecture is challenging train and evaluate 12, 000 network architectures to determine
and requires an expert knowledge. Errors in determination an optimal solution [12]. Motivated by the results of the NAS,
of CNN architecture leads to loss of performance; a shallow researchers have looked to explore automatic architecture
network is prone to underfitting, while networks with excessive determination algorithms. A graph based search approach
number of layers lead to overfitting and suffer from vanishing was proposed in Efficient Neural Architecture Search (ENAS)
gradient problem [3]. These issues often result in a stochastic [13] to reduce the computation cost of the NAS architecture.
approach to design a multi-layer CNN architecture. The reduced search space using sub-graph based architecture
Automated determination of network architecture has been exploration provided smaller networks (decreased number of
explored widely in literature. Earlier genetic algorithms were parameters) with higher classification performance. Further,
used to determine the network architecture and weights simul- Progressive Neural Architecture Search (PNAS) strategies has
taneously [4], [5]. The preliminary studies on automated ar- been presented in [14] which progressively explore network
chitecture determination were inspired by biological processes architectures with increasing degree of complexity. Similar
∗ progressive exploration strategies have been proposed using
A M S Aradhya and A Ashfahani share equal contributions.
1 The executable code and original numerical results can be downloaded tree structure based meta-controllers [15] and deep learning
from https://tinyurl.com/AutoCNN. based Differentiable Architecture Search (DARTS) approach
2
[16]. Although such search based reinforcement learning algo- back-image and ADHD200 datasets, demonstrating the
rithms have been relatively successful in classification prob- reliability and scalability of AutoCNN architectures.
lems with large dataset, automatic architecture determination • Compared to existing methods in literature the classifica-
has been a challenging task and needs to be validated over a tion performance of AutoCNN is least affected by noise
wide range of datasets. Specifically for classification problems in the input data.
with small datasets, the search based architecture learning • AutoCNN adopted as an optimization tool for existing
strategies are prone to overfitting due to large search space and CNN architectures provides a 10.96% accuracy gain over
limited sampling points. Therefore, there is a need to develop the the baseline model.
automatic architecture determination methodologies that are The rest of this paper is structured as follows: Section
reliable and scalable across various classification problems II outlines the problem formulation; Section III discusses
with different feature distributions, number of classes and the learning policy of AutoCNN; Section IV elaborates our
small sample sizes. experiments; concluding remarks are drawn in the last section.
Hence, data-driven approaches have gained prominence for
the determination of optimal network architecture. They have
been widely used to optimize the Multi-Layer Perceptron II. P ROBLEM F ORMULATION
(MLP) [17], [18] and for Recurrent Neural Networks (RNN) CNNs have shown remarkable dominance in machine learn-
[19] architectures for data stream problems. They propose ing tasks, such as image classification and handwriting recog-
a demand based network evolution strategies to iteratively nition. However, the design of CNN architectures for particular
evolve deep neural networks. Unlike the search based method- tasks are extremely complex, as evident from the existing
ologies, the data-driven approaches are scalable and can efforts made by researchers such as AlexNet [20], ResNet [21],
be adopted to address problems with both small and large LeNet [22], VGGNet [23] and GoogLeNet [24]. Furthermore,
datasets. Also, as the network architecture evolution is need as the architecture varies greatly dependent on the problem,
based in data-driven approaches, they are computationally the CNN architecture needs to be fine-tuned for each problem.
inexpensive when compared to the NAS based reinforcement Redesign of the CNN architecture is a time intensive task and
learning methodologies. Motivated by the findings of the above requires expert knowledge. Automatic architecture determina-
studies, we propose an Autonomous Convolutional Neural tion reduces the design time and proposes high-performance
Network (AutoCNN) to generate high-performance architec- CNN architectures [12]. However, the existing autonomous
tures using a data-driven non-heuristic approach. architecture determination methodologies are highly reliant on
Direct application of data-driven architecture learning ap- search based strategies and are prone to overfitting due to large
proaches on CNN is challenging as - CNN is composed of a search space and limited sampling points in datasets with large
complex combination of different layers with varied function- feature space and small number of samples. To address these
ality. Wherein sequential order of arrangement of the layers problems, in this paper, we propose a data-driven learning
are critical and necessary for network performance. Therefore, strategy called Autonomous Convolution Neural Network (Au-
in this paper, a novel data-driven architecture learning method toCNN) to determine the optimal CNN architecture based on
hereby called the Autonomous CNN (AutoCNN) is proposed the distribution and complexity of the data. Mathematically the
to obtain high-performance CNN architectures. AutoCNN is architecture determination problem can be defined as: given a
an end-to-end image classification approach whose network batch of image data B, the AutoCNN learning policy should
architecture has been determined using a data-driven evolving be able to automatically determine a CNN model F (.) capable
strategy. The evolving strategy of the AutoCNN is governed of associating the input sample X to its corresponding class
by a novel feature separability score, which is reliant on the label Y .
distribution and complexity of the training data. Further, a filter
pruning strategy has been introduced in AutoCNN to eliminate
the redundant information and to reduce the computational III. AUTO CNN: AUTONOMOUS C ONVOLUTION N EURAL
complexity of the CNN architecture. Finally, an early stopping N ETWORK
strategy is proposed in AutoCNN to prevent overtraining and Autonomous Convolutional Neural Networks (AutoCNN)
achieve the highest classification performance. is a deep convolutional neural network, with a data-driven
The significant contributions of this paper are: architecture evolving strategy. The structural evolution of the
• AutoCNN is a novel data-driven architecture genera- AutoCNN is illustrated in Figure 1, whereas the learning
tion methodology. The convolution layer growing, filter policy is presented in Algorithm 1. From the algorithm, it
pruning and early stopping strategies of the AutoCNN can be observed that the learning process consists of three
are developed to automatically evolve the network and stages. Training stage I, a spatial feature separability opti-
determine the best CNN architecture. mization stage, consists of the structural evolution strategies
• A novel Feature Separability Score (F SS) is proposed to (convolution layer growing and filter pruning methodologies).
measure the intra-network feature separability. Training stage II aims to optimize the classifier/decision layers
• The newly developed method has been tested on four pub- (fully connected layers), whereas training stage III optimizes
licly available MNIST, MNIST-rot-back-image, Fashion- AutoCNN in an end-to-end manner. Separately optimizing
MNIST and ADHD200 datasets. AutoCNN achieves CNN layers and fully connected layer parameters are designed
state-of-the-art classification performance on MNIST-rot- to reduce the risk of overfitting.
3
Algorithm 1 AutoCNN algorithm
Linearization layer
Input: Training data
Require: 0 < α < 1 and 0 < β < 1
Initialize: add a convolutional stage
Ensure: ST AT E ∈ {T raining, StopT raining}
=========**Training Stage I**=========
while STATE is Training do
Linearization layer
if Equation 3 is satisfied then
Execute: Filter pruning via Algorithm 2
if similarF ilterList is empty then
Create: a new convolutional stage
else
ST AT E ← StopT raining
Linearization layer
end if
end if
end if
Update: CNN layers parameters
end while Convolutional
Pruned Filters
stage
=========**Training Stage II**=========
ST AT E ← T raining (a) State t0 : AutoCNN initialization, State t1 : Convolutional layer growing,
while STATE is Training do State t2 : Filter pruning
end if
Batch Norm.
Max Pooling
Conv. Layer
Update: fully connected layer parameters
ReLU
end while
=========**Training Stage III**=========
ST AT E ← T raining
while STATE is Training do
if Equation 7 is satisfied then (b) A convolutional stage
end if Fig. 1. Schematic diagram of AutoCNN illustrating its evolution strategies.
Update: End-to-end network parameters
end while
Return: Network structure, Predicted labels connected layers estimates the class conditional probabilities
from the extracted features to classify the input image [25].
Further, to protect AutoCNN from overfitting, a dropout mech-
Figure 1 illustrates AutoCNN network evolution, where a anism is adopted in the learning process.
network structure can be automatically constructed for a given
problem. It illustrates the convolutional layer growing and fil-
ter pruning mechanisms to optimize the spatial features of the B. Convolutional Layer Growing
AutoCNN network. The AutoCNN is initiated with a primitive AutoCNN optimizes the number of convolutional stages
CNN network consisting a one convolution stage and one fully of CNN using a demand-based sequential strategy over the
connected layer. The AutoCNN adopts an iterative growing training epochs. The evolution of the convolutional stages
approach, wherein the architecture complexity is iteratively in AutoCNN is governed by the Feature Separability Score
modulated over the training epochs based on the network (F SS). It measures the similarity of the average spatial feature
learning performance. Novel data-driven convolutional layer F across the different classes and is calculated as:
growing criteria and filter pruning criteria are proposed to X C X C
automatically evolve the primitive AutoCNN network and 2
F SS = Fi · Fj ∀j < i (1)
achieve the best classification performance. Further, to prevent C 2 − C i=1 j=1
over fitting a self-stopping criteria is proposed to achieve early
stopping. A detailed explanation of the evolving strategies are where C represents the total number of classes and Fi ∈ <1×D
provided below. denotes the average spatial features from the last convolutional
stage calculated over all the samples of the ith class. The
term (2/(C 2 − C)) is includedP P to get the average value of
A. AutoCNN Initialization all similarity coefficients ( Fi · Fj ). F SS ∈ [0, 1] repre-
The AutoCNN is initialized with a single convolution stage sents the correlation between the high order spatial features
(1 convolution layer, 1 batch normalization layer, ReLU non- extracted from the convolutional stages from the samples of
linear activation function and 1 pooling layer), illustrated in the ith and j th classes. A higher F SS value denotes that
Figure 1 (b), with K filters, followed by a fully-connected the spatial features extracted by the convolutional stages are
layer with the C nodes. Wherein, C represents the number of strongly correlated and poorly separable, whereas a low F SS
classes in the input dataset. Each convolution stage converts denotes that the spatial features are loosely correlated and
the low level features into high level features, while the fully- highly discriminative.
4
Although the F SS provides accurate measures of the Algorithm 2 Filter pruning algorithm
extracted features separability, it is susceptible to local os- Input: Current network structure
cillations due to large parameter size and iterative mini- Set: similarF ilterList as an empty array
for all pairs of vectorized feature maps do
batch training in CNNs. Moving average scores have been Calculate: Pearson correlation coefficient ρ(Zp , Zq )
successfully used in time series based studies to effectively end for
dampen the variations due to local oscillations [26] while Calculate: µK and σK
preserving the information in the data. Hence, the moving Identify: similar filters via Equation 5
average scores of the F SS are calculated as: Store: the index of similar filter to similarF ilterList
if similarF ilterList is not empty then
µnF SS = µn−1 n−1
F SS + (F SS − µF SS )/n (2) for all filters in similarF ilterList do
Prune: filters having low contribution via Equation 6
where F SS is the spatial features similarity score of the nth end for
epoch. The above equations give the moving average F SS end if
Return: Network structure, similarFilterList
(µnF SS ) for the nth epoch which is used to estimate the
performance of the network.
For a randomly initialized network, µnF SS reduces with
C. Filter Pruning
the number of epochs and achieves a steady state. Further
improvement in performance can be achieved by deriving Although the addition of convolution stages using AutoCNN
higher-order features using additional convolutional stages. convolution layer growing strategy ameliorates the learning
Therefore, the steady-state response of the µnF SS over W potential of the network, it also increases the redundancy in the
epochs indicates the saturation in the learning of the current network leading to loss of performance. Pruning of redundant
network and is utilized as AutoCNN convolutional stage filters not only improves the classification performance but
growing criteria, given as: also reduces the space and the computational complexity
"
(n−W )
# of the network and results in faster execution times [27].
µnF SS − µF SS Moreover, apriori determination of the number of filters for
0< <β (3)
µmax min
F SS − µF SS the new convolution stage of the CNN is not feasible due
to the complex feature space relationship. Therefore, the
where µmax min
F SS and µF SS represent the maximum and minimum addition of K randomly initialized filters in the newly added
n
observed µF SS value for the given architecture complexity and V th convolution stage of the AutoCNN does not lead to
0 < β < 1 represents the growth sensitivity factor. A larger significant improvement in the classification performance. To
β enforces an aggressive growing strategy, while a β value address these limitations a data-driven filter pruning strategy is
closer to zero promotes a conservative growing strategy. proposed in AutoCNN to increase network learning potential
The iterative addition of convolutional stages although without loss in classification performance.
improves the separability of the features, multiple pooling Firstly, a significantly large K is chosen to build in re-
layer in deep networks leads to diminishing feature size and dundancy in the learnable parameters. Further, the filter pa-
loss of information [3]. To prevent the loss of discriminatory rameter distribution is monitored to identify filters with simi-
information, AutoCNN uses predictive analysis to determine lar/identical parameter distribution. Identical filters applied to
the resultant reduction in the number of features. It arbitrates common inputs are responsive to similar spatial patterns and
the feasibility of the addition of the pooling layer. For an input produce similar information for the next layer. Therefore, it is
of D dimensions, the number of features M of an AutoCNN imperative that redundant information is eliminated and only
with V convolutional stages can be calculated as: relevant discriminative information is retained in deep learning
D
Y models [28].
(dn /(Pn Sn ))V +1

M= (4) In this paper, we propose a Pearson’s product-moment
n=1 correlation-based strategy to identify the convolution filters
where, dn , Pn and Sn represent the size of the input, the with redundant information. Highly similar convolution filters
number of padding pixels and the stride in the nth dimension, (Kp , Kq ) with redundant information are determined using
respectively. From the observation of prominent CNN archi- the linear correlation between their vectorized filter maps Z.
tectures, such as AlexNet [20], ResNet [21], VGGNet [23] Two filters (Kp , Kq ) are considered similar if their vectorized
and GoogLeNet [24], it is found that the number of extracted feature maps (Zp , Zq ) possess a strong positive linear rela-
CNN features are ranging from 1000 to 4000. In this study, tionship, and are said to be dissimilar if they have negative or
we consider 4000 as the minimum number of features that no linear relationship. Therefore, the pair of filters (Kp , Kq )
allows AutoCNN to include max pooling layer whenever a with high similarity are identified using:
convolutional stage is added. Equation 4 is used to estimate the
0 < ρ(Zp , Zq ) ≥ µK + 4σK p 6= q (5)
number of spatial features of the AutoCNN after the addition
of the new convolution stage. If the addition of the new where ρ(Zp , Zq ) represents the Pearson’s correlation of the
convolution stage reduces the number of spatial features by vectorized feature maps of the pth and q th filters of a convolu-
M ≤ 4000, AutoCNN uses a convolutional stage devoid of tional stage, µK and σK are the mean and standard deviation
the pooling layer to prevent the deterioration of information of all the positive correlated filters of a given convolutional
due to diminishing spatial feature dimension. stage respectively. Equation 5 is inspired by the k-sigma rule
5
in statistics, where k = 4 governs the confidence of the A. Datasets

solution and implies a confidence level of 99.99%. Every pair
of highly similar features that satisfy Equation 5 represents This section gives a detailed description of the datasets
identical information and hence leads to redundancy in the utilized in AutoCNN performance validation, the data distri-
network. The L1 norm ||K||1 of the paired spatial filters are bution of the datasets and the significance of the dataset in the
used to identify the filter with lower information content and performance evaluation of AutoCNN.
is pruned as MNIST: MNIST is a benchmark dataset to evaluate CNN
min(||Kp ||1 , ||Kq ||1 ) ∈ K ∀p 6= q (6) based classification algorithms and consists of 70,000 binary
handwritten digits between 0 and 9 [29]. The dataset is di-
A smaller L1 norm indicates that a filter is characterized vided into distinct 60,000 training samples and 10,000 testing
by weaker activations and has a lower significance in the samples. AutoCNN performance over the MNIST dataset is
final classification performance of the CNN. The overall filter used to validate and compare the classification performance
pruning mechanism is presented in Algorithm 2 with the state-of-the-art methodologies in the literature.
MNIST-rot-back-image2 : The MNIST-rot-back-image
D. Stopping Criteria dataset is a variation of the MNIST dataset created by adding
The convolution layer growing and filter pruning strategies different noises (random pixel value mutation, rotation and
of AutoCNN are highly beneficial in increasing the network background image addition) to increase data complexity. The
capacity with low network redundancy. However, beyond a dataset contains 12, 000 training images and 50, 000 testing
saturation depth, the performance of deep networks degrades images. The skewed sample distribution along with the
with increased network depth [21]. The network performance background noise makes the MNIST-rot-back-image dataset
is also adversely affected by overtraining, i.e., optimization challenging. In this paper, the performance of the AutoCNN
of network parameters over an excessively large number of on the MNIST-rot-back-image dataset is used to evaluate the
epochs. The spatial feature separability measure (F SS) in robustness of AutoCNN against random noise in the input
AutoCNN given by Equation 1 provides a reliable measure features of the MNIST dataset.
of the convolution stage performance. Seldom reduction in The performance of the AutoCNN on the MNIST-rot-back-
F SS does not necessitate a significant decrease in error ET r image dataset is used to evaluate the robustness of AutoCNN
increasing the likelihood of overtraining. against random noise variations in the input features.
Therefore, to address the above problems it is essential to Fashion-MNIST3 : It is a dataset of Zalando’s article im-
identify the performance saturation of the network and stop the ages used to evaluate the performance of object recognition
training process. The stopping criteria of AutoCNN suspends classifiers. It consists of 50,000 training images and 10,000
the parameter update automatically and is given by, testing images divided into 10 classes [30]. Each image in the
" (n−W )
# Fashion-MNIST dataset has a small feature size of 28 × 28.
µnET r − µET r
<α (7) The small feature size coupled with the large sample size are
µmax min
ET r − µET r challenging for autonomous architecture determination models
where µnET r denotes the moving average of ET r calculated like AutoCNN. Fashion-MNIST performs a direct drop-in
in the similar way as µnF SS . µmax min replacement for the original MNIST dataset, as it has the same
ET r and µET r represents the
n
maximum and minimum value of µET r for the given archi- image size, data format and structure of training and testing
tecture complexity. α ∈ [0, 1] represents the early stopping splits.
co-efficient. A lower value of α has a higher tolerance to ADHD2004 : The ADHD200 dataset presents a binary clas-
performance saturation, whereas a higher value of α supports sification problem with small sample size, high intra-class
conservative stopping criteria. Further, the iterative revision variance, low-inter class variance and large feature dimension
of the µmax min
ET r and µET r ensures that the stopping criteria is
[31] and has been a challenging problem for CNN based
independent of the problem dynamics and adaptive to the classifiers [32]. The dataset consists of 904 grayscale brain rs-
network learning characteristic. fMRI images aggregated from 8 different institutions by the
INDI initiative [33]. In this paper, the AAL template-based
IV. E XPERIMENTS AND R ESULTS connectivity maps from 723 randomly selected samples are
used for training while the remaining 181 samples are used as
In this section, the classification performance of AutoCNN testing samples. The classification performance of AutoCNN is
on various image datasets have been presented. As scalability used to validate the robustness of the data-driven architecture
and reliability are key characteristics of architecture determi- determination in AutoCNN.
nation models, the performance of AutoCNN is tested using
datasets with varied data distributions, complexity and sample
size to validate the performance of the data-driven AutoCNN 2 The datasets is available at https://sites.google.com/a/lisa.iro.umontreal.ca/
architecture. The publicly available MNIST, MNIST-rot-back- public static twiki/variations-on-the-mnist-digits

3 The dataset is freely available at https://github.com/zalandoresearch/
image images, Fashion-MNIST and ADHD200 datasets were
fashion-mnist
used to evaluate the performance of AutoCNN and validate its 4 This dataset can be downloaded from http://fcon 1000.projects.nitrc.org/
reliability and scalability. indi/adhd200/
6
B. Experimental Setup which establishes that F SS is an imperative measure of

AutoCNN is implemented in python and the executable code network performance in CNN networks. Also, F SS success-
is available at (https://tinyurl.com/AutoCNN). Evaluation over fully drives the convolutional stage growing mechanism. For
different datasets and distributions show that the ideal range reference, the F SS plot (the second image from the left in
of growth sensitivity factor β was found to be [0.15, 0.25]. each figure) indicates a high value at the beginning of the
A large β value results in the addition of redundant pa- training process yet timely responded with the introduction of
rameters and a smaller β value limit the network capacity. a new convolutional stage.
Similarly, the ideal range of the early stopping co-efficient Third, the evolving characteristics of the AutoCNN on the
α was found to be [0.03, 0.05]. A larger α hinders the various datasets have been presented. From Figure 2 it can be
network training, while a smaller α results in over-training and observed that in the MNIST dataset the AutoCNN observes
reduced network performance. Through empirical evaluation a more conservative growing strategy in comparison with the
over different datasets and distributions, the best classification Fashion-MNIST dataset due to difference in data complexity.
accuracy was achieved using an evaluation window (W) of Addition of a new convolution stage resulted in an increase of
20 epochs. Further, AutoCNN is initialized with [8, 32] 3 × 3 network capacity and classification accuracy, with an inversely
filters in the first convolution stage followed by a non-linear proportional decrease in F SS score. The improvement in
ReLU activation function and a 2 × 2 pooling layer. After network performance with consecutive addition of convolution
every convolution stage, the outputs are padded (1 × 1) and a stages and successive reduction in F SS demonstrate the
suitable striding (1 × 1) is used to eliminate information loss learning convergence in AutoCNN architectures.
between the convolutional stages. A 0.1 dropout rate with a The number of AutoCNN filters in each convolutional
0.9 momentum and weight decay of 5e−4 is used along with stage at each epoch of the training cycle is presented in
a gradient-based backpropagation to prevent overfitting and Figure 2. The graphs representing the performance of the
ensure a suitable parameter update. AutoCNN on the MNIST and MNIST-rot-back-image datasets
showed that pruning of redundant filters reduced the network
parameters while conserving the classification performance.
C. AutoCNN Classification Performance The filter pruning not only reduces the computation overhead
It is imperative that the autonomous evolving architecture by reducing the redundant parameters in the network, it also
determination methods are validated using different datasets prevents the CNN network from overfitting [28], [34].
from various domains, with varied applications and data distri-
butions. Therefore, the classification performance of AutoCNN TABLE I
is validated using publicly available MNIST, MNIST-rot-back- C LASSIFICATION PERFORMANCE ON MNIST DATASET
image images, Fashion-MNIST and ADHD200 datasets. The
performance of AutoCNN is reported as the mean accuracy M ODEL D EPTH PARAMS . (M) ACC . (%)
over three independent runs with a holdout test dataset. The AUTO CNN 11.67 ± 3.78 0.34 ± 0.44 99.59 ± 0.015
state-of-the-art methodologies for each of the four datasets IPPSO 5 N/A 98.79
E VO CNN N/A N/A 98.82
vary and validation of results for each methodology on all R ES N ET 18* 18 11 97.90
the datasets is not feasible due to the unavailability of source VGG16* 16 26 99.68
code and high computation cost. Therefore, the classification FCCNN* 3 N/A 97.57
performance of AutoCNN has been compared individually *Hand designed CNN architecture.
with the state-of-the-art methodologies with hand-designed
and autonomous architecture determination methods in the 1) MNIST: The performance of AutoCNN against state-of-
literature. the-art methodologies on MNIST dataset is presented in Table
Figure 2 shows the performance metrics and network evo- I. Hand crafted CNN architectures have primarily focused
lution characteristics of AutoCNN, where each sub-figure on tailoring the network design to improve the classifica-
represents the characteristics of AutoCNN on a specific tion accuracy on the specific problem. Among such hand-
dataset. Each sub-figure consists of four independent and designed architectures, FCCNN [35] proposed a non-iterative
related metrics to measure network performance and evolution PCA based feature learning approach and achieved a testing
characteristics. First, the training and testing accuracy of the accuracy of 97.57%. In addition, the popular ResNet18 and
AutoCNN network over the training cycle has been presented. VGG16 hand-designed CNN architectures were found to have
It can be observed from Figure 2 that the early stopping a testing accuracy of 97.90% and 99.68% respectively on
criterion proposed in AutoCNN prevents overfitting and loss the MNIST dataset [36]. Recently, researchers have proposed
of testing performance. Deep learning methods using small autonomous architecture determination methodologies for the
datasets with high variability like ADHD200 are prone to MNIST dataset. IPPSO [37] proposed a particle swarm opti-
overfitting [32] and is imperative for evolving architecture mization based CNN architecture detection methodology and
determination methods like AutoCNN to prevent over-training reported a testing accuracy of 98.79%. Among the autonomous
and the resultant loss in network performance. architecture determination methods in literature, EvoCNN [38]
Second, the characteristics of the F SS over the training adopted a meta-heuristic genetic algorithm based evolving
cycle is presented. A strong inverse correlation is observed CNN and achieved the highest testing accuracy of 98.82%.
between the F SS and the accuracy of all the four datasets, However, AutoCNN achieves an improved classification accu-
7
100 0.7 10 70
0.6 60
80 8
no of Filter in Each Layer

50
no of Conv. Layer
0.5
Accuracy (%)
60 6
40
Conv layer 1
FSS
0.4 Conv layer 2
30 Conv layer 3
40 4
Conv layer 4
0.3
20 Conv layer 5
Conv layer 6
20 2
0.2 Conv layer 7
Testing 10
Training Conv layer 8
Conv layer 9
0 0.1 0 0
200 400 600 800 200 400 600 800 200 400 600 800 200 400 600 800
no of epoch no of epoch no of epoch no of epoch
(a) Performance of AutoCNN on MNIST dataset

100 1 12 140
0.9 120
10
80

0.8
100
no of Conv. Layer
8 Conv layer 1
Accuracy (%)
60 0.7
80 Conv layer 2
FSS Conv layer 3
0.6 6
Conv layer 4
60 Conv layer 5
40
0.5
4 Conv layer 6
40 Conv layer 7
0.4 Conv layer 8
20
Testing 2 20 Conv layer 9
Training 0.3 Conv layer 10
Conv layer 11
0 0.2 0 0
200 400 600 800 200 400 600 800 200 400 600 800 200 400 600 800
(b) Performance of AutoCNN on MNIST-rot-back-image dataset

100 1 20 140
Conv layer 1
Conv layer 2
120 Conv layer 3
80 0.8 Conv layer 4

15 Conv layer 5
no of Conv. Layer 100
Conv layer 6
Accuracy (%)
60 0.6 Conv layer 7

80 Conv layer 8
FSS
10 Conv layer 9
60 Conv layer 10
40 0.4 Conv layer 11
Conv layer 12
40 Conv layer 13
5
Conv layer 14
20 0.2
Testing Conv layer 15
20
Training Conv layer 16
Conv layer 17
0 0 0 0
200 400 600 800 200 400 600 800 200 400 600 800 200 400 600 800
(c) Performance of AutoCNN on Fashion-MNIST dataset

100 0.85 9 35
8
0.8 30
80 7
0.75 25
no of Conv. Layer
6
Accuracy (%)
60
0.7 5 20
FSS
0.65 4 15 Conv layer 1

40 Conv layer 2
3 Conv layer 3
0.6 10 Conv layer 4
20 2 Conv layer 5
Testing 0.55 5 Conv layer 6
Training 1 Conv layer 7
Conv layer 8
0 0.5 0 0
100 200 300 400 500 100 200 300 400 500 100 200 300 400 500 100 200 300 400 500
(d) Performance of AutoCNN on ADHD200 dataset
Fig. 2. Performance metrics and the network evolution of AutoCNN. The AutoCNN evolving mechanism is demonstrated here where it successfully introduces
several CNN layers and removes redundant filters. The effectiveness of F SS in measuring a model performance is also demonstrated here. It decreases as
the increase in accuracy.
racy of 99.59%, which is a 0.77% improvement over previous TABLE II

autonomous architecture determination methodologies in liter- C LASSIFICATION PERFORMANCE ON MNIST- ROT- BACK - IMAGE DATASET
ature. Further, the data-driven architecture evolution approach
proposed in AutoCNN also betters the classification perfor- M ODEL D EPTH PARAMS . (M) ACC . (%)
mance of the popular ResNet18 hand designed architecture, AUTO CNN 17.33 ± 4.72 1.5 ± 1.2 86.78 ± 0.66
while its performance was comparable to the VGG16 network IPPSO 7 N/A 65.50
E VO CNN N/A N/A 62.62
despite having fewer network parameters. Thus, the AutoCNN R ES N ET 18* 18 11 76.32
developed is able to identify smaller networks (which are less VGG16* 16 26 84.71
memory and computationally expensive) capable of achieving FCCNN* 3 N/A 66.40
classification performances similar to that of the state-of-the- *Hand designed CNN architecture.
art hand-designed CNN architectures.
2) MNIST-rot-back-image: Table II presents the perfor-
mance of AutoCNN against state-of-the-art methodologies looked at the MNIST-rot-back-image dataset to validate the
on MNIST-rot-back-image dataset. Recently, researchers have robustness of the classification methodologies against noise.
8
The CNN networks designed for MNIST classification when TABLE IV

retrained on the MNIST-rot-back-image dataset results in a C LASSIFICATION PERFORMANCE ON ADHD200 DATASET
reduced classification performance. The reduction in classifi-
cation performance is attributed to large intra-class variations M ODEL D EPTH PARAMS . (M) ACC . (%)
and greater noise in the data [35]. The classification accuracy AUTO CNN 11 ± 2.64 0.083 ± 0.088 74.72 ± 2.72
of FCCNN [35] reduces to 66.40%, a decrease in accuracy 3D CNN* 6 1 69.15
DTM* 3 N/A 70.36
of 31.17% in comparison to the MNIST dataset. While in
the ResNet18 and VGG16 networks the classification accuracy *Hand designed CNN architecture.
reduced to 76.32% and 84.71% respectively. The autonomous
evolution based methodologies too produced similar results,
with the IPPSO [37] reporting a classification accuracy of data [39] proposed a 6 layer 3D-CNN based approach and
65.50% (decrease of 33.29%) and the EvoCNN [38] reported achieved an accuracy of 69.15% and [32] achieved the maxi-
62.62% accuracy (decrease of 36.2%). On the MNIST-rot- mum classification accuracy of 70.36% using a learnable latent
back-image dataset AutoCNN achieves the state-of-the-art space transformation based CNN model. However, AutoCNN
classification accuracy of 86.78%, an improvement of 20.38% based data-driven architecture achieves the state-of-the-art
over previous methodologies in literature. Moreover, among classification performance of 74.72%. The 4.36% improve-
the previous methodologies in literature the classification ment in performance on the ADHD200 dataset demonstrates
performance of the AutoCNN was least affected by noise and the scalability and adaptability of AutoCNN to datasets with
intra-class variations in the dataset (lowest change in accuracy high variance and small sample size.
between the MNIST dataset and the MNIST-rot-back-image
dataset). D. Auxiliary studies
TABLE III TABLE V

C LASSIFICATION PERFORMANCE ON FASHION -MNIST DATASET AUXILIARY STUDY RESULTS ON ADHD200 DATASET.
M ODEL D EPTH PARAMS . (M) ACC . (%) M ODEL D EPTH ACC . (%)
AUTO CNN 19 ± 1.73 1.44 ± 0.26 94.42 ± 0.07 AUTO CNN 11 ± 2.64 74.72 ± 2.72
E VO CNN N/A 6.68 94.53 M ODEL I 9 75.45 ± 4.26
R ES N ET 18* 18 11 94.90 M ODEL II 2 + 4∗ 81.32 ± 5.66
VGG16* 16 26 93.50
∗: Evolved by AutoCNN. Model I: AutoCNN generated network, re-trained
*Hand designed CNN architecture. without network evolution. Model II: AutoCNN initialized with DTM,
re-trained with network evolution.
3) Fashion-MNIST: The performance of the AutoCNN on
the Fashion-MNIST dataset in comparison to other methodolo-
gies viz., EvoCNN, ResNet18 and VGG16, in the literature is TABLE VI
AUXILIARY STUDY RESULTS ON MNIST- ROT- BACK - IMAGE DATASET.
presented in Table III. A meta-heuristic genetic-based CNN,
EvoCNN [38] achieved a classification accuracy of 94.53%,
while the ResNet18 and VGG16 hand-designed architectures M ODEL D EPTH ACC . (%)
reported classification accuracies of 94.90% and 94.53% re- AUTO CNN 17.33 ± 4.72 86.78 ± 0.66
M ODEL I 18 87.76 ± 0.38
spectively. The classification performance of the ResNet18 M ODEL III 9 + 11∗ 88.75 ± 0.23
CNN classifier represents the state-of-the-art accuracy on the
Fashion-MNIST dataset[36]. The residual connections of the *Evolved by AutoCNN. Model I: AutoCNN generated network, re-trained
without network evolution. Model III: AutoCNN initialized with AutoCNN
ResNet architecture are vital to prevent loss of informa- generated network on MNIST dataset, re-trained with network evolution.
tion during learning process in datasets like Fashion-MNIST
characterized by low resolution images and large range of An ablation study simulating real-world use case scenarios
feature distribution [28]. AutoCNN achieved a classification of the proposed AutoCNN was conducted. First, the CNN ar-
performance of 94.42% on the Fashion-MNIST dataset, which chitecture for ADHD200 and MNIST-rot-back-image datasets
is comparable to the state-of-the-art methodologies in lit- were generated using AutoCNN. The generated network are
erature. However, the data-driven evolution strategy ensures re-initialized using random weights. The re-initialized Au-
that the AutoCNN network achieves comparable classification toCNN network are further retrained (without network evo-
performance with the smallest number of network parameters lution) using the individual datasets to verify the network
(21% reduction in network size), reducing the memory and performance. These re-trained AutoCNN models are hereby
space complexity. referred to as Model I. From the first and second row of
4) ADHD200: The ADHD200 dataset is characterized by Table V and Table VI, it can be seen that the classification
small sample size, large feature dimension, low intra-class performance of the Model I is comparable with the baseline
separability and high inter-class variability which presents a AutoCNN network on all three datasets. Further, statistical
challenging problem for CNN based classifiers [31]. Among analysis reveals that the reduction in classification accuracy
the CNN based approaches to classify ADHD using rs-fMRI is statistically not significant (p = 0.74) and the transfer of
9
network architecture into a new environment after re-training Fold

1
does not lead to any performance loss. These results indicate
2
that the network architecture derived using AutoCNN is robust
3
and has a high degree of result reproducibility. .
.
Second, the AutoCNN was initialized with a hand-designed .
baseline CNN classifier and the network evolution strategies 10
were used to optimize the network for better classification

Training data
accuracy. Thus obtained networks are hereby referred to as
Testing data
Model II. In this paper, the optimization performance of
AutoCNN was validated using the ADHD200 dataset, wherein
the DTM [32] (with 2 convolutional stages, a single fully Fig. 3. An example illustrating the reversed k-fold cross-validation where
k = 10. In addition to obtaining a true representation of the results under the
connected layer and without the transformation layer) was most challenging environment, it also aims to get variance of a network for
used to initialize the AutoCNN. The convolutional layer grow- calculating the upper-bound of Equation 8.
ing and filter pruning strategies were used to optimize the
baseline model autonomously using the AutoCNN evolution
strategies explained in Section III. The AutoCNN when used cross-entropy loss function L(f, T ) is employed to evaluate
as a network optimizer appended four additional convolutional the performance of a model f trained using a set of training
stages without pooling layers to achieve a 10.96% increase in data T .
classification accuracy over the baseline classifier. From the In a real world situation, it is impossible to calculate the
above ablation study it can be inferred that AutoCNN can be expected risk R(f ) since there is no access to the unseen
effectively used as an optimization tool to improve classifica- data. To estimate the generalization power of a model, one
tion performance of hand-designed CNN architectures. should obtain the upper-bound of Equation 8. This is achiev-
Next, the AutoCNN was used for adaptive noise compen- able by training-testing a model on given training samples
sation using network evolution. In this paper, the MNIST- under reversed k-fold cross-validation protocol as illustrated
rot-back-image dataset representing the MNIST dataset ran- in Figure 3. The model is trained using 1/k th part of the
domly rotated and induced with background noise was used data and testing using the rest of it, resulting in a large strain
to validate the ability of AutoCNN to adapt to addition of on the network performance. The classification performance
noise in the input data. The pre-trained (using the MNIST under such challenging conditions is recorded as Remp (f ).
dataset) AutoCNN network was used to initialize and network Iterative testing over the k folds, we compute the variance
evolution strategies explained in Section III were used to in the network performance and calculate the upper-bound
further re-train and optimize the architecture using the MNIST- performance of a model (Remp (f ) + variance). The upper-
rot-back-image dataset. Thus re-trained MNIST AutoCNN bound performance represents the generalization power of the
architectures are hereby referred to as Model III. The results of model on the testing data. The reversed cross-validation test
the Model III CNN architecture re-trained using the MNIST- is conducted to simulate the most difficult situation where the
rot-back-image are presented in Table VI. Eleven additional number of training data at least is similar to the number of
convolutional layers were added to the baseline AutoCNN testing data and thus the evaluation result truly reflects the
network while re-training to adapt to the noise in the input performance of a network.
data. Hence re-trained Model III network further improved The generalization power of the AutoCNN is validated
the baseline classification performance by 2% and obtained using the MNIST-rot-back-image dataset, since AutoCNN
an accuracy of 88.75% on the MNIST-rot-back-image dataset. outperformed previous methods in literature and attained a
This auxiliary study shows that AutoCNN can be used to classification accuracy of 86.78% despite the noisy charac-
effectively adapt deep networks to dynamic changes in the teristics of the data. It is to be noted that the MNIST-rot-
input noise and provide consistent classification performance. back-image dataset contains 12K samples, resulting in 1.2K
Finally, a study to demonstrate that the AutoCNN generated training samples and 10.8K testing samples per fold. The
network structure has better generalization power compared reversed cross-validation test results of the AutoCNN was
to handcrafted architecture was also conducted. This analysis compared against popular ResNet18 and VGG11 networks.
is based on Generalization Bound derived from Statistical For this study, the smallest VGG network was considered to
Learning Theory (SLT) [40]. The generalization power is minimize the risk of overfitting (as the number of training
formulated as the expected risk R(f ) bounded by empirical samples is less than 2K samples). The network parameters
risk Remp (f ) plus its variance written as follows [41]: were re-initialized and all the networks were trained using
uniform training environment. The results are presented in
R(f ) ≤ Remp (f ) + variance (8) Table VII. From the below table it can be observed that
n
1X AutoCNN generated network attained the lowest upper-bound
Remp (f ) = L(f, T ) (9) value (max(Remp (f ))+variance = 2.04) which signifies that
n i=1
the performance of the AutoCNN generated CNN network
where f is a set of classification function which performs input is superior on unseen samples possessing the same data
to output mapping. Empirical risk Remp (f ) is defined as the distribution. The generalization performance test validates the
loss measured on given training samples as in Equation 9. The AutoCNN result on MNIST-rot-back-image problem reported
10
TABLE VII the convolution layer growing, filter pruning and automatic
N ETWORK STRUCTURE PERFORMANCE ON MNIST- ROT- BACK - IMAGE . stopping strategies. The results reported on Table I, II and
in [11] show that the implementation of convolutional stage
Fold Remp (f )
AutoCNN ResNet18∗ VGG11∗ growing strategies helps AutoCNN achieves better classifica-
1 1.79 3.77 3.19
tion performance on the MNIST dataset in comparison to the
2 1.76 3.36 4.88 heuristic network determination methods in literature.
3 2.01 3.28 3.69 Moreover, the previous methods in literature use a pre-
4 1.52 3.46 4.39
5 1.86 3.52 4.21
determined number of epochs for training and hence are prone
6 1.96 3.47 4.44 to overfitting. AutoCNN proposes a novel automatic early
7 1.52 3.55 4.35 stopping criteria to prevent overfitting and ensure minimization
8 1.79 3.29 3.80
9 1.58 3.26 4.50
of computation cost. It is worth mentioning that the results
10 1.85 3.51 3.58 of the previous methodologies report only the classification
variance 0.027 0.022 0.240
accuracy on the dataset, which does not reflect the effect
max(Remp (f )) 2.01 3.77 4.88 of noise on the network performance and the generalization
max(Remp (f )) + variance 2.04 3.79 5.12 power of a network. Therefore, in this paper, AutoCNN
∗: Hand-designed CNN architecture. is evaluated for performance reproducibility (Model I), to
optimize existing CNN networks (Model II) and its ability
to adapt to introduction of noise in the dataset (Model III).
in Table I and at the same time confirms the effectiveness of The results of the above auxiliary studies indicated that the
AutoCNN learning policy in constructing a CNN structure for AutoCNN performance is scalable, adaptable, robust to noise
any given problem. and the results are highly generalizable. Therefore, AutoCNN
is a reliable tool for automatic CNN network architecture
determination for any given dataset.
E. Related Works
Automated CNN architecture determination has been widely V. C ONCLUSIONS
explored in literature. The previous methodologies like FCAE The proposed AutoCNN methodology enables to derive
[11], IPPSO [37] and EvoCNN [38] have proposed Evolution- the CNN network architecture in a data-driven approach.
ary Computation (EC) for the automatic design of the CNN It utilizes the convolution layer growing, filter pruning and
architecture. These EC methodologies for architecture determi- early stopping criteria to modulate the architecture using an
nation are inspired by biological evolution cycles, wherein the iterative strategy. A novel Feature Separability Score (F SS)
network parameters are updated using biological evolution in- is proposed to measure the spatial feature separability of
spired approaches and the intermediate solutions are evaluated the network. The performance of AutoCNN is tested us-
using quantitative measures. The above methods are heuristic ing the MNIST, MNIST-rot-back-image, Fashion-MNIST and
in nature and determination of the optimal solution requires ADHD200 datasets to ensure reliability and scalability. The
iterative exploration of the solution space over a large number results indicate that AutoCNN outperforms all the evolutionary
of trails. For deep networks with multiple layers, such heuristic learning-based architecture determination methodologies in
approaches are not feasible. Bayesian optimization strategies two of the four datasets. It also achieves the state-of-the-
have also been used for automatic determination of network art classification performance of 86.78% and 74.72% on the
architecture and their hyper-parameters [7], [8], however, their MNIST-rot-back-image and ADHD200 dataset, respectively.
performance were unable to match that of the hand-crafted In this paper, even though the proposed AutoCNN methodol-
networks. With the advent of deep learning various reinforce- ogy has been shown to optimize convolution neural network
ment learning based approaches have been proposed [12], [14]. architectures, the data-driven architecture determination strat-
The reinforcement learning based approaches is modelled as egy proposed here can be adopted for the optimization and/or
closed loop controller and child network, wherein the primary identification of any deep neural network architecture.
task of the controller is to determine the best child network
for a specific problem. As reported in [12], such approaches R EFERENCES
are computationally expensive, requiring 800 GPUs to trained
[1] P. Sermanet and Y. LeCun, “Traffic sign recognition with multi-scale
over thousands of GPU days to determine a single network convolutional networks.” in IJCNN, 2011, pp. 2809–2813.
architecture. To address these shortcomings in the previous [2] C. Pelletier, G. I. Webb, and F. Petitjean, “Temporal convolutional neural
studies, a novel data-driven approach to CNN architecture network for the classification of satellite image time series,” Remote
Sensing, vol. 11, no. 5, p. 523, 2019.
determination is proposed in AutoCNN. Further the proposed [3] G. Huang, Y. Sun, Z. Liu, D. Sedra, and K. Q. Weinberger, “Deep
AutoCNN is expensive approach (requiring 2 GPUs GeForce networks with stochastic depth,” in European conference on computer
GTX 1080 trained over 8 − 40 hours) and enables scaling vision. Springer, 2016, pp. 646–661.
[4] J. D. Schaffer, D. Whitley, and L. J. Eshelman, “Combinations of genetic
from small datasets to large datasets. AutoCNN proposes a algorithms and neural networks: A survey of the state of the art,” in
novel Feature Separability Score (F SS) that measures the [Proceedings] COGANN-92: International Workshop on Combinations
intra-network feature separability, to measure the network of Genetic Algorithms and Neural Networks. IEEE, 1992, pp. 1–37.
[5] K. O. Stanley and R. Miikkulainen, “Evolving neural networks through
capacity and determine appropriate evolution strategies. The augmenting topologies,” Evolutionary computation, vol. 10, no. 2, pp.
F SS and the training error (ET r ) are utilized to determine 99–127, 2002.
11
[6] N. Pinto, D. Doukhan, J. J. DiCarlo, and D. D. Cox, “A high-throughput [31] A. M. S. Aradhya, V. Subbaraju, S. Sundaram, and N. Sundararajan,
screening approach to discovering good forms of biologically inspired “Regularized spatial filtering method (r-sfm) for detection of attention
visual representation,” PLoS Comput Biol, vol. 5, no. 11, p. e1000579, deficit hyperactivity disorder (ADHD) from resting-state functional mag-
2009. netic resonance imaging (rs-fmri),” in 2018 40th Annual International
[7] B. Shahriari, K. Swersky, Z. Wang, R. P. Adams, and N. De Freitas, Conference of the IEEE Engineering in Medicine and Biology Society
“Taking the human out of the loop: A review of bayesian optimization,” (EMBC), July 2018, pp. 5541–5544.
Proceedings of the IEEE, vol. 104, no. 1, pp. 148–175, 2015. [32] A. M. S. Aradhya, A. Joglekar, S. Suresh, and M. Pratama, “Deep
[8] J. Snoek, H. Larochelle, and R. P. Adams, “Practical bayesian optimiza- transformation method for discriminant analysis of multi-channel resting
tion of machine learning algorithms,” Advances in neural information state fmri,” in Thirty-Third AAAI Conference on Artificial Intelligence,
processing systems, vol. 25, pp. 2951–2959, 2012. 2019.
[9] B. Wang, Y. Sun, B. Xue, and M. Zhang, “Evolving deep convolutional [33] P. Bellec, C. Chu, F. Chouinard-Decorte, Y. Benhajali, D. S. Margulies,
neural networks by variable-length particle swarm optimization for im- and R. C. Craddock, “The neuro bureau adhd-200 preprocessed
age classification,” in 2018 IEEE Congress on Evolutionary Computation repository,” NeuroImage, vol. 144, pp. 275 – 286, 2017, data
(CEC). IEEE, 2018, pp. 1–8. Sharing Part II. [Online]. Available: http://www.sciencedirect.com/
[10] J. Liang, E. Meyerson, B. Hodjat, D. Fink, K. Mutch, and R. Miikku- science/article/pii/S105381191630283X
lainen, “Evolutionary neural automl for deep learning,” in Proceedings [34] F. E. Fernandes and G. G. Yen, “Automatic searching and pruning of
of the Genetic and Evolutionary Computation Conference, 2019, pp. deep neural networks for medical imaging diagnostic,” IEEE Transac-
401–409. tions on Neural Networks and Learning Systems, pp. 1–11, 2020.
[11] Y. Sun, B. Xue, M. Zhang, and G. G. Yen, “A particle swarm [35] G. Qian and L. Zhang, “A simple feedforward convolutional conceptor
optimization-based flexible convolutional autoencoder for image classi- neural network for classification,” Applied Soft Computing, vol. 70, pp.
fication,” IEEE Transactions on Neural Networks and Learning Systems, 1034–1041, 2018.
vol. 30, no. 8, pp. 2295–2309, 2019. [36] F. Assunçao, N. Lourenço, P. Machado, and B. Ribeiro, “Denser: deep
[12] B. Zoph and Q. V. Le, “Neural architecture search with reinforcement evolutionary network structured representation,” Genetic Programming
learning,” arXiv preprint arXiv:1611.01578, 2016. and Evolvable Machines, vol. 20, no. 1, pp. 5–35, 2019.
[13] H. Pham, M. Y. Guan, B. Zoph, Q. V. Le, and J. Dean, “Effi- [37] B. Wang, Y. Sun, B. Xue, and M. Zhang, “Evolving deep convolutional
cient neural architecture search via parameter sharing,” arXiv preprint neural networks by variable-length particle swarm optimization for
arXiv:1802.03268, 2018. image classification,” CoRR, vol. abs/1803.06492, 2018. [Online].
[14] C. Liu, B. Zoph, M. Neumann, J. Shlens, W. Hua, L.-J. Li, L. Fei-Fei, Available: http://arxiv.org/abs/1803.06492
A. Yuille, J. Huang, and K. Murphy, “Progressive neural architecture [38] Y. Sun, B. Xue, and M. Zhang, “Evolving deep convolutional neural
search,” in Proceedings of the European conference on computer vision networks for image classification,” CoRR, vol. abs/1710.10741, 2017.
(ECCV), 2018, pp. 19–34. [Online]. Available: http://arxiv.org/abs/1710.10741
[15] H. Cai, J. Yang, W. Zhang, S. Han, and Y. Yu, “Path-level net- [39] L. Zou, J. Zheng, C. Miao, M. J. Mckeown, and Z. J. Wang, “3d
work transformation for efficient architecture search,” arXiv preprint cnn based automatic diagnosis of attention deficit hyperactivity disorder
arXiv:1806.02639, 2018. using functional and structural mri,” IEEE Access, vol. 5, pp. 23 626–
[16] H. Liu, K. Simonyan, and Y. Yang, “Darts: Differentiable architecture 23 636, 2017.
search,” arXiv preprint arXiv:1806.09055, 2018. [40] V. Vapnik, The nature of statistical learning theory. Springer science
& business media, 2013.
[17] M. Pratama, C. Za’in, A. Ashfahani, Y. S. Ong, and W. Ding, “Au-
[41] R. F. de Mello, “On the shattering coefficient of supervised learning
tomatic construction of multi-layer perceptron network from streaming
algorithms,” arXiv preprint arXiv:1911.05461, 2019.
examples,” in Proceedings of the 28th ACM International CIKM, 2019.
[18] G. bin Huang, P. Saratchandran, and N. Sundararajan, “A generalized
growing and pruning rbf (ggap-rbf) neural network for function approx-
imation,” IEEE TRANSACTIONS ON NEURAL NETWORKS, vol. 16,
pp. 57–67, 2005.
[19] M. Das, M. Pratama, S. Savitri, and Z. Jie, “Muse-rnn: A multilayer
self-evolving recurrent neural network for data stream classification,” in
19th IEEE International Conference on Data Mining, 08 2019.
[20] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification
with deep convolutional neural networks,” Communications of the ACM,
vol. 60, no. 6, pp. 84–90, 2017.
[21] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image
recognition,” in 2016 IEEE Conference on Computer Vision and Pattern
Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, 2016,
pp. 770–778. [Online]. Available: https://doi.org/10.1109/CVPR.2016.90
[22] Y. LeCun et al., “Lenet-5, convolutional neural networks,” URL:
http://yann. lecun. com/exdb/lenet, vol. 20, no. 5, p. 14, 2015.
[23] K. Simonyan and A. Zisserman, “Very deep convolutional networks for
large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
[24] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan,
V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,”
in Proceedings of the IEEE conference on computer vision and pattern
recognition, 2015, pp. 1–9.
[25] Y. LeCun, K. Kavukcuoglu, and C. Farabet, “Convolutional networks
and applications in vision,” in Proceedings of 2010 IEEE International
Symposium on Circuits and Systems. IEEE, 2010, pp. 253–256.
[26] G. Glendrange and S. Tveiten, “Testing the performance of simple
moving average with the extension of short selling,” Master’s thesis,
Universitetet i Agder; University of Agder, 2016.
[27] H. Li, A. Kadav, I. Durdanovic, H. Samet, and H. P. Graf, “Pruning
filters for efficient convnets,” arXiv preprint arXiv:1608.08710, 2016.
[28] J. Zhang, T. Liu, and D. Tao, “An information-theoretic view for deep
learning,” arXiv preprint arXiv:1804.09060, 2018.
[29] Y. LeCun and C. Cortes, “MNIST handwritten digit database,” 2010.
[Online]. Available: http://yann.lecun.com/exdb/mnist/
[30] H. Xiao, K. Rasul, and R. Vollgraf, “Fashion-mnist: a novel image
dataset for benchmarking machine learning algorithms,” arXiv preprint
arXiv:1708.07747, 2017.

Auto CNN

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Auto CNN

Uploaded by

Copyright:

Available Formats

1

Autonomous CNN (AutoCNN):

I N recent years, significant improvement has been achieved

Algorithm 1 AutoCNN algorithm

in statistics, where k = 4 governs the confidence of the A. Datasets

architecture. The publicly available MNIST, MNIST-rot-back- public static twiki/variations-on-the-mnist-digits

B. Experimental Setup which establishes that F SS is an imperative measure of

no of Filter in Each Layer

(a) Performance of AutoCNN on MNIST dataset

no of Filter in Each Layer

(b) Performance of AutoCNN on MNIST-rot-back-image dataset

no of Filter in Each Layer

60 0.6 Conv layer 7

(c) Performance of AutoCNN on Fashion-MNIST dataset

0.65 4 15 Conv layer 1

(d) Performance of AutoCNN on ADHD200 dataset

racy of 99.59%, which is a 0.77% improvement over previous TABLE II

The CNN networks designed for MNIST classification when TABLE IV

TABLE III TABLE V

network architecture into a new environment after re-training Fold

were used to optimize the network for better classification

You might also like