Download as pdf or txt
Download as pdf or txt
You are on page 1of 75

Investigating aesthetic image classification

with convolutional neural networks


Bachelor thesis

David van der Linde


Matr. no. 797985
Berlin, April 9th, 2018

Beuth University of Applied Sciences


Media Informatics Online
1st examiner: Prof. Dr. Agathe Merceron
2nd examiner: Prof. Dr. Simone Strippgen
Abstract
With the growing number of photos taken and shared by many people, it becomes harder to
filter out good photos from among the masses of mediocre ones. Since machine learning and
convolutional neural networks (CNNs) have proven to be excellent at object and pattern
recognition in images, they may as well be useful for the classification of images by aesthetic
criteria. In this work, a CNN was trained on generated images as well as on real photos to
classify them by their composition and sharpness. The CNN was successful at learning and
classifying images by both criteria, reaching good to high scores in the evaluation. When
approaching these two criteria as individual subproblems, the classification of composition
turned out to be easier than the classification of sharpness. When the two subproblems were
combined into a single dataset of images, the CNN was better at distinguishing the images by
their sharpness than by their composition. In general, it became clear that two of the most
important factors for a good performance are the size and quality of the dataset, so that the
CNN is able to recognize and learn the relevant features. Once a sensible “basic” network
configuration is established, the further fine-tuning of the network configuration has less
influence on the network’s performance than the size and quality of the dataset.

I
Table of contents
Abstract ..................................................................................................................................... I
Table of contents ..................................................................................................................... II
1 Introduction ....................................................................................................................... 1
1.1 Motivation .................................................................................................................. 1
1.2 The aim of this work................................................................................................... 2
1.3 Disambiguation .......................................................................................................... 3
2 Machine learning fundamentals......................................................................................... 5
2.1 History of artificial intelligence .................................................................................. 5
2.2 Machine learning ........................................................................................................ 6
2.3 Artificial neural networks ........................................................................................... 8
2.4 How artificial neural networks learn ........................................................................ 10
2.5 Hyperparameters ...................................................................................................... 14
2.6 Convolutional neural networks................................................................................. 17
2.7 Popular CNN architectures ....................................................................................... 20
3 Applying machine learning to image aesthetics ............................................................... 23
3.1 Determining the subproblems................................................................................... 23
3.2 Related work ............................................................................................................ 23
4 Implementation ............................................................................................................... 25
4.1 Collection or generation of datasets ......................................................................... 25
4.2 Implementation of the CNN training and classification application .......................... 27
4.3 The configuration of the initial CNN ........................................................................ 34
4.4 Subproblem 1: composition ...................................................................................... 36
4.5 Subproblem 2: sharpness.......................................................................................... 47
4.6 Combining composition and sharpness..................................................................... 56
5 Summary and conclusions ............................................................................................... 59
Literature ................................................................................................................................ IV
Table of figures ....................................................................................................................... VI
Appendices............................................................................................................................ VIII

II
1 Introduction

1.1 Motivation
With the ever-increasing presence of digital cameras, more and more photos are taken, shared
and stored. While the number of photos taken is continuously growing, the quality of the
average photo may not, just because it is very easy to “snap away” without thinking or caring
about the final result. In many cases, these quick snapshots do not need to be of high quality or
aesthetic value, because their only purpose is to quickly share something with someone else. In
other cases, however, the purpose is to keep a memory or indeed take a nice photo. Since most
of these photos end up in the same location, stored in a folder or an album on a phone or a
computer, the result is often a mix of many mediocre photos which could be discarded and only
a few high-value photos which are worth keeping, sharing and showing, or even printing.

From my own experience and from people in my surroundings I know that it is often considered
a difficult task and a time-consuming burden to clean up the accumulated masses of photos and
separate the good photos from the bad ones. This often results in resignation, leaving the masses
of photos untouched and therefore not enjoying the high-value photos that are existing,
somewhere.

Even though many people do not consider composition, lighting, color or sharpness of a scene
while taking a photo, these characteristics influence the verdict of a good over a bad photo
when trying to make a selection later on. Experienced photographers may consciously judge a
photo based on these criteria, but even a layman’s judgment will be influenced by them
unconsciously.

There are at least two ways in which computers could be of help in getting the best out of our
photos:

1. A digital camera could provide live feedback for the user while framing a shot. It could,
for example, inform the user that the current composition is not optimal. The user could
then move the camera or zoom in or out to improve the composition. The camera’s
ability to auto-correct color, sharpness, and contrast is continuously improving but still
limited. Therefore, the camera’s software could advise a user to consider adding color
to the frame if the resulting photo would otherwise become dull, or turn to a different
viewing angle for better lighting or composition.
2. The photo management software on a user’s smartphone or computer could recommend
and highlight photos that are likely good, or it could save much work in filtering out
obviously bad photos. Also, it could give a user feedback as to which specific features
(composition, sharpness, contrast) of a photo are likely to be good or bad.

While some of a photo’s characteristics such as color saturation and contrast can be calculated
and visualized through histograms, that information may still not be sufficient to evaluate the
aesthetic value of a photo. Also, a user may not know how to read and understand the
information available in a histogram. Lastly, there is more to aesthetics than only numbers, for
example when trying to evaluate the composition. So, the question to be answered in this work

1
is: how can a computer help in evaluating photos and pointing out the qualities and weaknesses
of individual photos?

1.2 The aim of this work


The objective of this investigation is to get an understanding of the possibilities of using machine
learning, and more specifically convolutional neural networks (CNNs), in the evaluation of
image aesthetics. Convolutional neural networks have proven to be excellent at learning and
recognizing patterns and objects in images. This technology is already used for many different
tasks in a wide range of areas [1]. Since convolutional neural networks learn “by themselves”
which features are relevant to pattern and object recognition, they might also be successful at
recognizing and learning the aesthetic-related features of photos.

While convolutional neural networks are usually


trained to recognize and classify specific patterns
or objects such as birds, faces or written
characters, the goal of this investigation is to train
a network to classify images by “how” or “where”
the subject is, not “what” the subject is. As
illustrated by the example in figure 1, the goal is
not to classify these images by their subjects
“bird” and “car”, but by their aesthetic features
“sharp” and “blurred”.

Figure 1. Example of desired classification.

The aesthetic value of a photo depends on many different characteristics. Not all of these
characteristics are suitable for the classification by a neural network. Besides, the limited time
available for the investigation makes it necessary to focus on only a few selected characteristics,
which will be image composition and sharpness, followed by a combination of these two.

Obviously, the aesthetic value of a photo also depends on the actual content and the way a
viewer relates to the subject of the photo. A photo of people may convey strong emotions, but
it may as well be dull and expressionless. A photo of a certain memorable event may be of high
value to one viewer while being meaningless to another. These topics are not part of the
investigation, as they require a much wider understanding of the relations between elements
within a photo, as well as emotion recognition or even contextual knowledge regarding the
photo and the viewer. The investigation will, therefore, be focused on the more “technical” and
objective aspects of photography.

2
1.3 Disambiguation
1.3.1 “Image” vs. “photo”
While the motivation behind this investigation comes from an interest in photography and the
final objective is focused on the classification of photos, in much of this work the broader term
“image” is used instead. This is because part of the investigation is done using training and test
data consisting of programmatically generated (drawn) images, not actual photos. Also, many
of the aspects of image aesthetics apply not only to photography but drawings and paintings as
well.

1.3.2 “Good” vs. “bad”


The terms “good” and “bad” are used in this work to simplify the description of characteristics
which are commonly desired or undesired in photography. For example, when discussing
composition, “good” would refer to a composition which is generally considered attractive or
interesting, whereas “bad” would mean the composition is unattractive. This does not, however,
mean that the photo as a whole is “ugly”, especially not to the individual viewer. As explained
before, the aesthetic value of a photo consists of many factors, as well as personal taste and
context.

3
4
2 Machine learning fundamentals

2.1 History of artificial intelligence


The basic idea of artificial intelligence (AI) has been around for centuries. “Mechanical men”
and humanoid automatons were built by many civilizations throughout history. People in
ancient Greece and Egypt believed sacred statues had real minds and were capable of wisdom
and emotion [2].

Modern AI has its beginnings in the studies of “formal reasoning” by Chinese, Indian and Greek
philosophers such as Aristotle and al-Khwārizmī, which developed methods and models for
mechanizing human thought and reasoning [3].

In the 20th century, continued progress in the field of mathematical logic led to the Church-
Turing thesis, which states that any effective procedure can be executed by a (generalized)
computer. This theory was supported by the Turing Machine, a model of an abstract machine
which is capable of executing any algorithm, no matter how complex it is. This invention ignited
the discussion about the possibility of thinking machines [2], [3].

So, while the term “machine learning” greatly gained popularity in recent years, the roots of
nowadays’ machine learning date back over 70 years. At a lecture to the London Mathematical
Society on 20 February 1947, Alan M. Turing, the computer science pioneer who played a vital
role in cracking the encryption of the Enigma machine used by the German army in World War
II, stated: “What we want is a machine that can learn from experience.” [4]. This is exactly what
today’s machine learning is all about. Turing’s work on the concepts of AI has been highly
influential until today’s time, an example being the Turing Test, a test by which a machine’s
ability to behave intelligently can be evaluated. In his article “Computing Machinery and
Intelligence” he poses the question: “Can machines think?”, which he then discusses using “The
Imitation Game”, a theoretical test by which an observer has the task of determining which of
two participants is a machine, and which is a human. The observer can only base his judgment
on the participants’ written answers to his written questions. In this test, the machine’s goal is
to “trick” the observer into believing the machine is a human. Turing predicts that machines’
capabilities will improve sufficiently to finally achieve this goal: “I believe that at the end of the
century the use of words and general educated opinion will have altered so much that one will be
able to speak of machines thinking without expecting to be contradicted.” [5]. Even though his
prediction may have been a little too optimistic, today’s AI indeed matches or even exceeds
human performance in many specific tasks.

In the same period research in neurology showed that the brain consisted of a network of
neurons that fire impulses. Due to the similarity to electrical networks, it was suggested that it
might be possible to create an “electronic brain” [2].

At a workshop at Dartmouth College in 1956, Marvin Minsky, John McCarthy, Arthur Samuel
and other scientists laid the foundations of AI research. McCarthy first introduced the term
“artificial intelligence” on this occasion. During and following the workshop the scientists and
their students created programs that solved various logic problems and others that played
checkers and chess, soon to play better than the average human [2], [6].

5
In the 1960’s AI research was booming, programs were able to handle increasingly complex
problems, and scientists were optimistic about creating a “fully intelligent machine” in less than
20 years. Much of the research in this period was financed by government agencies such as
DARPA [7].

In the early 1970’s scientists came to a point where the difficulty to solve some vital problems
slowed down the development, one of these problems being the lack of computer power. In
1973, a report by James Lighthill, a British applied mathematician, wrote a critical review on
the progress in the field of AI. This review was a trigger for governments to cut funding, leading
to the “AI winter” that lasted for almost a decade [7].

In the early 1980’s AI research regained momentum because of expert systems, systems that
were programmed to solve particular tasks. These systems were relatively easy to build and
useful, creating a market of several billion dollars. However, the success of the AI development
caused a bubble of exaggerated expectation, which burst by 1987, partly because by this time
“simple” desktop computers became more powerful than some of the AI systems. This was when
the second “AI winter” started [2], [7].

By the mid-1990’s, while AI still didn’t have a good reputation, it became more successful than
it had ever been. Due to the skepticism that still existed, enthusiasm and expectations were at
a reasonable level. At the same time, improved computing power and focusing on specific
problems brought clear results [2].

Successful solutions such as data mining, speech recognition, and Google’s search engine were
products of AI research and development but weren’t often labeled as such, because of the bad
reputation the term “artificial intelligence” still had.

Since the beginning of the 21st century, AI technology continued to improve and impress due
to the availability of large amounts of data, advanced machine learning techniques, and the
ever-increasing computing power. AI is now used in many fields like economy, ecology,
medicine, image-, video- and text processing. Various machine learning libraries make it
possible for the average programmer or statistician to apply machine learning to their problems
and explore the promising field of artificial intelligence.

2.2 Machine learning


As mentioned in the previous chapter, artificial intelligence is a broad field, from the ancient
sacred statues with mythical “minds” until the most advanced algorithms used in modern
computer applications. Machine learning is part of the latter and started to take shape around
the 1950’s. According to a popular paraphrase of a journal article from 1959 by Arthur Samuel,
machine learning is “a field of computer science that gives computers the ability to learn without
being explicitly programmed” [6]. Machine learning algorithms are used to create models from
existing data, to make predictions or decisions based on these models.

Machine learning can be divided into two categories: supervised learning and unsupervised
learning.

6
2.2.1 Supervised learning
During supervised learning, there is a training phase in which the learning algorithm is
presented with labeled example data. The algorithm’s goal is to create a generalized formula or
model that best fits the data. After training on a preferably large number of examples, the model
is evaluated by feeding it new, unseen examples. The label predicted by the model is then
compared to the actual label of the example. In the ideal case, the model’s prediction matches
the actual label or “ground truth” [8].

Examples of supervised learning:

• Email spam filters. An email spam filter is trained by feeding it email messages labeled
as “spam” or “not spam”. The filter (model) generates rules that best describe the
characteristics of these two groups of messages. The filter is then used to label new
incoming email messages. Whenever a message is incorrectly labeled, and the user
manually marks the email as being “spam” or “not spam”, the filter (the model) is
improved.
• Optical character recognition (OCR). Optical character recognition is widely used by
banks, mail delivery services and other companies that have to process large amounts
of written text, either hand-written or printed. The machine learning model is trained
by feeding it labeled examples of written characters, such as images of a “1” labeled as
“1”, images of a “2” labeled as “2”, et cetera. The model learns the specific features of
each of these characters to recognize future examples. OCR can be incredibly accurate
in recognizing text, which is, of course, of great importance due to the consequences
that false labeling could have. It is also used to convert printed documents into digital
archives and for automatic license plate recognition.

The AI effect
The email spam filter and OCR are just a few of the examples of technical solutions that fell
prey to the so-called “AI effect”. Pamela McCorduck describes this effect as follows:
“It's part of the history of the field of artificial intelligence that every time somebody figured out
how to make a computer do something—play good checkers, solve simple but relatively informal
problems—there was chorus of critics to say, 'that's not thinking' “ [2].
Alternatively, as Rodney Brooks, the director of MIT's Artificial Intelligence Laboratory said in
an interview for WIRED Magazine: “Every time we figure out a piece of it, it stops being magical;
we say, 'Oh, that's just a computation'” [9].

2.2.2 Unsupervised learning


In unsupervised learning, the machine learning algorithm receives unlabeled data. Here, the
algorithm’s task is to identify the differences between samples and highlight the key features by
grouping, or “clustering”, the examples [10]. This learning method is useful when masses of
seemingly random or chaotic data are available, and a researcher would have difficulty
recognizing patterns using other methods.

In the field of IT security, unsupervised learning can be used to detect “abnormal” user behavior
which may be related to attacks that use previously unknown methods [11]. Another example

7
of unsupervised learning is the analysis of students’ learning behavior in Massive Open Online
Courses (MOOCs) to optimize the structure and content of the course [12].

2.3 Artificial neural networks


Artificial neural networks (ANN), or simply neural networks (NN), are a machine learning
method that is primarily used for supervised learning tasks such as text and speech recognition,
medical analysis and diagnosis [13], object and pattern recognition. NN are inspired by the
biological brain.

2.3.1 The biological neural network


Even though ANNs gained enormous popularity
during recent years, the first description of how an
ANN could work was written by the neuroscientist
Warren S. McCulloch and logician Walter Pitts in
1943. During the 1930’s and 1940’s neurology
research had shown that the brain is a network of
neurons that process information by firing impulses,
shown by figure 2. McCulloch and Pitts realized that
an artificial electric network could work in a similar Figure 2. Biological neuron.
way [14].

Neurons in the brain pass on information in the form of electrochemical impulses. A neuron
receives incoming impulses from other neurons through its dendrites and sends out impulses
through the axon towards the synapses at the axon terminals. The synapse is a small gap
between one neuron’s axon terminal and the next neuron’s dendrite. Only when an impulse is
strong enough it will cause the release of chemicals (neurotransmitters) that transfer the
impulse through the synapse to the next neuron. Learning in the brain happens by changing
the effectiveness of the synapses, which affects the influence of one neuron on another [15]. Of
course, this description is only a very basic summary, and much about the functioning of the
brain is still unknown.

2.3.2 The artificial neural network


The ANN is built similarly. In figure 3, the dots represent neurons, and the arrows represent
the impulses passed on between neurons. The ANN consists of several layers.

The first layer is the input layer, which receives


the data to be learned. The number of neurons
in the input layer depends on the number of
input features. For example, in case of weather
data, the input could be temperatures in
degrees Celsius, precipitation in millimeters
and air pressure in millibars. These three Figure 3. Artificial neural network.
features would be the input for three neurons
in the input layer. For comparison, in the case of image classification, the number of neurons
in the input layer would be equal to the number of pixels multiplied by three, because each
pixel contains three numeric values for the red, green and blue color channels. This can result
in thousands of neurons only in the first layer.

8
An ANN contains one or more hidden layers, the number of neurons per hidden layer can vary.
Since the actual learning happens in the hidden layers, the number and size of hidden layers
are usually increased for more complex problems.

The last layer is the output layer, which provides the predictions or classification. The size of
the output layer depends on the type of output that is desired from the network. In case of
diagnosing diseases, there could be only two neurons in the output layer, giving a “positive” or
“negative” diagnosis. In the case of image classification, the number of neurons in the output
layer would be equal to the number of labels used in the data.

In most ANNs, the layers are fully connected, meaning each neuron in a layer has a connection
to each neuron in the adjacent layers [16].

2.3.3 Perceptron and multilayer perceptron


The perceptron is one of the first artificial neural networks and dates back to the 1950’s. Figure
4 [10] shows the “Mark I Perceptron”, which was a machine designed for image recognition for
use by the US Navy. It had 400 photocells connected to artificial neurons. The weights were
updated physically by electric motors. This machine was a binary classifier, meaning its output
was either 0 or 1, depending on the input.

Figure 4. Mark I Perceptron.

The perceptron, which was very limited in its capabilities due to its binary output, was a
precursor to the multilayer perceptron. The multilayer perceptron (MLP), often called the
“vanilla” neural network, consists of at least three layers: the input layer, one or more hidden
layers, and the output layer.

9
2.3.4 Deep learning
While machine learning is a subsection of AI,
deep learning is a specific class of machine
learning algorithms, as visualized in figure 5
[15]. Deep learning algorithms are ANNs that
contain multiple layers, where the output from
one layer is used as the input for the next layer
[1]. Each of the layers learns to recognize
features at a different level of abstraction. In
case of face recognition, the first layer may
recognize curves and edges; the second layer
uses these low-level features to recognize noses,
eyes, and mouths; the third layer uses these
higher-level features to recognize faces. Often,
the number of layers and levels of abstraction is
increased for solving more complex problems. Figure 5. Relationship between AI and deep learning.
There is no uniform number that defines what
is to be considered a “deep” network.

2.4 How artificial neural networks learn


A short but clear explanation of the learning process is provided on the website of OpenCV, a
computer vision library [17]: “Let’s say that the desired output of the network is y. The network
produces an output y'. The difference between the predicted output and the desired output (y - y) is
converted to a metric known as the loss function ( J ). The loss is high when the neural network
makes a lot of mistakes, and it is low when it makes fewer mistakes. The goal of the training process
is to find the weights and bias that minimise the loss function over the training set.” [sic]. The
terms weight, bias, and loss function will be explained shortly. When using an NN for supervised
learning, the network is trained using labeled training examples. The network’s output is then
evaluated by feeding it similar test examples. If the network learned well from the training data,
the network output for the (unknown) test examples should be accurate.

2.4.1 Forward propagation


Similar to the biological neural network, the neurons in an ANN pass on signals between
neurons. The transmission of signals through the network layers is called forward propagation.

Figure 6. Single neuron in an artificial neural network.

Figure 6 [17] shows a “close-up” of a single neuron. The neuron receives input signals x1, x2
…xn from the neurons in the previous layer. These are multiplied by weights w1, w2 …wn, which

10
represent the relevance of the individual signals. The neuron also has a bias parameter b. As
described in [15], “Biases are scalar values added to the input to ensure that at least a few nodes
per layer are activated regardless of signal strength. Biases allow learning to happen by giving the
network action in the event of low signal. They allow the network to try new interpretations or
behaviors. Biases are generally notated b, and, like weights, biases are modified throughout the
learning process.”. The bias is added to the weighted sum of input signals. An activation function
f is then applied to the sum of the weighted inputs and bias. The activation function “decides”
whether or not a signal is passed on to the neurons in the following layers. There are various
activation functions, some of which will be described here.

2.4.2 Activation functions


All figures in this chapter are taken from the book “Deep Learning - A Practitioner's Approach”
by J. Patterson and A. Gibson [15].

2.4.2.1 Linear
In a linear activation function (figure 7), the
output is proportional to the input. Linear
activation functions can be applied to simple
problems, but most problems that NNs have to
deal with are of a non-linear nature. This function
cannot sufficiently describe the optimal solution
to these complex problems.

Figure 7. Linear activation function.

2.4.2.2 Sigmoid
Sigmoid activation functions (figure 8) transform
(“compress”) signals from a vast range of negative
and positive infinity into a small, normalized
range of 0-1. This reduces extreme input signals,
but it has the drawback, that for most inputs, the
output signal becomes very close to zero or one,
leading to very small gradients close to these
boundaries, from which the network can hardly
learn. The sigmoid function can, however, be
useful in the output layer, where it provides Figure 8. Sigmoid activation function.
individual probabilities in a range of 0-1 for each
class.

11
2.4.2.3 Rectified linear
Rectified linear activation functions (figure 9) only provide an
output when the signal crosses a threshold. When the input signal
is zero, the output is zero. As the input signal crosses the threshold,
the output has a linear relationship to it. This function is written as:
f(x) = max(0, x)

Rectified linear activation functions are often used in convolutional


neural networks (CNNs), as will be described in the chapter about
Figure 9. Rectified linear
CNNs. This activation function is also called ReLU, short for Rectified
activation function.
Linear Unit.

According to A. Karpathy in the Stanford class “CS231n Convolutional Neural Networks for Visual
Recognition” [16], there is a risk of “dying ReLUs”: “For example, a large gradient flowing through
a ReLU neuron could cause the weights to update in such a way that the neuron will never activate
on any datapoint again. If this happens, then the gradient flowing through the unit will forever be
zero from that point on.”. This can either be avoided by choosing appropriate training
parameters or by using a leaky ReLU activation function instead. Instead of passing on zero if
the signal did not cross the threshold, a tiny signal such as 0.01 is passed on, preventing the
neuron from becoming irreversibly inactive.

2.4.2.4 Softmax
The softmax activation function is similar to the sigmoid activation function in that it creates
output values between 0 and 1. While the probabilities from the sigmoid activation function are
independent, the probabilities from the softmax activation all add up to 1. This makes more
sense in classification problems where the classes are mutually exclusive. For example, in an
image classification network, the softmax activation function in the output layer may give a
probability of 0.28 that the image shows a dog, while the probability for a bear is 0.72. This
makes sense because the animal cannot be a “probably a dog” and “probably a bear” at the same
time.

2.4.3 Loss functions


The loss function, also called cost function, is used to measure the error of the network output
compared to the desired output or ground truth. If a network is learning successfully, the loss
decreases. The choice of loss function depends on the type of problem and the output that is
desired from the NN. Without going into the mathematical details, according to [15], a few
commonly used loss functions for classification problems are:

• Hinge loss: “Hinge loss is the most commonly used loss function when the network must
be optimized for a hard classification. For example, 0 = no fraud and 1 = fraud, […]”.
In this example, the classes “fraud” and “no fraud” refer to a fraud detection algorithm
that analyzes user’s behavior in software or a network. The hinge loss function is also
called a 0-1 classifier or maximum margin classification. The output (loss) from this
function is 0 when the network’s prediction is accurate and 1 when the prediction is
inaccurate.
• Logistic loss: “Logistic loss functions are used when probabilities are of greater interest
than hard classifications.”. An example is the calculation of the probability of a visitor

12
on a website clicking on an ad, which can then be linked to the price charged for the
placement of that ad. The probability is a number between 0 and 1.
• Negative log likelihood: “For the sake of mathematical convenience, when dealing with
the product of probabilities, it is customary to convert them to the log of the probabilities;
hence, the product of the probabilities transforms to the sum of the log of the probabilities.
[…] We also negate the expression so that the equation now corresponds to a “loss”.”

As briefly mentioned at the beginning of chapter 2.4, the goal is to reduce the loss and to
increase the probability or likelihood of the classification.

2.4.4 Gradient descent and stochastic gradient descent


Gradient descent is a method used for finding the optimal weights and biases for a network.
Intuitive explanations often take the example of a mountain landscape. The goal is to descend
from the mountain and arrive at the lowest point in the valley in as few steps as possible. The
gradient descent algorithm calculates the gradient, or the “slope of the mountain” by taking the
derivative of the loss function. It measures the change in error as a result of a change in the
network parameters and it points in the direction in which the parameters have to be adjusted
to minimize the error, or “descend into the valley”.

In gradient descent, the gradient is calculated, and the parameters are updated after the overall
loss function for all of the training examples is calculated. Experience has shown that faster
results are achieved when the gradient is calculated after each training example, which is done
with stochastic gradient descent (SGD). Often a variation of SGD is used by calculating the loss
over a small number of examples called mini-batch. Using mini-batches has the advantage of
getting relatively quick results while getting smoother adjustments than when the parameters
are adjusted after each example. A single “extraordinary” training example could otherwise
cause “wild” changes in the parameters, leading the algorithm in the wrong direction. By using
mini-batches this effect is reduced [15].

2.4.5 Backpropagation
After a training example (or mini-batch) passed the network and the loss has been calculated,
the network parameters have to be adjusted to correct for the error, if the network output did
not match the desired output. This is the point at which the network learns from its mistakes.
The backward transmission of the error and updating the parameters is called backpropagation.
The goal is to correctly “divide the blame” over all of the parameters involved. As the network
learns which features are relevant for the task at hand, signals from relevant neurons are
strengthened and passed on by increasing their weight, while the less relevant signals are
weakened or even stopped by decreasing their weight.

13
2.5 Hyperparameters
Besides the network parameters w (weight) and b (bias), which are changed as a result of the
training process, there are several other parameters that are set before the training begins.
These parameters are called hyperparameters and they are used to tune and optimize the
training speed and learning performance. In general, there are no strict rules as to which
settings and values should be chosen. Instead, many of the hyperparameters are optimized
through trial-and-error.

2.5.1 Number of hidden layers and layer size


While a neural network has only one input and one output layer, the number of hidden layers
can be chosen freely. The number of layers and the layer size (the number of neurons within a
layer) define the number of network parameters (weights and biases). More parameters can
help to learn more complex problems, but too many parameters can cause overfitting.
Overfitting is what happens when a network learns the training data “too well”, it learns every
little detail of the training data, without learning to generalize and “understand” the relations
between the input features and the output. This can be recognized by achieving a high accuracy
on the training examples while achieving bad results with the evaluation examples.

According to [15], the number of hidden layers is related to the size of the dataset: “An example
of this would be that MNIST [a large dataset of images of hand-written digits] needs only around
three to four hidden layers (with accuracy decreasing beyond that depth), but Facebook’s DeepFace
uses nine hidden layers on what we can only guess to be a very large dataset.”.

The size of the input layer is equal to the number of features in the input data. The size of the
output layer is equal to the number of classes (labels) contained in the training data. Every
neuron in the output layer provides the probability for a single class.

2.5.2 Learning rate


In the book “Deep learning” by Goodfellow et al. [18], the relevance of the learning rate is
introduced as follows: “The learning rate is perhaps the most important hyperparameter. If you
have time to tune only one hyperparameter, tune the learning rate.”.

The learning rate defines how fast the network parameters are changed as a consequence of
the network’s error. If the learning rate is set high, large adjustments are made to the
parameters. Initially, this can lead to quick improvement of the results, but at the risk of
overshooting the minimum error. Returning to the analogy of the mountain landscape, a small
learning rate means that many small steps are taken to reach the bottom of the valley. A higher
learning means larger steps and a quicker descent. A learning rate which is set too high would
mean the lowest point is passed and the algorithm climbs up the mountain on the opposite side
of the valley. When the training progress is visualized as a graph, a too high learning rate can
be recognized by a zig-zagging line, bouncing back and forth around the minimum error.

The basic learning rate hyperparameter is applied equally to all network parameters (weights
and biases). The network learns more efficiently when an adaptive learning rate method is used.
These methods adaptively tune the learning rate per parameter and automatically find the
optimal learning rate. One of these methods is called AdaGrad, after how it adaptively changes
the learning rate based on the history of gradients. AdaGrad speeds up the training in the

14
beginning and slows it down as the training progresses and the changes to the network
parameters need to become smaller. A downside of using AdaGrad is that the learning rate is
often decreasing too aggressively and it stops learning too early.

A variation of AdaGrad is AdaDelta. AdaDelta bases the adaptation of the learning rate only on
the most recent history rather than the entire history of gradients like AdaGrad does. This makes
for a less aggressive adjustment to the learning rate. As a positive side effect, the network
becomes more efficient by reducing the volume of calculations for the updates while using the
most relevant (because recent) information.

2.5.3 Epochs
This defines the number of times the network “sees” the complete dataset. In one epoch, the
entire dataset is passed through the network once. Usually, many epochs must be completed
until achieving satisfying results.

2.5.4 Mini-batch size


The mini-batch size is the number of examples that are passed through the network before the
parameters are updated. A smaller mini-batch size is beneficial for memory use, and parameter
updates are done sooner, but as mentioned in the chapter about SGD, larger mini-batch sizes
lead to smoother results.

A remark regarding hardware performance from “Deep Learning - A Practitioner's


Approach” by J. Patterson and A. Gibson [15]:
“For performance (this is most important in the case of GPUs), we should use a multiple of 32 for
the batch size, or multiples of 16, 8, 4, or 2 if multiples of 32 can’t be used (or result in too large
a change given other requirements). In short, the reason for this is simple: memory access and
hardware design is better optimized for operating on arrays with dimensions that are powers of
two, compared to other sizes.
We should also consider powers of two when setting our layer sizes. For example, we should use a
layer size of 128 over size 125, or size 256 over size 250, and so on.”

2.5.5 Iterations
An iteration is one update of the network parameters (weights and biases). In one iteration the
number of examples defined as the mini-batch size are passed through the network. So, if the
dataset consists of 1.000 examples and the mini-batch size is set to 20, then it takes 50 iterations
to complete 1 epoch.

2.5.6 Choice of the activation function


Several types of activations functions were introduced in a previous chapter. Different activation
functions can be chosen for the individual layers. In most of the literature, “ReLU” or “leaky
ReLU” is recommended for the hidden layers. The activation function for the output layer
depends on the purpose and the desired output of the network. For classification problems, the
“softmax” activation function is usually most suitable.

15
2.5.7 Weight initialization
Weight initialization is used to assign useful starting values to the weights. It may seem logical
to initialize all weights with 0 because before the training it is unknown which signals are more
relevant than others. However, if all weights were set to 0, all parameters would be equally
responsible for the resulting loss after the examples are passed through the network. As a result
of this, all parameters would be updated with an equal amount, meaning they will be equally
responsible for the next loss as well. The network would not be able to learn this way. That is
why a weight initialization strategy is applied.

A common strategy is called Xavier initialization. Xavier initialization makes sure that the
weights are not too small or too large to be useful and they are randomized to avoid the
problems described above.

2.5.8 Regularization
Regularization can help prevent overfitting. There are several ways to apply regularization to
the network:

• Dropout: Dropout is used to mute parts of the input to a layer. This causes the network
to learn other portions of the input and generalize better, rather than to focus on certain
parts of the input. When dropout is applied, random neurons are “dropped”, so that
their signal is not propagated. This also has the positive side effect of speeding up the
training by reducing the number of computations per update.
• L1 and L2 penalty: L1 and L2 regularization are used to penalize too large weights,
which can otherwise lead to overfitting. Of these two regularization methods, L2 is the
most commonly used. In A. Karpathy’s Stanford class “CS231n Convolutional Neural
Networks for Visual Recognition” [16], the L2 penalty is described as follows: “The L2
regularization has the intuitive interpretation of heavily penalizing peaky weight vectors
and preferring diffuse weight vectors. […] this has the appealing property of encouraging
the network to use all of its inputs a little rather that some of its inputs a lot.”
In comparison, L1 regularization leads to using only a sparse subset of the most
important inputs (features) and “ignoring” the rest: “In practice, if you are not concerned
with explicit feature selection, L2 regularization can be expected to give superior
performance over L1.”
If the L1 or L2 penalties are set too high, they can over-penalize the network and prevent
it from learning.

2.5.9 Momentum
With gradient descent, the goal is to minimize
the loss to reach the global minimum.
However, many problems are not quite that
simple, and the “landscape” (the training
examples) contains local minima. Once a local
minimum is reached, the algorithm must pass Figure 10. Schematic example of global and local
a point at which the error increases slightly minima.
before it decreases further. This is visualized
by the simplified illustrations in figure 10 [19]. When a large learning rate is used, the algorithm
can overshoot a local minimum, but when a small learning rate is used the algorithm can get

16
stuck and stop learning. To prevent this, a momentum hyperparameter can be applied. The
momentum adds a fraction of the previous weight update to the current one. When the direction
of the previous and the current updates are the same, this will increase the size of the parameter
adjustment and speed up learning. If the direction of the updates is different, the momentum
smoothens the variation. A high learning rate should not be combined with a large momentum
because this would again lead to overshooting the global minimum.

2.6 Convolutional neural networks


2.6.1 Introduction to convolutional neural networks
Convolutional neural networks (CNNs) are one of the reasons why deep learning is as successful
and popular as it is. CNNs are excellent at image classification and object recognition, which is
used to identify faces, street signs, and license plates. While CNNs are already used for tasks
like the automated processing of written forms and paper mail for many years, more recently
they also gained importance due to the development of self-driving cars and autonomous
robots. In image classification benchmarks, the best CNNs achieve higher accuracy on
classifying images than humans [20].

Like neural networks in general, CNNs are inspired by the biological brain:
“The cells in the visual cortex are sensitive to small subregions of the input. We call this the visual
field (or receptive field). These smaller subregions are tiled together to cover the entire visual field.
The cells are well suited to exploit the strong spatially local correlation found in the types of images
our brains process, and act as local filters over the input space. There are two classes of cells in this
region of the brain. The simple cells activate when they detect edge-like patterns, and the more
complex cells activate when they have a larger receptive field and are invariant to the position of
the pattern.” [15].

“Regular” neural networks are not suitable for processing images due to the volume of
parameters that are involved. The image classification benchmark dataset CIFAR-10 [21]
consists of 60.000 images divided over 10 classes. The images have a size of only 32 x 32 pixels.
Even these tiny images cause a significant amount of input data: 32 pixels x 32 pixels x 3 (R-G-
B color channels) = 3.072 parameters for a single input in a single layer. Since the majority of
image classification tasks deal with higher resolution images, the number of parameters would
explode. This is where CNNs come in.

17
2.6.2 CNN architecture overview

Figure 11. CNN architecture.

The high-level architecture of CNNs is shown in figure 11 [15]. Besides the input layer and the
output or classification layer, there are multiple hidden layers. The hidden layers consist of
alternating convolution and pooling or subsampling layers. The activation function mostly used
on the convolution layers is the ReLU activation function. In some literature and machine
learning libraries, the ReLU activation function is visualized or implemented as a separate layer.
As explained previously, the total number of hidden layers can vary, depending on the
complexity of the task and the size of the dataset. The sequence of convolution, ReLU, and
pooling, however, is consistent in most CNNs.

2.6.3 Input layer


The input for a CNN is a three-dimensional array of numeric values
representing the pixel data. The height and width are defined by the size of
the image in pixels, the depth is defined by the color channels (usually three),
as visualized in figure 12 [15].

2.6.4 Convolution layer


Figure 12. CNN input.
The convolution layers are the essential part of the CNN. They derive their
name from the convolution operation, a rule for merging two sets of information. In the
convolution operation, a filter or kernel is slid over the input data to detect features. As such,
convolution layers are also called the “feature detectors” of a CNN. As described previously, the
image data is represented by a matrix of pixel values. The filter is a smaller matrix, usually 3x3
or 5x5 pixels in size, which also contains numeric values. The values in the filter are the weights
that are learned throughout the training process.

18
Figure 13. Convolution operation.

Figure 13 [15] shows a part of the convolution operation. The matrix on the left represents the
input data (pixels). As the filter is slid over the input data, at each step, also called stride, the
values of the input data and the values in the filter are multiplied (dot product). The individual
multiplications are shown on the right. These values are then accumulated into a single value
for the feature map or activation map. The number of filters and the number of activation maps
that are created as a product of these filters are defined by a hyperparameter. This number can
be chosen freely and, as with many other hyperparameters, there is no universal rule-of-thumb
as to what works best. The feature maps are stacked and form a new 3D-input for the next
layer.

The fact that the filters are slid across the entire input matrix makes the feature detection of
CNNs location invariant. Figure 14 [22] shows the visualization of the filters from three
convolutional layers in a CNN. The first filter detects low-level features such as edges and
curves. The next layer uses these features to recognize eyes, noses, and mouths; the third layer
recognizes entire faces.

Figure 14. Feature detection in convolutional layers of a CNN.

2.6.5 ReLU
In most CNNs, each convolution layer is followed by a ReLU activation function or layer. ReLU
replaces all negative values from the feature map by zero. ReLU increases nonlinearity into the
network, which makes it better at learning the nonlinear problems it has to solve. Until the
ReLU activation function is applied, the CNN is a linear system, because only dot products are
created. By removing the negative values, the ReLU activation function also makes the network
more efficient.

19
While the size of the input layer is directly connected to the size of the image, the size of the
hidden layers depends on the size of the filters and the stride hyperparameter, which defines
the size of the steps by which the filter slides over the input matrix. Smaller filters and strides
lead to more convolution operations and larger feature maps, whose size is the input size for
the next layer.

2.6.6 Pooling or subsampling layer


Pooling or subsampling layers are usually placed
between the convolution layers. The pooling layer
decreases the size of the feature maps but keeps
the essential information.

The pooling layer applies a filter with a max()


function to small blocks from the input matrix. If
the filter size is set to 2x2, it will take the largest
of the four values in that area. This keeps the
largest value, while reducing the volume by 75%,
as illustrated in figure 15 [23]. Instead of max Figure 15. Max pooling operation.
pooling also average pooling is sometimes used.

The use of pooling layers also decreases the risk of overfitting by removing too many irrelevant
details and keeping the model simple.

2.6.7 Output layer


The output layer of a CNN is a fully connected layer, meaning each of its neurons is connected
to each neuron of the previous layer. The number of neurons in the output layer is equal to the
number of classes in the dataset. The output layer uses a softmax activation function. As
described in a previous chapter, this activation function gives the probability for each of the
classes, while the sum of these probabilities adds up to 1.

2.7 Popular CNN architectures


Over the years, several CNN architectures for image classification have been developed that
became well-known because of their excellent performance. These CNNs are often used as a
starting point or inspiration for the development of other CNNs. A few of these architectures
are mentioned in the course notes of Andrej Karpathy’s Stanford Computer Vision class [24]:

• LeNet: Yann Lecun developed one of the first successful CNNs in the 1990’s. This CNN
was designed to recognize hand-written digits.
• AlexNet: the AlexNet was developed by Alex Krizhevsky, Ilya Sutskever, and Geoff
Hinton. It participated and won in the 2012 ImageNet ILSVRC (Large Scale Visual
Recognition Challenge). The network architecture is similar to the LeNet, but it has
more and larger layers, and it contains several stacked convolution layers before a
pooling layer, which was uncommon until then. The AlexNet architecture contains
around 60 million parameters.
• GoogLeNet: the GoogLeNet was developed by Szegedy et al. from Google and won the
2014 ImageNet ILSVRC challenge. It contains “only” 4 million parameters.

20
• VGGNet: The runner-up in the 2014 ImageNet ILSVRC challenge was the VGGNet by
Karen Simonyan and Andrew Zisserman. With this network, they showed that the depth
of the network played a significant role in its performance. The network contains 16
pairs of convolution/pooling layers. It has a rather simple design, featuring only 3x3
sized convolution filters and 2x2 sized pooling filters. For comparison, some of the filters
in the AlexNet are much larger at 11x11. A downside of the VGGNet is its size. It has
around 140 million parameters, which makes it very computation-heavy and require
much memory to run.
• ResNet: The ResNet, developed by Kaiming He et al., won the 2015 ImageNet ILSVRC
challenge. It makes heavy use of batch normalization. This means that the network
learns from many examples at once (using large mini-batches), before updating the
network parameters. This evens out the extremes in the training data, thereby making
the network generalize better.

21
22
3 Applying machine learning to image aesthetics
Now that the neural network fundamentals are explained, the next question is how the goal of
this work, the classification of image aesthetics, can be achieved using the CNN as a tool. To
answer this question, it must first be determined which specific problems the CNN may be able
to solve.

3.1 Determining the subproblems


As discussed in the introduction, image aesthetics are a complex field. Personal taste and
context play an important role in what is perceived as a good image by individual people.
However, some general rules can be applied to the majority of images, be it photos, drawings
or paintings. For this work, two aesthetic criteria will be taken as the subproblems to be solved
by the CNN: composition and sharpness. These criteria can be evaluated based on the image
data (pixels). Other criteria such as storytelling, creativity or emotional impact require much
more knowledge about the content and context of the image. Perhaps advanced neural
networks may, at some point, even be able to analyze some of these aspects, but that is beyond
the scope of this work.

3.1.1 Composition
One of the most commonly used rules for composition is the rule-of-thirds. In photography
books, websites and blogs it is often listed as one of the first recommendations to improve one’s
photography. The rule-of-thirds involves drawing an imaginary grid of two vertical and two
horizontal lines in the frame, dividing the frame into 9 equal-sized blocks. When the main
subject is placed roughly on one of the intersections of the grid lines, it generally leads to a
pleasant and interesting composition. On the opposite, a composition is less attractive when the
subject is placed right in the center of the image or when it touches the border of the image.

3.1.2 Sharpness
In most cases images are meant to be sharp, the subject needs to be in focus. Exceptions are
deliberate blur, either for artistic purposes or to emphasize the speed or movement of the
subject (motion blur). Different rules also apply to different types of images. Portraits usually
benefit from a slightly blurred background, which isolates the subject and captures the viewer’s
attention. A cluttered background with many details could distract the viewer. In landscape
photography, however, a wide depth of field is usually chosen to have both the foreground and
background in focus. This means that regarding sharpness, the classification is not so much
about good vs. bad, but more about distinguishing the various types of images.

3.2 Related work


Using machine learning to classify image aesthetics is a difficult, but therefore exciting topic,
which has been covered by some previous works [25]–[27]. In these works, the researchers
trained the models with real photos. These photos are part of the much used ImageNet dataset,
initially created by researchers from the Computer Science department at Princeton University
[28] and the AVA dataset by Murrey et al. [29]. The AVA dataset contains ratings for each
image. These ratings are given by the users of the DPchallenge.com photo sharing website.
While this data is interesting and useful for many investigations, it is unknown why a photo

23
received a high or low rating. The approach in this work differs in that selected features
(composition, sharpness) are examined individually. This may provide an understanding of
which features are most suitable to be learned and classified by an CNN. Also, having models
which evaluate photos based on specific, individual features can be useful in several use cases
such as photo-editing software or in-camera “improvement suggestions”. Instead of having a
single model classifying a photo as being good or bad as a whole, individual models could each
provide independent feedback regarding a specific feature.

Other works that focus on individual features used carefully hand-crafted features based on a
calculation of the image’s pixel data [30]–[33]. The approach in the current work is that the
features are not calculated using complicated formulas, but that they will be learned by the
CNN from images which are classified by these specific features.

24
4 Implementation

4.1 Collection or generation of datasets


To train the CNN for the individual subproblems, suitable datasets need to be collected or
generated. As mentioned in chapter 3.2, many related works make use of datasets created for
object recognition benchmarks and competitions. In most of these datasets, the images are
classified by their content (car, bird, …). The AVA dataset mentioned earlier contains over
250.000 image as well as ratings regarding their aesthetics. However, while these images may
be classified as good or bad based on these ratings, this still doesn’t provide information on
which characteristics of an image make it good or bad. There are two possible ways to gather
sufficiently large datasets suitable for this investigation.

4.1.1 Using existing datasets


The first possibility is to use images from existing datasets such as the ImageNet or AVA dataset.
A manual selection will be made to group the images by good or bad composition or sharpness.
The resulting “good” and “bad” classes then contain images of various subjects. This should
hopefully force the CNN to learn and classify the images by composition or sharpness rather
than by content. Another possibility is to batch-edit part of the images from the existing datasets
to imitate a certain bad characteristic such as blur. In any case, a substantial amount of work
will be involved in the collection and preparation of useful datasets for this investigation.

4.1.2 Generating simulated images


For the initial trials, which are mostly intended to get a feeling for the workings of CNNs, real
photos may be too complicated to get satisfying results. Also, creating a large enough dataset
of real photos may prove to be too time-consuming. Existing datasets like the ones mentioned
before were created by many people over an extended period. Therefore, instead of using real
photos it may be useful to use simulated data, at least for a part of the investigation. Using
Java’s Graphics2D class from the Java 2D API, it is relatively easy to draw a large number of
simple images with certain characteristics. These images will represent a “simplified world”
where, for example, instead of having many different and complex subjects like in photos, the
subjects are only rectangles, ellipses, and triangles in random sizes and colors. Half of these
images will be “good”, meaning the shapes will be positioned according to the rule-of-thirds
and/or they will be sharp. The other half of the images will be “bad”, characterized by
unattractive composition or blurred content. Once a CNN is successfully trained to recognize
these simulated features, it may be used or further trained to recognize similar features in
photos. The training on simulated data could be seen as pre-training, which is a common step
in the development and training of NNs. It prepares an NN to focus on the features relevant to
a specific task.

25
4.1.3 Implementation of the image generation application
Figure 16 shows the class diagram of the image generation application.

Figure 16. Class diagram of the image generation application.

In the main() method of the Main class, various settings are defined for the images to be drawn,
such as:

• the number of images


• the image size in pixels
• the background color or image
• the image content (shapes, …)
• the composition quality (good/bad)
• the foreground and background blur/sharpness

The ImageGenerator class contains the actual image drawing methods. The generateImages()
method iterates for the defined number of images. For each image, it creates a Java 2D
BufferedImage object and passed it to the drawBackground() and drawContent() methods.

In the drawBackground() method, a background color or background image is drawn in the


BufferedImage object, depending on the settings defined in the Main class. Optionally, the
background is then blurred using a filter from the external JH Labs’ image filter library [34].

In the drawContent() method, the image content is drawn over the image background.
Currently, the only “content” implemented in the application are random shapes (rectangle,
triangle and ellipse). These shapes represent the image’s subject in a simplified world.
Depending on the settings in the Main class, multiple shapes can be drawn. The position of each
shape depends on the compositionQuality setting. For good composition, the shape is placed
on the intersection of the rule-of-thirds grid lines. For bad composition, the shape is placed
closer to the border or in the center of the image. The shape’s color can be black, white, or

26
random. For random colors, a new Color object is created using three random integers in the
range of 0-255 for the R-G-B values. Optionally, the foreground is blurred before it is placed on
top of the background image.

After the background and foreground are drawn in the BufferedImage object, it is saved as an
image file using the saveImage() method.

4.2 Implementation of the CNN training and classification application


4.2.1 The DL4J library
Since neural networks require the implementation of many complex mathematical formulas,
manual implementation “from scratch” would not make sense for this work. There are many
popular neural network libraries available, many of which are written in the Python or C++
languages, such as Caffe by the Berkeley Vision and Learning Center (BVLC), TensorFlow by
Google’s Machine Intelligence research organization, Theano, and Torch. However, due to
being more experienced in Java, the Java-based DL4J (Deep Learning for Java) [35] library will
be used. DL4J is an open source deep learning library covering most important NN types and
functions. It is suitable for commercial use and backed by the company Skymind [36].

The book “Deep Learning - A Practitioner's Approach” by J. Patterson and A. Gibson [15], which
is used as a primary source for this work, is written by two of the engineers behind DL4J. The
book provides a theoretical background on deep learning and neural networks and accompanies
the library with many concrete implementation examples.

Part of the DL4J library is the “DL4J Model Zoo”. This is a repository of many frequently used
NN architectures such as LeNet, AlexNet, and GoogLeNet, which were described previously.
Instead of configuring a custom NN and possibly reinventing the wheel, one of these popular
architectures can be easily loaded and used without the need for copy-pasting the network
configuration code. Some of the models in the DL4J Model Zoo are available as pre-trained
networks, which means they are already trained on datasets such as ImageNet. While it is, of
course, more educative to configure a custom NN from scratch, it is sometimes useful to
compare the performance of a custom network with that from modern NN architectures which
have proven to be successful at specific tasks.

4.2.2 The architecture of the training and classification application


The application used for the training and evaluation of the CNNs contains the following
functionality:

• an image pipeline which loads images from a directory and creates the training and
evaluation dataset
• configuration of the CNN with various layers and hyperparameters
• training of the model using the training dataset
• the evaluation of the trained model using the evaluation dataset
• saving of the evaluation results as well as the used CNN configuration and the trained
model. Also, for the sake of documentation and reproducibility, the dataset used for
each training is saved together with the evaluation results.

27
• visualization of the training progress through a UI server, which generates graphs to be
viewed in a web browser. These graphs are continuously updated during training.

Figure 17 shows the class diagram of the training and classification application.

Figure 17. Class diagram of the training and classification application.

Similar to the image generation application, several settings are defined in the main() method
of the Main class. The actual training and evaluation logic is implemented in the
ImageClassifier class.

4.2.3 Creation of training and test datasets


The images used for training and evaluation of the network are located in a local directory. This
directory contains subdirectories named after the individual classes or labels, for example,
“trainingdata/good composition” and “trainingdata/bad composition”. The Dataset class
contains a loadExamples() method. In this method, the images are taken from the
subdirectories and randomized before they are split into two groups forming the training and
the test dataset. The amount of training vs. test examples can be set through a variable
trainTestRatio, where a value of 0.8 means that 80% of all images will be used for training
and 20% for evaluation.

4.2.4 Configuration of the CNN


One of the essential parts of the application is the CustomNetwork class, where the CNN is
configured. First, the individual layers are defined and hyperparameters such as layer size,
activation functions, filter size (for the convolutional and pooling layers) are set. Then, a
MultiLayerConfiguration object is created in which the individual layers are combined into a

28
network configuration. Global parameters such as the network input size (pixels), the number
of iterations, the learning rate and regularization are set in the configuration as well. The
getNetwork() function returns a MultiLayerNetwork object created from the configuration.

Finding the optimal CNN configuration for a particular dataset is a process of trial and error in
which many different variations are possible. Therefore, the fine-tuning of the network
configuration will be done by adjusting the configuration directly in this class, instead of
controlling it through variables and methods, as these would have to be constantly changed to
follow the “evolution” of the configuration.

4.2.5 Training
Training is done in the ImageClassifier class, which contains most of the business logic of the
application. Two training methods can be used: training with optional image transformations
and training with “early stopping”.

4.2.5.1 Training with optional image transformations


In the trainWithTransforms() method, the network is trained for a fixed number of epochs,
irrespective of the learning progress. The training duration depends on the number examples,
the number of epochs and the number of iterations. Image transformations can be applied to
the input data to achieve a more robust model that generalizes better. DL4J offers several
“transforms” such as FlipImageTransform, which randomly mirrors the image on the x- or y-
axis, and WarpImageTransform, which randomly warps/skews the image by a small amount,
deforming the content slightly. For each transform that is applied, the defined number of epochs
is repeated, allowing the network to train with a variation of the images.

4.2.5.2 Training with early stopping


Finding the optimal number of epochs can be
difficult, as it is unknown how fast a network learns
from the training data. If the network is trained too
shortly, it may not learn as much from the training
data as it could. Using too many epochs, on the
contrary, may lead to overfitting. The network then
learns all the little details of the training data and
is not able to generalize well. This is illustrated in
figure 18 [37]. To prevent this, the Figure 18. Training with early stopping.
trainWithEarlyStopping()method can be used.
After training for one epoch, the model is evaluated using the test data. The current score and
a copy of the current model are temporarily saved to disk. After the next epoch, the model is
re-evaluated. If the new score is higher, the previously saved model is replaced by the current
model. If the new score is lower, the previous model is kept. How the score is calculated will be
explained shortly. Training is automatically stopped when one of the following termination
conditions is met:

• The score does not improve enough for too long: if the evaluation score is not improved
by a value x for a duration of y epochs, the training is stopped. This can occur in several
cases:
o Training is successful, a high score is reached, and it is not improved significantly
anymore.

29
o Training is successful, but the network now starts overfitting, causing the
evaluation score to decrease.
o Training is unsuccessful, the score is not improving or even decreasing.
• The maximum number of epochs is reached. This condition is similar to “regular”
training. It stops the training after a defined number of epochs. The difference, however,
is that the optimal model is saved instead of the last model. If the network was trained
for 20 epochs, but the optimal model was achieved in the 14th epoch, that model is
saved. In regular training, the final model is saved after 20 epochs, which does not
necessarily achieve the best performance.
• The maximum training time is reached. This time limit can be chosen freely. This
termination condition is not related to the evaluation score. It is more of a “fallback”
termination condition to prevent too long training, for example, if the evaluation score
improves only very slowly over a long time. It can also be used during test runs to see if
the network configuration is plausible and free or errors. A short training time of only a
few minutes is often enough to spot obvious problems.

4.2.6 Evaluation
At the end of the training, a final evaluation is done using the evaluate() method. The
evaluation dataset is passed through the trained model. The evaluation method prints a
summary of the evaluation as well as four scores to the console, using DL4J’s eval.stats()
method. This is useful for getting a quick impression if the training was successful:

Examples labeled as bad classified by model as bad: 86 times


Examples labeled as bad classified by model as good: 14 times
Examples labeled as good classified by model as bad: 5 times
Examples labeled as good classified by model as good: 95 times
==========================Scores========================================
# of classes: 2
Accuracy: 0.9050
Precision: 0.9083
Recall: 0.9050
F1 Score: 0.9091
========================================================================

In the test run for this example, the entire dataset contained 1.000 examples, of which 80%
was used for training and 20% (200 examples) for the evaluation. Of the 100 “good” and 100
“bad” examples in the evaluation dataset, the majority was correctly classified. Besides a
summary of the networks’ predictions also four scores are calculated: Accuracy, Precision, Recall
and F1. These scores are expressed as positive real numbers in the range of 0-1 and are
calculated based on the values from the confusion matrix. The confusion matrix is a tabular
overview of the actual and predicted classes.

30
For a binary classification problem, the confusion matrix contains four values:

• True Negative: the actual class is negative and the network predicted negative.
• True Positive: the actual class is positive and the network predicted positive.
• False Negative: the actual class is positive but the network predicted negative.
• False Positive: the actual class is negative but the network predicted positive.

Predicted class → Negative (“bad”) Positive (“good”)


↓ Actual class
Negative (“bad”) True Negative (TN) False Positive (FP)
Positive (“good”) False Negative (FN) True Positive (TP)

In the current classification problem, the class labels “good” and “bad” match the terms
“positive” and “negative”, but this is only by coincidence because DL4J indexes the labels by
alphabetical order, by which “bad” becomes class 0 and “good” becomes class 1. If the class
labels were “A” and “B”, these would also be named “negative” and “positive” or 0 and 1 in the
confusion matrix. The confusion matrix can be generated automatically using DL4J’s
eval.confusionToString() method, which results in the following output:

Predicted: 0 1
Actual:
0 bad | 86 14
1 good | 5 95

Using the numbers from the confusion matrix, the four scores for the evaluation are calculated
as follows:

• Accuracy: This simply shows the number of correct predictions, compared to the total
number of predictions.
(TP + TN) / (TP + FP + TN + FN)
(86 + 95) / (86 + 14 + 95 + 5) = 0.905
• Precision: Precision is the ratio of the true positives to the total number of positive
predictions. In the current image classification problem, this translates to “How many of
the images that the network classified as good are indeed good?” but also “How many of
the images that the network classified as bad are indeed bad?”. DL4J calculates the
precision for each of the classes and returns the average:
½ * (TP / (TP+FP) + TN / (TN+FN))
½ * (86 / (86+5) + 95 / (95+14)) ≈ 0.9083
• Recall: Recall, also called sensitivity, answers the questions: “Of all good images, how
many did the network classify as good?” and “Of all bad images, how many did the network
classify as bad?”. In other words, it answers how many examples of a class were correctly
identified or “recalled”. Again, DL4J calculates the recall for each of the classes and
returns the average:
½ * (TP / (TP+FN) + TN / (TN+FP))
½ * (86 / (86+14) + 95 / (95+5)) = 0.905

31
• F1 Score: The F1 score is the harmonic mean of precision and recall. It is commonly
considered the most representative for the network’s performance and is calculated as
follows:
2 * ((precision * recall) / (precision + recall))

Important: while DL4J returns an average for the precision and recall scores, it returns
only a single, non-averaged F1 score in case of binary (two-class) classifications
problems. It calculates the F1 score for the positive class, which is class 1 in the
confusion matrix, labeled as “good” in the example on the previous page. This means
that, for the F1 score in a binary classification problem, the negative class (0) is not
considered at all. This behavior is slightly inconsistent and confusing.

After reporting this to the DL4J development team by opening an issue on GitHub
(https://github.com/deeplearning4j/deeplearning4j/issues/4759) and discussing the
issue with one of the core developers, the implementation was improved to be more
consistent as well as more clearly documented. In the improved implementation, the
precision, recall and F1 functions all return a non-averaged value for the positive class
in binary classification problems. In non-binary classification problems, the functions
will continue to return averaged values as they did before. This is the same logic as
implemented by other machine learning libraries and is considered best-practice.

This inconsistency was not noticed until the end of the investigation of the first
subproblem (composition). This means that the performance of the CNN was judged by
an F1 score which was in fact only the F1 score for the positive class, even though the
score for each of the classes is equally relevant in this problem. Luckily, after reviewing
the evaluation results in retrospect, it turned out that in most cases, the F1 score for the
negative class was similar. Therefore, the average of the two F1 scores was not much
different than the previously used F1 score and no relevant misjudgment was caused by
the inconsistency. The scores mentioned throughout the chapter were manually
recalculated and are now the average F1 scores of both classes. The deviation was not
more than 0.05 in most cases.

32
4.2.7 Visualization of the learning progress in the UI
DL4J offers a UI server that visualizes the training progress with several graphs that are
displayed in the browser and regularly updated during the training.

Figure 19. DL4J Training UI – Overview page.

The “Overview” page in figure 19 contains the following information:

• Top left: Score vs. Iteration graph. This shows the value of the loss function as a result
of the current mini-batch. If the network is learning successfully, this value should
decrease.
• Top right: Summary of network and training information.
• Bottom left: Ratio of parameter to update (logarithmic), displayed per layer. As the
network learns and the loss gets smaller, also the updates to the parameters get smaller.
• Bottom right: Standard deviation of activations, gradients, and updates in a tabbed
view.

33
The “Model” page in figure 20 shows a representation of the network on the left side. When a
layer is clicked, a summary of the layer’s configuration is shown on the right. At the right bottom
(partially outside of the screenshot) several more graphs are available, showing the activations
and parameter updates for the selected layer.

Figure 20. DL4J Training UI – Model page.

4.2.8 Save results


If enabled, the saveResults() method creates a new directory and saves the trained model and
the network configuration in JSON format to file. Also, the directory containing the images is
zipped and saved to the newly created directory. These files are kept for documentation,
verification, and reproducibility of the training sessions. The trained model can be reused as
pre-trained model for further training. The JSON file with the network configuration can be
deserialized into a new, “untrained” MultiLayerNetwork object, for example for training on a
different dataset.

4.3 The configuration of the initial CNN


The ideal CNN configuration depends on the problem that it has to solve. It will, therefore
evolve over time. The starting point will be a simple CNN with settings and hyperparameters
that are mentioned as best practice throughout the literature. The main guideline is the book
“Deep Learning - A Practitioner's Approach” by J. Patterson and A. Gibson [15], which
accompanies the DL4J network. Some of the values are initial guesses, based on examples and
tutorials found on the internet and the examples provided with the DL4J library. Following is a
short overview of some of the settings and hyperparameters. The entire initial configuration is
added as appendix I.

34
• Input layer
o Input size: 64 x 64 pixels, 3 color channels
• Convolution layer 1
o Output size: 20 (20 filters which result in 20 activation maps, as explained in
chapter 2)
o Filter size: 3x3
o Activation function: ReLU
o Weight initialization: RELU (as recommended by [15], page 252)
• Pooling layer 1
o Kernel size: 2x2
o Pooling type: max-pooling
• Convolution layer 2
o Identical to Convolution layer 1
• Pooling layer 2
o Identical to Pooling layer 1
• Output layer
o Activation function: Softmax
o Loss function: negative log likelihood
o Weight initialization: Xavier (as recommended by [15], page 252)
o Output size: 2 (binary classification “good” and “bad”)
• Global settings and hyperparameters
o Optimization algorithm: Stochastic Gradient Descent
o Momentum: 0.9
o L2 regularization: 0.005
o Iterations: 1
o Learning rate: 0.005

To get a feeling for the configuration of the CNN, some training trials were done. While the
findings from these trials are interesting and educative, they are not directly relevant to the aim
of this work. Therefore, a summary of the findings is added in appendix II.

35
4.4 Subproblem 1: composition
With the results from the training trials in mind, it is now time to tackle the first subproblem,
the classification of images by their composition. A summary of the “milestones” or most
interesting results is provided in the form of a table at the end of this chapter.

4.4.1 Learning to classify generated images


4.4.1.1 Black shapes
The first attempt at classifying images by their composition was made using generated images
containing random black shapes on a white background. In half of the images, the shape was
positioned according to the rule of thirds, in the other half the shape was positioned elsewhere,
as illustrated in figure 21. Since the images in the two classes contain the same type of
“subjects”, the idea was that the CNN should learn that the location of the subject is relevant,
not the shape of the subject itself. After the network achieved a high score on these images, the
difficulty was increased by using more complex images.

Figure 21. Black shapes - bad (left) vs. good (right) composition.

500 “good” and 500 “bad” examples were generated. In a first test run, the dataset contained
only 100 examples (80 training, 20 test). After training for 10 epochs, the network achieved an
F1 score (from here on simply called score) of 0.798, which was not a bad start. After more
than 80 training runs with various adjustments to the configuration, the score improved to
0.995. Some of the adjustments that had a positive effect during the initial trials also had a
positive effect here. However, none of the adjustments had as much effect on the score as simply
increasing the number of training examples. As long as the number of examples remained low,
the network kept overfitting, despite attempts to regularize it. Training with early stopping was
used to “catch” the optimal model, which sometimes occurred after 8 epochs, but sometimes
after more than 50 epochs.

4.4.1.2 Colored shapes


In the “next level” the images were a little more complex. The shapes, their size, and position
remained the same, but now the shapes had a random color, as illustrated in figure 22. The
intention was to find out if the network could keep its attention on the position of the subject,
not on the different features of the subject itself.

Figure 22. Colored shapes – bad (left) vs. good (right) composition.

Using the final network configuration from the previous training, a score of 0.990 was achieved,
which was almost identical to the score achieved on the black shapes. While the network

36
achieved its optimal score with the black shapes after 10 epochs, the best score on the colored
shapes was achieved after 15 epochs. Also, the graphs showing the learning progress looked
quite “rough”, as shown in figure 23. This matches the increased difficulty.

Over 30 adjustments were made to the network


configuration to make the learning smoother
and faster while maintaining the high score,
but without success. The adjustments consisted
of adding/removing layers, adding/removing
neurons per layer, changing regularization,
changing the learning rate and changing the
batch size. Most of the changes were made Figure 23. Training progress on colored shapes.
somewhat “random”, trying out various smaller
and larger values.

4.4.1.3 Colored shapes on a colored background


Until now, the images contained only a single shape while the rest of the image was white. This
means that most of the input was constant between the examples. To eliminate this fact, a
random background color was added to the images, so that the entire input varied from
example to example, as illustrated by figure 24.

Figure 24. Colored shape and background - bad (left) vs. good (right) composition.

Starting off with the best network configuration from the previous “level”, a score of 0.905 was
achieved after 21 epochs using 1.000 examples. Increasing the learning rate from 0.01 to 0.02
increased the score to 0.955. While the increased learning rate made the learning faster initially,
it took 53 epochs to achieve the highest score. Presumably, the lower learning rate in the
previous configuration let the algorithm get stuck in a local minimum. The higher learning rate
made it pass the local minimum and continue learning. Of course, also other explanations are
possible.

By doubling the number of examples to 2.000 (1.600 training, 400 test) in combination with
several adjustments to the configuration, the score was improved to 0.988. For comparison,
lowering the number of examples to 500 while using that same network configuration led to a
score of 0.798. This is a clear indication that the size of the training dataset is crucial for the
networks’ performance.

While looking at the graphs from the last training, it is noticeable how the mean magnitude of
the bias parameters grew while the weight parameters stayed relatively constant (figure 19, left
graph). It turned out that the “l2” regularization variable, which was set in the network
configuration, only regularized the weight parameters. A separate variable called “l2Bias” had
to be set to regularize the bias parameters as well.

37
Since the weight regularization was set at 0.005, the same value was tried for the bias. The
effect was obvious. A network configuration that achieved a score of 0.761 after training on
1.000 examples achieved a score of 0.910 after adding the bias regularization, which was a
significant improvement. On the other hand, adding the bias regularization made the learning
slower, finding the optimal model after 28 epochs compared to 11 epochs previously, but one
could say that it was worth the wait.

Figure 25. Mean magnitude of parameters - without and with bias regularization.

Figure 25 shows how the bias (red) passed the 0.125 line at around 500 iterations (left graph).
After adding bias regularization, the bias stayed below that value and stabilized, even though
the training lasted much longer (right graph).

With this knowledge, retrying the configuration that previously led to a score of 0.998 after 85
epochs, now led to the same score, but already after 40 epochs, only because the bias
regularization was added.

4.4.1.4 Multiple colored shapes, varied positioning


Since the final goal is to classify photos by their composition, the simulated images had to
become more complex and closer to reality. The next step was therefore to add more shapes,
as well as adding some variation to the positioning of the shapes. Until now, the center of a
shape was either placed precisely on the intersection of the “third” lines (good composition) or
the intersection of “quarter” lines (bad composition). In the next set of images, each image
contained up to 5 shapes (a random number between 1 and 5). The positioning varied slightly,
moving the center of the shapes away from the “third” or “quarter” lines by a small (random)
amount. This made the difference between the good and bad examples smaller and therefore
harder to classify.

Figure 26. Multiple shapes - bad (left) vs. good (right) composition.

As visible in figure 26, the images were now more complex and the difference between good
and bad images less obvious to the viewer. Still, the CNN learned the difference quite well.
Using the same configuration and number of examples as in the previous run, it achieved a
score of 0.960. The number of examples in the dataset was slightly higher. Previously 2.000
examples were divided into 1.600 training and 400 test examples. Now the total number was
2.500 (2.000 training, 500 test).

38
The “score vs. iteration” graph showed that initially the network learned almost as fast as in
previous runs, but at some point, it had difficulty improving further. It may be that the layers
were no longer large enough to learn the increased variation in features. Increasing the layer
size by 50% per layer led to an improved score of 0.976.

Another possibility is that the difference between the “good” and “bad” classes was getting
smaller and the learning rate of 0.02 was too large, causing the algorithm to keep overshooting
the minimum. To examine this, the learning rate was lowered to 0.01, which led to a score of
0.922, which was no improvement. Also increasing the learning rate to 0.05 was not beneficial.
On the contrary, this adjustment caused the loss to explode and the network to fail entirely.
Increasing and decreasing L2 regularization has brought no improvements either.

4.4.1.5 Using well-known network architectures


As mentioned at the end of chapter 2, several popular CNN architectures are often used as a
starting point for the development of new CNNs. As a matter of comparison, a few training runs
were performed using models from the DL4J Model Zoo. These are DL4J’s implementation of
the CNN architectures such as LeNet or AlexNet. The configurations (source code) of these two
networks are added in appendix III and IV.

The LeNet model achieved a score of 0.934, which means that the custom configuration, which
is indeed similar to the LeNet configuration, was slightly better suited for this specific task.

The AlexNet network learned almost nothing while training for over 40 epochs. Since this
network is much larger than the custom networks used previously, training went much slower.
The score resulting from this training was precisely 0.

Until now, training was done using images with a size of 64 x 64 pixels. These images were
large enough for classification of the composition, as the images were still relatively simple and
did not contain small details which could be relevant to the classification.

The default input size of the networks from the Model Zoo is 224 x 224 pixels. Therefore it may
be possible that they are indeed optimized for images this size. To find that out, the training
with the LeNet model was repeated with the same number of images, but this time the images
were 224 x 224 pixels large. Due to the increased size of the images (and therefore network
parameters), the training went much slower. After running for 10 hours, only 5 epochs were
completed, and the early stopping mechanism stopped the training. At that point, the score was
0.764.

Since the training was interrupted after 5 epochs because the 10-hour time limit was reached,
it is unknown which score the model would have achieved if it had trained longer. To make at
least some kind of comparison between the LeNet model’s performance on large versus small
images, the training was then repeated on the 64 x 64 sized images, also for 5 epochs. This led
to a score of 0.853. So, at least during the first 5 epochs, the LeNet model learned more from
the smaller images than from larger ones.

Using smaller images has the benefit of speeding up the training process (there are fewer
calculations to perform), but on the other hand, important details may get lost due to the limited
number of pixels. Another run with the LeNet model was done using 32 x 32 pixel images. It

39
turned out that the images still contained enough relevant details because the LeNet model
achieved a score of 0.920 after 23 epochs.

To make a final comparison between the LeNet model and the custom configuration (which last
achieved 0.976 on the 64x64 pixel images), now also the custom CNN was trained using the 32
x 32 pixel images. The custom network achieved a score of 0.924. Comparing this to the LeNet
model (0.920), the custom network achieved slightly better as well as faster, at 17 epochs as
opposed to 23 by the LeNet model.

Making one last attempt at benefiting from existing CNN architectures, the VGG16 model from
the Model Zoo was trained on the 32 x 32 pixel images. The training was rather unsuccessful,
leading to a maximum score of 0.490 after 3 epochs, which was roughly random. After that, no
further improvement was made for 10 epochs, so the early stopping mechanism terminated the
training.

4.4.1.6 Adding more complex backgrounds


After having compared the custom configuration to some well-known models and concluding
that the custom configuration was quite suitable for the task at hand, the next step was to
continue making the images more complex, until eventually actual photos could be classified
by the CNN. Until now, the images only contained an evenly colored background. Clearly, in
photos, this is not the case. Therefore, some background pattern images were collected and
randomly used as background for the newly generated images. To prevent the background from
becoming too obtrusive and distracting from the “subject” (the shapes), a semi-transparent
color overlay was placed over the pattern to reduce the effect. Examples of these images are
shown in figure 27.

Figure 27. Multiple shapes with background - bad (left) vs. good (right) composition.

The network achieved a score of


0.952 after 14 epochs. While this is
not a bad score, the “Score vs.
Iteration” graph in figure 28 shows
several high peaks of increased
error, indicating that the algorithm
experienced difficulties learning
from the more complex images.
Figure 28. "Score vs. Iteration" graph showing some learning difficulties.

After making several adjustments to the network configuration, it achieved a score of 0.964
after 24 epochs. However, this training lasted 6 times longer due to a slightly larger network
configuration.

40
Next, another set of images was created in which the background was a lot stronger, as
illustrated by the examples in figure 29. The semi-transparent colored overlay was removed
from the background.

Figure 29. Shapes with stronger background - bad (left) vs. good (right) composition.

The network (surprisingly) still achieved a


score of 0.930, but the “Score vs. Iteration”
graph in figure 30 shows even more clearly
that the task was getting more difficult.

Figure 30. Score vs. Iteration on images with background.

4.4.2 Classifying photos by their composition


By now it is clear that a CNN can learn that the position of the subject within the image is the
relevant feature for the classification of the composition, and not the appearance of the subject
itself. In all of the generated images used in the previous training runs, the subjects in both
classes (good and bad) were the same; the only difference was their position. Now it is time to
put the CNN to the test and try to learn from real photos and classify them.

For this, a selection of photos was taken from the AVA dataset. As described previously, the
AVA dataset is created and used for similar classification problems regarding image aesthetics.
The dataset contains over 250.000 images and ratings given by the users of the
DPchallenge.com photo sharing website. These ratings concern the image in general and not
specific aspects (such as composition, lighting, contrast, …).

1.200 images were manually selected from the dataset, disregarding their ratings. 600 of these
images were composed more or less according to the rule of thirds; the other 600 images did
not feature this composition rule. In most of these images, no distinct composition was
recognizable, or the subject simply was placed in the center of the frame.

In the first training run, the network configuration last used on the generated images scored
0.670 after only 4 epochs. After that, it continued to improve on the training data but not on
the test data. It was overfitting and memorizing the training examples. While the achieved score
was not high, it was certainly better than random. This indicates that the network learned
something, but there was a question as to what the network learned exactly.

Looking at the photos in the selected dataset (figure 31), it seems that the images in which the
rule of thirds is applied are also more appealing in general. On average, they are “cleaner” (less
cluttered), brighter and they have more vibrant colors. This is not very surprising, because a
photographer who pays attention to the composition likely also pays attention to lighting and
color, either during shooting or in post-processing.

41
Figure 31. Photos – bad (left) vs. good (right) composition.

It is very well possible that the CNN learned and classified images based on the amount of
clutter in the image, and not based on the position of the subject within the frame.

However, a more significant problem is the brightness and vibrancy of the well-composed
images. It is possible that the CNN was (also) learning and classifying the images based on these
features. To reduce that risk, the images from the dataset were desaturated (turned into black-
and-white) using Adobe Photoshop. No further adjustments to the images were made. The
resulting images are shown in figure 32.

Figure 32. Photos converted to black-and-white - bad (left) vs. good (right) composition.

Repeating the training with the same network configuration but with black-and-white images
led to a score of 0.644 (compared to 0.670 with color images), which indicates that the color
information indeed played a role in learning. Therefore, it makes sense to continue using the
black-and-white images to keep the results as representative as possible.

To prevent overfitting and hopefully achieve a higher score, the L2 bias regularization was
increased. While it expectedly slowed down the growth of the bias parameter and the learning
process, unfortunately, it did not have the desired effect and led to a decreased score of 0.654.

Using the AdaDelta learning rate updater instead of a fixed learning rate and momentum
increased the score to 0.673, while it must be mentioned that it took 43 epochs to get there.

Several pre-trained networks, which learned from training on generated images, were tested
on the dataset containing photos. First, the pre-trained networks were used to classify photos
without training on them. The networks performed roughly at random. This shows that the
generated images were too different from the photos to be useful as a single source for learning.
For composition, the network needed to learn from photos to be able to classify photos.

Then, the pre-trained model that achieved the best results on the generated images with a
pattern background (score of 0.930) was further trained on the photos. This did not help to get
a better score on the photos. During the first epochs of the training with the pre-trained model,
the score on the training data remained constant. Looking at the graphs in figure 33, it seems
that the model first simply “forgot” much of what it ever learned from the generated images
and then started learning from the beginning.

42
The graph on the left shows the mean magnitude of the (pre-trained) bias parameter (red line)
dropping off soon after the training started. The lowered bias led to smaller layer activations,
visible in the graph on the right. After a while, the network started learning, and the bias
parameters grew, leading to larger activations. Unfortunately, the network was only overfitting
on the training data again.

Figure 33. Development of parameters when using a pre-trained model.

To fight the overfitting, a dropout hyperparameter was added to the network configuration.
Setting the dropout to 0.5 means that half of the neurons/signals are randomly dropped during
each pass of the training examples through the network. This should have the effect that the
network focusses less on specific neurons and generalizes better. Unfortunately, this did not
have the desired result. Using a dropout of 0.5 led to a score of 0.378, a dropout of 0.1 led to a
score of 0.494.

Increasing the number of neurons per layer worsened the score and led to faster overfitting,
which makes sense because it gave the network a higher capacity to memorize the training
examples. Removing a convolution layer and a pooling layer indeed reduced the overfitting,
the network did not memorize the training data as well as before. However, that did not make
the score on the test examples better.

Presumably the size and quality of the dataset were not sufficient, and because of this, the CNN
could not extract and learn the relevant features. In an attempt to get a more apparent
difference between the “good” and “bad” classes, 200 out of the 600 images per class were
removed. The remaining 400 images per class were the ones that were more clearly composed
according to the rule of thirds or not at all. The clearer distinction between the classes should
make the learning task easier. Even though the optimized dataset was 33% smaller than the
previous dataset, the network achieved an improved score of 0.712. Since all previous training
runs on various datasets showed that bigger datasets lead to better results, the improved score
must have been the consequence of the improved dataset quality, which even compensated for
the reduced dataset size. To verify this, the previous dataset of 600 images per class was again
reduced to 400 images per class, but this time no specific selection was made. Instead, the first
200 images in each directory were simply deleted. As expected, the score after training on this
non-optimized dataset was lower at 0.491. What is especially interesting in this case is the
difference between the precision and recall scores for the individual classes. While in most
training runs these scores were very similar for both classes, in this case, the precision was
better for the “bad” class (0.750 vs. 0.536) while recall was much better for the “good” class
(0.938 vs. 0.188). In other words, the network did not identify many “bad” images (low recall),
but of all the image that it did classify as “bad”, many were indeed “bad” (high precision).

43
A few more adjustments to the network configuration were made to further optimize its
performance. Increasing the number of neurons in the fully connected layer from 30 to 50 led
to an improved score of 0.743.

Removing a pair of convolution/pooling layers or changing the number of neurons in the other
hidden layers all led to worse scores.

Lowering the batch size from 30 to 20 further improved the score to 0.755.

Based on the experience gained in this chapter, it is certain that further improvements to the
score could be achieved by using a larger dataset and further fine-tuning of the network
configuration. Since time is limited and the aim of this work is to also investigate other aspects
of image aesthetics, the investigation of the composition classification is stopped at this point.

The final network configuration which was used for the classification of photos is added in
appendix V.

4.4.3 Summary of the training and classification of image composition


Table 1 lists some of the “milestones” from this chapter. It contains a short description of the
images used for training, remarks regarding the relevant changes to the network configuration
and the score achieved using that network configuration.

Table 1. Summary of the training and classification of image composition.

Dataset description No. of images Description of used CNN Best No. of best
Default size: 64 x 64 (training / test) configuration/adjustments F1 epoch /
score Training
duration

Black shape on white 100 Custom CNN that resulted from the 0,798 10
background (80 / 20) initial trials (Appendix I and II) 14 seconds

Black shape on white 1.000 - Increased learning rate 0.001 à 0.01 0.995 11 (from 14)
background (800 / 20) - Added a convolution & pooling layer 3 minutes

Colored shape on 1.000 Same configuration as in previous 0.990 16 (from 20)


white background (800 / 200) training ↑ 6 minutes

Colored shape on 1.000 Same configuration as in previous 0.905 21 (from 31)


colored background (800 / 200) training ↑ 10 minutes

Colored shape on 1.000 Increased learning rate 0.01 à 0.02 0.955 53 (from 63)
colored background (800 / 200) 28 minutes

44
Colored shape on 2.000 - Lowered L2 regularization: 0.01 à 0.998 86 (from 96)
colored background (1600 / 400) 0.005 1 hour 16 min
rd
- Changed activation function in the 3
convolution layer from ReLU to Linear
- Increased neurons in the fully
connected layer: 20 à 30

Colored shape on 2.000 Added L2 bias regularization of 0.005 0.998 41 (from 51)
colored background (1.600 / 400) 1 hour 9 min

1-5 shapes on 2.500 Same configuration as in previous 0.960 24 (from 34)


colored background, (2.000 / 500) training ↑ 24 minutes
varied positioning

Same image content 2.500 - Increased number of filters in the 0.976 37 (from 47)
as in the previous (2.000 / 500) convolution layers: 20 à 30 47 minutes
training ↑ - Increased number of neurons in the
fully connected layer: 30 à 50

Same image content 2.500 LeNet model from the DL4J Model Zoo 0.934 13 (from 23)
as in the previous (2.000 / 500) 30 minutes
training ↑

Same image content 2.500 AlexNet model from the DL4J Model 0 39 (from 49)
as in the previous (2.000 / 500) Zoo 1 hour 42 min
training ↑

Same image content, 2.500 LeNet model from the DL4J Model Zoo 0.764 5 (from 5)
larger size: 224 x 224 (2.000 / 500) Training interrupted after 10 hours. 10 hours

Same image content, 2.500 LeNet model from the DL4J Model Zoo 0.853 5 (from 5)
smaller size: 64 x 64 (2.000 / 500) 6 minutes

Same images content, 2.500 LeNet model from the DL4J Model Zoo 0.920 24 (from 34)
smaller size: 32 x 32 (2.000 / 500) 13 minutes

Same images content, 2.500 Custom configuration, same as last used 0.924 18 (from 28)
smaller size: 32 x 32 (2.000 / 500) before Zoo Model ↑ 8 minutes

Same images content, 2.000 VGG16 model from the DL4J Model Zoo 0.490 4 (from 14)
smaller size: 32 x 32 (1600/400) 32 minutes

1-5 colored shapes 2.500 Custom configuration, same as last used 0.952 14 (from 24)
with moderate (2.000 / 500) before Zoo Model ↑ 19 minutes
background pattern

45
1-5 colored shapes 2.500 Same configuration as in previous 0.930 19 (from 29)
with strong (2.000 / 500) training ↑ 27 minutes
background pattern

Photos 1200 Same configuration as in previous 0.670 5 (from 15)


(960 / 240) training ↑ 17 minutes

Photos B&W 1200 Same configuration as in previous 0.644 6 (from 16)


(960 / 240) training ↑ 10 minutes

Photos B&W 1200 Increased L2 for bias: 0.005 à 0.05 0.654 9 (from 19)
(960 / 240) 8 minutes

Photos B&W 1200 Used AdaDelta instead of fixed learning 0.673 44 (from 51)
(960 / 240) rate and momentum 1 hour 24 min

Photos B&W 1200 Added dropout parameter: 0.5 0.378 5 (from 15)
(960 / 240) 11 minutes
Changed dropout parameter to 0.1 0.494 55 (from 55)
31 minutes

Photos B&W - 800 Same configuration as in previous 0.712 50 (from 58)


optimized for clearer (640 / 160) training ↑ 19 minutes
class distinction

Photos B&W - 800 Same configuration as in previous 0.491 6 (from 15)


not optimized (640 / 160) training ↑ 5 minutes

Photos B&W 800 Increase number of neurons in fully 0.743 53 (from 63)
(640 / 160) connected layer: 30 à 50 19 minutes

Photos B&W 800 (640 / 160) Lowered the batch size: 30 à 20 0.755 30 (from 40)
13 minutes

46
4.5 Subproblem 2: sharpness
4.5.1 Definition of the classes for sharpness classification
The next aspect of image aesthetics that will be focused on is image sharpness.

In the classification of composition, only two classes were considered, “good” and “bad”. Of
course, as mentioned earlier in this work, image aesthetics are not as simple as that, but for the
sake of the investigation, the scope has to be narrowed down. For the classification based on
sharpness, a different distinction will be made. An image is not automatically bad because it is
partly blurred, quite the opposite. A portrait photo is usually more attractive when it features a
short depth of field, the distance range in which details appear sharp. It separates the subject
from the background and keeps the viewer’s focus where it belongs, with the subject. An image
of a sports car is more impressive when the background is blurred because the photographer
used a technique called “panning”, moving the camera in the direction of the movement while
keeping the subject in focus. This captures the motion in the frame. Therefore, the following
classification will be considered in this chapter.

• Sharp background, sharp foreground: Many compact (and smartphone) cameras are
only able to produce images in which the background as well as the foreground are both
sharp. This wide depth of field is due to the construction of the small lenses. Also, in
landscape photography, it is common to purposely use a wide depth of field to maintain
detail throughout the entire frame.
• Blurred background, sharp foreground: As explained above, a shallow depth of field
is often a desirable combination, because it helps focussing the attention on the subject.
• Sharp background, blurred foreground: This is generally not a good combination.
Possibly the camera (or the photographer) focused on the wrong element in the frame,
the subject may have moved, or the shutter was pressed too soon, preventing the camera
from focusing on the subject before the shot was taken.
• Blurred background, blurred foreground: Unless the photographer’s aim is to create
something artistic, this can generally be classified as a bad photo. Either the camera did
not focus (shutter pressed too soon, subject moved, camera moved) or the light
conditions were not optimal. Depending on the camera settings, low light may force the
camera or the photographer to choose a slower shutter speed. Without the use of a
tripod, this will soon lead to blurred images.

Similar to the method used in the previous chapter, the first trainings and evaluations will be
done using generated images. This allows for a more controlled environment with a large
dataset, in which various types of images and network configurations can be tested. This time,
the position of the subject (the shapes) within the image will be random, as the composition is
not relevant to the current part of the investigation. The images will differ in the amount of
blur applied to the fore- and background.

Once it is shown that a CNN can learn to distinguish the classes defined above, the network will
be trained on photos.

47
4.5.2 Four classes of generated images
As explained on the previous page, four classes of images were created. Figure 28 shows one
example of each class. From left to right, the images represent the following classes:

• Blurred background, blurred foreground


• Blurred background, sharp foreground
• Sharp background, blurred foreground
• Sharp background, sharp foreground

Figure 34. Four sharpness classes – distinction in foreground and background sharpness.

Figure 34 purposely only shows examples with the same background image. This is for the sake
of a clearer explanation and comparison. The actual examples contain many different
background patterns.

Sharpness or blur is only recognizable when the images are large enough. In the composition
problem, it was possible to scale down the images to 64 x 64 or even 32 x 32 pixels and still
achieve good results with a CNN, because the position of the subject within the image was still
clearly recognizable. However, when the images for the sharpness classification are scaled down
too much, the details get lost. An otherwise blurred line would become thinner and look sharp.
This means that for the current problem the images will need to be larger. The first training run
is done using 128 x 128 pixel images. Unfortunately, this also leads to many more network
parameters and calculations, which significantly slows down the training.

The first training run was done using a network configuration similar to that used for the
composition problem. The network consisted of two pairs of convolution/pooling layers,
followed by a fully connected layer and the output layer.

3.000 images were used, 750 for each class. Again, 80% of the images were used for training,
20% for evaluation. The early-stopping method was used to prevent overfitting or endless
training. While the network did learn, the training progress was very slow. After training
overnight for 8 hours, the training was automatically stopped due to the time limit. By then, it
had completed only 8 epochs. For comparison, in the composition problem, the network
managed several dozen epochs in 2 hours or less with almost as many images (2.500) because
smaller images were used. In almost every epoch the score increased slightly, but the score after
7 epochs was only 0.484, while the accuracy was 0.5217. Considering that the problem
consisted of four classes, this accuracy means that the network achieved roughly twice as good
as random.

48
4.5.3 Using only two sharpness classes
Since the training and classification of 4 classes were very time-consuming and the performance
rather low, the training was repeated several times with only two of the four classes. The goal
was to find out which classes the network could easily learn and which it had difficulty with.
The first test was to train with entirely sharp and entirely blurred images:

• Class 1: blurred background, blurred foreground


• Class 2: sharp background, sharp foreground

Reducing the number of images by 50% significantly increased the training speed. A score of
0.983 was achieved after 25 epochs, at which point the training was terminated due to the time
limitation of 2 hours. The network clearly had no trouble classifying entirely blurred and
entirely sharp images.

The next training was done with the same number of examples and network configuration, but
with the following two classes:

• Class 1: blurred background, blurred foreground


• Class 2: blurred background, sharp foreground

The only difference between these two classes was the sharpness of the shapes in the
foreground. This training also lasted for 2 hours, in which 32 epochs were completed. The best
score of 0.816 was achieved after the 31st epoch, which means that the task was more difficult.
While this score is a lower than the one in the previous run, a higher score could probably have
be achieved if the training was allowed to continued longer.

Another training was done using the following two classes:

• Class 1: sharp background, sharp foreground


• Class 2: sharp background, blurred foreground

Again, the only difference between the two classes was the sharpness of the shapes in the
foreground. However, these images were overall much more detailed because the background,
which made up the majority of the surface, was sharp. This may have made the problem more
difficult because there were more distinguishable features in the images, and only few of these
features were relevant: the ones making up the blurred subject in the foreground. After 19
epochs in 55 minutes, the training was terminated because the best score was achieved in epoch
9 and no further improvements were made during the following 10 epochs. The best F1 score
was 0.566, accuracy was 0.577, which was only slightly better than random. This confirms the
assumption that the problem was more difficult because of the increased amount of detail in
the images.

In an attempt to improve the performance on the current type of images, the network
configuration was adjusted in various ways. First, a pair of extra convolution/pooling layers
was added (making a total of 3 pair), which led to a score of 0.417 after 6 epochs. The network
did not benefit from the additional layers.

49
Other adjustments that were tried, but which did not lead to performance improvements either,
were the addition of L1 regularization, an increase of the filter size in the convolutional layers
(from 3x3 pixels to 5x5 and 7x7 pixels), and an increase respectively decrease of the L2
regularization (from 0.005 to 0.05 and 0.0005). Even though the two classes were easily
distinguishable to a human viewer, the network was still unable to learn and classify them.

Figure 35. Two sharpness classes – blurred foreground (left) vs sharp foreground (right).

Figure 35 shows two examples of each class. The background images are randomly used in both
classes. The subjects are 1-5 shapes in random sizes, colors and positions. The only consistent
difference between the two classes is the sharpness of the shapes in the foreground.

Next, the number of filters in the convolution layers was increased. This was done based on the
idea that the network may not have enough capacity to learn the necessary amount of features,
because the images were now bigger than in the composition problem and therefore contained
more details. Also, the L2 regularization (which was used in most previous trainings) was
removed. Since the network was already having difficulty learning, the regularization may
enhance this.

After training overnight for 8 hours, a score of 0.763 was achieved. This was indeed a noticeable
improvement. However, the best score was achieved after only 6 epochs (at roughly 4 hours)
and the network then continued training for another 6 epochs without making further
improvements.

During the investigation of the composition problem it soon became evident that the number
of examples had a great influence on the network’s performance. Therefore, the number of
images per class was now increased. In previous training runs 750 images per class were used.
This was raised to 2.000 images per class. The network size was again reduced by removing the
last pair of convolution/pooling layers, leaving it with two pair of convolution/pooling layers,
each with only 20 filters per layer. Previous trainings had shown no great benefit from the third
pair of layers and reducing the network size would speed up the training.

Even though the larger dataset size should increase the time to complete a single epoch, the
reduced number of network parameters compensated for this. The time limit for this training
was set to 2 hours. In that time, the network completed 6 epochs and reached a score of 0.791.
The score improved in each epoch, indicating it would probably improve even further with more
time. Comparing this training to the previous one, it achieved a higher score in only half the
time (2 hours vs. 4 hours at the best epoch in the previous training), which is a clear indication
that the network configuration did not have to be very large or complicated, at least for the
current problem. The number of examples had a much stronger effect.

50
To further confirm the relevance of the dataset size, the next training was done with the exact
same network configuration but with twice as many examples, 4.000 per class. Analog to the
doubled size of the dataset also the time limit was doubled to 4 hours, giving the network an
equal chance at learning from the images. The training finished with a score of 0.855 in the 7th
of 8 epochs. Again, it was still improving and probably did not reach its optimal score yet.

So, even though the current problem seemed very difficult at first (the first score on these
images was only 0.566), eventually a very acceptable score was reached, which could have been
further improved. Most of the fine-tuning of the network configuration did not have any positive
effect on the score. Simply increasing the number of examples was sufficient.

4.5.4 Back to four classes of generated images


Now that relatively good performance was achieved on training and classification of 2 classes,
the investigation of the initial problem with 4 classes was continued.

The size of the dataset was increased even further, to 4.000 examples per class, making up a
total of 16.000 examples. The network configuration remained unchanged, the time limit was
set to 10 hours.

After 10 hours only 3 epochs were completed. The score achieved in this time was 0.765. After
each epoch, the performance improved by a rather large step. Based on the experience from all
previous trainings in this investigation, it is highly likely that the performance would increase
further, ending up somewhere between 0.8 and 0.9.

The following confusion matrix contains the results of the last training. It shows the actual
classes and predictions of 3.200 test examples, which is 20% of the 16.000 examples in the
dataset.

Predicted → blur – blur blur – sharp sharp – blur sharp – sharp


↓ Actual

blur – blur 789 11 0 0


blur – sharp 205 595 0 0
sharp – blur 7 0 569 224
sharp – sharp 0 3 297 500

The bold values on the diagonal line represent the true positives, where the network correctly
predicted the actual class. The entirely blurred (“blur-blur”) images were predicted with almost
no mistakes. The networks’ weakness is clearly related to the images which are entirely or
partially sharp.

The two underlined values in the matrix show how the network had difficulty classifying the
“sharp-sharp” vs. “sharp-blur” images. 297 respectively 224 times it predicted the “other” of
these 2 classes. This matches the findings from chapter 4.5.3, in which the problem with these
two classes was found to be more difficult than the classification of the other classes.

51
One last test was done with the generated images. Since the computing power was limited to
one laptop (MacBook Pro), the number of examples was slightly decreased to 3.000 examples
per class, making a total of 12.000 examples. At the same time, the image size was reduced
from 128 x 128 pixels to 100 x 100 pixels. The training time limit was set to 15 hours. This
combination should make it possible to complete more epochs while still having a sufficiently
large dataset.

After 15 hours of training, 12 epochs were completed. The reduced dataset and image size
increased the training speed as expected. A score of 0.839 was achieved in the 11th epoch. Over
the last few epochs the networks performance increased more and more slowly and in the 12th
epoch the score did not improve at all, indicating that it would not have learned much more if
it would train longer. The confusion matrix looks rather similar to the previous training. The
true positives are clearly recognizable by their high values.

Predicted → blur – blur blur – sharp sharp – blur sharp – sharp


↓ Actual

blur – blur 554 46 0 0


blur – sharp 44 550 0 6
sharp – blur 0 1 367 232
sharp – sharp 0 0 52 548

The following table shows the precision, recall and F1 scores for the four classes.

Precision Recall F1

blur – blur 0.926 0.923 0.925


blur – sharp 0.921 0.917 0.919
sharp – blur 0.876 0.612 0.720
sharp – sharp 0.697 0.913 0.791
Average 0.855 0.841 0.839

The network’s two most noticeable weaknesses (the underlined values in the table) are the
precision on the “sharp-sharp” class and the recall on the “sharp-blur” class. The network
wrongly classified images as sharp-sharp even though they were not, and the network did not
identify “sharp-blur” images as well as it identified images of the other classes.

4.5.5 Two classes of photos


For the training and classification of composition also real photos were used. Those photos were
manually selected from the AVA dataset. Half of the photos were composed more or less
according to the rule-of-thirds, the other photos were not. Creating a sufficiently large dataset
with photos for the current subproblem was more difficult. The AVA dataset contains mostly
sharp images, which makes sense because the images are taken from the DPchallenge.com
website, where people share their photos to give and receive feedback. Most of the photos
uploaded to the website will therefore be partially or entirely sharp, but hardly ever entirely

52
blurred. Therefore, it is difficult to collect equally large groups of photos for each of the four
classes used in this subproblem.

To get an idea of the feasibility of sharpness classification by a CNN with photos, 2.000 random
images were taken from the AVA dataset. 1.000 of these images were then blurred using Adobe
Photoshop. The other 1.000 image remained unchanged. While this is not exactly a real-life
use-case, it is certainly closer to reality than the generated images used previously, as the two
classes consist of photos rather than drawn shapes and patterns. Two examples of each class
are shown in figure 36.

Figure 36. Photos - sharp (left) vs. blurred (right).

As a first test, the trained and saved network that scored 0.983 on the generated images (“sharp-
sharp” vs. “blur-blur”) was used to classify the photos without training on them. The network
achieved a score of 0.868. This was a surprisingly positive result, which means that the training
on generated images prepared the network quite well for the classification of photos. The
precision, recall and F1 scores were quite similar for both classes, differing less than 0.05.

For comparison, pre-trained networks were also used in the investigation of the image
composition in chapter 4.4.2. There, the network was also trained on generated images and
then used to classify photos. However, the resulting scores were roughly random, which means
that the training on generated images was not useful for the classification of photos. The
composition-related features were possibly more different between generated images and
photos than the sharpness-related features.

To further investigate the effect of transfer learning (applying the “knowledge” learned from a
previous problem to a new and similar problem by using it as a starting point), the pre-trained
network was trained for 2 hours on the photos. The network achieved a score of 0.972.

Next, the same network configuration was trained for 2 hours on the photos, but this time
without loading the pre-trained parameters (weights and biases). The network now learned
from the photos without any previous knowledge about the features. This resulted in an almost
identical score of 0.980.

So, when the network was allowed to train on the photos, it did not really benefit from the pre-
training. In fact, the training on the photos could be omitted entirely because the network that
was trained on generated images scored similarly good on classifying photos. The most
noticeable difference between the two trainings was the progress made in the first epochs.
When using the pre-trained model, the network started off already quite good and made only
small improvements throughout the training. When using a “blank” network, performance
started off much worse (high loss / error), but it improved with relatively large steps in the first
epochs. After roughly one third of the training time, the two trainings progressed in a similar
way.

53
4.5.6 Summary of the training and classification of image sharpness
Table 2 lists some of the “milestones” from this chapter. It contains a short description of the
images used for training, remarks regarding the relevant changes to the network configuration
and the score achieved using that network configuration.

Table 2. Summary of the training and classification of image sharpness

Dataset description No. of images Description of used CNN Best F1 No. of best
Default size: 128 x 128 (training / test) configuration/adjustments score epoch /
(“sharp-blur” means sharp Training
background, blurred duration
foreground)

Shapes on background 3.000 Custom configuration last used 0.484 8 (from 8)


4 classes: blur-blur, blur-sharp, (2.400 / 600) on the composition problem. 8 hours
sharp-blur, sharp-sharp (Two pairs of
convolution/pooling layers)

Shapes on pattern background 1.500 Same configuration as in 0.983 25 (from 28)


blur-blur vs. sharp-sharp (1.200 / 300) previous training ↑ 2 hours

Shapes on pattern background 1.500 Same configuration as in 0.816 31 (from 32)


blur-blur vs. blur-sharp (1.200 / 300) previous training ↑ 2 hours

Shapes on pattern background 1.500 Same configuration as in 0.566 9 (from 19)


sharp-sharp vs. sharp-blur (1.200 / 300) previous training ↑ 55 minutes

Same images as used in 1.500 Added a pair of 0.417 6 (from 16)


previous training ↑ (1.200 / 300) convolution/pooling layers 1 hour

Same images as used in 1.500 Increased number of filters in 0.763 6 (from 12)
previous training ↑ (1.200 / 300) convolution layers: 8 hours
st
1 convolution layer: 20 à 50
2nd convolution layer: 20 à 30

Same images as used in 4.000 Removed 1 pair of 0.791 6 (from 6)


previous training ↑ (3.200 / 800) convolution/pooling layers. 2 hours
Decreased number of filters per
convolution layer to 20.

Same images as used in 8.000 Same configuration as in 0.855 7 (from 8)


previous training ↑ (6.400 / 1.600) previous training ↑ 4 hours

Same images as used in 16.000 Same configuration as in 0.766 3 (from 3)


previous training ↑ (12.800 / 3.200 previous training ↑ 10 hours

54
Same images as used in 12.000 Same configuration as in 0.839 11 (from 12)
previous training ↑ (9.600 / 2.400) previous training ↑ 15 hours
Slightly smaller size: 100 x 100

Random photos from the AVA 2.000 No training, only evaluation 0.868 -
dataset. (0 / 2.000) using the pre-trained network
Half of the photos blurred, the that scored 0.983 on generated
other half unchanged. images ↑

Same images as used in 2.000 Continued training on the pre- 0.972 28 (from 34)
previous training ↑ (1.600 / 400) trained network that scored 2 hours
0.983 on generated images.

Same images as used in 2.000 Same configuration as in 0.980 28 (from 29)


previous training ↑ (1.600 / 400) previous training ↑ 2 hours
This time without pretrained
weights / biases.

55
4.6 Combining composition and sharpness
In the previous chapters, it was shown that a CNN can successfully classify image composition
and sharpness, provided the dataset is large enough and of good quality. The next challenge is
to see how well the CNN can learn and classify various aesthetic criteria at once.

In this work, only two aesthetic criteria are investigated. For the combined classification, the
following four classes were defined:

• Sharp, bad composition


• Sharp, good composition
• Blurred, bad composition
• Blurred, good composition

If a single CNN would have to cover more criteria, it would be useful to implement the training
and classification application in such a way that a multi-class, multi-label dataset could be
handled. In the current implementation, each of the four class labels actually represents two
labels at once: sharpness and composition. This is acceptable for two criteria and four classes,
but for more criteria, it would be better to use a multi-labeled system, in which each image
could be assigned various suitable labels. Changing the training application and creating a
multi-labeled dataset of photos is, however, too time-consuming to be done in this investigation.

4.6.1 Four classes of generated images


Figure 37 shows one example of each of the four classes of generated images. From the left to
the right these are: blurred – bad composition, blurred – good composition, sharp – bad
composition, sharp – good composition.

Figure 37. Generated images with distinction in sharpness and composition.

2.000 images per class were created, making a total of 8.000 images. The network configuration
was the same as the one used in most previous trainings, consisting of two pairs of
convolution/pooling layers, followed by one fully-connected layer.

After training for 2 hours, the network had completed 5 epochs. The score achieved in the 4th
epoch was 0.869. The score did not improve in the last epoch, which indicates that likely the
performance would not have gotten much better with a longer training time. The precision,
recall and F1 scores were relatively similar for all four classes, with no extreme outliers.

The confusion matrix that resulted from this training is rather interesting:

56
Predicted → blur – blur – sharp – sharp –
↓ Actual bad comp. good comp. bad comp. good comp.

blur – bad comp. 347 53 0 0


blur – good comp. 51 349 0 0
sharp – bad comp. 0 0 308 92
sharp – good comp. 0 0 13 387

The underlined zeros in the matrix show that the network clearly distinguished the blurred and
sharp classes. Of the 1.600 evaluation examples, the network did not even once mistake a sharp
image for a blurred one or vice versa.

The network did, however, make rather many mistakes in the classification of the composition,
even though this was an “easy task” when the composition classification was investigated as a
single problem, achieving F1 scores close to perfection.

4.6.2 Four classes of photos


A similar training was repeated with photos. During the investigation of the composition, 800
images were manually selected from the AVA dataset. Half of these images were composed
more or less according to the rule-of-thirds, the other half was not. As mentioned several times
before, the composition is simply called “good” and “bad” for the sake of simplicity in this work.
These two groups of images were now copied and blurred, creating the four classes for the
combined composition/sharpness classification problem. This is, of course, a rather simplified
problem and not yet a real-life-scenario. The examples in figure 38 represent the following
classes (from the left to the right): blurred – bad composition, blurred – good composition,
sharp – bad composition, sharp – good composition.

Figure 38. Photos with distinction in composition and sharpness.

The network was allowed to train for 2 hours; however, training was terminated after 1 hour
and 35 minutes because no improvement was made for 10 epochs. The highest score, 0.693,
was reached after 16 epochs. Again, the network made almost no mistakes between the blurred
and sharp images, but rather many mistakes in the composition, as recognizable in the resulting
confusion matrix.

57
Predicted → blur – blur – sharp – sharp –
↓ Actual bad comp. good comp. bad comp. good comp.

blur – bad comp. 62 18 0 0


blur – good comp. 27 53 0 0
sharp – bad comp. 4 0 43 33
sharp – good comp. 1 3 11 65

Even though the performance after this last training is not very high, it is clearly not random
and it shows that the network learned at least some of the features relevant to the classification
of composition as well as sharpness. Based on the experience gained throughout this work, it is
likely that a larger dataset would lead to a better performance. Further improvement could also
be achieved by fine-tuning the CNN configuration, although in most cases, the fine-tuning did
not have as much of an effect as the dataset size and quality.

58
5 Summary and conclusions
The aim of this work was to investigate the possibility of classifying images by various aesthetic
criteria with the use of convolutional neural networks. The two criteria that were selected to be
the subproblems for this investigation were image composition and sharpness. Each of the
subproblems was approached in the same way.

Two Java applications were implemented. The first application was used to draw images by
using the Java 2D Graphics API. The second and most important application was used to train
and evaluate the CNNs and save the results to file. This application made use of the open-source
DL4J machine learning library.

After the implementation of the two applications, a large number of images was generated to
have a large and consistent dataset with specific features. The images contained simple shapes
as “subjects” and a plain or pattern background. The composition and sharpness of the subject
and background were adjusted for the desired type of images. Many different datasets were
created with increasingly complex images, in order to make the learning and classification task
gradually more difficult. A CNN was then configured and trained to classify these images. The
initial network configuration was based on an implementation example provided by the DL4J
library. Depending on the evaluation results after each training, either the CNN configuration
was adjusted to achieve better performance, or the dataset size or content was changed to
investigate the reasons for certain learning difficulties. After the training and classification of
generated images, the CNN was trained on photos.

Throughout the investigation, it was repeatedly shown that the CNN needs a large dataset to
learn the features relevant to a certain classification. For the rather simple generated images,
several hundred examples were sufficient. For more complex images, the network needed
several thousand examples to achieve a good score. The classification of photos should ideally
be done using many thousands of photos, which unfortunately was not possible in the time and
scope of this investigation.

Much time and effort were invested in learning and understanding the workings of CNNs and
the configuration of the network parameters. Even though the adjustment and fine-tuning of
the network configuration had some influence on the network’s performance, none of the fine-
tuning influenced the final results as much as the size of the dataset did. A relatively “basic”
network configuration with only a few layers and hyperparameters was sufficient to achieve
good results, as long as the dataset was large enough.

The “custom” network configuration that was developed over time based on the evaluation
results was compared to popular CNN architectures such as LeNet and AlexNet. For the
problems investigated in this work, the custom network configuration, which is similar to the
LeNet architecture, proved to be the most successful.

Regarding the classification of the image composition, the investigation showed that a CNN can
indeed learn that the position of a subject within the image is the relevant feature, rather than
the appearance of the subject itself. The CNN was successful at distinguishing images that made
use of the rule-of-thirds from images that did not. During the training and classification of
generated images, the F1 scores exceeded 0.9 (in a range of 0-1).

59
With the increasing complexity of the generated images, the classification problem became
more difficult and the CNN needed a larger dataset and more time to achieve good results.

The classification of photos proved to be more difficult, the network achieved a maximum score
of 0.755. However, it must be mentioned that the dataset containing photos was rather small
at a total of 800 images, due to the fact that the photos had to be manually selected. The dataset
of photos was at some point reduced in size in order to contain only the most distinct examples
for each class. Despite the reduction in size, this optimization of the dataset led to an increased
score because the network could better recognize the relevant features.

A pre-trained network that scored well on the classification of composition in generated images
was not successful at the classification of photos. The generated images were likely too different
from “real” photos to be a suitable preparation for this classification task.

The classification of image sharpness proved to be more difficult than the classification of
composition. Four classes of images were defined with a sharp or blurred background and
foreground. Most of this part of the investigation was done using generated images due to the
lack of a large dataset of photos. Not enough blurred photos were available to create a dataset
that would contain a large number of examples for each class. Therefore, a number of random
images was blurred using Adobe Photoshop to imitate “real” blurred photos. This means that
the sharpness classification of photos was done with only two classes, entirely sharp and entirely
blurred. While this is not a real-life use-case, it was the best option available for this work.

The CNN was successful at classifying entirely sharp and entirely blurred images but it had
difficulty with the generated images that were partly blurred and partly sharp. These images
were often mixed up in the evaluation. A CNN is clearly able to distinguish sharp and blurred
features, but it does not “know” which part of the image is background and which is the subject.
A possible solution for this problem would be to not only use the images’ pixel data as training
input, but also provide the bounding box (coordinates) of the subject, which tells the CNN where
in the image the subject is located. The CNN could then learn whether the subject and the
background are sharp or blurred.

While the classification of image composition could be learned from images with a size of only
32 x 32 pixels, the images for the sharpness classification needed to be larger (128 x 128 pixels).
Using larger images meant that the network configuration had to be adjusted accordingly. The
increased number of network parameters made the training much more time-consuming.

A very positive surprise in the investigation of the image sharpness was that a network which
was pre-trained on generated images was relatively good at the classification of photos. The
features which the CNN learned from the generated images were obviously similar to the
equivalent features in photos.

After the individual investigation of image composition and sharpness, a combination of these
two aesthetic criteria was investigated by creating image classes that contained a combination
of both features within each image: good/bad composition combined with sharpness/blur.

With generated images as well as with photos, the CNN achieved relatively good results in the
combined problem. During the individual investigation of the two subproblems, the CNN had
performed better at classifying the composition. In the combined problem, however, the CNN

60
was unexpectedly better at distinguishing the sharpness than composition. It did successfully
separate the blurred from the sharp images, but within these two groups, it made many mistakes
in the classification of the composition. Here, further fine-tuning of the network configuration
or increasing its size may be useful to give the network more “capacity” to accommodate the
additional amount of features to be learned.

This investigation may have covered only a small part from the wide range of image aesthetic
criteria, but it has shown that CNNs can indeed learn some of the features relevant to the
classification of generated images and even real photos. As long as a sufficiently large dataset
is available, as well as a few hours or days for training, a suitably configured CNN can learn
certainly much more.

During the implementation of the application with the DL4J library, two contributions to the
open source library were made. One of these contributions was the improvement of a code
example, the other was pointing out and discussing the inconsistent calculation of the
evaluation scores. This led to an adjustment of the calculations and improved documentation.

Writing this work, learning about convolutional neural networks and implementing and
training them has been very interesting. It was exciting to observe the application learning
something for which it was never explicitly instructed. Even though this work has now come to
an end, the interest in machine learning, especially in combination with image analysis, will
surely continue.

61
Literature
[1] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521, no. 7553, pp. 436–
444, May 2015.
[2] P. McCorduck, Machines who think: a personal inquiry into the history and prospects of
artificial intelligence, 25th anniversary update. Natick, Mass: A.K. Peters, 2004.
[3] D. Berlinski, The advent of the algorithm: the idea that rules the world. New York: Harcourt,
2000.
[4] A. M. Turing, “Lecture to the London Mathematical Society on 20 February 1947,” 20-
Feb-1947. [Online]. Available: http://www.vordenker.de/downloads/turing-
vorlesung.pdf. [Accessed: 13-Jan-2018].
[5] A. M. Turing, “I. - Computing Machinery and Intelligence,” Mind, vol. LIX, no. 236, pp.
433–460, Oct. 1950.
[6] A. L. Samuel, “Some Studies in Machine Learning Using the Game of Checkers,” IBM J.
Res. Dev., vol. 3, no. 3, pp. 210–229, Jul. 1959.
[7] D. Crevier, AI: the tumultuous history of the search for artificial intelligence. New York, NY:
Basic Books, 1993.
[8] M. Mohri, A. Rostamizadeh, and A. Talwalkar, Foundations of machine learning.
Cambridge, MA: MIT Press, 2012.
[9] A. J. K. J. K. Magazine, “It’s Alive!,” WIRED. [Online]. Available:
https://www.wired.com/2002/03/everywhere/. [Accessed: 18-Jan-2018].
[10] C. M. Bishop, Pattern recognition and machine learning. New York: Springer, 2006.
[11] G. Engel, “3 Flavors of Machine Learning: Who, What & Where,” Dark Reading. [Online].
Available: https://www.darkreading.com/threat-intelligence/3-flavors-of-machine-
learning--who-what-and-where/a/d-id/1324278. [Accessed: 18-Jan-2018].
[12] C. Geigle and C. Zhai, “Modeling MOOC Student Behavior With Two-Layer Hidden
Markov Models,” in Proceedings of the Fourth (2017) ACM Conference on Learning @ Scale,
New York, NY, USA, 2017, pp. 205–208.
[13] N. Ganesan, K. Venkatesh, M. A. Rama, and A. M. Palani, “Application of neural networks
in diagnosing cancer disease using demographic data,” Int. J. Comput. Appl., vol. 1, no.
26, pp. 76–85, 2010.
[14] W. S. McCulloch and W. Pitts, “Bulletin of Mathematical Biophysics,” Bull. Math. Biophys.,
vol. 5, pp. 115–133, 1943.
[15] A. Gibson and J. Patterson, Deep Learning - A Practitioner’s Approach. O’Reilly Media,
2017.
[16] Andrej Karpathy, “CS231n Convolutional Neural Networks for Visual Recognition.”
[Online]. Available: http://cs231n.github.io/neural-networks-1/. [Accessed: 10-Feb-
2018].
[17] A. Sharma, “Understanding Activation Functions in Deep Learning | Learn OpenCV,” 30-
Oct-2017. [Online]. Available: https://www.learnopencv.com/understanding-activation-
functions-in-deep-learning/. [Accessed: 09-Feb-2018].
[18] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. MIT Press, 2016.
[19] Genevieve B. Orr, “Momentum and Learning Rate Adaptation.” [Online]. Available:
https://www.willamette.edu/~gorr/classes/cs449/momrate.html. [Accessed: 21-Feb-
2018].
[20] A. Hern, “Computers are now better than humans at recognising images,” the Guardian,
13-May-2015. [Online]. Available:
http://www.theguardian.com/global/2015/may/13/baidu-minwa-supercomputer-
better-than-humans-recognising-images. [Accessed: 12-Feb-2018].
[21] A. Krizhevsky, V. Nair, and G. Hinton, “CIFAR-10 and CIFAR-100 datasets.” [Online].
Available: https://www.cs.toronto.edu/~kriz/cifar.html. [Accessed: 12-Feb-2018].

IV
[22] H. Lee, R. Grosse, R. Ranganath, and A. Y. Ng, “Unsupervised learning of hierarchical
representations with convolutional deep belief networks,” Commun. ACM, vol. 54, no. 10,
p. 95, Oct. 2011.
[23] Ujjwal Karn, “An Intuitive Explanation of Convolutional Neural Networks,” the data
science blog, 10-Aug-2016. [Online]. Available:
https://ujjwalkarn.me/2016/08/11/intuitive-explanation-convnets/. [Accessed: 26-Oct-
2017].
[24] A. Karpathy, “CS231n Convolutional Neural Networks for Visual Recognition - Case
studies.” [Online]. Available: http://cs231n.github.io/convolutional-networks/#case.
[Accessed: 28-Feb-2018].
[25] X. Jin et al., “Efficient Deep Aesthetic Image Classification using Connected Local and
Global Features,” Oct. 2016.
[26] E. Gong, “Deep Aesthetic Learning Estimating Photo Aesthetic Rating Using Deep
Convolutional Neural Network,” 2015.
[27] J. Hwang and S. Shi, “Classification of photographic images based on perceived aesthetic
quality.”
[28] “ImageNet.” [Online]. Available: http://www.image-net.org/. [Accessed: 17-Feb-2018].
[29] N. Murray, L. Marchesotti, and F. Perronnin, “AVA: A large-scale database for aesthetic
visual analysis,” in 2012 IEEE Conference on Computer Vision and Pattern Recognition,
2012, pp. 2408–2415.
[30] R. Datta, D. Joshi, J. Li, and J. Z. Wang, “Studying aesthetics in photographic images
using a computational approach,” in European Conference on Computer Vision, 2006, pp.
288–301.
[31] Y. Ke, X. Tang, and F. Jing, “The design of high-level features for photo quality
assessment,” in Computer Vision and Pattern Recognition, 2006 IEEE Computer Society
Conference on, 2006, vol. 1, pp. 419–426.
[32] C. Li, A. Gallagher, A. C. Loui, and T. Chen, “Aesthetic quality assessment of consumer
photos with faces,” in Proceedings / ICIP ... International Conference on Image Processing,
2010, pp. 3221–3224.
[33] Y. Luo and X. Tang, “Photo and Video Quality Evaluation: Focusing on the Subject,” 2008,
vol. 5304, pp. 386–399.
[34] Jerry Huxtable, “JH Labs Home Page.” [Online]. Available: http://www.jhlabs.com/.
[Accessed: 30-Mar-2018].
[35] “Deeplearning4j: Open-source, Distributed Deep Learning for the JVM.” [Online].
Available: https://deeplearning4j.org/. [Accessed: 17-Feb-2018].
[36] “Skymind - Deep learning for Enterprise on Hadoop and Spark.” [Online]. Available:
https://skymind.ai/. [Accessed: 30-Mar-2018].
[37] “Early Stopping - Deeplearning4j: Open-source, Distributed Deep Learning for the JVM.”
[Online]. Available: https://deeplearning4j.org/earlystopping. [Accessed: 22-Feb-2018].

V
Table of figures
Figure 1. Example of desired classification. .............................................................................. 2
Figure 2. Biological neuron. ...................................................................................................... 8
Figure 3. Artificial neural network. ........................................................................................... 8
Figure 4. Mark I Perceptron. ..................................................................................................... 9
Figure 5. Relationship between AI and deep learning. ............................................................ 10
Figure 6. Single neuron in an artificial neural network........................................................... 10
Figure 7. Linear activation function. ....................................................................................... 11
Figure 8. Sigmoid activation function. .................................................................................... 11
Figure 9. Rectified linear activation function. ......................................................................... 12
Figure 10. Schematic example of global and local minima. .................................................... 16
Figure 11. CNN architecture. .................................................................................................. 18
Figure 12. CNN input. ............................................................................................................. 18
Figure 13. Convolution operation. .......................................................................................... 19
Figure 14. Feature detection in convolutional layers of a CNN. .............................................. 19
Figure 15. Max pooling operation. .......................................................................................... 20
Figure 16. Class diagram of the image generation application. ............................................... 26
Figure 17. Class diagram of the training and classification application. .................................. 28
Figure 18. Training with early stopping. ................................................................................. 29
Figure 19. DL4J Training UI – Overview page. ....................................................................... 33
Figure 20. DL4J Training UI – Model page. ............................................................................ 34
Figure 21. Black shapes - bad (left) vs. good (right) composition. .......................................... 36
Figure 22. Colored shapes – bad (left) vs. good (right) composition. ..................................... 36
Figure 23. Training progress on colored shapes. ..................................................................... 37
Figure 24. Colored shape and background - bad (left) vs. good (right) composition. ............. 37
Figure 25. Mean magnitude of parameters - without and with bias regularization. ................ 38
Figure 26. Multiple shapes - bad (left) vs. good (right) composition. ..................................... 38
Figure 27. Multiple shapes with background - bad (left) vs. good (right) composition........... 40
Figure 28. "Score vs. Iteration" graph showing some learning difficulties. .............................. 40
Figure 29. Shapes with stronger background - bad (left) vs. good (right) composition. ......... 41
Figure 30. Score vs. Iteration on images with background...................................................... 41
Figure 31. Photos – bad (left) vs. good (right) composition.................................................... 42
Figure 32. Photos converted to black-and-white - bad (left) vs. good (right) composition. .... 42

VI
Figure 33. Development of parameters when using a pre-trained model. ............................... 43
Figure 34. Four sharpness classes – distinction in foreground and background sharpness...... 48
Figure 35. Two sharpness classes – blurred foreground (left) vs sharp foreground (right). .... 50
Figure 36. Photos - sharp (left) vs. blurred (right). ................................................................. 53
Figure 37. Generated images with distinction in sharpness and composition. ........................ 56
Figure 38. Photos with distinction in composition and sharpness. .......................................... 57
Figure 39. Test images for the classification of simple shapes................................................. IX

VII
Appendices
I. Initial CNN configuration
MultiLayerConfiguration configuration = new NeuralNetConfiguration.Builder()
.seed(12345)
.iterations(iterations)
.learningRate(learningRate)
.optimizationAlgo(OptimizationAlgorithm.STOCHASTIC_GRADIENT_DESCENT)
.regularization(true)
.l2(0.0005)
.updater(new Nesterovs(0.9)) // Momentum
.list()
.layer(0, new ConvolutionLayer.Builder(3,3)
.name("Convolution layer 1")
.nIn(3)
.nOut(20) // Number of filters
.stride(1,1)
.padding(1,1)
.activation(Activation.RELU)
.weightInit(WeightInit.RELU)
.build())
.layer(1, new SubsamplingLayer.Builder(SubsamplingLayer.PoolingType.MAX)
.name("Pooling layer 1")
.kernelSize(2,2)
.stride(2,2)
.build())
.layer(2, new ConvolutionLayer.Builder(3,3)
.name("Convolution layer 2")
.nOut(20) // Number of filters
.stride(1,1)
.padding(1,1)
.activation(Activation.RELU)
.weightInit(WeightInit.RELU)
.build())
.layer(3, new SubsamplingLayer.Builder(SubsamplingLayer.PoolingType.MAX)
.name("Pooling layer 2")
.kernelSize(2,2)
.stride(2,2)
.build())
.layer(4, new OutputLayer.Builder(LossFunctions.LossFunction.NEGATIVELOGLIKELIHOOD)
.name("Output layer")
.activation(Activation.SOFTMAX)
.weightInit(WeightInit.XAVIER)
.nOut(numLabels)
.build())
.setInputType(InputType.convolutional(imgHeight,imgWidth,depth))
.backprop(true).pretrain(false).build();

VIII
II. Learning trials with the initial CNN configuration
The dataset used for the trial trainings consisted of simple generated images. The images were
divided into two classes, “blue rectangle” and “black ellipse”. Each image contained a single
shape in random size and location, as illustrated in figure 39.

Figure 39. Test images for the classification of simple shapes.

Since this was a rather simple task for the CNN, the trial started with:

• 100 images (80 for training, 20 for test)


• A learning rate of 0.005
• 10 epochs
• 1 iteration

The network achieved an F1 score of 0.89, which is a good starting point. Changing some of
the hyperparameters led to some expected, but also to some surprising effects on the score.

• Learning rate:
o A decrease from 0.005 to 0.001 worsened the F1 score from 0.89 to 0.82. This
means that 0.005 was an appropriate learning rate for the current dataset.
o An increase from 0.005 to 0.05 made the loss “explode”, the network did not
learn at all.
• Number of epochs:
o A decrease from 10 to 5 improved the F1 score from 0.89 to 0.95. This seems to
indicate that 10 epochs were too much and caused overfitting.
o An increase from 10 to 20 improved the F1 score from 0.89 to 1.0.
This is surprising because previously it seemed that 10 epochs were too much.
Instead, increasing the number of epochs even further improved the score again.
This may mean that at 10 epochs, the algorithm got stuck in a local minimum,
which it got out of after more epochs.
• Number of examples:
o Increasing the number of examples from 100 to 200 improved the F1 score from
0.89 to 0.97. This was an expected result.
• Mini-batch size:
o Reducing mini-batch size from 20 to 10 improved the F1 score from 0.89 to 0.97,
but the graphs looked rather “wild”, which was expected, as the smaller mini-
batches meant less regularization.
o Increasing the mini-batch size from 20 to 40 led to an F1 score of 0.96 while the
graphs were much smoother.

IX
• Iterations:
o Increasing the iterations from 1 to 2 improved the F1 score from 0.89 to 1.0.
This increase was expected because the network saw each mini-batch twice per
epoch and updated the network parameters twice as often.
• Number of epochs:
o Doubling the number of epochs from 10 to 20 also improved the score from 0.89
to 1.0. This had a similar effect as doubling the number of iterations. Again, the
dataset was passed through the network twice as often, enabling the network to
learn more. Also, the training duration was identical with the two adjustments
(doubling the epochs or iterations).
• Momentum:
o Decreasing the momentum from 0.9 to 0.5 worsened the score from 0.89 to 0.57.
A larger momentum obviously positively influenced the learning.
o Increasing the momentum from 0.9 to 0.99 worsened the score from 0.89 to
0.82. A too large momentum may have caused the algorithm to overshoot the
minimum. The suggested value of 0.9 seems to be appropriate.
• Number of filters in the convolution layer:
o Decreasing the number of filters in the first convolution layer from 20 to 10
improved the F1 score from 0.89 to 0.95. Fewer filters caused the layer to extract
fewer features, making the model “simpler” and generalize better.
o Making the same adjustment in the second convolution layer worsened the score
from 0.89 to 0.74. While this is slightly surprising at first, the difference between
the effect of the adjustment to the two convolution layers may be explained by
the fact that the first layer extracts low-level features such as curves and edges,
while the second layer extracts higher-level features such as entire shapes. In
this test case, with rectangles and ellipses, there are only a few low-level features
(sharp and round corners, straight and round edges), while these are combined
into many different shapes (random sizes and aspect ratio). As with most
hyperparameters, the optimal value strongly depends on the type of images in
the dataset.
• Pooling type in the convolution layers:
o Changing the pooling type from max-pooling to average-pooling slightly
worsened the score. The common choice for using max-pooling seems to be
sensible.
• Loss function in the output layer:
o Changing the loss function from “negative log likelihood” to “hinge” did not lead
to any changes in the score.
• Activation function in the convolution layers:
o Changing the activation function from ReLU to “leaky ReLU” improved the score
from 0.89 to 1.0. While this matches the explanations in the literature, the effect
was still impressive. Using “Leaky ReLU” prevents neurons with low signals from
“dying” by always passing on a small signal. In the test case, the now “revived”
neurons were obviously indeed relevant to the learning progress.
o Changing the activation function to “sigmoid” worsened the score slightly.
o Changing the activation function to “identity” (linear) surprisingly improved the
score to 1.0. This did not quite match the recommendations from the literature.
• Filter size in the convolution layers:

X
o Changing the filter size in the convolution layers from 3x3 to 5x5 (as used in
some CNN implementation examples) slightly worsened the score.
• Kernel size in the pooling layers:
o Changing the kernel size in only one of the pooling layer from 2x2 to 3x3 slightly
worsened the score. Interestingly, changing the kernel size in both pooling layers
improved the score.
• L2 regularization:
o Increasing the L2 regularization from 0.0005 to 0.005 did not influence the
score.
o Increasing it further to 0.05 improved the score from 0.89 to 0.95.
• More layers:
o Adding more convolution and pooling layers slightly improved the score.

Most of the changes made to the CNN configuration had the expected effect on the score.
However, some results were surprising and made it clear that one should not focus on only a
few parameters. Instead, especially when the results are not satisfactory, it seems advisable to
sometimes make random or even seemingly illogical changes to the configuration and see what
happens.

XI
III. LeNet configuration from the DL4J Model Zoo
MultiLayerConfiguration configuration = new NeuralNetConfiguration.Builder()
.seed(seed)
.activation(Activation.IDENTITY)
.weightInit(WeightInit.XAVIER)
.optimizationAlgo(OptimizationAlgorithm.STOCHASTIC_GRADIENT_DESCENT)
.updater(new AdaDelta())
.convolutionMode(ConvolutionMode.Same)
.list()
.layer(0, new ConvolutionLayer.Builder(new int[] {5, 5}, new int[] {1, 1})
.name("cnn1")
.nIn(colorChannels)
.nOut(20)
.activation(Activation.RELU)
.build())
.layer(1, new SubsamplingLayer.Builder(SubsamplingLayer.PoolingType.MAX,
new int[] {2, 2}, new int[] {2, 2})
.name("maxpool1")
.build())
.layer(2, new ConvolutionLayer.Builder(new int[] {5, 5}, new int[] {1, 1})
.name("cnn2")
.nOut(50)
.activation(Activation.RELU)
.build())
.layer(3, new SubsamplingLayer.Builder(SubsamplingLayer.PoolingType.MAX,
new int[] {2, 2},new int[] {2, 2})
.name("maxpool2")
.build())
.layer(4, new DenseLayer.Builder()
.name("ffn1")
.activation(Activation.RELU)
.nOut(500)
.build())
.layer(5, new OutputLayer.Builder(LossFunctions.LossFunction.MCXENT)
.name("output")
.nOut(numLabels)
.activation(Activation.SOFTMAX)
.build())
.setInputType(InputType.convolutionalFlat(imgHeight, imgWidth, colorChannels))
.backprop(true).pretrain(false).build();

XII
IV. AlexNet configuration from the DL4J Model Zoo
MultiLayerConfiguration configuration = new NeuralNetConfiguration.Builder()
.seed(seed)
.weightInit(WeightInit.DISTRIBUTION)
.dist(new NormalDistribution(0.0, 0.01))
.activation(Activation.RELU)
.optimizationAlgo(OptimizationAlgorithm.STOCHASTIC_GRADIENT_DESCENT)
.updater(new Nesterovs(1e-2, 0.9))
.biasUpdater(new Nesterovs(2e-2, 0.9))
.convolutionMode(ConvolutionMode.Same)
.gradientNormalization(GradientNormalization.RenormalizeL2PerLayer)
.inferenceWorkspaceMode(WorkspaceMode.SINGLE)
.dropOut(0.5)
.l2(5 * 1e-4)
.miniBatch(false)
.list()
.layer(0,new ConvolutionLayer.Builder(new int[] {11, 11}, new int[] {4, 4},
new int[] {2, 2})
.name("cnn1")
.cudnnAlgoMode(ConvolutionLayer.AlgoMode.PREFER_FASTEST)
.convolutionMode(ConvolutionMode.Truncate)
.nIn(colorChannels)
.nOut(64).build())
.layer(1, new SubsamplingLayer.Builder(SubsamplingLayer.PoolingType.MAX,
new int[] {3, 3}, new int[] {2, 2}, new int[] {1, 1}).name("maxpool1")
.convolutionMode(ConvolutionMode.Truncate)
.build())
.layer(2, new ConvolutionLayer.Builder(new int[] {5, 5}, new int[] {2, 2},
new int[] {2, 2}).name("cnn2")
.convolutionMode(ConvolutionMode.Truncate)
.cudnnAlgoMode(ConvolutionLayer.AlgoMode.PREFER_FASTEST)
.nOut(192)
.biasInit(nonZeroBias).build())
.layer(3, new SubsamplingLayer.Builder(SubsamplingLayer.PoolingType.MAX,
new int[] {3, 3}, new int[] {2, 2})
.name("maxpool2").build())
.layer(4, new ConvolutionLayer.Builder(new int[] {3, 3}, new int[] {1, 1},
new int[] {1, 1})name("cnn3")
.cudnnAlgoMode(ConvolutionLayer.AlgoMode.PREFER_FASTEST)
.nOut(384).build())
.layer(5, new ConvolutionLayer.Builder(new int[] {3, 3}, new int[] {1, 1},
new int[] {1, 1})name("cnn4")
.cudnnAlgoMode(ConvolutionLayer.AlgoMode.PREFER_FASTEST)
.nOut(256)
.biasInit(nonZeroBias).build())
.layer(6, new ConvolutionLayer.Builder(new int[] {3, 3}, new int[] {1, 1},
new int[] {1, 1}).name("cnn5")
.cudnnAlgoMode(ConvolutionLayer.AlgoMode.PREFER_FASTEST)
.nOut(256)
.biasInit(nonZeroBias).build())
.layer(7, new SubsamplingLayer.Builder(SubsamplingLayer.PoolingType.MAX,
new int[] {3, 3}, int[] {7, 7})
.name("maxpool3").build())
.layer(8, new DenseLayer.Builder().name("ffn1")
.nIn(256).nOut(4096)
.dist(new GaussianDistribution(0, 0.005))
.biasInit(nonZeroBias)
.dropOut(dropOut).build())
.layer(9, new DenseLayer.Builder().name("ffn2")
.nOut(4096)
.dist(new GaussianDistribution(0, 0.005))
.biasInit(nonZeroBias)
.dropOut(dropOut).build())
.layer(10, new OutputLayer.Builder(LossFunctions
.name("output")
.LossFunction.NEGATIVELOGLIKELIHOOD)
.nOut(numLabels)
.activation(Activation.SOFTMAX).build())
.setInputType(InputType.convolutionalFlat(imgHeight, imgWidth, colorChannels))
.backprop(true).pretrain(false).build();

XIII
V. The final configuration of custom CNN for composition
classification
The following source code is the final CNN configuration used for the classification of photos
based on their composition. When comparing this configuration to the initial configuration used
for the first training on the generated images, there are several differences. These differences
are commented in the source code.

MultiLayerConfiguration configuration = new NeuralNetConfiguration.Builder()


.seed(12345)
.iterations(1)
.optimizationAlgo(OptimizationAlgorithm.STOCHASTIC_GRADIENT_DESCENT)
.regularization(true)
.l2(0.005) // Increased L2 regularization
.l2Bias(0.005) // Added L2 bias regularization
.updater(Updater.ADADELTA) // Replaced fixed learning rate and momentum
.list()
.layer(layerIndex++, new ConvolutionLayer.Builder(5,5) // Increased filter size
.name("Convolution 1")
.nIn(3)
.nOut(20)
.stride(1,1)
.padding(1,1)
.activation(Activation.LEAKYRELU) // Replaced ReLU with Leaky ReLU
.weightInit(WeightInit.RELU)
.build())
.layer(layerIndex++, new SubsamplingLayer.Builder(SubsamplingLayer.PoolingType.MAX)
.name("Pooling 1")
.kernelSize(2,2)
.stride(2,2)
.build())
.layer(layerIndex++, new ConvolutionLayer.Builder(3,3)
.name("Convolution 2")
.nOut(20)
.stride(1,1)
.padding(1,1)
.activation(Activation.LEAKYRELU) // Replaced ReLU with Leaky ReLU
.weightInit(WeightInit.RELU)
.build())
.layer(layerIndex++, new SubsamplingLayer.Builder(SubsamplingLayer.PoolingType.MAX)
.name("Pooling 2")
.kernelSize(2,2)
.stride(2,2)
.build())
.layer(layerIndex++, new ConvolutionLayer.Builder(3,3)
.name("Convolution 3") // Added a third convolution layer
.nOut(20)
.stride(1,1)
.padding(1,1)
.activation(Activation.LEAKYRELU)
.weightInit(WeightInit.RELU)
.build())
.layer(layerIndex++, new SubsamplingLayer.Builder(SubsamplingLayer.PoolingType.MAX)
.name("Pooling 3") // Added a third pooling layer
.kernelSize(2,2)
.stride(2,2)
.build())
.layer(layerIndex++, new DenseLayer.Builder()
.name("Fully connected 1") // Added a fully connected layer
.nOut(50)
.activation(Activation.RELU)
.weightInit(WeightInit.XAVIER)
.build())
.layer(layerIndex, new OutputLayer.Builder(LossFunctions
.LossFunction.NEGATIVELOGLIKELIHOOD)
.name("Output")
.activation(Activation.SOFTMAX)
.weightInit(WeightInit.XAVIER)
.nOut(numLabels)
.build())
.setInputType(InputType.convolutional(imgHeight,imgWidth,depth))
.backprop(true).pretrain(false).build();

XIV

You might also like