Icdipc 2015 7323027

Multimedia Data Mining using Deep Learning
Peter Wlodarczak Jeffrey Soar

Faculty of Business, Education, Law and Arts Faculty of Business, Education, Law and Arts
University of Southern Queensland University of Southern Queensland
Toowoomba, Australia Toowoomba, Australia
wlodarczak@gmail.com
Mustafa Ally
Faculty of Business, Education, Law and Arts
University of Southern Queensland
Toowoomba, Australia
Abstract—Due to the large amounts of Multimedia data on crime patterns [1], to predict consumer behavior [2], for fraud
the Internet, Multimedia mining has become a very active area of detection [3], for personalized medical treatments [4], to
research. Multimedia mining is a form of data mining. Data analyze seismic activity to detect new oil sources [5] and to
mining uses algorithms to segment data to identify useful analyze sensor data do predict material failures [6] to name a
patterns and to make predictions. Despite the successes in many few. Data mining is a multidisciplinary field in the areas of
areas, data mining remains a challenging task. In the past, statistics, databases, machine learning, artificial intelligence,
multimedia mining was one of the fields where the results were information retrieval and visualization [10].
often not satisfactory. Multimedia Data Mining extracts relevant
data from multimedia files such as audio, video and still images DM has been adopted in virtually any domain, and new
to perform similarity searches, identify associations, entity trends have been emerging. The Internet gave access to many
resolution and for classification. As the mining techniques have different data sources and data is stored in various formats. The
matured, new techniques were developed. A lot of progress has different sources often have interfaces that are not compatible
been made in areas such as visual data mining and natural to each other. Distributed Data Mining (DDM) uses highly
language processing using deep learning techniques. Deep sophisticated algorithms to analyze data from these
learning is a branch of machine learning and has been used heterogeneous data sources. Spatial and geographic DM
among other on Smartphones for face recognition and voice analyzes geographical, environmental and astronomical data to
commands. Deep learners are a type of artificial neural networks
measure distances and topology for navigation and Geographic
with multiple data processing layers that learn representations by
Information Systems (GIS). Ubiquitous DM taps into mobile
increasing the level of abstraction from one layer to the next.
These methods have improved the state-of-the-art in multimedia
devices to study human behavior and human-machine
mining, in speech recognition, visual object recognition, natural interaction. Time series and sequential DM analyzes seasonal
language processing and other areas such as genome mining and and cyclical trends to assess customer behavior and buying
predicting the efficacy of drug molecules. This paper describes patterns. In recent years, significant advances have been made
some of the deep learning techniques that have been used in in Multimedia DM (MDM). As the name suggests, MDM
recent research for multimedia data mining. extracts useful information from multimedia data such as
video, audio and image files. MDM has various applications. It
Keywords—data mining; multimedia data mining; deep can be used for face recognition, audio-visual speech
learning; artificial neural networks; natural language processing; recognition, entity resolution and record disambiguation. MDM
visual data mining has also been used for image tagging [7], [8]. Tagging will
mean assigning a keyword or term to a piece of information
I. INTRODUCTION such as a hyperlink or image file, with the goal of describing
the item. For instance, an image caption is a tag. Tagging has
With the unprecedented amount of data being collected and become a standard mechanism on the Internet for annotating
stored today, and it’s availability on the Internet, Data Mining multimedia data and search engines rely on tags to retrieve
(DM) has gained greater significance for public and private multimedia data [28]. Image caption generation is the process
organizations. Not surprising a lot of research has been of generating a descriptive sentence of an image, a task that
conducted in this area. Only the development of new humans can do with striking ease, but that poses a major
technologies called Big Data allowed to process and analyze challenge for a machine. Not only must caption generation
these large volumes of data efficiently. DM and Big Data are models be able to solve the computer vision challenges of
often used synonymously. DM is the analytical process of determining what objects are in an image, but they must also be
exploring data to detect patterns and relationships between powerful enough to capture and express their relationships in
variables in data sets. DM is used among other to analyze
ISBN: 978-1-4673-6832-2©2015 IEEE 190

natural language [7]. Automatically describing the content of A. Shallow artificial Neural Network models
an image is a fundamental problem in artificial intelligence that aNNs are inspired by nature. The fact that humans can
connects computer vision and natural language processing [8]. solve many classification problems with astounding ease must
Recent work has significantly improved the quality of caption lie in the fact that neurons in the brain are massively
generation. It is based on training deep neural networks, also interconnected, allowing a problem to be decomposed into
called deep learning. Deep learning (DL) is a branch of subproblems that can be solved at the neuron level [16]. This
machine learning (ML) and goes back to the 1980th. ML is a observation inspired the creation of aNNs. A typical aNN
subfield of Artificial Intelligence (AI). DL is nothing new and consists of perceptrons, the neurons, and their weighted
has been sometimes criticized as a rebranding of neural connections, the axons. aNNs are well documented in
networks. However, recent research in using DL for MDM has literature. They are shortly described here to contrast shallow
been promising, especially in the area of visual data mining [7], aNNs against deep aNNs. The basic idea of a perceptron is to
[8], [12], [18] and natural language processing (NLP) [11], find a linear function f:
[14]. To describe an image, first a vectorial representation of
the image is created using convolutional neural networks f(x)=wTx+b (1)
(ConvNets) [7]. The representation is then decoded into natural
language sentences using recurrent neural networks. The next such that f(x) > 0 for one class and f(x) < 0 for the other
chapter describes the techniques that have been used for class. The weights w = (w1, w2, …, wm) and the bias b are
multimedia data mining and image caption generation. adjusted during training until a loss function converges. The
perceptron computes a weighted sum of the feature vector
II. METHODOLOGY components. If it is above a certain threshold, the perceptron
“fires”, i. e. it classifies the input as belonging to a certain
Building intelligent systems that are capable of extracting category. This activation function is called the step function
high-level representations from high-dimensional data lies at and outputs either 0 or 1. In each training iteration, a loss
the core of solving many AI related tasks, including visual function, also called error function, is minimized until it
object or pattern recognition, speech perception, and language converges, i. e. until the minimum is reached.
understanding [9]. Many learning schemes use shallow
architectures such as neural networks with only one hidden A perceptron can only linearly separate a space into two
layer, support vector machines (SVM), kernel logistic half-spaces separated by a hyperplane. This is called binary
regression and many others. They are often incapable to extract classification.
meaningful patterns from high-dimensional input such as Perceptrons can be organized into a hierarchical structure to
images or video files. Previous work in automatic image form multilayer perceptrons, a type of aNN, to represent
caption generation has distilled an image down to its most nonlinear decision boundaries. They can be used for multiclass
salient objects. For instance, a common technique for human classification problems. Instead of the step function of the
face recognition is to extract Eigenfaces from images of a perceptron that just outputs 0 or 1, an aNN uses a sigmoid
person [29]. An Eigenface is a low-dimensional representation function such as the hyperbolic tangent function or radial basis
of a human face. The Eigenfaces are stored in a database and function (RBF) that is differentiable. But there are other
used for face recognition on new, unseen pictures. activation functions that can be used too. Multilayer
perceptrons are a type of feed-forward network since they
don’t contain any cycles. The signal is passed from the input
layer to the output layer without any concurrence.
Gradient descent and stochastic backpropagation are
optimization methods for learning the weights in a neural
network. Backpropagation is a relatively simple mechanism to
adjust the weights of aNNs during training. Backpropagation
changes the weights based on the contribution of each unit to
the final result. It calculates the gradient of the loss function
Fig. 1. Eigenface generation with respect of all the weights. Since the overall error doesn’t
necessarily decrease after each iteration, this is called
However, this method has the drawback that information stochastic backpropagation. Like the activation function, the
such as the background of the image is lost that could be used loss function has to be differentiable. Often the squared-error
for richer, more descriptive captions. loss function is used, but others such as negative log-likelihood
In the past few years, researchers have developed deep can be used too. A mathematical optimization algorithm to
learning architectures that are capable of extracting high-level minimize the loss function is called gradient descent. It uses the
representations for a wide range of domains including visual loss function to iteratively adjust a function’s parameters until a
object recognition, natural language processing, and speech minimum is reached. Gradient descent is an iterative
perception. Examples of deep learning schemes are deep belief optimization procedure that uses this information to adjust a
networks, deep Boltzmann machines, deep auto-encoders, and function’s parameters [16]. It takes the value of the derivative
sparse coding-based methods. This paper focuses on the of the loss function and multiplies it by a constant called the
subfield of Deep Learning (DL) in artificial Neural Networks learning rate. The learning rate determines the step size by
(aNN). which the function slopes downwards. If it is too small,
ISBN: 978-1-4673-6832-2©2015 IEEE 191

learning might be slow, or it might get caught in a local the training images, for instance, relevant areas of a face. A
minimum. If it is too big, it might overshoot the minimum and feature vector would then contain the pixel arrays of these
oscillate wildly. Fig. 2 shows gradient descent caught in a local regions. If the aNN has to recognize only one person, it is a
minimum. binary classification problem, and the output is for instance 1 if
a certain person has been positively identified by the network,
otherwise 0. If several persons have to be recognized, it is a
multiclass classification problem. Most ML schemes output the
probability that an input belongs to a certain class. They can
provide good approximations for very complex problems.
Multilayer perceptrons tend to be slow and are prone to
overfitting. Overfitting can happen if the aNN becomes
excessively complicated, and it starts to capture noise instead
of the underlying relationships. An overfitted model can
exaggerate minor fluctuations which result in poor predictive
performance. There are many other types of aNNs such as
recurrent networks, self-organizing maps, and radial basis
function network just to name a few.
All ML approaches require a way to measure the
Fig. 2. Gradient descent performance of the model. To get statistically relevant results, a
large number of test data is required. This is often not practical
Gradient descent can only find a local minimum, which is a since for instance for face recognition there might not be
serious draw back if there are several minima. This is why in enough pictures of a person from various angles or at different
the past often SVM have been favoured over aNNs since they ages. Shallow learners work well for approximating first or
do not have this problem. Gradient descent is, in fact, a second order functions, but not for modelling complex higher
general-purpose optimization technique that can be applied order features. As a result, they are not well suited for
whenever the objective function is differentiable [16]. Fig. 3 hierarchical feature extraction as required by object recognition
shows a simple multilayer perceptron with one hidden layer using pixel intensities of an image. The generated models
and the sigmoid activation functions. would become very complex, and a large body of training data
would be required to avoid overfitting. If there is not enough
data, the models are unlikely to generalize well.
aNNs are not the only non-linear learning schemes. Kernel-
base methods such as SVM have been used to solve many non-
linear problems. A kernel is essentially a similarity function
with certain mathematical properties, and it is possible to
define kernel functions over all sorts of structures - for
example, sets, strings, trees, and probability distributions [16].
However, kernel methods don’t allow the learner to generalize
far from the training data samples. Also, there is the danger that
during training gradient descent remains in a local minimum of
the loss function. Automatic feature extraction is often not
Fig. 3. Multilayer perceptron with one hidden layer easily achievable. A common approach is to extract features
manually [17]. A key advantage of DL is the capability of
One of the most common training approaches is supervised automatically extracting good features using general-purpose
learning, where the aNN is trained using labelled input data, for learning procedures. The main benefit of ConvNets for many
instance, images of a persons’ face at different angles and ages. such tasks is that the entire system is trained end to end, from
The goal is to find a function that classifies a set of input to its raw pixels to ultimate categories, thereby alleviating the
correct output, i. e. a picture of a face is correctly classified as requirement to manually design a suitable feature extractor
belonging to a specific person. A separate data set, a test set, is [23].
then used to verify if the trained network correctly recognises a
person in new, unseen images. If the image is wrongly B. Deep artificial Neural Network models
classified, the error is calculated, and the weights are adjusted DL is characterized by multiple layers of nonlinear
using backpropagation. The training goes through as many processing units. A deep-learning architecture is a multilayer
iterations as it takes until the aNN correctly recognises a person stack of simple modules, all (or most) of which are subject to
in an image. learning, and many of which compute non-linear input–output
As ML schemes take feature vectors as input, images have mappings [15]. Each layer represents features at a higher level
to be transformed into such feature vectors, a process called thus forming a hierarchical representation. As with shallow
feature extraction. Since images come in the form of arrays of aNNs, the layers are not programmed by engineers but are
pixels, feature extraction means identifying relevant regions in trained from data using a learning scheme. They can have
ISBN: 978-1-4673-6832-2©2015 IEEE 192

hundreds of millions of weighted connections. Other ML and bias and form a feature map. A feature map is a low-
models can be extended to perform deep learning; however the dimensional, discretized representation of the input image. This
vast majority are aNNs. allows features to be detected regardless where they are found
in the visual field. As shown in fig. 4, the units in the hidden
An important property of these models is that they can layer only receive signals from three input units and the
extract complex statistical dependencies from data and connections with the same color share the same weights. Units
efficiently learn high-level representations by re-using and in the next layer also receive signals only from three units in
combining intermediate concepts, allowing these models to the first hidden layer.
generalize well across a wide variety of tasks [12]. For
instance, when a DL scheme is feed with an image, the first
layer extracts low-level features such as edges, corners, and
gradients. The second layer uses this information to identify
shapes. Finally, based on the location of the shapes, the learner
classifies the objects in the image. DLs are at the same time
sensitive to minute details such as face traits in pictures of the
same person and insensitive to large variations such as the
pictures background, lightning depending on the time of the
day or other objects in the picture.
Deep learning is a form of the fundamental credit Fig. 4. ConvNet with a feature map
assignment problem [20]. Learning, or credit assignment,
means adjusting the weights that make the neural network The feature map is created by convolution of the input
exhibit the desired behavior. To contrast deep from shallow image across sub-regions and then applying a non-linear
algorithms, the number of parameterized transformations a function such as the ReLU or hyperbolic tangent. Each hidden
signal encounters between the input and the output layer can be layer consists of several feature maps to create a rich
used. There is no agreed upon minimum number of capital representation of the input data. This architecture ensures that a
assignments for an aNN to be called a deep aNN. Also, in a strong signal is produced for a specific spatially local input
recurrent aNN a signal might traverse a layer more than once. pattern. The shared weights increase computing efficiency
Generally speaking, deep learning means assigning credit since less free parameters need to be learned and the
across many stages. constraints enable ConvNets to produce better generalizations
on visual mining problems.
1) Convolutional neural networks
As mentioned before, there are many different types of Many natural signals are hierarchies composed of lower-
deep learners. A type of deep learner called convolutional level features. For instance, images are composed of edges that
neural network (ConvNet) is much easier to train and form motifs, motifs assemble to parts, and parts create objects.
generalizes very well compared to networks that are fully Similarly, phonemes form meaningful morphemes, words
connected between adjacent layers [15]. Their capacity can be comprise one or more morphemes and create sentences in
controlled by varying their depth and breadth, and they also natural language. Deep neural networks take advantage of this
make strong and mostly correct assumptions about the nature hierarchical structure. In each layer, they produce higher level
of images (namely, stationarity of statistics and locality of representations based on the input from the layer before. The
pixel dependencies) [22]. A ConvNet is a backpropagation output of each layer is a feature map representing one learned
network and typically goes through several stages. The feature, detected at each of the pictures positions. Deep-
convolutional stage detects local conjunctions; a pooling stage learning methods are representation-learning methods with
merges semantically similar functions into one. The pooling multiple levels of representation, obtained by composing
stage is usually followed by fully-connected layers [15], [18], simple but non-linear modules that each transform the
[24]. Instead of the more conventional sigmoid curve function, representation at one level (starting with the raw input) into a
often the rectified linear unit (ReLU) is used. ReLU is simply representation at a higher, slightly more abstract level [15].
the half-wave rectifier: Over the last few years it has been convincingly shown that
CNNs can produce a rich representation of the input image by
f(z) = max(z, 0) (2) embedding it to a fixed-length vector, such that this
representation can be used for a variety of vision tasks [8].
where z is the input to the neuron [15]. ReLU is biologically
more plausible than the sigmoid curve. Also, ReLU learns 2) Recurrent neural networks
Recurrent neural networks (RNN) are a class of aNN that
faster in multilayer networks than for example the hyperbolic
contain directed cycles between units. Recurrent neural
tangent function. In the backpropagation step, at each hidden
networks (RNNs) stand out from other machine learning
layer the error is computed which is a weighted sum of the
methods for their ability to learn and carry out complicated
error derivatives.
transformations of data over extended periods of time [21]. For
ConvNets exploit spatially-local correlations by enforcing tasks that involve sequential inputs, such as speech and
certain patterns between adjacent layers. The first hidden layer language, it is often better to use RNNs [15]. RNN maintain an
forms a subset of the input layer. ConvNets replicate subsets inner state exhibiting dynamic temporal behavior. RNNs
across the entire visual field. They also share the same weight process an input sequence one element at a time, maintaining
ISBN: 978-1-4673-6832-2©2015 IEEE 193

in their hidden units a ‘state vector’ that implicitly contains and ANNs for NLP. First the ConvNet analyses images using
information about the history of all the past elements of the object [7], [8] or face recognition [24], then the vectorial
sequence [15]. They contain state vectors in their hidden layers representation of the images is passed to an RNN for NLP, e. g.
and thus retain information about the history of the past for image caption generation [7], [8]. Fig. 6 shows the
elements of the input sequence. This is useful for instance for combination of a ConvNet with an RNN for automatic caption
NLP where the RNN reads in one word a time of a sentence generation.
and at the end encodes the final states of its hidden layers
which represent the thought expressed in the sentence. This can
be used for automatic translations of languages or for
interpreting images in natural language [7], [8]. For instance,
the RNN can take the high-level representation of an image
extracted by a ConvNet as input. The RNN is given focus on
different locations on the image to generate a caption
describing the image. Fig. 5 shows an RNN with one recurrent,
hidden layer.
Fig. 6. Vision ConvNet combined with an RNN for caption generation
The combination of ConvNets and RNNs has greatly

expanded the applications of DL for multimedia mining. It can
be used for automated image descriptions, person
identification, and it is to be expected that similar architectures
will not only be used for still image analysis but also for
movies, e. g. for movie classification.
III. CHALLENGES
Fig. 5. Recurrent neural network with one hidden layer The main disadvantage of ConvNets is their ravenous
appetite for labeled training samples [23]. For instance, if the
A hidden layer with units grouped under node S gets inputs ConvNet is trained for object recognition, it has to be feed with
from other neurons at previous time steps. The input sequence images of all the objects it has to recognize in all the variants
a with elements at is mapped to output sequence o with each ot and different angles. A workaround is to automatically create
depending on all the previous at. The same matrices U, V, W, synthetic training images. If there is not enough training data, a
are used at each time step. solution is to generate more training examples by deforming
Like ConvNets, RNNs are backpropagation networks. the existing ones [15].
Backpropagation can be directly applied to the graph with Many images do not only display still objects but actions,
respect to all the states st of node S and all the parameters.
There are many other possible architectures. To store for instance, two women playing tennis. A DL scheme needs to
information for a long time, memory extensions to RNNs have recognize not only ‘‘two women on a tennis court’’ but also the
been proposed [21], [22]. The long-term memory can be used action of playing tennis. Without describing the activities in
to make predictions based on past events. It acts as a dynamic images, automatic caption generation lacks in descriptiveness
knowledge base that the network is trained to operate, for since it does not describe what is going on in an image.
instance, to find associations [21] or to answer questions [22].
Recognizing the activities and describing those using verbs
ConvNets as well as many other aNNs accept only fixed- remains challenging for computers. The brains of living
sized vectors as inputs such as images (pixel arrays) and organisms still vastly outperform the best computer-based
produce fixed-size outputs such as probabilities of classes. In neural networks in pattern recognition and perception.
contrast, RNNs operate on input sequences of vectors and they
produce sequences of outputs which makes them especially As with shallow learners, overfitting is a serious problem,
suited for synthesizing natural langue. Also, contrary to especially if the learner gets large and complicated, and there is
ConvNets, RNNs don’t perform the mapping using a fixed not enough training data. Also, combining the predictions of
amount of operating steps. This makes RNN more flexible for many non-linear layers can make learning slow. A technique
mining problems that require a variable number of computing called dropout has been proposed to address this problem. The
steps as required for processing sequential input data. key idea is to randomly drop units (along with their
ConvNets and RNNs have been combined for multimedia connections) from the neural network during training [19].
data mining in recent research. Model combination nearly This significantly reduced training time and alleviated
always improves the performance of machine learning methods overfitting.
[19]. Combining ConvNets and recurrent neural networks for A fundamental credit assignment problem is to determine
automatic image caption generation has shown stunning results which components contribute to the performance of a DL
[8]. It exploits the strength of ConvNets for multimedia mining
ISBN: 978-1-4673-6832-2©2015 IEEE 194

scheme. Often this is not easily determined, and understanding Government data mining will even widen the application field
and optimizing DL schemes can be a challenge. of DL for multimedia mining. They have the potential to make
human lives more healthy and efficient and thus attract a lot of
Privacy preserving data mining remains a challenge, attention from academia, Governments, and the industry.
especially with the advances in multimedia data mining.
Persons can be easily identified in images, which is particularly The combination of ConvNets and RNN has yielded very
problematic since the large-scale surveillance of the Internet by promising results in many domains. However, one drawback is
governments, and secret services have been made public. Also that these methods mostly used supervised approaches where
record linkage, which can identify individuals across databases, large corpuses of labeled training data are needed. However
has exacerbated the problem. human and animal learning is not supervised, we learn mostly
from observation, not from labeled objects. Whereas some
IV. DISCUSSION studies used unsupervised DL methods, we expect to see more
research in this area in the near future.
DL has been successfully applied to many multimedia
mining problems, and new architectures and techniques keep Ultimately, it is expected that Artificial intelligence will go
appearing. Attention-based models using ConvNets have been beyond simple descriptions and will be able to understand
proposed. Humans have the capability to pay attention to the whole documents and will even be reasoning.
salient regions in an image. This is called attention control and
refers to humans or animals capability to choose what they pay VI. REFERENCES
attention to and what they ignore. Rather than compress an [1] C. Xinyu, C. Youngwoon, and J. Suk young, "Crime prediction using
entire image into a static representation, attention allows for Twitter sentiment and weather.", Systems and Information Engineering
salient features to dynamically come to the forefront as needed Design Symposium (SIEDS), pp. 63-68, 2015.
[7]. DL has also been used for classification problems such as [2] T. Liu, X. Ding, Y. Chen, H. Chen, and M. Guo, “Predicting movie Box-
sentiment classification [25] or automatic image quality office revenues by exploiting large-scale social media content,”
assessment [26] and speech recognition [13]. It is to be Multimedia Tools and Applications, pp. 1-20, 2014/10/02, 2014.
expected that DL will be applied to other areas such as [3] H. V. Jagadish, J. Gehrke, A. Labrinidis, Y. Papakonstantinou, J. M.
semantic analysis and automatic translations. Patel, R. Ramakrishnan, and C. Shahabi, “Big data and its technical
challenges,” Commun. ACM, vol. 57, no. 7, pp. 86-94, 2014.
Several advances contributed to the recent successes of [4] P. Wlodarczak, J. Soar, and M. Ally, "Genome Mining Using Machine
deep learners. Deep learners don’t have certain problems found Learning Techniques," Inclusive Smart Cities and e-Health, Lecture
Notes in Computer Science A. Geissbühler, J. Demongeot, M. Mokhtari,
in shallow learners. For instance, a shallow aNN might get B. Abdulrazak and H. Aloulou, eds., pp. 379-384: Springer International
caught in a local minimum if the loss function has several Publishing, 2015.
minima or it might overshoot and oscillate wildly [16]. This [5] J. O. Chan, “An Architecture for Big Data Analytics,” Communications
isn’t an issue for deep learners. In practice, poor local minima of the IIMA, vol. 13, no. 2, pp. 1-13, 2013.
are rarely a problem with large networks [15]. The minima [6] A. Twinkle, and S. Paul, “Adressing Big Data with Hadoop,”
reached are usually of equal quality. International Journal of Computer Science and Mobile Computing, vol.
3, no. 2, pp. 459–462, 2014.
Also, the efficient use of GPUs and parallelization [7] K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhutdinov, R.
algorithms has contributed to the big advances in deep learning Zemel, and Y. Bengio, “Show, Attend and Tell: Neural Image Caption
[27]. High-end consumer personal computers equipped with Generation with Visual Attention”, Proceedings of the 32nd
general-purpose graphics processing units (GPGPUs) has International Conference on Machine Learning from Data: Artificial
Intelligence and Statistics, vol. 37, 2015.
allowed a wider range of users to conduct brute force
numerical computation with large datasets [13]. [8] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan, “Show and Tell: A
Neural Image Caption Generator”, Google, 2015.
Data democratization and the Internet of things (IoT), the [9] R. Salakhutdinov, “Deep learning,” in Proceedings of the 20th ACM
confluence of Big Data and data streams greatly increase the SIGKDD international conference on Knowledge discovery and data
amounts of data available and techniques to efficiently process mining, New York, New York, USA, 2014, pp. 1973-1973.
them are needed. DL methods are very suited for many of these [10] P. Wlodarczak, M. Ally, and J. Soar, "Data Process and Analysis
Technologies of Big Data," Networking for Big Data, Chapman &
data analysis problems, and they provide interesting research Hall/CRC Big Data Series, pp. 103-119: Chapman and Hall/CRC, 2015.
problems. Hence, more studies on DL are expected to appear [11] S. Zhou, Q. Chen, and X. Wang, “Active deep learning method for semi-
shortly. supervised sentiment classification,” Neurocomputing, vol. 120, pp.
536-546, 11/23/, 2013.
V. CONCLUSIONS [12] R. Salakhutdinov, “Deep learning,” in Proceedings of the 20th ACM
SIGKDD international conference on Knowledge discovery and data
The big advances in multimedia data mining in the past mining, New York, New York, USA, 2014, pp. 1973-1973.
years have multiplied the applications of DL. DL has proven to [13] K. Noda, Y. Yamaguchi, K. Nakadai, H. Okuno, and T. Ogata, “Audio-
be suitable for problems where shallow learners didn’t provide visual speech recognition using deep learning,” Applied Intelligence,
satisfactory results. Deep learners were particularly successful vol. 42, no. 4, pp. 722-737, 2015/06/01, 2015.
in problems that in the past proved to be very difficult in [14] L. Deng, G. Hinton, and B. Kingsbury, “New types of deep neural
Artificial Intelligence research such as object recognition and network learning for speech recognition and related applications: an
overview,” IEEE International Conference on Acoustics, Speech and
descriptive language generation. New trends in data mining Signal Processing (ICASSP), 2013, pp. 4, 2013.
such as data mining for social good have been emerging. Areas [15] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521,
such as smart healthcare, social sensing, smart cities and open no. 7553, pp. 436-444, 05/28/print, 2015.
ISBN: 978-1-4673-6832-2©2015 IEEE 195

[16] I. H. Witten, E. Frank, and M. A. Hall, Data Mining, 3 ed., Burlington,
MA, USA: Elsevier, 2011.
[17] A. Bordes, a. Jason Weston, and S. Chopr, “Question Answering with
Subgraph Embeddings,” Proc. Empirical Methods in Natural Language
Processing, 2014.
[18] K. Alex, I. Sutskever, and E. H. Geoffrey, “ImageNet Classification with
Deep Convolutional Neural Networks,” pp. 1097--1105, 2012.
[19] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R.
Salakhutdinov, “Dropout: a simple way to prevent neural networks from
overfitting,” J. Mach. Learn. Res., vol. 15, no. 1, pp. 1929-1958, 2014.
[20] M. Minsky, “Steps toward Artificial Intelligence,” Proceedings of the
IRE, vol. 49, no. 1, pp. 8-30, 1961.
[21] A. Graves, G. Wayne, and I. Danihelka, “Neural Turing Machines,”
2014.
[22] J. Weston, S. Chopra, and A. Bordes, "Memory Networks", Proceedings
of the 3rd International Conference on Learning Representations, San
Diego, 2015
[23] P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and Y. LeCun.
Overfeat: Integrated recognition, localization and detection using
convolutional networks. arXiv preprint arXiv:1312.6229, 2013.
[24] Y. Sun, X. Wang, and X. Tang, "Deep Learning Face Representation
from Predicting 10,000 Classes." pp. 1891-1898.
[25] S. Zhou, Q. Chen, and X. Wang, “Active deep learning method for semi-
supervised sentiment classification,” Neurocomputing, vol. 120, pp.
536-546, 11/23/, 2013.
[26] H. Weilong, G. Xinbo, T. Dacheng, and L. Xuelong, “Blind Image
Quality Assessment via Deep Learning,” Neural Networks and Learning
Systems, IEEE Transactions on, vol. 26, no. 6, pp. 1275-1286, 2015.
[27] N. Lopes, and B. Ribeiro, “Towards adaptive learning with improved
convergence of deep belief networks on graphics processing units,”
Pattern Recognition, vol. 47, no. 1, pp. 114-127, 1//, 2014.
[28] V. Mayer-Schonberger, and K. Cukier, Big Data: A Revolution That
Will Transform How We Live, Work, and Think, New York, USA:
Houghton Mifflin Harcourt Publishing Company, 2013.
[29] M. Turk, and A. Pentland, “Eigenfaces for recognition,” J. Cognitive
Neuroscience, vol. 3, no. 1, pp. 71-86, 1991.
ISBN: 978-1-4673-6832-2©2015 IEEE 196

Icdipc 2015 7323027

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Icdipc 2015 7323027

Uploaded by

Copyright:

Available Formats

Multimedia Data Mining using Deep Learning

Peter Wlodarczak Jeffrey Soar

ISBN: 978-1-4673-6832-2©2015 IEEE 190

ISBN: 978-1-4673-6832-2©2015 IEEE 191

ISBN: 978-1-4673-6832-2©2015 IEEE 192

ISBN: 978-1-4673-6832-2©2015 IEEE 193

Fig. 6. Vision ConvNet combined with an RNN for caption generation

The combination of ConvNets and RNNs has greatly

ISBN: 978-1-4673-6832-2©2015 IEEE 194

ISBN: 978-1-4673-6832-2©2015 IEEE 195

ISBN: 978-1-4673-6832-2©2015 IEEE 196

You might also like