Professional Documents
Culture Documents
Icdipc 2015 7323027
Icdipc 2015 7323027
Mustafa Ally
Faculty of Business, Education, Law and Arts
University of Southern Queensland
Toowoomba, Australia
Abstract—Due to the large amounts of Multimedia data on crime patterns [1], to predict consumer behavior [2], for fraud
the Internet, Multimedia mining has become a very active area of detection [3], for personalized medical treatments [4], to
research. Multimedia mining is a form of data mining. Data analyze seismic activity to detect new oil sources [5] and to
mining uses algorithms to segment data to identify useful analyze sensor data do predict material failures [6] to name a
patterns and to make predictions. Despite the successes in many few. Data mining is a multidisciplinary field in the areas of
areas, data mining remains a challenging task. In the past, statistics, databases, machine learning, artificial intelligence,
multimedia mining was one of the fields where the results were information retrieval and visualization [10].
often not satisfactory. Multimedia Data Mining extracts relevant
data from multimedia files such as audio, video and still images DM has been adopted in virtually any domain, and new
to perform similarity searches, identify associations, entity trends have been emerging. The Internet gave access to many
resolution and for classification. As the mining techniques have different data sources and data is stored in various formats. The
matured, new techniques were developed. A lot of progress has different sources often have interfaces that are not compatible
been made in areas such as visual data mining and natural to each other. Distributed Data Mining (DDM) uses highly
language processing using deep learning techniques. Deep sophisticated algorithms to analyze data from these
learning is a branch of machine learning and has been used heterogeneous data sources. Spatial and geographic DM
among other on Smartphones for face recognition and voice analyzes geographical, environmental and astronomical data to
commands. Deep learners are a type of artificial neural networks
measure distances and topology for navigation and Geographic
with multiple data processing layers that learn representations by
Information Systems (GIS). Ubiquitous DM taps into mobile
increasing the level of abstraction from one layer to the next.
These methods have improved the state-of-the-art in multimedia
devices to study human behavior and human-machine
mining, in speech recognition, visual object recognition, natural interaction. Time series and sequential DM analyzes seasonal
language processing and other areas such as genome mining and and cyclical trends to assess customer behavior and buying
predicting the efficacy of drug molecules. This paper describes patterns. In recent years, significant advances have been made
some of the deep learning techniques that have been used in in Multimedia DM (MDM). As the name suggests, MDM
recent research for multimedia data mining. extracts useful information from multimedia data such as
video, audio and image files. MDM has various applications. It
Keywords—data mining; multimedia data mining; deep can be used for face recognition, audio-visual speech
learning; artificial neural networks; natural language processing; recognition, entity resolution and record disambiguation. MDM
visual data mining has also been used for image tagging [7], [8]. Tagging will
mean assigning a keyword or term to a piece of information
I. INTRODUCTION such as a hyperlink or image file, with the goal of describing
the item. For instance, an image caption is a tag. Tagging has
With the unprecedented amount of data being collected and become a standard mechanism on the Internet for annotating
stored today, and it’s availability on the Internet, Data Mining multimedia data and search engines rely on tags to retrieve
(DM) has gained greater significance for public and private multimedia data [28]. Image caption generation is the process
organizations. Not surprising a lot of research has been of generating a descriptive sentence of an image, a task that
conducted in this area. Only the development of new humans can do with striking ease, but that poses a major
technologies called Big Data allowed to process and analyze challenge for a machine. Not only must caption generation
these large volumes of data efficiently. DM and Big Data are models be able to solve the computer vision challenges of
often used synonymously. DM is the analytical process of determining what objects are in an image, but they must also be
exploring data to detect patterns and relationships between powerful enough to capture and express their relationships in
variables in data sets. DM is used among other to analyze
III. CHALLENGES
Fig. 5. Recurrent neural network with one hidden layer The main disadvantage of ConvNets is their ravenous
appetite for labeled training samples [23]. For instance, if the
A hidden layer with units grouped under node S gets inputs ConvNet is trained for object recognition, it has to be feed with
from other neurons at previous time steps. The input sequence images of all the objects it has to recognize in all the variants
a with elements at is mapped to output sequence o with each ot and different angles. A workaround is to automatically create
depending on all the previous at. The same matrices U, V, W, synthetic training images. If there is not enough training data, a
are used at each time step. solution is to generate more training examples by deforming
Like ConvNets, RNNs are backpropagation networks. the existing ones [15].
Backpropagation can be directly applied to the graph with Many images do not only display still objects but actions,
respect to all the states st of node S and all the parameters.
There are many other possible architectures. To store for instance, two women playing tennis. A DL scheme needs to
information for a long time, memory extensions to RNNs have recognize not only ‘‘two women on a tennis court’’ but also the
been proposed [21], [22]. The long-term memory can be used action of playing tennis. Without describing the activities in
to make predictions based on past events. It acts as a dynamic images, automatic caption generation lacks in descriptiveness
knowledge base that the network is trained to operate, for since it does not describe what is going on in an image.
instance, to find associations [21] or to answer questions [22].
Recognizing the activities and describing those using verbs
ConvNets as well as many other aNNs accept only fixed- remains challenging for computers. The brains of living
sized vectors as inputs such as images (pixel arrays) and organisms still vastly outperform the best computer-based
produce fixed-size outputs such as probabilities of classes. In neural networks in pattern recognition and perception.
contrast, RNNs operate on input sequences of vectors and they
produce sequences of outputs which makes them especially As with shallow learners, overfitting is a serious problem,
suited for synthesizing natural langue. Also, contrary to especially if the learner gets large and complicated, and there is
ConvNets, RNNs don’t perform the mapping using a fixed not enough training data. Also, combining the predictions of
amount of operating steps. This makes RNN more flexible for many non-linear layers can make learning slow. A technique
mining problems that require a variable number of computing called dropout has been proposed to address this problem. The
steps as required for processing sequential input data. key idea is to randomly drop units (along with their
ConvNets and RNNs have been combined for multimedia connections) from the neural network during training [19].
data mining in recent research. Model combination nearly This significantly reduced training time and alleviated
always improves the performance of machine learning methods overfitting.
[19]. Combining ConvNets and recurrent neural networks for A fundamental credit assignment problem is to determine
automatic image caption generation has shown stunning results which components contribute to the performance of a DL
[8]. It exploits the strength of ConvNets for multimedia mining