Cloud-Based Blur and Illumination Invariant Object Classification

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

Cloud-based Blur and Illumination Invariant Object Classification

Muhammad Usman Yaseen


Department of Computing and Mathematics
University of Derby
Derby, Uk
m.yaseen@derby.ac.uk

Abstract-- The recent rise in the multimedia turbulence. Moreover, since these cameras work
technology has made it easier to perform various under uncontrolled lighting conditions they are also
monitoring tasks. These monitoring tasks are being prone to illumination effects. Rotation angle of the
supported by various cheap cameras producing large objects being monitored is also another challenge
amount of video data. This video data is then processed
posed by these cameras.
for object classification to extract useful information.
However, the video data obtained by these cheap The accuracy of any object classification system is
cameras is often of low quality and results in blur video highly dependent on these challenges. A good
content. Moreover, various illumination effects caused countermeasure to these challenges can lead to highly
by lightning conditions also degrade the video quality. accurate results. However, video de-blurring and de-
These effects present themselves as severe challenges for illumination are resource and time consuming tasks
object classification process. We present a cloud-based and often bring in new artifacts [1]. It is therefore
blur and illumination invariant approach for object desirable to perform classification with a procedure
classification from image and video data. The bi- that is invariant to blur and illumination. Various
dimensional empirical mode decomposition (BEMD)
approaches have been proposed in the past to tackle
has been adopted to decompose a video frame into
intrinsic mode functions (IMFs). These IMFs further these problems. The most prominent among these
undergo to first order Reisz transform to generate approaches are built on top of insensitive moments
monogenic video frames. The analysis of each IMF has [2], color constancy [3] and Fourier phase. But these
been carried out by observing its local properties approaches are designed to perform classification
(amplitude, phase and orientation) generated from each globally and do not take in to account the local
monogenic video frame. We propose a stack based properties of objects. Various methods based on the
hierarchy of local pattern features generated from the magnitude, phase and spectral information [4] has
amplitudes of each IMF which results in blur and also been designed but these methods focused on
illumination invariant object classification. The
texture analysis and blur, illumination invariance was
extensive experimentation on video streams as well as
publically available image datasets reveals that our not considered as a design criterion.
system achieves high accuracy and outperforms state of One of the most successful approaches to address
the art techniques under uncontrolled conditions. The these challenges is to perform analysis of image or
system also proved to be scalable with high throughput video frames by shifting them from spatial domain to
when tested on a number of video streams using cloud spatial-frequency domain. In spatial domain,
infrastructure. processing of video frames is performed by directly
using the gray values of pixels. Spatial frequency
Keywords; Empirical Mode Decomposition; Local domain allows the processing of video frames by
Ternary Patterns; Riesz Transform; Amplitude projecting them on a set of basis functions which are
Spectrum
defined by the method itself. This phenomenon
expands the video frame into frequency components
I. INTRODUCTION
with both high and low magnitudes. Variety of
With the advancements in multimedia technology, it
methods exists for the conversion of spatial domain
is now becoming easier to execute monitoring of
signal to spatial frequency domain. These methods
critical events. The state of the art monitoring
include Fourier Transform [5], Wavelet Transform
cameras placed at various locations of a locality are
[6], Wigner distribution [7] and many others.
capable to record and store various events. The video
Huang et al. [8] proposed a method known as
data generated by these monitoring cameras is then
Empirical Mode Decomposition (EMD) which is
processed manually and automatically to extract
much focused for image analysis now a days. EMD
useful information such as person identification and
expands a signal into its frequency components
tracking. However, most of the cameras used for
adaptively. These frequency components are termed
monitoring purposes are cheaper and of low quality
as Intrinsic Mode Functions (IMFs). EMD tries to
because of budget constraints. The video streams
extract highest frequency components from the
generated by these cameras thus often contain blur
original input signal in each mode. It separates the
frames due to motion, out of focus, or atmospheric
locally highest frequencies and stores them into an approaches proposed in this domain and identify the
IMF. The rest of the IMFs contain the remaining research gap.
frequencies in the lowest order which ends up in a A number of authors have used color information to
residual part. In order to apply empirical mode develop blur and illumination robust descriptors. A
decomposition on the images two dimensional robust descriptor has been proposed by Joost et al.
empirical mode decomposition (2DEMD) or Bi- [3]. They used ratios of image derivatives to develop
dimensional empirical mode decomposition (BEMD) a descriptor for color constant ratios that is invariant
was introduced [9]. to blur and illuminant color. A similar kind of
We have used Bi-dimensional empirical mode approach based on color information is exploited by
decomposition to decompose a video frame into Ballard et al. [10]. They made use of color
Intrinsic Mode Functions in this paper. The video histograms to perform recognition of objects.
frame is first extracted from the video stream and Another approach based on color histograms is
then BEMD is applied to decompose it into IMFs. proposed by Funt et al. [11]. They utilized color
BEMD provides several advantages over spatial constant derivatives to represent an object for
domain analysis. Features can be extracted easily recognition. However, these approaches could not
according to the distribution of local phase or energy. perform well against variations in illuminant’s color
Each IMF is analyzed independently and in parallel especially with a change in the camera viewpoint or
using Reisz transform to extract local properties object orientation and are also dependent on lightning
(amplitude, orientation, and phase) of video frame. geometry.
These local properties are further examined to The moment invariants have also been used in the
perform classification task using local pattern past as blur invariant features. Flusser et al. [12]
features. pioneer the use of moment invariants developed on
The main contributions of this paper are: Firstly, we top of geometric moments. They also utilized central
pioneer the use of EMD on video streams in a cloud moments to provide invariance to translation.
based distributed environment. We have shown that Moment invariants have also been the part of various
only first three IMFs are sufficient to perform applications such as template matching [13],
classification under challenging conditions with high recognition of defocused objects and X-ray imaging.
accuracy rate. This is advantageous in two ways: I) Complex moments were also proposed by [14] for
Reduced feature extraction time as compared to other blur, rotation and scale invariance. Despite their wide
methods. II) Illumination and blur invariance, since usage in various application, moment invariants
only lower IMFs are sensitive to variety effects. remained sensitive to noise and background clutter.
Secondly, we utilized the amplitude property derived Invariants based on the phase of frequency spectrum
from each IMF using first order Reisz transform and obtained by Fourier transform are used by [15].
showed that it provides good results than the phase or These invariants are also insensitive to the shift of the
orientation properties. Thirdly, we propose a stack image. Ville et al. [16] proposed a centrally
based hierarchy of local pattern features generated symmetric blur invariant descriptor based on phase-
from the amplitude of each IMF for highly accurate only spectrum of an image. The phase-only spectrum
classification. was normalized so it became insensitive to linear
The organization of rest of the paper is as follows: brightness changes as well. A similar kind of
Section II provides the review of state of the art approach is adopted by Ville et al. [17] in which
techniques that have been used recently for object phase information was calculated within a local
classification. Section III explains the approach of window for every image position. The quantization
our object classification system. The implementation of the phase of discrete Fourier transform and de-
details and experimental setup are described in correlation of low frequency components was
Section IV and V respectively. The experimental performed in an eight dimensional subspace. A
results and their analysis are detailed in Section VI. histogram of the resulting features was used for
Section VII concludes the paper with a glimpse of the classification of blurred texture images. However,
future work. these invariants are limited to image shifts. Also,
invariance to translation has not been focused in
II. RELATED WORK phase based frequency spectrum invariants. We
Object classification has been the focus of many propose a blur and illumination invariant feature
researchers from the last several decades. However, descriptor which provides invariance to higher blur
classifying objects from video streams under radius and high PSNR values. Interestingly, our
uncontrolled conditions poses more challenges and is feature descriptor also provides good results for sharp
now acquiring much attention from the research video frames that are not blurred.
community. We provide a brief review of the recent
Figure 1: Video Analysis Approach
The rest of the IMFs contain the remaining
III. VIDEO ANALYSIS APPROACH frequencies in the lowest order. This ends up into the
We present the approach and architecture of our residue which contains remaining lowest frequencies.
object classification system in this section. The video By combining all the IMFs and the residue, original
streams are first acquired by the video capturing signal can be obtained. EMD allows visualizing
sources and are then stored in the cloud storage. The spatial-frequency characteristics signal by expanding
cloud manager fetches these video streams from the the parent signal in to IMFs.
cloud storage and distributes them among various
cloud nodes. The cloud manager is solely in-charge B. Amplitude, Phase and Orientation Spectrum
of the allocation of video streams to each cloud node Riesz Transform [19] is applied on the IMFs
and manages workload distribution among cloud afterwards to obtain monogenic data which helps to
nodes. The video streams are decoded to extract study the local properties of video frames.
individual video frames. These video frames are Monogenic data is a local quantitative and qualitative
artificially blurred with varying radius. Noise has measure of the video frame. Local amplitude, phase
also been added to the objects with different PSNR and direction can be calculated from the monogenic
values. These objects are then classified by blur and signal of each IMF.
illumination invariant feature descriptor. Figure 1
shows the approach of our blur and illumination C. Local Pattern Features
invariant object classification system. Local Ternary Pattern features [20] is an extension of
LBP. In LTP, the pixel difference between central
A. Decomposition of Video Frame pixel and neighboring pixels is encoded into a ternary
Each decoded video frame is decomposed into its code. This ternary code is further splitted into
frequency components through Two Dimensional positive LBP and negative LBP to reduce the
Empirical Mode Decomposition [9] which is a fully dimensionality. This encoding of pixel difference into
unsupervised approach. It defines its basis functions a separate state makes it more robust to noise. A
directly from the data and is not dependent on other pattern histogram like in LBP is then created by using
methods. 2DEMD expands a video frame into its the ternary values of neighboring pixels. LTP
frequency components adaptively. These frequency features are more robust to noise than LBP but loss of
components are termed as Intrinsic Mode Functions information occurs when splitting into positive and
(IMFs) and are defined by the video frame itself. negative LBP. Also, redundant information resides in
Sifting process [18] is used to extract these IMFs the histograms of both LBPs as they are strongly
from the video frame. This algorithm tries to extract correlated. We have calculated the local ternary
highest frequency components from the original input pattern features from the amplitudes of each IMF and
video frame in each mode. It separates the locally then arranged them in a hierarchical fashion to
highest frequencies and stores them in to an IMF. generate feature vector.
D. Stack based Hierarchy of Features These frequency components are termed as Intrinsic
We propose a stack based hierarchical approach of Mode Functions (IMFs) and are defined by the signal
the local ternary patterns to represent the input video itself. Sifting process is an algorithm which is used to
frames as a feature descriptor. The local ternary extract these IMFs from the signal. This algorithm
patterns generated from the amplitude spectrum of tries to extract highest frequency components from
each IMF are stacked together in a hierarchical the original input signal in each mode. It separates
fashion such that the pixels of each IMF are adjacent the locally highest frequencies and stores them in to
to each other. All the adjacent pixels are then an IMF. The rest of the IMFs contain the remaining
summed together to form an integrated local ternary frequencies in the lowest order. This ends up into the
pattern of the whole video frame. This represents as a residue which contains remaining lowest frequencies.
feature vector of the whole video frame. By combining all the IMFs and the residue, original
signal can be obtained. EMD allows visualizing
E. Similarity Measure spatial-frequency characteristics signal by expanding
This procedure of LTP histogram generation is the parent signal in to IMFs as shown in figure 3.
performed for all the video frames and the image The sifting algorithm is defined as follows:
which is to be matched. Matching is performed by
comparing the LTP histogram of the marked object
frame with all the frames of video stream. The
histogram intersection is used as a distance measure
to calculate the similarity between two frames. After
the person’s face is authenticated correctly, the
matching score associated to it is stored in a database
as depicted in Figure 2.

Figure 3: Averaged extrema surfaces along with their visual


representation
Figure 2: Workflow of the proposed system

 Extrema Identification
IV. SYSTEM IMPLEMENTATION Determine the extrema points (maxima points and
minima points) of the input video frame I(x, y),
where I is the 2D video frame with x = 1,.…, M and
A. Empirical Mode Decomposition
y = 1,…, N.
The spatial frequency domain allows the processing
of video frames by projecting them on a set of basis
functions. These basis functions are defined by the  Envelop Calculation
method itself. Since the basis functions are Connect the extrema points (all the maxima points
predefined by the methods itself, it makes them and the minima points respectively) using radial basis
unfeasible for non-linear and non-stationary function to form lower and upper 2D envelops
processes. Empirical Mode Decomposition (EMD) denoted by emax(x, y) and emin(x, y).
does not decompose the signal on a-priori basis but
the basis functions are derived from the data. Hence,  Mean Envelop Calculation
this approach is data driven and much more operative Average the two envelops i.e. maxima envelop and
on non-stationary signals. EMD considers data as minima envelop to generate the local mean envelop
superposition of fast oscillations on slow ones. These m1.
oscillations are recognized by EMD and a
decomposition by utilizing these modes as expansion m1(x,y) = (emax(x, y) + emin(x, y)) / 2 -------------- (1)
basis is made.
 ProtoIMF Generation direction can be calculated from the monogenic
Subtract out the local mean from the image to signal of each IMF as shown in figure 4.
generate hl.

hlk = I(x, y) - m1(x,y) ------------------------------- (2)

Repeat the entire process until h1 is a 2IMF. The


process terminates when the mean envelop is very
close to zero.

Depending upon the stopping criterion, it is arbitrated


that whether the mean envelop is close to zero or not.
The whole procedure is reiterated if the mean
envelop is not close enough to zero. The reiteration is
performed a number of times until the stopping
criterion is satisfied. The fulfilment of stopping
criterion results in the final IMF.

Cl(x,y) = hlk(x,y) ------------------------------- (3)


Figure 4: Orientation, Phase and Amplitude of first three IMFs
The residual is defined by subtracting the original
input video frame from the Cl(x,y) The Riesz transformed signal in the frequency
domain can be expressed as;
Rl(x,y) = I(x, y) - Cl(x,y) ------------------------ (4)
FR(v) = (i v/v)* F(v) = H2(v)*F(v) ----------------- (7)
The next IMF is obtained by repeating the entire
procedure on the residual by considering it as an Where H2 is the transfer function and the
input image. generalization of Hilbert transform.
The corresponding spatial domain representation can
I(x, y) = Rl(x,y) ------------------------------------ (5) be given as;
For all the subsequent residuals, the process is FR(x) = -(x/2pi ¦x¦3)*f(x) = h2(x)*f(x) -------------- (8)
repeated to obtain various IMFs in the descending
order of their frequencies. The procedure is normally The 2D analytical or monogenic signal which is
stopped when there are no more extrema points in the constituted by the original signal and its Riesz
residual frame. A video frame can be expressed as transform is given as;
the sum of all IMFs with a residual given below:
fM(x) = f(x) – (i,j)*fR(x) ------------------------------- (9)
I(x,y) = Rl(x,y) + ∑𝑳𝒊=𝟏 𝑪𝒊(𝒙, 𝒚)--------------- (6)
C. Local Ternary Pattern Histogram
B. First Order Riesz Transform We propose the use of local ternary patterns to
Hilbert transform is usually applied on the IMFs analyze and classify objects from the video streams.
afterwards to obtain instantaneous frequency which LTP is used as a descriptor to extract local features
helps to study the local properties. These signals from intrinsic mode functions. It is an extension of
which are obtained after applying Hilbert transform local binary pattern histogram but it is more robust to
are called analytic signals as they have no negative noise as it codes the input video frames into ternary
frequency components. These analytic signals help to patterns instead of binary patterns like in LBP. Local
obtain local amplitude and phase of the 1D signal. In ternary patterns use a threshold value to threshold the
case of BEMD, Riesz Transform is used instead of pixels into a ternary code. The pixels in the range of
Hilbert transform. Riesz transform is the ±t are set to zero, the pixels above this range are set
generalization of Hilbert Transform. When it is to positive 1 (+1) and the values below the range are
combined with the image, it produces monogenic set to negative 1 (-1).
signal which is a local quantitative and qualitative If k is considered as threshold constant, c is taken as
measure of the image. Local amplitude, phase and the value of center pixel and p be the neighboring
pixel then the local ternary pattern can be given as;
vector of the whole video frame. The process of
𝟏, 𝒊𝒇 𝒑 > 𝑐 + 𝑘 integrated LTP histogram generation is executed for
all the video frames and the image which is to be
{𝟎, 𝒊𝒇 𝒑 > 𝑐 − 𝑘 𝑎𝑛𝑑 𝑝 < 𝑐 + 𝑘
matched. Matching is performed by comparing the
−𝟏, 𝒊𝒇 𝒑 < 𝑐 − 𝑘 integrated LTP histogram of the marked object frame
with all the frames of video stream. The histogram
Each thresholded pixel will therefore be comprised of intersection is used as a distance measure to calculate
one of the three values. The neighboring pixels after the similarity between two frames. After the person’s
thresholding are grouped into a three value pattern face is authenticated correctly, the matching score
called ternary pattern. The histogram is then associated to it is stored in a database.
computed of these ternary patterns. Since the
histogram results in a large range, therefore the V. EXPERIMENTAL SETUP
ternary pattern is splitted into two binary patterns. This section details the experimental setup and the
The resultant histogram which represents the feature parameters used to evaluate the proposed system. The
descriptor is the concatenated histograms of the two reported results mainly focuses on the accuracy i)
binary patterns and hence the double of size of LBP. with blurred content ii) under various illumination
conditions, scalability i) video stream decoding time,
1 1 0 Positive Pattern ii) video data transfer time to the cloud, iii) video
11010100 data analysis time on the cloud nodes and
0 1
performance of the proposed system.
78 80 51 1 1 0 0 1 0 The proposed system is evaluated on an OpenStack
based cloud resource. The cloud consists of six server
22 54 63 -1 1
machine. Each machine has 12 cores with two 6-core
38 67 50 -1 1 0 0 0 0 Negative Pattern Intel® Xeon® processors running at 2.4 GHz with
00000011 32GB RAM and 2 Terabyte storage capacity. Each
Thresholded by range of width ±5 [49, 59] 1 0 cloud instance has 72 processing cores with 192GB
1 0 0 of RAM and 12TB of storage capacity. The
experimental results reported in this paper are
Figure 5: Generation of LTP and its split-up into positive and
negative LBP codes obtained from a 4 node cloud deployment. Each
cloud node has 4 VCPUs, 8 GB RAM and data
storage of 100 GB per node. Each vCPU is running at
The local ternary patterns of the amplitudes of first
2.4GHz.
three IMFs are then calculated. A visual
In order to evaluate the proposed system in the cloud,
representation of these LTPs is shown in figure 6.
Hadoop MapReduce framework [21] is utilized. The
empirical mode decomposition is implemented in
JAVA. OpenCV [22] with JNI wrappers for the
native C++ library is used as image/video processing
library. Hadoop comes with Yarn which is
responsible for resource management and job
scheduling. The Hadoop MapReduce framework has
a NameNode responsible for load balancing among
the nodes. The Data/Compute Nodes are used for
storing and processing the data. The JobTracker in
each compute node is used to track the tasks and in
Figure 6: Local Ternary Patterns of the Amplitudes of first three
IMFs
case of failure reschedule the tasks. The JobTracker
is also aware of the data locality and can therefore
D. Stack based hierarchy of LTPs and Similarity distribute the jobs to the data nodes, where the data
Measure resides. This improves the performance and ensures
The local ternary patterns generated from the data locality during the empirical mode
amplitude spectrum of each IMF are stacked together decomposition.
in a hierarchical fashion such that the pixels of each The decoded frames are first bundled using the HIPI
IMF are adjacent to each other. All the adjacent Image Bundle [23] and are then processed by the
pixels of each LTP are then summed together to map-reduce job. The map task performs the EMD
formulate an integrated local ternary pattern of the and object classification while the reduce task
whole video frame. This represents as a feature collects the results. The map-reduce job uses the
configured chunk size to define the number of input
splits from the HIPI ImageBundle (HIB). Each chunk cropped having a resolution of 168x192 pixels. Every
of data is then analyzed by the map-reduce job. subject demonstrate variations in illumination
Figure 7 depicts the process of video streams analysis conditions (left-right, center-right, right-right) and
on the cloud. facial expressions (normal, sad, happy, sleepy).
Figure 9 shows some example images from Yale
Face Database.

Figure 7: Video Stream Analysis on the Cloud


Figure 9: Example images from Yale Face Database with various
The video dataset on which empirical mode illumination conditions
decomposition is applied is self-generated at
University of Derby consisting videos of different VI. RESULTS AND DISCUSSION
subjects. Each video stream in the video dataset has The results obtained from the configurations
time duration of around 120 seconds. The video described in Section V are presented and described in
streams are H.264 encoded with a resolution of this section. The main focus of these results is to
704x528 and a frame rate of 25fps. This makes a total evaluate the system for classification accuracy with
of 3000 video frames in each video stream. The data blurred content and with various varying illumination
rate and bit rate of each video stream are 421 kbps conditions. The execution of the system on the cloud
and 461 kbps. evaluates the scalability and robustness of the
The BioID Face Database [24] and Yale Face proposed system by analyzing various components of
Database [25] are used to measure the efficiency of the system such as image bundle creation time, video
proposed method. During the recording of BioID data transfer time to the cloud, and video data
Face Database, special importance has been given to analysis time on the cloud nodes.
real world situations. Therefore, this database
contains a diversity of face sizes, background A) Accuracy with blurred content
conditions and illumination effects. The database The performance of the proposed system was
comprises 1521 gray level images captured at evaluated with the artificially caused blurred images
resolution of 384x286 pixels. For testing purposes, in the first experiment. The widely used BioID Face
the images are artificially blurred by using a Database was used for this purpose. The BioID Face
Gaussian blur mask with various sigma values (0, Database contains images which have variation in
0.25 . . .). Figure 8 shows some example images from pose and expressions. These images were further
BioID Face Database. artificially blurred by performing a convolution with
the Gaussian blur mask. The values of the mask
ranges from 0 to 5 with zero being no blur and 5 as
the maximum blur. It was therefore possible to
observe the joint effect of pose, expressions and blur.
The mean recognition rates of the proposed system
and the widely used state of the art techniques are
plotted in figure 10. It can be observed that the
proposed system performs better than the existing
techniques even with the minimum blur to maximum
blur. The existing approaches tolerates slight blur but
Figure 8: Example images from BioID Face Database with as the value of sigma increases, the accuracy rate
increasing artificial blur
falls down rapidly. On the other hand, the proposed
system handles increasing blur expressively well.
The Yale Face Database contains images from
various subjects with different poses and illumination
conditions. All the images are manually aligned and
1.02

0.98

0.96
Recognition rate 0.94

0.92

0.9

0.88

0.86

0.84
0 1 2 3 4 5 6
Sigma of the Gaussian Blur

Proposed Approach LBP LTP LPQ

Figure 10: Mean recognition rates for increasing Gaussian blur

B) Accuracy under various illumination effects from different angles such as left-right,
conditions center-right and right-right. Furthermore, the
The performance of the proposed system was facial expressions varies with normal, happy, sad
evaluated with varying illumination conditions in and sleepy. The mean recognition rates of the
the second experiment. The widely used Yale proposed system and the existing techniques are
Face Database was used for this purpose. The plotted in figure 11. It can be seen that the
Yale Face Database contains images which have proposed system tackles various illumination
variation in pose, expressions and illumination conditions better than the existing techniques and
conditions. It contains images with lightning maintains a smooth accuracy plot.

1.2

1
Recognition rate

0.8

0.6

0.4

0.2

0
0 2 4 6 8 10
Illumination Variations

Proposed Approach LBP LTP LPQ

Figure 11: Mean recognition rates for various illumination conditions


C) Video data analysis on the cloud nodes required to transform various input datasets into an
In order to process the video frames in parallel, we image bundle. It can be noted that the time needed to
have utilized the Hadoop MapReduce framework. create an image bundle increases with the increasing
The input data which is to be processed is first size of the data set.
transferred to Hadoop file storage (HDFS). This data 200
is then processed by the MapReduce framework for
150

CREATION TIME
IMAGE BUNDLE
object classification. The results generated by the

(SECONDS)
framework are then stored in the database. 100
The cloud nodes are responsible to execute analysis 50
task which is comprised of map and reduce tasks.
The map task in our system is responsible to execute 0
empirical mode decomposition and the stack based 1 2 3 4 5 6 7
local ternary patterns approach. The reducer is DATASET SIZE (GB)
responsible to collect the data and write it to the
output file. The MapReduce job splits the input hipi Figure 12: HIPI Image Bundle Creation Time
image bundle into various data chunks. These data We have also measured the time required to transfer
chunks then become input to the map tasks. The the decoded video streams to the cloud nodes. After
reducer gathers the processed data from the map transferring the video streams to the cloud nodes, the
tasks which are then stored in the database. hipi image bundle is created. The transfer time of
dataset to the cloud nodes is dependent on the
Creating Hipi Image Bundle from Recorded Video network bandwidth and the cloud data storage block
Streams: size. For the data sets varied between (1GB - 7GB),
The recorded video streams are H.264 encoded for the data transfer time varied from 4.8 minutes to 36.5
faster transmission and efficient use of storage space. minutes. Figure 13 shows the data transfer time for
The video streams are first fetched from the video varying dataset sizes.
storage and are then decoded to extract video frames TRANSFER TIME
40
from the input video. The video decoding is
30
performed by using FFMPEG library. These decoded
20
(MINS)

video frames are stored as PNGs. Each video stream


is recorded at 25 frames per second. There are 3000 10
(=120*25) video frames for a video stream of 120 0
seconds length. The numbers of decoded video
1 2 3 4 5 6 7
frames is dependent upon the length of video streams
being analyzed. DATASIZE (GB)
The individual frames are not suitable for further
Figure 13: Date Transfer Time to Cloud Nodes
processing on the compute nodes. This is because of
the fact that the MapReduce framework is developed
to process large scale data and processing a small file Analysing Video Streams on Cloud Nodes:
will decrease the overall performance. These small The scalability of the proposed system is evaluated
files also necessitate lots of disk seeks and hopping by executing it on the cloud nodes. We have analyzed
time from node to node. Therefore, the decoded various datasets ranging from 1GB to 7GB on the
frames are first bundled using the HIPI Image Bundle cloud nodes. The time required to analyze various
[28] and are then processed by the map-reduce datasets on the cloud nodes is measured to evaluate
application. the performance of the system. It is observed that
with an increase in the dataset size, an increasing
Hipi Image Bundle Creation Time with Varying Data inclination is observed in the execution time (Figure
Sets: 14). The amount of spawned map tasks has a heavy
In order to measure the Hipi Image Bundle creation impact on the performance of the overall system. The
time, we have varied the data sets from 1GB to 7GB numbers of map tasks are dependent on the amount
for generating the results. These data sets with of data and the corresponding input split size. The
varying sizes assisted to evaluate different features of maximum number of map tasks on a compute node
the system. These data sets are converted into one depends on the input data set, the cloud data storage
hipi image bundle before passing to map reduce task. block size and the available hardware specifications
The hipi image bundle creation time varied between of the node. The results show that performance of the
13.18 seconds to 147.26 seconds for 1GB to 7GB system increases by distributing and parallelizing the
data set respectively. Figure 12 shows the time work across multiple compute nodes. In this
particular setup the default input split size 128 MB is translation. We would also experiment the system on
used. a much bigger dataset with large number of cloud
200 nodes. The integration of proposed system with the
approaches from deep learning domain will also be
Analysis TIme (Hours)

the part of our future work.


150
REFERENCES:
100 7GB [1] Ojansivu, Ville, and Janne Heikkilä. "Blur insensitive texture
classification using local phase quantization." International conference on
5GB image and signal processing. Springer Berlin Heidelberg, 2008.
50 [2] Samad, Saleha, and Anam Haq. "Orientation Invariant Object
3GB Recognitions Using Geometric Moments Invariants and Color
Histograms." International Journal of Computer and Electrical Engineering,
0 2015
[3] Joost Van de Weijer, Cordelia Schmid. “Blur robust and color constant
0 2 4 6 image description”. International Conference on Image Processing (ICIP
'06), 2006.
Number of Nodes [4] Wang, Jing-Wein, Ngoc Tuyen Le, Jiann-Shu Lee, and Chou-Chen
Wang. "Color face image enhancement using adaptive singular value
Figure 14: Analysis Time on Cloud Nodes with Varying decomposition in fourier domain for face recognition." Pattern Recognition,
Datasets 2016
[5] Zhang, Dehai, Da Ding, Jin Li, and Qing Liu. "Pca based extracting
As the number of nodes decreases, the amount of feature using fast fourier transform for facial expression recognition." In
analysis tasks on each node increases which Transactions on Engineering Technologies, Springer, 2015.
[6] Sagar GV, Barker SY, Raja KB, Babu KS, Venugopal KR. Convolution
ultimately reduces the performance of the overall based Face Recognition using DWT and feature vector compression. In
system. The degradation in the performance occurs Third International Conference on Image Information Processing (ICIIP)
because each task waits for longer period of time to 2015.
[7] Saini, Nirmala, and Aloka Sinha. "Face and palmprint multimodal
get scheduled on the compute nodes. The analysis biometric systems using Gabor–Wigner transform as feature
task has a minimum duration of execution time and it extraction." Pattern Analysis and Applications, 2015.
[8] Mandic, Danilo P., Naveed ur Rehman, Zhaohua Wu, and Norden E.
is not possible to minimize the time beyond a certain Huang. "Empirical mode decomposition-based time-frequency analysis of
limit. This is because of the inter process multivariate signals: the power of adaptive data analysis." IEEE Signal
Processing Magazine, 2013.
communication and the data read and write [9] Chen WK, Lee JC, Han WY, Shih CK, Chang KC. Iris recognition based
phenomenon from the cloud storage. The proposed on bidimensional empirical mode decomposition and fractal dimension.
system executing on a single node takes 163 hours to Information Sciences. 2013.
[10] M.J. Swain and D.H. Ballard, .Color indexing,. International journal of
analyze a dataset of 7GB, while the same dataset is computer vision, 1991.
analyzed in 41 hours with 4 nodes. So a decreasing [11] B.V. Funt and G.D. Finlayson, Color constant color indexing. IEEE
Trans. on Pattern Analysis and Machine Intelligence, 1995
trend is observed in the execution time with more [12] J. Flusser and T. Suk. Degraded image analysis: An invariant approach.
number of nodes. IEEE Trans. Pattern Anal. Machine Intell., 1998
[13] Crosswhite, Nate, Jeffrey Byrne, Omkar M. Parkhi, Chris Stauffer,
Qiong Cao, and Andrew Zisserman. "Template adaptation for face
VII. CONCLUSION & FUTURE WORK verification and identification.", 2016
A cloud based blur and illumination invariant [14] Aggarwal, Ashutosh, and Chandan Singh. "Zernike Moments-Based
Gurumukhi Character Recognition." Applied Artificial Intelligence, 2016
object classification system is presented and [15] V. Ojansivu and J. Heikkil¨a. Object recognition using frequency
evaluated in this paper. The proposed system domain blur invariant features. In Proc. Scandinavian Conference on Image
Analysis (SCIA’07), 2007.
overcomes the challenges of blur and illumination by [16] Ville Ojansivu and Janne Heikkila. 2007. A Method for Blur and
employing a stack based hierarchy of local ternary Similarity Transform Invariant Object Recognition. In Proceedings of the
14th International Conference on Image Analysis and Processing (ICIAP
patterns generated from the amplitude spectrum of '07). IEEE Computer Society, 2007
intrinsic mode functions of each input frame. The [17] Ville Ojansivu , Janne Heikkilä, Blur Insensitive Texture Classification
accuracy of the proposed system is demonstrated by Using Local Phase Quantization, Proceedings of the 3rd international
conference on Image and Signal Processing, 2008
experimentations on video data as well as publically [18] Grasso M, Chatterton S, Pennacchi P, Colosimo BM. “A data-driven
available image datasets. The system proved to method to enhance vibration signal decomposition for rolling bearing fault
analysis. Mechanical Systems and Signal Processing”, 2016
outperform state of the art approaches under [19] Wadhwa N, Rubinstein M, Durand F, Freeman WT. Riesz pyramids for
controlled conditions. In order to demonstrate the fast phase-based video magnification. IEEE International Conference on
Computational Photography (ICCP), 2014
scalability of the system, it is deployed on a cloud [20] Freitas, Pedro Garcia, Welington YL Akamine, and Mylene CQ Farias.
based infrastructure. The system proved to cope with "No-reference image quality assessment based on statistics of Local Ternary
increasing volumes of data or varying number of Pattern." Eighth International Conference on Quality of Multimedia
Experience (QoMEX), 2016.
nodes. The larger volumes of data require more [21] http://hadoop.apache.org/
analysis time. However, the analysis time would [22] http://opencv.org/
[23] http://hipi.cs.virginia.edu/
decrease with the addition of more nodes to the [24] https://www.bioid.com/About/BioID-Face-Database
cloud. [25] http://vision.ucsd.edu/content/yale-face-database
In future, we would like to empower the system
to cope with other challenges such as rotation and

You might also like