Professional Documents
Culture Documents
Cloud-Based Blur and Illumination Invariant Object Classification
Cloud-Based Blur and Illumination Invariant Object Classification
Cloud-Based Blur and Illumination Invariant Object Classification
Abstract-- The recent rise in the multimedia turbulence. Moreover, since these cameras work
technology has made it easier to perform various under uncontrolled lighting conditions they are also
monitoring tasks. These monitoring tasks are being prone to illumination effects. Rotation angle of the
supported by various cheap cameras producing large objects being monitored is also another challenge
amount of video data. This video data is then processed
posed by these cameras.
for object classification to extract useful information.
However, the video data obtained by these cheap The accuracy of any object classification system is
cameras is often of low quality and results in blur video highly dependent on these challenges. A good
content. Moreover, various illumination effects caused countermeasure to these challenges can lead to highly
by lightning conditions also degrade the video quality. accurate results. However, video de-blurring and de-
These effects present themselves as severe challenges for illumination are resource and time consuming tasks
object classification process. We present a cloud-based and often bring in new artifacts [1]. It is therefore
blur and illumination invariant approach for object desirable to perform classification with a procedure
classification from image and video data. The bi- that is invariant to blur and illumination. Various
dimensional empirical mode decomposition (BEMD)
approaches have been proposed in the past to tackle
has been adopted to decompose a video frame into
intrinsic mode functions (IMFs). These IMFs further these problems. The most prominent among these
undergo to first order Reisz transform to generate approaches are built on top of insensitive moments
monogenic video frames. The analysis of each IMF has [2], color constancy [3] and Fourier phase. But these
been carried out by observing its local properties approaches are designed to perform classification
(amplitude, phase and orientation) generated from each globally and do not take in to account the local
monogenic video frame. We propose a stack based properties of objects. Various methods based on the
hierarchy of local pattern features generated from the magnitude, phase and spectral information [4] has
amplitudes of each IMF which results in blur and also been designed but these methods focused on
illumination invariant object classification. The
texture analysis and blur, illumination invariance was
extensive experimentation on video streams as well as
publically available image datasets reveals that our not considered as a design criterion.
system achieves high accuracy and outperforms state of One of the most successful approaches to address
the art techniques under uncontrolled conditions. The these challenges is to perform analysis of image or
system also proved to be scalable with high throughput video frames by shifting them from spatial domain to
when tested on a number of video streams using cloud spatial-frequency domain. In spatial domain,
infrastructure. processing of video frames is performed by directly
using the gray values of pixels. Spatial frequency
Keywords; Empirical Mode Decomposition; Local domain allows the processing of video frames by
Ternary Patterns; Riesz Transform; Amplitude projecting them on a set of basis functions which are
Spectrum
defined by the method itself. This phenomenon
expands the video frame into frequency components
I. INTRODUCTION
with both high and low magnitudes. Variety of
With the advancements in multimedia technology, it
methods exists for the conversion of spatial domain
is now becoming easier to execute monitoring of
signal to spatial frequency domain. These methods
critical events. The state of the art monitoring
include Fourier Transform [5], Wavelet Transform
cameras placed at various locations of a locality are
[6], Wigner distribution [7] and many others.
capable to record and store various events. The video
Huang et al. [8] proposed a method known as
data generated by these monitoring cameras is then
Empirical Mode Decomposition (EMD) which is
processed manually and automatically to extract
much focused for image analysis now a days. EMD
useful information such as person identification and
expands a signal into its frequency components
tracking. However, most of the cameras used for
adaptively. These frequency components are termed
monitoring purposes are cheaper and of low quality
as Intrinsic Mode Functions (IMFs). EMD tries to
because of budget constraints. The video streams
extract highest frequency components from the
generated by these cameras thus often contain blur
original input signal in each mode. It separates the
frames due to motion, out of focus, or atmospheric
locally highest frequencies and stores them into an approaches proposed in this domain and identify the
IMF. The rest of the IMFs contain the remaining research gap.
frequencies in the lowest order which ends up in a A number of authors have used color information to
residual part. In order to apply empirical mode develop blur and illumination robust descriptors. A
decomposition on the images two dimensional robust descriptor has been proposed by Joost et al.
empirical mode decomposition (2DEMD) or Bi- [3]. They used ratios of image derivatives to develop
dimensional empirical mode decomposition (BEMD) a descriptor for color constant ratios that is invariant
was introduced [9]. to blur and illuminant color. A similar kind of
We have used Bi-dimensional empirical mode approach based on color information is exploited by
decomposition to decompose a video frame into Ballard et al. [10]. They made use of color
Intrinsic Mode Functions in this paper. The video histograms to perform recognition of objects.
frame is first extracted from the video stream and Another approach based on color histograms is
then BEMD is applied to decompose it into IMFs. proposed by Funt et al. [11]. They utilized color
BEMD provides several advantages over spatial constant derivatives to represent an object for
domain analysis. Features can be extracted easily recognition. However, these approaches could not
according to the distribution of local phase or energy. perform well against variations in illuminant’s color
Each IMF is analyzed independently and in parallel especially with a change in the camera viewpoint or
using Reisz transform to extract local properties object orientation and are also dependent on lightning
(amplitude, orientation, and phase) of video frame. geometry.
These local properties are further examined to The moment invariants have also been used in the
perform classification task using local pattern past as blur invariant features. Flusser et al. [12]
features. pioneer the use of moment invariants developed on
The main contributions of this paper are: Firstly, we top of geometric moments. They also utilized central
pioneer the use of EMD on video streams in a cloud moments to provide invariance to translation.
based distributed environment. We have shown that Moment invariants have also been the part of various
only first three IMFs are sufficient to perform applications such as template matching [13],
classification under challenging conditions with high recognition of defocused objects and X-ray imaging.
accuracy rate. This is advantageous in two ways: I) Complex moments were also proposed by [14] for
Reduced feature extraction time as compared to other blur, rotation and scale invariance. Despite their wide
methods. II) Illumination and blur invariance, since usage in various application, moment invariants
only lower IMFs are sensitive to variety effects. remained sensitive to noise and background clutter.
Secondly, we utilized the amplitude property derived Invariants based on the phase of frequency spectrum
from each IMF using first order Reisz transform and obtained by Fourier transform are used by [15].
showed that it provides good results than the phase or These invariants are also insensitive to the shift of the
orientation properties. Thirdly, we propose a stack image. Ville et al. [16] proposed a centrally
based hierarchy of local pattern features generated symmetric blur invariant descriptor based on phase-
from the amplitude of each IMF for highly accurate only spectrum of an image. The phase-only spectrum
classification. was normalized so it became insensitive to linear
The organization of rest of the paper is as follows: brightness changes as well. A similar kind of
Section II provides the review of state of the art approach is adopted by Ville et al. [17] in which
techniques that have been used recently for object phase information was calculated within a local
classification. Section III explains the approach of window for every image position. The quantization
our object classification system. The implementation of the phase of discrete Fourier transform and de-
details and experimental setup are described in correlation of low frequency components was
Section IV and V respectively. The experimental performed in an eight dimensional subspace. A
results and their analysis are detailed in Section VI. histogram of the resulting features was used for
Section VII concludes the paper with a glimpse of the classification of blurred texture images. However,
future work. these invariants are limited to image shifts. Also,
invariance to translation has not been focused in
II. RELATED WORK phase based frequency spectrum invariants. We
Object classification has been the focus of many propose a blur and illumination invariant feature
researchers from the last several decades. However, descriptor which provides invariance to higher blur
classifying objects from video streams under radius and high PSNR values. Interestingly, our
uncontrolled conditions poses more challenges and is feature descriptor also provides good results for sharp
now acquiring much attention from the research video frames that are not blurred.
community. We provide a brief review of the recent
Figure 1: Video Analysis Approach
The rest of the IMFs contain the remaining
III. VIDEO ANALYSIS APPROACH frequencies in the lowest order. This ends up into the
We present the approach and architecture of our residue which contains remaining lowest frequencies.
object classification system in this section. The video By combining all the IMFs and the residue, original
streams are first acquired by the video capturing signal can be obtained. EMD allows visualizing
sources and are then stored in the cloud storage. The spatial-frequency characteristics signal by expanding
cloud manager fetches these video streams from the the parent signal in to IMFs.
cloud storage and distributes them among various
cloud nodes. The cloud manager is solely in-charge B. Amplitude, Phase and Orientation Spectrum
of the allocation of video streams to each cloud node Riesz Transform [19] is applied on the IMFs
and manages workload distribution among cloud afterwards to obtain monogenic data which helps to
nodes. The video streams are decoded to extract study the local properties of video frames.
individual video frames. These video frames are Monogenic data is a local quantitative and qualitative
artificially blurred with varying radius. Noise has measure of the video frame. Local amplitude, phase
also been added to the objects with different PSNR and direction can be calculated from the monogenic
values. These objects are then classified by blur and signal of each IMF.
illumination invariant feature descriptor. Figure 1
shows the approach of our blur and illumination C. Local Pattern Features
invariant object classification system. Local Ternary Pattern features [20] is an extension of
LBP. In LTP, the pixel difference between central
A. Decomposition of Video Frame pixel and neighboring pixels is encoded into a ternary
Each decoded video frame is decomposed into its code. This ternary code is further splitted into
frequency components through Two Dimensional positive LBP and negative LBP to reduce the
Empirical Mode Decomposition [9] which is a fully dimensionality. This encoding of pixel difference into
unsupervised approach. It defines its basis functions a separate state makes it more robust to noise. A
directly from the data and is not dependent on other pattern histogram like in LBP is then created by using
methods. 2DEMD expands a video frame into its the ternary values of neighboring pixels. LTP
frequency components adaptively. These frequency features are more robust to noise than LBP but loss of
components are termed as Intrinsic Mode Functions information occurs when splitting into positive and
(IMFs) and are defined by the video frame itself. negative LBP. Also, redundant information resides in
Sifting process [18] is used to extract these IMFs the histograms of both LBPs as they are strongly
from the video frame. This algorithm tries to extract correlated. We have calculated the local ternary
highest frequency components from the original input pattern features from the amplitudes of each IMF and
video frame in each mode. It separates the locally then arranged them in a hierarchical fashion to
highest frequencies and stores them in to an IMF. generate feature vector.
D. Stack based Hierarchy of Features These frequency components are termed as Intrinsic
We propose a stack based hierarchical approach of Mode Functions (IMFs) and are defined by the signal
the local ternary patterns to represent the input video itself. Sifting process is an algorithm which is used to
frames as a feature descriptor. The local ternary extract these IMFs from the signal. This algorithm
patterns generated from the amplitude spectrum of tries to extract highest frequency components from
each IMF are stacked together in a hierarchical the original input signal in each mode. It separates
fashion such that the pixels of each IMF are adjacent the locally highest frequencies and stores them in to
to each other. All the adjacent pixels are then an IMF. The rest of the IMFs contain the remaining
summed together to form an integrated local ternary frequencies in the lowest order. This ends up into the
pattern of the whole video frame. This represents as a residue which contains remaining lowest frequencies.
feature vector of the whole video frame. By combining all the IMFs and the residue, original
signal can be obtained. EMD allows visualizing
E. Similarity Measure spatial-frequency characteristics signal by expanding
This procedure of LTP histogram generation is the parent signal in to IMFs as shown in figure 3.
performed for all the video frames and the image The sifting algorithm is defined as follows:
which is to be matched. Matching is performed by
comparing the LTP histogram of the marked object
frame with all the frames of video stream. The
histogram intersection is used as a distance measure
to calculate the similarity between two frames. After
the person’s face is authenticated correctly, the
matching score associated to it is stored in a database
as depicted in Figure 2.
Extrema Identification
IV. SYSTEM IMPLEMENTATION Determine the extrema points (maxima points and
minima points) of the input video frame I(x, y),
where I is the 2D video frame with x = 1,.…, M and
A. Empirical Mode Decomposition
y = 1,…, N.
The spatial frequency domain allows the processing
of video frames by projecting them on a set of basis
functions. These basis functions are defined by the Envelop Calculation
method itself. Since the basis functions are Connect the extrema points (all the maxima points
predefined by the methods itself, it makes them and the minima points respectively) using radial basis
unfeasible for non-linear and non-stationary function to form lower and upper 2D envelops
processes. Empirical Mode Decomposition (EMD) denoted by emax(x, y) and emin(x, y).
does not decompose the signal on a-priori basis but
the basis functions are derived from the data. Hence, Mean Envelop Calculation
this approach is data driven and much more operative Average the two envelops i.e. maxima envelop and
on non-stationary signals. EMD considers data as minima envelop to generate the local mean envelop
superposition of fast oscillations on slow ones. These m1.
oscillations are recognized by EMD and a
decomposition by utilizing these modes as expansion m1(x,y) = (emax(x, y) + emin(x, y)) / 2 -------------- (1)
basis is made.
ProtoIMF Generation direction can be calculated from the monogenic
Subtract out the local mean from the image to signal of each IMF as shown in figure 4.
generate hl.
0.98
0.96
Recognition rate 0.94
0.92
0.9
0.88
0.86
0.84
0 1 2 3 4 5 6
Sigma of the Gaussian Blur
B) Accuracy under various illumination effects from different angles such as left-right,
conditions center-right and right-right. Furthermore, the
The performance of the proposed system was facial expressions varies with normal, happy, sad
evaluated with varying illumination conditions in and sleepy. The mean recognition rates of the
the second experiment. The widely used Yale proposed system and the existing techniques are
Face Database was used for this purpose. The plotted in figure 11. It can be seen that the
Yale Face Database contains images which have proposed system tackles various illumination
variation in pose, expressions and illumination conditions better than the existing techniques and
conditions. It contains images with lightning maintains a smooth accuracy plot.
1.2
1
Recognition rate
0.8
0.6
0.4
0.2
0
0 2 4 6 8 10
Illumination Variations
CREATION TIME
IMAGE BUNDLE
object classification. The results generated by the
(SECONDS)
framework are then stored in the database. 100
The cloud nodes are responsible to execute analysis 50
task which is comprised of map and reduce tasks.
The map task in our system is responsible to execute 0
empirical mode decomposition and the stack based 1 2 3 4 5 6 7
local ternary patterns approach. The reducer is DATASET SIZE (GB)
responsible to collect the data and write it to the
output file. The MapReduce job splits the input hipi Figure 12: HIPI Image Bundle Creation Time
image bundle into various data chunks. These data We have also measured the time required to transfer
chunks then become input to the map tasks. The the decoded video streams to the cloud nodes. After
reducer gathers the processed data from the map transferring the video streams to the cloud nodes, the
tasks which are then stored in the database. hipi image bundle is created. The transfer time of
dataset to the cloud nodes is dependent on the
Creating Hipi Image Bundle from Recorded Video network bandwidth and the cloud data storage block
Streams: size. For the data sets varied between (1GB - 7GB),
The recorded video streams are H.264 encoded for the data transfer time varied from 4.8 minutes to 36.5
faster transmission and efficient use of storage space. minutes. Figure 13 shows the data transfer time for
The video streams are first fetched from the video varying dataset sizes.
storage and are then decoded to extract video frames TRANSFER TIME
40
from the input video. The video decoding is
30
performed by using FFMPEG library. These decoded
20
(MINS)