Image Feature Extraction Algorithm Based On CUDA Architecture: Case Study GFD and GCFD

IET Computers & Digital Techniques
Research Article
Image feature extraction algorithm based on ISSN 1751-8601

Received on 13th September 2016
Revised 14th November 2016
CUDA architecture: case study GFD and Accepted on 13th January 2017
doi: 10.1049/iet-cdt.2016.0135
GCFD www.ietdl.org
Haythem Bahri1 , Fatma Sayadi1, Randa Khemiri1, Marwa Chouchene1, Mohamed Atri1
1Laboratory of Electronics and Micro-electronics, Faculty of Sciences, Monastir University, 5000 Monastir, Tunisia
E-mail: bahri.haythem@hotmail.com
Abstract: Optimising computing times of applications is an increasingly important task in many different areas such as scientific
and industrial applications. Graphics processing unit (GPU) is considered as one of the powerful engines for computationally
demanding applications since it proposes a highly parallel architecture. In this context, the authors introduce an algorithm to
optimise the computing time of feature extraction methods for the colour image. They choose generalised Fourier descriptor
(GFD) and generalised colour Fourier descriptor (GCFD) models, as a method to extract the image feature for various
applications such as colour object recognition in real-time or image retrieval. They compare the computing time experimental
results on central processing unit and GPU. They also present a case study of these experimental results descriptors using two
platforms: a NVIDIA GeForce GT525M and a NVIDIA GeForce GTX480. Their experimental results demonstrate that the
execution time can considerably be reduced until 34× for GFD and 56× for GCFD.
1 Introduction number of colour channels, are particular cases of their proposal.

Owing to the domain of our application (colour image processing),
Feature extraction and object recognition are subjects of extensive only the Clifford–Fourier transform defined from morphisms is
research in the field of image processing. Colour object recognition used.
is widely used in the machine vision industry in real-time Mennesson et al. [5] have defined new generalised colour FD
applications. The main issue is the recognition of objects rapidly to (GCFD) from the Clifford–Fourier transform and the generalised
simplify the classification process. Unfortunately, an extracting FD (GFD). They obtained better recognition rates with GCFD than
feature of invariant descriptors on different sizes of the object those with the marginal approach for many choices of bi-vector B.
remains a crucial challenge. In fact, this treatment often consumes Their computational costs are still low. This method was
the most important part of the computation time of the recognition implemented purely on software using MATLAB and C.
process. We, therefore, focused on the acceleration of this block Mohamed et al. [6] have implemented this GCFD based on co-
computation by using graphics processing units (GPUs). design using, FPGA. They used the same partitioning of the system
Many-cores devices as GPUs have gained popularity in the past used by Smach et al. [3], but for two different architectures of
years. NVIDIA and AMD manufacture have invested widely in hardware part. They obtained better execution time than those in
this category. The important distinction between these processors [3].
and traditional multi-core processors is that these devices provide a In our previous works [7], we have implemented the two-
large number of low-overhead hardware threads with low-overhead dimensional (2D) fast Fourier transform (FFT) [8] using many-core
context switching between them [1]. The GPU technology maybe NVIDIA GPU architecture CUDA. The parallel version was
advantageous for running experiments demanding high computing running more than 24 times faster than a sequential version on a
time such as embarrassingly parallel computing problems in CPU. We extended this work in [9], to analyse the GFD and GCFD
different fields. In fact, GPU is used as a co-processor to accelerate by implementing the FFT and shifting kernel under GPU, whereas
applications running on the central processing unit (CPU) by the computing of FD algorithm under CPU. The speed-up of the
discharging some of the compute-intensive and time-consuming GPU version of GFD has reached an improvement of around 9.7
portions of the code. The GPU part is using the massively parallel times and up to 12 times for GCFD models.
processing power of this device to boost performance. This is In this paper, we tried to optimise the computing time of FD
called the heterogeneous or the hybrid computing architecture. We algorithm on GPU. We present a method to compute the different
can note two different ways in which GPUs usually contribute to FD blocks (2D FFT, shift the FFT image and calculating FD) both
the overall computation. The first is to carry out some specific on the GPU and the CPU. These computations can be performed
tasks through user-designed kernels and the second is to execute faster than on CPU, mainly because of the high parallelism allowed
some data-parallel primitives provided by off-the-shelf GPU- by the internal structure of the GPU. Thus, we took into account
accelerated libraries such as (e.g. CUDPP, CUDNN, CUFFT, some of CUDA programming optimisation such as variation in the
CUBLAS etc.) [2]. number of threads per block and parallelising as maximum the
Many researchers have attempted to accelerate the Fourier ‘for’ loop to run simultaneously on GPU cores [10].
descriptor (FD) computing, which is used for object recognition, Here, we tried to compare the computing time of two FD
classification and image retrieval domain. Smach et al. [3] have models on two different platforms. We implemented these models
accelerated for the first time the FD computation based on first, on purely CPU software, then on GPU.
hardware/software co-design using field programmable gate array This paper is organised as follows. Section 2 presents an
(FPGA) technology. The image acquisition and the final overview of FDs models. In Section 3, we describe the design of a
classification using SVM are executed in software while the feature parallel algorithm of FD. Section 4 affords case studies and
vector computation is implemented in hardware. experimental results of the FD models. Finally, Section 5 concludes
In [4], Batard et al. have proposed a definition of the Fourier this paper.
transform for the L2 functions. They have demonstrated that the
previous generalisations for a colour image, where three is the
IET Comput. Digit. Tech. 1

© The Institution of Engineering and Technology 2017
2 FD: GFD and GCFD In (7), the DFT is used with a summation operator for the discrete
function. However, in (3), it is used with an integral operator for
Development of storage transmission and acquisition systems the continuous function. DFT is a sampled Fourier transform;
induced to create great images databases. The diversity of these therefore, it does not contain all frequencies forming an image.
databases (medicals, objects and faces image) makes image Only a set of samples is large enough to describe the spatial
representation a dynamic domain of research. Representing image domain image fully. There are various forms of FFT, but most of
by its contents is not the only way to characterise it. For all images, them restrict the size of the input image that maybe transformed. It
we can extract different attributes using some mathematical tool. is used to access the geometric characteristics of a spatial domain
FD is one of the tools that represent all information of an image in image. In most implementations, the Fourier image is shifted in
a coefficients vector. We quote two models of FD whose difference such a way that the output image F(0, 0) is displayed in the centre.
is linked to the processing image method: GFD and GCFD. These The FFT is based on the complex DFT, a developed model of the
models use FFT to extract different features of an image. Those real DFT. The complexity of FFT varies on O[n log (n)]. However,
features were represented in coefficients vector that will be used as
DFT algorithms varied on O(n2) with n as the number of points to
an input for learning machine to identify their class (SVM, neuron
transform. For n = 1024, the execution time of FFT can be 100
network etc.).
times faster than DFT algorithms.
2.1 Fast FT
2.2 Fourier descriptor
Fourier transform is the analogous of theoretical Fourier series for
The descriptor includes all the information of the object in the
non-periodic functions. It is an operator that transforms an integral
image and keeps the invariance characters when applying the
function to another function giving its frequency spectrum. In the
geometric, noise and lighting transform. It can be in a global model
Fourier domain, each point represents a particular frequency
to represent all pixels in the image or in a local model where it
contained in the spatial domain. Fourier transform is used in a wide
shows a region of interest or interesting points. We can cite some
range of applications such as image analysis, image filtering, image
models of descriptor such as a colour descriptors, shape descriptors
reconstruction and image compression. The Fourier transform of
and texture descriptors. We are interested in shape descriptors
spatial function f (x) results in a frequency function as given in (1).
model as an application domain. In this context, we studied the FD
Its inverse transform showed in (2), where ‘v’ is a frequency
for the recognition and classification of images. It uses the FFT to
variable, ‘x’ is a spatial variable and ‘i’ is a complex number
extract all information of a colour image. The data represented in a
vector which will be utilised as input to the classifier to identify its
∫
+∞
Fv = f x e− ivx dx (1) class. FDs can be used in various applications such as object
−∞ recognition and forms, tracking of objects or person and real-time
image recovery. We can describe the FD steps in a few lines.
∫
+∞
1 Initially, divide the colour image into sub-plans. Then transform
f x = F(v)eivx dx (2) these plans using FFT. Afterwards, group the FFT images blocks
2π −∞
and calculate the square modules of these resulting images. Finally,
For some specialty, these equations can be recovered without apply the suitable algorithm to determine the FD.
losing information. They can be written in two variables u and v as
in (3). Its inverse transform showed in (4) 2.2.1 Generalised FD: Historically, GFD is the first model of FD
used as feature vector components in colour object recognition or
image retrieval. This model computes the descriptor on each colour
∫∫
+∞
F u, v = f x, y e−2iπ ux + vy dx dy (3) channel of an image separately. It uses the FFT to extract the
−∞
feature of the colour channel of image.
The image intensity relations were defined in Cartesian
∫∫
+∞
f x, y = F u, v e2iπ ux + vy du dv (4) coordinates, but Zhang and Lul [11] have used them in two
−∞ dimensions polar coordinates as in (9)
This transform is an important image processing tool which is used F ρ, θ = ∑ ∑f r, θi e2iπ (r / R)ρ + (2π / T )φ (9)
to decompose an image into its sine and cosine components. The r i
output of this transform represents the image in the frequency
domain (also called Fourier domain), whereas the input image is To use the full information of the colour object, one of the most
the equivalent spatial domain. For a digital image processing, FD conventional methods of GFD is to calculate the FD for the three
used discrete Fourier transform (DFT) which is the basis of most planes red, green and blue (RGB) separately. These descriptors will
digital images processing. The DFT and inverse DFT can be be concatenated to build a single descriptor feature. This approach
written as follows: has been successfully applied to the handle Fourier generalised by
Smach [1]. It has defined some FD invariant generalised to all
M digital images in colour. This following relation notes these
Fu = ∑f x e−(2iπux / M) (5) invariants:
1
∫
2π
M Df r = f (r, θ)2 dθ (10)
f x = ∑F u e (2iπux / M )
(6) 0
1
The integral is replaced by a finite sum in the discrete domain to
The intensity of a digital image f (x,y) is a spatial function defined produce the FD. In fact, we summarise the computation of
in R2 function will be transformed into R2 frequency function using descriptors in four main steps. First, decompose a colour image
DFT [as in (7)] as well as its inverse transform as in (8) into three separate channels images. Then, apply the FFT algorithm
and compute the characterising descriptor for each channel.
M N Finally, the resulting vectors will be concatenated into a single
F u, v = ∑ ∑ f (x, y)e −2iπ (ux / M ) + (vy / N )
(7) descriptor to construct the descriptor of a colour image (see
x=1 y=1 Fig. 1a) to be used as an input to the classifier.
M N
2.2.2 Generalised colour FD: In this section, we present the
f x, y = ∑ ∑ F(u, v)e iπ 2 (ux / M ) + (vy / N )
(8)
second model of FD: GCFD. It can minimise the loss of data
u=1 v=1
2 IET Comput. Digit. Tech.

Fig. 1 Descriptors extraction from the colour image
(a) GFD, (b) GCFD
The resulting descriptor is the concatenation of parallel and

orthogonal parts of the descriptor. The size of the FD vector is
reduced compared with the marginal method from 2 × m instead of
3 × m of GFD (m: the number of coefficients of one channel
descriptor). For example, we can present an image of 128 × 128
resolution by 64 × 3 coefficients vector for GFD and 64 × 2 for
GCFD instead of 1282 colour pixels.
The algorithmic complexity of CFT is O[n log (n)]; indeed, it
requires the compute of eight projections [in O(n)] and two FFT [in
O(n log (n))]. The GFD provides a time complexity corresponding
to O[n log (n)] since it mainly uses the FFT function instead of
DFT. While GCFD has the same complexity, but it provides a gain
over GFD because the Clifford–Fourier transform requires only
two Fourier transforms, instead of three for the marginal method.
3 Methods and algorithms

Our proposed algorithm consists of accelerating the process of FD
of these previous models. We showed in Fig. 2 the designed
parallel algorithm of FD that explains all transactions from CPU to
GPU and inversely. The proposed algorithm is composed of three
major stages. In the first stage, a colour image is generated to CPU
that will be split according to the FD model. In the next stage, FFT
images are computed and shifted separately. Each FFT images are
passed to the FD algorithm to extract the feature vector. Finally, the
resulting vector is normalised and concatenated for displaying.
3.1 Image acquisition and preprocessing

Fig. 2 Designing parallel algorithm of FD
Currently, we use RGB colour images located in CPU for image
acquisition and preprocessing stage. We read the colour image
compared with GFD. This descriptor is based on CFT, the Fourier from the database to be ready for the preprocessing. Then, we split
transform defined in Clifford algebra. For GCFD, we work in two the image depending on the choice of the FD model. In the first
channels instead of three channels (see Fig. 1b). It consists of pick, we work on the GFD model by decomposing the image into
decomposing a colour image into parallel part according to the bi- RGB plans. However, in the second pick, we work on the GCFD
vector B and another orthogonal part according to I4B using CFT model, and we use the CFT transform to split the image into two
[5]. Like the previous model, it should apply the FFT algorithm for plans: one in a parallel way and the second in an orthogonal way
both channels and compute the characterising descriptor for each with the bi-vector. For example in Fig. 1, we show the GCFD
plan. Finally, the invariant noted by GCFD that it could be defined model using a B bi-vector for the parallel component; however, a
for the two vectors GCFD ∥ B and GCFD⊥B as follows: combination of R and G according to CFT transform for the
orthogonal component. Therefore, we prepare the image plans to
∫
2π provide the feature extraction on GPU.
GCFD ∥ B = F ∥ B r, θ 2 dθ (11) Algorithm 1 (see Fig. 3) shows the pseudo-code of the
0
designing parallel algorithm of FD for the CPU portion. It starts
with the declaration of algorithm variables, one to receive the
∫
2π
GCFD⊥B = F⊥B(r, θ)2 dθ (12) image and another to store the result of FD. Before split image, it
0 should choose one of FD model (see Fig. 1) to perform the

Fig. 4 Algorithm 2: FFT shift and square module kernel
Fig. 3 Algorithm 1: FD for a colour image
first, it declares the input variable where it receives the FFT image
adequate preprocessing of the colour image receipt. After splitting, form; an auxiliary variable is required to exchange the image
each image plan accesses to its suitable variable. Thereafter, the blocks for the shifting step and an output variable to return the
remaining steps are the same for each image plan for both FD resulting form of image plan of this kernel. Then, the input image
models. One of the most important tasks of the hybrid computing is plan will be divided into four equivalent blocks. Phase 1 of
to transfer data between CPU and GPU because it should respect Algorithm 2 explains how the exchange between the first and the
the same kind of variable and the same size of memory allocated to third blocks are made. This process is repeated for the second and
obtain accurate results. Transfer the image plan to GPU is the next the fourth blocks of the image plan. Therefore, the shifting step
step, FD variable is also transferred and retained as a computing consists of reorganising the image blocks such that the most feature
result at the end of this algorithm. Up to now the acquisition, the data of FFT image appear in the centre. Phase 2 computes the
preprocessing and the transfer of data are realised. Next, the first square module of the shifted image in order to return an organised
state of computing is to launch the CUFFT kernel where resides the form of image plan permitting to extract an accurate FD which is
computing of FFT on GPU for the image plan. The result of this the subject of Algorithm 3.
transform is not organised, that is why we call the FFT shift kernel Algorithm 3 translates the method of computing the FD of
from GPU which consists to reorganise and to compute the square image plan after the previous treatment. It describes the last step of
module the FFT image. The last kernel is called from GPU to feature extraction stage that is FD kernel. As explained in [3], this
compute the FD vector using its specific algorithm, it will be algorithm extracts the feature vector of the image from its pixels.
clarified in Section 4.3. At the end of this algorithm, the resulting The first coefficient of this vector is the intensity of the pixel in the
vector of the single image plan is transferred to CPU. However, middle. Then, the second is the sum of the pixels in around of the
before concatenating all vectors to display the FD, it shall middle, until the latest set of pixels of the image. Thereafter, this
normalise each image plan vector. algorithm returns the FD vector that represents the whole data of
The proposed algorithm can be used for GFD and GCFD, the the image. At the end of this stage, all FD of each image plans will
only difference is the appropriate number of image plans of the FD be sent to CPU to finish the last stage of our designing parallel
model. In addition, the kernels called from GPU, which will be algorithm.
explained in the next section, are linked between them. Therefore,
the output data from CUFFT kernel is the input of FFT shift kernel 3.3 Display FD
and the same for FD kernel.
The following stage is combining different results received from
3.2 Feature extraction GPU to build the FD of our original image. However, before that
we need to normalise each image plan vector with the first
At this stage, we detail the process that extracts the image feature coefficient, and then we concatenate all vectors to present and
to build the FD of a colour image. In fact, it is composed of three display the FD. To conclude this section, we realised an algorithm
steps as shown in Fig. 2, all of them are executed on GPU. CUFFT which transforms the colour image into FD by computing the main
is the function directly responsible for FFT transform in two stage through GPU. Hence, our contribution consists of
dimensions on GPU [12]. Therefore, if the image plan is ready on implementing the feature extraction stage on GPU in order to
the device, CUFFT applies the FFT transform to provide the accelerate the execution time of FD models using CUDA. In the
magnitude form of frequency image. Owing to the unorganised next sections, we can check the performance of our contribution in
form of the resulting image, the shifting and square module steps terms of execution time and the speed-up factor between CPU and
consist of reorganising the order of image blocks. Then, it GPU.
computes the square module of the shifted image plan to complete
this stage with the FD kernel. 4 Case studies and experimental results
Algorithm 2 (see Fig. 4) and Algorithm 3 (see Fig. 5) are the
GPU portions of the designing parallel algorithm of FD. They In this section, we evaluate the parallel algorithm of FD on GPU
describe the second and the third steps of feature extraction stage. using CUDA and compare it with the traditional algorithms on
Algorithm 2 shows the pseudo-code of the FFT shift and square CPU. First, we define the materials used, then we present and
module kernel that realises the exchange of the image blocks. At discuss our results.

Table 1 Summary of hardware features for the CPUs and
the GPUs used
Processor CPU1 CPU2 GPU1 GPU2
commercial Core Core GeForce GeForce
model I3-2350M I7-3770 GT525M GTX480
number of cores 2 at 2.30 4 at 3.40 96 at 1.2 480 at 1.4
at speed GHz GHz GHz GHz
memory speed, 665 665 900 924
MHz
memory bus 64 64 128 384
width, bits
memory 21.3 25.6 28.8 177
bandwidth, GB/s
memory size 4 GB 8 GB 1024 MB 1536 MB
(type) (DDR2) (DDR3) (GDDR3) (GDDR5)
bus from/to CPU does not does not PCI-e 2.0 × PCI-e 2.0 ×
apply apply 16 16
software was created in Visual C++ 2010. Table 1 summarises the

hardware used in this paper.
4.2 Impact of block size variation

We chose to implement the FD algorithm just for a grey-scale
image using the GT525M graphics card to see the influence of the
block size variation in terms of execution times. Fig. 6 depicts
performance comparison with different image sizes. We tested the
images from 256 × 256 until 2048 × 2048 resolutions and from 1 to
1024 threads per block. The X-axis represents the variation of the
number of threads per block, but the Y-axis includes the execution
Fig. 5 Algorithm 3: FD kernel
times in milliseconds. For 2048 × 2048 image size, we obtained the
execution time just from the 32 thread per block and for 1024 ×
1024 from four threads per block. These exceptions are due to the
4.1 Hardware selection
configuration limits of our equipment.
The experiments were carried out on both different personal The results show that the time performances are improved by
computers: the first one with a graphic card NVIDIA GeForce increasing the number of thread per block. The execution time
GT525M and an Intel Core I3-2350M 2.30 GHz CPU with 4 GB obtained for 256 × 256 resolutions does not give a clear idea since
memory, but the second with NVIDIA GeForce GTX480 and an all values are, very low, about some milliseconds. Therefore, we
Intel Core I7-3770 3.40 GHz CPU with 8 GB memory. For a cannot confirm the best block size. However, for the other lines, we
reasonable comparison, the GPU software was coded in the same can note the decrease of the execution time for each images size
programming language with CUDA Toolkits version 6.5. The CPU when we increase the block size until 256 threads per block. After
that, for 512 and 1024 threads per block, the time increases again
for almost all resolutions of images. It is clear that the minimum
Fig. 6 GPU performance for various block sizes for different image sizes

Table 2 Summary of execution times [milliseconds (ms)] Table 3 Summary of execution times (ms) under GeForce
under GeForce GT525M GPU and Intel Core I3-2350M CPU GTX480 GPU and Intel Core I7-3770 CPU
Image size GFD computing time Image size GFD computing time
CPU GPU Speed-up CPU GPU Speed-up
64 × 64 8.36 0.25 33.44 64 × 64 2.15 0.21 10.24
128 × 128 34.05 0.86 39.6 128 × 128 9.52 0.52 18.31
256 × 256 141.47 4.42 32 256 × 256 35.45 1.36 26.07
512 × 512 614.36 30.98 19.84 512 × 512 152.59 6.4 23.85
1024 × 1024 2517.57 253.48 9.94 1024 × 1024 650.11 55.85 11.64
2048 × 2048 11227.32 1820.24 6.16 2048 × 2048 2847.28 311.22 9.15
Fig. 7 Speed-up factor of GFD between CPU and GPU

(a) Using GT525M GPU and I3-2350M CPU, (b) Using GTX480 GPU and I7-3770 CPU
peak of execution time at 256 threads per block. Consequently, we computational and communication times. It is interesting to note
decided to fix this block size for all kernel implementation on the significant difference between GPU and CPU execution times.
GPU. The number of blocks used for this paper depends on the Table 3 shows the execution times for the second platform
number of threads per block and the image size. Therefore, the which is GTX480 GPU. At first sight, we note that the times are
number of blocks is equivalent to: num_block = image_size/ too lower than those obtained using the first platform. The high
(4*threads_block). number of cores, of SM, and the memory bandwidth of GTX480
Also very helpful is the CUDA Occupancy Calculator [13], a explain the performance of the obtained experimental results
spreadsheet – able to determine how many blocks of threads may compared with those of GT525M. Similar findings are valid for the
run in parallel on a specific GPU, depending on the number of difference between CPU and GPU execution times. Indeed, GPU
resources (registers and shared memory) the threads use. With this reduces times compared with CPU up to nine times at least. By
tool, one can estimate how well a specific implementation makes against, the speed-up factor of Table 3 behaves differently than the
use of the capacity of the GPU. It offers a nice way to estimate the previous table. It increases at the first images size until 256 × 256,
number of threads that will be able to run at the same time. We then it begins the decrease for the big size of images.
used this tool to tune the suitable kernel launch parameters (threads For GT525M, we note that the speed-up first rise and then fall;
and blocks) and subsequently to confirm our findings. this is due to the communication time especially for large images,
We empirically determined the optimal configuration using the since the memory bandwidth is only 28.8 GB/s. Whereas when
CUDA Occupancy Calculator. When we enter the number of using GTX480 the behaviour of speed-up changes, it reaches a pic
threads per thread block, the registers used per thread, and the total for medium image size (256 × 256) because the bandwidth on these
shared memory used per thread block in bytes, we found a platforms is larger and it reaches 177 GB/s.
configuration consisting of 256 threads per block. This In Fig. 7a, we focused on the speed-up by using GT525M
configuration requires 29 registers per thread and 68 B of shared graphic card. We can note that the speed-up factor for most of the
memory per block. image sizes is significant. It can reach 39 times between CPU and
GPU algorithms.
4.3 GFD implementation In the same way, Fig. 7b shows the speed-up factor comparing
GTX480 GPU and its appropriate CPU. Besides, we found that
In this section, we tried to evaluate the GFD on two different GPU improved enormously the execution times of application. The
platforms. The first one is a GT525M GPU and the second one is a speed-up compared with CPU can reach up 26 times.
GTX480 GPU. For each platform, we compared the execution time
of both FD models on CPU, and on GPU. For all experimental 4.4 GCFD implementation
results, the software times are evaluated using the predefined
function of time library under C++ as QueryPerformanceCounter In this section, we are interested in the GCFD by studying its
(&depart) and QueryPerformanceCounter (&fin). The GPU time is performance and comparing the experimental implementation
assessed using the NVIDIA Compute Visual Profiler of CUDA results under different platforms. According to the tables above, we
Toolkit 6.5 [14]. can mention that the GPU algorithm boosts the performance
In Table 2, we can observe the GFD execution times for comparing with CPU. In Table 4, we summarise a full
different image sizes on the GT525M platform including experimental result of GCFD implementation under GT525M. We
note an enormous gain in execution time between GPU and CPU

Table 4 Summary of execution times (ms) under GeForce Table 5 Summary of execution times (ms) under GeForce
GT525M GPU and Intel Core I3-2350M CPU GTX480 GPU and Intel Core I7-3770 CPU
Image size GCFD computing time Image size GCFD computing time
CPU GPU Speed-up CPU GPU Speed-up
64 × 64 8.51 0.16 53.18 64 × 64 2.43 0.14 17.36
128 × 128 33.69 0.61 55.23 128 × 128 8.69 0.34 25.56
256 × 256 127.11 3.18 39.98 256 × 256 34.39 0.92 37.4
512 × 512 528.07 22.33 23.65 512 × 512 138.33 4.24 32.63
1024 × 1024 2394.42 191.13 12.53 1024 × 1024 591.94 36.09 16.41
2048 × 2048 9462.55 1338.56 7.07 2048 × 2048 2485.96 206.97 12.02
Fig. 8 Speed-up factor of GCFD between CPU and GPU

(a) Using GT525M GPU and I3-2350M CPU, (b) Using GTX480 GPU and I7-3770 CPU
algorithms. For the speed-up factor, the results showed in Table 4 XC3SD3400A using two different architectures. They obtained an
have the same behaviour as Table 2, it provides a very high factor execution time of 0.718 ms for a 128 × 128 image size using the
with the smallest image size, then it decreases until it reaches first architecture and for the second one they obtained 0.704 ms.
around to seven times. However, when using GT525M GPU and GTX 480 GPU, we
Table 5 shows the importance of hardware materials to reduce obtained, respectively, 0.61 and 0.34 ms. Hence, we can conclude
the execution time for an application. that our implementations on GPUs are more efficient than others
As in the third table, the speed-up factor increases for the on FPGAs.
images size <512 × 512 and then it decreases to achieve 12 times
factor for the biggest image size. Here, we combine the efficient 5 Conclusion
material which is GTX480 GPU and GCFD to achieve a speed-up
more than 37 times. In this paper, we presented an extensive case study of FD
The decrease of the speed-up showed from Tables 2–5 was programming. This work has been accentuated by applying both
likely due to configurations limits of both graphics cards. For GFD, models of FD: GFD and GCFD, under CPU and GPU systems. We
we implemented the parallel algorithm of FD in three times and for also tested our algorithms on two platforms: once containing an
GCFD in two times relative to the number of colour image plan. Intel I3-2350M CPU and a GT525M GPU. However, the second
Despite the decline in speed-up for large image sizes, we still find a has an Intel I7-3770 CPU and a GTX480 GPU. In each case, we
significant acceleration factor. varied the images resolution between 64 × 64 and 2048 × 2048. We
Least but not last, we represent below the bar charts of speed-up should not forget that we studied the impact of varying the block
of GCFD indicating the GPU performance comparing with CPU. size to find the best execution time. Thus, we chose to work with
In fact for GT525M (see Fig. 8a), the speed-up factor between the most optimal number of thread per block for all applications.
optimal GPU and CPU algorithms reached 55. Even for GTX 480 Also, we have achieved a significant performance in terms of
(see Fig. 8b), about 37 times of speed-up between GPU and CPU. execution time on GPU. Our results indicate that the GPU
algorithm may improve the overall application performance in
4.5 Related work many cases compared with a purely CPU algorithm. Our
implementation attains much faster execution times, more than six
There have been some published works that deal with the GFD or and sometimes even beyond seven orders at least of time speed-up
CGFD optimisation. All of these effort, however, are targeted factors compared with the CPU.
toward DSP or FPGA platforms. GFD-related GPGPU
applications, on the other hand, are rare. Smach [15] has
implemented the GFD using FPGA Virtex-II 3000K, and he
obtained an execution time of 2.24 ms for a 128 × 128 image size
against 0.86 and 0.52 ms obtained, respectively, with our two
graphics cards GT525M GPU and GTX480 GPU for the same
image size.
Another comparison can be made between the implementation
of Mohamed et al. [6] and ours concerning the GCFD. Mohamed et
al. have implemented this descriptor using FPGA Spartan 3A DSP-

[5] Mennesson, J., Saint-Jean, C., Mascarilla, L.: ‘Color object recognition based
on a Clifford–Fourier transform’, in (EDs.): ‘Guide to geometric algebra in
practice’ (Springer-Verlag London Press, 2011, 1st edn.), pp. 175–191
[6] Mohamed, H., Mohamed, A., Smach, F.: ‘Hardware implementation of GCFD
for color images recognition’. 2014 World Congress Computer Applications
and Information Systems, January 2014, pp. 1–5
[7] Haythem, B., Mohamed, H., Marwa, C., et al.: ‘Fast generalized Fourier
descriptor for object recognition of image using CUDA’. 2014 World Symp.
Computer Applications and Research, January 2014, pp. 1–5
[8] NVIDIA Corporation: ‘CUFFT LIBRARY USER'S GUIDE version 6.5’
(NVIDIA, 2014), pp. 1–76
[9] Haythem, B., Fatma, S., Marwa, C., et al.: ‘Accelerating Fourier descriptor
for image recognition using GPU’, Appl. Math. Inf. Sci., 2016, 10, (1), pp.
297–306
[10] Chouchene, M., Sayadi, F.E., Bahri, H., et al.: ‘Optimized parallel
implementation of face detection based on GPU component’, Microprocess.
6 References Microsyst.., 2015, 39, (6), pp. 393–404
[11] Zhang, D., Lul, G.: ‘Shape based image retrieval using generic Fourier
[1] Pereira, A.D., Ramos, L., GÓES, L.: ‘PSkel: a stencil programming descriptor’, Signal Process., Image Commun., 2002, 17, (10), pp. 825–848
framework for CPU-GPU systems’, Concurrency Comput. Pract. Exp., 2015, [12] NVIDIA Corporation: ‘CUDA C PROGRAMMING GUIDE version 6.5’
27, (17), pp. 4938–4953 (NVIDIA, 2014), pp. 1–241
[2] NVIDIA Corporation: ‘NVIDIA CUDA_6.5_Performance_Report Version [13] CUDA Occupancy Calculator. Available at http://
6.5’ (NVIDIA, 2014), pp. 1–30 www.developer.download.nvidia.com/compute/cuda/
[3] Smach, F., Miteran, J., Atri, M., et al.: ‘An FPGA-based accelerator for CUDA_Occupancy_calculator.xls, accessed 14 November 2016
Fourier descriptors computing for color object recognition using SVM’, J. [14] NVIDIA Corporation: ‘Profiler user's guide’ (NVIDIA, 2014), pp. 1–87
Real-Time Image Process., 2007, 2, (4), pp. 249–258 [15] Smach, F.: ‘Etude et implantation FPGA des descripteurs de Fourier
[4] Batard, T., Berthier, M., Saint-Jean, C.: ‘Clifford–Fourier transform for color généralisés combinés aux SVM’. PhD thesis, Burgundy University, 2007
image processing’, in (EDs.): ‘Geometric algebra computing’ (Springer-
Verlag London Press, 2010, 1st edn.), pp. 135–162


Image Feature Extraction Algorithm Based On CUDA Architecture: Case Study GFD and GCFD

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Image Feature Extraction Algorithm Based On CUDA Architecture: Case Study GFD and GCFD

Uploaded by

Copyright:

Available Formats

IET Computers & Digital Techniques

Image feature extraction algorithm based on ISSN 1751-8601

1 Introduction number of colour channels, are particular cases of their proposal.

IET Comput. Digit. Tech. 1

2 IET Comput. Digit. Tech.

The resulting descriptor is the concatenation of parallel and

3 Methods and algorithms

3.1 Image acquisition and preprocessing

IET Comput. Digit. Tech. 3

4 IET Comput. Digit. Tech.

software was created in Visual C++ 2010. Table 1 summarises the

4.2 Impact of block size variation

IET Comput. Digit. Tech. 5

Fig. 7 Speed-up factor of GFD between CPU and GPU

6 IET Comput. Digit. Tech.

Fig. 8 Speed-up factor of GCFD between CPU and GPU

IET Comput. Digit. Tech. 7

8 IET Comput. Digit. Tech.

You might also like