Professional Documents
Culture Documents
Image Feature Extraction Algorithm Based On CUDA Architecture: Case Study GFD and GCFD
Image Feature Extraction Algorithm Based On CUDA Architecture: Case Study GFD and GCFD
Research Article
Haythem Bahri1 , Fatma Sayadi1, Randa Khemiri1, Marwa Chouchene1, Mohamed Atri1
1Laboratory of Electronics and Micro-electronics, Faculty of Sciences, Monastir University, 5000 Monastir, Tunisia
E-mail: bahri.haythem@hotmail.com
Abstract: Optimising computing times of applications is an increasingly important task in many different areas such as scientific
and industrial applications. Graphics processing unit (GPU) is considered as one of the powerful engines for computationally
demanding applications since it proposes a highly parallel architecture. In this context, the authors introduce an algorithm to
optimise the computing time of feature extraction methods for the colour image. They choose generalised Fourier descriptor
(GFD) and generalised colour Fourier descriptor (GCFD) models, as a method to extract the image feature for various
applications such as colour object recognition in real-time or image retrieval. They compare the computing time experimental
results on central processing unit and GPU. They also present a case study of these experimental results descriptors using two
platforms: a NVIDIA GeForce GT525M and a NVIDIA GeForce GTX480. Their experimental results demonstrate that the
execution time can considerably be reduced until 34× for GFD and 56× for GCFD.
∫
+∞
1 Initially, divide the colour image into sub-plans. Then transform
f x = F(v)eivx dx (2) these plans using FFT. Afterwards, group the FFT images blocks
2π −∞
and calculate the square modules of these resulting images. Finally,
For some specialty, these equations can be recovered without apply the suitable algorithm to determine the FD.
losing information. They can be written in two variables u and v as
in (3). Its inverse transform showed in (4) 2.2.1 Generalised FD: Historically, GFD is the first model of FD
used as feature vector components in colour object recognition or
image retrieval. This model computes the descriptor on each colour
∫∫
+∞
F u, v = f x, y e−2iπ ux + vy dx dy (3) channel of an image separately. It uses the FFT to extract the
−∞
feature of the colour channel of image.
The image intensity relations were defined in Cartesian
∫∫
+∞
f x, y = F u, v e2iπ ux + vy du dv (4) coordinates, but Zhang and Lul [11] have used them in two
−∞ dimensions polar coordinates as in (9)
This transform is an important image processing tool which is used F ρ, θ = ∑ ∑f r, θi e2iπ (r / R)ρ + (2π / T )φ (9)
to decompose an image into its sine and cosine components. The r i
output of this transform represents the image in the frequency
domain (also called Fourier domain), whereas the input image is To use the full information of the colour object, one of the most
the equivalent spatial domain. For a digital image processing, FD conventional methods of GFD is to calculate the FD for the three
used discrete Fourier transform (DFT) which is the basis of most planes red, green and blue (RGB) separately. These descriptors will
digital images processing. The DFT and inverse DFT can be be concatenated to build a single descriptor feature. This approach
written as follows: has been successfully applied to the handle Fourier generalised by
Smach [1]. It has defined some FD invariant generalised to all
M digital images in colour. This following relation notes these
Fu = ∑f x e−(2iπux / M) (5) invariants:
1
∫
2π
M Df r = f (r, θ)2 dθ (10)
f x = ∑F u e (2iπux / M )
(6) 0
1
The integral is replaced by a finite sum in the discrete domain to
The intensity of a digital image f (x,y) is a spatial function defined produce the FD. In fact, we summarise the computation of
in R2 function will be transformed into R2 frequency function using descriptors in four main steps. First, decompose a colour image
DFT [as in (7)] as well as its inverse transform as in (8) into three separate channels images. Then, apply the FFT algorithm
and compute the characterising descriptor for each channel.
M N Finally, the resulting vectors will be concatenated into a single
F u, v = ∑ ∑ f (x, y)e −2iπ (ux / M ) + (vy / N )
(7) descriptor to construct the descriptor of a colour image (see
x=1 y=1 Fig. 1a) to be used as an input to the classifier.
M N
2.2.2 Generalised colour FD: In this section, we present the
f x, y = ∑ ∑ F(u, v)e iπ 2 (ux / M ) + (vy / N )
(8)
second model of FD: GCFD. It can minimise the loss of data
u=1 v=1
∫
2π provide the feature extraction on GPU.
GCFD ∥ B = F ∥ B r, θ 2 dθ (11) Algorithm 1 (see Fig. 3) shows the pseudo-code of the
0
designing parallel algorithm of FD for the CPU portion. It starts
with the declaration of algorithm variables, one to receive the
∫
2π
GCFD⊥B = F⊥B(r, θ)2 dθ (12) image and another to store the result of FD. Before split image, it
0 should choose one of FD model (see Fig. 1) to perform the
Fig. 6 GPU performance for various block sizes for different image sizes
peak of execution time at 256 threads per block. Consequently, we computational and communication times. It is interesting to note
decided to fix this block size for all kernel implementation on the significant difference between GPU and CPU execution times.
GPU. The number of blocks used for this paper depends on the Table 3 shows the execution times for the second platform
number of threads per block and the image size. Therefore, the which is GTX480 GPU. At first sight, we note that the times are
number of blocks is equivalent to: num_block = image_size/ too lower than those obtained using the first platform. The high
(4*threads_block). number of cores, of SM, and the memory bandwidth of GTX480
Also very helpful is the CUDA Occupancy Calculator [13], a explain the performance of the obtained experimental results
spreadsheet – able to determine how many blocks of threads may compared with those of GT525M. Similar findings are valid for the
run in parallel on a specific GPU, depending on the number of difference between CPU and GPU execution times. Indeed, GPU
resources (registers and shared memory) the threads use. With this reduces times compared with CPU up to nine times at least. By
tool, one can estimate how well a specific implementation makes against, the speed-up factor of Table 3 behaves differently than the
use of the capacity of the GPU. It offers a nice way to estimate the previous table. It increases at the first images size until 256 × 256,
number of threads that will be able to run at the same time. We then it begins the decrease for the big size of images.
used this tool to tune the suitable kernel launch parameters (threads For GT525M, we note that the speed-up first rise and then fall;
and blocks) and subsequently to confirm our findings. this is due to the communication time especially for large images,
We empirically determined the optimal configuration using the since the memory bandwidth is only 28.8 GB/s. Whereas when
CUDA Occupancy Calculator. When we enter the number of using GTX480 the behaviour of speed-up changes, it reaches a pic
threads per thread block, the registers used per thread, and the total for medium image size (256 × 256) because the bandwidth on these
shared memory used per thread block in bytes, we found a platforms is larger and it reaches 177 GB/s.
configuration consisting of 256 threads per block. This In Fig. 7a, we focused on the speed-up by using GT525M
configuration requires 29 registers per thread and 68 B of shared graphic card. We can note that the speed-up factor for most of the
memory per block. image sizes is significant. It can reach 39 times between CPU and
GPU algorithms.
4.3 GFD implementation In the same way, Fig. 7b shows the speed-up factor comparing
GTX480 GPU and its appropriate CPU. Besides, we found that
In this section, we tried to evaluate the GFD on two different GPU improved enormously the execution times of application. The
platforms. The first one is a GT525M GPU and the second one is a speed-up compared with CPU can reach up 26 times.
GTX480 GPU. For each platform, we compared the execution time
of both FD models on CPU, and on GPU. For all experimental 4.4 GCFD implementation
results, the software times are evaluated using the predefined
function of time library under C++ as QueryPerformanceCounter In this section, we are interested in the GCFD by studying its
(&depart) and QueryPerformanceCounter (&fin). The GPU time is performance and comparing the experimental implementation
assessed using the NVIDIA Compute Visual Profiler of CUDA results under different platforms. According to the tables above, we
Toolkit 6.5 [14]. can mention that the GPU algorithm boosts the performance
In Table 2, we can observe the GFD execution times for comparing with CPU. In Table 4, we summarise a full
different image sizes on the GT525M platform including experimental result of GCFD implementation under GT525M. We
note an enormous gain in execution time between GPU and CPU
algorithms. For the speed-up factor, the results showed in Table 4 XC3SD3400A using two different architectures. They obtained an
have the same behaviour as Table 2, it provides a very high factor execution time of 0.718 ms for a 128 × 128 image size using the
with the smallest image size, then it decreases until it reaches first architecture and for the second one they obtained 0.704 ms.
around to seven times. However, when using GT525M GPU and GTX 480 GPU, we
Table 5 shows the importance of hardware materials to reduce obtained, respectively, 0.61 and 0.34 ms. Hence, we can conclude
the execution time for an application. that our implementations on GPUs are more efficient than others
As in the third table, the speed-up factor increases for the on FPGAs.
images size <512 × 512 and then it decreases to achieve 12 times
factor for the biggest image size. Here, we combine the efficient 5 Conclusion
material which is GTX480 GPU and GCFD to achieve a speed-up
more than 37 times. In this paper, we presented an extensive case study of FD
The decrease of the speed-up showed from Tables 2–5 was programming. This work has been accentuated by applying both
likely due to configurations limits of both graphics cards. For GFD, models of FD: GFD and GCFD, under CPU and GPU systems. We
we implemented the parallel algorithm of FD in three times and for also tested our algorithms on two platforms: once containing an
GCFD in two times relative to the number of colour image plan. Intel I3-2350M CPU and a GT525M GPU. However, the second
Despite the decline in speed-up for large image sizes, we still find a has an Intel I7-3770 CPU and a GTX480 GPU. In each case, we
significant acceleration factor. varied the images resolution between 64 × 64 and 2048 × 2048. We
Least but not last, we represent below the bar charts of speed-up should not forget that we studied the impact of varying the block
of GCFD indicating the GPU performance comparing with CPU. size to find the best execution time. Thus, we chose to work with
In fact for GT525M (see Fig. 8a), the speed-up factor between the most optimal number of thread per block for all applications.
optimal GPU and CPU algorithms reached 55. Even for GTX 480 Also, we have achieved a significant performance in terms of
(see Fig. 8b), about 37 times of speed-up between GPU and CPU. execution time on GPU. Our results indicate that the GPU
algorithm may improve the overall application performance in
4.5 Related work many cases compared with a purely CPU algorithm. Our
implementation attains much faster execution times, more than six
There have been some published works that deal with the GFD or and sometimes even beyond seven orders at least of time speed-up
CGFD optimisation. All of these effort, however, are targeted factors compared with the CPU.
toward DSP or FPGA platforms. GFD-related GPGPU
applications, on the other hand, are rare. Smach [15] has
implemented the GFD using FPGA Virtex-II 3000K, and he
obtained an execution time of 2.24 ms for a 128 × 128 image size
against 0.86 and 0.52 ms obtained, respectively, with our two
graphics cards GT525M GPU and GTX480 GPU for the same
image size.
Another comparison can be made between the implementation
of Mohamed et al. [6] and ours concerning the GCFD. Mohamed et
al. have implemented this descriptor using FPGA Spartan 3A DSP-