CUDA Application For Canny Edge Detection

Computação Paralela
Project 1 - Canny Edge Detector with CUDA and

OpenMP
Authors:
Madalena Blanc, Nº 93125
João Vieitas, Nº 97632
Professor:
Nuno Lau
13 de maio de 2024
Conteúdo
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
3 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
3.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
3.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
3.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
3.4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3.5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
4 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

4.1 OpenMP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
4.2 Cuda . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1
1 Introduction
This project aims to explore the capabilities of parallelization, more specifically, nvidia-
cuda and openMP, when it comes to doing tasks involving many vector calculations. The
difference between the capabilities of a CPU and a GPU exists because they were designed
with different goals in mind. While the CPU is optimized to execute a sequence of operations
(threads), as fast as possible and some of them in parallel, the GPU was designed to excel
at executing thousands of tasks in parallel.[1]
Edge detection is an important task in systems such as computer vision. This process
serves to simplify the analysis of images by reducing the amount of data to be processed
while preserving useful structural information about object boundaries. [2]
In this project, gray-scale images will be processed to detect the edges of the image. The
image will be modeled as a matrix of integers whose values range from 0 to 255. The values
in the image specify the pixel color, i.e., a value of 0 indicates a black pixel and a value of
255 indicates a white pixel. The image will be processed for edge detection using the Canny
Edge Detector[2]. In summary, this algorithm finds the edges in the image in these stages:
• A Gaussian filter is applied to minimize the effects of noise;
• The gradient of the resulting image is obtained;
• A non-maximum suppression approach determines the best candidates for edges among
several neighbors;
• The edges are traced using hysteresis;
How these steps were implemented for parallel execution will be described bellow.
2 Objectives
The objective of the work is to start from the source code cp canny, which includes a C
implementation of the Canny Edge Detector, and develop improved versions of it using the
CUDA and OpenMP platforms. These versions feature parallelized implementations of the
code, making it more efficient.
3 Implementation
In our implementation, topics 1, 2, 3, 4, and 5 were selected for delivery, which we will
now present.
2
3.1
In this first topic, the aim is to develop a CUDA kernel that can be used to replace
the convolution() function, thereby altering the cannyDevice() function to compute the
Canny Edges of an image using the new kernel to determine the vertical and horizontal
gradients of the image.
Figura 1: Convolution Kernel Code
This function was applied after the Gaussian filter that smooths out pictures and applies
the Sobel filters Gx and Gy to calculate the gradient magnitude. The implementation of this
function looks very similar to the original one, but by running this operation on the GPU we
can leverage the parallel processing power for this gradient calculation, no matter the size
of the image used.
3.2
In this second topic, the aim is to develop a CUDA kernel that can replace the non maximum supression(
function and change the cannyDevice() function (from task 1) to compute the Canny Edges
of an image using the new kernel. Non-maximum Suppression has the primary purpose of
thinning out the edges by suppressing non-maximum gradient values in the direction of the
gradient.
This step examines the gradient magnitude and its gradient direction, and for each pixel,
compares the gradient magnitude to its neighboring pixels in the gradient direction. If its
magnitude is not greater than the one of its neighbors, it is suppressed. As described above,
this helps retain only the local maximum values, which are the most likely candidates for
edges.
This function benefits from parallelization since the calculations of directions and mag-
nitudes per pixel can be distributed across several threads, making it easier to scale across
different-sized pictures.
3
Figura 2: Non-maximum Suppression Code
Figura 3: Non-maximum Suppression illustration, from[3]
3.3
The third topic is independent of the others because it’s intended to develop an OpenMP
implementation of the Canny Edge Detector and not just the CUDA kernels. For this
purpose, some changes were made to the C++ code file cannyOpenMP.
The new function cannyOpenMP() is very similar to the given function cannyHost() but
instead of the sequential approach, it used a #pragma omp command (from OpenMP library)
with some clauses to parallelize the code.
• #pragma omp: Directive to specify OpenMP directives to the compiler.
4
• parallel for: Directive that creates a parallel region in the code, where the iterations
of a loop will be executed in parallel by multiple threads.
• collapse(2): This clause is used to collapse nested loops into a single loop. In this
case, it collapses two into one, allowing for more efficient parallelization.
• shared(G, after Gx, after Gy): This clause specifies that the variables G, after Gx,
and after Gy are shared among all threads in the parallel region. It means that these
variables are accessible and modifiable by all threads.
• num threads(8): This clause specifies the number of threads to be used in the parallel
region. In this case, it sets the number of threads to 8.
In image 4, it’s possible to observe the loops to which the OpenMP directive applies.
Figura 4
A similar approach was used to the functions called from cannyOpenMP() to make the
code even more parallelized and efficient.
To properly compile the C++ source code file cannyOpenMP.cpp using the GNU Compiler
Collection with OpenMP support enabled and link the necessary libraries, the following
command line needs to be used:
gcc -o canny_edge_detector -fopenmp cannyOpenMP.cpp -I/path/to/Common -lm -lstdc++
This command line can be broken down as follows:
• gcc: GNU Compiler Collection.
• -o canny edge detector: Specifies the output file name.
• -fopenmp: Enables support for OpenMP.
• cannyOpenMP.cpp: Name of the source code file to be compiled.
• -I/path/to/Common: Specifies the directory where header files are located.
• -lm: Links the math library.
• -lstdc++: Links the standard C++ library.
5
3.4
The implementation of this function is with the goal that each pixel of the image by its
own thread. Each thread checks if the value at the pixel in the non-maximum suppression
result is greater than a value tmax. If it is, it sets the corresponding pixel to maximum
brightness, otherwise, it is set to 0, this is how we get a black image with only the edges in
white.
Figura 5: First Edges Kernel Code
3.5
In the Canny edge detection algorithm the hysteresis is the final step to produce edges.
The hysteresis step is simply setting a high and a low threshold. This begins by setting
all pixels with gradient magnitudes higher than the high threshold as a ”discovered”definite
edge. Pixels with a gradient magnitude lower than the lower threshold are discarded as
edges. The threshold values can be tuned depending on the application.
4 Results and Discussion

4.1 OpenMP
The processing time results of the OpenMP implementation of the Canny Edge Detector
are shown in 7. As expected due to the parallelization of the code, the device processing
time is lower than the host.
The image that results from the execution of the code is shown in 11b. It’s possible to
compare it with the reference image (11a) to conclude that the OpenMP implementation of
the Canny Edge Detector was well succeeded.
6
Figura 6: Hysteresis Kernel Code
4.2 Cuda
The functions explained above were implemented similarly to what is available in the
cannyHost code and run with the command:
nvcc -o cannyCuda cannyCuda.cu -I/home/cp/cp0110/exnumber/Common
Host processing time (ms) Device Processing time (ms) Improvement

Convolution Kernel 80.02 68.50 15 %
Non-maximum suppression Kernel 78.84 54.13 32 %
First edges kernel 79.04 54.54 31 %
Hysteresis edges Kernel 80.43 45.24 44 %
As we can observe in the table above, except for the first edges kernel, each step took at
least 10ms of time from the execution time compared to the host function.
As for the resulting images, the results we obtained are not exactly equal to the reference
images obtained from the cannyHost function or the openMP implementation, but it is
possible to see in the comparison pictures below that our implementation is still capable of
detecting edges from images of several sizes and objects.
7
Figura 7: Output of cannyOpenMP.cpp
8
(a) Reference image (b) Result image from cannyOpenMP.cpp
Figura 8: Reference and Result images for OpenMP implementation
(a) Reference image (b) Result image from cannyCuda.cu
Figura 9: Original picture and Results from CUDA implementation
9
10
5 Conclusion
Overall, the main objectives of the project were successfully achieved, namely the imple-
mentation of the Canny Edge Detector, which was developed and improved over previous
versions of the code, utilizing the CUDA and OpenMP platforms. The results indicate that
in both CUDA and OpenMP implementations, execution times were significantly reduced
thanks to code parallelization. However, it’s worth noting that while the resulting image for
CUDA wasn’t identical to the reference from CPU execution, it exhibited a lot of similarities,
suggesting some minor issues in our implementation of the Canny edge detection algorithm.
Referências
[1] Yuancheng Luo and Duraiswami, Ramani, Canny edge detection on NVIDIA CUDA,
2008 IEEE Computer Society Conference on Computer Vision and Pattern Recognition
Workshops, IEEE, June 2008, http://dx.doi.org/10.1109/CVPRW.2008.4563088.
[2] CUDA C++ Programming Guide, https://docs.nvidia.com/cuda/

cuda-c-programming-guide/index.html, Apr 2024.
[3] CANNY, JOHN, A computational approach to edge detection, Readings in Computer

Vision, 1987, pp. 184–203, https://doi.org/10.1016/b978-0-08-051581-6.50024-6.
11

CUDA Application For Canny Edge Detection

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

CUDA Application For Canny Edge Detection

Uploaded by

Copyright:

Available Formats

Computação Paralela

Project 1 - Canny Edge Detector with CUDA and

4 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

• A Gaussian filter is applied to minimize the effects of noise;

• The gradient of the resulting image is obtained;

• The edges are traced using hysteresis;

Figura 1: Convolution Kernel Code

Figura 3: Non-maximum Suppression illustration, from[3]

• #pragma omp: Directive to specify OpenMP directives to the compiler.

gcc -o canny_edge_detector -fopenmp cannyOpenMP.cpp -I/path/to/Common -lm -lstdc++

This command line can be broken down as follows:

• gcc: GNU Compiler Collection.

• -o canny edge detector: Specifies the output file name.

• -fopenmp: Enables support for OpenMP.

• cannyOpenMP.cpp: Name of the source code file to be compiled.

• -I/path/to/Common: Specifies the directory where header files are located.

• -lm: Links the math library.

• -lstdc++: Links the standard C++ library.

Figura 5: First Edges Kernel Code

4 Results and Discussion

nvcc -o cannyCuda cannyCuda.cu -I/home/cp/cp0110/exnumber/Common

Host processing time (ms) Device Processing time (ms) Improvement

Figura 8: Reference and Result images for OpenMP implementation

(a) Reference image (b) Result image from cannyCuda.cu

Figura 9: Original picture and Results from CUDA implementation

Figura 10: Original picture and Results from CUDA implementation

(a) Reference image (b) Result image from cannyCuda.cu

Figura 11: Original picture and Results from CUDA implementation

[2] CUDA C++ Programming Guide, https://docs.nvidia.com/cuda/

[3] CANNY, JOHN, A computational approach to edge detection, Readings in Computer

You might also like