Efficient Scalable Median Filtering Using Histogram-Based Operations

IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 27, NO.
5, MAY 2018 2217
Efficient Scalable Median Filtering Using

Histogram-Based Operations
Oded Green
Abstract— Median filtering is a smoothing technique for noise The complexity of the sort is dependent on the sorting function
removal in images. While there are various implementations used and can range from O(n) for a radix sort or bucket sort,
of median filtering for a single-core CPU, there are few O(n·log(n)) for merge sort or quick sort, or O(n 2 ) for bubble-
implementations for accelerators and multi-core systems. Many
parallel implementations of median filtering use a sorting sort where n is the number of elements that will be sorted.
algorithm for rearranging the values within a filtering window For median filtering, n = k 2 which means that the cost of
and taking the median of the sorted value. While using sorting sorting increases quadratically with the window size. For small
algorithms allows for simple parallel implementations, the cost kernels, the performance of the sorting algorithm is dictated
of the sorting becomes prohibitive as the filtering windows by several factors: the data itself, the storage requirements,
grow. This makes such algorithms, sequential and parallel alike,
inefficient. In this work, we introduce the first software parallel amount of data movement required by the algorithm, and the
median filtering that is non-sorting-based. The new algorithm parallel scalability of the algorithm.
uses efficient histogram-based operations. These reduce the Huang et al. [7] showed a fast implementation for median
computational requirements of the new algorithm while also filtering that was based on histogram operations rather
accessing the image fewer times. We show an implementation of than sorting. They showed how it is possible to “add” and
our algorithm for both the CPU and NVIDIA’s CUDA supported
graphics processing unit (GPU). The new algorithm is compared “subtract” whole histograms. Specifically, given a single
with several other leading CPU and GPU implementations. The master histogram and using a sliding window technique, the
CPU implementation has near perfect linear scaling with a histogram of the left-most column in the kernel is removed
3.7× speedup on a quad-core system. The GPU implementation and the histogram of the column to the right is added. In this
is several orders of magnitude faster than the other GPU approach, for each pixel two computationally expensive his-
implementations for mid-size median filters. For small kernels,
3 × 3 and 5 × 5, comparison-based approaches are preferable togram operations are required. Further, each pixel is accessed
as fewer operations are required. Lastly, the new algorithm is O(k 2 ) times.
open-source and can be found in the OpenCV library. Perreault and Hébert [8] recently showed an algorithm
Index Terms— Median filtering, parallel algorithms. that cuts the number of times that each pixel is accessed to
O(1) times. Thus, the operation is no longer dependent on the
kernel size and the number of pixel accesses is greatly reduced.
I. I NTRODUCTION
Their algorithm extends that of Huang et al. [7]. Unlike [7],
M EDIAN filtering is a key building block used by

numerous image processing application for denoising
an image, smoothing, or removing unwanted artifacts [1].
they show that the output of the median filter for two con-
secutive rows in an image contains a relatively large number
of overlapping pixels. They introduce a new auxiliary data
In addition to image processing, median filtering can be structure that stores a histogram for every column in an image
found in signal processing [2], computer vision based appli- and through additional histogram manipulative instructions
cations [3], [4], and video processing [5], [6]. they show an efficient algorithm for median filtering. A similar
Median filtering requires a single parameter, k, which rep- window movement pattern is used in [9] to efficiently compute
resents the size of the filtering windows (sometimes referred the estimated covariance matrix - reducing the computational
to as the kernel). Median filtering works as follows, for each complexity by several orders of magnitude.
pixel in the image, a k × k window is placed around that pixel Our new algorithm, extends the work of Perreault and
and the median value of theses pixels is taken as the output Hébert [8] and presents to the best of our knowledge the
value. Hence, the name median filtering. How that median first software parallel algorithm for median filtering that uses
value is found is algorithm dependent. One popular method histogram based operations and accesses each pixel O(1)
for finding the median value is to sort all the k 2 values and times. The main contribution of this paper is to show how
then simply take the value at the middle of the outputted array. to partition the work required by the median filtering to a
Manuscript received December 6, 2016; revised August 16, 2017 and many-core system. This algorithm is appropriate both for a
November 8, 2017; accepted December 1, 2017. Date of publication standard multi-core CPU system and for modern day graphics
December 8, 2017; date of current version February 9, 2018. The associate processing unit (GPU) systems.
editor coordinating the review of this manuscript and approving it for
publication was Dr. Christos Bouganis. In the experiment section of this paper, Section V,
The author is with the College of Computing, Georgia Institute of the parallel algorithm is evaluated on both the CPU and
Technology, Atlanta, GA 30332 USA (e-mail: ogreen@gatech.edu). the GPU. The CPU implementation is based off the algorithm
Color versions of one or more of the figures in this paper are available
online at http://ieeexplore.ieee.org. Perreault and Hébert [8] found in OpenCV and the GPU imple-
Digital Object Identifier 10.1109/TIP.2017.2781375 mentation was built afresh especially for NVIDIA’s CUDA
1057-7149 © 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY CALICUT. Downloaded on March 09,2020 at 04:52:02 UTC from IEEE Xplore. Restrictions apply.
2218 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 27, NO. 5, MAY 2018
platform. Of these implementations, the GPU is the more PCMF allows for near constant processing time for each pixel.
challenging of these platforms for algorithm parallelization as While the time complexity of PCMF is similar to the one
it requires more optimizations to achieve good performance. presented in this paper, the approaches are very different.
The new algorithm is benchmarked against several leading The time complexity of PCMF is dependent on the value
libraries and implementations, both CPU and GPU libraries. distributions giving it an amortized execution time. In contrast,
All these implementations use sorting operations except for the new algorithm presented in this paper has a deterministic
the CPU implementation by Perreault and Hébert [8] (which execution time that is independent of pixel values. Lastly,
is also sequential). The new GPU algorithm outperforms all PRMF [27] shows how to implement a median filter through
the sorting-based implementations for mid-size filters (such as a comparison network designed for a specific kernel size.
an 11 × 11 window) and large filters. In some cases, the new PRMF includes a tool for creating this network. In each
algorithm is over a 100× faster than other implementations. step of the PRMF algorithm, elements that are no longer
For smaller windows such as 3×3 and 5×5 the new algorithm needed are pruned out of the comparison network (and are not
is not the fastest and is out-performed by several finely tuned used). From a performance standpoint, BVM is outperformed
implementations. Lastly, our implementation is open-source both by PRMF and by PCMF. For smaller kernel sizes,
and has been tested and integrated in the OpenCV project [10] PRMF outperforms PCMF. However, for larger kernels PCMF
and can be found at https://github.com/opencv/opencv. outperforms PRMF due to a reduced number of operations
per pixel. PRMF does not scale to large kernel sizes as the
II. R ELATED W ORK amount of computation and memory grows quadratically with
Median filtering is a widely used technique for cleaning the kernel size.1 Nonetheless, for extremely small kernel sizes,
images. Other widely used filters include: biliteral [11], neigh- the PRMF algorithms can outperform the algorithm presented
borhood filtering [12], mode filtering [13], and Gaussian filter- in this paper.
ing [14]. These filtering methods smooth images making them
visually more appealing while removing unwanted artifacts. B. Historgram Based Algorithms
Median filtering is popular because of its conceptual sim-
Sorting based median filtering implementations are not
plicity which also leads to straightforward implementations,
always the most efficient, especially for larger kernels due
both in software and hardware. Median filtering can be
to a quadratic increase in the computational complexity.
found in a large number of software libraries, including:
Recently, several papers showed efficient median filtering that
OpenCV [10], NVIDIA’s NPP [15], ArrayFire [16],
use histogram-based operations rather than sorting. These
SciPy [17], and the Halide [18] image processing compiler.
operations, if sequenced in the correct order and with the right
These algorithms vary in their implementation based on the
auxiliary data structures, avoid doing many redundant compu-
system that they target (CPU, GPU, or some specific accel-
tations. Consider two adjacent pixels in an image, the filtering
erator) and on the method used for finding the median value
windows for both these pixels contain significant overlapping
(histogram based operations or sorting of the kernel window).
values, yet due to the nature of the sorting algorithms the
The simplicity and the determinism of the sorting oper-
elements in both windows are sorted from scratch. The works
ations, specifically sorting networks, allows for imple-
by Perreault and Hébert [8] and Weiss [30] avoid these waste-
menting power efficient hardware median filters. This
ful computations by using the overlapping sections and by
includes numerous FPGA implementations of median filtering
accessing the pixels in a desired order. The algorithm by Weiss
[19]–[21]. Sorting networks implemented in hardware typi-
uses a hierarchy of histograms for finding the median value.
cally focus on small and mid-size kernels due to the cost of
the hardware logic. Lastly, sorting networks can be based of
the different sorting algorithms, including Batcher’s sorting III. M EDIAN F ILTERING IN C ONSTANT T IME [8]
algorithm [22] (which was originally designed for hardware). Perreault and Hébert [8] present a median filtering algo-
rithm with a reduced time complexity that uses histogram-
A. GPU Algorithms based operation and accesses each pixel a constant number
of times, O(1). This approach is quite different from other
GPU implementations for median filtering include the
sorting based median filtering approaches [18], [31], [32] and
BVM algorithm [23], [24], PCMF algorithm [25], [26], and
also tends to reduce the number of memory accesses. Perreault
PRMF [27]. While these algorithms are very different from
and Hébert [8] use an array of histograms for counting pixel
each other, they share a common trait - they use compar-
value appearances. This is different than the approach taken
isons and sorting operations for finding the median value.
in the classic paper by Huang et al. [7] which uses a single
BVM [23], [24] avoids using branch instructions. This was
histogram for storing the state of the kernel. In fact, the column
shown to be extremely costly for merging and sorting based
histograms helps to greatly reduce the number of accesses to
operations [28]. BVM algorithm is implemented for both the
the individual pixels.
CPU and the GPU. On the CPU, BVM utilizes SIMD instruc-
Perreault and Hébert [8] point out that each pixel is needed
tions to further increase parallelism. PCMF [25], [26] uses a
for O(k 2 ) unique kernels (except for pixels on the image bor-
sorting algorithm based on the Complementary Cumulative
Distribution (CCD) function [29]. CCD benefits from low 1 From a practical standpoint the generated code requires a very large number
computational costs and a highly parallelizable data structure. of registers, and limits its scalability on the GPU.
GREEN: EFFICIENT SCALABLE MEDIAN FILTERING USING HISTOGRAM-BASED OPERATIONS 2219
Algorithm 1 Pseudo Code of the Serial Algorithm by Perreault TABLE I

and Hébert [8] S YMBOLS AND T ERMINOLOGY U SED IN T HIS W ORK
histograms in the kernel and is used for finding the median

value of a given pixel.
1) Storage Complexity: Given an input image with c
columns and histogram of depth h, a total of (c+1)·h memory
bytes are required to store both the column histograms and
the master histograms. The histogram depth, h, is dictated by
the number of bytes needed to represent an image. Typically,
h = 2b where b = log2 k 2 , is the log of the maximal times a
number can appear in a bin. In comparison to Huang et al. [7],
the memory requirements increases by a factor of c. However,
this storage increase is affordable on most systems.
B. Algorithmic Details
ders that are also used by fewer kernels). By maintaining a his- With the above data-structures, Perreault and Hébert [8]
togram for each column of the image, the number of times that show how to reduce the number of access each pixel from
a pixel is accessed is reduced to O(k), similar to the bounds O(k) to O(1) using histogram based operations. In this paper,
given by Huang et al. [7]. Further, they show that due to we only offer the intuition on why this is correct and refer the
overlap of adjacent kernels, the number of times that the pixel reader to [8] for formal proofs and additional details.
needs to accessed can be reduced to O(1) times. This is espe- When a kernel moves to the right by one position from its
cially important for mid-size and large kernels where k > 7. current position, it needs to remove the left most column from
From a theoretical standpoint, the pixels of an image can the master histogram as it is no longer needed. In addition, one
be filtered in a random order. However, by using a prede- column histogram needs to be added to the master histogram.
fined access pattern, optimizations such as the ones given by Note that this column histogram already contains k values
Huang et al. [7] and by Perreault and Hébert [8] become from the previous row. Of these, k − 1 values are relevant for
viable. Additional benefits of this access pattern is memory the current kernel. Thus, by removing the value of the top most
pre-fetching, cache reuse, and deterministic execution. The pixel in the column from the column histogram and adding one
following describes the access pattern. The algorithm starts pixel, which will be the bottom most pixel in the column, the
at the top-left corner of the image, where the first filtering column histogram is modified such that it represents correctly
kernel (window) is placed. This kernel is then moved one the right most column of the kernel.
column at a time to the right until it reaches the last column.
After the last column is processed, the kernel is then moved C. Histogram-Based Operations and
back at the left most column and moves one row down. Additional Optimizations
This is repeated until all the rows have been accessed in a
Perreault and Hébert [8] show that their algorithm bene-
contiguous fashion. Simplified pseudo-code of Perreault and
fits from vector instructions. They use Intel’s SSE [33] and
Hébert’s median filtering algorithm can be found in Alg. 1.
AVX [34] vector instruction set. These vector operations are
Table I depicts the symbols used in the pseudo-code and in
ideal for speeding up the histogram-based operations. The
the remainder of the paper.
specific instructions that can be used from the SSE or AVX
instruction set depends on the depth of the image (typically
A. Data Structure 8-bits per pixel) and the size of the kernel. Given a kernel of
Perreault and Hébert [8] suggest the following data struc- size k, the same value can be counted upto k 2 times. Thus, the
tures for their fast implementation. For each column in the SIMD instructions need to take this into account. Perreault and
input image, there is a histogram for the pixels of that given Hébert [8] use 16 bits to count instances. As such, the SIMD
column. This is maintained only for one row at a time in instruction that they use are 8-elements wide.
the images. In addition to the column-histogram, there is one Further, Perreault and Hébert [8] show an additional opti-
master histogram, denoted as H , that is a summation of all the mization in their OpenCV implementation - a two level
histogram hierarchy for efficient median value lookup. This Algorithm 2 Pseudo Code of the Partitioning and Initialization
simplifies the search for the median value in the master of the New Parallel Algorithm
histogram making it more efficient and time effective. The
two level histogram also benefits from vectorized operations.
We too have adopted this optimization for both the CPU and
GPU implementations.
IV. PARALLEL A LGORITHM

In this section, the new parallel algorithm for efficient
median filtering is given. This algorithm consists of two
main phases: 1) the work-load partitioning phase and 2) the
parallel median filtering phase. Any discussion of threads,
cores, or processors in this section is entirely independent
of the executing platform. Rather these terms are used to
highlight the parallel granularity of the algorithm. These two
parallel phases are important for both the CPU and GPU
implementations.
A. Workload Partitioning
Our parallel median filtering algorithm divides the image
into blocks of contiguous rows. This allows each processor to
work on a different section of the image. This is a simple par-
titioning scheme - divide the input image into non-overlapping
row segments. The segments are all roughly the same size. To
be precise, the number of rows in the input image is divided
by P processors. Each processors receives either rows
rows P or
P + 1 rows. This partitioning ensures that each processor
will do a near equal amount work, also ensuring good load-
balancing. There are two extreme cases worth noting, 1) P = 1
which leads to a sequential execution and 2) P = r ows which
means that is processor is responsible for processing one row.
The later might seem ideal as each processor will be able to
finish quickly due to little work assigned to it. In practice, due
to overheads this leads to low utilization.
Alg. 2 depicts the pseudo-code for partitioning and data- Algorithm 3 Parallel Median Filtering
structure initialization of our new algorithm. To ensure both
good
rows workload balancing, each processor is given at least
P rows and at most rows
P + 1 rows. Each of these
row-segments is represented by two numbers, the star t Rowi
and the stop Rowi . star t Rowi represents the first row to be
processed, stop Rowi represents the last row to be processed,
and i represents the processor/core (i ∈ {0, .., P − 1}). This
partitioning ensures that all the rows are accounted for.
B. Initialization
Unlike the sequential algorithm of Perreault and Hébert [8]
where there is only one column histogram for the single
processor; our algorithm requires P data structures - one
for each processor. In the sequential algorithm, this column mirrored. The exact number of mirrored values is dependent
histogram is initialized for the first row. In the parallel version, on the kernel size and the starting row of the row-segment
each processor is responsible for initializing its own set given to the processor. For pixels in the first row, each pixel
of column histograms. The real challenge with the parallel is mirrored k/2 + 2 times, as stated in [8]. For pixels in
initialization, is due to the pixels at the border of the image, the second row, the number of mirrored values from the first
namely the first k/2 rows of the image. For these rows, the row k/2 + 1. For the third row, this goes down to k/2.
kernel exceeds the boundary of the image. To overcome this And so forth, all the way to the k-th row. In the sequential
problem, the pixels in the first row of the input image are algorithm, the mirroring is only done for the first row. Beyond,
a single partition are also parallelized. For example, on the

GPU after the image is partitioned to the multi-processors,
the histogram operation is also parallelized using 32 threads
for each operation.
D. Complexity Analysis
1) Storage Complexity: Similar to the sequential algorithm,
we use h to refer to refer to the depth of the histogram and
c refers to the number of columns in the input. The storage
complexity of the parallel algorithm has one additional para-
meter, P the number of processor. Each processors requires
the same amount of memory as the sequential version (c+1)·h
Fig. 1. The algorithm initialization depends on three partitioning regions,
bytes. As there are P processors, the total amount of storage
which are determined by the first row of the row-segments. The first row, require by the parallel algorithm is P · (c + 1) · h bytes. Thus,
denoted by the blue box, is also the first region. The following region, denoted our approach requires P× more memory than the sequential
by the green box,is from the second row and upto the k + 1 row. The number
of processors starting within this region depends on the number of processors
algorithm. Given the relatively large memory available in most
and the kernel size. The last region starts at the end of the green box and modern systems, the increase in the storage requirements is
goes to the end of the image. more than reasonable. However, there are some cases where
this increase might be to much: 1) when number of processors
is extremely high, 2) the number of columns is extremely large,
the k-th row, mirroring is no longer necessary and the mirrored or 3) the system has limited memory (typically ultra low-end
results have been entirely removed from both the column processors).
histograms and the master histogram. 2) Time Complexity: The following notation denotes the
In the initialization phase of the parallel algorithm, the time complexity of the two main work-intensive phases, initial-
star t Row value is used to do decide on how to initialize ization and filtering, Tinit and T f ilt . For the work complexity
the column-histogram. The star t Row falls into one of these we denote these as Winit and W f ilt . The time required by the
zones, Fig. 1: partitioning phase is minuscule and can be ignored. We note
• star t Row = 0: similar to the sequential implementation. the following:
Values of the pixels in first row are mirrored k + 2 times for • The time spent for the initialization can increase by
each of the column histograms.2 This is followed by scanning upto a factor of two. Whereas, the sequential algorithm
k − 1 rows and updating the column histograms. Denoted by only needs k/2 − 1 rows for the initialization process,
the blue box at top Fig. 1. the parallel algorithms requires k rows for the initialization.
• 1 ≤ star t Row ≤ k + 1: similar to the previous Seq
As such, Tinit ≤ Tinit Par ≤ 2 · T Seq . As each processor is
init
scenario, except pixels are replicated fewer times and requiring responsible for initializing its own column histogram, the
additional rows to be scanned. See Alg. 2 for the exact number amount of work required by the initialization is increased by
of replicates and scanned rows. Denoted by the green box in a factor of P and the two times overhead of the initialization,
Fig. 1. Seq
Par
Winit = 2 · P · Winit .
• k + 2 ≤ star t Row: the entire kernel is within the bounds • The work executed in the filtering phase is identical to
of the image. No pixel mirroring is necessary, 2 · k + 1 rows that of the sequential algorithm. This means that the work
are scanned for the initialization. star t Row is below the green complexity of this phase is identical to that of the sequential
box in Fig. 1. Seq
ilt = W f ilt . As each processor is responsible for
algorithm, W fPar
a smaller set of rows, this time per processor is also reduced
Seq
T
ilt = P .
to T fPar
C. Parallel Filtering f ilt
The filtering phase of the parallel algorithm is nearly Given the above, we offer the following theoretical time
identical to that of the sequential implementation of speedup of the parallel algorithm over the sequential:
Perreault and Hébert [8]. The key difference between the Seq Seq Seq Seq
parallel and serial algorithms is that in the parallel algorithm Tinit + T f ilt Tinit + T f ilt
Speedup = Par + T Par
= Seq Seq
(1)
the processors are responsible for a smaller number of rows, Tinit 2 · Tinit + T f ilt /P
f ilt
see Alg. 3. In practice, this can be problematic. Using a large
number of processors can lead to the fact that initialization Note, that for the parallel algorithm’s initialization time, we
phase is more demanding than the filtering. Using a small use the worst case scenario where the initialization takes twice
number of processors, to avoid the overhead, can lead to under- as long as it does for the sequential algorithm.
utilization of the system. This is discussed in addition detail in 3) Theortical Perfomance Analysis: Using Eq. (1) we offer
the Section VII. Lastly, we note that there are addition schemes the following insights on the scalability of our algorithm,
for parallelizing the filtering such that histogram operations for assuming large enough images3 :
2 See Perreault and Hébert [8] for additional details. 3 For simplicity consider images of size 1024 × 1024
TABLE II
GPU AND CPU S YSTEMS U SED IN E XPERIMENTS
a) For a small number of processors P ≈ O(1) and a small C. Library Comparison

number kernel k, the speedup of our algorithm is almost linear. We compare the performance of our new algorithm with
Given that P is small, rowsp k, means that most of the time several of the most widely used implementations available:
is spent filtering and a relatively small amount of time is spent NVIDIA’s Performance Primitives (NPP) [15], ArrayFire [16],
in the initialization. SciPy [17], PRMF [27], and Perreault and Hébert [8] algorithm
b) For a small number of processors P ≈ O(1) and a mid- (part of the OpenCV [10] project). NPP is a closed library and
size to large kernel k (such as k = 81), the speedup of our as such its source code is not available. The ArrayFire library
algorithm is somewhat reduced. This leads to a situation where is one of the comprehensive open-source image processing
p ≈ k. In all likelihood more time is spent in the filtering
rows
libraries available for the GPU. The CPU implementations
phase, however, the initialization phase becomes larger and is include the implementation found in SciPy [17] and that of
no longer negligible. Perreault and Hébert [8] taken from the OpenCV [10] library.
c) For a large enough number of processors, P, the initial- Of the aforementioned implementations, only that by Perreault
ization phase becomes a larger overhead. This also means that and Hébert [8] uses histogram-based operations while the
for the large kernels, more time is spent in the initialization others use sorting and comparisons to find the median. PRMF
phase than the time spent in the filtering phase. [27] is a highly tuned median filtering algorithm for the
Most modern day CPU processors fall into the first two sce- GPU that uses a comparison network. In PRMF, unnecessary
narios where the number of processors is small. Accelerators comparisons are pruned in each iteration of the comparison
such as the GPU and Intel Xeon PHI, which have a have a network. While highly efficient for small networks, for mid-
large processor count, fall into the third scenario. size networks of 11×11 and above this network does not scale
due to a mix of computational requirements, excessive use of
registers, and increase size of its code base. The OpenCV
V. E XPERIMENTAL S ETUP library contains two sequential implementations of Perreault
and Hébert [8] algorithm: a sequential version and a vectorized
A. Systems
version that uses Intel’s SSE instruction set. We use the vector-
The new algorithm is implemented for both the CPU and for ized algorithm as it is better performing. Table III summarizes
NVIDIA’s CUDA platform. The analysis of the new GPU algo- the libraries on which system that libraries implementation was
rithm and the benchmarking of all the remaining GPU algo- executed.
rithms are tested on an NVIDIA 750 GTX GPU - a low-end Our experiments used varying sizes images and kernels.
GPU found in most desktops. The CPU implementations and This includes the images found in Fig. 2: Barbara, Lena,
the other CPU based algorithms are tested on an Intel system. Cameraman, and Peppers. We used square images from
Table II has additional details of both the CPU and GPU sys- 128 × 128 and upto 4096 × 4096 for performance analysis.
tems used in the experiments. The following sections discuss
in greater detail the parameter selection for getting good per- VI. CPU P ERFORMANCE A NALYSIS
formance for median filtering. This is especially important for
the GPU architecture which has a lot of parallelism available. The CPU implementation in OpenCV in uses two different
median filtering algorithms. For small window (3 × 3 and
5 × 5), OpenCV uses a sorting based implementation. For
mid-size and larger filter kernels, OpenCV uses the histogram
B. CUDA Supported GPUs
based algorithm of Perreault and Hébert [8]. These histogram-
A single NVIDIA CUDA GPU consists of multiple stream- based operations are implemented using Intel’s SSE instruction
ing multiprocessors known as SMs. Each SM is made up of a set. In Fig. 3, the OpenCV implementation is denoted by the
larger number of streaming processors (SP). These are respon- C PU − Sequenti al − OpenC V curve. The remaining curves
sible for executing a large number of threads concurrently. denote the implementation of the new algorithm for the CPU.
Threads are executed in groups known as thread warps. Warps The main difference between the new parallel implementation
are the smallest execution unit on the GPU. At the time of and the sequential implementation are two new code sections:
writing, the warp size in current GPU technology is 32 threads. 1) the workload partitioning and 2) the initialization phases
While fewer than 32 threads can be allocated for a thread block which initializes the master histogram for each thread.
by the programmer, the hardware will still allocate 32 threads Fig. 3 depicts the performance of the new algorithm for
and will not use the remaining threads. As such, it usually multiple image sizes. In each row of the figure different
preferable to find a way to use all 32 threads. performance metrics are presented. The top rows depicts the
TABLE III
L IBRARIES AND I MPLEMENTATIONS U SED IN O UR E XPERIMENTS
VII. GPU P ERFORMANCE A NALYSIS

Our new GPU algorithm has been integrated into OpenCV.
As such, the implementation adheres to the OpenCV standards
and data-types. Specifically, at the time of writing, OpenCV
supports two value-types of arrays for the GPU, 8-bit and
32 bit. An 8-bit value is limited to a max value of 255. Our
goal is to support larger kernel sizes, including kernels that are
greater in size than 17 × 17,4 thus the larger 32-bit data type
is used. This increase from 8-bits to 32-bits quadruples the
memory requirements and data movement. This also results
in a slower a slower execution time. On a positive note, this
implementation can be extended to support 16-bit grayscale
images (with some additional optimizations).
A. Performance Analysis of Different Implementations
Fig. 4 depicts the execution time (ordinate) of the various
implementations for different sizes of the Barbara image. The
abscissa represents the kernel size. Lower values imply better
performance. Fig. 7 compares the performance of the GPU
algorithm with the CPU algorithm. Further, this figure also
depicts the performance of the PRMF algorithm [27], one of
the best known GPU median filtering algorithms.
These implementations can be classified into main two
groups: 1) sorting and comparisons based implementations and
Fig. 2. Images used in experiments: (a) Barbara, (b) Lena, (c) Cameraman, 2) histogram-based operations implementations. The sorting
and (d) Peppers. based implementations are: ArrayFire, NVIDIA’s NPP, SciPy,
and PRMF. The non-sorting based implementations are those
by Perreault and Hébert [8] and our new one. As the kernel
size increases, the execution time increases in a polyno-
execution time, the middle row depicts the throughput rate of mial fashion for the sorting algorithms whilst the execution
the algorithm (in pixels per second), and lastly the bottom time of the histogram-based implementations is near constant
row depicts the speedup of the algorithm in comparison to (i.e. independent of the kernel size). As such, the histogram-
a sequential execution. On the CPU, the execution time for based implementations can support larger kernels than the
a given thread count, does not increase significantly with the sorting based implementations.
growth of the kernel size. This is true for all the thread counts PRMF had some scalability issues and could not be tested
tested and matches the results of Perreault and Hébert [8]. For for all the kernel sizes. The PRMF algorithm, uses an explicit
these thread counts, the majority of the execution time is spent laid out software comparison comparison network. This net-
in the filtering phase and not in the initialization phase (as will works requires a quadratic number of registers and does not
be seen the next section). scale beyond a relatively small kernel size. For an 11×11 ker-
As the the number of threads is increased on the CPU, the nel size, compiling PRMF took over 5 hours.5 For the smaller
performance increases in a nearly perfect linear fashion. For kernel sizes PRMF performs well. To capture the performance
two threads, the average speedup is 1.95×, for threads threads of PRMF, we have placed execution time in Fig. 7. Note,
it is 2.8×, and for four threads it is 3.7×. This low decrease in that we include two curves for PRMF: 1) execution time of
performance is in part due to the overhead of the initialization 4 For this size, the maximal times that a pixel-value can appear is 172 . This
stage which. In comparison to the overhead of the initialization cannot be represented by an 8-bit integer.
on the GPU, this is not as high. 5 The compilation was stopped after 5 hours.
Fig. 3. Execution time (top row), throughput (middle row), and speedup (bottom row) of the histogram-based implementations on the CPU (within the
OpenCV library). Each subplot depicts a different image size, from 1024 × 1024 (on the left) and upto 4096 × 4096 (on the right). Note that the speedup is
nearly linear with number of threads. (a) Execution time - lower is better. (b) Throughput - higher is better. (c) Parallel Scalability - higher is better.
only the median filtering and 2) execution time that includes sorting based implementations also hinder scalability as the
overheads such as initialization, memory allocation, and data memory requirements increase in a polynomial fashion.
transfer. The throughput measured in PRMF [27] measures • For really small kernel sizes, such 3 × 3 and 5 × 5
only the median filtering time. Our analysis will include both the sorting based GPU implementations outperform both the
these times. The motivation for separating these two curves is algorithm by Perreault and Hébert [8] and our new algorithm.
discussed below. We also note the following: These tend to have improved cache performance and memory
• As the kernel size grows, the sorting based implementa- requirements and lower computational requirements.
tions do not perform well. The histogram-based implemen- • Table IV breaks down the execution time of the new
tations can be hundreds of times faster than sorting based algorithm and PRMF. As the kernel size increases for PRMF,
implementations. In addition, the memory requirements of the the execution time increases quadratically (as expected) and
Fig. 4. Execution time of the various implementations for the Barbara image. The abscissa for each subplot is the kernel size and the ordinate is the execution
time. Each subplot shows the execution time for a different sized image, starting at 1024 × 1024 in the left and up-to to 4096 × 4096 in the bottom right
corner. Lower execution time is better. PRMF is not included due its limited scalability.
TABLE IV
E XECUTION T IME ( IN Milli-Seconds) B REAKDOWN OF THE N EW GPU A LGORITHM AND FOR THE PRMF A LGORITHM . N OTE T HAT THE T OTAL
E XECUTION T IME FOR PRMF IS D OMINATED BY THE A LLOCATION AND D ATA T RANSFER T IME . A LSO , N OTE T HAT THE T IME S PENT
F ILTERING FOR PRMF I NCREASES Q UADRATICALLY W ITH THE K ERNEL S IZE . I T IS N EAR C ONSTANT FOR THE
N EW GPU A LGORITHM . F OR E ACH A LGORITHM , W E S HOW THE T OTAL E XECUTION T IME AS W ELL THE T IME FOR
E ACH P HASE . N OTE , THE PRMF [27] H AS AN A DDITIONAL I NITIALIZATION P HASE FOR A DDING B ORDERS
more time is spent in executing the filtering versus the other Using the aforementioned rule of thumb, the best selection of
phases. Also note, the kernel-only execution time curve slowly ĉ will be in {6, 8, 10}. Perhaps not surprisingly, we see that
converges to the total execution time curve as the kernel grows. the best performance is for P̂ = 8 · 16 = 128.
For an 11 × 11 filter, PRMF would perform on the same Fig. 5 depicts the execution time of our algorithm for
order of magnitude as the new algorithm. Lastly, note that different values of P̂ for different sizes of the Barbara image.
the allocation and initialization time for PRMF is slower than Each curve depicts a different number of partitions that the
the new algorithm because PRMF needs to setup boundaries image is divided into. Recall from Eq. 1 - there are two
to the input image to stay within bounds. This time is included main components that dictate the execution time: initialization
in the overall execution time as it is a necessary step of the time and filtering time. On the one hand, a small number of
algorithm. partitions risks the chance that the GPU will be underutilized.
On the other hand, a large number of partitions ensures that the
B. GPU Performance Analysis GPU has enough work units to dispatch to the SM processors;
however, this can cause the initialization to become a dominant
Lastly, we discuss the impact of GPU parameter selection factor - leading to an increased execution time. Both of these
for the overall performance of the algorithm. The program- partitioning side effects can be seen in Fig. 5. Note that for the
ming model for NVIDIA’s CUDA [35] supported GPUs typ- smallest number of partitions, P̂ = 32, the execution is always
ically requires having a large number of threads available the slowest as the GPU is underutilized. When the number of
for execution. Usually, the number of software threads is partitions is increased to P̂ = 224 the overhead dominates the
about six to tens times the number of hardware threads (SPs). total execution time. For most of the images, image sizes, and
In hardware, threads are grouped into units of 32 threads kernel sizes we saw that dividing the image to 128 partitions
known as a warp. The GPU used in our experiment has 4 SMs offered the best performance for the NVIDIA GTX 750. This
and 512 SPs, which means that it can execute 16 warps offers the best balance between low system utilization and the
concurrently. Our algorithm uses thread blocks of a warp cost of a long initialization.
size. Each of these thread-blocks can be thought of as a
processor in the CPU execution. Instead of using an exact
number of processors, as we did in the CPU implementation, C. Additional Images
we overstate the exact number of processors and instead use Fig. 6 depicts the execution time of our new algorithm
virtual processors, P̂, such that P̂ = ĉ · P. P in this case is for the various images used in our benchmark, see Fig. 2.
the maximal number of thread-blocks executed concurrently. There is a roughly 10% different in execution time of the
Fig. 5. Execution time for the new algorithm as a function of the number of partitions used for the parallelization scheme. The abscissa for each subplot
is the kernel size while the ordinate is the execution time. Each subplot depicts a different image size, from 1024 × 1024 and upto 4096 × 4096. Each curve
represents a different number of partitions (or processors) used by our new algorithm.
Fig. 6. Execution time of the histogram-based implementations for different images. The abscissa for each subplot is the kernel size while the ordinate is
the execution time. Each subplot depicts a different image size, from 1024 × 1024 and upto 4096 × 4096.
Fig. 7. Execution time (top row) and throughput (bottom row) of the histogram-based implementations for the Barbara image. The abscissa for each subplot
is the kernel size while the ordinate is the execution time. Each subplot depicts a different image size, from 1024 × 1024 and upto 4096 × 4096.
various images. This can be accounted to the different memory images can cause memory hot-spots as multiple threads try
access patterns, which can also change due the pixel values. to update the same column histogram. Overall, this is a small
For example, the difference in the pixels of the different performance deviation and is expected. Lastly, note that the
execution time as a function of the kernel size grows at For mid-size to large filtering windows the new algorithm is
a relatively moderate rate and the difference between the well over an order of magnitude faster than several leading
fastest execution time (for the smallest kernel) and the slowest implementations. For example, for an 11x11 filtering window,
execution time (for the largest kernel) as roughly 25%. This the new algorithms is over 60x faster then the fastest GPU
increase is due to the overhead of the initialization phase, implementation (found in NVIDIA’s NPP library) and hun-
which requires more work as kernel size grows. dreds of times faster than other sorting based implementations
(CPU and GPU). Lastly, a parallel CPU implementation of
D. Comparison of CPU and GPU Histogram-Based the new algorithm was presented. This implementation extends
Implementations one of the fastest known sequential algorithms for the CPU,
namely the algorithm by Perreault and Hébert [8]. The new
Fig. 7 depicts the performance of the histogram-based parallel implementation for the CPU shows good scalability,
implementations, Perreault and Hébert [8] sequential algorithm achieving a 3.7× speedup on a quad-core system.
for the CPU and our new GPU algorithm. Also in this plot
are two curves for the PRMF algorithm: 1) execution time
R EFERENCES
for the entire algorithm including the memory allocation,
[1] T. Chen, K.-K. Ma, and L.-H. Chen, “Tri-state median filter for image
initialization, and data transfer, and 2) the execution time denoising,” IEEE Trans. Image Process., vol. 8, no. 12, pp. 1834–1838,
for only the filtering. For the sake of brevity we do not Dec. 1999.
repeat the analysis of PRMF and refer the reader back to [2] J. Fitch, E. Coyle, and N. Gallagher, “Median filtering by thresh-
old decomposition,” IEEE Trans. Acoust., Speech, Signal Process.,
VII-A; though we do remind the reader that the current vol. ASSP-32, no. 6, pp. 1183–1188, Dec. 1984.
implementation of PRMF has some overhead in the initial- [3] A. C. Bovik and D. C. Munson, Jr., “Edge detection using median
ization phase due to some padding of the image. The top row comparisons,” Comput. Vis., Graph., Image Process., vol. 33, no. 3,
pp. 377–389, 1986.
shows the execution time for these algorithms as a function [4] A. C. Bovik, T. S. Huang, and D. C. Munson, Jr., “The effect of median
of the kernel size and the bottom row shows the throughput filtering on edge estimation and detection,” IEEE Trans. Pattern Anal.
of these algorithms. Note for the histogram operation based Mach. Intell., vol. PAMI-9, no. 2, pp. 181–194, Mar. 1987.
[5] S.-C. S. Cheung and C. Kamath, “Robust techniques for
implementations (CPU and GPU), the execution time increases background subtraction in urban traffic video,” Proc. SPIE,
with the kernel size. The increase is slight for the CPU vol. 5308, pp. 881–892, Jan. 2004. [Online]. Available:
algorithm, while it is more apparent for the new parallel https://www.spiedigitallibrary.org/conference-proceedings-of-
spie/5308/1/Robust-techniques-for-background-subtraction-in-urban-
GPU implementation where a large kernel size can become traffic-video/10.1117/12.526886.pdf?SSO=1
a significant part of the execution, especially for large thread [6] H. Rantanen, M. Karlsson, P. Pohjala, and S. Kalli, “Color video signal
counts. This is due to additional overhead introduced by the processing with median filters,” IEEE Trans. Consum. Electron., vol. 38,
no. 3, pp. 157–161, Aug. 1992.
parallel algorithm. Note, the sequential CPU implementation [7] T. Huang, G. Yang, and G. Tang, “A fast two-dimensional median
is quite efficient and in fact outperforms the GPU algorithm. filtering algorithm,” IEEE Trans. Acoust., Speech, Signal Process.,
It is also worth noting that the CPU has a significantly larger vol. ASSP-27, no. 1, pp. 13–18, Feb. 1979.
[8] S. Perreault and P. Hébert, “Median filtering in constant time,” IEEE
cache. This allows the CPU to store larger parts of the image Trans. Image Process., vol. 16, no. 9, pp. 2389–2394, Sep. 2007.
as well as most of the auxiliary data structures. The GPUs [9] O. Green and Y. Birk, “A computationally efficient algorithm for the 2D
smaller L1 and L2 caches are not able to hold all the entire covariance method,” in Proc. Int. Conf. High Perform. Comput., Netw.,
Storage Anal. (SC), 2013, Art. no. 89.
auxiliary data structures, especially when the number of virtual [10] G. Bradski, “The OpenCV library,” Doctor Doobs J., vol. 25, no. 11,
processors is set to 128. While the GPU offers a lot of parallel pp. 120–126, 2000.
scalability, it is clear that the algorithms overheads are limiting [11] C. Tomasi and R. Manduchi, “Bilateral filtering for gray and color
images,” in Proc. 6th Int. Conf. Comput. Vis., 1998, pp. 839–846.
the performance. Nonetheless, the new GPU implementation [12] M. Mahmoudi and G. Sapiro, “Fast image and video denoising via
is significantly faster than several leading implementations and nonlocal means of similar neighborhoods,” IEEE Signal Process. Lett.,
is the first histogram-based algorithm for the GPU. vol. 12, no. 12, pp. 839–842, Dec. 2005.
[13] J. van de Weijer and R. van den Boomgaard, “Local mode filtering,” in
Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit. (CVPR),
VIII. C ONCLUSIONS vol. 2. Dec. 2001, pp. II-428–II-433.
[14] K. Ito and K. Xiong, “Gaussian filters for nonlinear filtering prob-
This paper showed the first software parallel and scalable lems,” IEEE Trans. Autom. Control, vol. 45, no. 5, pp. 910–927,
algorithm for median filtering that is based on histogram oper- May 2000.
ations. Unlike sorting based and comparison based algorithms [15] NVIDIA Performance Primitives (NPP), NPP NVIDIA, Santa Clara, CA,
USA, Feb. 2011.
(whose computation tends to grow at a quadratic rate with the [16] J. Malcolm, P. Yalamanchili, C. McClanahan, V. Venugopalakrishnan,
window size), the new algorithm is almost entirely indepen- K. Patel, and J. Melonakos, “ArrayFire: A GPU acceleration plat-
dent of the window. This results in an efficient algorithm that form,” Proc. SPIE, vol. 8403, pp. 8403-1–8403-8, 2012. doi:
10.1117/12.921122.
is high performing for mid-size windows and upwards. The [17] E. Jones, T. Oliphant, and P. Peterson, “SciPy: Open source scientific
new algorithm is implemented both for the CPU and GPU. tools for Python,” Tech. Rep., 2014.
Both implementations use a multi-level histogram hierarchy [18] J. Ragan-Kelley, C. Barnes, A. Adams, S. Paris, F. Durand, and
S. Amarasinghe, “Halide: A language and compiler for optimizing
for the data management - enabling efficient implementations. parallelism, locality, and recomputation in image processing pipelines,”
We show that these histogram based operations can also be ACM SIGPLAN Notices, vol. 48, no. 6, pp. 519–530, 2013.
implemented efficiently on the GPU using NVIDIA’s CUDA [19] K. Benkrid, D. Crookes, and A. Benkrid, “Design and implementation
of a novel algorithm for general purpose median filtering on fpgas,”
programming model. Our GPU implementation is open-source in Proc. IEEE Int. Symp. Circuits Syst. (ISCAS), vol. 4. May 2002,
and be found in OpenCV (https://github.com/opencv/opencv). pp. IV-425–IV-428.
[20] Y. Hu and H. Ji, “Research on image median filtering algorithm and its [32] J. Gil and M. Werman, “Computing 2-D min, median, and max filters,”
FPGA implementation,” in Proc. WRI Global Congr. Intell. Syst. (GCIS), IEEE Trans. Pattern Anal. Mach. Intell., vol. 15, no. 5, pp. 504–507,
vol. 3. May 2009, pp. 226–230. May 1993.
[21] S. A. Fahmy, P. Y. K. Cheung, and W. Luk, “Novel FPGA-based [33] A. Klimovitski, “Using SSE and SSE2: Misconceptions and reality,”
implementation of median and weighted median filters for image Intel Develop. Update Mag., pp. 3–8, 2001.
processing,” in Proc. Int. Conf. Field Programm. Logic Appl., Aug. 2005, [34] N. Firasta, M. Buxton, P. Jinbo, K. Nasri, and S. Kuo, “Intel AVX: New
pp. 142–147. frontiers in performance improvements and energy efficiency,” Intel,
[22] K. Batcher, “Sorting Networks and Their Applications,” in Proc. Spring Santa Clara, CA, USA, White Paper 19, 2008, p. 20.
Joint Comput. Conf., 1968, pp. 307–314. [35] NVIDIA CUDA Programming Guide, NVIDIA, Santa Clara, CA, USA,
[23] W. Chen, M. Beister, Y. Kyriakou, and M. Kachelrieß, “High perfor- 2011.
mance median filtering using commodity graphics hardware,” in Proc.
IEEE Nucl. Sci. Symp. Conf. Rec. (NSS/MIC), Oct. 2009, pp. 4142–4147.
[24] M. Kachelrieß, “Branchless vectorized median filtering,” in Proc. IEEE
Nucl. Sci. Symp. Conf. Rec. (NSS/MIC), Oct. 2009, pp. 4099–4105.
[25] R. M. Sánchez and P. A. Rodríguez, “Bidimensional median filter for
parallel computing architectures,” in Proc. IEEE Int. Conf. Acoust.,
Speech Signal Process. (ICASSP), Mar. 2012, pp. 1549–1552.
[26] R. M. Sánchez and P. A. Rodríguez, “Highly parallelable bidimensional
median filter for modern parallel programming models,” J. Signal
Process. Syst., vol. 71, no. 3, pp. 221–235, 2013.
[27] G. Perrot, S. Domas, and R. Couturier, “Fine-tuned high-speed imple- Oded Green received the B.Sc. degree in com-
mentation of a GPU-based median filter,” J. Signal Process. Syst., puter engineering and the M.Sc. degree in electrical
vol. 75, no. 3, pp. 185–190, 2014. engineering from the Technion–Israel Institute of
[28] O. Green, “When merging and branch predictors collide,” in Proc. IEEE Technology, and the Ph.D. degree from the School
4th Workshop Irregular Appl., Archit. Algorithms, Nov. 2014, pp. 33–40. of Computational Science and Engineering, Georgia
[29] T. P. Ryan, Modern Engineering Statistics. Hoboken, NJ, USA: Wiley, Institute of Technology. He is currently a Research
2007. Scientist with the School of Computational Science
[30] B. Weiss, “Fast median and bilateral filtering,” ACM Trans. Graph., and Engineering, Georgia Institute of Technology.
vol. 25, no. 3, pp. 519–526, 2006. His research focuses on improving performance and
[31] S. van der Walt, S. C. Colbert, and G. Varoquaux, “The NumPy array: increasing scalability for large-scale data analytics
A structure for efficient numerical computation,” Comput. Sci. Eng., using a wide range of high-performance computing
vol. 13, no. 2, pp. 22–30, 2011. platforms.

Efficient Scalable Median Filtering Using Histogram-Based Operations

Uploaded by

Copyright:

Available Formats

You might also like

Efficient Scalable Median Filtering Using Histogram-Based Operations

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Efficient Scalable Median Filtering Using Histogram-Based Operations

Uploaded by

Copyright:

Available Formats

IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 27, NO.

5, MAY 2018 2217

Efficient Scalable Median Filtering Using

M EDIAN filtering is a key building block used by

Algorithm 1 Pseudo Code of the Serial Algorithm by Perreault TABLE I

histograms in the kernel and is used for finding the median

IV. PARALLEL A LGORITHM

a single partition are also parallelized. For example, on the

a) For a small number of processors P ≈ O(1) and a small C. Library Comparison

VII. GPU P ERFORMANCE A NALYSIS

You might also like