Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 6

Gaussian Filter for the Kalray MPPA Processor

1 Introduction

The Gaussian blur (also known as Gaussian smoothing) filter is an image smoothing
filter that seeks to reduce noise and achieve an overall smoothing of the image. It
consists in applying a specially computed mask to an image, using convolution.
Gaussian blur filters, in addition to being used on their own to reduce noise in images,
are also applied as part of preprocessing steps in edge detection algorithms, since these
algorithms are particularly sensitive to noise. Image 1 presents an example of a
Gaussian blur filter being applied to a noisy image. The left image is the original, noisy
image. The right image is obtained after applying a Gaussian smoothing filter to the
original image, accomplishing a noise reduction effect.


Figure 1 Before and after applying a Gaussian smoothing filter

1.1 Mask computation

To compute an N x N Gaussian mask (where N is ideally an odd number), the following
equation must be used:

( )



where x and y are row and column indexes, respectively, for the mask, and range from -

to

. is the standard deviation for the Gaussian function. After computing the mask
elements, the mask must be normalized. Normalization consists of dividing each
element in the mask by the sum of all mask elements, such that after normalization the
sum of mask elements will be 1. The resulting mask will be symmetrical, which means
it can be applied to the image without being flipped.

1.2 Filtering operation

The mask is applied to the image by aligning the center value of the mask with the
image pixel currently being filtered. The mask is basically a matrix of weights, and the
center contains the biggest weight value. The filtered pixels new value is a weighted
average of it and its neighbors. By aligning the center of the mask with the pixel being
filtered, we assure that the original pixel receives the heaviest weight and neighbors
receive weights that decrease with the increase in distance from the original pixel.

1.3 Parallelism potential

There is a lot of potential parallelism to be found in Gaussian filters. Since each pixel is
filtered individually using a common mask, there are no concurrent write operations.
This means that if there were as many threads or processes as there are pixels, each
pixel could be filtered simultaneously and the whole operation would be done in a
single iteration. Data shared by threads or processes would include the Gaussian mask
and neighboring pixel values, since they are used in the convolution. Spatial and
temporal locality could also be explored, since pixels are filtered sequentially and this
means that contiguous memory values will definitely be referenced in proximity to each
other. So if an image row were brought to the cache, for example, every thread would
benefit from this, except in cases when a thread is done filtering the last pixel of a row
and then proceeds to request another row of pixels. In this case, the current cache values
would be overwritten and therefore hinder the performance of threads that were using
the previous values.



2 Gaussian filter for the MPPA processor

When designing algorithms for the MPPA processors, there are some limitations that
would not exist if the same algorithm was being developed for a traditional processor
architecture. Input sizes are an issue, because each compute cluster is limited to 2 MB
of memory to be shared between its 16 cores, whereas a traditional processor would
have RAM memory as its memory limit. This means that large input sets will have to be
split and sent to the clusters using the network-on-chip. Failure to proper balance how
the input will be distributed among the clusters may have negative impacts on
performance. When thinking about designing an MPPA application, the developer must
think of a computer cluster rather than of a typical SMP processor.

Data synchronization must also be done carefully. Since input data cannot be sent to the
compute clusters all at once, it is likely that manual synchronization will have to take
place, where compute clusters (slaves) receive a chunk of data, process it and then send
it back to the IO cluster (master). If synchronization is not done properly, the output of
the application may be incorrect.
A Gaussian blur filter could be implemented for the MPPA in many ways, given the
restrictions. A few options are listed below:

Master generates Gaussian mask, sends it to slaves. Slaves generate a single
image with random pixels, and distribute it among their cores to be filtered.
Since image sizes are known at compile time, a static scheduling is possible, that
is, thread workload can be calculated at compile time;
Master generates Gaussian mask, sends it to slaves. Slaves generate multiple
images with random pixels, and distribute them among their cores (ideally one
per core for maximum output images) to be filtered. Static scheduling is also
possible if number of images to be generated per cluster is predefined;
Each slave generates the Gaussian mask (but they must use the same standard
deviation for result consistency) and generates one or more images to be filtered.
This can work if the cost of generating the mask on each slave is less than the
cost of using the network to send it to the slaves;
Master generates a very large image, partitions it in smaller chunks that can fit in
the slaves memories, and sends each chunk so slaves can filter them and send
them back to the master for synchronization.

Other approaches are also possible; these are just a few options. For the tests described
below, we opted to use the last approach. Since the idea is to test MPPAs resources,
making more use of the network-on-chip (NoC) can be helpful in understanding its
behavior, performance and limitations.

2.1 Implementation

As previously mentioned, in our implementation the master program generates the input
and then sends chunks of appropriate size for the slaves to process. We opted to
generate a 32768x32768 unsigned char matrix, in a total of 1 GB of data. This size was
chosen because it is large enough to force many partitions to be made, and allows for
homogeneous partitioning, so each slave will be sent the same amount of data to
process. Chunk sizes were set to be 1 MB. Since each slave can only store 2 MB of data
at a given time, using 1 MB chunks leaves plenty of free space in the memory for other
data that may be stored there. Each slave will therefore receive 64 chunks of data to
process (16 slaves x 64 chunks per slave x 1 MB per chunk = 1 GB). The 7x7 Gaussian
mask is also generated by the master program and sent to each slave prior to partitioning
and sending data chunks.

Concerning NoC use, since there is a total of 1024 chunks to be sent (64 per cluster), a
total of 1 GB of data will be sent from the master to the slaves. Each slave will also
send 64 chunks of 1 MB back to the master, totaling another 1 GB of data. Since the
MPPA has a high bandwidth data NoC, this should not be a problem performance-wise.

3 Tests and Results

3.1 Tests for the described input size (1 GB)

In our tests, the following metrics were collected: input generation time, MPPA power
consumption, average master-to-slave data transfer time, average master-to-slave data
transfer bandwidth, average slave-to-master transfer time, average slave-to-master data
transfer bandwidth and average processing time in slaves.

For the 1 GB input matrix, generated by the master process using the standard C rand()
function, total generation time was 1737.96 seconds. 1,073,741,824 elements were
generated in total, each being 1 byte in size, which translates into 603.33 KB/s.

Average power consumption for the execution of tests with 1 GB input size was 6.44
W. This number is significantly lower than what is found in most commercial desktop
processors, which have TDPs ranging from 65 to over 120 W. It is even lower than
power-saving ultrabook processors, like Intels new Haswell-U series, with TDPs of 17
W. This is a very good value for a processor having 256 cores (albeit low frequency
ones) and a NoC.

Average processing time per slave per chunk was 9.355 seconds. Total average
processing time per cluster was 598.767 seconds. Since each cluster was tasked with
processing 64 MB of data, this means an average processing power of 112.08 KB/s.

Figures 2 and 3 show the master-to-slave NoC use results. Figure 2 shows the average
time it took to send a 1 MB chunk of data to each of the 16 compute clusters, while
figure 3 shows the average bandwidth obtained in such data transfers. We can observe
that transfer times increase linearly, and bandwidth decreases in an almost exponential
fashion. The MPPA processor uses a 2D torus NoC topology, and IO clusters are
connected to compute clusters in a peculiar manner. Our implementation of the
Gaussian filter was unable to optimally explore the NoC topology to obtain similar
times and bandwidth for every compute cluster, which caused these discrepancies.





Figure 2 Average master-to-slave chunk transfer times

Figure 3 Average master-to-slave chunk transfer bandwidth

This behavior is also observed in slave-to-master transfers. Figures 4 and 5 show,
respectively, average slave-to-master chunk transfer times and average slave-to-master
chunk transfer bandwidth. The behavior is almost identical to the one observed in
master-to-slave transfers, except for cluster 11, which presented a slight abnormality in
transfer time and bandwidth.

0
0.05
0.1
0.15
0.2
0.25
0.3
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
T
i
m
e

(
s
)

Compute clusters
Master to slave transfer time
0
20
40
60
80
100
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
B
a
n
d
w
i
d
t
h

(
M
B
/
s
)

Compute clusters
Master-to-slave bandwidth

Figure 4 Average slave-to-master chunk transfer times

Figure 5 Average slave-to-master chunk transfer bandwidth


4 Conclusions and Future Work

In this paper we describe our implementation of the Gaussian smoothing filter for the
Kalray MPPA processor. This processor has many architectural differences when
compared to a traditional multi-core processor, which forces developers to rethink how
algorithms are implemented and how they behave when being executed.

The small 2 MB dedicated memory available in each compute cluster poses a challenge
regarding to problem partitioning and data distribution among the clusters, which is
done via MPPAs network-on-chip. Algorithms must be carefully designed with these
constraints in mind, or they will not be able to perform well.

Results showed that although the implementation of the algorithm was successful, some
challenges were not totally overcome. One such challenge is the efficient use of the
NoC. Our results showed that delay and bandwidth numbers varied from cluster to
cluster, which we attribute to our poor understanding of the network topology and how
to best explore it. Power consumption results, however, appear to be very good.

As future work, a more thorough study of the embedded NoC could be conducted, in
order to try and eliminate delay and bandwidth discrepancies between clusters and
achieve better performance. The Gaussian filter implementation could be altered to
0
0.05
0.1
0.15
0.2
0.25
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
T
i
m
e

(
s
)

Compute clusters
Slave-to-master transfer times
0
20
40
60
80
100
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
B
a
n
d
w
i
d
t
h

(
M
B
/
s
)

Compute clusters
Slave-to-master bandwidth
work with actual images, instead of randomly generated input, which would require an
understanding of the processors PCI-Express interface.

You might also like