Radial Basis Function Neural Networks

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 17

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/311219980

Radial Basis Function Neural Networks with parameter selection using K-


means: Performance evaluation, from Matlab to C

Technical Report · August 2016


DOI: 10.13140/RG.2.2.18905.11362

CITATIONS READS

0 2,517

1 author:

Luis Neto
University of Porto
18 PUBLICATIONS   57 CITATIONS   

SEE PROFILE

Some of the authors of this publication are also working on these related projects:

PRODUTECH-SIF View project

Selsus View project

All content following this page was uploaded by Luis Neto on 01 December 2016.

The user has requested enhancement of the downloaded file.


Radial Basis Function Neural Networks
with parameter selection using K-means:
Performance evaluation, from Matlab to
C
High-Performance Embedded Computing
Doctoral Program in Informatics Engineering
Luis Carlos de Sousa Moreira Neto - lcneto@fe.up.pt

SUPERVISOR

Professor Dr. João Cardoso - jmpc@acm.org

FEUP, 2016

1 SUMMARY
In this work a Radial Basis Neural Network, implemented in Matlab, is optimized to run in an embedded
system. The work reported consists in the translation of Matlab implementation to C, performance
analysis and comparison between a Desktop and an Embedded Machine. Important translation and
optimization steps are summarized and the results obtained are discussed.

2 INTRODUCTION
A RBFNN (Orr, 1996) is an artificial neural network that uses radial basis functions as activation
functions. The RBFNN is three layered feed-forward neural network. The first layer is linear and only
distributes the input signal, while the next layer is nonlinear and uses Gaussian functions. The third layer
linearly combines the Gaussian outputs. Only the tap weights between the hidden layer and the output
layer are modified during training.
The selected RBFNN implementationFigure 1- RBFNN form and parameters.) have 5 parameters for
optimization:

1- The weights between the hidden layer and the output layer.
2- The activation function.
3- The center of activation functions.
4- The distribution of center of activation functions.
5- The number of hidden neurons.

Figure 1- RBFNN form and parameters.

The weights between the hidden layer and the output layer are calculated by using Moore-Penrose
generalized pseudo-inverse. This algorithm overcomes many issues in traditional gradient algorithms
such as stopping criterion, learning rate, number of epochs and local minima. Due to its shorter training
time and generalization ability, it is suitable for real-time applications.
The radial basis function selected is usually a Gaussian kernel for pattern recognition application.
Generally the center and distribution of activation functions should have characteristic similar to data.
Here, the center and width of Gaussians are selected using a Kmeans clustering algorithm.
Based on universal approximation theory center and distribution of activation functions are not
deterministic if the numbers of hidden neurons being sufficient enough, one can say that the single
hidden layer feed-forward network with sufficient number of hidden neurons can approximate any
function to any arbitrary level of accuracy.
In this case, there was an interest to test the implementation and performance of a Neural Network in
the embedded hardware. Despite the Neural Network type and application finality, due to time
limitations, the priority was to find a grounded and fully testable implementation of a Neural Network.
RBFNNs can be used, for example, for real time image tracking (Asvadi A. a.-M.-A., 2011) (Asvadi A. a.,
2011); but the choice for this particular Neural Network and for this particular implementation, was
taken due to the easiness in obtain performance differences between the Matlab and C
implementations. Also, because the implementation is complete1 and reliable, in the sense that is
provides a complete suite for test and measure the quality of the results. A reference to the author of
the Matlab implementation, Alireza Asvadi2 who made the implementation public in his Mathworks
Matlab Center site.

3 OBJECTIVE
The objective of this work is summarized in the following bullet points:

 Translate Matlab to C application.


 Profile the code to detect inefficient zones, hot spots.
 Apply optimizations in the hotspot zones.
 Deploy the program in a ZedBoard™ using the Xilinx Zynq®-7000 All Programmable SoC.3
 Apply optimizations to improve the performance of the system in the ZedBoard
 Compare performance results between a Desktop PC and the ZedBoard.

The rest of the document presents and discusses the performance results, obtained in the different
hardware platforms, with different configurations, and details the optimizations implemented.

4 CODE CONVERSION AND OPTIMIZATIONS


The Matlab implementation was composed of 5 files, a main file, a file to randomly generate the points
for evaluation and testing, a file containing the code for training, and another for testing the activation
functions in the hidden layer. First measure was to discard the point’s generation code, for the C
implementation the input must be given. After that, Matlab workspace was used intensively to
iteratively validate the translation steps, all the files were translated to C.

The RBFNN Matlab operation consists in three steps, point generation, train phase and the test phase.

 In the point generation, two vectors of random numbers with 2x200 dimensions each are
generated. The first vector (used in train phase) has a point distribution between 0 and 5 and
the second vector (used in test phase) has a distribution of points between 8 and 15.
 During the train phase, the Kmeans algorithm is used, iterating KMI times, to select the centers
of the Gaussian function, for the K hidden nodes. The train function returns: 1) the weights for

1
https://www.mathworks.com/matlabcentral/fileexchange/52580-radial-basis-function-neural-networks--with-
parameter-selection-using-k-means-
2
http://a-asvadi.ir/
3
http://microzed.org/product/zedboard
the activation between the hidden and the output layer, 2) the centers for Gaussian function, 3)
the distance for the Gaussian function.
 During the test phase, the second vector of points is evaluated by the RBFNN Kernel functions,
using the centers, distance and weights calculated in the test phase. The points are classified in
classes, as illustrated in Figure 2.

Figure 2 - Point classification during test phase.

For all the experiments, the values for number of centers K, Kmeans iterations KMI and input vector size
were not changed. The default values in the Matlab implementation were used, respectively K=8,
KMI=10 and vector size 200.

Figure 3 - Matlab implementation profiling.


For comparison purposes a profiling result of the Matlab implementation is shown in Figure 3. From
there, can be observed that the function which requires more time to run is the kmeans_cluster, with
0.038s elapsed per call. This function is used in the train phase for selecting the parameters for the
Gaussian functions, which means that the test phase is faster because it does not require to calculate
parameters for the kernel functions.

Before present the performance measures, let us detail the hardware and software environment of the
Desktop PC running the tests:

 SO – Windows 10 Home 64 bits.


 Hardware – Intel® Core™ i7-4710HQ CPU @2.5 GHz and 16384MB RAM.
 C ambient and compiler – MinGW32 version 2-25.1-1 and GCC 4.9.3-1
 C builder – Eclipse CDT version 8.3.0.201402142303.

4.1 C IMPLEMENTATION
For the C implementation, the function which generates the points for test and train was discarded. For
the C implementation the points can be given as an argument of the main function, during the tests they
were declared as stack variables.

The data type used in the implementation was double because of concerns with precision in the
embedded hardware. In the tests section a comparison in terms of performance between double and
float types is presented.

In the implementation of the Gaussian function, an inverse of a matrix of distances is computed. For this
function the Matlab performs an LU decomposition4 to form a linear system, whose the solution is the
inverse matrix. The C implementation5 of the LU decomposition technique used in explained in (Press,
2007). Another challenge to overcome was to transpose simplicity of Matlab syntax to C, namely, the
multiplication of matrix. The Matlab implementation heavily rely on matrix representation, in our C
approach, all matrix maintain the same format of column x line. To perform calculations, all the loops
were adapted to perform according with the desired result.
The implementation resulted in 6 C files, each main function (test, train, kernel, kmeans, LU matrix
decomposition) was implemented in a different file and an additional for the main function was created.

4
http://www.mathworks.com/help/matlab/ref/inv.html
5
http://www.it.uom.gr/teaching/linearalgebra/NumericalRecipiesInC/c2-3.pdf
4.2 C CODE PROFILING

Using gprof6, a first test to determine the hotspots, revealed that the function with the most negative
impact in speed was kmeans_cluster, and its descendant functions. An important remark, is that the
clock function used by gprof, did not have enough time precision to measure a single run. The reported
clock precision is about 1ms, and gprof ignores time accounting for functions that take less time to run
than the precision stated. To measure this results, a loop of 200 calls to the main function was made,
and the results divided by the number of iterations.

Figure 4 - Code profiling for train and test phases combined

As can be seen in the profiling report in Figure 4, kmeans_cluster function is the most time consuming,
1.25 ms are spent in each call. When analyzing the code7, three possible opportunities for optimization
were spotted and are explained next. Two extensive loops and a small loop were modified, the applied
optimizations are named and summarized next. Flags were used to control the optimizations, by using if
defined macros to modify the execution flow.

To better perceive the impact of both train and test phases, the profiling report is shown in Figure 5 and
Figure 6, respectively. When test phase is profiled independently, it shows the worst behavior among
train solely and train and test combined. This is an interesting behavior because during the test phase
only the kernel functions do requires computation. Otherwise, in the train phase, there is an additional
computation step to perform kmeans algorithm.

6
ftp://ftp.gnu.org/old-gnu/Manuals/gprof-2.9.1/html_node/gprof_toc.html
7
https://github.com/the-grandson/Radial-Basis-Function-Neural-Networks-with-parameter-selection-using-K-
means-Performance-evaluati
Figure 5 - Code profiling for train phase

Figure 6 - Code profiling for test phase.


4.3 C CODE OPTIMIZATIONS
4.3.1 Loop Unrolling and Constant Folding
In Code Excerpt 1, a short loop unrolling by a factor of 2 (highlight yellow) can be seen, as well as a
constant unfolding applied to the index advance resultant from the loop unrolling (highlight blue). These
two techniques were also applied to the other main loops of the kmeans_cluster function and controlled
by the same flag.

#if defined(LOOP_UNROLING_CONST_FOLDING)
int i_roll = 0;
for(i=0;i<K;i+=2){
i_roll = i+1;
int v1 = (int)values[i];
int v2 = (int)values[i_roll];

CENTS[i][0] = F[0][v1];
CENTS[i][1] = F[1][v1];
CENTS[i_roll][0] = F[0][v2];
CENTS[i_roll][1] = F[1][v2];
}
#else
for(i=0;i<K;i++){
CENTS[i][0] = F[0][(int)values[i]];
CENTS[i][1] = F[1][(int)values[i]];
}
#endif

Code Excerpt 1-Loop unrolling and constant folding optimization

4.3.2 Loop Unrolling and JAM


When inner loops exists, if the outside loop is previously unrolled, the inner loops must be duplicated
for encompass the two indexes (i and i+1), that the outside loop unrolling gives origin to. The duplicated
inner loops can be fused in a single loop, this process, combined with the outside loop unrolling, is
known as loop unrolling and jam. The technique is applied in the kmeans_cluster and is shown in Code
Excerpt 2 - Loop unrolling and JAM of inner loops Code Excerpt 2.
#elif defined(LOOP_UNROLING_AND_JAM)
int i_plus_1 = 0;
for(i=0;i<F_size;i+=2){
i_plus_1 = i +1;
double F_0_i1 = F[0][i];
double F_1_i1 = F[1][i];
double F_0_i2 = F[0][i_plus_1];
double F_1_i2 = F[1][i_plus_1];

for(j=0;j<K;j++){
double norm;
double aux[2] = {0};
aux[0] = F_0_i1 - CENTS[j][0];
aux[1] = F_1_i1 - CENTS[j][1];
vector_norm(aux,2,&norm);
DAL[i][j] = norm;

aux[0] = F_0_i2 - CENTS[j][0];


aux[1] = F_1_i2 - CENTS[j][1];
vector_norm(aux,2,&norm);
DAL[i_plus_1][j] = norm;
}
Code Excerpt 2 - Loop unrolling and JAM of inner loops
4.3.3 Loop Unrolling, Jam and Data Layout
For this technique, if the highlight blue and green parts from Code Excerpt 3 are compared with the
same parts in Code Excerpt 2, it is obvious the duplication of operations, by each iteration of the inner
loop. In this case, the inner loop is also unrolled.

#if defined(LOOP_UNROL_AND_JAM_AND_DATA_LAYOUT)
int K_plus_1 = K + 1;
int i_plus_1 = 0;
for(i=0;i<F_size;i+=2){

i_plus_1 = i + 1;
double F_0_i1 = F[0][i];
double F_1_i1 = F[1][i];
double F_0_i2 = F[0][i_plus_1];
double F_1_i2 = F[1][i_plus_1];

int j_plus_1 = 0;
for(j=0;j<K;j+=2){
j_plus_1 = j + 1;
double norm;
double aux[2] = {0};

aux[0] = F_0_i1 - CENTS[j][0];


aux[1] = F_1_i1 - CENTS[j][1];
vector_norm(aux,2,&norm);

DAL[i][j] = norm;

aux[0] = F_0_i1 - CENTS[j_plus_1][0];


aux[1] = F_1_i1 - CENTS[j_plus_1][1];
vector_norm(aux,2,&norm);

DAL[i][j_plus_1] = norm;

aux[0] = F_0_i2 - CENTS[j][0];


aux[1] = F_1_i2 - CENTS[j][1];
vector_norm(aux,2,&norm);

DAL[i_plus_1][j] = norm;

aux[0] = F_0_i2 - CENTS[j_plus_1][0];


aux[1] = F_1_i2 - CENTS[j_plus_1][1];
vector_norm(aux,2,&norm);

DAL[i_plus_1][j_plus_1] = norm;

}

}

Code Excerpt 3 - Loop unrolling, Jam and Data Layout


4.3.4 Loop Unrolling, Jam, Merge, Auxiliary Elimination and Function Inline
Regarding the optimization in 4.3.3, there exists a function vector_norm and an auxiliary vector to pass
values to that function. In this optimization, the norm of the vector is calculated in line with the loop.
This change allows to eliminate the auxiliary vector and the function call. After Code Excerpt 3, there
exists a loop to select the smallest distance to the center of the cluster and to record the index
containing the smallest distance. These two loops were merged to record the smallest distance and
keep its index in the same loop body.

#if defined(LOOP_UNROL_AND_JAM_INLINE_MERGE)
int K_plus_1 = K + 1;
int i_plus_1 = 0;
for(i=0;i<F_size;i+=2){

i_plus_1 = i + 1;
double F_0_i1 = F[0][i];
double F_1_i1 = F[1][i];
double F_0_i2 = F[0][i_plus_1];
double F_1_i2 = F[1][i_plus_1];

double distance1 = DBL_MAX;


double distance2 = DBL_MAX;
int index1;
int index2;

int j_plus_1 = 0;
for(j=0;j<K;j+=2){
j_plus_1 = j + 1;
double aux_0;
double aux_1;
double result;

aux_0 = F_0_i1 - CENTS[j][0];


aux_1 = F_1_i1 - CENTS[j][1];
result = sqrt((aux_0*aux_0)+(aux_1*aux_1));
DAL[i][j] = result;
if(result < distance1){ index1 = j;distance1 = result;};

aux_0 = F_0_i1 - CENTS[j_plus_1][0];


aux_1 = F_1_i1 - CENTS[j_plus_1][1];
result = sqrt((aux_0*aux_0)+(aux_1*aux_1));
DAL[i][j_plus_1] = result;
if(result < distance1){index1 = j_plus_1;distance1 = result;};

aux_0 = F_0_i2 - CENTS[j][0];


aux_1 = F_1_i2 - CENTS[j][1];
result = sqrt((aux_0*aux_0)+(aux_1*aux_1));
DAL[i_plus_1][j] = result;
if(result < distance2){index2 = j;distance2 = result;};

aux_0 = F_0_i2 - CENTS[j_plus_1][0];


aux_1 = F_1_i2 - CENTS[j_plus_1][1];
result = sqrt((aux_0*aux_0)+(aux_1*aux_1));
DAL[i_plus_1][j_plus_1] = result;
if(result < distance2){index2 = j_plus_1;distance2 = result;};

DAL[i][K] = index1;
DAL[i][K_plus_1] = distance1;

DAL[i_plus_1][K] = index2;
DAL[i_plus_1][K_plus_1] = distance2;
}
Code Excerpt 4 - Loop Unrolling, Jam, Merge, Auxiliary Elimination and Function Inline
4.3.5 Kernel Function Optimization
Regarding Code Excerpt 5, which represents the Gaussian kernel function, in highlight green, the only
part of the code which uses the index of the loop. The rest of the code in highlight yellow, which is
static, can be moved out of the loop body. The original Matlab implementation also have the static part
within the loop body. In highlight blue, the optimization shows the static part of the code out of the loop
body. In addition, the loop is unrolled.

double *rbf_kernel(int F_size, double F[2][F_size], double **MU, double ***SIGMA, int k){

#if defined(KERNEL_OPTIMIZATION)

double factor = 1/((SIGMA[k][0][0]*SIGMA[k][1][1])-(SIGMA[k][1][0]*SIGMA[k][0][1]));


double S_K_1_1_factor = (SIGMA[k][1][1]*factor);
double minus_S_K_1_0_factor = -(SIGMA[k][1][0]*factor);
double minus_S_K_0_1_factor = -(SIGMA[k][0][1]*factor);
double S_K_0_0_factor = (SIGMA[k][0][0]*factor);

for(i=0; i<F_size;i+=2){

double f_minus_mu_1 = F[0][i]-MU[k][0];//(F(1,:)'-MU(:,1))'


double f_minus_mu_2 = F[1][i]-MU[k][1];
double f_minus_mu_3 = F[0][i+1]-MU[k][0];//(F(1,:)'-MU(:,1))'
double f_minus_mu_4 = F[1][i+1]-MU[k][1];

h[i] = exp((-
0.5)*((f_minus_mu_1*S_K_1_1_factor)+(f_minus_mu_2*minus_S_K_1_0_factor))*f_minus_mu_1+
((f_minus_mu_1*minus_S_K_0_1_factor)+(f_minus_mu_2*S_K_0_0_factor))*f_minus_mu_2);

h[i+1] = exp((-
0.5)*((f_minus_mu_3*S_K_1_1_factor)+(f_minus_mu_4*minus_S_K_1_0_factor))*f_minus_mu_3+
((f_minus_mu_3*minus_S_K_0_1_factor)+(f_minus_mu_4*S_K_0_0_factor))*f_minus_mu_4);
}

#else
for(i=0; i<F_size;i++){

double f_minus_mu[2] = {F[0][i]-MU[k][0],F[1][i]-MU[k][1]};//(F(1,:)'-MU(:,1))'


double sigma[2][2] =
{{SIGMA[k][0][0],SIGMA[k][0][1]},{SIGMA[k][1][0],SIGMA[k][1][1]}};//SIGMA(:,:,k) double factor =
1/((sigma[0][0]*sigma[1][1])-(sigma[1][0]*sigma[0][1]));
double inv_sigma[2][2] = {{(sigma[1][1]*factor),-(sigma[1][0]*factor)},
{-
(sigma[0][1]*factor),(sigma[0][0]*factor)}};//inv(SIGMA(:,:,k))

double mul_sigma[2] =
{f_minus_mu[0]*inv_sigma[0][0]+f_minus_mu[1]*inv_sigma[0][1],f_minus_mu[0]*inv_sigma[1][0]+f_minus_mu[1]
*inv_sigma[1][1]};//(F(i,:)'-MU(:,k))'*(inv(SIGMA(:,:,k)))

double result = mul_sigma[0]*f_minus_mu[0]+mul_sigma[1]*f_minus_mu[1];

result = exp((-0.5)*result);

h[i] = result;
}

#endif
return h

Code Excerpt 5 - Kernel Function Optimization


5 PERFORMANCE RESULTS

5.1 PERFORMANCE RESULTS FOR OPTIMIZATIONS 4.3.1 TO 4.3.3


The following graphs (Figure 7 to Figure 10) and Table 1, show the results obtained for the optimizations
implemented in subsections 4.3.1 to 4.3.3.

To better understand the impact of the implemented optimizations, the speedup is calculated. To
determine the fraction of code for kmeans_cluster function, the profile results shown in Code
Conversion and Optimizations were used. The yellow highlighted lines in the profile results must be
summed in the column % time, to give the total fraction of kmeans_cluster , 62.5% of total execution
time. The function vector_norm, is a child of kmeans_cluster function. The relation between original/
improved code and the available compiler optimizations, in terms of execution time, is shown in Error!
Reference source not found., Error! Reference source not found., Error! Reference source not found.
and Error! Reference source not found.. For a more detailed description of the results see Table 1. All
the speedup calculations were done in relation with total execution time for original, non-optimized
code; for respective desktop or embedded hardware.

30 3.0000

25 2.5000
EXECUTION TIME (MS)

20 2.0000

15 1.5000

10 1.0000

5 0.5000

0 0.0000
O0 O1 O2 O3
Exec time - Embedded 28.0922 4.2777 4.0575 3.7912
Exec Time - Desktop 1.2500 1.3000 0.0520 0.6000
Speedup - Embedded 1.0000 2.2718 2.2734 2.3147
Speedup - Desktop 1.0000 0.9756 2.4938 1.4815
COMPILER OPTIMIZATION

Figure 7 - Original Code Execution Time and Speedup Graph


30 3.0000

25 2.5000

EXECUTION TIME (MS)


20 2.0000

15 1.5000

10 1.0000

5 0.5000

0 0.0000
O0 O1 O2 O3
Exec Time - Embedded 27.7583 4.2465 4.1340 3.7623
Exec Time - Desktop 3.2500 1.6500 0.4000 0.7500
Speedup - Embedded 0.9912 2.2770 2.2621 2.3196
Speedup - Desktop 0.9524 1.1765 1.8182 1.4286
COMPILER OPTIMIZATION

Figure 8 - Unrolling and Constant Folding Execution Time and Speedup Graph

30 3.0000

25 2.5000
EXECUTION TIME (MS)

20 2.0000

15 1.5000

10 1.0000

5 0.5000

0 0.0000
O0 O1 O2 O3
Exec Time - Embedded 27.3621 4.0397 3.9007 3.5680
Exec Time - Desktop 1.9500 1.0000 0.5500 0.9500
Speedup - Embedded 1.0039 2.3074 2.2956 2.3484
Speedup - Desktop 0.8889 1.7391 1.9048 1.2500
COMPILER OPTIMIZATION

Figure 9 - Unrolling and Jamming Execution Time and Speedup Graph


30 3.0000

25 2.5000

EXECUTION TIME (MS)


20 2.0000

15 1.5000

10 1.0000

5 0.5000

0 0.0000
O0 O1 O2 O3
Exec Time - Embedded 26.0837 4.0672 3.8152 4.0652
Exec Time - Desktop 1.7500 0.7500 0.9500 0.7000
Speedup - Embedded 1.0475 2.3013 2.3081 2.3654
Speedup - Desktop 0.93023 1.66667 1.25000 1.48148
COMPILER OPTIMIZATION

Figure 10 - Unrolling, Jamming and Data Layout Execution Time and Speedup Graph

Compiler Optimizations O0 O1 O2 O3
Ticks 234219 35666 33830 31609
Total Run (ms) 28.0922 4.2777 4.0575 3.7912
Embedded
K Means Run (ms) 23.121677 2.425307 2.414345 2.122099
Original Speedup 1.0000 2.2718 2.2734 2.3147
Total Run (ms) 2.0000 1.5500 1.5500 0.7500
Desktop K Means Run (ms) 1.2500 1.3000 0.0520 0.6000
Speedup 1.0000 0.9756 2.4938 1.4815
Ticks 231435 35404 34467 31368
Total Run (ms) 27.7583 4.2465 4.1340 3.7623
Embedded
K Means Run (ms) 23.5914 2.3882 2.4958 2.0880
Unrolling and Constant Folding Speedup 0.9912 2.2770 2.2621 2.3196
Total Run (ms) 3.2500 1.6500 0.4000 0.7500
Desktop K Means Run (ms) 1.3500 0.9500 0.3500 0.6500
Speedup 0.9524 1.1765 1.8182 1.4286
Ticks 228132 33681 32523 29749
Total Run (ms) 27.3621 4.0397 3.9007 3.5680
Embedded
K Means Run (ms) 23.1151 2.1733 2.2557 1.8915
Unrolling and Jamming Speedup 1.0039 2.3074 2.2956 2.3484
Total Run (ms) 1.9500 1.0000 0.5500 0.9500
Desktop K Means Run (ms) 1.5000 0.4000 0.3000 0.8500
Speedup 0.8889 1.7391 1.9048 1.2500
Ticks 217473 33911 31810 33894
Total Run (ms) 26.0837 4.0672 3.8152 4.0652
Embedded
K Means Run (ms) 21.5718 2.2160 2.1685 1.7773
Unrolling, Jamming and Data Layout Speedup 1.0475 2.3013 2.3081 2.3654
Total Run (ms) 1.7500 0.7500 0.9500 0.7000
Desktop K Means Run (ms) 1.4000 0.4500 0.8500 0.6000
Speedup 0.93023256 1.6666667 1.25 1.4814815
Table 1 - Detailed execution times for Desktop and Embedded and considering the code optimizations applied
5.2 PERFORMANCE RESULTS FOR ALL OPTIMIZATIONS
This second analysis (Figure 11), provides results for all the optimizations. The graph shows the
execution times splited in: train phase, test phase and train and test phases combined. In the horizontal
axis: Inline refers to optimization of section 4.3.4 alone; Kernel to optimization of section 4.3.5 alone;
Float defining double type as float, without other optimizations; Others to the optimizations reported in
section 5.1 and All for all the optimizations combined. All the results were obtained without any of the
compiler optimization levels (O0 used). As can be observed, the Inline optimization is the one with
better results. In the Train and Test combination, All optimizations even worse the Inline optimization
result. The tendency for the Test Phase alone to show worse results is maintained. As expected, using
float type instead of double does not show a better performance in powerful architecture like the
Desktop. For more impact, this solution needs to be implemented in the embedded system.

0.0016
0.0014
0.0012
Execution Time (s)

0.001
0.0008
0.0006
0.0004
0.0002
0
Original Inline Kernel Float Others All
Test Phase 0.0014 0.0011 0.0013 0.00145 0.0013 0.0011
Train Phase 0.00035 0.0001 0.00025 0.0002 0.0004 0.0001
Train and Test 0.00045 0.00025 0.00025 0.0003 0.00035 0.00035

Figure 11 - Execution Times for All Optimizations in Desktop

6 CONCLUSIONS
This application, when converted C and optimized seems to well suite real time image tacking needs. It
is difficult for the ZedBoard to compete with the Desktop architecture used in the tests. There was a lack
of time to test the last two optimizations in the ZedBoard, so it need to be done in future work. The
MinGW toolchain does not provide an accurate clock function to measure the results with the precision
of the embedded system. The use of gprof can lead to biased results because of the lack of precision
noticed in section 4.2.

The desktop machine shows a great performance in all the graphs, the compiler for the desktop it is so
optimized that the implementations done by hand worse the speedup results in all cases (blue
highlighted cells in Table 1). Even in the embedded hardware, the speedup improvements are not
significant (orange highlighted cells in Table 1). The two previous statements consider the fairest case,
the comparison between manually implemented optimizations and original code. A combination of
compiler and manual optimizations, in comparison with original code, for speedup evolution, does not
seems to be a fair option, because the compiler optimizes the most.

Two interesting observations to report: (1) when performing loop unrolling optimizations, without in
combination with constant unfolding, the execution times worsen significantly, in both hardware
platforms. Only after unfolding the additional index, resultant of the loop unrolling, the improvements
happen. (2) the Matlab implementation takes 407 ms to run, excluding the points generation function
and the plot functions as well. In relation to the desktop machine, the C non-optimized version is 203.5
times faster. The non-optimized version running on the embedded is 14.88 times faster than the Matlab.

Although the optimizations in section 5.1 does not improve the efficiency of the code. The optimizations
implemented in section 5.2, namely the Inline optimization, makes the code 0.55 times faster than the
original version. Better results were expected from the optimization of the Kernel function. A possible
explanation is that the compiler perceives the static part and does the same strategy. An intriguing
question is the time gap between the test and train phases. Even more intriguing is the fact that the test
phase, the lightest in terms of operations executed, when run alone shows the worse results. It is
important to address an answer to this question in future work and research.

7 REFERENCES
Asvadi, A. a. (2011). Efficient object tracking using optimized K-means segmentation and radial basis
function neural networks. Int. J. Inf. Commun. Technol, 29--39.

Asvadi, A. a.-M.-A. (2011). Improved object tracking using radial basis function neural networks. 2011
7th Iranian Conference on Machine Vision and Image Processing (pp. 1--5). IEEE.

Orr, M. J. (1996). Introduction to radial basis function networks. Technical Report, Center for Cognitive
Science, University of Edinburgh.

Press, W. H. (2007). Numerical recipes 3rd edition: The art of scientific computing. Cambridge university
press.

View publication stats

You might also like