Download as pdf or txt
Download as pdf or txt
You are on page 1of 9

Project 3: Optimizing the Alpha Blending Code for Neon SIMD Processing

Rishi Hora 001080823 rhora@ncsu.edu ECE785

Description of Base Case


CPU clock frequency: The clock frequency was by default set at 300 MHz, but since the board was being powered by the USB cable the actual frequency was 297MHz (approx.) Performance: With these settings the execution time of the unmodified code was

beaglebone:/home/rishi/QC_SIMD_Project/base# ./a.out fore.rgba back.rgba out.rgba Routine took 188066 microseconds


cpufreq-set -f 1000MHz command was used to set the frequency to 1000MHz, the governor was also set to performance using the cpufreq-set -g performance command. When the frequency was set to 1000MHz the execution time was:

beaglebone:/home/rishi/QC_SIMD_Project/base# ./a.out fore.rgba back.rgba out.rgba Routine took 59596 microseconds


At the clock speed of 1000MHz and time the un-optimized code ran for 59040566 cycles. The actual speed of the processor was found out using the cat /proc/cpuinfo command

root@beaglebone:~# cat /proc/cpuinfo processor :0 model name : ARMv7 Processor rev 2 (v7l) BogoMIPS : 990.68 Features : swp half thumb fastmult vfp edsp thumbee neon vfpv3 tls CPU implementer : 0x41 CPU architecture: 7 CPU variant : 0x3 CPU part : 0xc08 CPU revision : 2 Hardware Revision Serial : Generic AM33XX (Flattened Device Tree) : 0000 : 0000000000000000

Optimizations
-ftree-vectorize (Optimization #1)
Vectorization is enabled by this flag, it speeds up code by vectorizing the all the loops that can be vectorized. In this case the memory access is made such that it is predictable and there are no conditional branches. The compiler recognizes that and auto vectorizes where ever possible. Performance Impact There was a considerable speedup due to this optimization.

beaglebone:/home/rishi/QC_SIMD_Project# ./project1 fore.rgba back.rgba out.rgba Routine took 10285 microseconds


Total Cycles: 10189144 cycles Speedup: (59596 - 10285)/ 59596 * 100 = 82.74% Analysis The inner loop was taking too much time to execute, by using the vectorization functionalty of the NEON unit we achieved this speed up.

Use of Intrinsic Instructions & arm_neon.h (Optimization #2)


Intrinsic instructions are used where we need to access the Neon unit directly. Also the header file was inserted into the code; it is used for the intrinsic data type implementations. Performance Impact There was negligible impact on the performance. Analysis Since the vectorization was already taking place automatically, it did not add to the performace of the code and hence this optimization was abandoned.

Run Fast Mode (Optimization #3)


In this mode the instructions are directly able to access the Neon directly. The code includes ASM which gets executed on the processor as is. Performance Impact The effect on performance wasnt huge, but the impact can be seen in the assembly code where the instructions were inserted directly. This was interesting to observe. Analysis Since the vectorization was already taking place automatically, it did not add to the performance of the code and hence this optimization was abandoned.

Restrict keyword (Optimization #4)


By using this keyword we ensure that there isnt a simultaneous access to the address space of the output array by the other arrays. Performance Impact There was a little speedup due to this optimization.

beaglebone:/home/rishi/QC_SIMD_Project# ./project1 fore.rgba back.rgba out.rgba Routine took 9895 microseconds


Total Cycles: 9802779cycles Speedup: (10285 - 9895)/ 10285 * 100 = 3.8%

Reducing Redundant Multiplications and Divisions (Optimization #5)


There were multiple redundant instructions in the code which the compiler mostly recognized and removed such as the multiplication and division by constants. But during the regeneration of the dstImg array the values which were earlier divided by 256 were now being left shifted by 8 bits, i.e. multiplied by 256. Performance Impact There was some speedup due to this optimization.

beaglebone:/home/rishi/QC_SIMD_Project# ./project1 fore.rgba back.rgba out.rgba Routine took 9692 microseconds


Total Cycles: 9601671 cycles Speedup: (9895 - 9692)/ 9895 * 100 = 2.1% Analysis Some redundant multiplications were removed but since they were very few in number the speedup wasnt that high.

Summary
Overall performance improvement: Initial Cycles: 59040566 cycles Final Cycles: 9601671 cycles Speedup: (59596 - 9692)/59596 * 100 = 83.74% Which single optimization gave the largest improvement? The optimization that gave the largest improvement was inclusion of vectorization flags. It enabled the loop to get vectorized and allowed the neon unit to process 4 words simultaneously. Using the intrinsic instructions want that useful as the code was getting automatically vectorized. Hence that optimization was abandoned. The code was finally sped-up by 83.74%.

Makefile PROJ_NAME = project1 CC = gcc VECTFLAGS = -ftree-vectorize -ffast-math -fsingle-precision-constant -ftreevectorizer-verbose=2 -mvectorize-with-neon-quad CFLAGS = -Wall -O3 -march=armv7-a -mtune=cortex-a8 $(VECTFLAGS) -funroll-loops LIBS = -lm -lrt OBJFILES := $(patsubst %.c,%.o,$(wildcard *.c)) $(PROJ_NAME): $(OBJFILES) # echo $(OBJFILES) $(CC) -o $(PROJ_NAME) $(OBJFILES) $(LIBS) %.o: %.c $(CC) $(CFLAGS) -c -o $@ $< %.lst: %.c $(CC) $(CFLAGS) -Wa,-adhln $(LIBS) $< > $@ clean: rm -f *.o *.lst -mfloat-abi=softfp -mfpu=neon

alpha_time.c void alphaBlend_c(int *fgImage, int *bgImage, int *dstImage);

#include <stdio.h> #include <sys/time.h> #include <arm_neon.h>

int backImage[512 * 512]; int foreImage[512 * 512]; int newImage[512 * 512];

void enable_runfast() { static const unsigned int x = 0x04086060; static const unsigned int y = 0x03000000; int r; asm volatile ( "fmrx %0, fpscr "and "orr %0, %0, %1 %0, %0, %2 \n\t" \n\t" \n\t" \n\t" //r0 = FPSCR //r0 = r0 & 0x04086060 //r0 = r0 | 0x03000000 //FPSCR = r0

"fmxr fpscr, %0 : "=r"(r) : "r"(x), "r"(y) ); }

int main(int argc, char**argv) { FILE *fgFile, *bgFile, *outFile; int result; struct timeval oldTv, newTv; //enable_runfast();

if(argc != 4){ fprintf(stderr, "Usage:%s foreground background outFile\n",argv[0]); return 1; } fgFile = fopen(argv[1], "rb"); bgFile = fopen(argv[2], "rb"); outFile = fopen(argv[3], "wb");

if(fgFile && bgFile && outFile){ result = fread(backImage, 512*sizeof(int), 512, bgFile); if(result != 512){ fprintf(stderr, "Error with backImage\n"); return 3; } result = fread(foreImage, 512*sizeof(int), 512, fgFile); if(result != 512){ fprintf(stderr, "Error with foreImage\n"); return 4; } gettimeofday(&oldTv, NULL); alphaBlend_c(&foreImage[0], &backImage[0], &newImage[0]); gettimeofday(&newTv, NULL); fprintf(stdout, oldTv.tv_usec)); "Routine took %d microseconds\n", (int)(newTv.tv_usec -

fwrite(newImage, 512*sizeof(int),512,outFile); fclose(fgFile); fclose(bgFile); fclose(outFile); return 0; } fprintf(stderr, "Problem opening a file\n"); return 2; }

#define A(x) (((x) & 0xff000000) >> 24) #define R(x) (((x) & 0x00ff0000) >> 16) #define G(x) (((x) & 0x0000ff00) >> 8) #define B(x) ((x) & 0x000000ff)

void alphaBlend_c(int *fgImage, int *bgImage, int* __restrict dstImage) { int x, pos, y; for(y = 0; y < 512; y++){ for(x = 0; x < 512; x++){ /*for(xx = 0; xx< 4; xx++) { pos[xx]= y*512 }*/ //pos = (y*512)+x; pos = (y*512)+x;

int a_fg

= A(fgImage[pos]);

int dst_r = ((R(fgImage[pos]) * a_fg) + (R(bgImage[pos]) * (255-a_fg))); int dst_g = ((G(fgImage[pos]) * a_fg) + (G(bgImage[pos]) * (255-a_fg))); int dst_b = ((B(fgImage[pos]) * a_fg) + (B(bgImage[pos]) * (255-a_fg)))>>8; dstImage[pos] = (0x000000ff & (dst_b)); } } } 0xff000000 |(0x00ff0000 & (dst_r << 8)) |(0x0000ff00 & (dst_g)) |

You might also like