Parallel Image Processing Using MPI

By: Zhaoyang Dong


Given a data file that represents the output from a very simple edge-detection algorithm, the problem is to reconstruct the initial image by iteratively doing the reverse calculation. The calculation requires a large amount of computation, including many boundary swaps in the parallel code. Besides, the inverse operation is also very similar to a large number of real scientific HPC calculations that solve partial differential equations using iterative algorithms such as Jocobi or Gauss-Seidel. These make the image-processing problem very suitable for the purpose of studies of Message-Passing Programming (MPI). The primary aim of the coursework is to write a MPI code for the problem that can use a two-dimensional domain decomposition and uses non-blocking communications for halo-swapping. I have written the program and, based on extensive tests, find it can work correctly. Moreover, as the number of processors increases, the performance of the code is improving in terms of the entire time as well as the average time per iteration. The outline of this report is as follows. The next section briefly describes the design and implementation of the program. Section 3 evaluates the program and I conclude in section 4.


Design and Implementation

The Outline of the Algorithm

The input is a data file that represents the output from an edge-detection algorithm applied to a greyscale image of size M N, and the output is the initial PGM image file. The outline of the algorithm is as follows: 1. Given the image size M N and the number of processors P, find the appropriate two-dimensional decomposition Px, Py, such that P == PxPy , where Px and Py denote the number of processors in the first and second dimension respectively. 2. Read the edges data into memory, each processor is responsible for M/PxN/Py edges data. 3. Reconstruct the image using non-blocking communications iteratively. 4. Write the result data to the PGM file. Following sections will further describe the above steps.



Given the image size M N and the number of processors P, the aim of this routine is to decompose the image appropriately to minimize the communication overheads, the output is Px and Py , denoting the number of processors in the first and second dimension respectively. The communication overhead is proportional to the number of elements that are involved in swapping. In the first dimension, the number is 2*(Px 1)*N, and in the second dimension, the number is 2*(Py 1)*M, so the total number is 2*((Px 1)*N+(Py 1)*M), which is proportional to Px*N + Py*M. Thus, the task is to find the Px and Py such that P == Px*Py and minimizing Px*N + Py*M, which can be easily done by trying all the available Px and Py. For simplicity, it is assumed that M is exactly divisible by Px and N is exactly divisible by Py. For example, if M is 192, N is 360 and P is 8, then Px will be 4 and Py will be 2.


Read Data

Each processor is responsible for a block of the edges data, and the aim of this routine is to read its data into memory. For 1D decomposition, a simple approach is reading the entire file into an array on the master process and then distributing it using MPI_Scatter, however, which is unfeasible for a 2D decomposition. The best method may be parallel IO, i.e., all the processes read the edge data file in parallel. However, it is not easy to implement, and since the time for IO is insignificant compared with the computation, so I didnt implement it. The method I chose is similar to that in 1D decomposition, i.e., the master process read the entire edge data file into an array and distribute it among all the processes. The difference is the distributing method. The master process sends the data belonging to a process using non-blocking send MPI_Issend, and a process receives its data using standard receive MPI_Recv. It is easy for the master process to read the entire edge data file into an master array of size M N using datread. However, since the data belonging to a process is discontiguous in the master array, the problem is how to transfer the data to other processes. To handle this problem, a derived datatype blockType is defined, and the pseudo code is as follows: MPI_Datatype blockType; MPI_Type_vector(M/Px,N/Py,N,MPI_FLOAT,&blockType); MPI_Type_commit(&blockType);


Reconstruct Image

Reconstructing image involves halo-swapping. To do this, each process must know its neighbouring processes. This is best achieved by defining a 2D Cartesian topology with non-periodic boundary conditions. The code to create the topology is as follows: int dims[2],periods[2]; MPI_Comm cart_comm; dims[0]=Px; dims[1]= Py; periods[0]=periods[1]=0; MPI_Cart_create(MPI_COMM_WORLD,2,dims,periods,1, &cart_comm); Then, the neighbouring processes can be found with the following codes: int down_rank,up_rank,right_rank,left_rank; MPI_Cart_shift(cart_comm,0,1,&up_rank,&down_rank); MPI_Cart_shift(cart_comm,1,1,&left_rank,&right_rank); Suppose the edges data of a process is stored in the dynamically allocated array float* buf with size MP*NP, where MP equals M/Px, and NP equals N/Py, two other dynamically allocated arrays float *old, *new with size (MP+2)(NP+2) are needed for reconstructing image. The old and new arrays are larger than buf because they contain halo values. The old array is initialized with the following pseudo codes: loop over i = 1, MP; j = 1, NP if(0==i || MP+1==i || 0==j || NP+1==j) old[i*(NP+2)+j] = 0; else old[i*(NP+2)+j]=buf[(i-1)*NP+(j-1)]; Halo-swapping involves sending/receiving entire rows of data in old array to/from up_rank and down_rank processes, and entire columns to/from left_rank and right_rank of processes. The rows are easy to get since the elements in the same row of old array are adjacent, while the columns are

not. To ease the sending and receiving of columns, a new derived datatype haloType is defined as follows: MPI_Datatype haloType; MPI_Type_vector(MP,1,NP+2,MPI_FLOAT,&haloType); MPI_Type_commit(&haloType); Now, it is ready to present the iterative process for reconstructing the image. This process tries to overlap communications with computations. Each iteration is divided into the following seven steps: 1. Start swapping boundary elements. The finish of this step does not mean the data have been swapped because this step is non-blocking. 2. Calculate non-boundary elements that do not need neighbours' elements. 3. End swapping boundary elements, after this step, the neighbours elements are available. 4. Calculate boundary elements. 5. Set the old array to new (which stores the result of the calculation), halos are not copied. 6. At a user-defined interval, calculate the average value of the pixels in the reconstructed image and print it. 7. At a user-defined interval, check whether the image is sufficiently accurate, and if so, terminate the loop. The code of step 1 is as follows: MPI_Request req[8]; MPI_Irecv(old+(MP+1)*(NP+2)+1,NP,MPI_FLOAT,down_rank,0,cart_comm,&req[0]); MPI_Isend(old+(NP+2)+1,NP,MPI_FLOAT,up_rank,0,cart_comm,&req[4]); MPI_Irecv(old+1,NP,MPI_FLOAT,up_rank,1,cart_comm,&req[1]); MPI_Isend(old+MP*(NP+2)+1,NP,MPI_FLOAT,down_rank,1,cart_comm,&req[5]); MPI_Irecv(old+1*(NP+2)+(NP+1),1,haloType,right_rank,2,cart_comm,&req[2]); MPI_Isend(old+1*(NP+2)+1,1,haloType,left_rank,2,cart_comm,&req[6]); MPI_Irecv(old+1*(NP+2),1,haloType,left_rank,3,cart_comm,&req[3]); MPI_Isend(old+1*(NP+2)+NP,1,haloType,right_rank,3,cart_comm,&req[7]); The code of step 2 is as follows: float diff,delta=0; loop over i = 2, MP-1; j = 2, NP-1 new[i*(NP+2)+j]=(old[(i-1)*(NP+2)+j]+old[(i+1)*(NP+2)+j]+old[i*(NP+2)+(j-1)] +old[i*(NP+2)+(j+1)]-buf[(i-1)*NP+(j-1)]) * 0.25; if(0==loop%CHECKFREQ) { /*calculate local every CHECKFREQ iterations*/ diff = new[i*(NP+2)+j] - old[i*(NP+2)+j]; delta += diff*diff; } Step 3 is straightforward, which simply waits all the non-blocking send and receive to finish: MPI_Status status[8]; MPI_Waitall( 8,req, status); After obtaining the neighbouring elements, the boundary elements can be calculated now, and the code is as follows: for(i=1;i<=MP;i+=MP-1) { for(j=1;j<=NP;j++) {

new[i*(NP+2)+j]=(old[(i-1)*(NP+2)+j]+old[(i+1)*(NP+2)+j]+old[i*(NP+2)+(j-1)] +old[i*(NP+2)+(j+1)]-buf[(i-1)*NP+(j-1)]) * 0.25; if(0==loop%CHECKFREQ) { diff = new[i*(NP+2)+j] - old[i*(NP+2)+j]; delta += diff*diff; } } } for(i=2;i<MP;i++) { for(j=1;j<=NP;j+=NP-1) { new[i*(NP+2)+j]=(old[(i-1)*(NP+2)+j]+old[(i+1)*(NP+2)+j]+old[i*(NP+2)+(j-1)] +old[i*(NP+2)+(j+1)]-buf[(i-1)*NP+(j-1)]) * 0.25; if(0==loop%CHECKFREQ){ diff = new[i*(NP+2)+j] - old[i*(NP+2)+j]; delta += diff*diff; } } } Step 5 updates the old array for the next iteration, and the code is as follows: loop over i = 1, MP; j = 1, NP old[i*(NP+2)+j]=new[i*(NP+2)+j]; The code of step 6 is as follows: float pixelValue,avgPixelValue; if(0 == loop%PRINTFREQ) { /*calculate every PRINTFREQ iterations*/ pixelValue = 0; loop over i = 1, MP; j = 1, NP pixelValue += new[i*(NP+2)+j]; pixelValue /= (MP*NP); MPI_Reduce(&pixelValue,&avgPixelValue,1,MPI_FLOAT,MPI_SUM,0,cart_comm); if(0==my_rank) { /*the root process print the value*/ avgPixelValue /= size; printf("iteration %d , average value of pixels: %f\n",loop,avgPixelValue); } } The final step calculates the global value of the parameter, and if it is less than a user-defined threshold, terminate the loop. The reason to calculate at intervals is to reduce the overhead associated with the delta calculations and checking. The code is as follows: float gdelta; MPI_Allreduce(delta,&gdelta,1,MPI_FLOAT,MPI_SUM,cart_comm); gdelta=sqrt(gdelta/(float)(M*N)); if(gdelta<THRESHOLD) break; After the iteration is terminated, copy the old array to buf excluding halos. The code is as follows: loop over i = 1, MP; j = 1, NP

buf[(i-1)*NP+(j-1)]=old [i*(NP+2)+j];


Write Data

In this step, each processor transfers its final buf array to masterbuf on the master process using a call MPI_Ssend, and the master process write out the final image by passing masterbuf to pgmwrite.


In this section, I evaluate the program. To ease the comparison among various tests, the PRINTFREQ is fixed to 400, CHECKFREQ to 100, and THRESHOLD to 0.1. Extensive tests have been done on several different edge data files, and similar results are observed, so I choose to present the result only on one file, edge192x360.dat. The number of processors tested includes 4, 6, 8, 10, 12, 15, 16, 20, 24, 27, 30, and 32. To verify the correctness of my parallel code, I also revised the provided 1D decomposition code so that it terminates the calculation with the same condition and prints the average value of pixels in the reconstructed image at the same interval. Two kinds of evidence can prove the correctness of my parallel code. For the same edge data file, in all the tests, using the diff command, the result image of my code is exactly same as that of the provided 1D code. For the same edge data file, in all the tests, the printed average values of pixels in the reconstructed image are exactly the same. Having shown the correctness, now I present the performance results. Table 1 provides a synopsis of the time (ms) consumed of the tests for edge192x360.dat on different processors. The total time is divided into four parts: decompose, read, reconstruct, and write. The final column shows the average time per iteration, which is believed to be the best quantity to measure the parallel performance. Table 1. Time consumed of the parallel code on different processors

P decompose read 4 0.718 88.948 6 0.663 91.103 8 0.909 90.864 10 3.061 91.445 12 1.868 92.619 15 1.292 94.086 16 1.613 94.728 20 7.207 92.908 24 3.32 94.272 27 23.801 93.618 30 4.424 91.92 32 4.816 93.889

reconstruct 3421.262 2511.957 1896.337 1562.791 1357.217 1148.648 1070.5 925.685 820.574 750.597 699.382 662.993

write 82.565 83.843 83.29 85.08 86.578 87.605 88.021 89.292 92.733 93.099 90.715 90.596

total 3593.519 2687.594 2071.428 1742.409 1538.311 1331.663 1254.896 1115.128 1010.935 961.149 886.474 852.325

avg/iter 0.633 0.465 0.351 0.289 0.251 0.213 0.198 0.171 0.152 0.139 0.129 0.123

Figure 1 illustrates the breakdown of the total time. It is apparent from the figure that in all cases the time for reconstruct dominates the total time, which verifies that our simple IO strategy is effective and the further performance improvement by some sophisticated parallel IO is very limited.

4000 3500 3000 Time(ms) 2500 2000 1500 1000 500 0 4 6 8 10 12 15 16 20 24 27 30 32 The number of processors

Write Reconstruct Read Decompose

Figure 1. Breakdown of total time

4000 3500 Total time(ms) 3000 2500 2000 1500 1000 500 0 4 6 8 10 12 15 16 20 24 The number of processors 27 30 32

Figure 2. The total time

0.7 0.6 The average time per iteration(ms) 0.5 0.4 0.3 0.2 0.1 0 4 6 8 10 12 15 16 20 24 The number of processors 27 30 32

Figure 3. The average time per iteration Figure 2 illustrates the total time as the number of processor increases, and Figure 3 illustrates the average time per iteration as the number of processors increases. It is easy to see that both figures show similar trends, i.e., as the number of processors increases, the performance of the code improves. For example, as the number of processors increases from 4 to 8, the average time per

iteration reduces from 0.633 ms to 0.351ms (by a factor of 1.803); as the number increase from 4 to 24, the time reduces by a factor of 4.164; and as the number increase from 4 to 32, the time reduces by a factor of 5.146. However, It can be also observed that the advantage of more processors becomes less and less obvious. The reason may be that as the number of processors increases, the work of each processor becomes less and the communication overheads increase.


This report briefly describes the design and implementation of my program for an image-processing problem by employing two-dimensional domain decomposition and using non-blocking communications. To evaluate the code, extensive tests have been done, which have shown that the parallel code works correctly and the performance of the code improves as the number of processors increases.

