Professional Documents
Culture Documents
Ij N×N
Ij N×N
in many scientific disciplines ranging from computer science, data analysis to computational
economics. There are many methods for solving a linear system such as direct and iterative
methods. In this paper we study the parallelization based on the direct method for solving a
linear system. We focus on the Gauss-Jordan method as a kernel can be used to solve the
above scientific problem. In the literature have been studied many parallelization schemes of
Gauss-Jordan method on modern high performance computer systems such as multicomputer
(i.e. cluster of workstations) and multicore systems (i.e. dual/quad cores) using standard
interfaces such as MPI, or parallel extension of language C/C++ such as OpenMP. Serge
shows its version adaptive to MIMD, N.Melab et al give us its version suiting for MARS
which is a kind of network of workstations, Aouad et al and Shang et al present its intra-step
parallel version on grid platforms, using a YML framework. Further, experimental results of
Gauss- Jordan on cluster and grid computing environments using MPI library are presented in
and respectively. On the other hand, there are implementations on multicore for other similar
problems of linear algebra such as factorization algorithms for solving systems of equations
(LU, QR and Cholesky. We recently developed an implementation of the pipeline technique
for the LU factorization algorithm using Open MP. However, there are not research efforts in
implementation of Gauss-Jordan method on multicore using OpenMP interface.
The goal of the paper is to implement the Gauss-Jordan method using pipelining technique in
OpenMP and compare the performance of the proposed implementation to the two strategies
that are based on naive parallelizations of the original algorithm (such as row block and row
cyclic data distribution) on a multicore platform. We must note that the implementation of the
pipelining technique in OpenMP for different application domains have not been reported in
past research literature. We also propose a mathematical model of performance for the
pipelined implementation of the Gauss- Jordan method and we verify this model with
extensive experimental measurements on multicore platform.
The rest of the paper is organized as follows: In Section II, the Gauss-Jordan method is
outlined for solving a linear system. In Section III, the pipelined implementation and the
corresponding performance model are discussed. In Section V, experimental results are
presented. Finally, some conclusions are drawn in last section.
side and x = (x0, x1, ..., xn−1)T is the vector of the unknowns.
In Gauss-Jordan algorithm, first a working matrix is constructed by augmenting A matrix
with b, obtaining (A|B) matrix with n rows and n + 1 columns. Then, this matrix is
transformed into a diagonal form, using Gaussian elimination. Gauss-Jordan algorithm is
executed in two phases. In the first phase of Gauss-Jordan algorithm, the augmented matrix is
transformed into a diagonal form in which the elements both above and below the diagonal
element of a given column are zero. In the second phase, each xi (0 ≤ i ≤ n) solution is
computed by dividing the element from row i and column n of the augmented matrix (ai,n)
with the element from the row i of the principal diagonal (ai,i). The serial version of general
Gauss-Jordan algorithm for solving a linear system shown in Algorithm 1 consists of three
nested loops, which we will adopt for parallel implementation in the remainder of this paper.
The transformation of augmented matrix to diagonal form requires 4n3/3 scalar arithmetic
operations. Computing solution from diagonal form of the system requires approximately n
scalar arithmetic operations, so that the sequential run time of Gauss-Jordan algorithm is
4n3/3 + n.
A. Implementation
We assume that the rows of matrix A are distributed among the p threads or cores such that
𝑛
each core is assigned [𝑝] contiguous rows of the matrix. The general idea of a pipelining
algorithm is that each thread executes the n successive steps of Gauss-Jordan algorithm on
the rows that it holds. To do so, it must receive the index of the pivot row, send it
immediately to the next thread and then proceed with the steps of Gauss-Jordan method.
Generally, the parallel algorithm is as follows:
Performance Analysis
The performance model of the pipeline algorithm depends on two main aspects: the
computational cost and the communication cost. In the case of multicore, communications are
performed through direct Put()/Get() operations by two or more threads. An analytical model
based on the number of operations and their cost in CPU times can be used to determine the
computational cost. Similarly, to predict the communication cost we calculate the number of
get/put operations between threads and measure the number of cycles for put and get
operations.
EXPERIMENTAL RESULTS
A. System Platform and Experimental Process
For our experimental evaluation we used an Intel Core 2 Quad CPU with four processor
cores, a 2.40 GHz clock speed and 8 Gb of memory. The system ran GNU/Linux, kernel
version 2.6, for the x84 64 ISA. All programs were implemented in C using OpenMP
interface and were compiled using gcc, version 4.4.3, with the ”-O2” optimization flag.
Several sets of test matrices were used to evaluate the performance of the parallel algorithms,
a set of randomly generated input matrices with sizes ranging from 32 × 32 to 4096 × 4096.
To compare the parallel algorithms, the practical execution time was used as a measure.
Practical execution time is the total time in seconds an algorithm needs to complete the
computation, and it was measured using the omp_get_wtime() function of OpenMP. To decrease
random variation, the execution time was measured as an average of 50 runs.
available threads such as row block and row cyclic. In row block data distribution, the n × n
coefficient matrix A is block-striped among p threads or cores such that each core is assigned
𝑛
[𝑝]contiguous rows of the matrix. This algorithm we call RowBlock. In this row block data
distribution we used the static schedule without a specified chunk size that implies the
𝑛
OpenMP divides the iterations into p blocks of equal size of [𝑝] and is statically assigned to
the threads in a blockwise distribution. The allocation of iterations is done at the beginning of
the loop, and each thread will only execute those iterations assigned to it. In row cyclic data
distribution, rows of matrix A are distributed among the p threads or cores in a round-robin
fashion.
This algorithm we call RowCyclic. In this row cyclic data distribution we used the static
schedule with a variable specified chunk size that specifies a static distribution of iterations to
threads which assign blocks of size bs in around-robin fashion to the threads available, where
bs takes the values 1, 2, 4, 8, 16, 32 and 64 rows.
Firstly, we evaluate the RowCyclic algorithm in order to examine the relation between the
execution time of the RowCyclic algorithm and the block size. For this reason, we ran the
RowCyclic algorithm for different block sizes. Tables I, II and III present the average
execution time of the RowCyclic algorithm for different block sizes on one, two and four
cores, respectively. From these Tables we conclude that the execution time of the RowCyclic
algorithm is not affected by changes in the block size. This is due to the fact that the
RowCyclic method has poor locality of reference. For this reason we have used the
RowCyclic algorithm with a block of size 1 for the comparison to the other two parallel
algorithms.
Figure 1 shows the average execution time of all parallel algorithms for variable matrix size
on one, two and four cores. As can be seen from Figure 1, the execution time of all
algorithms is increased as the matrix size is increased. We observe that the parallel
implementations execute fast on matrix of small sizes (from 32 to 2048) because these
matrices fit entirely in the L1 and L2 cache. For large matrix sizes there is a slowdown in the
execution time because the matrices are larger than the cache (which is the more likely case)
and this leads to being served primarily by the slow main memory.
Figure 2 presents the way the performance of all parallel algorithms was affected when
parallel processed using OpenMP for 1 to 4 threads, for small, medium and large matrix
sizes. As can be seen, the performance of the algorithms improved with each additional
thread. We observe that the performance of the RowBlock and RowCyclic algorithms
increased at a decreasing rate. This is due to the fact that the implicit synchronization cost of
parallel ’for’ loops (i.e. start and stop parallel execution of the threads) dominates the
execution time. In this case, the cost of synchronization is about n2. On the other hand, it is
clear the performance of the Pipe algorithm on two and four cores resulted in the approximate
doubling and quadrupling of its performance. With the Pipe algorithm, the decrease in
performance is much slower than it is with the others. Therefore, we conclude that the
parallel execution time of the Pipe algorithm depends on the number of threads, since the
total communication and overhead time is much lower than the processing time on each
thread. Finally, we expect that the performance of the Pipe algorithm will be better than that
of the other two algorithms for large numbers of cores, such as 8 and 16 cores.
As we can see in the results, the ranking of the parallel algorithms is obvious. The parallel
algorithm with the best resulting performance is the Pipe, the second best is RowBlock and
the third best is RowCyclic.