Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 5

The solving of dense linear systems is an important scientific problem that is used as a kernel

in many scientific disciplines ranging from computer science, data analysis to computational
economics. There are many methods for solving a linear system such as direct and iterative
methods. In this paper we study the parallelization based on the direct method for solving a
linear system. We focus on the Gauss-Jordan method as a kernel can be used to solve the
above scientific problem. In the literature have been studied many parallelization schemes of
Gauss-Jordan method on modern high performance computer systems such as multicomputer
(i.e. cluster of workstations) and multicore systems (i.e. dual/quad cores) using standard
interfaces such as MPI, or parallel extension of language C/C++ such as OpenMP. Serge
shows its version adaptive to MIMD, N.Melab et al give us its version suiting for MARS
which is a kind of network of workstations, Aouad et al and Shang et al present its intra-step
parallel version on grid platforms, using a YML framework. Further, experimental results of
Gauss- Jordan on cluster and grid computing environments using MPI library are presented in
and respectively. On the other hand, there are implementations on multicore for other similar
problems of linear algebra such as factorization algorithms for solving systems of equations
(LU, QR and Cholesky. We recently developed an implementation of the pipeline technique
for the LU factorization algorithm using Open MP. However, there are not research efforts in
implementation of Gauss-Jordan method on multicore using OpenMP interface.
The goal of the paper is to implement the Gauss-Jordan method using pipelining technique in
OpenMP and compare the performance of the proposed implementation to the two strategies
that are based on naive parallelizations of the original algorithm (such as row block and row
cyclic data distribution) on a multicore platform. We must note that the implementation of the
pipelining technique in OpenMP for different application domains have not been reported in
past research literature. We also propose a mathematical model of performance for the
pipelined implementation of the Gauss- Jordan method and we verify this model with
extensive experimental measurements on multicore platform.
The rest of the paper is organized as follows: In Section II, the Gauss-Jordan method is
outlined for solving a linear system. In Section III, the pipelined implementation and the
corresponding performance model are discussed. In Section V, experimental results are
presented. Finally, some conclusions are drawn in last section.

II. GAUSS-JORDAN METHOD


Firstly, we describe the known Gauss-Jordan method for solving a system of linear equations.
Consider the following real linear algebraic system Ax = b, where A = (aij)n×n is a known
nonsingular n×n matrix with nonzero diagonal entries, b = (b0, b1, ..., bn−1)T is the right-hand

side and x = (x0, x1, ..., xn−1)T is the vector of the unknowns.
In Gauss-Jordan algorithm, first a working matrix is constructed by augmenting A matrix
with b, obtaining (A|B) matrix with n rows and n + 1 columns. Then, this matrix is
transformed into a diagonal form, using Gaussian elimination. Gauss-Jordan algorithm is
executed in two phases. In the first phase of Gauss-Jordan algorithm, the augmented matrix is
transformed into a diagonal form in which the elements both above and below the diagonal

element of a given column are zero. In the second phase, each xi (0 ≤ i ≤ n) solution is

computed by dividing the element from row i and column n of the augmented matrix (ai,n)
with the element from the row i of the principal diagonal (ai,i). The serial version of general
Gauss-Jordan algorithm for solving a linear system shown in Algorithm 1 consists of three
nested loops, which we will adopt for parallel implementation in the remainder of this paper.
The transformation of augmented matrix to diagonal form requires 4n3/3 scalar arithmetic
operations. Computing solution from diagonal form of the system requires approximately n
scalar arithmetic operations, so that the sequential run time of Gauss-Jordan algorithm is
4n3/3 + n.

PIPELINED IMPLEMENTATION AND PERFORMANCE MODEL


In this section, we present an OpenMP pipelined implementation of Gauss-Jordan method
using the queue data structure and we also present a performance model.

A. Implementation
We assume that the rows of matrix A are distributed among the p threads or cores such that
𝑛
each core is assigned [𝑝] contiguous rows of the matrix. The general idea of a pipelining

algorithm is that each thread executes the n successive steps of Gauss-Jordan algorithm on
the rows that it holds. To do so, it must receive the index of the pivot row, send it
immediately to the next thread and then proceed with the steps of Gauss-Jordan method.
Generally, the parallel algorithm is as follows:

Performance Analysis
The performance model of the pipeline algorithm depends on two main aspects: the
computational cost and the communication cost. In the case of multicore, communications are
performed through direct Put()/Get() operations by two or more threads. An analytical model
based on the number of operations and their cost in CPU times can be used to determine the
computational cost. Similarly, to predict the communication cost we calculate the number of
get/put operations between threads and measure the number of cycles for put and get
operations.

EXPERIMENTAL RESULTS
A. System Platform and Experimental Process
For our experimental evaluation we used an Intel Core 2 Quad CPU with four processor
cores, a 2.40 GHz clock speed and 8 Gb of memory. The system ran GNU/Linux, kernel
version 2.6, for the x84 64 ISA. All programs were implemented in C using OpenMP
interface and were compiled using gcc, version 4.4.3, with the ”-O2” optimization flag.
Several sets of test matrices were used to evaluate the performance of the parallel algorithms,

a set of randomly generated input matrices with sizes ranging from 32 × 32 to 4096 × 4096.

To compare the parallel algorithms, the practical execution time was used as a measure.
Practical execution time is the total time in seconds an algorithm needs to complete the
computation, and it was measured using the omp_get_wtime() function of OpenMP. To decrease
random variation, the execution time was measured as an average of 50 runs.

Analysis based on Experimental Results


We must note that we compare the performance of the pipeline implementation to the two
naive parallel algorithms are based on the two different data distribution schemes among the

available threads such as row block and row cyclic. In row block data distribution, the n × n

coefficient matrix A is block-striped among p threads or cores such that each core is assigned
𝑛
[𝑝]contiguous rows of the matrix. This algorithm we call RowBlock. In this row block data

distribution we used the static schedule without a specified chunk size that implies the
𝑛
OpenMP divides the iterations into p blocks of equal size of [𝑝] and is statically assigned to

the threads in a blockwise distribution. The allocation of iterations is done at the beginning of
the loop, and each thread will only execute those iterations assigned to it. In row cyclic data
distribution, rows of matrix A are distributed among the p threads or cores in a round-robin
fashion.
This algorithm we call RowCyclic. In this row cyclic data distribution we used the static
schedule with a variable specified chunk size that specifies a static distribution of iterations to
threads which assign blocks of size bs in around-robin fashion to the threads available, where
bs takes the values 1, 2, 4, 8, 16, 32 and 64 rows.
Firstly, we evaluate the RowCyclic algorithm in order to examine the relation between the
execution time of the RowCyclic algorithm and the block size. For this reason, we ran the
RowCyclic algorithm for different block sizes. Tables I, II and III present the average
execution time of the RowCyclic algorithm for different block sizes on one, two and four
cores, respectively. From these Tables we conclude that the execution time of the RowCyclic
algorithm is not affected by changes in the block size. This is due to the fact that the
RowCyclic method has poor locality of reference. For this reason we have used the
RowCyclic algorithm with a block of size 1 for the comparison to the other two parallel
algorithms.
Figure 1 shows the average execution time of all parallel algorithms for variable matrix size
on one, two and four cores. As can be seen from Figure 1, the execution time of all
algorithms is increased as the matrix size is increased. We observe that the parallel
implementations execute fast on matrix of small sizes (from 32 to 2048) because these
matrices fit entirely in the L1 and L2 cache. For large matrix sizes there is a slowdown in the
execution time because the matrices are larger than the cache (which is the more likely case)
and this leads to being served primarily by the slow main memory.
Figure 2 presents the way the performance of all parallel algorithms was affected when
parallel processed using OpenMP for 1 to 4 threads, for small, medium and large matrix
sizes. As can be seen, the performance of the algorithms improved with each additional
thread. We observe that the performance of the RowBlock and RowCyclic algorithms
increased at a decreasing rate. This is due to the fact that the implicit synchronization cost of
parallel ’for’ loops (i.e. start and stop parallel execution of the threads) dominates the
execution time. In this case, the cost of synchronization is about n2. On the other hand, it is
clear the performance of the Pipe algorithm on two and four cores resulted in the approximate
doubling and quadrupling of its performance. With the Pipe algorithm, the decrease in
performance is much slower than it is with the others. Therefore, we conclude that the
parallel execution time of the Pipe algorithm depends on the number of threads, since the
total communication and overhead time is much lower than the processing time on each
thread. Finally, we expect that the performance of the Pipe algorithm will be better than that
of the other two algorithms for large numbers of cores, such as 8 and 16 cores.
As we can see in the results, the ranking of the parallel algorithms is obvious. The parallel
algorithm with the best resulting performance is the Pipe, the second best is RowBlock and
the third best is RowCyclic.

C. Evaluation of the Performance Model


The final experiment was conducted to verify the correctness of the proposed performance
model for the pipeline algorithm. The performance model of equation 3 is plotted in Figure 3
using time parameters. In order to obtain these predicted results, we have determined the time
parameters for elimination and communications operations (i.e. telim, tcomm) for different
matrix sizes. These parameters of the target machine are determined experimentally with the
results shown in Figure 4. More specifically, the telim and tcomm parameters are measured,
executing the serial Gauss- Jordan algorithm 50 times for each different matrix size. Then we
take the average these 50 runs for each different matrix size.
As can be seen from Figure 3, the predicted execution times are quite close to the real ones.
This verifies that the proposed performance model for the pipeline algorithm is fairly
accurate and hence it provides a means to test the viability of the pipeline implementation on
any multicore (i.e. dual core, quad core) without taking the burden of real testing. Further, the
proposed performance model is able to predict the parallel performance and the general
behavior of the implementation. However, there are minor differences between the measured
and predicted results in the pipeline implementation.

You might also like