Articulo

Parallel Computing
Alc antara D az Diego Alejandro, Pe naloza Plata Gonzalo dialaldi@hotmail.com gpenalozap@hotmail.com Universidad Aut onoma Metropolitana www.azc.uam.mx March 15, 2013
Abstract
Nowdays, with the necessity of solving computing problems at an increasing faster rate, and problems eventually becoming larger,the Parallel Computing scheme is a very good and powerfull solution to this kind of problem. This article addresses the information concerning the theoretical framework, examples and conclusions of this quarterly project. It is a small demonstration of the Parallel Computing scheme. The proyect reered to by this article is an implementation of an algorithm used for calculating pixels inside a circle using matrices.
State of the Art
Threads .- Threads are sequential processes that share memory. They represent a key concurrency model supported by modern computers, programming languages, and operating systems. Many general-purpose parallel architectures in use today (such as symmetric multiprocessors, SMPs) are direct hardware realizations of the thread abstraction. Some applications can very eectively use threads. So-called embarrassingly parallel applications (for example, applications that essentially spawn multiple independent processes such as build tools, like PVM gmake, or web servers). Because of the independence of these applications, programming is relatively easy, and the abstraction being used is more like processes 1 Introduction than threads (where memory is not shared). Where such applications do share data, they do so through The tools used include the C programming language database abstractions, which manage concurrency in an UNIX oriented enviroment, sequential program- through such mechanisms as transactions.[3] ming in C using les to save the data obtained by the algorithm. The understanding and later use of Thread or Process Synchronization .- Process threads in an UNIX oriented enviroment for the same synchronization is required when one process must algorithm, using a dierent number of threads in each wait for another to complete some operation before execution, reducing its execution time while obtain- proceeding. For example, one process (called a ing the same arithmetical results, and at the same writer) may be writing data to a certain main time monitoring these results. And nally the imple- memory area, while another process (a reader) may mentation of a Parallel Virtual Machine (PVM) to be reading data from that area and sending it to the enhance the algorithms execution time, emulating printer. The reader and writer must be synchronized its performance being executed in several computers, so that the writer does not overwrite. Both writer each one with a part of the algorithm, at the same and reader may be threads. The presence of multiple time. threads in an application opens up potential issues I
regarding safe access to resources from multiple threads of execution. Two threads modifying the same resource might interfere with each other in unintended ways. One thread might overwrite another threads changes or put the application into an unknown and potentially invalid state. Sometimes the corrupted resource might cause obvious performance problems or crashes that are relatively easy to track down and x. However, the corruption may cause subtle errors that do not manifest themselves until much later, or the errors might require a signicant overhaul of your underlying coding assumptions. In cases where several threads must interact, the use of process synchronization helps to ensure that when several threads interact, they do so safely.[4][5]
the de facto standard for distributed computing world-wide.[1]
Multi-Core and Multi-Threaded Architectures .- In a multi-core architecture, multiple processors are placed on the same chip. These processors have two levels of cache memory. Each processor on that chip typically has its own level 1 cache, but they share a common level 2 cache. Processors can communicate eciently through the shared level 2 cache, avoiding the need to go through memory, and to invoke the cumbersome cache coherence protocol. In a multi-threaded architecture, a single processor may execute two or more threads at once. Many modern processors have substantial internal parallelism. They can execute instructions Shared Memory .- Shared Memory is an e- out of order, or in parallel (e.g., keeping both xed ceint means of passing data between programs. and oating-point units busy), or even execute One program will create a memory portion which instructions speculatively before branches or data other processes (if permitted) can access.[6] Shared have been computed. To keep hardware units busy, memory is the fastest form of IPC available. Once multithreaded processors can mix instructions from the memory is mapped into the address space of the multiple streams. Modern processor architectures processes that are sharing the memory region, no combine multi-core with multi-threading, where kernel involvement occurs in passing data between multiple individually multi-threaded cores may the processes. What is normally required, however, is reside on the same chip.[2] some form of synchronization between the processes that are storing and fetching information to and from the shared memory region.[7]
Development
PVM .- PVM (Parallel Virtual Machine) is a software package that permits a heterogeneous collection of Unix and/or Windows computers hooked together by a network to be used as a single large parallel computer. Thus large computational problems can be solved more cost eectively by using the aggregate power and memory of many computers. The software is very portable. The source, which is available free thru netlib, has been compiled on everything from laptops to CRAYs. PVM enables users to exploit their existing computer hardware to solve much larger problems at minimal additional cost. Hundreds of sites around the world are using PVM to solve important scientic, industrial, and medical problems in addition to PVMs use as an educational tool to teach parallel programming. With tens of thousands of users, PVM has become II
The tools used for the testing of the program are as follows: UNIX based Operating System (Debian 2.6.32 x86). PVM Library version 3. GCC Compiler version 4.5.
3.1
Technical considerations
For the development of the program it was assumed that: The matrices used in the tests of the program were square matrices.
The size of the matrices was dierent in the various tests. There were two nested loops to scroll through all the elements of the matrices. This algorithm has a complexity O (N 2 ).
3.2
Sequential Implementation
N 500 1500 3000 5000 7500 1000 1200 15000
Execution Time (s) 0 0.01 0.08 0.23 0.52 0.95 1.37 2.11
In this project phase the algorithm was implemented in C programming language, so that each instruction is executed sequentially, respecting the execution order of the algorithm. The algorithm for calculating pixels within a circle is: 1: 2: 3: { 4: 5: 6: c = N/2 r2 = radius radius f o r ( i =0; i <N; i ++)
Table 1: Sequential Implementation Results will be 1. As part of the study purpose, is obtained the system time before starting the procedure and the system time is obtained again at the end of the procedure. The dierence between these two measurements is considered the execution time. This part of the code is delimited to ensure that te measurement of time is unaected by the processes of reading and writing data on the results. Finally the resulting matrix is written in the output.txt le, and the log.txt le contains the resulting time history from executing the program with dierent inputs. The resulting time with dierent matrices size can be seen in Table 1. For the execution time graph see Figure 1.
Analyzing the algorithm used, is seen to be highly parallelizable, primarily because our working dataset is a matrix, and its values are independent of each, therefore, there are no data dependencies. The creThe program starts by reading the data concerning ation of the threads for our program is as described to the circle radius and the size of the matrix from below: the entrada.txt le. The next step is to assign memory to the matrix according to its size. Once we have #d e f i n e numhilos 4 the matrix, the main procedure of the algorithm takes i n t m; place. This procedure uses two nested loops and goes i n t N, r , i , j ; through each one of the matrix elements. And using c l o c k t i n i c i o , f i n ; the distance from the coordinates to the center of f l o a t tiempo ; the matrix and the radius value of the circle, is de- p t h r e a d t h i l o s [ numhilos ] ; termined if the current position is inside the circle or N++; is not. In case the current position is not within the m = ( i n t ) m a l l o c ( s i z e o f ( i n t ) N ) ; area of the circle, is assigned to one element of the f o r ( i = 0 ; i < N ; i ++){ matrix the zero value, otherwise the element value m[ i ] = ( i n t ) m a l l o c ( s i z e o f ( i n t ) N ) ; III
y = i c y2 = y y f o r ( j =0 ; j <N ; j ++) { 7: x= j c 8: x2= x x } 9: i f ( x2 + y2 <= r 2 ) { 1 0 : The p i x e l ( i , j ) i s w i t h i n the c i r c l e } }
3.3
Implementation by Threads
} f o r ( i = 0 ; i < numhilos ; i ++) { vector [ vector [ vector [ vector [ vector [ vector [ } i i i i i i ] . id = i ; ] . r2 = r r ; ] . N = N; ] . c = N / 2; ] . a l f a = ( i N) / numhilos + 1 ; ] . b e t a = ( ( i +1) N) / numhilos ;
N 500 1500 3000 5000 7500 1000 1200 15000
Execution Time (s) 2 Threads 3 Threads 0 0 0.02 0.03 0.09 0.09 0.28 0.27 0.63 0.63 1.12 1.12 1.62 1.64 2.56 2.55
i n i c i o = clock ( ) ; f o r ( i = 0 ; i < numhilos ; i ++){ p t h r e a d c r e a t e (& h i l o s [ i ] , NULL, ( v o i d ) r e c o r r i d o , ( v o i d )& v e c t o r [ i ] ) ; } f o r ( i = 0 ; i < numhilos ; i ++){ p t h r e a d j o i n ( h i l o s [ i ] , NULL ) ; } In this case the algorithm is implemented using threads or pthreads. In essence the program structure is very similar to the previous one, It starts reading the data, after that the matrix is processed, and later the results are written in their respective les. The main diference in this implementation of the algorithm is at its core. The core is now processed by a dierent number of threads, thereby is expected a reduction of the execution time. Each thread works with a similar portion of the matrix, the size of said portion depends on the amount of threads to execute. Once again the core of the algorithm consists of two nested loops that travel across all of the matrix elements, in the case of threads, at the time of creation of each one, they receive as parameters the values of the variables they need to run properly. The execution time is taken from the beginning of the execution of the thread until each one of the threads have nished its execution. Only until each one of the threads have nished its execution, is begun the writing process of the results. It is necessary to wait for each thread to be executed and it has nished its task to avoid creating data dependencies and erroneous reTable 2: Results of the Implementation by Threads (Part 1) N 500 1500 3000 5000 7500 10000 12000 15000 Execution Time (s) 4 Threads 0 0.03 0.09 0.28 0.63 1.13 1.63 2.56
Table 3: Results of the Implementation by Threads (Part 2) sults. The resulting time with dierent matrices size and dierent number of threads can be seen in Table 2 and Table 3. For the execution time graph see Figure 1.
3.4
Implementation with PVM
Continuing with the algorithm parallelization options, at this stage of the project we turned to PVM. The coding involves dividing the code into two separate programs. The rst program is known as the master, which is described below:
#define ESCLAVOS 4 int mytid; int slave_tid[ESCLAVOS]; int resultado; int no_esclavo;
IV
int p[5]; int filas; int N, r, i; int *m; m = (int*)malloc(N*N*sizeof(int)); filas = N /ESCLAVOS; mytid = pvm_mytid(); inicio = clock(); for (no_esclavo = 0 ; no_esclavo < ESCLAVOS ; no_esclavo++){ resultado = pvm_spawn ("gslave", NULL, PvmTaskDefault, "", 1, &slave_tid[no_esclavo]); if(resultado!=1){ printf("Error: No se pudo crear el esclavo\n"); pvm_exit(); return 0; } p[0] = no_esclavo * N * filas; p[1] = (no_esclavo+1) * N * filas -1; p[2] = N; p[3] = r; p[4] = no_esclavo+1; pvm_initsend(PvmDataDefault); pvm_pkint(p,5,1); pvm_pkint(&m[p[0]],p[1]-p[0]+1,1); pvm_send(slave_tid[no_esclavo], 0); } for(no_esclavo = 0; no_esclavo<ESCLAVOS; no_esclavo++) { /*Espera y recibe el resultado del esclavo*/ pvm_recv(slave_tid[no_esclavo], 0); p[0] = no_esclavo * N * filas; p[1] = (no_esclavo+1) * N * filas -1; /*Se desempaqueta el resultado del esclavo*/ pvm_upkint(&m[p[0]],p[1]-p[0]+1,1); } pvm_exit();
pvm_upkint(p,5,1); r = p[3]; N = p[2]; m = (int *) malloc(N*N*sizeof(int)); pvm_upkint(&m[p[0]],p[1]-p[0]+1,1); int c; int x,y,x2,y2,r2; c = N/2; for (i = p[0] ; i <= p[1] ; i++) { y = i / N; x = i % N; x2 = x*x; y2 = y*y; r2 = r*r; if ((x2+y2)<=r2) { m[i] = 9;} else { m[i] = 7;} } pvm_initsend(PvmDataDefault); pvm_pkint(&m[p[0]],p[1]-p[0]+1,1); pvm_send(parent_tid, 0); pvm_exit(); return 0; }
The number of slaves varies and depends on the hardware capabilities and the way these are declared in the master program. Once each slave program receives from the master program the data to be processed. The slaves begin to execute the algorithm and ultimately send a message to master program along with the processed information. The master program starts receiving the results of each of the slaves, the master unites the results, processes the results and displays them to the user. The resulting time with The master is responsible for reading the input data dierent matrices size and four slaves can be seen in for the program and send to a subset of programs Table 4. The relationship between the execution time known as the slaves, the information necessary for and the number of slaves can be seen in Figure 2. them to develop their work. The second program, known as the slave, is described below:
#include <stdio.h> #include <pvm3.h> int int int int int int int main(){ N, r; *m; mytid; parent_tid; p[5]; i;
Conclusions
mytid = pvm_mytid(); parent_tid = pvm_parent(); pvm_recv(parent_tid,0);
Upon completion of the three phases of the project we can see that, despite the use of methods to parallelize the algorithm, no signicant reduction is obtained in the execution time, and it is even possible to see an increase in time in some cases. These apparently negative results are the consequence of a straightforward implementation on a single workstation with certain characteristics. Processes V
N 500 1500 3000 5000 7500 10000 12000 15000
Execution Time (s) 0 0.08 0.32 0.93 1.99 3.46 4.99 7.93
Table 4: Results of the Implementation with PVM
with threads would ideally be implemented on a computer with more than one processor. It is also necessary to take into account that the result is expressed as a matrix, which is processed in the form of segments by each thread, but it is not possible to present a result without waiting for each one of the threads to nish its execution. In the case of thei implementation using PVM, it would be best to have several computers connected in a network, so that each slave will be executed in each one of the nodes in the network. In this case it is important to take into account that execution time is spent on sending information from teacher to each one of the slaves, and as soon these nish their execution, more time is used to send the result obtained by each of them to the master. The results do not indicate that parallelizing algorithms is inconvenient. It should be remembered that the problem must be studied beforehand, regarding size and resources, to determine from what conditions will be generated the optimizations. Besides using the proper equipment for the problem we want to solve.
References
Figure 1: Sequential and Threaded Execution Time Graph [1] Oak Ridge National Laboratory, Computer Science and Mathematics Division, [Online], http://www.csm.ornl.gov/pvm/, [Consultation: March 12, 2013]. [2] Maurice Herlihy, Nir Shavit The Art of Multiprocessor Programming, First Edition, Elsevier, Burlington, Massachusetts, United States of America, 2008. [3] Edward A. Lee The Problem with Threads, [Online], Electrical Engineering and Computer Sciences, January 10, 2006, http://www.eecs.berkeley.edu/Pubs/TechRpts/ 2006/EECS-2006-1.html, [Consultation: March 14, 2013]. Figure 2: PVM Implementation Graph [4] Encyclopedia Britannica, [Online], http://global.britannica.com/EBchecked/topic VI
/477740/process-synchronization, tion: March 14, 2013].
[Consulta-
http://www.cs.cf.ac.uk/Dave/C/node27.html, [Consultation: March 14, 2013]. [7] W. Richard Stevens UNIX Network Programming, Volume 2: Interprocess Communications, Second Edition, Second Edition, Prentice Hall, Upper Saddle River, New Jersey, United States of America, 1999.
[5] Mac Developer Library, [Online], https://developer.apple.com/library/mac/ documentation/Cocoa/Conceptual/Multithreading /ThreadSafety/ThreadSafety.html, [Consultation: March 14, 2013]. [6] IPC: Shared Memory, [Online],
VII

Articulo

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Articulo

Uploaded by

Copyright:

Available Formats

Parallel Computing

State of the Art

the de facto standard for distributed computing world-wide.[1]

N 500 1500 3000 5000 7500 1000 1200 15000

y = i c y2 = y y f o r ( j =0 ; j <N ; j ++) { 7: x= j c 8: x2= x x } 9: i f ( x2 + y2 <= r 2 ) { 1 0 : The p i x e l ( i , j ) i s w i t h i n the c i r c l e } }

N 500 1500 3000 5000 7500 1000 1200 15000

Implementation with PVM

mytid = pvm_mytid(); parent_tid = pvm_parent(); pvm_recv(parent_tid,0);

N 500 1500 3000 5000 7500 10000 12000 15000

Table 4: Results of the Implementation with PVM

/477740/process-synchronization, tion: March 14, 2013].

You might also like