Download as pdf or txt
Download as pdf or txt
You are on page 1of 52

Module 1

Introduction to Parallel and


Distributed Computing
Module 1
Subtopic 1
Introduction to Parallel and
Distributed Computing
• To discuss the evolution of software crisis
• To recognize different hardware architecture based on
Flynn’s taxonomy
• To determine the theoretical speedup when using
multiple processors using Amhdal’s Law
“To put it quite bluntly: as long as there were no
machines, programming was no problem at all; when we
had a few weak computers, programming became a mild
problem, and now we have gigantic computers,
programming has become an equally gigantic problem”

-- E. Dijkstra, 1972 Turing Award Lecture


• Time frame: 60’s and 70s
• Problem: Assembly Language Programming
• Computers could handle larger more complex programs

• Needed to get Abstraction and Portability without


losing performance
• High Level Languages for von-Neumann Machines
• Fortran and C
• Provide “common machine language” for
uniprocessors
• Time Frame: ’80s and ’90s

• Problem: Inability to build and maintain complex and


robust applications requiring multi-million lines of
code developed by hundreds of programmers
• Computers could handle larger more complex programs

• Needed to get Composability, Malleability and


Maintainability
• High-performance was not an issue →left for Moore’s Law
• Object Oriented Programming
• C++, C# and Java

• Better tools
• Component Libraries

• Better software engineering methodology


• Design patterns, specification, testing, code reviews
• Solid boundary between Hardware and Software
• Programmers don’t have to know anything about the processor „
• High level languages abstract away the processors
• Java bytecode is machine independent „
• Moore’s law does not require the programmers to know anything about the
processors to get good speedups
• Programs are oblivious of the processor → work on all
processors „
• A program written in ’70 using C still works and is much faster today
• This abstraction provides a lot of freedom for the
programmers
• Time Frame: 2005 to 20??
• Problem: Sequential performance is left behind by Moore’s law

• Needed continuous and reasonable performance improvements „


• to support new features „
• to support larger datasets
• While sustaining portability, malleability and maintainability without
unduly increasing complexity faced by the programmer

• →critical to keep-up with the current rate of evolution in software


https://upload.wikimedia.org/wikipedia/commons/thumb/0/00/Moore%27s_Law_Transistor_Count_1970-2020.png/1280px-Moore%27s_Law_Transistor_Count_1970-2020.png
12th Gen Intel® Core™ Processors Product Brief
AMD Ryzen™ 5000 Series Processors | Fastest in the Game | AMD
• Flynn's taxonomy is a classification of computer architectures, proposed by Michael J.
Flynn in 1966 and extended in 1972.
• The classification system has stuck, and it has been used as a tool in design of modern
processors and their functionalities.
• An SISD computing system is a uniprocessor machine which is capable
of executing a single instruction, operating on a single data stream.
• In SISD, machine instructions are processed in a sequential manner and computers adopting this
model are popularly called sequential computers. Most conventional computers have SISD
architecture. All the instructions and data to be processed have to be stored in primary memory.
• An SIMD system is a multiprocessor machine capable of
executing the same instruction on all the CPUs but operating
on different data streams.
• Machines based on an SIMD model are well suited to scientific computing since they
involve lots of vector and matrix operations.
• An MISD computing system is a multiprocessor machine
capable of executing different instructions on different PEs but
all of them operating on the same dataset .
• An MIMD system is a multiprocessor machine which is
capable of executing multiple instructions on multiple data
sets.
• Each PE in the MIMD model has separate instruction and data streams; therefore machines
built using this model are capable to any kind of application.
• PEs in MIMD machines work asynchronously.
• MIMD machines are broadly categorized into shared-
memory MIMD and distributed-memory MIMD based on
the way PEs are coupled to the main memory.

• In the shared memory MIMD model (tightly coupled multiprocessor


systems), all the PEs are connected to a single global memory and they all
have access to it.
• In Distributed memory MIMD machines (loosely coupled multiprocessor
systems) all PEs have a local memory.
The serial execution of tasks is a sort of chain, where the first task is
followed by the second one, the second is followed by the third, and so on.
The important point here is that tasks are physically executed without
overlapping time periods.
• Simplicity
• It is a straightforward approach, with a clear set of step-by-step instructions about what to do and
when to do it.

• Scalability
• A system is considered scalable if its performance improves after adding more processing
resources. In the case of sequential computing, the only way to scale the system is to increase the
performance of system resources used – CPU, memory, etc.
• Overhead
• In the sequential computing, no communication or synchronization is required between different
steps of the program execution. But there is an indirect overhead of the underutilization of
available processing resources
• Concurrency is when multiple tasks can run in overlapping
periods. It's an illusion of multiple tasks running in parallel
because of a very fast switching by the CPU. Two tasks
can't run at the same time in a single-core CPU

Image: https://golangbot.com/concurrency/
• Parallelism is when tasks actually run in parallel in multiple
CPUs

Image: https://golangbot.com/concurrency/
Image: https://devopedia.org/concurrency-vs-parallelism
• Complexity of designing parallel algorithms
• Removing task dependencies
• Can add large overheads

• Limited by memory access speeds

• Execution speed is sensitive to data

• Real world problems are most naturally described with


mathematical recurrences
• Saman Amarasinghe and Rodric Rabbah, 6.189 Multicore Programming Primer, January (IAP) 2007.
(Massachusetts Institute of Technology: MIT OpenCourseWare). http://ocw.mit.edu (accessed 07 27, 2022).
License: Creative Commons Attribution-Noncommercial-Share Alike.
• https://en.wikipedia.org/wiki/Moore%27s_law
• https://www.intel.com/content/www/us/en/products/docs/processors/core/12th-gen-core-mobile-
processors-brief.html
• https://en.wikipedia.org/wiki/Flynn%27s_taxonomy
• https://www.geeksforgeeks.org/computer-architecture-flynns-taxonomy/
• https://livebook.manning.com/book/grokking-concurrency/chapter-2/v-4/108
• https://stackoverflow.com/questions/1050222/what-is-the-difference-between-concurrency-and-parallelism
• https://www.geeksforgeeks.org/memory-layout-of-c-program/
Module 1
Subtopic 2
Amhdal’s Law
Process and Threads
• It seems that we can infinitely increase the number of
processors and thus make the system run as fast as
possible. But unfortunately, this is not the case.

• A famous observation of Gene Amdahl, known as


Amdahl's law, demonstrates this clearly.
• Imagine you have a huge pile of index cards with definitions written on
them. You want to find cards with information about concurrency and add
them to a separate stack, but the cards are all mixed up.
• Fortunately, you have two friends with you, so you could divide up the cards, give each
person a pile and tell them what to look for. Then all the friends could search its own pile
of cards. Once someone finds the right card, they can announce it and put it into a
separate stack.
• This algorithm might look like:
1. Divide pile up into stacks and hand out one stack to each person (Serial)
2. Everyone looks for the “concurrency” card (Parallel)
3. Give the card to a separate pile (Serial)
Let’s say that first and last part of the above algorithm takes one second, a second part takes 3
seconds. Thus, it takes 5 seconds to execute it from the beginning to the end if you will do it yourself.
The first and third parts are algorithmically sequential, and there is no way to separate them into
independent tasks and use parallel execution there.
But you found that we can easily use parallel execution in the second part by dividing cards
into any number of stacks, as long as we have a friend to execute this step independently.
The first and third parts are algorithmically sequential, and there is no way to separate them
into independent tasks and use parallel execution there. But you found that we can easily
use parallel execution in the second part by dividing cards into any number of stacks, as
long as we have a friend to execute this step independently.
Now imagine that by having two friends for the second part of the program, you reduced the execution
time of that part to 1 second. The whole program now takes only 3 seconds, which is 40% speedup for
the whole program. Speedup here is calculated as a ratio of the time it takes to execute the program in
the optimal sequential manner with a single processing resource, over the time it takes to execute in a
parallel manner with a certain number of processing resources
What happens if we keep increasing the number of friends? For example, suppose we
added three more friends, six in total, and now the second part of the program takes only
half a second to execute. Then the whole algorithm would only take 2.5 seconds to
complete, which is around 50% speedup for the whole program.
What happens if we keep increasing the number of friends? For example, suppose we
added three more friends, six in total, and now the second part of the program takes only
half a second to execute. Then the whole algorithm would only take 2.5 seconds to
complete, which is around 50% speedup for the whole program.
• Adding more would still result to 2 sec latency
• Possible communication overhead

• This is the key to understanding Amdahl's law – the


potential for speeding up a program using parallel
computing is limited to sequential parts of the
program.
Module 1
Subtopic 2

Processes and Threads


https://www.geeksforgeeks.org/memory-layout-of-c-program/
There can be multiple instances of a single program, and each instance of
that running program is a process. Each process has a separate memory
address space, which means that a process runs independently and is
isolated from other processes.

https://www.backblaze.com/blog/whats-the-diff-programs-processes-and-threads/
A thread is the unit of execution within a process. A process
can have anywhere from just one thread to many threads.

https://www.backblaze.com/blog/whats-the-diff-programs-processes-and-threads/
In multithreaded processes, the process contains more than
one thread, and the process is accomplishing a number of
things at the same time.

https://www.backblaze.com/blog/whats-the-diff-programs-processes-and-threads/
• Tutorials
• https://cplusplus.com/doc/tutorial/
• https://www.learncpp.com/
• https://stackoverflow.com/questions/388242/the-definitive-c-book-guide-
and-list

• Language Reference
• https://en.cppreference.com/w/
• https://code.visualstudio.com/
https://code.visualstudio.com/docs/languages/cpp
MSYS2 is a collection of tools and libraries providing you with
an easy-to-use environment for building, installing and
running native Windows software. https://www.msys2.org/
• Dev-C++ (IDE Only)
• https://sourceforge.net/projects/orwelldevcpp/
• Download tdm-gcc (jmeubank.github.io)
• https://jmeubank.github.io/tdm-gcc/
• https://www.geeksforgeeks.org/memory-layout-of-c-program/
• https://www.backblaze.com/blog/whats-the-diff-programs-processes-and-threads/
• https://cplusplus.com/doc/tutorial/
• https://www.learncpp.com/
• https://stackoverflow.com/questions/388242/the-definitive-c-book-guide-and-list

You might also like