W1 Intro.4u

Course organization and contact
● Course Convenor (Dr. Giuseppe Barca) and 2nd Lecturer (Dr. Alberto F. Martin)
● Course E-mail address (Piazza preferred, though): comp4300 (@anu.edu.au)
■ please make your subject line meaningful!
(e.g. ass1: trouble meeting deadline)
● Communication with teaching staff mostly handled via Piazza
Welcome to COMP4300/8300 ● Course webpage at https://comp.anu.edu.au/courses/comp4300
(we will use Wattle only for lecture recordings and marking!)
Parallel Systems ● course schedule
■ Labs start in week 2! Register before the end of week 1 via MyTimeTable
■ Account at NCI/Gadi needed! (apply for user accounts instructions)
● course assessment
■ 2x assignments 25% each, MSE 10% (redeemable), Final Exam 40%
● Assignments and exams submission via GitLab
■ We assume familiarization with Git/GitLab
■ See the following FAQ for details
COMP4300/8300 L1: Introduction to Parallel Systems 2023 ◀◀ ◀ • ▶ ▶▶ 1 COMP4300/8300 L1: Introduction to Parallel Systems 2023 ◀◀ ◀ • ▶ ▶▶ 2
Minimum IT requirements Health Warning!
● requirements (I): satisfactory (latency and bandwidth) http(s) access to:

■ course website (lecture slides etc), GitLab (assignments and exams
● it’s a 4000/8000-level course, it’s supposed to:
submission)
■ be more challenging than a 3000-level course!
■ Wattle (marks + lecture recordings)
■ you may (will) be exposed to ‘bleeding edge’ technologies
■ NCI documentation and apply for user accounts instructions
■ be less well structured
■ piazza (forum)
■ have a greater expectation for your self-directed understanding/independence
● requirements (II): you have a good SSH client (e.g. Cygwin or WSL) and outgoing
■ have more student participation
SSH is not being blocked
■ be fun!
■ most crucially to gadi.nci.org.au (if it gets as far as asking for a
● it assumes you have done some background in concurrency (e.g. COMP2310);
username/password, should be OK)
also in computer organization
■ later, will need access to a GPU server (e.g. stugpu2.anu.edu.au, either directly or
● it will require strong programming skills – in C!
indirectly, e.g. via partch.anu.edu.au)
● attendance at lectures/labs strongly recommended (even though not assessed)
● access to all of the above from external countries is believed possible – if not,
contact the course convenor ASAP!
Today’s lecture main take-away message Lecture Overview
To extract all the computational power from today’s (even commodity) microprocessors
there is no alternative but to EXPLICITLY manage parallelism in our programs!
● parallel computing concepts and scales

● sample application areas
● parallel programming’s rise, decline and rebirth:
■ the role of Moore’s Law and Dennard Scaling
■ the multicore revolution and a new design era
● the Top 500 supercomputers and challenges for ‘exascale computing’
UltraSPARC T2 (2007)
● why parallel programming is hard
(Niagara-2)
multicore chip layout
(courtesy of T. Okazaki, Flickr)
Parallel Computing: Concept and Rationale Parallelization
The idea: Split program up and run parts simultaneously on different processors
● split your computation into tasks that can be executed simultaneously ● on p computers, the time to solution should (ideally!) be reduced by 1/p
● parallel programming: the art of writing the parallel code
Motivation:
● parallel computer: the hardware on which we run our parallel code
● Speed, Speed, Speed · · · at a cost-effective price
This course will discuss both!
■ if we didn’t want it to go faster we wouldn’t bother with the hassles of parallel
programming! Beyond raw compute power, other motivations may include
● reduce the time to solution to acceptable levels (e.g., under real-time constraints) ● enabling more accurate simulations in the same time (finer grids)
■ no point in taking 1 week to predict tomorrow’s weather! ● providing access to huge aggregate memories (e.g.
■ simulations that take months are NOT useful in a design environment atmosphere model requiring ≥ 8 Intel nodes to run on)
● tackle larger-scale (i.e., with higher computational demands) problems ● providing more and/or better input/output capacity
● keep power consumption and heat dissipation under control ● hiding latency
Scales of Parallelism Sample Application Areas: Grand Challenge Problems
● fluid flow problems

● within a CPU/core: pipelined instruction execution, multiple instruction issue ■ weather prediction and climate change, ocean
(superscalar), other forms of instruction level parallelism, SIMD units* flow
■ aerodynamic modeling for cars, planes, rockets
● within a chip: multiple cores*, hardware multithreading*, accelerator units* (with
etc
multiple cores), transactional memory*
● structural mechanics
● within a node: multiple sockets* (CPU chips), interleaved memory access (multiple
■ building, bridge, car etc strength analysis
DRAM chips), disk block striping / RAID (multiple disks) ■ car crash simulation
● within a SAN (system area network): multiple nodes* (clusters, typical ● speech and character recognition, image processing
supercomputers), parallel filesystems ● visualization, virtual reality
● within the internet: grid/cloud computing* ● semiconductor design, simulation of new chips
● structural biology, design of drugs
*requires significant parallel programming effort ● human genome mapping
What programming paradigms are typically applied to each feature? ● financial market analysis and simulation
● data mining, machine learning, video games
Example: World Climate Modeling The (Rocky) Rise of Parallel Computing

● early ideas: 1946: Cellular Automata, John von Neumann; 1958: SOLOMON (1024
● atmosphere divided into 3D regions or cells 1-bit processors), Daniel Slotnick; 1962: Burroughs D825 4-CPU SMP
● complex mathematical equations describe conditions in each cell, e.g. pressure, ● 1967: Gene Amdahl proposes Amdahl’s Law, debates with Slotnick at AFIPS Conf.
temperature, wind speed, etc ● 1970’s: vector processors become the mainstream supercomputers (e.g. Cray-1);
■ conditions in each cell change according to conditions in neighbouring cells a few ‘true’ parallel computers are built
■ updates repeated many times to model the passage of time
● 1980’s: small-scale parallel vector processors dominant (e.g. Cray X-MP)
■ cells are affected by more distant cells the longer the forecast
When a farmer needs more ox power to plow his field, he doesn’t get a bigger ox, he gets
● assume another one. Then he puts the oxen in parallel. Grace Hopper, 1985-7
■ cells are 1 × 1 × 1 mile to a height of 10 miles ⇒ 5 × 108 cells If you were plowing a field, which would you rather use? Two strong oxen or 1024
■ 200 floating point operations (flops) to update each cell ⇒ 1011 flops per chickens? Seymour Cray, 1986
timestep ● 1988: Reevaluating Amdahl’s Law, John Gustafson

■ a timestep represents 10 minutes and we want 10 days ⇒ total of 1015 flops ● late 80s+: large-scale parallel computers begin to emerge: Computing Surface
● on a 100 mflops/sec. computer this would require 100 days (Meiko), QCD machine (Columbia Uni), iPSC/860 hypercube (Intel)
● on a 1 tflops/sec. computer this would take just 10 minutes ● 90’s–00’s: shared and distributed memory parallel computers used for servers
(small-scale) and supercomputers (large-scale)
Moore’s Law & Dennard Scaling Moore’s Law and Dennard Scaling Undermine Parallel Computing
● parallel computing looked promising in the 90’s, but many companies failed due to
Two “laws” underpin exponential performance increase of microprocessors from the 70’s the ‘free lunch’ from the combination of Moore’s Law and Dennard Scaling
■ Why parallelize my codes? Just wait 2 years, and processors will be 2x faster!
On several recent occasions, I have been asked whether parallel computing will soon be
relegated to the trash heap . . . Ken Kennedy, CRPC Director, 1994
● demography of Parallel Computing (mid 90’s, origin unknown)

Luke−warm
Heretics Agnostics Believers
(P−dP,R+dR) (P,R+dR) (P+dP,R+dR)
Prossessor
speed (R)
True
Believers
(P+dP,R)
Luddites
Fanatics
(P−dP,R−dR) (P+dP,R−dR)
Degree of
Parllelism (P)
Chip Size and Clock Speed End of Dennard Scaling and Other Free Lunch Ingredients
The ‘free lunch’ continued into the early 2000’s, allowing faster (higher clock rates) and
more complex (serial) processors: systems so fast that they could not communicate
across the chip in a single clock cycle! (full paper available here)
● even with Dennard Scaling, we
saw a exponential increase in
the power density of chips
1985–2000 (superlinear
scaling of clock rate)
● then Dennard Scaling ceased
around 2005! (power becomes
a huge concern)
● instruction level parallelism
(ILP) also reached its limits
The Multicore Revolution Where are we now?
● vendors began putting multiple CPUs (cores) on a chip, and stopped increasing
clock frequency
We find ourselves in a chip development era characterized by:
■ 2004: Sun releases dual-core Sparc IV, heralding the start of the multicore era
● (dynamic) power of a chip is given by: P = QCV 2 f , V is the voltage, Q is the ● Transistors galore (this cannot be sustained forever, though, due to physical limits)
number of transistors, C is a transistor’s capacitance and f is the clock frequency
● Sever power limitations (maximize transistors utility no longer possible)
■ but, on a given chip, f ∝ V , so P ∝ Q f 3!
● Customization versus generalization
● (ideal) parallel performance for p cores is given by R = p f , but Q ∝ p
■ double p ⇒ double R, but also P We are now seeing:
■ double p, halve f ⇒ maintain R, but quarter P! ● (customized) accelerators, generally manycore with low clock frequency
● doubling the number of cores is better than doubling clock speed!
■ e.g. Graphics Processing Units (GPUs), customized for fast numerical
● Moore’s Law (increase Q at constant QC) is expected to continue (till ???), so we calculations
can gain in R at constant P
● ‘dark silicon’: need to turn off parts of chip to reduce power
● cores can be much simpler (e.g., exploit less ILP), but there can be many
● chip design and testing costs are significantly reduced ● hardware-software codesign: speed via specialization
● BUT ... parallelism now must be exposed by software: The Free Lunch Is Over!
The Top 500 Most Powerful Computers: November 2022 Top500: Performance Trends
The Top 500 provides an interesting view of these trends (click here for full list)
(http://www.top500.org/resources/presentations/ (51st TOP500))
The Top500: Multicore Emergence Exascale and Beyond: Challenges and Opportunities
Level Characteristic Challenges/Opportunities

sheer #nodes, although different
balance among #nodes and
performance/node in top-ranked
● programming
among nodes languages/environment
supercomputers
(e.g., Frontier has “only” 9.4K
● fault tolerance
nodes but Fugaku has 159K nodes)
High heterogeneity
● what to use when
(e.g., Frontier, LUMI and
within a node ● co-location of data with
many more use CPUs and
the unit processing it
GPUs)
energy minimization ● minimize data size and
(e.g. processor cores have movement (e.g., use
within a chip
already dynamic frequency just enough precision)
and voltage scaling) ● specialized cores
(http://www.top500.org/blog/slides-highlights-of-the-45th-top500-list/)
Software: Why Parallel Programming is Hard Recommended further reading

● writing (correct and efficient) parallel programs is hard!
■ hard to expose enough parallelism; hard to debug!
● getting (close to ideal) speedup is hard! Overheads include: The following two articles provide a crystal clear view of past, present, and future:
■ idling (e.g., caused by load unbalance, synchronization, serial sections, etc.)
■ redundant/extra operations when splitting a computation into tasks ● Computing Performance: Game Over or Next Level? Computer, 44(1), 2011.
■ communication time among processes Publisher: IEEE.
● Amdahl’s Law: Let f the fraction of a computation that cannot be split into parallel ● Time Moore: Exploiting Moore’s Law From The Perspective of Time. IEEE
tasks. Then, max speedup achievable for arbitrary large p (processors) is 1f !!! Solid-State Circuits Magazine, 11 (1), 2019. Publisher: IEEE.
■ Let ts and t p denote serial and parallel exec times Only for intrepid readers:
■ t p = f ts + (1−pf )ts
p
■ S p = ttps = p f +(1 − f ) (Speedup)
● Computing Performance: Game Over or Next Level?, Full report from US National
■ lim p→∞ S p = 1f Research Council, National Academies Press, 2011. Bonus: reprints for Moore’s
and Dennard’s Laws seminal papers.
■ e.g., if f = 0.05, then 1f = 20
● counterargument (Gustafson’s Law): 1 − f is not fixed, but increases with the
data/problem size N
Recommended Texts + extra materials
● Some reading will be essential for the course. Recommended texts:

■ Introduction to Parallel Computing, 2nd Ed., Grama et al.
Available online (free for ANU students!)
■ Principles of Parallel Programming, Lyn & Snyder
■ Parallel Programming: techniques and applications using networked
workstations and parallel computers, Wilkinson & Allen
Other references: see the resources section of web page
COMP4300/8300 L1: Introduction to Parallel Systems 2023 ◀◀ ◀ • ▶ ▶▶ 25

W1 Intro.4u

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

W1 Intro.4u

Uploaded by

Copyright:

Available Formats

Course organization and contact

Minimum IT requirements Health Warning!

● requirements (I): satisfactory (latency and bandwidth) http(s) access to:

● parallel computing concepts and scales

(courtesy of T. Okazaki, Flickr)

Parallel Computing: Concept and Rationale Parallelization

● fluid flow problems

Example: World Climate Modeling The (Rocky) Rise of Parallel Computing

timestep ● 1988: Reevaluating Amdahl’s Law, John Gustafson

● demography of Parallel Computing (mid 90’s, origin unknown)

(http://www.top500.org/resources/presentations/ (51st TOP500))

Level Characteristic Challenges/Opportunities

Software: Why Parallel Programming is Hard Recommended further reading

● Some reading will be essential for the course. Recommended texts:

COMP4300/8300 L1: Introduction to Parallel Systems 2023 ◀◀ ◀ • ▶ ▶▶ 25

You might also like