DigitalLogic ComputerOrganization L23 Multicore Handout

DIGITAL LOGIC AND
COMPUTER ORGANIZATION
Lecture 23: Multicore
ELEC3010
ACKNOWLEGEMENT
I would like to express my special thanks to Professor Zhiru Zhang

School of Electrical and Computer Engineering, Cornell University
and Prof. Rudy Lauwereins, KU Leuven for sharing their teaching
materials.
2
COVERED IN THIS COURSE
❑ Binary numbers and logic gates
❑ Boolean algebra and combinational logic
❑ Sequential logic and state machines
❑ Binary arithmetic
Digital logic
❑ Memories
❑ Instruction set architecture

❑ Processor organization Computer
❑ Caches and virtual memory
❑ Input/output Organization
❑ Advanced topics
3
MOTIVATION EXAMPLE 1
4
Wi-Fi/GPS/ 6 ARM-based 64 4 Apple

Bluetooth bit microprocessor designed Audio
modules cores GPU cores
Broadcom codec
Qualcomm
RF front
Snapdragon X55M
end Super Retina
5G modem 16-core
AMX blocks XDR OLED
Neural Engine
LG
Sensor Fabricated by TSMC
modules
4GB/6GB 64GB/28GB/256GB
12M Cameras LPDDR4X RAM NAND flash Battery and Power
LG Innotek Micron Samsung module
5
6
INCREASING CLOCK FREQUENCIES
Darling of performance improvement for decades

• Technology Scaling
Why is this no longer the strategy?
Hitting Frequency Limits:

• Heat
• Power
7
IMPROVING IPC VIA ILP
You’ve seen:
❑Exploiting Intra-instruction parallelism:
Pipelining (decode A while fetching B)
You haven’t seen:
❑Exploiting Instruction Level Parallelism (ILP):
• Multiple issue (2-wide, 4-wide, etc.)
• Statically detected by compiler (VLIW)
• Dynamically detected by HW
➢ Dynamic Scheduling (OoO)
8
STATIC MULTIPLE ISSUE
a.k.a. Very Long Instruction Word (VLIW)
Compiler groups instructions to be issued together
▪ Packages them into “issue slots”
How does HW detect and resolve hazards?

It doesn’t. ☺ Compiler must avoid hazards
Example: Static Dual-Issue 32-bit MIPS

▪ Instructions come in pairs (64-bit aligned)
• One ALU/branch instruction (or nop)
• One load/store instruction (or nop)
9
STATIC DUAL ISSUE
Two-issue packets
• One ALU/branch instruction
• One load/store instruction
• 64-bit aligned
• ALU/branch, then load/store
• Pad an unused instruction with nop
Address Instruction type Pipeline Stages
n ALU/branch IF ID EX MEM WB
n+4 Load/store IF ID EX MEM WB
n+8 ALU/branch IF ID EX MEM WB
n + 12 Load/store IF ID EX MEM WB
n + 16 ALU/branch IF ID EX MEM WB
n + 20 Load/store IF ID EX MEM WB
10
STATIC DUAL ISSUE
11
SCHEDULING EXAMPLE
Schedule this for dual-issue
Loop: lw t0, 0(s1) # t0=array element
add t0, t0, s2 # add with s2
sw t0, 0(s1) # store result
addi s1, s1,–4 # decrement pointer
bne s1, zero, Loop # branch s1!=0
ALU/branch Load/store cycle
Loop: nop lw t0, 0(s1) 1
addi s1, s1,–4 nop 2
add t0, t0, s2 nop 3
bne s1, zero, Loop sw t0, 4(s1) 4
What is the IPC of this machine?
(A) 0.8 (B) 1.0 (C) 1.25 (D) 1.5 (E) I don’t know
12
DYNAMIC MULTIPLE ISSUE
aka SuperScalar Processor (c.f. Intel)
• CPU chooses multiple instructions to issue each cycle
• Compiler can help, by reordering instructions….
• … but CPU resolves hazards
13
DYNAMIC SCHEDULING
❑Scheduling is done at execution time

❑Out-of-order Execution:
• Execute instructions as early as possible
• Guess results of branches, loads, etc.
• Roll back if guesses were wrong
• Don’t commit results until all previous instructions committed
14
IMPROVING IPC VIA TLP
Exploiting Thread-Level parallelism
Hardware multithreading to improve utilization:
• Multiplexing multiple threads on single CPU
• Three types:
• Course-grain (has preferred thread)
• Fine-grain (round robin between threads)
• Simultaneous (hyperthreading)
15
WHAT IS A THREAD?
❑Process: multiple threads, code, data and OS state

❑Threads: concurrent computations that share the same address
space
• Share: code, data, files
• Do not share: registers or stack
16
THREAD MEMORY LAYOUT
Thread 1 Stack 1
SP
PC Stack 2
Thread 2 Stack 3
SP
PC
Data
Thread 3
SP
Insns
PC
17
THREAD EXAMPLES
int e;
main () {
int x[10], j, k, m; j = f(x, k); m = g(x, k);
}
int f(int *x, int k) Thread 0

{
int a; a = e * x[k] * x[k]; return a;
}
int g(int *x, int k) Thread 1

{
int a; k = k-1; a = e / x[k]; return a;
}
18
STANDARD MULTITHREADING PICTURE
Color = thread, white = no instruction
time
4-wide CGMT FGMT SMT

Superscalar Switch to thread B Switch threads Insns from multiple
on thread A L2 miss every cycle threads coexist
19
MULTITHREADING PERFORMANCE
1 2 3 4
time
4-wide CGMT FGMT SMT

Superscalar Which one of these has the best single-thread performance?
Which one of these has the best instruction throughput?
(A) 1 (B) 2 (C) 3 (D) 4 (E) I don't know
20
POWER EFFICIENCY
CPU Year Clock Pipeline Issue Out-of-order/ Cores Power

Rate Stages width Speculation
i486 1989 25MHz 5 1 No 1 5W
Pentium 1993 66MHz 5 2 No 1 10W
Pentium Pro 1997 200MHz 10 3 Yes 1 29W
P4 Willamette 2001 2000MHz 22 3 Yes 1 75W
UltraSparc III 2003 1950MHz 14 4 No 1 90W
P4 Prescott 2004 3600MHz 31 3 Yes 1 103W
21
WHY MULTICORE?
Performance 1.2x Single-Core
1.7x Overclocked +20%
Power
Performance 1.0x
Single-Core
Power 1.0x
Performance 0.8x Single-Core

Power 0.51x Underclocked -20%
Performance 1 2 1.6x Dual-Core

Power 1 2 1.02x Underclocked -20%
22
POWER EFFICIENCY
CPU Year Clock Pipeline Issue Out-of-order/ Cores Power
Rate Stages width Speculation
i486 1989 25MHz 5 1 No 1 5W
Pentium 1993 66MHz 5 2 No 1 10W
Pentium Pro 1997 200MHz 10 3 Yes 1 29W
P4 Willamette 2001 2000MHz 22 3 Yes 1 75W
UltraSparc III 2003 1950MHz 14 4 No 1 90W
P4 Prescott 2004 3600MHz 31 3 Yes 1 103W
Core 2006 2930MHz 14 4 Yes 2 75W
Core i5 Nehal 2010 3300MHz 14 4 Yes 1 87W
Core i5 Ivy Br 2012 3400MHz 14 4 Yes 8 77W
23
PARALLEL PROGRAMMING
So lets just all use multicore from now on!

… but software must be written as parallel program
Multicore difficulties
• Partitioning work
• Coordination & synchronization
• Communications overhead
• How do you write parallel programs?
... without knowing exact underlying architecture?

24
WORK PARTITIONING
Partition work so all cores have something to do
25
LOAD BALANCING
Need to partition so all cores are actually working
26
AMDAHL’S LAW
❑ Amdahl’s Law was named after Gene Amdahl, who presented it in
1967.
❑ Amdahl’s Law states that in parallelization, if P is the proportion
of a system or program that can be made parallel, and 1-P is the
proportion that remains serial, then the maximum speedup S(N)
that can be achieved using N processors is:
S(N)=1/((1-P)+(P/N))
❑ As number of cores increases …
▪ time to execute parallel part? goes to zero
▪ time to execute serial part? Remains the same
▪ Serial part eventually dominates
27
AMDAHL’S LAW
28
CAN YOU DO IT?
for (i = 0; i < N; i++) B.

for (i = 1; i < N; i++)
A.
a[i] = b[i] / 2.0; a[i] = a[i-1] * b[i];
int i;
float *a, *b, *c, tmp;
...
for (i = 0; i < N; i++) { Which code is parallelizable?
C.
tmp = a[i] / b[i];
c[i] = tmp * tmp;
}
29
CAN YOU DO IT?
for (j = 1; j < n; j++)

A. for (i = 0; i < m; i++)
a[i][j] = 2 * a[i][j-1];
for (i = 0; i < m; i++)

B. for (j = 1; j < n; j++)
a[i][j] = 2 * a[i][j-1];
Which code is better?

30
HAVE YOU EVER SEEN THIS?
31
BEFORE NEXT CLASS
• Textbook: 8.4
• Next time:
Virtual Memory
32

DigitalLogic ComputerOrganization L23 Multicore Handout

Uploaded by

Copyright:

Available Formats

You might also like

DigitalLogic ComputerOrganization L23 Multicore Handout

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

DigitalLogic ComputerOrganization L23 Multicore Handout

Uploaded by

Copyright:

Available Formats

DIGITAL LOGIC AND

I would like to express my special thanks to Professor Zhiru Zhang

❑ Instruction set architecture

Wi-Fi/GPS/ 6 ARM-based 64 4 Apple

Darling of performance improvement for decades

Hitting Frequency Limits:

How does HW detect and resolve hazards?

Example: Static Dual-Issue 32-bit MIPS

❑Scheduling is done at execution time

❑Process: multiple threads, code, data and OS state

int f(int *x, int k) Thread 0

int g(int *x, int k) Thread 1

4-wide CGMT FGMT SMT

4-wide CGMT FGMT SMT

CPU Year Clock Pipeline Issue Out-of-order/ Cores Power

Performance 0.8x Single-Core

Performance 1 2 1.6x Dual-Core

So lets just all use multicore from now on!

... without knowing exact underlying architecture?

Need to partition so all cores are actually working

for (i = 0; i < N; i++) B.

for (j = 1; j < n; j++)

for (i = 0; i < m; i++)

Which code is better?

You might also like