DigitalLogic ComputerOrganization L23 Multicore Handout

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 32

DIGITAL LOGIC AND

COMPUTER ORGANIZATION
Lecture 23: Multicore
ELEC3010
ACKNOWLEGEMENT

I would like to express my special thanks to Professor Zhiru Zhang


School of Electrical and Computer Engineering, Cornell University
and Prof. Rudy Lauwereins, KU Leuven for sharing their teaching
materials.

2
COVERED IN THIS COURSE
❑ Binary numbers and logic gates
❑ Boolean algebra and combinational logic
❑ Sequential logic and state machines
❑ Binary arithmetic
Digital logic
❑ Memories

❑ Instruction set architecture


❑ Processor organization Computer
❑ Caches and virtual memory
❑ Input/output Organization
❑ Advanced topics
3
MOTIVATION EXAMPLE 1

4
MOTIVATION EXAMPLE 2

Wi-Fi/GPS/ 6 ARM-based 64 4 Apple


Bluetooth bit microprocessor designed Audio
modules cores GPU cores
Broadcom codec

Qualcomm
RF front
Snapdragon X55M
end Super Retina
5G modem 16-core
AMX blocks XDR OLED
Neural Engine
LG
Sensor Fabricated by TSMC
modules
4GB/6GB 64GB/28GB/256GB
12M Cameras LPDDR4X RAM NAND flash Battery and Power
LG Innotek Micron Samsung module

5
MOTIVATION EXAMPLE 3

6
INCREASING CLOCK FREQUENCIES

Darling of performance improvement for decades


• Technology Scaling
Why is this no longer the strategy?

Hitting Frequency Limits:


• Heat
• Power

7
IMPROVING IPC VIA ILP

You’ve seen:
❑Exploiting Intra-instruction parallelism:
Pipelining (decode A while fetching B)
You haven’t seen:
❑Exploiting Instruction Level Parallelism (ILP):
• Multiple issue (2-wide, 4-wide, etc.)
• Statically detected by compiler (VLIW)
• Dynamically detected by HW
➢ Dynamic Scheduling (OoO)

8
STATIC MULTIPLE ISSUE
a.k.a. Very Long Instruction Word (VLIW)
Compiler groups instructions to be issued together
▪ Packages them into “issue slots”

How does HW detect and resolve hazards?


It doesn’t. ☺ Compiler must avoid hazards

Example: Static Dual-Issue 32-bit MIPS


▪ Instructions come in pairs (64-bit aligned)
• One ALU/branch instruction (or nop)
• One load/store instruction (or nop)
9
STATIC DUAL ISSUE
Two-issue packets
• One ALU/branch instruction
• One load/store instruction
• 64-bit aligned
• ALU/branch, then load/store
• Pad an unused instruction with nop
Address Instruction type Pipeline Stages
n ALU/branch IF ID EX MEM WB
n+4 Load/store IF ID EX MEM WB
n+8 ALU/branch IF ID EX MEM WB
n + 12 Load/store IF ID EX MEM WB
n + 16 ALU/branch IF ID EX MEM WB
n + 20 Load/store IF ID EX MEM WB
10
STATIC DUAL ISSUE

11
SCHEDULING EXAMPLE
Schedule this for dual-issue
Loop: lw t0, 0(s1) # t0=array element
add t0, t0, s2 # add with s2
sw t0, 0(s1) # store result
addi s1, s1,–4 # decrement pointer
bne s1, zero, Loop # branch s1!=0
ALU/branch Load/store cycle
Loop: nop lw t0, 0(s1) 1
addi s1, s1,–4 nop 2
add t0, t0, s2 nop 3
bne s1, zero, Loop sw t0, 4(s1) 4
What is the IPC of this machine?
(A) 0.8 (B) 1.0 (C) 1.25 (D) 1.5 (E) I don’t know
12
DYNAMIC MULTIPLE ISSUE
aka SuperScalar Processor (c.f. Intel)
• CPU chooses multiple instructions to issue each cycle
• Compiler can help, by reordering instructions….
• … but CPU resolves hazards

13
DYNAMIC SCHEDULING

❑Scheduling is done at execution time


❑Out-of-order Execution:
• Execute instructions as early as possible
• Guess results of branches, loads, etc.
• Roll back if guesses were wrong
• Don’t commit results until all previous instructions committed

14
IMPROVING IPC VIA TLP
Exploiting Thread-Level parallelism
Hardware multithreading to improve utilization:
• Multiplexing multiple threads on single CPU
• Three types:
• Course-grain (has preferred thread)
• Fine-grain (round robin between threads)
• Simultaneous (hyperthreading)

15
WHAT IS A THREAD?

❑Process: multiple threads, code, data and OS state


❑Threads: concurrent computations that share the same address
space
• Share: code, data, files
• Do not share: registers or stack

16
THREAD MEMORY LAYOUT

Thread 1 Stack 1
SP
PC Stack 2

Thread 2 Stack 3
SP
PC
Data
Thread 3
SP
Insns
PC
17
THREAD EXAMPLES
int e;

main () {
int x[10], j, k, m; j = f(x, k); m = g(x, k);
}

int f(int *x, int k) Thread 0


{
int a; a = e * x[k] * x[k]; return a;
}

int g(int *x, int k) Thread 1


{
int a; k = k-1; a = e / x[k]; return a;
}
18
STANDARD MULTITHREADING PICTURE
Color = thread, white = no instruction
time

4-wide CGMT FGMT SMT


Superscalar Switch to thread B Switch threads Insns from multiple
on thread A L2 miss every cycle threads coexist
19
MULTITHREADING PERFORMANCE
1 2 3 4
time

4-wide CGMT FGMT SMT


Superscalar Which one of these has the best single-thread performance?
Which one of these has the best instruction throughput?
(A) 1 (B) 2 (C) 3 (D) 4 (E) I don't know
20
POWER EFFICIENCY

CPU Year Clock Pipeline Issue Out-of-order/ Cores Power


Rate Stages width Speculation
i486 1989 25MHz 5 1 No 1 5W
Pentium 1993 66MHz 5 2 No 1 10W
Pentium Pro 1997 200MHz 10 3 Yes 1 29W
P4 Willamette 2001 2000MHz 22 3 Yes 1 75W
UltraSparc III 2003 1950MHz 14 4 No 1 90W
P4 Prescott 2004 3600MHz 31 3 Yes 1 103W

21
WHY MULTICORE?
Performance 1.2x Single-Core
1.7x Overclocked +20%
Power

Performance 1.0x
Single-Core
Power 1.0x

Performance 0.8x Single-Core


Power 0.51x Underclocked -20%

Performance 1 2 1.6x Dual-Core


Power 1 2 1.02x Underclocked -20%

22
POWER EFFICIENCY
CPU Year Clock Pipeline Issue Out-of-order/ Cores Power
Rate Stages width Speculation
i486 1989 25MHz 5 1 No 1 5W
Pentium 1993 66MHz 5 2 No 1 10W
Pentium Pro 1997 200MHz 10 3 Yes 1 29W
P4 Willamette 2001 2000MHz 22 3 Yes 1 75W
UltraSparc III 2003 1950MHz 14 4 No 1 90W
P4 Prescott 2004 3600MHz 31 3 Yes 1 103W
Core 2006 2930MHz 14 4 Yes 2 75W
Core i5 Nehal 2010 3300MHz 14 4 Yes 1 87W
Core i5 Ivy Br 2012 3400MHz 14 4 Yes 8 77W

23
PARALLEL PROGRAMMING

So lets just all use multicore from now on!


… but software must be written as parallel program

Multicore difficulties
• Partitioning work
• Coordination & synchronization
• Communications overhead
• How do you write parallel programs?

... without knowing exact underlying architecture?


24
WORK PARTITIONING
Partition work so all cores have something to do

25
LOAD BALANCING

Need to partition so all cores are actually working

26
AMDAHL’S LAW
❑ Amdahl’s Law was named after Gene Amdahl, who presented it in
1967.
❑ Amdahl’s Law states that in parallelization, if P is the proportion
of a system or program that can be made parallel, and 1-P is the
proportion that remains serial, then the maximum speedup S(N)
that can be achieved using N processors is:
S(N)=1/((1-P)+(P/N))
❑ As number of cores increases …
▪ time to execute parallel part? goes to zero
▪ time to execute serial part? Remains the same
▪ Serial part eventually dominates

27
AMDAHL’S LAW

28
CAN YOU DO IT?

for (i = 0; i < N; i++) B.


for (i = 1; i < N; i++)
A.
a[i] = b[i] / 2.0; a[i] = a[i-1] * b[i];

int i;
float *a, *b, *c, tmp;
...
for (i = 0; i < N; i++) { Which code is parallelizable?
C.
tmp = a[i] / b[i];
c[i] = tmp * tmp;

}
29
CAN YOU DO IT?

for (j = 1; j < n; j++)


A. for (i = 0; i < m; i++)
a[i][j] = 2 * a[i][j-1];

for (i = 0; i < m; i++)


B. for (j = 1; j < n; j++)
a[i][j] = 2 * a[i][j-1];

Which code is better?


30
HAVE YOU EVER SEEN THIS?

31
BEFORE NEXT CLASS

• Textbook: 8.4
• Next time:
Virtual Memory

32

You might also like