Professional Documents
Culture Documents
Untitled
Untitled
Neumann Machine:
A Crash Course in
Modern Hardware
Dr. Cliff Click
Azul Systems
Brian Goetz
Sun Microsystems
Agenda
>
Introduction
>
>
>
>
>
CPU Performance
>
>
CISC era
Frequency scaling era
Multicore era
CISC systems
>
Any instruction could be used with any data type and any
combination of addressing modes
CISC systems
>
>
Power Wall
ILP Wall
Memory Wall
Speed of light
>
>
Agenda
>
Introduction
>
>
>
>
10
>
Multiple-issue
Pipelining
Branch Prediction
Speculative execution
Out-Of-Order (O-O-O) execution
Hit-Under-Miss cache, no-lockup cache
Prefetching
11
12
Pipelining
add
cmp
rbx,16
rax,0
>
>
>
add rbx,16
cmp rax,0
Time
13
Pipelining
>
>
>
Pipelining hazards
>
>
>
ld
>
>
>
16
rax[rbx+16]
rax,0
>
>
>
Fairly common
17
Branch Prediction
ld
...
cmp
jeq
st
>
rax[rbx+16]
rax,0
No RAX yet, so no flags
null_chk
Branch not resolved
[rbx-16]rcx ...speculative execution
Multiple issue
>
19
Dual-Issue or Wide-Issue
add
cmp
>
>
rbx,16
rax,0
>
21
rax[rbx+16]
rbx,16
rax,0
null_chk
[rbx-16]rcx
rcx[rdx+0]
rax[rax+8]
22
rax
[rbx+16]
[rbx+16] Load RAX from memory
rbx,16
Assume cache miss rax,0
300 cycles to load
null_chk
Instruction starts and
[rbx-16]rcx dispatch continues...
rcx[rdx+0]
rax[rax+8]
Clock 0 instruction 0
23
rax[rbx+16]
rbx,16
rax,0
null_chk
[rbx-16]rcx
rcx[rdx+0]
rax[rax+8]
Clock 0 instruction 1
24
rax[rbx+16]
rbx,16
rax,0
null_chk
[rbx-16]rcx
rcx[rdx+0]
rax[rax+8]
Clock 0 instruction 2
25
rax[rbx+16]
rbx,16
rax,0
null_chk
[rbx-16]rcx
rcx[rdx+0]
rax[rax+8]
Clock 0 instruction 3
26
rax[rbx+16]
rbx,16
rax,0
null_chk
rcx
[rbx-16]
rcx[rdx+0]
rax[rax+8]
Store is speculative
Result kept in store buffer
Also RBX might be null
L/S used, no more mem
ops this cycle
Clock 1 instruction 4
27
rax[rbx+16]
rbx,16
rax,0
null_chk
[rbx-16]rcx
[rdx+0]
rcx
rax[rax+8]
Clock 2 instruction 5
28
rax[rbx+16]
rbx,16
rax,0
null_chk
[rbx-16]rcx
rcx[rdx+0]
[rax+8]
rax
rax[rbx+16]
rbx,16
rax,0
null_chk
[rbx-16]rcx
rcx[rdx+0]
rax[rax+8]
30
>
>
>
>
>
Contrast to GPUs
Agenda
>
Introduction
>
>
>
>
34
>
>
Types of memory
>
>
36
Caching
>
>
37
Caching
>
>
Caching
>
>
>
39
>
>
Stale reads
Order of execution is
relative to the observing CPU (thread)
40
CPUs
Caches
CPU #0
CPU #1
Registers
Compute
LD/ST Unit
Registers
Compute
LD/ST Unit
----: --flag: 0
I
S
MESI
data: 0
----: ---
S
I
Memory Controller
RAM
data: 0
flag: 0
41
Data is replicated
No single home
protocol
CPUComplex
#1
CPU #0
Modified
RegistersExclusive
ComputeShared
LD/ST UnitInvalid
Registers
Compute
LD/ST Unit
----: --flag: 0
I
S
MESI
data: 0
----: ---
S
I
Memory Controller
data: 0
flag: 0
42
CPU #0
CPU #1
Registers
Compute
LD/ST Unit
Registers
Compute
LD/ST Unit
----: --flag: 0
I
S
MESI
data: 0
----: ---
S
I
Memory Controller
data: 0
flag: 0
43
CPU #0
CPU #1
Registers
Compute
LD/ST Unit
Registers
Compute
LD/ST Unit
----: --flag: 0
Initial values
I
S
MESI
data: 0
----: ---
S
I
Memory Controller
data: 0
flag: 0
44
CPU #0
CPU #1
Registers
Compute
LD/ST Unit
?rax
Compute
?flag
?flag
----: --flag: 0
ld
rax,[&flag]
I
S
data: 0
----: ---
S
I
data: 0
flag: 0
45
CPU #0
CPU #1
Registers
Compute
LD/ST Unit
?rax
Compute
?flag
----: --flag: 0
ld
rax,[&flag]
I
S
!flag
data: 0
----: ---
S
I
data: 0
flag: 0
46
CPU #0
Registers
Compute
LD/ST Unit
?rax
jeq rax,...
?flag
----: --flag: 0
ld rax,[&flag]
jeq rax,...
jeq rax,...
I
S
-flag
data: 0
----: ---
S
I
?flag
data: 0
flag: 0
47
data = ...;
if( !flag )
return data;
mov rax,123
st [&data],rax
ld rax,[&flag]
jeq rax,...
ld rbx,[&data]
CPU #0
CPU #1
rax:123
Compute
data:123
?rax, ?rbx
?flag, ?data
data:123
----: --flag: 0
I
S
?data
data: 0
----: ---
S
I
?flag
data: 0
flag: 0
48
mov rax,123
st [&data],rax
ld rax,[&flag]
jeq rax,...
ld rbx,[&data]
CPU #0
CPU #1
rax:123
Compute
data:123
?rax, ?rbx
?flag, ?data
data:123
----: --flag: 0
I
S
?data
data: 0
----: ---
S
I
True
data
race
?flag
data: 0
flag: 0
49
mov rax,123
st [&data],rax
ld rax,[&flag]
jeq rax,...
ld rbx,[&data]
CPU #0
CPU #1
Registers
Compute
data:123
?rax, ?rbx
?flag, ?data
data:0
----: --flag: 0
I
S
!data
data: 0
----: ---
S
I
?flag
data: 0
flag: 0
50
mov rax,123
st [&data],rax
ld rax,[&flag]
jeq rax,...
ld rbx,[&data]
CPU #0
CPU #1
Registers
Compute
LD/ST Unit
?rax, rbx:0
data: 123
flag: 0
?flag
M
S
-data
I
I
?flag
data: 0
flag: 0
51
CPU #1
data = ...;
Registers
data in 2 places
Value is relative to observer Compute
LD/ST Unit
if( !flag )
return data;
mov rax,123
st [&data],rax
ld rax,[&flag]
jeq rax,...
ld rbx,[&data]
data: 123
flag: 0
?rax, rbx:0
?flag
M
S
I
I
?flag
data: 0
flag: 0
52
mov rax,123
st [&data],rax
st [&flag],1
ld rax,[&flag]
jeq rax,...
ld rbx,[&data]
CPU #0
CPU #1
Registers
Compute
flag:1
?rax, rbx:0
?flag
flag:1
data: 123
flag: 0
M
S
!flag
data:123
I
I
?flag
data: 0
flag: 0
53
mov rax,123
st [&data],rax
st [&flag],1
ld rax,[&flag]
jeq rax,...
ld rbx,[&data]
CPU #0
CPU #1
Registers
Compute
LD/ST Unit
?rax, rbx:0
data: 123
flag: 1
?flag
M
M
-flag
data:123
I
I
?flag
data: 0
flag: 0
54
mov rax,123
st [&data],rax
st [&flag],1
ld rax,[&flag]
jeq rax,...
ld rbx,[&data]
CPU #0
CPU #1
Registers
Compute
LD/ST Unit
?rax, rbx:0
data: 123
flag: 1
?flag
M
M
flag:1
data:123
I
I
?flag
data: 0
flag: 0
55
mov rax,123
st [&data],rax
st [&flag],1
ld rax,[&flag]
jeq rax,...
ld rbx,[&data]
CPU #0
CPU #1
Registers
Compute
LD/ST Unit
?rax, rbx:0
data: 123
flag: 1
data: 0
flag: 0
?flag
M
S
flag:1
data:123
I
I
flag:1
?flag
56
mov rax,123
st [&data],rax
st [&flag],1
ld rax,[&flag]
jeq rax,...
ld rbx,[&data]
CPU #0
CPU #1
Registers
Compute
LD/ST Unit
?rax, rbx:0
?flag
flag:1
data: 123
flag: 1
data: 0
flag: 0
M
S
----: --flag: 1
I
S
flag:1
data:123
57
mov rax,123
st [&data],rax
st [&flag],1
ld
jeq
ld
ret
rax,[&flag]
rax,...
rbx,[&data]
rbx
CPU #0
CPU #1
Registers
Compute
LD/ST Unit
rax:1, rbx:0
data: 123
flag: 1
data: 0
flag: 0
Returning null
Crash-n-burn time!
LD/ST Unit
M
S
----: --flag: 1
I
S
flag:1
data:123
58
Agenda
>
Introduction
>
>
>
>
59
>
>
61
Niagara supports 4
threads per core
Niagara II supports 8
62
Agenda
>
Introduction
>
>
>
>
63
Managing performance
>
Dominant operations
>
Locality is critical
>
>
>
>
>
>
Shared data == OK
>
Mutable data == OK
>
Requires synchronization
>
>
>
>
Summary
>
>
>
69
>
Ulrich Drepper
70
Brian Goetz
Sun Microsystems