Professional Documents
Culture Documents
Chap. 9 Pipeline and Vector Processing
Chap. 9 Pipeline and Vector Processing
To Memory
z 1) SISD (Single Instruction - Single Data stream) Incrementer
Processor
» for practical purpose: only one processor is useful registers
Floatint-point
add-subtract
» Example systems : Amdahl 470V/6, IBM 360/91
Floatint-point
IS multiply
Floatint-point
divide
CU PU MM
IS DS
z 2) SIMD
Shared memmory
(Single Instruction - Multiple Data stream) DS 1
PU 1 MM1
» vector or array operations 에 적합한 형태
one vector operation includes many DS 2
PU 2 MM2
operations on a data stream
IS
CU
» Example systems : CRAY -1, ILLIAC-IV
DS n
PU n MMn
IS
z 3) MISD
(Multiple Instruction - Single Data stream)
» Data Stream에 Bottle neck으로 인해 DS
IS1 IS1
실제로 사용되지 않음 CU1 PU 1
Shared memory
IS2 IS2
CU2 PU 2 MMn MM2 MM1
ISn ISn
CUn PU n
DS
z 4) MIMD
(Multiple Instruction - Multiple Data stream)
» 대부분의 Multiprocessor Shared memory
IS1 IS1 DS
System에서 사용됨 CU1 PU 1 MM1
» Shared memory or
IS2 IS2
Message passing : Chap. 13 CU2 PU 2 MM2
v v
ISn ISn
CUn PU n MMn
9-2 Pipelining
Pipelining의 원리
z Decomposing a sequential process into suboperations
z Each subprocess is executed in a special dedicated segment concurrently
Segment
즉, nonpipeline ( tn ) = pipeline ( k • tp ) 2 T1 T2 T3 T4 T5 T6
이라고 가정하면, 3 T1 T2 T3 T4 T5 T6
S = t n / tp = k • t p / tp = k
4 T1 T2 T3 T4 T5 T6
따라서 이론적으로 k 배 (segment 개수)
만큼 처리 속도가 향상된다.
Pipeline에는 Arithmetic Pipeline(Sec. 9-3)과 Instruction Pipeline(Sec. 9-4)이 있다
Sec. 9-3 Arithmetic Pipeline
Floating-point Adder Pipeline Example : Fig. 9-6
z Add / Subtract two normalized floating-point binary number
» X = A x 2a = 0.9504 x 103
» Y = B x 2b = 0.8200 x 102
© Korea Univ. of Tech. & Edu.
Computer System Architecture Chap. 9 Pipeline and Vector Processing Dept. of Info. & Comm.
9-6
z 4 segments suboperations
» 1) Compare exponents by subtraction : Exponents
a b
Mantissas
A B
3-2=1
R R
X = 0.9504 x 103
Y = 0.8200 x 102 Compare Difference
Segment 1 : exponents
» 2) Align mantissas by subtraction
X = 0.9504 x 103
Y = 0.08200 x 103 R
R R
Adjust Normalize
Segment 4 :
exponent result
R R
Fetch instruction
Segment 1 :
from memory
» 1) FI : Instruction Fetch
Step : 1 2 3 4 5 6 7 8 9 10 11 12 13
» 2) DA : Decode Instruction & calculate EA Instruction : 1 FI DA FO EX
» 3) FO : Operand Fetch 2 FI DA FO EX
» 4) EX : Execution (Branch) 3 FI DA FO EX
FI DA FO EX
» Instruction 3 에서 Branch 명령 실행 5
6 FI DA FO EX
7 FI DA FO EX
No Branch Branch
6. No-operation I A E
메모리를 access 하지 않는다.
7. No-operation I A E
z Branch Prediction I A E
8. Instruction in X
» Branch를 predict하는 additional hardware logic 사용 (a) Using no-operation instructions
» Correct guess eliminates the branch difficulty Clock cycles : 1 2 3 4 5 6 7 8
2. Increment I A E
z Fig. 9-8 에서와 같이 branch instruction이
3. Branch to X I A E
z Vector processor
» Single vector instruction
C(1:100) = A(1:100) + B(1:100)
ADD A B C 100
Matrix Multiplication
z 3 x 3 matrices multiplication : n2 = 9 inner product
⎡ a11 a12 a13 ⎤ ⎡ b11 b12 b13 ⎤ ⎡ c11 c12 c13 ⎤
⎢a a a23 ⎥ × ⎢b21 b22 b23 ⎥ = ⎢c21 c22 c23 ⎥
⎢ 21 22 ⎥ ⎢ ⎥ ⎢ ⎥
⎢⎣a31 a32 a33 ⎥⎦ ⎢⎣b31 b32 b33 ⎥⎦ ⎢⎣c31 c32 c33 ⎥⎦
A8B8 A7B7 A6B6 A5B5 A4B4 A3B3 A2B2 A1B1 A8B8 A7B7 A6B6 A5B5 A4B4 A3B3 A2B2 A1B1
z AR 의 하위 2 bit를 사용하여 4 개중 1 개의
memory module 선택 DR DR DR DR
Supercomputer
z Supercomputer = Vector Instruction + Pipelined floating-point arithmetic
z Performance Evaluation Index
» MIPS : Million Instruction Per Second
» FLOPS : Floating-point Operation Per Second
megaflops : 106, gigaflops : 109, teraflops : 1012
z Cray supercomputer : Cray Research
» Clay-1 : 80 megaflops, 4 million 64 bit words memory
» Clay-2 : 12 times more powerful than the clay-1
z VP supercomputer : Fujitsu
» VP-200 : 300 megaflops, 32 million memory, 83 vector instruction, 195 scalar instruction
» VP-2600 : 5 gigaflops
Array Processing
z Attached array processor : Fig. 9-14
» Auxiliary processor attached to a general purpose computer
z SIMD array processor : Fig. 9-15
» Computer with multiple processing units operating in parallel
Vector 계산 C = A + B 에서 ci = ai + bi 를
각각의 PEi에서 동시에 실행
PE 1 M1
Master control
unit
PE 2 M2
General-purpose Input-Output Attached array
computer interface Processor
PE 3 M3
Main memory
PE n Mn