Chap. 9 Pipeline and Vector Processing

9-1
Chap. 9 Pipeline and Vector Processing

9-1 Parallel Processing
Simultaneous data processing tasks for the purpose of increasing the
= computational speed
Perform concurrent data processing to achieve faster execution time
Multiple Functional Unit : Fig. 9-1 Parallel Processing Example
z Separate the execution unit into eight functional units operating in parallel
Computer Architectural Classification
Adder-subtractor
z Data-Instruction Stream : Flynn

Integer multiply
z Serial versus Parallel Processing : Feng
z Parallelism and Pipelining : Händler Logic unit
Flynn’s Classification Shift unit
To Memory
z 1) SISD (Single Instruction - Single Data stream) Incrementer
Processor
» for practical purpose: only one processor is useful registers
Floatint-point
add-subtract
» Example systems : Amdahl 470V/6, IBM 360/91
Floatint-point
IS multiply
Floatint-point
divide
CU PU MM
IS DS
© Korea Univ. of Tech. & Edu.

Computer System Architecture Chap. 9 Pipeline and Vector Processing Dept. of Info. & Comm.
9-2
z 2) SIMD
Shared memmory
(Single Instruction - Multiple Data stream) DS 1
PU 1 MM1
» vector or array operations 에 적합한 형태
one vector operation includes many DS 2
PU 2 MM2
operations on a data stream
IS
CU
» Example systems : CRAY -1, ILLIAC-IV
DS n
PU n MMn
IS
z 3) MISD
(Multiple Instruction - Single Data stream)
» Data Stream에 Bottle neck으로 인해 DS
IS1 IS1
실제로 사용되지 않음 CU1 PU 1
Shared memory
IS2 IS2
CU2 PU 2 MMn MM2 MM1
ISn ISn
CUn PU n
DS

9-3
z 4) MIMD
(Multiple Instruction - Multiple Data stream)
» 대부분의 Multiprocessor Shared memory
IS1 IS1 DS
System에서 사용됨 CU1 PU 1 MM1
» Shared memory or
IS2 IS2
Message passing : Chap. 13 CU2 PU 2 MM2
v v
ISn ISn
CUn PU n MMn
Main topics in this Chapter

z Pipeline processing : Sec. 9-2
» Arithmetic pipeline : Sec. 9-3
» Instruction pipeline : Sec. 9-4 (Sec. 9-5 : RISC Instruction Pipeline)
z Vector processing :adder/multiplier pipeline 이용, Sec. 9-6
Large vector, Matrices,
z Array processing :별도의 array processor 이용, Sec. 9-7 그리고 Array Data 계산
» Attached array processor : Fig. 9-14
» SIMD array processor : Fig. 9-15

9-4
9-2 Pipelining
Pipelining의 원리
z Decomposing a sequential process into suboperations
z Each subprocess is executed in a special dedicated segment concurrently
Pipelining의 예제 : Fig. 9-2

z Multiply and Add Operation : Ai * Bi + Ci ( for i = 1, 2, …, 7 )
z 3 개의 Suboperation Segment로 분리
» 1) R1 ← Ai , R 2 ← Bi : Input Ai and Bi
» 2) R 3 ← R1 * R 2, R 4 ← Ci : Multiply and input Ci
» 3) R5 ← R 3 + R 4 : Add Ci
z Content of registers in pipeline example : Tab. 9-1
General considerations
z 4 segment pipeline : Fig. 9-3
» S : Combinational circuit for Suboperation
» R : Register(intermediate results between the segments)
z Space-time diagram : Fig. 9-4 Segment
» Show segment utilization as a function of time versus
Clock-cycle
z Task : T1, T2, T3,…, T6
» Total operation performed going through all the segment
9-5
Speedup S : Nonpipeline / Pipeline

z S = n • tn / ( k + n - 1 ) • tp = 6 • 4 tp / ( 4 + 6 - 1 ) • tp = 24 tp / 9 tp = 2.67
» n : task number ( 6 ) Pipeline에서의 처리 시간 = 9 clock cycles
» tn : time to complete each task in nonpipeline
k+n-1≈n » tp : clock cycle time ( 1 clock cycle )
» k : segment number ( 4 )
Clock cycles
z If n→ ∞ 이면, S = tn / tp 1 2 3 4 5 6 7 8 9
1 T1 T2 T3 T4 T5 T6
z 한 개의 task를 처리하는 시간이 같을 때
Segment
즉, nonpipeline ( tn ) = pipeline ( k • tp ) 2 T1 T2 T3 T4 T5 T6
이라고 가정하면, 3 T1 T2 T3 T4 T5 T6
S = t n / tp = k • t p / tp = k
4 T1 T2 T3 T4 T5 T6
따라서 이론적으로 k 배 (segment 개수)
만큼 처리 속도가 향상된다.
Pipeline에는 Arithmetic Pipeline(Sec. 9-3)과 Instruction Pipeline(Sec. 9-4)이 있다
Sec. 9-3 Arithmetic Pipeline
Floating-point Adder Pipeline Example : Fig. 9-6
z Add / Subtract two normalized floating-point binary number
» X = A x 2a = 0.9504 x 103
» Y = B x 2b = 0.8200 x 102
9-6
z 4 segments suboperations
» 1) Compare exponents by subtraction : Exponents
a b
Mantissas
A B
3-2=1
R R
X = 0.9504 x 103
Y = 0.8200 x 102 Compare Difference
Segment 1 : exponents
» 2) Align mantissas by subtraction
X = 0.9504 x 103
Y = 0.08200 x 103 R
» 3) Add mantissas Segment 2 : Choose exponent Align mantissas

Z = 1.0324 x 103
» 4) Normalize result R
Z = 0.1324 x 104 Add or subtract

Segment 3 :
mantissas
R R
Adjust Normalize
Segment 4 :
exponent result
R R

9-7
Fetch instruction
Segment 1 :
from memory
9-4 Instruction Pipeline

Decode instruction
Instruction Cycle Segment 2 : and calculate

effective address
1) Fetch the instruction from memory Branch ?
2) Decode the instruction

3) Calculate the effective address Segment 3 :
Fetch operand
from memory
4) Fetch the operands from memory Segment 4 : Execute instruction
5) Execute the instruction

Interrupt
Interrupt ?
6) Store the result in the proper place handling
Example : Four-segment Instruction Pipeline Update PC
z Four-segment CPU pipeline : Fig. 9-7 Empty pipe
» 1) FI : Instruction Fetch
Step : 1 2 3 4 5 6 7 8 9 10 11 12 13
» 2) DA : Decode Instruction & calculate EA Instruction : 1 FI DA FO EX
» 3) FO : Operand Fetch 2 FI DA FO EX
» 4) EX : Execution (Branch) 3 FI DA FO EX
z Timing of Instruction Pipeline : Fig. 9-8 4 FI FI DA FO EX
FI DA FO EX
» Instruction 3 에서 Branch 명령 실행 5
6 FI DA FO EX
7 FI DA FO EX
No Branch Branch

9-8
Pipeline Conflicts : 3 major difficulties

z 1) Resource conflicts
» memory access by two segments at the same time
z 2) Data dependency
» when an instruction depend on the result of a previous instruction, but this result is not
yet available
z 3) Branch difficulties
» branch and other instruction (interrupt, ret, ..) that change the value of PC
Data Dependency 해결 방법
z Hardware 적인 방법
» Hardware Interlock
previous instruction의 결과가 나올 때 까지 Hardware 적인 Delay를 강제 삽입
» Operand Forwarding
previous instruction의 결과를 곧바로 ALU 로 전달 (정상적인 경우, register를 경유함)
z Software 적인 방법
» Delayed Load : Fig. 9-9, Sec. 9-5
previous instruction의 결과가 나올 때 까지 No-operation instruction 을 삽입
Handling of Branch Instructions
z Prefetch target instruction
» Conditional branch에서 branch target instruction (조건 맞음) 과 다음 instruction (조건 안
맞음) 을 모두 fetch

9-9
z Branch Target Buffer : BTB

» 1) Associative memory를 이용하여 branch target address 이후에 몇 개에 instruction 을
미리 BTB에 저장한다.
» 2) 만약 branch instruction이면 우선 BTB를 검사하여 BTB에 있으면 곧바로
가져온다(Cache 개념 도입)
Clock cycles : 1 2 3 4 5 6 7 8 9 10
z Loop Buffer 1. Load I A E
» 1) small very high speed register file (RAM) 을 2. Increment I A E
이용하여 프로그램에서 loop를 detect한다. 3. Add I A E
» 2) 만약 loop가 발견되면 loop 프로그램 전체를 4. Subtract I A E
Loop Buffer에 load 하여 실행하면 외부 5. Branch to X I A E
6. No-operation I A E
메모리를 access 하지 않는다.
7. No-operation I A E
z Branch Prediction I A E
8. Instruction in X
» Branch를 predict하는 additional hardware logic 사용 (a) Using no-operation instructions
» Correct guess eliminates the branch difficulty Clock cycles : 1 2 3 4 5 6 7 8
Delayed Branch 해결 방법 1. Load I A E
2. Increment I A E
z Fig. 9-8 에서와 같이 branch instruction이
3. Branch to X I A E
pipeline operation을 지연시키는 경우 4. Add I A E
z 예제 : Fig. 9-10, p. 318, Sec. 9-5 5. Subtract I A E
» 1) No-operation instruction 삽입 6. Instruction in X I A E
» 2) Instruction Rearranging : Compiler 지원 (b) Rearranging the instructions
Fig. 9-10 Example of delayed branch

9-10
9-5 RISC Pipeline

Conflict 발생
RISC CPU 의 특징
z Instruction Pipeline 을 이용함
Clock cycles : 1 2 3 4 5 6
z Single-cycle instruction execution
1. Load R1 I A E
z Compiler support
2. Load R2 I A E
Example : Three-segment Instruction Pipeline 3. Add R1+R2 I A E
z 3 Suboperations Instruction Cycle 4. Store R3 I A E

» 1) I : Instruction fetch
(a) Pipeline timing with data conflict
» 2) A : Instruction decoded and ALU operation
» 3) E : Transfer the output of ALU to a register,
Clock cycles : 1 2 3 4 5 6 7
memory, or PC(Program control Inst.=JMP/CALL) 1. Load R1 I A E
z Delayed Load : Fig. 9-9(a) 2. Load R2 I A E

» 3 번째 Instruction(ADD R1 + R2)에서 Conflict 발생 3. No-operation I A E
4 번째 clock cycle에서 2 번째 Instruction (LOAD R2)
4. Add R1+R2 I A E
실행과 동시에 3 번째 instruction에서 R2 를 연산
» Delayed Load 해결 방법 : Fig. 9-9(b) 5. Store R3 I A E
No-operation 삽입 (b) Pipeline timing with delayed load
z Delayed Branch : Sec. 9-4에서 이미 설명

Fig. 9-9 Three-segment pipeline timing

9-11
9-6 Vector Processing

Science and Engineering Applications
z Long-range weather forecasting, Petroleum explorations, Seismic data analysis,
Medical diagnosis, Aerodynamics and space flight simulations, Artificial
intelligence and expert systems, Mapping the human genome, Image processing
Vector Operations
z Arithmetic operations on large arrays of numbers
z Conventional scalar processor
» Machine language » Fortran language
Initialize I = 0
DO 20 I = 1, 100
20 Read A(I)
20 C(I) = A(I) + B(I)
Read B(I)
Store C(I) = A(I) + B(I)
Increment I = I + 1
If I ≤ 100 go to 20
Continue
z Vector processor
» Single vector instruction
C(1:100) = A(1:100) + B(1:100)

9-12
Vector Instruction Format : Fig. 9-11
ADD A B C 100
Matrix Multiplication
z 3 x 3 matrices multiplication : n2 = 9 inner product
⎡ a11 a12 a13 ⎤ ⎡ b11 b12 b13 ⎤ ⎡ c11 c12 c13 ⎤
⎢a a a23 ⎥ × ⎢b21 b22 b23 ⎥ = ⎢c21 c22 c23 ⎥
⎢ 21 22 ⎥ ⎢ ⎥ ⎢ ⎥
⎢⎣a31 a32 a33 ⎥⎦ ⎢⎣b31 b32 b33 ⎥⎦ ⎢⎣c31 c32 c33 ⎥⎦
» c11 = a11 b11 + a12 b21 + a13 b31 : 이와 같은 inner product가 9 개
z Cumulative multiply-add operation : n3 = 27 multiply-add

c = c + a ×b
» c11 = c11 + a11 b11 + a12 b21 + a13 b31 : 이와 같은 multiply-add가 3 개
c cd d e e 따라서 9 X 3 multiply-add = 27
C11의 초기값 = 0

9-13
Pipeline for calculating an inner product : Fig. 9-12

z Floating point multiplier pipeline : 4 segment
z Floating point adder pipeline : 4 segment
z 예제 ) C = A1 B1 + A2 B2 + A3 B3 + L + Ak Bk
» after 1st clock input » after 4th clock input
Source Source
A A
A1B1 A4B4 A3B3 A2B2 A1B1
Source Multiplier Adder Source Multiplier Adder

B pipeline pipeline B pipeline pipeline
» after 8th clock input » after 9th, 10th, 11th ,...

Source Source
A A
A8B8 A7B7 A6B6 A5B5 A4B4 A3B3 A2B2 A1B1 A8B8 A7B7 A6B6 A5B5 A4B4 A3B3 A2B2 A1B1
Source Multiplier Adder Source Multiplier Adder

B pipeline pipeline B pipeline pipeline
» Four section summation

C = A1B1 + A5 B5 + A9 B9 + A13 B13 + L A2 B2 + A6 B6 A1B1 + A5 B5
+ A2 B2 + A6 B6 + A10 B10 + A14 B14 + L ,,,lk
+ A3 B3 + A7 B7 + A11B11 + A15 B15 + L
+ A4 B4 + A8 B8 + A12 B12 + A16 B16 + L

9-14
Memory Interleaving : Fig. 9-13 Address bus
z Pipeline and vector processors often require AR AR AR AR
simultaneous access to memory from two or

more source using one memory bus system Memory
array
Memory
array
Memory
array
Memory
array
z AR 의 하위 2 bit를 사용하여 4 개중 1 개의
memory module 선택 DR DR DR DR
z 예제 ) Even / Odd Address Memory Access Data bus
Supercomputer
z Supercomputer = Vector Instruction + Pipelined floating-point arithmetic
z Performance Evaluation Index
» MIPS : Million Instruction Per Second
» FLOPS : Floating-point Operation Per Second
megaflops : 106, gigaflops : 109, teraflops : 1012
z Cray supercomputer : Cray Research
» Clay-1 : 80 megaflops, 4 million 64 bit words memory
» Clay-2 : 12 times more powerful than the clay-1
z VP supercomputer : Fujitsu
» VP-200 : 300 megaflops, 32 million memory, 83 vector instruction, 195 scalar instruction
» VP-2600 : 5 gigaflops

9-15
9-7 Array Processors

Performs computations on large arrays of data
Vector processing : Adder/Multiplier pipeline 이용
Array processing :별도의 array processor 이용
Array Processing
z Attached array processor : Fig. 9-14
» Auxiliary processor attached to a general purpose computer
z SIMD array processor : Fig. 9-15
» Computer with multiple processing units operating in parallel
Vector 계산 C = A + B 에서 ci = ai + bi 를
각각의 PEi에서 동시에 실행
PE 1 M1
Master control
unit
PE 2 M2
General-purpose Input-Output Attached array
computer interface Processor
PE 3 M3
High-speed memory to-

Main memory Local memory
memory bus
Main memory
PE n Mn


Chap. 9 Pipeline and Vector Processing

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Chap. 9 Pipeline and Vector Processing

Uploaded by

Copyright:

Available Formats

9-1

Chap. 9 Pipeline and Vector Processing

z Data-Instruction Stream : Flynn

 Flynn’s Classification Shift unit

© Korea Univ. of Tech. & Edu.

© Korea Univ. of Tech. & Edu.

 Main topics in this Chapter

© Korea Univ. of Tech. & Edu.

 Pipelining의 예제 : Fig. 9-2

 Speedup S : Nonpipeline / Pipeline

» 3) Add mantissas Segment 2 : Choose exponent Align mantissas

 Z = 0.1324 x 104 Add or subtract

© Korea Univ. of Tech. & Edu.

 9-4 Instruction Pipeline

 Instruction Cycle Segment 2 : and calculate

1) Fetch the instruction from memory Branch ?

2) Decode the instruction

4) Fetch the operands from memory Segment 4 : Execute instruction

5) Execute the instruction

 Example : Four-segment Instruction Pipeline Update PC

z Four-segment CPU pipeline : Fig. 9-7 Empty pipe

z Timing of Instruction Pipeline : Fig. 9-8 4 FI FI DA FO EX

© Korea Univ. of Tech. & Edu.

 Pipeline Conflicts : 3 major difficulties

© Korea Univ. of Tech. & Edu.

z Branch Target Buffer : BTB

» 1) small very high speed register file (RAM) 을 2. Increment I A E

이용하여 프로그램에서 loop를 detect한다. 3. Add I A E

» 2) 만약 loop가 발견되면 loop 프로그램 전체를 4. Subtract I A E

Loop Buffer에 load 하여 실행하면 외부 5. Branch to X I A E

 Delayed Branch 해결 방법 1. Load I A E

pipeline operation을 지연시키는 경우 4. Add I A E

z 예제 : Fig. 9-10, p. 318, Sec. 9-5 5. Subtract I A E

» 1) No-operation instruction 삽입 6. Instruction in X I A E

» 2) Instruction Rearranging : Compiler 지원 (b) Rearranging the instructions

Fig. 9-10 Example of delayed branch

 9-5 RISC Pipeline

z 3 Suboperations Instruction Cycle 4. Store R3 I A E

z Delayed Load : Fig. 9-9(a) 2. Load R2 I A E

 No-operation 삽입 (b) Pipeline timing with delayed load

z Delayed Branch : Sec. 9-4에서 이미 설명

© Korea Univ. of Tech. & Edu.

 9-6 Vector Processing

© Korea Univ. of Tech. & Edu.

 Vector Instruction Format : Fig. 9-11

» c11 = a11 b11 + a12 b21 + a13 b31 : 이와 같은 inner product가 9 개

z Cumulative multiply-add operation : n3 = 27 multiply-add

© Korea Univ. of Tech. & Edu.

 Pipeline for calculating an inner product : Fig. 9-12

A1B1 A4B4 A3B3 A2B2 A1B1

Source Multiplier Adder Source Multiplier Adder

» after 8th clock input » after 9th, 10th, 11th ,...

Source Multiplier Adder Source Multiplier Adder

» Four section summation

© Korea Univ. of Tech. & Edu.

 Memory Interleaving : Fig. 9-13 Address bus

z Pipeline and vector processors often require AR AR AR AR

simultaneous access to memory from two or

z 예제 ) Even / Odd Address Memory Access Data bus

© Korea Univ. of Tech. & Edu.

 9-7 Array Processors

High-speed memory to-

© Korea Univ. of Tech. & Edu.

Flynn’s Classification Shift unit

Main topics in this Chapter

Pipelining의 예제 : Fig. 9-2

Speedup S : Nonpipeline / Pipeline

Z = 0.1324 x 104 Add or subtract

9-4 Instruction Pipeline

Instruction Cycle Segment 2 : and calculate

Example : Four-segment Instruction Pipeline Update PC

Pipeline Conflicts : 3 major difficulties

Delayed Branch 해결 방법 1. Load I A E

9-5 RISC Pipeline

No-operation 삽입 (b) Pipeline timing with delayed load

9-6 Vector Processing

Vector Instruction Format : Fig. 9-11

Pipeline for calculating an inner product : Fig. 9-12

Memory Interleaving : Fig. 9-13 Address bus

9-7 Array Processors