Chap. 9 Pipeline and Vector Processing

„ 9-1 Parallel Processing
‹ Simultaneous data processing tasks for the purpose of increasing the
= computational speed
‹ Perform concurrent data processing to achieve faster execution time
‹ Multiple Functional Unit : Fig. 9-1 Parallel Processing Example
z Separate the execution unit into eight functional units operating in parallel
‹ Computer Architectural Classification

z Data-Instruction Stream : Flynn

Integer multiply
z Serial versus Parallel Processing : Feng
z Parallelism and Pipelining : Händler Logic unit

‹ Flynn’s Classification Shift unit

To Memory
z 1) SISD (Single Instruction - Single Data stream) Incrementer
» for practical purpose: only one processor is useful registers
» Example systems : Amdahl 470V/6, IBM 360/91
IS multiply



z 2) SIMD
Shared memmory
(Single Instruction - Multiple Data stream) DS 1
PU 1 MM1
» vector or array operations 에 적합한 형태
„ one vector operation includes many DS 2
PU 2 MM2
operations on a data stream
» Example systems : CRAY -1, ILLIAC-IV

DS n
PU n MMn


z 3) MISD
(Multiple Instruction - Single Data stream)
» Data Stream에 Bottle neck으로 인해 DS

실제로 사용되지 않음 CU1 PU 1
Shared memory

CU2 PU 2 MMn MM2 MM1

CUn PU n


z 4) MIMD
(Multiple Instruction - Multiple Data stream)
» 대부분의 Multiprocessor Shared memory
System에서 사용됨 CU1 PU 1 MM1

» Shared memory or
Message passing : Chap. 13 CU2 PU 2 MM2

v v

CUn PU n MMn

‹ Main topics in this Chapter

z Pipeline processing : Sec. 9-2
» Arithmetic pipeline : Sec. 9-3
» Instruction pipeline : Sec. 9-4 (Sec. 9-5 : RISC Instruction Pipeline)
z Vector processing :adder/multiplier pipeline 이용, Sec. 9-6
Large vector, Matrices,
z Array processing :별도의 array processor 이용, Sec. 9-7 그리고 Array Data 계산
» Attached array processor : Fig. 9-14
» SIMD array processor : Fig. 9-15

„ 9-2 Pipelining
‹ Pipelining의 원리
z Decomposing a sequential process into suboperations
z Each subprocess is executed in a special dedicated segment concurrently

‹ Pipelining의 예제 : Fig. 9-2

z Multiply and Add Operation : Ai * Bi + Ci ( for i = 1, 2, …, 7 )
z 3 개의 Suboperation Segment로 분리
» 1) R1 ← Ai , R 2 ← Bi : Input Ai and Bi
» 2) R 3 ← R1 * R 2, R 4 ← Ci : Multiply and input Ci
» 3) R5 ← R 3 + R 4 : Add Ci
z Content of registers in pipeline example : Tab. 9-1
‹ General considerations
z 4 segment pipeline : Fig. 9-3
» S : Combinational circuit for Suboperation
» R : Register(intermediate results between the segments)
z Space-time diagram : Fig. 9-4 Segment
» Show segment utilization as a function of time versus
z Task : T1, T2, T3,…, T6
» Total operation performed going through all the segment
‹ Speedup S : Nonpipeline / Pipeline

z S = n • tn / ( k + n - 1 ) • tp = 6 • 4 tp / ( 4 + 6 - 1 ) • tp = 24 tp / 9 tp = 2.67
» n : task number ( 6 ) Pipeline에서의 처리 시간 = 9 clock cycles
» tn : time to complete each task in nonpipeline
k+n-1≈n » tp : clock cycle time ( 1 clock cycle )
» k : segment number ( 4 )
Clock cycles
z If n→ ∞ 이면, S = tn / tp 1 2 3 4 5 6 7 8 9
1 T1 T2 T3 T4 T5 T6
z 한 개의 task를 처리하는 시간이 같을 때

즉, nonpipeline ( tn ) = pipeline ( k • tp ) 2 T1 T2 T3 T4 T5 T6

이라고 가정하면, 3 T1 T2 T3 T4 T5 T6
S = t n / tp = k • t p / tp = k
4 T1 T2 T3 T4 T5 T6
따라서 이론적으로 k 배 (segment 개수)
만큼 처리 속도가 향상된다.
‹ Pipeline에는 Arithmetic Pipeline(Sec. 9-3)과 Instruction Pipeline(Sec. 9-4)이 있다
„ Sec. 9-3 Arithmetic Pipeline
‹ Floating-point Adder Pipeline Example : Fig. 9-6
z Add / Subtract two normalized floating-point binary number
» X = A x 2a = 0.9504 x 103
» Y = B x 2b = 0.8200 x 102
z 4 segments suboperations
» 1) Compare exponents by subtraction : Exponents
a b

„ X = 0.9504 x 103
„ Y = 0.8200 x 102 Compare Difference
Segment 1 : exponents
» 2) Align mantissas by subtraction
„ X = 0.9504 x 103
„ Y = 0.08200 x 103 R

» 3) Add mantissas Segment 2 : Choose exponent Align mantissas

„ Z = 1.0324 x 103
» 4) Normalize result R

„ Z = 0.1324 x 104 Add or subtract

Segment 3 :


Adjust Normalize
Segment 4 :
exponent result


Fetch instruction
Segment 1 :
from memory

„ 9-4 Instruction Pipeline

Decode instruction

‹ Instruction Cycle Segment 2 : and calculate

effective address

1) Fetch the instruction from memory Branch ?

2) Decode the instruction

3) Calculate the effective address Segment 3 :
Fetch operand
from memory

4) Fetch the operands from memory Segment 4 : Execute instruction

5) Execute the instruction

Interrupt ?
6) Store the result in the proper place handling

‹ Example : Four-segment Instruction Pipeline Update PC

z Four-segment CPU pipeline : Fig. 9-7 Empty pipe

» 1) FI : Instruction Fetch
Step : 1 2 3 4 5 6 7 8 9 10 11 12 13
» 2) DA : Decode Instruction & calculate EA Instruction : 1 FI DA FO EX

» 3) FO : Operand Fetch 2 FI DA FO EX

» 4) EX : Execution (Branch) 3 FI DA FO EX

z Timing of Instruction Pipeline : Fig. 9-8 4 FI FI DA FO EX

» Instruction 3 에서 Branch 명령 실행 5



No Branch Branch

‹ Pipeline Conflicts : 3 major difficulties

z 1) Resource conflicts
» memory access by two segments at the same time
z 2) Data dependency
» when an instruction depend on the result of a previous instruction, but this result is not
yet available
z 3) Branch difficulties
» branch and other instruction (interrupt, ret, ..) that change the value of PC
‹ Data Dependency 해결 방법
z Hardware 적인 방법
» Hardware Interlock
„ previous instruction의 결과가 나올 때 까지 Hardware 적인 Delay를 강제 삽입
» Operand Forwarding
„ previous instruction의 결과를 곧바로 ALU 로 전달 (정상적인 경우, register를 경유함)
z Software 적인 방법
» Delayed Load : Fig. 9-9, Sec. 9-5
„ previous instruction의 결과가 나올 때 까지 No-operation instruction 을 삽입
‹ Handling of Branch Instructions
z Prefetch target instruction
» Conditional branch에서 branch target instruction (조건 맞음) 과 다음 instruction (조건 안
맞음) 을 모두 fetch

z Branch Target Buffer : BTB

» 1) Associative memory를 이용하여 branch target address 이후에 몇 개에 instruction 을
미리 BTB에 저장한다.
» 2) 만약 branch instruction이면 우선 BTB를 검사하여 BTB에 있으면 곧바로
가져온다(Cache 개념 도입)
Clock cycles : 1 2 3 4 5 6 7 8 9 10
z Loop Buffer 1. Load I A E

» 1) small very high speed register file (RAM) 을 2. Increment I A E

이용하여 프로그램에서 loop를 detect한다. 3. Add I A E

» 2) 만약 loop가 발견되면 loop 프로그램 전체를 4. Subtract I A E

Loop Buffer에 load 하여 실행하면 외부 5. Branch to X I A E

6. No-operation I A E
메모리를 access 하지 않는다.
7. No-operation I A E
z Branch Prediction I A E
8. Instruction in X
» Branch를 predict하는 additional hardware logic 사용 (a) Using no-operation instructions
» Correct guess eliminates the branch difficulty Clock cycles : 1 2 3 4 5 6 7 8

‹ Delayed Branch 해결 방법 1. Load I A E

2. Increment I A E
z Fig. 9-8 에서와 같이 branch instruction이
3. Branch to X I A E

pipeline operation을 지연시키는 경우 4. Add I A E

z 예제 : Fig. 9-10, p. 318, Sec. 9-5 5. Subtract I A E

» 1) No-operation instruction 삽입 6. Instruction in X I A E

» 2) Instruction Rearranging : Compiler 지원 (b) Rearranging the instructions

Fig. 9-10 Example of delayed branch

„ 9-5 RISC Pipeline

Conflict 발생
‹ RISC CPU 의 특징
z Instruction Pipeline 을 이용함
Clock cycles : 1 2 3 4 5 6
z Single-cycle instruction execution
1. Load R1 I A E
z Compiler support
2. Load R2 I A E
‹ Example : Three-segment Instruction Pipeline 3. Add R1+R2 I A E

z 3 Suboperations Instruction Cycle 4. Store R3 I A E

» 1) I : Instruction fetch
(a) Pipeline timing with data conflict
» 2) A : Instruction decoded and ALU operation
» 3) E : Transfer the output of ALU to a register,
Clock cycles : 1 2 3 4 5 6 7
memory, or PC(Program control Inst.=JMP/CALL) 1. Load R1 I A E

z Delayed Load : Fig. 9-9(a) 2. Load R2 I A E

» 3 번째 Instruction(ADD R1 + R2)에서 Conflict 발생 3. No-operation I A E
„ 4 번째 clock cycle에서 2 번째 Instruction (LOAD R2)
4. Add R1+R2 I A E
실행과 동시에 3 번째 instruction에서 R2 를 연산
» Delayed Load 해결 방법 : Fig. 9-9(b) 5. Store R3 I A E

„ No-operation 삽입 (b) Pipeline timing with delayed load

z Delayed Branch : Sec. 9-4에서 이미 설명

Fig. 9-9 Three-segment pipeline timing

„ 9-6 Vector Processing

‹ Science and Engineering Applications
z Long-range weather forecasting, Petroleum explorations, Seismic data analysis,
Medical diagnosis, Aerodynamics and space flight simulations, Artificial
intelligence and expert systems, Mapping the human genome, Image processing
‹ Vector Operations
z Arithmetic operations on large arrays of numbers
z Conventional scalar processor
» Machine language » Fortran language
Initialize I = 0
DO 20 I = 1, 100
20 Read A(I)
20 C(I) = A(I) + B(I)
Read B(I)
Store C(I) = A(I) + B(I)
Increment I = I + 1
If I ≤ 100 go to 20

z Vector processor
» Single vector instruction
C(1:100) = A(1:100) + B(1:100)

‹ Vector Instruction Format : Fig. 9-11

ADD A B C 100
‹ Matrix Multiplication
z 3 x 3 matrices multiplication : n2 = 9 inner product
⎡ a11 a12 a13 ⎤ ⎡ b11 b12 b13 ⎤ ⎡ c11 c12 c13 ⎤
⎢a a a23 ⎥ × ⎢b21 b22 b23 ⎥ = ⎢c21 c22 c23 ⎥
⎢ 21 22 ⎥ ⎢ ⎥ ⎢ ⎥
⎢⎣a31 a32 a33 ⎥⎦ ⎢⎣b31 b32 b33 ⎥⎦ ⎢⎣c31 c32 c33 ⎥⎦

» c11 = a11 b11 + a12 b21 + a13 b31 : 이와 같은 inner product가 9 개

z Cumulative multiply-add operation : n3 = 27 multiply-add

c = c + a ×b
» c11 = c11 + a11 b11 + a12 b21 + a13 b31 : 이와 같은 multiply-add가 3 개
c cd d e e 따라서 9 X 3 multiply-add = 27
C11의 초기값 = 0

‹ Pipeline for calculating an inner product : Fig. 9-12

z Floating point multiplier pipeline : 4 segment
z Floating point adder pipeline : 4 segment
z 예제 ) C = A1 B1 + A2 B2 + A3 B3 + L + Ak Bk
» after 1st clock input » after 4th clock input
Source Source

A1B1 A4B4 A3B3 A2B2 A1B1

Source Multiplier Adder Source Multiplier Adder

B pipeline pipeline B pipeline pipeline

» after 8th clock input » after 9th, 10th, 11th ,...

Source Source

A8B8 A7B7 A6B6 A5B5 A4B4 A3B3 A2B2 A1B1 A8B8 A7B7 A6B6 A5B5 A4B4 A3B3 A2B2 A1B1

Source Multiplier Adder Source Multiplier Adder

B pipeline pipeline B pipeline pipeline

» Four section summation

C = A1B1 + A5 B5 + A9 B9 + A13 B13 + L A2 B2 + A6 B6 A1B1 + A5 B5
+ A2 B2 + A6 B6 + A10 B10 + A14 B14 + L ,,,lk
+ A3 B3 + A7 B7 + A11B11 + A15 B15 + L
+ A4 B4 + A8 B8 + A12 B12 + A16 B16 + L

‹ Memory Interleaving : Fig. 9-13 Address bus

z Pipeline and vector processors often require AR AR AR AR

simultaneous access to memory from two or

more source using one memory bus system Memory

z AR 의 하위 2 bit를 사용하여 4 개중 1 개의
memory module 선택 DR DR DR DR

z 예제 ) Even / Odd Address Memory Access Data bus

‹ Supercomputer
z Supercomputer = Vector Instruction + Pipelined floating-point arithmetic
z Performance Evaluation Index
» MIPS : Million Instruction Per Second
» FLOPS : Floating-point Operation Per Second
„ megaflops : 106, gigaflops : 109, teraflops : 1012
z Cray supercomputer : Cray Research
» Clay-1 : 80 megaflops, 4 million 64 bit words memory
» Clay-2 : 12 times more powerful than the clay-1
z VP supercomputer : Fujitsu
» VP-200 : 300 megaflops, 32 million memory, 83 vector instruction, 195 scalar instruction
» VP-2600 : 5 gigaflops

„ 9-7 Array Processors

‹ Performs computations on large arrays of data
Vector processing : Adder/Multiplier pipeline 이용
Array processing :별도의 array processor 이용

‹ Array Processing
z Attached array processor : Fig. 9-14
» Auxiliary processor attached to a general purpose computer
z SIMD array processor : Fig. 9-15
» Computer with multiple processing units operating in parallel
„ Vector 계산 C = A + B 에서 ci = ai + bi 를
각각의 PEi에서 동시에 실행
PE 1 M1

Master control
PE 2 M2
General-purpose Input-Output Attached array
computer interface Processor
PE 3 M3

High-speed memory to-

Main memory Local memory
memory bus

Main memory
PE n Mn

