Download as pdf or txt
Download as pdf or txt
You are on page 1of 15

9-1

Chap. 9 Pipeline and Vector Processing


„ 9-1 Parallel Processing
‹ Simultaneous data processing tasks for the purpose of increasing the
= computational speed
‹ Perform concurrent data processing to achieve faster execution time
‹ Multiple Functional Unit : Fig. 9-1 Parallel Processing Example
z Separate the execution unit into eight functional units operating in parallel
‹ Computer Architectural Classification
Adder-subtractor

z Data-Instruction Stream : Flynn


Integer multiply
z Serial versus Parallel Processing : Feng
z Parallelism and Pipelining : Händler Logic unit

‹ Flynn’s Classification Shift unit

To Memory
z 1) SISD (Single Instruction - Single Data stream) Incrementer
Processor
» for practical purpose: only one processor is useful registers
Floatint-point
add-subtract
» Example systems : Amdahl 470V/6, IBM 360/91
Floatint-point
IS multiply

Floatint-point
divide

CU PU MM
IS DS

© Korea Univ. of Tech. & Edu.


Computer System Architecture Chap. 9 Pipeline and Vector Processing Dept. of Info. & Comm.
9-2

z 2) SIMD
Shared memmory
(Single Instruction - Multiple Data stream) DS 1
PU 1 MM1
» vector or array operations 에 적합한 형태
„ one vector operation includes many DS 2
PU 2 MM2
operations on a data stream
IS
CU
» Example systems : CRAY -1, ILLIAC-IV

DS n
PU n MMn

IS

z 3) MISD
(Multiple Instruction - Single Data stream)
» Data Stream에 Bottle neck으로 인해 DS

IS1 IS1
실제로 사용되지 않음 CU1 PU 1
Shared memory

IS2 IS2
CU2 PU 2 MMn MM2 MM1

ISn ISn
CUn PU n

DS

© Korea Univ. of Tech. & Edu.


Computer System Architecture Chap. 9 Pipeline and Vector Processing Dept. of Info. & Comm.
9-3

z 4) MIMD
(Multiple Instruction - Multiple Data stream)
» 대부분의 Multiprocessor Shared memory
IS1 IS1 DS
System에서 사용됨 CU1 PU 1 MM1

» Shared memory or
IS2 IS2
Message passing : Chap. 13 CU2 PU 2 MM2

v v

ISn ISn
CUn PU n MMn

‹ Main topics in this Chapter


z Pipeline processing : Sec. 9-2
» Arithmetic pipeline : Sec. 9-3
» Instruction pipeline : Sec. 9-4 (Sec. 9-5 : RISC Instruction Pipeline)
z Vector processing :adder/multiplier pipeline 이용, Sec. 9-6
Large vector, Matrices,
z Array processing :별도의 array processor 이용, Sec. 9-7 그리고 Array Data 계산
» Attached array processor : Fig. 9-14
» SIMD array processor : Fig. 9-15

© Korea Univ. of Tech. & Edu.


Computer System Architecture Chap. 9 Pipeline and Vector Processing Dept. of Info. & Comm.
9-4

„ 9-2 Pipelining
‹ Pipelining의 원리
z Decomposing a sequential process into suboperations
z Each subprocess is executed in a special dedicated segment concurrently

‹ Pipelining의 예제 : Fig. 9-2


z Multiply and Add Operation : Ai * Bi + Ci ( for i = 1, 2, …, 7 )
z 3 개의 Suboperation Segment로 분리
» 1) R1 ← Ai , R 2 ← Bi : Input Ai and Bi
» 2) R 3 ← R1 * R 2, R 4 ← Ci : Multiply and input Ci
» 3) R5 ← R 3 + R 4 : Add Ci
z Content of registers in pipeline example : Tab. 9-1
‹ General considerations
z 4 segment pipeline : Fig. 9-3
» S : Combinational circuit for Suboperation
» R : Register(intermediate results between the segments)
z Space-time diagram : Fig. 9-4 Segment
» Show segment utilization as a function of time versus
Clock-cycle
z Task : T1, T2, T3,…, T6
» Total operation performed going through all the segment
© Korea Univ. of Tech. & Edu.
Computer System Architecture Chap. 9 Pipeline and Vector Processing Dept. of Info. & Comm.
9-5

‹ Speedup S : Nonpipeline / Pipeline


z S = n • tn / ( k + n - 1 ) • tp = 6 • 4 tp / ( 4 + 6 - 1 ) • tp = 24 tp / 9 tp = 2.67
» n : task number ( 6 ) Pipeline에서의 처리 시간 = 9 clock cycles
» tn : time to complete each task in nonpipeline
k+n-1≈n » tp : clock cycle time ( 1 clock cycle )
» k : segment number ( 4 )
Clock cycles
z If n→ ∞ 이면, S = tn / tp 1 2 3 4 5 6 7 8 9
1 T1 T2 T3 T4 T5 T6
z 한 개의 task를 처리하는 시간이 같을 때

Segment
즉, nonpipeline ( tn ) = pipeline ( k • tp ) 2 T1 T2 T3 T4 T5 T6

이라고 가정하면, 3 T1 T2 T3 T4 T5 T6
S = t n / tp = k • t p / tp = k
4 T1 T2 T3 T4 T5 T6
따라서 이론적으로 k 배 (segment 개수)
만큼 처리 속도가 향상된다.
‹ Pipeline에는 Arithmetic Pipeline(Sec. 9-3)과 Instruction Pipeline(Sec. 9-4)이 있다
„ Sec. 9-3 Arithmetic Pipeline
‹ Floating-point Adder Pipeline Example : Fig. 9-6
z Add / Subtract two normalized floating-point binary number
» X = A x 2a = 0.9504 x 103
» Y = B x 2b = 0.8200 x 102
© Korea Univ. of Tech. & Edu.
Computer System Architecture Chap. 9 Pipeline and Vector Processing Dept. of Info. & Comm.
9-6

z 4 segments suboperations
» 1) Compare exponents by subtraction : Exponents
a b
Mantissas
A B

3-2=1
R R
„ X = 0.9504 x 103
„ Y = 0.8200 x 102 Compare Difference
Segment 1 : exponents
» 2) Align mantissas by subtraction
„ X = 0.9504 x 103
„ Y = 0.08200 x 103 R

» 3) Add mantissas Segment 2 : Choose exponent Align mantissas


„ Z = 1.0324 x 103
» 4) Normalize result R

„ Z = 0.1324 x 104 Add or subtract


Segment 3 :
mantissas

R R

Adjust Normalize
Segment 4 :
exponent result

R R

© Korea Univ. of Tech. & Edu.


Computer System Architecture Chap. 9 Pipeline and Vector Processing Dept. of Info. & Comm.
9-7

Fetch instruction
Segment 1 :
from memory

„ 9-4 Instruction Pipeline


Decode instruction

‹ Instruction Cycle Segment 2 : and calculate


effective address

1) Fetch the instruction from memory Branch ?

2) Decode the instruction


3) Calculate the effective address Segment 3 :
Fetch operand
from memory

4) Fetch the operands from memory Segment 4 : Execute instruction

5) Execute the instruction


Interrupt
Interrupt ?
6) Store the result in the proper place handling

‹ Example : Four-segment Instruction Pipeline Update PC

z Four-segment CPU pipeline : Fig. 9-7 Empty pipe

» 1) FI : Instruction Fetch
Step : 1 2 3 4 5 6 7 8 9 10 11 12 13
» 2) DA : Decode Instruction & calculate EA Instruction : 1 FI DA FO EX

» 3) FO : Operand Fetch 2 FI DA FO EX

» 4) EX : Execution (Branch) 3 FI DA FO EX

z Timing of Instruction Pipeline : Fig. 9-8 4 FI FI DA FO EX

FI DA FO EX
» Instruction 3 에서 Branch 명령 실행 5

6 FI DA FO EX

7 FI DA FO EX

No Branch Branch

© Korea Univ. of Tech. & Edu.


Computer System Architecture Chap. 9 Pipeline and Vector Processing Dept. of Info. & Comm.
9-8

‹ Pipeline Conflicts : 3 major difficulties


z 1) Resource conflicts
» memory access by two segments at the same time
z 2) Data dependency
» when an instruction depend on the result of a previous instruction, but this result is not
yet available
z 3) Branch difficulties
» branch and other instruction (interrupt, ret, ..) that change the value of PC
‹ Data Dependency 해결 방법
z Hardware 적인 방법
» Hardware Interlock
„ previous instruction의 결과가 나올 때 까지 Hardware 적인 Delay를 강제 삽입
» Operand Forwarding
„ previous instruction의 결과를 곧바로 ALU 로 전달 (정상적인 경우, register를 경유함)
z Software 적인 방법
» Delayed Load : Fig. 9-9, Sec. 9-5
„ previous instruction의 결과가 나올 때 까지 No-operation instruction 을 삽입
‹ Handling of Branch Instructions
z Prefetch target instruction
» Conditional branch에서 branch target instruction (조건 맞음) 과 다음 instruction (조건 안
맞음) 을 모두 fetch

© Korea Univ. of Tech. & Edu.


Computer System Architecture Chap. 9 Pipeline and Vector Processing Dept. of Info. & Comm.
9-9

z Branch Target Buffer : BTB


» 1) Associative memory를 이용하여 branch target address 이후에 몇 개에 instruction 을
미리 BTB에 저장한다.
» 2) 만약 branch instruction이면 우선 BTB를 검사하여 BTB에 있으면 곧바로
가져온다(Cache 개념 도입)
Clock cycles : 1 2 3 4 5 6 7 8 9 10
z Loop Buffer 1. Load I A E

» 1) small very high speed register file (RAM) 을 2. Increment I A E

이용하여 프로그램에서 loop를 detect한다. 3. Add I A E

» 2) 만약 loop가 발견되면 loop 프로그램 전체를 4. Subtract I A E

Loop Buffer에 load 하여 실행하면 외부 5. Branch to X I A E

6. No-operation I A E
메모리를 access 하지 않는다.
7. No-operation I A E
z Branch Prediction I A E
8. Instruction in X
» Branch를 predict하는 additional hardware logic 사용 (a) Using no-operation instructions
» Correct guess eliminates the branch difficulty Clock cycles : 1 2 3 4 5 6 7 8

‹ Delayed Branch 해결 방법 1. Load I A E

2. Increment I A E
z Fig. 9-8 에서와 같이 branch instruction이
3. Branch to X I A E

pipeline operation을 지연시키는 경우 4. Add I A E

z 예제 : Fig. 9-10, p. 318, Sec. 9-5 5. Subtract I A E

» 1) No-operation instruction 삽입 6. Instruction in X I A E

» 2) Instruction Rearranging : Compiler 지원 (b) Rearranging the instructions

Fig. 9-10 Example of delayed branch


© Korea Univ. of Tech. & Edu.
Computer System Architecture Chap. 9 Pipeline and Vector Processing Dept. of Info. & Comm.
9-10

„ 9-5 RISC Pipeline


Conflict 발생
‹ RISC CPU 의 특징
z Instruction Pipeline 을 이용함
Clock cycles : 1 2 3 4 5 6
z Single-cycle instruction execution
1. Load R1 I A E
z Compiler support
2. Load R2 I A E
‹ Example : Three-segment Instruction Pipeline 3. Add R1+R2 I A E

z 3 Suboperations Instruction Cycle 4. Store R3 I A E


» 1) I : Instruction fetch
(a) Pipeline timing with data conflict
» 2) A : Instruction decoded and ALU operation
» 3) E : Transfer the output of ALU to a register,
Clock cycles : 1 2 3 4 5 6 7
memory, or PC(Program control Inst.=JMP/CALL) 1. Load R1 I A E

z Delayed Load : Fig. 9-9(a) 2. Load R2 I A E


» 3 번째 Instruction(ADD R1 + R2)에서 Conflict 발생 3. No-operation I A E
„ 4 번째 clock cycle에서 2 번째 Instruction (LOAD R2)
4. Add R1+R2 I A E
실행과 동시에 3 번째 instruction에서 R2 를 연산
» Delayed Load 해결 방법 : Fig. 9-9(b) 5. Store R3 I A E

„ No-operation 삽입 (b) Pipeline timing with delayed load

z Delayed Branch : Sec. 9-4에서 이미 설명


Fig. 9-9 Three-segment pipeline timing

© Korea Univ. of Tech. & Edu.


Computer System Architecture Chap. 9 Pipeline and Vector Processing Dept. of Info. & Comm.
9-11

„ 9-6 Vector Processing


‹ Science and Engineering Applications
z Long-range weather forecasting, Petroleum explorations, Seismic data analysis,
Medical diagnosis, Aerodynamics and space flight simulations, Artificial
intelligence and expert systems, Mapping the human genome, Image processing
‹ Vector Operations
z Arithmetic operations on large arrays of numbers
z Conventional scalar processor
» Machine language » Fortran language
Initialize I = 0
DO 20 I = 1, 100
20 Read A(I)
20 C(I) = A(I) + B(I)
Read B(I)
Store C(I) = A(I) + B(I)
Increment I = I + 1
If I ≤ 100 go to 20
Continue

z Vector processor
» Single vector instruction
C(1:100) = A(1:100) + B(1:100)

© Korea Univ. of Tech. & Edu.


Computer System Architecture Chap. 9 Pipeline and Vector Processing Dept. of Info. & Comm.
9-12

‹ Vector Instruction Format : Fig. 9-11

ADD A B C 100
‹ Matrix Multiplication
z 3 x 3 matrices multiplication : n2 = 9 inner product
⎡ a11 a12 a13 ⎤ ⎡ b11 b12 b13 ⎤ ⎡ c11 c12 c13 ⎤
⎢a a a23 ⎥ × ⎢b21 b22 b23 ⎥ = ⎢c21 c22 c23 ⎥
⎢ 21 22 ⎥ ⎢ ⎥ ⎢ ⎥
⎢⎣a31 a32 a33 ⎥⎦ ⎢⎣b31 b32 b33 ⎥⎦ ⎢⎣c31 c32 c33 ⎥⎦

» c11 = a11 b11 + a12 b21 + a13 b31 : 이와 같은 inner product가 9 개

z Cumulative multiply-add operation : n3 = 27 multiply-add


c = c + a ×b
» c11 = c11 + a11 b11 + a12 b21 + a13 b31 : 이와 같은 multiply-add가 3 개
c cd d e e 따라서 9 X 3 multiply-add = 27
C11의 초기값 = 0

© Korea Univ. of Tech. & Edu.


Computer System Architecture Chap. 9 Pipeline and Vector Processing Dept. of Info. & Comm.
9-13

‹ Pipeline for calculating an inner product : Fig. 9-12


z Floating point multiplier pipeline : 4 segment
z Floating point adder pipeline : 4 segment
z 예제 ) C = A1 B1 + A2 B2 + A3 B3 + L + Ak Bk
» after 1st clock input » after 4th clock input
Source Source
A A

A1B1 A4B4 A3B3 A2B2 A1B1

Source Multiplier Adder Source Multiplier Adder


B pipeline pipeline B pipeline pipeline

» after 8th clock input » after 9th, 10th, 11th ,...


Source Source
A A

A8B8 A7B7 A6B6 A5B5 A4B4 A3B3 A2B2 A1B1 A8B8 A7B7 A6B6 A5B5 A4B4 A3B3 A2B2 A1B1

Source Multiplier Adder Source Multiplier Adder


B pipeline pipeline B pipeline pipeline

» Four section summation


C = A1B1 + A5 B5 + A9 B9 + A13 B13 + L A2 B2 + A6 B6 A1B1 + A5 B5
+ A2 B2 + A6 B6 + A10 B10 + A14 B14 + L ,,,lk
+ A3 B3 + A7 B7 + A11B11 + A15 B15 + L
+ A4 B4 + A8 B8 + A12 B12 + A16 B16 + L

© Korea Univ. of Tech. & Edu.


Computer System Architecture Chap. 9 Pipeline and Vector Processing Dept. of Info. & Comm.
9-14

‹ Memory Interleaving : Fig. 9-13 Address bus

z Pipeline and vector processors often require AR AR AR AR

simultaneous access to memory from two or


more source using one memory bus system Memory
array
Memory
array
Memory
array
Memory
array

z AR 의 하위 2 bit를 사용하여 4 개중 1 개의
memory module 선택 DR DR DR DR

z 예제 ) Even / Odd Address Memory Access Data bus

‹ Supercomputer
z Supercomputer = Vector Instruction + Pipelined floating-point arithmetic
z Performance Evaluation Index
» MIPS : Million Instruction Per Second
» FLOPS : Floating-point Operation Per Second
„ megaflops : 106, gigaflops : 109, teraflops : 1012
z Cray supercomputer : Cray Research
» Clay-1 : 80 megaflops, 4 million 64 bit words memory
» Clay-2 : 12 times more powerful than the clay-1
z VP supercomputer : Fujitsu
» VP-200 : 300 megaflops, 32 million memory, 83 vector instruction, 195 scalar instruction
» VP-2600 : 5 gigaflops

© Korea Univ. of Tech. & Edu.


Computer System Architecture Chap. 9 Pipeline and Vector Processing Dept. of Info. & Comm.
9-15

„ 9-7 Array Processors


‹ Performs computations on large arrays of data
Vector processing : Adder/Multiplier pipeline 이용
Array processing :별도의 array processor 이용

‹ Array Processing
z Attached array processor : Fig. 9-14
» Auxiliary processor attached to a general purpose computer
z SIMD array processor : Fig. 9-15
» Computer with multiple processing units operating in parallel
„ Vector 계산 C = A + B 에서 ci = ai + bi 를
각각의 PEi에서 동시에 실행
PE 1 M1

Master control
unit
PE 2 M2
General-purpose Input-Output Attached array
computer interface Processor
PE 3 M3

High-speed memory to-


Main memory Local memory
memory bus

Main memory
PE n Mn

© Korea Univ. of Tech. & Edu.


Computer System Architecture Chap. 9 Pipeline and Vector Processing Dept. of Info. & Comm.

You might also like