Professional Documents
Culture Documents
Chapter 09 Notes
Chapter 09 Notes
Chapter 9 Accelerators
Portions of this work are from the book, Digital Design: An Embedded Systems Approach Using VHDL, by Peter J. Ashenden, published by Morgan Kaufmann Publishers, Copyright 2007 Elsevier Inc. All rights reserved.
VHDL
Accelerating performance
Perform steps i parallel P f t in ll l Takes less time overall to complete an operation
Accelerators
Custom hardware for parallel operations
Digital Design Chapter 9 Accelerators 2
VHDL
Achievable Parallelism
How many steps can be performed at once? Regularly structured data
Independent processing steps d d i Examples
Video and image pixel processing Audio or sensor signal processing
VHDL
Algorithm Kernels
Algorithm: specification of the required processing steps
Often expressed in a programming language
Kernel: the part that involves the most intensive, repetitive processing
10% of operations take 90% of the time
VHDL
Amdahls Law
Time for an algorithm is t
Fraction f is spent on a k kernel l
t = ft + (1 f )t
ft t = + (1 f )t s
t 1 s = = t f + (1 f ) s
5
VHDL
For kernel 1:
s =
1 0.8 + (1 0.8) 10 1
For kernel 2:
s =
VHDL
Parallel Architectures
An architecture for an accelerator specifies
Processing blocks Data flow between them
VHDL
Parallel Architectures
Parallelism through pipelining
Break a computation into steps, performs them in f assembly-line fashion Latency (time to complete a single operation) is not increased Throughput (rate of completion of operations) is increased
Ideally by a factor equal to the number of pipeline stages
data in
step 1
step 2
step 3
data out
VHDL
VHDL
Bus Arbitration
Bus masters take turns to use bus to access slaves
Controlled by a bus arbiter
Arbitration policies
Priority, round-robin, y, ,
request grant
request
arbiter
grant
grant
request
processor
accelerator
controller
memory bus
memory
10
VHDL
Block-Processing Accelerator
Data arranged in regular groups of contiguous memo contig o s memory locations
Accelerator works block by block E.g., E g images in blocks of 8 8 16 bit 16-bit pixels
Datapath comprises
Memory access: address generation, counters Computation section Control section: finite-state machine(s) ( )
Digital Design Chapter 9 Accelerators 11
VHDL
Stream-Processing Accelerator
Streams of data from an input source
E.g., high-speed sensors
12
VHDL
Processor/Accelerator Interface
Embedded software controls an accelerator
Providing control parameters Synchronizing operations
13
VHDL
Application areas
Video surveillance, computer vision,
VHDL
15
VHDL
Gy
0 1
0 2
0 1
Dx (i, j ) = O(i, j ) Gx
D y (i, j ) = O(i, j ) G y
16
VHDL
VHDL
VHDL
VHDL
Data Dependencies
Pixels can be computed independently For each pixel:
20
VHDL
System Architecture
Data dependencies suggest a pipeline
Coefficient multiplies are simple shift/negate, so ff f merge with adder stage
21
VHDL
Memory Bandwidth
Assume memory read/write takes 20ns (2 cycles of 100MHz clock)
Memory is 32-bits wide, byte addressable Bandwidth = 50M operations/sec
VHDL
Memory Bandwidth
Read 4 pixels at once from each of previous, current, current and next rows
Store in accelerator to compute multiple derivative image pixels
VHDL
24
VHDL
Accelerator Sequence
Steady state
Write 4 result pixels Read 4 pixels for previous, current, next rows Compute for 4 cycles Repeat Omit writes until pipeline full Omit reads to drain pipeline
25
VHDL
26
VHDL
Pixel Datapath
-- Computation datapath signals type pixel_array is array(-1 to +1, -1 to +1) of unsigned(7 downto 0); signal prev_row, curr_row, next_row : g p , , unsigned(31 downto 0); signal O : pixel_array; signal Dx, Dy : signed(10 downto 0); signal abs_D : unsigned(7 downto 0); signal result_row : unsigned(31 downto 0); ...
27
VHDL
Pixel Datapath
-- Computational datapath prev_row_reg : process (clk_i) is begin if rising_edge(clk_i) then if prev_row_load = '1' then l d th prev_row <= unsigned(dat_i); elsif shift_en = '1' then prev_row(31 prev row(31 downto 8) <= prev_row(23 downto 0); prev row(23 end if; end if; end process prev row reg; prev_row_reg; curr_row_reg : process (clk_i) is ... next_row_reg : process (clk_i) is ...
28
VHDL
Pixel Datapath
pipeline : process (clk_i) is begin if rising_edge(clk_i) then if shift_en = '1' then abs_D <= resize( (unsigned(abs Dx) + unsigned(abs Dy)) srl 3, 8 ); Dx <= - signed(resize(O(-1, -1), 11)) + signed(resize(O(-1, +1), 11)) - (signed(resize(O( 0, -1), 11)) sll 1) + (signed(resize(O( 0, +1), 11)) sll 1) - signed(resize(O(+1, -1), 11)) + signed(resize(O(+1, +1), 11)); Dy <= signed(resize(O(-1, -1), 11)) + (signed(resize(O(-1, 0), 11)) sll 1) ( g ( (O( , )) ) + signed(resize(O(-1, +1), 11)) - signed(resize(O(+1, -1), 11)) - (signed(resize(O(+1, 0), 11)) sll 1) g ( ( ( , ), )); - signed(resize(O(+1, +1), 11)); ...
Digital Design Chapter 9 Accelerators 29
VHDL
Pixel Datapath
O(-1, -1) <= O(-1, 0); O(-1, 0) <= O(-1, +1); O( 1, O(-1, +1) <= prev row(31 downto 24); < prev_row(31 O( 0, -1) <= O(0, 0); O( 0, 0) <= O(0, +1); O( 0, +1) <= curr_row(31 downto 24); O(+1, -1) <= O(+1, 0); O(+1, 0) <= O(+1, +1); O(+1, +1) < next row(31 downto 24); <= next_row(31 end if; end if; end process pipeline; result_row_reg result row reg : process (clk_i) is (clk i) begin if rising_edge(clk_i) then if shift_en = '1' then result_row result row <= result_row(23 downto 0) & abs D; result row(23 abs_D; end if; end if; end process result_row_reg;
30
VHDL
Address Generation
Given an image in memory at base address B
Address for pixel in row r, column c is B + r 640 + c Base address (B) is fixed Offset (r 640 + c) increments by 4 for each group of 4 pixels read/written Use word-aligned addresses g
Two least-significant bits always 00 Increment word address by 1
Digital Design Chapter 9 Accelerators 31
VHDL
Address Generation
32
VHDL
Address Generation
O_base_reg : process (clk_i) is begin if rising_edge(clk_i) then if O_base_ce = '1' then O_base <= unsigned(dat_i(21 downto 2)); end if; end if; end process O_base_reg; O_offset_counter : process (clk_i) is begin g if rising_edge(clk_i) then if offset_reset = '1' then O_offset <= (others => '0'); elsif O_offset_cnt_en = '1' then O_offset <= O_offset + 1; end if; end if; p ; end process O_offset_counter; ...
Digital Design Chapter 9 Accelerators 33
VHDL
Address Generation
D_base_reg : process (clk_i) is begin if rising_edge(clk_i) then if D_base_ce = '1' then D_base <= unsigned(dat_i(21 downto 2)); end if; end if; end process D_base_reg; D_offset_counter : process (clk_i) is begin g if rising_edge(clk_i) then if offset_reset = '1' then D_offset <= (others => '0'); elsif D_offset_cnt_en = '1' then D_offset <= D_offset + 1; end if; end if; p ; end process D_offset_counter; ...
Digital Design Chapter 9 Accelerators 34
VHDL
Address Generation
O_prev_addr <= O_base + O_offset; O_curr_addr <= O_prev_addr + 640/4; O_next_addr <= O_prev_addr + 1280/4; D_addr <= D_base + D_offset; adr_o(21 downto 2) <= O_prev_addr O prev addr when prev_row_load = '1' else prev row load 1 O_curr_addr when curr_row_load = '1' else O_next_addr when next_row_load = '1' else D_addr; adr_o(1 adr o(1 downto 0) <= "00"; 00 ;
35
VHDL
Control/Status Registers
Register g Int_en Start O_base D_base Status St t Offset 0 4 8 12 0 Read/Write / Write-only Write-only Write-only Write-only Read-only R d l Purpose p Interrupt enable (bit 0). Write causes image processing to start (value ignored). Original image base address. Derivative image base address + 640. Processing done (bit 0) R di clears P i d 0). Reading l interrupt.
36
VHDL
VHDL
VHDL
39
VHDL
Control Sequencing
Use a finite-state machine
Counters keep track of rows (0 to 477) and columns (0 to 159)
40
VHDL
41
VHDL
Accelerator Verification
Simulation-based verification of each section of th accelerator f the l t
Slave bus operations Computation sequencing Master bus operations Address generation Pixel computation
VHDL
Arbiter
Memory Model
43
VHDL
VHDL
45
VHDL
46
VHDL
Bus Arbiter
Uses sobel_cyc_o and cpu_cyc_o as request inp ts eq est inputs
If both request at the same time, give accelerator priority
Mealy FSM
47
VHDL
Bus Arbiter
arbiter_fsm_reg : process (clk) is ... arbiter_logic : process (arbiter_current_state, sobel_cyc_o, begin case arbiter_current_state is when sobel => if sobel_cyc_o = '1' then sobel_gnt <= 1 ; cpu_gnt <= 0 ; arbiter next state sobel gnt < '1'; cpu gnt < '0'; arbiter_next_state elsif sobel_cyc_o = '0' and cpu_cyc_o = '1' then sobel_gnt <= '0'; cpu_gnt <= '1'; arbiter_next_state else sobel_gnt <= '0'; cpu_gnt <= '0'; arbiter_next_state end if d if; when cpu => if cpu_cyc_o = '1' then sobel_gnt <= '0'; cpu_gnt <= '1'; arbiter_next_state elsif sobel_cyc_o = '1' and cpu_cyc_o = '0' then y p y sobel_gnt <= '1'; cpu_gnt <= '0'; arbiter_next_state else sobel_gnt <= '0'; cpu_gnt <= '0'; arbiter_next_state end if; end case; end process arbiter_logic; Digital Design Chapter 9 Accelerators cpu_cyc_o) is
48
VHDL
Simulation Results
See waveforms in textbook
Demonstrates sequencing and address d dd generation
VHDL
Summary
Accelerators boost performance using parallel hardware
Replication, pipelining,
Ahmdahls Law
Best payback from accelerating a kernel p y g