Download as pdf or txt
Download as pdf or txt
You are on page 1of 50

Digital Design: An Embedded Systems Approach Using VHDL

Chapter 9 Accelerators

Portions of this work are from the book, Digital Design: An Embedded Systems Approach Using VHDL, by Peter J. Ashenden, published by Morgan Kaufmann Publishers, Copyright 2007 Elsevier Inc. All rights reserved.

VHDL

Performance and Parallelism


A processor core performs steps in sequence
Performance limited by the instruction rate f

Accelerating performance
Perform steps i parallel P f t in ll l Takes less time overall to complete an operation

Instruction-level Instruction level parallelism


Within a processor core Pipelining, multiple-issue Pipelining multiple issue

Accelerators
Custom hardware for parallel operations
Digital Design Chapter 9 Accelerators 2

VHDL

Achievable Parallelism
How many steps can be performed at once? Regularly structured data
Independent processing steps d d i Examples
Video and image pixel processing Audio or sensor signal processing

Constrained by data dependencies


Operations that depend on results of p previous steps p
Digital Design Chapter 9 Accelerators 3

VHDL

Algorithm Kernels
Algorithm: specification of the required processing steps
Often expressed in a programming language

Kernel: the part that involves the most intensive, repetitive processing
10% of operations take 90% of the time

Accelerating a kernel with parallel hardware gives the best payback


Digital Design Chapter 9 Accelerators 4

VHDL

Amdahls Law
Time for an algorithm is t
Fraction f is spent on a k kernel l

t = ft + (1 f )t
ft t = + (1 f )t s
t 1 s = = t f + (1 f ) s
5

Accelerator A l t speeds up d kernel by a factor s Overall speedup factor s'


For large f, s' s For small f, s' 1

Digital Design Chapter 9 Accelerators

VHDL

Amdahls Law Example


An algorithm with two kernels
Kernel 1: 80% of time, can be sped up 10 times f Kernel 2: 15% of time, can be sped up 100 times Which speedup gives best overall improvement?

For kernel 1:

s =

1 0.8 + (1 0.8) 10 1

1 = 3.57 0.08 + 0.2

For kernel 2:

s =

1 = = 1.17 0.15 + (1 0.15) 0.0015 + 0.85 100


6

Digital Design Chapter 9 Accelerators

VHDL

Parallel Architectures
An architecture for an accelerator specifies
Processing blocks Data flow between them

Parallelism through replication g p


Multiple identical block operating on different data elements Works well when elements can be processed independently
Digital Design Chapter 9 Accelerators 7

VHDL

Parallel Architectures
Parallelism through pipelining
Break a computation into steps, performs them in f assembly-line fashion Latency (time to complete a single operation) is not increased Throughput (rate of completion of operations) is increased
Ideally by a factor equal to the number of pipeline stages

data in

step 1

step 2

step 3

data out

Digital Design Chapter 9 Accelerators

VHDL

Direct Memory Access (DMA)


Input/Output data for accellerators must be transferred at high speed
Using the processor would be too slow

Direct memory access


I/O controller and accellerator transfer / data to and from memory autononously Program supplies starting address and g pp g length

Digital Design Chapter 9 Accelerators

VHDL

Bus Arbitration
Bus masters take turns to use bus to access slaves
Controlled by a bus arbiter

Arbitration policies
Priority, round-robin, y, ,

request grant

request

arbiter
grant

grant

request

processor

accelerator

controller

memory bus

memory

Digital Design Chapter 9 Accelerators

10

VHDL

Block-Processing Accelerator
Data arranged in regular groups of contiguous memo contig o s memory locations
Accelerator works block by block E.g., E g images in blocks of 8 8 16 bit 16-bit pixels

Datapath comprises
Memory access: address generation, counters Computation section Control section: finite-state machine(s) ( )
Digital Design Chapter 9 Accelerators 11

VHDL

Stream-Processing Accelerator
Streams of data from an input source
E.g., high-speed sensors

Digital signal processing (DSP) g g p g( )


Analog sensor signal converted to stream of digital sample values Filtering, gain/attenuation, frequencydomain conversion (Fourier transform)

Digital Design Chapter 9 Accelerators

12

VHDL

Processor/Accelerator Interface
Embedded software controls an accelerator
Providing control parameters Synchronizing operations

Input/output registers and interrupts p / p g p


Interact with the control sequencer

Digital Design Chapter 9 Accelerators

13

VHDL

Case Study: Edge Detection


Illustration of accelerator design Edge d t ti i id Ed detection in video processing i
Identify where image intensity changes abruptly Typically at the boundary of objects First step in identifying objects in a scene

Application areas
Video surveillance, computer vision,

For this case study


Monochrome images of 640 480 8-bit pixels f Stored row-by-row in memory Pixel values: 0 (black) 255 (white)
Digital Design Chapter 9 Accelerators 14

VHDL

Sobel Edge Detection


Compute derivatives of intensity in x and y directions di ections
Look for minima and maxima (where intensity changes most rapidly)

Digital Design Chapter 9 Accelerators

15

VHDL

The Sobel Algorithm


Use convolution to approximate partial derivatives Dx and Dy at each position
Weighted sum of value of a pixel and its eight nearest neighbors Coefficients represented using a 33 convolution mask
1 0 0 0 +1 +2 +2 +1 +2 +1

Sobel masks for x and y derivatives


Gx
2 2 1

Gy

0 1

0 2

0 1

Dx (i, j ) = O(i, j ) Gx

D y (i, j ) = O(i, j ) G y
16

Digital Design Chapter 9 Accelerators

VHDL

The Sobel Algorithm


Combine partial derivatives
2 D = Dx2 + D y

Since we just want maxima and minima Si j i d i i in magnitude, approximate as:


D Dx + D y

Edge pixels dont have eight neighbors


Skip computation of |D| for edges Just set them to 0 using software h f
Digital Design Chapter 9 Accelerators 17

VHDL

The Algorithm in Pseudocode


for row in 1 to 478 loop for col in 1 to 638 loop sumx := 0; sumy := 0; for i in 1 to +1 loop p for j in 1 to +1 loop sumx := sumx + O(row+i, col+j) * Gx(i, j); sumy := sumy + O(row+i col+j) * Gy(i, j); O(row+i, Gy(i end loop end loop D(row, col) := abs(sumx) + abs(sumy) end loop end loop p
Digital Design Chapter 9 Accelerators 18

VHDL

Data Formats and Rates


Pixel values: 0 to 255 (8 bits)
Coefficients are 0, 1 and 2 Partial products: 510 to +510 (10 bits) Dx and Dy: 1020 to +1020 (11 bits) | | |D|: 0 to 2040 (11 bits) Final pixel value: scale back to 8 bits

Video rate: 30 frames/sec


640 480 = 307,200 pixels 307,200 307 200 30 10 million pixels/sec
Digital Design Chapter 9 Accelerators 19

VHDL

Data Dependencies
Pixels can be computed independently For each pixel:

Digital Design Chapter 9 Accelerators

20

VHDL

System Architecture
Data dependencies suggest a pipeline
Coefficient multiplies are simple shift/negate, so ff f merge with adder stage

Digital Design Chapter 9 Accelerators

21

VHDL

Memory Bandwidth
Assume memory read/write takes 20ns (2 cycles of 100MHz clock)
Memory is 32-bits wide, byte addressable Bandwidth = 50M operations/sec

Camera produces 10Mpixels/sec p p /


Accelerator needs to process at this rate (8 reads + 1 write) 10Mpixel/sec = 90M operations/sec Greater than memory bandwidth y
Digital Design Chapter 9 Accelerators 22

VHDL

Memory Bandwidth
Read 4 pixels at once from each of previous, current, current and next rows
Store in accelerator to compute multiple derivative image pixels

Produce derivative pixels row-by-row, left-toright


Read 3 32 bit words for every 4th derivative 32-bit pixel computed Write 4 pixels at a time (3 reads + 1 write) / 4 10Mpixel/sec = 10M operations/sec = 20% of available memory bandwidth 0% o a a ab e e o y ba d dt
Digital Design Chapter 9 Accelerators 23

VHDL

Sobel Accelerator Architecture

Digital Design Chapter 9 Accelerators

24

VHDL

Accelerator Sequence
Steady state
Write 4 result pixels Read 4 pixels for previous, current, next rows Compute for 4 cycles Repeat Omit writes until pipeline full Omit reads to drain pipeline
25

Start of row End of row

Digital Design Chapter 9 Accelerators

VHDL

Memory Operation Timing


Steady state

Digital Design Chapter 9 Accelerators

26

VHDL

Pixel Datapath
-- Computation datapath signals type pixel_array is array(-1 to +1, -1 to +1) of unsigned(7 downto 0); signal prev_row, curr_row, next_row : g p , , unsigned(31 downto 0); signal O : pixel_array; signal Dx, Dy : signed(10 downto 0); signal abs_D : unsigned(7 downto 0); signal result_row : unsigned(31 downto 0); ...

Digital Design Chapter 9 Accelerators

27

VHDL

Pixel Datapath
-- Computational datapath prev_row_reg : process (clk_i) is begin if rising_edge(clk_i) then if prev_row_load = '1' then l d th prev_row <= unsigned(dat_i); elsif shift_en = '1' then prev_row(31 prev row(31 downto 8) <= prev_row(23 downto 0); prev row(23 end if; end if; end process prev row reg; prev_row_reg; curr_row_reg : process (clk_i) is ... next_row_reg : process (clk_i) is ...

Digital Design Chapter 9 Accelerators

28

VHDL

Pixel Datapath
pipeline : process (clk_i) is begin if rising_edge(clk_i) then if shift_en = '1' then abs_D <= resize( (unsigned(abs Dx) + unsigned(abs Dy)) srl 3, 8 ); Dx <= - signed(resize(O(-1, -1), 11)) + signed(resize(O(-1, +1), 11)) - (signed(resize(O( 0, -1), 11)) sll 1) + (signed(resize(O( 0, +1), 11)) sll 1) - signed(resize(O(+1, -1), 11)) + signed(resize(O(+1, +1), 11)); Dy <= signed(resize(O(-1, -1), 11)) + (signed(resize(O(-1, 0), 11)) sll 1) ( g ( (O( , )) ) + signed(resize(O(-1, +1), 11)) - signed(resize(O(+1, -1), 11)) - (signed(resize(O(+1, 0), 11)) sll 1) g ( ( ( , ), )); - signed(resize(O(+1, +1), 11)); ...
Digital Design Chapter 9 Accelerators 29

VHDL

Pixel Datapath
O(-1, -1) <= O(-1, 0); O(-1, 0) <= O(-1, +1); O( 1, O(-1, +1) <= prev row(31 downto 24); < prev_row(31 O( 0, -1) <= O(0, 0); O( 0, 0) <= O(0, +1); O( 0, +1) <= curr_row(31 downto 24); O(+1, -1) <= O(+1, 0); O(+1, 0) <= O(+1, +1); O(+1, +1) < next row(31 downto 24); <= next_row(31 end if; end if; end process pipeline; result_row_reg result row reg : process (clk_i) is (clk i) begin if rising_edge(clk_i) then if shift_en = '1' then result_row result row <= result_row(23 downto 0) & abs D; result row(23 abs_D; end if; end if; end process result_row_reg;

Digital Design Chapter 9 Accelerators

30

VHDL

Address Generation
Given an image in memory at base address B
Address for pixel in row r, column c is B + r 640 + c Base address (B) is fixed Offset (r 640 + c) increments by 4 for each group of 4 pixels read/written Use word-aligned addresses g
Two least-significant bits always 00 Increment word address by 1
Digital Design Chapter 9 Accelerators 31

VHDL

Address Generation

Digital Design Chapter 9 Accelerators

32

VHDL

Address Generation
O_base_reg : process (clk_i) is begin if rising_edge(clk_i) then if O_base_ce = '1' then O_base <= unsigned(dat_i(21 downto 2)); end if; end if; end process O_base_reg; O_offset_counter : process (clk_i) is begin g if rising_edge(clk_i) then if offset_reset = '1' then O_offset <= (others => '0'); elsif O_offset_cnt_en = '1' then O_offset <= O_offset + 1; end if; end if; p ; end process O_offset_counter; ...
Digital Design Chapter 9 Accelerators 33

VHDL

Address Generation
D_base_reg : process (clk_i) is begin if rising_edge(clk_i) then if D_base_ce = '1' then D_base <= unsigned(dat_i(21 downto 2)); end if; end if; end process D_base_reg; D_offset_counter : process (clk_i) is begin g if rising_edge(clk_i) then if offset_reset = '1' then D_offset <= (others => '0'); elsif D_offset_cnt_en = '1' then D_offset <= D_offset + 1; end if; end if; p ; end process D_offset_counter; ...
Digital Design Chapter 9 Accelerators 34

VHDL

Address Generation
O_prev_addr <= O_base + O_offset; O_curr_addr <= O_prev_addr + 640/4; O_next_addr <= O_prev_addr + 1280/4; D_addr <= D_base + D_offset; adr_o(21 downto 2) <= O_prev_addr O prev addr when prev_row_load = '1' else prev row load 1 O_curr_addr when curr_row_load = '1' else O_next_addr when next_row_load = '1' else D_addr; adr_o(1 adr o(1 downto 0) <= "00"; 00 ;

Digital Design Chapter 9 Accelerators

35

VHDL

Control/Status Registers
Register g Int_en Start O_base D_base Status St t Offset 0 4 8 12 0 Read/Write / Write-only Write-only Write-only Write-only Read-only R d l Purpose p Interrupt enable (bit 0). Write causes image processing to start (value ignored). Original image base address. Derivative image base address + 640. Processing done (bit 0) R di clears P i d 0). Reading l interrupt.

Digital Design Chapter 9 Accelerators

36

VHDL

Slave Bus Interface


start <= '1' when cyc_i = '1' and stb_i = '1 and we_i = '1' and adr_i = "01" else '0'; O_base_ce <= '1' when cyc_i = '1' and stb_i = '1' and we_i = '1' and adr_i = "10" else '0'; D_base_ce <= '1' when cyc_i = '1' and stb_i = '1' and we i = '1' and adr i = "11" else '0'; we_i 1 adr_i 11 0 ; int_reg : process (clk_i) is begin if rising_edge(clk_i) then if rst i = '1' then rst_i 1 int_en <= '0'; elsif cyc_i = '1' and stb_i = '1' and we_i = '1' and adr_i = "00" then int_en int en <= dat i(0); dat_i(0); end if; end if; end process int_reg; ...
Digital Design Chapter 9 Accelerators 37

VHDL

Slave Bus Interface


status_reg : process (clk_i) is begin if rising_edge(clk_i) then if rst_i = '1' then done <= '0'; elsif done_set = '1' then -- This occurs when last write is acknowledged, -- and so cannot coincide with a read of the -- status register. done <= '1'; elsif cyc_i = '1' and stb_i = '1' and we_i = '0' and adr_i = "00" and ack_o_tmp = '1' then done <= '0'; end if; end if; end process status_reg; q ; int_req <= int_en and done; ...
Digital Design Chapter 9 Accelerators 38

VHDL

Slave Bus Interface


ack_gen : process (clk_i) is begin if rising_edge(clk_i) then ack_o_tmp <= cyc_i and stb_i and not ack_o_tmp; end if; end process ack_gen; ack_o <= ack_o_tmp; -- Wishbone data output multiplexer dat_o <= (31 downto 1 => '0') & done -- status register read when cyc_i = '1' and stb i = '1' and we i = '0' cyc i 1 stb_i 1 we_i 0 and adr_i = "00" else (others => '0') -- other registers read as 0 when cyc_i = '1' and stb_i = '1' and we_i = '0' and adr i /= "00" else adr_i 00 std_logic_vector(result_row); -- for master write

Digital Design Chapter 9 Accelerators

39

VHDL

Control Sequencing
Use a finite-state machine
Counters keep track of rows (0 to 477) and columns (0 to 159)

See textbook for details of FSM output functions

Digital Design Chapter 9 Accelerators

40

VHDL

State Transition Diagram

Digital Design Chapter 9 Accelerators

41

VHDL

Accelerator Verification
Simulation-based verification of each section of th accelerator f the l t
Slave bus operations Computation sequencing Master bus operations Address generation Pixel computation

Testbench including the accelerator g


Bus functional processor model Simplified memory and bus arbiter models
Digital Design Chapter 9 Accelerators 42

VHDL

Sobel Verification Testbench


Processor BFM Sobel Accelerator

Arbiter

Multiplexed Bus: Muxes and Connections

Memory Model

Digital Design Chapter 9 Accelerators

43

VHDL

Processor Bus Functional Model


processor_bfm : process is procedure bus_write ( adr : in unsigned(22 downto 0); dat : in std_logic_vector(31 downto 0) ) is begin cpu_adr_o <= adr; cpu_sel_o <= "1111"; cpu_dat_o <= dat; cpu_cyc_o <= '1'; cpu_stb_o <= '1'; cpu_we_o <= '1'; wait until rising edge(clk) and cpu_ack_i = '1'; rising_edge(clk) cpu ack i 1 ; end procedure bus_write; begin cpu_adr_o <= (others => '0'); cpu_sel_o <= "0000"; cpu_dat_o <= (others => '0'); cpu_cyc_o <= '0'; cpu_stb_o <= '0'; cpu_we_o <= '0'; wait until rising_edge(clk) and rst = '0'; -- Write 008000 (hex) to O_base_addr register bus_write(sobel_reg_base bus write(sobel reg base + sobel O base reg offset, X 00008000 ); sobel_O_base_reg_offset, X"00008000"); -- Write 053000 + 280 (hex) to D_base_addr register bus_write(sobel_reg_base + sobel_D_base_reg_offset, X"00053280"); -- Write 1 to interrupt control register (enable interrupt) bus_write(sobel_reg_base + sobel_int_reg_offset, X"00000001"); ... Digital Design Chapter 9 Accelerators 44

VHDL

Processor Bus Functional Model


-- Write to start register (data value ignored) bus_write(sobel_reg_base + sobel_start_reg_offset, X"00000000"); -- End of write operations rite cpu_cyc_o <= '0'; cpu_stb_o <= '0'; cpu_we_o <= '0'; loop wait for 10 us; wait until rising_edge(clk); g g -- Read status register cpu_adr_o <= sobel_reg_base + sobel_status_reg_offset; cpu_sel_o <= "1111"; cpu_cyc_o <= '1'; cpu_stb_o <= '1'; cpu_we_o <= '0'; wait until rising edge(clk) and cpu_ack_i = '1'; rising_edge(clk) cpu ack i 1 ; cpu_cyc_o <= '0'; cpu_stb_o <= '0'; cpu_we_o <= '0'; exit when cpu_dat_i(0) = '1'; end loop; wait; end process processor_bfm;

Digital Design Chapter 9 Accelerators

45

VHDL

Memory Bus Functional Model


mem : process is begin mem_ack_o <= '0'; mem_dat_o <= X"00000000"; wait until rising_edge(clk) and bus_cyc = '1' and mem_stb_i = '1'; if bus_we = '0' then mem_dat_o <= X"00000000"; -- in place of read data end if; mem_ack_o <= '1'; wait until rising_edge(clk); end process mem;

Digital Design Chapter 9 Accelerators

46

VHDL

Bus Arbiter
Uses sobel_cyc_o and cpu_cyc_o as request inp ts eq est inputs
If both request at the same time, give accelerator priority

Mealy FSM

Digital Design Chapter 9 Accelerators

47

VHDL

Bus Arbiter
arbiter_fsm_reg : process (clk) is ... arbiter_logic : process (arbiter_current_state, sobel_cyc_o, begin case arbiter_current_state is when sobel => if sobel_cyc_o = '1' then sobel_gnt <= 1 ; cpu_gnt <= 0 ; arbiter next state sobel gnt < '1'; cpu gnt < '0'; arbiter_next_state elsif sobel_cyc_o = '0' and cpu_cyc_o = '1' then sobel_gnt <= '0'; cpu_gnt <= '1'; arbiter_next_state else sobel_gnt <= '0'; cpu_gnt <= '0'; arbiter_next_state end if d if; when cpu => if cpu_cyc_o = '1' then sobel_gnt <= '0'; cpu_gnt <= '1'; arbiter_next_state elsif sobel_cyc_o = '1' and cpu_cyc_o = '0' then y p y sobel_gnt <= '1'; cpu_gnt <= '0'; arbiter_next_state else sobel_gnt <= '0'; cpu_gnt <= '0'; arbiter_next_state end if; end case; end process arbiter_logic; Digital Design Chapter 9 Accelerators cpu_cyc_o) is

< sobel; <= <= cpu; <= sobel;

<= cpu; <= sobel; <= sobel;

48

VHDL

Simulation Results
See waveforms in textbook
Demonstrates sequencing and address d dd generation

But what about about


Data values computed correctly Interactions between processor and accelerator

Need to use more sophisticated verification techniques


Due to complexity of the design
Digital Design Chapter 9 Accelerators 49

VHDL

Summary
Accelerators boost performance using parallel hardware
Replication, pipelining,

Ahmdahls Law
Best payback from accelerating a kernel p y g

DMA avoids processor overhead Verification requires advanced techniques


Digital Design Chapter 9 Accelerators 50

You might also like