Chapter 09 Notes

Digital Design: An Embedded Systems Approach Using VHDL
Chapter 9 Accelerators
Portions of this work are from the book, Digital Design: An Embedded Systems Approach Using VHDL, by Peter J. Ashenden, published by Morgan Kaufmann Publishers, Copyright 2007 Elsevier Inc. All rights reserved.
VHDL
Performance and Parallelism

A processor core performs steps in sequence
Performance limited by the instruction rate f
Accelerating performance
Perform steps i parallel P f t in ll l Takes less time overall to complete an operation
Instruction-level Instruction level parallelism

Within a processor core Pipelining, multiple-issue Pipelining multiple issue
Accelerators
Custom hardware for parallel operations
Digital Design Chapter 9 Accelerators 2
VHDL
Achievable Parallelism
How many steps can be performed at once? Regularly structured data
Independent processing steps d d i Examples
Video and image pixel processing Audio or sensor signal processing
Constrained by data dependencies

Operations that depend on results of p previous steps p
VHDL
Algorithm Kernels
Algorithm: specification of the required processing steps
Often expressed in a programming language
Kernel: the part that involves the most intensive, repetitive processing
10% of operations take 90% of the time
Accelerating a kernel with parallel hardware gives the best payback

VHDL
Amdahls Law
Time for an algorithm is t
Fraction f is spent on a k kernel l
t = ft + (1 f )t
ft t = + (1 f )t s
t 1 s = = t f + (1 f ) s
5
Accelerator A l t speeds up d kernel by a factor s Overall speedup factor s'

For large f, s' s For small f, s' 1
Digital Design Chapter 9 Accelerators
VHDL
Amdahls Law Example

An algorithm with two kernels
Kernel 1: 80% of time, can be sped up 10 times f Kernel 2: 15% of time, can be sped up 100 times Which speedup gives best overall improvement?
For kernel 1:
s =
1 0.8 + (1 0.8) 10 1
1 = 3.57 0.08 + 0.2
For kernel 2:
s =
1 = = 1.17 0.15 + (1 0.15) 0.0015 + 0.85 100

6
VHDL
Parallel Architectures
An architecture for an accelerator specifies
Processing blocks Data flow between them
Parallelism through replication g p

Multiple identical block operating on different data elements Works well when elements can be processed independently
VHDL
Parallel Architectures
Parallelism through pipelining
Break a computation into steps, performs them in f assembly-line fashion Latency (time to complete a single operation) is not increased Throughput (rate of completion of operations) is increased
Ideally by a factor equal to the number of pipeline stages
data in
step 1
step 2
step 3
data out
VHDL
Direct Memory Access (DMA)

Input/Output data for accellerators must be transferred at high speed
Using the processor would be too slow
Direct memory access

I/O controller and accellerator transfer / data to and from memory autononously Program supplies starting address and g pp g length
VHDL
Bus Arbitration
Bus masters take turns to use bus to access slaves
Controlled by a bus arbiter
Arbitration policies
Priority, round-robin, y, ,
request grant
request
arbiter
grant
grant
request
processor
accelerator
controller
memory bus
memory
10
VHDL
Block-Processing Accelerator
Data arranged in regular groups of contiguous memo contig o s memory locations
Accelerator works block by block E.g., E g images in blocks of 8 8 16 bit 16-bit pixels
Datapath comprises
Memory access: address generation, counters Computation section Control section: finite-state machine(s) ( )
VHDL
Stream-Processing Accelerator
Streams of data from an input source
E.g., high-speed sensors
Digital signal processing (DSP) g g p g( )

Analog sensor signal converted to stream of digital sample values Filtering, gain/attenuation, frequencydomain conversion (Fourier transform)
12
VHDL
Processor/Accelerator Interface
Embedded software controls an accelerator
Providing control parameters Synchronizing operations
Input/output registers and interrupts p / p g p

Interact with the control sequencer
13
VHDL
Case Study: Edge Detection

Illustration of accelerator design Edge d t ti i id Ed detection in video processing i
Identify where image intensity changes abruptly Typically at the boundary of objects First step in identifying objects in a scene
Application areas
Video surveillance, computer vision,
For this case study

Monochrome images of 640 480 8-bit pixels f Stored row-by-row in memory Pixel values: 0 (black) 255 (white)
VHDL
Sobel Edge Detection

Compute derivatives of intensity in x and y directions di ections
Look for minima and maxima (where intensity changes most rapidly)
15
VHDL
The Sobel Algorithm

Use convolution to approximate partial derivatives Dx and Dy at each position
Weighted sum of value of a pixel and its eight nearest neighbors Coefficients represented using a 33 convolution mask
1 0 0 0 +1 +2 +2 +1 +2 +1
Sobel masks for x and y derivatives

Gx
2 2 1
Gy
0 1
0 2
0 1
Dx (i, j ) = O(i, j ) Gx
D y (i, j ) = O(i, j ) G y
16
VHDL
The Sobel Algorithm

Combine partial derivatives
2 D = Dx2 + D y
Since we just want maxima and minima Si j i d i i in magnitude, approximate as:

D Dx + D y
Edge pixels dont have eight neighbors

Skip computation of |D| for edges Just set them to 0 using software h f
VHDL
The Algorithm in Pseudocode

for row in 1 to 478 loop for col in 1 to 638 loop sumx := 0; sumy := 0; for i in 1 to +1 loop p for j in 1 to +1 loop sumx := sumx + O(row+i, col+j) * Gx(i, j); sumy := sumy + O(row+i col+j) * Gy(i, j); O(row+i, Gy(i end loop end loop D(row, col) := abs(sumx) + abs(sumy) end loop end loop p
VHDL
Data Formats and Rates

Pixel values: 0 to 255 (8 bits)
Coefficients are 0, 1 and 2 Partial products: 510 to +510 (10 bits) Dx and Dy: 1020 to +1020 (11 bits) | | |D|: 0 to 2040 (11 bits) Final pixel value: scale back to 8 bits
Video rate: 30 frames/sec

640 480 = 307,200 pixels 307,200 307 200 30 10 million pixels/sec
VHDL
Data Dependencies
Pixels can be computed independently For each pixel:
20
VHDL
System Architecture
Data dependencies suggest a pipeline
Coefficient multiplies are simple shift/negate, so ff f merge with adder stage
21
VHDL
Memory Bandwidth
Assume memory read/write takes 20ns (2 cycles of 100MHz clock)
Memory is 32-bits wide, byte addressable Bandwidth = 50M operations/sec
Camera produces 10Mpixels/sec p p /

Accelerator needs to process at this rate (8 reads + 1 write) 10Mpixel/sec = 90M operations/sec Greater than memory bandwidth y
VHDL
Memory Bandwidth
Read 4 pixels at once from each of previous, current, current and next rows
Store in accelerator to compute multiple derivative image pixels
Produce derivative pixels row-by-row, left-toright

Read 3 32 bit words for every 4th derivative 32-bit pixel computed Write 4 pixels at a time (3 reads + 1 write) / 4 10Mpixel/sec = 10M operations/sec = 20% of available memory bandwidth 0% o a a ab e e o y ba d dt
VHDL
Sobel Accelerator Architecture
24
VHDL
Accelerator Sequence
Steady state
Write 4 result pixels Read 4 pixels for previous, current, next rows Compute for 4 cycles Repeat Omit writes until pipeline full Omit reads to drain pipeline
25
Start of row End of row
VHDL
Memory Operation Timing

Steady state
26
VHDL
Pixel Datapath
-- Computation datapath signals type pixel_array is array(-1 to +1, -1 to +1) of unsigned(7 downto 0); signal prev_row, curr_row, next_row : g p , , unsigned(31 downto 0); signal O : pixel_array; signal Dx, Dy : signed(10 downto 0); signal abs_D : unsigned(7 downto 0); signal result_row : unsigned(31 downto 0); ...
27
VHDL
Pixel Datapath
-- Computational datapath prev_row_reg : process (clk_i) is begin if rising_edge(clk_i) then if prev_row_load = '1' then l d th prev_row <= unsigned(dat_i); elsif shift_en = '1' then prev_row(31 prev row(31 downto 8) <= prev_row(23 downto 0); prev row(23 end if; end if; end process prev row reg; prev_row_reg; curr_row_reg : process (clk_i) is ... next_row_reg : process (clk_i) is ...
28
VHDL
Pixel Datapath
pipeline : process (clk_i) is begin if rising_edge(clk_i) then if shift_en = '1' then abs_D <= resize( (unsigned(abs Dx) + unsigned(abs Dy)) srl 3, 8 ); Dx <= - signed(resize(O(-1, -1), 11)) + signed(resize(O(-1, +1), 11)) - (signed(resize(O( 0, -1), 11)) sll 1) + (signed(resize(O( 0, +1), 11)) sll 1) - signed(resize(O(+1, -1), 11)) + signed(resize(O(+1, +1), 11)); Dy <= signed(resize(O(-1, -1), 11)) + (signed(resize(O(-1, 0), 11)) sll 1) ( g ( (O( , )) ) + signed(resize(O(-1, +1), 11)) - signed(resize(O(+1, -1), 11)) - (signed(resize(O(+1, 0), 11)) sll 1) g ( ( ( , ), )); - signed(resize(O(+1, +1), 11)); ...
VHDL
Pixel Datapath
O(-1, -1) <= O(-1, 0); O(-1, 0) <= O(-1, +1); O( 1, O(-1, +1) <= prev row(31 downto 24); < prev_row(31 O( 0, -1) <= O(0, 0); O( 0, 0) <= O(0, +1); O( 0, +1) <= curr_row(31 downto 24); O(+1, -1) <= O(+1, 0); O(+1, 0) <= O(+1, +1); O(+1, +1) < next row(31 downto 24); <= next_row(31 end if; end if; end process pipeline; result_row_reg result row reg : process (clk_i) is (clk i) begin if rising_edge(clk_i) then if shift_en = '1' then result_row result row <= result_row(23 downto 0) & abs D; result row(23 abs_D; end if; end if; end process result_row_reg;
30
VHDL
Address Generation
Given an image in memory at base address B
Address for pixel in row r, column c is B + r 640 + c Base address (B) is fixed Offset (r 640 + c) increments by 4 for each group of 4 pixels read/written Use word-aligned addresses g
Two least-significant bits always 00 Increment word address by 1
VHDL
Address Generation
32
VHDL
Address Generation
O_base_reg : process (clk_i) is begin if rising_edge(clk_i) then if O_base_ce = '1' then O_base <= unsigned(dat_i(21 downto 2)); end if; end if; end process O_base_reg; O_offset_counter : process (clk_i) is begin g if rising_edge(clk_i) then if offset_reset = '1' then O_offset <= (others => '0'); elsif O_offset_cnt_en = '1' then O_offset <= O_offset + 1; end if; end if; p ; end process O_offset_counter; ...
VHDL
Address Generation
D_base_reg : process (clk_i) is begin if rising_edge(clk_i) then if D_base_ce = '1' then D_base <= unsigned(dat_i(21 downto 2)); end if; end if; end process D_base_reg; D_offset_counter : process (clk_i) is begin g if rising_edge(clk_i) then if offset_reset = '1' then D_offset <= (others => '0'); elsif D_offset_cnt_en = '1' then D_offset <= D_offset + 1; end if; end if; p ; end process D_offset_counter; ...
VHDL
Address Generation
O_prev_addr <= O_base + O_offset; O_curr_addr <= O_prev_addr + 640/4; O_next_addr <= O_prev_addr + 1280/4; D_addr <= D_base + D_offset; adr_o(21 downto 2) <= O_prev_addr O prev addr when prev_row_load = '1' else prev row load 1 O_curr_addr when curr_row_load = '1' else O_next_addr when next_row_load = '1' else D_addr; adr_o(1 adr o(1 downto 0) <= "00"; 00 ;
35
VHDL
Control/Status Registers
Register g Int_en Start O_base D_base Status St t Offset 0 4 8 12 0 Read/Write / Write-only Write-only Write-only Write-only Read-only R d l Purpose p Interrupt enable (bit 0). Write causes image processing to start (value ignored). Original image base address. Derivative image base address + 640. Processing done (bit 0) R di clears P i d 0). Reading l interrupt.
36
VHDL
Slave Bus Interface

start <= '1' when cyc_i = '1' and stb_i = '1 and we_i = '1' and adr_i = "01" else '0'; O_base_ce <= '1' when cyc_i = '1' and stb_i = '1' and we_i = '1' and adr_i = "10" else '0'; D_base_ce <= '1' when cyc_i = '1' and stb_i = '1' and we i = '1' and adr i = "11" else '0'; we_i 1 adr_i 11 0 ; int_reg : process (clk_i) is begin if rising_edge(clk_i) then if rst i = '1' then rst_i 1 int_en <= '0'; elsif cyc_i = '1' and stb_i = '1' and we_i = '1' and adr_i = "00" then int_en int en <= dat i(0); dat_i(0); end if; end if; end process int_reg; ...
VHDL
Slave Bus Interface

status_reg : process (clk_i) is begin if rising_edge(clk_i) then if rst_i = '1' then done <= '0'; elsif done_set = '1' then -- This occurs when last write is acknowledged, -- and so cannot coincide with a read of the -- status register. done <= '1'; elsif cyc_i = '1' and stb_i = '1' and we_i = '0' and adr_i = "00" and ack_o_tmp = '1' then done <= '0'; end if; end if; end process status_reg; q ; int_req <= int_en and done; ...
VHDL
Slave Bus Interface

ack_gen : process (clk_i) is begin if rising_edge(clk_i) then ack_o_tmp <= cyc_i and stb_i and not ack_o_tmp; end if; end process ack_gen; ack_o <= ack_o_tmp; -- Wishbone data output multiplexer dat_o <= (31 downto 1 => '0') & done -- status register read when cyc_i = '1' and stb i = '1' and we i = '0' cyc i 1 stb_i 1 we_i 0 and adr_i = "00" else (others => '0') -- other registers read as 0 when cyc_i = '1' and stb_i = '1' and we_i = '0' and adr i /= "00" else adr_i 00 std_logic_vector(result_row); -- for master write
39
VHDL
Control Sequencing
Use a finite-state machine
Counters keep track of rows (0 to 477) and columns (0 to 159)
See textbook for details of FSM output functions
40
VHDL
State Transition Diagram
41
VHDL
Accelerator Verification
Simulation-based verification of each section of th accelerator f the l t
Slave bus operations Computation sequencing Master bus operations Address generation Pixel computation
Testbench including the accelerator g

Bus functional processor model Simplified memory and bus arbiter models
VHDL
Sobel Verification Testbench

Processor BFM Sobel Accelerator
Arbiter
Multiplexed Bus: Muxes and Connections
Memory Model
43
VHDL
Processor Bus Functional Model

processor_bfm : process is procedure bus_write ( adr : in unsigned(22 downto 0); dat : in std_logic_vector(31 downto 0) ) is begin cpu_adr_o <= adr; cpu_sel_o <= "1111"; cpu_dat_o <= dat; cpu_cyc_o <= '1'; cpu_stb_o <= '1'; cpu_we_o <= '1'; wait until rising edge(clk) and cpu_ack_i = '1'; rising_edge(clk) cpu ack i 1 ; end procedure bus_write; begin cpu_adr_o <= (others => '0'); cpu_sel_o <= "0000"; cpu_dat_o <= (others => '0'); cpu_cyc_o <= '0'; cpu_stb_o <= '0'; cpu_we_o <= '0'; wait until rising_edge(clk) and rst = '0'; -- Write 008000 (hex) to O_base_addr register bus_write(sobel_reg_base bus write(sobel reg base + sobel O base reg offset, X 00008000 ); sobel_O_base_reg_offset, X"00008000"); -- Write 053000 + 280 (hex) to D_base_addr register bus_write(sobel_reg_base + sobel_D_base_reg_offset, X"00053280"); -- Write 1 to interrupt control register (enable interrupt) bus_write(sobel_reg_base + sobel_int_reg_offset, X"00000001"); ... Digital Design Chapter 9 Accelerators 44
VHDL
Processor Bus Functional Model

-- Write to start register (data value ignored) bus_write(sobel_reg_base + sobel_start_reg_offset, X"00000000"); -- End of write operations rite cpu_cyc_o <= '0'; cpu_stb_o <= '0'; cpu_we_o <= '0'; loop wait for 10 us; wait until rising_edge(clk); g g -- Read status register cpu_adr_o <= sobel_reg_base + sobel_status_reg_offset; cpu_sel_o <= "1111"; cpu_cyc_o <= '1'; cpu_stb_o <= '1'; cpu_we_o <= '0'; wait until rising edge(clk) and cpu_ack_i = '1'; rising_edge(clk) cpu ack i 1 ; cpu_cyc_o <= '0'; cpu_stb_o <= '0'; cpu_we_o <= '0'; exit when cpu_dat_i(0) = '1'; end loop; wait; end process processor_bfm;
45
VHDL
Memory Bus Functional Model

mem : process is begin mem_ack_o <= '0'; mem_dat_o <= X"00000000"; wait until rising_edge(clk) and bus_cyc = '1' and mem_stb_i = '1'; if bus_we = '0' then mem_dat_o <= X"00000000"; -- in place of read data end if; mem_ack_o <= '1'; wait until rising_edge(clk); end process mem;
46
VHDL
Bus Arbiter
Uses sobel_cyc_o and cpu_cyc_o as request inp ts eq est inputs
If both request at the same time, give accelerator priority
Mealy FSM
47
VHDL
Bus Arbiter
arbiter_fsm_reg : process (clk) is ... arbiter_logic : process (arbiter_current_state, sobel_cyc_o, begin case arbiter_current_state is when sobel => if sobel_cyc_o = '1' then sobel_gnt <= 1 ; cpu_gnt <= 0 ; arbiter next state sobel gnt < '1'; cpu gnt < '0'; arbiter_next_state elsif sobel_cyc_o = '0' and cpu_cyc_o = '1' then sobel_gnt <= '0'; cpu_gnt <= '1'; arbiter_next_state else sobel_gnt <= '0'; cpu_gnt <= '0'; arbiter_next_state end if d if; when cpu => if cpu_cyc_o = '1' then sobel_gnt <= '0'; cpu_gnt <= '1'; arbiter_next_state elsif sobel_cyc_o = '1' and cpu_cyc_o = '0' then y p y sobel_gnt <= '1'; cpu_gnt <= '0'; arbiter_next_state else sobel_gnt <= '0'; cpu_gnt <= '0'; arbiter_next_state end if; end case; end process arbiter_logic; Digital Design Chapter 9 Accelerators cpu_cyc_o) is
< sobel; <= <= cpu; <= sobel;
<= cpu; <= sobel; <= sobel;
48
VHDL
Simulation Results
See waveforms in textbook
Demonstrates sequencing and address d dd generation
But what about about

Data values computed correctly Interactions between processor and accelerator
Need to use more sophisticated verification techniques

Due to complexity of the design
VHDL
Summary
Accelerators boost performance using parallel hardware
Replication, pipelining,
Ahmdahls Law
Best payback from accelerating a kernel p y g
DMA avoids processor overhead Verification requires advanced techniques


Chapter 09 Notes

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Chapter 09 Notes

Uploaded by

Copyright:

Available Formats

Digital Design: An Embedded Systems Approach Using VHDL

Performance and Parallelism

Instruction-level Instruction level parallelism

Constrained by data dependencies

Accelerating a kernel with parallel hardware gives the best payback

Accelerator A l t speeds up d kernel by a factor s Overall speedup factor s'

Digital Design Chapter 9 Accelerators

Amdahls Law Example

1 = 3.57 0.08 + 0.2

1 = = 1.17 0.15 + (1 0.15) 0.0015 + 0.85 100

Digital Design Chapter 9 Accelerators

Parallelism through replication g p

Digital Design Chapter 9 Accelerators

Direct Memory Access (DMA)

Direct memory access

Digital Design Chapter 9 Accelerators

Digital Design Chapter 9 Accelerators

Digital signal processing (DSP) g g p g( )

Digital Design Chapter 9 Accelerators

Input/output registers and interrupts p / p g p

Digital Design Chapter 9 Accelerators

Case Study: Edge Detection

For this case study

Sobel Edge Detection

Digital Design Chapter 9 Accelerators

The Sobel Algorithm

Sobel masks for x and y derivatives

Digital Design Chapter 9 Accelerators

The Sobel Algorithm

Since we just want maxima and minima Si j i d i i in magnitude, approximate as:

Edge pixels dont have eight neighbors

The Algorithm in Pseudocode

Data Formats and Rates

Video rate: 30 frames/sec

Digital Design Chapter 9 Accelerators

Digital Design Chapter 9 Accelerators

Camera produces 10Mpixels/sec p p /

Produce derivative pixels row-by-row, left-toright

Sobel Accelerator Architecture

Digital Design Chapter 9 Accelerators

Start of row End of row

Digital Design Chapter 9 Accelerators

Memory Operation Timing

Digital Design Chapter 9 Accelerators

Digital Design Chapter 9 Accelerators

Digital Design Chapter 9 Accelerators

Digital Design Chapter 9 Accelerators

Digital Design Chapter 9 Accelerators

Digital Design Chapter 9 Accelerators

Digital Design Chapter 9 Accelerators

Slave Bus Interface

Slave Bus Interface

Slave Bus Interface

Digital Design Chapter 9 Accelerators

See textbook for details of FSM output functions

Digital Design Chapter 9 Accelerators

State Transition Diagram

Digital Design Chapter 9 Accelerators

Testbench including the accelerator g

Sobel Verification Testbench

Multiplexed Bus: Muxes and Connections

Digital Design Chapter 9 Accelerators

Processor Bus Functional Model