09 Accelerators

You might also like

Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 47

Digital Design:

An Embedded Systems
Approach Using Verilog
Chapter 9
Accelerators

Portions of this work are from the book, Digital Design: An Embedded
Systems Approach Using Verilog, by Peter J. Ashenden, published by Morgan
Kaufmann Publishers, Copyright 2007 Elsevier Inc. All rights reserved.

Verilog

Performance and Parallelism

A processor core performs steps in sequence

Accelerating performance

Perform steps in parallel


Takes less time overall to complete an operation

Instruction-level parallelism

Performance limited by the instruction rate

Within a processor core


Pipelining, multiple-issue

Accelerators

Custom hardware for parallel operations


Digital Design Chapter 9 Accelerators

Verilog

Achievable Parallelism

How many steps can be performed at


once?
Regularly structured data

Independent processing steps


Examples

Video and image pixel processing


Audio or sensor signal processing

Constrained by data dependencies

Operations that depend on results of


previous steps
Digital Design Chapter 9 Accelerators

Verilog

Algorithm Kernels

Algorithm: specification of the required


processing steps

Kernel: the part that involves the most


intensive, repetitive processing

Often expressed in a programming


language

10% of operations take 90% of the time

Accelerating a kernel with parallel


hardware gives the best payback
Digital Design Chapter 9 Accelerators

Verilog

Amdahls Law

Time for an algorithm is t

Fraction f is spent on a kernel

Accelerator speeds up
kernel by a factor s
Overall speedup factor s'

For large f, s' s


For small f, s' 1

t ft (1 f )t

ft
t (1 f )t
s

t
1
s
t f (1 f )
s

Digital Design Chapter 9 Accelerators

Verilog

Amdahls Law Example

An algorithm with two kernels

Kernel 1: 80% of time, can be sped up 10 times


Kernel 2: 15% of time, can be sped up 100 times
Which speedup gives best overall improvement?

For kernel 1:

For kernel 2:

1
0.8
(1 0.8)
10

1
3.57
0.08 0.2

1.17
0.15
(1 0.15) 0.0015 0.85
100

Digital Design Chapter 9 Accelerators

Verilog

Parallel Architectures

An architecture for an accelerator


specifies

Processing blocks
Data flow between them

Parallelism through replication

Multiple identical block operating on


different data elements
Works well when elements can be
processed independently
Digital Design Chapter 9 Accelerators

Verilog

Parallel Architectures

Parallelism through pipelining

Break a computation into steps, performs them in


assembly-line fashion
Latency (time to complete a single operation) is
not increased
Throughput (rate of completion of operations) is
increased

data
in

Ideally by a factor equal to the number of pipeline stages

step 1

step 2

step 3

Digital Design Chapter 9 Accelerators

data
out

Verilog

Direct Memory Access (DMA)

Input/Output data for accellerators


must be transferred at high speed

Using the processor would be too slow

Direct memory access

I/O controller and accellerator transfer


data to and from memory autononously
Program supplies starting address and
length

Digital Design Chapter 9 Accelerators

Verilog

Bus Arbitration

Bus masters take turns to use bus to access


slaves

Controlled by a bus arbiter

Arbitration policies

Priority, round-robin,

request

grant

request

arbiter

request

processor

grant

grant

accelerator

controller

memory
bus

memory

Digital Design Chapter 9 Accelerators

10

Verilog

Block-Processing Accelerator

Data arranged in regular groups of


contiguous memory locations

Accelerator works block by block


E.g., images in blocks of 8 8 16-bit
pixels

Datapath comprises

Memory access: address generation,


counters
Computation section
Control section: finite-state machine(s)
Digital Design Chapter 9 Accelerators

11

Verilog

Stream-Processing Accelerator

Streams of data from an input source

E.g., high-speed sensors

Digital signal processing (DSP)

Analog sensor signal converted to stream


of digital sample values
Filtering, gain/attenuation, frequencydomain conversion (Fourier transform)

Digital Design Chapter 9 Accelerators

12

Verilog

Processor/Accelerator Interface

Embedded software controls an


accelerator

Providing control parameters


Synchronizing operations

Input/output registers and interrupts

Interact with the control sequencer

Digital Design Chapter 9 Accelerators

13

Verilog

Case Study: Edge Detection

Illustration of accelerator design


Edge detection in video processing

Application areas

Identify where image intensity changes abruptly


Typically at the boundary of objects
First step in identifying objects in a scene
Video surveillance, computer vision,

For this case study

Monochrome images of 640 480 8-bit pixels


Stored row-by-row in memory
Pixel values: 0 (black) 255 (white)
Digital Design Chapter 9 Accelerators

14

Verilog

Sobel Edge Detection

Compute derivatives of intensity in x


and y directions

Look for minima and maxima (where


intensity changes most rapidly)

Digital Design Chapter 9 Accelerators

15

Verilog

The Sobel Algorithm

Use convolution to approximate partial


derivatives Dx and Dy at each position

Weighted sum of value of a pixel and its eight


nearest neighbors
Coefficients represented using a 33 convolution
mask

Sobel masks for x and y derivatives


Gx

+1

+2

+2

Dx (i, j ) O(i, j ) Gx

+1 +2 +1

Gy

D y (i, j ) O(i, j ) G y

Digital Design Chapter 9 Accelerators

16

Verilog

The Sobel Algorithm

Combine partial derivatives


D Dx2 Dy2

Since we just want maxima and minima


in magnitude, approximate as:
D Dx Dy

Edge pixels dont have eight neighbors

Skip computation of |D| for edges


Just set them to 0 using software
Digital Design Chapter 9 Accelerators

17

Verilog

The Algorithm in Pseudocode


for (row = 1; row <= 478; row = row + 1) begin
for (col = 1; col <= 638; col = col + 1) begin
sumx = 0; sumy = 0;
for (i = 1; i <= +1; i = i + 1) begin
for (j = 1; j <= +1; j = j + 1) begin
sumx = sumx + 0[row+i][col+j] * Gx[i][j];
sumy = sumy + 0[row+i][col+j] * Gy[i][j];
end
end
D[row][col] = abs(sumx) + abs(sumy);
end
end
Digital Design Chapter 9 Accelerators

18

Verilog

Data Formats and Rates

Pixel values: 0 to 255 (8 bits)

Coefficients are 0, 1 and 2


Partial products: 510 to +510 (10 bits)
Dx and Dy: 1020 to +1020 (11 bits)
|D|: 0 to 2040 (11 bits)
Final pixel value: scale back to 8 bits

Video rate: 30 frames/sec

640 480 = 307,200 pixels


307,200 30 10 million pixels/sec
Digital Design Chapter 9 Accelerators

19

Verilog

Data Dependencies

Pixels can be computed independently


For each pixel:

Digital Design Chapter 9 Accelerators

20

Verilog

System Architecture

Data dependencies suggest a pipeline

Coefficient multiplies are simple shift/negate, so


merge with adder stage

Digital Design Chapter 9 Accelerators

21

Verilog

Memory Bandwidth

Assume memory read/write takes 20ns


(2 cycles of 100MHz clock)

Memory is 32-bits wide, byte addressable


Bandwidth = 50M operations/sec

Camera produces 10Mpixels/sec

Accelerator needs to process at this rate


(8 reads + 1 write) 10Mpixel/sec
= 90M operations/sec
Greater than memory bandwidth
Digital Design Chapter 9 Accelerators

22

Verilog

Memory Bandwidth

Read 4 pixels at once from each of previous,


current, and next rows

Store in accelerator to compute multiple derivative


image pixels

Produce derivative pixels row-by-row, left-toright

Read 3 32-bit words for every 4th derivative


pixel computed
Write 4 pixels at a time
(3 reads + 1 write) / 4 10Mpixel/sec
= 10M operations/sec
= 20% of available memory bandwidth
Digital Design Chapter 9 Accelerators

23

Verilog

Sobel Accelerator Architecture

Digital Design Chapter 9 Accelerators

24

Verilog

Accelerator Sequence

Steady state

Start of row

Write 4 result pixels


Read 4 pixels for previous,
current, next rows
Compute for 4 cycles
Repeat
Omit writes until pipeline
full

End of row

Omit reads to drain


pipeline

Digital Design Chapter 9 Accelerators

25

Verilog

Memory Operation Timing

Steady state

Digital Design Chapter 9 Accelerators

26

Verilog

Pixel Datapath
// Computation datapath signals
reg
[31:0] prev_row, curr_row, next_row;
reg
[7:0] O [-1:+1][-1:+1];
reg signed [10:0] Dx, Dy, D;
reg
[7:0] abs_D;
reg
[31:0] result_row;
...
// Computational datapath
always @(posedge clk_i) // Previous row register
if (prev_row_load) prev_row
<= dat_i;
else if (shift_en) prev_row[31:8] <= prev_row[23:0];
... // Current row register
... // Next row register
function [10:0] abs (input signed [10:0] x);
abs = x >= 0 ? x : -x;
endfunction
...
Digital Design Chapter 9 Accelerators

27

Verilog

Pixel Datapath
always @(posedge clk_i) // Computation pipeline
if (shift_en) begin
D = abs(Dx) + abs(Dy);
abs_D <= D[10:3];
Dx <= - $signed({3'b000, O[-1][-1]})
+ $signed({3'b000, O[-1][+1]})
- ($signed({3'b000, O[ 0][-1]}) << 1)
+ ($signed({3'b000, O[ 0][+1]}) << 1)
- $signed({3'b000, O[+1][-1]})
+ $signed({3'b000, O[+1][+1]});
Dy <=
$signed({3'b000, O[-1][-1]})
+ ($signed({3'b000, O[-1][ 0]}) << 1)
+ $signed({3'b000, O[-1][+1]})
- $signed({3'b000, O[+1][-1]})
- ($signed({3'b000, O[+1][ 0]}) << 1)
- $signed({3'b000, O[+1][+1]});
...
Digital Design Chapter 9 Accelerators

28

Verilog

Pixel Datapath
O[-1][-1] <= O[-1][0];
O[-1][ 0] <= O[-1][+1];
O[-1][+1] <= prev_row[31:24];
O[ 0][-1] <= O[0][ 0];
O[ 0][ 0] <= O[0][+1];
O[ 0][+1] <= curr_row[31:24];
O[+1][-1] <= O[+1][ 0];
O[+1][ 0] <= O[+1][+1];
O[+1][+1] <= next_row[31:24];
end
always @(posedge clk_i) // Result row register
if (shift_en) result_row <= {result_row[23:0], abs_D};

Digital Design Chapter 9 Accelerators

29

Verilog

Address Generation

Given an image in memory at base


address B

Address for pixel in row r, column c is


B + r 640 + c
Base address (B) is fixed
Offset (r 640 + c) increments by 4 for
each group of 4 pixels read/written
Use word-aligned addresses

Two least-significant bits always 00


Increment word address by 1
Digital Design Chapter 9 Accelerators

30

Verilog

Address Generation

Digital Design Chapter 9 Accelerators

31

Verilog

Address Generation
always @(posedge clk_i) // O base address register
if (O_base_ce) O_base <= dat_i[21:2];
always @(posedge clk_i) // O address offset counter
if (offset_reset)
O_offset <= 0;
else if (O_offset_cnt_en) O_offset <= O_offset + 1;
always @(posedge clk_i) // D base address register
if (D_base_ce) D_base <= dat_i[21:2];
always @(posedge clk_i) // D address offset counter
if (offset_reset)
D_offset <= 0;
else if (D_offset_cnt_en) D_offset <= D_offset + 1;
...

Digital Design Chapter 9 Accelerators

32

Verilog

Address Generation
assign
assign
assign
assign
assign

O_prev_addr = O_base + O_offset;


O_curr_addr = O_prev_addr + 640/4;
O_next_addr = O_prev_addr + 1280/4;
D_addr = D_base + D_offset;
adr_o[21:2] = prev_row_load ? O_prev_addr :
curr_row_load ? O_curr_addr :
next_row_load ? O_next_addr :
D_addr;
assign adr_o[1:0] = 2'b00;

Digital Design Chapter 9 Accelerators

33

Verilog

Control/Status Registers
Register

Offset

Read/Write

Purpose

Int_en

Write-only

Interrupt enable (bit 0).

Start

Write-only

Write causes image processing to start


(value ignored).

O_base

Write-only

Original image base address.

D_base

12

Write-only

Derivative image base address + 640.

Status

Read-only

Processing done (bit 0). Reading clears


interrupt.

Digital Design Chapter 9 Accelerators

34

Verilog

Slave Bus Interface


assign start
= cyc_i && stb_i && we_i && adr_i == 2'b01;
assign O_base_ce = cyc_i && stb_i && we_i && adr_i == 2'b10;
assign D_base_ce = cyc_i && stb_i && we_i && adr_i == 2'b11;
always @(posedge clk_i) // Interrupt enable register
if (rst_i)
int_en <= 1'b0;
else if (cyc_i && stb_i && we_i && adr_i == 2'b00)
int_en <= dat_i[0];
always @(posedge clk_i) // Status register
if (rst_i)
done <= 1'b0;
else if (done_set)
// This occurs when last write is acknowledged,
// and so cannot coincide with a read of the status register.
done <= 1'b1;
else if (cyc_i && stb_i && we_i && adr_i == 2'b00 && ack_o)
done <= 1'b0;
assign int_req = int_en && done;
...
Digital Design Chapter 9 Accelerators

35

Verilog

Slave Bus Interface


always @(posedge clk_i) // Generate ack output
ack_o <= cyc_i && stb_i && !ack_o;
// Wishbone data output multiplexer
always @*
if (cyc_i && stb_i && !we_i)
if (adr_i == 2'b00)
dat_o = {31'b0, done}; // status register read
else
dat_o = 32'b0;
// other registers read as 0
else
dat_o = result_row;
// for master write

Digital Design Chapter 9 Accelerators

36

Verilog

Control Sequencing

Use a finite-state machine

Counters keep track of rows (0 to 477) and


columns (0 to 159)

See textbook for details of FSM output


functions

Digital Design Chapter 9 Accelerators

37

Verilog

State Transition Diagram

Digital Design Chapter 9 Accelerators

38

Verilog

Accelerator Verification

Simulation-based verification of each section


of the accelerator

Slave bus operations


Computation sequencing
Master bus operations
Address generation
Pixel computation

Testbench including the accelerator

Bus functional processor model


Simplified memory and bus arbiter models
Digital Design Chapter 9 Accelerators

39

Verilog

Sobel Verification Testbench


Processor
BFM

Arbiter

Sobel
Accelerator

Multiplexed Bus: Muxes and Connections

Memory
Model

Digital Design Chapter 9 Accelerators

40

Verilog

Processor Bus Functional Model


initial begin // Processor bus-functional model
cpu_adr_o <= 23'h000000;
cpu_sel_o <= 4'b0000;
cpu_dat_o <= 32'h00000000;
cpu_cyc_o <= 1'b0; cpu_stb_o <= 1'b0; cpu_we_o <= 1'b0;
@(negedge rst);
@(posedge clk);
// Write 008000 (hex) to O_base_addr register
bus_write(sobel_reg_base + sobel_O_base_reg_offset, 32'h00008000);
// Write 053000 + 280 (hex) to D_base_addr register
bus_write(sobel_reg_base + sobel_D_base_reg_offset, 32'h00053280);
// Write 1 to interrupt control register (enable interrupt)
bus_write(sobel_reg_base + sobel_int_reg_offset, 32'h00000001);
// Write to start register (data value ignored)
bus_write(sobel_reg_base + sobel_start_reg_offset, 32'h00000000);
// End of write operations
...

Digital Design Chapter 9 Accelerators

41

Verilog

Processor Bus Functional Model


cpu_cyc_o = 1'b0; cpu_stb_o = 1'b0; cpu_we_o = 1'b0;
begin: loop
forever begin
#10000;
@(posedge clk);
// Read status register
cpu_adr_o <= sobel_reg_base + sobel_status_reg_offset;
cpu_sel_o <= 4'b1111;
cpu_cyc_o <= 1'b1; cpu_stb_o <= 1'b1; cpu_we_o <= 1'b0;
@(posedge clk); while (!cpu_ack_i) @(posedge clk);
cpu_cyc_o <= 1'b0; cpu_stb_o <= 1'b0; cpu_we_o <= 1'b0;
if (cpu_dat_i[0]) disable loop;
end
end
end

Digital Design Chapter 9 Accelerators

42

Verilog

Memory Bus Functional Model


always begin // Memory bus-functional model
mem_ack_o <= 1'b0;
mem_dat_o <= 32'h00000000;
@(posedge clk);
while (!(bus_cyc && mem_stb_i)) @(posedge clk);
if (!bus_we)
mem_dat_o <= 32'h00000000; // in place of read data
mem_ack_o <= 1'b1;
@(posedge clk);
end

Digital Design Chapter 9 Accelerators

43

Verilog

Bus Arbiter

Uses sobel_cyc_o and cpu_cyc_o


as request inputs

If both request at the same time, give


accelerator priority

Mealy FSM

Digital Design Chapter 9 Accelerators

44

Verilog

Bus Arbiter
always @(posedge clk) // Arbiter FSM register
if (rst) arbiter_current_state <= sobel;
else
arbiter_current_state <= arbiter_next_state;
always @* // Arbiter logic
case (arbiter_current_state)
sobel: if (sobel_cyc_o) begin
sobel_gnt <= 1'b1; cpu_gnt <= 1'b0; arbiter_next_state
end
else if (!sobel_cyc_o && cpu_cyc_o) begin
sobel_gnt <= 1'b0; cpu_gnt <= 1'b1; arbiter_next_state
end
else begin
sobel_gnt <= 1'b0; cpu_gnt <= 1'b0; arbiter_next_state
end
cpu:
if (cpu_cyc_o) begin
sobel_gnt <= 1'b0; cpu_gnt <= 1'b1; arbiter_next_state
end else if (sobel_cyc_o && !cpu_cyc_o) begin
sobel_gnt <= 1'b1; cpu_gnt <= 1'b0; arbiter_next_state
end else begin
sobel_gnt <= 1'b0; cpu_gnt <= 1'b0; arbiter_next_state
end
endcase
Digital Design Chapter 9 Accelerators

<= sobel;

<= cpu;

<= sobel;

<= cpu;
<= sobel;
<= sobel;

45

Verilog

Simulation Results

See waveforms in textbook

But what about

Demonstrates sequencing and address


generation
Data values computed correctly
Interactions between processor and
accelerator

Need to use more sophisticated


verification techniques

Due to complexity of the design


Digital Design Chapter 9 Accelerators

46

Verilog

Summary

Accelerators boost performance using


parallel hardware

Ahmdahls Law

Replication, pipelining,

Best payback from accelerating a kernel

DMA avoids processor overhead


Verification requires advanced
techniques
Digital Design Chapter 9 Accelerators

47

You might also like