09 Accelerators

Digital Design:
An Embedded Systems
Approach Using Verilog
Chapter 9
Accelerators
Portions of this work are from the book, Digital Design: An Embedded
Systems Approach Using Verilog, by Peter J. Ashenden, published by Morgan
Kaufmann Publishers, Copyright 2007 Elsevier Inc. All rights reserved.
Verilog
Performance and Parallelism
A processor core performs steps in sequence
Accelerating performance
Perform steps in parallel

Takes less time overall to complete an operation
Instruction-level parallelism
Performance limited by the instruction rate
Within a processor core

Pipelining, multiple-issue
Accelerators
Custom hardware for parallel operations

Digital Design Chapter 9 Accelerators
Verilog
Achievable Parallelism
How many steps can be performed at

once?
Regularly structured data
Independent processing steps

Examples
Video and image pixel processing

Audio or sensor signal processing
Constrained by data dependencies
Operations that depend on results of

previous steps
Verilog
Algorithm Kernels
Algorithm: specification of the required

processing steps
Kernel: the part that involves the most

intensive, repetitive processing
Often expressed in a programming

language
10% of operations take 90% of the time
Accelerating a kernel with parallel

hardware gives the best payback
Verilog
Amdahls Law
Time for an algorithm is t
Fraction f is spent on a kernel
Accelerator speeds up
kernel by a factor s
Overall speedup factor s'
For large f, s' s

For small f, s' 1
t ft (1 f )t
ft
t (1 f )t
s
t
1
s
t f (1 f )
s
Verilog
Amdahls Law Example
An algorithm with two kernels
Kernel 1: 80% of time, can be sped up 10 times

Kernel 2: 15% of time, can be sped up 100 times
Which speedup gives best overall improvement?
For kernel 1:
For kernel 2:
1
0.8
(1 0.8)
10
1
3.57
0.08 0.2
1.17
0.15
(1 0.15) 0.0015 0.85
100
Verilog
Parallel Architectures
An architecture for an accelerator

specifies
Processing blocks
Data flow between them
Parallelism through replication
Multiple identical block operating on

different data elements
Works well when elements can be
processed independently
Verilog
Parallel Architectures
Parallelism through pipelining
Break a computation into steps, performs them in

assembly-line fashion
Latency (time to complete a single operation) is
not increased
Throughput (rate of completion of operations) is
increased
data
in
Ideally by a factor equal to the number of pipeline stages
step 1
step 2
step 3
data
out
Verilog
Direct Memory Access (DMA)
Input/Output data for accellerators

must be transferred at high speed
Using the processor would be too slow
Direct memory access
I/O controller and accellerator transfer

data to and from memory autononously
Program supplies starting address and
length
Verilog
Bus Arbitration
Bus masters take turns to use bus to access

slaves
Controlled by a bus arbiter
Arbitration policies
Priority, round-robin,
request
grant
request
arbiter
request
processor
grant
grant
accelerator
controller
memory
bus
memory
10
Verilog
Block-Processing Accelerator
Data arranged in regular groups of

contiguous memory locations
Accelerator works block by block

E.g., images in blocks of 8 8 16-bit
pixels
Datapath comprises
Memory access: address generation,

counters
Computation section
Control section: finite-state machine(s)
11
Verilog
Stream-Processing Accelerator
Streams of data from an input source
E.g., high-speed sensors
Digital signal processing (DSP)
Analog sensor signal converted to stream

of digital sample values
Filtering, gain/attenuation, frequencydomain conversion (Fourier transform)
12
Verilog
Processor/Accelerator Interface
Embedded software controls an

accelerator
Providing control parameters

Synchronizing operations
Input/output registers and interrupts
Interact with the control sequencer
13
Verilog
Case Study: Edge Detection
Illustration of accelerator design

Edge detection in video processing
Application areas
Identify where image intensity changes abruptly

Typically at the boundary of objects
First step in identifying objects in a scene
Video surveillance, computer vision,
For this case study
Monochrome images of 640 480 8-bit pixels

Stored row-by-row in memory
Pixel values: 0 (black) 255 (white)
14
Verilog
Sobel Edge Detection
Compute derivatives of intensity in x

and y directions
Look for minima and maxima (where

intensity changes most rapidly)
15
Verilog
The Sobel Algorithm
Use convolution to approximate partial

derivatives Dx and Dy at each position
Weighted sum of value of a pixel and its eight

nearest neighbors
Coefficients represented using a 33 convolution
mask
Sobel masks for x and y derivatives

Gx
+1
+2
+2
Dx (i, j ) O(i, j ) Gx
+1 +2 +1
Gy
D y (i, j ) O(i, j ) G y
16
Verilog
The Sobel Algorithm
Combine partial derivatives

D Dx2 Dy2
Since we just want maxima and minima

in magnitude, approximate as:
D Dx Dy
Edge pixels dont have eight neighbors
Skip computation of |D| for edges

Just set them to 0 using software
17
Verilog
The Algorithm in Pseudocode

for (row = 1; row <= 478; row = row + 1) begin
for (col = 1; col <= 638; col = col + 1) begin
sumx = 0; sumy = 0;
for (i = 1; i <= +1; i = i + 1) begin
for (j = 1; j <= +1; j = j + 1) begin
sumx = sumx + 0[row+i][col+j] * Gx[i][j];
sumy = sumy + 0[row+i][col+j] * Gy[i][j];
end
end
D[row][col] = abs(sumx) + abs(sumy);
end
end
18
Verilog
Data Formats and Rates
Pixel values: 0 to 255 (8 bits)
Coefficients are 0, 1 and 2

Partial products: 510 to +510 (10 bits)
Dx and Dy: 1020 to +1020 (11 bits)
|D|: 0 to 2040 (11 bits)
Final pixel value: scale back to 8 bits
Video rate: 30 frames/sec
640 480 = 307,200 pixels

307,200 30 10 million pixels/sec
19
Verilog
Data Dependencies
Pixels can be computed independently

For each pixel:
20
Verilog
System Architecture
Data dependencies suggest a pipeline
Coefficient multiplies are simple shift/negate, so

merge with adder stage
21
Verilog
Memory Bandwidth
Assume memory read/write takes 20ns

(2 cycles of 100MHz clock)
Memory is 32-bits wide, byte addressable

Bandwidth = 50M operations/sec
Camera produces 10Mpixels/sec
Accelerator needs to process at this rate

(8 reads + 1 write) 10Mpixel/sec
= 90M operations/sec
Greater than memory bandwidth
22
Verilog
Memory Bandwidth
Read 4 pixels at once from each of previous,

current, and next rows
Store in accelerator to compute multiple derivative

image pixels
Produce derivative pixels row-by-row, left-toright
Read 3 32-bit words for every 4th derivative

pixel computed
Write 4 pixels at a time
(3 reads + 1 write) / 4 10Mpixel/sec
= 10M operations/sec
= 20% of available memory bandwidth
23
Verilog
Sobel Accelerator Architecture
24
Verilog
Accelerator Sequence
Steady state
Start of row
Write 4 result pixels

Read 4 pixels for previous,
current, next rows
Compute for 4 cycles
Repeat
Omit writes until pipeline
full
End of row
Omit reads to drain

pipeline
25
Verilog
Memory Operation Timing
Steady state
26
Verilog
Pixel Datapath
// Computation datapath signals
reg
[31:0] prev_row, curr_row, next_row;
reg
[7:0] O [-1:+1][-1:+1];
reg signed [10:0] Dx, Dy, D;
reg
[7:0] abs_D;
reg
[31:0] result_row;
...
// Computational datapath
always @(posedge clk_i) // Previous row register
if (prev_row_load) prev_row
<= dat_i;
else if (shift_en) prev_row[31:8] <= prev_row[23:0];
... // Current row register
... // Next row register
function [10:0] abs (input signed [10:0] x);
abs = x >= 0 ? x : -x;
endfunction
...
27
Verilog
Pixel Datapath
always @(posedge clk_i) // Computation pipeline
if (shift_en) begin
D = abs(Dx) + abs(Dy);
abs_D <= D[10:3];
Dx <= - $signed({3'b000, O[-1][-1]})
+ $signed({3'b000, O[-1][+1]})
- ($signed({3'b000, O[ 0][-1]}) << 1)
+ ($signed({3'b000, O[ 0][+1]}) << 1)
- $signed({3'b000, O[+1][-1]})
+ $signed({3'b000, O[+1][+1]});
Dy <=
$signed({3'b000, O[-1][-1]})
+ ($signed({3'b000, O[-1][ 0]}) << 1)
+ $signed({3'b000, O[-1][+1]})
- $signed({3'b000, O[+1][-1]})
- ($signed({3'b000, O[+1][ 0]}) << 1)
- $signed({3'b000, O[+1][+1]});
...
28
Verilog
Pixel Datapath
O[-1][-1] <= O[-1][0];
O[-1][ 0] <= O[-1][+1];
O[-1][+1] <= prev_row[31:24];
O[ 0][-1] <= O[0][ 0];
O[ 0][ 0] <= O[0][+1];
O[ 0][+1] <= curr_row[31:24];
O[+1][-1] <= O[+1][ 0];
O[+1][ 0] <= O[+1][+1];
O[+1][+1] <= next_row[31:24];
end
always @(posedge clk_i) // Result row register
if (shift_en) result_row <= {result_row[23:0], abs_D};
29
Verilog
Address Generation
Given an image in memory at base

address B
Address for pixel in row r, column c is

B + r 640 + c
Base address (B) is fixed
Offset (r 640 + c) increments by 4 for
each group of 4 pixels read/written
Use word-aligned addresses
Two least-significant bits always 00

Increment word address by 1
30
Verilog
Address Generation
31
Verilog
Address Generation
always @(posedge clk_i) // O base address register
if (O_base_ce) O_base <= dat_i[21:2];
always @(posedge clk_i) // O address offset counter
if (offset_reset)
O_offset <= 0;
else if (O_offset_cnt_en) O_offset <= O_offset + 1;
always @(posedge clk_i) // D base address register
if (D_base_ce) D_base <= dat_i[21:2];
always @(posedge clk_i) // D address offset counter
if (offset_reset)
D_offset <= 0;
else if (D_offset_cnt_en) D_offset <= D_offset + 1;
...
32
Verilog
Address Generation
assign
assign
assign
assign
assign
O_prev_addr = O_base + O_offset;

O_curr_addr = O_prev_addr + 640/4;
O_next_addr = O_prev_addr + 1280/4;
D_addr = D_base + D_offset;
adr_o[21:2] = prev_row_load ? O_prev_addr :
curr_row_load ? O_curr_addr :
next_row_load ? O_next_addr :
D_addr;
assign adr_o[1:0] = 2'b00;
33
Verilog
Control/Status Registers
Register
Offset
Read/Write
Purpose
Int_en
Write-only
Interrupt enable (bit 0).
Start
Write-only
Write causes image processing to start

(value ignored).
O_base
Write-only
Original image base address.
D_base
12
Write-only
Derivative image base address + 640.
Status
Read-only
Processing done (bit 0). Reading clears

interrupt.
34
Verilog
Slave Bus Interface

assign start
= cyc_i && stb_i && we_i && adr_i == 2'b01;
assign O_base_ce = cyc_i && stb_i && we_i && adr_i == 2'b10;
assign D_base_ce = cyc_i && stb_i && we_i && adr_i == 2'b11;
always @(posedge clk_i) // Interrupt enable register
if (rst_i)
int_en <= 1'b0;
else if (cyc_i && stb_i && we_i && adr_i == 2'b00)
int_en <= dat_i[0];
always @(posedge clk_i) // Status register
if (rst_i)
done <= 1'b0;
else if (done_set)
// This occurs when last write is acknowledged,
// and so cannot coincide with a read of the status register.
done <= 1'b1;
else if (cyc_i && stb_i && we_i && adr_i == 2'b00 && ack_o)
done <= 1'b0;
assign int_req = int_en && done;
...
35
Verilog
Slave Bus Interface

always @(posedge clk_i) // Generate ack output
ack_o <= cyc_i && stb_i && !ack_o;
// Wishbone data output multiplexer
always @*
if (cyc_i && stb_i && !we_i)
if (adr_i == 2'b00)
dat_o = {31'b0, done}; // status register read
else
dat_o = 32'b0;
// other registers read as 0
else
dat_o = result_row;
// for master write
36
Verilog
Control Sequencing
Use a finite-state machine
Counters keep track of rows (0 to 477) and

columns (0 to 159)
See textbook for details of FSM output

functions
37
Verilog
State Transition Diagram
38
Verilog
Accelerator Verification
Simulation-based verification of each section

of the accelerator
Slave bus operations

Computation sequencing
Master bus operations
Address generation
Pixel computation
Testbench including the accelerator
Bus functional processor model

Simplified memory and bus arbiter models
39
Verilog
Sobel Verification Testbench

Processor
BFM
Arbiter
Sobel
Accelerator
Multiplexed Bus: Muxes and Connections
Memory
Model
40
Verilog
Processor Bus Functional Model

initial begin // Processor bus-functional model
cpu_adr_o <= 23'h000000;
cpu_sel_o <= 4'b0000;
cpu_dat_o <= 32'h00000000;
cpu_cyc_o <= 1'b0; cpu_stb_o <= 1'b0; cpu_we_o <= 1'b0;
@(negedge rst);
@(posedge clk);
// Write 008000 (hex) to O_base_addr register
bus_write(sobel_reg_base + sobel_O_base_reg_offset, 32'h00008000);
// Write 053000 + 280 (hex) to D_base_addr register
bus_write(sobel_reg_base + sobel_D_base_reg_offset, 32'h00053280);
// Write 1 to interrupt control register (enable interrupt)
bus_write(sobel_reg_base + sobel_int_reg_offset, 32'h00000001);
// Write to start register (data value ignored)
bus_write(sobel_reg_base + sobel_start_reg_offset, 32'h00000000);
// End of write operations
...
41
Verilog
Processor Bus Functional Model

cpu_cyc_o = 1'b0; cpu_stb_o = 1'b0; cpu_we_o = 1'b0;
begin: loop
forever begin
#10000;
@(posedge clk);
// Read status register
cpu_adr_o <= sobel_reg_base + sobel_status_reg_offset;
cpu_sel_o <= 4'b1111;
@(posedge clk); while (!cpu_ack_i) @(posedge clk);
if (cpu_dat_i[0]) disable loop;
end
end
end
42
Verilog
Memory Bus Functional Model

always begin // Memory bus-functional model
mem_ack_o <= 1'b0;
mem_dat_o <= 32'h00000000;
@(posedge clk);
while (!(bus_cyc && mem_stb_i)) @(posedge clk);
if (!bus_we)
mem_dat_o <= 32'h00000000; // in place of read data
mem_ack_o <= 1'b1;
@(posedge clk);
end
43
Verilog
Bus Arbiter
Uses sobel_cyc_o and cpu_cyc_o

as request inputs
If both request at the same time, give

accelerator priority
Mealy FSM
44
Verilog
Bus Arbiter
always @(posedge clk) // Arbiter FSM register
if (rst) arbiter_current_state <= sobel;
else
arbiter_current_state <= arbiter_next_state;
always @* // Arbiter logic
case (arbiter_current_state)
sobel: if (sobel_cyc_o) begin
sobel_gnt <= 1'b1; cpu_gnt <= 1'b0; arbiter_next_state
end
else if (!sobel_cyc_o && cpu_cyc_o) begin
end
else begin
end
cpu:
if (cpu_cyc_o) begin
end else if (sobel_cyc_o && !cpu_cyc_o) begin
end else begin
end
endcase
<= sobel;
<= cpu;
<= sobel;
<= cpu;
<= sobel;
<= sobel;
45
Verilog
Simulation Results
See waveforms in textbook
But what about
Demonstrates sequencing and address

generation
Data values computed correctly
Interactions between processor and
accelerator
Need to use more sophisticated

verification techniques
Due to complexity of the design

46
Verilog
Summary
Accelerators boost performance using

parallel hardware
Ahmdahls Law
Replication, pipelining,
Best payback from accelerating a kernel
DMA avoids processor overhead

Verification requires advanced
techniques
47

09 Accelerators

Uploaded by

Copyright:

Available Formats

You might also like

09 Accelerators

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

09 Accelerators

Uploaded by

Copyright:

Available Formats

Digital Design:

Performance and Parallelism

A processor core performs steps in sequence

Perform steps in parallel

Performance limited by the instruction rate

Within a processor core

Custom hardware for parallel operations

How many steps can be performed at

Independent processing steps

Video and image pixel processing

Constrained by data dependencies

Operations that depend on results of

Algorithm: specification of the required

Kernel: the part that involves the most

Often expressed in a programming

10% of operations take 90% of the time

Accelerating a kernel with parallel

Time for an algorithm is t

Fraction f is spent on a kernel

For large f, s' s

Digital Design Chapter 9 Accelerators

Amdahls Law Example

An algorithm with two kernels

Kernel 1: 80% of time, can be sped up 10 times

Digital Design Chapter 9 Accelerators

An architecture for an accelerator

Parallelism through replication

Multiple identical block operating on

Parallelism through pipelining

Break a computation into steps, performs them in

Ideally by a factor equal to the number of pipeline stages

Digital Design Chapter 9 Accelerators

Direct Memory Access (DMA)

Input/Output data for accellerators

Using the processor would be too slow

Direct memory access

I/O controller and accellerator transfer

Digital Design Chapter 9 Accelerators

Bus masters take turns to use bus to access

Controlled by a bus arbiter

Digital Design Chapter 9 Accelerators

Data arranged in regular groups of

Accelerator works block by block

Memory access: address generation,

Streams of data from an input source

E.g., high-speed sensors

Digital signal processing (DSP)

Analog sensor signal converted to stream

Digital Design Chapter 9 Accelerators

Embedded software controls an

Providing control parameters

Input/output registers and interrupts

Interact with the control sequencer

Digital Design Chapter 9 Accelerators

Case Study: Edge Detection

Illustration of accelerator design

Identify where image intensity changes abruptly