Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/4135463

Digital signal processor architectures and programming

Conference Paper · January 2005


DOI: 10.1109/APCCAS.2004.1412771 · Source: IEEE Xplore

CITATION READS

1 3,123

2 authors:

Sen M. Kuo Woon-Seng Gan


Northern Illinois University Nanyang Technological University
206 PUBLICATIONS   5,775 CITATIONS    333 PUBLICATIONS   3,320 CITATIONS   

SEE PROFILE SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Augmenting Urban Soundscapes: Design Tools, Noise Mitigation System, and Evaluation of the Urban Sound Environment – Phase 1 View project

Augmenting Urban Soundscape (AUS) Phase 2: Adaptive Soundscape Enhancement System and Evaluation of the Urban Sound Environment View project

All content following this page was uploaded by Sen M. Kuo on 07 January 2014.

The user has requested enhancement of the downloaded file.


The 2004 IEEE Asia-Pacific Conference on
Circuits and Systems, December 6-9, 2004

DIGITAL SIGNAL PROCESSOR ARCHITECTURES AND


PROGRAMMING

Sen M. Kuo1 and Woon S. Gan2


1
Dept. of Electrical Engineering, Northern Illinois University, DeKalb, IL 60115
2
School of Elec. & Electronics Engineering, Nanyang Technological University, Singapore

ABSTRACT Analog Analog


signal signal
ADC DAC
This paper presents modern digital signal processor DSP
architectures including multiply-accumulate unit, shifter, processor
Digital signal Digital signal
pipelining and parallelism, buses, data address
generators, and special addressing modes and
instructions. In addition, the most effective mixed C and
Figure 1 A Typical DSP System.
assembly programming technique is suggested for
software development.
The programmable processor can be programmed
for a variety of tasks. It is used for systems that are too
1. INTRODUCTION
complicated to implement with nonprogrammable
circuits, products that need shorter development time and
Digital signal processing (DSP) gained popularity in the
lower development cost, or systems that need to be
1960s with the introduction of digital technology. It
upgraded frequently with new algorithms and standards.
became the method of choice in processing signals as
Most DSP processors employ a modified Harvard
digital hardware increased in speed and became easier to
architecture, which gives a crossover path between
use, less expensive, and more available. In 1979, Intel
program and data memory. Most processors also are
introduced the first DSP processor (the Intel 2920),
optimized for performing repetitive multiply-and-add
which had an architecture and an instruction set
(MAC) operations that sequentially access data stored in
specifically tailored for DSP applications. Today,
consecutive memory locations.
general-purpose DSP processors are commercially
As DSP hardware and algorithm capabilities have
available from Texas Instruments, Motorola, Analog
advanced, so have processing demands, which has
Devices, Agere, DSP Groups, and many other companies
resulted in the development of higher-performance
[1, 2]. As DSP processors became less expensive and
systems with more sophisticated algorithms for the new
more powerful, real-world DSP applications such as
generation of applications. In today’s evolving DSP
high-speed modems and Internet access, wireless and
applications, flexibility and upgradeability of design are
cellular phones, audio and video players, and digital
key factors in longer product cycles. Many industrial
cameras have exploded onto the marketplace [3].
standards are in the early stages of development, and
As illustrated in Figure 1, DSP systems use a DSP
some of these standards must maintain compatibility
processor or other digital hardware, an analog-to-digital
with other standards. A good example is digital cellular
converter (ADC), and a digital-to-analog converter
phones, which have been upgraded from 2G to 2.5G, 3G,
(DAC) to replace analog devices such as amplifiers,
and 4G standards. Programmable DSP processors are
modulators, and filters. A DSP processor performs
especially suitable for designs that require multiple
digital operations based on a specific signal-processing
modes of operation and future upgradeability.
algorithm (or computational descriptions) implemented
The research results of DSP increasingly are applied
in software to process the digital signals. DSP algorithms
to the development of complete solutions that integrate
can be performed on a wide variety of digital hardware
algorithms, software, and hardware into a system.
and in many computer languages such as C and C++.
Because software development has become a larger
DSP hardware includes programmable and
expense than hardware development in major DSP
nonprogrammable logic, general-purpose
systems, processor-independent design has the advantage
microprocessors and microcontrollers, and
of porting software on different processors and the
general-purpose digital signal processors.

0-7803-8660-4/04/$20.00 ©2004 IEEE 365


ability to migrate to more advanced processors in the where { b0 , b1 , …, bL −1 } are filter coefficients, {x(n),
future.
In processor-independent design, a high-level x(n−1), …, x(n−L+1)} are signal samples, and L is the
language such as C or C++ is preferred and is available length of the filter. The computation of output y(n)
for most DSP processors. C programs are easier and requires the following steps:
faster to write, and they may be ported from one 1. Fetch two operands, bi and x(n−i), from memory.
processor to another simply by recompiling the source
2. Multiply bi and x(n−i) to obtain the product.
code for a new processor using the C compiler for that
processor. In applications where processing and memory 3. Add the product, bi x(n − i ) , to the accumulator.
resources are critical, the solution is a compromise that 4. Repeat steps 1, 2, and 3 for i = 0, 1, 2, …, L−1.
implements critical sections in assembly language and 5. Store the result, y(n), in the accumulator to memory.
that uses C language to code the rest. This mixed
6. Update the pointers for bi and x(n−i) and repeat
C-and-assembly programming provides a good balance
between ease of coding and efficient implementation. steps 1 through 5 for the next input sample.
Currently, the efficiency of C compilers is improved The generic internal architecture of the DSP processor
significantly, and many optimized assembly-coded DSP illustrated in Figure 3 is optimized for the FIR-filtering
libraries allow the user to develop mixed code easily and operations given in equation (1). Compared with
efficiently. general-purpose microprocessors, the most unique
Figure 2 illustrates how a DSP system is configured feature is the use of parallelism and pipelining for
around the DSP processor. The major external blocks improving processing speed. DSP processors have a
needed are memory and peripherals. DSP processors number of special processing units supported by multiple
usually provide some on-chip cache, program read-only dedicated buses, most of which can operate
memory (ROM), data random-access memory (RAM), independently and concurrently. As shown in Figure 3,
and peripherals. Peripherals such as the ADC and the the arithmetic and logic unit (ALU) performs addition,
DAC can connect either to the data bus using a dedicated subtraction, and logical operations. The shifter is used
address or to the serial interface if serial ports are for scaling data, and the hardware multiplier and
available on chip. accumulators are used to perform MAC operations. Data
address generators (DAGENs) generate the addresses of
operands used by instructions. With these available
Memory resources, the DSP processor achieves a fast execution
DSP speed by performing operations within these units
processor simultaneously.
Data bus

Address bus Peripherals DAGEN Memory Memory DAGEN


A A B B

Figure 2 External interfaces for the DSP processor.


Shifter,
multiplier,
2. DIGITAL SIGNAL PROCESSOR ALU
ARCHITECTURES DAGEN
C
The task of developing an efficient DSP system depends Accumulator(s)
on the DSP hardware and software architectures,
including data flow, arithmetic capabilities, memory
configurations, I/O structures, programmability, and the Memory Shifter
instruction set of the processor. The processor C
architecture and the corresponding DSP algorithm must
be complementary. For some applications, the algorithm
is given, and we have to select a suitable processor. For Figure 3 Generic DSP architecture of the data
other applications, the processor is given, and the task is computation unit.
to develop efficient algorithms that satisfy the
application requirements. Multiplication operations require several clock
Most DSP processors are designed to perform cycles on a microprocessor or microcontroller, where
repetitive MAC operations such as finite-impulse they are performed by repetitive shift-and-add
response (FIR) filtering expressed as operations. To achieve the speed required by
L −1 multiplication-intensive DSP algorithms, such as the FIR
y (n) = ∑ bi x (n − i ) , (1) filtering given in Eq. (1), DSP processors employ a fully
i =0 parallel hardware multiplier, which can multiply two

366
items of data within one clock cycle. At the same time, operands can be fetched simultaneously on two separate
an adder immediately following the multiplier adds the data buses, as shown in Fig. 3. The two data buses with
product from the previous multiplication operation into a supporting address buses connect with two separate
double-precision accumulator. Some processors such as memories [e.g., memory A for coefficient bi and
the TMS320c55x provide two MAC units and four
40-bit accumulators. A number of dedicated instructions memory B for signal x(n−i)]. This configuration avoids
that can perform multiply, accumulate, data-move, and the conflict of accessing two operands from the same
pointer-update operations in a single instruction are built memory at the same time. A third memory with
into processors for filtering and correlation algorithms. associated address and data buses is used for storing the
The basic arithmetic operations performed by DSP value in the accumulator back to memory C. In addition
processors are addition, subtraction, etc. The logical unit to these data memory blocks, there is program memory
performs Boolean logic such as AND, OR, and NOT and its dedicated program and data buses for fetching
operations on individual bits of a data word and executes program instructions to avoid delays in accessing data.
logical shifts of the entire data word. Binary division As shown in Fig. 3, each memory has its own
usually is implemented with software routine because it address bus, which originates with a DAGEN. Accessing
involves a repeated series of shift and conditional the sequence of operands, bi and the corresponding
subtraction operations. The shifter can be used for x(n−i), i = 0, 1, …, L−1, is a regular sequential operation
pre-scaling an operand in data memory or the where these operands are stored in consecutive locations
accumulator before an ALU operation or for post-scaling of memory. Each DAGEN simply increases (or
the accumulator value before storing it back into data decreases) the address pointer for pointing at the next
memory. data item within the same clock cycle when the
The instructions that control the operations of the multiplier and ALU are performing arithmetic
DSP processor require multiple steps to execute. First, operations; as a result, no extra clock cycle is needed to
the address of the instruction is generated, and the update the address pointer. After accessing the last
contents of program memory at that address are read and
coefficient, b L − 1 , the coefficient pointer has to wrap
decoded. Based on the instruction, one or more operands
then are fetched to provide the required data for around to b0 for the next iteration. This operation can
executing that instruction. Finally, the results are stored, be performed by arranging the coefficient buffer in a
and the address of the next instruction is computed. Each circular fashion. Therefore, further improvements to the
instruction may take several clock cycles to execute DAGEN include modulo L arithmetic for implementing
multiple steps of prefetch, decode, fetch operand, circular buffers. In addition, the DAGEN supports
execute, and write result. These steps can be cascaded in bit-reversal addressing for computing FFT algorithms.
assembly-line fashion by using pipelining. If each step When processor speed is not a limiting factor,
requires one clock cycle, a sequence of seven-stage memory access time can be relaxed by using a slower
instructions (in the TMS320c55x) can be completed at clock rate. Another effective method is to let the
one instruction per clock cycle after the pipeline is full. processor run at full speed, but allow a number of wait
The pipeline architecture takes advantage of the inherent states for accessing memory. These processors usually
decomposition of instructions into multiple serial have on-chip memory configurable as program memory,
operations. thus allowing small programs to be stored and executed
Figure 3 shows that cascading the multiplier and at full speed. The initial loading of the program from the
ALU allows the simultaneous operation of both. That is, external slow program memory can be accomplished
when the multiplier is performing its work at time i to using wait states.
produce bi x(n − i) , the ALU adds the previous product DSP processors normally provide on-chip
bi −1 x(n − i + 1) into the accumulator. The parallel peripherals or peripheral interfaces to facilitate the
integration of DSP with external devices, such as an
architecture takes advantage of the inherent parallelism
ADC, a DAC, or other DSP processors or
of DSP algorithms and applications. As shown in Figure
microprocessors. In addition, some internal peripherals
3, all processing units in the parallel configuration can
are used to control and manage the clocking of DSP
receive different data streams and execute different
processors, the data transfer mechanism, and power
operations at each cycle. For example, the multiplier
management facilities.
accesses two operands and multiplies them. At the same
Most modern DSP processors provide both a serial
time, the DAGENs update the address pointers as
and parallel I/O capability. The serial port has the
specified, and the ALU adds the previous product into
advantage of being separate from the data bus. Thus, it is
the accumulator with (or without) rounding.
not constrained by stringent access time and possible
Buses provide the communication paths among the
bus-conflict considerations. Modern processors have the
units that make up a DSP processor. The execution speed
following different types of serial ports: Standard serial
of filtering given in equation (1) can be improved further
ports, buffered serial ports (BSP), time-division
by using separate data buses for each of the two inputs
multiplexing serial ports, and multi-channel BSPs
for the multiplier. Instead of requiring an operand fetch
(McBSP).
for bi and another on the same data bus for x(n−i), both

367
The direct-memory-access (DMA) controller is 4. SYSTEM CONSIDERATIONS
used to control data transfer of the DSP
processor-memory space, which includes on- and The same DSP processor family provides different
off-chip memory and peripherals. It operates devices to provide the best match for the given
independently of the processor. The data transfer is done application. For example, the devices within the
in block format, and the DMA controller sends an TMS320C54x family differ in the number of DSP cores,
interrupt to the DSP processor when the transfer is operating clock frequencies, voltages, on-chip ROM
complete. Typically, the DMA can handle multiple configurations, RAM configurations, type and number of
channels (e.g., six channels in the c54x), and the user can serial ports, and host ports.
assign different or same priority to the channels. DSP processors are following the path of
With the increasing demand for running microprocessors in terms of performance and on-chip
DSP-based products with less power and prolonged integration. At the same time, power consumption
battery usage time, the DSP processor incorporates becomes an important issue for portable products. A
power-management features in addition to the DSP product design is constrained by the following key
conventional low-voltage approach. Several methods are design goals:
used in power management: (1) clock-frequency control, 1. Cost of the product
(2) power-down mode, and (3) disabling peripherals. 2. Cost of the design
3. Upgradeability
4. System integration
3. SOFTWARE DEVELOPMENTS
5. Power consumption
These design goals play key roles in selecting DSP
A programming language states the algorithm in a
processors.
manner that precisely defines its operations inside the
The selection of a DSP processor suited to a given
processor. In documenting the algorithm, it is sometimes
application is a complicated task. Some of the factors
helpful to clarify which inputs and outputs are involved
that might influence choice are cost, performance, future
by means of signal-flow diagrams. It is essential to
growth, and software and hardware development
document programs thoroughly with titles and comments
support. Using floating-point processors can increase the
because doing so greatly simplifies the task of
dynamic ranges of signals and coefficients.
troubleshooting and also helps with program
Floating-point processors are usually more expensive
maintenance. For ease of understanding, it is also
than fixed-point processors, but they are more suitable
important to use meaningful mnemonics for variables,
for high-level C programming. Thus, they are easier to
labels, subroutine names, etc.
use and allow a quicker time to market.
In general, most execution bottlenecks occur in a
The execution speed of a DSP algorithm is also an
few sections of DSP code, usually in the loops
important issue when selecting a processor. When
(especially inner loops) of a program. These loops may
performance is the most important factor, the algorithm
only occupy 10% of the code, but may take 90% of the
must be implemented with optimized code written for
time to execute. The best strategy is coding the entire
those processors, and the execution times must be
algorithm in C first, identifying the time-critical
compared. The time to complete a particular algorithm
bottlenecks, and then rewriting only that small
coded in optimized assembly language is called a
percentage of code in assembly language. Because DSP
benchmark. A benchmark can be used to give a general
C compilers generate intermediate assembly code for
measure of the performance of a specific algorithm for a
optimization, time-critical portions of code can be
particular processor. Other related issues include
identified by using profiling capabilities and can be
memory size (on chip and externally addressable) and
replaced with handcrafted assembly code. Another
the availability of on-chip peripheral devices such as
method is to use a library of hand-optimized functions
serial and parallel interfaces, timers, and multiprocessing
coded in assembly language by the engineers or in the
capabilities. In addition, space, weight, and power
run-time library provided by the manufacturers. These
requirements must be minimized. A key system
assembly routines either may be called as a function or
constraint is the system cost. Like the case for
in-line coded into the c program.
general-purpose microprocessors, second sourcing,
Software libraries become important as DSP
third-party support, and industry standards are other
algorithms become more complicated and
important issues for consideration.
computationally demanding. DSP manufacturers usually
provide a set of commonly used signal-processing
REFERENCES
operations in software libraries that are written optimally [1] Sen M. Kuo and Woon S. Gan, Digital Signal Processors,
for a particular processor. Because of the improvement Prentice Hall, Upper Saddle River, NJ, 2005.
in C-compiler efficiency and the availability of [2] Sen M. Kuo and Bob H. Lee, Real-Time Digital Signal
user-friendly integrated software development tools, a Processing, Wiley, Chicheser, 2001.
mix of C and assembly routines is the most effective way [3] Phil Lapsley, Jeff Bier, Amit Shoham, and Edward A. Lee,
of developing programs for DSP systems. DSP Processor Fundamentals, IEEE Press, Piscataway,
NJ, 1995.

368

View publication stats

You might also like