Download as doc, pdf, or txt
Download as doc, pdf, or txt
You are on page 1of 9

The Evolution of DSP Processors

are a popular solution for several


reasons.

Abstract:

The number and variety of products that In this article, we trace the evolution of
include some form of digital signal DSP processor, from early architectures
processing (DSP) has grown to current state-of-the-art devices.
dramatically. DSP has become a key Finally, we discuss the growing class of
component in many consumers, general purpose processors that have
communications, medical, and industrial been enhanced to address the needs of
products, which use a variety of DSP applications.
hardware approaches to implement DSP,
ranging from the use of off-the-shelf DSP Algorithms Mold DSP
microprocessors to field-programmable Architectures
gate arrays (FPGAs) to custom
integrated circuits (ICs). In this article,
From the outset, DSP processor
we highlight some of the key differences
architectures have been molded by DSP
among architectures and compare their
algorithms. For nearly every feature
strengths and weaknesses. Finally, we
found in a DSP processor, there are
discuss the growing class of general-
associated DSP algorithms whose
purpose processors that have been
computation is in some way eased by
enhanced to address the needs of DSP
inclusion of this feature. As a case study,
applications
we will consider one of the most
common signal processing algorithms,
INTRODUCTION the FIR filter.

The number and variety of products that Fast Multipliers


include some form of digital signal
processing has grown dramatically over The FIR filter is mathematically
the last five year.DSP has become a key expressed as, where is a vector of input
component in many consumer, data, and is a vector of filter coefficients.
communications, medical, and industrial For each “tap” of the filter, a data
products. These products use a variety of sample is multiplied by a filter
hardware approaches to implement DSP, coefficient, with the result added to a
ranging from the use of off-the-shelf running sum for all of the taps. Hence,
microprocessors to field-programmable the main component of the FIR filter
gate arrays to custom integrated circuits. algorithm is a dot product: multiply and
Programmable “DSP processors”,a class add, multiply and add. These operations
of microprocessors optimized for DSP, are not unique to the FIR filter
algorithm; infact, multiplication is one of
the most common operations performed

22
in signal processing-convolution, IIR Memory accesses in DSP algorithms
filtering, and Fourier transforms also all
involve heavy use of multiply-
accumulate operations. tend to exhibit very predictable patterns;
for example, for sample in an FIR filter,
In 1982, however, Texas instruments the filter coefficients are accessed
introduced the first commercially sequentially from start to finish for
successfully “DSP processor” the eachsample, then accesses start over
TMS32010, which incorporated from the beginning the next input
specialized hardware to enable it to sample. DSP processor address
compute a multiplication in a single generation units take advantage of this
clock cycle. DSP processors include at predictability by supporting specialized
least one dedicated single-cycle addressing modes that enable the
multiplier or combined multiply- processor to efficiently access data in the
accumulate unit. patterns commonly found in DSP
algorithms. The most common of these
Multiple Execution Units modes is register-indirect addressing
with post-increment, which is used to
DSP processors often include several automatically increment the address
independent execution units that are pointer for algorithms where repetitive
capable of operating in parallel—for computations are performed on a series
example, in addition to the MAC unit, of data stored sequentially in memory.
they typically contain an arithmetic-logic Without this feature, the programmer
unit and a shifter. would need to spend instructions
explicitly incrementing the address
pointer. Many DSP processors also
Efficient Memory Accesses
support “circular addressing:, which
allows the processor to access a block of
Executing a MAC in every clock data sequentially and then automatically
requires more than just a single-cycle wrap around to the beginning address—
MAC unit. Figure 1 illustrates the exactly the pattern used to access
differences in memory architectures for coefficients in FIR filtering.
early general-purpose processors and
DSP algorithms consume two data
Data Format
operands per instruction, a further
optimization commonly used is to
include a small bank of RAM near the In a fixed-point format, the binary point
processors core that is used as an is located at a fixed location in the data
instruction cache. word this is in contrast to floating point
formats, in which numbers are expressed
using an exponent and a mantissa; the
binary point essentially “floats” based on
the value of the exponent. Floating point
formats allow a much wider range of
values to be represented, and virtually
eliminate the hazard of numeric
Figure 1: Differences in memory architecture for overflow in most applications. In many
early general-purpose microprocessors vs. early DSP
processors.
23
applications, however, DSP processors
face additional constraints: they must be Finally, to allow low cost, high
inexpensive and provide good energy performance input and output, most DSP
efficiency. processors incorporate one or more
specialized serial or parallel I/O
Sensitivity to cost and energy interfaces, and streamlined I/O handling
consumption also iinfluences the data mechanisms, such as low- overhead
word width used in DSP processors. interrupts and direct memory access, to
DSP processors tend to use the shortest allow data transfers to proceed with little
data word that will provide adequate or no intervention from the processors’s
accuracy in their target applications. computational units.
Most fixed-point DSP processors use 16-
bit data wors, because that data word Specialized Instruction Sets
width is sufficient for many DSP
applications. A few fixed point DSP DSP processor instruction sets have
processor use 20,24, or even 32 bits to traditionally been designed with two
enable better accuracy in applications goals in mind. The first is to make
that are difficuilt to implement well with maximum use of the processor’s
16 bit data, such as high-fidelity audio underlying hardware,the second goal is
processing. to minimize the amount of memory
space required to store DSP programs,
To ensure adequate signal quality while to accomplish the first goal
using fixed point data, DSP processors ,conventional DSP processor instruction
typically include specialized harware to sets generally allow the programmer to
help programmers maintain numeric specify several parallal operations in a
fidelity throughout a series of single instruction , typically including
computayions. For example, most DSP one or two data fetches from memory in
processors include one or more parallal with the main arithmetic
“acccumalaotr” registers to hold the operation.With the second goal in
results of summing several mind,instructions are kept short by
multiplication products. restricting which operations can be
combined in an instruction.The overall
Zero-Overhead Looping result of this approach is that
conventional DSP processors tend to
DSP processors provide special support have highly specialized, complicated
for efficient looping. Often, a special ,and irregular instruction sets.
loop or repeat instruction is provided
which allows the programmer to Programmers who write software for PC
implement a for-next loop without processors, such as Pentiums or
expending any clock cycles for updating PowerPCs, typically don’t have to worry
and testing the loop counter or branching much about the ease of use of the
back to the top of the loop. This feature procssor’s instruction set, because they
is often referred to as “zero-overhead generally develop programs in a
looping”. highlevel language, such as C or C+
+.Life isn’t quite so simple for the DSP
Streamlined I/O processors programmer, because high-

24
volume DSP applications unlike other Midrange DSP processrors achieve
types of applications, are often written in higher performance than the low cost
assembly language. DSP’s described above through a
combination of increased clock speeds
There are two main reasons why DSPs and somewhat more sophisticated
aren’t usually programmed in high-level architecturers. DSP processors like the
languages. The first is that most widely Motorola DSP563xx and Texas
used high-level languages, such as C, are instruments TMS320c54x operate at 100
not well suited for describing typical -150 mhz and often include in ,modest
DSP algorithms. The second reason is amount of additional hardware, such as a
that conventional DSP architectures with barrel or instructions cache,to improve
their multiple memory spaces, multiple performance in common DSP
buses, irregular instruction sets, highly algorithms. Processors in this clause also
specialized hardware, are diffcult for tend to hve depper pipelines than their
compilers effectively.For these reasons, lower-performance cousins. Midrange
programmers often consider the DSP processors are able to achieve
palatability of the instruction set of a noticeably better performance while
DSP processor as a key aspect of it’s keeping energy and power consumption
over all desirability. low.processors in this performance range
or typically used in wireless
The Current DSP Landscape telecommunication applications and high
speed modems.
Conventional DSP processors
Enhanced convetional DSP
In the low-cost, low performance range processors
are the industry workhorses, which are
based on conventional DSP One approach is to extend conventional
architecturers. These processors are DSP architectures by adding parallel
quite similar architecturarly to the execution units, typically a second
original DSP processors of the early multiplier and adder. Enhanced-
1980’s. they issue and execute one conventional DSP processors typically
instruction per clock cycle, and use the have wider data busses to allow them to
complex, multi-operation type of retrieve more data words per clock cycle
instructions described earlier. These in order to keep the additional execution
processors typically include a single unit fed. They may also use wider
multiplier or MAC unit and ALU, but instruction words to accommodate
few additional execution units, if any. specification of additional parallel
Included in this group are analog devices operations within single instructions.
ADSP-21xx family, Texas instruments Increases in cost and power consumption
TMS320c2xx family, and Motorola’s due to the additional hardware and
DSP560xx family. These processors architectural complexity are largely
generally operate at around 20-50 MHz, offset by increased performance,
and provide good DSP performance allowing these processors to maintain
while maintaining very modest power cost performance and energy
consumption and memory usage. consumption similar to those of previous
generation of DSP’s.

25
performance processors. The two classes
of architectures that execute multiple
instructions in parallel or referred to as
VLIW (Very Long Instruction Word)

Multi-Issuse architectures
and superscalar.VLIW aand superscalar
With the goals of achiving high architectures provide many execution
performance and creating an architecture unit each of which executes its own
that lends itself to the use of compliers, instruction.
some newer DSP processors use a
“multi-issue” approach. In contrast to Figure 3 illustrates the executon units
conventional and enhanced-conventional and busses of the TMS 320C62xx,
processors, multi-issue processors use which contains 8 independent execution
very simple instructions that typically units. VLIW DSP processors typically
encode a single operation. These issue a maximum of between 4 and 8
processors achieve a high level of instructions per clock cycle, which are
parallelism by issueing and executing fetched and issued as part of one long
instructions in parallel groups rather super-instructions—hence the name
thean one at a time. Using simple “very long instruction word “.
instructions simplies instruction Superscalar processors typically issue
decoding and execution, allowing multi- and execute fewer instructions per cycle,
issue processors to execute at higher usally between two and four.
clock rates than conventional or
enhanced conventional DSP processors. The processor may group the same set of
TI was the first DSP processor vendor to instructions differently times int the
use this approach in a commercial DSP programme’s execution; for example, it
processor. TI’s multi-issue may group instructions one way the first
TMS320C62xx, introduced in 1996, was time it executes a loop, then group them
dramatically faster than any DSP differently for subsequent iterations.the
processors available at the time. Other difference in the way these two types of
vendors have since followed suit, and architectures schedule instruction for
now all four of the major DSP parallel execution is important in the
processors vendors are employing multi- context of using them in real-time DSP
issue architecturers for there latest high applications.

26
Because superscalar processors of wide instructions allows a higher
dynamically schedule parallel degree of consistency and regularity in
operations, it may be difficuilt for the the instruction set.making VLIW
programmer to predict exactly how long processors better compiler targets there
a given segment of software will take to are disadvantages ,however,to using
execute.the execution time may vary wide ,simple instruction .Since each
based on the particular data accesed, VLIW instruction is simpler than a
whether the processor is executing a conventional DSP processors tend to
loop for the first time or the third, or require many more instruction to
whether it has just finished proceesing perform a given task.Combined with the
an interrupt, DSP processors have fact that the instruction words are
traditionally provided dynamic features typically wider than those found on
fro just these reasons; this may be why conventional DSP processors,this
there is currently only one example of a characterstic result in relatively high
commercially available superscalar DSP program memory usage,in turn, may
processor. result in higher chip or system cost
because of the need for additional ROM
or RAM.

Traditionally VLIW processors have


used the positon of each instruction
within the super-instruction to determine
to where the instruction will be routed
.some recent VLIW architecturs do not
use positional super-instructions,
however and instead include routing
information within each sub-instruction.

To support execution of multiple parallel


instructions, VLIW and superscalar
processors must have sufficient
instruction decoders, buses, registers,
Although there instructions are very and memory bandwidth.VLIW
simple and typically encode only one processors typically use either wide
operation most current VLIW processors buses to access data memory and keep
use wider instruction words than the multiple execution units fed with
conventional DSP processors—for data.
exanmple 32 bits insteadt of 16.there are
a numbeer of reasons for using a wider DSP processors often omit of the
instruction word. In VLIW features that were, until recenetly,
architecturers, wide instruction word considered virtually part of the definition
may be required in order to specify of a “DSP processor.”For example, the
information about which functional unit Tms320C62xx does not include zero-
will execute the instruction.Wider overhead looping instructions; it requires
instructions allow the use of larger, ore the processor to explicitly prform the
uniform registe sets which inturn enable operations associated with maintaining a
higher performance. Repeatedly, the use

27
loop.this does not necessarily result in a On DSP processors with SIMD
loss of performance, however, since capabilities, the underlying hardware
VLIW-based processors are able to that supports SIMD operations varies
execute many instructions in parallel. widely. The new architecture is called
VLIW and superscalar processors often the ADSP-2116x.Each set of execution
suffer from high energy consumption units in the ADSP-2116x includes a
relative to conventional DSP processors, MAC unit, ALU, and shifter, and each
however in general, multi-issue has its own set of operand registers.
processors are designed with an
emphasis on increased speed rather than In contrast, instead of having multiple
energy efficiency. These processors sets of the same execution units, some
often have more execution units active in DSP processors can logically split their
parallel than conventional DSP execution units into multiple sub-units
processors, and they they wide on-chip that process narrow er operands. These
buses and memory banks to processors treat operands in long
accommodate multiple parallel registers as multiple short operands
instructions and to keep the multiple .perhaps the most extensive SIMD
execution units supplied with data, all of capabilities we have seen in a DSP
which contribute to increased energy processor to date are found in Analog
consumption. Devices’ Tiger-SHARC processor.
Tiger-SHARC is VLIW architecture,
Because they often have high memory and combines the two types of SIMD:
usage and energy consumption, VLIW one instruction can control execution of
and superscalar processors have mainly the processor’s two sets of execution
targeted applications which have very units, and this instruction can specify a
demanding computational requirements split-execution-unit operation that will
but are not very sensitive to cost or be executed in each set. Using this
energy efficiency. hierarchical SIMD capability, Tiger-
SHARC can execute eight 16-bit
SIMD multiplications per cycle, for example
figure 4 illustrates Tiger-SHARCs SIMD
SIMD, or single-instruction, multiple capabilities.
data. SIMD improves performance on
some algorithms by allowing the Alternatives to DSP processors
processor to execute multiple instances
of the same operation in parallel using High performance CPUs
different data. For example, a SIMD
multiplication instruction could perform Many high-end CPUs, such as Pentiums
two or more multiplicatios on different and PowerPCs, have been enhanced to
sets of input operands in parallal in a increase the speed of computations
single clock cycle. This technique can associated with signal processing tasks.
greatly increase the rate of computation The most common modification is the
for some vactor operations that are addition of SIMD-based instruction-set
heavily used in multimedia and signal extentions, such as MMX and SSE for
processing applications. the Pentium, and AltiVec for the
PowerPc. This approach is a good one

28
for CPUs, which typically have wide cost and time-to market, such as low
resources which can be treated as program memory use and the availability
multiple smaller resources to increase of compilers, are important.Many
performance. For example, a CPU with a applications require a mixture of control-
64 bit data bus, 64-bit registers, and a oriented software and DSP software. An
64-bit ALUs-bit can be treated as having example is the digital cellular phone,
four times as many 16-bit data buses, which must implement both supervisory
registers, and ALUs-resulting in up to task s and voice-processing tasks.
four times the performance on 16 bit
data. Image processing, which tends to Conclusions
be based on 8-bit data, can be sped up
even further. Using this approach, DSP processor architectures are evolving
general-purpose processors are often to meet the changing needs of DSP
able to achieve performance on DSP applications. The architectural
algorithms that is better than that of even homogeneity that prevailed during the
the faster DSP processors. This first decade of commercial DSP
surprising result is partly due to the processors has given way to rich
effectiveness of SIMD, but also because diversity. Some of the forces driving the
many CPUs operate at extremely high evolution of DSP processors today
clock speeds in comparision to DSP include the perennial push for increased
processors; high performance cpus Speed, decreased energy consumption,
typically operate at upwards of 500Mhz, decreased memory usage, and decreased
while the fastest DSP processor are in cost, all of which enable DSP processors
the 200-250Mhz range. to better meet the needs of new and
existing applications. Of increasing
influence is the need for architectures
that facilitate development of more
efficient compilers, allowing DSP
applications to be written primarily in
high-level languages. This has become a
focal point in the design of new DSP
processors, because DSP applications
are growing too large to comfortably
implement (and maintain) in assembly
language. As the needs of DSP
applications continue to change, we
Dsp/Microcontroller Hybrids expect to see a continuing evolution in
DSP processors.
There are many lower-cost general
purpose processors, reffered to as References
“microcontrollers,”that are designed to
execute control-oriented tasks [1] Lapsley et. al, “DSP Processor
efficiently. These processors are often Fundamentals: Architectures and Features,”
used in control applications where the IEEE Press, 1996.
computational requirements are modest [2] R. G. Lyons, “Understanding Digital
Signal Processing,”Addison Wesley, 1996.
but where factors that influence product

29
[3] W. Strauss, “DSP Strategies 2000,”
Forward Concepts, 1999.
[4] J. L. Hennessy and D. A. Patterson,
“Computer Architecture a Quantitative
Approach,” Morgan Kaufman, 1996.

30

You might also like