INTRODUCTION TO DSP PROCESSORS Unit-5

You might also like

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 43

INTRODUCTION TO DSP

PROCESSORS
Features of DSP Processors
• DSP processors should have multiple registers so that
data exchange from register to register fast.
• DSP operations require multiple operands.
• DSP processors should have circular buffers to support
circular operations.
• DSP processors should able to perform multiply and
accumulate operations.
• DSP processors can be used with general processors.
• To support DSP operations fast, the DSP processors
should have on chip memory.
Introduction to programmable DSPs
Multiplier and Multiplier Accumulator(MAC)

• Multipliers:

• Most of the operations in DSP involve array


multiplications. The operations such as
convolution, correlation require multiply and
accumulate operations.
MACs –Multiply Accumulates:
• In one clock cycle the ALU of a DSP can do a
multiply and addition.
• Used in:
• Vector dot products
• Correlation
• Filters
• Fourier Transforms
• In addition to ALU changes the bus structure
must also change
Multiple Memory Accesses
• Complete MANY memory accesses in a single
clock cycle
– Processor can fetch instructions while also
fetching the operands or storing to memory
– During FIR filter can operate a multiply and
accumulate while loading the operands and
coefficient for the next cycle
– Three reads and one or two writes per cycle
• This requires multiple memory buses on the
same chip, not simply an address and data bus.
Single-Cycle MAC unit
ai xi

Multiplier
a x i-1
n
i-1

Σ(a ix i )
ai xi

Adder
i=0
ai xi + a x i-1
i-1
Can compute a sum of n-
Register
products in n cycles
Modified Bus Structures and Memory Access
schemes in DSPs
The MACD instruction performs multiply,
accumulate with access required:
• Fetch MACD instruction from program
memory.
• Fetch one of the operands from program
memory
• Fetch second operand from data memory
• Data memory write
• Based on this we can use two types of
Architectures:
1. Von-Neumann Architecture
2. Harvard Architecture.
Von-Neumann Architecture

• General purpose processors normally have


this type of architecture. The architecture
shares same memory for program and data.
• The processors perform instruction fetch,
decode and execute operations sequentially.
• The speed can be increased by pipelining.
Harvard Architecture
• The Harvard Architecture
has separate memories DATA
MEMORY
for program and data.
• There are also separate ,
address and data buses CPU

for program and data


because it has separate PROGRAM
MEMORY
on chip memories and
internal buses.
Multiport memory
• The Multiport memory has the facility of
interfacing multiple address and data buses.
• The dual port memory can allows two
memory access in a single clock period.
• The multiport memories increased number of
pins and larger chip area which makes it
expensive and large size.
VLIW Architectures
•Very Long Instruction Word Architecture
⇒One instruction specifies multiple operations
⇒All scheduling of execution units is static
→Done by compiler
⇒Static scheduling should mean less control, higher clock
speed. Less control means more room for execution units.
• Currently very popular architecture in embedded
applications
⇒DSP, Multimedia applications
⇒No compiled legacy code to support, all code libraries in
some form of high level language
Basic Working Principles of VLIW
• Aim at speeding up computation by exploiting
instruction-level parallelism.
• Same hardware core as superscalar processors, having
multiple execution units (EUs) working in parallel.
• An instruction is consisted of multiple operations;
typical word length from 52 bits to 1 Kbits.
• All operations in an instruction are executed in a lock-
step mode.
• One or multiple register files for FX and FP data.
• Rely on compiler to find parallelism and schedule
dependency free program code.
Basic VLIW Approach
Very Long Instruction Word (VLIW)
• A technique for instruction-
VLIW instruction F=a+b c=e/g d=x&y w=z*h

level parallelism by a
F
executing instructions b
PU

without dependencies
e
(known at compile-time) in PU
c

parallel g

• Example of a single VLIW x d


PU
instruction: y

F=a+b; c=e/g; d=x&y; w=z*h;


z w
PU
h
Advantages & Disadvantages of VLIW
Advantages:
– Reduce hardware complexity.
– Tasks such as decoding, data dependency
detection, instruction issue, …, etc. becoming
simple.
– Potentially higher clock rate.
– Higher degree of parallelism with global
program information.
• Disadvantages
– Higher complexity of the compiler.
– In case of un-filled opcodes in a VLIW,
memory space and instruction bandwidth are
wasted.
DSP Characteristics

• Arithmetic Format (fixed and float)


• Bus Width(fixed-16 bit data bus, float-32 bit )
• Speed
• Memory/Bus/Instruction architecture
• Development Tools
• Power Consumption
• Cost
• Specialized Hardware
Pipelining
Special addressing
• Short Immediate Addressing
• Short Direct Addressing
• Memory –Mapped addressing
• Indirect Addressing
• Bit Reversed Addressing
• Circular Addressing
Architecture of TMS 320C5X-
Introduction
History of the TMS320 family
TMS320C1x,T
TMS320C1x,T
This family currently includes five generations of DSPs. MS320C2x,
MS320C2x,
TMS320C3x,
TMS320C3x,
TMS320C4x,
TMS320C4x,
TMS320C25, a CMOS 40-MHz digital signal processor and
and
TMS320C5x
capable of twice the performance of the TMS320C1x TMS320C5x

devices

is capable of executing 10 million instructions per second.


24 additional instructions (133 total)
eight auxiliary registers
an eight-level hardware stack
4K words of on-chip program ROM
low power dissipation inherent to CMOS
Features of TMS320C5x Processors
• Powerful 16 bit CPU.
• 50ns single cycle instruction execution time for 5v operation.
• 16 x 16 bit Multiply / Add operations can be performed in single
cycle.
• 224k x 16 bit maximum addressable external memory space. This
space is divided into 64k program, 64k data, 64k I/O and 32K
global memories.
• Upto 32k x 16 bit single access on-chip program ROM.
• 1K X 16 bit dual access on chip program /data RAM.
• On chip timer for control operations.
• It has buffered serial port
• It has hardware/ software wait state generation capability.
Architecture of TMS320C5x
Fig: Simplified block
diagram of TMS320C5X
Bus structure
• Separate program and data buses in the advance
Harvard architecture of C5x maximize the processing
power a high degree of parallelism.
• Many DSP applications are accomplished using single
cycle multiply/accumulate instruction with a data
move option.
• The C5x architecture has four buses:
i. Program bus (PB)
ii. Program address bus (PAB)
iii. Data read bus (DB)
iv. Data read address bus (DAB)
Central Processing Unit
• The central Processing Unit consists of the
following elements:
i. Central arithmetic logic unit (CALU)
ii. Parallel logic unit (PLU)
iii. Auxiliary register arithmetic unit (ARAU)
iv. Memory mapped registers
v. Program controller
Central Arithmetic Logic unit (CALU)

• The CALU is used to perform 2’s complement


arithmetic.
• The CALU consists of
i. 32 bit arithmetic logic unit (ALU)
ii. 32 bit accumulator (ACC)
iii. Scaling registers
iv. 16 x 16 parallel multiplier
v. 32 bit accumulator buffer (ACCB)
CALU (contd..)
• A typical ALU instruction:
1. Data is fetched from the RAM on the data
bus.
2. Data is passed through the scaling shifter and
the ALU.
3. The result is moved into the accumulator.
parallel Logic Unit (PLU)
• The parallel logic unit (PLU) is the second logic
unit.
• It executes logic operations on the data
without affecting the contents of ACC.
• PLU provides bit manipulation which can be
used to set, clear, test or toggle bits in data
memory control or status registers.
Auxiliary register and Auxiliary register
Arithmetic Unit (ARAU)
• There is a register file of eight auxiliary
registers. These registers are used for
temporary data storage.
• The auxiliary register file (AR0-AR7) is
connected to the auxiliary register arithmetic
unit (ARAU).
• The contents of the auxiliary registers can be
stored in data memory.
Index Register (INDX)
• The index register (INDX) is 16 bit register.
• The ARAU add or subtract the value stored in
INDX with contents of auxiliary registers (AR)
to get new address.
• This mode is called indirect addressing
• Bit reversal addressing is also by using INDX.
Auxiliary Register Compare Register

• The ARCR is a 16-bit register.


• It is used for address boundary comparison.
• The CMPR instruction compares the ARCR to
selected AR.
• The result of comparison is placed in
test/control flag (TC) bit of status register1
(ST1).
Block Move Address Register
• The BMAR is a 16-bit register.
• The address required in block move and
multiply/ accumulate operations is stored in
BMAR.
• This 16-bit address is used as an indirect
second operand.
Memory mapped registers
• The TMS320C5x series of processors have 96
memory mapped registers.
• These all registers are mapped into page ‘0’ of
data memory space.
• There are total 28 CPU registers and 16 I/O port
registers.
• These memory mapped registers are used for
indirect data address pointers, temporary
storage, CPU status and control
program controller
The Program Controller performs following
tasks:
• Decodes instructions
• Manage CPU pipeline
• Stores the status of CPU operations
• Decodes conditional operations
program controller (contd..)
The program controller consists of following
elements:
• Program Counter
• Status and control registers
• Hardware Stack
• Address generation logic
• Instruction register
• Interrupt register
Some flags in the status registers
On-chip registers
• The TMS320C5x architecture has a total
memory address range of 224 K words x 16
bits.
• The memory space is divided into four
memory segments:
i. 64 K – word Program memory space
ii. 64 K – word local data memory space
iii. 64 K – word input/output ports
iv. 64 K – word Global data memory space
On-chip peripherals
• The TMS320C5x processors have the following on-chip
peripherals:
1. Clock generator
2. Hardware timer
3. Software programmable wait state generators
4. General purpose I/O ports
5. Parallel I/O ports
6. Serial port interface
7. Buffered serial port
8. Time-division multiplexed (TDM) serial port
9. Host port interface
10.User unmaskable interrupts
Comparison between DSP Processors and General Purpose Processors
Comparison between DSP Processors and General Purpose Processors (contd…)
Thank You

You might also like