Download as doc, pdf, or txt
Download as doc, pdf, or txt
You are on page 1of 10

Architectural features of DSP Processors

Digital signal processing algorithms are computationally intensive operations. The algebraic equations involved are, generally, repetitive in nature. Most often signal processing algorithms are used in real time applications [ generally, in real time processing a signal need to be processed at the rate of the signal arrival and as and when it arrives]. Since very large amount of number crunching operations are involved, the processors need to work with very high speed and efficiency to make the system size and cost affordable. With the advancement in VLSI technology, the realizations of sophisticated systems are possible. For the implementation of complex DSP algorithms hardware processors specially suited for running these algorithms have been developed by many manufacturers. These processors are known as DSP processors. DSP processors are broadly classified as general purpose and special purpose. The general purpose DSPs have architectural features for DSP applications in many areas. Special purpose DSPs are are configured for running a specific algorithm, like, digital filters, FFT, etc.-algorithm specific. There are processors that are application specific, eg, telecommunications, digital audio or control applications. Architecture for signal processing. Most DSP algorithms, such as, filtering, correlation and FFT require repetitive arithmetic operations like, multiply, add, memory access and heavy data flow through CPU. The architecture of standard microprocessor is not suited for this kind of activity. An important goal in DSP hardware design is to optimize both hardware architecture and the instruction set for the DSP operations. In digital signal processors, this is achieved by making extensive use of the concept of parallelism. The following are the architectural features of DSP processors. Multiple bus structure with separate memory space for data and program instructions. Data memories hold input data ,intermediate data values and output samples as well as fixed coefficients of digital filters and FFTs. Program instructions are stored in the program memory. This facilitates the simultaneous access of the program and data for pipeline operations of program fetch, decode and execute cycles. 2. The I/O port provides a means of fast data transfer between the CPU and the external devices such as ADC,DAC or passing data to other processors. Other processor may be a host processor or other similar DSP processors sharing the computational load of a complex system. DMA(Direct Memory Access) allows rapid data transfer of blocks of data 1.

directly to and from RAM typically from external control, with out affecting the CPU operations. 3. Dedicated arithmetic units for logical and arithmetic operations which include ALU, hardware multiplier, multiplier accumulator, shifter/barrel shifter. 4. To make the data flow faster separate hardware address generators are provided which spares the CPU for the program execution activities. Dedicated hardware program sequencer , dedicated stacks are other features that increases the speed of execution. The technique used are the following. 1. Harward/Modified Harward architecture 2. Extensive pipelining 3. Fast dedicated hardware multiplier/multiplier accumulator 4. Special instructions dedicated for DSP 5. Replication of hardware for parallel operation 6. On-chip memory/cache 7. Extended parallelism SIMD,VLIW and static superscalar processing [ SIMD (Single Instruction Multiple Data). In this architecture multiple data buses and multiple computational units are provided. In a single cycle many instructions are executed in parallel. VLIW (Very Long Instruction Word) . In this, the core processor can fetch many instruction words simultaneously and execute them in parallel in many hardware units provided.] Harward Architecture and Modified Harward Architecture The standard microprocessors have Von Newman architecture, where we have a common memory for the data and program and a single data bus. In this, the operations like, fetch an instruction from memory, decode this instruction in CPU and execute this instruction are done in sequence. This method of operation is very inefficient because many of the units are idling when a certain operation is going on. Since the fetch, decode and execute functions are done in sequence the operation is very slow. This limitation is overcome by selecting the Harward architecture for DSPs. The principal feature of the Harward architecture is that the program and data lie in separate memories, each with its own address and data buses, permitting full overlap of instruction fetch, decode and execute. In this architecture, since the program code in the program memory and the operands(data) in data memory accessing both the memories at the same instant is possible. Nowadays most processors use modified Harward architecture in which transfer of data between program and memory is also possible.

TMS320C54x DSP Processor : Archetecture TMS 320C54X Series of DSP processors are the third generation 16 bit DSP family of processors from Texas Instruments Ltd. The TMS320C54x series includes TMS320C54x, TMS320LC54x, and TMS320VC54x fixed-point, digital signal processor (DSP) families. The significant features of each device are the capacity of on-chip RAM and ROM, the available peripherals, the CPU speed, and the type of package with its total pin count. Main Features provided by the 54x DSPs include: 1. Advanced multibus architecture with three separate 16-bit data memory buses and one program memory bus 2. 40-bit arithmetic logic unit (ALU), including a 40-bit barrel shifter and two independent 40-bit accumulators 3. 17- 7- bit parallel multiplier coupled to a 40-bit dedicated adder for 1 nonpipelined single-cycle multiply/accumulate (MAC) operation 4. Compare, select, and store unit (CSSU) for the add/compare selection of the Viterbi operator 5. Exponent encoder to compute an exponent value of a 40-bitaccumulator value in a single cycle 6. Two address generators with eight auxiliary registers and two auxiliary register arithmetic units (ARAUs) 7. Single-instruction repeat and block-repeat operations for program code, Block-memory-move instructions for better program and data management and Instructions with a 32-bit-long word operand 8. Instructions with two- or three-operand reads,Arithmetic instructions with parallel store and parallel load and Fast return from interrupt 9. On-chip peripherals 10. Software-programmable wait-state generator and programmable bankswitching 11. Phase-locked loop (PLL) clock generator with internal crystal oscillator or external clock source 12. Full-duplex standard serial port,Time-division multiplexed (TDM) serial port,Buffered serial port (BSP) and Multichannel buffered serial port (McBSP) 13. Direct memory access (DMA) controller 14. 8-bit /16bit parallel host-port interface (HPI) 15. Interprocessor first-in first-out (FIFO) unit (on multiple CPU devices) 16. Low-voltage device options to reduce power consumption without compromising performance 17. Million instructions per second (MIPS) (25-ns instruction cycle time) 40/80/100/200/532 MIPS 18. High-performance, low-power C54x CPU 1 Architecture

The 54x DSPs use an advanced, modified Harvard architecture that maximizes processing power by maintaining one program memory bus and three data memory buses. These processors also provide an arithmetic logic unit (ALU) that has a high degree of parallelism, application-specific hardware logic, on-chip memory, and additional on-chip peripherals. These DSP families also provide a highly specialized instruction set, which is the basis of the operational flexibility and speed of these DSPs. Separate program and data spaces allow simultaneous access to program instructions and data, providing the high degree of parallelism. Two reads and one write operation can be performed in a single cycle. Instructions with parallel store and application-specific instructions can fully utilize this architecture. In addition, data can be transferred between data and program spaces. Such parallelism supports a powerful set of arithmetic, logic, and bitmanipulation operations that can all be performed in a single machine cycle. Also included are the control mechanisms to manage interrupts, repeated operations, and function calls. Figure 1 is a simplified functional block diagram that shows the principal blocks and bus structure in the 54x devices. 1.1 Central Processing Unit (CPU) The CPU of the 54x devices contains: A 40-bit arithmetic logic unit (ALU), Two 40-bit accumulators, A barrel shifter, A 17 17-bit multiplier/adder and A compare, select, and store unit (CSSU) 1.1.1 Arithmetic Logic Unit (ALU) The 54x devices perform 2s-complement arithmetic using a 40-bit ALU and two 40-bit accumulators (ACCA and ACCB). The ALU also can perform Boolean operations. The ALU can function as two 16-bit ALUs and perform two 16-bit operations simultaneously 1.1.2 Accumulators The accumulators, ACCA and ACCB, store the output from the ALU or the multiplier / adder block; the accumulators can also provide a second input to the ALU or the multiplier / adder. 1.1.3 Barrel Shifter The barrel shifter has a 40-bit input connected to the accumulator or data memory (CB, DB) and a 40-bit output connected to the ALU or data memory (EB). The barrel shifter produces a left shift of 0 to 31 bits and a right shift of 0 to 16 bits on the input data. The least significant bits (LSBs) of the output are filled with 0s and the most significant bits (MSBs) can be either zero-filled or signextended, depending on the state of the sign-extended mode bit (SXM) of ST1. Additional shift capabilities enable the processor to perform numerical scaling, bit extraction, extended arithmetic, and overflow prevention operations. 1.1.4 Multiplier/ Multiplier Accumulator

The multiplier / adder performs 17 17-bit 2s-complement multiplication with a 40-bit accumulation in a single instruction cycle. The multiplier / adder block consists of several elements: a multiplier, adder, signed/ unsigned input control, fractional control, a zero detector, a rounder (2s-complement), overflow/ saturation logic, and TREG. The multiplier has two inputs: one input is selected from the TREG, a data-memory operand, or an accumulator; the other is selected from the program memory, the data memory, an accumulator, or an immediate value. The fast on-chip multiplier allows the 54x to perform operations such as convolution, correlation, and filtering efficiently. In addition, the multiplier and ALU together execute multiply/ accumulate (MAC) computations and ALU operations in parallel in a single instruction cycle. 1.1.5 Compare, Select, and Store Unit (CSSU) The compare, select, and store unit (CSSU) performs maximum comparisons between the accumulators high and low words, allows the test/ control (TC) flag bit of status register 0 (ST0) and the transition (TRN) register to keep their transition histories, and selects the larger word in the accumulator to be stored in data memory. 1.1.6 Program Control Program control is provided by several hardware and software mechanisms. The program controller decodes instructions, manages the pipeline, stores the status of operations, and decodes conditional operations. Some of the hardware elements included in the program controller are the program counter, the status and control register, the stack, and the addressgeneration logic. 1.2. Bus Structure The 54x device architecture is built around eight major 16-bit buses: One program-read bus (PB) which carries the instruction code and immediate operands from program memory Two data-read buses (CB, DB) and one data-write bus (EB), which is interconnect to various elements, such as the CPU, data-address generation logic (DAGEN), program-address generation logic (PAGEN), on-chip peripherals, and data memory.The CB and DB carry the operands read from data memory. The EB carries the data to be written to memory. Four address buses (PAB, CAB, DAB, and EAB), which carry the addresses needed for instruction execution. The 54x devices have the capability to generate up to two data-memory addresses per cycle, which are stored into two auxiliary register arithmetic units (ARAU0 and ARAU1). The PB can carry data operands stored in program space (for instance, a coefficient table) to the multiplier for multiply / accumulate operations or to a destination in data space for the data-move instruction. This capability allows implementation of single-cycle three-operand instructions such as FIRS.

The 54x devices also have an on-chip bidirectional bus for accessing onchip peripherals; this bus is connected to DB and EB through the bus exchanger in the CPU interface 1.3 Memory The minimum memory address range for the 54x devices is 192K words composed of 64K words in program space, 64K words in data space, and 64K words in I/O space. Selected devices also provide extended program memory space of up to 8M words. The program memory space contains the instructions to be executed as well as tables used in execution. The data memory space stores data used by the instructions. The I/ O memory space interfaces to the external memory-mapped peripherals and can also serve as extra data storage space. Internal Memory/ On chip Memory The 54x DSPs provide part of the memory on chip, both on-chip RAM and ROM to improve system performance. In 54X the memory is organized into three individually selectable spaces, refer the block diagram. This includes the on chip ROM program/data and dual access RAM (DRAM) and Single access RAM(SRAM) for the data/program. This makes it possible simultaneous access of instruction and two operands in single cycle execution of instructions. At reset, the DARAM/ SRAM is mapped into data memory space. DARAM/SRAM can be mapped into program/ data memory space as well. It is also possible to transfer data/program between the memory blocks. [Dual access RAMs can access memory twice in a single machine cycle and single access RAMs can access only once]. The on chip RAM is organized in pages having128 word location on each page. A part of the on chip ROM contains the bootloader and look up tables for functions such as sine,cosine,etc. This boot loader can be used to transfer user code from an external source to anywhere in the program memory at power up automatically. Memory Organization The standard external program or data memory space on the 54x devices addresses up to 64K 16-bit words. Software can configure their memory cells to reside inside or outside of the program/data address map. Memory Mapped Registers The CPU maintains a set of memory-mapped registers in the data memory for processor configuration and configuration/ communication with the device peripherals. These registers support operand addressing and computations. There are as many as 95 registers for performing CPU functions and interface activities. Some of the registers are: 1. TRG-Temporary register The TREG is used to hold one of the multiplicands for multiply and multiply/ accumulate instructions. It can hold a dynamic (execution-time programmable) shift count for instructions with a shift operation such as ADD, LD, and SUB.

2. Status Registers (ST0, ST1) The status registers, ST0 and ST1, contain the status of the various conditions and modes for the 54x devices. 3. Auxiliary Registers (AR0AR7) The eight 16-bit auxiliary registers (AR0AR7) can be accessed by the central airthmetic logic unit (CALU) and modified by the auxiliary register arithmetic units (ARAUs). The primary function of the auxiliary registers is generating 16-bit addresses for data space. However, these registers also can act as general-purpose registers or counters. 4. Stack-Pointer Register (SP) The SP is a 16-bit register that contains the address at the top of the system stack. The SP always points to the last element pushed onto the stack. The tack is manipulated by interrupts, traps, calls, returns. Pushes and pops of the stack pre-decrement and post-increment, respectively, all 16 bits of the SP 5. Circular-Buffer-Size Register (BK) The 16-bit BK is used by the ARAUs (Auxiliary Register Arithmetic Unit) in circular addressing to specify the data block size. 6. Interrupt Registers (IMR, IFR) The interrupt-mask register (IMR) is used to mask off specific interrupts individually at required times. The interrupt-flag register (IFR) indicates the current status of the interrupts. 7. Block-Repeat Registers (BRC, RSA, REA) The block-repeat counter (BRC) is a 16-bit register used to specify the number of times a block of code is to be repeated when performing a block repeat. The block-repeat start address (RSA) is a 16-bit register containing the starting address of the block of program memory to be repeated when operating in the repeat mode. The 16-bit block-repeat end address (REA) contains the ending address if the block of program memory is to be repeated when operating in the repeat mode. On-Chip Peripherals All the 54x devices have the same CPU structure; however, they have different on-chip peripherals connected to their CPUs. The on-chip peripheral options provided are: Software-programmable wait-state generator Parallel I / O ports DMA controller Host-port interface (standard 8-bit, enhanced 8-bit, and 16-bit) Serial ports 16-bit timer . 1.5.1 Software-Programmable Wait-State Generators 7

The software-programmable wait-state generator can be used to interface with slower off-chip memory and I/ O devices. The software wait-state generator is programmable up to 7 or 14 wait states depending on the device 1.5.3 Parallel I/ O Ports Each 54x device has a total of 64K I/O ports. These ports can be addressed by the PORTR instruction or the PORTW instruction. The 54X devices can interface easily with external devices through the I/O ports while requiring minimal off-chip address-decoding circuits. 1.5.4 Direct Memory Access (DMA) Controller The 54x direct memory access (DMA) controller transfers data between points in the memory map without intervention by the CPU. The DMA allows movements of data to and from internal program/data memory, internal peripherals or external memory devices to occur in the background of CPU operation. The DMA has six independent programmable channels, allowing six different contexts for DMA operation. the DMA has the following features: The DMA operates independently of the CPU. The DMA has six channels. The DMA can keep track of the contexts of six independent block transfers. The DMA has higher priority than the CPU for both internal and external accesses. Each channel has independently programmable priorities. Each read or write transfer may be initialized by selected events. DMA allows efficient data transfer between other 54X devices when used in a multiprocessor environment in bigger systems sharing processing loads , for transferring data between the 54x processors. in addition to from host processors or I/O devices. 1.5.5 Host-Port Interface (HPI) The host port interface HPI is unit that allows the DSP to an 8 bit/16 bit host device or processor. Through this the 54X processor can be interfaced with a host processor or another 54Xprocessor without any additional components on the PCB. The HPI communicates with the host independently of the DSP. The HPI allows the host to interrupt the DSP and vice versa, when required. HPI interfaces a PC directly through its parallel port. 1.5.6 Serial I/O Ports The 54x devices provide three types of serial ports: depending on the the type of device. These are synchronous, buffered and time division multiplexed. The synchronous serial ports are high speed, full duplex ports that provide direct communication with codec, and analog to digital converters. A buffered serial port (BSP) is a synchronous serial port that is provided with auto buffering unit and is clocked at full clock rate. The auto-buffering unit supports high speed data transfers and reduces overhead of servicing interrupts. A time division multiplexing (TDM) serial port is a synchronous serial port that is provided to allow time division multiplexing of data.

The functioning of these on chip peripherals is controlled by memory mapped registers assigned to these perferals. 1.5.8 Hardware Timer The 54x devices feature a 16-bit timing circuit with a four-bit prescaler. The timer counter is decremented by one at every CLKOUT cycle. Each time the counter decrements to zero, a timer interrupt is generated. The timer can be stopped, restarted, reset, or disabled by specific status bits. 1.5.9 Clock Generator There are two basic options for clock generation on the 54x family of devices: divide-by-two and PLL. In the first option, the CPU clock is generated by dividing the input clock by two. The second option uses a phase-locked loop circuit to generate a CPU clock that is a multiple of the frequency of the input clock. The PLL method allows a high-frequency internal CPU clock to be generated from a low-frequency external clock. Maintaining a low-frequency clock off chip reduces system power consumption, reduces clock-generated electromagnetic interference (EMI), and facilitates the use of less expensive external crystals or oscillators. Data Addressing Modes 54X processor provides different data addressing modes for efficient movement of operands during execution of program. Many of the arithmetic operations of signal processing are carried out in a single instruction cycle. To achieve this 54X provides seven types data addressing modes. They are: immediate addressing, absolute ,accumulator, Direct, indirect, memory mapped register and stack addressing modes. Special addressing mode in the indirect addressing provides circular buffer and bit reversed addressing. Circular addressing is suited for efficient computation of FIR filters and correlations and bit reversed is for FFT. For the generation of these addresses two hardware units are provided along with the auxiliary registers called auxiliary register arithmetic unit. Instruction Set The 54X processor supports large number of instructions. Many are similar to those of general purpose microprocessors. However their execution will be much faster due to the difference in the internal architecture. In particular there are many instructions for the arithmetic operations- MAC (multiplyaccumulate), MAS (multiply-subtract ), etc, control instruction like RPT (Repeat), etc . An instruction fetch, two operand access for execution and a memory access to save the result can all be done in a single cycle. Ref: www.ti.com Doc no SPRU 307

Data buses: PB,CB,DB &EB Program ROM Program/ data RAM Program/ data RAM Address buses: PAB,CAB,DAB &EAB

To external memory MAC Saturate 40 bit barrel shifter ALU Serial ports ALUCMPS operatorEXP encoder 40 bit ACCA40 bit ACCB

40 bit 17x17 MPY 40 bit adderRound,

DMA contrller Host Processor interface

System Control & Interface

PMem Address generator

Timer PLL SW wait stage generator

DMem Address Generator

Simplified Block Diagram of TMS 320C54X DSP Processor

10

You might also like