Rtes 6713

1.1.
Introduction Digital signal processing is one of the core technologies, in rapidly growing application areas, such as wireless communications, audio and video processing and industrial control. The number and variety of products that include some form of digital signal processing has grown dramatically over the last few years. DSP has become a key component, in many of the consumer, communications, medical and industrial products which implement the signal processing using microprocessors, Field Programmable Gate Arrays (FPGAs), Custom ICs etc. Due to increasing popularity of the above mentioned applications, the variety of the DSP-capable processors has expanded greatly. DSPs are processors or microcomputers whose hardware, software, and instruction sets are optimized for high-speed numeric processing applications, an essential for processing digital data, representing analog signals in real time. The DSP processors have gained increased popularity because of the various advantages like reprogram ability in the field, cost-effectiveness, speed, energy efficiency etc. Digital signal processors such as the TMS320C6x (C6x) family of processors are like fast special-purpose microprocessors with a specialized type of architecture and an instruction set appropriate for signal processing. The C6x notation is used to designate a member of Texas Instruments (TI) TMS320C6000 family of digital signal processors. The architecture of the C6x digital signal processor is very well suited for numerically intensive calculations. Based on a very-long-instruction-word (VLIW) architecture, the C6x is considered to be TIs most powerful processor. Digital signal processors are used for a wide range of applications, from communications and controls to speech and image processing. The general-purpose digital signal processor is dominated by applications in communications (cellular). Applications embedded digital signal processors are dominated by consumer products. They are found in cellular phones, fax/modems, disk drives, radio, printers, hearing aids, MP3 players, high-definition television (HDTV), digital cameras, and so on. These processors have become the products of choice for a number of consumer applications, since they have become very cost-effective. They can handle different tasks, since they can be reprogrammed readily for a different application. DSP techniques have been very successful because of the development of low-cost software and hardware support. For example, modems and speech recognition can be less expensive using DSP techniques. DSP processors are concerned primarily with real-time signal processing. Real-time processing requires the processing to keep pace with some external event, whereas non-real-time processing has no such timing constraint. The external event to keep pace with is usually the analog input. Whereas analog-based systems with discrete electronic components such as resistors can be more sensitive to temperature changes, DSP-based systems are less affected by environmental conditions. DSP processors enjoy the advantages of microprocessors. They are easy to use, flexible, and economical.
1.2. History, Development, and Advantages of TMS320 DSPs Advantages of DSPs over Analog Circuits Can implement complex linear or nonlinear algorithms. Can modify easily by changing software. Reduced parts count makes fabrication easier. High reliability.
1.3. Difference between DSPs and Other Microprocessors Over the past few years it is seen that general purpose computers are capable of performing two major tasks. (1) Data Manipulation, and (2) Mathematical Calculations All the microprocessors are capable of doing these tasks but it is difficult to make a device which can perform both the functions optimally, because of the involved technical tradeoffs like the size of the instruction set, how interrupts are handled etc. As a broad generalization these factors have made traditional microprocessors such as Pentium Series, primarily directed at data manipulation. Similarly DSPs are designed to perform the mathematical calculations needed in Digital Signal Processing. Data manipulation involves storing and sorting of information. For instance, a word processing program does a basic task of storing, organizing and retrieving of the information. This is achieved by moving data from one location to another and testing for inequalities (A=B, A<B). While mathematics is occasionally used in this type of application, it is infrequent and does not significantly affect the overall execution speed. In comparison to this, the execution speed of most of the DSP algorithms is limited almost completely by the number of multiplications and additions required. In addition to performing mathematical calculations very rapidly, DSPs must also have a predictable execution time. Most DSPs are used in applications where the processing is continuous, not having a defined start or end. The cost, power consumption, design difficulty etc increase along with the execution speed, which makes an accurate knowledge of the execution time, critical for selecting proper device, as well as algorithms that can be applied. DSPs can also perform the tasks in parallel instead of serial in case of traditional microprocessors.
Page | 2
1.4. Important feature of DSPs

As the DSP processors are designed and optimized for implementation of various DSP algorithms, most processors share various common features to support the high performance, repetitive, numeric intensive tasks.
1.4.1 MACs and Multiple Execution Units The most commonly known and used feature of a DSP processor is the ability to perform one or more multiply-accumulate operation (also called as MACs) in a single instruction cycle. The MAC operation is useful in DSP algorithms that involve computing a vector dot product, such as digital filters, correlation, and Fourier transforms. The MAC operation becomes useful as the DSP applications typically have very high computational requirements in comparison to other types of computing tasks, since they often must execute DSP algorithms (such as FIR filtering) in real time on lengthy segments of signals sampled at 10-100 KHz or higher. To facilitate this DSP processors often include several independent execution units that are capable of operating in parallel. 1.4.2 Efficient Memory Access DSP processors also share a feature of efficient memory access i.e. the ability to complete several accesses to memory in a single instruction cycle. Due to Harvard architecture in DSPs, i.e. physically separate storage and signal pathways for instructions and data, and pipelined structure the processor is able to fetch an instruction while simultaneously fetching operands and/or storing the result of previous instruction to memory. In some recently available DSPs a
Page | 3
further optimization is done by including a small bank of RAM near the processor core, often termed as L1 memory, which is used as an instruction cache. When a small group of instructions is executed repeatedly, the cache is loaded with these instructions thus making the bus available for data fetches, instead of instruction fetches. 1.4.3 Circular Buffering The need of processing the digital signals in real time, where in the output (processed samples) have to be produced at the same time at which the input samples are being acquired, evolves the concept of Circular Buffering. For instance this is needed in telephone communication, hearing aids, radars etc. Circular buffers are used to store the most recent values of a continually updated signal. Circular buffering allows processors to access a block of data sequentially and then automatically wrap around to the beginning address exactly the pattern used to access coefficients in FIR filter. Circular buffering also very helpful in implementing first-in, first-out buffers, commonly used for I/O and for FIR delay lines 1.4.4 Dedicated Address Generation Unit The dedicated address generation units also help speed up the performance of the arithmetic processing on DSP. Once an appropriate addressing registers have been configured, the address generation unit operates in the background. (i.e. without using the main data path of the processor). The address required for operand access is now formed by the address generation unit in parallel with the execution of the arithmetic instruction. DSP processor address generation units typically support a selection of addressing modes tailored to DSP applications. The most common of these is register-indirect addressing with post-increment, which is used in situations where a repetitive computation is performed on data stored sequentially in memory. Some processors also support bit-reversed addressing, which increases the speed of certain fast Fourier transform (FFT) algorithms. 1.4.5 Specialized Instruction Sets The instruction sets of the digital signal processors are designed to make maximum use of the processors resources and at the same time minimize the memory space required to store the instructions. Maximum utilization of the DSPs resources ensures the maximum efficiency and minimizing the storage space ensures the cost effectiveness of the overall system. To ensure the maximum use of the underlying hardware of the DSP, the instructions are designed to perform several parallel operations in a single instruction, typically including fetching of data in parallel with main arithmetic operation. For achieving minimum storage requirements the DSPs instructions are kept short by restricting which register can be used with which operations and which operations can be combined in an instruction. Some of the latest processors use VLIW (very long instruction word) architectures, where in multiple instructions are issued and executed per cycle. The instructions in such architectures are short and designed to perform much less work compared to those of conventional DSPs thus requiring less memory and increased speed because of the VLIW architecture.
Page | 4
2. TMS320C6713 ARCHITECTURE The TMS320C6713 floating-point digital signal processor uses the C67x VelociTI advanced very-long instruction words (VLIW) CPU. The CPU fetches (256 bits wide) to supply up to eight 32-bit instructions to the eight functional units during every clock cycle. The VelociTI VLIW architecture also features variable-length execute packets; these variable-length execute packets are a key memory-saving feature, distinguishing the C67x CPU from other VLIW architectures. Operating at 225 MHz, the TMS320C6713 delivers up to 1350 million floating-point operations per second (MFLOPS), 1800 million instructions per second (MIPS), and with dual fixed-floating-point multipliers up to 450 million multiply-accumulate operations per second(MMACS).
2.1 C67x CPU and Instruction Set The TMS320C6713 floating-point digital signal processor uses the C67x VelociTI advanced very-long instruction words (VLIW) CPU. The CPU fetches (256 bits wide) to supply up to eight 32-bit instructions to the eight functional units during every clock cycle. The VelociTI VLIW architecture also features variable-length execute packets; these variable-length execute packets are a key memory-saving feature, distinguishing the C67x CPU from other VLIW architectures. Operating at 225 MHz, the TMS320C6713 delivers up to 1350 million floating-point operations per second (MFLOPS), 1800 million instructions per second (MIPS), and with dual fixed-floating-point multipliers up to 450 million multiply-accumulate operations per second (MMACS).
Page | 5
2.2 Functional Units

The CPU features eight of functional units supported by 32 32-bit general purpose registers. This data path is divided into two symmetric sides consisting of 16 registers and 4 functional units each. Additionally, each side features a data bus connected to all the registers on the other side, by which the two sets of functional units can access data from the register files on the opposite side.
2.3 Fixed and Floating Point Instruction Set The C67x CPU executes the C62x integer instruction set. In addition, the C67x CPU natively supports IEEE 32-bit single precision and 64-bit double precision floating point. In addition to C62x fixed-point instructions, six out of the eight functional units also execute floating-point instructions: two multipliers, two ALUs, and two auxiliary floating point units. The remaining two functional units support floating point by providing address generation for the 64-bit loads the C67x CPU adds to the C62x instruction set. This provides 128-bits of data bandwidth per cycle. This double-word load capability allows multiple operands to be loaded into the register file for 32-bit floating point instructions. Unlike other floating point architectures the C67x had independent control of the its two floating point multipliers and its two the floating point ALUs. This enables the CPU to operate on a broader mix of floating point algorithms rather than to be tied to the typical multiply-accumulate oriented functions. 2.4 Load/Store Architecture Another key feature of the C67x CPU is the load/store architecture, where all instructions operate on registers (as opposed to directly on data in memory). Two sets of data-addressing units are responsible for all data transfers between the register files and the memory. The data address driven by the .D units allows data addresses generated from one register file to be used to load or store data to or from the other register file. 2.5 Cache Overview The TMS320C6713 device utilizes a highly efficient two-level real-time cache for internal program and data storage. The cache delivers high performance without the cost of large arrays of on-chip memory. The efficiency of the cache makes low cost, high-density external memory, such as SDRAM, as effective as on-chip memory. The first level of the memory architecture has dedicated 4K Byte instruction and data caches,L1I and L1D respectively. The LII is directmapped where as the L1D provides 2-wayassociativity to handle multiple types of data. The second level (L2) consists of a total of 256K bytes of memory. 64K bytes of this can be configured in one of five ways: 64K 4-way associative cache 48K 3-way associative cache, 16K mapped RAM 32K 2-way associative cache, 32K mapped RAM 16K direct mapped associative cache, 48K mapped RAM 64K Mapped RAM Dedicated L1 caches eliminate conflicts for the memory resources between the program and data busses. A unified L2 memory provides flexible memory allocation between program and
Page | 6
data for accesses that do not reside in L1. 2.6 Cache Summary The efficiency of the cache architecture makes the device simple to use. The cache is inherently transparent to the user. Due to the level of associativity and the high cache hit rate, virtually no optimization must be done to achieve high performance. Reduced time for optimization leads to reduced development time, allowing functional systems to be up and running quickly. High performance can be immediately achieved with the cache architecture, while a Harvard architecture device with small internal memory requires much more time to achieve similar performance. This is because optimizing an application on a small Harvard architecture requires several iterations to tune the application to fit in the small, fixed internal memories. 2.7 Interrupt Handling Interrupt handling is an important part of DSP operation. It is crucial that the DSP be able to receive and handle interrupts while maintaining real-time operation. In typical applications, interrupt frequency has not increased in proportion to the increase in device operation frequency. As processing speeds have increased, latency requirements have not. The TMS320C6713 is capable of servicing interrupts with a latency of a fraction of a microsecond when the service routine is located in external memory. By configuring the L2 memory blocks as memory-mapped SRAM, or by using the L2 memory mapped space, it is possible to lock critical program and data sections into internal memory. This is ideal for situations such as interrupts and OS task switching. By locking routines that need to be performed in minimal time, the microsecond delay for interrupts is reduced to tens of nanoseconds. 2.8 Real Time I/O Peripherals are a feature of most DSP systems that can take advantage of the memory-mapped L2 RAM. Typical processors require that peripheral data first be placed in external memory before it can be accessed by the CPU. The TMS320C6713 can maintain data buffers in on-chip memory, rather than in off-chip memory, providing a higher data throughput to peripherals. This increases performance when using on-chip McASPs, the HPI, or external peripherals. The EDMA can be used to transfer data directly into mapped L2 space while the CPU processes the data. This increases performance since the CPU is not stalled while fetching data from slow external memory or directly from the peripheral. Using this method for transferring data also minimizes EMIF activity, which is crucial as data rates or the number of peripherals increase. 2.9 Pipelining Pipelining is a key feature in a DSP to get parallel instructions working properly, requiring careful timing. There are three stages of pipelining: program fetch, decode, and execute. One FP with three EPs showing the p bit of each instruction. 1. The program fetch stage is composed of four phases:
Page | 7
(a) PG: program address generate (in the CPU) to fetch an address (b) PS: program address send (to memory) to send the address (c) PW: program address ready wait (memory read) to wait for data (d) PR: program fetch packet receive (at the CPU) to read opcode from memory 2. The decode stage is composed of two phases: (a) DP: to dispatch all the instructions within an FP to the appropriate functional units (b) DC: instruction decode 3. The execute stage is composed of 6 phases (with fixed point) to 10 phases (with floating point) due to delays (latencies) associated with the following instructions: (a) Multiply instruction, which consists of two phases due to one delay (b) Load instruction, which consists of five phases due to four delays (c) Branch instruction, which consists of six phases due to five delays Table 3.2 shows the pipeline phases, and Table 3.3 shows the pipelining effects. The first row in Table 3.3 represents cycle 1, 2, . . . , 12. Each subsequent row represents an FP. The rows represented PG, PS, . . . illustrate the phases associated with each FP. The program generate (PG) of the first FP starts in cycle 1, and the PG of the second FP starts in cycle 2, and so on. Each FP takes four phases for program fetch and two phases for decoding. However, the execution phase can take from 1to 10 phases (not all execution phases are shown in Table 3.3).We are assuming that each FP contains one EP.
For example, at cycle 7, while the instructions in the first FP are in the first execution phase E1 (which may be the only one), the instructions in the second FP are in the decoding phase, the instructions in the third FP are in the dispatching phase, and so on. All seven instructions are proceeding through the various phases.Therefore, at cycle 7, the pipeline is full. Most instructions have one execute phase. Instructions such as multiply (MPY), load (LDH/LDW), and branch (B) take two, five, and six phases, respectively. Additional
Page | 8
execute phases are associated with floating-point and double-precision types of instructions, which can take up to 10 phases. For example, the double-precision multiply operation (MPYDP), available on the C67x, has nine delay slots, so that the execution phase takes a total of 10 phases. The functional unit latency, which represents the number of cycles that an instruction ties up a functional unit, is 1 for all instructions except double-precision instructions, available with the floating-point C67x. Functional unit latency is different from a delay slot. For example, the instruction MPYDP has four functional unit latencies but nine delay slots. This implies that no other instruction can use the associated multiply functional unit for four cycles. A store has no delay slot but finishes its execution in the third execution phase of the pipeline. If the outcome of a multiply instruction such as MPY is used by a subsequent instruction, a NOP (no operation) must be inserted after the MPY instruction for the pipelining to operate properly. Four or five NOPs are to be inserted in case an instruction uses the outcome of a load or a branch instruction, respectively. 3 Applications Telecommunications: Telephone line modems, FAX, cellular telephones, wireless networks, speaker phones. Voice/speech digitization and compression, voice mail, speaker verification, speech synthesis. Automotive engine control, active suspension, system diagnosis Control systems: Head positioning servo systems in disk drives, laser printer control, motor control. Military: Radar and sonar signal processing, beam forming for adaptive antennas, navigation systems, secure spread spectrum radios, missile guidance. Medical: Hearing aids, MRI, ultrasound imaging. Instrumentation: Spectrum analysis, signal generators. Image processing: HDTVs, game consoles, animation, set-top boxes.
Page | 9

Rtes 6713

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Rtes 6713

Uploaded by

Copyright:

Available Formats

1.1.

1.4. Important feature of DSPs

2.2 Functional Units

You might also like