Advanced Processors: Overview of DSP Unit-5 Unit-6

Advanced Processors
Overview of DSP
Unit-5
Unit-6
DSP Processors
DSP Processors are specialized microprocessor with an

optimized architecture for the fast operational needs of
digital signal processing.
Need for DSP Architecture
•Harvard Architecture
• Filtering, correlation, •Pipelining
FFT •Fast dedicated
Parallelism
• Heavy data flow hardware MAC
through CPU •Special Instruction
• Real time operations •Replication
•On-chip memory and
cache
•Extended Parallelism-
SIMD, VLIW, Superscalar
Simplified Architecture of Standard Microprocessor
Van Newman Architecture

Independency between the operations
Limitations on the increase in speed
Hardware Architecture for Signal Processing
Multiple Bus Structure

•Separate data and program memory
•Data memory
•Coefficients, input data, out put samples, intermediated data
Non-Pipelining Architecture
Pipeline Architecture
Pipelining Concept
Pipeline MAC Operation
MAC Configuration
Special Instructions
Special Instruction: MAC

Repeat: RPT
Single Instruction Multiple Data (SIMD)
Processing
Data bus-A
Data bus-B
ALU MAC Shifter ALU MAC Shifter
Execution Unit A Execution Unit B

SIMD Processing
16 bit 16 bit 16 bit 16 bit 16 bit 16 bit 16 bit 16 bi
16x16 16x16 16x16 16x16

MAC MAC MAC MAC
32- bit result 32- bit result 32- bit result 32- bit result
Very Long Instruction Word (VLIW)
Instruction fetch packet

Internal Program Memory •Eight 32 bit instructions
•Always 256 bits wide
8x32-bits
Instruction fetch decode and Execution packet
dispatch •Dispatches instructions into
appropriate execution units
nx32-bits •Varies from one to eight
instructions (32 bits to 256 bits)
L1 S1 M1 D1 L1 S1 M1 D1
Register file A Register file B
32-bits 32-bits Two data paths
Internal data RAM

Superscalar Processors
• Uses instruction level parallelism

• Developed to execute multiple instructions in one
cycle
• Achieved through multiple execution units
• Extensive use of pipelining
• Instruction width is not fixed
• An instruction can be issued to execute in parallel
like SIMD
• Uses load/store architecture suitable to take two
inputs and compute an output
Fixed point and Floating point
representation
16-bit signed fractional point,

IEEE 754 normalized representation of a
often indicated as Q1.15
single precision floating point number.
General purpose DSP architecture
DSP Processors
Fixed point processors • Floating point processors
•Represent each number with a minimum of •Represent each number with a minimum of 16 bits
16 bits •232 = 4,294,967,296 possible bit patterns can represent a
•216 = 65536 possible bit patterns can number
represent a number •Represented numbers are not uniformly spaced
•Unsigned integer : 0 to 65,535 •ANSI/IEEE Std. 754-1985-- the largest and smallest
•Signed integer : -32,768 to 32,767 numbers are ±3.4×1038 and ± 1.2x10-38, respectively
•Unsigned fraction : spread uniformly •The represented values are unequally spaced between
between 0 to 1 these two extremes, such that the gap between any two
•Signed fraction : spread uniformly between -1 numbers is about ten-million times smaller than the
to 1 value of the numbers.
•This is important because it places large gaps between
large numbers, but small gaps between small numbers
Fixed point digital signal processors
First Generation Second Generation Third Generation Fourth Generation
•TMS320C54xx,D
•TMS320C1X by TI •TMS320C5X from SP563X and •TMS320C62XX
in 1982 TI, DSP5600X from DSP16000 •VLIW
•Dedicated AU with Motorola, •Aimed for •Included
multiplier and ADSP21XX from Digital extensive
accumulator Analog Devices, communication parallelism while
•Harvard DSP16XX from and Digital Audio maintaining the
architecture with Lucent Technologies •Special features of
separate program •Enhanced features instructions for earlier versions
and data memory than first generation Adaptive filtering •Wider
•On-chip memory •Larger on-chip which included instructions,
and special memory and more echo wider data paths
instructions for special instructions cancellations and more registers,
execution of basic to execute DSP adaptive larger instruction
DSP algorithms algorithms equalization and cache and
•MAC with Repeat Viterbi decoding multiple AU
•Low power and
had power
management
facility
Floating point DSP processors
First Generation Second Generation Third Generation
•TMS320C3X TI •TMS320C4X, ADSP- •TMS320C67xx,

•Larger memory and 2106x SHARCH ADSP-TS001
many on-chip •Emphasis on •VLIW
peripheral facilities multiprocessing and
•Program cache and multiprocessor
on-chip dual access support
memories
•Graphics and
Image processing
•Supported three
floating point
formats
Special purpose Digital Signal Processor
Hardware digital filters : FIR
FIR structure
Hardware Architecture
for FIR filter
Special Purpose Digital Signal Processor
Hardware digital filters : IIR
IIR Structure
Hardware architecture for IIR filter

Hardware FFT Processors
Simplified architecture of hardware

Concept of hardware butterfly processor FFT processor
Hardware FFT Processors
Double buffering in real-time FFT
FFT performed on N point data in buffer A while buffer B is being filled

Architecture of TMS320C67XX
Valid Register Pairs
Name of the unit .L unit .S unit .M unit .D unit
Type of operation
32 bit add and subtract

Arithmetic operation 32/40 bit operation 32 bit operation -
operations only
Logical operation 32-bit operations 32-bit operations - 32-bit logical operations*
16x16 multiply
Multiply operations - - -
operations
32/40 bit shift
Shift operations - - -
operations
Compare operations 32/40 bit operation - - -
Branch operations - Yes - -
Loads and stores with 5-bit

Load and Store
- - - constant offset(15 bit
operations
constant offset in .D2 only)
Linear and circular

- - - Yes
address calculation
Constant generation - Yes - -
32/40 bit count

Count operations - - -
operations
16 bit move
Move operations Register to register only - Register to register only
operations
TMS320C67XX CPU data paths
Data lines:
•scr1 and scr2
•32bits (All)
•40bits (.L, .S)
Register File Cross Paths:

• Functional units can read and
write operands from own register
files
•.L1,.S1,.M1, .L2, .S2, .M2 have
access to opposite side registers
through cross paths
Memory Load and Store Paths:

•LD1and LD2 (LDDW)
•ST1 and ST2
Data Address Paths:

•DA1 and DA2 allows data address
generated by any one path to
access data to or from any register
Control Registers (accessed by .S2 alone using MVC)
Register Name Abbre. Description
Addressing Mode Reg. AMR Specifies linear or circular addressing of A4-A7 &B4-B7
Control Status Reg. CSR Contains important control and status bits of the processor
Program Counter E1 PCE1 Contains the address of the fetch packet that is in the E1
Phase Reg. phase of the pipeline
Interrupt Flag Reg. IFR Contains the status of INT4-INT5 and NMI maskable
interrupts
Interrupt Set Reg. ISR Used to manually set maskable pending interrupts
Interrupt Clear Reg. ICR Used to manually clear maskable pending interrupts
Interrupt Enable Reg. IER Used to enable/disable the individual maskable interrupts
Interrupt Service Table ISTP Points to beginning of interrupt service table
Reg.
Interrupt Return IRP Contains the address to be used to return from a maskable
Pointer interrupt
Non-maskable NRP Contains the address to be used to return from a non-
Interrupt Return maskable interrupt
Pointer
Address Mode Register AMR
31 26 25 21 20 16
Reserved BK1 BK2
B7 B6 B5 B4 A7 A6 A5 A4
mode mode mode mode mode mode mode mode
15 0
Mode Select Description of mode

0 0 Linear modification of address
0 1 Circular addressing using BK0
1 0 Circular addressing using BK1
1 1 Reserved
Unit 5
Introduction to Computer Architecture R5-12.1, R5-12.2
General purpose Digital Signal Processors R5-12.3
Selecting digital signal processors R5-12.4
Special purpose DSP Hardware R5-12.6
Architecture of TMS320C67X Reference GuideTMS320C67XX/T2-13.2
Features of C67X processors Reference GuideTMS320C67XX/T2-13.2
CPU TMS320C67x/C67x+ DSPCPU and Instruction Set
Reference Guide/T2-13.4
General purpose register files TMS320C67x/C67x+ DSPCPU and Instruction
Set Reference Guide/T2-13.5
Functional units and operation TMS320C67x/C67x+ DSPCPU and Instruction
Data paths TMS320C67x/C67x+ DSPCPU and Instruction
Control register file TMS320C67x/C67x+ DSPCPU and Instruction
Functional Units
Name of unit Type of operations

.L Arithmetic, Logical,
Compare , Other
.S Arithmetic, Logical,
Shift , Branch, Move,
Other
.M Multiply
.D Arithmetic, Load
store, Other
Addressing Modes
• Register Addressing mode:
– mnemonic .unit scr1, scr2, dst
– Mnemonic used could be ADD, SUB, MPY etc.
• ADD .L1 A1, A2,A3
• ADD .S2 B1, B2, B2
• ADD .L1 X A1,B2, A2
• Linear Addressing mode Uses .D1 and .D2

• Circular Addressing mode
Addressing Modes
• Linear Addressing mode: Uses .D1 and .D2
– mnemonic .unit mode field, dst
Load, store
*+baseR[offsetR/ucst5] Positive offset from baseR specified by offserR/ucst5

*-baseR[offsetR/ucst5 ] Negative offset from baseR specified by offserR/ucst5
*++baseR[offsetR/ucst5] Pre-incrmt from baseR specified by offserR/ucst5
*--baseR[offsetR/ucst5 ] Pre-decrmt from baseR specified by offserR/ucst5
*baseR++[offsetR/ucst5 ] Post-incrmt from baseR specified by offserR/ucst5
*baseR--[offsetR/ucst5 Post-decrmt from baseR specified by offserR/ucst5
Addressing Modes
• Linear Addressing mode: Uses .D1 and .D2
– mnemonic .unit mode field, dst
LDW .D1 *A0[1], A1

Load contents of mem located pointed by contents of A0+offset(1 left
shifted twice) into reg A1
(left shift by 3, 2, 1, 0 for double word, word, half word, byte respt.)
LDW .D1 *++A0[A4], A1
LDW .D1 *A0++[2], A1

Addressing Modes
• Circular Addressing mode:

– Uses .D1 and .D2
– A4-A7 and B4-B7 are used
– Address mode register is used to select modes for
A4/B4—A7/B7
• mnemonic .unit mode field, dst
Circular Buffering
Fixed Point Instructions
Instruction Functional Unit Description
MV .L1 or .L2 Move value from one register to another
.S1 or .S2
.D1 or .D2
MVC .S2 only Move value between control register and registerfile
MVK .S1 or .S2 Move 16-bit const into lower 16-bits of a register and
sign extended
MVKLH .S1 or .S2 Move 16-bit const into upper 16-bits of a register
MVKH .S1 or .S2 Move upper 16-bit const value of 32-bit into upper 16-
bits of a register
Flow of Execution
•Conditional Operations:
•All instructions can be conditional
•A1,A2,B0,B1,B2 are tested for conditional operation
(value as zero or non zero can be tested)
•Specified condition in register is tested at the beginning of
Execution E1 phase
•Parallel Operation:
• 8 instructions are fetched to form Fetched packet
•Execution of these instructions is controlled by scanning
p-bit from left to right
•P=1 of ith instruction; then i+1th instruction is to be
executed in parallel with ith instruction
•P=0 of ith instruction; then i+1th instruction is to be
executed in the next machine cycle after ith instruction
Flow of Execution
Flow of Execution
– Fully serial : p bits are zero; need 8 m/c to execute;
– Fully parallel : p bits are 1; need 1m/c
– Partially serial :
Flow of Execution
In summary
Pipelining
Fetch Operation
Program address generate PG Mem addr of 8 instr of fetch packet is generated

Program address send PS Address are send to mem
Program access ready wait PW Mem read operation
Program fetch packet receive PR 8 instrn are received in CPU
Execution will depend on fully serial, fully parallel or partially serial type
Pipelining
Decode Operation
•DP- Instruction dispatch
•Fetched packet are spilt into execution packet
•Execution packet consists of one instrn or two to eight parallel instrn
•Instrn are assigned to appropriate functional units
•DC-Instruction decode
•Source registers , destination registers and associated paths are decoded
Pipelining
Execute Operation
E1 E2 E3 E4 E5
Fixed point processor
E1 E2 E3 E4 E5 E6 E7 E8 E9 E10
Floating point processor

Internal Memory
Internal Memory •Cached based
internal mem arch.
•2 level mem arch
•L1P,L1D
• -4k size
•Not inculded in
Mem. Map
•Always enabled
•L2 64k size shared

for both program and
data mem
•First L1P and L1D are

accessed and if a
miss occurs then L2 is
accessed
•L2 controller
facilitates
•CPU access EMIF
•CPU access
Peripherals
External Memory
• If L2 miss occurs then external memory is

accessed
• Memory Attribute Register is used to enable
the external memory
On-chip Peripherals
Multichannel Buffered Serial Port
Features:
• Provides full-duplex communication
• Data selection size of 8,12,16,20,24 and 32 bits
• Independent framing and clocking for receive and transmit
• External shift clock or internal programmable clock for data
for transfer
• 8-bit data transfer with an option of LSB or MSB first
• Programmable polarity for both frame synchronization and
data clocks
• Double buffered register which allows continuous data
transmission
• Auto buffering capability through 5- channel DMA controller
• µ law and A law companding
• Direct interface to industry standard codecs, A/D, D/A
converters etc.
Multichannel Buffered Serial Port
Features of TMS320C6X Processor
• Advanced VLIW CPU with eight functional units, including two
multiplier and six ALUs
• Executes up to eight instructions per cycle allows to develop RISC
like code
• Instruction packing reduces code size, program fetches and power
consumption
• Efficient code execution on independent functional units
• Support 8/16/32- bit formats
• Field manipulation and instruction extract, set, clear and bit
counting operations
• Has support for single precision (32- bit) and double precision(64-
bit) IEEE floating point operations and also 32x32 bit integer
multiplication with 32 or 64- bit results
Unit 6
TMS320C67X Functional units TMS320C67x/C67x+ DSPCPU and Instruction Set
Reference Guide/T2 14.1
Internal memory T2-15.4
External memory T2-15.5
on chip peripherals T2-15.6
Interrupts TMS320C67x/C67x+ DSPCPU and Instruction Set
Instruction set and addressing modes TMS320C67x/C67x+ DSPCPU and Instruction Set
Reference Guide/T2-14.1, T2-14.2
Fixed point instructions TMS320C67x/C67x+ DSPCPU and Instruction Set
Floating point instructions TMS320C67x/C67x+ DSPCPU and Instruction Set
Conditional operations TMS320C67x/C67x+ DSPCPU and Instruction Set
Parallel operations TMS320C67x/C67x+ DSPCPU and Instruction Set
Pipeline operations TMS320C67x/C67x+ DSPCPU and Instruction Set
Thank you!
Functional Units and Operations Performed
On-chip Peripherals

Advanced Processors: Overview of DSP Unit-5 Unit-6

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Advanced Processors: Overview of DSP Unit-5 Unit-6

Uploaded by

Copyright:

Available Formats

Advanced Processors

DSP Processors are specialized microprocessor with an

Van Newman Architecture

Multiple Bus Structure

Special Instruction: MAC

ALU MAC Shifter ALU MAC Shifter

Execution Unit A Execution Unit B

16 bit 16 bit 16 bit 16 bit 16 bit 16 bit 16 bit 16 bi

16x16 16x16 16x16 16x16

Instruction fetch packet

32-bits 32-bits Two data paths

Internal data RAM

• Uses instruction level parallelism

16-bit signed fractional point,

Fixed point processors • Floating point processors

First Generation Second Generation Third Generation

•TMS320C3X TI •TMS320C4X, ADSP- •TMS320C67xx,

Hardware architecture for IIR filter

Simplified architecture of hardware

Double buffering in real-time FFT

FFT performed on N point data in buffer A while buffer B is being filled

32 bit add and subtract

Logical operation 32-bit operations 32-bit operations - 32-bit logical operations*

Compare operations 32/40 bit operation - - -

Branch operations - Yes - -

Loads and stores with 5-bit

Linear and circular

Constant generation - Yes - -

32/40 bit count

Register File Cross Paths:

Memory Load and Store Paths:

Data Address Paths:

Reserved BK1 BK2

Mode Select Description of mode

Name of unit Type of operations

• Linear Addressing mode Uses .D1 and .D2

*+baseR[offsetR/ucst5] Positive offset from baseR specified by offserR/ucst5

LDW .D1 *A0[1], A1

LDW .D1 *++A0[A4], A1

LDW .D1 *A0++[2], A1

• Circular Addressing mode:

Program address generate PG Mem addr of 8 instr of fetch packet is generated

Fixed point processor

Floating point processor

•L2 64k size shared

•First L1P and L1D are

• If L2 miss occurs then external memory is

You might also like