Professional Documents
Culture Documents
32 DCT
32 DCT
Reconfigurable
Architecture for Efficient and Scalable
Orthogonal Approximation of DCT
ABSTRACT
The Discrete Cosine Transform (DCT) has been widely applied in the area of
image compression and video compression, such as JPEG, MPEG-2/4 and H.263. Its
popularity is attributed to its ability to decorrelate data of spatial domain into data of
frequency domain. Data will become more compact after being transformed, redundant
information can be further removed. The DCT is considered as the closest to K-L
transform, which is the ideal energy compaction transform. However, the matrix elements
of DCT contain real numbers presented by a finite number of bits, which inevitably leads
to the possibility of drift (mismatch between the decoded data in the encoder and
decoder). Several methods have been introduced to control the accumulation of drift in
video compress standards before H.264. However, H.264 makes extensive use of
prediction, which causes it to be very sensitive to drift [2]. In order to eliminate
mismatch between encoders and decoders and to facilitate low complexity
implementations, latest video standards like H.264, VC-1 and AVS begin to adopt integer
transform. High Efficiency Video Coding (HEVC) [3] is the newest standard for high
definition video processing. It is considered as the successor of H.264. Main goal of
HEVC is to achieve 50% higher coding efficiency than H.264. In order to achieve this
goal, HEVC adopts lots of state of the art coding tools including 4/8/16/32 integer
transform. Compared to H.264, not only the size of matrices themselves but also matrix
elements get larger. This makes the implementation of both hardware and software
become very complicated. In this work, a fast algorithm for 8x8 integer transform of
HEVC is presented, which is suitable for hardware and software implementation.
We need to reduce truncation error that error is introduced if the least significant
part is directly truncated. In order to reduce truncation error effect several error
compensation bias methods have been presented based on statistical analysis of
relationship between partial product and multiplier-multiplicand. Hardware complexity
will be reduced if truncation error minimized. In general, the truncation part (TP) is
usually truncated to reduce hardware costs in parallel shifting and addition operations,
known as the direct truncation (Direct-T) method. Thus, a large truncation error occurs
due to the neglecting of carry propagation from the TP to Main Part (MP). Distributed
arithmetic is a bit level rearrangement of a multiply accumulate to hide the
multiplications. It is a powerful technique for reducing the size of a parallel hardware
multiplyaccumulate that is well suited to FPGA designs. The Discrete cosine transform
(DCT) is widely used in digital image processing for image compression, especially in
image transform coding. However, though most of them are good software solutions to
the realization of DCT, only a few of them are really suitable for VLSI implementation.
Cyclic convolution plays an important role in digital signal processing due to its nature of
easy implementation. Specifically, there exist a number of well-developed convolution
algorithms and it can be easily realized through modular and structural hardware such as
distributed arithmetic and systolic array. The way of data movement forms a significant
part in the determination of the efficiency of the realization of a transform using the DA.
WHY COMPRESSION
Despite the many advantages of digital representation of signals compared to the
analog counterpart, they need a very large number of bits for storage and transmission.
For example, a high-quality audio signal requires approximately 1.5 megabits per second
for digital representation and storage. A television-quality lowresolution color video of 30
frames per second with each frame containing 640 x 480 pixels (24 bits per color pixel)
needs more than 210 megabits per second of storage. As a result, a digitized one-hour
color movie would require approximately 95 gigabytes of storage. The storage
requirement for upcoming high-definition television (HDTV) of resolution 1280 x 720 at
60 frames per second is far greater. A digitized one-hour color movie of HDTV-quality
video will require approximately 560 gigabytes of storage. A digitized 14 x 17 square
inch radiograph scanned at 70 pm occupies nearly 45 megabytes of storage. Transmission
of these digital signals through limited bandwidth communication channels is even a
greater challenge and sometimes impossible in its raw form. Although the cost of storage
has decreased drastically over the past decade due to significant advancement in
microelectronics and storage technology, the requirement of data storage and data
processing applications is growing explosively to outpace this achievement.
Although data compression offers numerous advantages and it is the most sought-
after technology in most of the data application areas, it has some disadvantages too,
depending on the application area and sensitivity of the data. For example, the extra
overhead incurred by encoding and decoding processes is one of the most serious
drawbacks of data compression, which discourages its usage in some areas (e.g., in many
large database applications). This extra overhead is usually required in order to uniquely
identify or interpret the compressed data. For example, the encoding/decoding tree in a
Huffman coding type compression scheme is stored in the output file in addition to the
encoded bit-stream. These overheads run opposite to the essence of data compression,
that of reducing storage requirements. In large statistical or scientific databases where
changes in the database are not very frequent, the decoding process has greater impact on
the performance of the system than the encoding process. Even if we want to access and
manipulate a single record in a large database, it may be necessary to decompress the
whole database before we can access the desired record. After access and probably
modification of the data, the database is again compressed to store. The delay incurred
due to these compression and decompression processes could be prohibitive for many
real-time interactive database access requirements unless extra care and complexity are
added in the data arrangement in the database.
Literature Review:
2.1Types of compressions
There are two types of compressions
1. Lossless compression
2. Lossy compression
LOSSY
Transform
where x(ij) is the image pixel data, and Z(u,v) is the transport data.
A Standard diagram shown in Figure 2.1 where the computation of the 2-D DCT
has been separated into two I-D DCTs.
..(7)
Even rows of C are even-symmetric and odd rows are odd-symmetric. Therefore
by exploiting this symmetry in the rows of C and separating even and odd rows we can
get 1D-DCT as follows,
(8)
,(9)
Algorithm for Hardware Implementation of Integer DCT for HEVC:
TheN-point integer DCT 1 for HEVC given by [14] can be computed by a partial
butterfly approach using a (N/2)-point DCT and a matrixvector product of (N/2)(N/2) matrix
with an (N/2)-point vector as
and
where
Where C2i+1,jN is the (2i +1,j)th entry of the matrix CN.Note that (1a) could be
similarly decomposed, recursively, further using CN/4 and MN/4.
Based on (1) and (2), hardware oriented algorithms for DCT computation can be derived
in three stages as in Table I. For 8-, 16-, and 32-point DCT, even indexed coefficients of
[y(0),y(2),y(4),y(N2)] are computed as 4-, 8-, and 16-point DCTs of
[a(0),a(1),a(2),a(N/21)],respectively, accordingto (1a). In Table II, we have listed the
arithmetic complexities of the reference algorithm and the MCM-based algorithm for four-,
eight-, 16-, and 32-point DCT. Algorithms for Inverse DCT (IDCT) can also be derived in a
similar way.
TABLE I
CHAPTER III
HARDWARE REQUIREMENTS:
5.1 GENERAL
Integrated circuit (IC) technology is the enabling technology for a whole host of
innovative devices and systems that have changed the way we live. Jack Kilby and Robert
Noyce received the 2000 Nobel Prize in Physics for their invention of the integrated
circuit; without the integrated circuit, neither transistors nor computers would be as
important as they are today. VLSI systems are much smaller and consume less power than
the discrete components used to build electronic systems before the 1960s.
Integration allows us to build systems with many more transistors, allowing much
more computing power to be applied to solving a problem. Integrated circuits are also
much easier to design and manufacture and are more reliable than discrete systems; that
makes it possible to develop special-purpose systems that are more efficient than general-
purpose computers for the task at hand.
Even though the chip could be made smaller or faster with more design effort, the
advantages of having a single-chip implementation of a function that can be quickly
designed often outweighs the lost potential performance.
The problem and the challenge of the ability to manufacture such large chips is
designthe ability to make effective use of the millions of transistors on a chip to
perform a useful function.
Chip designs are simulated to ensure that the chips circuits compute the proper
functions to a sequence of inputs chosen to exercise the chip. manufacturing test But each
chip that comes off the manufacturing line must also undergo
Manufacturing test:
The chip must be exercised to demonstrate that no manufacturing defects rendered
the chip useless. Because IC manufacturing tends to introduce certain types of defects and
because we want to minimize the time required to test each chip, we cant just use the
input sequences created for design verification to perform manufacturing test. Each chip
must be designed to be fully and easily testable. Finding out that a chip is bad only after
you have plugged it into a system is annoying at best and dangerous at worst. Customers
are unlikely to keep using manufacturers who regularly supply bad chips.
Defects introduced during manufacturing range from the catastrophic
contamination that destroys every transistor on the waferto the subtlea single broken
wire or a crystalline defect that kills only one transistor. While some bad chips can be
found very easily, each chip must be thoroughly tested to find even subtle flaws that
produce erroneous results only occasionally. Tests designed to exercise functionality and
expose design bugs dont always uncover manufacturing defects. We use fault models to
identify potential manufacturing problems and determine how they affect the chips
operation.
The most common fault model is stuck-at-0/1: the defect causes a logic gates
output to be always 0 (or 1), independent of the gates input values. We can often
determine whether a logic gates output is stuck even if we cant directly observe its
outputs or control its inputs. We can generate a good set of manufacturing tests for the
chip by assuming each logic gates output is stuck at 0 (then 1) and finding an input to the
chip which causes different outputs when the fault is present or absent.
much more efficiently in specialized logic than it can using standard lookup table
techniques. The wiring channels that connect to the logic elements inputs and outputs
also need to be programmable. A wiring channel has a number of programmable
connections such that each input or output generally can be connected to any one of
several different wires in the channel.
CHAPTER-VI
TOOLS
6.1 Introduction:
The main tools required for this project can be classified into two broad categories.
Hardware requirement
Software requirement
Standards Supported:
ModelSim VHDL supports both the IEEE 1076-1987 and 1076-1993 VHDL, the
1164-1993 Standard Multivalue Logic System for VHDL Interoperability, and the 1076.2-
1996 Standard VHDL Mathematical Packages standards. Any design developed with
ModelSim will be compatible with any other VHDL system that is compliant with either
IEEE Standard 1076-1987 or 1076-1993. ModelSim Verilog is based on IEEE Std 1364-
1995 and a partial implementation of 1364-2001, Standard Hardware Description
Language Based on the Verilog Hardware Description Language. The Open Verilog
International Verilog LRM version 2.0 is also applicable to a large extent. Both PLI
(Programming Language Interface) and VCD (Value Change Dump) are supported for
ModelSim PE and SE users.
6.4 MODELSIM:
Basic Steps For Simulation
This section provides further detail related to each step in the process of
simulating your design using ModelSim.
design files (VHDL, Verilog, and/or SystemC), including stimulus for the design
libraries, both working and resource
modelsim.ini (automatically created by the library mapping command )
A library is a location where data to be used for simulation is stored. Libraries are
ModelSims way of managing the creation of data before it is needed for use in
simulation. It also serves as a way to streamline simulation invocation. Instead of
compiling all design data each and every time you simulate, ModelSim uses binary pre-
compiled data from these libraries. So, if you make a changes to a single Verilog module,
only that module is recompiled, rather than all modules in the design.
Before you can compile your source files, you must create a library in which to
store the compilation results. You can create the logical library using the GUI, using File
> New > Library (see "Creating a library"), or you can use the vlib command. For
example, the command:
vlib work
creates a library named work. By default, compilation results are stored in the work
library Mapping The Logical Work To The Physical Work Directory vmap
VHDL uses logical library names that can be mapped to ModelSim
library directories. If libraries are not mapped properly, and you invoke your
simulation, necessary components will not be loaded and simulation will fail.
Similarly, compilation can also depend on proper library mapping.
By default, ModelSim can find libraries in your current directory
(assuming they have the right name), but for it to find libraries located
elsewhere, you need to map a logical library name to the pathname of the
library. You can use the GUI ("Library mappings with the GUI", a command
("Library mappings with the GUI" ), or a project ("Getting started with
projects" to assign a logical name to a design library.
The format for command line entry is:
vmap <logical_name> <directory_pathname>
This command sets the mapping between a logical library name and a directory.
Step 2 - Compiling the design with vlog/vcom/sccom
Designs are compiled with one of the three language compilers.
Compiling Verilog - vlog
ModelSims compiler for the Verilog modules in your design is vlog . Verilog files may
be compiled in any order, as they are not order dependent. See "Compiling Verilog files"
for details.Verilog portions of the design can be optimized for better simulation
performance.
"Optimizing Verilog designs".
ModelSims compiler for VHDL design units is vcom . VHDL files must be compiled
according to the design requirements of the design. Projects may assist you in
determining the compile order: for more information, see"Auto-generating compile order"
(UM-46). See "Compiling VHDL files" (UM-73) for details. on VHDL compilation.
Compiling SystemC sccom
ModelSims compiler for SystemC design units is sccom , and is used only if you have
SystemC components in your design. See "Compiling SystemC files" for details.
Step 3 - Loading the design for simulation
vsim <top>
Your design is ready for simulation after it has been compiled and (optionally)
optimized with vopt . For more information on optimization, see Optimizing Verilog
designs . You may then invoke vsim with the names of the top-level modules (many
designs contain only one top-level module) or the name you assigned to the optimized
version of the design.
For example, if your top-level modules are "testbench" and "globals", then invoke the
simulator as follows:
To simulate, first the entity design has to be loaded into the simulator. Do this by
selecting fromthe menu:
Simulate > Simulate
A new window will appear listing all the entities (not filenames) that are in the
work library. Select FA entity for simulation and click OK.
Create a working
library
Debug results
In this case, it is possible to use Verilog to write a test bench to verify the functionality of the
design using files on the host computer to define stimuli, to interact with the user, and to
compare results with those expected.
A Verilog model is translated into the "gates and wires" that are mapped onto a
programmable logic device such as a CPLD or FPGA, and then it is the actual hardware being
configured, rather than the Verilog code being "executed" as if on some form of a processor chip.
6.6.1 Implementation:
Synthesis (XST)
Produce a netlist file starting from an HDL description
Translate (NGDBuild)
Converts all input design netlists and then writes the results into a single merged
file, that describes logic and constraints.
Mapping (MAP)
Maps the logic on device components.
Takes a netlist and groups the logical elements into CLBs and IOBs (components of FPGA).
Place And Route (PAR)
Place FPGA cells and connects cells.
Bit stream generation
XILINX Design Process
Step 1: Design entry
HDL (Verilog or VHDL, ABEL x CPLD), Schematic Drawings, Bubble
Diagram
Step 2: Synthesis
Translates .v, .vhd, .sch files into a netilist file (.ngc)
Step 3: Implementation
FPGA: Translate/Map/Place & Route, CPLD: Fitter
Step 4: Configuration/Programming
Download a BIT file into the FPGA
Program JEDEC file into CPLD
Program MCS file into Flash PROM
Simulation can occur after steps 1, 2, 3
6.7 Introduction to FPGA:
FPGA stands for Field Programmable Gate Array which has the array of logic module,
I /O module and routing tracks (programmable interconnect). FPGA can be configured by end
user to implement specific circuitry. Speed is up to 100 MHz but at present speed is in GHz.
Main applications are DSP, FPGA based computers, logic emulation, ASIC and ASSP.
FPGA can be programmed mainly on SRAM (Static Random Access Memory). It is Volatile and
main advantage of using SRAM programming technology is re-configurability. Issues in FPGA
technology are complexity of logic element, clock support, IO support and interconnections
(Routing).
FPGA Design Flow
FPGA contains a two dimensional arrays of logic blocks and interconnections
between logic blocks. Both the logic blocks and interconnects are programmable. Logic blocks
are programmed to implement a desired function and the interconnects are programmed using
the switch boxes to connect the logic blocks.
To be more clear, if we want to implement a complex design (CPU for instance), then the
design is divided into small sub functions and each sub function is implemented using one logic
block. Now, to get our desired design (CPU), all the sub functions implemented in logic blocks
must be connected and this is done by programming the interconnects.
FPGAs, alternative to the custom ICs, can be used to implement an entire System
On one Chip (SOC). The main advantage of FPGA is ability to reprogram. User can
reprogram an FPGA to implement a design and this is done after the FPGA is
manufactured. This brings the name Field Programmable.
0 0 0 0
0 1 0 1
1 0 0 1
1 1 1 1
6.8 Interconnects:
A wire segment can be described as two end points of an interconnect with no
programmable switch between them. A sequence of one or more wire segments in an FPGA can
be termed as a track.
Typically an FPGA has logic blocks, interconnects and switch blocks (Input/output blocks).
Switch blocks lie in the periphery of logic blocks and interconnect. Wire segments are connected
to logic blocks through switch blocks. Depending on the required design, one logic block is
connected to another and so on.
In this part of tutorial we are going to have a short intro on FPGA design flow. A simplified
version of design flow is given in the flowing diagram.
This can be done after MAP or PAR processes Post MAP timing report lists
signal path delays of the design derived from the design logic. Post Place and Route timing
report incorporates timing delay information to provide a comprehensive timing summary of the
design.
CHAPTER VII
RESULTS
Simulation Results:
Synthesis Results:
RTL schematic:
Technology Schematic:
Design Summary:
Timing Report:
CHAPTER
CONCLUSION