Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 57

A Generalized Algorithm and

Reconfigurable
Architecture for Efficient and Scalable
Orthogonal Approximation of DCT
ABSTRACT

Approximation of discrete cosine transform (DCT) is useful for reducing its


computational complexity without significant impact on its coding performance. Most of
the existing algorithms for approximation of the DCT target only the DCT of small
transform lengths, and some of them are non-orthogonal. This paper presents a
generalized recursive algorithm to obtain orthogonal approximation of DCT where an
approximate DCT of length could be derived from a pair of DCTs of length at the cost of
additions for input preprocessing. We perform recursive sparse matrix decomposition and
make use of the symmetries of DCT basis vectors for deriving the proposed
approximation algorithm. Proposed algorithm is highly scalable for hardware as well as
software implementation of DCT of higher lengths, and it can make use of the existing
approximation of 8-point DCT to obtain approximate DCT of any power of two
length, .We demonstrate that the proposed approximation of DCT provides comparable or
better image and video compression performance than the existing approximation
methods. It is shown that proposed algorithm involves lower arithmetic complexity
compared with the other existing approximation algorithms. We have presented a fully
scalable reconfigurable parallel architecture for the computation of approximate DCT
based on the proposed algorithm. One uniquely interesting feature of the proposed design
is that it could be configured for the computation of a 32-point DCT or for parallel
computation of two 16-point DCTs or four 8-point DCTs with a marginal control
overhead. The proposed architecture is found to offer many advantages in terms of
hardware complexity, regularity and modularity. Experimental results obtained from
FPGA implementation show the advantage of the proposed method.
INTRODUCTION

The Discrete Cosine Transform (DCT) has been widely applied in the area of
image compression and video compression, such as JPEG, MPEG-2/4 and H.263. Its
popularity is attributed to its ability to decorrelate data of spatial domain into data of
frequency domain. Data will become more compact after being transformed, redundant
information can be further removed. The DCT is considered as the closest to K-L
transform, which is the ideal energy compaction transform. However, the matrix elements
of DCT contain real numbers presented by a finite number of bits, which inevitably leads
to the possibility of drift (mismatch between the decoded data in the encoder and
decoder). Several methods have been introduced to control the accumulation of drift in
video compress standards before H.264. However, H.264 makes extensive use of
prediction, which causes it to be very sensitive to drift [2]. In order to eliminate
mismatch between encoders and decoders and to facilitate low complexity
implementations, latest video standards like H.264, VC-1 and AVS begin to adopt integer
transform. High Efficiency Video Coding (HEVC) [3] is the newest standard for high
definition video processing. It is considered as the successor of H.264. Main goal of
HEVC is to achieve 50% higher coding efficiency than H.264. In order to achieve this
goal, HEVC adopts lots of state of the art coding tools including 4/8/16/32 integer
transform. Compared to H.264, not only the size of matrices themselves but also matrix
elements get larger. This makes the implementation of both hardware and software
become very complicated. In this work, a fast algorithm for 8x8 integer transform of
HEVC is presented, which is suitable for hardware and software implementation.

Today we are talking about digital networks, digital representation of images,


movies, video, tv, voice, digital library-all because digital representation of the signal is
more robust than the analog counterpart for processing, manipulation, storage, recovery,
and transmission over long distances, even across the globe through communication
networks. In recent years, there have been significant advancements in processing of still
image, video, graphics, speech, and audio signals through digital computers in order to
accomplish different application challenges. As a result, multimedia information
comprising image, video, audio, speech, text, and other data types has the potential to
become just another data type. Development of efficient image compression techniques
continues to be an important challenge to us, both in academia and in industry. In
multiplier based DCTs were implemented, later to reduce area ROM-based DA was
applied for designing DCT. Then knowing the advantage of ROM-based, DA-based
multipliers using ROMs were implemented to produce partial products together with
adders that accumulated these partial products. By applying DA-based ROM to DCT core
design we can reduce the area required. In addition, the symmetrical properties of the
DCT transform and parallel DA architecture can be used in reducing the ROM size,
respectively. Recently, ROM-free DA architectures were presented . Shams et al.
employed a bit-level sharing scheme to construct the adder-based butterfly matrix called
new DA (NEDA). Being compressed, the butterfly-adder-matrix in utilized 35 adders and
8 shift-addition elements to replace the ROM. Based on NEDA architecture, the recursive
form and ALU were applied in DCT design to reduce area cost but speed limitations exist
in the operations of serial shifting and addition after the DA-computation. In DA-based
computation partial products words are shifted and added in parallel. However, a large
truncation error occurred.

We need to reduce truncation error that error is introduced if the least significant
part is directly truncated. In order to reduce truncation error effect several error
compensation bias methods have been presented based on statistical analysis of
relationship between partial product and multiplier-multiplicand. Hardware complexity
will be reduced if truncation error minimized. In general, the truncation part (TP) is
usually truncated to reduce hardware costs in parallel shifting and addition operations,
known as the direct truncation (Direct-T) method. Thus, a large truncation error occurs
due to the neglecting of carry propagation from the TP to Main Part (MP). Distributed
arithmetic is a bit level rearrangement of a multiply accumulate to hide the
multiplications. It is a powerful technique for reducing the size of a parallel hardware
multiplyaccumulate that is well suited to FPGA designs. The Discrete cosine transform
(DCT) is widely used in digital image processing for image compression, especially in
image transform coding. However, though most of them are good software solutions to
the realization of DCT, only a few of them are really suitable for VLSI implementation.
Cyclic convolution plays an important role in digital signal processing due to its nature of
easy implementation. Specifically, there exist a number of well-developed convolution
algorithms and it can be easily realized through modular and structural hardware such as
distributed arithmetic and systolic array. The way of data movement forms a significant
part in the determination of the efficiency of the realization of a transform using the DA.

WHY COMPRESSION
Despite the many advantages of digital representation of signals compared to the
analog counterpart, they need a very large number of bits for storage and transmission.
For example, a high-quality audio signal requires approximately 1.5 megabits per second
for digital representation and storage. A television-quality lowresolution color video of 30
frames per second with each frame containing 640 x 480 pixels (24 bits per color pixel)
needs more than 210 megabits per second of storage. As a result, a digitized one-hour
color movie would require approximately 95 gigabytes of storage. The storage
requirement for upcoming high-definition television (HDTV) of resolution 1280 x 720 at
60 frames per second is far greater. A digitized one-hour color movie of HDTV-quality
video will require approximately 560 gigabytes of storage. A digitized 14 x 17 square
inch radiograph scanned at 70 pm occupies nearly 45 megabytes of storage. Transmission
of these digital signals through limited bandwidth communication channels is even a
greater challenge and sometimes impossible in its raw form. Although the cost of storage
has decreased drastically over the past decade due to significant advancement in
microelectronics and storage technology, the requirement of data storage and data
processing applications is growing explosively to outpace this achievement.

ADVANTAGES OF DATA COMPRESSION

The main advantage of compression is that it reduces the data storage


requirements. It also offers an attractive approach to reduce the communication cost in
transmitting high volumes of data over long-haul links via higher effective utilization of
the available bandwidth in the data links. This significantly aids in reducing the cost of
communication due to the data rate reduction. Because of the data rate reduction, data
compression also increases the quality of multimedia presentation through limited-
bandwidth communication channels. Hence the audience can experience rich-quality
signals for audio-visual data representation. For example, because of the sophisticated
compression technologies we can receive toll-quality audio at the other side of the globe
through the good old telecommunications channels at a much better price compared to a
decade ago. Because of the significant progress in image compression techniques, a single
6 MHz broadcast television channel can carry HDTV signals to provide better quality
audio and video at much higher rates and enhanced resolution without additional
bandwidth requirements. The rate of input-output operations in a computing device can be
greatly increased due to shorter representation of data. In systems with levels of storage
hierarchy, data compression in principle makes it possible to store data at a higher and
faster storage level (usually with smaller capacity), thereby reducing the load on the
input-output channels. Data compression obviously reduces the cost of backup and
recovery of data in computer systems by storing the backup of large database files in
compressed form. The advantages of data compression will enable more multimedia
applications with reduced cost and hence aid its usage by a larger population with newer
applications in the near future.

DISADVANTAGES OF DATA COMPRESSION

Although data compression offers numerous advantages and it is the most sought-
after technology in most of the data application areas, it has some disadvantages too,
depending on the application area and sensitivity of the data. For example, the extra
overhead incurred by encoding and decoding processes is one of the most serious
drawbacks of data compression, which discourages its usage in some areas (e.g., in many
large database applications). This extra overhead is usually required in order to uniquely
identify or interpret the compressed data. For example, the encoding/decoding tree in a
Huffman coding type compression scheme is stored in the output file in addition to the
encoded bit-stream. These overheads run opposite to the essence of data compression,
that of reducing storage requirements. In large statistical or scientific databases where
changes in the database are not very frequent, the decoding process has greater impact on
the performance of the system than the encoding process. Even if we want to access and
manipulate a single record in a large database, it may be necessary to decompress the
whole database before we can access the desired record. After access and probably
modification of the data, the database is again compressed to store. The delay incurred
due to these compression and decompression processes could be prohibitive for many
real-time interactive database access requirements unless extra care and complexity are
added in the data arrangement in the database.
Literature Review:

2.1Types of compressions
There are two types of compressions

1. Lossless compression

Digitally identical to the original image. Only achieve a modest amount of


compression

Lossless compression involves with compressing data, when decompressed data


will be an exact replica of the original data. This is the case when binary data such
as executable are compressed.

2. Lossy compression

Discards components of the signal that are known to be redundant. Signal is


therefore changed from input

LOSSY

Predictive Importance oriented


Frequency oriented Hybrid

Transform

DCT DWT Fractal

Mallat Transversal filter Lifting Scheme Coedic

Figure 2.1 Different Types of Lossy Compression Techniques


2.1 Description of the algorithms:

2.1.1 Discrete Cosine Transform (DCT):

The forward and inverse 2-D DCT can be written as:

where x(ij) is the image pixel data, and Z(u,v) is the transport data.

The 2-D DCT is an orthogonal and separable transform. It can be expressed in


matrix notation as two 1-D DCT's as follows: Z = CX(CT) and X = (CT)ZC. Therefore,
we can decompose (I) into two 1-D DCT's:

(2) can be decomposed into two 1-D IDCT's:

A Standard diagram shown in Figure 2.1 where the computation of the 2-D DCT
has been separated into two I-D DCTs.

Figure 2.1 Standard DCT Computation


2.1.2 Computation of the DCT
The 8 x 8 DCT coefficient matrix can be written as

..(7)
Even rows of C are even-symmetric and odd rows are odd-symmetric. Therefore
by exploiting this symmetry in the rows of C and separating even and odd rows we can
get 1D-DCT as follows,

(8)

1D-DCT is written as follows,

,(9)
Algorithm for Hardware Implementation of Integer DCT for HEVC:

In the Joint Collaborative Team-Video Coding (JCT-VC), which manages the


standardization of HEVC, Core Experiment 10 (CE10) studied the design of core transforms over
several meeting cycles [13]. The eventual HEVC transform design [14] involves coefficients of 8-
bit size, but does not allow full factorization unlike other competing proposals [13]. It however
allows for both matrix multiplication and partial butterfly implementation. In this section, we
have used the partial-butterfly algorithm of [14] for the computation of integer DCT along with its
efficient algorithmic transformation for hardware implementation.

A. Key Features of Integer DCT for HEVC

TheN-point integer DCT 1 for HEVC given by [14] can be computed by a partial
butterfly approach using a (N/2)-point DCT and a matrixvector product of (N/2)(N/2) matrix
with an (N/2)-point vector as

and
where

for i=0,1,,N/21.X=[x(0),x(1),,x(N1)] is the input vector


andY=[y(0),y(1),,y(N1)] isN-point DCT of X. CN/2 is (N/2)-point integer DCT kernel matrix
of size (N/2)(N/2).MN/2 is also a matrix of size (N/2)(N/2) and its (i, j)th entry is defined as

Where C2i+1,jN is the (2i +1,j)th entry of the matrix CN.Note that (1a) could be
similarly decomposed, recursively, further using CN/4 and MN/4.

B. Hardware Oriented Algorithm

Direct implementation of (1) requiresN2/4 + MULN/2 multiplications,N2 /4+N/2 + ADDN/2


additions, and 2 shifts where MULN/2and ADDN/2are the number of multiplications and
additions/subtractions of (N/2)-point DCT, respectively.

Computation of (1) could be treated as a CMM problem [15][17]. Since the


absolute values of the coefficients in all the rows and columns of matrix Min (1b) are
identical, the CMM problem can be implemented as a set ofN/2 MCMs that will result in
a highly regular architecture and will have low-complexity implementation. The kernel
matrices for four-, eight-, 16-, and 32-point integer DCT for HEVC are given in [14], and
4- and eightpoint integer DCT are represented, respectively, as
and

Based on (1) and (2), hardware oriented algorithms for DCT computation can be derived
in three stages as in Table I. For 8-, 16-, and 32-point DCT, even indexed coefficients of
[y(0),y(2),y(4),y(N2)] are computed as 4-, 8-, and 16-point DCTs of
[a(0),a(1),a(2),a(N/21)],respectively, accordingto (1a). In Table II, we have listed the
arithmetic complexities of the reference algorithm and the MCM-based algorithm for four-,
eight-, 16-, and 32-point DCT. Algorithms for Inverse DCT (IDCT) can also be derived in a
similar way.

TABLE I

3-STAGEs Hardware Oriented Algorithms for the Computation of4-,8-,16-, and32-Point


DCT
Proposed Architectures for Integer DCT Computation:

A. Proposed Architecture for Four-Point Integer DCT:


The proposed architecture for four-point integer DCT is shown in Fig. 1(a). It
consists of an input adder unit (IAU), a shift-add unit (SAU), and an output adder unit
(OAU). The IAU computesa(0),a(1),b(0), andb(1) according to STAGE-1 of the
algorithm as described in Table I. The computations of ti,36 and ti,83 are performed by
two SAUs according to STAGE-2 of the algorithm. The computation oft0,64 andt1,64
does not consume any logic since the shift operations could be rewired in hardware. The
structure of SAU is shown in Fig. 1(b). Outputs of the SAU are finally added by the OAU
according to STAGE-3 of the algorithm.

Fig. 1. Proposed architecture of four-point integer DCT. (a) Four-point DCT


architecture. (b) Structure of SAU.

B. Proposed Architecture for Integer DCT of Length8and Higher Length DCTs:


The generalized architecture for N-point integer DCT based on the proposed
algorithm is shown in Fig. 2. It consists of four units, namely the IAU, (N/2)-point
integer DCT unit, SAU, and OAU. The IAU computesa(i) and b(i) for i =0,1, ..., N/21
according to STAGE-1 of the algorithm of Section II-B. The SAU provides the result of
multiplication of input sample with DCT coefficient by STAGE-2 of the algorithm.
Finally, the OAU generates the output of DCT from a binary adder tree of log 2N1
stages. Fig. 3(a)(c), respectively, illustrates the structures of IAU, SAU, and OAU in the
case of eight-point integer DCT. Four SAUs are required to compute ti,89, ti,75, ti,50,
and ti,18 for i =0,1,2,and3 according to STAGE-2 of the algorithm. The outputs of SAUs
are finally added by two-stage adder tree according to STAGE-3 of the algorithm.
Structures for 16- and 32-point integer DCT can also be obtained similarly.
Fig. 2. Proposed generalized architecture for integer DCT of lengthsN=8,
16, and 32.
Fig. 3. Proposed architecture of eight-point integer DCT and IDCT. (a)
Structure of IAU. (b) Structure of SAU. (c) Structure of OAU

CHAPTER III
HARDWARE REQUIREMENTS:
5.1 GENERAL
Integrated circuit (IC) technology is the enabling technology for a whole host of
innovative devices and systems that have changed the way we live. Jack Kilby and Robert
Noyce received the 2000 Nobel Prize in Physics for their invention of the integrated
circuit; without the integrated circuit, neither transistors nor computers would be as
important as they are today. VLSI systems are much smaller and consume less power than
the discrete components used to build electronic systems before the 1960s.
Integration allows us to build systems with many more transistors, allowing much
more computing power to be applied to solving a problem. Integrated circuits are also
much easier to design and manufacture and are more reliable than discrete systems; that
makes it possible to develop special-purpose systems that are more efficient than general-
purpose computers for the task at hand.

5.2 APPLICATIONS OF VLSI


Electronic systems now perform a wide variety of tasks in daily life. Electronic
systems in some cases have replaced mechanisms that operated mechanically,
hydraulically, or by other means; electronics are usually smaller, more flexible, and easier
to service. In other cases electronic systems have created totally new applications.
Electronic systems perform a variety of tasks, some of them visible, some more hidden:
Personal entertainment systems such as portable MP3 players and DVD players perform
sophisticated algorithms with remarkably little energy.
Electronic systems in cars operate stereo systems and displays; they also control fuel
injection systems, adjust suspensions to varying terrain, and perform the control functions
required for anti-lock braking (ABS) systems.
Digital electronics compress and decompress video, even at high definition data rates, on-
the-fly in consumer electronics.
Low-cost terminals for Web browsing still require sophisticated electronics, despite their
dedicated function.
Personal computers and workstations provide word-processing, financial analysis, and
games. Computers include both central processing units (CPUs) and special-purpose
hardware for disk access, faster screen display, etc.

Medical electronic systems measure bodily functions and perform complex


processing algorithms to warn about unusual conditions. The availability of these
complex systems, far from overwhelming consumers, only creates demand for even more
complex systems. The growing sophistication of applications continually pushes the
design and manufacturing of integrated circuits and electronic systems to new levels of
complexity.
And perhaps the most amazing characteristic of this collection of systems is its
variety as systems become more complex, we build not a few general-purpose computers
but an ever wider range of special-purpose systems. Our ability to do so is a testament to
our growing mastery of both integrated circuit manufacturing and design, but the
increasing demands of customers continue to test the limits of design and manufacturing.

5.3 ADVANTAGES OF VLSI:


While we will concentrate on integrated circuits in this book, the properties of
integrated circuits what we can and cannot efficiently put in an integrated circuitlargely
determine the architecture of the entire system. Integrated circuits improve system
characteristics in several critical ways. ICs have three key advantages over digital circuits
built from discrete components:
Size: Integrated circuits are much smallerboth transistors and wires are shrunk to
micrometer sizes, compared to the millimeter or centimeter scales of discrete
components. Small size leads to advantages in speed and power consumption, since
smaller components have smaller parasitic resistances, capacitances, and inductances.
Speed: Signals can be switched between logic 0 and logic 1 much quicker within a chip
than they can between chips. Communication within a chip can occur hundreds of times
faster than communication between chips on a printed circuit board.
The high speed of circuits on-chip is due to their small sizesmaller components and
wires have smaller parasitic capacitances to slow down the signal.
Power consumption: Logic operations within a chip also take much less power. Once
again, lower power consumption is largely due to the small size of circuits on the chip
smaller parasitic capacitances and resistances require less power to drive them.

5.4 VLSI AND SYSTEMS


These advantages of integrated circuits translate into advantages at the system level:
Smaller physical size: Smallness is often an advantage in itselfconsider portable
televisions or handheld cellular telephones.
Lower power consumption: Replacing a handful of standard parts with a single chip
reduces total power consumption. Reducing power consumption has a ripple effect on the
rest of the system: a smaller, cheaper power supply can be used; since less power
consumption means less heat, a fan may no longer be necessary; a simpler cabinet with
less shielding for electromagnetic shielding may be feasible, too.
Reduced cost: Reducing the number of components, the power supply requirements,
cabinet costs, and so on, will inevitably reduce system cost. The ripple effect of
integration is such that the cost of a system built from custom ICs can be less, even
though the individual ICs cost more than the standard parts they replace. Understanding
why integrated circuit technology has such profound influence on the design of digital
systems requires understanding both the technology of IC manufacturing and the
economics of ICs and digital systems.
5.5 INTEGRATED CIRCUIT MANUFACTURING

Integrated circuit technology is based on our ability to manufacture huge numbers


of very small devicestoday, more transistors are manufactured in California each year
than raindrops fall on the state. In this section, we briefly survey VLSI manufacturing.
TECHNOLOGY
Most manufacturing processes are fairly tightly coupled to the item they are
manufacturing. An assembly line built to produce Buicks, for example, would have to
undergo moderate reorganization to build Chevystools like sheet metal molds would
have to be replaced, and even some machines would have to be modified. And either
assembly line would be far removed from what is required to produce electric drills.
5.5 MASK-DRIVEN MANUFACTURING
Integrated circuit manufacturing technology, on the other hand, is remarkably
versatile. While there are several manufacturing processes for different circuit types
CMOS, bipolar, etc.a manufacturing line can make any circuit of that type simply by
changing a few basic tools called masks. For example, a single CMOS manufacturing
plant can make both microprocessors and microwave oven controllers by changing the
masks that form the patterns of wires and transistors on the chips.
Silicon wafers are the raw material of IC manufacturing. The fabrication process
forms patterns on the wafer that create wires and transistors. a series of identical chips are
patterned onto the wafer (with some space reserved for test circuit structures which allow
manufacturing to measure the results of the manufacturing process).
The IC manufacturing process is efficient because we can produce many identical chips
by processing a single wafer. By changing the masks that determine what patterns are laid
down on the chip, we determine the digital circuit that will be created.
The IC fabrication line is a generic manufacturing linewe can quickly retool the
line to make large quantities of a new kind of chip, using the same processing steps used
for the lines previous product.
5.6 CIRCUITS AND LAYOUTS
We could build a breadboard circuit out of standard parts. To build it on an IC
fabrication line, we must go one step further and design the layout, or patterns on the
masks. The rectangular shapes in the layout (shown here as a sketch called a stick
diagram) form transistors and wires which conform to the circuit in the schematic.
Creating layouts is very time-consuming and very importantthe size of the
layout determines the cost to manufacture the circuit, and the shapes of elements in the
layout determine the speed of the circuit as well. During manufacturing, a
photolithographic (photographic printing) process is used to transfer the layout patterns
from the masks to the wafer. The patterns left by the mask are used to selectively change
the wafer: impurities are added at selected locations in the wafer; insulating and
conducting materials are added on top of the wafer as well.
These fabrication steps require high temperatures, small amounts of highly toxic
chemicals, and extremely clean environments. At the end of processing,the wafer is
divided into a number of chips.
5.8 MANUFACTURING DEFECTS
Because no manufacturing process is perfect, some of the chips on the wafer may
not work. Since at least one defect is almost sure to occur on each wafer, wafers are cut
into smaller, working chips; the largest chip that can be reasonably manufactured today is
1.5 to 2 cm on a side, while a wafer is in moving from 30 to 45 cm. Each chip is
individually tested; the ones that pass the test are saved after the wafer is diced into chips.
The working chips are placed in the packages familiar to digital designers.
In some packages, tiny wires connect the chip to the packages pins while the
package body protects the chip from handling and the elements; in others, solder bumps
directly connect the chip to the package. Integrated circuit manufacturing is a powerful
technology for two reasons: all circuits can be made out of a few types of transistors and
wires; and any combination of wires and transistors can be built on a single fabrication
line just by changing the masks that determine the pattern of components on the chip.
Integrated circuits run very fast because the circuits are very small.
Just as important, we are not stuck building a few standard chip typeswe can build any
function we want. The flexibility given by IC manufacturing lets we build faster, more
complex digital systems in ever greater variety.
5.9 ECONOMICS
Because integrated circuit manufacturing has so much leveragea great number of
parts an be built with a few standard manufacturing proceduresa great deal of effort has
gone into improving IC manufacturing. However, as chips become more complex, the
cost of designing a chip goes up and becomes a major part of the overall cost of the chip.
Moores Law
In the 1960s Gordon Moore predicted that the number of transistors that could be
manufactured on a chip would grow exponentially. His prediction, now known as
Moores Law, was remarkably prescient. Moores ultimate prediction was that transistor
count would double every two years, an estimate that has held up remarkably well. Today,
an industry group maintains the International Technology Roadmap for Semiconductors
(ITRS), that maps out strategies to maintain the pace of Moores Law.
Terminology
The most basic parameter associated with a manufacturing process is the minimum
channel length of a transistor. (In this book, for example, we will use as an example a
technology that can manufacture 180 nm transistors.) A manufacturing technology at a
particular channel length is called a technology node. We often refer to a family of
technologies at similar feature sizes: micron, submicron, deep submicron, and now
nanometer technologies. The term nanometer technology is generally used for
technologies below 100 nm.
5.10 COST OF MANUFACTURING
IC manufacturing plants are extremely expensive. A single plant costs as much as
$4 billion. Given that a new, state-of-the-art manufacturing process is developed every
three years, that is a sizeable investment. The investment makes sense because a single
plant can manufacture so many chips and can easily be switched to manufacture different
types of chips.
In the early years of the integrated circuits business, companies focused on
building large quantities of a few standard parts. These parts are commoditiesone 80 ns,
256Mb dynamic RAM is more or less the same as any other, regardless of the
manufacturer.
Companies concentrated on commodity parts in part because manufacturing
processes were less well understood and manufacturing variations are easier to keep
track of when the same part is being fabricated day after day.
Standard parts also made sense because designing integrated circuits was hardnot only
the circuit, but the layout had to be designed, and there were few computer programs to
help automate the design process.
5.11 COST OF DESIGN
One of the less fortunate consequences of Moores Law is that the time and money
required to design a chip goes up steadily. The cost of designing a chip comes
from several factors:
Skilled designers are required to specify, architect, and implement the chip. A
design team may range from a half-dozen people for a very small chip to 500
people for a large, high-performance microprocessor
These designers cannot work without access to a wide range of computer- aided
design (CAD) tools. These tools synthesize logic, create layouts, simulate, and
verify designs. CAD tools are generally licensed and you must pay a yearly fee to
maintain the license. A license for a single copy of one tool, such as logic
synthesis, may cost as much as $50,000 US.
The CAD tools require a large compute farm on which to run. During the most
intensive part of the design process, the design team will keep dozens of
computers running continuously for weeks or months.
A large ASIC, which contains millions of transistors but is not fabricated on the
state-of-the-art process, can easily cost $20 million US and as much as $100
million. Designing a large microprocessor costs hundreds of millions of dollars.
DESIGN COSTS AND IP
We can spread these design costs over more chips if we can reuse all or part of the
design in other chips. The high cost of design is the primary motivation for the rise of
IP-based design, which creates modules that can be reused in many different designs
5.12 TYPES OF CHIPS
The preponderance of standard parts pushed the problems of building customized
systems back to the board-level designers who used the standard parts.
Since a function built from standard parts usually requires more components than
if the function were built with custom designed ICs, designers tended to build smaller,
simpler systems. The industrial trend, however, is to make available a wider variety of
integrated circuits. The greater diversity of chips includes:
More specialized standard parts:
In the 1960s, standard parts were logic gates; in the 1970s they were LSI
components. Today, standard parts include fairly specialized components: communication
network interfaces, graphics accelerators, floating point processors. All these parts are
more specialized than microprocessors but are used in enough volume that designing
special-purpose chips is worth the effort.
In fact, putting a complex, high-performance function on a single chip often
makes other applications possiblefor example, single-chip floating point processors
make high-speed numeric computation available on even inexpensive personal
computers.
Application-specific integrated circuits (ASICs)
Rather than build a system out of standard parts, designers can now create a single
chip for their particular application. Because the chip is specialized, the functions of
several standard parts can often be squeezed into a single chip, reducing system size,
power, heat, and cost. Application-specific ICs are possible because of computer tools
that help humans design chips much more quickly.
Systems-on-chips (SoCs).
Fabrication technology has advanced to the point that we can put a complete
system on a single chip. For example, a single-chip computer can include a CPU, bus, I/O
devices, and memory. SoCs allow systems to be made at much lower cost than the
equivalent board-level system. SoCs can also be higher performance and lower power
than board-level equivalents because on-chip connections are more efficient than chip-to
chip connections.
A wider variety of chips is now available in part because fabrication methods are
better understood and more reliable. More importantly, as the number of transistors per
chip grows, it becomes easier and cheaper to design special-purpose ICs. When only a
few transistors could be put on a chip, careful design was required to ensure that even
modest functions could be put on a single chip. Todays VLSI manufacturing processes,
which can put millions of carefully-designed transistors on a chip, can also be used to put
tens of thousands of less-carefully designed transistors on a chip.

Even though the chip could be made smaller or faster with more design effort, the
advantages of having a single-chip implementation of a function that can be quickly
designed often outweighs the lost potential performance.
The problem and the challenge of the ability to manufacture such large chips is
designthe ability to make effective use of the millions of transistors on a chip to
perform a useful function.

5.13 CMOS TECHNOLOGY:


CMOS is the dominant integrated circuit technology. In this section we will
introduce some basic concepts of CMOS to understand why it is so widespread and some
of the challenges introduced by the inherent characteristics of CMOS.
5.13.1 POWER CONSUMPTION
POWER CONSUMPTION CONSTRAINTS:
The huge chips that can be fabricated today are possible only because of the
relatively tiny consumption of CMOS circuits. Power consumption is critical at the chip
level because much of the power is dissipated as heat, and chips have limited heat
dissipation capacity. Even if the system in which a chip is placed can supply large
amounts of power, most chips are packaged to dissipate fewer than 10 to 15 Watts of
power before they suffer permanent damage (though some chips dissipate well over 50
Watts thanks to special packaging).
The power consumption of a logic circuit can, in the worst case, limit the number
transistors we can effectively put on a single chip. Limiting the number of transistors per
chip changes system design in several ways. Most obviously, it increases the physical size
of a system. Using high-powered circuits also increases power supply and cooling
requirements.
A more subtle effect is caused by the fact that the time required to transmit a
signal between chips is much larger than the time required to send the same signal
between two transistors on the same chip; as a result, some of the advantage of using a
higher-speed circuit family is lost. Another subtle effect of decreasing the level of
integration is that the electrical design of multi-chip systems is more complex:
microscopic wires on-chip exhibit parasitic resistance and capacitance, while macroscopic
wires between chips have capacitance and inductance, which can cause a number of
ringing effects that are much harder to analyze.
The close relationship between power consumption and heat makeslow-power
design techniques important knowledge for every CMOS designer. Of course, low-energy
design is especially important in battery-operated systems like cellular telephones.
Energy, in contrast, must be saved by avoiding unnecessary work.
We will see throughout the rest of this book that minimizing power and energy
consumption requires careful attention to detail at every level of abstraction, from system
architecture down to layout.
As CMOS features become smaller, additional power consumption mechanisms
come into play. Traditional CMOS consumes power when signals change but consumes
only negligible power when idle. In modern CMOS, leakage mechanisms start to drain
current

5.13.2 DESIGN AND TESTABILITY


DESIGN VERIFICATION
Our ability to build large chips of unlimited variety introduces the problem of
checking whether those chips have been manufactured correctly. Designers accept the
need to verify or validate their designs to make sure that the circuits perform the
specified function. (Some people use the terms verification and validation
interchangeably; a finer distinction reserves verification for formal proofs of correctness,
leaving validation to mean any technique which increases confidence in correctness, such
as simulation.)

Chip designs are simulated to ensure that the chips circuits compute the proper
functions to a sequence of inputs chosen to exercise the chip. manufacturing test But each
chip that comes off the manufacturing line must also undergo
Manufacturing test:
The chip must be exercised to demonstrate that no manufacturing defects rendered
the chip useless. Because IC manufacturing tends to introduce certain types of defects and
because we want to minimize the time required to test each chip, we cant just use the
input sequences created for design verification to perform manufacturing test. Each chip
must be designed to be fully and easily testable. Finding out that a chip is bad only after
you have plugged it into a system is annoying at best and dangerous at worst. Customers
are unlikely to keep using manufacturers who regularly supply bad chips.
Defects introduced during manufacturing range from the catastrophic
contamination that destroys every transistor on the waferto the subtlea single broken
wire or a crystalline defect that kills only one transistor. While some bad chips can be
found very easily, each chip must be thoroughly tested to find even subtle flaws that
produce erroneous results only occasionally. Tests designed to exercise functionality and
expose design bugs dont always uncover manufacturing defects. We use fault models to
identify potential manufacturing problems and determine how they affect the chips
operation.

The most common fault model is stuck-at-0/1: the defect causes a logic gates
output to be always 0 (or 1), independent of the gates input values. We can often
determine whether a logic gates output is stuck even if we cant directly observe its
outputs or control its inputs. We can generate a good set of manufacturing tests for the
chip by assuming each logic gates output is stuck at 0 (then 1) and finding an input to the
chip which causes different outputs when the fault is present or absent.

5.13.3 TESTABILITY AS A DESIGN PROCESS:


Unfortunately, not all chip designs are equally testable. Some faults may require
long input sequences to expose; other faults may not be testable at all, even though they
cause chip malfunctions that arent covered by the fault model. Traditionally, chip
designers have ignored testability problems, leaving them to a separate test engineer who
must find a set of inputs to adequately test the chip. If the test engineer cant change the
chip design to fix testability problems, his or her job becomes both difficult and
unpleasant. The result is often poorly tested chips whose manufacturing problems are
found only after the customer has plugged them into a system.
Companies now recognize that the only way to deliver high-quality chips to
customers is to make the chip designer responsible for testing, just as the designer is
responsible for making the chip run at the required speed. Testability problems can often
be fixed easily early in the design process at relatively little cost in area and performance.
But modern designers must understand testability requirements, analysis techniques
which identify hard-to-test sections of the design, and design techniques which improve
testability
5.13.4 RELIABILITY
RELIABILITY IS A LIFETIME PROBLEM
Earlier generations of VLSI technology were robust enough that testing chips at
manufacturing time was sufficient to identify working partsa chip either worked or it
didnt. In todays nanometer-scale technologies, the problem of determining whether a
chip works is more complex. A number of mechanisms can cause transient failures that
cause occasional problems but are not repeatable. Some other failure mechanisms, like
overheating, cause permanent failures but only after the chip have operated for some
time. And more complex manufacturing problems cause problems that are harder to
diagnose and may affect performance rather than functionality.
5.13.5 DESIGN-FOR MANUFACTURABILITY
A number of techniques, referred to as design-for-manufacturability or design-for-
yield, are in use today to improve the reliability of chips that come off the manufacturing
line. We can make chips more reliable by designing circuits and architectures that reduce
design stresses and check for problems. For example, heat is one major cause of chip
failure. Proper power management circuitry can reduce the chips heat dissipation and
reduce the damage caused by overheating. We also need to change the way we design
chips.
Some of the convenient levels of abstraction that served us well in earlier
technologies are no longer entirely appropriate in nanometer technologies. We need to
check more thoroughly and be willing to solve reliability problems by modifying design
decisions made earlier.
5.13.6 INTEGRATED CIRCUIT DESIGN TECHNIQUES
To make use of the flood of transistors given to us by Moores Law, we must
design large, complex chips quickly. The obstacle to making large chips work correctly is
complexitymany interesting ideas for chips have died in the swamp of details that must
be made correct before the chip actually works. Integrated circuit design is hard because
designers must juggle several different problems:
Multiple levels of abstraction:
IC design requires refining an idea through many levels of detail. Starting from a
specification of what the chip must do, the designer must create an architecture which
performs the required function, expand the architecture into a logic design, and further
expand the logic design into a layout like the one in Figure 1-2. As you will learn by the
end of this book, the specification-to-layout design process is a lot of work.
Multiple and conflicting costs:
In addition to drawing a design through many levels of detail, the designer must
also take into account costsnot dollar costs, but criteria by which the quality of the
design is judged. One critical cost is the speed at which the chip runs. Two architectures
that execute the same function (multiplication, for example) may run at very different
speeds. We will see that chip area is another critical design cost: The cost of
manufacturing a chip is exponentially related to its area, and chips much larger than 1
cm2 cannot be manufactured at all. Furthermore, if multiple cost criteriasuch as area
and speed requirementsmust be satisfied, many design decisions will improve one cost
metric at the expense of the other. Design is dominated by the process of balancing
conflicting constraints.
Short design time:
In an ideal world, a designer would have time to contemplate the effect of a design
decision. We do not, however, live in an ideal world. Chips which appear too late may
make little or no money because competitors have snatched market share. Therefore,
designers are under pressure to design chips as quickly as possible. Design time is
especially tight in application-specific IC design, where only a few weeks may be
available to turn a concept into a working ASIC.

5.14 FIELD-PROGRAMMABLE GATE ARRAYS(FPGA):


A field-programmable gate array (FPGA) is a block of programmable logic that
can implement multi-level logic functions. FPGAs are most commonly used as separate
commodity chips that can be programmed to implement large functions.
However, small blocks of FPGA logic can be useful components on-chip to allow the user
of the chip to customize part of the chips logical function. An FPGA block must
implement both combinational logic functions and interconnect to be able to construct
multi-level logic functions. There are several different technologies for programming
FPGAs, but most logic processes are unlikely to implement anti-fuses or similar hard
programming technologies, so we will concentrate on SRAM-programmed FPGAs.
5.14.1 LOOKUP TABLES:
The basic method used to build a combinational logic block (CLB) also called a
logic elementin an SRAM-based FPGA is the lookup table (LUT). As shown in Figure ,
the lookup table is an SRAM that is used to implement a truth table.
Each address in the SRAM represents a combination of inputs to the logic
element.The value stored at that address represents the value of the function for that input
Combination an input function requires an SRAM with location

Figure 5.1 Lookup Tables


Because a basic SRAM is not clocked, the lookup table logic element operates
much as any other logic gate as its inputs change, its output changes after some delay.
Each address in the SRAM represents a combination of inputs to the logic element.The
value stored at that address represents the value of the function for that input Combination
an input function requires an SRAM with location
5.14.2 PROGRAMMING A LOOKUP TABLE :
Unlike a typical logic gate, the function represented by the logic element can be
changed by changing the values of the bits stored in the SRAM. As a result, the n-input
logic element can represent functions (though some of these functions are permutations of
each other).

Figure 5.2 Programming A Lookup Table


A typical logic element has four inputs. The delay through the lookup table is
independent of the bits stored in the SRAM, so the delay through the logic element is the
same for all functions. This means that, for example, a lookup table-based logic element
will exhibit the same delay for a 4-input XOR and a 4-input NAND.
In contrast, a 4-input XOR built with static CMOS logic is considerably slower
than a 4-input NAND. Of course, the static logic gate is generally faster than the logic
element. Logic elements generally contain registersflip-flops and latchesas well as
combinational logic.
A flip-flop or latch is small compared to the combinational logic element (in
sharp contrast to the situation in custom VLSI), so it makes sense to add it to the
combinational logic element. Using a separate cell for the memory element would simply
take up routing resources. The memory element is connected to the output; whether it
stores a given value is controlled by its clock and enable inputs.
5.14.3 COMPLEX LOGIC ELEMENT:
Many FPGAs also incorporate specialized adder logic in the logic element. The
critical component of an adder is the carry chain, which can be implemented .

much more efficiently in specialized logic than it can using standard lookup table
techniques. The wiring channels that connect to the logic elements inputs and outputs
also need to be programmable. A wiring channel has a number of programmable
connections such that each input or output generally can be connected to any one of
several different wires in the channel.

5.14.4 PROGRAMMABLE INTERCONNECTION POINTS:


Simple version of an interconnection point, often known as a connection box.
Figure 5.3 Programming A Lookup Table
A programmable connection between two wires is made by a CMOS transistor (a
pass transistor). The pass transistors gate is controlled by a static memory program bit
(shown here as a D register).When the pass transistors gate is high, the transistor
conducts and connects the two wires; when the gate is low, the transistor is off and the
two wires are not connected.
A flip-flop or latch is small compared to the combinational logic element (in sharp
contrast to the situation in custom VLSI), so it makes sense to add it to the combinational
logic element. Using a separate cell for the memory element would simply take up
routing resources. The memory element is connected to the output; whether it stores a
given value is controlled by its clock and enable inputs.
The wiring channels that connect to the logic elements inputs and outputs also
need to be programmable. A wiring channel has a number of programmable connections
such that each input or output generally can be connected to any one of several different
wires in the channel.

CHAPTER-VI
TOOLS
6.1 Introduction:
The main tools required for this project can be classified into two broad categories.
Hardware requirement
Software requirement

6.2 Hardware Requirements:


FPGA KIT
In the hardware part a normal computer where Xilinx ISE 10.1i software can be
easily operated is required, i.e., with a minimum system configuration Pentium III, 1 GB
RAM, 20 GB Hard Disk.
6.3 Software Requirements:
MODELSIM 6.4b
XILINX 10.1
It requires Xilinx ISE 10.1 version of software where Verilog source code can be
used for design implementation.
6.3.1 Introduction To Modelsim:
In Modelsim, all designs are compiled into a library. You typically start a new
simulation in Modelsim by creating a working library called "work". "Work" is the
library name used by the compiler as the default destination for compiled design units.
Compiling Your Design: After creating the working library, you compile your design
units into it. The ModelSim library format is compatible across all supported
platforms. You can simulate your design on any platform without having to recompile
your design.
Loading the Simulator with Your Design and Running the Simulation With the design
compiled, you load the simulator with your design by invoking the simulator on a top-
level module (Verilog) or a configuration or entity/architecture pair (VHDL).
Assuming the design loads successfully, the simulation time is set to zero, and you
enter a run command to begin simulation.
Debugging your results if you dont get the results you expect, you can
use ModelSims robust debugging environment to track down the cause of
the problem.

Standards Supported:

ModelSim VHDL supports both the IEEE 1076-1987 and 1076-1993 VHDL, the
1164-1993 Standard Multivalue Logic System for VHDL Interoperability, and the 1076.2-
1996 Standard VHDL Mathematical Packages standards. Any design developed with
ModelSim will be compatible with any other VHDL system that is compliant with either
IEEE Standard 1076-1987 or 1076-1993. ModelSim Verilog is based on IEEE Std 1364-
1995 and a partial implementation of 1364-2001, Standard Hardware Description
Language Based on the Verilog Hardware Description Language. The Open Verilog
International Verilog LRM version 2.0 is also applicable to a large extent. Both PLI
(Programming Language Interface) and VCD (Value Change Dump) are supported for
ModelSim PE and SE users.
6.4 MODELSIM:
Basic Steps For Simulation

This section provides further detail related to each step in the process of
simulating your design using ModelSim.

Step 1 - Collecting Files and Mapping Libraries

Files needed to run ModelSim on your design:

design files (VHDL, Verilog, and/or SystemC), including stimulus for the design
libraries, both working and resource
modelsim.ini (automatically created by the library mapping command )

Providing stimulus to the design

You can provide stimulus to your design in several ways:

Language based testbench


Tcl-based ModelSim interactive command, force
VCD files / commands
See "Using extended VCD as stimulus" (UM-458) and "Using extended VCD as
stimulus"
3rd party test bench generation tools

What is a library in ModelSim?

A library is a location where data to be used for simulation is stored. Libraries are
ModelSims way of managing the creation of data before it is needed for use in
simulation. It also serves as a way to streamline simulation invocation. Instead of
compiling all design data each and every time you simulate, ModelSim uses binary pre-
compiled data from these libraries. So, if you make a changes to a single Verilog module,
only that module is recompiled, rather than all modules in the design.

Working and resource libraries


Design libraries can be used in two ways: 1) as a local working library that
contains the compiled version of your design; 2) as a resource library. The contents of
your working library will change as you update your design and recompile. A resource
library is typically unchanging, and serves as a parts source for your design. Examples of
resource libraries might be: shared information within your group, vendor libraries,
packages, or previously compiled elements of your own working design.
You can create your own resource libraries, or they may be supplied by another
design team or a third party (e.g., a silicon vendor). For more information on resource
libraries and working libraries, see "Working library versus resource libraries",
"Managing library contents", "Working with design libraries, and "Specifying the
resource librarie".
Creating The Logical Library vlib

Before you can compile your source files, you must create a library in which to
store the compilation results. You can create the logical library using the GUI, using File
> New > Library (see "Creating a library"), or you can use the vlib command. For
example, the command:
vlib work
creates a library named work. By default, compilation results are stored in the work
library Mapping The Logical Work To The Physical Work Directory vmap
VHDL uses logical library names that can be mapped to ModelSim
library directories. If libraries are not mapped properly, and you invoke your
simulation, necessary components will not be loaded and simulation will fail.
Similarly, compilation can also depend on proper library mapping.
By default, ModelSim can find libraries in your current directory
(assuming they have the right name), but for it to find libraries located
elsewhere, you need to map a logical library name to the pathname of the
library. You can use the GUI ("Library mappings with the GUI", a command
("Library mappings with the GUI" ), or a project ("Getting started with
projects" to assign a logical name to a design library.
The format for command line entry is:
vmap <logical_name> <directory_pathname>
This command sets the mapping between a logical library name and a directory.
Step 2 - Compiling the design with vlog/vcom/sccom
Designs are compiled with one of the three language compilers.
Compiling Verilog - vlog
ModelSims compiler for the Verilog modules in your design is vlog . Verilog files may
be compiled in any order, as they are not order dependent. See "Compiling Verilog files"
for details.Verilog portions of the design can be optimized for better simulation
performance.
"Optimizing Verilog designs".

Compiling VHDL vcom

ModelSims compiler for VHDL design units is vcom . VHDL files must be compiled
according to the design requirements of the design. Projects may assist you in
determining the compile order: for more information, see"Auto-generating compile order"
(UM-46). See "Compiling VHDL files" (UM-73) for details. on VHDL compilation.
Compiling SystemC sccom
ModelSims compiler for SystemC design units is sccom , and is used only if you have
SystemC components in your design. See "Compiling SystemC files" for details.
Step 3 - Loading the design for simulation
vsim <top>
Your design is ready for simulation after it has been compiled and (optionally)
optimized with vopt . For more information on optimization, see Optimizing Verilog
designs . You may then invoke vsim with the names of the top-level modules (many
designs contain only one top-level module) or the name you assigned to the optimized
version of the design.
For example, if your top-level modules are "testbench" and "globals", then invoke the
simulator as follows:

vsim testbench globals


After the simulator loads the top-level modules, it iteratively loads the instantiated
modules and UDPs in the design hierarchy, linking the design together by connecting the
ports and resolving hierarchical references.
Using SDF
You can incorporate actual delay values to the simulation by applying SDF back
annotation files to the design. For more information on how SDF is used in the design,
see "Specifying SDF files for simulation" .
Step 4 - Simulating the design
Once the design has been successfully loaded, the simulation time is set to
zero, and you must enter a run command to begin simulation. For more information,
see Verilog simulation , VHDL simulation , and SystemC simulation .
The basic simulator commands are: add
wave
force
bp
run
step
next
Step 5- Debugging The Design
Numerous tools and windows useful in debugging your design are available from the
ModelSim GUI. For more information, seeWaveform analysis (UM-237), PSL Assertions
andTracing signals with the Dataflow window. In addition, several basic simulation
commands are available from the command line to assist you in debugging your design: d
escribe
drivers
examine
force
log
checkpoint
restore
show

A programmable connection between two wires is made by a CMOS transistor (a


pass transistor). The pass transistors gate is controlled by a static memory program bit
(shown here as a D register).When the pass transistors gate is high, the transistor
conducts and connects the two wires; when the gate is low, the transistor is off and the
two wires are not connected.
A flip-flop or latch is small compared to the combinational logic element (in sharp
contrast to the situation in custom VLSI), so it makes sense to add it to the combinational
logic element. Using a separate cell for the memory element would simply take up
routing resources. The memory element is connected to the output; whether it stores a
given value is controlled by its clock and enable inputs.
The wiring channels that connect to the logic elements inputs and outputs also
need to be programmable. A wiring channel has a number of programmable connections
such that each input or output generally can be connected to any one of several different
wires in the channel.On the left side of the interface, under the project tab is the frame
listing of the files that pertainto the opened project. The Library frame lists the entities of
the project (that have been ompiled).To the right is the ModelSim shell frame. It is an
extension of MS-DOS, so both ModelSim and MS-DOS commands can be executed
Numerous tools and windows useful in debugging your design are available from the
ModelSim GUI. For more information, seeWaveform analysis (UM-237), PSL Assertions
andTracing signals with the Dataflow window. In addition, several basic simulation
commands are available from the command line to assist you in debugging your design:
You can incorporate actual delay values to the simulation by applying SDF back
annotation files to the design. For more information on how SDF is used in the design,
see "Specifying SDF files for simulation unchanging, and serves as a parts source for
your design. Examples of resource libraries might be shared information within your
group, vendor libraries, packages, or previously compiled elements of your own working
design.
You can create your own resource libraries, or they may be supplied by another
design team or a third party (e.g., a silicon vendor). For more information on resource
libraries and working libraries, see "Working library versus resource libraries",
"Managing library contents", "Working with design libraries, and "Specifying the
resource librarie
6.5 MODELSIM BASICS
On the left side of the interface, under the project tab is the frame listing of the
files that pertainto the opened project. The Library frame lists the entities of the project
(that have been ompiled).To the right is the ModelSim shell frame. It is an extension of
MS-DOS, so both ModelSim and MS-DOS commands can be executed.
A. Creating a New Project
Once ModelSim has been started, create a new projectt:
File > New > Project
Name the project FA, set the project location
to F:/VHDL , and click OK.
A new window should appear to add new files to
the project. Choose Create New File.

Enter F:\VHDL\FA.vhd as the file name and click OK.


Then close the add new files window.
Additional files can be added later by choosing from the menu: Project > Add File to
Project
B. Editing Source Files
Double click the FA.vhd source found under the Workspace windows project tab.
This will open up an empty text editor configured to highlight VHDL syntax. Copy the
source found at the end of this document. After writing the code for the entity and its
architecture, save and close the source file.
C. Compiling projects
Select the file from the project files list frame and right click on it. Select compile
to just compile. This file or compile all for the files in the current project. If there are
errors within the code or the project, a red failure message will be displayed. Double click
these red errors for more detailed errors. Otherwise, if all is well then no red warning or
error messages will be displayed.A new window will be displayed listing the design
entitys signals and their initial value (shown below). Items in waveform and listing are
ordered in the same order in which they are declared in the code.
To display the waveform, select the signals for the waveform to display (hold CTL
and click to select multiple signals) and from the signal list window menu select. In this
case, it is possible to use Verilog to write a test bench to verify the functionality of the
design using files on the host computer to define stimuli, to interact with the user, and to
compare results with those expected.
A Verilog model is translated into the "gates and wires" that are mapped onto a
programmable logic device such as a CPLD or FPGA, and then it is the actual hardware
being configured, rather than the Verilog code being "executed" as if on some form of a
processor chip. In this case, it is possible to use Verilog to write a test bench to verify the
functionality of the design using files on the host computer to define stimuli, to interact
with the user, and to compare results with those expected. FPGA stands for Field
Programmable Gate Array which has the array of logic module, I /O module and routing
tracks (programmable interconnect). FPGA can be configured by end user to implement
specific circuitry. Speed is up to 100 MHz but at present speed is in GHz.
Main applications are DSP, FPGA based computers, logic emulation, ASIC and
ASSP. FPGA can be programmed mainly on SRAM (Static Random Access Memory). It
is Volatile and main advantage of using SRAM programming technology is re-
configurability. Issues in FPGA technology are complexity of logic element, clock
support, IO support and interconnections (Routing).
II. Simulating with ModelSim

To simulate, first the entity design has to be loaded into the simulator. Do this by
selecting fromthe menu:
Simulate > Simulate
A new window will appear listing all the entities (not filenames) that are in the
work library. Select FA entity for simulation and click OK.

Often times it will be necessary to create entities with multiple architectures. In


this case the architecture has to be specified for the simulation. Expand the tree for the
entity and select the architecture to be simulated and then click OK.
Creating test files for the simulator After the design is loaded, clear up any previous data
and restart the timer by typing in the Prompt:
View > Signals
A new window will be displayed listing the design entitys signals and their initial
value (shown below). Items in waveform and listing are ordered in the same order in
which they are declared in the code. To display the waveform, select the signals for the
waveform to display (hold CTL and click to select multiple signals) and from the signal
list window menu select:
Add > Wave > Selected signals
Basic Simulation Flow:
The following diagram shows the basic steps for simulating a design in ModelSim.

Create a working
library

Compile design files

Load and Run

Debug results

Figure 6.1 Basic simulation flow


Project design flow:
As you can see, the flow is similar to the basic simulation flow. However, there are two
Important differences:
You do not have to create a working library in the project flow; it is done for you
automatically.
Projects are persistent. In other words, they will open every time you invoke Modelsim
unless you specifically close them.
The following diagram shows the basic steps for simulating a design within a Modelsim
project.
Figure:6.2 Project Design F low

6.6 Introduction To XILINX ISE:


This tool can be used to create, implement, simulate, and synthesize Verilog designs for
implementation on FPGA chips.
ISE: Integrated Software Environment
Environment for the development and test of digital systems design targeted to FPGA or
CPLD
Integrated collection of tools accessible through a GUI
Based on a logical synthesis engine (XST: Xilinx Synthesis Technology)
XST supports different languages:
Verilog
VHDL
XST produce a net list integrated with constraints
Supports all the steps required to complete the design:
Translate, map, place and route
Bit stream generation

In this case, it is possible to use Verilog to write a test bench to verify the functionality of the
design using files on the host computer to define stimuli, to interact with the user, and to
compare results with those expected.
A Verilog model is translated into the "gates and wires" that are mapped onto a
programmable logic device such as a CPLD or FPGA, and then it is the actual hardware being
configured, rather than the Verilog code being "executed" as if on some form of a processor chip.

6.6.1 Implementation:

Synthesis (XST)
Produce a netlist file starting from an HDL description
Translate (NGDBuild)
Converts all input design netlists and then writes the results into a single merged
file, that describes logic and constraints.
Mapping (MAP)
Maps the logic on device components.
Takes a netlist and groups the logical elements into CLBs and IOBs (components of FPGA).
Place And Route (PAR)
Place FPGA cells and connects cells.
Bit stream generation
XILINX Design Process
Step 1: Design entry
HDL (Verilog or VHDL, ABEL x CPLD), Schematic Drawings, Bubble
Diagram
Step 2: Synthesis
Translates .v, .vhd, .sch files into a netilist file (.ngc)
Step 3: Implementation
FPGA: Translate/Map/Place & Route, CPLD: Fitter
Step 4: Configuration/Programming
Download a BIT file into the FPGA
Program JEDEC file into CPLD
Program MCS file into Flash PROM
Simulation can occur after steps 1, 2, 3
6.7 Introduction to FPGA:
FPGA stands for Field Programmable Gate Array which has the array of logic module,
I /O module and routing tracks (programmable interconnect). FPGA can be configured by end
user to implement specific circuitry. Speed is up to 100 MHz but at present speed is in GHz.
Main applications are DSP, FPGA based computers, logic emulation, ASIC and ASSP.
FPGA can be programmed mainly on SRAM (Static Random Access Memory). It is Volatile and
main advantage of using SRAM programming technology is re-configurability. Issues in FPGA
technology are complexity of logic element, clock support, IO support and interconnections
(Routing).
FPGA Design Flow
FPGA contains a two dimensional arrays of logic blocks and interconnections
between logic blocks. Both the logic blocks and interconnects are programmable. Logic blocks
are programmed to implement a desired function and the interconnects are programmed using
the switch boxes to connect the logic blocks.
To be more clear, if we want to implement a complex design (CPU for instance), then the
design is divided into small sub functions and each sub function is implemented using one logic
block. Now, to get our desired design (CPU), all the sub functions implemented in logic blocks
must be connected and this is done by programming the interconnects.
FPGAs, alternative to the custom ICs, can be used to implement an entire System
On one Chip (SOC). The main advantage of FPGA is ability to reprogram. User can
reprogram an FPGA to implement a design and this is done after the FPGA is
manufactured. This brings the name Field Programmable.

Internal structure of an FPGA is depicted in the following figure.


Figure 6.3 Internal structure of FPGA
Custom ICs are expensive and takes long time to design so they are useful when
produced in bulk amounts. But FPGAs are easy to implement within a short time with the
help of Computer Aided Designing (CAD) tools (because there is no physical layout
process, no mask making, and no IC manufacturing).
Some disadvantages of FPGAs are, they are slow compared to custom ICs as they
cant handle vary complex designs and also they draw more power.Xilinx logic block
consists of one Look Up Table (LUT) and one Flip Flop.
An LUT is used to implement number of different functionality. The input lines to the logic
block go into the LUT and enable it. The output of the LUT gives the result of the logic
function that it implements and the output of logic block is registered or unregistered
output from the LUT.SRAM is used to implement a LUT.A k-input logic function is
implemented using 2^k * 1 size SRAM. Number of different possible functions for k input
LUT is 2^2^k.
Advantage of such an architecture is that it supports implementation of so many
logic functions, however the disadvantage is unusually large number of memory cells
required to implement such a logic block in case number of inputs is large.
Figure 6.4 4-input LUT based implementation of logic block

In technology mapping, the transformation of optimized Boolean expression to FPGA


logic blocks, that is said to be as Slices. Here area and delay optimization will be taken place.
During placement the algorithms are used to place each block in FPGA array. Assigning the
FPGA wire segments, which are programmable, to establish connections among FPGA blocks
through routing.
The configuration of final chip is made in programming unit LUT based design provides for
better logic block utilization. A k-input LUT based logic block can be implemented in
number of different ways with trade off between performance and logic density.An n-LUT
can be shown as a direct implementation of a function truth-table. Each of the latch holds
the value of the function corresponding to one input combination.
For Example: 2-LUT can be used to implement 16 types of functions like AND , OR, A
+not B .... etc.

A B AND OR ..... ...... ......

0 0 0 0

0 1 0 1

1 0 0 1

1 1 1 1
6.8 Interconnects:
A wire segment can be described as two end points of an interconnect with no
programmable switch between them. A sequence of one or more wire segments in an FPGA can
be termed as a track.

Typically an FPGA has logic blocks, interconnects and switch blocks (Input/output blocks).
Switch blocks lie in the periphery of logic blocks and interconnect. Wire segments are connected
to logic blocks through switch blocks. Depending on the required design, one logic block is
connected to another and so on.

6.9 FPGA DESIGN FLOW:

In this part of tutorial we are going to have a short intro on FPGA design flow. A simplified
version of design flow is given in the flowing diagram.

Figure 6.5 FPGA Design Flow

6.10 Design Entry:


There are different techniques for design entry. Schematic based, Hardware Description
Language and combination of both etc. Selection of a method depends on the design and
designer. If the designer wants to deal more with Hardware, then Schematic entry is the better
choice. When the design is complex or the designer thinks the design in an algorithmic way then
HDL is the better choice. Language based entry is faster but lag in performance and density.
HDLs represent a level of abstraction that can isolate the designers from the details of the
hardware implementation. Schematic based entry gives designers much more visibility into the
hardware. It is the better choice for those who are hardware oriented. Another method but rarely
used is state-machines.
It is the better choice for the designers who think the design as a series of states. But the
tools for state machine entry are limited. In this documentation we are going to deal with the
HDL based design entry.
6.11 Synthesis:
The process which translates VHDL or Verilog code into a device netlist format. i.e. a
complete circuit with logical elements( gates, flip flops, etc) for the design. If the design
contains more than one sub designs, ex. to implement a processor, we need a CPU as one design
element and RAM as another and so on, then the synthesis process generates netlist for each
design element Synthesis process will check code syntax and analyze the hierarchy of the design
which ensures that the design is optimized for the design architecture, the designer has selected.
The resulting netlist(s) is saved to an NGC( Native Generic Circuit) file (for Xilinx Synthesis
Technology (XST)).

Figure 6.6 FPGA Synthesis


6.12 Implementation:
In this work, design of a DWT and IDWT is made using Verilog HDL and is synthesized
on FPGA family of Spartan 3E through XILINX ISE Tool. This process includes following:
Translate
Map
Place and Route
6.12.1 Translate:
Process combines all the input netlists and constraints to a logic design file. This
information is saved as a NGD (Native Generic Database) file. This can be done using NGD
Build program. Here, defining constraints is nothing but, assigning the ports in the design to the
physical elements (ex. pins, switches, buttons etc) of the targeted device and specifying time
requirements of the design. This information is stored in a file named UCF (User Constraints
File). Tools used to create or modify the UCF are PACE, Constraint Editor etc.

Figure 6.7 FPGA Translate


6.12.2 Map:
Process divides the whole circuit with logical elements into sub blocks such that they can
be fit into the FPGA logic blocks. That means map process fits the logic defined by the NGD file
into the targeted FPGA elements (Combinational Logic Blocks (CLB), Input Output Blocks
(IOB)) and generates an NCD (Native Circuit Description) file which physically represents the
design mapped to the components of FPGA.
MAP program is used for this purpose.

Figure 6.8 FPGA map


6.12.3 Place and Route:
PAR program is used for this process. The place and route process places the sub blocks from the
map process into logic blocks according to the constraints and connects the logic blocks. Ex. if a
sub block is placed in a logic block which is very near to IO pin, then it may save the time but it
may effect some other constraint.
So tradeoff between all the constraints is taken account by the place and route process
The PAR tool takes the mapped NCD file as input and produces a completely routed NCD file as
output. Output NCD file consists the routing information.

Figure 6.9 FPGA Place and route


6.13 Device Programming:
Now the design must be loaded on the FPGA. But the design must be converted to a
format so that the FPGA can accept it. BITGEN program deals with the conversion. The routed
NCD file is then given to the BITGEN program to generate a bit stream (a .BIT file) which can
be used to configure the target FPGA device. This can be done using a cable. Selection of cable
depends on the design.
6.13.1 Design Verification:
Verification can be done at different stages of the process steps.
6.13.2 Behavioral Simulation (RTL Simulation):
This is first of all simulation steps; those are encountered throughout the hierarchy of the
design flow. This simulation is performed before synthesis process to verify RTL (behavioral)
code and to confirm that the design is functioning as intended.
Behavioral simulation can be performed on either VHDL or Verilog designs. In this process,
signals and variables are observed, procedures and functions are traced and breakpoints are set.
This is a very fast simulation and so allows the designer to change the HDL code if the
required functionality is not met with in a short time period. Since the design is not yet
synthesized to gate level, timing and resource usage properties are still unknown.
6.13.3 Functional simulation (Post Translate Simulation):
Functional simulation gives information about the logic operation of the circuit. Designer
can verify the functionality of the design using this process after the Translate process. If the
functionality is not as expected, then the designer has to made changes in the code and again
follow the design flow steps.

Static Timing Analysis:

This can be done after MAP or PAR processes Post MAP timing report lists
signal path delays of the design derived from the design logic. Post Place and Route timing
report incorporates timing delay information to provide a comprehensive timing summary of the
design.
CHAPTER VII
RESULTS
Simulation Results:

Synthesis Results:
RTL schematic:

Technology Schematic:
Design Summary:

Timing Report:
CHAPTER
CONCLUSION

In this paper, we have proposed a recursive algorithm to obtain orthogonal approximation


of DCT where approximate DCT of length could be derived from a pair of DCTs of
length at the cost of additions for input preprocessing. The proposed approximated DCT has
several advantages, such as of regularity, structural simplicity, lower-computational complexity,
and scalability. Comparison with recently proposed competing methods shows the effectiveness
of the proposed approximation in terms of error energy, hardware
resources consumption, and compressed image quality. We have also proposed a fully scalable
reconfigurable architecture for approximate DCT computation where the computation of
32-point DCT could be configured for parallel computation of two 16-point DCTs or four 8-point
DCTs.
REFERENCES

[1] A. M. Shams, A. Chidanandan, W. Pan, and M. A. Bayoumi, NEDA:


A low-power high-performance DCT architecture, IEEE Trans.
Signal Process., vol. 54, no. 3, pp. 955964, 2006.
[2] C. Loeffler, A. Lightenberg, and G. S. Moschytz, Practical fast 1-D
DCT algorithm with 11 multiplications, in Proc. Int. Conf. Acoust.,
Speech, Signal Process. (ICASSP), May 1989, pp. 988991.
[3] M. Jridi, P. K. Meher, and A. Alfalou, Zero-quantised discrete cosine
transform coefficients prediction technique for intra-frame video encoding, IET Image Process.,
vol. 7, no. 2, pp. 165173, Mar. 2013.
[4] S. Bouguezel, M. O. Ahmad, and M. N. S. Swamy, Binary discrete
cosine and Hartley transforms, IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 60, no. 4, pp.
9891002, Apr. 2013.
[5] F. M. Bayer and R. J. Cintra, DCT-like transform for image compression requires 14
additions only, Electron. Lett., vol. 48, no. 15, pp.
919921, Jul. 2012.
[6] R. J. Cintra and F. M. Bayer, A DCT approximation for image compression, IEEE Signal
Process. Lett., vol. 18, no. 10, pp. 579582,
Oct. 2011.

[7] S. Bouguezel, M. Ahmad, and M. N. S. Swamy, Low-complexity 8


8 transform for image compression, Electron. Lett., vol. 44, no. 21,
pp. 12491250, Oct. 2008.
[8] T. I. Haweel, A new square wave transform based on the DCT, Signal
Process., vol. 81, no. 11, pp. 23092319, Nov. 2001.
[9] V. Britanak, P. Y. Yip, and K. R. Rao, Discrete Cosine and Sine Transforms: General
Properties, Fast Algorithms and Integer Approximations. London, U.K.: Academic, 2007.
[10] G. J. Sullivan, J.-R. Ohm, W.-J. Han, and T. Wiegand, Overview of the
high efficiency video coding (HEVC) standard, IEEE Trans. Circuits
Syst. Video Technol., vol. 22, no. 12, pp. 16491668, Dec. 2012.
[11] F. Bossen, B. Bross, K. Suhring, and D. Flynn, HEVC complexity and
implementation analysis, IEEE Trans. Circuits Syst. Video Technol.,
vol. 22, no. 12, pp. 16851696, 2012.
[12] X. Li, A. Dick, C. Shen, A. van den Hengel, and H. Wang, Incremental learning of 3D-
DCT compact representations for robust visual
tracking, IEEE Trans. Pattern Anal. Mach. Intell., vol. 35, no. 4, pp.
863881, Apr. 2013.
[13] A. Alfalou, C. Brosseau, N. Abdallah, and M. Jridi, Assessing the performance of a method
of simultaneous compression and encryption of
multiple images and its resistance against various attacks, Opt. Express, vol. 21, no. 7, pp.
80258043, 2013.
[14] R. J. Cintra, An integer approximation method for discrete sinusoidal transforms,
Circuits, Syst., Signal Process., vol. 30, no. 6, pp.
14811501, 2011.
[15] F. M. Bayer, R. J. Cintra, A. Edirisuriya, and A. Madanayake, A
digital hardware fast algorithm and FPGA-based prototype for a novel
16-point approximate DCT for image compression applications,
Meas. Sci. Technol., vol. 23, no. 11, pp. 110, 2012.
[16] R. J. Cintra, F. M. Bayer, and C. J. Tablada, Low-complexity 8-point
DCT approximations based on integer functions, Signal Process., vol.
99, pp. 201214, 2014.

You might also like