SoC - Midterm Lecture Materials

Module-1 3 hours
• Architecture of the present-day SoC –

• Design issues of SoC-
• Hardware-Software Co design –
• Core Libraries
• EDA Tools.
What is meant by SoC?
• Stands for "System On a Chip."
An SoC (pronounced "S-O-C") is an integrated
circuit that contains all the required circuitry
and components of an electronic system on a
single chip.
• It can be contrasted with a traditional
computer system, which is comprised of many
distinct components
What is System on Chip in VLSI?
• SoC acronym for system on chip is an IC which
integrates all the components into a
single chip.
• It may contain analog, digital, mixed signal
and other radio frequency functions all lying
on a single chip substrate.
• Today, SoCs are very common in electronics
industry due to its low power consumption
What Does System on a Chip (SoC)
Mean?
• A system on a chip (SoC) combines the required
electronic circuits of various computer components
onto a single, integrated chip (IC).
• SoC is a complete electronic substrate system that may
contain analog, digital, mixed-signal or radio frequency
functions.
• Its components usually include a graphical processing
unit (GPU), a central processing unit (CPU) that may be
multi-core, and system memory (RAM).
• Because SOC includes both the hardware and software,
it uses less power, has better performance, requires
less space and is more reliable than multi-chip systems.
• Most system-on-chips today come inside mobile
devices like smartphones and tablets.
Which is the first system on chip?
• 1974: DIGITAL WATCH IS
FIRST SYSTEM-ON-CHIP
INTEGRATED CIRCUIT
• THE MICROMA LIQUID
CRYSTAL DISPLAY (LCD)
DIGITAL WATCH IS THE
FIRST PRODUCT TO
INTEGRATE A COMPLETE
ELECTRONIC SYSTEM
ONTO A SINGLE SILICON
CHIP, CALLED A SYSTEM-
ON-CHIP OR SOC.
https://www.computerhistory.org/silico
nengine/digital-watch-is-first-system-on- Electronic module from a Hamilton Pulsar
chip-integrated-circuit/ digital watch
What does a system on chip contain?
• A system on a chip consists of both the

hardware, described in Structure, and the
software controlling the microcontroller,
microprocessor or digital signal processor
cores, peripherals and interfaces.
• What is difference between SoC and ASIC?
• What is difference between SOC and FPGA?
An SoC usually contains various components such as:
Operating system
• Utility software applications.

• Voltage regulators and power management circuits.
• Timing sources such as phase lock loop control systems or
oscillators.
• A microprocessor, microcontroller or digital signal
processor
• Peripherals such as real-time clocks, counter timers and
power-on-reset generators
• External interfaces such as USB, FireWire, Ethernet,
universal asynchronous receiver-transmitter or serial
peripheral interface bus
• Analog interfaces such as digital-to-analog converters and
analog-to-digital converters
• RAM and ROM memory
A TAP controller is a 16-state machine, programmed by the Test Mode Select (TMS) and Test Clock (TCK) inputs, which
controls the flow of data bits to the Instruction Register (IR) and the Data Registers (DR). The TAP Controller can be thought
of as the control center of a boundary-scan device.
https://www.terraelectronica.ru/pdf/show?pdf_file=%252Fz%252FDatasheet%252FM%252FM
PC6410.pdf
https://www.electronicsforu.com/technology-trends/must-read/7-nm-ic-technology-trends-
challenges
Hardware-Software Co design
Directions of the HW/SW
Design Process
Integrated Modeling Substrate
HWCI
HW Development Testing
Fabric.
Detailed
Design
Prelim.
Design
Hardware
Require.
Sys/HW
Analysis
Require.
Analysis Operation.
System Integrated Modeling Substrate System
Concepts Integ. and Testing and
Sys/SW test Evaluation
Require.
Analysis Software
Require.
Analysis Prelim.
Design
Detailed
Design
Coding,
Unit test.,
SW Development Integ. test CSCI
Testing
[Franke91]
© IEEE 1991
• Hardware/software codesign is the process of
designing computing systems consisting of
both hardware and software components
What is an embedded system?
Rapid Prototyping Design
Process
REUSE DESIGN LIBRARIES AND DATABASE
Primarily VIRTUAL PROTOTYPE Primarily

software hardware
HW HW
DESIGN FAB
SYSTEM FUNCTION HW & INTEG.
DEF. DESIGN SW & TEST
PART.
HW & SW SW SW
CODESIGN DESIGN CODE
HW & SW
Partitioning
& Codesign
Module Goals
• Introduce the fundamentals of HW/SW codesign
and partitioning concepts in designing embedded
systems
– Discuss the current trends in the codesign of
embedded systems
– Provide information on the goals of and methodology
for partitioning hardware/software in systems
• Show benefits of the codesign approach over
current design process
– Provide information on how to incorporate these
techniques into a general digital design methodology
for embedded systems
• Illustrate how codesign concepts are being
introduced into design methodologies
– Several example codesign systems are discussed
Module Outline
• Introduction
• Unified HW/SW Representations
• HW/SW Partitioning Techniques
• Integrated HW/SW Modeling
Methodologies
• HW and SW Synthesis Methodologies
• Industry Approaches to HW/SW Codesign
• Hardware/Software Codesign Research
• Summary
Codesign Definition
and Key Concepts
• Codesign
– The meeting of system-level objectives by
exploiting the trade-offs between hardware
and software in a system through their
concurrent design
• Key concepts
– Concurrent: hardware and software
developed at the same time on parallel paths
– Integrated: interaction between hardware
and software development to produce design
meeting performance criteria and functional
specs
Motivations for Codesign
• Factors driving codesign (hardware/software
systems):
– Instruction Set Processors (ISPs) available as cores
in many design kits (386s, DSPs,
microcontrollers,etc.)
– Systems on Silicon - many transistors available in
typical processes (> 10 million transistors available
in IBM ASIC process, etc.)
– Increasing capacity of field programmable devices
- some devices even able to be reprogrammed on-
the-fly (FPGAs, CPLDs, etc.)
– Efficient C compilers for embedded processors
– Hardware synthesis capabilities
Motivations for Codesign
(cont.)
• The importance of codesign in designing
hardware/software systems:
– Improves design quality, design cycle time, and cost
• Reduces integration and test time
– Supports growing complexity of embedded systems
– Takes advantage of advances in tools and
technologies
• Processor cores
• High-level hardware synthesis capabilities
• ASIC development
Categorizing
Hardware/Software Systems
• Application Domain
– Embedded systems
• Manufacturing control
• Consumer electronics
• Vehicles
• Telecommunications
• Defense Systems
– Instruction Set Architectures
– Reconfigurable Systems
• Degree of programmability
– Access to programming
– Levels of programming
• Implementation Features
– Discrete vs. integrated components
– Fabrication technologies
Categories of Codesign Problems
• Codesign of embedded systems
– Usually consist of sensors, controller, and actuators
– Are reactive systems
– Usually have real-time constraints
– Usually have dependability constraints
• Codesign of ISAs
– Application-specific instruction set processors (ASIPs)
– Compiler and hardware optimization and trade-offs
• Codesign of Reconfigurable Systems
– Systems that can be personalized after manufacture for a
specific application
– Reconfiguration can be accomplished before execution of
concurrent with execution (called evolvable systems)
Components of the Codesign Problem
• Specification of the system
• Hardware/Software Partitioning
– Architectural assumptions - type of processor, interface style between
hardware and software, etc.
– Partitioning objectives - maximize speedup, latency requirements,
minimize size, cost, etc.
– Partitioning strategies - high level partitioning by hand, automated
partitioning using various techniques, etc.
• Scheduling
– Operation scheduling in hardware
– Instruction scheduling in compilers
– Process scheduling in operating systems
• Modeling the hardware/software system during the design process
Embedded Systems
Embedded Systems
Application-specific systems which contain hardware
and software tailored for a particular task and are
generally part of a larger system (e.g., industrial
controllers)
• Characteristics
– Are dedicated to a particular application
– Include processors dedicated to specific functions
– Represent a subset of reactive (responsive to external
inputs) systems
– Contain real-time constraints
– Include requirements that span:
• Performance
• Reliability
• Form factor
Embedded Systems:
Specific Trends
• Use of microprocessors only one or two
generations behind state-of-the-art for
desktops
– E.g. N/2 bit width where N is the bit width of
current desktop systems
• Contain limited amount of memory
• Must satisfy strict real-time and/or
performance constraints
• Must optimize additional design objectives:
– Cost
– Reliability
– Design time
• Increased use of hardware/software codesign
principles to meet constraints
Embedded Systems:
Examples
• Banking and transaction processing
applications
• Automobile engine control units
• Signal processing applications
• Home appliances (microwave ovens)
• Industrial controllers in factories
• Cellular communications
Embedded Systems:
Complexity Issues
• Complexity of embedded systems is
continually increasing
• Number of states in these systems (especially
in the software) is very large
• Description of a system can be complex,
making system analysis extremely hard
• Complexity management techniques are
necessary to model and analyze these systems
• Systems becoming too complex to achieve
accurate “first pass” design using conventional
techniques
• New issues rapidly emerging from new
implementation technologies
Techniques to Support
Complexity Management
• Delayed HW/SW partitioning
– Postpone as many decisions as possible that place
constraints on the design
• Abstractions and decomposition techniques
• Incremental development
– “Growing” software
– Requiring top-down design
• Description languages
• Simulation
• Standards
• Design methodology management framework
A Model of the Current
Hardware/Software Design
Process
DOD-STD-2167A
HWCI
Fabric.
Detailed
Design
Prelim.
Design
Hardware
Require.
Sys/HW
Analysis
Require.
Analysis
System System Operation.
Sys/SW test Eval.
Require.
Analysis Software
Require.
Analysis Prelim.
Design
Detailed
Design
Coding,
Unit test.,
Testing
[Franke91]
© IEEE 1991
Current Hardware/Software
Design Process
• Basic features of current process:
– System immediately partitioned into hardware and
software components
– Hardware and software developed separately
– “Hardware first” approach often adopted
• Implications of these features:
– HW/SW trade-offs restricted
• Impact of HW and SW on each other cannot be assessed
easily
– Late system integration
• Consequences these features:
– Poor quality designs
– Costly modifications
– Schedule slippages
Incorrect Assumptions in
Current Hardware/Software
Design Process
• Hardware and software can be acquired
separately and independently, with
successful and easy integration of the two
later
• Hardware problems can be fixed with
simple software modifications
• Once operational, software rarely needs
modification or maintenance
• Valid and complete software requirements
are easy to state and implement in code
Directions of the HW/SW
Design Process
HWCI
Fabric.
Detailed
Design
Prelim.
Design
Hardware
Require.
Sys/HW
Analysis
Require.
Analysis Operation.
System Integrated Modeling Substrate System
Sys/SW test Evaluation
Require.
Analysis Software
Require.
Analysis Prelim.
Design
Detailed
Design
Coding,
Unit test.,
Testing
[Franke91]
© IEEE 1991
Requirements for the Ideal
Codesign Environment
• Unified, unbiased hardware/software
representation
– Supports uniform design and analysis techniques for
hardware and software
– Permits system evaluation in an integrated design
environment
– Allows easy migration of system tasks to either
hardware or software
• Iterative partitioning techniques
– Allow several different designs (HW/SW partitions) to
be evaluated
– Aid in determining best implementation for a system
– Partitioning applied to modules to best meet design
criteria (functionality and performance goals)
Requirements for the Ideal
Codesign Environment
(cont.)
• Integrated modeling substrate
– Supports evaluation at several stages of the design
process
– Supports step-wise development and integration of
hardware and software
• Validation Methodology
– Insures that system implemented meets initial
system requirements
Cross-fertilization Between
Hardware and Software
Design
• Fast growth in both VLSI design and
software engineering has raised awareness
of similarities between the two
– Hardware synthesis
– Programmable logic
– Description languages
• Explicit attempts have been made to

“transfer technology” between the
domains
Design (cont.)
VLSI SOFTWARE
DESIGN ENGINEERING
• EDA tool technology has been transferred to SW CAD

systems
– Designer support (not automation)
– Graphics-driven design
– Central database for design information
– Tools to check design behavior early in process

Design (cont.)
SOFTWARE VLSI
ENGINEERING DESIGN
• Software technology has been transferred to

EDA tools
– Single-language design
• Use of 1 common language for architecture spec. and
implementation of a chip
– Compiler-like transformations and techniques
• Dead code elimination
• Loop unrolling
– Design change management
• Information hiding
• Design families
Typical Codesign Process
System
FSM- Description Concurrent processes
directed graphs (Functional) Programming languages
HW/SW Unified representation

Partitioning (Data/control flow)
SW HW
Another
HW/SW Software Interface Hardware
Synthesis Synthesis Synthesis
partition
System Instruction set level

Integration HW/SW evaluation
Conventional Codesign
Methodology
Analysis of Constraints
and Requirements
System Specs..
HW/SW
Partitioning
Hardware Descript. Software Descript.
HW Synth. and Interface Synthesis Software Gen.

Configuration & Parameterization
Configuration Hardware HW/SW Software

Modules Components Interfaces Modules
HW/SW Integration
and Cosimulation
Integrated
System
© IEEE 1994
System Evaluation Design Verification
[Rozenblit94]
Codesign Features
Basic features of a codesign process

• Enables mutual influence of both HW and SW
early in the design cycle
– Provides continual verification throughout the
design cycle
– Separate HW/SW development paths can lead to
costly modifications and schedule slippages
• Enables evaluation of larger design space
through tool interoperability and automation
of codesign at abstract design levels
• Advances in key enabling technologies (e.g.,
logic synthesis and formal methods) make it
easier to explore design tradeoffs
State of Codesign
Technology
• Current use limited by:
– Lack of a standardized representation
– Lack of good validation and evaluation
methods
• Possible solutions:
– Extend existing hardware/software languages
to the use of heterogeneous paradigms
– Extend formal verification techniques to the
HW/SW domain
Issues and Problems:
Integration
• Errors in hardware and software design become much more costly as
more commitments are made
• “Hardware first” approach often compounds software cost because
software must compensate for hardware inadequacies
Software Cost Impact of Inadequate Hardware Resources

4 Experience
3
Relative Prog.
Cost / Instr.
1 Folklore
25 50 75 100
% Util. of speed and mem capacity
Module Outline
• Introduction

• Integrated HW/SW Modeling Methodologies
• Summary
Unified HW/SW
Representation
• Unified Representation --
– A representation of a system that can be used to
describe its functionality independent of its
implementation in hardware or software
– Allows hardware/software partitioning to be
delayed until trade-offs can be made
– Typically used at a high-level in the design process
• Provides a simulation environment after
partitioning is done, for both hardware and
software designers to use to communicate
• Supports cross-fertilization between hardware
and software domains
Current Abstraction
Mechanisms in
Hardware Systems
Abstraction
The level of detail contained within the system
model
• A system can be modeled at system,

instruction set, register-transfer, logic, or
circuit level
• A model can describe a system in the

behavioral, structural, or physical domain
Abstractions in Modeling:
Hardware Systems
Level Behavior Structure Physical

Start here
PMS (System) Communicating Processors Cabinets, Cables

Processes Memories
Switches (PMS)
Instruction Set Input-Output Memory, Ports Board

(Algorithm) Processors Floorplan
Register- Register ALUs, Regs, ICs

Transfer Transfers Muxes, Bus Macro Cells
Work to
Logic Logic Equns. Gates, Flip-flops Std. cell layout here
Circuit Network Equns. Trans., Connections Transistor layout
© IEEE 1990 [McFarland90]

Current Abstraction
Mechanisms for
Software Systems
Virtual machine
A software layer very close to the hardware
that hides the hardware’s details and provides
an abstract and portable view to the
application programmer
Attributes
– Developer can treat it as the real machine
– A convenient set of instructions can be used
by developer to model system
– Certain design decisions are hidden from the
programmer
– Operating systems are often viewed as virtual
machines
Abstractions for
Software Systems
Virtual Machine Hierarchy
• Application Programs
• Utility Programs
• Operating System
• Monitor
• Machine Language
• Microcode
• Logic Devices
Abstract Hardware-Software
Model
Uses a unified representation of system to allow
early performance analysis
General
Performance
Evaluation
Abstract Evaluation
Identification
HW/SW of Design
of Bottlenecks
Model Alternatives
Evaluation
of HW/SW
Trade-offs
Examples of Unified HW/SW
Representations
Systems can be modeled at a high level as:
 Data/control flow diagrams
 Concurrent processes
 Finite state machines
 Object-oriented representations
 Petri Nets
Unified Representations
(Cont.)
• Data/control flow graphs
– Graphs contain nodes corresponding to
operations in either hardware or software
– Often used in high-level hardware synthesis
– Can easily model data flow, control steps, and
concurrent operations because of its graphical
nature
5 X 4 Y
Example: + + Control Step 1
+ Control Step 2
+ Control Step 3
(Cont.)
• Concurrent processes
– Interactive processes executing concurrently with
other processes in the system-level specification
– Enable hardware and software modeling
• Finite state machines
– Provide a mathematical foundation for verifying
system correctness, simulation,
hardware/software partitioning, and synthesis
– Multiple FSMs that communicate can be used to
model reactive real-time systems
(Cont.)
• Object-oriented representations:
– Use techniques previously applied to software to
manage complexity and change in hardware
modeling
– Use C++ to describe hardware and display OO
characteristics
– Use OO concepts such as
• Data abstraction
• Information hiding
• Inheritance
– Use building block approach to gain OO benefits
• Higher component reuse
• Lower design cost
• Faster system design process
• Increased reliability
(Cont.)
Object-oriented representation
Example:
3 Levels of abstraction:
Register ALU Processor
Read Add Mult

Sub Div
Write AND Load
Shift Store
(Cont.)
• Petri Nets: a system model consisting of
places, tokens, transitions, arcs, and a
marking
– Places - equivalent to conditions and hold
tokens
– Tokens - represent information flow through
system
– Transitions - associated with events, a “firing”
Example:
of aInput
transition indicates that Token some event has
occurred
Places
– Marking - a particular Transition placement of tokens
within places
Output of a Petri net, representing the
state of the
Place
net
Module Outline
• Introduction

• Summary
Hardware/Software
Partitioning
• Definition
– The process of deciding, for each subsystem,
whether the required functionality is more
advantageously implemented in hardware or
software
• Goal
– To achieve a partition that will give us the
required performance within the overall system
requirements (in size, weight, power, cost, etc.)
• This is a multivariate optimization problem
that when automated, is an NP-hard problem
HW/SW Partitioning Issues
• Partitioning into hardware and software

affects overall system cost and performance
• Hardware implementation
– Provides higher performance via hardware speeds
and parallel execution of operations
– Incurs additional expense of fabricating ASICs
• Software implementation
– May run on high-performance processors at low
cost (due to high-volume production)
– Incurs high cost of developing and maintaining
(complex) software
Partitioning Approaches
• Start with all functionality in software and

move portions into hardware which are
time-critical and can not be allocated to
software (software-oriented
partitioning)
• Start with all functionality in hardware and

move portions into software
implementation (hardware-oriented
partitioning)
System Partitioning
(Functional Partitioning)
• System partitioning in the context of
hardware/software codesign is also
referred to as functional partitioning
• Partitioning functional objects among
system components is done as follows
– The system’s functionality is described as
collection of indivisible functional objects
– Each system component’s functionality is
implemented in either hardware or software
• An important advantage of functional
partitioning is that it allows
hardware/software solutions
Partitioning Metrics
• Deterministic estimation techniques

– Can be used only with a fully specified model with
all data dependencies removed and all
component costs known
– Result in very good partitions
• Statistical estimation techniques
– Used when the model is not fully specified
– Based on the analysis of similar systems and
certain design parameters
• Profiling techniques
– Examine control flow and data flow within an
architecture to determine computationally
expensive parts which are better realized in
hardware
Binding Software
to Hardware
• Binding: assigning software to hardware
components
• After parallel implementation of assigned
modules, all design threads are joined for
system integration
– Early binding commits a design process to a
certain course
– Late binding, on the other hand, provides
greater flexibility for last minute changes
Hardware/Software System
Architecture Trends
• Some operations in special-purpose hardware
– Generally take the form of a coprocessor
communicating with the CPU over its bus
• Computation must be long enough to compensate for
the communication overhead
– May be implemented totally in hardware to avoid
instruction interpretation overhead
• Utilize high-level synthesis algorithms to generate a
register transfer implementation from a behavior
description
• Partitioning algorithms are closely related to
the process scheduling model used for the
software side of the implementation
HW/SW Partition
Formal Definition
• A hardware/software partition is defined
using two sets H and S, where H  O, S  O, H
 S = O, H  S = f
• Associated metrics:
– Hsize(H) is the size of the hardware needed to
implement the functions in H (e.g., number of
transistors)
– Performance(G) is the total execution time for the
group of functions in G for a given partition {H,S}
– Set of performance constraints, Cons = (C1, ... Cm),
where Cj = {G, timecon}, indicates the maximum
execution time allowed for all the functions in
group G and G  O
Performance Satisfying
Partition
• A performance satisfying partition is one
for which performance(Cj.G)  Cj.timecon,
for all j=1...m
• Given O and Cons, the hardware/software
partitioning problem is to find a
performance satisfying partition {H,S} such
that Hsize(H) is minimized
• The all-hardware size of O is defined as the
size of an all hardware partition (i.e.,
Hsize(O))
Issues in Partitioning
• Specification abstraction level

• Granularity
• System-component allocation
• Metrics and estimations
• Partitioning algorithms
• Objective and closeness functions
• Output
• Flow of control and designer interaction
Issues in Partitioning (Cont.)
High Level Abstraction
Decomposition of functional objects

• Metrics and estimations
• Objective and closeness functions
Component allocation
Outpu
t
Specification Abstraction
Levels
• Task-level dataflow graph
– A Dataflow graph where each operation
represents a task
• Task
– Each task is described as a sequential program
• Arithmetic-level dataflow graph
– A Dataflow graph of arithmetic operations along
with some control operations
– The most common model used in the partitioning
techniques
• Finite state machine (FSM) with datapath
– A finite state machine, with possibly complex
expressions being computed in a state or during a
transition
Specification Abstraction
Levels (Cont.)
• Register transfers
– The transfers between registers for each
machine state are described
• Structure
– A structural interconnection of physical
components
– Often called a netlist
Granularity Issues in
Partitioning
• The granularity of the decomposition is a
measure of the size of the specification in
each object
• The specification is first decomposed into
functional objects, which are then partitioned
among system components
– Coarse granularity means that each object
contains a large amount of the specification.
– Fine granularity means that each object contains
only a small amount of the specification
• Many more objects
• More possible partitions
– Better optimizations can be achieved
System Component
Allocation
• The process of choosing system component
types from among those allowed, and
selecting a number of each to use in a given
design
• The set of selected components is called an
allocation
– Various allocations can be used to implement a
specification, each differing primarily in monetary
cost and performance
– Allocation is typically done manually or in
conjunction with a partitioning algorithm
• A partitioning technique must designate the
types of system components to which
functional objects can be mapped
– ASICs, memories, etc.
Metrics and Estimations
Issues
• A technique must define the attributes of a
partition that determine its quality
– Such attributes are called metrics
• Examples include monetary cost, execution time,
communication bit-rates, power consumption, area,
pins, testability, reliability, program size, data size, and
memory size
• Closeness metrics are used to predict the benefit of
grouping any two objects
• Need to compute a metric’s value
– Because all metrics are defined in terms of the
structure (or software) that implements the
functional objects, it is difficult to compute costs
as no such implementation exists during
partitioning
Metrics in HW/SW
Partitioning
• Two key metrics are used in
hardware/software partitioning
– Performance: Generally improved by moving

objects to hardware
– Hardware size: Hardware size is generally

improved by moving objects out of hardware
Computation of Metrics
• Two approaches to computing metrics

– Creating a detailed implementation
• Produces accurate metric values
• Impractical as it requires too much time
– Creating a rough implementation
• Includes the major register transfer
components of a design
• Skips details such as precise routing or optimized
logic, which require much design time
• Determining metric values from a rough
implementation is called estimation
Objective and Closeness
Functions
• Multiple metrics, such as cost, power, and
performance are weighed against one another
– An expression combining multiple metric values into a
single value that defines the quality of a partition is
called an Objective Function
– The value returned by such a function is called cost
– Because many metrics may be of varying importance,
a weighted sum objective function is used
• e.g., Objfct = k1 * area + k2 * delay + k3 * power
– Because constraints always exist on each design, they
must be taken into account
• e.g Objfct = k1 * F(area, area_constr)
+ k2 * F(delay, delay_constr)
+ k3 * F(power, power_constr)
Partitioning Algorithm Issues
• Given a set of functional objects and a set

of system components, a partitioning
algorithm searches for the best partition,
which is the one with the lowest cost, as
computed by an objective function
• While the best partition can be found
through exhaustive search, this method is
impractical because of the inordinate
amount of computation and time required
• The essence of a partitioning algorithm is
the manner in which it chooses the subset
of all possible partitions to examine
Partitioning Algorithm
Classes
• Constructive algorithms
– Group objects into a complete partition
– Use closeness metrics to group objects, hoping for
a good partition
– Spend computation time constructing a small
number of partitions
• Iterative algorithms
– Modify a complete partition in the hope that such
modifications will improve the partition
– Use an objective function to evaluate each
partition
– Yield more accurate evaluations than closeness
functions used by constructive algorithms
• In practice, a combination of constructive and
iterative algorithms is often employed
Iterative Partitioning
Algorithms
• The computation time in an iterative
algorithm is spent evaluating large
numbers of partitions
• Iterative algorithms differ from one
another primarily in the ways in which they
modify the partition and in which they
accept or reject bad modifications
• The goal is to find global minimum while
B
performing
A
as little computation
A, B - as
Local minima
possible C - Global minimum
C
Iterative Partitioning
Algorithms (Cont.)
• Two broad categories:
– Greedy algorithms
• Only accept moves that decrease cost
• Can get trapped in local minima
– Hill-climbing algorithms
• Allow moves in directions increasing cost
(retracing)
– Through use of stochastic functions
• Can escape local minima
• E.g., simulated annealing
Output Issues in Partitioning
• Any partitioning technique must define the

representation format and potential use of
its output
– E.g., the format may be a list indicating which
functional object is mapped to which system
component
– E.g., the output may be a revised specification
• Containing structural objects for the system
components
• Defining a component’s functionality using the
functional objects mapped to it
Flow of Control and
Designer Interaction
• Sequence in making decisions is variable, and
any partitioning technique must specify the
appropriate sequences
– E.g., selection of granularity, closeness metrics,
closeness functions
• Two classes of interaction
– Directives
• Include possible actions the designer can perform
manually, such as allocation, overriding estimations,
etc.
– Feedback
• Describe the current design information available to the
designer (e.g., graphs of wires between objects,
histograms, etc.)
Comparing Partitions Using
Cost Functions
• A cost function is a function Cost(H, S, Cons, I )
which returns a natural number that
summarizes the overall quality of a given
partition
– I contains any additional information that is not
contained in H or S or Cons
– A smaller cost function value is desired
• An iterative improvement partitioning
algorithm is defined as a procedure
Part_Alg(H, S, Cons, I, Cost( ) )
which returns a partition H’, S’ such that
Cost(H’, S’, Cons, I)  Cost(H, S, Cons, I )
Module Outline
• Introduction
• Integrated HW/SW Modeling

Methodologies
• Summary
Cosimulation
• An HDL (VHDL or Verilog) simulation

environment is used to perform behavioral
simulation of the system hardware
processes
• A Software environment (C or C++) is used

to develop the code
• SW and HW execute as separate processes

linked through UNIX IPC (interprocessor
communications) mechanisms (sockets)
Verilog Cosimulation
Example
Verilog HW Simulator
Module: Application Software processes

specific hardware communicate with hardware
HW HW simulator
proc 1 proc 2 via UNIX sockets
SW Verilog PLI (programming

proc 1
Module: Bus Interface language interface) serves as
UNIX translator, allowing hardware
sockets simulation models to
Verilog PLI
communicate with software
SW
proc 2 processes.
© IEEE 1993 [Thomas93]

VHDL Cosimulation Example
VHDL Simulator
Hardware Model in VHDL: Software processes

communicate with hardware
RS232 VME simulator
module module via foreign language interface
Allowing hardware
SW
simulation models to
VHDL Foreign Language proc 1 “cosimulate” with software
Interface processes.
SW
proc 2
VHDL-C Based HW/SW Cosimulation for
DSP Multicomputer Application
Algorithm - C
Architecture - VHDL
Scheduler - C
CPU 1 CPU 2 CPU 3 CPU 4
Mapping Function
(e.g.):
Round Robin
Computational Communications Network
Requirements
Based
VHDL-C Based HW/SW Cosimulation for
DSP Multicomputer Application
Unix C Program VHDL Simulator
System State (e.g.):
CPU:
Time to instruction completion
Architecture Model
Comm Agent:
Messages in Send Queue
Messages in Recv Queue INSTRUME
NT
Network:
Communications Channels Busy PACKAGE
CPU CPU 2 CPU 3 CPU 4

1
Algorithm/ Com Com Com Com
Scheduler m m m m
Agent Agent Agent Agent
Next Instruction for CPU to Execute (e.g.): 1 Communications
2 3 Network
4
Send(destination, message_length)
Recv(source, message_length)
Compute(time)
Model Continuity Problem
Model Continuity Problem

Inability to gradually define a system-level model
into a hardware/software implementation
• Model continuity problems exist in both

hardware and software systems
• Model continuity can help address several
system design problems
– Allows validation of system level models with
corresponding HW/SW implementation
– Addresses subsystem integration
Module Outline
• Introduction

• Summary
Hardware Design
Methodology
Hardware Design Process:
Waterfall Model
Preliminary Detailed
Hardware
Hardware Hardware Fabrication Testing
Requirements
Design Design
Hardware Design
Methodology (Cont.)
• Use of HDLs for modeling and simulation
• Use of lower-level synthesis tools to derive register
transfer and lower-level designs
• Use of high-level hardware synthesis tools
– Behavioral descriptions
– System design constraints
• Introduction of synthesis for testability at all levels
Hardware Synthesis
• Definition
– The automatic design and implementation of
hardware from a specification written in a
hardware description language
• Goals/benefits
– To quickly create and modify designs
– To support a methodology that allows for multiple
design alternative consideration
– To remove from the designer the handling of the
tedious details of VLSI design
– To support the development of correct designs
Hardware Synthesis
Categories
• Algorithm synthesis
– Synthesis from design requirements to control-
flow behavior or abstract behavior
– Largely a manual process
• Register-transfer synthesis
– Also referred to as “high-level” or “behavioral”
synthesis
– Synthesis from abstract behavior, control-flow
behavior, or register-transfer behavior (on one
hand) to register-transfer structure (on the other)
– Logic synthesis
– Synthesis from register-transfer structures or
Boolean equations to gate-level logic (or physical
implementations using a predefined cell or IC
library)
Hardware Synthesis
Process Overview
Specification Implementation
Behavioral Behavioral Behavioral

Simulation Synthesis Functional
Optional RTL Synthesis & RTL

Simulation Test Synthesis Functional
Verification
Gate-level Gate-level Gate
Simulation Analysis
Silicon Vendor
Layout
Place and Route
Silicon
Software Design
Methodology
Software Design Process:
Waterfall Model
Software Software
Coding Testing Maintenance
Requirements Design
Software Design
Methodology (Cont.)
• Software requirements includes both
– Analysis
– Specification
• Design: 2 levels:
– System level - module specs.
– Detailed level - process design language (PDL) used
• Coding - in high-level language
– C/C++
• Maintenance - several levels
– Unit testing
– Integration testing
– System testing
– Regression testing
– Acceptance testing
Software Synthesis
• Definition: the automatic development of

correct and efficient software from
specifications and reusable components
• Goals/benefits
– To Increase software productivity
– To lower development costs
– To Increase confidence that software
implementation satisfies specification
– To support the development of correct programs
Why Use
Software Synthesis?
• Software development is becoming the major
cost driver in fielding a system
• To significantly improve both the design cycle

time and life-cycle cost of embedded systems,
a new software design methodology,
including automated code generation, is
necessary
• Synthesis supports a correct-by-construction

philosophy
• Techniques support software reuse

Software Synthesis
Categories
• Language compilers
– ADA and C compilers
– YACC - yet another compiler compiler
– Visual Basic
• Domain-specific synthesis
– Application generators from software libraries

Software Synthesis Examples
• Mentor Graphics Concurrent Design Environment
System
– Uses object-oriented programming (written in C++)
– Allows communication between hardware and
software synthesis tools
• Index Technologies Excelerator and Cadre’s
Teamwork Toolsets
– Provide an interface with COBOL and PL/1 code
generators
• KnowledgeWare’s IEW Gamma
– Used in MIS applications
– Can generate COBOL source code for system designers
• MCCI’s Graph Translation Tool (GrTT)
– Used by Lockheed Martin ATL
– Can generate ADA from Processing Graph Method
(PGM) graphs
GrTT Tool Architecture
*Signal Processing Graph Notation SPGN GV
File Sets
Constraints/Error Cond. SPGN*

PARCER
Validated Graph
Object
Behavior GRAPH
ANALYSIS
Behavioral Specification
Code
Fragments
AUTOCODER
Ada Source
Code File
MCCI Domain Primitive Database
Interface Synthesis
• Definition: the automatic design and

implementation of hardware (glue logic)
and the software (drivers) components
between the processor and the dedicated
hardware
• Goals/benefits
– To quickly create and modify designs
– To remove from the designer the handling of
the tedious details of VLSI design
Interface Synthesis
Approaches
• Typical approaches use standard
interface schemes
– memory-mapped
– serial port
– parallel port
– self-timed
– synchronous
– blocking
Cosynthesis
• Methodical approach to system

implementations using automated synthesis-
oriented techniques
• Methodology and performance constraints
determine partitioning into hardware and
software implementations
• The result is “optimal” system that benefits
from analysis of hardware/software design
trade-off analysis
Cosynthesis Approach to
System Implementation
System
Input
Behavioral Specification
Memory and Performance
criteria System
Output
Mixed
Implementation
Pure HW
Performance
Pure SW Constraints
Cost [Gupta93]
© IEEE 1993
Module Outline
• Introduction

• Summary
Sanders Codesign
Methodology
Global influences
Design Tool Virtual Cost
Libraries
rules select. Environ. models
Design Development Software Modules
SW Req.
Feedback Partition. Design Code
At all Test
to user
steps
Integrate System
HW/SW Integrated HW/SW & Test Checkout
Req. Algorithm Tradeoff Simulation
Analysis Develop. Analysis
Fab &
HW Req. Logical Anal. Test
Requirements Partition. & Phys. &
Design Simul.
Hardware Modules
[HOOD94]
Sanders Codesign
Methodology
System
Requirements
Arch Ind.
Proc Model
Hardware Software
Perf. Model Perf. Model
S Arch Dep.
Behavior I L Proc Model
Level Model M I
U B
ISA L R Source Code
Model A A
T R
I HOL
Y
RTL Model O
N Assembly
Gate Level
Model
Prototype Load
Hardware Module [RASSP94]
Sanders Codesign
Methodology
• Subsystems process
– Processing requirements are modeled in an
architecture-independent manner
– Codesign not an issue
• Architecture process
– HW/SW allocation analyzed via modeling of SW
performance on candidate architectures
– Hierarchical verification is performed using finer grain
modeling (ISA and below)
• Detailed design
– Downloadable executable application and test code is
verified to maximum extent possible
• Library support
– SW models validated on test data
– HW models validated using existing SW models
– HW & SW models jointed iterated throughout designs
Lockheed Martin ATL
Codesign Methodology
SW Req. SW SW SW
Spec. Design Code Debug
Partition.
SW
Prototype Test
User
Req. Top Interface
HW/SW HW/SW
& level
Tradeoff Cosimul.
Spec. Arch. HW/SW System
Integ. Checkout
Algor. HW
develop. Sim.
& simul.
HW HW
HW HW
Spec.. Dev.
Design Test
Partition
HW
Anal.
& Fab
[RASSP94]
Module Outline
• Introduction

• Summary
Major Codesign Research Efforts
• Chinook - University of Washington - Chou,
Ortega, Borriello
• Cosmos - Grenoble University - Ismail, Jerraya
• Cosyma - University of Braunschweig - Ernst,
Henkel, Benner
• Polis - U. C. Berkeley - Chiodo, Giusto,
Jurecska, Hsieh, Lavagno, Sangiovanni-
Vincentelli
• Ptolemy - U. C. Berkeley - Kalavade, Lee
Chinook
• Unified representation: Event Graph
(CDFG)
• Partitioning: constraint driven by
scheduling requirements
• Scheduling: timing driven
• Modeling substrate: based on Verilog HDL
• Validation: simulation based (Verilog)
• Main emphasis on synthesis of
hardware/software interfaces
Cosmos
• Unified representation: Initial description is
done in SDL (specification description
language) which is translated into SOLAR, an
intermediate form that allows several
description levels (CSPs, FSMs, etc.)
• Partitioning: user driven using a tool that
allows processes to be grouped together or
split into sub-processes
• Scheduling: based on the partitioning
• Modeling substrate: VHDL simulation after
Cosyma
• Unified representation: ES graph (CDFG)
• Partitioning: combined method based on
course partitioning by user with cost
guidance and finer scheduling done by
simulated annealing
• Scheduling: no specific method
• Modeling substrate: based on C++
• Validation: simulation based (C++)
• Main emphasis on partitioning for
hardware accelerators
Polis
• Unified representation: Codesign Finite State
Machine (CFSM) based
• Partitioning: user driven with cost estimated
provided by co-simulation
• Scheduling: classical real-time algorithms
• Modeling substrate: Ptolemy based (C++)
• Validation: co-simulation and formal FSM
verification
• Main emphasis on verifiable specification not
Ptolemy
• Unified representation: Data Flow Graph
• Partitioning: greedy algorithm based on
scheduling constraints
• Scheduling: linear based on sorting blocks by
“criticality”
• Modeling substrate: heterogeneous modeling
and simulation framework based on C++
• Validation: based on simulation
• Main emphasis on heterogeneous modeling
Siera
• Unified representation: static, hierarchical
network of concurrent sequential processes
communicating via message queues (similar to
DFG)
• Partitioning: manual user driven
• Scheduling: static process to processor
mapping, priority based preemptive
schedulers available within real-time OS on
processors
• Modeling substrate: based on VHDL - includes
Chinook
• Hardware/Software Co-synthesis system

developed at the University of Washington
• Targeted at real-time reactive embedded

systems
• Control dominated designs constructed

from off-the-shelf components
Chinook’s Principal
Innovations
• Single Specification - one specification, with explicit timing/performance
constraints is used for the system’s hardware and software
• One Simulation Environment - the high level specification, the final result,
and any intermediate steps can be simulated to verify and debug the design
• Software Scheduling - the appropriate software architecture is synthesized
to meet the timing requirements
• Interface Synthesis - the hardware and software necessary to interface
between system components (glue logic and device drivers) is automatically
synthesized
• Complete Information for Physical Prototyping - a complete netlist is
generated for the hardware, and C source code is generated for the
software
The Chinook System
parser scheduler
program
Verilog
Specification
comm. code
synthesizer generator
driver interface
synthesizer synthesizer
Processor &
netlist
Device Libraries
Behavioral Mixed Structural
Simulation Simulation Simulation
System Specification in
Chinook
(Unified Representation)
• The system specification is written in a dialect of Verilog and includes
the system’s behavior and the structure of the system architecture
• The behavior is specified as a set of tasks in a style similar to
communicating finite state machines - control states of the system
are organized as modes which are behavioral regimes similar to
hierarchical states
• In a given mode, the system’s responses are defined by a set of
handlers which are essentially event-triggered routines
• The designer must tag tasks or modules with the processor that is
preferred for their implementation - untagged tasks are
implemented in software
• The designer can specify response times and rate constraints for
tasks in the input description
Scheduling in Chinook
• Chinook provides an automated scheduling algorithm
• Low-level I/O routines and high level routines grouped in modes are
scheduled statically
• A static, nonpreemptive scheduling algorithm is used to meet
min/max timing constraints on low-level operations
– Determines serial ordering for operations
– Inserts delays as necessary to meet minimum constraints
– Includes heuristics in the scheduling algorithm to help exact algorithm
generate valid solution to NP-hard scheduling problem
• A customized dynamic scheduler may be generated for the top-level
modes
Interface Synthesis in
Chinook
• Realization of communication between system components is an
area of emphasis in the Chinook system
• Chinook synthesizes device drivers from timing diagrams
• Custom code for the processor being used is generated
– For processors with I/O ports, an efficient heuristic is used to connect
devices with minimal interface hardware
– For processors w/o I/O ports, a memory mapped I/O interface is
generated including allocating address spaces, and generating the
required bus logic and instructions
• Portions of the interface that cannot be implemented in software are
synthesized into external hardware
Communications
Synthesis and System
Simulation in Chinook
• Chinook provides methods for synthesizing communications systems
between multiple processors if a multicomputer implementation is chosen
– Bus-based, point-to-point, and hybrid communications schemes are supported
– Communications library that includes FIFOs, arbiters, and interconnect
templates is provided
• Simulation of the design at different levels of detail is supported

– Verilog-XL Programming Language is used
– Verilog PLI is used to interface to device models written in C
– Each device supports the same API for simulation and synthesis - API calls can
be used by the designer to animate the model interactively
– RTL level models of the processors are used to simulate the final
implementation of the system (software)
Cosynthesis of Embedded
Applications (COSYMA)
• Developed at the Technical University of
Braunschweig, Germany
• An experimental system for HW/SW codesign of
small embedded real time systems
– Implements as many operations as possible in
software running on a processor core
– Generates external hardware only when timing
constraints are violated
• Target architecture:
– Standard RISC processor core
– Application-specific processor
COSYMA (Cont.)
• Input description of system in C* is

translated into an internal graph
representation supporting
– Partitioning
– Generating hardware descriptions for parts
moved to hardware
• Internal graph representation combines
– Control and dataflow graph
– Extended syntax (ES) graph
• Syntax graph
• Symbol table
Design Flow in a
COSYMA System
C* Mode
Simulator
C* Compiler
ES to C ES Flowgraph ES to HW C
C Program Partitioning HW-C Model
C Compiler Cost
Estimation Olympus
Run time
Object Code
Analysis
COSYMA - Aims and
Strategies
• Major aim is automating HW/SW
partitioning process, for which very few
tools currently exist
• COSYMA partitions at the basic block and
function level (including hierarchical
function calls)
– Simulated annealing algorithm is used
because of its flexibility in the cost function
and the possibility to trade-off computation
time vs result quality
– Starts with an unfeasible all-software solution
COSYMA - Cost Function
and Metrics
• The cost function is defined to force the
annealing to reach a feasible solution
before other optimization goals (e.g., area)
• The metrics used in cost computation are:
– Expected hardware execution times
– Software execution times
– Communication
– Hardware costs
• The cost function is updated in each step
of the simulated annealing algorithm
COSYMA - Cost Function
and Metrics (Cont.)
• After partitioning, the parts selected to be
realized in software are translated to a C
program, thereby inserting code for
communicating with the coprocessor
• The rest of the system is translated to the
input description of the high-level
synthesis system, and an application-
specific coprocessor is synthesized
• Lastly, a fast-timing analysis of the whole
HW/SW system is performed to test
whether all constraints are satisfied
Ptolemy
• A software environment for simulation and

prototyping of heterogeneous systems
• Attributes
– Facilitates mixed-mode system simulation,
specification, and design
– Supports generation of DSP assembly code
from a block diagram description of algorithm
– Uses object-oriented representations to
model subsystems efficiently
– Supports different design styles called
Using Ptolemy
• Ptolemy supports a framework for
hardware/software codesign, called the
Design Assistant
• The Design Assistant consists of two

components
– Specific point tools for estimation,

partitioning, synthesis, and simulation
– An underlying design methodology

management infrastructure for design space
Using Ptolemy (Cont.)
Design constraints Design specs. User inputs
Design Flow Area/Time

Estimation
Manual
HW/SW CPLEX(ILP)
Partitioning GCLP...
Hardware Interface Software

Synthesis Synthesis Synthesis
Netlist
Generation Ptolemy
VHDL/Synopsys
Simulation
System Layout + Software © IEEE 1994

[Rozenblit94]
Ptolemy Heterogeneous
Simulation Environment
Structural Components
Universe
(Ptolemy Simulation Kernel) Geodesic
Porthole Block Porthole Porthole Block Porthole
Plasma
Separate Model of Computation Separate Model of Computatio
(e.g. discrete event) (e.g. data flow)
• Data encapsulated in “particles”

• “Block” objects send and receive messages
• Particles travel to/from external world
through “portholes”
POLIS
• Hardware/Software Codesign and
synthesis system developed at the
University of California, Berkeley
• Targeted towards small, scale, reactive,

control dominated embedded systems
• Includes an “unbiased” mechanism for

specifying the system’s function that allows
for maximum flexibility in mapping to
POLIS
Unified Representation
• System behavior is specified in a formal manner using Codesign Finite State
Machines (CFSMs)
– CFSMs translate a set of inputs to a set of outputs with only a finite amount of internal
state
– Unlike traditional FSMs, CFSMs do not all change state exactly at the same time (globally
asynchronous)
• CFSMs are designed to be unbiased towards hardware or software
• Translators exist to convert other specification languages (e.g. ESTEREL) into CFSMs
• CFSMs can be translated into traditional FSMs to allow formal verification
• CFSMs can communicate with each other using events
– Events are unidirectional and happen in non-zero, unbounded time
– Events can be used to communicate across all domains (hardware or software)
– Events are unbuffered and can be overwritten - however, they can be used to implement
fully interlocked handshaking
• CFSMs are translated into behavioral FSMs for hardware synthesis and into S-graphs
for software synthesis
Codesign Finite State
Machines
• Specification: “Five seconds after
the key is turned on, if the belt
has not been fastened, an alarm
will beep for ten seconds or until
the key is turned off”
(*Key == On)  *Start

Wait
(*Key == ON) and
(*Belt == On) 
(*Key == Off) or
(*Belt == On)  (*End == 5)  *Alarm = On
Off
(*End == 10) or
Alarm
(*Belt == On) or
(*Key == Off)  *Alarm = Off
S-graph Software
Specification
Begin
Next
S==Off
True False
*Key==On S==Wait
False
True False True
*Start *END==5 *END==10
Next False
True False True
S=Wait *Key==Off *Alarm=On *Belt==On
False
Next
True
Next FalseTrue
*Belt==On S=Alarm *Key==Off
False True Next False True
*Alarm=Off
Next
S=Off
Next
End
© IEEE 1994 [Chiodo94]

Partitioning and Scheduling in
POLIS
• Partitioning based on mapping CFSMs to
either hardware or software
• This mapping is left to the user - performance
feedback is provided by simulation
• Interfaces between partitions are
automatically generated
• Scheduling based on executing CFSMs
• Selection of scheduling algorithm left to user -
built into RTOS
– Round-robin cyclic executive
Interfaces Among
Partitions
 Interfaces use strobe/data protocol (corresponding to the
event/value primitive)
Sender A B C Receiver
Sender’s Domain Channel’s Domain Receiver’s Domain
 Example HW to SW interface
ack
HW HW to SW SW
X y
-1 / 0
11 + 0- / 0
-0 / 1
X
0 1
y
10 / 1 x ack / y ack
The POLIS Co-design Environment
Graphical EFSM ESTEREL (Other)…
Formal
Verification
Compilers
Partitioning
CFSMs
SW Synthesis HW Synthesis
Interface
Synthesis
Simulation SW Code + Logic Netlist

RTOS
Prototype
Module Outline
• Introduction
• Summary
Module Summary
• The synergistic design of hardware and software in a digital system, called
Hardware/Software Codesign, has been explored
• Elements of a HW/SW Codesign methodology have been outlined
• Industrial design flows that contain aspects of codesign have been
presented
• Present day research into automating portions of the codesign problem
have been explored
• As digital systems become more complex and performance criteria become
more stringent, codesign will become a necessity
• Better design tools and unified design environments will allow codesign
techniques to become standard practice
References
*Boehm73+ Boehm, B.W. “Software and its Impact: A Quantitative Assessment,” Datamation, May 1973, p. 48-59.
*Buchenrieder93+ Buchenrieder, K., “Codesign and Concurrent Engineering”, Hot Topics, IEEE Computer, R. D. Williams,
ed., January, 1993, pp. 85-86
*Buck94+ Buck, J., et al., “Ptolemy: a Framework for Simulating and Prototyping Heterogeneous Systems,” International
Journal of Computer Simulation, Vol. 4, April 1994, pp. 155-182.
[Chiodo92] Chiod0, M., A. Sangiovanni-Vincentelli, “Design Methods for Reactive Real-time Systems Codesign,”
International Workshop on Hardware/Software Codesign, Estes Park, Colorado, September 1992.
[Chiodo94] Chiodo, M., P. Giusto, A. Jurecska, M. Marelli, H. C. Hsieh, A. Sangiovanni-Vincentelli, L. Lavagno, “Hardware-
Software Codesign of Embedded Systems,” IEEE Micro, August, 1994, pp. 26-36; © IEEE 1994.
*Chou95+ P. Chou, R. Ortega, G. Borriello, “The Chinook hardware/software Co-design System,” Proceedings ISSS, Cannes,
France, 1995, pp. 22-27.
*DeMicheli93+ De Micheli, G., “Extending CAD Tools and Techniques”, Hot Topics, IEEE Computer, R. D. Williams, ed.,
January, 1993, pp. 84
*DeMicheli94+ De Micheli, G., “Computer-Aided Hardware-Software Codesign”, IEEE Micro, August, 1994, pp. 10-16
*DeMichelli97+ De Micheli, G., R. K. Gupta, “Hardware/Software Co-Design,” Proceedings of the IEEE, Vol. 85, No. 3,
March 1997, pp. 349-365.
*Ernst93+ Ernst, R., J. Henkel, T. Benner, “Hardware-Software Cosynthesis for Micro-controllers”, IEEE Design and Test,
December, 1993, pp. 64-75
*Franke91+ Franke, D.W., M.K. Purvis. “Hardware/Software Codesign: A Perspective,” Proceedings of the 13th
International Conference on Software Engineering, May 13-16, 1991, p. 344-352; © IEEE 1991
References (Cont.)
[Gajski94] Gajski, D. D., F. Vahid, S. Narayan, J. Gong, Specification and Design of Embedded Systems, Prentice Hall,
Englewood Cliffs, N J, 07632, 1994
*Gupta92+ Gupta, R.K., C.N. Coelho, Jr., G.D. Micheli. “Synthesis and Simulation of Digital Systems Containing Interactive
Hardware and Software Components,” 29th Design Automation Conference, June 1992, p.225-230.
*Gupta93+ Gupta, R.K., G. DeMicheli, “Hardware-Software Cosynthesis for Digital Systems,” IEEE Design and Test,
September 1993, p.29-40; © IEEE 1993.
*Hermann94+ Hermann, D., J. Henkel, R. Ernst, “An approach to the estimation of adapted Cost Parameters in the
COSYMA System”, 3rd International Conference on Hardware/Software codesign, Grenoble, France, September 22-
24, 1994, pp. 100-107
[Hood94] Hood, W., C. Myers, "RASSP: Viewpoint from a Prime Developer," Proceedings 1st Annual RASSP Conference,
Aug. 1994.
[IEEE] All referenced IEEE material is used with permission.
*Ismail95+ T. Ismail, A. Jerraya, “Synthesis Steps and Design Models for Codesign,” IEEE Computer, no. 2, pp. 44-52, Feb
1995.
*Kalavade93+ A. Kalavade, E. Lee, “A Hardware-Software Co-design Methodology for DSP Applications,” IEEE Design and
Test, vol. 10, no. 3, pp. 16-28, Sept. 1993.
*Klenke96+ Klenke, R. H., J. H. Aylor, R. Hillson, D. J. Kaplan, “VHDL-Based Performance Modeling for the Processing Graph
Method Tool (PGMT) Environment,” Proceedings of the VHDL International Users Forum, Spring 1996, pp. 69-73.
[Kumar95] Kumar, S., “A Unified Representation for Hardware/Software Codesign”, Doctoral Dissertation, Department of
Electrical Engineering, University of Virginia, May, 1995
[Jalote91] Jalote, P., An Integrated Approach to Software Engineering, Springer-Verlag, New York, 1991.
*McFarland90+ McFarland, M.C., A.C. Parker, R. Camposano. “The High-Level Synthesis of Digital Systems,” Proceedings of
the IEEE, Vol. 78, No. 2, February 1990, p.301-318, © IEEE 1990.
References (Cont.)
*Parker84+ Parker, A.C., “Automated Synthesis of Digital Systems,” IEEE Design and Test,, November 1984, p. 75-81.
[RASSP94] Proceedings of the 1st RASSP Conference, Aug. 15-18, 1994.
[Rozenblit94] Rozenblit, J. and K. Buchenrieder (editors). Codesign Computer -Aided Software/Hardware Engineering,
IEEE Press, Piscataway, NJ, 1994; © IEEE 1994.
*Smith86+ Smith, C.U., R.R. Gross. “Technology Transfer between VLSI Design and Software Engineering: CAD Tools and
Design Methodologies,” Proceedings of the IEEE, Vol. 74, No. 6, June 1986, p.875-885.
*Srivastava91+ M. B. Srivastava, R. W. Broderson, “Rapid prototyping of Hardware and Software in a Unified Framework,”
Proceedings ICCAD, 1991, pp. 152-155.
*Subrahmanyam93+ Subrahmanyam, P. A., “Hardware-Software Codesign -- Cautious optimism for the future”, Hot Topics,
IEEE Computer, R. D. Williams, ed., January, 1993, pp. 84
[Tanenbaum87] Tanenbaum, A.S., Operating Systems: Design and Implementation, Prentice-Hall, Inc., Englewood Cliffs,
N.J., 1987.
*Terry90+ Terry, C. “Concurrent Hardware and Software Design Benefits Embedded Systems,” EDN, July 1990, p. 148-154.
*Thimbleby88+ Thimbleby, H. “Delaying Commitment,” IEEE Software, Vol. 5, No. 3, May 1988, p. 78-86.
*Thomas93+ Thomas, D.E., J.K. Adams, H. Schmitt, “A Model and Methodology for Hardware-Software Codesign,” IEEE
Design and Test, September 1993, p.6-15; © IEEE 1993.
*Turn78+ Turn, R., “Hardware-Software Tradeoffs in Reliable Software Development,” 11th Annual Asilomar Conference
on Circuits, Systems, and Computers, 1978, p.282-288.
*Vahid94+ Vahid, F., J. Gong, D. D. Gajski, “A Binary Constraint Search Algorithm for Minimizing Hardware During
Hardware/Software Partitioning”, 3rd International Conference on Hardware/Software Codesign, Grenoble, France,
Sepetember22-24, 1994, pp. 214-219
*Wolf94+ Wolf, W.H. “Hardware-Software Codesign of Embedded Systems,” Proceedings of the IEEE, Vol. 82, No.7, July
1994, p.965-989.
References (Cont.)
Additional Reading:
Aylor, J.H. et al., "The Integration of Performance and Functional Modeling in VHDL” in Performance and Fault Modeling
with VHDL, J. Schoen, ed., Prentice-Hall, Englewood Cliffs, N.J., 1992.
D’Ambrosio, J. G., X. Hu, “Configuration-level Hardware-Software Partitioning for Real-time Embedded Systems”, 3rd
International Conference on Hardware/Software codesign, Grenoble, France, September 22-24, 1994, pp. 34-41
Eles, P., Z. Peng, A. Doboli, “VHDL System-Level Specification and Partitioning in a Hardware-Software Cosynthesis
Environment”, 3rd International Conference on Hardware/Software codesign, Grenoble, France, September 22-24,
1994, pp. 49-55
Gupta, R.K., G. DeMicheli, “Hardware-Software Cosynthesis for Digital Systems,” IEEE Design and Test, September 1993,
p.29-40.
Richards, M., Gadient, A., Frank, G., eds. Rapid Prototyping of Application Specific Signal Processors, Kluwer Academic
Publishers, Norwell, MA, 1997
Schultz, S.E., “An Overview of System Design,” ASIC and EDA, January 1993, p.12-21.
Thomas, D. E, J. K. Adams, H. Schmit, “A Model and Methodology for Hardware-Software Codesign”, IEEE Design and Test,
September, 1993, pp. 6-15
Zurcher, F.W., B. Randell, “Iterative Multi-level Modeling - A Methodology for Computer System Design,” Proceedings IFIP
Congress ‘68, Edinburgh, Scotland, August 1968, p.867-871.
• https://www.design-
reuse.com/articles/4339/enforcing-design-
rules-to-develop-reusable-ip.html
• https://www.design-
reuse.com/articles/6978/hardware-software-
partitioning-methodology-for-systems-on-
chip-socs-with-risc-host-and-configurable-
microprocessors.html
Design Reuse - 1
• Motivation
– High cost of design and verification
– Shorter design cycles
– Higher quality demands
– Emerging System-on-a-Chip (SoC) designs
• Very short design cycles
• Large numbers of distinct designs
• Analogous to board design today
Design Reuse - 2
• Requirements
– Correct and robust
• Well-written, well-documented, thoroughly-commented
code
• Well-designed verification suites and robust scripts
– Solves a general problem
• Easily configurable; parameterized
– Supports multiple technologies
• Soft macros: synthesis scripts span a variety of libraries
• Hard macros: porting strategies to new technologies
Design Reuse - 3
• Requirements (continued)
– Simulates with multiple simulators
• Both VHDL and Verilog version of models and test-benches
• Work with all major commercial simulators
– Accompanied by full verification environment
• Test benches and verification suites that provide high levels of verification
coverage
– Verified rigorously before release
• Includes construction of actual prototype tested in actual system with
real software
Design Reuse - 4
• Requirements (continued)
– Documented in terms of applications and restrictions
• Valid configurations and parameter values
• Interfacing requirements and restrictions
Example SoC
Digital Signal
Processor
Processor
System Bus
A/D and D/A

RAM I/O Interface
Interface
Design Paradigms - 1
–Design Flow - Spiral model instead of

Waterfall model
–Top-down/Bottom-up Mixture instead of
Top-down
–Construct by correction instead of Correct
by construction
• Waterfall model
• Design flows from one
phase to another
– Phases such as
algorithm
development,
RTL coding and
functional
verification, and
synthesis and
timing
verification, and
physical design
all performed by
different teams
– Limited reverse
flow in design
• Spiral model Teams
work on multiple
aspects simultaneously,
performing incremental
improvement
– Concurrent
development of HW
and SW
– Parallel verification
and synthesis
– Floorplanning and
place and route at
synthesis
– Modules developed
only if predesigned
hard or soft macro
unavailable
– Planned iteration
throughout
• Top-down Design
– Assumes that lowest level
blocks can be designed and
built
– If not, start over
– Not compatible with
maximum reuse of macros
• Top-down/Bottom-up
Mixed Design
– Design downward, but
target macros or
combination of macros
build upward
231
• Correct by construction
– Focus on one pass design with goal of completely
correct during this pass
• Construction by correction
– Begin with the realization that multiple complete
iterations will be required
– First pass is quick to see the problems at various levels
caused by the decisions at prior levels
– Performed design refinement several times
The Role of Reuse
• Redesign of cores such as processors, bus
interfaces, DSP processors, DRAM controllers,
RAMS, etc. is not cost-effective
• Redesign of common blocks such as ALUs, barrel
shifters, adders, and multipliers, likewise, not cost
effective
• Availability of well-designed macros particularly
parameterizable versions can greatly reduce cost
Macros, Cores and Blocks - 1
• Terms in header used synonymously

• Other terms:
– Subblock - subcomponent of a macro, core, or block -
too small or specific to be a stand-alone component
– Hard macro - A hard macro is delivered to integrator as
a GDSII file (tape-out file for fabrication) - fully
designed, placed and routed by the supplier -
Macros, Cores and Blocks - 2
– Soft macro - Delivered to integrator as

synthesizable RTL code - may also include test
benches, synthesis scripts, etc.
– Firm macro - (defined by Virtual Socket Interface
Alliance) - RTL code with supplemental physical
design information
System Design Rules and Guidelines - 1
 Timing and Synthesis Issues
 Rule - synchronous and register-based
 Rule - document clock domains and
frequencies
• Guideline - use the smallest possible number of
clock domains
• If phase-locked loop (PLL) used, have disable
or by-pass.
 Rule - document reset strategy
 Rule - document reset strategy
• Guideline - if asynchronous, must be de-
asserted synchronously
• Guideline - Synchronous preferred
 Rule - overall design goals for timing, area,
and power should be documented before
macros designed or selected - overall synth
methodology planned early
• Functional Design Issues

– Rule - Design of on-chip buses that interconnect
the various blocks must be an integral part of
macro selection and the design process
– Rule - Develop strategy for bring-up and debug
early in the design process
• Guideline - provide controllability and observability,
the keys to easy debug
• Physical Design Issues

– Rule - Floor-planning, placing and routing
of a combination of hard and soft macros
must be developed before hard macros are
selected or designed
• Comment - Hard macros can be very
detrimental to place and route
– Rule - Floorplanning must begin early in the

design process
– Decide on the basic clock distribution
structure early in the design process
• Guideline - Low speed synchronous bus
between modules - High speed local clocks
synchronous to the bus clock by PLLs or
buffering - multiple of bus clock
• Verification
– Rule - strategy must be developed and
documented before macro selection or
design begins
• Guideline - selection of verification tools can
affect the coding style of macros and the
design - testbench design must be started early
in the design process
• Manufacturing Test Strategies

– Rule - system-level chip manufacturing test
strategy must be documented
• Guideline - On-chip test structures are
recommended for all blocks - different test
strategies for different blocks - master test
controller
04/26/2003 242
• Guideline - Built-In Self-Test (BIST)

recommended for RAMs - also non-BIST data
retention tests
• Guideline - microprocessor tests usually involve
parallel vectors and serial scan - test controller
must provide for both
• Guideline - for other blocks, full scan is good
choice - sometimes with logic BIST
04/26/2003 243
RTL Coding Guidelines - 1
• Fundamental Principles
• Basic Coding Practices
• Coding for Portability
• Guidelines for Clocks and Resets
• Coding for Synthesis
• Partitioning for Synthesis
• Designing with Memories
04/26/2003 244
• Fundamental Principles
– Use simple constructs, basic types (for VHDL), and simple
clocking schemes
– Be consistent in coding style, naming, and style for
processes and state machines
– Use a regular partitioning method with module outputs
registered and modules of about the same size
– Use comments, meaningful names and constants and
parameters instead of numbers
04/26/2003 245
• Basic Coding Practices

– Rule - Develop a naming convention for the
design
• Use lowercase letters for signal names, variable names,
and port names
• Use uppercase letters for constants and user-defined
types
• Use meaningful names
• Keep parameter names short, but descriptive
04/26/2003 246
RTL Coding Guidelines – 4
• Basic Coding Practices (continued)
• Use clk for the clock signal. If multiple clocks, use clk as the prefix
for all clock signals.
• Use the same name for all clk signals driven by same source
• For active low signal, end with underscore _n for standardization
• Use rst for reset signals - if active low, use rst_n
• For buses, use consistent bit ordering; recommend (y downto x)
(VHDL) or (x:0) (Verilog)
• Use same name for connected ports and signals
• Use *_r for register output, *_a for asynchronous signals, *_pn for
signals in phase n, *_nxt for data in to register *_r, and *_z for
internal, 3-state signal.
– Many more!
04/26/2003 247
• Coding for Portability
– Rule (VHDL) - Use only IEEE standard types
• Use std_logic instead of std_ulogic
• Be conservative re number of created types
• Do not use bit or bit_vector (no built in arithmetic)
– Do not use hard-coded values
– (VHDL) Collect all parameter values and function definitions
into a package DesignName_package.vhd
– (Verilog) Keep ‘define statements in a separate file
DesignName_params.v
04/26/2003 248
– Avoid embedding dc_shell scripts to avoid unintended
execution with negative impact and obsolescence
– Use technology-independent libraries to maintain technology
independence (e.g., DW in Synopsys)
– Avoid instantiating gates in designs
– If technology-specific gates must be instantiated, isolate in
separate module
– If gate instantiated, use technology-independent library (e.g.,
GTECH in Synopsys)
04/26/2003 249
– Code for translation between VHDL and
Verilog
• Do not use reserved keywords from Verilog as
identifiers in a description in VHDL and vice-versa.
• In VHDL, do not use:
– generate
– block
– Code to modify constant declarations
04/26/2003 250
• Guidelines for Clocks and Resets
– Avoid mixed clock edges
• Duty cycle of clock becomes critical in timing analysis
• Separate serial scan handling required
• If required, worst case duty cycle(s) must be accurately modeled,
duty cycle documented, and + and - edge flip-flops in separate
modules
– Avoid clock buffers (done in physical design)
– Avoid gated clocks (technology specific, timing dependent,
and non-scannable)
04/26/2003 251
RTL Coding Guidelines – 9
– Avoid internally-generated clocks (logic they clock cannot
be scanned; synthesis constraints difficult to write)
– Avoid internally generated resets
– If gated clocks, or internally-generated clocks or resets
required, do in separate module at the top level of the
design and partition into modules using single clock and
reset.
– Model gated clocks as if registers enabled.
– Model complex reset by generating reset signal in
separate module
04/26/2003 252
• Coding for Synthesis
– Infer registers
– Avoid latches
– Avoid combinational feedback
– Specify complete sensitivity lists
– In Verilog, always use non-blocking assignments in
always@(*edge clk)
– In VHDL, signals preferred to variables, but
variables can be used with caution
04/26/2003 253
• Coding for Synthesis (continued)
– Use case over if-then-else whenever priority
structure not required.
– Use separate processes for sequential state
register and combinational logic
– In VHDL, create an enumerated type for state
vector. In Verilog, use ‘define. Why ‘define rather
than parameters?
– Keep FSM logic separate
04/26/2003 254
• Partitioning for Synthesis

– Register all outputs
– Keep related logic together
– Separate logic with different design goals
– Avoid asynchronous logic
– Keep mergeable logic within a given module
– Avoid point-to-point exceptions and false paths such as
multicycle paths and false paths
– Avoid top-level glue logic
04/26/2003 255
• Designing with Memories

– Partition address and data registers and write
enable logic in a separate module (allows use with
both synchronous and asynchronous memories)
– Add interface module for asynchronous memory
04/26/2003 256
Macro Synthesis Guidelines - 1
• Basic timing budget for macro must be

specified
• Timing budgets must be developed for each
subblock in the macro
• Initially synthesized to single tech library
• In productization, synthesized to multiple tech
libraries
04/26/2003 257
Macro Synthesis Guidelines - 2
• Subblock Synthesis
– Typically compile-characterize-write script- reoptimize
approach
• Macro Synthesis
– Compile individual subblocks
– Characterize-compile overall
– Perform incremental compile
• Use of RAM Compiler and Module Compiler
04/26/2003 258
Developing Hard Macros
• Hard Macro Design Process

– Fig. 8-1 RMM
• Models and Documentation
– Behavioral
– Functional
– Timing
– Floorplan
04/26/2003 259
Macro Deployment - 1
• Deliverables
– Soft Macros
• RTL code
• Support files
• Documentation
• See Table 9-1
– Hard Macros
• Broad set of integration models
• Documentation for integration into final chip
• Design archive of files
• See Table 9-2
04/26/2003 260
Macro Deployment - 2
• Deliverables
– Soft Macros
• RTL code
• Support files
• Documentation
• See Table 9-1
– Hard Macros
• Broad set of integration models
• Documentation for integration into final chip
• Design archive of files
• See Tables 9-2, 9-3
04/26/2003 261
System Integration
• The integration process
– Selection of macros
– Design of macros
– Verification of macros
– The design flow - See Fig. 10-1 RMM
– Verification of design
• Overall: Verification important throughout the
macro design and system design
04/26/2003 262
Partitioning in
Hardware/Software
Co-Design
Overview Of a Partitioner
Closer Look At Partitioner
Issues Involved during Partitioning
Process
– Nature of Application
– Target Architectures
– Interplay Of Granularity and Estimation
– Closeness Metrics
– Cost Function
Nature Of Application
• Computation oriented systems
– Workstations, PC’s or scientific parallel computers
• Control Dominated Systems
reacts to external events
• Data-Dominated Systems
– Complex transformation or transportation of data
– Eg DSP or Router
• Mixed Systems
– Eg Mobile Phone or Motor Control
Architecture for control dominated
systems
• Each FSM mapped to a process
• Small Variable set – FSM state
• Short Program segments – FSM transitions
• Explosion of states and transitions – Issue of Code
Size
• Shared Memory architecture
• Optimizations – bit manipulations, few operation per
state transition .
• E.g.. 8051,Motorolla MC68332 , Siemen’s 80C166
Architecture for Data Oriented Systems
• Emphasis on high throughput than short latency deadline
• Large data variables – Memory optimization
• Periodic behaviour of system parts
– Static schedule
• Transformations for high concurrency such as loop unrolling
• Specialize control,data path and interconnect function units
• Priori known address sequences and operations – Memory
and address unit specialization
• Eg: DSP Applications–ADSP21060,TMS320C80
Mixed Systems
• Interconnected data and control dominated
functions
• Approaches
– Heterogeneous systems – Independently controlled
communicating specialized components
– Computation application without specific specialization
potential.
• E.g. Printer or Scanner controller
– Tailoring of less specialized systems to an application
domain – Eg. Minimize power consumption or cost for
a required level of performance
• E.g.: ARM family , Motorolla Cold Fire family
Modern Embedded Architectures
Highly multiplexed data path processors.
• ASIPs.
– Optimized for speed, performance, power
characteristics of the application and can be reused
and provide cost.
• VLIW processors.
– Network of horizontally programmable execution unit.
• Commercial programmable DSPs( Harvard Arch).
– Separate program and data memories.
– Instruction set is tuned to multiply-accumulation Op.
Granularity Level
• Coarse Grain Partitioning
– Task / Process or Function level
• Fine Grain Partitioning
– Operator ,Statement or Basic Block Level
• Even lower level of Assembly Language not
useful – Based upon processor details
Fine Grain Granularity
• Becomes important as processor performance and
system software increases.
• Less obvious , more difficult and time consuming
and can have high overheads.
– Communication time overhead.
– Communication area overhead – May require buffers or
memories.
– Interlocks.
– Change in efficiency of compiler optimizations , pipelines
and concurrent units utilizations.
Coarse Grain Granularity
• Limits parallelism
• Reduces time and error during estimations
• Better suited for manual partitioning
Closeness Metrics
• Measures the likelihood that two pieces of
specification are mapped on to the same system
component.
• Metrics.
– Connectivity.
• Measures no. of wires shared between two behaviours.
– Communication.
• Measures amount of data transferred between two
behaviours.
– Constrained Communication.
• Measures communication metric between those behaviours
with given performance constraints.
– Common accessors.
• Grouping of behaviours(or variables) accessed via subroutine
calls and variable read/write by many of same behaviours
reduces inter component communication.
– Sequential Execution.
• If two behaviours are defined sequentially in specification ,
mapping on to same processor does not affect performance.
– Hardware Sharing.
• Measures the amount of hardware that two behaviours can
share.
– Balanced Size.
• Achieves a final partition of groups that are roughly balanced
in hardware size.Otherwise above metrics
lead to a single group.
Structural/Functional
• Functional Partitioning. Partitioning
– Partitions a functional specification into smaller sub-
specifications and synthesizes structure for each.
– Isolates a function to one part.
• Reduces I/O.
• Prevents critical path from crossing parts thus reducing clock
period.
• Yields simpler hardware , reducing clock period.
• Complete control over I/O allowing tradeoff with performance.
– Reduces synthesis tool times and memory usage.
• Structural Partitioning.
– A structure is synthesized for the entire
specification and then partitioned.
– Size and Delay can be estimated quickly and
accurately.
– It cannot satisfy both size and I/O constraints.
– Placement and Routing can be done.
more efficiently.
– Not suitable for large systems.
Partitioning Algorithms
• Random Mapping
• Multistage Clustering
• Hierarchical Clustering
• Group Migration
• Ratio Cut
• Simulated Annealing
• Genetic Evolution
• ILP Formulation
Cosyma
• Target Architecture
– standard RISC processor core
– a fast RAM for program and data with single clock cycle
access time
– an automatically generated application specific
coprocessor.
– Peripheral units must be inserted by the designer.
– Processor and coprocessor communicate via shared
memory in mutual exclusion
• Granularity
– Partitioning works at the basic block level.
• Since communication between basic blocks of a
process is implicit , partitioning requires
communication analysis.
• Simulate on an RT-level model of the target
processor to obtain profiling and software timing
information
Hardware/Software Partitioning
• Input to partitioning are the ESG with profiling (or control
flow analysis) information, the CDR-file and synthesis
directives which include channel mapping directives,
partitioning directives, and component selection.
• Starts with an all software solution and tries to extract
hardware components iteratively until all timing
constraints are met.
• The partitioning goals are
– meet real-time constraints
– minimize hardware costs
– minimize the CAD system response time
Algorithm & Cost function
• It uses Simulated Annealing, a stochastic optimization

algorithm.
• The total (estimated) costs of a single basic block b - assumed
that it is moved from software to hardware - amounts to :
Continued….
• tsw(b)is estimated with a local source code
timing estimation based on simulation data.
• thw(b) is estimated with a list scheduler
• tcom(Z U b)is estimated by data flow analysis
LYCOS
• Supports an easy inclusion of new design tools
and algorithms and new design methods.
• It is built as a suite of tools centered around
an implementation independent model of
computation called Quenya, based upon
communicating CDFGs.
LYCOS Partitioning Tool
• Input Specification is in form of CDFG.
• Granularity is chosen by the user interactively.
• Different processor architectures whose technology
files are present can be selected.
• Dedicated hardware units are selected by loading the
hardware library file which contains
area,delay,latency,provided operations,storage
capabilities etc.
• Software execution time is estimated using
CDFG and selected processor technology
file.
• Hardware execution time is estimated using
a dynamic list based scheduling algorithm.
• Partitioning is done using any of the
selected algorithms.
• Allows better design space exploration.
• Traditionally hardware/software partitioning
was accomplished manually.
• However, with the increased complexity in
embedded systems, researchers currently
prefer an automatic approach to handle this
problem.
Hardware/Software Partitioning by
Particle Swarm Optimization Algorithm
https://towardsdatascience.com/genetic-algorithm-a-simple-and-intuitive-guide-
51c04cc1f9ed
http://www.scholarpedia.org/article/Ant_colony_optimization
https://en.wikipedia.org/wiki/Integer_programming
The Particle Swarm
Optimization Algorithm
Summary
• Introduction to Particle Swarm Optimization
(PSO)
– Origins
– Concept
– PSO Algorithm
• PSO for the Bin Packing Problem (BPP)

– Problem Formulation
– Algorithm
– Simulation Results
Introduction to the PSO: Origins
• Inspired from the nature social behavior and dynamic
movements with communications of insects, birds and fish
• In 1986, Craig Reynolds described this process in 3 simple
behaviors:
Separation Alignment Cohesion

avoid crowding local move towards the average move toward the average
flockmates heading of local position of local
flockmates flockmates
• Application to optimization: Particle Swarm Optimization
• Proposed by James Kennedy & Russell Eberhart (1995)
• Combines self-experiences with social experiences

Introduction to the PSO: Concept
 Uses a number of agents (particles)
that constitute a swarm moving
around in the search space looking for
the best solution
 Each particle in search space adjusts

its “flying” according to its own flying
experience as well as the flying
experience of other particles
• Collection of flying particles (swarm) - Changing solutions
• Search area - Possible solutions
• Movement towards a promising area to get the global

optimum
• Each particle keeps track:

– its best solution, personal best, pbest
– the best value of any particle, global best, gbest

• Each particle adjusts its travelling speed dynamically
corresponding to the flying experiences of itself and its
colleagues
 Each particle modifies its
position according to:
• its current position
• its current velocity
• the distance between its

current position and pbest
• the distance between its

current position and gbest
Introduction to the PSO: Algorithm -
Neighborhood
geographical
social
Neighborhood
global
Parameterss
 Algorithm parameters
– A : Population of agents
– pi : Position of agent ai in the solution space
– f : Objective function
– vi : Velocity of agent’s ai
– V(ai) : Neighborhood of agent ai (fixed)
 The neighborhood concept in PSO is not the same as the one

used in other meta-heuristics search, since in PSO each
particle’s neighborhood never changes (is fixed)
Introduction to the PSO: Algorithm
[x*] = PSO()
P = Particle_Initialization();
For i=1 to it_max
For each particle p in P do
fp = f(p);
If fp is better than f(pBest)
pBest = p;
end
end
gBest = best p in P;
For each particle p in P do
v = v + c1*rand*(pBest – p) + c2*rand*(gBest – p);
p = p + v;
end
end
 Particle update rule
p=p+v
 with
v = v + c1 * rand * (pBest – p) + c2 * rand * (gBest – p)
 where
• p: particle’s position
• v: path direction
• c1: weight of local information
• c2: weight of global information
• pBest: best position of the particle
• gBest: best position of the swarm
• rand: random variable
Parameters
 Number of particles usually between 10 and 50
 C1 is the importance of personal best value
 C2 is the importance of neighborhood best value
 Usually C1 + C2 = 4 (empirically chosen value)
 If velocity is too low → algorithm too slow
 If velocity is too high → algorithm too unstable

1. Create a ‘population’ of agents (particles) uniformly distributed
over X
2. Evaluate each particle’s position according to the objective

function
3. If a particle’s current position is better than its previous best

position, update it
4. Determine the best particle (according to the particle’s previous

best positions)
5. Update particles’ velocities:
6. Move particles to their new positions:
7. Go to step 2 until stopping criteria are satisfied

Particle’s velocity:
1. Inertia • Makes the particle move in the same

direction and with the same velocity
• Improves the individual

2. Personal • Makes the particle return to a previous
Influence position, better than the current
• Conservative
3. Social • Makes the particle follow the best

Influence neighbors direction
 Intensification: explores the previous solutions, finds
the best solution of a given region
 Diversification: searches new solutions, finds the

regions with potentially the best solutions
 In PSO:
Example
Example
Example
Example
Example
Example
Example
Example
Characteristics
• Advantages
– Insensitive to scaling of design variables
– Simple implementation
– Easily parallelized for concurrent processing
– Derivative free
– Very few algorithm parameters
– Very efficient global search algorithm
• Disadvantages
– Tendency to a fast and premature convergence in mid optimum points
– Slow convergence in refined search stage (weak local search ability)
Introduction to the PSO: Different
Approaches
• Several approaches
– 2-D Otsu PSO
– Active Target PSO
– Adaptive PSO
– Adaptive Mutation PSO
– Adaptive PSO Guided by Acceleration Information
– Attractive Repulsive Particle Swarm Optimization
– Binary PSO
– Cooperative Multiple PSO
– Dynamic and Adjustable PSO
– Extended Particle Swarms
– …
Davoud Sedighizadeh and Ellips Masehian, “Particle Swarm Optimization Methods, Taxonomy and Applications”.
International Journal of Computer Theory and Engineering, Vol. 1, No. 5, December 2009
On solving Multiobjective Bin Packing Problem
Using Particle Swarm Optimization
D.S Liu, K.C. Tan, C.K. Goh and W.K. Ho

2006 - IEEE Congress on Evolutionary Computation
• First implementation of PSO for BPP

PSO for the BPP:
Problem Formulation
• Multi-Objective 2D BPP
• Maximum of I bins with width W and height H
• J items with wj ≤ W, hj ≤ H and weight ψj
• Objectives
– Minimize the number of bins used K
– Minimize the average deviation between the overall
centre of gravity and the desired one
PSO for the BPP:
Initialization
• Usually generated randomly

• In this work:
– Solution from Bottom Left Fill (BLF) heuristic
– To sort the rectangles for BLF:

• Random
• According to a criteria (width, weight, area, perimeter..)

PSO for the BPP:
Initialization BLF
Item moved to the right if

intersection detected at the top
Item moved to the top if

intersection detected at the right
Item moved if there is a lower

available space for insertion
PSO for the BPP:
Algorithm
• Velocity depends on either pbest or gbest: never
both at the same time
OR
PSO for the BPP:
Algorithm
1st Stage:
• Partial Swap between 2 bins
• Merge 2 bins
• Split 1 bin
2nd Stage:
• Random rotation
3rd Stage:
• Random shuffle
Mutation modes for a single particle

PSO for the BPP:
Algorithm
H hybrid
M multi
O objective
P particle
S swarm
O optimization
The flowchart of HMOPSO

PSO for the BPP:
Problem Formulation
 6 classes with 20 instances randomly generated
 Size range:
– Class 1: [0, 100]
– Class 2: [0, 25]
– Class 3: [0, 50]
– Class 4: [0, 75]
– Class 5: [25, 75]
– Class 6: [25, 50]
 Class 2: small items → more difficult to pack

PSO for the BPP:
Simulation Results
• Comparison with 2 other methods
– MOPSO (Multiobjective PSO) from [1]
– MOEA (Multiobjective Evolutionary Algorithm) from [2]
• Definition of parameters:
[1] Wang, K. P., Huang, L., Zhou C. G. and Pang, W., “Particle Swarm Optimization for Traveling Salesman Problem,”
International Conference on Machine Learning and Cybernetics, vol. 3, pp. 1583-1585, 2003.
[2] Tan, K. C., Lee, T. H., Chew, Y. H., and Lee, L. H., “A hybrid multiobjective evolutionary algorithm for solving truck
and trailer vehicle routing problems,” IEEE Congress on Evolutionary Computation, vol. 3, pp. 2134-2141, 2003.
PSO for the BPP:
Simulation Results
• Comparison on the performance of metaheuristic
algorithms against the branch and bound method (BB) on
single objective BPP
• Results for each algorithm in 10 runs
• Proposed method (HMOPSO) capable of evolving more

optimal solution as compared to BB in 5 out of 6 classes of
test instances
PSO for the BPP:
Simulation Results
Number of optimal solution obtained

PSO for the BPP:
Simulation Results
• Computational Efficiency
– stop after 1000 iterations or no improvement in last 5 generations
– MOPSO obtained inferior results compared to the other two
PSO for the BPP:
Conclusions
• Presentation of a mathematical model for MOBPP-2D
• MOBPP-2D solved by the proposed HMOPSO
• BLF chosen as the decoding heuristic
• HMOPSO is a robust search optimization algorithm

– Creation of variable length data structure
– Specialized mutation operator
• HMOPSO performs consistently well with the best average performance on

the performance metric
• Outperforms MOPSO and MOEA in most of the test cases used in this
paper
design methodology for embedded
memories
• https://hal.archives-ouvertes.fr/hal-
00181196/document
• https://www.ics.uci.edu/~dutt/pubs/bc12-
hipc02-panda.pdf
• http://www.interradesign.com/pdf/MC2_Data
sheet.pdf
sysyemC
Design Challenge
s
Reference 4
http://www.doulos.com
Silicon complexity Software
v.s complexity
Silicon is growing 10x Software in systems growing faster
complexity
every 6 years is
than 10x every 6 years
Reference 5
Increasing Complexity in SoC Designs
l One or more processors

l 32-bit Microcontrollers
l DSPs or specialized processors
l
media
l
On-chip memory
Special function blocks
l
Peripheral control devices
l Complex on-chip communications (On-chip
network
busses)
l RTOS and embedded software which are
layering
l
architecture
……
6
How to Conquer the Complexity
?
l Modeling strategies
– Using the appropriate modeling for different
levels
– The consistency and accuracy of model
the
l Reuse existing designs
– IP reuse
– Architecture reuse (A based design)
platform
l Partition
– Based on functionality
– Hardware and software
7
Traditional System Design Flow (1/2)
l Designerspartition the system into hardware and

software early in the flow
l HW and SW engineers design their respective
component in isolation
s engineers do not talk to each
l HW and SW other
l The system may not be the suitable solution
l Integration problems
l High cost and long iteration
8
Traditional System Design Flow (2/2)
nn System
System Level
Level Design
Hardware and
Design
nn Hardware and Software
Algorithm Developmen
nn Algorithm Development
Software
Processor tSelection
nn Processor Selection
Done mainly
nn Done mainly in
in C/C++
C/C++ C/C++
C/C++
Environment
Environment
ICDevelopment
nnIC Development Verificatio Process SoftwareDesign
Software Design
Hardware
nnHardware n w w CCooddeeDDeevveeloloppm
Implementation
nnImplementation wmRReTeTnO
nO
ttSSddeetta
Decisions
nnDecisions
$$$ ilDone
w aDone mainly in
isls mainly in
Donemainly
mainly in
in HDL
HDL C/C++
C/C++
nnDone
EDA
EDA CC//CC++++EEnnvviriroonn
Environment
Environment mmeenntt
Reference : Synopsys
9
Typical Project
Schedule
System
Design
Hardware Design
Prototype
Build
Hardware
Debug
Software
Design
Software
Coding
Software Debug
Reference : Mentor Graphic

Project
Complete
10
Former Front- Design Flow
End
C/C++ Convert by
C/C++ Hand
System
System
Level
Level
Model
Model Verilog
Verilog Verilog
Testbench
Testbench
AAnnaalylyss
isis
SSimimuulalattio
Results
Results ionn
Refine
SSyynntthhees
sisis
Reference : DAC 2002 SystemC Tutorial
11
Problems with the Design Flow
C/C++ Convert by
C/C++ Hand
System
System
Level
Level
Model
Model Verilog
Verilog Verilog
Testbench
Testbench
AAnnaalylyss
isis
Not reusable
SSimimuulalattio
Results
Results ionn
Refine
SSyynntthhees
˝Not done by designers sisis
˝The netlist is not preserved

12
Shortcoming of Current System
Design
Flow
l Use natural language to describe the
system specification
– Cannot
l Require
verifyexperts
many the desired functionsarchitecture
in system directly for
the partition of software and hardware parts
– The partition may not be the optimal solution
l Hardware designers have to restart the design
process by capturing the designs using the
HDLs
– May have unmatched problems
l Hardware and software integration is often painful
– Hardware and software cannot work together
– Co-verification of hardware and software is inefficient 13
Concurrent HW/SW
Design
l Canprovide a significant performance
improvement for embedded system
design
– Allows earlier architecture closure
– Reduce risk by 80%
l Allows HW/SW engineering groups to talk
together
l Allows earlier HW/SW Integration
l Reduce design cycle
– Develop HW/SW in parallel

– 100x faster than RTL
14
Project Schedule with HW/SW Co-design
System Design
Hardware Design
Prototype Build
Hardware Debug
Software Design
Software Coding
Software Debug
Reference : Mentor Graphic

Project Complete
15
Modern System Flow
Design
Specification of the
System
System Level Modeling
Hardware and
Software
Partitioning
Architectural
Exploration
H/W Model S/W Model
H/W Design Flow S/W Development
Integration and Verification
16
Outline
l Introductio
l
n
System Modeling
l
Languages
SystemC
l
Overview
l
Data-Types
l
Processes
l
Interfaces
l
Simulation Supports
l
System Design
l Environments HW/SW Co-
Verification Conclusion 17
Motivation to Use a Modeling
Language
l The increasing system design complexity
l The demand of higher level abstraction
and modeling
l Traditional HDLs (verilog, VHDL, etc)
are
suitable for system level design
– Lack of software supports
l To enable an efficient system design flow
18
Requirements a System Design
of Language
l Support system models at various levels
abstraction of
l Incorporation of embedded software
portions
of a complex system
– Both models and implementation-level code
l Creation of executable specifications
of design intent
l Creation of executable platform models
– Represent possible implementation

architectures on which the design intent will be
mapped
19
Requirements of a System Design
Language
l Fast simulation speed to enable design-space
exploration
– Both functional specification and
architectural implementation alternatives
l Constructs allowing the separation of
system function from system
communications
– In order to allow flexible adaptation and reuse of
both models and implementation
l Based on a well-established programming language
– In order to capitalize on the extensive infrastructure of
capture, compilation, and debugging tools already
available
20
Model Accuracy
Requirements
l Structuralaccuracy
l Timing accuracy
l Functional accuracy
l Data organization accuracy
l Communication protocol accuracy
21
System Level Language
SystemC
Cynlib
C/C++ SoC++
Based
Handel-C
A/RT
(Library)
VHDL/Verilog VHDL+
Replacement System
System-Level s
Modeling Verilog
Language
Higher-level SDL
Languages
SLDL
Entirely New
Languages SUPERLO
G
Java- Java
Based
22
Language Use
C/C++ SystemC TestBuilder, Verilog SUPERLOG
2.0 OpenVer,e VHDL
Embedded
Very
Good NO NO NO
SW Good
System
Level Very
OK Excel NO Good
Design Poor
Very
Verification OK Good Excel OK
Good
RTL
Design NO Good NO Excel Excel

23
Trend of System-Level Languages
l Extend existing design languages (ex: SystemVerilog)
– Pros:
• Familiar languages and environments to designers
• Allow descriptions of prior version of Verilog
– Cons:
• Not standardized yet
• Become more and more complex to learn
l Standard C/C++ based languages (ex: SystemC

)
– Pros:
• Suitable for very abstract descriptions
• Suitable to be an executable specification
– Cons:
• A new language to learn
• Need solutions for the gap to traditional design
flow 24
Evolution of Verilog
Language
lProprietary design description language developed by Gateway,
1990
– Donated to OVI(Open Verilog International) by Cadence
l Verilog Hardware Description Language LRM by OVI, V2.0 1993

l IEEE Std. 1364-1995, ―Verilog 1.0‖ (called Verilog-1995)
l Synopsys proposed Verilog-2000, including synthesizable
subset
l IEEE Std. (1st main
1364-2001, ―Verilog 2.0‖ (called Verilog-2001) enhancement)
l SystemVerilog 3.0, approved as an Accellera standard in 2002
June
– add system-level architectural modeling
2003
l SystemVerilog 3.1, approved as an Accellera standard in
– add verification and C language (Not yet as
May,integration
standard)
l Verilog Standards Group(IEEE 1364) announced a
project
authorization request for 1364-2005
25
Compatibility of SystemVerilog
ANSI C-style ports, named passing,
parameter
comma-separated lists and attributes
sensitivity
SystemVerilog
3.0/3.1 Verilog-
1995
Verilog-
2001 earliest Verilog
initialization of variables, the semantics of "posedge" and

"negedge" constructs, record-like constructs, handling of
interfaces and various keywords
26
Enhancements in
SystemVerilog
l C data types
l Interfaces to encapsulate
l Dynamic processes
l A unique top level ($root)
l
hierarchy
Verification functionality
l
Synchronization
l
Classes
l
Dynamic memory
l
Assertion mechanism
l
…… 27
Summary about SystemVerilog
l More extension in high-level abstraction to the Verilog-
2001 standard
– Still no much enhancement in transaction-level
abstraction
l Improves the productivity and readability of Verilog code
l Provide more concise hardware descriptions
l Extends the verification aspects of Verilog by
incorporating the capabilities of assertions
– Still no coverage construct within testbench
design
l 3.0/3.1 LRM are still not very clear in more details
l Not yet simulator support
– No compiler for trying its syntax
l SV is a valuable direction to be watched
– Will it become too complex for most
28
designers/verification engineers’
requirement/understanding??
Reference for SystemVerilog
l SystemVerilog 3.1, ballot draft: Accellera's
Extensions
to Verilog? Accellera, Napa, California, April 2003.
l
Verilog 2001: A Guide to the new Verilog
Standard?
Stuart Sutherland, Kluwer Academic Publishers,
l Boston, Massachusetts, 2001
An overview of SystemVerilog 3.1, By Stuart
Sutherland, EEdesign, May 21, 2003
http://www.eedesign.com/story/OEG20030521S0
URL:
1)
086
http://www.eda.org/sv- (SystemVerilog Testbench Extension Committee)
ec/
2) http://www.eda.org/sv-ec/SV_3.1_Web/index.html (SV3.1
Web)
29
Why C/C++ Based Language for System
Modeling
l Specification between architects
and implementers is executable
l High simulation speed due to the higher level
of abstraction
l Refinement, no translation into HDL (no
―semantic
gap‖)
l Testbench re-use
30
Advantages of Executable
Specifications
l Ensure the completeness of specification
– Even components (e.g. Peripherals) are so complex
– Create a program that behave the same way as the
system
l Avoid ambiguous interpretation of the specification
– Avoids unspecified parts and inconsistencies
– IP customer can evaluate the functionality up-front
l Validate system functionality before implementation
– Create early model and validate system performance
l Refine and test the implementation of
the specification
– Test automation improves Time-to-Market
31
Can Traditional C++ Standard Be Used?
l C++ does not support

– Hardware style communication
• Signals, protocols, etc
– Notion of time
• Time sequenced operations
– Concurrency
• Hardware and systems are concurren
inherently t
– Reactivity
• Hardware is inherently reactive, it responds to stimuli
and is
inconstant interaction with its environments
– Hardware data types
• Bit type, bit-vector type, multi-valued logic type, signed
and unsigned integer types and fixed-point types 32
SystemC v.s SpecC
l Constructs to model system architecture
– Hardware timing
– Concurrency
– Hardware data-type (signal,etc)
l Adding these constructs to C/C++
– SystemC
• C++ Class library
• Standard C++ Compiler : bcc, msvc, etc
gcc,
– SpecC
syntax
• Language extension : New keywords
and
• Translator for C
33
SystemC is…
lA library of C++ classes

– Processes (for
– concurrency)
– Clocks (for time)
Hardware data types (bit vectors, 4-valued
logic, fixed-point types, arbitrary precision
–
integers)
–
Waiting and watching (for reactivity)
–
Modules, ports, signals (for
hierarchy)
Abstract ports and protocols
(abstract
communications)
• Using channel and interface classes 34
SystemC Design
Flow
C/C++
development
environment
Compile
header r
files
systemC Linke
librarie r
s
Source file for
Debugge system
r and testbenches
make
Executable=simulato
r
35
Outline
l Introductio
l
n
System Modeling
l
Languages
SystemC
l
Overview
l
Data-Types
l
Processes
l
Interfaces
l
Simulation Supports
l
System Design
SystemC Language Architectur
e
SystemC Language Layering Architecture
Not-standard
Standard Channels for Various Methodology-Specific

Model of Computation Channels
Kahn Process Networks Master/Slave Library,

etc. Static Dataflow, etc.
Prepare to involve to SystemC Standard

Elementary
Channels
Signal, Timer, Mutex, Semaphone, FIFO,
etc
Core Language Data-Types
Modules 4-valued logic types

(01XZ) Ports 4-valued logic-
vectors
Processes Bits and bit-vectors
Events Arbitrary-precision
integers Interfaces Fixed-point
numbers Channels C++ user-
defined types
Event-Driven Simulation
Kernel
37
C++ Language
Standard
System Abstraction Level
(1/3)
l Untimed Functional Level (UTF)
– Refers to both the interface and
functionality
– Abstract communication channels
– Processes executed in zero time but in
order
– Transport of data executed in zero time
l Timed Functional Level (TF)
– Refers to both the interface and
functionality
– Processes are assigned an execution time
– Transport of data is assigned a time
– Latency modeled 38
– ―Timed‖ but not ―clocked‖
System Abstraction Level
(2/3)
l Bus Cycle Accurate (BCA)
– Transaction Level Model (TLM)
– Model the communications between system
modules using shared resources such as
busses
– Bus cycle accurate or transaction accurate
• No pin-level details
l Pin Cycle Accurate (PCA)
– Fully described by HW and the
signals
communications protocol
– Pin-level details
– Clocks used for timing
39
System Abstraction (3/3)
Level
l Register Transfer Accurate
– Fully timed
– Clocks used for
synchronization
– Complete functional details
• Every register for every cycle
• Every bus for every cycle
• Every bit described for every cycle
– Ready to RTL HDL
40
Core
Language
l Time Model
– To define time unit and its resolution
l Event-Driven Simulation Kernel
– To operate on events and switch between
processes, without knowing what the
events actually represent or what the
processes do
l Modules and Ports
– To represent structural information
l Interfaces and Channels
– To describe the abstraction of
communication between the design block
41
Time Model
l Using an integer-valued time model

l 64-bit unsigned integer
l Canbe increase to more than bits if necessar
64 y
l Same as in Verilog and VHDL
42
Time Model
(cont’)
l Time resolution
– Must be specified before any time (e.g
objects sc_time) are created
-12
– Default value is one pico-second ( 10 )
l Time unit s
Example for 42 picosecond
SC_FS femtosecond
SC_PS picosecond sc_time T1(42, SC_PS)
SC_NS nenosecond
Example for resolution
SC_US microsecond
sc_set_time_resolution(10,
SC_MS millisecond SC_PS)
sc_time SC_NS
T2(3.1416, )
SC_SEC second
T2 would be rounded to 3140 ps 43
Modules
l The basic building blocks for partitioning a design
l Declared with the SystemC keyword SC_MODULE
l Typically contain
– Ports that communicate with the environment
– Process that describe the functionality of the module
– Internal data and communication channels for the model

– Hierarchies (other modules)
l Modules can also Module

Ports
access a
RTL process
channel’s interface
directly Signals Sub-module
RTL process
44
Modules - Example
SC_MODULE (FIFO) {
Load //ports, process, internal data,
Full
Read FIFO etc sc_in<bool> load;
sc_in<bool> read;
Empty
Data sc_inout<int>data;
sc_out<bool> full;
sc_out<bool>
empty;
SC_CTOR(FIFO){
//body of constructor;
//process declaration, sensitivities,
etc.
} 45
};
Module Instantiatio
n
Top Module
Positional
Association
Named
Association
46
Similar Control Flow Description
s
IF
CASE
FOR
47
Outline
l Introductio
l
n
System Modeling
l
Languages
SystemC
l
Overview
l
Data-Types
l
Processes
l
Interfaces
l
Simulation Supports
l
System Design
Data-Types
l SystemC allows users to use any C++ data types as well
as unique SystemC data types
– sc_bit – 2 value single bit type
– sc_logic – 4 value single bit
– type
– sc_int – 1 to 64 bit signed integer type
– sc_uint – 1 to 64 bit unsigned integer type
–
sc_bigint – arbitrary sized signed integer type
–
sc_biguint – arbitrary sized unsigned integer
type
–
sc_bv – arbitrary sized 2 value vector type
–
sc_lv – arbitrary sized 4 value vector type
–
sc_fixed - templated signed fixed point type
–
sc_ufixed - templated unsigned fixed point
–
type sc_fix - untemplated signed fixed point
type sc_ufix - untemplated unsigned fixed 49
point type
Type sc_bit
l Type sc_bit is a two-valued data type representing a single bit

l Value ’0’ = false
l Value ’1’ = true
Bitwise &(and) |(or) ^(xor) ~(not)

Assignment = &= |= ^=
Equality == !=
sc_bit
operators
For Example :
sc_bit a,b; //Declaration
a = a & b;
a=a|b
50
Type sc_logic
l The sc_logic has 4 values, ’0’(false), ’1’(true), ’X’
(unknown), and ’Z’ (high impedance or floating)
l This type can be used to model designs with multi-driver
busses, X propagation, startup values, and floating busses
l The most common type in RTL simulation

Equality == !=
sc_logic
operators
For Example
sc_logic x;// object

declaration x = '1';// assign a
x1 =
value
'Z';// assign a Z
value 51
Fixed Precision Unsigned and Signed
Integers
l The
C++ int type is machine dependent, but
usually 32 bits
l SystemC integer type provides integers from 1 to
64 bits in signed and unsigned
forms
l sc_int<n>
– A Fixed Precision Signed Integer
– 2’s complement notation
l sc_uint<n>
– A Fixed Precision Unsigned Integer
52
The Operators of sc_int<n> and sc_uint<n>
53
Bit Select and Part Select
54
The Examples of sc_int<n> and sc_uint<n>
sc_int<64> x;// declaration example
sc_uint<48> y;// declaration
example
sc_int<16> x, y, z;
z = x & y;// perform and operation on x and y bit
// by bit
z = x >> 4;// assign x shifted right by 4 bits to z
To select on bit of an integer using the bit select

operator
sc_uint<8>
sc_logic myint;
mybit;
mybit = myint[7];
To select more than one bit using the range
method sc_uint<4> myrange;

sc_uint<32> myint;
myrange = myint.range(7,4);
Concatenation
operation sc_uint<4>
inta;
sc_uint<4> intb; 55
sc_uint<8> intc;
intc = (inta, intb);
Arbitrary Precision Signed and
Unsigned
Integer Types
l For the cases that some operands to be larger
have not work
l
than 64 bits, the sc_int and sc_uint integer) or
will
sc_bigint (arbitrary sized signed integer) can solve
The sc_biguint (arbitrary size
this problem
unsigned
l These types allow the designer to work on integers
of any size, limited only by underlying system
l limitations
Arithmetic
precision and other operators also use arbitrary
when operations
l performing slowly than their fixed
These types execute therefore should only
more precision be
counterparts and used
when necessary 56
The Operators of the sc_bigint<n> and
sc_biguint<n>
l Type sc_bigint is a 2’s complement signed integer of any size
l Type sc_biguint is an unsigned integer of any size
l The precision used for the calculations depends on the sizes
of the operands used
57
Arbitrary Length Bit Vector žÍsc_bv<n>
l The sc_bv is a 2-valued arbitrary length vector

to be used for large bit vector manipulation
l The sc_bv type will simulate faster than the
sc_lv type
– Without tri-state capability and arithmetic
operations
l Type sc_biguint could also be used for
these operations, but
– It is optimized for arithmetic operations, not
bit manipulation operations
– Type sc_bv will still have faster simulation time
58
The New Operators for
sc_bv<n>
l The new operators perform bit reduction
– and_reduce()
– or_reduce()
– xor_reduce()
sc_bv<64> databus;
sc_logic result;
result =
databus.or_reduce();
If databus contains 1 or more 1 values the result of the reduction will be 1.
If no 1 values are present the result of the reduction will be 0 indicating

that
databus is all 0’s.
59
The Operator of the sc_bv<n>
s
60
Arbitrary Length Logic Vector žÍsc_lv<n>
l The sc_lv<n> data-type represents an arbitrary

length vector in which each bit can have one of
four values
l To supply the design that need to be modeled with
tri- state capabilities
l These values are exactly the same as the four
values of type sc_logic
l
Type sc_lv<n> is just a sized array of sc_logic
objects
l
The sc_lv types cannot be used in
arithmetic operations directly
61
The Operator of the sc_lv<n>
s
Equality == !=
sc_uint<16>
uint16; sc_int<16>
int16; sc_lv<16>
lv16;
lv16= uint16; // convert uint to
lv int16 = lv16; // convert lv to
int
To perform arithmetic functions, first assign sc_lv objects to

the appropriate SystemC integer
62
Fixed Point Types
l For a high level model, floating point numbers
are
useful to model arithmetic operations
l Floating point numbers can handle a very large
of values and range
l Floating point are easily scaled
data data
built as fixed point typestypes
are typically converted
to minimize the or
amount
of hardware cost
l
To model the behavior of fixed point hardware,
designers need bit accurate fixed point data
l
types
Fixed point types are also used to develop DSP
software
63
Fixed Point Types (cont’)
l Thereare 4 basic types used to model fixed point
types in
SystemC
– sc_fixed
– sc_ufixed
– sc_fix
– sc_ufix
l Types sc_fixed and sc_fix specify a signed
point data fixed
type
l Types sc_ufixed and sc_ufix specify an unsigned
fixed point data type
64
Fixed Point Types
(cont’)
l Types sc_fixed and sc_ufixed uses static
arguments to specify the functionality of the
type
– Static arguments must be known at compile time
l Types sc_fix and sc_ufix can use
argument types that are non-static
– Non-static arguments can be variables
– Types sc_fix and sc_ufix can use variables to
determine word length, integer word length,
etc.
65
Syntax of the Fixed Point
Types sc_fixed<wl, iwl, q_mode, o_mode, n_bits> x;
sc_ufixed<wl, iwl, q_mode, o_mode, n_bits> y;
sc_fix x(list of options);
sc_ufix y(list of options);
l wl - Total word length

– Used for fixed point representation. Equivalent to the total of
number bits used in the type.
l iwl - Integer word length
– To specifies the number of bits that are to the left the binary point (.)
of
in a fixed point number.
l q_mode – quantization mode.
l o_mode - overflow mode
l n_bits - number of saturated bits

– This parameter is only used for overflow mode
l x,y - object name
– The name of the fixed point object being 66
declared.
Quantizatio Modes
n
67
Overflow Modes
68
The Operator of Fixed Point
s
69
Outline
l Introductio
l
n
System Modeling
l
Languages
SystemC
l
Overview
l
Data-Types
l
Processes
l
Interfaces
l
Simulation Supports
l
System Design
Processe
s
l Processes are the basic unit of execution within
SystemC
l Processes are called to simulate the behavior of
the target device or system
l Processes provide the mechanism of
concurrent behavior to model electronic
system
l
A process must be contained in a module
l
Processes cannot not be hierarchical
– No process will call another process
l Processes can call methods and
directly that are
functions
not processes 71
Processes
(cont’)
l Processes have sensitivity lists
– a list of signals that cause the process to be
invoked,
whenever the value of a signal in this list changes
l Processes trigger other processes by assigning

values to the hardware new
signals in the sensitivity list
of
the other process
72
Processes
(cont’)
l Three types of SystemC processes
– Methods — SC_METHOD
– Threads — SC_THREAD
– Clocked Threads — SC_CTHREAD
73
Process —
SC_METHOD
lA method that does not have its own thread of
execution
– Cannot call code with wait()
l Executed when events (value changes) occur
on the sensitivity list
l When a method process is invoked, it
executes and returns control back to the
simulation kernel until it is finished
l Users are strongly recommended not to
write infinite loops within a method process
– Control will never be returned back to the simulator
74
SC_METHO (Example)
D
// rcv.h // rcv.cc
#include "systemc.h" #include "rcv.h"
#include "frame.h" #include "frame.h"
void rcv::extract_id() {
SC_MODULE(rcv) { frame_type frame;
sc_in<frame_type> frame = xin;
xin; sc_out<int> id; if(frame.type == 1) {
void extract_id(); id = frame.ida;
SC_CTOR(rcv) { } else {
SC_METHOD(extract_id id = frame.idb;
); sensitive(xin); }
} }
};
To register the member function with the simulation kernel
75
Process —
SC_THREAD
l Thread process can be suspended and reactivated
l A thread process can contain wait() functions that
suspend process execution until an event occurs
on the sensitivity list
l An event will reactivate the thread process from
the statement that was last suspended
l The process will continue to execute until the next
wait()
76
SC_THREAD (Example of a Traffic Light)
// traff.h // traff.cc
#include #include "traff.h"
"systemc.h" void traff::control_lights()
SC_MODULE(traff) NSred
{ = false;
{
// input ports NSgreen == false;
NSyellow
sc_in<bool> roadsensor; true;
sc_in<bool> clock; EWred = true;
// output ports EWyellow =
sc_out<bool> NSred; false; EWgreen
sc_out<bool> = false; while
NSyellow; (true) {
sc_out<bool> while (roadsensor
NSgreen; NSgreen = false; //== false)
road sensor
wait();
triggered
sc_out<bool> EWred;
sc_out<bool> NSred = false;
NSyellow = true;// set NS to
EWyellow; yellow
sc_out<bool> wait();
EWgreen; void
control_lights(); for (i=0; i<5;
NSyellow i++) // set NS to
= false;
int i; red
// Constructor SC_CTOR(traff) { NSgreen = false; // yellow interval
SC_THREAD(control_lights);// Thread over
EWgreen =
Process sensitive << roadsensor; true;
sensitive_pos << clock; NSred = true;// set EW to green
} EWred = false;
}; EWyellow = false;
wait();
for (i= 0; i<50; i++)
. 77
.
.
Process —
SC_CTHREAD
l Clocked thread process is a special case of the thread processes
l A clocked thread process is only triggered on one edge of
one clock
– Matches the way that hardware is typically implemented
with synthesis tools
l Clocked threads can be used to create implicit state
machines within design descriptions
l Implicit state machine
– The states of the system are not explicitly defined
– The states are described by sets of statements with wait()
function calls between them
l Explicit state machine
– To define the state machine states in a declaration
– To use a case statement to move from state to state
78
SC_CTHREAD (Example of a BUS function)
// bus.h start = true;// new addr for memory controller

#include "systemc.h" wait();
SC_MODULE(bus) { // wait 1 clock between data
sc_in_clk clock; transfers data8 = taddr.range(15,8);
sc_in<bool> newaddr; start = false;
sc_in<sc_uint<32> > wait();
addr; sc_in<bool> ready;
sc_out<sc_uint<32> > data; data8 = taddr.range(23,16);
wait();
sc_out<bool> start; data8 = taddr.range(31,24);
sc_out<bool> datardy; wait();
sc_inout<sc_uint<8> > // now wait for ready signal from memory
data8; sc_uint<32> tdata;
sc_uint<32> taddr; // controller
wait_until(ready.delayed() == true);
void xfer(); SC_CTOR(bus) {
SC_CTHREAD(xfer, // now transfer memory
tdata.range(7,0) data to
= data8.read();
clock.pos()); databus
wait();
datardy.initialize(true); // ready to accept tdata.range(15,8) = data8.read();
// new address wait();
} tdata.range(23,16) = data8.read();
}; wait();
// bus.cc tdata.range(31,24) = data8.read();
#include data = tdata;
"bus.h" void
bus::xfer() { }datardy = true;// data is ready, new addresses
while (true) { }ok
// wait for a new address to appear
wait_until( newaddr.delayed() ==
true);
// got a new address so process it
taddr = addr.read();
datardy = false; // cannot accept new address
now data8 = taddr.range(7,0);
79
Events
l An event is represented by class sc_event

– Determines whether and when a process’s execution should
be triggered or resumed
l An event is usually associated with some changes
of state in a process or of a channel
l The owner of the event is responsible for reporting
the change to the event object
– The act of reporting the change to the event is called
notification
l The event object is responsible for keeping a list
of processes that are sensitive to it
l Thus, when notified, the event object will inform
the scheduler of which processes to trigger
80
Event Notification and Process Triggering
Process or Channel
(owner or event)
Notify immediately, after

delta-delay, or after time T
event
trigger trigger
trigger
Process 1 Process 2 Process 3
81
Events in Classical Hardware
Modeling
lAhardware signal is responsible for notifying
the
event whenever its value changes
– A signal of Boolean type has two additional events
• One associated with the positive edge
• One associated with the negative edge
– A more complex channel, such as a FIFO buffer
• An event associated with the change from being
empty to having a word written to it
• An event associated with the change from being full
to having a word read from it
82
Relationship Between the
Events
l Anevent object may also be directly by
one process used process
P1 to control
– If P1 has access to event object E andP2P2 is
another
sensitive to or waiting on E, then P1 may
trigger the execution of P2 by notifying E
– In this case, event E is not associated with the
change in a but rather with the execution
channel,
of some path in P1
83
Sensitivit
y
l The sensitivity of a process defines when this
process will be resumed or activated
l A process can be sensitive to a set of events.
l Whenever one of the corresponding events is
triggered, the is resumed or activated.

process
l Two types
– Static Sensitivity
– Dynamic Sensitivity
84
Static
Sensitivity
l Static sensitivity list
– In a module, the sensitivity lists of events are
determined before simulation begins
– The list remains the same throughout simulation
l RTLand synchronous behavioral processes only

use static sensitivity lists
85
Dynamic
Sensitivity
l It is possible for a process to temporarily override
its static sensitivity list
– During simulation a thread process may suspend
itself
– To designate a specific event E as the current event on
which the process wishes to wait
– Then, only the notification of E will cause the thread
process to be resumed
– The static sensitivity list is ignored
86
Dynamic Sensitivit — wait()
y
To wait for a specific eventE, the thread process

simply call wait() with E as argument:
wait(E)
Composite events, for use with wait only, may

be formed by conjunction A
( ND) or disjunction
(OR)
wait(E1 & E2 & E3);

wait(E1 | E2 | E3);
87
Dynamic Sensitivit — wait() (Cont’)
y
The wait() function may also take as argument
a time
wait(200, SC_NS);
or, equivalently
sc_time t(200, SC_NS);
wait(t);
By combining time and events, we may impose

a timeout on the waiting of events
wait(200, SC_NS, E);
waits for eventE to occur, but ifE does not

occur within
within 200ns,200ns, the thread
the thread processprocess willup
will give give
on
up on
the
wait and resume
88
Dynamic Sensitivity —
next_trigger()
l Calling Next_Trigger() does not suspend the current method
process
l Execution of the process will be invoked only when the
event specified by next_trigger() occurs.
l If an invocation of a method process does not call next_trigger(),
then the static sensitivity list will be restored
The calling will make the current method process wait oE

n within a timeout of
200ns. IfE occurs within 200ns, the method process will be triggered
next_trigger(200, SC_NS, E)
Otherwise, when the timeout expires, the method process will be triggered
an its static sensitivity list will be back in effect
89
Special Dynamic Sensitivity for
SC_CTHREAD — wait_until()
l The wait_until() method will halt the execution of

the process until a specific event has
occurred.
l This specific event is specified by the expression
to the wait_until() method.
This statement will halt execution of the process

until the new value of roadsensor is true.
wait_until(roadsensor.delayed() == true);
90
SC_CTHREAD — watching()
l SC_CTHREAD processes typically have

infinite loops that will be continuously
executed
l A designer typically wants some way to
initialize the behavior of the loop or jump out
of the loop when a condition occurs
l The watching construct will monitor a
specified condition and transfer the control to
the beginning of the process
91
SC_CTHREAD — watching() (Cont’)
// datagen.h // datagen.cc
#include "systemc.h" #include "datagen.h"
SC_MODULE(data_gen) { void gen_data() {
sc_in_clk clk; if (reset == true) {
sc_inout<int> data; data = 0;
sc_in<bool> reset; }
void gen_data(); while (true) {
SC_CTOR(data_gen){ data = data + 1;
SC_CTHREAD(gen_data, clk.pos()); wait();
watching(reset.delayed() == true); data = data + 2;
} wait();
}; data = data + 4;
} wait();
specifies that signal reset will be watched for this process

? If signal reset changes to true, the watching expression will be true and the
SystemC scheduler will halt execution of the while loop for this process
? start the execution at the first line of the process
92
Outline
l Introductio
l
n
System Modeling
l
Languages
SystemC Overview
l
Data-Types
l
Processes
l
Interfaces
l
Simulation
l
Supports
l
System Design
l Environments
HW/SW Co-Verification 93
Communication between Design
Blocks
- Channels, Interfaces, and Ports
l Traditionally, the hardware signals are used for
communication and synchronization between
l processes
The level of abstraction is too low for system design
view Channels
Introduc Interfaces
e
Ports
l Interfaces and ports describe what functions are

available
in a communications package
l
– Access points
Channels defines how these functions are performed
– Internal operations
94
Example of Modules, Ports, Interfaces,
and Channels
port
Module with a
port
interface
primitive
channel Hierarchical
channel
with a port
Port-channel binding
Module 1 HC Module 2
95
Interface
s
l The ―windows‖ into channels that describe the set of
operations
l Define sets of methods that channels must implement
l Specify only the signature of each operation, namely,
the operation’s name, parameters, and value
return
l It neither specifies how the operations are
implemented nor defines data fields
96
Interfaces
(cont’)
l All interfaces must be derived, directly or indirectly,
from the abstract base class : sc_interface
l The concept of interface is useful to model
design layered
– between modules which are different level
Connect of
ion
abstractio with ports
n
– Ports are connected to channels through interfaces
l – A port that is connected to a channel through an
Relationship
interface sees only those channel methods that are
defined by the interface
97
Interface
Examples
l All interface methods are pure virtual methods without
any implementation
template <class T>

class sc_read_if
: virtual public sc_interfac
{ e
public:
// interface methods
virtual const T& read() const = 0;
};
An example read interface: sc_read_if

this interface provides a 'read' method
98
Interface
Examples
template <class T>

class sc_write_if
: virtual public sc_interface
{
public:
// interface methods
virtual void write( const T& ) = 0;
};
An example write interface:

sc_write_if this interface provides a
'write' method
99
Interface Example
s
template <class T>

class sc_read_write_if
: public sc_read_if<T>,
public sc_write_if<T>
{};
An example read/write interface: sc_read_write_if
this defines a read/write interface by deriving

from the read interface and write interface
100
Ports
l A port is an object through which a module can

access a channel’s interface
l A port is the external interface that pass to
information and from a module, and trigger actions
within the module
l A port connects to channels through interfaces
101
Ports (cont’)
lA port can have three different modes of operation
– Input (sc_in<T>)
– Output (sc_out<T>)
– Inout (sc_inout<T>)
102
Ports (Cont’)
lA port of a module can be connected to
– Zero or more channels at the same level of
hierarchy
– Zero or more ports of its parent module
– At least one interface or port
l sc_port allows accessing a channel’s interface
methods by using operator žor operator [ ]
l In the following example:
– ―input‖ is an input port of a process
– read() is an interface method of the attached channel
a = input->read();// read from the first (or only) channel of

input b = input[2]->read();// read from the third channel of
input
103
Access Ports
Read/Wrir
e through
methods
Hardware
types
cannot be
accessed
directly
104
Specialized
Ports
l Specialized
ports can be created by refining port base
class sc_port or one of the predefined port types
– Addresses are used in addition to data
• Bus interface.
– Additional information on the channel’s status
• The number of samples in a FIFO/LIF
available O
– Higher forms of sensitivity
• Wait_for_request()
105
Port-less Channel
Access
l Inorder to facilitate IP reuse and to enable tool
support, SystemC 2.0 define the
following mandatory design style
– Design style for inter-module level communicatio
– Design style for intra-module level
n
communicatio
n
106
Port-less Channel Access
(cont’)
l For inter-module level communication, ports must be
used to connect modules to channels
– Ports are handles for communicating with the
―outside world‖ (channels outside the module)
– The handles allow for checking design rules and
attaching
communication attributes, such as priorities
– From a software point-of-view they can be seen as a
kind of smart pointers
l For intra-module level communication, direct
access to channels is allowed
– Without using the ports.
– Access a channel’s interface in a ―port-less‖ way
by directly calling the interface methods.
107
Channels
l A channel implements one or more interfaces, and
serves as a container for communication
l functionality
A channel is the workhorse for holding
l and transmitting data
A channel is not necessarily a point-to-
l
point connection
A channel may be connected to more than
l
two modules
A channel may vary widely in complexity, from
hardware signal to complex protocols with
embedded processes
l
SystemC 2.0 allows users to create their own
channel types
108
Channels
(cont’)
l Primitive channels
– Do not exhibit any visible structure
– Do not contain processes
– Cannot (directly) access other channels
primitive
l Hierarchical channels
– Basically are modules
– Can have structure
– Can contain other modules and processes
– Can (directly) access channels
other
109
Primitive
Channels
l The hardware signal
– sc_signal<T>
l The FIFO channel
– sc_fifo<T>
l The mutual-exclusion lock (mutex)
– sc_mutex
110
The Hardware Signal –
sc_signal<T>
l The semantics are similar to the VHDL signal
l Sc_signal<T> implements the interface sc_signal_inout_if<T>
// controller.h SC_CTOR(controller) {
#include "statemach.h" // .... other module statements
s1 = new state_machine ("s1");
SC_MODULE(controller) s1->clock(clk); // special case port to port binding
{ s1->en(lstat); // port en bound to signal lstat
sc_in<sc_logic> clk; s1->dir(down); // port dir bound to signal down
s1->st(status); // special case port to
sc_out<sc_logic> count;
sc_in<sc_logic> status; // port binding
}
sc_out<sc_logic> load; };
sc_out<sc_logic> clear
sc_signal<sc_logic> lstat;
sc_signal<sc_logic> down;
state_machine *s1; //state is another module
The example above shows a port bound to another port

(special case) and a port bound to a signal.
111
The FIFO Channel –
sc_fifo<T>
l To provide both blocking and nonblocking versions of access
l Sc_fifo<T> implements the interfaces sc_fifo_in_if<T> and
sc_fifo_out_if<T>
Blocking
Ifversion
the FIFO is empty
suspend until more data is available
If the FIFO is full
suspend until more space is
available
NonBlocking
Ifversion
the FIFO is empty
do
If the FIFO is full
nothing
do
nothing 112
The Mutual-Exclusion Lock
(Mutex)
– sc_mutex
l Model critical sections for accessing shared
variables
l A process attempts to lock the mutex
before entering a critical section
l If the mutex has already been locked by another
process, it will cause the current process to
suspend
113
Channel Design Rules
sc_signal<T> Ÿ No more than one driver, i.e., at most one output

(sc_out<T> ) or bi- port (sc_inout<T>) connected.
Ÿ Arbitrary number of input ports s(c_in<T> ) can be
directional
connected
sc_fifo<T> Ÿ At most one input port can be connected.
Ÿ At most one output port can be connected.
Ÿ No bi-directional ports.
114
Channel
Attributes
l Channel attributes can be used for a per-port
configuration of the communication
l Channel attributes are helpful especially when
modules are connected to a bus
l Attributes that can be used
– Addresses (in case the module doesn't use specialized

ports, addresses can be specified as arguments of the
access methods)
– Addressing schemes (e.g. constant address vs. auto-
increment)
– Connect module as master or slave or master/slave
– Priorities
– Buffer sizes
115
Channel (Example
Attributes )
l Let mod be an instance of a module and let port be a port
of this module
// create a local
channel
message_queue mq;
...
// connect the module port to the
channel mod.port( mq );
...
l A channel attribute can now be specified, for

example:
// specify a channel
attribute mq.priority(
mod.port, 2 );
...
l which sets the priority attribute for mod.port to 2.
116
Hierarchical Channels
l To model the new generation of SoC communication
infrastructures efficiently
l For instance, OCB (On Chip Bus)
– The standard backbone from VSIA
– The OCB consisting of several units
intelligent
• Arbiter unit
• A Control
• Programming
• unit
Decode unit
l For modeling complex channels such as the OCB
backbone, primitive channels are not very
suitable
– Due to the lack of processes and structures
l For modeling this type of channels,
hierarchical channels should be used 117
Primitive Channels v.s Hierarchical
Channels
l Use primitive channels
– When you need to use the request-update
scheme
– When channels are atomic and cannot
reasonably be chopped into smaller pieces
– When speed is absolutely crucial (using primitive
channels we can often reduce the number of
delta cycles)
– When it doesn’t make any sense trying to build
up a channel (such as a mutex) out of processes
and other channels
118
Primitive Channels v.s
Hierarchical
Channels (Cont’)
l Use hierarchical channels
– When channels are truly hierarchical and
users would want to be able to explore the
underlying structure
– When channels contain processes
– When channels contain other channels
119
Outline
l Introductio
l
n
System Modeling
l
Languages
SystemC
l
Overview
l
Data-Types
l
Processes
l
Interfaces
l
Simulation Supports
l
System Design
Clock
Objects
l Clock objects are special objects which generate
timing signals to synchronize events in the
simulation
l Clocks order events in time so that parallel events
in hardware are properly modeled by a simulator
on a sequential computer
l Typically
design inclocks are created
the testbench andatpassed
the top levelthrough
of the the
down
module hierarchy to the rest of the
design
121
Clock Objects (Example
)
int sc_main(int argc, char*argv[])

{
sc_signal<int> val;
sc_signal<sc_logic> load;
sc_signal<sc_logic> reset;
sc_signal<int> result;
sc_time t1(20,SC_NS);
sc_clock ck1("ck1", t1, 0.5, 0, true);
filter f1("filter");
f1.clk(ck1.signal())
; f1.val(val);
f1.load(load);
f1.reset(reset)
;
f1.out(result);
// rest of sc_main not shown
This declaration will create a clock object named ck1 with a period of
20ns, a duty cycle of 50%, the first edge will occur at 0 time units,
and the first value will be true
122
Simulatio Control
n
simulate for
1000ns
due to the
use of
wait
123
Design Example: 4-bit LFSR
X0
D R1
D R2
D D
R3 R4
l Four files are created to describe this

–
circuit
LFSR.h (header file for declaration)
– LFSR.cpp (body of LFSR)
– LFSR_env.cpp (simulation environment)
– Main.cpp (top module)
124
LFSR.h
#include SC_MODULE(LFSR_Gen)
{
"systemc.h" sc_out<bool> reset_n;
SC_MODULE(LFS SC_CTOR(LFSR_Gen)
{ SC_THREAD(Gen_proc
R)
);
{ }
sc_in<bool> clk; sc_in<bool> void // body
Gen_proc(); proc
reset_n; sc_out<unsigned };
int> random;
SC_MODULE(LFSR_Mo
SC_CTOR(LFSR) { {n)
SC_METHOD(LFSR_proc
); sensitive_pos << clk; sc_in<unsigned int>
sensitive_neg << reset_n; random;
SC_METHOD(Mon_proc
} );
SC_CTOR(LFSR_Mon){
void LFSR_proc(); // body }
proc sc_uint<4> Reg; sensitive << random;
bool X0,R1,R2,R3,R4; 125
}; void Mon_proc(); // body proc
LFSR.cp
p
else {
if(R3==R4)
#include "LFSR.h" X0=1; else X0=0;
R4=R3;
void LFSR::LFSR_proc() R3=R2
{ ;
if(reset_n.read()==false) { R2=R1
R1=0; ;
R2=0 R1=X0
; ;
R3=0 }
;
R4=0 Reg[0]=R1;
; Reg[1]=R2;
} Reg[2]=R3;
Reg[3]=R4;
random.write(Reg) 126
;
LFSR_env.cp
p #include "LFSR.h"
void LFSR_Gen::Gen_proc()
{
while (1) {
reset_n=1;
wait(2,SC_NS);
reset_n=0;
wait(4,SC_NS);
reset_n=1;
wait(100,SC_NS
);
}
}
void LFSR_Mon::Mon_proc()
{
cout << "random= ";
cout << random << "\n";
} 127
Main.cpp
#include "LFSR.h"
int sc_main(int argc, char* argv[])
{
sc_signal<bool> reset_n;
sc_signal<unsigned int> random;
sc_time t1(2,SC_NS); // 1 cycle=
2ns sc_clock clk("clk",t1,0.5);
LFSR_Gen
M1("LFSR_Gen");
M1(reset_n);
LFSR M2("LFSR");
M2(clk,reset_n,random);
LFSR_Mon
M3("LFSR_Mon");
M3(random);
sc_start(100,SC_NS);
128
return 0;
}
Running Results
random= 0
random= 1
random= 0
random= 1
random= 3
random= 7
random=
14
random=
13
random=
11
random= 6
random=
12
random= 9
random= 2
random= 5
random=
10 129
random= 4
Outline
l Introductio
l
n
System Modeling Languages
l SystemC
l
Overview
l
Data-Types
Processes
l
Interfaces
l
Simulation Supports
l
System Design
l
Environments HW/SW Co-
l
Verification Conclusion
130
The Supported Tools for SystemC
l Platform and Compiler
l System design environments
131
Platform and
Compiler
l Typically,a compiler for C++ standard can
compile the SystemC source well
code
– SystemC just a extended template
l GNU gcc for many platform
l Sun with solaris
– Forte c++
l HP
– Hp aC++
l Intel with Microsoft OS
– MS Visual C++
132
System Design
Environments
l Synopsys
– CoCentric System Studio (CCSS)
l Cadence
– Signal Processing Worksystem

(SPW)
l Agilent
– Advanced Design (ADS)
System
l CoWare
– N2C Design System

l ……
133
CoCentric System Level Design
Platform
CoCentric System
C/SystemC, Studio
Reference Performance
Design Kit Exploration
Processor Model
HW/SW Co-design
SystemC SystemC Software

Executable Synthesizable C-Code
Specification model
Chip Verification CoCentric Software

SystemC Compiler Implementation
134
Platform
l Algorithm libraries and Reference Design Kits
– Broadband Access: ADSL, DOCSIS cable
modem
– Wireless: CDMA, Bluetooth, GSM/GPRS,
PDC, DECT, EDGE
– Digital Video: MPEG-2, MPEG-4
– Broadcast standard: DAB, DVB
– Error Correcting Coding: RS coding,
coding Hamming
– Speech Coding: ITU G.72X, GSM speech,
speech AMR
135
Platform
l Simulation, Debugging and Analysis
– Mixing of architectural and algorithmic models
in the same simulation
– Works with VCS, Verilog-XL, ModelSim, import
Matlab models for co-simulation
– Macro-debugging at the block level
– Micro-debugging at the source code level
– Davis
– VirSim
136
Platform
l Path to implementation
– System code generate automaticall
Synthesizable C d y
Cocentric System
Studio Executable
specification
C/C++/SystemC
Cocentric SystemC
Compiler
C/C++
software
Implementation
flow
Design
Physical 137
Compiler
Advance Design System
d
Ptolemy HDL Hardware Logic

Models Simulation Emulation Synthesis
ADS
DSP
Designer
C/C++ Measurement
HDL Models MATLAB
Models Instrumentatio
n
138
Outline
l Introductio
l
n
System Modeling
l
Languages
SystemC
l
Overview
l
Data-Types
l
Processes
l
Interfaces
l
Simulation Supports
l
System Design
Traditional HW/SW Verification Flow
l Enter system integration and verification stage

until both HW/SW are finished
– Errors may happen at the beginning
l Hinges on the physical prototype like FPGA
– Chips respin waste a lot of money and time
– Error correction through redesign
l Signal visibility will be getting worse
– Change the pin assignment of the in order to get
FPGA
the visibility
140
Pre-Silicon
Prototype
l Virtual prototype
– Simulation environment
l Emulator
– Hundreds kilo Hz
l Rapid prototype
– Combination of FPGAs and chips that can be
dedicated interconnected to
instantiate a design
– Tens mega Hz
l Roll-Your-Own (RYO) prototype
– FPGA and Board
– Tens mega Hz
141
Virtual
Prototypes
l Definition
– A simulation model of a product , component, or
system
l Features
– Higher abstraction level
– Easily setup and modify
– Cost-effective
– Great observability
– Shorten design cycles
142
Verification Speed
Algorith Cell loss ? Bit error rate??

m Level
Transactio Bus bandwidth? Cache size?

n Level
Handshake? reset?
RTL
Sub bus Latency untime

Cycle d
Cycle Cycle annotatio
Reference : Synopsys n
143
Verification Speed
Algorith
m Level 1000X
Transactio
n Level
100X
1 10X
RTL
Sub Cycle bus Latency untimed

Cycle Cycle annotatio
n
144
HW/SW Co-
Simulation
l Couple a software execution environment with a
hardware simulator
l Provides complete visibility and
debugger interface into each
environment
l Software normally executed on an Instruction Set
Simulator (ISS)
l A Bus Interface Model (BIM) converts abstract
software operations into detailed pin
operations
145
Advantages of HW/SW Co-Simulation
(1/2)
l Simulate in minutes instead of days
l Early architecture closure reduces risk by 80%
l Start software development 6 months earlier

l Simulate 100x~1000x faster than RTL
l HW designers can use the tools which are
familiar to them
l SW programmers can use all their favorite
debugger to observe software state and
control the executions
146
Advantages of HW/SW Co- (2/2)
Simulation
l Software Engineers
– Simulation model replace stub code
– More time to develop & debug code
– Validate code against hardware as you
develop
– Maintain software design integrity
l Hardware Engineer
– Embedded software replaces test bench
– Reduce the chance of an ASIC or Board spin
– Resolve gray areas before tape out
147
Synopsys’s Solution
SystemC
l System Studio
– SystemC System Studio
simulation DesignWare
l SystemC Compiler
– SystemC
synthesis SystemC
– System Compiler
l DesignWare C/C++
AMBA/ARM C
Design Compiler
models
Compiler/
Physical
Compiler
So
C
148
Synopsy System Studio
s
Architecture
Algorithm
ARM9 / AHB
SystemC
Simulation
Hardware
Software
Debugger
Memory
Bus
Reference : Synopsys 149

Mentor Graphic : Seamless CVE
Reference : Mentor Graphics 150

Cadence : Incisive Platform
Unified
Unified test generation

Environment
Transaction support
Verilog SystemC
Single-kernel
architecture
VHDL
PSL/Sugar
AMS Algorithm
Acceleration-on-
Demand
Reference : Cadence 151
Conclusion
s
l The system level design is a new design challenge
– Both hardware and software issues have to be considered
l High level abstraction and modeling is essential
for system design in future
– SystemC is a more mature language, but not the only one
l Co-design methodology can reduce the design cycle
– Allow earlier HW/SW integration
l Virtual co-simulation environment is required
– Reduce the cost and design cycle of hardware prototype
– Simulate 100x~1000x faster than RTL with the models of
higher level of abstraction
l A hot and hard area for designers and EDA vendors
152
References
l Book materials
– System Design with SystemC, T. Grotker, S. Liao, G. Martin, S.
Swan, Kluwer Academic Publishers.
– Surviving the SOC Revolution, H. Chang et. Al, Kluwer Academic
Publishers.
l Manual
– SystemC Version 2.0 User’s Guide
– Functional Specification for SystemC 2.0
l Slides
– SoC Design Methodology, Prof. C.W Jen
– Concept of System Level Design using Cocentric System Studio, L.F
Chen
l WWW
– http://www.synopsys.com
– http://www.systemc.org
– http://www.forteds.com
– http://www.celoxica.com
– http://www.adelantetechnologies.co
– m
– http://mint.cs.man.ac.uk/Projects/UPC/Languages/VHDL+.htm
– l
– http://eesof.tm.agilent.com/products/ads2002.html
– http://www.specc.gr.jp/eng/ 153
http://www.coware.com
SystemC: Co-specification and
Embedded System Modeling
Module:4 SoC and NoC Interconnection Structures 7hours CO4
SoC Interconnection Structures- Bus-based Structures- AMBA

Bus.Network on Chip -NoC Interconnection Structures-Topologies-
routing- flow control- network components(router/switch, network
interface, Links).
SoC and NoC Interconnection
Structures
• SoC Interconnection Structures- Bus-based
Structures- AMBA Bus.
• Network on Chip –NoC Interconnection
Structures-Topologies- routing- flow control-
network components(router/switch, network
interface, Links).
SoC Buses Overview
 AMBA bus  Wishbone
 ASB (Advanced System Bus)  CoreFrame
 AHB (Advanced High-  Marble
performance Bus)  PI bus
 APB (Advanced Peripheral Bus)
 OCP
 Avalon
 VCI (Virtual Component Interface)
 CoreConnect
 SiliconBackplane Network
 PLB (Processor Local Bus)
 OPB (On-chip Peripheral Bus)
 ST Bus
 Type I (Peripheral protocol)
 Type II (Basic Protocol)
 Type III (Advanced protocol)
3
SoC Buses Overview
 AMBA
AMBAbus
bus  Wishbone
ASB
ASB(Advanced
(AdvancedSystem
SystemBus)
Bus)  CoreFrame
AHB
AHB(Advanced
(AdvancedHigh-
High-  Marble
performance Bus)
performance Bus)  PI bus
 APB (Advanced Peripheral Bus)
 OCP
 Avalon
 VCI (Virtual Component Interface)
 CoreConnect
 SiliconBackplane Network
 PLB (Processor Local Bus)
 OPB (On-chip Peripheral Bus)
 ST Bus
 Type I (Peripheral protocol)
 Type II (Basic Protocol)
 Type III (Advanced protocol)
4
Introduction
 SOC designs involves the integration of intellectual property (IP)
cores, each separately designed and verified.
 Most important issue is the method by which the IP cores are

connected together.
 SOC interconnect architectures:

 Network - on - chip (NOC).
 Bus architectures
 Switch - based interconnects used in SOC are referred to as NOC.
5
Overview: Interconnect Architectures
A simplified block diagram of an SOC module in a system context.

6
 The SOC module typically contains a number of IP blocks
(processors).
 In addition, there are various types of on - chip memory like cache,

data, or instruction storage.
 Other IP blocks serving application - specific functions, such as

graphics processors, video codecs and network control units are
integrated in the SOC.
 All above SOC modules need to communicate with each other for
the proper operation of system. Interconnects are used to do the
communication between them.
7
 System level issues and specifications while Choosing a suitable
interconnect architecture:
1. Communication Bandwidth:
2. Communication Latency:
3. Master and Slave:
4. Concurrency Requirement:
5. Packet or Bus Transaction:
6. ICU: An interconnect interface Unit:
7. Multiple Clock Domains:
8
1. Communication Bandwidth:
 It is the rate of information transfer between a module and the

surrounding environment in which it operates.
 Usually measured in bytes per second,
 The bandwidth requirement of a module describes the type of

interconnection required to achieve the overall system throughput as
per specifications.
9
2. Communication Latency:
 It is the time delay between a module requesting data and receiving

a response to the request.
 For example:
 Watching a movie that is a couple of seconds later than when it is
actually broadcast is of no consequence.
 In contrast, even small, unanticipated latencies in a two - way mobile
communication protocol can make it almost impossible to carry out a
conversation.
 Hence, Latency may or may not be important in terms of overall system
performance.
10
3. Master and Slave.
 These terms concern whether a unit can initiate or react to

communication requests.
 A master, such as a processor, controls transactions between itself

and other modules.
 A slave, such as memory, responds to requests from the master.
 An SOC design typically has several masters and numerous slaves.
11
4. Concurrency Requirement:
 The number of independent simultaneous communication

channels operating in parallel.
 Usually, additional channels improve system bandwidth.
12
5. Packet or Bus Transaction:
 The size and definition of the information transmitted in a single

transaction.
 For a bus, this consists of an address with control bits (read/write, etc.)
and data.
 For a NOC it is referred as a packet. The packet consists of a header

(address and control) and data.
13
6. ICU: An interconnect interface Unit.
 This unit manages the interconnect protocol and the physical

transaction.
 If the IP core requires a protocol translation to access the bus, the

unit is called a bus wrapper.
 In an NOC, this unit manages the protocol for transport of a packet

from the IP core to the switching network.
14
7. Multiple Clock Domains:
 Different IP modules may operate at different clock and data rates.
 For example:
A video camera captures pixel data at a rate governed by the video
standard used,
while a processor’s clock rate is usually determined by the
technology and architectural design.
As a result, IP blocks inside an SOC often need to operate at
different clock frequencies, creating separate timing regions known
as clock domains.
 Crossing between clock domains can cause deadlock and
synchronization problems. 15
Bus: Basic Architecture
 The Computer Systems heavily dependent on the characteristics of
its interconnect architecture.
 A poorly designed system bus can throttle:
 Transfer of instructions and data between memory and

processor or
 between peripheral devices and memory.
16
 The speed at which the bus can operate is often limited by:
 The high capacitive load on each bus signal,
 The resistance of the contacts on the connector, and
 The electromagnetic noise produced by such fast–switching

signals.
17
 Arbitration and Protocols:
 Bus is just wire shared by multiple units.
 Some logic must be present to use the bus; otherwise, two units
may send signals at the same time, causing conflicts.
 In an SOC, a bus master is a component within the chip, such as a

processor.
 Other units connected to bus, such as I/O devices and memory
components, are the ―slaves‖.
 The bus master controls the bus paths using specific slave
addresses and control signals.
18
 Arbitration and Protocols:
 Arbitration determines ownership (to whom access should be given).

 There is a centralized arbitration unit with an input from each
requesting unit. The arbitration unit then grants bus ownership to
one requesting unit, as determined by the bus protocol.
 The protocol determines the following:

 The type and order of data being sent;
 How the sending device indicates that it has finished sending the
information;
 The data compression method used, if any;
 How the receiving device acknowledges successful reception of
the information; and
 How arbitration is performed to resolve contention on the bus
and in
and in what
what priority,
priority, and
and the
the type
type of
of error
error checking
checking to
to be
be used.
used.
19
 Bus Bridge:
 A bus bridge is a module that connects together two buses, which
are not necessarily of the same type.
 A typical bridge can serve three functions:
1. If the two buses use different protocols, a bus bridge provides the
necessary format and standard conversion.
2. A bridge is inserted between two buses to segment them and keep
traffic contained within the segments. This improves concurrency:
both buses can operate at the same time.
3. A bridge often contains memory buffers and the associated control
circuits.
When a master on one bus initiates a data transfer to a slave module
on another bus through the bridge, the data is temporarily stored in
the buffer, allowing the master to proceed to the next transaction
before the data are actually written to the slave.
A bus
A bus bridge
bridge can
can significantly
significantly improve
improve system
system performance.
performance. 20
 Bus Varieties:
 Buses may be unified or split (address and data).
 In the unified bus the address is initially transmitted followed by one

or more data cycles;
 The split bus has separate buses for each of these functions.
21
AMBA bus
22
AMBA bus
 AMBA is Advanced Microcontroller Bus Architecture
 The AMBA protocol is an open standard, on-chip bus specification.
 Originally developed in 1995, the AMBA Specification has been refined

and extended with additional protocol support to provide the
capabilities required for SoC design.
 The AMBA protocol enhances a reusable design methodology.
 IP re-use is essential in reducing SoC development costs and

timescales and AMBA provides the interface standard that enables IP
re-use meeting the essential requirements
https://www.allaboutcircuits.com/technical-articles/introduction-to-the-
advanced-microcontroller-bus-architecture/ 23
AMBA bus
 Nowadays, AMBA is one of the leading on-chip busing system used
in high performance SoC design.
 AMBA is hierarchically organized into two bus segments, system-

bus and peripheral-bus, connected via bridge.
 Standard bus protocols for connecting on-chip components generalized

for different SoC structures are independent of the processor type and
are defined by AMBA specifications.
24
AMBA bus
AMBA based system architecture
25
AMBA bus
 The three distinct buses specified within the AMBA bus are:
 ASB: Advanced System Bus
 ASB - First generation of AMBA system bus used for simple cost-
effective designs.
 ASB supports:
 burst transfer,
 pipelined transfer operation and
 multiple bus masters.
26
AMBA bus
 AHB: Advanced High-performance Bus
 AHB - later generation of AMBA bus is intended for high

performance high-clock synthesizable designs.
 It provides high-bandwidth communication channel and high

performance peripherals/hardware accelerators (ASICs MPEG, color
LCD, etc), on-chip SRAM, on-chip external memory interface and
APB bridge.
 AHB supports a multiple bus masters operation, peripheral and a

burst transfer, split transactions, wide data bus configurations.
27
AMBA bus
 APB: Advanced Peripheral Bus
 APB - is used to connect general purpose low speed low-power

peripheral devices.
 The bridge is peripheral bus master, while all buses devices (Timer,
UART, PIA, etc) are slaves.
 APB is static bus that provides a simple addressing with latched

addresses and control signals for easy interfacing.
28
AMBA AHB
 AMBA AHB implements the features required for high-performance,
high clock frequency systems including:
 Burst transfers
 Split transactions
 Single-cycle bus master handover
 Single-clock edge operation
 Non-tristate implementation
 Wider data bus configurations (64/128 bits).
29
AMBA AHB
 A typical AMBA AHB system design contains the following components:
 AHB master
A bus master is able to initiate read and write operations by

providing an address and control information. Only one bus master is
allowed to actively use the bus at any one time.
 AHB slave
A bus slave responds to a read or write operation within a

given address space range. The bus slave signals back to the active
master the success, failure or waiting of the data transfer.
30
AMBA AHB
 A typical AMBA AHB system design contains the following components:
 AHB arbiter
The bus arbiter ensures that only one bus master at a time is
allowed to initiate data transfers.
 AHB decoder
The AHB decoder is used to decode the address of each

transfer and provide a select signal for the slave that is involved in
the transfer.
A single centralized decoder is required in all AHB implementations.
31
AMBA AHB
AMBA AHB Master Interface

32
AMBA AHB
AMBA AHB bus slave interface

33
AMBA APB Bridge Interface
 The APB bridge is the only bus master on the AMBA APB.
 The APB bridge is also a slave on the higher-level system bus.

34
AMBA APB Bridge
 The bridge converts system bus transfers into APB transfers and
performs the following functions:
 Latches the address and holds it valid throughout the transfer.
 Decodes the address and generates a peripheral select, PSELx.
Only one select signal can be active during a transfer.
 Drives the data onto the APB for a write transfer.
 Drives the APB data onto the system bus for a read transfer.
 Generates a timing strobe, PENABLE, for the transfer.
35
AMBA APB
Bridge
Block diagram
of bridge module
 The AHB to APB bridge is an AHB slave, providing an interface

between the high speed AHB and the low-power APB.
 Read and write transfers on the AHB are converted into equivalent
transfers on the APB. As the APB is not pipelined, wait states are
added during transfers to and from the APB.
36
IBM’s CoreConnect Bus
37
CoreConnect Bus
CoreConnect bus based system 38

CoreConnect Bus
 CoreConnect is an IBM-developed on-chip bus.
 By reusing processor, subsystem and peripheral cores, supplied

from different sources, enables their integration into a single VLSI
design.
 It is comprised of three buses (PLB, OPB & DCR) that provide an

efficient interconnection of cores, library macros, and custom logic
within a SoC
39
CoreConnect Bus
 PLB Bus : Processor Local Bus
 It is synchronous, multi master, central arbitrated bus that allows

achieving high-performance on-chip communication.
 Separate address and data buses support concurrent read and

write transfers.
 PLB macro is used to interconnect various master and slave

macros.
 PLB slaves are attached to PLB through shared, but decoupled,

address, read data, and write data buses.
 Up to 16 masters can be supported by the arbitration unit, while there

are no restrictions on the number of slave devices.
40
CoreConnect
Bus
Example of PLB
Interconnection
 Figure illustrates the connection of multiple masters and slaves

through the PLB macro. Each PLB master is attached to the PLB macro
via separate address, read data and write data buses.
 PLB slaves are attached to the PLB macro via shared, but decoupled,
address, read data and write data buses along with transfer control and
status signals for each data bus.
 The PLB architecture supports up to 16 master devices and any
number of slave devices. 41
CoreConnect Bus
 OPB Bus: On-chip Peripheral Bus
 It is optimized to connect lower speed, low throughput peripherals,

such as serial and parallel port, UART, etc.
 Crucial features of OPB are:
 Fully synchronous operation,
 Dynamic bus sizing,
 Separate address and data buses,
 Multiple OPB bus masters,
 Single cycle transfer of data between bus masters,
 Single cycle transfer of data between OPB bus master and OPB
slaves, etc.
 Instead of tristate drivers OPB uses distributed multiplexer.
42
CoreConnect Bus
 OPB Bridge:
 PLB masters gain access to the peripherals on the OPB bus through
the OPB bridge.
 The OPB bridge acts as a slave device on the PLB and a master on
the OPB.
 It supports word (32-bit), half-word (16-bit) and byte read and write
transfers on the 32-bit OPB data bus.
 The OPB bridge performs dynamic bus sizing, allowing devices with
different data widths to efficiently communicate.
43
CoreConnect Bus
 DCR Bus: Device Control Register Bus
 It is a single master bus mainly used as an alternative relatively low

speed datapath to the system for:
(a) passing status and setting configuration information into the
individual device-control-registers between the Processor Core and
others SoC constituents (Auxiliary Processors, On-Chip Memory,
System Cores, Peripheral Cores, etc.); and
(b) design for testability purposes.
 DCR is synchronous bus based on a ring topology implemented as

distributed multiplexer across the chip.
 It consists of a 10-bit address bus and a 32-bit data bus.
44
IBM CoreConnect Vs ARM AMBA Architectures
45
Bus Sockets and Bus Wrappers
 Using a standard SOC bus for the integration of different reusable IP
blocks has one major drawback.
 Standard buses specify protocols over wired connections
 An IP block that complies with one bus standard cannot be reused

with another block using a different bus standard.
 An approach to overcome this is to employ a hardware ―socket‖, i.e.

bus wrapper is used.
 Core-to-Core communication is handled by the interface wrapper.
46
Bus Sockets and Bus Wrappers
 OCP - Open Core Protocol.
 The OCP defines a point - to – point interface between two
communicating entities such as two IP cores using a core - centric
protocol.
 A system consisting of three IP core modules using the OCP and bus
wrappers is shown in Figure.
 One module is a system initiator, one is a system target, and another is
both initiator and target.
47
Analytic Bus Models
 Contention and Shared Bus:
 Contention occurs wherever two or more units request a shared

resource that cannot supply both at the same time.
 When contention occurs, either (1) it delays its request and is idle until
the resource is available or (2) it queues its request in a buffer and
proceeds until the resource is available.
 As contention and queues develop at the ―bottleneck‖ in the

system, the most limiting resource is the source of the contention,
and other parts of the system simply act as delay elements.
 Buses often have no buffering (queues), and access delays cause

immediate system slowdown.
 The analysis on the effects of bus congestion depends on the

access type and buffering.
48
Analytic Bus Models
 Contention and Shared Bus:
 Generally there are two types of access patterns:
1. Requests without Immediate Resubmissions:

Once a request is denied, processing continues despite the delay in
the resubmission of the request.
2. Requests Are Immediately Resubmitted:

A program cannot proceed after a denied request. It is immediately
resubmitted.
The processor is idle until the request is honored and serviced.
49
NOC: Networks on Chip
SoC Interconnection Structures
Overview
• Introduction to Networks on a Chip
• Bus and Point-to-point NoC Systems
• Routing Algorithms and Switching Techniques
• Flow Control
• NOC Topology Generation and Analysis
Chapter 5: Computer System Design – System on Chip by M.J. Flynn and W. Luk
Chapter 12: On-Chip Communication Architectures – SoC Interconnect by S. Pasricha & N. Dutt
System-on-Chip and NoC
System-on-Chip --to-- Network-on-Chip
CPU
MPEG CORE
DSP
VGA CORE
Analog Component
ADC/DAC
NOC and SOC Design 2

SoC Structure
p1 bus
p3
A communication link

Multiple Processor/Core SoC
Inter-node communication between CPU/cores can be performed

by message passing or shared memory. Number of processors in
the same chip-die increases at each node (CMP and MPSoC).
• Memory sharing will require: SHARED BUS
* Large Multiplexers
* Cache coherence techniques
* Not Scalable
• Message Passing: NOC
* Scalable
* Require data transfer transactions
* Has overhead of extra communication

NOC: Network-on-Chip
System Bus
Shared bus is not a

long-term solution
• It has poor scalability
On-Chip micro-networks suit the
demand of scalability and performance
NOC and Off-Chip Networks
NOC Off-Chip Networks

Sensitive to cost: Cost is in the links
area and power Latency is tolerable
Wires are relatively cheap
Traffic/applications
Latency is critical unknown
Traffic is known a-priori Changes at runtime
Design time specialization
Adherence to networking
Custom NoCs are possible standards

On-Chip Communication Structures

On-Chip Bus Interconnection
For highly connected multi-core system
Communication bottleneck
For multi-master buses
Arbitration will become a complex problem
Power grows for each communication event as more
units attached will increase the capacitive load.
A crossbar switch can overcome some of these
problems and limitations of the buses
Crossbar is not scalable

SOC Communication Structures
Dedicated Point-to-Point
• Advantages
Optimal in terms of bandwidth, availability,
latency and power usage
Simple to design and verify as well as easier to
model
• Disadvantages
Number of links may increase exponentially
with the increase in number of cores
Hardware Area
Routing Problems
SOC Communication Structures
Network on Chip
Advantages
Structured architecture – Lower complexity and cost
of SOC design
Reuse of components, architectures, design methods
and tools
Efficient and high performance interconnect.
Scalability of communication architecture
Disadvantages
Internal network contention can cause a latency
Bus oriented IPs need smart wrapping hardware
Software needs clear synchronization in
multiprocessor systems
Networks-on-Chip
• Interconnect for SoCs, CMPs, MPSoC and
FPGAs
Multi-hop, packet-based communication
Efficient resource sharing
• Scalable communication infrastructure
provides scalable performance/efficiency in
Power
Hardware Area
Design productivity

NoC ?
A chip-wide network: Processing Elements (PEs) are inter-
connected via a packet-based network in NoC Architecture
Packetized Message
MSG
MSG
Decoded Message
What is an NoC?
• Network-on-chip (NoC) is a packet switched on-chip
communication network designed using a layered methodology
―routes packets, not wires‖
• NoCs use packets to route data from the source to the destination
PE via a network fabric that consists of
switches (routers)
interconnection links (wires)
15
Network-on-Chip vs. Bus Interconnection
• Total bandwidth grows BUS inter-connection is fairly
• Link speed unaffected simple and familiar
• Concurrent spatial reuse
However
• Pipelining is built-in
• Bandwidth is limited, shared
• Distributed arbitration • Speed goes down as N grows
• Separate abstraction layers • No concurrency
However
• No performance guarantee • Pipelining is tough
• Extra delay in routers • Central arbitration
• Area and power overhead? • No layers of abstraction
(communication and
• Modules need NI
computation are coupled)
• Unfamiliar methodology

NoC Evolution
• Progress of on-chip communication architectures

NoC
NoCs are an attempt to scale down the concepts of large-
scale networks, and apply them to the embedded system-
on-chip (SoC) domain
NoC Properties
Regular geometry that is scalable
Flexible QoS guarantees
Higher bandwidth
Reusable components
• Buffers, arbiters, routers, protocol stack
No long global wires (or global clock tree)
• No problematic global synchronization
• GALS: Globally asynchronous, locally synchronous design
Reliable and predictable electrical and physical properties

NoC: Buses to Networks
Original Bus Features
• One transaction at a time
• Central Arbiter
• Limited bandwidth
• Synchronous
• Low cost
Shared Bus to Segmented Bus
S
Advanced Bus
Segmented Bus
• More General/Versatile
bus architecture Shared Bus to Segmented Bus
• Pipelining capability
• Burst transfer
• Split transactions
• Overlapped arbitration
• Transaction preemption, S
resumption & reordering
S

Buses to Networks
• Architectural paradigm shift: Replace wire spaghetti by network

• Usage paradigm shift: Pack everything in packets
• Organizational paradigm shift
Confiscate communications from logic designers
Create a new discipline, a new infrastructure responsibility
NoC Related Main Problems
Global interconnect design problems:
• Delay
• Power
• Noise
• Scalability
• Reliability
System integration
Productivity problem
Chip Multi Processors
For power-efficient computing

NoC Wiring Design
• NoC links:
Regular
Point-to-point -- no fan-out tree (problem)
Can use transmission-line layout
Well-defined current return path
• Can be optimized for noise / speed / power
Low swing, current mode, ….

NoC Topology
Direct Topologies
each node has direct point-to-point link to a subset of
other nodes in the system called neighboring nodes
nodes consist of computational blocks and/or
memories, as well as a NI block that acts as a router e.g.
Nostrum, SOCBUS, Proteo, Octagon
as the number of nodes in the system increases, the total
available communication bandwidth also increases
fundamental trade-off is between connectivity and cost
NoC Topology
• Most direct network topologies have an orthogonal
implementation, where nodes can be arranged in an
n-dimensional orthogonal space
Routing for such networks is fairly simple
e.g. n-dimensional mesh, torus, folded torus, hypercube, and octagon
• 2D mesh is most popular topology
All links have the same length
• eases physical design
Chip area grows linearly with the number
of nodes
Must be designed in such a way as to
avoid traffic accumulating in the
center of the mesh

NoC Topology
Torus topology, also called a k-ary n-cube, is an n-dimensional
grid with k nodes in each dimension
k-ary 1-cube (1-D torus) is essentially a ring network with k nodes
• Limited scalability as performance decreases when more nodes
k-ary 2-cube (i.e., 2-D torus) topology is

similar to a regular mesh
• Except that nodes at the edges are connected
to switches at the opposite edge via wrap-
around channels
• Long end-around connections can, however,
lead to excessive delays

NoC Topology
• Folding torus topology overcomes the long link limitation
of a 2-D torus
links have the same size
• Meshes and tori can be extended by adding bypass links to

increase performance at the cost of higher area

NoC Topology
Octagon topology is another example of a direct network
messages being sent between any 2 nodes require at most two hops
more octagons can be tiled together to accommodate larger designs
• by using one of the nodes is used as a bridge node

NoC Topology
• Indirect Topologies
each node is connected to an external switch, and switches have
point-to-point links to other switches
switches do not perform any information processing, and
correspondingly nodes do not perform any packet switching
e.g. SPIN, crossbar topologies
• Fat tree topology
nodes are connected only to the leaves of the tree
more links near root, where bandwidth
requirements are higher

Irregular NoC Topologies
• Based on the concept of using only what is

necessary.
• Application-specific topologies.
• Eliminate unneeded resources and bandwidth
from the system.
• Leads to reduced power and area use.
• Requires additional design work.

NOC Topology
1 2 3 4 1 2 3 4
5 6 7 8 5 6 7 8
9 10 11 12 9 10 11 12
13 14 15 16 13 14 15 16
Mesh Physical implementation

NOC Torus Topology
1 2 4 3
1 2 3 4
13 14 16 15
5 6 7 8
9 10 11 12 5 6 8 7
13 14 15 16 9 10 12 11
Torus Physical implementation

Deadlock, Livelock, and Starvation
Deadlock: A packet does not reach its destination,

because it is blocked at some intermediate resource.
Livelock: A packet does not reach its destination,

because it enters a cyclic path.
Starvation: A packet does not reach its destination,

because some resource does not grant access (while
it grants access to other packets).

Definitions and Terminology
Switch: The component of the network that is in charge
of flit routing.
Flit Latency: The time needed for a FLIT to reach its

target PE from its source PE.
Packet Latency: The time needed for a PACKET to

reach its target PE from its source PE.
Packet Spread: The time from the reception of the first

flit of a packet to the reception of the last one.
In computer networking, a flit (flow control unit or flow control digit) is
NOC and SOC Design a link-level atomic piece that forms a network packet or stream 33
Message Abstraction
Packet: An element of information that a processing element
(PE) sends to another PE. A packet may consist of a variable
number of flits.‖
Flit: The elementary unit of information exchanged in the
communication network in a clock cycle.
Message
Header Payload
Packet
Dest. Body Tail
Flit
Type
Type
Type
VC
VC
NOC and SOC Design VC 34

Switching Techniques
Two main modes of transporting flits in an NoC are Circuit
Switching and Packet Switching
• Circuit switching
physical path between the source and the destination is reserved
prior to the transmission of data
message header flit traverses the network from the source to
the
destination, reserving links along the way
Advantage: low latency transfers, once path is reserved
Disadvantage: pure circuit switching does not scale well with
NoC size
• Several links are occupied for the duration of the transmitted data,
even when no data is being transmitted
– for instance in the setup and tear down phases
Switching Strategies
Virtual Circuit Switching
creates virtual circuits that are multiplexed on links
number of virtual links (or virtual channels (VCs)) that can be
supported by a physical link depends on buffers allocated to
link
Possible to allocate either one buffer per virtual link or one
buffer
Allocating one buffer per virtual link
per
• physical
dependslink
on how virtual circuits are spatially distributed in the
NoC, routers can have a different number of buffers
• can be expensive due to the large number of shared buffers
• multiplexing virtual circuits on a single link also requires
scheduling at each router and link (end-to-end schedule)
• conflicts between different schedules can make it difficult to
achieve bandwidth and latency guarantees
Switching Strategies, cont.
Virtual Circuit Switching
Allocating one buffer per physical link
o virtual circuits are time multiplexed with a single buffer
per link
o uses time division multiplexing (TDM) to statically
schedule the usage of links among virtual circuits
o flits are typically buffered at the NIs and sent into the
NoC according to the TDM schedule
o global scheduling with TDM makes it easier to achieve
end-to-end bandwidth and latency guarantees
o less expensive router implementation, with fewer buffers

Packet Switching
packets are transmitted from source and make their way
independently to receiver
• Possibly along different routes and with different delays
zero start up time, followed by a variable delay due to
contention in routers along packet path
QoS guarantees are harder to make in packet switching
than in circuit switching
three main packet switching scheme variants
SAF (Store and Forward) Switching
packet is sent from one router to the next only if the receiving
router has buffer space for entire packet
buffer size in the router is at least equal to the size of a packet
Disadvantage: excessive buffer requirements
Packet Switching
VCT (Virtual Cut Through) Switching
Reduces router latency over SAF switching by forwarding first flit
of
a packet as soon as space for the entire packet is available in the
next
router
If no space is available in the receiving buffer, no flits are sent, and
the entire packet is buffered
WH (Wormhole) Switching
Same buffering requirements as SAF switching
Flit from a packet is forwarded to the receiving router if space
exists
for that flit
Parts of the packet can be distributed among two or more routers
Buffer requirements are reduced to one flit, instead of an entire
packet
Susceptible to deadlocks due to usage dependencies among links
Routing Algorithms
• Responsible for correctly and efficiently routing packets or
circuits from the source to the destination
• Choice of a routing algorithm depends on trade-offs between
several potentially conflicting metrics
Minimizing power required for routing
Minimizing logic & routing tables to achieve lower area
footprint
increasing performance by reducing delay and maximizing
traffic
utilization of the network
• Routing schemes can be classified into several categories
improving robustness to better adapt to changing traffic needs
Static or dynamic routing
Distributed or source routing
Minimal or non-minimal routing
Routing Algorithms
Static and Dynamic routing
Static Routing: fixed paths are used to transfer data between a
particular source and destination
• does not take into account current state of the network
Advantages of static routing:
• easy to implement, since very little additional router logic is required
• in-order packet delivery if single path is used
Dynamic Routing: routing decisions are made according to the
current state of the network
• considering factors such as availability and load on links
Path between source and destination may change over time
• as traffic conditions and requirements of the application change
More resources needed to monitor state of the network and
dynamically change routing paths
Able to better distribute traffic in a network
Routing Algorithms
Distributed and Source Routing
Static and dynamic routing schemes can be further classified
depending on where the routing information is stored, and
where
routing decisions are made
• e.g., XY co-ordinates
Distributed routing: each packetidentifying
or number carries thedestination
destination address
node/router
• Routing decisions are made in each router by looking up the destination
addresses in a routing table or by executing a hardware function
Source routing: packet carries routing information
• Pre-computed routing tables are stored at a nodes‘ NI
• Routing information is looked up at the source NI and routing information
is added to the header of the packet (increasing packet size)
• When a packet arrives at a router, the routing information is extracted
from the routing field in the packet header
• Does not require a destination address in a packet, any intermediate
routing tables, or functions needed to calculate the route
Routing algorithms
Minimal and Non-minimal routing
minimal routing: length of the routing path from the source to
the
destination is theNoC
• e.g. in a mesh shortest possible
topology (wherelength between
each node can bethe two nodes
identified by its
XY co-ordinates in the grid) if source node is at (0, 0) and destination
node is at (i, j), then the minimal path length is |i| + |j|
• source does not start sending a packet if minimal path is not available
Non-minimal routing: can use longer paths if a minimal path is
not
available.
• by allowing non-minimal paths, the number of alternative paths
is increased, which can be useful for avoiding congestion
• disadvantage: overhead of additional power consumption

Routing Algorithms
Routing algorithm must ensure freedom from deadlocks
common in WH switching
e.g. cyclic dependency shown below
freedom from deadlocks can be ensured by allocating additional

hardware resources or imposing restrictions on the routing
usually dependency graph of the shared network resources is
built
and analyzed either statically or dynamically
Routing Algorithms
Routing Algorithm must ensure freedom from Livelocks
Livelocks are similar to deadlocks, except that states of the
resources
involved constantly change with regard to one another, without
•making
occursany progress
especially when dynamic (adaptive) routing is used
• e.g. can occur in a deflective ―hot potato‖ routing if a packet is bounced
around over and over again between routers and never reaches its
destination
Livelocks can be avoided with simple priority rules
Routing Algorithm must ensure freedom from starvation
under scenarios where certain packets are prioritized during
routing,
some of the low priority packets never reach their intended
destination
can be avoided by using a fair routing algorithm, or reserving
some
Flow Control Schemes
• Goal of flow control is to allocate network resources for packets
traversing a NoC
can also be viewed as a problem of resolving contention during packet
traversal
• At the data link-layer level, when transmission errors occur,
recovery from the error depends on the support provided by the
flow control mechanism
e.g. if a corrupted packet needs to be retransmitted, flow of packets from
the sender must be stopped, and request signaling must be performed to
reallocate buffer and bandwidth resources
• Most flow control techniques can manage link congestion
• But not all schemes can (by themselves) reallocate all the
resources required for retransmission when errors occur
either error correction or a scheme to handle reliable transfers must be
implemented at a higher layer
STALL/GO
Low overhead scheme
Requires only two control wires
• one going forward and signaling data availability
• the other going backward and signaling either a condition of buffers
filled (STALL) or of buffers free (GO)
Implement with distributed buffering (pipelining) along link
good performance – fast recovery from congestion
does not have any provision for fault handling
• higher level protocols responsible for handling flit interruption
T-Error
More aggressive scheme that can detect faults
• by making use of a second delayed clock at every buffer stage
Delayed clock re-samples input data to detect any inconsistencies
• then emits a VALID control signal
Re-synchronization stage added between end of link and
receiving
switch
• to handle offset between original and delayed clocks
Timing budget can be used to provide greater reliability by
configuring links with appropriate spacing and frequency
Does not provide a thorough fault handling mechanism
ACK/NACK
When flits are sent on a link, a local copy is kept in a buffer by sender
When ACK received by sender, it deletes copy of flit from its buffer
When NACK is received, sender rewinds its output queue and
starts
resending flits, starting from the corrupted one
Implemented either end-to-end or switch-to-switch
Sender needs to have a buffer of size 2N + k
• N is number of buffers encountered between source and destination
• k depends on latency of logic at the sender and receiver
Overall a minimum of 3N + k buffers are required
Fault handling support comes at cost of greater power, area overhead
ACK/NACK

Network and Transport-Layer Flow Control
Flow Control without Resource Reservation
• Technique #1: drop packets when receiver NI full
– improves congestion in short term but increases it in long term
• Technique #2: return packets that do not fit into receiver buffers to sender
– to avoid deadlock, rejected packets must be accepted by sender
• Technique #3: deflection routing
– when packet cannot be accepted at receiver, it is sent back into network
– packet does not go back to sender, but keeps hopping from router to router till
it is accepted at receiver
Flow Control with Resource Reservation
• credit-based flow control with resource reservation
• credit counter at sender NI tracks free space available in receiver NI
buffers
• credit packets can piggyback on response packets
• end-to-end or link-to-link
Switching Techniques
Packet Switching – Routing Protocols
Store and Forward: Router cost is packet based. Packet size also
affects latency and buffering requirements. Stalling happens at two
nodes and the link between them.
Wormhole: Router cost is based on header. Header can effect
latency and buffering at the router is based on the header size.
Stalling can happen at all the nodes and links spanned by the
packet..
Virtual Cut-through: Router cost depends on header and packet
size. Stalling at local nodes level.

VCT and Wormhole Routing

Relevant Parameters: Routing
Minimum latency is of paramount importance in
NOCs (inter-process communication).
Ideally: One clock latency per switch/router (flit
enters at time t and exits at t+1)
Maximum switch clock frequency
(technology + routing logic limits)
Deadlock free
No flits are ever lost; once a flit is injected in the
NOC, it must reach to its destination - may be after
a long time.

Fixed Shortest Path Routing
Suitable for Regular Topologies
e.g. Mesh, Torus, Tree, etc.
X-Y routing (fist x then y

direction.
Simple Router
No deadlock scenario
No retransmission
No reordering of messages
Power-efficient

Wormhole Routing
In wormhole routing a header flit ―digs‖ the
path and hold.
Successive flits are routed to the same path or
direction
In case of blocks and loss-less NoC we need:
Buffers
A back-pressure mechanism if we don‘t have
infinite size FIFOs…

Wormhole
Src
Dest

Wormhole
T
F4SF3rFc2
H
F F
Dest

Wormhole
T
S4rFc3F2
F H
F F
Dest

Wormhole
T
SFrFc4F3 F2
HF
Dest

Wormhole
T
SrcF F4 F3
F2
HF
Dest

Wormhole
T
Src F
F4 F3
F2
HF
Dest

Wormhole
T
Src F
F4 F3
F2
HF
Dest

Wormhole
T
Src F
F4
F3
F2
Dest
HF

Wormhole
T
Src F
F4
F3
Dest
F2
HF

Wormhole
Src
TF
F4
DeF3st
F2
HF

Wormhole
Src
TF
F4
DeF3st
F2
HF

Wormhole
Src
TF
F4
DeF3st
F2
HF

Deflection Routing
Hot Potato – Deadlock Free Routing
Every flit can be routed to different directions
(no packet notion at the switch level)
If the optimal direction is blocked, the flit is ―deflected‖ to
another direction
Switch latency of 1 clock cycle whatever the level of congestion
Minimum buffer requirements
Wormhole Routing
Packets reordering No packets reordering
Adaptive routing Static routing
No buffering Buffering ( 2 flits/port)
No back pressure Back pressure
Works with Torus/Mesh XY routing needs mesh

Hot-Potato
Src
Dest

Hot-Potato
T H
FS3rFc2
F F
Dest
NOC and SOC Design

Hot-Potato
T H
SrFc3 F2
F F
Dest
NOC and SOC Design

Hot-Potato
T H
Src F3 F2
F F
Dest
NOC and SOC Design

Hot-Potato
T
Src
F
F3 F2 HF
Dest
NOC and SOC Design

Hot-Potato
T
Src
F
F3 F2 HF
Dest
NOC and SOC Design

Hot-Potato
Src
TF
H
Dest
F
F3 F2
NOC and SOC Design

Hot-Potato
Src
TF
Dest
HF
F2
F3
NOC and SOC Design

Hot-Potato
Src
DTeFs
t
HF
F2
F3

Hot-Potato
Src
F3
DTeFs
t
HF
F2

Network-on-Chip

Core to Network Connection

NOC Switch/Router
Generic
Router/Switch

VC: Virtual-Channels

A Router Structure
Module
Module
or Router
another router
• Flits stored in input ports

• Output port schedules
transmission of pending
flits according to:
Priority (Service Level)
Buffer space in next router
Round-Robin on input
ports of same SL
Preempt lower priority
packets

Virtual Channel 2D Router
For NE
For SE
For SW
For NW
N E S W

A Typical Router Pipeline
FLIT
FLIT OUT
IN
ROUTING VC
& BUFFERS ALLOCATION ARBITRATION SWITCH
TRAVERSAL

CAD Problems for NOC
Application Mapping (map tasks to cores)
Floorplanning/Placement (within the network)
Routing (of messages)
Buffer Sizing (size of FIFO queues in the routers)
Timing Closure (Link bandwidth capacity allocation)
Simulation (Network simulation for traffic, delay, power
modeling)
Testing … Combined with problems of designing NOC itself
(topology synthesis, switching, virtual channels, arbitration,
flow control,……)

Topology Generation and Analysis
• Aim:
Generate a viable network topology.
Analyze the generated topology.
• Targeted Network:
Best-effort, wormhole switched.
Lookup table based source routing.
No virtual channel support.
Round Robin switch output arbitration.
One NI per component master or slave interface.
All transactions converted to packets of the same length (flit
count).
Burst beats converted to separate packets.
System Input and Output
• Input:
Core Graph
Network Parameters
• Output:
Topology Graph
Route tables
Recommended
Operating Clock
Frequency

Partitioned Crossbar Topologies
• Initial topology: Fully-
Connected Crossbar
(single switch).
• Ideal latency situation.
• May violate maximum

port requirement.
• Partitioning process.

MPEG4 - Decoder
A)
B)
Clock Frequency:
3.43 GHz
NOC and SOC Design

Simple Bus Model
Consider a computer system where each and
every Bus_request occupies the bus for the same
time.
Bus occupancy can be approximated by a simple

computation of the per-processor average.
(Offered) Bus occupancy, ρ =

Bus_transaction_time/(Processor_time+Bus_transaction_time)
G. Khan Bus and NoC Models Page:

Multiprocessor Bus Model
without Re-submission
Processor time is the mean time a processor needs
to compute before making a bus request.
Processor may overlap a fraction of its compute
time with the bus time. Then the Processor time is
the net non-overlapped time between bus requests.
In the case of any event, ρ ≤ 1.
For n processors using a single bus
Probability (processor doesn't access bus) = (1 – ρ)
Probability (bus is busy) = (1 − ρ)n
= Fraction of bus bandwidth realize = B(ρ, n)
G. Khan Bus and NoC Models Page: 3
Multiprocessor Bus Model without
Resubmission
The fraction of bandwidth realized (times) the
maximum bus bandwidth gives the achieved bus
bandwidth, B.
The achieved bandwidth fraction (Bus occupancy)
per CPU (ρa) is given by: nρa = B(ρ, n)
Achieved bus occupancy ρa = B(ρ, n)/n
= (1 − ρ)n/n
A processor slows down by ρa/ρ
due to bus congestion.
4
Multiprocessor Bus Model with
Request Resubmission
Given by an iterative pair of equations.
The achieved bandwidth fraction (Bus occupancy)
per CPU (ρa) i.e.
nρa = 1 − (1 − a)n
and
a = ρ/ [ρ + (ρa/ρ)(1 − ρ)]
Where a is the acatual offered request rate.
To find a final ρa initially set a = ρ to begin the iteration,
convergance occurs in few iterations.
Processors with Blocking Transactions
Blocking Transactions: The processor is idle after the bus
request is made and resumes computation only after the
bus transaction is complete.
A Single Bus Master with Blocking Transactions.
There is no bus contention as the processor waits for the
transaction to complete.
The achieved occupancy, ρa is the same as the
offered occupancy.
Therefore ρ = ρa =
bus_trans_time/(compute time+bus_trans_time)
G. Khan Bus and NoC Models

Processors with non-Blocking Trans.
n Bus Masters with Blocking Transactions
The offered occupancy is simply the nρ,
where ρ is as given earlier.
Now contention can develop so we use our bus model
to determine the achieved occupancy, ρa
ρa = (Bus_trans_time)/
(compute_time+Bus_trans_time+Contention_time)
Alternative case buffered (or non-blocking) transactions

This is a complex case where a processor continues
processing after making a request, and may make
several requests before completion of initial request.
7
Bus Model Example
Suppose a processor has bus transactions that consist of cache line
transfers. Assume that 80% of the transactions move a single line
and occupy the bus for 20 cycles and 20% of the transactions move a
double line (as in dirty line replacement), which takes 36 cycles. The
mean bus transaction time is 23.2 cycles. Assume that a cache miss
(transaction) occurs every 200 cycles.
i). Find the bus occupancy for a single processor system.

ii). Find the bus occupancy when there are four processors connected to
the
bus.
Bus Model Example contd.
Suppose a processor has bus transactions that consist of cache line transfers. Assume that
80% of the transactions move a single line and occupy the bus for 20 cycles and 20% of the
transactions move a double line (as in dirty line replacement), which takes 36 cycles. The
mean bus transaction time is 23.2 cycles. Assume that a cache miss (transaction) occurs
every 200 cycles.
iii) Determine the contention time in the case of a 4-processor system.
G. Khan Bus and NoC Models

NoC Communication
A 4×4 2D-torus: N =16 (Number of nodes or Routers)
n =2 (Maximum travel distance) and its diameter is 4.
Dimension of the network and its maximum distance are
important to NoC cost and performance.
NoC Links are characterized in three ways:
Cycle Time of the Link, Tch corresponds to time
required to transmit between neighboring nodes.
1/Tch is the bandwidth of a wire in the NoC link.
Width of the Link, w is the number of bits that can be
concurrently transmitted between two nodes.
Message data L plus H header bits, which is the address destination.
Tch ×(L + H)/w will be the time required to transmit a message
between two adjacent nodes.
NoC Communication
Wormhole Routing based Communication:
When a message packet is received at a node (or router), it is
buffered only long enough to decode its header to determine
its destination.
As soon as the basic routing information is determined,
the message is retransmitted to destination node.
The amount of buffering required at intermediate
nodes is significantly reduced.
The overall time of transmission between two given nodes:
Twormhole = Tch (d h + L/w) where h = [H/w]
and d is the # of hops between source and destination nodes

NoC Communication Example-1
Consider a wormhole routing based communication in a 2D 4×4
NoC. Determine the average number of clock cycles needed to
transmit 256-bits of data among various nodes of the NoC where the
width of NoC link is 16, and the header size is 64-bits.

NoC Communication Example-2
Consider a wormhole routing based communication in a 2D 4×4
Torus NoC. Determine the minimum number of clock cycles
required to transmit 1K-bits of data from source node (1,1) to
destination node (3,4) where the width of NoC link is 32, and the
header size is 32 bits.

Communication-centric Design
Communication is the most critical aspect
affecting system performance
Communication architecture consumes upto 50% of
total on-chip power
Ever increasing number of wires, repeaters, bus
components (arbiters, bridges, decoders etc.) increases
system cost
Communication architecture design, customization,
exploration, verification and implementation takes up
the largest chunk of a design cycle
Communication Architectures in complex systems and SoCs
significantly affects performance, power, cost & time-to-market!
COE838: SoC Design ©G. Khan 3
Interconnection Strategies
Tradeoffs between interconnection complexity/parallelism.
How to interconnect hardware modules?
Consider 4 modules (registers) capable of exchanging
their contents. Methods of interconnection.
Notation for data swap: SWAP(Ri, Rj)

Ri <= Rj; temp <= Ri;
Rj <= Ri; Rj <= temp;
Module-to-Module Communication
Point-to-point
Single bus
Multiple buses

Point-to-Point Connection
Four Modules interconnected via 4:1 MUXes and point-to-point
connections.
Modules have edge-triggered N-bit registers to transfer, controlled by LDi signals.

Nx4:1 Multiplexers per module (register) controlled by Si<1:0> control signals.
Control of SWAP Operation, SWAP(R1, R2) Control Signals
01 S2<1:0>; Establish
10 S1<1:0>; connection paths
1 LD2; 1 LD1; Swap takes place at next active clock

MUX and Bus Transfers
Transfer S1 S0 L2 L1 L0
R0 <= R2 1 0 0 0 1
R0 <= R1, R2 <= R1 0 1 1 0 1
R0 <= R1, R1 <= R0 Impossible
Single Bus Interconnection
Module MUX are replaced by a single MUX block.

25% hardware cost of the previous alternative.
Shared set of inter-connection is called a BUS.
Multiple Transfers R0 <= R1 and R3 <= R2
State X: (R0 <= R1) State Y: (R3 <= R2)
01 S<1:0>; 10 S<1:0>;
1 LD0; 1 LD3;
Two control states are required for transfers.

Alternatives to Multiplexers
Tri-state buffers as an interconnection scheme

Reduces Interconnection Physical Wires
Only one contents gated to shared bus at a time.
Decoder decodes the input control lines S<1:0> and generates one select
signal to enable only one tri-state buffer.

Tri-State Bus
Tri-state bus using bi-

directional lines.
Instead of separate Input
and Output Lines
Register with bi-directional

input-output lines

PCB Busses – VME, Multibus-II, ISA, EISA, PCI and PCI
Express
Bus is made of wires shared by multiple units with
logic to provide an orderly use of the bus.
Devices can be Masters or Slaves.
Arbiter determines - which device will control the
bus.
Bus protocol is a set of rules for transmitting
information between two or more devices over a bus.
Bus bridge connects two buses, which are not of the
same type having different protocols.
Buses may be unified or split type (address and data).

Decoder determines the target for any transfer initiated by a
master

Bus Signals
address lines
data lines
control lines
Typically a bus has three types of signal lines

Address
Carry address of destination for which transfer is initiated
Can be shared or separate for read, write data
Data
Transfer information between source and destination devices
Can be shared or separate for read, write data
Control
Requests and acknowledgements
Specifyburst
enable, moresize,
information about type of data transfer e.g. Byte
cacheable/bufferable,
…
Interconnect Architectures
IP blocks need to communicate among each other
System level issues and specifications of an SoC
Interconnect:
Communication Bandwidth – Rate of Information Transfer
Communication Latency – Delay between a module
requesting the data and receiving a response to its request.
Master and Slave – Initiate (Master) or response (Slave)
to communication requests
Concurrency Requirements – Simultaneous Comm. Channels
Packet or Bus Transaction – Information size per transaction
Multiple Clock Domains – IP module operate at different clocks

Bus Basics
Master Slave °°°
Control Lines
Address Lines
Data Lines
Bus Master: has ability to control the bus, initiates transaction

Bus Slave: module activated by the transaction
Bus Communication Protocol: specification of sequence of events
and timing requirements in transferring information.
Asynchronous Bus Transfers: control lines (req, ack) serve to
orchestrate sequencing.
Synchronous Bus Transfers: sequence relative to common
clock.

Embedded Systems busses
Atmel SAM3U

Types of Bus Topologies
Shared bus

Hierarchical shared bus
Improves system throughput

Multiple ongoing transfers on
different buses

Split bus
Reduces impact of capacitance across two segments

Reduces contention and energy

Full crossbar/matrix bus (point to point)

Ring bus

Bus Physical Structure
Tri-state buffer based bidirectional signals
Commonly used in off-chip/backplane buses

+ take up fewer wires, smaller area footprint
- higher power consumption, higher delay, hard to debug

Bus Physical Structure
MUX based signals
Separate read, write channels

Bus Clocking
Synchronous Bus
◦ Includes a clock in control lines
◦ Fixed protocol for communication that is relative to clock
◦ Involves very little logic and can run very fast
◦ Require frequency converters across frequency domains

Bus Clocking
Asynchronous Bus
◦ No clock
◦ Requires a handshaking protocol
performance not as good as that of synchronous bus
No need for frequency converters, but does need extra lines
◦ No clock skew as in the synchronous bus

Decoding and Arbitration
Decoding
◦ determines the target for any transfer initiated by a
master
Arbitration
◦ decides which master can use the shared bus if more
than one master request bus access simultaneously
Decoding and Arbitration can either be
◦ centralized
◦ distributed

Centralized Decoding and Arbitration
Minimal change is required if new

components are added to the system
Distributed Decoding and Arbitration
+ requires fewer signals as compared to centralized method

- more hardware duplication and logic/ chip-area

Arbitration Schemes
Random: Randomly select master to grant the bus access
Static priority
◦ Masters assigned static priorities
◦ Higher priority master request always serviced first
◦ Can be pre-emptive (AMBA-2) or non-preemptive (AMBA-3)
◦ May lead to starvation of low priority masters
Round-Robin (RR)
◦ Masters allowed to access bus in a round-robin manner
◦ No starvation – every master guaranteed bus access
◦ Inefficient if masters have vastly different data injection rates
◦ High latency for critical data streams
Arbitration Schemes
TDMA
◦ Time division multiple access
◦ Assign slots to masters based on BW requirements
◦ If a master does not have anything to read/write during
its time slots, leads to low performance
◦ Choice of time slot length and number critical
TDMA/RR
◦ Two-level scheme
◦ If master does not need to utilize its time slot, second level
RR scheme grants access to another waiting master
◦ Better bus utilization
◦ Higher implementation cost for scheme (more logic, area)

Arbitration Schemes
Dynamic priority
◦ Dynamically vary priority of master during
application execution
◦ Gives masters with higher injection rates a higher priority
◦ Requires additional logic to analyze traffic at runtime
◦ Adapts to changing data traffic profiles
◦ High implementation cost
(several registers to track priorities and traffic profiles)
Programmable priority
◦ Simpler variant of dynamic priority scheme
◦ Programmable register in arbiter allows software to change priority

Bus Data Transfer Modes
Single Non-pipelined Transfer
◦ The Simplest Transfer Mode
first request for access to bus from arbiter
on being granted access, set address and control signals
Send/receive data

Pipelined Transfer - Overlap address and data phases

Non-pipelined Burst Transfer
Send multiple data items, with only a single arbitration for entire
transaction
master must indicate to the arbiter it intends to perform a
burst transfer
Saves time spent requesting for arbitration

Pipelined Burst Transfer
◦ Useful when separate address and data buses available

Split Transfer
◦ If slaves take a long time to read/write data, it can
prevent other masters from using the bus
◦ Split transfers improve performance by ‗splitting‘ a
transaction
Master sends read request to slave
Slave relinquishes control of bus as it prepares data
Arbiter can grant bus access to another waiting master
Allows utilizing otherwise idle cycles on the bus
When slave is ready, it requests bus access from
arbiter
On being granted access, it sends data to master
◦ Explicit support for split transfers required from slaves
and arbiters

Out-of-Order Transfer
◦ Allows multiple transfers from different masters, or same master, to
be SPLIT by a slave and be in progress simultaneously on a
single bus
◦ Masters can initiate data transfers without waiting for earlier data
transfers to complete
◦ Allows better parallelism, performance in buses
◦ Additional signals are needed to transmit IDs for every data transfer
in the system
◦ Master interfaces need to be extended to handle data transfer
IDs and be able to reorder the received data
◦ Slave interfaces have out-of-order buffers for reads, writes, to keep track
of pending transactions, plus logic for processing IDs
Any application typically has a limited buffer size beyond
which performance doesn‘t increase.

Physical implementation - Bus wires
Bus wires are implemented as long metal lines on
silicon transmitting data using electromagnetic waves
As application performance requirements increase, clock

frequencies are also increasing
◦ Greater bus clock frequency = shorter bus clock period
100 MHz = 10 ns ; 500 MHz = 2 ns
Time allowed for a signal on a bus to travel from source-to-
destination in a single bus clock=cycle is decreasing
Can take multiple cycles to send a signal across a chip
6-10 bus clock cycles @ 50 nm
unpredictability in signal propagation time has serious consequences
for performance and correct functioning of synchronous digital
circuits

Physical implementation Issues - Bus wires
Partition long bus wires into shorter ones
◦ Hierarchical or split bus communication architectures
◦ Register slices or buffers to pipeline long bus wires
enable signal to traverse a segment in one clock
cycle
Bus wire 1
Component
Synchronous
Bus wire 2
… Component
Synchronous
Source Destination
Bus wire n
Flip Flops (FF) repeaters
Asynchronous buses:
Low level techniques: add repeaters
Summary
On-chip communication architectures are
critical components in SoC designs
◦ Power, performance, cost, reliability constraints
◦ Rapidly increasing in complexity with the no. of cores
Review of basic concepts of (widely used)
bus- based communication architectures
Open Problems
◦ Designing communication architectures to satisfy
diverse and complex application constraints

SoC Integration and Interconnect
Architectures
• SoC Integration is the most important part of
SoC design.
Integration of IP cores.
The method connect the IP cores.
Maximize the reuse of design to lower cost.
• SoC Interconnect Architectures
Bus-based Interconnection.
NoC: Network on Chip that hides the physical
interconnects from the designer.
G. Khan Page: 2
Actel SmartFusion system/bus
G. Khan SoC Bus Interconnexion Structures Page: 3

SoC Bus Architectures
HW Area for a Slave

• The Advanced Micro controller Bus Architecture (AMBA) bus
protocols is a set of interconnect specifications from ARM that
standardizes on chip communication mechanisms between various
functional blocks (or IP) for building high performance SOC designs.
• These designs typically have one or more micro controllers or

microprocessors along with several other components —
• Internal memory or external memory bridge, DSP, DMA,
accelerators and various other peripherals like USB, UART, PCIE, I2C
etc — all integrated on a single chip.
• The primary motivation of AMBA protocols is to have a standard

and efficient way to interconnecting these blocks with re-use across
multiple designs.
• The first step in learning AMBA protocols is to
understand where exactly these different
protocols are used , how these evolved and how
all of them fit into a SOC design.
• Following diagram (reference from the AMBA 2.0

spec) illustrates a traditional AMBA based SOC
design that uses the AHB (Advanced High
performance) or ASB (Advanced System Bus)
protocols for high bandwidth interconnect and an
APB (Advanced Peripheral Bus) protocol for low
bandwidth peripheral interconnects.
AMBA Bus History
•
• The AMBA was first introduced by a company named ARM in 1996. The first buses
used in AMBs were the Advanced Peripheral Bus or APB and the Advanced System
Bus or ASB. The design was an immediate success, and this was followed in 1999
by the AMBA 2. In this version, the AMBA added a high-performance bus or AHB
that used a singular clock-edge protocol which advanced the design of the
product.
• By 2003, the AMBA 3 was created and it introduced the Advanced Extensible
Interface or AXI which boosted the performance of the interconnect to an even
higher degree. It also brought along the Advanced Trace Bus or ATB which was
used on the CoreSight trace solution and on-chip debug. This design lasted for
several years until it was surpassed in 2010 by the AMBA 4. This version boosted
the AXI to a considerable degree and laid the foundation for newer versions.
•
• By 2013, the AMBA 5 came along and provided the Coherent Hub Interface or CHI
along with a newly designed high-speed transport application that helped reduce
congestions and create a streamlined approached. So potent has the impact of the
AMBA been that today the protocols are considered the industry standard for all
embedded processors.
• With increasing number of functional blocks (IP)
integrating into SOC designs, the shared bus
protocols (AHB/ASB) started hitting limitations ,
• the new revision of AMBA 3 introduced a point to
point connectivity protocol — AXI (Advanced
Extensible Interface). Further in 2010, an
enhanced version was introduced — AXI 4.
• Following diagram illustrates this evolution of
protocols along with the SOC design trends in
industry.
Advanced eXtensible Interface (AXI)
How AMBA Bus Works
• The AMBA bus was designed to address the interconnect for SoC
application and have the peripherals interface with each other
more efficiently. The purpose of the AMBA bus is to do the
following:
• Unify and standardize SoC interconnect IP

• Enable and promote SoC modular design
• Easy reuse of IP cores
• Allow 1st time right development of SoC with one or more
embedded CPUs
• Supports high performance along with low-power communication
• The modular design helps to boost the development of IP cores
which are technology independent and the reuse of IP cores to help
accelerate and reduce cost of future designs.
• APB
• APB is low bandwidth protocol optimized for low power and low
complexity to support peripherals. It is used as low-cost interface to
peripherals, which do not require high-performance of pipelined bus
interface. Any transfer takes at least 2 cycles. A typical APB system has APB
bridge which interfaces with the AHB, AXI or ASB and peripherals which
are the slaves. It can be used to access programmable registers of the
peripheral devices.
• ASB
• ASB supports features for high-performance systems like burst transfers,
pipelined transfer operation and multiple bus masters. It supports
connection of many processors and memories. ASB bus consists of Master,
Slave, Arbiter and Decoder. Only one master can access the bus at any
time with the help of arbiter. Master initiates the read and write
operations and slave responds to the read and write requests. Address
and appropriate slave are selected using decoder.
• AHB
• AHB is the specifically designed for high-performance designs. It supports
multiple bus masters and supports high bandwidth operations. A typical
AMBA system design contains AHB master, AHB slave, AHB arbiter and
AHB decoder. It is used to connect components like DMA, DSP and
Memory that require high bandwidth on a shared bus.
• AMBA AHB supports features required by high
bandwidth and high frequency designs:
• Burst transfers
• Split Transactions
• Wider data bus configurations (64/128 bits)
• Single-clock edge operation
• Single-cycle bus master handover
AXI
Advanced eXtensible Interface (AXI)
•
AXI is a point-to-point interconnect protocol that
overcomes limitations of shared bus protocols. It
targets high-performance and high-frequency systems
with key features as:
• Multiple outstanding transactions
• Out-of-order data completion
• Burst-based transactions with only start address issued
• Support for unaligned data transfers using strobes
• Simultaneous read and write transactions
• Pipelined interconnects for high speed operation
• ACE
ACE protocol extends the AXI4 protocol along with
hardware-coherent caches. ACE coherency protocol
ensures all the masters see correct data for any address
location. This avoids software cache maintenance to main
coherency between caches. ACE also provides barrier
transactions that guarantee ordering of multiple
transactions within a system and Distributed Virtual
Memory (DVM) functionality to manage virtual memory.
• CHI
CHI protocol defines interfaces for connection of fully
coherent processors. It is a packet based layered
communication protocol with Protocol, Link and Network
layer. It is topology independent and provides Quality of
Service (QoS) based mechanism to control resources in the
system. It supports high-frequency and non-blocking
coherent data transfers between processors that provides
performance and scale for applications like data center.
On-Chip Busses
• AMBA 2.0, 3.0 (ARM)
• CoreConnect (IBM)
• STBus (STMicroelectronics)
• Sonics Smart Interconnect (Sonics)
• Wishbone (Opencore)
• Avalon (Altera)
• PI Bus (OMI)
• MARBLE (Univ. of Manchester)
• CoreFrame (PalmChip)

AMBA 2.0

AMBA Busses
Advanced Microcontroller Bus Architecture
Actually three standards: APB, AHB, and AXI
AHB – Advanced High-Performance Bus
Pipelining of Address / Data
Split Transactions
Multiple Masters
APB – Advanced Peripheral Bus
Low Power / Bandwidth Peripheral Bus
Very commonly used for commercial IP cores

APB Bus
• A simple bus that is easy to work with
• Low-cost
• Low-power
• Low-complexity
• Low-bandwidth
• Non-pipelined
• Ideal for peripherals

APB bus state machine
• IDLE
Default APB state
• SETUP
When transfer required
PSELx is asserted
Only one cycle
• ACCESS
PENABLE is asserted
Addr, write, select, and
write data remain stable
Stay if PREADY = L
Goto IDLE if PREADY = H
and no more data
Goto
and SETUP
more is PREADY = H
data pending

Notations

APB bus States
• IDLE
Default APB state
• SETUP
When transfer required
PSELx is asserted
Only one cycle
• ACCESS
PENABLE is asserted
Addr, write, select, and
write data remain stable
Stay if PREADY = L
Goto IDLE if PREADY=H
and no more data
Goto SETUP is PREADY Setup Access
is H and more data Phase Phase
pending

APB Signals
• PCLK: bus clock source (rising-edge triggered)
• PRESETn: bus (and typically system) reset signal (active
low)
• PADDR: APB address bus (can be up to 32-bits wide)
• PSELx: select line for each slave device
• PENABLE: indicates the 2nd and subsequent cycles of an
APB xfer
• PWRITE: indicates transfer direction (Write=H, Read=L)
• PWDATA: write data bus (can be up to 32-bits wide)
• PREADY: used to extend a transfer
• PRDATA: read data bus (can be up to 32-bits wide)
• PSLVERR: indicates a transfer error
(OKAY=L, ERROR=H)
APB bus Signals
• PCLK
Clock
• PADDR
Address on bus
• PWRITE
1=Write,
0=Read
• PWDATA
* Data written to
the I/O device.
* Supplied by
the bus
master/processor.

APB bus signals
• PSEL
Asserted if the current bus
transaction is targeted to
this device
• PENABLE
High during entire
transaction other than the
first cycle.
• PREADY
Driven by target. Similar
to our #ACK. Indicates
if target is ready to do
the
transaction.

A Write Transfer - No Wait States
Setup phase begins
with this rising edge
Setup Access
Phase Phase

A Write Transfer with Wait States
Wait as Peripheral is
not ready
Setup Wait Wait Access

Phase State State Phase

A Read Transfer - No Wait States
Setup phase begins with
this rising edge
Setup Access
Phase Phase

A Read Transfer with Wait States
Wait Access
Setup State
Phase Phase

AHB
Bus
centralized arbitration / decode
• one unidirectional
address bus
(HADDR)
• two unidirectional
data buses
(HWDATA,
HRDATA)
• At any time only one
active data bus
Simple AHB Transfer
no wait state

AHB-Lite Bus Master/Slave Interface
AHB-Lite: Single Master
Global signals
HCLK
HRESETn
Master out/Slave in
HADDR (address)
HWDATA (write
data) Control
• HWRITE
• HSIZE
• HBURST
• HPROT
• HTRANS
• HMASTLOCK
Slave out/Master in
HRDATA (read data)
HREADY
HRESP

AHB-LITE
Signals
Global Signals
• HCLK: the bus clock source (rising-edge triggered)
• HRESETn: the bus (and system) reset signal (active low)
Master out/slave in
• HADDR[31:0]: the 32-bit system address bus
• HWDATA[31:0]: the system write data bus
• Control
HWRITE: indicates transfer direction (Write=1, Read=0)
HSIZE[2:0]: indicates size of transfer (byte, halfword, or word)
handler)
HBURST[2:0]: indicates single or burst transfer (1, 4, 8, 16 beats)
HPROT[3:0]: provides protection information (e.g. I or D; user or
HTRANS: indicates current transfer type (e.g. idle, busy, nonseq, seq)
HMASTLOCK: indicates a locked (atomic) transfer sequence
Slave out/master in
HRDATA[31:0]: the slave read data bus
HREADY: indicates previous transfer is complete
HRESP: the transfer response (OKAY=0, ERROR=1)
Basic Read and Write – No Wait States
Pipelined
Address
& Data
Transfer

Simple AHB Transfer
Data transfer with slave wait states

Read – Two Wait States
Two wait states Valid data

added by slave produced
by asserting
HREADY low

Wait States Extend the Address Phase of
Next Transfer
Address stage of
the next transfer
is also extended
One wait state

added by slave
by asserting
HREADY low
AHB Wait Cycles
Slave may not be ready to service the request
It inserts Wait cycle(s) by using HREADY

AHB Pipelining

AHB Pipelined Transactions
Transaction A Starts Transaction A Completes
Transaction B Starts

AHB Pipelining with Burst
Address and data of consecutive transfers are
transmitted in the same clock cycle

Transfer Types
Four types (HTRANS[1:0])
IDLE (00)
No data transfer is required
Slave must OKAY w/o waiting
Slave must ignore IDLE
BUSY (01)
Insert idle cycles in a burst
Burst will continue afterward
Address/control reflects next transfer in burst
Slave must OKAY w/o waiting
Slave must ignore BUSY
NONSEQ (10)
Indicates single transfer or first transfer of a burst
Address/control unrelated to prior transfers
SEQ (11)
Remaining transfers in a burst
Addr = prior addr + transfer size

AHB Pipelined Burst Transfers
Bursts cut down arbitration, handshaking time, improve performance

4-beat Burst – Master Busy, Slave Wait
One wait state added by slave

by asserting HREADY
low
AHB Burst Types
• Burst of 1, 4, 8, 16 and undef. INCR bursts access sequential locations.

e.g. 0x64, 0x68, 0x6C, 0x70 for INCR4, transferring 4 byte data
Wrapping bursts: ―wrap around‖ address if starting address is not aligned
to total no. of bytes in transfer. e.g. 0x64, 0x68, 0x6C, 0x60 for WRAP4 that
transfer 4 byte data. Another example, 0x34, 0x38, 0x3C, 0x30, …..
• Burst must not cross 1KB address boundaries.
AHB Control
Signals
Transfer Direction: HWRITE – write transfer when high,
read transfer when low
Transfer Size: HSIZE[2:0] indicates the size of the transfer
HSIZE + HBURST determine wrapping boundary for
WRAP burst.
WRAP4: 4 Beat Wrapping Burst

INCR4: 4 Beat Incrementing Burst

WRAP8: 8-Beat Wrapping Burst

INCR8: 8-Beat Incrementing Burst for
Half-word Transfers

INCR: Undefined Incrementing Burst

Multi-master AHB Requires a Multi-layer
Interconnect
Multi-master operation
• Must isolate masters
• Each master assigned
to layer
• Interconnect arbitrates
slave accesses
Full crossbar switch
often not needed
• Slaves 1, 2, 3 are
shared
• Slaves 4, 5 are local
to Master 2

AMBA Bus Arbitration
• Several masters and slaves are connected to AHB.
• An arbiter decides which master will transfer data.
• Data is transferred from a master to a slave in
bursts.
• Any burst involves read/write of a sequence of
addresses.
• The slave to service a burst is chosen depending on
the addresses (decided by a decoder).
• AHB is connected to APB via a bus bridge.

AHB Arbitration
Arbiter
HBREQ_M1
HBREQ_M2
HBREQ_M3

Arbitration Cost
Time for handshaking
Time for arbitration

Request Grant Protocol
Performance Impact
The
transaction
proceeds
Before a transaction a master

makes a request to the central
arbiter

AHB Split Transfers
• Improves bus utilization

AHB Bus Matrix
AHB can be employed and implemented as a
bus matrix.

AHB-APB
Bridge
High performance Low power (& performance)

AMBA 3.0
Introduces AXI
Support for separate read address, write address, read
data, write data, write response channels
Out of order transaction completion
Fixed mode burst support
• Useful for I/O peripherals
Advanced system cache support
• Specify if transaction is cacheable and buffer-able
• Specify attributes such as write-back/write-through
Enhanced protection support
• Secure/non-secure transaction specification
Exclusive access (for semaphore operations)
Register slice support for high frequency operation

AMBA AXI Read Channels
Give me some data
Independent
Here it is

AMBA AXI Write Channels
I‘m sending data. Please store it.
Independent
Here is the data.
Independent
I received that data correctly.

AMBA AXI Write Channels
Sending data, store it.
Independent
The data is here.
Independent
I received
that data correctly. channels synchronized
with ID # or ―tags‖

AMBA AXI Flow-Control
• Information moves
only when:
Source is Valid, and
Destination is Ready
• On each channel the

master or slave can
limit the flow
• Very flexible

AMBA AXI Read
Read Address Channel
Read Data Channel

AMBA AXI Write
Write Address Channel
Write Data
Channel
Write Response Channel

AHB vs. AXI
Burst
AHB Burst
pipeline stage).
Address and Data are locked together (a single
HREADY controls intervals of address and data.
AXI Burst: One Address for entire burst

AHB vs. AXI
Burst
AXI Burst
• Simultaneous read, write transactions
• Better bus utilization
AXI Out of Order Completion
With AHB
If one slave is very slow, all data is held up
SPLIT transactions provide very limited improvement
With AXI Burst

Multiple outstanding addresses, out of order (OO)
completion allowed
Fast slaves may return data ahead of slow slaves

AHB vs. AXI -
Summary
IBM CoreConnect On-Chip Bus
CoreConnect is an SoC Bus proposed by IBM having:
PLB: Processor Local Bus, PLB Arbiter, PLB to OPB Bridge
OPB: On-Chip Peripheral Bus, OPB Arbiter
DCR: Device Control Register Bus and a Bridge

CoreConnect Advance Features
IBM CoreConnect Bus to support a variety of
applications
PLB: Fully synchronous, supports up to 8 masters
Separate read/write data buses
Burst transfers, variable and fixed-length, Pipelining
DMA transfers and No on-chip tri-states required
Overlapped arbitration, programmable priority fairness
OPB: Fully synchronous, 32-bit address and data buses
Support 1-cycle data transfers between master and slaves
Arbitration for up to 4 OPB master peripherals
Bridge function can be master on PLB or OPB
DCR: Provides fully synchronous movement of GPR data
between CPU and slave logic

CoreConnect Bus based SoC

AMBA and CoreConnect SoC Buses

IBM CoreConnect
PLB OPB
• Pipelined DC
• Low bandwidth R • Low throughput
• Burst modes • Burst mode
• Split transactions • 1 r/w = 2 cycles
• Multiple masters • Multiple Masters
• Ring type data bus

Processor Local Bus
(PLB)
High performance synchronous bus
Shared address, separate read and write data buses
Support for 32-bit address, 16, 32, 64, & 128-bit data bus widths
Dynamic bus sizing-byte, half-word, word, double-word transfers
Up to 16 masters and any number of slaves
AND-OR implementation structure
Variable or fixed length (16-64 byte) burst transfers
Pipelined transfers
SPLIT transfer support
Overlapped read and write transfers (up to 2 transfers per cycle)
Centralized arbiter
Locked transfer support for atomic accesses
PLB Transfer Phases
Address and data phases are decoupled

Overlapped PLB Transfers
PLB allows address and data buses to have different masters

at the same time
PLB Arbiter
Bus Control Unit

each master drives a 2-bit signal that encodes 4 priority levels
in case of a tie, arbiter uses static or RR scheme
Timer
pre-empts long burst masters
ensures high priority requests served with low latency

On-chip Peripheral Bus (OPB)
Synchronous bus to connect low performance
peripherals and reduce capacitive loading on PLB.
Shared address bus, multiple data buses.
Up to a 64-bit address bus width.
32- or 64-bit read, write data bus width support.
Support for multiple masters.
Bus parking (or locking) for reduced transfer latency.
Sequential address transfers (burst mode).
Dynamic bus sizing—byte, half-word, word, double-
word transfers.
MUX-based (or AND–OR) structural implementation.
Single cycle data transfer between OPB masters and slaves.
Timeout capability for low-latency for important xfers.
Device Control Register (DCR) Bus
Low speed synchronous bus, used for on-chip device
configuration purposes
meant to off-load the PLB from lower performance
status and control read and write transfers
10-bit, up to 32-bit address bus
32-bit read and write data buses
4-cycle minimum read or write transfers
Slave bus timeout inhibit capability
Multi-master arbitration
Privileged and non-privileged transfers
Daisy-chain (serial) or distributed-OR (parallel)
bus topologies

Nios-II CPU & Avalon Bus based
System
Avalon Bus
• Avalon bus is an active, on-chip bus architecture that
accommodate the SOPC environment.
• The interface to peripherals is synchronous with the Avalon
clock. Therefore, no complex, asynchronous handshaking and
acknowledge schemes are necessary.
• Multiplexers (not tri-state buffers) inside the bus determine
which signals drive which peripheral. Peripherals are never
required to tri-state their outputs.
Even when the peripheral is deselected
• The address, data and control signals use separate, dedicated
ports. It simplifies the design of peripherals as they don‘t need
to decode address and data bus cycles as well as disable its
outputs when it is not selected.
Avalon Bus Module Features
Data-Path Multiplexing - Multiplexers transfer data from the
selected slave peripheral to the appropriate master peripheral.
Address Decoding - Produces chip-select signals for
each peripheral.
Wait-State Generation
Dynamic Bus Sizing
Interrupt-Priority Assignment - When one or more
slave peripherals generate interrupts.
Latent Transfer Capabilities
Streaming Read and Write Capabilities - The logic
required to allow streaming transfers between master-slave
pairs is contained inside the Avalon bus module.
Avalon Bus Module
The Avalon bus module (an Avalon bus) is a unit of active logic that takes the
place of passive, metal bus lines on a physical PCB.

BUS System with Master Modules

Multi-Master: Avalon Bus Arbitration
Slave (data memory) is shared by

two masters (Nios CPU and
DMA)

Slave Arbitrator
Avalon bus module contains one slave arbitrator for
each shared slave port. Slave arbitrator performs the
following.
• Defines control, address, and data paths from multiple master
ports to the slave port and specifies the arbitration mechanism
to use when multiple masters contend for a slave at the same
time.
• At any given time, selects which master port has access to the
slave port and forces all other contending masters (if any) to
wait, based on the arbitration assignments.
• Controls the slave port, based on the address, data, and control
signals presented by the currently selected master port.
Multi-Masters and Slaves
Request and
arbitrator logic
Simultaneous multi-
master system that
permits bus transfers
between two masters
and two slaves.

Standard Bus Architectures
• AMBA 2.0, 3.0 (ARM)
• CoreConnect (IBM)
• Avalon (Altera)
• STBus (STMicroelectronics)
• Sonics Smart Interconnect (Sonics)
• Wishbone (Opencore)
• PI Bus (OMI)
• MARBLE (Univ. of Manchester)
• CoreFrame (PalmChip)
• …

STBus
Consists of 3 synchronous bus-based
interconnect specifications
Type 1
• Simplest protocol meant for peripheral access
Type 2
• More complex protocol
• Pipelined, SPLIT transactions
Type 3
• Most advanced protocol
• OO transactions, transaction labeling/hints
Type 1 and 3
Type 1
• Simple handshake mechanism
• 32-bit address bus
• Data bus sizes of 8, 16, 32, 64 bits
• Similar to IBM CoreConnect DCR bus
Type 3
• transaction completion
• Requires only single response/ACK Supports all
Type 2 functionality
• OO for multiple data transfers (burst mode)

Type 2
• Supports all Type 1 functionality
• Pipelined transfers
• SPLIT transactions
• Data bus sizes up to 256 bits
• Compound operations
READMODWRITE: Returns read data and locks slave till
same master writes to location
SWAP : Exchanges data value between master and slave
FLUSH/PURGE: Ensure coherence between local and
main memory
USER: Reserved for user defined operations

STBus Arbitration
• Static priority
Non-preemptive
• Programmable priority
• Latency based
Each master has register with max. allowed latency (clock cycles)
If value is 0, Each master also has counter loaded with max.
latency value when master makes request
Master counters are decremented at every subsequent cycle
Arbiter grants access to master with lowest counter value
In case of a tie, static priority is used
• Higher priority master must be granted bus access as soon as it
requests it.

STBus Arbitration
• Bandwidth based
Similar to TDMA/RR (Round Robin) scheme
• STB
Hybrid of latency based and programmable priority schemes
In normal mode, programmable priority scheme is used
Masters have max. latency registers, counters (latency based)
Each master also has an additional latency-counter-enable bit
If this bit is set, and counter value is 0, master is in ―panic state‖
If one or more masters in panic state, programmable priority
scheme is overridden, and panic state masters granted access
• Message based
Pre-emptive static priority scheme

Socket-based Interface Standards
Defines the interface of components
Does not define bus architecture implementation
Shield IP designer from knowledge of interconnection system,
and enable same IP to be ported across different systems
Requires Adaptor components to interface with implementation

Socket-based Interface Standards
• Must be generic, comprehensive, and configurable
to capture basic functionality and advanced features of a wide
array of bus architecture implementations
• Adaptor (or translational) logic component
Must be created only once for each implementation (e.g. AMBA)
– adds area, performance penalties, more design time
+ enhances reuse, speeds up design time across many designs
• Commonly used socket-based interface standards
Open Core Protocol (OCP) ver 2.0
• Most popular – used in Sonics Smart Interconnect
VSIA Virtual Component Interface (VCI)
• Subset of OCP

OCP 2.0/3.0
Open Core Protocol
• Point-to-point synchronous interface
• Bus architecture independent
• Configurable data flow (address, data, control)
signals for area-efficient implementation
• Configurable sideband signals to support
additional communication requirements
• Pipelined transfer support
• Burst transfer support
• OO (out-of-order) transaction completion support
• Multiple threads
OCP 3.0 Basic
Signals
Example: SoC with Mixed Profiles

Summary and Conclusions
• Standards important for seamless integration of SoC IPs
avoid costly integration mismatches
• Two categories of standards for SoC communication:
Standard bus architectures
• define interface between IPs and bus architecture
• define (at least some) specifics of bus architecture that
implements data transfer protocol
• e.g. AMBA 2.0/3.0, CoreConnect, Sonics Smart
Interconnect, STBus
Socket based bus interface standards (e.g. OCP 2.0)
• define interface between IPs and bus architecture
• do not define bus architecture implementation
specifics

SoC - Midterm Lecture Materials

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

SoC - Midterm Lecture Materials

Uploaded by

Copyright:

Available Formats

Module-1 3 hours

• Architecture of the present-day SoC –

• A system on a chip consists of both the

• Utility software applications.

Primarily VIRTUAL PROTOTYPE Primarily

• Explicit attempts have been made to

• EDA tool technology has been transferred to SW CAD

– Central database for design information

– Tools to check design behavior early in process

• Software technology has been transferred to

HW/SW Unified representation

System Instruction set level

Hardware Descript. Software Descript.

HW Synth. and Interface Synthesis Software Gen.

Configuration Hardware HW/SW Software

Basic features of a codesign process

Software Cost Impact of Inadequate Hardware Resources

• Unified HW/SW Representations

• Hardware/Software Codesign Research

• A system can be modeled at system,

• A model can describe a system in the

Level Behavior Structure Physical

PMS (System) Communicating Processors Cabinets, Cables

Instruction Set Input-Output Memory, Ports Board

Register- Register ALUs, Regs, ICs

Circuit Network Equns. Trans., Connections Transistor layout

© IEEE 1990 [McFarland90]

Register ALU Processor

Read Add Mult

– Marking - a particular Transition placement of tokens

• HW/SW Partitioning Techniques

• Hardware/Software Codesign Research

• Partitioning into hardware and software

• Start with all functionality in software and

• Start with all functionality in hardware and

• Deterministic estimation techniques

• Specification abstraction level

High Level Abstraction

Decomposition of functional objects

– Performance: Generally improved by moving

– Hardware size: Hardware size is generally

• Two approaches to computing metrics

• Given a set of functional objects and a set

• Any partitioning technique must define the

• Integrated HW/SW Modeling

• Hardware/Software Codesign Research

• An HDL (VHDL or Verilog) simulation

• A Software environment (C or C++) is used

• SW and HW execute as separate processes

Module: Application Software processes

SW Verilog PLI (programming

© IEEE 1993 [Thomas93]

Hardware Model in VHDL: Software processes

CPU CPU 2 CPU 3 CPU 4

Model Continuity Problem

• Model continuity problems exist in both

• HW and SW Synthesis Methodologies

• Hardware/Software Codesign Research

Behavioral Behavioral Behavioral

Optional RTL Synthesis & RTL

• Definition: the automatic development of

• To significantly improve both the design cycle

• Synthesis supports a correct-by-construction

(Key == On)  Start