Professional Documents
Culture Documents
BITS Pilani: Hardware Software Co-Design ES/SE/SS ZG626, MEL ZG651 Session 2
BITS Pilani: Hardware Software Co-Design ES/SE/SS ZG626, MEL ZG651 Session 2
BITS Pilani: Hardware Software Co-Design ES/SE/SS ZG626, MEL ZG651 Session 2
Course Overview
• Module: Introduction, Specification and
Modelling Concepts [T1 Ch 2, T2 Ch1, R2 Ch2]
– Introduction to Embedded System Design
– Challenges
– Introduction to Hardware/ Software Co-design
• HW or SW?
• Writing SW seems easier
– flexible
– fast compiler
– lots of libraries
– low cost computing platforms
– why design a new hardware when you can just buy
one, eg a desktop computer
• So what are the drivers for custom hardware?
• Therefore, performance may not be a very good metric to compare hardware and
software, especially in resource-constrained environments such as IoT
• A better metric is energy efficiency, i.e., the amount of useful work done per unit
of energy
hardware software
implementation implementation
• There are many examples of systems in which the distinction between hardware
and software is not always crystal clear
• For example:
– An FPGA is a hardware circuit that can be reconfigured
• The program for an FPGA is a bitstream, which is used to configure its logic
function
– VHDL and Verilog are used to generate software models, which in turn are
translated into bitstreams that configure the hardware
• A soft-core is a processor configured into an FPGA’s reconfigurable logic The soft-
core can then be used to execute C programs
• A digital signal processor (DSP) has instructions which are optimized for signal
processing applications
– Efficient programs require detailed knowledge of the DSP HW architecture,
which creates a strong relationship between software and hardware
Design Complexity
• Modern electronic systems are extremely complex, containing multiple
processors, large embedded memories, multiple peripherals and input-output
devices
• It is generally difficult or impossible to design all components in fixed
hardware
• On the other hand, software implementations running on processors can
better handle complexity and additionally allows for updates and bug fixes
Design Cost
• New chips are very expensive to design and fabricate
• Programmable architectures including processors and FPGAs are becoming
more attractive because they can be reused over multiple products or product
generations
• System-on-Chip (SoCs) are good examples of this trend
increasing flexibility
increasing energy efficiency
the table shows that in many aspects, hardware design and software design use dual
concepts. Hence, being able to effectively transition from one world of design to the
other is an important asset to your skills as a computer engineer. You can try to combine
the best of both worlds. Complex control processing? Use a software core. Hard-real-
time processing requirement? Add some hardware.
• Cycle-accurate: Simulators model events only at regularly-spaced intervals, i.e., the rising
edge of the clk. A cycle-accurate model does not model propagation delays and glitches, i.e.,
all signals propagate with zero delay in combinational circuits This level of abstraction is
considered the golden reference in HW/SW codesign
• Instruction-accurate: Expresses activities in steps of one microprocessor instruction (not
cycle count) If you need to determine the real-time performance of a model, you need to
translate instruction count to clk cycle count to obtain execution time Instruction-accurate
simulators are used extensively to verify complex software systems
• Both concurrency and parallelism occur often in HW/SW codesign, and they mean very different things
– Concurrency refers to simultaneous execution where the individual operations are completely independent
– Parallelism, on the other hand, refers to simultaneous execution where the operations are run on different
processors or circuit elements
• Therefore, concurrency refers to an application model while parallelism refers to the implementation of
that model
• Sequential or concurrent software require only a single processor, while parallel software requires multiple
processors
• Software running on your laptop, e.g., WORD, email, etc. is concurrent. Software running on a 65536-
processor IBM Blue Gene/L is parallel
• For example, if your application spends 33% of its time running sequentially,
the maximum speedup is 3
• This means that no matter how fast you make the parallel component run, the
maximum speedup you will ever be able to achieve is 3
• The task of making your application take advantage of parallelism is not
obvious
• C programs are sequential, and so are typical instruction set architectures
Completely connected
machine. Original machine
contained 65556 processors,
each Connection Machine:
with 4Kbits of local memory.
• The authors of the CM, Hellis and Steele, show that it is possible to express algorithms in a
concurrent fashion so that they map neatly onto a CM
• Consider the problem of summing an array of numbers
• The array can be distributed across the CM by assigning one number to each processor
• The sum is computed in log(n) steps, i.e., in only 3 time steps in this example
• The same algorithm running on a sequential processor would take 7 steps.
– the array of numbers would be stored in a single memory and the processor needs to iterate over each element,
requiring a minimum of 8 time steps!
• Even through the parallel sum speeds up the computation significantly, there remains a lot of
wasted compute power
• Compute power of a smaller 8-node CM for 3 times steps is 3*8 = 24 computation time-steps
of which only 7 are being used
• On the other hand, if the application requires all partial sums, i.e., the sum of the first two,
three, four, etc. numbers, then the full power of the parallel machine is used. Here, 17
computation time-steps are used
• Assume for now that you only have an 8085 processor at 1MHz. How would you code it.
• Assume you have a 8bit multiplier hardware block, attached to the 8085 processor. How would you code it.
• What would you do if you do have h/w flexibility, but not as much as a 8bit multiplier block. AND s/w
flexibility that it’s ok if it is not a single cycle. In other words, its ok to have a multiplier throughput of say
500kHz or 256kHz (two flavors).
• Assume external h/w also works at 1MHz max.
• Assume for now that you only have an 8085 processor at 1MHz. How
would you code it.
– Think of left shift multiplicand, conditional add
• Assume you have a 8bit multiplier hardware block, attached to the
8085 processor. How would you code it.
– Think of sending operands to hardware block, read results.
– Cost of Area at the gain of higher throughput.
• What would you do if you do have h/w flexibility, but not as much as a
8-bit multiplier block. AND s/w flexibility that it’s ok if it is not a single
cycle. In other words, its ok to have a multiplier throughput of say
500kHz or 256kHz (two flavors).
– Think of sending operands to hardware block, read results.
– What will reduce? Area at the expense of throughput.
• Assume external h/w also works at 1MHz max..
• Throw power into the equation on a more complex example.