Isa 3

ISA
(Wrap up)
1
RISC vs. CISC
2
CISC Evolution
• Storage and Memory

– High cost of memory.
– Need for compact code.
• Support for high-level languages

• Support for new applications (e.g.,
multimedia)
3
CISC Effects
• Moved complexity from s/w to h/w
• Ease of compiler design (HLLCA)
• Easier to debug
• Lengthened design times
• Increased design errors
4
RISC Evolution
• Increasingly cheap memory

• Improvement in compiler technology
Patterson: “Make the common case fast”
5
RISC Effect
• Move complexity from h/w to s/w
• Provided a single-chip solution
• Better use of chip area
• Better Speed
• Feasibility of pipelining
– Single cycle execution stages
– Uniform Instruction Format
6
Key arguments
• RISC argument
– for a given technology, RISC implementation will be faster
– current VLSI technology enables single-chip RISC
– when technology enables single-chip CISC, RISC will be pipelined
– when technology enables pipelined CISC, RISC will have caches
– When CISC have caches, RISC will have multiple cores
• CISC argument
– CISC flaws not fundamental (fixed with more transistors)
– Moore’s Law will narrow the RISC/CISC gap (true)
– software costs will dominate (very true)
7
Role of Compiler: RISC vs. CISC
• CISC instruction:
MUL <addr1>, <addr2>
• RISC instructions:
LOAD A, <addr1>
LOAD B, <addr2>
MUL A, B
STORE <addr1>
• RISC is dependent on optimizing compilers
8
Example CISC ISA: Intel X86
12 addressing modes: Operand sizes:

• Register. • Can be 8, 16, 32, 48, 64, or 80 bits
• Immediate. long.
• Direct.
•
• Also supports string operations.
Base.
• Base + Displacement.
• Index + Displacement.
• Scaled Index + Displacement. Instruction Encoding:
• Based Index.
•
• The smallest instruction is one byte.
Based Scaled Index.
• Based Index + Displacement. • The longest instruction is 12 bytes
• Based Scaled Index + long.
Displacement.
• Relative. • The first bytes generally contain the
opcode, mode specifiers, and register
fields. 9
Example RISC ISA: PowerPC
8 addressing modes:
Operand sizes:
• Register direct. • Four operand sizes: 1, 2, 4 or 8
• Immediate. bytes.
• Register indirect.
• Register indirect with immediate index
(loads and stores).
• Register indirect with register index
•
(loads and stores).
Absolute (jumps).
Instruction Encoding:
• Link register indirect (calls).
•
• Instruction set has 15 different
Count register indirect (branches).
formats with many minor
•
variations.
• All are 32 bits in length.
10
Example RISC ISA: HP-PA

• Register • Five operand sizes ranging in
• Immediate powers of two from 1 to 16 bytes.
• Base with displacement
• Base with scaled index and
displacement
• Predecrement Instruction Encoding:
• Postincrement
• PC-relative • Instruction set has 12 different
•
formats.
11
Example RISC ISA: SPARC

• Register indirect with • Four operand sizes: 1, 2, 4 or 8
immediate displacement. bytes.
• Register inderect indexed by
another register.
• Register direct. Instruction Encoding:
• Immediate.
• Instruction set has 3 basic
• PC relative.
instruction formats with 3 minor
variations.
12
CISC vs. RISC Characteristics
• RISC vs. CISC controversy is now 20 years old.
• After the initial enthusiasm for RISC machines, there has been a
growing realization that
– RISC designs may benefit from the inclusion of some CISC features, and
– Vice-versa.
• The result is that more recent RISC design, PowerPC and
SPARC, are no longer "pure" RISC and the more recent CISC
designs, notably the Pentium, AMD, and Core Duo incorporate
core RISC characteristics.
Intel called this hack CRISC. This concept was so moronic that
even Intel could not market it!
13
Current Trends
• - x86 instructions decoded into RISC-like
instructions (ROps)
Intel called this hack CRISC. This concept was so moronic that even Intel
could not market it!
• IA-64 - dependence on compilers for

scheduling
• Athlon – both direct execution and micro-
programmed instructions
14
Why did Intel (CISC) win?
• x86 won because it was the first 16-bit chip.
• IBM put it in PCs because there was no competing
choice
• Rest is inertia and “financial feedback”
– x86 is most difficult ISA to implement for high performance, but
– Because Intel sells the most processors ...
– It has the most money ...
– Which it uses to hire more and better engineers ...
– Which is uses to maintain competitive performance ...
– And given equal performance, compatibility wins ...
– So Intel sells the most processors.
15
Towards Instruction Set Consolidation
IA vision
Future innovation should come in micro-architecture
enhancements and compatible extensions to dominant
instruction sets, rather than the creation of new instruction sets.
1 Is the trend clear?

With ever growing software complexity and installed base the value of remaining
compatible with and extending existing, dominant instruction sets heavily
outweighs any disadvantages.
2 Is the time now?
Technology has passed the point where instruction set costs are no longer
relevant.
16
What The Euro Can Teach Us
• The economic benefits of moving away from
multiple currencies is enormous
17
Possible Solutions
Port thousands of
applications, operating
systems, drivers, codecs, tool x86
chains and virtual machines
MIPS® STOP
ARM
main() {
printf(“Hello World\n”); PowerPC
} • Resource
EPIC
intensive
SPARC • Slow
Cell • Costly to
maintain
Other
18
Architectural Evolution Macro-
Level
Networking
Common
Instruction Set
Architecture Storage
Happening Now
No need to port
Server
The Future?
Today
No need for multiple
validations
Desktop
Built in OS integration
Laptop
Robust security
Investment protection Handheld
Ubiquitous
19
Extending x86
20
EPIC
(Explicitly Parallel
Instruction Computing)
21
IA- 64: The Itanium Processor
• A radical departure from the traditional
paradigms.
• Intel and Hewlett-Packard Co. designed a
new architecture, IA-64, that they expected
to be much more effective at executing
instructions in parallel
• IA-64 is brand new ISA
22
IA - 64
• IA-64 is a 64-bit. In IA-64 designs,
instructions are scheduled by the compiler,
not by the hardware.
• Much of the logic that groups, schedules,
and tracks instructions is not needed thus
simplifying the circuitry and promising to
improve performance.
23
The EPIC Philosophy
• The acronym EPIC stands for Explicitly
Parallel Instruction Computing.
• The entire EPIC design philosophy can be
summed up by the following: make use of
parallel power whenever and wherever
possible; and if it's not possible, make it
possible.
24
The EPIC Philosophy
25
Intel IA-64
• Massive resources
– 128 GPRs (64-bit, plus poison bit)
– 128 FPRs (82-bit)
– 64 predicate registers
– Also has branch registers for indirect branches
• Contrast to:
– RISC: 32 int, 32 FP, handful of control regs
– x86: 8 int, 8 fp, handful of control regs
• x86-64 bumps this to 16, SSE adds 8/16 MM regs
26
IA-64 Registers
27
IA-64 Groups
• Compiler assembles groups of instructions
– No register data dependencies between insts in
the same group
• Memory deps may exist
– Compiler explicitly inserts “stops” to mark the
end of a group
– Group can be arbitrarily long
28
IA-64 Bundles
• Bundle == The “VLIW” Instruction

– 5-bit template encoding
• also encodes “stops”
– Three 41-bit instructions
• 128 bits per bundle

– average of 5.33 bytes per instruction
• x86 only needs 3 bytes on average
29
Instruction Types
• Instructions are divided into different type

– determines which functional units they can
execute on
– templates are based on these types
30
Bundle Templates
• Not all combinations of A,
I, M, F, B, L and X are
permitted
• Group “stops” are
explicitly encoded as part
of the template
– can’t stop just anywhere
Some bundles identical

except for group stop
31
Individual Instruction Formats
• Fairly RISC-like like
– easy to decode, fields tend to stay put
32
Instruction Format: Bundles &
Templates
•Bundle
•Set of three instructions (41 bits each)
•Template
•Identifies types of instructions in bundle
33
Explicitly Parallel Instruction Computing
S2
EPIC
S1 S0 T
128-bit instruction bundles from I-cache
Fetch one or more bundles for execution

(Implementation, Itanium® takes two.)
Processor
Try to execute all instructions in

functional units parallel, depending on available units.
MEM MEM INT INT FP FP B B B
Retired instruction bundles
34
The Role of
Compilers
35
Compiler and ISA
• ISA decisions are no more just for programming assembly
language (AL) easily
• Due to HLL, ISA is a compiler target today
• Performance of a computer will be significantly affected by
compiler
• Understanding the compiler technology today is critical to
designing and efficiently implementing an instruction set
• Architecture choice affects the code quality and the
complexity of building a compiler for it
36
Goal of the Compiler
• Primary goal is correctness
• Second goal is speed of the object code
• Others:
– Speed of the compilation
– Ease of providing debug support
– Inter-operability among languages
– Flexibility of the implementation - languages may not
change much but they do evolve - e. g. Fortran 66 ===>
HPF
Make the frequent cases fast and the rare case correct
37
Typical Modern Compiler Structure
Common Intermediate
Representation
Somewhat language dependent

Largely machine independent
Small language dependent
Slight machine dependent
Language independent
Highly machine dependent
38
Typical Modern Compiler Structure
(Cont.)
• Multi-pass structure  easy to write bug-free compilers
– Transform HL, more abstract representations, into progressively low-
level representations, eventually reaching the instruction set
• Compilers must make assumptions about the ability of later
steps to deal with certain problems
– Ex. 1 choose which procedure calls to expand inline before they know
the exact size of the procedure being called
– Ex. 2 Global common sub-expression elimination
• Find two instances of an expression that compute the same value
and saves the result of the first one in a temporary
– Temporary must be register, not memory (Performance)
– Assume register allocator will allocate temporary into register
39
Optimization Types
• High level - done at source code level
– Procedure called only once - so put it in-line and save CALL
• Local - done on basic sequential block (straight-line
code)
– Common sub-expressions produce same value
– Constant propagation - replace constant valued variable with the
constant - saves multiple variable accesses with same value
• Global - same as local but done across branches
– Code motion - remove code from loops that compute same value on
each pass and put it before the loop
– Simplify or eliminate array addressing calculations in loop
40
Optimization Types (Cont.)
• Register allocation
– Use graph coloring (graph theory) to allocate registers
• NP-complete
• Heuristic algorithm works best when there are at least 16 (and
preferably more) registers
• Processor-dependent optimization
– Strength reduction: replace multiply with shift and add sequence
– Pipeline scheduling: reorder instructions to minimize pipeline stalls
– Branch offset optimization: Reorder code to minimize branch offsets
41
Register Allocation
• One the most important optimizations
• Based on graph coloring techniques
– Construct graph of possible allocations to a register
– Use graph to allocate registers efficiently
– Goal is to achieve 100% register allocation for all
active variables.
– Graph coloring works best when there are at least 16
general-purpose registers available for integers and
more for floating-point variables.
42
Constant propagation
a:= 5;
...
// no change to a so far.
if (a > b)
{
. . .
}
The statement (a > b) can be replaced by (5 > b). This could free
a register when the comparison is executed.
When applied systematically, constant propagation can improve the

code significantly.
43
Strength reduction
Example:
for (j = 0; j = n; ++j)
A[j] = 2*j;
for (i = 0; 4*i <= n; ++i)

A[4*i] = 0;
An optimizing compiler can replace multiplication by 4 by

addition by 4.
This is an example of strength reduction.

In general, scalar multiplications can be replaced by
additions.
44
Major Types of Optimizations and
Example in Each Class
45
Change in IC Due to
Optimization
 Level 1: local optimizations, code scheduling, and local register allocation

 Level 2: global optimization, loop transformation (software pipelining),
global register allocation
 Level 3: + procedure integration
46
How can Architects Help Compiler
Writers
• Provide Regularity
– Address modes, operations, and data types should be orthogonal
(independent) of each other
• Simplify code generation especially multi-pass
• Counterexample: restrict what registers can be used for a certain classes
of instructions
• Provide primitives - not solutions
– Special features that match a HLL construct are often un-usable
– What works in one language may be detrimental to others
47
How can Architects Help Compiler
Writers (Cont.)
• Simplify trade-offs among alternatives
– How to write good code? What is a good code?
• Metric: IC or code size (no longer true) caches and pipeline…
– Anything that makes code sequence performance obvious is
a definite win!
• How many times a variable should be referenced before it is
cheaper to load it into a register
• Provide instructions that bind the quantities known at
compile time as constants
– Don’t hide compile time constants
• Instructions which work off of something that the compiler thinks
could be a run-time determined value hand-cuffs the optimizer
48
Short Summary -- Compilers
• ISA has at least 16 GPR (not counting FP registers) to
simplify allocation of registers using graph coloring
• Orthogonality suggests all supported addressing modes
apply to all instructions that transfer data
• Simplicity – understand that less is more in ISA design
– Provide primitives instead of solutions
– Simplify trade-offs between alternatives
– Don’t bind constants at runtime
• Counterexample – Lack of compiler support for
multimedia instructions
49
MIPS ISA
50
Growth of Processors
1400
• Language of the Machine 1300
Other
SPARC
• We’ll be working with the 1200 Hitachi SH
MIPS instruction set 1100 PowerPC
Motorola 68K
architecture 1000
MIPS
– similar to other 900 IA-32
800 ARM
architectures developed
700
since the 1980's 600
– Almost 100 million 500
MIPS processors 400
manufactured in 2002 300
– used by NEC, 200
Nintendo, Cisco, Silicon 100
Graphics, Sony, … 0
1998 1999 2000 2001 2002
51
MIPS Instruction Set (RISC)
• Instructions execute simple functions.
• Maintain regularity of format – each instruction is
one word, contains opcode and arguments.
• Minimize memory accesses – whenever possible
use registers as arguments.
• Three types of instructions:
• Register (R)-type – only registers as arguments.
• Immediate (I)-type – arguments are registers and numbers
(constants or memory addresses).
• Jump (J)-type – argument is an address.
52
MIPS Arithmetic Instructions
• All instructions have 3 operands
• Operand order is fixed (destination first)
Example:
C code: a = b + c
MIPS ‘code’: add a, b, c
(we’ll talk about registers in a bit)
“The natural number of operands for an operation like addition is

three… requiring every instruction to have exactly three operands
conforms to the philosophy of keeping the hardware simple”
53
Arithmetic Instr. (Continued)
• Design Principle: simplicity favors regularity.
• Of course this complicates some things...
C code: a = b + c + d;
MIPS code: add a, b, c

add a, a, d
• Operands must be registers (why?)

• 32 registers provided
• Each register contains 32 bits
54
Registers vs. Memory
• Arithmetic instructions operands must be registers
• 32 registers provided
• Compiler associates variables with registers.
• What about programs with lots of variables? Must use
memory.
Control Input
Memory
Datapath Output
Processor I/O
55
Memory Instructions
• Load and store instructions
• Example:
C code: A[12] = h + A[8];
MIPS code: lw $t0, 32($s3) #addr of A in reg s3

add $t0, $s2, $t0 #h in reg s2
sw $t0, 48($s3)
• Can refer to registers by name (e.g., $s2, $t2) instead of number

• Store word has destination last
• Remember arithmetic operands are registers, not memory!
Can’t write: add 48($s3), $s2, 32($s3)
56
Machine Language
• Instructions, like registers and words of data, are also 32
bits long
– Example: add $t1, $s1, $s2
– registers have numbers, $t1=8, $s1=17, $s2=18
• Instruction Format:
000000 10001 10010 01000 00000

100000
op rs rt rd shamt
funct
57
MIPS64 Instruction Format
I - type instruction 6 5 5 16
Opcode rs rt Immediate
0 5 6 10 11 15 16 31
Encodes: Loads and stores of bytes, words, half words. All immediates (rd rs op immediate)
Conditional branch instructions (rs1 is register, rd unused)
Jump register, jump and link register (rd = 0, rs = destination, immediate = 0)
R - type instruction 6 5 5 5 5 6
Opcode rs rt rd shamt func
0 5 6 10 11 15 16 20 21 25 26 31
Register-register ALU operations: rd  rs func rt Function encodes the data path operation:
Add, Sub .. Read/write special registers and moves.
J - Type instruction 6 26
Opcode Offset added to PC
0 5 6 31
Jump and jump and link. Trap and return from exception
58
Summary: MIPS Registers and
Memory
$s0-$s7, $t0-$t9, $zero, Fast locations for data. In MIPS, data m ust be in registers to perform
32 registers $a0-$a3, $v0-$v1, $gp, arithm etic. MIPS register $zero alw ays equals 0. Register $at is
$fp, $sp, $ra, $at reserved for the assem bler to handle large constants.
Memory[0], Accessed only by data transfer instructions. MIPS uses byte
230 memoryMemory[4], ..., addresses, so sequential w ords differ by 4. Mem ory holds data
words Memory[4294967292] structures, such as arrays, and spilled registers, such as those
saved on procedure calls.
59
Summary: MIPS Instructions
MIPS assembly language
Category Instruction Example Meaning Comments
add add $s1, $s2, $s3 $s1 = $s2 + $s3 Three operands; data in registers
Arithmetic subtract sub $s1, $s2, $s3 $s1 = $s2 - $s3 Three operands; data in registers
add immediate addi $s1, $s2, 100 $s1 = $s2 + 100 Used to add constants
load w ord lw $s1, 100($s2) $s1 = Memory[$s2 + 100]Word from memory to register
store w ord sw $s1, 100($s2) Memory[$s2 + 100] = $s1 Word from register to memory
Data transfer load byte lb $s1, 100($s2) $s1 = Memory[$s2 + 100]Byte from memory to register
store byte sb $s1, 100($s2) Memory[$s2 + 100] = $s1 Byte from register to memory
load upper lui $s1, 100 $s1 = 100 * 2 16 Loads constant in upper 16 bits
immediate
branch on equal beq $s1, $s2, 25 if ($s1 == $s2 ) go to Equal test; PC-relative branch
PC + 4 + 100
branch on not equal bne $s1, $s2, 25 if ($s1 != $s2 ) go to Not equal test; PC-relative
PC + 4 + 100
Conditional
branch set on less than slt $s1, $s2, $s3 if ($s2 < $s3 ) $s1 = 1; Compare less than; for beq, bne
else $s1 = 0
set less than slti $s1, $s2, 100 if ($s2 < 100) $s1 = 1; Compare less than constant
immediate else $s1 = 0
jump j 2500 go to 10000 Jump to target address
Uncondi- jump register jr $ra go to $ra For sw itch, procedure return
tional jump jump and link jal 2500 $ra = PC + 4; go to 10000 For procedure call
60
Addressing Modes
1. Immediate addressing
op rs rt Immediate
2. Register addressing
op rs rt rd ... funct Registers
Register
3. Base addressing
op rs rt Address Memor y
Register + Byte Halfword Word
4. PC-relative addressing
op rs rt Address Memor y
PC + Word
5. Pseudodirect addressing
op Address Memor y
PC Word
2004 © Morgan Kaufman Publishers
61

Isa 3

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Isa 3

Uploaded by

Copyright:

Available Formats

ISA

• Storage and Memory

• Support for high-level languages

• Increasingly cheap memory

Patterson: “Make the common case fast”

12 addressing modes: Operand sizes:

7 addressing modes: Operand sizes:

5 addressing modes: Operand sizes:

• IA-64 - dependence on compilers for

1 Is the trend clear?

2 Is the time now?

• Bundle == The “VLIW” Instruction

• 128 bits per bundle

• Instructions are divided into different type

Some bundles identical

Fetch one or more bundles for execution

Try to execute all instructions in

Retired instruction bundles

Somewhat language dependent

When applied systematically, constant propagation can improve the

for (i = 0; 4*i <= n; ++i)

An optimizing compiler can replace multiplication by 4 by

This is an example of strength reduction.

 Level 1: local optimizations, code scheduling, and local register allocation

– used by NEC, 200

Nintendo, Cisco, Silicon 100

MIPS ‘code’: add a, b, c

(we’ll talk about registers in a bit)

“The natural number of operands for an operation like addition is

MIPS code: add a, b, c

• Operands must be registers (why?)

C code: A[12] = h + A[8];

MIPS code: lw $t0, 32($s3) #addr of A in reg s3

• Can refer to registers by name (e.g., $s2, $t2) instead of number

Can’t write: add 48($s3), $s2, 32($s3)

000000 10001 10010 01000 00000

Opcode rs rt rd shamt func

Opcode Offset added to PC

Register + Byte Halfword Word

2004 © Morgan Kaufman Publishers

You might also like