Professional Documents
Culture Documents
How A Processor Can Permute N Bits in O (1) Cycles
How A Processor Can Permute N Bits in O (1) Cycles
How A Processor Can Permute N Bits in O (1) Cycles
Ruby Lee,
Zhijie Shi, Xiao Yang
Princeton Architecture Lab for Multimedia and Security(PALMS)
Department of Electrical Engineering
Princeton University
Motivation
1
Today - microprocessor or ASIC
• Logic Operations
– MASK-Gen/AND/SHIFT/OR → 4n instructions
– EXTRACT/DEPOSIT → 2n instructions
• Table lookup
– small set of fixed permutations only
– 8x2KB tables, about 32 instructions for 64 bits permutation
• Subword permutation instructions for multimedia
– Works on 8-bit or larger subwords
• ASIC
– permutation very fast in hardware, BUT
– small set of fixed permutations only
n Source to be permuted
n
Register
File Permutation
ALU Shifter
Configuration FU
bits
Intermediate result
n
2
Initial Problem Definition
Outline
3
Alternative permutation methods
0 7
Data Rs a b c d e f g h
Control Rc 1 0 0 1 1 0 1 0
Result Rd b c f h a d e g
4
GRP64 Implementation
1:
64 OR gates
output
5
8-bit CROSS instruction
building a virtual Benes Network
8-bit OMFLIP
building a virtual Omega-Flip Network
6
An OMFLIP Implementation
64 permuted bits
7
Comparison
Bit permutation,
n elements, Θ(n) Θ(n) log(n) log(n)
each 1-bit
Subword
permutation,
Θ(n/k) Θ(n) log(n/k) log(n/k)
n/k elements,
each k-bit
Speedup of DES
2.24
2.5
2.14
2
1.17
GRP
1
1 OMFLIP or CROSS
0.5
For key generation,
0 speedup is 11x-16X !
cache 1 cache 2
8
Speedup for sorting 64 elements
using GRP instruction
9
Leverage microarchitecture features
in 2-way superscalar processors
ALU1 ALU2
(4,2)- FU
10
Two (4,1) functional units, each log(n) stages
(Butterfly is faster than Omega-flip)
n=64
bits
64 permuted bits
2log(n)=12 stages
n=64
bits
Inverse
butterfly Inverse butterfly
stages network (IBFLY)
64 permuted bits
11
Conclusions
12