Brewer 1987

Knowledge Based Control in Micro-Architecture Design
Forrest D. Brewer
Daniel D. Gajs ki
Dept. of Computer Science

University of Illinois at Urbana-Champaign
1304 West Springfield Ave., Urbana, Illinois 61801
ABSTRACT
2. Design Process Model
This paper describes the principles and implementa- The organrzation of the Chippe system is patterned after
tion of design-process control in a macro-architecture com- the “Design Paradigm” [BrGaBG]. In this model the design is
piler. The knowledge-base relies on both local and global iteratively refined’ accoroing to a selected strategy. The
evaluations to determine strategies to achieve global goals refinement is performed by a suite of algorithmic tools
and then implements those strategies by manipulating [PaGa86][Dutt86] and is controlled by resource constraints
hardware allocations and search heuristics. A system (knobs). The resultrng potential design IS evaluated to oroduce
overview and annotated sample run are presented. quality measures (gauges) whrch are compared with the design
goals. These gauges drive a rule-based expert system which
controls the design process by selecting desrgn styles and
strategres and makmg rradeoffs th.at help the aesgn meet the
imposed constraints. Since the design control knobs are p&t
1. Introduction of the input to the refinement tools, the expert need not con-
Micro-architecture design takes a hardware behavioral cern itself with correctly implementing the design. its only con-
specification and produces a register-transfer level design. cern is in meeting the design constraints.
This design must not only implement the required behavior, but A somewhat similar approach to the use of design con-
also must meet the design constraints. To meet these constraints is taken by the BUD [McKo86] system. In that sys-
straints we propose a strategy of iterative design, based on the tem, however, the knowledge is used to augment an expert
premise that computer generated design is relatively inexpen- which directly implements the design structure. Thus in BUD
sive. There is great commercial interest in automating this the constraints are used to produce candidate designs which
design process as a means of reducing the cost of developing are structurally refined by an expert. A third approach was
new application specific integrated circuits (ASIC’s). used by ADAM [KnPa86]. In that system the rule base was
Automating the process of micro-architecture design from organized as a planning engine. The system would implement a
a behavioral language requires the addition of a large amount of plan based on estimates of the local constraints for each
knowledge to the design specification. In the past this was operation task. The globat constraints were then checked
accomplished by restricting the design model to a limited set of against the plan after the system completed a design.
alternatives, then casting the behavior onto one of them. In this The design paradigm allows iterative exploration of the
way the specialized knowledge about design trade-offs can be design space. After several design runs are evaluated, infor-
‘hard-wired’ into the svnthests process. Soeclfic examoles are mation is gained which can be used to constrain the search for
SYCO [JVJC86] and MacPitts [South 831 which were iargeted further design refinements. This is supported in the expert by
at microprocessor design. and CATHEDRAL [DeRS86] optim- selection of a design strategy and in the refinement tools by
ized for signal-processing design. structural constraints. Design strategies impose constraints
We choose instead a paradigm with a more general design on the design process by selecting certain design refinement
model and then use a knowledge base of rules to determine the and optimization steps and imposing selected tradeoffs until a
appropriate styles, design tradeoffs and strategies. The design cost function is satisfied. The knowledge-base can be supplied
itself is iteratively refined by specialized algorithms which with several design strategies which are invoked to satisfy the
quickly search the design space to implement a new potential imposed goals or to produce stylistic variations of the design.
design based on the component constraints (such as the For example the interconnect can be mux-based or bus-based
number of busses, or ALU’s) and cost functions (eg. area, per- or the design could be pipelined.
formance. and power). Separation of the knowledge about how Lastly, the expert IS concerned with producing a design in
to correctly implement the design from the knowledge about a reasonable amount of time. Several of the problems in
meeting design constraints simplifies the design strategies and micro-architecture design are NP-hard so that an uncon-
consequently the implementation of the knowledge-base. strained search through even a simple design model is infeasi-
ble. The expert can provide focus of attention and limits to the
search, allowing long optimizations only when the design is
nearing completion.
This work supported in par1 by a grant from AT&T Bell Labs.
’ Refinement IS the process of producing a correct design from a desired
behavior. 3. Problem Description
Permission to copy without fee all or part of this material is granted provided The key issues for the Chippe expert are function unit
that the copies are not made or distributed for direct cornmercral advantage, the selection and allocation, control unit and data-path style selec-
ACM copyright notice and the title of the publication and its date appear, and tion, selection of system global control variables, and constraint
notice is given that copying is by permission of the Association for Computing of the search.
Machinery. To copy otherwise, or to republish, requires a fee and/or specific
permission. There are several tradeoffs in the selection of the func-
tion units used to implement the operations defined in the
24th ACM/IEEE Design Automation Conference

Paper 12.2
0 1987 ACM 0738-100X/87/0600-0203$00.75 203
control-data-flow-graph CDFG2 Structural similarity In the
logic design of many units allows them to be merged3 into com-
posite units with selectable operations. These composite units
have smaller area than the multiple units they reptace and
therefore trade parallel execution for area reduction. Several
algorithm to merge units [PaKG66]. The clique partitioning
algorithm attempts to maximize the size of each cluster of
operations but ignores attributes such as bit-width, number of
pipeline stages, operation time delay etc. In Chippe. functional
units are selected based on a search of a data-base containing
several implementations of units satisfying different design
tradeoffs.
Control unit selection and design requires modifications to
the control-flow graph and to the state scheduling. The struc-
tured nature of the input Hardware Description Language pro-
vides an initial control flow to the CDFG which can be modified
at the request of the expert. These modifications can increase
the parallelism by merging control macro-states or reduce the
data-path complexity by serializing the execution. In addition
the global control unit itself can be implemented in one of
several parameterized design styles.4. Examples are ROM-
based or PLA-based and piplined etc. The control unit designer
(Cogent)[Dutt66] then has the responsibility of adjusting the
schedule in the state-graph.
Finally, the design’s global parameters must be set. Most
Important of these is the system clock speed. The generality
of the model used in Chippe for function unit execution allows
the graph to be partitioned into essentially arbitrary states, with
the provision that the control has time to operate the intercon- Control
> Splicer
nect and select the next state. The tradeoff here is between Generation -Z
the fine granularity offered at high clock speeds which allows
units to operate with little delay, and the large area and cower
consumption required by the Increase in the number of states
and requirement of faster control styles.
-Data
4. The Chippe System
- - - - - Control
Figure 1 shows an overview of the basic functions in
Chrppe. The system input is a hardware description language
[Pang871 reminiscent of PASCAL with modifications to allow
specification of interface protocols and bit-fields. The language Figure I. Chippe Block Diagram
is translated by the compiler to a CDFG representation wrth
data and operation dependencies inserted. This initial graph is execution of units from data that is not available in a register at
stored as the starting point for all of the future refinements the start of the cycle, but which will be available as output from
imposed on the design. This allows backtracking control-flow a function unit. Figure 2 depicts a fragment of the state graph
refinements to the graph. showing several micro-instructions. Most operations start
shortly after the beginning of a state (as soon as data can be
4.1. Refinement Tools routed to them from the registers). However, the chain slice5
The Slicer [PaGa86] is the scheduler for Chippe, produc- allows binding of the subtract operator to the end of the multi-
ing a valid set of micro-instructions for the CDFG. Slicer uses piy operation without the addition of a register or a delay until
the control-delay and clock-cycle-time along with estimates of the next state. To insure the correct operation of the designed
the operation times (all from the expert) to partition the critical machine, such an operation requires either the input busses to
path. Operations not on the critical path are assigned to states be held active for the duration of the operationor that the input
based on the dependences and avallable hardware. Thus the of the function unit be latched. This input latch is available to
constraints to slicer are the function-unrt allocations, the con- the expert as an optional attribute to set for the function unit.
trol style. and the clock-time. Slicer correctly schedules units Pipelined units are treated as operators which become avail-
with delays longer than the clock and WIII attempt to chain units able periodically (not necessarily on clock boundaries) and
sequentially in a single state if the delays are short enough. whose outputs are delayed by an appropriate number of clock-
Lastly, Slicer can schedule prpelined units and keep track of the times (the pipeline latency).
position of the scheduled operations In successtve clocks. Splicer[Pang87] performs the connectivity binding and
Slicer’s timing model consists of a control state which is unit allocation (selection of which units to use for each opera-
partrtioned into one or more micro-mstructions each of which tion) for Chippe. Splicer is des:gned around a depth-first
can be partitioned into several chain slices. The micro- search using both backtrack and branch and bound techniques
mstructtons correspond to cycles of the system clock and also to bind operations to units and connections to busses. Its input
correspond to the time granularity of system control unit. To will accept an arbitrarily connected partial design and use con-
prevent races in the data path, a register is assumed for each nections it can find or optionally introduce new ones if neces-
bus crossing a micro-instruction boundary. Chain slices allow sary. The use of depth first search allows an initial (greedy)
solution to be found quickly so that Splicer can be used by the
expert to explore the design space for potential design candi-
% representation similar lo the Flow-Graph (OrGaM] and DACON dates. In addition, Splicer will look ahead an arbitrary number of
[Tnce;l or Value-Trace [KowaN] with added control information.
3 ‘merging is identzal with foldfnpdefined I” (KowaB4] --
’ ‘style’here refers to a particu,ar type of impiementailon; See teroasa] ’ a parl!t!on a a single micra-instruction or ciock-cycle
Paper 12.2
204
The expert is driven by relating imposed goals for the
design to evaluations made by routines which measure various
parameters of the potential design. The basic measures are
estimates of the area, power dissipation and execution time.
These estimates are compiled from the allocated function
units, the connections, and the control unit. At present there is
no method of including layout constraints in the design parame-
ters as has been done in BUD [McKo66]. All of the above qual-
ity measures point out problems with the design but do not indi-
cate how to correct these problems. For this purpose several
other quality measures are used. These include ‘overlap’,
‘dead-time’, ‘bus-usage’, and identification of components on the
critical path.
j State n+4
iM
$ )-/
!
4 4 ]
The overlap function determines the number of scheduled
states for which two units are active at the same time. This
measure helps determine the relative effect of merging or elim-
Figure 2. Fragment of State Graph inating units on the schedule. If the overlap is zero (i.e. the
units are exclusive) then merging can be done without
states in searching for connections allowing the expert to lengthening the schedule. Small numbers reflect relatively
select the depth and iteration limits for the search. Splicer small changes in the execution time. Large numbers indicate
uses preset cost heuristics in calculating the quality of its that the merging ~111cost a great deal of time and should only
designs. These costs are selected by the expert to guide the be done if the area must be significantly reduced.
design search.
The dead-time and critical path functions are used to
determine means of increasing the performance by alterations
42. Design Style Selections of the system clock-time and the critical components respec-
There are two major style selection decisions supported tively. In a case where a small time improvement is needed it
by Chippe. These are selection of the global control unit style may be possible to substitute a faster (and larger) umt on the
and selection of the function units which are used to implement critical path. The dead-time measure collects information on
the design. Other possible style decisions could be added, how poorly the system clock fits the execution time of the
most notably connection and layout styles. These would function umts in the design. That is, how much time expires for
require floorplanning and layout refinement tools which have each unit after it has completed its task and is waiting for more
not been implemented yet. In the present version, Chippe inputs. This measure quantifies the efficiency of the global
allows connectivity constrained only by the allocated function control clock granularity to the present design schedule. Large
units and area constraints. values of dead-time indicate possible performance increase by
Chippe’s global control generation is performed by the modification of the system clock. It is Important to note that
Cogent[Dutt66] sub-system. Cogent allows the selection of modifications of the system parameters to improve a quality
several parameterized control unit styles and modifies the measure may change other quality measures in ways that are
graph dependencies of the CDFG accordingly. It allows the not desired. For example a change in the system clock to
specification of a control or data-path implementation for each reduce the dead-time may modify the schedule enough so that
control signal encountered in the graph. Data path implementa- the performance of a time critical macro-state IS decreased.
tions require merging of control macro-states to increase the To help avoid these problems the expert is designed to perform
available parallelism. Control path implementations require certain simple strategies of design refinement.
state assignments and transitions in the global control umt. In
addition, Cogent combines all of the state data from Slicer and 4.4. The Expert
the allocation data from Splicer to produce the actual micro- The expert’s view of the design process consists of three
code for the control unit. basic structures. These are the function unit bindings, the glo-
The function unit data base supports the expert by sup- bal parameters and goals, and an abstracted CDFG. The
plying possible units and estimates for the operation of those expert maintains a list of function units which are bound to the
units. The data-base supports several selectable attributes for operations they can perform. In addition, the function units
function units such as input latching, multi-level pipelining, and have atrrlbutes such as pipelining, Input latching, power dissipa-
several implementations of each function with different area- rlon. area, number of clocks to complete an operation. and flow
time tradeoffs. It provides insulation between the technology delay ttme. The exoert maintains its own version of the CDFG
dependent component data-base and the rule base of the at the granularity of control-flow blocks. Each block
expert system. This allows the function unit data-base to be corresponds to a straight-line section of the CDFG, with a con.
easily replaced as required by the application spectfic design trol condition determmmg the next possible blocks. Thus the
technology. When the system is interfaced to a structural desrgn’s CDFG IS modeled as a finite s!a!e m_achme whose
compiler these estimates can be tuned to the lower level struc- states’ correspond to (macro-)state of operations of the
tural design. machine separated by explicit conditional transitions to other
macro states. These macro-states reflect the finest granular-
43. Design Evaluation in Chippe ity of modifications to the graph by the expert system. Each
macro-state has several attributes such as total execution
The evaluator assembles the data from the state graph time, function unit usage and type of control transition. Finally.
and the partial design to produce the quality measures used by the expert has a collection of parameters corresponding to the
the expert. It ensures that each subsystem has the current global state of the design. These include the system clock
version of the data it requires and manages constraint handling time, the systems total area, total power dissipation, control
for the system. For example, the evaluator insures that the time delay, and other quality measures for the machine.
Slicer has the correct estimate of the control-unit delay. When
Chippe is interfaced to a structural silicon cornoiler. the evalua- The organization of the macro-state graph is determined
tor WIII manage the data passed back from the actual design by the possible refinements selected by the expert. Modiiica-
layout to correct these estlmatlons. The last part of the tions of the macro-state blocks, such as merging two states,
evaluator is a set of functions called from the expert whtch pro- results in the corresponding modifications of the CDFG and
duce local and global evaluations of the graph and connectivity. global control unit. For example, a state block with two suc-
These are used by the expert to focus the design effort onto cessor blocks can be merged into a single larger block by the
specific local trade-offs. insertion of multiplexors controlled by the condition operation.
Paper 12.2
205
This block merging is shown in Figure 3. The actual merging of design. These units are kept in a data base and are matched
the blocks and updating of the state graph is performed by the by desired functionality and areaitimeipower characteristics.
Cogent subsystem along with local optimizations to increase The expert can select many parallel fast units for a time con-
the design efficiency. This change amounts to selecting a strained design or a few highly merged units (those performing
data-path implementation of the conditional operation rather a large number of operations) for an area constrained design.
than using the global control unit to select the next state. If the Since the system design philosophy IS design by iterative
two conditional blocks use exclusive parts of the machine then refinement there are rules who’s action part increases the
the parallelism of execution can be increased without the addi- merging of units as well as rules to split units. adding to the
tion of significant hardware. Macro-states can often be merged achievable parallelism. This merging also has a strong effect
vertically if the conditions are not dependent: This allows on the connections needed to complete the design. A smaller
number of function units requires a correspondingly smaller
number of busses for Interconnection. Lastly, the selection of
Individual units performing identical functions offers additional
design tradeoffs. A function unit can be plpelined or Imple-
mented with carry-lookahead or ripple logic. For example. an
adder on the main data-path may be wide enough to require a
lookahead function while-a narrower incrementer could be fast
enough (and save space) if implemented as ripple-carry.
The decision making process of the expert is performed in
two phases, first the goals are compared to the evaluations to
select a strategy for change. Then selected rules use local
measures to determine possjble actions. The action with most
promise is tried first, after which the design is re-evaluated to
determine the changes. Finally, the design can be backtracked
if a strategy proves useless. An example of a strategy for
minimizing the area usage by merging is shown in Figure 4.
These rules are arranged in order of increasing change to the
schedule. The first ruie which potentially solves the problem is
fired and the schedule and graph are updated.
A possible strategy for the selection of function units is to
first allocate a unique unit to each operation in the graph. After
Figure 3. Merging State Blocks scheduling, the graph can be scanned for unused units and for
potential merging candidates. The resulting machine is
evaluated and compared to the goals. Since the machine is
multi-way selection of next states if the control unit can per- implemented with many parallel units, if the performance IS not
form multi-way branches. Other possible macro-state refine- high enough. some of the units will need to be replaced with
ments include the familiar compiler control-flow changes such faster versions. If the area is too large, the merging candidates
as constant folding and loop unwinding; See Flammel [Trlc87]. can be evaluated for potential gains and the allocation appropri-
The purpose of these modifications is to change the con- ately modified by pair-wise merging of the candidates. After a
trol structure from one that is easily described by a program- strategy has been selected (for example, reducing the number
ming language to one that is efficient for a potential design. of instantiations of a unit), a set of rules for that strategy IS
The control for the design is performed by a control unit which activated. This has the effect of inducing a two level control on
is selectable by the expert. This selection is based on the the rule-base: First the strategy is selected by evaluating the
goals set for the final design and the evaluation of the desired measures against the desired (Global) goals, then the imple-
behavior. In addition, the expert can select the direct implementation of that strategy is activated. The rules that
mentation of the control for unique state transitions. The rea- represent the action of a particular strategy determine the par-
sons for this ability is that the changes to the total design from ticular (local) change required. If no candidate meets the
modifications of the control are not well understood; Hence, requirements of the activation of a given strategy, the activa-
modifications to the control flow are performed directly at the tion fails and a new strategy is chosen. When all strategies for
reauest of the exqert knowledge base. The coarse granularity
of the data path representation. however, indicates that much
better knowledge about the optimization of this part of the
design exists so that direct rule based control is deemed Rule: Remove-Redundant::
unnecessary. Instead. this part of the design is amenable to If ( Never-used( FUl) )
standard data-flow optimizations which are handled in Slicer
Then (
and Splicer.
Remove( FUl) )
Refinements to the control flow are often irreversible
after the optimizations and dependencies are resolved. Since Rule: Merge-Exclusive::
the intended result of the modification cannot be assured at the If ( Compatable( FUl, FU2) &&Exclusive( FUX FU2) &&
outset the expert must have a method for backtracking its Smaller-Area( Merge( FUI, FU2), FUI+FU2) &&
actions on the CDFG. For this purpose the expert maintains a Largest-Gain{ FUl, FU2) )
stack of earlier states and potential refinements. If the refine-
Then (
ments to the design fail to achieve the goals the design can be Add( Merge( F Ul, FU2)) &&Aemove( FUI FU2) )
backtracked to an earlier state with knowledge of the modifica-
tions which led to failure.
Rule: Merge-Trade::
lf ( Compatable( FIJI, FU2) &&
4.5. Design Strategy Smaller-Area( Merge( FUl, FU2). FUl+FU2) &&
The expert maintains control of the system by modiflca- Overlap-Cost( FUI, FU2)< Largest-Gain( FUI FU2) )
tion of knobs. These knobs include modifications in the func- Then (
tion unit allocations, the design global parameters (clock length, Add( Merge( FUl, FU2)) && Remove( FU1. FU2) )
type of control, control delay etc.), selection of limits and
heuristics for Slicer, Splicer, and Cogent subsystems. The pri-
mary control of the expert over the data-path is from the Figure 4. function Unit Merge Rules
function-units which are selected to be the components of the
Paper 12.2
206
a given desired change fail the expert can optionally backtrack program diffeq(input.output);
to earlier design decisions (such as selection of a control /* Example from HAL: A Multi-Paradigm Approach to
style). Automatic Data-Path Synthesis 23rd DAC ‘I
The system clock length has a drastic effect on the final type integer = {O..ll};
design. The flexibility of the scheduling algorithm allows clocks three : integer:
w
which may be either faster or slower than the execution delay five : integer:
of the units. The faster clocks reduce the granularity with
var a, dx, x, u, y, yi, ul, u2, u3, u4, u5, u6 : integer:
which the control can be scheduled. This reduces the ‘dead-
time’ when a unit is unused and awaiting new operands. How- begin
ever, it also increases the number of states which must be if (x < a) then
encoded in the control unit, thus increasing the units’ size and repeat
delay. Longer clock time allocations allow more units to exe- ul := u dx;
l
cute in sequence within a single clock, this can lead to efficient u2 := five * x:
operation at relatively slow clock speeds. It must be noted that u3 := three * y;
the clock length should not exceed the execution time of an yl:= u * dx;
entire macrosstate as no further operations can be chained in x :=x+dx;
sequence. Thus, designs with many small macro-states (such u4 := u1- u2:
as a controller with many ports to service) should use higher u5 := dx * u3:
clock rates to reduce the response trme.
y := y + yl:
The design of this system is based on simple tradeoffs u6 := u - u4:
controlling fast search of the design space. For the connec- u := u6 - u5;
tions between the function units in the design several possible when x < a
cost strategies wrth different tradeoffs are poss;ble. The
end.
design can be interconnected with a few busses having many
connections to different units or the number of busses can be
allowed to increase, reducing the loading on the busses. In Figure 5. Hardware Desc. Language for Hal example
addition, heuristics which give adequate results for very short
iteration limits are different from those which produce high data to build the control unit. Estimates of the control unit size
quali!y designs in longer searches [PangS’I]. The expert should based on the control-unit style and the micro-code are thus
not spend long searches to optimize designs which are far from quite accurate. Each numbered block corresponds to a state of
satisfying the goals, so design time can be traded against qual- the machine while the lines describe which units are accessed
ity of the design produced. In the final stages of the design and where the results are placed. The FUxx, rxxx. and bxx are
process (when the design is close to the imposed goals) the function units, registers, and busses respectively. Operands
design can be connected with optimized minimal connections are supplied to the function-units on the indicated busses. In
for the final designs. these examples (to conform to the original Hal paper) the initial
values for the registers are assumed to be stored at the start
5. Walk-Through Example of the code fragment. In a more realistic case these values
could be loaded from a constant ROM or from external ports m
Figure 5 shows the hardware description for a small loop. the environment.
This test case was used bv the HAL svstem IPaKG661. We
will examine a run of Chippe on this fragment and indicate the Figure 7 shows the design after one unit was merged, the
trade-off decision points. All of the following examples were adder-subtractor. In this very parallel version the six multiplies
produced by Chippe from thus one code fragment: The area and are carried out in just two states, leaving the other states rela-
time bounds were set at the beginntng and the examples were tively empty. The area requirement for this design was far
sampled from the runas the design progressed.
Figure 6 traces the evolution of the small design test
case. The goals for the system were areac3000 gates and
delay<l.O uSec. These constraints are shown as the vertical
dashed box on the left side of the figure. The figure shows that
the first set of merges (two of the multipliers) reduced the area
out did not change the performance. since these units were not
used simultaneously in the schedule. The later merges
required more states to complete the loop but in each case the ‘\ \
trade was in favor of the desired goal. In this case the area \\
bound was first met and then the time performance goal was \
\\
attempted. This resulted from the large difference between the
initial area and that of the goal requirement, which led to the ‘\
‘r Area Constraints
selection of the simple strategy outlined above for area reduc- Satisfied
tion. After the area constraint was satisfied the controller tried :
to speed up the machine since the time bound was now
violated. In searching for modifications to speed up the
machine, the rule base used a usage measure to determine
where the biggest gain could be made by a unit modification.
The unit returned was the multiplier which was used in nearly
all states of the loop. A data-base query determined that this
unit could be pipelined. The strategy used here was that the
system clock could be shortened, decreasing the loop delay.
These changes led to the large drop in loop delay in the figure.
Finally, the change in the clock led to a last potential merge,
producing the final design.
These changes are shown pictorially in Figure 7. Figure 8 2000 3000 4000 5000 6000 7000 8000 9000
Gales
and Figure 9. The tables that appear under each figure are the
output symbolic microcode for these three designs. The sym-
bolic microcode and the micro-architecture contain sufficient
Figure 6. Design Evolution
Paper 12.2
207
Ml9 ACTIONS Nxt Conditions
la FUOl(<r002b0lr003b02) 2 x<a :TRUE
1 x<a :FALSE
2a FU02('~004b01~005b02) 3
3a rOOl.BOl= FUOZ(*:) 4
FU02('r002bOlr001b02)
- 4a rOOO.Boi = FU02( ':) 5
2a rOOO.BOl= FU02(~~004b04r005b05) FUO2(*rOOOb~l~O06b02)
rOOl,B02= FU03('~00lb02r002bO3) Sa rOOO.BOi= FUO2( ':) 6
r007,803 = FUO4(*rOOObC1trO06b06) FUO2(*rOOObWOo~bO2)
r008.804 = FUO5( 'rOO4bO4r005bO5) r002.802= FU03(+r002b03r005$04)
3a r007.803 = FUO4( ‘rOOObOlrO1Ilb02) 6a r001$01= FUO2(':) 7 I
r002,802 = FUO6( +r002b03~005b05) FU02('r004b03~005b04)
rOOO,BOl= FUO2( *rOO7$04rOO5bO5) /
FUOl( <rOO2bOlrOO3W2)
__ r006,604 = FUO6(+rOo6bO6~Oo6bO7)
7a rOOO,BOi = FUO2( ‘:) 8
4a
4b
BO5= FUO7(-rOO4bO4rOO7bO6)
FUOl( <~002b01~003b02)
r004,804 = FUOB( -BO5rOOObO7)
T-p-z::: sa
FUO2( ‘~000b01~005.b04)
r004,802 = FUO3( -r004,b03~00l,b02)
rOOl,BOl= FUO2( ‘:) 9
r006.BO2 = FUO3( +rOQOb03r006b02)
Figure 7. The HAL design after one merge 9a , r004,B02 = FU03(-r004bO3rO01bo2) 2 x <- a :TRUE :
1 xc.3 :FALSE :;
greater than the goal so the expert chose to remove multipliers

as they provide the largest gain in area. 1 his de’sign also made Figure 9. Final design for Hal
use of the chaining ability of Slicer/Splicer. Since the clock
time set by the expert allows the multiplies to execute in one
cycle, there is sufficient time to perform both and add and a
The design in Figure 8 shows the result after several
subtract in a single state.
more merges. The gate usage in this examole is still about
I
5500. Fewer registers are used in this examole since the
decrease in parallelism allowed a schedule with one fewer tem-
porary register. The plethora of muxes In this design are an
artifact of the design strategy (which forblds optlmlzation at
this point in the design) and the Splicer heuristic (plcked by the
expert to minimize busses at the expense of muxes). If this
design was close to the desired goal, much better optlmlzatiOn
would be used. This design iteration took less than 1% of the
PLA CZ total design time for the code fragment. Optimizations at this
step would have been wasted design time.
1 a
p The Final design shown in Figure 9 shows the design
after the inclusion of a 2-stage piped multiply unit. This design
I modification occurred because the number of sequential multi-
I-------
plies became large enough for a pipe to be efficient. Notice
that the two-level muxing structure has resulted in a design
with four input busses and two output busses. The optimiza-
tion of this design clearly splits the registers into two structural
units, RO. R2, R4, and Rl, R3. R5, R6. Additional rules could
Ml+ ACTIONS Nxt Conditions /
create register arrays for these partitions. This total sequence
la 1 FUOlf cr002bOlr003b02~ ~2 lxia :TRUE 1 of designs took about 6 set of CPU time on a Pyramid proces-
1 x<a :FALSE 1'
sor (roughly 2.5 times-the speed of a VAX-111780). About 95%
2a rOOl,BOl = FUO2( ‘rOO4,bO3~005bO4) 3
r007.602 = FUO5( ‘r00l.b0lr002bO2) of this time was spent optimizing the connections in the last
3a r002.602 = FUO5( ‘~000b0lr006b05) 4 design, the other exploratory designs were not optimized.
rOOl,BOl= FU02('r001bO2r007.b06) !
Changing the design goals to time < .4 uSec and area <
r002,803 = FUO3( cr002b03~005b04) 6000 gates resulted in the design in Figure 10. The evolution to
4a rOOl,BOl= FUO2(*rOO5bO5rOO4bO4) 5 I this design started out the same as in the previous one but
r004.803 = FUO3( -~004b04~00l,b01) j deviates as soon as the area goal is satisfied. Several
r002,802 = FUO5( *r002b02~005b05) I
attempts to achieve the required time were made, including
FUOl( <r003b03~002b02)
pipelining the (two) multipliers and changing the clock. These
5a rO06BIX= FUW( +~006bO5rOOlbO2) 2 x<a :TRUE 1
r004,B03= FUO3(-rOO4bO3rOO2bOl) 1 x<a :FALSE 1 changes are depicted in the design evolution chart as the
dashed line moving to Ex4. In this case the design attempt
failed and Figure 10 is the best design found by the expert.
Figure 8. Intermediate Hal design
Paper 12.2
208
7. References
[Barb81 M. Barbacci, "Instruction Set Processor Specifica-
tlons IISPS): The Notation and its Applications,”
IEEE TransactIons on Computers v 30 n 1 Jan.
1981.
:BrGa86] F. D. Brewer, 0. 0. Gajski, “An Expert System Para-
digm for Design” 23rd Design Automat/on Confer-
ence. June, 1986.
(DeRS86] Ii. DeMan. J. Rabaey, P.Six;‘CATHEDRAL Il:A Syn-
thesis and Module Generation System for
Multiprocessor Systems on a Chip”, NATO Study
Institute of Logic Synthesis and Sillcon Cornpita-
tion for VLSI Oeslgn, L’Aqulia, Italy, July 7-18, 1986.
Mb ACTIONS Nxt Conditions ' [Dutt86] N. Dutt “COGENTY~ Parameterizable Control Gen-
la FUO1(<r002b01~003b02) 2 x<a :TRUE erator for Constraint Driven Microarchitecture Syn-
1 x<a :FALSE j thesis” Ph 0. Prehmmary Proposal, University of
2a / FU02('~004b03~005b04) 3 I’ Iilmois Urbana-Champaign. Dec. 1986
FU04('r001W2~002bO1) [JVJC86] A. Jerraya, P. Varniot, Ft. Jamier, B. Curtois, “Princi-
3a ’ tOOl,BOZ= FUO4(':) rOOO,BOl= FUOZ(':) 4
ples of the SYCO Compiler” 23rd Des/gn Automa-
FUO4(*rOO6bO2rOOObOl)
, FU02(*r004,b03~005b04) t/on Conference, June 1986.
4a ; r002,BOl= FUO2(':) rOOl,BOZ = FUO4(':) 5 [ KnPa86] D. W. Knapp, AC. Parker “A Design Utility Manager:
the ADAM Planning Engine” 23rd Design Automa-
II tion Conference, June 1986.
6
[LSl85] LSI Logic Corp. “CMOS Macrocell Manual” Ju/y,
r006.803- FUO3(+rOO6bO5rOO2,bOl) 1985
1 FUOl( <r002b01~003b02) [McKo86] M. C. McFarland, T. J. Kowalski “Assisting DAA:The
6a ' rOOO,BOl= FUO2(':) 7
Use of Global Analysis in an Expert System”
r002,503* FUO3(-rOO4bO5rOO5bO4)
Proceedings KC0 1986
?a i r004,B03= FU03(-~002.b05~000b01) 2 xia :TRUE
I I x<a :FALSE [PaGa86] B. Pangrle, D.D. Gajski, “State Syntheses and Con-
nectivity Binding for Microarchitecture Compilation”
Proceedings ICCAD 1986
Figure IO. Faster Design
[Pang871 B. M. Pangrle “A BehavIoral Compiler for Intelligent
Silicon Compilation” Ph.D Theskz, Unrverslty of
The present version of the function untt data base was Illino/s. Urbana-Champaign July 1987
implemented around the LSI Logic: LS17000 series gate array [PaKG86] P. G. Paulin, J. P. Knight, E. F. Girczyc. “HAL: A
products [LSl85] because several of the functions needed Multi-Paradigm Approach to Automatic Data Path
were already designed and characterized. All of the timings Synthesis” 23rd Desrgn Automat/on Conference
and gate counts given are estimates based on the units allo- (Sout83] J. Ft. Southard, “MacPitts: An approach to Silicon
cated, the control unit design, and an estimate of the area used Compilation” IEEE Computer, vol 16, no 12 Dec.
by bussing and connections. 1983.
6. Conclusions
The expert control for the Ghippe system IS presently
under development to add more capabilities in design analysis.
Specifically, rules are needed to control optimizat\on routines
for the graph and to optimize the control selection vs. the sys.
tern clock time. Lastly, future reasearch is needed in the
language used to initially represent the design behavior.
There are several limitations inherent in the system: most
of these are related to the knobs and gauges control strategy
The system cannot make direct changes in the potential design
without losing the ability to iterate the refinements. It can only
change the knobs which control the refinement process and run
optimization routines on the output design. There are several
possible optimizations which could be performed at different
stages in the design, most notably the generatlon of the initial
control data-flow graph, where variations on compller optimlza-
trons would be very useful.
The present system does show. however, that design
refinement can be carried out usmg strategies based on simple
design trade-offs. The simplictty of the expert control stems
greatly from the generality of the underlying design model and
!he associated design tools. Several deslgns have been tested
using the system and with few exceptions have all been amen-
able to the same rules. This gives support to the idea of a gen-
eralized set of design strategies for a wide class of architec-
ture design problems.
Paper 12.2
209

Brewer 1987

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Brewer 1987

Uploaded by

Copyright:

Available Formats

Knowledge Based Control in Micro-Architecture Design

Dept. of Computer Science

24th ACM/IEEE Design Automation Conference

greater than the goal so the expert chose to remove multipliers

You might also like