Professional Documents
Culture Documents
ni2010
ni2010
ni2010
Abstract— Clock skew scheduling is a useful sequential rent literature performs reasonably for a single iteration,
circuit optimization method. The run time efficiency of i.e., several seconds for a moderate size circuit. However,
this problem becomes crucial if it must be repeated iter- the runtime can become as much as several hundred sec-
atively in a higher level optimization. The widely recog- onds when the circuit size increases. Furthermore, we may
nized Burns’ algorithm proposed to solve this problem suf- need to solve the MPCSS problem repeatedly for thousands
fers from high runtime complexity, which makes it unsuit- of iterations as a subroutine in certain applications. In
able to be deployed in iterative optimization loops. This that case, the runtime efficiency becomes extremely crucial.
algorithm is based on the general concept of primal-dual For example, during multi-domain clock skew scheduling,
optimization. In this paper, we demonstrate that a more it has been reported that the MPCSS subroutine needed
efficient approach to the clock skew scheduling problem can to be repeated as many as 2327 times [11], which takes
be developed by designing a new algorithm using the same as many as 20 hours. Consider another related problem,
primal-dual optimization concept. The basic idea of the namely, the register binding problem that considers mini-
algorithm is to avoid creating new admissible graph and mizing the clock period via clock skew scheduling [5]. The
recalculating θ values for each iteration of the primal-dual time complexity of the algorithm proposed to solve this
optimization. The asymptotic runtime efficiency of our al- problem is O(|V ||E|Lcss ), where Lcss represents the run-
gorithm is of O(|V ||E| + |V |2 log|V |), which is improved time of MPCSS problem. This shows clearly that the over-
from O(|V |2 |E|) thanks to the heap data structure used all efficiency of the register binding problem directly de-
in our proposed algorithm. The experimental results show pends on the efficiency of the local MPCSS algorithm.
that our algorithm is on average 95 times faster than Burns’ One obvious way of improving the runtime efficiency of
implementation. In best case, we can observe as much as these techniques is to speed up the clock skew scheduling al-
189 times speedup. gorithm, since it is called as a major subroutine within each
Keywords— Clock Skew Scheduling, Primal-Dual, Op- iteration. This motivates us to revisit the MPCSS prob-
timization, Sequential Circuit lem. We started by studying the widely used algorithm
by Burns [1]. Burns’ algorithm is based on the general
primal-dual optimization concept, which is used to solve
I. Introduction
linear programs (LPs). Developing an efficient implemen-
Clock skews are the differences in clock arrival times at tation for the primal-dual algorithm now becomes the key
different registers due to the variation in interconnect de- issue. In order to address this issue, we made the following
lays in the clock distribution network. On one hand, this important observation. There are existing examples which
has been viewed as a design fault and efforts have been di- efficiently implement the primal-dual method and yield fast
rected towards constructing zero skew clock networks [8]. algorithms, such as Dijkstra’s shortest path algorithm [10].
On the other hand, the concept of clock skew scheduling We developed an efficient primal-dual based clock skew
views clock skews as a manageable resource rather than a li- scheduling algorithm inspired by Dijkstra’s algorithm. Di-
ability by assigning a certain clock arrival time for each reg- jkstra’s algorithm uses a heap data structure to organize
ister [3], [2], [11]. The ability of using clock skew scheduling the elements involved in the shortest path problem. We
to enhance circuit performance and power consumption has employed the same heap structure to reduce the computa-
been studies in the literature [14]. tion complexity of the primal-dual optimization algorithm.
The clock period can be minimized if we carefully as- In addition, we also developed further enhancements that
sign a clock arrival time li for each register. The minimum are specific to the MPCSS problem. For instance, vari-
clock period clock skew scheduling (MPCSS) problem is ous computation steps which are repeated redundantly in
to find a suitable assignment to achieve this goal. It can Burns’ implementation have been significantly simplified
be formulated as a linear program [3] and solved based on through incremental update mechanisms. As a result, the
either the combination of the shortest path algorithm [4] asymptotic time complexity of our proposed algorithm is of
and binary searching [2], or the primal-dual optimization O(|V ||E|+|V |2 log|V |), which is improved from Burns’ com-
concept [1]. The widely popular implementation in cur- plexity of O(|V |2 |E|). Our experimental results show that
978-1-4244-6455-5/10/$26.00 ©2010 IEEE 755 11th Int'l Symposium on Quality Electronic Design
in practice, our implementation can speed up the MPCSS III. The Primal-Dual Algorithm
algorithm by 95 times on average. Given a circuit graph Gc (Vc , Ec ), we can generate a con-
The remainder of this paper is organized as follows. In straint graph G(V, E) by finding the longest and shortest
Section 3, we first present formally the primal-dual algo- paths between all pairs of flip-flops in the circuit graph in
rithm. Then we will show the reason why Burns’ algorithm O(|Vc |2 ) time, as illustrated in Figure 1. There is a hold
is not efficient. Our methods of improving runtime effi- edge (i, j) (forward dashed line in Figure 1) and a setup
ciency and the Dijkstra style implementation of the primal- edge (j, i) (backward solid line in Figure 1) between two
dual algorithm is proposed afterward. The experimental vertices i and j in G if and only if there exists a direct
results are presented in Section 4, followed by the conclu- path (without having other DFFs on the path) from i to j
sions in Section 5. in the circuit graph Gc . We define Tij as the longest path
delay from vertex i to j, tij as the shortest path delay from
vertex i to j, P as the clock period of the circuit, Es as the
II. Related Work set of all setup edges in E, and Eh as the set of all hold
edges in E.
The MPCSS problem was first proposed by Fishburn [3].
A linear programming formulation was presented in the D1 9
D1 2 4 1
work. By recognizing the equivalent graph representation 2 9 D3
D3
of the MPCSS problem, Deokar et al. proposed to solve the 6
3 1
problem by iteratively solving the embedded shortest path 10
D2 D2
problem [2]. On a higher hierarchy of the algorithm, bi-
nary search was implemented to predict the possible clock Fig. 1. An example of generating the constraint graph (shown on the
right) from the circuit graph (shown on the left). Labels on gates represent
period for the next iteration. A similar approach was also the gate delays. Labels on the forward (backward) edges in the constraint
proposed earlier [7]. The asymptotic time complexity of graph represent the shortest (longest) path delays.
this algorithm is of O(|V ||E|log C
n ), where |V ||E| is the
The clock skew scheduling is to assign each flip-flop a spe-
complexity of the Bellman-Ford shortest path algorithm, cific clock arrival time li such that the timing constraints
C is the initial range of the clock period, and n is the between any two pairs of the flip-flops are not violated and
required accuracy of the clock period . In practice, the the clock period is minimized. This problem can be formu-
binary search based algorithm can be efficient only when lated as the following linear program:
the Bellman-Ford shortest path algorithm is carefully im-
plemented by heuristics as proposed in the literature [4]. Primal : min P
s.t. li − lj ≥ Tji − P, ∀(i, j) ∈ Es (1)
In general, the MPCSS is a linear programming problem
with some special properties that faster algorithms can take li − lj ≥ −tij , ∀(i, j) ∈ Eh (2)
advantage of. Therefore, any strategy for solving the lin- where inequality (1) represents the setup time constraint
ear programming can also be used for the MPCSS problem. between flip-flops i and j and inequality (2) represents the
Primal-dual algorithm is one of these approaches [10]. The hold time constraint. Here we have already added the setup
widely popular Burns’ algorithm for solving the MPCSS and hold time into the corresponding path delay.
problem is based on the primal-dual concepts [1]. Differ- This linear program can be solved directly by binary
ent from the binary search strategy, where in each itera- search with efficient negative cycle detection [7]. On the
tion the clock period is blindly guessed, the clock period is other hand, the primal-dual algorithm takes an indirect
calculated in a more guided way in the primal-dual based approach. The principle of the primal dual algorithm is
approach. This usually leads to much faster convergence based on the complementary slackness theorem [13]. It
of the clock period in practice. The asymptotic time com- starts from a feasible solution of the primal problem and
plexity of the Burns’ algorithm is of O(|V |2 |E|), which is then tries to find a feasible solution to its dual. If these
no longer dependent on the accuracy of the solution. In two feasible solutions satisfy the complementary slackness
other words, Burns’ algorithm solves the MPCSS algorithm condition, they must be optimal. We will elaborate on the
exactly while the binary search based approach is only a details of the basic primal-dual algorithm in Section 3-A.
heuristic algorithm. In Section 3-A.1, we first show that finding a dual feasi-
In this paper, we revisit the primal-dual based algorithm ble solution is equivalent to solving a system of equations
for solving the MPCSS problem. We will carefully analyze defined by complementary slackness conditions. In Sec-
each step in the primal-dual strategy and propose a much tion 3-A.2 we show that directly solving these equations is
more efficient primal-dual based algorithm than Burns’ al- avoided by formulating an equivalent linear program. This
gorithm. linear program can be easily solved by cycle detection. In
Section 3-A.3, we propose a general primal-dual algorithm Therefore, we can start from any given feasible solution
and analyze the key steps that determine the runtime effi- to the primal problem and try to find a feasible solution
ciency of the primal-dual algorithm. to the dual problem that satisfies the equations (5, 6, 7,
8). Among equations (5) and (6), the values of σij and
A. An overview of the primal-dual algorithm ηij are easily decided (must be equal to 0) if the given
In this section, we describe the fundamental steps in- primal feasible solution satisfies li − lj − Tji + P > 0 and
volved in the primal-dual algorithm. This discussion will li − lj + tij > 0, respectively. We define those setup and
be necessary for two reasons. First, we will use it to illus- hold edges that do not satisfy either of these two conditions
trate the causes of inefficiencies in Burns’ implementation. as admissible edges.
Second, this helps to understand how our algorithm can Definition 1: The reduced weight wr (i, j) of an edge
address these problems. The framework of using primal- (i, j) in the constraint graph G is defined as wr (i, j) =
dual optimization to solve the MPCSS problem is proposed li − lj + w(i, j), where w(i, j) = P − Tji for setup edges and
in Section 3-A.3. w(i, j) = tij for hold edges.
Definition 2: A setup edge or hold edge (i, j) in the con-
A.1 finding a dual feasible solution straint graph G(V,E), is called admissible edge if its re-
The dual problem of this primal linear program can be duced weight equals to 0.
expressed as follows, Therefore, the dual variables we need to solve are ac-
tually those corresponding to the admissible edges. We
X X denote the set of all admissible setup edges as Esa and the
Dual : max Tji σij − tij ηij
set of all admissible hold edges as Eha . Then, our problem
(i,j)∈Es (i,j)∈Eh
X X becomes finding a feasible solution to the following system
s.t. ηij − ηji (3) of equations,
(i,j)∈Eh (j,i)∈Eh X
X X σij − 1 = 0 (9)
+ σij − σji ≤ 0, ∀i ∈ V (i,j)∈Esa
(i,j)∈Es (j,i)∈Es X X
X ηij − ηji
σij ≤ 1 (4) (i,j)∈Eha (j,i)∈Eha
(i,j)∈Es X X
σij ≥ 0, ηij ≥ 0 + σij − σji = 0, ∀i ∈ V (10)
(i,j)∈Esa (j,i)∈Esa
where σij represents the dual variables corresponding to A.2 solving the restricted dual problem
all setup constraints while ηij represents the dual variables
corresponding to all hold constraints. Instead of directly solving the above system of equations,
According to the complementary slackness condi- primal-dual algorithm creates a linear program called re-
tion [13], a feasible solution {l1 , . . . , ln , P } of the pri- stricted dual (RD) which is equivalent to this system.
mal problem and a feasible solution {σij , (i, j) ∈ Es } ∪ RD : min ε
{ηij , (i, j) ∈ Eh } of the dual problem are both optimal, if X
and only if they satisfy the following equations, s.t. σij − 1 = ε (11)
(i,j)∈Esa
X X
σij · (li − lj − Tji + P ) = 0, ∀(i, j) ∈ Es (5) ηij − ηji (12)
ηij · (li − lj + tij ) = 0, ∀(i, j) ∈ Eh (6) (i,j)∈Eha (j,i)∈Eha
X X X
P ·( σij − 1) = 0, (7) + σij − σji = 0, ∀i ∈ V
(i,j)∈Es (i,j)∈Esa (j,i)∈Esa
X X
ηij − ηji σij ≥ 0, ηij ≥ 0, ε≥0
(i,j)∈Eh (j,i)∈Eh This linear program is similar to the dual program except
X X
+ σij − σji = 0, ∀i ∈ V (8) that it contains a subset of dual variables that correspond
(i,j)∈Es (j,i)∈Es to the admissible edges. The objective function is also sim-
plified. If we solve this restricted dual problem and obtain
In equation (8), we do not multiply the primal variables an optimal solution, which equals to 0, then it must also
li with the dual constraints, since it has been proven by be a feasible solution to the dual problem. In this case,
Burns [1] that the inequalities (3) in the dual problem are we have found the optimal solution to the primal problem
strictly equal to 0. according to the complementary slackness condition.
The next step is to find an efficient way to solve the setup edges from any vertex i in the admissible graph Ga
RD problem. This can be done by solving the dual of the to it through an admissible path (path consisting of only
RD problem. We call this problem restricted primal (RP) admissible edges) i ; j. dj = 0 if no such path exists.
since it is similar to our primal problem except that the Based on Definition 3, setup distance dj can be recursively
constraints are only applied for admissible edges. If RP expressed as,
problem has an optimal solution, which equals to 0, then
according to the strong duality theorem [13], the RD prob- dj = max{di + , ∀ (i, j) ∈ Ea } (16)
lem also has a zero-value optimal solution. In this case, our
problem has been solved based on the above discussions. where = 1 for setup edges and = 0 for hold edges.
This metric is useful because it defines how much the clock
RP : max ρ arrival time li should change at vertex i in order to main-
s.t. di − dj + ρ ≤ 0, ∀(i, j) ∈ Esa (13) tain a feasible clock skew schedule if the clock period P is
di − dj ≤ 0, ∀(i, j) ∈ Eha (14) reduced by θ.
Lemma 1: If the clock period P is reduced by a suffi-
ρ≤1 (15)
ciently small amount θ, reducing clock arrival time li for
We denote the admissible graph Ga (Va , Ea ) as the graph vertex i in the admissible graph Ga by θ · di will maintain
comprised of all admissible edges in the original constraint a feasible clock skew schedule.
graph. Then, we have the following theorem. Proof: We require the θ value to be sufficiently small
Theorem 1: The RP problem has an optimal so- such that no inadmissible edge becomes admissible after
lution equal to 0 if there exists a cycle W = reducing the clock period P by θ. In such cases, only origi-
{ei0,i1 , ei1,i2 , . . . , eik,i0 } on the admissible graph Ga . nally admissible edges can violate the timing constraint,
Proof: For each admissible edge, the edge con- that is, their reduced weights wr (i, j) become negative.
straints (13) and (14) can be expressed in a uniform for- Given an admissible edge (i, j), its reduced weight becomes
mat, that is, di − dj + ρ ≤ 0, where = 1 for setup edges wr∗ (i, j) = li − lj + (P − θ) − Tji (or wr∗ (i, j) = li − lj + tij ) if
and = 0 for hold edges. (In the following, we will use it is a setup (hold) edge after we reduce P . Obviously, for
in the same way as defined here.) For the given cycle setup edges, wr∗ (i, j) < 0 since li −lj +P −Tji = wr (i, j) = 0
W , if we add up all these constraint inequalities, we have, according Definition 2. If we reduce the clock arrival times
(di0 − di1 ) + (di1 − di2 ) + . . . + (dik − di0 ) + n · ρ ≤ 0, where of i and j by θ · di , then we have,
n represents the number of setup edges in the cycle, which
must be larger than 1 (otherwise, there will be a combina- wr∗ (i, j) = li − lj + P − Tji − θ(di − dj + 1) (17)
tional cycle in the circuit). Simplifying this inequality, we wr∗ (i, j) = li − lj + tij − θ(di − dj ) (18)
have n · ρ ≤ 0. Therefore, we conclude that ρ ≤ 0. In order
to maximize ρ, we can simply set ρ = 0 and all di equal to According to Definition 3 and Equation (16), di −dj +1 ≤ 0
the same values. and di − dj ≤ 0. Therefore we have wr∗ (i, j) ≥ 0, which
indicates that the timing constraints are still satisfied after
A.3 a general primal-dual algorithm we reduce P by θ and li by θ · di .
In this subsection, we will propose our novel interpre- Theorem 1 motivates the basic idea of primal-dual al-
tation of the general primal-dual framework according to gorithm. If there is no cycle in Ga , it indicates that the
Theorem 1. In Section 3-C, we will illustrate our efficient current clock period P is not minimized. Then, we reduce
implementation, which is also rooted at this same frame- P by a certain amount θ. The effect of reducing P is that
work. Before that, we first make several important obser- we reduce the reduced weights wr (i, j) of edges in the con-
vations. straint graph G, hence, we introduce new admissible edges
If there is no cycle in the admissible graph Ga , then (see Definition 2) into Ga . As long as θ is sufficiently small
we can easily find a feasible solution by setting ρ = 1, such that only one edge becomes admissible after reduc-
di = dj + 1, ∀(i, j) ∈ Esa , di = dj , ∀(i, j) ∈ Eha . This ing the clock period P by θ, we are guaranteed to capture
feasible solution is also optimal because ρ ≤ 1 according to the moment when the first cycle in Ga occurs. The clock
constraint (15). The di obtained in this way represents the period corresponding to this instant is the minimum clock
maximum number of setup edges from any vertex in the period. The outline of the primal-dual algorithm for solv-
admissible graph Ga to vertex i. Therefore, we call it the ing the clock skew scheduling problem can be illustrated as
setup distance of vertex i. The formal definition of setup in Algorithm 1.
distance is as follows. The most important step of the primal-dual algorithm
Definition 3: For each vertex j in the constraint graph, is line 4, which calculates the appropriate θ value. For a
setup distance dj is defined as the maximum number of given edge (i, j) in the constraint graph G, in order to make
Algorithm 1 Primal-Dual Algorithm 2 Burns’ algorithm for clock skew scheduling
1: P = max{Tij , (i, j) ∈ Es } 1: P = max{Tij , (i, j) ∈ Es }
2: create an empty admissible graph Ga 2: for all i ∈ V do
3: while Ga is acyclic do 3: li = 0; di = 0
4: find θ ≥ 0 s.t. only one edge (i, j) becomes admissible 4: end for
5: P =P −θ 5: while true do
6: add new admissible edge (i, j) to Ga 6: create an empty admissible graph Ga
7: end while 7: for all (i, j) ∈ E do
8: if (i, j) is admissible then
9: add (i, j) into Ga
it admissible, we require it satisfies Definition 2, 10: end if
11: end for
lˆi − lˆj − Tji + (P − θ) = 0, if (i, j) ∈ Es (19) 12: if Ga is cyclic then
lˆi − lˆj + tij = 0, if (i, j) ∈ Eh (20) 13: break;
14: end if
where lˆi represents the clock arrival time at vertex i in G 15: topological sort Ga and update di for i ∈ Va
after the clock period P is reduced by θ. Lemma 1 provides 16: for all (i, j) ∈ E do
the way of calculating lˆi , that is, 17: calculate θ(i, j) by Equation (22) and (23)
18: end for
lˆi = li − θ · di . (21) 19: θm = min{θ(i, j), (i, j) ∈ E}
20: P = P − θm
Substituting Equation (21) into (19) and (20) and rear- 21: update li , ∀i ∈ V by Equation (21)
ranging the formula, we obtain the equations that can be 22: end while
used to calculate the θ value for any edge (i, j) in G.
li − lj − Tji + P ^
θ= , if (i, j) ∈ Esdi − dj + 1 > 0 lows the primal-dual algorithm described in Algorithm 1
di − d j + 1 except that it creates a new admissible graph and recalcu-
(22) lates admissible edges each iteration. The strategy it uti-
li − lj + tij ^ lizes to update θ values is illustrated on lines 16-18. It sim-
θ= , if (i, j) ∈ Eh di − dj > 0 (23)
di − d j ply recalculates the θ for all edges in the constraint graph
G for each iteration. This becomes the major run time ef-
A setup (hold) edge (i, j) cannot become admissible if di − ficiency problem of Burns’ algorithm. We will show in the
dj + 1 ≤ 0 (di − dj ≤ 0). In such cases, we set the θ values next subsection that only a small portion of the edges need
to be infinite. to be updated for θ values each iteration. Another runtime
Different ways of implementing the subroutine on line 4 efficiency problem lies in creating a new admissible graph
of Algorithm 1 results in different runtime efficiency. The Ga each round, this is also improved in our proposed im-
naive approach is that we calculate θ values for every edge plementation in subsection 3-C. Before that, we first use
in the constraint graph G in each iteration and choose the a simple example to demonstrate in detail how Burns’ al-
smallest θ to be deducted from the clock period P . The gorithm works. In Figure 2, we have a simple constraint
new admissible edge corresponding to the smallest θ is then graph comprised of 4 vertices. The solid lines represent
inserted into the admissible graph Ga . Choosing the small- the setup edges and the dashed lines represent the hold
est θ guarantees to generate exactly one new admissible edges. Each vertex has a pair of numbers (li , di ), which
edge (when there are more than one smallest θ values, we are the clock arrival time and setup distance of vertex i.
can add one edge each time). This naive approach is basi- Similarly, each edge also has a pair of numbers (tij , θij ),
cally how Burns’ algorithm [1] implements the primal-dual which represent the maximum or minimum delay between
algorithm. We will show that there are many redundant two flip-flops and the θ values, respectively.
computations in this implementation. Then, we propose In each iteration, after we update the θ value for every
our more efficient implementation of the primal-dual algo- edge, we identify one edge with the minimum θ, which is
rithm in subsection 3-C. highlighted in the figure. The corresponding edge becomes
admissible at the beginning of the next iteration (also high-
B. Burns’ implementation lighted by bold solid lines). In our simple example, only
We outline Burns’ algorithm [1] to solve the clock skew one edge becomes admissible in each iteration. Therefore,
scheduling problem in Algorithm 2. Burns’ algorithm fol- it is obvious that Burns’ implementation is very inefficient.
(0, 0) (−1.5, 1)
in it. In other words, we have a k-edge forest. We will show
(7, 3) (10, 0) (7, 1.5) (10, inf)
(3, inf) (4, inf) (3, inf) (4, 4)
after we add a new admissible edge (i, j), Ga is still a forest.
(0, 0) (0, 0) (0, 0) (0, 0)
There are four cases depending on whether vertices i and
(2, inf) (3, inf) (2, inf) (3, inf) j are in forest Ga before adding edge (i, j).
(8, 2) (5, 5) (8, 2) (5, 5)
1. i ∈/ Ga , j ∈/ Ga
(0, 0) (0, 0)
Iteration 0 : P = 10 Iteration 1 : P = 8.5 In this case, we generate a new tree rooted at i, which has
(−1.67, 1) (−1.67, 1) only one child j. The resulting Ga is still a forest;
(7, inf) (10, inf) (7, inf) (10, inf) 2. i ∈ Ga , j ∈ / Ga
(−0.17, 2) (3, 1.5) (4, 2.5) (0, 0) (−0.17, 2) (3, 1.5) (4, 2.33) (0,0) In this case, we add a child j to one vertex i in a tree,
(2, inf) (3, inf) (2, 2.17) (3, inf)
which is still a tree. Thus the resulting Ga is still a forest;
(8, 0.17) (5, 3.5) (8,inf) (5, 0.83) 3. i ∈/ G a , j ∈ Ga
(0, 0) (0, 3) This case is impossible to occur because if i ∈ / Ga , di = 0;
Iteration 2 : P = 8.33 Iteration 3 : P = 7.5
on the other hand, since j ∈ Ga , we have dj ≥ 1. Therefore,
(−1.67, 1)
we have di −dj + ≤ 0. In such a case, θ(i, j) = ∞ according
(7, inf) (10, inf)
(3, 1.5) (4, 2.33)
to Equation (22) and (23). Therefore, edge (i, j) cannot be
(−0.17, 2) (0, 0)
an admissible edge.
(2, 2.17) (3, inf) 4. i ∈ Ga , j ∈ Ga
(8, inf) (5, 0.83)
In this case, we will prove that the incoming edge (w, j) ∈
(0, 3)
Iteration 4 : Cycle Detected, P = 7.5
Ea of vertex j must become inadmissible, therefore, we
Fig. 2. An example of using Burns’ algorithm to solve the clock skew can safely remove this edge and still maintain a forest.
scheduling on a simple constraint graph.
First we observe from Equations (22) and (23) that in
order for edge (i, j) to become the minimum θ edge, it
It actually recalculates the reduced weights of all edges at
must satisfy di − dj + (i, j) > 0. Therefore, we have
the beginning of a new iteration (line 8 in Algorithm 2).
di + (i, j) > dj . On the other hand, since (w, j) ∈ Ea ,
Burns’ algorithm also recalculates θ values for all edges
we have dw + (w, j) = dj . The reason is based on the
at each iteration (line 17 in Algorithm 2). However, from
definition of setup distance dj and our inductive assump-
our example in Figure 2 we can tell that not all θ values
tion that Ga is a forest. The new setup distance dj is
change for each iteration. In the following, we propose a
max{di +(i, j), dw +(w, j)}, which is di +(i, j). This new
systematic way to update the admissible graph Ga and θ
dj value makes edge (w, j) inadmissible because its new re-
values, such that the runtime efficiency can be significantly
duced weight wr (w, j) = w(w, j) − θ(dw − dj + (w, j)) (see
improved.
Equation (17) and (18)) becomes positive. By removing
C. An enhanced implementation the incoming edge (w, j) and inserting a new admissible
edge (i, j), Ga is still maintained as a forest.
In this subsection, we will first propose the algorithm
to maintain one admissible graph Ga and update setup
Theorem 2 provides us an easy way to incrementally
distances for vertices in Ga . Then, we describe how to
update the admissible graph instead of generating a new
update the θ values efficiently.
graph for each iteration. The complete algorithm is shown
C.1 maintaining the admissible graph in Algorithm 3. For each vertex i in the admissible graph
• they use the same graph manipulating subroutines, Delay Format (SDF) file from Design Compiler. In the
• they use the same subroutine to calculate θ values. SDF file, the delay of each gate in the netlist is provided.
Burns’ algorithm is coded based on the pseudo-code Therefore, we can construct a constraint graph G by cal-
given in Algorithm [1]. Our proposed algorithm is coded culating the shortest and longest path accumulated delay
in the manner described in Algorithm 5. For both algo- between any two flip-flops in the circuit. The total runtime
rithms, we ran them for 5 times. The final execution times of this step is O(|V 2 |).
are calculated by taking the average of these sample data. It can be observed from Table I that on average our
The final runtime results are listed in Table I. algorithm is 95 times faster than Burns’ implementation,
In Table I, the first and second columns are the number even though both of them follow the same primal-dual idea.
of vertices and number of edges in the constraint graph G. Such significant improvement on the runtime indicates the
The next two columns represent the original clock period fact that the proposed algorithm is superior to Burns’ al-
P ∗ and the clock period P after clock skew scheduling. The gorithm in terms of the asymptotic time complexity. The
CPU time is measured in seconds for both algorithms. We major improvement of the asymptotic runtime is due to
also list the ratio of the execution times comparing Burns’ the utilization of the heap data structure in the proposed
algorithm with our algorithm. algorithm. Our algorithm has an asymptotic runtime of
The benchmark circuits we use for clock skew scheduling O(|V ||E| + |V |2 log|V |), which is improved from O(|V |2 |E|)
are from two widely used sets, ISCAS89 and ITC99. We of Burns’ implementation. It can be observed that this is
take the largest circuits from both sets because both algo- quite similar to the fact that Dijkstra’s algorithm improves
rithms are extremely fast (less than one second) for small runtime to O(|E| + |V |log|V |) from Bellman-Ford’s short-
circuits with less than 10k gates. In order to obtain more est path algorithm, which has O(|V ||E|) complexity. This
realistic results, we calculate the gate delay in the following only difference is that in our problem, we have one more
way. We first re-synthesize the circuits (either in format of hierarchy that needs to be iterated for O(|V |) times. In
Verilog or VHDL) with Synopsys Design Compiler. Then, practice, the runtime improvement is more significant. In
we export the synthesized netlist along with the Standard the best case, we observe a 189 times (circuit b22 1 opt)
speed up by the proposed algorithm.
V. Conclusions
In this paper, we revisit the primal-dual based algorithm
for solving the minimum clock period clock skew schedul-
ing. We propose a much more efficient primal-dual based
algorithm to improve the runtime efficiency of Burns’ im-
plementation of the primal-dual algorithm. The proposed
algorithm is superior to Burns’ algorithm in both theoreti-
cal asymptotic time complexity and the practical measured
runtime. Our experimental results show that our proposed
algorithm is on average 95 times faster than Burns’ algo-
rithm. Even for the largest circuits in our benchmarks, our
algorithms take less than one second to complete. This en-
ables faster algorithms that take the MPCSS as subroutines
within their inner iteration loops.
References
[1] S. M. Burns. Performance analysis and optimization of asyn-
chronous circuits. PhD thesis, California Institute of Technology,
Computer Science Department, 1991.
[2] R.B. Deokar and S.S. Sapatnekar. A graph-theoretic approach to
clock skew optimization. In Proc. Intl. Symposium on Circuits
and Systems, 1994.
[3] J.P. Fishburn. Clock skew optimization. IEEE Transactions on
Computers, 39(7), July 1990.
[4] A.V. Goldberg. A heuristic improvment of the bellman-ford al-
gorithm. Applied Math Letter, 1993.
[5] S. Huang, C. Cheng, Y. Nieh, and W. Yu. Register binding for
clock period minimization. In Proc. of the Design Automation
Conf., 2006.
[6] E. .G. Friedman I. S. Kourtev. Timing Optimization Through
Clock Skew Scheduling. Kluwer Academic Publisher, 2000.
[7] E. L. Lawler. Combinatorial Optimization: Networks and Ma-
troids. Holt, Reinhart, and Winston, 1976.
[8] L. Liu, T. Chou, A. Aziz, and D.F. Wong. Zero-skew clock
tree construction by simultaneous routing, wire sizing and buffer
insertion. In International Symposium on Physical Design, 2000.
[9] Xun Liu, Marios C. Papaefthymiou, and Eby G. Friedman. Max-
imizing performance by retiming and clock skew scheduling. In
DAC, pages 231–236, 1999.
[10] C. H. Papadimitriou and K. Steiglitz. Combinational Optimiza-
tion, algorithms and complexity. Dover Publications, 1998.
[11] K. Ravindran, A. Kuehlmann, and E. Sentovich. Multi-domain
clock skew scheduling. In Proc. Intl. Conf. on Computer-Aided
Design, 2003.
[12] J. Tsai, T. Chen, and C. C. Chen. Optimal minimum-delay/area
zero skew clock tree wire-sizing in pseudo-polynomial time. In
International Symposium on Physical Design, 2003.
[13] V.Chvatal. Linear programming. W. H. Freeman, 1983.
[14] Dimitrios Velenis, Kevin T. Tang, Ivan S. Kourtev, V. Adler,
F. Baez, and Eby G. Friedman. Demonstration of speed and
power enhancements on an industrial circuit through application
of clock skew scheduling. Journal of Circuits, Systems, and
Computers, 11(3):231–246, 2002.