ni2010

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

A Revisit to the Primal-Dual Based Clock Skew Scheduling Algorithm

Min Ni, Seda Ogrenci Memik


Synopsys Inc., Mountain View, CA, USA
Northwestern University, Evanston, IL, USA
E-mail: mni166@ece.northwestern.edu

Abstract— Clock skew scheduling is a useful sequential rent literature performs reasonably for a single iteration,
circuit optimization method. The run time efficiency of i.e., several seconds for a moderate size circuit. However,
this problem becomes crucial if it must be repeated iter- the runtime can become as much as several hundred sec-
atively in a higher level optimization. The widely recog- onds when the circuit size increases. Furthermore, we may
nized Burns’ algorithm proposed to solve this problem suf- need to solve the MPCSS problem repeatedly for thousands
fers from high runtime complexity, which makes it unsuit- of iterations as a subroutine in certain applications. In
able to be deployed in iterative optimization loops. This that case, the runtime efficiency becomes extremely crucial.
algorithm is based on the general concept of primal-dual For example, during multi-domain clock skew scheduling,
optimization. In this paper, we demonstrate that a more it has been reported that the MPCSS subroutine needed
efficient approach to the clock skew scheduling problem can to be repeated as many as 2327 times [11], which takes
be developed by designing a new algorithm using the same as many as 20 hours. Consider another related problem,
primal-dual optimization concept. The basic idea of the namely, the register binding problem that considers mini-
algorithm is to avoid creating new admissible graph and mizing the clock period via clock skew scheduling [5]. The
recalculating θ values for each iteration of the primal-dual time complexity of the algorithm proposed to solve this
optimization. The asymptotic runtime efficiency of our al- problem is O(|V ||E|Lcss ), where Lcss represents the run-
gorithm is of O(|V ||E| + |V |2 log|V |), which is improved time of MPCSS problem. This shows clearly that the over-
from O(|V |2 |E|) thanks to the heap data structure used all efficiency of the register binding problem directly de-
in our proposed algorithm. The experimental results show pends on the efficiency of the local MPCSS algorithm.
that our algorithm is on average 95 times faster than Burns’ One obvious way of improving the runtime efficiency of
implementation. In best case, we can observe as much as these techniques is to speed up the clock skew scheduling al-
189 times speedup. gorithm, since it is called as a major subroutine within each
Keywords— Clock Skew Scheduling, Primal-Dual, Op- iteration. This motivates us to revisit the MPCSS prob-
timization, Sequential Circuit lem. We started by studying the widely used algorithm
by Burns [1]. Burns’ algorithm is based on the general
primal-dual optimization concept, which is used to solve
I. Introduction
linear programs (LPs). Developing an efficient implemen-
Clock skews are the differences in clock arrival times at tation for the primal-dual algorithm now becomes the key
different registers due to the variation in interconnect de- issue. In order to address this issue, we made the following
lays in the clock distribution network. On one hand, this important observation. There are existing examples which
has been viewed as a design fault and efforts have been di- efficiently implement the primal-dual method and yield fast
rected towards constructing zero skew clock networks [8]. algorithms, such as Dijkstra’s shortest path algorithm [10].
On the other hand, the concept of clock skew scheduling We developed an efficient primal-dual based clock skew
views clock skews as a manageable resource rather than a li- scheduling algorithm inspired by Dijkstra’s algorithm. Di-
ability by assigning a certain clock arrival time for each reg- jkstra’s algorithm uses a heap data structure to organize
ister [3], [2], [11]. The ability of using clock skew scheduling the elements involved in the shortest path problem. We
to enhance circuit performance and power consumption has employed the same heap structure to reduce the computa-
been studies in the literature [14]. tion complexity of the primal-dual optimization algorithm.
The clock period can be minimized if we carefully as- In addition, we also developed further enhancements that
sign a clock arrival time li for each register. The minimum are specific to the MPCSS problem. For instance, vari-
clock period clock skew scheduling (MPCSS) problem is ous computation steps which are repeated redundantly in
to find a suitable assignment to achieve this goal. It can Burns’ implementation have been significantly simplified
be formulated as a linear program [3] and solved based on through incremental update mechanisms. As a result, the
either the combination of the shortest path algorithm [4] asymptotic time complexity of our proposed algorithm is of
and binary searching [2], or the primal-dual optimization O(|V ||E|+|V |2 log|V |), which is improved from Burns’ com-
concept [1]. The widely popular implementation in cur- plexity of O(|V |2 |E|). Our experimental results show that

978-1-4244-6455-5/10/$26.00 ©2010 IEEE 755 11th Int'l Symposium on Quality Electronic Design
in practice, our implementation can speed up the MPCSS III. The Primal-Dual Algorithm
algorithm by 95 times on average. Given a circuit graph Gc (Vc , Ec ), we can generate a con-
The remainder of this paper is organized as follows. In straint graph G(V, E) by finding the longest and shortest
Section 3, we first present formally the primal-dual algo- paths between all pairs of flip-flops in the circuit graph in
rithm. Then we will show the reason why Burns’ algorithm O(|Vc |2 ) time, as illustrated in Figure 1. There is a hold
is not efficient. Our methods of improving runtime effi- edge (i, j) (forward dashed line in Figure 1) and a setup
ciency and the Dijkstra style implementation of the primal- edge (j, i) (backward solid line in Figure 1) between two
dual algorithm is proposed afterward. The experimental vertices i and j in G if and only if there exists a direct
results are presented in Section 4, followed by the conclu- path (without having other DFFs on the path) from i to j
sions in Section 5. in the circuit graph Gc . We define Tij as the longest path
delay from vertex i to j, tij as the shortest path delay from
vertex i to j, P as the clock period of the circuit, Es as the
II. Related Work set of all setup edges in E, and Eh as the set of all hold
edges in E.
The MPCSS problem was first proposed by Fishburn [3].
A linear programming formulation was presented in the D1 9
D1 2 4 1
work. By recognizing the equivalent graph representation 2 9 D3
D3
of the MPCSS problem, Deokar et al. proposed to solve the 6
3 1
problem by iteratively solving the embedded shortest path 10
D2 D2
problem [2]. On a higher hierarchy of the algorithm, bi-
nary search was implemented to predict the possible clock Fig. 1. An example of generating the constraint graph (shown on the
right) from the circuit graph (shown on the left). Labels on gates represent
period for the next iteration. A similar approach was also the gate delays. Labels on the forward (backward) edges in the constraint
proposed earlier [7]. The asymptotic time complexity of graph represent the shortest (longest) path delays.
this algorithm is of O(|V ||E|log C
n ), where |V ||E| is the
The clock skew scheduling is to assign each flip-flop a spe-
complexity of the Bellman-Ford shortest path algorithm, cific clock arrival time li such that the timing constraints
C is the initial range of the clock period, and n is the between any two pairs of the flip-flops are not violated and
required accuracy of the clock period . In practice, the the clock period is minimized. This problem can be formu-
binary search based algorithm can be efficient only when lated as the following linear program:
the Bellman-Ford shortest path algorithm is carefully im-
plemented by heuristics as proposed in the literature [4]. Primal : min P
s.t. li − lj ≥ Tji − P, ∀(i, j) ∈ Es (1)
In general, the MPCSS is a linear programming problem
with some special properties that faster algorithms can take li − lj ≥ −tij , ∀(i, j) ∈ Eh (2)
advantage of. Therefore, any strategy for solving the lin- where inequality (1) represents the setup time constraint
ear programming can also be used for the MPCSS problem. between flip-flops i and j and inequality (2) represents the
Primal-dual algorithm is one of these approaches [10]. The hold time constraint. Here we have already added the setup
widely popular Burns’ algorithm for solving the MPCSS and hold time into the corresponding path delay.
problem is based on the primal-dual concepts [1]. Differ- This linear program can be solved directly by binary
ent from the binary search strategy, where in each itera- search with efficient negative cycle detection [7]. On the
tion the clock period is blindly guessed, the clock period is other hand, the primal-dual algorithm takes an indirect
calculated in a more guided way in the primal-dual based approach. The principle of the primal dual algorithm is
approach. This usually leads to much faster convergence based on the complementary slackness theorem [13]. It
of the clock period in practice. The asymptotic time com- starts from a feasible solution of the primal problem and
plexity of the Burns’ algorithm is of O(|V |2 |E|), which is then tries to find a feasible solution to its dual. If these
no longer dependent on the accuracy of the solution. In two feasible solutions satisfy the complementary slackness
other words, Burns’ algorithm solves the MPCSS algorithm condition, they must be optimal. We will elaborate on the
exactly while the binary search based approach is only a details of the basic primal-dual algorithm in Section 3-A.
heuristic algorithm. In Section 3-A.1, we first show that finding a dual feasi-
In this paper, we revisit the primal-dual based algorithm ble solution is equivalent to solving a system of equations
for solving the MPCSS problem. We will carefully analyze defined by complementary slackness conditions. In Sec-
each step in the primal-dual strategy and propose a much tion 3-A.2 we show that directly solving these equations is
more efficient primal-dual based algorithm than Burns’ al- avoided by formulating an equivalent linear program. This
gorithm. linear program can be easily solved by cycle detection. In
Section 3-A.3, we propose a general primal-dual algorithm Therefore, we can start from any given feasible solution
and analyze the key steps that determine the runtime effi- to the primal problem and try to find a feasible solution
ciency of the primal-dual algorithm. to the dual problem that satisfies the equations (5, 6, 7,
8). Among equations (5) and (6), the values of σij and
A. An overview of the primal-dual algorithm ηij are easily decided (must be equal to 0) if the given
In this section, we describe the fundamental steps in- primal feasible solution satisfies li − lj − Tji + P > 0 and
volved in the primal-dual algorithm. This discussion will li − lj + tij > 0, respectively. We define those setup and
be necessary for two reasons. First, we will use it to illus- hold edges that do not satisfy either of these two conditions
trate the causes of inefficiencies in Burns’ implementation. as admissible edges.
Second, this helps to understand how our algorithm can Definition 1: The reduced weight wr (i, j) of an edge
address these problems. The framework of using primal- (i, j) in the constraint graph G is defined as wr (i, j) =
dual optimization to solve the MPCSS problem is proposed li − lj + w(i, j), where w(i, j) = P − Tji for setup edges and
in Section 3-A.3. w(i, j) = tij for hold edges.
Definition 2: A setup edge or hold edge (i, j) in the con-
A.1 finding a dual feasible solution straint graph G(V,E), is called admissible edge if its re-
The dual problem of this primal linear program can be duced weight equals to 0.
expressed as follows, Therefore, the dual variables we need to solve are ac-
tually those corresponding to the admissible edges. We
X X denote the set of all admissible setup edges as Esa and the
Dual : max Tji σij − tij ηij
set of all admissible hold edges as Eha . Then, our problem
(i,j)∈Es (i,j)∈Eh
X X becomes finding a feasible solution to the following system
s.t. ηij − ηji (3) of equations,
(i,j)∈Eh (j,i)∈Eh X
X X σij − 1 = 0 (9)
+ σij − σji ≤ 0, ∀i ∈ V (i,j)∈Esa
(i,j)∈Es (j,i)∈Es X X
X ηij − ηji
σij ≤ 1 (4) (i,j)∈Eha (j,i)∈Eha
(i,j)∈Es X X
σij ≥ 0, ηij ≥ 0 + σij − σji = 0, ∀i ∈ V (10)
(i,j)∈Esa (j,i)∈Esa
where σij represents the dual variables corresponding to A.2 solving the restricted dual problem
all setup constraints while ηij represents the dual variables
corresponding to all hold constraints. Instead of directly solving the above system of equations,
According to the complementary slackness condi- primal-dual algorithm creates a linear program called re-
tion [13], a feasible solution {l1 , . . . , ln , P } of the pri- stricted dual (RD) which is equivalent to this system.
mal problem and a feasible solution {σij , (i, j) ∈ Es } ∪ RD : min ε
{ηij , (i, j) ∈ Eh } of the dual problem are both optimal, if X
and only if they satisfy the following equations, s.t. σij − 1 = ε (11)
(i,j)∈Esa
X X
σij · (li − lj − Tji + P ) = 0, ∀(i, j) ∈ Es (5) ηij − ηji (12)
ηij · (li − lj + tij ) = 0, ∀(i, j) ∈ Eh (6) (i,j)∈Eha (j,i)∈Eha
X X X
P ·( σij − 1) = 0, (7) + σij − σji = 0, ∀i ∈ V
(i,j)∈Es (i,j)∈Esa (j,i)∈Esa
X X
ηij − ηji σij ≥ 0, ηij ≥ 0, ε≥0
(i,j)∈Eh (j,i)∈Eh This linear program is similar to the dual program except
X X
+ σij − σji = 0, ∀i ∈ V (8) that it contains a subset of dual variables that correspond
(i,j)∈Es (j,i)∈Es to the admissible edges. The objective function is also sim-
plified. If we solve this restricted dual problem and obtain
In equation (8), we do not multiply the primal variables an optimal solution, which equals to 0, then it must also
li with the dual constraints, since it has been proven by be a feasible solution to the dual problem. In this case,
Burns [1] that the inequalities (3) in the dual problem are we have found the optimal solution to the primal problem
strictly equal to 0. according to the complementary slackness condition.
The next step is to find an efficient way to solve the setup edges from any vertex i in the admissible graph Ga
RD problem. This can be done by solving the dual of the to it through an admissible path (path consisting of only
RD problem. We call this problem restricted primal (RP) admissible edges) i ; j. dj = 0 if no such path exists.
since it is similar to our primal problem except that the Based on Definition 3, setup distance dj can be recursively
constraints are only applied for admissible edges. If RP expressed as,
problem has an optimal solution, which equals to 0, then
according to the strong duality theorem [13], the RD prob- dj = max{di + , ∀ (i, j) ∈ Ea } (16)
lem also has a zero-value optimal solution. In this case, our
problem has been solved based on the above discussions. where  = 1 for setup edges and  = 0 for hold edges.
This metric is useful because it defines how much the clock
RP : max ρ arrival time li should change at vertex i in order to main-
s.t. di − dj + ρ ≤ 0, ∀(i, j) ∈ Esa (13) tain a feasible clock skew schedule if the clock period P is
di − dj ≤ 0, ∀(i, j) ∈ Eha (14) reduced by θ.
Lemma 1: If the clock period P is reduced by a suffi-
ρ≤1 (15)
ciently small amount θ, reducing clock arrival time li for
We denote the admissible graph Ga (Va , Ea ) as the graph vertex i in the admissible graph Ga by θ · di will maintain
comprised of all admissible edges in the original constraint a feasible clock skew schedule.
graph. Then, we have the following theorem. Proof: We require the θ value to be sufficiently small
Theorem 1: The RP problem has an optimal so- such that no inadmissible edge becomes admissible after
lution equal to 0 if there exists a cycle W = reducing the clock period P by θ. In such cases, only origi-
{ei0,i1 , ei1,i2 , . . . , eik,i0 } on the admissible graph Ga . nally admissible edges can violate the timing constraint,
Proof: For each admissible edge, the edge con- that is, their reduced weights wr (i, j) become negative.
straints (13) and (14) can be expressed in a uniform for- Given an admissible edge (i, j), its reduced weight becomes
mat, that is, di − dj + ρ ≤ 0, where  = 1 for setup edges wr∗ (i, j) = li − lj + (P − θ) − Tji (or wr∗ (i, j) = li − lj + tij ) if
and  = 0 for hold edges. (In the following, we will use it is a setup (hold) edge after we reduce P . Obviously, for
 in the same way as defined here.) For the given cycle setup edges, wr∗ (i, j) < 0 since li −lj +P −Tji = wr (i, j) = 0
W , if we add up all these constraint inequalities, we have, according Definition 2. If we reduce the clock arrival times
(di0 − di1 ) + (di1 − di2 ) + . . . + (dik − di0 ) + n · ρ ≤ 0, where of i and j by θ · di , then we have,
n represents the number of setup edges in the cycle, which
must be larger than 1 (otherwise, there will be a combina- wr∗ (i, j) = li − lj + P − Tji − θ(di − dj + 1) (17)
tional cycle in the circuit). Simplifying this inequality, we wr∗ (i, j) = li − lj + tij − θ(di − dj ) (18)
have n · ρ ≤ 0. Therefore, we conclude that ρ ≤ 0. In order
to maximize ρ, we can simply set ρ = 0 and all di equal to According to Definition 3 and Equation (16), di −dj +1 ≤ 0
the same values. and di − dj ≤ 0. Therefore we have wr∗ (i, j) ≥ 0, which
indicates that the timing constraints are still satisfied after
A.3 a general primal-dual algorithm we reduce P by θ and li by θ · di .
In this subsection, we will propose our novel interpre- Theorem 1 motivates the basic idea of primal-dual al-
tation of the general primal-dual framework according to gorithm. If there is no cycle in Ga , it indicates that the
Theorem 1. In Section 3-C, we will illustrate our efficient current clock period P is not minimized. Then, we reduce
implementation, which is also rooted at this same frame- P by a certain amount θ. The effect of reducing P is that
work. Before that, we first make several important obser- we reduce the reduced weights wr (i, j) of edges in the con-
vations. straint graph G, hence, we introduce new admissible edges
If there is no cycle in the admissible graph Ga , then (see Definition 2) into Ga . As long as θ is sufficiently small
we can easily find a feasible solution by setting ρ = 1, such that only one edge becomes admissible after reduc-
di = dj + 1, ∀(i, j) ∈ Esa , di = dj , ∀(i, j) ∈ Eha . This ing the clock period P by θ, we are guaranteed to capture
feasible solution is also optimal because ρ ≤ 1 according to the moment when the first cycle in Ga occurs. The clock
constraint (15). The di obtained in this way represents the period corresponding to this instant is the minimum clock
maximum number of setup edges from any vertex in the period. The outline of the primal-dual algorithm for solv-
admissible graph Ga to vertex i. Therefore, we call it the ing the clock skew scheduling problem can be illustrated as
setup distance of vertex i. The formal definition of setup in Algorithm 1.
distance is as follows. The most important step of the primal-dual algorithm
Definition 3: For each vertex j in the constraint graph, is line 4, which calculates the appropriate θ value. For a
setup distance dj is defined as the maximum number of given edge (i, j) in the constraint graph G, in order to make
Algorithm 1 Primal-Dual Algorithm 2 Burns’ algorithm for clock skew scheduling
1: P = max{Tij , (i, j) ∈ Es } 1: P = max{Tij , (i, j) ∈ Es }
2: create an empty admissible graph Ga 2: for all i ∈ V do
3: while Ga is acyclic do 3: li = 0; di = 0
4: find θ ≥ 0 s.t. only one edge (i, j) becomes admissible 4: end for
5: P =P −θ 5: while true do
6: add new admissible edge (i, j) to Ga 6: create an empty admissible graph Ga
7: end while 7: for all (i, j) ∈ E do
8: if (i, j) is admissible then
9: add (i, j) into Ga
it admissible, we require it satisfies Definition 2, 10: end if
11: end for
lˆi − lˆj − Tji + (P − θ) = 0, if (i, j) ∈ Es (19) 12: if Ga is cyclic then
lˆi − lˆj + tij = 0, if (i, j) ∈ Eh (20) 13: break;
14: end if
where lˆi represents the clock arrival time at vertex i in G 15: topological sort Ga and update di for i ∈ Va
after the clock period P is reduced by θ. Lemma 1 provides 16: for all (i, j) ∈ E do
the way of calculating lˆi , that is, 17: calculate θ(i, j) by Equation (22) and (23)
18: end for
lˆi = li − θ · di . (21) 19: θm = min{θ(i, j), (i, j) ∈ E}
20: P = P − θm
Substituting Equation (21) into (19) and (20) and rear- 21: update li , ∀i ∈ V by Equation (21)
ranging the formula, we obtain the equations that can be 22: end while
used to calculate the θ value for any edge (i, j) in G.

li − lj − Tji + P ^
θ= , if (i, j) ∈ Esdi − dj + 1 > 0 lows the primal-dual algorithm described in Algorithm 1
di − d j + 1 except that it creates a new admissible graph and recalcu-
(22) lates admissible edges each iteration. The strategy it uti-
li − lj + tij ^ lizes to update θ values is illustrated on lines 16-18. It sim-
θ= , if (i, j) ∈ Eh di − dj > 0 (23)
di − d j ply recalculates the θ for all edges in the constraint graph
G for each iteration. This becomes the major run time ef-
A setup (hold) edge (i, j) cannot become admissible if di − ficiency problem of Burns’ algorithm. We will show in the
dj + 1 ≤ 0 (di − dj ≤ 0). In such cases, we set the θ values next subsection that only a small portion of the edges need
to be infinite. to be updated for θ values each iteration. Another runtime
Different ways of implementing the subroutine on line 4 efficiency problem lies in creating a new admissible graph
of Algorithm 1 results in different runtime efficiency. The Ga each round, this is also improved in our proposed im-
naive approach is that we calculate θ values for every edge plementation in subsection 3-C. Before that, we first use
in the constraint graph G in each iteration and choose the a simple example to demonstrate in detail how Burns’ al-
smallest θ to be deducted from the clock period P . The gorithm works. In Figure 2, we have a simple constraint
new admissible edge corresponding to the smallest θ is then graph comprised of 4 vertices. The solid lines represent
inserted into the admissible graph Ga . Choosing the small- the setup edges and the dashed lines represent the hold
est θ guarantees to generate exactly one new admissible edges. Each vertex has a pair of numbers (li , di ), which
edge (when there are more than one smallest θ values, we are the clock arrival time and setup distance of vertex i.
can add one edge each time). This naive approach is basi- Similarly, each edge also has a pair of numbers (tij , θij ),
cally how Burns’ algorithm [1] implements the primal-dual which represent the maximum or minimum delay between
algorithm. We will show that there are many redundant two flip-flops and the θ values, respectively.
computations in this implementation. Then, we propose In each iteration, after we update the θ value for every
our more efficient implementation of the primal-dual algo- edge, we identify one edge with the minimum θ, which is
rithm in subsection 3-C. highlighted in the figure. The corresponding edge becomes
admissible at the beginning of the next iteration (also high-
B. Burns’ implementation lighted by bold solid lines). In our simple example, only
We outline Burns’ algorithm [1] to solve the clock skew one edge becomes admissible in each iteration. Therefore,
scheduling problem in Algorithm 2. Burns’ algorithm fol- it is obvious that Burns’ implementation is very inefficient.
(0, 0) (−1.5, 1)
in it. In other words, we have a k-edge forest. We will show
(7, 3) (10, 0) (7, 1.5) (10, inf)
(3, inf) (4, inf) (3, inf) (4, 4)
after we add a new admissible edge (i, j), Ga is still a forest.
(0, 0) (0, 0) (0, 0) (0, 0)
There are four cases depending on whether vertices i and
(2, inf) (3, inf) (2, inf) (3, inf) j are in forest Ga before adding edge (i, j).
(8, 2) (5, 5) (8, 2) (5, 5)
1. i ∈/ Ga , j ∈/ Ga
(0, 0) (0, 0)
Iteration 0 : P = 10 Iteration 1 : P = 8.5 In this case, we generate a new tree rooted at i, which has
(−1.67, 1) (−1.67, 1) only one child j. The resulting Ga is still a forest;
(7, inf) (10, inf) (7, inf) (10, inf) 2. i ∈ Ga , j ∈ / Ga
(−0.17, 2) (3, 1.5) (4, 2.5) (0, 0) (−0.17, 2) (3, 1.5) (4, 2.33) (0,0) In this case, we add a child j to one vertex i in a tree,
(2, inf) (3, inf) (2, 2.17) (3, inf)
which is still a tree. Thus the resulting Ga is still a forest;
(8, 0.17) (5, 3.5) (8,inf) (5, 0.83) 3. i ∈/ G a , j ∈ Ga
(0, 0) (0, 3) This case is impossible to occur because if i ∈ / Ga , di = 0;
Iteration 2 : P = 8.33 Iteration 3 : P = 7.5
on the other hand, since j ∈ Ga , we have dj ≥ 1. Therefore,
(−1.67, 1)
we have di −dj + ≤ 0. In such a case, θ(i, j) = ∞ according
(7, inf) (10, inf)
(3, 1.5) (4, 2.33)
to Equation (22) and (23). Therefore, edge (i, j) cannot be
(−0.17, 2) (0, 0)
an admissible edge.
(2, 2.17) (3, inf) 4. i ∈ Ga , j ∈ Ga
(8, inf) (5, 0.83)
In this case, we will prove that the incoming edge (w, j) ∈
(0, 3)
Iteration 4 : Cycle Detected, P = 7.5
Ea of vertex j must become inadmissible, therefore, we
Fig. 2. An example of using Burns’ algorithm to solve the clock skew can safely remove this edge and still maintain a forest.
scheduling on a simple constraint graph.
First we observe from Equations (22) and (23) that in
order for edge (i, j) to become the minimum θ edge, it
It actually recalculates the reduced weights of all edges at
must satisfy di − dj + (i, j) > 0. Therefore, we have
the beginning of a new iteration (line 8 in Algorithm 2).
di + (i, j) > dj . On the other hand, since (w, j) ∈ Ea ,
Burns’ algorithm also recalculates θ values for all edges
we have dw + (w, j) = dj . The reason is based on the
at each iteration (line 17 in Algorithm 2). However, from
definition of setup distance dj and our inductive assump-
our example in Figure 2 we can tell that not all θ values
tion that Ga is a forest. The new setup distance dj is
change for each iteration. In the following, we propose a
max{di +(i, j), dw +(w, j)}, which is di +(i, j). This new
systematic way to update the admissible graph Ga and θ
dj value makes edge (w, j) inadmissible because its new re-
values, such that the runtime efficiency can be significantly
duced weight wr (w, j) = w(w, j) − θ(dw − dj + (w, j)) (see
improved.
Equation (17) and (18)) becomes positive. By removing
C. An enhanced implementation the incoming edge (w, j) and inserting a new admissible
edge (i, j), Ga is still maintained as a forest.
In this subsection, we will first propose the algorithm
to maintain one admissible graph Ga and update setup
Theorem 2 provides us an easy way to incrementally
distances for vertices in Ga . Then, we describe how to
update the admissible graph instead of generating a new
update the θ values efficiently.
graph for each iteration. The complete algorithm is shown
C.1 maintaining the admissible graph in Algorithm 3. For each vertex i in the admissible graph

We follow the method described in Algorithm 1 to up-


Algorithm 3 INC GA UPDATE(i,j)
date the admissible graph Ga instead of creating a new one
for each iteration. There is only one new admissible edge 1: add (i, j) into Ga
inserted into Ga for each round. If multiple edges have the 2: remove (w, j) from Ga if exists
same θ values that are minimum, we arbitrarily choose one 3: dj = di + 
of them. We will prove that by updating the admissible 4: pj = pi ∪ {i}
graph in this way, Ga is always a set of trees, i.e., a forest. 5: for all k ∈ Ea in the subtree rooted at j do

Theorem 2: If we add exactly one minimum θ value edge 6: update dk , pk


7: end for
(i,j) into the admissible graph Ga , Ga is a forest until a
cycle is generated.
Proof: The proof is by induction on the number of Ga , we maintain a list of its parents pi in the tree. In this
edges in Ga . The basis is trivial since an empty Ga can way, the cycle can be detected by checking if the newly
always be regarded as an empty forest. For the inductive inserted edge (i, j) satisfies j ∈ pi . According to Defini-
step, we assume that the theorem holds for Ga with k edges tion 3, setup distances of all vertices in the subtree rooted
at i should also be increased by the same amount as that for calculated as min{θmc1 , θmc2 }, where θmc2 represents the
vertex i. Of course, the parent list should also be updated. minimum θ value in the candidate edge set Ec2 .
Our procedure for updating the admissible graph has a It is possible that a setup edge, which is originally in Ec1
worst case time complexity of O(|V |). On the other hand, violates the requirement di = 0 or dj = 0 after the admis-
Burns’ algorithm has O(|E|) complexity for the same step. sible graph Ga is updated. If the first candidate edge in Ls
presents such a case, we can discard this edge and check
C.2 updating θ efficiently the next top element in Ls , until there is one edge satis-
Our method for updating the θ value is organized in a fying the condition. The discarded edges actually become
similar way as the Dijkstra’s shortest path algorithm. In elements in the edge set Ec2 . This is because of the mutual
Dijkstra’s algorithm, a set of edges are maintained as can- exclusiveness of these two edge categories.
didates of shortest path tree edges. In the next iteration, a Unlike Ec1 , which is fully built before the algorithm
minimum edge (i, j) from this edge set becomes a shortest starts iterating, candidate edge set Ec2 grows incremen-
path tree edge. The corresponding vertex i is moved to the tally. Originally, there are no vertices in the admissible
tree vertex set W . Then, new candidate edges, which are graph Ga . Therefore, Ec2 = φ. In each iteration of the
outgoing edges from vertex j are added into the edge set. main algorithm, we add one admissible edge (i, j) into Ga .
This process stops when the tree vertex set W is equal to If vertex j is already in Ga , we do nothing. Otherwise,
the vertex set V of the original graph G(V, E). j becomes an admissible vertex. Based on the require-
In our MPCSS problem, we choose an edge (i, j) with ment for Ec2 described in Lemma 2, we know that all edges
the minimum θ value and add the corresponding vertex j (j, k) ∈ E should be added into the candidate edge set Ec2 .
into Ga . The set of the candidate edges includes all edges We illustrate the subroutine for maintaining the candidate
in G in Burns’ implementation. In our algorithm, however, edge set Ec2 in Algorithm 4.
we propose to maintain a much smaller set of candidate
edges. Furthermore, instead of recalculating θ values for all Algorithm 4 UPDATE EC2(i, j)
candidate edges in this set (in order to find the minimum θ 1: if j ∈
/ Ea then
edge) for each iteration, we propose to only update θ values 2: for all outgoing edges (j, k) ∈ E do
for a small subset of our candidate edges. 3: calculate θ(j, k)
Lemma 2: A candidate edge (i, j) for a new admissible 4: insert (j, k) in Ec2
edge in G can only be in one of the following two exclusive 5: end for
categories: 6: end if
Ec1 : (i, j) ∈ Es and di = 0, dj = 0; 7: update θ for edges in Ec2
Ec2 : (i, j) ∈ E, and i ∈ Ga ;
Proof: According to Equations (22) and (23), an edge
(i, j) can only become admissible if it satisfies di − dj +  > Algorithm 4 is exactly the method by which Dijkstra’s
0. It is trivial to verify that among all edges in E, only shortest path algorithm updates its candidate list except
those edges in the above two categories can satisfy this that Dijkstra’s algorithm does not need to perform the up-
constraint. dating on line 7. By implementing the edge set using the
These two categories are mutually exclusive, which heap data structure, inserting a new edge into the heap
means that a candidate can either be in Ec1 or Ec2 , but and decreasing the key value in the heap can be done in
not both. The category Ec1 has the following property. O(logn) time. Moreover, the minimum θ value edge can be
Lemma 3: Given an ordered set of candidate edges (in extracted from the heap in constant time.
terms of the θ values) in Ec1 , it is still an ordered set after Until now, our implementation of the primal-dual algo-
the clock period P is reduced by θ. rithm has been much more efficient than Burns’ algorithm.
Proof: For edge (i, j) ∈ Ec1 , we have di = 0, dj = We only need to update θ values for those edges in heap
0, li = 0, lj = 0. Its θ can be calculated as θ(i, j) = P − Ec2 in each iteration, which usually is only a small portion
Tij . Therefore, if θ(i, j) < θ(u, v), after reducing P by the of the full edge set E in the constraint graph. However, the
minimum θm (see line 19 in Algorithm 2), the inequality runtime efficiency of the procedure on line 7 in Algorithm 4
still holds. can be further improved by the following observation.
Lemma 3 suggests to organize setup edges separately Lemma 4: The θ value for a candidate edge (i, j) ∈ Ec2
from the candidate edges in Ec2 because once it is ordered, reduces by θm , which is the minimum θ deducted from the
it remains ordered during all iterations of the algorithm. clock period P in the previous iteration, if di and dj are
We can first put all setup edges in a list Ls sorted by Tij not changed from the previous iteration,i.e., d∗i = di and
values. Then, in each iteration, we pop the first element d∗j = dj .
from the list and compute its θmc1 . The minimum θm is Proof: According to Equations (21), (22) and (23), a
new θ value θ∗ (i, j) after reducing the clock period P by practice (in most iterations, there is only the newly entered
θm can be expressed by, vertex j having its di value updated, therefore, only candi-
date edges going out from j need to have their θ recalcu-
(li − θm d∗i ) − (lj − θm d∗j ) + P − θm − Tji lated). However, Burns’ algorithm would perform O(|E|)
θ∗ (i, j) =
d∗i − d∗j + 1 computation for the same task.
= θ(i, j) − θm , ∀(i, j) ∈ Ec2 ∩ Es (24) The only remaining problem is to find whether a given
(li − θm d∗i ) − (lj − θm d∗j ) + tij vertex i in Ga has its setup distance di changed or not.
θ∗ (i, j) = This can be easily traced during the process of updating
d∗i − d∗j
the admissible graph. Then, the candidate edges whose
= θ(i, j) − θm , ∀(i, j) ∈ Ec2 ∩ Eh (25) θ values need to be recalculated are those outgoing and
incoming admissible edges of this vertex i.
Omitted due to space limitation.
Given all methods to improve the runtime efficiency dis-
From Lemma 4 we see that candidate edges (i, j) in Ec2
cussed in the above two subsections, we can illustrate our
are of two types, depending on how they should be up-
algorithm as shown in Algorithm 5.
dated.
1. Neither di nor dj changes from the previous iteration. Algorithm 5 Dijkstra style clock skew scheduling
2. Either di or dj changes from the previous iteration.
1: P = P ∗ = max{Tij , (i, j) ∈ Es }
Candidate edges falling into type 1 can be updated by
2: create a sorted set Ec1 for all setup edges Es ∈ E
subtracting θm from the previous θ(i, j) values. On the
3: Ga = φ, Ec2 = φ
other hand, for edges belonging to type 2, Equations (22)
4: while true do
and (23) must be used to calculate new θ(i, j) values.
5: (i1 , j1 ) = EXTRACT MAX(Ec1 )
One important observation is that in practice the number
6: (i2 , j2 )= EXTRACT MIN(Ec2 )
of edges of type 1 is much larger than that of type 2. This
7: θm (im , jm ) = min{−θ(i1 , j1 ) + P ∗ , θ(i2 , j2 )}
is because when adding a new admissible edge (i, j) into
8: update li , ∀i ∈ Ga by Equation (21)
Ga , only those vertices belonging to the subtree rooted at
9: P = P − θm
vertex j are of type 2. Therefore, if we can further save
10: INC GA UPDATE(im , jm ) (Algorithm 3)
some computation on edges of type 1, we can improve the
11: if Ga is cyclic then
runtime efficiency. This can be done based on the following
12: break
theorem.
13: end if
Theorem 3: In each iteration of the primal-dual algo-
14: UPDATE EC2(im , jm ) (Algorithm 4)
rithm, θ values only need to be recalculated on candidate
15: end while
edges (i, j) ∈ Ec2 , for which di or dj changes. by using
Equation (26).
Proof: Assume the candidate edge set Ec2 is orga- Our proposed algorithm follows the general principle of
nized by a sorted sequence ordered by the θ values. Heap the primal-dual algorithm while carefully optimizing the
is an example of such data structure. The naive approach computation inside the main loop. On line 7 of this algo-
is to first recalculate θ values for all edges in Ec2 . Then, rithm, instead of directly comparing the minimum entry
the sequence should be re-sorted. After that, if we add a from Ec1 and the minimum entry from Ec2 , we first off-
constant value θm to every candidate edge in this sorted set the θ value from Ec1 by P ∗ , which is the initial clock
sequence, it remains sorted. The minimum θ edge is sim- period. This is because the values in Ec1 are not actually
ply the first entry in this sequence. The effect of adding the θ values for setup edges. They only denote Tji . It has
a constant is that the new θ values for candidate edges of been discussed that for candidate edges in Ec1 , θ can be
type 1 becomes θ ∗ (i, j) = (θ(i, j) − θm ) + θm = θ(i, j). In calculated as θ(i, j) = P − Tji .
other words, they do not change from the values from the IV. Experimental Results
previous iteration. Thus computation efforts can be saved
for the candidate edges of type 1. As for candidate edges In this section, we present experimental results demon-
of type 2, new θ value θ ∗ (i, j) is calculated as, strating the efficiency of our proposed algorithm.
We implemented both Burns’ algorithm and our pro-
lˆi − lˆj + P − θm − Tji posed algorithm to solve the MPCSS problem with C++
θ∗ (i, j) = + θm (26) programming language on a Linux machine with 2GHz
d∗i − d∗j + 1
CPU and 2G RAM. In order to obtain more accurate re-
sults, both algorithms are coded on the same infrastruc-
By Theorem 3, the time complexity of subroutine for ture:
updating θ in the primal-dual algorithm becomes O(1) in • they use the same graph data structure,
TABLE I
comparison of runtime efficiency of clock skew scheduling algorithms

Circuit |V | |E| CP (ps) CPm (ps) CPU (s) ratio


Burn
Burns’ ours ours
s5378 180 2732 3471 3240 0.028 0.003 9.3
s9234 212 5144 4243 3801 0.721 0.019 37.9
s13207 639 7046 4497 4325 0.496 0.014 35.4
s15850 535 24012 7255 6903 6.80 0.090 75.6
s35932 1729 12790 3532 3433 0.230 0.097 2.4
s38417 1637 66304 6087 5760 19.28 0.252 76.5
b14 1 opt 246 39954 13424 12786 4.012 0.089 45.1
b14 1 246 40966 13188 12549 3.122 0.087 35.9
b15 1 opt 449 85948 13033 12331 5.67 0.130 43.6
b15 opt 450 126100 13782 12589 56.51 0.781 72.4
b17 1 opt 1413 290426 13952 12969 105.0 0.663 158.4
b17 opt 1415 381076 13517 12575 121.6 0.746 163.0
b18 1 opt 3271 681610 17945 16851 155.6 1.09 142.8
b18 opt 3271 683634 18304 16975 108.4 1.04 104.2
b20 1 opt 491 103000 14258 13170 44.1 0.292 151.0
b20 opt 491 105360 14224 13435 25.1 0.219 114.6
b21 1 opt 491 102940 14453 13473 36.0 0.254 141.7
b21 opt 491 104968 13944 12911 45.1 0.328 137.5
b22 1 opt 704 152142 13992 12590 110.0 0.582 189.0
b22 opt 704 155630 14303 13370 51.1 0.312 163.8
avg 95.0

• they use the same graph manipulating subroutines, Delay Format (SDF) file from Design Compiler. In the
• they use the same subroutine to calculate θ values. SDF file, the delay of each gate in the netlist is provided.
Burns’ algorithm is coded based on the pseudo-code Therefore, we can construct a constraint graph G by cal-
given in Algorithm [1]. Our proposed algorithm is coded culating the shortest and longest path accumulated delay
in the manner described in Algorithm 5. For both algo- between any two flip-flops in the circuit. The total runtime
rithms, we ran them for 5 times. The final execution times of this step is O(|V 2 |).
are calculated by taking the average of these sample data. It can be observed from Table I that on average our
The final runtime results are listed in Table I. algorithm is 95 times faster than Burns’ implementation,
In Table I, the first and second columns are the number even though both of them follow the same primal-dual idea.
of vertices and number of edges in the constraint graph G. Such significant improvement on the runtime indicates the
The next two columns represent the original clock period fact that the proposed algorithm is superior to Burns’ al-
P ∗ and the clock period P after clock skew scheduling. The gorithm in terms of the asymptotic time complexity. The
CPU time is measured in seconds for both algorithms. We major improvement of the asymptotic runtime is due to
also list the ratio of the execution times comparing Burns’ the utilization of the heap data structure in the proposed
algorithm with our algorithm. algorithm. Our algorithm has an asymptotic runtime of
The benchmark circuits we use for clock skew scheduling O(|V ||E| + |V |2 log|V |), which is improved from O(|V |2 |E|)
are from two widely used sets, ISCAS89 and ITC99. We of Burns’ implementation. It can be observed that this is
take the largest circuits from both sets because both algo- quite similar to the fact that Dijkstra’s algorithm improves
rithms are extremely fast (less than one second) for small runtime to O(|E| + |V |log|V |) from Bellman-Ford’s short-
circuits with less than 10k gates. In order to obtain more est path algorithm, which has O(|V ||E|) complexity. This
realistic results, we calculate the gate delay in the following only difference is that in our problem, we have one more
way. We first re-synthesize the circuits (either in format of hierarchy that needs to be iterated for O(|V |) times. In
Verilog or VHDL) with Synopsys Design Compiler. Then, practice, the runtime improvement is more significant. In
we export the synthesized netlist along with the Standard the best case, we observe a 189 times (circuit b22 1 opt)
speed up by the proposed algorithm.

V. Conclusions
In this paper, we revisit the primal-dual based algorithm
for solving the minimum clock period clock skew schedul-
ing. We propose a much more efficient primal-dual based
algorithm to improve the runtime efficiency of Burns’ im-
plementation of the primal-dual algorithm. The proposed
algorithm is superior to Burns’ algorithm in both theoreti-
cal asymptotic time complexity and the practical measured
runtime. Our experimental results show that our proposed
algorithm is on average 95 times faster than Burns’ algo-
rithm. Even for the largest circuits in our benchmarks, our
algorithms take less than one second to complete. This en-
ables faster algorithms that take the MPCSS as subroutines
within their inner iteration loops.

References
[1] S. M. Burns. Performance analysis and optimization of asyn-
chronous circuits. PhD thesis, California Institute of Technology,
Computer Science Department, 1991.
[2] R.B. Deokar and S.S. Sapatnekar. A graph-theoretic approach to
clock skew optimization. In Proc. Intl. Symposium on Circuits
and Systems, 1994.
[3] J.P. Fishburn. Clock skew optimization. IEEE Transactions on
Computers, 39(7), July 1990.
[4] A.V. Goldberg. A heuristic improvment of the bellman-ford al-
gorithm. Applied Math Letter, 1993.
[5] S. Huang, C. Cheng, Y. Nieh, and W. Yu. Register binding for
clock period minimization. In Proc. of the Design Automation
Conf., 2006.
[6] E. .G. Friedman I. S. Kourtev. Timing Optimization Through
Clock Skew Scheduling. Kluwer Academic Publisher, 2000.
[7] E. L. Lawler. Combinatorial Optimization: Networks and Ma-
troids. Holt, Reinhart, and Winston, 1976.
[8] L. Liu, T. Chou, A. Aziz, and D.F. Wong. Zero-skew clock
tree construction by simultaneous routing, wire sizing and buffer
insertion. In International Symposium on Physical Design, 2000.
[9] Xun Liu, Marios C. Papaefthymiou, and Eby G. Friedman. Max-
imizing performance by retiming and clock skew scheduling. In
DAC, pages 231–236, 1999.
[10] C. H. Papadimitriou and K. Steiglitz. Combinational Optimiza-
tion, algorithms and complexity. Dover Publications, 1998.
[11] K. Ravindran, A. Kuehlmann, and E. Sentovich. Multi-domain
clock skew scheduling. In Proc. Intl. Conf. on Computer-Aided
Design, 2003.
[12] J. Tsai, T. Chen, and C. C. Chen. Optimal minimum-delay/area
zero skew clock tree wire-sizing in pseudo-polynomial time. In
International Symposium on Physical Design, 2003.
[13] V.Chvatal. Linear programming. W. H. Freeman, 1983.
[14] Dimitrios Velenis, Kevin T. Tang, Ivan S. Kourtev, V. Adler,
F. Baez, and Eby G. Friedman. Demonstration of speed and
power enhancements on an industrial circuit through application
of clock skew scheduling. Journal of Circuits, Systems, and
Computers, 11(3):231–246, 2002.

You might also like