Reinforcement Learning Optimization

2018 IEEE Computer Society Annual Symposium on VLSI
Parametric Circuit Optimization with Reinforcement Learning
Changcheng Tang Zuochang Ye

Tsinghua University, Institute of Microelectronics Tsinghua University, Institute of Microelectronics
Beijing Beijing
Email: tangcc15@mails.tsinghua.edu.cn Email: zuochang@tsinghua.edu.cn
Yan Wang
Tsinghua University, Institute of Microelectronics
Beijing
Email: wangy46@tsinghua.edu.cn
Abstract—In this paper, we focus on solving parametric op- Motivated by Learning to Learn by Descent by
timization problems. Such kind of problems is very commonly Descent[4], we noticed that we can adopt a well-trained
seen in reality. We propose an efficient method to train a model optimizer to replace human-designed optimizers which pro-
that connects the solution to the parameters and thus solve all
the problems with the same structure and different parameters vide updating values of variables of object function for each
at the same time. During the training process, instead of solving iteration in the optimization process.
a series of optimization problems with randomly sampled To train an optimizer which is constructed by a neural net-
parameters independently, we adopt reinforcement learning work by supervised learning does not work[5]. Fortunately,
to accelerate the training process. Two networks are trained reinforcement learning has the exact same formulation with
alternately. The first network is a value network, and it is
trained to fit the target loss function. The second network is the task of training optimizer. The main concept of reinforce-
a policy network, whose output is connected to the input of ment learning is that define a reward function about specific
the value network and it is trained to minimize the output of action and state of the system, then impose the penalty on
the value network. Experiments demonstrate the effectiveness poor performance optimizer[5][6].
of the proposed method.
In this paper, we propose a global optimization method
Keywords-circuit optimization; reinforcement learning; para- that optimizes a serial of parametric objective functions by
metric optimization training an optimizer. We modeled policy and state-value
function by neural networks and trained them by using
I. I NTRODUCTION DDPG (Deep Deterministic Policy Gradients) algorithm in
reinforcement learning[7], and achieved good performance
Global optimization algorithms had been studied in many in some global optimization cases.
years. An efficient, simple and robust optimization algorithm
is essential in many fields of science and engineering. II. PARAMETRIC G LOBAL O PTIMIZATION
Conventional global optimization algorithms use to search as The problem this paper is focusing on global optimization
board range as it could to find the minima, like evolutionary in the following form
algorithms (EAs)[1][2], or searching minima following the
gradient of object functions, like stochastic gradient descent min f (θ, w) (1)
θ
(SGD), momentum[Nesterov, 1983, Tseng, 1998], RMSprop
[Tieleman and Hinton, 2012], and Adam [Kingma and Ba, where f (x, w) is the objective function to be optimized, w
2015], etc. is a vector that parameterize the objective function.
In many applications, people may encounter parametric A naive way to solve the parametric optimization problem
optimization problem, i.e. a set of problems that share the is to sample in the parameter space, and for each sampled wi ,
same structure and they only differ parametrically. A rep- solve the optimization problem to get the optimal solution
resentative application is circuit optimization. The topology θi , and finally train a network that connects the parameter
of the circuit to be optimized is usually fixed, such as 6- w to the best solution x with supervised learning.
T SRAM, logic gates, and amplifiers. Such circuits have The problem with this naive method is that for each
been studied thoroughly, however, engineers still need to different parameter, the optimization process has to be done
pay the effort to optimize the circuit again when conditions from scratch, which is resource-wasted and time-consuming.
e.g., process nodes, power supply voltage, and constraints It is particularly true when the variable space and parameter
are changed. If such parametric optimization can be solved, space are large. In this case, randomly sampled parameter
such effort is no longer necessary. vector tends to be sparse in the high dimensional space,
2159-3477/18/$31.00 ©2018 IEEE 197

DOI 10.1109/ISVLSI.2018.00045
Algorithm 1 Procedure of Optimization Algorithms

Object function f

θ0 ← randomly initialized
2: for t = 1, ..., T do
if stop condition then
4: return θt
Firstly, the agent makes an observation of environment.
θt ← θt−1 + g(f (θt−1 ), θt−1 , φ)
Secondly, the agent takes action for current observation.
6: return θT Lastly, environment updates its status and return the
reward to an agent.
Figure 1. Evironment and agent.
and it requires samples that increase exponentially with the
number of parameters to reserve the sample density.
Our idea is that train an optimizer based on reinforcement
The determination of algorithms is to optimize the
learning to generate the optimal searching path like what
stochastic policy π: O × A → R≥0 , and it also can be
the conventional optimizers do rather than build a simple
simply expressed as at ∼ π(at |ot )[10][13]
regression model.
There are discrete action space algorithms, like Q-learning
A. Learning to Optimize and DQN[19], and continuous action space algorithms, like
A unconstrained single-objective task of optimization can DDPG and PPO, etc[7]. In discrete action space problem,
be expressed as: find the minimum θ∗ = arg minθ f (θ), the number of action is finite, thus the algorithm may
where f (θ) is object function. There are many global enumerate all actions and calculate its Q value then find
optimization methods that we discussed above, most of them the maximum. The ”deep Q network” (DQN) algorithm
are inspired the evolutionary algorithms: GAs, EP, ESs, GP, use neural network to approximate value function Q(s) =
differential evolution (DE), and so on[1][3]. They are called [Q(s, a1 ), Q(s, a2 ), ..., Q(s, an )]T , where n is the number
zero-order optimization, which means they do not rely on of action[8]. It enumerates all possible actions and then
the gradient of object function. On the contrary, gradient chooses the one which gets the biggest Q value. DQN
descent is popular in deep learning because its efficiency is robust and efficient, but it can only address discrete
and good performance of its variants, including stochastic low-dimensional space problem. When the dimension of
gradient descent(SGD), momentum [Nesterov, 1983, Tseng, action increases linearly, the possible number of action
1998], Adagrad [Duchi et al., 2011], RMSprop [Tieleman increases exponentially. We want the algorithm can work in
and Hinton, 2012], and Adam [Kingma and Ba, 2015]. continuous and high-dimensional space environments, and
Whatever optimization we choose, it looks like that we the DDPG algorithm may address those issues.
build an optimizer to generate updating value to optimizee:
C. DDPG: Deep Deterministic Policy Gradients
θt+1 = θt + g(f (θt ), θt , φ), where g is the optimizer
specified by parameter φ. DDPG is a continuous-state space algorithm. Because
states and actions are infinite set in the continues-state space,
B. Reinforcement Learning it is impossible to calculate all Q-value at each action at like
Usually, a standard reinforcement learning problem can DQN in discrete-state space. To address this issue, DDPG
be described as an agent A interacting with an environment uses neural network to learn in large state and action spaces.
E in a discrete time system[8]. Environment might be a very Directly learning Q-value with neural networks, however,
complex system, and we can consider it as an abstract system has been proved to be unstable in many environments since
or a black-box represented by high-dimension state st , and the function Q(st , at |θQ ) is being updated in training step
we can get different observation st for each state st . Agent and also being the learning target[7]. DDPG consists of
takes an action at depending on ot to change the state of actor and critic network, Q(st , at |θQ ), η(st , at |θη ) and their

environment from st+1 to st , then we can define the reward copy Q (st , at |θQ ), η (st , at |θη ), where θ is weights of
rt ∈ R of action at at current status st [8] network. The critic is a value network, which is trained to fit
The way to evaluate the performance of action is the the value function Q(s, a), and the actor is a policy network,
action-value function: which is trained to determinate the best action for current
observation. There are two pair of networks: learning net-
Q(s, a) = max E[rt +γrt+1 +γ 2 rt+2 +...|st = s, at = a, π] works and target networks. Learning networks are trained by
π
(2) targets generated by target networks as supervised learning,
where γ is reward discount factor, and π is the policy and target networks are not trained rather than update their
function which defines the behavior of agent.Usually, we
use a neural network to approximate this function[8].
198

Figure 3. The optimizer based on reinforcement learning.

We have to provide enough train dataset {wi , θi } as test

dataset to train a regression model. When we try to generate

the dataset for two simple mathematic problems, it is very
fast that solve equations and take some iterations. In circuit
simulation, however, it is barely feasible to generate train

dataset: it takes too much cost. Usually, a global optimization
algorithm takes a hundred steps to find the optimal solution,
and call simulator 10 or more times for every step. In other
words, we may call simulator thousand times to generate
one single sample.

We train an optimizer to provide the updating value
rather than build a regression model directly. The difference
The DDPG algorithms contain two pairs of networks: the between optimizer and the regular regression model is the
actor-target takes action based on current observation of variable space. An optimizer is also a regression model, i.e.
environment, and the critic-target evaluate the Q value for optimizer O : Θ × W → A, where A is updating value
current action. The actor updates its policy based on the Q space.
value every step and the actor-target updates itself by the We might want to figure out whether we can build a
actor every τ steps. Then environment returns reward to regression model for this problem by supervised learning,
the critic, which updates as the same as the actor. but according to [5], the answer is no. When they tried to
Figure 2. The construction of DDPG.
train an optimizer by supervised learning, the optimizer is
divergent quickly. Fortunately, the reinforcement learning is
absolutely feasible for training an optimizer.
weights by training networks weights: In the abstract problem above, we let f : S → R be
the environment E, θt be the state st , θt and parameter w
θ ← τ θ + (1 − τ )θ (3) together be the observation ot and g : R × S → R be our
where τ 1 is a small constants factor, called soft-updating agent A. Then we define the reward rt = Δ(f (θt ), f (θt+1 )),
factor. When τ ≥ 1, it becomes hard-updating. The process which means agent get positive reward when it find better
of updating may be slower than original networks, but it solution. Now we transfer an optimization problem into a
turns out to be more stable. reinforcement learning problem, shown in fig 3
When actor-critic networks are fine-trained, we can use
IV. E XPERIMENTS
critic network only to make our decision. In other word,
agent receives observation ot from environment E as input A. Implementation Detail
of critic network, then critic network offers the best action In all our experiments, we build simply actor and critic
at . networks by a few fully connect (FC) layers in Keras[15].
III. M ETHODOLOGY: L EARNING TO OPTIMIZE WITH Each network is optimized by Adam with learning rate
REINFORCEMENT LEARNING
0.001, and training for 30, 000 steps. Reward discount factor
is set as 0.99 and soft updating factor is set as 0.001. The
Our idea is that a specific topological problem can be source code of DDPG is available in GitHub [keras-rl].
parameterized as a family of object function, and we con- Keras-rl module developed many open source algorithms
ceived that we can build a model to describe the relationship based on OpenAI Gym interface. OpenAI Gym is a toolkit
between parameters W and the optimal solutions Θ∗ for this for developing and comparing reinforcement learning algo-
function family, i.e. regression model M : W → Θ. rithms. It provides many different and funny environments
as a simple open source interface to reinforcement learning
199
6.8 test 8.5
True value
Our work
Our Work
RandomForest DE
6.6 SVR 8.0 Momentum
KNN
6.4 7.5
loss
loss
6.2 7.0
6.0 6.5 0
10 10 1 10 2 10 3
0 2 4 6 8 10 12 14 16 iterations
sample
Figure 4. Regression models perform in test dataset. Figure 5. Quadratic function convergence curve.
Table I
E XPERIMENT R ESULT OF Q UADRATIC F UNCTION
tasks. We do not really need the environments but the
interface it provides. We provide environments by defining (min/mean/max) Steps Relative Loss
interfaces reset which resets the environment to initial state Reinforcement Learning 1/1/1 0/0/0
(may be random), sample which sample an action at from Gradient Descent 10 / 23 / 36 -0.07 / -0.02 / -0.003
action space for random strategy in the warm-up process, Different Evolution 57 / 294 / 527 -0.07 / -0.02 / -0.003
step which takes an action at as input to change the state
of environment then return the new observation ot of the
environment, reward rt , whether done or not and other
information, and other interfaces. Step is the most important
part that defines the behavior and reward of the environment. 16
14
B. Quadratic function 12
10
A quadratic function is convex, which can be express as: 8
6
4
f (θ, W ) = W θ + b22 (4) 2
In this experiment, we trained a simple optimizer for 4
8-dimensional quadratic functions specified by parameters 2
−4 −2 0
W, b with dimension 8 × 8 and 8 respectively. The optimizer 0 −2
was trained by this function family, where W and b were 2 4 −4
generated by normal Gaussian distribution for each epoch in
Figure 6. 2-dimensional ackley function.
training process, and tested on random parameter sampled
by the same distribution.
Firstly, we generated 500 samples by gradient descent
algorithm to train regression models, and split them as loss∗f inal − lossddpg
f inal
two parts: train(300), test(200). Every sample takes nearly relative loss = (5)
200 steps to find the minimum. We use random forest[16], lossinit
SVR[18] and KNN[17] as the regression model for com- The loss∗f inal is the convergent loss of other algorithms.
parison. For this sample convex problem, it is easy to
generate dataset and fit the curve. We trained those three C. Non-convex function
regression models in train dataset and tested them in test Ackley function, which has many local minima, is a
dataset respectively. the result is shown in fig 4. classic non-convex function. Fig 6 shows 2-dimensional
The coverage curves are shown in fig 5. For quantitative ackley function graph.
comparisfoon, we define metrics that the steps of other √ 1 n 2
algorithms take when our algorithm is convergent and the f (θ, w1 , w2 ; w) = − 20e−0.2 n i=1 w1i (xi −g(w))
1
n
w2i cos(2π(xi −g(w)))
final relative loss with our algorithm. The relative loss is − en i=1 + 20 + e
defined as follow: (6)
200
0.5 test Table III
True value E XPERIMENT R ESULT OF SRAM
Our work
RandomForest
0.4 KNN (min/mean/max) Steps Relative Loss
SVR Reinforcement Learning 1/1/1 0/0/0
0.3 Different Evolution 182 / 343 / 594 -0.27 / -0.13 / -0.07
loss
0.2
and environment variables θ. The devices would not work
0.1 anymore when the environment variables, which are un-
controllable, beyond a specified margin γ. In this task, we
need to find out the maximum radio of cycle area where
0.0
0 5 10 15 20 25 30 35 40 parameters are feasible, i.e. :
sample
min θ2 (LP1)
Figure 7. Ackley regression model perform in test dataset.
s.t. f (θ, w) ≤ γ (7)
10 where f is the Static Noise Margin (SNM) of SRAM. The

Our Work devices do not work when the constrain above is satisfied.
8 DE In order to optimize it, we transfer the constrained convex
Momentum problem into an unconstrained optimization problem:
6 min θ2 + C · 1(f (θ, w) − γ) (8)
loss
4 where C is penalty factor, 1(x) = x when x > 0 and 1(x) =

0 for x ≤ 0.
2 It is inevitable to call spice simulator to calculate
f (θ, w) in the process of optimization, which is very time-
0 0 consuming. And since the simulator is a black box, it is
10 10 1 10 2 10 3 10 4 infeasible to apply gradient-based optimization algorithms.
iterations
Conventional algorithms such as differential evolution (DE)
algorithms usually require calling simulator hundreds to
Figure 8. Ackley function convergence curve.
thousands of times to optimize single one problem. In our
Table II experiments, there are 18-dimensional environment variables
E XPERIMENT R ESULT OF ACKLEY F UNCTION and 3-dimensional parameters. We use DE algorithm to
optimize it for 20 times with different parameters and use
(min/mean/max) Steps Relative Loss
a fine-trained agent to optimize the same parameters for
Reinforcement Learning 1/1/1 0/0/0
comparison.
Gradient Descent - 0.15 / 0.85 / 0.57
We only generated 20 samples in parameter space. Ev-
Different Evolution 833 / 1044 / 1344 -0.07 / -0.03 / 0.13
ery sample called simulator nearly 2,000-3,000 times to
optimize, and we just need to train our model 10, 000
times (actually it converged quickly). For the more complex
Technically, our algorithm is independent of the complex
problem, we have to generate more samples and take O(n)
of the objective function. In other words, we can always
time, but we just take O(1) time by using trained-model.
find the optimal solution no matter how many local minima
The coverage curves are shown in fig 9. The x-axis
the objective function has. In the contrast, the momentum
coordinate is logarithmic. We can notice that our trained
algorithm which relies on gradient can barely find the global
model takes 1-2 steps to find the optimal solution nearly
minima, and the differential evolution algorithm may only
when DE algorithm takes a thousand steps.
find the local minima in some cases.
We also make a comparison between reinforcement learn- V. C ONCLUSION
ing and regression model.
The coverage curves are shown in figure 8. A family of optimization problem can be parameterized,
and we can build a regression model to fit parameters and the
D. SRAM optimal solutions. Using conventional supervised learning
In production process of SRAM, the performance of needs enough sample to train the model, while our trained
SRAM depends on process level, process parameter w model can take O(1) time to do this work and achieve good
performance.
201
[7] Lillicrap T P, Hunt J J, Pritzel A, et al. Continuous control
14 with deep reinforcement learning[J]. Computer Science, 2015,
DE
Our work 8(6):A187.
12
[8] Mnih V, Kavukcuoglu K, Silver D, et al. Human-level con-
10 trol through deep reinforcement learning[J]. Nature, 2015,
518(7540):529.
loss
8
[9] Duan Y, Chen X, Houthooft R, et al. Benchmarking Deep Re-
inforcement Learning for Continuous Control[J]. 2016:1329-
6 1338.
4 [10] Balduzzi D, Ghifary M. Compatible Value Gradients for Rein-

10 0 10 1 10 2 10 3 10 4 forcement Learning of Continuous Deep Policies[J]. Computer
iterations Science, 2015, 8(6):A187.
[11] Silver D, Lever G, Heess N, et al. Deterministic policy gra-

Figure 9. SRAM convergence curve. dient algorithms[C]// International Conference on International
Conference on Machine Learning. JMLR.org, 2014:387-395.
12 test [12] Heess N, Wayne G, Silver D, et al. Learning Continuous

True value Control Policies by Stochastic Value Gradients[J]. 2015:2944-
Our work
10 KNN 2952.
RandomForest
SVR [13] Peters J, Schaal S. Reinforcement learning of motor skills
8
with policy gradients[J]. Neural Networks, 2008, 21(4):682-
697.
6
loss
[14] Bakker B. Reinforcement learning with long short-term

4 memory[C]// International Conference on Neural Informa-
tion Processing Systems: Natural and Synthetic. MIT Press,
2 2001:1475-1482.
0 [15] Chollet, Franois and others. Keras. GitHub. Github repository.

0 1 2 3 4 5 6 7 8 9 2015. https://github.com/fchollet/keras
sample
Figure 10. Regression models performance in test dataset .
[16] Breiman L. RANDOM FORESTS–RANDOM
FEATURES[J]. Machine Learning, 2007, 45(1):5–32.
[17] Fallah A, Fallah A, Fallah A, et al. Forest attribute imputation

R EFERENCES using machine-learning methods and ASTER data: comparison
of k-NN, SVR and random forest regression algorithms[J].
[1] Das S, Suganthan P N. Differential Evolution: A Survey of International Journal of Remote Sensing, 2012, 33(19):6254-
the State-of-the-Art[J]. IEEE Transactions on Evolutionary 6280.
Computation, 2011, 15(1):4-31.
[18] Hong W C. Electric load forecasting by support vector
model[J]. Applied Mathematical Modelling, 2009, 33(5):2444-
[2] Storn R. On the usage of differential evolution for function 2454.
optimization[C]// Fuzzy Information Processing Society, 1996.
NAFIPS. 1996 Biennial Conference of the North American. [19] Silver D, Lever G, Heess N, et al. Deterministic policy gra-
IEEE Xplore, 1947:519-523. dient algorithms[C]// International Conference on International
Conference on Machine Learning. JMLR.org, 2014:387-395.
[3] Vesterstrom J, Thomsen R. A comparative study of differential
evolution, particle swarm optimization, and evolutionary al-
gorithms on numerical benchmark problems[C]// Evolutionary
Computation, 2004. CEC2004. Congress on. IEEE, 2004:1980-
1987 Vol.2.
[4] Andrychowicz M, Denil M, Gomez S, et al. Learning to learn

by gradient descent by gradient descent[J]. 2016.
[5] Li K, Malik J. Learning to Optimize[J]. 2016.
[6] Li K, Malik J. Learning to Optimize Neural Nets[J]. 2017.
202

Reinforcement Learning Optimization

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Reinforcement Learning Optimization

Uploaded by

Copyright:

Available Formats

2018 IEEE Computer Society Annual Symposium on VLSI

Parametric Circuit Optimization with Reinforcement Learning

Changcheng Tang Zuochang Ye

2159-3477/18/$31.00 ©2018 IEEE 197

Figure 3. The optimizer based on reinforcement learning.

We have to provide enough train dataset {wi , θi } as test

10 where f is the Static Noise Margin (SNM) of SRAM. The

4 where C is penalty factor, 1(x) = x when x > 0 and 1(x) =

4 [10] Balduzzi D, Ghifary M. Compatible Value Gradients for Rein-

[11] Silver D, Lever G, Heess N, et al. Deterministic policy gra-

12 test [12] Heess N, Wayne G, Silver D, et al. Learning Continuous

[14] Bakker B. Reinforcement learning with long short-term

0 [15] Chollet, Franois and others. Keras. GitHub. Github repository.

[17] Fallah A, Fallah A, Fallah A, et al. Forest attribute imputation

[4] Andrychowicz M, Denil M, Gomez S, et al. Learning to learn

[5] Li K, Malik J. Learning to Optimize[J]. 2016.

[6] Li K, Malik J. Learning to Optimize Neural Nets[J]. 2017.

You might also like