# Algorithm for Achieving Minimum Energy Consumption in CMOS Circuits Using Multiple Supply and Threshold Voltages at the Module Level 

Yuvraj Singh Dhillon, Abdulkadir Utku Diril, Abhijit Chatterjee, and Hsien-Hsin Sean Lee School of Electrical and Computer Engineering, Georgia Institute of Technology, Atlanta, GA 30332 USA


#### Abstract

This paper proposes an optimum methodology for assigning supply and threshold voltages to modules in a CMOS circuit such that the overall energy consumption is minimized for a given delay constraint. The modules of the circuit should have large enough gate depths such that the delay and energy penalties of the level shifters connecting them are negligible. Both static and dynamic energy are considered in the optimization. Energy savings of up to $48 \%$ have been achieved on various example circuits. The first step in the optimization finds optimum supply and threshold voltages for each module in the circuit. If the circuit has a large number of modules, this step might yield a correspondingly large number of different supply and threshold voltages for minimum energy consumption. Since having a large number of different supply and threshold voltages on an IC is not feasible in current technologies, an additional step clusters the multiple voltages obtained from the first step into a fixed number of supply and threshold voltages (for example, 2 different supply voltages and 2 different threshold voltages). In addition to the application of this method to circuit optimization, it can also be applied to a wide range of problems with delay constraints, such as software tasks running on a dynamically variable $V_{D D}$ and $V_{\text {th }}$ processor.


## 1. Introduction

Energy consumption is recognized as one of the most important parameters in designing modern portable electronic systems. Dynamic energy has been the main component of total energy since it is proportional to the square of $\mathrm{V}_{\mathrm{DD}}$. However, with the shrinking of device sizes and reduction of supply voltages, static energy has become as important as dynamic energy. To obtain high gate overdrive $\left(\mathrm{V}_{\mathrm{DD}}-\mathrm{V}_{\mathrm{th}}\right)$ for high speeds of operation, $\mathrm{V}_{\text {th }}$ is also decreased as $\mathrm{V}_{\mathrm{DD}}$ is decreased. The decrease in threshold voltage increases the leakage current exponentially, which makes static energy consumption more significant in every new technology generation. Therefore, it has become essential to consider both supply and threshold voltage in any circuit optimization for low-energy consumption.

There has been significant research in the usage of dynamic supply voltage scaling [3, 4]; static assignment of different supply voltages to a system [1, 5, 6]; static multithreshold voltage systems [7, 8, 16]; and dynamic threshold voltage scaling [9] for energy minimization. The optimization procedure that we propose in this paper can be used in all of the above scenarios. It can be used to determine the optimal supply and threshold voltages for tasks executing on a variable voltage processor; or it can be used to statically set supply and threshold voltages for a system, given the probable switching activity and timing requirements. The benefit of our method is
that it considers both supply and threshold voltages simultaneously, an idea which is gaining importance as a means of saving energy without sacrificing performance [15]. Another important contribution of this work is that it gives a metric which can be used by circuit designers to test how close the energy consumption of their design is to the minimum possible. We use this metric to determine the stopping conditions of our optimization algorithm. If an unlimited number of supply and threshold voltages are available, the proposed algorithm is optimum in the sense that no other voltage assignment for the given modules will give lower energy consumption for the given delay constraint.

The complete procedure has two steps. The first step finds optimum supply and threshold voltage values for CMOS modules in a digital circuit that minimizes the total energy consumption. Considering a circuit as composed of modules allows energy optimization of much larger circuits than is possible with gate-level optimization algorithms. This is due to the significant reduction of problem complexity. We find the exact conditions for minimum energy using the Lagrange Multiplier Method. Then we iteratively find the supply and threshold voltage values for each module that satisfy the minimum energy condition. This step of the algorithm converges to an exact solution in a small number of iterations for a large and varied set of problems. If it is technologically feasible to assign the optimum (and perhaps all different) supply and threshold voltages to all the modules, then we stop here. Otherwise we continue to the next step.

The second step clusters the multiple voltages obtained from the first step into a fixed number of supply and threshold voltages (for example, 2 different supply voltages and 2 different threshold voltages). This step results in a feasible implementation of the system in current technologies. The organization of the paper is as follows. In Section 2, we provide the theoretical background of the paper. In Section 3, we explain the Lagrange Multiplier based optimization procedure. In Sections 4 and 5, we explain the algorithms for the first and second steps of the procedure. Section 6 shows experimental results. Finally, Section 7 concludes.

## 2. Theoretical background

In order to formulate the optimization, we need equations for delay, dynamic energy, and static energy in terms of $V_{D D}$ and $\mathrm{V}_{\mathrm{th}}$.

For a CMOS circuit, delay can be approximated as being proportional to $\mathrm{V}_{\mathrm{DD}} /\left(\mathrm{V}_{\mathrm{DD}}-\mathrm{V}_{\mathrm{th}}\right)^{\alpha}$ [2].

$$
\begin{equation*}
d_{i}=\frac{k_{0 i} \cdot V_{D D i}}{\left(V_{D D i}-V_{t h i}\right)^{\alpha}} \tag{1}
\end{equation*}
$$

Here $\mathrm{d}_{\mathrm{i}}$ is the delay of the $\mathrm{i}^{\text {th }}$ module, $\mathrm{k}_{0 \mathrm{i}}$ is a constant for that module, $\mathrm{V}_{\mathrm{DDi}}$ and $\mathrm{V}_{\text {thi }}$ are the supply voltage and threshold
voltage, respectively applied to that module, and $\alpha$ is the velocity saturation coefficient. For current technologies, $\alpha$ is between 1.2 and 1.5. The delay constant ( $\mathrm{k}_{0 \mathrm{i}}$ ) includes the effects of process, device sizes, load capacitance, and gate depth in that module. Gate depth can be included in the constant because of the additive characteristics of delays.

Having formed an equation for module delay, we now form an equation for the dynamic energy consumption in terms of supply voltage and threshold voltage. We can write the dynamic energy consumed in a module as:

$$
\begin{equation*}
E_{d i}=0.5 \cdot C_{i} \cdot V_{D D i}^{2} \tag{2}
\end{equation*}
$$

Here $\mathrm{C}_{\mathrm{i}}$ is the term for all the capacitances that are switched during operation of the $i^{\text {th }}$ module including possible multiple switching of some nodes. To simplify the derivation, we rewrite dynamic energy as:

$$
\begin{equation*}
E_{d i}=k_{1 i} \cdot V_{D D i}^{2} \tag{3}
\end{equation*}
$$

where $\mathrm{k}_{1 \mathrm{i}}$ stands for the circuit, process, and application dependent terms including switching activity. An average value for total switching activity in the module can be found by running several different tasks on the module and averaging the switching activity results. Short circuit power dissipation can also be included in $\mathrm{k}_{1 \mathrm{i}}$ because of the quadratic dependence of short-circuit power dissipation to $\mathrm{V}_{\mathrm{DD}}$ [11].

For static energy consumption, we use a generalized model.

$$
\begin{array}{r}
E_{s i}=E_{s u b i}+E_{g a t e i}=k_{2 i} \cdot V_{D D i} \cdot e^{k_{3 i} \cdot V_{D D i} \cdot k_{4 i} \cdot V_{t i i}} \cdot T_{i}  \tag{4}\\
+k_{5 i} \cdot V_{D D i} \cdot e^{k_{6 i} \cdot V_{D D i}-k_{7 i} \cdot V_{t l i}} \cdot T_{i}
\end{array}
$$

$\mathrm{E}_{\text {subi }}$ stands for the sub-threshold leakage component of the static energy consumption. This component is strongly influenced by the threshold voltage. $\mathrm{E}_{\text {gatei }}$ stands for the gate leakage component of the static energy. This component is much smaller than $\mathrm{E}_{\text {subi }}$ when $\mathrm{V}_{\text {th }}$ is small. When $\mathrm{V}_{\mathrm{th}}$ is large, the main contribution to the static energy comes from this component. Also, gate leakage increases as the gate oxide thickness becomes smaller. $T_{i}$ is the period for which the circuit is idle. $\mathrm{k}_{2 \mathrm{i}}$ and $\mathrm{k}_{5 \mathrm{i}}$ are circuit-dependent parameters. $\mathrm{k}_{3 \mathrm{i}}$, $\mathrm{k}_{4 \mathrm{i}}, \mathrm{k}_{6 \mathrm{i}}$, and $\mathrm{k}_{7 \mathrm{i}}$ are process-dependent parameters. The values of the process-dependent parameters can be found by fitting SPICE simulation results of a simple gate to Equation 4. The values for these parameters can be used for any circuit designed in the same technology.

Given these approximate models for delay and energy in terms of supply and threshold voltages, we state the energyoptimization problem for a digital system (assumed given to us) consisting of N modules and P paths from primary inputs to primary outputs as follows:

$$
\text { Minimize } \sum_{i=1}^{N} E_{i} \text { under the constraints } \sum_{i \in P_{j}} d_{i} \leq T_{d} \text { for }
$$

all paths $P_{j}$ where $E_{i}=E_{d i}+E_{s i}, T_{d}$ is the time constraint and the variables are $\mathrm{V}_{\mathrm{DDi}}$ and $\mathrm{V}_{\text {thi }}$ for each module. Note that we obtain the time constraint, $\mathrm{T}_{\mathrm{d}}$, for the optimized circuit from the initial delays of the modules of the unoptimized circuit.

In the following sections, we will consistently use i to index the modules and $j$ to index the paths. An example circuit for $\mathrm{N}=4$ and $\mathrm{P}=2$ is given in Figure 1. In the next section, we derive the conditions on $\mathrm{V}_{\mathrm{DDi}}$ and $\mathrm{V}_{\text {thi }}$ for minimum energy consumption.


Figure 1. An example circuit with 4 modules and 2 paths ( $\mathrm{N}=4, \mathrm{P}=2$ ).

## 3. Lagrange Multiplier based optimization

Consider a system of N modules and P paths from primary inputs to primary outputs. We form a binary matrix, $\mathbf{A}$, of P rows and N columns as follows:

$$
\begin{align*}
A_{j i} & =1 \text { if Module i lies on Path } j  \tag{5}\\
& =0 \text { otherwise }
\end{align*}
$$

For example, the $\mathbf{A}$ matrix corresponding to the circuit in Figure $1(\mathrm{~N}=4, \mathrm{P}=2)$ is:

$$
A=\left[\begin{array}{llll}
1 & 0 & 1 & 1  \tag{6}\\
0 & 1 & 1 & 1
\end{array}\right]_{2 \times 4}
$$

If a path $P_{u}$ is a subset of $P_{v}$ (i.e. all modules on $P_{u}$ also lie on $P_{v}$ ), then the row of $\mathbf{A}$ corresponding to $P_{u}\left(\operatorname{Row}_{u}(\mathbf{A})\right)$ is removed from $\mathbf{A}$ to reduce unnecessary computation. For the rest of the paper, we assume that the resulting $\mathbf{A}$ matrix has more columns than rows (i.e. $\mathrm{N}>\mathrm{P}$ ).

The total energy consumed by the system is given by:

$$
\begin{align*}
E_{\text {total }} & =\left(E_{s 1}+E_{d 1}\right)+\left(E_{s 2}+E_{d 2}\right)+\cdots+\left(E_{s N}+E_{d N}\right)  \tag{7}\\
& =E_{1}+E_{2}+\cdots+E_{N}
\end{align*}
$$

The initial delay of each path in the system is given by:

$$
\begin{equation*}
T_{j}=\sum_{i \in P_{j}} d_{i} \quad \text { for all } j \tag{8}
\end{equation*}
$$

We can represent the above equation in vector form as follows:

$$
\begin{equation*}
\bar{T}=A \cdot \bar{d} \tag{9}
\end{equation*}
$$

where $\bar{T}=\left[\mathrm{T}_{1} \mathrm{~T}_{2} \ldots \mathrm{~T}_{\mathrm{P}}\right]^{\mathrm{T}}$ is the vector of path delays and $\bar{d}=$ $\left[d_{1} d_{2} \ldots d_{N}\right]^{T}$ is the vector of module delays. The delay constraint is obtained from the initial delays of the modules as follows:

$$
\begin{equation*}
T_{d}=m a x(\bar{T}) \tag{10}
\end{equation*}
$$

Following is a Lagrange Multiplier formulation with multiple constraints, where the function to minimize is total energy, the constraint for each path j is that its delay, $\mathrm{T}_{\mathrm{j}}$, should be less than $\mathrm{T}_{\mathrm{d}}$, and $\lambda_{\mathrm{j}}$ is the Lagrange Multiplier for the $\mathrm{j}^{\text {th }}$ path.

$$
\begin{equation*}
G\left(V_{D D 1}, V_{t h 1}, \cdots, V_{D D N}, V_{t h N}, \lambda_{1}, \cdots, \lambda_{P}\right)=\sum_{i=1}^{N} E_{i}-\sum_{j=1}^{P} \lambda_{j} \cdot\left[\sum_{i \in P_{j}} d_{i}-T_{j}\right] \tag{11}
\end{equation*}
$$

Then, for minimum energy consumption, we have the following:
$\frac{\partial G\left(V_{D D 1}, V_{t h 1}, \cdots, V_{D D N}, V_{t h N}, \lambda_{1}, \cdots, \lambda_{P}\right)}{\partial V_{D D i}}=0 \quad$ for all $i$
$\frac{\partial G\left(V_{D D 1}, V_{t h 1}, \cdots, V_{D D N}, V_{t h N}, \lambda_{1}, \cdots, \lambda_{P}\right)}{\partial V_{t h i}}=0 \quad$ for all $i$
Equations 12 and 13 become:
$\frac{\partial \mathrm{E}_{\mathrm{i}}}{\partial V_{D D i}}=\mathrm{Row}_{\mathrm{i}}\left(\mathrm{A}^{\mathrm{T}}\right) \cdot \bar{\lambda} \cdot \frac{\partial d_{\mathrm{i}}}{\partial V_{D D i}} \quad$ for all $i$
$\frac{\partial \mathrm{E}_{\mathrm{i}}}{\partial V_{t h i}}=\mathrm{Row}_{\mathrm{i}}\left(\mathrm{A}^{\mathrm{T}}\right) \cdot \bar{\lambda} \cdot \frac{\partial d_{\mathrm{i}}}{\partial V_{t h i}} \quad$ for all $i$
where $\bar{\lambda}=\left[\begin{array}{lll}\lambda_{1} & \lambda_{2} \ldots \lambda_{\mathrm{P}}\end{array}\right]^{\mathrm{T}}$ and $\operatorname{Row}_{\mathrm{i}}\left(\mathrm{A}^{\mathrm{T}}\right)$ refers to the $\mathrm{i}^{\text {th }}$ row of $\mathrm{A}^{\mathrm{T}}$. We define two vectors, the Constant Threshold Energy Gradient Vector ( $\overline{C T E G}$ ) and the Constant Supply Energy Gradient Vector ( $\overline{C S E G}$ ), as follows:
$\overline{C T E G}=\left[\frac{\partial \mathrm{E}_{1}}{\partial V_{D D 1}} / \frac{\partial d_{1}}{\partial V_{D D 1}} \frac{\partial \mathrm{E}_{2}}{\partial V_{D D 2}} / \frac{\partial d_{2}}{\partial V_{D D 2}} \cdots \frac{\partial \mathrm{E}_{\mathrm{N}}}{\partial V_{D D N}} / \frac{\partial d_{\mathrm{N}}}{\partial V_{D D N}}\right]^{\mathrm{T}}$
and $\overline{\operatorname{CSEG}}=\left[\begin{array}{llll}\frac{\partial \mathrm{E}_{1}}{\partial V_{t h 1}} / \frac{\partial d_{1}}{\partial V_{t h 1}} & \frac{\partial \mathrm{E}_{2}}{\partial V_{t h 2}} / \frac{\partial d_{2}}{\partial V_{t h 2}} & \cdots & \frac{\partial \mathrm{E}_{\mathrm{N}}}{\partial V_{t h N}} / \frac{\partial d_{\mathrm{N}}}{\partial V_{t h \mathrm{~N}}}\end{array}\right]^{\mathrm{T}}$
Following are the equations for the partial derivatives of the energy function, $\mathrm{E}_{\mathrm{i}}$. These equations are obtained using Equations 3 and 4. In these equations, we assume that each module is active for a small amount of time compared to the total deadline $T_{d}$. Then, we may write $T_{i} \approx T_{d}$ in Equation 4.

$$
\begin{align*}
& \frac{\partial \mathrm{E}_{\mathrm{i}}}{\partial V_{D D i}}=2 \cdot k_{1 \mathrm{i}} \cdot V_{D D i} \\
& +T_{d} \cdot V_{D D i} \cdot k_{2 i} \cdot k_{3 i} \cdot e^{\left(k_{3 i} \cdot V_{D D}-k_{4 i} \cdot V_{1 i t}\right)}  \tag{18}\\
& +T_{d} \cdot V_{D D i} \cdot k_{5 i} \cdot k_{6 i} \cdot e^{\left(k_{6 i} \cdot V_{D D i}-k_{7 i} \cdot V_{k i t}\right)} \\
& +T_{d} \cdot\left(k_{2 i} \cdot e^{\left(k_{3 i} \cdot V_{D D i}-k_{4 i} \cdot V_{t i t}\right)}+k_{5 i} \cdot e^{\left(k_{6 i} \cdot V_{D D}-k_{i j} \cdot V_{k i t}\right)}\right) \\
& =\frac{2 \cdot E_{\text {di }}}{V_{D D i}}+k_{3 i} \cdot E_{\text {subi }}+k_{6 i} \cdot E_{\text {gatei }}+\frac{E_{\text {subi }}+E_{\text {gatei }}}{V_{D D i}} \\
& \frac{\partial d_{i}}{\partial V_{D D i}}=k_{0 i} \cdot \frac{\left(V_{D D i}-V_{t u}\right)^{\alpha}-\alpha \cdot\left(V_{D D i}-V_{t u i}\right)^{\alpha-1} \cdot V_{D D i}}{\left(V_{D D i}-V_{t w i}\right)^{2, \alpha}}  \tag{19}\\
& =-\frac{d_{i} \cdot\left((\alpha-1) \cdot V_{D D i}+V_{t h i}\right)}{V_{D D i} \cdot\left(V_{D D i}-V_{t w i}\right)} \\
& \frac{\partial \mathrm{E}_{\mathrm{i}}}{\partial V_{t h i}}=-k_{2 i} \cdot k_{4 i} \cdot e^{\left(k_{3 i} \cdot V_{D i i}-k_{4 i} \cdot V_{t i t}\right)} \cdot T_{d} \cdot V_{D D i} \\
& -k_{5 i} \cdot k_{7 i} \cdot e^{\left(k_{6 i} \cdot V_{D D i}-k_{7 i} \cdot V_{t i i}\right)} \cdot T_{d} \cdot V_{D D i}  \tag{20}\\
& =-k_{4 i} \cdot E_{\text {subi }}-k_{7 i} \cdot E_{\text {gatei }} \\
& \frac{\partial d_{\mathrm{i}}}{\partial V_{t h i}}=\frac{\alpha \cdot k_{0 i} \cdot V_{D D i}}{\left(V_{D D i}-V_{t h i}\right)^{\alpha+1}}=\frac{\alpha \cdot d_{i}}{\left(V_{D D i}-V_{t h i}\right)} \tag{21}
\end{align*}
$$

Finally, from Equations 18, 19, 20, and 21, we get:

CTEG $_{i}=\frac{2 \cdot E_{d i}+V_{D D i} \cdot\left(k_{3 i} \cdot E_{\text {subi }}+k_{6 i} \cdot E_{\text {gatei }}\right)+E_{\text {subi }}+E_{\text {gatei }}}{d_{i} \cdot\left((\alpha-1) V_{D D i}+V_{t h i}\right) /\left(V_{D D i}-V_{t h i}\right)}$
$\operatorname{CSEG}_{i}=\frac{k_{4 i} \cdot E_{\text {subi }}+k_{7 i} \cdot E_{\text {gatei }}}{\alpha \cdot d_{i}} \cdot\left(V_{D D i}-V_{\text {thi }}\right)$
Using Equations 16, 17, 22, Equations 14 and 15 can now be written concisely as follows:

> Minimum EnergyCondition:
> $\overline{C T E G}=\overline{C S E G}=A^{T} \cdot \bar{\lambda}$

We see that the initial energy optimization problem involving 2 N variables ( $\mathrm{V}_{\mathrm{DD}}$ and $\mathrm{V}_{\mathrm{th}}$ for each module) and a delay constraint has now been simplified to the form in Equation 23. We now need to solve N independent equations ( CTEG ${ }_{i}=$ CSEG $_{i}=$ Row $\left._{i}\left(A^{T}\right) \cdot \bar{\lambda}\right)$ in 2 variables $\left(V_{D D i}\right.$ and $\mathrm{V}_{\mathrm{th}}$ ). However, doing this is not trivial since the Lagrange Multiplier Vector, $\bar{\lambda}$, is unknown. In the next section, we propose an iterative gradient search algorithm that yields a solution to this problem in a small number of iterations. After every iteration, the condition in Equation 23 will be used to check if minimum energy is achieved.

## 4. Gradient search algorithm for the optimization problem

We use an iterative algorithm to fulfill the conditions of Equation 23. The inputs to the algorithm are the initial parameters of all the N modules, such as the $\mathrm{V}_{\mathrm{DDi}} \mathrm{S}$, the $\mathrm{V}_{\text {this }} \mathrm{S}$, the module delays $\left(\mathrm{d}_{\mathrm{i}} \mathrm{s}\right)$ and the circuit- and process-dependent parameters $\mathrm{k}_{0 \mathrm{i}} \mathrm{S}, \mathrm{k}_{1 \mathrm{i}} \mathrm{s}, \mathrm{k}_{2 \mathrm{i}} \mathrm{s}, \mathrm{k}_{3 \mathrm{i}} \mathrm{s}, \mathrm{k}_{4 \mathrm{i}} \mathrm{s}, \mathrm{k}_{5 \mathrm{i}} \mathrm{S}, \mathrm{k}_{6 \mathrm{i}} \mathrm{S}$ and $\mathrm{k}_{7 \mathrm{i}} \mathrm{S}$.

To solve $\mathrm{CTEG}_{\mathrm{i}}=\operatorname{CSEG}_{\mathrm{i}}$ for the $\mathrm{i}^{\text {th }}$ module, we fix the delay, $\mathrm{d}_{\mathrm{i}}$, for that module. Then we can write $\mathrm{V}_{\text {thi }}$ in terms of $\mathrm{V}_{\mathrm{DDi}}$ using Equation 1. This makes $\mathrm{CTEG}_{\mathrm{i}}$ and $\mathrm{CSEG}_{\mathrm{i}}$ functions of $\mathrm{V}_{\mathrm{DDi}}$ only and the equation $\mathrm{CTEG}_{\mathrm{i}}=\mathrm{CSEG}_{\mathrm{i}}$ can be solved easily (we use MATLAB's FZERO function) to get $\mathrm{V}_{\mathrm{DDi}}$ and $\mathrm{V}_{\text {thi }}$ values. Then with these $\mathrm{V}_{\mathrm{DDi}}$ and $\mathrm{V}_{\text {thi }}$ values, we can find the energy consumption of that module (this will be the optimum energy consumption for that module, for the given delay, $d_{i}$ ). Hence, we can consider energy consumption of a module as a function of delay for that module. In vector form, we can write:
$E_{\text {total }}=\operatorname{sum}(\bar{E})=\mathrm{E}(\bar{d})$
where $\bar{E}=\left[\mathrm{E}_{1} \mathrm{E}_{2} \ldots \mathrm{E}_{\mathrm{N}}\right]^{\mathrm{T}}$.
First, we get intermediate values for module delays, $\bar{d}_{\text {int }}$, that make all path delays as close to $\mathrm{T}_{\mathrm{d}}$ as possible. This step also makes sure that all modules have zero slack, so that we have an optimal starting point. A method similar to the Zero Slack Algorithm [13] is used in this step. Let $\bar{T}_{\text {int }}=A \cdot \bar{d}_{\text {int }}$ be the vector of path delays after this step.

Next, we minimize $\mathrm{E}_{\text {total }}$ by doing a gradient search on the delay vector, $\bar{d}$. But, the delay vector is constrained due to the path delay constraints $\left(A \cdot \bar{d}=\bar{T}_{i n t}\right)$. So, in every iteration, we vary $\bar{d}$ by adding $\bar{\Delta}$ such that $A \cdot \bar{\Delta}=0$. This choice of $\bar{\Delta}$ satisfies the constraints as shown below:
$A \cdot \bar{d}=A \cdot\left(\bar{d}_{\text {int }}+\bar{\Delta}\right)=A \cdot \bar{d}_{\text {int }}+A \cdot \bar{\Delta}=\bar{T}_{\text {int }}$
In other words, $\bar{\Delta}$ has to lie in the nullspace of $\mathbf{A}$.


The delay vector, $\bar{d}_{\text {new }}$, for a new iteration is obtained from the current delay vector, $\bar{d}_{\text {curr }}$, as follows:
$\bar{d}_{\text {new }}=\bar{d}_{\text {curr }}+k \cdot \bar{\nabla}_{A} E_{\text {total }}$
where $\bar{\nabla}_{A} E_{\text {total }}$ is the gradient of $E_{\text {total }}$ along the nullspace vectors of $\mathbf{A} . \mathrm{k}$ is chosen in such a way that the new energy $\left(E\left(\bar{d}_{\text {new }}\right)\right)$ is minimum in the direction of gradient vector. We now derive the stopping condition for the gradient search.

The condition $\overline{C T E G}=\overline{C S E G}$ (Equation 23) is satisfied in every iteration of the search. To check how close $\overline{C T E G}$ (or $\overline{C S E G}$ ) is to $A^{T} \cdot \bar{\lambda}$, we define a Metric_cost_fn as follows:
Metric_cost_fin $=$ norm $\left(A^{T} \cdot A^{T \uparrow} \cdot \overline{C T E G}-\overline{C T E G}\right) /$ norm $(\overline{\text { CTEG }})$ (27)
where $A^{T^{\dagger}}$ is the pseudo-inverse of $A^{T}$. At the minimum energy point, Metric_cost_fn should be zero as shown below:


Figure 3. Algorithm for clustering
$\overline{C T E G}_{m i n}=A^{T} \cdot \bar{\lambda}$
Metric_cost_fn $=\operatorname{norm}\left(A^{T} \cdot A^{T \dagger} \cdot \overline{\operatorname{CTEG}}_{\text {nin }}-\overline{\operatorname{CTEG}}_{\text {min }}\right) / \operatorname{norm}\left(\overline{\text { CTEG }}_{\text {min }}\right)$

$$
\begin{aligned}
& =\operatorname{norm}\left(A^{T} \cdot A^{T \dagger} \cdot A^{T} \cdot \bar{\lambda}-A^{T} \cdot \bar{\lambda}\right) / \operatorname{norm}\left(A^{T} \cdot \bar{\lambda}\right) \\
& =\operatorname{norm}\left(A^{T} \cdot \bar{\lambda}-A^{T} \cdot \bar{\lambda}\right) / \operatorname{norm}\left(A^{T} \cdot \bar{\lambda}\right) \\
& =0
\end{aligned}
$$

Designers can use Metric_cost_fn to determine how close their design is to the optimum.

In our algorithm, we terminate the iterations when Metric cost fn goes below $10^{-3}$. The overview of the optimization algorithm is given in the flowchart in Figure 2.

## 5. Clustering heuristic for limited number of supply and threshold voltages

The algorithm described in Section 4 yields optimum values of supply and threshold voltages for each module that minimize the overall circuit energy. But these voltages might all have different values, in which case a practical
implementation of the optimized circuit is difficult in current technologies. In this section, we propose a heuristic algorithm that clusters the optimum supply and threshold voltage values obtained into a limited number of supply and threshold voltages. The final solution meets the delay constraint at the expense of slightly higher total energy consumption than the optimum case.

Assume only n supply voltage planes and m threshold voltages are available $(\mathrm{n}<\mathrm{N}, \mathrm{m}<\mathrm{N})$. Note that the values of the available voltages are not fixed at the beginning, although their number is fixed. Let $\bar{V}_{D D_{-} o p t}$ and $\bar{V}_{t h_{-} o p t}$ be the optimum supply and threshold voltage values (obtained in the previous section), respectively. Let $\bar{V}_{D D_{-} n}$ and $\bar{V}_{t h_{-} m}$ be supply and threshold voltage vectors holding values for the limited number of supply and threshold voltages ( n supply voltages, m threshold voltages) initialized as follows:

$$
\begin{array}{ll}
\bar{V}_{D D_{-} n}(p)=\min \left(\bar{V}_{D D_{-} \text {opt }}\right)+\left(\frac{\max \left(\bar{V}_{D D_{-} \text {opt }}\right)-\min \left(\bar{V}_{D D_{-o p t}}\right)}{n+1}\right) \cdot p & \text { for } p=1 \cdots n \\
\bar{V}_{t l_{-} m}(q)=\min \left(\bar{V}_{t t_{-} \text {opt }}\right)+\left(\frac{\max \left(\bar{V}_{t l_{-} \text {opt }}\right)-\min \left(\bar{V}_{t h_{-} \text {opt }}\right)}{m+1}\right) \cdot q & \text { for } q=1 \cdots m
\end{array}
$$

These vectors will finally hold the n supply voltage values and m threshold voltage values that will be used in the circuit. For any module $i$, the function "Map" finds the nearest pair $\left[\mathrm{V}_{\mathrm{DD} \_\mathrm{n}}(\mathrm{p}), \mathrm{V}_{\text {th_m }}(\mathrm{q})\right]$ to the pair $\left[\mathrm{V}_{\mathrm{dd} \_ \text {_pt }}(\mathrm{i}), \mathrm{V}_{\text {th_opt }}(\mathrm{i})\right]$ and assigns it to [ $\left.\mathrm{V}_{\mathrm{DD} \_ \text {new }}(\mathrm{i}), \mathrm{V}_{\text {th_new }}(\mathrm{i})\right]$.

$$
\begin{equation*}
\left[\bar{V}_{D D_{-} n e w}, \bar{V}_{t t_{-} n e w}\right]=\operatorname{Map}\left(\bar{V}_{D D_{-} \text {opt }}, \bar{V}_{t t_{-} \text {opt }}, \bar{V}_{D D_{-} n}, \bar{V}_{t t_{-} m}\right) \tag{28}
\end{equation*}
$$

In any iteration, the delay of the circuit $\left(\mathrm{T}_{\mathrm{c}}\right)$ is calculated using $\left[\bar{V}_{\text {DD_new }}, \bar{V}_{\text {th_new }}\right]$. We use $E_{\text {total }} \cdot T_{c}^{2}$ as the cost function if $T_{c}$ exceeds $T_{d}$ by a fixed fraction (say 0.01 ). Doing this forces the critical path delay to go down in the next iteration, possibly increasing $E_{\text {total }}$. If $T_{c}$ is less than $T_{d}$ by a fixed fraction, $E_{\text {total }}$ is used as the cost function. Doing this decreases the energy in the next iteration, possibly by increasing $\mathrm{T}_{\mathrm{c}}$. These cost functions were chosen because they yielded good results in experiments. The gradient, $\bar{\nabla}\left(\right.$ Cost $\left._{-} f n\right)$, is obtained by changing the entries of $\bar{V}_{D D_{-} n}$ and $\bar{V}_{t h-m}$ by a small amount, mapping these to new $\left[\bar{V}_{D D_{-} n e w}, \bar{V}_{\text {th_new }}\right]$ and calculating the difference in the cost function. The new values of $\bar{V}_{D D_{-} n}$ and $\bar{V}_{t h_{-} m}$, which lower the cost function, are obtained by searching in the direction of the gradient. The search terminates when the circuit delay is in $1 \%$ proximity of the deadline, $\mathrm{T}_{\mathrm{d}}$. The flowchart of the algorithm is given in Figure 3.

## 6. Experimental results

We synthesized the hierarchical Verilog descriptions of the combinational ISCAS' 85 circuits and a 16-bit Wallace Tree Multiplier using Synopsys Design Compiler (with the TSMC $0.25 \mu$ library) to get the delay, dynamic energy and static energy consumption values for the modules at the top level of design hierarchy. The modules at the top level of hierarchy in the Verilog description were directly mapped to the modules used in the optimization*. The values of the process-dependent parameters $\left(k_{3}, k_{4}, k_{6}, k_{7}\right)$ were obtained from SPICE
simulations as explained in Section 2. SPICE simulation of simple gates showed that $k_{5}$ is 6 orders of magnitude smaller than $\mathrm{k}_{2}$ for this technology. Since $\mathrm{k}_{2}$ and $\mathrm{k}_{5}$ scale almost linearly with number of gates [10, 12], $\mathrm{k}_{5}$ can be taken to be $10^{-6}$ times $\mathrm{k}_{2}$ for any module. The circuit-dependent parameters $\left(k_{0}, k_{1}, k_{2}\right)$ were then calculated for each module by using the delay, dynamic energy and static energy values obtained from Synopsys and the process-dependent parameters.

We use the following notation for describing the results: The symbol "I" denotes the initial circuit which has the standard $0.25 \mu$ TSMC voltages ( $\mathrm{V}_{\mathrm{DD}}=2.5 \mathrm{~V}, \mathrm{~V}_{\text {th }}=0.5 \mathrm{~V}$ ). We obtain the delay of the initial circuit using Synopsys Design Compiler and use this value as the time constraint for the optimization i..e the optimized circuits (II, III, IV) will have the same delay as I. "II" denotes the baseline circuit (for energy comparisons) that has the single $V_{D D}$ and $V_{t h}$ values that give the minimum energy consumption for the given deadline. "III" denotes the circuit having optimum (and possibly all different) $\mathrm{V}_{\mathrm{DD}} \mathrm{S}$ and $\mathrm{V}_{\mathrm{th}} \mathrm{S}$ for the modules. "IV" denotes the circuit in which the $\mathrm{V}_{\mathrm{DD}} \mathrm{S}$ and $\mathrm{V}_{\mathrm{th}} \mathrm{S}$ in III have been clustered into two $\mathrm{V}_{\mathrm{DD}} \mathrm{S}$ and one $\mathrm{V}_{\mathrm{th}}$. We only use one $\mathrm{V}_{\mathrm{th}}$ in the final circuit because we found that having more $\mathrm{V}_{\mathrm{th}} \mathrm{s}$ only saved an additional $2-3 \%$ of energy in the benchmark circuits designed using $0.25 \mu$ technology. The need for multiple $\mathrm{V}_{\mathrm{th}} \mathrm{s}$ will become more pronounced as technology shrinks.

For the experiments, we used various switching activities for the input ports to observe their effects on the energy savings and the optimum voltages obtained. We noticed that for switching activities above 0.05 , the optimum $\mathrm{V}_{\mathrm{th}} \mathrm{S}$ were of the order of 10 mV . This is due to the fact that the static energy in $0.25 \mu$ technology is very small compared to the dynamic energy for high switching activities. So for these cases, the optimization algorithm scales down $\mathrm{V}_{\mathrm{DD}}$ aggressively and to achieve the delay constraint, it reduces $\mathrm{V}_{\text {th }}$ to very small values without incurring a significant increase in static energy. Since such small $\mathrm{V}_{\text {th }}$ values are not currently feasible, for these cases we fixed $\mathrm{V}_{\text {th }}$ at 0.1 V and found the optimum $\mathrm{V}_{\mathrm{DD}} \mathrm{s}$. This phenomenon is not expected ot occur for deep sub-micron technologies, where static energy is significant.

Table 1 provides detailed results for the Wallace Tree Multiplier circuit. The first column in the table shows the top level of the Verilog design hierarchy. The modules are a partial product generator (level0), Carry Save Adders (CSAs), and a Carry Propagate Adder (CPA). Also shown is the $\mathbf{A}$ matrix corresponding to the circuit. The second column gives the $\mathrm{V}_{\mathrm{DD}}$, $\mathrm{V}_{\mathrm{th}}$ and energy consumption values for the baseline circuit (II) for two different input switching activities ( $\mathrm{SA}=0.01$ and $\mathrm{SA}=0.0001$ ). Note that the delay for the baseline circuit is same as the delay of the initial circuit (I), which had $\mathrm{V}_{\mathrm{DD}}=2.5 \mathrm{~V}, \mathrm{~V}_{\text {th }}$ $=0.5 \mathrm{~V}$. The third and fourth columns give the voltages for each module as well as the energy consumptions for circuits III and IV respectively.

Figure 4 shows the energy savings obtained for the various benchmark circuits as a percentage of the baseline energy consumption for an input switching activity of 0.01 . The dynamic and static components of energy are also shown. It is observed that in II and III, static energy is $\sim 10 \%$ of the total energy. This validates the fact that at the optimum, static energy is a fixed fraction of the total energy [14], although this fraction depends on the technology used.

Figures 5 and 6 show the savings for different input switching activities for circuits III and IV, respectively. The results show that the energy savings tend to increase as the input switching activity increases. Thus, accurate estimation of

A power-aware partitioning of the circuit into modules could further improve the results, but that by itself is a very difficult problem to solve and is not handled in this work.

Table 1. Optimization Results for a Wallace Tree Multiplier (with two different input switching activities)

| Wallace Tree Multiplier |  | II (Baseline System) |  | III (unlimited $\mathrm{V}_{\mathrm{DD}} \mathrm{s}, \mathrm{V}_{\text {th }} \mathrm{s}$ ) |  | IV ( $2 \mathrm{~V}_{\text {DDS }}$, $1 \mathrm{~V}_{\text {th }}$ ) |  |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
|  | $\mathrm{A}^{\text {T }}$ | i | ii | 1 | ii | 1 | ii |
| levelo |  |  |  | $\left[\begin{array}{ll}\bar{V}_{D D} & \bar{V}_{t h}\end{array}\right]$ | $\left[\begin{array}{ll}\bar{V}_{D D} & \bar{V}_{t h}\end{array}\right]$ | $\left[\begin{array}{ll}\bar{V}_{D D} & \bar{V}_{t h}\end{array}\right]$ | $\left[\begin{array}{ll}\bar{V}_{D D} & \bar{V}_{t h}\end{array}\right]$ |
|  | $[11111111111111]$ | SA=0.01 | SA $=0.0001$ | $\left[\begin{array}{ll}2.23 & 0.10 \\ 1.38\end{array}\right]$ | $\left[\begin{array}{ll}2.80 & 0.33 \\ \hline 1.75 & 0.32\end{array}\right.$ | $\left[\begin{array}{ll}1.84 & 0.09 \\ 1.84\end{array}\right]$ | $\left[\begin{array}{ll}2.42 & 0.34 \\ \hline 2.42\end{array}\right]$ |
| - | 11111111111 | $\mathrm{V}_{\mathrm{DD}}=1.62 \mathrm{~V}$ | $\mathrm{V}_{\mathrm{DD}}=2.18 \mathrm{~V}$ | $\begin{array}{ll}1.38 & 0.09\end{array}$ | $\begin{array}{ll}1.75 & 0.32\end{array}$ | $\begin{array}{ll}1.84 & 0.09\end{array}$ | 2.42 0.34 |
|  |  | $\mathrm{V}_{\mathrm{th}}=0.11 \mathrm{~V}$ | $\mathrm{V}_{\mathrm{th}}=0.35 \mathrm{~V}$ | $\begin{array}{ll}0.85 & 0.12\end{array}$ | 1.4300 .38 | $\begin{array}{ll}1.841 & 0.09\end{array}$ | $\begin{array}{ll}1.45 & 0.34\end{array}$ |
| (M3) (M5) (M8) (M12) (M16) | 10000000000 |  | $\mathrm{V}_{\mathrm{th}}=0.35 \mathrm{~V}$ | $\begin{array}{ll}0.83 & 0.11\end{array}$ | 1.4800 .36 | 0.910 .09 | 2.420 .34 |
| , | 11000000000 | $\mathrm{T}_{\mathrm{d}}=13.7 \mathrm{~ns}$ | $\mathrm{T}_{\mathrm{d}}=13.7 \mathrm{~ns}$ | $\begin{array}{lll}0.85 & 0.12 \\ 0.77 & 0.08\end{array}$ | 1.430 .38 | $\begin{array}{ll}0.91 & 0.09\end{array}$ | 1.450 .34 |
| $\mathrm{ca212}^{5}$ | 01111000000 |  |  | 0.77 | $\begin{array}{ll}1.39 & 0.34 \\ 1.49\end{array}$ | $\begin{array}{ll}0.91 & 0.09\end{array}$ | 1.450 .34 |
| (M4) (M7) (M11) | 01111000000 | $\mathrm{E}=71.6 \mathrm{pJ}$ | $\mathrm{E}=0.35 \mathrm{pJ}$ | $\begin{array}{ll}0.84 & 0.11 \\ 0.85 & 0.12\end{array}$ | $\begin{array}{ll}1.49 & 0.37\end{array}$ | $\begin{array}{ll}0.91 & 0.09\end{array}$ | 2.420 .34 |
| T | 11100100000 | $\mathrm{Ed}=63.4 \mathrm{pJ}$ | $\mathrm{Ed}=0.33 \mathrm{pJ}$ | $\begin{array}{ll}0.85 & 0.12\end{array}$ | 1.400 .36 | $\begin{array}{ll}0.91 & 0.09\end{array}$ | $\begin{array}{ll}1.45 & 0.34\end{array}$ |
|  | $001111111000$ | $\mathrm{Es}=8.2 \mathrm{pJ}$ | $\mathrm{Es}=0.02 \mathrm{pJ}$ | $\begin{array}{ll}0.80 & 0.09\end{array}$ | 1.440 .36 | $\begin{array}{ll}0.91 & 0.09\end{array}$ | 1.450 .34 |
|  |  |  |  | 0.94 | $\begin{array}{ll}1.41 & 0.32 \\ 1.46 & 0.37\end{array}$ | $1.84 \quad 0.09$ | 1.450 .34 |
| $\bigcirc$ | 00000111000 |  |  | 0.84 | $\begin{array}{ll}1.46 & 0.37\end{array}$ | 0.910 .09 | $\begin{array}{ll}2.42 & 0.34\end{array}$ |
|  | 00011011110 |  |  | $\begin{array}{ll}0.86 & 0.12\end{array}$ | $\begin{array}{ll}1.47 & 0.38 \\ 0.97 & 0.33\end{array}$ | $\begin{array}{ll}0.91 & 0.09\end{array}$ | $\begin{array}{ll}2.42 & 0.34\end{array}$ |
| $\binom{$ cspat }{ M10 }$\left(\begin{array}{c}\text { csat } \\ \text { cil }\end{array}\right.$ | $11110110100$ |  |  | 0.50 | $\begin{array}{ll}0.97 & 0.33\end{array}$ | $\begin{array}{ll}0.91 & 0.09\end{array}$ | $\begin{array}{ll}1.45 & 0.34\end{array}$ |
| - | 11110110100 |  |  | 1.00 | $\begin{array}{ll}1.47 & 0.35 \\ 1.07 & 0.30\end{array}$ | 1.840 .09 | 2.420 .34 |
| 1 | 00000000110 |  |  | 0.630 .06 | $\begin{array}{ll}1.07 & 0.30\end{array}$ | $\begin{array}{ll}0.91 & 0.09\end{array}$ | $\begin{array}{ll}1.45 & 0.34\end{array}$ |
| ${ }_{\substack{\text { casis } \\ \text { cals }}}$ | 00000000110 |  |  | $\left[\begin{array}{ll}0.54 & 0.12 \\ 0.64 & 0.06\end{array}\right.$ | $\begin{array}{ll}1.02 & 0.36 \\ 1.07 & 0.30\end{array}$ | $\begin{array}{ll}0.91 & 0.09\end{array}$ | $\left[\begin{array}{ll}1.45 & 0.34\end{array}\right.$ |
| $15)$ | 00000000001 |  |  | $\left[\begin{array}{ll}0.64 & 0.06\end{array}\right]$ | $\left[\begin{array}{ll}1.07 & 0.30\end{array}\right]$ | $\left[\begin{array}{ll}0.91 & 0.09\end{array}\right]$ | $\left[\begin{array}{ll}1.45 & 0.34\end{array}\right]$ |
| $11$ | $00001001011$ |  |  | $\mathrm{T}_{\mathrm{c}}=13.7 \mathrm{~ns}$ | $\mathrm{T}_{\mathrm{c}}=13.7 \mathrm{~ns}$ | $\mathrm{T}_{\mathrm{c}}=13.7 \mathrm{~ns}$ | $\mathrm{T}_{\mathrm{c}}=13.7 \mathrm{~ns}$ |
| (sab7) | 11111111111 |  |  | $\mathrm{E}=36.9 \mathrm{pJ}$ | $\mathrm{E}=0.22 \mathrm{pJ}$ | $\mathrm{E}=49.4 \mathrm{pJ}$ | $\mathrm{E}=0.28 \mathrm{pJ}$ |
|  | -00000000001 |  |  | $\mathrm{Ed}=31.8 \mathrm{pJ}$ | $\mathrm{Ed}=0.20 \mathrm{pJ}$ | $\mathrm{Ed}=41.9 \mathrm{pJ}$ | $\mathrm{Ed}=0.25 \mathrm{pJ}$ |
|  | 00000000001 |  |  | $\mathrm{Es}=5.1 \mathrm{pJ}$ | $\mathrm{Es}=0.02 \mathrm{pJ}$ | $\mathrm{Es}=7.5 \mathrm{pJ}$ | $\mathrm{Es}=0.03 \mathrm{pJ}$ |
| $\binom{\text { cpal }}{\mathrm{M11}}$ | $[1111111111111]$ |  |  | Saving $=48 \%$ | Saving=39\% | Saving $=31 \%$ | Saving $=22 \%$ |



Figure 4. Energy consumption of benchmark circuits as a percentage of the baseline energy consumption when the input switching activity is 0.01 .
I: Energy consumption of circuit with standard $0.25 \mu$ TSMC voltages ( $\mathrm{V}_{\mathrm{DD}}=2.5, \mathrm{~V}_{\mathrm{th}}=0.5$ )
II: Energy consumption of circuit with optimum single $\mathrm{V}_{\mathrm{DD}}$ and single $\mathrm{V}_{\text {th }}$ (baseline case)
III: Energy consumption of circuit with unlimited $\mathrm{V}_{\mathrm{DD}} \mathrm{S}$ and $\mathrm{V}_{\text {th }} \mathrm{S}$
IV: Energy consumption of circuit with two $\mathrm{V}_{\mathrm{DD}} \mathrm{S}$ and one $\mathrm{V}_{\text {th }}$


Figure 5. Percent energy savings with unlimited $\mathbf{V}_{\mathrm{DD}} \mathbf{s}$ and $\mathrm{V}_{\mathrm{th}} \mathbf{s}$ (III) for different input switching activities


Figure 6. Percent energy savings with two $\mathrm{V}_{\mathrm{DD}} \mathrm{S}$ and one $\mathrm{V}_{\text {th }}$ (IV) for different input switching activities

Table 2. Optimization Results

| Circuit | Input Switching Activity | $\begin{gathered} E(I) \\ p J \end{gathered}$ | $\begin{gathered} E(I I) \\ \text { pJ } \end{gathered}$ | $\begin{gathered} \text { E (III) } \\ \text { pJ } \end{gathered}$ | $\begin{gathered} E(I V) \\ p J \end{gathered}$ | $V_{D D}$ <br> (II) | $\mathbf{V}_{\text {th }}$ <br> (II) | $\mathrm{V}_{\mathrm{DD} 1}$ <br> (IV) | $\begin{aligned} & \mathrm{V}_{\mathrm{DD} 2} \\ & \text { (IV) } \end{aligned}$ | $\mathrm{V}_{\text {th }}$ (IV) | \% Energy Savings (III) | \% Energy Savings (IV) |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| c1908 | 0.5 | 106 | 44.2 | 25.9 | 30.6 | 1.6 | 0.10 | 1.2 | 2.1 | 0.10 | 41.5 | 30.9 |
|  | 0.1 | 44.1 | 18.7 | 11.2 | 13.2 | 1.6 | 0.10 | 1.2 | 2.1 | 0.10 | 40.2 | 29.4 |
|  | 0.01 | 5.85 | 2.75 | 1.67 | 1.98 | 1.6 | 0.10 | 1.2 | 2.1 | 0.10 | 39.4 | 28.1 |
|  | 0.001 | 0.62 | 0.37 | 0.25 | 0.28 | 1.8 | 0.20 | 1.4 | 2.4 | 0.19 | 32.5 | 22.9 |
|  | 0.0001 | 0.07 | 0.05 | 0.04 | 0.05 | 2.0 | 0.28 | 1.7 | 2.1 | 0.27 | 26.9 | 5.1 |
| c2670 | 0.5 | 238 | 100 | 92.5 | 100 | 1.6 | 0.10 | 1.6 | 1.6 | 0.10 | 7.5 | 0 |
|  | 0.1 | 78.1 | 33.5 | 31.2 | 33.5 | 1.6 | 0.10 | 1.6 | 1.6 | 0.10 | 6.9 | 0 |
|  | 0.01 | 8.95 | 4.58 | 3.71 | 4.45 | 1.7 | 0.14 | 1.3 | 1.7 | 0.12 | 19.2 | 3.0 |
|  | 0.001 | 0.85 | 0.55 | 0.45 | 0.53 | 1.9 | 0.23 | 1.5 | 1.9 | 0.22 | 18.4 | 2.6 |
|  | 0.0001 | 0.09 | 0.07 | 0.06 | 0.07 | 2.1 | 0.32 | 1.8 | 2.1 | 0.31 | 13.4 | 1.0 |
| c3540 | 0.5 | 414 | 175 | 120 | 139 | 1.6 | 0.10 | 1.2 | 1.6 | 0.10 | 31.5 | 20.3 |
|  | 0.1 | 130 | 57.0 | 39.1 | 45.3 | 1.6 | 0.10 | 1.2 | 1.7 | 0.10 | 31.4 | 20.6 |
|  | 0.01 | 14.4 | 7.75 | 5.18 | 6.81 | 1.7 | 0.16 | 1.3 | 1.7 | 0.12 | 33.2 | 12.2 |
|  | 0.001 | 1.29 | 0.87 | 0.60 | 0.73 | 2.0 | 0.25 | 1.5 | 2.0 | 0.22 | 31.8 | 16.0 |
|  | 0.0001 | 0.09 | 0.07 | 0.05 | 0.06 | 2.2 | 0.36 | 1.7 | 2.3 | 0.35 | 32.6 | 17.6 |
| c432 | 0.5 | 23.5 | 9.81 | 9.05 | 9.47 | 1.6 | 0.10 | 1.5 | 1.8 | 0.10 | 7.7 | 3.5 |
|  | 0.1 | 6.77 | 2.88 | 2.65 | 2.77 | 1.6 | 0.10 | 1.5 | 1.7 | 0.10 | 8.0 | 3.9 |
|  | 0.01 | 0.74 | 0.37 | 0.33 | 0.36 | 1.7 | 0.14 | 1.5 | 1.8 | 0.11 | 10.6 | 4.0 |
|  | 0.001 | 0.09 | 0.053 | 0.049 | 0.052 | 1.9 | 0.22 | 1.7 | 2.1 | 0.20 | 8.7 | 3.0 |
|  | 0.0001 | 0.01 | 0.0084 | 0.0078 | 0.0081 | 2.1 | 0.30 | 1.9 | 2.4 | 0.29 | 7.5 | 4.1 |
| c499 | 0.5 | 81.4 | 34.0 | 26.9 | 27.6 | 1.6 | 0.10 | 1.2 | 1.9 | 0.10 | 20.8 | 18.8 |
|  | 0.1 | 34.4 | 14.5 | 11.8 | 12.0 | 1.6 | 0.10 | 1.2 | 1.9 | 0.10 | 18.5 | 17.1 |
|  | 0.01 | 4.81 | 2.24 | 1.95 | 1.98 | 1.6 | 0.10 | 1.3 | 1.8 | 0.09 | 12.8 | 11.5 |
|  | 0.001 | 0.49 | 0.29 | 0.26 | 0.26 | 1.8 | 0.19 | 1.5 | 2.0 | 0.19 | 11.1 | 10.2 |
|  | 0.0001 | 0.05 | 0.039 | 0.035 | 0.035 | 2.0 | 0.28 | 1.7 | 2.3 | 0.28 | 9.5 | 9.1 |
| c5315 | 0.5 | 438 | 184 | 110 | 153 | 1.6 | 0.10 | 0.5 | 1.6 | 0.10 | 40.0 | 16.8 |
|  | 0.1 | 143 | 61.5 | 37.3 | 50.7 | 1.6 | 0.10 | 0.5 | 1.6 | 0.10 | 39.3 | 17.5 |
|  | 0.01 | 16.7 | 8.59 | 5.38 | 7.83 | 1.7 | 0.14 | 0.5 | 1.6 | 0.09 | 37.3 | 8.9 |
|  | 0.001 | 1.59 | 1.03 | 0.67 | 0.97 | 1.9 | 0.23 | 0.6 | 1.8 | 0.18 | 34.8 | 5.3 |
|  | 0.0001 | 0.15 | 0.12 | 0.07 | 0.10 | 2.1 | 0.33 | 0.8 | 2.1 | 0.29 | 35.0 | 14.5 |
| c7552 | 0.5 | 861 | 361 | 259 | 285 | 1.6 | 0.10 | 1.0 | 1.6 | 0.10 | 28.1 | 21.0 |
|  | 0.1 | 283 | 121 | 84.7 | 86.0 | 1.6 | 0.10 | 0.6 | 1.6 | 0.10 | 29.9 | 28.9 |
|  | 0.01 | 32.3 | 16.4 | 9.04 | 11.00 | 1.7 | 0.14 | 0.7 | 1.6 | 0.10 | 44.9 | 32.8 |
|  | 0.001 | 3.34 | 2.12 | 1.20 | 1.42 | 1.9 | 0.22 | 1.0 | 1.9 | 0.20 | 43.4 | 33.2 |
|  | 0.0001 | 0.42 | 0.32 | 0.17 | 0.20 | 2.1 | 0.31 | 1.2 | 2.1 | 0.31 | 45.4 | 36.0 |
| Multiplier | 0.5 | 2890 | 1210 | 502 | 834 | 1.6 | 0.10 | 1.0 | 1.8 | 0.10 | 58.4 | 30.0 |
|  | 0.1 | 1180 | 500 | 245 | 342 | 1.6 | 0.10 | 1.0 | 1.8 | 0.10 | 51.0 | 31.6 |
|  | 0.01 | 151 | 71.6 | 36.9 | 49.40 | 1.6 | 0.11 | 0.9 | 1.8 | 0.09 | 48.4 | 30.9 |
|  | 0.001 | 11.7 | 7.21 | 4.00 | 5.16 | 1.9 | 0.21 | 1.1 | 2.1 | 0.20 | 44.5 | 28.5 |
|  | 0.0001 | 0.43 | 0.35 | 0.22 | 0.28 | 2.2 | 0.35 | 1.5 | 2.4 | 0.34 | 39.0 | 22.0 |

the input switching activity is crucial for obtaining good energy savings.

Table 2 summarizes the results of the experiments. $\mathrm{V}_{\mathrm{DD} 1}$ and $\mathrm{V}_{\mathrm{DD} 2}$ are the two voltages applied to the circuit after clustering. We obtained up to $48.4 \%$ savings for circuit III and up to $36 \%$ savings for circuit IV for switching activities below $0.05\left(\mathrm{~V}_{\mathrm{th}} \mathrm{S}\right.$ variable). For switching activities above $0.05\left(\mathrm{~V}_{\mathrm{th}} \mathrm{S}\right.$ fixed at 0.1 V ), we obtained up to $58.4 \%$ savings for circuit III and up to $31.6 \%$ savings for circuit IV. The average saving, for switching activities above 0.05 , was $29 \%$ for circuit IIIand $18 \%$ for circuit IV. For switching activities below 0.05 , the average saving was $28 \%$ for circuit III and $15 \%$ for circuit IV.

The optimization algorithms in Sections 4 and 5 were implemented in MATLAB and run on a PC with 1 GB RAM and PIII 800 MHz processor. To compare execution times for the different circuits, we terminate the optimal algorithm in Section 4 after 10 iterations of the loop in Figure 2. 10 iterations were enough to get near optimal results for most of the cases. Since we observed that the clustering algorithm (Section 5) takes only about $5-10 \%$ of the total execution time, we let it run to completion. Figure 7 shows the execution time of the program versus number of modules $(\mathrm{N})$ multiplied with the number of null-space vectors of $\mathbf{A}(|\operatorname{Null}(\mathbf{A})|)$. Execution time is roughly proportional to $\mathrm{Nx}|\operatorname{Null}(\mathbf{A})|$ because in each iteration of the loop in Figure 2, computation of each coordinate of the gradient vector, $\bar{\nabla}_{A} E_{\text {total }}$, requires computation of supply, threshold voltage and energy consumption values for each module.


Figure 7. Execution time versus $\mathbf{N} \mathbf{x}|\operatorname{Null}(\mathrm{A})|$

## 7. Conclusion and future work

In this paper, we presented an algorithm to find optimum values of supply and threshold voltages for circuit modules such that the energy consumption is minimized. The conditions for optimum energy were found mathematically and then a gradient search algorithm was presented which iteratively converges to the optimum values. An additional step clusters these optimum values into a limited number of supply and threshold voltages. The method can be applied to circuit modules of any kind, given the delay and energy parameters for the modules.

As a next step, we plan to apply our algorithm to deep sub-micron technologies, which we believe will give more
energy savings than the results for $0.25 \mu$. We are also investigating techniques for power-aware partitioning of circuits into modules.

## References

[1] A. U. Diril, Y. S. Dhillon, K. Choi, A. Chatterjee, "An O(N) Supply Voltage Assignment Algorithm for LowEnergy Serially Connected CMOS Modules and a Heuristic Extension to Acyclic Data Flow Graphs," ISVLSI, pp.173-179, February 2003.
[2] T. Sakurai, A.R. Newton, "Alpha-power law MOSFET model and its applications to CMOS inverter delay and other formulas," IEEE Journal of Solid-State Circuits, vol. 25, pp.584-594, April 1990.
[3] S. Lee, T. Sakurai, "Run-time Voltage Hopping for Low Power Real-Time Systems," DAC, pp.806-809, 2000.
[4] T. Pering, T. Burd, R. Brodersen, "The Simulation and Evaluation of Dynamic Voltage Scheduling Algorithms," ISLPED, pp.76-81, 1998.
[5] J. Chang, M. Pedram, "Energy Minimization Using Multiple Supply Voltages," IEEE Trans. on VLSI Systems, vol.5, no.4, December 1997.
[6] M. Johnson, K. Roy, "Optimal Selection of Supply Voltages and Level Conversions During Data Path Scheduling Under Resource Constraints," ICCD '96, pp. 72 -77, 1996.
[7] P. Pant, R.K. Roy, A. Chattejee, "Dual-threshold voltage assignment with transistor sizing for low power CMOS circuits," IEEE Trans. on VLSI Systems, vol.9, no.2, pp. 390 -394, April 2001.
[8] V. Sundararajan, K.K. Parhi, "Low power synthesis of dual threshold voltage CMOS VLSI circuits," ISLPED, pp.139-144, 1999.
[9] Y. Moisiadis, I. Bouras, A. Arapoyanni, "Dynamic back bias CMOS driver for low-voltage applications," Electronics Letters, vol.36, no.2, pp.135-136, Jan. 2000.
[10] J.A. Butts, G.S. Sohi, "A static power model for architects," IEEE/ACM MICRO, pp. 191-201, 2000.
[11] K. Nose, T. Sakurai, "Analysis and future trend of shortcircuit power," IEEE Trans. on CAD, vol.19, no.9, pp.1023-1030, Sept. 2000.
[12] R. Kumar, C.P. Ravikumar, "Leakage power estimation for deep submicron circuits in an ASIC design environment," DAC, pp.45-50, 2002.
[13] R. Nair, C.L. Berman, P.S. Hauge, E.J. Yoffa, "Generation of performance constraints for layout," IEEE Trans. on CAD, vol.8, no.8, pp.860-874, Aug. 1989.
[14] K. Nose, T. Sakurai, "Optimization of $\mathrm{V}_{\mathrm{DD}}$ and $\mathrm{V}_{\mathrm{TH}}$ for low-power and high-speed applications," $A S P-D A C$, pp.469-474, 2000.
[15] R. Bai, S. Kulkami,W. Kwong, A. Srivastava, D. Sylvester, D.Blaauw, "An implementation of a 32-bit ARM processor using dual power supplies and dual threshold voltages," ISVLSI, pp.149-154, 2003.
[16] L. Wei, Z. Chen, K. Roy, M.C. Johnson, Y. Ye, V.K. De, "Design and optimization of dual-threshold circuits for low-voltage low-power applications," IEEE Trans. on VLSI Systems, pp.16-24, vol.7, no.1, March 1999.

