# Power-Up Sequence Control for MTCMOS Designs Shi-Hao Chen, Youn-Long Lin, Member, IEEE, and Mango C.-T. Chao, Member, IEEE Abstract—Power gating is effective for reducing standby leakage power as multi-threshold CMOS (MTCMOS) designs have become popular in the industry. However, a large inrush current and dynamic IR drop may occur when a circuit domain is powered up with MTCMOS switches. This could in turn lead to improper circuit operation. We propose a novel framework for generating a proper power-up sequence of the switches to control the inrush current of a power-gated domain while minimizing the power-up time and reducing the dynamic IR drop of the active domains. We also propose a configurable domino-delay circuit for implementing the sequence. Experimental results based on state-of-the-art industrial designs demonstrate the effectiveness of the proposed framework in limiting the inrush current, minimizing the power-up time, and reducing the dynamic IR drop. Results further confirm the efficiency of the framework in handling large-scale designs with more than 80 K power switches and 100 M transistors. Index Terms—Dynamic IR, inrush current, low power design, multi-threshold CMOS (MTCMOS), power gating, power-up sequence, ramp-up time. #### I. Introduction RERGY efficiency is important to battery-powered portable devices such as smart phones, GPS, PDAs, and tablets. However, the leakage current of these devices has increased significantly with the shrinking of semiconductor process technologies. The most straightforward and effective method for reducing standby leakage is power-gating, which cuts off the power supply (or ground) to a power-gated domain when it is in an idle state and resumes the power supply when it is in an active state. The multi-threshold CMOS (MTCMOS) technique [1] employs high- $V_t$ transistors to implement always-on circuits, such as power switches, retention flip-flops, and always-on buffers, to minimize their leakage power consumption. The power up/down of a gated domain is controlled by turning the header (or footer) power switches on or off. These switches are parallel-connected between the mesh of the chip's true VDD (or ground) and the mesh of the gated domain's virtual VDD (or virtual ground). This *power-switch fabric*, also called a Manuscript received August 07, 2011; revised November 18, 2011 and January 25, 2012; accepted January 30, 2012. Date of publication March 16, 2012; date of current version February 20, 2013. S.-H. Chen is with the Design Service Division, Global Unichip Corporation, Hsinchu 30078, Taiwan, and also with the Department of Computer Science, National Tsing Hua University, Hsinchu 30043, Taiwan (e-mail: hockchen@globalunichip.com). Y.-L. Lin is with the Department of Computer Science, National Tsing Hua University, Hsinchu 30043, Taiwan (e-mail: ylin@cs.nthu.edu.tw). M. C.-T. Chao is with the Department of Electronics Engineering and Institute of Electronics, National Chiao Tung University, Hsinchu 30010, Taiwan (e-mail: mango@faculty.nctu.edu.tw). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TVLSI.2012.2187689 distributed sleep transistor network (DSTN [2]), along with its control scheme significantly affect the characteristics of the MTCMOS design [1]–[13], [19]–[22], and thus need to be carefully designed. The number and size of the transistors used in the power-switch fabric determine the voltage drop between the true VDD and the virtual VDD [5]–[7]. This voltage drop degrades circuit performance, and must be kept below a user-specified value. Using a larger number of power switches can achieve a smaller voltage drop at the expense of more area overhead. After the power switches are allocated, the sequence that turns on the power switches for a domain (called the *power-up sequence*) determines the voltage *ramp-up time* and the *inrush current* of the domain. The ramp-up time is the time during which the virtual VDD rises from ground level to the required operating level for active mode. The inrush current is the maximum transient current flowing through the power switches during the sleep-to-active mode transition. There is generally a tradeoff between the ramp-up time and the inrush current [8]–[12]. A short ramp-up time may incur a large inrush current. It is necessary to constrain the inrush current of a domain, as an excessive inrush current may lead to excessive IR drop in other active domains resulting in chip malfunction. For example, an on-chip low drop out (LDO) voltage regulator may fail to boot up due to its being incapable of handling an excessive current surge. This can in turn damage certain power-sensitive IPs, such as a USB. Fig. 1 illustrates a design tradeoff between the inrush current and the ramp-up time. The dotted line "A" indicates that all power switches are turned on simultaneously, thus, the current peaks very early and the virtual rail voltage reaches the VDD level quite early. The dashed line "B" interspersed with single dot represents a single chain fabric, where the peak current is intrinsically small as power switches are turned on one by one per 100 ps in a sequential fashion. However, the inrush current may still exceed the specified constraint. Turning on power switches sequentially with larger time interval (e.g., 200 ps) can reduce the inrush current at the expense of increased area and ramp-up time (dashed line "C"). This study proposes a framework to schedule the power-up sequence of the power switches. This approach minimizes power ramp-up time while limiting the inrush current (solid line "D" in Fig. 1). In this framework, the power-up sequence turns on one bank of power switches at a time, and employs a configurable delay circuitry to control the activation of the next bank. We also propose a new model for estimating inrush current for very large-scale MTCMOS designs. The rest of this paper is organized as follows. In Section II, we present some general concepts of a multi-domains MTCMOS design, the behavior of a power switch during power-up, and the associated design challenges, such as limiting the inrush Fig. 1. Tradeoff of ramp-up time and inrush current and the expected current budget control. current and reducing the *dynamic IR* drop. Section III defines the problem of finding the optimal power-up sequence control. Section IV describes our current-budget method and proposes a configurable domino-delay controller. Section V deals with power switch routing. We take into consideration physical routing information and impact on the dynamic IR drop of the most fragile active domain when grouping the power switches into banks. In Section VI, we show experimental results using an industrial 40 nm multi-domain MTCMOS design to validate the efficiency and scalability of the proposed framework. Finally, we draw conclusion and point to possible directions for future research in Section VII. ### II. MTCMOS DESIGN CHALLENGES AND PREVIOUS WORKS Many researchers have proposed methods to reduce or control the inrush current. These methods generally fall into one of the following three types. Type-A) Customized power switch with large slew rate and reduced saturation current [5]–[7], [19]. Type-B) Turn on multiple power switches using a custom scheduling scheme (or delay insertion) [8]–[13], [19], [21], [22]. Type-C) Separate power switch chain into two phases and turn on power switches one by one (single chain) [13], [15], [17], [19]. Type-A methods mainly focus on adjusting the electrical properties of the power switches while ignoring the effect of the control sequence. Thus, they are suitable for designs with few power switches but ineffective for managing area, leakage, and ramp-up time simultaneously for inrush current reduction. Type-B methods turn on one group of power switches at a time such that the inrush current is under a constraint. They rely on an accurate model to estimate the current resulting from the already-on switches. This idea was first proposed by Kim *et al.* in [19]. However, they did not mention how to generate the desired timing of the turn-on sequence nor how to efficiently estimate the inrush current and ramp-up time. [21] generates the desired delay of the turn-on signal by inserting buffers. However, in practice the desired delay of turning on the next group of power switches can easily exceed 10 ns. From physical design's perspective, such a long delay can hardly be generated by using only buffer chains. Also, [21] did not provide a formal model to estimate the inrush current. [10] and [22] utilized fullchip SPICE to simulate the inrush current and further minimize the ramp-up time under a predefined constraint of the inrush current. However, it is computationally infeasible to iteratively apply a full-chip SPICE simulation, especially for modern MTCMOS designs, which may easily contain more than 10 K power switches and 3 M transistors in a power-gated domain. Also, [10] and [22] did not mention the physical-design issues about generating desired delay of the turn-on signal. Another research direction of controlling the turn-on sequence of power switches is to minimize the ground bounce, which was formulated in an exact mixed integer linear programming (MILP) problem in [9]. Type-C methods reduce the inrush current by sequentially turning on the power switches. They are implemented with the help of *mother-daughter switches* [17]. These switches contain a smaller transistor (daughter switch) that is first turned on during the ramp-up of the virtual VDD and a larger transistor (mother switch) that is subsequently turned on after the virtual VDD is fully charged. [15] finds a feasible *Hamiltonian-path* tour (also known as single chain) for connecting all the power switches with minimal routing length. [13] uses a Schmitt trigger to detect when the virtual VDD have reached the desired level through the daughter transistors and thus the mother transistors can be turned on. The presence of hard macros (as routing obstacles) and irregular placement of a large number of power switches further complicates the routing tasks. Another problem is that the inrush current of a power-gated domain may still exceed the limit, even after applying a Hamiltonian-path routing [15] to turn on daughter switches sequentially. In other words, the minimum wire-length is not necessary a correct objective for inrush current control. This requires manual fix, and thus may significantly impair the design closure. In addition to inrush current and ramp-up time, the power-up sequence of a domain may also affect the dynamic IR drop of other active domains [20]. A previous research [4] shows that different power up sequences may result in different dynamic IR drops. [16] describes an exemplary yield overkill due to dynamic IR drop. Although many previous studies propose algorithms for the problems of power switch sizing [5]–[7] and ramp-up timing minimization [8]–[12], they rarely address the issue of dynamic IR effect. The worst case scenario may occur when several gated-blocks are being simultaneously powered up while the others are still in active mode. For large circuits, it is computationally infeasible to perform transient analysis for every pair of adjacent power modes. Thus, we need a predictable analysis model and a prevention mechanism. For example, to ensure that frequent switching cells can be kept at a certain #### Power-up Sequence Control (PSC) #### Inputs: A list of variable sized switches *N*In-rush current budget *B*Maximum-distance limit for any two adjacent switches *MD*Extracted capacitance *Cap* of gated domain Time interval for switch bank control *sT* #### Output: Cluster switches into banks k(T) with proper turn-on time #### Objective: Maximize I(k(T)) for all T #### s.t Macro routing blockage for all metal layers The distance of two adjacent switches $dist(i,j) \leq MD$ , where $dist(i,j) = |\mathbf{x}_i - \mathbf{x}_j| + |\mathbf{y}_i - \mathbf{y}_j|$ Inrush current $I(k(T)) \leq B$ , where $I(k(T)) = \Sigma I_d(V(PS_j)), \ \Delta V(PS_j) = V_{DD} - V(T)$ T = n\*sT, n is integer Fig. 2. PSC problem formulation. distance from one another during timing optimization and clock tree synthesis, an early-stage dynamic IR prevention mechanism proposed in [16] performs cell padding during placement based on a flip-flop/clock-buffer density rule. # III. PROBLEM FORMULATION AND POWER RAMP-UP MODELING ## A. Problem Formulation This section formulates the power-up sequence control (PSC) problem that configures power switches and orders their turn-on sequence, as shown in Fig. 2. The inputs include a list of N variable sized switches, an inrush current limit B, a maximum distance constraint MD for any two adjacent switches, the extracted capacitance Cap of the gated domain and the time interval sT for switch enabling control. The virtual rail voltage is initially zero when the power switch is just turned on. The output is to cluster switches into banks k(T) such that all switches of bank k can be turned on simultaneously at time T. Note that T is an integer multiple of sT and all switches in a bank satisfy the maximum distance constraint MD. Routing across hard macro is prohibited so that the power switch wiring will not damage the regular signal routing and are adequately buffered to meet both output loading and transition constraint. Unlike those previous works [13], [15], [17], which turns on one power switch at a time, our method turns on a bank of power switches simultaneously at each predefined time interval. To realize the above objective, we propose a simple but accurate model for estimating the virtual rail voltage of the gated domain being turned on. Initially, a vector is applied on the primary inputs, power on reset and isolation clamp controls. We first build an I-V curve lookup table for the source-drain current $I_d(V(PS_i))$ of switch $PS_i$ by performing HSPICE simulation, wherein $V(PS_i)$ is calculated by subtracting an estimated ramp-up voltage V from the supply voltage $V_{\mathrm{DD}}$ . Fig. 3. Power ramp-up and effective capacitance modeling. (We characterized the source-drain current $I_d$ as a voltage dependent lookup table with HSPICE.) ### B. Effective Modeling for Power Ramp-Up Analysis During power-up, a power switch (PSW) behaves like a current source and remains in the saturation region for a while. The power-gated devices connected to the virtual rail VDDV behave like resistors until the virtual rail is charged to the normal operating voltage. As depicted in Fig. 3, we model a power switch as a voltage-dependent current source $(i_1, i_2, \ldots, i_n)$ and a gate $(U_1, U_2, \ldots, U_m)$ including wires as a lumped resistance $R_{\rm eff}$ and a lumped capacitance $C_{\rm eff}$ . $T_1, T_2, \ldots, T_n$ denote the times at which the associated power switch can be turned on. In a uniform power grid distribution network, a typical power gate resistance is subject to the IR drop constraint. Therefore, given a design with 1.0 V of VDD, 500 mW of power consumption and 5% of VDD drop tolerance, a reasonable resistance value must be less than 0.1 $\Omega$ . Only power switch mode (I–V curve lookup table) needs to be pre-characterized with SPICE simulation by assuming that Vg is tied to zero. As a result, our modeling is designed a little bit pessimistic when comparing to the SPICE simulation. We perform HSPICE simulation based on the accurate SPICE netlist considering layout parasitic extraction and calibrate the values of load capacitance $C_L$ and load resistance $R_L$ to evaluate the effects on both voltage ramp-up time and inrush current. For each load capacitance $C_L$ ranging from 10 to 1000 fF, the load resistance $R_L$ of each sub-circuit (out of 40 K sub-circuits) is varied from 10 $\Omega$ to 10 K $\Omega$ . Fig. 4 illustrates the effect on a simple MTCMOS design constructed using a TSMC 65 nm low power library [17], wherein 400 HDRSID0HVT header switches are connected in parallel and 40 K sub circuits are attached to the virtual VDD. The $R_{\rm on}$ , $I_{\rm dsat}$ and propagation delay of an HDRSID0HVT switch are 678 $\Omega$ , 0.479 mA and 120 ps, respectively. The power grid resistance is set to 0.1 $\Omega$ to compliant with the IR and EM requirement. Each sub circuit in the gated domain consists of an INVD24 inverter with 22.78 fF pin capacitance. The experimental results demonstrate that the inrush current is entirely subject to the load capacitance. For a DSTN structure, all its power switches are parallel-connected to the power mesh and share the current supply of the power-gated domain. So after parallel-connecting the resistance of all 40 K sub-circuits, their equivalent resistance may not vary too much even though the resistance of each sub-circuit varies a lot. Fig. 4. Effect of device capacitance $C_L$ tuning from 10 to 1000 fF, wherein the power grid resistance is set to 0.1 $\Omega$ , load resistance $R_L$ is tuned from 10 $\Omega$ to 10 K $\Omega$ for each capacitance level. The capacitance of power mesh (plate capacitor and fringing capacitor) can actually be ignored in contrast to the capacitance of the device load, which is proportional to the area and can be calculated after performing routing and RC extraction. Although the power grid resistance (in distributed form) may generate a slightly higher impact on the inrush current, the relative errors in terms of the inrush current and the ramp-up time for an actual power grid design could be approximately 5% more optimistic than that of an analysis ignoring the power grid resistance. For large thousand-gate designs, the results obtained by using an actual device model and by using a lumped-C model are very similar, as shown in Fig. 5. The load resistance and capacitance of each sub-circuit is set to 250 $\Omega$ and 100 fF, respectively. The equivalent effective capacitance $C_{\rm eff}$ of the power-gated domain calculated by our proposed method is 4.91 nF in this experiment. As mentioned early, the power switches behave like current sources during the power-up process, while those device components hanging under the virtual rail behave like a set of parallel-connected RC networks. The collected effect of the load resistance is relatively small due to the parallel connection. Thus, the cells' loading capacitance is in fact the key factor when estimating the inrush current and ramp-up time during the power-up process. In other words, the effective capacitance $C_{\rm eff}$ determines the amplitude of the inrush current and the effective resistance $R_{\rm eff}$ can be ignored safely. In contrast to the actual device model, in which the pMOS device in the power-gated domain constrains the current to charge the loading capacitance, the effective capacitance $C_{\rm eff}$ can be estimated by summing up the output loading of each individual cell. The model does not need any layout operations. Once the timing and power are fixed, a more accurate RC data Fig. 5. Power ramp-up correlation between an actual device model and an effective capacitance model, wherein 40 K INVD24 inverters are used as real device circuit accompanied with the $R_L$ set to 250 $\Omega$ , $C_L$ set to 100 fF, and the associated $C_{\rm eff}$ is set to 4.91 nF (a) inrush current and (b) voltage ramp-up. (We perform HSPICE with TSMC 65LP post-layout SPICE netlist.) (e.g., SPEF) can be generated. Thus, our approach enhances the accuracy of ramp-up analysis. As highlighted by the stage "1" and stage "2" in Fig. 5, the virtual VDD of the SPICE simulation is indeed different from that of the proposed model. Fortunately, the power switches turned on during stage "1" and stage "2" are still operated in the saturation region, rather than in the linear region. When a power switch operates in the saturation region, its $I_{\rm ds}$ is essentially independent of $V_{\rm ds}$ . Hence, the inrush current obtained by our proposed model can be close to the SPICE simulation. In stage "1" (Fig. 5), the pMOS device (not the power switch) in the power-gated domain remains in the cutoff region. Although the currents under both models are almost identical, the virtual rail voltage under the actual device model increases faster than that of the lumped-C model. Once the virtual rail voltage is large enough, the pMOS device in the power-gated domain can be switched ON as a smaller resistance. At this moment, the virtual rail voltage in the actual device model tends to increase slower in stage "2". Finally, all power switches are turned on and the two voltages finally converge at stage "3". Thus, the current profile of the lumped-C model is slightly worse than that of the actual device model and the same trend continues within an error of 0.1%. Estimating the equivalent capacitance is a challenge job. The effective capacitance $C_{\rm eff}$ is visible only when the corresponding pMOS transistors in the power-gated domain are turned on. In our proposed method, we first determine a fixed pattern for the entire power-up process, which specifies all the controllable inputs such as primary input, power on reset, #### Power-up Sequencing Algorithm Output: I-V curve, switch banks k, turn-on time T for each bank // total simulation time 2: $Q \leftarrow 0$ : // cumulate electric charge of PG block (C) 3: $I \leftarrow 0$ ; // cumulate current of PG block (A) V ← 0; 4: // virtual rail voltage (V) 5: pswAv ← 0: // present available PSW under current limit B $pswOn \leftarrow 0$ : // total turned-on PSW 6: pswStage ← 0; // available turned-on PSW per-stage 7: 8: Build Id Iookup table Iookup IdOurrent; // perform HSPICE 9: while Q<0.95\*V<sub>D</sub>\*Cap and N≠ o do // target electric charge (O) 10: $\text{vd} \; \leftarrow \; V_{\underline{DD}^{\text{-}}} \, V;$ // delta voltage across switch 11: id ← IookupldOurrent(vd); // interpolation for middle vd 12: psw&t age ← 0; 13: $pswAv \leftarrow \lfloor B/id \rfloor;$ 14: pswAv ← pswAv>si ze( N) ?si ze( N) : pswAv; 15: pswStage ← pswAv-pswOn: if int (T\*1.0e12) % nt (sT\*1.0e12) ==0 // stage time 16: 17: $k(T) \leftarrow switchBank(pswStage, N, T); // turn on bank k$ Е 18: 19. pswStage ← 0; 20: pswOn ← int(pswOn+pswStage); // present total on switch 21: ← pswOn\*i d; // update rail current 22: dumpOurrent Vol tage(T, I, V)); // out put I-V curve 23: $Q \leftarrow Q + (dT*I);$ // update cumulate electric charge 24: $V \leftarrow Q Cap;$ // update virtual rail voltage 25 $T \leftarrow T + dT$ // update simulation time (s) 26: end while 27: $k(T) \leftarrow switchBank(N,T); // turn-on the rest$ Fig. 6. Proposed power-up sequencing algorithm. isolation clamp control, etc. Based on such input pattern, we can obtain the specified value for a significant portion of the gated cells. For those cells still with an unknown value, we assume that a certain percentage of their output capacitance can be seen from the true VDD network based on an empirical rule. This percentage is set to 50% in our method, which can correlate to a fast SPICE simulation result [18]. ## IV. PROPOSED FRAMEWORK Based on the proposed power ramp-up modeling, Section IV-A presents a current budget algorithm for the PSC problem. Section IV-B illustrates an exemplary architecture. Section IV-C further proposes a quantitative metric and a heuristic approach to compensate for the IR drop effect on active domain. Section IV-D presents a control circuit for generating the proper control signals. Finally, post-silicon tuning for variation control is discussed in Section IV-E. Note that the proposed framework is applied after the power pads and the power switches are allocated. The number and location of each power switch are already known, while the placement and routing of the gated cells are yet to complete at this stage. # A. Power-Up Sequencing Algorithm Fig. 6 gives a pseudo code description of our algorithm for optimizing PSC. The supply voltage is set to $V_{\rm DD}$ , the target electric charge of the gated domain is set to $0.95 \times V_{\rm DD} \times Cap$ , and the simulation precision dT is set to 1 ps. This algorithm iteratively calculates the current and voltage of the gated domain and clusters the power switches into banks to be turned on one bank every fixed time interval. The number of switches in a bank that can be simultaneously turned on is limited by the summation of each switch's current resulting Fig. 7. Power-up sequence control with a fixed time interval. from the voltage difference between the VDD and the virtual VDD. As the voltage of the virtual VDD increases during the sleep-to-active mode transition, the current through a switch decreases, and the number of switches in a bank that can be turned on during a later time interval can be increased. Step "A" initializes the simulation time T, a cumulated electric charge of the power gated domain Q, a cumulated current I, the virtual rail voltage V, the presently available number of switches pswAv to be turned on under the current limit B, the number of turned-on switches pswOn and the presently available number of turned-on switches per stage pswStage based on the specified time interval sT. Step "B" obtains a voltage difference dV by subtracting the estimate virtual rail voltage V from the supply voltage $V_{\rm DD}$ , and calculates the inrush current contributed by each power switch based on a pre-characterized voltage dependent Id curve. Step "C" determines the number of switches *pswAv* to be turned on under the preset current limit *B*. It also employs a switch banking procedure for clustering a set of *pswStage* switches in every time interval. Step "D" updates the accumulated current value I, the accumulated charge value Q, and the accumulated voltage value V. Finally, if the accumulated charge value is less than the target charge, the process returns to step "B" and repeats. If the accumulated charge value is greater than or equal to the target charge value, the process will proceed to the END step. This enables the overall procedure "E" for controlling the current lookup procedure, the computing procedure, and the updating procedure to halt the production of the renewed booting current and the renewed power switch cluster. # B. Exemplary Power Switch Banking and Inrush Current Budgeting Fig. 7 depicts an ideal power-up sequence control in accordance with the proposed algorithm when the fixed time interval is set to 10 ns. This experiment sets the inrush current constraint to 100 mA with the power supply voltage set to 1.2 V. After optimization, we insert delay elements to comply with a set of time slots. By specifying an upper bound on the inrush current without restricting the size of the power switch bank, the proposed framework maximizes the number of power switches in a bank that can be turned on in a specified time interval (in this case, 209, 13, 15, 21, 34, 100, and 8). Fig. 8 compares the results of the proposed framework and that of a single-chain Hamiltonian-path method. The dotted line was produced by the Hamiltonian-path method (denoted as HP). The accumulated current [see Fig. 8(a)] gradually increases to Fig. 8. Inrush current and voltage ramp-up profile using different power-up sequences. (a) Inrush current. (b) Voltage ramp-up. a maximum value (137.5 mA at 46 ns) and subsequently decreases until the voltage reaches the normal operation voltage [see Fig. 8(b)]. The dashed line and the solid line were produced using the proposed framework with time intervals of 10 and 1 ns, respectively. The accumulated current is continually kept below a specified limit (100 mA). Fig. 8(b) shows that the voltage increases faster than that achieved using the Hamiltonian-path method. Above experimental results demonstrate that the voltage ramp-up curve of the proposed framework can effectively observe a constant current limit (the smaller the time interval, the better the approximation). However, the difference is small (less than 3%) between that of 1 and 10 ns. ### C. Heuristic for Dynamic IR Drop Mitigation We employ a model based on (1) to mitigate the dynamic IR drop on the active domains. First, power switches are weighted and ranked according to a function of their physical locations and that of DC sources. Then, they are clustered and routed according to the ranking guidance to prevent any excessive dynamic IR drop on the active domains from happening $$W(PS_i) = \frac{R_{\text{on},i}}{R'_{\text{DC},i}}$$ (1) $$R'_{\text{DC},i} = R_{\text{DC1},i} / / R_{DC2,i} / / \dots R_{\text{DC}n,i}$$ (2) $$R'_{DC,i} = R_{DC1,i} // R_{DC2,i} // \dots R_{DCn,i}$$ (2) Fig. 9. Analytical power switch ranking model for dynamic IR minimization. where $$R_{A,B} = R_{\text{oe}} \frac{L_{A,B}}{w}$$ where $W(PS_i)$ is the estimated voltage drop effect when power switch $PS_i$ is turned on, $R_{on}$ denotes the effective resistance from $PS_i$ to the fragile active domain, and $R'_{DC,i}$ derived from (2) indicates the effective resistance from $PS_i$ to various DC sources. $R_{oe}$ is the sheet resistance of a proper metal layer, while $L_{A,B}$ represents the distance between two nodes, A and B, and w is the metal width. To identify the most fragile domain and its corresponding power mode that will suffer from the worst-case dynamic IR drop, we should take into account the locations of power domains and DC sources, the domains to be turned on simultaneously, the power mesh size as well as the capacitance of the gated circuit. Since it is computationally unaffordable to simulate every IR-drop map during the transition between every possible two adjacent power modes, we usually rely on designers' knowledge to designate numbers of power domains and power modes to analyze their worst-case IR drop. Fig. 9 depicts how our heuristic method estimates the dynamic IR effect on the active domain. Given a set of DC sources and a plurality of power switches placed uniformly over the gated domain. The switches that are physically close to the enable signal (denoted as root) are turned on first. A rule of thumb is to place the enable signal of a power-up domain near the DC source and far away from the active domain $$W(PS_i) = \frac{R_{\text{on},i}}{R_{\text{DC1},i} // R_{\text{DC2},i}} = \frac{L_{\text{On},i} (L_{\text{DC1},i} + L_{\text{DC2},i})}{L_{\text{DC1},i} \cdot L_{\text{DC2},i}}$$ (3) where $$R_{\mathrm{DC},i} = R_{\mathrm{oe}} \frac{L_{\mathrm{DC},i}}{w}.$$ Equation (3) simplifies the estimation using the distance between a power switch, the DC sources, and the fragile active Fig. 10. Voltage effect on powering up power switch individually based on the proposed ranking algorithm. (We perform HSPICE and select 100 power switches evenly for the voltage drop effect from the smallest to the largest.) domain, where $L_{\mathrm{DC1},i}$ and $L_{\mathrm{DC2},i}$ indicate the distance from power switch $\mathrm{PS}_i$ to voltage sources DC1 and DC2, respectively. In order to validate the correctness and the efficiency of the proposed switch ranking algorithm, we build an $80 \times 80$ true-VDD grid, where each unit grid has resistance of 0.1 $\Omega$ . The gated power-up domain and the active domain are connected to the bottom-right quarter and the top-left quarter of the true-VDD mesh, respectively. The gated power-up domain uses $40 \times 40$ TSMC 65 nm header switches (HDRSID2HVT). Estimated from a 65 nm, 1.2 V, 133 MHz 250 K-gate design, the gated domain is represented by a 1 nF capacitance and a 0.5 A current source connected to the virtual VDD. The $R_{\rm on}$ and $I_{dsat}$ are 39.42 $\Omega$ and 8.92 mA, respectively. We rank switches based on their $W(SP_i)$ computed by (3). Next, we perform SPICE simulation for the case that each switch is turned on independently with all other switches turned off and the initial virtual rail voltage is set to 0 V. We then measure the true VDD voltage of the active domain after 100 ps and plot the results in Fig. 10. The switch with a larger rank (decreasing weight) produces a larger voltage drop on the active domain. Thus our ranking scheme based on (3) can accurately predict a switch's effect on $V(SP_i)$ . # D. Configurable Domino-Delay Circuit Fig. 11 depicts a power-up sequence control system with a configurable domino-delay based on the proposed algorithm. The controller receives a sleep enable signal (denoted by SleepEn) and distributes the sleep enable outputs (denoted by SleepEnD[0-N]) in a domino fashion using a configurable time interval. The power switches are divided into several banks, which receive the controller's sleep enable signals according to the schedule. We place the controller in the power-gated domain. It cannot be interrupted by the system once it starts sending out power-on signals. Our framework implements the domino-delay circuit after logic synthesis. We design a parameterized RTL code together with a Perl-script based circuit compiler to make the number of Fig. 11. Configurable domino-delay control Fig. 12. I-V curve of the TSMC65LP HDRSID2HVT among different process-voltage-temperature corners. outputs and the time interval configurable. Although the implementation cannot be changed after synthesis, we keep the flexibility of dividing a 200 MHz (5 ns period) reference clock. The area overhead of the generated domino-delay controller is proportional to the number of output enable signals. In general, only about 400 gates are needed for a power-gated domain with 300 memory instances. Each sleep enable output is supplied by a buffer tree, which may induce extra delay to the SleepEn signal. As a result, the switches are turned on a little later than our estimation model expects. It means that the actual inrush current is in fact smaller than that estimated by our inrush current model. In other words, our proposed inrush current model is a conservation one. As long as it can be satisfied, the actual in-rush current would be smaller than it reports. In our practical cases, the signal buffer tree is controlled within 5 levels, which may add around 0.5 ns delay (100 ps/level). # E. Post-Silicon Tuning for Variation Control The effective capacitance may be affected by the pattern and the variability of the process-voltage-temperature (PVT). The input pattern during the power-up process can be determined in advance, but the PVT variation cannot. As shown in Fig. 12, the saturation current of the TSMC 65 nm header switch HDRSID2HVT at FF, 1.32 V 25°C is 1.52× bigger than that at TT, 1.2 V, 25°C. As a result, one pre-defined turn-on sequence may not be able to satisfy the inrush current constraint for every manufactured chip. One way to solve this problem is to use large design margin (such as using FF, 1.32 V, 25 °C to generate the power-up sequence), but it would be over-designed for most cases. Another approach to solve this problem is post-silicon tuning according to the result of the on-chip process-monitoring circuitry (i.e., can be a ring oscillator combined with a ripple counter). For example, one can design a new power switch with at least two selected number of fingers and slew rate controls. During power-up stage, only enabled fingers will sustain the current and contribute to the loading of the power gated domain. Once the power gated domain is charged to approach the operation voltage, all fingers will be enabled to maintain the performance of power gated domain. However, the configurable driving capability with modified finger numbers of output buffer must be included. Furthermore, selective control pins must be designed within the power switch cell for tolerating PVT variation. That increases the area overhead and the unexpected routing congestion. Another dynamic technique is to change the body bias of the sleep transistors (the threshold voltage can be controlled) and is called the adaptive body bias (ABB) technique. For fast silicon, reverse body bias (RBB) can be applied to adjust the on current of the power switches to the same level as used in our estimation (mostly typical corner). The body bias adjustment is done according to the PVT condition, and is cancelled once the power gated domain is charged to the operation voltage. However, we need to pay extra routing resource for the bias power, which may increase area overhead, routing congestion and unexpected timing due to the voltage drop of the bias-power network. In order to automate design phase and satisfy the demand of time-to-market requirement, we can divide each original power switch bank into multiple sub-groups. Each sub-group has its own enable control connected to the domino delay circuit, such that the power-up sequence can be programmable by programming the time interval and reference clock. #### V. POWER SWITCH ROUTING After the procedure converges to an optimal power-up sequence, we proceed with power switch fabric design. Two problems may arise. First, regular power switches placed uniformly in a checkerboard pattern in the core area may coexist with ring-style power switches placed around hard macros. Second, a power switch connection across hard macros may cause routing congestion. ### A. Power Switch Routing Our framework generates a distributed routing topology for each power switch bank, wherein each power switch is sequentially connected to the next one. It observes both the maximum fan-out and maximum distance constraints between two adjacent power switches to prevent from happening design rule violations that would require extra always-on buffers to resolve. The maximum fan-out and maximum distance constraints ensure that neither the output loading of the current switch nor the input slew of the next routed switch exceed the upper bound set in the timing library. 1) Power Switch Banking: We partition power switches into several vertical banks (**bankV**) within the specified horizontal search range mFact [see Fig. 13(a)]. The power switches in the Fig. 13. Power switch routing algorithm. (a) Partitioning. (b) Patching. same vertical banks are then further divided into disjoint subbanks to satisfy the maximum distance constraint MD. Within a vertical bank, the highest ranked switch is routed first. 2) Patching: The entry point or floating input of each vertical bank is connected to the nearest power switch of the adjacent bank [see Fig. 13(b)]. The floating input pin of a switch bank is subsequently patched by the nearest power switch in the adjacent bank at the corresponding vertical coordinate. Finally, an extra always-on buffer is inserted when a feasible driver cannot be found within the maximum distance constraint or the routing pattern violates the design rules. Ideally, all switches within a bank should be turned on simultaneously. However, wiring delay will cause discrepancy. Fig. 14 depicts the ideal and actual turn-on sequences after buffer tree synthesis has fixed input-slew and output-loading violations. The fewer levels there are in the buffer tree, the more likely it is to achieve the ideal current. To minimize the number of inserted always-on buffers, we employ a greedy buffer tree synthesizer that re-wires power switch routing while minimizing the logic level for sleep control circuit. # B. Efficiency Analysis Fig. 15 illustrates the effectiveness of the proposed framework for different time interval settings and power switch routing configurations. An ideal (optimal) inrush current profile, as represented by the red dashed line, can be derived by simultaneously turning on the maximum number of power switches under the current constraint. The green dotted line (or blue solid line) depicts a possible actual inrush current profile. In practice, the current may display an insignificant slant at time "1" after the application of a feasible power switch routing. Fig. 14. Activated power switch number per stage. Fig. 15. Optimized efficiency analysis for current budget algorithm. Thus, there should exist a gap representing the *minimum instantaneous burst* time (denoted as *MIB*) to turn on all switches simultaneously. The slope depends on the speed with which the power switches can be turned on. After the current has reached the specified current limit at time stamp "2", the current decreases. However, the virtual rail voltage keeps increasing due to current through those on switches. At the next time interval "3", the current increases again and reaches the current limit as a new power switch bank is enabled. The *minimum current recoverability* (denoted as *MCR*) exists due to the intrinsic propagation delay of power switch. Given a current limit without considering the dynamic IR drop, the boundary (ramp-up voltage and inrush current curves) of the optimal solution is known. However, due to the availability of delay elements and the schedule control circuit (as the switch itself incurs a 50–100 ps propagation delay), the time interval for current refreshment cannot be granulated further. Experimental result to be presented latter demonstrates that the proposed framework can effectively minimize both MIB and MCR simultaneously by considering the feasibility of switch routing and scheduling control element. #### VI. EXPERIMENTAL RESULTS We evaluate the proposed approach based on an industrial smart-phone design utilizing an in-house chip implementation flow [13], [14]. The power switches are the mother-daughter switches (HDRDID2BWPHVT) from a TSMC 40 nm MTCMOS cell library [17]. Three different methods were used to implement the power switch fabric. The first method TABLE I POWER SWITCH NUMBER, ROUTING LENGTH, AREA OVERHEAD, AND RUN TIME OF THE PROPOSED CURRENT BUDGET/LIMIT FRAMEWORK FOR EACH POWER-UP DOMAIN | Items\Domains | PD1 | PD2 | PD3 | | |---------------------------|------------|------------|------------|--| | Nr. Power Switch | 31559 | 18813 | 27190 | | | Routing Length (Daughter) | 2308228 um | 1451779 um | 2214791 um | | | Area Overhead | 0.68% | 0.75% | 0.67% | | | Leakage Overhead | 0.07% | 0.17% | 0.15% | | | CPU Time (Budgeting) | 5.8 s | 3.6 s | 4.7 s | | | CPU Time (PS Routing) | 34 s | 18 s | 28 s | | TABLE II PEAK CURRENT AND THE RAMP-UP TIME OF THE POWER-UP DOMAINS ASSOCIATED WITH DIFFERENT POWER-UP SEQUENCES | | peak inrush current (A) | | | time to 99%VDD (ns) | | | | | |-----------------|-------------------------|-------|-------|---------------------|------|------|--|--| | Methods\Domains | PD1 | PD2 | PD3 | PD1 | PD2 | PD3 | | | | Parallel | 1.543 | 0.876 | 1.258 | 68 | 72 | 68 | | | | HP | 0.182 | 0.147 | 0.165 | 467 | 407 | 412 | | | | B150mA | 0.144 | 0.142 | 0.143 | 384 | 244 | 312 | | | | Comparison | | | | | | | | | | B150mA/Parallel | 0.09 | 0.16 | 0.11 | 5.65 | 3.41 | 4.59 | | | | B150mA/HP | 0.79 | 0.97 | 0.87 | 0.82 | 0.60 | 0.76 | | | [13] (denoted as Parallel) employed a commercial automatic placement and route (APR) tool [14] to implement the power switch fabric in multiple short chains fashion with a delay inserted between two adjacent chains. The second method is the single-chained method based on finding a Hamiltonian-path (denoted as HP) [15]. The third method is the proposed framework with current limit set to 150 mA (denoted as B150 mA). This method initially sets the capacitances of the power-gated domains by assuming that 50% of the total capacitance of unknown nodes can be seen from the virtual rail. Ramp-up and dynamic IR analysis for the true VDD, the power switches, and the virtual rail, were performed using a commercial power analysis tool [18]. The current budget is determined according to the input of the designers, who may evaluate the worst design scenario by simulating different power-mode transitions, input patterns, PVT variations, and previous silicon results. To the base of our knowledge, there is no systematic and efficient method to determine this current budget. Table I lists the power switch number, total wire length of the power switch routing (daughter chain only), area overhead, and the run time of the proposed framework. Table II lists the peak inrush current and the ramp-up time for each of the three power-gated domains (denoted as PD1, PD2, and PD3). Compared to the Parallel method [13], the proposed framework can reduce the peak inrush current by 8.6 times while slowing down the ramp-up time by 4.5 times, as shown in Fig. 16. Only 0.75% of area penalty is observed. Although the reference method [13] may not meet the specified current limit, we showed this comparison because both methods turn on multiple power switches at the same time. The difference is that our proposed method can properly control the turn-on sequence of power switches while the reference method cannot. This experiment demonstrates that the inrush current control Fig. 16. Inrush current and voltage ramp-up profile of the PD3 power-up domain associated with the Parallel power-up sequences and the proposed framework, respectively. (a) Inrush current. (b) Ramp-up voltage. TABLE III Worst Dynamic IR Drop (mV) Recorded in Active Domains Associated With Different Power-Up Sequences | Methods\Domains | USB | HSPA0 | HSPA1 | EDGE | DSP | SOC | | | |-----------------|-------|-------|-------|-------|-------|------|--|--| | Parallel | 152.8 | 61.7 | 72.5 | 159.3 | 175.5 | 90.2 | | | | HP | 131.0 | 32.1 | 28.9 | 140.4 | 153.1 | 91.0 | | | | RHP | 153.5 | 39.3 | 35.3 | 161.2 | 174.1 | 94.4 | | | | B150mA | 129.6 | 32.3 | 29.2 | 136.8 | 151.4 | 92.1 | | | | Comparison | | | | | | | | | | B150mA/Parallel | 0.85 | 0.52 | 0.40 | 0.86 | 0.86 | 1.02 | | | | B150mA/HP | 0.99 | 1.01 | 1.01 | 0.97 | 0.99 | 1.01 | | | achieved by the proposed method cannot be easily obtained by randomly turning on power switches group by group. Table III lists the worst dynamic IR drop scenario, wherein three power-gated domains (PD1, PD2, and PD3) are powered up simultaneously while the others are still active. Another control experiment (denoted as RHP) reversed the order by that of HP. Although the dynamic IR drop results are similar to those based on the Hamiltonian-path fabric, the proposed framework significantly reduces the inrush current and ramp-up time. For the MTCMOS design used in our experiment, we found that USB is the most fragile domain and the switch routing of three communication-protocol domains may significantly affect the dynamic IR drop of the USB domain due to the following reasons. First, the three communication-protocol domains often turn on and off at the same time based on the power-mode table, and hence the simultaneous current consumption is magnified. Second, the virtual-VDD mesh of these three blocks is large while the virtual-VDD mesh of the USB domain is relatively Fig. 17. Inrush current and voltage ramp-up profile of the PD3 power-up domain associated with the Hamiltonian-path method and the proposed framework, respectively. (a) Inrush current. (b) Ramp-up voltage. small, which is inherently more sensitive to the power noise. Third, the location of the USB domain is close to the communication-protocol domains and at the same time near the center of the true-VDD mesh, which is relatively far away from the four DC sources on the mesh's boundaries. As a result, its effective resistance to DC sources is relatively large while the effective resistance to the communication-protocol domains is small. Therefore, in our experiments, we focused on minimizing the dynamic IR drop for USB domain when doing the power switch routing for the three communication-protocol domains. Fig. 17 demonstrates the effectiveness of the proposed framework for domain PD3. Our approach successfully constrains the inrush current within 143 mA. The peak inrush current and ramp-up time are reduced by 13% and 27%, respectively, compared to the Hamiltonian-path method [15]. # VII. CONCLUSION AND FUTURE WORK We have presented a power-up sequence generation method to address the problem of minimizing ramp-up time under peak inrush current constraint for MTCMOS designs. The proposed framework includes a current budget algorithm based on an effective model, an analytical routing guidance and a configurable domino-delay controller. Experimental results demonstrate the effectiveness of the proposed framework in minimizing the ramp-up time while mitigating the dynamic IR effect on the active domains under a specified peak current limit. We have not incorporated the effects of package model in this work. One necessary step in our future work is to explore an analytical model considering package *RLC* parasitic. For package selection and cost reduction, it is possible to extend our work to analyze the impact of decoupling capacitance on ramp-up time, inrush current and dynamic IR drop. Considering the inductance imposed by the package would result in a different inrush current estimation. We would like to address this point in our future work. However, we can still consider our proposed framework as a conservative method to limit the inrush current since the package inductance will slow down the voltage ramp up and hence the actual inrush current will be smaller than our estimation. So the power-up sequence generated by our proposed method can still satisfy the inrush current constraint. In nano-meter technology, device variations can alter the inrush current very much in each process-voltage-temperature corner. If we consider all the variations, the resulting power-up sequence may be either overestimated or underestimated. Our present version can only adjust the time interval and the reference clock post-silicon. Thus, some on-chip hardware performance monitors with an adaptive voltage or body-bias controller should be taken into account for variation control. #### REFERENCES - S. Mutoh, T. Douseki, Y. Matsuya, T. Aoki, and J. Yamada, "1 V high-speed digital circuit technology with 0.5 μm multi-threshold CMOS," in *Proc. IEEE 6th Int. Annu. ASIC Conf.*, 1993, pp. 186–189. - [2] C. Long and L. He, "Distributed sleep transistors network for power reduction," in *Proc. DAC*, 2003, pp. 181–186. - [3] M. Anis, S. Areibi, M. Mahmoud, and M. Elmasry, "Dynamic and leakage power reduction in MTCMOS circuits using an automated efficient gate clustering technique," in *Proc. DAC*, 2002, pp. 480–485. - [4] S. H. Chen and J. Y. Lin, "Experiences of low power design implementation and verification," in *Proc. ASP-DAC*, 2008, pp. 742–747. - [5] J. Kao, S. Narendra, and A. Chandrakasan, "MTCMOS hierarchical sizing based on mutual exclusive discharge patterns," in *Proc. DAC*, 1998, pp. 495–500. - [6] C. Hwang, C. Kang, and M. Pedram, "Gate sizing and replication to minimize the effects of virtual ground parasitic resistances in MTCMOS designs," in *Proc. ISQED*, 2006, pp. 741–746. - [7] J. Kao, A. Chandrakasan, and D. Antoniadis, "Transistor sizing issues and tool for multi-threshold CMOS technology," in *Proc. DAC*, 1997, pp. 409–414. - [8] A. Davoodi and A. Srivastava, "Wake-up protocols for controlling current surges in MTCMOS-based technology," in *Proc. ASP-DAC*, 2005, pp. 868–871. - [9] A. Ramalingam, A. Devgan, and D. Z. Pan, "Wake-up scheduling in MTCMOS circuits using successive relaxation to minimize ground bounce," *J. Low Power Electron*, vol. 3, no. 1, pp. 28–35, Apr. 2007. - [10] Y. T. Chen, D. C. Juan, M. C. Lee, and S. C. Chang, "An efficient wake-up schedule during power mode transition considering spurious glitches phenomenon," in *Proc. ICCAD*, 2007, pp. 779–782. - [11] H. Jiang and M. Marek-Sadowska, "Power gating scheduling for power/ground noise reduction," in *Proc. DAC*, 2008, pp. 980–985. - [12] Y. Lee, D. K. Jeong, and T. Kim, "Simultaneous control of power/ ground current, wakeup time and transistor overhead in power gated circuits," in *Proc. ICCAD*, 2008, pp. 169–172. - [13] M. Keating, D. Flynn, R. Aitken, A. Gibsons, and K. Shi, Low Power Methodology Manual for System on Chip Design. New York: Springer, 2007, ch. 14, pp. 225–247 [Online]. Available: http://www.lpmm-book.org - [14] Cadence Design System, Inc., San Jose, CA, "Encounter digital implementation system user guide," Mar. 2011, pp. 491–532. - [15] T. M. Tseng, M. C.-T. Chao, C. P. Lu, and C. H. Lo, "Power-switch routing for coarse-grain MTCMOS technologies," in *Proc. ICCAD*, 2009, pp. 39–46. - [16] S. H. Chen, K. C. Chu, J. Y. Lin, and C. H. Tsai, "DFM/DFY practices during physical designs for timing, signal integrity, and power," in *Proc. ASP-DAC*, 2007, pp. 232–237. - [17] Taiwan Semiconductor Manufacturing Company, Ltd., Hsinchu, Taiwan, "TSMC reference flow 8.0," 2007. - [18] "RedHawk Users' Manual," 10.1 ed. Apache Design Solutions Inc., San Jose, CA, 2010. - [19] S. Kim, S. V. Kosonocky, and D. R. Knebel, "Understanding and minimizing ground bounce during mode transition of power gating structures," in *Proc. ISLPED*, 2003, pp. 22–25. - [20] H. Jiao and V. Kursun, "Ground bouncing noise aware combinational MTCMOS circuits," *IEEE Trans. Circuits Syst. I, Reg. Papers*, vol. 57, no. 8, pp. 2053–2065, Aug. 2010. - [21] A. Calimera, L. Benini, A. Macii, E. Macii, and M. Poncino, "Design of a flexible reactivation cell for safe power-mode transition in powergated circuits," *IEEE Trans. Circuits Syst. I, Reg. Papers*, vol. 56, no. 9, pp. 1979–1993, Sep. 2009. - [22] D. Juan, Y. Chen, M. Lee, and S. Chang, "An efficient wake-up strategy considering spurious glitches phenomenon for power gating designs," *IEEE Trans. Very Large Scale Integr. (VLSI) Syst.*, vol. 18, no. 2, pp. 246–255, Feb. 2010. Shi-Hao Chen received the B.S. and M.S. degrees in electronic engineering from Chung Yuan Christian University, Chung Li, Taiwan, in 1996 and 1998, respectively. He is currently pursuing the Ph.D. degree in computer science from the National Tsing Hua University, Hsinchu, Taiwan. He is currently a Deputy Director of Design Service Division, Global Unichip Corporation, an SOC Design Foundry, Hsinchu, Taiwan. His research interests include design flow automation, physical design, and low power design methodologies. Youn-Long Lin (SM'00) received the B.S. degree in electronics engineering from National Taiwan University of Science and Technology, Taipei, Taiwan, in 1982, and the Ph.D. degree in computer science from the University of Illinois, Urbana-Champaign, in 1987. Upon his graduation, he joined National Tsing Hua University, Hsinchu, Taiwan, where he has served as Chairman of Computer Science Department and Vice President of Research and Development. In 1998 he co-founded Global UniChip Corporation, an SOC Design Foundry, Hsinchu, Taiwan. Between 2001 and 2003, he worked for UniChip as its Chief Technical Officer and Executive Vice President. He is now a Chair Professor of computer science of National Tsing Hua University. He is also an Adjunct Professor with Peking University, Beijing, China, and a Guest Professor with Waseda University, Japan. His primary research interest is in computer-aided design (CAD) of VLSI circuits with emphasis on physical design automation and high-level synthesis. He coauthored the book *High Level Synthesis—Introduction to Chip and System Design* (Kluwer, 1992). His current research focus is on design technology for system-on-a-chips (SOC) employing reusable silicon intellectual properties (IPs). Prof. Lin is an Associate Editor of ACM Transactions on Embedded Computing Systems (TECS). Mango C.-T. Chao (M'07) received the B.S. and M.S. degrees from the Department of Computer and Information Science, National Chiao Tung University, Hsinchu, Taiwan, in 1998 and 2000, respectively, and the Ph.D. degree from the Department of electrical and computer engineering, University of California, Santa Barbara, in 2006. He is currently an Assistant Professor with the Department of Electronics Engineering, National Chiao Tung University, Hsinchu, Taiwan. His current research interests include VLSI testing, TFT circuitry design, and physical design automation.