# 國立交通大學

# 電子工程學系 電子研究所

### 碩士 論 文

可用於工作在次臨界/近臨界電壓區間綠色節能科技 之製程、電壓、溫度高適應性超低電壓時脈系統設計

Ultra-Low Voltage PVT-Robust Clock System Design for Sub/Near-Threshold Green Technologies

研究生:謝忠穎

指導教授:黃 威 教授

#### 中華民國九十九年七月

# 可用於工作在次臨界/近臨界電壓區間綠色節能科技之製程、電壓、溫度高適應性超低電壓時脈系統設計

Ultra-Low Voltage PVT-Robust Clock System Design for

#### Sub/Near-Threshold Green Technologies

研究生: 謝忠穎 Student: Chung-Ying Hsieh

指導教授:黃 威 教授 Advisor: Prof. Wei Hwang

國立交通大學



Submitted to Department of Electronics Engineering & Institute of Electronics

College of Electrical Engineering and Computer Engineering

National Chiao Tung University

in partial Fulfillment of the Requirements

for the Degree of

Master

in

**Electronics Engineering** 

July 2010

Hsinchu, Taiwan, Republic of China

中華民國九十九年七月

可用於工作在次臨界/近臨界電壓區間綠色節能科技之製程、電壓、溫度高適應性超低電壓時脈系統設計

學生:謝忠穎

#### 指導教授:黃 威 教授

#### 國立交通大學電子工程學系電子研究所

#### 摘 要

#### 

本論文提出一個可用於次臨界/近臨界電壓區間綠色節能科技之製程、電壓、 溫度高適應性超低電壓時脈系統。針對可感知的電路設計,本論文提出了統一的 邏輯努力模型,它已經建立在四個不同的 CMOS 奈米世代和環境參數的變異,包 括供應電壓從 0.1 到1 伏和溫度從-50 到 125 度。此模型的最多平均誤差不超過 8.4%。

藉著使用統一的邏輯努力模型,一個溫度強健之緩衝時脈樹被提出,用於減輕溫度所造成的時脈相位差。邏輯努力——個傳遞延遲的指標,跟隨著溫度與供應電壓變化,藉由可調寬度之緩衝器來控制。在這個設計裡面,溫度感測器測得不同部位的溫度並且動態調整相對應的緩衝器的邏輯努力,來減少脈衝相位差。 在 UMC 65 奈米科技中,可調寬度之緩衝器與脈衝 II 樹在佈局後模擬裡已被建立, 它顯示了脈衝相位差可被減少最多到 97.8%,平均 72.2%。

一個次臨界/近臨界可程式時脈產生器被提出,它可以產生1/8到4倍參考 時脈頻率的輸出時脈。變異感知的邏輯設計在這個時脈產生器已被執行。脈衝循 環結構的採用減少了製程變異所造成的輸出時脈抖動。此外,我們實現一個製程、 電壓、溫度補償單位,用於調整時脈產生器的鎖定範圍。參考時脈的頻率在0.2 伏是625千赫茲,在0.5伏是5百萬赫茲。

### Ultra-Low Voltage PVT-Robust Clock System Design for Sub/Near-Threshold Green Technologies

Student : Chung-Ying Hsieh

Advisor : Prof. Wei Hwang

Department of Electronics Engineering & Institute of Electronics National Chiao-Tung University

#### ABSTRACT

This thesis proposes an ultra-low voltage (ULV) PVT-robust clock system for sub/near-threshold green technologies. For variation-aware circuit design, the unified logical effort models are proposed, which have been established over the four different nanoscale CMOS generations and environmental parameter variations with wide supply voltage 0.1~1V and temperature range -50~125 °C. The average modeling error is no more than 8.40%.

By using the unified logical effort models, a thermally robust buffered clock tree is proposed for mitigating the temperature-induced clock skew. Logical effort - an index of propagation delay, varying with thermal and supply voltage conditions, is controlled by a tunable-width buffer. In this design, the temperature sensor senses the temperature of different parts of the clock tree and adjusts the logical effort of the corresponding clock buffers dynamically to reduce the clock skew. In UMC-65nm technology, tunable-width buffers along with 7th-layer metal interconnect clock H-tree are constructed in post-layout simulation, which shows that the clock skew is reduced by up to 97.8%, and 72.2% in average.

A sub/near-threshold programmable clock generator is proposed, which is able to create output clock with frequency 1/8~4 times of the reference clock. The variation-aware logic design is performed in the clock generator. The adoption of pulse-circulating scheme reduces process induced output clock jitter. In addition, we realize a PVT compensation unit for adjusting the locking range of clock generator. The frequencies of reference clock are 625KHz at 0.2V and 5MHz at 0.5V.

# Content

| Chapter 1 Introduction1                                                       |
|-------------------------------------------------------------------------------|
| 1.1 Background1                                                               |
| 1.2 Motivation2                                                               |
| 1.3 Organization                                                              |
| Chapter 2 Overview on Clock Distribution Networks and Clock Generator4        |
| 2.1 An Overview on Clock Distribution Networks [2.1]4                         |
| 2.1.1 Synchronous Systems                                                     |
| 2.1.2 Theoretical Background of Clock Skew6                                   |
| 2.1.3 Clock Distribution Design of Custom VLSI Circuits7                      |
| 2.1.3.1 Buffered Clock Distribution Trees                                     |
| 2.1.3.2 Symmetric H-Tree Distribution Networks10                              |
| 2.1.4 Previous Works on Temperature-Aware Clock Distribution Design11         |
| 2.1.4.1 Dynamic Thermal Clock Skew Compensation Using Tunable Delay           |
| Buffers [2.8]11                                                               |
| 2.1.4.2 Design of Thermally Robust Clock Trees Using Dynamically Adaptive     |
| Clock Buffers [2.9]                                                           |
| 2.2 An Overview on Clock Generator                                            |
| 2.2.1 DLL-Based Clock Generator [2.10]17                                      |
| 2.2.2 PLL-Based Clock Generator [2.11]18                                      |
| 2.2.3 Multi-Phase Clock Generator Based on a Time-to-Digital Converter [2.12] |
|                                                                               |
| 2.2.4 Programmable Clock Generator Based on a Cyclic Clock Multiplier [2.13]  |
|                                                                               |
| Chapter 3 Unified Logical Effort Models over Wide Supply Voltage and          |
| Temperature Range                                                             |
| 3.1 Introduction                                                              |
| 3.2 Classic Logical Effort Model [3.3]24                                      |
| 3.3 Unified Logical Effort Models26                                           |
| 3.3.1 Strong-Inversion (Super-Threshold) Region                               |
| 3.3.2 Moderate-Inversion (Near-Threshold) Region                              |
| 3.3.3 Weak-Inversion (Sub-Threshold) Region                                   |
| 3.4 Experimental Result                                                       |
| 3.4.1 Test Vehicle I                                                          |
| 3.4.2 Test Vehicle II                                                         |
| Chapter 4 A Thermally Robust Buffered Clock Tree Using Logical Effort         |
| Compensation                                                                  |

| 4.1 Introduction                                                       | 40      |
|------------------------------------------------------------------------|---------|
| 4.2 Creating Constant Gate Delay against Thermal Variation             | 43      |
| 4.2.1 Effects of Dynamically Tuning MOSFET Width on Logical Effort     | 43      |
| 4.2.2 Creating Constant Gate Delay                                     | 45      |
| 4.3 A Thermally Robust Buffered Clock Tree Using Logical Effort Compen | isation |
|                                                                        | 48      |
| 4.4 Simulation Results                                                 | 50      |
| Chapter 5 A Programmable Clock Generator for Sub- and Near-Threshold   | DVFS    |
| System                                                                 | 53      |
| 5.1 Introduction                                                       | 54      |
| 5.2 System Architecture                                                | 55      |
| 5.3 Variation-Aware Logic Design                                       | 59      |
| 5.3.1 Sub-Threshold Logic Design Challenge                             | 59      |
| 5.3.2 Mitigating Variation by Upsizing Transistors                     | 60      |
| 5.4 PVT Compensation for Locking Range of Delay Line                   | 61      |
| 5.4.1 Delay Ratio of FO1-INV to FO2-NAND                               | 62      |
| 5.4.2 Procedure of PVT Compensation for Locking Range of Delay Line    | 64      |
| 5.5 Circuit Description                                                | 68      |
| 5.5.1 Lock-In Delay Line (LIDL) Controller                             |         |
| 5.5.2 Lock-In Delay Line (LIDL)                                        | 69      |
| 5.5.3 Pulse Generator                                                  | 71      |
| 5.5.4 SEL Generator                                                    |         |
| 5.5.5 Phase Detector                                                   | 72      |
| 5.5.6 Frequency Divider                                                | 73      |
| 5.6 Combination of Clock Generator and Clock Tree                      | 74      |
| 5.7 Design Implementation                                              | 75      |
| 5.8 Simulation Results                                                 | 76      |
| Chapter 6 Conclusions and Future Work                                  | 80      |
| 6.1 Conclusions                                                        | 80      |
| 6.2 Future Work                                                        | 81      |
| Bibliography                                                           | 83      |

### List of Tables

| Table 3.1 Function A(T) for strong-inversion                                            |           |
|-----------------------------------------------------------------------------------------|-----------|
| Table 3.2 Functions B(T), C(T) and D(T) for moderate-inversion                          |           |
| Table 3.3 Functions E(T) and F(T) for weak-inversion                                    |           |
| Table 3.4 Logic effort modeling error                                                   |           |
| Table 3.5 Ratios of logical effort for logic gates                                      |           |
| Table 4.1 Compensation improvement of clock skew in sub/near-thresho                    | ld region |
|                                                                                         |           |
| Table 5.1 Frequency selection range, $f_{out}$ and $f_{ref}$ are the frequencies of our | utput and |
| reference clocks                                                                        | 56        |
| Table 5.2 The relation between control signal and output frequency                      | 74        |
| Table 5.3 Summary of proposed clock generator                                           | 77        |



### **List of Figures**

| Figure 2.1 Local data path                                                        | 5     |
|-----------------------------------------------------------------------------------|-------|
| Figure 2.2 Timing diagram of clocked data path                                    | 6     |
| Figure 2.3 Tree structure of clock distribution network                           | 8     |
| Figure 2.4 Common structures of clock distribution networks including a trunk,    | tree, |
| mesh and H-tree                                                                   | 9     |
| Figure 2.5 Three-level buffer clock distribution network                          | 10    |
| Figure 2.6 Symmetric H-tree and X-tree clock distribution networks                | 11    |
| Figure 2.7 Structure of the tunable delay buffer                                  | 12    |
| Figure 2.8 Delay and normalized power versus number of taps                       | 12    |
| Figure 2.9 Online skew compensation architecture                                  | 13    |
| Figure 2.10 Overall flow of the proposed methodology                              | 14    |
| Figure 2.11 Thermally adaptive buffer schematic                                   | 15    |
| Figure 2.12 Control waveforms coming from the wave-shaping circuits               | 15    |
| Figure 2.13 Temperature-sensor schematic                                          | 16    |
| Figure 2.14 Temperature-sensor output-voltage levels                              | 16    |
| Figure 2.15 Block diagram of the proposed DLL-based frequency multiplier          | 18    |
| Figure 2.16 System architecture                                                   | 19    |
| Figure 2.17 Architecture of the synchronous multi-phase clock generator           | 20    |
| Figure 2.18 The all-digital clock generator using cyclic clock multiplier         | 21    |
| Figure 2.19 The timing diagram of the clock generator                             | 21    |
| Figure 3.1 Simplified physical alpha-power law current equations                  |       |
| Figure 3.2 $V_T$ - $T$ plot                                                       | 28    |
| Figure 3.3 1/g in UMC 65-nm technology (strong-inversion)                         |       |
| Figure 3.4 1/g in UMC 65-nm technology (moderate-inversion)                       | 31    |
| Figure 3.5 1/g in UMC 65-nm technology (weak-inversion)                           | 32    |
| Figure 3.6 Unified logical effort models                                          | 33    |
| Figure 3.7 Test vehicle I for proposed logical effort models                      | 34    |
| Figure 3.8 Simulated and estimated delays for the circuit path of Figure 3.7 in U | JMC   |
| 90nm technology (strong-inversion)                                                | 35    |
| Figure 3.9 Simulated and estimated delays for the circuit path of Figure 3.7 in U | JMC   |
| 90nm technology (moderate-inversion)                                              | 35    |
| Figure 3.10 Simulated and estimated delays for the circuit path of Figure 3       | .7 in |
| UMC 90nm technology (weak-inversion)                                              | 35    |
| Figure 3.11 8-to-256 decoder for a 32×256 register file                           | 36    |
| Figure 3.12 8-to-256 decoder                                                      | 37    |
| Figure 3.13 Simulated and estimated delays for Figure 3.12 in UMC 6               | 55nm  |
|                                                                                   | VI    |

| technology (strong-inversion)                                                                |
|----------------------------------------------------------------------------------------------|
| Figure 3.14 Simulated and estimated delays for Figure 3.12 in UMC 65nm                       |
| technology (moderate-inversion)                                                              |
| Figure 3.15 Simulated and estimated delays for Figure 3.12 in UMC 65nm                       |
| technology (weak-inversion)                                                                  |
| Figure 4.1 Buffered Clock Tree41                                                             |
| Figure 4.2 Temperature effect on edge skew between two buffers41                             |
| Figure 4.3 Inversion of the temperature dependence of drain saturation current for a         |
| PTM 45- nm (a) nMOS transistor and (b) pMOS transistor [4.2]42                               |
| Figure 4.4 Tunable-width inverter                                                            |
| Figure 4.5 Logical effort with two different widths                                          |
| Figure 4.6 Tuned W <sub>2</sub> according to various thermal conditions                      |
| Figure 4.7 A ring oscillator composed of 9-stage tunable-width inverters47                   |
| Figure 4.8 Normalized period before and after compensation                                   |
| Figure 4.9 Tunable-width buffer with control blocks                                          |
| Figure 4.10 Temperature Sensor Proposed by Shi-Wen Chen                                      |
| Figure 4.11 Layout of a tunable-width inverter                                               |
| Figure 5.1 Proposed clock generator for sub- and near-threshold DVFS system55                |
| Figure 5.2 Finite state machine                                                              |
| Figure 5.3 The schematic diagram of waveform from state Reset to SAR control .58             |
| Figure 5.4 The schematic diagram of waveform from state SAR control to Lock58                |
| Figure 5.5 Effects of variations and reduced $I_{ON}$ / $I_{OFF}$ on sub-Vt inverter voltage |
| transfer curve [5.8]60                                                                       |
| Figure 5.6 Back-to-back configuration                                                        |
| Figure 5.7 Schematic diagram of PVT compensation62                                           |
| Figure 5.8 Topology of delay line (lattice delay line [5.13]) used in the proposed           |
| clock generator                                                                              |
| Figure 5.9 Ring oscillator using (a) FO1-INV cell, (b) FO2-NAND cell63                       |
| Figure 5.10 Periods of ring oscillators (composed of FO1-INV and FO2-NAND) at                |
| 0.2V64                                                                                       |
| Figure 5.11 Periods of ring oscillators (composed of FO1-INV and FO2-NAND) at                |
|                                                                                              |
| 0.5V                                                                                         |
|                                                                                              |
| 0.5V                                                                                         |
| 0.5V                                                                                         |
| 0.5V                                                                                         |
| 0.5V                                                                                         |

| Figure 5.18 Pulse generator                                                  | 71   |
|------------------------------------------------------------------------------|------|
| Figure 5.19 SEL generator                                                    | 72   |
| Figure 5.20 SEL waveform while State = SAR                                   | 72   |
| Figure 5.21 SEL waveform while State = Lock                                  | 72   |
| Figure 5.22 Phase detector                                                   | 73   |
| Figure 5.23 RST <sub>PD</sub> generator                                      | 73   |
| Figure 5.24 Frequency divider                                                | 74   |
| Figure 5.25 Combination of proposed thermally robust clock tree and programm | able |
| clock generator                                                              | 75   |
| Figure 5.26 Layout view of proposed clock generator                          | 76   |
| Figure 5.27 The operation waveform at 0.2V with 4X output clock              | 77   |
| Figure 5.28 The operation waveform at 0.5V with 4X output clock              | 77   |
| Figure 5.29 PVT compensation for locking range of clock generator at 0.2V    | TT   |
| corner (a) before compensation (b) after compensation                        | 78   |
| Figure 5.30 PVT compensation for locking range of clock generator at 0.2V    | FF   |
| corner (a) before compensation (b) after compensation                        | 78   |
| Figure 5.31 PVT compensation for locking range of clock generator at 0.5V    | ΤT   |
| corner (a) before compensation (b) after compensation                        | 79   |
| Figure 5.32 PVT compensation for locking range of clock generator at 0.5V    | FF   |
| corner (a) before compensation (b) after compensation                        | 79   |
| Figure 6.1 Sub/near-threshold DVFS system                                    | 82   |
| 1896                                                                         |      |
|                                                                              |      |
|                                                                              |      |
| - A A A A A A A A A A A A A A A A A A A                                      |      |

# Chapter 1 Introduction

#### **1.1 Background**

With the evolution of CMOS process technology, the number of transistors in a digital core doubles about every two years. The increases of transistor density and operating frequency have brought the effect of shorter battery life. For some applications such as wireless body area network (WBAN) sensors, the critical consideration is life time instead of operating speed. Thus, how to perform a low-power design and meanwhile conform to the speed and reliability requirements is an important issue.

Ultralow-power dissipation can be achieved by operating digital circuits with scaled supply voltages, albeit with degradation in speed and increased susceptibility to parameter variations. The operating voltage is scaled down to sub-threshold or near-threshold regions depending on the power and speed requirements of circuit system. There are many researches about sub/near-threshold operation. Literature [1.1] demonstrates optimizations of subthreshold design in device, circuit as well as architecture perspectives, which are different from the conventional superthreshold design. It also analyzes such optimizations from energy dissipation point of view and shows that it is feasible to achieve robust operation of ultralow-voltage systems. In [1.2] the trade-off between power and performance along with the extreme ends of this balance are discussed. Another paper [1.3] gives examples to show that designing

flexibility into ultralow-power (ULP) systems across the architecture and circuit levels can meet both the ULP requirements and the performance demands. It also present a method that expands on ultradynamic voltage scaling (UDVS) to combine multiple supply voltages with component level power switches to provide more efficient operation at any energy-delay point and low overhead switching between points. The UDVS technique is described in [1.4], which presents voltage-scalable circuits such as logic cells, SRAMs, ADCs, and dc-dc converters. Using these circuits as building blocks, some applications have been highlighted. The exploration of how design in the moderate inversion region helps to recover some of performance loss from weak inversion region is performed in [1.5]. It develops an energy-delay modeling framework that extends over the weak, moderate, and strong inversion regions.

Dynamic-voltage-and-frequency-scaling (DVFS) technique is widely used to achieve the goal of saving power. In addition, advances in ultra-low voltage (ULV) circuit design have demonstrated capabilities saving huge power. As a consequence, the mix of DVFS and ULV design techniques has a great potential for ultra-low power demand.

#### **1.2 Motivation**

In the DVFS system, the clock generation and transmission are realized by clock generator and clock tree. The mainly possible problems in clock system are clock jitter and skew. Jitter comes from clock generator, and skew comes from clock tree. They may cause functional errors in digital circuits, and will be more serious in ULV region because of environmental variations. The environmental variations include

2

process, voltage and temperature (PVT); they should be considered carefully when designing ULV circuits. This thesis is aimed at sub/near-threshold clock systems.

#### **1.3 Organization**

This thesis includes six chapters which focus on unified logical effort models and clock system in sub/near-threshold region. The latter includes clock tree and programmable clock generator. The following briefly introduces the content of each chapter.

Chapter 2 gives an overview on clock tree and clock generator.

Chapter 3 describes the proposed unified logical effort models which cover super-, near- and sub-threshold regions.

Chapter 4 presents the proposed thermal-robust clock tree using logical effort compensation. The unified logical effort models will be used for thermal compensation of clock buffers. In the end of this chapter, we will show layout and simulation result.

Chapter 5 demonstrates the proposed programmable clock generator which is aimed at sub/near-threshold DVFS system. Finally, we will show the implementation of layout, simulation result and performance summary.

Chapter 6 gives the conclusion of this thesis and future work.

# Chapter 2 Overview on Clock Distribution Networks and Clock Generator

#### 2.1 An Overview on Clock Distribution Networks [2.1]

Clock distribution networks synchronize the flow of data signals among synchronous data paths. The design of clock distribution networks directly influences system-wide performance and reliability. The characteristics of clock signal in the distribution network has been noted because they are critical to the synchronous system. Clock signals have some special characteristics: loaded with the greatest fanout, traveling over the longest distances, and operating at the highest speeds. Furthermore, the clock waveforms must be clean and sharp to guarantee the data movement with no error. However, the high resistance of long global metal lines affects the property of clock signals; the resistance is even higher due to technology scaling. Thus, it is important to pay more attention to design of clock distribution on synchronous performance. In this section, we will introduce some topics: synchronous systems, theoretical background of clock skew and clock distribution design.

#### 2.1.1 Synchronous Systems

In the synchronous systems, the clock signal defines the timing for the shift of data. The synchronous systems consist of cascaded banks of sequential registers with combinational logic between each set of registers. Timing requirements between each set of registers are satisfied by carefully setting worst case timing in the combinational logic. Properly designing the clock distribution network can further guarantee that timing requirements are satisfied.

A digital synchronous system is composed of logic elements and clocked registers. For an ordered pair of registers ( $R_1$ ,  $R_2$ ),  $R_1 => R_2$  denotes that the signal switching at the output of  $R_1$  will propagate to the input of  $R_2$ . This is called a sequentially-adjacent pair of registers. Figure 2.1 shows the local data path.



The minimum clock period is decided by the delay between any two registers in a sequential data path:

$$\frac{1}{f_{clkMAX}} = T_{CP}(\min) = T_{PD}(\max) + T_{Skew}$$
(2.1)

$$T_{PD}(\max) = T_{C-Q} + T_{Logic} + T_{Int} + T_{Set-up} = D(i, f)$$
(2.2)

Where  $T_{PD}(\max)$  is the maximum data path delay,  $T_{C-Q}$  is the time for the data required for the data to leave the initial register,  $T_{Logic}$  and  $T_{Int}$  is the time of propagation in the logic and interconnect,  $T_{Set-up}$  is the time required to successfully propagate to and latch within the final register of data path.

#### 2.1.2 Theoretical Background of Clock Skew

Figure 2.2 shows the schematic of generalized synchronized data path.  $C_i$  and  $C_f$  are clock signals driving a sequentially-adjacent pair of registers, the initial one  $R_i$  and the final one  $R_f$ . Both clock signals are generated from the same clock signal source. We define that  $T_{Ci}$  and  $T_{Cj}$  are the propagation delays from the clock source to the *i*th and *j*th clocked register. The clock source is designed to generate a specific clock signal waveform for synchronizing each register. The equipotential clocking is most commonly used, which makes the clocking events occur at all registers simultaneously in ideal condition.



Figure 2.2 Timing diagram of clocked data path

The clock skew is defined as the difference in clock signal arrival time between two sequentially-adjacent registers.  $T_{Skew}$ , the clock skew, is zero if the clock signals  $C_i$  and  $C_f$  are in complete synchronism which means clock signals arrive at their respective registers at the same time. If clock skew is not zero, it comes from the difference between the arrival time of *i*th and *j*th clock signals:

$$T_{Skewij} = T_{Ci} - T_{Cj} \tag{2.3}$$

where  $T_{Ci}$  and  $T_{Cj}$  are the clock delays from the clock source to registers  $R_i$  and  $R_j$ .

The contributions of clock skew are due to a variety of reasons. Wann and Franklin [2.2] present that there are four kinds of reasons that causes clock skew: (1) the differences in line lengths from clock source to the clocked register, (2) the differences in delays of clock distribution buffers, (3) the differences in passive interconnect parameters such as line resistivity and via/contact resistance and (4) differences in active device parameters such as MOS threshold voltages and channel mobility in the clock buffers. In them, the distributed clock buffers are the main source of clock skew.

#### 2.1.3 Clock Distribution Design of Custom VLSI Circuits

There are many approaches developed for designing clock distribution networks in synchronous digital integrated circuits. Clock distribution network affects the tradeoffs existing among system speed, physical die area and power dissipation. Thus in the development of system, the design methodology and structural topology of the clock distribution network should be considered.

Many kinds of clock distribution strategies have been developed. Buffered clock tree is the most general approach to equipotential clock distribution which is presented 2.1.3.1. Symmetric trees such as H-trees in 2.1.3.2 are used to distribute high-speed clock signals.

#### 2.1.3.1 Buffered Clock Distribution Trees

The buffered clock distribution trees are most commonly used for distributing clock signals among the integrated circuits. The buffers are inserted in the clock signal path or at the clock source to drive long interconnections and registers at the end nodes. This clock distribution structure is commonly used and illustrated in Figure 2.3.



Figure 2.3 Tree structure of clock distribution network

The mesh structure is an extended version of the standard. In the mesh clock tree structure, the shunt paths down to next level of distribution network are used to minimize the resistance within the clock tree. Since the branch resistances are placed in parallel, it has the advantage of minimized clock skew. Various forms of clock distribution network including trunk, tree, mesh, and H-tree are illustrated in Figure 2.4.

An alternative approach to using distributed clock buffers throughout the clock distribution network is adopting only one buffer at the clock source. Using only one buffer, the additional area consumed by distributed buffers is saved greatly. However, this approach is suitable for the clock network with negligible resistance of the interconnect lines. In addition, the buffer should be strong enough to drive the

network capacitance while maintaining high-quality waveform shapes and minimizing the effects of the interconnect resistance.

Compared with one-buffer clock distribution network, distributed buffers consume more power and area, but it greatly improves the precision of the clock signal waveform. So it is necessary to use distributed buffers when the interconnect lines are too long. The distributed buffers not only amplify the clock signals but also isolate the local clock nets from upstream load impedances [2.3]. An example using three-level buffer clock distribution network is shown in Figure 2.5. In this strategy a single buffer drives multiple clock paths and buffers. The number of buffer stages between the clock source and registers depends on (1) the loading of registers and interconnect, and (2) the allowable clock skew [2.4]. Note that the source of clock skew mainly comes from clock buffers since the active device characteristics vary much more greatly than the passive device characteristics.



Figure 2.4 Common structures of clock distribution networks including a trunk, tree, mesh and H-tree



Figure 2.5 Three-level buffer clock distribution network

The primary design goal of clock distribution networks is to ensure that the clock signal arrives at every register at the same time. With zero skew, it can enhance the system reliability.

#### 2.1.3.2 Symmetric H-Tree Distribution Networks

Figure 2.6 shows the symmetric clock distribution networks H-tree and X-tree which ensure zero clock skew by setting the length of interconnect and buffers identical from the clock signal source to any end node. They are a subset of the distributed buffer approach described in section 2.1.3.1. In the H-tree distribution networks, the clock driver is placed at the center of the main "H" structure. Clock signal is transmitted to four corners of H. The distances from these corners are the same, so the clock signal is transited to the corners with equal delay. Then, the four corners provide clock signal for smaller "H" structure in the next level. The distribution process continues through several levels of progressively smaller "H"



Figure 2.6 Symmetric H-tree and X-tree clock distribution networks

The primary source of clock skew is from the difference between the signal paths, including process variations on metal lines, and active buffers in particular. In the H-tree structure clock distribution network, the amount clock skew depends on physical size, the control of semiconductor process, and the degree to which active buffers and clocked latches are distributed.

2.1.4 Previous Works on Temperature-Aware Clock Distribution Design

#### 2.1.4.1 Dynamic Thermal Clock Skew Compensation Using

#### **Tunable Delay Buffers [2.8]**

The temperature gradient in a high-performance chip brings the problem of clock skew in the clock distribution network. Knowing the spatial temperature distribution beforehand, it is possible to compensate the thermal non-uniformities by properly designing a clock network. However, the temperature distribution also changes over time. A. Chakraborty et al. proposed a technique of compensation for temporal variations of temperature, by dynamically modifying the clock tree. It is realized by using tunable delay buffers during the clock network generation. The control of buffer is computed offline and stored in a tuning table which is added in the design. Then, temperature-induced delay variations are compensated.

The conceptual architecture of tunable delay buffer is shown in Figure 2.7. Each control signal decides whether the corresponding transmission gate is opened, thus achieving variable delays in discrete steps. In Figure 2.8 we can observe that each additional tap delivers a constant delay of approximately 8 ps, this value is chosen to keep the area and power overheads within reasonable values.



Figure 2.7 Structure of the tunable delay buffer



Figure 2.8 Delay and normalized power versus number of taps

An online hardware mechanism is in Figure 2.9 that the clock buffers are properly tuned so that the clock skew induced by thermal gradient can be compensated. There are two essential elements required to do that. First, a set of on-chip temperature sensors detects thermal variations. Second, a hardware mechanism hereafter called thermal management unit (TMU) translates this variation into the proper tuning of the buffers.



Figure 2.9 Online skew compensation architecture

The algorithm to minimize the number of inserted tunable buffers is proposed in this design. The overflow of the methodology is established and depicted in Figure 2.10. It includes some processes. In the first step, physical synthesis, the RTL design is synthesized; the placement, clock tree generation, and global routing are done. In the second step, TDB identification, the characterization and optimization are run from the synthesized designs and their corresponding clock trees, which entail the repeated execution of the optimization algorithm for every relevant thermal profile. In the final step, physical redesign, the insertion of buffers require some amount of 13





Figure 2.10 Overall flow of the proposed methodology

This design shows that the clock skew is kept within original bounds with worst-case power and area penalty of 3.5% and 5.5%, respectively.

#### 2.1.4.2 Design of Thermally Robust Clock Trees Using

#### **Dynamically Adaptive Clock Buffers [2.9]**

Temperature gradient has emerged as a major concern for high-performance integrated circuits design in current and future technology nodes, which causes undesired clock skew in the clock distribution network. The primary purpose in research [2.9] is to provide intelligent solution for minimizing the temperature-induced clock skew by designing dynamically adaptive circuit elements, particularly the clock buffers.

The effect of on-chip temperature gradient on the clock skew for a number of

temperature profile is investigated by using an RLC model of the clock tree. To mitigate the variable clock skew, an adaptive circuit technique is proposed, which senses the temperature of different parts of the clock tree and adjusts the driving strengths of the corresponding clock buffers dynamically. Figure 2.11 shows the design technique in which the local temperature sensors sense the ambient temperatures and convert the temperatures to voltages. The voltages are used for dynamically changing the driving strength of the clock buffers, thereby reducing the overall clock skew. The buffers use the combination of two techniques to compensate the temperature effect, buffer-current control and body-bias control. Figure 2.12 shows the control waveforms coming from the wave-shaping circuits.



Figure 2.11 Thermally adaptive buffer schematic



Figure 2.12 Control waveforms coming from the wave-shaping circuits

To distribute the thermal sensors all over the chip, a moderate-accuracy temperature sensor is needed for the purpose of reduced area and power. The architecture of the temperature sensor used here is shown in Figure 2.13. The accuracy of this temperature sensor is below 10 °C while occupying only 30 um<sup>2</sup> on 45-nm technology. The waveforms of the output are shown in Figure 2.14, it demonstrate the linearity of the output voltage over the temperature range.



Figure 2.14 Temperature-sensor output-voltage levels

Spice simulations were performed to evaluate the performance. The clock skew equals zero when the temperature difference of clock signal path is zero. With the difference of 80 °C, the clock skew is 155 ps while reduced to 21 ps with the use of adaptive technique. Simulation results show that the adaptive technique is capable of reducing the temperature-induced clock skew by up to 92.4% and 70.2% in average.

#### 2.2 An Overview on Clock Generator

A clock generator is a circuit that produces a timing signal for use in synchronizing a circuit's operation. Many kinds of clock generators have been presented in previous literatures. In this section, we will briefly describe some categories of clock generators, including DLL-based, PLL-based, TDC-based and CCM-based (cyclic clock multiplier) clock generators. They are used in different applications.

#### 2.2.1 DLL-Based Clock Generator [2.10]

A low-power programmable DLL-based clock generator for dynamic frequency scaling is developed in [2.10]. The block diagram is shown in Figure 2.15. When the DLL locks, the phase difference between B0 and B8 is one reference clock cycle. The voltage-controlled delay line (VCDL) generates uniformly spaced clocks which are used for frequency multiplying. The frequency of multiplied clock is decided by the two-bit control signals. To avoid harmonic-lock, an anti-harmonic block established. Three clock phases B0, B3 and B8 are selected as inputs for the antiharmonic-lock block. The phases of B0 and B8 are compared by phase detector (PD), and then the phase detector sends signals UP or DOWN to the charge pump (CP). If the DLL locks in harmonic state, the antiharmonic-lock block will have the priority to make the output of PD UP or DOWN. These signals increase or decrease the control voltage of the VCDL, so the phase B9 can be locked.



Figure 2.15 Block diagram of the proposed DLL-based frequency multiplier

#### 2.2.2 PLL-Based Clock Generator [2.11]

A triangular-modulated spread-spectrum clock generator using a  $\triangle - \Sigma$ modulated fractional-N phase-locked loop is presented in [2.11]. The multiphase divider is employed to implement the modulated fractional counter with increased  $\triangle$  $-\Sigma$  operation speed. The phase mismatching error in the phase-interpolated PLL with multiphase clocks can be randomized, and finer frequency resolution is achievable. Figure 2.16 shows the system architecture, it consists of a PLL, a  $\triangle -\Sigma$ modulator, and a triangular modulated profile. The PLL is a digiphase-based fractiona-N synthesizer with a multimodulus fractional divider (MMDF). The instantaneous phase error can be canceled by a phase-compensated technique before the phase frequency detector. When the PLL is locked, neglecting the modulated operation of the  $\triangle -\Sigma$  modulator to the MMDF, the output frequency of the PLL is (M±k/16)f<sub>ref</sub>, and the synthesizer operates as a modulo-31 fractional-N frequency synthesizer for K = 0~15.



Figure 2.16 System architecture

# 2.2.3 Multi-Phase Clock Generator Based on a

#### Time-to-Digital Converter [2.12]

An all-digital fast-lock synchronous multi-phase clock generator is presented in [2.12]. It adopts a time-to-digital converter (TDC) to achieve the purposes of fast-lock and delay measurement. It can generate four-phase clocks and synchronize the reference clock within 45 cycles. Figure 2.17 shows the synchronous multi-phase clock generator, consisting of a TDC, sampling clock selector, control pulse generator, code controller and de-skewing circuit. The TDC measures the periods of the input clock and the replica delay. Then, the delay codes generated by the TDC are converted into coarse and fine codes in the code controller. Therefore, the clock generator can be synchronized, generating multi-phase output clock. In addition, the de-skewing circuits improve the phase resolution of the multi-phase clocks. The phase error between the reference and output clocks is 4.6ps at 1.8V, with 1.22GHz input clock.



Figure 2.17 Architecture of the synchronous multi-phase clock generator

11111

# 2.2.4 Programmable Clock Generator Based on a Cyclic

#### **Clock Multiplier** [2.13]

An all-digital clock generator using a cyclic clock multiplier (CCM) is presented in [2.13]. It realizes the fractional or multiplied output clock within four reference clock cycles. Figure 2.18 shows the all-digital clock generator which is composed of a CCM, a finite state machine (FSM), a conventional time-to-digital converter (TDC), a counter\_K, a programmable divider and two multiplexers (MUXs). It can generate output clock with frequency M/N times of reference clock, where the ranges of M and N are 1~7 and 1~8 respectively. CCM<sub>out</sub> is a multiplied clock which frequency is M times of reference clock. The timing diagram of clock generator is shown in Figure 2.19 with M = 5 and N = 1. There are four steps for its operation. First, C[4:0] is preset to M and the CCM measures the period of the reference cycle. Second, the counted value is stored as K[4:0] = K and K = 3 in Figure 2.19. Third, the clock  $CCM_{out}$  generates M pulses by K unit delay cells. Finally, the delay of the unit delay cell in the CCM is adjusted by F[3:0] according to the TDC outputs, so the phase error between the multiplied clock and the reference clock can be reduced.



Figure 2.18 The all-digital clock generator using cyclic clock multiplier



Figure 2.19 The timing diagram of the clock generator

# Chapter 3 Unified Logical Effort Models over Wide Supply Voltage and Temperature Range

In this chapter, we present unified logical effort models, which cover all operational regions of MOSFET in weak-, moderate- and strong- inversion regions. These models have been established over the four different nanoscale CMOS generations and environmental parameter variations with wide supply voltage 0.1~1V and temperature range -50~125°C. The simulation results are using UMC90-, 65-nm, PTM 65-, 45- and 32-nm bulk CMOS technologies, respectively, with average modeling error no more than 8.40%. Proposed models extend the original high performance circuits design in super-threshold region to low power design operation in near-threshold and sub-threshold regions. They are useful for future ultra-low voltage design and applications.

Section 3.1 is the introduction. The classic logical effort model will be reviewed in section 3.2. In section 3.3 we will derive the physical alpha-power law current equations. The formulas of unified logical effort models will be derived in section 3.4. Section 3.5 shows the experimental results.

#### **3.1 Introduction**

Power becomes the dominant design constraint in many emergence applications such as mobile consumer electronics or wireless sensor networks. The techniques of ultra-low voltage (ULV) design have been exploded continuously. In addition, the 22 minimum energy point appeared at the voltage where transistors operate in weak-inversion (also called sub-threshold region) [3.1], [3.2]. However, sub-threshold circuits are much more sensitive to environmental variations than super-threshold ones. Recently, three-dimensional integrated circuit (3D-IC) technology is developed for overcoming the barriers in large interconnections. The high integration of 3D-IC introduces hot spot problem because of different thermal distribution. The temperature inconsistency brings performance coherence problem in ULV circuits design. Voltage and temperature variations affect timing behavior of logic gates significantly with lower voltage and advanced CMOS technology. They may lead to functional errors in digital circuits. Therefore, novel unified logical effort models for optimizing of combinational logic by considering temperature and voltage variations are proposed.

The logical effort model proposed by Sutherland, Sproull, and Harris in 1999 is a method for estimating circuit path delay [3.3]. By using logical effort, it is easy to estimate path delay from simple calculation, but it doesn't consider environmental conditions. Many papers have been presented to improve the accuracy of logical effort model in different conditions. The effect of a linear input transition time was introduced [3.4]. A modified logical effort model concerning series connected MOSFET structure, input transition time, and internodal charge were presented [3.5]. I/O coupling capacitance and the input ramp effect on logical effort was considered [3.6]. The influences of voltage and temperature on logical effort were introduced in UMC 90nm bulk CMOS process [3.7], which logical gates, however, were operated in strong inversion region.

In this chapter, unified logical effort models for different CMOS operation regions are proposed, which cover strong-, moderate- and weak-inversion regions (also called super-threshold, near-threshold and sub-threshold regions, respectively). The models have been established in UMC90-, 65-nm, PTM 65-, 45- and 32-nm bulk CMOS technologies. Next section we will derive them from classic logical effort model.

#### 3.2 Classic Logical Effort Model [3.3]

The method of logical effort is established on a simple model of the delay through a single MOS logic gate. This model describes the delay model composed of gate drive and gate capacitive load. When the gate load increases, the delay will increase; however, the delay also depends on the logic function of the gate. Inverters are the simplest logic gate and mostly chosen as amplifiers to drive large load. Some logic gates with complex function often require series topology, making them poorer than inverter at driving current. Thus NAND gate has more delay than inverter with the same transistor sizes which drive the same load. The method of logical effort quantifies these effects to simply delay analysis.

The first step in modeling delays is dividing the absolute delay into two parts: delay unit  $\tau$  and unitless delay *d* of the gate. The delay unit is particular to a specific integrated circuit fabrication process. The absolute gate delay can be expressed as:

$$d_{abs} = d\tau \tag{3.1}$$

The delay is composed of two components, a fixed part called the parasitic delay p and a part proportional to the load on the gate's output called the stage effort f. The total delay, measured in units of  $\tau$ , is the sum of parasitic delay and stage effort:

$$d = f + p \tag{3.2}$$

24

The stage effort delay depends on the output load and the driving capability of the logic gate. The output load and driving capability are represented by the terms electrical effort h, and logical effort g respectively. The stage effort f is the product of these two factors:

$$f = gh \tag{3.3}$$

The logical effort characterizes the effect of the logic gate's topology on its ability to drive the load. It is independent of the size of transistors in the circuit. The electrical effort h is defined by:

$$h = \frac{C_{out}}{C_{in}} \tag{3.4}$$

In additional to estimate the delay, logical effort is also used to optimize an *N*-stage logic path.

$$G = \prod g_i, \quad B = \prod b_i, \quad H = \frac{C_{out}}{C_{in}}, \quad F = GBH$$
(3.5)

where  $b_i$  is the branching effort, and *G*, *B*, *H*, *F* are the path logical effort, path branching effort, path electrical effort and path effort. The minimum path delay will be performed when the stage effort and the input capacitance of each gate are

$$\hat{f} = g_i h_i = F^{1/N}$$
 (3.6)

Based on the above simple equations, it is easy to arrange the logic paths and obtain the optimize path delay.

#### **3.3 Unified Logical Effort Models**

The unified logical effort models are derived by considering current equation of physical alpha-power law [3.8] and conventional logical effort model simultaneously. In logic gates, the operation region of MOSFET is determined by the value of supply voltage. When the supply voltage is less than threshold voltage ( $V_{DD} < V_T$ ), then the weak-inversion (or sub-threshold) current is derived as

$$I_{DSUB} = (W/L)\mu_0 Cox \frac{\eta}{\beta^2} \exp\left[(\beta/\eta)(V_{DD} - V_T - \eta/\beta)\right]$$
(3.7)

where (*W/L*) is the channel width-to-length ratio,  $C_{OX}$  is the gate oxide capacitance per unit area,  $\mu_0$  is carrier mobility, and the MOSFET parameters

$$\beta = q/(kT), \quad \eta = 1 + C_{D0}/C_{OX}$$
 (3.8)

When supply voltage is applied near threshold voltage ( $V_{DD} \sim V_T$ ), velocity saturation is negligible ( $E_C L >> V_{DD} - V_T$ ), this region is called moderate-inversion (near-threshold) region. Thus, we simplify the saturation voltage and  $I_{DSAT}$  from [3.8] and obtain

$$V_{DSsat \mid EcL \gg V_{DD} - V_T} \approx (1/\eta)(V_{DD} - V_T)$$
(3.9)

$$I_{D_{SAT} | EcL >> V_{DD} - V_T} = (W/L) Cox \mu_{eff} (1/\eta) (V_{DD} - V_T)^2$$
(3.10)

When supply voltage is applied much larger than threshold voltage ( $V_{DD} >> V_T$ ), strong velocity saturation ( $E_C L << V_{DD} - V_T$ ) is reached. This is called strong-inversion (super-threshold) region. Again, we simplify the saturation voltage and  $I_{DSAT}$  from [3.8] as

$$V_{DSSAT | EcL << VDD - VT} = \left[ (2E_{cL}/\eta)(V_{DD} - V_{T}) \right]^{1/2}$$
(3.11)

$$I_{D_{SAT} | E_{cL} \ll V_{DD} - V_{T}} \approx 2(W/L) Cox \mu_{eff} (E_{cL}/\eta)^{1/2} (V_{DD} - V_{T})^{3/2}$$
(3.12)

Strong-inversion (Super-threshold):  $I_{DSAT | ECL \iff V_{DD} - V_T \approx 2(W/L)Cox\mu_{eff}(EcL/\eta)^{1/2}(V_{DD} - V_T)^{3/2}$ Moderate-inversion (Near-threshold):  $I_{DSAT | ECL \implies V_{DD} - V_T = (W/L)Cox\mu_{eff}(1/\eta)(V_{DD} - V_T)^2$ Weak-inversion (Sub-threshold):  $I_{DSUB} = (W/L)\mu_0Cox \frac{\eta}{\beta^2} \exp[(\beta/\eta)(V_{DD} - V_T - \eta/\beta)]$ 

Figure 3.1 Simplified physical alpha-power law current equations

All three regions of MOS current are derived in (3.7), (3.10) and (3.12), summarized in Figure 3.1. To modify the logical effort model, the logical effort g has been introduced in section 3.2. From equations (3.1) and (3.2) we can get:

$$d_{abs} = \tau(f+p) = \tau(gh+p) \tag{3.13}$$

The definitions of  $\tau$ , g, h, and p:

$$\tau = \kappa R_{inv} C_{inv}, \quad g = \frac{R_t C_{int}}{R_{inv} C_{inv}}, \quad h = \frac{C_{out}}{C_{in}}, \quad p = \frac{R_t C_{pt}}{R_{inv} C_{inv}}$$
(3.14)

where  $R_{inv}$  and  $C_{inv}$  are output resistance and input capacitance of an inverter template;  $R_t$ ,  $C_{int}$ ,  $C_{pt}$  are output resistance, input capacitance and output parasitic capacitance of a specific gate. In (3.15), logical effort is equal to the ratio of gate *RC* to inverter *RC*:

$$g = \frac{R_t C_{int}}{R_{inv} C_{inv}} = k R_t C_{int} = k \frac{V_{DD}}{I_D} C_{int}$$
(3.15)

The inverter  $1/R_{inv}C_{inv}$  is equal to constant k, and  $R_t$  is equal to  $V_{DD}/I_D$ , where  $I_D$  is drain current. The inverse of logical effort

$$1/g = \frac{I_D}{kV_{DD}C_{int}}$$
(3.16)

From (3.16), inverse of logical effort is proportional to  $I_D$ ; there are three regions for  $I_D$  as well as g: strong-, moderate- and weak-inversions. The driving ability of NMOS and PMOS are not the same in different regions. The inverter sizing ratios Wp/Wn, are set as 2.5, 2.0 and 1.5 in strong-, moderate- and weak-inversion regions to get balanced rise and fall delay.

## 3.3.1 Strong-Inversion (Super-Threshold) Region

In strong-inversion region, MOSFET operates with strong carrier velocity saturation. Substitute  $I_D$  (3.12) into (3.16)

$$1/g = \frac{(W/L)Cox\mu_{eff}(2/\eta)^{1/2}(EcL)^{1/2}(V_{DD} - V_T)^{3/2}}{kV_{DD}C_{in}}$$

$$= const1 \cdot \mu_{eff} \frac{(V_{DD} - V_T)^{3/2}}{V_{DD}}$$
(3.17)

where *const1* represents all constant coefficients. From Figure 3.2,  $V_T$  can be expressed as  $V_{T0} - aT$  where  $V_{T0}$  stands for threshold voltage at 0 °C. Unified 1/g function is curve fitted by

$$1/g_{u} = A(T) \frac{(V_{DD} - V_{T0} + aT)^{3/2}}{V_{DD}}$$
(3.18)

Figure 3.2  $V_T$  - T plot

 $g_u$  stands for unified logical effort; A(T) is two-degree polynomial of T. By measuring logical effort with various  $V_{DD}$  and T, A(T) is solved and listed in Table 3.1. In this region, we set g equal to 1 at  $V_{DD} = 1$ V, T = 25 °C and the  $V_{DD}$  range is from 0.5V to 1.0V. Figure 3.3 shows unified and simulated 1/g with various  $V_{DD}$  and T. The average of absolute modeling errors are 3.89%, 3.05%, 4.12%, 8.01%, 6.55% in UMC 90-, 65-nm and PTM 65-, 45-, 32-nm.

|          | A(T)                                                     |
|----------|----------------------------------------------------------|
| UMC 90nm | $1.77 \times 10^{-5}T^2 - 6.75 \times 10^{-3}T + 1.67$   |
| UMC 65nm | $3.02 \times 10^{-6} T^2 - 4.79 \times 10^{-3} T + 1.93$ |
| PTM 65nm | $4.83 \times 10^{-5} T^2 - 1.63 \times 10^{-2} T + 2.30$ |
| PTM 45nm | $7.32 \times 10^{-5} T^2 - 2.25 \times 10^{-2} T + 2.93$ |
| PTM 32nm | $5.99 \times 10^{-5} T^2 - 1.81 \times 10^{-2} T + 2.30$ |

Table 3.1 Function A(T) for strong-inversion



Figure 3.3 1/g in UMC 65-nm technology (strong-inversion)

## 3.3.2 Moderate-Inversion (Near-Threshold) Region

In moderate-inversion region, MOSFET operates with negligible carrier velocity saturation. Substitute  $I_D$  (3.10) into (3.16)

$$1/g = \frac{(W/L)Cox\mu_{eff}(1/\eta)(V_{DD} - V_T)^2}{kV_{DD}C_{int}}$$
  
= const2 \cdot \frac{\mu\_{eff}(V\_{DD} - V\_{T0} + aT)^2}{V\_{DD}} (3.19)

where *const2* represents all const coefficients.  $V_T$  is function of *T*. Unified 1/g is curve fitted by

$$1/g_{\mu} = B(T)V_{DD}^{2} + C(T)V_{DD} + D(T)$$
(3.20)

 $g_u$  stands for unified logical effort; B(T), C(T), and D(T) are two-degree polynomials of *T*. By measuring logical effort with various  $V_{DD}$  and *T*, B(T), C(T), and D(T) are solved, listed in Table 3.2. In this region, *g* is set to be 1 at  $V_{DD} = 0.5$  V, T = 25 °C and the  $V_{DD}$  range is from about 0.33V to 0.5V. The position of divide point between moderate- and weak-inversions depends on which CMOS technology used. Figure 3.4 is unified and simulated 1/g with various  $V_{DD}$  and *T*. The average of absolute modeling errors are 1.52%, 2.57%, 1.20%, 1.44%, 5.04% in UMC 90-, 65-nm and PTM 65-, 45-, 32-nm.

|           | B(T)                          | C(T)                          | D(T)                                         |
|-----------|-------------------------------|-------------------------------|----------------------------------------------|
|           | $4.76 \times 10^{-4} T^2 -$   | $-3.94 \times 10^{-4} T^2 +$  | $7.39 \times 10^{-5} T^2 -$                  |
| UMC 90nm  | $9.20 \times 10^{-2}T + 84.7$ | $6.91 \times 10^{-2}T - 2.35$ | $1.11 \times 10^{-2}T + 6.87 \times 10^{-2}$ |
| UMC 65nm  | $-2.05 \times 10^{-4} T^2 -$  | $6.54 \times 10^{-5}T^2 +$    | $3.21 \times 10^{-6} T^2 -$                  |
| UNIC 65mm | $4.81 \times 10^{-2}T + 15.9$ | $5.87 \times 10^{-2}T - 8.75$ | $1.22 \times 10^{-2}T + 1.30$                |
| PTM 65nm  | $5.09 \times 10^{-4} T^2 -$   | $-3.36 \times 10^{-4} T^2 +$  | $5.49 \times 10^{-5} T^2 -$                  |
|           | $1.96 \times 10^{-1}T + 26.0$ | $1.29 \times 10^{-1}T - 15.5$ | $2.10 \times 10^{-2}T + 2.39$                |
| PTM 45nm  | $1.16 \times 10^{-3} T^2 -$   | $-8.37 \times 10^{-4} T^2 +$  | $1.51 \times 10^{-4} T^2 -$                  |
|           | $3.20 \times 10^{-1}T + 36.0$ | $2.27 \times 10^{-1}T - 23.7$ | $4.01 \times 10^{-2}T + 4.00$                |
| PTM 32nm  | $1.25 \times 10^{-3} T^2 -$   | $-8.93 \times 10^{-4} T^2 +$  | $1.59 \times 10^{-4} T^2 -$                  |
|           | $3.75 \times 10^{-1}T + 42.8$ | $2.70 \times 10^{-1}T - 29.3$ | $4.87 \times 10^{-2}T + 5.11$                |

Table 3.2 Functions B(T), C(T) and D(T) for moderate-inversion



Figure 3.4 1/g in UMC 65-nm technology (moderate-inversion)

### 3.3.3 Weak-Inversion (Sub-Threshold) Region

In weak-inversion region, MOSFET operates in sub-threshold mode. Substitute  $I_{D} (3.7) \text{ into } (3.16)$   $1/g = \frac{(W/L)\mu_{0}Cox \frac{\eta}{\beta^{2}} \exp\left[(\beta/\eta)(V_{DD} - V_{T} - \eta/\beta)\right]}{kV_{DD}C_{in}}$   $= const3 \cdot \mu_{0} \frac{\exp\left[(\beta/\eta)(V_{DD} - V_{T0} + aT - \eta/\beta)\right]}{V_{DD}}$ (3.21)

where *const3* represents all constant coefficients,  $\beta$  and  $V_T$  are functions of T. Unified 1/g is curve fitted by

$$1/g_{u} = E(T)\exp\left\{F(T)[V_{DD} - V_{T0}]\right\}$$
(3.22)

E(T) and F(T) are four-degree and two-degree polynomials of T respectively. By measuring 1/g with various T and  $V_{DD}$ , E(T) and F(T) can be calculated, listed in Table 3.3. In this region, g is set to be 1 at T = 25 °C, and  $V_{DD} =$  about 0.33V depending on which CMOS technology used. Figure 3.5 is unified and simulated 1/gwith various  $V_{DD}$  and T. The average of absolute modeling error are 6.01%, 8.40%,

3.03%, 2.97%, 5.14% in UMC 90-, 65-nm and PTM 65-, 45-, 32-nm. Table 3.4 lists the average of absolute modeling errors in all regions. Figure 3.6 summarizes the unified logical effort models.

|            | E(T)                                                                             | F(T)                          |
|------------|----------------------------------------------------------------------------------|-------------------------------|
| UMC 90nm   | $1.16 \times 10^{-09} T^4 - 2.35 \times 10^{-7} T^3 + 5.64 \times 10^{-6} T^2 +$ | $2.36 \times 10^{-4} T^2 -$   |
|            | $6.35 \times 10^{-3}T + 0.467$                                                   | $1.02 \times 10^{-1}T + 21.8$ |
| UMC 65nm   | $6.88 \times 10^{-10} T^4 - 2.37 \times 10^{-7} T^3 + 2.86 \times 10^{-5} T^2 +$ | $2.90 \times 10^{-4} T^2 -$   |
| UNIC 05mm  | $1.20 \times 10^{-2}T + 0.855$                                                   | $1.06 \times 10^{-1}T + 21.1$ |
| PTM 65nm   | $7.51 \times 10^{-10} T^4 - 1.46 \times 10^{-7} T^3 - 1.06 \times 10^{-6} T^2 +$ | $2.11 \times 10^{-4} T^2 -$   |
|            | $1.20 \times 10^{-3}T + 1.020$                                                   | $9.13 \times 10^{-2}T + 22.2$ |
| PTM 45nm   | $6.47 \times 10^{-10} T^4 - 1.44 \times 10^{-7} T^3 + 3.09 \times 10^{-6} T^2 +$ | $2.08 \times 10^{-4} T^2 -$   |
| PTM 45IIII | $1.15 \times 10^{-3}T + 0.989$                                                   | $9.39 \times 10^{-2}T + 22.0$ |
|            | $3.29 \times 10^{-10} T^4 - 1.17 \times 10^{-7} T^3 + 1.08 \times 10^{-5} T^2 +$ | $1.80 \times 10^{-4} T^2 -$   |
| PTM 32nm   | $7.29 \times 10^{-4} T + 0.959$                                                  | $8.95 \times 10^{-2}T + 21.2$ |

Table 3.3 Functions E(T) and F(T) for weak-inversion



Figure 3.5 1/g in UMC 65-nm technology (weak-inversion)

| Average        | Strong-   | Moderate- | Weak-     |
|----------------|-----------|-----------|-----------|
| Absolute Error | inversion | inversion | inversion |
| UMC 90nm       | 3.89%     | 1.52%     | 6.01%     |
| UMC 65nm       | 3.05%     | 2.57%     | 8.40%     |
| PTM 65nm       | 4.12%     | 1.20%     | 3.03%     |
| PTM 45nm       | 8.01%     | 1.44%     | 2.97%     |
| PTM 32nm       | 6.55%     | 5.04%     | 5.14%     |

Table 3.4 Logic effort modeling error

Strong-inversion (Super-threshold):  $1/g_{u} = A(T) \frac{(V_{DD} - V_{T0} + aT)^{3/2}}{V_{DD}}$ Moderate-inversion (Near-threshold):  $1/g_{u} = B(T)V_{DD}^{2} + C(T)V_{DD} + D(T)$ Weak-inversion (Sub-threshold):

Figure 3.6 Unified logical effort models

#### **3.4 Experimental Result**

In this section, to test and verify the unified logical effort models, we use them to estimate some path delays. There are two test vehicles. Test vehicle I is some simple logic gates, and test vehicle II is an 8-to-256 decoder. The test vehicles are simulated in various thermal and voltage conditions, and real delays are measured. The estimations of delay are done through calculation based on delay equation of logical effort model

$$d = \tau \times \sum (f_i + p_i) = \tau \times \sum (g_i \times h_i + p_i)$$
(3.23)

where *d* is calculated delay. *h* is electrical effort, independent to environmental variations. *g* and *p* are logical effort and parasitic delay. The unified logical effort will be substituted for *g* here to include the effects of temperature and supply voltage. We measured the values of *p* in various environmental conditions beforehand, thereby using ideal values of *p* for equations (3.23) here.

In the test vehicles, the logical efforts of logic gates are calculated according classic rule. The logical efforts of INV, 2-input NAND and NOR, listed in Table 3.5,

can be derived from different Wp/Wn ratio in three distinct regions. In the next two sections we will show the comparisons of simulated and estimated delays.

|            | Strong-inversion  | Moderate-inversion | Weak-inversion   |
|------------|-------------------|--------------------|------------------|
| Wp/Wn      | 2.5               | 2.0                | 1.5              |
| g(INV)     | gu                | gu                 | gu               |
| g (2-NAND) | $g_u \times 9/7$  | $g_u \times 4/3$   | $g_u \times 7/5$ |
| g (2-NOR)  | $g_u \times 12/7$ | $g_u \times 5/3$   | $g_u \times 8/5$ |

Table 3.5 Ratios of logical effort for logic gates

## 3.4.1 Test Vehicle I

The test vehicle I is an INV-NAND-NOR-INV path with another INV as load, shown in Figure 3.7. All of these gates have the same driving ability as unit size inverter. They are simulated in UMC 90-nm CMOS technology. The delay comparisons of simulated and estimated delays are shown in Figure 3.8, Figure 3.9 and Figure 3.10. The results show that the average absolute errors are 12.6%, 7.96% and 16.8% in strong-, moderate- and weak-inversion regions respectively.



Figure 3.7 Test vehicle I for proposed logical effort models



Figure 3.8 Simulated and estimated delays for the circuit path of Figure 3.7 in UMC 90nm technology (strong-inversion)



Figure 3.9 Simulated and estimated delays for the circuit path of Figure 3.7 in UMC 90nm technology (moderate-inversion)



Figure 3.10 Simulated and estimated delays for the circuit path of Figure 3.7 in UMC 90nm technology (weak-inversion)

#### 3.4.2 Test Vehicle II

Test vehicle II is an 8-to-256 decoder which is used to control a register file. Figure 3.11 shows the 8-to-256 decoder along with a 32×256 register file. In the register file, there are 256 words and each word is 32 bits wide. Each bit presents a load of 3 unit-sized inverter, so there is a total of 3×32 unit capacitance for every output of decoder. Figure 3.12 shows the circuit diagram of 8-to-256 decoder. Every stage is set with stage effort 4 to achieve fast propagation of FO4 rule. Besides, the branch number is 128.

The 8-to-256 decoder is simulated in UMC 65nm CMOS technology. The path delays are estimated through logical effort model. The comparisons of simulated and estimated delays are shown in Figure 3.13, Figure 3.14 and Figure 3.15. The results show that the average absolute errors are 14.6%, 6.15% and 10.13% in strong-, moderate- and weak-inversion regions respectively.



Figure 3.11 8-to-256 decoder for a 32×256 register file



Figure 3.12 8-to-256 decoder



Figure 3.13 Simulated and estimated delays for Figure 3.12 in UMC 65nm technology



Figure 3.14 Simulated and estimated delays for Figure 3.12 in UMC 65nm technology (moderate-inversion)



Figure 3.15 Simulated and estimated delays for Figure 3.12 in UMC 65nm technology (weak-inversion)



## Chapter 4 A Thermally Robust Buffered Clock Tree Using Logical Effort Compensation

Temperature gradient has been a major design concern for integrated circuits recently. In this chapter, an intelligent solution for mitigating the temperature-induced clock skew by using logical effort compensation is proposed. Logical effort - an index of propagation delay, varying with thermal and supply voltage conditions, is controlled by a tunable-width buffer. As an effective way of mitigating the variable clock skew, this chapter presents an adaptive circuit technique that senses the temperature of different parts of the clock tree and adjusts the logical effort of the corresponding clock buffers dynamically to reduce the clock skew. In UMC-65nm technology, tunable-width buffers along with 7th-layer metal interconnect clock H-tree are constructed in post-layout simulation, which shows that the clock skew is reduced by up to 97.8%, and 72.2% in average. This leads to much improved clock synchronization and design performance.

Section 4.1 will give the introduction of clock tree with effect of temperature variation. In section 4.2, we create a constant gate delay against thermal variation by using a tunable-width inverter to control the logical effort. Section 4.3 shows the thermally robust buffered clock tree, in which the technique proposed in section 4.2 is adopted. Section 4.4 will give the simulation results of thermally robust buffered clock tree.

## **4.1 Introduction**

Temperature gradient has become a significant factor in designing a chip with the advancement of integrated circuit technology. It significantly affects the performance of a chip. Temperature gradient is getting more acute because of various activities in different parts of a chip. For instance, a processor chip contains operating part with higher activity and cache part with lower activity, causing temperature gradient. The temperature difference can be as high as 50 % [4.1], which affects the performance of the different functional parts and interconnection. In this chapter, we focus on the effect of temperature on the clock skew between special-close and function-related points of a clocking network. In the H-tree shown in Figure 4.1, we can see that, for a number of terminal locations, while physically close, the clocking signals reached through completely different paths from the source. As a result, temperature differences in the paths can lead to significant skews. As shown in Figure 4.2 for the H-tree mapped to the 45-nm technology node, the clock skew increases with increasing temperature difference between different parts of the chip [4.2]. Since the increase of clock skew has a big performance threat to integrated circuits, we need intelligent solutions to mitigate the effect of temperature-dependent clock skew.



Figure 4.1 Buffered Clock Tree



Figure 4.2 Temperature effect on edge skew between two buffers

The effect of temperature on the device performance is complicated because there are two mixed phenomena. First, carrier mobility is decreased while temperature increases. Second, threshold voltage is lowered while temperature increases. Depending on the operating point of the transistor, the drain saturation current may actually increase or decrease. Figure 4.3, which was simulated by T. Ragheb [4.2], shows the results of the drain saturation current of the nMOS and pMOS devices modeled using BSIM4 predictive 45-nm CMOS technology [4.3]. There is a zero-temperature-coefficient (*ZTC*) point where the current of transistors are invariant to temperature variation.



Figure 4.3 Inversion of the temperature dependence of drain saturation current for a PTM 45- nm (a) nMOS transistor and (b) pMOS transistor [4.2]

The ZTC point was well-known to designers for a long time. This is the basis of the method suggested by Shakeri and Meindl in [4.4] that uses a temperature-variable supply voltage of 1V (TVS) to guarantee near-constant delay across a temperature range. However, the ZTC point is also a function of the technology node. Because of different ZTC points between technologies, designer may need to redesign circuits using ZTC bias method when the circuits are ported from one technology to another.

Previous solutions uses fixed known temperature profiles [4.5]–[4.7]. The temperature profiles are built beforehand. However, it may be too optimistic especially for processors running different applications. Other techniques try to manage clock skew under thermal variations; nevertheless, they sacrifice performance to achieve immunity against variations [4.8]. Finally, dynamic adjustment techniques for microprocessor pipelines have been proposed, which incur significant overheads to enable timing violation detection and correction [4.9].

In this chapter, a thermally robust buffered clock tree is proposed. It uses tunable-width inverter as clock buffer to adjust the drive ability by means of logical effort compensation. Here we consider thermal conditions from -50°C to 125°C.

#### 4.2 Creating Constant Gate Delay against Thermal

#### Variation

In this section we will introduce the method of creating constant delay which is invariant to thermal conditions. In chapter 3 we have presented the unified logical effort models, the logical effort is a function of voltage and temperature. Here the voltage in set unchanged, thus the logical effort of a gate is only varied to temperature. By adjusting the logical effort of a gate as a constant value, constant gate delay can be created. To adjust the logical effort, a tunable-width inverter is adopted in which the width as well as logical effort can be tuned. Later we will show the relation between width and logical effort.

The constant gate delay is used for the buffers of clock tree. Constant delay means that the delays of buffers are invariant to temperature, thus the clock skew can be minimized.

#### 4.2.1 Effects of Dynamically Tuning MOSFET Width on

#### **Logical Effort**

From (3.19), logical effort g is inversely proportional to drain current  $I_D$ .

$$g \propto \frac{1}{I_D} \tag{4.1}$$

In the current equation, current is proportional to width-length ratio

$$I_D \propto \frac{W}{L} \tag{4.2}$$

So logical effort is inversely proportional to (*W/L*)

$$g \propto \frac{L}{W}$$
 (4.3)

*L* is fixed, so logical effort *g* is inversely proportional to width *W*. In chapter 2, we demonstrated that logical effort is affected by thermal and supply voltage conditions. The relation between two logical efforts with different widths  $W_1$  and  $W_2$  considering temperature and supply voltage:

$$\frac{g_{W2}(V,T)}{g_{W1}(V,T)} = \frac{W_1}{W_2}$$
(4.4)

where  $g_{WI}(V, T)$  is reference logical effort with width equal to  $W_1$ , and  $g_{W2}(V, T)$  is tuned logical effort with altered width equal to  $W_2$ . The logical effort can be altered by adoption of a tunable-width inverter shown in Figure 4.4. In this figure, control signals B0-B7 come from outside control blocks, determining total width of the tunable-width inverter. The widths of MOSFETs are binary weighted, 1X, 2X ... 128X unit size corresponding to control signals B0-B7, and the available tuning range of width is from 1X to 255X. By altering the width, we can tune the logical effort to a specific value.



Figure 4.4 Tunable-width inverter

#### 4.2.2 Creating Constant Gate Delay

In this section, we will demonstrate how to create constant gate delay by tuning logical effort to a fixed value. With utilization of constant delay buffers, temperature induced clock skew can be mitigated that will be described in section 4.3. From the delay equation of classic logical effort, gate delay  $d_{abs} = \tau(gh + p)$ . Usually, compared with parasitic effort *p*, stage effort f = gh is much more significant, so we can consider the effects of thermal and voltage only on logical effort *g* and neglect the effects on parasitic effort. Under various thermal and supply voltage conditions, we tune logical effort *g* to a fixed value for the purpose of creating constant delay.

Assume that supply voltage is a fixed value  $V_{SUPPLY}$ . The relation between two logical efforts  $g_{WI}(V_{SUPPLY}, T)$  and  $g_{W2}(V_{SUPPLY}, T)$  has been shown in equation (4.4). We define that  $g_{WI}(V_{SUPPLY}, 25^{\circ}C)$  is set to be 1 when supply voltage =  $V_{SUPPLY}$ , temperature = 25°C, width =  $W_1$ . Our goal is to keep the logical effort at a fixed value = 1 for various thermal conditions. So the width is altered to the target width  $W_2$ according to current temperature *T* so that  $g_{W2}(V_{SUPPLY}, T) = 1$ . Substitute  $g_{W2}(V_{SUPPLY}, T) = 1$  into equation (4.4):

$$W_2 = W_1 \times g_{W1} \left( V_{SUPPLY}, T \right) \tag{4.5}$$

 $W_2$  is calculated by multiplying the reference width  $W_1$  and  $g_{WI}(V_{SUPPLY}, T)$ together. For example, assume  $V_{SUPPLY} = 0.5V$  and  $W_1 = 128X$  unit size, if the temperature changes from 25 °C to -25 °C, logical effort will change from  $g_{WI}(0.5V,$ 25 °C) = 1 to  $g_{WI}(0.5V, -25 °C) = 1.34$ , corresponding to procedure I in Figure 4.5. To find target width  $W_2$  for that  $g_{W2}(V_{SUPPLY}, -25 °C) = 1$ , the  $W_2$  is equal to  $W_1 \times g_{WI}(0.5V,$  $-25 °C) = 128X \times 1.34 = 172X$  unit size. We tune the width to  $W_2$ , so the value of logical effort is then set back to 1 which corresponds to procedure II in Figure 4.5. Figure 4.6 shows  $W_2 - T$  curve according to equation (4.5) with  $V_{SUPPLY} = 0.5V$ ,  $W_1 = 128X$ .



Figure 4.5 Logical effort with two different widths



Figure 4.6 Tuned  $W_2$  according to various thermal conditions

To test the methodology of creating constant gate delay against thermal variation, we use a ring oscillator composed of 9-stage tunable-width inverters in UMC 65-nm technology to run simulation. In Figure 4.7, control signals B0-B7 determine the total widths of all tunable-width inverter. The width is tuned according to equation (4.5), it compensates for temperature induced delay variation. With compensation, the period of ring oscillator is almost unchanged. Figure 4.8 shows the 0.5V simulation results before and after compensation. The maximum normalized period is up to 1.77 with fixed width = 128X, however, it is lowered to 1.08 with logical effort tuned to 1.



Figure 4.7 A ring oscillator composed of 9-stage tunable-width inverters



Figure 4.8 Normalized period before and after compensation

#### 4.3 A Thermally Robust Buffered Clock Tree Using Logical

## **Effort Compensation**

To mitigate temperature induced clock skew in clock tree, we propose a thermally robust buffered clock tree. The generally used buffered H-tree is taken as clock distribution scheme, while beside each clock buffer there is a local temperature sensor. The logical effort of each clock buffer is tuned according to the digital codes of temperature. With logical effort tuned to 1 shown in section 4.2, the clock buffer appears to have a nearly constant delay with temperature variation.

A typical H-tree in a UMC 65-nm design is chosen in Figure 4.1, where the die length is 2 cm. The side length in first level H-tree is 10mm, and 5mm in next level. The H-tree interconnections use 7th-layer metal with width equal to 1um. For showing the effects of temperature difference on clock skew, the die is divided into two temperature areas, TL on the left half and TR on the right half. When the clock signal propagates through H-tree, it enters different temperature parts and brings various delay time, thereby producing clock skew. In this design, clock skew will be measured between points A and B in Figure 4.1.

Beside each clock buffer, there is a temperature sensor and a look-up table. Figure 4.9 shows the control blocks and tunable-width buffer. The thermal condition is sensed by temperature sensor which outputs temperature codes T[9:0]. Then, the total width of tunable-width buffer is adjusted according to the look-up table. B[7:0] is the width control code. Basically, the buffer's logical effort is tuned to a fixed value based on the equation (4.5). In section 4.2, we have presented that with logical effort tuned to fixed value, the tunable-width inverter appears to have nearly constant delay. Although the loading of clock buffer is long and wide metal, it still possesses that property. Moreover, this design is mainly aimed at ultra-low voltage region, so the buffer instead of metal resistance plays the dominant role on producing delay. Thus the tunable-width buffer in the clock tree can still have constant delay with logical effort compensation.



Figure 4.9 Tunable-width buffer with control blocks

Figure 4.10 shows fully on-chip temperature, process and voltage sensors proposed by Shi-Wen Chen [4.10]. P[3:0], V[4:0] and T[9:0] are output codes for process, voltage and temperature respectively. Temperature compensation block is fed with P[3:0] and V[4:0] to calibrate the temperature codes. Unlike conventional temperature sensor using voltage/current analog-to-digital converter (ADC) or bandgap reference, this one adopts frequency-to-digital scheme. The property of zero temperature coefficient (ZTC) bias point is used to remove temperature effect. It is designed in UMC 65nm bulk CMOS technology, capable of operating over a wide voltage range within 0.3V~1V. Thus it is suitable for ultra-low voltage thermally

robust buffered clock tree.

In addition, the power consumption is no more than  $3.7\mu$ W at 0.3V supply voltage. With low-power characteristic, it is suitable to distribute temperature sensors among the chips. The temperature error is merely -0.8~0.8°C, thus the thermally robust clock tree possesses a high precision on tuning logical effort.



Figure 4.10 Temperature Sensor Proposed by Shi-Wen Chen

#### **4.4 Simulation Results**

Hspice simulations with layout parameters extracted were performed to evaluate the performance of the proposed design. In the simulations, we used the UMC 65-nm technology including layout of tunable-width buffers and 7th-layer metal H-tree interconnection. Figure 4.11 shows the tunable-width inverter and 7th-layer metal with 1-um width. The clock skew is measured between points A and B in Figure 4.1, in various thermal conditions. Table 4.1 lists the improvements of clock skew after using logical effort compensation at 0.3V (sub-threshold region) and 0.5V 50 (near-threshold region). In Table 4.1,  $W_I$  is set to be 128X for 0.5V and 64X for 0.3V, considering logical effort tuning range. Before compensation, the buffer width is not changed in various thermal conditions, equal to  $W_I$ . With logical effort compensation, clock buffers create constant delay, mitigating temperature induced clock skew. The clock skew is reduced by up to 97.8%, and 71.19% in average.



|      |      | ~                | ~1            | 0.0.011 | ~ 1                     | <u></u> | 0.0.777 |
|------|------|------------------|---------------|---------|-------------------------|---------|---------|
|      |      | Clock Skew @0.3V |               |         | Clock Skew @0.5V        |         |         |
| TL   | TR   |                  |               |         | (near-threshold region) |         |         |
| (°C) | (°C) | Before           | After Improve |         | Before                  | After   | Improve |
|      |      | (ns)             | (ns)          | -ment   | (ns)                    | (ns)    | -ment   |
|      | 0    | 101.4            | 5.6           | 94.5%   | 1.055                   | 0.120   | 88.6%   |
|      | 25   | 153.8            | 7.6           | 95.1%   | 1.769                   | 0.327   | 81.5%   |
| 25   | 50   | 183.2            | 18.6          | 89.8%   | 2.259                   | 0.476   | 79.0%   |
| -25  | 75   | 200.9            | 19.0          | 90.5%   | 2.645                   | 0.641   | 75.8%   |
|      | 100  | 212.2            | 25.7          | 87.9%   | 2.891                   | 0.800   | 72.3%   |
|      | 125  | 219.7            | 27.4          | 87.5%   | 3.108                   | 0.873   | 71.9%   |
|      | 25   | 52.4             | 2.0           | 96.1%   | 0.714                   | 0.207   | 71.0%   |
|      | 50   | 81.8             | 13.1          | 84.0%   | 1.205                   | 0.356   | 70.5%   |
| 0    | 75   | 99.5             | 13.4          | 86.5%   | 1.590                   | 0.521   | 67.3%   |
|      | 100  | 110.8            | 20.1          | 81.9%   | 1.836                   | 0.680   | 63.0%   |
|      | 125  | 118.3            | 21.9          | 81.5%   | 2.054                   | 0.753   | 63.3%   |
|      | 50   | 29.4             | 11.0          | 62.5%   | 0.490                   | 0.149   | 69.7%   |
| 25   | 75   | 47.1             | 11.4          | 75.8%   | 0.876                   | 0.313   | 64.2%   |
| 25   | 100  | 58.4             | 18.0          | 69.1%   | 1.122                   | 0.473   | 57.9%   |
|      | 125  | 65.9             | 19.8          | 69.9%   | 1.340                   | 0.546   | 59.2%   |
|      | 75   | 17.7             | 0.4           | 97.8%   | 0.386                   | 0.165   | 57.2%   |
| 50   | 100  | 29.0             | 7.0           | 75.7%   | 0.632                   | 0.324   | 48.7%   |
|      | 125  | 36.5             | 8.8           | 75.8%   | 0.849                   | 0.398   | 53.2%   |
| 75   | 100  | 11.3             | 6.6           | 41.2%   | 0.246                   | 0.159   | 35.4%   |
| 75   | 125  | 18.8             | 8.4           | 55.1%   | 0.464                   | 0.233   | 49.8%   |
| 100  | 125  | 7.5              | 1.8           | 76.2%   | 0.217                   | 0.073   | 66.2%   |

Table 4.1 Compensation improvement of clock skew in sub/near-threshold region

## Chapter 5 A Programmable Clock Generator for Suband Near-Threshold DVFS System

In this chapter, a sub/near-threshold programmable clock generator will be presented. It has the ability creating output clock with frequency 1/8~4 times of the reference clock. The variation-aware logic design is performed in the clock generator, which improves the reliability on process variation. The adoption of pulse-circulating scheme reduces process induced output clock jitter. In addition, we realize a PVT compensation unit for adjusting the locking range of clock generator. The clock generator has been designed in UMC 65nm CMOS technology. The frequencies of reference clock are 625 KHz at 0.2V and 5MHz at 0.5V.

Section 5.1 gives the introduction. Section 5.2 shows the system architecture of proposed programmable clock generator for sub/near-threshold DVFS system. Section 5.3 introduces the variation-aware logic design for sub-threshold operation. Section 5.4 demonstrates the proposed PVT compensation technique mainly for adjusting clock generator's locking range. And section 5.5 shows the circuit description of clock generator. In section 5.6 the clock tree proposed in chapter 4 and the programmable clock generator will be combined. Section 5.7 shows the design implementation in UMC 65-nm CMOS technology. Finally, the post-layout design is simulated and results will be demonstrated in section 5.8.

## **5.1 Introduction**

The dynamic-voltage-and-frequency-scaling (DVFS) technique has been adopted in many low-power devices such as wireless body area network (WBAN) communication system. The WBAN system provides body signal collecting and reliable physical monitoring, which has many wireless sensor nodes (WSNs) attached on or implanted inside human body [5.1][5.2]. To achieve low-power requirement, near/sub-threshold regime has been introduced to WBAN system.

Many clock multiplication schemes have been proposed for DVFS systems in super-threshold region. Phase-locked loops (PLLs) are usually used as clock generator, but its locking period takes hundred of reference clock cycles. To enhance the flexibility of clock generator for DVFS system, an all-digital clock generator is presented [5.3] which generates output clock by delaying the reference clock dynamically according to the frequency control code. However, the output frequency can only be fraction of reference clock. Delay-locked loop (DLL) [5.4] was presented for DVFS system, but it couldn't generate fractional clock. Cyclic clock multiplier (CCM) has been presented for DVFS applications [5.5], and it has the advantage of creating fractional or multiplied clock. However, the cyclic clock multiplier uses TDC for phase error detection which will consume much area and power.

In this chapter, a programmable clock generator is proposed which is aimed at sub- and near-threshold region. It adopts the pulse-circulating scheme in [5.5] and includes some advantages. First, the pulse always circulates through the same delay line; thus compared to DLL based clock multiplier [5.6][5.7], the process-induced phase error will be reduced. Second, the proposed clock generator has the ability of 54

PVT compensation for locking range and takes only one reference clock cycle. Finally, variation-aware logic design is performed for sub-threshold and near operation.

#### 5.2 System Architecture

The architecture of the proposed clock generator is shown in Figure 5.1. The clock generator consists of main blocks as following: pulse generators (PG), phase detector, counter, lock-in delay line, PVT-Comp. (PVT Compensation) delay line, PVT-Comp., control and frequency divider.



Figure 5.1 Proposed clock generator for sub- and near-threshold DVFS system

In the proposed clock generator, the CLK<sub>REF</sub> signal enters a PG which produces pulses ( $P_{REF}$ ) with frequency equal to CLK<sub>REF</sub>. Pulse multiplier generates pulses ( $P_{OUT}$ ) with 8-time frequency of the reference pulses ( $P_{REF}$ ). In addition, the divider can divide the input frequency by 2, 4, 6 or 8. Therefore, the proposed clock generator is able to output clock with frequency M/N times of the reference clock, M = (1, 8) and N = (2, 4, 6, 8) which are controlled by input frequency selecting signal FS[2:0]. Table 5.1 shows

the frequency selection range.

| FS[2:0] | Μ | N | $f_{out}$ / $f_{ref}$ |
|---------|---|---|-----------------------|
| 000     | 1 | 8 | 0.125                 |
| 001     | 1 | 6 | 0.167                 |
| 010     | 1 | 4 | 0.250                 |
| 011     | 1 | 2 | 0.5                   |
| 100     | 8 | 8 | 1                     |
| 101     | 8 | 6 | 1.333                 |
| 110     | 8 | 4 | 2                     |
| 111     | 8 | 2 | 4                     |

Table 5.1 Frequency selection range,  $f_{\text{out}}$  and  $f_{\text{ref}}$  are the frequencies of output and reference clocks

In order to produce  $P_{OUT}$  with 8-time frequency of  $P_{REF}$ , we adopt a circulating scheme. Each pulse of  $P_{REF}$  will enter the circulating path and circulate for 8 times. The paths is determined by path selection signal SEL, when SEL = 1 the pulse from  $P_{REF}$  can enter the delay line; otherwise, the circulating path is built. The counter is used for counting the number of times that pulse flowing in the circulating path. The counter informs phase detector and control block whether the counting times is equal to 8 by the signal countE8. Phase detector compares the phases of  $P_{OUT}$  and  $P_{REF}$  only when the counting times is equal to 8. The control block will change the value of C[5:0] according the compared results, LEAD and LAG. Figure 5.2 demonstrates the procedure of system operation. After the system is reset, the state machine will pass through three steps: PVT compensation, SAR control (successive approximation register) and lock.



Figure 5.2 Finite state machine

In the first step, the system undertakes PVT compensation. In sub- and near-threshold regions, devices behaviors are affected more seriously by PVT variations than that in super-threshold region. The effects of PVT variations cause the lock-in delay line having extremely different delay. To compensate for delay variations, the clock generator uses PVT-Comp. technique to provide adequate delay for the lock-in delay line. Therefore, the period of reference clock is in the locking range of lock-in delay line.

After PVT compensation, the system enters the second step – SAR control which uses binary search algorithm. In this step, the control block changes the control codes C[5:0] according to the comparison result of phase detector, LEAD and LAG. By using SAR control, the lock-in delay line is tuned so that the clock generator can be locked to reference clock.

Finally, the state of clock generator becomes lock state, and the clock generator can output clock with multiplied or divided frequency. In this step the feedback loop – from output clock to phase detector, control block, then control code C[5:0] – is still 57

kept. The control block will continue tracking by means of counter control which adds or subtracts C[5:0] by 1 at a time. Keeping it in close loop guarantees that the clock generator will be locked to reference clock, even if there are voltage and temperature variations in run time. Figure 5.3 and Figure 5.4 show the schematic diagram of waveform. The state and control codes C[5:0] change every two clock cycles.



Figure 5.3 The schematic diagram of waveform from state Reset to SAR control



Figure 5.4 The schematic diagram of waveform from state SAR control to Lock

#### **5.3 Variation-Aware Logic Design**

### 5.3.1 Sub-Threshold Logic Design Challenge

Voltage scaling is an effective approach for improving the power efficiency. The power consumed by a digital circuit can be expressed as

$$P = C \times V_{DD}^{2} \times f \tag{5.1}$$

Where *P* is power consumption, *C* is total charged capacitance,  $V_{DD}$  is supply voltage and *f* is operation frequency. Since the power is a quadratic function of  $V_{DD}$ , we can efficiently lower power consumption by voltage scaling. However, when supply voltage is down to sub-threshold region, there are two critical factors that affect functionality [5.8]. First, ratios of on to off currents ( $I_{ON} / I_{OFF}$ ) are decreased in logic gates. Second, random-dopant-fluctuation is a dominant source of local variations in sub- $V_t$  [5.9]. These two factors result in not only reduced output swings in CMOS logic gates but also skewed voltage transfer curve (VTC). Figure 5.5 shows the VTC of an inverter operated at 300 mV [5.8] with global as well as local variations. In some cases, the logic levels of CMOS gates are severely degraded.



Figure 5.5 Effects of variations and reduced  $I_{ON}$  /  $I_{OFF}$  on sub-Vt inverter voltage transfer curve [5.8]

# **5.3.2 Mitigating Variation by Upsizing Transistors**

Upsizing transistor is one technique for mitigating local variation. Researches in [5.10] show that standard deviation of  $V_t$  varies inversely with the square root of the channel area. However, the upsized lengths and widths of transistors increase total capacitance as well as power consumption. The back-to-back configuration in is proposed [5.8] to deal with this trade-off. In this configuration, NAND and NOR gates are selected to check for the worst case. Because NAND gate has more leakage for output logic 1 and NOR gate has more leakage for output logic 0, the worst skewed VTC is obtained. For mitigating  $V_t$  variation, 3-input gates have much more area increase than 2-input gates at 0.2V. In addition, 3 stacked transistors have less current, thus lowering operation speed. In the proposed clock generator, we use only 2-input logic gates for sub-threshold operation.



Figure 5.6 Back-to-back configuration

In Figure 5.6, the initial output voltage of NAND and NOR gates are set to be 0V and 0.2V respectively. If the function error happens, the output voltages will change. Trend of failure rate is analyzed in [5.8], which shows that the failure rate is decreased exponentially as either  $V_{DD}$  or device width is increased. In UMC 65nm CMOS technology and at 0.2V, 10000 times Monte Carlo simulation demonstrates that the failure frequencies are 51 times with minimum width and length, and 0 time with 125% minimum width and 123% minimum length. The device-aware sizing is needed for sub-threshold design, thereby avoiding function errors.

## 5.4 PVT Compensation for Locking Range of Delay Line

In this section, we will demonstrate the proposed PVT compensation technique for the locking range of delay line. It has been shown that the device behavior is much more variable to PVT variations in sub-threshold region than that in super-threshold region. For the clock generator, the influenced devices make the lock-in delay line having different delay range. Therefore, the clock generator probably cannot be locked to reference clock. To solve this problem, the PVT compensation technique is proposed. Figure 5.7 shows the concept diagram of PVT compensation. In the typical condition, the reference clock is in the locking range of lock-in delay line. While there are PVT variations, the locking range is shifted, thus the clock generator cannot be locked to reference clock. After adding the PVT compensation, the locking range can be adjusted.



## 5.4.1 Delay Ratio of FO1-INV to FO2-NAND

In this subsection we demonstrate the delay ratio of inverter with fanout 1 (FO1-INV) to NAND gate with fanout 2 (FO2-NAND), and this characteristic will be used in PVT compensation for locking range of delay line. The reason of using the delay component FO1-INV is that FO1-INV is taken as the cell of PVT sensing circuits in PVT-comp. block. The reason of using delay component FO2-NAND is that FO2-NAND delay is unit delay step which can be tuned in the lock-in delay line, and the topology is shown in Figure 5.8.



Figure 5.8 Topology of delay line (lattice delay line [5.13]) used in the proposed clock generator

Figure 5.9 are two 11-stage ring oscillators composed of FO1-INV and FO2-NAND respectively. The  $\beta$  ratio of FO1-INV is 1, in which the sizes of NMOS and PMOS are the same. Figure 5.10 and Figure 5.11 are Monte Carlo simulation results of the oscillators at 0.2V (sub-threshold region) and 0.5V (near-threshold region). The simulations also include the temperature effects. We can see that at supply voltage 0.2V and 0.5V, the delay ratios of FO2-NAND to FO1-INV are both 2. This ratio is unchanged under various PVT conditions. In the next subsection, the property will be used for PVT compensation for the locking range of delay line.



Figure 5.9 Ring oscillator using (a) FO1-INV cell, (b) FO2-NAND cell



Figure 5.10 Periods of ring oscillators (composed of FO1-INV and FO2-NAND) at 0.2V



Figure 5.11 Periods of ring oscillators (composed of FO1-INV and FO2-NAND) at 0.5V

#### 5.4.2 Procedure of PVT Compensation for Locking Range of

#### **Delay Line**

PVT compensation is done in the beginning of state machine, which has been mentioned in section 5.2. This procedure can be completed in only one reference cycle. Figure 5.12 shows the PVT-comp. block and PVT-comp. delay line. The PVT-comp. delay line is controlled by D[5:0]. In the PVT compensation state, the PVT-comp. block first senses the environmental conditions, which will be recorded in a counted number *count*. Then *count* is decoded to control code D[5:0], the PVT-comp. delay line can provide adequate delay.



Figure 5.12 PVT-comp. block and PVT-comp. delay line

The PVT-comp. block is shown in Figure 5.13. It consists of a PVT sensing circuit, a counter and a decoder. The PVT sensing circuits uses a ring oscillator which can be switched on or off. When the clock generator is in PVT-comp. state, the switch signal is turned on for one reference clock cycle. Then counter records the number of oscillated cycles.



Figure 5.13 PVT-comp. block

The ring oscillator is composed of 62-stage FO1-INV and 1-stage NAND. The FO1-INV is the same as that in section 5.4.1. According the simulation result, the period of the ring oscillator's output is nearly equal to 128-stage FO1-INV delay,  $128 \times D_{INV}$ . So the counted number *count* is equal to:

$$count = \frac{T_D}{128 \times D_{INV}}$$
(5.2)

where  $T_D$  represents one reference clock cycle. The relation between delay of FO1-INV and delay of FO2-NAND has been introduced in subsection 5.4.1:

$$D_{NAND,FO2} = 2 \times D_{INV} \tag{5.3}$$

 $D_{NAND,FO2}$  represents delay of FO2-NAND. Thus:

$$count = \frac{T_D}{64 \times D_{NAND,FO2}}$$
(5.4)

Because the clock generator adopts circulating scheme with output pulses of 8 times frequency, the pulse signal will propagate through the delay line for 8 times. For locking to the reference clock, the target delay of entire delay line should be equal to  $1/8 T_D$ . So from (5.4):

The delay of entire delay line is divided into two parts: delays provided by PVT Comp. delay line and by lock-in delay line. The locking range of the lock-in delay line is from  $4 \times D_{NAND,FO2}$  to  $130 \times D_{NAND,FO2}$  which will be introduced in the next subsection. The  $64 \times D_{NAND,FO2}$  in equation (5.5) means that the target of the delay provided by lock-in delay line is set to about the middle point of locking range. The remaining delay will be compensated by PVT-comp. delay line.

Figure 5.14 shows the PVT-comp. delay line, it is similar to nested lattice delay line (NLDL) proposed by [5.13]. The unit delay step of PVT-comp. delay line is  $32 \times D_{NAND,FO2}$ . To calculate control codes D[5:0], we divide the delay provided by PVD-comp. delay line in (5.5) by  $32 \times D_{NAND,FO2}$ :

$$D[5:0] = \frac{1}{32 \times D_{NAND,FO2}} \left[ (count \times 8 - 64) \times D_{NAND,FO2} \right] = \frac{count}{4} - 2 \quad (5.6)$$

Because the delay provided by PVT-comp. delay line cannot be negative, the minimum value of D[5:0] is 0. When the calculated value of D[5:0] in (5.6) is negative, it is set to 0. In Figure 5.13, the decoder realizes the equation in (5.6). The divisor of *count* is 4 so division can be accomplished with shift bit 2, thereby saving circuit area.



#### **5.5 Circuit Description**

#### 5.5.1 Lock-In Delay Line (LIDL) Controller

The lock-in delay line controller combines two categories of locking strategy: SAR (Successive Approximation Register) controlled [5.11] and counter controlled [5.12]. The SAR controlled strategy adopts binary search algorithm, which achieves fast lock time and low hardware complexity. Nevertheless, its open-loop characteristic doesn't track the environmental variations such as temperature and voltage variations. For solving this problem, the counter controlled strategy is added, which is aimed at track the environmental variations for its close-loop characteristic. When the clock generator starts, it will use the SAR strategy first for fast locking; after the SAR algorithm finished, it changes to counter controlled strategy. Figure 5.15 represents this procedure. C[5:0] is LIDL control codes, and it is sent back to the combination logic blocks. The multiplexer choose which lock-in strategy to be used. When the clock generator is in lock state, it will choose the counter controlled locking strategy tracking to environmental variations.



Figure 5.15 Lock-In Delay Line (LIDL) Controller

#### 5.5.2 Lock-In Delay Line (LIDL)

The LIDL is modified from NLDL nested lattice delay line (NLDL) [5.13]. Compared with NLDL, the LIDL in Figure 5.16 saves some circuit area by using 14-stage FO2-NAND instead of lattice delay line (LDL) as block delay. It still keeps the advantages of NLDL. First, the LIDL has equal rising and falling times. Second, while the tuning range increases, the maximum operating frequency will be the same. Finally, the variation is only half compared to conventional configuration in Figure 69



Figure 5.17 Simulation result of delay variation [5.13]

#### **5.5.3 Pulse Generator**

To circulate a pulse in the LIDL, the pulse width and the duty cycle should be design properly to avoid the pulse disappearing [5.14]. Figure 5.18 shows the pulse generator which is composed of D flip-flop and delay line. The pulse is generated when there is a rising edge in  $V_{IN}$ .



#### 5.5.4 SEL Generator

In Figure 5.1, the SEL signal selects the path of pulse, from P<sub>OUT</sub> or P<sub>REF</sub>. If the pulse signal is from P<sub>REF</sub>, the circulating pulses are readjusted. If the pulse signal is from P<sub>OUT</sub>, the circulating path is built. Figure 5.19 shows SEL generator, it has two different modes at states SAR and Lock. When the state is SAR, SEL will be inversed every negative edge of  $P_{\text{REF}}$ , which waveform is in Figure 5.20. When the state is Lock, SEL is decided by P<sub>REF</sub> and CountE8, which waveform is in Figure 5.21. SEL will be high when  $P_{REF}$  is high or 8th pulse of  $P_{OUT}$  arrives, and the latter is designed to avoid 9th pulse propagating through the circulating path early.



Figure 5.21 SEL waveform while State = Lock

#### **5.5.5 Phase Detector**

In Figure 5.22, the phase detector compares the arrival time of  $P_{REF}$  and 8th  $P_{OUT}$ . Conventional phase detector uses only two D flip-flops, which is not suitable in pulse circulating scheme. Here we add another two flip-flops in front of them, thereby able to compare only  $P_{REF}$  and 8th pulse  $P_{OUT}$  without noised by other pulses of  $P_{OUT}$ . Figure 5.23 shows the RST<sub>PD</sub> generator; phase comparison is performed every two reference cycles.



The frequency divider shown in Figure 5.24 is able to divide the frequency of  $P_{DIV}$  by 2, 4, 6 or 8, according to frequency selection signal FS[2:0]. The number of division is decided according to how many flip-flops are in the loop. For example, when FS[1:0] is equal to 11, the clock loop will propagate through only one flip-flop, thus the output frequency is the division of  $P_{DIV}$  by 2. This frequency divider is capable of producing 50% duty cycle output clock. Table 5.2 lists the relation between FS[1:0] and frequency division ratio.



Figure 5.24 Frequency divider

| FS[1:0] | $f_{CLKOUT}/f_{PDIV}$ |
|---------|-----------------------|
| 11      | 1/2                   |
| 10      | 1/4                   |
| 01      | 1/6                   |
| 00      | 1/8                   |

Table 5.2 The relation between control signal and output frequency

#### **5.6 Combination of Clock Generator and Clock Tree**

In chapter 4, we proposed a thermally robust clock tree using logical effort compensation. In the buffered clock tree, the clock signal propagates through different paths, due to temperature gradient there is the problem of clock skew. Proposed thermally robust clock tree mitigates the temperature induced clock skew. In this chapter, a programmable clock generator for sub/near-threshold DVFS system is proposed. It is capable of compensation for locking range of clock generator. Moreover, the pulse-circulating scheme reduces process variation induced clock jitter which is even more serious in DLL-based multiphase clock generator. Figure 5.25 shows the combination of proposed programmable clock generator and thermally robust clock tree. This clock topology is aimed at sub/near-threshold DVFS system, solving the problems which are brought by environmental variations.



Figure 5.25 Combination of proposed thermally robust clock tree and programmable clock generator

896

#### 5.7 Design Implementation

The proposed clock generator is implemented in UMC 65nm standard CMOS technology. Its major characteristics are operation in sub- and near-threshold regions and PVT compensation for locking range of clock generator. The clock generator can work correctly within +/- 10% voltage variation, -25°C to 125°C, and all process corners. The layout view of proposed clock generator is shown in Figure 5.26.



Figure 5.26 Layout view of proposed clock generator

#### **5.8 Simulation Results**

The proposed programmable clock generator for sub/near-threshold DVFS system is implemented in UMC 65nm CMOS technology. It can operate in the voltage range from 0.2V to 0.5V. At 0.2V, the frequency of reference clock is 156 KHz; it consumes 0.18 uW with maximum output frequency 625 KHz. At 0.5V, the frequency of reference clock is 5 MHz; it consumes 5.17 uW with maximum output frequency 20 MHz. Figure 5.27 and Figure 5.28 show the operation waveforms of the proposed clock generator at 0.2V and 0.5V. Table 5.3 gives the summary.



Figure 5.27 The operation waveform at 0.2V with 4X output clock



Figure 5.28 The operation waveform at 0.5V with 4X output clock

| Sub/Near-threshold Programmable Clock Generator |                                           |
|-------------------------------------------------|-------------------------------------------|
| Supply Voltage                                  | 0.2~0.5V                                  |
| Process                                         | UMC 65nm CMOS                             |
| Area                                            | 0.077×0.125mm <sup>2</sup>                |
| Reference Clock                                 | 156KHz @ 0.2V                             |
|                                                 | 5MHz @ 0.5V                               |
| Max. Output Frequency                           | 625KHz @ 0.2V                             |
|                                                 | 20MHz @ 0.5V                              |
| Min. Output Frequency                           | 19.5KHz @ 0.2V                            |
|                                                 | 625KHz @ 0.5V                             |
| Output Jitter                                   | 60ns @ 625KHz CLK <sub>OUT</sub> , 0.2V   |
|                                                 | 4ns @ 20MHz CLK <sub>OUT</sub> , 0.5V     |
| Power Consumption                               | 0.18uW @ 625KHz CLK <sub>OUT</sub> , 0.2V |
|                                                 | 5.17uW @ 20MHz CLK <sub>OUT</sub> , 0.5V  |

Table 5.3 Summary of proposed clock generator

Figure 5.29 ~ Figure 5.32 show the PTV compensation for the locking range of clock generator. Before compensation, due to the effects of environmental variations,

the reference clock is not in the locking range. Thus the clock generator is not able to output multiplied clock. After PVT compensation, the reference clock is in the locking range for various environmental conditions.



Figure 5.29 PVT compensation for locking range of clock generator at 0.2V TT corner (a) before compensation (b) after compensation



Figure 5.30 PVT compensation for locking range of clock generator at 0.2V FF corner (a) before compensation (b) after compensation



Figure 5.31 PVT compensation for locking range of clock generator at 0.5V TT corner (a) before compensation (b) after compensation



Figure 5.32 PVT compensation for locking range of clock generator at 0.5V FF corner (a) before compensation (b) after compensation

# Chapter 6 Conclusions and Future Work

#### **6.1 Conclusions**

The unified logical effort models considering voltage and temperature variations are proposed, which cover all MOS operation ranges including strong-, moderate- and strong-inversion range. These models have been established in UMC 90-, 65-nm, PTM 65-, 45- and 32-nm CMOS technologies. They are used on two test vehicles to estimate the path delay. Simulation results show that the logical effort average modeling error is no more than 8.40%. A thermally robust buffered clock tree using logical effort compensation is proposed. By tuning the logical effort of clock buffer to a fixed value, the buffer delay keeps the same, thereby reducing temperature-induced clock skew. With the adoption of tunable-width inverter, the logical effort can be controlled. The proposed thermally robust clock tree has been built in UMC 65-nm CMOS technology, which shows that the clock skew is reduced by up to 97.8%, and 72.2% in average. A programmable clock generator for sub- and near-threshold DVFS system is proposed. The clock generator's output frequency is from 1/8X to 4X times of reference clock. It is able to adjust fluctuated locking range of the clock generator in various PVT conditions. At 0.2V, it consumes only 0.18uW with a 156KHz reference clock. The proposed clock generator has been implemented in UMC 65-nm CMOS technology. It is suitable for use of sub- and near-threshold DVFS system.

#### **6.2 Future Work**

Wireless medical microsensors are usually with two different operating modes: *Low-Power Mode* and *Performance Mode* because the well-known signals of the main characteristics of cardiac activity, e.g. heart rate and ECG, are at a very low rate. More than 99% operating time of sensor nodes are operating in *low-power mode* to record various physiological signals throughout its life time while only less than 1% operating time in *performance mode* to process and transmit real-time informative cardiovascular parameters to a host. This *low-power-mode*-dominated scenario is capable of further reducing total energy consumption if dynamic voltage frequency scaling (DVFS) technique is applied. The benefit of DVFS technique is attributed to the quadratic savings in active  $CV_{DD}^2 f$  power.

The proposed programmable clock generator is used for dynamic voltage frequency scaling system which is operated in sub/near-threshold region. Figure 6.1 shows the sub/near-threshold DVFS system, it is composed of two switched-capacitor (SC) DC-DC converters, decoupling capacitors (DeCaps), the proposed clock generator, level shifters (LS), DVFS controller, PVT sensors, supply switch, and near/sub-threshold 8T SRAM-based FIFO. The clock generator is equipped with two frequency dividers, thereby able to output two multiplied clock. CLK\_r is for the read clock, and CLC\_w is for write clock in FIFO.

In *Performance Mode*, heart rate information is transmitted to a host. The supply voltage of FIFO is switched to VddH for high performance operation, and CLK\_w and CLK\_r are scaled up to a frequency that the function error will not happen according to PVT conditions sensed by PVT sensors. The PVT sensors were 81

completed by S.-W. Chen [4.10]. While in *Low-Power Mode*, only the operation of recording physiological signals is performed, so the CLK\_r is switched off and CLK\_w is scaled down to save dynamic power. The supply voltage of FIFO is switched to a lower grade VddL which is outputted from DC-DC converter, thus dynamic power can be reduced. By setting 99% Low-Power Mode and 1% Performance Mode, the sub/near-threshold dynamic frequency voltage scaling system can save large power.



Figure 6.1 Sub/near-threshold DVFS system

## Bibliography

- [1.1] S. K. Gupta, A. Raychowdhury and K. Roy, "Digital computation in subthreshold region for ultralow-power operation: a device-circuit-architecture codesign perspective," *Proceeding of the IEEE*, Feb. 2010, pp. 160-190.
- [1.2] L. Chang, D. J. Frank, R. K. Montoye, S. J. Koester, B. L. Ji, P. W. Coteus, R. H. Dennard and W. Haensch, "Practical strategies for power-efficient computing technologies," *Proceeding of the IEEE*, Feb. 2010, pp. 215-236.
- [1.3] B. H. Calhoun, J. F. Ryan, S. khanna, M. Putic, J. Lach, "Flexible circuits and architectures for ultralow power," *Proceeding of the IEEE*, Feb. 2010, pp. 267-282.
- [1.4] A. P. Chandrakasa, D. C. Daly, D. F. Finchelstein, J. Kwong, Y. K. Ramadass,
   M. E. Sinangil, V. Sze and N. Verma, "Technologies for Ultradynamic voltage scaling," *Proceeding of the IEEE*, Feb. 2010, pp. 191-214.
- [1.5] D. Markovic, C. C. Wang, L. P. Alarcon, T.-T. Liu, J. M. Rabaey, "Ultralow-power design in near-threshold region," *Proceeding of the IEEE*, Feb. 2010, pp. 237-252.

- [2.1] E.G. Friedman, "Clock distribution networks in synchronous digital integrated circuits," *Proceedings of the IEEE*, vol. 89, issue 5, May 2001, pp. 665-692.
- [2.2] D. Wann and M. Franklin, "Asynchronous and clocked control structures for VLSI based interconnection networks," *IEEE Trans. Comput.*, vol. C-32, Mar. 1983, pp. 778-783.
- [2.3] E. G. Friedman and S. Powell, "Design and analysis of a hierarchical clock distribution system for synchronous standard cell/macrocell VLSI," *IEEE J. Solid-State Circuits*, vol. SC-21, Apr. 1986, pp. 240-246.
- [2.4] D. Mijuskovic, "Clock distribution in application specific integrated circuits," *Microelectron. J.*, vol. 8, July/Aug. 1987, pp. 15-27.
- [2.5] H. B. Bakoglu, J. T. Walker, and J. D. Meindl, "A symmetric clock-distribution tree and optimized high-speed interconnections for reduced clock skew in ULSI and WSI circuits," *Proc. IEEE Int. Conf. Computer Design*, Oct. 1986, pp. 118-122.
- [2.6] M. Nekili, Y. Savaria, G. Bois, and M. Bennani, "Logic-based H-trees for large VLSI processor arrays: A novel skew modeling and high-speed clocking method," in *Proc. 5th Int. Conf. Microelectronics*, Dec. 1993, pp.1-4.
- [2.7] H. B. Bakoglu, Circuits, Interconnections, and Packaging for VLSI. Reading, MA: Addison Wesley, 1990.

- [2.8] A. Chakraborty, K. Duraisami, A. Sathanur, P. Sithambaram, L. benini, A. Macii, E. Macii and M. Poncino, "Dynamic Thermal Clock Skew Compensation Using Tunable Delay Buffers," *IEEE Trans. on VLSI Systems*, vol. 16, no. 6, June 2008, pp. 639-649.
- [2.9] T. Ragheb, A. Ricketts, M. Mondal, S. Kirolos, G. M. Links, V. Narayanan, and Y. Massoud, "Design of Thermally Robust Clock Trees Using Dynamically Adaptive Clock Buffers," *IEEE Transactions on Circuits and System I*, vol. 56, Feb. 2009, pp. 374–383.
- [2.10] J. Koo, S. Ok, and C. Kim, "A low-power programmable DLL-based clock generator with wide-range antiharmonic lock," *IEEE Trans. on Circuits and Systems II*, vol. 56, no. 1, Jan. 2009, pp. 21-25.
- [2.11] C.-Y. Yang, C.-H. Chang and W.-G. Wong, "A  $\triangle -\Sigma$  PLL-based spread-spectrum clock generator with a ditherless fractional topology," *IEEE Trans. on Circuits and Systems I*, vol. 56, no. 1, Jan. 2009, pp. 51-59.
- [2.12] D. Shin, J. Koo, W.-J. Yun, Y. J. Choi and C. Kim, "A fast-lock synchronous multi-phase clock generator based on a time-to-digital converter," *IEEE International Symposium on Circuits and Systems*, May 2009, pp 1-4.
- [2.13] W.-M. Lin, C.-C. Chen and S.-I. Liu, "An All-Digital Clock Generator for Dynamic Frequency Scaling," in *Int. Symp. VLSI Design, Automation and Test*, July 2009, pp. 251-254.

- [3.1] B. H. Calhoun, S. Khanna, R. Mann, and J. Wang, "Sub-threshold circuit design with shrinking CMOS devices," *IEEE Int'l Symp. Circuits and Systems*, May 2009, pp. 2541-2544.
- [3.2] B. H. Calhoun, A. Wang, and A. Chandrakasan, "Modeling and sizing for minimum energy operation in subthreshold circuits," *IEEE J. of Solid-State circuits*, vol. 40, Sep. 2005, pp. 1778-1786.
- [3.3] Sutherland, B. Sproull, and D. Harris, Logical Effort: Designing Fast CMOS Circuits. San Francisco, CA: Morgan Kaufmann, 1999.
- [3.4] X. Y. Yu, V. G. Oklobdzija, and W. W. Walker, "Application of logical effort on design of arithmetic blocks," *Conference Record of the Thirty-Fifth Asilomar Conference on Signals, Systems and Computers*, vol.1, Nov. 2001, pp. 872–874.
- [3.5] A. Kabbani, D. Al-Khalili, and A.J. Al-Khalili, "Delay macro modeling of CMOS gates using modified logical effort technique,"*IEEE International Conference on Semiconductor Electronics*, Dec. 2004, pp. 56-60.
- [3.6] B. Lasbouygues, S. Engels, R. Wilson, P. Maurine, N. Azemard, and D. Auvergne, "Logical effort model extension to propagation delay representation," *IEEE Trans. on Computer-Aided Design of Integrated Circuits and Systems*, vol. 25, no. 9, Sep. 2006, pp. 1677-1684.
- [3.7] C.-H. Wu, S.-H. Lin, H. Chiueh, "Logical Effort Model Extension with Temperature and Voltage Variations," 14th *Int'l Workshop on THERMINIC*, Sep. 2008, pp. 85-88.

[3.8] K. A. Bowman, B. L. Austin, J. C. Eble, X. Tang, and J. D. Meindl, "A Physical Alpha-Power Law MOSFET Model," *IEEE J. Solid-State Circuits*, vol. 34, no.10, Oct. 1999, pp. 1410-1414.



- [4.1] S. Borkar, T. Karnik, S. Narendra, J. Tschanz, A. Keshavarzi, and V. De, "Parameter variation and impact on circuits and microarchitecture," in *Proc. Design Autom. Conf.*, 2003, pp. 338–342.
- [4.2] T. Ragheb, A. Ricketts, M. Mondal, S. Kirolos, G. M. Links, V. Narayanan, and Y. Massoud, "Design of Thermally Robust Clock Trees Using Dynamically Adaptive Clock Buffers," *IEEE Transactions on Circuits and System I*, vol. 56, Feb. 2009, pp. 374–383.
- [4.3] W. Zhao and Y. Cao, "New generation of predictive technology model for sub-45 nm design exploration," in *Proc. Int. Symp. Qual. Electron.Des.*, 2006, pp. 585–590. [Online]. Available: <u>http://www.eas.asu.edu/ptm</u>
- [4.4] K. Shakeri and J. Meindl, "Temperature variable supply voltage for power reduction," in *Proc. ISVLSI*, 2002, pp. 71–74.
- [4.5] H. Ajami, K. Banerjee, and M. Pedram, "Modeling and analysis of nonuniform substrate temperature effects on global ULSI interconnects," *IEEE Trans. Comput.-Aided Design Integr. Circuits Syst.*, vol. 24, no. 6, Jun. 2001, pp. 849–861.
- [4.6] M. Cho, S. Ahmed, and D. Z. Pan, "TACO: Temperature aware clocktree optimization," in *Proc. ICCAD*, 2005, pp. 582–587.
- [4.7] Macii, Thermal-Aware Clock Tree Design 2005.

- [4.8] V. Nawale and T. W. Chen, "Optimal useful clock skew scheduling in the presence of variations using robust ILP formulations," presented at the IEEE/ACM Int. Conf. Computer-Aided Design, San Jose, CA, 2006.
- [4.9] S. Lee, S. Das, T. Pham, T. Austin, D. Blaauw and T. Mudge, "Reducing pipeline energy demands with local DVS and dynamic retiming," presented at the Int. Symp. Low Power Electronics and Design, 2004.
- [4.10] Shi-Wen Chen, Ming-Hung Chang, Wei-Chih Hsieh, and Wei Hwang, "Fully on-chip temperature, process, and voltage sensors," *IEEE International Symposium on Circuits and Systems*, May 2010.



- [5.1] J Y. Yu, C. C. Chung, W. C. Liao, and C. Y. Lee, "A sub-mW ulti-Tone CDMA Baseband Transceiver Chipset for Wireless Body Area Network Applications," *ISSCC Dig. Tech. papers*, Feb. 2007, pp. 364-365.
- [5.2] A. C. W. Wong, D. M. Donagh, G. Kathiresan, O. C. Omeni, O. El-Jamaly, T. C-K. Chan, P. Paddan, and A. J. Burdett, "A 1V, Micropower System-on-Chip for Vital-Sign Monitoring in Wireless Body Sensor Network," *ISSCC Dig. Tech. Papers*, Feb. 2008, pp. 138-139.
- [5.3] A. Shibayama, K. Nose, Sunao Torii, M. mizuno, and M. Edahiro, "Skew-Tolerant global synchronization based on periodically al-in-phase clocking for Multi-Core SOC platforms," *Symp. VLSI Circuits Digest of Technique Papers*, June 2007, pp. 158-159.
- [5.4] J. H. Kim, Y. H. Kwak, M. Y. Kim, S. W. Kim and C. Kim, "A 120MHz-1.8GHz CMOS DLL-Based clock generator for dynamic frequency scaling," *IEEE J. Solid-State Circuis*, vol. 41, Sep. 2006, pp. 2077-2082.
- [5.5] W.-M. Lin, C.-C. Chen and S.-I. Liu, "An All-Digital Clock Generator for Dynamic Frequency Scaling," in *Int. Symp. VLSI Design, Automation and Test*, July 2009, pp. 251-254.
- [5.6] J. Koo, S. Ok, and C. Kim, "A low-power programmable DLL-based clock generator with wide-range antiharmonic lock," *IEEE Trans. on Circuits and Systems II*, vol. 56, no. 1, Jan. 2009. pp. 21-25.

- [5.7] B. Mesgarzadeh and A. Alvandpour, "A low-power digital DLL-based clock generator in open-loop mode," *IEEE J. Solid-State Circuits*, vol. 44, no. 6, July 2009, pp. 1907-1913.
- [5.8] J. Kwong, Y. K. Ramadass, N. Verma and A. P. Chandrakasan, "A 65 nm sub-V<sub>t</sub> microcontroller with integrated SRAM and switched capacitor DC-DC converter," *IEEE J. Solid-State Circuis*, vol. 44, no. 1, Jan. 2009, pp. 115-126.
- [5.9] B. Zhai, S. Hanson, D. Blaauw, and D. Sylvester, "Analysis and mitigation of variability in subthreshold design," in *Proc. Int. Symp. Low-Power Electronics* and Design (ISLPED), Aug. 2005, pp. 20-25.
- [5.10] M. J. M. Pelgrom, A. C. J. Duinmaijer, and A. P. G. Welbers, "Matching properties of MOS transistors," *IEEE J. Solid-State Circuits*, vol. 24, no. 5, Oct. 1989, pp. 1433-1439.
- [5.11] A. Rossi and G. Fucilli, "Nonredundant successive approximation register for A/D converters," *Electron. Lett.*, vol. 32, no. 12, Jun. 1996, pp. 1055-1057.
- [5.12] T. Matano, Y. Takai, T. Takahashi, Y. Sakito, I. Fujii, Y. Takaishi, H. Fujisawa, S. Kubouchi, S. Narui, K. Arai, M. morino, M. Nakamura, S. Miyatake, T. Sekiguchi, and K. Koyama, "A 1-Gb/s/pin 512-Mb DDRII SDRAM using a digital DLL and a slew-rate-controlled output buffer," *IEEE J. Solid-State Circuits*, vol. 38, no. 5, May 2003, pp. 762-768.
- [5.13] R.-J. Yang and S.-I. Liu, "A 40-550 MHz Harmonic-Free All-Digital Delay-Locked Loop Using a Variable SAR Algorithm," *IEEE J. Solid-State Circuits*, vol. 42, no. 2, Feb. 2007, pp. 361-373.

[5.14] R. Farjad-Rad, W. Dally, H. T. Ng, R. Senthinathan, M.-J. E. Lee, R. Rathi, and J. Poulton, "A low-power multiplying DLL for low-jitter multigigahertz clock generator in highly integrated digital chips," *IEEE J. Solid-State Circuits*, vol. 37, Dec. 2002, pp. 1804-1812.



# Vita

### 謝忠穎 Chung-Ying Hsieh

#### PERSONAL INFORMATION

Birth Date: May 27, 1986

Birth Place: Changhua, TAIWAN

E-Mail Address: johnny.ee97g@g2.nctu.edu.tw

#### **EDUCATION**

09/2008 – 07/2010 M.S. in Electronics Engineering, National Chiao Tung University Thesis: PVT-Robust ULV Clock System Design for Sub/Near-Threshold Green Technologies

09/2004 – 06/2008 B.S. in Engineering Science, National Cheng Kung

University

#### **PUBLICATIONS**

Chung-Ying Hsieh, Ming-Hung Chang, Shang-Yuan Lin, and Wei Hwang, "Logical Effort Models with Voltage and Temperature Extensions in Super-/Near-/Sub-threshold Regions" *IEEE Asia Pacific Conference Circuits and Systems*, May. 2010. (Submitted)

#### **PATENTS**

Chung-Ying Hsieh, Ming-Hung Chang, and Wei Hwang, "A thermally robust buffered clock tree using logical effort compensation" US/TW Patent Pending (submitted)

Chung-Ying Hsieh, Ming-Hung Chang, and Wei Hwang, "A programmable clock generator for sub- and near-threshold DVFS system" US/TW Patent Pending (submitted)