## A Study on

# Low－Power Circuit Design for Silicon VLSI＇s and Organic IC＇s in Ubiquitous Electronics Environment 

ユビキタスエレクトロニクス環境におけるシリコン VLSI と有機集積回路の低電力回路設計に関する研究

Hiroshi Kawaguchi

> 川口 博

## Table of Contents

1. Introduction ..... 1
1.1. References ..... 2
2. RCSFF (Reduced Clock-Swing Flip-Flop) for $\mathbf{6 3 \%}$ Clock Power Reduction ..... 4
2.1. Introduction ..... 4
2.2. Circuits ..... 4
2.3. Reduced-Swing Clock Drivers ..... 5
2.4. Operation Waveforms ..... 6
2.5. Performance Comparison ..... 6
2.5.1. Area ..... 6
2.5.2. Delay ..... 7
2.5.3. Power ..... 7
2.6. Application to Reduced-Swing Bus ..... 8
2.7. Summary ..... 8
2.8. References ..... 8
3. Closed-Form Expressions in Delay and Crosstalk Noise for Capacitively Coupled Distributed RC
Lines ..... 18
3.1. Introduction ..... 18
3.2. Basic Equations ..... 19
3.3. Same-Direction Drive ..... 20
3.3.1. Delay ..... 23
3.3.1.1. $\quad$ Case that $n=1$ (Two-Line System) ..... 24
3.3.1.2. $\quad$ Case that $n=2$ (Three-Line System) ..... 24
3.3.2. Crosstalk-Noise Amplitude ..... 26
3.4. Opposite-Direction Drive ..... 27
3.4.1. Delay ..... 28
3.4.2. Crosstalk-Noise Amplitude ..... 30
3.4.2.1. $\quad$ Case that $n=1$ (Two-Line System) ..... 31
3.4.2.2. Case that $n=2$ (Three-Line System) ..... 32
3.5. Summary ..... 32
3.6. References ..... 32
4. Leakage-Current Reduction Schemes for Logic Circuits and SRAM Cells: SCCMOS (Super-Cutoff CMOS) and DLC (Dynamic Leakage Cutoff) SRAM ..... 43
4.1. Introduction ..... 43
4.2. SCCMOS (Super-Cutoff CMOS) ..... 43
4.2.1. Concept ..... 43
4.2.2. Comparison with Other Schemes ..... 45
4.2.2.1. MTCMOS (Multithreshold-Voltage CMOS) ..... 45
4.2.2.2. VTCMOS (Variable-Threshold CMOS) ..... 45
4.2.2.3. DTMOS (Dynamic-Threshold MOS) ..... 45
4.2.3. Measurement Results ..... 45
4.2.3.1. Inverter and 2NAND ..... 46
4.2.3.2. Flip-Flop Keeping Information in Standby Mode ..... 46
4.2.3.3. PTL (Pass-Transistor Logic) Gate ..... 47
4.3. DLC (Dynamic Leakage-Cutoff) SRAM ..... 48
4.3.1. Circuits ..... 48
4.3.2. Well-Bias Drivers ..... 49
4.3.3. Design Considerations ..... 49
4.3.3.1. Leakage Current. ..... 49
4.3.3.2. Bitline Delay ..... 50
4.3.3.3. Cell Area ..... 50
4.3.4. Measurement Results ..... 50
4.4. Summary ..... 51
4.5. References ..... 51
5. $\quad \mathbf{V}_{\text {DD }}$ Hopping with Off-the-Shelf Processors for Multimedia Applications and Its Extension to $\mu$ ITRON-LP ..... 64
5.1. Introduction ..... 64
5.2. $\mathrm{V}_{\mathrm{DD}}$ Hopping ..... 64
5.2.1. Concept ..... 65
5.2.1.1. Application Slicing ..... 66
5.2.1.2. Second Frequency ..... 68
5.2.2. Breadboard Design ..... 69
5.2.2.1. Clock Frequency ..... 70
5.2.2.2. Power Switch ..... 70
5.2.2.3. Power ..... 72
5.2.3. LSI Design ..... 72
5.3. $\quad$ ITRON-LP: Power-Conscious RTOS (Real-Time Operation System) Based on CVS
(Cooperative Voltage Scaling) ..... 73
5.3.1. CVS (Cooperative Voltage Scaling) ..... 74
5.3.1.1. Model ..... 74
5.3.1.2. ETCB (Extended Task-Control Block) ..... 75
5.3.1.3. Real Deadline ..... 76
5.3.1.4. Example ..... 76
5.3.2. Hardware Implementation ..... 77
5.3.3. Power Model ..... 77
5.3.4. Experimental Results ..... 78
5.3.4.1. Operation Waveforms ..... 78
5.3.4.2. Power ..... 79
5.4. Summary ..... 79
5.5. Appendix: $0.5-\mathrm{V} 400-\mathrm{MHz} \mathrm{V}_{\mathrm{DD}}$-hopping Processor with Zero- $\mathrm{V}_{\text {TH }}$ FD-SOI Technology ..... 80
5.6. References ..... 84
6. Active-Matrix and Hierarchical Structure in Organic Large-Area Sensors ..... 114
6.1. Introduction ..... 114
6.2. OFET (Organic Field-Effect Transistor) ..... 115
6.2.1. Manufacturing Process ..... 115
6.2.2. Via Holes ..... 116
6.2.3. DC Characteristics ..... 116
6.3. E-Skin (Electronic Artificial Skin) with Active Matrix ..... 117
6.3.1. Device Structure ..... 117
6.3.2. Cut-and-Paste Customization ..... 118
6.3.3. Boosted-Gate E/E (Enhancement/Enhancement) Configuration ..... 119
6.3.4. Scalability Limit ..... 120
6.3.5. Measurement Results ..... 121
6.4. A Sheet-Type Scanner with Double-Wordline and Double-Bitline Structure ..... 122
6.4.1. Device Structure and Operation Principle ..... 122
6.4.2. Circuits ..... 123
6.4.2.1. Dynamic Decoder ..... 124
6.4.2.2. Wordline-Delay Optimization. ..... 124
6.4.2.3. Photocurrent-Integration Scheme ..... 125
6.4.3. 3D Integration ..... 126
6.4.4. Measurement Results ..... 126
6.4.4.1. Photocurrent ..... 127
6.4.4.2. Scanned Image ..... 127
6.4.4.3. Operation Waveforms and Delays ..... 127
6.4.4.4. Power ..... 128
6.4.5. Future Direction ..... 128
6.5. Cost Comparison with Silicon ..... 129
6.6. Summary ..... 129
6.7. References ..... 129
7. Conclusions ..... 152
7.1. References ..... 155
List of Publications and Presentations ..... 158
Acknowledgment ..... 165

## List of Figures

Fig. 1.1 Research topics. Systems in ubiquitous electronic environment are implemented by heterogeneous technologies of silicon and organic. The numbers indicate the chapter numbers in this paper......... 3

Fig. 2.1 Power breakdowns in various VLSIs............................................................................................... 10
Fig. 2.2 Schematic diagrams of (a) the conventional flip-flop and (b) RCSFF. The Numbers signify the gate widths of MOSFETs in $\mu \mathrm{m}$. The gate length is $0.5 \mu \mathrm{~m}$ for all the MOSFETs. $W_{\text {Clock }}$ is the gate width of the nMOSFET, N1. 11

Fig. 2.3 Two types of reduced-swing clock drivers. (a) The types A1 and An are grouped as the type A. (b) In the type B, $V_{\text {Clock }}$ is supplied externally. ...................................................................................... 12

Fig. 2.4 Operation waveforms of RCSFF...................................................................................................... 12
Fig. 2.5 Layouts of (a) the conventional flip-flop and (b) RCSFF. $W_{\text {Clock }}$ is assumed to be $10 \mu \mathrm{~m}$, and the
others are the same as the values in Fig. 2.2..................................................................................... 13
Fig. 2.6 The Clock-to-Q delay characteristic of the RCSFF simulated with HSPICE, which depend on


Fig. 2.7 Power characteristic of the RCSFF simulated with HSPICE.......................................................... 14
Fig. 2.8 An application of an RCSFF to a long differential bus. .................................................................. 15
Fig. 2.9 Behavior of a differential bus.......................................................................................................... 15
Fig. 2.10 Normalized energy consumed by a distributed RC line when the terminal voltage, $V_{2}$, is reduced.
$\qquad$
Fig. 2.11 Delay improvement of a long differential bus. The length and width of the line are assumed to 10 mm and $0.5 \mu \mathrm{~m}$, respectively. In the RCSFF, $W_{\text {Clock }}$ is $10 \mu \mathrm{~m}$ and the type-A1 driver is used. ...... 16

Fig. 3.1. Two distributed RC lines capacitively coupled (two-line system). The $x$-coordinate indicates position along lines. $t$ is time. 34

Fig. 3.2. Three distributed RC lines capacitively coupled (three-line system). ............................................ 34
Fig. 3.3. Same-direction drive. Driving points are at the same ends. ............................................................ 35
Fig. 3.4. Boundary conditions and Elmore delays for distributed RC lines (a) without $C_{j}$ and (b) with $C_{j}$. 35
Fig. 3.5. Delay comparisons between (3.16) and HSPICE simulations (same-direction drive). ................... 36
Fig. 3.6. Worst-case \%error in delay (same-direction drive). ........................................................................ 36
Fig. 3.7. Crosstalk-noise comparison between (3.29) and HSPICE simulation (same-direction drive). ..... 37
Fig. 3.8. Worst-case $\%$ error in crosstalk-noise amplitude ( $n=1$, same-direction drive) ..... 37
Fig. 3.9. Worst-case $\%$ error in crosstalk-noise amplitude ( $n=2$, same-direction drive) ..... 38
Fig. 3.10. Opposite-direction drive. Driving points are at the opposite ends. ..... 38
Fig. 3.11. Approximate voltage waveform at the receiving point. ..... 39
Fig. 3.12. Delay comparisons between (3.39) and HSPICE simulations (opposite-direction drive). ..... 39
Fig. 3.13. Worst-case \%error in delay ( $n=1$, opposite-direction drive). ..... 40
Fig. 3.14. Worst-case \%error in delay ( $n=2$, opposite-direction drive). ..... 40
Fig. 3.15. Crosstalk noise in HSPICE simulation (opposite-direction drive). ..... 41
Fig. 3.16. Worst-case $\%$ error in crosstalk-noise amplitude ( $n=1$, opposite-direction drive) ..... 41
Fig. 3.17. Worst-case $\%$ error in crosstalk-noise amplitude ( $n=2$, opposite-direction drive) ..... 42
Fig. 4.1. A concept of SCCMOS ..... 53
Fig. 4.2. Mitigation of a voltage across gate oxide of a cutoff pMOSFET. ..... 53
Fig. 4.3. A micrograph of a test chip ..... 54
Fig. 4.4. Measured delays of inverters and 2NANDs. ..... 54
Fig. 4.5. Simulated delay dependencies on $W_{\text {switch }}$ ..... 55
Fig. 4.6. A flip-flop with SCCMOS. ..... 55
Fig. 4.7. Operation waveforms of a flip-flop with SCCMOS ..... 56
Fig. 4.8. Measured delays of flip-flops with SCCMOS ..... 56
Fig. 4.9. How to measure a delay of a flip-flop with SCCMOS. ..... 57
Fig. 4.10. A PTL gate array with SCCMOS ..... 57
Fig. 4.11. Measured delays of PTL gates with SCCMOS. ..... 58
Fig. 4.12. ITRS prediction for memory area in an SoC ..... 58
Fig. 4.13. (a) The OSD (offset-source driving) scheme, and (b) BSN (boosted storage-node) scheme ..... 59
Fig. 4.14. (a) The DLC (dynamic leakage-cutoff) scheme, and (b) its operation waveforms ..... 59
Fig. 4.15. (a) An n-well, and (b) p-well bias drivers. (c) $\mathrm{V}_{\mathrm{GS}}-\mathrm{V}_{\mathrm{GD}}$ trajectories in (a). No trajectories go beyond a region of $V_{D D}$. ..... 60
Fig. 4.16. A total subthreshold leakage current in a 1-Mb SRAM. ..... 60
Fig. 4.17. Bitline-delay characteristics ..... 61
Fig. 4.18. Layout examples of the (a) conventional, and (b) DLC SRAM cell.
Fig. 4.19. An area overhead in the DLC SRAM when the number of bits selected at a time is changed.
Fig. 4.20. A chip micrograph of the DLC SRAM. "SCs" signify SRAM cells............................................... 62
Fig. 4.21. Well-disturbance test. "P" indicates a pass...................................................................................... 63
Fig. 5.1. A conceptual diagram of $\mathrm{V}_{\mathrm{DD}}$ hopping. .......................................................................................... 87
Fig. 5.2. Three approaches to save power. In (a) and (b), only task periods are controlled while in (c), both $f$ and $V_{D D}$ are controlled. It is assumed that no power is consumed in a sleep mode for simplicity.. 87
Fig. 5.3. NP dependences on $N W$. (a) "NOP" loop while waiting. $b$ is assumed to be 0.7. (b) Sleep while waiting. (c) DVS.

Fig. 5.4. An example of workload histogram in an MPEG4 encoder. An H. 263 standard sequence "carphone" is used as input data. The total number of video frames is 72. The "carphone" sequence is also used in the experiment described in this chapter.

Fig. 5.5. Application slicing. At the head of each slice, a code fragment is inserted to determine a speed of a processor.

Fig. 5.6. Temporal behaviors in $\mathrm{V}_{\mathrm{DD}}$ hopping for one video frame. (a) Power, (b) $f$, and (c) $V_{D D}$............... 89
Fig. 5.7. Power comparison when $f_{\max } l j(j>2$, point A$)$ and $f_{\max } / 2$ (point B) are used as a second frequency. 90
Fig. 5.8. Numerical solution of slope of $N P(N W)$ at $N W=1 / 2$.
Fig. 5.9. (a) MPEG4 encoder system with $V_{D D}$ hopping. (b) SH-4 embedded system board. (c) $\mathrm{V}_{\mathrm{DD}}$-hopping board inserted in a VME slot. (d) Backside of (c)........................................................................... 91

Fig. 5.10. Block diagram of $\mathrm{V}_{\mathrm{DD}}$-hopping system. .......................................................................................... 91
Fig. 5.11. $\mathrm{V}_{\mathrm{DD}}$ waveforms when there is a period while both $V_{G \max }$ and $V_{G m i n}$ are asserted. (a) Falling $V_{D D}$


Fig. 5.12. $\mathrm{V}_{\mathrm{DD}}$ waveforms when there is a period while both $V_{G \max }$ and $V_{G \min }$ are negated. (a) Falling $V_{D D}$


Fig. 5.13. (a) Measured power characteristics of $\mathrm{V}_{\mathrm{DD}}$-hopping system. (b) Power dependence on workload based on (a), (A) "NOP" loop while waiting, (B) sleep while waiting, and (C) two-level $\mathrm{V}_{\mathrm{DD}}$ hopping. (c) Power reduction ratio of $\mathrm{V}_{\mathrm{DD}}$-hopping system.

Fig. 5.14. Voltage-drop dependence on gate width of power switch. This shows the worst case because of the minimum gate bias $\left(V_{G S}=-V_{D D \min }=-1.2 \mathrm{~V}\right)$.

Fig. 5.15. (a) Power switches with timers. (b) All-purpose decoder for power switches. (c) Clock frequency selector 96

Fig. 5.16. Measured waveforms of $V_{D D}$ and sleep signal of processor. ........................................................... 97
Fig. 5.17. Power comparison between $V_{D D}$-hopping and fixed- $\mathrm{V}_{\mathrm{DD}}$ schemes. ............................................... 97
Fig. 5.18. $V_{\text {DD }}$ hopping controller.................................................................................................................... 98
Fig. 5.19. Structural model of CVS. A task gets timing information and sends speed information to external $f-\mathrm{V}_{\mathrm{DD}}$ control hardware via processor. By using this speed information, a combination of $f$ and $V_{D D}$ is supplied to the processor. ............................................................................................................ 98

Fig. 5.20. Pseudo code of ETCB structure....................................................................................................... 99
Fig. 5.21. Task-state transition in $\mu$ ITRON-LP. The READY queue and $T_{n}$ queues are renewed when a task is initiated or exits.

Fig. 5.22. How to determine $D_{v}$. Cases that (a) there are two or more tasks in the READY queue, and (b) a RUN task is the only one in the READY queue............................................................................ 100

Fig. 5.23. Method to obtain WCET of a RUN task....................................................................................... 101
Fig. 5.24. Scheduling example of Tasks A, B, and C. A horizontal axis indicates a time scale, and a height of slices shows magnitude of $f$. Cases of (a) original $\mu$ ITRON, and (b) $\mu$ ITRON-LP....................... 101

Fig. 5.25. (a) Snapshot of CVS experimental system. An Output image of an MPEG4 encoder is displayed on a monitor. (b) $\mathrm{V}_{\mathrm{DD}}$ supply board on an SH-4 embedded system board..................................... 102

Fig. 5.26. Block diagram of CVS experimental system................................................................................ 102
Fig. 5.27. Ideal CVS behavior and power characteristics. The left graph shows temporal ratio when $T_{t r}$ is zero and the number of slices, $N$, is infinite. In the ideal case, at $0 \%$ workload, $100 \%$ sleep. At $50 \%$ workload, $100 \% f_{\max } / 2$ operation. At $100 \%$ workload, $100 \% f_{\max }$ operation. ........................ 103

Fig. 5.28. Measured waveforms of $V_{D D}$ and a sleep signal. KB indicates a KEYBOARD routine. When the sleep signal is high, a processor is in a sleep mode...................................................................... 103

Fig. 5.29. Explanation of $\mathrm{V}_{\mathrm{DD}}$ waveform in Fig. 5.28. A height of slices indicates magnitude of $f$. Contrast with the $\mathrm{V}_{\mathrm{DD}}$ waveform in Fig. 5.28. ............................................................................................. 104

Fig. 5.30. Power comparison. Lines A, B, and C in the right graph are the same ones in Fig. 5.27. ............ 105
Fig. 5.31. Block diagram of $\mathrm{V}_{\mathrm{DD}}$-hopping processor. .................................................................................... 105
Fig. 5.32.16-b Kogge-Stone adder................................................................................................................. 106
Fig. 5.33. Block diagram of SRAM. ..... 106
Fig. 5.34. (a) Replica-biasing and (b) conventional level-up converters ..... 107
Fig. 5.35. (a) Measurement setup. (b) Monitored output of VCO ..... 108
Fig. 5.36. Micrograph of processor chip. ..... 109
Fig. 5.37. Types of gates in compact cell library ..... 109
Fig. 5.38. (a) Breakdown of access time and (b) performance of SRAM. ..... 110
Fig. 5.39. Measured operating frequency. ..... 111
Fig. 5.40. Measured leakage current. ..... 111
Fig. 5.41. Measured power. ..... 112
Fig. 5.42. Power scaling. ..... 112
Fig. 6.1 Passive matrix ..... 131
Fig. 6.2 Manufacturing process of OFET ..... 131
Fig. 6.3 Micrograph of laser via. ..... 132
Fig. 6.4 $\quad \mathrm{V}_{\mathrm{DS}}-\mathrm{I}_{\mathrm{DS}}$ characteristics of fabricated p-type OFET. ..... 132
Fig. 6.5 (a) Cross section and (b) circuit diagram of sensor cell. ..... 133
Fig. 6.6 Photograph of e-skin system. ..... 133
Fig. 6.7 Circuit diagram of e-skin system. ..... 134
Fig. 6.8 Scalability of row decoders ..... 134
Fig. 6.9 Scalability of column selectors ..... 135
Fig. $6.104 \times 4$ version of e-skin ..... 135
Fig. 6.11 Static decoder circuits with (a) off-state load, (b) diode load, and (c) boosted-gate load. Theirsimulation waveforms are also shown........................................................................................... 136
Fig. 6.12 Input-output transfer characteristics of static decoders with (a) off-state load, (b) diode load, and
(c) boosted-gate load. The dotted lines are $x-y$ symmetries of the input-output transfer characteristics (solid lines). ..... 136
Fig. 6.13 (a) Smallest on-current case, and (b) largest off-current case ..... 137
Fig. 6.14 Current dependence on pressure ..... 137
Fig. 6.15 Bitline voltage when pressed. ..... 138
Fig. 6.16 Simulated access-time dependence on sencel size. ..... 138
Fig. 6.17 Measured and simulated operation waveforms. ..... 139
Fig. 6.18 Access time dependence on $V_{D D}$. ..... 139
Fig. 6.19 Device structure. (a) Top view, and (b) cross section. ..... 140
Fig. 6.20 Double-wordline and double-bitline structure ..... 140
Fig. 6.21 (a) Conventional static decoder, and (b) proposed dynamic decoder. ..... 141
Fig. 6.22 Cut-and-paste customization. (a) Cut, and (b) paste. ..... 141
Fig. 6.23 (a) Single-wordline scheme. (b) Double-wordline structure, and (c) its simulated delay. ..... 142
Fig. 6.24 (a) Photocharge-transfer scheme, and (b) photocurrent-integration scheme. ..... 143
Fig. 6.25 Simulated rise times of 2BL voltage, $t_{\text {BLACK }}$ and $t_{\text {WHITE }}$, and ratio of them. ..... 143
Fig. 6.26 (a) Memory, and (b) sensor design. ..... 144
Fig. 6.27 Layouts of blocks on (a) OFET sheet \#1, and (b) OFET sheet \#2. ..... 145
Fig. 6.28 Photograph of sheet-type scanner. ..... 146
Fig. 6.29 Cross-sectional photograph of sheet-type scanner. ..... 146
Fig. 6.30 Measured I-V characteristics on 1BL ..... 147
Fig. 6.31 Measured histograms of (a) photocurrent, and (b) rise time of 2BL voltage. ..... 148
Fig. 6.32 (a) Original image, and (b) scanned image ..... 149
Fig. 6.33 Measured operational waveforms. ..... 149
Fig. 6.34 Future trends of (a) scan-out time and (b) power. ..... 150
Fig. 6.35 Costs of technologies for large-area sensors. Area is assumed to be $100 \times 100 \mathrm{~mm}^{2}$, and silicon costs $\$ 1 \mathrm{k}$ per that area while organic costs $\$ 10$. ..... 151
Fig. 7.1 Power saved by the techniques described in this paper. ..... 156
Fig. 7.2 Improved RCSFF ..... 156
Fig. 7.3. ZSCCMOS (zigzag SCCMOS). ..... 157
Fig. 7.4. Row-by-row variable $V_{D D}$ SRAM. ..... 157

## List of Tables

TABLE 2.1. Performance comparison between the conventional flip-flop and RCSFF. ..... 17
TABLE 3.1. Expressions and \%errors at a glance. ..... 42
TABLE 5.1. Characteristics of applications in CVS. ..... 113

## 1. Introduction

In upcoming ubiquitous electronics environment, abundant electronics systems will be disposed in a sensor, car, robot, home, town, even in a farm, and will be connected through networks. The ubiquitous electronics support our comfortable and safe life, and require low-power feature since they are supposed to be powered by a small battery or energy harvesting [1.1].

As the modern life is supported by high-performance silicon electronics, silicon VLSIs such as microprocessors will become the mainstream also in the ubiquitous electronics environment. Thus, cost reduction by downsizing and power saving for the abundant electronics systems will be keys still in future. However, the ubiquitous electronics are not achieved only by silicon. Another technology such as organic electronics complements the silicon system, and realizes a new system as the fusion of the heterogeneous technologies.

Although the organic circuits are slow in operation speed, cost per area is superior to the silicon circuits. This enables sparse system such as an area sensor at low cost, and thus the organic circuit is suitable for large-area sensing. Still in future, silicon SoC (system on a chip) will take charge of high-performance information processing as well while organic electronics will cover sparse systems such as large-area sensing.

This paper describes low-power circuit design for both silicon and organic technologies that will support the ubiquitous electronics environment. Fig. 1.1 briefly illustrates our research topics for the low-power electronics. In Chapter 2, a flip-flop that can accept a low-swing clock is introduced. This flip-flop reduces clock power by $2 / 3$ in a silicon digital system. Chapter 3 analyzes delay and noise impacts caused by capacitance coupling in a scale-down device where there are two voltage domains. The coupling issue has come up and has to be solved to achieve signal integrity and low-power circuits, in particular, using supply-voltage domains. In Chapter 4, leakage reduction techniques in a silicon SoC are describes. The super-cutoff scheme decreases standby leakage in silicon logic circuits to less than 1 pA per gate, and a dynamic leakage-cutoff SRAM lowers active leakage by exploiting body effect. In Chapter 5, software approaches to save power of a microprocessor consumed by multimedia applications are introduces. The $\mathrm{V}_{\mathrm{DD}}$-hopping scheme and low-power RTOS (real-time operation system) adaptively change $f$ (frequency) and $V_{D D}$ (supply voltage) of an off-the-shelf microprocessor depending on workload of the multimedia
application. Chapter 6 describes low-power circuit designs in organic large-area sensors. An active matrix implementation and hierarchical structure adopting double wordlines and double bitlines in the organic large-area sensors are discussed. The cost comparison between silicon and organic electronics is mentioned as well. Chapter 7 concludes this paper.

### 1.1. References

[1.1] A. Kansal, and M. B. Srivastava, "An Environmental Energy Harvesting Frameworks for Sensor Networks," ACM/IEEE Int. Symp. Low Power Elec. and Design, pp. 481-486, Aug. 2003.

## 6. Hierarchical structure <br>  <br> 5. $\mathrm{V}_{\mathrm{DD}}$ hopping <br> 5. Low-power RTOS $\mathscr{f}, X_{D D}$ <br>  <br>  <br> 2. Low-swing clock flip-flop <br> 3. Coupling analysis <br> 4. Super-cutoff CMOS

Fig. 1.1 Research topics. Systems in ubiquitous electronic environment are implemented by heterogeneous
technologies of silicon and organic. The numbers indicate the chapter numbers in this paper.

# 2. RCSFF (Reduced Clock-Swing Flip-Flop) for $63 \%$ Clock Power Reduction 

### 2.1. Introduction

Four pie charts in Fig. 2.1 show power breakdowns in various VLSIs. The MPU1 is a low-end microprocessor for embedded use, and MPU2 is a high-end microprocessor with a large amount of cache memory on a chip. The ASSP1 is a MPEG2 decoder, and ASSP2 is for an ATM switch. The power breakdowns of the VLSIs differ from product to product. However, it is interesting that a clock and logic parts consume almost the same power in the VLSIs. The clock part consumes $20 \%$ to $45 \%$ of the total power, of which $90 \%$ is consumed by flip-flops themselves and the last branches of the clock distribution network that directly drives the flip-flops [2.1].

One of the reasons for the large power consumed by the clock part is that the transition probability of the clock is $100 \%$ while that of the ordinary logic part is about $1 / 3$ on average. Therefore, in order to achieve low-power designs, it is important to reduce the clock power. In order to reduce the clock power, it is effective to reduce a clock swing. This is because the clock power is proportional to either the clock swing or a square of the clock swing depending on the circuit configuration, which is described later on.

One idea to reduce the clock swing was pursued in the half-swing clocking scheme [2.2], however it requires four clock lines, which will increase an interconnection capacitance of the clock distribution. Moreover, routing four clock lines is disadvantageous in area, and skew adjustment is difficult. A dedicated clock driver output the half-swing clocks, but they are limited to $V_{D D} / 2$ and an arbitrary value of the clock swing cannot be taken. The power of the clock driver power is not proportional to a square of the clock swing but the clock swing itself, which is a drawback in terms of power saving.

This chapter describes a novel flip-flop using a reduced clock swing that requires only one line, which we call an RCSFF (reduced clock-swing flip-flop) for short. The RCSFF is also beneficial to decrease a clock capacitance by reducing the number of MOSFETs that are connected to a clock distribution network.

### 2.2. Circuits

We propose the RCSFF that can lower the clock swing. Fig. 2.2 shows the schematic diagrams of the conventional flip-flop and proposed RCSFF. In the conventional flip-flop, the clock swing cannot be
reduced because pMOSFETs in the clocked inverters does not completely turn off and a leakage current flows through the pMOSFETs. The internal clock, $\phi$, is generated with $/ \phi$, and an overhead becomes eminent even if the two clock lines are distributed.

On the other hand, the RCSFF is composed of a current-latch sense amplifier (master) and reset-set latch (slave) [2.3]. The current-latch sense amplifier is true single-phase latch. The salient feature of the RCSFF is that it can accept a reduced clock swing due to the single-phase nature. The clock swing, $V_{\text {Clock }}$, can be as low as 1 V .

The transistor count of the conventional flip-flop is 24 while that of the RCSFF is 20 including an inverter for generating the signal, $/ D$. The number of MOSFETs that are related to the clock signal in the RCSFF is also three, which should be compared with twelve in the conventional flip-flop. Since the only three MOSFETs, P1, P2, and N1 are clocked, the capacitance of the clock distribution network is smaller as well, which in turn decreases the clock power.

Even if the clock swing is reduced in the RCSFF, an issue is left that the precharge pMOSFETs, P1 and P 2 , do not completely turn off when the clock is " H " at the clock voltage of $V_{\text {Clock. }}$. This draws a leakage current through either P1 or P2. The RCSFF, however, has a leakage-cutoff mechanism. By applying backgate bias, $V_{\text {well }}$, to P1 and P2, $V_{t h}$ (threshold voltage) of them can be increased, and thus the leakage current is reduced. Although it will be shown afterwards that the power can be saved even without the backgate bias, the further power improvement is possible by applying it. The other way to increase $V_{t h}$ of P1 and P2 is using an ion implant, which needs process modification and is usually prohibitive. Therefore, the ion implant is not considered here, but it is one of the technically promising ways in future. When the clock stops in a standby mode, it should be at the ground because there is no leakage current even without the backgate bias.

### 2.3. Reduced-Swing Clock Drivers

The RCSFF has a reduced-swing clock driver. There are two types of clock drivers, the type A and B, as shown in Fig. 2.3.

With the type-A driver, the clock swing, $V_{\text {Clock }}=V_{D D}-n V_{t h}$, depending on the number of inserted nMOSFETs. The power consumption associated with this type of clock driver is only proportional to $V_{\text {Clock }}$. The type-A driver does not require either a DC-DC converter or external voltage supply so that it is easily
implemented.
On the other hand, in the type B, $V_{\text {Clock }}$ needs to be supplied from either an on-chip DC-DC converter or external voltage supply. The power is proportional to a square of $V_{\text {Clock }}$, and thus it is more efficient than the type-A driver but more difficult to implement since it requires an additional voltage-supply line of $V_{\text {Clock }}$ to each clock driver.

### 2.4. Operation Waveforms

Fig. 2.4 shows typical behavior of the RCSFF simulated by HSPICE when $V_{D D}$ is 3.3 V with the type-A1 driver. The left half of the figure shows a data acquisition phase, and the right one is a precharge phase. It can be seen that the clock goes up only to 2.2 V . Now, the data input, $D$, is assumed to be " H " when the clock is asserted. The black-line path in the left figure turns on, and the node, $/ P$, goes down to "L" while $P$ remains " H ". $P$ and $/ P$ drives the low-active reset-set latch, and the output, $Q$, becomes " H ". In the precharge phase, P 1 and P 2 are precharged back to " H ". $Q$ and $/ Q$ keep the previous state because both $P$ and $/ P$ are " H ". The original $V_{t h}$ of P 1 and P 2 is 0.6 V , but that with $V_{\text {well }}$ of 6 V gets 1.4 V , which is high enough to cut off leakage current when the clock swing is 2.2 V .

The RCSFF behaves as an edge-triggered flip-flop because if the clock goes to " H ", $P$ and $/ P$ are determined dependent on $D$, and once $D$ is latched, the change of $D$ does not affect $P$ and $/ P$ thanks to the cross-coupled inverters.

Let us consider the sizes of the MOSFETs used in the RCSFF. The numbers in Fig. 2.2 signify the gate widths in $\mu \mathrm{m}$. Since $P$ and $/ P$ can be slowly precharged while the clock is "L", the gate widths of P1 and P2, can be minimized to $0.5 \mu \mathrm{~m}$ in this technology. The gate width of N 1 should be larger to achieve faster Clock-to-Q operation. There is a tradeoff between speed and power in choosing optimum gate width for N1.

### 2.5. Performance Comparison

### 2.5.1. Area

Fig. 2.5 (a) is a layout example of the conventional flip-flop and Fig. 2.5 (b) shows the RCSFF case. The well for P1 and P2 is separated from the normal well in order to apply the backgate bias. Nevertheless, the area can be reduced by a factor of about $20 \%$ compared with the conventional flip-flop. In reality, however, an extra backgate bias line is needed in the RCSFF, and this $20 \%$ advantage is canceled out by the backgate bias-line overhead. If $V_{t h}$ of P1 and P2 was adjusted by ion implant, the area reduction of $20 \%$
could be enjoyed.

### 2.5.2. Delay

A HSPICE simulation is carried out assuming typical parameters of a generic $0.5-\mu \mathrm{m}$ double-metal CMOS process. The rise time of the clock is assumed to be 0.2 ns in this simulation, but even if the rise time is changed from 0.2 ns to 0.6 ns , the change in Clock-to-Q delay is less than 0.04 ns . Fig. 2.6 shows the Clock-to-Q delay characteristics in the RCSFF where the gate width of $\mathrm{N} 1, W_{\text {Clock }}$, is varied as a parameter. Since delay improvement is saturated if $W_{\text {Clock }} \geq 10 \mu \mathrm{~m}$, this value is used in the area and power estimation. When a type-A1 driver with $V_{\text {Clock }}$ of 2.2 V and $W_{\text {Clock }}$ of $10 \mu \mathrm{~m}$ are used, the RCSFF improves the delay by a factor of about $20 \%$ compared with the conventional flip-flop.

The setup and hold time in the RCSFF are 0.04 and 0 ns , respectively, regardless of magnitude of $V_{\text {Clock }}$ while those in the conventional flip-flop are 0.1 and 0 ns .

### 2.5.3. Power

Fig. 2.7 shows power characteristics of the RCSFF. The interconnection length of the clock is assumed to be $200 \mu \mathrm{~m}$ from a clock driver to an RCSFF, and transition probability of data is assumed to be $30 \%$. The clock frequency, $f_{\text {Clock }}$, is assumed to be 100 MHz . These are typical values for low-power processors.

Power consumption per flip-flop is a sum of a clock driver, a flip-flop itself, and an interconnection between them. The power becomes smaller as $V_{\text {Clock }}$ is decreased. As seen from the figure, with the type-A drivers, power reduction is less efficient than the type-B drivers. In this simulation, $V_{\text {well }}$ is set to either 3.3 V or 6 V . Without the backgate bias to P 1 and P 2 , that is, in case that $V_{\text {well }}=3.3 \mathrm{~V}$, the power improvement is saturated around $V_{\text {Clock }}$ of 1.5 V because the leakage current through P1 or P2 increases as $V_{\text {Clock }}$ lowers. On the other hand, when $V_{\text {well }}=6 \mathrm{~V}$, the power improvement is not saturated even at $V_{\text {Clock }}$ of 1 V . With the best case considered, the power can be saved by $63 \%$ of the conventional flip-flops in total. The figure also shows the power consumed by the RCSFF itself. The slight increase of $4 \%$ in the power of the RCSFF is observed due to the leakage current through P 1 or P 2 in a low- $\mathrm{V}_{\text {Clock }}$ region.

TABLE 2.1 summarizes performance comparison between the conventional flip-flop and RCSFF. When the type-A1 driver that is easy to implement is used, the power is reduced to $59 \%$ and the Clock-to-Q delay is reduced to $82 \%$. If a DC-DC converter and type-B driver are used, the power consumption can be reduced to $37 \%$ even if the delay increases by $23 \%$. Considering the improvement level and delay increase,
the cases of type-A1 driver and type-B driver with $V_{\text {Clock }}$ of 2.2 V are practical choices.

### 2.6. Application to Reduced-Swing Bus

As shown in Fig. 2.8, an application of the RCSFF to a long differential bus is considered [2.4]. Since the RCSFF is a differential amplifier in nature, it can be used to amplify a small voltage signal on the differential bus and at the same time, it can latch the data.

Behavior of the differential bus is shown in Fig. 2.9. The differential bus is first precharged to $V_{D D}$ and then, when the voltage difference of $D$ and $/ D$ reaches $\Delta V_{D}$, the clock is asserted and the RCSFF amplifier is activated. Since $\Delta V_{D}$ can be as small as less than 1 V , delay reduction of the long differential bus can be achieved. Furthermore, power reduction in a logic part can be realized as well because $D$ and $/ D$ do not need to be in a full swing. Let us consider what amount of energy saving is observed when a distributed RC line is driven in a full swing at a drive end and switched off when the other terminal becomes $V_{2}$.

Fig. 2.10 shows the normalized energy, $E / C V_{D D}{ }^{2}$, consumed by the distributed RC line, which is expressed as $0.64 V_{2} / V_{D D}+0.36$. This means that $50 \%$ power saving is possible if $V_{2}=0.2 V_{D D}$.

Fig. 2.11 shows the delay improvement of the long differential bus with an RCSFF. The delay depends on $\Delta V_{D}$, and faster operation is possible as $\Delta V_{D}$ is decreased. Compared with the conventional flip-flop, acceleration by a factor of more than two is possible in a low $-\Delta \mathrm{V}_{\mathrm{D}}$ range.

### 2.7. Summary

The RCSFF that is compatible with generic CMOS processes was proposed to save up to $63 \%$ of the clock power. With the RCSFF, area can be reduced to $80 \%$, delay can be decreased to $80 \%$, and the power is reduced to $1 / 3$ of the conventional flip-flop. Leakage current through the precharge pMOSFETs can be eliminated by the backgate bias. As an application of the RCSFF, a long differential bus was considered. The delay and power consumed by the RC interconnect can be reduced to less than a half compared with the case of the conventional flip-flops.

### 2.8. References

[2.1] T. Sakurai and T. Kuroda, "Low-Power Circuit Design for Multimedia CMOS VLSI's," Proc. Synthesis and System Integration of Mixed Technologies (SASIMI), pp. 3-10, Nov. 1996. H. Kojima, S. Tanaka and K. Sasaki, "Half-Swing Clocking Scheme for 75\% Power Saving in

Clocking Circuitry," IEEE/JSAP Symp. VLSI Circ. Dig. Tech. Papers, pp. 23-24, June 1994.
J. Montanaro, R. T. Witek, K. Anne, A. J. Black, E. M. Cooper, D. W. Dobberpuhl, P. M. Donahue, J. Eno, G. W. Hoeppner, D. Kruckemyer, T. H. Lee, P. C. M. Lin, L. Madden, D. Murray, M. H. Pearce, S. Santhanam, K. J. Snyder, R. Stephany, and S. C. Thierauf, "A 160-MHz, 32-b, 0.5-W CMOS RISC Microprocessor," IEEE J. Solid-State Circ., vol. 31, no. 11, pp. 1703-1714, Nov. 1996.
M. Matsui, H. Hara, Y. Uetani, L. Kim, T. Nagamatsu, Y. Watanabe, A. Chiba, K. Matsuda and T. Sakurai, "A 200MHz 13mm² 2-D DCT Macrocell Using Sense-Amplifying Pipeline Flip-Flop Scheme," IEEE J. Solid-State Circ., vol. 29, no. 12, pp. 1482-1490, Dec. 1994.


Fig. 2.1 Power breakdowns in various VLSIs.


Fig. 2.2 Schematic diagrams of (a) the conventional flip-flop and (b) RCSFF. The Numbers signify the gate widths of MOSFETs in $\mu \mathrm{m}$. The gate length is $0.5 \mu \mathrm{~m}$ for all the MOSFETs. $W_{\text {Clock }}$ is the gate width of the nMOSFET, N1.


Fig. 2.3 Two types of reduced-swing clock drivers. (a) The types A1 and An are grouped as the type A. (b) In the type $\mathrm{B}, V_{\text {Clock }}$ is supplied externally.


Fig. 2.4 Operation waveforms of RCSFF.

(a)

Well for precharge pMOSFETs, P1 and P2, is separated.

(b)

Fig. 2.5 Layouts of (a) the conventional flip-flop and (b) RCSFF. $W_{\text {Clock }}$ is assumed to be $10 \mu \mathrm{~m}$, and the others are the same as the values in Fig. 2.2.

(HSPICE simulation)
Fig. 2.6 The Clock-to-Q delay characteristic of the RCSFF simulated with HSPICE, which depend on $V_{\text {Clock }}$ but are not affected by $V_{\text {well }}$.


Fig. 2.7 Power characteristic of the RCSFF simulated with HSPICE.


Fig. 2.8 An application of an RCSFF to a long differential bus.


Fig. 2.9 Behavior of a differential bus.


Fig. 2.10 Normalized energy consumed by a distributed RC line when the terminal voltage, $V_{2}$, is reduced.


Fig. 2.11 Delay improvement of a long differential bus. The length and width of the line are assumed to 10 mm and $0.5 \mu \mathrm{~m}$, respectively. In the RCSFF, $W_{\text {Clock }}$ is $10 \mu \mathrm{~m}$ and the type-A1 driver is used.

TABLE 2.1. Performance comparison between the conventional flip-flop and RCSFF.

|  | Driver | $V_{\text {Clock }}$ [V] | Power | Delay | Area |
| :---: | :---: | :---: | :---: | :---: | :---: |
| Conventional |  | 3.3 | 100\% | 100\% | 100\% |
|  | Type A1 | 2.2 | 59\% | 82\% | 83\% |
|  | Type A2 | 1.3 | 48\% | 123\% | 83\% |
|  | Type B | 2.2 | 48\% | 82\% | 83\% |
|  | Type B | 1.3 | 37\% | 123\% | 83\% |

# 3. Closed-Form Expressions in Delay and Crosstalk Noise for Capacitively Coupled Distributed RC Lines 

### 3.1. Introduction

Interconnection related issues become more and more important in estimating VLSI behavior [3.1]. For instance, a coupling capacitance is getting comparable to a grounding capacitance, and crosstalk noise may cause malfunction and timing problem, in particular, in dynamic circuits. Even in static circuits, the noise may generate unexpected glitches, which gives rise to timing and power issues as well.

Several attempts have been made to treat delay and crosstalk noise in capacitively coupled interconnections [3.2]-[3.7]. Although [3.2] and [3.3] handle crosstalk noise in coupled RC lines, the interconnections are not distributed lines. [3.4] is limited to delay estimation in a two-line system. [3.5]-[3.7] describe both delay and crosstalk noise but do not give closed-form expression, which are useful for EDA implementation while it is too complicated for circuit designers. Moreover, they are restricted to the case that adjacent lines are driven from the same direction (hereafter, same-direction drive), and do not reflect on a junction capacitance of a driver MOSFET.

This chapter extends analysis of delay and crosstalk noise to more general cases that adjacent lines are driven from the opposition direction (hereafter, opposite-direction drive). The derived expressions are useful for circuit designers in estimating the delay and crosstalk noise, and give insight to coupling related issues in an early stage of VLSI design.

We do not consider an inductance, $L$, and mutual inductance, $M$, here since they do not affect delay and crosstalk noise very much in an optimally buffered distributed line [3.8]-[3.9]. For lower-level local interconnections, a \%error in delay between distributed RC and RLC lines is less than $1.5 \%$ when the width of the interconnection is ten times as wide as a design rule or less. Even for top global interconnections, it is less than $2 \%$ if the width is equal to a design rule [3.8]. \%errors in crosstalk-noise amplitude are the same degree in both local and global interconnections [3.9]. This in turn means that $L$ and $M$ should be considered only in quite wide interconnections such as power-supply and clock lines.

This chapter is organized as follows. In the next section, we will mention basic equations of capacitively coupled distribution lines. In Section 3.3 and 3.4, we will discuss delay and crosstalk noise in
the same- and opposite-direction drive cases, respectively. Finally, a summary follows in Section 3.5.

### 3.2. Basic Equations

Capacitively coupled distributed RC lines in a two-line system are shown in Fig. 3.1, and governed by the following basic equation set.

$$
\left\{\begin{array}{l}
\frac{\partial^{2} v_{1}(x, t)}{\partial x^{2}}=r_{1}\left(c_{1}+c_{c}\right) \frac{\partial v_{1}(x, t)}{\partial t}-r_{1} c_{c} \frac{\partial v_{2}(x, t)}{\partial t}  \tag{3.1}\\
\frac{\partial^{2} v_{2}(x, t)}{\partial x^{2}}=r_{2}\left(c_{2}+c_{c}\right) \frac{\partial v_{2}(x, t)}{\partial t}-r_{2} c_{c} \frac{\partial v_{1}(x, t)}{\partial t}
\end{array}\right.
$$

where $v_{i}(x, t)(i=1,2)$ is a voltage of the line $i . r_{i}, c_{i}$, and $c_{c}$ are a resistance, capacitance, and coupling capacitance between the lines per unit length. Since a bus and other wiring structures laid out on a same level have a same resistance and capacitance per unit length, we hereafter assume $r_{1}=r_{2}=r$ and $c_{1}=c_{2}=c$. In this chapter, we do not consider lines on different levels because lines on upper and lower levels cross at right angle, and a coupling capacitance between them is negligible.

In the three-line system in Fig. 3.2, the following equation set holds.

$$
\left\{\begin{array}{l}
\frac{\partial^{2} v_{1}(x, t)}{\partial x^{2}}=r\left(c+2 c_{c}\right) \frac{\partial v_{1}(x, t)}{\partial t}-2 r c_{c} \frac{\partial v_{2}(x, t)}{\partial t}  \tag{3.2}\\
\frac{\partial^{2} v_{2}(x, t)}{\partial x^{2}}=r\left(c+c_{c}\right) \frac{\partial v_{2}(x, t)}{\partial t}-r c_{c} \frac{\partial v_{1}(x, t)}{\partial t}
\end{array} .\right.
$$

(3.1) and (3.2) can be represented as follows.

$$
\left\{\begin{array}{l}
\frac{\partial^{2} v_{1}(x, t)}{\partial x^{2}}=r\left(c+n c_{c}\right) \frac{\partial v_{1}(x, t)}{\partial t}-n r c_{c} \frac{\partial v_{2}(x, t)}{\partial t}  \tag{3.3}\\
\frac{\partial^{2} v_{2}(x, t)}{\partial x^{2}}=r\left(c+c_{c}\right) \frac{\partial v_{2}(x, t)}{\partial t}-r c_{c} \frac{\partial v_{1}(x, t)}{\partial t}
\end{array},\right.
$$

where $n=1$, and $n=2$ in the two- and three-line systems, respectively. (3.3) can be rewritten as follows.

$$
\left\{\begin{array}{l}
\frac{\partial^{2} v_{1}(x, t)}{\partial x^{2}}=r c\left\{(n \eta+1) \frac{\partial v_{1}(x, t)}{\partial t}-n \eta \frac{\partial v_{2}(x, t)}{\partial t}\right\}  \tag{3.4}\\
\frac{\partial^{2} v_{2}(x, t)}{\partial x^{2}}=r c\left\{(\eta+1) \frac{\partial v_{2}(x, t)}{\partial t}-\eta \frac{\partial v_{1}(x, t)}{\partial t}\right\}
\end{array},\right.
$$

where $\eta=c_{c} / c$. With a linear transformation, (3.4) turns out to the following equation set.

$$
\left\{\begin{array}{l}
\frac{\partial^{2}\left\{v_{1}(x, t)+n v_{2}(x, t)\right\}}{\partial x^{2}}=r c \frac{\partial\left\{v_{1}(x, t)+n v_{2}(x, t)\right\}}{\partial t}  \tag{3.5}\\
\frac{\partial^{2}\left\{v_{1}(x, t)-v_{2}(x, t)\right\}}{\partial x^{2}}=r c \frac{\partial\left\{v_{1}(x, t)-v_{2}(x, t)\right\}}{\partial(t / p)}
\end{array}\right.
$$

where $p=(n+1) \eta+1 . v_{1}+n v_{2}$ and $v_{1}-v_{2}$ are called a fast and slow wave, respectively.

### 3.3. Same-Direction Drive

In this section, the case that adjacent lines are driven from the same direction as shown in Fig. 3.3 is treated. As boundary conditions, we account for an equivalent resistance of a driver MOSFET, $R_{t}$, equivalent junction capacitance of the driver MOSFET at the drain, $C_{j}$, and equivalent capacitance of a receiver MOSFET, $C_{t}$, as follows.

$$
\left\{\begin{array}{l}
-\left.\frac{1}{r} \cdot \frac{\partial v_{1}(x, t)}{\partial x}\right|_{x=0}=\frac{E_{1}-v_{1}(0, t)}{R_{t}}-C_{j} \frac{\partial v_{1}(0, t)}{\partial t}  \tag{3.6}\\
-\left.\frac{1}{r} \cdot \frac{\partial v_{1}(x, t)}{\partial x}\right|_{x=1}=C_{t} \frac{\partial v_{1}(l, t)}{\partial t} \\
-\left.\frac{1}{r} \cdot \frac{\partial v_{2}(x, t)}{\partial x}\right|_{x=0}=\frac{E_{2}-v_{2}(0, t)}{R_{t}}-C_{j} \frac{\partial v_{2}(0, t)}{\partial t} \\
-\left.\frac{1}{r} \cdot \frac{\partial v_{2}(x, t)}{\partial x}\right|_{x=1}=C_{t} \frac{\partial v_{2}(l, t)}{\partial t}
\end{array}\right.
$$

where $E_{i}(i=1,2)$ is a step voltage at the driving point of the line $i . l$ is the line length. Then, we introduce the concept of the fast and slow wave, and (3.5) is replaced as follows.

$$
\left\{\begin{array}{l}
\frac{\partial^{2} v_{\text {fast }}(x, t)}{\partial x^{2}}=r c \frac{\partial v_{\text {fast }}(x, t)}{\partial t}  \tag{3.7}\\
\frac{\partial^{2} v_{\text {slow }}(x, t)}{\partial x^{2}}=r c \frac{\partial v_{\text {slow }}(x, t)}{\partial(t / p)}
\end{array}\right.
$$

where $v_{\text {fast }}=v_{1}+n v_{2}$ and $v_{\text {slow }}=v_{1}-v_{2}$. The boundary conditions, (3.6), can be replaced as well.

$$
\left\{\begin{array}{l}
-\left.\frac{1}{r} \cdot \frac{\partial v_{\text {fast }}(x, t)}{\partial x}\right|_{x=0}=\frac{\left(E_{1}+n E_{2}\right)-v_{\text {fast }}(0, t)}{R_{t}}-C_{j} \frac{\partial v_{\text {fast }}(0, t)}{\partial t}  \tag{3.8}\\
-\left.\frac{1}{r} \cdot \frac{\partial v_{\text {fast }}(x, t)}{\partial x}\right|_{x=l}=C_{t} \frac{\partial v_{\text {fast }}(l, t)}{\partial t} \\
-\left.\frac{1}{r} \cdot \frac{\partial v_{\text {slow }}(x, t)}{\partial x}\right|_{x=0}=\frac{\left(E_{1}-E_{2}\right)-v_{\text {slow }}(0, t)}{R_{t}}-\frac{C_{j}}{p} \cdot \frac{\partial v_{\text {slow }}(0, t)}{\partial(t / p)} \\
-\left.\frac{1}{r} \cdot \frac{\partial v_{\text {slow }}(x, t)}{\partial x}\right|_{x=l}=\frac{C_{t}}{p} \cdot \frac{\partial v_{\text {slow }}(l, t)}{\partial(t / p)}
\end{array} .\right.
$$

On the other hand, it is well known that the telegraph equation of the single distributed RC line as
shown in Fig. 3.4 (a),

$$
\begin{equation*}
\frac{\partial^{2} v(x, t)}{\partial x^{2}}=r c \frac{\partial v(x, t)}{\partial t} \tag{3.9}
\end{equation*}
$$

with the boundary conditions,

$$
\left\{\begin{array}{l}
-\left.\frac{1}{r} \cdot \frac{\partial v(x, t)}{\partial x}\right|_{x=0}=\frac{E-v(0, t)}{R_{t}}  \tag{3.10}\\
-\left.\frac{1}{r} \cdot \frac{\partial v(x, t)}{\partial x}\right|_{x=l}=C_{t} \frac{\partial v(l, t)}{\partial t}
\end{array}\right.
$$

has the following solution at the receiving end [3.10].

$$
\begin{aligned}
v(l, t) & =E\left(1-\exp \left[-\frac{t /(R C)-0.1}{\tau_{\text {ElmoreWithout } C j}-0.1}\right]\right) \\
& =E\left(1-\exp \left[-\frac{t /(R C)-0.1}{R_{T} C_{T}+R_{T}+C_{T}+0.4}\right]\right) \quad(\text { if } t /(R C)>0.1) \\
& =0 \quad(\text { if } t /(R C) \leq 0.1),
\end{aligned}
$$

where $R=r l, C=c l, R_{T}=R_{t} / R$, and $C_{T}=C_{t} / C$. Namely, $R$ and $C$ are the total resistance and capacitance of the line. $\tau_{\text {ElmoreWithout } C j}$ is the Elmore delay [3.11] of the line without $C_{j}$, which is $R_{T} C_{T}+R_{T}+C_{T}+0.5$. Supposing $C_{j}$ as shown in Fig. 3.4 (b), the Elmore delay is replaced as $\tau_{\text {ElmoreWith }_{\mathrm{j}}}=R_{T}\left(C_{T}+C_{J}\right)+R_{T}+C_{T}+0.5$, and thus (3.11) is rewritten as follows.

$$
\begin{align*}
v(l, t) & =E\left(1-\exp \left[-\frac{t /(R C)-0.1}{\tau_{\text {ElmorewithCj }}-0.1}\right]\right) \\
& =E\left(1-\exp \left[-\frac{t /(R C)-0.1}{R_{T}\left(C_{T}+C_{J}\right)+R_{T}+C_{T}+0.4}\right]\right) \quad(\text { if } t /(R C)>0.1)  \tag{3.12}\\
& =0 \quad(\text { if } t /(R C) \leq 0.1),
\end{align*}
$$

where $C_{J}=C_{j} / C$. Compared between the boundary conditions, (3.8) and (3.10), the following solutions to the fast and slow waves can be obtained.

$$
\left\{\begin{array}{rl}
v_{\text {fast }}(l, t) & \left.=\left(E_{1}+n E_{2}\right)\left(1-\exp \left[-\frac{t /(R C)-0.1}{R_{T}\left(C_{T}+C_{J}\right)+R_{T}+C_{T}+0.4}\right]\right) \quad \text { (if } t /(R C)>0.1\right) \\
& =0 \quad \text { (if } t /(R C) \leq 0.1) \\
v_{\text {slow }}(l, t) & =\left(E_{1}-E_{2}\right)\left(1-\exp \left[-\frac{t /(p R C)-0.1}{R_{T}\left(C_{T}+C_{J}\right) / p+R_{T}+C_{T} / p+0.4}\right]\right)  \tag{3.13}\\
& =\left(E_{1}-E_{2}\right)\left(1-\exp \left[-\frac{t /(R C)-0.1 p}{R_{T}\left(C_{T}+C_{J}\right)+p R_{T}+C_{T}+0.4 p}\right]\right) \quad(\text { if } t /(R C)>0.1 p) \\
& =0 \quad \text { (if } t /(R C) \leq 0.1 p)
\end{array} .\right.
$$

Since $v_{\text {fast }}=v_{1}+n v_{2}$ and $v_{\text {slow }}=v_{1}-v_{2}, v_{1}$ and $v_{2}$ are expressed with the linear combination as follows.

$$
\left\{\begin{array}{l}
v_{1}(l, t)=\left\{v_{\text {fast }}(l, t)+n v_{\text {slow }}(l, t)\right\} /(n+1)  \tag{3.14}\\
v_{2}(l, t)=\left\{v_{\text {fast }}(l, t)-v_{\text {slow }}(l, t)\right\} /(n+1)
\end{array} .\right.
$$

Finally, the following expression for $v_{1}$ holds.

$$
\begin{align*}
v_{1}(l, t) & =E_{1}-\frac{1}{n+1}\left\{\left(E_{1}+n E_{2}\right) \exp \left[-\frac{t /(R C)-0.1}{R_{T}\left(C_{T}+C_{J}\right)+R_{T}+C_{T}+0.4}\right]\right. \\
& \left.+n\left(E_{1}-E_{2}\right) \exp \left[-\frac{t /(R C)-0.1 p}{R_{T}\left(C_{T}+C_{J}\right)+p R_{T}+C_{T}+0.4 p}\right]\right\} \quad(\text { if } t /(R C)>0.1 p)  \tag{3.15}\\
& =\frac{E_{1}+n E_{2}}{n+1}\left(1-\exp \left[-\frac{t /(R C)-0.1}{R_{T}\left(C_{T}+C_{J}\right)+R_{T}+C_{T}+0.4}\right]\right) \quad(\text { if } 0.1<t /(R C) \leq 0.1 p) \\
& =0 \quad(\text { if } t /(R C) \leq 0.1) .
\end{align*}
$$

Since we hereafter assume that the line 1 is a victim and the line 2 is an aggressor in this chapter, we will focus on $v_{1}$ not $v_{2}$. In order to verify the validity of (3.15) and other expressions described later on, we compare the expressions to HSPICE simulations. Note that all HSPICE simulations in this chapter are carried out using a 10 -stage $\pi$-type RC model instead of a distributed RC line model. We set the following parameter sets for wide-range comparison in terms of $\eta, R_{T}, C_{T}$, and $C_{J}$.

- $\quad \eta=\{0,0.1,0.2,0.5,1,2,5,10\}$.
- $R_{T}=\{0,0.1,0.2,0.5,1,2,5,10\}$.
- $C_{T}=\{0,0.1,0.2,0.5,1,2,5,10\}$.
- $\quad C_{J}=\{0,0.1,0.2,0.5,1,2,5,10\}$.

Consequently, the number of combination is $4096(=8 \times 8 \times 8 \times 8)$. Unfortunately, since (3.15) does not fit to the HSPICE simulations very much, we introduce a fitting technique to the expressions with MATLAB Optimization Toolbox [3.12]. We put $a_{1}$ and $a_{2}$ to (3.15) as fitting parameters, which is rewritten as follows.

$$
\begin{align*}
v_{1}(l, t)= & E_{1}-\frac{1}{n+1}\left\{\left(E_{1}+n E_{2}\right) \exp \left[-\frac{t /(R C)-0.1-a_{1} \sqrt{R_{T} C_{J}}}{R_{T}\left(C_{T}+a_{2} C_{J}\right)+R_{T}+C_{T}+0.4}\right]\right. \\
& \left.+n\left(E_{1}-E_{2}\right) \exp \left[-\frac{t /(R C)-0.1 p-a_{1} \sqrt{R_{T} C_{J}}}{R_{T}\left(C_{T}+a_{2} C_{J}\right)+p R_{T}+C_{T}+0.4 p}\right]\right\} \quad\left(\text { if } t /(R C)>0.1 p+a_{1} \sqrt{R_{T} C_{J}}\right)  \tag{3.16}\\
= & \frac{E_{1}+n E_{2}}{n+1}\left(1-\exp \left[-\frac{t /(R C)-0.1-a_{1} \sqrt{R_{T} C_{J}}}{R_{T}\left(C_{T}+a_{2} C_{J}\right)+R_{T}+C_{T}+0.4}\right]\right) \\
& \left(\text { if } 0.1+a_{1} \sqrt{R_{T} C_{J}}<t /(R C) \leq 0.1 p+a_{1} \sqrt{R_{T} C_{J}}\right) \\
= & 0 \quad\left(\text { if } t /(R C) \leq 0.1+a_{1} \sqrt{R_{T} C_{J}}\right) .
\end{align*}
$$

### 3.3.1. Delay

As expressed in (3.16), $v_{1}$ depends on values of $E_{1}$ and $E_{2}$. In the delay estimation of the line 1 , although we make $E_{1}=E, E_{2}$ has three cases. $E_{2}=E$ indicates an in-phase drive, where the adjacent lines are driven in phase. When $E_{2}=0$, we call it an $E_{2}=0$ drive, where the line 1 is only driven and the line 2 is not. The last case that $E_{2}=-E$ is an out-of-phase drive, where the adjacent lines are driven out of phase. The delay comparisons between (3.16) and the HSPICE simulations in the three cases are shown in Fig. 3.5 when $n=2$, $\eta=1$, and $R_{T}=C_{T}=C_{J}=0 . \eta=1$ means that a coupling capacitance is equal to a grounding capacitance, which often happens in VLSI design. The figure shows that the delays in the same-direction drive case fluctuate from $0.38 R C$ to $1.98 R C$ according to the $E_{2}$ drives, and the out-of-phase drive has the worst-case delay that is discussed as a line delay in this chapter.
(3.16) does not have a positive value when $t /(R C) \leq 0.1 p+a_{1} \sqrt{R_{T} C_{J}}$ in case of out-of-phase drive. Therefore, the region in which $t /(R C)>0.1 p+a_{1} \sqrt{R_{T} C_{J}}$ is only to be considered in the delay estimation, where (3.16) is rewritten as follows.

$$
\begin{align*}
\frac{v_{1}(l, t)}{E} & =1-\frac{1}{n+1}\left\{(1-n) \exp \left[-\frac{t /(R C)-0.1-a_{1} \sqrt{R_{T} C_{J}}}{R_{T}\left(C_{T}+a_{2} C_{J}\right)+R_{T}+C_{T}+0.4}\right]\right. \\
& \left.+2 n \exp \left[-\frac{t /(R C)-0.1 p-a_{1} \sqrt{R_{T} C_{J}}}{R_{T}\left(C_{T}+a_{2} C_{J}\right)+p R_{T}+C_{T}+0.4 p}\right]\right\} \quad\left(\text { if } t /(R C) \geq 0.1 p+a_{1} \sqrt{R_{T} C_{J}}\right) . \tag{3.17}
\end{align*}
$$

Then, in order to find the delay, $t_{p d, s a m e}$, we need to solve the following equation in terms of $t_{p d, s a m e}$, where $v_{1}(l, t) / E$ in (3.17) is set to $1 / 2$.

$$
\begin{align*}
& \frac{1}{n+1}\left\{(1-n) \exp \left[-\frac{t_{p d, \text { same }} /(R C)-0.1-a_{1} \sqrt{R_{T} C_{J}}}{R_{T}\left(C_{T}+a_{2} C_{J}\right)+R_{T}+C_{T}+0.4}\right]\right. \\
& \left.+2 n \exp \left[-\frac{t_{p d, \text { same }} /(R C)-0.1 p-a_{1} \sqrt{R_{T} C_{J}}}{R_{T}\left(C_{T}+a_{2} C_{J}\right)+p R_{T}+C_{T}+0.4 p}\right]\right\}=\frac{1}{2} . \tag{3.18}
\end{align*}
$$

### 3.3.1.1. Case that $\boldsymbol{n}=\mathbf{1}$ (Two-Line System)

$t_{p d, s a m e}$ in (3.18) is easily solved as follows.

$$
\begin{equation*}
t_{p d, \text { same }} /(R C)=0.1 p+a_{1} \sqrt{R_{T} C_{J}}+\ln [2]\left\{R_{T}\left(C_{T}+a_{2} C_{J}\right)+p R_{T}+C_{T}+0.4 p\right\} \tag{3.19}
\end{equation*}
$$

Compared with the HSPICE simulations, $a_{1}=0.19$, and $a_{2}=1$ are optimal in (3.19), where the \%error is $6.9 \%$ at worst. Thus, $t_{p d, s a m e}$ finally becomes as follows.

$$
\begin{gather*}
t_{p d, s a m e} /(R C)=0.1(2 \eta+1)+0.19 \sqrt{R_{T} C_{J}}+\ln [2]\left\{R_{T}\left(C_{T}+C_{J}\right)+(2 \eta+1) R_{T}+C_{T}+0.4(2 \eta+1)\right\}  \tag{3.20}\\
(\because p=(n+1) \eta+1=2 \eta+1) .
\end{gather*}
$$

The worst-case $\%$ error happens when $\eta=0, R_{T}=0.5, C_{T}=0$, and $C_{J}=10$ as depicted in Fig. 3.6.

### 3.3.1.2. Case that $\boldsymbol{n}=\mathbf{2}$ (Three-Line System)

(3.18) is a sum of two exponential functions, and can be represented to the following function, $f$.

$$
\begin{equation*}
f(\hat{t})=k_{\text {fast }} \exp \left[-\hat{t} / \tau_{\text {fast }}\right]+k_{\text {slow }} \exp \left[-\hat{t} / \tau_{\text {slow }}\right] \tag{3.21}
\end{equation*}
$$

where

$$
\left\{\begin{array}{rl}
\hat{t} & =t_{\text {pd,same }} /(R C)  \tag{3.22}\\
p & =(n+1) \eta+1=3 \eta+1 \\
\tau_{\text {fast }} & =R_{T}\left(C_{T}+a_{2} C_{J}\right)+R_{T}+C_{T}+0.4 \\
\tau_{\text {slow }} & =R_{T}\left(C_{T}+a_{2} C_{J}\right)+p R_{T}+C_{T}+0.4 p \\
k_{\text {fast }} & =-\frac{1}{3} \exp \left[\frac{0.1+a_{1} \sqrt{R_{T} C_{J}}}{\tau_{\text {fast }}}\right] \\
k_{\text {slow }} & =\frac{4}{3} \exp \left[\frac{0.1 p+a_{1} \sqrt{R_{T} C_{J}}}{\tau_{\text {slow }}}\right]
\end{array} .\right.
$$

Then, we assume that (3.21) is approximate to the following single exponential function $g$,
$g(\hat{t})=k_{\text {same }} \exp \left[-\hat{t} / \tau_{\text {same }}\right]$,
and introduce the moment matching method [3.13] as follows.

$$
\begin{align*}
& m_{0}=k_{\text {fast }}+k_{\text {slow }} \Leftrightarrow n_{0}=k_{\text {same }} \\
& m_{1}=\int_{0}^{\infty} f(\hat{t}) d \hat{t}=k_{\text {fast }} \tau_{\text {fast }}+k_{\text {slow }} \tau_{\text {slow }} \Leftrightarrow n_{1}=\int_{0}^{\infty} g(\hat{t}) d \hat{t}=k_{\text {same }} \tau_{\text {same }} \\
& m_{2}=\int_{0}^{\infty} \hat{t} f(\hat{t}) d \hat{t}=k_{\text {fast }} \tau_{\text {fast }}^{2}+k_{\text {slow }} \tau_{\text {slow }}^{2} \Leftrightarrow n_{2}=\int_{0}^{\infty} \hat{t} g(\hat{t}) d \hat{t}=k_{\text {same }} \tau_{\text {same }}^{2} \\
& \begin{array}{c}
m_{j}=\int_{0}^{\infty} \hat{t}^{j-1} f(\hat{t}) d \hat{t}=k_{\text {fast }} \tau_{\text {fast }}^{j}+k_{\text {slow }} \tau_{\text {slow }}^{j} \Leftrightarrow n_{j}=\int_{0}^{\infty} \hat{t}^{j-1} g(\hat{t}) d \hat{t}=k_{\text {same }} \tau_{\text {same }}^{j} \\
m_{j+1}=\int_{0}^{\infty} \hat{t}^{j} f(\hat{t}) d \hat{t}=k_{\text {fast }} \tau_{\text {fast }}^{j+1}+k_{\text {slow }} \tau_{\text {slow }}^{j+1} \Leftrightarrow n_{j+1}=\int_{0}^{\infty} \hat{t}^{j} g(\hat{t}) d \hat{t}=k_{\text {same }} \tau_{\text {same }}^{j+1} \\
\vdots
\end{array} \tag{3.24}
\end{align*}
$$

where $m_{i}$ and $n_{i}(i=0,1,2, \ldots, j, j+1, \ldots)$ are the $i$-th order moments of $f$ and $g$, respectively, and we assume $m_{i}=n_{i}$ based on the moment matching method. Once we obtain $m_{j}$ and $m_{j+1}, \tau_{\text {same }}$ and $k_{\text {same }}$ are given as follows.

$$
\left\{\begin{array}{l}
\tau_{\text {same }}=m_{j+1} / m_{j}  \tag{3.25}\\
k_{\text {same }}=m_{j}^{j+1} / m_{j+1}^{j}
\end{array} .\right.
$$

Then, $\hat{t}$ can be reached as follows.

$$
\begin{gather*}
\hat{t}=\tau_{\text {same }} \ln \left[2 k_{\text {same }}\right]=\frac{m_{j+1}}{m_{j}} \ln \left[\frac{2 m_{j}^{j+1}}{m_{j+1}^{j}}\right]  \tag{3.26}\\
\left(\because k_{\text {same }} \exp \left[-\hat{t} / \tau_{\text {same }}\right]=1 / 2\right),
\end{gather*}
$$

where $j$ is a fitting parameter. Again compared with the HSPICE simulations to find the optimal values, $a_{1}=0.19, a_{2}=1$, and $j=2$ are optimal. Then, (3.26) can be rewritten at last as follows.

$$
\begin{equation*}
\frac{t_{p d, s a m e}}{R C}=\frac{m_{3}}{m_{2}} \ln \left[\frac{2 m_{2}^{3}}{m_{3}^{2}}\right], \tag{3.27}
\end{equation*}
$$

where

$$
\left\{\begin{array}{l}
\tau_{\text {fast }}=R_{T}\left(C_{T}+C_{J}\right)+R_{T}+C_{T}+0.4  \tag{3.28}\\
\tau_{\text {slow }}=R_{T}\left(C_{T}+C_{J}\right)+(3 \eta+1) R_{T}+C_{T}+0.4(3 \eta+1) \\
k_{\text {fast }}=-\frac{1}{3} \exp \left[\frac{0.1+0.19 \sqrt{R_{T} C_{J}}}{\tau_{\text {fast }}}\right] \\
k_{\text {slow }}=\frac{4}{3} \exp \left[\frac{0.1(3 \eta+1)+0.19 \sqrt{R_{T} C_{J}}}{\tau_{\text {slow }}}\right] \\
m_{2}=k_{\text {fast }} \tau_{\text {fast }}^{2}+k_{\text {slow }} \tau_{\text {slow }}^{2} \\
m_{3}=k_{\text {fast }} \tau_{\text {fast }}^{3}+k_{\text {slow }} \tau_{\text {slow }}^{3}
\end{array} .\right.
$$

The worst-case $\%$ error in (3.27) is $6.9 \%$ as well as the case that $n=1$ when $\eta=0, R_{T}=0.5, C_{T}=0$, and
$C_{J}=10$, and thus on that conditions, the waveforms are the same as Fig. 3.6.

### 3.3.2. Crosstalk-Noise Amplitude

In the crosstalk-noise estimation, we make $E_{1}=0$ and $E_{2}=E$ in (3.16) as follows.

$$
\begin{align*}
& \frac{v_{1}(l, t)}{E}=- \frac{n}{n+1}\left(\exp \left[-\frac{t /(R C)-0.1-a_{1} \sqrt{R_{T} C_{J}}}{\tau_{\text {fast }}}\right]-\exp \left[-\frac{t /(R C)-0.1 p-a_{1} \sqrt{R_{T} C_{J}}}{\tau_{\text {slow }}}\right]\right) \\
&\left(\text { if } t /(R C)>0.1 p+a_{1} \sqrt{R_{T} C_{J}}\right) \\
&= \frac{n}{n+1}\left(1-\exp \left[-\frac{t /(R C)-0.1-a_{1} \sqrt{R_{T} C_{J}}}{\tau_{\text {fast }}}\right]\right)  \tag{3.29}\\
&\left(\text { if } 0.1+a_{1} \sqrt{R_{T} C_{J}}<t /(R C) \leq 0.1 p+a_{1} \sqrt{R_{T} C_{J}}\right) \\
&=0 \quad\left(\text { if } t /(R C) \leq 0.1+a_{1} \sqrt{R_{T} C_{J}}\right),
\end{align*}
$$

where $\tau_{\text {fast }}=R_{T}\left(C_{T}+a_{2} C_{J}\right)+R_{T}+C_{T}+0.4$ and $\tau_{\text {slow }}=R_{T}\left(C_{T}+a_{2} C_{J}\right)+p R_{T}+C_{T}+0.4 p$. The crosstalk-noise comparison between (3.29) and the HSPICE simulations are shown in Fig. 3.7 when $n=2, \eta=1$, and $R_{T}=C_{T}=C_{J}=0$, where the noise peak in the HSPICE simulation is 0.4 . This means that the noise induced by the crosstalk goes up to $40 \%$ of the signal swing on this condition, which often happens in VLSI design and may cause malfunction, in particular, in dynamic circuits.

In order to obtain the noise peak, we first find the time to give the noise peak, $t_{p, \text { same }}$. Since (3.29) is monotone increasing function when $t /(R C) \leq 0.1 p+a_{1} \sqrt{R_{T} C_{J}}, t_{p, \text { same }} /(R C) \geq 0.1 p+a_{1} \sqrt{R_{T} C_{J}}$ must hold. Therefore, although we can provisionally obtain $t_{p, \text { same }}$ by differentiating (3.29) and solving $\partial v_{1} / \partial t=0$ in terms of $t, t_{p, \text { same }} /(R C)$ should be $0.1 p+a_{1} \sqrt{R_{T} C_{J}}$ as follows if the obtained $t_{p, \text { same }} /(R C)$ is less than $0.1 p+a_{1} \sqrt{R_{T} C_{J}}$.

$$
\begin{align*}
\frac{t_{p, \text { same }}}{R C}= & \frac{\tau_{\text {fast }} \tau_{\text {slow }} \ln \left[\tau_{\text {fast }} / \tau_{\text {slow }}\right]+0.1\left(p \tau_{\text {fast }}-\tau_{\text {slow }}\right)}{\tau_{\text {fast }}-\tau_{\text {slow }}}+a_{1} \sqrt{R_{T} C_{J}} \\
& \left(\text { if } \frac{\tau_{\text {fast }} \tau_{\text {slow }} \ln \left[\tau_{\text {fast }} / \tau_{\text {slow }}\right]+0.1\left(p \tau_{\text {fast }}-\tau_{\text {slow }}\right)}{\tau_{\text {fast }}-\tau_{\text {slow }}} \geq 0.1 p\right)  \tag{3.30}\\
= & 0.1 p+a_{1} \sqrt{R_{T} C_{J}} \quad\left(\text { if } \frac{\tau_{\text {fast }} \tau_{\text {slow }} \ln \left[\tau_{\text {fast }} / \tau_{\text {slow }}\right]+0.1\left(p \tau_{\text {fast }}-\tau_{\text {slow }}\right)}{\tau_{\text {fast }}-\tau_{\text {slow }}}<0.1 p\right) .
\end{align*}
$$

By putting (3.30) back to (3.29), the noise peak, $v_{p, \text { same }}$, is obtained as follows.

$$
\begin{align*}
\frac{v_{\text {p,same }}}{E}= & -\frac{n}{n+1}\left(\exp \left[-\frac{\tau_{\text {slow }} \ln \left[\tau_{\text {fast }} / \tau_{\text {slow }}\right]+0.1(p-1)}{\tau_{\text {fast }}-\tau_{\text {slow }}}\right]-\exp \left[-\frac{\tau_{\text {fast }} \ln \left[\tau_{\text {fast }} / \tau_{\text {slow }}\right]+0.1(p-1)}{\tau_{\text {fast }}-\tau_{\text {slow }}}\right]\right) \\
& \left(\text { if } \frac{\tau_{\text {fast }} \tau_{\text {slow }} \ln \left[\tau_{\text {fast }} / \tau_{\text {slow }}\right]+0.1\left(p \tau_{\text {fast }}-\tau_{\text {slow }}\right)}{\tau_{\text {fast }}-\tau_{\text {slow }}} \geq 0.1 p\right)  \tag{3.31}\\
= & -\frac{n}{n+1}\left\{\exp \left[-\frac{0.1(p-1)}{\tau_{\text {fast }}}\right]-1\right\} \quad\left(\text { if } \frac{\tau_{\text {fast }} \tau_{\text {slow }} \ln \left[\tau_{\text {fast }} / \tau_{\text {slow }}\right]+0.1\left(p \tau_{\text {fast }}-\tau_{\text {slow }}\right)}{\tau_{\text {fast }}-\tau_{\text {slow }}}<0.1 p\right) .
\end{align*}
$$

(3.31) does not include the fitting parameter $a_{1}$ but $a_{2}$. Since $a_{2}=0.7$ is optimal in cases that both $n=1$ and $n=2$, we make $\tau_{\text {fast }}=R_{T}\left(C_{T}+0.7 C_{J}\right)+R_{T}+C_{T}+0.4$ and $\tau_{\text {slow }}=R_{T}\left(C_{T}+0.7 C_{J}\right)+p R_{T}+C_{T}+0.4 p$ in the crosstalk-noise estimation in this section. For $t_{p, s a m e}, a_{1}=0$ is optimal, and thus, (3.30) can be rewritten as follows.

$$
\begin{align*}
\frac{t_{p, \text { same }}}{R C}= & \frac{\tau_{\text {fast }} \tau_{\text {slow }} \ln \left[\tau_{\text {fast }} / \tau_{\text {slow }}\right]+0.1\left(p \tau_{\text {fast }}-\tau_{\text {slow }}\right)}{\tau_{\text {fast }}-\tau_{\text {slow }}} \\
& \left(\text { if } \frac{\tau_{\text {fast }} \tau_{\text {slow }} \ln \left[\tau_{\text {fast }} / \tau_{\text {slow }}\right]+0.1\left(p \tau_{\text {fast }}-\tau_{\text {slow }}\right)}{\tau_{\text {fast }}-\tau_{\text {slow }}} \geq 0.1 p\right)  \tag{3.32}\\
& =0.1 p \quad\left(\text { if } \frac{\tau_{\text {fast }} \tau_{\text {slow }} \ln \left[\tau_{\text {fast }} / \tau_{\text {slow }}\right]+0.1\left(p \tau_{\text {fast }}-\tau_{\text {slow }}\right)}{\tau_{\text {fast }}-\tau_{\text {slow }}}<0.1 p\right) .
\end{align*}
$$

In case that $n=1$, the worst-case $\%$ error of $t_{p, \text { same }}$ in (3.32) is as much as $55.4 \%$ when $\eta=0.1, R_{T}=0.5$, $C_{T}=0$, and $C_{J}=10$ while the worst-case error of $v_{p, \text { same }}$ in (3.31) is just $0.033 E$ (3.3\%) as shown in Fig. 3.8 when $\eta=5, R_{T}=0.1, C_{T}=1$, and $C_{J}=10$. In case that $n=2$, the worst-case error of $v_{p, \text { same }}$ is $0.044 E(4.4 \%)$ as depicted in Fig. 3.9 when $\eta=10, R_{T}=10, C_{T}=0$, and $C_{J}=10$ although the worst-case $\%$ error of $t_{p, \text { same }}$ is as much as $56.8 \%$ when $\eta=10, R_{T}=0, C_{T}=0$, and $C_{J}=10$.

### 3.4. Opposite-Direction Drive

In this section, the case that adjacent lines are driven from the opposite direction as shown in Fig. 3.10 is handled. With the Laplace transformation, (3.5) is replaced in the $s$-domain as follows.

$$
\left\{\begin{array}{l}
\frac{\partial^{2}\left\{V_{1}(x, s)+n V_{2}(x, s)\right\}}{\partial x^{2}}=\operatorname{rcs}\left\{V_{1}(x, s)+n V_{2}(x, s)\right\}  \tag{3.33}\\
\frac{\partial^{2}\left\{V_{1}(x, s)-V_{2}(x, s)\right\}}{\partial x^{2}}=\operatorname{rcps}\left\{V_{1}(x, s)-V_{2}(x, s)\right\}
\end{array} .\right.
$$

The solutions to (3.33) are expressed as follows.

$$
\left\{\begin{array}{rl}
V_{1}(x, s)+n V_{2}(x, s) & =K_{1}^{\prime} \mathrm{e}^{\sqrt{s c c} x}+K_{2}^{\prime} \mathrm{e}^{-\sqrt{s c c} x}  \tag{3.34}\\
V_{1}(x, s)-V_{2}(x, s) & =K_{3}^{\prime} \mathrm{e}^{\sqrt{s c c p} x}+K_{4}^{\prime} \mathrm{e}^{-\sqrt{s c c p} x}
\end{array},\right.
$$

where $K_{1}{ }^{\prime}, K_{2}{ }^{\prime}, K_{3}{ }^{\prime}$, and $K_{4}{ }^{\prime}$ are integration constants. With the linear combination, (3.34) is rewritten as follows.

$$
\left\{\begin{array}{l}
(n+1) V_{1}(x, s)=\left(K_{1}^{\prime} \mathrm{e}^{\sqrt{s r c} x}+K_{2}^{\prime} \mathrm{e}^{-\sqrt{s r c} x}\right)+n\left(K_{3}^{\prime} \mathrm{e}^{\sqrt{s r c p} x}+K_{4}^{\prime} \mathrm{e}^{-\sqrt{s r c p} x}\right)  \tag{3.35}\\
(n+1) V_{2}(x, s)=\left(K_{1}^{\prime} \mathrm{e}^{\sqrt{s r c} x}+K_{2}^{\prime} \mathrm{e}^{-\sqrt{s r c} x}\right)-\left(K_{3}^{\prime} \mathrm{e}^{\sqrt{s r c p} x}+K_{4}^{\prime} \mathrm{e}^{-\sqrt{s r c p} x}\right)
\end{array} .\right.
$$

Finally, the following expressions are the general solutions to (3.33) in the $s$-domain.

$$
\left\{\begin{array}{l}
V_{1}(x, s)=K_{1} \mathrm{e}^{\sqrt{s r c} x}+K_{2} \mathrm{e}^{-\sqrt{s c c} x}+n K_{3} \mathrm{e}^{\sqrt{s r c p} x}+n K_{4} \mathrm{e}^{-\sqrt{s r c p} x}  \tag{3.36}\\
V_{2}(x, s)=K_{1} \mathrm{e}^{\sqrt{s r c} x}+K_{2} \mathrm{e}^{-\sqrt{s c c} x}-K_{3} \mathrm{e}^{\sqrt{s c c p} x}-K_{4} \mathrm{e}^{-\sqrt{s c c p} x}
\end{array}\right.
$$

where integration constants, $K_{1}, K_{2}, K_{3}$, and $K_{4}$ are to be taken from boundary conditions, which in the $t$-domain are as follows.

$$
\left\{\begin{array}{l}
-\left.\frac{1}{r} \cdot \frac{\partial v_{1}(x, t)}{\partial x}\right|_{x=0}=-C_{t} \frac{\partial v_{1}(0, t)}{\partial t}  \tag{3.37}\\
-\left.\frac{1}{r} \cdot \frac{\partial v_{1}(x, t)}{\partial x}\right|_{x=l}=-\frac{E_{1}-v_{1}(l, t)}{R_{t}}+C_{j} \frac{\partial v_{1}(l, t)}{\partial t} \\
-\left.\frac{1}{r} \cdot \frac{\partial v_{2}(x, t)}{\partial x}\right|_{x=0}=\frac{E_{2}-v_{2}(0, t)}{R_{t}}-C_{j} \frac{\partial v_{2}(0, t)}{\partial t} \\
-\left.\frac{1}{r} \cdot \frac{\partial v_{2}(x, t)}{\partial x}\right|_{x=l}=C_{t} \frac{\partial v_{2}(l, t)}{\partial t}
\end{array}\right.
$$

(3.37) can be replaced in the $s$-domain as follows.

$$
\left\{\begin{array}{l}
-\left.\frac{1}{r} \cdot \frac{\partial V_{1}(x, s)}{\partial x}\right|_{x=0}=-s C_{t} V_{1}(0, s)  \tag{3.38}\\
-\left.\frac{1}{r} \cdot \frac{\partial V_{1}(x, s)}{\partial x}\right|_{x=l}=-\frac{E_{1} / s-V_{1}(l, s)}{R_{t}}+s C_{j} V_{1}(l, s) \\
-\left.\frac{1}{r} \cdot \frac{\partial V_{2}(x, s)}{\partial x}\right|_{x=0}=\frac{E_{2} / s-V_{2}(0, s)}{R_{t}}-s C_{j} V_{2}(0, s) \\
-\left.\frac{1}{r} \cdot \frac{\partial V_{2}(x, s)}{\partial x}\right|_{x=l}=s C_{t} V_{2}(l, s)
\end{array}\right.
$$

### 3.4.1. Delay

In order to obtain the delay, we again introduce the moment matching method [3.13]. As shown in Fig. 3.11, we assume that the approximate voltage waveform at the receiving point $v_{1}(0, t)$ has a form of exponential function with a time constant, $\tau_{\text {oppo }}$, and pure delay, $t_{0}$, as follows.

$$
\begin{equation*}
v_{1}(0, t)=E_{1}\left(1-\exp \left[-\left(t-t_{0}\right) / \tau_{\text {oppo }}\right]\right) \tag{3.39}
\end{equation*}
$$

Then, the coefficients of the zero-th order moment, $M_{0}$, and first order moment, $M_{1}$, in the exact solution to (3.36) are supposed to be matched to those in the approximate voltage waveform as follows.

$$
\begin{equation*}
E_{1} / s-s^{0} M_{0}+s^{1} M_{1}+O_{\text {exact }}\left(s^{2}\right) \Leftrightarrow E_{1} / s-s^{0}\left(\tau_{\text {oppo }}+t_{0}\right)+s^{1}\left(\tau_{\text {oppo }}^{2}+\tau_{\text {oppo }} t_{0}+t_{0}^{2} / 2\right)+O_{\text {approx }}\left(s^{2}\right) \tag{3.40}
\end{equation*}
$$

where the left side is the Taylor expansion of $V_{1}$ in (3.36), and the right one is that of the approximate voltage waveform in Fig. 3.11. Thus, the following equation set holds.

$$
\left\{\begin{array}{r}
\tau_{\text {oppo }}+t_{0}=M_{0}  \tag{3.41}\\
\tau_{o p p o}^{2}+\tau_{o p p o} t_{0}+t_{0}^{2} / 2=M_{1}
\end{array} .\right.
$$

The solutions to (3.41) are as follows.

$$
\left\{\begin{align*}
\tau_{o p p o} & =\sqrt{2 M_{1}-M_{0}^{2}}  \tag{3.42}\\
t_{0} & =M_{0}-\tau_{o p p o}
\end{align*}\right.
$$

Finally, the delay, $t_{p d, o p p o}$, can be expressed as follows.

$$
\begin{equation*}
t_{p d, o p p o}=t_{0}+\ln [2] \tau_{o p p o}=M_{0}-\ln [\mathrm{e} / 2] \sqrt{2 M_{1}-M_{0}^{2}}, \tag{3.43}
\end{equation*}
$$

where $M_{0}$ and $M_{1}$ can be obtained with the Taylor expansion as follows from (3.36) with the boundary conditions, (3.38).

$$
\left\{\begin{array}{rl}
M_{0} /(R C) & =\left[E_{1}\left\{n \eta\left(2 R_{T}+1\right)+2 R_{T} C_{T}+2 R_{T} C_{J}+2 R_{T}+2 C_{T}+1\right\}\right. \\
& \left.-E_{2} n \eta\left(2 R_{T}+1\right)\right] / 2 \\
M_{1} /(R C)^{2} & =\left[E _ { 1 } \left\{n^{2} \eta^{2}\left(24 R_{T}^{2}+20 R_{T}+5\right)\right.\right. \\
& +n \eta^{2}\left(24 R_{T}^{2}+20 R_{T}+3\right)  \tag{3.44}\\
& +2 n \eta\left(24 R_{T}^{2} C_{T}+24 R_{T}^{2}+30 R_{T} C_{T}+20 R_{T}+10 C_{T}+5\right) \\
& \left.+24 R_{T}^{2} C_{T}^{2}+48 R_{T}^{2} C_{T}+48 R_{T} C_{T}^{2}+24 R_{T}^{2}+60 R_{T} C_{T}+24 C_{T}^{2}+20 R_{T}+20 C_{T}+5\right\} \\
& -E_{2}\left\{n^{2} \eta^{2}\left(24 R_{T}^{2}+20 R_{T}+5\right)\right. \\
& +n \eta^{2}\left(24 R_{T}^{2}+20 R_{T}+3\right) \\
& \left.\left.+2 n \eta\left(24 R_{T}^{2} C_{T}+24 R_{T}^{2}+30 R_{T} C_{T}+20 R_{T}+8 C_{T}+4\right)\right\}\right] / 24
\end{array} .\right.
$$

The delay comparisons between (3.39) and the HSPICE simulations are shown in Fig. 3.12 when $n=2$, $\eta=1$, and $R_{T}=C_{T}=C_{J}=0$. The delays in the opposite-direction drive case fluctuate from $0.25 R C$ to $1.90 R C$ according to the $E_{2}$ drives, and the out-of-phase drive has the worst-case delay as well as the same-direction drive case. Thus, we make $E_{1}=E$ and $E_{2}=-E$, and rewrite (3.44) as follows in the delay estimation.

$$
\left\{\begin{align*}
M_{0} /(R C) & =E\left\{2 n \eta\left(2 R_{T}+1\right)+2 R_{T} C_{T}+2 R_{T} C_{J}+2 R_{T}+2 C_{T}+1\right\} / 2 \\
M_{1} /(R C)^{2} & =E\left\{2 n^{2} \eta^{2}\left(24 R_{T}^{2}+20 R_{T}+5\right)\right. \\
& +2 n \eta^{2}\left(24 R_{T}^{2}+20 R_{T}+3\right)  \tag{3.45}\\
& +2 n \eta\left(48 R_{T}^{2} C_{T}+48 R_{T}^{2}+60 R_{T} C_{T}+40 R_{T}+18 C_{T}+9\right) \\
& \left.+24 R_{T}^{2} C_{T}^{2}+48 R_{T}^{2} C_{T}+48 R_{T} C_{T}^{2}+24 R_{T}^{2}+60 R_{T} C_{T}+24 C_{T}^{2}+20 R_{T}+20 C_{T}+5\right\} / 24
\end{align*}\right.
$$

Consequently, (3.43) is recalculated as follows.

$$
\begin{align*}
t_{p d, o p p o} /(R C) & =\left\{2 n \eta\left(2 R_{T}+1\right)+2 R_{T} C_{T}+2 R_{T} C_{J}+2 R_{T}+2 C_{T}+1\right\} / 2 \\
& -\ln [\mathrm{e} / 2] \sqrt{\left[-n^{2}\right.} \eta^{2}\left(4 R_{T}+1\right) \\
& +n \eta^{2}\left(24 R_{T}^{2}+20 R_{T}+3\right) \\
& +n \eta\left(24 R_{T}^{2} C_{T}+24 R_{T}^{2} C_{J}+24 R_{T}^{2}+24 R_{T} C_{T}+16 R_{T}+6 C_{T}+3\right) \\
& +6 R_{T}^{2} C_{T}^{2}+12 R_{T}^{2} C_{T} C_{J}+6 R_{T}^{2} C_{J}^{2}+12 R_{T}^{2} C_{T}+12 R_{T}^{2} C_{J}+12 R_{T} C_{T}^{2} \\
& \left.+6 R_{T}^{2}+12 R_{T} C_{T}+6 C_{T}^{2}+4 R_{T}+4 C_{T}+1\right] / \sqrt{6}  \tag{3.46}\\
& \approx n \eta\left\{\frac{\ln [\mathrm{e} / 2]}{\sqrt{6}} R_{T} C_{T}+\left(2-\frac{5 \ln [\mathrm{e} / 2]}{\sqrt{6}}\right) R_{T}+1-\frac{\sqrt{6} \ln [\mathrm{e} / 2]}{4}\right\} \\
& +\left(1-\frac{2 \ln [\mathrm{e} / 2]}{\sqrt{6}}\right)\left(R_{T} C_{T}+R_{T}+C_{T}\right)+R_{T} C_{J}+\frac{1}{2}-\frac{\ln [\mathrm{e} / 2]}{\sqrt{6}} \\
& =n \eta\left(0.13 R_{T} C_{T}+1.37 R_{T}+0.81\right)+0.75\left(R_{T} C_{T}+R_{T}+C_{T}\right)+R_{T} C_{J}+0.37 .
\end{align*}
$$

However, since (3.46) does not fit to the HSPICE simulations very much, we again introduce fitting parameters, $b_{1}, b_{2}, b_{3}, b_{4}, b_{5}$ and $b_{6}$, as follows.

$$
\begin{equation*}
t_{p d, o p p o} /(R C)=n \eta\left(b_{1} R_{T} C_{T}+b_{2} R_{T}+b_{3}\right)+b_{4}\left(R_{T} C_{T}+R_{T}+C_{T}\right)+b_{5} R_{T} C_{J}+b_{6} . \tag{3.47}
\end{equation*}
$$

In cases that both $n=1$ and $n=2, b_{1}=0, b_{2}=1.48, b_{3}=0.78, b_{4}=0.75, b_{5}=0.75$, and $b_{6}=0.4$ are optimal with a \%error of $8.1 \%$ at worst, and finally (3.47) is rewritten as follows.

$$
\begin{equation*}
t_{p d, o p p o} /(R C)=n \eta\left(1.48 R_{T}+0.78\right)+0.75\left(R_{T} C_{T}+R_{T} C_{J}+R_{T}+C_{T}\right)+0.4 . \tag{3.48}
\end{equation*}
$$

The in case that $n=1$ happens when $\eta=0, R_{T}=10, C_{T}=10$, and $C_{J}=0$ as shown in Fig. 3.13. On the other hand, the worst-case $\%$ error in case that $n=2$ occurs when $\eta=0.1, R_{T}=0.1, C_{T}=0.5$, and $C_{J}=10$ as depicted in Fig. 3.14.

### 3.4.2. Crosstalk-Noise Amplitude

Unless $R_{t}, C_{t}$, and $C_{j}$ are all zero, we cannot easily solve noise peak since analytical expressions turn out to be very complicated. The case that $R_{t}=C_{t}=C_{j}=0$, however, gives the worst-case scenario in terms of the noise peak because coupling effect is mitigated if $R_{t}, C_{t}$, or $C_{j}$ is not zero. Therefore, we treat the case that $R_{t}=C_{t}=C_{j}=0$ at first, and extend it to the general case. The noise peak in the HSPICE simulation are shown in

Fig. 3.15 when $n=2, \eta=1$, and $R_{T}=C_{T}=C_{J}=0$, where the amplitude is 0.4 as well as the same-direction drive case.

The boundary conditions, (3.38), can be rewritten as follows when $R_{t}=C_{t}=C_{j}=0$.

$$
\left\{\begin{array}{rl}
\left.\frac{\partial V_{1}(x, s)}{\partial x}\right|_{x=0} & =0  \tag{3.49}\\
V_{1}(l, s) & =E_{1} / s \\
V_{2}(0, s) & =E_{2} / s \\
\left.\frac{\partial V_{2}(x, s)}{\partial x}\right|_{x=l} & =0
\end{array} .\right.
$$

(3.36) with the boundary condition, (3.49), yields the following equation set,

$$
\left\{\begin{align*}
K_{1} \gamma_{1}-K_{2} \gamma_{1}+n K_{3} \gamma_{2}-n K_{4} \gamma_{2} & =0  \tag{3.50}\\
K_{1} e^{\gamma_{1} l}+K_{2} e^{-\gamma_{1}^{l}}+n K_{3} e^{\gamma_{2} l}+n K_{4} e^{-\gamma_{2} l} & =E_{1} / s \\
K_{1}+K_{2}-K_{3}-K_{4} & =E_{2} / s \\
K_{1} \gamma_{1} e^{\gamma_{1} l}-K_{2} \gamma_{1} e^{-\gamma_{1}}-K_{3} \gamma_{2} e^{\gamma_{2} l}+K_{4} \gamma_{2} e^{-\gamma_{2} l} & =0
\end{align*}\right.
$$

where $\gamma_{1}=\sqrt{s R C}$ and $\gamma_{2}=\sqrt{s p R C}$.
In the noise-peak estimation, we make $E_{1}=0$ and $E_{2}=E$, and solve (3.50) in terms of $K_{1}, K_{2}, K_{3}$, and $K_{4}$. By substituting them for (3.36), $V_{1}(0, s)$ is obtained as follows.

$$
\begin{equation*}
\frac{V_{1}(0, s)}{E}=-\frac{n}{s} \frac{\left(\gamma_{1}-\gamma_{2}\right)\left(n \gamma_{1}+\gamma_{2}\right) \mathrm{e}^{2\left(\gamma_{1}+\gamma_{2}\right)}+K_{1} \mathrm{e}^{2 \gamma_{1}}+K_{3} \mathrm{e}^{\gamma_{1}+\gamma_{2}}+K_{5} \mathrm{e}^{2 \gamma_{1}}+O_{1}\left(n, \gamma_{1}, \gamma_{2}\right)}{\left(\gamma_{1}+n \gamma_{2}\right)\left(n \gamma_{1}+\gamma_{2}\right) \mathrm{e}^{2\left(\gamma_{1}+\gamma_{2}\right)}+K_{2} \mathrm{e}^{2 \gamma_{1}}+K_{4} \mathrm{e}^{\gamma_{1}+\gamma_{2}}+K_{6} \mathrm{e}^{2 \gamma_{1}}+O_{2}\left(n, \gamma_{1}, \gamma_{2}\right)} . \tag{3.51}
\end{equation*}
$$

The noise peak, $v_{p, \text { oppo }}$, can be calculated with the following initial value theorem of Laplace transformation because $v_{p, \text { oppo }}$ is given when $t=0$ if $R_{t}=C_{t}=C_{j}=0$.

$$
\begin{equation*}
\frac{v_{p, o p p o}}{E}=\frac{v_{1}(0,+0)}{E}=\lim _{s \rightarrow \infty} \frac{s V_{1}(0, s)}{E}=\frac{n \sqrt{p}-n}{n \sqrt{p}+1} \quad\left(\text { exact if } R_{t}=C_{t}=C_{j}=0\right) \tag{3.52}
\end{equation*}
$$

Then, for general cases, we extend (3.52) and introduce fitting parameters, $d_{1}, d_{2}, d_{3}$, and $d_{4}$, to it as follows when $R_{t}, C_{t}$, or $C_{j}$ is not zero.

$$
\begin{equation*}
\frac{v_{p, \text { oppo }}}{E}=\frac{n \sqrt{p}-n}{n \sqrt{p}+1+d_{1} \sqrt{C_{T}}+d_{2} \sqrt{R_{T} C_{J}}} \cdot \frac{\sqrt{R_{T}}+\sqrt{R_{T} C_{T}}+1}{d_{3} \sqrt{R_{T}}+d_{4} \sqrt{R_{T} C_{T}}+1} \tag{3.53}
\end{equation*}
$$

### 3.4.2.1. Case that $\boldsymbol{n}=\mathbf{1}$ (Two-Line System)

Since $d_{1}=2.96, d_{2}=1.05, d_{3}=1.48$, and $d_{4}=0.81$ are optimal, (3.53) is rewritten as follows.

$$
\begin{align*}
\frac{v_{p, o p p o}}{E}= & \frac{\sqrt{2 \eta+1}-1}{\sqrt{2 \eta+1}+1+2.96 \sqrt{C_{T}}+1.05 \sqrt{R_{T} C_{J}}} \cdot \frac{\sqrt{R_{T}}+\sqrt{R_{T} C_{T}}+1}{1.48 \sqrt{R_{T}}+0.81 \sqrt{R_{T} C_{T}}+1} .  \tag{3.54}\\
& (\because p=(n+1) \eta+1=2 \eta+1) .
\end{align*}
$$

The worst-case error is $0.078 E(7.8 \%)$ when $\eta=5, R_{T}=10, C_{T}=0.1$, and $C_{J}=1$ as shown in Fig. 3.16.

### 3.4.2.2. Case that $\boldsymbol{n}=\mathbf{2}$ (Three-Line System)

Since $d_{1}=3.99, d_{2}=1.81, d_{3}=1.14$, and $d_{4}=0.94$ are optimal, (3.53) is rewritten as follows.

$$
\begin{align*}
\frac{v_{p, \text { oppo }}}{E}= & \frac{2 \sqrt{3 \eta+1}-2}{2 \sqrt{3 \eta+1}+1+3.99 \sqrt{C_{T}}+1.81 \sqrt{R_{T} C_{J}}} \cdot \frac{\sqrt{R_{T}}+\sqrt{R_{T} C_{T}}+1}{1.14 \sqrt{R_{T}}+0.94 \sqrt{R_{T} C_{T}}+1} .  \tag{3.55}\\
& (\because p=(n+1) \eta+1=3 \eta+1) .
\end{align*}
$$

The worst-case error is $0.098 E$ ( $9.8 \%$ ) when $\eta=5, R_{T}=10, C_{T}=0.2$, and $C_{J}=1$ as shown in Fig. 3.17.

### 3.5. Summary

The closed-form expressions in delays and crosstalk-noise amplitudes for capacitively coupled twoand three-line systems are introduced within $10 \%$ error at worst, which consider the cases of both same- and opposite-direction drive. A junction capacitance of a driver MOSFET is also reflected. They are useful for circuit designers, and give insight to coupling related issues in an early stage of VLSI design.

In summary, we list the expressions and \%errors in TABLE 3.1.

### 3.6. References

[3.2] A. Vittal and M. Marek-Sadowska, "Crosstalk Reduction for VLSI," IEEE Trans. Comp.-Aided Design of Integrated Circ. and Sys., vol. 16, no. 3, pp. 290-298, Mar. 1997.
[3.3] A. Vittal, L. H. Chen, M. Marek-Sadowska, K.-P. Wang and S. Yang, "Crosstalk in VLSI Interconnections," IEEE Trans. Comp.-Aided Design of Integrated Circ. and Sys., vol. 18, no. 12, pp. 1817-1824, Dec. 1999.
[3.4] G. Yee, R. Chandra, V. Ganesan and C. Sechen, "Wire Delay in the Present of Crosstalk," Proc. ACM/IEEE Int. Workshop on Timing Issues in the Specification and Synthesis of Digital Systems, pp. 170-175, Dec. 1997.
[3.5] D. S. Gao, A. T. Yang and S. M. Kang, "Modeling and Simulation of Interconnection Delays and Crosstalks in High-Speed Integrated Circuits," IEEE Trans. Circ. and Sys., vol. 37, no. 1, pp. 1-9,

Jan. 1990.
[3.6] F. Dartu and L. T. Pileggi, "Calculating Worst-Case Gate Delays Due to Dominant Capacitance Coupling," Proc. ACM Design Automation Conf., pp. 46-51, June 1997.
[3.7] J. A. Davis and J. D. Meindl, "Compact Distributed RLC Interconnect Models," IEEE Trans. Elec. Dev., vol. 47, no. 11, pp. 2068-2087, Nov. 2000.
[3.8] D. D. Antono and T. Sakurai, "Transmission Line Models and Overshoots of On-Chip Interconnects," IEICE General Conference, A-3-22, Mar. 2002.
[3.9] D. D. Antono and T. Sakurai, "Inductive and Capacitive Coupling Effects among Deep-Submicron Adjacent Interconnects," JSAP Autumn Meet., 24p-YF-11, Sep. 2002.
[3.10] T. Sakurai, "Closed-Form Expressions for Interconnection Delay, Coupling and Crosstalk in VLSIs," IEEE Trans. Elec. Dev., vol. 40, no. 1, pp. 118-124, Jan. 1993.
[3.11] W. C. Elmore, "The Transient Response of Damped Linear Networks with Particular Regard to Wideband Amplifiers," J. of Applied Physics, vol. 19, pp. 55-63, Jan. 1948.
[3.12] MATLAB home page, http://www.mathworks.com/.
[3.13] L. T. Pillage and R. A. Rohrer, "Asymptotic Waveform Evaluation for Timing Analysis," IEEE Trans. Comp.-Aided Design, vol. 9, no. 4, pp. 352-366, Sep. 1990.


Fig. 3.1. Two distributed RC lines capacitively coupled (two-line system). The $x$-coordinate indicates position along lines. $t$ is time.


Fig. 3.2. Three distributed RC lines capacitively coupled (three-line system).


Fig. 3.3. Same-direction drive. Driving points are at the same ends.


Elmore delay:
$\tau_{\text {Elmore Without } C j}=\boldsymbol{R}_{T} C_{T}+\boldsymbol{R}_{T^{+}} C_{T^{+}} \mathbf{0 . 5}$
(a)


Elmore delay:
$\tau_{\text {Elmorewith }{ }^{j}}=R_{T}\left(C_{T^{+}} C_{J}\right)+R_{T^{+}} C_{T^{+}} \mathbf{0 . 5}$
(b)

Fig. 3.4. Boundary conditions and Elmore delays for distributed RC lines (a) without $C_{j}$ and (b) with $C_{j}$.


Fig. 3.5. Delay comparisons between (3.16) and HSPICE simulations (same-direction drive).


Fig. 3.6. Worst-case \%error in delay (same-direction drive).


Fig. 3.7. Crosstalk-noise comparison between (3.29) and HSPICE simulation (same-direction drive).


Fig. 3.8. Worst-case \%error in crosstalk-noise amplitude ( $n=1$, same-direction drive).


Fig. 3.9. Worst-case \%error in crosstalk-noise amplitude ( $n=2$, same-direction drive).


Fig. 3.10. Opposite-direction drive. Driving points are at the opposite ends.


Fig. 3.11. Approximate voltage waveform at the receiving point.


Fig. 3.12. Delay comparisons between (3.39) and HSPICE simulations (opposite-direction drive).


Fig. 3.13. Worst-case \%error in delay ( $n=1$, opposite-direction drive).


Fig. 3.14. Worst-case \%error in delay ( $n=2$, opposite-direction drive).


Fig. 3.15. Crosstalk noise in HSPICE simulation (opposite-direction drive).


Fig. 3.16. Worst-case $\%$ error in crosstalk-noise amplitude ( $n=1$, opposite-direction drive).


Fig. 3.17. Worst-case $\%$ error in crosstalk-noise amplitude ( $n=2$, opposite-direction drive).

TABLE 3.1. Expressions and \%errors at a glance.

| Eq. \# and \%error |  | Delay |  | Crosstalk-noise amplitude |  |
| :---: | :---: | :---: | :---: | :---: | :---: |
| Same-direction <br> drive | $n=1$ | $(3.20)$ |  | $(3.31)$ and | $3.3 \%$ |
|  | $n=2$ | $(3.27)$ <br> and <br> $(3.28)$ | $6.9 \%$ | $\left\{\begin{array}{l}p=(n+1) \eta \\ \tau_{\text {fast }}=R_{T}\left(C_{T}+0.7 C_{J}\right)+R_{T}+C_{T}+0.4 \\ \tau_{\text {slow }}=R_{T}\left(C_{T}+0.7 C_{J}\right)+p R_{T}+C_{T}+0.4 p\end{array}\right.$ | $4.4 \%$ |
|  | $n=1$ | $(3.48)$ | $8.1 \%$ | $(3.54)$ | $7.8 \%$ |
|  | $n=2$ |  | $(3.55)$ | $9.8 \%$ |  |

# 4. Leakage-Current Reduction Schemes for Logic Circuits and SRAM Cells: SCCMOS (Super-Cutoff CMOS) and DLC (Dynamic Leakage Cutoff) SRAM 

### 4.1. Introduction

Recently, low-power and high-performance features have been pursued extensively in CMOS VLSI designs to meet increasing needs for portable multimedia applications, and tried to overcome heat crisis in high-end processors. Since power consumption of the CMOS logic circuits quadratically depends on a supply voltage, $V_{D D}$, low $V_{D D}$ is effective, and thus CMOS process technologies have been optimized using thinner gate oxide and shorter channel length.

If logic circuits are operated at $V_{D D}$ less than 1 V , for instance, in a range from 0.5 to 0.8 V , a threshold voltage, $V_{T H}$ of MOSFETs in the logic circuits should be $0.1-0.2 \mathrm{~V}$ in order to obtain ns-order delay. Such low $V_{T H}$, however, causes a 10-nA-order leakage per logic gate in a standby mode, which results in 10 mA for 1 M logic gates. This prevents VLSIs to be applied to portable equipments powered by a small battery. In order to overcome this problem, we propose the SCCMOS (super cutoff CMOS) scheme in the next section. By using SCCMOS, operation under 1 V is possible with $V_{T H}$ of $0.1-0.2 \mathrm{~V}$, and at the same time realizes a pA-order standby current per logic gate.

As well in a future low-voltage SRAM, the low-voltage operation is important, where scaled MOSFETs need to be operated at the low-voltage environments with sufficient reliability. In Section 4.3, a sub-volt SRAM-circuit scheme called the DLC (dynamic leakage cutoff) SRAM is presented which speeds up the conventional low-voltage SRAM by more than a factor of two without applying an excessive voltage to gate oxide, but with a subthreshold leakage current maintaining in a tolerable level.

### 4.2. SCCMOS (Super-Cutoff CMOS)

### 4.2.1. Concept

Fig. 4.1 shows a concept of SCCMOS in a pMOSFET-insertion case as a cutoff switch, which is explained and verified by experiments in this section since a p-type substrate is widely used and suitable. With a p-type substrate, well voltages of pMOSFETs in logic circuits and the cutoff pMOSFET can be
different because the both wells can be electrically isolated. Consequently, the pMOSFET backgates in the logic circuits may be connected to a virtual $V_{D D}, V_{D D V}$, line, which does not require another line for the pMOSFET backgate bias. This, in turn, means that a $\mathrm{V}_{\mathrm{DD}}$ line in existing cell libraries can be used as a $\mathrm{V}_{\mathrm{DDV}}$ line and layout modification to the cell libraries can be minimized. Alternatively, pMOSFETs in the logic circuits can share a well with a cutoff pMOSFET , however in this case, an extra virtual- $\mathrm{V}_{\mathrm{DD}}$ line must be added to the cell libraries and the modification wastes time. An nMOSFET-insertion case is possible with an extra virtual-ground line to the cell libraries as well, but which also make an area overhead large, and thus the implementation is difficult.

In Fig. 4.1, the low- $\mathrm{V}_{\text {TH }}$ cutoff pMOSFET, M1, whose $V_{T H}$ is $0.1-0.2 \mathrm{~V}$ is inserted in series to a logic circuit consisting of low- $\mathrm{V}_{\text {TH }}$ MOSFETs. The low $V_{T H}$ assures high-speed operation of the logic circuits. A gate voltage of M1, $V_{G}$, is grounded to turn M1 on in an active mode. In a standby mode, $V_{G}$ is overdriven to $V_{D D}+0.4 \mathrm{~V}$ to completely cut off a leakage current since the low $V_{T H}$ of $0.1-0.2 \mathrm{~V}$ is lower by 0.4 V than the conventional high $V_{T H}(0.5-0.6 \mathrm{~V})$, and this overdriven mechanism can sustain the standby current on the same level. If $V_{T H}$ is further lower than $0.1-0.2 \mathrm{~V}$ or negative, $V_{G}$ should be reduced as low as there is no problem with gate-oxide reliability or GIDL (gate-induced drain leakage) current [4.1]. On the other hand, in the nMOSFET-insertion case, $V_{D D}$ is applied to the gate of the cutoff nMOSFET in an active mode, and the gate is overdriven to -0.4 V in a standby mode.

A gate-bias generator for $V_{G}$ can be relatively made easy without any feedbacks as shown in Fig. 4.1 since a precise voltage of $V_{G}$ is not necessary unless the gate-oxide reliability or GIDL current becomes an issue. Moreover, since high-speed control is not necessary when the logic circuit enters a standby mode, $V_{G}$ can be slowly overdriven. Therefore, pumping frequency can be low, and power consumed by the pumping circuit is low, too.

Fig. 4.2 shows a technique to mitigate a voltage across gate oxide of a cutoff pMOSFET when gate-oxide reliability is an issue. In the standby mode, $V_{D D V}$ drops to the ground due to a large leakage current of the low- $\mathrm{V}_{\mathrm{TH}}$ MOSFETs in the logic circuits. This may cause the gate-oxide reliability problem of the cutoff pMOSFET when thin gate oxide is used. For instance, assume that 1.2 V is applied across the gate oxide of the cutoff pMOSFET in the standby mode at a $0.8-\mathrm{V} V_{D D}$ as shown in the figure. In this case, connecting two cutoff pMOSFETs in series prevent the gate oxide from breaking down since they work in a subthreshold region where a drain current strongly depends on $V_{G S}$ not $V_{D S}$. The drain voltage of M1, $V_{D}$,
becomes 0.4 V to draw the same amount of current through them if gate widths of them are all the same. This combination can reduce a maximum voltage across the gate oxide from 1.2 V to 0.8 V .

### 4.2.2. Comparison with Other Schemes

There are a couple of schemes that have been reported achieving high speed at a low voltage, and at the same time reducing a leakage current in a standby mode.

### 4.2.2.1. MTCMOS (Multithreshold-Voltage CMOS)

The MTCMOS (multithreshold-voltage CMOS) scheme uses a high $V_{T H}$ as a cutoff MOSFET in series with low- $\mathrm{V}_{\mathrm{TH}}$ logic circuits in order to cut off a leakage current in a standby mode [4.2]. The MTCMOS does not work below a $0.6-\mathrm{V} V_{D D}$ because the high- $\mathrm{V}_{\mathrm{TH}}$ cutoff MOSFETs does not turn on. Consequently, the MTCMOS cannot be used below a $0.6-\mathrm{V} V_{D D}$.

### 4.2.2.2. VTCMOS (Variable-Threshold CMOS)

Another scheme named a VTCMOS (variable-threshold CMOS) applies biases to backgates of MOSFETs in logic circuits to cut off a leakage current in a standby mode, which exploits the body effect [4.3]-[4.4]. This scheme cannot be applied to a fully depleted SOI process. It is also difficult for a partially depleted SOI process due to an area overhead required to apply the backgate biases. Another drawback is that the VTCMOS requires modification to cell libraries to separate backgate-bias lines from a $\mathrm{V}_{\mathrm{DD}}$ and ground lines.

### 4.2.2.3. DTMOS (Dynamic-Threshold MOS)

The DTMOS (dynamic-threshold MOS) scheme ties a gate and backgate of a MOSFET together and thus, changes $V_{T H}$ of the MOSFET so that $V_{T H}$ is high in an off state and low in an on state [4.5]. The DTMOS, however, suffers from a $10-\mathrm{mA}$-order leakage current at a $V_{D D}$ of $0.5-0.7 \mathrm{~V}$ per 1-M logic gates because of an inherent forward-bias current of a source-backgate junction of the MOSFET. By combining the SCCMOS and DTMOS, the leakage current in a standby mode can be reduced while high speed of the DTMOS can be enjoyed in an active mode. For this purpose, the VTCMOS cannot be used with the DTMOS, in which a backgate is always fixed to a gate.

### 4.2.3. Measurement Results

A test chip was fabricated in a $0.3-\mu \mathrm{m}$ triple-metal CMOS process, whose $V_{T H}$ is 0.2 V for both of pMOSFETs and nMOSFETs to demonstrate the effectiveness of the SCCMOS. A micrograph of the test chip
is shown in Fig. 4.3. The area of the gate-bias generator is $100 \times 100 \mu \mathrm{~m}^{2}$. The current consumption of the gate-bias generator is $0.1 \mu \mathrm{~A}$ at a $0.5-\mathrm{V} V_{D D}$ when the pumping frequency is set to be 10 kHz . Delays and standby currents of inverters, 2NANDs, flip-flops and pass-transistor logic gates were measured by means of ring oscillators that have 101 stages for each circuit.

### 4.2.3.1. Inverter and 2NAND

The measured speed characteristics of the inverters and 2NANDs with fanouts of three are shown in Fig. 4.4 with circles and crosses, respectively. The simulated delay characteristics using HSPICE are shown with lines as well. Gate widths in the logic gates are all $2.4 \mu \mathrm{~m}$ so that the total logic gate width is $484.8 \mu \mathrm{~m}$ for the 101 inverters and $969.4 \mu \mathrm{~m}$ for the 101 2NANDs. On the other hand, the gate width of the cutoff pMOSFET is $10 \mu \mathrm{~m}$. The SCCMOS pushes low-voltage operation limits of the logic gates further than the MTCMOS by 0.2 V . In addition, the SCCMOS operates almost at the same speed of the "no cutoff MOSFET" case, and namely the $10-\mu \mathrm{m}$ width is sufficiently large as a cutoff pMOSFET in this measurement. The measured standby current is below 1 pA per logic gate. The active energy consumption of a 2NAND with fanouts of three is 8 fJ per switching.

Fig. 4.5 shows simulated delay dependency on the gate width of the cutoff $\operatorname{pMOSFET}(\mathrm{s}), W_{\text {switch }}$, in both cases of single connection and two-serial connection. The speed degradation is $4.6 \%$ for the inverter and $8.6 \%$ for the 2NAND in the single cutoff-pMOSFET case although a double width is needed for the two-serial connection to achieve the same speed of the single cutoff-pMOSFET case, which means that an area overhead is four times as large as the single-connection case.

### 4.2.3.2. Flip-Flop Keeping Information in Standby Mode

When logic circuits are in a standby mode and a cutoff pMOSFET turns off, $V_{D D V}$ drops almost to the ground due to a large leakage current of the low- $\mathrm{V}_{\mathrm{TH}}$ logic circuits. Consequently, flip-flops in the low $-\mathrm{V}_{\mathrm{TH}}$ logic circuits lose stored information in the standby mode. This is fatal in certain applications, and one way to solve this problem at a system level is to send all information stored in the flip-flops to external memories, for instance, with scan-path flip-flops before entering the standby mode, and then to restore the information back into the flip-flops in resume operation.

When this solution at the system level is not preferable, a flip-flop in Fig. 4.6 can be used in the SCCMOS. A current-latch flip-flop is made of the low- $\mathrm{V}_{\mathrm{TH}}$ MOSFETs for high speed with a cutoff pMOSFET, and an SRAM cell composed of high- $\mathrm{V}_{\mathrm{TH}}$ MOSFETs is added to the flip-flop to suppress a
leakage current in the standby mode. The source voltage of the SRAM cell is -0.5 V to obtain strong drive in resume operation. Namely, the substantial supply voltage is equivalent to 1 V . If the driving capability of the SRAM cell is low, the SRAM cell cannot write the stored information back into the output nodes of the cross-coupled 2NORs, $Q$ and $/ Q$, and the stored information of the SRAM cell is reversely overwritten. The waveforms of the flip-flop are shown in Fig. 4.7. Before entering a standby mode, at first, /WL is asserted and, $Q$ and $/ Q$ is stored into the nodes, N 1 and N 2 . In the standby mode, $Q$ and $/ Q$ are almost at the ground level due to the large leakage current of the low- $\mathrm{V}_{\mathrm{TH}}$ logic circuits. N 1 and N 2 , however, keep the right information. In resume operation, $/ W L$ is asserted again, and the stored information is written back into $Q$ and $/ Q$.

Fig. 4.8 shows the measured delay characteristics of the flip-flops. In order to measure the flip-flop delay, an edge-trigger pulse generator shown in Fig. 4.9 is used. At first, the delay of the flip-flops with the edge-trigger pulse generators is measured, and then the delay of only the edge-trigger pulse generators is subtracted from the delay of the former to get the genuine flip-flop delay. -0.5 V is applied to a p-type substrate to prevent $\mathrm{p} / \mathrm{n}$ junctions in the SRAM cells from being forward-biased only in this measurement. The $-0.5-\mathrm{V}$ substrate bias increases a $V_{T H}$ of all nMOSFETs from 0.2 V to 0.3 V since the process is not a triple-well technology. This is why the flip-flops are slow in this experiment. With the triple-well technology, the flip-flop delay decreases to about a triple of the inverter delay with fanouts of three

### 4.2.3.3. PTL (Pass-Transistor Logic) Gate

A test circuit of PTL (pass-transistor logic) gates that can achieve high area efficiency was fabricated by means of the gate-array structure. The layout and circuit schematics are shown in Fig. 4.10. This gate-array structure is optimized for a single-rail PTL and simpler than a PTL gate-array structure previously published [4.6]. One basic cell is composed of a pMOSFET and two nMOSFETs. The gate width of the pMOSFET is $0.96 \mu \mathrm{~m}$ and those of the nMOSFETs are $4.8 \mu \mathrm{~m}$ and $1.2 \mu \mathrm{~m}$, which are optimized sizes as an SRAM cell. Therefore, an SRAM cell can be mapped onto this gate array with two basic cells. A small pull-up pMOSFET with a feedback restores a voltage drop of $V_{T H}$ due to a series of nMOSFET transfer gates to a full swing. Fig. 4.11 shows measured delay characteristics of the single-rail PTL gates with the SCCMOS. Operation at a $1-\mathrm{V} V_{D D}$ is verified, but $0.5-\mathrm{V}$ operation is questionable with the PTL gates because of an inherent voltage drop of $V_{T H}$. However, in other words, it can be said that the SCCMOS does not degrade speed of CMOS-logic gates and PTL gates while a leakage current is sustained below 1 pA per gate in a
standby mode

### 4.3. DLC (Dynamic Leakage-Cutoff) SRAM

According to the ITRS prediction [4.7], a $90 \%$ area of an SoC (system on a chip) are occupied by memories in 2013 as shown in Fig. 4.12, and a considerably high leakage current flows through the memory. Since SRAMs apparently play a large part in memory even in a future SoC, it is very important to cut off a leakage current. However, it is not possible just to apply an existing leakage-cutoff scheme such as the MTCMOS to SRAMs because information stored in SRAMs evaporates if a power line is cut off. Thus, low-voltage SRAM schemes in other ways have been proposed including the OSD (offset-source driving) scheme [4.8] and BSN (boosted storage-node) scheme [4.9] as shown in Fig. 4.13. However, in these schemes, gate voltages of MOSFETs go over supply voltages, which give rise to reliability issues in cases.

In the OSD scheme, when an SRAM cell is not selected, a substantial supply voltage applied to the SRAM cell is 0.8 V because a source voltage of the SRAM cell, $V_{\text {SOURCE }}$, is 0.6 V When it is selected, however, $V_{\text {SOURCE }}$ is pulled down to the ground. In this situation, the gate-source voltage in the hatched MOSFETs goes up to 1.4 V , and it is not possible to assure gate-oxide reliability when the MOSFETs are optimized for $0.8-\mathrm{V}$ operation.

On the other hand, a 1.4-V supply voltage is applied to an SRAM cell in the BSN scheme even though peripheral circuits are operated at 0.8 V . Again, when the MOSFETs in the memory cell are optimized for $0.8-\mathrm{V}$ operation, it is not possible to assure the gate-oxide reliability.

As a result, the both schemes suffer from the reliability issue of the gate oxide since a higher voltage than a supply voltage of peripheral circuits is applied to SRAM cells to gain high-speed operation.

### 4.3.1. Circuits

Fig. 4.14 shows the schematic of the proposed DLC (dynamic leakage-cutoff) SRAM with operation waveforms. The salient feature of the DLC SRAM is that n - and p-well biases, $V_{\text {NWELL }}$ and $V_{P W E L L}$, are dynamically changed in synchronizing with a wordline signal for selected SRAM cells. This scheme is different from the VTCMOS in which well biases synchronize with a standby signal. It should be noted that a triple-well process is required to realize the DLC SRAM but a triple-well process is preferable anyway for SoCs, in which analog circuits and memories are embedded in digital circuit environments and electrical isolation is an issue.

As illustrated in Fig. 4.14 (a), an n-well and p-well bias drivers drive two adjacent rows at the same time so that the number of drivers can be a half. Unlike the OSD and BSN schemes, the DLC scheme requires only a single $V_{T H}$, and does not need multiple $V_{T H} \mathrm{~s}$.

Fig. 4.14 (b) shows operation waveforms of $V_{N W E L L}$ and $V_{P W E L L}$ when SRAM cells are selected and dormant, in which $V_{\text {NWELL }}$ and $V_{\text {PWELL }}$ are dynamically changed. $V_{\text {NWELL }}$ and $V_{P W E L L}$ are zero biases for selected SRAM cells. In contrast, they are kept $2 V_{D D}$ and $-V_{D D}$ in a dormant state, respectively, which means negative biases. By doing so, the threshold voltages of the p - and nMOSFETs in the selected SRAM cells becomes relatively low by the body effect and assures a large drive, which in turn achieves fast operation. On the other hand, the threshold voltage of dormant SRAM cells is relatively high, which suppresses a subthreshold leakage current. The n - and p -well bias drivers are controlled by row decoders in order to synchronize with a wordline signal.

### 4.3.2. Well-Bias Drivers

Fig. 4.15 (a) and (b) show circuit schematics of $n$ - and p-well bias driver, respectively. Fig. 4.15 (c) plots the corresponding $\mathrm{V}_{\mathrm{GS}}-\mathrm{V}_{\mathrm{GD}}$ trajectories of all MOSFETs in the n -well bias driver in dynamic operation. It is seen that each MOSFET in the well-bias driver does not feel a voltage over $V_{D D}$ across gate oxide, which assures sufficient reliability.

The n-well bias driver draws a leakage current when $V_{\text {in }}$ is " H ", and alternatively the p-well bias drivers draw leakage current when $/ V_{i n}$ is "L". However, there is no leakage current when $V_{i n}$ and $/ V_{\text {in }}$ are opposite, which means that the leakage currents of the well-bias drivers is not an issue because only one $n$-well bias driver and one p-well bias driver draw the leakage currents.

### 4.3.3. Design Considerations

### 4.3.3.1. Leakage Current

Fig. 4.16 shows simulated leakage characteristics of a $1-\mathrm{Mb}$ SRAM in a $0.35-\mu \mathrm{m}$ low $-\mathrm{V}_{\text {TH }}$ process, with which a test chip was designed and fabricated. A zero $V_{T H}$ is desirable from a delay point of view. The total subthreshold leakage current, $I_{L E A K}$, however, goes up to 200 mA , which should be compared with a dynamic current of 5 mA at 100 MHz . Therefore, the subthreshold leakage current is dominant in the SRAM even in an active mode. If more than $1-\mathrm{Mb}$ are necessary, the situation gets worse because the dynamic current does not increase much while the leakage current does increase linearly.

By using the DLC scheme, a $V_{T H}$ can be shifted by 0.14 V at a $V_{D D}$ of 0.5 V when the SRAM cell gets dormant, and by 0.25 V at a $V_{D D}$ of 1 V with the employed technology. In other words, although a threshold voltage in the selected SRAM cells is 0 V , the leakage current is decreased to a level where $V_{T H}$ is set to 0.25 V or 0.14 V as shown in Fig. 4.16. In case of a $1-\mathrm{V} V_{D D}$, the total leakage current is suppressed to 0.9 mA from 200 mA .

### 4.3.3.2. Bitline Delay

Fig. 4.17 is simulated bitline-delay characteristics. Since a $V_{T H}$ is shifted to a higher value with the DLC scheme, the original $V_{T H}$ before shifting can be set lower. For instance, if a $V_{T H}$ is set to 0 V , the bitline delay can be reduced by a factor of 2.5 at a $0.5-\mathrm{V} V_{D D}$ while the subthreshold leakage current is kept at 0.9 mA .

### 4.3.3.3. Cell Area

The area overhead per DLC-SRAM cell is $27 \%$ as shown in Fig. 4.18 . Other than the SRAM cells, the DLC SRAM has another area overhead for the well-bias drivers. If an SRAM-cell area is assumed to occupy about $70 \%$ in the conventional SRAM, the overall area overhead in the DLC SRAM is laid between $20 \%$ and $50 \%$ as shown in Fig. 4.19, which is a function of the number of selected SRAM cells. The area overhead tends to be reduced as the number of selected bits increases because the number of well-bias drivers decrease. A deep-trench isolation technology can reduce the overhead by $10 \%$ since the major cause of the area overhead is due to a well-separation rule.

Compared with the DTMOS scheme, in which a gate and well are tied and a well-bias change is limited to 0.7 V , the DLC scheme can allow more well-bias change than 2 V . This leads to a larger shift in $V_{T H}$, which in turn achieves a higher speed with a same leakage current.

### 4.3.4. Measurement Results

Fig. 4.20 shows a micrograph of a test chip fabricated with the $0.35-\mu \mathrm{m}$ CMOS process. The area overhead by the well-bias drivers and the DLC-SRAM cells is observed to be $30 \%$. $V_{T H}$ s of pMOSFETs and nMOSFETs are both 0.15 V because of limited accessibility to a low- $\mathrm{V}_{\mathrm{TH}}$ process. The $0.15-\mathrm{V} V_{T H}$ is not ideal from a speed and leakage point of view, but we carried out an important measurement that cannot be simulated by a circuit simulator, which is a well-disturbance test.

In the DLC SRAM, the well biases are dynamically changed, and an unexpected flip in the SRAM cells may occur. Fig. 4.21 shows the results of the well-disturbance test in which the well-bias pulse frequency is

100 MHz . No abnormal flip in the DLC-SRAM cells was observed when a range of $V_{D D}$ is from 0.5 V through 1 V and a well-bias amplitude as a disturbance is in a range from 0.5 V through 2 V .

### 4.4. Summary

In this chapter, two leakage-current reduction techniques for logic circuits and SRAM cells were introduced.

The SCCMOS was proposed to realize CMOS-logic circuits to work below a $0.5-\mathrm{V} V_{D D}$ without speed degradation. Although the SCCMOS adopts a low- $\mathrm{V}_{\mathrm{TH}}$ cutoff switch, a standby current is 1-pA-order per logic gate with the gate of the switch overdriven. The SCCMOS can be effectively combined with an SOI technology, DTMOS structure, and/or PTL gates, and thus it is promising for future technologies that are optimized for low-power operation.

SRAM can speed up the conventional low-voltage SRAM by a factor of 2.5 while a subthreshold leakage current is maintained in a tolerable level by using the body effect. The DLC scheme does not apply an excessive voltage across to gate oxide. The area overhead is about $30 \%$ if a data width is 32 b . This overhead can be reduced by $10 \%$ by introducing a deep-trench isolation technology. The DLC is resilient against disturbing well biases.

### 4.5. References

[4.1] T. Y. Chan, J. Chen, P. K. Ko and C. Hu, "The Impact of Gate-Induced Drain Leakage Current on MOSFET Scaling," IEEE Int. Elec. Dev. Meet. Dig. Tech. Papers, pp. 718-721, Dec. 1987.
[4.2] S. Mutoh, T. Douseki, Y. Matsuya, T. Aoki, S. Shigematsu and J. Yamada, "1-V Power Supply High-Speed Digital Circuit Technology with Multithreshold-Voltage CMOS," IEEE J. Solid-State Circ., vol. 30, no. 8, pp. 847-854, Aug. 1995.
[4.3] K. Seta, H. Hara, T. Kuroda, M. Kakumu and T. Sakurai, " $50 \%$ Active Power Saving without Speed Degradation Using Stand-by Power Reduction (SPR) Circuit," IEEE Int. Solid-State Circ. Conf. Dig. Tech. Papers, pp. 318-319, Feb. 1995.
[4.4] T. Kuroda, T. Fujita, S. Mita, T. Nagamatsu, S. Yoshida, F. Sano, M. Norishima, M. Murota, M. Kato, M. Kinugawa, M. Kakumu and T. Sakurai, "A 0.9V 150MHz $10 \mathrm{~mW} 4 \mathrm{~mm}^{2}$ 2-D Discrete Cosine Transform Core Processor with Variable-Threshold-Voltage Scheme," IEEE Int. Solid-State Circ. Conf. Dig. Tech. Papers, pp. 166-167, Feb. 1996.
[4.5] T. Fuse, Y. Oowaki, T. Yamada, M. Kamoshida, M. Ohta, T. Shino, S. Kawanaka, M. Terauchi, T. Yoshida, G. Matsubara, S. Yoshida, S. Watanabe, M. Yoshimi, K. Ohuchi and S. Manabe, "A 0.5V 200MHz 1-Stage 32b ALU Using a Body Bias Controlled SOI Pass-Gate Logic," IEEE Int. Solid-State Circ. Conf. Dig. Tech. Papers, pp. 286-287, Feb. 1997.
[4.6] Y. Sasaki, K. Yano, M. Hiraki, K. Rikino, M. Miyamoto, T. Matsuura, T. Nishida and K. Seki, "Pass Transistor Based Gate Array Architecture," IEEE/JAPS Symp. VLSI Circ. Dig. Tech. Papers, pp. 123-124, June 1995.
[4.7] International Technology Roadmap for Semiconductors public home page, http://public.itrs.net/.
[4.8] H. Yamauchi, T. Iwata, H. Akamatsu, and A. Matsuzawa, "A 0.8V / 100Mhz / sub-5mW-Operated Mega-bit SRAM Cell Architecture with Charge-Recycle Offset-Source Driving (OSD) Scheme," IEEE/JSAP Symp. VLSI Circ. Dig. Tech. Papers, pp. 126-127, June 1996.
[4.9] K. Itoh, A. R. Fridi, A. Bellaouar, and M. I. Elmasry, "A Deep Sub-V, Single Power-Supply SRAM Cell with Multi-Vt, Boosted Storage Node and Dynamic Load," IEEE/JSAP Symp. VLSI Circ. Dig. Tech. Papers, pp. 132-133, June 1996.


Fig. 4.1. A concept of SCCMOS.


Fig. 4.2. Mitigation of a voltage across gate oxide of a cutoff pMOSFET.


Fig. 4.3. A micrograph of a test chip.


Fig. 4.4. Measured delays of inverters and 2NANDs.


Fig. 4.5. Simulated delay dependencies on $W_{\text {switch }}$.


Fig. 4.6. A flip-flop with SCCMOS.


Fig. 4.7. Operation waveforms of a flip-flop with SCCMOS.


Fig. 4.8. Measured delays of flip-flops with SCCMOS.


Fig. 4.9. How to measure a delay of a flip-flop with SCCMOS.


Fig. 4.10. A PTL gate array with SCCMOS.


Fig. 4.11. Measured delays of PTL gates with SCCMOS.


Fig. 4.12. ITRS prediction for memory area in an SoC.


Fig. 4.13. (a) The OSD (offset-source driving) scheme, and (b) BSN (boosted storage-node) scheme.

(a)

(b)

Fig. 4.14. (a) The DLC (dynamic leakage-cutoff) scheme, and (b) its operation waveforms.

(a)

(b)

(c)

Fig. 4.15. (a) An n-well, and (b) p-well bias drivers. (c) $\mathrm{V}_{\mathrm{GS}}-\mathrm{V}_{\mathrm{GD}}$ trajectories in (a). No trajectories go beyond a region of $V_{D D}$.


Fig. 4.16. A total subthreshold leakage current in a 1-Mb SRAM.


Fig. 4.17. Bitline-delay characteristics.


Fig. 4.18. Layout examples of the (a) conventional, and (b) DLC SRAM cell.


Fig. 4.19. An area overhead in the DLC SRAM when the number of bits selected at a time is changed.


Fig. 4.20. A chip micrograph of the DLC SRAM. "SCs" signify SRAM cells.


Fig. 4.21. Well-disturbance test. "P" indicates a pass.

# 5. VDD Hopping with Off-the-Shelf Processors for Multimedia Applications and Its Extension to $\mu$ ITRON-LP 

### 5.1. Introduction

For multimedia mobile systems powered by a small battery such as a 3 G cell phone, power-efficient design managing both low power and high speed is required. In order to save power of a hardware, there have been several concepts that dynamically provide an optimum fine-grained $V_{D D}$ (supply voltage) and $f$ (clock frequency) to the hardware [5.1]-[5.6], which are called DVS (dynamic voltage scaling). However, redesign is required to implement it because $V_{D D}$ and $f$ are controlled with a model of a critical path and hardware feedback. Consequently, it is difficult to apply CVS to an off-the-shelf processor sold on the market.

Alternatively, Crusoe adopts software power management called LongRun [5.7]-[5.8], which relies on its workload history. Namely, LongRun works fine in PC environment but is not suitable for embedded systems since it cannot reduce power by making use of data-dependent nature of multimedia applications nor guarantee a real-time feature.

The next section presents a chip that externally provides $V_{D D}$ and $f$ to an off-the-shelf processor, and realizes DVS for a multimedia application with a concept of run-time voltage hopping [5.9]-[5.10]. The novel DVS system is hereafter called $\mathrm{V}_{\mathrm{DD}}$ hopping. As processor performance improves, effective power management of a system is increasingly achieved through software [5.11]-[5.16], and $\mathrm{V}_{\mathrm{DD}}$ hopping is a kind of software approach to save power consumed by a multimedia application.

In Section 5.3, with a help of not only $\mathrm{V}_{\mathrm{DD}}$ hopping but also a RTOS (real-time operation system), DVS is embedded as a real-time multitask system. The cooperative design of applications and an RTOS is more efficient than a case that only applications do $\mathrm{V}_{\mathrm{DD}}$ hopping. We call the cooperative design, CVS (cooperative voltage scaling), which encompasses interaction among an RTOS, applications, and a hardware, and save power consumed by a processor.

## 5.2. $V_{\text {DD }}$ Hopping

Fig. 5.1 shows a conceptual diagram of $\mathrm{V}_{\mathrm{DD}}$ hopping. A multimedia application calculates workload of a task and then, sends speed information to an external $\mathrm{V}_{\mathrm{DD}}$-hopping hardware through a processor.

Otherwise, the application sends sleep information if there is no task to execute, and the processor gets into a sleep mode. By using the speed information, the $\mathrm{V}_{\mathrm{DD}}$-hopping hardware provides $V_{D D}$ and $f$ to the processor. $\mathrm{V}_{\mathrm{DD}}$ hopping accordingly makes dynamic adjustment of $V_{D D}$ and $f$ depending on workload of the processor. Power can be drastically saved when workload is low because we can lower $f$ and $V_{D D}$ at the same time, and power is proportional to a square of $V_{D D}$. This is the basis of $\mathrm{V}_{\mathrm{DD}}$ hopping. A higher $V_{D D}$ should be used only when high performance is needed.

In this section, two-level $\mathrm{V}_{\mathrm{DD}}$ hopping is explained since two levels are sufficient as describe later on. $f_{\max } / 2$ (half frequency) is used with a low $V_{D D}$, and $f_{\max }$ (full frequency) is used with a high $V_{D D}$. These two frequencies are set to enable safe synchronization between a processor and peripheral circuits. By limiting the number of discrete-voltage levels to two and providing $V_{D D}$ and $f$ externally, $\mathrm{V}_{\mathrm{DD}}$ hopping make it possible to use an off-the-shelf processor. Consequently in two-level $\mathrm{V}_{\mathrm{DD}}$ hopping, $f_{\max } / 2$ and a low $V_{D D}$ are used when workload is $50 \%$ or less, and $f_{\max }$ and a high $V_{D D}$ are used when workload is over $50 \%$, which means that $V_{D D}$ hops between only two voltages depending on required performance using software feedback. The number of hopping levels is crucial in a product because many test sequences should be run if the number is large

### 5.2.1. Concept

Fig. 5.2 shows three approaches to save power when the workload is, for instance, $50 \%$. The approaches (a) and (b) in the figure are the conventional ones while (c) is DVS, which shows the highest power saving. The point is to execute a task as slowly as possible.
(a) "NOP" loop while waiting: Even if there is no task to be done, an application program usually executes "NOP" loop to wait for either a next task or interrupt. Clock generators with PLL/DLL, memories including caches, and address calculations have to operate in the "NOP" loop which consumes a certain level of power, $b P_{\max }$, where $b<1$. A normalized power, $N P$, is expressed as a function of a normalized workload, $N W$, as follows.
$N P(N W)=(1-b) N W+b$.
(b) Sleep while waiting: If a sleep mode is available on a target processor, an application program can use it after a task is completed until a next task starts or an interrupt is acknowledged. In this case, since almost power is hardly consumed in the sleep mode, $N P$ is given as follows.

$$
\begin{equation*}
N P(N W)=N W \tag{5.2}
\end{equation*}
$$

(c) Operate slowly without waiting (DVS): $N W$ and $N P$ are given as parametric functions of $V_{D D}$ with the $\alpha$-power law MOSFET model as follows [5.17].

$$
\begin{align*}
& \left\{\begin{array}{l}
N W\left(V_{D D}\right)=\frac{V_{D D \max }}{V_{D D}}\left(\frac{V_{D D}-V_{T H}}{V_{D D \max }-V_{T H}}\right)^{\alpha} \\
N P\left(V_{D D}\right)=\left(V_{D D} / V_{D D \max }\right)^{2} N W\left(V_{D D}\right)
\end{array}\right.  \tag{5.3}\\
& \therefore N P(N W)=N W^{\frac{\alpha+1}{\alpha-1}} \quad \text { if } \quad V_{T H}=0 .
\end{align*}
$$

$V_{T H}$ denotes a threshold voltage of a MOSFET. $\alpha$ represents a velocity saturation index, and is about 1.2 in a recent short-channel MOSFET while 2.0 in a long-channel one (classic Shockley model). NP dependences on $N W$ for the three cases are illustrated in Fig. 5.3. In DVS, $V_{D D}$ is decreased to a level at which a speed is just satisfied when $N W$ is less than one. It is clear that DVS saves a total power best. Furthermore, it is seen from the figure, as $V_{T H}$ and $\alpha$ lower, effectiveness of DVS increases, which suits MOSFET scaling and becomes an advantage in DVS.

### 5.2.1.1. Application Slicing

Multimedia applications usually synchronize with their own regular periods, for instance, 60 Hz for MPEG2 and 44.1 kHz for CD audio. The WCET (worst-case execution time) of the application has to be equal to or less than the period on a hardware platform in order to keep a real-time feature. The execution time of the application, however, is frequently less than the WCET, sometimes by a large amount since workload strongly depends on data imposed on hardware [5.12]. For example in an MPEG4 encoder (code decode), workload becomes higher as objects in an image move fast, although the worst-case data are seldom input in an MPEG4 encoder as shown in Fig. 5.4 and in most cases, a task finishes well before the WCET. In addition, even if the worst-case data is input, there may be a time margin because the WCET is equal to or less than the period.

This is one of motivations for $V_{D D}$ hopping. An execution time is not constant, and it does not always take the WCET to execute a task. At a start of each application, however, we do not have any information about its future execution time, and it is impossible to predict future workload without an error. In order to solve this problem, application slicing is introduced. If a task is sliced, an unused time from the previous slices can be exploited by the following slices. By checking the current time and slack time (time margin) to
execute the next slice, application slicing adaptively selects optimum $f$ and $V_{D D}$ at run time to minimize power.

Fig. 5.6 explains the concept of application slicing when only one task is running on a processor. The task is periodic and its period is $T_{\text {PERIOD. }}$. If a task is not periodic, application slicing is not applicable, however, multimedia applications are fortunately periodic as described at the beginning of this section. In other words, $\mathrm{V}_{\mathrm{DD}}$ hopping with application slicing is suitable for synchronous tasks such as multimedia applications, and not suitable for an asynchronous task such as communication processing. In an MPEG4 encoder, although communication may be necessary, a communication rate is low, say 64 kbps . Therefore, an overhead of communication is estimated at less than $1 \%$, which is negligible compared with an MPEG4 encoder itself.

As illustrated in Fig. 5.6, the WCET of a task, $T_{\text {WCETItoN }}$, is chopped into $N$ slices with potentially different lengths each other. The WCET of the $i$-th slice, $T_{\text {WCETi }}(i=1, \ldots, N)$, and the WCET from the $i$-th to $N$-th slices, $T_{\text {WCETitoN }}$, can be obtained through static analysis or direct measurement in a design stage [5.18]. In a code fragment at the head of each slice, a slack time that is allowed to execute the slice is computed. $D$ is a deadline, which is the interval to the next initiation time.

Then with $D$, a slack time, $T_{S L A C K i}$, is checked. $T_{\text {SLACKi }}$ is obtained by subtracting $T_{W C E T i+1 t o N}$ from $D$. Ideally, $f$ can be reduced to $T_{\text {WCETi }} / T_{\text {SLACKi }}$. In reality, however, an arbitrary choice of $f$ causes a serious problem at interfaces with peripheral devices. In order to solve this issue, in $\mathrm{V}_{\mathrm{DD}}$ hopping, a candidate $f$ is limited only to $f_{\max }$ or $f_{\max } / 2$, where $f_{\max }$ is the maximum frequency of a processor. In two-level $\mathrm{V}_{\mathrm{DD}}$ hopping, the $i$-th slice is carried out at $f_{\max } / 2$ if $T_{S L A C K i} \geq 2 T_{W C E T i}+T_{t r}$, where $T_{t r}$ indicates a transition time of $f$ and $V_{D D}$. According to this procedure, the optimum $f$ and corresponding $V_{D D}$ are adaptively selected by software on a slice-by-slice basis. After finishing the $N$-th slice, the processor goes into a sleep mode until the next initiation of the task.

Fig. 5.6 shows an example of temporal behaviors in $V_{D D}$ hopping obtained by a simulation for an MPEG4 SP@L1 encoder when the WCET is 66.7 ms (one video frame). The workload is $42 \%$ of the worst case. If infinite levels of $f$ are available, namely, infinite levels of $V_{D D}$ are provided, the maximum power reduction is possible. However, the power improvement is just $8 \%$ compared to two-level $\mathrm{V}_{\mathrm{DD}}$ hopping. This is the reason why the levels of $f$ and $V_{D D}$ are limited to two in $\mathrm{V}_{\mathrm{DD}}$ hopping.

In case of two-level $\mathrm{V}_{\mathrm{DD}}$ hopping in Fig. 5.6, $f_{\max }$ is used only $6 \%$ while $f_{\max } / 2$ is used for $70 \%$ of the
time. For the rest of the time, a processor is in a sleep mode. $f_{\max }$ is still needed because the processor has to run at $f_{\max }$ when the worst-case data that hardly come is input. This tendency holds for other multimedia applications such as an MPEG2 decoder and VSELP (voice encoder), and about an order of magnitude improvement in power are assured. $\mathrm{V}_{\mathrm{DD}}$ hopping can be applied to such an application which synchronizes with a regular period and whose WCET is known.

### 5.2.1.2. Second Frequency

Here, we would like to describe why $f_{\max } / 2$ not $f_{\max } / j(j>2)$ is preferable as a second frequency. One of the reasons is that the average workload of the MPEG4 encoder treated in this work is about a half. If an average workload of an application is known as about $1 / 3$ in advance, and a processor is used only for the application, the best choice of the second frequency would be $f_{\max } / 3$. In general, however, a processor in a recent system is used for various applications, and an average workload is unknown. Therefore, a workload of an application is to be supposed to vary randomly from zero to one, in which case $f_{\max } / 2$ as a second frequency minimizes an average power.

Fig. 5.7 shows power dependence on workload. The segments OAC corresponds to power dependence when $f_{\max } / j(j>2)$ is used as a second frequency while the segments OBC corresponds to case of $f_{\max } / 2$. The areas of quadrilaterals OACD $\left(S_{O A C D}\right)$ and OBCD ( $S_{O B C D}$ ) are proportional to an average power when workload varies randomly from zero to one. It is demonstrated that $f_{\max } / 2$ minimizes the average power if $S_{O A C D}>S_{O B C D}$.
$S_{O A C D}$ and $S_{O B C D}$ are given as follows.

$$
\begin{align*}
& S_{\text {OACD }}=N P(1 / j) / 2 j+(1 / 2-1 / 2 j)[N P(1 / j)+1] \quad \text { where } \quad j>2, \\
& S_{O B C D}=N P(1 / 2) / 4+(1 / 2-1 / 4)[N P(1 / 2)+1] . \tag{5.4}
\end{align*}
$$

$N P(1 / j)$ signifies a normalized power when a workload is $1 / j$. In order to demonstrate $S_{O A C D}>S_{O B C D}$, the following inequality derived with (5.4) has to hold.

$$
\begin{equation*}
N P(1 / j)>1 / j+N P(1 / 2)-1 / 2 \quad \text { where } \quad j>2 . \tag{5.5}
\end{equation*}
$$

Now $1 / j$ is substituted by a normalized workload, $N W$. (5.5) becomes the following.
$N P(N W)>N W+N P(1 / 2)-1 / 2$ where $N W<1 / 2$.
In Fig. 5.7, a dashed line passing through the point B shows the function $G(N W)=N W+N P(1 / 2)-1 / 2$, and the shaded region $R$ corresponds to $R>N W+N P(1 / 2)-1 / 2$ where $N W<1 / 2$. Therefore, if the curve $N P(N W)$ passes through the region $R$, (5.6) holds, which in turn demonstrates $S_{O A C D}>S_{O B C D}$.

Suppose $T(N W)$ is a tangent line that touches $N P(N W)$ at the point, B. Since $N P(N W)$ is a concave function, $(N P(N W)-T(N W)) ">0$, and $(N P(N W)-T(N W))^{\prime}<0$ where $0 \leq N W<1 / 2$. Therefore, $N P(N W)-T(N W)$ is a decreasing function where $0 \leq N W<1 / 2$, and is zero when $N W=1 / 2$. This means that $N P(N W)>T(N W)$ where $0 \leq N W<1 / 2$. If the slope of $T(N W)$ is less than one, $T(N W)$ passes through the region $R$. In this case, $S_{O A C D}>S_{O B C D}$ can be demonstrated because $N P(N W)>T(N W)$, and $N P(N W)$ passes through the region $R$. Finally, the condition that $S_{O A C D}>S_{O B C D}$ now becomes the following.

$$
\begin{equation*}
\left.\frac{d N P(N W)}{d N W}\right|_{N W=1 / 2}<1 \tag{5.7}
\end{equation*}
$$

As abovementioned in this subsection, $N P$ dependence on $N W$ is shown as follows.

$$
\begin{align*}
& \left\{\begin{array}{l}
N W\left(v_{d d}\right)=\frac{1}{v_{d d}}\left(\frac{v_{d d}-V_{T H} / V_{D D \max }}{1-V_{T H} / V_{D D \max }}\right)^{\alpha}, \\
N P\left(v_{d d}\right)=v_{d d}^{2} N W\left(v_{d d}\right)
\end{array}\right.  \tag{5.8}\\
& \text { where } \quad v_{d d}=V_{D D} / V_{D D \max } .
\end{align*}
$$

Since (5.8) can be given as parametric functions, and it is difficult to write it in an closed form, the slope of $N P(N W)$ at $N W=1 / 2$ is numerically calculated as follows.

$$
\begin{equation*}
\text { Slope }=\left.\frac{d N P(N W)}{d N W}\right|_{N W=1 / 2}=\left.\frac{\frac{d N P\left(v_{d d}\right)}{d v_{d d}}}{\frac{d N W\left(v_{d d}\right)}{d v_{d d}}}\right|_{N W\left(v_{d d}\right)=1 / 2} \tag{5.9}
\end{equation*}
$$

The result is shown in Fig. 5.8. In the regions where $0 \leq V_{T H} / V_{D D \max }<1$ and $1 \leq \alpha<2$, which hold in normal VLSI processors, the slope does not exceed one. Therefore, $S_{O A C D}>S_{O B C D}$ is now demonstrated, and it is established that an average power is minimized when $f_{\max } / 2$ is chosen as a second frequency for a system in which a workload of an applications varies randomly from zero to one.

### 5.2.2. Breadboard Design

An MPEG4 encoder system was built to demonstrate feasibility of $\mathrm{V}_{\mathrm{DD}}$ hopping as shown in Fig. 5.9. The system utilizes an off-the-shelf processor, Hitachi's SH-4 [5.19], and its embedded system board made by Densan [5.20]. The block diagram of the $\mathrm{V}_{\mathrm{DD}}$-hopping system is illustrated in Fig. 5.10. An H. 263 standard sequence "carphone" is used as input data. The image has $80 \times 64$ pixels ( $5 \times 4$ macroblocks), and is stored in a flash ROM as raw data. One macroblock corresponds to one slice in a manner of application slicing. In addition, other two slices are assigned to an initial and display routine, and consequently the

MPEG4 encoder has 22 slices. In order to obtain the WCET of a video frames, a frame rate is varied to check that the system works in time without a video frame dropping. 200 ms is obtained as the WCET, which means that the frame rate of the system is five per second. It should be noted that the image size and frame rate are different from the standard, however, feasibility of $\mathrm{V}_{\mathrm{DD}}$ hopping can be verified in respect of both hardware and software.

The optimum $f$ and $V_{D D}$ are calculated with the SH-4. Speed information is sent through a SH-4 I/O bus and a local bus to a VME bus as shown in Fig. 5.10, which controls a $\mathrm{V}_{\mathrm{DD}}$-hopping board implemented by an FPGA (Altera EPM7064). Because only I/O instructions are required to implement $\mathrm{V}_{\mathrm{DD}}$ hopping, no new instruction set is necessary. This is the reason why $\mathrm{V}_{\mathrm{DD}}$ hopping can be implemented without redesigning a processor.

The FPGA has two timers in itself. One timer watches the current time. The other timer is used to keep the processor in a sleep mode during $\mathrm{V}_{\mathrm{DD}}$ transition, which avoids malfunction due to the $\mathrm{V}_{\mathrm{DD}}$ transition. The FPGA requests interrupts with the timers, and the processor acknowledges them through the VME bus.

### 5.2.2.1. Clock Frequency

The processor has a frequency control register called an FRQCR as shown in Fig. 5.10. The FRQCR can instantaneously change an internal clock frequency that is synchronized with an external clock frequency of 33 MHz . Since 200 MHz and 100 MHz are used as operation frequencies and they are divisible by the external clock frequency, there is no synchronization problem at interfaces with peripheral devices. For a processor that does not have such kind of frequency control register, a clock frequency should be externally changed in order to provide $f_{\max }$ and $f_{\max } / 2$. In this case, the processor must be halted during a settling time of a clock distribution network including a PLL/DLL to avoid malfunction. A controller described afterward output such frequencies by itself.

In $V_{\mathrm{DD}}$ hopping, $V_{D D}$ must be changed according to $f$. By using the speed information, $V_{D D}$ is selected out of 2.0 V as $V_{D D \max }$ for 200 MHz or 1.2 V as $V_{D D \min }$ for 100 MHz by power switches on the $\mathrm{V}_{\mathrm{DD}}-$-hopping board. Relationship between $f$ and $V_{D D}$ is obtained by measuring physical characteristics of the processor.

### 5.2.2.2. Power Switch

On the $\mathrm{V}_{\mathrm{DD}}$-hopping board, $V_{D D}$ hops between $V_{D D \max }$ and $V_{D D \min }$ using power switches (NEC 2SJ208), which has one of the lowest threshold voltages on the market. However, since the threshold voltage is 2.8 V that is higher than $V_{D D \max }$, the switches never turn with $V_{D D \max }$. Consequently, an RS-232C driver (Maxim

MAX232) is used as an amplifier that amplifies the gate voltage of the switches to $\pm 8 \mathrm{~V}$.
Fig. 5.11 and Fig. 5.12 are measured $V_{D D}$ waveforms. The measured fall and rise times for the $V_{D D}$ transition are less than $200 \mu \mathrm{~s}$ and $100 \mu \mathrm{~s}$, respectively with a decoupling capacitance, $C_{D}+C_{S}$, of $30 \mu \mathrm{~F}$ at a $\mathrm{V}_{\mathrm{DD}}$ node.

A care should be taken for the overlap of $V_{G \max }$ (an enabling signal of $V_{D D \max }$ ) and $V_{G \min }$ (an enabling signal of $V_{D D \min }$. During the $\mathrm{V}_{\mathrm{DD}}$ transition between $V_{D D \max }$ and $V_{D D \text { min }}$, there are two cases; one is that there is overlap between $V_{G \max }$ and $V_{G \min }$, and the other is that there is no overlap between them. It is virtually impossible to turn on one switch and turn off the other switch at the same time. If there is a $2-\mu \mathrm{s}$ overlap whose situation is depicted in Fig. 5.11, large current might flow from $V_{D D \max }$ to $V_{D D \min }$ and cause a problem. However, thanks to the decoupling capacitors, no spike noise or voltage drop is observed.

If there is no overlap between $V_{G \max }$ and $V_{G \min }$, there is a period while $V_{D D}$ is completely cut off from both $V_{D D \max }$ and $V_{D D \min }$, which causes a serious problem as seen in Fig. 5.12. A falling- $\mathrm{V}_{\mathrm{DD}}$ case is barely safe, but in case of rising $\mathrm{V}_{\mathrm{DD}}, V_{D D}$ sags below $V_{D D \min }$ due to discharge from the decoupling capacitor, which puts the system in a hung-up status. In conclusion, switching between $V_{D D \max }$ and $V_{D D \min }$ should be carried out with a period while both $V_{D D \max }$ and $V_{D D \min }$ are connected to a $\mathrm{V}_{\mathrm{DD}}$ line for a short time.

Another care than the timing overlap is of a power-on sequence. $V_{G \max }$ should be asserted in a startup to connect $V_{D D \max }$ to a $\mathrm{V}_{\mathrm{DD}}$ line for a stable system. The other is about an order of control of $V_{D D}$ and $f$. In case of falling $V_{D D}, f$ should be decreased at first, and then $V_{D D}$ should be decreased. Alternatively, in case of rising $V_{D D}, V_{D D}$ is increased at first, and then $f$ is increased

In order to avoid malfunction, a processor stays in a sleep mode during $\mathrm{V}_{\mathrm{DD}}$ transition. This is realized by using the timer as abovementioned at the beginning of this subsection, which is different from a system-clock timer in order to know the current time. Before the $\mathrm{V}_{\mathrm{DD}}$ transition, $200 \mu \mathrm{~s}$ is set to expire at the end of the $\mathrm{V}_{\mathrm{DD}}$ transition for both falling and rising cases, and then the processor moves to the sleep mode. The $\mathrm{V}_{\mathrm{DD}}$-transition timer wakes up the processor with an interrupt when the preset time expires. All interrupts must be masked in order to avoid malfunction during the $\mathrm{V}_{\mathrm{DD}}$ transition except for the $\mathrm{V}_{\mathrm{DD}}$-transition timer, which means that the interrupt level of the $\mathrm{V}_{\mathrm{DD}}$-transition timer should be highest. Since the $V_{D D}$ transition is relatively long, $V_{D D}$ hopping is not suitable for a fast-response system such as a servo system.

### 5.2.2.3. Power

Fig. 5.13 (a) shows measured power characteristics of the $\mathrm{V}_{\mathrm{DD}}$-hopping system. A power at 200 MHz is 0.8 W while that at 100 MHz is 0.16 W . This means that energy at 100 MHz is 2.5 times as efficient as that at 200 MHz . A sleep mode is operated at 100 MHz and 1.2 V in order to save standby power, and it is 0.07 W. Since an average time for $V_{D D \max }$ is $8 \%$, that for $V_{D D \min }$ is $86 \%$, and that for the sleep mode is $6 \%$, the average power in $\mathrm{V}_{\mathrm{DD}}$ hopping is 0.21 W . In the processor, $\mathrm{I} / \mathrm{O}$ buffers are not optimized for low-voltage operation at 100 MHz , and if they were carefully designed, $V_{\text {DDmin }}$ could be below 0.9 V instead of 1.2 V . In this case, the power at 100 MHz could be reduced to about a half.

Based on Fig. 5.13 (a), power dependence on workload can be obtained as shown in Fig. 5.13 (b). 0.8 W at 200 MHz corresponds to a full workload while 0.16 W at 100 MHz corresponds to a half workload. The processor consumes 0.07 W in a sleep mode and 0.58 W in a "NOP" loop when workload is zero. $\mathrm{V}_{\mathrm{DD}}$ hopping works more effectively than the case of "NOP" loop in a low-workload region as seen in Fig. 5.13 (c). On the other hand, compared with the case of sleep mode, $\mathrm{V}_{\mathrm{DD}}$ hopping is the most effective when a workload is a half because the second frequency is set to $f_{\text {max }} / 2$.

### 5.2.3. LSI Design

After evaluating the $\mathrm{V}_{\mathrm{DD}}$-hopping breadboard, a $\mathrm{V}_{\mathrm{DD}}$-hopping controller was designed and fabricated, which has the same function as the breadboard. Fundamentally, the FPGA portion on the board was implemented to the controller in a standard-cell design style.

In the $\mathrm{V}_{\mathrm{DD}}$-hopping controller, the gate width of the power switch is critical. The simulated voltage drop of the power switch is shown in Fig. 5.14. In the process used for the controller design, the threshold voltage is 0.6 V that is smaller than $V_{D D \min }(1.2 \mathrm{~V})$. Therefore, a signal swing amplifier is not necessary which was required for the breadboard design. When a gate bias is 1.2 V , and load current is 0.13 A , the maximum gate width is needed. A gate width of 27 mm is found to be appropriate when a voltage drop by the switch is set to less than 0.05 V . It should be noted that this gate width can draw a large current of 0.4 A through it when $V_{D D}$ is $V_{D D \max }(2.0 \mathrm{~V})$.

Fig. 5.15 illustrates a schematic diagram of the $\mathrm{V}_{\mathrm{DD}}$-hopping controller. The timing overlap between $V_{G m a x}$ and $V_{G \min }$ is critical as well as the breadboard design described in the previous subsection. In order to adjust the period of the overlap, programmable timers are put at the gates of the power switches as shown in

Fig. 5.15 (a). One more care other than the timing overlap is for a startup. $V_{D D \max }$ should be connected to a $\mathrm{V}_{\mathrm{DD}}$ line with the System_reset signal in order to initiate a system stably. The controller also has an all-purpose decoder for the power switches as shown in Fig. 5.15 (b). For a processor that does not have a frequency control register, the $\mathrm{V}_{\mathrm{DD}}$-hopping controller has a clock frequency selector to output either $f_{\max }$ or $f_{\text {max }} / 2$ as shown in Fig. 5.15 (c). A programmable timer in the figure avoids $f$ changing during program execution. In general, a processor must be halted while $f$ and $V_{D D}$ is being changed to avoid malfunctions due to the transition. In addition, two other timers are available to watch the current time, and to wake up out of a sleep mode using an interrupt signal after the $f$ and $\mathrm{V}_{\mathrm{DD}}$ transition as abovementioned.

Fig. 5.16 shows the measured waveforms of $V_{D D}$ and the sleep signal of the processor. The application is the MPEG4 encoder, and input sequence data are the same as that on the breadboard. It should be noted that just two video frames are shown in the figure. $V_{D D \max }$ is used only $8 \%$ on average while a sleep period is $6 \%$ on average. This means that $86 \%$ left is used for $V_{\text {DDmin }}$. Therefore, the average workload is $51 \%$ $(8 \% \times 1+86 \% \times 0.5+6 \% \times 0)$.

Fig. 5.17 shows a power comparison between $V_{D D}$ hopping and other fixed $V_{D D}$ schemes for the MPEG4 encoder. $\mathrm{V}_{\mathrm{DD}}$ hopping is measured to consume 0.21 W . If the I/O buffers of the processor were carefully designed, $V_{D D \min }$ could be 0.9 V and the power would become 0.15 W . In this case, $\mathrm{V}_{\mathrm{DD}}$ hopping can reduce the power to less than a quarter of the case of "NOP" while waiting.

The controller was fabricated with a Rohm $0.6-\mu \mathrm{m}$ triple-metal CMOS process as shown in Fig. 5.18, which consumes 0.01 W when an external clock is 33 MHz . The size is about $4.6 \times 2.3 \mathrm{~mm}^{2}$ including two power switches of $27-\mathrm{mm}$ width. The switches are implemented with comb-shaped pMOSFETs because of their huge width.

## 5.3. $\mu$ ITRON-LP: Power-Conscious RTOS (Real-Time Operation System) Based on CVS (Cooperative Voltage Scaling)

In order to realize CVS mentioned in Section 5.1, an RTOS (real-time operation system) is modified so that it maintains and provides timing information to plural applications. An application itself is also modified in manners of application slicing and $\mathrm{V}_{\mathrm{DD}}$ hopping as described in the previous section. A code fragment determines $f$ and $V_{D D}$ according to both its WCET and timing information provided the RTOS. The rationale of CVS is that the RTOS knows only global timing information among tasks while each application has
better knowledge about its own structure and behavior. In $\mathrm{V}_{\mathrm{DD}}$ hopping, only one application operates using its own timing information without an RTOS, but in this section, the RTOS help plural applications exploit inter-task timing information.

### 5.3.1. CVS (Cooperative Voltage Scaling)

### 5.3.1.1. Model

Fig. 5.19 shows a structural model of CVS. This is similar to Fig. 5.1, but an RTOS is implemented in a system. The software architecture is comprised of the power-conscious RTOS and applications. Hitachi HI7750 [5.21] that is based on the $\mu$ ITRON specification [5.22] is redesigned as the power-conscious RTOS, which we call $\mu$ ITRON-LP. Real-time tasks are scheduled according to fixed-priority preemptive-scheduling algorithm in $\mu$ ITRON-LP, however, other scheduling algorithm may be utilized.

In $\mu$ ITRON-LP, an absolute time called a system clock is maintained by a cyclic interrupt from a hardware timer, which interval is set to 1 ms meaning that 1 ms is the time resolution of the system. Since the timer interrupt involves an interrupt service routine that consume certain processor cycles, a time resolution cannot be arbitrarily lowered very much.

An RTOS kernel is frequently realized with a TCB (task-control block) and a set of queues. The TCB holds task-specific information such as a priority and start address, and each queue maintains a list of tasks under a scheduling status. We add an READY queue and $\mathrm{T}_{\mathrm{n}}$ queue to $\mu$ ITRON-LP. $T_{n}$ means a next initiation time. The READY queue holds a currently running task as well as tasks waiting in order of priority to run. If a task currently occupies a processor, it is called a RUN task, which is at the head of the READY queue. It should be noted that the RUN task is still in the READY queue even though it is running. The $\mathrm{T}_{\mathrm{n}}$ queue holds all tasks in ascending numerical order of time at which their next initiation is due. We also extend the original TCB, which we call an ETCB (extended task-control block) containing specific timing information.

In addition, a scheduler in $\mu$ ITRON-LP is customized to perform necessary actions during task-state transition. The scheduler manages the READY queue and $\mathrm{T}_{\mathrm{n}}$ queue, computes timing information in the ETCB, and puts a processor into a sleep mode if there is no task in the READY queue. The processor, however, wakes up in every system clock to keep the system clock counting, and then returns to the sleep mode.

### 5.3.1.2. ETCB (Extended Task-Control Block)

Each task is associated with the ETCB. Fig. 5.20 shows a pseudo code of the ETCB structure, in which each element is managed based on task state transition illustrated in Fig. 5.21.

- $T_{\text {PERIOD }}$ refers to a regular period of task initiation. This is fixed, and thus is not changed at run time.
- $\quad T_{n}$ refers to relative time at which the next initiation is supposed to arrive. Every system clock, $T_{n}$ of any task in any state is always decremented by one except for case that $T_{n}$ is zero ( $\mathrm{T}_{\mathrm{n}}$ time-out). In $\mu$ ITRON-LP, a $\mathrm{T}_{\mathrm{n}}$ queue is adopted to monitor the $\mathrm{T}_{\mathrm{n}}$ time-out. All tasks are sorted in ascending numerical order of $T_{n}$ to easily find the $\mathrm{T}_{\mathrm{n}}$ time-out. If the $\mathrm{T}_{\mathrm{n}}$ time-out happens, a associated task is automatically initiated, and then $T_{\text {PERIOD }}$ is set to $T_{n}$. A newly created task is also immediately initiated because its $T_{n}$ is reset.
- $\quad T_{\text {sta }}$ refers to a system clock at which a RUN task is dispatched. $T_{\text {sta }}$ is valid only when a task is in a RUN state.
- $\quad T_{\text {exe }}$ refers to an accumulated time that has been already executed since the first dispatch. It should be noted that $T_{\text {exe }}$ is incremented by the remainder between the system clock and $T_{\text {sta }}$ only when a task is preempted. $T_{\text {exe }}$ is reset when a task is initiated.
- $\quad D_{v}$ refers to a relative time to a virtual deadline of a RUN task. $D_{v}$ is valid only when a task is in a RUN state. $D_{v}$ of a RUN task becomes zero regardless of $T_{n} \mathrm{~s}$ of tasks in the $\mathrm{T}_{\mathrm{n}}$ queue if there are two or more tasks in the READY queue. In this event, a RUN task should finish itself within its own WCET. It should be noted that there is still possibility to decrease $f$ and $V_{D D}$ because some slices might complete their execution earlier than their WCETs. On the other hand, if a RUN task is the only one in the READY queue, $\mu$ ITRON-LP chooses the smallest $T_{n}$ in the $\mathrm{T}_{\mathrm{n}}$ queue as $D_{v}$ of the RUN task. The smallest $T_{n}$ of the tasks in the $T_{\mathrm{n}}$ queue can be easily obtained because the tasks are sorted in ascending numerical order of $T_{n}$ in the $\mathrm{T}_{\mathrm{n}}$ queue. In this case, the RUN task can occupy a processor at least until $D_{v}$ because there is no task waiting for its execution. Therefore, the RUN task can lower $f$ and $V_{D D}$ if $D_{v}$ is longer than its WCET of the RUN task. Fig. 5.22 shows how to determine $D_{v}$. Incidentally, $D_{v}$ of the RUN task is also renewed every system clock.


### 5.3.1.3. Real Deadline

In each slice, a code fragment computes a real deadline, $D_{r}$. The code fragment obtains $D_{v}$ with a system call, and then compares it to its own WCET. The longer one becomes $D_{r}$.

Fig. 5.23 shows a method to obtain the WCET of a RUN task. In the figure, the RUN task was preempted four times. Since $\mu$ ITRON-LP adopts a preemptive scheduling algorithm, the WCET should be acquired by subtracting accumulated execution time up to the present from $T_{\text {WCETItoN }}$. The accumulated execution time since the first dispatch is $T_{\text {exe }}$, and the execution time from the last dispatch time up to the present is (system clock $-T_{\text {sta }}$ ). That is, the WCET becomes $T_{\text {WCET1toN }}-T_{\text {exe }}-\left(\right.$ system clock $\left.-T_{\text {sta }}\right)$. The RUN task can get its own $T_{\text {exe }}$ and $T_{\text {sta }}$ with system calls.

### 5.3.1.4. Example

Now, we explain how CVS works using an example of a task set illustrated in Fig. 5.24. Suppose that there are three periodic Tasks A, B, and C, and a $\mathrm{V}_{\mathrm{DD}}$-transition time, $T_{t r}$ is zero. Task A is composed of three slices with each slice taking two time units in the worst case. Task B is comprised of six slices with total twelve time units in the worst case. Task C has only one slice whose WCET is two time units.

As for workloads of the tasks in the figure, we assume $50 \%$ of the worst case in Task A. That is, it takes one time unit to execute one slice in Task A. In Tasks B and C, a workload of $100 \%$ is assumed meaning that they run in their WCETs.

In original $\mu$ ITRON, the scheduling looks like Fig. 5.24 (a) while the scheduling in $\mu$ ITRON-LP is shown in Fig. 5.24 (b) when $f_{\max }$ and $f_{\max } / 2$ are provided as available frequencies.

In the $\mu$ ITRON-LP, at time zero, Tasks A, B, and C are initiated at the same time. Task A starts first since it has the highest priority. At the first slice of Task A, $D_{v}$ is zero because there are three tasks in the READY queue. In this case, the real deadline, $D_{r}$, is six, which is $T_{W C E T 1 t o 3}$ of Task A. Then, since $T_{\text {WCET2to3 }}$ is four, the slack time, $T_{S L A C K 1}$ is two. $f$ remains $f_{\max }$ since $T_{W C E T 1}$ is two.

At time one, the first slice finishes its execution because the workload of Task A is $50 \%$. At the second slice, the WCET is five since $T_{\text {exe }}$ is zero and (system clock- $T_{\text {sta }}$ ) is now one. $T_{\text {WECT3to3 }}$ is two and then, $T_{\text {SLACK2 }}$ is three. This is not enough to reduce $f$ to a half. Thus, the second slice is executed at $f_{\max }$ as well. At the last slice of Task A, the situation is different from the previous slices. The WCET is two and $T_{S L A C K 3}$ is four. Therefore, the third slice is carried out at a half frequency, $f_{\max } / 2$, and power saving is possible.

Task A completes at time four. Then, Task B takes over and is executed between time 4 and 16.

At time 16, Task C is allocated to the processor. At this time, only Task C is in the READY queue. The real deadline, $D_{r}$, is a longer interval between the WCET of Task C and $D_{v}$. In this case, $D_{r}$ is four that is $T_{n}$ of Task A. Even though this slice is the first slice, it can be executed at $f_{\max } / 2$ unlike the other tasks. Task C finishes at time 20.

Then, Task A starts again. In case that there is no task to execute, $\mu$ ITRON-LP brings the processor into a sleep mode until a next task initiation.

### 5.3.2. Hardware Implementation

Fig. 5.25 shows a snapshot of a CVS experimental system. An embedded system board with a Hitachi SH-4 is used as a target platform as well as the previous section, in which points to notice have been described. The block diagram of the CVS experimental system is shown in Fig. 5.26.

As described in the previous section, the $\mathrm{V}_{\mathrm{DD}}$ transition time, $T_{t r}$, is $200 \mu \mathrm{~s}$. However, in a calculation of timing information, $T_{t r}$ is set to 1 ms instead of $200 \mu \mathrm{~s}$ since the resolution of the system clock recognized in $\mu$ ITRON-LP is as coarse as 1 ms as mentioned at the beginning of this section. It should be noted that $T_{t r}$ must be smaller than a system clock resolution in order to preserve accuracy of the system clock. Otherwise, an interrupts from a system-clock timer is not properly acknowledged because the interrupt level of the system clock timer is lower than that of the $\mathrm{V}_{\mathrm{DD}}$ transition timer as mentioned in the previous section.

### 5.3.3. Power Model

The measured power characteristics of an SH-4 have been shown in Fig. 5.13 (a), by using which we can obtain ideal CVS behavior and power characteristics as shown in Fig. 5.27. Since a "NOP" loop is carried out instead of a sleep mode when there is nothing to do, and consumes 0.58 W in original $\mu$ ITRON, a power consumption of original $\mu$ ITRON falls on Line A in the right graph. Line B shows a case that a processor can enter a sleep mode if there is no task to execute. In a sleep mode, a processor is usually clock-gated and a dynamic power is completely cut off. Unfortunately, original $\mu$ ITRON does not support a sleep mode because a next initiation time of a real-time application cannot be generally predicted and a sleep mode is dependent on hardware. If CVS works ideally as shown in the left graph, power dependence on workload becomes Line C. A normal power dependency of CVS theoretically lies somewhere in Region $S$ between Lines B and C.

### 5.3.4. Experimental Results

In order to demonstrate feasibility of CVS, we constructed a task set that consists of a KEYBOARD routine, MPEG4 encoder, and 4096-point fast Fourier transform (FFT) as indicated in TABLE 5.1. An H. 263 standard sequence "carphone" is used as MPEG4 input data. Functional blocks in the applications are sliced into some slices to be able to add code fragments.

### 5.3.4.1. Operation Waveforms

Fig. 5.28 shows the measured waveforms of $V_{D D}$ and a sleep signal of the processor, in which there are five falling and five rising $V_{D D}$ transitions. Thus, the overhead of the $V_{D D}$ transition is just $2 \mathrm{~ms}(200 \mu \mathrm{~s} \times 10)$ during the $360-\mathrm{ms}$ period.

It should be noted that 2.0 V is used only $14 \%$ of the total time on average while the sleep takes $38 \%$. This means that the remaining $48 \%$ is used for the low-power operation at 1.2 V . This gives the average workload of $38 \%(14 \% \times 1+48 \% \times 0.5+38 \% \times 0)$.

The behavior of the measured waveform can be explained as follows with a help of Fig. 5.29. The absolute time is used for simplicity.
(a) At the beginning, a KEYBOARD routine is dispatched. The virtual deadline, $D_{v}$, is set to zero because an MPEG4 and FFT routines are also in the READY queue waiting for running. Therefore, the KEYBOARD routine should complete its execution in its WCET of 2 ms , which is the real deadline, $D_{r}$. The KEYBOARD routine finishes at 2 ms since the KEYBOARD routine does not have data dependency and its execution time is always fixed.
(b) At 2 ms , the MPEG4 routine is executed. $D_{v}$ is also set to zero because the FFT routine is still in the READY queue. Then, $D_{r}$ becomes 81 ms because the WCET of the MPEG4 routine is 79 ms . In this task, since workload is much lighter than the worst case, some slices at the beginning are executed at 200 MHz and the remaining slices are done at 100 MHz . Eventually, the MPEG4 routine ends at 22 ms .
(c) At 22 ms , the FFT routine occupies the processor. Because it only requires the processor, $D_{v}$ and $D_{r}$ are set to 120 ms that is equal to $T_{n} \mathrm{~s}$ of both the KEYBOARD and MPEG4 routine. Thus, 98 ms is allowed to execute the FFT routine, whose WCET is 35 ms . This means that both slices of the FFT routine can be executed at a half speed. At 92 ms , upon the completion, the processor
goes to a sleep mode and then, sleeps until 120 ms because there is nothing to execute until then. The sleep mode is carried out at 100 MHz and 1.2 V to save power as described in 5.2.2.3.
(d) At 120 ms , the second instance of the KEYBOARD routine is dispatched.
(e) At 122 ms , MPEG4 is executed again with $D_{v}$ of 180 ms , which is $T_{n}$ of the FFT routine. Since the time interval to $D_{v}(58=180-122 \mathrm{~ms})$ is less than the WCET of the MPEG4 routine, an advantage of the virtual deadline cannot be exploited. In this case, $D_{r}$ is set to the WCET, which is 201 ms . Here, unlike the first instance, data is close to the worst case and most slices are executed at the high speed of 200 MHz . Then, the last slice completes its execution at 196 ms .
(f) Next, the second FFT instance waiting for execution takes over. The remaining instances can be understood similarly.

### 5.3.4.2. Power

Fig. 5.30 shows average power comparison among original $\mu$ ITRON, CVS and other cases. In original $\mu$ ITRON, the processor executes "NOP"s for an idle time and consumes 0.66 W while CVS is measured to consume 0.22 W when the workload is $38 \%$. If original $\mu$ ITRON supported a sleep mode, the power consumption would be estimated at 0.35 W . Unfortunately, I/O buffers of an SH-4 do not work below 1.2 V . If the I/O buffers were designed carefully, operation below 0.9 V could be achieved instead of 1.2 V . In this case, the power of CVS would become 0.17 W and could be reduced to about a quarter of that in original $\mu$ ITRON. Line C' in the right graph corresponds to such case that the power at 100 MHz is 0.09 W and 0.05 W in the sleep mode. The power characteristic is improved compared with the $1.2-\mathrm{V}$ case particularly in a low-workload region. Similarly, even compared with the case that original $\mu$ ITRON uses the sleep mode at 0.9 V (Line B'), CVS still saves about a half power.

In reality, power saving with CVS depends on combination of tasks, which in turn determines how much we can benefit from virtual deadlines. It is also dependent on variation of execution time. Nevertheless, CVS efficiently exploits a slack time among tasks and data-dependent variations of multimedia applications.

### 5.4. Summary

In this chapter, $\mathrm{V}_{\mathrm{DD}}$ hopping and its extension to $\mu$ ITRON-LP were introduced.
In Section 5.2, feasibility of $\mathrm{V}_{\mathrm{DD}}$ hopping was verified based on a breadboard-level prototype with an off-the-shelf processor and it was extended toward design of a $\mathrm{V}_{\mathrm{DD}}$-hopping controller. $\mathrm{V}_{\mathrm{DD}}$ hopping exploits
application slicing as a software approach, and the controller are comprised of the power switches, plain login, and timers as hardware. By applying $\mathrm{V}_{\mathrm{DD}}$ hopping to an MPEG4 encoder, $75 \%$ power saving of a processor can be achieved without the processor redesigned or real-time features of the MPEG4 encoder degrading.

In Section 5.3, $\mathrm{V}_{\mathrm{DD}}$ hopping is extended to $\mu$ ITRON-LP. CVS between tasks and $\mu$ ITRON-LP achieves power saving by exploiting a slack time arising from interaction among the tasks and variation of execution times of the tasks. The experimental results verified that $\mu$ ITRON-LP achieved $74 \%$ power saving of original $\mu$ ITRON under a multitasking environment when workload was $38 \%$.

### 5.5. Appendix: $0.5-\mathrm{V} 400-\mathrm{MHz} \mathrm{V}_{\mathrm{DD}}$-hopping Processor with Zero- $\mathrm{V}_{\mathrm{TH}}$ FD-SOI Technology

In Section 5.2 and 5.3, $\mathrm{V}_{\mathrm{DD}}$ hopping for an off-the-shelf processor was described mainly as software approaches. However, in this section, a dedicated processor for $\mathrm{V}_{\mathrm{DD}}$ hopping with a FD-SOI process is introduced as an appendix.

The ITRS roadmap [5.23] predicts that, in 2013, $V_{D D}$ will be as low as 0.5 V . An FD-SOI process is a promising way to fabricate this generation of devices because it provides superior characteristics. Namely, a subthreshold slope is steeper than that of bulk CMOS devices, and a threshold swing is near the ideal value of $60 \mathrm{mV} /$ decade. This feature helps a leakage current to be suppressed. In addition, FD-SOI devices have a smaller junction capacitance that makes them suitable for high-speed operation.

This appendix discusses design methodology for a processor with a target speed of 400 MHz and a supply voltage of 0.5 V . For high-speed circuits, a rule of thumb is that $V_{T H}$ should be less than $20 \%$ of $V_{D D}$. Thus, we set $V_{T H}$ to less than 0.1 V in a logic part of the processor since $V_{D D}$ is 0.5 V . The memories in the processor use higher values of $V_{D D}$ and $V_{T H}$ to suppress a leakage current of memory cells because the memories occupy most of the transistor count. So, a dual- $\mathrm{V}_{\mathrm{DD}}$ dual $-\mathrm{V}_{\mathrm{TH}}$ scheme is employed to achieve both low power and high speed. Moreover, higher $V_{D D}$ enables operation at a double speed ( 800 MHz ), which allows a $\mathrm{V}_{\mathrm{DD}}$-hopping scheme to be implemented.

As described in this chapter, $\mathrm{V}_{\mathrm{DD}}$ hopping provides dynamic-power management, in which $V_{D D}$ and $f$ change adaptively depending on workload of a processor. In contrast, in $\mathrm{V}_{\mathrm{TH}}$ hopping [5.24], $V_{T H}$ is changed directly by means of a body bias to control leakage power. $\mathrm{V}_{\mathrm{TH}}$ hopping is thought to be more effective than
$\mathrm{V}_{\mathrm{DD}}$ hopping when a leakage current is large. Unfortunately, however, $\mathrm{V}_{\mathrm{TH}}$ hopping is not applicable to FD-SOI devices because they have no back gate that is the basis for $\mathrm{V}_{\text {TH }}$ hopping. Even so, with a help of DIBL (drain-induced barrier lowering), $\mathrm{V}_{\mathrm{DD}}$ hopping is a still effective low-power technique for FD-SOI devices if leakage power is large.

Fig. 5.31 shows the block diagram of the $\mathrm{V}_{\mathrm{DD}}$-hopping processor based on a dual- $\mathrm{V}_{\mathrm{DD}}$ dual $-\mathrm{V}_{\mathrm{TH}}$ scheme. $V_{D D L}$ is a low $V_{D D}$, and is switched between 0.5 and $1 \mathrm{~V} . V_{T H L}$ is a low $V_{T H}$, which is 0 V . Similarly, $V_{D D H}$ is switched between 1 V and 2 V ; and $V_{T H H}$ is $0.3 \mathrm{~V} . V_{D D L}$ and $V_{T H L}$ are used in a logic part to achieve high speed, while $V_{D D H}$ and $V_{T H L}$ are used in an instruction memory, data memory, and register files, which have a low activation ratio and low dynamic power. The instruction and data memories both have a capacity of 2 kb (128 words by 16 b). The register files have room for 16 words, and are based on a two-read-port, one-write-port cell. In the processor, $V_{D D H}$ tracks the change in $V_{D D L}$ since $V_{D D H}$ is $2 V_{D D L}$ so that a balance is maintained between critical paths of the logic part and memories. The external-memory interface downloads and uploads memory content. For high-speed operation, a VCO (voltage-controlled oscillator) generates a clock at frequencies up to 1 GHz . It can output either $f$ or $2 f$ for $\mathrm{V}_{\mathrm{DD}}$ hopping since a frequency selector is employed. The supply voltage of the VCO is always set to $V_{V C O}$ for stable operation. Monitoring $1 / 64$ of the VCO output provides accurate information on the internal operating frequency.

An ALU is based on a 16-b Kogge-Stone adder in Fig. 5.32 to achieve the highest speed. The critical path of the processor is in the adder, and the delay time is determined by a path through six gates connected in series, namely, one gate that issues a generate or propagate signal, four gates for a binary look-ahead part, and one gate that outputs a sum from a carry. At $V_{D D L}$ of 0.5 V , the delay is 1.5 ns without pipeline flip-flops and 2.1 ns with pipeline flip-flops. The ALU also has a shifter and a bit operator.

Fig. 5.33 shows a block diagram of an SRAM for the processor. We cannot use MOSFETs with $V_{T H L}$ for the SRAM cell because that would result in a large leakage current and dramatically increase power dissipation of the SRAM. Therefore, we use $V_{D D H}$ and $V_{T H H}$ for memory cells. In order to maintain stable operation, word lines and bitlines work at $V_{D D H}$ as well. Buffers and predecoders, however, should use $V_{D D L}$ because they are interfaces between the memory core and logic part. By using $V_{D D L}$, it is quite reasonable to reduce the dynamic power of the buffers and predecoders because they have long wires and many fanouts, and the dynamic power predominates. As a result of these assignments, level-up converters are needed at the interface between the memory core and buffers.

Level-up conversion is handled by the replica-biasing level-up converter in Fig. 5.34 (a). It is twice as fast as the conventional type in Fig. 5.34 (b) [5.25] because it incorporates a decoding function and does not need a slow cross-coupled configuration. The decoder and its replica circuit may exhibit a large static current when the series of decoding nMOSFETs turn on. This static current does not matter because the replica circuit is shared, and only one of the decoders is activated and consumes power. Again, when the nMOSFETs turn on, they fight against a $\mathrm{V}_{\text {тнн }}-\mathrm{pMOSFET}$ load, which makes a input margin small. In order to deal with the small input margin, a voltage divider always keeps a bias voltage in the middle of $\mathrm{V}_{\mathrm{DDL}}$, thereby compensating for fluctuations in the strengths of the n - and pMOSFETs. The voltage divider that is commonly used in DRAMs is useful in suppressing a static current.

Fig. 5.35 (a) shows a measurement setup with a VLSI tester, which externally switches between the two $V_{D D} \mathrm{~S}$. Even though the processor works at a high frequency, slow testing is possible because it has the external-memory interface. The monitored output of the VCO in Fig. 5.35 (b) has a voltage of 0.5 V and a frequency of 6.27 MHz . Since this is $1 / 64$ of the internal operating frequency, the figure shows that the processor is working at a voltage of 0.5 V and a frequency of 400 MHz .

Fig. 5.36 is a micrograph of the processor chip. It was made on a $0.25-\mu \mathrm{m}$, triple-metal FD-SOI process with dual $V_{T H} \mathrm{~s}$. The logic part contains about 2-k gates including those for the peripheral circuits of the memories and register files. It was designed with the cell library described in the next paragraph. The memories that work at a supply voltage of $V_{D D H}$ account for $85 \%$ of the transistor count.

A compact cell library was used for the logic synthesis of the processor. It has only 20 kinds of logic gates as shown in Fig. 5.37 because a small number of logic gates do not significantly degrade the performance [5.26]. Limiting the number to 20 makes it possible to fine-tune the design of each cell so that the processor works even under the worst-case conditions. For instance, for a two-input NOR with $V_{D D L}$ of 0.5 V , correctly sizing the transistors is of critical importance because the ratio of the on-current of a pMOSFET to the off-current of an nMOSFET is only 33, which is much smaller than that in a conventional design.

The bar graphs in Fig. 5.38 (a) show a breakdown of the simulated SRAM access time for three cases. The critical delay of the processor is the memory read-out time. In the case A , both $V_{D D L}$ and $V_{D D H}$ are 0.5 V , and operation will not be very fast because the decoders, bitlines, and sense amplifiers in the memory core take a long time to carry out their functions. The processor is assumed to work at a frequency of 400 MHz in
the case B and at 800 MHz in the case C. Fig. 5.38 (b) shows the measured dependence of operating frequency on $V_{D D L}$ for the three cases. The solid line is for $V_{D D H}=2 V_{D D L}$, and the dotted line is for $V_{D D H}=V_{D D L}$. The difference between the simulation and measurement results for the case C is due to an error in the SPICE model file for $V_{D D L}$ of 1 V . Clearly, $V_{D D H}$ should be $2 V_{D D L}$ for high-speed operation. However, $V_{D D H}$ should not be fixed at 2 V because the replica-biasing level-up converter might fail to convert a voltage when $V_{D D L}$ is 0.5 V .

Fig. 5.39 shows the measured dependence of operating frequency on $V_{D D L}$ for $V_{D D H}=2 V_{D D L}$ at room temperature $\left(30^{\circ} \mathrm{C}\right)$ and a high temperature $\left(100^{\circ} \mathrm{C}\right)$. At room temperature, the processor operates at a speed of 400 MHz when $V_{D D L}=0.5 \mathrm{~V}$, and at 800 MHz when $V_{D D L}=0.9 \mathrm{~V}$. Therefore, $\mathrm{V}_{\mathrm{DD}}$ hopping from 400 to 800 MHz is possible by changing $V_{D D L}$ from 0.5 V to 0.9 V . Another interesting point is that the delay has positive temperature dependence. It has been pointed out that, when $V_{D D}$ is below 1 V and $V_{T H}$ is set to a moderate value, the circuit delay generally has negative temperature dependence [5.27]. However, this processor has the usual positive temperature dependence because $V_{T H}$ is zero.

The graph in Fig. 5.40 shows how leakage current depends on $V_{D D L}$ when the clock is stopped. At a high temperature of $100^{\circ} \mathrm{C}$, the leakage current is 3.6 times greater than the value at room temperature. It should be noted that at both temperatures, the leakage current strongly depends on $V_{D D L}$. This is due to DIBL. Without DIBL, the leakage current would be constant even if $V_{D D L}$ changed.

Fig. 5.41 shows the measured power characteristics at temperatures of 30 and $100^{\circ} \mathrm{C}$. The total power, $P_{\text {TOTAL }}$, is the sum of the leakage power, $P_{L E A K}$, and dynamic power. When $V_{D D L}$ changes from 0.5 V to 0.9 V , $P_{\text {TOTAL }}$ jumps from 3.5 W to 29 W at room temperature. Note that $P_{\text {TOTAL }}$ and $P_{\text {LEAK }}$ exhibit a similar dependence on $V_{D D L}$ at the two temperatures. In other words, it is possible to effectively scale the power by changing $V_{D D L}$, but not $V_{T H L}$. This, in turn, demonstrates that $\mathrm{V}_{\mathrm{DD}}$ hopping is still effective in FD-SOI circuits when the voltage is below 1 V , where we can enjoy the power-scaling benefits of $\mathrm{V}_{\mathrm{DD}}$ hopping.

Fig. 5.42 analytically shows the power-scaling benefits of $\mathrm{V}_{\mathrm{DD}}$ hopping. In the formula for $P_{\text {LEAK }}, V_{T H L}$ is zero, $\lambda$ is the DIBL coefficient, and $s$ is the subthreshold swing. The dynamic power, $P_{\text {DYNAMIC }}$, is $V_{D D L}{ }^{2.5}$. This is derived using the $\alpha$-power law [5.17] with $\alpha$ set to 1.5 . If $\lambda$ is zero which means that there is no DIBL, there is no power-scaling benefit because the relationship between power and $V_{D D L}$ is linear. On the other hand, if $\lambda$ is 0.1 , the simulated curve for $P_{L E A K}$ agrees well with the measured curve. Furthermore,
$P_{\text {LEAK }}$ is quite similar to $P_{\text {DYNAMIC }}$, which demonstrates the power-scaling benefits of $\mathrm{V}_{\mathrm{DD}}$ hopping.

### 5.6. References

[5.1] A. Chandrakasan, V. Gutnik, and T. Xanthopoulos, "Data Driven Signal Processing: An Approach for Energy Efficient Computing," Proc. ACM/IEEE Int. Symp. Low Power Elec. and Design, pp. 347-352, Aug. 1996.
T. Kuroda, K. Suzuki, S. Mita, T. Fujita, F. Yamane, F. Sano, A. Chiba, Y. Watanabe, K. Matsuda, T. Maeda, T. Sakurai, and T. Furuyama, "Variable Supply-Voltage Scheme for Low-Power High-Speed CMOS Digital Design," IEEE J. Solid-State Circ., vol. 33, no. 3, Mar. 1998. Variable Voltage Core-Based Systems," Proc. ACM/IEEE Design Automation Conf., pp. 176-181, June 1998.
T. Pering, T. Burd, and R. Brodersen, "The Simulation and Evaluation of Dynamic Voltage Scaling Algorithms," Proc. ACM/IEEE Int. Symp. Low Power Elec. and Design, pp. 76-81, Aug. 1998. T. Ishihara, and H. Yasuura, "Voltage Scheduling Problem for Dynamically Variable Voltage Processors," Proc. ACM/IEEE Int. Symp. Low Power Elec. and Design, pp. 197-202, Aug. 1998. T. Burd, T. Pering, A. Stratakos, and R. Brodersen, "A Dynamic Voltage Scaled Microprocessor System," IEEE Int. Solid-State Circ. Conf. Dig. Tech. Papers, pp. 294-295, Feb. 2000. Transmeta Crusoe page, http://www.transmeta.com/technology/.

David R. Ditzel, "Transmeta's Crusoe: A Low-Power x86-Compatible Microprocessor Built with Software," Proc. Int. Symp. Low-Power and High-Speed Chips (Cool Chips), pp. 1-30, Apr. 2000. S. Lee, and T. Sakurai, "Run-time Power Control Scheme Using Software Feedback Loop for Low-Power Real-time Applications," Proc. ACM/IEEE Asia and South Pacific Design Automation Conf., pp. 381-386, Jan. 2000.
S. Lee, and T. Sakurai, "Run-time Voltage Hopping for Low-power Real-time Systems," Proc. ACM/IEEE Design Automation Conf., pp. 806-809, June 2000.
T. Okuma, H. Yasuura and T. Ishihara, "Software Energy Reduction Techniques for Variable-Voltage Processors," IEEE Design and Test of Comp., vol. 18, issue 2, pp. 31-41, Mar. 2001.
[5.12] Y. Shin, and K. Choi, "Power Conscious Fixed Priority Scheduling for Hard Real-Time Systems," Proc. ACM/IEEE Design Automation Conf., pp. 134-139, June 1999.
[5.13] M. Weiser, B. Welch, A. Demers, and S. Shenker, "Scheduling for Reduced CPU Energy," Proc. USENIX Symp. Operating Sys. Design and Imple., pp. 13-23, Nov. 1994.
[5.14] F. Yao, A. Demers, and S. Shenker, "A Scheduling Model for Reduced CPU Energy," Proc. IEEE Foundations of Comp. Sci., pp. 374-382, Oct. 1995.
[5.15] C. Hwang, and A. Wu, "A Predictive System Shutdown Method for Energy Saving of Event-Driven Computation," Proc. IEEE/ACM Int. Conf. Comp.-Aided Design, pp. 28-32, Nov. 1997.
[5.16] Y. Lee, and C. Krishna, "Voltage-Clock Scaling for Low Energy Consumption in Real-time Embedded Systems," Proc. Int. Conf. Real-Time Comp. Sys. and Appli., pp. 272-279, Dec. 1999. T. Sakurai, and A. R. Newton, "Alpha-Power Law MOSFET Model and its Applications to CMOS Inverter Delay and Other Formulas," IEEE J. Solid-State Circ., pp. 584-594, vol. 25, no. 2, Feb. 1990.
[5.18] S. Lim, Y. Bae, G. Jang, B. Rhee, S. Min, C. Park, H. Shin, K. Park, and C. Kim, "An Accurate Worst Case Timing Analysis for RISC Processors," Proc. IEEE Real-Time Sys. Symp., pp. 97-108, Dec. 1994.
[5.19] Hitachi SuperH home page, http://www.superh.com/.
[5.20] Densan home page, http://www.densan.com/.
[5.21] Hitachi HI-Series OS page, http://www.renesas.com/eng/products/mpumcu/tool/realtime_os/itron/.
[5.22] TRON Project home page, http://www.tron.org/.
[5.23] International Technology Roadmap for Semiconductors public home page, http://public.itrs.net/.
[5.24] K. Nose, M. Hirabayashi, H. Kawaguchi, S. Lee, and T. Sakurai, "V $\mathrm{V}_{\mathrm{TH}}-$ Hopping Scheme to Reduce Subthreshold Leakage for Low-Power Processors," IEEE J. Solid-State Circ., vol. 37, no. 3, pp. 413-419, Mar. 2002.
[5.25] H. Zhang, and J. Rabaey, "Low-Swing Interconnect Interface Circuits," Proc. Int. Symp. Low Power Elec. and. Design, pp. 161-166, Aug. 1998.
[5.26] N. D. Minh, and T. Sakurai, "Compact yet High-Performance (CyHP) Library for Short Time-to-Market with New Technologies," Proc. ACM/IEEE Asia and South Pacific Design

Automation Conf., pp. 475-480, Jan. 2000.
[5.27] K. Kanda, K. Nose, H. Kawaguchi, and T. Sakurai, "Design Impact of Positive Temperature Dependence on Drain Current in Sub-1-V CMOS VLSIs," IEEE J. Solid-State Circ., vol. 36, no. 10, pp. 1559-1564, Oct. 2001.


Fig. 5.1. A conceptual diagram of $\mathrm{V}_{\mathrm{DD}}$ hopping.


Fig. 5.2. Three approaches to save power. In (a) and (b), only task periods are controlled while in (c), both $f$ and $V_{D D}$ are controlled. It is assumed that no power is consumed in a sleep mode for simplicity.


Fig. 5.3. NP dependences on $N W$. (a) "NOP" loop while waiting. $b$ is assumed to be 0.7. (b) Sleep while waiting. (c) DVS.


Fig. 5.4. An example of workload histogram in an MPEG4 encoder. An H. 263 standard sequence "carphone" is used as input data. The total number of video frames is 72. The "carphone" sequence is also used in the experiment described in this chapter.


Fig. 5.5. Application slicing. At the head of each slice, a code fragment is inserted to determine a speed of a processor.


Fig. 5.6. Temporal behaviors in $\mathrm{V}_{\mathrm{DD}}$ hopping for one video frame. (a) Power, (b) $f$, and (c) $V_{D D}$.


Fig. 5.7. Power comparison when $f_{\max } / j(j>2$, point A$)$ and $f_{\max } / 2$ (point B) are used as a second frequency.


Fig. 5.8. Numerical solution of slope of $N P(N W)$ at $N W=1 / 2$.


Fig. 5.9. (a) MPEG4 encoder system with $V_{D D}$ hopping. (b) SH-4 embedded system board. (c) $V_{D D}-$ hopping board inserted in a VME slot. (d) Backside of (c).


Fig. 5.10. Block diagram of $\mathrm{V}_{\mathrm{DD}}$-hopping system.


Fig. 5.11. $\mathrm{V}_{\mathrm{DD}}$ waveforms when there is a period while both $V_{G \max }$ and $V_{G m i n}$ are asserted. (a) Falling $V_{D D}$ from $V_{D D \max }$ to $V_{D D \min }$. (b) Rising $V_{D D}$ from $V_{D D \min }$ to $V_{D D \max }$.


Fig. 5.12. $\mathrm{V}_{\mathrm{DD}}$ waveforms when there is a period while both $V_{G \max }$ and $V_{G \min }$ are negated. (a) Falling $V_{D D}$ from $V_{D D \max }$ to $V_{D D \min }$. (b) Rising $V_{D D}$ from $V_{D D \min }$ to $V_{D D \max }$.


Fig. 5.13. (a) Measured power characteristics of $\mathrm{V}_{\mathrm{DD}}$-hopping system. (b) Power dependence on workload based on (a), (A) "NOP" loop while waiting, (B) sleep while waiting, and (C) two-level $\mathrm{V}_{\mathrm{DD}}$ hopping. (c) Power reduction ratio of $\mathrm{V}_{\mathrm{DD}}$-hopping system.


Fig. 5.14. Voltage-drop dependence on gate width of power switch. This shows the worst case because of the minimum gate bias $\left(V_{G S}=-V_{D D m i n}=-1.2 \mathrm{~V}\right)$.


Fig. 5.15. (a) Power switches with timers. (b) All-purpose decoder for power switches. (c) Clock frequency selector.


Fig. 5.16. Measured waveforms of $V_{D D}$ and sleep signal of processor.


Fig. 5.17. Power comparison between $\mathrm{V}_{\mathrm{DD}}$-hopping and fixed $-\mathrm{V}_{\mathrm{DD}}$ schemes.


Fig. 5.18. $\mathrm{V}_{\mathrm{DD}}$ hopping controller.


Fig. 5.19. Structural model of CVS. A task gets timing information and sends speed information to external $f-\mathrm{V}_{\mathrm{DD}}$ control hardware via processor. By using this speed information, a combination of $f$ and $V_{D D}$ is supplied to the processor.

```
structure ETCB {
    T TERIOD; // Task initiation period
    T
    Tsta; // Time when dispatched
    Texe; // Time executed already
    D
};
```

Fig. 5.20. Pseudo code of ETCB structure.


Fig. 5.21. Task-state transition in $\mu$ ITRON-LP. The READY queue and $T_{n}$ queues are renewed when a task is initiated or exits.


Fig. 5.22. How to determine $D_{v}$. Cases that (a) there are two or more tasks in the READY queue, and (b) a RUN task is the only one in the READY queue.


Fig. 5.23. Method to obtain WCET of a RUN task.


Fig. 5.24. Scheduling example of Tasks A, B, and C. A horizontal axis indicates a time scale, and a height of slices shows magnitude of $f$. Cases of (a) original $\mu$ ITRON, and (b) $\mu$ ITRON-LP.


Fig. 5.25. (a) Snapshot of CVS experimental system. An Output image of an MPEG4 encoder is displayed
on a monitor. (b) $\mathrm{V}_{\mathrm{DD}}$ supply board on an SH-4 embedded system board.


Fig. 5.26. Block diagram of CVS experimental system.


Fig. 5.27. Ideal CVS behavior and power characteristics. The left graph shows temporal ratio when $T_{t r}$ is zero and the number of slices, $N$, is infinite. In the ideal case, at $0 \%$ workload, $100 \%$ sleep. At $50 \%$ workload, $100 \% f_{\max } / 2$ operation. At $100 \%$ workload, $100 \% f_{\max }$ operation.


Fig. 5.28. Measured waveforms of $V_{D D}$ and a sleep signal. KB indicates a KEYBOARD routine. When the sleep signal is high, a processor is in a sleep mode.

(a) KEYBOARD $\prod_{0}^{2} D_{r}=2$

(c) FFT


Fig. 5.29. Explanation of $\mathrm{V}_{\mathrm{DD}}$ waveform in Fig. 5.28. A height of slices indicates magnitude of $f$. Contrast with the $V_{D D}$ waveform in Fig. 5.28.


Fig. 5.30. Power comparison. Lines A, B, and C in the right graph are the same ones in Fig. 5.27.


Fig. 5.31. Block diagram of $\mathrm{V}_{\mathrm{DD}}$-hopping processor.


Fig. 5.32. 16-b Kogge-Stone adder.


Fig. 5.33. Block diagram of SRAM.

(b)

Fig. 5.34. (a) Replica-biasing and (b) conventional level-up converters.

(a)

(b)

Fig. 5.35. (a) Measurement setup. (b) Monitored output of VCO.


Fig. 5.36. Micrograph of processor chip.

| INV $\times 3$ | NAND $\times 3$ |
| :--- | :--- |
| NOR $\times 3$ | AOI $\times 2$ |
| OAI $\times 2$ | EXOR |
| EXNOR | MUX $\times 2$ |
| DFF $\times 2$ | CLKBUF |

Fig. 5.37. Types of gates in compact cell library.

(b)

Fig. 5.38. (a) Breakdown of access time and (b) performance of SRAM.


Fig. 5.39. Measured operating frequency.


Fig. 5.40. Measured leakage current.


Fig. 5.41. Measured power.


$$
\begin{aligned}
P_{L E A K} & \propto V_{D D L} \cdot I_{0} \cdot 10^{-\frac{V_{T H}-\lambda \cdot V_{D D L}}{s}} \\
& \propto V_{D D L} \cdot 10^{\frac{\lambda \cdot V_{D D L}}{s}} \\
P_{D V N A M I C} & \propto f \cdot V_{D D L}^{2} \\
& \propto \frac{\left(V_{D D L}-V_{T H L}\right)^{\alpha}}{V_{D D L}} \cdot V_{D D L}^{2} \\
& \propto V_{D D L}^{2.5}
\end{aligned}
$$

Fig. 5.42. Power scaling.

TABLE 5.1. Characteristics of applications in CVS.

| Application | \# slices | WCET | Function |
| :---: | :---: | :---: | :--- |
| KEYBOARD | 1 | 2 ms | Polling |
| MPEG4 | 1 | 1 ms | Initialization |
|  | 20 | 64 ms | Macroblock calculation |
|  | 1 | 14 ms | Display |
|  | 1 | 2 ms | Bit-reversal |
|  | 1 | 33 ms | Danielson-Lanczos |

# 6. Active-Matrix and Hierarchical Structure in Organic Large-Area Sensors 

### 6.1. Introduction

Organic circuits [6.1]-[6.3] are attractive attention for complementing high-performance yet expensive silicon VLSIs. By using OFETs (organic field-effect transistors), large-area circuits can be made on a plastic film. It is believed that fabrication cost of OFETs will be low possibly with roll-to-roll process and printing technologies, which means that low cost per area will be expected in future. Besides, thanks to a plastic substrate, organic circuits are mechanically flexible. These features are suitable for a large-area sensor application, which can complement small-area silicon ICs.

In large-area sensors, a passive matrix without switches is not preferable since it turns out to a large leakage. Fig. 6.1 shows a typical passive matrix. Leakage currents due to wordline-voltage mismatches and bitline-voltage drops flow through sensors over a whole matrix. Thus, leakage power quadratically increases as the matrix size increase. In Section 6.3, the e-skin (electronic artificial skin) is described, in which an active matrix with OFETs is adopted as crossbar switches. The active matrix will become more important to suppress power in future large-area sensors.

Apart from large-area sensors, recently major driving applications of OFETs have been RFID (radio-frequency identification) tags and displays, including an organic EL (electroluminescence) and e-paper (electronic paper). These applications except the e-paper sometimes require higher-speed operation than OFETs have ever achieved. Moreover, a silicon RFID tag is so small that it cannot be broken even in a sheet of paper. Meanwhile, an organic circuit can bend, but is broken if it is bent sharply. In addition, a silicon RFID tag is potentially so cheap that it could be difficult even for a organic circuit to compete with a silicon counterpart. Thus, in the near future, it is difficult for the organic electronics to compete with the silicon electronics in these applications, and we believe that the most suitable application of organic electronics is a large-area sensor.

In reality, speed of an OFET is slow, however, OFET circuits have a definite advantage over silicon VLSIs as abovementioned when cost-per-area is considered. They may not compete with silicon VLSIs when cost-per-function is considered. The carrier mobility of our OFET is about $1 \mathrm{~cm}^{2} / \mathrm{Vs}$, which is three
orders of magnitude lower than that of silicon. According to published papers, the fastest silicon VCO (voltage-controlled oscillator) operate at 114 GHz [6.4] while the fastest ring oscillator made of OFETs works at only 11 MHz [6.5]. Therefore, hierarchical structure to speed up organic circuit is effective. The hierarchical structure also decreases circuit power that quadratically worsens as sensor area increases. In Section 6.4, sheet-type canner, to which double-wordline and double-bitline structure is implemented, is described. The hierarchical structure does not only improve the speed but also saves the power of the sheet-type canner.

### 6.2. OFET (Organic Field-Effect Transistor)

### 6.2.1. Manufacturing Process

Manufacturing process of our OFET devices can be seen in [6.6] in detail but is briefly summarized in Fig. 6.2.
(a) The base film is PEN (polyethylene naphthalate) or PI (polyimide), which is a kind of plastic. First, gate electrodes consisting of adhesion 5-nm Cr (chromium) and 100-nm Au (gold) layers are deposited on the base film through a shadow mask with a vacuum evaporator.
(b) Then PI is spin-coated with a rotation speed of $3,000 \mathrm{rpm}$ as a gate insulator and cured at $180^{\circ} \mathrm{C}$ for 1 hr in a clean oven (class 100) under nitrogen environment. This means low-temperature process without the plastic base film damaged. The thickness of PI is 900 nm in the e-skin, and 630 nm in the sheet-type scanner.
(c) Some parts of the gate insulator are removed with a $\mathrm{CO}_{2}$ laser to make via holes, which is described in the following subsection.
(d) Next, pentacene is deposited as an organic semiconductor through a shadow mask by vacuum sublimation. The pressure is $30 \mu \mathrm{~Pa}$ at ambient substrate temperature. The nominal thickness of the pentacene layer is 50 nm . The chemical structure of pentacene is shown in the bottom of the figure. Pentacene is one of the fastest and most popular low-molecular-weight p-type organic semiconductors. Deposition of the pentacene thin layer requires a vacuum system, but the mobility sometimes exceeds $1 \mathrm{~cm}^{2} / V$ s. This value is one or two orders of magnitude higher than those of n-type organic semiconductors and polymers.
(e) Finally, $60-\mathrm{nm}$ thick Au is deposited though a shadow mask to form source and drain electrodes. The minimum channel length of $40 \mu \mathrm{~m}$ is possible in our process.

The OFET structure is called top-contact geometry. The device dimensions are determined by resolution of the shadow masks. We adopted a $100-\mu \mathrm{m}$ rule in the e-skin, and $40-\mu \mathrm{m}$ one in the sheet-type scanner. The initial transistor yield exceeds $99 \%$. The major fault mode is gate leakage caused by a pinhole. One may think to incorporate redundancy structure and error correction codes in an matrix. Unfortunately, being different from memories, a sensor is close to a display in which a physical location of a cell is meaningful. As is a common practice in a normal display, a certain number of defects will be tolerated even in a product.

### 6.2.2. Via Holes

A $\mathrm{CO}_{2}$ laser selectively formed via holes through a PI gate insulator. The $\mathrm{CO}_{2}$ laser can drill one via hole per second, and implements a drain-to-gate interconnection. In order to improve productivity of the laser via, other industrial laser-drill machines could be utilized. When the laser power is set to more than 8 mJ, good interconnections can be achieved. By optimizing condition of the laser-via process, the yield exceeds $99 \%$ per pulse when the criterion of the conductance is set to more than $10^{-2} \mathrm{~S}$. This result means that the yield becomes as high as $99.99 \%$ if two laser pulses are irradiated onto each electrode, which is adopted in this study. Fig. 6.3 is a micrograph of the laser via with a diameter of $90 \mu \mathrm{~m}$.

### 6.2.3. DC Characteristics

Fig. 6.4 shows the measured $V_{D S}-I_{D S}$ characteristics of the fabricated p-type OFET when the gate insulator is 900 nm and the channel width/length are $2,000 / 100 \mu \mathrm{~m}$. The typical supply voltage of the OFET is 40 V , and the ESD (electro static discharge) immunity is 200 V with $10-\mathrm{M} \Omega$ protection resistor when IEC 61000-4-2 test is carried out. In circuit designs, only p-type OFETs are used since the mobility of the n-type organic semiconductors ever reported is $0.1 \mathrm{~cm}^{2} / \mathrm{Vs}$ at best, which is an order of magnitude smaller than that of pentacene. Consequently, resultant circuits made of n-type OFETs are slow. Moreover, n-type materials are much more sensitive to oxygen and humidity than pentacene. They deteriorate in a shorter time in the atmosphere.

We verified for the first time that the measurement curves can be closely reproduced by the simulation based on the level-1 SPICE MOS model, with $200-\mathrm{k} \Omega$ serial resistors to both source and drain when $W$
(channel width) and $L$ (channel length) are 2 mm and $100 \mu \mathrm{~m}$ respectively. The maximum \%error between the measured results and SPICE simulation is $7.2 \%$ when the on-current, $I_{D 0}$, is set to $100 \%$. Since $I_{D 0}$ is fit to $100 \%$, delay simulation is sufficiently accurate. It is possible to predict organic circuit behavior with SPICE simulations, which is good news to the circuit community. The layout of the circuit is carried out with an existing EDA tool, which is also good news. The GDS II data resulted from the layout design is converted into a DXF file format, and then it is handed to a metal mask manufacturer.
$I_{D S}$ changes in time in the atmosphere. The rapid $\mathrm{I}_{\mathrm{DS}}$ change occurs on the order of minutes to days with the materials and structure used in this study. This should be the most stringent problem related to OFETs but silicon in very early days was suffering from the same problem, which was fully remedied by now. OFETs have a hysteresis in $I_{D S}$, but it does not affect digital circuits.

### 6.3. E-Skin (Electronic Artificial Skin) with Active Matrix

The importance of pressure sensing is increasing in applications of an area sensor and robot for a next generation. Recently, we have manufactured a large-area and flexible pressure-sensor array with OFETs, and successfully taken a pressure image with a resolution of 10 dpi over $4 \times 4 \mathrm{~cm}^{2}$ [6.7]. In this work, by combining the pressure-sensor array with row decoders and column selectors, a customized OFET IC is accomplished as a e-skin system (electronic artificial skin).

In addition, a scalable-circuit concept based on a cut-and-paste customization for organic ICs is proposed and demonstrated by physically cutting a part of circuit and pasting it to another circuit with a connecting plastic tape. The organic circuits are designed with a level-1 SPICE MOS model and standard layout design tool, and the operation of the e-skin is confirmed by measurement.

### 6.3.1. Device Structure

A cross section of a sensor cell is illustrated in Fig. 6.5 (a). The device structure of the OFET looks similar to an upside-down silicon MOSFET, but the channel layer is made of pentacene. On the OFET sheet, a through-hole sheet, pressure-sensitive conductive rubber sheet, and top electrode sheet are laminated to form the sensor cells, whose circuit diagram is shown in Fig. 6.5 (b). Hereafter, we call the sensor cell "sencel" for short. The through-hole sheet with a round diameter of $100 \mu \mathrm{~m}$ is prepared by the conventional method of making flexible circuit boards in combination with chemical etching, mechanical drilling, and plating. It should be noted that this through-hole connects pressure-sensitive conductive rubber with an
access OFET, and is totally different from a laser via that connects a gate and drain of an OFETs. The pressure-sensitive conductive rubber sheet is a $0.5-\mathrm{mm}$ thick silicone rubber containing graphite particles. The upper top electrode sheet has a Cu (copper) electrode layer suspended by PI. A WL and BL mean a wordline and bitline, respectively.

### 6.3.2. Cut-and-Paste Customization

Fig. 6.6 shows a photograph of an assembled e-skin system. A $16 \times 16$ sencel matrix, row decoders, and column selectors are separately fabricated. The sencel size is $2.54 \times 2.54 \mathrm{~mm}^{2}\left(0.1 \times 0.1 \mathrm{in}^{2}\right)$, which corresponds to 10 dpi and the total area of the sencel matrix is $4 \times 4 \mathrm{~cm}^{2}$. The three parts are connected together using a PET film with evaporated Au stripes with a $2.54-\mathrm{mm}$ pitch and conductive glue. We call this film a connecting tape, which enables the cut-and-paste customization in size.

The circuit diagram of the e-skin system is shown in Fig. 6.7. All inputs are driven by Toshiba TD62981P buffers that can output up to 120 V . The 4-b data outputs $\left(D_{0}, \ldots, D_{3}\right)$ are externally pre-discharged by the pre-discharge signal, $\phi_{D}$, with nMOSFETs (Siemens BSS101). The off-current of the nMOSFET is less than 30 nA . A high voltage probe (Tektronix P6015A) has light input impedance (100-M $\Omega$ resistance and $3-\mathrm{pF}$ capacitance) so that the OFETs can drive it. 3 pF is not a problem because the gate capacitance of the OFET is as much as 60 pF . As shown in Fig. 6.6, I/O pads are wide enough for test clips to connect test leads easily.

In order to realize the cut-and-paste customization, all parts of the e-skin system must have scalability in size. Since the sencel matrix has a simple repetition of sencels, it is scalable and it is easy to customize a size just by cutting a required part out of the original sheet. In order to maintain the scalability of the row decoder and column selector, they are laid out so that a sencel matrix of any $m$ rows by $4 n$ columns ( $m \leq 16$, $n \leq 4)$ can be driven.

The left figure in Fig. 6.8 shows the original row decoders, which activates one wordline out of 16 wordlines. If one-out-of-four decoders are needed, the dotted rectangle should be cut out of the original row decoders. The scalability is accomplished by using a wired-NAND type of decoders. In the figure, $/ \phi_{R}$ is the row-decoder activation signal, which makes one wordline "L" to activate sencels on the wordline. The decoder circuit is explained in the following subsection in detail. This type of circuit can be achieved thanks to the high on/off ratio of more than $10^{5}$, which is preferable for OFETs since characteristics of OFETs are
suffering from large process variation. Similarly, the column selectors can have the scalability as shown in Fig. 6.9.

Since the three parts have the scalabilities, even if, for example, a smaller e-skin system with $4 \times 4$ sencels surrounded by the dashed line in Fig. 6.7 is necessary, the scaled-down version works without modification. The photograph of the $4 \times 4$ version is shown in Fig. 6.10. It works fine, as far as the matrix is convex and the row decoder and column selector sides of the matrix are not cut off. Therefore, the corner of the two other sides can be cut and removed. The required shape of the sencel matrix does not need to be rectangular being different from a normal memory matrix. This feature is suitable for a robot application, in which a non-rectangular area sensor is sometimes needed.

Although the fabricated sencel matrix has $16 \times 16$ cells, the concept can be expanded not only to a smaller size but also to an arbitrarily larger size. A long sheet of row decoder and column selectors, and a large sencel matrix can be fabricated and prepared in advance. When a required size and shape is fixed, appropriate parts of circuits are cut out of the prefabricated sheets and then glued together with connecting tapes. For a humanoid robot, fingers need small e-skins while a body requires a large e-skin. It is not cost-effective to prepare all different sizes of masks and products. In terms of mask cost, the concept of the cut-and-paste customization is preferable because we do not need to design or make new masks for various sizes. This reduces NRE (nonrecurring engineering) cost and design turnaround time as well.

### 6.3.3. Boosted-Gate E/E (Enhancement/Enhancement) Configuration

Fig. 6.11 shows three candidates for a static decoder circuit. It should be noted that the fabricated OFET exhibits enhancement-type characteristics and a threshold voltage of OFETs cannot be changed by impurity dope unlike silicon. Fig. 6.11 (a) has an off-state load, and is similar to an E/D (enhancement/depletion) configuration but the load is also an enhancement type. Consequently, the load OFET must be large, which leads to slow speed. The second candidate is an E/E (enhancement/enhancement) configuration in Fig. 6.11 (b), and the load acts as a diode. However, the current drive of the load in this configuration is low when the output is around a threshold voltage, $\left|V_{T H P}\right|$. Therefore, the output does not go below $\left|V_{T H P}\right|$, which results in a low operational margin.

The last candidate in Fig. 6.11 (c) has a boosted-gate load that is adopted as a static decoder in the e-skin. The waveforms of the output and decoder activation signal, $/ \phi$, are shown in the figure. The
negatively boosted-gate voltage keeps the output " H " in a dormant state while it accelerates the transition to "L" with the positively boosted-gate voltage. The "H" outputs suppress a leakage current in the sencel matrix as well as a leakage current of the decoders themselves. The output easily goes to the ground level, which increases an on-current of a pressed sencel in the matrix. We name this circuit a boosted-gate E/E. Although / $\phi$ works out of the rails, there is no reliability issue of the gate insulator because the gate-breakdown voltage is more than 100 V . Since $/ \phi$ is supplied from an external circuit, it can be utilized to adjust an operation-point change due to variability.

The input-output transfer characteristics of the three static decoders are shown in Fig. 6.12. The decoder with the off-state load works nearly between the rails, and opens an eye pattern with some SNM (static noise margin) as shown in Fig. 6.12 (a). A circuit with a deep logic depth like an inverter chain requires an appropriate SNM. On the other hand, the diode-load decoder has a smaller SNM, and cannot output complete "L" as shown in Fig. 6.12 (b), which is apart from the ground by $\left|V_{T H P}\right|$. The boosted-gate $\mathrm{E} / \mathrm{E}$ decoder that is adopted in the e-skin can closely output rail-to-rail voltages $\left(V_{H}\right.$ and $\left.V_{L}\right)$ as shown in Fig. 6.12 (c). Although the boosted-gate $\mathrm{E} / \mathrm{E}$ decoder is superior to the others in terms of a speed and operational margin, $V_{L}$ is worsened as the number of off-state OFETs in the decoder ( $M-1$ in Fig. 6.11) increases.

### 6.3.4. Scalability Limit

In this subsection, factors that hinder scalability of the e-skin are discussed. The $\mathrm{S} / \mathrm{N}$ (signal to noise) ratio determines the maximum number of sencels in the matrix.

Fig. 6.13 (a) shows a case that only one of the accessed sencels is pressed and no others are pressed, which means that the on-current of the pressed sencel, $I_{S O N}$, flows to the bitline. This corresponds to the smallest on-current, which should be compared with the largest off-current. Fig. 6.13 (b) shows the largest off-current case that only one of the accessed sencel is not pressed and all the others are pressed. The largest off-current is $\left(2^{M}-1+2^{N-2}-1\right) I_{\text {SOFF }}$, where $M$ and $N$ signify the number of wordlines and bitlines, respectively. Note that there are 4-b output data through the column selectors in the e-skin system, and thus the number of bitlines connected in parallel is $2^{N-2}$. Accordingly, the $\mathrm{S} / \mathrm{N}$ ratio is $I_{\text {SON }} / I_{\text {SOFF }} /\left(2^{M}-1+2^{N-2}-1\right)$. If $M=N$ and $M$ and $N$ are large, the $\mathrm{S} / \mathrm{N}$ ratio is expressed as $0.8 I_{\text {SON }} / I_{\text {SOFF }} / 2^{M}$.

Although the original on/off ratio of the OFET is more than $10^{5}, I_{\text {SON }} / I_{\text {SOFF }}$ can be lowered to $10^{3}$ because $V_{H}$ and $V_{L}$ are not strictly on the rails. $V_{H}$ is slightly lower than $V_{D D}$ as well as the number of $M$ is
decreased and $V_{L}$ is degraded. Therefore, the theoretical maximum size is around $512 \times 512$ sencels in the s-skin system. This number, however, will be improved to almost infinity if hierarchical arrangement of sencels is made and more complicated circuits are manufactured. A hierarchical approach will be discussed in the next section.

### 6.3.5. Measurement Results

Fig. 6.14 shows measured $I_{D S}$ dependence on pressure. The resistance of the pressure-sensitive conducting conductive rubber rapidly changes from $10 \mathrm{M} \Omega$ to $1 \mathrm{k} \Omega$ when a certain pressure is given, and thus the pressure-sensitive rubber is not suitable for an analog circuit. In the e-skin system, it is used for a digital use. The off-resistance of the pressure-sensitive conductive rubber is sufficiently larger than the drain resistance and the on-resistance is much smaller than the drain resistance in a wide temperature range between $-30^{\circ} \mathrm{C}$ and $120^{\circ} \mathrm{C}$ [6.8]. When a rectangular object presses on one line of the sencel matrix, only corresponding parts of the pressure-sensitive conducting rubber turn on and the corresponding sencels pull the bitlines up to $V_{D D}$ of 40 V as shown in Fig. 6.15. The measured dynamic power of the sencel matrix is $100 \mu \mathrm{~W}$ for the $16 \times 16$ sencels. The static power is $20 \mu \mathrm{~W}$ when $100 \%$ of the area is pressed.

Fig. 6.16 shows simulated access-time dependence on a sencel size. Since an Au interconnection is wide and its resistance is negligible under a $100-\mu \mathrm{m}$ rule, the access time linearly depends on the size. As known in silicon memory designs, if a matrix size is large, double-wordline and double-bitline structure can be adopted with some extra cost in order to reduce delay caused by long and capacitive wordlines and bitlines. The hierarchical structure will be discussed in the next section.

Fig. 6.17 shows the measured and simulated operation waveforms of the e-skin system. The access time from a row-decoder activation signal $/ \phi_{R}$ to a bit-out signal is 23 ms when the sense voltage of the bit-out signal is 20 V . A cycle time can be within 30 ms , which means that a time to scan over the whole $16 \times 16$ sencels is about $2 \mathrm{~s}(=16 \times 4 \times 30 \mathrm{~ms})$ since 4-b output data can be read out in parallel. In order to shorten the cycle time, reducing line widths and capacitances of the wordlines, bitlines, and other bus lines would be effective.

The access time dependence on $V_{D D}$ is shown in Fig. 6.18. By increasing $V_{D D}$ up to 100 V , the delay can be reduced to about a half. The simulation using the abovementioned level-1 SPICE MOS model well agrees with the measurement points. In future, however, a high operation voltage of 40 V should be lowered to less
than 12 V for a typical consumer use. Lowering operation voltage is not considered very difficult by introducing a shorter channel length to OFETs, which is feasible since the present channel length is $100 \mu \mathrm{~m}$ and there is much room to shrink it down without technical obstacles. Thinning gate insulator that can be achieved by increasing a rotation speed of spin coating enhances conductivity of an OFET without surface roughness degrading [6.6]. A use of high-k materials is supposed to be helpful for a lower operation voltage as well as silicon.

The OFET can be bent down to 5 mm in radius without material fatigue or snapping off although the weakest part in bending is a source/drain electrode made of Au. This value is sufficient to wrap around a surface of a round object such as a robot. A current change caused by bending was also measured using a bare OFET without a pressure-sensitive conductive rubber or encapsulation. Even when an OFET is bent down to 5 mm in radius, the current decrease is just 3\% [6.9]. This demonstrates the mechanical flexibility of the organic circuits.

### 6.4. A Sheet-Type Scanner with Double-Wordline and Double-Bitline Structure

Recent advancement in organic large-area sensor is integration of an OFET and OPD (organic photodiode) for a sheet-type scanner [6.10]. In order to improve speed and make the scanner practical, a double-wordline and double-bitline structure is implemented to the organic circuits for the first time. The structure can be applied not only to the scanner but also to other organic large-area sensor, and save power as well as circuit delay. This section describes the circuits of the sheet-type scanner, and demonstrates the advantages of the double-wordline and double-bitline structure with the scanner taken as an example.

### 6.4.1. Device Structure and Operation Principle

Fig. 6.19 (b) illustrates the cross section of the sheet-type scanner. The two OFET sheets and one OPD sheet are separately fabricated and glued together with silver paste. All base films are transparent PEN (polyethylene naphthalate) that is a kind of plastic and its thickness is $125 \mu \mathrm{~m}$ each, through which light can pass. The aperture (open-area ratio) is $45 \%$ of a total pixel area as shown in Fig. 6.19 (a). The light goes through the three sheets and then reflects on a surface of a paper under scan to OPDs. Black and while are discriminated by difference in reflectance, which in turn modulates the photocurrent of the OPD. That is, in the black part, the light does not reflect very much while in the white part, the light reflects to the OPD.

Thus, the sheet-type scanner can capture a black-and-white image on a paper without heavy mechanical components or optical lens.

A parylene (poly-monochloro-para-xylylene) passivation is made on OFETs in situ so as not to be exposed to the air. Parylene protects the OFETs from oxygen and humidity that deteriorate organic devices, and thus it is very useful to enhance durability and reliability of the organic devices. On the parylene passivation and bottom of the PEN base film, there are top and bottom metals, respectively, for interconnections and connectivity to another sheets. If a connection through a PEN base film, PI gate dielectric, or parylene passivation is necessary, a $\mathrm{CO}_{2}$ laser selectively drills a via hole. [6.6]-[6.7] describe this fabrication process in details.

The OPDs are the basis of the scanner. As a common anode of the OPDs, transparent ITO (indium tin oxide) covers the PEN base film. CuPc (copper phthalocyanine) is a p-type semiconductor and PTCDI (3,4,9,10-perylene tetracarboxylic diimide) is n-type one, forming a OPD [6.11]-[6.12]. As a cathode, Au is deposited onto the OPD. Parylene passivates the OPDs as well as the OFET case. Additional top metals on the parylene passivation are for connectivity to the OFET sheet \#1.

Since a reflection type of operation is adopted as a principle, direct incident light must be blocked. Otherwise, all pixels become white. The cathode acts as a shield against the direct incident light as shown in Fig. 6.19 (a), whose size is $900 \times 900 \mu \mathrm{~m}^{2}$. Only over the cathodes, all OFETs are placed.

The circled structure in Fig. 6.19 (b) corresponds to the right-side circuit schematic, including an OPD, pixel selector, and peripheral OFETs. The circuit design is discussed in the next subsection.

### 6.4.2. Circuits

Fig. 6.20 shows the circuit schematic of the proposed double-wordline and double-bitline structure in the sheet-type scanner. Only p-type OFETs are used as well as the e-skin system. A supply voltage, $V_{D D}$, is 40 V . An array of $64 \times 64$ pixels is divided into $8 \times 8$ blocks so that each block has $8 \times 8$ pixels. Every pixel has an OPD and pixel selector.

A 1WL (first wordline) connects to a 2WL (second wordline) through a 1WL selector (first-wordline selector). A 1WL activates gates of aligned pixel selectors to specify a local row address. A 1WL selector selects a 1WL with / $1 W L S_{x}$ (first-wordline select signal). A 2WL decoder (second-wordline decoder) drives a 2 WL , which is mentioned afterward.

The similar notations are used for the bitlines. A 1BL (first bitline) is a local bitline in a block. A precharge gate precharges a 1 BL with $/ R$ (precharge signal), and this signal also pre-discharges a 2 BL (second bitline) before readout operation. An amplifier amplifies a 1BL voltage. A 1BL selector (first-bitline selector) selectively transfers the amplified voltage to a 2 BL with $/ 1 B L S_{x}$ (first-bitline select signal). Concerning the readout operation, a photocurrent-integration scheme is adopted, which is described later on.

### 6.4.2.1. Dynamic Decoder

Fig. 6.21 (a) shows the conventional static decoder used in the e-skin, in which switching OFETs are connected in parallel. The load transistor must be small because of a normally-on load, and thus its sizing is required. This turns out to be a slow fall time in the output. In addition, bias-voltage adjustment, $V_{B I A S}$, is necessary, by which a $\mu$ A-order active leakage flows when the output is " H ". To make matters worse, all but one decoders output " H ".

On the other hand, the proposed decoder in Fig. 6.21 (b) does not draw an active leakage thanks to dynamic operation. The switching OFETs are connected in series. This dynamic decoder is a ratioless circuit without a precharge OFET sized, and thus it has a wider margin than the conventional one.

The layout of the proposed dynamic decoder is devised so that the cut-and-paste customization described in the previous section can be made. Assume that we have prepared six switching OFETs in advance as shown in Fig. 6.22 (a), but now want a one-out-of-eight decoder and need only three switching OFETs. This does not matter, and we just cut three switching OFETs out of the prefabricated circuit. Then, we paste them to a 2WL pad as shown in Fig. 6.22 (b).

### 6.4.2.2. Wordline-Delay Optimization

Fig. 6.23 (a) is the conventional single-wordline scheme while Fig. 6.23 (b) is the proposed double-wordline structure. Fig. 6.23 (c) shows the simulated wordline delay in the proposed structure. In the simulation, we assume that OFET sizes of a pixel selector and 1WL selector are same. $K$ is the number of parallel OFETs in a 1 WL selector driving a $1 \mathrm{WL} . N$ is the number of pixels per 1 WL . The point of $K=0$ and $N=64$ corresponds to the conventional scheme while in the proposed structure, the wordline delay can reduce to $1 / 6$ when $K=0.3$ and $N=8$. In the double-wordline structure, a wordline delay is optimized when $N$ is about a square root of the total number of columns for most cases. When $N$ is large, a delay in a 1WL becomes large. Alternatively, when $N$ is small, a delay of $/ 1 W L S_{X}$ becomes large. Consequently, $N$ has an optimal value at which a wordline delay is minimized.

The double-wordline structure potentially reduce dynamic power by the same factor as well as the wordline delay since the circuits operate on a block-by-block basis, where a capacitance associated with the operation is lower than the single-wordline scheme. In particular, this becomes important when a random access is employed for intelligent image capturing in future.

### 6.4.2.3. Photocurrent-Integration Scheme

In a double-bitline structure in a silicon imager, a charge-transfer scheme in Fig. 6.24 (a) is exploited to amplify a small charge induced by a silicon photodiode. First, a 1BL is precharged to a precharge voltage, $V_{P C}$, with the precharge signal, $/ R$. Then once one of the 1WLs is pulsed, a negative charge, $Q_{B L A C K}$ or $Q_{W H I T E}$, is transferred to a first-bitline capacitance, $C$, when a corresponding part is black or white, respectively. A 1BL voltage is dropped from $V_{P C}$ to a certain voltage by the negative charge. An amplifier amplifies the static voltage of the 1BL, and outputs a static 2BL voltage. This charge transfer scheme, however, cannot be realized in a organic circuit since a gate capacitance of an OFET is huge. Instead, a photocurrent-integration scheme in Fig. 6.24 (b) was applied to the scanner.

The circuit topology in the photocurrent-integration scheme is almost same as the charge-transfer scheme, but operation is different. In order to evaluate a photocurrent of an OPD, one of the 1WLs keeps to be activated. The photocurrent, $I_{B L A C K}$ or $I_{\text {WHITE }}$, discharges $C$, and the 1 BL voltage starts decreasing to an anode voltage, $V_{A}$. The fall time of the 1BL voltage depends on the photocurrent integration. An amplifier amplifies the 1BL voltage, and starts to pull up a 2BL voltage to $V_{D D}$. Since the rise times of the 2BL voltage, $t_{\text {BLACK }}$ and $t_{\text {WHITE }}$, are a function of $I_{\text {BLACK }}$ and $I_{\text {WHITE }}$, respectively, we can know if a pixel is either black or white.
$t_{\text {BLACK }}$ and $t_{\text {WHITE }}$ depend on $V_{P C}$, and we simulated them as shown in Fig. 6.25. They are defined as a time from the $/ R$ negate to when the 2BL voltage crosses 30 V , which is discussed in Section 6.4.4 in detail. In the circuit simulation, a level-1 SPICE MOS model described in Subsection 6.2.3 was used as device parameters. $V_{A}$ is restricted to $V_{P C}-5 \mathrm{~V}$, which means that a voltage across an OPD is 5 V at most. This voltage is sufficient to avoid the Zener breakdown of the OPD. $t_{B L A C K}$ and $t_{\text {WHITE }}$ become faster as $V_{P C}$ decreases since a gate bias of an amplifier, $V_{G S}$, gets larger. However, a $t_{\text {BLACK }} / t_{\text {WHITE }}$ ratio turns out to be smaller which is a kind of dynamic range. We chose 25 V as $V_{P C}$ in the circuit design, at which the $t_{\text {BLACK }} / t_{\text {WHITE }}$ ratio is 1.6 .

### 6.4.3. 3D Integration

Although the double-wordline and double-bitline structure is well known in a memory design, a situation and constraints are different in a sensor design. In a memory design, for instance, 1WL selectors and 1BL selectors are laid out on the side of the memory cells, and sense amplifiers are put at the bottom of the memory cells as shown in Fig. 6.26 (a). This does not matter since a memory is a logical device, and logically any location is all right for memory cells and their peripheral circuits.

On the contrary, a scanner is a physical device, and pixel positions are meaningful. We must arrange uniform distribution of pixels. If we place the peripheral circuits in arbitrary positions, the pixel density becomes irregular and uniform sensing is impossible. Moreover, since the OFET is large, only a single OFET per pixel is allowed and there is no room left for the peripheral circuits in the pixel region. As a result, the peripheral circuits of the sheet-type scanner are to be separated and stacked as shown in Fig. 6.26 (b). The peripheral circuits are disposed on the separate OFET sheet \#2, and stacked on the pixel-selector sheet (OFET sheet \#1) with 3D-stack integration.

Fig. 6.27 (a) shows the layout of the pixel selectors on the OFET sheet \#1. There are $8 \times 8$ pixel selectors in a block. Under this sheet, there is an OPD sheet.

The OFET sheet \#2 is the peripheral-circuit sheet, on which there are four kinds of OFETs including 1WL selectors, 1BL selectors, amplifiers, and precharge gates. As shown in Fig. 6.27 (b), the checkerboard-like layout fulfills the requirement for connectivity. Between transistor rows, there are interconnection channels since a design rule for the laser vias is loose, which reduces the transistor density to a half of the OFET sheet \#1.

### 6.4.4. Measurement Results

Fig. 6.28 shows a photograph of the three organic sheets before being laminated as a sheet-type scanner. A pixel size is $1.27 \times 1.27 \mathrm{~mm}^{2}$, which corresponds to 20 dpi. Namely, $64 \times 64$ pixels occupy $80 \times 80 \mathrm{~mm}^{2}$ in area. The biggest obstacle to enhance resolution is the design rule for the laser vias as abovementioned. The size and enclosure rules of the laser via are more than $100 \mu \mathrm{~m}$ in our process. The total thickness of the sheet-type scanner is 0.4 mm as shown in Fig. 6.29, and total weight is 1 g . The sheet-type scanner is so thin and flexible that it can take an image of a round object such as a label on a wine bottle, which is impossible for the conventional commercial scanners. The bending radius is down to 30 mm , and limited by an

ITO-layer breakdown on the OPD sheet.

### 6.4.4.1. Photocurrent

Fig. 6.30 shows the measured I-V characteristics of the pixel selector and OPD on a 1BL as a function of a cathode voltage, $V_{C}$ (see Fig. 6.24 (b)), when a light intensity is $80 \mathrm{~mW} / \mathrm{cm}^{2}$. The on/off ratio of the OFET achieves $10^{5}$. The 1 BL is precharged to $V_{P C}$ of 25 V with the reset signal, $/ R$. An anode voltage of OPDs, $V_{A}$, is limited to $V_{A}$ of 20 V in order to avoid the Zener breakdown of OPDs as mentioned in 6.4.2.3. After $/ R$ is negated, $I_{B L A C K}$ initially flows when a pixel is black through a 1 BL. Alternatively, when a pixel is white, $I_{\text {WHITE }}$ flows as an initial value.

### 6.4.4.2. Scanned Image

Fig. 6.31 (a) and (b) show the measured histogram of the initial $I_{B L A C K}$ and $I_{\text {WHITE }}$ in a block, and that of the rise times of the 2BL voltages when the image " $F$ " in Fig. 6.32 (a) is scanned. Since one OPD is defunct, the number of samples in the block is $63(=8 \times 8-1)$. The major malfunction mode of OPD is that a laser via through the parylene passivation on the OPD sheet passes to an anode of an OPD, which means electrical short. Although the average ratio of $I_{\text {WHITE }} / I_{\text {BLACK }}$ is 3.8 , that of $t_{\text {BLACK }} / t_{\text {WHITE }}$ results in 1.4 due to the photocurrent-integration scheme, which agrees well with the simulation in Fig. 6.25. In order to compensate this small dynamic range, before scanning an image, a pure-black and pure-white papers are both scanned at first. Then, we scan the image, and interpolate every pixel datum between the data of the pure-black and pure-white papers. Fig. 6.32 (b) is the scanned image "F" with the interpolation.

### 6.4.4.3. Operation Waveforms and Delays

The measured operational waveforms are shown in Fig. 6.33 together with a sketch of stimulus signals. All inputs are driven with external high-voltage buffers (Toshiba TD62981P), and outputs are observed with a high-voltage probe (Tektronix P6015A). For comparison, we manufactured both devices with the conventional single-wordline and single-bitline scheme, and proposed double-wordline and double-bitline structure.

In the proposed structure, the fall time of a 1 WL from $V_{D D}$ to $10 \%$ of $V_{D D}$ is 3 ms while that in the conventional single-wordline scheme is 17 ms . That is, the delay on the wordline is shorten about a factor of 5.7, which agrees well with the wordline-delay simulation in Fig. 6.23.

On a 2 BL in the proposed structure, $t_{B L A C K}$ is defined as a readout time because $I_{B L A C K}$ is smaller than $I_{\text {WHITE }}$ and $t_{\text {BLACK }}$ is larger than $t_{\text {WHITE }}$. If a sense voltage of a 2 BL is set to 30 V , the readout time in the
conventional scheme is 18 ms while that in the proposed structure is 3 ms , achieving a six-fold improvement.

The cycle time in the conventional scheme is 39 ms (the wordline delay is 17 ms , the readout time is 18 ms , and the recovery time is 4 ms ) while that in the proposed structure is 7 ms (the wordline delay is 3 ms , the readout time is 3 ms , and the recovery time is 1 ms ). This exhibits that the cycle time is reduced by a factor of 5.6

### 6.4.4.4. Power

In the conventional scheme, the total power measures 2.5 mW at the $39-\mathrm{ms}$ cycle time while that in the proposed structure is $900 \mu \mathrm{~W}$ at the 7 -ms cycle time, which means a 2.8 -times improvement. If the cycle time in the proposed structure is set to 39 ms as long as the cycle time in the conventional scheme, the power reduces to $350 \mu \mathrm{~W}$, which indicates that the proposed structure saves the power by a factor of seven.

### 6.4.5. Future Direction

In future, we suppose that we will be able to make a scanner with a 320 -dpi resolution and $4,096 \times 4,096$ pixels in size. Fig. 6.34 (a) shows a simulated trend of a scan-out time when a resolution and the number of pixels are advanced, which indicates that a scan-our time might take an order of $10^{3}$ seconds to scan out all pixels with the conventional single-wordline and single-bitline scheme. In the present process (20-dpi resolution and $64 \times 64$ pixels), the scan-out time in the proposed double-wordline and double-bitline structure is only 5.6 times as fast as the conventional scheme since the cycle-time improvement is the same factor. In the advanced future process (320-dpi resolution and 4,096 $\times 4,096$ pixels), however, the improvement factor of the scan-out time will increase to 51 , and the scan-out time will be shorten to 10 s of seconds with the proposed structure. As shown Fig. 6.34 (b), we can save also power by a factor of 25 in the advanced future process. The proposed approach can be applicable to other types of organic large-area sensors including the e-skin, and solves fundamental issues on speed and power. We believe that the proposed structure will be essential for organic large-area sensors in future.

In the measurement, an artificial light source was used, however, it will be potentially replaced to an ambient one. A gate capacitance of an OFET will decrease as scaling is advanced. The hierarchical bitline structure will divide a bitline into smaller segments, and reduce a bitline capacitance more and more in future. Thanks to the 3D-stack integration, we can also put an amplifier near the segmented bitline without
an aperture degrading. This means that an OPD just draws a charge out of a relatively small capacitance. Consequently, we believe that we will be able to improve sensitivity of photocurrent in future, making it possible to use an ambient light as a light source.

### 6.5. Cost Comparison with Silicon

As shown in Fig. 6.35, a tentative cost comparison was made. The range where organic ICs have cost advantage lies between tens of microns and millimeters resolution. Note that the resolution does not mean a design rule nor device size but sencel sparseness. The lower bound is limited by manufacturability of organic ICs while the upper bound is limited by the competition against an assembly approach. In the assembly approach, small organic elements are assembled on a base material, or connected together with additional interconnections.

### 6.6. Summary

In the e-skin system, an active matrix based on OFETs was implemented instead of a passive matrix to suppress a leakage current. The e-skin has mechanical flexibility, and is low-cost even for large-area electronics. The cut-and-paste customization as a scalable circuit concept was proposed and demonstrated. A static decoder with a boosted-gate E/E configuration is designed using a level-1 SPICE MOS model, and standard layout design tool. The access time of the manufactured e-skin system is 23 ms . Mechanical flexibility of a OFET is proven by bending it down to 5 mm in radius.

In the sheet-type scanner, we confirmed that the proposed double-wordline and double-bitline structure reduces power and cycle time by a factor of 7 and 5.6, respectively. In order to implement the proposed structure, one OPD sheet and two OFET sheets was integrated as a 3D-stack sheet-type scanner. The dynamic decoder with low-power and wide-margin capability was introduced, to which the cut-and-paste customization can be applied as well.

In terms of cost of organic electronics, we suggested that a resolution range where an organic IC is superior to silicon lies between tens of micrometer and millimeters resolution.

### 6.7. References

[6.1] R. Brederlow, S. Briole, H. Klauk, M. Halik, U. Zschieschang, G. Schmid. J.-M. Gorriz-Saez, C. Pacha, R. Thewes, and W. Weber, "Evaluation of the Performance Potential of Organic TFT

Circuits," IEEE Int. Solid-State Circ. Conf. Dig. Tech. Papers, pp. 378-379, Feb. 2003.
E. Huitema, G. Gelinck, B. van der Putten, E. Cantatore, E. van Veenendaal, L. Schrijnemakers, B.-H. Huisman, and D. M. Leeuw, "Plastic Transistors in Active-Matrix Displays," IEEE Int. Solid-State Circ. Conf. Dig. Tech. Papers, pp. 380-381, Feb. 2003.
E. Cantatore, C. M. Hart, M. Digioia, G. H. Gelinck, T. C. T. Geuns, H. E. A. Huitema, L. R. R. Schrijnemakers, E. van Veenendaal, D. M. Leeuw, "Circuit Yield of Organic Integrated Electronics," IEEE Int. Solid-State Circ. Conf. Dig. Tech. Papers, pp. 382-383, Feb. 2003.
P.-C. Huang, M.-D. Tsai, H. Wang, C.-H. Chen, and C.-S. Chang, "A 114 GHz VCO in $0.13 \mu \mathrm{~m}$ CMOS Technology," IEEE Int. Solid-State Circ. Conf. Dig. Tech. Papers, pp. 404-405, Feb. 2005.
J. H. Schoen, and C. Kloc, "Fast organic electronic circuits based on ambipolar pentacene field-effect transistors," AIP Applied Physics Let., vol. 79, no. 24, pp. 4043-4044, Dec 2001.
Y. Kato, S. Iba, R. Teramoto, T. Sekitani, T. Someya, H. Kawaguchi, and T. Sakurai, "High mobility of pentacene field-effect transistors with polyimide gate dielectric layers," AIP Applied Physics Let., vol. 84, no. 19, pp. 3789-3791, May 2004.
T. Someya, T. Sekitani, S. Iba, Y. Kato, H. Kawaguchi, and T. Sakurai, "A large-area, flexible pressure sensor matrix with organic field-effect transistors for artificial skin applications," Proc. National Academy of Sci. of U.S.A., vol. 101, no. 27, pp. 9966-9970, July 2004. PCR Technical page, http://www.scn-net.ne.jp/~eagle/CSAJapanese_right.html\#h-ondo (in Japanese).
T. Sekitani, H. Kawaguchi, T. Sakurai, and T. Someya, "Organic Field-Effect Transistors with Bending Radius Down to 1 mm ," Proc. Materials Research Society Spring Meet., Apr. 2004.
T. Someya, T. Sakurai, T. Sekitani, H. Kawaguchi, S. Iba, and Y. Kato, "A Large-Area, Flexible, and Lightweight Sheet Image Scanner Integrated with Organic Field-Effect Transistors and Organic Photodiodes," IEEE Int. Elec. Dev. Meet. Dig. Tech. Papers, pp. 365-368, Dec. 2004.
Z. Bao, A. J. Lovinger, and A. Dodabalapur, "Organic field-effect transistors with high mobility based on copper phthalocyanine," AIP Applied Physics Let., vol. 69, no. 20, pp. 3066-3068, Nov. 1996.
J. Shinar, "Organic Light-Emitting Devices: A Survey," Springer, 2003.


Fig. 6.1 Passive matrix.
(a) Patterning of $\mathrm{Cr} / \mathrm{Au}$ gate electrodes on PEN/PI base film

(b) Spin-coating and curing of gate insulator PI

(c) Via process with $\mathrm{CO}_{2}$ laser

(d) Deposition of pentacene

(e) Deposition of Au source/drain electrodes



Pentacene $\mu>1 \mathrm{~cm}^{2} / \mathrm{Vs}$
Fig. 6.2 Manufacturing process of OFET.


Fig. 6.3 Micrograph of laser via.


Fig. 6.4 $\quad \mathrm{V}_{\mathrm{DS}}-\mathrm{I}_{\mathrm{DS}}$ characteristics of fabricated p-type OFET.


Fig. 6.5 (a) Cross section and (b) circuit diagram of sensor cell.


Fig. 6.6 Photograph of e-skin system.


Fig. 6.7 Circuit diagram of e-skin system.


Fig. 6.8 Scalability of row decoders.


Fig. 6.9 Scalability of column selectors.


Fig. $6.104 \times 4$ version of e-skin.


Fig. 6.11 Static decoder circuits with (a) off-state load, (b) diode load, and (c) boosted-gate load. Their simulation waveforms are also shown.


Fig. 6.12 Input-output transfer characteristics of static decoders with (a) off-state load, (b) diode load, and (c) boosted-gate load. The dotted lines are $x-y$ symmetries of the input-output transfer characteristics (solid lines).


Fig. 6.13 (a) Smallest on-current case, and (b) largest off-current case.


Fig. 6.14 Current dependence on pressure.


Fig. 6.15 Bitline voltage when pressed.


Fig. 6.16 Simulated access-time dependence on sencel size.


Fig. 6.17 Measured and simulated operation waveforms.


Fig. 6.18 Access time dependence on $V_{D D}$.

(b)

Fig. 6.19 Device structure. (a) Top view, and (b) cross section.


Fig. 6.20 Double-wordline and double-bitline structure.

(a)

Proposed

(b)

Fig. 6.21 (a) Conventional static decoder, and (b) proposed dynamic decoder.


Fig. 6.22 Cut-and-paste customization. (a) Cut, and (b) paste.


Fig. 6.23 (a) Single-wordline scheme. (b) Double-wordline structure, and (c) its simulated delay.


Fig. 6.24 (a) Photocharge-transfer scheme, and (b) photocurrent-integration scheme.


Fig. 6.25 Simulated rise times of 2BL voltage, $t_{\text {BLACK }}$ and $t_{\text {WHITE }}$, and ratio of them.


Fig. 6.26 (a) Memory, and (b) sensor design.


Fig．6．27 Layouts of blocks on（a）OFET sheet \＃1，and（b）OFET sheet \＃2．


Fig. 6.28 Photograph of sheet-type scanner.


Fig. 6.29 Cross-sectional photograph of sheet-type scanner.


Fig. 6.30 Measured I-V characteristics on 1BL.


Fig. 6.31 Measured histograms of (a) photocurrent, and (b) rise time of 2BL voltage.


Fig. 6.32 (a) Original image, and (b) scanned image.


Fig. 6.33 Measured operational waveforms.


Fig. 6.34 Future trends of (a) scan-out time and (b) power.


Fig. 6.35 Costs of technologies for large-area sensors. Area is assumed to be $100 \times 100 \mathrm{~mm}^{2}$, and silicon costs $\$ 1 \mathrm{k}$ per that area while organic costs $\$ 10$.

## 7. Conclusions

This paper described studies on power-reduction techniques and related analysis for silicon VLSIs and organic ICs, which realize future ubiquitous electronics.

In Chapter 2, the RCSFF (reduced clock-swing flip-flop) was proposed to save clock power that accounts $20-45 \%$ of the total power in a silicon VLSI. The RCSFF accepts a low clock swing, for which a reduced-swing clock driver is prepared. Although the low-swing clock incurs a leakage current, the RCSFF has a leakage-cutoff mechanism with the body effect. The RCSFF reduced clock power down to $1 / 3$ compared with the conventional flip-flop, whose power improvement was achieved with the clock swing reduced to 1 V . The area and delay of the RCSFF can be reduced by a factor of $20 \%$ as well, and it can halve an RC delay of a long interconnect, too.

In a future low-power silicon VLSI, there will be some voltage domains so as the RCSFF uses a lower swing than a supply voltage, in which signal-integrity issue is caused by RC coupling. To make matters worse, as scaling goes on, the issue has become more obvious. In Chapter 3, expressions in delay and crosstalk-noise amplitude for capacitively coupled two- and three-line systems were derived assuming bus lines and other signal lines in a silicon VLSI. Two modes were studied; a case that adjacent lines are driven from the same direction, and another case that adjacent lines are driven from the opposite direction, whose cases correspond to typical situations in VLSI designs. Beside, a junction capacitance of a driver MOSFET was considered in both cases. The expressions are closed forms, and useful for circuit designers in an early stage of VLSI design to give insight to interconnection problems. The expressions were extensively compared and fitted to HSPICE in order to demonstrate validity, whose \%errors are all within $10 \%$.

In Chapter 4, a standby-leakage cutoff scheme for logic circuits and active-leakage reduction scheme in an SRAM were proposed as solutions to leakage problems that have been come up with recent silicon devices. The SCCMOS (super-cutoff CMOS) scheme that achieves high-speed and low standby current CMOS VLSIs in sub-1V supply-voltage regime was verified in Section 4.2 by measurement. By overdriving a gate of a cutoff MOSFET, the SCCMOS suppresses a leakage current below 1 pA per logic gate in a standby mode while high-speed operation in an active mode is possible with a low threshold voltage of $0.1-0.2 \mathrm{~V}$. The SCCMOS pushes low-voltage operation limit by 0.2 V further down compared with the conventional schemes, maintaining the same standby-current level. In Section 4.3, the DLC (dynamic
leakage cutoff) SRAM was proposed and fabricated as a $0.5-\mathrm{V}$ SRAM circuit, which speeds up the conventional low-voltage SRAM by a factor of 2.5 maintaining subthreshold leakage current in a tolerable level. In the DLC SRAM, n- and p-well biases are dynamically changed only in selected SRAM cells so that threshold voltages of selected and dormant SRAM cells are low and high, respectively. Therefore, the DLC SRAM does not apply an excessive voltage to gate oxide, and there is no reliability issue on the gate oxide. The leakage current was suppressed to 0.9 mA in a $1-\mathrm{Mb}$ SRAM at a supply voltage of 1 V although the conventional SRAM draws a $200-\mathrm{mA}$ leakage current on the same condition.

Chapter 5 described software approaches for a low-power multimedia system with an off-the-shelf processor. The $\mathrm{V}_{\mathrm{DD}}$-hopping scheme was discussed in Section 5.2 , which adaptively controls a supply voltage of a processor depending on workload of the processor using a hardware-software cooperative mechanism. When a workload of an MPEG4 encoder was about a half, $\mathrm{V}_{\mathrm{DD}}$ hopping demonstrated that power was reduced to less than a quarter compared with the conventional fixed $-\mathrm{V}_{\mathrm{DD}}$ scheme. The power saving was achieved without degrading a real-time feature. In addition, we fabricated a controller dedicated to $\mathrm{V}_{\mathrm{DD}}$ hopping in order to verify feasibility in an embedded-system level. In Section 5.3, $\mathrm{V}_{\mathrm{DD}}$ hopping was extended to a RTOS (real-time operation system) called $\mu$ ITRON-LP, which presents cooperative power management among $\mu$ ITRON-LP itself, multimedia applications including an MPEG4 encoder, and a hardware platform with an off-the-shelf processor. The experimental results with the prototype system showed that $74 \%$ power saving is possible in the multitasking multimedia environment.

Fig. 7.1 shows power saved by the above techniques when an H. 264 high-definition encoder is designed in a silicon VLSI. The active power of the original without any power-reduction techniques is 2.17 W , but it is reduced to 1.34 W by introducing the RCSFFs and DLC SRAM. In fact, $\mu$ ITRON-LP can be implemented only to a processor, however, if the VLSI is assumed to be a processor, the active power becomes as small as 0.48 W . Standby power can be dramatically decreased as well with the SCCMOS and DLC SRAM from 0.7 W to 2.8 mW .

In Chapter 6, low-power techniques in organic ICs were described. The e-skin (electronic artificial skin) in Section 6.3 adopted an OFET (organic field-effect transistor) matrix that is superior to a passive matrix in terms of leakage power. In the e-skin, the new concept of the cut-and-paste customization in designing organic ICs was introduced for the first time. The e-skin is comprised of three kinds of circuits that are all scalable in size so that the cut-and-paste customization can be applied. The organic circuits were
designed with a level-1 SPICE MOS model and standard layout design tool, and operation was confirmed by measurements. In Section 6.4, an organic sheet-type scanner and its circuits were described. The 3D-stacked sheet-type scanner consists of two organic-transistor sheets and one organic-photodiode sheet, which enable a double-wordline and double-bitline structure. The operation of the proposed hierarchical structure was compared with the conventional single-wordline and single-bitline scheme, and improvements of a cycle time and power by factors of 5.6 and 7 were verified, respectively. It can be said that the active matrix and its double-wordline and double-bitline structure are scalable in terms of delay and power for future large-area sensors.

The above is our approaches to future low-power ubiquitous electronics, which has been given an impact on researchers engaging in electronics industries and academies.

More than ten groups have pursued the RCSFF, which is recognized in an electronics society as the first low-swing flip-flops that can accept an arbitrary swing clock. Unfortunately, the RCSFF is suffered from GIDL (gate-induced drain leakage) since it utilizes the backgate bias. However, we have studied an improved RCSFF in Fig. 7.2 to eliminate the adverse effect [7.1].

We have extended the SCCMOS to the ZSCCMOS (zigzag SCCMOS) in Fig. 7.3, which improves SCCMOS drawbacks that are gate overdrive and slow wakeup time [7.2]. Before entering a standby mode, each output node of gates is forces on either ' 0 ' or ' 1 '. Gates whose output nodes are ' 0 ' are connected to a pMOSFET switch, MP and alternatively, the other gates whose output node are ' 1 ' are connected to an nMOSFET switch, MN. Therefore, each gate has at least two series transistors that are off in a leakage path, which reduce a leakage current by the stack effect [7.3]. It can be said that the ZSCCMOS is more suitable for a scale-down device since a leakage current can be suppressed without gate overdriving, but the idea of use of the low threshold-voltage switch in the SCCMOS is followed.

The DLC SRAM is a very early concept that dynamic control synchronizing with a wordline access is applied. Unfortunately as well as the RCSFF, it utilizes the backgate biases, and thus GIDL is an issue. Our latest SRAM eliminating a large leakage current with row-by-row variable $\mathrm{V}_{\mathrm{DD}}$ control in Fig. 7.4 is a GIDL-free expansion of the DLC SRAM, sustaining information below 0.3 V in a standby mode [7.4]. The active leakage current is suppressed by $95 \%$ thanks to DIBL (drain-induced barrier lowering) effect due to the retention voltage of 0.3 V . This novel SRAM was well accepted in a conference, and is supposed to be an essential technique for a future low-power SRAM.

Multimedia applications have been rapidly penetrated to mobile equipments as typical ubiquitous electronics, for which software approaches such as $\mathrm{V}_{\mathrm{DD}}$ hopping and $\mu$ ITRON-LP would be suitable. $\mu$ ITRON-LP is being implemented to a next-generation cell phone [7.5].

The cut-and-paste customization is the first concept, and effective to organic large-area sensors with an active matrix that suppresses an active leakage current. The double-wordline and double-bitline structure is also effective because of low dynamic power. These techniques are supposed to be essential for future large-scale sensing and robotics using abundant OFETs.

Thus, this paper can contribute to the next ubiquitous society supported by low-power electronics.

### 7.1. References

M. Yazid, H. Kawaguchi, and T. Sakurai, "Low-Power High-Speed Reduced-Clock-Swing Flip-Flops Based on Contention Reduction Techniques," IEICE Technical Report, ICD2005, Dec. 2005. (To be presented)
K. Min, H. Kawaguchi, and T. Sakurai, "Zigzag Super Cut-off CMOS (ZSCCMOS) Block Activation with Self-Adaptive Voltage Level Controller: An Alternative to Clock-Gating Scheme in Leakage Dominant Era," IEEE Int. Solid-State Circ. Conf. Dig. Tech. Papers, pp. 400-401, Feb. 2003.
S. Narendra, S. Borkar, V. De, D. Antoniadis, and A. Chandrakasan, "Scaling of Stack Effect and its Application for Leakage Reduction," Proceedings of ACM/IEEE Int. Symp. Low Power Elec. and Design, pp. 195-200, Aug. 2001.
F. R. Saliba, H. Kawaguchi, and T. Sakurai, "Experimental Verification of Row-by-Row Variable $\mathrm{V}_{\mathrm{DD}}$ Scheme Reducing 95\% Active Leakage Power of SRAM's," IEEE/JSAP Symp. VLSI Circ. Dig. Tech. Papers, pp. 162-165, June 2005.
S. Misaka, H. Kawaguchi, and T. Sakurai, "Time Revising Robust Frequency-Voltage Cooperative Power Reduction for Multi-tasking Multimedia Applications," Proceedings of Int. Symp. Low-Power and High-Speed Chips (COOL Chips), pp. 165-180, Apr. 2005.


Fig. 7.1 Power saved by the techniques described in this paper.


Fig. 7.2 Improved RCSFF.


Fig. 7.3. ZSCCMOS (zigzag SCCMOS).


Fig. 7.4. Row-by-row variable $V_{D D}$ SRAM.

## List of Publications and Presentations

Publications in journals and transactions as the first author

- H. Kawaguchi, and T. Sakurai, "A Reduced Clock-Swing Flip-Flop (RCSFF) for $63 \%$ Power Reduction," IEEE Journal of Solid-State Circuits, vol. 33, no. 5, pp. 807-811, May 1998.
- H. Kawaguchi, K. Nose, and T. Sakurai, "A Super Cut-Off CMOS (SCCMOS) Scheme for 0.5-V Supply Voltage with Picoampere Stand-By Current," IEEE Journal of Solid-State Circuits, vol. 35, no. 10, pp. 1498-1501, Oct. 2000.
- H. Kawaguchi, G. Zhang, S. Lee, Y. Shin, and T. Sakurai, "A Controller LSI for Realizing $\mathrm{V}_{\mathrm{DD}}-$ Hopping Scheme with Off-the-Shelf Processors and Its Application to MPEG4 System," IEICE Transactions on Electronics, vol. E85-C, no. 2, pp. 263-271, Feb. 2002.
- H. Kawaguchi, T. Someya, T. Sekitani, and T. Sakurai, "Cut-and-Paste Customization of Organic FET Integrated Circuit and Its Application to Electronic Artificial Skin," IEEE Journal of Solid-State Circuits, vol. 40, no. 1, pp. 177-185, Jan. 2005.
- H. Kawaguchi, Y. Shin, and T. Sakurai, " $\mu$ ITRON-LP: Power-Conscious Real-Time OS Based on Cooperative Voltage Scaling for Multimedia Applications," IEEE Transactions on Multimedia, vol. 7, no. 1, pp. 67-74, Feb. 2005.

Presentations in international conferences as the first author

- H. Kawaguchi, and T. Sakurai, "A Reduced Clock-Swing Flip-Flop (RCSFF) for $63 \%$ Clock Power Reduction," IEEE/JSAP Symposium on VLSI Circuits Digest of Technical Papers, pp. 97-98, June 1997.
- H. Kawaguchi, and T. Sakurai, "Noise Expressions for Capacitance-Coupled Distributed RC Lines," Proceedings of ACM/IEEE International Workshop on Timing Issues in the Specification and Synthesis of Digital Systems (TAU), pp. 270-279, Dec. 1997.
- H. Kawaguchi, K. Nose, and T. Sakurai, "A CMOS Scheme for 0.5V Supply Voltage with Pico-Ampere Standby Current," IEEE International Solid-State Circuits Conference Digest of Technical Papers, pp. 192-193, Feb. 1998.
- H. Kawaguchi, and T. Sakurai, "Delay and Noise Formulas for Capacitively Coupled Distributed RC Lines," Proceedings of Asia and South Pacific Design Automation Conference, pp. 35-43, Feb.

1998. 

- H. Kawaguchi, Y. Itaka, and T. Sakurai, "Dynamic Leakage Cut-off Scheme for Low-Voltage SRAM's," IEEE/JSAP Symposium on VLSI Circuits Digest of Technical Papers, pp. 140-141, June 1998.
- H. Kawaguchi, K. Nose, and T. Sakurai, "A CMOS Scheme for 0.5 V Supply Voltage with Pico-Ampere Standby Current," Proceedings of International Workshop on Advanced LSIs, pp. 45-49, July 1998.
- H. Kawaguchi, G. Zhang, S. Lee, and T. Sakurai, "An LSI for $\mathrm{V}_{\mathrm{DD}}$-Hopping and MPEG4 System Based on the Chip," Proceedings of IEEE International Symposium on Circuit and Systems, pp. 918-921, May 2001.
- H. Kawaguchi, Y. Shin, and T. Sakurai, "Experimental Evaluation of Cooperative Voltage Scaling (CVS): A Case Study," Proceedings of IEEE Workshop on Power Management for Real-Time and Embedded Systems, pp. 17-23, May 2001.
- H. Kawaguchi, K. Kanda, K. Nose, S. Hattori, D. D. Antono, D. Yamada, T. Miyazaki, K. Inagaki, T. Hiramoto, and T. Sakurai "A $0.5-\mathrm{V}, 400-\mathrm{MHz}, \mathrm{V}_{\mathrm{DD}}-$ Hopping Processor with Zero- $\mathrm{V}_{\mathrm{TH}}$ FD-SOI Technology," IEEE International Solid-State Circuits Conference Digest of Technical Papers, pp. 106-107, Feb. 2003.
- H. Kawaguchi, S. Iba, Y. Kato, T. Sekitani, T. Someya, and T. Sakurai, "A Sheet-Type Scanner Based on a 3D-Stacked Organic-Transistor Circuit with Double Word-Line and Double Bit-Line Structure," IEEE International Solid-State Circuits Conference Digest of Technical Papers, pp. 580-581, Feb. 2005.

Publications in journals and transactions as a coauthor

- T. Hiramoto, M. Takamiya, H. Koura, T. Inukai, H. Gomyo, H. Kawaguchi, and T. Sakurai, "Optimum Device Parameters and Scalability of Variable Threshold Voltage Complementary MOS (VTCMOS)," Japanese Journal of Applied Physics, vol. 40, part 1, no. 4B, pp. 2854-2858, Apr. 2001.
- K. Kanda, K. Nose, H. Kawaguchi, and T. Sakurai, "Design Impact of Positive Temperature Dependence on Drain Current in Sub-1-V CMOS VLSIs," IEEE Journal of Solid-State Circuits,
vol. 36, no. 10, pp. 1559-1564, Oct. 2001.
- K. Nose, M. Hirabayashi, H. Kawaguchi, S. Lee, and T. Sakurai, "V $\mathrm{V}_{\mathrm{TH}}$-Hopping Scheme to Reduce Subthreshold Leakage for Low-Power Processors," IEEE Journal of Solid-State Circuits, vol. 37, no. 3, pp. 413-419, Mar. 2002.
- Y. Kato, S. Iba, R. Teramoto, T. Sekitani, T. Someya, H. Kawaguchi, and T. Sakurai, "High mobility of pentacene field-effect transistors with polyimide gate dielectric layers," Applied Physics Letters, vol. 84, no. 19, pp. 3789-3791, May 2004.
- T. Someya, T. Sekitani, S. Iba, Y. Kato, H. Kawaguchi, and T. Sakurai, "A large-area, flexible pressure sensor matrix with organic field-effect transistors for artificial skin applications," Proceedings of the National Academy of Sciences of the United States of America, vol. 101, no. 27, pp. 9966-9970, July 2004.
- K. Toyama, S. Misaka, K. Aisaka, T. Aritsuka, K. Uchiyama, K. Ishibashi, H. Kawaguchi, and T. Sakurai, "Frequency-Voltage Cooperative CPU Power Control: A Design Rule and Its Application by Feedback Prediction," Systems and Computers in Japan, vol. 36, no. 6, pp. 39-48, June 2005.
- K. Min, K. Kanda, H. Kawaguchi, K. Inagaki, F. R. Saliba, H. Choi, H. Choi, D. Kim, D. Kim, M. Min, and T. Sakurai, "Row-by-Row Dynamic Source-Line Voltage Control (RRDSV) Scheme for Two Orders of Magnitude Leakage Current Reduction of Sub-1-V-V ${ }_{\text {DD }}$ SRAM's," IEICE Transactions on Electronics, vol. E88-C, no. 4, pp. 760-767, Apr. 2005.
- S. Iba, T. Sekitani, Y. Kato, T. Someya, H. Kawaguchi, M. Takamiya, T. Sakurai, and S. Takagi, "Control of threshold voltage of organic field-effect transistors with double-gate structures," Applied Physics Letters, vol. 87, 023509, July 2005.
- T. Someya, Y. Kato, T. Sekitani, S. Iba, Y. Noguchi, Y. Murase, H. Kawaguchi, and T. Sakurai, "Conformable, flex, large-area networks of pressure and thermal sensors with organic transistor active matrixes," Proceedings of the National Academy of Sciences of the United States of America, vol. 102, no. 35, pp. 12321-12325, Aug. 2004.
- T. Someya, Y. Kato, S. Iba, Y. Noguchi, T. Sekitani, H. Kawaguchi, and T. Sakurai, "Integration of Organic FETs With Organic Photodiodes for a Large Area, Flexible, and Lightweight Sheet Image Scanners," IEEE Transactions on Electron Devices, vol. 52, no. 11, pp. 2502-2511, Nov. 2005.
- D. D. Antono, H. Kawaguchi, and T. Sakurai, "Trends of On-chip Interconnects in Deep Sub-micron VLSI," IEICE Transactions on Electronics. (To be published).
- C. Q. Tran, H. Kawaguchi, and T. Sakurai, "Low-power Low-leakage FPGA Design using Zigzag Power Gating, Dual- $\mathrm{V}_{\mathrm{TH}} / \mathrm{V}_{\mathrm{DD}}$ and Micro- $\mathrm{V}_{\mathrm{DD}}$-Hopping," IEICE Transactions on Electronics. (To be published).

Presentations in international conferences as a coauthor

- T. Sakurai, H. Kawaguchi, and T. Kuroda, "(Invited) Low-Power CMOS Design through $\mathrm{V}_{\text {TH }}$ Control and Low-Swing Circuits," Proceedings of ACM/IEEE International Symposium on Low Power Electronics and Design, pp. 1-6, Aug. 1997.
- K. Kanda, K. Nose, H. Kawaguchi, and T. Sakurai, "Design Impact of Positive Temperature Dependence of Drain Current in Sub 1V CMOS VLSI's," Proceedings of IEEE Custom Integrated Circuits Conference, pp. 563-566, May 1999.
- T. Inukai, M. Takamiya, K. Nose, H. Kawaguchi, T. Hiramoto, and T. Sakurai, "Boosted Gate MOS (BGMOS): Device/Circuit Cooperation Scheme to Achieve Leakage-Free Giga-Scale Integration," Proceedings of IEEE Custom Integrated Circuits Conference, pp. 409-412, May 2000.
- T. Hiramoto, M. Takamiya, H. Koura, T. Inukai, H. Gomyo, H. Kawaguchi, and T. Sakurai, "(Invited) Optimum Device Parameters and Scalability of Variable Threshold CMOS (VTMOS)," Proceedings of International Conference on Solid State Devices and Materials, pp. 372-373, Aug. 2000.
- K. Kanda, N. D. Minh, H. Kawaguchi, and T. Sakurai, "Abnormal Leakage Suppression (ALS) Scheme for Low Standby Current SRAMs," IEEE International Solid-State Circuits Conference Digest of Technical Papers, pp. 174-175, Feb. 2001.
- K. Nose, M. Hirabayashi, H. Kawaguchi, S. Lee, and T. Sakurai, "V ${ }_{T H}$-hopping Scheme for $82 \%$ Power Saving in Low-voltage Processors," Proceedings of IEEE Custom Integrated Circuits Conference, pp. 93-96, May 2001.
- Y. Shin, H. Kawaguchi, and T. Sakurai, "Cooperative Voltage Scaling (CVS) between OS and Applications for Low-Power Real-Time Systems," Proceedings of IEEE Custom Integrated

Circuits Conference, pp. 553-556, May 2001.

- K. Aisaka, T. Aritsuka, S. Misaka, K. Toyama, K. Uchiyam, K. Ishibashi, H. Kawaguchi, and T. Sakurai, "Design Rule for Frequency-Voltage Cooperative Power Control and Its Application to an MPEG-4 Decoder," IEEE/JSAP Symposium on VLSI Circuits Digest of Technical Papers, pp. 216-217, June 2002.
- K. Kanda, T. Miyazaki, K. Min, H. Kawaguchi, and T. Sakurai, "Two Orders of Magnitude Leakage Power Reduction of Low Voltage SRAM's by Row-by-Row Dynamic VDD Control (RRDV) Scheme," Proceedings of IEEE International ASIC/SOC Conference, pp. 381-385, Sep. 2002.
- K. Kanda, D. D. Antono, K. Ishida, H. Kawaguchi, T. Kuroda, and T. Sakurai, "1.27-Gbps/pin, 3mW/pin Wireless Superconnect (WSC) Interface Scheme," IEEE International Solid-State Circuits Conference Digest of Technical Papers, pp. 186-187, Feb. 2003.
- K. Min, H. Kawaguchi, and T. Sakurai, "Zigzag Super Cut-off CMOS (ZSCCMOS) Block Activation with Self-Adaptive Voltage Level Controller: An Alternative to Clock-Gating Scheme in Leakage Dominant Era," IEEE International Solid-State Circuits Conference Digest of Technical Papers, pp. 400-401, Feb. 2003.
- S. Misaka, K. Toyama, T. Aritsuka, K. Uchiyama, K. Aisaka, H. Kawaguchi, and T. Sakurai, "Frequency-Voltage Cooperative Power Reduction for Multi-tasking Multimedia Applications," International Symposium on Low-Power and High-Speed Chips (COOL Chips), vol. I, pp. 103-116, Apr. 2003.
- T. Someya, H. Kawaguchi, and T. Sakurai, "Cut-and-Paste Organic FET Customized ICs for Application to Artificial Skin," IEEE International Solid-State Circuits Conference Digest of Technical Papers, pp. 288-289, Feb. 2004.
- T. Sekitani, H. Kawaguchi, T. Sakurai, and T. Someya, "Organic field-effect transistors with bending radius down to 1 mm ," Proceedings of Materials Research Society Spring Meeting, Apr. 2004.
- T. Miyazaki, T. Q. Canh, H. Kawaguchi, and T. Sakurai, "Observation of one-fifth-a-clock wake-up time of power-gated circuit," Proceedings of IEEE Custom Integrated Circuits

Conference, pp. 87-90, Oct. 2004.

- T. Someya, S. Iba, Y. Kato, T. Sekitani, Y. Noguchi, Y. Murase, H. Kawaguchi, and T. Sakurai, "(Invited) Organic transistor ICs for large-area sensors," Korea Japan Joint Forum, \#05-I, Nov. 2004.
- T. Someya, T. Sakurai, T. Sekitani, H. Kawaguchi, S. Iba, and Y. Kato, "A Large-Area, Flexible, and Lightweight Sheet Image Scanner Integrated with Organic Field-Effect Transistors and Organic Photodiodes," IEEE International Electron Devices Meeting Digest of Technical Papers, pp. 365-368, Dec. 2004.
- S. Misaka, H. Kawaguchi, and T. Sakurai, "Time Revising Robust Frequency-Voltage Cooperative Power Reduction for Multi-tasking Multimedia Applications," Proceedings of International Symposium on Low-Power and High-Speed Chips (COOL Chips), pp. 165-180, Apr. 2005.
- T. Someya, T. Sakurai, T. Sekitani, H. Kawaguchi, S. Iba, and Y. Kato, "Recent Advances in Applications of Organic Integrated Circuits for Large-Area Electronics," Proceedings of IEEE International Conference on IC Design and Technology, pp. 57-58, May 2005.
- C. Q. Tran, H. Kawaguchi, and T. Sakurai, "Low-power High-speed Level Shifter Design for Block-level Dynamic Voltage Scaling Environment," Proceedings of IEEE International Conference on IC Design and Technology, pp. 229-232, May 2005.
- K. Ishida, K. Kanda, A. Tamtrakarn, H. Kawaguchi, and T. Sakurai, "Subthreshold-Leakage Suppressed Switched Capacitor Circuit Based on Super Cut-Off CMOS (SCCMOS)," Proceedings of IEEE International Symposium on Circuits and Systems, pp. 3119-3122, May 2005.
- C. Q. Tran, H. Kawaguchi, and T. Sakurai, "More Than Two Orders of Magnitude Leakage Current Reduction in Look-Up Table for FPGA's," Proceedings of IEEE International Symposium on Circuits and Systems, pp. 4701-4704, May 2005.
- K. Ishida, K. Kanda, A. Tamtrakarn, H. Kawaguchi, and Takayasu Sakurai, "Managing Leakage in Charge-Based Analog Circuits with Low- $\mathrm{V}_{\text {TH }}$ Transistors by Analog T-Switch (AT-Switch) and Super Cut-off CMOS," IEEE/JSAP Symposium on VLSI Circuits Digest of Technical Papers, pp. 122-125, June 2005.
- F. R. Saliba, H. Kawaguchi, and T. Sakurai, "Experimental Verification of Row-by-Row Variable
$\mathrm{V}_{\mathrm{DD}}$ Scheme Reducing 95\% Active Leakage Power of SRAM's," IEEE/JSAP Symposium on VLSI Circuits Digest of Technical Papers, pp. 162-165, June 2005.
- T. Someya, T. Sakurai, T. Sekitani, H. Kawaguchi, S. Iba, Y. Kato, and Y. Noguchi, "(Invited) Recent Progress of Organic Transistor Integrated Circuits for Large-Area Sensor Applications," Proceedings of International Conference on Solid State Devices and Materials, Sep. 2005.
- C. Q. Tran, H. Kawaguchi, and T. Sakurai, "95\% Leakage-Reduced FPGA using Zigzag Power-gating, Dual- $\mathrm{V}_{\mathrm{TH}} / \mathrm{V}_{\mathrm{DD}}$ and Micro- $\mathrm{V}_{\mathrm{DD}}$-Hopping," IEEE Asian Solid-State Circuits Conference Proceedings of Technical Papers, pp. 149-152, Nov. 2005.


## Acknowledgment

First of all, I would like to appreciate Prof. Takayasu Sakurai, but I do not know how to thank him. He is an excellent researcher, and he has led me to the right direction. He has given me the warmest encouragement and most productive advices.

I would like to thank Prof. Takashi Nanya, Prof. Yasuhiko Arakawa, Prof. Shin-ichi Takagi, Prof. Toshiro Hiramoto, and Prof. Takao Someya for helpful advices.

During the enriched time I spent in the laboratory, I was given a lot of assistance. I am really fortunate to meet Prof. Makoto Takamiya, Mr. Kenichi Inagaki, Dr. Koichi Ishida, Prof. Seongsoo Lee, Prof. Youngsoo Shin, Prof. Kyeongsik Min, Dr. Jinhyeok Choi, Dr. Kyuwon Choi, Mr. Seiji Takeuchi, Dr. Koichi Nose, Mr. Ysuhito Itaka, Dr. Koichi Kanda, Mr. Nguyen Minh Duc, Mr. Zhang Gang, Mr. Yutaro Asano, Mr. Masayuki Hirabayashi, Mr. Sadaaki Hattori, Mr. Tran Quang Canh, Mr. Danardono Dwi Antono, Mr. Atit Tamtrakarn, Mr. Daisuke Yamada, Mr. Takayuki Miyazaki, Mr. Fayez Robert Saliba, Mr. Kohei Onizuka, Mr. Kazuhiro Tokunaga, Mr. Yingxue Xu, Mr. Yasumi Nakamura, Mr. Muhammad Yazid, Mr. Takuya Minakawa, Mr. Masaya Ishida, Mr. Limin Xiao, and Mr. Wenhao Wu. With them, I had a fruitful time. In particular, Mr. Kenichi Inagaki has given me great helps. I cannot thank Ms. Yuko Nara and Ms. Jungsook Kye enough for their kindness.

We are grateful for helpful suggestion on the interconnection issues in Chapter 3 with Dr. Shigetaka Kumashiro of the STARC (Semiconductor Technology Academic Research Center).

We are thankful for research support from Toshiba Corporation and fruitful discussions with Dr. Tohru Furuyama, Dr. Mototsugu Hamada, Dr. Hiroki Ishikuro, Mr. Tetsuya Fujita, Mr. Yoshinori Watanabe, and Prof. Tadahiro Kuroda throughout the study in Chapter 4.

We would like to thank Dr. Koichiro Ishibashi, Dr. Kunio Uchiyama, Mr. Keisuke Toyama, Mr. Kazuo Aisaka, and Mr. Satoshi Misaka of Hitachi for fruitful discussion on $V_{D D}$ hopping and $\mu$ ITRON-LP in Chapter 5. I particularly appreciate Mr. Hitoshi Yamaki of Hitachi Yonezawa Electronics for tests and helpful advice. He made the system calls and embedded $\mu$ ITRON-LP into the system.

I would like to express my deepest appreciation to Dr. Tsuyoshi Sekitani, Mr. Shingo Iba, and Mr. Yusaku Kato for their assistance in fabrication and experiment of the organic devices. They have made the state-of-the-art devices, without which I could not do anything about the organic circuits in Chapter 6.

We would like to thank for their financial supports. Chapter 2 was supported in part by the grant from Toshiba Corporation. Chapter 3 has been supported by the grant from the STARC. Chapter 4 was supported in part by the grants from Toshiba Corporation and the JSPS (Japan Society for the Promotion of Science). Chapter 4 was supported by the grants from Hitachi, Ltd. and the JSPS. Chapter 6 has been supported in part by the grants from the NEDO (New Energy and Industrial Technology Development Organization), MEXT (Ministry of Education, Culture, Sports, Science and Technology, and MPHPT, (Ministry of Public Management, Home Affairs, Posts and Telecommunications).

The test chips were fabricated by Toshiba Corporation, Rohm Co., Ltd, and Oki Electric Industry Co., Ltd. The Rohm LSIs were fabricated by the chip fabrication program of the VDEC (VLSI Design and Education Center), University of Tokyo, with the collaboration by Rohm and Toppan Printing.

Finally, I have to apologize to my family for spoiling your weekends. I will take a journey with you in return.

